A demonstration of using markdown and knitr with R studio

Eric Nord, Dept. of Horticulture, Penn State University

The recently released version of RStudio (0.96.122) has integrated the knitr package to allow easy use markdown to produce dynamic reports. Here we'll briefly explore the possibilities.
For some time I have been caught on the issue of how I document my work in R. Ideally, I want to have my code, and the statistical output and figures all accessible, but I've had no good way to do this. My requirements:
1. Code and commentary stored in one file
a. both data manipulation and analysis code stored
b. ability to have extensive commentary w/o commenting out every line
c. ability to keep code that is not run for reference
2. Statistical output and figures easily kept with the code that produced them
3. Easy to go back and re-run or update analyses

Pasting output and figures into a word processor document sort of works, but it has several flaws, the most obvious of which is that it is not easy to re-run code from there. I got around this by keeping a comprehensive code file and a “results” file that statistical output and figures, as well as commentary went into.

THERE IS A BETTER WAY!

Using markdown and knitr in RStudio makes it almost trivially easy to put this all in one (sort of) file. The markdown file is easily readable plain text (like this file) and can contain R code in discrete “chunks”. Knitr will process the markdown file into an HTML file that includes the code and the output from that code. Knitr options allow the user to specify whether code chunks are evaluated, (if something didn't work but you want to keep a record of how you tried it, a chunk that isn't evaluated is perfect), and whether the code itself is shown in the HTML file (perfect for figures).

Getting stated with markdown

In the RStudio editor pane choose File>New>R Markdown. You may be prompted to install the knitr package, in which case type install.packages("knitr",dep=T). (Note the format in the markdown file here - this is an inline code chunk, and in the HTML file the text will be formatted as code). Markdown is a simple markup language - clink the “MD” button on the toolbar above the editor pane for a quick reference sheet. To insert a code chunk choose Chunks>Insert Chunk from the “Chunks” menu on the editor toolbar.


Demonstration: Analysis of Electrical Usage

I keep track of electrical usage for some reason. What can I learn about my household electrical use? I'll insert a chunk here to get some data.
[Note: the {r load-data} in the code chunk means evaluate this with R, and the name of the chunk is “load-data”]

elec <- read.delim("~/Documents/ERIC/Stats/R mini class/electric bill.txt")
dim(elec)
## [1] 101   9
head(elec)
##   month year  kwh days est   cost avgT dT.yr kWhd.1
## 1     8 2003  476   21   a  33.32   69    -8  22.67
## 2     9 2003 1052   30   e 112.33   73    -1  35.05
## 3    10 2003  981   28   a  24.98   60    -6  35.05
## 4    11 2003 1094   32   a  73.51   53     2  34.19
## 5    12 2003 1409   32   a  93.23   44     6  44.03
## 6     1 2004 1083   32   a  72.84   34     3  33.84

We have a data frame with 101 rows and 9 columns. How has daily electric usage (kWhd.1) changed over the years?
This code chunk will create a graph for us:

plot(elec$kWhd.1, type = "l", xlab = "month of residence", ylab = expression(paste("Daily electric use (KWH", 
    d^-1, ")")))
abline(v = c(0, 12, 24, 36, 48, 60, 72, 84, 96) + 5.5, lty = 3)  # add vertical lines at the end of each year.

plot of chunk daily-usage

There is a pretty strong seasonal pattern, but there is a lot of month-to-month variability also. Some of this is because they only read our meter every two months, and send us an “estimated” bill between months. There is a lot we could do with this data, but for now let's see how daily use is related to temperature (our house has electric heat, but no AC).
[Note: the {echo=FALSE} in the code chunk means don't show the code, just run it. Avoid periods and spaces in the chunk options]
plot of chunk daily-usage-vs-temp

There is clearly a relationship, though noisy. Let's try a regression. (Hint Ctrl+Alt+< (Mac: Cmd+Opt+<) insetes a new chunk)

m1 <- lm(kWhd.1 ~ avgT, data = elec)
summary(m1)
## 
## Call:
## lm(formula = kWhd.1 ~ avgT, data = elec)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22.04  -4.76  -1.17   5.29  26.45 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  43.6586     2.6651   16.38  < 2e-16 ***
## avgT         -0.2850     0.0485   -5.87  5.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 8.07 on 99 degrees of freedom
## Multiple R-squared: 0.258,   Adjusted R-squared: 0.251 
## F-statistic: 34.5 on 1 and 99 DF,  p-value: 5.79e-08 
## 

A significant relationship with a very low p-value of 0 but low explanatory power (R2 ) of 0.2508. Might there be an increase in electric use in very hot weather with all the fans running?

m2 <- lm(kWhd.1 ~ avgT + I(avgT^2), data = elec)
summary(m2)
## 
## Call:
## lm(formula = kWhd.1 ~ avgT + I(avgT^2), data = elec)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19.99  -5.21   0.25   5.71  17.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 72.68427    7.07798   10.27  < 2e-16 ***
## avgT        -1.59068    0.30201   -5.27  8.2e-07 ***
## I(avgT^2)    0.01305    0.00298    4.37  3.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 7.42 on 98 degrees of freedom
## Multiple R-squared: 0.379,   Adjusted R-squared: 0.367 
## F-statistic: 29.9 on 2 and 98 DF,  p-value: 7.11e-11 
## 

Indeed there is. This improves the R2 quite a bit to 0.3666.
In April of 2008 (month 57 of occupancy) we added insulation to the attic. Has this helped?

elec$insul <- rep(0, nrow(elec))
elec$insul[57:nrow(elec)] <- 1  # create a boolean for insulation
m3 <- lm(kWhd.1 ~ avgT + I(avgT^2) + insul, data = elec)
summary(m3)
## 
## Call:
## lm(formula = kWhd.1 ~ avgT + I(avgT^2) + insul, data = elec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.455  -3.766   0.178   3.448  13.045 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 75.18349    5.40014   13.92  < 2e-16 ***
## avgT        -1.53703    0.23016   -6.68  1.5e-09 ***
## I(avgT^2)    0.01271    0.00227    5.59  2.1e-07 ***
## insul       -9.61350    1.13405   -8.48  2.6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 5.65 on 97 degrees of freedom
## Multiple R-squared: 0.643,   Adjusted R-squared: 0.632 
## F-statistic: 58.3 on 3 and 97 DF,  p-value: <2e-16 
## 

This improves the R2 even more to 0.6324. Considering all the “estimated bills” and the noise they introduce, that is pretty good! Insulation has saved us about -9.614 KWH per day! Let's plot this:
plot of chunk insulation-plot

One of the great advantages of this method of docuemnting your work is that you can easily update this (like when I get around to adding the newer bills I can see if the new water heater and fridge are saving even more electricity!) by updating the data file and running the analysis again. Most of the important conclusions are in inline code chunks, so R just updates them too.

A few extra notes

For more information

RStudio has some basic info http://www.rstudio.org/docs/authoring/using_markdown
Wikipedia has an article on markdown here http://en.wikipedia.org/wiki/Markdown
Texts is an editor for Mac and Windows that can handle Markdown http://www.texts.io/
The knitr help page online has much information about R chunk options http://yihui.name/knitr/options
A recent post on R bloggers has a great example also http://www.r-bloggers.com/example-reproducile-report-using-r-markdown-analysis-of-california-schools-test-data/