Sunday, January 16, 2011

Confidence Intervals for Regression plot

I was recently confronted with the problem of plotting 95% confidence intervals on a regression line.  I originally thought this would be a straight forward process in R, but soon came to realize there are some intricacies that need to be tackled before completing a presentable plot.

For this example we will construct a regression plot between island habitat area (log transformed) and species number on each island. First we create vectors with the data:

area <- c(153750,15860,5800,5545,11970,2000,100)

species <- c(36,34,33,33,21,12,3)

Logarithms can be applied during the regression, but this procedure creates a problem later when predicted values are obtained from the model (see below). Therefore it is important to create new vectors with the log-transformed values

larea <- log(area)

lsp <- log(species)


We now proceed to calculate the regression between log(area) and log(sp).  Based on Island Biogeography Theory (IBT), the area of the habitat should predict species number on a logarithmic scale.  We perform a regression analysis using the lm() command.  The results of the regression are stored in the mod1 object and printed with the summary() command.:

mod1 <- lm(lsp~larea)


lm(formula = lsp ~ larea)

      1       2       3       4       5       6       7
-0.5609  0.1960  0.5266  0.5428 -0.1850 -0.1034 -0.4161

            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -0.13566    0.77702  -0.175  0.86825  
larea        0.35837    0.08744   4.098  0.00937 **
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4781 on 5 degrees of freedom
Multiple R-squared: 0.7706,    Adjusted R-squared: 0.7247
F-statistic:  16.8 on 1 and 5 DF,  p-value: 0.00937

A quick review of the results show that area significantly predicts species number, as expected from the IBT.  One would now proceed to determine the validity and quality of the analysis through plotting of residuals and related techniques.  Since we are concerned with plots we will skip these analysis in the current post.

To plot the confidence intervals we need to use the predict() command which is used, as its name implies, to predict y-values based on a particular regression model.  The command is actually a wrapper command which applies to different models (lm, glm) and it will adapt itself to the proper analysis. In our case, since we used a liner model (lm), predict()will invoke the predict.lm() command.  For more information type ?predict.lm in the R prompt. 

To illustrate the use of predict(), type in the following commands:

[1] 11.943083  9.671555  8.665613  8.620652  9.390159  7.600902  4.605170

[1] 3.583519 3.526361 3.496508 3.496508 3.044522 2.484907 1.098612

       1        2        3        4        5        6        7
4.144421 3.330367 2.969864 2.953751 3.229522 2.588300 1.514710

The previous code shows the values for the log-transformed area, the number of species, also log-transformed and the predicted values according to the regression equation: lsp = larea*(0.35837) - 0.13566 .  These means that the predicted number of species for an area of 11.94 is 4.144, on the logarithmic scale.   The values returned by predict() are always located on the regression line.

A simple scatterplot and regression line can be produced with the following commands:

plot(larea, lsp)


The predict() command calculates confidence intervals for the predicted values.


predict(mod1, interval="confidence")

       fit       lwr      upr
1 4.144421 3.2690777 5.019765
2 3.330367 2.8114190 3.849315
3 2.969864 2.5052945 3.434434
4 2.953751 2.4891850 3.418317
5 3.229522 2.7355129 3.723531
6 2.588300 2.0681003 3.108500
7 1.514710 0.4952322 2.534188

The previous code shows how to compute confidence intervals for larea values in the regression model (i.e. mod1).  By adding the option interval= “confidence”, lower (lwr) and upper (upr) confidence intervals are computed alongside fitted values.  By default 95% confidence intervals are calculated, if you require higher confidence the level=0.99 may be used.  We could plot these confidence intervals on top of the previous figure, by storing the results of the predict command on an object and then using the lines() command:


a <- predict(mod1, interval="confidence")

lines(larea, a[,2], lty=2)

lines(larea, a[,3], lty=2)


This graph seems a bit “choppy” and this effect is due to having few data points in the regression.  We can solve this problem by predicting a larger number of values and their corresponding confidence intervals.  This should result in smoother confidence interval lines.  To do this we need to create a vector with a large number of values in the range of the original x-variable (i.e. larea).  We then use these new values to calculate predicted values (which we wont use) and their confidence intervals to plot them as curves.

newx <- seq(min(larea), max(larea), 0.1)

a <- predict(mod1, newdata=data.frame(larea=newx), interval="confidence")


Now, these two last commands require a detailed explanation.  the first one creates a  sequence starting on larea’s minimum value, ending on its maximum value taking 0.1 steps.  The newx vector contains a total of 74 values.

The second command calculates predicted y-values for newx.  In order to accomplish this we need to tell R to use newx as a new set of data and use it as if it were larea.  This is accomplished by the option: newdata=data.frame(larea=newx)

It is very important, that we use the same name as the x-variable in the model (i.e. larea in mod1), otherwise R will complain about vectors of different size.  Now, the only thing left to do is to redo the plot and add the lines:



lines(newx,a[,2], lty=3)

lines(newx,a[,3], lty=3)

It is very important to use newx in the lines() command otherwise the length of vectors will not match. 



the lines in this final graph are a lot smoother.