Pages

Sunday, January 16, 2011

Confidence Intervals for Regression plot

I was recently confronted with the problem of plotting 95% confidence intervals on a regression line.  I originally thought this would be a straight forward process in R, but soon came to realize there are some intricacies that need to be tackled before completing a presentable plot.

For this example we will construct a regression plot between island habitat area (log transformed) and species number on each island. First we create vectors with the data:

area <- c(153750,15860,5800,5545,11970,2000,100)

species <- c(36,34,33,33,21,12,3)

Logarithms can be applied during the regression, but this procedure creates a problem later when predicted values are obtained from the model (see below). Therefore it is important to create new vectors with the log-transformed values

larea <- log(area)

lsp <- log(species)

 

We now proceed to calculate the regression between log(area) and log(sp).  Based on Island Biogeography Theory (IBT), the area of the habitat should predict species number on a logarithmic scale.  We perform a regression analysis using the lm() command.  The results of the regression are stored in the mod1 object and printed with the summary() command.:

mod1 <- lm(lsp~larea)

summary(mod1)

Call:
lm(formula = lsp ~ larea)

Residuals:
      1       2       3       4       5       6       7
-0.5609  0.1960  0.5266  0.5428 -0.1850 -0.1034 -0.4161

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -0.13566    0.77702  -0.175  0.86825  
larea        0.35837    0.08744   4.098  0.00937 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4781 on 5 degrees of freedom
Multiple R-squared: 0.7706,    Adjusted R-squared: 0.7247
F-statistic:  16.8 on 1 and 5 DF,  p-value: 0.00937

A quick review of the results show that area significantly predicts species number, as expected from the IBT.  One would now proceed to determine the validity and quality of the analysis through plotting of residuals and related techniques.  Since we are concerned with plots we will skip these analysis in the current post.

To plot the confidence intervals we need to use the predict() command which is used, as its name implies, to predict y-values based on a particular regression model.  The command is actually a wrapper command which applies to different models (lm, glm) and it will adapt itself to the proper analysis. In our case, since we used a liner model (lm), predict()will invoke the predict.lm() command.  For more information type ?predict.lm in the R prompt. 

To illustrate the use of predict(), type in the following commands:

larea
[1] 11.943083  9.671555  8.665613  8.620652  9.390159  7.600902  4.605170

lsp
[1] 3.583519 3.526361 3.496508 3.496508 3.044522 2.484907 1.098612

predict(mod1)
       1        2        3        4        5        6        7
4.144421 3.330367 2.969864 2.953751 3.229522 2.588300 1.514710

The previous code shows the values for the log-transformed area, the number of species, also log-transformed and the predicted values according to the regression equation: lsp = larea*(0.35837) - 0.13566 .  These means that the predicted number of species for an area of 11.94 is 4.144, on the logarithmic scale.   The values returned by predict() are always located on the regression line.

A simple scatterplot and regression line can be produced with the following commands:

plot(larea, lsp)

abline(mod1)

The predict() command calculates confidence intervals for the predicted values.

 

predict(mod1, interval="confidence")

       fit       lwr      upr
1 4.144421 3.2690777 5.019765
2 3.330367 2.8114190 3.849315
3 2.969864 2.5052945 3.434434
4 2.953751 2.4891850 3.418317
5 3.229522 2.7355129 3.723531
6 2.588300 2.0681003 3.108500
7 1.514710 0.4952322 2.534188
 

The previous code shows how to compute confidence intervals for larea values in the regression model (i.e. mod1).  By adding the option interval= “confidence”, lower (lwr) and upper (upr) confidence intervals are computed alongside fitted values.  By default 95% confidence intervals are calculated, if you require higher confidence the level=0.99 may be used.  We could plot these confidence intervals on top of the previous figure, by storing the results of the predict command on an object and then using the lines() command:

 

a <- predict(mod1, interval="confidence")

lines(larea, a[,2], lty=2)

lines(larea, a[,3], lty=2)

r2

This graph seems a bit “choppy” and this effect is due to having few data points in the regression.  We can solve this problem by predicting a larger number of values and their corresponding confidence intervals.  This should result in smoother confidence interval lines.  To do this we need to create a vector with a large number of values in the range of the original x-variable (i.e. larea).  We then use these new values to calculate predicted values (which we wont use) and their confidence intervals to plot them as curves.

newx <- seq(min(larea), max(larea), 0.1)

a <- predict(mod1, newdata=data.frame(larea=newx), interval="confidence")

 

Now, these two last commands require a detailed explanation.  the first one creates a  sequence starting on larea’s minimum value, ending on its maximum value taking 0.1 steps.  The newx vector contains a total of 74 values.

The second command calculates predicted y-values for newx.  In order to accomplish this we need to tell R to use newx as a new set of data and use it as if it were larea.  This is accomplished by the option: newdata=data.frame(larea=newx)

It is very important, that we use the same name as the x-variable in the model (i.e. larea in mod1), otherwise R will complain about vectors of different size.  Now, the only thing left to do is to redo the plot and add the lines:

plot(larea,lsp)

abline(mod1)

lines(newx,a[,2], lty=3)

lines(newx,a[,3], lty=3)

It is very important to use newx in the lines() command otherwise the length of vectors will not match. 

r3

 

the lines in this final graph are a lot smoother.

24 comments:

  1. I am having problems with the confidence interval lines zigzagging. I know i need to sort the upr and lwr confidence limit columns to straighten the curves out but havent been successful yet.

    ReplyDelete
  2. By zig-sagging you mean they cross the regression line? It does seem like you are mixing the upper and lower confidence limits. Maybe you can post (here) some of your code. That might help.

    ReplyDelete
  3. Thank you that was so useful!!

    ReplyDelete
  4. For the second part of this plot you need to change interval to "prediction." The standard errors are slightly larger for prediction intervals rather than confidence interval. Confidence intervals are valid where you have x values in your data set. For possible x values not present in your dataset, you need to use a prediction interval.

    ReplyDelete
  5. For the second part of this plot you need to change interval to "prediction." The standard errors are slightly larger for prediction intervals rather than confidence interval. Confidence intervals are valid where you have x values in your data set. For possible x values not present in your dataset, you need to use a prediction interval.

    ReplyDelete
  6. Thanks, Dalton, my confidence intervals were looking suspiciously narrow until I changed the interval from "confidence" to "prediction". And thanks to Campitor for posting this, it worked like a charm!

    ReplyDelete
  7. # I'm confused: I copied and ran
    # the code, and R still complains:

    > Error in xy.coords(x, y) : 'x' and 'y' lengths differ

    #Had a look at the lengths:
    dim(a)
    > [1] 74 3

    #So, I tried for the fun of it:
    newx <- seq(min(larea), max(larea), length.out=7)

    #Reproduces the behaviour
    #described by Veronica.
    # Q: Did the R core team
    # change something I missed?
    # Or am I missing a point here?
    # BWT, using my own data,
    # same thing happens.
    # Tried restarting R several times.

    # Running 2.15.0
    # on x86_64-pc-mingw32/x64 (64-bit)

    ReplyDelete
    Replies
    1. Hey, what you missed is a coma
      [1] supposed to be [,1] to include all rows. In this case, x and y lengths are equal.

      Delete
  8. # Oh my, stupid me!
    # There we go:


    predict(mod1, interval="confidence")
    a <- predict(mod1, interval="confidence")
    lines(larea, a[,2], lty=2)
    lines(larea, a[,3], lty=2)

    newx <- seq(min(larea), max(larea), 0.1)
    a <- predict(mod1, newdata=data.frame(larea=newx), interval="prediction")
    lines(newx, a[,2], lty=2, col="red")
    lines(newx, a[,3], lty=2, col="green")

    a <- predict(mod1, newdata=data.frame(larea=newx), interval="confidence")
    lines(newx, a[,2], lty=1, col="darkred")
    lines(newx, a[,3], lty=1, col="darkgreen")

    ReplyDelete
  9. Thank you ! it was so useful ! cheers !

    ReplyDelete
  10. How would one go about filling in the area between the upper and lower limit

    ReplyDelete
  11. Hi, should this work if you want confidence intervals on the partial regression plots, i.e. added-variable plots?

    ReplyDelete
  12. Please note that when you create a confidence band for the entire regression line, you ought to adjust for simultaneous estimation, e.g. by using the Working-Hotelling procedure.

    ReplyDelete
  13. Thanks, I have been working on this problem for a couple of hours this afternoon and found this approach so much easier than several others I was given! Just one change because I was using Rcmdr and submitting all at once (after developing the model) is that I changed one of the "a <-" paragraphs to read b<- etc so the PI's were different to CI's.
    Lynette

    ReplyDelete
  14. Thank you very much for your help! Your method worked almost perfectly.


    The only inconvenient is that for some of the plots, the confidence band lines did not extend to the end of the data range. How can I solve this?

    ReplyDelete
  15. I like the way on how you put up your blogs. Wonderful and awesome. Hope to read more post from you in the future. Goodluck. Happy blogging!

    Bubble
    www.gofastek.com

    ReplyDelete
  16. @Cidy Dy.

    thank you for your comments.

    ReplyDelete
  17. @Moreno-Bernal
    That can probably be fixed if you extend the x and y limits with xlim and ylim. The procedure stated above should create confidence intervals for the same length as the regression slope.

    ReplyDelete
  18. Hello, thanks for the useful and understandable script. If possible I would like to keep using the range that is generated from my actual data and not start adding sequenced numbers. This is because when I keep having errors with the numbers of rows of the sequence, related to the range of my data (min/max) and wind up playing with the interval step (changing 0.01). Is there a way to use "confidence" with original data but to make a smooth line? I tend to get many layers, like a spider web on top of each other (not crossing regr line).

    Thanks

    ReplyDelete
    Replies
    1. Am getting the same thing, the dotted line that represents the CI lines looks like it has been overwrittenn multiple times in a similar location, but just off enough to create a linear 'scribble.' Does not cross the regression line.

      Delete
  19. Every time I visit your blog it really completes my day, and hey its not a joke. I am telling the truth. Thank you for always inspiring us and for writing a very touching article.

    zandra

    www.n8fan.net

    ReplyDelete
  20. Hi,

    I am trying to replicate this part of the code:

    a <- predict(mod1, newdata=data.frame(larea=newx), interval="prediction")

    OR

    a <- predict(mod1, newdata=data.frame(larea=newx), interval="confidence")

    And I am getting the following error:

    "Error: unexpected '=' in "newx <- predict(mod1, newdata=data.frame(larea="

    I'm not sure where it thinks the error is. Any ideas? TY

    ReplyDelete