Pages

Sunday, May 14, 2017

A gentle introduction to Generalized Linear Models in R

What are generalized linear models?

Generalized linear models (glm) are a special form of linear models used when errors do not follow a normal distribution. In previous posts I’ve discussed linear models (lm), their use and interpretation.

To recap, lm’s model a response variable which depends on one or more independent variables
y~x
  Regular linear models have several assumptions, a really important one is the  normal distribution of errors.  Errors are the differences between the observed and predicted values of the response variable.

Let’s use an example. Let’s say you are modeling the effect of seedling density on seedling herbivory, both measured as continuous variables.
herbivory~density
  If you assume a linear relationship between both variables, a linear model will produce a linear equation that allows us to predict how much herbivory a plant will have based on its local density. Generally such equations are presented in the form:

\[ y \sim \alpha + \beta_1 x \]

With \(\alpha\) as the intercept and \(\beta_1\) as the regression coefficient, which depicts the linear effect of density on herbivory. If \(\beta_1 > 0\) it means that an increase in density will have a positive effect (i.e., increase) herbivory. This relationship is usually displayed as a regression line relating the dependent and independent variables.



  The errors are the differences (dotted lines) between observed values (dots on previous figure) and the regression line. When both variables are normally distributed and linearly related, the distribution of the errors should follow a normal distribution with zero mean.


  For certain types of data, like counts or proportions, their distributions are limited by certain restrictions. Counts, for example, cannot ever be negative or fractional. Proportions are bound between 0 and 1. In most of these cases, the relationship between dependent and independent variables is no longer linear and is usually described by non-linear equations. This means that the error distribution along the fitted line is no longer normal either. In these cases a generalized linear model should be preferred over linear models.

  As an example, let us consider a dichotomous response variable such as survival of a seedling in the forest one year after germination. The seedling may survive (success) or it may not (failure). We are interested in determining if herbivory affects the fate of seedlings. Our goal is to fit a model that describes the effect of herbivory on seedling survival. Survival is a probability and as such, its range is restricted to \( y \in [0,\ldots,1]\).  In contrast, a regular linear model would fit a line that predicted values in the range \( \hat{y} \in [-\infty; \infty]\). This is clearly wrong. There aren’t any negative probabilities or probabilities that are higher than one. This is one of the reasons why a linear model is not advisable in this scenario. We would need to fit a model that limits the response variable between 0 and 1. A sigmoid relationship, similar to the one shown in the following figure, may be used to model the effect of herbivory on survival.

  There are two approaches for fitting a sigmoidal curve. One is to fit a non-linear model to the data an try to estimate the exponential or sigmoidal equation that best describes the effect of herbivory on seedling survival. Non-linear models require very precise data, previous knowledge of the relationship between variables and estimate several parameters.  The second option is to use a linear model that fits our probability data. The second options is preferred since we have robust methods of fitting linear models.

Generalized linear models (glm) allow us to fit linear models to data that do not meet the criteria for linear regression.  Glm's fit predictors that describe the relationship between the dependent and the response variable taking into account the restrictions imposed by  the data. In our example, they predict expected values that lie between 0 and 1.

 Predictions are performed through a predictor function.   These predictor functions \(\eta(x)\) are not usually in linear form  \( y = \alpha + \beta_i x + \varepsilon_{ij} \) and therefore need to be linearized by a link function. In order words, the link function \( g(\eta) \) linearizes the non-linear predictor function \( \eta(x) \), allowing us to use robust methods. Since there are many predictor and link functions depending on the nature of the data, these models are referred to in plural as generalized linear models.

  The previous explanation will become more clear with an example. Previously, we wanted to determine the effect of herbivory on the probability of seedlings surviving one year in the forest. The following equation models a sigmoidal curve that explains the effect of a linear independent variable \(x\) on the probability of survival \(p\).

\[ \eta(x) = p =\frac{e^{\alpha+\beta x}}{1+e^{\alpha + \beta x}} \]

  We are interested in determining the effect of herbivory (\( \beta \)) on the probability of survival using a linear model.  To this end we use the log of the odds ratio or logits as a link functions. A logit is the ratio between the probability of survival \(p\) and the probability of not surviving \( (1-p) \). Therefore, the predictor equation may be transformed into a linear model as:

\[ g(\eta) = \ln\left({\frac{p}{1-p}}\right) = \alpha +\beta x \]

Now we may fit a regular linear model to estimate the parameter \(\beta\) which describes the effect of herbivory on survival probability. The only precaution we need to take is that the response variable is transformed into logits, therefore we need to transform them back into probability units or odd ratios to interpret them.

There are different predictor and link functions depending on the type of response variable. Counts usually use a log as their link function and probabilities or dichotomous variables use logits or probits as their link function. I will not go into detail into why or which link function should be used in this post. In the next section I’ll show how to perform and interpret a glm in R.

GLMs in R

In R, generalized linear models are performed using the glm() command. It is similar to the lm() command as it requires a formula that describes the relationship between the dependent and the independent variables. However,  glm()  requires that we define an error distribution family. The family defines the distribution of the errors and chooses the appropriate link function. In R the two most commonly used families are Poisson and Binomial. Poisson is commonly used for counts, while the Binomial family is used in proportions or dichotomous response variables.

  For our first example we will use a simple model that studies the effect of herbivory, measured  as leaf area ( \(\textrm{mm}^2\) ) removed, and the fate of  seedlings of an endemic Costa Rican oak Quercus costaricensis. One year after germination seedlings were recorded as survived (1) or died (0). We create a subsample of the data set in R as follows:

> oak <- data.frame(survival = c(1,1,1,0,1,0,0,0,0,1,1, 1, 1,0,0,0,0,1,1),
area= c(5.41,5.63,25.92,15.17,13.04,18.85,33.95,22.87,12.01,11.6,6.09,2.28,4.05,59.94,63.16,22.76,23.54,0.21,2.55))
> plot(survival~area, data=oak)

  The figure shows a likely decrease in survival with an increase in leaf area removed. We now need to test the effect using a glm. We fit a simple model of the effect of herbivory on survival using the binomial distribution which uses a logit link function. As with linear models, the result of the glm must be assigned to an object for downstream analysis.
> m1 <- glm(survival~area, data=oak, family=binomial)

The results are displayed after asking for a summary:
> summary(m1)

Call:
glm(formula = survival ~ area, family = binomial, data = oak)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.5413  -0.5731   0.2427   0.4369   2.2065

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   3.5582     1.6145   2.204   0.0275 *
area         -0.2277     0.1009  -2.257   0.0240 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26.287  on 18  degrees of freedom
Residual deviance: 13.113  on 17  degrees of freedom
AIC: 17.113
Number of Fisher Scoring iterations: 6 


  As in previous posts, we will analyze the results in detail. The first part shows the model tested and the distribution of deviance residuals. Since we are fitting a model, the differences between the observed values and those predicted by the model are de residuals. A summary of their distribution is presented as minimum and maximum residuals, the first and third quartiles and the median.

I normally just compare the Min-Max and 1Q-3Q to see if they are comparable in absolute size. If not, I worry about skewed data and overdispersion (more on that later).

  R then presents us with the estimates for the linear model coefficients. In our case we fit a simple linear model in the form of \( y \sim \alpha + \beta_0 x + \varepsilon_{ij} \), therefore there are only two parameters to be estimated, the intercept (\(\alpha\)) and the effect of area on survival (\(\beta\)). In this case the intercept is of little interest to us. But we do see a significant negative effect of area  on survival ( \(\beta = -0.2277, p=0.0240\)).

  The summary then tells us that the variance was estimated using the binomial family without overdispersion. The Null deviance depicts the residual variance ofa null model (i.e., just the intercept), while the residual deviance is the variance of the residuals when the covariate area is included in the model. This is why we have one less degree of freedom in the residual deviance, because we are including a new term in the model. The summary ends with the number of iterations needed for convergence (I never check this).

  Before we interpret the results, it is advisable to test some predictions about the fit of the model. First of all we need to check for overdispersion. In linear models (lm) the variance of the residuals is estimated based on the data. However, in glm’s the variance of the residuals is specified by the distribution used and is usually described as a particular relationships between the mean and the variance. For example, if we would’ve chosen a Poisson distribution, the mean and the variance should be the same. A similar function exists for the Binomial distribution. If the data shows a greater variance than expected by the distribution, we have overdispersion. When over dispersion occurs the fitted model  and the estimated parameters are biased and should not be used. 

Overdispersion usually means that a covariate which has an important effect on the response variable was omitted. For a logistic model the deviance residuals should follow a chi-square distribution with \( (n-p) \) degrees of freedom where \(p\) is the number of parameters estimated. Based on the chi-square distribution the deviance and its degrees of freedom should be comparable. Thus, we can check the null hypothesis of no over-dispersion by:
> 1-pchisq(m1$deviance,m1$df.residual)
[1] 0.7285575
  Since we do not see any evidence of over dispersion ( \(p>0.05\) ), we may asume the model is appropriate and test the significance of the model through a deviance test:
> anova(m1, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: survival

Terms added sequentially (first to last)

               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
NULL                             18     26.287
area           1   13.174        17     13.114 0.0002839 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  The results show that a model that includes area is significantly different than a model with just the intercept (NULL). Therefore we can conclude that there is a significant effect of area on survival.

Now, let’s get back to the interpretation of the coefficients. As we saw before, the model predicts two coefficients, the intercept and the slope of the regression.
> coef(m1)
(Intercept) area
3.5581582 -0.2276513
This shows that the effect of area is negative, however we cannot interprete these values directly. As shown in the beginning of the post, our variables were transformed to logits. Logits are the log of the odds-ratios \( \tfrac{p}{q} \) which are the odds of an event happening (e.g., seedling survives) over the odds of an event not happening (seedling dies). An odds-ratio (OR) of 1 means that seedling survival is equally likely than a seedling dying. If OR<1 means that seedlings are more likely to die than to survive, while survival is more likely than death if OR>1. So in order to interpret the coefficients from the glm, we need to convert logits into OR by exponentiating them.
> exp(coef(m1))
(Intercept)    area
35.0984945     0.7964019
  As previously mentinoed were are not too interested in the intercept. The OR for area are less than 1 suggesting that an increase of 1 \(\textrm{mm}^2\) lowers the survival probability by (\(1-0.7964=0.2036 \) ) approximately 20%.

  If we want 95% confidence intervals for the coefficients we can ask for them with the confint() function but they also need to be exponentiated in order to interpret the OR's.
> exp(confint(m1))
Waiting for profiling to be done...
                  2.5 %      97.5 % (Intercept) 2.9039600 2466.3851121 area          0.6092671    0.9289333
Disregard the warning which is probably a consequence of our small data set. The 95% confidence intervals for area do not include the value OR=1 which confirms the significant negative effect of area on survival. We can conclude that an increase of one unit of leaf area removed by herbivores decreases the probability of survival. The decrease in survival is found with 95% certainty between 8% and 40% .

R-square

There is no easy way of estimating \(R^2\) in glm’s. The best approximation is a ratio of the null and residual deviances. As previously mentioned the former shows the prediction of a model that only includes the intercept and the latter includes the covariates.
> 1 - (m1$dev/ m1$null)
[1] 0.5011413
We can interpret this value to say that approximately 50% of the variance in survival is explained by the variance in herbivory.

Plot

A plot may be constructed in two ways;  using the base graph commands in R or using ggplot2. Although ggplot is a handy tool, I always prefer the base graphs since they look better for publication. In order to plot the observed and the expected curve, we need to calculated predicted values as we did in the post where we calculated confidence intervals for regression lines.

> xv <- seq(0,70, length=1000)
> su.pre <- predict(m1, list(area=xv), type="response", se=T)
> plot(survival~area, data=oak, xlim=c(0,80), ylab="Seedling survival", xlab="Herbivory")
> lines(su.pre$fit~xv)
> lines(su.pre$fit+su.pre$se.fit ~ xv, lty=2)
> lines(su.pre$fit-su.pre$se.fit ~xv, lty=2)
In the previous commands we do the following:
  1. Create 1000 values between 0 and 70 to be used as \(x\) values for our regression model.
  2. Estimate \(\hat{y}\) values for each \(x\) value created in the previous step and standard errors for the estimates (option se=T).
  3. Plot the observed values with appropriate axis labels
  4. Add lines for the expected \(\hat{y}\) values and their confidence intervals (su.pre$fit) .
The resulting graphs shows an expected linear decrease in seedling survival with increasing herbivory.