tag:blogger.com,1999:blog-21592051089795232042024-03-27T16:52:51.833-07:00R in Ecology and EvolutionUnknownnoreply@blogger.comBlogger7125tag:blogger.com,1999:blog-2159205108979523204.post-58061297857417994672017-05-14T15:07:00.000-07:002017-08-23T07:15:16.589-07:00A gentle introduction to Generalized Linear Models in R <h2>
What are generalized linear models?</h2>
Generalized linear models (<code>glm</code>) are a special form of linear models used when errors do not follow a normal distribution. In previous posts I’ve discussed <a href="http://r-eco-evo.blogspot.com/2011/08/comparing-two-regression-slopes-by.html">linear models</a> (lm), their use and interpretation.<br />
<br />
To recap, <code>lm</code>’s model a response variable which depends on one or more independent variables<br />
<blockquote>
y~x</blockquote>
Regular linear models have several assumptions, a really important one is the normal distribution of errors. Errors are the differences between the observed and predicted values of the response variable.<br />
<br />
Let’s use an example. Let’s say you are modeling the effect of seedling density on seedling herbivory, both measured as continuous variables.<br />
<blockquote>
herbivory~density</blockquote>
If you assume a linear relationship between both variables, a linear model will produce a linear equation that allows us to predict how much herbivory a plant will have based on its local density. Generally such equations are presented in the form:<br />
<br />
\[ y \sim \alpha + \beta_1 x \]<br />
<div dir="ltr">
<br />
With \(\alpha\) as the intercept and \(\beta_1\) as the regression coefficient, which depicts the linear effect of density on herbivory. If \(\beta_1 > 0\) it means that an increase in density will have a positive effect (i.e., increase) herbivory. This relationship is usually displayed as a regression line relating the dependent and independent variables.</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC-EATOkG-0hLJZNHXefWgxk-9QALl1YkG6Kp_6xbxoNppt25wUt60A4nhX75ID4Y6_GWO8p0lw5-VS75L5WGEw6-3jeWkq3UCEwGkstB3Z3Uvjglk84rnDDI14us3lsFW3Dp8YgZdNpE/s1600/Untitled+drawing.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="248" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC-EATOkG-0hLJZNHXefWgxk-9QALl1YkG6Kp_6xbxoNppt25wUt60A4nhX75ID4Y6_GWO8p0lw5-VS75L5WGEw6-3jeWkq3UCEwGkstB3Z3Uvjglk84rnDDI14us3lsFW3Dp8YgZdNpE/s320/Untitled+drawing.png" width="320" /></a></div>
<br />
<br />
The errors are the differences (dotted lines) between observed values (dots on previous figure) and the regression line. When both variables are normally distributed and linearly related, the distribution of the errors should follow a normal distribution with zero mean.<br />
<div dir="ltr">
<br />
<a name='more'></a><br />
For certain types of data, like counts or proportions, their distributions are limited by certain restrictions. Counts, for example, cannot ever be negative or fractional. Proportions are bound between 0 and 1. In most of these cases, the relationship between dependent and independent variables is no longer linear and is usually described by non-linear equations. This means that the error distribution along the fitted line is no longer normal either. In these cases a generalized linear model should be preferred over linear models.</div>
<div dir="ltr">
<br />
As an example, let us consider a dichotomous response variable such as survival of a seedling in the forest one year after germination. The seedling may survive (success) or it may not (failure). We are interested in determining if herbivory affects the fate of seedlings. Our goal is to fit a model that describes the effect of herbivory on seedling survival. Survival is a probability and as such, its range is restricted to \( y \in [0,\ldots,1]\). In contrast, a regular linear model would fit a line that predicted values in the range \( \hat{y} \in [-\infty; \infty]\). This is clearly wrong. There aren’t any negative probabilities or probabilities that are higher than one. This is one of the reasons why a linear model is not advisable in this scenario. We would need to fit a model that limits the response variable between 0 and 1. A sigmoid relationship, similar to the one shown in the following figure, may be used to model the effect of herbivory on survival.</div>
<div class="separator" style="clear: both; text-align: center;">
<img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmBb6EpTUz8gwJP3SBs_6a7zgPdlncj3reJKDGd77SeNFPbZxhGv5yvXutPU6OkUrWbIv3O1UdQW9f584PtqhuW8CaytJ2UViKj3Y_XnV4-lw-lF5mOj-cOqeZc-z2_KD5T7t-HuHeJp8/s9999/Rplot001_1.png" style="max-width: 100%;" width="300" /></div>
<br />
<div dir="ltr">
There are two approaches for fitting a sigmoidal curve. One is to fit a non-linear model to the data an try to estimate the exponential or sigmoidal equation that best describes the effect of herbivory on seedling survival. Non-linear models require very precise data, previous knowledge of the relationship between variables and estimate several parameters. The second option is to use a linear model that fits our probability data. The second options is preferred since we have robust methods of fitting linear models.<br />
<br />
Generalized linear models (<code>glm</code>) allow us to fit linear models to data that do not meet the criteria for linear regression. Glm's fit predictors that describe the relationship between the dependent and the response variable taking into account the restrictions imposed by the data. In our example, they predict expected values that lie between 0 and 1.<br />
<br />
Predictions are performed through a predictor function. These predictor functions \(\eta(x)\) are not usually in linear form \( y = \alpha + \beta_i x + \varepsilon_{ij} \) and therefore need to be linearized by a <strong>link</strong> function. In order words, the link function \( g(\eta) \) linearizes the non-linear predictor function \( \eta(x) \), allowing us to use robust methods. Since there are many predictor and link functions depending on the nature of the data, these models are referred to in plural as <em>generalized linear models.</em></div>
<div dir="ltr">
<br />
The previous explanation will become more clear with an example. Previously, we wanted to determine the effect of herbivory on the probability of seedlings surviving one year in the forest. The following equation models a sigmoidal curve that explains the effect of a linear independent variable \(x\) on the probability of survival \(p\).</div>
<br />
\[ \eta(x) = p =\frac{e^{\alpha+\beta x}}{1+e^{\alpha + \beta x}} \]<br />
<div dir="ltr">
<br />
We are interested in determining the effect of herbivory (\( \beta \)) on the probability of survival using a linear model. To this end we use the log of the odds ratio or <b>logits</b> as a link functions. A logit is the ratio between the probability of survival \(p\) and the probability of not surviving \( (1-p) \). Therefore, the predictor equation may be transformed into a linear model as:<br />
<br />
\[ g(\eta) = \ln\left({\frac{p}{1-p}}\right) = \alpha +\beta x \]</div>
<div dir="ltr">
<br />
Now we may fit a regular linear model to estimate the parameter \(\beta\) which describes the effect of herbivory on survival probability. The only precaution we need to take is that the response variable is transformed into logits, therefore we need to transform them back into probability units or odd ratios to interpret them.</div>
<div dir="ltr">
<br />
There are different predictor and link functions depending on the type of response variable. Counts usually use a log as their link function and probabilities or dichotomous variables use logits or probits as their link function. I will not go into detail into why or which link function should be used in this post. In the next section I’ll show how to perform and interpret a <code>glm</code> in R.<br />
<br /></div>
<h3>
</h3>
<h3>
GLMs in R</h3>
<div dir="ltr">
In R, generalized linear models are performed using the <code>glm()</code> command. It is similar to the <code>lm()</code> command as it requires a formula that describes the relationship between the dependent and the independent variables. However, <code>glm()</code> requires that we define an error distribution family. The <em>family</em> defines the distribution of the errors and chooses the appropriate link function. In R the two most commonly used families are Poisson and Binomial. Poisson is commonly used for counts, while the Binomial family is used in proportions or dichotomous response variables.</div>
<div dir="ltr">
<br />
For our first example we will use a simple model that studies the effect of herbivory, measured as leaf area ( \(\textrm{mm}^2\) ) removed, and the fate of seedlings of an endemic Costa Rican oak <em>Quercus costaricensis</em>. One year after germination seedlings were recorded as survived (1) or died (0). We create a subsample of the data set in R as follows:</div>
<br />
<pre><blockquote>
> oak <- data.frame(survival = c(1,1,1,0,1,0,0,0,0,1,1, 1, 1,0,0,0,0,1,1),<br />
area= c(5.41,5.63,25.92,15.17,13.04,18.85,33.95,22.87,12.01,11.6,6.09,2.28,4.05,59.94,63.16,22.76,23.54,0.21,2.55))<br />
> plot(survival~area, data=oak)</blockquote>
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_DxN4sckIT1x8AahfiVSJe2DPanJAoTE0sGa6Ed9H47fwqYdNNfEKxDSQGLNJENCV-qIhJKF2tiVruJ1No2y-ZmiZBr8QXeuKGVNZ_ybGw67Nq5vnkQZ7dia6XkxPBrsLeLi0AhqGnYc/s1600/scatter.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_DxN4sckIT1x8AahfiVSJe2DPanJAoTE0sGa6Ed9H47fwqYdNNfEKxDSQGLNJENCV-qIhJKF2tiVruJ1No2y-ZmiZBr8QXeuKGVNZ_ybGw67Nq5vnkQZ7dia6XkxPBrsLeLi0AhqGnYc/s320/scatter.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div dir="ltr">
The figure shows a likely decrease in survival with an increase in leaf area removed. We now need to test the effect using a glm. We fit a simple model of the effect of herbivory on survival using the binomial distribution which uses a logit link function. As with linear models, the result of the glm must be assigned to an object for downstream analysis.</div>
<blockquote>
> m1 <- glm(survival~area, data=oak, family=binomial)<br />
<br /></blockquote>
<div dir="ltr">
The results are displayed after asking for a summary:</div>
<blockquote>
<div dir="ltr">
> summary(m1)</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Call:</div>
<div dir="ltr">
glm(formula = survival ~ area, family = binomial, data = oak)</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Deviance Residuals:</div>
<div dir="ltr">
Min 1Q Median 3Q Max</div>
<div dir="ltr">
-1.5413 -0.5731 0.2427 0.4369 2.2065</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Coefficients:</div>
<div dir="ltr">
Estimate Std. Error z value Pr(>|z|)</div>
<div dir="ltr">
(Intercept) 3.5582 1.6145 2.204 0.0275 *</div>
<div dir="ltr">
area -0.2277 0.1009 -2.257 0.0240 *</div>
<div dir="ltr">
---</div>
<div dir="ltr">
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
(Dispersion parameter for binomial family taken to be 1)</div>
<div dir="ltr">
Null deviance: 26.287 on 18 degrees of freedom</div>
Residual deviance: 13.113 on 17 degrees of freedom<br />
AIC: 17.113<br />
Number of Fisher Scoring iterations: 6 </blockquote>
<br />
<br />
<div dir="ltr">
As in previous posts, we will analyze the results in detail. The first part shows the model tested and the distribution of deviance residuals. Since we are fitting a model, the differences between the observed values and those predicted by the model are de residuals. A summary of their distribution is presented as minimum and maximum residuals, the first and third quartiles and the median.</div>
<div dir="ltr">
<br />
I normally just compare the Min-Max and 1Q-3Q to see if they are comparable in absolute size. If not, I worry about skewed data and overdispersion (more on that later).</div>
<div dir="ltr">
<br />
R then presents us with the estimates for the linear model coefficients. In our case we fit a simple linear model in the form of \( y \sim \alpha + \beta_0 x + \varepsilon_{ij} \), therefore there are only two parameters to be estimated, the intercept (\(\alpha\)) and the effect of area on survival (\(\beta\)). In this case the intercept is of little interest to us. But we do see a significant negative effect of <code>area</code> on <code>survival</code> ( \(\beta = -0.2277, p=0.0240\)).</div>
<div dir="ltr">
<br />
The summary then tells us that the variance was estimated using the binomial family without overdispersion. The Null deviance depicts the residual variance ofa null model (i.e., just the intercept), while the residual deviance is the variance of the residuals when the covariate <code>area</code> is included in the model. This is why we have one less degree of freedom in the residual deviance, because we are including a new term in the model. The summary ends with the number of iterations needed for convergence (I never check this).</div>
<div dir="ltr">
<br />
Before we interpret the results, it is advisable to test some predictions about the fit of the model. First of all we need to check for <strong>overdispersion</strong>. In linear models (lm) the variance of the residuals is estimated based on the data. However, in <code>glm</code>’s the variance of the residuals is specified by the distribution used and is usually described as a particular relationships between the mean and the variance. For example, if we would’ve chosen a Poisson distribution, the mean and the variance should be the same. A similar function exists for the Binomial distribution. If the data shows a greater variance than expected by the distribution, we have overdispersion. When over dispersion occurs the fitted model and the estimated parameters are biased and should not be used. <br />
<br />
Overdispersion usually means that a covariate which has an important effect on the response variable was omitted. For a logistic model the deviance residuals should follow a chi-square distribution with \( (n-p) \) degrees of freedom where \(p\) is the number of parameters estimated. Based on the chi-square distribution the deviance and its degrees of freedom should be comparable. Thus, we can check the null hypothesis of no over-dispersion by:</div>
<blockquote>
> 1-pchisq(m1$deviance,m1$df.residual)<br />
[1] 0.7285575</blockquote>
<div dir="ltr">
Since we do not see any evidence of over dispersion ( \(p>0.05\) ), we may asume the model is appropriate and test the significance of the model through a deviance test:</div>
<blockquote>
<div dir="ltr">
> anova(m1, test="Chisq")</div>
<div dir="ltr">
Analysis of Deviance Table</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Model: binomial, link: logit</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Response: survival</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Terms added sequentially (first to last)</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Df Deviance Resid. Df Resid. Dev Pr(>Chi)</div>
<div dir="ltr">
NULL 18 26.287</div>
<div dir="ltr">
area 1 13.174 17 13.114 0.0002839 ***</div>
<div dir="ltr">
---</div>
<div dir="ltr">
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</div>
</blockquote>
<div dir="ltr">
The results show that a model that includes <code>area</code> is significantly different than a model with just the intercept (<code>NULL</code>). Therefore we can conclude that there is a significant effect of area on survival.</div>
<div dir="ltr">
<br />
Now, let’s get back to the interpretation of the coefficients. As we saw before, the model predicts two coefficients, the intercept and the slope of the regression.</div>
<blockquote>
> coef(m1)<br />
(Intercept) area<br />
3.5581582 -0.2276513</blockquote>
<div dir="ltr">
This shows that the effect of area is negative, however we cannot interprete these values directly. As shown in the beginning of the post, our variables were transformed to logits. Logits are the log of the odds-ratios \( \tfrac{p}{q} \) which are the odds of an event happening (e.g., seedling survives) over the odds of an event not happening (seedling dies). An odds-ratio (OR) of 1 means that seedling survival is equally likely than a seedling dying. If OR<1 means that seedlings are more likely to die than to survive, while survival is more likely than death if OR>1. So in order to interpret the coefficients from the glm, we need to convert logits into OR by exponentiating them.</div>
<pre><blockquote>
> exp(coef(m1))<br />
(Intercept) area<br />
35.0984945 0.7964019</blockquote>
</pre>
<div dir="ltr">
As previously mentinoed were are not too interested in the intercept. The OR for <code>area</code> are less than 1 suggesting that an increase of 1 \(\textrm{mm}^2\) lowers the survival probability by (\(1-0.7964=0.2036 \) ) approximately 20%.</div>
<div dir="ltr">
<br />
If we want 95% confidence intervals for the coefficients we can ask for them with the <code>confint()</code> function but they also need to be exponentiated in order to interpret the OR's.</div>
<pre><blockquote>
> exp(confint(m1))<br />
Waiting for profiling to be done...<br />
2.5 % 97.5 %
(Intercept) 2.9039600 2466.3851121
area 0.6092671 0.9289333</blockquote>
</pre>
<div dir="ltr">
Disregard the warning which is probably a consequence of our small data set. The 95% confidence intervals for area do not include the value OR=1 which confirms the significant negative effect of area on survival. We can conclude that an increase of one unit of leaf area removed by herbivores decreases the probability of survival. The decrease in survival is found with 95% certainty between 8% and 40% .</div>
<h4>
</h4>
<h4>
R-square</h4>
<div dir="ltr">
There is no easy way of estimating \(R^2\) in glm’s. The best approximation is a ratio of the null and residual deviances. As previously mentioned the former shows the prediction of a model that only includes the intercept and the latter includes the covariates. </div>
<blockquote>
> 1 - (m1$dev/ m1$null)<br />
[1] 0.5011413</blockquote>
<div dir="ltr">
We can interpret this value to say that approximately 50% of the variance in survival is explained by the variance in herbivory.</div>
<h4>
</h4>
<h4>
Plot</h4>
<div dir="ltr">
A plot may be constructed in two ways; using the base graph commands in R or using <code>ggplot2</code>. Although ggplot is a handy tool, I always prefer the base graphs since they look better for publication. In order to plot the observed and the expected curve, we need to calculated predicted values as we did in the post where we calculated <a href="http://r-eco-evo.blogspot.com/2011/01/confidence-intervals-for-regression-plot.html" target="_blank">confidence intervals for regression lines</a>. </div>
<br />
<blockquote>
> xv <- seq(0,70, length=1000)<br />
> su.pre <- predict(m1, list(area=xv), type="response", se=T)<br />
> plot(survival~area, data=oak, xlim=c(0,80), ylab="Seedling survival", xlab="Herbivory")<br />
> lines(su.pre$fit~xv)<br />
> lines(su.pre$fit+su.pre$se.fit ~ xv, lty=2) <br />
> lines(su.pre$fit-su.pre$se.fit ~xv, lty=2)</blockquote>
<div dir="ltr">
In the previous commands we do the following: </div>
<ol>
<li>Create 1000 values between 0 and 70 to be used as \(x\) values for our regression model.</li>
<li>Estimate \(\hat{y}\) values for each \(x\) value created in the previous step and standard errors for the estimates (option <code>se=T</code>).</li>
<li>Plot the observed values with appropriate axis labels</li>
<li>Add lines for the expected \(\hat{y}\) values and their confidence intervals (<code>su.pre$fit</code>) .</li>
</ol>
<div dir="ltr">
The resulting graphs shows an expected linear decrease in seedling survival with increasing herbivory.</div>
<div class="separator" style="clear: both; text-align: center;">
<img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyewhyAOoHBTAb_DVUuDKwiKjvUmNYOc6l2XpUpk9ud6FxPLgDS2jbeoKTF_QEldJ8OksTWkXoUFaFVs6Pzdbg6DeSPgqYrKuFUpkxh9c_78BNM3hRijVBKsA6C4bXyrfqT1sEpFi2Iqo/s9999/Rplot001_2.png" style="max-width: 100%;" width="400" /></div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-2159205108979523204.post-80983419320325411132011-08-28T19:13:00.000-07:002012-03-29T04:18:25.959-07:00Comparing two regression slopes by means of an ANCOVARegressions are commonly used in biology to determine the causal relationship between two variables. This analysis is most commonly used in morphological studies, where the allometric relationship between two morphological variables is of fundamental interest. Comparing scaling parameters (i.e. slopes) between groups can be used by biologist to assess different growth patterns or the development of different forms or shapes between groups. For examples, the regression between head size and body size may be different between males and females if they grow differently. This difference in allometric growth should manifest itself as a different slope in both regression lines.<br />
<h2>
Ancova</h2>
The analysis of covariance (ANCOVA) is used to compare two or more regression lines by testing the effect of a categorical factor on a dependent variable (y-var) while controlling for the effect of a continuous co-variable (x-var). When we want to compare two or more regression lines, the categorical factor splits the relationship between x-var and y-var into several linear equations, one for each level of the categorical factor. <br />
Regression lines are compared by studying the interaction of the categorical variable (i.e. treatment effect) with the continuous independent variable (x-var). If the interaction is significantly different from zero it means that the effect of the continuous covariate on the response depends on the level of the categorical factor. In other words, the regression lines have different slopes (Right graph on the figure below). A significant treatment effect with no significant interaction shows that the covariate has the same effect for all levels of the categorical factor. However, since the treatment effect is important, the regression lines although parallel have different intercepts. Finally, if the treatment effect is not significant nor its interaction with the covariate (but the coariate is significant), this means there is a single regression line. A reaction norm is used to graphically represent the possible outcomes of an ANCOVA.<br />
<a href="http://lh6.ggpht.com/-E1deZPMVV9M/TluC0mgY6xI/AAAAAAAAA58/Qd7YQ7tlPRk/s1600-h/ancova-fig-2%25255B10%25255D.jpg"><img alt="ancova-fig-2" border="0" height="225" src="http://lh4.ggpht.com/-sOH4146CDSc/TluC1FNYuGI/AAAAAAAAA6A/GHCCiieg3QE/ancova-fig-2_thumb%25255B6%25255D.jpg?imgmax=800" style="background-image: none; border-bottom-width: 0px; border-left-width: 0px; border-right-width: 0px; border-top-width: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="ancova-fig-2" width="468" /></a><br />
For this example we want to determine how body size (snout-vent length) relates to pelvic canal width in both male and female alligators (data from <a href="http://udel.edu/~mcdonald/statancova.html" target="_blank">Handbook of Biological Statistics</a>). In this specific case, sex is a categorical factor with two levels (i.e. male and female) while snout-vent length is the regressor (x-var) and pelvic canal width is the response variable (y-var). The ANCOVA will be used to assess if the regression between body size and pelvic width are the comparable between the sexes. <br />
<h2>
Interpretation</h2>
An ANCOVA is able to test for differences in slopes and intercepts among regression lines. Both concepts have different biological interpretations. Differences in intercepts are interpreted as differences in magnitude but not in the rate of change. If we are measuring sizes and regression lines have the same slope but cross the y-axis at different values, lines should be parallel. This means that growth is similar for both lines but one group is simply larger than the other. A difference in slopes is interpreted as differences in the rate of change. In allometric studies, this means that there is a significant change in growth rates among groups. <br />
Slopes should be tested first, by testing for the interaction between the covariate and the factor. If slopes are significantly different between groups, then testing for different intercepts is somewhat inconsequential since it is very likely that the intercepts differ too (unless they both go through zero). Additionally, if the interaction is significant testing for natural effects is meaningless (see <a href="http://r-eco-evo.blogspot.com/2007/10/infamous-type-iii-ss-before-i-started_20.html" target="_blank">The Infamous Type III SS</a>). If the interaction between the covariate and the factor is not significantly different from zero, then we can assume the slopes are similar between equations. In this case, we may proceed to test for differences in intercept values among regression lines. <br />
<h2>
Performing an ANCOVA</h2>
For an ANCOVA our data should have a format very similar to that needed for an Analysis of Variance. We need a categorical factor with two or more levels (i.e. sex factor has two levels: male and female) and at least one independent variable and one dependent or response variable (y-var). <br />
<blockquote>
<pre>> head(gator)
sex snout pelvic
1 male 1.10 7.62
2 male 1.19 8.20
3 male 1.13 8.00
4 male 1.15 9.60
5 male 0.96 6.50
6 male 1.19 8.17</pre>
</blockquote>
The preceding code shows the first six lines of the <span style="font-family: 'Courier New';">gator</span> object which includes three variables: <span style="font-family: 'Courier New';">sex</span>, <span style="font-family: 'Courier New';">snout</span> and <span style="font-family: 'Courier New';">pelvic</span>, which hold the sex, snout-vent size and the pelvic canal width of alligators, respectively. The sex variable is a factor with two levels, while the other two variables are numeric in their type. <br />
We can do an ANCOVA both with the <span style="font-family: 'Courier New';">lm()</span> and <span style="font-family: 'Courier New';">aov()</span> commands. For this tutorial, we will use the <span style="font-family: 'Courier New';">aov()</span> command due to its simplicity.
<br />
<blockquote>
<pre>> mod1 <- aov(pelvic~snout*sex, data=gator)
> summary(mod1) Df Sum Sq Mean Sq F value Pr(>F)
snout 1 51.871 51.871 134.5392 8.278e-13 ***
sex 1 2.016 2.016 5.2284 0.02921 *
snout:sex 1 0.005 0.005 0.0129 0.91013
Residuals 31 11.952 0.386
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 </pre>
</blockquote>
The previous code shows the ANCOVA model, pelvic is modeled as the dependent variable with sex as the factor and snout as the covariate. The summary of the results show a significant effect of snout and sex, but no significant interaction. These results suggest that the slope of the regression between snout-vent length and pelvic width is similar for both males and females. <br />
A second more parsimonious model should be fit without the interaction to test for a significant differences in the slope.
<br />
<blockquote>
<pre>> mod2 <- aov(pelvic~snout+sex, data=gator)
> summary(mod2)
Df Sum Sq Mean Sq F value Pr(>F)
snout 1 51.871 51.871 138.8212 3.547e-13 ***
sex 1 2.016 2.016 5.3948 0.02671 *
Residuals 32 11.957 0.374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 </pre>
</blockquote>
The second model shows that sex has a significant effect on the dependent variable which in this case can be interpreted as a significant difference in ‘intercepts’ between the regression lines of males and females. We can compare <span style="font-family: 'Courier New';">mod1</span> and <span style="font-family: 'Courier New';">mod2</span> with the <span style="font-family: 'Courier New';">anova()</span> command to assess if removing the interaction significantly affects the fit of the model:<br />
<blockquote>
<pre>> anova(mod1,mod2)
Analysis of Variance Table
Model 1: pelvic ~ snout * sex
Model 2: pelvic ~ snout + sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 31 11.952
2 32 11.957 -1 -0.0049928 0.0129 0.9101</pre>
</blockquote>
The <span style="font-family: 'Courier New';">anova()</span> command clearly shows that removing the interaction does not significantly affect the fit of the model (F=0.0129, p=0.91). Therefore, we may conclude that the most parsimonious model is <span style="font-family: 'Courier New';">mod2</span>. Biologically we observe that for alligators, body size has a significant and positive effect on pelvic width and the effect is similar for males and females. However, we still don’t know how the slopes change. <br />
At this point we are going to fit linear regressions separately for males and females. In most cases, this should have been performed before the ANCOVA. However, in this example we first tested for differences in the regression lines and once we were certain of the significant effects we proceeded to fit regression lines.
<br />
To accomplish this, we are now going to sub-set the data matrix into two sets, one for males and another for females. We can do this with the <span style="font-family: 'Courier New';">subset()</span> command or using the extract functions <span style="font-family: 'Courier New';">[]</span>. We will use both in the following code for didactic purposes:<br />
<blockquote>
<pre>> machos <- subset(gator, sex=="male")
> hembras <- gator[gator$sex=='female',]</pre>
</blockquote>
Separate regression lines can also be fitted using the subset option within the <span style="font-family: 'Courier New';">lm()</span> command, however we will use separate data frames to simplify the creation of graphs:<br />
<blockquote>
<pre>> reg1 <- lm(pelvic~snout, data=machos); summary(reg1)
Call:
lm(formula = pelvic ~ snout, data = machos)
Residuals:
Min 1Q Median 3Q Max
-0.85665 -0.40653 -0.08933 0.04518 1.57408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4527 0.9697 0.467 0.647
snout 6.5854 0.8625 7.636 6.85e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7085 on 17 degrees of freedom
Multiple R-squared: 0.7742, Adjusted R-squared: 0.761
F-statistic: 58.3 on 1 and 17 DF, p-value: 6.846e-07
> reg2 <- lm(pelvic~snout, data=hembras); summary(reg2)
Call:
lm(formula = pelvic ~ snout, data = hembras)
Residuals:
Min 1Q Median 3Q Max
-0.69961 -0.19364 -0.07634 0.04907 1.15098
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2199 0.9689 -0.227 0.824
snout 6.7471 0.9574 7.047 5.8e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4941 on 14 degrees of freedom
Multiple R-squared: 0.7801, Adjusted R-squared: 0.7644
F-statistic: 49.67 on 1 and 14 DF, p-value: 5.797e-06 </pre>
</blockquote>
The regression lines indicate that males have a higher intercept (a=0.45) than females (a=-0.2199), which means that males are larger. We can now plot both regression lines as follows:<br />
<blockquote>
<pre>> plot(pelvic~snout, data=gator, type='n')
> points(machos$snout,machos$pelvic, pch=20)
> points(hembras$snout,hembras$pelvic, pch=1)
> abline(reg1, lty=1)
> abline(reg2, lty=2)
> legend("bottomright", c("Male","Female"), lty=c(1,2), pch=c(20,1) )</pre>
</blockquote>
The resulting plot shows the regression lines for males and females on the same plot.<br />
<div align="center">
<a href="http://lh6.ggpht.com/-X-_0t9cpFG8/TluC1vqwedI/AAAAAAAAA6E/MgI-WmdQeqc/s1600-h/fig-ancova-1%25255B7%25255D.jpg"><img alt="fig-ancova-1" border="0" height="343" src="http://lh5.ggpht.com/-VXGF7mpLKbo/TluC2LXB2yI/AAAAAAAAA6I/D0NddPkhG34/fig-ancova-1_thumb%25255B7%25255D.jpg?imgmax=800" style="background-image: none; border-bottom-width: 0px; border-left-width: 0px; border-right-width: 0px; border-top-width: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="fig-ancova-1" width="404" /></a></div>
<h2>
Advanced</h2>
We can fit both regression models with a single call to the <span style="font-family: 'Courier New';">lm()</span> command using the nested structure of snout nested within sex (i.e. <span style="font-family: 'Courier New';">sex/snout</span>) and removing the single intercept for the model so that separate intercepts are fit for each equation.<br />
<blockquote>
<pre>> reg.todo <- lm(pelvic~sex/snout - 1, data=gator)
> summary(reg.todo)
Call:
lm(formula = pelvic ~ sex/snout - 1, data = gator)
Residuals:
Min 1Q Median 3Q Max
-0.85665 -0.33099 -0.08933 0.05774 1.57408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
sexfemale -0.2199 1.2175 -0.181 0.858
sexmale 0.4527 0.8498 0.533 0.598
sexfemale:snout 6.7471 1.2031 5.608 3.76e-06 ***
sexmale:snout 6.5854 0.7558 8.713 7.73e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6209 on 31 degrees of freedom
Multiple R-squared: 0.9936, Adjusted R-squared: 0.9928
F-statistic: 1213 on 4 and 31 DF, p-value: < 2.2e-16
</pre>
</blockquote>
<h2>
Conclusion</h2>
The results shown in the previous figure are consistent with our previous findings, pelvic width is positively related to snout-vent length. The relationship is linear for both males and females. The regression slope is positive and similar for both males and females (b ≈ 7.07; weighted average), which means that pelvic width grows faster than snout-vent length. Finally, the regression line of males intercepts with the y-axis at a higher value than for females, which means that males are larger.Unknownnoreply@blogger.com95tag:blogger.com,1999:blog-2159205108979523204.post-30959700134188374602008-11-22T16:02:00.001-08:002013-04-13T08:46:01.035-07:00Spatial Statistics. Part I.<h1>
Spatial Statistics. Part I.</h1>
<h2>
Ripley's K(r) function</h2>
Spatial statistics are becoming an important tool for population ecologists. These techniques allow researchers to answer questions regarding the spatial arrangements of individuals within a population and the causal factors that may be influencing spatial distributions. I will detail some basic spatial statistics using the <span style="font-family: Courier New;">spatstat </span>library.
<br />
<blockquote style="font-family: Courier New;">
<span style="font-family: courier new;">library(spatstat)</span>
</blockquote>
The first approach to a planar point pattern (ppp) is to describe the size of the plot, the number of individuals (i.e. points) and the density of such occurrences or instances. In spatial statistics density is usually referred to as <i>intensity</i>. We will be using the bei data-set included in the spatstat library. From the help page: «The dataset <span style="font-family: Courier New;">bei</span> gives the positions of 3605 trees of the species <i>Beilschmiedia pendula</i> (Lauraceae) in a 1000 by 500 metre rectangular sampling region in the tropical rainforest of Barro Colorado Island.»
The following code invokes the bei data-set and requests some summary statistics:
<br />
<blockquote>
<span style="font-family: courier new;">> data(bei)</span>
<span style="font-family: courier new;">> summary(bei)</span>
<br />
<div style="font-family: courier new; margin-left: 40px;">
Planar point pattern: 3604 points
Average intensity 0.00721 points per square metre
Window: rectangle = [0, 1000] x [0, 500] metres
Window area = 5e+05 square metres
Unit of length: 1 metre
</div>
<span style="font-family: courier new;">> plot(bei)</span>
</blockquote>
The previous results show a total of 3604 points in a 50 ha plot. The average intensity (i.e. density) is 0.00721 points per square meter or 72.1 trees per hectare. The position of the trees is shown in the following graph produced by the <span style="font-family: Courier New;">plot(bei)</span> command:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivQeqnpjVj7W-o-VOcMcyD32hubTXQm9XJodL4-P8XjM6_a3EsRS9ot-YHjhFBE3_ZCsBeT6R-6UGQRQlpttsEj7jzz1pMWzt06KvXmQoEgd3rICAMTdzJvrXPqMQj2O38pl-FsZVNRQU/s1600/bei.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivQeqnpjVj7W-o-VOcMcyD32hubTXQm9XJodL4-P8XjM6_a3EsRS9ot-YHjhFBE3_ZCsBeT6R-6UGQRQlpttsEj7jzz1pMWzt06KvXmQoEgd3rICAMTdzJvrXPqMQj2O38pl-FsZVNRQU/s1600/bei.png" height="178" width="320" /></a></div>
<br />
<div>
<div id="ti-5" style="padding: 1em 0pt; text-align: center;">
<br />
<div style="text-align: left;">
The plot evidently shows that this species --<i>B. pendula</i> -- is located more commonly in certain areas of the plot, while absent in others. This suggests an aggregated spatial distribution, which is commonplace for tropical trees. To assess the spatial distribution of these points, we will use Ripley's K(r) function. This method, also known as Ripley's second moment reduced function, estimates the expected number of random points within a distance <i>r</i> of a randomly chosen point within a plot (Ripley 1976).
Ripley's K(r) function is generally transformed as follows:
<br />
<div style="text-align: center;">
<img align="center" border="0" src="http://www.texify.com/img/%5CLARGE%5C%21L%28r%29%3D%5Csqrt%7B%5Cfrac%7BK%28r%29%7D%7B%5Cpi%7D%7D-r.gif" />
</div>
</div>
</div>
</div>
The <i>L(r)</i> function transforms the theoretically expected value for a random distribution into a horizontal line intersecting the P(0,0) point, thus it is more easily interpreted than the exponential K(r) function.
Ripley's K(r) function is produced by the <span style="font-family: Courier New;">Kest()</span> command in R. Given the large number of points in the bei data-set, calculations may take a while in computers with slow processors. The L(r) transformation is performed on-the-fly by the plot() command.
<br />
<blockquote>
<span style="font-family: courier new;">> a1 <- Kest(bei, correction="isotropic", nlarge=Inf)</span>
<span style="font-family: courier new;">> plot(a1, sqrt(./pi)-r~r, ylab="L(r)")</span>
</blockquote>
The previous code request Ripley's K(r) function using the bie data-set. We specifically request the isotropic correction for edge effects. Spatstat's <span style="font-family: Courier New;">Kest()</span> function has a restriction for data-sets larger than 3000 points. In order to circumvent this restriction we must include the <span style="font-family: Courier New;">nlarge=Inf</span> option in the command. Results are stored in the <span style="font-family: Courier New;">a1 </span>object.
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUrbocH8eJ6FEaKDmXKBN7Sts_auXOb8z2dE6ABK5LF5QviMx5swNaoMEna3GyqQ83QqdjYZoSbSx2_iw5E8mtWMLiHkFffDoz7YBaC8NbbZA55kulzeALs8HTN0HLhD5IucnnEHW9tPk/s1600/Kest.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUrbocH8eJ6FEaKDmXKBN7Sts_auXOb8z2dE6ABK5LF5QviMx5swNaoMEna3GyqQ83QqdjYZoSbSx2_iw5E8mtWMLiHkFffDoz7YBaC8NbbZA55kulzeALs8HTN0HLhD5IucnnEHW9tPk/s1600/Kest.png" height="240" width="320" /></a></div>
<br />Ripley's L(r) function is shown in the previous graph. The x-axis show the automatically selected radii (r) for which abundances are calculated, while the y-axis shows the L(r) function. The dotted red line is the expected value for a random distribution and the black solid line is the observed count. If observed values lie above the zero line (i.e. random expectation) one should suspect an aggregated distibution. Nevertheless, we need to assess if this deviation is large enough to reject the null hipothesis of a random spatial distribution or CSR (complete spatial randomness). To answer this question we need to create confidence intervals for the null hypothesis using the envelope() command. These calculations require a lot of computer resources given the large number of points in the <span style="font-family: Courier New;">bei </span>data-set, therefore I reduced the number of simulations to 50 from the default <span style="font-family: Courier New;">nsim=99</span>:<br />
<blockquote style="font-family: courier new;">
> sobre <- envelope(bei, Kest, nlarge=Inf, nsim=50)
<br />
<div style="margin-left: 40px;">
Generating 50 simulations of CSR ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50.
</div>
Done.
> sobre
<br />
<div style="margin-left: 40px;">
Pointwise critical envelopes for K(r)
Obtained from 50 simulations of simulations of CSR
Significance level of pointwise Monte Carlo test: 2/51 = 0.0392156862745098
Data: bei
Function value object (class 'fv')
for the function r -> K(r)
Entries:
id label description
-- ----- -----------
r r distance argument r
obs obs(r) function value for data pattern
theo theo(r) theoretical value for CSR
lo lo(r) lower pointwise envelope of simulations
hi hi(r) upper pointwise envelope of simulations
--------------------------------------
Default plot formula:
. ~ r
Recommended range of argument r: [0, 125]
Unit of length: 1 metre
</div>
> plot(sobre, sqrt(./pi)-r~r, lty=c(1,2,3,3), col=c(1,2,3,3), ylab="L(t)")
</blockquote>
The resulting graph is as follows:
<br />
<div>
<div>
<div id="dwak" style="padding: 1em 0pt; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0URfWsL4JQvfIHAsEKWHud-c2bGQzi12YRh-FVuQDNhJJQW9uD7s1Q34S9_mduSn1fw2VIM_LM-yauqYWUICoRRwPmhdaDYQeDvl-tDfpVBQLKz1LNHBmkJmV4yoe_nnbBnlkELz6co4/s1600/sobre.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0URfWsL4JQvfIHAsEKWHud-c2bGQzi12YRh-FVuQDNhJJQW9uD7s1Q34S9_mduSn1fw2VIM_LM-yauqYWUICoRRwPmhdaDYQeDvl-tDfpVBQLKz1LNHBmkJmV4yoe_nnbBnlkELz6co4/s1600/sobre.png" height="240" width="320" /></a></div>
<div id="dwak" style="padding: 1em 0pt; text-align: center;">
<br /></div>
Given that the observed count (black solid line) is above the confidence interval (green doted lines) we can conclude that the spatial distribution of <i>B. pendula</i> trees significantly deviates from random expectations.
<br />
<h3>
Conclusion</h3>
As expected for a tropical tree (Condit, 2000), Ripley's K(t) shows that <i>B. pendula</i> is spatially aggregated.
References:
<br />
<div style="margin-left: 0.0866in;">
<div style="margin: 0pt 0pt 0pt 40px;">
Condit R, Ashton PS, Baker P, Bunyavejchewin S, Gunatilleke S, Gunatilleke N, Hubbell SP, Foster RB, Itoh A, LaFrankie JV, Lee HS, Losos E, Manokaran N, Sukumar R, Yamakura T (2000) Spatial patterns in the distribution of tropical tree species. Science 288:1414―1418. <span class="Z3988" title="url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Spatial%20patterns%20in%20the%20distribution%20of%20tropical%20tree%20species&rft.jtitle=SCIENCE&rft.volume=288&rft.issue=5470&rft.aufirst=R.&rft.aulast=Condit&rft.au=R.%20Condit&rft.au=P.%20S%20Ashton&rft.au=P.%20Baker&rft.au=S.%20Bunyavejchewin&rft.au=S.%20Gunatilleke&rft.au=N.%20Gunatilleke&rft.au=S.%20P%20Hubbell&rft.au=R.%20B%20Foster&rft.au=A.%20Itoh&rft.au=J.%20V%20LaFrankie&rft.au=H.%20S%20Lee&rft.au=E.%20Losos&rft.au=N.%20Manokaran&rft.au=R.%20Sukumar&rft.au=T.%20Yamakura&rft.date=2000&rft.pages=1414%E2%80%951418"></span></div>
</div>
<div style="margin-left: 0.5in;">
</div>
<div style="margin-left: 0.5in;">
<div style="margin: 0pt;">
Ripley BD (1976) The second-order analysis of stationary point processes. Journal of Applied Probability 13:255-256.
</div>
<div style="margin: 0pt;">
Ripley BD (1977) Modeling spatial patterns. Journal of the Royal Statistical Society, B. 39:172-212. <span class="Z3988" title="url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Modeling%20spatial%20patterns&rft.jtitle=Journal%20of%20the%20Royal%20Statistical%20Society%2C%20B.&rft.volume=39&rft.aufirst=B.%20D.&rft.aulast=Ripley&rft.au=B.%20D.%20Ripley&rft.date=1977&rft.pages=172-212"></span>
</div>
</div>
</div>
</div>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-2159205108979523204.post-15531075536210855132007-10-20T16:03:00.000-07:002008-08-17T20:58:43.881-07:00The Infamous type III SS<h1><span style="font-family:verdana;"><span style="font-size:100%;"> </span></span></h1><p></p><p>Before I started using R, I was under the impression that there was only one type of sums of squares (SS) that should be used: type III SS. It actually was a bit confusing when some people chose to use other types of SS. I also didn't know that SAS invented the term: type III SS. Once I started using R and came across my first unbalanced factorial design...reality hit me. </p><p>First of all, I couldn't get R to deploy my coveted type III SS. I could clearly see that any summary coming from <span style="font-family:Courier New;">lm()</span> or <span style="font-family:Courier New;">aov()</span> were clearly type I SS, or additive sums of squares. After some reading, I could get type II SS from different models, but the process seemed a bit to complicated. It had to be simpler!! </p><p>I played with the idea of posting in R-help , but I was (again) scared away by the terrible comments from some of the gurus. After searching for endless hours on Google, I came across a bunch of interesting information. I read comments and posts regarding how bad type III SS are, mainly because they allow to test for main effects even in the presence of interaction. Honestly I don't think is all that terrible, but I understand it shouldn't be done. I actually teach it that way to my students: "If an interaction is significant, the main effects are rarely interpretable". </p><p>On the other camp, I read proponents suggesting that type III SS are the only ones that allow hypothesis testing, giving that they are order independent. Also very true. </p><p>I made up my mind, when someone compared SAS to Micro$oft. From then on, I chose not to use type III SS if possible. Therefore, this entry is to teach how to deal with unbalanced factorials and how to get the appropriate SS and F values from R. And if you're so inclined, I will show you how to get the blasted type III SS's. Most of this information is scattered all over the Internet and I'm not to be credited for any of it. I just collected it for my personal use, and since this Blog serves as my basket case for R methods I easily forget, they ended up here. </p><p>Let's begin with an example. Let´s suppose we have a two-way unbalanced ANOVA to study the effect of forest type and species on herbivory. The biological hypothesis suggests that species vary in their susceptibility to herbivory, and the percentage of leaves damaged by herbivory are independent of forest type. We sample three different forest types: riparian, transition and mature forests. In each forest type, we collected samples from four different species and determined the percent of area removed from a randomly selected leaf. Given that species number is not constant on each site, the design is unbalanced. The data is presented in the following table:</p><p></p><p><table dir="ltr" cellpadding="1" cellspacing="1" width="241" height="195"><tbody><tr><td style="border-top: 1px solid black; border-left: 1px solid black;" bg="" colspan="4" width="241" height="33"><span style=""><b>Herbivory </b></span>
</td></tr><tr><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="46" height="33"><p></p>
</td><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="69" height="33"><div align="center"><b>Riparian </b></div></td><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="67" height="33"><div align="center"><b>Transition </b></div></td><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="61" height="33"><div align="center"><b>Mature</b> </div></td></tr><tr><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="46" height="33"><b>Sp1 </b>
</td><td style="border: 1px solid black;" width="69" height="33"><div style="text-align: center;">42 44 36 </div><div style="text-align: center;">13 19 22 </div></td><td style="border: 1px solid black;" width="67" height="33"><div style="text-align: center;">33 26 33 </div><div style="text-align: center;">21 </div></td><td style="border: 1px solid black;" width="61" height="33"><div style="text-align: center;"></div><div style="text-align: center;">31 3 25 </div><div style="text-align: center;">25 24 </div><div style="text-align: center;"></div></td></tr><tr><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="46" height="33">
<b>Sp2</b>
</td><td style="border: 1px solid black;" width="69" height="33"><div style="text-align: center;">28 23 34 </div><div style="text-align: center;">42 13 </div></td><td style="border: 1px solid black;" width="67" height="33"><div style="text-align: center;">34 33 31 </div><div style="text-align: center;">36 </div></td><td style="border: 1px solid black;" width="61" height="33"><div style="text-align: center;">3 26 28 </div><div style="text-align: center;">32 4 16 </div></td></tr><tr><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="46" height="33">
<b>Sp3</b>
</td><td style="border: 1px solid black;" width="69" height="33"><div style="text-align: center;">1 29 19 </div></td><td style="border: 1px solid black;" width="67" height="33"><div style="text-align: center;">11 9 7 </div><div style="text-align: center;">1 6 </div></td><td style="border: 1px solid black;" width="61" height="33"><div style="text-align: center;">21 1 9 </div><div style="text-align: center;">3 </div></td></tr><tr><td style="border: 1px solid black;" bgcolor="#a8bdc4" width="46" height="33">
<b>Sp4</b>
</td><td style="border: 1px solid black;" width="69" height="33"><div style="text-align: center;">24 9 22 </div><div style="text-align: center;">2 15 </div></td><td style="border: 1px solid black;" width="67" height="33"><div style="text-align: center;">27 12 12 </div><div style="text-align: center;">5 16 15 </div></td><td style="border: 1px solid black;" width="61" height="33"><div style="text-align: center;">22 7 25 </div><div style="text-align: center;">5 12 </div></td></tr></tbody></table></p><p>
We read the data and included it in R as a data frame named: <span style="font-family:Courier New;">herb</span>. Now we can check the number of replicates using the <span style="font-family:Courier New;">replications()</span> command:
<span style=";font-family:Courier New;font-size:100%;" ><pre>> replications(herbiv~bosque*especie, data=herb)
$bosque
bosque
maduro ripario transicion
20 19 19
$especie
especie
sp1 sp2 sp3 sp4
15 15 12 16
$"bosque:especie"
especie
bosque sp1 sp2 sp3 sp4
maduro 5 6 4 5
ripario 6 5 3 5
transicion 4 4 5 6
</pre></span>
As we can clearly see, the model is highly unbalanced. We shall perform an analysis of variance in various ways. First we will fit a saturated model, and then vary the order of the factors in the model:
<span style=";font-family:Courier New;font-size:100%;" ><pre>> mod.sat <- aov(herbiv~bosque*especie, data=herb)
> mod.sat.2 <- aov(herbiv~especie*bosque, data=herb)
</pre></span></p><p></p><p>The summary tables for both models show that SS are calculated sequentially.
</p>
<p style="font-family: Courier New;">> summary(mod.sat)
</p><p style="font-family: Courier New;">
</p><p style="font-family: Courier New;" size="3"></p><pre> Df Sum Sq Mean Sq F value Pr(>F)
bosque 2 464.0 232.0 2.4766 0.09516 .
especie 3 2810.8 936.9 10.0017 3.434e-05 ***
bosque:especie 6 530.4 88.4 0.9436 0.47348
Residuals 46 4309.1 93.7
</pre><p></p><p style="font-family: Courier New;">
</p><p style="font-family: Courier New;" size="3"></p><pre>> summary(mod.sat.2)
Df Sum Sq Mean Sq F value Pr(>F)
especie 3 2834.8 944.9 10.0871 3.186e-05 ***
bosque 2 440.0 220.0 2.3486 0.1069
especie:bosque 6 530.4 88.4 0.9436 0.4735
Residuals 46 4309.1 93.7
</pre><p></p> <p>Although very close, the SS for the species factor (i.e. especies) is different depending on the order of the factors. If species comes first then (SS = 2834,8), but if species is the second term in the model then SS=2810.8. These SS are all sequential SS´s, known by SAS as Type I SS. In the SAS world, now we would use proc GLM and get type III sums of squares, and test hypothesis regarding the interaction and the main effects. Nonetheless, this approach may lead us down a dangerous path, since it allows us to test for main effects even in the presence of a significant interaction. </p><p></p><p><b>The R way</b>
The advisable method in R, is to search for the most parsimonious model and then obtain SS by comparing models that differ in the number of parameters (i.e. factors) included. This procedure is automagically done by the <span style="font-family:Courier New;">drop1()</span> command. This command, as its name implies, drops arguments from the model and compares the original model to the reduced one. It generally begins by removing non-significant interactions. The results given by <span style="font-family:Courier New;">drop1()</span> command are order-independent, therefore the following code examples will only be performed on mod.sat: </p><p style="font-family: Courier New;" size="3"></p><pre>> drop1(mod.sat, test="F")
Single term deletions
Model:
herbiv ~ bosque * especie
Df Sum of Sq RSS AIC F value Pr(F)
<none> 4309.1 273.9
bosque:especie 6 &nbsp; 530.4 4839.5 268.6 0.9436 0.4735
>
<p> </p>
<p>
</p></pre><p></p><p>The drop1() command used in the previous code, required two parameters or options: the name of the fitted <span style="font-family:Courier New;">aov()</span> model, and the type of test to perform. We asked R to drop terms from "mod.sat", our saturated model. We also requested F statistics, which compare the original model, with the new model without the term. AIC, the Akaike Information Criterion, measures the goodness of fit of a statistical model taking into account the number of parameters included. The AIC statistic allows us to find the best model that fits the data, with the minimum number of parameters. </p><p>One should choose the model with the lowest AIC value. In the previous output, removing the interaction term: bosque:especie; produces a better model with a lower AIC than the saturated formula (268.6 vs 273.9). Therefore, we can conclude that removing the interaction term should benefit our model. The F statistic provided is based on SS calculated by comparing models with and without the interaction term, and hence, can be used to report the non-significance of the interaction. </p><p>We now fit a linear model without the interaction term, and request the appropriate sums-of-square with the <span style="font-family:Courier New;">drop1()</span>. </p><p><span style=";font-family:Courier New;font-size:100%;" >> mod1 <- aov(herbiv~bosque+especie, data=herb)</span> </p><p><span style=";font-family:Courier New;font-size:100%;" >> drop1(mod1, test="F")</span> </p><p><span style=";font-family:Courier New;font-size:100%;" >Single term deletions</span> </p><p><span style="font-family:Courier New;"><pre>Model:
herbiv ~ bosque + especie
Df Sum of Sq RSS AIC F value Pr(F)
<none> 4839.5 268.6
bosque 2 440.0 5279.5 269.6 2.3639 0.1041
especie 3 2810.8 7650.2 289.2 10.0672 2.461e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</pre></span>
The output from the drop1() command shows that removing any of the natural components (i.e. 'bosque' or 'especie') would lower the goodness of fit of the model. This can be seen by inspecting the AIC statistics. We see that by removing <span style="font-family:Courier New;"><none></span> of the factors, we get an AIC of 268.6. However, if we remove 'bosque' or 'especie', we get a higher AIC value, thus suggesting that we have the most parsimonious model. </p><p></p><p>F-statistics are computed by comparing the original model, with one where the factor is removed. We can see that 'bosque' is not significant, while 'especie' is highly significant (F=10.07; df=2,46; p<0.001). As I precieve it, these are all type III SS, but I could be wrong. We can clearly see that they are order independent, by fitting a model where 'especie' comes first: </p><p></p><p><span style=";font-family:Courier New;font-size:100%;" >> drop1(aov(herbiv~especie+bosque, herb), test="F")
Single term deletions</span> </p><p><span style=";font-family:Courier New;font-size:100%;" ><pre>Model:
herbiv ~ especie + bosque
Df Sum of Sq RSS AIC F value Pr(F)
<none> 4839.5 268.6
especie 3 2810.8 7650.2 289.2 10.0672 2.461e-05 ***
bosque 2 440.0 5279.5 269.6 2.3639 0.1041
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</pre></span></p><p></p><p></p><p>The results do not deviate from the previous analysis, thus confirming that this type of analysis is order independent. My only grievance with this type of analysis, is that the ANOVA table has to be computed by hand. I still haven't figured out if this can be done automatically. </p><p>
<b>Type III SS from R</b> </p><p></p><p>If you want to get type III sums-of-square anyway (maybe due to extraordinary circumstances: your boss demands them!) you can easily get them from R. As Dr. Venables says: you just have to know where to look for them. </p><p></p><p>I will not go into detail on how this works, mainly, because I don't know how it works. To get type III-SS, you have to do the following: </p><p></p><p>1. Change the default contrast matrix used by R, to one which produces orthogonal contrasts such as <span style="font-family:Courier New;">contr.helmert</span> or <span style="font-family:Courier New;">contr.sum</span>: </p><p><span style="font-family:Courier New;">> options(contrasts=c("contr.helmert", "contr.poly"))
</span></p><p>
2. Fit a new anova model (saturated), which uses the new contrast matrix we selected in step 1. </p><p><span style="font-family:Courier New;">> mod.III <- aov(herbiv~bosque*especie, data=herb)
</span></p><p>3. Then use the <span style="font-family:Courier New;">drop1()</span> command, choosing to remove all objects using the formula shorthand notation <span style="font-family:Courier New;">.~.</span> :</p><p><span style=";font-family:Courier New;font-size:100%;" >> drop1(mod.III, .~., test="F")
Single term deletions</span> </p><p><span style=";font-family:Courier New;font-size:100%;" ><pre>Model:
herbiv ~ bosque * especie <p> </p> <p><span style="font-family:Courier New;"> Df Sum of Sq RSS AIC F value Pr(F)
<none> 4309.1 273.9
bosque 2 436.3 4745.4 275.5 2.3289 0.1088
especie 3 2753.8 7062.8 296.5 9.7989 4.108e-05 ***
bosque:especie 6 530.4 4839.5 268.6 0.9436 0.4735
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</span></p></pre></span></p><p></p><p>And voila!...You get type III sums-of-square, which should make any SAS user happy. The <i>Car</i> package, has built in functions to get type-III SS, but I've read they produce funny output sometimes. I've never tested this, since I don't like installing many libraries to do things you can get from the base stats module. </p><p></p><p><b>Conclusion</b> </p><p></p><p>In my opinion the type III sums-of-square issue, is a bit technical for the average user. Although some suggest that the average user shouldn't be doing statistics, it is commonplace for many students and some researchers to have to deal with anova on their own, with very little statistics background. I think R does the correct thing by not supplying the infamous SS's easily, and forcing the user to think about what he/she really wants. Nonetheless, this discussion should be part of R's help files, and the correct procedure should be also easier to find. Finally, the <span style="font-family:Courier New;">drop1()</span> method should provide a properly formatted ANOVA table, but again, I may just not know how to do it...and the correct procedure is hidden somewhere within R's mysterious innards. </p>Unknownnoreply@blogger.com10tag:blogger.com,1999:blog-2159205108979523204.post-1914202808861419032007-10-06T14:47:00.000-07:002007-10-06T15:10:16.795-07:00Contrasts in R<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;font-size:100%;">One of the most neglected topics in Biostatistics is the calculation of contrasts in ANOVA. Most researchers are content with simply calculating an ANOVA table and stating that differences among groups are statistically significant. Some even present box-plots or any other graphic that show where the differences lie. However, few researchers actually test the differences they are looking for.</span> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;">An ANOVA is designed to test a statistical hypothesis. It looks at the means of each treatment or level, and determines if all the means are equal or not. Nevertheless, the biological hypotheses introduced by the researchers need to be specifically addressed with contrasts. One can generalize and state that while ANOVA analyzes the statistical hypothesis, contrasts look into the biological hypothesis.</span> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;">The following entry will show how to create a contrast in R. I will not go into detail regarding the theory behind contrasts. I suggest you read a good book on Experimental Design. </span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;font-size:100%;">Let’s begin by using a data set included in R. We will use the InsectSprays data frame. This experiment shows the number of insects killed (variable count) by using six different insecticides (A:F). The data is loaded by using:</span> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> data(InsectSprays)</span> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;font-size:100%;">Once the data has been invoked, we can start working on it. The first thing we want to do is to determine if the model is balanced. We can use the summary() command to assess some basic information on the data.frame:</span> </p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;"></span></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">> summary(InsectSprays) </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;"></span></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">count spray </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">Min. : 0.00 A:12 </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">1st Qu.: 3.00 B:12 </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">Median : 7.00 C:12 </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">Mean : 9.50 D:12 </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">3rd Qu.:14.25 E:12 </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:'Courier New';"><span style="font-size:100%;">Max. :26.00 F:12 </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;">The previous output shows that we have a balanced design, with 12 replicates per group. Since this is a dataframe the alphanumeric variable, </span><span style="font-family:'Courier New';">spray</span><span style="font-family:Times New Roman;">, is immediately recognized by R as a factor. </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;">Let’s imagine there are three main questions. First, we want to determine if there are any differences between the spray treatments. Secondly, the researcher wants to know if the first three sprays differ from the latter three. Third, there is strong indication that sprays: A, B, and F; should have a stronger effect than the remaining treatments. Since contrasts are pre-planned or <i>a priori</i> comparisons, we should start by creating the contrasts matrix. We need to create a vector of coefficients for each comparison, and then bind them in a matrix:</span> </p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;"></span></p>
<p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;"></span></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> c1 <- c(1,1,1,-1,-1,-1) </span></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> c2 <- c(1,1,-1,-1,-1,1) </span></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> mat <- cbind(c1,c2)</span> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;">Secondly, we need to specify that the contrast matrix </span><span style="font-family:'Courier New';">mat</span><span style="font-family:Times New Roman;"> should be used to compute SS during anova calculations. We achieve this by assigning the contrasts matrix to the factor </span><span style="font-family:'Courier New';">sprays</span><span style="font-family:Times New Roman;"> using the </span><span style="font-family:'Courier New';">contrasts()</span><span style="font-family:Times New Roman;"> command:</span></span> </p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> contrasts(InsectSprays$spray) <- mat</span> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;">It should be noted, that once this assignment is implemented, the </span><span style="font-family:'Courier New';">aov()</span><span style="font-family:Times New Roman;"> command will calculate SS for the contrasts established in each column of </span><span style="font-family:'Courier New';">mat</span><span style="font-family:Times New Roman;">. If you wish to change the contrasts, then a new contrasts matrix should be created and assigned to the factor.</span></span> </p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;"></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;"></span> </p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"><span style="font-family:Times New Roman;font-size:100%;">We now perform the analysis of variance, and request a summary table. The anova table is split to include the contrasts:</span> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 6pt"></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;"></span></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> model1 <- aov(count ~ spray, data = InsectSprays) </span></p><p class="code" style="MARGIN: 0in 0in 0pt"></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> summary(model1, split=list(spray=list("First 3 vs other"=1, "ABF vs CDE"=2))) </span></p><p class="code" style="MARGIN: 0in 0in 0pt"></p><p class="code" style="MARGIN: 0in 0in 0pt"></p><table id="qxk." cellspacing="0" cellpadding="3" border="0"><tbody><tr><td></td><td><span style="font-family:courier new;font-size:85%;">Df </span></td><td><span style="font-family:courier new;font-size:85%;">Sum Sq </span></td><td><span style="font-family:courier new;font-size:85%;">Mean Sq </span></td><td><span style="font-family:courier new;font-size:85%;">F value </span></td><td><span style="font-family:courier new;font-size:85%;">Pr(>F) </span></td><td><span style="font-family:courier new;font-size:85%;"></span></td></tr><tr><td><span style="font-family:courier new;font-size:85%;">spray</span></td><td><span style="font-family:courier new;font-size:85%;">5</span></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">2668.83</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">533.77</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">34.702</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;"><2e-16</span></p></td><td><span style="font-family:courier new;font-size:85%;">***</span></td></tr><tr><td><span style="font-family:courier new;font-size:85%;">spray: First 3 vs other </span></td><td><span style="font-family:courier new;font-size:85%;">1</span></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">93.39</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">93.39</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">6.0716 </span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">0.01635 </span></p></td><td><span style="font-family:courier new;font-size:85%;">*</span></td></tr><tr><td><span style="font-family:courier new;font-size:85%;">spray: ABF vs CDE </span></td><td><span style="font-family:courier new;font-size:85%;">1</span></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">2558.67</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">2558.67</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">166.3495 </span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">< 2e-16 </span></p></td><td><span style="font-family:courier new;font-size:85%;">***</span></td></tr><tr><td><span style="font-family:courier new;font-size:85%;">Residuals </span></td><td><span style="font-family:courier new;font-size:85%;">66</span></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">1015.17</span></p></td><td><p align="right"><span style="font-family:courier new;font-size:85%;">15.38 </span></p></td><td><span style="font-family:courier new;font-size:85%;"></span></td><td><span style="font-family:courier new;font-size:85%;"></span></td><td><span style="font-family:courier new;font-size:85%;"></span></td></tr></tbody></table>
<p class="code" style="MARGIN: 0in 0in 0pt"></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">--- </span></p><p class="code" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Courier New;">Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"> Let’s analyze the previous code. First we conduct an analysis of variance with the </span><span style="font-family:'Courier New';">aov()</span><span style="font-family:Times New Roman;"> command. The results are stored in the </span><span style="font-family:'Courier New';">model1</span><span style="font-family:Times New Roman;"> object. Immediately following, we request a summary table with the </span><span style="font-family:'Courier New';">summary()</span><span style="font-family:Times New Roman;"> command. The summary command includes the option: split. The </span><span style="font-family:'Courier New';">split</span><span style="font-family:Times New Roman;"> option provides a list of factors where contrasts are stored. Within each factor, we also provide a list which includes the names of the contrasts (i.e. each column) stored in the contrasts matrix columns (i.e. </span><span style="font-family:'Courier New';">mat</span><span style="font-family:Times New Roman;">). </span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"></span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Times New Roman;"></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-size:100%;"><span style="font-family:Times New Roman;"> The ANOVA table shows a significant effect of the spray varieties on insect moratility. The contrasts suggest significant differences between the first three and the latter three spray types. Finally, we see that most of the treatment SS are explained by the comparison of the ABF and CDE groups. A box-plot clearly shows that the main differences lie between the groups compared by the second contrasts. The comparisons among groups are better observed using a boxplot:</span></span></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"></p><p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;"></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><span style="font-family:Courier New;font-size:100%;">> boxplot(count ~ spray, data = InsectSprays,
+ xlab = "Type of spray", ylab = "Insect count",
+ main = "InsectSprays data", col = "skyblue")
</p></span>
<div id="itri" style="PADDING-RIGHT: 0px; PADDING-LEFT: 0px; PADDING-BOTTOM: 1em; PADDING-TOP: 1em; TEXT-ALIGN: left"><img style="WIDTH: 402px; HEIGHT: 312px" height="240" src="http://docs.google.com/File?id=dddbb94z_207j49h6qgq" width="216" /></div>Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-2159205108979523204.post-39525921289052530012007-09-15T12:17:00.000-07:002007-09-15T13:53:50.007-07:00Neighbor Joining Tree with ApeToday I used R to create a neighbor joining phenogram, using the <span style="font-family:courier new;"><a href="http://pbil.univ-lyon1.fr/R/ape/">ape</a> </span>library. The phenogram will be used to visualize a species genetic differences between six different locations in Costa Rica. Genetic distances were calculated from microsatellite data using GenAlEx (<a href="http://www.anu.edu.au/BoZo/GenAlEx/">GenAlEx</a>), because I haven't figured out how to calculate Nei's genetic dissimilarities with R. This is the procedure I used:<br /><br />First I imported the genetic distance matrix from Exc..l. The matrix used was symmetric, meaning that the same distances were mirrored above and below the diagonal. I copied the matrix into the clipboard from Exc..l using Ctrl-C. The distance matrix included row and coloumn names, the six different locations. The matrix was imported in R by assigning it to an object named m:<br /><br /><span style="font-family:courier new;">> m <- as.matrix(read.table("clipboard", head=T, row.names=1)) <span style="font-family:georgia;"><br /><br />The previous command imports the data stored in the clipboard and transforms the data frame into a matrix format. The options of the <span style="font-family:courier new;">read.table()</span> command: <span style="font-family:courier new;">head=T</span> and <span style="font-family:courier new;">row.names=1</span>, tell R that the first line of the matrix is a header and that row names are stored in the first column, respectively. After checking the matrix was imported correctly, I proceeded to load the ape library.<br /><br />> <span style="font-family:courier new;">library(ape)</span><br /><br />The <span style="font-family:courier new;">ape </span>(analysis in phylogenetics and evolution) contains various methods for the analysis of genetic and evolutionary data. <span style="font-family:courier new;">Ape</span> also provides commands designed to calculate distances in DNA sequences. I will tackle those in another post. Now back to my tree. After importing the distance matrix, and including the ape library, I proceeded to create the phylogenetic tree with the <span style="font-family:courier new;">nj()</span> command:<br /><br />> <span style="font-family:courier new;">arbol <- nj(as.dist(m), "unrooted") > plot(arbol)</span><br /><br />The <span style="font-family:courier new;">"unrooted"</span> parameter provided to the <span style="font-family:courier new;">nj()</span> command produces, as expected, an unrooted tree. The resulting tree topography was stored in the 'arbol' object. The plot command produced:<br /><br /></span></span><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyHjKdEQpTUpjuBEkA4xEbqFsuru-dP2k_SnFTgHPo54ujvP5oPyQl5_djAxSJdvCVNcAowvj8g9_5lhC1NFA8H86S_UXccjQ1ywNBv0XPGZCcrOK3ej5ICwgdF4Q4kjZGfaBSKLWzsc8/s1600-h/arbolmama.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyHjKdEQpTUpjuBEkA4xEbqFsuru-dP2k_SnFTgHPo54ujvP5oPyQl5_djAxSJdvCVNcAowvj8g9_5lhC1NFA8H86S_UXccjQ1ywNBv0XPGZCcrOK3ej5ICwgdF4Q4kjZGfaBSKLWzsc8/s320/arbolmama.jpg" alt="" id="BLOGGER_PHOTO_ID_5110525505922960130" border="0" /></a>We can observe how 'Puriscal' is located between two well demarcated groups, one including 'Guapiles' and 'Guatuso' and the other three populations in the second group.<br /><span style="font-family:courier new;"><span style="font-family:georgia;"><br /><br /></span></span>Unknownnoreply@blogger.com6tag:blogger.com,1999:blog-2159205108979523204.post-26667836563086157442006-09-29T10:03:00.000-07:002006-11-13T17:13:04.457-08:00Mantel Partial RegressionTo perform a partial mantel regression on three different distance matrices:
<span style="font-family:courier new;">
> library(vegan)
</span><span style="font-family:courier new;">> mantel.partial (xdis, ydis, zdis, </span>
<span style="font-family:courier new;">+ method = "pearson",</span> <span style="font-family:courier new;">permutations = 1000) </span>
Example.
Matrix 1: Genetic distance, Nei's D.
Matrix 2: Geographic distance
Matrix 3: Regional correspondance (i.e. binary matrix, 1 if two populations belon to the same region, zero otherwise)
This will allow to test if a correlation between geographic distance and genetic differentiation exists, when regional correspondance is taken into account.Unknownnoreply@blogger.com0