Before I started using R, I was under the impression that there was only one type of sums of squares (SS) that should be used: type III SS. It actually was a bit confusing when some people chose to use other types of SS. I also didn't know that SAS invented the term: type III SS. Once I started using R and came across my first unbalanced factorial design...reality hit me.

First of all, I couldn't get R to deploy my coveted type III SS. I could clearly see that any summary coming from lm() or aov() were clearly type I SS, or additive sums of squares. After some reading, I could get type II SS from different models, but the process seemed a bit to complicated. It had to be simpler!!

I played with the idea of posting in R-help , but I was (again) scared away by the terrible comments from some of the gurus. After searching for endless hours on Google, I came across a bunch of interesting information. I read comments and posts regarding how bad type III SS are, mainly because they allow to test for main effects even in the presence of interaction. Honestly I don't think is all that terrible, but I understand it shouldn't be done. I actually teach it that way to my students: "If an interaction is significant, the main effects are rarely interpretable".

On the other camp, I read proponents suggesting that type III SS are the only ones that allow hypothesis testing, giving that they are order independent. Also very true.

I made up my mind, when someone compared SAS to Micro$oft. From then on, I chose not to use type III SS if possible. Therefore, this entry is to teach how to deal with unbalanced factorials and how to get the appropriate SS and F values from R. And if you're so inclined, I will show you how to get the blasted type III SS's. Most of this information is scattered all over the Internet and I'm not to be credited for any of it. I just collected it for my personal use, and since this Blog serves as my basket case for R methods I easily forget, they ended up here.

Let's begin with an example. Let´s suppose we have a two-way unbalanced ANOVA to study the effect of forest type and species on herbivory. The biological hypothesis suggests that species vary in their susceptibility to herbivory, and the percentage of leaves damaged by herbivory are independent of forest type. We sample three different forest types: riparian, transition and mature forests. In each forest type, we collected samples from four different species and determined the percent of area removed from a randomly selected leaf. Given that species number is not constant on each site, the design is unbalanced. The data is presented in the following table:

Herbivory
| |||

Riparian | Transition | Mature | |

Sp1
| 42 44 36 13 19 22 | 33 26 33 21 | 31 3 25 25 24 |

Sp2
| 28 23 34 42 13 | 34 33 31 36 | 3 26 28 32 4 16 |

Sp3
| 1 29 19 | 11 9 7 1 6 | 21 1 9 3 |

Sp4
| 24 9 22 2 15 | 27 12 12 5 16 15 | 22 7 25 5 12 |

We read the data and included it in R as a data frame named: herb. Now we can check the number of replicates using the replications() command:

> replications(herbiv~bosque*especie, data=herb) $bosque bosque maduro ripario transicion 20 19 19 $especie especie sp1 sp2 sp3 sp4 15 15 12 16 $"bosque:especie" especie bosque sp1 sp2 sp3 sp4 maduro 5 6 4 5 ripario 6 5 3 5 transicion 4 4 5 6As we can clearly see, the model is highly unbalanced. We shall perform an analysis of variance in various ways. First we will fit a saturated model, and then vary the order of the factors in the model:

> mod.sat <- aov(herbiv~bosque*especie, data=herb) > mod.sat.2 <- aov(herbiv~especie*bosque, data=herb)

The summary tables for both models show that SS are calculated sequentially.

> summary(mod.sat)

Df Sum Sq Mean Sq F value Pr(>F) bosque 2 464.0 232.0 2.4766 0.09516 . especie 3 2810.8 936.9 10.0017 3.434e-05 *** bosque:especie 6 530.4 88.4 0.9436 0.47348 Residuals 46 4309.1 93.7

> summary(mod.sat.2) Df Sum Sq Mean Sq F value Pr(>F) especie 3 2834.8 944.9 10.0871 3.186e-05 *** bosque 2 440.0 220.0 2.3486 0.1069 especie:bosque 6 530.4 88.4 0.9436 0.4735 Residuals 46 4309.1 93.7

Although very close, the SS for the species factor (i.e. especies) is different depending on the order of the factors. If species comes first then (SS = 2834,8), but if species is the second term in the model then SS=2810.8. These SS are all sequential SS´s, known by SAS as Type I SS. In the SAS world, now we would use proc GLM and get type III sums of squares, and test hypothesis regarding the interaction and the main effects. Nonetheless, this approach may lead us down a dangerous path, since it allows us to test for main effects even in the presence of a significant interaction.

**The R way**
The advisable method in R, is to search for the most parsimonious model and then obtain SS by comparing models that differ in the number of parameters (i.e. factors) included. This procedure is automagically done by the drop1() command. This command, as its name implies, drops arguments from the model and compares the original model to the reduced one. It generally begins by removing non-significant interactions. The results given by drop1() command are order-independent, therefore the following code examples will only be performed on mod.sat:

> drop1(mod.sat, test="F") Single term deletions Model: herbiv ~ bosque * especie Df Sum of Sq RSS AIC F value Pr(F) <none> 4309.1 273.9 bosque:especie 6 530.4 4839.5 268.6 0.9436 0.4735 >

The drop1() command used in the previous code, required two parameters or options: the name of the fitted aov() model, and the type of test to perform. We asked R to drop terms from "mod.sat", our saturated model. We also requested F statistics, which compare the original model, with the new model without the term. AIC, the Akaike Information Criterion, measures the goodness of fit of a statistical model taking into account the number of parameters included. The AIC statistic allows us to find the best model that fits the data, with the minimum number of parameters.

One should choose the model with the lowest AIC value. In the previous output, removing the interaction term: bosque:especie; produces a better model with a lower AIC than the saturated formula (268.6 vs 273.9). Therefore, we can conclude that removing the interaction term should benefit our model. The F statistic provided is based on SS calculated by comparing models with and without the interaction term, and hence, can be used to report the non-significance of the interaction.

We now fit a linear model without the interaction term, and request the appropriate sums-of-square with the drop1().

> mod1 <- aov(herbiv~bosque+especie, data=herb)

> drop1(mod1, test="F")

Single term deletions

Model: herbiv ~ bosque + especie Df Sum of Sq RSS AIC F value Pr(F) <none> 4839.5 268.6 bosque 2 440.0 5279.5 269.6 2.3639 0.1041 especie 3 2810.8 7650.2 289.2 10.0672 2.461e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1The output from the drop1() command shows that removing any of the natural components (i.e. 'bosque' or 'especie') would lower the goodness of fit of the model. This can be seen by inspecting the AIC statistics. We see that by removing <none> of the factors, we get an AIC of 268.6. However, if we remove 'bosque' or 'especie', we get a higher AIC value, thus suggesting that we have the most parsimonious model.

F-statistics are computed by comparing the original model, with one where the factor is removed. We can see that 'bosque' is not significant, while 'especie' is highly significant (F=10.07; df=2,46; p<0.001). As I precieve it, these are all type III SS, but I could be wrong. We can clearly see that they are order independent, by fitting a model where 'especie' comes first:

> drop1(aov(herbiv~especie+bosque, herb), test="F") Single term deletions

Model: herbiv ~ especie + bosque Df Sum of Sq RSS AIC F value Pr(F) <none> 4839.5 268.6 especie 3 2810.8 7650.2 289.2 10.0672 2.461e-05 *** bosque 2 440.0 5279.5 269.6 2.3639 0.1041 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results do not deviate from the previous analysis, thus confirming that this type of analysis is order independent. My only grievance with this type of analysis, is that the ANOVA table has to be computed by hand. I still haven't figured out if this can be done automatically.

**Type III SS from R**

If you want to get type III sums-of-square anyway (maybe due to extraordinary circumstances: your boss demands them!) you can easily get them from R. As Dr. Venables says: you just have to know where to look for them.

I will not go into detail on how this works, mainly, because I don't know how it works. To get type III-SS, you have to do the following:

1. Change the default contrast matrix used by R, to one which produces orthogonal contrasts such as contr.helmert or contr.sum:

> options(contrasts=c("contr.helmert", "contr.poly"))

2. Fit a new anova model (saturated), which uses the new contrast matrix we selected in step 1.

> mod.III <- aov(herbiv~bosque*especie, data=herb)

3. Then use the drop1() command, choosing to remove all objects using the formula shorthand notation .~. :

> drop1(mod.III, .~., test="F") Single term deletions

Model: herbiv ~ bosque * especie

Df Sum of Sq RSS AIC F value Pr(F) <none> 4309.1 273.9 bosque 2 436.3 4745.4 275.5 2.3289 0.1088 especie 3 2753.8 7062.8 296.5 9.7989 4.108e-05 *** bosque:especie 6 530.4 4839.5 268.6 0.9436 0.4735 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And voila!...You get type III sums-of-square, which should make any SAS user happy. The *Car* package, has built in functions to get type-III SS, but I've read they produce funny output sometimes. I've never tested this, since I don't like installing many libraries to do things you can get from the base stats module.

**Conclusion**

In my opinion the type III sums-of-square issue, is a bit technical for the average user. Although some suggest that the average user shouldn't be doing statistics, it is commonplace for many students and some researchers to have to deal with anova on their own, with very little statistics background. I think R does the correct thing by not supplying the infamous SS's easily, and forcing the user to think about what he/she really wants. Nonetheless, this discussion should be part of R's help files, and the correct procedure should be also easier to find. Finally, the drop1() method should provide a properly formatted ANOVA table, but again, I may just not know how to do it...and the correct procedure is hidden somewhere within R's mysterious innards.

I don't know how active this blog still is but I want to thank you very much for this insightful discussion!

ReplyDeleteI have been trying to piece together this information for a long time but I never managed. Especially the hint about setting the contrasts option to contr.helmert was a missing piece that I would have never discovered myself.

I am left with one puzzling and annoying question: using your method to arrive at a SAS-like Type III analysis, how does one calculate the intercept? I was under the impression, that "intercept" is another word for what used to be called "correction factor". However, the correction factor for the data in your example would be (1127^2)/58=21898.77, while the "intercept" given by Statistica (and, I assume, by SAS) for a Type III analysis of your data is 21120.45. This is the same intercept that the Anova command from R's car package gives (if the contrasts option is set to contr.helmert), so I assume there is something mysterious going on here that I don't understand. Of course, one could simply use the Anova(car) command, but it would be much nicer to be able to recreate the SAS-style Type III anova in R without installing additional packages.

I think, once you know that your model is best without the interaction term, you could produce a nice formatted ANOVA table in the following way:

ReplyDeletemodel <- lm(herbiv~especie+bosque, herb)

anova(model)

I may of course be wrong ;)

lovely blog post - it would be great if you provided a data table we could download & read, to follow through the example.

thank you! your explanations were the "light at the end of the tunnel"!

ReplyDeleteThank you, Andrea. glad you liked it.

ReplyDeleteRegarding your doubt about calling the final drop1() table for type III SS. I think that this is actually what you would call type II SS according to this post: Anova – Type I/II/III SS explained

ReplyDeleteThis comment has been removed by a blog administrator.

ReplyDeleteThis is a bit of a naïve question from an R amateur. how can I undo or reset the default settings after applying the "contr.helmert" settings?

ReplyDeleteFind the latest used and new cars for sale.

ReplyDeleteGreat used car deals and prices.

More here postallads4free.com

Thanks for the post!

ReplyDeleteIn reply to Andoni Santander, I have found the following way of setting up again the defalut contrasts:

options(contrasts = c("contr.treatment", "contr.poly"))

A question: Would the coefficients change depending the type of SS? I guess no, but can anyone confirm it? Coef does not apply to the drop1 outcome.

Thanks and congratulations for the blog

It was really a nice article and i was really impressed by reading this Data Science online Training Bangalore

ReplyDeletegreat post

ReplyDeletehotmail sign up login