ANOVA by any other name
- Details
- Created: 18 January 2008
- Written by Steven Ouellette
Analysis of variance (ANOVA) is an elegant procedure—simple, economical, and powerful. Let's say we have this research question:
“Which of the four materials being considered for making a gear has the best wear characteristic?”
This will lead to the statistical question, “Which of the five materials has the highest average wear?” (I’ll discuss another statistical question—about the dispersion—in next month’s column.)
We make up eight gears out of each of the five materials and run them on our wear tester (on which, of course, we have performed our measurement system analysis and determined acceptability). You can get the data here.
We might be tempted to do multiple t-tests, but we would have to do 10 different t-tests (which is annoying). Even worse, when we do that we increase the chance of making an α error. If αFW is the chance of making a Type I error during all the tests, αPC is the Type I error for each test and c is the total number of tests:
So if our α_{PC} for each t-test is 0.05, our actual α_{FW} is inflated to about 40.13 percent.
Holy leftover fruit cake, Batman! That’s a serious chance of concluding that a significant difference exists in the materials when in fact they don’t.
Luckily, there’s a way to test for equality of all group averages in one test. ANOVA works because we have two potential sources of variation: variation within each group and variation between the averages of each group. If all the groups have the same average wear, then the variation between and within the groups is due only to random chance, and the different materials have no effect on wear. On the other hand, if the different materials do have different averages, the total variation we see will be higher than we would have guessed from the variability within each group.
I can therefore estimate the population variance in two ways. I can take the average of the variances within each group and I can find the variance between the means and divide by the sample size. Both ought to give me the same answer, within sampling error, if the true means of the groups are all the same. Here’s the genius part—if I make a ratio of these two estimates, I can use the good old F-statistic to test to see if they’re equal. If they’re different by an amount that could be due to sampling error, then as far as I can tell, the groups are all the same average and I calculate an F close to one. If the averages really are different, then the between estimate of the variance contains sample error and a variance component due to the differences in group averages.
Clearly, this F is going to be larger than one. Another cool thing is that it’s a one-tail test (see why?), which gives us additional power for the same α.
OK, fine, you knew all that. But sometimes it’s fun to just sit back and appreciate elegance when you find it.
Our null hypothesis for this ANOVA is:
H0: μ_{1} = μ_{2} = μ_{3} = μ_{4} = μ_{5}
The alternative hypothesis is that the null statement is not true somehow.
First I check for normality, because that’s one of the assumptions in ANOVA.
Material | n | (A-D)A²* | p | (S-W)W | p | (L-M)r | p | Skew. | p | Kurt. | p |
1.00 | 8 | 0.463 | 0.266 | 0.902 | 0.301 | 0.166 | 0.796 | 0.190 | 0.798 | -1.301 | >.10 |
2.00 | 8 | 0.654 | 0.089 | 0.859 | 0.117 | 0.701 | 0.161 | -0.820 | 0.271 | -0.924 | >.10 |
3.00 | 8 | 0.392 | 0.396 | 0.897 | 0.274 | 0.000 | 1.000 | 0.000 | 1.000 | -1.456 | >.10 |
4.00 | 8 | 0.337 | 0.530 | 0.933 | 0.542 | 0.034 | 0.958 | 0.083 | 0.911 | -0.438 | >.10 |
5.00 | 8 | 0.777 | 0.043* | 0.809 | 0.036* | 0.750 | 0.111 | -1.113 | 0.137 | 0.291 | >.10 |
There’s one material that might not be normally distributed, as indicated by the Anderson-Darling and Shapiro-Wilk tests. ANOVA is fairly robust to departures from normality when n is large, but is highly affected by outliers. I reviewed a histogram of the data and found that, while it might be skewed, there are no outliers, therefore we are probably safe with ANOVA. Just to be sure, I ran a Kruskall-Wallace nonparametric test, which confirmed the results.
So we perform an ANOVA on our gear data, and generate something like this:
ONEWAY ANOVA
wear by material [1 to 5]
Source | df | SS | MS | F | p |
Between | 4 | 700.1500 | 175.0375 | 46.324 | 0.000* |
Within | 35 | 132.2500 | 3.7786 | ||
Total | 39 | 832.4000 |
Fixed Effects Analysis:
ω² = 81.92%
I’m going to talk about dispersion analysis in March, so for now let’s assume we have equal variances within the five materials. The p-value on the end of the ANOVA table is the probability of getting an F-statistic of 46.324 (or more extreme) from an F-distribution with 4, 35 degrees of freedom and an average of 1, which is pretty dang unlikely. So we reject the hypothesis that all the averages are equal, and conclude that the different materials do in fact influence the gear wear. The ω2 number is an estimate of the percentage of the total variation explained by differences in material, so clearly those differences are large compared to the sampling error.
But now what? We’ve found a significant difference, and if you refer to the alternative hypothesis, you’ll notice that ANOVA doesn’t tell you where the differences are. Now we enter the world of post-hoc analysis. (Post-hoc just means “after the fact.” Ham hock is something else entirely, so stop drooling.) We will delve into this realm next month, when our managers intrude their reality on our nice antiseptic ANOVA. Then I will show you something that you might not have seen before that could save you oodles of money.
Then again, I could be wrong.
Thanks to Mike Petrovich, for his program MVPstats, which makes these types of analyses fast and easy. Mike now has a shareware version of his flexible SPC program available for download.