Compare the means of groups that have been sampled randomly, to test whether or not they differ.

The different levels of the categorical variable representing different groups are called **treatments**.

Each observation is called a **replicate** and is assumed to be independent from all other replicates.

So how do we compare group means?

Colored bars at top represent variances within groups.

Black bars at top represent variances ignoring groups.

\[Y_{ij}=\mu + A_i + \epsilon_{ij}\]

\(\mu\) is the population grand mean (\(\bar{Y_{ij}}\) an unbiased estimator of \(\mu\))

\(A_i\) is the additive linear effect compared to the grand mean for a given treatment \(i\) (\(A_i\)‘s sum to 0).

\(\epsilon_{ij}\) is the error variance (the variance of individual points around their treatment group means. These are assumed to be distributed as \(N(0,\sigma^2)\)).

\[Y_{ij}=\mu + A_i + \epsilon_{ij}\]

Let's see how changing model parameters affects the y variable. We will start a few variables that will be kept constant.

grandMean <- 10 groups <- rep(c("A", "B"), 100) myError <- rnorm(200, mean = 0, sd=3)

y <- grandMean + c(-5, 5) + myError qplot(groups, y, geom="boxplot", fill=groups)

y <- grandMean + c(-0.3, 0.3) + myError qplot(groups, y, geom="boxplot", fill=groups)

- Calculate the Total SS
- Calculate the Within-group SS
- Calculate the Between-group SS
- Calculate Mean Squares using appropriate df.
- Calculate the F ratio
- Calculate the p-value

As you could tell from your reading, calculating the appropriate sum of squares is complicated, and depends on your experimental design. Be careful (or just use Monte Carlo)!

Example data from Gotelli Table 10.1

## unmanipulated control treatment ## 1 10 9 12 ## 2 12 11 13 ## 3 12 11 15 ## 4 13 12 16

These data represent 12 experimental plots to see the effect of soil temperature on the flowering period in larkspur flowers. The treatment plots had heating coils installed and activated. The control plot had heating coils installed but never turned on. The unmanipulated plots were untouched.

We want to know if there is a difference in mean flowering length between these groups.

Source | degrees freedom | SS | Mean Square (MS) | Expected Mean Square | F-ratio |
---|---|---|---|---|---|

Among groups | a-1 | \(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(\bar{Y}_i - \bar{Y})^2\) | \(\frac{SS_{among_groups}}{(a-1)}\) | \(\sigma^2 + n\sigma^2_A\) | \(\frac{MS_{among}}{MS_{within}}\) |

within groups (residual) | a(n-1) | \(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y}_i)^2\) | \(\frac{SS_{within}}{a(n-1)}\) | \(\sigma^2\) | |

Total | an-1 | \(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y})^2\) | \(\frac{SS_{total}}{(an-1)}\) | \(\sigma^2_Y\) |

- The samples are independent and identically distributed
- The residuals \(\epsilon_i\) are normally distributed
- Within-group variances are similar across all groups (‘homoscedastic”)
- Observations are properly categorized in groups

- Main effects are additive (no strong interactions).
- Balanced design (equal number of observations in all groups). If you are going to do a lot of ANOVA, you need to read about the ANOVA controversy in R and other models for calculating SS that are more appropriate for unbalanced designs.

myMod <- lm(iris$Petal.Length~iris$Species) summary(myMod)

## ## Call: ## lm(formula = iris$Petal.Length ~ iris$Species) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.260 -0.258 0.038 0.240 1.348 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.46200 0.06086 24.02 <2e-16 *** ## iris$Speciesversicolor 2.79800 0.08607 32.51 <2e-16 *** ## iris$Speciesvirginica 4.09000 0.08607 47.52 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4303 on 147 degrees of freedom ## Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406 ## F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16

anova(myMod)

## Analysis of Variance Table ## ## Response: iris$Petal.Length ## Df Sum Sq Mean Sq F value Pr(>F) ## iris$Species 2 437.10 218.551 1180.2 < 2.2e-16 *** ## Residuals 147 27.22 0.185 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(iris$Petal.Length ~ iris$Species)

#Homogeneity of variances bartlett.test(iris$Petal.Length ~ iris$Species)

## ## Bartlett test of homogeneity of variances ## ## data: iris$Petal.Length by iris$Species ## Bartlett's K-squared = 55.423, df = 2, p-value = 9.229e-13

plot(myMod)

Life gets more complicated when we are considering more than one factor

\[Y_{ij}=\mu + A_i + B_j + AB_{ij} + \epsilon_{ij}\]

With more than one factor, we have the possibility of non-additive effects of one factor on the other. These are called **interaction terms**

data(warpbreaks) model <- lm(warpbreaks$breaks ~ warpbreaks$tension + warpbreaks$wool) anova(model)

## Analysis of Variance Table ## ## Response: warpbreaks$breaks ## Df Sum Sq Mean Sq F value Pr(>F) ## warpbreaks$tension 2 2034.3 1017.13 7.5367 0.001378 ** ## warpbreaks$wool 1 450.7 450.67 3.3393 0.073614 . ## Residuals 50 6747.9 134.96 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Have possibility of non-additive effects, where value of factor A influences value of factor B