13.1 - The Univariate Approach: Analysis of Variance (ANOVA)

In the univariate case, the data can often be arranged in a table as shown in the table below:

The columns correspond to the responses to g different treatments or from g different populations. And, the rows correspond to the subjects in each of these treatments or populations.

Notations:

Assumptions for the Analysis of Variance are the same as for a two sample t-test except the no of groups is more than two:

  1. The data from group i has common mean = μi; i.e., E(Yij) = μi . This means that there is no sub-populations with different means.
  2. Homoskedasticity: The data from all groups have common variance σ2; i.e., var(Yij) = σ2. That is, the variability in the data does not depend on group membership.
  3. Independence: The subjects are independently sampled.
  4. Normality: The data are normally distributed.

The hypothesis of interest is that all of the means are equal to one another. Mathematically we write this as:

\(H_0: \mu_1 = \mu_2 = \dots = \mu_g\)

The alternative is expressed as:

\(H_a: \mu_i \ne \mu_j \) for at least one \(i \ne j\).

i.e., there is difference between at least one pair of group population means. The following notation should be considered:

  • \(\bar{y}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\) = Sample mean for group i .
  • This involves taking average of all the observations for j = 1 to ni belonging to the ith group. The dot in the second subscript means that the average involves summing over the second subscript of y.

  • \(\bar{y}_{..} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_i}Y_{ij}\) = Grand mean.
  • This involves taking average of all the observations within each group and over the groups and dividing by the total sample size. The double dots indicate that we are summing over both subscripts of y.

The Analysis of Variance involves a partitioning of the total sum of squares which is defined as in the expression below:

\[SS_{total} = \sum_{i=1}^{g}\sum_{j=1}^{n_i}(Y_{ij}-\bar{y}_{..})^2\]

Here we are looking at the average squared difference between each observation and the grand mean. Note that if the observations tend to be far away from the Grand Mean then this will take a large value. Conversely, if all of the observations tend to be close to the Grand mean, this will take a small value. Thus, the total sums of squares measures the variation of the data about the Grand mean.

An Analysis of Variance (ANOVA) is a partitioning of the total sum of squares. In the second line of the expression below we are adding and subtracting the sample mean for the ith group. In the third line, we can divide this out into two terms, the first term involves the differences between the observations and the group means, \(\bar{y}_i\), while the second term involves the differences between the group means and the grand mean.

\[\begin{array}{lll} SS_{total} & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}|Y_{ij}-\bar{y}_{..}|^2 \\ & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}|(Y_{ij}-\bar{y}_{i.})+(\bar{y}_{i.}-\bar{y}_{..})|^2 \\ & = &\underset{SS_{error}}{\underbrace{\sum_{i=1}^{g}\sum_{j=1}^{n_i}(Y_{ij}-\bar{y}_{i.})^2}}+\underset{SS_{treat}}{\underbrace{\sum_{i=1}^{g}n_i(\bar{y}_{i.}-\bar{y}_{..})^2}} \end{array}\]

 

The first term is call the error sum of squares and measures the variation in the data about their group means. Note that if the observations tend to be close to their group means, then this value will tend to be small. On the other hand, if the observations tend to be far away from their group means, then value will be larger. The second term is called the treatment sum of squares and involves the differences between the group means and the Grand mean. Here, if group means are close to the Grand mean, then this value will be small. While, if the group means tend to be far away from the Grand mean, this will take a large value. This second term is call the Treatment Sum of Squares and basically it measures the variation of the group means about the Grand mean.

These results the Analysis of Variance can be summarized in an analysis of variance table below:

The ANOVA table contains columns for Source, Degrees of Freedom, Sum of Squares, Mean Square and F. Sources include Treatment and Error which both can be added up to a Total.

The degrees of freedom for treatment in the first row of the table is calculated by taking number of groups or treatments minus 1. The total degrees of freedom is the total sample size minus1. The Error degrees of freedom is then obtained by subtracting the treatment degrees of freedom from the total degrees of freedom to obtain N-g.

The formulae for the Sum of Squares is given in the SS column. The Mean Square terms are obtained by taking the Sums of Squares terms and dividing by the corresponding degrees of freedom.

The final column contains the F statistic which is obtained by taking the MS for treatment and divided the MS for Error.

Under the null hypothesis that treatment is equal across group means, under Ho : μ1 = μ2 = ... = + μg, this F statistic is F-distributed with g - 1 and N - g degrees of freedom:

\(F \sim F_{g-1, N-g}\)

The numerator degrees of freedom g - 1 comes from the degrees of freedom for treatments in the ANOVA table. This is referred to as the numerator degrees of freedom since the formula for the F-statistic involves the Mean Square for Treatment in the numerator. The denominator degrees of freedom N - g is equal to the degrees of freedom for error in the ANOVA table. This is referred to as the denominator degrees of freedom since the formula for the F-statistic involves the Mean Square Error in the denominator.

We reject Ho at level α if the F statistic is greater than the critical value of the F-table, with g - 1 and N - g degrees of freedom and evaluated at level α.

\(F > F_{g-1, N-g, \alpha}\)