Two-Sample Pooled t-Interval

Printer-friendly versionPrinter-friendly version

spider

Example

The feeding habits of two species of net-casting spiders are studied. The species, the deinopis and menneus, coexist in eastern Australia. The following data were obtained on the size, in millimeters, of the prey of random samples of the two species:

data

What is the difference, if any, in the mean size of the prey (of the entire populations) of the two species?

Solution. Let's start by formulating the problem in terms of statistical notation. We have two random variables, for example, which we can define as: 

  • Xi = the size (in millimeters) of the prey of a randomly selected deinopis spider
  • Yi = the size (in millimeters) of the prey of a randomly selected menneus spider

In statistical notation, then, we are asked to estimate the difference in the two population means, that is:

\(\mu_X-\mu_Y\)

(By virtue of the fact that the spiders were selected randomly, we can assume the measurements are independent.)

We clearly need some help before we can finish our work on the example. Let's see what the following theorem does for us.

Theorem. If \(X_1,X_2,\ldots,X_n\sim N(\mu_X,\sigma^2)\) and \(Y_1,Y_2,\ldots,Y_m\sim N(\mu_Y,\sigma^2)\) are independent random samples, then a (1−α)100% confidence interval for \(\mu_X-\mu_Y\), the difference in the population means is:

\((\bar{X}-\bar{Y})\pm (t_{\alpha/2,n+m-2}) S_p \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}\)

where \(S_p^2\), the "pooled sample variance":

\(S_p^2=\dfrac{(n-1)S^2_X+(m-1)S^2_Y}{n+m-2}\)

is an unbiased estimator of the common variance σ2.

Proof. We'll start with the punch line first. If it is known that:

\(T=\dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}}} \sim t_{n+m-2}\)

then the proof is a bit on the trivial side, because we then know that:

\(P\left[-t_{\alpha/2,n+m-2} \leq \dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}}} \leq t_{\alpha/2,n+m-2}\right]=1-\alpha\)

And then, it is just a matter of manipulating the inequalities inside the parentheses. First, multiplying through the inequality by the quantity in the denominator, we get:

\(-t_{\alpha/2,n+m-2}\times S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}} \leq (\bar{X}-\bar{Y})-(\mu_X-\mu_Y)\leq t_{\alpha/2,n+m-2}\times S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}}\)

Then, subtracting through the inequality by the difference in the sample means, we get:

\(-(\bar{X}-\bar{Y})-t_{\alpha/2,n+m-2}\times S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}} \leq -(\mu_X-\mu_Y) \leq -(\bar{X}-\bar{Y})+t_{\alpha/2,n+m-2}\times S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}} \)

And, finally, dividing through the inequality by −1, and thereby changing the direction of the inequality signs, we get:

\((\bar{X}-\bar{Y})-t_{\alpha/2,n+m-2}\times S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}} \leq \mu_X-\mu_Y \leq (\bar{X}-\bar{Y})+t_{\alpha/2,n+m-2}\times S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}} \)

That is, we get the claimed (1−α)100% confidence interval for the difference in the population means:

\((\bar{X}-\bar{Y})\pm (t_{\alpha/2,n+m-2}) S_p \sqrt{\dfrac{1}{n}+\dfrac{1}{m}}\)

 Now, it's just a matter of going back and proving that first distributional result, namely that:

\(T=\dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}}} \sim t_{n+m-2}\)

Well, by the assumed normality of the Xi and Yi measurements, we know that the means of each of the samples are also normally distributed. That is:

\(\bar{X}\sim N \left(\mu_X,\dfrac{\sigma^2}{n}\right)\)   and    \(\bar{Y}\sim N \left(\mu_Y,\dfrac{\sigma^2}{m}\right)\)

Then, the independence of the two samples implies that the difference in the two sample means is normally distributed with the mean equaling the difference in the two population means and the variance equaling the sum of the two variances. That is:

\(\bar{X}-\bar{Y} \sim N\left(\mu_X-\mu_Y,\dfrac{\sigma^2}{n}+\dfrac{\sigma^2}{m}\right)\)

Now, we can standardize the difference in the two sample means to get:

\(Z=\dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{\sqrt{\dfrac{\sigma^2}{n}+\dfrac{\sigma^2}{m}}} \sim N(0,1)\)

Now, the normality of the Xi and Yi measurements also implies that:

 \(\dfrac{(n-1)S^2_X}{\sigma^2}\sim \chi^2_{n-1}\)   and   \(\dfrac{(m-1)S^2_Y}{\sigma^2}\sim \chi^2_{m-1}\)

And, the independence of the two samples implies that when we add those two chi-square random variables, we get another chi-square random variable with the degrees of freedom (n−1 and m−1) added. That is:

\(U=\dfrac{(n-1)S^2_X}{\sigma^2}+\dfrac{(m-1)S^2_Y}{\sigma^2}\sim \chi^2_{n+m-2}\)

Now, it's just a matter of using the definition of a T-random variable:

\(T=\dfrac{Z}{\sqrt{U/(n+m-2)}}\)

Substituting in the values we defined above for Z and U, we get:

\(T=\dfrac{\dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{\sqrt{\dfrac{\sigma^2}{n}+\dfrac{\sigma^2}{m}}}}{\sqrt{\left[\dfrac{(n-1)S^2_X}{\sigma^2}+\dfrac{(m-1)S^2_Y}{\sigma^2}\right]/(n+m-2)}}\)

Pulling out a factor of 1/σ in both the numerator and denominator, we get:

\(T=\dfrac{\dfrac{1}{\sigma} \dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{\sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}}{\dfrac{1}{\sigma} \sqrt{\dfrac{(n-1)S^2_X+(m-1)S^2_Y}{(n+m-2)}}}\)

And, canceling out the (1/σ)s and recognizing that the denominator is the pooled standard deviation:

eqn

we get:

\(T=\dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}\)

That is, we have shown that:

\(T=\dfrac{(\bar{X}-\bar{Y})-(\mu_X-\mu_Y)}{S_p\sqrt{\dfrac{1}{n}+\dfrac{1}{m}}}\sim t_{n+m-2}\)

And we are done.... our proof is complete!

Notes

(1) Three assumptions are made in deriving the above confidence interval formula. They are:

  • The measurements (Xi and Yi) are independent.
  • The measurements in each population are normally distributed.
  • The measurements in each population have the same variance σ2.

That means that we should use the interval to estimate the difference in two population means only when the three conditions hold for our given data set. Otherwise, the confidence interval wouldn't be an accurate estimate of the difference in the two population means.

(2) There are no restrictions on the sample sizes n and m. They don't have to be equal and they don't have to be large.

(3) The pooled sample variance \(S_p^2\) is an average of the sample variances weighted by their sample sizes. The larger sample size gets more weight. For example, suppose:

n = 11   and   m = 31

 \(s^2_x=4\)  and  \(s^2_y=8\)

Then, the unweighted average of the sample variances is 6, as shown here:

\(\dfrac{4+8}{2}=6\)

But, the pooled sample variance is 7, as the following calculation illustrates:

\(s_p^2=\dfrac{(11-1)4+(31-1)8}{11+31-2}=\dfrac{10(4)+30(8)}{40}=7\)

In this case, the larger sample size (m = 31) is associated with the variance of 8, and so the pooled sample variance get "pulled" upwards from the unweighted average of 6 to the weighted average of 7. By the way, note that if the sample sizes are equal, that is, m = n = r, say, then the pooled sample variance \(S_p^2\) reduces to an unweighted average.

With all of the technical details behinds us, let's now return to our example.


spiderExample (continued)

The feeding habits of two species of net-casting spiders are studied. The species, the deinopis and menneus, coexist in eastern Australia. The following data were obtained on the size, in millimeters, of the prey of random samples of the two species:

data

What is the difference, if any, in the mean size of the prey (of the entire populations) of the two species?

Solution. First, we should make at least a superficial attempt to address whether the three conditions are met. Given that the data were obtained in a random manner, we can go ahead and believe that the condition of independence is met. Given that the sample variances are not all that different, that is, they are at least similar in magnitude:

\(s^2_{deinopis}=6.3001\) and \(s^2_{menneus}=3.61\)

we can go ahead and assume that the variances of the two populations are similar. Assessing normality is a bit trickier, as the sample sizes are quite small. Let me just say that normal probability plots don't give an alarming reason to rule out the possibility that the measurements are normally distributed. So, let's proceed!

The pooled sample variance is calculated to be 4.955:

\(s_p^2=\dfrac{(10-1)6.3001+(10-1)3.61}{10+10-2}=4.955\)

which leads to a pooled standard deviation of 2.226:

\(s_p=\sqrt{4.955}=2.226\)

(Of course, because the sample sizes are equal (m = n = 10), the pooled sample variance is just an unweighted average of the two variances 6.3001 and 3.61). 

Because m = n = 9, if we were to calculate a 95% confidence interval for the difference in the two means, we need to use a t-table or statistical software to determine that:

\(t_{0.025,10+10-2}=t_{0.025,18}=2.101\)

The sample means are calculated to be:

\(\bar{x}_{deinopis}=10.26\)  and  \(\bar{y}_{menneus}=9.02\)

We have everything we need now to calculate a 95% confidence interval for the difference in the population means. It is:

\((10.26-9.02)\pm 2.101(2.226)\sqrt{\dfrac{1}{10}+\dfrac{1}{10}}\)

which simplifies to:

 \(1.24 \pm 2.092\)   or    \((-0.852,3.332)\)

That is, we can be 95% confident that the actual mean difference in the size of the prey is between −0.85 mm and 3.33 mm. Because the interval contains the value 0, we cannot conclude that the population means differ. 

Using Minitab

The commands necessary for asking Minitab to calculate a two-sample pooled t-interval for μX−μY depend on whether the data are entered in two columns, or the data are entered in one column with a grouping variable in a second column. We'll illustrate using the spider and prey example. 

When the Data are Entered in Two Columns

(1) Enter the data in two columns, such as:

minitab

(2) Under the Stat menu, select Basic Statistics, and then select 2-Sample t...:

minitab

(3) In the pop-up window that appears, select Samples in different columns. Specify the name of the First variable, and specify the name of the Second variable. Click on the box labeledAssume equal variances. (If you want a confidence level that differs from Minitab's default level of 95.0, under Options..., type in the desired confidence level. Select OK on the Options window.) Select OK on the 2-Sample t... window:

minitab

The confidence interval output will appear in the session window. Here's what the output looks like for the spider and prey example with the confidence interval circled in red:

minitab

When the Data are Entered in One Column, and a Grouping Variable in a Second Column

(1) Enter the data in one column (called Prey, say), and the grouping variable in a second column (called Group, say, with 1 denoting a deinopis spider and 2 denoting a menneus spider), such as:

data

(2) Under the Stat menu, select Basic Statistics, and then select 2-Sample t...:

minitab

(3) In the pop-up window that appears, select Samples in one column. Specify the name of the Samples variable (Prey, for us) and specify the name of the Subscripts (grouping) variable (Group, for us). Click on the box labeled Assume equal variances. (If you want a confidence level that differs from Minitab's default level of 95.0, under Options..., type in the desired confidence level. Select OK on the Options window.) Select OK on the 2-sample t... window. 

minitab

The confidence interval output will appear in the session window. Here's what the output looks like for the example above with the  confidence interval circled in red:

minitab