5.1 - Sampling Distribution of the Sample Mean

Printer-friendly versionPrinter-friendly version

 

Unit Summary

  • Sampling Distribution of the Sample Mean
  • Sampling Distribution of the Mean When the Population is Normal
  • Central Limit Theorem
  • Application of Sample Mean Distribution
  • Demonstrations of Central Limit Theorem

reading assignmentReading Assignment
An Introduction to Statistical Methods and Data Analysis, (See Course Schedule).

 

General Objective: In inferential statistics, we want to use characteristics of the sample (i.e. a statistic) to estimate the characteristics of the population (i.e. a parameter).  What happens when we take a sample of size n from some population?  If a continuous distribution, how is the sample mean distributed? If taken from a categorical population set of data, how is that sample proportion distributed?  One uses the sample mean (the statistic) to estimate the population mean (the parameter) and the sample proportion (the statistic) to estimate the population proportion (the parameter). In doing so, we need to know the properties of the sample mean or the sample proportion. That is why we need to study the sampling distribution of the statistics.  We will begin with the sampling distribution of the sample mean.  Since the sample statistic is a single value that estimates a population paramater, we refer to the statistic as a point estimate.

Before we begin, we will introduce a brief explanation of notation and some new terms that we will use this lesson and in future lessons.

Notation:

  • Sample mean: book uses y-bar or \(\bar{y}\); most other sources use x-bar or \(\bar{x}\)
  • Population mean: standard notation is the Greek letter \(\mu\)
  • Sample proportion: book uses π-hat (\(\hat{\pi}\)); other sources use p-hat, (\(\hat{p}\))
  • Population proportion: book uses \(\pi\); other sources use p

[NOTE: Remember that the use of \(\pi\) is NOT to be interpreted as the numeric representation of 3.14 but instead is simply a symbol.]

Terms

  • Standard error – standard deviation of a sample statistic
  • Standard deviation – relates to a sample
  • Parameters, e.g. mean and SD, are summary measures of population, e.g. \(\mu\) and \(\sigma\).  These are fixed. 
  • Statistics, e.g. sample mean and sample SD, are summary measures of a sample, e.g. \(\bar{x}\) and s.  These vary.  Think about taking a sample and the sample isn’t always the same therefore the statistics change.  This is the motiviation behind this lesson - due to this sampling variation the sample statistics themselves have a distribution that can be described by some measure of central tendency and spread.

Sampling Distribution of the Sample Mean

A large tank of fish from a hatchery is being delivered to the lake. We want to know the average length of the fish in the tank. Instead of measuring all the fish, we randomly sample some of them and use the sample mean to estimate the population mean.

Note: The sample mean \(\bar{y}\) is random since its value depends on the sample chosen. It is called a statistic. The population mean is fixed, usually denoted as \(\mu\).

The sampling distribution of the (sample) mean is also called the distribution of the variable \(\bar{y}\).

Usually, the sampling distribution of the sample mean is complicated except for very small sample size or for large sample size. In the following example, we illustrate the sampling distribution for a very small population. The sampling method is to sample without replacement.

image of pumpkinsExample:  Pumpkin Weights

The population is the weight of six pumpkins (in pounds) displayed in a carnival "guess the weight" game booth. You are asked to guess the average weight of the six pumpkins by taking a random sample without replacement from the population.

Pumpkin
A
B
C
D
E
F
Weight (in pounds)
19
14
15
9
10
17

a. Calculate the population mean \(\mu\).

\(\mu\) = (19 + 14 + 15 + 9 + 10 + 17 ) / 6 = 14 pounds

b. Obtain the sampling distribution of the sample mean for a sample size of 2 when one samples without replacement.

Sample
Weight
\(\bar{y}\)
Probability
A, B
19, 14
16.5
1/15
A, C
19, 15
17.0
1/15
A, D
19, 9
14.0
1/15
A, E
19, 10
14.5
.
A, F
19, 17
18.0
.
B, C
14, 15
14.5
.
B, D
14, 9
11.5
.
B, E
14, 10
12.0
.
B, F
14, 17
15.5
.
C, D
15, 9
12.0
.
C, E
15, 10
12.5
.
C, F
15, 17
16.0
.
D, E
9, 10
9.5
.
D, F
9, 17
13.0
1/15
E, F
10, 17
13.5
1/15

Distribution of  \(\bar{y}\):

\(\bar{y}\)
9.5
11.5
12.0
12.5
13.0
13.5
14.0
14.5
15.5
16.0
16.5
17.0
18.0
Probability
1/15
1/15
2/15
1/15
1/15
1/15
1/15
2/15
1/15
1/15
1/15
1/15
1/15

( i ) One can thus see that the chance that the sample mean is exactly the population mean is only 1 in 15, very small. (In some other examples, it may happen that the sample mean can never be the same value as the population mean.) When using the sample mean to estimate the population mean, some possible error will be involved since sample mean is random.

( ii ) The mean of the sample mean when the sample size is 2:

Mean of sample mean
   = (16.5 + 17.0 + 14.0 + 14.5 + 18.0 + 14.5 + 11.5 + 12.0 + 15.5 + 12.0 + 12.5 + 16.0 + 9.5 + 13.0 + 13.5) / 15
   = 14 pounds

Thus, even though each sample may give you an answer involving some error, the expected value is right at the target: exactly the population mean. In other words, if one does the experiment over and over again, the overall average of the sample mean is exactly the population mean.

Part Two

Now let's obtain the sampling distribution for the sample mean when the sample size is 5.

Sample
Weight
\(\bar{y}\)
Probability
A, B, C, D, E
19, 14, 15, 9, 10
13.4
1/6
A, B, C, D, F
19, 14, 15, 9, 17
14.8
1/6
A, B, C, E, F
19, 14, 15, 10, 17
15.0
1/6
A, B, D, E, F
19, 14, 9, 10, 17
13.8
1/6
A, C, D, E, F
19, 15, 9, 10, 17
14.0
1/6
B, C, D, E, F
14, 15, 9, 10, 17
13.0
1/6

Distribution of  \(\bar{y}\):

\(\bar{y}\)
13.0
13.4
13.8
14.0
14.8
15.0
Probability
1/6
1/6
1/6
1/6
1/6
1/6

( i ) Again, we see that using sample mean to estimate population mean involves sampling error. However, the error with a sample of size 5 is on the average smaller than with a sample of size 2.

( ii ) The mean of sample mean when sample size is 5:

Mean of sample mean
   = (13.4 + 14.8 + 15.0 + 13.8 + 14.0 + 13.0) / 6
   = 14 pounds

The following dot plots show the distribution of the sample means corresponding to sample sizes of 2 and of 5.

Sampling error is the error resulting from using a sample characteristic to estimate a population characteristic.

Sample size and sampling error: As the dotpots above shows, the possible sample means cluster more closely around the population mean as the sample size increases. Thus, possible sampling error decreases as sample size increases.

The mean of sample mean is the population mean. That is: \(\mu_{\bar{y}}=\mu\)

When sampling with replacement, the standard deviation of the sample mean called the standard error equals the population standard deviation divided by the square root of the sample size. That is: \(\sigma_{\bar{y}}=\frac{\sigma}{\sqrt{n}}\) .

 

Remark: When the sampling is done without replacement (as in the pumpkin example), then there is a finite correction factor in the formula. Let M denote the population size and n the sample size:

\[\sigma_{\bar{y}}=\sqrt{\frac{M-n}{M-1}} \frac{\sigma}{\sqrt{n}}\]

If the population size is large compared to the sample size (population size is more than 20 times the sample size), then the finite correction factor can be ignored and we can use the simpler formula for sampling with replacement.

 

Sampling Distribution of the Mean When the Population is Normal

Key Fact: If the population is normally distributed with mean \(\mu\) and standard deviation σ, then the sampling distribution of the sample mean is also normally distributed no matter what the sample size is. When the sampling is done with replacement or if the population size is large compared to the sample size, it follows from the above two formulas that \(\bar{y}\) has mean \(\mu\) and standard error \(\sigma / \sqrt{n}\).

SPECIAL NOTE: In the rest of this course, we only deal with the case when the sampling is done with replacement or if the population size is much larger than the sample size.

 

Central Limit Theorem

For a large sample size (rule of thumb: n ≥ 30), \(\bar{y}\) is approximately normally distributed, regardless of the distribution of the population one samples from. If the population has mean \(\mu\) and standard deviation \(\sigma\), then \(\bar{y}\) has mean \(\mu\) and standard error \(\sigma / \sqrt{n}\).

 

Application of Sample Mean Distribution

When we know the sample mean is normal or approximately normal, and we know the population mean, \(\mu\), and population standard deviation, \(\sigma\), then we can calculate a z-score for the sample mean and determine probabilities for it where:

\[Z=\frac{\bar{y}-\mu}{\sigma/\sqrt{n}}\]

image of a speedboatExample:  Speedboat Engines

The engines made by Ford for speedboats had an average power of 220 horsepower (HP) and standard deviation of 15 HP.

1. A potential buyer intends to take a sample of four engines and will not place an order if the sample mean is less than 215 HP. What is the probability that the buyer will not place an order?

We want to find P(\(\bar{y}\) < 215) = ?

Answer: We need to know whether the distribution of the population is normal since the sample size is too small: n = 4 (less than 30 which is required in the central limit theorem). If someone confirms that the population normal, then we can proceed since the sampling distribution of the mean of a normal distribution is also normal for all sample sizes.

If the population follows a normal distribution, we can conclude that \(\bar{y}\) has a normal distribution with mean 220 HP and a standard error of \(\sigma/\sqrt{n}=15/\sqrt{4}=7.5HP\).

P(\(\bar{y}\) < 215)
   = P(Z < (215 - 220) / 7.5)
   = P(Z < -0.67)
   = 0.2514

If the customer just samples four engines, the probability that the customer will not place an order is 25.14%.

    2. If the customer samples 100 engines, what is the probability that the sample mean will be less than 215?  

Try to figure out your answer first, then click the graphic to compare answers.

Web Demonstration of Central Limit Theorem

Before we begin the demonstration, let's talk about what we should be looking for...

Note: N is the sample size in the demonstration. We can check that:

  1. If the population is skewed, then the sample mean won't be normal for when N is small. (When doing simulation, one replicates the process many times.) Using 10,000 replications is a good idea.
  2. If the population is normal, then the distribution of sample mean looks normal even if N = 2.
  3. If the population is skewed, then the distribution of sample mean looks more and more normal when N gets larger.
  4. Note that in all cases, the mean of sample mean is close to the population mean and the standard deviation of the sample mean is close to \(\sigma / \sqrt{N}\).

To work through this demonstration of the central limit theorem yourself, click on the link to the website, https://onlinestatbook.com/stat_sim/sampling_dist/ , and then click Begin.

Another Demonstration of the Central Limit Theorem

Here is a video that illustrates the Central Limit Theorem using a dataset where the data is heavily skewed.  Let's see what happens.