Lesson 4: How to Get a Good Sample

Assignments

Learning Objectives

Chapter 4

After successfully completing this lesson, you should be able to:

Terms to Know

Chapter 4

Commentary

Section 4.1. Chapter 4 in Textbook

Overview

In this lesson, we will add to our knowledge base by explaining ways to obtain appropriate samples for statistical studies.

4.1 Common Research Strategies

Chapter 4 Section 4.1

The following research strategies are described in this section of the textbook.

  1. Sample Surveys
  2. Experiments
  3. Observational Studies
  4. Meta-Analyses (also covered in Chapter 25--not required for the course)
  5. Case Studies

Terms Used with Sample Surveys (Chapter 4 Section 4.2 in Textbook)

It is first necessary to distinguish between a census and a sample survey.  A census is a collection of data from every member of the population, while a sample survey is a collection of data from a subset of the population.  A sample survey is a type of observational study. Obviously, it is much easier to conduct a sample survey than a census.  The remaining sections of this lesson (Chapter 4) will discuss issues about sample surveys.

Of the many terms that are used with sample surveys, the following four need the most clarification because of how they are connected to each other.

The graph illustrates the relationship between population, sampling frame and sample. The population characteristics can be estimated by observed sample characteristics.

Figure 4.1 Relationship between Population, Sampling Frame and Sample

Example 4.1. Who are those angry women?

(Streitfield, D., 1988 and Wallis, 1987)

Recalling some of the information from Example 2.1 in Lesson 2, in 1987, Shere Hite published a best-selling book called Women and Love: A Cultural Revolution in Progress. This 7-year research project produced a controversial 922-page publication that summarized the results from a survey that was designed to examine how American women felt about their relationships with men. Hite mailed out 100,000 fifteen-page questionnaires to women who were members of a wide variety of organizations across the U.S.   Questionnaires were actually sent to the leader of each organization. The leader was asked to distribute questionnaires to all members. Each questionnaire contained 127 open-ended questions with many parts and follow-ups. Part of Hite's directions read as follows: "Feel free to skip around and answer only those questions you choose." Approximately 4500 questionnaires were returned.

In Lesson 2, we determined that the

It is also easy to identify that the sampling unit was an American woman.  So, the key question is "What is the sampling frame?" Most people think the sampling frame was the 100,000 women who received the questionnaires.  However, this answer is not correct because the sampling frame was the list from which the 100,000 who were sent the survey was obtained.  In this instance, the sampling frame included all American women who had some affiliation with an organization.  There is no statistical term to attach to the 100,000 women who received the questionnaire.  However, if the response rate had been 100%, the sample would have been the 100,000 women who responded to the survey.

You should also remember that ideally the sampling frame should include the entire population. If this is not possible, the sampling frame should appropriately represent the desired population. In this case, the sampling frame of all American women who were "affiliated with some organization" did not appropriately represent the population of all American women.  In Lesson 2, we called this problem selection bias.

Chapter 4 of your text also lists three difficulties that are possible when samples are obtained for surveys. These three difficulties, which happen to be possible with this example, include:

  1. Using the wrong sampling frame. We just discussed this problem in the
    preceding paragraph. This problem is also called selection bias.
  2. Not reaching the individuals selected.   Because the questionnaire was sent to leaders of organizations, there is no guarantee that these questionnaires actually reached the women who were supposed to be in the sample.
  3. Getting "no response" or a "volunteer response.   In Lesson 2, we learned that this survey has a problem with nonresponse bias because of the low response rate. This problem can also be called "no response" or "volunteer response."

4.2 The Beauty of Sampling

Sample surveys are generally used to estimate the percentage of people in the population that have a certain characteristic or opinion.  If you follow the news, you will probably recall that most of these polls are based on samples of size 1000 to 1500 people.  So, why is a sample size of around 1000 people commonly used in surveying?  The answer is based on understanding what is called the margin of error.

The margin of error:

For a sample size of n = 1000, the margin of error is \(\frac {1}{\sqrt{n}}=\frac{1}{\sqrt{1000}}=0.03\) , or about 3%.

Even though you will not be asked to calculate a margin of error in this course, you should remember the margin of error formula and that the margin of error formula depends only on the size of the sample. The size of the population is not used in the calculation of the margin of error.  So, a percentage estimated by a selected sample size will have the same margin of error (accuracy), regardless of whether the population size is 5,000 or 5 billion.   It also helps that pollsters believe that an accuracy of ± 3% is reasonable with surveys.

So what does the margin of error represent?  The following statement represents the generic interpretation of a margin of error.

Generic Interpretation:  If one obtains many samples of the same size from a defined population, the difference between the sample percent and the true population percent will be within the margin of error, at least 95% of the time.

Key Features of the Interpretation of the Margin of Error

Example 4.2. Margin of Error

Suppose a recent poll based on 1000 Americans finds that 55% approve of the president's current educational plan.  Since the sample size is 1000, the margin of error is about 3%.  These poll results suggest that 55% ± 3% of all Americans approve of the president's current economic plan. What is the correct interpretation of this margin of error?

Margin of Error Interpretation

The difference between our sample percent and the true population percent will be within 3%, at least 95% of the time.  This means that we are almost certain that 55% ± 3% or (52% to 58%) of all Americans approve of the president's current educational plan.   Because the range of possible values from this poll all fall above 50%, we can also say that we are pretty sure that a majority of Americans support the president's current educational plan.  If any of the range of possible values would have been 50% or less, then we would not have been able to say that the majority supported the plan.  The range of values (52% to 58%) is called a 95% confidence interval.   We will go into further detail about confidence intervals in Lesson 7.

4.3 Relationship between Sample Size and Margin of Error

There is a predictable relationship between sample size and margin of error. The numbers found in Table 4.1 help to explain this relationship.

Table 4.1. Calculated Margins of Error for Selected Sample Sizes

Sample Size (n) Margin of Error (M.E.)
200 7.1%
400 5.0%
700 3.8%
1000 3.2%
1200 2.9%
1500 2.6%
2000 2.2%
3000 1.8%
4000 1.6%
5000 1.4%

From this table, one can clearly see that as sample size increases, the margin of error decreases. In order to add additional clarity to this finding, the information from Table 4.1 is also displayed in Figure 4.2.

The graph shows the relationship between sample size and margin of error. Margin of error decreases as the sample size increases.

Figure 4.2 Relationship Between Sample Size and Margin of Error

In Figure 4.2, you again find that as the sample size increases, the margin of error decreases.  However, you should also notice that the amount by which the margin of error decreases is substantial between samples sizes of 200 and 1500.  This implies that the accuracy of the estimate is strongly affected by the size of the sample.  In contrast, the margin of error does not substantially decrease at sample sizes above 1500.  Therefore, pollsters have concluded that it is not worth it to spend additional time and money for samples that contain more than 1500 people.

4.4 Simple Random Sampling and Other Sampling Methods

Sampling Methods can be classified into one of two categories:

Probability Sampling

In probability sampling it is possible to both determine which sampling units belong to which sample and the probability that each sample will be selected. The following sampling methods, which are listed in Chapter 4, are types of probability sampling:

  1. Simple Random Sampling (SRS)
  2. Stratified Sampling
  3. Cluster Sampling
  4. Multistage Sampling
  5. Random-Digit Dialing
  6. Systematic Sampling

Of the five methods listed above, students have the most trouble distinguishing between stratified sampling and cluster sampling.

Stratified Sampling is possible when it makes sense to partition the population into groups based on a factor that may influence the variable that is being measured.   These groups are then called strata.  An individual group is called a stratum.  With stratified sampling one should:

Stratified sampling works best when a heterogeneous population is split into fairly homogeneous groups.  Under these conditions, stratification generally produces more precise estimates of the population percents than estimates that would be found from a simple random sample. Table 4.2 shows some examples of ways to obtain a stratified sample.

Table 4.2. Examples of Stratified Samples

  Example 1 Example 2 Example 3
Population All people in U.S. All PSU intercollegiate athletes All elementary students in the local school district
Groups (Strata)

4 Time Zones in the U.S. (Eastern,Central, Mountain,Pacific)

26 PSU intercollegiate teams 11 different elementary schools in the local school district
Obtain a Simple Random Sample 500 people from each of the 4 time zones 5 athletes from each of the 26 PSU teams 20 students from each of the 11 elementary schools
Sample 4 × 500 = 2000 selected people 26 × 5 = 130 selected athletes 11 × 20 = 220 selected students

Cluster Sampling is very different from Stratified Sampling. With cluster sampling one should

It is important to note that, unlike with the strata in stratified sampling, the clusters should be microcosms, rather than subsections, of the population.   Each cluster should be heterogeneous. Additionally, the statistical analysis used with cluster sampling is not only different, but also more complicated than that used with stratified sampling.

Table 4.3. Examples of Cluster Samples

  Example 1 Example 2 Example 3
Population All people in U.S. All PSU intercollegiate athletes All elementary students in a local school district
Groups (Clusters) 4 Time Zones in the U.S. (Eastern,Central, Mountain,Pacific.) 26 PSU intercollegiate teams 11 different elementary schools in the local school district
Obtain a Simple Random Sample 2 time zones from the 4 possible time zones 8 teams from the 26 possible teams 4 elementary schools from the l1 possible elementary schools
Sample every person in the 2 selected time zones every athlete on the 8 selected teams every student in the 4 selected elementary schools

Each of the three examples that are found in Tables 4.2 and 4.3 were used to illustrate how both stratified and cluster sampling could be accomplished. However, there are obviously times when one sampling method is preferred over the other. The following explanations add some clarification about when to use which method.

Judgment Sampling

The following sampling methods that are listed in your text are types of judgment sampling:

  1. volunteer samples
  2. haphazard (convenience) samples

Since judgment sampling is based on human choice rather than random selection, statistical theory cannot explain what is happening.   In your textbook, the two types of judgment samples listed above are called "sampling disasters."

Section 4.2. Article: "How Polls are Conducted"

The article is exceptional and provides great insight into how major polls are conducted. When you are finished reading this article you may want to go to the Gallup Poll Web site, http://www.gallup.com, and see the results from recent Gallup polls.

It is important to be mindful of margin or error as discussed in this article. We all need to remember that public opinion on a given topic cannot be appropriately measured with one question that is only asked on one poll.  Such results only provide a snapshot at that moment under certain conditions.  The concept of repeating procedures over different conditions and times leads to more valuable and durable results. Within this section of the article, there is also an error: "in 95 out of those 100 polls, his rating would be between 46% and 54%." This should instead say that in 95 out of those 100 polls, the true population percent would be within the confidence interval calculated. In 5 of those surveys, the confidence interval would not contain the population percent.

Lesson 4 Practice Questions

Answer the following Practice Questions to check your understanding of the material in this lesson.

Think About It!

Come up with an answer to these questions by yourself and then click the icon on the left to reveal the answer.

1. Which of the following is not an example of probability sampling?

a. simple random sampling
b. cluster sampling
c. convenience sampling
d. stratified sampling

2. Which of the following surveys would have the smallest margin of error?

a. a sample size of n = 1,600 from a population of 50 million
b. a sample size of n = 500 from a population of 5 billion
c. a sample size of n = 100 from a population of 10 million

3. Suppose a recent survey finds that 80% of Penn State students prefer that fall semester begins after Labor Day. The results of this survey were based on opinions expressed by 200 Penn State students. Which of the following represents the calculation of the margin of error for this survey?

a. 200
b. 1/200
c. \(1/ \sqrt{200}\)
d. \(\sqrt{200}\)

4. Suppose a margin of error for a poll is 4%. What is the correct interpretation of the margin of error for this poll? In about 95% of all samples of this size, the ________________.

a. difference between the sample percent and the population percent will be within 4%.
b. probability that the sample percent does not equal the population percent is 4%.
c. probability that the sample percent does equal the population percent is 4%.
d. difference between the sample percent and the population percent will exceed 4%.

5. In order to survey the opinions of its customers, a restaurant chain obtained a random sample of 30 customers from each restaurant in the chain. Each selected customer was asked to fill out a survey. Which one of the following sampling plans was used in this survey?

a. cluster sampling
b. stratified sampling