# Lesson 3: Characteristics of Good Sample Surveys and Comparative Studies

### Assignments

• Read Chapters 4 & 5 in the text
• Work through the Lesson 3 online notes that follow
• Complete all lesson assignments as listed in the Canvas Module

### Learning Objectives

#### Chapters 4 & 5

After successfully completing this lesson, you should be able to:

• Distinguish between a population, sample, and sampling frame.
• Interpret and identify the factors that affect the margin of error.
• Identify types of probability samples and judgment samples.
• Apply the "Difficulties and Disasters" in sampling to real world problems.
• Identify all steps used and issues addressed by the Gallup Poll.
• Distinguish between randomized experiments and observational studies.
• Distinguish between two independent and two dependent (matched paired) samples.
• Apply basic terms associated with research studies.

### Terms to Know

click on the terms to learn more

#### Chapter 4

• sample surveys
• experiments
• observational studies
• unit (sampling unit)
• population
• sample
• sampling frame
• census
• margin of error (ME)
• sample size (n)
• probability sampling
• simple random sample
• stratified sampling
• cluster sampling
• systematic sampling
• voluntary sample
• convenience sample
• haphazard sampling
• nonresponse bias
• response rate
• random-digit dialing
• selection bias
• sample percent
• population percent

#### From Chapter 5

• experimental unit
• explanatory variable
• treatment
• response (outcome) variable
• confounding variable
• randomized experiments
• independent samples
• dependent samples
• matched pair design
• block design
• carryover effect
• placebo (placebo effect)
• case-control study
• retrospective study
• prospective study
• single blind
• double blind

# 3.1 Overview

Overview

In this lesson, we will add to our knowledge base by explaining ways to obtain appropriate samples for statistical studies.  We will look both at sample surveys and at comparative studies.  For a sample survey,  we want the sample to provide an accurate reflection of the population we are interested in - and random sampling can help.  In a comparative study, we want the groups being studied to provide a fair comparison - and random assignment can help.

# 3.2 Defining a Common Language for Sampling

The Gallup Global Emotions Survey in 2015 included interviews with more than 147,000 people in 140 countries. In each country Gallup is interested in studying the emotional well-being of the population of adults 18 years old and over. They do this by taking a sample of about 1000 adults in each country using techniques that involve random selection of people for interviews. Luckily, in most countries, they are able to operate quite freely as they conduct interviews in the local language about various positive and negative experiences and feelings that the respondents have. Unfortunately, in a country like Syria, where there was an active war going on in 2015, the security situation made it impossible for Gallup to operate in about a third of the country. The two-thirds of the population that lived in areas without active fighting going on formed the sampling frame for Gallup’s poll in Syria. Gallup published their Global Emotions report about the 2015 interviews in March, 2016. In one question they asked people “Did you smile or laugh a lot yesterday?” and about 72% of their respondents worldwide answered yes. However, only about half of that percentage said “yes” in war-torn Syria.

Let’s examine the general framework of the example above and define a common language for the processes used in sample surveys.

It is first necessary to distinguish between a census and a sample survey.  A census is an attempt to collect data from every member of the population, while a sample survey is a collection of data from a subset of the population chosen by the researcher.  A sample survey is a type of observational study. Obviously, it is much easier to conduct a sample survey than a census. In planning a sample survey, the researcher needs to precisely define the following:

• Sampling Unit: The individual person, animal, or object that has the measurement (observation) taken on them / it
• Population: The entire group of individuals or objects that we wish to know something about.  A numerical characteristic of the population is called a parameter.
• Sampling Frame: The list of the sampling units from which those to be contacted for inclusion in the sample is obtained. The sampling frame lies between the population and sample. Ideally the sampling frame should match the population, but rarely does because the population is not usually small enough to list all members of the population.
• Sample: Those individuals or objects who provide the data to be collected.  Numerical characteristics of the sample are called statistics and are typically used as estimates of population parameters.

Figure 3.1 Relationship between Population, Sampling Frame and Sample (roll-over for description)

Example 3.1. Who are those angry women?

(Streitfield, D., 1988 and Wallis, 1987)

Recalling some of the information from Example 2.1 in Lesson 2, in 1987, Shere Hite published a best-selling book called Women and Love: A Cultural Revolution in Progress. This 7-year research project produced a controversial 922-page publication that summarized the results from a survey that was designed to examine how American women felt about their relationships with men. Hite mailed out 100,000 fifteen-page questionnaires to women who were members of a wide variety of organizations across the U.S.   Questionnaires were actually sent to the leader of each organization. The leader was asked to distribute questionnaires to all members. Each questionnaire contained 127 open-ended questions with many parts and follow-ups. Part of Hite's directions read as follows: "Feel free to skip around and answer only those questions you choose." Approximately 4500 questionnaires were returned.

In Lesson 2, we determined that the

• population was all American women.
• sample was the 4,500 women who responded.

It is also easy to identify that the sampling unit was an American woman.  So, the key question is "What is the sampling frame?" Some might think that the sampling frame was the 100,000 women who received the questionnaires (that's the intended sample).  However, this answer is not correct because the sampling frame was the list from which the 100,000 who were sent the survey was obtained.  In this instance, the sampling frame included all American women who had some affiliation with an organization because those are the women that had some possibility of being contacted.  If the response rate had been 100%, the sample would have been the 100,000 women who responded to the survey.

You should also remember that ideally the sampling frame should be as close to the entire population as possible. If this is not possible, the sampling frame should appropriately represent the desired population. In this case, the sampling frame of all American women who were "affiliated with some organization" did not appropriately represent the population of all American women.  In Lesson 2, we called this problem selection bias.

This example illustrate three key difficulties that can result in bias in sample surveys:

1. Using the wrong sampling frame. As discussed above, bias can result when the sampling frame leaves out major portions of the population. This is called undercoverage which is a type of selection bias.
2. Not reaching the individuals selected.   Because the questionnaire was sent to leaders of organizations, there is no guarantee that these questionnaires actually reached the women who were supposed to be in the sample.
3. Getting a low response rate.   In Lesson 2, we learned that this survey has a problem with nonresponse bias because of the low response rate. This problem can create bias if the people who respond have different views than those who do not.

Summary: Focusing on these distinctions is meant to help you think carefully about the process of creating a sample so you can identify issues that might arise in interpreting the results of a sample survey.

The process:  You want to know about a POPULATION but you only really have access to a SAMPLING FRAME that you can draw an INTENDED SAMPLE from; but in the end you only get observations from the actual SAMPLE.

When you read about a sample survey, always try to break down the process used into these component parts.

When a report says that a random sample was used, that usually means that the intended sample was randomly selected from the sampling frame.  You must then judge whether the sampling frame was really representative of the population and whether the sample was really representative of the intended sample.  When you read the methodology used in high quality sample surveys, you will find that they go to great lengths to make adjustments to avoid bias from these issues.  If no such adjustments are made, survey results can be quite misleading.

# 3.3 The Beauty of Sampling

Many sample surveys are used to estimate the percentage of people in a population that have a certain characteristic or opinion.  If you follow the news, you might remember hearing that many of these polls are based on samples of size 1000 to 1500 people.  So, why is a sample size of around 1000 people commonly used in surveying?  The answer is based on understanding what is called the margin of error.

The margin of error:

• measures the reliability of the percent or other estimate based on the survey data
• is smaller when the sample size (n) is larger
• does not provide information about bias or other errors in a survey

For a sample size of n = 1000, the margin of error for a sample proportion is around  $$\frac {1}{\sqrt{n}}=\frac{1}{\sqrt{1000}}≈0.03$$ , or about 3%.  Since other problems inherent in surveys may often cause biases of a percent or two, pollsters often believe that it is not worth the expense to achieve the small improvement in the margin of error that might be gained by increasing the sample size further (see section 3.4).

The margin of error for most sample estimates depends directly on the square root of the size of the sample, $$\sqrt{n}$$.  For example, if you have four times as many people in your sample, your margin of error will be cut in half and your survey will be twice as reliable.  The size of the population does not affect the margin of error.  So, a percentage estimated from a sample will have the same margin of error (reliability), regardless of whether the population size is 50,000 or 5 billion.   If a survey is conducted using an unbiased methodology then the margin of error tells us directly about the accuracy of the poll at estimating a population parameter.

So what does the margin of error represent?

Interpretation:  If one obtains many unbiased samples of the same size from a defined population, the difference between the sample percent and the true population percent will be within the margin of error, at least 95% of the time.

Key Features of the Interpretation of the Margin of Error

• Even though a pollster obtains only one sample, you should remember that the interpretation of the margin of error is based on what would happen if the survey was conducted repeatedly under identical conditions.  The key to statistics is analyzing the quality of the process used to gather data. The margin of error says something about the reliability of that processes.
• The margin of error represents the largest distance that would occur in most unbiased surveys between the sample percent, which is the percent obtained by the poll, and the true population percent, which is unknown because we have not sampled the entire population.
• When talking about the margin of error, it is just not possible to say that the difference between the sample percent and the population percent will be within the margin of error for 100% of all possible samples.  So, statisticians use the laws of probability to ensure that at least 95% of the time, the difference between the sample percent and the population percent will be within the margin of error.

Example 3.2. Margin of Error and the Gallup Emotions Report

The Gallup Global Emotions Report was released in March, 2016 and included results of surveys Gallup conducted in 140 different countries in 2015 to study the emotional well-being of the populations of each country. For example, Gallup’s survey in Paraguay (population size 7 million) included about 1000 interviews and found the adults in that country to be the happiest in the world with 84% of respondents indicating they had laughed or smiled a lot the day before.  On the other hand, Gallup’s survey in Syria (population size 23 million) also included about 1000 interviews and found the adults in that country to be the least happy in the world with only about 36% of respondents indicating they had laughed or smiled a lot the day before. The surveys in both of those two countries had a margin of error of about 3%.

The results of the poll in Paraguay suggest that 84% ± 3% of all Paraguay adults smile or laugh alot on any given day. What is the correct interpretation of this margin of error?

Margin of Error Interpretation

Assuming the poll in Paraguay used an unbiased procedure, the difference between our sample percent and the true population percent will be within 3%, at least 95% of the time.  This means that we are almost certain that 84% ± 3% or (81% to 87%) of all Paraguay adults smile or laugh a lot each day.   Because the range of possible values from this poll all fall above 72%, we can also say that we are pretty sure that the rate in Paraguay is above the world wide average of 72%.  If any of the range of possible values would have been 72% or less, then we would not have been able to make that kind of statement with as much certainty.  The range of values (81% to 87%) is called a 95% confidence interval.  Other levels of confidence, besides 95% may be used - but 95% is the most typical. We will go into further detail about confidence intervals in Lesson 9. Importantly the poll in Syria also had a margin of error of about 3% despite that country having a population that is three times larger. However the interpretation of the margin of error in Syria should also include the reminder that it reflects only the variability due to the randomness in the survey.  The margin of error in that survey does not include any information about the likely bias in the Syria poll that resulted from undercoverage due to the fact that security concerns made it impossible for Gallup to have access to a third of the population (see section 3.2).

# 3.4 Relationship between Sample Size and Margin of Error

As discussed in the previous section, the margin of error for sample estimates will shrink with the square root of the sample size. For example, a typical margin of error for sample percents for different sample sizes is given in Table 3.1 and plotted in Figure 3.2.

Table 3.1. Calculated Margins of Error for Selected Sample Sizes

Sample Size (n) Margin of Error (M.E.)
200 7.1%
400 5.0%
700 3.8%
1000 3.2%
1200 2.9%
1500 2.6%
2000 2.2%
3000 1.8%
4000 1.6%
5000 1.4%

Let's look at the implications of this square root relationship. To cut the margin of error in half, like from 3.2% down to 1.6%, you need four times as big of a sample, like going from 1000 to 4000 respondents. To cut the margin of error by a factor of five, you need 25 times as big of a sample, like having the margin of error go from 7.1% down to 1.4% when the sample size moves from n = 200 up to n = 5000.

Figure 3.2 Relationship Between Sample Size and Margin of Error

In Figure 3.2, you again find that as the sample size increases, the margin of error decreases.  However, you should also notice that there is a diminishing return from taking larger and larger samples.  in the table and graph, the amount by which the margin of error decreases is most substantial between samples sizes of 200 and 1500.  This implies that the reliability of the estimate is more strongly affected by the size of the sample in that range.  In contrast, the margin of error does not substantially decrease at sample sizes above 1500 (since it is already below 3%).  It is rarely worth it for pollsters to spend additional time and money to bring the margin of error down below 3% or so.  After that point, it is probably better to spend additional resources on reducing sources of bias that might be on the same order as the margin of error.  An obvious exception would be in a government survey, like the one used to estimate the unemployment rate, where even tenths of a percent matter.

# 3.5 Simple Random Sampling and Other Sampling Methods

Sampling Methods can be classified into one of two categories:

• Probability Sampling: Sample has a known probability of being selected
• Non-probability Sampling: Sample does not have known probability of being selected as in convenience or voluntary response surveys

Probability Sampling

In probability sampling it is possible to both determine which sampling units belong to which sample and the probability that each sample will be selected. The following sampling methods are examples of probability sampling:

1. Simple Random Sampling (SRS)
2. Stratified Sampling
3. Cluster Sampling
4. Systematic Sampling
5. Multistage Sampling (in which some of the methods above are combined in stages)

Of the five methods listed above, students have the most trouble distinguishing between stratified sampling and cluster sampling.

Stratified Sampling is possible when it makes sense to partition the population into groups based on a factor that may influence the variable that is being measured.   These groups are then called strata.  An individual group is called a stratum.  With stratified sampling one should:

• partition the population into groups (strata)
• obtain a simple random sample from each group (stratum)
• collect data on each sampling unit that was randomly sampled from each group (stratum)

Stratified sampling works best when a heterogeneous population is split into fairly homogeneous groups.  Under these conditions, stratification generally produces more precise estimates of the population percents than estimates that would be found from a simple random sample. Table 3.2 shows some examples of ways to obtain a stratified sample.

Table 3.2. Examples of Stratified Samples

 Example 1 Example 2 Example 3 Population All people in U.S. All PSU intercollegiate athletes All elementary students in the local school district Groups (Strata) 4 Time Zones in the U.S. (Eastern,Central, Mountain,Pacific) 26 PSU intercollegiate teams 11 different elementary schools in the local school district Obtain a Simple Random Sample 500 people from each of the 4 time zones 5 athletes from each of the 26 PSU teams 20 students from each of the 11 elementary schools Sample 4 × 500 = 2000 selected people 26 × 5 = 130 selected athletes 11 × 20 = 220 selected students

Cluster Sampling is very different from Stratified Sampling. With cluster sampling one should

• divide the population into groups (clusters).
• obtain a simple random sample of so many clusters from all possible clusters.
• obtain data on every sampling unit in each of the randomly selected clusters.

It is important to note that, unlike with the strata in stratified sampling, the clusters should be microcosms, rather than subsections, of the population.   Each cluster should be heterogeneous. Additionally, the statistical analysis used with cluster sampling is not only different, but also more complicated than that used with stratified sampling.

Table 3.3. Examples of Cluster Samples

 Example 1 Example 2 Example 3 Population All people in U.S. All PSU intercollegiate athletes All elementary students in a local school district Groups (Clusters) 4 Time Zones in the U.S. (Eastern,Central, Mountain,Pacific.) 26 PSU intercollegiate teams 11 different elementary schools in the local school district Obtain a Simple Random Sample 2 time zones from the 4 possible time zones 8 teams from the 26 possible teams 4 elementary schools from the l1 possible elementary schools Sample every person in the 2 selected time zones every athlete on the 8 selected teams every student in the 4 selected elementary schools

Each of the three examples that are found in Tables 3.2 and 3.3 were used to illustrate how both stratified and cluster sampling could be accomplished. However, there are obviously times when one sampling method is preferred over the other. The following explanations add some clarification about when to use which method.

• With Example 1: Stratified sampling would be preferred over cluster sampling, particularly if the questions of interest are affected by time zone.  For example the percentage of people watching a live sporting event on television might be highly affected by the time zone they are in.  Cluster sampling really works best when there are a reasonable number of clusters relative to the entire population. In this case, selecting 2 clusters from 4 possible clusters really does not provide much advantage over simple random sampling.
• With Example 2: Either stratified sampling or cluster sampling could be used.  It would depend on what questions are being asked.  For instance, consider the question "Do you agree or disagree that you receive adequate attention from the team of doctors at the Sports Medicine Clinic when injured?"  The answer to this question would probably not be team dependent, so cluster sampling would be fine.  In contrast, if the question of interest is "Do you agree or disagree that weather affects your performance during an athletic event?"  The answer to this question would probably be influenced by whether or not the sport is played outside or inside.  Consequently, stratified sampling would be preferred.
• With Example 3: Cluster sampling would probably be better than stratified sampling if each individual elementary school appropriately represents the entire population as in aschool district where students from throughout the district can attend any school.  Stratified sampling could be used if the elementary schools had very different locations and served only their local neighborhood (i.e., one elementary school is located in a rural setting while another elementary school is located in an urban setting.)  Again, the questions of interest would affect which sampling method should be used.

The most common method of carrying out a poll today is using Random Digit Dialing in which a machine random dials phone numbers.  Some polls go even farther and have a machine conduct the interview itself rather than just dialing the number!  Such "robo call polls" can be very biased because they have extremely low response rates (most people don't like speaking to a machine) and because federal law prevents such calls to cell phones.  Since the people who have landline phone service tend to be older than people who have cell phone service only, another potential source of bias is introduced.  National polling organizations that use random digit dialing in conducting interviewer based polls are very careful to match the number of landline versus cell phones to the population they are trying to survey.

Non-probability Sampling

The following sampling methods that are listed in your text are types of non-probability sampling that should be avoided:

1. volunteer samples
2. haphazard (convenience) samples

Since such non-probability sampling methods are based on human choice rather than random selection, statistical theory cannot explain how they might behave and potential sources of bias are rampant.   In your textbook, the two types of non-probability samples listed above are called "sampling disasters."

Read the article: "How Polls are Conducted" by the Gallup organization available in Canvas.

The article provides great insight into how major polls are conducted. When you are finished reading this article you may want to go to the Gallup Poll Web site, http://www.gallup.com, and see the results from recent Gallup polls.  Another excellent source of public opinion polls on a wide variety of topics using solid sampling methodology is the Pew Research Center website at http://www.pewresearch.org  When you read one of the summary reports on the Pew site, there is a link (in the upper right corner) to the complete report giving more detailed results and a full description of their methodology as well as a link to the actual questionnaire used in the survey so you can judge whether their might be bias in the wording of their survey.

It is important to be mindful of margin or error as discussed in this article. We all need to remember that public opinion on a given topic cannot be appropriately measured with one question that is only asked on one poll.  Such results only provide a snapshot at that moment under certain conditions.  The concept of repeating procedures over different conditions and times leads to more valuable and durable results. Within this section of the Gallup article, there is also an error: "in 95 out of those 100 polls, his rating would be between 46% and 54%." This should instead say that in an expected 95 out of those 100 polls, the true population percent would be within the confidence interval calculated. In 5 of those surveys, the confidence interval would not contain the population percent.

# 3.6 Defining a Common Language for Comparative Studies

Overview: We've learned some of the very basics about research studies that compare two or more samples of one variable. Now we will explore this topic in more detail.  We first need to learn a few terms. These include:

1. experimental unit
2. explanatory variable
3. treatment
4. response (outcome) variable
5. confounding variable

The experimental unit  is the smallest basic object to which one can assign different conditions (treatments.)  In research studies, the experimental unit does not always have to be a person. In fact, the statistical terminology that is associated with research studies actually came from studies done in agriculture. Examples of an  experimental unit include:

• person
• animal
• plant
• set of twins
• married couple
• plot of land
• building

The explanatory variable is the variable used to form or define the different samples.  In randomized experiments, one explanatory variable is the variable that is used to explain differences in the groups. In this instance, the explanatory variable can also be called a treatment when each experimental unit is randomly assigned a certain condition. Examples of explanatory variables include:

• gender
• type of plant
• type of drug
• type of medical procedure
• teaching method

You should note that gender and type of plant cannot be called treatments because one cannot randomly assign gender or type of plant.

The response (outcome) variable is the outcome of the study that is either measured or counted. We have seen the response (outcome) variable in previous lessons. Examples of response variables include:

• height
• weight
• temperature
• classification of whether a person is a vegetarian
• classification of symptom severity for an illness

Of course some variables may play different roles in different studies. For example, in an experiment to see whether a new diet might held in reducing your weight; weight is the response variable and whether you used the new diet or not would be the explanatory variable.  On the other hand, in an observational study to examine how a person's weight might affect their heart rate; weight would play the role of an explanatory variable and heart rate would be the response variable.

A confounding variable is a variable that affects the response variable and is also related to the explanatory variable. The effect of a confounding variable on the response variable cannot be separated from the effect of the explanatory variable.  Therefore, we cannot clearly determine that the explanatory variable is solely responsible for any effect on the response or outcome variable when a confounding variable is present.   Confounding variables are problematic in observational studies.

Example 3.3.  Laboratory experiments conducted in the 1980s showed that pregnant mice exposed to high does of ultrasound gave birth to lower weight infant mice than unexposed mice (in fact the higher the dose the greater the effect on birthweights). This worried obstetricians who feared that sonograms given to women during pregnancy might cause lower weights in their children.  Researchers at Johns Hopkins University Hospital then examined the birthweights of infants of mothers who had sonograms versus those whose mothers had no such exposure.  They found that the 1598 infants who had been exposed averaged a couple of ounces lower in weight than the 944 infants whose mothers did not have a sonogram.  However, the women who got sonograms were more likely to have had twins in the past and were more likely to be over 40 years old.  Having twins or being over 40 are examples of confounding variables in this study since they provide an alternate expanation for the data.  You can not tell whether it was the sonogram that caused the lower birthweights or just the confounding medical reasons for getting the sonogram in the first place.  Later experimental evidence in humans did not show sonograms to have any affect (see Abramowicz et al, 2008 for a review).

# 3.7 Types of Research Studies

So far we have discussed two basic types of comparative statistical research studies:

• randomized experiments
• observational studies

With a randomized experiment, the researcher

• creates differences in the explanatory variable when randomly assigning treatments
• allows for possible "cause and effect" conclusions if other precautions are taken
• can minimize the effect of "confounding" variables

With an observational study, the researcher

• observes differences in the explanatory variable in natural settings/groupings (no variable is randomly assigned)
• strives for association conclusions since "cause and effect" conclusions are not possible
• must accept that confounding variables are potential problems

Example 3.4. Randomized Experiment (Two Independent Samples)

An educator wants to compare the effectiveness of computer software that teaches reading versus a standard curriculum used to teach reading. The educator tests the reading ability of a group of 60 students and then randomly divides them into classes of 30 students each. One class uses the computer regularly while the other class uses a standard curriculum. At the end the semester, the educator retests the students and compares the mean increase in reading ability for the two groups.

This example is a randomized experiment because the students were randomly assigned to one of two methods to learn reading.  Also in this example:

• the experimental unit is the student
• the explanatory variable (treatment) is the method used to teach reading
• the response variable is the change found in reading ability at the end of the semester for each individual student.

The randomization that is used in this example cancels out other factors (confounding variables) that could also affect a change in reading ability. Specifically, the randomization will cancel out factors that may result from either self-selected or haphazardly-formed groups. With self-selection, students might base their decision on whether or not they like the computer or whether or not their friends will be in the class. This is no longer a problem when the groups are randomly formed.  Consequently, "cause and effect" statements can be used if statistical significance is found and other precautions are used to treat each group the same except for the different treatments assigned.

In statistics, we also say that the two samples in this study are independent. The label of independent samples is used when the results for the one sample have no impact on the results found with the second sample.  In this instance, each student provided a measurement for only one treatment.  The results from students in one group will not impact the results of students in the other group, so the results from the two samples are independent.

Example 3.5. Observational Study (Two Independent Samples)

A medical researcher conjectures that smoking can result in wrinkled skin around the eyes. The researcher obtained a sample of smokers and a sample of nonsmokers. Each person was classified as either having or not having prominent wrinkles. The study compared the percent of prominent wrinkles for the two groups.

This example cannot be a randomized experiment because it would be both unrealistic and unethical to randomly assign who would be the smoker and who would be the nonsmoker. Also in this example:

• the experimental unit is the person
• the explanatory variable is smoking status
• the response variable is whether or not each person has prominent wrinkles

Because this example is an observational study, it is possible that confounding variables may also be responsible for whether or not a person has prominent wrinkles. Possible confounding variables include (1) how much time the person spends outside, (2) whether or not the person wears sun screen, and (3) other variables that revolve around health and nutrition (especially those that could be related to smoking status). Because we can't separate the impact that these variables may have on the response variable, "cause and effect" conclusions are never possible. The researcher would be limited to saying either that there is an association between smoking status and wrinkle status or that there is a difference in the two percents when comparing smokers to nonsmokers.

This is also an example where the two samples are independent. The individuals in this study were classified as being either smokers or nonsmokers. The results from the smoking group had no impact on the results from the nonsmoking group.

Example 3.6. Randomized Experiment (Two Dependent Samples or Matched Pairs)

Is the right hand stronger than the left hand for those who are right-handed? An instrument has been developed to measure the force exerted (in pounds) when squeezed by one hand. The subjects for this study include 10 right-handed people.  How can we best answer this question?

What would happen if we tried to implement what was done in Example 3.4?  This would mean that we would randomly assign five people to use their right and five people to use their left hand. The results from the two groups would then be compared. Hopefully you see that even though randomization is being used with this approach, the results may not be the best because it is possible that - just by the luck of the draw with so few people - the one group could be comprised of strong people while the other group could be comprised of weak people.  If this happened, one could erroneously conclude that one hand is stronger for reasons other than that there is a difference in the two hands.

A better approach would be to have each person use both hands and then compare the results for the two hands. With this approach, the

• the experimental unit is the person
• the explanatory variable (treatment) is hand being used (right hand or left hand)
• the response variable is force exerted (in pounds) for each hand.

The design used in the example is called a block design because the results from each person form a block. Specifically, this block design is called a matched pairs (block) design because each person provides two data observations that can be paired together (i.e. left and right hands of the same person form the pairs). Consequently, we can say that we have two dependent samples.  Table 3.3 shows how a spreadsheet for the data in the matched pairs design might look.

Table 3.3. Spreadsheet of Matched Pairs (Block) Design for Example 3.6

Person Force from Right Hand Force from Left Hand
1
2
3
"
10

In Table 3.3, one sees that the results from each person form a block. The reason that this design is used is so that unwanted or extraneous variation can be removed from the data.  In order to accomplish this goal, the data analysis is based on the differences rather than on the original data.    By using the differences, we are comparing the two data observations each person provides to each other which distinguishes matched pairs from independent samples.     Table 3.4 shows some data that could have been collected in this study.

Table 3.4. Picture Data of Matched Pairs (Block) Design for Example 3.6

Person Force from Right Hand (pounds) Force from Left Hand (pounds) Difference = (Right Hand Force) - (Left Hand Force)
1473847-38 = 9 pounds
2201520-15 = 5 pounds
3332633-26 = 7 pounds
""""
10282728-27 = 1 pound

As you examine the results from Table 3.4, you should see that there are innate differences in strength when comparing the people who participated in the study. For example, Person 1 is much stronger than Person 2.  However, the variation from person to person is no longer a factor when the differences are used in the data analysis rather than using the original data.

Also, as you examine Table 3.4, you should see why we classify the two measurements for each experimental unit as dependent.   A higher value in one hand is usually followed by a higher value in the other hand.   The values are more similar for each pair of measurements for each experimental unit than the values are between experimental units.

Even though the matched paired design is critical in this example, this study would also benefit from randomization. Since each person is doing both things or providing two measurements, the randomization could be used to determine the order in which the treatments are done. Why would this enhance the study? Problems can exist with block designs, including matched pair designs, when what happens with the first measurement "carries over" to the second measurement. This "carryover" effect is a type of confounding that is found with block designs.

For example,  "carryover" effect could possibly occur if complicated equipment was used to measure force exerted by a hand. If everyone used their right hand first, they might not do so well with the right hand because of not understanding the equipment, but do much better with their left hand because they learned how the equipment worked.  In statistics, this is called a training effect. The opposite, however, could also take place. Suppose everyone was asked to first exert force with their right hand for 15 minutes and then repeat this task with their left hand.  Participants might do okay with their right hand but become either bored or fatigued or sore when asked to repeat with this task with the left hand. So again, what happened with the first measurement would "carryover" and affect the second measurement. One may conclude that one hand is stronger than the other, not because this is really true, but because the "carryover" effect allowed this to happen.

The overall conclusion is that if you randomly assign the order of treatment, some people will use their right hand first and other people will use their left hand first. This randomization should cancel out the possibility of a "carryover" effect.  In statistics, we call this a randomized block design, as shown in Table 3.5. Randomizing the order of treatment makes this a randomized experiment.

Table 3.5. Randomized Matched Pairs (Block) Design for Example 3.6

Person Hand Used First Hand Used Second
1Right HandLeft Hand
2Left HandRight Hand
3Right HandLeft Hand
"
10Left HandRight Hand

Example 3.7. Observational Study (Two Dependent Samples or Matched Pairs)

An owner of a theater wants to determine if the time of the showing affects attendance at a "scary" movie. In order to check this claim, a sample of five nights from all possible  nights over the past month was obtained. The attendance (total number of tickets sold) for both the 7:00 PM and the 9:30 PM showings was determined for each of the five nights.

In this example:

• the experimental unit is the night
• the explanatory variable is the showing time
• the response variable is the attendance at each showing.

This example also uses a matched pair (block) design because there are two measurements made on each night. A picture of this matched pair block design is found in Table 3.3.

Table 3.6. Matched Paired (Block) Design for Example 3.7

Night Attendance at 7:00 PM Showing Attendance at 9:30 PM Showing
1
2
3
4
5

Again, why is the matched pairs design preferred over two independent samples? In this example, our goal is determine whether or not time of showing affects attendance at the "scary" movie. We do not want any extraneous or other unwanted variation to explain the differences in attendance. In this example, the potential unwanted variation would be the variation that would exist from night to night. Some of the selected nights may fall on a weekend while other nights may fall on a weekday. This factor could affect attendance. However, this will no longer be a problem when both measurements are made on the same night.

This example, however, cannot be a randomized experiment because it would be impossible to randomly assign time of showing. The 7:00 PM showing will always take place before the 9:30 PM showing. Consequently, there is a possibility that what happens at the 7:00 PM showing may "carryover" and affect attendance at the 9:30 PM.  A possible "carryover" effect could be the fact there is a limited amount of parking near the theatre. If this were true, perhaps those at the 7:00 PM showing take all the available spots. Then people planning to attend that 9:30 PM showing may not attend because of not being able to find a parking spot. However, this problem may not exist if there is sufficient time between the two showings so that those who attended the 7:00 PM showing had time to leave before those who arrived for the 9:30 PM showing.  In any event, because this is an observational study, confounding variables are possible. "Cause and effect" conclusions may not be used if statistical significance is found.

Table 3.7. Summary of the Four Examples

 Examples Type of Study Type of Samples Randomization Used Is Confounding Possible? 3.4 Experiment Two Independent Randomize type of treatment No, randomization cancels out confounding 3.5 Observational Two Independent None Yes 3.6 Experiment Two Dependent (Matched Pairs) Randomize order of treatment No, randomization cancels out confounding 3.7 Observational Two Dependent (Matched Pairs) None Yes

# 3.8 Designing a Better Observational Study

The problem of confounding makes the interpretation of observational studies difficult.  Thus, it is important to design observational studies in a way that minimizes confounding. A case-control study is one example of a technique that can help in certain circumstances.

In a Case-Control study people with the response of interest form a group of "cases" and are compared to a group of "controls" who are in similar circumstances except for the fact that they have the response.  This type of study is very common in the study of factors that might be associated with uncommon diseases.

Example 3.8. In order to study whether the longterm use of cell phones might be associated with a greater risk of brain tumors, researchers in France conducted a case-control study (see Coureau et al, 2013).  In their study 253 glioma patients (a type of brain tumor) were compared as to their cell phone use with 892 matched controls of similar age who lived in the same areas (according to electoral rolls).  In this way, the researchers hoped that other environmental factors associated with the differing brain tumor rates seen in differing parts of the country would be eliminated as confounders.

The difficulty in performing a case-control study comes with finding a good group of controls (e.g. similar life circumstances, same gender, similar age, similar similar family histories of disease, etc...).

Notice that case-control studies are done retrospectively - they compare patients who have the disease today with those who don't and ask them about past exposures or behaviors.  But people's memories might be faulty or affected by the fact that they have the disease.  For example in a study of people who have arthritis, patients will often "remember" that their parents also suffered from arthritis indicating a strong genetic component.  However when their non-suffering siblings with the same parents are asked the same question, no genetic component is found.

One alternative is to conduct a Prospective study in which people with different exposures or behaviors (the explanatory variables) are followed over time to see how many in each situation get the disease (the response variable).

Example 3.9.  447,357 non-Hispanic white cancer free members of the AARP who were 50 to 71 years old in 1995-1996.  These subjects were followed for about ten years until the end of 2006 and during that decade 2,904 of them developed melanoma skin cancer.  All of the subjects filled out a series of questionnaires at the start of the study and explanatory factors identified at that time could be examined to see if they might be associated with the onset of melanoma later in life.  In one study based on this very large cohort of AARP members, Loftfield et al, 2015 found a relationship between heavy coffee drinking (4 cups a day or more) and a modest increase in the risk of getting melanoma.  Of course, this observed association can not be viewed as causal since other variables associated with melanoma risk (like exposure to sunlight) might also be different between the heavy coffee drinkers and the non coffee drinkers.

# Lesson 3 - Test Yourself!

Now it's time to test yourself to check your understanding of the material in this lesson.  Be sure to also try the Practice quiz in Canvas (you can take that multiple times to see more practice problems).

#### Think About It!

Select the answer you think is correct - then click the 'Check' button to see how you did.

Click the right arrow to proceed to the next question.  When you have completed all of the questions you will see how many you got right and the correct answers.

# Lesson 3 - Have Fun With It!

#### Have Fun With It!

J.B. Landers © for CAUSEweb.org

Learn about Samples with Samba Music

Our Experiment
By: Laura Krajewski

Verse 1

Here's a question for you which you must
test: Which leads to greater success?
A student who eats well every day,
Or one whose diet's gone astray?

Chorus

When we plan out our experiment,
We must be sure to add
A control, replication, randomization,
And maybe even blocking ain't bad.

Verse 2

Ok let's see...who shall we test?
University students are best.
Both boys and girls from every year,
To make sure randomness is clear.

But boys and girls may react differently,
Maybe we should group them separately.
This will let us further randomize the test,
Thus by blocking we aim for the best.

Chorus

Verse 3

So we've got a random group of them,
Our roots are set now what's our first stem?
We must ruin the diets of a few,
Perhaps just feed them Mountain Dew.

Now some we'll feed well every day,
Their diets cannot go astray!
And for our control some folks stay untouched,
To see if they have any luck.

Chorus

Verse 4

What about blinding? Can we use it now?
Can we keep our treatments hidden somehow?
In this case no, I think they'd know,
But thanks for the suggestion though!

So we've got three groups, what do we think we'll see?
Who will win academically?
Let's hypothesize it's those who are fed well,
All the others will not excel.

Chorus

Verse 5

If we run this once will that be good?
Can our results then be fully understood?
Not so fast! We must once again replicate,
Perhaps try again in another state.

Now let's run the test, that's all there's left to do.
But even then our results aren't always true.
Maybe diet isn't the cause of what we find,
We must keep both types of error in mind.

So remember:

A control, replication, randomization,
And maybe even blocking ain't bad.