Lesson 12: Correlation & Simple Linear Regression
Lesson 12: Correlation & Simple Linear RegressionObjectives
 Construct a scatterplot using Minitab Express and interpret it
 Identify the explanatory and response variables in a given scenario
 Identify situations in which correlation or regression analyses are appropriate
 Compute Pearson r using Minitab Express, interpret it, and test for its statistical significance
 Construct a simple linear regression model (i.e., yintercept and slope) using Minitab Express, interpret it, and test for its statistical significance
 Compute and interpret a residual given a simple linear regression model
 Compute and interpret the coefficient of determination (R^{2})
 Explain how outliers can influence correlation and regression analyses
 Explain why extrapolation is inappropriate
In Lesson 11 we examined relationships between two categorical variables with the chisquare test of independence. In this lesson, we will examine the relationships between two quantitative variables with correlation and simple linear regression. Quantitative variables have numerical values with magnitudes that can be placed in a meaningful order. You were first introduced to correlation and regression in Lesson 3.4. We will review some of the same concepts again, and we will see how we can test for the statistical significance of a correlation or regression slope using the t distribution.
In addition to reading Section 9.1 in the Lock^{5} textbook this week, you may also want to go back to review Sections 2.5 and 2.6 where scatterplots, correlation, and regression were first introduced.
12.1  Review: Scatterplots
12.1  Review: ScatterplotsIn Lesson 3 you learned that a scatterplot can be used to display data from two quantitative variables. Let's review.
 Scatterplot
 A graphical representation of two quantitative variables in which the explanatory variable is on the xaxis and the response variable is on the yaxis.
How do we determine which variable is the explanatory variable and which is the response variable? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a study.
 Explanatory variable

Variable that is used to explain variability in the response variable, also known as an independent variable or predictor variable; in an experimental study, this is the variable that is manipulated by the researcher.
 Response variable

The outcome variable, also known as a dependent variable.
When describing the relationship between two quantitative variables, we need to consider the following:
 Direction (positive or negative)
 Form (linear or nonlinear)
 Strength (weak, moderate, strong)
 Outliers
In this class we will focus on linear relationships. This occurs when the lineofbestfit for describing the relationship between \(x\) and \(y\) is a straight line. The linear relationship between two variables is positive when both increase together; in other words, as values of \(x\) get larger values of \(y\) get larger. This is also known as a direct relationship. The linear relationship between two variables is negative when one increases as the other decreases. For example, as values of \(x\) get larger values of \(y\) get smaller. This is also known as an indirect relationship.
Scatterplots are useful tools for visualizing data. Next we will explore correlations as a way to numerically summarize these relationships.
MinitabExpress – Review of Using Minitab Express to Construct a Scatterplot
Let's construct a scatterplot to examine the relation between quiz scores and final exam scores.
 Open the data set:
 On a PC or Mac: GRAPHS > Scatterplot
 Select Simple
 Double click the variable Final in the box on the left to insert the variable into the Y variable box
 Double click the variable Quiz Average in the box on the left to insert the variable into the X variable box
 Click OK
This should result in the following scatterplot:
Select your operating system below to see a stepbystep guide for this example.
12.2  Correlation
12.2  CorrelationIn this course we have been using Pearson's \(r\) as a measure of the correlation between two quantitative variables. In a sample, we use the symbol \(r\). In a population, we use the symbol \(\rho\) ("rho").
Pearson's \(r\) can easily be computed using Minitab Express. However, understanding the conceptual formula may help you to better understand the meaning of a correlation coefficient.
 Pearson's \(r\): Conceptual Formula

\(r=\frac{\sum{z_x z_y}}{n1}\)
where \(z_x=\frac{x  \overline{x}}{s_x}\) and \(z_y=\frac{y  \overline{y}}{s_y}\)
When we replace \(z_x\) and \(z_y\) with the \(z\) score formulas and move the \(n1\) to a separate fraction we get the formula in your textbook: \(r=\frac{1}{n1}\Sigma{\left(\frac{x\overline x}{s_x}\right) \left( \frac{y\overline y}{s_y}\right)}\)
If conducting a test by hand, a \(t\) test statistic with \(df=n2\) is computed: \(t=\frac{r \rho_{0}}{\sqrt{\frac{1r^2}{n2}}} \)
In this course you will never need to compute \(r\) or the test statistic by hand, we will always be using Minitab Express to perform these calculations.
MinitabExpress – Computing Pearson's r
We previously created a scatterplot of quiz averages and final exam scores and observed a linear relationship. Here, we will compute the correlation between these two variables.
 Open the data set:
 On a PC: Select STATISTICS > Correlation > Correlation
On a MAC: Select Statistics > Regression > Correlation  Double click the Quiz_Average and Final in the box on the left to insert them into the Variables box
 Click OK
This should result in the following output:
Pearson correlation of Quiz_Average and Final = 0.608630 
PValue = <0.0001 
Select your operating system below to see a stepbystep guide for this example.
Properties of Pearson's r
 \(1\leq r \leq +1\)
 For a positive association, \(r>0\), for a negative association \(r<0\), if there is no relationship \(r=0\)
 The closer \(r\) is to 0 the weaker the relationship and the closer to +1 or 1 the stronger the relationship (e.g., \(r=.88\) is a stronger relationship than \(r=+.60\)); the sign of the correlation provides direction only
 Correlation is unit free; the \(x\) and \(y\) variables do NOT need to be on the same scale (e.g., it is possible to compute the correlation between height in centimeters and weight in pounds)
 It does not matter which variable you label as \(x\) and which you label as \(y\). The correlation between \(x\) and \(y\) is equal to the correlation between \(y\) and \(x\).
The following table may serve as a guideline when evaluating correlation coefficients
Absolute Value of \(r\)  Strength of the Relationship 

0  0.2  Very weak 
0.2  0.4  Weak 
0.4  0.6  Moderate 
0.6  0.8  Strong 
0.8  1.0  Very strong 
12.2.1  Hypothesis Testing
12.2.1  Hypothesis TestingIn testing the statistical significance of the relationship between two quantitative variables we will use the five step hypothesis testing procedure:
In order to use Pearson's \(r\) both variables must be quantitative and the relationship between \(x\) and \(y\) must be linear
Research Question  Is the correlation in the population different from 0?  Is the correlation in the population positive?  Is the correlation in the population negative? 

Null Hypothesis, \(H_{0}\)  \(\rho=0\)  \(\rho= 0\)  \(\rho = 0\) 
Alternative Hypothesis, \(H_{a}\)  \(\rho \neq 0\)  \(\rho > 0\)  \(\rho< 0\) 
Type of Hypothesis Test  Twotailed, nondirectional  Righttailed, directional  Lefttailed, directional 
Use Minitab Express to compute \(r\)
Minitab Express will give you the pvalue for a twotailed test (i.e., \(H_a: \rho \neq 0\)). If you are conducting a onetailed test you will need to divide the pvalue in the output by 2.
If \(p \leq \alpha\) reject the null hypothesis, there is evidence of a relationship in the population.
If \(p>\alpha\) fail to reject the null hypothesis, there is not evidence of a relationship in the population.
Based on your decision in Step 4, write a conclusion in terms of the original research question.
12.2.1.1  Example: Temperature & Coffee Sales
12.2.1.1  Example: Temperature & Coffee SalesData concerning sales at studentrun cafe were retrieved from cafedata.xls more information about this data set available at cafedata.txt. Let's determine if there is a statistically significant relationship between the maximum daily temperature and coffee sales.
Maximum daily temperature and coffee sales are both quantitative variables. From the scatterplot below we can see that the relationship is linear.
\(H_0: \rho = 0\)
\(H_a: \rho \neq 0\)
Pearson correlation of Max Daily Temperature (F) and Coffees = 0.741302 
PValue = <0.0001 
\(r=0.741302\)
\(p<.0001\)
\(p \leq \alpha\) therefore we reject the null hypothesis.
There is evidence of a relationship between the maximum daily temperature and coffee sales in the population.
12.2.1.2  Example: Age & Height
12.2.1.2  Example: Age & HeightData concerning body measurements from 507 adults retrieved from body.dat.txt for more information see body.txt. In this example we will use the variables of age (in years) and height (in centimeters).
Research question: Is there a relationship between age and height in adults?
Age (in years) and height (in centimeters) are both quantitative variables. From the scatterplot below we can see that the relationship is linear (or at least not nonlinear).
\(H_0: \rho = 0\)
\(H_a: \rho \neq 0\)
From Minitab Express:
Pearson correlation of Height (cm) and Age = 0.067883 
PValue = 0.1269 
\(r=0.067883\)
\(p=.1269\)
\(p > \alpha\) therefore we fail to reject the null hypothesis.
There is not evidence of a relationship between age and height in the population from which this sample was drawn.
12.2.2  Correlation Matrix
12.2.2  Correlation MatrixWhen examining correlations for more than two variables (i.e., more than one pair), correlation matrices are commonly used. In Minitab Express, if you request the correlations between three or more variables at once, your output will contain a correlation matrix with all of the possible pairwise correlations. For each pair of variables, Pearson's r will be given along with the p value. The following pages include examples of interpreting correlation matrices.
12.2.2.1  Video Example: Student Survey
12.2.2.1  Video Example: Student SurveyThis example uses the StudentSurvey.MTW dataset from the Lock^{5} textbook.
12.2.2.2  Example: Body Correlation Matrix
12.2.2.2  Example: Body Correlation MatrixThis correlation matrix was constructed using the body dataset. These data are from the Journal of Statistics Education data archive.
Six variables were used: age, weight (kg), height (cm), hip girth, abdominal girth, and wrist girth.
Cell contents grouped by Age, Weight, Height, Hip Girth, and Abdominal Girth; First row: Pearson correlation, Following row: PValue
Cell contents grouped by Age, Weight, Height, Hip Girth, and Abdominal Girth; First row: Pearson correlation, Following row: PValue
Age  Weight (kg)  Height (cm)  Hip Girth  Abdominal Girth  

Weight (kg)  0.207265  
<0.0001  
Height (cm)  0.067883  0.717301  
0.1269  <0.0001  
Hip Girth  0.227080  0.762969  0.338584  
<0.0001  <0.0001  <0.0001  
Abdominal Girth  0.422188  0.711816  0.313197  0.825892  
<0.0001  <0.0001  <0.0001  <0.0001  
Wrist Girth  0.192024  0.816488  0.690834  0.458857  0.435420 
<0.0001  <0.0001  <0.0001  <0.0001  <0.0001 
This correlation matrix presents 15 different correlations. For each of the 15 pairs of variables, the top box contains the Pearson's r correlation coefficient and the bottom box contains the p value.
The correlation between age and weight is \(r=0.207265\). This correlation is statistically significant (\(p<0.0001\)). That is, there is evidence of a relationship between age and weight in the population.
The correlation between age and height is \(r=0.0678863\). This correlation is not statistically significant (\(p=0.1269\)). There is not evidence of a relationship between age and height in the population.
The correlation between weight and height is \(r=0.717301\). This correlation is statistically significant (\(p<0.0001\)). That is, there is evidence of a relationship between weight and height in the population.
And so on.
12.3  Simple Linear Regression
12.3  Simple Linear RegressionRecall from Lesson 3, regression uses one or more explanatory variables (\(x\)) to predict one response variable (\(y\)). In this lesson we will be learning specifically about simple linear regression. The "simple" part is that we will be using only one explanatory variable. If there are two or more explanatory variables, then multiple linear regression is necessary. The "linear" part is that we will be using a straight line to predict the response variable using the explanatory variable.
You may recall from an algebra class that the formula for a straight line is \(y=mx+b\), where \(m\) is the slope and \(b\) is the \(y\)intercept. The slope is a measure of how steep the line is; in algebra this is sometimes described as "change in \(y\) over change in \(x\)," or "rise over run". A positive slope indicates a line moving from the bottom left to top right. A negative slope indicates a line moving from the top left to bottom right. For every one unit increase in \(x\) the predicted value of \(y\) increases by the value of the slope. The \(y\) intercept is the location on the \(y\) axis where the line passes through; this is the value of \(y\) when \(x\) equals 0.
In statistics, we use a similar formula:
 Simple Linear Regression Line in a Sample
 \(\widehat{y}=b_0 +b_1 x\)

\(\widehat{y}\) = predicted value of \(y\) for a given value of \(x\)
\(b_0\) = \(y\)intercept
\(b_1\) = slope
In the population, the \(y\)intercept is denoted as \(\beta_0\) and the slope is denoted as \(\beta_1\).
Some textbook and statisticians use slightly different notation. For example, you may see either of the following notations used:
\(\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1 x \;\;\; \text{or} \;\;\; \widehat{y}=a+b x\)
Note that in all of the equations above, the \(y\)intercept is the value that stands alone and the slope is the value attached to \(x\).
Example: Interpreting the Equation for a Line
The plot below shows the line \(\widehat{y}=6.5+1.8x\)
Here, the \(y\)intercept is 6.5. This means that when \(x=0\) then the predicted value of \(y\) is 6.5.
The slope is 1.8. For every one unit increase in \(x\), the predicted value of \(y\) increases by 1.8.
Example: Interpreting the Regression Line Predicting Weight with Height
Data were collected from a random sample of World Campus STAT 200 students. The plot below shows the regression line \(\widehat{weight}=150.950+4.854(height)\)
Here, the \(y\)intercept is 150.950. This means that an individual who is 0 inches tall would be predicted to weigh 150.905 pounds. In this particular scenario this intercept does not have any real applicable meaning because our range of heights is about 50 to 80 inches. We would never use this model to predict the weight of someone who is 0 inches tall. What we are really interested in here is the slope.
The slope is 4.854. For every one inch increase in height, the predicted weight increases by 4.854 pounds.
Review: Key Terms
In the next sections you will learn how to construct and test for the statistical significance of a simple linear regression model. But first, let's review some key terms:
 Explanatory variable
Variable that is used to explain variability in the response variable, also known as an independent variable or predictor variable; in an experimental study, this is the variable that is manipulated by the researcher.
 Response variable
The outcome variable, also known as a dependent variable.
 Simple linear regression
A method for predicting one response variable using one explanatory variable and a constant (i.e., the yyintercept).
 yintercept
The point on the \(y\)axis where a line crosses (i.e., value of \(y\) when \(x = 0\)); in regression, also known as the constant.
 Slope
A measure of the direction (positive or negative) and steepness of a line; for every one unit increase in \(x\), the change in \(y\). For every one unit increase in \(x\) the predicted value of \(y\) increases by the value of the slope.
12.3.1  Formulas & Assumptions
12.3.1  Formulas & AssumptionsSimple linear regression uses data from a sample to construct the line of best fit. But what makes a line “best fit”? The most common method of constructing a regression line, and the method that we will be using in this course, is the least squares method. The least squares method computes the values of the yintercept and slope that make the sum of the squared residuals as small as possible. A residual is the difference between the actual value of y and the value of y predicted by \(\widehat{y}=b_0+b_1 x\). Residuals are symbolized by \(\varepsilon \) (“epsilon”) in a population and \(e\) or \(\widehat{\varepsilon }\) in a sample.
As with most predictions, you expect there to be some error. In other words, you expect the prediction to not be exactly correct. For example, when predicting the percent of voters who selected your candidate, you would expect the prediction to be accurate but not necessarily the exact final voting percentage. Also, in regression, usually not every individual with the same \(x\) value has the same \(y\) value. For example, if we are using height to predict weight, not every person with the same height would have the same weight. These errors in regression predictions are called residuals or prediction error. The residuals are calculated by taking the observed \(y\) value minus its corresponding predicted \(y\) value. Therefore, each individual has a residual. The goal in least squares regression is to select the line that minimizes the squared residuals. In essence, we create a best fit line that has the least amount of error.
 Residual
 \(e_i =y_i \widehat{y}_i\)

\(y_i\) = actual value of y for the ith observation
\(\widehat{y}_i\) = predicted value of y for the ith observation
 Sum of Squared Residuals

Also known as Sum of Squared Errors (SSE)
\(SSE=\sum (y\widehat{y})^2\)
Recall, the equation for a simple linear regression line is \(\widehat{y}=b_0+b_1x\) where \(b_0\) is the \(y\)intercept and \(b_1\) is the slope.
Statistical software will compute the values of the \(y\)intercept and slope that minimize the sum of squared residuals. The conceptual formulas below show how these statistics are related to one another and how they relate to correlation which you learned about earlier in this lesson. In this course we will always be using Minitab Express to compute these values.
 Slope
 \(b_1 =r \frac{s_y}{s_x}\)

\(r\) = Pearson’s correlation coefficient between \(x\) and \(y\)
\(s_y\) = standard deviation of \(y\)
\(s_x\) = standard deviation of \(x\)
 yintercept
 \(b_0=\overline {y} – b_1 \overline {x}\)

\(\overline {y}\) = mean of \(y\)
\(\overline {x}\) = mean of \(x\)
\(b_1\) = slope
In order to use the methods above, there are four assumptions that must be met:
 Linearity: The relationship between \(x\) and y must be linear. Check this assumption by examining a scatterplot of \(x\) and \(y\).
 Independence of errors: There is not a relationship between the residuals and the \(y\) variable; in other words, \(y\) is independent from errors. Check this assumption by examining a scatterplot of “residuals versus fits”; the correlation should be approximately 0.
 Normality of errors: The residuals must be approximately normally distributed. Check this assumption by examining a normal probability plot; the observations should be near the line. You can also examine a histogram of the residuals; it should be approximately normally distributed.
 Equal variances: The variance of the residuals is the same for all values of \(x\). Check this assumption by examining the scatterplot of “residuals versus fits”; the variance of the residuals should be the same across all values of the \(x\)axis. If the plot shows a pattern (e.g., bowtie or megaphone shape), then variances are not consistent and this assumption has not been met.
Example: Checking Assumptions
The following example uses students' scores on two tests.
 Linearity. The scatterplot below shows that the relationship between Test 3 and Test 4 scores is linear.
 Independence of errors. The plot of residuals versus fits is shown below. The correlation shown in this scatterplot is approximately \(r=0\), thus this assumption has been met.
 Normality of errors. On the normal probability plot we are looking to see if our observations follow the given line. This tells us that the distribution of residuals is approximately normal. We could also look at the second graph which is a histogram of the residuals; here we see that the distribution of residuals is approximately normal.
 Equal variance. Again we will use the plot of residuals versus fits. Now we are checking that the variance of the residuals is consistent across all fitted values.
The next section will show you how to construct simple linear regression equations using statistical software. The graphs shown above can be obtained when running the regression model using Minitab Express.
Before we continue, let’s review a few of the new terms:
 Least squares method
 Method of constructing a regression line which makes the sum of squared residuals as small as possible for the given data.
 Residual
 Actual value minus predicted value (i.e., \(e=y \widehat{y}\)); vertical distance between the actual \(y\) value and the regression line.
 Sum of squared residuals
 The sum of all of the residuals squared: \(\sum (y\widehat{y})^2\).
12.3.2  Minitab Express  Simple Linear Regression
12.3.2  Minitab Express  Simple Linear RegressionMinitabExpress – Obtaining Simple Linear Regression Output
We previously created a scatterplot of quiz averages and final exam scores and observed a linear relationship. Here, we will use quiz scores to predict final exam scores.
 Open the data set:
 On a PC or Mac: Select STATISTICS > Regression > Simple Regression
 Double click Final in the box on the left to insert it into the Response (Y) box on the right
 Double click Quiz_Average in the box on the left to insert it into the Predictor (X) box on the right
 Under the Graphs tab, click the box for Residual plots
 Click OK
This should result in the following output:
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  2663.66  2663.66  28.24  <0.0001 
Error  48  4527.06  94.31  
Total  49  7190.72 
S  Rsq  Rsq(adj) 

9.71152  37.04%  35.73% 
Term  Coef  SE Coef  TValue  PValue 

Constant  12.12  11.94  1.01  0.3153 
Quiz_Average  0.7513  0.1414  5.31  <0.0001 
Final = 12.12 + 0.7513 Quiz_Average 
Obs  Final  Fit  Resid  Std Resid  

11  49  70.4975  21.4975  2.25  R 
40  80  61.2158  18.7842  2.03  R 
47  37  59.5050  22.5050  2.46  R 
R Large residual
Select your operating system below to see a stepbystep guide for this example.
On the next page you will learn how to test for the statistical significance of the slope.
12.3.3  Hypothesis Testing
12.3.3  Hypothesis TestingWe can use statistical inference (i.e., hypothesis testing) to draw conclusions about how the population of \(y\) values relates to the population of \(x\) values, based on the sample of \(x\) and \(y\) values.
The equation \(Y=\beta_0+\beta_1 x\) describes this relationship in the population. Within this model there are two parameters that we use sample data to estimate: the \(y\)intercept (\(\beta_0\) estimated by \(b_0\)) and the slope (\(\beta_1\) estimated by \(b_1\)). We can use the five step hypothesis testing procedure to test for the statistical significance of each separately. Note, typically we are only interested in testing for the statistical significance of the slope because that tells us that \(\beta_1 \neq 0\) which means that \(x\) can be used to predict \(y\). When \(\beta_1 = 0\) then the line of best fit is a straight horizontal line and having information about \(x\) does not change the predicted value of \(y\); in other words, \(x\) does not help us to predict \(y\). If the value of the slope is anything other than 0, then the predict value of \(y\) will be different for all values of \(x\) and having \(x\) helps us to better predict \(y\).
We are usually not concerned with the statistical significance of the \(y\)intercept unless there is some theoretical meaning to \(\beta_0 \neq 0\). Below you will see how to test the statistical significance of the slope and how to construct a confidence interval for the slope; the procedures for the \(y\)intercept would be the same.
The assumptions of simple linear regression are linearity, independence of errors, normality of errors, and equal error variance. You should check all of these assumptions before preceding.
Research Question  Is the slope in the population different from 0?  Is the slope in the population positive?  Is the slope in the population negative? 

Null Hypothesis, \(H_{0}\)  \(\beta =0\)  \(\beta= 0\)  \(\beta= 0\) 
Alternative Hypothesis, \(H_{a}\)  \(\beta\neq 0\)  \(\beta> 0\)  \(\beta< 0\) 
Type of Hypothesis Test  Twotailed, nondirectional  Righttailed, directional  Lefttailed, directional 
Minitab Express will compute the \(t\) test statistic:
\(t=\frac{b_1}{SE(b_1)}\) where \(SE(b_1)=\sqrt{\frac{\frac{\sum (e^2)}{n2}}{\sum (x \overline{x})^2}}\)
Minitab Express will compute the pvalue for the nondirectional hypothesis \(H_a: \beta_1 \neq 0 \)
If you are conducting a onetailed test you will need to divide the pvalue in the Minitab Express output by 2.
If \(p\leq \alpha\) reject the null hypothesis. If \(p>\alpha\) fail to reject the null hypothesis.
Based on your decision in Step 4, write a conclusion in terms of the original research question.
12.3.3.1  Example: Business Decisions
12.3.3.1  Example: Business DecisionsA studentrun cafe wants to use data to determine how many wraps they should make today. If they make too many wraps they will have waste. But, if they don't make enough wraps they will lose out on potential profit. They have been collecting data concerning their daily sales as well as data concerning the daily temperature. They found that there is a statistically significant relationship between daily temperature and coffee sales. So, the students want to know if a similar relationship exists between daily temperature and wrap sales. The video below will walk you through the process of using simple linear regression to determine if daily temperature can be used to predict wrap sales. The screen shots and annotation below the video will walk you through these steps again.
Data concerning sales at a studentrun cafe were obtained from a Journal of Statistics Education article. Data were retrieved from cafedata.xls more information about this data set available at cafedata.txt.
Research question:
Can daily temperature be used to predict wrap sales?
 \(H_0: \beta_1 =0\)
 \(H_a: \beta_1 \neq 0\)
The scatterplot below shows that the relationship between maximum daily temperature and wrap sales is linear (or at least it's not nonlinear). Though the relationship appears to be weak.
The plot of residuals versus fits below can be used to check the assumptions of independent errors and equal error variances. There is not a significant correlation between the residuals and fits, therefore the assumption of independent errors has been met. The variance of the residuals is relatively consistent for all fitted values, therefore the assumption of equal error variances has been met.
Finally, we must check for the normality of errors. We can use the normal probability plot below to check that our data points fall near the line. Or, we can use the histogram of residuals below to check that the errors are approximately normally distributed.
Now that we have check all of the assumptions of simple linear regression, we can examine the regression model.
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  16.41  16.4053  0.47  0.4961 
Max Daily Temperature (F)  1  16.41  16.4053  0.47  0.4961 
Error  45  1567.55  34.8345  
LackofFit  24  875.17  36.4654  1.11  0.4106 
Pure Error  21  692.38  32.9706  
Total  46  1583.96 
S  Rsq  Rsq(adj)  Rsq(pred) 

5.90208  1.04%  0.00%  0.00% 
Term  Coef  SE Coef  TValue  PValue  VIF 

Constant  11.418  2.665  4.29  <0.0001  
Max Daily Temperature (F)  0.04139  0.06032  0.69  0.4961  1.00 
Wraps Sold = 11.418 
+ 0.04139 Max Daily Temperature (F) 
\(t = 0.69\)
\(p=0.4961\)
\(p > \alpha\), fail to reject the null hypothesis
There is not evidence that maximum daily temperature can be used to predict the number of wraps sold in the population of all days.
12.3.4  Confidence Interval for Slope
12.3.4  Confidence Interval for SlopeWe can use the slope that was computed from our sample to construct a confidence interval for the population slope (\(\beta_1\)). This confidence interval follows the same general form that we have been using:
 General Form of a Confidence Interval
 \(sample statistic\pm(multiplier)\ (standard\ error)\)
 Confidence Interval of \(\beta_1\)
 \(b_1 \pm t^\ast (SE_{b_1})\)

\(b_1\) = sample slope
\(t^\ast\) = value from the \(t\) distribution with \(df=n2\)
\(SE_{b_1}\) = standard error of \(b_1\)
Example: Confidence Interval of \(\beta_1\)
Below is the Minitab Express output for a regression model using Test 3 scores to predict Test 4 scores. Let's construct a 95% confidence interval for the slope.
Term  Coef  SE Coef  TValue  PValue  VIF 

Constant  16.37  12.40  1.32  0.1993  
Test 3  0.8034  0.1360  5.91  <0.0001  1.00 
From the Minitab Express output, we can see that \(b_1=0.8034\) and \(SE(b_1)=0.1360\)
We must construct a \(t\) distribution to look up the appropriate multiplier. There are \(n2\) degrees of freedom.
\(df=262=24\)
\(t_{24,\;.05/2}=2.064\)
\(b_1 \pm t \times SE(b_1)\)
\(0.8034 \pm 2.064 (0.1360) = 0.8034 \pm 0.2807 = [0.523,\;1.084]\)
We are 95% confident that \(0.523 \leq \beta_1 \leq 1.084 \)
In other words, we are 95% confident that in the population the slope is between 0.523 and 1.084. For every one point increase in Test 3 the predicted value of Test 4 increases between 0.523 and 1.084 points.
12.4  Coefficient of Determination
12.4  Coefficient of DeterminationThe amount of variation in the response variable that can be explained by (i.e. accounted for) the explanatory variable is denoted by \(R^2\). This is known as the coefficient of determination or Rsquared.
Example: \(R^2\) From Output
S  Rsq  Rsq(adj)  Rsq(pred) 

9.71152  37.04%  35.73%  29.82% 
In our Exam Data example this value is 37.04% meaning that 37.04% of the variation in the final exam scores can be explained by quiz averages.
The coefficient of determination is equal to the correlation coefficient squared. In other words \(R^2=(Pearson's\;r)^2\)
Example: \(R^2\) From Pearson's r
The correlation between quiz averages and final exam scores was \(r=.608630\)
Coefficient of determination: \(R^2=.608630^2=.3704\)
Pearson correlation of Quiz_Average and Final = 0.608630 
PValue = <0.0001 
When going from \(r\) to \(R^2\) you can simply square the correlation coefficient. \(R^2\) will always be a positive value between 0 and 1.0. When going from \(R^2\) to \(r\), in addition to computing \(\sqrt{R^2}\), the direction of the relationship must also be taken into account. If the relationship is positive then the correlation will be positive. If the relationship is negative then the correlation will be negative.
Examples: From \(R^2\) to \(r\)
Quiz Averages and Final Exam Scores
There is a direct (i.e., positive) relationship between quiz averages and final exam scores. The coefficient of determination (\(R^2\)) is 37.04%. What is the correlation between quiz averages and final exam scores?
\(r=\sqrt{R^2}=\sqrt{.3704}= \pm .6086\)
The correlation is \(r=+.6086\) because we are told that there is a positive relationship between the two variables.
Daily High Temperatures and Hot Chocolate Sales
As the daily high temperature decreases, hot chocolate sales increase at a restaurant. 49% of the variance in hot chocolate sales can be attributed to variance in daily high temperatures. What is the correlation between daily high temperatures and hot chocolate sales?
\(R^2=.49\) \(r=\sqrt{.49}= \pm .7\)
The correlation between daily high temperatures and hot chocolate sales is \(r=.7\). Because there is an indirect relationship between the two variables, the correlation is negative.
12.5  Cautions
12.5  CautionsHere we will examine a few important issues related to correlation and regression: the impact of outliers, extrapolation, and the interpretation of causation.
Influence of Outliers
An outlier may decrease or increase a correlation value. Below, in the first plot there are no outliers. In the second and third plots, each have one outlier. Depending on the location of the outlier, the correlation could be decreased or increased.
When the point (90, 0) was added, the correlation decreased from r = 0.825 to r = 0.343. This decrease occurred because the outlier was not in line with the pattern of the other points.
When the point (0, 0) was added, the correlation increased from r = 0.825 to r = 0.972. This increase occurred because the outlier was in line with the pattern of the other points.
Extrapolation
A regression equation should not be used to make predictions for values that are far from those that were used to construct the model or for those that come from a different population. This misuse of regression is known as extrapolation.
For example, the regression line below was constructed using data from adults who were between 147 and 198 centimeters tall. It would not be appropriate to use this regression model to predict the height of a child. For one, children are a different population and were not included in the sample that was used to construct this model. And second, the height of a child will likely not fall within the range of heights used to construct this regression model. If we wanted to use height to predict weight in children, we would need to obtain a sample of children and construct a new model.
Interpretation of Causation
Recall from earlier in the course, correlation does not equal causation. To establish causation one must rule out the possibility of lurking variables. The best method to accomplish this is through a solid design of your experiment, preferably one that uses a control group and randomization methods.
For example, consider smoking cigarettes and lung cancer. Does smoking cause lung cancer? Initially this was answered as yes, but this was based on a strong correlation between smoking and lung cancer. Not until scientific research verified that smoking can lead to lung cancer was causation established. If you were to review the history of cigarette warning labels, the first mandated label only mentioned that smoking was hazardous to your health. Not until 1981 did the label mention that smoking causes lung cancer.
12.6  Video Example: March Madness
12.6  Video Example: March MadnessHave you ever filled out a March Madness bracket? For those of you who are not familiar with March Madness, it is the NCAA basketball tournament that takes place each spring in the month of March. It’s so big that the President even completes a bracket! If you have completed a bracket, how did you pick your teams? What was the reasoning that you used to fill it out? Here we will look at one way to use statistics to inform our decisions!
If you would like to work through this example on your own, the data can be found in the following file:
Let's use NCAA tournament points per game (PPG) as the response variable and regular season PPG as the explanatory variable. Here, we are assuming that a team with higher tournament PPG would win the tournament games. We have two variables that are quantitative, making simple linear regression the appropriate analysis tool.
Data were collected from the 201415 season. The video below walks through the analyses presented here. Below the video is a review of this process with a bit more detail.
Before using linear methods we should examine a scatterplot to ensure that the relationship is linear (as opposed to nonlinear). The scatterplot below shows a linear, though weak, relationship between regular season and NCAA Tournament PPG.
We can compute the correlation coefficient to learn more about the relationship between these two variables.
Pearson correlation of NCAA Tournament PPG and Regular Season PPG = 0.360180 
PValue = 0.0026 
We see that \(r=0.360180\). This is a moderately weak correlation. The \(p\)value is low (\(p=0.0026\)) for the null hypothesis that \(\rho=0\) so this is a statistically significant correlation. In other words, we conclude that in the population (assuming this is a representative sample) the correlation between regular season and NCAA Tournament PPG is different from 0.
Next, we can construct a regression equation for predicting NCAA Tournament PPG using regular season PPG.
Before we can interpret regression output, we should check the assumptions of linear regression (LINE). We already checked the assumption of a linear relationship by examining the scatterplot of the two variables. We can use the plot of residuals versus fits below to check the assumptions of independent errors and equal error variances. Here, see that there is not a correlation between the residuals and fitted values. And, the variances of the residuals are approximately equal across all fitted values.
The final assumption that we must check is the normality of residuals. Using the normal probability plot or a histogram of the residuals we see that the residuals are approximately normally distributed. All assumptions have been met and it is appropriate to use linear regression methods with this data.
The ANOVA source table gives us information about the entire model. The \(p\) value for the model is 0.0026. Because this is simple linear regression, this is the same \(p\) value that we found earlier when we examined the correlation and the same \(p\) value that we see below in the test of the statistical significance for the slope. Our \(R^2\) value is 0.1297 which tells us that 12.97% of the variance in NCAA Tournament PPG can be explained by regular season PPG. This is a relatively low \(R^2\) value.
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  598.32  598.320  9.84  0.0026 
Regular Season PPG  1  598.32  598.320  9.84  0.0026 
Error  66  4013.74  60.814  
LackofFit  58  3795.70  65.443  2.40  0.0935 
Pure Error  8  218.03  27.254  
Total  67  4612.06 
S  Rsq  Rsq(adj)  Rsq(pred) 

7.79835  12.97%  11.65%  5.97% 
Term  Coef  SE Coef  TValue  PValue  VIF 

Constant  21.54  14.18  1.52  0.1336  
Regular Season PPG  0.6197  0.1976  3.14  0.0026  1.00 
NCAA Tournament PPG = 21.54 + 0.6197 Regular Season PPG 
While there is a statistically significant relationship between regular season PPG and NCAA Tournament PPG, the \(R^2\) value is relatively low. At this point we would probably go back and revise our theory. We may want to choose a different variable to predict NCAA Tournament PPG. Or, we may want to add additional variables to our current model by using multiple linear regression methods.
12.7  Lesson 12 Summary
12.7  Lesson 12 SummaryObjectives
 Construct a scatterplot using Minitab Express and interpret it
 Identify the explanatory and response variables in a given scenario
 Identify situations in which correlation or regression analyses are appropriate
 Compute Pearson r using Minitab Express, interpret it, and test for its statistical significance
 Construct a simple linear regression model (i.e., yintercept and slope) using Minitab Express, interpret it, and test for its statistical significance
 Compute and interpret a residual given a simple linear regression model
 Compute and interpret the coefficient of determination (R^{2})
 Explain how outliers can influence correlation and regression analyses
 Explain why extrapolation is inappropriate
In this lesson you learned how to test for the statistical significance of Pearson's \(r\) and the slope of a simple linear regression line using the \(t\) distribution to approximate the sampling distribution. You were also introduced to the coefficient of determination.