Let's get started! Here is what you will learn in this lesson.
Upon completion of this lesson, you should be able to do the following:
Previously we considered the distribution of a single quantitative variable. Now we will study the relationship between two variables where both variables are qualitative, i.e. categorical, or quantitative. When we consider the relationship between two variables, there are three possibilities:
Understand that categorical variables either exist naturally (e.g. a person’s race, political party affiliation, or class standing), while others are created by grouping a quantitative variable (e.g. taking height and creating groups Short, Medium, and Tall). We analyze categorical data by recording counts or percents of cases occurring in each category. Although you can compare several categorical variables we are only going to consider the relationship between two such variables.
The Class Survey data set, (CLASS_SURVEY.MTW or CLASS_SURVEY.XLS), consists of student responses to survey given last semester in a Stat200 course. We can construct a twoway table showing the relationship between Smoke Cigarettes (row variable) and Gender (column variable) using either Minitab or SPSS.
To create a twoway table in Minitab:
To create a twoway table in Minitab Express:
The marginal distribution along the bottom (the bottom row All) gives the distribution by gender only (disregarding Smoke Cigarettes). The marginal distribution on the right (the values under the column All) is for Smoke Cigarettes only (disregarding Gender). Since there were more females (127) than males (99) who participated in the survey, we should report the percentages instead of counts in order to compare cigarette smoking behavior of females and males. This tells the conditional distribution of smoke cigarettes given gender, suggesting we are considering gender as an explanatory variable (i.e. a variable that we use to explain what is happening with another variable). These conditional percentages are calculated by taking the number of observations for each level smoke cigarettes (No, Yes) within each level of gender (Female, Male). For example, the conditional percentage of No given Female is found by 120/127 = 94.5%.
We can calculate these marginal probabilities using either Minitab or SPSS:
To calculate these marginal probabilities using Minitab:
To calculate these marginal probabilities using Minitab Express:
Although you do not need the counts, having those visible aids in the understanding of how the conditional probabilities of smoking behavior within gender are calculated. We can see from this display that the 94.49% conditional probability of No Smoking given the Gender is Female is found by the number of No and Female (count of 120) divided by then number of Females (count of 127). The data under Cell Contents tells you what is being displayed in each cell: the top value is Count and the bottom value is Percent of Column. Alternatively, we could compute the conditional probabilities of Gender given Smoking by calculating the Row Percents; i.e. take for example 120 divided by 209 to get 57.42%. This would be interpreted then as for those who say they do not smoke 57.42% are Females – meaning that for those who do not smoke 42.58% are Male (found by 100% – 57.42%).
Hypothetically, suppose sugar and hyperactivity observational studies have been conducted; first separately for boys and girls, and then the data is combined. The following tables list these hypothetical results:
Boys  Normal  Hyper  Rate of Hyperactivity 

Low Sugar  25  50  50/75 = 0.67 
High Sugar  50  100  100/150 = 0.67 
Girls  Normal  Hyper  Rate of Hyperactivity 

Low Sugar  75  25  25/100 = 0.25 
High Sugar  25  8  8/33 = 0.25 
Combined  Normal  Hyper  Rate of Hyperactivity 

Low Sugar  100  75  75/175 = 0.43 
High Sugar  75  108  108/183 = 0.59 
Notice how the rates for Boys (67%) and Girls (25%) are the same regardless of sugar intake. What we observe by these percentages is exactly what we would expect if no relationship existed between sugar intake and activity level. However, when we consider the data when the two groups are combined, the hyperactivity rates do differ: 43% for Low Sugar and 59% for High Sugar. This difference appears large enough to suggest that a relationship does exist between sugar intake and activity level. This phenomenon is known as Simpson’s Paradox, which describes the apparent change in a relationship in a twoway table when groups are combined. In this hypothetical example, boys tended to consume more sugar than girls, and also tended to be more hyperactive than girls. This results in the apparent relationship in the combined table. The confounding variable, gender, should be controlled for by studying boys and girls separately instead of ignored when combining. By definition, a confounding variable is a variable that when combined with another variable produces mixed effects compared to when analyzing each separately. By contrast, a lurking variable is a variable not included in the study but has the potential to confound. Consider the previous example where the combined statistics are analyzed then a researcher considers a variable such as gender. At this point gender would be a lurking variable as gender would not have been measured and analyzed.
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technique for comparing two quantitative variables. We plot on the yaxis the variable we consider the response variable and on the xaxis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a study. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research study is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship between two quantitative variables, we need to consider:
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of 50 students who took STAT 200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot showing the relationship between Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student performance on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz performance be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
The result should be the scatterplot below:
The result should be the scatterplot below:
We can interpret from either graph that there is a positive association between Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not appear to be a change in the direction in the relationship.
In order to measure the strength of a linear relationship between two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
The output gives us a Pearson Correlation of 0.609
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the change in y per unit change in x.
Two examples:
Data 1  Data 2  

x  y  x  y  
0  3  0  13  
1  5  1  11  
2  7  2  9  
3  9  3  7  
4  11  4  5  
5  13  5  3 
For the 'Data 1' the equation is y = 3 + 2x ; the intercept is 3 and the slope is 2. The line slopes upward, indicating a positive relationship between x and y.
For the 'Data 2' the equation is y = 13  2x ; the intercept is 13 and the slope is 2. The line slopes downward, indicating a negative relationship between x and y.
Plot for Data 1  Plot for Data 2 



The relationship between x and y is 'perfect' for these two examples—the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be concerned with relationships between two variables which are not perfect. The 'Correlation' between x and y is r = 1.00 for the values of x and y on the left and r = 1.00 for the values of x and y on the right.
Regression analysis is concerned with finding the 'best' fitting line for predicting the average value of a response variable y using a predictor variable x.
The best description of many relationships between two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Sir Francis Galton who in the mid 1800’s studied the phenomenon that children of tall parents tended to “regress” toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
\[\hat{y}=b_0+b_1 x \]
Here, b_{0} is the yintercept and b_{1} is the slope of the regression line.
Some questions to consider are:
By answering the third question we should gain insight into the first two questions.
We use the regression line to predict a value of \(\hat{y}\) for any given value of x. The “best” line would make the best predictions: the observed yvalues should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as small as possible. To accomplish this goal of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals appears as follows:
Residuals: \(y\hat{y}\)
Sum of squared residuals: \(\sum{(y\hat{y})^2}\)
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for b_{0} and b_{1}:
\(b_1=r\frac{S_y}{S_x}\) \(b_0=\bar{y}b_1\bar{x}\)
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point (\(\bar{x}\), \(\bar{y}\)). As to the other expressions in the slope equation, S_{y} refers to the square root of the sum of squared deviations between the observed values of y and mean of y; similarly, S_{x} refers to the square root of the sum of squared deviations between the observed values of x and the mean of x.
Example: Exam Data set (Final.MTW or Final.XLS)
To perform a regression on the Exam Data we can use either Minitab or SPSS:
Plus the following is the first five rows of the data in the worksheet:
The result should be the following output:
WOW! This is quite a bit of output. We will take this data apart and you will see that these results are not too complicated.
From the output we see:
NOTE: Remember that the square root of a value can be positive or negative (think of the square root of 2). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the first value in the FITS column. Using this value, we can compute the first residual under RESI by taking the difference between the observed y and this fitted : 90 – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we try to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R^{2}. In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R^{2} now; we will take a further look at this statistic in a future lesson.
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every x variable has the same y variable as we mentioned earlier regarding that not every person with the same height (xvariable) would have the same weight (yvariable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed yvalue minus its corresponding predicted yvalue or \(y\hat{y}\). Therefore we would have as many residuals as we do y observations. The goal in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
In most practical circumstances an outlier decreases the value of a correlation coefficient and weakens the regression relationship, but it’s also possible that in some circumstances an outlier may increase a correlation value and improve regression. Figure 1 below provides an example of an influential outlier. Influential outliers are points in a data set that influence the regression equation and improve correlation. Figure 1 represents data gather on a persons Age and Blood Pressure, with Age as the explanatory variable. [Note: the regression plots were attained in Minitab by Stat > Regression > Fitted Line Plot.] The top graph in Figure 1 represents the complete set of 10 data points. You can see that one point stands out in the upper right corner, point of (75, 220). The bottom graph is the regression with this point removed. The correlation between the original 10 data points is 0.694 found by taking the square root of 0.481 (the Rsq of 48.1%). But when this outlier is removed, the correlation drops to 0.032 from the square root of 0.1%. Also, notice how the regression equation originally has a slope greater than 0, but with the outlier removed the slope is practically 0, i.e. nearly a horizontal line. This example is somewhat exaggerated, but the point illustrates the effect of an outlier can play on the correlation and regression equation. Such points are referred to as influential outliers. As this example illustrates you can see the influence the outlier has on the regression equation and correlation. Typically these influential points are far removed from the remaining data points in at least the horizontal direction. As seen here, the age of 75 and the blood pressure of 220 are both beyond the scope of the remaining data.
If we conduct a study and we establish a strong correlation does this mean we also have causation? That is, if two variables are related does that imply that one variable causes the other to occur? Consider smoking cigarettes and lung cancer: does smoking cause lung cancer. Initially this was answered as yes, but this was based on a strong correlation between smoking and lung cancer. Not until scientific research verified that smoking can lead to lung cancer was causation established. If you were to review the history of cigarette warning labels, the first mandated label only mentioned that smoking was hazardous to your health. Not until 1981 did the label mention that smoking causes lung cancer. (See warning labels). To establish causation one must rule out the possibility of lurking variable(s). The best method to accomplish this is through a solid design of your experiment, preferably one that uses a control group.
In this lesson we learned the following:
Next, let's take a look at the homework problems for this lesson. This will give you a chance to put what you have learned to use...
Ponder the following, then click the graphic to display the answers.
If you are asked to estimate the weight of a STAT 200 student, what will you use as a point estimate?
Now, if I tell you that the height of the student is 70 inches, can you give a better estimate of the person's weight?