# 10.2 - Inference for Regression

We can use statistical inference (i.e. hypothesis testing) to draw conclusions about how the population of y-values relates to the population of x-values, based on the sample of x and y values. Our model extends beyond a simple straight-line summary as we include a parameter for the natural variation about the regression line as seen in real-life relationships. For each x the regression line will tell us on the average how a population of y-values would react. We call these y-values the mean response. Naturally we would expect some variation (i.e. not all of the same response for a given x. Think of not all people of the same height having the same weight) above and below the mean response. The equation E(Y) = Βo + Β1 describes this population relationship. For any given x-value the mean y-value should be E(Y) = Βo + Β1. [NOTE: Some texts will use the notation uy in place of E(Y). These notations are read, respectively, as "The mean of y" and "The expectation of y". Both interpretations have the same meaning.] There are some assumptions, however, that come with this analysis.

### Regression Questions

1. Is there strong, i.e. statistically significant, evidence that y depends on x? In other words, is the slope significantly different from zero?
2. For a particular x-value, $$x=x^*$$, what interval should contain the mean response to all such x-values?
3. For a particular x-value, $$x=x^*$$, what interval should contain the individual response to a single such x-value?

Returning to our output for the final exam data, we can conduct an hypothesis test of slope of the regression line using t-test methods to test the following hypothesis:

H0 : Β1 = 0     Ha : Β1 ≠ 0

From this output we concern ourselves with the second row under Predictor as the first row, Constant is not relevant for our purposes. This second Predictor row shows our estimate for Β as 0.7513; standard error $$SE_\beta$$ of 0.1414; a t-value of 5.31; and p-value of 0.000 or approximately 0. Since our test in Minitab is whether the true slope is zero or not zero we are conducting a two-sided hypothesis test. In general, the t test statistic is found by $$(\beta-0)/SE_\beta$$, or in this example t = 0.7513/0.1414 = 5.31 The p-value is found by doubling the probability P(T ≥ |t|). In this example since the p-value is less than our standard alpha value of 0.05 we would reject H0 and decide that the true slope does differ significantly from zero. We would then conclude that Quiz Average is a significant predictor of scores on the Final exam.

### Prediction Inference

We now turn to questions 2 and 3 regarding estimating both a population mean response and individual response to a given x-value denoted as $$x^*$$. Formulas exist and can be found in various texts, but we will use Minitab for calculations. Keep in mind that when estimating in statistics we typically are referring to the use of confidence intervals. That will be the case here as well as we will use Minitab to calculate confidence intervals for these estimates.

Sticking with the exam data, what would be a 95% confidence interval for the mean Final exam score for all students with a Quiz Average of 84.44 and what would be a 95% prediction interval for the Final exam score of a particular student with an 84.44 Quiz Average? To use Minitab or SPSS we follow our initial regression steps but with a few additions:

To perform prediction inference linear regression analysis in Minitab:

1. From the menu bar select Stat > Regression > Regression
2. In the window box by Response enter the variable Final
3. In the window box by Predictors enter the variable Quiz Average
4. Click the Options button and in the text box under "Prediction intervals for new observations" enter 84.44 and verify that 95 is entered in the "Confidence Level" text box. Click OK
5. Click OK
6. Click the Storage button and select Residuals. Click OK
7. Click OK again.

To perform prediction inference linear regression analysis in SPSS:

1. Open SPSS without data
2. Click Analyze > Regression > Linear
3. Click the variable Final and move to the text box for Dependent
4. Click the variable Quiz_Average and move to the text window for Independent(s)
5. Click the button Save and select Mean and Individual under the heading Prediction Intervals
6. Click Continue
7. Click OK

The output and first 4 rows in the worksheet are as follows:

The confidence interval which can be found in the output window under the heading 95% CI in Minitab and in the SPSS data spreadsheet under the LMCI and UMCI headings is (72.79, 78.33). This is interpreted as "We are 95% confident that the true mean Final exam score for students with an 84.44 Quiz Average is between 72.79% and 78.33%.

The prediction interval which can be found in the output window under the heading 95% PI in Minitab and in the SPSS data spreadsheet under the LICI and UICI headings is (55.84, 95.28). This is interpreted as "We are 95% confident that the true Final exam score for a student with an 84.44 Quiz Average is between 55.84% and 95.28%.

You should notice that the confidence interval is narrower than the prediction interval which makes sense intuitively. Since the confidence interval is estimating an average or mean response for all students with this quiz average you should expect that to be more precise than the prediction of the exact Final score for just one student.

### Checking Assumptions in Linear Regression

1. There exists a linear relationship between the y and x. This can be graphically represented by creating a scatterplot of y versus x.
2. The error terms are independent for each value of y. The regression model would not be appropriate if the final exam scores were not of different students.
3. The variance of the error is same for all values of x. That is, the points are assumed to fall in a band of similar widths on either side of the line. Again we can employ a scatterplot of the residuals versus the predictor variable. For constant variance we would expect this graph to show a random scatter of points. If the plot showed a pattern (e.g. a megaphone shape where the residuals became more varied as x increased) this would be an indicator that the constant variance assumption was being violated. (See the following Residual Plot. The plot shows a random scatter without evidence of any clear pattern.)
4. The error terms follow a normal distribution with a mean of zero and a variance σ2. This, too, can be tested using a significance test. By storing the residuals we can:
• In Minitab: return to Graph > Probability Plot > Single and enter the column with the residuals into the window for "Graph variables" and click OK.
• In SPSS: Click Analyze > Descriptive Statistics > Q-Q Plot, then Click the variable Unstandardized Residuals and move to the text window for Variables 3. Click OK.
The null hypothesis is that the residuals follow a normal distribution so here we are interested in a p-value that is greater than 0.05 as we do not want to reject H0. Rejecting H0 would indicate that the normality assumption has been violated, however, keep in mind that the central limit theorem can be invoked if our sample size is large. (See the following Normal Probability Plot. With a p-value of 0.931 we would not reject H0 and therefore assume that the residuals follow a normal distribution.)