# Lesson 5: Auxillary Data and Regression Estimation

Reading assignment for Lesson 5:  Ch. 8.1 of Sampling by Steven Thompson, 3rd edition.

### Introduction

This lesson discusses when and how to use regression estimation. An example for using regression is given. Then we compare the regression estimate with simply using the sample mean, not taking advantage of the auxiliary information. To illustrate that one has to choose the right model, we use the ratio estimate for the example even though the condition for using the ratio estimate was not satisfied. And, not surprisingly, ratio estimate performs poorly since it is not the appropriate model for that data set.

 Lesson 5 Objectives Upon success completion of this lesson, you will be able to: know why and when to use regression estimates know how to check the condition to see whether one can use the regression estimate compute the regression estimate and its estimated variance compute confidence interval based on regression estimate see that the regression estimate does perform better than the expansion estimate when auxillary data is useful   see that the regression estimate does perform better than the ratio estimate when the condition for using the ratio estimate is not satisfied

# 5.1 Linear Regression Estimator

 Unit Summary compute  $$\hat{\mu}_L$$ and $$\hat{V}ar(\hat{\mu}_L)$$ (L stands for our linear model) confidence interval using regression estimator

Looking at the data, how will we find things that will work, or which model should we use? These are key questions. The variance for the estimators will be an important indicator.

#### The Idea Behind Regression Estimation

When the auxiliary variable x is linearly related to y but does not pass through the origin, a linear regression estimator would be appropriate. (This does not mean that regression estimate cannot be used when the intercept is close to zero.  The two estimates, regression and ratio may be quite close in such cases and you can choose the one you want to use.)

In addition, if multiple auxiliary variables have a linear relationship with y, multiple regression estimates may be appropriate.

To estimate the mean and total of y-values, denoted as μ and τ, one can use the linear relationship between y and known x-values.

$$\hat{y}=a+bx$$ , which is our basic regression equation.
Then, $$b=\dfrac{\sum\limits_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum\limits_{i=1}^n(x_i-\bar{x})^2}$$ and

$$a=\bar{y}-b\bar{x}$$

Then to estimate the mean for y, substitute as follows:

$$x=\mu_x,\quad a=\bar{y}-b\bar{x},\text{then}$$
$$\hat{\mu}_L=(\bar{y}-b\bar{x})+b\mu_x$$
$$\hat{\mu}_L=\bar{y}+b(\mu_x-\bar{x}),\quad \hat{\mu}_L=a+b\mu_x$$

Note that even though  $$\hat{\mu}_L$$ is not unbiased under simple random sampling , it is roughly so (asymptotically unbiased) for large samples.

Thus, the mean square error of $$\hat{\mu}_L$$ is roughly estimated by:

\begin{align}
\hat{V}ar(\hat{\mu}_L) &=\dfrac{N-n}{N \times n}\cdot \dfrac{\sum\limits_{i=1}^n(y_i-a-bx_i)^2}{n-2}\\
&= \dfrac{N-n}{N \times n}\cdot MSE\\
\end{align}

where MSE is the MSE of the linear regression model of y on x.

Therefore, an approximate (1-α)100% CI for μ is:

$$\hat{\mu}_L \pm t_{n-2,\alpha/2}\sqrt{\hat{V}ar(\hat{\mu}_L)}$$

It follows that:

$$\hat{\tau}_L=N\cdot \hat{\mu}_L=N\bar{y}+b(\tau_x-N\bar{x})$$

\begin{align}
\hat{V}ar(\hat{\tau}_L) &= N^2 \hat{V}ar(\hat{\mu}_L) \\
&= \dfrac{N \times (N-n)}{n} \cdot MSE\\
\end{align}

And, an approximate (1-α)100% CI for τ is:

$$\hat{\tau}_L \pm t_{n-2,\alpha/2}\sqrt{\hat{V}ar(\hat{\tau}_L)}$$

#### Example: First Year Calculus Scores

(See p. 205 of Scheaffer, Mendenhall and Ott)

A mathematics achievement test was given to 486 students prior to entering a certain college who then took a calculus class. A simple random sampling of 10 students are selected and their calculus score recorded. It is known that the average achievement test score for the 486 students was 52. The scatterplot of the 10 samples are given below and the data follow.

The scatter plot shows that there is a strong positive linear relationship.

$$\hat{\mu}_L=\bar{y}+b(\mu_x-\bar{x})=a+b\mu_x$$

 Student Achievement test score X Calculus score Y 1 39 65 2 43 78 3 21 52 4 64 82 5 57 92 6 47 89 7 28 73 8 75 98 9 34 56 10 52 75

Minitab output

#### Application Exercise

Using the results from the Minitab output here, what do you get for the regression estimate?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

The minitab output provides us with p-values for the constant and the coefficient of X. We can see that both terms are significant. (ratio estimate is not appropriate since the constant term is non-zero).

Now we can compute the variance.

#### Application Exercise

What is the variance of the regression estimate?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

#### Application Exercise

What is then, an approximate 95% CI for μ ?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

# 5.2 Comparison of Estimators

 Unit Summary compare to  $$\bar{y}$$, computing  $$\hat{V}ar(\bar{y})$$ compare to  $$\hat{\mu}_r$$, computing $$\hat{V}ar(\hat{\mu}_r)$$

A. To compare the regression estimate to the estimate  $$\bar{y}$$, (which does not use auxiliary result of x), we see that:

$$\hat{V}ar(\bar{y})=\dfrac{N-n}{N}\cdot \dfrac{s^2}{n}$$

s2 for y values is: (15.11)2

#### Application Exercise

What is the $$Var(\bar{y})$$?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

#### Application Exercise

Next, what is an approximate 95% CI for μ ?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

Recall: The 95% confidence interval using regression estimate is 80.63 ± 6.28; a much shorter confidence interval.

This regression estimate is more precise than $$\bar{y}$$.

Additionally, we have another estimator that we can look at:

B. Compare $$\hat{\mu}_L$$ to the ratio estimator $$\hat{\mu}_r$$

Next, Minitab was used to find out the mean and standard deviation for X and Y.

The ratio estimate is inappropriate for this example. However, just to show a counter example, we can compute the variance of the ratio estimate using the following Minitab print out and compare this to the regression estimate.

Note: for the Calculus Scores example we should not use the ratio estimator  $$\hat{\mu}_r$$ because the p-value for the constant term is 0.001. This implies that it does not go through the origin and for this reason the ratio estimate is not appropriate. But for the purposes of a counter example we will work it out here anyway:

$$\hat{\mu}_r=r\mu_x=\dfrac{\bar{y}}{\bar{x}}\cdot \mu_x=\dfrac{76}{46}\cdot 52=85.91$$

Next, we need to figure out the variance and for this we need the MSE while using ratio estimate. From the Minitab output we have the SS / n-1, therefore, the

$$s^2_r=\dfrac{1}{10-1} \sum\limits_{i=1}^{10} (y_i-rx_i)^2=283.33$$ (this is huge!)

Now we can compute the variance:

#### Application Exercise

What is the variance of $$\hat{\mu}_r$$?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

Now we can compute a 95% confidence interval for μ.

#### Application Exercise

What is an approximate 95% confidence interval for $$\hat{\mu}_r$$ using a ratio estimate?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

We can see that the ratio estimate is even worse than when it is used in an inappropriate situation.

The width of the interval is larger than the one for the regression estimate.

The moral to this story here is, "Use the right model!"

# Homework

Find the HW 5 assignment in the Homework folder in Canvas.