# Lesson 2: Linear Combinations of Random Variables

### Introduction

This lesson is concerned with linear combinations or if you would like linear transformations of the variables. Mathematically linear combinations can be expressed as shown in the expression below:

$$Y = c_1X_1 +c_2X_2 +\dots + c_pX_p = \sum_{j=1}^{p}c_jX_j = \mathbf{c}'\mathbf{X}$$.

Here what we have is a set of coefficients c1 through cp that are multiplied by corresponding variables X1 through Xp. So, in the first term we have c1 times X1 which is added to c2 times X2 and so on up to the variable Xp. Mathematically this is expressed as the sum of j = 1, ... , p of the terms cj times Xj. The random variables X1 through Xp are collected into a column vector X and the coefficient c1 to cp are collected into a column vector c. Hence, the linear combination can be expressed as c′X.

The selection of the coefficients c1 through cp are very much dependent on the application of interest and what kinds of scientific questions we would like to address.

### Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

• Interpret the meaning of a specified linear combination;
• Compute the sample mean and variance of a linear combination from the sample means, variances, and covariances of the individual variables.

Later on in this course, when we learn about multivariate data reduction techniques, interpretation of linear combinations will be of great importance.

# 2.1 - Two Examples

### Example: Women’s Nutrition Data

The Women's Nutrition data contains observations for the following variables:

• X1 calcium (mg)
• X2 iron (mg)
• X3 protein(g)
• X4 vitamin A(μg)
• X5 vitamin C(mg)

In addition to addressing questions about the individual nutritional component, we may wish to address questions about certain combinations of these components. For instance, we might want to ask what is the total intake of vitamins A and C (in mg). We note that in this case Vitamin A is measuring in micrograms while Vitamin C is measured in milligrams. There are a thousand micrograms per milligram so the total intake of the two vitamins, Y, can be expressed as the following:

Y = 0.001X4 + X5

In this case, our coefficients c1 , c2 and c3 are all equal to 0 since the variables X1, X2 and X3 do not appear in this expression. In addition, c4 is equal to 0.001 since each microgram of vitamin A is equal to 0.001 milligrams of vitamin A. In summary, we have

c1 = c2 = c3 = 0, c4 = 0.001, c5 = 1

### Example: Monthly Employment Data

Another example where we might be interested in linear combinations is in the Monthly Employment Data. Here we have observations on 6 variables:

• X1 Number people laid off or fired
• X2 Number of people resigning
• X3 Number of people retiring
• X4 Number of jobs created
• X5 Number of people hired
• X6 Number of people entering the workforce

Net employment decrease:

In looking at the net job increase, which is equal to the number of jobs created, minus the number of jobs lost.

Y = X4 - X1 - X2 - X3

In this case we have the number of jobs created, (X4), minus the number of people laid off or fired, (X1), minus the number of people resigning, (X2), minus the number of people retired, (X3). These are all of the people that have left their jobs for whatever reason.

In this case

c1 = c2 = c3 = -1 and c4 = 1.

Because variables 5 and 6 are not included in this expression,

c5 = c6 = 0.

Net employment increase:

In a similar fashion, net employment increase is equal to the number of people hired, (X5), minus the number of people laid off or fired, (X1), minus the number of people resigning, (X2), minus the number of people retired, (X3).

Y = X5 - X1 - X2 - X3

In this case

c1 = c2 = c3 = -1, c4 = c6 = 0, and  c5 = 1.

Net unemployment increase:

Net unemployment increase is going to be equal to the number of people laid off or fired, (X1), plus the number of people resigning, (X2), plus the number of people entering the workforce, (X6), minus the number of people hired, (X5).

Y = X1 + X2 + X6 - X5

Unfilled jobs:

Finally, if we wanted to ask about the number of jobs that went unfilled, this is simply equal to the number jobs created, (X4), minus the number of people hired, (X5).

Y = X4 - X5

In other applications, of course, other linear combinations would be of interest.

# 2.2 - Measures of Central Tendency

Because linear combinations are functions of random quantities, they also are random vectors, and hence have population means and variances. Moreover, if you are looking at several linear combinations, they will have covariances and correlations as well.

Therefore we are interested in knowing:

• What is the population mean of Y?
• What is the population variance of Y?
• What is the population covariance between two linear combinations Y1 and Y2?

Population Mean:

The population mean of a linear combination is equal to the same linear combination of the population means of the component variables. If

$Y = c_1X_1 + c_2X_2 +\dots c_pX_p =\sum_{j=1}^{p}c_jX_j = \mathbf{c}'\mathbf{X}$.

then

$E(Y) = c_1 \mu_1 +c_2\mu_2 +\dots + c_p\mu_p = \sum_{j=1}^{p}c_j\mu_j = \mathbf{c}'\mathbf{\mu}$.

Mathematically you express this as the sum of j = 1 to p of cj times the corresponding mean of the jth variable. If the coefficient c's are collected into a vector c and the mean μ are collected into a mean vector μ you can express this as c transpose times μ.

We can estimate the population mean by replacing the population means with the corresponding sample means; that is replace all of the μ's with $$\bar{x}$$'s so that $$\bar{y}$$ equals c1 times $$\bar{x}_{1}$$ plus c2 times $$\bar{x}_{2}$$ and so on...

$\bar{y} = c_1\bar{x}_1 + c_2\bar{x}_2 + \dots + c_p\bar{x}_p = \sum_{j=1}^{p}c_j\bar{x}_j = \mathbf{c}'\mathbf{\bar{x}}$.

### Example: Women’s Nutrition Data

The following table shows the sample means for each of our five nutritional components that we computed in the previous lesson.

 Variable Mean Calcium 624.0 mg Iron 11.1 mg Protein 65.8 mg Vitamin A 839.6 μg Vitamin C 78.9 mg

If, as previously, we define Y to be the total intake of vitamins A and C (in mg) or :

Y = 0.001X4 + X5

Then we can work out the estimated mean intake of the two vitamins as follows:

$$\bar{y}=0.001 \bar{x}_4 +\bar{x}_5 = 0.001 \times 839.6 + 78.9248 = 0.8396 + 78.9248 = 79.7680$$ mg.

# 2.3 - Population Variance

Linear combinations not only have a population mean but they also have a population variance. The population variance of a linear combination is expressed as the following double sum of j = 1 to p and k = 1 to p over all pairs of variables.

$var(Y) =\sum_{j=1}^{p}\sum_{k=1}^{p}c_jc_k\sigma_{jk}=\mathbf{c}'\mathbf{\Sigma}\mathbf{c}$.

In each term within the double sum, the product of the paired coefficients cj times ck is multiplied by the covariance between the jth and kth variables. If Σ is the variance-covariance matrix of X, then Var(Y) = c′ Σ c.

Expressions of vectors and matrices of this form are called quadratic forms.

When using this expression, the covariance between the variables and itself, or σjj is simply equal to the variance of the jth variable, or σj2 .

$$\sigma_{jj} = \sigma^2_j$$

The variance of the random variable y can be estimated by the sample variances, or s squared Y. This is obtained by substituting in the sample variances and covariances for the population variances and covariances as shown in the expression below.

$s^2_Y = \sum_{j=1}^{p}\sum_{k=1}^{p}c_jc_ks_{jk} = \mathbf{c}'\mathbf{S}\mathbf{c}$

A simplified calculation can be found below. This involves two terms.

$s^2_Y = \sum_{j=1}^{p}c^2_j s^2_j +2\sum_{j<k}c_jc_ks_{jk}$

The first term involves summing over all the variables. Here we take the squared coefficients and multiply them by their respective variances. In the second term, we sum over all unique pairs of variables j less than k. Again take the product of cj times ck times the covariances between variables j and k. Since each unique pair appears twice in the original expression, we must multiply the sum by 2.

### Example: Women’s Nutrition Data

Looking at the Women's Nutrition survey data we obtained the following variance/covariance matrix as shown below from the previous lesson.

$$S = \left(\begin{array}{RRRRR}157829.4 & 940.1 & 6075.8 & 102411.1 & 6701.6 \\ 940.1 & 35.8 & 114.1 & 2383.2 & 137.7 \\ 6075.8 & 114.1 & 934.9 & 7330.1 & 477.2 \\ 102411.1 & 2383.2 & 7330.1 & 2668452.4 & 22063.3 \\ 6701.6 & 137.7 & 477.2 & 22063.3 & 5416.3 \end{array}\right)$$.

If we wanted to take a look at the total intake of vitamins A and C (in mg) remember we defined this earlier as:

Y = 0.001X4 + X5

Therefore the sample variance of Y is equal to (0.001)2 times the variance for X4, plus the variance for X5, plus 2 times 0.001 times the covariance between X4 and X5. The next few lines carries out the mathematical calculations using these values.

$$\begin{array}{lll}s^2_Y & = & 0.001^2s^2_4 + s^2_5 + 2 \times 0.001s_{45}\\ & = & 0.000001 \times 2668452.4 + 5416.3 + 0.002 \times 22063.3\\ & = & 2.7 + 5416.3 + 44.1 \\ & = & 5463.1\end{array}$$

# 2.4 - Population Covariance

Sometimes we are interested in more than one linear combination or variable. In this case we may be interested in the association between those two linear combinations. More specifically, we can consider the covariance between two linear combinations of the data.

Consider the pair of linear combinations:

$Y_1 = \sum_{j=1}^{p}c_jX_j \;\;\; \text{and} \;\;\; Y_2 = \sum_{k=1}^{p}d_kX_k$.

Here Y1 and Y2 are two distinct linear combinations. Both variables Y1 and Y2 are going to be random and so they will be potentially correlated. We can assess the association between these variables using the covariance as the two vectors c and d are distinct.

The population covariance between Y1 and Y2 is obtained by summing over all pairs of variables. We then multiply respective coefficients from the two linear combinations as cj times dk times the covariances between j and k.

$cov(Y_1, Y_2) = \sum_{j=1}^{p}\sum_{k=1}^{p}c_jd_k\sigma_{jk}$

We can then estimate the population covariance by using the sample covariance. This is obtained by simply substituting the sample covariances between the pairs of variables for the population covariances between the pairs of variables.

$s_{Y_1,Y_2}= \sum_{j=1}^{p}\sum_{k=1}^{p}c_jd_ks_{jk}$

The population correlation between variables Y1 and Y2 can be obtained by using the usual formula of the covariance between Y1 and Y2 divided by the standard deviation for the two variables as shown below:

$\rho_{Y_1,Y_2} = \frac{\sigma_{Y_1,Y_2}}{\sigma_{Y_1}\sigma_{Y_2}}$

This population correlation is estimated by the sample correlation where we simply substitute in the sample quantities for the population quantities as below

$r_{Y_1,Y_2} = \frac{s_{Y_1, Y_2}}{s_{Y_1}s_{Y_2}}$

### Example: Women’s Nutrition Data

Here is the matrix of the data as was shown previously.

$$S = \left(\begin{array}{RRRRR}157829.4 & 940.1 & 6075.8 & 102411.1 & 6701.6 \\ 940.1 & 35.8 & 114.1 & 2383.2 & 137.7 \\ 6075.8 & 114.1 & 934.9 & 7330.1 & 477.2 \\ 102411.1 & 2383.2 & 7330.1 & 2668452.4 & 22063.3 \\ 6701.6 & 137.7 & 477.2 & 22063.3 & 5416.3 \end{array}\right)$$.

We may wish to define the total intake of vitamins A and C in mg as before.

Y1 = 0.001X4 + X5

and we may also want to take a look at the total intake of calcium and iron:

Y2 = X1 + X2

Then the sample covariance between Y1 and Y2 can then be obtained by looking at the covariances between each pair of the component variables time the respective coefficients. So in this case we are looking at pairing X1 and X4, X1 and X5, X2 and X4, and X2 and X5. You will notice that in the expression below s41, s42, s51 and s52 all appear. The variables are taken from the matrix above and substituting them into the expression and the math is carried out below.

$$\begin{array}{lll}s_{Y_1, Y_2} & = & 0.001s_{41} + 0.001s_{42} + s_{51}+s_{52}\\& = & 0.001 \times 102411.1 + 0.001 \times 2383.2 + 6701.6 +137.7\\ & = & 102.4 + 2.4 + 6701.6 + 137.7\\ & = &6944.1 \end{array}$$

You should be able at this point to be able to confirm that the sample variance of Y2 is 159,745.4 as shown below:

$$\begin{array}{lll}s^2_{Y_2} & = & s_{11}+s_{22}+2s_{12}\\ & = & 157829.4 + 35.8 + 2 \times 940.1\\ & = & 157829.4 + 35.8 + 1880.2 \\ & = & 159745.4 \end{array}$$

And, if we care to obtain the sample correlation between Y1 and Y2, we take the sample covariance that we just obtained and divide by the square root of the product of the two component variances, 5463.1, for Y1, which we obtained earlier, and 159745.4, which we just obtained above. Following this math through, we end up with a correlation of about 0.235 as shown below.

$r_{Y_1,Y_2} = \frac{s_{Y_1, Y_2}}{s_{Y_1}s_{Y_2}}= \frac{6944.1}{\sqrt{5463.1 \times 159745.4}}=0.235$

# 2.5 - Summary

In this lesson we learned about:

• The definition of a linear combination of random variables;
• Expressions of the population mean and variance of a linear combination and the covariance between two linear combinations;
• How to compute the sample mean of a linear combination from the sample means of the component variables;
• How to compute the sample variance of a linear combination from the sample variances and covariances of the component variables;
• How to compute the sample covariance and correlation between two linear combinations from the sample covariances of the component variables.

Next, complete the homework problems that will give you a chance to put what you have learned to use...