# 10.3 - Linear Discriminant Analysis

We assume that in population π_{i} the probability density function of * x* is multivariate normal with mean vector

**μ**and variance-covariance matrix Σ (same for all populations). As a formula, this is

_{i}\[f(\mathbf{x}|\pi_i) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}}\exp\left[-\frac{1}{2}\mathbf{(x-\mu_i)'\Sigma^{-1}(x-\mu_i)}\right]\]

We classify to the population for which *p _{i }*

*f*(

**π**

*x |*_{i}) is largest.

Because a log transform is monotonic, this equivalent to classifying an observation to the population for which log[ *p _{i }*

*f*(

**π**

*x |*_{i})] is largest.

Linear discriminant analysis is used when the variance-covariance matrix does not depend on the population. In this case, our decision rule is based on the Linear Score Function, a function of the population means for each of our *g* populations, **μ _{i}**, as well as the pooled variance-covariance matrix.

The *Linear Score Function* is:

\[s^L_i(\mathbf{X}) = -\frac{1}{2}\mathbf{\mu'_i \Sigma^{-1}\mu_i + \mu'_i \Sigma^{-1}x}+ \log p_i = d_{i0}+\sum_{j=1}^{p}d_{ij}x_j + \log p_i\]

where

\[d_{i0} = -\frac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i}\]

\(d_{ij} = j\text{th element of } \mu'_i\Sigma^{-1}\)

The far left-hand expression resembles a linear regression with intercept term *d _{i}*

_{0}and regression coefficients

*d*.

_{ij}Linear Discriminant Function:

\[d^L_i(\mathbf{x}) = -\frac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i + \mu'_i\Sigma^{-1}x} = d_{i0} + \sum_{j=1}^{p}d_{ij}x_j\]

\[d_{i0} = -\frac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i}\]

Given a sample unit with measurements *x*_{1}, *x*_{2}, ... , *x*_{p}, we classify the sample unit into the population that has the largest Linear Score Function. This is equivalent to classifying to the population for which the posterior probability of membership is largest. The linear score function is computed for each population, then we plug in our observation values and assign the unit to the population with the largest score.

However, this is a function of unknown parameters, **μ _{i}** and Σ. So, these must be estimated from the data.

Discriminant analysis requires estimates of:

Prior probabilities:

\(p_i = \text{Pr}(\pi_i);\) \(i = 1, 2, \dots, g\)

The population means are estimated by the sample mean vectors:

\(\mathbf{\mu_i} = E(\mathbf{X}|\pi_i)\); \(i = 1, 2, \dots, g\)

The variance-covariance matrix is estimated by using the pooled variance-covariance matrix

\(\Sigma = \text{var}(\mathbf{X}| \pi_i)\); \(i = 1, 2, \dots, g\)

Typically, these parameters are estimated from training data, in which the population membership is known.

**Conditional Density Function Parameters**:

Population Means: **μ_{i} **is estimated by substituting in the sample means \(\bar{\mathbf{x}}_i\).

Variance-Covariance matrix: Let *S _{i}* denote the sample variance-covariance matrix for population

*i*. Then the variance-covariance matrix Σ is estimated by substituting in the pooled variance-covariance matrix into the Linear Score Function as shown below:

\[\mathbf{S}_p = \frac{\sum_{i=1}^{g}(n_i-1)\mathbf{S}_i}{\sum_{i=1}^{g}(n_i-1)}\]

to obtain the estimated linear score function:

\[\hat{s}^L_i(\mathbf{x}) = -\frac{1}{2}\mathbf{\bar{x}'_i S^{-1}_p \bar{x}_i +\bar{x}'_i S^{-1}_p x } + \log{\hat{p}_i} = \hat{d}_{i0} + \sum_{j=1}^{p}\hat{d}_{ij}x_j + \log{p}_i\]

where

\[\hat{d}_{i0} = -\frac{1}{2}\mathbf{\bar{x}'_i S^{-1}_p \bar{x}_i} \]

and

\(\hat{d}_{ij} = j\)th element of \(\mathbf{\bar{x}'_iS^{-1}_p}\)

This is a function of the sample mean vectors, the pooled variance-covariance matrix, and prior probabilities for *g* different populations. This is written in a form that looks like a linear regression formula with an intercept term plus a linear combination of response variables, plus the natural log of the prior probabilities.

**Decision Rule**: Classify the sample unit into the population that has the largest estimated linear score function.