# Lesson 10: Discriminant Analysis

### Introduction

Discriminant analysis is a classification problem, where two or more groups or clusters or populations are known a priori and one or more new observations are classified into one of the known populations based on the measured characteristics. Let us look at three different examples.

#### Example 1 - Swiss Bank Notes:

We have two populations of bank notes, genuine, and counterfeit. Six measures are taken on each note:

• Length
• Right-Hand Width
• Left-Hand Width
• Top Margin
• Bottom Margin
• Diagonal across the printed area

Take a bank note of unknown origin and determine just from these six measurements whether or not it is real or counterfeit. Perhaps this is not as impractical as it might sound. A more modern equivalent is a scanner that would measure the notes automatically and makes a decision.

#### Example 2 - Pottery Data:

Pottery shards are sampled from four sites: L) Llanedyrn, C) Caldicot, I) Ilse Thornes, and A) Ashley Rails and the concentrations of the following chemical constituents were measured at a laboratory

• Al: Aluminum
• Fe: Iron
• Mg: Magnesium
• Ca: Calcium
• Na: Sodium

An archaeologist encounters a pottery specimen of unknown origin. To determine possible trade routes, the archaeologist may wish to classify its site of origin.

#### Example 3 - Insect Data:

Data were collected on two species of insects in the genus Chaetocnema, (a) Ch. concinna and (b) Ch. heikertlingeri. Three variables were measured on each insect:

• width of the 1st joint of the tarsus (legs)
• width of the 2nd joint of the tarsus
• width of the aedeagus (sex organ)

Our objective is to obtain a classification rule for identifying the insect species based on these three variables. An entomologist can identify these two closely related species, but the differences are so subtle that one has to have considerable experience to be able to tell the difference. If a classification rule may be developed, then this might be a more accurate way to help differentiate between these two different species.

### Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

• Determine whether linear of quadratic discriminant analysis should be applied to a given data set;
• Be able to carry out both types of discriminant analyses using SAS/Minitab;
• Be able to apply the linear discriminant function to classify a subject by its measurements;
• Understand how to assess the efficacy of a discriminant analysis.

# 10.1 - Bayes Rule and Classification Problem

#### Bayes’ Rule

Consider any two events A and B. To find P(B|A), the probability that B occurs given that A has occurred, Bayes’ Rule states the following:

$P(B|A) = \frac{P(A \text{ and } B)}{P(A)}$

This says that the conditional probability is the probability that both A and B occur divided by the unconditional probability that A occurs. This is a simple algebraic restatement of a rule for finding the probability that two events occur together, which is P(A and B) = P(A)P(B|A).

#### Bayes’ Rule Applied to the Classification Problem

We are interested in Pi | x), the conditional probability that an observation came from population πi given that the observed values of the multivariate vector of variables x. We will classify an observation to the population for which the value of Pi | x) is greatest. This is the most probable group given the observed values of x.

• Suppose that we have g populations (groups) and that the ith population is denoted as πi .
• Let   pi = Pi), the probability that a randomly selected observation is in population πi .
• Let f (x | πi ) be the conditional probability density function of the multivariate set of variables x, given that observation came from population πi .

Technical Note: We have to be careful about the word probability in conjunction with our observed vector x. A probability density function for continuous variables does not give a probability, but instead gives a measure of “likelihood.”

Using the notation of Bayes’ Rule above, event A = observing the vector x and event B = observation came from population πi . Thus our probability of interest can be found as

$P(\text{ member of } \pi_i | \text{ we observed } \mathbf{x}) = \frac{P(\text{ member of } \pi_i \text{ and we observe } \mathbf{x})}{P(\text{ we observe } \mathbf{x})}$

• The numerator of the expression just given is the likelihood that a randomly selected observation is both from population πi and has the value x. This likelihood = pi f (x | πi ) .
• The denominator is the unconditional likelihood (over all populations) that we could observe x. This likelihood = $$\sum_{j=1}^{g} p_j f(\mathbf{x}|\pi_j)$$

Thus the posterior probability that an observation is a member of population πi is

$p(\pi_i|\mathbf{x}) = \frac{p_i f(\mathbf{x}|\pi_i)}{\sum_{j=1}^{g}p_j f(\mathbf{x}|\pi_j)}$

The classification rule is to assign observation x to the population for which the posterior probability is the greatest.

The denominator is the same for all posterior probabilities (for the various populations) so it is equivalent to say that we will classify an observation to the population for which pi f (x | πi ) is greatest.

#### Two Populations

With only two populations we can express a classification rule in terms of the ratio of the two posterior probabilities. Specifically we would classify to population 1 when

$\frac{p_1 f(\mathbf{x}|\pi_1)}{p_2 f(\mathbf{x}|\pi_2)} > 1$

This can be rewritten to say the we classify to population 1 when

$\frac{ f(\mathbf{x}|\pi_1)}{ f(\mathbf{x}|\pi_2)} > \frac{p_2}{p_1}$

#### Decision Rule

We are going to classify the sample unit or subject into the population πi that maximizes the posterior probability p(πi). that is the population that maximizes

$$f(\mathbf{x|\pi_i})p_i$$

We are going to calculate the posterior probabilities for each of the populations. Then we are going to assign the subject or sample unit to that population that has the highest posterior probability. Ideally that posterior probability is going to be greater than a half, the closer to 100% the better!

Equivalently we are going to assign it to the population that maximizes this product:

$$\log f(\mathbf{x|\pi_i})p_i$$

The denominator that appears above does not depend on the population since it involves summing over all the populations. Equivalently all we really need to do is to assign it to the population that has the largest for this product, or equivalently again we can maximize the log of that product. A lot of times it is easier to write this log down.

# 10.2 - Discriminant Analysis Procedure

This is a 7 (or 6?) step procedure that is usually carried out in discriminant analysis:

• Step 1: Collect ground truth or training data.
• Ground truth or training data are data with known group memberships. Here, we actually know to which population each subject belongs. For example, in the Swiss Bank Notes, we actually know which of these are genuine notes and which others are counterfeit examples.

• Step 2: Prior Probabilities:

The prior probability pi represents the expected portion of the community that belongs to population πi. There are three common choices:

1) Equal priors: $\hat{p}_i = \frac{1}{g}$ This would be used if we believe that all of the population sizes are equal.

2) Arbitrary priors selected according to the investigators beliefs regarding the relative population sizes. Note that we require:

$$\hat{p}_1 + \hat{p}_2 + \dots + \hat{p}_g = 1$$

3) Estimated priors:

$\hat{p}_i = \frac{n_i}{N}$

where ni is the number observations from population πi in the training data, and N = n1 + n2 + ... + ng

• Step 3: Use Bartlett’s test to determine if variance-covariance matrices are homogeneous for the two or more populations involved. Result of this test will determine whether to use Linear or Quadratic Discriminant Analysis.

Case 1: Linear discriminant analysis is for homogeneous variance-covariance matrices:

$$\Sigma_1 = \Sigma_2 = \dots = \Sigma_g = \Sigma$$

In this case the variance-covariance matrix does not depend on the population from which the data are obtained.

Case 2: Quadratic discriminant analysis is used for heterogeneous variance-covariance matrices:

$$\Sigma_i \ne \Sigma_j$$ for some $$i \ne j$$

This allows the variance-covariance matrices to depend on which population we are looking at.

(We do not discuss testing whether the means of the populations are different. If they are not, there is no case for DA)

• Step 4: Estimate the parameters of the conditional probability density functions f ( X | πi ). Here, we shall make the following standard assumptions:
1. The data from group i has common mean vector μi
2. The data from group i has common variance-covariance matrix Σ.
3. Independence: The subjects are independently sampled.
4. Normality: The data are multivariate normally distributed.
• Step 5: Compute discriminant functions. This is the rule for classification of the new object into one of the known populations.
• Step 6: Use cross validation to estimate misclassification probabilities.
• As in all statistical procedures it is helpful to use diagnostic procedures to asses the efficacy of the discriminant analysis. We use cross-validation to assess the classification probability. Typically you are going to have some prior rule as to what is an acceptable misclassification rate. Those rules might involve things like, "what is the cost of misclassification?" This could come up in a medical study where you might be able to diagnose cancer. There are really two alternative costs. The cost of misclassifying someone as having cancer when they don't. This could cause a certain amount of emotional grief!! There is also the alternative cost of misclassifying someone as not having cancer when in fact they do have it! The cost here is obviously greater if early diagnosis improves cure rates.

• Step 7: Classify observations with unknown group memberships.

The procedure described above assumes that the unit or subject which is being classified actually belongs to one of the populations which has been considered. If you have a study where you are looking at two species of insects, A and B, and the insect being classified actually belongs to species C, then it will obviously be misclassified as to belonging to either A or B.

# 10.3 - Linear Discriminant Analysis

We assume that in population πi the probability density function of x is multivariate normal with mean vector μi and variance-covariance matrix Σ (same for all populations). As a formula, this is

$f(\mathbf{x}|\pi_i) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}}\exp\left[-\frac{1}{2}\mathbf{(x-\mu_i)'\Sigma^{-1}(x-\mu_i)}\right]$

We classify to the population for which pi f (x | πi ) is largest.

Because a log transform is monotonic, this equivalent to classifying an observation to the population for which log[ pi f (x | πi )] is largest.

Linear discriminant analysis is used when the variance-covariance matrix does not depend on the population from which the data are obtained. In this case, our decision rule is based on the so-called Linear Score Function which is a function of the population means for each of our g populations μi, as well as the pooled variance-covariance matrix.

The Linear Score Function is:

$s^L_i(\mathbf{X}) = -\frac{1}{2}\mathbf{\mu'_i \Sigma^{-1}\mu'_i + \mu'_i \Sigma^{-1}x}+ \log p_i = d_{i0}+\sum_{j=1}^{p}d_{ij}x_j + \log p_i$

where

$d_{i0} = -\frac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i}$

$$d_{ij} = j\text{th element of } \mu'_i\Sigma^{-1}$$

The far left-hand expression resembles a linear regression with intercept term di0 and regression coefficients dij.

Linear Discriminant Function:

$d^L_i(\mathbf{x}) = -\frac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i + \mu'_i\Sigma^{-1}x} = d_{i0} + \sum_{j=1}^{p}d_{ij}x_j$

$d_{i0} = -\frac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i}$

Given a sample unit with measurements x1, x2, ... , xp, we classify the sample unit into the population that has the largest Linear Score Function. This is equivalent to classifying to the population for which the posterior probability of membership is largest. The linear score function is computed for each population, then we assign the unit to the population with the largest score.

However, this is a function of unknown parameters, μi and Σ. So, these must be estimated from the data.

Discriminant analysis requires estimates of:

Prior probabilities:

$$p_i = \text{Pr}(\pi_i);$$ $$i = 1, 2, \dots, g$$

The Population Means: these can be estimated by the sample mean vectors:

$$\mathbf{\mu_i} = E(\mathbf{X}|\pi_i)$$; $$i = 1, 2, \dots, g$$

The Variance-covariance matrix: this is going to be estimated by using the pooled variance-covariance matrix

$$\Sigma = \text{var}(\mathbf{X}| \pi_i)$$; $$i = 1, 2, \dots, g$$

Typically, these parameters are estimated from training data, in which the population membership is known.

Conditional Density Function Parameters:

Population Means: μi can be estimated by substituting in the sample means $$\bar{\mathbf{x}}_i$$.

Variance-Covariance matrix: Let Si denote the sample variance-covariance matrix for population i. Then the variance-covariance matrix Σ can be estimated by substituting in the pooled variance-covariance matrix into the Linear Score Function as shown below:

$\mathbf{S}_p = \frac{\sum_{i=1}^{g}(n_i-1)\mathbf{S}_i}{\sum_{i=1}^{g}(n_i-1)}$

to obtain the estimated linear score function:

$\hat{s}^L_i(\mathbf{x}) = -\frac{1}{2}\mathbf{\bar{x}'_i S^{-1}_p \bar{x}_i +\bar{x}'_i S^{-1}_p x } + \log{\hat{p}_i} = \hat{d}_{i0} + \sum_{j=1}^{p}\hat{d}_{ij}x_j + \log{p}_i$

where

$\hat{d}_{i0} = -\frac{1}{2}\mathbf{\bar{x}'_i S^{-1}_p \bar{x}_i}$

and

$$\hat{d}_{ij} = j$$th element of $$\mathbf{\bar{x}'_iS^{-1}_p}$$

This is a function of the sample mean vectors, the pooled variance-covariance matrix and prior probabilities for g different populations. This is written in a form that looks like a linear regression formula with an intercept term plus a linear combination of response variables, plus the natural log of the prior probabilities.

Decision Rule: Classify the sample unit into the population that has the largest estimated linear score function.

# 10.4 - Example: Insect Data

Data were collected on two species of insects in the genus Chaetocnema, (species a) Ch. concinna and (species b) Ch. heikertlingeri. Three variables were measured on each insect:

• X1 = Width of the 1st joint of the tarsus (legs)
• X2 = Width of the 2nd joint of the tarsus
• X3 = Width of the aedeagus (sec organ)

We have ten individuals of each species to make up training data. Data on these ten individuals of each species is used to estimate the model parameters which we will use in linear score function.

Our objective is to obtain a classification rule for identifying the insect species from these three variables.

Let's begin...

Step 1: Collect the ground truth data or training data. (described above)

Step 2: Specify the prior probabilities. In this case we do not have any information regarding the relative abundances of the two species. Having no information in order to help specify prior probabilities, by default equal priors are selected

$\hat{p}_1 = \hat{p}_2 = \frac{1}{2}$

Step 3: Test for homogeneity of the variance-covariance matrices using Bartlett's test.

Here we will use the SAS program insect.sas as shown below:

Click on the arrow in the window below to see how discriminant analysis is performed using the Minitab statistical software application.

Discriminant Analysis using Minitab

No significant difference between the variance-covariance matrices for the two species (L' = 9.83; d.f. = 6; p = 0.132) is found. Thus linear discriminant analysis is appropriate for the data.

Step 4: Estimate the parameters of the conditional probability density functions, i.e., the population mean vectors and the population variance-covariance matrices involved. It turns out that all of this is done automatically in the discriminant analysis procedure.

Step 5: The linear discriminant functions for the two species can be obtained directly from the SAS or Minitab output.

Now, consider an insect with the following measurements. Which species does this belong to?

 Variable Measurement Joint 1 194 Joint 2 124 Aedeagus 49

These are responses for the first three variables. The linear discriminant function for species a is obtained by plugging in the values for these three measurements into the equation for species (a):

$$\hat{d}^{L}_a(\textbf{x}) = -247.276 - 1.417 x 194 + 1.520 x 124 + 10.954 x 49 = 203.052$$

and then for species (b):

$$\hat{d}^{L}_b(\textbf{x}) = -193.178 - 0.738 x 194 + 1.113 x 124 + 8.250 x 49 = 205.912$$

Then the linear score function is obtained by adding in a log of one half, here for species (a):

$$\hat{s}^L_a(\mathbf{x}) = \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a} = 203.052 + \log{0.5} = 202.359$$

and then for species (b):

$$\hat{s}^L_b(\mathbf{x}) = \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} = 205.912 + \log{0.5} = 205.219$$

#### Conclusion

According to the classificaqtion rule the insect is classified into the species that has the highest linear discriminant function. Since $$\hat{s}^L_b(\mathbf{x}) > \hat{s}^L_a(\mathbf{x})$$, we conclude that the insect belongs to species (b) Ch. heikertlingeri.

Of course here addition of log of one half does not make any difference. Whether we classify on the basis of $$\hat{d}^L_b(\mathbf{x})$$ or on the basis of score function, the decision will remain the same. In case the priors are not equal, this would not hold.

You can think of these priors as a 'penalty' in some sense. If you have a higher prior probability of a given species you will give it very little 'penalty' because you will be taking the log of a number close to one which is not going to subtract much. But if there is a low prior probability you will be taking the log of a very small number, this will end up in a large reduction.

Note: SAS by default will assume equal priors. Later on we will look at an example where we will not assume equal priors - the Swiss Banks Notes example.

#### Posterior Probabilities

You can also calculate the posterior probabilities. These are used to measure uncertainty regarding the classification of a unit from an unknown group. They will give us some indication of our confidence in our classification of individual subjects.

In this case, the estimated posterior probability that the insect belongs to species (a) Ch. concinna given the observed measurements can be obtained by using this formula:

$\begin{array}{ccl} p(\pi_a|\mathbf{x}) & = & \frac{\exp\{\hat{s}^L_a(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\ & = & \frac{\exp\{202.359\}}{\exp\{202.359\}+\exp\{205.219\}} \\ & = & 0.05\end{array}$

This is a function of our linear score functions for our two species. Here we are looking at the exponential function of the linear score function for species (a) divided by the sum of the exponential functions of the score functions for species (a) and species (b). Using the numbers that we obtained earlier we can carry out the math and get 0.05.

Similarly for species (b), the estimated posterior probability that the insect belongs to Ch. heikertlingeri is:

$\begin{array}{ccl} p(\pi_b|\mathbf{x}) & = & \frac{\exp\{\hat{s}^L_b(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\ & = & \frac{\exp\{205.219\}}{\exp\{202.359\}+\exp\{205.219\}} \\ & = & 0.95\end{array}$

In this case we are 95% confident that the insect belongs to species (b). This is a pretty high level of confidence but there is a 5% chance that we might be in error in this classification. One of the things that you would have to decide is what is an acceptable error rate here. For classification of insects this might be perfectly acceptable, however, in some situations it might not be. For example, looking at the cancer case that we talked about earlier where we were trying to classify someone as having cancer or not having cancer, it may not be acceptable to have 5% error rate. This is an ethical decision that has to be made. It is a decision that has nothing to do with statistics but must be tailored to the situation at hand.

# 10.5 - Estimating Misclassification Probabilities

When an umknown specimen is classified according to any decision rule, there is always a possibility that the item is wrongly classified. This must not be taken as error! This is part of the inherent uncertainty in any statistical procedure. One procedure to measure how good the discriminant rule is, we classify the training data according to the developed discrimination rule. Since we know which unit comes from which population among the training data, this will give us some idea of the validity of the discrimination procedure.

Method 1. The confusion table describes how the discriminant function will classify each observation in the data set. In general, the confusion table takes the form:

Rows 1 through g are g populations to which the items truly belong. Across the columns we are looking at how they are classified. n11 is the number of insects correctly classified in species (1). But n12 is the number of insects incorrectly classified into species (2). In this case nij = the number belonging to population i classified into population j. Ideally this matrix will be a diagonal matrix; in practice we hope to get off-diagonal elements to be very small numbers.

The row totals give the number of individuals belonging to each of our populations or species in our training dataset. The column totals give the number classified into each of these species. The total number of observations in the dataset is n... The dot notation is used here in the row totals for summing over the second subscript, whereas in the column totals we are summing over the first subscript.

We will let:

$$p(i|j)$$

denote the probability that a unit from population πj is classified into population πi. These misclassification probabilities can be estimated by taking the number of insects from population j that are misclassified into population i divided by the total number of insects in the sample from population j as shown here:

$\hat{p}(i|j) = \frac{n_{ji}}{n_{j.}}$

This will give the misclassification probabilities.

Example - Insect Data:

From the SAS output, we obtain the following confusion table.

 Classified As Truth a b Total a 10 0 10 b 0 10 10 Total 10 10 20

Here, no insect was misclassified. So, the misclassification probabilities are all estimated to be equal to zero.

Method 2: Set Aside Method

Step 1: Randomly partition the observations into two ”halves”

Step 2: Use one ”half” to obtain the discriminant function.

Step 3: Use the discriminant function from Step 2 to classify all members of the second ”half” of the data, from which the proportion of misclassified observations can be computed.

Advantage: This method yield unbiased estimates of the misclassification probabilities.

Problem: Does not make optimum use of the data, and so, estimated misclassification probabilities are not as precise as possible.

Method 3: Cross validation

Step1: Delete one observation from the data.

Step 2: Use the remaining observations to compute a discriminant function.

Step 3: Use the discriminant function from Step 2 to classify the observation removed in Step 1. Steps 1-3 are repeated for all observations; compute the proportions of observations that are misclassified.

Example: Insect Data

The confusion table for the cross validation is

 Classified As Truth a b Total a 10 0 10 b 2 8 10 Total 12 8 20

Here, the estimated misclassification probabilities are:

$\hat{p}(b|a) = \frac{0}{10} = 0.0$

for insects belonging to species A, and

$\hat{p}(a|b) = \frac{2}{10} = 0.2$

for insects belonging to species B.

Specifying Unequal Priors

Suppose that we have information (from prior experience or from another study) that suggests that 90% of the insects belong to Ch. concinna. Then the score functions for the unidentified specimen are

$$\hat{s}^L_a(\mathbf{x}) = \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a} = 203.052 + \log{0.9} = 202.946$$

and

$$\hat{s}^L_b(\mathbf{x}) = \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} = 205.912 + \log{0.1} = 203.609$$

In this case, we would still classify this specimen into Ch. heikertlingeri with posterior probabilities

$$p(\pi_a|\mathbf{x}) = 0.36$$ and $$p(\pi_b|\mathbf{x}) = 0.64$$

These priors can be specified in SAS by adding the ”priors” statement: priors ”a” = 0.9 ”b” = 0.1; following the var statement.  However, it should be noted that when the "priors" statement is added, SAS will include log pi as part of the constant term.  In other words, in this case, SAS outputs the estimated linear score function, not the estimated linear discriminant function.

# 10.6 - Quadratic Discriminant Analysis

Linear Discriminant Analysis is for homogeneous variance-covariance matrices. However not in all cases data may come from such simplified situations.  Quadratic Discriminant Analysis is used for heterogeneous variance-covariance matrices:

$$\Sigma_i \ne \Sigma_j$$ for some $$i \ne j$$

Again, this allows the variance-covariance matrices to depend on which population we are looking at.

Quadratic discriminant analysis calculates a Quadratic Score Function which looks like this:

$s^Q_i (\mathbf{x}) = -\frac{1}{2}\log{|\mathbf{\Sigma_i}|}-\frac{1}{2}{\mathbf{(x-\mu_i)'\Sigma^{-1}_i(x - \mu_i)}}+\log{p_i}$

This is a function of population mean vectors and the variance-covariance matrices for ith group. Similarly we will determine a separate quadratic score function for each of the groups.

This is of course a function of unknown population mean vector for group i and the variance-covariance matrix for group i. These will have to be estimated from ground truth data. As before, we replace the unknown values of μi, Σi,and pi by their estimates to obtain the estimated quadratic score function as shown below:

All natural logs are used in this function.

Decision Rule: Our decision rule remains the same as well. We will classify the sample unit or subject into the population that has the largest quadratic score function.

$s^Q_i (\mathbf{x}) = -\frac{1}{2}\log{|\mathbf{S_i}|}-\frac{1}{2}{\mathbf{(x-\bar{x})'S^{-1}_i(x -\bar{x})}}+\log{p_i}$

Let's illustrate this using the Swiss Bank Notes example...

# 10.7 - Example: Swiss Bank Notes

Recall that we have two populations of notes, genuine, and counterfeit and that six measurements were taken on each note:

• Length
• Right-Hand Width
• Left-Hand Width
• Top Margin
• Bottom Margin
• Diagonal

#### Priors

In this case it would not be reasonable to consider equal priors for the two types of banknotes. Equal priors would assume that half the banknotes in circulation are counterfeit and half are genuine. This is a very high counterfeit rate and if it was that bad the Swiss government would probably by bankrupt! So we need to consider unequal priors in which the vast majority of banknotes are thought to be genuine. For this example let us assume that no more than 1% of bank notes in circulation are counterfeit and 99% of the notes are genuine. The prior probabilities can then be expressed as:

$$\hat{p}_1 = 0.99$$ and $$\hat{p}_2 = 0.01$$

The first step in the analysis is going to carry out Bartlett's test to check for homogeneity of the variance-covariance matrices.

To do this we will use the SAS program swiss9.sas - shown below:

#### SAS Notes

By default, SAS will make this decision for you. Let's look at the proc descrim procedures in the SAS Program swiss9.sas that we just used.

By including this pool=test, above, what SAS will do is decide what kind of discriminant analysis is going to be carried based on the results of this test.

If you fail to reject, SAS will automatically do a linear discriminant analysis. If you reject, then SAS will do a quadratic discriminant analysis.

There are two other options here. If we put pool=yes then SAS will not carry out Bartlett's test but will go ahead and do a linear discriminant analysis whether it is warranted or not. It will pool the variance-covariance matrices and do a linear discriminant analysis.

If pool=no then SAS will not pool the variance-covariance matrices and SAS will then perform the quadratic discriminant analysis.

SAS does not actually print out the quadratic discriminant function, but it will use quadratic discriminant analysis to classify sample units into populations.

Click on the arrow in the window below to see how discriminant analysis is performed using the Minitab statistical software application.

Discriminant Analysis using Minitab

Bartlett's Test finds a significant difference between the variance-covariance matrices of the genuine and counterfeit bank notes (L' = 121.90; d.f. = 21; p < 0.0001). The variance-covariance matrix for the genuine notes is not equal to the variance-covariance matrix for the counterfeit notes. Since we reject the null hypothesis here of equal variance-covariance matrices this suggest that a linear discriminant analysis will not be appropriate for these data.Hence a quadratic discriminant analysis for these data is necessary.

Let us consider a bank note with the following measurements that were entered into program:

 Variable Measurement Length 214.9 Left Width 130.1 Right Width 129.9 Bottom Margin 9.0 Top Margin 10.6 Diagonal 140.5

Any number of lines of measurements may be considered. Here we are just interested in one set of measurements. It is reported that this bank note should be classified as real or genuine. The posterior probability that it is fake or counterfeit is only 0.000002526. So, the posterior probability that it is genuine is very close to one (actually, this posterior probability is 1 - 0.000002526 = 0.999997474). We are nearly 100% confident that this is a real note and not counterfeit.

Next consider the results of crossvalidation. Note that crossvalidation yields estimates of the probability that a randomly selected note will be correctly classified. The resulting confusion table is as follows:

 Classified As Truth Counterfeit Genuine Total Counterfeit 98 2 100 Genuine 1 99 100 Total 99 101 200

Here, we can see that 98 out of 100 counterfeit notes are expected to be correctly classified, while 99 out of 100 genuine notes are expected to be correctly classified.Thus, the estimated misclassification probabilities are estimated to be:

$$\hat{p}(\text{real | fake}) = 0.02$$ and $$\hat{p}(\text{fake | real}) = 0.01$$

The question remains: Are these acceptable misclassification rates?

A decision should be made in advance as to what would be the acceptable levels of error. Here again, you need to think about the consequences of making a mistake. In terms of classifying a genuine note as a counterfeit, one might put somebody in jail who is innocent. If you make the opposite error you might let a criminal get away. What are the costs of these types of errors? And, are the above error rates acceptable? This decision should be made in advance. You should have some prior notion of what you would consider reasonable.

# 10.8 - Summary

In this lesson we learned about:

• How to determine the type of discriminant analysis to be carried out, linear or quadratic;
• How the linear discriminant function can be used to classify a subject into the appropriate population;
• Issues regarding the selection of prior probabilities that a randomly selected subject belongs to a particular population;
• The use of posterior probabilities to assess the uncertainty of the classification of a particular subject;
• The use of crossvalidation and confusion tables to assess the efficacy of discriminant analysis.

Complete the homework problems that will give you a chance to put what you have learned to use.