Key Learning Goals for this Lesson: 

Let the feature vector be X and the class label be Y.
The Bayes rule says that if you have the joint distribution of X and Y, and if X is given, under the 01 loss, the optimal decision on Y is to choose a class with maximum posteriori probability given X.
There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. LDA belongs to the first branch. In generative modeling, we try to estimate the within class density of X given the class label. Combined with the prior probability (unconditioned probability) of classes, the posteriori probability of Y can be obtained by the Bayes formula. In discriminant modeling, we try to estimate the posteriori probability of Y given X directly without assuming the marginal distribution on X.
Assume the prior probability or the marginal pmf for class k is denoted as π_{k},
π_{k} is usually estimated simply by empirical frequencies of the training set:
You have the training data set and you count what percentage of data come from a certain class.
Then we need the classconditional density of X. Remember this is the density of X conditioned on the class k, or class G = k denoted by f_{k}(x).
According to the Bayes rule, what we need is to compute the posterior probability:
This is a conditional probability of class G given X.
By MAP (maximum a posteriori, i.e., the Bayes rule for 01 loss):
Notice that the denominator is identical to matter what class k you are using. Therefore, for maximization, it does not make a difference in the choice of k. The MAP rule is essentially trying to maximize π_{k} times f_{k}(x).
Depending on which algorithms you use, you end up with different ways of density estimation within every class.
In Linear Discriminant Analysis (LDA) we assume that every density within each class is a Gaussian distribution.
In LDA we assume those Gaussian distributions for different classes share the same the same covariance structure. In Quadratic Discriminant Analysis (QDA) we don't put such constraints. You will see the difference later.
A single Gaussian for the within class densities may not be sufficient. Therefore, we model the within class density of X by a mixture of Gaussians, which we will talk about later in the semester.
You can also use general nonparametric density estimates, for instance kernel estimates.
There is a wellknown algorithm called the Naive Bayes algorithm. Here the basic assumption is that if your input vector X is multidimensional, each of the within class densities , has independent dimensions, that is, all the variables are independent given the class label. Therefore, to estimate the class density, you can separately estimate the density for every dimension and then multiply them to get the joint density. This makes the computation much simpler.
X may be discrete, not continuous. Instead of talking about density, we will treat with the probability mass function. In this case, we would compute a probability mass function for every dimension and then multiply them to get the joint pmf.
You can see that we have swept through several prominent methods for classification. You should also see that they all fall into the Generative Modeling idea. The only essential difference is in how you actually estimate the density for every class.
Under LDA we assume that the density for X, given every class k is following a Gaussian distribution. Here is the density formula for a multivariate Gaussian distribution:
p is the dimension and Σ_{k} is the covariance matrix. This involves the square root of the determinant of this matrix. In this case, we are doing matrix multiplication. The vector x and the mean vector μ_{k } are both column vectors.
For Linear discriminant analysis (LDA): Σk = Σ, ∀k.
In LDA, as we mentioned, you simply assume for different k that the covariance matrix is identical. By making this assumption, classifier becomes linear. The only difference from quadratic discriminant analysis is that we do not assume that the covariance matrix is identical for different classes. For QDA, the decision boundary is determined by a quadratic function.
Since the covariance matrix determines the shape of the Gaussian density, in LDA, the Gaussian densities for different classes have the same shape, but are shifted versions of each other (different mean vectors). Example densities for the LDA model are shown below.
For the moment, we will assume that we already have the covariance matrix for every class. And we will talk about how to estimate this in a moment. Let's look at what the optimal classification would be based on the Bayes rule
Bayes rule says that we should pick a class that has the maximum posterior probability given the feature vector X. If we are using the generative modeling approach this is equivalent to maximizing the product of the prior and the within class density.
Since the log function is an increasing function, the maximization is equivalent because whatever gives you the maximum should also give you a maximum under a log function. Next, we plug in the density of the Gaussian distribution assuming common covariance and then multiplying the prior probabilities.
Note:
To sum up, after simplification we obtain this formula:
This is the final classifier. Given any x, you simply plug into this formula and see which k maximizes this. Usually the number of classes is pretty small, and very often only two classes. Hence, an exhaustive search over the classes is effective.
LDA gives you a linear boundary because the quadratic term is dropped.
To sum up
Then
{x : δ_{k} (x) = δ_{l}(x)}
Or equivalently the following holds
In binary classification in particular, for instance if we let (k =1, l =2), then we would define constant a_{0}, given below, where π_{1} and π_{2} are prior probabilities for the two classes and μ_{1} and μ_{2} are mean vectors.
Here is a contour plot of this result:
We have two classes and we know that within class density. The marginal density is simply the weighted sum of the within class density where the weights are the prior probabilities. Because we have equal weights and because the covariance matrix two classes are identical, we get these symmetric lines in the contour plot. The black diagonal line is the decision boundary for the two classes. Basically, if you are given an x, if it is above the line we would be classifying this x into the firstclass. If it is below the line, it would be classified into the second class.
There is a missing piece here, right?
For all of the discussion above we assume that we have the prior probabilities for the classes and we also had the within class densities given to us. Of course, in practice you don't have this. In practice, what we have is only a set of training data.
The question is how do we find the π_{k}'s and the f_{k}(x)?
We need to estimate the Gaussian distribution. Here is the formula for estimating the π_{k}'s and the parameters in the Gaussian distributions. The formula below is actually the maximum likelihood estimation:
where N_{k} is the number of classk samples and N is the total number of points in the training data. As we mentioned, to get the prior probabilities for class k, you simply count the frequency of data points in class k.
Then, the mean vector for every class is also simple. You take all of the data points in a given class and compute the average, the sample mean:
Next, the covariance matrix formula looks slightly complicated. The reason is because we have to get a common covariance matrix for all of the classes. First you divide the data points in two given classes according to the given labels. If we were looking at class k, for every point we subtract by the corresponding mean which we computed earlier. Then multiply its transpose. Remember x is a column vector, therefore if we have a column vector multiplied by a row vector, we get a square matrix, which is what we need.
First, we do the summation within every class k, then we have the sum over all of the classes. Next, we normalize by vector N  K. When we fit a maximum likelihood estimator it should be divided by N, but if it is divided by N – K, we get an unbiased estimator. Remember, K is the number of classes. So, when N is large, the difference between N and N  K is pretty small.
Note that x^{(i)} denotes the ith sample vector.
In summary, if you want to use LDA to obtain a classification rule, the first step would involve estimating the parameters using the formulas above. Once you have these, then go back and find the linear discriminant function and choose a class according to the discriminant functions.
Let's take a look at specific data set. This is a diabetes data set from the UC Irvine Machine Learning Repository. It is a fairly small data set by today's standards. The original data had eight variable dimensions. What I did here was to obtain the two prominent principal components from these eight variables. Instead of using the original eight dimensions we will just use these two principal components for this example.
The Diabetes data set has two types of samples in it. One sample type are healthy individuals the other are individuals with a higher risk of diabetes. Here are the prior probabilities estimated for both of the sample types, first for the healthy individuals and second for those individuals at risk:
The first type has a prior probability estimated at 0.651. This means that among the data set, (250 to 300 data points), about 65% of these belong to class one and the other 35% belong to class two. Next, we computed the mean vector for the two classes separately:
Then we computed using the formulas discussed earlier.
Once we have done all of this, we compute the linear discriminant function and found the classification rule.
Classification rule:
In this example, if you give me an x, I then plug this value into the above linear function. If the result is greater than or equal to zero I claim that it is in class one, otherwise it is in class two.
Below is a scatter plot of the dominant principle components. The two classes are represented, the first, without diabetes, are the red stars (class 1), and the second class with diabetes are the blue circles (class 2). The solid line represents the classification boundary obtained by LDA. It seems as though the two classes are not that well separated. The dashed or dotted line is the boundary obtained by linear regression of indicator matrix. In this case the results of the two different linear boundaries are very close.
It is always a good practice to plot things so that if something went terribly wrong it would show up in the plots.
Here is the contour plot for the density (the marginal density for x is a mixture of two Gaussians, 2 classes) of the diabetes data. You can see below it looks like a single Gaussian distribution. The reason is because the two classes are so close that they merge into a single mode.
LDA makes some strong assumptions. It assumes that the covariance matrix is identical for different classes. It also assumes that the density is Gaussian. What if these are not true? LDA may not necessarily be bad when the assumptions about the density functions are violated. Here are some examples that might illustrate this.
In certain cases, LDA may yield poor results.
In the first example (a), we do have similar data sets which follow exactly the model assumptions of LDA. This means that the two classes, red and blue, actually have the same covariance matrix and they are generated by Gaussian distributions. Below, in the plots, the black line represents the decision boundary. The second example (b) violates all of the assumptions made by LDA. First of all the within class of density is not a single Gaussian distribution, instead it is a mixture of two Gaussian distributions. The overall density would be a mixture of four Gaussian distributions. Also, they have different covariance matrices as well.
Then, if we apply LDA we get this decision boundary (above, black line), which is actually very close to the ideal boundary between the two classes. By ideal boundary, I mean the boundary given by the Bayes rule using the true distribution (since we know it in this simulated example).
If you look at another example, (c) below, here we also generated two classes. The red class still contains two Gaussian distributions. The blue class, which spreads itself over the red class with one mass of data in the upper right and another data mass in the lower left. If we force LDA we get a decision boundary, as displayed. In this case the result is very bad (far below ideal classification accuracy). You can see that in the upper right the red and blue are very well mixed, however, in the lower left the mix is not as great. If perform classification using this decision boundary you can imagine that the error rate would be very high.
In plot (d), the density of each class is estimated by a mixture of two Gaussians. The Bayes rule is applied. The result boundaries are two curves. The separation of the red and the blue is much improved.
This example illustrates when LDA gets into trouble. LDA separates the two classes with a hyper plane. This means that the two classes have to pretty much be two separated masses, each occupying half of the space. In the above example, the blue class breaks into two pieces, left and right. Then, you have to use more sophisticated density estimation for the two classes if you want to get a good result. This is an example where LDA has seriously broken down. This is why it's always a good idea to look at the scatter plot before you choose a method. The scatter plot will often show whether a certain method is appropriate. If you see a scatter plot like this example, you can see that the blue class is broken into pieces, and you can imagine if you used LDA, no matter how you position your linear boundary, you are not going to get a good separation between the red and the blue class. Largely you will find out that LDA is not appropriate and you want to take another approach.
QDA is not really that much different from LDA except that you assume that the covariance matrix can be different for each class and so, we will estimate the covariance matrix Σ_{k} separately for each class k, k =1, 2, ... , K.
Quadratic discriminant function:
This quadratic discriminant function is very much like the linear discriminant function except that because Σ_{k}, the covariance matrix, is not identical, you cannot throw away the quadratic terms. This discriminant function is a quadratic function and will contain second order terms.
Classification rule:
The classification rule is similar as well. You just find the class k which maximizes the quadratic discriminant function.
The decision boundaries are quadratic equations in x.
QDA, because it allows for more flexibility for the covariance matrix, tends to fit the data better than LDA, but then it has more parameters to estimate. The number of parameters increases significantly with QDA. Because, with QDA, you will have a separate covariance matrix for every class. If you have many classes and not so many sample points, this can be a problem.
As we talked about at the beginning of this course, there are tradeoffs between fitting the training data well and having a simple model to work with. A simple model sometimes fits the data just as well as a complicated model. Even if the simple model doesn't fit the training data as well as a complex model, it still might be better on the test data because it is more robust.
In this example we do the same things as we have previously with LDA on the prior probabilities and the mean vectors, except now we estimate the covariance matrices separately for each class.
How do we estimate the covariance matrices separately?
Remember, in LDA once we had the summation over the data points in every class we had to pull all the classes together. In QDA we don't do this.
Prior probabilities:
The dashed line in the plot below is decision boundary given by LDA.The curved line is the decision boundary resulting from the QDA method.
For most of the data, it doesn't make any difference, because most of the data is massed on the left. The percentage of the data in the area where the two decision boundaries differ a lot is small. Therefore, you can imagine that the difference in the error rate is very small.
We obtain the classifier and apply it to the training data set and see what percentage of data are classified incorrectly because we know the true labels in the training data.
By sensitivity we mean that if the person has diabetes what percentage of the data points will say that they actually have diabetes. In other words, if it is in fact a positive case, percentage of the times that we will classify it as a positive case.
This refers to the percentage of correctness of negative samples. If the person does not have diabetes, what percentage of the times that we classify them as not having diabetes.
Sensitivity for QDA is the same as that obtained by LDA, but specificity is slightly lower.
In fact, if you are using LDA you can also get a quadratic decision boundary if you expand the basis. Let's say that we have two dimensions X_{1} and X_{2}. We can augment the input features by including X_{1}X_{2}, X_{1}^{2}, and X_{2}^{2}. This means that for every sample point you have X_{1} and X_{2} and we introduce these three new dimensions. They are all computed based on the given X_{1} and X_{2}.
Now, we can use these inputs as a five dimensional vector: X = (X_{1}, X_{2}, X_{1}X_{2}, X_{1}^{2}, X_{2}^{2}). The true dimension is two, however, by introducing these nonlinear basis, we will pretend that the data is five dimensional.
Now, we apply LDA and we get the mean vectors as follows:
Next, we get the common covariance matrix for this five dimensional data:
The classification boundary  in terms of the five dimensions  is linear, but in terms of X_{1} and X_{2} it is quadratic because the third, fourth, and fifth dimensions are quadratic functions of the original X_{1} and X_{2} . Here is the classification boundary we obtained using LDA on this expanded basis:
Let's take a look at the decision function in the plot below. The dashed curved line is the decision boundary given by QDA. This straight dashed line is the decision boundary given by LDA based on the two original dimensions. The curved solid black line is the LDA boundary line obtained using the expanded dimensions. If the function on the right hand side of the above equation is nonnegative, classify as 1; otherwise 2.
This new decision boundary line is quite different than the decision boundary line obtained when using QDA. Again, All three decision boundaries agree on the majority of the data. It is only in the upper and lower areas where there is a difference.
The within training data classification error rate is lower than those by LDA and QDA with the original input, therefore it seems that the LDA Using the expanded basis performs better on the training data. We don't have any test data to see whether or not it really is better.
Here is another example taken from the textbook. In this example, there are three classes.
You can see that these are slightly different from one another.
Why use the expanded basis?