Minimal Cost-Complexity Pruning

Printer-friendly versionPrinter-friendly version

As we just discussed, R(T), is not a good measure for selecting a subtree because it always favors bigger trees. We need to add a complexity penalty to this resubstitution error rate. The penalty term favors smaller trees, and hence balances with R(T).

Definition for the cost-complexity measure:

For any subtree T < Tmax , we will define its complexity as |T tilde|, the number of terminal or leaf nodes in T . Let α ≥ 0 be a real number called the complexity parameter and define the cost-complexity measure Rα(T) as:

formula

The more leaf nodes that the tree contains the higher complexity of the tree because we have more flexibility in partitioning the space into smaller pieces, and therefore more possibilities for fitting the training data. There's also the issue of how much importance to put on the size of the tree. The complexity parameter α adjusts that.

At the end, the cost complexity measure comes as a penalized version of the resubstitution error rate. This is the function to be minimized when pruning the tree.

Which subtree is selected eventually depends on α . If α  = 0 then the biggest tree will be chosen because the complexity penalty term is essentially dropped. As α approaches infinity, the tree of size 1, i.e., a single root node, will be selected.

In general, given a pre-selected α , find the subtree T(α) that minimizes Rα(T), i.e.,

formula

The minimizing subtree for any α always exists since there are only finitely many subtrees.

Since there are at most a finite number of subtrees of Tmax , Rα(T(α)) yields different values for only finitely many α’s. T(α) continues to be the minimizing tree when α increases until a jump point is reached.

Two questions:

  1. Is there a unique subtree T < Tmax which minimizes Rα(T)?
  2. In the minimizing sequence of trees T1, T2, ... is each subtree obtained by pruning upward from the previous subtree, i.e., does the nesting T1 > T2 > ... > {t1}hold?

If the optimal subtrees are nested, the computation will be a lot easier. We can first find T1, and then to find T2 , we don't need to start again from the maximum tree, but from T1, (because T2 is guaranteed to be a subtree of T1).  In this way when α increases, we prune based on a smaller and smaller subtree.

Definition: The smallest minimizing subtree T(α) for complexity parameter α is defined by the conditions:

  1. Rα(T(α)) = min TTmax Rα(T)
  2. If Rα(T) = Rα(T(α)) , then T(α) ≤ T . - if another tree achieves the minimum at the same α, then the other tree is guaranteed to be bigger than the smallest minimized tree T(α).

By definition, (according the second requirement above), if the smallest minimizing subtree T(α) exists, it must be unique. Earlier we argued that a minimizing subtree always exists because there are only a finite number of subtrees. Here we go one step more. We can prove that the smallest minimizing subtree always exists (details in the text book). This is not trivial to show because one tree smaller than another means the former is embedded in the latter. Tree ordering is a partial ordering.

The starting point for the pruning is not Tmax , but rather T1 = T(0), which is the smallest subtree of Tmax satisfying:

R(T1) = R(Tmax).

The way that you get T1 is as follows:

First, look at the biggest tree, Tmax , and for any two terminal nodes descended from the same parent, for instance tL and tR , if they yield the same re-substitution error rate as the parent node t,  prune off these two terminal nodes, that is, if R(t) = R(tL) + R(tR), prune off tL and tR.

This process is applied recursively. After we have pruned one pair of terminal nodes, the tree shrinks a little bit. Then based on the smaller tree, we do the same thing until we cannot find any pair of terminal nodes satisfying this equality. The resulting tree at this point is T1.

Let's take a look at an example using a plot to see how Tmax is obtained.

T max

Inspect

We will use Tt to denote a branch rooted at t. Then, for Tt, we define R(Tt), (the resubstitution error rate for this branch ) by:

formula

where T tildet is the set of terminal nodes of Tt .

What we can prove is that, if t is any non-terminal or internal node of T1, then it is guaranteed to have a smaller re-substitution error rate than the re-substitution error rate of the branch, that is, R(t) > R(Tt).  If we prune off the branch at t, the resubstitution error rate will strictly increase.

Weakest-Link Cutting

The weakest link cutting method not only finds the next α which results in a different optimal subtree, but find that optimal subtree.

Remember, we previously defined Rα for the entire tree. Here, we extend the definition to a node and then for a single branch coming out of a node.

For any node t T1, we can set Rα({t}) = R(t) + α .

Also, for any branch Tt , we can define Rα(Tt) = R(Tt) + α | T tildet |.

We know that when α = 0, R0(Tt) < R0({t}). The inequality holds for sufficiently small α .

If we gradually increase α, because Rα(Tt) increases faster with α (the coefficient in front of α is larger than that in Rα({t}) ) , at a certain α (α1 below), we will have Rα(Tt) = Rα({t}).

graph

If α further increases, the inequality sign will be reversed, and we have Rα(Tt) > Rα({t}). Some node t may reach the equality earlier than some other. The node that achieves the equality at the smallest α is called the weakest link. It is possible that several nodes achieve the equality at the same time, and hence there are several weakest link nodes.

Solve the inequality Rα(Tt) < Rα({t}) and get

formula

The right hand side is the ratio between the difference in resubstitution error rates and the difference in complexity, which is positive because both the numerator and the denominator are positive.

Define a function g1(t), tT1 by

formula

The weakest link formula in T1achieves the minimum of g1(t).

formula

and put formula. To get the optimal subtree corresponding to α2 , simply remove the branch growing out of formula. When α increases, formula is the first node that becomes more preferable than the branch formula descended from it. If there are several nodes that simultaneously achieve the minimum g1(t), we remove the branch grown out of each of these nodes. α2 is the first value after α1 = 0 that yields an optimal subtree strictly smaller than T1.  For all α1 ≤ α < α2, the smallest minimizing subtree is the same as T1.

Let T2 = T1 - formula.

Repeat the previous steps. Use T2 instead of T1 as the starting tree, find the weakest link in T2 and prune off at all the weakest link nodes to get the next optimal subtree.

formula

Computation

In terms of computation, we need to store a few values at each node.

  • R(t) , the resubstitution error rate for node t. This only need to be computed once.
  • R(Tt), the resubstitution error rate for the branch coming out of node t.  This may need to be updated after pruning because Tt may change after pruning.
  • |Tt| , the number of leaf nodes on the branch coming out of node t. This may need to be updated after pruning.

R(t)

In order to compute the resubstitution error rate R(t) we need the proportion of data points in each class that land in node t. Let's suppose we compute the class priors by the proportion of points in each class. As we grow the tree, we will store the number of points land in node t, as well as the number of points in each class that land in node t. Given those numbers, we can easily estimate the probability of node t and the class posterior given a data point is in node t. R(t) can then be calculated.

As far as calculating the next two numbers, a) the resubstitution error rate for the branch coming out of node t, and b) the number of leaf nodes that are on the branch coming out of node t, these two numbers change after pruning. After pruning we to need to update these values because the number of leaf nodes will have been reduced. To be specific we would need to update the values for all of the ancestor nodes of the branch.

A recursive procedure can be used to compute R(Tt) and |Tt| .

To find the number of leaf nodes in the branch coming out of node t, we can do a bottom-up sweep through the tree. The number of leaf nodes for any node is equal to the number of leaf nodes for the right child node plus the number of leaf nodes for the right child node. A bottom-up sweep ensures that the number of leaf nodes is computed for a child node before for a parent node. Similarly, R(Tt) is equal to the sum of the values for the two child nodes of t. And hence, a bottom-up sweep would do.

Once we have the three values at every node, we compute the ratio g(t) and find the weakest link. The corresponding ratio at the weakest link is the new α. It is guaranteed that the sequence of α obtained in the pruning process is strictly increasing.

If at any stage, there are multiple weakest links, for instance, if formula, then define:

formula

In this case, the two branches are either nested or share no node. The pruning procedure gives sequence of nested subtrees:

T1 > T2 > T3 > ... > { t1}.

Each embedded in the other.

Theorem for αk

The theorem states that the {αk} are an increasing sequence, that is, αk < αk+1, k ≥ 1,where α1 = 0.

For any k ≥ 1, αk ≤ α < αk+1 , the smallest optimal subtree T(α) = Tk) = Tk , i.e., is the same as the smallest optimal subtree for αk.

Basically, this means that smallest optimal subtree Tk stays optimal for all the α's starting from k until it reaches αk+1 . Although we have a sequence of finite subtrees, they are optimal for a continuum of α .

At the initial steps of pruning, the algorithm tends to cut off large sub-branches with many leaf nodes very quickly. Then pruning becomes slower and slower as the tree becoming smaller. The algorithm tends to cut off fewer nodes. Let's look at an example.

Digital Recognition Example

T1 is the smallest optimal subtree for α1 = 0. It has 71 leaf nodes. Next, by finding the weakest link, after one step of pruning, the tree is reduced to size 63 (8 leaf nodes are pruned off in one step). Next, five leaf nodes pruned off. From T3 to T4, the pruning is significant, 18 leave nodes removed. Towards the end, pruning becomes slower.

For classification purpose, we have to select a single α , or a single subtree to use.

table