As we just discussed, R(T), is not a good measure for selecting a subtree because it always favors bigger trees. We need to add a complexity penalty to this resubstitution error rate. The penalty term favors smaller trees, and hence balances with R(T).
Definition for the cost-complexity measure:
For any subtree T < T_{max} , we will define its complexity as ||, the number of terminal or leaf nodes in T . Let α ≥ 0 be a real number called the complexity parameter and define the cost-complexity measure R_{α}(T) as:
The more leaf nodes that the tree contains the higher complexity of the tree because we have more flexibility in partitioning the space into smaller pieces, and therefore more possibilities for fitting the training data. There's also the issue of how much importance to put on the size of the tree. The complexity parameter α adjusts that.
At the end, the cost complexity measure comes as a penalized version of the resubstitution error rate. This is the function to be minimized when pruning the tree.
Which subtree is selected eventually depends on α . If α = 0 then the biggest tree will be chosen because the complexity penalty term is essentially dropped. As α approaches infinity, the tree of size 1, i.e., a single root node, will be selected.
In general, given a pre-selected α , find the subtree T(α) that minimizes R_{α}(T), i.e.,
The minimizing subtree for any α always exists since there are only finitely many subtrees.
Since there are at most a finite number of subtrees of T_{max} , R_{α}(T(α)) yields different values for only finitely many α’s. T(α) continues to be the minimizing tree when α increases until a jump point is reached.
Two questions:
If the optimal subtrees are nested, the computation will be a lot easier. We can first find T_{1}, and then to find T_{2} , we don't need to start again from the maximum tree, but from T_{1}, (because T_{2} is guaranteed to be a subtree of T_{1}). In this way when α increases, we prune based on a smaller and smaller subtree.
Definition: The smallest minimizing subtree T(α) for complexity parameter α is defined by the conditions:
By definition, (according the second requirement above), if the smallest minimizing subtree T(α) exists, it must be unique. Earlier we argued that a minimizing subtree always exists because there are only a finite number of subtrees. Here we go one step more. We can prove that the smallest minimizing subtree always exists (details in the text book). This is not trivial to show because one tree smaller than another means the former is embedded in the latter. Tree ordering is a partial ordering.
The starting point for the pruning is not T_{max} , but rather T_{1} = T(0), which is the smallest subtree of T_{max} satisfying:
R(T_{1}) = R(T_{max}).
The way that you get T_{1} is as follows:
First, look at the biggest tree, T_{max} , and for any two terminal nodes descended from the same parent, for instance t_{L} and t_{R} , if they yield the same re-substitution error rate as the parent node t, prune off these two terminal nodes, that is, if R(t) = R(t_{L}) + R(t_{R}), prune off t_{L} and t_{R}.
This process is applied recursively. After we have pruned one pair of terminal nodes, the tree shrinks a little bit. Then based on the smaller tree, we do the same thing until we cannot find any pair of terminal nodes satisfying this equality. The resulting tree at this point is T_{1}.
Let's take a look at an example using a plot to see how T_{max} is obtained.
We will use T_{t} to denote a branch rooted at t. Then, for T_{t}, we define R(T_{t}), (the resubstitution error rate for this branch ) by:
where _{t} is the set of terminal nodes of T_{t} .
What we can prove is that, if t is any non-terminal or internal node of T_{1}, then it is guaranteed to have a smaller re-substitution error rate than the re-substitution error rate of the branch, that is, R(t) > R(T_{t}). If we prune off the branch at t, the resubstitution error rate will strictly increase.
The weakest link cutting method not only finds the next α which results in a different optimal subtree, but find that optimal subtree.
Remember, we previously defined R_{α} for the entire tree. Here, we extend the definition to a node and then for a single branch coming out of a node.
For any node t ∈ T_{1}, we can set R_{α}({t}) = R(t) + α .
Also, for any branch T_{t} , we can define R_{α}(T_{t}) = R(T_{t}) + α | _{t} |.
We know that when α = 0, R_{0}(T_{t}) < R_{0}({t}). The inequality holds for sufficiently small α .
If we gradually increase α, because R_{α}(T_{t}) increases faster with α (the coefficient in front of α is larger than that in R_{α}({t}) ) , at a certain α (α_{1} below), we will have R_{α}(T_{t}) = R_{α}({t}).
If α further increases, the inequality sign will be reversed, and we have R_{α}(T_{t}) > R_{α}({t}). Some node t may reach the equality earlier than some other. The node that achieves the equality at the smallest α is called the weakest link. It is possible that several nodes achieve the equality at the same time, and hence there are several weakest link nodes.
Solve the inequality R_{α}(T_{t}) < R_{α}({t}) and get
The right hand side is the ratio between the difference in resubstitution error rates and the difference in complexity, which is positive because both the numerator and the denominator are positive.
Define a function g_{1}(t), t ∈ T_{1} by
The weakest link in T_{1}achieves the minimum of g_{1}(t).
and put . To get the optimal subtree corresponding to α_{2} , simply remove the branch growing out of . When α increases, is the first node that becomes more preferable than the branch descended from it. If there are several nodes that simultaneously achieve the minimum g1(t), we remove the branch grown out of each of these nodes. α_{2} is the first value after α_{1} = 0 that yields an optimal subtree strictly smaller than T_{1}. For all α_{1} ≤ α < α_{2}, the smallest minimizing subtree is the same as T_{1}.
Let T_{2} = T_{1} - .
Repeat the previous steps. Use T_{2} instead of T_{1} as the starting tree, find the weakest link in T_{2} and prune off at all the weakest link nodes to get the next optimal subtree.
In terms of computation, we need to store a few values at each node.
In order to compute the resubstitution error rate R(t) we need the proportion of data points in each class that land in node t. Let's suppose we compute the class priors by the proportion of points in each class. As we grow the tree, we will store the number of points land in node t, as well as the number of points in each class that land in node t. Given those numbers, we can easily estimate the probability of node t and the class posterior given a data point is in node t. R(t) can then be calculated.
As far as calculating the next two numbers, a) the resubstitution error rate for the branch coming out of node t, and b) the number of leaf nodes that are on the branch coming out of node t, these two numbers change after pruning. After pruning we to need to update these values because the number of leaf nodes will have been reduced. To be specific we would need to update the values for all of the ancestor nodes of the branch.
A recursive procedure can be used to compute R(T_{t}) and |T_{t}| .
To find the number of leaf nodes in the branch coming out of node t, we can do a bottom-up sweep through the tree. The number of leaf nodes for any node is equal to the number of leaf nodes for the right child node plus the number of leaf nodes for the right child node. A bottom-up sweep ensures that the number of leaf nodes is computed for a child node before for a parent node. Similarly, R(T_{t}) is equal to the sum of the values for the two child nodes of t. And hence, a bottom-up sweep would do.
Once we have the three values at every node, we compute the ratio g(t) and find the weakest link. The corresponding ratio at the weakest link is the new α. It is guaranteed that the sequence of α obtained in the pruning process is strictly increasing.
If at any stage, there are multiple weakest links, for instance, if , then define:
In this case, the two branches are either nested or share no node. The pruning procedure gives sequence of nested subtrees:
T_{1} > T_{2} > T_{3} > ... > { t_{1}}.
Each embedded in the other.
The theorem states that the {α_{k}} are an increasing sequence, that is, α_{k} < α_{k+1}, k ≥ 1,where α_{1} = 0.
For any k ≥ 1, α_{k} ≤ α < α_{k+1} , the smallest optimal subtree T(α) = T(α_{k}) = T_{k} , i.e., is the same as the smallest optimal subtree for α_{k}.
Basically, this means that smallest optimal subtree T_{k} stays optimal for all the α's starting from k until it reaches α_{k+1} . Although we have a sequence of finite subtrees, they are optimal for a continuum of α .
At the initial steps of pruning, the algorithm tends to cut off large sub-branches with many leaf nodes very quickly. Then pruning becomes slower and slower as the tree becoming smaller. The algorithm tends to cut off fewer nodes. Let's look at an example.
T_{1} is the smallest optimal subtree for α_{1} = 0. It has 71 leaf nodes. Next, by finding the weakest link, after one step of pruning, the tree is reduced to size 63 (8 leaf nodes are pruned off in one step). Next, five leaf nodes pruned off. From T3 to T4, the pruning is significant, 18 leave nodes removed. Towards the end, pruning becomes slower.
For classification purpose, we have to select a single α , or a single subtree to use.