Pruning in Decision Trees

Swagata Ashwani
3 min readMay 27, 2022

--

Photo by Markus Spiske on Unsplash

Decision trees are a classification algorithm with a tree based prediction method. They are fairly unique in the world of Machine Learning since in that there is no reason to scale features, or do any type of standardization required.

However, they are prone to over fitting due to the nature of making a split at each node, and hence needs some form of regularization to reduce the complexity.

Let’s take an example : Assume N as the number of data points, k as the Number of features and d as Depth of the tree
Time complexity to learn a single node of a decision tree is done by calculating the sorting and updating the impurities as you go.
If you go through and compute the impurities over the entire dataset each time, it is .

O(Nkd)

This means that it’s actually somewhere in between being in O(NklogN) and O(N2k).

As you can see in the explanation above, the worst case performance can be extremely bad! Hence, we use some pruning techniques to reduce the complexity and thereby reduce the chance of over fitting-

Pre-Pruning

When working with tree based methods, there is a much more direct way to control the complexity, which is with the number of decision nodes.
A decision tree which subdivides all the way to individual data points is maximally over fit.This can be controlled either directly (say by limiting the number of leaves) or by imposing a maximum depth.
This idea (stopping early when some criteria isn’t met) is called pre-pruning.

Post-Pruning

In this case, we start with a complex tree, and trim backwards
Simplest way is reducing error pruning: iterate over leaf nodes removing those that can be removed without hurting the predictive accuracy on a withheld validation set.

There are many pruning algorithms out there; below are three examples of pruning algorithms.

Pruning by information gain

We can prune our decision tree by using information gain in both post-pruning and pre-pruning. In pre-pruning, we check whether information gain at a particular node is greater than minimum gain. In post-pruning, we prune the subtrees with the least information gain until we reach a desired number of leaves.

Reduced Error Pruning (REP)

REP belongs to the Post-Pruning category. In REP, pruning is performed with the help of a validation set. In REP, all nodes are evaluated for pruning in a bottom up fashion. A node is pruned if the resulting pruned tree performs no worse than the original tree on the validation set. The subtree at the node is replaced with a leaf node which is assigned the most common class.

Cost-complexity pruning

Cost-complexity pruning also falls under the post-pruning category. Cost-complexity pruning works by calculating a Tree Score based on Residual Sum of Squares (RSS) for the subtree, and a Tree Complexity Penalty that is a function of the number of leaves in the subtree. The Tree Complexity Penalty compensates for the difference in the number of leaves. Numerically, Tree Score is defined as follows:

TreeScore=RSS+aTTree Score=RSS+aTTreeScore=RSS+aT

where a (alpha) is a hyperparameter we find using cross-validation, and T is the number of leaves in the subtree.

--

--

Swagata Ashwani
Swagata Ashwani

Written by Swagata Ashwani

I love talking Data! Data Scientist with a passion for finding optimized solutions in the AI space.Follow me here — https://www.linkedin.com/in/swagata-ashwani/

No responses yet

Write a response