Hyper-parameter Tuning in Machine Learning
While building a Machine Learning model, a key factor in improving the performance of the model is selecting the right hyper parameters for the model. What are hyper-parameters and how do we select the best ones? Let’s find out.
Hyper-parameters
Hyper-parameters are sets of information that are used to control way of learn a machine learning algorithm. Their definitions impact how the model witll learn and thereby impacting the performance of the model.This set of values affects performance, stability and interpretation of a model.
For example, for decision trees, the hyper-parameters are depth of the tree, maximum number of leaf nodes and so on.
How to select the best hyper-parameters?
To select the best hyper-parameters that can lead to the best performance of the model, there are couple of techniques as follows-
Before looking at each technique, let’s take a look at decision tree hyper-parameters-
class sklearn.tree.DecisionTreeClassifier(*, criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
- Grid Search
In Grid Search, we pass a preset list of hyper-parameter values and evaluate the model for each combination.Each set of parameters is taken into consideration and the accuracy is noted. Once all the combinations are evaluated, the model with the set of parameters which give the top accuracy is considered to be the best.
One of the major drawbacks of this technique is since this takes into all possible combinations of hyper parameters, it can be computationally expensive as the number of parameter grows.
Example of decision tree, hyper-parameter tuning using grid search-
param_grid = { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 10)}decision_tree_model=DecisionTreeClassifier()
decision_tree_gscv = GridSearchCV(decision_tree_model, param_grid, cv=nfolds)
2. Random Search
Random search is similar to grid search, but instead of using all the points in the grid, it tests only a randomly selected subset of these points. The smaller this subset, the faster but less accurate the optimization. The larger this dataset, the more accurate the optimization but the closer to a grid search.
One drawback of the technique is since it completely random without any intelligence, it may not always lead to the most accurate result.
Example of decision tree, hyper-parameter tuning with Random search-
param_dist = {"max_depth": [3, None], "max_features": randint(1, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]}
decision_tree_model=DecisionTreeClassifier()
decision_tree_model = RandomizedSearchCV(tree, param_dist, cv=5)
3. Bayesian Optimization
Bayesian Optimization provides a technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function. So it provides a way to intelligently search the best hyper-parameters in a subset of space rather than scanning all the possible combinations yet not be computationally expensive.
Example of decision tree, hyper-parameter tuning with Bayesian Optimization-
decision_tree_model = BayesianOptimization(bo_params_rf, {
'max_samples':(0.5,1),'max_features'(0.5,1),'n_estimators':(100,200)})
Happy Tuning!