A Gentle Introduction to Regularization in Machine Learning
What is Regularization
We learned about over-fitting in Machine Learning as a common bottleneck for many models. While there are many ways to detect over-fitting and work around with the data, the most effective way to deal with over-fitting is a technique called Regularization.
Regularization is a way to avoid over-fitting by penalizing high value regression coefficients. In other words, it reduces parameters and simplifies the model, hence avoiding over-fitting.
Regularization adds penalties to more complex models and then sorts potential models from least over-fit to best-fit.The model with the lowest “over fitting” score is usually the best choice for predictive power.
Do we really need Regularization always?
While Regularization is a great way to reduce/avoid over-fitting in models, it would add more value to tree based models that are prone to over-fitting, as well as models with high multi-collinearity.
Having said that, there is a way to customize the regularization penalty based on the model use case as described below.
Types of Regularization
L1 or Lasso Regularization:
L1 or Lasso Regularization adds an L1 penalty equal to the absolute value of the magnitude of coefficients of the model cost function. In other words, it limits the size of the coefficients. L1 can yield sparse models (i.e. models with few coefficients). Some coefficients can become zero and eliminated.Hence, this regularization technique is often used in Feature Selection to remove features that don’t add value to the prediction power.
L1 = α.Σ(absolute values of coefficients)
L2 or Ridge Regularization:
L2 or Ridge Regularization adds an L2 penalty equal to the square of the magnitude of coefficients of the model cost function. L2 will not yield sparse models and all coefficients are shrunk by the same factor (none are eliminated).
L2 = α.Σ(squared values of coefficients)
Elastic net Regularization:
Elastic net Regularization adds both L1 and L2 regularization terms to the cost function of at the same time.
We add a hyper parameter called l1_ratio defines how we mix both L1 and L2 regularization. Therefore, it is called the ElasticNet mixing parameter. The acceptable range of values for l1_ratio is:
0 <= l1_ratio <= 1