A Gentle Introduction to Logistic Regression
What is Logistic Regression?
Logistic Regression is a supervised learning algorithm that predicts a category- yes/no, or any type of categorical classification or class types. It is primarily used for binary classification problems but can also be extended to multi-class classification problems.
Real life examples of Logistic Regression:
Some of the real life examples where Logistic Regression is used -
- Gmail — Email classifier to tell us whether an incoming email should be marked as “spam” or “not spam”.
- Health care- Check radiological images to predict whether a tumour is benign or malignant.
Types of Logistic Regression:
- Binary logistic regression: In this type of Logistic Regression, the dependent variable is dichotomous in nature — i.e. it has only two possible outcomes (e.g. 0 or 1).
- Multinomial logistic regression: In this type of logistic regression model, the dependent variable has three or more possible outcomes; however, these values have no specified order.
- Ordinal logistic regression: This type of logistic regression model is leveraged when the response variable has three or more possible outcome, but in this case, these values do have a defined order.
Assumptions of Logistic Regression:
Unlike Linear Regression, Logistic Regression does not have a lot of assumptions, but assumes the following-
- There should be no outliers in the data.
- There should be no high correlations (multicollinearity) among the predictors.
Logistic Regression Representation:
Logistic regression is also a linear model like linear regression, but it does not predict continuous values, instead it predicts a class/category. It is represented using a sigmoid function as follows-
where y is the predicted probability of the class and wo + w1x is the linear model within logistic regression.
Logistic regression only calculates the outcome as either 0 or 1. The outptu is represented as a curve because logistic regression calculates a probability.
Consider an example where the probability of an event occurring is 0.6, does this output belong to class 0 or 1.
A threshold is used to categorize the probabilities of logistic regression into discrete classes.
y = 0 if predicted probability < 0.5
y = 1 if predicted probability > 0.5
In short, To predict the Y label — spam/not spam, cancer/not cancer, fraud/not fraud, etc. — you have to set a probability cutoff or threshold.