A Gentle Introduction to Linear Regression
Regression is a supervised learning method that predicts a continuous variable or data point. In simple terms, it is a simple line equation represented by the following equation -
y = B0 + B1*x
- y is the predicted value of the dependent variable (y) for any given value of the independent variable (x).
- B0 is the intercept, the predicted value of y when the x is 0.
- B1 is the regression coefficient — how much we expect y to change as x increases.
- x is the independent variable
Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B1).
In order to find the best fit line, the Linear Regression algorithm uses the method of least squares.The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. This means that given a regression line through the data we calculate the distance from each data point to the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.
As we can see from the graph above, is a linear regression graph that represents the relationship between the independent variable(weight) and a dependent variable(height).
The ordinary square method finds the best fit line by plotting all the data points in the graph, and finding coefficients that generate a line that matches as closely to the data points as possible. This will reduce the overall error and create a generalized model.
Although Linear Regression is a simple algorithm and produces great results, it still has some assumptions-
- Linear relationship -
Linear regression needs the relationship between the independent and dependent variables to be linear.
- Multivariate normality -
Linear regression analysis requires all variables to be multivariate normal.
- No or little multi collinearity -
Linear Regression assumes that is no or very little correlation between the independent variables.
- Homoscedasticity -
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.