What are the main assumptions of linear regression?
There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading.
- Linear relationship between features and target variable.
- Additivity means that the effect of changes in one of the features on the target variable does not depend on values of other features. For example, a model for predicting revenue of a company have of two features - the number of items a sold and the number of items b sold. When company sells more items a the revenue increases and this is independent of the number of items b sold. But, if customers who buy a stop buying b, the additivity assumption is violated.
- Features are not correlated (no collinearity) since it can be difficult to separate out the individual effects of collinear features on the target variable.
- Errors are independently and identically normally distributed (yi = B0 + B1*x1i + … + errori):
- No correlation between errors (consecutive errors in the case of time series data).
- Constant variance of errors - homoscedasticity. For example, in case of time series, seasonal patterns can increase errors in seasons with higher activity.
- Errors are normaly distributed, otherwise some features will have more influence on the target variable than to others. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.