Regression Assumptions

Assumptions underlying Regression Analysis

The validity of regression analysis depends on several assumptions concerning the model

Y = α + β₁X₁ +...+ β_kX_k + ε .

The relationship really is linear (or, for practical purposes, approximately linear over the range of the population being studied).
E[ε] = 0 . This is purely a cosmetic assumption. The estimate of α will include any on-average residual effects which are different from zero.
ε varies normally across the population. While a substantive assumption, this is typically true, due to the Central Limit Theorem, since the residual term is the total of a myriad of other, unidentified explanatory variables. If this assumption is not correct, all statements regarding confidence intervals for individual predictions might be invalid.
StdDev[ε] does not vary with the values of the explanatory variables. (This is called the homoskedasticity assumption.) Again, if this assumption is not correct, all statements regarding confidence intervals for individual predictions might be invalid.
ε is uncorrelated with the explanatory variables of the model. The regression analysis will “attribute” as much of the variation in the dependent variable as it can to the explanatory variables. If some unidentified factor covaries with one of the explanatory variables, the estimate of that explanatory variable’s coefficient (i.e., the estimate of its effect in the relationship) will suffer from “specification bias,” since the explanatory variable will have both its own effect, and some of the effect of the unidentified variable, attributed to it. This is why, when doing a regression for the purpose of estimating the effect of some explanatory variable on the dependent variable, we try to work with the most “complete” model possible.

Examination of the (sample) residuals resulting from the regression analysis can indicate failures of assumptions 1, 3, and 4. Such failures are not necessarily a bad thing: They can point the way to a better model. Detailed examination of outlying observations can help detect violations of assumption 5, and therefore again can lead to a better model.