Assumptions underlying Regression Analysis
The validity of regression analysis depends on several assumptions concerning the model
Y = α + β1X1 +...+
βkXk + ε .
- The relationship really is linear (or, for practical purposes,
approximately linear over the range of the population being studied).
- E[ε] = 0 . This is purely a cosmetic assumption. The estimate of
α will include any on-average residual effects which are different from zero.
- ε varies normally across the population. While a substantive
assumption, this is typically true, due to the Central Limit Theorem, since the residual term is
the total of a myriad of other, unidentified explanatory variables. If this assumption is not
correct, all statements regarding confidence intervals for individual predictions might be
invalid.
- StdDev[ε] does not vary with the values of the explanatory variables.
(This is called the homoskedasticity assumption.) Again, if this assumption is not correct,
all statements regarding confidence intervals for individual predictions might be invalid.
- ε is uncorrelated with the explanatory variables of the model. The
regression analysis will “attribute” as much of the variation in the dependent variable
as it can to the explanatory variables. If some unidentified factor covaries with one of the
explanatory variables, the estimate of that explanatory variable’s coefficient (i.e., the
estimate of its effect in the relationship) will suffer from “specification bias,”
since the explanatory variable will have both its own effect, and some of the effect of the
unidentified variable, attributed to it. This is why, when doing a regression for the purpose of
estimating the effect of some explanatory variable on the dependent variable, we try to work with
the most “complete” model possible.
Examination of the (sample) residuals resulting from the regression analysis can indicate
failures of assumptions 1, 3, and 4. Such failures are not necessarily a bad thing: They can point
the way to a better model. Detailed examination of outlying observations can help detect violations
of assumption 5, and therefore again can lead to a better model.