Regression Modeling

Overview

The first step in conducting a regression-based study is to specify a model. In real applications, this is usually the most challenging step - deciding which variables “belong” in the model and which should be excluded, and deciding on the mathematical structure of the model. Statistical analysis alone cannot guarantee that the best model is being used: Good managerial judgment is an important component of the modeling process.

The following sections present some of the most important issues to keep in mind during the modeling process.

Variable Selection

We begin with the problems of specification bias (including too few variables in the model) and colinearity (including too many).

Specification Bias

Recall the childhood poem:

One day, as I walked up the stair,
  I met a man who wasn't there.
He wasn't there again today.
  I wish that he would go away!

The Zen aspect of the poem is that the absence of something can be just as tangible as the presence of something.

Specification bias arises when a potential independent variable - which is related to both the dependent variable and an included independent variable - is omitted from the model. The result is a biased estimate of the coefficient of the included variable (which is forced to play a double role). As always, no amount of statistical analysis can detect the presence of bias, i.e., in this case, the tangible absence of the excluded variable. Only wisdom (good judgment, based on an understanding of the relationship being investigated) can save you.

Example: The motorpool manager receives a call from the city comptroller, asking for an estimate of the average incremental maintenance expense associated with each mile one of the city's cars is driven. Although only mileage and cost are mentioned during the call, the manager must recognize that, in order to obtain a “pure” estimate of the effect of mileage on cost, he must avoid specification bias. Some quiet thought might well suggest to him that he collect data on age as well, since age has its own effect on cost and varies in a systematic way with mileage, i.e., the newer cars tend to be driven further in the course of a year.

(If he were to simply regress cost onto mileage, he'd end up with an underestimate of the true effect of mileage, since mileage would be blamed not only for its own effect on cost, but also indirectly for some of the effect of age on cost. To see this, imagine predicting a year's-worth of maintenance costs for two cars in the motorpool, one that is expected to be driven 15,000 miles during the year, and the other 16,000 miles. Since the latter car is somewhat likely to be a bit newer, the estimated difference in cost will be the mileage-related difference, scaled down a bit due to the age difference between the cars.)

Searching for New Explanatory Variables

When a regression analysis is carried out, estimates of the residuals are obtained for all of the sample observations. Sorting the observations with respect to the residuals will put those observations for which the current model most underestimates the dependent variable at one end of the list, and those for which it most overestimates at the other end of the list. If some new variable can be found which differs significantly for the observations at the two ends of the sorted list, then inclusion of that variable as a new explanatory variable in the model might well yield a “better” model.

Example: If the motorpool manager doesn't think to collect data on age, and simply regresses mileage onto cost, he'll obtain a table of residuals. Pulling the records for the three cars with the largest positive residuals, and comparing them to the records for the three cars with the greatest negative residuals, he'll notice that the cars in the first group are all two years old, and the cars in the second group are all new. This suggests that adding “age” to the model will yield a new model with greater explanatory power. (In this particular case, if his goal was to estimate the effect of mileage on cost, the addition of the new variable also helps to save him from specification bias.)

Colinearity

If two independent variables are highly correlated (or if three or more are closely linearly related) in the sample, then it can be difficult to estimate their separate effects via regression analysis. If one (or the other) truly belongs in the regression model, then we will find that

  1. when either is included in the analysis, and the other excluded, the t-ratio of the included variable is large, but
  2. when both are included, both t-ratios are small (because there will be substantial uncertainty in the estimates of the two coefficients, resulting in large standard errors of the coefficients).

Difficulties will arise if the prediction equation including both variables is used to make a prediction for an individual with X-values that don't fit the observed relationship between the X's (even when each X-value separately is not very extreme). (You are doing what is sometimes called “hidden extrapolation”.) In such a case, the standard error of the estimated mean can be quite large (and the standard error of the regression much smaller than the standard error of the prediction error).

If judgment suggests that both variables are measuring the same thing, include the one that seems the better measure and exclude the other. If, however, the two variables are truly measuring different things, include them both, but be sure to check the standard error of the prediction when making predictions for individuals.

Stepwise Regression

So, here you are, with a plethora of potential independent variables arrayed before you. There is no substitute for the use of good judgment in choosing which to include in your model. But there is a useful procedure, known as stepwise regression, which can aid you in the development of your model.

In a “forward” stepwise regression analysis, the computer will begin by examining every possible simple linear regression model, and will show you the one with the highest coefficient of determination.

Keeping the independent variable just selected for the first model, the computer will next examine every two-independent-variables model which includes the already-selected variable and one other, and will show you the one with the highest coefficient of determination.

And so on, adding one variable at a time, the computer will eventually hand you a sequence of models it deems worthy of consideration.

“Backwards” stepwise regression begins with the model including all of the potential independent variables, and successively throws out those which cost the least in terms of reduction of the coefficient of determination. The sequence of models so generated may be different from that generated using forward stepwise regression.

“General” stepwise regression goes merrily on its way, including variables which look good, and throwing them away later if their contribution to the explanatory power of some later model is not too great, and eventually yields a single model. (Which single model depends on what criteria you specify for inclusion and exclusion of variables).

These stepwise regression procedures are available in most commercial software packages, and can be quite helpful in the modeling process. Just don't let the computer's suggestions completely override your own instincts.