Three Common Modeling Tricks

Interactions

The standard linear regression model does not apply when the effect of one explanatory variable on the dependent variable depends on the value of another explanatory variable. In this case, the coefficient of the first variable, rather than being a constant, is a function of the other variable. This is called an interaction between the explanatory variables.

Examples: As automobiles age, the annual cost-per-mile-driven of keeping them in working order increases, i.e., the effect of mileage on maintenance cost depends on the age of a car. As workers become more experienced, their level of education becomes less of a factor in determining job performance, i.e., the effect of years-of-education on productivity depends on a worker’s experience. Advertising an upcoming sporting event has a more beneficial impact if the visiting team is one of the league leaders, i.e., the effect of advertising on ticket sales depends on the visitor’s league standing. Purchasers of condominiums in a resort high-rise will pay a premium to be on a lower floor if the condominium has a beach view, and will pay extra to be on an upper floor if the unit has an inland view, i.e., the effect of “floor number” on the market value of a condominium in the building depends on the view from the unit.

A simple step towards a better model is to at least permit the coefficient of the first variable to change linearly with the value of the second variable.

Example: Cost = α + (β1 + β2Age) Mileage + β3Age + β4Make + ε . In this model, the coefficient of mileage varies according to the age of the car. It isn’t essential that age also appear in a separate term: If age has a separate effect of its own, it should be in the model, while if the only impact of age on maintenance costs is through its effect on the coefficient of mileage, it should be omitted. Note that an interaction is very different from the dependence of the value of one explanatory variable on the value of another. It could well be that age and annual mileage vary independently in the population of cars being studied, yet the interaction is still present.

“Multiplying out” the resulting model yields a linear model in which one of the explanatory variables is an artificial variable derived from the original data.

Example: Cost = α + β1Mileage + β2(Age × Mileage) + β3Age + β4Make + ε .

By creating a new column of data corresponding to the product of the interacting variables, we can estimate the coefficients of the new model via regression analysis. The regression results can then be interpreted in the context of the before-multiplying-out form of the model. In particular, the significance level of the t-ratio of the newly-created variable helps us to see how strongly the data supports the inclusion of this new interaction variable in the model.

Example: For new cars, the average incremental effect of mileage on cost is just β1. For one-year-old cars, the average effect is β12, and for two-year-old cars, β1+2β2 . [Computational note: KStat provides the option of multiplying the constant coefficient by 0 when making a prediction. Doing so, and, for example, putting in 1 for Mileage, 2 for Mileage*Age, and 0 for any other explanatory variables and then making a prediction, yields a resulting standard error of the estimated mean from which a confidence interval for β1+2β2 – the average incremental cost per thousand miles driven for a two-year-old car – can be determined.]

U-shaped nonlinearities

If it is suspected that there is a “bend” in the way some explanatory variable affects the dependent variable, it can be useful to introduce the square of that explanatory variable as a new variable in the regression model.

Example: Net wealth tends to increase more rapidly as one ages. A linear model which uses age to predict net wealth, i.e., Wealth = α + β Age + ε , fails to capture this phenomenon. A better model might be Wealth = α + β1Age + β2Age2 + ε .

A quadratic function has the form f(x) = a + bx + cx2 , and its graph is a parabola, a particular type of “U”-shaped curve. The curve bends upwards if c > 0 , and downwards (i.e., is an inverted “U”) if c < 0 . The further c is from zero, the sharper the bend of the curve. The bottom of the “U” (or top of the inverted “U”) occurs when x = -b/(2c) , i.e., the values of b and c together determine the horizontal position of the “U”. Finally, varying a varies the vertical position of the “U”. Combining all of these possibilities, the graph of a quadratic function can assume many different “U”-shaped forms. Moreover, if we look at just one section of such a graph, it can take many different curved shapes.

Example: The model Wealth = α + β1Age + β2Age2 + ε would most likely be inappropriate if applied to a population containing children, since we’d expect the relationship to be flat over the range of pre-adult ages, and then to begin to bend upwards. However, if the population – as is likely to be the case in a study of net wealth – included only adults, the right-hand half of an upward-bending “U”-shaped curve might fit the sample data quite well. We’d expect the curve to start bending upwards near the age when young adults begin to accumulate wealth, i.e., we’d expect our estimates of the regression coefficients to indicate that -β1/(2β2) takes a value corresponding to the late teens or early twenties (depending on the typical educational level of members of the population). Of course, we’d be reluctant to use the model to make extrapolative predictions for young children.

Other, similar approaches are available to try to capture nonlinear relationships. Introducing the reciprocal of an existing variable as a new variable can help to capture asymmetric “U”-shaped relationships, and introducing the square root of an existing variable can help to capture “sideways” “U”-shapes. However, in the study of relationships encountered in business settings, some combination of the simple interaction and “U”-shape “tricks” presented here can capture most nonlinearities. Indeed, many commercially-available regression-analysis packages contain specific features to facilitate the construction of new variables (i.e., new columns of data) from products of existing variables, or the square of an existing variable.

Dummy variables

A two-valued qualitative variable can be represented by a single 0-or-1-valued "dummy" variable. If a qualitative variable has three or more possible values (e.g., make-of-car, or marital-status), choose one value as the "base case", and create one 0-or-1-valued "difference" variable for each other value. (The coefficient of each difference variable represents the difference between the associated value, and the base case.)

Example: To include automobile-make (with the values Ford, Honda, BMW, and Sterling) in the maintenance-and-repair-cost model, create three new variables:

DHonda-Ford DBMW-Ford DSterling-Ford
Ford 0 0 0 base case
Honda 1 0 0
BMW 0 1 0
Sterling 0 0 1

The regression model will look like:

Cost = α + β1Mileage + β2Age + β3DH-F + β4DB-F + β5DS-F + ε

which actually consists of four separate models estimated from the same sample:

Ford: Cost = α + β1Mileage + β2Age + ε
Honda: Cost = α + β1Mileage + β2Age + β3 + ε
BMW: Cost = α + β1Mileage + β2Age + β4 + ε
Sterling:       Cost = α + β1Mileage + β2Age + β5 + ε

The one regression statistic that becomes difficult to interpret after this one-into-many process is the t-ratio. If some of the difference variables have large t-ratios, and others small ones, what will you do? You cannot include some of the difference variables and exclude others, since altogether they represent a single real variable. Instead, you want to test the null hypothesis “H0: The qualitative variable does not "belong" in the model.” This is done via analysis of variance (ANOVA).