Regression analysis is a statistical technique for studying linear relationships. [1] It begins by supposing a general form for the relationship, known as the regression model:
Y = α + β1X1 +...+ βkXk + ε .
Example: In the motorpool case, the manager of the motorpool considers the model
Cost = α + β1Mileage + β2Age + β3Make + ε .
Y is the dependent variable, representing a quantity that varies from individual to individual throughout the population, and is the primary focus of interest. X1,..., Xk are the explanatory variables (the so-called “independent variables”), which also vary from one individual to the next, and are thought to be related to Y. Finally, ε is the residual term, which represents the composite effect of all other types of individual differences not explicitly identified in the model. [2]
Beside the model, the other input into a regression analysis is some relevant sample data, consisting of the observed values of the dependent and explanatory variables for a sample of members of the population.
The primary result of a regression analysis is a set of estimates of the regression coefficients α, β1,..., βk. These estimates are made by finding values for the coefficients that make the average residual 0, and the standard deviation of the residual term as small as possible. The result is summarized in the prediction equation:
Ypred = a + b1X1 +...+ bkXk .
Example: Fitting the model above to the motorpool data, we obtain:
Costpred = 107.34 + 29.65 Mileage + 73.96 Age + 47.43 Make .
(Dive down for further discussion of the assumptions underlying regression analysis, or examine a workbook which illustrates some of the underlying computations.)
Typically, a regression analysis is done for one of two purposes: In order to predict the value of the dependent variable for individuals for whom some information concerning the explanatory variables is available, or in order to estimate the effect of some explanatory variable on the dependent variable.
If we know the value of several explanatory variables for an individual, but do not know the value of that individual’s dependent variable, we can use the prediction equation (based on a model using the known variables as its explanatory variables) to estimate the value of the dependent variable for that individual.
In order to see how much our prediction can be trusted, we use the standard error of the prediction [3] to construct confidence intervals for the prediction. (Examine a workbook that provides a detailed discussion of the standard error of the prediction.)
Example: In order to predict the next twelve-month’s maintenance and repair expenses for a specific one-year-old Ford currently in the motorpool, we’d first perform a regression analysis using age and make as the explanatory variables:
Costpred = 705.66 + 8.53 Age - 54.27 Make .
Our prediction will then be $714.19, and the margin of error (at the 95%-confidence level) for the prediction is 2.1788 × 124.0141 = $270.20 .
If our goal is not to make a prediction for an individual, but rather to estimate the mean value of the dependent variable across a large pool of similar individuals, we use the standard error of the estimated mean instead when computing confidence intervals.
Example: Our estimate of the average cost of keeping one-year-old Fords working is $714.19, with a margin of error of 2.1788 × 41.573 = $90.58 .
In order to estimate the “pure” effect of some explanatory variable on the dependent variable, we want to control for as many other effects as possible. That is, we’d like to see how our prediction would change for an individual if this explanatory variable were different, while all others aspects of the individual were kept the same. In order to do this, we should always use the most complete model available, i.e., we should include all other relevant factors as additional explanatory variables. (Dive down for further discussion.)
Our estimate of impact of a unit difference in the targeted explanatory variable is its coefficient in the prediction equation. The extent to which our estimate can be trusted is measured by the standard error of the coefficient.
Example: Using the full regression model, we estimate that the mean marginal maintenance and repair cost associated with driving one of the cars in the motorpool an additional 1000 miles is $29.65, with a margin of error in the estimate of 2.2010 × 3.915 = $8.62 . To better understand why we use the most complete model available, note that any “one of the cars” has a particular age and make, and we want to hold those constant while considering the incremental effect of another 1000 miles of driving.
Given a specific model, one might wonder whether a particular one of the explanatory variables really “belongs” in the model; equivalently, one might ask if this variable has a true regression coefficient different from 0 (and therefore would affect predictions).
We take the standard approach of classical hypothesis testing: In order to see if there is evidence supporting the inclusion of the variable in the model, we start by hypothesizing that it does not belong, i.e., that its true regression coefficient is 0.
Dividing the estimated coefficient by the standard error of the coefficient yields the t-ratio of the variable, which simply shows how many standard-deviations-worth of sampling error would have to have occurred in order to yield an estimated coefficient so different from the hypothesized true value of 0. We then ask how likely it is to have experienced so much sampling error: This yields the significance level of the sample data with respect to the null hypothesis that 0 is the true value of the coefficient. The closer this significance level is to 0%, the stronger is the evidence against the null hypothesis, and therefore the stronger the evidence is that the true coefficient is indeed different from 0, i.e., that the variable does belong in the model.
Example: In the full model, the significance level of the t-ratio of mileage is 0.0011%. We have overwhelmingly strong evidence that mileage has a true non-zero effect in the model. On the other hand, the significance level of the t-ratio of make is only 12.998%. We have here only a little bit of evidence that the true difference between Fords and Hondas is nonzero. (If we really wish to make a case against Hondas, we’ll require that the estimated difference persist as the sample size is increased, i.e., as more evidence is collected.)
Why does the dependent variable take different values for different members of the population? There are two possible answers: “Because the explanatory variables vary.” “Because things still sitting in the residual term vary.” The total variation seen in the dependent variable can be broken down into these two components, and the coefficient of determination [4] is the fraction of the total variation that is explained by the model, i.e., the fraction explained by variation in the explanatory variables. Subtracting the coefficient of determination from 100% indicates the fraction of variation in the dependent variable that the model fails to explain.
Example: Looking at mileage alone, it can explain 56% of the observed car-to-car variation in annual maintenance costs. Looking at age alone, it can’t explain much of anything. But variations in mileage and age together can explain over 78% of the variation in costs. The reason they can explain more together than the sum of what they can explain separately is that mileage masks the effect of age in our data. When both are included in the regression model, the effect of mileage is separated from the effect of age, and the latter effect then can be seen.
A natural follow-up is to ask what the relative importance of variation in the explanatory variables is in explaining observed variation in the dependent variable. The beta-weights [5] of the explanatory variables can be compared to answer this question. ( Dive down for a discussion of the distinction between t-ratios and beta-weights.)
Example: In the full model, the beta-weight of mileage is roughly twice that of age, which in turn is more than twice that of make. If asked, “Why does the annual maintenance cost vary from car to car?” one would answer, “Primarily because the cars vary in how far they’re driven. Of secondary explanatory importance is that they vary in age. Trailing both is the fact that some are Fords and others Hondas, i.e., that make varies across the fleet.”
The six “steps” to interpreting the result of a regression analysis are:
[1] Why is it valuable to be able to unravel linear relationships? Some interesting relationships are linear, essentially all managerial relationships are at least locally linear, and several modeling tricks help to transform the most commonly-encountered nonlinear relationships into linear relationships.
[2] The dependent and explanatory variables, as well as the residual term, can be thought of as random variables resulting from the random selection of a single member of the population, i.e., as quantities that vary from one individual to the next.
[3] The standard error of the prediction takes into account both our exposure to error in using a value of 0 for the individual’s residual when making the prediction (measured by the standard error of the regression), and our exposure to sampling error in estimating the regression coefficients (measured by the standard error of the estimated mean).
[4] The coefficient of determination is sometimes called the “R-square” of the model. Some computer packages will offer two coefficients of determination, one with an adjective – “adjusted”, “corrected”, or “unbiased” – in front. Given the choice, use the one with the adjective. If it is somewhat less than zero, read it as 0%.
[5] The beta-weight of an explanatory variable has the same sign as the estimated coefficient of that variable. It is the magnitude, i.e., absolute value, of the beta-weight that is of relevance.