At times, experience or judgment will suggest that the linkage between some explanatory variable and the dependent variable in a regression model is not linear.
Example:A company has collected data on one of its factories over the past 20 fiscal quarters. For each quarter, they've divided total operating expenses by the number of (standardized) units of output produced, in order to determine their per-unit cost of production. In order to try to understand why this cost has varied, they look at two potential explanatory variables: prod lvl = the level of scheduled output, measured in percentage points of the maximum designed output level of the factory, and rm+lbr = a composite index tracking the market price of raw materials and the hourly cost of direct labor. Their sample data, and the results of an initial regression are:
However, the notion of economies of scale would lead us to expect that the linkage between production level and per-unit manufacturing cost is nonlinear. Combined with a bit of operational insight, we'd expect the per-unit cost to drop, rapidly at first and then more slowly, as the production level increases, and then to rise again as production level is pushed beyond what the factory was designed to handle.
A graphical examination of how the residuals from a regression vary with the magnitude of an explanatory variable can reveal a nonlinear linkage as well. In a truly linear relationship, the residuals take both positive and negative values for every range of values of the explanatory variables. A residual plot which shows the sign of the residuals varying systematically with the values of some explanatory variable indicates the presence of a nonlinear relationship between that explanatory variable and the dependent variable.
Example: Plotting the residuals against the raw-material-and-labor index reveals nothing of interest.
However, a plot of the residuals against production levels reveals a definite pattern:
For production levels below 70 and above 90, the residuals are almost all positive (indicating that the model systematically underpredicts the dependent variable in these cases). In-between, the residuals are just about all negative (indicating that the model overpredicts in those cases). Obviously, we could improve the model by adjusting predictions upwards when production level is high or low, and adjusting them downwards when production level is moderate.
When a residual plot shows a rough "U"-shaped link (either direct or inverted) between the residuals and an explanatory variable, the fit of the model to the data can be improved by introducing the square of that explanatory variable as a new artificial variable in the model. (Here is a workbook which reviews some of the properties of quadratic functions and their graphs (parabolas).
Example: Introducing the square of production level as a new explanatory variable in the model yields the regression results:
(unit cst)pred = 10.5223035 - 0.1744727⋅(prod lvl) + 0.0008948⋅(prod lvl)2 + 0.02016781⋅(rm+lbr)
The significance level of 0.0001% for the squared term indicates strong evidence that it has a true non-zero coefficient, and therefore belongs in the model. The coefficient 0.0008948 is positive, indicating that the nonlinear link between per-unit cost and production level is upward-bending. This upward-bending relationship bottoms out when production level is -b/(2c) = -(-0.1744727)/(2⋅0.0008948) = 97.492 (i.e., near 100, as expected from our initial intuition).
Other, similar approaches are available to try to capture nonlinear relationships. Introducing the reciprocal of an existing variable as a new variable can help to capture asymmetric “U”-shaped relationships, and introducing the square root of an existing variable can help to capture “sideways” “U”-shapes. However, in the study of relationships encountered in business settings, some combination of the simple interaction and “U”-shape “tricks” presented here can capture most nonlinearities. Indeed, many commercially-available regression-analysis packages contain specific features to facilitate the construction of new variables (i.e., new columns of data) from products of existing variables, or the squares of existing variables.