# 8.8 - Piecewise Linear Regression Models

Printer-friendly version

We discuss what are called "piecewise linear regression models" here, because they utilize interaction terms containing dummy variables.

### Example

Let's start with an example that demonstrates the need for using a piecewise approach to our linear regression model. Consider the following plot of the compressive strength (y) of n = 18 batches of concrete against the proportion of water (x) mixed in with the cement:

The estimated regression line —the solid line —appears to fit the data fairly well in some overall sense, but it is clear that we could do better. The residuals versus fits plot:

provides yet more evidence that our model needs work.

We could instead split our original scatter plot into two pieces —where the water-cement ratio is 70% —and fit two separate, but connected lines, one for each piece. As you can see, the estimated two-piece function, connected at 70% —the dashed line —appears to do a much better job of describing the trend in the data.

So, let's formulate a piecewise linear regression model for these data, in which there are two pieces connected at x = 70:

$y_i=\beta_0+\beta_1x_{i1}+\beta_2(x_{i1}-70)x_{i2}+\epsilon_i$

Alternatively, we could write our formulated piecewise model as:

$y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}^{*}+\epsilon_i$

where:

• yi is the comprehensive strength, in pounds per square inch, of concrete batch i
• xi1 is the water-cement ratio, in %, of concrete batch i
• xi2 is a dummy variable (0, if xi1 ≤ 70 and 1, if xi1 > 70) of concrete batch i
• xi2* denotes the (xi1 - 70)xi2 the interaction term

and the independent error terms εi follow a normal distribution with mean 0 and equal variance σ2.

With a little bit of algebra — —we can see how the piecewise regression model as formulated above yields two separate linear functions connected at x = 70. Incidentally, the x-value at which the two pieces of the model connect is called the "knot value." For our example here, the knot value is 70.

Now, estimating our piecewise function in Minitab, we obtain:

With a little bit of algebra we see how the estimated regression equation that Minitab reports:

yields two estimated regression lines, connected at x = 70, that fit the data quite well:

And, the residuals versus fits plot illustrates significant improvement in the fit of the model:

### PRACTICE PROBLEMS: Piecewise linear regression

Shipment data. This exercise is intended to review the concept of piecewise linear regression. The basic idea behind piecewise linear regression is that if the data follow different linear trends over different regions of the data then we should model the regression function in "pieces." The pieces can be connected or not connected. Here, we'll fit a model in which the pieces are connected.

We'll use the shipment.txt data set. An electronics company periodically imports shipments of a certain large part used as a component in several of its products. The size of the shipment varies depending upon production schedules. For handling and distribution to assemby plants, shipments of size 250 thousand parts or less are sent to warehouse A; larger shipments are sent to warehouse B since this warehouse has specialized equipment that provides greater economies of scale for large shipments.

The data set contains information on the cost (y) of handling the shipment in the warehouse (in thousand dollars) and the size (x1) of the shipment (in thousand parts).

1. Create a scatter plot of the data with cost on the y axis and size on the x axis. (See Minitab Help: Creating a basic scatter plot). Based on the plot, does it seem reasonable that there are two different (but connected) regression functions — one when x1 < 250 and one when x1 > 250?
2. Not surprisingly, we'll use a dummy variable and an interaction term to help define the piecewise linear regression model. Specifically, the model we'll fit is:

yi = β0 + β1xi1 + β2 (xi1 - 250) xi2 + εi

where xi1 is the size of the shipment and xi2 = 0 if xi1 < 250 and xi2 = 1 if xi1 > 250. We could also write this model as:

yi = β0 + β1xi1 + β2x*i2 + εi

where x*i2 = (xi1 - 250) xi2. If we assume the data follow this model, what is the mean response function for shipments whose size is smaller than 250? And, what is the mean response function for shipments whose size is greater than 250? Do the two mean response functions have different slopes and connect when xi1 = 250?

3. You first need to set up the data set so that you can easily fit the model. In the data set, y denotes cost and x1 denotes size. This is the easiest way to create the new variable, x*i2 = x1shiftx2, say:
1. Use Minitab's calculator to create a new variable, x1shift, say, which equals x1 - 250.
2. Use Minitab's Manip >> Code command (v16) or Data >> Recode >> To Numeric command (v17) to create x2. To do so, tell Minitab that you want to recode values in column x1 using method "Recode range of values." Indicate that you want values from 0 to 250 to take on value 0 and values from 250 to 500 to take on value 1. Store the recoded column in a specied column of the current worksheet named x2.
3. Use Minitab's calculator to multiply x1shift by x2 to get x1shiftx2. Review the values obtained to convince yourself that they take on the values as defined. Then, fit the linear regression model with y as the response and x1 and x1shiftx2 as the predictors. What is the estimated regression function for shipments whose size < 250? for shipments whose size > 250?
4. Based on your estimated regression function, what is the predicted cost for a shipment with a size of 125? with a size of 250? with a size of 400? Convince yourself that you get the same prediction for size = 250 regardless of which estimated regression function you use.
5. Using your predicted values for size = 125, 250, and 400, create another scatter plot of the data, but this time "annotate" the graph with the two connected lines. (Note that the F3 key should completely erase any previous work, such as annotation lines, in the Graph >> Plot command.) Do you think the piecewise linear regression model reasonably summarizes these data?