Skip to main content icon/video/no-internet

Estimating the linear model Yi = β 0 + β 1X1i + … + β kXki + ei is at the core of many of the statistics conducted today. If you allow the individual X variables to be products of themselves and other variables, the linear model is appropriate for factorial ANOVAs and polynomial regressions, as well as estimating the mean, t tests, and so on. The flexibility of the linear model has led authors of some textbooks and software to call this the general linear model. I try to avoid this phrase because it can be confused with the generalized linear model, or GLM. The GLM is an important extension that allows researchers to analyze efficiently models where the responses are proportions and counts, as well as other situations. More on this later.

The main focus of this entry is extending the linear model into an additive model. In the linear model, each X variable is multiplied by a scalar, the β value. This is what makes it a linear model, but this restricts the relationship between X and Y (conditioned on all the other Xs). With additive models, the β values are replaced by usually fairly simple (in terms of degrees of freedom) functions of the X variables. The model can be rewritten as Yi =α+ f1(X1i) + … + fk(Xki) + ei . The functions are usually assumed to be splines with a small number of knots. More complex functions can be used, but this may cause the model to overfit the observed data and thus not generalize well to new data sets. The typical graphical output shows the functions and the numeric output shows the fit of the linear and nonlinear components. The choice of functions, which often comes down to the type and complexity of the splines, is critical.

To illustrate this procedure, Berndt's 1991 data from 534 respondents on hourly wages and several covariates (experience in years, gender, and education in years) are considered. One outlier with an hourly wage of $44 (z = 6.9) is removed, but the data remain skewed (1.28, se = 0.11). Logging these data removes the skew (0.05, se = 0.11), so a fairly common approach is to use the logged values as the response variable and assume that the residuals are normally distributed. Suppose the researchers' main interests are in the experience variable, and whether income steadily increases with experience or whether it increases rapidly until some point and then increases but less rapidly. For argument's sake, let us assume that the increases are both linear with the logged wages. The researchers accept that wages increase with education and believe that the relationship is nonlinear, and so they allow this relationship to be modeled with a smoothing spline. Because the variable female is binary, only a single parameter is needed to measure the difference in earnings between males and females. Although categorical variables can be included within generalized additive models (GAMs), the purpose of GAMs is to examine the relationships between quantitative variables and the response variable. The first model is

None

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading