# Linear Regression and the CEF

So what’s the regression you want to run?

In our world, this question or one like it is heard almost every day. Regression estimates provide a valuable baseline for almost all empirical research because regression is tightly linked to the CEF, and the CEF   provides a natural summary of empirical relationships. The link between regression functions – i. e., the best-fitting line generated by minimizing expected squared errors – and the CEF can be explained in at least 3 ways. To lay out these explanations precisely, it helps to be precise about the regression function we have in mind. This chapter is concerned with the vector of population regression coefficients, defined as the solution to a population least squares problem. At this point, we are not worried about causality. Rather, we let the К x 1 regression coefficient vector P be defined by solving

Using the first-order condition,

E [X, (y, – X’b)] =0.

the solution for b can be written P = E [X, X,] 1 E [X. Y,]. Note that by construction, E [X, (y, — X, P)] =

0. In other words, the population residual, which we define as YXjP = e,, is uncorrelated with the regressors, X,. It bears emphasizing that this error term does not have a life of its own. It owes its existence and meaning to P.

In the simple bivariate case where the regression vector includes only the single regressor, X,, and a constant, the slope coefficient is P1 = , and the intercept is a = E [y,] — P1E [X,]. In the multivariate

case, i. e., with more than one non-constant regressor, the slope coefficient for the k-th regressor is given below:  REGRESSION ANATOMY

where Xki is the residual from a regression of Xk, on all the other covariates.

In other words, E [X. Xj] 1 E [X. Y,] is the Kxl vector with k-th element Со))(Х**’~)к*). This important formula is said to describe the “anatomy of a multivariate regression coefficient” because it reveals much more than the matrix formula P = E [X, Xj] 1 E [X. Y,] . It shows us that each coefficient in a multivariate regression is the bivariate slope coefficient for the corresponding regressor, after "partialling out" all the other variables in the model.

To verify the regression-anatomy formula, substitute

Yi = P 0 + P 1X1i + … + Pkxki + … + PKXKi + ei

in the numerator of (3.1.3). Since Xk, is a linear combination of the regressors, it is uncorrelated with e,. Also, since Xk, is a residual from a regression on all the other covariates in the model, it must be uncorrelated these covariates. Finally, for the same reason, the covariance of Xk, with Xk, is just the variance of Xk,. We

therefore have that Cov (yj, Xki) = ffeV (xki) -2

The regression-anatomy formula is probably familiar to you from a regression or statistics course, perhaps with one twist: the regression coefficients defined in this section are not estimators, but rather they are non­stochastic features of the joint distribution of dependent and independent variables. The joint distribution is what you would observe if you had a complete enumeration of the population of interest (or knew the stochastic process generating the data). You probably don’t have such information. Still, it’s kosher—even desirable—to think about what a set of population parameters might mean, without initially worrying about how to estimate them.

Below we discuss three reasons why the vector of population regression coefficients might be of interest. These reasons can be summarized by saying that you are interested in regression parameters if you are interested in the CEF.

Theorem 3.1.4 The Linear CEF Theorem (Regression-justification I)

Suppose the CEF is linear. Then the population regression function is it.

Proof. Suppose E [yj|Xj] =Xj f* for a KX 1 vector of coefficients, f*. Recall that E [Xj (Yj — E [Yj|Xj])] = 0 by the CEF-decomposition property. Substitute using E [Yj|Xj] =Xjf * to find that f * = E [XjXj] 1 E [XjYj] = f. ■

The linear CEF theorem raises the question of under what circumstances a CEF is linear. The classic scenario is joint Normality, i. e., the vector (Y^xj)0 has a multivariate Normal distribution. This is the scenario considered by Galton (1886), father of regression, who was interested in the intergenerational link between Normally distributed traits such as height and intelligence. The Normal case is clearly of limited empirical relevance since regressors and dependent variables are often discrete, while Normal distributions are continuous. Another linearity scenario arises when regression models are saturated. As reviewed in Section 3.1.4, the saturated regression model has a separate parameter for every possible combination of values that the set of regressors can take on. For example a saturated regression model with two dummy covariates includes both covariates (with coefficients known as the main effects) and their product (known as an interaction term). Such models are inherently linear, a point we also discuss in Section 3.1.4.

2The regression-anatomy formula is usually attributed to Frisch and Waugh (1933). You can also do regression anatomy this way:

з __ Cov (Yki;Xki)

3к _ V (хкг) ’

where Yki is the residual from a regression of Yi on every covariate except х^г. This works because the fitted values removed from Yki are uncorrelated with х^г. Often it’s useful to plot Y^i against х^г; the slope of the least-squares fit in this scatterplot is your estimate of the multivariate 3к, even though the plot is two-dimensional. Note, however, that it’s not enough to partial the other covariates out of Yi only. That is,

 Cov (Укг, хкг) ‘Cov (Укг, Хкг)~ V (хкг) V (хкг) . V (хкг) . .V (хкг).
 _ 3k >

unless х^г is uncorrelated with the other covariates.

The following two reasons for focusing on regression are relevant when the linear CEF theorem does not apply.

Theorem 3.1.5 The Best Linear Predictor Theorem (Regression-justification II)

The function Xjfi is the best linear predictor of Yj given Xj in a MMSE sense.

Proof. fi = E[XjXj]_1E[Xjyj] solves the population least squares problem, (3.1.2). ■

In other words, just as the CEF, E [yj|Xj], is the best (i. e., MMSE) predictor of Yj given Xj in the class of all functions of Xj, the population regression function is the best we can do in the class of linear functions.

Theorem 3.1.6 The Regression-CEF Theorem (Regression-justification III)

The function Xjfi provides the MMSE linear approximation to E[Yj|Xj], that is,

fi = argminE{(E[Yj|Xj] – Xjb)2}. (3.1.4)

b

Proof. Write

(Yj – Xjb)2 = {(Yj – E[Yj|Xj]) + (E[Yj|Xj] – Xjb)}2 = (Yj – E[Yj|Xj])2 + (E[Yj|Xj] – Xjb)2 +2(Yj – E[Yi|Xi])(E[Yi|Xi] – Xjb).

The first term doesn’t involve b and the last term has expectation zero by the CEF-decomposition property (ii). The CEF-approximation problem, (3.1.4), therefore has the same solution as the population least squares problem, (3.1.2). ■

These two theorems show us two more ways to view regression. Regression provides the best linear predictor for the dependent variable in the same way that the CEF is the best unrestricted predictor of the dependent variable. On the other hand, if we prefer to think about approximating E[Yj|Xj], as opposed to predicting Yj, the Regression-CEF theorem tells us that even if the CEF is nonlinear, regression provides the best linear approximation to it.

The regression-CEF theorem is our favorite way to motivate regression. The statement that regression approximates the CEF lines up with our view of empirical work as an effort to describe the essential features of statistical relationships, without necessarily trying to pin them down exactly. The linear CEF theorem is for special cases only. The best linear predictor theorem is satisfyingly general, but it encourages an overly clinical view of empirical research. We’re not really interested in predicting individual Yj; it’s the distribution of Yj that we care about.

Figure 3.1.2 illustrates the CEF approximation property for the same schooling CEF plotted in Figure

3.1.1. The regression line fits the somewhat bumpy and nonlinear CEF as if we were estimating a model
for E[Yj|Xj] instead of a model for Yj. In fact, that is exactly what’s going on. An implication of the regression-CEF theorem is that regression coefficients can be obtained by using E[Yj|Xj] as a dependent variable instead of Yj itself. To see this, suppose that Xj is a discrete random variable with probability mass function, gx(u) when Xj = u. Then

E{(E[Yj|Xj] – Xjb)2} = £)(E[yj|Xj = u] – u’b)2gx(u).

u

This means that ft can be constructed from the weighted least squares regression of E[Yj|Xj = u] on u, where u runs over the values taken on by Xj. The weights are given by the distribution of Xj, i. e., gx(u) when Xj = u. Another way to see this is to iterate expectations in the formula for ft:

ft = E[XjXj]-1E[XjYj] = E[XjXj]-1E[XjE(Yj|Xj)]. (3.1.5)

The CEF or grouped-data version of the regression formula is of practical use when working on a project that precludes the analysis of micro data. For example, Angrist (1998), studies the effect of voluntary military service on earnings later in life. One of the estimation strategies used in this project regresses civilian earnings on a dummy for veteran status, along with personal characteristics and the variables used by the military to screen soldiers. The earnings data come from the US Social Security system, but Social Security earnings records cannot be released to the public. Instead of individual earnings, Angrist worked with average earnings conditional on race, sex, test scores, education, and veteran status.   An illustration of the grouped-data approach to regression appears below. We estimated the schooling coefficient in a wage equation using 21 conditional means, the sample CEF of earnings given schooling. As the Stata output reported here shows, a grouped-data regression, weighted by the number of individuals at each schooling level in the sample, produces coefficients identical to what would be obtained using the underlying microdata sample with hundreds of thousands of observations. Note, however, that the standard errors from the grouped regression do not correctly reflect the asymptotic sampling variance of the slope estimate in repeated micro-data samples; for that you need an estimate of the variance of Yj—Xjft. This