# Conditional Expectation Functions

Chapter 1 introduces the notion of mathematical expectation, called “expectation” for short. We write E[Yj] for the expectation of a variable, Yt. We’re also concerned with conditional expectations, that is, the expectation of a variable in groups (also called “cells”) defined by a second variable. Sometimes this second variable is a dummy, taking on only two values, but it need not be. Often, as in this chapter, we’re interested in conditional expectations in groups defined by the values of variables that aren’t dummies, for example, the expected earnings for people who have completed 16 years of schooling. This sort of conditional expectation can be written as

ОДІ

and it’s read as “The conditional expectation of Yf given that X{ equals the particular value

x.”

Conditional expectations tell us how the population average of one variable changes as we move the conditioning variable over the values this variable might assume. For every value of the conditioning variable, we might get a different average of the dependent variable, Yt. The collection of all such averages is called the conditional expectation function (CEF for short). E[Ff|Xf] is the CEF of Y{ given Xh without specifying a value for Xh while ElYjlXj = x] is one point in the range of this function.

A favorite CEF of ours appears in Figure 2.1. The dots in this figure show the average log weekly wage for men with different levels of schooling (measured by highest grade completed), with schooling levels arrayed on the X-axis (data here come from the 1980 U. S. Census). Though it bobs up and down, the earnings-schooling CEF is strongly upward-sloping, with an average slope of about.1. In other words, each year of schooling is associated with wages that are about 10% higher on average.

 FIGURE 2.1 The CEF and the regression line Notes: This figure shows the conditional expectation function (CEF) of log weekly wages given years of education, and the line generated by regressing log weekly wages on years of education (plotted as a broken line).

Many of the CEFs we’re interested in involve more than one conditioning variable, each of which takes on two or more values. We write

for a CEF with К conditioning variables. With many conditioning variables, the CEF is harder to plot, but the idea is the same. EfT^X^- = xh …, XKi = xK] gives the population average of Yt with these К other variables held fixed. Instead of looking at average wages conditional only on schooling, for example, we might also condition on cells defined by age, race, and sex.

Regression and the CEF

Table 2.1 illustrates the matchmaking idea by comparing students who attended public and private colleges, after sorting students into cells on the basis of the colleges to which they applied and were admitted. The body of the chapter explains how we see regression as a quick and easy way of automating such matched comparisons. Here, we use the CEF to make this interpretation of regression more rigorous.—

The regression estimates of equation (2.2) reported in Table 2.3 suggest that private school attendance is unrelated to average earnings once individual SAT scores, parental income, and the selectivity of colleges applied and admitted to are held fixed. As a simplification, suppose that the CEF of log wages is a linear function of these conditioning variables. Specifically, assume that

A’fln YiPi, GROUP і, SATh In P/J {1.6)

= а + 0Ъ+^ YjGROUPjj + SSAT( + S2 In Plit

j

where Greek letters, as always, are parameters. When the CEF of In Y{ is a linear function of the conditioning variables as in equation (2.6). the regression of In Y{ on these same

conditioning variables recovers this linear function. (We skip a detailed proof of this fact, though it’s not hard to show.) In particular, given linearity, the coefficient on P; in equation

(2.2) will be equal to the coefficient on P; in equation Г2.6У

With a linear CEF, regression estimates of private school effects based on equation (2.2) are also identical to those we’d get from a strategy that (i) matches students by values of GROUP,, SATh and In Pi); (ii) compares the average earnings of matched students who went to private (P;- = 1) and public (Р,- = 0) schools for each possible combination of the conditioning variables; and (iii) produces a single average by averaging all of these cell – specific contrasts. To see this, it’s enough to use equation Г2.61 to write cell-specific comparisons as

EIn = t, GROUP„ SATf, ln P/,]

– Epn FiJPj = 0, CKOPP,, SAT{, In PI}] = P.

Because our linear model for the CEF assumes that the effect of private school attendance is equal to the constant jв in every cell, any weighted average of cell-specific private – attendance contrasts is also equal to jB.

Linear models help us understand regression, but regression is a wonderfully flexible tool, useful regardless of whether the underlying CEF is linear. Regression inherits this flexibility from the following pair of closely related theoretical properties:

■ If E ; ■ ■» хкі = + £*=i Ькхы for Some constants a and bb…, bK, then the

regression of Yt on Xu, …, XKi has intercept a and slopes bls…, bK. In other words, if the CEF of Yt onXu, …, XKi is linear, then the regression of Yt onXu, …, XKi is it.

■ If E[F;|X1;-, …, XKj] is a nonlinear function of the conditioning variables, then the regression of Yj on Xlh…, XKi gives the best linear approximation to this nonlinear CEF in the sense of minimizing the expected squared deviation between the fitted values from a linear model and the CEF.

To summarize: if the CEF is linear, regression finds it; if not linear, regression finds a good approximation to it. We’ve just used the first theoretical property to interpret regression estimates of private school effects when the CEF is linear. The second property tells us that we can expect regression estimates of a treatment effect to be close to those we’d get by matching on covariates and then averaging within-cell treatment-control differences, even if the CEF isn’t linear.

Figure 2.1 documents the manner in which regression approximates the nonlinear CEF of log wages conditional on schooling. Although the CEF bounces around the regression line, this line captures the strong positive relationship between schooling and wages. Moreover, the regression slope is close to E{E[Ff|Xf]- EfF^X,- – 1]}; that is, the regression

slope also comes close to the expected effect of a one-unit change in Xf on E[Ff|Xf].—

Bivariate Regression and Covariance

Regression is closely related to the statistical concept of covariance. The covariance between two variables, Xf and Yh is defined as

од, F) = £[(*- – ВД])(у, – ОДІ)]- Covariance has three important properties:

(i) The covariance of a variable with itself is its variance; Xi) =

(ii) If the expectation of either Xf or Ff is 0, the covariance between them is the expectation of their product; C(X;-, F,) = E[X, F,].

(iii) The covariance between linear functions of variables X;- and Y{—written Wt = a + bXj and Z{ – c + dYj for constants a, b, c, d—is given by

C(Wl, Zl) = bdC(Xi>Yi).

The intimate connection between regression and covariance can be seen in a bivariate regression model, that is, a regression with one regressor, Xf, plus an intercept.— The bivariate regression slope and intercept are the values of a and b that minimize the associated residual sum of squares, which we write as

RSS(a, b) = EYi – a-bXi?.

The term RSS references a sum of squares because, carrying out this minimization in a particular sample, we replace expectation with a sample average or sum. The solution for the bivariate case is

V(Xi)

a = ot = EYi-^EXil

An implication of equation (2.7) is that when two variables are uncorrelated (have a covariance of 0), the regression of either one on the other generates a slope coefficient of 0. Fikewise, a bivariate regression slope of 0 implies the two variables involved are uncorrelated.

Fits and Residuals

Regression breaks any dependent variable into two pieces. Specifically, for dependent variable Ff, we can write

Уі = У,+^

The first term consists of the fitted values, Yis sometimes said to be the part of Yt that’s “explained” by the model. The second part, the residuals, e{, is what’s left over.

Regression residuals and the regressors included in the model that produced them are uncorrelated. In other words, if e{ is the residual from a regression on Xu, …, XKi, then the regression of e{ on these same variables produces coefficients that are all 0. Because fitted values are a linear combination of regressors, they’re also uncorrelated with residuals. We summarize these important properties here.

properties of residuals Suppose that a and P1} …, jвк are the intercept and slope coefficients from a regression of Yt on Xu, …, XKi. The fitted values from this regression are

к

and the associated regression residuals are

к

Regression residuals

Yn_ e = 0;

(i) have expectation and sample mean 0: E[e;] = (_1 ‘

(ii) are uncorrelated in both population and sample with all regressors that made them and with the corresponding fitted values. That is, for each regressor, Xki,

n n

;=i (=i

You can take these properties on faith, but for those who know a little calculus, they’re easy to establish. Start with the fact that regression parameters and estimates minimize the residual sum of squares. The first-order conditions for this minimization problem amount to statements equivalent to (i) and (ii).