# Regression Anatomy and the OVB Formula

The most interesting regressions are multiple; that is, they include a causal variable of interest, plus one or more control variables. Equation (2.2). for example, regresses log earnings on a dummy for private college attendance in a model that controls for ability, family background, and the selectivity of schools that students have applied to and been admitted to. We’ve argued that control for covariates in a regression model is much like matching. That is, the regression coeffiicent on a private school dummy in a model with controls is similar to what we’d get if we divided students into cells based on these controls, compared public school and private school students within these cells, and then took an average of the resulting set of conditional comparisons. Here, we offer a more detailed “regression anatomy” lesson.

Suppose the causal variable of interest is Xn (say, a dummy for private school) and the control variable is X2i (say, SAT scores). With a little work, the coefficient on Xu in a regression controlling for X2i can be written as

ОД, x! d)

where W,; is the residual from a regression of Xn on X2i:

Щ = щ + JTj Kji + Xjj.

As always, residuals are uncorrelated with the regressors that made them, and so it is for the residual W;. It’s not surprising, therefore, that the coefficient on Xu in a multivariate regression that controls for X2i is the bivariate coefficient from a model that includes only the part of Xn that is uncorrelated with X2i. This important regression anatomy formula shapes our understanding of regression coefficients from around the world.

The regression anatomy idea extends to models with more than two regressors. The multivariate coefficient on a given regressor can be written as the coefficient from a

bivariate regression on the residual from regressing this regressor on all others. Here’s the anatomy of the kth coefficient in a model with К regressors:

REGRESSION ANATOMY

C(YifXki)

where Х-кї is the residual from a regression of Xki on the К – 1 other covariates included in the model.

Regression anatomy is especially revealing when the controls consist of dummy variables, as in equation (2.2). For the purposes of this discussion, we simplify the model of interest to have only dummy controls, that is,

uo

1л Yt = a + p Pi + £ yjGROUPji + {1Э)

Regression anatomy tells us that the coefficient on P; controlling for the set of 150 GROUPjj dummies is the bivariate coefficient from a regression on C, where this is the residual from a regression of P; on a constant and the set of 150 GROUP^ dummies.

It’s helpful here to add a second subscript to index groups as well as individuals. In this scheme, In Yy is the log earnings of college graduate z in selectivity group j, while Py is

this graduate’s private school enrollment status. What is the residual, pu, from the auxiliary regression of Py on the set of 150 selectivity-group dummies? Because the

auxiliary regression that generates PU has a parameter for every possible value of the underlying CEF, this regression captures the CEF of Py conditional on selectivity group

perfectly. (Here we’re extending the dummy-variable result described by equation Г2.8І to regression on dummies describing a categorical variable that takes on many values instead of just two.) Consequently, the fitted value from a regression of Py on the full set of selectivity-group dummies is the mean private school attendance rate in each group. For applicant z in group j, the auxiliary regression residual is therefore pij = p*i ~ pj, where pi is shorthand for the mean private school enrollment rate in the selectivity group to which z belongs.

Finally, putting the pieces together, regression anatomy tells us that the multivariate jв in the model described by equation Г2.9І is

C(In у ри) C(In уф PtJ – Pj)

v{?u) Щ-Щ

This expression reveals that, just as if we were to manually sort students into groups and compare public and private students within each group, regression on private school attendance with control for selectivity-group dummies is also a within-group procedure: variation across groups is removed by subtracting pj to construct the residual, pu. Moreover, as for groups C and D in Table 2.1. equation f2.101 implies that applicant groups in which everyone attends either a public or private institution are uninformative

about the effects of private school attendance because pu ~~ pi is 0 for everyone in such groups.

The OVB formula, used at the end of this chapter (in Section 2.31 to interpret estimates from models with different sets of controls, provides another revealing take on regression anatomy. Call the coefficient on Xu in a multivariate regression model controlling for X2j

the long regression coefficient, /31:

Yj = Wl + + уХ2і + ■

Call the coefficient on Xu in a bivariate regression (that is, without X2i) the short regression coefficient, j8s:

¥i =& + Ґхм + */■

The OVB formula describes the relationship between short and long coefficients as follows.

OMITTED VARIABLES BIAS (OVB) FORMULA

f}s = ft1 +П2)У,

where у is the coefficient on X2i in the long regression, and n2i is the coefficient on Xu in a regression of X2i onXn. In words: short equals long plus the effect of omitted times the regression of omitted on included.

This central formula is worth deriving. The slope coefficient in the short model is

= C(YitX0 {2.11)

У(Хц)

Substituting the long model for Yt in equation (2.11) gives

C(g’+ljXu + yjf2f – t-gj. Jfti)

jB{ УЧ хм) + уС(ХМ, Atj) + C{e, xu)

The first equals sign comes from the fact that the covariance of a linear combination of variables is the corresponding linear combination of covariances after distributing terms. Also, the covariance of a constant with anything else is 0, and the covariance of a variable with itself is the variance of that variable. The second equals sign comes from the fact that

хм) – 0^ because residuals are uncorrelated with the regressors that made them (^ is the residual from a regression that includes Xu). The third equals sign defines n2it0 be the

coefficient on Xu in a regression of X2i on Xn.—

Often, as in the discussion of equations (2.2) and Г2.5У we’re interested in short vs. long

comparisons across regression models that include a set of controls common to both models. The OVB formula for this scenario is a straightforward extension of the one above. Call the coefficient on Xn in a multivariate regression controlling for X2j and X3i

the long regression coefficient, /31; call the coefficient on Xn in a multivariate regression controlling only for X3i (that is, without X2i) the short regression coefficient, j8s. The OVB formula in this case can still be written

(2.12)

where у is the coefficient on X2i in the long regression, but that regression now includes X3i as well as X2i, and n21 is the coefficient on Xu in a regression of X2i on both Xn and X3i. Once again, we can say: short equals long plus the effect of omitted times the regression of omitted on included. We leave it to the reader to derive equation (2.12): this derivation tests your understanding (and makes an awesome exam question).

## Leave a reply