We’ve made the point that control for covariates can make the CIA more plausible. But more control is not always better. Some variables are bad controls and should not be included in a regression model even when their inclusion might be expected to change the short regression coefficients. Bad controls are variables that are themselves outcome variables in the notional experiment at hand. That is, bad controls might just as well be dependent variables too. Good controls are variables that we can think of as having been fixed at the time the regressor of interest was determined.

The essence of the bad control problem is a version of selection bias, albeit somewhat more subtle than
the selection bias discussed in Chapter (2) and Section (3.2). To illustrate, suppose we are interested in the effects of a college degree on earnings and that people can work in one of two occupations, white collar and blue collar. A college degree clearly opens the door to higher-paying white collar jobs. Should occupation therefore be seen as an omitted variable in a regression of wages on schooling? After all, occupation is highly correlated with both education and pay. Perhaps it’s best to look at the effect of college on wages for those within an occupation, say white collar only. The problem with this argument is that once we acknowledge the fact that college affects occupation, comparisons of wages by college degree status within an occupation are no longer apples-to-apples, even if college degree completion is randomly assigned.

Here is a formal illustration of the bad control problem in the college/occupation example. Let w, be a dummy variable that denotes white collar workers and let У, denote earnings. The realization of these variables is determined by college graduation status and potential outcomes that are indexed against c,. We have

yi — ciy ii + (1 ci)y0i

wi — CiWii + (1 – c,)woi

where c, — 1 for college graduates and is zero otherwise, {у1,,Уо,} denotes potential earnings, and {w1i, Woi} denotes potential white-collar status. We assume that c, is randomly assigned, so it is independent of all potential outcomes. We have no trouble estimating the causal effect of ci on either yi or wi since independence gives us

E [у, Ic, — 1] – E [y, Ic, — 0] — E [yi, – Уоі] ,

E [w, |c, — 1] – E [w, |c, — 0] — E [wi, – wo,] •

In practice, we might estimate these average treatment effects by regressing y, and w, and on c,.

Bad control means that a comparison of earnings conditional on w, does not have a causal interpretation. Consider the difference in mean earnings between college graduates and others conditional on working at a white collar job. We can compute this in a regression model that includes w, or by regressing y, on c, in

the sample where w, — 1. The estimand in the latter case is the difference in means with c, switched off and    on, conditional on w, — 1:

By the joint independence of {Y^W^YoyWoi} and ci, we have

E [Yii|Wii = 1, Ci = 1] – E [y0iIWoi = 1, Ci = 0] = E [Yii|Wii = 1] – E [Yoi|Woi = 1].

This expression illustrates the apples-to-oranges nature of the bad-control problem:

E [yii |wii = 1] – E [yoi |Woi = 1]

= E [yii – Yoi |Wii = 1] +{E [Yoi|Wii = 1] – E [yoi |Woi = 1]} .

1 {Z } 1 {Z }

causal effect on college grads selection bias

In other words, the difference in wages between those with and without a college degree conditional on working in a white collar job equals the causal effect of college on those with Wii = 1 (people who work at a white collar job when they have a college degree) and a selection-bias term which reflects the fact that college changes the composition of the pool of white collar workers.

The selection-bias in this context can be positive or negative, depending on the relation between occupa­tional choice, college attendance, and potential earnings. The main point is that even if Yii =Yoi, so that there is no causal effect of college on wages, the conditional comparison in (3.2.12) will not tell us this (the regression of Yi on Wi and ci has exactly the same problem). It is also incorrect to say that the conditional comparison captures the part of the effect of college that is "not explained by occupation." In fact, the conditional comparison does not tell us much that is useful without a more elaborate model of the links between college, occupation, and earnings.

As an empirical illustration, we see that the addition of two-digit occupation dummies indeed reduces the schooling coefficient in the NLSY models reported in Table 3.2.1, in this case from.087 to.066. However, it’s hard to say what we should make of this decline. The change in schooling coefficients when we add occupation dummies may simply be an artifact of selection bias. So we would do better to control only for variables that are not themselves caused by education.

A second version of the bad control scenario involves proxy control, that is, the inclusion of variables that might partially control for omitted factors, but are themselves affected by the variable of interest. A simple version of the proxy-control scenario goes like this: Suppose you are interested in a long regression, similar to equation (3.2.10),

Yi = a + Psi + 7 ai + "i; (3.2.13)

where for the purposes of this discussion we’ve replaced the vector of controls Ai, with a scalar ability measure ai. Think of this as an IQ score that measures innate ability in eighth grade, before any relevant

schooling choices are made (assuming everyone completes eighth grade). The error term in this equation satisfies E[Sj£j] = Е[а;Є;] =0 by definition. Since а; is measured before S; is determined, it is a good control.

Equation (3.2.13) is the regression of interest, but unfortunately, data on а; are unavailable. However, you have a second ability measure collected later, after schooling is completed (say, the score on a test used to screen job applicants). Call this variable "late ability," ац. In general, schooling increases late ability relative to innate ability. To be specific, suppose

ац = + ^iS; + ■K2aj. (3.2.14)

By this, we mean to say that both schooling and innate ability increase late or measured ability. There is almost certainly some randomness in measured ability as well, but we can make our point more simply via the deterministic link, (3.2.14).

You’re worried about OVB in the regression of Y; on Sj alone, so you propose to regress Y; on Sj and late ability, ац since the desired control, a;, is unavailable. Using (3.2.14) to substitute for a; in (3.2.13), the regression on S; and ац is

Yj = (a – 7—) + (p – 7—)S; + —aH + Є;. (3.2.15)

W2 K2 K2

In this scenario, 7, гк1, and ^2 are all positive, so p — 7^ is too small unless гк1 turns out to be zero. In other words, use of a proxy control that is increased by the variable of interest generates a coefficient below the desired effect. Importantly, гк1 can be investigated to some extent: if the regression of ац on S; is zero, you might feel better about assuming that гк1 is zero in (3.2.14).

There is an interesting ambiguity in the proxy-control story that is not present in the first bad-control story. Control for outcome variables is simply misguided; you do not want to control for occupation in a schooling regression if the regression is to have a causal interpretation. In the proxy-control scenario, however, your intentions are good. And while proxy control does not generate the regression coefficient of interest, it may be an improvement on no control at all. Recall that the motivation for proxy control is equation (3.2.13). In terms of the parameters in this model, the OVB formula tells us that a regression on S; with no controls generates a coefficient of p + 7Sos, where Sas is slope coefficient from a regression of а; on S;. The schooling coefficient in (3.2.15) might be closer to p than the coefficient you estimate with no control at all. Moreover, assuming Sas is positive, you can safely say that the causal effect of interest lies between these two.

One moral of both the bad-control and the proxy-control stories is that when thinking about controls, timing matters. Variables measured before the variable of interest was determined are generally good controls. In particular, because these variables were determined before the variable of interest, they cannot themselves be outcomes in the causal nexus. In many cases, however, the timing is uncertain or unknown. In such cases, clear reasoning about causal channels requires explicit assumptions about what happened first, or the assertion that none of the control variables are themselves caused by the regressor of interest.