# The Bias of Robust Standard Errors*  і

^XlYl = (X ‘X )-1X ‘y,

i where X is the NXK matrix with rows Xi and y is the N x 1 vector of Yi’s. We saw in Section 3.1.3 that /3 has an asymptotically Normal distribution. We can write:

VN(3 – /) – n(0, n)

where П is the asymptotic covariance matrix. Repeating (3.1.7), the formula for П in this case is

Пг = E [XiXi]-1^ [XiXief] E [XiXi]-1, (8.1.1)

where ei = Yi—Xi/. When residuals are homoskedastic, П simplifies to fic = a2E[XiXi]~1 where a2 = E[e2].

We are concerned here with the bias of robust standard errors in independent samples (i. e., no clustering or serial correlation). To simplify the derivation of bias, we assume that the regressor vector can be treated as fixed in repeated samples, as it would be if we sampled stratifying on Xi. Non-stochastic-regressors gives a benchmark sampling model that is often used to look at finite-sample distributions. It turns out that we miss little by making this assumption, while simplifying the derivations considerably.   With fixed regressors, we have

where

Ф = E[ee’] = diag(ip i)    is the variance matrix of residuals. Under homoskedasticity, ф,

Asymptotic standard errors are given by the square root of the diagonal elements of fir and fic, after removing the asymptotic normalization by dividing by N.

In practice, the asymptotic covariance matrix must be estimated. The old-fashioned or conventional variance matrix estimator is

fic = (x’x)-1a2 = (x’X)-^X f) ;

where p =Yi—X’i/3 is the estimated regression residual, and a3

estimates the residual variance. The corresponding robust variance matrix estimator is fir = (X’xГ1 (£ (x’x)-1.

We can think of the middle term as an estimator of the form ^ , where фi = e2 estimates ф,.

By the law of large numbers and Slutsky theorems, Nfic converges in probability to fic while Nfir converges to fir. But in finite samples, both variance estimators are biased. The bias in fic is well-known from classical least-squares theory and easy to correct. Less appreciated is the fact that if the residuals are homoskedastic, the robust estimator is more biased than the conventional, perhaps a lot more. From this we conclude that robust standard errors can be more misleading than conventional standard errors in situations where heteroskedasticity is modest. We also propose a rule-of-thumb that uses the maximum of old-fashioned and robust standard errors to avoid gross misjudgments of precision.

With non-stochastic regressors, we have

E[fi c] = (X ‘X )-V = (X ‘x )-1(X.

To analyze E , start by expanding e = y — XP:

e = y – X(X’X)-1X’y = [I – X(X’X)~1X’] (XP + e) = Me
where e is the vector of population residuals, M = In — X(X’X)-1X’ is a non-stochastic residual-maker  matrix with ith row mi, and In is the N x N identity matrix. Then ei = mie, and

E (ef) = E (m’iee’mi)

= m’i^mi

To simplify further, write mi = ‘ — hi where ‘ is the ith column of In and hi = X(X’X)-1Xi, the ith

column of the projection matrix H = X(X’X)- 1X’. Then

E (ef) = (‘i – hi)’ Ф (‘ – hi)  = ipi – 2pihu + hiФhi

where hu, the ith diagonal element of H, satisfies

 hiy hiiYi + ^ ‘ hi j=  Parenthetically, hu is called the leverage of the ith observation. Leverage tells us how much pull a particular value of Xi exerts on the regression line. Note that the ith fitted value (ith element of Hy) is  A large hii means that the ith observation has a large impact on the ith predicted value. In a bivariate regression with a single regressor, Xi,

This shows that leverage increases when Xi is far the mean. In addition to (8.1.6), we know that hu is a

N

number that lies in the interval [0,1] and that hj =K, the number of regressors (see, e. g., Hoaglin and

j=i

Welch, 1978).1

Suppose residuals are homoskedastic, so that pi = a. Then (8.1.4) simplifies to

E (ef) = af [1 – 2hii + hihi] = af (1 – hu) < af.  Using the properties of hu, we can go one step further:

Thus, the bias in Qc can be fixed by a simple degrees-of-freedom correction: divide by N—к instead of N in the formula for a2, the default in most empirical variance computations.  We now want to show that under homoskedasticity the bias in Qr is likely to be worse than the bias in Qc. The bias in the robust covariance matrix estimator is

where E (e2) is given by (8.1.4). Under homoskedasticity, г/т = a2 and we have E (e2) = a2 (1 — hn) as

in Qc. It’s clear, therefore, that the bias in e2 tends to pull robust standard errors down. The general expression, (8.1.8), is hard to evaluate, however. Chesher and Jewitt (1987) show that as long as there is not "too much" heteroskedasticity, robust standard errors based on Qr are indeed biased downwards.2    How do we know that Qr is likely to be more biased than Qc? Partly this comes from Monte Carlo evidence (e. g., MacKinnon and White, 1985, and our own small study, discussed below). We also prove this for a bivariate example, where the single regressor, xi, is assumed to be in deviations-from-means form, so there is a single coefficient. In this case, the estimator of interest is /31 = * and the leverage

conventional covariance estimator, we have  E [Q c] =

 a2 (1 ~ hu) Ґxf_ Nsf ^ N Vsf  so the bias here is small. A simple calculation using (8.1.8) shows that under heteroskedasticity, the robust estimator has expectation:

The bias of Qr is therefore worse than the bias of Qc if E h2i > N, as it is by Jensen’s inequality unless the regressor has constant leverage, in which case ha=N for all i.   We can reduce the bias in Qr by trying to get a better estimator of ipi, say Єi. The estimator Qr sets Єi = e2, the estimator proposed by White (1980a) and our starting point in this section. Here is a summary

 N-к  of the proposals explored in MacKinnon and White (1985):

HCі is a simple degrees of freedom correction as is used for fc. HCi uses the leverage to give an unbiased estimate of the variance estimate of the ith residual when the residuals are homoskedastic, while HC3 approximates a jackknife estimator. In the applications we’ve seen, the estimated standard errors tend to get larger as we go down the list, but this is not a theorem.

Time out for the Bootstrap

Bootstrapping is a resampling scheme that offers an alternative to inference based on asymptotic formulas. A bootstrap sample is a sample drawn from our own data. In other words, if we have a sample of size N, we treat this sample as if it were the population and draw repeatedly from it (with replacement). The bootstrap standard error is the standard deviation of an estimator across many draws of this sort. Intuitively, we expect the sampling distribution constructed by sampling from our own data to provide a good approximation to the sampling distribution we are after.

There are many ways to bootstrap regression estimates. The simplest is to draw pairs of {Yi, Xi}-values, sometimes called the "pairs bootstrap" or a "nonparametric bootstrap". Alternatively, we can keep the Xi-values fixed, draw from the distribution of residuals (Фі ), and create a new estimate of the dependent variable based on the predicted value and the residual draw for the particular observation. This procedure, which is a type of "parametric bootstrap", mimics a sample drawn with non-stochastic regressors and ensures that Xi and the regression residuals are independent. On the other hand, we don’t want independence if we’re interested in standard errors under heteroskedasticity. An alternative residual bootstrap, called the "wild bootstrap", draws Xi/3 + Фі (which, of course, is just the original Yi) with probability 0.5, and X[(3 — Фі otherwise (see, e. g., Mammen 1993 and Horowitz, 1997). This preserves the relationship between residual variances and Xi observed in the original sample.

Bootstrapping is useful for two reasons. First, in some cases the asymptotic distribution of an estimator can be hard to compute (e. g., the asymptotic distributions of quantile regression estimates involve unknown densities). Bootstrapping provides a computer-intensive but otherwise straightforward computational strat­egy. Not all asymptotic distributions are approximated by the bootstrap, but it seems to work well for the simple estimators we care about. Second, under some circumstances, the sampling distribution obtained via bootstrap may be closer to the finite-sample distribution of interest than the asymptotic approximation – statisticians call this property asymptotic refinement.

Here, we are mostly interested in the bootstrap because of asymptotic refinement. The asymptotic distribution of regression estimates is easy enough to compute, but we worry that the estimators HCo – HC3 are biased. As a rule, bootstrapping provides an asymptotic refinement when applied to test statistics that have asymptotic distributions which do not depend on any unknown parameters (see, e. g., Horowitz, 2001). Such test statistics are said to be asymptotically pivotal. An example is a t-statistic: this is asymptotically standard normal. Regression coefficients are not asymptotically pivotal; they have an asymptotic distribution which depends on the unknown residual variance.

The upshot is that if you want better finite-sample inference for regression coeff cients, you should boot­strap t-statistics. That is, you calculate the t-statistic in each bootstrap sample and compare the analogous t-statistic from your original sample to this bootstrap “t”-distribution. A hypothesis is rejected if the

absolute value of the original t-statistic is above, say, the 95th percentile of the absolute values from the bootstrap distribution.

Theoretical appeal notwithstanding, as applied researchers, we don’t like the idea of bootstrapping pivotal statics very much. This is partly because we’re not only (or even primarily) interested in formal hypothesis testing: we like to see the standard errors in parentheses under our regression coeff cients. These provide a summary measure of precision that can be used to construct confidence intervals, compare estimators, and test any hypothesis that strikes us, now or later. We can certainly calculate standard errors from bootstrap samples but this promises no asymptotic refinement. In our view, therefore, practitioners worried about the finite-sample behavior of robust standard errors should focus on bias corrections like HC1-HC3. We especially like the idea of taking the larger of the conventional standard error (with degrees of freedom correction) and one of these three.

An Example

For further insight into the differences between robust covariance estimators, we analyze a simple but im­portant example that has featured in the earlier chapters in this book. Suppose you are interested in an estimate of in the model  i;

where Dj is a dummy variable. The OLS estimate of fi1 is the difference in the means between those with Dj switched on and off. Denoting these subsamples by the subscripts 1 and 0, we have

bi = Y1 – Yq.

For the purposes of this derivation we think of Dj as non-random, so that ^ Dj = N1 an^](1—Dj) = Nq are fixed. Let r = N1/N.

We know something about the finite-sample behavior of b1from statistical theory. If Yj is Normal with equal but unknown variance in both the Dj = 1 and Dj =0 populations, then the conventional t-statistic for b1 has a t-distribution. This is the classic two sample t-test. Heteroskedasticity in this context means that the variances in the Dj = 1 and Dj =0 populations are different. In this case, the testing problem in small samples becomes surprisingly intractable: the exact small sample distribution for even this simple problem is unknown.5 The robust covariance estimators HCq – HC3 give asymptotic approximations to the unknown finite-sample distribution for the case of unequal variances.

The differences between HCq – HC3 are differences in how the sample variances in the two groups defined by Dj are processed. Define Sj = ^D • (Yj — Yj)2 for j = 0,1. The leverage in this example is

1/Nq if Dj =0

hjj — .

1/N1 if Dj = 1

 N fSj + Sj 1 NqN^ N – 2 ) = Nr(1 – r) 22 So і _Sl Nq2 + N1 N N Sj + Sj N – 2 Nj + Nj J 22 S0 і____________ 1 No (Nq – 1) N1 (N1 – 1) 22 So + S1 (Nq – 1)2 (N1 – 1)2.  Using this, it’s straightforward to show that the five variance estimators we’ve been discussing are

The conventional estimator pools subsamples: this is efficient when the two variances are the same. The White (1980a) estimator, HCq, adds separate estimates of the sampling variances of the means, using the

S2

consistent (but biased) variance estimators, . The HC2 estimator uses unbiased estimators of the sample sample variance for each group, since it makes the correct degrees of freedom correction. HC1 makes a degrees of freedom correction outside the sum, which will help but is generally not quite correct. Since we know HCj to be the unbiased estimate of the sampling variance under homoskedasticity, HC3 must be too
big. Note that with r = 0.5, a case where the regression design is said to be balanced, the conventional estimator equals HC and all five estimators differ little.

A small Monte Carlo study based on (8.1.9) illustrates the pluses and minuses of the estimators and the extent to which a simple rule of thumb goes a long way towards ameliorating the bias of the HC class. We choose N = 30 to highlight small sample issues, and r = 0.9, which implies hu = 10/N = 1/3 if Di = 1. This is a highly unbalanced design. We draw

N(0,a2) if Di =0 N(0,1) if Di = 1

and report results for three cases. The first has lots of heteroskedasticity with a = 0.5, while the second has relatively little heteroskedasticity, with a = 0.85. No heteroskedasticity is the benchmark case.

Table 8.1.1 displays the results. Columns (1) and (2) report means and standard deviations of the various standard error estimators across 25,000 replications of the sampling experiment. The standard deviation of Pг is the sampling variance we are trying to measure. With lots of heteroskedasticity, as in the upper panel of the table, Conventional standard errors are badly biased and, on average, only about half the size of the Monte Carlo sampling variance that constitutes our target. On the other hand, while the robust standard errors perform better, except for HC3, they are still too small.

The standard errors are themselves estimates and have considerable sampling variability. Especially noteworthy is the fact that the robust standard errors have much higher sampling variability than the OLS standard errors, as can be seen in column 2. The sampling variability further increases when we attempt to reduce bias by dividing the residuals by 1 — ha or (1 — ho)2. The worst case is HC3, with a standard deviation about 50% above that of the White (1980a) standard error, HCq.

The last two columns in the table show empirical rejection rates in a nominal 5% test for the hypothesis bi = P*, where P* is the population parameter (equal to zero, in this case). The test statistics are compared with a Normal distribution and to a t-distribution with N — 2 degrees of freedom. Rejection rates are far too high for all tests, even HC3. Using a t-distribution rather than a Normal distribution helps only marginally.

The results with little heteroskedasticity, reported in the second panel, show that conventional standard errors are still too low; this bias is now in the order of 15%. HCo and HC are also too small, about

like before in absolute terms, though they now look worse relative to the conventional standard errors. The HC2 and HC3 standard errors are still larger than the conventional standard errors, on average, but

empirical rejection rates are higher for these two than for conventional standard errors. This means the robust standard errors are sometimes too small “by accident," an event that happens often enough to inflate rejection rates so that they exceed the conventional rejection rates.

The lesson we can take a away from this is that robust standard errors are no panacea. They can be smaller than conventional standard errors for two reasons: the small sample bias we have discussed and the higher sampling variance of these standard errors. We therefore take empirical results where the robust standard errors fall below the conventional standard errors as a red flag. This is very likely due to bias or a chance occurrence that is better discounted. In this spirit, we like the idea of taking the maximum of the conventional standard error and a robust standard error as your best measure of precision. This rule of thumb helps on two counts: it truncates low values of the robust estimators, reducing bias, and it reduces variability. Table 8.1.1 shows the empirical rejection rates obtained using Max(HCj, Conventional). The empirical rejection rates using this rule of thumb look pretty good in the first two panels and greatly improve on the robust estimators alone.

Since there is no gain without pain, there must be some cost to using Max(HCj, Conventional). The cost is that the best standard error when there is no heteroskedasticity is the conventional OLS estimate. This is documented in the bottom panel of the table. Using the maximum inflates standard errors unnecessarily under homoskedasticity, depressing rejection rates. Nevertheless, the table shows that even in this case rejection rates don’t go down all that much. We also view an underestimate of precision as being less costly than an over-estimate. Underestimating precision, we come away thinking the data are not very informative and that we should try to collect more data, while in the latter case, we may mistakenly draw important substantive conclusions.

A final comment on this Monte Carlo investigation concerns the sample size. Labor economists like us are used to working with tens of thousands of observations or more. But sometimes we don’t. In a study of the effects of busing on public school students, Angrist and Lang (2004) work with samples of about 3000 students grouped in 56 schools. The regressor of interest in this study varies within grade only at the school level, so some of the analysis in this paper uses 56 school means. Not surprisingly, therefore, Angrist and Lang (2004) obtained HC standard errors below conventional OLS standard errors when working with school-level data. As a rule, even if you start with the micro data on individuals, when the regressor of interest varies at a higher level of aggregation – a school, state, or some other group or cluster – effective sample sizes are much closer to the number of clusters than to the number of individuals. Inference procedures for clustered data are discussed in detail in the next section.