Few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post – Ph. D., we read the section of the Stata manual on weighting with some dismay. Weights can be used in a number of ways, and how they are used may well matter for your results. Regrettably, however, the case for or against weighting is often less than clear-cut, as are the specifics of how the weights should be programmed. A detailed discussion of weighting pros and cons is beyond the scope of this book. See Pfefferman (1993) and Deaton (1997) for two perspectives. In this brief subsection, we provide a few guidelines and a rationale for our approach to weighting.
A simple rule of thumb for weighting regression is use weights when they make it more likely that the regression you are estimating is close to the population target you are trying to estimate. If, for example, the target (or estimand) is the population regression function, and the sample to be used for estimation is nonrandom with sampling weights, Wi, equal to the inverse probability of sampling observation i, then it makes sense to use weighted least squares, weighting by Wi (for this you can use Stata pweights or a SAS WEIGHT statement). Weighting by the inverse sampling probability generates estimates that are consistent for the population regression function even if the sample you have to work with is not a simple random sample.
A related weighting scenario is grouped data. Suppose that you would like to regress Yi on Xi in a random sample, presumably because you want to learn about the population regression vector ft = E[XiX’i]~1E[XiYi]. Instead of a random sample, however, you have data grouped at the level of Xi.
That is, you have estimates of E[Yi|Xi = x] for each x, estimated using data from a random sample. Let this average be denoted yx, and suppose you also know nx, where nx =N is the relative frequency of x in the underlying random sample. As we saw in Section 3.1.2, the regression of y/x on x, weighted by nx is the same as the random-sample regression. Therefore, if your goal is to get back to the microdata regression, it makes sense to weight by group size. We note, however, that macroeconomists, accustomed to working with published averages and ignoring the underlying microdata, might disagree, or perhaps take the point in principle but remain disinclined to buck tradition in their discipline, which favors the unweighted analysis of aggregates.
If, on the other hand, the rationale for weighting has something to do with heteroskedasticity, as in many
are as described in Dehejia and Wahba (1999). The samples in the last two columns are limited to observations with a propensity score between.1 and.9. Propensity score estimates use all the covariates listed in the table.
Notes: The table reports regression estimates of training effects using the Dehejia-Wahba (1999) data with alternative sets of controls. The demographic controls are age, years of schooling, and dummies for Black, Hispanic, high school dropout, and married.
Standard Errors are reported in parentheses, Observation counts are reported in brackets [treated/control]
textbook discussions of weighting, we are even less sympathetic to weighting than the macroeconomists. The argument for weighting under heteroskedasticity goes roughly like this: suppose you are interested in a linear CEF, E[Yj|Xj] =Xj/S. The error term, defined as ej =Yj – Xj^, may be heteroskedastic. That is, the conditional variance function, E[e2|Xj] need not be constant. In this case, while the population regression function is still equal to E[XjXj]_1E[XjYj], the sample analog is inefficient. A more precise estimator of the linear CEF is weighted least squares, i. e., minimize the sum of squared errors weighted by an estimate of
As noted in Section 3.1.3, an inherently heteroskedastic scenario is the LPM, where Yj is a dummy variable. Assuming the CEF is in fact linear, as it will be if the model is saturated, then P [Yj = 1 |Xj] =X[f4 and therefore E [e2|Xj] =Xjft (1 — Xj/3), which is obviously a function of Xj. This is an example of model – based heteroskedasticity where in principle, the conditional variance function is easily constructed from estimates of the underlying regression function. The efficient weighted least squares estimator—a special case of generalized least squares (GLS)—is to weight by [Xj/3(1 — Xj/3)] 1. In practice, because the CEF has been assumed to be linear, these weights can be estimated in a first pass by OLS.
There are two reason why we prefer not to weight in this case (though we would use a heteroskedasticity – consistent covariance matrix). First, in practice, the estimate of E[e2|Xj] may not be very good. If the conditional variance model is a poor approximation and/or the estimates of it are very noisy (in the LPM, this might mean the CEF is not really linear), weighted least squares estimates may have worse finite-sample properties than unweighted estimates. The inferences you draw based on asymptotic theory may therefore be misleading, and the hoped for efficiency gain may not materialize26. Second, if the CEF is not linear, the weighted least squares estimator is no more likely to estimate the CEF than is the unweighted estimator. Moreover, the unweighted estimator still estimates something easy to interpret: it estimates the MMSE linear approximation to the population CEF.
Of course, the GLS estimator also provides some sort of approximation, but the nature of this approximation depends on the weights. At a minimum, this makes it harder to compare your results to estimates by other researchers, and opens up additional avenues for specification searches when results depend on weighting. Finally, an old caution comes to mind: “if it ain’t broke, don’t fix it.” The interpretation of the population regression vector is unaffected by heteroskedasticity, so why worry about it? Any efficiency gain from weighting is likely to be modest, and incorrect or poorly estimated weights can do more harm than good.