# Optimal Moments and Nearly Uninformative Moments

Throughout the analysis of the GMM estimator, we have taken the population moment condition as given. However, in practice, a researcher is typically faced with a large set of alternatives from which q elements are chosen to make up the population moment. In this section we consider the two extreme scenarios in which the "best" choice is made and the "worst" choice is made. To understand what best and worst mean in this context, it is useful to consider two ways in which the choice of population moment condition impacts on the asymptotic analysis. Theorem 1 establishes that the consistency of GMM depends crucially on the identification condition in Assumption 4. Theorem 2 reveals that the asymptotic variance of the estimator depends directly on the choice of moment condition via both S and G0. Therefore, the best choice is the population moment condition which leads to the estimator with the smallest asymptotic variance. Section 8.1 summarizes the main results in the literature on the best or optimal choice of population moment condition. The worst case scenario is when the population moment condition does not or nearly does not provide enough information to identify 0O. Section 8.2 describes both the consequences of (nearly) uninformative population moment conditions for the inference techniques discussed above and also how the problems can be circumvented.

In its most general form, we have already answered this question in Section 7. It is shown there that MLE can be interpreted as a GMM estimator based on (11.35). Since the MLE is known to be asymptotically efficient in the class of consistent uniformly asymptotically normal estimators, the optimal choice of population moment condition is just the score function associated with the true probability

distribution of the data. Unfortunately, in many cases of interest in economics, the true probability distribution is unknown. One solution is to choose a distribution arbitrarily but this strategy can have undesirable consequences if the wrong choice is made. In this case, the estimator is no longer asymptotically efficient and may also be inconsistent in nonlinear models.24 In many cases where MLE is infeasible, GMM is applied using a population moment condition which takes the form

E[zt ® ut(00)] = 0, (11.36)

where zt is a vector of observable instruments and u t(0o) is a vector of functions which depend on both the data and the unknown parameter vector. Hansen and Singleton (1982) refer to GMM estimation based on (11.36) as generalized instrumental variables. Notice that both our examples from the introduction fit into this class. Within this framework ut(0o) is usually determined by the model, and so the only difference between choices of moment condition arises from the choice of instrument vector. Therefore, the optimal moment condition is characterized by finding the optimal choice of instrument vector.

In the literature on optimal instruments, it is customary to work with a slightly modified version of the population moment condition.25 Instead of (11.36), the population moment condition takes the form

E[f(v„ 0o)] = E[Z(v2t)Ut(0o)] = 0, (11.37)

where ut(0o) is a (s x 1) vector of functions which satisfies E[ut(0o)| Qt] = 0, Qt represents the information set at time t, Z (v2t) is a (q x s) matrix whose elements are functions of v2t G Qt, and we have partitioned vt = (v1t, v2t)’. The problem is then to find the optimal choice of Z (v2t). This question is typically broken down into two parts: what is the optimal choice of Z() for a given choice of v2t? and then what is the optimal choice of v2t? The answer to the second question is going to depend on the model in question and so we do not address that here. Instead we focus entirely on the first question.

It turns out that the optimal instrument is relatively easy to characterize in static models, but is much more difficult in time series models. We therefore introduce the following restriction.

Assumption 12. Independence. {vt; t = 1, 2… T} forms an independent

sequence.

Notice that Assumptions 1 and 12 imply vt forms an iid process.

If GMM estimation is based on the population moment condition in (11.37) with the optimal choice of weighting matrix, then from Theorems 2 and 4 the asymptotic covariance matrix of the PT is

where for simplicity we have set Zt = Z(v2f) and SZ = E[Ztut(Q0)ut(Q0)’Z’t]. The optimal choice of Z( ) given v2t is the function which minimizes V(Z) in a matrix sense, and this is given by the next theorem.26

Theorem 7. The optimal choice of instrument in static models. If

(i) Assumptions 1-10, 12 and certain other regularity conditions hold; (ii) the population moment condition is given by (11.37); then the optimal choice of Z() given v2t is

Z°(v2f) = E[duf(90)/d9 | v2f] Xu1|v2,

where Zu|v2 = E[uf(90)uf(90)’ | v2t], and this choice leads to a GMM estimator with asymptotic covariance matrix

V(Z0) = {E[Z>2t)]IuV2E[Z°K)r }-1.

An intuition for this result can be derived by relating the optimal estimator to the familiar case of two-stage least squares (2SLS) estimation in the linear model. For expositional simplicity, we consider the case in which s = 1, and so let Xu|v2 = o2. To set up the analogy to 2SLS, it is necessary to return to the asymptotic behavior of T1/2(0T – 90). Our previous analysis indicates that T1/2 (PT – 90) is asymptotically equivalent to the function of the data in (11.26). Using (11.37), (11.26) becomes

T1/2(0t – 90) = -{[T ~1Dt 9) Zt]Wt[T -1Z’tDt (90)]}-1

x [T -1Dt (90)’Zt]WtT-1/2Z’tUt(90) + op(1), (11.39)

where Dt(90) is the T x p matrix with fth row 3uf(90)/39′, Zt is the T x q matrix with fth row Zf, and UT(90) is the T x 1 vector with fth element uf(90). Assumption 12 implies that the optimal choice of weighting matrix is S-1 = o-2{E[ZfZj ]}-1. Since the scaling factor o-2 cancels out in the formula for the estimator, the two-step GMM estimator can be obtained by setting WT = (T-1Z’t Zt)-1. Making this substitution in (11.39), we obtain

T1/2(0t – 90) = -{[T-Dt (9 0),Zt][T-1Z’t Zt]-1[T-1ZTDt(90)]}-1

x [T-2Dt(9 0),Zt][T-1Z’t Zt]-1T-1/2ZTUt (90) + 0p(1). (11.40)

In the linear regression model, uf(90) = yf – xf’90 and so Dt(90) = – X, the matrix of observations on xf. In this case, (11.40) reduces to the formula for the linear IV estimator and the optimal instrument is E[xf|Zf]. If xf is assumed to be a linear function of Zf then the feasible optimal IV estimator is just the two-stage least squares estimator.27

Now let us return to the original nonlinear setting. By analogy to the linear model, (11.40) implies that T 1/2(0t – 90) behaves asymptotically like an IV estimator in a linear model with regressor vector, xf = -3uf(90)/39′ and error, uf (90). Now we have just argued that the optimal instrument in a linear model is given by the conditional expectation of the regressor. Therefore, applying that logic here, the optimal instrument is given by – E[3ut(00)/30’|Zt], which is identical to the result in Theorem 7 except for the presence of the scaling factor, – a2. This difference is inconsequential because, as remarked above, the scaling factor cancels out and so does not effect the estimator. To construct a feasible optimal instrument, it is possible to follow a similar strategy to 2SLS and assume a model for 3ut(00)/30′. However, this is likely to require an assumption about the distribution of v2t in order to evaluate the expectation. This is undesirable here because it is the absence of this information which led us to generalized IV estimation in the first place. An alternative solution is to estimate Z0(v2t) nonparametrically, and Newey (1993) provides a survey of various methods which have been proposed in this context.

The above discussion gives an intuition for the part of Z0(v2t) involving the partial derivative, but does not explain the presence of 1—v in the formula because the a2 factor canceled out. However, if s > 1 and Xu|v Ф a2Is then it is necessary to employ a correction in the construction of the optimal instrument for either the unequal variances or any contemporaneous correlation (or both) of the elements of ut(00). It is for this reason that Z0(v2t) is transformed by Z-jV2.28

The matrix V(Z0) can be interpreted as a lower bound on the asymptotic covariance matrix for this class of estimators. It should be remembered that the optimal IV estimator is likely to be less efficient than maximum likelihood because (11.37) does not typically contain all the information in the true score function of the data. However, there is a sense in which V(Z0) is the best we can do given the information available. Chamberlain (1987) shows that V(Z0) is also the lower bound on the asymptotic covariance matrix of any consistent and asymptotically normal estimator of 00 in which the only substantive information used in estimation is the population moment condition in (11.39).29

It would be desirable to extend this theorem to time series, but so far there has only been limited success in this direction. Hansen and Singleton (1982), Hayashi and Sims (1983), Hansen (1985) and Hansen, Heaton, and Ogaki (1988) have all provided characterizations of a lower bound on the asymptotic variance under different assumptions about the functional form of ut(00) and its dynamic structure. However, as yet, these results have not been translated into general algorithms for the calculation of a feasible optimal instrument in dynamic nonlinear models.30

## Leave a reply