# Regression Case

Let us generalize some of the estimation methods discussed earlier to the regression situation.

M Estimator. The M estimator is easily generalized to the regres­sion model: It minimizes 2£.i/>[(y, — xJb)/\$] with respect to the vector b. Its asymptotic variance-covariance matrix is given by Jo(X, AX)~1X’BX(X’AX)_1, where A and В are diagonal matrices with the tth diagonal elements equal to Ep"[(y, — x’t{f)/s0] and E{p'[(y, — x’tfi)/s0]2}, re­spectively.

Hill and Holland (1977) did a Monte Carlo study on the regression general­ization of Andrews’ M estimator described in (2.3.5). They used s = (2.1) Median (largest T— K+ 1 of|y, — х,’Д|} as the scale factor in the p function,

where 0 is the value of b that minimizes i y, — xt’b|. Actually, their estima­

tor, which they called one-step sine estimator, is defined as

0S = (X’DX)-‘X’Dy, (2.3.10)

where D is the diagonal matrix whose tth diagonal element d, is defined by

if Iy,-x’Msx (2.3.11)

if yt – x’0 > sn.

This is approximately the first step of the so-called Newton-Raphson iteration designed to minimize (2.3.5), as we shall show next.

Put g{0) = 2Г_і p(zt) where z, = (y, – x’0)/s and expand g(0) around ft = 0 in a Taylor series as

giP) = g(/b + ^7(fi-jh + H0-fr Щр (fi – h (2.3.12)

where the derivatives are evaluated at 0. Let 0 be the value of 0 that minimizes the right-hand side of (2.3.12). Thus

ft – h Г d2g 1 1 f?£ P p 1д0д0′] 80‘

This is the first step in the Newton-Raphson iteration (see Section 4.4.1). Inserting

T

(2.3.14)

r-l

and

(2.3.15)

where p’ and p" are evaluated at (y, — x,’0)/s, into (2.3.13), we obtain

0 = (У. s-2p"xpt’t ) 2 (s~lP’xi – s~2p"xtx’b (2.3.16)

v-i / f-i

Finally, inserting a Taylor approximation p’ = p'(0) + s"‘0>t – x’iP)P"

into the right-hand side of (2.3.16), we obtain the estimator (2.3.10).

In their Monte Carlo study, Hill and Holland assumed that the error term is N(0, 1) with probability 1 — a and N(0, c2) with probability a, with various values for c and a. The regressor matrix was artificially generated with various degrees of outlying observations and with the number of regressors ranging from 1 to 6. The sample size was chosen to be 20. The reader should look at the table in their article (Hill and Holland, 1977) to see the striking improvement of fi or fis over the least squares estimator and the minor improvement of fis over fi.

Lp Estimator. This class also can easily be generalized to the regression model. It is the value of b that minimizes SjLjy, — x’b|p. A special case in which p= 1, which was already defined as fi in the discussion about the M estimator, will be more fully discussed later as an L estimator. For p Ф 1, the asymptotic variance of the Lp estimator can be obtained by the same formula as that used for the asymptotic variance of the M estimator. Forsythe (1972) conducted a Monte Carlo study of Lp estimators for p = 1.25,1.5,1.75, and 2 (least squares) in a regression model with one fixed regressor and an intercept where the error term is distributed as GN{0, 1) + (1 — G)N(S, R) for several values of G, S, and R. His conclusion: The more “contamination,” the smaller p should be.

L Estimator. The 0th sample quantile can be generalized to the regression situation by simply replacing b by x,’b in the minimand (2.3.7), as noted by Koenker and Bassett (1978). We shall call the minimizing value the 0th sample regression quantile and shall denote it by Д0). They investigated the conditions for the unique solution of the minimization problem and ex­tended^ Mosteller’s result to the regression case. They established that k^i), \$02), . . . , are asymptotically normal with the means equal to fi + р(в{), fi + p(d2), . . . ,0 + /і(вп), where міві) = [р(ві),0,0, . . . , 0]’, and the variance-covariance matrix is given by

Cov [Д0,), km = ^(X’X)->, i ^j, (2.3.18)

where &>y is given in (2.3.8). A proof for the special case k\$) = fiis also given in Bassett and Koenker (1978) (see Section 4.6.2).

Blattberg and Sargent (1971) conducted a Monte Carlo study and com­pared fi, the least squares estimator, and one other estimator in the model with the regressor and no intercept, assuming the error term has the characteristic function exp(—|ctA|“) for a = 1.1, 1.3, 1.5, 1.7, 1.9, and 2.0. Note that a = 2 gives the normal distribution and a = 1 the Cauchy distribution. They found that fi did best in general.

Schlossmacher (1973) proposed the following iterative scheme to obtain 0: ^(X’DXr’X’Dy, (2.3.19)

where D is the diagonal matrix whose ith diagonal element is given by

4 = №-*г’Д<—і)Г‘ ly«-x;Aj-i>l>« (2.3.20)

= 0 otherwise,

where j?(0 is the estimate obtained at the ith iteration and e is some predefined small number (say, € = 10~7). Schlossmacher offered no theory of conver­gence but found a good convergence in a particular example.

Fair (1974) proposed the following alternative to (2.3.20):

d* = min (Іу, – О – (2-3.21)

Thus Fair bounded the weights from above, whereas Schlossmacher threw out observations that should be getting the greatest weight possible. It is clear that Fair’s method is preferable because his weights are continuous and nonin­creasing functions of the residuals, whereas Schlossmacher’s are not.

A generalization of the trimmed mean to the regression model has been proposed by several authors. We shall consider two such methods. The first, which we shall call Да) for 0 < a < requires a preliminary estimate, which we shall denote by 0O. The estimation process involves calculating the resid­uals from 0O, throwing away those observations corresponding to the [Та] smallest and [7a] largest residuals, and then calculating Да) by least squares applied to the remaining observations.

The second method uses the 0th regression quantile Д0) of Koenker and Bassett (1978) mentioned earlier. This method involves removing from the sample any observations that have a residual from Да) that is negative or a residual from Д1 — a) that is positive and then calculating the LS estimator using the remaining observations. This estimator is denoted by 0*{a).

Ruppert and Carroll (1980) derived the asymptotic distribution of the two estimators and showed that the properties of Да) are sensitive to the choice of the initial estimate 0O and can be inefficient relative to 0*(a). However, if 0o = Ц0(а) + Д1 — a)] and the distribution of the error term is symmetric, Да) is asymptotically equivalent to 0*(a).

R Estimator. This type of regression estimator was proposed by Jaeckel (1972). He wrote the regression model as у = Д>1 + X0 + u, where X is now the usual regressor matrix except I, the column of ones. Jaeckel’s estimator

minimizes D(у — Xb) = 2^, [Л, — (T + 1 )/2](y, — x,’b), where R, =

rank(y, — xr’b). Note that R, is also a function of b. Jaeckel proved that D is a nonnegative, continuous, and convex function of b and that his estimator is asymptotically normal with mean fi and variance-covariance matrix

(2.3.22)

where z2 = 12~l[f/2(u) du]~2. The ratio a2/z2 is known as the Pitman effi­ciency of the Wilcoxon rank test and is equal to 0.955 if/is normal and greater than 0.864 for any symmetric distribution, whereas its upper bound is infinity. Because the derivative of D exists almost everywhere, any iterative scheme of minimization that uses only the first derivatives can be used. (Second derivatives are identically zero.) The intercept /?0 may be estimated by the Hodges-Lehmann estimator, Median{(i2, + fy)/2}, 1 S і S j S T, where fi is the vector of the least squares residuals. See the articles by McKean and Hettmansperger (1976) and Hettmansperger and McKean (1977) for tests of linear hypotheses using Jaeckel’s estimator.

Exercises

1. (Section 2.1.3)

Show that the Bayesian minimand (2.1.5) is minimized when Sis chosen to be the set of у that satisfies the inequality (2.1.3).

2. (Section 2.1.5)

A weakness of PC is that it does not choose the right model with probability 1 when Tgoes to infinity. (The weakness, however, is not serious.) Suppose we must choose between regressor matrix X| and X such that X, с X. Show that

lim P[PC chooses XJX, is true] = P[x2k-kx < 2K] < 1.

3. (Section 2.1.5)

Schwartz’s (1978) criterion minimizes T log y’M, y + Kt log T. Show that this criterion chooses the correct model with probability 1 as T goes to °°.

4. (Section 2.2.3)

If F’fl is estimable, show that F’fi is the BLUE of F’fl, where fi is the LS estimator.

5. (Section 2.2.3)

Show that fiP defined in (2.2.10) is a solution of (2.2.6).

6. (Section 2.2.4)

Show that for any square matrix A there exists a positive constant y0 such that for all у > y0, A + yl is nonsingular.