# Multicollinearity and Principal Components

In Model 1 we assumed that X is of full rank [that is, rank(X) = КШТ], or, equivalently, that X’X is nonsingular. If it is not, X’X cannot be inverted and therefore the least squares estimator cannot be uniquely defined. In other words, there is no unique solution to the normal equation

X’X0 = X’y. (2.2.6)

Even then, however, a subset of the regression parameters still may be esti­mated by (1.2.14).

We shall now turn to a more general question: Can F’/?be estimated by least squares, where F is an arbitrary К X /matrix of rank/(/ё К)1 Tо make sense out of this question, we must first define the least squares estimator ofF’fi. We say that the least squares estimator of F’/? is F’ Д where fi is any solution (which maj not be unique) of the normal equation (2.2.6), provided F’fi is unique. If F’fi is unique, we also say that F’fi is estimable. Then it is easy to prove that F’fi is estimable if and only if we can write F = X’A for some TXfmatrix A, or equivalently, if and only if we can write F = X’XB for some KX /matrix B. (See Rao, 1973, pp. 223-224, for the proof.) If F’fi is estimable, it can be shown that F’fi is the best linear unbiased estimator of F’fi.

The estimability of F’fi can be reduced to the estimability of a subset of the regression parameters in the sense of the previous paragraph by the following observation. Let G be a KX(K—f) matrix of rank K—f such that G’F = 0. (We defined a similar matrix in Section 1.4.2.) Then we can write Model 1 as  у = X/J + u

where the identity defines Z1( Zj, ylt and y2. Then the estimability ofF’/lis equivalent to the estimability of y{.

If X’X is singular, fi is not estimable in the sense defined above (that is, a solution of Eq. 2.2.6 is not unique). This fact does not mean that we should not attempt to estimate fi. We can still meaningfully talk about a class of estima­tors and study the relative merits of the members of the class. One such class may be the totality of solutions of (2.2.6)—infinite in number. Another class may be the constrained least squares estimator satisfying linear constraints Q’fi = c. From Eq. (1.4.11) it is clear that this estimator can be defined even when X’X is singular. A third class is the class of Bayes estimators with prior Qfi = c + v formed by varying Q, c, and the distribution of v. We should mention an important member of the first class that also happens to be a member of the second class. It is called the principal components estimator.

Suppose we arrange the diagonal elements of A defined in Section 2.2.2 in descending order—A, SA2S • • • S XK—and let the corresponding charac­teristic vectors be h,,h2, . . . , h^sothatH = (h,, h2, . . . , h*). Then we call Xh, the ith principal component of X. If X’X is singular, some of its characteristic roots are 0. Partition

л-[о "J (228>

so that the diagonal elements of A, are positive and those of Л2 are all 0, and partition H = (Hj, H2) conformably. Furthermore, define Xf = XHi and X J = XH2 and partition a’ = (a[, conformably. Then X J = 0 and hence

a2 cannot be estimated. Suppose we estimate a, by

a, = (Xf’Xf)-lXf’y (2.2.9)

and set a2 = 0. (It is arbitrary to choose 0 here; any other constant will do.) Transforming a’ = (a[, <*£) into an estimator of fi, we obtain the principal components estimator of fi by the formula6

fiP = Ha = HjAr’HIX’y. (2.2.10)

It is easy to show that fiP satisfies (2.2.6); hence, it is a member of the first class. It is also a member of the second class because it is the constrained least squares subject to H’2fi = 0.

It was shown by Fomby, Hill, and Johnson (1978) that the principal compo­nents estimator (or constrained least squares subject toH^jJ = 0) has a smaller variance-covariance matrix than any constrained least squares estimator ob­tained subject to the constraints Q’fi = c, where Q and c can be arbitrary except that Q has an equal or smaller number of columns than H2.

We shall now consider a situation where X’X is nonsingular but nearly singular. The near singularity of X’X is commonly referred to as multicollin – earity. Another way of characterizing it is to say that the determinant of X ‘X is close to 0 or that the smallest characteristic root of X’X is small. (The question of how small is “small” will be better understood later.) We now ask the question, How precisely or imprecisely can we estimate a linear combination of the regression parameters c ‘fi by least squares?7 Because the matrix H is nonsingular, we can write c = Hd for some vector d. Then we have

V(c’fi) = CT2d’A-‘d, (2.2.11)

which gives the answer to the question. In other words, the closer c is to the direction of the first (last) principal component, the more precisely (impreci­sely) one can estimate c’fi. In particular, we note from (2.2.5) that the preci­sion of the estimator of an element of a is directly proportional to the corre­sponding diagonal element of Л.

Suppose we partition Л as in (2.2.8) but this time include all the “large” elements in Л| and “small” elements in Л2. The consideration of which roots to include in A2 should depend on a subjective evaluation of the statistician regarding the magnitude of the variance to be tolerated. It makes sense to use the principal components estimator (2.2.10) also in the present situation, because a2 can be only imprecisely estimated. (Here, the principal compo­nents estimator is not unique because the choice of a2 is a matter of subjective judgment. Therefore it is more precise to call this estimator by a name such as the Ki principal components estimator specifying the number of elements chosen in OEi.)

2.2.2 Ridge Regression

Hoerl and Kennard (1970a, b) proposed the class of estimators defined by

P(y) = (X’X + yI)~lX’y, (2.2.12)

called the ridge estimators. Hoerl and Kennard chose these estimators be­cause they hoped to alleviate the instability of the least squares estimator due to the near singularity of X’X by adding a positive scalar у to the characteristic roots of X’X. They proved that given fi there exists y*, which depends upon fi, such that E[fi(y*) – ftky*) – fi<E(fi – fi)'(fi ~ fi), where fi = fi(0). Because y* depends on fi, fi{y*) is not a practical estimator. But the existence of fi(y*) gives rise to a hope that one can determine y, either as a constant or as a function of the sample, in such a way that fi(y) is better than the least squares estimator fi with respect to the risk function (2.2.2) over a wide range of the parameter space.

Hoerl and Kennard proposed the ridge trace method to determine the value of y. The ridge trace is a graph of fiSy), the rth element of fi{y), drawn as a function of y. They proposed that у be determined as the smallest value at which the ridge trace stabilizes. The method suffers from two weaknesses: (1) The point at which the ridge trace starts to stabilize cannot always be deter­mined objectively. (2) The method lacks theoretical justification inasmuch as its major justification is derived from certain Monte Carlo studies, which, though favorable, are not conclusive. Although several variations of the ridge trace method and many analogous procedures to determine у have been proposed, we shall discuss only the empirical Bayes method, which seems to be the only method based on theoretical grounds. We shall present a variant of it in the next paragraph and more in the next two subsections.

Several authors interpreted the ridge estimator (more precisely, the class of estimators) as the Bayes estimator and proposed the empirical Bayes method of determining y; we shall follow the discussion of Sclove (1973).8 Suppose
that the prior distribution of fi is N(pi, <x|I), distributed independently of u. Then from (1.4.22) the Bayes estimator of fi is given by

fi* = (X’X + yI)-1(X’y + yfj), (2.2.13)

where у = аг! а. Therefore, the ridge estimator (2.2.12) is obtained by putting H = 0 in (2.2.13). By the empirical Bayes method we mean the estimation of the parameters (in our case, a}) of the prior distribution using the sample observations. The empirical Bayes method may be regarded as a compromise between the Bayesian and the classical analysis. From the marginal distribu­tion (that is, not conditional on fi) of y, we have Ey’y = a|tr X’X + To2,

which suggests that we can estimate a by

_ У’У — To2 tr X’X ’

where a2 = Г-1у'[І — X(X’X)_1X’]y as usual. Finally, we can estimate у by y = o2/o}.

In the next two subsections we shall discuss many more varieties of ridge estimators and what we call generalized ridge estimators, some of which involve the empirical Bayes method of determining y. The canonical model presented in Section 2.2.2 will be considered.