Ridge Regression and Stein’s Estimator
We proved in Section 1.2.5 that the LS estimator is best linear unbiased in Model 1 and proved in Section 1.3.3 that it is best unbiased in Model 1 with normality. In either case a biased estimator may be better than LS (in the sense of having a smaller mean squared error) for some parameter values. In this section we shall consider a variety of biased estimators and compare them to LS in Model 1 with normality.
The biased estimators we shall consider here are either the constrained least squares estimator discussed in Section 1.4.1 or the Bayes estimator discussed in Section 1.4.4 or their variants. If the linear constraints (1.4.1) are true, the constrained least squares estimator is best linear unbiased. Similarly, the Bayes estimator has optimal properties if the regression vector fi is indeed random and generated according to the prior distribution. In this section, however, we shall investigate the properties of these estimators assuming that the constraints do not necessarily hold. Hence, we have called them biased estimators. Even so, it is not at all surprising that such a biased estimator can beat the least squares estimator over some region of the parameter space. For example, 0 can beat any estimator when the true value of the parameter in question is indeed 0. What is surprising is that there exists a biased estimator that dominates the least squares estimates over the whole parameter space when the risk function is the sum of the mean squared errors, as we shall show. Such an estimator was first discovered by Stein (see James and Stein, 1961) and has since attracted the attention of many statisticians, some of whom have extended Stein’s results in various directions.
In this section we shall discuss simultaneously two closely related and yet separate ideas: One is the aforementioned idea that a biased estimator can dominate least squares, for which the main result is Stein’s, and the other is the idea of ridge regression originally developed by Hoerl and Kennard (1970a, b) to cope with the problem of multicollinearity. Although the two ideas were initially developed independently of each other, the resulting estimators are close cousins; in fact, the term Stein-type estimators and the term ridge estimators are synonymous and may be used to describe the same class of estimators. Nevertheless, it is important to recognize them as separate ideas. We might be tempted to combine the two ideas by asserting that a biased estimator can be good and is especially so if there is multicollinearity. The statement can be proved wrong simply by noting that Stein’s original model assumes X’X = I, the opposite of multicollinearity. The correct characterization of the two ideas is as follows: (1) Some form of constraint is useful in estimation. (2) Some form of constraint is necessary if there is multicollinearity.
The risk function we shall use throughout this section is the scalar
where P is an estimator in question. This choice of the risk function is as general as
where A is an arbitrary (known) positive definite matrix, because we can always reduce (2.2.2) to (2.2.1) by transforming Model 1 to
y — Xfi + u (2.2.3)
= XA~1/2Al/20 + u
and consider the transformed parameter vector AU20. Note, Jiowever, that
(2.2.1) is not as general as the mean squared error matrix E(0 — 0)(0 — 0) which we used in Section 1.2.4, since (2.2.1) is the trace of the mean squared error matrix.