# LEAST SQUARES ESTIMATORS

10.2.1 Definition

In this section we study the estimation of the parameters a, (3, and a2 in the bivariate linear regression model (10.1.1). We first consider estimating a and (3. The T observations on у and x can be plotted in a so-called scatter diagram, as in Figure 10.1. In that figure each dot represents a vector of observations on у and x. We have labeled one dot as the vector (yt, xt). We have also drawn a straight line through the scattered dots and labeled the point of intersection between the line and the dashed perpendicular line that goes through (yt, xt) as {% xt). Then the problem of estimating a and (3 can be geometrically interpreted as the problem of drawing a straight line such that its slope is an estimate of 3 and its intercept is an estimate of a.

Since Eut = 0, a reasonable person would draw a line somewhere

FIGURE 10.1 Scatter diagram

through a configuration of the scattered dots, but there are a multitude of ways to draw such a line. Gauss in a publication dated 1821 proposed the least squares method in which a line is drawn in such a way that the sum of squares of the vertical distances between the line and each dot is minimized. In Figure 10.1, the vertical distance between the line and the point (yt, xt) is indicated by h. Minimizing the sum of squares of distances in any other direction would result in a different line. Alternatively, we can draw a line so as to minimize the sum of absolute deviations, or the sum of the fourth power of the deviations, and so forth. Another simple method would be simply to connect the two dots signifying the largest and smallest values of x. We can go on forever defining different lines; how shall we choose one method?

The least squares method has proved to be by far the most popular

о

method for estimating a, (8, and <x in the linear regression model because of its computational simplicity and certain other desirable properties, which we shall show below. Still, it should by no means be regarded as the best estimator in every situation. In the subsequent discussion the reader should pay special attention to the following question: In what sense and under what conditions is the least squares estimator the best estimator?

Algebraically, the least squares (LS) estimators of a and 3, denoted by a and 3, can be defined as the values of a and 3 which minimize the sum of squared residuals

(10.2.1) S(ot, 3) = X (Jt ~ a “

Differentiating S with respect to a and (3 and equating the partial deriva­tives to 0, we obtain  – 220 – a – \$xt)xt = 0,

where 2 should be understood to mean 2^ unless otherwise noted. Solving (10.2.2) and (10.2.3) simultaneously for a and (3 yields the follow­ing solutions:

(10.2.4) Э = S~J ’

sx

(10.2.5) a = у — (Зл;,

where we have defined у — T *2\$, x = T 12л:г, si = T x2xf — x’ and s„ =

_ 1 — _ __ 9 ^

T 2xtyt — xy. Note that у and x are sample means, sx is the sample variance of x, and sxy is the sample covariance.

It is interesting to note that (10.2.4) and (10.2.5) can be obtained by substituting sample moments for the corresponding population moments in the formulae (4.3.8) and (4.3.9), which defined the best linear unbiased predictor. Thus the least squares estimates can be regarded as the natural estimates of the coefficients of the best linear predictor of у given x.

We define

(10.2.6) % = a + p*„ t = 1, 2, . . . , T,

and call it the least squares predictor of yt. We define the error made by the least squares predictor as

(10.2.7) щ = yt ~ yt, t = 1, 2, . . . , T,

and call it the least squares residual. In Section 10.2.6 below we discuss the prediction of a “future” value of y; that is, yt where t is not included in the sample period (1, 2,. . . , T).

So far we have treated a and (3 in a nonsymmetric way, regarding (3 as the slope coefficient on the only independent variable of the model, namely xt, and calling a the intercept. But as long as we can regard {xt} as known constants, we can treat a and (3 on an equal basis by regarding a
as the coefficient on another sequence of known constants—namely, a sequence of T ones. We shall call this sequence of ones the unity regressor. This symmetric treatment is useful in understanding the mechanism of the least squares method.

Under this symmetric treatment we should call yt as defined in (10.2.6) the least squares predictor of yt based on the unity regressor and {xt}. There is an important relationship between the error of the least squares pre­diction and the regressors: the sum of the product of the least squares residual and a regressor is zero. We shall call this fact the orthogonality between the least squares residual and a regressor. (See the general defini­tion in Section 11.2.) Mathematically, we can express the orthogonality as

(10.2.8) Ъщ= 0 and

(10.2.9) Хвд= 0.

Note that (10.2.8) and (10.2.9) follow from (10.2.2) and (10.2.3), respec­tively.

We shall present a useful interpretation of the least squares estimators a and (3 by means of the above-mentioned symmetric treatment. The least squares estimator (3 can be interpreted as measuring the effect of {xt} on {yt} after the effect of the unity regressor has been removed, and a as measuring the effect of the unity regressor on {yt} after the effect of {xt} has been removed. The precise meaning of the statement above is as follows.

Define the least squares predictor of xt based on the unit regressor as

(10.2.10) it = y, t = 1, 2, . . . , T,

where у is the value that minimizes T,(xt — y)2, that is, у = x. In other words, we are predicting xt by the sample mean. Define the error of the predictor as

(10.2.11) xf = xt – x„ t = 1, 2, . . . , T,

which is actually the deviation of xt from the sample mean since xt = x. Then (3, defined in (10.2.4), can be interpreted as the least squares estimator of the coefficient of the regression of yt on xf without the
intercept: that is, (3 minimizes X(y — Pxf )2. In this interpretation it is more natural to write |3 as

– Exf y,

(10.2.12) 3 =—— — •

S(xf)2

Reversing the roles of {xt} and the unity regressor, we define the least squares predictor of the unit regressor based on {xt as

(10.2.13) lt = bxt, t = 1, 2, . . . , T, where 5 minimizes E(1 — 8x()2. Therefore

л Tx,

(10.2.14) 8 = —- •

Tx2

We call 1( the predictor of 1 for the sake of symmetric treatment, even though there is of course no need to predict 1 in the usual sense. Then, if we define

(10.2.15) If = 1 – І„ we can show that a, defined in (10.2.5), is the least squares estimator of the coefficient of the regression of y, on If without the intercept. In other words, a minimizes X(y, — a if)2 so that si?1*

2(1 tf

Note that this formula of a has a form similar to (3 as given in (10.2.12).

The orthogonality between the least squares residual and a regressor is also true in the regression of {xt} on the unity regressor or in the regression of the unity regressor on (xj, as we can easily verify that (10.2.17) Xxf= 0 and (10.2.18) Ilfx, = 0.