Least Squares Estimation and the Classical Assumptions
Least squares minimizes the residual sum of squares where the residuals are given by ei = Yi — 2 – (3Xi i = 1,2,…,n
and 2 and в denote guesses on the regression parameters a and в, respectively. The residual sum of squares denoted by RSS = £7=i e2 = Yn=i(Y — 2 — eXi)2 is minimized by the two first-order conditions:
d(£7=i e2)/da = — 2£7= ei = 0; or £7= Yi — na — PY7=i X = 0
d(Y7=i e2)/de = — 2£n=i eiXi = 0; or £7= YiXi — p£ Xi — PY7=i X2 = 0 (3.3)
Solving the least squares normal equations given in (3.2) and (3.3) for a and в one gets
a0LS = Y — Polsx and Pols = Y7=i xiyi/Y7=ix
where Y = £n=i Yi/n, X = £"=i Xi/n, Уі = Yi — Y, Xi = X, — X. LLi x2 = £"=i X — nX2, E“=i yf = :=i Y? — nY2 and n=i ХіУі = n=i XiYi — nXY.
These estimators are subscripted by OLS denoting the ordinary least squares estimators. The OLS residuals e, = Y, — Pols — POLSXi automatically satisfy the two numerical relationships given by (3.2) and (3.3). The first relationship states that (i) £= e, = 0, the residuals sum to zero. This is true as long as there is a constant in the regression. This numerical property of the least squares residuals also implies that the estimated regression line passes through the sample means (X, Y). To see this, average the residuals, or equation (3.2), this gives immediately Y = 2OLS + в0і,5;Х. The second relationship states that (ii) £= eiXi = 0, the residuals and the explanatory variable are uncorrelated. Other numerical properties that the OLS estimators satisfy are the following: (iii) £= Y, = Y7=i Y, and (iv) £= eiYi = 0. Property (iii) states that the sum of the estimated Yi’s or the predicted YPs from the sample is equal to the sum of the
Figure 3.1 ‘True’ Consumption Function Figure 3.2 Estimated Consumption Function
actual Yj’s. Property (iv) states that the OLS residuals and the predicted Y;’s are uncorrelated. The proof of (iii) and (iv) follow from (i) and (ii) see problem 1. Of course, underlying our estimation of (3.1) is the assumption that (3.1) is the true model generating the data. In this case, (3.1) is linear in the parameters a and в, and contains only one explanatory variable Xi besides the constant. The inclusion of other explanatory variables in the model will be considered in Chapter 4, and the relaxation of the linearity assumption will be considered in Chapters 8 and 13. In order to study the statistical properties of the OLS estimators of a and в, we need to impose some statistical assumptions on the model generating the data.
Assumption 1: The disturbances have zero mean, i. e., E(ui) = 0 for every i = 1,2,…,n. This assumption is needed to insure that on the average we are on the true line.
To see what happens if E(ui) = 0, consider the case where households consistently under-report their consumption by a constant amount of 8 dollars, while their income is measured accurately, say by cross-referencing it with their IRS tax forms. In this case,
(Observed Consumption) = (True Consumption) — 8
and our regression equation is really
(True Consumption)i = a + e(Income)i + ui
But we observe,
(Observed Consumption)i = a + e(Income)i + ui — 8
This can be thought of as the old regression equation with a new disturbance term u* = ui — 8. Using the fact that 8 > 0 and E(ui) = 0, one gets E(u*) = —8< 0. This says that for all households with the same income, say $20,000, their observed consumption will be on the average below that predicted from the true line [a+в($20,000)] by an amount 8. Fortunately, one can deal with this problem of constant but non-zero mean of the disturbances by reparametizing the model as
(Observed Consumption)і = a* + l3(Income)i + Ui
where a* = a — 8. In this case, E(ui) = 0 and a* and в can be estimated from the regression. Note that while a* is estimable, a and 8 are non-estimable. Also note that for all $20,000 income households, their average consumption is [(a — 8)+ в($20, 000)].
Assumption 2: The disturbances have a constant variance, i. e., var(ui) = a2 for every i = 1, 2,…,n. This insures that every observation is equally reliable.
To see what this assumption means, consider the case where var(ui) = a2, for i = 1,2,…,n. In this case, each observation has a different variance. An observation with a large variance is less reliable than one with a smaller variance. But, how can this differing variance happen? In the case of consumption, households with large disposable income (a large Xi, say $100,000) may be able to save more (or borrow more to spend more) than households with smaller income (a small Xi, say $10,000). In this case, the variation in consumption for the $100,000 income household will be much larger than that for the $10,000 income household. Therefore, the corresponding variance for the $100, 000 observation will be larger than that for the $10, 000 observation. Consequences of different variances for different observations will be studied more rigorously in Chapter 5.
Assumption 3: The disturbances are not correlated, i. e., E (uiuj) = 0 for i = j, i,j = 1,2,…,n. Knowing the i-th disturbance does not tell us anything about the j-th disturbance, for i = j.
For the consumption example, the unforseen disturbance which caused the i-th household to consume more, (like a visit of a relative), has nothing to do with the unforseen disturbances of any other household. This is likely to hold for a random sample of households. However, it is less likely to hold for a time series study of consumption for the aggregate economy, where a disturbance in 1945, a war year, is likely to affect consumption for several years after that. In this case, we say that the disturbance in 1945 is related to the disturbances in 1946, 1947, and so on. Consequences of correlated disturbances will be studied in Chapter 5.
Assumption 4: The explanatory variable X is nonstochastic, i. e., fixed in repeated samples, and hence, not correlated with the disturbances. Also, ^™=1 x2/n = 0 and has a finite limit as n tends to infinity.
This assumption states that we have at least two distinct values for X. This makes sense, since we need at least two distinct points to draw a straight line. Otherwise X = X, the common value, and x = X — X = 0, which violates x2 = 0. In practice, one always has several distinct values of X. More importantly, this assumption implies that X is not a random variable and hence is not correlated with the disturbances.
In section 5.3, we will relax the assumption of a non-stochastic X. Basically, X becomes a random variable and our assumptions have to be recast conditional on the set of X’s that are observed. This is the more realistic case with economic data. The zero mean assumption becomes E(ui/X) = 0, the constant variance assumption becomes var(ui/X) = a2, the no serial correlation assumption becomes E(uiuj/X) = 0 for i = j. The conditional expectation here is with respect to every observation on Xi from i = 1, 2,…n. Of course, one can show that if E(ui/X) = 0 for all i, then Xi and ui are not correlated. The reverse is not necessarily true, see
Figure 3.3 Consumption Function with Cov(X, u) > 0
problem 3 of Chapter 2. That problem shows that two random variables, say ui and Xi could be uncorrelated, i. e., not linearly related when in fact they are nonlinearly related with ui = X2. Hence, E(ui/Xi) = 0 is a stronger assumption than ui and Xi are not correlated. By the law of iterated expectations given in the Appendix of Chapter 2, E(ui/X) = 0 implies that E(ui) = 0. It also implies that ui is uncorrelated with any function of Xi. This is a stronger assumption than ui is uncorrelated with Xi. Therefore, conditional on Xi, the mean of the disturbances is zero and does not depend on Xi. In this case, E(Yi/Xi) = a + вXi is linear in a and в and is assumed to be the true conditional mean of Y given X.
To see what a violation of assumption 4 means, suppose that X is a random variable and that X and u are positively correlated, then in the consumption example, households with income above the average income will be associated with disturbances above their mean of zero, and hence positive disturbances. Similarly, households with income below the average income will be associated with disturbances below their mean of zero, and hence negative disturbances. This means that the disturbances are systematically affected by values of the explanatory variable and the scatter of the data will look like Figure 3.3. Note that if we now erase the true line (a + eX), and estimate this line from the data, the least squares line drawn through the data is going to have a smaller intercept and a larger slope than those of the true line. The scatter should look like Figure 3.4 where the disturbances are random variables, not correlated with the Xi’s, drawn from a distribution with zero mean and constant variance. Assumptions 1 and 4 insure that E(Yi/Xi) = a + eXi, i. e., on the average we are on the true line. Several economic models will be studied where X and u are correlated. The consequences of this correlation will be studied in Chapters 5 and 11.
We now generate a data set which satisfies all four classical assumptions. Let a and в take the arbitrary values, say 10 and 0.5 respectively, and consider a set of 20 fixed X’s, say income classes from $10 to $105 (in thousands of dollars), in increments of $5, i. e., $10, $15, $20, $25,…, $105. Our consumption variable Yi is constructed as (10 + 0.5Xi + ui) where ui is a f (u)
Figure 3.4 Random Disturbances Around the Regression
disturbance which is a random draw from a distribution with zero mean and constant variance, say a2 = 9. Computers generate random numbers with various distributions.
In this case, Figure 3.4 would depict our data, with the true line being (10 + 0.5X) and ui being random draws from the computer which are by construction independent and identically distributed with mean zero and variance 9. For every set of 20 ui’s randomly generated, given the fixed Xi’s, we obtain a corresponding set of 20 Yi’s from our linear regression model. This is what we mean in assumption 4 when we say that the X’s are fixed in repeated samples. Monte Carlo experiments generate a large number of samples, say a 1000, in the fashion described above. For each data set generated, least squares can be performed and the properties of the resulting estimators which are derived analytically in the remainder of this chapter, can be verified. For example, the average of the 1000 estimates of a and в can be compared to their true values to see whether these least squares estimates are unbiased. Note what will happen to Figure 3.4 if Е(щ) = —8 where 8 > 0, or var(ui) = a2 for i = 1,2,…,n. In the first case, the mean of f (u), the probability density function of u, will shift off the true line (10 + 0.5X) by —8. In other words, we can think of the distributions of the ui’s, shown in Figure 3.4 , being centered on a new imaginary line parallel to the true line but lower by a distance 8. This means that one is more likely to draw negative disturbances than positive disturbances, and the observed Yi’s are more likely to be below the true line than above it. In the second case, each f (ui) will have a different variance, hence the spread of this probability density function will vary with each observation. In this case, Figure 3.4 will have a distribution for the ui’s which has a different spread for each observation. In other words, if the u’ s are say normally distributed, then ui is drawn from a N(0,ai) distribution, whereas u2 is drawn from a N(0,a2) distribution, and so on. Violation of the classical assumptions can also be studied using Monte Carlo experiments, see Chapter 5.