# Nonparametric Regression

Consider the regression model

V; = m(xl) + uu

where i = 1,…, n, Vi is the dependent variable, x = (x;1,…, xiq) are q regressors, m(x) = E(v;|x;) is the true but unknown regression function, and u; is the error term such that E(ui | x,) = 0 and V(u;1 x,) = a2(x,).

If m(x) = /(P, x;) is a correctly specified family of parametric regression func­tions then у, = /(P, x,) + uі is a correct model and, in this case, one can construct a consistent least squares (LS) estimator of m(x;) given by /(S, x,), where S is the LS estimator of the parameter vector p. This S is obtained by minimizing Xu2 = X(Vi – /(P, x,))2 with respect to p. For example, if /(P, x;) = a + x;p = Xi5, 5 = (a, P’)’, is linear we can obtain the LS estimator of 5 as X = (X’ X )-1X’y, where X is a n x (q + 1) matrix with the ith row given by X, = (1, x;). Further the predicted (fitted) values are y; = X, X = Xi(X’ X )-1X’y. In general, if the parametric regression /(P, x) is incorrect or the form of m(x) is unknown then the /(S, x) may not be a consistent estimate of m(x).

An alternative approach is to use the nonparametric regression estimation of the unknown m(x) by using techniques such as kernel, series, and spline, among others, see Newey (1997), Hardle (1990) and Pagan and Ullah (1999) for details. Here we consider the kernel estimation since it is simple to imple­ment and its asymptotic properties are well established. Essentially the kernel estimator is a local LS (LLS) estimator obtained by minimizing Xu2 K () where uі = у; – /(P, x,), Kix = K (^уг) are a decreasing function of the distances of the regressor vector x; from the point x = (x1,…, xq), and h > 0 is the window width (smoothing parameter) which determines how rapidly the weights decrease as the distance of x; from x increases. When h = *>, Kix = K(0) is a constant so that the minimization of K(0) Xu2 is the same as the minimization of Xu2, that is the LLS estimator becomes the global LS estimator described above. In general, while the nonparametric LLS estimation fits /(P, x,) to the points in the interval of length h around x, the parametric LS estimator fits / (P, x,) globally to the entire scattering of the data.

When one treats m(x) = /(P, x,) locally (around x) as X;5(x), where 5(x) =

(a(x), p(x)’)’, an explicit expression of the LLS estimator of 5(x) is

X(x) = (X’ K(x) X )-1X’ K(x)y,

and the predicted values of m(x) are

y = m(x) = X;X(x;) = X;(X’ K(x,)X )-1X’ K(x;)y = wy

or y = Wy = n where w; = X;(X’ K(x,)X)-1X’K(x;) is an n x n ith row of W, K(x) is the diagonal matrix with the diagonal elements (K(),…, and m = [m(x1), …, m(xn)]’. The estimator 8(x) is the local linear LS (LLLS) or simply the local linear (LL) estimator. One can consider /(p, x;) to be the polynomials in x; of degree d, in which case the matrix X will contain polynomials and the estimator 5(x) will be the local polynomial LS (LPLS) estimator. For more details, see Fan and Gijbels (1996).

When one treats m(x) locally (around x) as a scalar constant a(x), the LLS estimator of a(x) is

X УК

a(x) = (i’K(x)i) 4’K(x)y = – ,

X Ki, x i

X y – K-

and the predicted values are yt = m(x) = a(x) = wy = j = E.-y.-w. y, where i is

X і Kji

an n x 1 vector of unit elements, K = K(x–x-), wi = (i, K(xi)i)-1i’K(xi) is the ith row of W, and Wji = Kji/XjKji. The estimator a(x) is the local constant LS (LCLS) estimator, and it was first introduced by Nadaraya (1964) and Watson (1964) (N-W).

The traditional approach to LLLS estimator (Fan and Gijbles, 1996) is to take a first-order Taylor series expansion of m(x,) around x so that yi = m(x,) + ui = m(x) + (xi – x)m(1)(x) + vi = a(x) + xiP(x) + vi = Xi5(x) + vi; where m(1)(x) = P(x) = dm(x)/dx is the first derivative of m(x), a(x) = m(x) – xP(x) and Xt and 5(x) are as given above. The LLLS estimator X(x) is then obtained by minimizing Xv)K (xL-x), and it is equivalent to X(x) given above. Furthermore m(xi) = a(x,) + xiP(xi) = XiX(x,) is an estimator of m(xi) = a(x,) + Xi P(x,) = Xi 5(xi). We note that while LLLS provides the estimates of the unknown function m(x) and its derivative P(x) simultaneously, LCLS estimator of N-W provides the estimator of m(x) only. The first analytical derivative of m(x) is then taken to obtain P(x), see Pagan and Ullah (1999, ch. 4).

The LLS results provide the point-wise estimates of P which vary with x. In some situations one may be interested in knowing how the parameters change with respect to a vector of variables zi which is not necessarily in the model. That is, the model to be estimated is, say, yi = /(P(z,), x,) + ui or in the linear case yi = xiP(zi) + ui. This can be estimated by minimizing Xu)K () = X [yi – xiP]2K () which gives P(z) = (X’K(z)X)-1X’K(z)y. For examples and applications of these models, see Cai, Fan, and Yao (1998), Robinson (1989) and Li, Huang, and Fu

(1998) .

The above results extend to the estimation of E(g(yf)| x,) where g(yi) is a func­tion of yi such that E | g(yf)| < ^, for example, E(y2| x), where g(у,) = y2.

The asymptotic normality results of LLS and N-W (LCLS) estimators are simi­lar and they are well established in the literature. But their finite sample approxi­mate bias expressions to O(h2) are different while the variance expressions are the same. These are now well known, see Pagan and Ullah (1999, chs. 3 and 4) for details. These results accompanied by several simulation studies (see Fan and Gijbels, 1996) indicate that the MSE performance of the LLLS is much better than that of N-W estimator especially in the tails where data are fewer. In particular, the bias of the N-W estimator is much worse compared to LLLS in the tails, and while the LLLS is unbiased when the true regression function m(x) is linear, the N-W estimator is biased. Intuitively this makes sense since while the N-W estimator fits a constant to the data around each point x the LLLS estimator fits a linear line around x. These properties and the fact that the LLLS estimator provides derivatives (elasticities) and the regression estimators simultaneously, and that it is simple to calculate, make LLLS an appealing estimation technique. In the future, however, more research is needed to compare the performances of LPLS and LLLS.

An important implication of the asymptotic results of LLS and N-W estimators of m(x) and P(x) is that the rate of convergence of m(x) is (nhq)1/2 and that of S(x) is (nhq+2)1/2 which are slower than the parametric rate of n1/2. In fact as the dimen­sion of regressors q increases the rates become worse. This is the well known problem of the "curse of dimensionality" in the purely nonparametric regression. In recent years there have been several attempts to resolve this problem. One idea is to calculate the average regression coefficients, e. g. XnP(x;)/n or weighted average coefficients which give n1/2 convergence rate. This, however, may not have much economic meaning in general, except in the case of single index models used in labor econometrics, see Powell, Stock, and Stoker (1989). Another solution is to use the additive regression models which improve the (nhq)1/2 rate to a univariate rate of (nh)1/2. This is described in Section 4.

The asymptotic results described above are well established for the indepen­dently and identically distributed (iid) observations, and for the weakly dependent time series data. For the extensions of the results to nonparametric kernel esti­mation with nonstationary data, see Phillips and Park (1998), and for the case of a purely nonparametric single index model yi = m(x;P) + ui, see Lewbel and Linton (1998).