# The Gauss-Newton Regression

Associated with every nonlinear regression model is a somewhat nonstandard artificial regression which is probably more widely used than any other. Con­sider the univariate, nonlinear regression model

yt = vt(P) + ut, ut ~ iid(0, о2), t = 1,…, n, (1.2)

where yt is the tth observation on the dependent variable, and в is a ^-vector of parameters to be estimated. The scalar function vt(P) is a nonlinear regression function. It determines the mean value of yt as a function of unknown parameters в and, usually, of explanatory variables, which may include lagged dependent variables. The explanatory variables are not shown explicitly in (1.2), but the t subscript on х((в) reminds us that they are present. The model (1.2) may also be written as

y = x(e) + u, u ~ iid(0, о2I), (1.3)

where y is an n-vector with typical element yt, and x(e) is an n-vector of which the tth element is х(ф).

The nonlinear least squares (NLS) estimator S for model (1.3) minimizes the sum of squared residuals. It is convenient to use this sum divided by 2. Thus we define Q(e) = i(y – x(e))T(y – x(e)).

The Gauss-Newton regression can be derived as an approximation to Newton’s Method for the minimization of Q(P). In this case, Newton’s Method consists of the following iterative procedure. One starts from some suitably chosen starting value, P(o). At step m of the procedure, pw is updated by the formula

P(m+1) P(m) H(m)g(m),

where the k x 1 vector g(m) and the k x k matrix Hw are, respectively, the gradient and the Hessian of Q(P) with respect to p, evaluated at P(m). For general p, we have

g(P) = – XT(P)(y – x(p)),

 yt – **(Р))- X(P)xVэр,   where the matrix X(P) is an n x k matrix with tith element the derivative of xt(P) with respect to p;, the ith component of p. A typical element of the Hessian H(P) is

The Gauss-Newton procedure is one of the set of so-called quasi-Newton pro­cedures, in which the exact Hessian is replaced by an approximation. Here, only the second term in (1.5) is used, so that the H(P) of Newton’s method is replaced by the matrix XT(P)X(P). Thus the Gauss-Newton updating formula is

P(m+1) P(m) + (X(m)X(m)) X(m)(y x(m)^ (1.6)

where we write X(m) = X(P(m)) and x(m) = x(Pw). The updating term on the right – hand side of (1.6) is the set of OLS parameter estimates from the Gauss-Newton regression, or GNR,

y – x(P) = X(P)b + residuals, (1.7)

where the variables r(P) = y – x(P) and R(P) = X(P) are evaluated at pw. Notice that there is no regressor in (1.7) corresponding to the parameter о2, because the criterion function Q(P) does not depend on о2. This is one of the features of the GNR that makes it a nonstandard artificial regression.

The GNR is clearly a linearization of the nonlinear regression model (1.3) around the point p. In the special case in which the original model is linear, x(P) = Xp, where X is the matrix of independent variables. Since X(P) is equal to X for all P in this special case, the GNR will simply be a regression of the vector y – Xp on the matrix X.

An example is provided by the nonlinear regression model yt = Р^Ц2Z^ + ut, ut ~ iid(0, о2),  where Zt1 and Zt2 are independent variables. The regression function here is nonlinear and has the form of a Cobb-Douglas production function. In many cases, of course, it would be reasonable to assume that the error term is multi­plicative, and it would then be possible to take logarithms of both sides and use ordinary least squares. But if we wish to estimate (1.8) as it stands, we must use nonlinear least squares. The GNR that corresponds to (1.8) is

The regressand is yt minus the regression function, the first regressor is the derivative of the regression function with respect to p1, and the second regressor is the derivative of the regression function with respect to p2.

Now consider the defining conditions of an artificial regression. We have RT(0)r(0) = XT(P)(y – x(p)),

which is just minus the gradient of Q(P). Thus condition (1.1′) is satisfied.

Next, consider condition (3). Let T denote a vector of initial estimates, which are assumed to be root-n consistent. The GNR (1.7) evaluated at these estimates is

y – X = Xb + residuals,

where X = x(T) and X = X(T). The estimate of b from this regression is

b = (XTX)-1XT(y – X). (1.10)

The one-step efficient estimator is then defined to be

] = T + b. (1.11)

By Taylor expanding the expression n~1/2XT(y – X) around P = p0, where p0 is the true parameter vector, and using standard asymptotic arguments, it can be shown that, to leading order,

n~1/2XT(y – X) = n-1/2XTu – n~1XT0X0n1/2(T – p0),

where X0 ^ X(P0). This relation can be solved to yield n1/2(T – P0) = (n-JXTX0)-1(n-1/2XTu – n-1/2XT(y – X)).

Now it is a standard result that, asymptotically,
see, for example, Davidson and MacKinnon (1993, section 5.4). By (1.10), the second term on the right-hand side of (1.12) is asymptotically equivalent to – n1/2b. Thus (1.12) implies that

n1/2(T – P0) = n1/2(S – P0) – n1/2b.

Rearranging this and using the definition (1.11), we see that, to leading order asymptotically,

n1/2(P – P) = n1/2(T + b – P0) = n1/2(S – P0).

In other words, after both are centered and multiplied by n1/2, the one-step estim­ator ] and the NLS estimator S tend to the same random variable asymptotically. This is just another way of writing condition (3) for model (1.3).

Finally, consider condition (2). Since X(P) plays the role of R(0), we see that

n RT(0)R(0) = n XT(P)X(P). (1.14)

If the right-hand side of (1.14) is evaluated at any root-n consistent estimator T, it must tend to the same probability limit as nr1XT0X0. It is a standard result, follow­ing straightforwardly from (1.13), that, if S denotes the NLS estimator for the model (1.3), then limvar(n1/2(S – Po)) = a0plim(n 2XTX0)

where о 0 is the true variance of the error terms; see, for example, Davidson and MacKinnon (1993, ch. 5). Thus the GNR would satisfy condition (2) except that there is a factor of о 0 missing. However, this factor is automatically supplied by the regression package. The estimated covariance matrix will be

w (b) = s2(XTX)-1, (1.16)

where s2 = SSR/ (n – k) is the estimate of о2 from the artificial regression. It is not hard to show that s2 estimates о2 consistently, and so it is clear from

(1.15) that (1.16) provides a reasonable way to estimate the covariance matrix

of S.

It is easy to modify the GNR so that it actually satisfies condition (2). We just need to divide both the regressand and the regressors by s, the standard error from the original, nonlinear regression. When this is done, (1.14) becomes

-RT(9)R(9) = Xxt(P)X(P),

n ns2

and condition (2) is seen to be satisfied. However, there is rarely any reason to do this in practice.

Although the GNR is the most commonly encountered artificial regression, it differs from most artificial regressions in one key respect: there is one parameter, о2, for which there is no regressor. This happens because the criterion function, Q(P), depends only on p. The GNR therefore has only as many regressors as P has components. This feature of the GNR is responsible for the fact that it does not quite satisfy condition (2). The fact that Q(P) does not depend on о2 also causes the asymptotic covariance matrix to be block diagonal between the k x k block that corresponds to P and the 1 x 1 block that corresponds to о2.

The GNR, like other artificial regressions, has several uses, depending on the para­meter values at which the regressand and regressors are evaluated. If we evaluate them at S, the vector of NLS parameter estimates, regression (1.7) becomes

y – W = Xb + residuals, (1.17)

where W = x(S) and X = X(S). By condition (1), which follows from the first-order conditions for NLS estimation, the OLS estimate b from this regression is a zero vector. In consequence, the explained sum of squares, or ESS, from regression

(1.17) will be 0, and the SSR will be equal to

lly – W||2 = (y – W)T(y – W),

which is the SSR from the original nonlinear regression.

Although it may seem curious to run an artificial regression all the coefficients of which are known in advance to be zero, there can be two very good reasons for doing so. The first reason is to check that the vector S reported by a program for NLS estimation really does satisfy the first-order conditions. Computer pro­grams for calculating NLS estimates do not yield reliable answers in every case; see McCullough (1999). The GNR provides an easy way to see whether the first – order conditions are actually satisfied. If all the f-statistics for the GNR are not less than about 10-4, and the R2 is not less than about 10-8, then the value of S reported by the program should be regarded with suspicion.

The second reason to run the GNR (1.17) is to calculate an estimate of var(S), the covariance matrix of the NLS estimates. The usual OLS covariance matrix from regression (1.17) is

w (b) = s2(XTX)-1, (1.18)

which is similar to (1.16) except that everything is now evaluated at S. Thus running the GNR (1.17) provides an easy way to calculate what is arguably the best estimate of var(S). Of course, for (1.18) to provide an asymptotically valid covariance matrix estimate, it is essential that the error terms in (1.2) be inde­pendent and identically distributed, as we have assumed so far. We will discuss ways to drop this assumption in Section 7.

Since the GNR satisfies the one-step property, it and other artificial regressions can evidently be used to obtain one-step efficient estimates. However, although one-step estimation is of considerable theoretical interest, it is generally of mod­est practical interest, for two reasons. First, we often do not have a root-n consis­tent estimator to start from and, secondly, modern computers are so fast that the savings from stopping after just one step are rarely substantial.

What is often of great practical interest is the use of the GNR as part of a numerical minimization algorithm to find the NLS estimates S themselves. In practice, the classical Gauss-Newton updating procedure (1.6) should generally be replaced by

P(m) P(m-1) + a (m)b(m),

where a (m) is a scalar that is chosen in various ways by different algorithms, but always in such a way that Q(P(m+1)) < Q(P(m)). Numerical optimization methods are discussed by Press et al. (1992), among many others. Artificial regressions other than the GNR allow these methods to be used more widely than just in the least squares context.