NeymanPearson generalized lemma and its applications
The lemma can be stated as follows:
Let g1, g,…, gm, gm+1 be integrable functions and ф be a test function over S such that 0 < ф < 1, and
фgidy = ci i = 1, 2,…, m,
where c1, c2,…, cm are given constants. Further, let there exist a ф* and constants k1, k2,…, km such that ф* satisfies (2.6), and
Ф* = 1 if gm+1 > X
i=1
m
= 0 if gm+1 < X kigi
i=1
Ф*gm+1dУ ^ 
the function ф*(у) defined as 
ф*(у) = 1 when = 0 when 
Ф*(у)Ц91)йу ^ 
that is, ф*(у) will provide the MP test. Therefore, in terms of critical region, 
then,
where k is such that Pr{o  H0} = a, is the MP critical region.
The NP lemma also provides the logical basis for the LR test. To see this, consider a general form of null hypothesis, H0 : h (9) = c where h(9) is an r x 1 vector function of 9 with r < p and c a known constant vector. It is assumed that H(9) = has full column rank, that is, rank[H(9)] = r. We denote the maximum likelihood estimator (MLE) of 9 Ьу 0, and Ьу 0, the restricted MLE of 9, that is,
0 is obtained by maximizing the loglikelihood function 1(0) = ln L(0) subject to the restriction h(0) = c. Neyman and Pearson (1928) suggested their LR test as
(2.12)
Their suggestion did not result from any search procedure satisfying an optimality criterion, and it was purely based on intuitive grounds and Fisher’s (1922) likelihood principle.1 Comparing the MP critical region in (2.11) with (2.12)
we can see the logical basis of the LR test.
Locally MP (LMP) and Rao’s (1948) score tests
Let us consider a simple case, say p = 1 and test H0 : 0 = 0O. Assuming that the
power function у (0) in (2.3) admits Taylor series expansion, we have
Y(0) = Y(0o) + (0 – 0o)T'(0o) + Y"(0*), (2.13)
where 0* is a value in between 0 and 0O. If we consider local alternatives of the form 0 = 0O + 8/ Vn, 0 < 8 < ro, the third term will be of order O(n :). To obtain highest power, we need to maximize,
for 0 > 0O. Therefore, for an LMP test of size a we should have
Ф(У)Ц0о)^У = a
and maximize /ф(у) – J0 L(0o)dy. In the NP generalized lemma, let us put m = 1, y1 = L(0O), g2 = – J0 L(0O), c1 = a and k1 = k. Then from (2.7) and (2.8), the LMP test will have critical region
d0 L(0o) > kL(0o) or
> k. (2.15)
0=00
The quantity s(0) = Э1(0)/Э0 is known as the score function. The above result was first discussed in Rao and Poti (1946), who stated that an LMP test for H0 : 0 = 0O is given by
hs(Qo) > l2,
where l2 is so determined that the size of test is equal to a preassigned value a with l1 as +1 or 1, respectively, for alternative 0 > 90 and 0 < 0O. Test criterion
(2.16) is a precursor to Rao’s score (RS) or the Lagrange multiplier (LM) test that has been very useful to econometrics for developing various model diagnostic procedures, as we will discuss later.
The LMP test can also be obtained directly from the NP lemma (2.11). By expanding L(01) around 0O as
L(01) = L(0o) + (01 – 0o) d – L(0*),
O0
where 0* is in between 0O and 01. Therefore, according to (2.11) we reject H0 if
L(0*) > k. 


Now as 01 ^ 0O, it is clear that this critical region reduces to that of (2.15) [see Gourieroux and Monfort, 1995, p. 32].
Example 1. As an example of an LMP test consider testing for the median of a Cauchy distribution with probability density



We test H0 : 0 = 0 against H1 : 0 > 0. For simplicity, take n = 1, and therefore, we reject H0 for large values of




As constructed, this will provide an optimal test for 0 close to zero (local alternatives). Now suppose 0 >> 0, and we can see that as 0 ^ ro,
Therefore, for distant alternatives the power of the test will be zero.
Therefore, what works for local alternatives may not work at all for notso – local alternatives. The situation, however, is not so grim universally. Consider the following standard example.
Example 2. Let Y ~ N(p, 1) and test H0 : p = 0 against H1 : p > 0 based on a sample of size 1. We have
Эln f( y; p) Эр
Therefore, we reject H0 if y > k, where k = Za, the upper a percent cutoff point of standard normal. The power of this test is 1 – Ф^а – p), where Ф( ) is the distribution function of the standard normal density. And as p ^ ^, the power of the test goes to 1. Therefore, the test y > Za is not only LMP, it is also uniformly most powerful (UMP) for all p > 0.
Now let us consider what happens to the power of this test when p < 0. The power function Pr(y > Za  p < 0) still remains 1 – Ф(Za – p), but it is now less than a, the size of the test. Therefore, the test is not MP for all p Ф 0. To get an MP test for twosided alternatives, we need to add unbiasedness as an extra condition in our requirements.
Locally most powerful unbiased (LMPU) test
A test ф(у) of size a is unbiased for H0 : 0 G Q0 against H1 : 0 G Q1 if Ее[ф(у)] < a for 0 G Q0 and Е0[ф(у)] > a for 0 G Q1. Suppose we want to find an LMPU test for testing H0 : 0 = 00 against H1 : 0 Ф 00. By expanding the power function у(0) in (2.3) around 0 = 00 for local alternatives, we have
Y(0) = Y(00) + (0 – 00)Y'(00) + (^2^ T"(00) + o(0 = a + ^o) Y"(00) + o(n1).
Unbiasedness requires that the "power" should be minimum at 0 = 00, and, hence, y(00) = 0. To maximize the local power, we, therefore, need to maximize Y"(00) for both 0 > 00 and 0 < 00, and this leads to the LMPU test. Neyman and Pearson (1936, p. 9) called the corresponding critical region "typeA region," and this requires maximization of Y^(00) subject to two sideconditions y(00) = a and y(00) = 0. In the NP generalized lemma, let us put m = 2, c1 = 0, c2 = a, & = ‘LP, g2 = L(00) and g3 = 32^г0), then from (2.7) and (2.8), the optimal test
function ф* = 1 if
and ф* = 0, otherwise. Critical region (2.23) can be expressed in terms of the derivatives of the loglikelihood function as
In terms of the score function s(0) = Э!(0)/Э0 and its derivative s'(0), (2.24) can be written as
Example 2. (continued) For this example, consider now testing H0 : p = 0 against H1 : p Ф 0. It is easy to see that s(90) = y, s'(90) = 1. Therefore, a uniformly most powerful unbiased test (UMPU) will reject H0 if
y2 + k1 y + k 2 > 0
or
y < k" and y > k",
where k1, k 2, k", and k" are some constants determined from satisfying the size and unbiasedness conditions. After some simplification, the LMPU principle leads to a symmetric critical region of the form y < – Za/2 and y > Za/2.
In many situations, s'(9) can be expressed as a linear function of the score s(9). For those cases, LMPU tests will be based on the score function only, just like the LMP test in (2.15). Also for certain test problems s(90) vanishes, then from (2.25) we see that an LMPU test can be constructed using the second derivative of the loglikelihood function.
Example 3. (Godfrey, 1988, p. 92). Let y{ ~ N(0, (1 + 92z,)), i = 1, 2,…, n, where zis are given positive constants. We are interested in testing H0 : 9 = 0, that is, yi has constant variance. The loglikelihood function and the score function are, respectively, given by
n n
i(9) = const – – Xln(1 + 02z) – – X y2/(1 + 02zi)
2 i = 1 2 i=1
and
It is clear that s(9) = 0 at H0 : 9 = 0. However,
and from (2.25), the LMPU test could be based on the above quantity. In fact, it can be shown that (Godfrey, 1988, p. 92)
Xi=1Z (y 1) n(0, 1). (2.29)
V2Xn=1 z?
where denotes convergence in distribution.
Neyman’s smooth test
Pearson (1900) suggested his goodnessoffit test to see whether an assumed probability model adequately described the data at hand. Suppose we divide data into /classes and the probability of the jth class is 0j, j = 1, 2,…, p, and Xf_!0j = 1. Suppose according to the assumed probability model 0j = 0jO; therefore, our null hypothesis could be stated as H0 : 0j = 0jO, j = 1, 2,…, p. Let Uj denote the observed frequency of the jth class, with Xp_1nj = n. Pearson (1900) suggested the goodnessoffit statistic


where O/ and E/ denote, respectively, the observed and expected frequencies for the jth class.
Neyman’s (1937) criticism to Pearson’s test was that (2.30) does not depend on the order of positive and negative differences (O/ – E;). Neyman (1980) gives an extreme example represented by two cases. In the first, the signs of the consecutive differences (Oj – Ej) are not the same, and in the other, there is run of, say, a number of "negative" differences, followed by a sequence of "positive" differences. These two possibilities might lead to similar values of P, but Neyman (1937, 1980) argued that in the second case the goodnessoffit should be more in doubt, even if the value of P happens to be small.
Suppose we want to test the null hypothesis (H0) that f( y; 0) is the true density function for the random variable Y. The specification of f(y; 0) will be different depending on the problem on hand. Let us denote the alternative hypothesis as H1 : Y ~ g( y). Neyman (1937) transformed any hypothesistesting problem of this type to testing only one kind of hypothesis. Let z = F( y) denote the distribution function of Y, then the density of the random variable Z is given by
h(z) _ g(y)dy = fgyti – (2.31)
dz f( y; 0)
when H0 : Y ~ f( y; 0), then
h(z) = 1 0 < z < 1. (2.32)
Therefore, testing H0 is equivalent to testing whether Z has uniform distribution in the interval (0, 1), irrespective of the specification of f (y; 0). As for the specific alternative to the uniform distribution, Neyman (1937) suggested a smooth class. By smooth alternatives Neyman meant those densities that have few intersections with the null density function and that are close to the null. He specified the alternative density as
j=1
where C(5) is the constant of integration that depends on the 8, values, and n,(z) are orthogonal polynomials satisfying
1
n, (z)nk(z)dy = 1 for j = k
0
= 0 for j Ф k.
Under the hypothesis H0 : 81 = 82 = … = 8r = 0, C(8) = 1 and h(z) in (2.33) reduces to the uniform density (2.32). Using the generalized NP lemma, Neyman (1937) derived a locally most powerful symmetric unbiased test for H0, and the test statistic is given by


The test is symmetric in the sense that the asymptotic power of the test depends only on the "distance" ІГ=1 §2 between the null and alternative hypotheses.
2.2 Tests based on score function and Wald’s test
We have already discussed Rao’s (1948) score principle of testing as an LMP test in (2.15) for the scalar parameter 0(p = 1). For the p > 2 case, there will be scores for each individual parameter, and the problem is to combine them in an "optimal" way. Let H0 : 0 = 00, where now 0 = (01, 02,…, 0p)’ and 00 = (010, 020,…, 0p0)’, and the (local) alternative hypothesis be as H1 : 0 = 08, where 08 = (010 + +
82,. .., 0p0 + 8p)’. The proportionate change in the loglikelihood function for moving from 00 to 08 is given by 8’s(00), where 8 = (81, 82,…, 8p)’ and s(00) is the score function evaluated at 0 = 00. Let us define the information matrix as






Then, the asymptotic variance of 8’s(00) is 8′ I (00)8; and, if 8′ were known, a test could be based on which under H0 will be asymptotically distributed as x1. To eliminate the 8’s and to obtain a linear function that would yield maximum discrimination, Rao (1948) maximized (2.37) with respect to 8 and obtained2
[§/s(9q)]2 5’i(9q)8
with optimal value 5 = I (00)1s(00). In a sense, 5 = I (00)1s(00) signals the optimal direction of the alternative hypothesis that we should consider. For example, when p = 1, 5 = +1 or 1, as we have seen in (2.16). Asymptotically, under the null, the statistic in (2.38) follows a xp distribution in contrast to (2.37), which follows x1. When the null hypothesis is composite, like H0 : h(0) = c with r < p restrictions, the general form of Rao’s score (RS) statistic is
RS = s(0)’I (0)1s(0), (2.39)
where 0 is the restricted MLE of 0. Under H0 : RS xp. Therefore, we observe two optimality principles behind the RS test; first, in terms of the LMP test as given in (2.15), and second, in deriving the "optimal" direction for the multiparameter case.
Rao (1948) suggested the score test as an alternative to the Wald (1943) statistic, which for testing H0 : h(0) = c is given by
W = (h(0) – c)'[H(0)’I(0)1H(0)]1(h(0) – c). (2.40)
Rao (1948, p. 53) stated that his test "besides being simpler than Wald’s has some theoretical advantages," such as invariance under transformation of parameters. Rao (2000) recollects the motivation and background behind the development of the score test.
The three statistics LR, W, and RS given, respectively in (2.12), (2.40), and (2.39) are referred to as the "holy trinity." We can look at these statistics in terms of different measures of distance between the null and alternative hypotheses. When the null hypothesis is true, we would expect the restricted and unrestricted MLEs of 0, 0, and 0 to be close, and likewise the loglikelihood functions. Therefore the LR statistic measures the distance through the loglikelihood function and is based on the the difference l(0) – l(R). To see the intuitive basis of the score test, note that s(0) is zero by construction, and we should expect s(0) to be close to zero if H0 is true. And hence the RS test exploits the distance through the score function s(0) and can be viewed as being based on s(0) – s(0). Lastly, the W test considers the distance directly in terms of h(0) and is based on [h(0) – c] – [h(R) – c], where by construction h(R) = c. This reveals a duality between the Wald and score tests. At the unrestricted MLE 0, s(0) = 0, and the Wald test checks whether h(0) is away from c. On the other hand, at the restricted MLE 0, h(R) = c by construction, and the score test verifies whether s(0) is far from a null vector.3
Example 4. Consider a multinomial distribution with p classes and let the probability of an observation belonging to the jth class be 0j, so that Xf=i0; = 1. Denote the frequency of jth class by n with "Ц=п = n. We are interested in testing H0 : 0j = 0j 0, j = 1, 2,…, p, where 0jos are known constants. It can be shown that for this problem the score statistic is given by
S(00)’I(00)1s(00) = І (П ~ П9/0) , (2.41)
j=1 n9j0
where 90 = (910,…, 9p0)’. Therefore, the RS statistic is the same as Pearson’s P given in (2.30). It is quite a coincidence that Pearson (1900) suggested a score test mostly based on intuitive grounds almost 50 years before Rao (1948). For this problem, the other two test statistics LR and W are given by
and
The equivalence of the score and Pearson’s tests and their local optimality has not been fully recognized in the statistics literature. Many researchers considered the LR statistic to be superior to P. Asymptotically, both statistics are locally optimal and equivalent, and, in terms of finite sample performance, P performs better [see for example Rayner and Best, 1989, pp. 267].
The three tests LR, W, and RS are based on the (efficient) maximum likelihood estimates. When consistent (rather than efficient) estimators are used there is another attractive way to construct a scoretype test, which is due to Neyman (1954, 1959). In the literature this is known as the C(a), or effective score or NeymanRao test. To follow Neyman (1959), let us partition 9 as 9 = [9[, 92]’, where 92 is a scalar and test H0 : 92 = 920. Therefore, 91 is the nuisance parameter with dimension (p – 1) x 1. Neyman’s fundamental contribution is the derivation of an asymptotically optimal test using consistent estimators of the nuisance parameters. He achieved this in two steps. First he started with a class of function g( y; 91, 92) satisfying regularity condition of Cramer (1946, p. 500).4
For simplicity let us start with a normed Cramer function, that is, g( y; 91, 92) has zero mean and unit variance. We denote – JUconsistent estimator of 9 under H0 by 9+ = (9+’, 920)’. Neyman asked the question what should be the property of g() such that replacing 9 by 9+ in the test statistic would not make any difference asymptotically, and his Theorem 1 proved that g() must satisfy
Cov[g(y; 91, 920), sjy; 01, 920)] = 0, (2.44)
where s1j = p, i. e. the score for jth component of 91, j = 1, 2,…, p – 1. In other words, the function g(y; 9) should be orthogonal to s1 = d00). Starting from a normed Cramer function let us construct
p1
j(y; 91, 920) = g(y; 91, 920) – X bjS1j(01, 020), (2.45)
j=1
where bj, j = 1, 2, … , p – 1, are the regression coefficients of regressing g(y; 91, 920) on s11, s12, … , s1p1. Denote by o2(91, 920) the minimum variance of j(y; 91, 920), and define
g*(y; 01/ 02o) 
Note that g*(y; 01, 02o) is also a normed Cramer function, and the covariance between g*(y; 01, 020) and s1/(01, 020) is also zero, j = 1, 2,…, p – 1. Therefore, a class of C(a) test can be based on Zn(0+, 020) = Yf=1 g*(y,; 0+, 020). Condition (2.44)
ensures that Zn(01, 020) – Zn(0+, 020) = op(1). The second step of Neyman was to find the starting function g( y; 0) itself. Theorem 2 of Neyman (1959) states that under the sequence of local alternatives H1n : 02 = 020 + , 0 < 5 < «>, Zn(0+, 020) is
asymptotically distributed as normal with mean 5po2 and variance unity, where
The asymptotic power of the test will be purely guided by p, and to maximize the power we should select the function g( y; 0) so that p = 1, that is, the optimal choice should be g(y; 0) = = s2(01, 020) say, score for the testing parameter 02.
Therefore, from (2.45), an asymptotically and locally optimal test should be based on the part of the score for the parameter tested that is orthogonal to the score for the nuisance parameter, namely,
p1
S2(9+, 920) – X bjS1j(9+, 920). (2.48)
j=1
In (2.48) bj, j = 1, 2,…, p – 1 are now regression coefficients of regressing s2(0+, 020) on s11, s12,…, s1p1, and we can express (2.48) as
S2(0+) – С1(0+)С1(0+)Ы0+) = s*(0), say, (2.49)
where I, j(0) are the appropriate blocks of the information matrix I(0) corresponding to 01 and 02. s*(0) is called the effective score for 02 and its variance, I*2(0) = I22(0) – I21(0)I1(0)I12(0) is termed effective information. Note that, since s*(0) is the residual score obtained from running a regression of s2(0) on s1(0), it will be orthogonal to the score for 01. The operational form of Neyman’s C(a) test is
C(a) = s*(0+)’I*2(0+)1s*(0+). (2.50)
Bera and Billias (2000) derived this test using the Rao (1948) framework [see equations (2.38) and (2.39)]. If we replace the л/й – consistent estimator 0+ by the restricted MLE, then s *(0+) and I*2(0+) reduce to s2(0) and I22(0), respectively, and the C(a) test becomes the standard RS test.
01 . 1 
Example 5. (Neyman, 1959) Let us consider testing H0: 02 = 0 in the following Cauchy density
2y 02 + y2 
where 01 > 0 and «> < 02 < ro. It is easy to see that
2Уі 0+2 + y2 ^ 
I12(0) = 0 and I22(0) = 205 under H0. Therefore, s*(0) = s2(0) and I*2(0) = I22(0). Hence the C(a) statistic (2.50) based on a sample y = (y1, y2,…, yn)’ is given by
For 0+, we can use any 4n – consistent estimator such as the difference between the third and first sample quartiles. Since I21(0) = 0 under H0, the RS test will have the same algebraic form (2.53), but for 01, we need to use the restricted MLE, 01.
Neyman’s C(a) approach provides an attractive way to take into account the nuisance parameter 01. Bera and Yoon (1993) applied this approach to develop tests that are valid under a locally misspecified model. They showed that replacing 01, even by a null vector in the final form of the test, would lead to a valid test procedure.
This ends our discussion of the test principles proposed in the statistics literature. We have covered only those tests that have some relevance to testing and evaluating econometric models. In the next section we discuss some of their applications.
John Maynard Keynes was skeptical about applying statistical techniques to economic data, as can be seen in his review of Tinbergen’s book. It was left to Haavelmo (1944) to successfully defend the application of statistical methodologies to economic data within the framework of the joint probability distribution of variables. Trygve Haavelmo was clearly influenced by Jerzy Neyman,6 and Haavelmo (1944) contains a seven page account of the NeymanPearson theory. He clearly stated the limitation of the standard hypothesis testing approach and explicitly mentioned that a test is, in general, constructed on the basis of a given fixed set of possible alternatives that he called a priori admissible hypotheses. And whenever this priori admissible set deviates from the data generating process, the test loses its optimality [for more on this see (Bera and Yoon, 1993) and (Bera, 2000)]. Haavelmo, however, did not himself formally apply the Neyman – Pearson theory to econometric testing problems. That was left to Anderson (1948) and Durbin and Watson (1950).
Leave a reply