# Multiple Endogenous Regressors and the Cragg-Donald F-test

3Cragg and Donald (1993) have proposed a test statistic that can be used to test for weak identification (i. e., weak instruments). In order to compute it manually, you have to obtain a set of canonical correlations. These are not computed in gretl so we will use another free software, R, to do part of the computations. On the other hand, gretl prints the value of the Cragg-Donald statistic by default so you won’t have to go to all of this trouble. Still, to illustrate a very powerful feature of gretl we will use R to compute part of this statistic.

One solution to identifying weak instruments in models with more than one endogenous regressor is based on the use of canonical correlations. Canonical correlations are a generalization of the usual concept of a correlation between two variables and attempt to describe the association between two sets of variables.

Let N denote the sample size, B the number of righthand side endogenous variables, G the number of exogenous variables included in the equation (including the intercept), L the number of external instruments-i. e., ones not included in the regression. If we have two variables in the first set of variables and two variables in the second set then there are two canonical correlations, r and r2.

A test for weak identification-which means that the instruments are correlated with endogenous regressors, but not very highly-is based on the Cragg-Donald F-test statistic

Cragg-Donald — F = [(N — G — B)/L] x [r2B/(1 — r2B)] (10.6)

The Cragg-Donald statistic reduces to the usual weak instruments F-test when the number of endogenous variables is B = 1. Critical values for this test statistic have been tabulated by Stock and Yogo (2005), so that we can test the null hypothesis that the instruments are weak, against the alternative that they are not, for two particular consequences of weak instruments.

3The computations in this section use R. You should refer to D for some hints about using R.

The problem with weak instruments is summarized by Hill et al. (2011, p. 435):

Relative Bias: In the presence of weak instruments the amount of bias in the IV estimator can become large. Stock and Yogo consider the bias when estimating the coefficients of the endogenous variables. They examine the maximum IV estimator bias relative to the bias of the least squares estimator. Stock and Yogo give the illustration of estimating the return to education. If a researcher believes that the least squares estimator suffers a maximum bias of 10%, and if the relative bias is 0.1, then the maximum bias of the IV estimator is 1%.

Rejection Rate (Test Size): When estimating a model with endogenous regressors, testing hy­potheses about the coefficients of the endogenous variables is frequently of interest. If we choose the a = 0.05 level of significance we expect that a true null hypothesis is rejected 5% of the time in repeated samples. If instruments are weak, then the actual rejection rate of the null hypothesis, also known as the test size, may be larger. Stock and Yogo’s second criterion is the maximum rejection rate of a true null hypothesis if we choose a = 0.05. For example, we may be willing to accept a maximum rejection rate of 10% for a test at the 5% level, but we may not be willing to accept a rejection rate of 20% for a 5% level test.

The script to compute the statistic manually is given below:

1 open "@gretldirdatapoemroz. gdt"

2 smpl wage>0 —restrict

3 logs wage

4 square exper

5 series nwifeinc = (faminc-wage*hours)/1000

6 list x = mtr educ kidsl6 nwifeinc const

7 list z = kidsl6 nwifeinc mothereduc fathereduc const

8 tsls hours x ; z

9 scalar df = \$df

This first section loads includes much that we’ve seen before. The data are loaded, the sample restricted to the wage earners, the log of wage is taken, the square is experience is added to the data. Then a new variable is computed to measure family income from all other members of the household. The next part estimates a model of hours as a function of mtr, educ, kidsl6, nwifeinc, and a constant. Two of the regressors are endogenous: mtr and educ. The external instruments are mothereduc and fathereduc; these join the internal ones (const, kidsl6, nwifeinc) in the instrument list. The degrees of freedom from this regression is saved to compute (N — G — B)/L.

The next set of lines partial’s out the influence of the internal instruments on each of the endogenous regressors and on the external instruments.

10 list w = const kidsl6 nwifeinc

11 ols mtr w –quiet

 12 series e1 = \$uhat 13 ols educ w —quiet 14 series e2 = \$uhat 15 ols mothereduc w — quiet 16 series e3 = \$uhat 17 ols fathereduc w — quiet 18 series e4 = \$uhat

Now this is where it gets interesting. From here we are going to call a separate piece of software called R to do the computation of the canonical correlations. Lines 19-25 hold what gretl refers to as a foreign block.

19 foreign language=R —send-data —quiet

20 setl <- gretldata[,29:30]

21 set2 <- gretldata[,31:32]

22 cc1 <- cancor(set1,set2)

23 cc <- as. matrix(cc1\$cor)

24 gretl. export(cc)

25 end foreign

26

27 print vars

28 scalar mincc = minc(vars)

29 scalar cd = df*(mincc"2)/(2*(1-mincc"2))

30 printf "nThe Cragg-Donald Statistic is %10.4f.n",cd

A foreign block takes the form

————————— Foreign Block syntax

foreign language=R [–send-data] [–quiet]

… R commands… end foreign

and achieves the same effect as submitting the enclosed R commands via the GUI in the noninter­active mode (see section 30.3 of the Gretl Users Guide). In other words, it allows you to use R commands from within gretl. Of course, you have to have installed R separately, but this greatly expands what can be done using gretl. The –send-data option arranges for auto-loading of the data from the current gretl session. The –quiet option prevents the output from R from being echoed in the gretl output. The block is closed with an end foreign command.

Inside our foreign block we create two sets of variables. The first set includes the residuals, e1 and e2 computed above. There are obtained from a matrix called gretldata. This is the name that gretl gives to data matrices that are passed into R. You have to pull the desired variables out of gretldata. In this case I used a rather inartful but effective means of doing so. These two variables are located in the 29th and 30th columns of gretldata. These also happen to be their ID numbers in gretl. Line 20 puts these two variables into setl.

The second set of residuals is put into set2. Then, R’s cancor function is used to find the canonical correlations between the two sets of residuals. The entire set of results is stored in R as cc. This object contains many results, but we only need the actual canonical correlations between the two sets. The canonical correlations are stored within cc as cor. They are retrieved as cc\$cor and put into a matrix with R’s as. matrix command. These are exported to gretl as cc. mat. R adds the .mat suffix. cc. mat is placed in your working directory.

The next step is to read the cc. mat into gretl. Then in line we take the smallest canonical correlation and use it in line to compute the Cragg-Donald statistic. The result printed to the screen is:

? printf "nThe Cragg-Donald Statistic is %6.4f.n",cd The Cragg-Donald Statistic is 0.1006.

It matches the automatic one produced by tsls, which is shown below, perfectly! It also shows that these instruments are very weak.

Weak instrument test –

Cragg-Donald minimum eigenvalue = 0.100568

Critical values for desired TSLS maximal size, when running tests at a nominal 5% significance level:

size 10% 15% 20% 25%

value 7.03 4.58 3.95 3.63

Maximal size may exceed 25%

Of course, you can do this exercise without using R as well. Gretl’s matrix language is very powerful and you can easily get the canonical correlations from two sets of regressors. The following funcrion does just that.

1 function matrix cc(list Y, list X)

2 matrix mY = cdemean({Y})

3 matrix mX = cdemean({X})

4

4 matrix YX = mY’mX

5 matrix XX = mX’mX

6 matrix YY = mY’mY

8

7 matrix ret = eigsolve(qform(YX, invpd(XX)), YY)

8 return sqrt(ret)

9 end function

The function is called cc and takes two arguments, just as the one in R. Feed the function two lists, each containing the variable names to be included in each set for which the canonical correlations are needed. Then, the variables in each set are demeaned using the very handy cdemean function. This function centers the columns of the matrix argument around the column means. Then the various cross-products are taken (YX, XX, YY) and the eigenvalues for |Q — AYY| = 0, where Q = (YX)(U)-1 (YX)T, are returned.

Then, to get the value of the Cragg-Donald F, assemble the two sets of residuals and use the cc function to get the canonical correlations.

1 list E1 = el e2

2 list E2 = e3 e4

3

3 l = cc(E1, E2)

4 scalar mincc = minc(l)

5 scalar cd = df*(mincc"2)/(2*(1-mincc"2))

6 printf "nThe Cragg-Donald Statistic is %10.4f.n",cd