Cars Example
The data set cars. gdt is included in package of datasets that are distributed with this manual. In most cases it is a good idea to print summary statistics of any new dataset that you work with. This serves several purposes. First, if there is some problem with the dataset, the summary statistics may give you some indication. Is the sample size as expected? Are the means, minimums and maximums reasonable? If not, you’ll need to do some investigative work. The other reason is important as well. By looking at the summary statistics you’ll gain an idea of how the variables have been scaled. This is vitally important when it comes to making economic sense out of the results. Do the magnitudes of the coefficients make sense? It also puts you on the lookout for discrete variables, which also require some care in interpreting.
The summary command is used to get summary statistics. These include mean, minimum, maximum, standard deviation, the coefficient of variation, skewness and excess kurtosis. The corr command computes the simple correlations among your variables. These can be helpful in gaining an initial understanding of whether variables are highly collinear or not. Other measures are more useful, but it never hurts to look at the correlations. Either of these commands can be used with a variable list afterwards to limit the list of variables summarized of correlated.
Consider the cars example from POE4. The script is
1 open "c:Program Filesgretldatapoecars. gdt"
2 summary
3 corr
4 ols mpg const cyl eng wgt
5 vif
The summary statistics appear below:
Summary Statistics, using the observations 
1392 

Variable 
Mean 
Median 
Minimum 
Maximum 
mpg 
23.4459 
22.7500 
9.00000 
46.6000 
cyl 
5.47194 
4.00000 
3.00000 
8.00000 
eng 
194.412 
151.000 
68.0000 
455.000 
wgt 
2977.58 
2803.50 
1613.00 
5140.00 
Variable 
Std. Dev. 
C. V. 
Skewness 
Ex. kurtosis 
mpg 
7.80501 
0.332894 
0.455341 
0.524703 
cyl 
1.70578 
0.311733 
0.506163 
1.39570 
eng 
104.644 
0.538259 
0.698981 
0.783692 
wgt 
849.403 
0.285266 
0.517595 
0.814241 
and the correlation matrix
Correlation coefficients, using the observations 1392
5% critical value (twotailed) = 0.0991 for n = 392
mpg 
cyl 
eng 
wgt 

1.0000 
0.7776 
0.8051 
0.8322 
mpg 
1.0000 
0.9508 
0.8975 
cyl 

1.0000 
0.9330 
eng 

1.0000 
wgt 
The variables are quite highly correlated in the sample. For instance the correlation between weight and engine displacement is 0.933. Cars with big engines are heavy. What a surprise!
The regression results are:
OLS, using observations 1392
Dependent variable: mpg
Coefficient 
Std. Error 
tratio 
pvalue 

44.3710 
1.48069 
29.9665 
0.0000 

cyl 
0.267797 
0.413067 
0.6483 
0.5172 
eng 
0.0126740 
0.00825007 
1.5362 
0.1253 
wgt 
0.00570788 
0.000713919 
7.9951 
0.0000 
The test of the individual significance of cyl and eng can be read from the table of regression results. Neither are significant at the 5% level. The joint test of their significance is performed using the omit statement. The Fstatistic is 4.298 and has a pvalue of 0.0142. The null hypothesis is rejected in favor of their joint significance.
The new statement that requires explanation is vif. vif stands for variance inflation factor and it is used as a collinearity diagnostic by many programs, including gretl. The vif is closely related to the statistic suggested by Hill et al. (2011) who suggest using the R[27] from auxiliary regressions to determine the extent to which each explanatory variable can be explained as linear functions of the others. They suggest regressing xj on all of the other independent variables and comparing the R[28] from this auxiliary regression to 10. If the R2 exceeds 10, then there is evidence of a collinearity problem.
The vifj actually reports the same information, but in a less straightforward way. The vif associated with the jth regressor is computed
which is, as you can see, simply a function of the Rj2 from the jth regressor. Notice that when R2 > .80, the vifj > 10. Thus, the rule of thumb for the two rules is actually the same. A vifj greater than 10 is equivalent to an R2 greater than.8 from the auxiliary regression.
The output from gretl is shown below:
Variance Inflation Factors Minimum possible value = 1.0
Values > 10.0 may indicate a collinearity problem
Mean dependent var 23.44592 Sum squared resid 7162.549 R[31] 0.699293 F(3, 388) 300.7635 Loglikelihood 1125.674 Schwarz criterion 2275.234 
S. E. of regression 4.296531 Adjusted R[32] 0.696967 Pvalue(F) 7.6e101 Akaike criterion 2259.349 HannanQuinn 2265.644 
cyl 10.516 eng 15.786
VIF(j) = 1/(1 – R(j)"2), where R(j) is the multiple correlation coefficient between variable j and the other independent variables
Properties of matrix X’X:
1norm = 4.0249836e+009
Determinant = 6.6348526e+018
Reciprocal condition number = 1.7766482e009
Once again, the gretl output is very informative. It gives you the threshold for high collinearity (vifj) > 10) and the relationship between vifj and R2. Clearly, these data are highly collinear. Two variance inflation factors above the threshold and the one associated with wgt is fairly large as well.
6.4 Script [33] 2 3 4 5 6 [34] [35] [36] [37] [38] 12 [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51]
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
square advert
ols sales const price advert sq_advert restrict b[2] =0 b[3] =0 b[4] =0 end restrict
ols sales const price advert sq_advert
scalar sseu = $ess
scalar unrest_df = $df
ols sales const
scalar sser = $ess
scalar rest_df = $df
scalar J = rest_df – unrest_df
scalar Fstat=((ssersseu)/J)/(sseu/(unrest_df)) pvalue F J unrest_df Fstat
# ttest
ols sales const price advert sq_advert omit price
# optimal advertising
open "@gretldirdatapoeandy. gdt" square advert
ols sales const price advert sq_advert
scalar Ao =(1$coeff(advert))/(2*$coeff(sq_advert))
# test of optimal advertising restrict b[3]+3.8*b[4]=1
end restrict
open "@gretldirdatapoeandy. gdt" square advert
ols sales const price advert sq_advert
scalar Ao =(1$coeff(advert))/(2*$coeff(sq_advert))
# Onesided ttest
ols sales const price advert sq_advert –vcv
scalar r = $coeff(advert)+3.8*$coeff(sq_advert)1
scalar v = $vcv[3,3]+((3.8)rt2)*$vcv[4,4]+2*(3.8)*$vcv[3,4]
scalar t = r/sqrt(v)
pvalue t $df t
# joint test
ols sales const price advert sq_advert restrict
b[3]+3.8*b[4]=1
b[1]+6*b[2]+1.9*b[3]+3.61*b[4]=80 end restrict
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 
# restricted estimation
open "@gretldirdatapoebeer. gdt" logs q pb pl pr i
ols l_q const l_pb l_pl l_pr l_i —quiet restrict
b2+b3+b4+b5=0 end restrict restrict
b[2]+b[3]+b[4]+b[5]=0 end restrict
# model specification — relevant and irrelevant vars open "@gretldirdatapoeedu_inc. gdt"
ols faminc const he we omit we
corr
list all_x = const he we kl6 xtra_x5 xtra_x6 ols faminc all_x
# reset test
ols faminc const he we kl6 reset —quiet —squaresonly reset –quiet
# model selection rules and a function function matrix modelsel (series y, list xvars)
ols y xvars —quiet scalar sse = $ess scalar N = $nobs scalar K = nelem(xvars) scalar aic = ln(sse/N)+2*K/N scalar bic = ln(sse/N)+K*ln(N)/N scalar rbar2 = 1((1$rsq)*(N1)/$df) matrix A = { K, N, aic, bic, rbar2} printf "nRegressors: %sn",varname(xvars) printf "K = %d, N = %d, AIC = %.4f, SC = %.4f, and Adjusted R2 = %.4fn", K, N, aic, bic, rbar2 return A end function
list x1 = const he
list x2 = const he we
list x3 = const he we kl6
list x4 = const he we xtra_x5 xtra_x6
matrix a = modelsel(faminc, x1)
matrix b = modelsel(faminc, x2)
matrix c = modelsel(faminc, x3)
matrix d = modelsel(faminc, x4)
matrix MS = abcd colnames(MS,"K N AIC SC Adj_R2" ) printf "%10.5g",MS function modelsel clear
ols faminc all_x
modeltab add
omit xtra_x5 xtra_x6
modeltab add
omit kl6
modeltab add
omit we
modeltab add
modeltab show
ols faminc x3 —quiet reset
# collinearity
open "@gretldirdatapoecars. gdt"
summary
corr
ols mpg const cyl
ols mpg const cyl eng wgt –quiet
omit cyl
ols mpg const cyl eng wgt –quiet omit eng
ols mpg const cyl eng wgt –quiet omit eng cyl
# Auxiliary regressions for collinearity
# Check: r2 >.8 means severe collinearity ols cyl const eng wgt
scalar r1 = $rsq ols eng const wgt cyl scalar r2 = $rsq ols wgt const eng cyl scalar r3 = $rsq
printf "Rsquares for the auxillary regresionsnDependent Variable:
n cylinders %3.3gn engine displacement %3.3gn weight %3.3gn", r1, r2, r3
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 
ols mpg const cyl eng wgt vif
Leave a reply