Heteroskedasticity in the Linear Probabilty Model
In chapter 7 we introduced the linear probability model. It was shown that the indicator variable, yi is heteroskedastic. That is,
var(yi) = пі(1 – пі) (8.10)
where ni is the probability that the dependent variable is equal to 1 (the choice is made). The estimated variance is
var(yi) = Пі(1 – Пі) (8.11)
This can be used to perform feasible GLS. The cola marketing data coke. gdt is the basis for this example. The independent variable, coke, takes the value of 1 if the individual purchases CocaCola and is 0 if not. The decision to purchase CocaCola depends on the ratio of the price relative to Pepsi, and whether displays for CocaCola or Pepsi were present. The variables disp_coke=1 if a CocaCola display was present, otherwise 0; disp_pepsi=1 if a Pepsi display was present, otherwise it is zero.
1 First, the data are loaded and the summary statistics are provided.
2 open "@gretldirdatapoecoke. gdt"
3 summary —simple
4 list x = const pratio disp_coke disp_pepsi
The —simple option is used for the summary command. Then a list is created that contains the names of the independent variables to be used in the estimated models. The basic summary statistics are:
Summary statistics, using the observations 1 – 1140
Mean Minimum Maximum Std. Dev.
0.44737 
0.00000 
1.0000 
0.49744 

pr_pepsi 
1.2027 
0.68000 
1.7900 
0.30073 
pr_coke 
1.1901 
0.68000 
1.7900 
0.29992 
disp_pepsi 
0.36404 
0.00000 
1.0000 
0.48137 
disp_coke 
0.37895 
0.00000 
1.0000 
0.48534 
pratio 
1.0272 
0.49721 
2.3247 
0.28661 
Everything looks good. There are no negative prices, and the indicator variables are all contained between 0 and 1. The magnitudes of the means are reasonable.
Next, least squares is used to estimate the model twice: once with usual standard errors and again with the HCCME standard errors produced by the —robust option. Each is added to a model table using modeltab add.
1 # OLS
2 ols coke x
3 modeltab add
4 # OLS w/robust
5 ols coke x —robust
6 modeltab add
Feasible GLS will be estimated in two ways. In the first regression, we will omit any observation that has a negative estimated variance. Remember that one of the problems with linear probability is that predictions are not constrained to lie between 0 and 1. If yi < 0 or yi > 1, then variance estimates will be negative. In the first line below a new series is created to check this condition. If the variance, varp, is greater than zero, pos will be equal to 1 and if not, then it is zero. The second line creates a weight for wls that is formed by multiplying the indicator variable pos times the reciprocal of the variance. In this way, any nonnegative weights become zeros.
————— Remove observations with negative variance
1 series p = $yhat
2 series varp = p*(1p)
3 series pos = (varp > 0)
4 series w = pos * 1/varp
5 # omit regression
6 wls w coke x
7 modeltab add
The first line uses the accessor for the predicted values from a linear regression, $yhat, and therefore it must follow least squares estimation of the linear probability model; in this model, they are interpreted as probabilities. Once again, a trick is being used to eliminate observations from the model. Basically, any observation that has a zero weight in w is dropped from the computation. There are equivalent ways to do this in gretl as shown below
—————— Two other ways to drop observations
smpl varp>0 —restrict
setmiss 0 w
The restricting the sample is probably the most straightforward method. The second uses the setmiss command that changes the missing value code to 0 for elements of w; any observation where w=0 is now considered missing and won’t be used to estimate the model.
Finally, another feasible GLS estimation is done. This time, p1 is truncated at 0.01 if yi < 0.01 and to 0.99 if yi > 0.99. The code to do this is
——– WLS with truncated variances for observations out of bounds
1 series b = (p<.01)  (p>.99)
2 series pt = b*0.01 + p*(1b)
3 series varp_t = pt*(1pt)
4 series w_t = 1/varp_t
5 wls w_t coke x
6 modeltab add
7 modeltab show
The first line creates another indicator variable that takes the value of 1 if the predicted probability falls outside of the boundary. The  is a logical operator that takes the union of the two conditions (=“OR”). The second line creates the truncated value of the probability using the indicator variable.
b(0.01) + p(1 – b) = 0.01 when b = 1 b(0.01) + p(1 — b) = p when b = 0
There is another, less transparent, way to generate the truncated probabilities: use the ternary conditional assignment operator. This operates like an if statement and can be used to save a line of script. This syntax would create the series as
——————– The conditional assignment operator
series pt = ( (p<.01)  (p>.99) ) ? 0.01 : p
Basically the bound condition in parentheses (p < .01)(p > .99) is checked: that is what the question mark represents. If it is true, pt is set to the first value that appears in front of the colon. If false, it is set to the value specified to the right of the colon. It operates very much like a traditional if statement in a spreadsheet program. This method is more efficient computationally as well, which could save some time if used in a loop to perform simulations.
Once the truncated probabilities are created, then the usual weighted least squares estimation can proceed. The model table appears below:
Dependent variable: coke
Standard errors in parentheses * indicates significance at the 10 percent level ** indicates significance at the 5 percent level
Columns (1) and (2) are the OLS estimates with usual and robust standard errors, respectively. Column (3) uses WLS with the negative variance observations omitted from the sample. Column (4) is WLS with the negative predictions truncated. These results are quite a bit different from the others. This no doubt occurs because of the large weight being placed on the 16 observations whose weights were constructed by truncation. The var(ei) = 0.01(1 — 0.01) = 0.0099. The square root of the reciprocal is approximately 10, a large weight to be placed on these 16 observations via WLS. Since these extreme observations carry a large weight relative to the others, they exert a considerable influence on the estimated regression.
1 open "@gretldirdatapoefood. gdt"
2 set echo off
3 ols food_exp const income
4 gnuplot food_exp income —linearfit
5 # see section 1.4 of this manual for commands to view these plots.
6
6 # ols with HCCME standard errors
7 ols food_exp const income —robust
8 # confidence intervals (Robust)
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 
scalar lb = $coeff(income) – critical(t,$df,0.025) * $stderr(income) scalar ub = $coeff(income) + critical(t,$df,0.025) * $stderr(income) printf "nThe 95%% confidence interval is (%.3f, %.3f).n",lb, ub
# residual plot
ols food_exp const income –robust series res = $uhat
setinfo res – d "Least Squares Residuals" – n "Residual" gnuplot res income —output=c:Tempolsres
# lauch gnuplot (Windows only) launch wgnuplot
# To view graph, type: load ‘C:Tempolsres’ at prompt
# residual magnitude plot with loess fit series abs_e = abs(res)
setinfo abs_e – d "Absolute value of the LS
Residuals" – n "Absolute Value of Residual"
gnuplot abs_e income —loessfit —output=c:temploessfit. plt
# LM test for heteroskdasticity ols food_exp const income series sq_ehat = $uhat*$uhat ols sq_ehat const income scalar NR2 = $trsq
pvalue X 1 NR2
# builtin LM test
ols food_exp const income modtest income —breuschpagan
# White test
ols food_exp const income series sq_ehat = $uhat*$uhat series sq_income = incomert2 ols sq_ehat const income sq_income scalar NR2 = $trsq pvalue X 2 NR2
# builtin White test
ols food_exp const income —quiet modtest —white —quiet
# grouped data—GoldfeldQuandt open "@gretldirdatapoecps2.gdt" ols wage const educ exper metro
# Use only metro observations smpl metro=1 –restrict
ols wage const educ exper scalar stdm = $sigma scalar df_m = $df
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 
#Restore the full sample smpl full
# Use only rural observations smpl metro=0 —restrict
ols wage const educ exper scalar stdr = $sigma scalar df_r = $df
# GQ statistic
gq = stdmrt2/stdrrt2
scalar pv = pvalue(F, df_m, df_r, gq)
printf "nThe F(%d, %d) statistic = %.3f. The right side pvalue is %.4g.n",df_m, df_r, gq, pv
# GoldfeldQuandt for food expenditure open "@gretldirdatapoefood. gdt" dataset sortby income
list x = const income ols food_exp x
# large variance observations smpl 21 40
ols food_exp x scalar stdL = $sigma scalar df_L = $df #Restore the full sample smpl full
# small variance observations smpl 1 20
ols food_exp x scalar stdS = $sigma scalar df_S = $df
# GQ statistic
gq = stdLrt2/stdSrt2
scalar pv = pvalue(F, df_L, df_S, gq) printf "nThe F(%d, %d) statistic = %.3f. The right side pvalue is %.4g.n",df_L, df_S, gq, pv
# compare ols with and without HCCME list x = const income
ols food_exp x –quiet modeltab add
ols food_exp x –robust –quiet modeltab add modeltab show
# hypothesis test
ols food_exp x –robust omit income
ols food_exp x –quiet restrict b[2]=0 end restrict
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
ols food_exp x —robust —quiet restrict b[2]=0 end restrict
open "@gretldirdatapoefood. gdt"
#GLS using built in function
series w = 1/income
wls w food_exp const income
scalar lb = $coeff(income) – critical(t,$df,0.025) * $stderr(income)
scalar ub = $coeff(income) + critical(t,$df,0.025) * $stderr(income)
printf "nThe 95%% confidence interval is (%.3f, %.3f).n",lb, ub
#GLS using OLS on transformed data
series wi = 1/sqrt(income)
series ys = wi*food_exp
series xs = wi*income
series cs = wi
ols ys cs xs
#Wage Example
open "@gretldirdatapoecps2.gdt" ols wage const educ exper metro
# Use only metro observations smpl metro –dummy
ols wage const educ exper scalar stdm = $sigma smpl full
#Create a dummy variable for rural
series rural = 1metro
#Restrict sample to rural observations
smpl rural –dummy
ols wage const educ exper
scalar stdr = $sigma
#Restore the full sample
smpl full
#Generate standard deviations for each metro and rural obs
series wm = metro*stdm
series wr = rural*stdr
series w = 1/(wm + wr)rt2
#Weighted least squares
wls w wage const educ exper metro
# heteroskedastic model
open "@gretldirdatapoefood. gdt" ols food_exp const income series lnsighat = log($uhat*$uhat) series z = log(income) ols lnsighat const z
series predsighat = exp($yhat) series w = 1/predsighat wls w food_exp const income
# linear probability model
open "@gretldirdatapoecoke. gdt" summary –simple
list x = const pratio disp_coke disp_pepsi
# OLS
ols coke x modeltab add
# OLS w/robust
ols coke x –robust modeltab add series p = $yhat series varp = p*(1p) series pos = (varp > 0) series w = pos * 1/varp
# omit regression wls w coke x modeltab add
# smpl varp>0 –restrict
# setmiss 0 w
series b = (p<.01)  (p>.99) series pt = b*0.01 + p*(1b) series varp_t = pt*(1pt) series w_t = 1/varp_t
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 
# trunc regression wls w_t coke x modeltab add modeltab show



Figure 8.4: Plot of food expenditures against income with least squares fit. 



Leave a reply