# Grouped Data

In our discussion of the Goldfeld-Quandt test we decided that wages in rural and metropolitan areas showed different amounts of variation. When the heteroskedasticity occurs between groups, it is relatively straightforward to estimate the GLS corrections-this is referred to as Feasible GLS (FGLS).

There are a couple of ways to estimate each subsample. The first was used in the Goldfeld – Quandt test example where the metro subsample was chosen using smpl metro=1 —restrict and the rural one chosen with smpl metro=0 —restrict. Grouped GLS using this method can be found below:

1 open "@gretldirdatapoecps2.gdt"

2 list x = const educ exper

3 ols wage x metro

4 smpl metro —dummy

5 ols wage x

6 scalar stdm = \$sigma

7 smpl full

8 series rural = 1-metro

9 smpl rural —dummy

10 ols wage x

11 scalar stdr = \$sigma

12 #Restore the full sample

13 smpl full

14 series wm = metro*stdm

15 series wr = rural*stdr

16 series w = 1/(wm + wr)"2

17 wls w wage x metro

The smpl command is used in a new way here. In line 3 smpl metro —dummy restricts the sample based on the indicator variable metro. The sample will be restricted to only those observations for which metro=1. The wage equation is estimated in line 4 for the metro dwellers and the standard error of the regression is saved in line 5.

The next lines restore the full sample and create a new indicator variable for rural dwellers. Its value is just 1-metro. We generate this in order to use the smpl rural —dummy syntax. We could have skipped generating the rural and simply used smpl metro=0 —restrict. In line 10 the model is estimated for rural dwellers and the standard error of the regression is saved.

The full sample must be restored and two sets of weights are going to be created and combined. In line 14 the statement series wm = metro*stdm multiplies the metro S. E. of the regression times the indicator variable. Its values will either be stdm for metro dwellers and 0 for rural dwellers. We do the same for rural dwellers in 15. Adding these two series together creates a single variable that contains only two distinct values, dM for metro dwellers and dR for rural ones. Squaring this and taking the reciprocal provides the necessary weights for the weighted least squares regression.

WLS, using observations 1-1000
Dependent variable: wage

 Coefficient Std. Error t-ratio p-value const -9.39836 1.01967 -9.2170 0.0000 educ 1.19572 0.0685080 17.4537 0.0000 exper 0.132209 0.0145485 9.0874 0.0000 metro 1.53880 0.346286 4.4437 0.0000

Statistics based on the weighted data:

 Sum squared resid 998.425 S. E. of regression 1.00122 R2 0.271528 Adjusted R2 0.269334 F (3,996) 123.749 P-value(F) 3.99e-68 Log-likelihood -1418.15 Akaike criterion 2844.3 Schwarz criterion 2863.93 Hannan-Quinn 2851.76

Statistics based on the original data:

Mean dependent var 10.21302 S. D. dependent var 6.246641 Sum squared resid 28585.82 S. E. of regression 5.357296