# Repeated Sampling

Perhaps the best way to illustrate the sampling properties of least squares is through an exper­iment. In section 2.4.3 of your book you are presented with results from 10 different regressions (POE4 Table 2.2). These were obtained using the dataset table2-2.gdt which is included in the gretl datasets that accompany this manual. To reproduce the results in this table you could estimate 10 separate regressions

open "@gretldirdatapoetable2_2.gdt" ols y1 const x ols y2 const x

ols y10 const x

The ten regressions can be estimated more compactly using one of gretl’s loop constructs. The first step is to create a list that contains the variable names for the dependent variables as in line

1 of the script below. The statement list ylist is used to put data series into a collection called ylist; each of the series, y1, y2, …, y10 are included. Such named lists can be used to make your scripts less verbose and repetitious, and easier to modify. Since lists are in fact lists of series ID numbers they can be used only when a dataset is in place. The foreach loop in line 2 uses an index variable, i, to index a specified list of strings. The loop is executed once for each string in the list. The numerical value of the index starts at 1 and is incremented by 1 at each iteration. To refer to elements of the list, use the syntax listname.\$i. Be sure to close the loop using endloop.

1 open "@gretldirdatapoetable2_2.gdt"

2 list ylist = y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

3 loop foreach i ylist

4 ols ylist.\$i 0 1

5 endloop

In the gretl GUI, named lists can be inspected and edited under the Data menu in the main window, via the item Define or edit list. This dialog is shown in Figure 2.14

You can also generate your own random samples and conduct a Monte Carlo experiment using gretl. In this exercise you will generate 100 samples of data from the food expenditure data, estimate the slope and intercept parameters with each data set, and then study how the least squares estimator performed over those 100 different samples. What will become clear is this, the outcome from any single sample is a poor indicator of the true value of the parameters. Keep this humbling thought in mind whenever you estimate a model with what is invariably only 1 sample or instance of the true (but always unknown) data generation process.

food-expt = ві + в2 incomet + et (2.6)

where food-expt is total food expenditure for the given time period and incomet is income. Suppose further that we know how much income each of 40 households earns in a week. Additionally, we know that on average a household spends at least \$80 on food whether it has income or not and that an average household will spend ten cents of each new dollar of income on additional food. In terms of the regression this translates into parameter values of в1 = 80 and в2 = 10.

Our knowledge of any particular household is considerably less. We don’t know how much it actually spends on food in any given week and, other than differences based on income, we don’t know how its food expenditures might otherwise differ. Food expenditures are sure to vary for reasons other than differences in family income. Some families are larger than others, tastes and preferences differ, and some may travel more often or farther making food consumption more costly. For whatever reasons, it is impossible for us to know beforehand exactly how much any household will spend on food, even if we know how much income it earns. All of this uncertainty is captured by the error term in the model. For the sake of experimentation, suppose we also know that et ~ N(0, 882).

With this knowledge, we can study the properties of the least squares estimator by generating samples of size 40 using the known data generation mechanism. We generate 100 samples using the known parameter values, estimate the model for each using least squares, and then use summary statistics to determine whether least squares, on average anyway, is either very accurate or precise. So in this instance, we know how much each household earns, how much the average household spends on food that is not related to income (ві = 80), and how much that expenditure rises on average as income rises. What we do not know is how any particular household’s expenditures responds to income or how much is autonomous.

A single sample can be generated in the following way. The systematic component of food expenditure for the tth household is 80 + 10 * incomet. This differs from its actual food expenditure by a random amount that varies according to a normal distribution having zero mean and standard deviation equal to 88. So, we use computer generated random numbers to generate a random error, et, from that particular distribution. We repeat this for the remaining 39 individuals. The generates one Monte Carlo sample and it is then used to estimate the parameters of the model. The results are saved and then another Monte Carlo sample is generated and used to estimate the model and so on.

In this way, we can generate as many different samples of size 40 as we desire. Furthermore, since we know what the underlying parameters are for these samples, we can later see how close our estimators get to revealing these true values.

Now, computer generated sequences of random numbers are not actually random in the true sense of the word; they can be replicated exactly if you know the mathematical formula used to generate them and the ‘key’ that initiates the sequence. In most cases, these numbers behave as if they randomly generated by a physical process.

To conduct an experiment using least squares in gretl use the script found in below:

2 open "@gretldirdatapoefood. gdt"

3 set seed 3213789

4 loop 100 —progressive —quiet

5 series u = normal(0,88)

6 series y1= 80+10*income+u

7 ols y1 const income

8 endloop

Let’s look at what each line accomplishes. The first line opens the food expenditure data set that resides in the poe folder of the data directory. The next line, which is actually not necessary to do the experiments, initiates the pseudo-random numbers at a specific point. This is useful, since it will allow you to get the same results each time you run the script.

In Monte Carlo experiments loops are used to estimate a model using many different samples that the experimenter generates and to collect the results. The simplest loop construct in gretl begins with the command loop NMC —progressive —quiet and ends with endloop. This is called a count loop. NMC in this case is the number of Monte Carlo samples you want to use and the option —progressive is a command that suppresses the individual output at each iteration from being printed; the —quiet option will suppress some printing to the screen as well.

Optionally you could add a couple of additional commands. The print command collects (scalar) statistics that you have computed and finds their averages and standard deviations. The store command allows you to store these in a gretl data file. These are discussed further below.

Within the loop itself, you tell gretl how to generate each sample and state how you want that sample to be used. The series command is used to generate new variables. In the first line u is generated using the gretl command normal(), which when used without arguments produces a computer generated standard normal random variable. In this case, the function contains two arguments (e. g., series u = normal(0,88)). The normal function takes an ordered pair as inputs (commonly referred to as ‘arguments’), the first of which is the desired mean of the random normal and the second is the standard deviation. The next line adds this random element to the systematic portion of the model to generate a new sample for food expenditures (using the known values of income from the dataset).

Next, the model is estimated using least squares. After executing the script, gretl prints out some summary statistics to the screen. These appear as a result of using the –progressive loop option. The result appears in Figure 2.15. Note that the average value of the intercept is about 88.147. This is getting close to the truth. The average value of the slope is 9.55972, also reasonably close to the true value. If you were to repeat the experiments with larger numbers of Monte Carlo iterations, you will find that these averages get closer to the values of the parameters used to generate the data. This is what it means to be unbiased. Unbiasedness only has meaning within the context of repeated sampling. In your experiments, you generated many samples and averaged results over those samples to get close to finding the truth. In actual practice, you do not have this luxury; you have one sample and the proximity of your estimates to the true values of the parameters is always unknown.

In section 2.8 and in the script at the end of this chapter, you will find another example of Monte Carlo that is discussed in POE4. In this example, a sample of regressors is generated using a simple loop and the properties of least squares is examined using 1000 samples. The use of the print and store commands will be examined in section 2.8 as well.