# The t-Statistic and the Central Limit Theorem

Having laid out a simple scheme to measure variability using standard errors, it remains to

interpret this measure. The simplest interpretation uses a t-statistic. Suppose the data at hand come from a distribution for which we believe the population mean, E[Y] takes on a particular value, ц (read this Greek letter as “mu”). This value constitutes a working hypothesis. A t-statistic for the sample mean under the working hypothesis that E[Yt] = ц is constructed as

The working hypothesis is a reference point that is often called the null hypothesis. When the null hypothesis is ц = 0, the t-statistic is the ratio of the sample mean to its estimated standard error.

Many people think the science of statistical inference is boring, but in fact it’s nothing short of miraculous. One miraculous statistical fact is that if E[Y] is indeed equal to ц, then—as long as the sample is large enough—the quantity t(^) has a sampling distribution that is very close to a bell-shaped standard normal distribution, sketched in Figure 1.1. This property, which applies regardless of whether Yt itself is normally distributed, is called the Central Limit Theorem (CLT). The CLT allows us to make an empirically informed decision as to whether the available data support or cast doubt on the hypothesis that E[Y;] equals ц.

FIGURE 1.1 A standard normal distribution |

The CLT is an astonishing and powerful result. Among other things, it implies that the (large-sample) distribution of a t-statistic is independent of the distribution of the underlying data used to calculate it. For example, suppose we measure health status with a dummy variable distinguishing healthy people from sick and that 20% of the population is sick. The distribution of this dummy variable has two spikes, one of height.8 at the value 1 and one of height.2 at the value 0. The CLT tells us that with enough data, the distribution of the t-statistic is smooth and bell-shaped even though the distribution of the underlying data has only two values.

We can see the CLT in action through a sampling experiment. In sampling experiments, we use the random number generator in our computer to draw random samples of different sizes over and over again. We did this for a dummy variable that equals one 80% of the time and for samples of size 10, 40, and 100. For each sample size, we calculated the t – statistic in half a million random samples using .8 as our value of ц.

Figures 1.2–1.4 plot the distribution of 500,000 t-statistics calculated for each of the three sample sizes in our experiment, with the standard normal distribution superimposed. With only 10 observations, the sampling distribution is spiky, though the outlines of a bellshaped curve also emerge. As the sample size increases, the fit to a normal distribution improves. With 100 observations, the standard normal is just about bang on.

The standard normal distribution has a mean of 0 and standard deviation of 1. With any standard normal variable, values larger than ±2 are highly unlikely. In fact, realizations larger than 2 in absolute value appear only about 5% of the time. Because the t-statistic is close to normally distributed, we similarly expect it to fall between about ±2 most of the time. Therefore, it’s customary to judge any t-statistic larger than about 2 (in absolute value) as too unlikely to be consistent with the null hypothesis used to construct it. When the null hypothesis is ц = 0 and the t-statistic exceeds 2 in absolute value, we say the sample mean is significantly different from zero. Otherwise, it’s not. Similar language is used for other values of ц as well.

FIGURE 1.2 The distribution of the t-statistic for the mean in a sample of size 10 FIGURE 1.3 The distribution of the t-statistic for the mean in a sample of size 40 |

FIGURE 1.4 The distribution of the t-statistic for the mean in a sample of size 100 |

Note: This figure shows the distribution of the sample mean of a dummy variable that equals 1 with probability.8. |

should contain E[Yi] about 95% of the time. This interval is therefore said to be a 95% confidence interval for the population mean. By describing the set of parameter values consistent with our data, confidence intervals provide a compact summary of the information these data contain about the population from which they were sampled.

## Leave a reply