WHAT IS STATISTICS?
In our everyday life we must continuously make decisions in the face of uncertainty, and in making decisions it is useful for us to know the probability of certain events. For example, before deciding to gamble, we would want to know the probability of winning. We want to know the probability of rain when we decide whether or not to take an umbrella in the morning. In determining the discount rate, the Federal Reserve Board needs to assess the probabilistic impact of a change in the rate on the unemployment rate and on inflation. It is advisable to determine these probabilities in a reasonable way; otherwise we will lose in the long run, although in the short run we may be lucky and avoid the consequences of a haphazard decision. A reasonable way to determine a probability should take into account the past record of an event in question or, whenever possible, the results of a deliberate experiment.
We are ready for our first working definition of statistics: Statistics is the science of assigning a probability to an event on the basis of experiments.
Consider estimating the probability of heads by tossing a particular coin many times. Most people will think it reasonable to use the ratio of heads over tosses as an estimate. In statistics we study whether it is indeed reasonable and, if so, in what sense.
Tossing a coin with the probability of heads equal to p is identical to choosing a ball at random from a box that contains two types of balls, one of which corresponds to heads and the other to tails, with p being the proportion of heads balls. The statistician regards every event whose outcome is unknown to be like drawing a ball at random from a box that contains various types of balls in certain proportions.
For example, consider the question of whether or not cigarette smoking is associated with lung cancer. First, we need to paraphrase the question to make it more readily accessible to statistical analysis. One way is to ask, What is the probability that a person who smokes more than ten cigarettes a day will contract lung cancer? (This may not be the optimal way, but we choose it for the sake of illustration.) To apply the box-ball analogy to this example, we should imagine a box that contains balls, corresponding to cigarette smokers; some of the balls have lung cancer marked on them and the rest do not. Drawing a ball at random corresponds to choosing a cigarette smoker at random and observing him until he dies to see whether or not he contracts lung cancer. (Such an experiment would be a costly one. If we asked a related but different question—what is the probability that a man who died of lung cancer was a cigarette smoker?— the experiment would be simpler.)
This example differs from the example of coin tossing in that in coin tossing we create our own sample, whereas in this example it is as though God (or a god) has tossed a coin and we simply observe the outcome. This is not an essential difference. Its only significance is that we can toss a coin as many times as we wish, whereas in the present example the statistician must work with whatever sample God has provided. In the physical sciences we are often able to conduct our own experiments, but in economics or other behavioral sciences we must often work with a limited sample, which may require specific tools of analysis.
A statistician looks at the world as full of balls that have been drawn by God and examines the balls in order to estimate the characteristics (“proportion”) of boxes from which the balls have been drawn. This mode of thinking is indispensable for a statistician. Thus we state a second working definition of statistics: Statistics is the science of observing data and making inferences about the characteristics of a random mechanism that has generated the data.
Coin tossing is an example of a random mechanism whose outcomes are objects called heads and tails. In order to facilitate mathematical analysis, the statistician assigns numbers to objects: for example, 1 to heads and 0 to tails. A random mechanism whose outcomes are real numbers is called a random variable. The random mechanism whose outcome is the height (measured in feet) of a Stanford student is another random variable. The first is called a discrete random variable, and the second, a continuous random variable (assuming hypothetically that height can be measured to an infinite degree of accuracy). A discrete random variable is characterized by the probabilities of the outcomes. The characteristics of a continuous random variable are captured by a density function, which is defined in such a way that the probability of any interval is equal to the area under the density function over that interval. We use the term probability distribution as a broader concept which refers to either a set of discrete probabilities or a density function. Now we can compose a third and final definition: Statistics is the science of estimating the probability distribution of a random variable on the basis of repeated observations drawn from the same random variable.