# BAYESIAN METHOD

We have stated earlier that the goal of statistical inference is not merely to obtain an estimator but to be able to say, using the estimator, where the true value of the parameter is likely to lie. This is accomplished by constructing confidence intervals, but a shortcoming of this method is
that confidence can be defined only for a certain restricted sets of inter­vals. In the Bayesian method this problem is alleviated, because in it we can treat a parameter as a random variable and therefore define a prob­ability distribution for it. If the parameter space is continuous, as is usually the case, we can define a density function over the parameter space and thereby consider the probability that a parameter lies in any given interval. This probability distribution, called the posterior distribution, defined over the parameter space, embodies all the information an investigator can obtain from the sample as well as from the a priori information. It is derived by Bayes’ theorem, which was proved in Theorem 2.4.2. We shall subsequendy show examples of the posterior distribution and how to derive it. Note that in classical statistics an estimator is defined first and then confidence intervals are constructed using the estimator, whereas in the Bayesian statistics the posterior distribution is obtained direcdy from the sample without defining any estimator. After the posterior distribution has been obtained, we can define estimators using the posterior distribu­tion if we wish, as will be shown below. The two methods are thus opposite in this respect. For more discussion of the Bayesian method, see DeGroot (1970) and Zellner (1971).

example 8.3.1 Suppose there is a sack containing a mixture of red marbles and white marbles. The fraction of the marbles that are red is known to be either p = Уз or p = У>. W^e are to guess the value of p after taking a sample of five drawings (replacing each marble drawn be­fore drawing another). The Bayesian expresses the subjective a priori belief about the value of p, which he has before he draws any marble, in the form of what is called the prior distribution. Suppose he believes that p = Уз is as three times as likely as p = У2, so that his prior distribu­tion is

(8.3.1) P(p = Уз) = 3/4

P{p = y2) = %•

Suppose he obtains three red marbles and two white marbles in five drawings. Then the posterior distribution of p given the sample, denoted by A, is calculated via Bayes’ theorem as follows:

 cl r 1 UJ 3 ( 2 kJ 3 4 r5fl ТҐ2Ї Cs 5 5 J J – Q 5 + r5 4 + сз fi Y f і Y 2 2 V / V У
 1 T    = 0.61

This calculation shows how the prior information embodied in (8.3.1) has been modified by the sample. It indicates a higher value of p than the Bayesian’s a priori beliefs: it has yielded the posterior distribution (8.3.2), which assigns a larger probability to the event p = У2.

Suppose we change the question slightly as follows. There are four sacks containing red marbles and white marbles. One of them contains an equal number of red and white marbles and three of them contain twice as many white marbles as red marbles. We are to pick one of the four sacks at random and draw five marbles. If three red and two white marbles are drawn, what is the probability that the sack with the equal number of red and white marbles was picked? Answering this question using Bayes’ theo­rem, we obtain 0.39 as before. The reader should recognize the subde difference between this and the previous question. In the wording of the present question, the event (p = У2) means the event that we pick the sack that contains the equal number of red marbles and white marbles. Since this is a repeatable event, the classical statistician can talk meaningfully about the probability of this event. In contrast, in the previous question, there is only one sack; hence, the classical statistician must view the event (p = У2) as a statement which is either true or false and cannot assign a probability to it. The Bayesian, however, is free to assign a probability to it, because probability to him merely represents the degree of uncertainty. The prior probability in the previous question is purely subjective, whereas

TABLE 8.1 Loss matrix in estimation

 State of Nature Decision p = Уз P=lk p = Уз 0 72 p = У> 7i 0

the corresponding probability in the present question has an objective basis.

Given the posterior distribution (8.3.2), the Bayesian may or may not wish to pick either p = Vs or p = У2 as his point estimate. If he simply wanted to know the truth of the situation, (8.3.2) would be sufficient, because it contains all he could possibly know about the situation. If he wanted to make a point estimate, he would consider the loss he would incur in making a wrong decision, as given in Table 8.1. For example, if he chooses p = Уз when p = У2 is in fact true, he incurs a loss y2. Thus the Bayesian regards the act of choosing a point estimate as a game played against nature. He chooses the decision for which the expected loss is the smallest, where the expectation is calculated using the posterior distribu­tion. In the present example, therefore, he chooses p = Уз as his point estimate if

(8.3.3) 0.39y2 < О. біуі.

For simplicity, let us assume = y2. In this case the Bayesian’s point estimate will be p = У3. This estimate is different from the maximum likelihood estimate obtained by the classical statistician under the same circumstances. The difference occurs because the classical statistician ob­tains information only from the sample, which indicates a greater likeli­hood that p = У2 than p = У3, whereas the Bayesian allows his conclusion to be influenced by a strong prior belief indicating a greater probability that p = У3. If the Bayesian’s prior distribution assigned equal probability to jb = Уз and p = У2 instead of (8.3.1), then his estimate would be the same as the maximum likelihood estimate.

What if we drew five red marbles and two white marbles, instead? Denoting this event by B, the posterior distribution now becomes

( 1,^   V У

In this case the Bayesian would also pick p = У2 as his estimate, assuming 7i = 7г as before, because the information contained in the sample has now dominated his a priori information.

example 8.3.2 Let X be distributed as B(n, p). In Example 8.3.1, for the purpose of illustration, we assumed that p could take only two values. It is more realistic to assume that p can take any real number between 0 and 1. Suppose we a priori think any value of p between 0 and 1 is equally likely. This situation can be expressed by the prior density

(8.3.5) f(p) = 1, 0 < p < 1. Suppose the observed value of X is k and we want to derive the posterior density of p, that is, the conditional density of p given X = k. Using the result of Section 3.7, we can write Bayes’ formula in this example as where the denominator is the marginal probability that X = k. Therefore we have

k(n — k)

where the second equality above follows from the identity   for nonnegative integers n and m.

Using (8.3.7), the Bayesian can evaluate the probability that p falls into any given interval. We shall assume that n = 10 and k = 5 in the model of Example 8.2.1 and compare the Bayesian posterior probability with the confidence obtained there. In (8.2.9) we obtained the 95% confidence interval as (0.2366 < p < 0.7634). We have from (8.3.7)

__ (0.7634 к.

(8.3.9) P(0.2366 < p < 0.7634) = 2772 /(1 – pfdp J 0.2366

From (8.2.8) we can calculate the 80% confidence interval

(8.3.10) C(0.3124 < p < 0.6876) = 0.8.

We have from (8.3.7)

(8.3.11) P(0.3124 < p < 0.6876) = 0.8138.

These calculations show that the Bayesian inference based on the uniform prior density leads to results similar to those obtained in classical infer­ence.

We shall now consider the general problem of choosing a point estimate of a parameter 0 given its posterior density, say, /i(0). This problem is a generalization of the game against nature considered earlier. Let 0 be the estimate and assume that the loss of making a wrong estimate is given by

(8.3.12) Loss = (0 – 0)2.

Then the Bayesian chooses 0 so as to minimize

(8.3.13) £(0 – 0)2 = (0 – 0)2/i(0)d0.

J -00

Note that the expectation is taken with respect to 0 in the above equation, since 0 is the random variable and 0 is the control variable. Equating the derivative of (8.3.13) with respect to 0 to 0, we obtain

(8.3.14) 2 Г (0 – 0)/i(0)d0 = 0-  Note that in obtaining (8.3.14) we have assumed that it is permissible to differentiate the integrand in (8.3.13). Therefore we finally obtain

We call this the Bayes estimator (or, more precisely, the Bayes estimator under the squared error loss). In words, the Bayes estimator is the expected value of 0 where the expectation is taken with respect to its posterior distribution.  Let us apply the result (8.3.15) to our example by putting 0 = p and flip) = f(p I X = h) in (8.3.7). Using the formula (8.3.8) again, we obtain

This is exactly the estimator Z that was defined in Example 7.2.2 and found to have a smaller mean squared error than the maximum likelihood estimator k/n over a relatively wide range of the parameter value. It gives a more reasonable estimate of p than the maximum likelihood estimator when n is small. For example, if a head comes up in a single toss (k = 1, n = 1), the Bayesian estimate p = % seems more reasonable than the maximum likelihood estimate, p = 1. As и approaches infinity, however, both estimators converge to the true value of p in probability.

As this example shows, the Bayesian method is sometimes a useful tool of analysis even for the classical statistician, because it can give her an estimator which may prove to be desirable by her own standard. Nothing prevents the classical statistician from using an estimator derived following the Bayesian principle, as long as the estimator is judged to be a good one by the standard of classical statistics.

Note that if the prior density is uniform, as in (8.3.5), the posterior density is proportional to the likelihood function, as we can see from (8.3.7). In this case the difference between the maximum likelihood estimator and the Bayes estimator can be characterized by saying that the former chooses the maximum of the posterior density, whereas the latter chooses its average. Classical statistics may therefore be criticized, from the Bayesian point of view, for ignoring the shape of the posterior density except for the location of its maximum. Although the classical statistician uses an intuitive word, “likelihood,” she is not willing to make full use of its implication. The likelihood principle, an intermediate step between clas­sical and Bayesian principles, was proposed by Birnbaum (1962).

 -Ц X ~ h)   EXAMPLE 8.3.3 Let {X,} be independent and identically distributed as lV(p, a 2), і = 1, 2, . . . , n, where a2 is assumed known. Let x; be the observed value of Xr Then the likelihood function of p given the vector x = (x, x2,. . . , xn) is given by

 – (M – – M’Of 2 X2   Suppose the prior density of p is 7V(po, that is, Then the posterior density of p given x is by the Bayes rule

 —– Ц – X (*i – M-)2 exp 1 2 9 Oo – M-o) L 2a2 i=i J [_ 2X2 J
 = C exp

 ^ raX2 + a2 2 2, 2

 2 X2Lx,- + 2ст2р0wX2 + a2

 1 raX2 + a2 2 a2X2
 2V і 2 X2 X Lx^ + a po 2 , 2 nX + CT

 Л 2V, 2 X2 X LXj + a po    where cj is chosen to satisfy /“„/(p | x)dp = 1. We shall write the exponent part successively as ,2T 2 , 22 X Lx^ + a po

raX2 + a2  where c2 = (1/л/2тг)(VnX2 + ct2/ctX) in order for /(p | x) to be a density. Therefore we conclude that the posterior distribution of p given x is    £(p I x) in the above formula is suggestive, as it is the optimally weighted average of x and p0. (Cf. Example 7.2.4.) As we let X approach infinity in (8.3.22), the prior distribution approaches that which represents total prior ignorance. Then (8.3.22) becomes

Note that (8.3.23) is what we mentioned as one possible confidence density in Example 8.2.2. The probability calculated by (8.3.23) coincides with the confidence given by (8.2.12) whenever the latter is defined.

Note that the right-hand side of (8.3.22) depends on the sample x only through x. This result is a consequence of the fact that x is a sufficient statistic for the estimation of p. (Cf. Example 7.3.1.) Since we have X ~ lV(p, n lor2), we have   (8.3.24) Using (8.3.24), we could have obtained a result identical to (8.3.21) by calculating

example 8.3.4 Let X be uniformly distributed over the interval (0, 10.5). Assuming that the prior density of 0 is given by

/(0) = 5 for 9.5 < Є < 9.7, = 0 otherwise,   calculate the posterior density of 0 given that an observation of X is 10. We have

10.5 – 0

9.7 5

9.5 10.5 – 0

_______ 1_______

(log 0.8X10.5 – 0)

“ЖІпЬ f-«<6<9.7.

One weakness of Bayesian statistics is the possibility that a prior distri­bution, which is the product of a researcher’s subjective judgment, might unduly influence statistical inference. The classical school, in fact, was developed by R. A. Fisher and his followers in an effort to establish statis­tics as an objective science. This weakness could be eliminated if statisti­cians could agree upon a reasonable prior distribution which represents total prior ignorance (such as the one considered in Examples 8.3.2 and 8.3.3) in every case. This, however, is not always possible. We might think that a uniform density over the whole parameter domain is the right prior that represents total ignorance, but this is not necessarily so. For example, if parameters 0 and |x are related by 0 = |x_1, a uniform prior over jjl,

/(|x) = 1, for 1 < |x < 2,

= 0 otherwise,

implies a nonuniform prior over 0:

/(0) = Є-2, for 1/2 < Є < 1,

= 0 otherwise.

Table 8.2 summarizes the advantages and disadvantages of Bayesian school vis-a-vis classical statistics.

table 8.2 Comparison of Bayesian and classical schools

 Bayesian school Classical school *Can make exact inference using posterior distribution. Use confidence intervals as substitute. *Bayes estimator is good, even by classical standard. If sample size is large, maximum likelihood estimator is just as good. Bayes inference may be robust against misspecification of distribution. *Can use good estimator such as sample mean without assuming any distribution. Use prior distribution that represents total ignorance. *Objective inference. *No need to obtain distribution of estimator. *No need to calculate complicated integrals.

EXERCISES

1. (Section 8.2)

Suppose you have a coin for which the probability of a head is the unknown parameter p. How many times should you toss the coin in order that the 95% confidence interval for p is less than or equal to 0.5 in length?

2. (Section 8.2)

The heights (in feet) of five randomly chosen male Stanford students were 6.3, 6.1, 5.7, 5.8, and 6.2. Find a 90% confidence interval for the mean height, assuming the height is normally distributed.

3. (Section 8.2)

Suppose X,- ~ N(Q, 02), г = 1, 2, , 100. Obtain an 80% confidence

interval for 0 assuming x = 10.

4. (Section 8.2)

A particular drug was given to a group of 100 patients (Group 1), and no drug was given to another group of 100 patients (Group 2). Assuming that 60 patients of Group 1 and 50 patients of Group 2 recovered, construct an 80% confidence interval on the difference of the mean rates of recovery of the two groups (p^ — (x2).

5. (Section 8.2)

If 50 students in an econometrics class took on the average 35 minutes to solve an exam problem with a variance of 10 minutes, construct a 90% confidence interval for the true standard deviation of the time it takes students to solve the given problem. Answer using both exact and asymptotic methods.

6. (Section 8.3)

Let Xj and X2 be independent and let each be B(l, p). Let the prior probabilities of p be given by P(p = Уг) = 0.5 and Pip = %) = 0.5. Calculate the posterior probabilities P{p) = P(p X = 1) and P%{p) = P(p I Xi = 1, X2 = 1). Also calculate P{p | X2 = 1) using P(p) as the prior probabilities. Compare it with P%{p).

7. (Section 8.3)

A Bayesian is to estimate the probability of a head, p, of a particular coin. If her prior density is f{p) = 6/?(l — p), 0 ^ p ^ 1, and two heads appear in two tosses, what is her estimate of pi

8. (Section 8.3)

Suppose the density of X is given by f{x I 0) = 1/0 for 0 < x < 0,

= 0 otherwise,

and the prior density of 0 is given by
/(0) = 1/02 for 0 > 1, = 0

Obtain the Bayes estimate of 0, assuming that the observed value of X is 2.

9. (Section 8.3)

Suppose that a head comes up in one toss of a coin. If your prior probability distribution of the probability of a head, p, is given by P(p = Vz) = Уз, P(p = %) = Уз, and P(p = %) = y3 and your loss function is given by p — p, what is your estimate p} What if your prior density of p is given by f(p) = 1 for 0 < p < 1?

10. (Section 8.3)

Let X ~ B(l, p) and the prior density of p is given by f(p)=l for 0 < p < 1. Suppose the loss function L(-) is given by

L(e) = -2e if-1 < e < 0,

= e if 0 < e < 1,

where e = p — p. Obtain the Bayes estimate of p, given X = 1.

11. (Section 8.3)

In the preceding exercise, change the loss function to L(e) = e. Obtain the Bayes estimate of p, given X = 1.

12. (Section 8.3)

Suppose the density of X is given by

f(x I 0) = 0 + 2(1 — 0)x for 0 < x < 1 = 0 otherwise,

where we assume 0 < 0 < 2. Suppose we want to estimate 0 on the basis of one observation on X.

(a) Find the maximum likelihood estimator of 0 and obtain its exact mean squared error.

(b) Find the Bayes estimator of 0 using the uniform prior density of 0 given by

/(0) = 0.5 for 0 < 0 < 2,

= 0 otherwise.

Obtain its exact mean squared error.

13. (Section 8.3)

Let (Xj) be i. i.d. with the density

/(x I 0) = 1/0, 0 < x < 0, 1 < 0,

and define (FJ by

¥, = 1 if Xj > 1,

= 0 if X,- < 1.

Suppose we observe (FJ, і = 1 and 2, and find Fj = 1 and F2 = 0. We do not observe {XJ.

(a) Find the maximum likelihood estimator of 0.

(b) Assuming the prior density of 0 is/(0) =0 2 for 0^1, find the Bayes estimate of 0.

14. (Section 8.3)

The density of X, given an unknown parameter X Є [0, 1], is given by

f(x I X) = X/j(x) + (1 – X)/2(x),

where /1(-) and/2(-) are known density functions. Derive the maxi­mum likelihood estimator of X based on one observation on X. Assuming the prior density of X is uniform over the interval [0, 1], derive the Bayes estimator of X based on one observation on X.

15. (Section 8.3)

Let the density function of X be given by

/(x) = 2x/0 for 0 < x < 0,

= 2(x – l)/(0 – 1) for 0 < x — 1,

where 0 < 0 < 1. Assuming the prior density/(0) = 60(1 — 0), derive the Bayes estimator of 0 based on a single observation of X.

16. (Section 8.3)

We have a coin for which the probability of a head is p. In the experiment of tossing the coin until a head appears, we observe that a head appears in the &th toss. Assuming the uniform prior density, find the Bayes estimator of p.

9.1 INTRODUCTION

There are two kinds of hypotheses: one concerns the form of a probability distribution, and the other concerns the parameters of a probability dis­tribution when its form is known. The hypothesis that a sample follows the normal distribution rather than some other distribution is an example of the first, and the hypothesis that the mean of a normally distributed sample is equal to a certain value is an example of the second. Throughout this chapter we shall deal with tests of hypotheses of the second kind only.

The purpose of estimation is to consider the whole parameter space and guess what values of the parameter are more likely than others. In hypothesis testing we pay special attention to a particular set of values of the parameter space and decide if that set is likely or not, compared with some other set.

In hypothesis tests we choose between two competing hypotheses: the null hypothesis, denoted Hq, and the alternative hypothesis, denoted H. We make the decision on the basis of the sample (Xj, X2, .. . , X„), denoted simply as X. Thus X is an n-variate random variable taking values in En, n – dimensional Euclidean space. Then a test of the hypothesis H0 mathemati­cally means determining a subset R of En such that we reject H0 (and therefore accept #j) if X Є R, and we accept H0 (and therefore reject Hi) if X Є R, the complement of R in En. The set R is called the region of rejection or the critical region of the test. Thus the question of hypothesis testing mathematically concerns how we determine the critical region.

As we shall show in Section 9.3, a test of a hypothesis is often based on the value of a real function of the sample (a statistic). If T(X) is such a statistic, the critical region is a subset R of the real line such that we reject H0 if T(X) Є R. In Chapter 7 we called a statistic used to estimate a parameter an estimator. A statistic which is used to test a hypothesis is called a test statistic. In the general discussion that follows, we shall treat a critical region as a subset of En, because the event T(X) Є R can always be regarded as defining a subset of the space of X.

A hypothesis may be either simple or composite.

definition 9.1.1 A hypothesis is called simple if it specifies the values of all the parameters of a probability distribution. Otherwise, it is called composite.

For example, the assumption that p = У2 in the binomial distribution is a simple hypothesis and the assumption that p > У2 is a composite hy­pothesis. Specifying the mean of a normal distribution is a composite hypothesis if its variance is unspecified.

In Sections 9.2 and 9.3 we shall assume that both the null and the alternative hypotheses are simple. Sections 9.4 and 9.5 will deal with the case where one or both of the two competing hypotheses may be compos­ite. In practice, the most interesting case is testing a composite hypothesis against a composite hypothesis. Most textbooks, however, devote the great­est amount of space to the study of the simple against simple case. There are two reasons: one is that we can learn about a more complicated realistic case by studying a simpler case; the other is that the classical theory of hypothesis testing is woefully inadequate for the realistic case.