# CONDITIONAL MEAN AND VARIANCE

In Chapters 2 and 3 we noted that conditional probability and conditional density satisfy all the requirements for probability and density. Therefore, we can define the conditional mean in a way similar to that of Definitions

4.1.1 and 4.1.2, using the conditional probability defined in Section 3.2.2 and the various types of conditional densities defined in Section 3.3.2 and 3.4.3. Here we shall give two definitions: one for the discrete bivariate random variables and the other concerning the conditional density given in Theorem 3.4.3.

DEFINITION 4.4.1 Let (X, F) be a bivariate discrete random variable taking values (x„ У]), i, j = 1, 2, … . Let P(y} X) be the conditional probability of F = У] given X. Let ф(-, •) be an arbitrary function. Then the conditional mean of ф(Х, F) given X, denoted by £[ф(Х, F) | X] or by £у|хф(Х, F), is defined by

oo

(4.4.1) ЕУ|хф(Х, F) = X ИХ, У])Р{У] I X).

r-i

DEFINITION 4.4.2 Let (X, F) be a bivariate continuous random variable with conditional density f(y x). Let ф(-, ■) be an arbitrary function. Then the conditional mean of ф(Х, F) given X is defined by

f 00

(4.4.2) ЕК|хФ(Х, F) = ф(Х, y)f(y X)dy.

J —00

The conditional mean £Г|Хф(Х, F) is a function only of X. It may be evaluated at a particular value that X assumes, or it may be regarded as a random variable, being a function of the random variable X. If we treat it as a random variable, we can take a further expectation of it using the probability distribution of X. The following theorem shows what happens.

THEOREM 4.4.1 (law of iterated means) £ф(Х, У) = ЕхЕухф{Х> F). (The symbol Ex indicates that the expectation is taken treating X as a random variable.)

Proof. We shall prove it only for the case of continuous random vari­ables; the proof is easier for the case of discrete random variables.

00    ф(х, y)f(x, y)dxdy.

— 00 ф(х, y)f(y I x)dy

= ЕхЕуїхф(Х, Y). □

The following theorem is sometimes useful in computing variance. It says that the variance is equal to the mean of the conditional variance plus the variance of the conditional mean.

THEOREM 4.4.2

(4.4.4) Уф(X, Y) = ExVYx§{X, Y) + VxEyxф (X, Y). Proof. Since

(4.4.5) УГ|Хф = £у|хф2 – (£у|хФ)2> we have

(4.4.6) ExVyx\$ = £ф2 – Ех(Еуіхф)2.

But we have

(4.4.7) VxEYx\$ = £*(£у|хф)2 – (£ф)2.

Therefore, by adding both sides of (4.4.6) and (4.4.7),

(4.4.8) £xVV|x<t> + УхТу|хФ = Тф2 — (£ф)2 = Уф. □

The following examples show the advantage of using the right-hand side of (4.4.3) and (4.4.4) in computing the unconditional mean and variance.

EXAMPLE 4.4.1 Suppose f(x) = 1 for 0 < x < 1 and = 0 otherwise and f(y I x) = x_1 for 0 < у < x and = 0 otherwise. Calculate EY.  This problem may be solved in two ways. First, use Theorem 4.4.1: Second, use Definition 4.1.2:

exam ple 4.4.2 The marginal density of X is given by/(x) = 1, 0 < x <

1. The conditional probability of Y given X is given by

P(Y = 1 I X = x) = x.

P(Y = 0 I X = x) = 1 –

Find EY and VY.

EY = ExEyxY = ExX = )z ■

VY = VxEyxY + ExVyxY = VxX + Ex(X – X2)

= ExX – (ExXf = і •

The result obtained in Example 4.4.2 can be alternatively obtained using the result of Section 3.7, as follows. We have

P(Y = 1) = P(Y = 1,0 < X < 1) = P(0<X< 1Y= 1 )P(Y= 1)

= jo f(xY= 1 )P(Y = 1 )dx

= Г P(Y = 1 I x)f{x)dx by (3.7.2) Jo

= I xdx

Jo

1

2

Therefore the mean and variance of Y can be shown to be the same as obtained above, using the definition of the mean and the variance of a discrete random variable which takes on either 1 or 0.

In the previous section we solved the problem of optimally predicting Y by a linear function of X. Here we shall consider the problem of optimally predicting Y by a general function of X. The problem can be mathematically formulated as

(4.4.10) Minimize E[Y — ф(Х)]2 with respect to ф.

Despite the apparent complexity of the problem, there is a simple solu­tion. We have

(4.4.11) E[Y – ф(Х)]2 = £{[T – E(Y X)] + [E{Y X) – ф(Х)]}2

= E[Y – E(Y I X)]2 + E[E(Y X) – ф(Х)]2,

where the cross-product has dropped out because

Eyx{[Y – E(YX)][E(YX) – ф(Х)]} = 0.

Therefore (4.4.11) is clearly minimized by choosing ф(Х) = E(Y j X). Thus we have proved

THEOREM 4.4.В The best predictor (or more exacdy, the minimum mean – squared-error predictor) of Y based on X is given by E(Y X).

In the next example we shall compare the best predictor with the best linear predictor.

EXAMPLE 4.4.в A fair die is rolled. Let Y be the face number showing. Define X by the rule

X = Y if Y is even,

= 0 if Y is odd.

Find the best predictor and the best linear predictor of Y based on X. The following table gives E(Y X).

X 0 2 4 6

E(Y IX) 3 2 4 6

To compute the best linear predictor, we must first compute the moments that appear on the right-hand sides of (4.3.8) and (4.3.9): EY = 7/2, EX = 2, EX2 = EXY = 28/3, EY2 = 91/6, VX = 16/3, VY = 35/12, Cov = 7/3. Therefore

a* = 21/8, (З* = 7/16.

Put Y = (21/8) + (7/16)X. Thus we have

X 0 2 4 6

Y 2.625 3.5 4.375 5.25

where the values taken by Y and X are indicated by empty circles in Figure 4.2.

We shall compute the mean squared error of prediction for each pre­dictor:

E(Y – Y)2 = VY — Cov2/VX = 35/12 – 49/48

= 91/48 = 1.9.

E[Y – E(Y I X)]2 = (1/6) • 4 + (1/6) -4 = 4/3 = 1.3.

example 4.4.4 Let the joint probability distribution of X and Y be given as follows: P(X = 1, Y = 1) = Pu = 0.3, P(X = 1, Y = 0) = P10 = 0.2, P(X = Q, Y=1) =PQ1 = 0.2, P(X = 0, Y = 0) = Pqo = 0.3. Obtain the

 Y £(Т|Х) figure 4.2 Comparison of best predictor and best linear predictor

best predictor and the best linear predictor of Y as functions of X and calculate the mean squared prediction error of each predictor.

We have

E(YX = 1) = Pn/(Pu + Pio)

and

E(Y I X = 0) = Pоі/(Poi + Poo).

Both equations can be combined into one as

(4.4.12) E{Y I X) = [Pu/(Pn + Pio)]X + [P0i/(P0i + Poo)] (1 – X),

which is a linear function of X. This result shows that the best predictor is identical with the best linear predictor, but as an illustration we shall obtain two predictors separately.

Best predictor. From (4.4.12) we readily obtain E(Y X) = 0.4 + 0.2X. Its mean squared prediction error (MSPE) can be calculated as follows:

(4.4.13) MSPE = E[Y – E{Y | X)]2

= ExEyx[Y2 + E(Y I X)2 – 2YE(Y | X)]

= EY2 – EX[E(Y X)2]

= 0.5 – 0.26 = 0.24.

Best linear predictor. The moments of X and Y can be calculated as follows: EX = EY = 0.5, VX = VY = 0.25, and Cov(X, F) = 0.05. Inserting these values into equations (4.3.8) and (4.3.9) yields (3* = 0.2 and a* = 0.4. From (4.3.11) we obtain

(4.4.14) MSPE = VY – COV55 F) = 0.24.

V A

EXERCISES

1. (Section 4.1)

A station is served by two independent bus lines going to the same destination. In the first line buses come at a regular interval of five minutes, and in the second line ten minutes. You get on the first bus that comes. What is the expected waiting time?

2. (Section 4.2)

Let the probability distribution of (X, F) be given by 2

%

%

Find V(X|F).

3. (Section 4.2)

Let X be the number of tosses required until a head comes up. Compute EX and VX assuming the probability of heads is equal to p.

4. (Section 4.2)

Let the density of X be given by

f(x) = x for 0 < x < 1,

= 2 — x for 1 < x < 2,

= 0 otherwise.

Calculate VX.

5. (Section 4.3)

Let the probability distribution of (X, 7) be given by

 7 x 1 0 1 % % 0 Ув %

(a) Show that X + Y and X — (20/19)7 are uncorrelated.

(b) Are X + Y and X — (20/19)7 independent?

6. (Section 4.3)

Let (X, 7) have joint density f(x, у) = x + у for 0 < x < 1 and 0 < у < 1. Compute Cov(X, 7).

7. (Section 4.3)

Let (X, 7) have joint density f(x, y) = 2 for 0 < x < 1 and 0 < у < x. Compute VX and Cov(X, 7).

8. (Section 4.3)

Let EX = EY = 0, VX = VY = 2, and Cov(X, 7) = 1. Determine a and (3 so that F(aX + (37) = 1 and Cov(aX + (37, X) = 0.

9. (Section 4.3)

Suppose X and 7 are independent with EX = 1, VX = 1, EY = 2, and V7 = 1. Define Z = X + 7 and W = X7. Calculate Cov(Z, W).

10. (Section 4.4)

Let X, 7, and Z be random variables, each of which takes only two values: 1 and 0. Given P(X = 1) = 0.5, P(Y = 1 | X = 1) = 0.6, P(Y = 1 I X = 0) = 0.4, P(Z = 1 I 7 = 1) = 0.7, P(Z = 1 | 7 = 0) = 0.3, find EZ and E(Z X = 1). Assume that the probability distribu­tion of 7 depends only on X and that of Z only on 7.

11. (Section 4.4)

Let X = 1 with probability p and 0 with probability 1 — p. Let the conditional density of Y given X = 1 be uniform over 0 < у < 1 and given X = 0 be uniform over 0 < у < 2. Obtain Cov(X, Y).

12. (Section 4.4)

Let f(x, у) = 1 for 0 < ж < 1 and 0 < у < 1. Obtain E(X X < Y).

13. (Section 4.4)

With the same density as in Exercise 6, obtain E(X Y = X + 0.5).

14. (Section 4.4-Prediction)

Let the joint probability distribution of X and Y be given by

 y x 2 1 0 2 0.2 0.1 0 1 0.1 0.2 0.1 0 0 0.1 0.2

Obtain the best predictor and the best linear predictor of Y as func­tions of X and calculate the mean squared prediction error for each predictor.

15. (Section 4.4-Prediction)

Suppose EX = EY = 0,VX = VY = 1, and Cov(X, Y) = 0.5. If we debne Z = X + Y, find the best linear predictor of Y based on Z.

16. (Section 4.4-Prediction)

Give an example in which X can be used to predict Y perfectly, but Y is of no value in predicting X. Supply your own definition of the phrase “no value in predicting X.”

17. (Section 4.4-Prediction)

Let X be uniformly distributed over [0, 1] and for some c in (0, 1) define

7=1 if X > c,

= 0 if X < c.

Find the best predictor of X given Y, denoted X and compare the variances V(X) and V(U), where U = X — X.

18. (Section 4.4-Prediction)

Suppose U and V are independent with exponential distribution with parameter X. (T is exponentially distributed with parameter X if its density is given by X exp(—t) for t > 0.) Define X = U + V and Y = UV. Find the best predictor and the best linear predictor of Y given X.

19. (Section 4.4-Prediction)

Suppose that X and Y are independent, each distributed as 5(1, p). (See Section 5.1 for the definition.)Find the best predictor and the best linear predictor of X + Y given X — Y. Compute their respective mean squared prediction errors and directly compare them.