COVARIANCE AND CORRELATION

Covariance, denoted by Cov(X, F) or <Txy, is a measure of the relationship between two random variables X and Y and is defined by

DEFINITION 4.3.1 Cov(X, Y) = E[(X – EX) (F – EY)] = EXY – EXEY.

The second equality follows from expanding (X — EX) (Y — EY) as the sum of four terms and then applying Theorem 4.1.6. Note that because of Theorem 4.1.6 the covariance can be also written as E[(X — EX) Y] or E[(Y – EY)X].

Let (Xb Tj), (X2, T2), . . . , (Xn, Yn) be mutually independent in the sense of Definition 3.5.4 and identically distributed as (X, Y). Then we define the sample covariance by n 1X”=1(Xi — X)(T, — F), where X and F are the sample means of X and F, respectively. Using the results of Chapter
6, we can show that the sample covariance converges to the population covariance in probability.  It is apparent from the definition that Cov > 0 if X — EX and Y — EY tend to have the same sign and that Cov < 0 if they tend to have the opposite signs, which is illustrated by

Since EX — EY = 0,

Cov(X, Y) = EXY = a – (1 – a) = 2a – 1.

Note that in this example Cov = 0 if a = У2, which is the case of inde­pendence between X and Y. More generally, we have

theorem 4.3.1 If X and Y are independent, Cov(X, Y) = 0 provided that VX and VY exist.

The proof follows immediately from the second formula of Definition

4.3.1 and Theorem 4.1.7. The next example shows that the converse of Theorem 4.3.1 is not necessarily true.

example 4.3.2 Let the joint probability distribution of (X, Y) be given by

 F x -1 0 1 1 y6 Уі2 Уб 0 Уі2 0 <м -1 Уе Уі2 Уб

Clearly, X and Y are not independent by Theorem 3.2.1, but we have Cov(X, Y) = EXY = 0.

Examples 4.3.3 and 4.3.4 illustrate the use of the second formula of Definition 4.3.1 for computing the covariance.

EXAMPLE 4.B. B Let the joint probability distribution of (X, Y) be given by

 y x -1 0 1 У4 Vi У2 0 y8 % y2 % %

where the marginal probabilities are also shown. We have EX = y2, EY = %, and EXY = y4. Therefore, by Definition 4.3.1, Cov(X, Y) = Vi – 3/i6 = Уїв-

EXAMPLE 4.3.4 Let the joint density be

f(x, y) = x + y, for 0 < x < 1 and 0 < у < 1,

= 0 otherwise.

Calculate Cov(X, Y).

We have

f 1

f{x) = J 0 + y)dy = x + –  1 C 2 ,

X 4- —

V  У

.’. EY = — by symmetry

EXY = [ [ (x2y + y’x)dxdy = 2 [ xldx [ ydy = —

Jo Jo Jo Jo 3

Theorem 4.3.2 gives a useful formula for computing the variance of the sum or the difference of two random variables.

THEOREM 4.3.2 V(X ± Y) = VX + VY ± 2 Cov(X, V).

The proof follows immediately from the definitions of variance and co – variance.

Combining Theorems 4.3.1 and 4.3.2, we can easily show that the variance of the sum of independent random variables is equal to the sum of the variances, which we state as  THEOREM 4.3.3 LetX,, і = 1, 2, . . . , n, be pairwise independent. Then

It is clear from Theorem 4.3.2 that the conclusion of Theorem 4.3.3 holds if we merely assume Cov(X;, X}) = 0 for every pair such that і + j. As an application of Theorem 4.3.3, consider

EXAMPLE 4.3.5 There are five stocks, each of which sells for \$100 per share and has the same expected annual return per share, jx, and the same variance of return, о. Assume that the returns from the five stocks are pairwise independent, (a) If you buy ten shares of one stock, what will be the mean and variance of the annual return on your portfolio? (b) What if you buy two shares of each stock?

Let X, be the return per share from the ith stock. Then, (a) /?(10X,) = lOjx by Theorem 4.1.6, and T(10X;) = lOOcr2 by Theorem 4.2.1. (b) E(2 xf=1X;) = 10|x by Theorem 4.1.6, and Т(2Х^=1Х,) = 20cr2 by Theorem

4.2.1 and Theorem 4.3.3.

A weakness of covariance as a measure of relationship is that it depends on the units with which X and Y are measured. For example, Cov (Income, Consumption) is larger if both variables are measured in cents than in
dollars. This weakness is remedied by considering correlation (coefficient), defined by

DEFINITION 4.3.2

Correlation (X, Y) = •

Correlation is often denoted by pxy or simply p. It is easy to prove

THEOREM 4.3.4 If a and Э are nonzero constants,

Correlation (aX, |3T) = Correlation (X, Y).

We also have

THEOREM 4.3.5 |p| < 1.

Proof. Since the expected value of a nonnegative random variable is nonnegative, we have

(4.3.1) £[(X – EX) – X(F – EY)f > 0 for any X.

Expanding the squared term, we have

(4.3.2) FX + X2VY – 2X Cov > 0 for any X.

In particular, putting X = Cov/FT into the left-hand side of (4.3.2), we obtain the Cauchy-Schwartz inequality

(4.3.3) VX – > 0.

The theorem follows immediately from (4.3.3). □

If P = 0, we say X and Y are uncorrelated. If p > 0 (p < 0), we say X and Y are positively (negatively) correlated.

We next consider the problem of finding the best predictor of one random variable, Y, among all the possible linear functions of another random variable, X. This problem has a bearing on the correlation co­efficient because the proportion of the variance of Y that is explained
by the best linear predictor based on X turns out to be equal to the square of the correlation coefficient between Y and X, as we shall see below.

We shall interpret the word best in the sense of minimizing the mean squared error of prediction. The problem can be mathematically formu­lated as

(4.3.4) Minimize E(Y — a — |3X)2 with respect to a and (3.

We shall solve this problem by calculus. Expanding the squared term, we can write the minimand, denoted by S, as

(4.3.5) S = EY2 + a2 + |32£X2 – 2aEY – 2|3£ХУ + 2a(3£X.

Equating the derivatives to zero, we obtain

л О

(4.3.6) — = 2a – 2EY + 2 (BEX = 0 da

and

(4.3.7) — = 2(3EX2 – 2ЕХУ + 2aEX = 0.

dp

Solving (4.3.6) and (4.3.7) simultaneously for a and (3 and denoting the optimal values by a* and (3*, we get = Cov(X, F)

vx

and ^EK-^EX.

Thus we have proved

THEOREM 4.3.6 The best linear predictor (or more exactly, the minimum mean-squared-error linear predictor) of Y based on X is given by a* + (3*X, where a* and |3* are defined by (4.3.8) and (4.3.9).

Next we shall ask what proportion of VY is explained by a* + 3*X and what proportion is left unexplained. Define Y = a* + (3*X and U = Y — F. The latter will be called either the prediction error or the residual. We have

(4.3.10) VY=(\$*fVX by Theorem 4.2.1 by (4.3.8)

by Definition 4.3.2.

We have VU = V(Y – a* – p*X) = VY + (0*)2VX – 2P* Cov(X, Y)

_ Cov(X, T)2 EX

= (1 – p2)VE

We call VU the mean squared prediction error of the best linear predictor. We also have (4.3.12) Cov(F, U) = Cov(F, Y — Y)

= Cov(F, Y) — VY by Definition 4.3.1

= |3*Cov(X, Y) – VY by Definition 4.3.1

= 0 by (4.3.8) and (4.3.10).

Combining (4.3.10), (4.3.11), and (4.3.12), we can say that any random variable Y can be written as the sum of the two parts—the part which is expressed as a linear function of another random variable X (namely, F) and the part which is uncorrelated with X (namely, U); a p2 proportion of the variance of Y is attributable to the first part and a 1 — p2 proportion to the second part. This result suggests that the correlation coefficient is a measure of the degree of a linear relationship between a pair of random variables.

As a further illustration of the point that p is a measure of linear dependence, consider Example 4.3.1 again. Since VX = VY = 1 in that example, p = 2a — 1. When a = 1, there is an exact linear relationship between X and Y with a positive slope. When a = 0, there is an exact linear relationship with a negative slope. When a = У2, the degree of linear dependence is at the minimum.

A nonlinear dependence may imply a very small value of p. Suppose that there is an exact nonlinear relationship between X and F defined by Y = X, and also suppose that X has a symmetric density around EX = 0. Then Cov(X, F) = EXY = ЕХЪ = 0. Therefore p = 0. This may be thought of as another example where no correlation does not imply indepen­dence. In the next section we shall obtain the best predictor and compare it with the best linear predictor.