# Descriptive Statistics

In Chapter 4, we will consider the estimation of a simple wage equation based on 595 individuals drawn from the Panel Study of Income Dynamics for 1982. This data is available on the Springer web site as EARN. ASC. Table 2.1 gives the descriptive statistics using EViews for a subset of the variables in this data set.    The average log wage is \$6.95 for this sample with a minimum of \$5.68 and a maximum of \$8.54. The standard deviation of log wage is 0.44. A plot of the log wage histogram is given in Figure 2.8. Weeks worked vary between 5 and 52 with an average of 46.5 and a standard deviation of 5.2. This variable is highly skewed as evidenced by the histogram in Figure 2.9. Years of education vary between 4 and 17 with an average of 12.8 and a standard deviation of 2.79. There is the usual bunching up at 12 years, which is also the median, as is clear from Figure 2.10.

Experience varies between 7 and 51 with an average of 22.9 and a standard deviation of 10.79. The distribution of this variable is skewed to the left, as shown in Figure 2.11.

Marital status is a qualitative variable indicating whether the individual is married or not. This information is recoded as a numeric (1, 0) variable, one if the individual is married and zero    otherwise. This recoded variable is also known as a dummy variable. It is basically a switch turning on when the individual is married and off when he or she is not. Female is another dummy variable taking the value one when the individual is a female and zero otherwise. Black is a dummy variable taking the value one when the individual is black and zero otherwise. Union is a dummy variable taking the value one if the individual’s wage is set by a union contract and zero otherwise. The minimum and maximum values for these dummy variables are obvious. But if they were not zero and one, respectively, you know that something is wrong. The average is a meaningful statistic indicating the percentage of married individuals, females, blacks and union contracted wages in the sample. These are 80.5, 11.3, 7.2 and 30.6%, respectively. We would like to investigate the following claims: (i) women are paid less than men; (ii) blacks are paid less than non-blacks; (iii) married individuals earn more than non-married individuals; and (iv) union contracted wages are higher than non-union wages.

Table 2.2 Test for the Difference in Means

 Average log wage Difference Male \$7,004 -0.474 Female \$6,530 (-8.86) Non-Black \$6,978 -0.377 Black \$6,601 (-5.57) Not Married \$6,664 0.356 Married \$7,020 (8.28) Non-Union \$6,945 0.017 Union \$6,962 (0.45)

Table 2.3 Correlation Matrix

 LWAGE WKS ED EX MS FEM BLK UNION LWAGE 1.0000 0.0403 0.4566 0.0873 0.3218 -0.3419 -0.2229 0.0183 WKS 0.0403 1.0000 0.0002 -0.1061 0.0782 -0.0875 -0.0594 -0.1721 ED 0.4566 0.0002 1.0000 -0.2219 0.0184 -0.0012 -0.1196 -0.2719 EX 0.0873 -0.1061 -0.2219 1.0000 0.1570 -0.0938 0.0411 0.0689 MS 0.3218 0.0782 0.0184 0.1570 1.0000 -0.7104 -0.2231 0.1189 FEM -0.3419 -0.0875 -0.0012 -0.0938 -0.7104 1.0000 0.2086 -0.1274 BLK -0.2229 -0.0594 -0.1196 0.0411 -0.2231 0.2086 1.0000 0.0302 UNION 0.0183 -0.1721 -0.2719 0.0689 0.1189 -0.1274 0.0302 1.0000

A simple first check could be based on computing the average log wage for each of these cat­egories and testing whether the difference in means is significantly different from zero. This can be done using a f-test, see Table 2.2. The average log wage for males and females is given along with their difference and the corresponding f-statistic for the significance of this differ­ence. Other rows of Table 2.2 give similar statistics for other groupings. In Chapter 4, we will show that this f-test can be obtained from a simple regression of log wage on the categorical dummy variable distinguishing the two groups. In this case, the Female dummy variable. From Table 2.2, it is clear that only the difference between union and non-union contracted wages are insignificant.

One can also plot log wage versus experience, see Figure 2.12, log wage versus education, see Figure 2.13, and log wage versus weeks, see Figure 2.14.

The data shows that, in general, log wage increases with education level, weeks worked, but that it exhibits a rising and then a declining pattern with more years of experience. Note that the f-tests based on the difference in log wage across two groupings of individuals, by sex, race or marital status, or the figures plotting log wage versus education, log wage versus weeks worked are based on pairs of variables in each case. A nice summary statistic based also on pairwise com­parisons of these variables is the correlation matrix across the data. This is given in Table 2.3.

The signs of this correlation matrix give the direction of linear relationship between the corresponding two variables, while the magnitude gives the strength of this correlation. In Chapter 3, we will see that these simple correlations when squared give the percentage of  Figure 2.14 Log (Wage) Versus Weeks

variation that one of these variables explain in the other. For example, the simple correlation coefficient between log wage and marital status is 0.32. This means that marital status explains (0.32)2 or 10% of the variation in log wage.

One cannot emphasize enough how important it is to check one’s data. It is important to compute the descriptive statistics, simple plots of the data and simple correlations. A wrong minimum or maximum could indicate some possible data entry errors. Troughs or peaks in these plots may indicate important events for time series data, like wars or recessions, or influential observations. More on this in Chapter 8. Simple correlation coefficients that equal one indicate perfectly collinear variables and warn of the failure of a linear regression that has both variables included among the regressors, see Chapter 4.

Notes

1. Actually E(s2) = a2 does not need the normality assumption. This fact along with the proof of (n — 1)s21 a2 ~ П-1, under Normality, can be easily shown using matrix algebra and is deferred to Chapter 7.

2. This can be proven using the Chebyshev’s inequality, see Hogg and Craig (1995).

3. See Hogg and Craig (1995) for the type of regularity conditions needed for these distributions.

Problems

1. Variance and Covariance of Linear Combinations of Random Variables. Let a, b,c, d,e and f be arbitrary constants and let X and Y be two random variables.

(a) Show that var(a + bX) = b2 var(X).

(b) var(a + bX + cY) = b2var(X) + c2 var(Y) + 2bc cov(X, Y).

(c) cov[(a + bX + cY), (d + eX + fY)] = be var(X) + cf var(Y) + (bf + ce) cov(X, Y).

2. Independence and Simple Correlation.

(a) Show that if X and Y are independent, then E(XY) = E(X)E(Y) = pxpy where px = E(X) and py = E(Y). Therefore, cov(X, Y) = E(X — px)(Y — py) = 0.

(b) Show that if Y = a + bX, where a and b are arbitrary constants, then pxy = 1 if b > 0 and — 1 if b < 0.

3. Zero Covariance Does Not Necessarily Imply Independence. Let X = —2, —1, 0, 1, 2 with Pr[X = x] = 1/5. Assume a perfect quadratic relationship between Y and X, namely Y = X2. Show that cov(X, Y) = E(X3) = 0. Deduce that pXY = correlation (X, Y) = 0. The simple correlation coef­ficient pXY measures the strength of the linear relationship between X and Y. For this example, it is zero even though there is a perfect nonlinear relationship between X and Y. This is also an example of the fact that if pXY = 0, then X and Y are not necessarily independent. pxy = 0 is a necessary but not sufficient condition for X and Y to be independent. The converse, however, is true, i. e., if X and Y are independent, then pXY = 0, see problem 2.

4. The Binomial Distribution is defined as the number of successes in n independent Bernoulli trials with probability of success в. This discrete probability function is given by

f(X;в) = (XJeX(1 — e)n-X X = 0,1,…,n

and zero elsewhere, with (X) = n!/[X!(n — X)!].

(a) Out of 20 candidates for a job with a probability of hiring of 0.1. Compute the probabilities of getting X = 5 or 6 people hired?

(b) Show that (X) = (nnX) and use that to conclude that b(n, X, в) = b(n, n — X, 1 — в).

(c) Verify that E(X) = пв and var(X) = пв(1 — в).

(d) For a random sample of size n drawn from the Bernoulli distribution with parameter в, show that X is the MLE of в.

(e) Show that X, in part (d), is unbiased and consistent for в.

(f) Show that X, in part (d), is sufficient for в.

(g) Derive the СгатЄг-Rao lower bound for any unbiased estimator of в. Is X, in part (d), MVU for в?

(h) For n = 20, derive the uniformly most powerful critical region of size a < 0.05 for testing H0; в = 0.2 versus Hi; в = 0.6. What is the probability of type II error for this test criteria?

(i) Form the Likelihood Ratio test for testing H0; в = 0.2 versus Hi; в = 0.2. Derive the Wald and LM test statistics for testing H0 versus Hi. When is the Wald statistic greater than the LM statistic?

5. For a random sample of size n drawn from the Normal distribution with mean p and variance a2.

(a) Show that s2 is a sufficient statistic for a2.

(b) Using the fact that (n — 1)s2/a2 is хП-i (without proof), verify that E(s2) = a2 and that var(s2) = 2a4/(n — 1) as shown in the text.

(c) Given that a2 is unknown, form the Likelihood Ratio test statistic for testing H0; p = 2 versus Hi; p = 2. Derive the Wald and Lagrange Multiplier statistics for testing H0 versus Hi. Verify that they are given by the expressions in example 4.

(d) Another derivation of the W > LR > LM inequality for the null hypothesis given in part (c) can be obtained as follows: Let pi, a be the restricted maximum likelihood estimators under H0; p = p0. Let p, p2 be the corresponding unrestricted maximum likelihood estimators under the alternative Hi; p = p0. Show that W = — 2log[L(p, p2)/L(p, a2)] and LM = —2log[L(p, a2)/L(p, a2)] where L(p, a2) denotes the likelihood function. Conclude that W > LR > LM, see Breusch (1979). This is based on Baltagi (1994).

(e) Given that p is unknown, form the Likelihood Ratio test statistic for testing H0; a = 3 versus

Hi;a = 3.

(f) Form the Likelihood Ratio test statistic for testing H0; p = 2,a2 =4 against the alternative that Hi; p = 2 or a2 = 4.

(g) For n = 20,s2 =9 construct a 95% confidence interval for a2.

6. The Poisson distribution can be defined as the limit of a Binomial distribution as n ^ ж and в ^ 0 such that ^ = A is a positive constant. For example, this could be the probability of a rare disease and we are random sampling a large number of inhabitants, or it could be the rare probability of finding oil and n is the large number of drilling sights. This discrete probability function is given by

e-AAX

f (X; A) = X – X = 0,1,2,…

For a random sample from this Poisson distribution

(a) Show that E(X) = A and var(X) = A.

(b) Show that the MLE of A is AMLE = X.

(c) Show that the method of moments estimator of A is also X.

(d) Show that X is unbiased and consistent for A.

(e) Show that X is sufficient for A.

(f) Derive the Cramer-Rao lower bound for any unbiased estimator of A. Show that X attains that bound.

(g) For n = 9, derive the Uniformly Most Powerful critical region of size a < 0.05 for testing H0; A = 2 versus Hi; A = 4.

(h) Form the Likelihood Ratio test for testing H0; A = 2 versus Hi; A = 2. Derive the Wald and LM statistics for testing H0 versus Hi. When is the Wald test statistic greater than the LM statistic?

7. The Geometric distribution is known as the probability of waiting for the first success in indepen­dent repeated trials of a Bernoulli process. This could occur on the 1st, 2nd, 3rd,.. trials.

g(X; в) = 0(1 — 0)X-i for X = 1, 2, 3,…

(a) Show that E(X) = 1/0 and var(X) = (1 — 0)/02.

(b) Given a random sample from this Geometric distribution of size n, find the MLE of в and the method of moments estimator of 0.

(c) Show that X is unbiased and consistent for 1/0.

(d) For n = 20, derive the Uniformly Most Powerful critical region of size a < 0.05 for testing H0; в = 0.5 versus Hi; в = 0.3.

(e) Form the Likelihood Ratio test for testing H0; в = 0.5 versus Hi; в = 0.5. Derive the Wald and LM statistics for testing H0 versus Hi. When is the Wald statistic greater than the LM statistic?

8. The Uniform density, defined over the unit interval [0,1], assigns a unit probability for all values of X in that interval. It is like a roulette wheel that has an equal chance of stopping anywhere between 0 and 1.

f (X) = 1 0 < X < 1

= 0 elsewhere

Computers are equipped with a Uniform (0,1) random number generator so it is important to understand these distributions.

(a) Show that E(X) = 1/2 and var(X) = 1/12.

(b) What is the Pr[0.1 < X < 0.3]? Does it matter if we ask for the Pr[0.1 < X < 0.3]?

9. The Exponential distribution is given by

f (X; в) = 1 e-Xle X> 0 and в> 0

This is a skewed and continuous distribution defined only over the positive quadrant.

(a) Show that E(X) = в and var(X) = в2.

(b) Show that 0mle = XL.

(c) Show that the method of moments estimator of в is also XL.

(d) Show that XL is an unbiased and consistent estimator of в.

(e) Show that XL is sufficient for в.

(f) Derive the Cramer-Rao lower bound for any unbiased estimator of в? Is XL MVU for в?

(g) For n = 20, derive the Uniformly Most Powerful critical region of size a < 0.05 for testing H0; в = 1 versus Hi; в = 2.

(h) Form the Likelihood Ratio test for testing H0; в = 1 versus Hi; в = 1. Derive the Wald and LM statistics for testing H0 versus Hi. When is the Wald statistic greater than the LM statistic?

10. The Gamma distribution is given by

f (X; а, в) =r(c1)e« Xa-1e-X/e for X> 0

= 0 elsewhere

where a and в > 0 and Г(а) = (a — 1)! This is a skewed and continuous distribution.

(a) Show that E(X) = ав and var(X) = aft2.

(b) For a random sample drawn from this Gamma density, what are the method of moments estimators of a and в?

(c) Verify that for a = 1 and в = \$, the Gamma probability density function reverts to the Exponential p. d.f. considered in problem 9.

(d) We state without proof that for a = r/2 and в = 2, this Gamma density reduces to a x2 distribution with r degrees of freedom, denoted by x?. Show that E(x2) = r and var(x2) = 2r.

(e) For a random sample from the x2 distribution, show that (X1X2,…,Xn) is a sufficient statistic for r.

(f) One can show that the square of a N(0,1) random variable is a x2 random variable with 1 degree of freedom, see the Appendix to the chapter. Also, one can show that the sum of independent x2’s is a x2 random variable with degrees of freedom equal the sum of the corresponding degrees of freedom of the individual x2’s, see problem 15. This will prove useful for testing later on. Using these results, verify that the sum of squares of m independent N(0,1) random variables is a x2 with m degrees of freedom.

11. The Beta distribution is defined by

f(X> =ГйШX"-1(1 — X»e-‘ for0<X< 1

= 0 elsewhere

where a > 0 and в > 0. This is a skewed continuous distribution.

(a) For a = в =1 this reverts back to the Uniform (0,1) probability density function. Show that E(X) = (a/a + в) and var(X) = aв/(a + в)2^ + в +1).

(b) Suppose that a = 1, find the estimators of в using the method of moments and the method of maximum likelihood.

12. The t-distribution with r degrees of freedom can be defined as the ratio of two independent random variables. The numerator being a N(0,1) random variable and the denominator being the square – root of a x2 random variable divided by its degrees of freedom. The t-distribution is a symmetric distribution like the Normal distribution but with fatter tails. As r ^ <x>, the t-distribution approaches the Normal distribution.

(a) Verify that if X1,…,Xn are a random sample drawn from a N(a, a2) distribution, then

z = (X — fj,)/(a/^n) is N(0,1).

(b) Use the fact that (n — 1)s2/a2 ~ xП-1 to show that t = z^Js^ja2 = (X — yi)/(s/yfn) has a t-distribution with (n — 1) degrees of freedom. We use the fact that s2 is independent of X without proving it.

(c) For n =16, x = 20 and s2 = 4, construct a 95% confidence interval for fa.

13. The F-distribution can be defined as the ratio of two independent x2 random variables each divided by its corresponding degrees of freedom. It is commonly used to test the equality of variances. Let sf be the sample variance from a random sample of size иі drawn from N) and let s be the sample variance from another random sample of size n2 drawn from N(p2 ). We know that

(nf — 1)s/a is x2(ni-i) and (n2 — 1)s2/a2 is – і). Taking the ratio of those two independent

X2 random variables divided by their appropriate degrees of freedom yields

F = ^/аї

s2/°2

which under the null hypothesis H0; a = a?, gives F = s/s‘2l and is distributed as F with (nf — 1) and (n2 — 1) degrees of freedom. Both sf and s2 are observable, so F can be computed and compared to critical values for the F-distribution with the appropriate degrees of freedom. Two inspectors drawing two random samples of size 25 and 31 from two shifts of a factory producing steel rods, find that the sampling variance of the lengths of these rods are 15.6 and 18.9 inches squared. Test whether the variances of the two shifts are the same.

14. Moment Generating Function (MGF).

(a) Derive the MGF of the Binomial distribution defined in problem 4. Show that it is equal to [(1 — 0)+0et]n.

(b) Derive the MGF of the Normal distribution defined in problem 5. Show that it is e^t+ 2a2f2.

(c) Derive the MGF of the Poisson distribution defined in problem 6. Show that it is ex(e-i).

(d) Derive the MGF of the Geometric distribution defined in problem 7. Show that it is 0et/[1 —

(1 — оу].

(e) Derive the MGF of the Exponential distribution defined in problem 9. Show that it is 1/(1 — 0t).

(f) Derive the MGF of the Gamma distribution defined in problem 10. Show that it is (1 — @t)-a. Conclude that the MGF of a xl is (1 — 2t)- 2.

(g) Obtain the mean and variance of each distribution by differentiating the corresponding MGF derived in parts (a) through (f).

15. Moment Generating Function Method.

(a) Show that if Xf,… ,Xn are independent Poisson distributed with parameters (A*) respec­tively, then Y = J2П=і X* is Poisson with parameter ^П=і A*.

(b) Show that if Xf,…, Xn are independent Normally distributed with parameters (p*, a2), then Y = J2П=і X* is Normal with mea^]П=і hi and variance ^П=і a*2.

(c) Deduce from part (b) that if X1,…, Xn are IIN(p, a2), then X ‘ ~ N(p, a2/n).

(d) Show that if X-_,… ,Xn are independent x2 distributed with parameters (r*) respectively, then Y = J2П=і X* is x2 distributed with parameter ^П=і r*.

16. Best Linear Prediction. (Problems 16 and 17 are based on Amemiya (1994)). Let X and Y be two random variables with means pX and pY and variances a2X and a, respectively. Suppose that

p = correlation(X, Y) = aXY/aXaY

where aXY = cov(X, Y). Consider the linear relationship Y = a + /3X where a and в are scalars:

(a) Show that the best linear predictor of Y based on X, where best in this case means the minimum mean squared error predictor which minimizes E (Y — a — eX)2 with respect to a and в is given by Y = a + fiX where a = pY — PpX and в = aXY/a2X = paY/aX.

(b) Show that the var(Y) = p2oY and that и = Y — Y, the prediction error, has mean zero and variance equal to (1 — p2)o‘Y• Therefore, p2 can be interpreted as the proportion of oY that is explained by the best linear predictor Y.

(c) Show that cov(Y, и) = 0.

17. The Best Predictor. Let X and Y be the two random variables considered in problem 16. Now consider predicting Y by a general, possibly non-linear, function of X denoted by h(X).

(a) Show that the best predictor of Y based on X, where best in this case means the minimum mean squared error predictor that minimizes E[Y — h(X)]2 is given by h(X) = E(Y/X). Hint: Write E[Y — h(X)]2 as E{[Y — E(Y/X)] + [E(Y/X) — h(X)]}2. Expand the square and show that the cross-product term has zero expectation. Conclude that this mean squared error is minimized at h(X) = E(Y/X).

(b) If X and Y are bivariate Normal, show that the best predictor of Y based on X is identical to the best linear predictor of Y based on X.

18. Descriptive Statistics. Using the data used in section 2.6 based on 595 individuals drawn from the Panel Study of Income Dynamics for 1982 and available on the Springer web site as EARN. ASC, replicate the tables and graphs given in that section. More specifically

(a) replicate Table 2.1 which gives the descriptive statistics for a subset of the variables in this data set.

(b) Replicate Figures 2.62.11 which plot the histograms for log wage, weeks worked, education and experience.

(c) Replicate Table 2.2 which gives the average log wage for various groups and test the difference between these averages using a t-test.

(d) Replicate Figure 2.12 which plots log wage versus experience. Figure 2.13 which plots log wage versus education and Figure 2.14 which plots log wage versus weeks worked.

(e) Replicate Table 2.3 which gives the correlation matrix among a subset of these variables.

19. Conflict Among Criteria for Testing Hypotheses: Examples from Non-normal Distributions. This is based on Baltagi (2000). Berndt and Savin (1977) showed that W > LR > LM for the case of a multivariate regression model with normal distrubances. Ullah and Zinde-Walsh (1984) showed that this inequality is not robust to non-normality of the disturbances. In the spirit of the latter article, this problem considers simple examples from non-normal distributions and illustrates how this conflict among criteria is affected.

(a) Consider a random sample xi, x2,…,xn from a Poisson distribution with parameter A. Show that for testing A = 3 versus A = 3 yields W > LM for X < 3 and W < LM for X > 3.

(b) Consider a random sample xi, x2,…,xn from an Exponential distribution with parameter в. Show that for testing в = 3 versus в = 3 yields W > LM for 0 <x < 3 and W < LM for

x > 3.

(c) Consider a random sample x1 ,x2,…,xn from a Bernoulli distribution with parameter в. Show that for testing в = 0.5 versus в = 0.5, we will always get W > LM. Show also, that for testing в = (2/3) versus в = (2/3) we get W < LM for (1/3) < x < (2/3) and W > LM for (2/3) < x < 1or0 < x < (1/3).

References

More detailed treatment of the material in this chapter may be found in:

Amemiya, T. (1994), Introduction to Statistics and Econometrics (Harvard University Press: Cambridge).

Baltagi, B. H. (1994), “The Wald, LR, and LM Inequality,” Econometric Theory, Problem 94.1.2, 10: 223-224.

Baltagi, B. H. (2000), “Conflict Among Critera for Testing Hypotheses: Examples from Non-Normal Distributions,” Econometric Theory, Problem 00.2.4, 16: 288.

Bera A. K. and G. Permaratne (2001), “General Hypothesis Testing,” Chapter 2 in Baltagi, B. H. (ed.), A Companion to Theoretical Econometrics (Blackwell: Massachusetts).

Berndt, E. R. and N. E. Savin (1977), “Conflict Among Criteria for Testing Hypotheses in the Multivariate Linear Regression Model,” Econometrica, 45: 1263-1278.

Breusch, T. S. (1979), “Conflict Among Criteria for Testing Hypotheses: Extensions and Comments,” Econometrica, 47: 203-207.

Buse, A. (1982), “The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note,” The American Statistician, 36 :153-157.

DeGroot, M. H. (1986), Probability and Statistics (Addison-Wesley: Mass.).

Freedman, D., R. Pisani, R. Purves and A. Adhikari (1991), Statistics (Norton: New York).

Freund, J. E. (1992), Mathematical Statistics (Prentice-Hall: New Jersey).

Hogg, R. V. and A. T. Craig (1995), Introduction to Mathematical Statistics (Prentice Hall: New Jersey).

Jollife, I. T. (1995), “Sample Sizes and the Central Limit Theorem: The Poisson Distribution as an Illustration,” The American Statistician, 49: 269.

Kennedy, P. (1992), A Guide to Econometrics (MIT Press: Cambridge).

Mood, A. M., F. A. Graybill and D. C. Boes (1974), Introduction to the Theory of Statistics (McGraw-Hill: New York).

Spanos, A. (1986), Statistical Foundations of Econometric Modelling (Cambridge University Press: Cam­bridge).

Ullah, A. and V. Zinde-Walsh (1984), “On the Robustness of LM, LR and W Tests in Regression Models,” Econometrica, 52: 1055-1065.

Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics (Wiley: New York).

Appendix

Score and Information Matrix: The likelihood function of a sample X1,…, Xn drawn from f (Xi} 9) is really the joint probability density function written as a function of 9:

L(9) = f(Xi,…,Xn; 9)

This probability density function has the property that J" L(9)dx = 1 where the integral is over all Xi,…, Xn written compactly as one integral over x. Differentiating this multiple integral with respect

to в, one gets Г dL  дв But the score is by definition S(в) = dlogL/дв. Hence E[S(в)] = 0. Differentiating again with respect to в, one gets

or

 д2 logL =E д^^ 2 1 дв2 _ дв )
 E

 E [S (в)]2

But varS/в)] = Е[Ьв)]2 since E[S(в)] = 0. Hence!(в) = var[S(в)].

Moment Generating Function (MGF): For the random variable X, the expected value of a special function of X, namely eXt is denoted by

t2 t3

Mx(t) = E(eXt) = E(1 + Xt + X2 2! + X3 3! + ..)

where the second equality follows from the Taylor series expansion of eXt around zero. Therefore,

t2 t3

Mx (t) = 1 + E(X )t + E(X2) 2! + E(X3) 3, + ..

This function of t generates the moments of X as coefficients of an infinite polynomial in t. For example, g = E(X) = coefficient of t, and E(X2)/2 is the coefficient of t2, etc. Alternatively, one can differentiate this MGF with respect to t and obtain g = E(X) = M’x(0), i. e., the first derivative of Mx(t) with respect to t evaluated at t = 0. Similarly, E(Xr) = MX (0) which is the r-th derivative of Mx(t) with respect to t evaluated at t = 0. For example, for the Bernoulli distribution;

Mx(t) = E(eXt) = £X=o ешвх (1 – в)1-х = ве^ + (1 – в)

so that M’x (t) = ве} and M’x (0) = в = E(X) and M’X (t) = ве} which means that E(X2) = M’X (0) = в. Hence,

var(X) = E(X2) – (E(X))2 = в – в2 = в(1 – в).

For the Normal distribution, see problem 14, it is easy to show that if X ~ N(g, a2), then Mx(t) = eXt+ 2a2t2 and MX (0) = E(X) = g and M’X (0) = E(X2) = a2 + g2.

There is a one-to-one correspondence between the MGF when it exists and the corresponding p. d.f. This means that if Y has a MGF given by e2t+4t then Y is normally distributed with mean 2 and variance 8. Similarly, if Z has a MGF given by (et + 1)/2, then Z is Bernoulli distributed with в = 1/2.

Change of Variable: If X ~ N(0,1), then one can find the distribution function of Y = X by using the Distribution Function method. By definition the distribution function of y is defined as

G(y) = Pr[Y < y] = Pr[X< y] = Pr[-y < X < y]

= Pr[X < y] – Pr[X < – y] = F(y) – F(-y)

so that the distribution function of Y, G(y), can be obtained from the distribution function of X, F(x). Since the N(0,1) distribution is symmetric around zero, then F(-y) = 1 – F(y) and substituting that in G(y) we get G(y) = 2F(y) – 1. Recall, that the p. d.f. of Y is given by g(y) = G'(y). Hence, g(y) = f (y) + f (-y) and this reduces to 2f (y) if the distribution is symmetric around zero. So that if f (x) = e-x /2 j^2n for – to < x < +to then g(y) = 2f (y) = 2e-y /2/^2n for y > 0.

 h(z) = g(Vz) –    Let us now find the distribution of Z = X2, the square of a N(0,1) random variable. Note that dZ/dX = 2X which is positive when X > 0 and negative when X < 0. The change of variable method cannot be applied since Z = X2 is not a monotonic transformation over the entire domain of X. However, using Y = X, we get Z = Y2 = (X)2 and dZ/dY = 2Y which is always non-negative since Y is non­negative. In this case, the change of variable method states that the p. d.f. of Z is obtained from that of Y by substituting the inverse transformation Y = %/Z into the p. d.f. of Y and multiplying it by the absolute value of the derivative of the inverse transformation:

It is clear why this transformation will not work for X since Z = X2 has two solutions for the inverse transformation, X = ±JZ, whereas, there is one unique solution for Y = %/Z since it is non-negative. Using the results of problem 10, one can deduce that Z has a gamma distribution with a = 1/2 and в =2. This special Gamma density function is a x2 distribution with 1 degree of freedom. Hence, we have shown that the square of a N(0,1) random variable has a x1 distribution.

Finally, if Xi,…, Xn are independently distributed then the distribution function of Y = ^П=1 Xi can be obtained from that of the Xi, s using the Moment Generating Function (MGF) method:

MY(t) = E(eYt) = E[e(£-i Xi)t] = E(eXlt)E(eX2t)..E(eX4)

= Mx-l (t)Mx2 (t)..MXn (t)

If in addition these Xi, s are identically distributed, then MXi (t) = MX (t) for i = 1,…,n and

My (t) = [Mx (t)]n

For example, if X1,…, Xn are IID Bernoulli (в), then MXi (t) = MX (t) = de1 + (1 – в) for i = 1,…,n. Hence the MGF of Y = J2П=1 Xi is given by

My(t) = [Mx(t)]n = [det + (1 – d)]n

This can be easily shown to be the MGF of the Binomial distribution given in problem 14. This proves that the sum of n independent and identically distributed Bernoulli random variables with parameter в is a Binomial random variable with same parameter в. Central Limit Theorem: If X1,…, Xn are IID(p,, a2) from an unknown distribution, then Z is asymptotically distributed as N(0, 1). Proof: We assume that the MGF of the Xi, s exist and derive the MGF of Z. Next, we show that lim MZ (t) as n is e1/2t which is the MGF of N(0,1) distribution. First, note that

where Y = ХП=і Xi with MY (t) = [MX (t)]n. Therefore,

MZ(t) = E(eZt) = E ^e(Yt-n^t)/^/nj = e-n^t/^/nE ^eYt/&Vn”j    = e-n^t/^/nMY (t/a^n) = e~n^t/^n [Mx (t/a^n)]n Taking log of both sides we get  Using the Taylor series expansion log(1 + s) = s—————– |——– .. we get  Collecting powers of t, we get

Therefore

, ,, , , 1 2 /E(X3) ME(X2) иП t3

logMz m = 2 c + (ДД – + Д) +..

note that the coefficient of t3 is 1Д/n times a constant. Therefore, this coefficient goes to zero as n ^<x>. Similarly, it can be shown that the coefficient of tr is 1//n-2 times a constant for r > 3. Hence,

1 1 2 lim logMZ (t) = -12 and lim Mz (t) = e21

п^ж 2 пжж

which is the MGF of a standard normal distribution.

The Central Limit Theorem is a powerful tool for asymptotic inference. In real life we do not know what distribution we are sampling from, but as long as the sample drawn is random and we average (or sum) and standardize then as n ^ <x>, the resulting standardized statistic has an asymptotic N(0,1) distribution that can be used for inference.

Using a random number generator from say the uniform distribution on the computer, one can generate samples of size n = 20, 30, 50 from this distribution and show how the sampling distribution of the sum (or average) when it is standardized closely approximates the N(0, 1) distribution.

The real question for the applied researcher is how large n should be to invoke the Central Limit Theorem. This depends on the distribution we are drawing from. For a Bernoulli distribution, a larger n is needed the more asymmetric this distribution is i. e., if в = 0.1 rather than 0.5.

In fact, Figure 2.15 shows the Poisson distribution with mean = 15. This looks like a good approx­imation for a Normal distribution even though it is a discrete probability function. Problem 15 shows that the sum of n independent identically distributed Poisson random variables with parameter A is a Poisson random variable with parameter (nA). This means that if A = 0.15, an n of 100 will lead to the distribution of the sum being Poisson (nA = 15) and the Central Limit Theorem seems well approximated.

 p( x) Figure 2.15 Poisson Probability Distribution, Mean = 15

 P( x)

 x However, if A = 0.015, an n of 100 will lead to the distribution of the sum being Poisson (nA = 1.5) which is given in Figure 2.16. This Poisson probability function is skewed and discrete and does not approximate well a normal density. This shows that one has to be careful in concluding that n = 100 is a large enough sample for the Central Limit Theorem to apply. We showed in this simple example that this depends on the distribution we are sampling from. This is true for Poisson (A = 0.15) but not Poisson (A = 0.015), see Joliffe (1995). The same idea can be illustrated with a skewed Bernoulli distribution.

Conditional Mean and Variance: Two random variables X and Y are bivariate Normal if they have the following joint distribution:  1

2naXoYJ 1 – p2 2(1 – p2)

-2p/x – Px (У – Py

where —то < x < +to, —to < y < +to, E(X) = px, E(Y) = pY, var(X) = —x, var(Y) = —Y and p = correlation (X, Y) = cov(X, Y)/—xaY. This joint density can be rewritten as

where fi(x) is the marginal density of X and f (y/x) is the conditional density of Y given X. In this case, X ~ N(px, a2x) and Y/X is Normal with mean E(Y/X) = pY + p—Y (x — px) and variance given

— X

by var(Y/X) = aY(1 — p2).

By symmetry, the roles of X and Y can be interchanged and one can write f (x, y) = f (x/y) f2 (y) where f2(y) is the marginal density of Y. In this case, Y ~ N(pY,—Y) and X/Y is Normal with mean E(X/Y) = px + p(y — pY) and variance given by var(X/Y) = aXX(1 — p2). If p = 0, then f (y/x) =

aY

f2(y) and f (x, y) = f1(x)f2(y) proving that X and Y are independent. Therefore, if cov(X, Y) = 0 and X and Y are bivariate Normal, then X and Y are independent. In general, cov(X, Y) = 0 alone does not necessarily imply independence, see problem 3.

One important and useful property is the law of iterated expectations. This says that the expectation of any function of X and Y say h(X, Y) can be obtained as follows:

E [h(X, Y)] = Ex EY/x[h(X, Y)]

where the subscript Y/X on E means the conditional expectation of Y given that X is treated as a constant. The next expectation Ex treats X as a random variable. The proof is simple.

 /

+ТО +TO

h(x, y)f (x, y)dxdy -TO —TO

Example: This law of iterated expectation can be used to show that for the bivariate Normal density, the parameter p is indeed the correlation coefficient of X and Y. In fact, let h(X, Y) = XY, then

E(XY) = ExEy/x(XY/X) = ExXE(Y/X) = ExX[pY + p—Y (X — px)]

Y aX x

aY 2

= Px Py + p —x = Px Py + p—Y—x

aX

Rearranging terms, one gets p = [E(XY) — pxpy]/—x—Y = —xY/—x—y as required.

Another useful result pertains to the unconditional variance of h(X, Y) being the sum of the mean of the conditional variance and the variance of the conditional mean:

var(h(X, Y)) = ExvarY/x [h(X, Y)] + varxEy/x [h(X, Y)]

Proof: We will write h(X, Y) as h to simplify the presentation varY/x (h) = EY/x(h2) — [EY/x (h)]2

and taking expectations with respect to X yields ExvarY/x(h) = ExEY/x(h?) — Ex [EY/x(h)]2 = E (h2) — Ex [EY/x(h)]2.

Also, varxEY/x(h) = Ex[EY/x(h)]2 — (Ex [EY/x(h)])2 = Ex [EY/x(h)]2 — [E(h)]2 adding these two terms yields

E (h2) — [E (h)]2 = var(h).

CHAPTER 3