# The EM Algorithm

The EM algorithm is a general iterative method for obtaining the MLE; it was first proposed by Hartley (1958) and was generalized by Dempster, Laird, and Rubin (1977) to a form that is especially suited for censored regression models such as Tobit models. We shall first present the definition and the properties of the EM algorithm under a general setting and then apply it to the standard Tobit model. It can also be applied to general Tobit models.

We shall explain the EM algorithm in a general model where a vector of observable variables z is related to a vector of unobservable variables y* in such a way that the value of y* uniquely determines the value of z but not vice versa. In the Tobit model, {yf} defined in (10.2.3) constitute the elements of y*, and {y,) and {wt) defined in (10.2.4) and (10.4.3), respectively, constitute the elements of z. Let the joint density or probability of y* be /(у*) and let the joint density or probability of z be g(z). Also, define k(y*|z) = f(y*)/g(z). Note that /(y*, z) = /(z|y*)/(y*) =/(y*) because /(z|y*) = 1 inasmuch as y* uniquely determines z. We implicitly assume that / g, and к depend on a vector of parameters в. The purpose is to maximize

Ц0) = n~l logg(z) = n~1 log/(y*) — n~l log k(y*z) (10.4.42)

with respect to в. Define

G(0|0,) = Щпlog/(y*|0)|z, 0,], (10.4.43)

where we are taking the expectation assuming 0, is the true parameter value and doing this conditional on z. Then the EM algorithm purports to maximize L(0) by maximizing Q(0|0,) with respect to 0 where 0, is given at each step of the iteration. The “E” of the name “EM” refers to the expectation taken in (10.4.43) and the “M” refers to the maximization of (10.4.43).

Consider the convergence properties of the EM algorithm. Define

#(0|0,) = E[n~l log fc(y*|z, 0)|z, 0,]. (10.4.44)

Then we have from (10.4.42), (10.4.43), and (10.4.44) and the fact that L(0|0,) – E[n-1 log g(z)|z, 0,] = L(0) L(0) = G(0|0,)-tf(0|0,).

But we have by Jensen’s inequality (4.2.6)

Н(вв1)<Щ61в1) for 0#0,. (10.4.46)

Now, given 0,, let M(0,) maximize G(0|0,) with respect to 0. Then we have

L(M) = <2(M|0,) – Я(М|0,). (10.4.47)

But, because <2(М |0!) S 6(010,) by definition and Я(М|0,) ё Я(0,|0І) by (10.4.46), we have from (10.4.45) and (10.4.47)

L(M) г L(0t). (10.4.48)

Thus we have proved the desirable property that L always increases or stays constant at each step of the EM algorithm.

The preceding result implies that if L is bounded, then lim^» L(0r) exists. Let 0* satisfy the equality lim^,. L(0r) = L(0*). (0* exists if the range of L (0) is closed.) We shall show that if 0* is a stationary point of the EM algorithm, then 0* is a stationary point of L. For this we assume L is twice differentiable. Differentiating (10.4.45) with respect to 0 and evaluating the derivative at 0 = 0t, we obtain

But the last term of the right-hand side of (10.4.49) is 0 because of (10.4.46). Therefore, if 0! is a stationary point of 6(0|0,), it is a stationary point of L.

Unfortunately, a local maximum of 0(0|0,) with respect to 0 may not be a local maximum (let alone the global maximum) of L(0) because the negative definiteness of [02<2(0|0,)/0000/]ві does not imply the negative definiteness of [д1Ь/д0д6′]ві. However, this is true of any iterative method commonly used. See Wu (1983) for further discussion of the convergence properties of the EM algorithm.

Now consider an application of the algorithm to the Tobit model.9 Define 0 = (/?’, a2) as before. Then in the Tobit model we have

log/(y*|0) – log <72 – 2^ І (У? ~ (Ю.4.50)

respect to fi and a2

ЦЪ*Лу*№у, w, 0,1 (10.4.51)

= ~ log а2-^(Г "2^1 Д(У? – *;/№, = 0, б,]

—- log <x2-2^2 £ O’, " x;/d2 ~ 2^2 ^ [WK, = 0, 0,) – x;/42

-ylWkrO, Q,

where WK – = 0,0,) = XJA – Y^- – y?   and

where фп = ф{х’,р, /a,) and Фа = Ф(х;Д, /а,).

From (10.4.51) it is clear that the second-round estimate of)? in the EM algorithm, denoted p2, is obtained as follows: Assume without loss of general­ity that the first n, observations of y, are positive and call the vector of those observations y. Next, define an (n — n, )-vector y° the elements of which are the y° defined in (10.4.52). Then we have

A = (X,2D-‘X'[^)], (10.4.54)

where X was defined after (10.2.4). In other words, the EM algorithm amounts to predicting all the unobservable values of yf by their conditional expecta­tions and treating the predicted values as if they were the observed values. The second-round estimate of a2, denoted a, is given by

o = и-1 [x (и – + (10.4.55)

L і о

+ X WK-0,0,)}

We can directly show that the MLE 0 is the equilibrium solution of the iteration defined by (10.4.54) and (10.4.55). Partition X = (X’, X0′)’ so that

X is multiplied by у and X° by y°. Then inserting § into both sides of (10.4.54) yields, after collecting terms, (10.4.56)

where the last bracket denotes an (n — n,)-dimensional vector the typical element of which is given inside. Now, setting the derivative of log L with respect to fi equal to 0 yields (10.4.57)

But, clearly, (10.4.56) is equivalent to (10.4.57). Similarly, the normal equa­tion for a2 can be shown to be equivalent to (10.4.55).

Schmee and Hahn (1979) performed a simulation study of the EM algo­rithm applied to a censored regression model (a survival model) defined by

Уі = УЇ if yf^c = c if y*> c,

where y* ~ N(a + Дх,, a2). They generally obtained rapid convergence.