Yanzhi Hsu's Dissertation: Chapter 6

Chapter 6 Sampling in MCEM Algorithm

In this Chapter, we detail the sampling scheme needed in the MCEM algorithm implementations outlined in Theorem 4.2 and Theorem 4.5. The augmented Gibbs sampler developed in Theorem 6.3 is very efficient and particularly suitable for sampling from truncated distributions and for EM style algorithms.

6.1 Sampling Scheme

Let us first briefly review some widely used sampling methods and algorithms, which are relevant to our implementation schemes.

Inverse Cumulative Distribution Method

This is a well known method for generating a (univariate) variable X with cumulative distribution function F. It is particularly attractive in designing efficient Monte Carlo methods (Johnson, 1987; Olkin et al., 1980). The method is stated simply as

Lemma 6.1. A variate X under distribution function F can be generated as follows.

(1). Generate a variate U from uniform distribution U(0,1).

(2). Let X = sup_x[F(x) ≤ U ]. If F is strictly increasing, let X = F^-1(U ).

Method of Composition

The method of composition (Tanner, 1996) is essential for generating multivariate random variables from univariate sampling.

Lemma 6.2. Suppose ƒ(y|x) and g(x) are density functions where x and y may be vectors. Repeat the following two steps m times.

(1). Draw x^* ~ g(x).

(2). Draw y^*~ ƒ(y|x^*).

The pairs (x₁, y₁), ..., (x_m, y_m) are an i.i.d. sample from the joint density h(x, y)=ƒ(y|x)g(x), while y₁, ..., y_m are an i.i.d. sample from the marginal ∫ƒ(y|x)g(x) d x.

Theoretically, we can generate random variables with any dimension with this lemma. The method is a key technique underlying the data augmentation algorithms.

Gibbs Sampler

The Gibbs sampler is a powerful Markov Chain Monte Carlo (MCMC) data augmentation technique (Geman & Geman, 1984; Gelfand & Smith, 1990). It has been widely utilized in the image-processing and other large scale models such as neural networks and expert systems. The general problem is to draw (dependent) samples from a very complex multivariate density. Suppose that y is a k×1 random vector with density ƒ(y), also that the following conditional distributions are known,

y_i|(y₁, ..., y_i-1, y_i+1, ..., y_k) ~ ƒ_i(y_i|y₁, ..., y_i-1, y_i+1, ..., y_k), (i = 1, ..., k).

Lemma 6.3. (Gibbs Sampler) Given an arbitrary starting point in the support of ƒ(y) as y⁽⁰⁾ = (y₁⁽⁰⁾, ..., y_k⁽⁰⁾)'. The iteration i+1 gets y⁽ⁱ⁺¹⁾ = (y₁⁽ⁱ⁺¹⁾, ..., y_k⁽ⁱ⁺¹⁾)' as follows.

For j = 1, ..., k, draw y_j⁽ⁱ⁺¹⁾ from ƒ_j⁽ⁱ⁺¹⁾ =ƒ_j(y_j|y₁⁽ⁱ⁺¹⁾, ..., y_j-1⁽ⁱ⁺¹⁾, y_j+1⁽ⁱ⁾, ..., y_k+1⁽ⁱ⁾).

The vectors y⁽⁰⁾, y⁽¹⁾, y⁽²⁾, ... are a realization of a Markov chain with transition probability from y⁽ⁱ⁾ to y⁽ⁱ⁺¹⁾ as ƒ_j⁽ⁱ⁺¹⁾. Under very mild regularity conditions, the joint distribution of (y₁⁽ⁱ⁾, y₂⁽ⁱ⁾, ..., y_k⁽ⁱ⁾) converges geometrically to the unique invariant distribution ƒ(y) in the L₁ norm (Chan, 1993; Geman & Geman, 1984). More importantly, though these y⁽ⁱ⁾'s are not independent, the following property still holds. For any integrable function g(y),

These results suggest that if the chain is run to equilibrium, one can use the simulated data as a basis for summarizing ƒ(y), and the average of a function of interest over values from a chain yields a consistent estimator of its expectation.

In practice, we need to discard some initial iterations as ``burn-in" steps so that the sampling is not far from the converged distribution. If an iid sample is desired, multiple independent chains with different starting points or suitable spacings between realizations could be employed. Also, determining run length and assessing convergence are important practical issues (Brooks & Gelman, 1998; Tanner, 1996). A good explanation of Gibbs sampler can be found in Casella and George (1992).

Two other widely used sampling algorithms are data-augmentation (Tanner & Wong, 1987) and sampling-importance-resampling (Rubin, 1987). Gelfand and Smith (1990) compared them with the Gibbs sampler to the calculation of marginal densities and obtained some favorable findings to the Gibbs sampler and the data-augmentation algorithm. As they argued, the two iterative methods are closely related, and the latter could slightly speed up the Gibbs sampler if more reduced conditional densities are available.

6.2 Sampling from Conditional Densities k₁ and k₂

In this section, we discuss the sampling scheme from general distributions. From next section on, we deal with multivariate normal distribution in particular, wherein a rather attractive Gibbs sampler is developed, which could be generalized to many general distributions.

Sampling from Conditional Distribution k₁

When some components of y is not completely known, we need to generate random samples of y from the conditional distribution k₁(y|u, ) given in equation (4.6) as

This is a density function over ^*, the observed range of incomplete components of y. It is easy to rewrite it in terms of conditional distribution as

(6.1)

which is a truncated density of h₁(y^*|y⁺,u,). Sampling from such distribution, especially the truncated multivariate normal distribution, has been very well studied in the literature and usually can be handled by the Gibbs sampler or other Markov Chain Monte Carlo (MCMC) methods such as the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970). As an added advantage of these MCMC methods, the integral calculation in the denominator of (6.1) can usually be avoided.

Sampling from Conditional Distribution k₂

The key step is to generate random samples of u from the conditional distribution k₂(u|,) given in equation (4.7) as

(6.2)

For simplicity, assume u is a scalar. However, it is not difficult to extend it to a vector. Denote the cumulative distribution function of k₂(u|,) by

which is continuous in our context.

By the inverse cumulative distribution method in Lemma 6.1, we have

Lemma 6.4. To generate a sample of size m from k₂(u|,), proceed as follows.

(1). Draw a sample a⁽¹⁾, a⁽²⁾, ..., a^(m) from uniform distribution U(0, 1).

(2). Solve the equation K₂(u) = a⁽ⁱ⁾ for u to get u⁽ⁱ⁾.

Then sample u⁽¹⁾, u⁽²⁾, ..., u^(m) is such a random sample from k₂(u|,).

The equations in step (2) can be solved by simple numerical algorithms such as the bisection method or the Newton's method. Now the problem becomes how to calculate K₂(u) when u is known. Denote function

(6.3)

then

Let I(.) be the indicator function. For given v,

(6.4)

Now we need to calculate the two expectations above with respect to h₂(u|).

Lemma 6.5. For any given v, we can proceed as follows with a sample of size m.

(1). Draw a sample u⁽¹⁾, u⁽²⁾, ..., u^(m) from h₂(u|).

(2). Calculate expectations as

Now we need to calculate h₀(u|,) when u is known. That is to calculate each item h₁(y_i|u,)dy_i in the product (6.3). Based on Definition 3.1, we proceed the calculation according to the three different cases of incompleteness.

Case (a), if all components of y_i are complete, it is simply

h₁(y_i|u,)dy_i = h₁(y_i|u,).

Case (c), if all components of y_i are incomplete, i.e. y_i _i, we have

Now the calculation can be done by the following simulation.

(1). Draw a sample y_i⁽¹⁾, y_i⁽²⁾, ..., y_i^(m) from _i under h₁(y_i|u,).

(2). Calculate h₁(y_i|u,)dy_i as .

Case (b), if some components of y_i are incomplete, say y_i = (y_i⁺, y_i^*) {y_i⁺} × _i^*,

That is, plug in completely observed components y_i⁺ into the integrand and then integrating over _i^* for incomplete components.

Since in general h₁(y_i⁺, y_i^*|u, ) is no longer a density function. It is standardized using the marginal density of y_i⁺|(u, ) as

where _i^* is the sampling space for y_i^*, and c_i could be calculated by Monte Carlo method if no closed form solution is available. This results the conditional distribution density function for y_i^* given (y_i⁺, u, ) as h₁(y_i^*|y_i⁺, u, ) = h₁(y_i⁺,y_i^*|u, )/c_i. Then

Lemma 6.6. The calculation of h₀(u|,) can be done by the following simulation.

(1). Draw a sample y_i^*(1), y_i^*(2), ..., y_i^*(m) from _i^* under h₁(y_i^*|y_i⁺, u, ).

(2). Calculate h₁(y_i|u, ) dy_i as .

Note that case (c) is a special case where c_i = 1, and case (a) is also a special case with c_i = h₁(y_i|u, ) and no drawing is necessary. In summary, assume u is a scalar,

Theorem 6.1. A sample of u from the conditional distribution k₂(u|, ) can be drawn as follows.

(1) Draw a sample a⁽¹⁾, a⁽²⁾, ..., a^(m) from uniform distribution U(0, 1).
(2) Draw a sample u₀⁽¹⁾, u₀⁽²⁾, ..., u₀^(m) from h₂(u|).
(3) Calculate h₀(u₀^(j)|,) for j = 1, 2, ..., m as follows.

(3.1) For i = 1, ..., n and j = 1, ..., m, calculate h₁(y_i|u₀^(j), ) dy_i as below.

(a) if y_i is completely known, let h₁(y_i|u₀^(j), ) dy_i = h₁(y_i|u₀^(j), ).
(b) if some components of y_i are incomplete, draw a sample of y_i^* from _i^* as y_i^*(j,1), y_i^*(j,2), ..., y_i^*(j,m) under h₁(y_i^*|y_i⁺, u₀^(j), ), let

(c) if all components of y_i are incomplete, draw a sample of y_i from _i as y_i^(j,1), y_i^(j,2), ..., y_i^(j,m) under h₁(y_i|u₀^(j), ), let

(3.2) Calculate h₀(u₀^(j)|,) = h₁(y_i|u₀^(j), ) dy_i.
(4) Solve the equation for u to get u⁽ⁱ⁾ for i = 1, 2, ..., m, where

Then sample u⁽¹⁾, u⁽²⁾, ..., u^(m) is a random sample from k₂(u|

6.3 Mixed Models with Normal Distribution

Under multivariate normal distribution, we can design a very attractive Gibbs sampler employing the idea of latent variables (Damien & Walker, 1996). This method is remarkably suitable for truncated densities and EM style algorithms. It is very easy to implement and bypasses the need for rejection sampling and algorithms such as the Metropolis-Hastings and sampling-resampling. With the introduction of one or more latent variables, most conditional distributions can be sampled via uniform variates. It is applicable in a very broad range of distributions.

6.3.1 Normal Model

For the Mixed model given in equation (4.2), y_i = X_i β + Z_iu + e_i, under multivariate normal distributions u ~ N_q(0, Σ₂) and e_i ~ N_k(0, Σ₁), then y_i|u ~ N_k(μ_i, Σ₁), where μ_i = X_i β + Z_iu, which is a function of u.

For multivariate normal distribution y_i|(u,) ~ N_k(μ_i, Σ₁), both the marginal distribution of y_i⁺|(u,) and the conditional distribution h₁(y_i^*|y_i⁺, u, ) of y_i^* given (y_i⁺, u, ) are also normal. Given u and , then μ_i and Σ₁ are also known. If we partition

(6.5)

then the marginal distribution of y_i⁺|(u,) is multivariate normal with mean μ₊ and covariance matrix Σ₊, and the conditional distribution of y_i^*|(y_i⁺, u, ) is with mean μ_*+ Σ_*+(y_i⁺- μ₊) and covariance matrix Σ_*-Σ_*+Σ_+*. Therefore

(6.6)

where k₊ and k_* are the number of complete and incomplete components of y_i, respectively.

6.3.2 Sampling from k₁(y_i^*|y_i⁺, u, )

Now discuss the sampling from k₁(y_i^*|y_i⁺, u, ), a truncated multivariate normal distribution when y_i has incomplete components y_i^*. As in (6.1), given u and y_i^* _i^*,

k₁(y_i^*|y_i⁺, u, ) h₁(y_i^*|y_i⁺, u, ) I(y_i^* _i^*).

We will construct a general Gibbs sampler from truncated multivariate normal distributions, with sampling from k₁(y_i^*|y_i⁺, u, ) as a direct application. First give the following simple lemma about the roots of quadratic equations.

Given matrix A = (a_ij), let vector a_(j) be the jth column of A without the jth element a_jj, and matrix A_jj be the sub-matrix of A without the jth row and column.

Lemma 6.7 Suppose x=(x₁, ..., x_k)' and matrix A = (a_ij)_k×k. Given x_i (i ≠ j) and y, if a_jj > 0, the solution of x_j from the quadratic inequality x'Ax < y is given as x_j^(l) ≤ x_j ≤ x_j^(r), where

(6.7)

and x_(j) is the vector x without the jth element x_j.

If A is diagonal as A={_d a_ii}, the roots in equation (6.7) become

(6.8)

Now we state the following augmented Gibbs sampler for truncated multivariate normal distributions, which is implied in Cumbus et al. (1996).

Theorem 6.2 Suppose x=(x₁, ..., x_k)' ~ N_k(0, Σ) and x A, say A_i = (a_i, b_i). Then we have the following augmented Gibbs sampler for drawing x.

(1). Given an arbitrary starting point x⁽⁰⁾=(x₁⁽⁰⁾, ..., x_k⁽⁰⁾)' A. Set counter i = 0.

(2). Draw a ~ U(0, 1) and let y⁽ⁱ⁾ = x'⁽ⁱ⁾ x⁽ⁱ⁾ - 2ln(1 - a).

(3). For each j, in the order from 1 to k draw x_j⁽ⁱ⁺¹⁾ from uniform distribution over the set , where . At the end of this step, we have x⁽ⁱ⁺¹⁾ = (x₁⁽ⁱ⁺¹⁾, x₂⁽ⁱ⁺¹⁾, ..., x_k⁽ⁱ⁺¹⁾)'.

When i is large enough (after initial burn-in period), x⁽ⁱ⁺¹⁾ is approximately from N_k(0, Σ). Repeat (2-3) to sample more points. Note the bounding set in (3) can be obtained by equation (6.7).

Proof. With a latent variate y, easy to verify that the joint density

has its marginal density for x as N_k(0, Σ) truncated by x A. In order to show that the theorem is a Gibbs sampling of (x, y) from the joint density ƒ(x, y), we only need to show that step (2) is in fact sampling y⁽ⁱ⁾ from ƒ_y|x(y|x⁽ⁱ⁾), and step (3) is sampling x_j⁽ⁱ⁺¹⁾ from .

Based on the joint distribution ƒ(x, y), given x, variate y has truncated exponential distribution , which can be sampled by the inverse cumulative distribution method. It is done in step (2).

On the other hand, given y, vector x has multivariate uniform distribution over the region {x|y: x'x<y}∩A. In particular, each component of x has conditional density ƒ_j(x_j|x₁, ..., x_j-1, x_j+1, ..., x_k, y), which is univariate uniform

ƒ_j(x_j|x₁, ..., x_j-1, x_j+1, ..., x_k, y) I(x'x<y, x_j A_j),

over the set {x_j: x'x<y}∩ A_j. This is step (3) and we finish the proof.

Note that (a_i, b_i) could have a_i=-∞ or b_i=+∞. In fact, the bounding set A could be far more complex than intervals.

Since k₁(y_i^*|y_i⁺,u,) h₁(y_i^*|y_i⁺,u,) I(y_i^* _i^*), and h₁(y_i^*|y_i⁺,u,) is normal with mean μ_*+ Σ_*+(y_i⁺- μ₊) and covariance matrix Σ_*-Σ_*+Σ_+*. Sampling from k₁(y_i^*|y_i⁺,u,) is now straightforward.

Corollary 6.1 Drawing a variate y_i^* ~ k₁(y_i^*|y_i⁺,u,) can proceed as follows.

(1). Calculate μ₀ = μ_* + Σ_*+(y_i⁺-μ₊) and Σ₀ = Σ_*-Σ_*+Σ_+*.

(2). Draw y₀ ~ N_{k_*}(0, Σ₀) over the region A = {y₀: y₀ + μ₀ _i^*} as above.

(3). Let y_i^* = y₀ + μ₀, then y_i^* is a drawing from k₁(y_i^*|y_i⁺,u,) over _i^*.

6.3.3 Sampling from k₂(u|, ) and k(|, )

Now we employ similar idea of latent variates to implement a Gibbs sampler for drawing u from the conditional distribution k₂(u|, ) given in equation (4.7) as

(6.9)

In fact, sample from the conditional distribution k(|, ) as given by (4.8),

Let latent variables w₁, ..., w_n, v together with y₁^*, ..., y_n^*, u have joint density

Note that y_i = (y_i1, ..., y_ik)', y_i^* denotes its incomplete components in _i^*, and μ_i is a function of u = (u₁, ..., u_q)' as μ_i(u) = X_i β + Z_iu.

It is easy to verify that based on this joint distribution, the marginal density function for u is k₂(u|, ) and the conditional density for y_i^* given u is k₁(y_i^*|y_i⁺, u, ). Thus sampling from this joint distribution will yield a sample of u ~ k₂(u|, ) and a sample of y_i^* ~ k₁(y_i^*|y_i⁺, u, ) as well. In fact we get a sample from the conditional distribution k(|, ). Therefore with this scheme we will get all samples needed in the MCEM algorithm step (1.1) and step (1.2) of Theorem 4.5.

The following conditional densities are easy to obtain and essential for Gibbs sampling. Given all other variables, each y_ij and u_j has uniform distribution, while w_i and v have truncated exponential distributions. All of these distributions are very easy to sample from.

If y_ij is incomplete, given all other incomplete y's, as well as u, v, w's,

where A_ij = {y_ij: (y_i - μ_i)'(y_i - μ_i) < w_i}, which can be obtained by (6.7). This is a univariate uniform distribution over set A_ij ∩ {y_ij: y_i^* _i^*}.

For each component u_j of u,

It is a uniform distribution over the set

which can be obtained by equation (6.7).

Given y^* =( y₁^*, ..., y_n^*) and u, w_i and v have truncated exponential distributions

with support at w_i > (y_i - μ_i)'

(y_i - μ_i) and v > u'

u, respectively. These can be sampled easily by the inverse cumulative distribution method. Thus we have a Gibbs sampler for k(

), k₂(u|

), and k₁(y_i^*|y_i⁺, u,

) as below.

Theorem 6.3 Assume is known. An augmented Gibbs sampler for k(|, ) and k₂(u|, ) can be formatted as below.

(1). Select staring points for u and y_j for j = 1, ..., n, and set counter i = 0.

(1a). Select an arbitrary starting point u⁽⁰⁾, and calculate μ_j⁽⁰⁾ = μ_j(u⁽⁰⁾).

(1b). Get starting point y_j⁽⁰⁾ as follows. For l = 1, ..., k, if y_jl is known, set y_jl⁽⁰⁾ = y_jl; otherwise select an arbitrary value of y_jl⁽⁰⁾ such that y_j⁽⁰⁾ _j.

(2). Draw a ~ U (0, 1). Let

(3). For j = 1, ..., n, draw a_j ~ U (0, 1). Let

(4). In the order of j from 1 to q draw u_j⁽ⁱ⁺¹⁾ from uniform distribution over the set
where and
Let u⁽ⁱ⁺¹⁾ =(u₁⁽ⁱ⁺¹⁾, ..., u_q⁽ⁱ⁺¹⁾)' and obtain μ_j⁽ⁱ⁺¹⁾ = μ_j(u⁽ⁱ⁺¹⁾) for j = 1, ..., n.

(5). For j = 1, ..., n, get y_j⁽ⁱ⁺¹⁾ =( y_j1⁽ⁱ⁺¹⁾, y_j2⁽ⁱ⁺¹⁾, ..., y_jk⁽ⁱ⁺¹⁾)' as below. In the order of l from 1 to k, if y_jl is completely known, set y_jl⁽ⁱ⁺¹⁾ = y_jl; otherwise draw y_jl⁽ⁱ⁺¹⁾ from uniform distribution over the set , where and

When i is large enough (after initial burn-in period), u⁽ⁱ⁺¹⁾ is approximately from k₂(u|, ), y_j⁽ⁱ⁺¹⁾ from k₁(y_j|u⁽ⁱ⁺¹⁾, ₁) and ⁽ⁱ⁺¹⁾ = ( y₁⁽ⁱ⁺¹⁾, ..., y_n⁽ⁱ⁺¹⁾, u⁽ⁱ⁺¹⁾) from k(|, ). Repeat steps (2-5) with i+1 to sample more points.

Note that the bounding sets in steps (4) and (5) can be obtained by equation(6.7).

This Gibbs sampler is very efficient and will be used in our subsequent simulations and data analysis. More importantly, the use of latent variables in constructing the augmented sampler is an ideal match for sampling incomplete data in EM style algorithm. Its potential could be enormous in sampling schemes for very complicated distributions and in other areas as well.

(1).	Generate a variate U from uniform distribution U(0,1).
(2).	Let X = sup_x[F(x) ≤ U ]. If F is strictly increasing, let X = F^-1(U ).