26.10.2013 Views

Nonparametric Bayesian Discrete Latent Variable Models for ...

Nonparametric Bayesian Discrete Latent Variable Models for ...

Nonparametric Bayesian Discrete Latent Variable Models for ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.1 The Dirichlet Process<br />

Eq. (3.21b) is the combined probability <strong>for</strong> assigning xi to one of the infinitely many<br />

empty components. That is, the prior probability of assigning a data point to a component<br />

that is associated with other data points is proportional to the number of data<br />

points that have been already assigned to that component, and the probability of it<br />

having its own component is proportional to α. This approach is in accordance with<br />

the suggestion in Section 4 of (Kingman, 1975). Note the similarity of the above probabilities<br />

to eq. (3.8) and the correspondence of the indicator variables in the CRP and<br />

the infinite mixture model.<br />

3.1.6 Properties of the Distribution<br />

Different approaches <strong>for</strong> defining the DP summarized above reveal interesting properties<br />

of the random measure and allow development of different inference algorithms <strong>for</strong><br />

models using DPs. Here, we summarize some properties of the DP that are important<br />

<strong>for</strong> building DPM models and developing inference techniques.<br />

The posterior distribution given samples θ1, . . . , θn from the process is again a DP.<br />

This property makes inference <strong>for</strong> DP models easier to handle relative to most other<br />

nonparametric models.<br />

The Pólya urn scheme and the CRP show that it is possible to sample from the DP<br />

without having to represent the full nonparametric distribution. Thus, inference on<br />

DPM models is possible by only representing samples from the nonparametric distribution<br />

rather than the infinite dimensional distribution itself.<br />

Draws from a DP are discrete with probability 1. As a consequence, there is positive<br />

probability of draws from a DP being identical. This property might lead to undesirable<br />

posterior properties <strong>for</strong> some models (Petrone and Raftery, 1997) but can also be<br />

exploited <strong>for</strong> defining powerful models such as the hierarchical Dirichlet process (HDP)<br />

(Teh et al., 2006) that allow sharing statistical strength between groups of data.<br />

The DP is defined by two parameters, namely the base distribution G0 and the<br />

concentration parameter α. The mean of the distribution is G0, there<strong>for</strong>e G0 can be seen<br />

as the prior distribution that we center our nonparametric model on. The concentration<br />

parameter represents the ”strength of belief” on this prior distribution, similar to the<br />

scale of the parameters <strong>for</strong> the Dirichlet distribution.<br />

The concentration parameter α controls both the smoothness (or discreteness) of the<br />

random distributions and the size of the neighborhood (or variability) of G about G0. In<br />

the DPM models, the prior <strong>for</strong> selecting a new component to represent the ith data point<br />

is proportional to α as given in eq. (3.21b). Thus, the value of α influences the number<br />

of active components, that is, the number of components that have data assigned to<br />

them. The prior expected number of active components <strong>for</strong> a set of n data points is<br />

given by n α<br />

N<br />

i=1 α+i−1 ≈ α log( α + 1), (Antoniak, 1974).<br />

For small α, the model will have a few active components drawn from the base distribution<br />

to represent the data, and <strong>for</strong> large α there will be many active components,<br />

thus the distribution of the parameters will be concentrated around G0, hence the name.<br />

It is not clear what the less-in<strong>for</strong>mative value of α corresponds to since α → 0 results in<br />

representing the data with a single component and α → ∞ expresses strong belief that<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!