24.12.2012 Views

Online Model Selection Based on the Variational Bayes

Online Model Selection Based on the Variational Bayes

Online Model Selection Based on the Variational Bayes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Online</str<strong>on</strong>g> <str<strong>on</strong>g>Model</str<strong>on</strong>g> <str<strong>on</strong>g>Selecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Based</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> Variati<strong>on</strong>al <strong>Bayes</strong> 1651<br />

<strong>the</strong> posterior parameter distributi<strong>on</strong> as a coef�cient matrix. Namely, <strong>the</strong> VB<br />

method is a type of natural gradient method (Amari, 1998). By combining<br />

sequential model selecti<strong>on</strong> procedures, <strong>the</strong> <strong>on</strong>line VB algorithm provides<br />

a fully <strong>on</strong>line learning method with a model selecti<strong>on</strong> mechanism. It can<br />

be applied to real-time applicati<strong>on</strong>s. In preliminary experiments using syn<strong>the</strong>tic<br />

data, <strong>the</strong> <strong>on</strong>line VB method was able to adapt <strong>the</strong> model structure to<br />

dynamic envir<strong>on</strong>ments. We also found that <strong>the</strong> introducti<strong>on</strong> of a discount<br />

factor was crucial for a fast c<strong>on</strong>vergence of <strong>the</strong> <strong>on</strong>line VB method.<br />

We study <strong>the</strong> VB method for general exp<strong>on</strong>ential family models with<br />

hidden variables (EFH models) (Amari, 1985), although <strong>the</strong> VB method<br />

can be applied to more general graphical models (Attias, 1999). The use<br />

of <strong>the</strong> EFH models makes <strong>the</strong> calculati<strong>on</strong>s transparent. Moreover, <strong>the</strong> EFH<br />

models include a lot of interesting models such as normalized gaussian networks<br />

(Sato & Ishii, 2000), hidden Markov models (Rabiner, 1989), mixture<br />

of gaussian models (Roberts, Husmeier, Rezek, & Penny, 1998), mixture of<br />

factor analyzers (Ghahramani & Beal, 2000), mixture of probabilistic principal<br />

comp<strong>on</strong>ent analyzers (Tipping & Bishop, 1999), and o<strong>the</strong>rs (Roweis &<br />

Ghahramani, 1999; Titteringt<strong>on</strong>, Smith, & Makov, 1985).<br />

2 Variati<strong>on</strong>al <strong>Bayes</strong> Method<br />

In this secti<strong>on</strong>, we review <strong>the</strong> VB method (Waterhouse et al., 1996; Attias,<br />

1999, 2000; Ghahramani & Beal, 2000; Bishop, 1999) for EFH models (Amari,<br />

1985).<br />

2.1 <strong>Bayes</strong>ian Method. In <strong>the</strong> maximum likelihood (ML) approach, <strong>the</strong><br />

objective is to �nd <strong>the</strong> ML estimator that maximizes <strong>the</strong> likelihood for a<br />

given data set. The ML approach, however, suffers from over�tting and <strong>the</strong><br />

inability to determine <strong>the</strong> best model structure.<br />

In <strong>the</strong> <strong>Bayes</strong>ian method (Bishop, 1995; Mackay, 1992a, 1992b), a set of<br />

models, fMm |m D 1, . . . , N g, where Mm denotes a model with a �xed structure,<br />

is c<strong>on</strong>sidered. A probability distributi<strong>on</strong> for a model Mm with a model<br />

parameter µm is denoted by P(x| µm, Mm), where x represents an observed<br />

stochastic variable. For a set of observed data, XfTg D fx(t)|t D 1, . . . , Tg,<br />

and a prior probability distributi<strong>on</strong> P(µm |Mm), <strong>the</strong> <strong>Bayes</strong>ian method calculates<br />

<strong>the</strong> posterior probability over <strong>the</strong> parameter,<br />

P(µm|XfTg, Mm) D P(XfTg| µm, Mm)P(µm |Mm)<br />

.<br />

P(XfTg|Mm)<br />

Here, <strong>the</strong> data likelihood is de�ned by<br />

P(XfTg| µm, Mm) D<br />

TY<br />

P(x(t)| µm, Mm),<br />

tD1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!