Online Model Selection Based on the Variational Bayes

More documents

Recommendations

Info

1658 Masa-aki Sato D® D ®new ¡ ® D 1 c (Thr(x, z)i N C c 0®0 ¡c ® µ ) D 1 2 V¡1 c ® ,® (®, c ) @F @® (XfTg, N µ, ®, c ), (2.33) together with c D T C c 0. Substituting the VB E-step equation, 2.20, into the free energy equation, 2.26, the VB algorithm is further rewritten as D® D 1 V¡1 2 c ® ,® (®, c ) @F @® (XfTg, µ N D hµi ® , ®, c ). (2.34) This shows that the VB algorithm is the gradient method with the inverse of the Fisher information matrix as a coef�cient matrix. Namely, it is a type of natural gradient method (Amari, 1998), which gives the optimal asymptotic convergence. This fact is proved for the �rst time in this article. The natural gradient gives the steepest direction in the hyperparameter space, which has the Riemannian structure according to the information geometry. It should be noted that the learning rate in the VB algorithm, 2.34, is automatically determined by the inverse of the Fisher information matrix. When the VB algorithm converges, the free energy equation, 2.26, can be written in a simple form: Z F(XfTg) D log(P(XfTg| µ N )P0( µ N )) ¡ dm (µ ) Qh (µ ) log(Qh (µ )) C c [Y( µ N ) ¡ hY(µ )i ® ]. (2.35) The �rst term on the right-hand side is the log likelihood together with the prior. It is estimated at the ensemble average of the parameters. The second term is the entropy of the posterior parameter distribution. It penalizes the complex models and over�tting. The third term represents the deviation from the ensemble average of the parameters and becomes negligible in the large sample limit. 2.6 Predictive Distribution. If the posterior parameter distribution is obtained by using the VB algorithm, one can calculate the predictive distribution for the observed variable x. The predictive distribution for x is given by Z P(x|XfTg) D dm (µ )Qh (µ )P(x| µ ) Z D dm (µ ) Z dm (z) exp [(r(x, z) C c ® ) ¢ µ C r0(x, z) ¡ (1 C c )Ã (µ ) ¡ © (®, c )] . (2.36)
<strong>Online</strong> <strong>Model</strong> <strong>Selection</strong> <strong>Based</strong> on the Variational Bayes 1659 By interchanging the integration with respect to µ and z, one can get Z P(x|XfTg) D dm (z) £ exp £ r0(x, z) C © ( O® (x, z), c C 1) ¡ © (®, c ) ¤ , (2.37) O® (x, z) D (c ® C r(x, z))/(1 C c ). For a �nite T, this predictive distribution has a different functional form from the model distribution P(x| µ ), equation 2.1. 2.7 Large Sample Limit. When the amount of observed data becomes large (T À 1 : c À 1), the solution of the VB algorithm becomes the ML estimator (Attias, 1999). In this limit, the integration over the parameters with respect to the posterior parameter distribution can be approximated by using a stationary point approximation: Z exp [© (®, c )] D dm (µ ) exp [c (a ¢ µ ¡ Ã (µ ))] µ � exp c (a ¢ O µ ¡ Ã ( µ O ) ¡ 1 2 log � � � �c @2Ã @µ@µ ( � µ O � ) � � C O(1/c ) , (2.38) where O µ is the maximum of the exponent ¡ ® ¢ µ ¡ Ã (µ )¢ , that is, @Ã @µ ( O µ ) D ®. (2.39) Therefore, © can be approximated as © (®, c ) � c (a ¢ O µ ¡ Ã ( µ O ) ¡ 1 2 log � � � �c @2Ã @µ@µ ( � µ O � ) � � C O(1/c ). (2.40) Consequently, the ensemble average of the parameter N µ can be approximated as Nµ D 1 c � 1 c @© @® (®, c ) @ @® (c (® ¢ O µ ¡ Ã ( O µ ))) D O µ. (2.41) The relations 2.39 and 2.41 imply that the posterior hyperparameter ® is equal to the expectation parameter of the EFH model, Á (see equation 2.3) in this limit. Furthermore, equations 2.18, 2.39, and 2.41 are equivalent to the
Page 1 and 2: LETTER Communicated by Hagai Attias
Page 3 and 4: Online Mod
Page 9: Online Mod
Page 33: Online Mod

Online Model Selection Based on the Variational Bayes

Create successful ePaper yourself

Delete template?

Save as template?