22.06.2013 Views

a la physique de l'information - Lisa - Université d'Angers

a la physique de l'information - Lisa - Université d'Angers

a la physique de l'information - Lisa - Université d'Angers

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Author's personal copy<br />

3972 F. Chapeau-Blon<strong>de</strong>au, D. Rousseau / Physica A 388 (2009) 3969–3984<br />

following way. The complete coding of the data should here inclu<strong>de</strong> two parts. The first part is the coding of the data based<br />

on a <strong>de</strong>finite probability <strong>de</strong>nsity mo<strong>de</strong>l to assign the co<strong>de</strong> lengths. For a given data set x, the <strong>de</strong>scription length nee<strong>de</strong>d by<br />

this first part is Ldata of Eq. (7), that we can also write Ldata L(x|M), the <strong>de</strong>scription length of the data given a <strong>de</strong>finite mo<strong>de</strong>l<br />

M of probability <strong>de</strong>nsity. The second part nee<strong>de</strong>d for a complete coding of the data is the <strong>de</strong>scription of the parameters that<br />

completely specify the un<strong>de</strong>rlying probability <strong>de</strong>nsity mo<strong>de</strong>l M. These parameters inclu<strong>de</strong> the number of bins K along with<br />

the K values fk for k = 1 to K . The <strong>de</strong>scription length nee<strong>de</strong>d by this second part in charge of coding the parameters of the<br />

mo<strong>de</strong>l M is <strong>de</strong>noted Lmo<strong>de</strong>l L(M); and we shall soon see how to explicitly quantify this <strong>de</strong>scription length L(M). Now the<br />

complete coding of the data set x has a total <strong>de</strong>scription length Ltotal which sums up the two parts as<br />

Ltotal L(x|M) + L(M), (10)<br />

signifying that the total <strong>de</strong>scription length of the data is the <strong>de</strong>scription length of the data given the mo<strong>de</strong>l plus the<br />

<strong>de</strong>scription length of the mo<strong>de</strong>l.<br />

For a given data set x, the MDL principle then dictates to select the mo<strong>de</strong>l parameters {K; fk, k = 1, . . . K} so as to<br />

minimize the total <strong>de</strong>scription length Ltotal of Eq. (10), i.e.<br />

{K; fk, k = 1, . . .K} = arg min<br />

{K;fk} Ltotal = arg min [L(x|M) + L(M)] . (11)<br />

{K;f k}<br />

This is an optimization principle based on optimal coding and information theory. In a prescribed c<strong>la</strong>ss of mo<strong>de</strong>ls (histograms<br />

with regu<strong>la</strong>r bins here), the best mo<strong>de</strong>l for the data is the mo<strong>de</strong>l that, when known, enables the most efficient (shortest)<br />

coding of these data.<br />

5. Description length for the data<br />

As already stated, the <strong>de</strong>scription length L(x|M) for the data given the mo<strong>de</strong>l is supplied by Eq. (7). The term − log(dx N )<br />

in Eq. (7) is a constant common to all mo<strong>de</strong>ls. For the purpose of discriminating among mo<strong>de</strong>ls, it is often chosen to omit<br />

this constant − log(dx N ) in the <strong>de</strong>scription length, with no impact on the final result concerning the mo<strong>de</strong>l choice. However<br />

here, we prefer to maintain this term, in or<strong>de</strong>r to keep track of the complete value of the <strong>de</strong>scription length, and convey<br />

some additional insight into the mo<strong>de</strong>ling process beyond the choice of the mo<strong>de</strong>l itself. So equivalently, the <strong>de</strong>scription<br />

length of Eq. (7) for the data given the mo<strong>de</strong>l is written as<br />

L(x|M) = −<br />

K<br />

Nk log(fkdx). (12)<br />

k=1<br />

Next, we have to address the quantification of the <strong>de</strong>scription length L(M) for the mo<strong>de</strong>l.<br />

6. Description length for the mo<strong>de</strong>l parameters as in<strong>de</strong>pen<strong>de</strong>nt real variables<br />

To quantify the <strong>de</strong>scription length L(M) of the mo<strong>de</strong>l, a possibility is to use a procedure <strong>de</strong>rived from Ref. [28]. The<br />

approach from Ref. [28] to quantify the <strong>de</strong>scription length L(M) of the mo<strong>de</strong>l, consi<strong>de</strong>rs the K mo<strong>de</strong>l parameters fk as K<br />

in<strong>de</strong>pen<strong>de</strong>nt real (continuously-valued) variables, which need to be quantized to finite precision in or<strong>de</strong>r to allow their<br />

coding. The histogram mo<strong>de</strong>l for the <strong>de</strong>nsity of the data assigns a probability pk = fkδx to bin k with width δx. Un<strong>de</strong>r this<br />

mo<strong>de</strong>l also, the number Nk of data points falling in bin k has expected value E(Nk) = Npk = Nfkδx and standard <strong>de</strong>viation<br />

σ (Nk) = [Nfkδx(1−fkδx)] 1/2 , according to the properties of the binomial distribution [40]. Therefore, since fk = E(Nk)/(Nδx),<br />

for all k, estimating fk is equivalent to estimating the mean E(Nk) of random variable Nk with standard <strong>de</strong>viation σ (Nk). The<br />

value σ (fk) = σ (Nk)/(Nδx) = [fk(1−fkδx)/(Nδx)] 1/2 fixes a natural precision with which fk can be estimated and need to be<br />

co<strong>de</strong>d. This <strong>de</strong>termines σ (fk) as the quantization step relevant for coding the mo<strong>de</strong>l parameters fk. One has the probability<br />

pk ∈ [0, 1] and the <strong>de</strong>nsity fk = pkδx −1 ∈ [0, δx −1 ]. The parameter fk therefore can take its values in the interval [0, δx −1 ]<br />

and is estimated and quantized with the precision σ (fk). Accordingly, a total number δx −1 /σ (fk) of different values for fk can<br />

be distinguished and need to be co<strong>de</strong>d separately, at a co<strong>de</strong> length log[δx −1 /σ (fk)]. For the K parameters fk the co<strong>de</strong> length<br />

results as<br />

L({fk}) =<br />

K<br />

log<br />

k=1<br />

δx −1<br />

σ (fk)<br />

<br />

= K<br />

1<br />

log(N) −<br />

2 2<br />

K<br />

log[fkδx(1 − fkδx)]. (13)<br />

k=1<br />

An alternative, comparable, approach to quantify the cost of coding continuously-valued parameters is <strong>de</strong>scribed in<br />

Ref. [1], based on a slightly more involved mathematical formu<strong>la</strong>tion. It turns out that quantifying the coding cost of<br />

continuously-valued mo<strong>de</strong>l parameters is an important and recurrent step when applying the MDL principle. We review<br />

this alternative approach from Ref. [1] in the Appendix, for better appreciation of different existing variants for applying the<br />

MDL principle. With the present approach <strong>de</strong>rived from Ref. [28] and proceeding through Eq. (13), the <strong>de</strong>scription length<br />

153/197

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!