31.10.2014 Views

Dictionary-based methods for information extraction - Sapienza

Dictionary-based methods for information extraction - Sapienza

Dictionary-based methods for information extraction - Sapienza

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A. Baronchelli et al. / Physica A 342 (2004) 294 – 300 299<br />

approach [17]). Our results appear as comparable with those obtainedthrough other<br />

completely dierent “whole-genome” analysis (see, <strong>for</strong> instance, [18]). Closely relatedspecies<br />

are correctly grouped(as in the case of E. coli and S. typhimurium,<br />

C. pneumoniae and C. trachomatis, P. abyssi and P. horikoshii, etc.), andsome main<br />

groups of organisms are identied. It is known that the mono-nucleotide composition<br />

is a specie-specic property <strong>for</strong> a genome. This compositional property couldaect our<br />

method: namely two genomes could appear as similar simply because of their similar<br />

C+G content. In order to rule out this hypothesis we per<strong>for</strong>med a new analysis after<br />

shuing genomic sequences andwe noticedthat the resulting new tree was completely<br />

dierent with respect to the one <strong>based</strong> on real sequences.<br />

In conclusion we have dened the dictionary of a sequence and we have shown<br />

how it can be helpful <strong>for</strong> in<strong>for</strong>mation <strong>extraction</strong> purposes. Dictionaries are intrinsically<br />

interesting anda statistical study of their properties can be a useful tool to investigate<br />

the strings they have been extractedfrom. In particular new results regarding the statistical<br />

study of DNA sequences have been presented here. On the other hand, we have<br />

proposedan integration of the string comparison procedure presentedin Ref. [11] that<br />

exploits dictionaries by means of articial texts. This methodgives very goodresults<br />

in several contexts andwe have focusedhere on self-classication problems, showing<br />

two similarity trees <strong>for</strong> corpora of written texts andDNA sequences.<br />

References<br />

[1] T. Jiang, Y. Xu, M. Zang (Eds.), Current Topics in Computational Molecular Biology, MIT Press,<br />

Cambridge, MA, London, England.<br />

[2] C.E. Shannon, A mathematical theory of communication, The Bell System Tech. J. 27 (1948)<br />

379–423, 623–656.<br />

[3] W.H. Zurek (Ed.), Complexity, Entropy and Physics of In<strong>for</strong>mation, Addison-Wesley, Redwood City,<br />

1990.<br />

[4] M. Li, P.M.B. Vitanyi, An Introduction to Kolmogorov Complexity andits Applications, 2ndEdition,<br />

Springer, Berlin, 1997.<br />

[5] A. Lempel, J. Ziv, A universal algorithm <strong>for</strong> sequential data compression, IEEE Trans. In<strong>for</strong>m. Theory<br />

IT-23 (1977) 337–343.<br />

[6] A.D. Wyner, J. Ziv, The sliding-window Lempel-Ziv algorithm is asymptotically optimal, Proc. IEEE<br />

82 (1994) 872–877.<br />

[7] A. Baronchelli, V. Loreto, A data compression approach to in<strong>for</strong>mation retrieval, unpublished,<br />

http://arxiv.org/abs/cond-mat/0403233.<br />

[8] A.D. Wyner, Shannon lecture, Typical sequences andall that: entropy, pattern matching anddata<br />

compression, IEEE In<strong>for</strong>mation Theory Society Newsletter, 1944.<br />

[9] D. Loewenstern, H. Hirsh, P. Yianilos, M. Noordewieret, DNA sequence classication using<br />

compression-<strong>based</strong>induction, DIMACS Technical Report 95-04, 1995.<br />

[10] O.V. Kukushkina, A.A. Polikarpov, D.V. Khmelev, Using Literal andGrammatical Statistics <strong>for</strong><br />

Authorship Attribution. Problemy Peredachi In<strong>for</strong>matsii, 37 (2000) 96–108 (in Russian). Translated<br />

in English, in: Problems of In<strong>for</strong>mation Transmission, Vol. 37, 2001, pp. 172–184.<br />

[11] D. Benedetto, E. Caglioti, V. Loreto, Language trees and zipping, Phys. Rev. Lett. 88 (2002)<br />

048702–048705.<br />

[12] A. Puglisi, D. Benedetto, E. Caglioti, V. Loreto, A. Vulpiani, Data compression and learning in time<br />

sequences analysis, Physica D 180 (2003) 92.<br />

[13] J. Ziv, N. Merhav, A measure of relative entropy between individual sequences with applications to<br />

universal classication, IEEE Trans. In<strong>for</strong>m. Theory 39 (1993) 1280–1292.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!