96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
extensible to more countries. We tested the system with<br />
3 countries with strongly diverging standards for the<br />
expression of location information (Germany, Australia<br />
and Japan). New countries can be added within a few<br />
hours, as only certain country specific files have to be<br />
edited and the corresponding OpenStreetMap knowledge<br />
base has to be plugged in. Most European countries are<br />
similar to Germany, and the U.S. and Canada almost<br />
identical to the Australian system, so that a large part of<br />
the world can easily be covered.<br />
The system was shown to successfully improve the<br />
address element segmentation in a company internal<br />
database with high variation in orthography and<br />
formatting, even containing translated names.<br />
Moreover, the system is able to almost always correctly<br />
guess the country that textual location information can<br />
be attributed to.<br />
In future work, the system can be further improved to<br />
deal with a greater variety of typographical or<br />
transcription errors by using phonetic indexing<br />
algorithms as Soundex for English or Traphoty matching<br />
rules (Lisbach, 2010) for international languages.<br />
6. Acknowledgements<br />
We would like to thank all external annotators that<br />
helped gathering and annotating the test data and the<br />
LSS R&D GmbH for making a company internal<br />
address database available to us in order to test the<br />
system.<br />
7. References<br />
Agichtein, E., Ganti, V. (2004): Mining Reference<br />
Tables for Automatic Text Segmentation. In KDD ’04:<br />
Proceedings of the tenth ACM SIGKDD international<br />
conference on Knowledge discovery and data mining,<br />
Seattle, WA, USA, ACM.<br />
Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., Tyson, M.<br />
(1992): FASTUS: A Finite-state Processor for<br />
Information Extraction from Real-world Text.<br />
Borkar, V.; Deshmukh, K., Sarawagi, S. (2001):<br />
Automatic segmentation of text into structured<br />
records .<br />
Christen, P., Belacic, D. (2005): Automated Probabilistic<br />
Address Standardisation and Veri- fication.<br />
Australasian Data Mining Conference 2005<br />
(AusDM05).<br />
Christen, P.; Churches, T., Zhu, J.X. (2002): Case-<br />
Probabilistic Name and Address Cleaning and<br />
Standardisation. The Australasian Data Mining<br />
Workshop 2002.<br />
Cortez, E., De Moura, E.S. (2010): ONDUX: On-<br />
Demand Unsupervised Learning for Information<br />
Extraction. In Proceedings of the 2010 international<br />
conference on Management of data (SIGMOD ’10 ),<br />
pp. 807–818.<br />
Lisbach, B. (2010): Linguistisches Identity Matching.<br />
Vieweg+Teubner. ISBN 978-3-8348-9791- 6. URL<br />
http://dx.doi.org/10.1007/ 978-3-8348-9791-6\_11.<br />
Manning, C.D., Schütze, H. (1999): Foundations of<br />
Statistical Natural Language Processing. The MIT<br />
Press, Cambridge, Massachusetts.<br />
Marques, N.C., Gon Calves, S. (2004): Applying a Partof-Speech<br />
Tagger to Postal Address Detection on the<br />
Web, 2004.<br />
Peng, F., McCallum, A. (2003): Accurate In- formation<br />
Extraction from Research Papers using Conditional<br />
Random Fields. In: Information Processing<br />
Management.<br />
Riloff, E. (1993): Automatically Constructing a Dictionary<br />
for Information Extraction Tasks, AAAI<br />
Press / MIT Press. pp. 811–816.<br />
Wolf, J. (<strong>2011</strong>): Classifying the components of textual<br />
location information. Diploma Thesis, Department <strong>für</strong><br />
Linguistik, <strong>Universität</strong> Potsdam.<br />
43