05.12.2012 Views

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Multilingual Resources and Multilingual Applications - Regular Papers<br />

extensible to more countries. We tested the system with<br />

3 countries with strongly diverging standards for the<br />

expression of location information (Germany, Australia<br />

and Japan). New countries can be added within a few<br />

hours, as only certain country specific files have to be<br />

edited and the corresponding OpenStreetMap knowledge<br />

base has to be plugged in. Most European countries are<br />

similar to Germany, and the U.S. and Canada almost<br />

identical to the Australian system, so that a large part of<br />

the world can easily be covered.<br />

The system was shown to successfully improve the<br />

address element segmentation in a company internal<br />

database with high variation in orthography and<br />

formatting, even containing translated names.<br />

Moreover, the system is able to almost always correctly<br />

guess the country that textual location information can<br />

be attributed to.<br />

In future work, the system can be further improved to<br />

deal with a greater variety of typographical or<br />

transcription errors by using phonetic indexing<br />

algorithms as Soundex for English or Traphoty matching<br />

rules (Lisbach, 2010) for international languages.<br />

6. Acknowledgements<br />

We would like to thank all external annotators that<br />

helped gathering and annotating the test data and the<br />

LSS R&D GmbH for making a company internal<br />

address database available to us in order to test the<br />

system.<br />

7. References<br />

Agichtein, E., Ganti, V. (2004): Mining Reference<br />

Tables for Automatic Text Segmentation. In KDD ’04:<br />

Proceedings of the tenth ACM SIGKDD international<br />

conference on Knowledge discovery and data mining,<br />

Seattle, WA, USA, ACM.<br />

Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., Tyson, M.<br />

(1992): FASTUS: A Finite-state Processor for<br />

Information Extraction from Real-world Text.<br />

Borkar, V.; Deshmukh, K., Sarawagi, S. (2001):<br />

Automatic segmentation of text into structured<br />

records .<br />

Christen, P., Belacic, D. (2005): Automated Probabilistic<br />

Address Standardisation and Veri- fication.<br />

Australasian Data Mining Conference 2005<br />

(AusDM05).<br />

Christen, P.; Churches, T., Zhu, J.X. (2002): Case-<br />

Probabilistic Name and Address Cleaning and<br />

Standardisation. The Australasian Data Mining<br />

Workshop 2002.<br />

Cortez, E., De Moura, E.S. (2010): ONDUX: On-<br />

Demand Unsupervised Learning for Information<br />

Extraction. In Proceedings of the 2010 international<br />

conference on Management of data (SIGMOD ’10 ),<br />

pp. 807–818.<br />

Lisbach, B. (2010): Linguistisches Identity Matching.<br />

Vieweg+Teubner. ISBN 978-3-8348-9791- 6. URL<br />

http://dx.doi.org/10.1007/ 978-3-8348-9791-6\_11.<br />

Manning, C.D., Schütze, H. (1999): Foundations of<br />

Statistical Natural Language Processing. The MIT<br />

Press, Cambridge, Massachusetts.<br />

Marques, N.C., Gon Calves, S. (2004): Applying a Partof-Speech<br />

Tagger to Postal Address Detection on the<br />

Web, 2004.<br />

Peng, F., McCallum, A. (2003): Accurate In- formation<br />

Extraction from Research Papers using Conditional<br />

Random Fields. In: Information Processing<br />

Management.<br />

Riloff, E. (1993): Automatically Constructing a Dictionary<br />

for Information Extraction Tasks, AAAI<br />

Press / MIT Press. pp. 811–816.<br />

Wolf, J. (<strong>2011</strong>): Classifying the components of textual<br />

location information. Diploma Thesis, Department <strong>für</strong><br />

Linguistik, <strong>Universität</strong> Potsdam.<br />

43

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!