05.12.2012 Views

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Multilingual Resources and Multilingual Applications - Regular Papers<br />

Tackling the Variation in International Location Information Data: An<br />

Approach Using Open Semantic Databases<br />

Janine Wolf 1 , Manfred Stede 2 , Michaela Atterer 1<br />

1 Linguistic Search Solutions R&D GmbH, Rosenstraße 2, 10178 Berlin<br />

2 <strong>Universität</strong> Potsdam, Karl-Liebknecht-Straße 24-25, 14476 Potsdam<br />

E-mail: janine@wolf-velten.de, stede@uni-potsdam.de, michaela.atterer@lssrd.de<br />

Abstract<br />

International location information ranges from mere relational descriptions of places or buildings over semi-structured address-like<br />

information up to fully structured postal address data. In order to be utilized, e.g. for associating events or people with<br />

geographical information, these location descriptions have to be decomposed and the relevant semantic information units have to<br />

be identified. However, they show a high amount of variation in order, occurrence and presentation of these semantic information<br />

units. In this work we present a new approach of using a semantic database and a rule-based algorithm to tackle the variation in<br />

such data and segment semi-structured location information strings into pre-defined elements. We show that our method is highly<br />

suitable for data cleansing and classifying address data into countries, reaching an f-score of up to 97 for the segmentation task, an<br />

f-score of 91 for the labelled segmentation task, and a success rate of 99% in the classification task.<br />

Keywords: address parsing, OpenStreetMap, address segmentation, data cleansing<br />

1. Introduction<br />

Databases of international location information, as<br />

maintained by most companies, often contain<br />

incomplete address data, variation in the order of<br />

elements, mixing of international conventions for<br />

address formatting or even semi-translated address parts.<br />

Moreover, the address data can be structured<br />

insufficiently or erroneously according to the database<br />

fields which makes the data unusable for further<br />

classification, querying and data cleansing tasks.<br />

Table 1 shows a number of possible variations of the<br />

same German address.<br />

address string problem description<br />

Willy-Brandt Street 1, Berlin partial translations<br />

#1 Willy-Brandt Street, Berlin 1000 non-standard format<br />

Willy-Brand-Str. 1 incorrect spelling<br />

Willy-Brandt-Str. 1, 1000 Berlin 20 politically outdated<br />

Willy-Brandt-Str.1, Haus 1<br />

presence of more<br />

3.Et., Zi. 101<br />

detailed information<br />

In der Willy-Brandt-Str in Berlin incomplete, e.g.<br />

extracted from free text<br />

Table 1: Examples of variation in postal addresses based on<br />

the German address Willy-Brandt-Str. 1, 10557 Berlin<br />

Apart from this kind of variation we also face variation<br />

in the description of location objects such as colloquial<br />

variations as Big Apple for New York, historical<br />

variations (Chemnitz/Karl-Marx-Stadt), transcription<br />

variants (Peking/Beijing) or translation variants<br />

(München/Munich).<br />

International addresses create further variation in<br />

address data as the typical Japanese address shown in<br />

Table 2 exemplifies.<br />

part of<br />

description string<br />

element type<br />

11-1 street number (mixed information:<br />

estate and building no.)<br />

Kamitoba-hokotate-cho city district<br />

Minami-ku ward of a city (town)<br />

Kyoto city (here: also prefecture)<br />

601-8501 postal code<br />

Table 2: Address elements: Japanese postal address<br />

example 11-1 Kamitoba-hokotate-cho,<br />

Minami- ku, Kyoto 601-8501<br />

All these variations pose major problems for data<br />

warehousing, such as deduplication, record linkage and<br />

identity matching.<br />

In this work we propose a method which is highly<br />

suitable for data cleansing. Tests on German, Australian<br />

and Japanese data show that it is moreover suitable for<br />

classifying address data into countries.<br />

39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!