96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Tackling the Variation in International Location Information Data: An<br />
Approach Using Open Semantic Databases<br />
Janine Wolf 1 , Manfred Stede 2 , Michaela Atterer 1<br />
1 Linguistic Search Solutions R&D GmbH, Rosenstraße 2, 10178 Berlin<br />
2 <strong>Universität</strong> Potsdam, Karl-Liebknecht-Straße 24-25, 14476 Potsdam<br />
E-mail: janine@wolf-velten.de, stede@uni-potsdam.de, michaela.atterer@lssrd.de<br />
Abstract<br />
International location information ranges from mere relational descriptions of places or buildings over semi-structured address-like<br />
information up to fully structured postal address data. In order to be utilized, e.g. for associating events or people with<br />
geographical information, these location descriptions have to be decomposed and the relevant semantic information units have to<br />
be identified. However, they show a high amount of variation in order, occurrence and presentation of these semantic information<br />
units. In this work we present a new approach of using a semantic database and a rule-based algorithm to tackle the variation in<br />
such data and segment semi-structured location information strings into pre-defined elements. We show that our method is highly<br />
suitable for data cleansing and classifying address data into countries, reaching an f-score of up to 97 for the segmentation task, an<br />
f-score of 91 for the labelled segmentation task, and a success rate of 99% in the classification task.<br />
Keywords: address parsing, OpenStreetMap, address segmentation, data cleansing<br />
1. Introduction<br />
Databases of international location information, as<br />
maintained by most companies, often contain<br />
incomplete address data, variation in the order of<br />
elements, mixing of international conventions for<br />
address formatting or even semi-translated address parts.<br />
Moreover, the address data can be structured<br />
insufficiently or erroneously according to the database<br />
fields which makes the data unusable for further<br />
classification, querying and data cleansing tasks.<br />
Table 1 shows a number of possible variations of the<br />
same German address.<br />
address string problem description<br />
Willy-Brandt Street 1, Berlin partial translations<br />
#1 Willy-Brandt Street, Berlin 1000 non-standard format<br />
Willy-Brand-Str. 1 incorrect spelling<br />
Willy-Brandt-Str. 1, 1000 Berlin 20 politically outdated<br />
Willy-Brandt-Str.1, Haus 1<br />
presence of more<br />
3.Et., Zi. 101<br />
detailed information<br />
In der Willy-Brandt-Str in Berlin incomplete, e.g.<br />
extracted from free text<br />
Table 1: Examples of variation in postal addresses based on<br />
the German address Willy-Brandt-Str. 1, 10557 Berlin<br />
Apart from this kind of variation we also face variation<br />
in the description of location objects such as colloquial<br />
variations as Big Apple for New York, historical<br />
variations (Chemnitz/Karl-Marx-Stadt), transcription<br />
variants (Peking/Beijing) or translation variants<br />
(München/Munich).<br />
International addresses create further variation in<br />
address data as the typical Japanese address shown in<br />
Table 2 exemplifies.<br />
part of<br />
description string<br />
element type<br />
11-1 street number (mixed information:<br />
estate and building no.)<br />
Kamitoba-hokotate-cho city district<br />
Minami-ku ward of a city (town)<br />
Kyoto city (here: also prefecture)<br />
601-8501 postal code<br />
Table 2: Address elements: Japanese postal address<br />
example 11-1 Kamitoba-hokotate-cho,<br />
Minami- ku, Kyoto 601-8501<br />
All these variations pose major problems for data<br />
warehousing, such as deduplication, record linkage and<br />
identity matching.<br />
In this work we propose a method which is highly<br />
suitable for data cleansing. Tests on German, Australian<br />
and Japanese data show that it is moreover suitable for<br />
classifying address data into countries.<br />
39