96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
The following step maps OSM types to address<br />
elements. In the OpenStreetMap project, every country-<br />
specific subproject, e.g. the Japanese or the German<br />
OSM project, has its own guidelines about how to tag<br />
locations according to their administrative unit status (as<br />
being a city, town or hamlet4 ). Therefore we use country<br />
specific mappings from OSM internal types to one or<br />
more of the desired target address element types we<br />
define.<br />
The enrichment step provides rules for labelling<br />
address elements which have not been attributed a tag<br />
by a previous step because they were not found in the<br />
knowledge base due to spelling errors, for instance. The<br />
completion rules are of the following form:<br />
(type1 , type2 , . . . , typen ) − − > targetAddressElement<br />
If for each typex, x = 1 .. n, for the token at index x, the<br />
respective type can be found in the list of possible types,<br />
the tokens in the sequence are grouped and labelled with<br />
the type targetAddressElement. Examples for language<br />
specific completion token types are found in Table 3.<br />
A token tagged with one of these affix types indicates a<br />
(possibly still unlabelled) preceding/following location<br />
name and the token group is labelled appropriately<br />
including the marker token.<br />
compl. type examples description<br />
town_suf ku Suffix marking a<br />
station_suf Station,<br />
Ekimae,<br />
Meieki<br />
town/ward (Japan)<br />
Word marking a train<br />
station (Japan)<br />
village_suf mura, son Suffix marking a<br />
village (Japan)<br />
city_dist_pref Aza, Koaza Prefix usually<br />
preceding<br />
a city district or sub-<br />
district (Japan)<br />
street_suf Avenue, Suffix marking a street<br />
Road name (Australia)<br />
state_pref Freistaat Prefix marking a state<br />
name (Germany)<br />
Table 3: Completion types<br />
Some examples of completion rules are listed in Table 4.<br />
The left hand side of the rules specifies the token type<br />
pattern, the right hand side defines the target address<br />
element. An @ means that the token at the respective<br />
4 A hamlet is a small town or village.<br />
position must not have other possible types than the<br />
specified one.<br />
completion rule matching<br />
example<br />
(city_prefix,city) --> city Hansestadt<br />
(orientation_prefix,other,<br />
street_suffix) --> street_name<br />
Hamburg<br />
Lower Geoge<br />
Street (instead of<br />
George Street)<br />
(orientation_prefix,city) East<br />
Launcheston<br />
(contains_street_suffix) --> Ratausstraße<br />
street_name<br />
(instead of<br />
Rathausstraße<br />
(city,loc_suffix) --> city_district Berlin Mitte<br />
(state_prefix,state) --> state Freistaat Bayern<br />
(@city,@city) --> city Munich<br />
(München)<br />
(street_number,street_number_ext)<br />
--> street_number<br />
34a<br />
(street_number,sep_last_alphanum)<br />
--> street_number<br />
34 - 36<br />
Table 4: Example completion rules<br />
The final disambiguation step provides rules which<br />
decide which of the attributed types for each element is<br />
selected. In the aforementioned example, Brandenburg<br />
would thus be tagged a state and not a city.<br />
The disambiguation rules take the form<br />
(leftNeighbourType, currentType, rightNeighbourType)<br />
where currentType is the target address element type of<br />
the token group under consideration. Either<br />
rightNeighbourType or leftNeighbourType may be empty<br />
(i.e. any type is allowed). If such a rule can be applied,<br />
the token group under consideration will be labelled<br />
with currentType.<br />
4. Experiments<br />
4.1. Data<br />
We conducted our experiments using two different<br />
datasets. The first dataset was collected from the<br />
Internet, the second corpus was a company internal<br />
database. Eleven external annotators collected variations<br />
of location information data from the Internet and<br />
annotated them according to the annotation guidelines<br />
given in Wolf (<strong>2011</strong>). They collected 154 strings for<br />
German, 35 of which were used for development and the<br />
rest for testing. For Australia they collected 143 strings,<br />
41