05.12.2012 Views

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Multilingual Resources and Multilingual Applications - Regular Papers<br />

The following step maps OSM types to address<br />

elements. In the OpenStreetMap project, every country-<br />

specific subproject, e.g. the Japanese or the German<br />

OSM project, has its own guidelines about how to tag<br />

locations according to their administrative unit status (as<br />

being a city, town or hamlet4 ). Therefore we use country<br />

specific mappings from OSM internal types to one or<br />

more of the desired target address element types we<br />

define.<br />

The enrichment step provides rules for labelling<br />

address elements which have not been attributed a tag<br />

by a previous step because they were not found in the<br />

knowledge base due to spelling errors, for instance. The<br />

completion rules are of the following form:<br />

(type1 , type2 , . . . , typen ) − − > targetAddressElement<br />

If for each typex, x = 1 .. n, for the token at index x, the<br />

respective type can be found in the list of possible types,<br />

the tokens in the sequence are grouped and labelled with<br />

the type targetAddressElement. Examples for language<br />

specific completion token types are found in Table 3.<br />

A token tagged with one of these affix types indicates a<br />

(possibly still unlabelled) preceding/following location<br />

name and the token group is labelled appropriately<br />

including the marker token.<br />

compl. type examples description<br />

town_suf ku Suffix marking a<br />

station_suf Station,<br />

Ekimae,<br />

Meieki<br />

town/ward (Japan)<br />

Word marking a train<br />

station (Japan)<br />

village_suf mura, son Suffix marking a<br />

village (Japan)<br />

city_dist_pref Aza, Koaza Prefix usually<br />

preceding<br />

a city district or sub-<br />

district (Japan)<br />

street_suf Avenue, Suffix marking a street<br />

Road name (Australia)<br />

state_pref Freistaat Prefix marking a state<br />

name (Germany)<br />

Table 3: Completion types<br />

Some examples of completion rules are listed in Table 4.<br />

The left hand side of the rules specifies the token type<br />

pattern, the right hand side defines the target address<br />

element. An @ means that the token at the respective<br />

4 A hamlet is a small town or village.<br />

position must not have other possible types than the<br />

specified one.<br />

completion rule matching<br />

example<br />

(city_prefix,city) --> city Hansestadt<br />

(orientation_prefix,other,<br />

street_suffix) --> street_name<br />

Hamburg<br />

Lower Geoge<br />

Street (instead of<br />

George Street)<br />

(orientation_prefix,city) East<br />

Launcheston<br />

(contains_street_suffix) --> Ratausstraße<br />

street_name<br />

(instead of<br />

Rathausstraße<br />

(city,loc_suffix) --> city_district Berlin Mitte<br />

(state_prefix,state) --> state Freistaat Bayern<br />

(@city,@city) --> city Munich<br />

(München)<br />

(street_number,street_number_ext)<br />

--> street_number<br />

34a<br />

(street_number,sep_last_alphanum)<br />

--> street_number<br />

34 - 36<br />

Table 4: Example completion rules<br />

The final disambiguation step provides rules which<br />

decide which of the attributed types for each element is<br />

selected. In the aforementioned example, Brandenburg<br />

would thus be tagged a state and not a city.<br />

The disambiguation rules take the form<br />

(leftNeighbourType, currentType, rightNeighbourType)<br />

where currentType is the target address element type of<br />

the token group under consideration. Either<br />

rightNeighbourType or leftNeighbourType may be empty<br />

(i.e. any type is allowed). If such a rule can be applied,<br />

the token group under consideration will be labelled<br />

with currentType.<br />

4. Experiments<br />

4.1. Data<br />

We conducted our experiments using two different<br />

datasets. The first dataset was collected from the<br />

Internet, the second corpus was a company internal<br />

database. Eleven external annotators collected variations<br />

of location information data from the Internet and<br />

annotated them according to the annotation guidelines<br />

given in Wolf (<strong>2011</strong>). They collected 154 strings for<br />

German, 35 of which were used for development and the<br />

rest for testing. For Australia they collected 143 strings,<br />

41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!