96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

More documents

Recommendations

Info

Multilingual Resources and Multilingual Applications - Regular Papers 34 of which were used for development. The Japanese data were collected and annotated by the first author. 76 of the 242 data points were used for development. The company internal database contained 57 examples for Germany, 162 examples for Australia and 56 for Japan. They were already (sometimes not correctly) attributed to 3 database fields address, postal code and city. To obtain a gold standard, a correct re-ordering of the elements was done manually by the first author. 4.2. Segmentation Our first experiment consisted of correctly segmenting the internet data with our system. As a baseline we used unsophisticated systems for each language which took about 1.5 hours to program each and use patterns for postal code, a small list of endings for street names and knowledge about the typical order of address elements in the country. Our evaluation should thus reflect the superiority of a full-fledged system compared to an adhoc solution. Tables 5, 6 and 7 show the evaluation results for the segmentation task for each country using f-scores based on recall and precision as computed by the PARSEVAL measures (cf. Manning & Schütze, 1996), which are suitable for evaluating systems generating bracketed structures with labels. 42 F-score type baseline system unlabelled 87.36 96.91 labelled 70.23 91.36 Table 5: Evaluation results for German data F-score type baseline system unlabelled 68.05 95.85 labelled 64.93 86.60 Table 6: Evaluation results for Australian data F-score type baseline system unlabelled 75.45 91.80 labelled 45.47 73.50 Table 7: Evaluation results for Japanese data The baseline systems showed above all problems with multi-token address elements (Frankfurt (Main), Bad Homburg) and addresses that did not conform to the standard ordering. The full-fledged system clearly outperforms the baselines by a difference in f-score (when counting correct labels and not only correct element boundaries) of 21 points for Germany, 12 for Australia and 28 for Japan. The contribution of the completion patterns was an increase in f-score of up to 13.03 points for the Japanese data (unlabelled) and a minimum of 0.28 for Australia (labelled). 4.3. Data cleansing In a second experiment we tested whether the system is suitable for data cleansing. A problem already mentioned in the introduction is erroneous data structuring according to the fields of a database. By using the system for attributing address elements to the database field we could reduce the rate of elements in an incorrect database field for the company internal database by 16.77 percentage points (pp) for German, 19.31pp for Australia, and 29.84pp for Japan. 4.4. Address classification We also conducted an experiment to find out whether the system is able to correctly guess the country of a location information string. Our testing method ignores country information (Japan, Germany, Australia) if present, and selects the country by computing the rate of tokens in the input which could not be classified by the system, neither by the database nor by the country specific patterns for suffixes, prefixes, special words or alphanumeric strings. As a result the system selects the country with the lowest rate of unlabelled tokens. For this experiment, we used 518 location information strings from both the Internet and the company internal data (166 for German, 271 from Australia, 81 from Japan), 99.22% of which were correctly attributed to their country. 5. Discussion and Future Work We present a system that successfully deals with the high variability in international textual location information, by classifying the components of location strings. The implemented system is robust and easily
Multilingual Resources and Multilingual Applications - Regular Papers extensible to more countries. We tested the system with 3 countries with strongly diverging standards for the expression of location information (Germany, Australia and Japan). New countries can be added within a few hours, as only certain country specific files have to be edited and the corresponding OpenStreetMap knowledge base has to be plugged in. Most European countries are similar to Germany, and the U.S. and Canada almost identical to the Australian system, so that a large part of the world can easily be covered. The system was shown to successfully improve the address element segmentation in a company internal database with high variation in orthography and formatting, even containing translated names. Moreover, the system is able to almost always correctly guess the country that textual location information can be attributed to. In future work, the system can be further improved to deal with a greater variety of typographical or transcription errors by using phonetic indexing algorithms as Soundex for English or Traphoty matching rules (Lisbach, 2010) for international languages. 6. Acknowledgements We would like to thank all external annotators that helped gathering and annotating the test data and the LSS R&D GmbH for making a company internal address database available to us in order to test the system. 7. References Agichtein, E., Ganti, V. (2004): Mining Reference Tables for Automatic Text Segmentation. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, WA, USA, ACM. Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., Tyson, M. (1992): FASTUS: A Finite-state Processor for Information Extraction from Real-world Text. Borkar, V.; Deshmukh, K., Sarawagi, S. (2001): Automatic segmentation of text into structured records . Christen, P., Belacic, D. (2005): Automated Probabilistic Address Standardisation and Veri- fication. Australasian Data Mining Conference 2005 (AusDM05). Christen, P.; Churches, T., Zhu, J.X. (2002): Case- Probabilistic Name and Address Cleaning and Standardisation. The Australasian Data Mining Workshop 2002. Cortez, E., De Moura, E.S. (2010): ONDUX: On- Demand Unsupervised Learning for Information Extraction. In Proceedings of the 2010 international conference on Management of data (SIGMOD ’10 ), pp. 807–818. Lisbach, B. (2010): Linguistisches Identity Matching. Vieweg+Teubner. ISBN 978-3-8348-9791- 6. URL http://dx.doi.org/10.1007/ 978-3-8348-9791-6\_11. Manning, C.D., Schütze, H. (1999): Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts. Marques, N.C., Gon Calves, S. (2004): Applying a Partof-Speech Tagger to Postal Address Detection on the Web, 2004. Peng, F., McCallum, A. (2003): Accurate In- formation Extraction from Research Papers using Conditional Random Fields. In: Information Processing Management. Riloff, E. (1993): Automatically Constructing a Dictionary for Information Extraction Tasks, AAAI Press / MIT Press. pp. 811–816. Wolf, J. (2011): Classifying the components of textual location information. Diploma Thesis, Department für Linguistik, Universität Potsdam. 43
Page 1: ARBEITEN ZUR MEHRSPRACHIGKEIT WORKI
Page 4: Collaborative Research Center: Mult
Page 7 and 8: Call for Papers The Conference of t
Page 9 and 10: Hybrid Machine Translation for Germ
Page 11: IV System Presentations New and Fut
Page 14 and 15: Multilingual Resources and Multilin
Page 98 and 99:
Multilingual Resources and Multilin
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
Page 106 and 107:
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Page 154 and 155:
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Page 162 and 163:
Page 164 and 165:
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
Page 184 and 185:
Page 186 and 187:
Page 188 and 189:
Page 190 and 191:
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
Page 210 and 211:
Page 212 and 213:
Page 214 and 215:
Page 216 and 217:
Page 218 and 219:
Page 220 and 221:
Page 222 and 223:
Page 224 and 225:
Page 226 and 227:
Page 228 and 229:
Page 230 and 231:
Page 232 and 233:
Page 234 and 235:
Page 236 and 237:
Page 238 and 239:
Page 240 and 241:
Page 242 and 243:
Page 244 and 245:
Page 246 and 247:
Page 248 and 249:
Page 250 and 251:
Page 252 and 253:
Page 254 and 255:
Page 256 and 257:
Page 258 and 259:
Page 260 and 261:
Page 262 and 263:
Page 264 and 265:
Page 266 and 267:
Page 268 and 269:
Page 270 and 271:
Page 272 and 273:
WORKING PAPERS IN MULTILINGUALISM
Page 274:
74. Ludger Zeevaert: Variation und
show all

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?