PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 3. Tracking English Inclusions in German 89 TreeTagger’s performance when trained on either the NEGRA or the TIGER corpus. As reported in the system description in Section 3.3, the final full English inclusion classifier incorporates TnTTIGER as the POS tagger in the pre-processing step and Yahoo in the search engine module. The decision to use TnTTIGER was made due to the fact that this module results in a drastic performance increase over the TnTNEGRA module for the space travel and EU domains of 7.91 and 6.73 points in F-score, respectively. On the internet data, the TnTT IGER and the TnTNEGRA modules perform very similarly. It can therefore be concluded that a POS tagger trained on a sufficiently large corpus is a vital component of the English inclusion classifier. In the following section, the decision to use Yahoo in the search engine module is motivated. 3.5.2 Task-based Evaluation of Different Search Engines Tokens that are not found in the German or English lexical database are passed to a back-off search engine module (Section 3.3.4). Such tokens are queried just for German and just for English webpages, a language preference that is offered by most search engines, and classified based on the maximum normalised score of the number of hits returned for each language. This module therefore relies to some extent on the search engine’s internal language identification algorithm. During the initial stages of developing the English inclusion classifier, Google was used in the search engine module (Alex and Grover, 2004; Alex, 2005). The main reasons for opting for Google was that it was the biggest search engine available at the time spanning over 8 billion webpages. It also offers the language preference set- ting which is essential for determining counts. Moreover, queries can be automated by means of the Google Soap Search API (beta) which allows 1,000 queries per day. 11 During the course of developing the English inclusion classifier, Yahoo, another search engine, also made an API available which allows 5,000 searches per day. 12 In Yahoo, searches can also be restricted to webpages of a particular language. The only dif- ferences between the two search engines is their number of indexed webpages. In August 2005, Yahoo announced that it indexes more than 19.2 billion documents 13 which amounts to more than double the number of webpages (8.2 billion) indexed by 11 http://www.google.com/apis/ 12 http://developer.yahoo.com/ 13 http://www.ysearchblog.com/archives/000172.html
Chapter 3. Tracking English Inclusions in German 90 Google. A discussion on the Corpora List in May 2005 14 and a series of studies car- ried out by Jean Véronis 15 signal that the real number of webpages indexed by a search engine is not necessarily in line with the one that is advertised. Therefore, it is difficult to rely on such quoted figures. Possible artificial inflation of the number of returned hits does however not affect the performance of the English inclusion classifier as long as this inflation occurs consistently for each language-specific query. For example, for Yahoo the estimated English corpus size is 638.9bn tokens whereas that for German is 53.3bn tokens. These estimates illustrate that the English web content is much larger than the German one which is also reflected in the percentages of English and German internet users presented in Figure 2.1, shown in Chapter 2. The ratio between the estimated web corpora for English and German amounts to nearly 12 to 1 in this case. This ratio is similar to those obtained by Grefenstette and Nioche (2000) and Kilgarriff and Grefenstette (2003) (15 to 1 and 11 to 1, respectively) who performed the same estimation using Altavista as the underlying search engine. Before evaluating the use of different search engines with regard to the performance of the English inclusion classifier, the results of a time comparison experiment are reported. 16 3.5.2.1 Time Comparison Experiment A comparison of the time required to run the Yahoo module compared to the Google module reveals that the former is considerably faster. Table 3.18 shows the time it takes to estimate the size of the web corpus for three different languages using either Yahoo or Google (Section 3.3.4) which was performed on a 2.4GHz Pentium 4. This estimation involves 16 queries to the search engine API per language. Yahoo clearly outperforms Google by up to 6.1 times. Web Corpus German French English Yahoo 6.8s 7.2s 7.6s Google 35.9s 33.0s 46.4s Table 3.18: Time required for web corpus estimation using Yahoo and Google. 14 http://torvald.aksis.uib.no/corpora/2005-1/0191.html 15 http://aixtal.blogspot.com/2005/08/yahoo-19-billion-pages.html 16 All task-based search engine evaluation experiments were conducted in April 2006.
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52: Chapter 2. Background and Theory 38
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 101: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 129 and 130: Chapter 5 Parsing English Inclusion
Page 131 and 132: Chapter 5. Parsing English Inclusio
Page 153 and 154:
Chapter 5. Parsing English Inclusio
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?