05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Tracking English Inclusions in German 89<br />

TreeTagger’s performance when trained on either the NEGRA or the TIGER corpus.<br />

As reported in the system description in Section 3.3, the final full English inclu-<br />

sion classifier incorporates TnTTIGER as the POS tagger in the pre-processing step and<br />

Yahoo in the search engine module. The decision to use TnTTIGER was made due to<br />

the fact that this module results in a drastic performance increase over the TnTNEGRA<br />

module for the space travel and EU domains <strong>of</strong> 7.91 and 6.73 points in F-score, re-<br />

spectively. On the internet data, the TnTT IGER and the TnTNEGRA modules perform<br />

very similarly. It can therefore be concluded that a POS tagger trained on a sufficiently<br />

large corpus is a vital component <strong>of</strong> the English inclusion classifier. In the following<br />

section, the decision to use Yahoo in the search engine module is motivated.<br />

3.5.2 Task-based Evaluation <strong>of</strong> Different Search Engines<br />

Tokens that are not found in the German or English lexical database are passed to<br />

a back-<strong>of</strong>f search engine module (Section 3.3.4). Such tokens are queried just for<br />

German and just for English webpages, a language preference that is <strong>of</strong>fered by most<br />

search engines, and classified based on the maximum normalised score <strong>of</strong> the number<br />

<strong>of</strong> hits returned for each language. This module therefore relies to some extent on the<br />

search engine’s internal language identification algorithm.<br />

During the initial stages <strong>of</strong> developing the English inclusion classifier, Google was<br />

used in the search engine module (Alex and Grover, 2004; Alex, 2005). The main<br />

reasons for opting for Google was that it was the biggest search engine available at<br />

the time spanning over 8 billion webpages. It also <strong>of</strong>fers the language preference set-<br />

ting which is essential for determining counts. Moreover, queries can be automated<br />

by means <strong>of</strong> the Google Soap Search API (beta) which allows 1,000 queries per day. 11<br />

During the course <strong>of</strong> developing the English inclusion classifier, Yahoo, another search<br />

engine, also made an API available which allows 5,000 searches per day. 12 In Yahoo,<br />

searches can also be restricted to webpages <strong>of</strong> a particular language. The only dif-<br />

ferences between the two search engines is their number <strong>of</strong> indexed webpages. In<br />

August 2005, Yahoo announced that it indexes more than 19.2 billion documents 13<br />

which amounts to more than double the number <strong>of</strong> webpages (8.2 billion) indexed by<br />

11 http://www.google.com/apis/<br />

12 http://developer.yahoo.com/<br />

13 http://www.ysearchblog.com/archives/000172.html

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!