PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3. Tracking English Inclusions in German 89<br />
TreeTagger’s performance when trained on either the NEGRA or the TIGER corpus.<br />
As reported in the system description in Section 3.3, the final full English inclu-<br />
sion classifier incorporates TnTTIGER as the POS tagger in the pre-processing step and<br />
Yahoo in the search engine module. The decision to use TnTTIGER was made due to<br />
the fact that this module results in a drastic performance increase over the TnTNEGRA<br />
module for the space travel and EU domains <strong>of</strong> 7.91 and 6.73 points in F-score, re-<br />
spectively. On the internet data, the TnTT IGER and the TnTNEGRA modules perform<br />
very similarly. It can therefore be concluded that a POS tagger trained on a sufficiently<br />
large corpus is a vital component <strong>of</strong> the English inclusion classifier. In the following<br />
section, the decision to use Yahoo in the search engine module is motivated.<br />
3.5.2 Task-based Evaluation <strong>of</strong> Different Search Engines<br />
Tokens that are not found in the German or English lexical database are passed to<br />
a back-<strong>of</strong>f search engine module (Section 3.3.4). Such tokens are queried just for<br />
German and just for English webpages, a language preference that is <strong>of</strong>fered by most<br />
search engines, and classified based on the maximum normalised score <strong>of</strong> the number<br />
<strong>of</strong> hits returned for each language. This module therefore relies to some extent on the<br />
search engine’s internal language identification algorithm.<br />
During the initial stages <strong>of</strong> developing the English inclusion classifier, Google was<br />
used in the search engine module (Alex and Grover, 2004; Alex, 2005). The main<br />
reasons for opting for Google was that it was the biggest search engine available at<br />
the time spanning over 8 billion webpages. It also <strong>of</strong>fers the language preference set-<br />
ting which is essential for determining counts. Moreover, queries can be automated<br />
by means <strong>of</strong> the Google Soap Search API (beta) which allows 1,000 queries per day. 11<br />
During the course <strong>of</strong> developing the English inclusion classifier, Yahoo, another search<br />
engine, also made an API available which allows 5,000 searches per day. 12 In Yahoo,<br />
searches can also be restricted to webpages <strong>of</strong> a particular language. The only dif-<br />
ferences between the two search engines is their number <strong>of</strong> indexed webpages. In<br />
August 2005, Yahoo announced that it indexes more than 19.2 billion documents 13<br />
which amounts to more than double the number <strong>of</strong> webpages (8.2 billion) indexed by<br />
11 http://www.google.com/apis/<br />
12 http://developer.yahoo.com/<br />
13 http://www.ysearchblog.com/archives/000172.html