PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 3. Tracking English Inclusions in German 55 LEXICON 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 HTML 00 11 00 11 00 11 00 11 00 11 00 11 TIDY TOKENISER POS TAGGER POST−PROCESSING 00 11 00 1100 11 00 1100 1100 11 00 1100 1100 11 00 1100 1100 11 00 1100 1100 11 00 1100 11 00 11 WWW CONSISTENCY CHECK LANGUAGE CLASSIFICATION Figure 3.1: System architecture of the English inclusion classifier.
Chapter 3. Tracking English Inclusions in German 56 3.3.1 Processing Paradigm The underlying processing paradigm of the English inclusion classifier is XML-based. As a markup language for NLP tasks, XML is expressive and flexible yet constrain- able. Furthermore, there exists a wide range of XML-based tools for NLP applications which lend themselves to a modular, pipelined approach to processing whereby linguistic knowledge is computed and added incrementally as XML annotations. More- over, XML’s character encoding capabilities facilitate multilingual processing. As il- lustrated in Figure 3.1, the system for processing German text is essentially a UNIX pipeline which converts HTML files to XML and applies a sequence of modules: a pre- processing module for tokenisation and POS tagging, followed by a lexicon lookup, a search engine module, post-processing and an optional document consistency check which all add linguistic markup and classify tokens as either German or English. The pipeline is composed partly of calls to LT-TTT2 and LT-XML2 (Grover et al., 2006) 4 for tokenisation and sentence splitting. In addition, non-XML public-domain tools such as the TnT tagger (Brants, 2000b) were integrated and their output incorporated into the XML markup. The primary advantage of this architecture is the ability to integrate the output of already existing tools with that of new modules specifically tailored to the task in an organised fashion. The XML output can be searched to find specific instances or to acquire counts of occurrences using the LT-XML2 tools. 3.3.2 Pre-processing Module All downloaded Web documents are first of all cleaned up using TIDY 5 to remove HTML markup and any non-textual information and then converted into XML. Alter- natively, the input into the classifier can be in simple text format which is subsequently converted into XML format. The resulting XML pages simply contain the textual information of each article. Subsequently, all documents are passed through a series of pre-processing steps implemented using the LT-XML2 and LT-TTT2 tools (Grover et al., 2006) with the output of each step encoded in XML. Two rule-based grammars which were developed specifically for German are used 4 These tools are improved upgrades of the LT-TTT and LT-XML toolsets (Grover et al., 2000; Thompson et al., 1997) and are available under GPL as LT-TTT2 and LT-XML2 at: http://www.ltg.ed.ac. uk. 5http://tidy.sourceforge.net
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18: Chapter 1. Introduction 4 Chapter 3
Page 19 and 20: Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22: Chapter 2. Background and Theory 8
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 67: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 117 and 118: Chapter 4. System Extension to a Ne
Page 119 and 120:
Chapter 4. System Extension to a Ne
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Chapter 5 Parsing English Inclusion
Page 131 and 132:
Chapter 5. Parsing English Inclusio
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?