12.07.2015 Views

Corpus-Based Thesaurus Construction for Image Retrieval in ...

Corpus-Based Thesaurus Construction for Image Retrieval in ...

Corpus-Based Thesaurus Construction for Image Retrieval in ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1990-2001 were used. Our corpus comprises 1451 texts (891 written <strong>in</strong> BritishEnglish and 560 <strong>in</strong> American English) conta<strong>in</strong><strong>in</strong>g a total of 610,197 tokens. Thegenres <strong>in</strong>clude <strong>in</strong><strong>for</strong>mative texts, like journal papers, <strong>in</strong>structive texts, <strong>for</strong> example,handbooks and imag<strong>in</strong>ative texts, primarily advertisements. These texts were gatheredfrom the Web us<strong>in</strong>g the Google Search Eng<strong>in</strong>e by key<strong>in</strong>g <strong>in</strong> the terms <strong>for</strong>ensic andscience. We analyzed the frequency distribution of compound terms <strong>in</strong> the mothercorpus and consulted our expert <strong>in</strong><strong>for</strong>mants –scene of crime officers (SOCOs)regard<strong>in</strong>g more specialized texts: We found two sub-doma<strong>in</strong>s, crime scenephotography (CSP) and footwear impressions (FI) and collected texts from the Webby key<strong>in</strong>g <strong>in</strong> the two terms and follow<strong>in</strong>g the l<strong>in</strong>ks <strong>in</strong>dicated by the texts. (The CSPcorpus has 63328 tokens and the FI 11332). Furthermore, the SOCOs provided 53crime-scene <strong>for</strong>ms compris<strong>in</strong>g 6580 tokens. As part of the SoCIS project, 10 SOCOsprovided a subsequently transcribed commentary on 66 scene-of-crime imagescompris<strong>in</strong>g a total of 5,000 tokens. In this way we had built a mother corpus ofForensic Science and four progeny corpora representative of sub-doma<strong>in</strong>s of ForensicScience that could be used <strong>for</strong> a comparative analysis.2.1 Lexical Signature of the Doma<strong>in</strong>The major build<strong>in</strong>g blocks of a thesaurus are the <strong>in</strong>dividual words, lexical units orterms that <strong>for</strong>m its backbone. The frequency of occurrence of open class words(OCWs) with<strong>in</strong> a corpus can be an <strong>in</strong>dication of terms that are accepted as part of thatlanguage’s register. A frequency analysis was conducted on the <strong>for</strong>ensic sciencecorpus to determ<strong>in</strong>e its lexical signature: the key s<strong>in</strong>gle words used frequently tosituate the text <strong>in</strong> the doma<strong>in</strong> of <strong>for</strong>ensic science. Typically, the first hundred mostfrequent tokens <strong>in</strong> a text corpus comprise over 40% of the total text: this is true of the100-million word British National <strong>Corpus</strong> (BNC) [23], the Longman <strong>Corpus</strong> ofContemporary English (LCCE), as is true of a number of specialist corpora we havebuilt over the last 15 years [20]. The key difference is that <strong>for</strong> the general languagecorpora the first 100 most frequent tokens are essentially the so-called closed classwords (CCW) or grammatical words: <strong>in</strong> the specialist corpora, <strong>in</strong> contrast, as much as20% of the first hundred most frequent tokens comprise the so-called open-class orlexical words. The import of this f<strong>in</strong>d<strong>in</strong>g, <strong>for</strong> us, is that these frequent words are usedmore productively <strong>in</strong> the morphological sense <strong>in</strong> that their <strong>in</strong>flections (plurals ma<strong>in</strong>ly)and compounds based on these frequent words also tend to dom<strong>in</strong>ate the text corpus.The frequent use attests to the acceptability of the s<strong>in</strong>gle and compound words: this<strong>for</strong> us is crucial <strong>in</strong> build<strong>in</strong>g a thesaurus. A look at the 100 most frequent words <strong>in</strong> theForensic Science corpus shows that the first 20 most frequent words are <strong>in</strong>deed theCCWs compris<strong>in</strong>g just under 30% of the total corpus – a figure similar to the BNC.The next 10 most frequent tokens comprise three open class words –evidence, crime,and scene: these 10 words comprise 3.78% of the total corpus and three open classwords contribute a 1.0% to this 3.78%. In itself 1% is a small number, but studies ofword frequency suggest otherwise: <strong>for</strong> every set of 100 words <strong>in</strong> a <strong>for</strong>ensic text it isstatistically possible that 1 word would be either of the three. The total contribution ofthe 21 open class words amongst the 100 most frequent comes to about 6% - this isthe frequency of the most frequent word <strong>in</strong> written English – the determ<strong>in</strong>er the.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!