12.07.2015 Views

Corpus-Based Thesaurus Construction for Image Retrieval in ...

Corpus-Based Thesaurus Construction for Image Retrieval in ...

Corpus-Based Thesaurus Construction for Image Retrieval in ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Corpus</strong>-<strong>Based</strong> <strong>Thesaurus</strong> <strong>Construction</strong> <strong>for</strong> <strong>Image</strong><strong>Retrieval</strong> <strong>in</strong> Specialist Doma<strong>in</strong>sKhurshid Ahmad, Mariam Tariq, Bogdan Vrusias and Chris HandyDepartment of Comput<strong>in</strong>g, School of Electronics and Physical Sciences, University ofSurrey, Guild<strong>for</strong>d, GU2 7XH, United K<strong>in</strong>gdom{k.ahmad, m.tariq, b.vrusias, c.j.handy}@surrey.ac.ukAbstract. This paper explores the use of texts that are related to an imagecollection, also known as collateral texts, <strong>for</strong> build<strong>in</strong>g thesauri <strong>in</strong> specialistdoma<strong>in</strong>s to aid <strong>in</strong> image retrieval. <strong>Corpus</strong> l<strong>in</strong>guistic and <strong>in</strong><strong>for</strong>mation extractionmethods are used <strong>for</strong> identify<strong>in</strong>g key terms and conceptual relationships <strong>in</strong>specialist texts that may be used <strong>for</strong> query expansion purposes. The specialistdoma<strong>in</strong> context imposes certa<strong>in</strong> constra<strong>in</strong>ts on the language used <strong>in</strong> the texts,which makes the texts computationally more tractable. The effectiveness ofsuch an approach is demonstrated through a prototype system that has beendeveloped <strong>for</strong> the storage and retrieval of images and texts, applied <strong>in</strong> the<strong>for</strong>ensic science doma<strong>in</strong>.1 IntroductionA visual scene <strong>in</strong> a specialist doma<strong>in</strong> may conta<strong>in</strong> a range of <strong>in</strong><strong>for</strong>mation that usuallycannot be detected by an untra<strong>in</strong>ed person. An art critic can discern several aspects ofan object <strong>in</strong> a Cubist pa<strong>in</strong>t<strong>in</strong>g and conv<strong>in</strong>ce us that what appears to be a juxtapositionof geometrical elements is a strik<strong>in</strong>g portrait, or a bicycle, or <strong>in</strong>deed a group ofsoldiers creat<strong>in</strong>g mayhem. An experienced scene of crime officer can identify anddescribe the ‘existence’ of various (parts of) human be<strong>in</strong>gs and objects <strong>in</strong> a scene-ofcrimeimage, particularly the physical attributes and relative locations of theseobjects, which may not be obvious to 12 adults sitt<strong>in</strong>g on a jury panel. However, oncethe officer describes the objects, the untra<strong>in</strong>ed person can generally immediatelydiscern the attributes and locations that, hitherto, were not so obvious to them. Thel<strong>in</strong>k between an image and its verbal description is one of the central issues <strong>in</strong> mostdiscipl<strong>in</strong>es that deal with human vision and by implication human <strong>in</strong>telligence. TheScene of Crime In<strong>for</strong>mation System (SoCIS) project 1 has attempted to explore thisl<strong>in</strong>k by analyz<strong>in</strong>g the use of texts related to images, also known as collateral texts, <strong>for</strong>the <strong>in</strong>dex<strong>in</strong>g and retrieval of specialist images. We have adopted a corpus-basedapproach and used <strong>in</strong><strong>for</strong>mation extraction techniques <strong>in</strong> develop<strong>in</strong>g a prototype textenhancedimage storage and retrieval system. This paper reports work <strong>in</strong> progress on1 A three-year EPSRC sponsored project (Grant No.GR/M89041) jo<strong>in</strong>tly undertaken by theUniversities of Sheffield and Surrey and supported by five police <strong>for</strong>ces <strong>in</strong> the UK.


the construction of a thesaurus from doma<strong>in</strong>-specific texts <strong>for</strong> query expansionpurposes.The development of digital visual archives br<strong>in</strong>gs with it the problem of <strong>in</strong>dex<strong>in</strong>gthe images <strong>in</strong> the archives. These <strong>in</strong>dices act as the equivalent of keywords used to<strong>in</strong>dex text documents. The currently available digital archives range from medicalimage archives to archives of the images of pa<strong>in</strong>t<strong>in</strong>gs and press-agency photocollections.Dur<strong>in</strong>g the last three decades significant ef<strong>for</strong>t has been spent on systemsthat focus exclusively on vision-specific features such as colour distribution, shapeand texture, often called content based image retrieval systems (CBIR), and this termcan be used to describe many research as well as commercially available systems [1].However, CBIR systems have an implicit limitation <strong>in</strong> that visual properties cannot,<strong>in</strong> themselves, be used to identify arbitrary classes of objects. Indeed there aretheoretical limitations on us<strong>in</strong>g the visual features <strong>for</strong> describ<strong>in</strong>g an image [2] and,recently, some practical limitations of such an approach have been outl<strong>in</strong>ed as well[3]. Earlier image retrieval systems, as well as the images retrieved by the searcheng<strong>in</strong>es, rely almost exclusively on keywords. The problem here is that append<strong>in</strong>gkeywords to an image is not only quite time consum<strong>in</strong>g, the estimates vary fromm<strong>in</strong>utes to hours [4][5], but the choice of keywords may <strong>in</strong> themselves show the biasof the <strong>in</strong>dexer. This has led to the so-called multimodal systems, which essentially usel<strong>in</strong>guistic features extracted from textual captions or descriptions together with thevisual features <strong>for</strong> stor<strong>in</strong>g and retriev<strong>in</strong>g images <strong>in</strong> databases. The term<strong>in</strong>ology used <strong>in</strong>the description of such systems is <strong>in</strong>dicative of the multimodal nature: Picard’s Visual<strong>Thesaurus</strong> [6]; Srihari’s arguments on ‘texts’ that are collateral to an image [7].<strong>Retrieval</strong> is shown to be more effective when textual features are used togetherwith the visual features, <strong>for</strong> example [8] show a mild improvement on precision andrecall statistics when the comb<strong>in</strong>ed features were used to query an image databasecompared to when either text or visual features were used, but there are stilllimitations where the keywords are concerned <strong>in</strong> that the use of synonyms,abbreviations or related words as well as broader or narrower words is not taken <strong>in</strong>toaccount. The issue of <strong>in</strong>ter-<strong>in</strong>dexer variability [4], the variation <strong>in</strong> the verbal outputsof different <strong>in</strong>dexers <strong>for</strong> the same image, has shown a use of related terms. The termquery expansion 2 refers to the addition of search terms to a user's search <strong>for</strong>improv<strong>in</strong>g precision and/or recall. The additional terms may be either taken from athesaurus or from documents that the user has specified as be<strong>in</strong>g relevant. A thesaurusis a “controlled and dynamic vocabulary of semantically and generically relatedterms, which covers a specific doma<strong>in</strong> of knowledge”[9]. The most commonrelationships <strong>in</strong> a thesaurus are related terms (RT), broader terms (BT) and narrowerterms (NT). There do exist some general purpose thesauri, Roget’s[10] thesaurusbe<strong>in</strong>g the classic example, as well as lexical resources such as WordNet 3 but theproblem is that their coverage is too broad render<strong>in</strong>g them <strong>in</strong>adequate <strong>for</strong> use <strong>in</strong>specialized doma<strong>in</strong>s such as <strong>for</strong>ensic science or medic<strong>in</strong>e. Hence when images of aspecialist nature are to be stored there is a problem <strong>in</strong> that the thesauri and relevantdocuments <strong>for</strong> query expansion are not readily available establish<strong>in</strong>g the need tocreate doma<strong>in</strong>-specific thesauri. These could be manually built by expert2 http://wombat.doc.ic.ac.uk/foldoc/ (Site visited 13 Nov 2002)3 http://www.cogsci.pr<strong>in</strong>ceton.edu/~wn/


lexicographers, of which there are a number of examples like the NASA thesaurus 4and the Arts and Architecture <strong>Thesaurus</strong> (AAT) 5 , however, handcrafted thesauri facesimilar problems to those of manual keyword <strong>in</strong>dex<strong>in</strong>g of be<strong>in</strong>g time-consum<strong>in</strong>g tobuild, subjective and error-prone, as well as hav<strong>in</strong>g the additional issue of <strong>in</strong>adequatedoma<strong>in</strong> coverage. A solution to this is the automatic generation of thesauri <strong>for</strong>specialized doma<strong>in</strong>s from representative text documents.Automatic thesauri generation was <strong>in</strong>itially addressed by [11][12], as far back asthe 1970s, through the use of association matrices, which use statistical term to termco-occurrence measures as a basis <strong>for</strong> identify<strong>in</strong>g related terms. This method has anumber of drawbacks: many unrelated terms will co-occur due to be<strong>in</strong>g highlyfrequent or general; synonyms are hardly used together; only s<strong>in</strong>gle-word terms areconsidered whereas <strong>in</strong> a number of specialist doma<strong>in</strong>s multi-word terms are usedfrequently; a cluster of associated terms is produced with no knowledge of the k<strong>in</strong>dsof relationships between the terms. [13] addressed the fact that synonyms were morelikely to have similar co-occurrence patterns rather then co-occur <strong>in</strong> a document ordocument collection, by associat<strong>in</strong>g a term with a phrase based on its contextual<strong>in</strong><strong>for</strong>mation. The SEXTANT system [14] uses weak syntactic analysis methods ontexts to generate thesauri under the assumption that similar terms will appear <strong>in</strong>similar syntactic relationships, and groups them accord<strong>in</strong>g to the grammatical context<strong>in</strong> which they appear. Both the methods above are viable approaches but still do notaddress the shortcom<strong>in</strong>g of undef<strong>in</strong>ed relationships between terms.In the hope of address<strong>in</strong>g some of the issues mentioned above we <strong>in</strong>tend to explorea different approach to thesauri construction -based on the context of specialistlanguages with<strong>in</strong> recent developments <strong>in</strong> corpus l<strong>in</strong>guistics, and <strong>in</strong> particular corpusbasedlexicography and corpus-based term<strong>in</strong>ology. The proponents of corpusl<strong>in</strong>guistics claim that a text corpus, a randomly-selected but systematically organizedcollection of texts, can be used to derive empirical knowledge about language, whichcan supplement, and frequently supplant, <strong>in</strong><strong>for</strong>mation from reference sources and<strong>in</strong>trospection [15]. A significant practical application of this empirical approach hasbeen found <strong>in</strong> dictionary mak<strong>in</strong>g or lexicography: this is facilitated <strong>in</strong> large measureby the computation of the frequency and collocation of tokens <strong>in</strong> the text. Expertlexicographers can then elaborate on the elicited vocabulary [16]. The randomselection <strong>in</strong>volves select<strong>in</strong>g equal chunks from every text collected or select<strong>in</strong>grandomly from a catalogue of ‘books <strong>in</strong> pr<strong>in</strong>t.’ The systematic organization refers toselect<strong>in</strong>g different genres of texts – <strong>for</strong>mal and <strong>in</strong><strong>for</strong>mal types, <strong>for</strong> example journalarticles and popular science articles, <strong>in</strong>structive and <strong>in</strong><strong>for</strong>mative types, <strong>for</strong> example,advanced texts and basic textbooks and <strong>in</strong>struction manuals, and so on. There is muchdiscussion <strong>in</strong> corpus l<strong>in</strong>guistics about what constitutes a representative corpus. This<strong>in</strong>deed is an issue <strong>for</strong> texts written <strong>in</strong> general language or language of everyday use. Ifone wishes to use the methods of corpus l<strong>in</strong>guistics <strong>for</strong> specialized subjects, thequestion of representativeness is not as vexatious. This is, perhaps, because thel<strong>in</strong>guistic output of a specialist community is limited, <strong>in</strong> sheer volume and <strong>in</strong> genre, ascompared to that of the broader general language community.4 http://www.sti.nasa.gov/thesfrm1.htm5 http://www.getty.edu/research/tools/vocabulary/aat/


Specialist languages, considered variants of natural language [17], are restrictedsyntactically and semantically, which makes them easier to process at the lexical,morphological and semantic levels [17, 18, 19]. There is a preponderance of openclass words <strong>in</strong> specialist languages, particularly nouns and nom<strong>in</strong>alizations, as theydeal with objects and named events, actions and states. Specialist languages tend touse a large number of compound terms and these compounds also relate to namedentities with<strong>in</strong> the doma<strong>in</strong>. Aga<strong>in</strong>, as science and technology deals with namedentities researchers aim to create structures to organize the <strong>in</strong>terrelationships betweenthese named entities. This organization is argued <strong>for</strong> and reported <strong>in</strong> the literature ofthe doma<strong>in</strong>, through the use of lexical semantic relationships. It has been suggestedthat not only can one extract keywords from a specialist corpus [18, 20, 21,] but onecan also extract semantic relations of taxonomy and meronymy (part-whole relations)from free texts [22]. Thus, <strong>for</strong> us, the extraction of terms and their <strong>in</strong>terrelationshipsfrom a text corpus to start build<strong>in</strong>g a thesaurus is an attractive proposition. Section 2of this paper discusses the issue of the association between images and texts under theidiosyncratic term collateral texts and how one goes about build<strong>in</strong>g a representativecorpus of collateral texts <strong>in</strong> order to construct a thesaurus. The issue of doma<strong>in</strong>coverage has been <strong>in</strong>vestigated through the comparison of terms extracted from aprogeny corpus (representative of a sub-doma<strong>in</strong>) to those extracted from a mothercorpus. This functionality has been <strong>in</strong>corporated <strong>in</strong> a text-enhanced image retrievalsystem – (Section 3). A brief outl<strong>in</strong>e of on-go<strong>in</strong>g work is given <strong>in</strong> Section 4.2 Thesauri <strong>Construction</strong> from Specialist Text CorporaTHIS IS A CLOSELYCOLLATERAL TEXTTHAT COULD BETHE DESCRIPTIONOR THE CAPTIONDICTIONARYDEFINITIONTHIS IS A BROADLYCOLLATERAL TEXTWHICH COULD BE ANEWSPAPER ARTICLEOR ENCYCLOPAEDICDEFINITIONCLOSELY COLLATERALTEXTSCAPTIONCRIME SCENEREPORTBROADLY COLLATERALTEXTSTHIS IS ABROADLYCOLLATERALTEXT WHICHCOULD BE AREPORTNEWSPAPERARTICLEFig. 1. Closely and broadly collateral textsAn image may be associated <strong>in</strong> various ways with the different texts that may existcollateral to it. These texts may conta<strong>in</strong> a full or partial description of the content ofthe image or they may conta<strong>in</strong> metadata <strong>in</strong><strong>for</strong>mation. Texts could be closely collaterallike the caption of an image which will describe only what is depicted <strong>in</strong> the image, or


oadly collateral such as the report describ<strong>in</strong>g a crime scene where the content of theimage would be discussed together with other <strong>in</strong><strong>for</strong>mation related to the crime; thecloser the collateral text to the image the higher the co-dependency. The degree of codependencebetween an image and its collateral text can be exploited <strong>for</strong> <strong>in</strong>dex<strong>in</strong>g andretriev<strong>in</strong>g images at different levels of abstraction.The computer-based analysis of closely collateral texts, written/spoken <strong>for</strong> arestricted readership, say by scene of crime officers <strong>for</strong> describ<strong>in</strong>g scene of crimeimages, will help <strong>in</strong> the identification of objects and their relationships <strong>in</strong> an image ofa specific scene of crime. This <strong>in</strong><strong>for</strong>mation can then be used <strong>for</strong> <strong>in</strong>dex<strong>in</strong>g thatparticular image. An analysis of broadly collateral texts, texts written to <strong>in</strong>struct and<strong>in</strong><strong>for</strong>m a broader readership, <strong>for</strong> <strong>in</strong>stance a scene of crime manual deal<strong>in</strong>g with thecollection of evidence written by experts <strong>for</strong> scene-of-crime officers or a newtechnique developed <strong>in</strong> a journal paper, will help <strong>in</strong> the identification and elaborationof broader terms of the doma<strong>in</strong>, which can be useful <strong>in</strong> the construction of athesaurus. Such a thesaurus has to be updated at regular <strong>in</strong>tervals or <strong>in</strong>deed veryfrequently <strong>in</strong> specialisms that are undergo<strong>in</strong>g rapid change. Forensic science may be agood example – here developments <strong>in</strong> computer vision, <strong>in</strong> analytical chemistry andmolecular biology together with developments <strong>in</strong> law regard<strong>in</strong>g <strong>in</strong>directly-derivedevidence, <strong>for</strong> example, DNA f<strong>in</strong>gerpr<strong>in</strong>t<strong>in</strong>g, and digital photography, not only affectthe adm<strong>in</strong>istration of justice but br<strong>in</strong>g a plethora of new terms that have been adoptedby <strong>for</strong>ensic scientists and used by police officers. Many of these terms are <strong>in</strong>troduced<strong>in</strong> the broadly collateral texts and, <strong>in</strong> due course, f<strong>in</strong>d their way <strong>in</strong>to closely collateraltexts.A corpus-based approach can be used to <strong>in</strong>vestigate the language of <strong>for</strong>ensicscience <strong>for</strong> thesaurus build<strong>in</strong>g purposes, where the corpus can be said to consist ofbroadly and closely collateral texts with respect to a typical scene of crime imagecollection. The aim is to study the behavior of the language at the lexical,morphological and lexical-syntactic levels to determ<strong>in</strong>e whether it has a discernablestructure that can be used to extract terms and their relationships. Typically, follow<strong>in</strong>gwork on corpus-based lexicography [15], the term<strong>in</strong>ologists collect and analyze a(random) sample of free texts <strong>in</strong> a given doma<strong>in</strong>. The text is analyzed at the lexicaland morphological level and the frequency of tokens and their morphological <strong>for</strong>ms isnoted. Candidate terms are produced by contrast<strong>in</strong>g the frequency of tokens <strong>in</strong> thespecialist corpus with that of the frequency of the same tokens <strong>in</strong> a representativecorpus of general language. This ratio, sometimes referred to as a weirdness measure[18], is a good <strong>in</strong>dication that the candidate will be approved by an expert to be aterm.weirdness coefficient =Wheref s = frequency of term <strong>in</strong> a specialist corpusf g = frequency of term <strong>in</strong> general languageN s = total number of terms <strong>in</strong> the specialist corpusAnd N g = total number of terms <strong>in</strong> the general languageA <strong>for</strong>ensic science corpus of over half a million words has been created. To ensurethat the corpus is representative of the doma<strong>in</strong>, a variety of text types rang<strong>in</strong>g fromffsgNNsg(1)


1990-2001 were used. Our corpus comprises 1451 texts (891 written <strong>in</strong> BritishEnglish and 560 <strong>in</strong> American English) conta<strong>in</strong><strong>in</strong>g a total of 610,197 tokens. Thegenres <strong>in</strong>clude <strong>in</strong><strong>for</strong>mative texts, like journal papers, <strong>in</strong>structive texts, <strong>for</strong> example,handbooks and imag<strong>in</strong>ative texts, primarily advertisements. These texts were gatheredfrom the Web us<strong>in</strong>g the Google Search Eng<strong>in</strong>e by key<strong>in</strong>g <strong>in</strong> the terms <strong>for</strong>ensic andscience. We analyzed the frequency distribution of compound terms <strong>in</strong> the mothercorpus and consulted our expert <strong>in</strong><strong>for</strong>mants –scene of crime officers (SOCOs)regard<strong>in</strong>g more specialized texts: We found two sub-doma<strong>in</strong>s, crime scenephotography (CSP) and footwear impressions (FI) and collected texts from the Webby key<strong>in</strong>g <strong>in</strong> the two terms and follow<strong>in</strong>g the l<strong>in</strong>ks <strong>in</strong>dicated by the texts. (The CSPcorpus has 63328 tokens and the FI 11332). Furthermore, the SOCOs provided 53crime-scene <strong>for</strong>ms compris<strong>in</strong>g 6580 tokens. As part of the SoCIS project, 10 SOCOsprovided a subsequently transcribed commentary on 66 scene-of-crime imagescompris<strong>in</strong>g a total of 5,000 tokens. In this way we had built a mother corpus ofForensic Science and four progeny corpora representative of sub-doma<strong>in</strong>s of ForensicScience that could be used <strong>for</strong> a comparative analysis.2.1 Lexical Signature of the Doma<strong>in</strong>The major build<strong>in</strong>g blocks of a thesaurus are the <strong>in</strong>dividual words, lexical units orterms that <strong>for</strong>m its backbone. The frequency of occurrence of open class words(OCWs) with<strong>in</strong> a corpus can be an <strong>in</strong>dication of terms that are accepted as part of thatlanguage’s register. A frequency analysis was conducted on the <strong>for</strong>ensic sciencecorpus to determ<strong>in</strong>e its lexical signature: the key s<strong>in</strong>gle words used frequently tosituate the text <strong>in</strong> the doma<strong>in</strong> of <strong>for</strong>ensic science. Typically, the first hundred mostfrequent tokens <strong>in</strong> a text corpus comprise over 40% of the total text: this is true of the100-million word British National <strong>Corpus</strong> (BNC) [23], the Longman <strong>Corpus</strong> ofContemporary English (LCCE), as is true of a number of specialist corpora we havebuilt over the last 15 years [20]. The key difference is that <strong>for</strong> the general languagecorpora the first 100 most frequent tokens are essentially the so-called closed classwords (CCW) or grammatical words: <strong>in</strong> the specialist corpora, <strong>in</strong> contrast, as much as20% of the first hundred most frequent tokens comprise the so-called open-class orlexical words. The import of this f<strong>in</strong>d<strong>in</strong>g, <strong>for</strong> us, is that these frequent words are usedmore productively <strong>in</strong> the morphological sense <strong>in</strong> that their <strong>in</strong>flections (plurals ma<strong>in</strong>ly)and compounds based on these frequent words also tend to dom<strong>in</strong>ate the text corpus.The frequent use attests to the acceptability of the s<strong>in</strong>gle and compound words: this<strong>for</strong> us is crucial <strong>in</strong> build<strong>in</strong>g a thesaurus. A look at the 100 most frequent words <strong>in</strong> theForensic Science corpus shows that the first 20 most frequent words are <strong>in</strong>deed theCCWs compris<strong>in</strong>g just under 30% of the total corpus – a figure similar to the BNC.The next 10 most frequent tokens comprise three open class words –evidence, crime,and scene: these 10 words comprise 3.78% of the total corpus and three open classwords contribute a 1.0% to this 3.78%. In itself 1% is a small number, but studies ofword frequency suggest otherwise: <strong>for</strong> every set of 100 words <strong>in</strong> a <strong>for</strong>ensic text it isstatistically possible that 1 word would be either of the three. The total contribution ofthe 21 open class words amongst the 100 most frequent comes to about 6% - this isthe frequency of the most frequent word <strong>in</strong> written English – the determ<strong>in</strong>er the.


The arguments of Halliday and Mart<strong>in</strong>[17] can also be partially attested by not<strong>in</strong>gfour derivations <strong>in</strong> Table 2 (identification, analysis, <strong>in</strong><strong>for</strong>mation, and <strong>in</strong>vestigationfrom the verbs to identify, to analyse, to <strong>in</strong><strong>for</strong>m and to <strong>in</strong>vestigate). The lexicalsignature can be identified more vividly by compar<strong>in</strong>g the distribution of these tokens<strong>in</strong> the BNC. The CCWs are used with the same (relative) frequency, but the openclass occur far more frequently – the token <strong>for</strong>ensic is 471 time more frequent <strong>in</strong> ourcorpus as compared to the BNC, followed by crime (53 times), scene (38 times), andevidence (20 times).Table 1. Lexical Signature of the Forensic Science <strong>Corpus</strong>TokensCumulative Relative Frequencythe,of,and,to,a,<strong>in</strong>,is,be,that,<strong>for</strong> 23.30%or,on,as,was,by,s,with,are,from,it 5.97%this,an,evidence (23) ,at,not,crime (26) ,can,have,which,scene (30) 3.78%were,<strong>for</strong>ensic (32) ,should,he,will,when,police (37) ,may,if,de 2.39%has,been,other,one,they,all,identification (47) ,had,used,these 1.97%but,case (52) ,also,their,his,there,any,found (58) ,court (59) ,such 1.59%two,analysis (62) ,more,what,body (65) ,i,no,who,use,d 1.35%some,where,blood (73) ,time (74) ,<strong>in</strong><strong>for</strong>mation (75) ,l,you,only,<strong>in</strong>to,victim (80) 1.19%must,m,dna(83),would,her,then,science(87),sample(88),most,than 1.06%we,cases(92),after,test(94),made,about,<strong>in</strong>vestigation(96),its,new,each 0.98%Weirdness shows the skewness <strong>in</strong> the distribution of the words <strong>in</strong> two corpora. Ahigher weirdness <strong>in</strong>dicates significant use of the word <strong>in</strong> the specialist corpora ascompared to general language corpora, an extreme example be<strong>in</strong>g a weirdness of<strong>in</strong>f<strong>in</strong>ity <strong>in</strong>dicat<strong>in</strong>g that the term is not present <strong>in</strong> the BNC at all. This might be<strong>in</strong>dicative of neologisms <strong>in</strong> the text, which have not yet been adopted <strong>in</strong> generallanguage. The frequency ratio is a good <strong>in</strong>dicator of termhood: Table 2 lists a numberof terms that have <strong>in</strong>f<strong>in</strong>ite weirdness; note some are s<strong>in</strong>gle word technical terms, butmany are compounded terms like bitemark, toolmark or plurals of terms likeshoepr<strong>in</strong>ts the s<strong>in</strong>gular of which does exist <strong>in</strong> the BNC.Table 2. Terms with a weirdness of <strong>in</strong>f<strong>in</strong>ity ordered on relative frequency (f/N), N = 610,197S<strong>in</strong>gle Term f/N Compound Term f/N Compound Term f/Nrifl<strong>in</strong>g 0.0139% bitemark 0.0174% spectroscopy 0.0092%pyrolysis 0.0124% earpr<strong>in</strong>t 0.0122% handguns 0.0090%accelerant 0.0105% nightlead 0.0105% shoepr<strong>in</strong>ts 0.0070%polygraph 0.0081% handgun 0.0105% toolmark 0.0045%accelerants 0.0079% f<strong>in</strong>gerpr<strong>in</strong>t<strong>in</strong>g 0.0093% earpr<strong>in</strong>ts 0.0040%Tokens that are absent <strong>in</strong> the BNC are a part of the potential signature of a specialistdoma<strong>in</strong>. The other tokens that <strong>for</strong>m part of the signature are tokens with significantlyhigh relative frequency – the italicised open class words <strong>in</strong> Table 1. Thesetokens/terms are used productively to make compound terms; someth<strong>in</strong>g which weshow is cruical <strong>for</strong> a thesaurus to be used <strong>in</strong> query expansion. Table 3 comprises 10 ofthe frequently used tokens that are amongst 100 most used open class words togetherwith an exemplar set of (relatively) high frequency compounds that comprise two ofthe 10 s<strong>in</strong>gle tokens.


Table 3. Highly frequent terms and their compoundsMother <strong>Corpus</strong> Total 610,197f f/N Weirdnessanalysis 862 0.0014 10.54blood 781 0.0013 12.43crime 2366 0.0038 53.52 Compound S<strong>in</strong>gular Pluraldna 676 0.0011 33.05 blood spatter 17 12evidence 2757 0.0045 20.77 crime scene 495 69<strong>for</strong>ensic 1563 0.0025 471.04 dna analysis 39 Not foundhomicide 237 0.0004 228.30 <strong>for</strong>ensic science 229 82physical 382 0.0006 6.45 physical evidence 161 NAscene 1605 0.0026 38.18science 634 0.0010 9.68Indeed, the two most frequent tokens, crime and scene are used to <strong>for</strong>m over 90different compounds, some compris<strong>in</strong>g upto three other (high frequency) tokens, <strong>for</strong><strong>in</strong>stance crime scene <strong>in</strong>vestigator, crime scene photography, crime scene process<strong>in</strong>g,crime scene technician, crime scene sketch, and crime scene photography personnelagency. Much the same is true of other tokens <strong>in</strong> Table 3: blood, dna and so on areused just as productively. Note also the tendency of the authors of these texts to useplurals of terms as well as the s<strong>in</strong>gulars. The identification of a lexical signature <strong>for</strong> adoma<strong>in</strong>, conta<strong>in</strong><strong>in</strong>g both s<strong>in</strong>gle and compound terms, from a randomly selectedcorpus will help to <strong>in</strong>itiate the development of a thesaurus <strong>for</strong> the doma<strong>in</strong>.The weirdness measure can also be used to determ<strong>in</strong>e the lexical signature of aprogeny corpus as compared to a mother corpus. In the above example the mothercorpus was the BNC and the <strong>for</strong>ensic science corpus the progeny corpus. A statisticalsampl<strong>in</strong>g was done of the OCWs <strong>for</strong> these three corpora <strong>in</strong> comparison to the mothercorpus. The ratio r refers the relative frequency, f/N, where f is the frequency of thetoken and N the total number of tokens. Note that the highly frequent OCWs <strong>in</strong> theFootwear and CS Photography progeny corpora generally have a much higherweirdness when compared to the FS mother corpus then that of the same words <strong>in</strong> theFS corpus compared to the BNC. Table 4 shows a comparison of selected highweirdness terms <strong>in</strong> the three progeny corpora. The analysis of progeny corpora mayyield more specialized terms, <strong>in</strong>dicat<strong>in</strong>g they have their own lexical signature, andthese either could be added to the ma<strong>in</strong> thesaurus or kept separately as a sub-doma<strong>in</strong>thesaurus.Table 4. Comparison of three progeny corporar Fr C/r M /r M Footwear/r BNC CS PhotosS/r M /r BNC Transcripts/r BNCr mr T(a) (b) ( c) (d) (e) (f) (g) (h) (i)footwear 126 40 lens 67 8 footwear 41 40reebok 55 2 underexposed 49 INF f<strong>in</strong>germarks 18 INFmold<strong>in</strong>g 55 INF lenses 43 3 ricochet 18 19gatekeep<strong>in</strong>g 27 INF tripod 43 12 strangulation 16 21impressions 25 31 enlargements 39 17 splatter 16 35S<strong>in</strong>gle word terms, with some key exceptions, are often used as carrier terms <strong>in</strong>that they <strong>for</strong>m compounds to give more specific mean<strong>in</strong>g to an exist<strong>in</strong>g concept: <strong>in</strong>themselves the s<strong>in</strong>gle word terms are usually too generic. An automatic extraction andr CSr T


analysis of compound terms may perhaps lead to the conceptual structure of thedoma<strong>in</strong> <strong>in</strong> question. Compound terms typically have a nom<strong>in</strong>al head qualified by anadjective or compounded with another noun or noun phrase. They are usually not<strong>in</strong>terspersed by closed class words. For example, gunshot residue analysis tool maybe written as a tool <strong>for</strong> the analysis of residues left by a gunshot but that will be rare.NP[adjective] | [Noun], NPA heuristic to identify compound nouns <strong>in</strong> free text is as follows: given that wehave a sequence of two or more consecutive words it is possible to assume that thissequence of words is a compound term provided that there are no closed class words<strong>in</strong> the sequence. For development of our method <strong>for</strong> extract<strong>in</strong>g terms <strong>for</strong> an arbitrarydoma<strong>in</strong> this is another important heuristic <strong>for</strong> search<strong>in</strong>g <strong>for</strong> compound terms andperhaps also <strong>for</strong> validat<strong>in</strong>g them. Church and Mercer have suggested that one mayuse the Student’s t-test <strong>for</strong> test<strong>in</strong>g whether or not a collocation, or a compound tokenfound <strong>in</strong> the analysis of a text or corpus, is due to random chance [24]. The authorshave simplified the computation of the t-test score and suggest that <strong>for</strong> a collocatex+y:t ≈f ( x,y)−f ( x) * f ( y) / Nf ( x,y)(2)where f(x,y) is the frequency of the compound token, and f(x) and f(y) that of thes<strong>in</strong>gle tokens compris<strong>in</strong>g the compound. Church and Mercer have suggested that ‘ifthe t-score is larger than 1.65 standard deviations then we ought to believe that the cooccurrencesare significant and we can reject the null hypothesis with 95% confidencethough <strong>in</strong> practice we might look <strong>for</strong> a t-score of 2 or more standard deviations’. InTable 5 we show the t-scores <strong>for</strong> a number of high frequency compound tokens <strong>in</strong> theFS corpus. We have tabulated both the s<strong>in</strong>gular and plural <strong>for</strong>ms of the token whererelevant; hence the ~ sign, <strong>for</strong> example, <strong>for</strong> scene <strong>in</strong>dicat<strong>in</strong>g both s<strong>in</strong>gular and plurals.Table 5. 10 Most frequent compound termsx y f(x,y)/N f(x)/N f(y)/N t(x,y) t(x,y)/2crime scene~ 0.000912 0.003825 0.002858 23.46 11.73<strong>for</strong>ensic science~ 0.000503 0.002527 0.00124 17.53 8.76workplace homicide 0.000145 0.00032 0.000462 9.48 4.74crime lab(orator)~ 0.000149 0.003825 0.00167 9.18 4.59cartridge case~ 0.000131 0.000414 0.002656 8.92 4.46body fluid~ 0.000112 0.001352 0.000236 8.28 4.14crim<strong>in</strong>al justice 9.21E-05 0.000689 0.00047 7.52 3.76fire scene~ 8.89E-05 0.000774 0.002858 7.23 3.62dna analysis 6.3E-05 0.001093 0.001394 6.09 3.05blood spatter~ 4.69E-05 0.001263 9.7E-05 5.37 2.692.2 Discovery of Conceptual RelationsEvery language has its own vocabulary, where lexical units or words represent certa<strong>in</strong>concepts and these words are grammatically arranged <strong>in</strong> certa<strong>in</strong> patterns that convey a


mean<strong>in</strong>g. A range of semantic relations may exist between these different lexicalunits. [25] presents a model that illustrates some basic relationships between classesof entities: Identity, where class X and class Y have exactly the same members. Thelexical relation which corresponds to this is synonymy, <strong>for</strong> example “f<strong>in</strong>gerpr<strong>in</strong>t” and“lift” are synonyms <strong>in</strong> that they are syntactically equal; Inclusion, where class Y isentirely <strong>in</strong>cluded with<strong>in</strong> class X. The lexical relation correspond<strong>in</strong>g to this ishyponymy, which is most commonly illustrated by the construct ‘Y is a k<strong>in</strong>d/type ofX.’ An example of this could be ‘a gun is a type of firearm’. The most common typeof lexical hierarchy is a taxonomy, which reflects the hyponymy relationship alsoknown as the supertype/subtype or subsumption relationship and a meronomy, whichmodels the part-whole relationship. This section attempts to discover frequentlyoccurr<strong>in</strong>g patterns that demonstrate these two types of relationships.Phrasal structures such as compound words convey a certa<strong>in</strong> semantic relationshipbetween the constituent lexical units. Usually headwords such as scene are weaksemantically <strong>in</strong> that their mean<strong>in</strong>g cannot be easily ascerta<strong>in</strong>ed out of context unlessthey are part of a compound, <strong>for</strong> example crime scene or movie scene specify whattype of scene it is. Compound<strong>in</strong>g tends to specialize the mean<strong>in</strong>g of the headword,which can be used to create a hierarchy. For example tak<strong>in</strong>g the three compoundsblood sample, f<strong>in</strong>gerpr<strong>in</strong>t sample and DNA sample, it shows that blood, dna andf<strong>in</strong>gerpr<strong>in</strong>t are different types of samples. Similarly there may be certa<strong>in</strong> lexical cuessuch as k<strong>in</strong>d of and part of, which convey hyponymic or meronymic relationshipsbetween certa<strong>in</strong> lexical units they are syntactically associated with. In the follow<strong>in</strong>gwe discuss the possible elicitation of conceptual structures from texts, which can beused to def<strong>in</strong>e broader and narrower terms <strong>in</strong> the thesaurus.The hypernymy/hyponymy relationship, also known as the supertype/subtype orsubsumption relationship, is the semantic relationship that is used to build taxonomies<strong>for</strong> various purposes –a classic example be<strong>in</strong>g the biological classification of species.At the higher levels of a taxonomy more general/broader concepts are encountered,<strong>for</strong> example knife is a general concept <strong>for</strong> a dagger or stiletto. There are a number ofl<strong>in</strong>guistic patterns that can be used to illustrate the hyponymic relationship <strong>in</strong> texts.The cues is a, or is a type of are the most common patterns but there are a number ofothers that are typically associated with this relationship.There are certa<strong>in</strong> enumerative cues [22] that can be used to derive hyponymicrelationships as well. For example tak<strong>in</strong>g the sentence “All automatic weapons suchas mach<strong>in</strong>e guns must be registered” one can derive hyponym (‘mach<strong>in</strong>e gun’,‘automatic weapon’). Typical hyponymic and enumerative cues listed <strong>in</strong> the literatureon lexical semantic analysis <strong>in</strong>clude:Table 6. List of hyponymic and enumerative cuesHyponymic CuesEnumerative Cuesis a; k<strong>in</strong>d of; type of; set of; class of; belongs tolike; such as; such * as; or/and other; <strong>in</strong>clud<strong>in</strong>g; especially.The aim was to study the patterns <strong>in</strong> which these cues occur as well as f<strong>in</strong>d out theproportion of valid phrases returned (i.e. those that depict a hypernym/hyponymrelationship or a meronymy/homonymy relationship). The cue or other was the mostproductive with 80% of the elicited phrases be<strong>in</strong>g valid. The cue belongs to picked up


[R] ¢ such as¤¤a s<strong>in</strong>gle correct sentence (“chrysotile belongs to the serpent<strong>in</strong>e group of m<strong>in</strong>erals thatare layer silicates”) out of only 2 sentences returned. It should be noted that thepercentage of valid phrases calculated from the total phrases returned was based onthe judgment of this author. It was <strong>in</strong>terest<strong>in</strong>g to note that <strong>for</strong> the <strong>for</strong>ensic sciencedoma<strong>in</strong> the enumerative cues had around 60% productivity <strong>for</strong> a total of 1224 clausescompris<strong>in</strong>g the enumerative cue and the potential compound term; while the typicalhyponymic cues had only 10% <strong>for</strong> a total of 400 clauses. A few example sentencesextracted from the corpus <strong>for</strong> some of the cues listed above are:Table 7. Example of sentences conta<strong>in</strong><strong>in</strong>g lexical-syntactic cuesTrace evidence, such as hair and fibers, is collected off the body[…]In the case of shoot<strong>in</strong>gs or other fatal assaults the <strong>for</strong>ensic pathologist, […]important trace evidence.[…] to search <strong>for</strong> latent and otherbodily fluids.f<strong>in</strong>gerpr<strong>in</strong>ts, hairs, fibers, blood[…] the <strong>in</strong>vestigation ofcomputer crimes<strong>in</strong>clud<strong>in</strong>gcomputer <strong>in</strong>trusions, component theft and<strong>in</strong><strong>for</strong>mation theft.Each set of sentences show a certa<strong>in</strong> similar grammatical pattern <strong>for</strong> exampleconsider<strong>in</strong>g the sentences conta<strong>in</strong><strong>in</strong>g the cue such as, such as is act<strong>in</strong>g as aconjunction between two phrases P1 and P2 such that P1 on the left hand side is thesuperord<strong>in</strong>ate and P2 on the right hand side is a list of subord<strong>in</strong>ate types. Typicallythese sentences display a local grammar [26] compris<strong>in</strong>g a NP followed by anadjective such, a preposition as and a comma separated list of NPs with acoord<strong>in</strong>at<strong>in</strong>g conjunction and or or appear<strong>in</strong>g be<strong>for</strong>e the f<strong>in</strong>al NP. The examplepattern shown below can be used to validate sentences that conta<strong>in</strong> the such as cue,which are representative of the hyponymic relationship.[P1] [R] [P2]P1 NPP2 ¡ £¢ NP 1¤¥§¢ ¥©¨¨¨¨ ¥ ¢¢NP 2] NP ¤n-1], or¤ and | NP n] }¢ ¢¦¥¢¦¥NP [Adjective] | [Noun], NPIf this pattern is matched then each NP i ∈ P2 is a hyponym/NT <strong>for</strong> P1Figure 3 below shows the process of tagg<strong>in</strong>g and pars<strong>in</strong>g an example sentenceconta<strong>in</strong><strong>in</strong>g the cue such as. Compound structure analysis, discussed above, was usedto elicit that evidence is the broader term <strong>for</strong> trace evidence. After the sentence istagged, regular expressions are used to check that it is a valid pattern. Then thesentence is parsed to extract the hypernym {hyponym list} pairs. These partialstructures are then represented <strong>in</strong> XML to be used by the SoCIS system <strong>for</strong> queryexpansion purposes (see section 3). This module has been developed <strong>in</strong> JAVA andmakes use of the MXPOST 6 tagger. The whole process is fully automated.6 http://www.cis.upenn.edu/~adwait/statnlp.html


Cued SentenceContam<strong>in</strong>ation problems are not unique to firearms and explosives residues but applyequally to all <strong>for</strong>ms of trace evidence such as fibres, pa<strong>in</strong>t, glass and DNA.TAGGERTagged Cued SentenceContam<strong>in</strong>ation/NNP problems/NNS are/VBP not/RB unique/JJ to/TO firearms/NNS and/CCexplosives/NNS residues/NNS but/CC apply/VB equally/RB to/TO all/DT <strong>for</strong>ms/NNS of/DTtrace/NN evidence/NN such/JJ as/IN fibres/NNS ,/, pa<strong>in</strong>t/NN ,/, glass/NN and/CC DNA/NNP./.PARSERParsed Sentenceevidence {trace evidence }trace evidence {fibre, pa<strong>in</strong>t, glass, DNA}XML


to associate the keywords used <strong>in</strong> the description of the image with the vision-specificfeatures [29]. These keywords are extracted automatically from the description us<strong>in</strong>ga comb<strong>in</strong>ation of TF*IDF and weirdness [17] measures as well as filter<strong>in</strong>g carried outus<strong>in</strong>g the doma<strong>in</strong>-specific thesaurus. Automation is also a key aspect <strong>in</strong> SoCIS asmost of its features are fully automated. Text analysis and term<strong>in</strong>ology extraction <strong>in</strong>SOCIS has been facilitated by the <strong>in</strong>tegration of two exist<strong>in</strong>g systems: System Quirk[17] and GATE [28], which gives SoCIS a powerful text process<strong>in</strong>g mechanismthrough the comb<strong>in</strong>ation of a term based statistical approach with a semantic basedapproach. This system was evaluated us<strong>in</strong>g 66 scene of crime images used <strong>in</strong> thetra<strong>in</strong><strong>in</strong>g of scene of crime officers. Experts <strong>in</strong> tra<strong>in</strong><strong>in</strong>g as well as serv<strong>in</strong>g officersprovided descriptions of the images, and these texts and images <strong>for</strong>med the <strong>in</strong>put <strong>for</strong>the system.A display tool has been developed that can be used to visualize the thesaurus <strong>for</strong>validat<strong>in</strong>g and edit<strong>in</strong>g purposes. The XML generated by the method described <strong>in</strong> theprevious section is parsed to display the hierarchies <strong>in</strong> a tree structure. The user canadd, delete or move a node (with its sub-hierarchy unless it is a leaf node) as well asadd synonyms. This display tool has been <strong>in</strong>tegrated <strong>in</strong>to the SoCIS search <strong>in</strong>terface, ascreen shot of which is shown <strong>in</strong> figure 4. The user can per<strong>for</strong>m <strong>in</strong>teractive queryexpansion by select<strong>in</strong>g or delet<strong>in</strong>g nodes as appropriate.Fig. 4. Screen shot of the search <strong>in</strong>terface4 Conclusions and Future WorkContent-based image retrieval systems <strong>in</strong>creas<strong>in</strong>gly use keywords collateral to animage <strong>for</strong> <strong>in</strong>dex<strong>in</strong>g and retrieval. One outstand<strong>in</strong>g problem is that of build<strong>in</strong>g thesauri


that can be used dur<strong>in</strong>g the <strong>in</strong>dexation phase and latterly <strong>in</strong> the query expansionphase. We have attempted to outl<strong>in</strong>e a corpus-based method <strong>for</strong> build<strong>in</strong>g a thesaurus.We have demonstrated, through the use of frequency metrices <strong>for</strong> s<strong>in</strong>gle tokens and<strong>for</strong> compound tokens, that a text corpus, randomly selected and systematicallyorganized, can perhaps be used to <strong>in</strong>itiate the development of such a thesaurus. Wehave developed a series of programs (<strong>in</strong> the JAVA programm<strong>in</strong>g language) tocompute frequency distribution of s<strong>in</strong>gle and compound tokens and to automatically<strong>in</strong>dex an image. The lexical-syntactic pattern analysis carried out has shown thatbroader and narrow relations can be explicitly extracted which makes it easier toconstruct a hierarchy and improve the query expansion <strong>in</strong> perhaps limit<strong>in</strong>g theexpansion to narrower terms. From our comparative analysis of progeny and mothercorpora it has been shown that <strong>in</strong>ter-variation of terms between sub-doma<strong>in</strong>s can beused to study differences <strong>in</strong> language usage and perhaps one method of ensur<strong>in</strong>gproper doma<strong>in</strong> coverage is <strong>for</strong> the construction of a doma<strong>in</strong>-specific thesaurus bycomb<strong>in</strong><strong>in</strong>g the sub-doma<strong>in</strong>-specific thesauri built from various progeny corpora.The current method used is based on certa<strong>in</strong> lexical syntactic patterns depictive ofhyponymic and meronymic relationships. This work can be cont<strong>in</strong>ued to discoversynonyms as well as other doma<strong>in</strong>-specific relationships such as attempt<strong>in</strong>g toidentify roles, attributes of entities, and events. In a multi-modal doma<strong>in</strong> it would be<strong>in</strong>terest<strong>in</strong>g to consider a visual representation as well as l<strong>in</strong>guistic labels to represent aconcept <strong>in</strong> the thesaurus, which could act as a l<strong>in</strong>k between the two. There has beenwork done on creat<strong>in</strong>g multimedia thesauri but it would be <strong>in</strong>terest<strong>in</strong>g to <strong>in</strong>vestigatehow images could be automatically analyzed to discover relationships between thembased on Picards work [6] and then the visual representation of these imagesautomatically l<strong>in</strong>ked to the concept with the correspond<strong>in</strong>g l<strong>in</strong>guistic label. This couldperhaps provide even more effective multi-model query expansion.References1. Veltkamp, R.C., Tanase M.: Content-<strong>Based</strong> <strong>Image</strong> <strong>Retrieval</strong> Systems: A Survey.(Technical Report UU-CS-2000-34). Institute of In<strong>for</strong>mation and Comput<strong>in</strong>g Sciences,Univ. of Utrecht, The Netherlands (2000)2. Marr, D.: Vision. W.H. Freeman, San Francisco (1982)3. Squire, McG.D., Muller, W., Muller, H., Pun, T.: Content-<strong>Based</strong> Query of <strong>Image</strong>databases: Inspirations from Text <strong>Retrieval</strong>. Pattern Recognition Letters 21. ElsevierScience B.V. (2000) 1193-11984. Eak<strong>in</strong>s, J.P., Graham, M.E.: Content-based <strong>Image</strong> <strong>Retrieval</strong>: A Report to the JISCTechnology Applications Programme. <strong>Image</strong> Data Research Institute Newcastle,Northumbria. (1999). (http://www.unn.ac.uk/iidr/report.html, visited 19/06/02)5. Ogle, V.E., Stonebraker, M.: Chabot: retrieval from a relational database of images. IEEEComputer Magaz<strong>in</strong>e, Vol. 28(9). IEEE (1995) 40-486. Picard, R.W.: Towards a Visual <strong>Thesaurus</strong>. In: Ian Ruthven (ed): Spr<strong>in</strong>ger VerlagWorkshops <strong>in</strong> Comput<strong>in</strong>g, MIRO 95, Glasgow, Scotland (1995)7. Srihari, R.K.: Use of Collateral Text <strong>in</strong> Understand<strong>in</strong>g Photos. Artificial IntelligenceReview, Special Issue on Integrat<strong>in</strong>g Language and Vision, Vol. 8 (1995) 409-4308. Paek, S., Sable C.L., Hatzivassiloglou, V., Jaimes, A., Schiffman, B.H., Chang, S.F.,McKeown, K.R.: Integration of Visual and Text-<strong>Based</strong> Approaches <strong>for</strong> the Content


Label<strong>in</strong>g and Classification of Photographs. ACM SIGIR'99 Workshop on MultimediaIndex<strong>in</strong>g and <strong>Retrieval</strong>, Berkeley, CA (1999)9. Foskett, D.J.: <strong>Thesaurus</strong>. In: Sparck Jones, K., Willet, P. (eds.): Read<strong>in</strong>gs <strong>in</strong> In<strong>for</strong>mation<strong>Retrieval</strong>. Morgan Kaufmann Publishers, San Francisco, Cali<strong>for</strong>nia (1997) 111-13410. Roget, P.: <strong>Thesaurus</strong> of English Words and Phrases. Longmans, Green and Company,London (1911)11. Salton, G.: Experiments <strong>in</strong> Automatic Thesauri <strong>Construction</strong> <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>. InProceed<strong>in</strong>gs of the IFIP Congress, Vol. TA-2. Ljubljana, Yoguslavia (1971) 43-4912. Sparck Jones, K.: Automatic Keyword Classification <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>.Butterworths, London, UK (1971)13. J<strong>in</strong>g, Y., Croft, W.B.: An Association <strong>Thesaurus</strong> <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>. In: Bretano, F.,Seitz, F.: (eds.): Proceed<strong>in</strong>gs of the RIAO’94 Conference. CIS-CASSIS, Paris, France(1994) 146-16014. Grefenstette, G.: Explorations <strong>in</strong> Automatic <strong>Thesaurus</strong> Discovery. Kluwer AcademicPublishers, Boston, USA (1994)15. Leech, G.: The State of the Art <strong>in</strong> <strong>Corpus</strong> L<strong>in</strong>guistics. In: Aijmer, K., Altenberg, B. (eds.):English <strong>Corpus</strong> L<strong>in</strong>guistics: In honour of Jan Svartvik. Longman, London (1991)16. S<strong>in</strong>clair, J.McH (ed.): Look<strong>in</strong>g Up. Coll<strong>in</strong>s, London, UK (1987) 1-4017. Halliday, M.A.K., Mart<strong>in</strong>, J.R.: Writ<strong>in</strong>g Science: Literacy and Discursive Power. TheFalmer Press, London and Wash<strong>in</strong>gton D.C. (1993)18. Ahmad, K.: Pragmatics of Specialist Terms and Term<strong>in</strong>ology Management. In: Steffens,P. (ed.): Mach<strong>in</strong>e Translation and the Lexicon. Spr<strong>in</strong>ger-Verlag, Heidelberg (1995) 51-7619. Harris, Z.S.: Language and In<strong>for</strong>mation. In: Nev<strong>in</strong>, B. (ed.): Computational L<strong>in</strong>guisticsVol. 14, No.4. Columbia University Press, New York (1988) 87-9020. Ahmad, K., Rogers, M.A.: <strong>Corpus</strong>-based term<strong>in</strong>ology extraction. In: Bud<strong>in</strong>, G., WrightS.A. (eds.): Handbook of Term<strong>in</strong>ology Management, Vol.2. John Benjam<strong>in</strong>s Publishers,Amsterdam (2000) 725-760.21. Bourigault, D., Jacquem<strong>in</strong>, C., L'Homme, M-C. (eds.): Recent Advances <strong>in</strong> ComputationalTerm<strong>in</strong>ology. John Benjam<strong>in</strong>s Publishers, Amsterdam (2001)22. Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. InProceed<strong>in</strong>gs of the Fourteenth International Conference on Computational L<strong>in</strong>guistics.Nantes, France. (1992)23. Leech, G., Rayson, P., Wilson, A.: Word Frequencies <strong>in</strong> Written and Spoken English:based on the British National <strong>Corpus</strong>. Pearson Education Limited, Great Brita<strong>in</strong> (2001)24. Church, K.W., Mercer, R.L: Introduction. In: Armstrong, S. (ed.): Us<strong>in</strong>g Large Corpora.The MIT Press, Mass., USA. (1993) 1-2425. Cruse, D. A.: Lexical Semantics. Cambridge University Press, Avon, Great Brita<strong>in</strong> (1986)26. Gross, M.: Local grammars and their representation by f<strong>in</strong>ite automata. In: Hoey, M. P.(ed.): Data, Description, Discourse. HarperColl<strong>in</strong>s, London (1993) 26-38.27. Pastra, K., Saggion, H., Wilks, Y.: Extract<strong>in</strong>g Relational Facts <strong>for</strong> Index<strong>in</strong>g and <strong>Retrieval</strong>of Crime-Scene Photographs. To appear <strong>in</strong> Knowledge-<strong>Based</strong> Systems (2002)28. Cunn<strong>in</strong>gam, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework andgraphical development environment <strong>for</strong> robust NLP tools and applications. In:Proceed<strong>in</strong>gs of the 40th Anniversary Meet<strong>in</strong>g of the Association <strong>for</strong> ComputationalL<strong>in</strong>guistics (2002)29. Ahmad, K., Vrusias, B., Tariq, M.: Co-operative Neural Networks and ‘Integrated’Classification. IJCNN 2002, Honolulu, Hawaii, USA (2002)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!