Specialist languages, considered variants of natural language [17], are restrictedsyntactically and semantically, which makes them easier to process at the lexical,morphological and semantic levels [17, 18, 19]. There is a preponderance of openclass words <strong>in</strong> specialist languages, particularly nouns and nom<strong>in</strong>alizations, as theydeal with objects and named events, actions and states. Specialist languages tend touse a large number of compound terms and these compounds also relate to namedentities with<strong>in</strong> the doma<strong>in</strong>. Aga<strong>in</strong>, as science and technology deals with namedentities researchers aim to create structures to organize the <strong>in</strong>terrelationships betweenthese named entities. This organization is argued <strong>for</strong> and reported <strong>in</strong> the literature ofthe doma<strong>in</strong>, through the use of lexical semantic relationships. It has been suggestedthat not only can one extract keywords from a specialist corpus [18, 20, 21,] but onecan also extract semantic relations of taxonomy and meronymy (part-whole relations)from free texts [22]. Thus, <strong>for</strong> us, the extraction of terms and their <strong>in</strong>terrelationshipsfrom a text corpus to start build<strong>in</strong>g a thesaurus is an attractive proposition. Section 2of this paper discusses the issue of the association between images and texts under theidiosyncratic term collateral texts and how one goes about build<strong>in</strong>g a representativecorpus of collateral texts <strong>in</strong> order to construct a thesaurus. The issue of doma<strong>in</strong>coverage has been <strong>in</strong>vestigated through the comparison of terms extracted from aprogeny corpus (representative of a sub-doma<strong>in</strong>) to those extracted from a mothercorpus. This functionality has been <strong>in</strong>corporated <strong>in</strong> a text-enhanced image retrievalsystem – (Section 3). A brief outl<strong>in</strong>e of on-go<strong>in</strong>g work is given <strong>in</strong> Section 4.2 Thesauri <strong>Construction</strong> from Specialist Text CorporaTHIS IS A CLOSELYCOLLATERAL TEXTTHAT COULD BETHE DESCRIPTIONOR THE CAPTIONDICTIONARYDEFINITIONTHIS IS A BROADLYCOLLATERAL TEXTWHICH COULD BE ANEWSPAPER ARTICLEOR ENCYCLOPAEDICDEFINITIONCLOSELY COLLATERALTEXTSCAPTIONCRIME SCENEREPORTBROADLY COLLATERALTEXTSTHIS IS ABROADLYCOLLATERALTEXT WHICHCOULD BE AREPORTNEWSPAPERARTICLEFig. 1. Closely and broadly collateral textsAn image may be associated <strong>in</strong> various ways with the different texts that may existcollateral to it. These texts may conta<strong>in</strong> a full or partial description of the content ofthe image or they may conta<strong>in</strong> metadata <strong>in</strong><strong>for</strong>mation. Texts could be closely collaterallike the caption of an image which will describe only what is depicted <strong>in</strong> the image, or
oadly collateral such as the report describ<strong>in</strong>g a crime scene where the content of theimage would be discussed together with other <strong>in</strong><strong>for</strong>mation related to the crime; thecloser the collateral text to the image the higher the co-dependency. The degree of codependencebetween an image and its collateral text can be exploited <strong>for</strong> <strong>in</strong>dex<strong>in</strong>g andretriev<strong>in</strong>g images at different levels of abstraction.The computer-based analysis of closely collateral texts, written/spoken <strong>for</strong> arestricted readership, say by scene of crime officers <strong>for</strong> describ<strong>in</strong>g scene of crimeimages, will help <strong>in</strong> the identification of objects and their relationships <strong>in</strong> an image ofa specific scene of crime. This <strong>in</strong><strong>for</strong>mation can then be used <strong>for</strong> <strong>in</strong>dex<strong>in</strong>g thatparticular image. An analysis of broadly collateral texts, texts written to <strong>in</strong>struct and<strong>in</strong><strong>for</strong>m a broader readership, <strong>for</strong> <strong>in</strong>stance a scene of crime manual deal<strong>in</strong>g with thecollection of evidence written by experts <strong>for</strong> scene-of-crime officers or a newtechnique developed <strong>in</strong> a journal paper, will help <strong>in</strong> the identification and elaborationof broader terms of the doma<strong>in</strong>, which can be useful <strong>in</strong> the construction of athesaurus. Such a thesaurus has to be updated at regular <strong>in</strong>tervals or <strong>in</strong>deed veryfrequently <strong>in</strong> specialisms that are undergo<strong>in</strong>g rapid change. Forensic science may be agood example – here developments <strong>in</strong> computer vision, <strong>in</strong> analytical chemistry andmolecular biology together with developments <strong>in</strong> law regard<strong>in</strong>g <strong>in</strong>directly-derivedevidence, <strong>for</strong> example, DNA f<strong>in</strong>gerpr<strong>in</strong>t<strong>in</strong>g, and digital photography, not only affectthe adm<strong>in</strong>istration of justice but br<strong>in</strong>g a plethora of new terms that have been adoptedby <strong>for</strong>ensic scientists and used by police officers. Many of these terms are <strong>in</strong>troduced<strong>in</strong> the broadly collateral texts and, <strong>in</strong> due course, f<strong>in</strong>d their way <strong>in</strong>to closely collateraltexts.A corpus-based approach can be used to <strong>in</strong>vestigate the language of <strong>for</strong>ensicscience <strong>for</strong> thesaurus build<strong>in</strong>g purposes, where the corpus can be said to consist ofbroadly and closely collateral texts with respect to a typical scene of crime imagecollection. The aim is to study the behavior of the language at the lexical,morphological and lexical-syntactic levels to determ<strong>in</strong>e whether it has a discernablestructure that can be used to extract terms and their relationships. Typically, follow<strong>in</strong>gwork on corpus-based lexicography [15], the term<strong>in</strong>ologists collect and analyze a(random) sample of free texts <strong>in</strong> a given doma<strong>in</strong>. The text is analyzed at the lexicaland morphological level and the frequency of tokens and their morphological <strong>for</strong>ms isnoted. Candidate terms are produced by contrast<strong>in</strong>g the frequency of tokens <strong>in</strong> thespecialist corpus with that of the frequency of the same tokens <strong>in</strong> a representativecorpus of general language. This ratio, sometimes referred to as a weirdness measure[18], is a good <strong>in</strong>dication that the candidate will be approved by an expert to be aterm.weirdness coefficient =Wheref s = frequency of term <strong>in</strong> a specialist corpusf g = frequency of term <strong>in</strong> general languageN s = total number of terms <strong>in</strong> the specialist corpusAnd N g = total number of terms <strong>in</strong> the general languageA <strong>for</strong>ensic science corpus of over half a million words has been created. To ensurethat the corpus is representative of the doma<strong>in</strong>, a variety of text types rang<strong>in</strong>g fromffsgNNsg(1)