Semi-Automatic Indexing of Documents with a Multilingual Thesaurus

Semi-Automatic Indexing of Documents with a MultilingualThesaurusUlrich Schiel,Universidade Federal de Campina GrandeCampina Grande / BRAZILulrich@dsc.ufpb.brIanna M. S. F. de Sousa,Universidade Federal de Campina GrandeCampina Grande / BRAZILianna@dsc.ufpb.brAbstractWith the growing significance of digitallibraries and the Internet, more and moreelectronic texts become accessible to a wide andgeographically disperse public. This requiresadequate tools to facilitate indexing, storage, andretrieval of documents written in differentlanguages. We present a method for semiautomaticindexing of electronic documents andconstruction of a multilingual thesaurus, whichcan be used for query formulation and informationretrieval. We use special dictionaries and userinteraction in order to solve ambiguities and findadequate canonical terms in the language and anadequate abstract language-independent term. Theabstract thesaurus is updated incrementally bynew indexed documents and is used to search fordocuments using adequate terms1. IntroductionThe growing relevance of Digital Libraries isgenerally recognized [Ha97]. A Digital Librarytypically contains hundreds or thousands ofdocuments. It is also recognized that, insteadEnglish is the dominant language, documents inother languages are of great significance and,moreover, users want to retrieve documents inseveral languages associated to a topic, stated inhis own language [Ha97, Go02]. This is especiallytrue in regions as the European Community orAsia. Therefore a multilingual environment isneeded to attend user requests to Digital Libraries.The multilingual communication between usersand the library can be realized in two ways:• The user query is translated to the severallanguages of existing documents andsubmitted to the library;• The documents are indexed and theextracted terms are converted to alanguage neutral thesaurus (calledmultilingual thesaurus). The same occurswith the query, and the correspondencebetween query terms and documents isobtained via the neutral thesaurus.The first solution is the most widely used in theCross-Language Information Retrieval (CLIR)community [Go02], [OD00] [Oa99]. It applies alsoto other information retrieval environments, as theWorld Wide Web. For Digital Libraries, withthousands of documents, indexing of incomingdocuments and a good association structurebetween index terms and documents can becomecrucial for efficient document retrieval. This paperpresents a system based on the second approach.In order to get an extensive and preciseretrieval of textual information, a correct andconsistent analysis of incoming documents isnecessary. The most broadly-used technique forthis analysis is indexing. An index file becomes anintermediate representation between a query andthe document base.One of the most popular structures for complexindexes is a semantic net of lexical terms of a

language, called thesaurus. The nodes are single orcomposed terms and the links are pre-definedsemantic relationships between these terms, suchas synonyms, hyponyms and metonyms.Despite the importance of multilingual thesaurihas been recognized [Go02] nearly all researcheffort in Cross-Lingual Information Retrieval hasbeen done on the query side and not on theindexing of incoming documents [OD00, Oa99,Ha97].Indexing in a multilingual environment can bedivided in three steps: (1) language-dependentcanonical term extraction (including stop-wordelimination, stemming, word-sensedisambiguation), (2) language-neutral termfinding, and (3) update of the term-documentassociation lattice.Bruandet [Br89] has developed an automaticindexing technique for electronic documents,which was extended by Gammoudi [Ga93] tooptimal thesaurus generation for a given set ofdocuments. The nodes of the thesaurus arebipartite rectangles where the left side contains aset of terms and the right side the set of documentsindexed by the terms. Higher rectangles in thethesaurus contain broader term sets and fewerdocuments. One extension to this technique is thealgorithm of Pinto [Pi97], which permits anincremental addition of index terms of newincoming documents, updating the Thesaurus.We show in this paper how this extendedversion of Gammoudi's technique can be used inan environment with multilingual documents andqueries whose language need not be the same asthat of the searched documents. The main idea isto use monolingual dictionaries in order to, withthe user's help, eliminate ambiguities, andassociate to each unambiguous term an abstract,language independent, term. The terms of a queryare also converted to abstract terms in order to findthe corresponding documents.In the next section we introduce themathematical background needed to understandthe technique, whereas section 3 introduces ourmultilingual rectangular thesaurus. The followingsection 4 shows the procedure of term extractionfrom documents, finding the abstract concept andthe term-document association and inclusion of thenew rectangle in the existing rectangular thesaurus.Section 5 shows the query and retrievalenvironment and, finally, section 6 discusses somerelated work and concludes the paper.2. Rectangular Thesaurus: BasicConceptsThe main structure used for the indexing ofdocuments is the binary relation. A binary relationcan be decomposed in a minimal set of optimalrectangles by the method of RectangularDecomposition of a Binary Relation [Ga93,BBG94]. The extraction of rectangles from a finitebinary relation has been extensively studied in thecontext of Lattice Theory and has proven to bevery useful in many computer science applications.A rectangle of a binary relation R is a pair ofsets (A, B) such that A × B ⊆ R. More preciselyDefinition 1: RectangleLet R be a binary relation defined from E to F. Arectangle of R is a pair of sets (A,B) such that A ⊆E, B ⊆ F and A x B ⊆ R. A is the domain of therectangle where B is the co domain.A rectangle (A,B) of a relation R is maximal iff,for each rectangle (A’,B’)A x B ⊆ A’ x B’ ⊆ R → A = A’ e B= B’.A binary relation can always be represented bysets of rectangles, but this representation is notunique. In order to gain storage space, thefollowing coefficient is important for the selectionof an optimal set of rectangles representing therelation.Definition 2: Gain in storage spaceThe gain in storage space of a rectangle RE=(A,B)is given by:g(RE)= [Card(A) x Card (B)] - [Card (A) +Card(B)]where Card(A) is the cardinality of the set A.The gain becomes significant if Card(A) > 2and Card(B) > 2, then g(RE) > 0 and grows withCard(A) and Card(B). On the other hand, there isno gain (g(GE)

A maximal rectangle containing an element (x, y)of a relation R is called optimal if it produces amaximal gain with respect to other maximalrectangles containing (x, y).Figure 2.1(a) presents an example of a relationR, and Figures 2.1(b), 2.1(c) and 2.1(d) representthree maximal rectangles containing the element(y,3). The corresponding gains are 1, 0 e -1.Therefore, the optimal rectangle containing (y,3)of R is the rectangle of Figure 2.1(b).(a)RectangleRx 1yz2345xy(b)RE 1(y, 3)123(c)RE 2(y, 3)yz35y(d)RE 3(y, 3)g( RE1(y, g( RE 2(y, g( RE 3(y, 3)3) ) = 1 3) ) = 0 ) = --1Figure 2.1: Finding the optimal rectangleDefinition 4: Rectangular GraphLet “≤≤” be a relation defined over a set ofrectangles of a binary relation R, as follows:∀ (A 1 ,B 1 ) e (A 2 , B 2 ) two rectangles of R:(A 1 ,B 1 ) ≤≤ (A 2 , B 2 ) ⇔ A 1 ⊆ A 2 and B 2 , ⊆ B 1.We call (R, ≤≤) a Rectangular Graph.Note that “≤≤” defines a partial order over the setof rectangles.Definition 5: LatticeA partially ordered set (R,

R4R8SystemDevelopmentSystemInformationDevelopmentD1D4D1factdatasearchqueryinformationrecuperationdocumentreportformularyarticleR5SystemInformationD1D2Figure 2.4: Example of synonyms andpseudo-synonymsR2SystemD1D2D3, D4Finally we define a Rectangular Thesaurus asa graph where the nodes are optimal rectangles andthe edges are the semantic relations defined above.R4R8DevelopmentD1D4R5 InformationD23. Multilingual RectangularThesaurusThe idea of creating a language-independentconceptual thesaurus has been defined by Sosoaga[So91]. The textual structure of specific languagesis mapped to a conceptual thesaurus (see Fig. 3.1).R2SystemFigure 2.3: Example lattice ofrectangles – full and simplified versionD3Lexical structureof language 1Lexical structureof language 2ConceptualThesauruscontaining the relation between sets of index termsand of sets of documents. Each term in a rectangleis the representative of an equivalence relation ofsynonyms. The terms in the rectangle are calledpseudo-synonyms. Note that two terms are pseudosynonymsif they index the same set of documents.For a lattice of rectangles with the hierarchicrelation (A 1 ,B 1 ) ≤≤ (A 2 , B 2 ), defined above, weconsider a simplified representation, in order tosave space. We eliminate the repetition of terms inthe hierarchy. If (A 1 ,B 1 ) ≤≤ (A 2 , B 2 ) the rectangle(A 2 , B 2 ) is represented by (A 2 -A 1 , B 1 -B 2 ), withoutloss of information content.Figure 2.3 illustrates a lattice of four rectangles,disrespecting language independence at thismoment, in its full and its simplified versions.Figure 2.4 shows a relation between synonymsand pseudo-synonyms. The terms Information,Retrieval and Document are pseudo-synonymsrepresenting: fact and data, search and query, andreport, formulary and article, respectively.Figure 3.1: Multilingual ThesaurusOne problem of this mapping is the eliminationof multiple senses of terms. For instance, the term‘word’ has 10 distinct meanings in the WordNetsystem [WN]. In general, each meaning occurs in adifferent context of using the word. We extend thedefinition of multilingual thesaurus of Sosoaga inorder to include a notion of contexts that permitsthe elimination of ambiguities.A multilingual thesaurus is a classificationsystem defined asMTh = (V, n, r; L 1 , ..., L n , C 1, ..., C m , t 1 , ... , t n*m )composed of a unique set of abstract concepts(V), a set of lexicons {L 1 ,…, L n }, a set of contexts{ C 1 , ... , C k } a set of functions t k : L i × C j → Vwhich associates to each term of the lexicon in agiven context a unique abstract term. Thehierarchic and non-hierarchic relationships aregiven by n (narrower term) and r (related term).Therefore, both n and r are subsets of V × V.

A rectangular multilingual thesaurus is arectangular thesaurus as defined in the previoussection, where the terms at the left part of eachrectangle are elements of the abstract concepts (V)of a multilingual thesaurus, and the right side areidentifiers of documents.Note that the way to obtain the rectangularmultilingual thesaurus of a set of documents is (1)indexing each document, (2) define the correctmeaning of each term, and (3) finding the abstractconcept applying the corresponding function t.In order to construct rectangles withrepresentative terms, we can decompose eachfunction t in two parts t 0 and t 1 , with t(x, c) =t 1 (t 0 (x, c)). The function t 0 is responsible for theselection of the canonical representative for thesynonyms in a given context, and t 1 is an injectivefunction that determines the abstract termassociated to the original term in a given language.4. Construction and maintenance of aRectangular Multilingual ThesaurusIn a digital library each incoming document(which can be a full document, an abstract or anindex card) must be indexed and integrated in thelibrary. Instead of indexing documents one-by-onewe can consider lots of incoming documents to beinserted in the library.The construction of a Rectangular MultilingualThesaurus is done in three steps:• Term extraction and disambiguationfrom one or more documents using one ormore monolingual dictionary (semiautomaticindexing), and determination ofthe abstract concept;• Generation of one or more optimalrectangles;• Optimal insertion of the newrectangles in the existing abstractthesaurus.Figure 4.1 shows the main modules of the system,called SIM-System for Indexing of MultilingualDocuments.4.1 Indexing, disambiguation and abstractionThe construction of a rectangular thesaurusreferencing a set of electronic documents in naturallanguage begins with relevant term extractioncontained in the document. Our semi-automaticmethod allows the user to eliminate ambiguitiesinteractively. In the current version of the systemthe format of incoming document can be pure text(txt), Microsoft Word documents (doc) or html.Other formats must be converted to pure text.DOCUMENTIndexingRECTANGLEUpdatingTHESAURUSDICTIONARYNEW WORDScontextclassificationlibrarianWordNet4.1 Semi-automatic indexingThe first step consists of words selection, stopword elimination and, for significant words,finding of the abstract term. As shown in Figure4.2, two dictionaries are use for this step. First, thedictionary of term variations contains all lexicalvariations of words in a language and determinesits normal form. Compound words, such as 'DataBase' or 'Operating System' must be considered assingle terms, since in other language they arewritten as single word (e.g. 'Datenbank' and'Betriebssystem' in German) and should beassociated to a single abstract term code. These areidentified by a look-ahead step when the dictionaryidentifies a candidate compound word.Having found the normal form of a term, themain dictionary is then used to find the abstractlanguage-independent term, depending on thecontext.In the main dictionary, the column"Representative" and the list of "Related terms"will be used in the construction of the thesaurus ina given language for query formulation (seesection 5 below).Each rectangle obtained in the previous step relates a setof terms to a set of documents. If we are processing a single

document, one rectangle is generated with the significantTerm VariationsTermCompound TermData 1BaseDatabaseMain DictionaryTermCategoryContext Concept Represent.Data Base noun C.10125 DatabaseScienceData noun C.10230 InformationScienceData noun Ling. DataDate noun History ageFigure 4.1: Dictionariesterms indexing that document. We must now insert the newrectangles in the existing abstract rectangular thesaurus.4.2 Updating the rectangular thesaurusFigure 4.3 shows the rectangular thesaurus for adocument base, where the abstract terms of thedomains of the rectangles has been exchanged byits representatives in english. Since it is in thesimplified form, term redundancy has beeneliminated.Distributed 2, 11CObject-oriented, library, programming, 2, 11, 12Database, Software, development, inheritanceTool, object, class, model, concept, system 2, 11, 12, 14∅Structured 11Relational, client-server 11, 12Figure 4.3: Document thesaurus∅DNote that in a rectangular thesaurus, we canidentify several cardinality levels, due to thecardinality of the rectangle domain. This levelgoes from 0 to n, where n is the total number termsthe thesaurus. Each new rectangle must be placedin the corresponding level. In the example, thesecond level has cardinality 5 and the third levelhas cardinality 12, since 7 terms has been added tothe level below.The following algorithm, from [Pi97], providesthe insertion of a new rectangle in an existingrectangular thesaurus. We consider the thesaurusin its original definition, without simplification.The simplification is straightforward.1. Check if the cardinality level of the newrectangle exists in the thesaurus1.1. If it does not exists, create the new level forthe rectangle1.2. else1.2.1. If the domain of the new rectanglecoincides with an existing rectangle,then add the new document to the codomain of the existing rectangleelse insert the new rectangle in thelevel2. If a new rectangle has been created, establishthe hierarchic connections, searching for thehigher level rectangles containing the newterms.3. New terms not occurring in the supremum, areinserted there4. If the descendants of the new rectangle areempty, connect it to the infimum5. Information RetrievalThe purpose of an information retrieval systemis to return to the user a set of documents thatmatch the keywords expressed in a query. In oursystem we present an interface, using the user'spreferred language, representing the thesaurus ofconcepts occurring in the document database. Thisthesaurus includes synonyms, related terms andhierarchic relations. Using pre-existing terms,obtained from the documents in the library, helpsusers to formulate its query in an adequateterminology, reducing significantly naturallanguage interpretation problems.The dictionary-like form of the interfaceguarantees fast access to the documents searchedfor. As it was reported by LYCOS, typical userqueries are only two or three words long. Figure

5.1 shows the prototype's interface [So98] withterms defined in Portuguese.In a rectangular thesaurus, the retrieval processconsist of finding a rectangle Ri = Ci x Di, suchthat Ci is a minimal domain containing the set ofterms from the query Q. If Ci ≠ Q the user canreceive a feedback from the system concerningother terms which index the retrieved documents.This fact is identified as a Galois connection in[Ga93]. Note that we can obtain several rectanglesmatching the query. On the other hand, the usercan eliminate some terms from the query in orderto obtain more documents.As can be seen in the figure, the prototypeallows one to choose a language and, as he/she isselecting the terms, the system lists thecorresponding documents.Figure 5.1: Query interfaceThe prototype has been implemented in theDelphi Programming Environment and, in its firstrelease, recognizes Word (.doc), html and text(.txt) documents.5. Related workThe model HiMeD [RN01, Li98] deals with theindexing and retrieval of documents. It isspecialized on medical documents and theindexing step is completely automatic. Since thedomain is restricted, the occurrence of polysemicterms is not so frequent as for general digitallibraries. As language neutral ontology of medicalterms they use the medical vocabulary MeSH[NLM00] combined with a generic dictionary.Gilarranz, Gonzalo and Verdejo [GGV02],proposed an approach of indexing documentsusing the information stored in the EuroWordNetdatabase. From this database they take thelanguage-neutral InterLingual Index. For theassociation of queries to documents they use thewell-known vectorial approach.The MULINEX project [ENU99] is a EuropeanUnion effort to develop tools to allow crosslanguagetext retrieval for the WWW, conceptbasedindexing, navigation tools and facilities formultilingual WWW sites. The project considersseveral alternatives for document treatment, suchas translation of the documents, translation ofindex terms and queries, relevance feedback withtranslation.6. ConclusionMost work on thesaurus construction usesautomatic indexing of a given set of documents[Br89, Ga93] and, in case of a multilingualframework, uses machine translation techniquesapplied on the query [Ya98]. In [Pi97] anincremental version of the approach on automaticgeneration of rectangular thesauri has beendeveloped. The main contribution of this paper isto integrate the incremental approach with adictionary-based multilingual indexing andinformation retrieval, including an interactiveambiguity resolution. This approach eliminatesproblems of automatic indexing, linguisticvariations of a single concept and restrictions ofmonolingual systems. Furthermore the problem ofterms composed of more than one word has beensolved with a look-ahead algorithm for candidatewords found in the dictionary.It is clear that the interaction with the user isvery time consuming. But, it seems to be a goodtrade-off between the precision of manualthesaurus construction and the efficiency ofautomatic systems. With an 'apply to all' option oncan avoid a repetition of the same conflictresolution.

Lexical databases, such as WordNet and theforthcoming EuroWordNet can be useful to offer asemantic richer query interface using thehierarchic relations between terms. Thesehierarchies must also be included in the abstractconceptual thesaurus.References[BBG94] Belkhiter, N., Bourhfir, C., Gammoudi,M.M., Jaoua, A. Le Thanh, N. and Reguig, M.Décomposition Rectangulaire Optimale d’uneRelation Binaire: Aplication aux Bases deDonnées Documentaires. INFOR vol. 32, n° 1,pp. 33-54, 1994.[Br89] Bruandet, M.-F. ‘Outline of a Knowledge-Base Model for an Intelligent InformationRetrieval System’. Information Processing &Management. Vol. 25, N o 1, pp. 89-115, 1989.[Ga93] Gammoudi, M. M.. Méthode deDécomposition Rectangulaire d'une RelationBinaire: Une base formelle et uniforme pour lagénération automatique des thesaurus et larecherche documentaire. Thése de doctorat.Université de Nice − Sophia Antipolis EcoleDoctorale des Sciences pour L'Ingenieur, 1993.[ENU99] G. Erbach, G. Neumann & H. Uszkoreit‘MULINEX Multilingual Indexing, Navigationand Editing Extensions for the World-WideWeb’, proceedings of the Third DELOSworkshop -- Cross-Language InformationRetrieval and Proceedings, Zürich/CH, 1997[Fe97] Ferneda, E. ‘Construção Automática de umThesaurus Retangular’, Master thesis,Departamento de Sistemas e Computação,Universidade Federal da Paraíba, CampinaGrande, 1997.[Fr02] Freitas-Jr, H.R., Laender, A. H., Lima, L.,Ribeiro-Neto, B., Vale, R. Recuperação deInformação Médica Interlínguas, Proc. of theXVII Brazilian Symposium on Databases,Gramado/Brazil, 2002, pp.209-223[GGV01] J. Gilarranz, J. Gonzalo & F. Verdejo‘An approach to conceptual text retrieval usingthe EuroWordNet Multilingual SemanticDatabase’ in Working Notes of the AAAISymposium on Cross Language Text and SpeechRetrieval, 1997[Ha97] H. Haddouti ‘Survey: Multilingual TextRetrieval and Access’ in Working Notes of theAAAI Symposium on Cross Language Text andSpeech Retrieval, 1997[Li98] L.R.S. Lima, A.F. Laender & B. Ribeiro-Neto ‘A hierarchical approach to the automaticcategorization of medical documents’ in Proc. ofthe 7 th Intl. Conf. on Information KnowledgeManagement (1998) pp. 132-138[NLM00] National Library of Medicine-USA‘Tree Structures & Alphabetic List – 12 th Edition,(2000)[Oa99] D. Oard ‘Global Access to MultilingualInformation’ keynote address at the FourthInternational Workshop on Information Retrievalwith Asian Languages-IRAL99, Taipei Taiwan,1999[OD00] W.Ogden & M.Davis ‘Improving CrosslanguageText Retrieval with Human Interaction’Proc. of the 33rd Hawaii Intl. Conference onSystem Sciences, Mauai, HI/USA, 2000[Pi97] Pinto, W. S. ‘Sistema de Recuperação deInformação com navegação através de PseudoThesaurus’. Master Thesis. Universidade Federaldo Maranhão, 1997.[RN01] B. Ribeiro-Neto, A.F. Laender & R.S.Lima ‘An experimental study in automaticallycategorizing medical documents’ Journal of theAmerican society for Information science andTechnology (2001) pp. 391-401[So98] Sodré, I. M. ‘SISMULT - Sistema deIndexação Semi-automática Multilíngüe’. MasterThesis, Universidade Federal da Paraíba/COPIN,Campina Grande, 1998.[So91] Sosoaga, C. L. De ‘Multilingual access toDocumentary Databases’ In A. Lichnerowicz,Editor. Proceedings of a Conference onIntelligent Text and Image Handling (RIAO91),Amsterdam, April 1991, P. 774-778[WN] WordNet – a lexical database for theEnglish language, Princeton,http://www.cogsci.princeton.edu/~wn/[Ya91] Yang, Y., Carbonell, J., Brown, R. &Frederking, R. ‘Translingual informationretrieval: learning from bilingual corpora’,Artificial Intelligence 103 (1998) 323-345

Semi-Automatic Indexing of Documents with a Multilingual Thesaurus

Create successful ePaper yourself

Delete template?

Save as template?