sense tagging: don't look for the meaning but for the use

and annotators were asked to mark the senses inadditional columns. They had therefore alloccurrences of the same word available on thescreen. They could mark them in any order, andrevise their judgement as they were going along.Annotators were instructed to chose either onesense, or several if they felt that more than onewere appropriate in the given context. They couldalso choose no sense at all, if they felt that nosense in the dictionary was appropriate in thecontext. In the latter case, they were instructed towrite down a question mark in the sense column.In the subsequent study, the question mark wastreated as an additional sense for each word,grouping all meanings that were not found in thedictionary.3.3. ResultsThe annotators gave more senses per context forverbs than for adjectives and nouns (Table 3,column Nsen). This is likely to be a result of thelarger number of senses offered for verbs in thedictionary (see discussion above and Table 2). Theaverage number of senses (used by a single judgein a given context) per POS category is not veryhigh, which shows that annotators have a tendencyto avoid multiple answers (as said above, the “nosense” answer is counted as a special sense).However, the average per POS category masksimportant differences between words: the averagenumber of responses per word ranges from 1 to to1.311 (verb comprendre). In some cases,annotators used up to six senses in a singleresponse for a given context.Agreement was computed according to severalmeasures (summarised in Table 3):(1) Full agreement among the six annotators.Two variants were computed:MinMaxCounts agreement when judges agree on all sensesproposed for a given contextCounts agreement when judges agree on at leastone of the senses proposed for a given contextThe difference between the min and max measuresis not very important, apart from for a few words(sûr, comprendre, importer). This is due to thefact that the average number of senses given byjudges is close to 1 (Table 3, column Nsen). Ofcourse, these measures are biased with the numberof judges, as mentioned above. It is howeverstriking to note that for some words (correct,historique, économie, comprendre) there was fullagreement on none of the contexts for that word.(2) Pairwise agreement. Three variants werecomputed:MinMaxWeightedCounts agreement when judges agree on all sensesproposed for a given contextCounts agreement when judges agree on at leastone of the senses proposed for a given contextAccounts for partial agreement using the Dicecoefficient:A∩BDice = 2A + BAgain, there is not much difference between themeasures, apart from for a few words,interestingly enough not exactly the same asbefore (chef, comprendre, connaître).(3) Agreement corrected for chance. Themeasures above are not completely satisfactory,because they do not enable comparison ofobserved agreement and agreement that would beobtained by pure chance. The κ statisticsmentioned above enables such a comparison. Inorder to account for partial agreement, κ wascomputed on the weighted pairwise measure usingthe extension proposed in [8].It is interesting to note that κ ranges between 0.92(noun détention) and 0.007 (adjective correct). Inother terms, there is no more agreement thanchance for some words. The average κ values arelow, below 50%, which indicates a great amountof disagreement among judges.FullPairwisePOS Nsen min max min max wgh κA 1.013 0.43 0.46 0.69 0.72 0.71 0.41N 1.009 0.44 0.45 0.72 0.74 0.73 0.46V 1.045 0.29 0.34 0.60 0.65 0.63 0.41Table 3. Agreement measures per POS catagoryIt is possible that the sense divisions contained indictionaries are too fine-grained for NLP purposes.This argument has been made many times, andmany WSD systems have been restricted tohomograph level or broad sense distinctions.In order to test this hypothesis, I have computedthe degree of inter-annotator agreement when theirresponses are reduced to the top-level distinctionsmade in the dictionary (French dictionaries aremuch more hierarchical than English ones, due todifferent lexicographic traditions). The4

improvement was measured as the reduction ofdisagreement once corrected for chance, i.e.:1 −κ∆ = 1−1 −κThe results are disappointing: the disagreementreduction is only of 8% for adjectives and 9% forverbs. It is higher for nouns, but reaches only 25%(Table 4).FullPairwisePOS Nsen min max min max wgh κ ∆(%)A 1.010 0.55 0.57 0.78 0.80 0.79 0.46 7.9N 1.003 0.70 0.70 0.86 0.86 0.86 0.60 25.2V 1.018 0.54 0.56 0.77 0.80 0.79 0.46 8.9Table 4. Agreement on top-level divisions4. DISCUSSION4.1. Summary of resultsExperiment One showed that judges disagreewidely on whether a given word is polysemous ornot in a corpus. Experiment Two showed that theyalso disagree enormously when they have to tagcorpus examples according to the sense listprovided by a common dictionary. The rate ofdisagreement is so important that for some words,there was no more agreement than what would beobtained by mere chance. It cannot be argued thatsense distinctions are too fine-grained for WSD,since, somewhat surprisingly, most disagreementbetween annotators spans across the top-leveldivisions of entries.These results shed a new light on automated sensetagging.The dictionary chosen (Petit Larousse) isnot at fault. It is a very respectable medium-sizedictionary which builds on a century and a half oflexicographic tradition. I am convinced that theresults would be similar with any other traditionaldictionary.4.2. An example of difficultyThe word degré (=degree) exemplifies the type ofdifficulty that annotators are faced with. At the toplevel, the dictionary gives the following divisionsand definitions (I translate roughly and skip thesub-senses for lack of space):DEGRÉ. I. Literary: step/stair. II. each of theintermediary state leading from one state to another. III.relative intensity (of an affective, moral or pathologicalstate). IV. each of the divisions, corresponding to a unit,21of a scale of measurement.If divisions I and IV are (almost) straightforward,the distinction between II and III is extremelyconfusing for annotators. In sentences such as:...les trois principaux degrés de cette éliminationétatique: le génocide, la déportation en masse etl'assimilation forcée... (...the three main steps/levels ofthis state elimination: genocide, mass deportation andforced assimilation...)Ils s'inquiètent de ce qu'ils perçoivent comme un degrécroissant d'anarchie... (they point out their concernabout what they perceive to be an increasing level oflawlessness...)it is very unclear whether degré refers to “anintermediary state leading from one state toanother” or a “relative intensity of an affective,moral or pathological state” 3 .In this example, it would however be very easy tosplit uses according to syntactic criteria. A first setof uses accepts cardinal determiners (un, deux,trois / =one, two, three, etc.) as well as ordinalqualifiers such as premier, second, dernier (=first,second, last):...les trois principaux degrés de cette éliminationétatique → le premier degré, le second degré, etc. (thefirst step, the second step, etc.)On the other hand, another, disjoint, set of usesaccepts intensifying qualifiers whose prototype isthe fort/faible (= high/low) pair:un degré croissant d'anarchie → un faible degré, unfort degré d'anarchie (a low level, a high level oflawlessness)Other adjectives in the paradigm are alarmant(=alarming), élevé (=high), minimal (=minimal),différent (=different), croissant (=increasing), etc.In other words, one set of uses is discrete andcountable, the other set is continuous andintensifiable. Annotators would have little troubleusing these tests, and machines could use thepresence of the appropriate adjectives ordeterminers as a reliable disambiguating clue.However, none of the French dictionaries that Iexamined use or mention this rather simple3 WordNet 1.6 proposes a similar distinction for the Englishdegree, resulting in exactly the same kind of indecision: 1.a position on a scale of intensity or amount or quality(e.g. : “a moderate degree of intelligence” etc.) 2. aspecific identifiable position in a continuum or series orespecially in a process (e.g. : “a remarkable degree offrankness” etc.). It is hard to see why “degree ofintelligence” and “degree of frankness” should be treateddifferently.5

syntactic property. Worse yet, the Petit Laroussedefinitions II and III which at first glance couldcorrespond to this division are in fact at odds withit, as the examples and sub-senses reveal.4.3. From meaning to useIt is always easy to point out weaknesses anderrors in entries, in any dictionary. However, mycriticism is of a different nature. I am not trying tospot occasional flaws, but questioning the verystyle and organisation of entries. In almost all ofthe 60 words used in the Experiment Two, thedefinitions (which are after all the onlyinformation that annotators have at their disposalin order to match individual senses with corpuscontexts) do not contain enough clues to performthe task safely. Worse yet, the division of entriesitself rarely takes into account (and is oftencontradictory with) distributional facts. Annotatorsall commented on the vagueness of definitionsand lack of clear-cut distinctions among senses,which they had never fully realised until they wereconfronted with the systematic tagging task. Thisvagueness is particularly apparent in abstract, verypolysemous words, such as degré, économie(=economy, economics, saving, etc.),communication (=communication, report,telephone call, etc.), formation (=education,training, forming, formation, etc.), whichconstitute a large part of most texts.The reason for this is probably to be found in alexicographic tradition that has its roots in theAristotelian approach to meaning and definition.For several centuries, dictionaries have primarilytried to give an account of meaning, not of usage(apart from occasional indications of register ordomain). As a result, they rarely provide thesurface distributional clues that would enablesense discrimination. Only recently somedictionaries (e.g. Cobuild, LDOCE, OALD) havestarted incorporating detailed syntactic,collocational and paradigmatic information, usingcorpus evidence instead of lexicographer'sintrospection. This trend is however very newcompared to the four-century dictionary buildingtradition, and distributional information in moderndictionaries is still very far from being systematicand precise enough for computer use. Morecomputer-oriented resources such as WordNetunfortunately also almost totally lack this type ofinformation.A major departure from traditional lexicographyhas to be made if we want to accomplishsignificant progress in sense tagging and othersense-related activities. We have to radically shiftfrom the description of meaning to that of the uses.The dictionaries cited above go one step in thatdirection,but distributional information is still verymuch conceived as an add-on on top of traditionalfoundations. I will take the radical stance thatdistributional information can provide the veryfoundations of dictionary organisation, and thatentries can be divided up into coherent usageclasses — that one can think about as senses — onthe sole basis of that information, with no resort tomeaning analysis and the more or lessintrospective or psychological considerations thatsuch analysis usually requires.Although never implemented fully andsystematically in lexicographic work and computerapplications, this point of view is not entirely new.It can be tracked back at least to Meillet [15]:“Le sens d'un mot ne se laisse définir que par unemoyenne entre [ses] emplois linguistiques.” (The senseof a word is defined only by the average of its linguisticuses.)Wittgenstein [18] popularised a similar position inthe well-known aphorism 4 :“Don't look for the meaning, but for the use”,and Harris made it part of his linguisticprogramme, by defining “meaning as a function ofdistribution” [10:155-158].4.4. Distributional informationIn this section, I will show that entries can bedivided up using various types of distributionalinformation with no resort to meaning analysis. Atthe same time, this information is of primaryimportance for human annotators and taggingsystems. I will use the word barrage (=dam,blocking, roadblock, barrier, etc.) as an example,since while being polysemous, it is not toocomplex for the space constraints of this paper.4.4.1. Syntactic informationSyntax provides an extremely powerful tool forsplitting entries. For example, some uses ofbarrage are an active nominalisation of the verbbarrer, others are not. By active, I mean that thenominalisation is a strict synonym of the verb, bywhich it can be replaced by changing the4 in the Philosophische Utersuchungen – he had previouslydefended the opposite view in the Tractatus.6

construct. At the same time, the valency of theverb is kept in the noun; in particular, the noun has(or can have) an agent (corresponding to the verbsubject) :le barrage de la rivière [par les castors] (the blocking upof the river [by the beavers]) → les castors ont barré larivière (the blocking up of the river [by the beavers]→the beavers blocked up the river)This use, although given as the core sense by mostdictionaries (“the act of blocking”) is in fact veryrare in corpora. It must not be confused with theother uses of barrage which, althoughetymologically formed through a nominalisation,have lost the direct relationship to the verb.le barrage sur le Rhône (the dam on the Rhône river)→∗ quelqu'un barre le RhôneThis second set of uses has developed its ownvalency over the centuries, in different ways: afirst subset takes a complement introduced by thepreposition sur (=on) and a second subset acomplement introduced by à (=to):le barrage sur le Rhône, sur l'autoroute (the dam on theRhône river, the roadblock on the highway)le barrage à la loi sur l'avortement (the opposition to theabortion law)At this stage, the entry is structured as follows:barrage1. barrage de X par Y(= Y barre X)4.4.2. Paradigmatic information2.2.1. barrage sur X2.2. barrage à XAnother type of information is of paradigmaticnature. For example, one set of uses of barragehas a hypernym, ouvrage (≈civil engineeringstructure, no exact translation), while the othershave no hypernym. This assertion may seem odd,since a long Aristotelian tradition, and the recentupsurge of ontology development, havecontributed to the widespread feeling that allwords, and all senses of these words, have ahypernym and that the lexical space is organisedas a giant taxonomy. It all depends, of course, onwhat we want to call hypernym, and how lax wewant this definition to be. In the distributionalperspective that I am advocating here, I will adopta very strict view of hyperonymy and restrict it tothe only cases where there is syntagmatic evidenceof the relationship, for instance in enumerations oranaphoras:les ouvrages "lourds" du GAP, comme le barrageAtatürk ou les tunnels jumeaux d'Urfa... (the “heavy”structures built by the GAP, such as the Atatürk dam orthe twin tunnels of Urfa...)le barrage d'Assouan … cet ouvrage géant, monstrueux(the Hassouan dam ... this giant, monstrous structure)The usual tests such as “is a kind of” simply donot work on most senses, unless we accept todistort language use in the laxest and mostunnatural way. For example, it is impossible tofind a natural filler for the pattern “is a kind of ...”for the sense “roadblock” of barrage. Despiteextended search in large corpora, I was unable tofind any syntagmatic evidence of a term that couldbe a satisfactory hypernym.This differential behaviour with respect tohypernyms enables us to subdivide further the use2.1 (barrage sur):2.1. barrage sur XOther types of paradigmatic information can beused as well, such as the presence or absence ofsynonyms. Here again, I restrict the notion to strictsynonyms, i.e. which can be substituted with nochange or loss in a given context. Thesubstitutability can be established by assessingwhether the contexts of the candidate synonymsare similar in terms of distribution (valency, etc.).For example, use 2.2 of barrage (barrage à)accepts a strict synonym, obstacle:la volonté de faire barrage (=obstacle) à une probableexpansion du communisme (the desire to block aprobable expansion of communism)while no other subset has any strict synonym. Thisdoes not enable us to subdivide classes further, butconfirms that 2.2 should be a separate class.4.4.3. Collocational information2.1.1. ⇑ OUVRAGE2.1.2. (others)Collocational information is at the crossroads ofsyntagmatic and paradigmatic information. On onehand it has a syntactic base, since it expresses the“preferences” of syntactically bound terms (verbobject,etc.); on the other hand, it enables thegrouping of words in paradigms that can fulfil agiven syntactic place (e.g.: read a ). This information7

can relatively easily be extracted from corporausing grammatical and statistic filters, and manualchecking. In the barrage example, this informationis quite productive. It does not impose furtherdividing, but strongly confirms the classesestablished so far. For instance, frequent verbswith barrage 2.1.1 as object are construire(=build), édifier (=edify), démolir (=démolish),etc., while verbs associated with barrage 2.1.2 area totally disjoint subset: dresser (=put up),franchir (=cross), démanteler (=dismantle), etc.Figure 2 shows the most frequent collocationsassociated with the various classes of uses forbarrage, roughly grouped by syntactic category.Glosses are provided between square brackets onlyfor the sake of readability. It is important to notethat the meanings that they are referring to werenot used in the splitting process, which was doneonly on distributional grounds. However,interestingly enough, the classes of uses obtainedthis way are also coherent from a cognitive pointof view.5. CONCLUSIONIn this paper, I have shown that interannotatoragreement is very low in a straightforward sensetaggingtask, using a traditional dictionary. Forsome words, agreement was no better than chance.A careful analysis reveals that the main difficultiescome from the lack of distributional information intraditional dictionaries. Building on severalcenturies of lexicographic tradition, dictionariesmainly attempt to describe and define meaning,and rather marginally give information about worduses and distributional data. Only very recentlylexicographers have started making systematic useof corpora, and dictionaries still do not containsystematically the surface clues (syntactic,collocational, etc.) that are required to match agiven sense with a given corpus occurrence. I triedto show that distributional information can providethe very foundations of dictionary organisation,and that entries can be divided up into coherentusage classes — that one can think about assenses — on the sole basis of that information,with no resort to meaning analysis and the more orless introspective or psychological considerationsthat such analysis usually requires. I am convincedthat large scale lexicons organised this way, andcontaining detailed distributional information arenecessary in order for fundamental progress to bemade in sense tagging and other sense-relatedlanguage processing.BARRAGE1. [act of blocking] barrage de X par Y (+Nomin.

Proceedings of the Sixth Midwest ArtificialIntelligence and Cognitive Society Conference,Carbondale, Illinois, April 1995, 73-78.[3] Ahlswede, T. E., & Lorand, D. (1993). TheAmbiguity Questionnaire: A Study of LexicalDisambiguation by Human Informants.Proceedings of the Fifth Midwest ArtificialIntelligence and Cognitive Society Conference,Chesterton, Indiana, 21-25.[4] Amsler, R. A., & White, J. S. (1979).Development of a computational methodology forderiving natural language semantic structures viaanalysis of machine-readable dictionaries. Finalreport on NSF project MCS77-01315. Universityof Texas at Austin, Austin, Texas.[5] Bruce, R., & Wiebe, J. (1998). Word sensedistinguishability and inter-coder agreement.Proceedings of the 3rd Conference on EmpiricalMethods in Natural Language Processing(EMNLP-98). Association for ComputationalLinguistics SIGDAT, Granada, Spain, June 1998.[6] Carletta, J. (1996). Assessing agreement onclassification tasks: the kappa statistics.Computational Linguistics, 22(2), 249-254.[7] Cohen, J. (1960). A coefficient of agreementfor nominal scales. Educational andPsychological Measurement, 20, 37-46.[8] Cohen, J. (1968). Weighted kappa: nominalscale agreement with provision for scaleddisagreement or partial credit. PsychologicalBulletin, 70(4), 213-220.[9] Fellbaum, C., Grabowski, J., & Landes, S.(1998). Performance and confidence in a semanticannotation task. In C. Fellbaum (Ed.), WordNet:An electronic database (pp. 217-237). Cambridge,Massachusetts: The MIT Press.[10] Harris, Z. S. (1954). “DistributionalStructure.” Word, 10, 146-162.[11] Ide, N., & Véronis, J. (1998). Introduction tothe special issue on word sense disambiguation:the state of the art. Computational Linguistics,24(1), 1-40.[12] Jorgensen, J. (1990). The psychologicalreality of word senses. Journal of PsycholinguisticResearch, 19, 167-190.[13] Kilgarriff, A. (1998). SENSEVAL: AnExercise in Evaluating Word SenseDisambiguation Programs. Proceedings of theLanguage Resources and Evaluation Conference(pp. 581-588). Granada, Spain.[14] Krippendorff, K. (1980). Content Analysis:An introduction to its Methodology. SagePublications.[15] Meillet, A. (1926). Linguistique historiqueet linguistique générale. Vol. 1. Champion, Paris,351pp. (2 nd édition).[16] Véronis, J. (2000). Evaluation of paralleltext alignment systems: the ARCADE project. InJ. Véronis (Ed.), Parallel text processing:Alignment and use of translation corpora (pp.369-388). Dordrecht: Kluwer AcademicPublishers.[17] Weaver, W. (1949). Translation.Mimeographed, 12 pp., July 15, 1949. Reprintedin Locke, William N. and Booth, A. Donald(1955) (Eds.), Machine translation of languages.John Wiley & Sons, New York, 15-23.[18] Wittgenstein, L. (1953). PhilosophischeUntersuchungen [Philosophical Investigations,translated by G.E.M. Anscombe, New York,Macmillan].9

sense tagging: don't look for the meaning but for the use

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?