31.07.2015 Views

The SenSem Project: syntactico-semantic annotation of ... - cs@famaf

The SenSem Project: syntactico-semantic annotation of ... - cs@famaf

The SenSem Project: syntactico-semantic annotation of ... - cs@famaf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> <strong>SenSem</strong> <strong>Project</strong>: <strong>syntactico</strong>-<strong>semantic</strong> <strong>annotation</strong> <strong>of</strong> sentences inSpanishAlonso,Laura*Capilla, JoanAntoni†Castellón,Irene*Fernández-Montraveta,Ana**Vázquez,Gloria†*Department <strong>of</strong> Linguistics, Universitat de Barcelona, Spain{lalonso,icastellon}@ub.edu**Department <strong>of</strong> English and German Philology, Universtitat Autònoma deBarcelona, Spainana.fernandez@uab.es†Department <strong>of</strong> English and Linguistics, Universitat de Lleida, Spain{jcapilla,gvazquez}@dal.udl.esAbstractThis paper presents <strong>SenSem</strong>, a project 1 thataims to systematize the behavior <strong>of</strong> verbs inSpanish at the lexical, syntactic and <strong>semantic</strong>level. As part <strong>of</strong> the project, two resources arebeing built: a corpus where sentences areassociated to their <strong>syntactico</strong>-<strong>semantic</strong>interpretation and a lexicon where each verbsense is linked to the corresponding annotatedexamples in the corpus. Some tendencies thatcan be observed in the current state <strong>of</strong>development are also discussed.1 Introduction<strong>The</strong> <strong>SenSem</strong> project 2 aims to build a databank <strong>of</strong>Spanish verbs based on a lexicon that links each verbsense to a significant number <strong>of</strong> manually analyzedcorpus examples. This databank will reflect thesyntactic and <strong>semantic</strong> behavior <strong>of</strong> Spanish verbs innaturally occurring text.We analyze the 250 verbs that occur mostfrequently in Spanish. Annotation is carried out atthree different levels: the verb as a lexical item, theconstituents <strong>of</strong> the sentence and the sentence as awhole. <strong>The</strong> <strong>annotation</strong> process includes verb sensedisambiguation, syntactic structure analysis1 Databank Sentential Semantics: “Creación de una Base deDatos de Semántica Oracional”. MCyT (BFF2003-06456).2 http://grial.uab.es/projectes/sensem.php(syntagmatic categories, including the <strong>annotation</strong> <strong>of</strong>the phrasal heads, and syntactic functions),interpretation <strong>of</strong> <strong>semantic</strong> roles and analysis <strong>of</strong> variouskinds <strong>of</strong> sentential <strong>semantic</strong>s. It is precisely this lastarea <strong>of</strong> investigation which sets our project apart fromothers currently being carried out with Spanish(Subirats and Petruck, 2003 and García De Miguel andComesaña, 2004).Abstracting from the analysis <strong>of</strong> a significantnumber <strong>of</strong> examples, the prototypical behavior <strong>of</strong> verbsenses will be systematized and encoded in a lexicon.<strong>The</strong> description <strong>of</strong> verb senses will focus on theirproperties at the <strong>syntactico</strong>-<strong>semantic</strong> interface, andwill include information like the list <strong>of</strong> <strong>syntactico</strong><strong>semantic</strong>frames in which a verb can possibly occur. Inaddition, selectional restrictions will be automaticallyinferred from the words marked as heads <strong>of</strong> theconstituents. Finally, the usage <strong>of</strong> prepositions will bestudied.<strong>The</strong> conjunction <strong>of</strong> all this information will providea very fine-grained description <strong>of</strong> the <strong>syntactico</strong><strong>semantic</strong>interface at sentence level, useful forapplications that require an understanding <strong>of</strong> sentencesbeyond shallow parsing. In the fields <strong>of</strong> automaticunderstanding, <strong>semantic</strong> representation and automaticlearning systems, a resource <strong>of</strong> this type will beespecially valuable.In the rest <strong>of</strong> the paper we will describe the corpus<strong>annotation</strong> process in more detail and examples will beprovided. Section 2 <strong>of</strong>fers a general overview <strong>of</strong> otherprojects similar to <strong>SenSem</strong>. In section 3, the levels <strong>of</strong><strong>annotation</strong> are discussed, and the process <strong>of</strong> <strong>annotation</strong>is described in section 4. We then proceed to present


the results obtained to date and the current state <strong>of</strong><strong>annotation</strong>, and we put forward some tentativeconclusions obtained from the results <strong>of</strong> the <strong>annotation</strong>thus far.2 Related WorkAs shown by Levin (1993) and others (Jones et al.,1994; Jones, 1995; Kipper et al., 2000; Saint-Dizier,1999; Vázquez et al., 2000), syntax and <strong>semantic</strong>s arehighly interrelated. By describing the way linguisticlayers inter-relate, we can provide better verbdescriptions since generalizations from the lexiconthat previously belonged to the grammar level <strong>of</strong>linguistic description can be established (lexicalistapproach).Within the area <strong>of</strong> Computational Linguistics, it iscommon to deal with both fields independently(Grishman et al., 1994; Corley et al., 2001). In othercases, the relationship established between syntacticand <strong>semantic</strong> components is not fully exploited andonly basic correlations are established (Dorr et al.,1998; McCarthy, 2000). We believe this approach isinteresting even though it does not take full advantage<strong>of</strong> the existing link between syntax and <strong>semantic</strong>s.Furthermore, we think that in order to coherentlycharacterize the <strong>syntactico</strong>-<strong>semantic</strong> interface, it isnecessary to start by describing linguistic data fromreal language. Thus, a corpus annotated at syntacticand <strong>semantic</strong> levels plays a crucial role in acquiringthis information appropriately.In recent years, a number <strong>of</strong> projects related to the<strong>syntactico</strong>-<strong>semantic</strong> <strong>annotation</strong> <strong>of</strong> corpora have beencarried out. <strong>The</strong> length <strong>of</strong> the present paper does notallow us to consider them all here, but we willmention a few <strong>of</strong> the most significant ones.FrameNet (Johnson and Fillmore, 2000) is alexicographic resource that describes approximately2.000 items, including verbs, nouns and adjectives thatbelong to diverse <strong>semantic</strong> domains (communication,cognition, perception, movement, space, time,transaction, etc.). Each lexical entry has examplesextracted from the British National Corpus that havebeen manually annotated. <strong>The</strong> <strong>annotation</strong> reflectsargument structure and, in some cases, also adjuncts.PropBank (Kingsbury and Palmer, 2002;Kingsbury et al., 2002) is a project based on themanual <strong>semantic</strong> <strong>annotation</strong> <strong>of</strong> a subset <strong>of</strong> the PennTreebank II (a corpus which is syntacticallyannotated). This project aims to identify predicateargumentrelations. In contrast with FrameNet, thesentences to be annotated have not been pre-selectedso examples are more varied.Both FrameNet and Propbank work with the use <strong>of</strong>corpora, although their objectives are a bit different. InFrameNet, a corpus is used to find evidence aboutlinguistic behavior and to associate examples tolexical entries, whereas in Propbank, the objective isto enrich a corpus that has been already annotated at asyntactic level so that it can be exploited in moreambitious NLP applications.For Spanish, only a few initiatives address the<strong>syntactico</strong>-<strong>semantic</strong> analysis <strong>of</strong> corpus. <strong>The</strong> DataBase“Base de Datos Sintácticos del Español Actual”(Muñiz et al., 2003) provides the syntactic analysis <strong>of</strong>160.000 sentences extracted from part <strong>of</strong> theARTHUS corpus <strong>of</strong> contemporary texts. Syntacticpositions are currently being labeled with <strong>semantic</strong>roles (García de Miguel and Comesaña, 2004).FrameNet-Spanish (Subirats and Petruck, 2003) isthe application <strong>of</strong> the FrameNet methodology forSpanish. Its target is to develop <strong>semantic</strong> frames andlexical entries for this language. Each verb sense isassociated to its possible combinations <strong>of</strong> participants,grammatical functions and phrase types, as attested inthe corpus.<strong>The</strong> <strong>SenSem</strong> project provides a different approachto the description <strong>of</strong> verb behavior. In contrast withFrameNet, its aim is not to provide examples for a preexistinglexicon, but to shape the lexicon with thecorpus examples annotated. Another difference fromthe FrameNet approach is that the <strong>semantic</strong> roles weuse are far more general, they are related to syntacticfunctions, and are less class-dependent.Finally, to the best <strong>of</strong> our knowledge, no largescalecorpus <strong>annotation</strong> initiative associates <strong>semantic</strong>sto sentence such as their aspectual interpretations ortypes <strong>of</strong> causativity.3 Levels <strong>of</strong> <strong>annotation</strong>As mentioned previously, we are describing verbbehavior so only constituents directly related to theverb will be analyzed. Elements beyond the scope <strong>of</strong>the verb (i.e. extra-sentential elements such as logicallinkers, some adverbs, etc.) are disregarded. <strong>The</strong>following is an example <strong>of</strong> scope <strong>of</strong> <strong>annotation</strong>:...El presidente, que ayer inició una visita<strong>of</strong>icial a la capital francesa, hizo estasdeclaraciones…...<strong>The</strong> president, who began an <strong>of</strong>ficial visitto the French capital yesterday, stated…


Were we annotating the verb iniciar –begin– wewould ignore the participants <strong>of</strong> the main sentence andonly take into account the elements within the clause.If we were annotating the verb hacer –make– wewould annotate the subject to include the entirerelative clause, with the word president” as the head <strong>of</strong>the whole structure. <strong>The</strong> relative clause will not befurther analyzed.Sentences are annotated at three levels: sentence<strong>semantic</strong>s, lexical and constituent level.3.1 Sentential <strong>semantic</strong>s levelAt this level, different aspects <strong>of</strong> sentential <strong>semantic</strong>sare accounted for. With regard to aspectualinformation, a distinction is made among three types<strong>of</strong> meaning, eventive, procedural or stative, as in thefollowing examples:event: ...El diálogo acabará hoy…...<strong>The</strong> conversations will finish today...process: …cuando le preguntaron de qué habíavivido hasta aquel momento ...…when he was asked what he had been living onuntil then...state: ...El gasto de personal se acerca a los 2.990millones de euros......Personnel expenses come close to 2,990 millioneuros...Apart from aspectual information, we also annotatesentential level meanings using labels likeanticausative, antiagentive, impersonal, reflexive,reciprocal or habitual. This feature is useful toaccount for the variation in the syntactic realizations<strong>of</strong> the argument structures <strong>of</strong> each verb sense. Forexample, the next sentence with the verb “abrir”(open) presents an antiagentive interpretation:agentive: El alcalde de Calafell […] abrirá unexpediente…… <strong>The</strong> mayor <strong>of</strong> Calafell […] will openadministrative proceedings …antiagentive: … el vertedero de Tivissa no seabrirá sin consenso.… Tivissa’s rubbish dump will not be openedwithout a consensus.3.2 Lexical levelAt the lexical level, each example <strong>of</strong> a verb is assigneda sense. We have developed a verb lexicon in whichthe possible senses for a verb are defined, togetherwith its prototypical event structure and thematic grid,and a list <strong>of</strong> synonyms and antonyms and its relatedsynsets in WordNet (Fellbaum, 1998).Various lexicographic sources have been taken asreferences to build the inventory <strong>of</strong> senses for eachverb, mainly the Diccionario de la Real Academia dela Lengua Española and the Diccionario Salamancade la Lengua Española. Less frequent meanings arediscarded, together with archaic and restricted uses.This inventory <strong>of</strong> senses for each verb is onlypreliminary, and can be modified whenever theexamples found in the corpus indicate the existence <strong>of</strong>a distinct sense which has not been considered.Different senses are distinguished by differentthematic grids, different event structures, differentselectional restrictions or different subcategorizations.3.3 Constituent levelFinally, at the constituent level, each participant in theclause is tagged with its constituent type (e.g.: nounphrase, completive, prepositional phrase) andsyntactic function (e.g.: subject, direct object,prepositional object).Arguments and adjuncts are also distinguished.Arguments are defined as those participants that arepart <strong>of</strong> the verb’s lexical <strong>semantic</strong>s. Arguments areassigned a <strong>semantic</strong> role describing their relation withthe verb (e.g.: agent, theme, initiator…). In <strong>SenSem</strong>,each sense is associated with a prototypical thematicgrid describing the possible arguments a verb maytake, but, as in the case <strong>of</strong> senses, this thematic grid isonly preliminary and is modified when corpusexamples provide enough evidence.<strong>The</strong> head <strong>of</strong> the phrase is also signaled in order toacquire selectional restrictions for that verb sense.Sometimes, information that has been consideredrelevant in that it may alter some other informationdeclared at a different level has also been included; forexample, negative polarity or negative adverbs arealso indicated.4 Annotation process<strong>The</strong> <strong>SenSem</strong> corpus will describe the 250 mostfrequently occurring verbs in Spanish. Frequency hasbeen calculated in a journalistic corpus. For each <strong>of</strong>these verbs, 100 examples are extracted randomlyfrom 13 million words <strong>of</strong> corpora obtained from theelectronic version <strong>of</strong> the Spanish newspapers, LaVanguardia and El Periódico de Catalunya. <strong>The</strong>corpus has been automatically tagged and a shallowparsing analysis has been carried out to detect thepersonal forms <strong>of</strong> the verbs under consideration. We


do not take into account uses <strong>of</strong> the verb as anauxiliary. We also disregard any collocations oridioms in which the verb might participate.<strong>The</strong> manual <strong>annotation</strong> <strong>of</strong> examples is carried outvia a graphical interface, seen in Figure 1, where thethree levels are clearly distinguished.Figure 1. Screenshot <strong>of</strong> the <strong>annotation</strong> interface.<strong>The</strong> interface displays one sentence at a time. First,when a verb sense is selected from the list <strong>of</strong> possiblesenses, its prototypical event structure and <strong>semantic</strong>roles are displayed for the annotator to take intoaccount. <strong>The</strong>n, the clause is assigned its aspectual<strong>semantic</strong>s, and constituents are identified and analyzedby selecting the words that belong to it. <strong>The</strong> head <strong>of</strong>the arguments and its possible metaphorical usage arealso signaled in order to facilitate a future automaticextraction <strong>of</strong> selectional restrictions. Finally,annotators specify any applicable <strong>semantic</strong>s at clauselevel (e.g.: anticausative, reflexive, stative, etc.), andstate any particular fact that they consider might be <strong>of</strong>use in future revision and correction processes.<strong>The</strong> distribution <strong>of</strong> the corpus among annotatorshas evolved since the earlier stages <strong>of</strong> the project. Inan initial stage, when the <strong>annotation</strong> guidelines werenot yet consolidated, each <strong>of</strong> the 4 annotators wasgiven 24 different sentences <strong>of</strong> the same verb, plus 4common sentences that were separately annotated byall <strong>of</strong> them. Later on, these sentences were comparedin order to identify those aspects <strong>of</strong> the <strong>annotation</strong> thatwere unclear or prone to subjectivity, as explained inthe following section. In the current stage, the<strong>annotation</strong> guidelines have been well established.Annotators work with sets <strong>of</strong> 100 sentencescorresponding to a single verb. All <strong>annotation</strong>s arerevised and any possible errors are corrected.<strong>The</strong> final corpus will be available to the linguisticcommunity by means <strong>of</strong> a soon to be created webbasedinterface.5 Preliminary Results <strong>of</strong> AnnotationAt this stage <strong>of</strong> the project, 77 verbs have already beenannotated, which implies that the corpus at thismoment is made up <strong>of</strong> 7,700 sentences (199,190words). A total <strong>of</strong> 900 sentences out <strong>of</strong> these 7,700


have already been validated, which means that acorpus <strong>of</strong> approximately 25,000 words has alreadyundergone the complete <strong>annotation</strong> process.5.1 Data analysisIn this section we describe the information about verbbehavior that can be extracted from the corpus in itspresent state. We have found that, out <strong>of</strong> the 199,190words that have already been annotated, 182,303 arepart <strong>of</strong> phrases which are an argument <strong>of</strong> the verb and16,887 are adjuncts.With regard to aspectuality, there is a clearpredominance <strong>of</strong> events (74.26% <strong>of</strong> the sentences)over processes (20.67%) and states (8.96%). Thisskewed distribution <strong>of</strong> clause types, with a clearpredominance <strong>of</strong> events, may be exclusive to thejournalistic genre. We have yet to investigate itsdistribution in other genres.As concerns syntactic functions, seen in Table 1,the most frequent category is direct object, with asignificant difference in subjects. This is not surprisingif we take into account that Spanish is a pro-droplanguage. However, prepositional objects are lessfrequent than subjects, and indirect objects are alsoscarce. Thus, the clausal core appears to bepredominantly populated by the least markedconstituents.Functionratiodirect object 39.83 %subject 22.57 %circumstantial 23.16 %prepositional object 12.65 %indirect object 1.97 %Table 1. Distribution <strong>of</strong> syntactical functions in theannotated examples.<strong>The</strong> distribution <strong>of</strong> <strong>semantic</strong> roles can be seen inTable 2. <strong>The</strong>mes are predominant, as would beexpected given that the most common syntacticfunction is that <strong>of</strong> direct object, and that there is a highpresence <strong>of</strong> antiagentive, anticausative and passiveconstructions. Within the different types <strong>of</strong> the<strong>semantic</strong> role theme, unaffected themes (movedobjects) appear most frequently.At the constituent level, the <strong>semantic</strong> role chosenfor each phrase is <strong>of</strong>ten predictive <strong>of</strong> the other labels<strong>of</strong> that phrase, following what was expected fromlinguistic introspection: agents tend to be noun phraseswith subject function, themes tend to be noun phraseswith subject or object function (if they occur in apassive, antiagentive, anticausative or stativesentence), etc. +RoleRationot- affected theme 53.47 %affected theme 14.36%agent and cause 14.02%initiator 2.97%Table 2. Distribution <strong>of</strong> <strong>semantic</strong> roles in theannotated examples.Thus, the associations made between labels indifferent levels have been used as a first step to semiautomatethe <strong>annotation</strong> process: once a role isselected, the category and function most frequentlyassociated with it and its role as a verb argument arepre-selected so that the annotator only has to validatethe information.5.2 Inter-annotator agreementIn order to measure inter-annotator agreement, foursentences <strong>of</strong> 59 verbs have been annotated by 4different judges so that divergences in criteria could befound. <strong>The</strong>se common sentences were used in thepreliminary phase with the aim <strong>of</strong> both training theannotators and detecting points <strong>of</strong> disagreementamong them. This comparison has helped us refineand settle the <strong>annotation</strong> guidelines and facilitate thesubsequent revision <strong>of</strong> the corpus.In order to detect these problematic issues, wecalculated inter-annotator agreement for all levels <strong>of</strong><strong>annotation</strong>. An overview <strong>of</strong> the most representativevalues for annotator agreement can be seen in Table 3.We determined pair wise proportions <strong>of</strong> overallagreement, that is, the ratio <strong>of</strong> cases in which twoannotators agreed with respect to all cases.In addition, we also obtained the kappa coefficient(Cohen, 1960), which gives an indication <strong>of</strong> stabilityand reproducibility <strong>of</strong> human judgments in corpus<strong>annotation</strong>. <strong>The</strong> main advantage <strong>of</strong> this measure is thatit factors out the possibility that judges agree bychance. Kappa measures range from k=-1 to k=1, withk=0 when there is no agreement other than whatwould be expected by chance, k=1 when agreement isperfect, and k=-1 when there is systematicdisagreement. Following the interpretation proposedby Krippendorf (1980) and Carletta et al. (1996), forcorpus <strong>annotation</strong>, kappa>0.8 indicate good stabilityand reproducibility <strong>of</strong> the results, while k


category agreement kappaeventual <strong>semantic</strong>sevent 66% .11state 90% .33process 76% .06argumentalityargument 82% .54adjunct 64% .46<strong>semantic</strong> roleinitiator 70% .37agent 84% .81cause 91% .89experiencer 97% .92theme 68% .43affected theme 74% .55non-affected theme 70% .34goal 79% .70syntactic functionagentive complement 100% 1.00subject 87% .83direct object 80% .63indirect object 77% .79prepositional object 1 67% .65prepositional object 2 66% .28prepositional object 3 78% .24Circumstantial 62% .42Predicative 76% .16syntactic categorynoun phrase 78% .67prepositional phrase 72% .53adjectival phrase 88% .69negative adverbial 100% 1.00adverbial phrase 77% .54adverbial clause 68% .66gerund clause 72% .65relative clause 82% .16completive clause 95% .93direct speech 96% .95infinitive clause 94% .98prep. completive 96% .44clauseprep. infinitive clause 81% .57personal pronoun 97% .81relative pronoun 98% .96other pronouns 94% .82Table 3. Inter-annotator agreement for a selection <strong>of</strong>annotated categoriesAs a general remark, agreement is comparable towhat is reported in similar projects. For example,Kingsbury et al. (2002) report agreement between60% and 100% for predicate-argument tagging withinPropbank, noting that agreement tends to increase asannotators are more trained. In <strong>SenSem</strong>, the level <strong>of</strong><strong>annotation</strong> that is comparable to predicate-argumentrelations, <strong>semantic</strong> role <strong>annotation</strong>, is clearly withinthis 60%-100% range.It is noteworthy that the values obtained for thekappa coefficient are rather low. After a closeinspection, we found that these low values <strong>of</strong> kappaare mainly due to the fact that the <strong>annotation</strong>guidelines were still not well-established at this point<strong>of</strong> <strong>annotation</strong>, and that annotators were still undertraining. This led us to further describe and exemplifycases detected as having a low agreement value oncethe preliminary exploration <strong>of</strong> the corpus hadconcluded. As a result, we expect values for kappa toincrease in evaluations that will be carried out in alater stage <strong>of</strong> the project.Agreement within aspectual interpretations <strong>of</strong>sentences is very close to chance agreement. <strong>The</strong>stative interpretation seems to be more clearlyperceived than the rest. Events and processes at timesseem to be confused. In order to reduce this source <strong>of</strong>disagreement, each verbal sense was associated to itsprototypical aspectual <strong>semantic</strong>s, as determined by itslexical meaning. For example, the verb aceptar(accept) is associated to the <strong>semantic</strong>s “event” for itssense “to receive (something <strong>of</strong>fered), especially withgladness or approval” and to the <strong>semantic</strong>s “state” forits sense “to be able to endure, hold or admit (in anordered system)”. We expect that this change in the<strong>annotation</strong> procedure will dramatically increase interannotatoragreement for this feature.Agreement is also low in the categorization <strong>of</strong>constituents as arguments or adjuncts. To improveconsistency in the <strong>annotation</strong> <strong>of</strong> arguments, theprototypical subcategorization frame for each verbsense has been provided, making it easier forannotators to identify arguments associated with averb and to label the rest <strong>of</strong> constituents as adjuncts.For example, in the case <strong>of</strong> the verb accept, theeventive sense is associated with a subcategorizationframe <strong>of</strong> the kind [agent,theme], while the second isassociated with [theme,theme]. <strong>The</strong> criteria todistinguish constituents dominated by a verb(arguments or adjuncts) and those beyond clausalscope have also been clarified.In contrast, the tagging <strong>of</strong> <strong>semantic</strong> roles appears torely on linguistic intuition much more than the abovefeatures. <strong>The</strong>re seems to be perfect agreement for veryinfrequent roles (indirect cause, instrument, location).More frequent roles show a higher level <strong>of</strong>disagreement: initiators are significantly less clearly


perceived than agents or causes (note differences in kagreement). It is also clear that fine-graineddistinctions are more difficult to perceive than coarsegrainedones, as exemplified by low agreement withinthe superclass <strong>of</strong> theme.Among syntactic functions, the agentivecomplement <strong>of</strong> passives presents perfect agreement.Agreement is also high for subjects and indirectobjects, but the distinction between different kinds <strong>of</strong>prepositional objects and circumstantial complementsis not clearly perceived. <strong>The</strong>refore, a clearer decisionmakingprocedure was established in the <strong>annotation</strong>guidelines to distinguish among these. We expect thatthese changes will improve consistency significantly.Finally, agreement is rather high for some syntacticcategories: pronouns, adverbs <strong>of</strong> negation, adjectivalcomplements, completive clauses, infinitive clausesand direct speech present k > .7 and ratios <strong>of</strong>agreement over 90%. However, major categoriespresent a rather high ratio <strong>of</strong> disagreement, as well asthose categories that are mostly considered adjuncts.This seems to be a direct effect <strong>of</strong> the variability in theassignment <strong>of</strong> argumentality, <strong>semantic</strong> role andsyntactic function features. We expect that a thoroughinspection <strong>of</strong> the relations between variability in rolesand functions with relation to variability in categorieswill provide a clearer view <strong>of</strong> this aspect.After the study <strong>of</strong> inter-annotator agreement, theguidelines for <strong>annotation</strong> have been settled (Vázquezet al. 2005). <strong>The</strong>se guidelines serve as a reference forannotators, and we believe they will increase theoverall consistency <strong>of</strong> the resulting corpus. Later on inthe process <strong>of</strong> <strong>annotation</strong>, another study <strong>of</strong> interannotatoragreement will be carried out to determinethe consistency achieved in corpus <strong>annotation</strong>.6 Conclusions and Future Work<strong>The</strong> linguistic resource we have presented constitutesan important source <strong>of</strong> linguistic information useful inseveral natural language processing areas as well as inlinguistic research. <strong>The</strong> fact that the corpus has beenannotated at several levels increases its value and itsversatility.<strong>The</strong> project is in its second year <strong>of</strong> development,with still a year and a half to go. During this time weintend to continue with the <strong>annotation</strong> process and todevelop a lexical database that will reflect theinformation found in the corpus. We are aware that theguidelines established in the <strong>annotation</strong> process aregoing to bias, to a certain extent, the resultingresources, but nevertheless we believe that both toolsare <strong>of</strong> interest for the NLP community.All tools developed in the project and the corpusand lexicon themselves will be available to allresearchers who might have interest in exploitingthem.References(Carletta et al. 1996) J. Carletta, A. Isard, S. Isard, J. C. Kowtko,G. Doherty-Sneddon and A. H. Anderson, HCRC DialogueStructure Coding Manual, HCRC Technical report HCRC/TR-82, 1996.(Cohen 1960) J. Cohen, A coefficient <strong>of</strong> agreement for nominalscales. in Educational & Psychological Measure, 20, 1960, pp.37-46.(Corley et al. 2001) S. Corley, M. Corley, F. Keller, M. W.Crocker, S. Trewin, Finding Syntactic Structure in UnparsedCorpora in Computer and the Humanities, 35, 2001, pp. 81-94.(Dorr et al. 1998) B. Dorr, M. A. Martí, I. Castellón, SpanishEuroWordNet and LCS- Based Interlingual MT. Proceeding <strong>of</strong>the ELRA Congress. Granada, 1998.(Fellbaum 1998) C. Fellbaum, A Semantic Network <strong>of</strong> EnglishVerbs, in Christiane Fellbaum (ed.). WordNet: An ElectronicLexical Database, MIT Press, 1998.(Garcia de Miguel & Comesaña 2004) J.M. Garcia de Miguel andS. Comesaña, Verbs <strong>of</strong> Cognition in Spanish: ConstructionalSchemas and Reference Points, in A. Silva, A. Torres, M.Gonçalves (eds) Linguagem, Cultura e Cogniçao: Estudos deLinguística Cognitiva, Almedina, 2004, pp. 399-420.(Grishman et al. 1994) R. Grishman, C. Macleod and A. Meyers.1994. Comlex Syntax: Building a computational lexicon.Proceedings <strong>of</strong> COLING.(Johnson &Fillmore 2000) C. Johnson and C. J. Fillmore, <strong>The</strong>FrameNet tagset for frame-<strong>semantic</strong> and syntactic coding <strong>of</strong>predicate-argument structure. Proceedings NAACL 2000,Seattle WA, USA, 2000, pp. 56-62.(Jones 1994) D. Jones (ed), Verb classes and alternations inBangla, German, English and Korean. Memo nº 1517. MIT,Artificial Intelligence Laboratory, 1994.(Jones 1995) D. Jones, Predicting Semantics from Syntactic Cues --- Evaluating Levin's English Verb Classes and Alternations.UMIACS TR-95-121, University <strong>of</strong> Maryland, 1995.(Kingsbury & Palmer. 2002) P. Kingsbury and M. Palmer, FromTreebank to Propbank. Third International Conference onLanguage Resources and Evaluation, LREC-02, Las Palmas,Spain, 2002.(Kingsbury et al. 2002) P. Kingsbury, M. Palmer and M. Marcus.,Adding Semantic Annotation to the Penn TreeBank.Proceedings <strong>of</strong> the Human Language Technology Conference.San Diego, California, 2002.(Kipper ET AL. 2000) K. Kipper, H. T. Dang and M. Palmer,Class-Based Construction <strong>of</strong> a Verb Lexicon. AAAI-2000Seventeenth National Conference on Artificial Intelligence,Austin, TX, USA, 2000.


(Krippendorf 1980) K. Krippendorf, Content analysis: anintroduction, Sage, 1980.(Levin 1993) B. Levin, English Verb Classes and Alternations: APreliminary Investigation. <strong>The</strong> University <strong>of</strong> Chicago Press,1993.(McCarthy 2000) D. McCarthy, Using <strong>semantic</strong> preferences toidentify verbal participation in role switching alternations.Proceedings <strong>of</strong> NAACL 2000, Seattle, WA, USA, 2000.(Muñiz ET AL. 2003) E. Muñiz, M. Rebolledo, G. Rojo, M.P.Santalla and S. Sotelo, Description and Exploitation <strong>of</strong> BDS:a Syntactic Database about Verb Government in Spanish, inGalia Angelova, Kalina Bontcheva, Ruslan Mitkov, NicolasNicolov, Nikolai Nikolov (eds.). Proceedings <strong>of</strong> RANLP 2003.Borovets, Bulgaria, 2003, pp. 297-303.(Saint-Dizier 1999) P. Saint-Dizier, Alternations and verb<strong>semantic</strong> classes for French: analysis and class formation .Saint-Dizier, P. (ed.). Predicative Forms in Natural Languagesand Lexical Knowledge Bases. Holanda. Kluwer: 139-170.(Subirats-Rüggeberg & Petruck 2003) Subirats-Rüggeberg, C. yM. R. L. Petruck, Surprise: Spanish FrameNet! Presentation atWorkshop on Frame Semantics, Proceedings <strong>of</strong> theInternational Congress <strong>of</strong> Linguists, Praga., 2003.(Vázquez et al. 2000) G. Vázquez, A. Fernández and M. A. Martí,Clasificación verbal. Alternancias de diátesis, Universitat deLleida, 2000.(Vázquez et al. 2005) G. Vázquez, A. Fernández and L. Alonso,Guidelines for the <strong>syntactico</strong>-<strong>semantic</strong> <strong>annotation</strong> <strong>of</strong> a corpusin Spanish, RANLP, Bulgaria, 2005.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!