10.07.2015 Views

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Componential</strong> <strong>Analysis</strong> <strong>for</strong> <strong>Recognizing</strong> <strong>Textual</strong> <strong>Entailment</strong>Ken LitkowskiCL Research9208 Gue RoadDamascus, MD 20872http://www.clres.comken@clres.comAbstractCL Research participated in the PASCALChallenge <strong>for</strong> <strong>Recognizing</strong> <strong>Textual</strong><strong>Entailment</strong> using components from itsKnowledge Management System. Thesecomponents, particularly those involved insummarization, proved useful. CL Researchsubmitted two test runs, obtaining accuracyscores of 0.581 and 0.566, largely consistentwith results obtained using the developmentset. The core routine used in assessingentailment was based on an overlap metricthat used a test <strong>for</strong> preponderance ofsimilarity between the hypothesis and thetext. Several other mechanisms, e.g.,syntactic and semantic tests, or usingWordNet, a Roget-style thesaurus, andFrameNet alternation patterns, wereconsidered, but none appeared likely to yieldimprovements, suggesting the need <strong>for</strong> anintegrated solution. Qualitative observationsabout these mechanisms suggests futurework.1 IntroductionCL Research participated in the second PASCALChallenge <strong>for</strong> <strong>Recognizing</strong> <strong>Textual</strong> <strong>Entailment</strong>(RTE-2) to assess the extent to which experiencegained in summarization, question answering, andother similar NLP technologies could be applied tothis task. CL Research’s Knowledge ManagementSystem (KMS) provides several modules <strong>for</strong>summarization, question answering, in<strong>for</strong>mationextraction, and document exploration. Inparticipating in RTE, we explored whether and howthese modules could be combined to per<strong>for</strong>m thebasic task and identified issues that emerged inworking with the RTE-2 development set. Afterdeveloping our system to per<strong>for</strong>m the task, weprocessed the RTE-2 test set and submitted tworuns, with results similar to those that had beenachieved with the development set.In section 2, we briefly describe KMS and theprimary modules used to per<strong>for</strong>m the RTE task. Insection 3, we describe the system that wasconstructed <strong>for</strong> per<strong>for</strong>ming the task and thatallowed examination of the underlying issues.Section 4 provides the official results from oursubmission and compares them to the results fromusing the development set. In section 5, we assessthe different tasks and in section 6, we describe ourpreliminary attempts to use such resources asWordNet, FrameNet, and machine-readablethesauruses, i.e., future work.2 The Knowledge Management SystemThe CL Research KMS is a graphical interface thatenables users to create repositories of files (ofseveral file types) and to per<strong>for</strong>m a variety of tasksagainst the files. The tasks include questionanswering, summarization, in<strong>for</strong>mation extraction,document exploration, semantic category analysis,and ontology creation. The text portions of files(selected according to DTD elements) are processedinto an XML representation; each task is thenper<strong>for</strong>med with an XML-based analysis of thetexts.KMS uses lexical resources as an integralcomponent in per<strong>for</strong>ming the various tasks.Specifically, KMS employs dictionaries developedusing CL Research’s DIMAP dictionarymaintenance programs, available <strong>for</strong> rapid lookupof lexical items. CL Research has created DIMAPdictionaries <strong>for</strong> a machine-readable version of theOx<strong>for</strong>d Dictionary of English, WordNet, theUnified Medical Language System (UMLS)Specialist Lexicon (which provides a considerableamount of syntactic in<strong>for</strong>mation about lexicalitems), The Macquarie Thesaurus, and specializedverb and preposition dictionaries. These lexcialresources are used seamlessly in a variety of ways


in per<strong>for</strong>ming the various tasks.The KMS text processing component consistsof three elements: (1) a sentence splitter thatseparates the source documents into individualsentences; (2) a full sentence parser which producesa parse tree containing the constituents of thesentence; and (3) a parse tree analyzer thatidentifies important discourse constituents(sentences and clauses, discourse entities, verbs andprepositions) and creates an XML-tagged version ofthe document.The XML representations of the documents areused in per<strong>for</strong>ming the various KMS tasks. Toper<strong>for</strong>m the RTE task, we made use ofsummarization and question answering modules,each of which employ lower level modules <strong>for</strong>dictionary lookup, WordNet analysis, linguistictesting, and XML functions. Litkowski (2006),Litkowski (2005a), and Litkowski (2005b) providemore details on the methods used in TREC questionanswering and DUC summarization.3 System <strong>for</strong> Assessing <strong>Textual</strong> <strong>Entailment</strong>To per<strong>for</strong>m the RTE task, we developed a graphicaluser interface on top of various modules fromKMS, as appropriate. The development of thisinterface is in itself illuminating about factors thatappear relevant to the task.KMS is document-centric, so it was firstnecessary to create an appropriate framework <strong>for</strong>analyzing each instance of the RTE data sets(working initially with only the development set).Since these data were available in XML, we wereable to exploit KMS’ underlying XML functionalityto read the files. We first created a list box <strong>for</strong>displaying in<strong>for</strong>mation about each instance as thefile was read. Initially, this list box contained acheckbox <strong>for</strong> each item (so that subsets of the datacould be analyzed), its ID, its task, its entailment,an indication of whether the text and the hypothesiswere properly parsed, the results of our evaluation,and a confidence score (used initially, but thendiscarded since we did not develop this aspectfurther). Subsequently, we added columns to recordand characterize any problem with our evaluationand to identify the main verb in the hypothesis.The interface was designed with text boxes sothat an item could be selected from the instancesand both the text and the hypothesis could bedisplayed. We associated a menu of options withthe list box so that we could per<strong>for</strong>m various tasks.Initially, the options consisted of (1) selecting allitems, (2) clearing all selections, and (3) parsing allitems.The first step in per<strong>for</strong>ming the RTE task wasto parse the texts and hypotheses and to createXML representations <strong>for</strong> further analysis. We wereable to incorporate KMS’ routines <strong>for</strong> processingeach text and each hypothesis as a distinct“document” (applying KMS’ sentence splitting,parsing, discourse analysis, and XMLrepresentation routines). 1 After per<strong>for</strong>ming this step(taking about 15 minutes <strong>for</strong> the full set), it wasfound that several texts had not been parsed, due toa bug in a sentence splitting routine. As a result,another option was added to reparse selected items,useful when corrections were made to underlyingroutines. The result of this parsing step was thecreation of an XML rendition of the entire RTE set,approximately 10 times the size of the originaldata. 2The next extension of the interface was theaddition of an option to make our evaluation ofwhether the texts entailed the hypotheses. Ourinitial implementation of this evaluation was drawnfrom the KMS summarization functionality. Asused in multi-document DUC summarization (seeLitkowski, 2005b, <strong>for</strong> details), KMS extracts topsentences that have a high match with either theterms in the documents or the terms in a topicdescription. KMS has generally per<strong>for</strong>med quitewell in DUC, primarily through its use of anoverlap assessment that excludes relevant sentencesthat are highly repetitive of what has already beenincluded in a growing summary. A key feature ofthat success is the use of anaphoric references inplace of the anaphors. While this feature issignificant in multi-document summarization, it isless so <strong>for</strong> the RTE task. Notwithstanding, thisoverlap assessment is the basis <strong>for</strong> the RTEjudgment.The overlap analysis is not strict, but ratherbased on an assessment of “preponderance.” InRTE, the analysis looks at each discourse entity inthe hypothesis and compares them to the discourse1 Since only a relatively small number of the RTEtexts consisted of more than one sentence, the use ofKMS discourse analysis functionality was minimal.2 The developers of the RTE data sets are to becommended <strong>for</strong> the integrity of the data. Processingof the data proceeded quite smoothly, enabling us tofocus on the task, rather than dealing with problemsin the underlying data.


entities in the texts (with all anaphors andcoreferents replaced by their antecedents). As usedin KMS, discourse entities are essentially nounphrases, including gerundial phrases. Since theoverlap analysis is based only on discourse entities,other sentence components, specifically verbs andprepositions, are not considered. And, whilediscourse entities are further analyzed into lexicalcomponents (i.e., nouns, adjectives, adverbs, andconjunctions), the overlap analysis does not makeuse of these distinctions. Since our summarizationper<strong>for</strong>mance in DUC has proved adequate withoutconsideration of other leaf nodes, we have notattempted to develop overlap metrics which takethem into account, nor have we assessed whetherthey are important.Each discourse entity in the hypothesis iscompared to the full set of discourse entities in thetexts, one by one. In an individual comparison, bothdiscourse entities are lowercased and then split intoconstituent words. Words on a stop list are ignored.If at least one word in a discourse entity from thehypothesis is contained in a discourse entity fromthe text, the test returns true. 3 If a match does notoccur, a counter of “new” discourse entities isincremented; if a match does occur, a counter of“old” discourse entities is incremented. When alldiscourse entities from the hypothesis have beentested, the number of new discourse entities iscompared to the number of old discourse entities. Ifthere are more new entities than old entities, asentence (in this case, the hypothesis) is judged toprovide sufficient new in<strong>for</strong>mation so as to be saidnot to be overlapping. In this case, the judgment ismade that the hypothesis is not entailed by the text.If the preponderance of old entities is greater thanor equal to the number of new entities, the judgmentis made that the hypothesis is entailed by the text.After selecting and making entailmentjudgments on the full set of instances, the interfaceshows the score, i.e., the number of judgments thatmatch the entailments in the development set. 4 Thefull evaluation <strong>for</strong> all instances in development settook about 10 minutes. Thus, in summary, a full3 A test is made of whether there is an exact matchbetween two discourse entities, but this result is notcurrently used.4 The interface also shows the confidence weightedscore, but as mentioned, this aspect was not furtherdeveloped and all scores were set to the same value,so that this score is equal to the accuracy.run <strong>for</strong> an RTE data set of 800 items takes less than30 minutes.Having made the judgments and computed theaccuracy, the next steps of our process involvedextending the interface to permit a more detailedanalysis of the results. Two major components wereadded to the interface: (1) the ability to look indetail at the XML representations of the texts andthe hypotheses and (2) the ability to examine results<strong>for</strong> subsets of the full set.We added a button to view details about aparticular item. This displays the XMLrepresentation of the text and the hypothesis, as wellas the list of discourse entities <strong>for</strong> each. The displayalso shows the entailment and our evaluation. Italso contains a drop-down list of “problems.” If ourevaluation is incorrect, we can assign a reason (anduse a growing list of problem assessments). Whenthis display is closed, any problem that has beenassigned is then listed next to the item.To examine subsets <strong>for</strong> more in-depth analysis,we added a set of five lists of selection criteria. Thelists are (1) the subtask, (2) the official entailment,(3) our evaluation, (4) our problem assessment, and(5) the main verb of the hypothesis. Anycombination of these selection criteria can be made(including “All”). The set of options was thenexpanded to select all items meeting those criteria.The subset can then be scored by itself (e.g., todetermine our score on just the QA task).Finally, the interface was extended to permit anassessment of any changes that were made to theunderlying system. Thus, given a currentevaluation, and then making some change in anunderlying component, we could determine changesin the evaluation (YES to NO or NO to YES) andchanges to our score (CORRECT to INCORRECTor INCORRECT to CORRECT). 54 ResultsMost of our ef<strong>for</strong>ts have been spent on examiningthe results of our system on the development set.We used this set to examine and consider variousmodifications to our system be<strong>for</strong>e making our5 In using the interface, it has become clear that afurther extension is desirable to assess the effects ofdifferent changes. This would involve themodularization of different underlying components inmaking different tests so that the effect of variouscombinations could be tested. The interface wouldenable the selection of different combinations.


official submissions. We made two official runs,with only one major change. After our officialsubmission, we made a full run using the RTE-1test set. Table 1 provides the summary results overall 800 test items in each of these runs. Tables 2and 3 breaks down the results by subtask <strong>for</strong> thedevelopment set and <strong>for</strong> the first official run.Table 1. Summary ResultsRunAccuracyRTE-2 Development 0.590RTE-2 Test (run1) 0.581RTE-2 Test (run2) 0.566RTE-1 Test 0.549Table 2. RTE-2 Development Subtask ResultsSubtaskAccuracyIn<strong>for</strong>mation Extraction (IE) 0.550In<strong>for</strong>mation Retrieval (IR) 0.570Question Answering (QA) 0.560Summarization (SUM) 0.690Table 3. RTE-2 Test (run1) Subtask ResultsSubtaskAccuracyIn<strong>for</strong>mation Extraction (IE) 0.500In<strong>for</strong>mation Retrieval (IR) 0.615Question Answering (QA) 0.520Summarization (SUM) 0.690As shown in Table 1, the initial accuracy 6achieved in the RTE-2 development set was 0.590,higher than what any full run obtained in RTE-1(Dagan et al., 2005). This was somewhatencouraging, particularly since the results werebased on making a positive decision <strong>for</strong> each item(as opposed to making a default decision based onchance except when a positive decision could bemade). For run1 of the RTE-2 test set, we used theidentical system and the overall results wereroughly consistent. However, as shown in Tables 2and 3, there was significant variation by subtaskbetween the development and the test sets. We havenot yet examined the reasons <strong>for</strong> these differences.For run2 of the RTE-2 test set, one majormodification was made to the underlying system,the addition of a test <strong>for</strong> subject mismatch. TheRTE-1 Test set was run after our officialsubmissions, in preparation <strong>for</strong> this paper, and aftera change in our underlying XML rendition routines,which decreased the likelihood of a positive overlapassessment, thus resulting in a lower accuracy.These are discussed further in the next section.6 Since all items were answered, accuracy isequivalent to precision.5 Interpretation and <strong>Analysis</strong> of ResultsIn all runs, a considerable majority of the 800entailment judgments were in the affirmative, asshown in Table 4. Our system is clearly erring onthe side of making the judgment that the hypothesesoverlap with the texts. This reflects the reliance ofour method on assessing only the noun phrases inthe hypotheses against those in the texts. For themost part, it is to be expected that the test itemsused terms in the hypotheses that appeared in thetexts, perhaps modifying the way that they appearedin relation to one another. It is noteworthy that ouraccuracy on the positive answers was somewhatlower than <strong>for</strong> the negative answers (0.578 vs.0.614). That is, when our method asserts that thepreponderance of discourse entities in thehypotheses is toward new items, our system is morelikely to judge that the text does not entail thehypothesis. The lower number of positive answers<strong>for</strong> RTE-1, and the lower accuracy shown in Table1, reflects the inclusion of adverbs as discourseentities, without a modification to the overlapassessment that should have excluded these items inthe test.Table 4. Number of Positive AnswersRunPositiveRTE-2 Development 476RTE-2 Test (run1) 515RTE-2 Test (run2) 475RTE-1 Test 439The observation that our system was providingmore positive judgments than negative judgments isalso reflected in an overall assessment of the errors.Table 5 shows the error types by subtask, e.g.,YES-NO indicates a YES entailment, but oursystem judged a NO entailment. The differences bysubtask are noteworthy, suggesting that wheredifferences in discourse entities are likely to reflectreal differences in the text (IR and SUM), the errorsare generally equal, whereas in cases wheredifferences in ordering are likely to be significant,considerably more errors were made in assertingentailment when it was not present.Table 5. Error Types by Subtask(RTE-2 Development Set)Subtask YES-NO NO-YESIE 22 68IR 48 38QA 21 67SUM 34 28


Having grouped the error types in this generalway, we were able to focus the error analysis inways that otherwise would not have been possible.In particular, it quickly became clear that therewere significant differences in the types of errorsand that different approaches were necessary. Ingeneral, YES answers require different types ofanalysis from NO answers. YES answers imply thatthere is sufficient overlap in the discourse entities,but that after this assessment, it is necessary todetermine if the discourse entities bear similarsynactic and semantic relations to one another. NOanswers, on the other hand, require a furtheranalysis to determine if we have overlookedsynonyms or paraphrases.In examining YES answers which should havebeen NO answers, we were able to observe manycases where the difference was in the subject of averb. That is, the subject of the verb in thehypothesis was different from the subject of thesame verb in the text (even though this subjectappeared somewhere in the text). We termed this acase of “subject mismatch” and implemented thistest <strong>for</strong> those cases where an initial assessment wasentailment. We modified the underlying code tomake this test, working with one item (126, with thehypothesis, “North Korea says it will rejoin nucleartalks”, where the subject of “say” in the text was“Condoleezza Rice”).After making this change on the basis of oneitem, we reran our evaluations <strong>for</strong> all items. This isthe difference between run1 and run2. As indicatedin Table 5, the effect of this change was a reductionin the number of positive answers from 515 to 475.However, as shown in Table 1, the effect on theaccuracy was a decline from 0.581 to 0.566, a netdecline of 12 correct answers. Of the 40 changedanswers, 26 were changed from correct to incorrect.We were able to investigate these cases in detail,making a further assessment of where our systemhad made an incorrect change. Several problemsemerged (e.g., incorrect use of a genitive as themain verb). Making slight changes would haveimproved our results slightly, but were not madebecause of time limitations and because it seemedthat they ought to be part of more general solutions.As indicated, inclusion of the subject mismatchtest (also observed in working with the developmentset) seemed to decrease our per<strong>for</strong>mance. As aresult, it was not included in run1, but only inrun2, with the expectation of a decline, borne outwhen the score was computed.6 Considerations <strong>for</strong> Future WorkSimilar results to that of using the subject mismatchtest seemed likely when considering other possiblemodifications to our system, namely, that althoughthe result <strong>for</strong> some specific items would be changedfrom incorrect to correct, the effect over the full setwould likely result in an overall negative effect. Wedescribe the avenues we investigated.As mentioned earlier, we observed the need <strong>for</strong>different types of tests depending on the initialresult returned by the system. We also observedthat the subtask is significant, and perhaps has abearing on some of the test items. The main concernis the plausibility of a hypothesis and how thismight be encountered in a real system.For summarization, the hypotheses appear to bevalid sentences retrieved from other documents,overlapping to some extent. This task appears to bewell drawn. In this subtask, the key is therecognition of novel elements (similar to the noveltytask in TREC in 2003 and 2004). For our system,this would mean a more detailed analysis of thenovel elements in a hypothesis. The in<strong>for</strong>mationretrieval task appears to be somewhat similar, inthat the hypotheses might be drawn from real texts.For question answering, the task appears to beless well-drawn. Many of the non-entailedhypotheses are unlikely to appear in real texts. Webelieve this is reflected in the many NO-YES errorsthat appeared in our results. A similar situationoccurs <strong>for</strong> the in<strong>for</strong>mation extraction task, where itis unlikely that non-entailed hypotheses would befound in real text, since they are essentiallycounterfactual. (The non-entailed hypotheses initem 209, Biodiesel produces the 'Beetle', and item226, Seasonal Affective Disorder (SAD) is aworldwide disorder, are unlikely to occur in realtext.)We reviewed material from RTE-1 (Dagan etal.) to identify areas <strong>for</strong> exploration. As indicated,this led to our major approach of using overlapanalysis as employed in KMS’ summarizationroutines. We considered the potential <strong>for</strong> various<strong>for</strong>ms of syntactic analysis, particularly asdescribed in Vanderwende et al. (2005), since manyof these constructs are similar to what KMSemploys in its question answering routines (i.e.,appositives, copular constructions, predicatearguments, and active-passive alternations.However, when we examined instances where theseoccurred in the development set and were relevant


to determination of entailment, we found that,similar to the subject mismatch test, many answerswere already correctly assessed and that it waslikely that implementation of general routines wouldlead to an overall decline in per<strong>for</strong>mance.We next turned to questions of synonymy andparaphrasing. We considered the use of WordNeton several levels. First, we observed cases wheresome general categorization schema (to captureclasses of words, e.g., the tops in WordNet) wouldprovide useful results. Similarly, the use of synsetsappeared as if they would provide some benefit, butthese did not appear to provide sufficiently broadcategories. The addition of derivational links inWordNet is also too sporadic to use systematically.Limitations in WordNet led to consideration ofusing a broader category analysis provided in aRoget-style thesaurus. Such a thesaurus wouldprovide a grouping to recognize more instances ofsynonymy as well as nominalizations (such asdirect and director). However, we were not able toimplement use of the thesaurus at this time.Finally, we considered the use of FrameNet.Although it does not provide a considerable rangeof lexical items, it may have some in<strong>for</strong>mation thatcould be employed effectively in entailmentanalysis. In The Preposition Project (Litkowski &Hargraves, 2005), syntactic alternation patterns <strong>for</strong>particular frame elements are being identified.Thus, <strong>for</strong> example, the verb work has a frameelement Employer associated wth the preposition<strong>for</strong> (there are 19 instances where work is the mainverb of the hypothesis in the development set). ThePreposition Project identifies other lexical items andsyntactic patterns that instantiate the Employerframe element. Thus, we find that an Employer canbe found as a noun modifier (ECB spokesman,Bank of Italy governor), the object of thepreposition by following the verb employed(employed by the United States), the object of thepreposition of (of FEMA), and the object of thepreposition at (press official at the United Statesembasssy). These patterns can be used to look inthe text associated with the hypotheses to determinewhether the particular pattern is present. However,once again, implementation of such strategies willnot necessarily result in an overall net improvement.Thus, <strong>for</strong> example, our score <strong>for</strong> the cases involvingwork was 0.526 without the inclusion of such tests.Further exploration of the possible use of FrameNetseems worthwhile.7 ConclusionsThe PASCAL Challenge <strong>for</strong> <strong>Recognizing</strong> <strong>Textual</strong><strong>Entailment</strong> has provided a useful mechanism <strong>for</strong>examining basic strategies involved in assessingequivalence in meaning. While basic overlapanalysis as used in summarization has provedworthwhile, RTE has revealed some shortcomingsin its use and suggests some possible avenuesmaking improvements. Attempts to make use oflexical resources <strong>for</strong> making assessments ofsynonymy and paraphrase has suggested thatWordNet does not provide adequate levels ofgranularity. Possible improvements might be gainedfrom a Roget-style thesaurus and syntacticalternation patterns derived from FrameNet. Ingeneral, it appears that an integrated approach isneeded to provide consistent improvements.ReferencesIdo Dagan, Bernardo Magnini and Oren Glickman.2005. The PASCAL <strong>Recognizing</strong> <strong>Textual</strong><strong>Entailment</strong> Challenge. In PASCAL: Proceedingsof the First Challenge Workshop. <strong>Recognizing</strong><strong>Textual</strong> <strong>Entailment</strong>, Southhampton, U.K, pp. 1-8.Kenneth C. Litkowski (2005a). Evolving XML andDictionary Strategies <strong>for</strong> Question Answeringand Novelty Tasks. In E. M. Voorhees & L. P.Buckland (eds.), The Twelfth Text RetrievalConference (TREC 2004). NIST SpecialPublication 500-261. Gaithersburg, MD., TREC2004 Proceedings CD.Kenneth C. Litkowski (2005b). Evolving XMLSummarization Strategies in DUC 2005.Available http://duc.nist.gov/pubs.html.Kenneth C. Litkowski. 2006. Exploring DocumentContent with XML. In Voorhees, E. And L.Buckland (eds). Proceedings of the ThirteenthText REtrieval Conference, TREC 2004. NISTSpecial Publication 500-261. Gaithersburg, MD,pp. 52-62.Kenneth C. Litkowski & Orin Hargraves. 2005. ThePreposition Project. ACL-SIGSEM Workshop on“The Linguistic Dimensions of Prepositions andtheir Use in Computational LinguisticFormalisms and Applications”, University ofEssex - Colchester, United Kingdom. 171-179.Lucy Vanderwende, Deborah Coughlin, and BillDolan. 2005. What Syntax Can Contribute in<strong>Entailment</strong> Task. In PASCAL: Proceedings ofthe First Challenge Workshop. <strong>Recognizing</strong><strong>Textual</strong> <strong>Entailment</strong>, Southhampton, U.K, pp. 13-16.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!