Componential Analysis for Recognizing Textual Entailment

More documents

Recommendations

Info

in performing the various tasks.The KMS text processing component consistsof three elements: (1) a sentence splitter thatseparates the source documents into individualsentences; (2) a full sentence parser which producesa parse tree containing the constituents of thesentence; and (3) a parse tree analyzer thatidentifies important discourse constituents(sentences and clauses, discourse entities, verbs andprepositions) and creates an XML-tagged version ofthe document.The XML representations of the documents areused in performing the various KMS tasks. Toperform the RTE task, we made use ofsummarization and question answering modules,each of which employ lower level modules fordictionary lookup, WordNet analysis, linguistictesting, and XML functions. Litkowski (2006),Litkowski (2005a), and Litkowski (2005b) providemore details on the methods used in TREC questionanswering and DUC summarization.3 System for Assessing Textual EntailmentTo perform the RTE task, we developed a graphicaluser interface on top of various modules fromKMS, as appropriate. The development of thisinterface is in itself illuminating about factors thatappear relevant to the task.KMS is document-centric, so it was firstnecessary to create an appropriate framework foranalyzing each instance of the RTE data sets(working initially with only the development set).Since these data were available in XML, we wereable to exploit KMS’ underlying XML functionalityto read the files. We first created a list box fordisplaying information about each instance as thefile was read. Initially, this list box contained acheckbox for each item (so that subsets of the datacould be analyzed), its ID, its task, its entailment,an indication of whether the text and the hypothesiswere properly parsed, the results of our evaluation,and a confidence score (used initially, but thendiscarded since we did not develop this aspectfurther). Subsequently, we added columns to recordand characterize any problem with our evaluationand to identify the main verb in the hypothesis.The interface was designed with text boxes sothat an item could be selected from the instancesand both the text and the hypothesis could bedisplayed. We associated a menu of options withthe list box so that we could perform various tasks.Initially, the options consisted of (1) selecting allitems, (2) clearing all selections, and (3) parsing allitems.The first step in performing the RTE task wasto parse the texts and hypotheses and to createXML representations for further analysis. We wereable to incorporate KMS’ routines for processingeach text and each hypothesis as a distinct“document” (applying KMS’ sentence splitting,parsing, discourse analysis, and XMLrepresentation routines). 1 After performing this step(taking about 15 minutes for the full set), it wasfound that several texts had not been parsed, due toa bug in a sentence splitting routine. As a result,another option was added to reparse selected items,useful when corrections were made to underlyingroutines. The result of this parsing step was thecreation of an XML rendition of the entire RTE set,approximately 10 times the size of the originaldata. 2The next extension of the interface was theaddition of an option to make our evaluation ofwhether the texts entailed the hypotheses. Ourinitial implementation of this evaluation was drawnfrom the KMS summarization functionality. Asused in multi-document DUC summarization (seeLitkowski, 2005b, for details), KMS extracts topsentences that have a high match with either theterms in the documents or the terms in a topicdescription. KMS has generally performed quitewell in DUC, primarily through its use of anoverlap assessment that excludes relevant sentencesthat are highly repetitive of what has already beenincluded in a growing summary. A key feature ofthat success is the use of anaphoric references inplace of the anaphors. While this feature issignificant in multi-document summarization, it isless so for the RTE task. Notwithstanding, thisoverlap assessment is the basis for the RTEjudgment.The overlap analysis is not strict, but ratherbased on an assessment of “preponderance.” InRTE, the analysis looks at each discourse entity inthe hypothesis and compares them to the discourse1 Since only a relatively small number of the RTEtexts consisted of more than one sentence, the use ofKMS discourse analysis functionality was minimal.2 The developers of the RTE data sets are to becommended for the integrity of the data. Processingof the data proceeded quite smoothly, enabling us tofocus on the task, rather than dealing with problemsin the underlying data.
entities in the texts (with all anaphors andcoreferents replaced by their antecedents). As usedin KMS, discourse entities are essentially nounphrases, including gerundial phrases. Since theoverlap analysis is based only on discourse entities,other sentence components, specifically verbs andprepositions, are not considered. And, whilediscourse entities are further analyzed into lexicalcomponents (i.e., nouns, adjectives, adverbs, andconjunctions), the overlap analysis does not makeuse of these distinctions. Since our summarizationperformance in DUC has proved adequate withoutconsideration of other leaf nodes, we have notattempted to develop overlap metrics which takethem into account, nor have we assessed whetherthey are important.Each discourse entity in the hypothesis iscompared to the full set of discourse entities in thetexts, one by one. In an individual comparison, bothdiscourse entities are lowercased and then split intoconstituent words. Words on a stop list are ignored.If at least one word in a discourse entity from thehypothesis is contained in a discourse entity fromthe text, the test returns true. 3 If a match does notoccur, a counter of “new” discourse entities isincremented; if a match does occur, a counter of“old” discourse entities is incremented. When alldiscourse entities from the hypothesis have beentested, the number of new discourse entities iscompared to the number of old discourse entities. Ifthere are more new entities than old entities, asentence (in this case, the hypothesis) is judged toprovide sufficient new information so as to be saidnot to be overlapping. In this case, the judgment ismade that the hypothesis is not entailed by the text.If the preponderance of old entities is greater thanor equal to the number of new entities, the judgmentis made that the hypothesis is entailed by the text.After selecting and making entailmentjudgments on the full set of instances, the interfaceshows the score, i.e., the number of judgments thatmatch the entailments in the development set. 4 Thefull evaluation for all instances in development settook about 10 minutes. Thus, in summary, a full3 A test is made of whether there is an exact matchbetween two discourse entities, but this result is notcurrently used.4 The interface also shows the confidence weightedscore, but as mentioned, this aspect was not furtherdeveloped and all scores were set to the same value,so that this score is equal to the accuracy.run for an RTE data set of 800 items takes less than30 minutes.Having made the judgments and computed theaccuracy, the next steps of our process involvedextending the interface to permit a more detailedanalysis of the results. Two major components wereadded to the interface: (1) the ability to look indetail at the XML representations of the texts andthe hypotheses and (2) the ability to examine resultsfor subsets of the full set.We added a button to view details about aparticular item. This displays the XMLrepresentation of the text and the hypothesis, as wellas the list of discourse entities for each. The displayalso shows the entailment and our evaluation. Italso contains a drop-down list of “problems.” If ourevaluation is incorrect, we can assign a reason (anduse a growing list of problem assessments). Whenthis display is closed, any problem that has beenassigned is then listed next to the item.To examine subsets for more in-depth analysis,we added a set of five lists of selection criteria. Thelists are (1) the subtask, (2) the official entailment,(3) our evaluation, (4) our problem assessment, and(5) the main verb of the hypothesis. Anycombination of these selection criteria can be made(including “All”). The set of options was thenexpanded to select all items meeting those criteria.The subset can then be scored by itself (e.g., todetermine our score on just the QA task).Finally, the interface was extended to permit anassessment of any changes that were made to theunderlying system. Thus, given a currentevaluation, and then making some change in anunderlying component, we could determine changesin the evaluation (YES to NO or NO to YES) andchanges to our score (CORRECT to INCORRECTor INCORRECT to CORRECT). 54 ResultsMost of our efforts have been spent on examiningthe results of our system on the development set.We used this set to examine and consider variousmodifications to our system before making our5 In using the interface, it has become clear that afurther extension is desirable to assess the effects ofdifferent changes. This would involve themodularization of different underlying components inmaking different tests so that the effect of variouscombinations could be tested. The interface wouldenable the selection of different combinations.
Page 1: Componential Analysis for Recognizi
Page 5 and 6: Having grouped the error types in t

Componential Analysis for Recognizing Textual Entailment

Create successful ePaper yourself

Delete template?

Save as template?