10.07.2015 Views

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

entities in the texts (with all anaphors andcoreferents replaced by their antecedents). As usedin KMS, discourse entities are essentially nounphrases, including gerundial phrases. Since theoverlap analysis is based only on discourse entities,other sentence components, specifically verbs andprepositions, are not considered. And, whilediscourse entities are further analyzed into lexicalcomponents (i.e., nouns, adjectives, adverbs, andconjunctions), the overlap analysis does not makeuse of these distinctions. Since our summarizationper<strong>for</strong>mance in DUC has proved adequate withoutconsideration of other leaf nodes, we have notattempted to develop overlap metrics which takethem into account, nor have we assessed whetherthey are important.Each discourse entity in the hypothesis iscompared to the full set of discourse entities in thetexts, one by one. In an individual comparison, bothdiscourse entities are lowercased and then split intoconstituent words. Words on a stop list are ignored.If at least one word in a discourse entity from thehypothesis is contained in a discourse entity fromthe text, the test returns true. 3 If a match does notoccur, a counter of “new” discourse entities isincremented; if a match does occur, a counter of“old” discourse entities is incremented. When alldiscourse entities from the hypothesis have beentested, the number of new discourse entities iscompared to the number of old discourse entities. Ifthere are more new entities than old entities, asentence (in this case, the hypothesis) is judged toprovide sufficient new in<strong>for</strong>mation so as to be saidnot to be overlapping. In this case, the judgment ismade that the hypothesis is not entailed by the text.If the preponderance of old entities is greater thanor equal to the number of new entities, the judgmentis made that the hypothesis is entailed by the text.After selecting and making entailmentjudgments on the full set of instances, the interfaceshows the score, i.e., the number of judgments thatmatch the entailments in the development set. 4 Thefull evaluation <strong>for</strong> all instances in development settook about 10 minutes. Thus, in summary, a full3 A test is made of whether there is an exact matchbetween two discourse entities, but this result is notcurrently used.4 The interface also shows the confidence weightedscore, but as mentioned, this aspect was not furtherdeveloped and all scores were set to the same value,so that this score is equal to the accuracy.run <strong>for</strong> an RTE data set of 800 items takes less than30 minutes.Having made the judgments and computed theaccuracy, the next steps of our process involvedextending the interface to permit a more detailedanalysis of the results. Two major components wereadded to the interface: (1) the ability to look indetail at the XML representations of the texts andthe hypotheses and (2) the ability to examine results<strong>for</strong> subsets of the full set.We added a button to view details about aparticular item. This displays the XMLrepresentation of the text and the hypothesis, as wellas the list of discourse entities <strong>for</strong> each. The displayalso shows the entailment and our evaluation. Italso contains a drop-down list of “problems.” If ourevaluation is incorrect, we can assign a reason (anduse a growing list of problem assessments). Whenthis display is closed, any problem that has beenassigned is then listed next to the item.To examine subsets <strong>for</strong> more in-depth analysis,we added a set of five lists of selection criteria. Thelists are (1) the subtask, (2) the official entailment,(3) our evaluation, (4) our problem assessment, and(5) the main verb of the hypothesis. Anycombination of these selection criteria can be made(including “All”). The set of options was thenexpanded to select all items meeting those criteria.The subset can then be scored by itself (e.g., todetermine our score on just the QA task).Finally, the interface was extended to permit anassessment of any changes that were made to theunderlying system. Thus, given a currentevaluation, and then making some change in anunderlying component, we could determine changesin the evaluation (YES to NO or NO to YES) andchanges to our score (CORRECT to INCORRECTor INCORRECT to CORRECT). 54 ResultsMost of our ef<strong>for</strong>ts have been spent on examiningthe results of our system on the development set.We used this set to examine and consider variousmodifications to our system be<strong>for</strong>e making our5 In using the interface, it has become clear that afurther extension is desirable to assess the effects ofdifferent changes. This would involve themodularization of different underlying components inmaking different tests so that the effect of variouscombinations could be tested. The interface wouldenable the selection of different combinations.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!