10.07.2015 Views

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Having grouped the error types in this generalway, we were able to focus the error analysis inways that otherwise would not have been possible.In particular, it quickly became clear that therewere significant differences in the types of errorsand that different approaches were necessary. Ingeneral, YES answers require different types ofanalysis from NO answers. YES answers imply thatthere is sufficient overlap in the discourse entities,but that after this assessment, it is necessary todetermine if the discourse entities bear similarsynactic and semantic relations to one another. NOanswers, on the other hand, require a furtheranalysis to determine if we have overlookedsynonyms or paraphrases.In examining YES answers which should havebeen NO answers, we were able to observe manycases where the difference was in the subject of averb. That is, the subject of the verb in thehypothesis was different from the subject of thesame verb in the text (even though this subjectappeared somewhere in the text). We termed this acase of “subject mismatch” and implemented thistest <strong>for</strong> those cases where an initial assessment wasentailment. We modified the underlying code tomake this test, working with one item (126, with thehypothesis, “North Korea says it will rejoin nucleartalks”, where the subject of “say” in the text was“Condoleezza Rice”).After making this change on the basis of oneitem, we reran our evaluations <strong>for</strong> all items. This isthe difference between run1 and run2. As indicatedin Table 5, the effect of this change was a reductionin the number of positive answers from 515 to 475.However, as shown in Table 1, the effect on theaccuracy was a decline from 0.581 to 0.566, a netdecline of 12 correct answers. Of the 40 changedanswers, 26 were changed from correct to incorrect.We were able to investigate these cases in detail,making a further assessment of where our systemhad made an incorrect change. Several problemsemerged (e.g., incorrect use of a genitive as themain verb). Making slight changes would haveimproved our results slightly, but were not madebecause of time limitations and because it seemedthat they ought to be part of more general solutions.As indicated, inclusion of the subject mismatchtest (also observed in working with the developmentset) seemed to decrease our per<strong>for</strong>mance. As aresult, it was not included in run1, but only inrun2, with the expectation of a decline, borne outwhen the score was computed.6 Considerations <strong>for</strong> Future WorkSimilar results to that of using the subject mismatchtest seemed likely when considering other possiblemodifications to our system, namely, that althoughthe result <strong>for</strong> some specific items would be changedfrom incorrect to correct, the effect over the full setwould likely result in an overall negative effect. Wedescribe the avenues we investigated.As mentioned earlier, we observed the need <strong>for</strong>different types of tests depending on the initialresult returned by the system. We also observedthat the subtask is significant, and perhaps has abearing on some of the test items. The main concernis the plausibility of a hypothesis and how thismight be encountered in a real system.For summarization, the hypotheses appear to bevalid sentences retrieved from other documents,overlapping to some extent. This task appears to bewell drawn. In this subtask, the key is therecognition of novel elements (similar to the noveltytask in TREC in 2003 and 2004). For our system,this would mean a more detailed analysis of thenovel elements in a hypothesis. The in<strong>for</strong>mationretrieval task appears to be somewhat similar, inthat the hypotheses might be drawn from real texts.For question answering, the task appears to beless well-drawn. Many of the non-entailedhypotheses are unlikely to appear in real texts. Webelieve this is reflected in the many NO-YES errorsthat appeared in our results. A similar situationoccurs <strong>for</strong> the in<strong>for</strong>mation extraction task, where itis unlikely that non-entailed hypotheses would befound in real text, since they are essentiallycounterfactual. (The non-entailed hypotheses initem 209, Biodiesel produces the 'Beetle', and item226, Seasonal Affective Disorder (SAD) is aworldwide disorder, are unlikely to occur in realtext.)We reviewed material from RTE-1 (Dagan etal.) to identify areas <strong>for</strong> exploration. As indicated,this led to our major approach of using overlapanalysis as employed in KMS’ summarizationroutines. We considered the potential <strong>for</strong> various<strong>for</strong>ms of syntactic analysis, particularly asdescribed in Vanderwende et al. (2005), since manyof these constructs are similar to what KMSemploys in its question answering routines (i.e.,appositives, copular constructions, predicatearguments, and active-passive alternations.However, when we examined instances where theseoccurred in the development set and were relevant

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!