10.07.2015 Views

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

Componential Analysis for Recognizing Textual Entailment

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

official submissions. We made two official runs,with only one major change. After our officialsubmission, we made a full run using the RTE-1test set. Table 1 provides the summary results overall 800 test items in each of these runs. Tables 2and 3 breaks down the results by subtask <strong>for</strong> thedevelopment set and <strong>for</strong> the first official run.Table 1. Summary ResultsRunAccuracyRTE-2 Development 0.590RTE-2 Test (run1) 0.581RTE-2 Test (run2) 0.566RTE-1 Test 0.549Table 2. RTE-2 Development Subtask ResultsSubtaskAccuracyIn<strong>for</strong>mation Extraction (IE) 0.550In<strong>for</strong>mation Retrieval (IR) 0.570Question Answering (QA) 0.560Summarization (SUM) 0.690Table 3. RTE-2 Test (run1) Subtask ResultsSubtaskAccuracyIn<strong>for</strong>mation Extraction (IE) 0.500In<strong>for</strong>mation Retrieval (IR) 0.615Question Answering (QA) 0.520Summarization (SUM) 0.690As shown in Table 1, the initial accuracy 6achieved in the RTE-2 development set was 0.590,higher than what any full run obtained in RTE-1(Dagan et al., 2005). This was somewhatencouraging, particularly since the results werebased on making a positive decision <strong>for</strong> each item(as opposed to making a default decision based onchance except when a positive decision could bemade). For run1 of the RTE-2 test set, we used theidentical system and the overall results wereroughly consistent. However, as shown in Tables 2and 3, there was significant variation by subtaskbetween the development and the test sets. We havenot yet examined the reasons <strong>for</strong> these differences.For run2 of the RTE-2 test set, one majormodification was made to the underlying system,the addition of a test <strong>for</strong> subject mismatch. TheRTE-1 Test set was run after our officialsubmissions, in preparation <strong>for</strong> this paper, and aftera change in our underlying XML rendition routines,which decreased the likelihood of a positive overlapassessment, thus resulting in a lower accuracy.These are discussed further in the next section.6 Since all items were answered, accuracy isequivalent to precision.5 Interpretation and <strong>Analysis</strong> of ResultsIn all runs, a considerable majority of the 800entailment judgments were in the affirmative, asshown in Table 4. Our system is clearly erring onthe side of making the judgment that the hypothesesoverlap with the texts. This reflects the reliance ofour method on assessing only the noun phrases inthe hypotheses against those in the texts. For themost part, it is to be expected that the test itemsused terms in the hypotheses that appeared in thetexts, perhaps modifying the way that they appearedin relation to one another. It is noteworthy that ouraccuracy on the positive answers was somewhatlower than <strong>for</strong> the negative answers (0.578 vs.0.614). That is, when our method asserts that thepreponderance of discourse entities in thehypotheses is toward new items, our system is morelikely to judge that the text does not entail thehypothesis. The lower number of positive answers<strong>for</strong> RTE-1, and the lower accuracy shown in Table1, reflects the inclusion of adverbs as discourseentities, without a modification to the overlapassessment that should have excluded these items inthe test.Table 4. Number of Positive AnswersRunPositiveRTE-2 Development 476RTE-2 Test (run1) 515RTE-2 Test (run2) 475RTE-1 Test 439The observation that our system was providingmore positive judgments than negative judgments isalso reflected in an overall assessment of the errors.Table 5 shows the error types by subtask, e.g.,YES-NO indicates a YES entailment, but oursystem judged a NO entailment. The differences bysubtask are noteworthy, suggesting that wheredifferences in discourse entities are likely to reflectreal differences in the text (IR and SUM), the errorsare generally equal, whereas in cases wheredifferences in ordering are likely to be significant,considerably more errors were made in assertingentailment when it was not present.Table 5. Error Types by Subtask(RTE-2 Development Set)Subtask YES-NO NO-YESIE 22 68IR 48 38QA 21 67SUM 34 28

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!