23.03.2013 Views

Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...

Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...

Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Computational Linguistics Volume 37, Number 2<br />

Table 7<br />

Per<strong>for</strong>mance of SO-CAL in other domains.<br />

SO-CAL options Percent correct by corpus<br />

MPQA MySpace News Blogs Headlines Headlines (coarse)<br />

SO-CAL (back-off) 73.64 81.93 71.57 75.31 62.17 74.63<br />

Baseline<br />

(most frequent class)<br />

64.64 71.08 50.00 50.00 52.92 62.20<br />

SO-CAL<br />

(excluding empty)<br />

Baseline<br />

(mostfrequentclass,<br />

excluding empty)<br />

79.38 78.69 77.76 82.33 79.83 88.98<br />

66.94 69.93 51.10 50.00 59.87 67.37<br />

% SO-empty 28.61 21.12 25.25 21.70 54.00 43.98<br />

The second method used in evaluating SO-CAL on SO-empty texts is to only classify<br />

texts <strong>for</strong> which it has direct evidence to make a judgment. Thus, we exclude such<br />

SO-empty texts from the evaluation. The second part of Table 7 shows the results of<br />

this evaluation. The results are strikingly similar to the per<strong>for</strong>mance we saw on full<br />

review texts, with most attaining a minimum of 75–80% accuracy. Although missing<br />

vocabulary (domain-specific or otherwise) undoubtedly plays a role, the results provide<br />

strong evidence that relative text size is the primary cause of SO-empty texts in these<br />

data sets. When the SO-empty texts are removed, the results are entirely comparable to<br />

those that we saw in the previous section. Although sentence-level polarity detection is<br />

a more difficult task, and not one that SO-CAL was specifically designed <strong>for</strong>, the system<br />

has per<strong>for</strong>med well on this task, here, and in related work (Murray et al. 2008; Brooke<br />

and Hurst 2009).<br />

3. Validating the Dictionaries<br />

To a certain degree, acceptable per<strong>for</strong>mance across a variety of data sets, and, in particular,<br />

improved per<strong>for</strong>mance when the full granularity of the dictionaries is used (see<br />

Table 5), provides evidence <strong>for</strong> the validity of SO-CAL’s dictionary rankings. Recall also<br />

that the individual word ratings provided by a single researcher were reviewed by a<br />

larger committee, mitigating some of the subjectivity involved. Nevertheless, some independent<br />

measure of how well the dictionary rankings correspond to the intuitions of<br />

English speakers would be valuable, particularly if we wish to compare our dictionaries<br />

with automatically generated ones.<br />

The most straight<strong>for</strong>ward way to investigate this problem would be to ask one or<br />

more annotators to re-rank our dictionaries, and compute the inter-annotator agreement.<br />

However, besides the difficulty and time-consuming nature of the task, any<br />

simple metric derived from such a process would provide in<strong>for</strong>mation that was useful<br />

only in the context of the absolute values of our −5 to+5 scale. For instance, if our reranker<br />

is often more conservative than our original rankings (ranking most SO 5 words<br />

as 4, SO 4 words as 3, etc.), the absolute agreement might approach zero, but we would<br />

like to be able to claim that the rankings actually show a great deal of consistency given<br />

286

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!