Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Computational Linguistics Volume 37, Number 2<br />
Table 7<br />
Per<strong>for</strong>mance of SO-CAL in other domains.<br />
SO-CAL options Percent correct by corpus<br />
MPQA MySpace News Blogs Headlines Headlines (coarse)<br />
SO-CAL (back-off) 73.64 81.93 71.57 75.31 62.17 74.63<br />
Baseline<br />
(most frequent class)<br />
64.64 71.08 50.00 50.00 52.92 62.20<br />
SO-CAL<br />
(excluding empty)<br />
Baseline<br />
(mostfrequentclass,<br />
excluding empty)<br />
79.38 78.69 77.76 82.33 79.83 88.98<br />
66.94 69.93 51.10 50.00 59.87 67.37<br />
% SO-empty 28.61 21.12 25.25 21.70 54.00 43.98<br />
The second method used in evaluating SO-CAL on SO-empty texts is to only classify<br />
texts <strong>for</strong> which it has direct evidence to make a judgment. Thus, we exclude such<br />
SO-empty texts from the evaluation. The second part of Table 7 shows the results of<br />
this evaluation. The results are strikingly similar to the per<strong>for</strong>mance we saw on full<br />
review texts, with most attaining a minimum of 75–80% accuracy. Although missing<br />
vocabulary (domain-specific or otherwise) undoubtedly plays a role, the results provide<br />
strong evidence that relative text size is the primary cause of SO-empty texts in these<br />
data sets. When the SO-empty texts are removed, the results are entirely comparable to<br />
those that we saw in the previous section. Although sentence-level polarity detection is<br />
a more difficult task, and not one that SO-CAL was specifically designed <strong>for</strong>, the system<br />
has per<strong>for</strong>med well on this task, here, and in related work (Murray et al. 2008; Brooke<br />
and Hurst 2009).<br />
3. Validating the Dictionaries<br />
To a certain degree, acceptable per<strong>for</strong>mance across a variety of data sets, and, in particular,<br />
improved per<strong>for</strong>mance when the full granularity of the dictionaries is used (see<br />
Table 5), provides evidence <strong>for</strong> the validity of SO-CAL’s dictionary rankings. Recall also<br />
that the individual word ratings provided by a single researcher were reviewed by a<br />
larger committee, mitigating some of the subjectivity involved. Nevertheless, some independent<br />
measure of how well the dictionary rankings correspond to the intuitions of<br />
English speakers would be valuable, particularly if we wish to compare our dictionaries<br />
with automatically generated ones.<br />
The most straight<strong>for</strong>ward way to investigate this problem would be to ask one or<br />
more annotators to re-rank our dictionaries, and compute the inter-annotator agreement.<br />
However, besides the difficulty and time-consuming nature of the task, any<br />
simple metric derived from such a process would provide in<strong>for</strong>mation that was useful<br />
only in the context of the absolute values of our −5 to+5 scale. For instance, if our reranker<br />
is often more conservative than our original rankings (ranking most SO 5 words<br />
as 4, SO 4 words as 3, etc.), the absolute agreement might approach zero, but we would<br />
like to be able to claim that the rankings actually show a great deal of consistency given<br />
286