Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Computational Linguistics Volume 37, Number 2<br />
different approach, instead supposing that negative expressions, being relatively rare,<br />
are given more cognitive weight when they do appear. Thus we increase the final SO<br />
of any negative expression (after other modifiers have applied) by a fixed amount<br />
(currently 50%). This seems to have essentially the same effect in our experiments, and<br />
is more theoretically satisfying.<br />
Pang, Lee, and Vaithyanathan (2002) found that their machine-learning classifier<br />
per<strong>for</strong>med better when a binary feature was used indicating the presence of a unigram<br />
in the text, instead of a numerical feature indicating the number of appearances. Counting<br />
each word only once does not seem to work equally well <strong>for</strong> word-counting models.<br />
We have, however, improved overall per<strong>for</strong>mance by decreasing the weight of words<br />
that appear more often in the text: The nth appearance of a word in the text will have<br />
only 1/n of its full SO value. 17 Consider the following invented example.<br />
(17) Overall, the film was excellent. The acting was excellent, the plot was excellent,<br />
and the direction was just plain excellent.<br />
Pragmatically, the repetition of excellent suggests that the writer lacks additional substantive<br />
commentary, and is simply using a generic positive word. We could also impose<br />
an upper limit on the distance of repetitions, and decrease the weight only when they<br />
appear close to each other. Repetitive weighting does not apply to words that have<br />
been intensified, the rationale being that the purpose of the intensifier is to draw special<br />
attention to them.<br />
Another reason to tone down words that appear often in a text is that a word that<br />
appears regularly is more likely to have a neutral sense. This is particularly true of<br />
nouns. In one example from our corpus, the words death, turmoil, andwar each appear<br />
twice. A single use of any of these words might indicate a comment (e.g., Iwasboredto<br />
death), but repeated use suggests a descriptive narrative.<br />
2.7 Other Features of SO-CAL<br />
Two other features merit discussion: weighting and multiple cut-offs. First of all, SO-<br />
CAL incorporates an option to assign different weights to sentences or portions of a text.<br />
Taboada and Grieve (2004) improved per<strong>for</strong>mance of an earlier version of the SO calculator<br />
by assigning the most weight at the two-thirds mark of a text, and significantly<br />
less at the beginning. The current version has a user-configurable <strong>for</strong>m of this weighting<br />
system, allowing any span of the text (with the end points represented by fractions of<br />
the entire text) to be given a certain weight. An even more flexible and powerful system<br />
is provided by the XML weighting option. When this option is enabled, XML tag pairs in<br />
the text (e.g., , ) can be used as a signal to the calculator that any words<br />
appearing between these tags should be multiplied by a certain given weight. This gives<br />
SO-CAL an interface to outside modules. For example, one module could pre-process<br />
the text and tag spans that are believed to be topic sentences, another module could<br />
provide discourse in<strong>for</strong>mation such as rhetorical relations (Mann and Thompson 1988),<br />
and a third module could label the sentences that seem to be subjective. Armed with<br />
this in<strong>for</strong>mation, SO-CAL can disregard or de-emphasize parts of the text that are less<br />
relevant to sentiment analysis. This weighting feature is used in Taboada, Brooke, and<br />
17 One of the reviewers points out that this is similar to the use of term frequency (tf-idf) in in<strong>for</strong>mation<br />
retrieval (Salton and McGill 1983). See also Paltoglou and Thelwall (2010) <strong>for</strong> a use of in<strong>for</strong>mation<br />
retrieval techniques in sentiment analysis.<br />
280