23.03.2013 Views

Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...

Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...

Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Computational Linguistics Volume 37, Number 2<br />

different approach, instead supposing that negative expressions, being relatively rare,<br />

are given more cognitive weight when they do appear. Thus we increase the final SO<br />

of any negative expression (after other modifiers have applied) by a fixed amount<br />

(currently 50%). This seems to have essentially the same effect in our experiments, and<br />

is more theoretically satisfying.<br />

Pang, Lee, and Vaithyanathan (2002) found that their machine-learning classifier<br />

per<strong>for</strong>med better when a binary feature was used indicating the presence of a unigram<br />

in the text, instead of a numerical feature indicating the number of appearances. Counting<br />

each word only once does not seem to work equally well <strong>for</strong> word-counting models.<br />

We have, however, improved overall per<strong>for</strong>mance by decreasing the weight of words<br />

that appear more often in the text: The nth appearance of a word in the text will have<br />

only 1/n of its full SO value. 17 Consider the following invented example.<br />

(17) Overall, the film was excellent. The acting was excellent, the plot was excellent,<br />

and the direction was just plain excellent.<br />

Pragmatically, the repetition of excellent suggests that the writer lacks additional substantive<br />

commentary, and is simply using a generic positive word. We could also impose<br />

an upper limit on the distance of repetitions, and decrease the weight only when they<br />

appear close to each other. Repetitive weighting does not apply to words that have<br />

been intensified, the rationale being that the purpose of the intensifier is to draw special<br />

attention to them.<br />

Another reason to tone down words that appear often in a text is that a word that<br />

appears regularly is more likely to have a neutral sense. This is particularly true of<br />

nouns. In one example from our corpus, the words death, turmoil, andwar each appear<br />

twice. A single use of any of these words might indicate a comment (e.g., Iwasboredto<br />

death), but repeated use suggests a descriptive narrative.<br />

2.7 Other Features of SO-CAL<br />

Two other features merit discussion: weighting and multiple cut-offs. First of all, SO-<br />

CAL incorporates an option to assign different weights to sentences or portions of a text.<br />

Taboada and Grieve (2004) improved per<strong>for</strong>mance of an earlier version of the SO calculator<br />

by assigning the most weight at the two-thirds mark of a text, and significantly<br />

less at the beginning. The current version has a user-configurable <strong>for</strong>m of this weighting<br />

system, allowing any span of the text (with the end points represented by fractions of<br />

the entire text) to be given a certain weight. An even more flexible and powerful system<br />

is provided by the XML weighting option. When this option is enabled, XML tag pairs in<br />

the text (e.g., , ) can be used as a signal to the calculator that any words<br />

appearing between these tags should be multiplied by a certain given weight. This gives<br />

SO-CAL an interface to outside modules. For example, one module could pre-process<br />

the text and tag spans that are believed to be topic sentences, another module could<br />

provide discourse in<strong>for</strong>mation such as rhetorical relations (Mann and Thompson 1988),<br />

and a third module could label the sentences that seem to be subjective. Armed with<br />

this in<strong>for</strong>mation, SO-CAL can disregard or de-emphasize parts of the text that are less<br />

relevant to sentiment analysis. This weighting feature is used in Taboada, Brooke, and<br />

17 One of the reviewers points out that this is similar to the use of term frequency (tf-idf) in in<strong>for</strong>mation<br />

retrieval (Salton and McGill 1983). See also Paltoglou and Thelwall (2010) <strong>for</strong> a use of in<strong>for</strong>mation<br />

retrieval techniques in sentiment analysis.<br />

280

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!