Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Lexicon-Based Methods for Sentiment Analysis - Simon Fraser ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Computational Linguistics Volume 37, Number 2<br />
The “Maryland” dictionary (Mohammad, Dorr, and Dunne 2009) is a very large<br />
collection of words and phrases (around 70,000) extracted from the Macquarie Thesaurus.<br />
The dictionary is not classified according to part of speech, and only contains<br />
in<strong>for</strong>mation on whether the word is positive or negative. To integrate it into our system,<br />
we assigned all positive words an SO value of 3, and all negative words a value of −3. 33<br />
We used the same type of quantification <strong>for</strong> the General Inquirer (GI; Stone et al.<br />
1966), which also has only positive and negative tags; a word was included in the<br />
dictionary if any of the senses listed in the GI were polar.<br />
The Subjectivity dictionary is the collection of subjective expressions compiled by<br />
Wilson, Wiebe, and Hoffmann (2005), also used in our Mechanical Turk experiments in<br />
the previous section. The Subjectivity dictionary only contains a distinction between<br />
weak and strong opinion words. For our tests, weak words were assigned 2 or −2<br />
values, depending on whether they were positive or negative, and strong words were<br />
assigned 4 or −4.<br />
The SentiWordNet dictionary (Esuli and Sebastiani 2006; Baccianella, Esuli, and<br />
Sebastiani 2010), also used in the previous section, was built using WordNet (Fellbaum<br />
1998), and retains its synset structure. There are two main versions of SentiWordNet<br />
available, 1.0 and 3.0, and two straight<strong>for</strong>ward methods to calculate an SO value: 34<br />
Use the first sense SO, or average the SO across senses. For the first sense method,<br />
we calculate the SO value of a word w (of a given POS) based on its first sense f as<br />
follows:<br />
SO(w) = 5 × (Pos( f ) − Neg( f ))<br />
For the averaging across senses method, SO is calculated as<br />
SO(w) =<br />
<br />
5<br />
|senses| x∈senses (Pos(x) − Neg(x))<br />
that is, the difference between the positive and negative scores provided by SentiWord-<br />
Net (each in the 0–1 range), averaged across all word senses <strong>for</strong> the desired part of<br />
speech, and multiplied by 5 to provide SO values in the −5 to 5 range. Table 9 contains a<br />
comparison of per<strong>for</strong>mance (using simple word averaging, no SO-CAL features) in our<br />
various corpora <strong>for</strong> each version of SentiWordNet using each method. What is surprising<br />
is that the best dictionary using just basic word counts (1.0, first sense) is actually<br />
the worst dictionary when using SO-CAL, and the best dictionary using all SO-CAL<br />
features (3.0, average across senses) is the worst dictionary when features are disabled.<br />
We believe this effect is due almost entirely to the degree of positive bias in the various<br />
dictionaries. The 3.0 average dictionary is the most positively biased, which results in<br />
degraded basic per<strong>for</strong>mance (in the Camera corpus, only 20.5% of negative reviews<br />
are correctly identified). When negative weighting is applied, however, it reaches an<br />
almost perfect balance between positive and negative accuracy (70.7% to 71.3% in the<br />
Camera corpus), which optimizes overall per<strong>for</strong>mance. We cannot there<strong>for</strong>e conclude<br />
definitively that any of the SentiWordNet dictionaries is superior <strong>for</strong> our task; in fact,<br />
33 We chose 3 as a value because it is the middle value between 2 and 4 that we assigned to strong and weak<br />
words in the Subjectivity Dictionary, as explained subsequently.<br />
34 A third alternative would be to calculate a weighted average using sense frequency in<strong>for</strong>mation;<br />
SentiWordNet does not include such in<strong>for</strong>mation, however. Integrating this in<strong>for</strong>mation from other<br />
sources, though certainly possible, would take us well beyond “off-the-shelf” usage, and, we believe,<br />
would provide only marginal benefit.<br />
298