Full proceedings volume - Australasian Language Technology ...

More documents

Recommendations

Info

Pr COMB (D|s) = (4) Pr NB (D|s)w NB +Pr LEX (D|s)w LEX w NB +w LEX where D is a document; w LEX is the fraction of open-class and lexicon words in the review that appear in the lexicon; and w NB represents the fraction of all the words in the review that appear in the NB features (this is because unigrams and bigrams selected as NB features contain both open- and closed-class words). 4 Identifying Bimodal Sentences Sentiment analysis is a difficult problem, as opinions are often expressed in subtle ways, such as irony and sarcasm (Pang and Lee, 2008), which may confuse human readers, creating uncertainty over their understanding. In Section 2, we discussed the incorporation of uncertainty into the lexicon-based framework. Here we offer a method for identifying reviews that contain sentiment-ambiguous sentences, which also affect the ability to understand a review. As mentioned above, the probability distribution of the sentiment in a review is likely to be unimodal. The Beta distribution obtained from the lexicon guarantees this property, but the multinomial distribution used to train the NB classifier does not. Further, the combination of the distributions obtained from the lexicon and the NB classifier can lead to a bimodal distribution due to inconsistencies between the two input distributions. We posit that such bimodal sentences are unreliable, and propose the following heuristic to identify bimodal sentences. 4 The sentiment distribution in a sentence is bimodal if (1) the two most probable classes are not adjacent (e.g., 2-star and 4- star rating), and (2) the probability of the second most probable class is more than half of that of the most probable class. Examples of sentences identified by this heuristic are “It is pretty boring, but you do not worry because the picture will be beautiful, and you have these gorgeous stars too” (NB⇒1, Lexicon⇒3, actual = 1) and “ ‘The Wonderful, Horrible Life of Leni Riefenstahl’ is a excellent 4 A statistical method for identifying bi-modality is described in (Jackson et al., 1989). film, but it needed Riefenstahl to edit it more” (NB⇒2&4,Lexicon⇒3, actual=4). The impact of bimodal sentences on performance is examined in Section 5.2. 5 Evaluation 5.1 Datasets Our datasets were sourced from reviews in various domains: movies, kitchen appliances, music, and post office. These datasets differ in review length, word usage and writing style. • Movies 5 : This is the Sentiment Scale dataset collected and pre-processed by Pang and Lee (2005), which contains movie reviews collected from the Internet. They separated the dataset into four sub-corpora, each written by a different author, to avoid the need to calibrate the ratings given by different authors. The authors, denoted A, B, C and D, wrote 1770, 902, 1307 and 1027 reviews respectively. Each author’s reviews were grouped into three and four classes, denoted AuthorX3 and AuthorX4 respectively, where X ∈ {A, B, C, D}. • Kitchen 6 : This dataset was sourced from a large collection of kitchen appliance reviews collected by Blitzer et al. (2007) from Amazon product reviews. We selected 1000 reviews from each of the four classes considered by Blitzer et al., totalling 4000 reviews. The resultant dataset is denoted Kitchen4. • Music 7 : We selected 4039 text samples of music reviews from the Amazon product review dataset compiled by Jindal and Liu (2008). To obtain a dataset with some degree of item consistency and reviewer reliability, we selected reviews for items that have at least 10 reviews written by users who have authored at least 10 reviews. The original reviews are associated with a 5-point rating scale, but we grouped the reviews with low ratings (≤ 3 stars) into one class due to their low numbers. The resultant dataset, denoted 5 http://www.cs.cornell.edu/home/llee/data/ 6 http://www.cs.jhu.edu/˜mdredze/datasets/ sentiment/ 7 http://www.cs.uic.edu/˜liub/FBS/ sentiment-analysis.html 56
Figure 1: Average classification accuracy for NB+Lex, compared with NB4000 and MCST; all datasets. Music3, contains three classes: 860 low reviews (≤ 3 stars), 1409 medium (4 stars) and 1770 high (5 stars). • PostOffice: contains 3966 reviews of postoffice outlets written by “mystery shoppers” hired by a contractor. The reviews are very short, typically comprising one to three sentences, and focus on specific aspects of the service, e.g., attitude of the staff and cleanliness of the stores. The reviews were originally associated with a seven-point rating scale. However, as for the Music dataset, owing to the low numbers of reviews with low ratings (≤ 5 stars), we grouped the reviews into three balanced classes denoted Post3: 1277 low reviews (≤ 5 stars), 1267 medium (6 stars), and 1422 high (7 stars). 5.2 Results Figure 1 shows the average accuracy obtained by the hybrid approach (NB+Lex using NB4000), 8 compared with the accuracy obtained by the bestperforming version of MCST (Bickerstaffe and Zukerman, 2010) (which was evaluated on the Movies dataset, using the algorithms presented in (Pang and Lee, 2005) as baselines), and by NB4000 (NB plus feature selection with 4000 features selected using information gain). All trials employed 10-fold cross-validation. For the 8 We omit the results obtained with the lexicon alone, as its coverage is too low. NB+Lex method, we investigated different combinations of settings (with and without negations, modifiers, sentence connectives, and intersentence weighting). However, these variations had a marginal effect on performance, arguably owing to the low coverage of the lexicon. Here we report on the results obtained with all the options turned on. Statistical significance was computed using a two-tailed paired t-test for each fold with p = 0.05 (we mention only statistically significant differences in our discussion). • NB+Lex outperforms MCST on three datasets (AuthorA3, AuthorA4 and Music3), while the inverse happens only for AuthorB4. NB+Lex also outperforms NB4000 on AuthorD4 and Music3. No other differences are statistically significant. • Interestingly, NB4000 outperforms MCST for AuthorA3 and Music3, with no other statistically significant differences, which highlights the importance of judicious baseline selection. Despite showing some promise, it is somewhat disappointing that the combination approach does not yield significantly better results than NB4000 for all the datasets. The small contribution of the lexicon to the performance of NB+Lex may be partially attributed to its low coverage of the vocabulary of the datasets compared with the coverage of NB4000 alone. Specifically, only a small 57
Page 1 and 2:
Australasian Language Technology As
Page 3 and 4:
ALTA 2012 Workshop Committees Works
Page 5 and 6:
ALTA 2012 Programme The proceedings
Page 7 and 8:
Contents Invited talks 1 Using a la
Page 9 and 10:
Invited talks 1
Page 11 and 12:
Diverse Words, Shared Meanings: Sta
Page 13 and 14: A Citation Centric Annotation Schem
Page 15 and 16: The structure of this paper is as f
Page 17 and 18: understanding of sentences surround
Page 19 and 20: henceforth referred to as Annotator
Page 21 and 22: References Angrosh, M A, Cranefield
Page 23 and 24: Semantic Judgement of Medical Conce
Page 25 and 26: 2012). The TE model provides a form
Page 27 and 28: Correlation coefficient with human
Page 29 and 30: Pair # Concept 1 Doc. Freq. Concept
Page 31 and 32: Active Learning and the Irish Treeb
Page 33 and 34: 2.2 Sources of annotator disagreeme
Page 35 and 36: complement (csubj) 2 . See Figure 4
Page 37 and 38: top Y trees from this ordered set a
Page 39 and 40: References Ron Artstein and Massimo
Page 41 and 42: Unsupervised Estimation of Word Usa
Page 43 and 44: not lend itself to determining unsu
Page 45 and 46: T 0: think, want, thing, look, tell
Page 47 and 48: correlation is much stronger than t
Page 49 and 50: References Eneko Agirre and Philip
Page 51 and 52: eaders with sentences with a negati
Page 53 and 54: Question Which word is closest in m
Page 55 and 56: Acceptability and negativity: conce
Page 57 and 58: MORE NEGATIVE / LESS NEGATIVE MR Δ
Page 59 and 60: Julie Elizabeth Weeds. 2003. Measur
Page 61 and 62: cation produced by a supervised mac
Page 63: understood to carry a lower weight
Page 67 and 68: 0.8 0.75 0.7 AuthorA4 AuthorB4 Auth
Page 69 and 70: Segmentation and Translation of Jap
Page 71 and 72: tomatosōsu “tomato sauce”, rev
Page 73 and 74: “markup” and “Mach”, so the
Page 75 and 76: MWE Segmentation Possible Translati
Page 77 and 78: English Term Pairs from Search Engi
Page 79 and 80: techniques used are, of necessity,
Page 81 and 82: ation metrics robust, if they are v
Page 83 and 84: Figure 2: Methodologies of human as
Page 85 and 86: an optimal method? Machine translat
Page 87 and 88: Towards Two-step Multi-document Sum
Page 89 and 90: archical and non-hierarchical relat
Page 91 and 92: We use the MetaMap 7 tool to automa
Page 93 and 94: T T & C CC z -1.5 -1.27 -1.33 p-val
Page 95 and 96: References Alan R. Aronson. 2001. E
Page 97 and 98: techniques and results are shown. F
Page 99 and 100: Rock Hip-Hop Electronic Punk Metal
Page 101 and 102: Rank Frequency Frequency (filtered)
Page 103 and 104: Ex. 1 Ex. 2 Ex. 3 Score Song Title
Page 105 and 106: Free-text input vs menu selection:
Page 107 and 108: consistent with the much earlier ed
Page 109 and 110: 3.2 Dialogue system architecture Th
Page 111 and 112: Control Free-text Menu-based n=119
Page 113 and 114: tions in analyzing algebra example
Page 115 and 116:
langid.py for better language model
Page 117 and 118:
documents of MIME type text/html wi
Page 119 and 120:
Acknowledgments NICTA is funded by
Page 121 and 122:
LaBB-CAT: an Annotation Store Rober
Page 123 and 124:
elationship between the coincidence
Page 125 and 126:
Communication 33 (special issue on
Page 127 and 128:
matically extracting location infor
Page 129 and 130:
Classifier DBp:1R Geo:1R D+G:1R DBp
Page 131 and 132:
ALTA Shared Task papers 123
Page 133 and 134:
All Struct. Unstruct. Total - Abstr
Page 135 and 136:
System Population Intervention Back
Page 137 and 138:
tence positions (absolute and relat
Page 139 and 140:
sitional attributes, and sequential
Page 141 and 142:
ordering labels, All includes BOW,
Page 143 and 144:
3 Software Used All experimentation
Page 145 and 146:
Combination Output Public Private C
Page 147 and 148:
Experiments with Clustering-based F
Page 149 and 150:
3 Results For the initial experimen
show all

Full proceedings volume - Australasian Language Technology ...

Create successful ePaper yourself

Delete template?

Save as template?