Full proceedings volume - Australasian Language Technology ...

More documents

Recommendations

Info

Figure 2: Effect of bimodal sentences on performance (average accuracy): NB+Lex and NB4000; datasets AuthorA4-D4, Kitchen4, Music3. fraction of the words in our datasets is covered by the lexicon (between 5.5% for AuthorC and 7% for PostOffice), compared with the NB4000 coverage (between 31% for AuthorA3 and 67% for PostOffice). Further, as indicated above, the formulas for estimating the influence of negations, adverbial modifiers and sentence connectives are rather ad hoc, and the parameters were manually set. Different heuristics and statistical methods to set their parameters warrant future investigation. It is interesting to note that the overlap between the words in the corpora covered by the lexicon and the words covered by the 4000 features used by NB is rather small. Specifically, in all datasets except PostOffice, which has an unusually small vocabulary (less than 3000 words), between half and two thirds of the lexicon-covered words are not covered by the set of 4000 features. This discrepancy in coverage means that the unigrams in the lexicon have a lower information gain, and hence are less discriminative, than many of the 4000 features selected for the NB classifier, which include a large number of bigrams. We also analyzed the effect of the presence of sentiment-ambiguous (bimodal) sentences on the predictability of a review, using the method described in Section 4 to identify bimodal sentences. Figure 2 displays the accuracy obtained by NB+Lex and NB4000 on the datasets Authors4A- D, Kitchen4 and Music3 as a function of the number of bimodal sentences in a review (the Authors3A-D datasets were omitted, as they are “easier” than Authors4A-D, and Post3 was omitted because of the low number of sentences per review). We display only results for reviews with 0 to 3 bimodal sentences, as there were very few reviews with more bimodal sentences. As expected, performance was substantially better for reviews with no bimodal sentences (with the exception of NB4000 on AuthorsA4 with 3 bimodal sentences per review). These results suggest that the identification of bimodal sentences is worth pursuing, possibly in combination with additional lexicon coverage, to discriminate between reviews whose sentiment can be reliably detected and reviews where this is not the case. Further, it would be interesting to ascertain the views of human annotators with respect to the sentences we identify as bimodal. In the context of identifying difficult reviews, we also investigated the relationship between prediction confidence (the probability of the most probable star rating in a review) and performance (Figure 3(a)), and between the coverage provided by both the lexicon and the NB classifier and performance (Figure 3(b) 9 ). As seen in Figure 3(a), for all datasets, except AuthorB4, accuracy improves as prediction confidence increases. This 9 We do not display results for less than 50 documents with a particular coverage. 58
0.8 0.75 0.7 AuthorA4 AuthorB4 AuthorC4 AuthorD4 Kitchen4 Music3 Post3 0.8 0.75 0.7 Accuracy 0.65 0.6 0.55 Accuracy 0.65 0.6 0.55 0.5 0.45 0.4 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Threshold AuthorA4 0.5 AuthorB4 AuthorC4 AuthorD4 0.45 Kitchen4 Music3 Post3 0.4 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Coverage (a) Probability of the most probable star rating. (b) Lexicon and NB coverage. Figure 3: Relationship between probability of the most probable star rating and accuracy, and between lexicon/NB coverage and accuracy; datasets AuthorsA4-D4, Kitchen4, Music3 and Post3. improvement is steadier and sharper for Kitchen4, Music3 and Post3, which as seen in Figure 3(b), have a higher lexicon and NB coverage than the Authors datasets. As one would expect, performance improves for the first three datasets as coverage increases from 50% to 80%. However, outside this range, the results are counter-intuitive: overall, accuracy decreases between 20% and 50% coverage, and also drops for Post3 at 85% coverage (a level of coverage that is not obtained for any other dataset); and a high level of accuracy is obtained for very low levels of coverage (≤ 25%) for AuthorA4 and AuthorC4. These observations indicate that other factors, such as style and vocabulary, should be considered in conjunction with coverage, and that the use of coverage in Equations 2 and 4 may require fine-tuning to take into account the level of coverage. 6 Conclusion We have examined the performance of three methods based on supervised machine learning applied to multi-way sentiment analysis: (1) sentiment lexicon combined with NB with feature selection, (2) NB with feature selection, and (3) MCST (which considers label similarity). The lexicon is harnessed by applying a probabilistic procedure that combines words, phrases and sentences, and performs adjustments for negations, modifiers and sentence connectives. This information is combined with corpus-based information by taking into account the uncertainty arising from the extent to which the text is “understood”. Our methods were evaluated on seven datasets of different sizes, review lengths and writing styles. The results of our evaluation show that the combination of lexicon- and corpus-based information performs at least as well as state-of-theart systems. The fact that this improvement is achieved with a small contribution from the lexicon indicates that there may be promise in increasing lexicon coverage and improving the domain specificity of lexicons. At the same time, the observation that NB+Lex (with a small lexicon), NB4000 and MCST exhibit similar performance for several datasets leads us to posit that pure n-gram based statistical systems have plateaued, thus reinforcing the point that additional factors must be brought to bear to achieve significant performance improvements. The negative result that an NB classifier with feature selection achieves state-of-the-art performance indicates that careful baseline selection is warranted when evaluating new algorithms. Finally, we studied the effect of three factors on the reliability of a sentiment-analysis algorithm: (1) number of bimodal sentences in a review; (2) probability of the most probable star rating; and (3) coverage provided by the lexicon and the NB classifier. Our results show that these factors may be used to predict regions of reliable sentiment-analysis performance, but further investigation is required regarding the interaction between coverage and the stylistic features of a review. References B. Allison. 2008. Sentiment detection using lexicallybased classifiers. In Proceedings of the 11th International Conference on Text, Speech and Dialogue, pages 21–28, Brno, Czech Republic. 59
Page 1 and 2:
Australasian Language Technology As
Page 3 and 4:
ALTA 2012 Workshop Committees Works
Page 5 and 6:
ALTA 2012 Programme The proceedings
Page 7 and 8:
Contents Invited talks 1 Using a la
Page 9 and 10:
Invited talks 1
Page 11 and 12:
Diverse Words, Shared Meanings: Sta
Page 13 and 14:
A Citation Centric Annotation Schem
Page 15 and 16: The structure of this paper is as f
Page 17 and 18: understanding of sentences surround
Page 19 and 20: henceforth referred to as Annotator
Page 21 and 22: References Angrosh, M A, Cranefield
Page 23 and 24: Semantic Judgement of Medical Conce
Page 25 and 26: 2012). The TE model provides a form
Page 27 and 28: Correlation coefficient with human
Page 29 and 30: Pair # Concept 1 Doc. Freq. Concept
Page 31 and 32: Active Learning and the Irish Treeb
Page 33 and 34: 2.2 Sources of annotator disagreeme
Page 35 and 36: complement (csubj) 2 . See Figure 4
Page 37 and 38: top Y trees from this ordered set a
Page 39 and 40: References Ron Artstein and Massimo
Page 41 and 42: Unsupervised Estimation of Word Usa
Page 43 and 44: not lend itself to determining unsu
Page 45 and 46: T 0: think, want, thing, look, tell
Page 47 and 48: correlation is much stronger than t
Page 49 and 50: References Eneko Agirre and Philip
Page 51 and 52: eaders with sentences with a negati
Page 53 and 54: Question Which word is closest in m
Page 55 and 56: Acceptability and negativity: conce
Page 57 and 58: MORE NEGATIVE / LESS NEGATIVE MR Δ
Page 59 and 60: Julie Elizabeth Weeds. 2003. Measur
Page 61 and 62: cation produced by a supervised mac
Page 63 and 64: understood to carry a lower weight
Page 65: Figure 1: Average classification ac
Page 69 and 70: Segmentation and Translation of Jap
Page 71 and 72: tomatosōsu “tomato sauce”, rev
Page 73 and 74: “markup” and “Mach”, so the
Page 75 and 76: MWE Segmentation Possible Translati
Page 77 and 78: English Term Pairs from Search Engi
Page 79 and 80: techniques used are, of necessity,
Page 81 and 82: ation metrics robust, if they are v
Page 83 and 84: Figure 2: Methodologies of human as
Page 85 and 86: an optimal method? Machine translat
Page 87 and 88: Towards Two-step Multi-document Sum
Page 89 and 90: archical and non-hierarchical relat
Page 91 and 92: We use the MetaMap 7 tool to automa
Page 93 and 94: T T & C CC z -1.5 -1.27 -1.33 p-val
Page 95 and 96: References Alan R. Aronson. 2001. E
Page 97 and 98: techniques and results are shown. F
Page 99 and 100: Rock Hip-Hop Electronic Punk Metal
Page 101 and 102: Rank Frequency Frequency (filtered)
Page 103 and 104: Ex. 1 Ex. 2 Ex. 3 Score Song Title
Page 105 and 106: Free-text input vs menu selection:
Page 107 and 108: consistent with the much earlier ed
Page 109 and 110: 3.2 Dialogue system architecture Th
Page 111 and 112: Control Free-text Menu-based n=119
Page 113 and 114: tions in analyzing algebra example
Page 115 and 116: langid.py for better language model
Page 117 and 118:
documents of MIME type text/html wi
Page 119 and 120:
Acknowledgments NICTA is funded by
Page 121 and 122:
LaBB-CAT: an Annotation Store Rober
Page 123 and 124:
elationship between the coincidence
Page 125 and 126:
Communication 33 (special issue on
Page 127 and 128:
matically extracting location infor
Page 129 and 130:
Classifier DBp:1R Geo:1R D+G:1R DBp
Page 131 and 132:
ALTA Shared Task papers 123
Page 133 and 134:
All Struct. Unstruct. Total - Abstr
Page 135 and 136:
System Population Intervention Back
Page 137 and 138:
tence positions (absolute and relat
Page 139 and 140:
sitional attributes, and sequential
Page 141 and 142:
ordering labels, All includes BOW,
Page 143 and 144:
3 Software Used All experimentation
Page 145 and 146:
Combination Output Public Private C
Page 147 and 148:
Experiments with Clustering-based F
Page 149 and 150:
3 Results For the initial experimen
show all

Full proceedings volume - Australasian Language Technology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?