Full proceedings volume - Australasian Language Technology ...

More documents

Recommendations

Info

Figure 1: The process by which human assessment is used to confirm (or not) automatic MT evaluation measures. In this process, a suite of different MT systems are each given the same corpus of sentences to translate, across a variety of languages, and required to output a 1-best translation for each input in the required target language. Since the total number of translations in the resulting set is too large for exhaustive human assessment, a sample of translations is selected, and this process is labeled A in Figure 1. To increase the likelihood of a fair evaluation, translations are selected at random, with some number of translations repeated, to facilitate later measurement of consistency levels. Label B in Figure 1 indicates the assessment of the sample of translations by human judges, where judges are required to examine translated sentences, perhaps several at a time, and assess their quality. It is this issue in particular that we are most concerned with: to consider different possibilities for acquiring human judgments of translation quality in order to facilitate more consistent assessments. Once sufficient human judgments have been collected, they are used to decide a best-to-worst ranking of the participating machine translation systems, shown as R H in Figure 1. The process of computing that ranking is labeled C. The best approach to process C, that is, going from raw human judgments to a total-order rank, to some degree still remains an open research question (Bojar et al., 2011; Lopez, 2012; Callison-Burch et al., 2012), and is not considered further in this paper. Once the suite of participating systems has been ordered, any existing or new automatic MT evaluation metric can be used to construct another ordered ranking of the same set. The ranking generated by the metric can then be compared with the ranking R H generated by the human assessment, using statistics such as Spearman’s coefficient, with a high correlation being interpreted as evidence that the metric is sound. Since the validity of an automatic MT evaluation measure is assessed relative to human judgments, it is vital that the judgments acquired are reliable. In practice, however, human judgments, as evaluated by intra- and inter-annotator agreement, can be inconsistent with each other. For example, interannotator agreement for human assessment of translation quality, as measured using Cohen’s Kappa coefficient (Cohen, 1960), in recent WMT shared tasks are reported to be at as low levels as k = 0.44 (2010), k = 0.38 (2011) and k = 0.28 (2012), with intra-annotator agreement levels not faring much better: k = 0.60 (2010), k = 0.56 (2011) and k = 0.46 (2012) (Callison-Burch et al., 2010; Callison- Burch et al., 2011; Callison-Burch et al., 2012). This lack of coherence amongst human assessments then forces the question: are assessments of MT evalu- 72
ation metrics robust, if they are validated via lowquality human judgments of translation quality? While one valid response to this question is that the automatic evaluation measures are no worse than human assessment, a more robust approach is to find ways of increasing the reliability of the human judgments we use as the yard-stick for automatic metrics by endeavoring to find better ways of collecting and assessing translation quality. Considering just how important human assessment of translation quality is to empirical machine translation, although there is a significant amount of research into developing metrics that correlate with human judgments of translation quality, the underlying topic of finding ways of increasing the reliability of those judgments to date has received a limited amount of attention (Callison-Burch et al., 2007; Callison-Burch et al., 2008; Przybocki et al., 2009; Callison-Burch et al., 2009; Callison-Burch et al., 2010; Denkowski and Lavie, 2010). 4 Human Assessment of Quality To really improve the consistency of the human judgments of translation quality, we may need to take a step back and ask ourselves what are we really asking human judges to do when we require them to assess translation quality? In the philosophy of science, the concept of translation quality would be considered a (hypothetical) construct. MacCorquodale and Meehl (1948) describe a construct as follows: . . . constructs involve terms which are not wholly reducible to empirical terms; they refer to processes or entities that are not directly observed (although they need not be in principle unobservable); the mathematical expression of them cannot be formed simply by a suitable grouping of terms in a direct empirical equation; and the truth of the empirical laws involved is a necessary but not a sufficient condition for the truth of these conceptions. Translation quality is an abstract notion that exists in theory and can be observed in practice but cannot be measured directly. Psychology often deals with the measurement of such abstract notions, and provides established methods of measurement and validation of those measurement techniques. Although “translation quality” is not a psychological construct as such, we believe these methods of measurement and validation could be used to develop more reliable and valid measures of translation quality. Psychological constructs are measured indirectly, with the task of defining and measuring a construct known as operationalizing the construct. The task requires examination of the mutual or commonsense understanding of the construct to come up with a set of items that together can be used to indirectly measure it. In psychology, the term construct validity refers to the degree to which inferences can legitimately be made from the operationalizations in a study to the theoretical constructs on which those operationalizations were based. Given some data, it is possible then to examine each pair of constructs within the semantic net, and evidence of convergence between theoretically similar constructs supports the inclusion of both constructs (Campbell and Fiske, 1959). To put it more simply, when two theoretically similar constructs that should (in theory) relate to one another do in fact highly correlate on the data, it is evidence to support their use. Similarly, when a lack of correlation is observed for a pair of constructs that theoretically should not relate to each, this also validates their use. This is just one example of a range of methods used in psychology to validate techniques used in the measurement of psychological constructs (see Trochim (1999) for a general introduction to construct validity). 5 Past and Current Methodologies The ALPAC Report (ALPAC, 1966) was one of the earliest published attempts to perform cross-system MT evaluation, in determining whether progress had been made over the preceding decade. The (somewhat anecdotal) conclusion was that: (t)he reader will find it instructive to compare the samples above with the results obtained on simple, selected, text 10 years earlier . . . in that the earlier samples are more readable than the later ones. The DARPA Machine Translation Initiative of the 1990s incorporated MT evaluation as a central tenet, and periodically evaluated the three MT 73
Page 1 and 2:
Australasian Language Technology As
Page 3 and 4:
ALTA 2012 Workshop Committees Works
Page 5 and 6:
ALTA 2012 Programme The proceedings
Page 7 and 8:
Contents Invited talks 1 Using a la
Page 9 and 10:
Invited talks 1
Page 11 and 12:
Diverse Words, Shared Meanings: Sta
Page 13 and 14:
A Citation Centric Annotation Schem
Page 15 and 16:
The structure of this paper is as f
Page 17 and 18:
understanding of sentences surround
Page 19 and 20:
henceforth referred to as Annotator
Page 21 and 22:
References Angrosh, M A, Cranefield
Page 23 and 24:
Semantic Judgement of Medical Conce
Page 25 and 26:
2012). The TE model provides a form
Page 27 and 28:
Correlation coefficient with human
Page 29 and 30: Pair # Concept 1 Doc. Freq. Concept
Page 31 and 32: Active Learning and the Irish Treeb
Page 33 and 34: 2.2 Sources of annotator disagreeme
Page 35 and 36: complement (csubj) 2 . See Figure 4
Page 37 and 38: top Y trees from this ordered set a
Page 39 and 40: References Ron Artstein and Massimo
Page 41 and 42: Unsupervised Estimation of Word Usa
Page 43 and 44: not lend itself to determining unsu
Page 45 and 46: T 0: think, want, thing, look, tell
Page 47 and 48: correlation is much stronger than t
Page 49 and 50: References Eneko Agirre and Philip
Page 51 and 52: eaders with sentences with a negati
Page 53 and 54: Question Which word is closest in m
Page 55 and 56: Acceptability and negativity: conce
Page 57 and 58: MORE NEGATIVE / LESS NEGATIVE MR Δ
Page 59 and 60: Julie Elizabeth Weeds. 2003. Measur
Page 61 and 62: cation produced by a supervised mac
Page 63 and 64: understood to carry a lower weight
Page 65 and 66: Figure 1: Average classification ac
Page 67 and 68: 0.8 0.75 0.7 AuthorA4 AuthorB4 Auth
Page 69 and 70: Segmentation and Translation of Jap
Page 71 and 72: tomatosōsu “tomato sauce”, rev
Page 73 and 74: “markup” and “Mach”, so the
Page 75 and 76: MWE Segmentation Possible Translati
Page 77 and 78: English Term Pairs from Search Engi
Page 79: techniques used are, of necessity,
Page 83 and 84: Figure 2: Methodologies of human as
Page 85 and 86: an optimal method? Machine translat
Page 87 and 88: Towards Two-step Multi-document Sum
Page 89 and 90: archical and non-hierarchical relat
Page 91 and 92: We use the MetaMap 7 tool to automa
Page 93 and 94: T T & C CC z -1.5 -1.27 -1.33 p-val
Page 95 and 96: References Alan R. Aronson. 2001. E
Page 97 and 98: techniques and results are shown. F
Page 99 and 100: Rock Hip-Hop Electronic Punk Metal
Page 101 and 102: Rank Frequency Frequency (filtered)
Page 103 and 104: Ex. 1 Ex. 2 Ex. 3 Score Song Title
Page 105 and 106: Free-text input vs menu selection:
Page 107 and 108: consistent with the much earlier ed
Page 109 and 110: 3.2 Dialogue system architecture Th
Page 111 and 112: Control Free-text Menu-based n=119
Page 113 and 114: tions in analyzing algebra example
Page 115 and 116: langid.py for better language model
Page 117 and 118: documents of MIME type text/html wi
Page 119 and 120: Acknowledgments NICTA is funded by
Page 121 and 122: LaBB-CAT: an Annotation Store Rober
Page 123 and 124: elationship between the coincidence
Page 125 and 126: Communication 33 (special issue on
Page 127 and 128: matically extracting location infor
Page 129 and 130: Classifier DBp:1R Geo:1R D+G:1R DBp
Page 131 and 132:
ALTA Shared Task papers 123
Page 133 and 134:
All Struct. Unstruct. Total - Abstr
Page 135 and 136:
System Population Intervention Back
Page 137 and 138:
tence positions (absolute and relat
Page 139 and 140:
sitional attributes, and sequential
Page 141 and 142:
ordering labels, All includes BOW,
Page 143 and 144:
3 Software Used All experimentation
Page 145 and 146:
Combination Output Public Private C
Page 147 and 148:
Experiments with Clustering-based F
Page 149 and 150:
3 Results For the initial experimen
show all

Full proceedings volume - Australasian Language Technology ...

Create successful ePaper yourself

Delete template?

Save as template?