21.01.2014 Views

Using a Grammar Checker for Evaluation and Postprocessing of ...

Using a Grammar Checker for Evaluation and Postprocessing of ...

Using a Grammar Checker for Evaluation and Postprocessing of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Using</strong> a <strong>Grammar</strong> <strong>Checker</strong> <strong>for</strong> <strong>Evaluation</strong> <strong>and</strong> <strong>Postprocessing</strong> <strong>of</strong> Statistical<br />

Machine Translation<br />

Sara Stymne, Lars Ahrenberg<br />

Department <strong>of</strong> Computer <strong>and</strong> In<strong>for</strong>mation Science<br />

Linköping University, Sweden<br />

{sarst,lah}@ida.liu.se<br />

Abstract<br />

One problem in statistical machine translation (SMT) is that the output <strong>of</strong>ten is ungrammatical. To address this issue, we have investigated<br />

the use <strong>of</strong> a grammar checker <strong>for</strong> two purposes in connection with SMT: as an evaluation tool <strong>and</strong> as a postprocessing tool. As an<br />

evaluation tool the grammar checker gives a complementary picture to st<strong>and</strong>ard metrics such as Bleu, which do not account <strong>for</strong> grammaticality.<br />

We use the grammar checker as a postprocessing tool by applying the error correction suggestions it gives. There are only<br />

small overall improvements <strong>of</strong> the postprocessing on automatic metrics, but the sentences that are affected by the changes are improved,<br />

as shown both by automatic metrics <strong>and</strong> by a human error analysis. These results indicate that grammar checker techniques are a useful<br />

complement to SMT.<br />

1. Introduction<br />

One problem with st<strong>and</strong>ard statistical machine translation<br />

systems is that their output tends to be ungrammatical,<br />

since there generally is no linguistic knowledge used in the<br />

systems. We investigate how a grammar checker can be<br />

used to address this issue. <strong>Grammar</strong> checkers are developed<br />

to find errors in texts produced by humans, but in this<br />

study we investigate if they can also be used to find errors<br />

made by machines. We identify two novel usages <strong>of</strong> the<br />

grammar checker <strong>for</strong> machine translation (MT): as an evaluation<br />

tool <strong>and</strong> as a postprocessing tool.<br />

We have per<strong>for</strong>med experiments <strong>for</strong> English-Swedish<br />

translation using a factored phrase-based statistical machine<br />

translation (PBSMT) system based on Moses (Koehn<br />

et al., 2007) <strong>and</strong> the mainly rule-based Swedish grammar<br />

checker Granska (Domeij et al., 2000; Knutsson, 2001).<br />

The combination <strong>of</strong> a grammar checker <strong>and</strong> a MT system<br />

could be used <strong>for</strong> other architectures <strong>and</strong> language<br />

pairs as well, however. We have per<strong>for</strong>med experiments<br />

on six translation systems that differ on two dimensions:<br />

the amount <strong>of</strong> training data, <strong>and</strong> the amount <strong>of</strong> linguistic<br />

knowledge used in the system.<br />

To be able to use the grammar checker as an evaluation<br />

tool, we per<strong>for</strong>med an error analysis <strong>of</strong> the grammar<br />

checker on SMT output. We then defined three crude<br />

measures based on the error identification by the grammar<br />

checker. All three measures are error rates based on the<br />

grammar checker error categories. The difference between<br />

them is that they use different subsets <strong>of</strong> the categories. All<br />

three measures give a complementary picture to two st<strong>and</strong>ard<br />

MT metrics, since they are better at accounting <strong>for</strong> the<br />

fluency <strong>and</strong> grammaticality <strong>of</strong> the machine translation output.<br />

We used the grammar checker as an automatic postprocessing<br />

tool on the SMT output, by using the correction<br />

suggestions given <strong>for</strong> many errors. We applied the suggestions<br />

only <strong>for</strong> categories that had a high precision on<br />

the error analysis on SMT output. A human error analysis<br />

showed that the corrections were successful in most cases.<br />

There were only small improvements on automatic metrics<br />

on the full test sets, however, but this was mainly due to the<br />

fact that the postprocessing only affected a small proportion<br />

<strong>of</strong> the sentences.<br />

The remainder <strong>of</strong> the paper is organized as follows: Section<br />

2 describes the translation systems used in the study. In<br />

Section 3 the grammar checker Granska is introduced, <strong>and</strong><br />

an error analysis <strong>of</strong> Granska on SMT output is described.<br />

The experiments <strong>and</strong> results are presented in Section 4 <strong>for</strong><br />

evaluation <strong>and</strong> in Section 5 <strong>for</strong> postprocessing. Section 6<br />

contains a discussion <strong>of</strong> related work, <strong>and</strong> Section 7 contains<br />

the conclusion <strong>and</strong> suggestions <strong>for</strong> future work.<br />

2. SMT System<br />

The translation system used is a st<strong>and</strong>ard PBSMT setup using<br />

the Moses decoder (Koehn et al., 2007) <strong>and</strong> the SRILM<br />

toolkit <strong>for</strong> sequence models (Stolcke, 2002). We take advantage<br />

<strong>of</strong> the factored translation framework in Moses<br />

(Koehn <strong>and</strong> Hoang, 2007), where factors other than surface<br />

<strong>for</strong>m can be used to represent words, which allows the<br />

inclusion <strong>of</strong> linguistic knowledge such as lemmas <strong>and</strong> part<strong>of</strong>-speech<br />

tags. To tune feature weights minimum error rate<br />

training is used (Och, 2003).<br />

The system is trained <strong>and</strong> tested on the Europarl corpus<br />

(Koehn, 2005). The Swedish target side <strong>of</strong> the training<br />

corpus is part-<strong>of</strong>-speech tagged using the Granska tagger<br />

(Carlberger <strong>and</strong> Kann, 1999). The training corpus is filtered<br />

to remove sentences longer than 40 words <strong>and</strong> with a<br />

length ratio <strong>of</strong> more than 1 to 7. 500 sentences are used <strong>for</strong><br />

tuning <strong>and</strong> 2000 sentences <strong>for</strong> testing.<br />

In order to evaluate the use <strong>of</strong> the grammar checker we<br />

trained six different systems that are varied on two dimensions:<br />

the amount <strong>of</strong> training data <strong>and</strong> the amount <strong>of</strong> linguistic<br />

knowledge in the systems, in the <strong>for</strong>m <strong>of</strong> extra factors<br />

on the output side. For the corpus size we have used<br />

two sizes <strong>of</strong> the training corpus <strong>for</strong> the translation model:<br />

a large training corpus with 701,157 sentences, <strong>and</strong> a small<br />

training corpus with 100,000 sentences. In both cases the<br />

sequence models were trained on the large corpus.


Text: Averaging vore med tre timmar per dag , det är den mest omfatt<strong>and</strong>e mänskliga<br />

aktivitet efter sover och - för vuxna - arbete .<br />

Rule: stav1@stavning Span: 1-1 Words: Averaging<br />

Rule: kong10E@kong Span: 14-15 Words: mänskliga aktivitet<br />

mänskliga aktiviteten<br />

mänsklig aktivitet<br />

Figure 1: Example <strong>of</strong> filtered Granska output <strong>for</strong> a sentence with two errors<br />

The linguistic knowledge used in the system are varied<br />

by the use <strong>of</strong> different factors on the output side. There are<br />

three system setups <strong>for</strong> each corpus size: the none system<br />

with only a st<strong>and</strong>ard language model on surface <strong>for</strong>m, <strong>and</strong><br />

two systems with an additional sequence model. The POS<br />

system use st<strong>and</strong>ard part-<strong>of</strong>-speech tags <strong>and</strong> the morph system<br />

use morphologically enriched part-<strong>of</strong>-speech tags. An<br />

example <strong>of</strong> the annotation is shown in (1).<br />

(1) EN: my question is important .<br />

none: min fråga är viktig .<br />

POS: min|ps fråga|nn är|vb viktig|jj .|mad<br />

morph: min|ps.utr.sin.def fråga|nn.utr.sin.ind.nom<br />

är|vb.prs.akt.kop viktig|jj.pos.utr.sin.ind.nom .|mad<br />

<strong>Using</strong> part-<strong>of</strong>-speech tags can be expected to improve word<br />

order. The use <strong>of</strong> morphological tags can also improve<br />

agreement (Stymne et al., 2008).<br />

3. <strong>Grammar</strong> <strong>Checker</strong><br />

A grammar checker is a tool that can identify grammar errors<br />

in texts. Often they also include other errors such as<br />

spelling errors <strong>and</strong> stylistic errors. <strong>Grammar</strong> checkers tend<br />

to be authoring tools or writing aids, that is, they are designed<br />

to be used by a human who assess the alarms <strong>and</strong><br />

suggestions given by the tools, rather than as s<strong>of</strong>tware that<br />

can be applied automatically.<br />

For MT output, both types <strong>of</strong> grammar checkers could be<br />

considered useful. A grammar checker system that is an authoring<br />

tool could be used to highlight suspicious problems<br />

<strong>and</strong> to suggest changes in translations that are sent to a human<br />

posteditor. An automatic grammar checking system,<br />

on the other h<strong>and</strong>, could be used to automatically improve<br />

MT output, regardless <strong>of</strong> whether the translations are being<br />

used directly, e.g. <strong>for</strong> gisting, or if they are being sent <strong>for</strong><br />

further postediting by humans. If the purpose <strong>of</strong> using the<br />

grammar checker is evaluation it would clearly be preferable<br />

with an automatic grammar checker.<br />

3.1. Granska<br />

We use the Swedish grammar checker Granska (Domeij et<br />

al., 2000; Knutsson, 2001), which is a hybrid, mainly rulebased<br />

grammar checker. The main modules in Granska are:<br />

a tokenizer; a probabilistic Hidden Markov model-based<br />

tagger, which tags texts both with part-<strong>of</strong>-speech <strong>and</strong> morphology<br />

(Carlberger <strong>and</strong> Kann, 1999); the spell checker<br />

Stava (Kann et al., 2001); <strong>and</strong> a rule matcher, which identifies<br />

errors <strong>and</strong> generates error descriptions <strong>and</strong> correction<br />

suggestions. The rule matcher contains h<strong>and</strong>-written rules<br />

in an object-oriented rule language developed <strong>for</strong> Granska.<br />

Granska finds both grammar <strong>and</strong> spelling errors, <strong>and</strong> in<br />

many cases it also gives correction suggestions. The grammar<br />

errors are divided into thirteen main categories, <strong>of</strong><br />

which many are in turn further divided into a number <strong>of</strong><br />

subcategories.<br />

Granska can output XML, from which we extracted the<br />

necessary in<strong>for</strong>mation <strong>for</strong> our purposes, exemplified in Figure<br />

1. In the output we see the whole sentence (Text:), <strong>and</strong><br />

a list <strong>of</strong> errors that were found in each sentence. For each<br />

error we know the main rule type <strong>and</strong> rule subcategory that<br />

applied (Rule:), the position in the sentence where it occurs<br />

(Span:), the words that are wrong (Words:), <strong>and</strong> possibly<br />

one or several correction suggestions. For the sentence in<br />

Figure 1, Granska detected a spelling error, the unknown<br />

English word Averaging, <strong>for</strong> which it has no correction suggestions,<br />

<strong>and</strong> a noun phrase agreement error, <strong>for</strong> which it<br />

has two suggestions, <strong>of</strong> which the first one is correct in this<br />

context. There are also other errors in the sentence which<br />

Granska does not find, mainly because it is not designed <strong>for</strong><br />

this type <strong>of</strong> mal<strong>for</strong>med output.<br />

3.2. Error Analysis <strong>of</strong> Granska on SMT Output<br />

In order to use the grammar checker <strong>for</strong> SMT it is useful to<br />

know on which categories the grammar checker produces<br />

good results <strong>and</strong> good correction suggestions <strong>for</strong> SMT output.<br />

Granska has been evaluated on human output in previous<br />

studies, but with different results on different text types<br />

(Knutsson, 2001). Applying Granska on a different type <strong>of</strong><br />

text than it was designed <strong>for</strong> can also affect its per<strong>for</strong>mance;<br />

<strong>for</strong> instance, its precision <strong>for</strong> a subset <strong>of</strong> its error categories<br />

degrades from 92% on texts written by adults to 35% on<br />

texts written by primary school children (S<strong>of</strong>kova Hashemi,<br />

2007). Machine translation output is another very different<br />

text type, on which we cannot expect the behavior <strong>of</strong><br />

Granska to be the same as <strong>for</strong> human texts. Thus we per<strong>for</strong>med<br />

an error analysis <strong>of</strong> the per<strong>for</strong>mance <strong>of</strong> the grammar<br />

checker on the translation output from the tuning process,<br />

a total <strong>of</strong> 11,574 words, from Europarl. The evaluation was<br />

per<strong>for</strong>med by one native Swedish speaker.<br />

The grammar errors found in the sample only belonged<br />

to five <strong>of</strong> the thirteen error categories in Granska:<br />

• Agreement NP – Errors in noun phrase agreement <strong>of</strong><br />

determiners, nouns <strong>and</strong> adjectives <strong>for</strong> number, gender<br />

<strong>and</strong> definiteness<br />

• Agreement predicatives – Errors in agreement <strong>of</strong><br />

predicatives with the subject or object on number, gender<br />

<strong>and</strong> definiteness


Type Error identification Correction suggestions<br />

Correct False Correct1 Correct2+ Wrong None<br />

Agreement NP 64 10 48 10 4+10 2+0<br />

Agreement Predicatives 21 1 20 – 1+1 –<br />

Split compounds 12 14 8 – 3+13 1+1<br />

Verb 31 18 11 2 – 18+18<br />

Word order 9 0 8 – 1+0 –<br />

Table 1: Analysis <strong>of</strong> grammar errors that are identified by Granska on tuning data. Correct1 means that the top suggestion<br />

is correct, <strong>and</strong> Correct2+ that some other suggestion is correct. For cases where suggestions are wrong or not given, the<br />

numbers are divided between errors that are correctly identified <strong>and</strong> errors that are erroneously identified.<br />

Type Granska evaluation on human text Granska evaluation on SMT output<br />

Recall Precision Precision range Precision<br />

Agreement NP 0.83 0.44 0.11–0.72 0.86<br />

Agreement Predicative 0.69 0.32 0.00–0.44 0.95<br />

Split compounds 0.46 0.39 0.00–0.67 0.46<br />

Verb 0.97 0.83 0.71–0.91 0.63<br />

Word order 0.75 0.38 0.00–1.00 1.00<br />

Table 2: Comparison <strong>of</strong> an evaluation <strong>of</strong> Granska on human text (Knutsson, 2001) with the error analysis on SMT output.<br />

Precision range is the extreme precision values on the five different text types<br />

• Split compounds – Compounds that are written as<br />

two separate words, instead <strong>of</strong> as one word<br />

• Verb – Errors that are related to verbs, such as missing<br />

verbs, wrong <strong>for</strong>m <strong>of</strong> verbs, or two finite verbs<br />

• Word order – Wrong word order<br />

Table 1 summarizes the results <strong>for</strong> these five categories.<br />

The per<strong>for</strong>mance varies a lot between the error categories,<br />

with good per<strong>for</strong>mance <strong>of</strong> error identification on agreement<br />

<strong>and</strong> word order, but worse on split compounds <strong>and</strong><br />

verbs. Looking into the different subcategories shows that<br />

the false alarms <strong>for</strong> verbs mostly belong to three categories:<br />

missing verb, missing finite verb, <strong>and</strong> infinitive marker<br />

without a verb. When we excluded these categories there<br />

are 17 verb errors left, <strong>of</strong> which only 1 is wrong.<br />

The quality <strong>of</strong> the correction suggestions also varies between<br />

the categories. In the verb category, suggestions are<br />

never wrong, but they are given in a minority <strong>of</strong> the cases<br />

where they are correctly identified, <strong>and</strong> never <strong>for</strong> the false<br />

alarms. For split compounds, on the other h<strong>and</strong>, the majority<br />

<strong>of</strong> the correction suggestions are incorrect, mainly<br />

since suggestions are given to nearly all false alarms. For<br />

NP agreement, all false alarms have correction suggestions,<br />

but the majority <strong>of</strong> the correction suggestions are still correct.<br />

Predicative agreement <strong>and</strong> verb errors have correct<br />

suggestions <strong>for</strong> nearly all identified errors.<br />

There are very few pure spelling errors in the translation<br />

output, since the words all come from the corpus the<br />

SMT system is trained on. The grammar checker still identifies<br />

161 spelling errors, <strong>of</strong> which the majority are untranslated<br />

<strong>for</strong>eign words (49.0%) <strong>and</strong> proper names (32.9%).<br />

The only correction suggestions that are useful <strong>for</strong> spelling<br />

errors is the capitalization <strong>of</strong> lower-cased proper names,<br />

which occur in 9 cases.<br />

The error analysis on SMT output can be contrasted to<br />

an earlier evaluation <strong>of</strong> Granska on human texts per<strong>for</strong>med<br />

by Knutsson (2001). That evaluation was per<strong>for</strong>med on<br />

201,019 words from five different text types: sport news,<br />

<strong>for</strong>eign news, government texts, popular science <strong>and</strong> student<br />

essays. The results from that evaluation on the error<br />

categories found in the SMT sample are contrasted with the<br />

SMT error analysis in Table 2. The per<strong>for</strong>mance <strong>of</strong> Granska<br />

varies a lot between the five text types, as shown by the precision<br />

range. The precision on SMT output is better than<br />

the average precision on human texts on all error categories<br />

except verb errors. This is promising <strong>for</strong> the use <strong>of</strong> Granska<br />

on SMT output. The human texts were annotated with all<br />

present errors, which meant that recall could be calculated.<br />

The recall is rather high on all categories except split compounds.<br />

No annotation <strong>of</strong> all errors were done in our SMT<br />

error analysis, <strong>and</strong> thus we cannot calculate recall. It could<br />

be expected to be a lot lower than on human texts, however,<br />

since Granska was not developed with SMT errors in mind.<br />

4. <strong>Grammar</strong> <strong>Checker</strong> as an <strong>Evaluation</strong> Tool<br />

A disadvantage <strong>of</strong> most current evaluation metrics <strong>for</strong> MT,<br />

is that they do not take grammar into account. Thus, the<br />

grammar checker could complement existing metrics by indicating<br />

the grammaticality <strong>of</strong> a text, simply by counting<br />

the number <strong>of</strong> errors. This only accounts <strong>for</strong> the fluency <strong>of</strong><br />

the output, however, so it needs to be used in addition to<br />

other metrics that can account <strong>for</strong> adequacy. In this study<br />

we compare the translation results on three crude measures<br />

based on the grammar checker with the results on two st<strong>and</strong>ard<br />

metrics: Bleu (Papineni et al., 2002), which mainly<br />

measures n-gram precision, <strong>and</strong> TER (Snover et al., 2006),<br />

which is an error rate based on a modified Levenshtein distance.<br />

1 The scores <strong>for</strong> all metrics are calculated based on a<br />

single reference translation.<br />

Based on the error analysis in Section 3.2. we define<br />

three grammar checker based metrics. <strong>Grammar</strong> error ratio<br />

1 Percent notation is used <strong>for</strong> all Bleu <strong>and</strong> TER scores.


Size Factors Bleu TER GER 1 GER 2 SGER<br />

none 22.18 66.42 0.196 0.293 0.496<br />

Large POS 21.63 66.88 0.228 0.304 0.559<br />

morph 22.04 66.63 0.125 0.195 0.446<br />

none 21.16 67.49 0.244 0.359 0.664<br />

Small POS 20.79 67.79 0.282 0.375 0.718<br />

morph 19.45 69.52 0.121 0.245 0.600<br />

Table 3: Results <strong>for</strong> the six basic systems. Bleu have higher values <strong>for</strong> better systems, whereas TER <strong>and</strong> the grammar<br />

checker metrics are error rates, <strong>and</strong> have lower values <strong>for</strong> better systems.<br />

1, GER 1 , is the average number <strong>of</strong> the grammar errors that<br />

had a high precision on error categorization on the development<br />

set (all errors except split compounds <strong>and</strong> three verb<br />

categories), <strong>and</strong> grammar error ratio 2, GER 2 , is the average<br />

number <strong>of</strong> all grammar errors per sentence. Spelling<br />

<strong>and</strong> grammar error ratio, SGER, is the average total number<br />

<strong>of</strong> identified errors per sentence. All three metrics are error<br />

rates, with the value <strong>of</strong> 0 <strong>for</strong> systems with no identified<br />

errors, <strong>and</strong> a higher value when more errors are identified.<br />

There is no upper bound on the metrics.<br />

Table 3 shows the results <strong>of</strong> the evaluation. As expected<br />

the systems trained on the large corpus are all better than<br />

the small systems on the st<strong>and</strong>ard metrics. On the new<br />

grammar checker metrics the large systems are all better<br />

on SGER but not always on the GER metrics. The ranking<br />

between the six systems are the same on the two st<strong>and</strong>ard<br />

metrics, but differ <strong>for</strong> the Granska metrics. The morph<br />

system is best <strong>for</strong> the small corpus on the Granska metrics,<br />

but markedly worse on the st<strong>and</strong>ard metrics. We believe<br />

that this is because the small corpus is too small <strong>for</strong><br />

the morphology to be useful, due to sparsity issues, so that<br />

the grammar is improved, but the word selection is worse.<br />

Also on the large corpus, the morph sequence model does<br />

improve the grammaticality <strong>of</strong> the output, but here it per<strong>for</strong>ms<br />

nearly equal to the none system, <strong>and</strong> better than the<br />

POS system on the st<strong>and</strong>ard metrics. <strong>Using</strong> a POS sequence<br />

model gives the worst per<strong>for</strong>mance on all metrics <strong>for</strong> both<br />

corpus sizes.<br />

There are some interesting differences between the three<br />

Granska metrics. SGER gives markedly worse per<strong>for</strong>mance<br />

on the small corpus, which is not the case <strong>for</strong> the<br />

two GER metrics. This is mainly due to the fact that SGER<br />

is the only metric that includes spelling errors, <strong>and</strong> many<br />

<strong>of</strong> the spelling errors are English out-<strong>of</strong>-vocabulary words,<br />

which are just passed through the translation system. With<br />

a smaller training corpus, the number <strong>of</strong> out-<strong>of</strong>-vocabulary<br />

words increases. If we want the metric to give a picture<br />

<strong>of</strong> the coverage <strong>of</strong> the system, this in<strong>for</strong>mation is clearly<br />

useful. GER 1 gives the best value <strong>for</strong> the morph system<br />

with the small corpus. This could be an indication that using<br />

only these few error categories makes this metric less<br />

robust to bad translations, since the grammar checker per<strong>for</strong>mance<br />

most likely degrades when the SMT output is too<br />

bad <strong>for</strong> it to analyze in a meaningful way. On the other h<strong>and</strong><br />

GER 1 gives a larger difference between the none <strong>and</strong> POS<br />

systems, than GER 2 , which also uses error categories with<br />

a bad per<strong>for</strong>mance on SMT output.<br />

Size Factors Bleu TER Changes<br />

none 22.34 66.37 382<br />

Large POS 21.81 66.84 429<br />

morph 22.17 66.54 259<br />

none 21.30 67.47 456<br />

Small POS 20.95 67.75 514<br />

morph 19.52 69.48 249<br />

Table 4: Results <strong>and</strong> number <strong>of</strong> changes <strong>for</strong> the six systems<br />

when Granska is used <strong>for</strong> postprocesing<br />

5. <strong>Grammar</strong> <strong>Checker</strong> <strong>for</strong> <strong>Postprocessing</strong><br />

We wanted to see if the suggestions <strong>of</strong> the grammar checker<br />

could be used to improve MT by automatic postprocessing,<br />

where we apply the suggestions from the grammar checker.<br />

We have chosen to accept the correction suggestions <strong>for</strong> the<br />

categories where a large majority <strong>of</strong> the suggestion were<br />

correct in the error analysis in Section 3.2. In the test systems<br />

there were between 8 <strong>and</strong> 16 errors in categories that<br />

did not appear in the error analysis. These errors were ignored.<br />

When there were several correction suggestions <strong>for</strong><br />

an error, we always chose the first suggestion. For most <strong>of</strong><br />

the errors, such as agreement errors, the corrections were<br />

per<strong>for</strong>med by changing one or several word <strong>for</strong>ms. For<br />

other categories, the word order was changed, or words<br />

were deleted or inserted.<br />

Table 4 shows the results <strong>for</strong> the systems with postprocessing.<br />

There are consistent but very small improvements<br />

<strong>for</strong> all systems on both metrics, compared to the unprocessed<br />

scores in Table 3. The Bleu scores had an absolute<br />

improvement by at most 0.18, <strong>and</strong> the TER scores by just<br />

0.09 at most. As could be expected, the systems with bad<br />

scores on the Granska metrics have more corrections. More<br />

interestingly, the improvements as measured by the metrics<br />

were not generally larger on the none <strong>and</strong> POS systems than<br />

on the morph systems with much fewer corrections.<br />

One <strong>of</strong> the likely reasons <strong>for</strong> the small improvements on<br />

the two metrics was that only a relatively small proportion<br />

<strong>of</strong> the sentences were affected by the postprocessing. To<br />

investigate this, we calculated scores on only the subset <strong>of</strong><br />

sentences that were affected by the changes, shown in Table<br />

5. The subsets <strong>of</strong> sentences are all different, so the scores<br />

can not be directly compared between the systems, or to the<br />

scores on the full test sets. Not surprisingly, the improvements<br />

are much larger on the subsets than on the full test set<br />

<strong>for</strong> all systems. The difference in the change is much larger<br />

on Bleu, with an absolute improvement <strong>of</strong> around 0.70 <strong>for</strong><br />

the large systems <strong>and</strong> <strong>of</strong> around 0.50 <strong>for</strong> the small systems.


Size Factors Bleu TER No. sentences<br />

Basic Postproc. Basic Postproc.<br />

none 19.44 20.12 68.99 68.77 335<br />

Large POS 18.87 19.61 69.42 69.26 373<br />

morph 18.47 19.29 70.28 69.69 238<br />

none 18.72 19.26 69.96 69.88 395<br />

Small POS 17.74 18.27 71.00 70.88 452<br />

morph 16.79 17.24 72.01 71.77 241<br />

Table 5: Results on the subsets <strong>of</strong> sentences where Granska postprocessing led to a change. Note that these subsets are<br />

different <strong>for</strong> the six systems, so these scores are not comparable between the systems.<br />

Size Factors Good Neutral Bad<br />

none 73 19 8<br />

Large POS 77 17 6<br />

morph 68 19 13<br />

none 74 19 7<br />

Small POS 73 17 10<br />

morph 68 20 12<br />

Table 6: Error analysis <strong>of</strong> the 100 first Granska-based<br />

changes <strong>for</strong> each system<br />

The difference on TER is still relatively small, except <strong>for</strong><br />

the morph systems, especially <strong>for</strong> the large system, which<br />

has a TER improvement <strong>of</strong> 0.59.<br />

To further investigate the quality <strong>of</strong> the error corrections,<br />

an error analysis was per<strong>for</strong>med by one native Swedish<br />

speaker on the first 100 error corrections <strong>for</strong> each system.<br />

The corrections were classified into three categories: good,<br />

bad, <strong>and</strong> neutral. The good changes were improvements<br />

compared to not changing anything, <strong>and</strong> the bad changes resulted<br />

in a worse translation than be<strong>for</strong>e the change. For the<br />

neutral category the translations be<strong>for</strong>e <strong>and</strong> after the change<br />

were <strong>of</strong> equal quality.<br />

The results <strong>of</strong> the analysis are summarized in Table 6. A<br />

large majority <strong>of</strong> the changes were improvements, which<br />

indicates that the two st<strong>and</strong>ard metrics used are not that<br />

good in capturing this type <strong>of</strong> improvement. The corrections<br />

<strong>for</strong> the morph systems, are slightly worse than <strong>for</strong> the<br />

other systems. This is to a large degree due to the fact that<br />

these systems have fewer agreement errors than the other<br />

systems, <strong>and</strong> agreement errors are generally the easiest errors<br />

to correct.<br />

Table 7 shows some examples <strong>of</strong> the different types <strong>of</strong><br />

errors. In the good example a NP agreement error is fixed<br />

by switching the indefinite article so it gets the correct gender,<br />

<strong>and</strong> a verb with the wrong tense, present, is changed<br />

to perfect tense. In the first neutral example an NP agreement<br />

error which is mixed between definite <strong>and</strong> indefinite,<br />

is changed <strong>and</strong> becomes syntactically correct, but indefinite<br />

instead <strong>of</strong> the preferred definite. The second neutral<br />

example concerns agreement <strong>of</strong> an adjectival predicative<br />

with a collective noun, where both the original plural adjective,<br />

<strong>and</strong> the changed singular adjective are acceptable.<br />

In the first bad example an attributive adjective has been<br />

mistaken <strong>for</strong> a head noun, resulting in a change from a correct<br />

NP to an incorrect NP with two noun <strong>for</strong>ms. The second<br />

bad example contains an untranslated English plural<br />

genitive noun, which are given Swedish plural inflection by<br />

Granska.<br />

Even though the per<strong>for</strong>med corrections are generally<br />

good, the number <strong>of</strong> errors with useful suggestions is low,<br />

which makes the overall effect <strong>of</strong> the corrections small.<br />

There are many more actual errors in the output, which are<br />

not found by Granska. For the postprocessing technique to<br />

be even more useful we need to be able to identify <strong>and</strong> correct<br />

more <strong>of</strong> the errors, either by modifying the grammar<br />

checker or by developing a custom SMT output checker.<br />

6. Related Work<br />

Automatic metrics are usually based on the matching <strong>of</strong><br />

words in the translation hypothesis to words in one or several<br />

human reference translations in some way, as is done<br />

by Bleu (Papineni et al., 2002) <strong>and</strong> TER (Snover et al.,<br />

2006). These types <strong>of</strong> metrics do not generalize any linguistic<br />

knowledge; they only rely on the matching <strong>of</strong> surface<br />

strings. There are metrics with extended matching,<br />

e.g. Meteor (Lavie <strong>and</strong> Agarwal, 2007), which uses stemming<br />

<strong>and</strong> synonyms from WordNet, <strong>and</strong> TERp (Snover et<br />

al., 2009), which uses paraphrases. There are also some<br />

metrics that incorporate other linguistic levels, such as part<strong>of</strong>-speech<br />

(Popović <strong>and</strong> Ney, 2009), dependency structures<br />

(Owczarzak et al., 2007), or deeper linguistic knowledge<br />

such as semantic roles <strong>and</strong> discourse representation structure<br />

(Giménez <strong>and</strong> Márquez, 2008).<br />

Controlled language checkers, which can be viewed as<br />

a type <strong>of</strong> grammar checker, have been suggested in connection<br />

to MT, but <strong>for</strong> preprocessing <strong>of</strong> the source language,<br />

see Nyberg et al. (2003) <strong>for</strong> an overview. The<br />

controlled language checkers tend to be authoring tools,<br />

as in Mitamura (1999) <strong>and</strong> de Koning (1996) <strong>for</strong> English<br />

<strong>and</strong> Sågvall Hein (1997) <strong>for</strong> Swedish, which are used by<br />

humans be<strong>for</strong>e feeding a text to a usually rule-based MT<br />

system.<br />

Automatic postprocessing has been suggested be<strong>for</strong>e <strong>for</strong><br />

MT, but not by using a grammar checker. Often postprocessing<br />

has targeted specific phenomena, such as correcting<br />

English determiners (Knight <strong>and</strong> Ch<strong>and</strong>er, 1994), merging<br />

German compounds (Stymne, 2009), or applying word<br />

substitution (Elming, 2006). The combination <strong>of</strong> a statistical<br />

MT system <strong>and</strong> the mainly rule-based grammar checker<br />

can also be viewed as a hybrid MT system, on which there<br />

has been much research, see e.g., Thurmair (2009) <strong>for</strong> an<br />

overview. Carter <strong>and</strong> Monz (2009) discuss the issue <strong>of</strong> applying<br />

tools that are developed <strong>for</strong> human texts, in their case


Good<br />

Neutral<br />

Bad<br />

Original<br />

Changed<br />

Original<br />

Changed<br />

Correct<br />

Att ge nya befogenheter till en kommitté av ministrar främjas genom en oansvarigt sekretariat<br />

skulle inte utgör någon typ av framsteg . . .<br />

Att ge nya befogenheter till en kommitté av ministrar främjas genom ett oansvarigt sekretariat<br />

skulle inte ha utgjort någon typ av framsteg . . .<br />

Det är viktigt att fylla den kulturella vakuum mellan våra två regioner<br />

Det är viktigt att fylla ett kulturellt vakuum mellan våra två regioner<br />

Det är viktigt att fylla det kulturella vakuumet mellan våra två regioner<br />

Original Jag hör ibl<strong>and</strong> sägas att rådet är så engagerade i Berlin . . .<br />

Changed Jag hör ibl<strong>and</strong> sägas att rådet är så engagerat i Berlin . . .<br />

Original Skulle det inte vara värt att ansvar på alla nivåer i den beslutsfatt<strong>and</strong>e processen tydligare,<br />

snarare än att försöka gå framåt . . .<br />

Changed Skulle det inte vara värt att ansvar på alla nivåer i det beslutsfatt<strong>and</strong>et processen tydligare,<br />

snarare än att försöka gå framåt . . .<br />

Original Dokumentet kommer att överlämnas till europeiska rådet i Biarritz i några days’ tid .<br />

Changed Dokumentet kommer att överlämnas till europeiska rådet i Biarritz i några daysar tid .<br />

Correct Dokumentet kommer att överlämnas till europeiska rådet i Biarritz i några dagars tid .<br />

Table 7: Some examples <strong>of</strong> error corrections from the different categories<br />

statistical parsers, on SMT output.<br />

7. Conclusion <strong>and</strong> Future Work<br />

We have explored the use <strong>of</strong> a grammar checker <strong>for</strong> MT, <strong>and</strong><br />

shown that it can be useful both <strong>for</strong> evaluation <strong>and</strong> <strong>for</strong> postprocessing.<br />

A more large scale investigation with a more<br />

thorough error analysis would be useful, both <strong>for</strong> Swedish<br />

with the Granska grammar checker <strong>and</strong> <strong>for</strong> other languages<br />

<strong>and</strong> tools. In particular we want to see if the results are<br />

as useful <strong>for</strong> other architectures <strong>of</strong> the MT system <strong>and</strong> <strong>of</strong><br />

the grammar checker, as <strong>for</strong> the combination <strong>of</strong> a statistical<br />

MT system <strong>and</strong> a mainly rule-based grammar checker. We<br />

also plan to apply grammar checkers <strong>for</strong> other languages<br />

on st<strong>and</strong>ard datasets such as that <strong>of</strong> the WMT shared task 2<br />

(Callison-Burch et al., 2009), where we could correlate the<br />

grammar checker per<strong>for</strong>mance with human judgments <strong>and</strong><br />

several automatic metrics <strong>for</strong> many different MT systems.<br />

<strong>Using</strong> the grammar checker <strong>for</strong> evaluation gives a complimentary<br />

picture to the Bleu <strong>and</strong> TER metrics, since<br />

Granska accounts <strong>for</strong> fluency to a higher extent. Granska<br />

needs to be combined with some other metric to account <strong>for</strong><br />

adequacy, however. A possibility <strong>for</strong> future research would<br />

be to combine a grammar checker <strong>and</strong> some measure <strong>of</strong> adequacy<br />

into one metric.<br />

In the postprocessing scenario we used the grammar<br />

checker as a black box with good results. The grammar<br />

checker was, however, only able to find a small proportion<br />

<strong>of</strong> all errors present in the SMT output, since it was not developed<br />

with SMT in mind. One possibility to find more<br />

errors is to extend the rules in the grammar checker. Another<br />

possibility would be a tighter integration between the<br />

SMT system <strong>and</strong> the grammar checker. As an example, the<br />

grammar checker tags the translation output, which is errorprone.<br />

Instead part-<strong>of</strong>-speech tags from factored translation<br />

systems could be used directly by a grammar checker, without<br />

re-tagging. A third option would be to develop a new<br />

2 The shared translation task at the Workshop <strong>of</strong> Statistical<br />

Machine Translation, http://www.statmt.org/wmt09/<br />

translation-task.html<br />

grammar checker targeted at MT errors rather than human<br />

errors, which could be either a st<strong>and</strong>-alone tool or a module<br />

in a MT system.<br />

8. References<br />

Chris Callison-Burch, Philipp Koehn, Christ<strong>of</strong> Monz, <strong>and</strong><br />

Josh Schroeder. 2009. Findings <strong>of</strong> the 2009 Workshop<br />

on Statistical Machine Translation. In Proceedings <strong>of</strong><br />

the Fourth Workshop on Statistical Machine Translation,<br />

pages 1–28, Athens, Greece.<br />

Johan Carlberger <strong>and</strong> Viggo Kann. 1999. Implementing<br />

an efficient part-<strong>of</strong>-speech tagger. S<strong>of</strong>tware Practice <strong>and</strong><br />

Experience, 29:815–832.<br />

Simon Carter <strong>and</strong> Christ<strong>of</strong> Monz. 2009. Parsing statistical<br />

machine translation output. In Proceedings <strong>of</strong> the 4th<br />

Language & Technology Conference, Poznań , Pol<strong>and</strong>.<br />

Michiel de Koning. 1996. Bringing controlled language<br />

support to the desktop. In Proceedings <strong>of</strong> the EAMT Machine<br />

Translation Workshop, pages 11–20, Vienna, Austria.<br />

Rickard Domeij, Ola Knutsson, Johan Carlberger, <strong>and</strong><br />

Viggo Kann. 2000. Granska – an efficient hybrid system<br />

<strong>for</strong> Swedish grammar checking. In Proceedings <strong>of</strong><br />

the 12th Nordic Conference on Computational Linguistics<br />

(Nodalida’99), pages 49–56, Trondheim, Norway.<br />

Jakob Elming. 2006. Trans<strong>for</strong>mation-based correction <strong>of</strong><br />

rule-based MT. In Proceedings <strong>of</strong> the 11th Annual Conference<br />

<strong>of</strong> the European Association <strong>for</strong> Machine Translation,<br />

pages 219–226, Oslo, Norway.<br />

Jesús Giménez <strong>and</strong> Lluìs Márquez. 2008. A smorgasbord<br />

<strong>of</strong> features <strong>for</strong> automatic MT evaluation. In Proceedings<br />

<strong>of</strong> the Third Workshop on Statistical Machine Translation,<br />

pages 195–198, Columbus, Ohio, USA.<br />

Viggo Kann, Rickard Domeij, Joachim Hollman, <strong>and</strong><br />

Mikael Tillenius. 2001. Implementation aspects <strong>and</strong> applications<br />

<strong>of</strong> a spelling correction algorithm. Quantitative<br />

Linguistics, 60:108–123.<br />

Kevin Knight <strong>and</strong> Ishwar Ch<strong>and</strong>er. 1994. Automated<br />

postediting <strong>of</strong> documents. In 12th National conference


<strong>of</strong> the American Association <strong>for</strong> Artificial Intelligence,<br />

Seattle, Washington, USA.<br />

Ola Knutsson. 2001. Automatisk språkgranskning av<br />

svensk text. Licentiate thesis, Royal Institute <strong>of</strong> Technology<br />

(KTH), Stockholm, Sweden.<br />

Philipp Koehn <strong>and</strong> Hieu Hoang. 2007. Factored translation<br />

models. In Proceedings <strong>of</strong> the 2007 Joint Conference on<br />

Empirical Methods in Natural language Processing <strong>and</strong><br />

Computational Natural Language Learning, pages 868–<br />

876, Prague, Czech Republic.<br />

Philipp Koehn, Hieu Hoang, Alex<strong>and</strong>ra Birch, Chris<br />

Callison-Burch, Marcello Federico, Nicola Bertoldi,<br />

Brooke Cowan, Wade Shen, Christine Moran, Richard<br />

Zens, Chris Dyer, Ondrej Bojar, Alex<strong>and</strong>ra Constantin,<br />

<strong>and</strong> Evan Herbst. 2007. Moses: Open source toolkit<br />

<strong>for</strong> statistical machine translation. In Proceedings <strong>of</strong> the<br />

45th Annual Meeting <strong>of</strong> the ACL, demonstration session,<br />

pages 177–180, Prague, Czech Republic.<br />

Philipp Koehn. 2005. Europarl: A parallel corpus <strong>for</strong> statistical<br />

machine translation. In Proceedings <strong>of</strong> MT Summit<br />

X, pages 79–86, Phuket, Thail<strong>and</strong>.<br />

Alon Lavie <strong>and</strong> Abhaya Agarwal. 2007. METEOR: An<br />

automatic metric <strong>for</strong> MT evaluation with high levels <strong>of</strong><br />

correlation with human judgments. In Proceedings <strong>of</strong><br />

the Second Workshop on Statistical Machine Translation,<br />

pages 228–231, Prague, Czech Republic.<br />

Teruko Mitamura. 1999. Controlled language <strong>for</strong> multilingual<br />

machine translation. In Proceedings <strong>of</strong> MT Summit<br />

VII, pages 46–52, Singapore, Singapore.<br />

Eric Nyberg, Teruko Mitamura, <strong>and</strong> Willem-Olaf Huijsen.<br />

2003. Controlled language <strong>for</strong> authoring <strong>and</strong> translation.<br />

In Harold Somers, editor, Computers <strong>and</strong> Translation.<br />

A translator’s guide, pages 245–282. John Benjamins,<br />

Amsterdam.<br />

Franz Josef Och. 2003. Minimum error rate training in statistical<br />

machine translation. In Proceedings <strong>of</strong> the 42nd<br />

Annual Meeting <strong>of</strong> the ACL, pages 160–167, Sapporo,<br />

Japan.<br />

Karolina Owczarzak, Josef van Genabith, <strong>and</strong> Andy Way.<br />

2007. Dependency-based automatic evaluation <strong>for</strong> machine<br />

translation. In Proceedings <strong>of</strong> the Workshop on<br />

Syntax <strong>and</strong> Structure in Statistical Translation, pages<br />

80–87, Rochester, New York, USA.<br />

Kishore Papineni, Salim Roukos, Todd Ward, <strong>and</strong> Wei-Jing<br />

Zhu. 2002. BLEU: A method <strong>for</strong> automatic evaluation<br />

<strong>of</strong> machine translation. In Proceedings <strong>of</strong> the 40th Annual<br />

Meeting <strong>of</strong> the ACL, pages 311–318, Philadelphia,<br />

Pennsylvania, USA.<br />

Maja Popović <strong>and</strong> Hermann Ney. 2009. Syntax-oriented<br />

evaluation measures <strong>for</strong> machine translation output. In<br />

Proceedings <strong>of</strong> the Fourth Workshop on Statistical Machine<br />

Translation, pages 29–32, Athens, Greece.<br />

Anna Sågvall Hein. 1997. Language control <strong>and</strong> machine<br />

translation. In Proceedings <strong>of</strong> the 7th International Conference<br />

<strong>of</strong> Theoretical <strong>and</strong> Methodological Issues in Machine<br />

Translation, pages 103–110, Santa Fe, New Mexico,<br />

USA.<br />

Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea<br />

Micciulla, <strong>and</strong> John Makhoul. 2006. A study <strong>of</strong> translation<br />

edit rate with targeted human notation. In Proceedings<br />

<strong>of</strong> the 7th Conference <strong>of</strong> the Association <strong>for</strong> Machine<br />

Translation in the Americas, pages 223–231, Cambridge,<br />

Massachusetts, USA.<br />

Matthew Snover, Nitin Madnani, Bonnie Dorr, <strong>and</strong> Richard<br />

Schwartz. 2009. Fluency, adequacy, or HTER? Exploring<br />

different human judgments with a tunable MT metric.<br />

In Proceedings <strong>of</strong> the Fourth Workshop on Statistical<br />

Machine Translation, pages 259–268, Athens, Greece.<br />

Sylvana S<strong>of</strong>kova Hashemi. 2007. Ambiguity resolution by<br />

reordering rules in text containing errors. In Proceedings<br />

<strong>of</strong> 10th International Conference on Parsing Technologies,<br />

pages 69–79, Prague, Czech Republic.<br />

Andreas Stolcke. 2002. SRILM – an extensible language<br />

modeling toolkit. In Proceedings <strong>of</strong> the Seventh International<br />

Conference on Spoken Language Processing,<br />

pages 901–904, Denver, Colorado, USA.<br />

Sara Stymne, Maria Holmqvist, <strong>and</strong> Lars Ahrenberg. 2008.<br />

Effects <strong>of</strong> morphological analysis in translation between<br />

German <strong>and</strong> English. In Proceedings <strong>of</strong> the Third Workshop<br />

on Statistical Machine Translation, pages 135–138,<br />

Columbus, Ohio, USA.<br />

Sara Stymne. 2009. A comparison <strong>of</strong> merging strategies<br />

<strong>for</strong> translation <strong>of</strong> German compounds. In Proceedings <strong>of</strong><br />

the EACL 2009 Student Research Workshop, pages 61–<br />

69, Athens, Greece.<br />

Gregor Thurmair. 2009. Comparing different architectures<br />

<strong>of</strong> hybrid machine translation systems. In Proceedings<br />

<strong>of</strong> MT Summit XII, pages 340–347, Ottawa, Ontario,<br />

Canada.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!