22.11.2014 Views

Designing a Machine Translation System for Canadian Weather ...

Designing a Machine Translation System for Canadian Weather ...

Designing a Machine Translation System for Canadian Weather ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Designing</strong> a <strong>Weather</strong> Warning <strong>Translation</strong> <strong>System</strong> 8<br />

SUMMARY FORECAST FOR WESTERN QUEBEC<br />

ISSUED BY ENVIRONMENT CANADA<br />

RESUME DES PREVISIONS POUR L OUEST DU<br />

QUEBEC EMISES PAR ENVIRONNEMENT CANADA<br />

MONTREAL AT 4.30 PM EST MONDAY 31 MONTREAL 16H30 HNE LE LUNDI 31 DECEMBRE<br />

DECEMBER 2001 FOR TUESDAY 01 JANUARY 2001 POUR MARDI LE 01 JANVIER 2002.<br />

2002. VARIABLE CLOUDINESS WITH CIEL VARIABLE AVEC AVERSES DE NEIGE.<br />

FLURRIES. HIGH NEAR MINUS 7. MAX PRES DE MOINS 7.<br />

Fig. 4. An example of an English weather <strong>for</strong>ecast and its French translation.<br />

text <strong>for</strong>mat typically seen on EC’s public warning website. If the client needs a file<br />

in MTCN <strong>for</strong>mat however, it can be trivially derived from Watt’s result.<br />

Watt’s design started in 2009 within the Multi-<strong>for</strong>mat Environmental In<strong>for</strong>mation<br />

Dissemination project outlined earlier. Preliminary discussions with Environment<br />

Canada allowed RALI to identify the needs of the government and to propose<br />

the principles of the solution presented in this paper. Afterward, we started gathering<br />

and preparing the corpora, which would be necessary to build Watt’s first<br />

prototype. This conceptually simple task proved exceedingly complex and required<br />

80% of the man-hours devoted to the project. The reasons <strong>for</strong> this are explained in<br />

the following section. Suffice it to say that putting together enough textual material<br />

<strong>for</strong> our needs proved challenging because of cryptic file <strong>for</strong>mats and the state of<br />

some of the warning archives we had to work with. We then trained a few variants<br />

of a statistical MT engine with this data, and populated a sentence-based translation<br />

memory. Five prototypes were successively submitted to our client and two of<br />

those were <strong>for</strong>mally tested in order to validate our design and their per<strong>for</strong>mance.<br />

4 Data Preparation<br />

As we mentioned in the previous section, data preparation was both critical and<br />

challenging in this study. This data is used when training the statistical MT engine<br />

and <strong>for</strong> populating a sentence-based translation memory. The latter is an alternative<br />

to the MT engine when a source sentence has already been encountered and translated<br />

by humans. Naturally, when employing corpus-based approaches like here,<br />

gathering as much data as possible is important. In our case, we were interested in<br />

creating a bitext, i.e. an aligned corpus of corresponding sentences in French and<br />

English.<br />

Two types of <strong>Canadian</strong> meteorological texts were made available to us by EC:<br />

weather <strong>for</strong>ecasts and weather warnings.<br />

<strong>Weather</strong> <strong>for</strong>ecasts predicting meteorological conditions <strong>for</strong> a given region of<br />

Canada are written in a telegraphic style, consisting in highly repetitive turns of<br />

phrase. An example of such a <strong>for</strong>ecast taken from (Langlais et al. 2005) is shown<br />

in Fig. 4. <strong>Weather</strong> warnings are written in a “looser” style.<br />

As we want to translate the discussion part of these warnings illustrated in Fig. 2,<br />

our preferred raw material was MTCN files.<br />

For both <strong>for</strong>ecasts and warnings, the text is in capital letters and uses the ASCII<br />

character set, and does not include diacritics or apostrophes, most of the time.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!