Designing a Machine Translation System for Canadian Weather ...
Designing a Machine Translation System for Canadian Weather ...
Designing a Machine Translation System for Canadian Weather ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Designing</strong> a <strong>Weather</strong> Warning <strong>Translation</strong> <strong>System</strong> 8<br />
SUMMARY FORECAST FOR WESTERN QUEBEC<br />
ISSUED BY ENVIRONMENT CANADA<br />
RESUME DES PREVISIONS POUR L OUEST DU<br />
QUEBEC EMISES PAR ENVIRONNEMENT CANADA<br />
MONTREAL AT 4.30 PM EST MONDAY 31 MONTREAL 16H30 HNE LE LUNDI 31 DECEMBRE<br />
DECEMBER 2001 FOR TUESDAY 01 JANUARY 2001 POUR MARDI LE 01 JANVIER 2002.<br />
2002. VARIABLE CLOUDINESS WITH CIEL VARIABLE AVEC AVERSES DE NEIGE.<br />
FLURRIES. HIGH NEAR MINUS 7. MAX PRES DE MOINS 7.<br />
Fig. 4. An example of an English weather <strong>for</strong>ecast and its French translation.<br />
text <strong>for</strong>mat typically seen on EC’s public warning website. If the client needs a file<br />
in MTCN <strong>for</strong>mat however, it can be trivially derived from Watt’s result.<br />
Watt’s design started in 2009 within the Multi-<strong>for</strong>mat Environmental In<strong>for</strong>mation<br />
Dissemination project outlined earlier. Preliminary discussions with Environment<br />
Canada allowed RALI to identify the needs of the government and to propose<br />
the principles of the solution presented in this paper. Afterward, we started gathering<br />
and preparing the corpora, which would be necessary to build Watt’s first<br />
prototype. This conceptually simple task proved exceedingly complex and required<br />
80% of the man-hours devoted to the project. The reasons <strong>for</strong> this are explained in<br />
the following section. Suffice it to say that putting together enough textual material<br />
<strong>for</strong> our needs proved challenging because of cryptic file <strong>for</strong>mats and the state of<br />
some of the warning archives we had to work with. We then trained a few variants<br />
of a statistical MT engine with this data, and populated a sentence-based translation<br />
memory. Five prototypes were successively submitted to our client and two of<br />
those were <strong>for</strong>mally tested in order to validate our design and their per<strong>for</strong>mance.<br />
4 Data Preparation<br />
As we mentioned in the previous section, data preparation was both critical and<br />
challenging in this study. This data is used when training the statistical MT engine<br />
and <strong>for</strong> populating a sentence-based translation memory. The latter is an alternative<br />
to the MT engine when a source sentence has already been encountered and translated<br />
by humans. Naturally, when employing corpus-based approaches like here,<br />
gathering as much data as possible is important. In our case, we were interested in<br />
creating a bitext, i.e. an aligned corpus of corresponding sentences in French and<br />
English.<br />
Two types of <strong>Canadian</strong> meteorological texts were made available to us by EC:<br />
weather <strong>for</strong>ecasts and weather warnings.<br />
<strong>Weather</strong> <strong>for</strong>ecasts predicting meteorological conditions <strong>for</strong> a given region of<br />
Canada are written in a telegraphic style, consisting in highly repetitive turns of<br />
phrase. An example of such a <strong>for</strong>ecast taken from (Langlais et al. 2005) is shown<br />
in Fig. 4. <strong>Weather</strong> warnings are written in a “looser” style.<br />
As we want to translate the discussion part of these warnings illustrated in Fig. 2,<br />
our preferred raw material was MTCN files.<br />
For both <strong>for</strong>ecasts and warnings, the text is in capital letters and uses the ASCII<br />
character set, and does not include diacritics or apostrophes, most of the time.