Designing a Machine Translation System for Canadian Weather ...
Designing a Machine Translation System for Canadian Weather ...
Designing a Machine Translation System for Canadian Weather ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Designing</strong> a <strong>Weather</strong> Warning <strong>Translation</strong> <strong>System</strong> 12<br />
sentences contained at least one error. While preparing the corpus, we also found<br />
unfaithful translations, although they were extremely rare and should be considered<br />
negligible.<br />
There<strong>for</strong>e, it does seem that while the translations are of high quality overall, a<br />
significant proportion of sentences contain errors, a fact that should be taken into<br />
account when relying on the corpus <strong>for</strong> translating.<br />
4.4.2 Inconsistencies in Format and Phrasing<br />
There are some phrasing inconsistencies in the corpus that required manual intervention.<br />
For instance, numbers are usually written with digits, but are sometimes<br />
written using French and English numerals (e.g. “TEN”). Additionally, negative<br />
numbers are usually prefixed with the word “MINUS” but they are sometimes written<br />
with a minus sign in front. Hours presented a similar problem, where the separator<br />
between hours and minutes sometimes changed from a colon to a period <strong>for</strong><br />
certain warnings. Units were either spelled in full (“MILLIMETRE”) or using their<br />
corresponding symbol. Floating-point numbers were sometimes found to use the<br />
comma instead of the period to mark the fractional part.<br />
The text <strong>for</strong>mat of the MTCNs also proved to be inconsistent depending on<br />
the source we use (see section 4.3). For instance, the apostrophe may or may not<br />
appear in some words (like “AUJOURD’HUI” in French or after elided words like<br />
“D’ORAGES”).<br />
We attempted to normalize the text by including standardization rules in the<br />
tokenizer we wrote <strong>for</strong> this study. Multiple iterations of the code were necessary as<br />
we received more and more text over time and discovered new problems.<br />
4.4.3 Serialization Differences<br />
When serializing the corpus using the procedure explained in Section 2, we discovered<br />
discrepancies between the serialized tokens produced by each member of a<br />
sentence pair. For instance, the phrase “DE LA GRELE DE LA TAILLE D UNE BALLE<br />
DE GOLF” is associated with the English phrase “45 MM HAIL”. The latter will be<br />
serialized to “ NUM MM HAIL” while the French version will be left untouched.<br />
This serialization disagreement becomes a problem when deserializing a translated<br />
sentence using the procedure explained in Section 4.1. We are then <strong>for</strong>ced to produce<br />
a translation which cannot be deserialized because there is no corresponding<br />
source material.<br />
We initially encountered quite a lot of these discrepancies and solved most of<br />
them by adding rules in the tokenizer. However, we did not solve all problems and<br />
10k sentence pairs could not be serialized and were added verbatim to the bitext<br />
in order to reduce out of vocabulary coverage. Most of them are caused by hail<br />
sizes and sentence alignment errors. We provide further details about the problems<br />
caused by hail sizes in Section 7.1.