22.11.2014 Views

Designing a Machine Translation System for Canadian Weather ...

Designing a Machine Translation System for Canadian Weather ...

Designing a Machine Translation System for Canadian Weather ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Designing</strong> a <strong>Weather</strong> Warning <strong>Translation</strong> <strong>System</strong> 12<br />

sentences contained at least one error. While preparing the corpus, we also found<br />

unfaithful translations, although they were extremely rare and should be considered<br />

negligible.<br />

There<strong>for</strong>e, it does seem that while the translations are of high quality overall, a<br />

significant proportion of sentences contain errors, a fact that should be taken into<br />

account when relying on the corpus <strong>for</strong> translating.<br />

4.4.2 Inconsistencies in Format and Phrasing<br />

There are some phrasing inconsistencies in the corpus that required manual intervention.<br />

For instance, numbers are usually written with digits, but are sometimes<br />

written using French and English numerals (e.g. “TEN”). Additionally, negative<br />

numbers are usually prefixed with the word “MINUS” but they are sometimes written<br />

with a minus sign in front. Hours presented a similar problem, where the separator<br />

between hours and minutes sometimes changed from a colon to a period <strong>for</strong><br />

certain warnings. Units were either spelled in full (“MILLIMETRE”) or using their<br />

corresponding symbol. Floating-point numbers were sometimes found to use the<br />

comma instead of the period to mark the fractional part.<br />

The text <strong>for</strong>mat of the MTCNs also proved to be inconsistent depending on<br />

the source we use (see section 4.3). For instance, the apostrophe may or may not<br />

appear in some words (like “AUJOURD’HUI” in French or after elided words like<br />

“D’ORAGES”).<br />

We attempted to normalize the text by including standardization rules in the<br />

tokenizer we wrote <strong>for</strong> this study. Multiple iterations of the code were necessary as<br />

we received more and more text over time and discovered new problems.<br />

4.4.3 Serialization Differences<br />

When serializing the corpus using the procedure explained in Section 2, we discovered<br />

discrepancies between the serialized tokens produced by each member of a<br />

sentence pair. For instance, the phrase “DE LA GRELE DE LA TAILLE D UNE BALLE<br />

DE GOLF” is associated with the English phrase “45 MM HAIL”. The latter will be<br />

serialized to “ NUM MM HAIL” while the French version will be left untouched.<br />

This serialization disagreement becomes a problem when deserializing a translated<br />

sentence using the procedure explained in Section 4.1. We are then <strong>for</strong>ced to produce<br />

a translation which cannot be deserialized because there is no corresponding<br />

source material.<br />

We initially encountered quite a lot of these discrepancies and solved most of<br />

them by adding rules in the tokenizer. However, we did not solve all problems and<br />

10k sentence pairs could not be serialized and were added verbatim to the bitext<br />

in order to reduce out of vocabulary coverage. Most of them are caused by hail<br />

sizes and sentence alignment errors. We provide further details about the problems<br />

caused by hail sizes in Section 7.1.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!