Designing a Machine Translation System for Canadian Weather ...
Designing a Machine Translation System for Canadian Weather ...
Designing a Machine Translation System for Canadian Weather ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Designing</strong> a <strong>Weather</strong> Warning <strong>Translation</strong> <strong>System</strong> 26<br />
Strong easterly winds with gusts up to 100 km/h are <strong>for</strong>ecast to<br />
develop in the Pincher Creek region this morning and then spread<br />
eastward throughout the day. The strong winds will gradually<br />
diminish this evening but will redevelop on Wednesday.<br />
Fig. 8. An excerpt of a discussion as it appeared on EC’s public weather warning<br />
website. We underlined mixed case phrases.<br />
6.2 Truecasing<br />
6.2.1 Principle<br />
An example of the end result of the truecasing step is shown in Fig. 1, whose text<br />
is reproduced in Fig. 8 <strong>for</strong> convenience, in which the tokens with a mixed case are<br />
underlined. On the one hand, we observe that case restoration involves conceptually<br />
simple rule-based trans<strong>for</strong>mations, like the capitalization of the first word of a<br />
sentence or that of a weekday or month. We readily implemented these rules. On<br />
the other hand, truecasing the numerous names of places and other named entities is<br />
trickier. It actually consists of two problems: per<strong>for</strong>ming named entity recognition,<br />
and then properly truecasing these named entities. Named entity recognition is<br />
made difficult by the lack of case in<strong>for</strong>mation.<br />
For these reasons, and <strong>for</strong> the sake of simplicity, we chose a gazetteer-based<br />
approach to truecasing. We started by compiling a list as exhaustive as possible of<br />
all mixed case expressions susceptible to appear in Watt’s translations. We will<br />
describe this preliminary step in detail in Section 6.2.2.<br />
We compiled two lists, one <strong>for</strong> each output language, and made sure these lists<br />
spelled each place name with its correct case. We added to these lists other capitalized<br />
words (e.g. “Environment Canada”), extracted from sources described in<br />
Section 6.2.2. An excerpt of the English entries is shown in Table 9.<br />
With these lists at our disposal, it is then possible to scan the raw translations<br />
in order to find these expressions in their raw <strong>for</strong>m. Whenever a match is found,<br />
the raw translation is replaced with the truecased expression. Each list is compiled<br />
into a trie in order to speed up the matching step. We use a greedy algorithm that<br />
scans the raw translation from left to right to find the longest matching list entry.<br />
For instance, the raw “AN AREA NEAR METSO A CHOOT INDIAN RESERVE 23” will<br />
only match Table 9’s “Metso A Choot Indian Reserve 23” entry, and not shorter<br />
expressions.<br />
It is noteworthy that our lists hold entries that contain diacritical marks (e.g.<br />
“Metro Montréal” in Table 9). This allows a few final additions to the accenting<br />
step of French text, described earlier (see Section 6.1). As <strong>for</strong> English text, this<br />
single step is sufficient to restore the few diacritical marks necessary and no further<br />
processing was deemed necessary.