22.11.2014 Views

Designing a Machine Translation System for Canadian Weather ...

Designing a Machine Translation System for Canadian Weather ...

Designing a Machine Translation System for Canadian Weather ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Designing</strong> a <strong>Weather</strong> Warning <strong>Translation</strong> <strong>System</strong> 26<br />

Strong easterly winds with gusts up to 100 km/h are <strong>for</strong>ecast to<br />

develop in the Pincher Creek region this morning and then spread<br />

eastward throughout the day. The strong winds will gradually<br />

diminish this evening but will redevelop on Wednesday.<br />

Fig. 8. An excerpt of a discussion as it appeared on EC’s public weather warning<br />

website. We underlined mixed case phrases.<br />

6.2 Truecasing<br />

6.2.1 Principle<br />

An example of the end result of the truecasing step is shown in Fig. 1, whose text<br />

is reproduced in Fig. 8 <strong>for</strong> convenience, in which the tokens with a mixed case are<br />

underlined. On the one hand, we observe that case restoration involves conceptually<br />

simple rule-based trans<strong>for</strong>mations, like the capitalization of the first word of a<br />

sentence or that of a weekday or month. We readily implemented these rules. On<br />

the other hand, truecasing the numerous names of places and other named entities is<br />

trickier. It actually consists of two problems: per<strong>for</strong>ming named entity recognition,<br />

and then properly truecasing these named entities. Named entity recognition is<br />

made difficult by the lack of case in<strong>for</strong>mation.<br />

For these reasons, and <strong>for</strong> the sake of simplicity, we chose a gazetteer-based<br />

approach to truecasing. We started by compiling a list as exhaustive as possible of<br />

all mixed case expressions susceptible to appear in Watt’s translations. We will<br />

describe this preliminary step in detail in Section 6.2.2.<br />

We compiled two lists, one <strong>for</strong> each output language, and made sure these lists<br />

spelled each place name with its correct case. We added to these lists other capitalized<br />

words (e.g. “Environment Canada”), extracted from sources described in<br />

Section 6.2.2. An excerpt of the English entries is shown in Table 9.<br />

With these lists at our disposal, it is then possible to scan the raw translations<br />

in order to find these expressions in their raw <strong>for</strong>m. Whenever a match is found,<br />

the raw translation is replaced with the truecased expression. Each list is compiled<br />

into a trie in order to speed up the matching step. We use a greedy algorithm that<br />

scans the raw translation from left to right to find the longest matching list entry.<br />

For instance, the raw “AN AREA NEAR METSO A CHOOT INDIAN RESERVE 23” will<br />

only match Table 9’s “Metso A Choot Indian Reserve 23” entry, and not shorter<br />

expressions.<br />

It is noteworthy that our lists hold entries that contain diacritical marks (e.g.<br />

“Metro Montréal” in Table 9). This allows a few final additions to the accenting<br />

step of French text, described earlier (see Section 6.1). As <strong>for</strong> English text, this<br />

single step is sufficient to restore the few diacritical marks necessary and no further<br />

processing was deemed necessary.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!