05.01.2017 Views

December 11-16 2016 Osaka Japan

W16-46

W16-46

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chinese-to-<strong>Japan</strong>ese Patent Machine Translation based on Syntactic<br />

Pre-ordering for WAT 20<strong>16</strong><br />

Katsuhito Sudoh and Masaaki Nagata<br />

Communication Science Laboratories, NTT Corporation<br />

2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237, <strong>Japan</strong><br />

sudoh.katsuhito@lab.ntt.co.jp<br />

Abstract<br />

This paper presents our Chinese-to-<strong>Japan</strong>ese patent machine translation system for WAT 20<strong>16</strong><br />

(Group ID: ntt) that uses syntactic pre-ordering over Chinese dependency structures. Chinese<br />

words are reordered by a learning-to-rank model based on pairwise classification to obtain word<br />

order close to <strong>Japan</strong>ese. In this year’s system, two different machine translation methods are compared:<br />

traditional phrase-based statistical machine translation and recent sequence-to-sequence<br />

neural machine translation with an attention mechanism. Our pre-ordering showed a significant<br />

improvement over the phrase-based baseline, but, in contrast, it degraded the neural machine<br />

translation baseline.<br />

1 Introduction<br />

Patent documents, which are well-structured written texts that describe the technical details of inventions,<br />

are expected to have almost no semantic ambiguities caused by indirect or rhetorical expressions. Therefore,<br />

they are good candidates for literal translation, which most machine translation (MT) approaches<br />

aim to do.<br />

One technical challenge for patent machine translation is the complex syntactic structure of patent<br />

documents, which typically have long sentences that complicate MT reordering, especially for word<br />

order in distant languages. Chinese and <strong>Japan</strong>ese have similar word order in noun modifiers but different<br />

subject-verb-object order, requiring long distance reordering in translation. In the WAT 20<strong>16</strong> evaluation<br />

campaign (Nakazawa et al., 20<strong>16</strong>), we participated in a Chinese-to-<strong>Japan</strong>ese patent translation task and<br />

tackled long distance reordering by syntactic pre-ordering based on Chinese dependency structures, as in<br />

our last year’s system (Sudoh and Nagata, 2015). We also use a recent neural MT as the following MT<br />

implementation for comparison with a traditional phrase-based statistical MT.<br />

Our system basically consists of three components: Chinese syntactic analysis (word segmentation,<br />

part-of-speech (POS) tagging, and dependency parsing) adapted to patent documents; dependency-based<br />

syntactic pre-ordering with hand-written rules or a learning-to-rank model; and the following MT component<br />

(phrase-based MT or neural MT). This paper describes our system’s details and discusses our<br />

evaluation results.<br />

2 System Overview<br />

Figure 1 shows a brief workflow of our Chinese-to-<strong>Japan</strong>ese MT system. Its basic architecture is standard<br />

with syntactic pre-ordering. Input sentences are first applied to word segmentation and POS tagging,<br />

parsed into dependency trees, reordered using pre-ordering rules or a pre-ordering model, and finally<br />

translated into <strong>Japan</strong>ese by MT.<br />

3 Chinese Syntactic Analysis: Word Segmentation, Part-of-Speech Tagging, and<br />

Dependency Parsing<br />

Word segmentation and POS tagging are solved jointly (Suzuki et al., 2012) for better Chinese word segmentation<br />

based on POS tag sequences. The dependency parser produces untyped dependency trees. The<br />

2<strong>11</strong><br />

Proceedings of the 3rd Workshop on Asian Translation,<br />

pages 2<strong>11</strong>–215, <strong>Osaka</strong>, <strong>Japan</strong>, <strong>December</strong> <strong>11</strong>-17 20<strong>16</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!