23.11.2013 Aufrufe

tekom-Jahrestagung 2012 - ActiveDoc

tekom-Jahrestagung 2012 - ActiveDoc

tekom-Jahrestagung 2012 - ActiveDoc

MEHR ANZEIGEN
WENIGER ANZEIGEN

Erfolgreiche ePaper selbst erstellen

Machen Sie aus Ihren PDF Publikationen ein blätterbares Flipbook mit unserer einzigartigen Google optimierten e-Paper Software.

Sprachtechnologie / Language Technology<br />

A hardened service layer architecture<br />

The first and most important task was to create a stable production environment.<br />

To achieve this, a service layer was put on top of Moses SMT,<br />

providing redundancy, load balancing, asynchronous processing, failover<br />

support, industry-standard document format support, alignment-based<br />

tag-handling, improved normalization and hardened (de)tokenization<br />

and (lower/real)casing.<br />

To achieve a separation of concerns, linguistic processing by Moses SMT<br />

and text-engineering capabilities were strictly separated. As a result,<br />

the hardened Moses SMT set-up allows the deployment of third-party<br />

translation and language models, while still providing the text engineering<br />

capabilities built on top of the translation workflow. Named entity<br />

recognition (NER) and terminology management services, for example,<br />

can be added without disrupting the Moses SMT models. From a technical<br />

point of view, tagging and annotation are considered to be engineering<br />

issues, which are not allowed to interfere with the linguistic issues,<br />

as addressed by the SMT models. From a commercial point of view, clients<br />

can have third parties focus on linguistic quality, while the framework<br />

can still take care of immediate production use.<br />

A pipeline architecture<br />

If we take a closer look at the evolution of Moses SMT and MT in general,<br />

it becomes clear that investing in a consistent design, that follows the<br />

very principles of modern MT, is well worth the effort: it alleviates the<br />

costs related to the implementation of difficult languages and it streamlines<br />

processing workflows.<br />

To clarify this, we need to address a common misunderstanding among<br />

non-academic users: Moses SMT is a toolkit, not a monolithic application<br />

that will reach the 1.0 milestone somewhere in the near future. It<br />

consists of data training scripts and a decoder that can handle so-called<br />

factored input. It is designed to primarily work with phrase alignments<br />

and n-gram models, but it can handle far more information. Up to date<br />

however, the factored model (which may include for example syntactic<br />

information) has not lived up to the expectations, mainly due to the data<br />

sparseness introduced by extra information and the locality issues inherent<br />

to n-gram models.<br />

Still, the original concept of processing other information than phrase<br />

alignment is useful, but it may be more fruitful to incorporate it outside<br />

of the decoder and the training. This is exactly what most researchers<br />

do when developing so-called “hybrid” systems. For agglutinative<br />

languages, for example, morphological analyzers are used to transform<br />

training and input data before they are fed to Moses SMT. In other<br />

words: modelling is executed outside of the training scripts and the<br />

decoder. For this reason, there is a growing consensus in the research<br />

community that MT is more and more evolving towards a pipeline process<br />

in which Moses SMT plays an important role. This is in contrast to<br />

what most people picture when making a Moses SMT buy decision.<br />

With the pipeline architecture, the stage is set for developing difficult<br />

language pairs. In the meantime it can be used to harbour text engineering<br />

tools, which basically also add extra information to the translation<br />

process.<br />

<strong>tekom</strong>-<strong>Jahrestagung</strong> <strong>2012</strong><br />

261

Hurra! Ihre Datei wurde hochgeladen und ist bereit für die Veröffentlichung.

Erfolgreich gespeichert!

Leider ist etwas schief gelaufen!