13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 2010The original shallow-transfer Apertium system consists of a de-formatter, a morphological analyzer, acategorical disambiguator, a structural and lexical transfer module, a morphological generator, a postgenerator and a reformatter (Tyers, 2009).Inserting a new language pair to be translated by Apertium is a challenging task that requires advancedknowledge of Linguistics in addition to specific Computing knowledge. A researcher or user that may wishto develop translation pairs for Apertium relies on non-specific tools and very few interface resources andscripts, which in turn are fully textual and detail parts of the process only.The architecture of Apertium consists of a generic translation machine that may be used in associationwith different knowledge bases, such as dictionaries and rules of shallow-transfer between languages. It is thecombination of these elements that <strong>do</strong>es the translation of text through a process of shallow-transfer, whosemain distinctive feature is the fact that it <strong>do</strong>es not carry out a full syntactic analysis, but rather operates overlexical units.Therefore, one must master certain key concepts so as to understand both the knowledge model and thebasic structure of the XML files to be created.Some of these key concepts are lexemes, symbols and paradigms. A “lexeme” is the fundamental unit ofthe lexicon of a language (word). A “symbol” refers to a classifying tag, which may be either grammatical orlinked to the execution stages of the machine. A “paradigm” is a script containing morphological flexions fora given group of words.Moreover, there are three stages one must go through in order to insert new language pairs into Apertium,stages which, in turn, give rise to XML files. It is important to stress that the entire process is based ontextual files, and that the references among the tags, for example, must be kept correctly and manually(human).Table 1 shows these three defining stages, namely the development of a monolingual dictionary, abilingual dictionary and translation rules. We have developed solutions for the first two stages.In order to create and manage the files that make up the dictionaries and rules, one must necessarily knowand understand the syntax, the labels, the internal structure and the tagging hierarchy. Typical monolingualdictionaries, for example, currently comprise more than 10 thousand lexemes spread through 150 thousandlines of XML information, which makes knowledge management a task too complex to be carried outmanually and without specific techniques and support tools.In order to make terminology clear, in the present paper we will refer to fixed XML Apertium tags as“elements”, leaving the concept of “tag” to flexible language features. We will present below examples ofexcerpts of Apertium files, including the meanings of the elements mentioned whenever necessary. Theinformation presented here is largely based on tutorials provided by the platform.Table 1. Data for the apertium platform: development stagesDevelopment StageContentMonolingual Dictionary of language “xx” and “yy” Contains the rules of how words in language are inflected.Bilingual Dictionary for each direction “xx to yy” Contains correspondences between words and symbols in the twoand “yy to xx”languages in the direction defined.Transfer Rules for each directionContains rules for how language xx will be changed into languageyy.4.1 Dictionary FormatBoth monolingual and bilingual dictionaries share the same XML specifications. We shall now present anoverview of the standard structure of these files.The element “dictionary” comprises the entire content of the file and is basically divided into four partsisolated by tags and their respective content, as follows: alphabet: set of characters used to perform tokenization; sdefs: markings that can be applied to structures during the process; pardefs: flexion paradigms applied to the lexemes; section: lexeme definition and special symbols belonging to the language, possibly involvingparadigms.Example of the Basic Schema of an XML Dictionary:161

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!