12.07.2015 Views

The current state of work on the Polish-Ukrainian Parallel Corpus

The current state of work on the Polish-Ukrainian Parallel Corpus

The current state of work on the Polish-Ukrainian Parallel Corpus

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

A comm<strong>on</strong> morphosyntactic tagset for <strong>Polish</strong> and <strong>Ukrainian</strong> was developed by us for <strong>the</strong> corpusneeds based <strong>on</strong> <strong>the</strong> menti<strong>on</strong>ed resources, see [Kotsyba et al. 2008, Коциба 2009] for details.Language specific categories and values are preserved, as our intenti<strong>on</strong> was not to lose anyinformati<strong>on</strong>. All <strong>the</strong> details will not be seen at <strong>the</strong> GUI search-level, but will be accessible foradvanced users through self-defined regex-based corpus queries. <str<strong>on</strong>g>The</str<strong>on</strong>g> basic changes we had tointroduce include a higher POS granulati<strong>on</strong> for <strong>Ukrainian</strong> and regrouping some word classes for<strong>Polish</strong> to fit a more traditi<strong>on</strong>al understanding <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> parts <str<strong>on</strong>g>of</str<strong>on</strong>g> speech. <str<strong>on</strong>g>The</str<strong>on</strong>g>se quasi-changes arerealized with <strong>the</strong> help <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> mechanism <str<strong>on</strong>g>of</str<strong>on</strong>g> aliases and effect <strong>on</strong>ly <strong>the</strong> GUI search level. Reorganizing<str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong> about <strong>the</strong> degree for <strong>Ukrainian</strong> adjectives and adverbs from <strong>the</strong> lexical to grammaticallevel has also been d<strong>on</strong>e to keep to <strong>the</strong> standards both in traditi<strong>on</strong>al grammars and <str<strong>on</strong>g>current</str<strong>on</strong>g>comm<strong>on</strong>ly accepted NLP treatment <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> degree as a grammatical category. <str<strong>on</strong>g>The</str<strong>on</strong>g> special treatment<str<strong>on</strong>g>of</str<strong>on</strong>g> predicatives that was followed by us as well is described in detail in [Derzhanski, Kotsyba 2008].<str<strong>on</strong>g>The</str<strong>on</strong>g> above format was also used for <strong>the</strong> <strong>Ukrainian</strong> language while c<strong>on</strong>verting <strong>the</strong> original annotatedfiles.AlignmentPresently <strong>the</strong> parallel texts are aligned at <strong>the</strong> paragraph level dynamically, i.e. paragraphs areenumerated during <strong>the</strong> searching procedure and paragraphs with <strong>the</strong> same order number that <strong>the</strong><strong>on</strong>es where <strong>the</strong> searched fragment is found are shown al<strong>on</strong>g with <strong>the</strong> KWICs. <str<strong>on</strong>g>The</str<strong>on</strong>g> difference in <strong>the</strong>paragraph divisi<strong>on</strong> had to be removed manually, so that <strong>the</strong>ir order numbers and c<strong>on</strong>tent whereequal. This situati<strong>on</strong> is provisi<strong>on</strong>al – <strong>the</strong> paragraph level <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> alignment is unsatisfactory as mostparagraphs are too lengthy to easily spot <strong>the</strong> searched equivalent. <str<strong>on</strong>g>The</str<strong>on</strong>g> intended alignment level aresentences and, eventually, words.One <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> freely available programs that aligns parallel texts at <strong>the</strong> sentence level is <strong>the</strong> languageindependent HunAlign. <str<strong>on</strong>g>The</str<strong>on</strong>g> result <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> alignment is recorded ei<strong>the</strong>r as an intertwined text or assets <str<strong>on</strong>g>of</str<strong>on</strong>g> corresp<strong>on</strong>ding sentences, so called link groups, represented by sentence numbers or o<strong>the</strong>ridentifiers. Additi<strong>on</strong>al numeric informati<strong>on</strong> about <strong>the</strong> accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g> alignment can be included as well.<str<strong>on</strong>g>The</str<strong>on</strong>g> program foresees <strong>the</strong> use <str<strong>on</strong>g>of</str<strong>on</strong>g> a corresp<strong>on</strong>ding bilingual dicti<strong>on</strong>ary to ensure a higher accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g><strong>the</strong> alignment. Such a dicti<strong>on</strong>ary can also be generated by <strong>the</strong> program itself from <strong>the</strong> <str<strong>on</strong>g>current</str<strong>on</strong>g>ly fed inbitexts, if not available o<strong>the</strong>rwise. <str<strong>on</strong>g>The</str<strong>on</strong>g> results <str<strong>on</strong>g>of</str<strong>on</strong>g> aligning <strong>Polish</strong> and <strong>Ukrainian</strong> texts without adicti<strong>on</strong>ary were far from satisfactory. For <strong>the</strong> purpose <str<strong>on</strong>g>of</str<strong>on</strong>g> a more accurate alignment we havedeveloped a bilingual dicti<strong>on</strong>ary structured according to <strong>the</strong> HunAlign demands. It is recorded as aplain text where each entry takes a separate line: <strong>the</strong> original word or expressi<strong>on</strong>, @-sign, <strong>the</strong>equivalent word or expressi<strong>on</strong>. Since many words and expressi<strong>on</strong>s have several equivalents due topolysemy, <strong>the</strong> same entries <strong>on</strong> <strong>the</strong> left side can be repeated with different equivalents. <str<strong>on</strong>g>The</str<strong>on</strong>g>alignment dicti<strong>on</strong>ary was generated automatically from <strong>the</strong> database versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> <strong>Polish</strong>-<strong>Ukrainian</strong>dicti<strong>on</strong>ary that is <str<strong>on</strong>g>current</str<strong>on</strong>g>ly developed as a joint project <str<strong>on</strong>g>of</str<strong>on</strong>g> ULIF NASU and ISS PAS, and c<strong>on</strong>tains 31088entries.Fragment <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> dicti<strong>on</strong>ary:białokrusz @ окис свинцюblałolicy @ білолицийblałoramienny @ білоплечийbiałoruszczyzna @ білорущина

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!