12.07.2015 Views

The current state of work on the Polish-Ukrainian Parallel Corpus

The current state of work on the Polish-Ukrainian Parallel Corpus

The current state of work on the Polish-Ukrainian Parallel Corpus

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

It can be seen from <strong>the</strong> example above that sentence borders are defined basing <strong>on</strong> punctuati<strong>on</strong>marks without c<strong>on</strong>sidering comm<strong>on</strong> abbreviati<strong>on</strong>s ended with full stops, which can generate wr<strong>on</strong>gsentence segmentati<strong>on</strong>.Example <str<strong>on</strong>g>of</str<strong>on</strong>g> a manual splitting procedure with <strong>the</strong> help <str<strong>on</strong>g>of</str<strong>on</strong>g> TextAlign.At <strong>the</strong> moment we are developing a PLUczeK program that will combine <strong>the</strong> features <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> HunAlignand <strong>the</strong> TextAlign. It will include an editable plugging-in module <str<strong>on</strong>g>of</str<strong>on</strong>g> text-segmentati<strong>on</strong> at <strong>the</strong>paragraph and sentence levels, which has to ensure language independence <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> program. <str<strong>on</strong>g>The</str<strong>on</strong>g>sentence segmentati<strong>on</strong> module is rule based, it presupposes <strong>the</strong> use <str<strong>on</strong>g>of</str<strong>on</strong>g> such heuristics as comm<strong>on</strong>abbreviati<strong>on</strong> to functi<strong>on</strong> as a stop list, combinati<strong>on</strong>s and sequences <str<strong>on</strong>g>of</str<strong>on</strong>g> abbreviati<strong>on</strong>s and punctuati<strong>on</strong>marks, forms <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> reported speech presentati<strong>on</strong> (that can also be different across languages), cf.also [Rudolf, 2004]. <str<strong>on</strong>g>The</str<strong>on</strong>g> program will <str<strong>on</strong>g>work</str<strong>on</strong>g> with both plain texts and morphologically annotated xmlfiles, addressing ei<strong>the</strong>r <strong>the</strong> informati<strong>on</strong> about <strong>the</strong> actual form <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> token, or its lemma, as well asusing grammatical informati<strong>on</strong> for sentence segmentati<strong>on</strong> (a verb or a prepositi<strong>on</strong> cannot be aproper name, hence, written with a capital letter <strong>the</strong>y signal about <strong>the</strong> beginning <str<strong>on</strong>g>of</str<strong>on</strong>g> a sentence, etc.).<str<strong>on</strong>g>The</str<strong>on</strong>g> program will also have a GUI interface and enable editing <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> segmentati<strong>on</strong>.We have chosen <strong>the</strong> XCES format for alignment records. <str<strong>on</strong>g>The</str<strong>on</strong>g> informati<strong>on</strong> about corresp<strong>on</strong>dingsentences is stored in a separate file. An example fragment <str<strong>on</strong>g>of</str<strong>on</strong>g> an alignment file is below (sentences 1i 2 <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> sec<strong>on</strong>d link group are translated as <strong>on</strong>e sentence).......

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!