12.07.2015 Views

The current state of work on the Polish-Ukrainian Parallel Corpus

The current state of work on the Polish-Ukrainian Parallel Corpus

The current state of work on the Polish-Ukrainian Parallel Corpus

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Natalia KotsybaInstitute <str<strong>on</strong>g>of</str<strong>on</strong>g> Slavic StudiesPAS (Warsaw)<str<strong>on</strong>g>The</str<strong>on</strong>g> <str<strong>on</strong>g>current</str<strong>on</strong>g> <str<strong>on</strong>g>state</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>work</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>Polish</strong>-<strong>Ukrainian</strong> <strong>Parallel</strong> <strong>Corpus</strong> (PolUKR).Objectives <str<strong>on</strong>g>of</str<strong>on</strong>g> creating <strong>the</strong> corpusPolUKR 1 , a <strong>Polish</strong>-<strong>Ukrainian</strong> parallel corpus was launched as a pilot corpus project in <strong>the</strong> Institute <str<strong>on</strong>g>of</str<strong>on</strong>g>Slavic Studies <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> <strong>Polish</strong> Academy <str<strong>on</strong>g>of</str<strong>on</strong>g> Sciences in 2004. <str<strong>on</strong>g>The</str<strong>on</strong>g> corpus is intended for use as a tool forboth human and machine users, and language material for compiling bilingual <strong>Polish</strong><strong>Ukrainian</strong>dicti<strong>on</strong>aries and a c<strong>on</strong>trastive grammar for <strong>Polish</strong> and <strong>Ukrainian</strong>. It can also be used as a translati<strong>on</strong>database and language learning materials.Acquisiti<strong>on</strong> and preprocessing <str<strong>on</strong>g>of</str<strong>on</strong>g> parallel textsCurrently <strong>the</strong> corpus c<strong>on</strong>tains ab. 2 mln tokens (ab. 500K tokens in 70 parallel texts are publiclyavailable for search through <strong>the</strong> web interface at http://corpus.domeczek.pl) that represent mostlymodern <strong>Ukrainian</strong> and <strong>Polish</strong> literature (<strong>the</strong> 2 nd part <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> XXth century). Part <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong>m was receivedfrom <strong>the</strong> translators 2 , <strong>the</strong>n <strong>the</strong> corresp<strong>on</strong>ding original was sought for and prepared accordingly.Ano<strong>the</strong>r group was downloaded from existing digital libraries 3 . <str<strong>on</strong>g>The</str<strong>on</strong>g> quality <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> texts was <str<strong>on</strong>g>of</str<strong>on</strong>g>tenunsatisfactory, as in most cases electr<strong>on</strong>ic texts were acquired through scanning <strong>the</strong> paper editi<strong>on</strong>sthat were later submitted to <strong>the</strong> automatic Optical Character Recogniti<strong>on</strong> (OCR) procedure andneeded fur<strong>the</strong>r correcti<strong>on</strong>s. A large group <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> texts was originally in <strong>the</strong> hard copy format, <strong>the</strong>ywere scanned, cleaned from images, page numbers and o<strong>the</strong>r unnecessary informati<strong>on</strong>, <strong>the</strong>n OCRedwith <strong>the</strong> help <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> FineReader 9.0 program, checked for mistakes that appeared as a c<strong>on</strong>sequence<str<strong>on</strong>g>of</str<strong>on</strong>g> a poor OCR, recorded as MSWord documents and c<strong>on</strong>verted into simple UTF-8 encoded xml filesthat c<strong>on</strong>tain <strong>the</strong> informati<strong>on</strong> about <strong>the</strong> divisi<strong>on</strong> into paragraphs extracted from doc files with <strong>the</strong>help <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> AutoReplace functi<strong>on</strong>.<str<strong>on</strong>g>The</str<strong>on</strong>g> text metadata are recorded into a MySQL database placed <strong>on</strong> <strong>the</strong> server. <str<strong>on</strong>g>The</str<strong>on</strong>g>y include (ifavailable): author, title, language, year <str<strong>on</strong>g>of</str<strong>on</strong>g> creati<strong>on</strong>, publicati<strong>on</strong> place, year and publishing house ,genre, translator, year <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong>, source and original format <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> text, etc. This informati<strong>on</strong>1 <str<strong>on</strong>g>The</str<strong>on</strong>g> <str<strong>on</strong>g>current</str<strong>on</strong>g>, extended versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> corpus is partially supported by MNiSW (<strong>Polish</strong> Ministry <str<strong>on</strong>g>of</str<strong>on</strong>g> Science and HigherEducati<strong>on</strong>) grant N N 104 0403 33 in 2007-2009.2 We would like to thank Katarzyna Kotyńska, Anna Łazar, Ola Hnatiuk and Helena Krasowska for sharing <strong>the</strong>irtexts.3 Some <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> libraries used can be found at: http://lib.ru, http://www.ae-lib.org.ua/,http://www.4shared.com/dir/3997557/7fe59813/ebooki.html, http://exlibris.org.ua/,http://ukrcenter.com/library/default.asp, http://www.share.net.ua/, http://lib.proza.com.ua.


may be used to restrict <strong>the</strong> scope <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> search e.g. <strong>on</strong>e can choose <strong>on</strong>ly <strong>the</strong> texts created after aspecific date or by a specific author.Structural annotati<strong>on</strong><str<strong>on</strong>g>The</str<strong>on</strong>g> texts are segmented into chunks that can be <str<strong>on</strong>g>of</str<strong>on</strong>g> two types: paragraphs and sentences. Sentencesare always parts <str<strong>on</strong>g>of</str<strong>on</strong>g> paragraphs. Such structure <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> document is encoded in a corresp<strong>on</strong>dingDocument Type Definiti<strong>on</strong> file.Morphological annotati<strong>on</strong>For adding <strong>the</strong> morphosyntactic informati<strong>on</strong> for <strong>the</strong> <strong>Polish</strong> texts we use <strong>the</strong> freely available TаKIPItoolset developed by Marcin Woliński, Adam Przepiórkowski, Adam Radziszewski and MaciejPiasecki, that includes a text chunker, a lemmatizer, a morphological tagger and a disambiguator.Morphological tags are stored as value lists c<strong>on</strong>taining morphological class and grammaticalcategories adequate for a given class, e.g., <strong>the</strong> grammatical characteristics <str<strong>on</strong>g>of</str<strong>on</strong>g> jedziecie (you pl go) willbe fin:pl:sec:imperf (finite verb form, plural, sec<strong>on</strong>d pers<strong>on</strong>, imperfective aspect). If an ambiguityoccurs for a given segment, several tags are listed. After <strong>the</strong> disambiguati<strong>on</strong> procedure <strong>the</strong> mostverisimilar “candidate” is given <strong>the</strong> disambiguati<strong>on</strong> value “1”.An example <str<strong>on</strong>g>of</str<strong>on</strong>g> a tagged chunk “Dokąd jedziecie?”dokąddokądqubjedzieciejechaćfin:pl:sec:imperf??interpFor <strong>the</strong> <strong>Ukrainian</strong> language we use <strong>the</strong> UGS (<strong>Ukrainian</strong> Grammatical Dicti<strong>on</strong>ary) developed at <strong>the</strong>ULIF NASU by Igor Shevchenko and Oleksandr Rabulets, that enables lemmatizati<strong>on</strong> andmorphological annotati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> texts, although its does not support disambiguati<strong>on</strong> at <strong>the</strong> moment.


A comm<strong>on</strong> morphosyntactic tagset for <strong>Polish</strong> and <strong>Ukrainian</strong> was developed by us for <strong>the</strong> corpusneeds based <strong>on</strong> <strong>the</strong> menti<strong>on</strong>ed resources, see [Kotsyba et al. 2008, Коциба 2009] for details.Language specific categories and values are preserved, as our intenti<strong>on</strong> was not to lose anyinformati<strong>on</strong>. All <strong>the</strong> details will not be seen at <strong>the</strong> GUI search-level, but will be accessible foradvanced users through self-defined regex-based corpus queries. <str<strong>on</strong>g>The</str<strong>on</strong>g> basic changes we had tointroduce include a higher POS granulati<strong>on</strong> for <strong>Ukrainian</strong> and regrouping some word classes for<strong>Polish</strong> to fit a more traditi<strong>on</strong>al understanding <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> parts <str<strong>on</strong>g>of</str<strong>on</strong>g> speech. <str<strong>on</strong>g>The</str<strong>on</strong>g>se quasi-changes arerealized with <strong>the</strong> help <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> mechanism <str<strong>on</strong>g>of</str<strong>on</strong>g> aliases and effect <strong>on</strong>ly <strong>the</strong> GUI search level. Reorganizing<str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong> about <strong>the</strong> degree for <strong>Ukrainian</strong> adjectives and adverbs from <strong>the</strong> lexical to grammaticallevel has also been d<strong>on</strong>e to keep to <strong>the</strong> standards both in traditi<strong>on</strong>al grammars and <str<strong>on</strong>g>current</str<strong>on</strong>g>comm<strong>on</strong>ly accepted NLP treatment <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> degree as a grammatical category. <str<strong>on</strong>g>The</str<strong>on</strong>g> special treatment<str<strong>on</strong>g>of</str<strong>on</strong>g> predicatives that was followed by us as well is described in detail in [Derzhanski, Kotsyba 2008].<str<strong>on</strong>g>The</str<strong>on</strong>g> above format was also used for <strong>the</strong> <strong>Ukrainian</strong> language while c<strong>on</strong>verting <strong>the</strong> original annotatedfiles.AlignmentPresently <strong>the</strong> parallel texts are aligned at <strong>the</strong> paragraph level dynamically, i.e. paragraphs areenumerated during <strong>the</strong> searching procedure and paragraphs with <strong>the</strong> same order number that <strong>the</strong><strong>on</strong>es where <strong>the</strong> searched fragment is found are shown al<strong>on</strong>g with <strong>the</strong> KWICs. <str<strong>on</strong>g>The</str<strong>on</strong>g> difference in <strong>the</strong>paragraph divisi<strong>on</strong> had to be removed manually, so that <strong>the</strong>ir order numbers and c<strong>on</strong>tent whereequal. This situati<strong>on</strong> is provisi<strong>on</strong>al – <strong>the</strong> paragraph level <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> alignment is unsatisfactory as mostparagraphs are too lengthy to easily spot <strong>the</strong> searched equivalent. <str<strong>on</strong>g>The</str<strong>on</strong>g> intended alignment level aresentences and, eventually, words.One <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> freely available programs that aligns parallel texts at <strong>the</strong> sentence level is <strong>the</strong> languageindependent HunAlign. <str<strong>on</strong>g>The</str<strong>on</strong>g> result <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> alignment is recorded ei<strong>the</strong>r as an intertwined text or assets <str<strong>on</strong>g>of</str<strong>on</strong>g> corresp<strong>on</strong>ding sentences, so called link groups, represented by sentence numbers or o<strong>the</strong>ridentifiers. Additi<strong>on</strong>al numeric informati<strong>on</strong> about <strong>the</strong> accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g> alignment can be included as well.<str<strong>on</strong>g>The</str<strong>on</strong>g> program foresees <strong>the</strong> use <str<strong>on</strong>g>of</str<strong>on</strong>g> a corresp<strong>on</strong>ding bilingual dicti<strong>on</strong>ary to ensure a higher accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g><strong>the</strong> alignment. Such a dicti<strong>on</strong>ary can also be generated by <strong>the</strong> program itself from <strong>the</strong> <str<strong>on</strong>g>current</str<strong>on</strong>g>ly fed inbitexts, if not available o<strong>the</strong>rwise. <str<strong>on</strong>g>The</str<strong>on</strong>g> results <str<strong>on</strong>g>of</str<strong>on</strong>g> aligning <strong>Polish</strong> and <strong>Ukrainian</strong> texts without adicti<strong>on</strong>ary were far from satisfactory. For <strong>the</strong> purpose <str<strong>on</strong>g>of</str<strong>on</strong>g> a more accurate alignment we havedeveloped a bilingual dicti<strong>on</strong>ary structured according to <strong>the</strong> HunAlign demands. It is recorded as aplain text where each entry takes a separate line: <strong>the</strong> original word or expressi<strong>on</strong>, @-sign, <strong>the</strong>equivalent word or expressi<strong>on</strong>. Since many words and expressi<strong>on</strong>s have several equivalents due topolysemy, <strong>the</strong> same entries <strong>on</strong> <strong>the</strong> left side can be repeated with different equivalents. <str<strong>on</strong>g>The</str<strong>on</strong>g>alignment dicti<strong>on</strong>ary was generated automatically from <strong>the</strong> database versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> <strong>Polish</strong>-<strong>Ukrainian</strong>dicti<strong>on</strong>ary that is <str<strong>on</strong>g>current</str<strong>on</strong>g>ly developed as a joint project <str<strong>on</strong>g>of</str<strong>on</strong>g> ULIF NASU and ISS PAS, and c<strong>on</strong>tains 31088entries.Fragment <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> dicti<strong>on</strong>ary:białokrusz @ окис свинцюblałolicy @ білолицийblałoramienny @ білоплечийbiałoruszczyzna @ білорущина


iałoruszczyzna @ все, що білоруськеbibułka @ папіросний папірbibułka @ цигарковий папірbibułomania @ манія збирати старі рукописиbiczować @ батожитиbiczowanie @ батоженняbiczyk @ батіжокbiczykowaty @ подібний до батіжкаbić @ битиbiję @ б'юbiec @ бігтиSince both <strong>Polish</strong> and <strong>Ukrainian</strong> are highly inflected languages, basic dicti<strong>on</strong>ary forms are notenough. Ei<strong>the</strong>r we need lemmatized texts, or a dicti<strong>on</strong>ary with all possible forms generated. <str<strong>on</strong>g>The</str<strong>on</strong>g> firstopti<strong>on</strong> seems to be easier to realize, but for this we need to adjust <strong>the</strong> alignment algorithm and to<str<strong>on</strong>g>work</str<strong>on</strong>g> with already annotated texts.Ano<strong>the</strong>r opti<strong>on</strong> for aligning is <strong>the</strong> TextAlign, a user friendly s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware with GUI and editingpossibilities. <str<strong>on</strong>g>The</str<strong>on</strong>g> <strong>on</strong>ly possible input format <strong>the</strong>re is RTF (rich text format), <strong>the</strong> output is a TMX filewith an intertwined parallel text. <str<strong>on</strong>g>The</str<strong>on</strong>g> main problem with <strong>the</strong> unequal number <str<strong>on</strong>g>of</str<strong>on</strong>g> sentences in paralleltexts that effected <strong>the</strong> quality <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> results produced by <strong>the</strong> fully automatic and hardly c<strong>on</strong>trollableHunalign is compensated by <strong>the</strong> possibility <str<strong>on</strong>g>of</str<strong>on</strong>g> an easy and quick alignment editi<strong>on</strong> in <strong>the</strong> TextAlign.However, <strong>the</strong> sentence segmentati<strong>on</strong> algorithm in <strong>the</strong> TextAlign is too simple for satisfactory results.Example <str<strong>on</strong>g>of</str<strong>on</strong>g> alignment results by TextAlign, pre-editing phase


It can be seen from <strong>the</strong> example above that sentence borders are defined basing <strong>on</strong> punctuati<strong>on</strong>marks without c<strong>on</strong>sidering comm<strong>on</strong> abbreviati<strong>on</strong>s ended with full stops, which can generate wr<strong>on</strong>gsentence segmentati<strong>on</strong>.Example <str<strong>on</strong>g>of</str<strong>on</strong>g> a manual splitting procedure with <strong>the</strong> help <str<strong>on</strong>g>of</str<strong>on</strong>g> TextAlign.At <strong>the</strong> moment we are developing a PLUczeK program that will combine <strong>the</strong> features <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> HunAlignand <strong>the</strong> TextAlign. It will include an editable plugging-in module <str<strong>on</strong>g>of</str<strong>on</strong>g> text-segmentati<strong>on</strong> at <strong>the</strong>paragraph and sentence levels, which has to ensure language independence <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> program. <str<strong>on</strong>g>The</str<strong>on</strong>g>sentence segmentati<strong>on</strong> module is rule based, it presupposes <strong>the</strong> use <str<strong>on</strong>g>of</str<strong>on</strong>g> such heuristics as comm<strong>on</strong>abbreviati<strong>on</strong> to functi<strong>on</strong> as a stop list, combinati<strong>on</strong>s and sequences <str<strong>on</strong>g>of</str<strong>on</strong>g> abbreviati<strong>on</strong>s and punctuati<strong>on</strong>marks, forms <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> reported speech presentati<strong>on</strong> (that can also be different across languages), cf.also [Rudolf, 2004]. <str<strong>on</strong>g>The</str<strong>on</strong>g> program will <str<strong>on</strong>g>work</str<strong>on</strong>g> with both plain texts and morphologically annotated xmlfiles, addressing ei<strong>the</strong>r <strong>the</strong> informati<strong>on</strong> about <strong>the</strong> actual form <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> token, or its lemma, as well asusing grammatical informati<strong>on</strong> for sentence segmentati<strong>on</strong> (a verb or a prepositi<strong>on</strong> cannot be aproper name, hence, written with a capital letter <strong>the</strong>y signal about <strong>the</strong> beginning <str<strong>on</strong>g>of</str<strong>on</strong>g> a sentence, etc.).<str<strong>on</strong>g>The</str<strong>on</strong>g> program will also have a GUI interface and enable editing <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> segmentati<strong>on</strong>.We have chosen <strong>the</strong> XCES format for alignment records. <str<strong>on</strong>g>The</str<strong>on</strong>g> informati<strong>on</strong> about corresp<strong>on</strong>dingsentences is stored in a separate file. An example fragment <str<strong>on</strong>g>of</str<strong>on</strong>g> an alignment file is below (sentences 1i 2 <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> sec<strong>on</strong>d link group are translated as <strong>on</strong>e sentence).......


Even sentence alignment cannot reach a 100% accuracy due to objective reas<strong>on</strong>s. In <strong>the</strong> table below,fragments that are parts <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>e sentence are highlighted with <strong>the</strong> same shade.Dok<strong>on</strong>any przez Namiera wybór 1848 r. jako punktuwyjścia był jak najbardziej uzasadni<strong>on</strong>y.Zgodnie z obiegową opinią, rok ów stanowił punktzwrotny w historii, w którym historii nie udało siędok<strong>on</strong>ać zwrotu.Pogląd ten jest jednak błędny –to właśnie wtedy wybuchły pierwsze europejskierewolucje.Ich epicentrum stanowiła Francja, ale ruchyrewolucyjne objęły również Palermo, Neapol,Wiedeń, Berlin, Budę i Poznań, żeby wymienić tylkokilka miast.Нейм’єр обирає як вихідний пункт 1848 рік, і цейвибір добре обґрунтований.Існує відоме затерте кліше, що 1848 рік бувповоротним пунктом історії, але тоді історія незмогла повернути в інший бік,проте це не так.1848 – це рік перших европейських революцій:їхнім епіцентром була Франція, також революціївідбулися у Палермо, Неаполі, Відні, Берліні, Буді йПознані – і це ще далеко не всі міста.This means that mistakes are practically unavoidable, especially with large amounts <str<strong>on</strong>g>of</str<strong>on</strong>g> texts, but its isstill possible to keep <strong>the</strong> general quality <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> corpus sufficient for <str<strong>on</strong>g>work</str<strong>on</strong>g>ing with it and receivingobjective results.C<strong>on</strong>clusi<strong>on</strong>s and fur<strong>the</strong>r <str<strong>on</strong>g>work</str<strong>on</strong>g>


<str<strong>on</strong>g>The</str<strong>on</strong>g> <str<strong>on</strong>g>current</str<strong>on</strong>g> <str<strong>on</strong>g>state</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> PolUKR enables already searching for translati<strong>on</strong> equivalent and can be used as atranslati<strong>on</strong> memory database both by human translators and researchers and machines. But <strong>the</strong>corpus can be enhanced in a number <str<strong>on</strong>g>of</str<strong>on</strong>g> ways, like finer alignment level, enriching with fur<strong>the</strong>rannotati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> different types, including also semantic and referential informati<strong>on</strong>. Automatic wordlevelalignment can be <str<strong>on</strong>g>of</str<strong>on</strong>g> significant help while compiling bilingual dicti<strong>on</strong>aries. <str<strong>on</strong>g>The</str<strong>on</strong>g> search enginehas to be adjusted to enable searching for <strong>the</strong> new informati<strong>on</strong> as well.LiteratureBroda Bartosz, Piasecki Maciej & Radziszewski Adam. Towards a Set <str<strong>on</strong>g>of</str<strong>on</strong>g> General PurposeMorphosyntactic Tools for <strong>Polish</strong>. Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> Intelligent Informati<strong>on</strong> Systems, Zakopane Poland,2008. Institute <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science PAS, 2008.Ivan Derzhanski and Natalia Kotsyba. <str<strong>on</strong>g>The</str<strong>on</strong>g> Category <str<strong>on</strong>g>of</str<strong>on</strong>g> Predicatives in <strong>the</strong> Light <str<strong>on</strong>g>of</str<strong>on</strong>g> C<strong>on</strong>sistentMorphosyntactic Tagging <str<strong>on</strong>g>of</str<strong>on</strong>g> Slavic Languages. Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> Internati<strong>on</strong>al Workshopwithin MONDILEX project, Moscow, 2-4 October 2008.Hunalign - sentence level aligner: http://mokk.bme.hu/resources/hunalign.Natalia Kotsyba, Olha Shypnivska and Magdalena Turska. Linguistic principles <str<strong>on</strong>g>of</str<strong>on</strong>g> organizing acomm<strong>on</strong> morphological tagset for PolUKR (<strong>Polish</strong>-<strong>Ukrainian</strong> <strong>Parallel</strong> <strong>Corpus</strong>). Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g>Intelligent Informati<strong>on</strong> Systems, Zakopane, Poland, 2008. Institute <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Science PAS, 2008.Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for <strong>Polish</strong>. In: <str<strong>on</strong>g>The</str<strong>on</strong>g> Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong>Workshop <strong>on</strong> Morphological Processing <str<strong>on</strong>g>of</str<strong>on</strong>g> Slavic Languages, EACL 2003.http://nlp.ipipan.waw.pl/~adamp/Papers/2003-eacl-ws12/ws12.pdfMichał Rudolf. Metody automatycznej analizy korpusu tekstów polskich. Pozyskiwanie, wzbogacanie iprzetwarzanie informacji lingwistycznych. Warszawa, 2004.TextAlign in MT2007 (Memory Translati<strong>on</strong> Computer Aided Tool): http://mt2007-cat.ru/index.html.Magdalena Turska and Natalia Kotsyba. Polsko-Ukraiński korpus równoległy (PolUKR). „Materiały LXIIIZjazdu Polskiego Towarzystwa Językoznawczego”, Warszawa.Magdalena Turska and Natalia Kotsyba. <strong>Polish</strong>-<strong>Ukrainian</strong> <strong>Parallel</strong> <strong>Corpus</strong> and its Possible Applicati<strong>on</strong>s,Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> Internati<strong>on</strong>al C<strong>on</strong>ference "Practical Applicati<strong>on</strong>s in Language and Computers, 7-9April, Łódź", Peter Lang GmbH, 2007.v. Waldenfels, R. Compiling a parallel corpus <str<strong>on</strong>g>of</str<strong>on</strong>g> slavic languages. Text strategies, tools and <strong>the</strong>questi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> lemmatizati<strong>on</strong> in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (Hrsg.); Beiträge derEuropäischen Slavistischen Linguistik (POLYSLAV) 9. München, 123-138, 2006.Коциба Наталія. Принципи морфосинтактичного таґування польсько-українськогопаралельного корпусу (PolUKR). Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> Internati<strong>on</strong>al C<strong>on</strong>ference “MegaLing'2008.Horiz<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Applied Linguistics and Linguistic Technologies, Par<strong>the</strong>nit – Crimea, Ukraine, September2008”, 2009 (in preparati<strong>on</strong>).


Широков В.А, О.В.Бугаков, Т.О.Грязнухіна, О.М.Костишин, М.Ю.Кригін, Т.П.Любченко,О.Г.Рабулець, О.О.Сидоренко, Н.М.Сидорчук, І.В.Шевченко, О.О.Шипнівська, К.М.Якименко.Корпусна лінгвістика. Київ: Довіра, 2005.Abstract<str<strong>on</strong>g>The</str<strong>on</strong>g> article describes <strong>the</strong> present <str<strong>on</strong>g>state</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>work</str<strong>on</strong>g> <strong>on</strong> PolUKR, <strong>the</strong> <strong>Polish</strong>-<strong>Ukrainian</strong> parallel corpus,developed in <strong>the</strong> Institute <str<strong>on</strong>g>of</str<strong>on</strong>g> Slavic Studies <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> <strong>Polish</strong> Academy <str<strong>on</strong>g>of</str<strong>on</strong>g> Sciences since 2004. Presentedare <strong>the</strong> ways <str<strong>on</strong>g>of</str<strong>on</strong>g> bitexts’ acquisiti<strong>on</strong>, <strong>the</strong>ir structure and pre-processing stages; <strong>the</strong> soluti<strong>on</strong>sc<strong>on</strong>cerning <strong>the</strong> comm<strong>on</strong> morphosyntactic annotati<strong>on</strong> pattern for <strong>Polish</strong> and <strong>Ukrainian</strong>, as well asannotati<strong>on</strong> methods; <strong>the</strong> alignment format and <strong>the</strong> s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware used or developed for <strong>the</strong> corpusneeds.Recommendati<strong>on</strong>sOne <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> objectives <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> <str<strong>on</strong>g>current</str<strong>on</strong>g> project is to develop a scheme for creating a parallel corpus forany pair <str<strong>on</strong>g>of</str<strong>on</strong>g> Slavic languages. At <strong>the</strong> moment a researcher who deals with Slavic parallel corporaenvisages several major problems that need to be attended to. One <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> still unresolved issues is acomm<strong>on</strong> morphological annotati<strong>on</strong> tagset for Slavic languages that should ensure uniform searchthrough both parts <str<strong>on</strong>g>of</str<strong>on</strong>g> a corpus at <strong>the</strong> same time. Technical bilingual dicti<strong>on</strong>aries for sentencealignment as well as a user friendly alignment editor are necessary to enable c<strong>on</strong>trollable high-qualityalignment. A free, platform independent search engine for parallel corpora is also needed.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!