Automatic text processing and deeply annotated text corpora of ...

cesar.nytud.hu

Automatic text processing and deeply annotated text corpora of ...

SummaryThe main focus will be on SynTagRus, a corpus ofRussian texts annotated with dependency-typesyntactic structures, lexical meanings, and lexicalfunctions. Statistical data collected from the corpusare used to improve lexical and syntacticdisambiguation of automatic parsing. Other uses ofSynTagRus include construction of dependency parsersby machine learning techniques and regression testingof the rule-based parser.June 8, 2012Bratislava: Slovak corpusanniversary conference3


THE Russian corpus:National Corpus of Russian,НКРЯwww.ruscorpora.ru(hosted by Yandex, to bechanged soon, non-commercialpartnership to be founded)June 8, 2012Bratislava: Slovak corpusanniversary conference4


National Corpus of RussianJune 8, 2012Bratislava: Slovak corpusanniversary conference5


National Corpus of Russian3. Newspaper corpus4. Parallel aligned corpora(English-Russian, Russian-English,German-Russian, Ukrainian-Russian, Russian-Ukrainian,Belarussian-Russian, Russian-Belarussian, multilingual)June 8, 2012Bratislava: Slovak corpusanniversary conference7


National Corpus of Russian5. Dialectal corpus6. Poetic corpus7. Learners’ corpus8. Oral speech corpus9. Accentological corpus10. Multimedia corpus11. Church Slavonic corpusJune 8, 2012Bratislava: Slovak corpusanniversary conference8


SynTagRusSyntactic Corpus of NRC andSynTagRus: the former is asubcorpus of the latter (isupdated 3-4 times a a year,does not show lexicalmeanings or lexical functionalannotation).June 8, 2012Bratislava: Slovak corpusanniversary conference9


SynTagRusThe corpus is created semiautomatically:first, everysentence is processed by theETAP parser and thenmanually corrected by at leasttwo linguist experts.June 8, 2012Bratislava: Slovak corpusanniversary conference10


SynTagRusCurrently the treebank containsover 52,000 sentences belongingto texts of a variety of genres(contemporary fiction, popularscience, newspaper and journalarticles dated between 1960 and2012, texts of online newsWikipedia articles etc.) and issteadily growing.June 8, 2012Bratislava: Slovak corpusanniversary conference11


SynTagRusSynTagRus adopts adependency-based annotationscheme, in a way parallel to thePrague Dependency Treebankbut, in contrast, relying on theMeaning – Text theory by IgorMel’čuk.June 8, 2012Bratislava: Slovak corpusanniversary conference12


SynTagRusA sentence:Naibol’šee vozmuščenie učastnikovmitinga vyzval prodolžajuščijsja rost cenna benzin, ustanavlivaemyx neftjanymikompanijami‘It was the continuing growth of petrolprices set by oil companies that causedthe greatest indignation of theparticipants of the meeting’.June 8, 2012Bratislava: Slovak corpusanniversary conference13


SynTagRusNodes represent words (lemmas)assigned morphological and part-ofspeechtags, whilst arcs are labeledwith names of syntactic links. Thetagging uses about 75 syntactic links,half of them proposed by Mel’čuk(1988).June 8, 2012Bratislava: Slovak corpusanniversary conference15


SynTagRusNormally, one token corresponds to onenode in the dependency tree. There arecertain exceptions:•composite words likepjatidesjatiètažnyj ‘fifty-storeyed’,where one token corresponds to twoor more nodes;§multiword expressions like po krajnejmere ‘at least’ where several tokenscorrespond to one node;June 8, 2012Bratislava: Slovak corpusanniversary conference16


SynTagRus§so-called phantom nodes for therepresentation of hard cases ofellipsis, or gapping, which do notcorrespond to any particular token inthe sentence (cf. Ja kupil rubašku, aon galstuk ‘I bought a shirt and he atie’), which is expanded into Ja kupilrubašku, a on kupilPHANTOM galstuk‘I bought a shirt and heboughtPHANTOM a tie’June 8, 2012Bratislava: Slovak corpusanniversary conference17


SynTagRus and ETAPMorphological Tagging of SynTagRus is based ona comprehensive morphological dictionary ofRussian that counts about 130,000 entries (over4 million word forms).Recently, the dictionary was supplemented withfull phonetic stress marking, which is used in aRussian text-to-speech synthesis system.ETAP-3 morphological analyzer uses thedictionary to produce morphological annotationof words belonging to the corpus, which includesthe lemma, POS tags, and, depending on POS, aset of morphological features.June 8, 2012Bratislava: Slovak corpusanniversary conference18


SynTagRus and ETAPThe current version ofSynTagRus contains partiallexical functional annotation.For collocations that could bepresented with the apparatusof lexical functions, thetagging includes informationon values and attributes ofsuch lexical functions.June 8, 2012Bratislava: Slovak corpusanniversary conference19


SynTagRus and ETAPJune 8, 2012Bratislava: Slovak corpusanniversary conference20


SynTagRus and ETAPОбразец текстаВторой уровень●Третий уровень●Четвертый уровень●Пятый уровеньJune 8, 2012Bratislava: Slovak corpusanniversary conference21


SynTagRus and ETAPThe current version ofSynTagRus displays wordsenses as they are presentedin the combinatorialdictionary of ETAP.June 8, 2012Bratislava: Slovak corpusanniversary conference22


SynTagRus and ETAPПока же исследователи по-разномутолкуют "первоисточник", а в качестведоказательства своей правоты приводятотдельные археологические находкипоследних лет.‘So far, the researchers interpret differentlythe primary source, and as a proof of theirbeing right they present isolatedarcheological findings of the recent years’June 8, 2012Bratislava: Slovak corpusanniversary conference23


SynTagRus and ETAPJune 8, 2012Bratislava: Slovak corpusanniversary conference24


SynTagRus and ETAPJune 8, 2012Bratislava: Slovak corpusanniversary conference25


SynTagRus and ETAPJune 8, 2012Bratislava: Slovak corpusanniversary conference26


SynTagRus and ETAPТолкуют о сооружении местного Сити,почти как в Москве, и, разумеется, свысоченным небоскребом напротивСмольного на невском правобережье.‘They talk about the construction of a localCity, almost like in Moscow and, naturally,with a very high skyscraper opposite Smolnyon the Neva river bank.June 8, 2012Bratislava: Slovak corpusanniversary conference27


SynTagRus and ETAPJune 8, 2012Bratislava: Slovak corpusanniversary conference28


SynTagRus and ETAPJune 8, 2012Bratislava: Slovak corpusanniversary conference29


SynTagRus and ETAPSyntagrus is not only a linguisticresource but also a computationalresource which is used•to collect various statistical data;•to create training sets for machinelearning;•to develop automatic parsers(Nivre-Boguslavsky-Iomdin 2008)June 8, 2012Bratislava: Slovak corpusanniversary conference30


SynTagRus beyond ETAP•Over 30 free licensesprovided to universitiesand academic institutions;•2 commercial licensesprovided to big ITcompaniesJune 8, 2012Bratislava: Slovak corpusanniversary conference31


Three main uses of SynTagRus within ETAP1. It provides the statistics of thedifferent syntactic constructions, lexicalco-occurrences, patterns of ambiguitiesetc., which is used at several points ofthe algorithm if the statistical componentis activated.June 8, 2012Bratislava: Slovak corpusanniversary conference32


Three main uses of SynTagRus within ETAP2. It serves as an efficient and accurateevaluation resource, which is used toevaluate the performance of ETAP parserand in this way find and resolve some ofthe system’s bottlenecks.June 8, 2012Bratislava: Slovak corpusanniversary conference33


Three main uses of SynTagRus within ETAP3. It is used for regression testing of ETAP.June 8, 2012Bratislava: Slovak corpusanniversary conference34


Regression Testing•Periodically, ETAP is run on the whole corpus. Sentencesthat receive parses exactly equivalent to those stored in thecorpus (between 30 and 35 percent of the bulk of the corpus)are selected as basis for regression testing.•ETAP is then run on this test set to see if changes introducedin the dictionary, rules, or software affected the state of thetest set.•Regression testing has proven helpful in ensuring thestability of the parser and eventually improving it. Regressiontesting helps improve the SynTagRus itself: sometimes thediscrepancies in parses detected by regression test runspoint to erroneous annotation in the corpus, which is thencorrected.June 8, 2012Bratislava: Slovak corpusanniversary conference35

More magazines by this user
Similar magazines