Proceedings - Natural Language Server - IJS

More documents

Recommendations

Info

considered as a prerequisite for bilingual lexicon extraction from parallel corpora. Moreover, it is clear that disambiguating the vectors improves the quality of the extracted lexicons and manages to beat the simpler, but yet powerful, most frequent sense/alignment heuristic. These encouraging results pave the way towards pure data-driven methods for bilingual lexicon extraction from comparable corpora. This knowledge-light approach can be applied to languages and domains that do not dispose of large-scale seed dictionaries but for which parallel corpora are available. Moreover, the use of a data-driven crosslingual WSD method, such as the one proposed in this paper, can contribute to obtain less noisy translated vectors, which is important especially when lexicon extraction is performed from general language comparable corpora. The experiments carried out till now focus on a health comparable corpus. Although this is not a very specialized corpus but a rather popular one, cases of true polysemy are still less frequent than in a general corpus. We would thus like to extend this work by applying the method to a more general comparable corpus, for instance a corpus built from Wikipedia texts. We expect that the effect of applying the WSD method on a general corpus will be highly beneficial, as ambiguity problems will be more prevalent. Another avenue that we want to explore is to use second order co-occurrences for disambiguation. For the moment, the context used to disambiguate vector features consists of the other features that appear in the same vector. However, these features are direct co-occurrences of the headword, which does not necessarily mean that the features themselves co-occur with each other in the corpus. We consider that it would be preferable to replace this context with the co-occurrences of the features in the corpus for disambiguation, which would correspond to the second order co-occurrences of the English nouns, and investigate the effect of using this type of context on lexicon extraction. Last but not least, we would like to apply the method to the opposite direction (i.e. from Slovene to English) and compare the results obtained in both directions. 6. References M. Apidianaki. 2008. Translation-oriented Word Sense Induction based on Parallel Corpora. Proc. of LREC, Marrakech, Morocco. M. Apidianaki, Y. He. 2010. An algorithm for cross-lingual sense clustering tested in a MT evaluation setting. Proc. of the 7th IWSLT, 219–226, Paris, France. M. Apidianaki. 2009. Data-driven Semantic Analysis for Multilingual WSD and Lexical Selection in Translation. Proc. of the 12th EACL-09, 77–85, Athens, Greece. Š. Arhar, V. Gorjanc, S. Krek 2007. FidaPLUS corpus of Slovenian: the new generation of the Slovenian reference corpus: its design and tools, Proc. of the Corpus Linguistics conference, Birmingham, UK. T. Brants. 2000. Tnt a statistical part-of-speech tagger. In <strong>Proceedings</strong> of the 6th ANLP, Seattle, WA. H. Déjean, E. Gaussier, J. Renders, F. Sadat. 2005. Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval. Artif. Intell. Med., 33(2):111–124. 15 T. Erjavec, D. Fišer, S. Krek, N. Ledinek. 2010. The JOS linguistically tagged corpus of Slovene. Proc. of 7th LREC, Valletta, Malta. A. Ferraresi, E. Zanchetta, M. Baroni, S. Bernardini. 2008. Introducing and evaluating ukWaC, a very large web-derived corpus of English. Proc. of the 4th WAC: Can we beat Google, 47–54. D. Fišer, N. Ljubešić. 2011. Bilingual lexicon extraction from comparable corpora for closely related languages. Proc. of RANLP, 125–131, Hissar, Bulgaria. D. Fišer, N. Ljubešić,Š. Vintar, S. Pollak. 2011. Building and using comparable corpora for domain-specific bilingual lexicon extraction. Proc. of the 4th BUCC: Comparable Corpora and the Web, 19–26, Portland, Oregon, USA. P. Fung. 1998. Machine translation and the information soup, third conference of the association for machine translation in the Americas, AMTA, Vol. 1529 of LNCS. Springer. G. Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Norwell, MA. H. Kaji. 2003. Word sense acquisition from bilingual comparable corpora. Proc. of HLT-NAACL. P. Koehn, K. Knight. 2002. Learning a translation lexicon from monolingual corpora. Proc. of ACL Workshop on Unsupervised Lexical Acquisition, 9–16. P. Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. Proc. of MT Summit X, 79–86, Phuket, Thailand. N. Ljubešić, T. Erjavec. 2011. hrWaC and slWaC: Compiling web corpora for Croatian and Slovene. TSD Vol. 6836 of LNCS, 395–402. Springer. N. Ljubešić, D. Fišer, Š. Vintar, S. Pollak. 2011. Bilingual lexicon extraction from comparable corpora: A comparative study. Proc. of WOLER, Ljubljana, Slovenia August 1-5, 2011. E. Marsi, E. Krahmer. 2010. Automatic analysis of semantic similarity in comparable text through syntactic tree matching. Proc. of COLING, 752–760, Beijing, China, August. F.J. Och, H. Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19–51. P. Gamallo Otero. 2007. Learning bilingual lexicons from comparable English and Spanish corpora. Proc. of MT Summit XI, 191–198. R. Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. Proc. of the 37th ACL, 519–526, College Park, Maryland, USA. Z. Saralegi, I. San Vicente, A. Gurrutxaga. 2008. Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proc. of the 1st Building and Using Comparable Corpora (BUCC) workshop, Marrakech, Morocco. H. Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proc. of the International Conference on New Methods in <strong>Language</strong> Processing, 44–49, Manchester, UK. L. Shao, H. Tou Ng. 2004. Mining new word translations from comparable corpora. Proc. of COLING, 618–624, Geneva, Switzerland, Aug 23–Aug 27. R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufiş. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proc. of the 5th LREC, 2142– 2147. K. Yu, J. Tsujii. 2009. Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proc. of ACL, 121–124, Boulder, Colorado, USA.
, Iztok Kosem**, Berginc*** * Trojina, zavod za uporabno slovenistiko Partizanska cesta 5, SI- spela.arhar@trojina.si ** Trojina, zavod za uporabno slovenistiko Partizanska cesta 5, SI- iztok.kosem@trojina.si * -1000 Ljubljana natasa.logar@fdv.uni-lj.si Povzetek V projektu Sporazumevanje v slovenskem jeziku . Gre za javno dostopni lematizirani i korpus pisnih besedil. Vzporedno z izdelavo korpusa je potekala priprava novih korpusnih orodij z intuitivno zasnovanim spletnim vmesnikom, ki bi tudi nejezikoslovnim uporabnikom enostavno korpusno iskanje. V prispevku tako z dvojim: (a) z zbiranjem gradiva in vsebino korpusa ter h kategorijah in (b) z prijazne iskalne i, samodejna lematizacija iskalnega pogoja, uvedba podatkovnih filtrov itd. The development of the Gigafida corpus and its online interface The paper describes the building of a new reference corpus of Slovene, called Gigafida, as part of the Communication in Slovene project. The Gigafida corpus is freely available corpus of written texts, which has been lemmatized and POS-tagged. In addition to building the corpus, we have developed new corpus tools with intuitive online interface, which enables easy corpus searches to language users. The paper describes the design of the corpus, and decisions made in terms of a) collection of texts, corpus contents, and its final structure according to taxonomic categories, and b) new features in interface functionality such as: user-friendly search options, automatic lemmatization of query, introduction of filters etc. 1. Uvod V sklopu projekta Sporazumevanje v slovenskem jeziku (http://www.slovenscina.eu/; SSJ) 1 je med drugim potekala izdelava oz. posodobitev in nadgradnja besedilnih korpusov Nastalo jih ccGigafida in ccKRES. 2 V prvem delu prispevka se bomo posvetili predvsem zbiranju gradiva in vsebini ga korpusa Gigafida, ki je uporabnikom od leta 2011 prosto dostopen na spletu (http://demo.gigafida.net/); prosta dostopnost v istem tudi -milijonski podkorpus KRES ( v Logar Berginc et al., 2012), medtem ko sta drugi dve izvedenki, ccGigafida in ccKRES, prosto dostopni kot podatkovna baza za prenos pod licenco Creative o cc-korpusih v Erjavec, Logar Berginc, 2012). 1 Operacijo, v okviru katere je nastala raziskava, delno financira Evropska unija iz Evropskega socialnega sklada ter Ministrstvo Operacija se izvaja v okviru Operativnega programa razvoja 2013, razvojne prioritete: 2013. 2 P GOS (http://www.korpus-gos.net/; Verdonik, Zwitter Vitez, Rozman et al., 2010). 16 novi nadaljevanje svojih dveh predhodnikov, korpusov FIDA in FidaPLUS (konec koncev bo vseboval vsa njuna besedila), smo si v projektu zastavili dva glavna cilja: na eni strani izgradnjo javno in prosto dostopnega z obsegom milijarde besed in na drugi strani pripravo zmogljivih prosto dostopnih enostavno uporabo zainteresirani javnosti. V nadaljevanju predstavljamo del rezultatov teh dveh projektnih delu sprejeli. 2. Gigafida: zbiranje gradiva, vsebina Ob pripravi na izdelavo novega korpusa, ki smo ga vsebine. Tako lahko o Gigafidi predstavimo ne le e sumarne podatke, ampak tudi vse glavne sezname in merila, ki so nas vodili pri zbiranju; dalje popoln seznam del, ki so v korpusu; seznam ta besedila izdali, oz. spletnih strani, s katerih smo jih pridobili; ter en obseg vseh , skupaj s taksonomsko kategorijo, ki smo jim jo pripisali (zlasti podrobno smo tak prikaz pripravili za korpus KRES). V Gigafidi je poleg velike gradivo, ki smo ga zbrali do 29. maja 2010 (tisk) oz. do 11. aprila 2012 (internet). , kar
Page 2 and 3: Zbornik 15. mednarodne multikonfere
Page 4 and 5: PREDGOVOR MULTIKONFERENCI INFORMACI
Page 6 and 7: KONFERENČNI ODBORI CONFERENCE COMM
Page 8 and 9: KAZALO / TABLE OF CONTENTS Jezikovn
Page 10: Zbornik 15. mednarodne multikonfere
Page 13 and 14: RECENZENTI • doc. dr. Simon Dobri
Page 15 and 16: language similarity as a feature of
Page 17 and 18: ually annotating 345 sentences and
Page 19 and 20: Disambiguating vectors for bilingua
Page 21 and 22: Language POS Source word Slovene se
Page 23: 4. Evaluation and discussion of the
Page 27 and 28: vseh besed, se pravi, da bi ta
Page 29 and 30: 3. Spletni vmesnik za dostop do kor
Page 31 and 32: SPOOK.Sem: semantično označevanje
Page 33 and 34: uporabili samo za angleški del kor
Page 35 and 36: Vsekakor pa nas je pri označevanju
Page 37 and 38: Sistem vsebinskega priporočanja do
Page 39 and 40: poljubno, avtor pa svetuje vrednost
Page 41 and 42: Tabela 5: Filtriranje z dinamično
Page 43 and 44: Luka *, * Artificial Intelligence
Page 45 and 46: 4. Technical approaches and algorit
Page 47 and 48: Tehnologije govorjenega jezika v pa
Page 49 and 50: zma. Možno jih je namre uporabiti
Page 51 and 52: Kaja Dobrovoljc, 1 Simon Krek, 2 Ja
Page 53 and 54: 3.1.4. fazah: v prvi fazi s
Page 55 and 56: (del, dol, vez, skup) in pri poveza
Page 57 and 58: ˇSirjenje slovarja in dvoprehodni
Page 59 and 60: Uporabili smo orodje s katerim smo
Page 61 and 62: zbirka besedil, korpus, slovar Toma
Page 63 and 64: -8"?> -c.org/ns/1.0" type="pb" xml:
Page 65 and 66: 6. Dostopnost virov dejanska možn
Page 67 and 68: prinašajo 185 milijonov besed oz.
Page 69 and 70: Ugotovimo lahko, da po številu pol
Page 71 and 72: predlog, zadeva, podatek, podlaga,
Page 73 and 74: Their main differences between them
Page 75 and 76:
The corpus was PoS-tagged and lemma
Page 77 and 78:
5. Conclusions In this paper we pre
Page 79 and 80:
2000) is an English computer tutor
Page 81 and 82:
Table 5: Classifier performance wit
Page 83 and 84:
forms of Croatian proper names, but
Page 85 and 86:
on the output of the first CRF. We
Page 87 and 88:
Table 4: CroNER performance dependi
Page 89 and 90:
Table 1: Description of the picture
Page 91 and 92:
syntactic movement out of the objec
Page 93 and 94:
and other agreement markers are use
Page 95 and 96:
Figure 2. A FST representing a simp
Page 97 and 98:
encoded text representation as seen
Page 99 and 100:
Izhodišče segmentacijsko-tokeniza
Page 101 and 102:
3. Učni korpus ssj500k 4 Učni kor
Page 103 and 104:
azreševanje stavčne ali celo meds
Page 105:
težavnosti; pri oceni berljivosti
Page 110 and 111:
Kako dobro programi popravljajo vej
Page 112 and 113:
okviru projekta ESS Uspešno vklju
Page 114 and 115:
Besana opozarja na manjkajoče veji
Page 116 and 117:
Umetno tvorjenje slovenskega govora
Page 118 and 119:
Tabela 1: Analiza čistopisa zbirke
Page 120 and 121:
Distributional Semantics Approach t
Page 122 and 123:
3.2.1. Random indexing Random index
Page 124 and 125:
Table 2: Accuracy for all considere
Page 126 and 127:
Avtomatsko luščenje leksikalnih p
Page 128 and 129:
3.1. Metodologija in zaporedje post
Page 130 and 131:
začetku povedi in upoštevanje t.
Page 132 and 133:
Izdelava XML-shem za slovarske proj
Page 134 and 135:
standardi še niso bili vzpostavlje
Page 136 and 137:
nepričakovan zapis ali napačen za
Page 138 and 139:
Building Named Entity Recognition M
Page 140 and 141:
corpus token transfer type transfer
Page 142 and 143:
morphological information, for Slov
Page 144 and 145:
Luščenje terminoloških kandidato
Page 146 and 147:
3.1. Enobesedni terminološki kandi
Page 148 and 149:
Rezultati priklica so dobri. Od 109
Page 150 and 151:
Event and Temporal Relation Extract
Page 152 and 153:
Table 1: Event annotation summary E
Page 154 and 155:
Table 3: Event extraction performan
Page 156 and 157:
Korpusna analiza slovenskega delež
Page 158 and 159:
Vrsta besedila Izbrana področja (i
Page 160 and 161:
primerih (11) in (12), ne pa za del
Page 162 and 163:
Identifying Fear Related Content in
Page 164 and 165:
not present” by 23 annotators. Th
Page 166 and 167:
A Web Service Implementation of Lin
Page 168 and 169:
equired to install the specific sof
Page 170 and 171:
In both figures the same workflow i
Page 172 and 173:
Termania - prosto dostopni spletni
Page 174 and 175:
informacije so poleg same vsebine g
Page 176 and 177:
Izdelava slovensko-srbskega vzpored
Page 178 and 179:
dovoljeno odstopanje do 1 sekunde.
Page 180 and 181:
Poleg navedenih težav so se pojavl
Page 182 and 183:
Topic ontology construction from En
Page 184 and 185:
Fig. 2: English topic ontology afte
Page 186 and 187:
4. Topic ontologies constructed fro
Page 188 and 189:
Translating news to CycL using the
Page 190 and 191:
and the Cyc concept, which correspo
Page 192 and 193:
without parsing. One type of such s
Page 194 and 195:
Guessing the Correct Inflectional P
Page 196 and 197:
noted that the distribution of LPPs
Page 198 and 199:
Table 1: Feature selection analysis
Page 200 and 201:
Razpoznavanje imenskih entitet v sl
Page 202 and 203:
N − log Z(x (i) ) − i=1 K k=1
Page 204 and 205:
F1 1.0 0.8 0.6 0.4 0.2 0.0 0.63 0.4
Page 206 and 207:
sloWCrowd: orodje za popravljanje w
Page 208 and 209:
2.2. Uporabniški vmesnik Osnovna f
Page 210 and 211:
uporabnikov z večanjem stopnje zan
Page 212 and 213:
Korpus slovenskega znakovnega jezik
Page 214 and 215:
Posnetki se v anonimizirani obliki
Page 216 and 217:
ZEN: zasnova glasovnih e-storitev v
Page 218 and 219:
Rešitev je bila izvedena na podlag
Page 220 and 221:
predpisane aktivnosti v poteku zdra
Page 222 and 223:
Indeks avtorjev / Author index Agi
show all

Proceedings - Natural Language Server - IJS

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?