Proceedings - Natural Language Server - IJS

More documents

Recommendations

Info

noted that the distribution of LPPs in the lexicon with respect to the paradigms is very uneven; the ten least frequent paradigms appear only 40 times in the lexicon, whereas the ten most frequent paradigms appear over 50,000 times. 4. Features Given an LPP, we compute a set of features based on which the LPP can be classified as either correct or incorrect. We distinguish between two groups of features: string-based and corpus-based. 4.1. String-based features The string-based features are based on the orthographic properties of the lemma or the stem. The intuition behind this is that incorrect LPPs tend to generate ill-formed (or somewhat odd-formed) stems and lemmas. For example, there is no adjective in Croatian language that ends in -kč; an LPP that would generate such a stem could be discarded immediately. In fact, many paradigms defined in traditional grammar books are conditioned on the stem ending, requiring that it belongs to a certain group of phonemes or that it forms a consonant group. Similarly, there are paradigms that are applicable only to one-syllable stems. We use the following string-based features: 1. EndsIn – the ending character of the stem; 2. EndsInCgr – a binary feature indicating whether the word-forms ends in a consonant group (two consecutive consonants); 3. EndsInCons – a binary feature indicating whether the word-form ends in a consonant; 4. EndsInNonPals – a binary feature indicating whether the word-form ends in a non-palatal (v, r, l, m, n, p, b, f, t, d, s, z, c, k, g, or h); 5. EndsInPals – a binary feature indicating whether the word-form ends in a palatal (lj, nj, ć, d, č, dˇz, ˇs, ˇz, or j); 6. EndsInVelars – a binary feature indicating whether the word-form ends in a velar (k, g, or h); 7. LemmaSuffixProb – the probability P (sl|p) of lemma l having a three-letter suffix sl given inflectional paradigm p; 8. StemSuffixProb – the probability P (ss|p) of stem s having a three-letter suffix ss given inflectional paradigm p; 9. StemLength – the number of characters in the stem; 10. NumSyllables – the number of syllables in the stem (determined heuristically); 11. OneSyllable – a binary feature indicating whether NumSyllables equals 1. 187 4.2. Corpus-based features The corpus-based features are calculated based on the frequencies of word-forms attested in the corpus. The general idea is that a correct LPP should have more of its wordforms attested in the corpus than an incorrect LPP. Instead of only looking at total counts of attested word-forms, one can also look at the distributions of attested word-forms across the morphological tags. The intuition behind this is that every inflectional paradigm has its own distribution of morphological tags, and that a correct LPP will generate word-forms that obey such a distribution. For instance, in case of a noun paradigm, we can expect a genitive wordform to be far more frequent than a vocative word-form. Hence, an LPP that generates more vocative word-forms than genitive word-forms is unlikely to be correct. In what follows, we use #(w, C) to denote the number of occurrences of word-form w in corpus C. Set T (p) denotes the set of morphological tags of inflectional paradigm p. Let P (t|p) denote the probability distribution of morphological tag t conditioned on the inflectional paradigm p, and let P (t|l, p) denote the probability of morphological tag t generated by LPP (l, p). We obtain these distributions as maximum likelihood estimates using the LPPs from the inflectional lexicon L and word-form frequencies from corpus C: (l,p P (t|p) = ′ )∈L;p ′ =p;(w,t ′ )∈wfs(l,p);t ′ =t #(w, C) (l,p ′ )∈L;p ′ =p;w∈wfs ′ (l,p) #(w, C) (w,t P (t|l, p) = ′ )∈wfs(l,p);t ′ =t #(w, C) #(w, C) w∈wfs ′ (l,p) where wfs ′ is a simpler version of the wfs function that only returns the word-forms. Notice that, because we do not perform POS tagging of the corpus, we count the ambiguous word-forms (inner and outer homographs) multiple times. We use the following corpus-based features: 1. LemmaAttested – a binary feature indicating whether the lemma is attested in the corpus, i.e., #(l, C) > 0; 2. Score0 – the number of corpus-attested word-form types generated by the LPP: score0(l, p) = |wfs ′ (l, p) ∩ C| 3. Score1 – the sum of corpus frequencies of word-forms generated by the LPP: score1(l, p) = #(w, C) w∈wfs ′ (l,p) 4. Score2 – the proportion of corpus-attested word-form types generated by the LPP: score2(l, p) = |wfs′ (l, p) ∩ C| |wfs ′ (l, p)| 5. Score3 – the product of paradigm-conditioned distribution of morphological tags and the distribution of tags generated by the LPP: score3(l, p) = P (t|p) × P (t|l, p) t∈T (p)
6. Score4 – the expected number of corpus-attested word-form types generated by the LPP: score4(l, p) = P (t|p) × min 1, #(w, C) t∈T (p) 7. Score5 – the Kullback-Leibler divergence between the paradigm-conditioned distribution of morphological tags, p1(t) = P (t|p), and the distribution of tags generated by the LPP, p2(t) = P (t|l, p): score5(l, p) = KL(p1||p2) 8. Score6 – the Jensen-Shannon divergence between the aforementioned distributions: score6(l, p) = KL(p1||p2) + KL(p2||p1) 9. Score7 – the cosine similarity between the aforementioned distributions: t∈T (p) score7(l, p) = p1(t) × p2(t) t∈T (p) p1(t) 2 × t∈T (p) p2(t) 2 We computed the above features on the Vjesnik newspaper corpus totaling 23 million word-form tokens and 330,298 word-form types (the same corpus was that used for lexicon acquisition in (ˇSnajder et al., 2008)). 4.3. Other features Besides the string- and corpus-based features, we also use the following two features: 1. ParadigmId – a nominal feature denoting the LPP’s inflectional paradigm; 2. POS – the part-of-speech of the LPP’s inflectional paradigm (noun, adjective, or verb). 5. Evaluation The purpose of evaluation is twofold: apart from determining how accurately we can guess the inflectional paradigms, we also wish to analyze what features are most useful for this task. 5.1. Data set We compiled the data set for training and testing from the aforementioned inflectional lexicon (ˇSnajder et al., 2008). We sampled from the lexicon 5,000 LPPs for training and 5,000 LPPs for testing. Because the distribution of paradigms is very uneven, we used stratified sampling with respect to the inflectional paradigms. Moreover, we ensured that there is no LPP that appears in the test set, but does not appear in the training set (otherwise the probability distributions would be undefined). To generate the negative training and testing examples, we proceeded as follows. For each LPP, we generate all word-forms using the function wfs. Then, for all corpus-attested obtained wordforms, we generate the candidate LPPs using the function lm, and filter out those LPPs that exist in the lexicon. This 188 generates a large number of incorrect LPPs, from which we again sample 5,000 for training and 5,000 for testing. Thus we end up with 10,000 LPPs (5,000 correct and 5,000 incorrect) in each the training and the test set. Given the number of classes and features (a total of 146 binary-encoded features), the amount of training data ought to be sufficient; a larger training set would unnecessary increase the time required for training. Notice that the training set contains correct and some incorrect LPPs for each selected word-form, while the test set contains LPPs obtained from word-forms that did not appear in the training set. 5.2. Feature analysis Some of the features we defined are redundant or perhaps irrelevant for LPP classification. Because in absolute terms the number of features is not large, we need not perform feature analysis in order to reduce this number. Instead, the purpose of our feature analysis is to gain insight into what features are useful for paradigm guessing. For feature analysis we used the open source tool Weka (Hall et al., 2009). Table 1 summarizes the results. We used three univariate filtering methods: information gain (IG), gain ratio (GR), and RELIEF method (Kononenko, 1994). We lists feature rankings obtained on the training set, with first five ranks shown in bold. The first two methods produced similar rankings: among string-based features, suffix probabilities are ranked the highest, while among corpusbased features, feature Score5 is often ranked high, while ranks of other features vary. There are a number of features that are low-ranked (rank > 10) by each of the three methods: the five EndsIn* features, NumSyllables, OneSyllable, StemLength, Score1, Score3, and POS. The univariate methods do not measure the dependencies between the features, thus they cannot detect feature redundancy. We therefore also analyzed the features using two multivariate feature subset selection (FSS) methods: correlation-based feature selection (CFS) (Hall, 1998) and consistency subset selection (CSS) (Liu and R., 1996), both with greedy forward search as the optimization method. Table 1 shows the optimal subset selection obtained with each of these methods. Notice that both selected subsets contain both string- and corpus-based features. 5.3. Classification accuracy For training and testing of models, we used the LIB- SVM implementation of the SVM algorithm (Chang and Lin, 2011). We trained eight models using different feature subsets. We optimized the parameters of each model separately using 5-fold cross-validation on the training set. Classification accuracy on the test set is shown in Table 2. The reliability of probability estimates used for some of the corpus-based features depends on the frequencies of word-forms in the corpus. In a realistic setting, the unknown words tend to be less frequent in corpus. The last two columns of Table 2 show the classification accuracy for LPPs for which the frequency of word-forms in the corpus is less than or equal to 100 (rare words, accounting for 66% of the test set) and less than or equal to 10 (very rare words, accounting for 22% of the test set). The performance baseline is the majority class in each test set.
Page 2 and 3:
Zbornik 15. mednarodne multikonfere
Page 4 and 5:
PREDGOVOR MULTIKONFERENCI INFORMACI
Page 6 and 7:
KONFERENČNI ODBORI CONFERENCE COMM
Page 8 and 9:
KAZALO / TABLE OF CONTENTS Jezikovn
Page 10:
Zbornik 15. mednarodne multikonfere
Page 13 and 14:
RECENZENTI • doc. dr. Simon Dobri
Page 15 and 16:
language similarity as a feature of
Page 17 and 18:
ually annotating 345 sentences and
Page 19 and 20:
Disambiguating vectors for bilingua
Page 21 and 22:
Language POS Source word Slovene se
Page 23 and 24:
4. Evaluation and discussion of the
Page 25 and 26:
, Iztok Kosem**, Berginc*** * Troj
Page 27 and 28:
vseh besed, se pravi, da bi ta
Page 29 and 30:
3. Spletni vmesnik za dostop do kor
Page 31 and 32:
SPOOK.Sem: semantično označevanje
Page 33 and 34:
uporabili samo za angleški del kor
Page 35 and 36:
Vsekakor pa nas je pri označevanju
Page 37 and 38:
Sistem vsebinskega priporočanja do
Page 39 and 40:
poljubno, avtor pa svetuje vrednost
Page 41 and 42:
Tabela 5: Filtriranje z dinamično
Page 43 and 44:
Luka *, * Artificial Intelligence
Page 45 and 46:
4. Technical approaches and algorit
Page 47 and 48:
Tehnologije govorjenega jezika v pa
Page 49 and 50:
zma. Možno jih je namre uporabiti
Page 51 and 52:
Kaja Dobrovoljc, 1 Simon Krek, 2 Ja
Page 53 and 54:
3.1.4. fazah: v prvi fazi s
Page 55 and 56:
(del, dol, vez, skup) in pri poveza
Page 57 and 58:
ˇSirjenje slovarja in dvoprehodni
Page 59 and 60:
Uporabili smo orodje s katerim smo
Page 61 and 62:
zbirka besedil, korpus, slovar Toma
Page 63 and 64:
-8"?> -c.org/ns/1.0" type="pb" xml:
Page 65 and 66:
6. Dostopnost virov dejanska možn
Page 67 and 68:
prinašajo 185 milijonov besed oz.
Page 69 and 70:
Ugotovimo lahko, da po številu pol
Page 71 and 72:
predlog, zadeva, podatek, podlaga,
Page 73 and 74:
Their main differences between them
Page 75 and 76:
The corpus was PoS-tagged and lemma
Page 77 and 78:
5. Conclusions In this paper we pre
Page 79 and 80:
2000) is an English computer tutor
Page 81 and 82:
Table 5: Classifier performance wit
Page 83 and 84:
forms of Croatian proper names, but
Page 85 and 86:
on the output of the first CRF. We
Page 87 and 88:
Table 4: CroNER performance dependi
Page 89 and 90:
Table 1: Description of the picture
Page 91 and 92:
syntactic movement out of the objec
Page 93 and 94:
and other agreement markers are use
Page 95 and 96:
Figure 2. A FST representing a simp
Page 97 and 98:
encoded text representation as seen
Page 99 and 100:
Izhodišče segmentacijsko-tokeniza
Page 101 and 102:
3. Učni korpus ssj500k 4 Učni kor
Page 103 and 104:
azreševanje stavčne ali celo meds
Page 105:
težavnosti; pri oceni berljivosti
Page 110 and 111:
Kako dobro programi popravljajo vej
Page 112 and 113:
okviru projekta ESS Uspešno vklju
Page 114 and 115:
Besana opozarja na manjkajoče veji
Page 116 and 117:
Umetno tvorjenje slovenskega govora
Page 118 and 119:
Tabela 1: Analiza čistopisa zbirke
Page 120 and 121:
Distributional Semantics Approach t
Page 122 and 123:
3.2.1. Random indexing Random index
Page 124 and 125:
Table 2: Accuracy for all considere
Page 126 and 127:
Avtomatsko luščenje leksikalnih p
Page 128 and 129:
3.1. Metodologija in zaporedje post
Page 130 and 131:
začetku povedi in upoštevanje t.
Page 132 and 133:
Izdelava XML-shem za slovarske proj
Page 134 and 135:
standardi še niso bili vzpostavlje
Page 136 and 137:
nepričakovan zapis ali napačen za
Page 138 and 139:
Building Named Entity Recognition M
Page 140 and 141:
corpus token transfer type transfer
Page 142 and 143:
morphological information, for Slov
Page 144 and 145:
Luščenje terminoloških kandidato
Page 146 and 147: 3.1. Enobesedni terminološki kandi
Page 148 and 149: Rezultati priklica so dobri. Od 109
Page 150 and 151: Event and Temporal Relation Extract
Page 152 and 153: Table 1: Event annotation summary E
Page 154 and 155: Table 3: Event extraction performan
Page 156 and 157: Korpusna analiza slovenskega delež
Page 158 and 159: Vrsta besedila Izbrana področja (i
Page 160 and 161: primerih (11) in (12), ne pa za del
Page 162 and 163: Identifying Fear Related Content in
Page 164 and 165: not present” by 23 annotators. Th
Page 166 and 167: A Web Service Implementation of Lin
Page 168 and 169: equired to install the specific sof
Page 170 and 171: In both figures the same workflow i
Page 172 and 173: Termania - prosto dostopni spletni
Page 174 and 175: informacije so poleg same vsebine g
Page 176 and 177: Izdelava slovensko-srbskega vzpored
Page 178 and 179: dovoljeno odstopanje do 1 sekunde.
Page 180 and 181: Poleg navedenih težav so se pojavl
Page 182 and 183: Topic ontology construction from En
Page 184 and 185: Fig. 2: English topic ontology afte
Page 186 and 187: 4. Topic ontologies constructed fro
Page 188 and 189: Translating news to CycL using the
Page 190 and 191: and the Cyc concept, which correspo
Page 192 and 193: without parsing. One type of such s
Page 194 and 195: Guessing the Correct Inflectional P
Page 198 and 199: Table 1: Feature selection analysis
Page 200 and 201: Razpoznavanje imenskih entitet v sl
Page 202 and 203: N − log Z(x (i) ) − i=1 K k=1
Page 204 and 205: F1 1.0 0.8 0.6 0.4 0.2 0.0 0.63 0.4
Page 206 and 207: sloWCrowd: orodje za popravljanje w
Page 208 and 209: 2.2. Uporabniški vmesnik Osnovna f
Page 210 and 211: uporabnikov z večanjem stopnje zan
Page 212 and 213: Korpus slovenskega znakovnega jezik
Page 214 and 215: Posnetki se v anonimizirani obliki
Page 216 and 217: ZEN: zasnova glasovnih e-storitev v
Page 218 and 219: Rešitev je bila izvedena na podlag
Page 220 and 221: predpisane aktivnosti v poteku zdra
Page 222 and 223: Indeks avtorjev / Author index Agi
show all

Proceedings - Natural Language Server - IJS

Create successful ePaper yourself

Delete template?

Save as template?