Proceedings - Natural Language Server - IJS

More documents

Recommendations

Info

Topic ontology construction from English and Slovene language technologies corpora Jasmina Smailović 1 , Senja Pollak 1,2 1 Jožef Stefan Institute, Jamova cesta 39, Ljubljana, Slovenia 2 Faculty of Arts, University of Ljubljana, Aškerčeva 2, Ljubljana, Slovenia {jasmina.smailovic, senja.pollak}@ijs.si Abstract This paper presents the OntoGen topic ontology construction tool and the process of building topic ontologies from English and Slovene research papers in the domain of language technologies. We were interested in how cleaning the documents (e.g. removing the references section), manual concept moving and renaming, or using supervised active learning affect the ontologies. Gradnja ontologij tematik iz angleškega in slovenskega korpusa jezikovnih tehnologij V članku predstavljamo orodje OntoGen ter proces gradnje ontologij tematik iz angleških in slovenskih znanstvenih člankov s področja jezikovnih tehnologij. Zanimalo nas je, kako čiščenje člankov (npr. brisanje poglavja z viri), ročno preimenovanje in premeščanje konceptov ter uporaba metode aktivnega učenja vplivajo na ontologije tematik. 1. Introduction Manual construction of taxonomies and ontologies represent a significant investment of human resources when used for modeling a new domain. Therefore, methods for (semi-)automatic extraction of domain knowledge from unstructured texts were developed. Automatic taxonomy construction was addressed in e.g. Navigli et al. (2011) and Kozareva and Hovy (2010). While an ontology is a "formal, explicit specification of a shared conceptualization" (Gruber, 1993), represented as a set of domain concepts and the relationships between them, a topic ontology is a set of domain topics or concepts 1 – formed of related documents – represented by the most characteristic topic keywords and related by the subconcept-of relationship (Fortuna et al., 2005; 2007). The task addressed in this paper is to semiautomatically construct a topic ontology from documents in the area of language technologies. This domain has already been modeled in previous research. The main domain publications were collected by Bird (2008). Joseph and Radev (2007) performed the citation analysis, the domain topic and trend analyses were done by Hall et al. (2008) and Paul and Girju (2009), using LDA (Blei et al. 2003). In this work, our goal is to get an overview of the topics covered at the Slovene language technologies conference. To do so, we semi-automatically constructed two topic ontologies, one from papers written in Slovene, and the other from papers written in English. The constructed topic ontology is corpus-driven and represents only the concepts covered in the given corpus. For addressing this task, we used OntoGen 2 (Fortuna et al., 2005; 2007), a data-driven ontology editor, focusing on extracting and editing of topic ontologies. We investigated how the fact of removing information specific to scientific articles, like references or authors’ names, affect the 1 In this paper words concept and topic are used as synonyms. 2 http://ontogen.ijs.si/ 173 ontology. A very interesting part of this research was to compare the resulting topic ontologies with and without using supervised active learning (Cohn et al., 1994; Settles, 2009). We continue the research presented in (Smailović and Pollak, 2011), where the English articles were modeled into a topic ontology. In addition to the English topic ontology, in this paper we also model the Slovene part of the corpus as a separate topic ontology. The paper is structured as follows. Section 2 describes the corpus, data preparation and the ontology editing tool. In Sections 3 and 4 the topic ontology construction process is presented. Section 5 provides the conclusions. 2. The corpus, data preparation and the OntoGen topic ontology construction tool The articles for this case study were taken from the proceedings of the Slovene <strong>Language</strong> Technologies Conference (proceedings of seven conference editions are available online: http://www.sdjt.si/konference.html). As the papers (79 in English and 109 in Slovene) were available as PDF documents, we had to transform them into an appropriate textual format for the OntoGen tool, i.e. to the named-line document format. The first step was to transform the documents to a text-only document format. PDF to text conversion was performed, using the PDFBox 3 and Nitro PDF reader 4 . The text files were transformed to UTF-8 encoding. Next, we split the English and Slovene articles. In this research, we present two settings, in the first one, the topic ontology is constructed without cleaning the documents and in the second one, semi-automatic data preprocessing is first performed. For the latter, using Perl scripts, we discarded parts of articles, such as authors' names, institutions, references, section numbers, tables, page numbers, etc. to get the “cleaned documents”. After presenting the documents in the named-line document format, OntoGen was used for building a topic 3 http://www.codeproject.com/KB/string/pdf2text.aspx 4 http://www.nitropdf.com/
ontology. OntoGen is a semi-automatic and data-driven ontology editor. Semi-automatic means that the system is an interactive tool that aids the user during the topic ontology construction process. Data-driven means that most of the aid provided by the system is based on the underlying text data (document corpus) provided by the user. The system combines text mining techniques (Kmeans document clustering) with an efficient user interface to reduce both the time spent and complexity of manual ontology construction for the user. 3. Topic ontology on raw documents We first examined how the topic ontology looks like if we do not perform any additional cleaning of the corpus. In this case, the text documents in the named-line format consist of an ID, followed by a title, names of the authors, main text of the article and references. OntoGen uses K-means clustering, i.e. a method of cluster analysis which aims to partition N instances (documents, in our case) into K clusters in which each instance belongs to the cluster with the nearest mean. If we build a topic ontology automatically, by only suggesting to OntoGen the number K of concepts at each node of the concept hierarchy, the result for the English articles can be seen in Figure 1. For every concept, we tried different K-values and chose the one that splits the concept in the best way according to the user’s understanding of the area. Neither the active learning functionality nor the renaming of the concepts was performed in this topic ontology construction process. As one can see from the figure, names of the concepts/topics are not intuitive, and in some cases it is hard to understand what they represent. This happens since for concept naming OntoGen selects the first three most frequent words from the automatically constructed keywords list. For example, if the concept is described by the following keywords: slovenian, translation, vowel, speakers, synthesis, speech, corpus, tagging etc., OntoGen will name this concept “slovenian, translation, vowel”. A better way of naming concepts is by involving the expert who can quickly find an appropriate concept name after observing all the topic keywords. Using this approach, the previous topic could be called Speech technologies. All the concepts in the English and Slovene topic ontologies were thus manually renamed based on the automatically extracted topic keywords. Next, we observed that several topics/concepts were not present in the topic ontology. For the terms often occurring in the keyword lists of different concepts, but not being one of the three main topics keywords, we decided to use the supervised method for adding topics. It is based on the Support Vector Machine (SVM) active learning method of OntoGen. For the English corpus, we entered queries for Speech recognition and Speech translation concepts and answered some automatically proposed questions of a type: Would you classify the document number 41 as an article on the topic of Speech recognition? which enabled the system to label the instances. After the concept node was constructed, it was added to the ontology as a sub-concept of the selected concept, in our case, as a sub-concept of the Speech technologies concept. Similarly, we performed active learning also on the Slovene corpus. We entered queries for Prevajanje govora (Speech translation). In this way we tried to identify the most common and important words for the missing subconcept and put them in the query. Then, as for English articles, we answered several questions, and a new sub-concept was added in the Slovene topic ontology. After manually renaming the concepts, using active learning for adding concepts, and manually moving some documents from one concept to another, we got an improved topic ontology. The resulting English topic ontology is shown in Figure 2. This ontology is more intuitive and understandable. One can see from the figure that language technologies consist of Computational linguistics and Speech technologies as its core concepts. This is also the general division of the field of language technology (e.g. in Wikipedia, language technology is defined as follows: <strong>Language</strong> technology is often called human language technology (HLT) or natural language processing (NLP) and consists of computational linguistics (or CL) and speech technology as its core but includes also many application oriented aspects of them.). Fig. 1: English topic ontology without cleaning text document and without concepts renaming. 174
Page 2 and 3:
Zbornik 15. mednarodne multikonfere
Page 4 and 5:
PREDGOVOR MULTIKONFERENCI INFORMACI
Page 6 and 7:
KONFERENČNI ODBORI CONFERENCE COMM
Page 8 and 9:
KAZALO / TABLE OF CONTENTS Jezikovn
Page 10:
Zbornik 15. mednarodne multikonfere
Page 13 and 14:
RECENZENTI • doc. dr. Simon Dobri
Page 15 and 16:
language similarity as a feature of
Page 17 and 18:
ually annotating 345 sentences and
Page 19 and 20:
Disambiguating vectors for bilingua
Page 21 and 22:
Language POS Source word Slovene se
Page 23 and 24:
4. Evaluation and discussion of the
Page 25 and 26:
, Iztok Kosem**, Berginc*** * Troj
Page 27 and 28:
vseh besed, se pravi, da bi ta
Page 29 and 30:
3. Spletni vmesnik za dostop do kor
Page 31 and 32:
SPOOK.Sem: semantično označevanje
Page 33 and 34:
uporabili samo za angleški del kor
Page 35 and 36:
Vsekakor pa nas je pri označevanju
Page 37 and 38:
Sistem vsebinskega priporočanja do
Page 39 and 40:
poljubno, avtor pa svetuje vrednost
Page 41 and 42:
Tabela 5: Filtriranje z dinamično
Page 43 and 44:
Luka *, * Artificial Intelligence
Page 45 and 46:
4. Technical approaches and algorit
Page 47 and 48:
Tehnologije govorjenega jezika v pa
Page 49 and 50:
zma. Možno jih je namre uporabiti
Page 51 and 52:
Kaja Dobrovoljc, 1 Simon Krek, 2 Ja
Page 53 and 54:
3.1.4. fazah: v prvi fazi s
Page 55 and 56:
(del, dol, vez, skup) in pri poveza
Page 57 and 58:
ˇSirjenje slovarja in dvoprehodni
Page 59 and 60:
Uporabili smo orodje s katerim smo
Page 61 and 62:
zbirka besedil, korpus, slovar Toma
Page 63 and 64:
-8"?> -c.org/ns/1.0" type="pb" xml:
Page 65 and 66:
6. Dostopnost virov dejanska možn
Page 67 and 68:
prinašajo 185 milijonov besed oz.
Page 69 and 70:
Ugotovimo lahko, da po številu pol
Page 71 and 72:
predlog, zadeva, podatek, podlaga,
Page 73 and 74:
Their main differences between them
Page 75 and 76:
The corpus was PoS-tagged and lemma
Page 77 and 78:
5. Conclusions In this paper we pre
Page 79 and 80:
2000) is an English computer tutor
Page 81 and 82:
Table 5: Classifier performance wit
Page 83 and 84:
forms of Croatian proper names, but
Page 85 and 86:
on the output of the first CRF. We
Page 87 and 88:
Table 4: CroNER performance dependi
Page 89 and 90:
Table 1: Description of the picture
Page 91 and 92:
syntactic movement out of the objec
Page 93 and 94:
and other agreement markers are use
Page 95 and 96:
Figure 2. A FST representing a simp
Page 97 and 98:
encoded text representation as seen
Page 99 and 100:
Izhodišče segmentacijsko-tokeniza
Page 101 and 102:
3. Učni korpus ssj500k 4 Učni kor
Page 103 and 104:
azreševanje stavčne ali celo meds
Page 105:
težavnosti; pri oceni berljivosti
Page 110 and 111:
Kako dobro programi popravljajo vej
Page 112 and 113:
okviru projekta ESS Uspešno vklju
Page 114 and 115:
Besana opozarja na manjkajoče veji
Page 116 and 117:
Umetno tvorjenje slovenskega govora
Page 118 and 119:
Tabela 1: Analiza čistopisa zbirke
Page 120 and 121:
Distributional Semantics Approach t
Page 122 and 123:
3.2.1. Random indexing Random index
Page 124 and 125:
Table 2: Accuracy for all considere
Page 126 and 127:
Avtomatsko luščenje leksikalnih p
Page 128 and 129:
3.1. Metodologija in zaporedje post
Page 130 and 131:
začetku povedi in upoštevanje t.
Page 132 and 133: Izdelava XML-shem za slovarske proj
Page 134 and 135: standardi še niso bili vzpostavlje
Page 136 and 137: nepričakovan zapis ali napačen za
Page 138 and 139: Building Named Entity Recognition M
Page 140 and 141: corpus token transfer type transfer
Page 142 and 143: morphological information, for Slov
Page 144 and 145: Luščenje terminoloških kandidato
Page 146 and 147: 3.1. Enobesedni terminološki kandi
Page 148 and 149: Rezultati priklica so dobri. Od 109
Page 150 and 151: Event and Temporal Relation Extract
Page 152 and 153: Table 1: Event annotation summary E
Page 154 and 155: Table 3: Event extraction performan
Page 156 and 157: Korpusna analiza slovenskega delež
Page 158 and 159: Vrsta besedila Izbrana področja (i
Page 160 and 161: primerih (11) in (12), ne pa za del
Page 162 and 163: Identifying Fear Related Content in
Page 164 and 165: not present” by 23 annotators. Th
Page 166 and 167: A Web Service Implementation of Lin
Page 168 and 169: equired to install the specific sof
Page 170 and 171: In both figures the same workflow i
Page 172 and 173: Termania - prosto dostopni spletni
Page 174 and 175: informacije so poleg same vsebine g
Page 176 and 177: Izdelava slovensko-srbskega vzpored
Page 178 and 179: dovoljeno odstopanje do 1 sekunde.
Page 180 and 181: Poleg navedenih težav so se pojavl
Page 184 and 185: Fig. 2: English topic ontology afte
Page 186 and 187: 4. Topic ontologies constructed fro
Page 188 and 189: Translating news to CycL using the
Page 190 and 191: and the Cyc concept, which correspo
Page 192 and 193: without parsing. One type of such s
Page 194 and 195: Guessing the Correct Inflectional P
Page 196 and 197: noted that the distribution of LPPs
Page 198 and 199: Table 1: Feature selection analysis
Page 200 and 201: Razpoznavanje imenskih entitet v sl
Page 202 and 203: N − log Z(x (i) ) − i=1 K k=1
Page 204 and 205: F1 1.0 0.8 0.6 0.4 0.2 0.0 0.63 0.4
Page 206 and 207: sloWCrowd: orodje za popravljanje w
Page 208 and 209: 2.2. Uporabniški vmesnik Osnovna f
Page 210 and 211: uporabnikov z večanjem stopnje zan
Page 212 and 213: Korpus slovenskega znakovnega jezik
Page 214 and 215: Posnetki se v anonimizirani obliki
Page 216 and 217: ZEN: zasnova glasovnih e-storitev v
Page 218 and 219: Rešitev je bila izvedena na podlag
Page 220 and 221: predpisane aktivnosti v poteku zdra
Page 222 and 223: Indeks avtorjev / Author index Agi
show all

Proceedings - Natural Language Server - IJS

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?