Proceedings - Natural Language Server - IJS

More documents

Recommendations

Info

Translating news to CycL using the XLE parser Janez Starc, Blaˇz Fortuna Artificial Intelligence Laboratory Joˇzef Stefan Institute Jamova 39, SI-1000 Ljubljana {janez.starc, blaz.fortuna}@ijs.si Abstract We built a pipeline for translating text into logical representation, which falls in the category of Machine Reading (MR). The essential component of the pipeline and our main contribution is the non-probabilistic rule-based framework. Other components are: syntatic parser, XLE, to extract gramamtical features from text, Enrycher, for additional natural language processing, and Cyc for semantic resources, reasoning, and its language CycL to respresent the translated knowledge. We defined and implemented several rules on the framework. We evaluated them on business news. In the discussion we identified several challenges in MR. Prevajanje novic v jezik CycL s pomočjo razčlenjevalnika XLE Zgradili smo cevovod za prevajanje besdila v logično predstavitev. Naˇse delo spada v področje strojnega branja. Glavna komponenta cevovoda in naˇs glavni prispevek je neprobabilistično programsko ogrodje, ki temelji na pravilih. Ostale komponente so: XLE – sintaktični razčlenjevalnik, Enrycher – storitev za procesiranje naravenega jezika in Cyc – semantični vir ter avtomatski pojasnjevalnik. Za predstavitev prevedenega znanja smo uporabili jezik CycL. Na programskem ogrodju smo definirali in implementirali nekaj pravil. Ocenili smo, kako prevajajo besedila iz poslovnih novic. V zaključku smo odkrili ˇstevilne izzive v strojnem branju. 1. Introduction Vast amounts of text are published online every minute. People all over the world are communicating through Facebook updates, Tweets, blogs, etc., or more formal discourses like news articles, academic papers, books, etc. The tendency is to extract as much information as possible from these sources and express this information in a logical representation. This data can be used to populate a particular knowledge base. Consequently, new knowledge can be inferred from the knowledge base. We built a non-probabilistic rule-based system, which does several tasks sequentially. In the beginning, it obtains English textual data from a news article stream, IJS Newsfeed. Then it processes the data with natural language processing tools XLE parser (Maxwell and Kaplan, 1996) and Enrycher (ˇStajner et al., 2010). In the next step, the system translates the processed text into a logic language of Cyc system (Lenat, 1995), CycL. These logical statements are then asserted into the knowledge base of Cyc, CycKB. The main contribution of our work is a framework for implementing rules, which translate the output of the XLE parser, f-structures, to CycL. We named this component the Translator. We also defined, implemented and evaluated a few rules. Our system exploits external resources to semantically enrich its output. Our paper falls into category of Machine Reading (MR). A study of requirements for MR has been done by (Clark and Thompson, 2009). Similar work to ours was done in (Witbrock et al., 2004) using different parsers. Also similar to our work is the approach taken by researchers at Parc (Crouch, 2005) (Crouch and King, 2006) (Bobrow et al., 2007). Many others took the semi-supervised approach, for e.g. (Etzioni, 2007) (Carlson et al., 2010). In (Ghosh et al., 2011) Markov Logic Networks (MLNs) were used for MR. The rest of this paper is organized in the following way. 179 We describe all third-party components, especially parts that we use in our system in Section 2. We present the workflow of our system in Section 3. In Section 4., we evaluate our system. We finish with conclusions and proposals for future work in the last section. 2. Description of components 2.1. IJS Newsfeed To obtain online news we have used software IJS newsfeed 1 . It periodically crawls a list of RRS feeds and a subset of Google News to obtain links to news articles. Then it downloads and parses the articles to extract cleartext version of the article body, which can be further processed by Enrycher. This data is then available on two streams, cleartext and Enrycher processed. We captured news from the stream that is processed by Enrycher, which we present in the next subsection. 2.2. Enrycher Enrycher 2 (ˇStajner et al., 2010) is a service that provides shallow and deep text processing at the document level. Main features of the system include topic and keyword detection, named entity extraction, word sense disambiguation, triplet extraction. We have used sentence splitting, topic detection, named entity resolution, co-reference and anaphora resolution in our system. 2.3. XLE parser This software 3 (Maxwell and Kaplan, 1996) parses text with Lexical Functional Grammars (LFGs). It is a part of larger platform, XLE, which also provides rich graphical user interface for writing and debugging LFGs and facility 1 newsfeed.ijs.si 2 http://enrycher.ijs.si 3 www.parc.com
Slika 1: F-structure for generating text from LFGs. The main part of the platform is ordered rewrite rule system, which was used to produce semantic representations from the syntactic output of the parser (Crouch and King, 2006). This work has similarities to ours. We both use XLE parser, but different system to generate semantic representations. The XLE parser produces two mutually constraining types of output, c-structures and f-structure. C-structures or structures of syntactic constituents are constituency trees. The f-structure analysis, on the other hand, treats the sentence as being composed of attributes, which include features such as number and tense, or functional units such as subject, predicate, or object. For example, f-structure of the sentence “John drives a car.” is depicted on Figure 1. This sentence has only one f-structure. Since the system does not use any semantics there can be more than one solution to one sentence, and the parser produces a packed representation of all possible solutions. Using a form of Optimality Theory (Prince and Smolensky, 2004) solutions are ranked, based on ordered violable constraints. We use the highest ranked solution, which is written in Prolog format, for further processing. 2.4. Cyc The last third-party component that we use is Cyc 4 (Lenat, 1995). This system includes CycKB, which is an ontology and knowledge base of every day common sense knowledge. The knowledge base contains nearly 500.000 terms, about 15.000 types of relations, and about 5 million assertions. The knowledge is represented by a formal language CycL. The CycL representation of the sentence “John drives a car.” is presented on Listing 1. Listing 1: CycL sentence (#$thereExists ?ACTION (#$thereExists ?CAR (#$and (#$isa ?ACTION #$TransportInvolvingADriver) (#$isa ?CAR #$Automobile) (#$vehicle ?ACTION ?CAR) (#$driverActor ?ACTION #$John) ))) 4 www.cyc.com Slika 2: The Translator Pipeline Cyc concepts start with #$ and Cyc variables start with ?. CycL sentences are split into microtheories. Each microtheory must not contain any contradictions. Furthermore, Cyc offers an inference engine, which derives answers on the queries using the knowledge base. It is also used to reject assertions that would be inconsistent with the knowledge base. We use Cyc to assert the translated sentence into its knowledge base. We use its inference engine to answer the CycL queries of the corresponding interrogative sentences. We also utilize data about subcategorization frames, which are stored in CycKB. There are two versions of Cyc technology available: the open source OpenCyc and the more complete version, ResearchCyc, which we use in our work. 3. The Workflow In this section, we will present how our system works. The workflow is shown on Figure 2. 3.1. From news stream to f-structures In the first step, we listen to IJS newsfeed stream for some time to download a sample of articles together with additional information produced by Enrycher. Each article is tagged with several topics keywords from SIOC ontology. We built a filter, which passes through articles that have been tagged with the selected topic. Using Enrycher’s sentence and token splitting feature, another filter was built to filter out either too long or too short sentences. These types of sentences can be ungrammatical or the possibility of correct parse is very low. The parameters for this filter are minimal and maximal number of tokens in a sentence. All the filtered sentences are concatenated, and gathered in the plaintext file. This file is then sent as input to the XLE parser. The parser produces one Prolog file for each sentence. Each file includes several attributes including the actual sentence, metadata, the most probable f-structure, and c-structure. 3.2. Semantic resources from Enrycher Enrycher annotates every named instance with its type. The following types are resolved: person, location, organization, date, percentage, amount of money. For every named instance we generated one corresponding Prolog compound term, which is available to Translator. One term has four arguments: the sentence, in which the instance appears, the mention of the instance, the name of the instance, 180
Page 2 and 3:
Zbornik 15. mednarodne multikonfere
Page 4 and 5:
PREDGOVOR MULTIKONFERENCI INFORMACI
Page 6 and 7:
KONFERENČNI ODBORI CONFERENCE COMM
Page 8 and 9:
KAZALO / TABLE OF CONTENTS Jezikovn
Page 10:
Zbornik 15. mednarodne multikonfere
Page 13 and 14:
RECENZENTI • doc. dr. Simon Dobri
Page 15 and 16:
language similarity as a feature of
Page 17 and 18:
ually annotating 345 sentences and
Page 19 and 20:
Disambiguating vectors for bilingua
Page 21 and 22:
Language POS Source word Slovene se
Page 23 and 24:
4. Evaluation and discussion of the
Page 25 and 26:
, Iztok Kosem**, Berginc*** * Troj
Page 27 and 28:
vseh besed, se pravi, da bi ta
Page 29 and 30:
3. Spletni vmesnik za dostop do kor
Page 31 and 32:
SPOOK.Sem: semantično označevanje
Page 33 and 34:
uporabili samo za angleški del kor
Page 35 and 36:
Vsekakor pa nas je pri označevanju
Page 37 and 38:
Sistem vsebinskega priporočanja do
Page 39 and 40:
poljubno, avtor pa svetuje vrednost
Page 41 and 42:
Tabela 5: Filtriranje z dinamično
Page 43 and 44:
Luka *, * Artificial Intelligence
Page 45 and 46:
4. Technical approaches and algorit
Page 47 and 48:
Tehnologije govorjenega jezika v pa
Page 49 and 50:
zma. Možno jih je namre uporabiti
Page 51 and 52:
Kaja Dobrovoljc, 1 Simon Krek, 2 Ja
Page 53 and 54:
3.1.4. fazah: v prvi fazi s
Page 55 and 56:
(del, dol, vez, skup) in pri poveza
Page 57 and 58:
ˇSirjenje slovarja in dvoprehodni
Page 59 and 60:
Uporabili smo orodje s katerim smo
Page 61 and 62:
zbirka besedil, korpus, slovar Toma
Page 63 and 64:
-8"?> -c.org/ns/1.0" type="pb" xml:
Page 65 and 66:
6. Dostopnost virov dejanska možn
Page 67 and 68:
prinašajo 185 milijonov besed oz.
Page 69 and 70:
Ugotovimo lahko, da po številu pol
Page 71 and 72:
predlog, zadeva, podatek, podlaga,
Page 73 and 74:
Their main differences between them
Page 75 and 76:
The corpus was PoS-tagged and lemma
Page 77 and 78:
5. Conclusions In this paper we pre
Page 79 and 80:
2000) is an English computer tutor
Page 81 and 82:
Table 5: Classifier performance wit
Page 83 and 84:
forms of Croatian proper names, but
Page 85 and 86:
on the output of the first CRF. We
Page 87 and 88:
Table 4: CroNER performance dependi
Page 89 and 90:
Table 1: Description of the picture
Page 91 and 92:
syntactic movement out of the objec
Page 93 and 94:
and other agreement markers are use
Page 95 and 96:
Figure 2. A FST representing a simp
Page 97 and 98:
encoded text representation as seen
Page 99 and 100:
Izhodišče segmentacijsko-tokeniza
Page 101 and 102:
3. Učni korpus ssj500k 4 Učni kor
Page 103 and 104:
azreševanje stavčne ali celo meds
Page 105:
težavnosti; pri oceni berljivosti
Page 110 and 111:
Kako dobro programi popravljajo vej
Page 112 and 113:
okviru projekta ESS Uspešno vklju
Page 114 and 115:
Besana opozarja na manjkajoče veji
Page 116 and 117:
Umetno tvorjenje slovenskega govora
Page 118 and 119:
Tabela 1: Analiza čistopisa zbirke
Page 120 and 121:
Distributional Semantics Approach t
Page 122 and 123:
3.2.1. Random indexing Random index
Page 124 and 125:
Table 2: Accuracy for all considere
Page 126 and 127:
Avtomatsko luščenje leksikalnih p
Page 128 and 129:
3.1. Metodologija in zaporedje post
Page 130 and 131:
začetku povedi in upoštevanje t.
Page 132 and 133:
Izdelava XML-shem za slovarske proj
Page 134 and 135:
standardi še niso bili vzpostavlje
Page 136 and 137:
nepričakovan zapis ali napačen za
Page 138 and 139: Building Named Entity Recognition M
Page 140 and 141: corpus token transfer type transfer
Page 142 and 143: morphological information, for Slov
Page 144 and 145: Luščenje terminoloških kandidato
Page 146 and 147: 3.1. Enobesedni terminološki kandi
Page 148 and 149: Rezultati priklica so dobri. Od 109
Page 150 and 151: Event and Temporal Relation Extract
Page 152 and 153: Table 1: Event annotation summary E
Page 154 and 155: Table 3: Event extraction performan
Page 156 and 157: Korpusna analiza slovenskega delež
Page 158 and 159: Vrsta besedila Izbrana področja (i
Page 160 and 161: primerih (11) in (12), ne pa za del
Page 162 and 163: Identifying Fear Related Content in
Page 164 and 165: not present” by 23 annotators. Th
Page 166 and 167: A Web Service Implementation of Lin
Page 168 and 169: equired to install the specific sof
Page 170 and 171: In both figures the same workflow i
Page 172 and 173: Termania - prosto dostopni spletni
Page 174 and 175: informacije so poleg same vsebine g
Page 176 and 177: Izdelava slovensko-srbskega vzpored
Page 178 and 179: dovoljeno odstopanje do 1 sekunde.
Page 180 and 181: Poleg navedenih težav so se pojavl
Page 182 and 183: Topic ontology construction from En
Page 184 and 185: Fig. 2: English topic ontology afte
Page 186 and 187: 4. Topic ontologies constructed fro
Page 190 and 191: and the Cyc concept, which correspo
Page 192 and 193: without parsing. One type of such s
Page 194 and 195: Guessing the Correct Inflectional P
Page 196 and 197: noted that the distribution of LPPs
Page 198 and 199: Table 1: Feature selection analysis
Page 200 and 201: Razpoznavanje imenskih entitet v sl
Page 202 and 203: N − log Z(x (i) ) − i=1 K k=1
Page 204 and 205: F1 1.0 0.8 0.6 0.4 0.2 0.0 0.63 0.4
Page 206 and 207: sloWCrowd: orodje za popravljanje w
Page 208 and 209: 2.2. Uporabniški vmesnik Osnovna f
Page 210 and 211: uporabnikov z večanjem stopnje zan
Page 212 and 213: Korpus slovenskega znakovnega jezik
Page 214 and 215: Posnetki se v anonimizirani obliki
Page 216 and 217: ZEN: zasnova glasovnih e-storitev v
Page 218 and 219: Rešitev je bila izvedena na podlag
Page 220 and 221: predpisane aktivnosti v poteku zdra
Page 222 and 223: Indeks avtorjev / Author index Agi
show all

Proceedings - Natural Language Server - IJS

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?