Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

More documents

Recommendations

Info

52 Chapter 3. Related Work Due to the scalability issues of KnowItAll, its authors proposed the paradigm of Open Information Extraction (Banko et al., 2007) (OIE, see section 2.3.2 for more details). OIE systems make a single data-driven pass over a corpus and extract a large set of relational tuples, without requiring any human input. TextRunner (Banko et al., 2007) is a fully-implemented OIE system. In order to get a classifier that labels candidate extractions as trustworthy or not, a small corpus sample is given as input . Then, all tuples that are potential relations are extracted from the corpus. In the last step, relation names are normalised and tuples have a probability assigned. TextRunner is more scalable than KnowItAll, has a lower error rate and, considering only a set of 10 relation types, both systems extract an identical number of relations. However, since TextRunner does not take as input the name of the relations, its complete set of extractions contains more types of relations. More recently, ReVerb (Etzioni et al., 2011; Fader et al., 2011), a new and more efficient OIE system that does not need a classifier was presented. ReVerb is solely based on two constraints: (i) a syntactic constraint requires that the relation phrase matches a POS regular expression (verb | verb prep | verb word* prep); (ii) a lexical constraint requires that each relevant relation phrase occurs in the corpus with different arguments. The following illustrate ReVerb extractions: • {Calcium, prevents, osteoporosis} • {A galaxy, consists of, stars and stellar remnants} • {Most galaxies, appear to be, dwarf galaxies, which are small} The Never Ending Language Learner (NELL, Carlson et al. (2010a)) learns from reading contents on the Web and gets better at reading as it reads the same text multiple times. NELL’s starting point is: (i) a set of fundamental categories (e.g. person, sportsTeam, fruit, emotion) and relation types (e.g., playsOn- Team(athlete,sportsTeam), playsInstrument(musician,instrument)), that constitute an ontology; and (ii) a set of 10 to 15 seed examples for each category and relation. Then, NELL reads web pages continuously, 24 hours a day, for extracting new category instances and new relations between instances, which are used to populate the ontology. The extracted contents are used as a self-supervised collection of training examples, used in the acquisition of new discriminating patterns. NELL employs coupled-training (Carlson et al., 2010b), which combines the simultaneous training of many extraction methods. The following are examples of NELL extractions: • musicArtistGenre(Nirvana, Grunge) • tvStationInCity(WLS-TV, Chicago) • sportUsesEquip(soccer, balls) The main difference between NELL and OIE systems is that NELL learns extractors for a fixed set of known relations, while an OIE system can extract meaningful information from any kind of corpora, on any domain, as relations are not given as a starting point (Etzioni et al., 2011). This has also an impact on the quantity of extracted knowledge. Still, recently, Mohamed et al. (2011) reported how a system like NELL can learn new relation types between already extracted categories.
3.3. Enrichment and Integration of Lexical Knowledge Bases 53 Kozareva and Hovy (2010) present a minimally-supervised method to learn domain taxonomies from the Web. It starts by extracting the terms of a given domain, and then induces their taxonomic organisation, without any initial taxonomic information. The acquisition of hypernymy relations relies on two variations of Hearst patterns, which provide higher precision, requiring only a root concept and one seed hyponym. The following patterns are used for collecting more relations: 1. such as and * 2. * such as and , where term1 and term2 are hyponyms acquired with the first pattern. In the taxonomy induction stage, other Hearst patterns are used to find evidence on the position of each concept in the taxonomy. In order to identify the hierarchic levels, an algorithm finds the longest path between the root and the other concepts. Still looking at the Web, a specific resource that has been receiving more and more attention by the IE community is Wikipedia, the free collaborative encyclopedia, which is constantly growing. Medelyan et al. (2009) present a survey on IE from Wikipedia. Among the works using Wikipedia, Zesch et al. (2008a) introduce an API for the extraction of knowledge from its English and German version and also Wiktionary; and Wu and Weld (2010) use the Wikipedia infoboxes for training an OIE classifier. Concerning LSIE, Herbelot and Copestake (2006) investigate the extraction of hypernymy relations from Wikipedia; and Veale (2006) captures neologisms from this resource. Most neologisms are hyponyms of its parts (e.g. hero is-a superhero), or, at least, can be seen as such. Moreover, there are several public knowledge bases automatically extracted from Wikipedia, including WikiNet (Nastase et al., 2010), YAGO (Suchanek et al., 2007; Hoffart et al., 2011) and DBPedia (Bizer et al., 2009). Although created automatically, DBPedia is manually supervised by the community. For Portuguese, the first works using Wikipedia as an external source of knowledge include Ferreira et al. (2008), who exploited the first sentences of Wikipedia articles to classify named entities. We (Gonçalo Oliveira et al., 2010a) have also made some experiments on the acquisition of synonymy, hypernym, part-of, purpose-of, and causation relations from the first sentences of the articles, using a set of predefined discriminating patterns. Recently, Págico (Mota et al., 2012; Santos et al., 2012), a joint evaluation on the retrieval of non-trivial information from the Portuguese Wikipedia, was organised. Among the seven participations, two were automatic systems (Rodrigues et al., 2012; Cardoso, 2012) and five were humans (three individuals and two teams). 3.3 Enrichment and Integration of Lexical Knowledge Bases It was earlier noticed (Hearst, 1992; Riloff and Shepherd, 1997; Caraballo, 1999) that, even though WordNet is a broad-coverage resource, it is limited and incomplete in many domains, and therefore not enough for several NLP tasks. As a LKB, most of the information in WordNet is about the words and their meanings. Therefore, more than aiming at the creation of new knowledge bases, works on the automatic acquisition of semantic relations used WordNet as a reference for
Page 1:
PhD Thesis Doctoral Program in Info
Page 5:
Preface About six years ago, almost
Page 9 and 10:
Resumo Não há grandes dúvidas qu
Page 11 and 12:
Contents Chapter 1: Introduction .
Page 13:
8.2.1 Semantic Web model . . . . .
Page 16 and 17:
6.1 Illustrative synonymy network.
Page 18 and 19:
6.3 Evaluation against intersection
Page 21 and 22: Chapter 1 Introduction A substantia
Page 23 and 24: 1.2. Approach 5 • They are not bu
Page 25 and 26: 1.4. Outline of the thesis 7 which
Page 27 and 28: Chapter 2 Background Knowledge The
Page 29 and 30: 2.1. Lexical Semantics 11 that, in
Page 31 and 32: 2.1. Lexical Semantics 13 Meronymy
Page 33 and 34: 2.2. Lexical Knowledge Formalisms a
Page 39 and 40: 2.3. Information Extraction from Te
Page 41 and 42: 2.3. Information Extraction from Te
Page 43: 2.4. Remarks on this section 25 usi
Page 46 and 47: 28 Chapter 3. Related Work in group
Page 48 and 49: 30 Chapter 3. Related Work ple rela
Page 50 and 51: 32 Chapter 3. Related Work knowledg
Page 52 and 53: 34 Chapter 3. Related Work the ELRA
Page 54 and 55: 36 Chapter 3. Related Work resource
Page 56 and 57: 38 Chapter 3. Related Work English
Page 58 and 59: 40 Chapter 3. Related Work of super
Page 60 and 61: 42 Chapter 3. Related Work • part
Page 62 and 63: 44 Chapter 3. Related Work LSIE fro
Page 64 and 65: 46 Chapter 3. Related Work modifier
Page 66 and 67: 48 Chapter 3. Related Work 6. {,}
Page 68 and 69: 50 Chapter 3. Related Work 1. Extra
Page 72 and 73: 54 Chapter 3. Related Work comparis
Page 74 and 75: 56 Chapter 3. Related Work creation
Page 76 and 77: 58 Chapter 4. Acquisition of Semant
Page 98 and 99: 80 Chapter 5. Synset Discovery Ther
Page 100 and 101: 82 Chapter 5. Synset Discovery the
Page 102 and 103: 84 Chapter 5. Synset Discovery tb-t
Page 104 and 105: 86 Chapter 5. Synset Discovery cota
Page 106 and 107: 88 Chapter 5. Synset Discovery θ W
Page 108 and 109: 90 Chapter 5. Synset Discovery Tabl
Page 110 and 111: 92 Chapter 5. Synset Discovery word
Page 113 and 114: Chapter 6 Thesaurus Enrichment Gene
Page 115 and 116: 6.1. Automatic Assignment of synpai
Page 117 and 118: 6.2. Evaluation of the assignment p
Page 119 and 120: 6.3. Clustering and integrating new
Page 121 and 122:
6.4. A large thesaurus for Portugue
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129:
6.5. Discussion 111 Another contrib
Page 132 and 133:
114 Chapter 7. Moving from term-bas
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 149 and 150:
Chapter 8 Onto.PT: a lexical ontolo
Page 151 and 152:
8.1. Overview 133 items inside a sy
Page 153 and 154:
8.2. Access and Availability 135 no
Page 155 and 156:
8.2. Access and Availability 137 Ex
Page 157 and 158:
8.3. Evaluation 139 Figure 8.3: Ins
Page 159 and 160:
8.3. Evaluation 141 the most reliab
Page 161 and 162:
8.3. Evaluation 143 imation of the
Page 163 and 164:
8.3. Evaluation 145 Relation parteD
Page 165 and 166:
8.4. Using Onto.PT 147 • S: (n) a
Page 167 and 168:
8.4. Using Onto.PT 149 todos os fun
Page 169 and 170:
8.4. Using Onto.PT 151 In addition
Page 171 and 172:
8.4. Using Onto.PT 153 based approa
Page 173:
8.4. Using Onto.PT 155 Uma populaç
Page 176 and 177:
158 Chapter 9. Final discussion 3.
Page 178 and 179:
160 Chapter 9. Final discussion - G
Page 180 and 181:
162 Chapter 9. Final discussion Any
Page 183 and 184:
References Agichtein, E. and Gravan
Page 185 and 186:
References 167 for storing and quer
Page 187 and 188:
References 169 15th International C
Page 189 and 190:
References 171 Symposium (STAIRS 20
Page 191 and 192:
References 173 Hovy, E., Hermjakob,
Page 193 and 194:
References 175 ACM, 38(11):39-41. M
Page 195 and 196:
References 177 ACL Press. Partee, B
Page 197 and 198:
References 179 Russell, S. and Norv
Page 199 and 200:
References 181 Proceedings of 13th
Page 201 and 202:
Appendix A Description of the extra
Page 203 and 204:
• x propriedadeDeAlgoQueCausa y -
Page 205:
• x antonimoAdjDe y Property - x
Page 208 and 209:
190 Appendix B. Coverage of EuroWor
Page 210 and 211:
Page 212:
show all

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Create successful ePaper yourself

Delete template?

Save as template?