Back Room Front Room 2

More documents

Recommendations

Info

INFORMATION ACCESS VIA TOPIC HIERARCHIES AND THEMATIC ANNOTATIONS FROM DOCUMENT COLLECTIONS Hermine Njike Fotzo, Patrick Gallinari Université de Paris6 – LIP6, 8 rue du Capitaine Scott, 75015 Paris, France Email : Hermine.Njike-Fotzo@lip6.fr, Patrick.Gallinari@lip6.fr Keywords: Concept hierarchies, typed hyperlinks generation, thematic annotations, text segmentation. Abstract: With the development and the availability of large textual corpora, there is a need for enriching and organizing these corpora so as to make easier the research and navigation among the documents. The Semantic Web research focuses on augmenting ordinary Web pages with semantics. Indeed, wealth of information exists today in electronic form, they cannot be easily processed by computers due to lack of external semantics. Furthermore, the semantic addition is an help for user to locate, process information and compare documents contents. For now, Semantic Web research has been focused on the standardization, internal structuring of pages, and sharing of ontologies in a variety of domains. Concerning external structuring, hypertext and information retrieval communities propose to indicate relations between documents via hyperlinks or by organizing documents into concepts hierarchies, both being manually developed. We consider here the problem of automatically structuring and organizing corpora in a way that reflects semantic relations between documents. We propose an algorithm for automatically inferring concepts hierarchies from a corpus. We then show how this method may be used to create specialization/generalization links between documents leading to document hierarchies. As a byproduct, documents are annotated with keywords giving the main concepts present in the documents. We also introduce numerical criteria for measuring the relevance of the automatically generated hierarchies and describe some experiments performed on data from the LookSmart and New Scientist web sites. 1 INTRODUCTION Large textual and multimedia databases are nowadays widely available but their exploitation is restricted by the lack of meta-information about their structure and semantics. Many such collections like those gathered by most search engines are loosely structured. Some have been manually structured, at the expense of an important effort. This is the case of hierarchies like those of internet portals (Yahoo, LookSmart, Infoseek, etc) or of large collections like MEDLINE: documents are gathered into topics, which are themselves organized into a hierarchy going from the most general to the most specific [G. Källgren, 1988]. Hypertext multimedia products are another example of structured collections: documents are usually grouped into different topics and subtopics with links between the different entities. Generally speaking, structuring collections makes easier navigating the collection, accessing information parts, maintaining and enriching the collection. Manual structuring relies on a large amount of qualified human I. Seruca et al. (eds.), Enterprise Information Systems VI, © 2006 Springer. Printed in the Netherlands. 143 143–150. resources and can be performed only in the context of large collaborative projects like e.g. in medical classification systems or for specific commercial products. In order to help this process it would be most needful to rely on automatic or semi-automatic tools for structuring document collections. The Semantic Web whose goal is to help users to locate, organize, process information and compare documents contents, has for now focalised on the standardization, internal structuring of documents and sharing of ontologies in a variety of domains. The short-term goal is to transform existing sources (stored as HTML pages, in databases…) into a machineunderstandable form. RDF (resources description form) has been created for computers but Semantic Web should be equally accessible by computers using specialized languages and interchange formats, and humans using natural language. Although the general framework of Semantic Web includes provisions for natural language technology, such techniques have largely been ignored. Nevertheless we can quote [B. Katz, J. Lin, 2002] who propose a method to augment
144 ENTERPRISE INFORMATION SYSTEMS VI RDF with natural language to be more familiar for users. In this context, we study here how to automatically structure collections by deriving concept hierarchies from a document collection and how to automatically generate from that a document hierarchy. The concept hierarchy relies on the discovering of “specialization/generalization” relations between the concepts which appear in the documents of a corpus. Concepts are themselves automatically identified from the set of documents. This method creates “specialization / generalization” links between documents and document parts. It can be considered as a technique for the automatic creation of specific typed links between information parts. Such typed links have been advocated by different authors as a mean for structuring and navigating collections. It also associates to each document a set of keyword representative of the main concepts in the document. The proposed method is fully automatic and the hierarchies are directly extracted from the corpus, and could be used for any document collection. It could also serve as a basis for a manual organization. The paper is organized as follows. In section 2 we introduce previous related work. In section 3, we describe our algorithm for the automatic generation of typed relations “specialization/generalization” between concepts and documents and the corresponding hierarchies. In section 4 we discuss how our algorithm answers some questions of the Web Semantic research. In section 5 we propose numerical criteria for measuring the relevance of our method. Section 6, describes experiments performed on small corpus extracted from Looksmart and New Scientists hierarchies. 2 PREVIOUS WORK In this section we present related work on automatically structuring document collection. We discuss work on the generation of concept hierarchies and on the discovering of typed links between document parts. Since for identifying concepts, we perform document segmentation into homogeneous themes, we also briefly present this problematic and describe the segmentation method we use. We also give some pointers on work on natural language annotations for the Semantic Web. Many authors agree about the importance of typed links in hypertext systems. Such links might prove useful for providing a navigation context or for improving research engines performances. Some authors have developed links typologies. [Randall Trigg, 1983] proposes a set of useful types for scientific corpora, but many of the types can be adapted to other corpora. [C. Cleary, R. Bareiss, 1996] propose a set of types inspired by the conversational theory. These links are usually manually created. [J. Allan, 1996] proposes an automatic method for inferring a few typed links (revision, abstract/expansion links). His philosophy is close to the one used in this paper, in that he chose to avoid complex text analysis techniques. He deduces the type of a link between two documents by analysing the similarity graph of their subparts (paragraphs). We too use similarity graphs (although of different nature) and corpus statistics to infer a relation between concepts and documents. The generation of hierarchies is a classical problem in information retrieval. In most cases the hierarchies are manually built and only the classification of documents into the hierarchy is automatic. Clustering techniques have been used to create hierarchies automatically like in the Scatter/Gather algorithm [D. R. Cutting et al. 1992]. Using related ideas but by using a probabilistic formalism, [A. Vinokourov, M. Girolami, 2000], propose a model which allows to infer a hierarchical structure for unsupervised organization of documents collection. The techniques of hierarchical clustering were largely used to organize corpora and to help information retrieval. All these methods cluster documents according to their similarity. They cannot be used to produce topic hierarchies or to infer generalization/specialization relations. Recently, it has been proposed to develop topic hierarchies similar to those found in e.g. Yahoo. As in Yahoo, each topic is identified by a single term. These term hierarchies are built from “specialization/generalization” relations between the terms, automatically discovered from the corpus. [Lawrie and Croft 2000, Sanderson and Croft 1999] propose to build term hierarchies based on the notion of subsumption between terms. Given a set of documents, some terms will frequently occur among the documents, while others will only occur in a few documents. Some of the frequently occurring terms provide a lot of information about topics within the documents. There are some terms that broadly define the topics, while others which co-occur with such a general term explain aspects of a topic. Subsumption attempts to harness the power of these words. A subsumption hierarchy reflects the topics covered within the documents, a parent term is more general than its child. The key idea of Croft and co-workers has been to use a very simple but efficient subsumption measure. Term x subsumes term y if the following relation holds : P(x|y) > t and P(y|x)
Page 1 and 2:
www.dbeBooks.com - An Ebook Library
Page 3 and 4:
A C.I.P. Catalogue record for this
Page 5 and 6:
vi TABLE OF CONTENTS PART 1 - DATAB
Page 7 and 8:
viii TABLE OF CONTENTS A WIRELESS A
Page 9 and 10:
CONFERENCE COMMITTEE Honorary Presi
Page 11 and 12:
Hung, P. (AUSTRALIA) Jahankhani, H.
Page 13 and 14:
Invited Papers
Page 15 and 16:
2 ENTERPRISE INFORMATION SYSTEMS VI
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
10 ENTERPRISE INFORMATION SYSTEMS V
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
EVOLUTIONARY PROJECT MANAGEMENT: MU
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
MANAGING COMPLEXITY OF ENTERPRISE I
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
ASSESSING EFFORT PREDICTION MODELS
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
ORGANIZATIONAL AND TECHNOLOGICAL CR
Page 77 and 78:
Page 79 and 80:
ASAP Processes W Project Kickoff A
Page 81 and 82:
Page 83 and 84:
since it means changing the relevan
Page 85 and 86:
ACME-DB: AN ADAPTIVE CACHING MECHAN
Page 87 and 88:
ACME-DB: AN ADAPTIVE CACHING MECHAN
Page 89 and 90:
Hit Rate (%) ACME-DB: AN ADAPTIVE C
Page 91 and 92:
5.3.6 Effect on all Policies by Int
Page 93 and 94:
ACKNOWLEDGEMENTS We would like to t
Page 95 and 96:
This paper describes the data sampl
Page 97 and 98:
The major limit A of the operation
Page 99 and 100:
RELATIONAL SAMPLING FOR DATA QUALIT
Page 101 and 102:
TOWARDS DESIGN RATIONALES OF SOFTWA
Page 103 and 104: ((Brown et al., 1998), pp. 159-190,
Page 105 and 106: sive data filtering, presentation (
Page 107 and 108: • implementation changes in servi
Page 109 and 110: MEMORY MANAGEMENT FOR LARGE SCALE D
Page 111 and 112: Data Rate [bytes/sec] 1600000 14000
Page 113 and 114: E.g., DV Camcorder Camera Packets (
Page 115 and 116: Procedure CheckOptimal (n rs 1 ...n
Page 117 and 118: MEMORY MANAGEMENT FOR LARGE SCALE D
Page 119 and 120: PART 2 Artificial Intelligence and
Page 121 and 122: 110 ENTERPRISE INFORMATION SYSTEMS
Page 127 and 128: DYNAMIC MULTI-AGENT BASED VARIETY F
Page 153: 142 ENTERPRISE INFORMATION SYSTEMS
Page 169 and 170: AN EXPERIENCE IN MANAGEMENT OF IMPR
Page 179 and 180: ANALYSIS AND RE-ENGINEERING OF WEB
Page 181 and 182: Module p0 Customer Itinerary Send I
Page 183 and 184: departments: the marketing dept. se
Page 185 and 186: • An acyclic module M is usable,
Page 187 and 188: BALANCING STAKEHOLDER’S PREFERENC
Page 189 and 190: The scenarios will provide an abstr
Page 191 and 192: BALANCING STAKEHOLDER'S PREFERENCES
Page 193 and 194: BALANCING STAKEHOLDER'S PREFERENCES
Page 195 and 196: A POLYMORPHIC CONTEXT FRAME TO SUPP
Page 197 and 198: A POLYMORPHIC CONTE XT FRAME TO SUP
Page 203 and 204: FEATURE MATCHING IN MODEL-BASED SOF
Page 205 and 206:
combination of solution domains - a
Page 207 and 208:
• features for all the concepts:
Page 209 and 210:
Existence In both steps it is possi
Page 211 and 212:
matching strategies are maximal, mi
Page 213 and 214:
TOWARDS A META MODEL FOR DESCRIBING
Page 215 and 216:
elementary message (ae-message) can
Page 217 and 218:
problems on a pragmatic level, whic
Page 219 and 220:
TOWARDS A META MODEL FOR DESCRIBING
Page 221 and 222:
INTRUSION DETECTION SYSTEMS USING A
Page 223 and 224:
INTRUSION DETECTION SYSTEMS USING A
Page 225 and 226:
(k� 1) (k) (k�1) (k) p � w
Page 227 and 228:
Table 5: Performance Comparison of
Page 229 and 230:
A USER-CENTERED METHODOLOGY TO GENE
Page 231 and 232:
carries out the second phase of the
Page 233 and 234:
1 1 1 1 2 PROCESS STORE ENTITY EDGE
Page 235 and 236:
constraints rules. KOGGE (Ebert et
Page 237 and 238:
PART 4 Software Agents and Internet
Page 239 and 240:
230 ENTERPRISE INFORMATION SYSTEMS
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
CABA 2 L A BLISS PREDICTIVE COMPOSI
Page 287 and 288:
y disables evidencing severe mental
Page 289 and 290:
Figure 4: Symbols emission in DAR-H
Page 291 and 292:
they evidence a generalization erro
Page 293 and 294:
A METHODOLOGY FOR INTERFACE DESIGN
Page 295 and 296:
and mobility loss coupled with redu
Page 297 and 298:
Tradeoff: The default input is poss
Page 299 and 300:
Sessions on the VABS which are held
Page 301 and 302:
A CONTACT RECOMMENDER SYSTEM FOR A
Page 303 and 304:
A CONTACT RECOMMENDER SYSTEM FOR A
Page 305 and 306:
- "Going to the sources". The user
Page 307 and 308:
Here are the matrix W and P for our
Page 309 and 310:
EMOTION SYNTHESIS IN VIRTUAL ENVIRO
Page 311 and 312:
theme turns out to be surprisingly
Page 313 and 314:
6 FROM FEATURES TO SYMBOLS 6.1 Face
Page 315 and 316:
sions are described by different em
Page 317 and 318:
Electronic Imaging 2000 Conference
Page 319 and 320:
to the accessibility problem for th
Page 321 and 322:
Figure 3: Hierarchical View of the
Page 323 and 324:
4 CONCLUSIONS AND FUTURE WORK On th
Page 325 and 326:
PERSONALISED RESOURCE DISCOVERY SEA
Page 327 and 328:
Page 329 and 330:
profile, the user defines his curre
Page 331 and 332:
Page 333 and 334:
Abdelkafi, N. .....................
show all

Back Room Front Room 2

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?