7 - Indira Gandhi Centre for Atomic Research

More documents

Recommendations

Info

Automatic Extraction of Proper names and Keywords from Web Resources Velumani G. and Sivasamy K. Abstract One of the major Challenges facing information professionals today relates to effective mechanisms for retrieval of information from the Web. Traditionally librarians have used metadata as a tool for management of information resources, and their effective retrieval. However, given the volume of information on the Web and the rate at which it keeps growing only mechanisms that can automatically extract such data will be useful. The existing mechanisms such as search engines appear to rely on full text and index practically every word appearing in a text. This leads to the problem of unacceptably low levels of precision during searches. This research is based on the premise that names of persons, corporate bodies and keywords present in a text are important in terms their value as search keys for a document. This paper describes methodologies that have been developed and are under evaluation fort the automatic identification and extraction of Names of persons, Names of Corporate bodies and keywords from web resources. Some of the problems encountered that are being addressed are also discussed. 1. Introduction: An issue that has attracted considerable attention in recent years relates to indexing of web documents in an effective manner so as to facilitate acceptable levels of precision in retrieval. Much of this research and discussions are based on the assumption that existing tools for information retrieval from the web are inadequate and suffer from major limitations. Search Engines and Subject Directories are the major tools currently available for resource discovery from the Web. The major limitations of search engines in facilitating effective information retrieval derive largely from the limitations of the mechanisms they employ for creating their databases. Of course there are differences between various search engines with regard to what portion of the web document is used to derive index terms. However, by and large most search engines derive their index terms from web documents with the help of specially developed programs referred to variously in literature as robots, spiders, etc. In reality this process of extracting index terms has several major limitations • It could, and often does, result in extracting several index terms that are not relevant search keys for the document in question leading to unacceptably low levels of precision • The existing mechanisms also do not have any means of distinguishing between different kinds of terms that may be present in the text being indexed. For example, there is no way of identifying whether a term that is extracted denotes the name of a person or is a keyword indicating the subject matter of the web document. • In this paper we discuss the results of the experiment carried out to identify and automatically extract Research Scholor, Dept. Inf. Sci., University of Madras, Chennai. 60
• Proper names and • Keywords / key phrases from web documents 2. Scope and Objectives: The important elements of description (metadata) of any documentary resource from the point of view of their utility as search keys are: • The proper names associated with the resource; This would include names of persons, names of institutions and other corporate bodies; and • Keywords and Key phrases representing the subject content of the document. If effective mechanisms could be developed and implemented to identify these and extract them from electronic resources these will be useful metadata. The experiments reported in this paper were carried out to develop and test algorithms to identify and extract Proper names and keywords from web documents. The mechanism uses reasonably fast and robust heuristics to identify proper names, keywords and extract them from the web resources. 3. The experiment: There are basically two aspects to this study. The first one focused on identification and extraction of proper names and the second one on keywords. A corpus of HTML documents on the Web was used as the test bed to experiment with the algorithms developed. The first step was to arrive at a corpus of electronic texts to experiment with. In planning such a study it was decided, keeping in mind a wide variety of factors, to conduct the experiments with documents falling in a subject domain. This was necessary as one component of the study focused on extraction of keywords. 3.1 Identification of Proper names: For the purpose of this experiment and study ‘name extraction’ is defined as the process of identifying and extracting personal names from unstructured web texts in the English language. A set of rules to enable a computer to identify names in all their variations is an essential component of the kind of text processing application described in this paper. Literature on the subject has references to a few rules identified and applied to extract names. Most of these rules are derived from standard conventions that are widely employed by authors of properly edited texts in English. Some of the key indicators that were used included: ‣ Use of legend words such as Mr. Miss, Ms. etc is a good indicator that the following word / words denotes a name; ‣ Use of Corporate trigger words such as ‘University’, ‘College’, ‘School’, etc., is a good indicator to identify the organization names. Prepositions are widely used in corporate names to link legend words with other words (e.g. University of California); this has been exploited to identify complete corporate names ‣ Initial capitalization is also an indicator of names since the convention of English language requires that each word of a name start with an uppercase letter. In this module, therefore, identification of initial capitalization has been used in formulating the algorithm. ‣ Example: Powell ‣ First Name, Middle Name and Last Name starting with uppercase letters is also a good indicator for names 2 . ‣ Example: Mohandas Karamchand Gandhi 61
Page 2 and 3:
Proceedings of the Conference on Re
Page 4 and 5:
CONTENTS Invited Talks Knowledge Sh
Page 6 and 7:
Technical Session IV : KNOWLEDGE SH
Page 8 and 9:
Digital Collection Development Dhan
Page 10 and 11:
3.3 E-Journals Electronic journals
Page 12 and 13:
disseminate information on speciali
Page 14 and 15:
2.Justification of the Project The
Page 16 and 17:
ack 12 Leased Line 1 mbps port char
Page 18 and 19: Central Institute of Indian Languag
Page 20 and 21: access the full text articles from
Page 22 and 23: • Integrated authority control an
Page 24 and 25: References 1. Gopal, Krishan, Digit
Page 26 and 27: 1. Characteristics of Digital Libra
Page 28 and 29: Computer CPU, PCI Bus, Ethernet, S
Page 30 and 31: Digital Library Collection Developm
Page 32 and 33: will have to develop mechanisms for
Page 34 and 35: collection management in a distribu
Page 36 and 37: Electronic Collections Collection T
Page 38 and 39: Timeliness The electronic resources
Page 40 and 41: Digital library Infrastructure and
Page 42 and 43: • User acceptability from their d
Page 44 and 45: • Technical skills (Knowledge of
Page 46 and 47: is most prevalent format. In PDF fo
Page 48 and 49: NextGen Digital Resource Centre: On
Page 50 and 51: Liaison”. To streamline, some of
Page 52 and 53: 4. Online Resources With careful wa
Page 54 and 55: 5.3 Lakshya: Mode of Access In curr
Page 56 and 57: Information Management
Page 58 and 59: 2. Content Management Activities Id
Page 60 and 61: 2.3 IGC Reports digitization activi
Page 62 and 63: (TEI) have been proposed as encodin
Page 64 and 65: Overview of Object Oriented Databas
Page 66 and 67: 6. Object Oriented Database Managem
Page 70 and 71: ‣ Legend words prefixed to a word
Page 72 and 73: ii. iii. considered as non-signific
Page 74 and 75: Keywords and phrases Information pr
Page 76 and 77: In the Internet era, digital librar
Page 78 and 79: • Reduce Maintenance Time - Devel
Page 80 and 81: preservation but differs from it in
Page 82 and 83: First, an organization must underst
Page 84 and 85: 13. Multimedia Metadata Standards M
Page 86 and 87: 3. Definition of Metadata . www.ter
Page 88 and 89: All Dublin Core elements are option
Page 90 and 91: Some libraries use TEI headers to d
Page 92 and 93: Keywords relate solely to the subje
Page 94 and 95: possible, data are entered by choos
Page 96 and 97: Label: Definition : Comment : Eleme
Page 98 and 99: Some digital library systems are us
Page 100 and 101: definition or meaning of the elemen
Page 102 and 103: Digital Library: File Formats, Stan
Page 104 and 105: that area. It provides an improved
Page 106 and 107: subject, description, source, langu
Page 108 and 109: 9.1 Internet Explorer (IE) This is
Page 110 and 111: e well specified and may be differe
Page 112 and 113: General Packet Radio Service (GPRS)
Page 114 and 115: can be split into as many areas as
Page 116 and 117: to the fact that GSM network is a d
Page 118 and 119:
network. cdmaOne supports data traf
Page 120 and 121:
9. Conclusion Wireless networks con
Page 122 and 123:
ISH information is supplementary to
Page 124 and 125:
Production and Information use of p
Page 126 and 127:
emanded by the U.S. Circuit Court o
Page 128 and 129:
Information Technology Infrastructu
Page 130 and 131:
The object of Change Management is
Page 132 and 133:
Continuity Management is concerned
Page 134 and 135:
The web service activity statement
Page 136 and 137:
‘A set of interrelated units that
Page 138 and 139:
---To introduce new communication t
Page 140 and 141:
2. Systems approach to appropriate
Page 142 and 143:
• Utilizing world-wide resources
Page 144 and 145:
5. Proposed Model As has already be
Page 146 and 147:
3. Kumaruguru College of Technology
Page 148 and 149:
If visuals, like handwritten or cop
Page 150 and 151:
Knowledge Discovery Tools and Techn
Page 152 and 153:
8. Interpretation: Includes interpr
Page 154 and 155:
and accuracy of the generated rules
Page 156 and 157:
with other security mechanisms alre
Page 158 and 159:
As simulation was added to the trad
Page 160 and 161:
data. The plant data generated at d
Page 162 and 163:
Information professionals can play
Page 164 and 165:
2. Knowledge organization: An organ
Page 166 and 167:
financial resources on information
Page 168 and 169:
innovation systems of a country. Ho
Page 170 and 171:
esults to vocabulary which occurs d
Page 172 and 173:
epresented in the figure. The basic
Page 174 and 175:
Metadata is data that describes dat
Page 176 and 177:
7. Knowledge Management Using GNOWS
Page 178 and 179:
19.Joel Mintzes, James Wandersee an
Page 180 and 181:
2. Merits The successful implementa
Page 182 and 183:
a KMS program. Like product develop
Page 184 and 185:
Advances in Knowledge Management Ku
Page 186 and 187:
applied information. Both forms of
Page 188 and 189:
a mind set of working with co-worke
Page 190 and 191:
Rewards and recognition address the
Page 192 and 193:
The process architecture covers tho
Page 194 and 195:
Institutionalization of 'best pract
Page 196 and 197:
people are so busy just getting the
Page 198 and 199:
and it is matter of time that other
Page 200 and 201:
knowledge management practices alon
Page 202 and 203:
(Fig.1 A frame work for Our Knowled
Page 204 and 205:
# Consistency in Content delivery #
Page 206 and 207:
Knowledge Sharing Techniques
Page 208 and 209:
2. Traditional Translation In tradi
Page 210 and 211:
This is particularly useful for com
Page 212 and 213:
7. Conclusion In this age of digita
Page 214 and 215:
Figure - 1 SIRD SET UP DB 1 DB 2 Fl
Page 216 and 217:
Figure 3 Deployment Diagram 5. Impl
Page 218 and 219:
Figure .5 Search Results Screen 6.S
Page 220 and 221:
Coordination Model for Communicatio
Page 222 and 223:
them. That is, the world where agen
Page 224 and 225:
suit open applications, where a num
Page 226 and 227:
Preservation of Digital Information
Page 228 and 229:
documents. It increases the product
Page 230 and 231:
DESKTOP WORD, PDF DOC XML HTML WIRE
Page 233 and 234:
Digital Libraries and Changing Role
Page 235 and 236:
(iv) Those who are intermixed betwe
Page 237 and 238:
Digital Library: The Change Managem
Page 239 and 240:
• The Change Problem • Change a
Page 241 and 242:
9. Change happens only through peop
Page 243 and 244:
3. Access the required photocopy fo
Page 245 and 246:
Date of request Date of request mai
Page 247 and 248:
TIFR 36% TMC 2% SINP 13% BARC 17% E
Page 249 and 250:
References 1. Mignon Adams. Rethink
Page 251 and 252:
systems. It is believed that while
Page 253 and 254:
3.4 Magnetic Diskettes Magnetic Dis
Page 255 and 256:
6. Contextual Demands The contextua
Page 257 and 258:
Information overload Organizational
Page 259 and 260:
AGRILIBNET: A Web Portal Rathinasab
Page 261 and 262:
3. Agricultural Library Resource Sh
Page 263 and 264:
o The proposed AGRILIBNET will be h
Page 265 and 266:
7. Projct Progress The project has
Page 267 and 268:
APPENDIX -1 List of State Agricultu
Page 269 and 270:
APPENDIX-II State Agricultural Univ
Page 271 and 272:
century and a half ago.” And so m
Page 273 and 274:
The database is divide into two sec
Page 275 and 276:
the work/s within a manuscript. 11.
Page 277 and 278:
266 The Data Elements Which Are Not
Page 279 and 280:
268 Table -3 The Elements Which Are
Page 281 and 282:
270 References: 1. M L Saini. “Ma
Page 283 and 284:
2. RFID Technology in Libraries The
Page 285 and 286:
3.3 Anti-collision If many tags are
Page 287 and 288:
3.5.5 Class 4: Read-Write (with Int
Page 289 and 290:
5. Implementation of RFID: The meth
Page 291 and 292:
several items in a stack can be rea
Page 293:
Author Index Akhtar Hussain, 77 Mah
show all

7 - Indira Gandhi Centre for Atomic Research

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?