23.01.2014 Views

7 - Indira Gandhi Centre for Atomic Research

7 - Indira Gandhi Centre for Atomic Research

7 - Indira Gandhi Centre for Atomic Research

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Automatic Extraction of Proper names and Keywords from Web<br />

Resources<br />

Velumani G. and Sivasamy K.<br />

Abstract<br />

One of the major Challenges facing in<strong>for</strong>mation professionals today relates to<br />

effective mechanisms <strong>for</strong> retrieval of in<strong>for</strong>mation from the Web. Traditionally<br />

librarians have used metadata as a tool <strong>for</strong> management of in<strong>for</strong>mation resources,<br />

and their effective retrieval. However, given the volume of in<strong>for</strong>mation on the<br />

Web and the rate at which it keeps growing only mechanisms that can<br />

automatically extract such data will be useful. The existing mechanisms such as<br />

search engines appear to rely on full text and index practically every word<br />

appearing in a text. This leads to the problem of unacceptably low levels of<br />

precision during searches. This research is based on the premise that names of<br />

persons, corporate bodies and keywords present in a text are important in terms<br />

their value as search keys <strong>for</strong> a document. This paper describes methodologies<br />

that have been developed and are under evaluation <strong>for</strong>t the automatic<br />

identification and extraction of Names of persons, Names of Corporate bodies<br />

and keywords from web resources. Some of the problems encountered that are<br />

being addressed are also discussed.<br />

1. Introduction:<br />

An issue that has attracted considerable attention in recent years relates to indexing of web<br />

documents in an effective manner so as to facilitate acceptable levels of precision in<br />

retrieval. Much of this research and discussions are based on the assumption that existing<br />

tools <strong>for</strong> in<strong>for</strong>mation retrieval from the web are inadequate and suffer from major<br />

limitations. Search Engines and Subject Directories are the major tools currently available<br />

<strong>for</strong> resource discovery from the Web. The major limitations of search engines in facilitating<br />

effective in<strong>for</strong>mation retrieval derive largely from the limitations of the mechanisms they<br />

employ <strong>for</strong> creating their databases. Of course there are differences between various search<br />

engines with regard to what portion of the web document is used to derive index terms.<br />

However, by and large most search engines derive their index terms from web documents<br />

with the help of specially developed programs referred to variously in literature as robots,<br />

spiders, etc. In reality this process of extracting index terms has several major limitations<br />

• It could, and often does, result in extracting several index terms that are not relevant<br />

search keys <strong>for</strong> the document in question leading to unacceptably low levels of<br />

precision<br />

• The existing mechanisms also do not have any means of distinguishing between<br />

different kinds of terms that may be present in the text being indexed. For example,<br />

there is no way of identifying whether a term that is extracted denotes the name of a<br />

person or is a keyword indicating the subject matter of the web document.<br />

• In this paper we discuss the results of the experiment carried out to identify and<br />

automatically extract<br />

<strong>Research</strong> Scholor, Dept. Inf. Sci., University of Madras, Chennai.<br />

60

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!