23.01.2014 Views

7 - Indira Gandhi Centre for Atomic Research

7 - Indira Gandhi Centre for Atomic Research

7 - Indira Gandhi Centre for Atomic Research

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

• Any initial letter that precedes the name, it will be transferred to the end of the name<br />

and a comma inserted between the initials and the name<br />

All the words / phrases are now alphabetized.<br />

3.2 Identification of Keyword / Key phrases:<br />

In OPACs and similar databases the elements representing the subject of the resource<br />

usually take the <strong>for</strong>m of a set of data fields that may include keywords, descriptors, subject<br />

headings, abstract, classification codes, etc. However, in automatic extraction of keywords<br />

from a document it is necessary to look <strong>for</strong> appropriate lexical clues. The major type of<br />

lexical clue to the subject of a document is the set of domain terms the document contains.<br />

This usually takes the <strong>for</strong>m of keywords. The experiment reported in this paper involves the<br />

use of a set of simple heuristics to identify keywords and key phrases in HTML documents.<br />

The module involves the following major inputs:<br />

• One or more HTML files constituting the documentary in<strong>for</strong>mation resources in a<br />

domain from which keywords / key phrases are to be extracted automatically; The<br />

program requires that all the HTML files be in a single folder<br />

• A database which is in effect a list of domain terms in the subject area / discipline of the<br />

HTML files: In this study the ASIS thesaurus was employed<br />

• A database of ‘stop words’ consisting of all non-noun words taken from the Pocket<br />

English dictionary which itself is derived from the New Ox<strong>for</strong>d Dictionary of<br />

English.<br />

The principal output of the program is a HTML page consisting of extracted keywords / key<br />

phrases with hyperlinks to the HTML pages from which they were extracted.<br />

Keyword Extraction: The major problems involved in KWE are extraction of keywords and<br />

omission of non-significant words. The experience with techniques such as those adopted<br />

by the popular search engines clearly brings out the need <strong>for</strong> a different approach. In this<br />

study it was decided to experiment with a validation process using two databases of terms to<br />

assist in the identification of keywords and non-significant words in the input file. The<br />

validation process employed made certain assumptions:<br />

• It was assumed that a word / phrase in the input HTML file that is also part of a<br />

controlled vocabulary in the concerned subject domain is a key word / key phrase<br />

with a high probability of indicating the subject content of the input file.<br />

• Non-noun words in the input file are assumed to be non-significant words.<br />

• In the present experiment the following inputs / tools were employed:<br />

• A paper entitled ‘In<strong>for</strong>mation Retrieval and Cognitive <strong>Research</strong>’ was used as the<br />

input HTML document to test the utility and limitations of the Program. An idea of<br />

the paper can be had from the details given in the Table 1 below.<br />

• As <strong>for</strong> identifying keywords and key phrases in the input file online tools were used..<br />

• The ASIS thesaurus (http://www.asis.org/Publications/Thesaurus/isframe.htm)<br />

i. Stop-word Terms (ST): Uncontrolled vocabularies have always presented<br />

problems in IR. The most common words in English may account <strong>for</strong> 50% or<br />

more of any given text. Their semantic content measured in terms of their<br />

value in describing / indicating the subject matter of the text is minimal.<br />

Further, such words tend to lessen the impact of frequency differences among<br />

the less common words. In addition, they necessitate a large amount of<br />

unnecessary processing. In all methods of automatic indexing such less<br />

significant words are ignored based on a stop-word list of such words. As<br />

already mentioned in the present experiment all non-noun words were<br />

63

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!