7 - Indira Gandhi Centre for Atomic Research
7 - Indira Gandhi Centre for Atomic Research
7 - Indira Gandhi Centre for Atomic Research
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
• Any initial letter that precedes the name, it will be transferred to the end of the name<br />
and a comma inserted between the initials and the name<br />
All the words / phrases are now alphabetized.<br />
3.2 Identification of Keyword / Key phrases:<br />
In OPACs and similar databases the elements representing the subject of the resource<br />
usually take the <strong>for</strong>m of a set of data fields that may include keywords, descriptors, subject<br />
headings, abstract, classification codes, etc. However, in automatic extraction of keywords<br />
from a document it is necessary to look <strong>for</strong> appropriate lexical clues. The major type of<br />
lexical clue to the subject of a document is the set of domain terms the document contains.<br />
This usually takes the <strong>for</strong>m of keywords. The experiment reported in this paper involves the<br />
use of a set of simple heuristics to identify keywords and key phrases in HTML documents.<br />
The module involves the following major inputs:<br />
• One or more HTML files constituting the documentary in<strong>for</strong>mation resources in a<br />
domain from which keywords / key phrases are to be extracted automatically; The<br />
program requires that all the HTML files be in a single folder<br />
• A database which is in effect a list of domain terms in the subject area / discipline of the<br />
HTML files: In this study the ASIS thesaurus was employed<br />
• A database of ‘stop words’ consisting of all non-noun words taken from the Pocket<br />
English dictionary which itself is derived from the New Ox<strong>for</strong>d Dictionary of<br />
English.<br />
The principal output of the program is a HTML page consisting of extracted keywords / key<br />
phrases with hyperlinks to the HTML pages from which they were extracted.<br />
Keyword Extraction: The major problems involved in KWE are extraction of keywords and<br />
omission of non-significant words. The experience with techniques such as those adopted<br />
by the popular search engines clearly brings out the need <strong>for</strong> a different approach. In this<br />
study it was decided to experiment with a validation process using two databases of terms to<br />
assist in the identification of keywords and non-significant words in the input file. The<br />
validation process employed made certain assumptions:<br />
• It was assumed that a word / phrase in the input HTML file that is also part of a<br />
controlled vocabulary in the concerned subject domain is a key word / key phrase<br />
with a high probability of indicating the subject content of the input file.<br />
• Non-noun words in the input file are assumed to be non-significant words.<br />
• In the present experiment the following inputs / tools were employed:<br />
• A paper entitled ‘In<strong>for</strong>mation Retrieval and Cognitive <strong>Research</strong>’ was used as the<br />
input HTML document to test the utility and limitations of the Program. An idea of<br />
the paper can be had from the details given in the Table 1 below.<br />
• As <strong>for</strong> identifying keywords and key phrases in the input file online tools were used..<br />
• The ASIS thesaurus (http://www.asis.org/Publications/Thesaurus/isframe.htm)<br />
i. Stop-word Terms (ST): Uncontrolled vocabularies have always presented<br />
problems in IR. The most common words in English may account <strong>for</strong> 50% or<br />
more of any given text. Their semantic content measured in terms of their<br />
value in describing / indicating the subject matter of the text is minimal.<br />
Further, such words tend to lessen the impact of frequency differences among<br />
the less common words. In addition, they necessitate a large amount of<br />
unnecessary processing. In all methods of automatic indexing such less<br />
significant words are ignored based on a stop-word list of such words. As<br />
already mentioned in the present experiment all non-noun words were<br />
63