7 - Indira Gandhi Centre for Atomic Research
7 - Indira Gandhi Centre for Atomic Research
7 - Indira Gandhi Centre for Atomic Research
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
• Proper names and<br />
• Keywords / key phrases from web documents<br />
2. Scope and Objectives:<br />
The important elements of description (metadata) of any documentary resource from the<br />
point of view of their utility as search keys are:<br />
• The proper names associated with the resource; This would include names of<br />
persons, names of institutions and other corporate bodies; and<br />
• Keywords and Key phrases representing the subject content of the document.<br />
If effective mechanisms could be developed and implemented to identify these and extract<br />
them from electronic resources these will be useful metadata. The experiments reported in<br />
this paper were carried out to develop and test algorithms to identify and extract Proper<br />
names and keywords from web documents. The mechanism uses reasonably fast and robust<br />
heuristics to identify proper names, keywords and extract them from the web resources.<br />
3. The experiment:<br />
There are basically two aspects to this study. The first one focused on identification and<br />
extraction of proper names and the second one on keywords. A corpus of HTML documents<br />
on the Web was used as the test bed to experiment with the algorithms developed. The first<br />
step was to arrive at a corpus of electronic texts to experiment with. In planning such a study<br />
it was decided, keeping in mind a wide variety of factors, to conduct the experiments with<br />
documents falling in a subject domain. This was necessary as one component of the study<br />
focused on extraction of keywords.<br />
3.1 Identification of Proper names: For the purpose of this experiment and study ‘name<br />
extraction’ is defined as the process of identifying and extracting personal names from<br />
unstructured web texts in the English language. A set of rules to enable a computer to<br />
identify names in all their variations is an essential component of the kind of text processing<br />
application described in this paper. Literature on the subject has references to a few rules<br />
identified and applied to extract names. Most of these rules are derived from standard<br />
conventions that are widely employed by authors of properly edited texts in English. Some<br />
of the key indicators that were used included:<br />
‣ Use of legend words such as Mr. Miss, Ms. etc is a good indicator that the following<br />
word / words denotes a name;<br />
‣ Use of Corporate trigger words such as ‘University’, ‘College’, ‘School’, etc., is a<br />
good indicator to identify the organization names. Prepositions are widely used in<br />
corporate names to link legend words with other words (e.g. University of<br />
Cali<strong>for</strong>nia); this has been exploited to identify complete corporate names<br />
‣ Initial capitalization is also an indicator of names since the convention of English<br />
language requires that each word of a name start with an uppercase letter. In this<br />
module, there<strong>for</strong>e, identification of initial capitalization has been used in <strong>for</strong>mulating<br />
the algorithm.<br />
‣ Example: Powell<br />
‣ First Name, Middle Name and Last Name starting with uppercase letters is also a<br />
good indicator <strong>for</strong> names 2 .<br />
‣ Example: Mohandas Karamchand <strong>Gandhi</strong><br />
61