23.01.2014 Views

7 - Indira Gandhi Centre for Atomic Research

7 - Indira Gandhi Centre for Atomic Research

7 - Indira Gandhi Centre for Atomic Research

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

• Proper names and<br />

• Keywords / key phrases from web documents<br />

2. Scope and Objectives:<br />

The important elements of description (metadata) of any documentary resource from the<br />

point of view of their utility as search keys are:<br />

• The proper names associated with the resource; This would include names of<br />

persons, names of institutions and other corporate bodies; and<br />

• Keywords and Key phrases representing the subject content of the document.<br />

If effective mechanisms could be developed and implemented to identify these and extract<br />

them from electronic resources these will be useful metadata. The experiments reported in<br />

this paper were carried out to develop and test algorithms to identify and extract Proper<br />

names and keywords from web documents. The mechanism uses reasonably fast and robust<br />

heuristics to identify proper names, keywords and extract them from the web resources.<br />

3. The experiment:<br />

There are basically two aspects to this study. The first one focused on identification and<br />

extraction of proper names and the second one on keywords. A corpus of HTML documents<br />

on the Web was used as the test bed to experiment with the algorithms developed. The first<br />

step was to arrive at a corpus of electronic texts to experiment with. In planning such a study<br />

it was decided, keeping in mind a wide variety of factors, to conduct the experiments with<br />

documents falling in a subject domain. This was necessary as one component of the study<br />

focused on extraction of keywords.<br />

3.1 Identification of Proper names: For the purpose of this experiment and study ‘name<br />

extraction’ is defined as the process of identifying and extracting personal names from<br />

unstructured web texts in the English language. A set of rules to enable a computer to<br />

identify names in all their variations is an essential component of the kind of text processing<br />

application described in this paper. Literature on the subject has references to a few rules<br />

identified and applied to extract names. Most of these rules are derived from standard<br />

conventions that are widely employed by authors of properly edited texts in English. Some<br />

of the key indicators that were used included:<br />

‣ Use of legend words such as Mr. Miss, Ms. etc is a good indicator that the following<br />

word / words denotes a name;<br />

‣ Use of Corporate trigger words such as ‘University’, ‘College’, ‘School’, etc., is a<br />

good indicator to identify the organization names. Prepositions are widely used in<br />

corporate names to link legend words with other words (e.g. University of<br />

Cali<strong>for</strong>nia); this has been exploited to identify complete corporate names<br />

‣ Initial capitalization is also an indicator of names since the convention of English<br />

language requires that each word of a name start with an uppercase letter. In this<br />

module, there<strong>for</strong>e, identification of initial capitalization has been used in <strong>for</strong>mulating<br />

the algorithm.<br />

‣ Example: Powell<br />

‣ First Name, Middle Name and Last Name starting with uppercase letters is also a<br />

good indicator <strong>for</strong> names 2 .<br />

‣ Example: Mohandas Karamchand <strong>Gandhi</strong><br />

61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!