23.03.2013 Views

Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato

Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato

Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Summary<br />

SEMI-AUTOMATIC INDEXING<br />

<strong>State</strong> <strong>of</strong> <strong>the</strong> <strong>Art</strong><br />

by<br />

Hermann Fangmeyer<br />

Diplom-Ma<strong>the</strong>matiker<br />

C.C.R.- EURATOM/CETIS<br />

21020 ISPRA (VA) Italy<br />

After an intensive period <strong>of</strong> research in information science from <strong>the</strong> late fifties to <strong>the</strong> late sixties, a lull can now be<br />

observed, in which people seem to be engaged in developing operational IR-systems at a relatively low level <strong>of</strong> sophistication. In<br />

<strong>the</strong>se systems <strong>the</strong> bulk <strong>of</strong> <strong>the</strong> work is still done by man and <strong>the</strong> methodology applied differs only in unimportant details. They<br />

are <strong>of</strong>ten evaluated by applying short-sighted economical criteria without taking into account prospective user needs, which<br />

tend towards increasingly specific and exhaustive information. Since re-designing a system and re-compiling manually built data<br />

bases is a prohibitively expensive operation, it is believed that most <strong>of</strong> <strong>the</strong>se systems will not survive for very long.<br />

<strong>Indexing</strong> is a fundamental part <strong>of</strong> IRS, since it directly affects <strong>the</strong> quality. Thus <strong>the</strong> most advanced indexing techniques<br />

which tend to automatic full-text processing, should be chosen.<br />

Computer assisted indexing has great advantages over manual methods. Especially preferable are those techniques which<br />

require <strong>the</strong> text to be indexed in machine-readable form, since <strong>the</strong>y allow more automised indexing processes to be integrated<br />

easily, when available, by reindexing <strong>the</strong> collection at relatively low costs.<br />

1. INTRODUCTION<br />

<strong>Semi</strong>-automatic indexing has not been strictly defined; <strong>the</strong>re exist as many interpretations as synonyms: The intervention <strong>of</strong><br />

a computer may save <strong>the</strong> indexer from having to perform routine work, or <strong>the</strong> indexer may help <strong>the</strong> computer to make<br />

decisions where no sophisticated algorithms are available. Hence, semi-automatic indexing methods must be arranged within <strong>the</strong><br />

wide spectrum between manual indexing with a minimum <strong>of</strong> computer assistance, (e.g. <strong>the</strong> New York Times System in which <strong>the</strong><br />

indexer is working on a video terminal on which <strong>the</strong> document to be indexed is displayed using a closed-circuit television system<br />

without having access to <strong>the</strong> data bases [98(1972)])and quasi automatic indexing with a minimum <strong>of</strong> human intervention. This<br />

can be imagined as a process in which <strong>the</strong> indexer changes a threshold value, depending upon <strong>the</strong> number <strong>of</strong> index terms to be<br />

automatically assigned.<br />

The terms computer assisted indexing, computer- or machine-aided indexing, man-machine indexing, computer aids to<br />

indexing, computer based - or computerized indexing and similar terms, will be treated as synonyms in this report.<br />

<strong>Semi</strong>-automatic indexing should be distinguished from automatic indexing which is defined by Stevens [ 110(p.3), (1970)] as<br />

<strong>the</strong> use <strong>of</strong> machines to extract or assign index terms without human intervention, once programs or procedural rules have been<br />

established.<br />

Ano<strong>the</strong>r,contradictory definition for automatic indexing can be derived from Caras' [24(1968)] statement:<br />

The primary aim in automatic indexing is to derive index terms directly from <strong>the</strong> text with a minimum <strong>of</strong> human intervention.<br />

This definition provides for an intellectual operation.<br />

For Maron [75(1961)] <strong>the</strong> term 'automatic indexing' denotes <strong>the</strong> problem <strong>of</strong> deciding in a mechanical way to which<br />

category (subject or field <strong>of</strong> knowledge) a given document belongs.<br />

Thus, in his opinion, it concerns <strong>the</strong> problem <strong>of</strong> automatic recognition <strong>of</strong> <strong>the</strong> contents <strong>of</strong> a given document. From <strong>the</strong>se<br />

non-uniform definitions, no precise definition for machine-assisted indexing can be derived. The author <strong>the</strong>refore decided upon<br />

<strong>the</strong> following:<br />

The indexing process will be called semi-automatic within this report if it consists <strong>of</strong> a combination <strong>of</strong> <strong>the</strong> intellectual efforts <strong>of</strong><br />

scientific subject specialists and advanced computer techniques.<br />

Thus, semi-automatic indexing is restricted here to those machine-aided methods which require <strong>the</strong> qualified intervention <strong>of</strong><br />

both a computer and an indexer. Fur<strong>the</strong>rmore, metfiods which cannot be applied to an operational system are also excluded.<br />

<strong>Semi</strong>-automatic indexing is divided into conversational and symbiotic indexing in order to distinguish between indexing by<br />

continuous contact with <strong>the</strong> computer and indexing by integration <strong>of</strong> <strong>the</strong> computer in <strong>the</strong> indexing process for <strong>the</strong> purpose <strong>of</strong><br />

performing certain clerical tasks respectively.<br />

This report comprises <strong>the</strong> state-<strong>of</strong>-<strong>the</strong>-art up to Dec. 1972 in<br />

- semi-automatic derivative indexing,<br />

- machine-aided assignment indexing (including automatic assignment indexing techniques, which are based on previously<br />

created manual or semi-automatic indexing aids).<br />

- semi-automatic dictionary construction, since <strong>the</strong> indexing techniques <strong>of</strong>ten involve <strong>the</strong> setting up <strong>of</strong> <strong>the</strong>sauri.<br />

(Since dictionary construction is <strong>of</strong>ten closely linked to indexing and <strong>of</strong>ten employs similar methods, exact distinctions cannot<br />

always be made.)<br />

Evaluation, in <strong>the</strong> sense <strong>of</strong> measuring <strong>the</strong> retrieval efficiency <strong>of</strong> <strong>the</strong> different approaches described in <strong>the</strong> literature, is not<br />

involved here. The reason for this is that <strong>the</strong> authors usually content <strong>the</strong>mselves with general and <strong>of</strong>ten contradictory<br />

statements.<br />

For some computer-aided indexing techniques computer analysis <strong>of</strong> text is <strong>the</strong> fundamental step. [103(1969)]. For <strong>the</strong>se <strong>the</strong><br />

data is needed in machine-readable form:<br />

There are essentially two principal methods for obtaining a machine-readable text for computer indexing:<br />

- as a by-product <strong>of</strong> <strong>the</strong> printing process; and<br />

- through some kind <strong>of</strong> conversion procedure using keyboard devices to produce cards or tape, or using optical scanning<br />

devices. [9(1968)]<br />

The transfer <strong>of</strong> data in natural language into machine-readable form can be extremely expensive in relation to <strong>the</strong><br />

application for which it will be used. That is probably why automation in natural language processing is not as well developed as<br />

o<strong>the</strong>r fields <strong>of</strong> computer application. (Indeed, <strong>the</strong>re are only a few original approaches. Most applications are simple<br />

modifications <strong>of</strong> <strong>the</strong>se, i.e. <strong>the</strong>oretical advances have been minimal in <strong>the</strong> last decade).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!