Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato
Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato
Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Summary<br />
SEMI-AUTOMATIC INDEXING<br />
<strong>State</strong> <strong>of</strong> <strong>the</strong> <strong>Art</strong><br />
by<br />
Hermann Fangmeyer<br />
Diplom-Ma<strong>the</strong>matiker<br />
C.C.R.- EURATOM/CETIS<br />
21020 ISPRA (VA) Italy<br />
After an intensive period <strong>of</strong> research in information science from <strong>the</strong> late fifties to <strong>the</strong> late sixties, a lull can now be<br />
observed, in which people seem to be engaged in developing operational IR-systems at a relatively low level <strong>of</strong> sophistication. In<br />
<strong>the</strong>se systems <strong>the</strong> bulk <strong>of</strong> <strong>the</strong> work is still done by man and <strong>the</strong> methodology applied differs only in unimportant details. They<br />
are <strong>of</strong>ten evaluated by applying short-sighted economical criteria without taking into account prospective user needs, which<br />
tend towards increasingly specific and exhaustive information. Since re-designing a system and re-compiling manually built data<br />
bases is a prohibitively expensive operation, it is believed that most <strong>of</strong> <strong>the</strong>se systems will not survive for very long.<br />
<strong>Indexing</strong> is a fundamental part <strong>of</strong> IRS, since it directly affects <strong>the</strong> quality. Thus <strong>the</strong> most advanced indexing techniques<br />
which tend to automatic full-text processing, should be chosen.<br />
Computer assisted indexing has great advantages over manual methods. Especially preferable are those techniques which<br />
require <strong>the</strong> text to be indexed in machine-readable form, since <strong>the</strong>y allow more automised indexing processes to be integrated<br />
easily, when available, by reindexing <strong>the</strong> collection at relatively low costs.<br />
1. INTRODUCTION<br />
<strong>Semi</strong>-automatic indexing has not been strictly defined; <strong>the</strong>re exist as many interpretations as synonyms: The intervention <strong>of</strong><br />
a computer may save <strong>the</strong> indexer from having to perform routine work, or <strong>the</strong> indexer may help <strong>the</strong> computer to make<br />
decisions where no sophisticated algorithms are available. Hence, semi-automatic indexing methods must be arranged within <strong>the</strong><br />
wide spectrum between manual indexing with a minimum <strong>of</strong> computer assistance, (e.g. <strong>the</strong> New York Times System in which <strong>the</strong><br />
indexer is working on a video terminal on which <strong>the</strong> document to be indexed is displayed using a closed-circuit television system<br />
without having access to <strong>the</strong> data bases [98(1972)])and quasi automatic indexing with a minimum <strong>of</strong> human intervention. This<br />
can be imagined as a process in which <strong>the</strong> indexer changes a threshold value, depending upon <strong>the</strong> number <strong>of</strong> index terms to be<br />
automatically assigned.<br />
The terms computer assisted indexing, computer- or machine-aided indexing, man-machine indexing, computer aids to<br />
indexing, computer based - or computerized indexing and similar terms, will be treated as synonyms in this report.<br />
<strong>Semi</strong>-automatic indexing should be distinguished from automatic indexing which is defined by Stevens [ 110(p.3), (1970)] as<br />
<strong>the</strong> use <strong>of</strong> machines to extract or assign index terms without human intervention, once programs or procedural rules have been<br />
established.<br />
Ano<strong>the</strong>r,contradictory definition for automatic indexing can be derived from Caras' [24(1968)] statement:<br />
The primary aim in automatic indexing is to derive index terms directly from <strong>the</strong> text with a minimum <strong>of</strong> human intervention.<br />
This definition provides for an intellectual operation.<br />
For Maron [75(1961)] <strong>the</strong> term 'automatic indexing' denotes <strong>the</strong> problem <strong>of</strong> deciding in a mechanical way to which<br />
category (subject or field <strong>of</strong> knowledge) a given document belongs.<br />
Thus, in his opinion, it concerns <strong>the</strong> problem <strong>of</strong> automatic recognition <strong>of</strong> <strong>the</strong> contents <strong>of</strong> a given document. From <strong>the</strong>se<br />
non-uniform definitions, no precise definition for machine-assisted indexing can be derived. The author <strong>the</strong>refore decided upon<br />
<strong>the</strong> following:<br />
The indexing process will be called semi-automatic within this report if it consists <strong>of</strong> a combination <strong>of</strong> <strong>the</strong> intellectual efforts <strong>of</strong><br />
scientific subject specialists and advanced computer techniques.<br />
Thus, semi-automatic indexing is restricted here to those machine-aided methods which require <strong>the</strong> qualified intervention <strong>of</strong><br />
both a computer and an indexer. Fur<strong>the</strong>rmore, metfiods which cannot be applied to an operational system are also excluded.<br />
<strong>Semi</strong>-automatic indexing is divided into conversational and symbiotic indexing in order to distinguish between indexing by<br />
continuous contact with <strong>the</strong> computer and indexing by integration <strong>of</strong> <strong>the</strong> computer in <strong>the</strong> indexing process for <strong>the</strong> purpose <strong>of</strong><br />
performing certain clerical tasks respectively.<br />
This report comprises <strong>the</strong> state-<strong>of</strong>-<strong>the</strong>-art up to Dec. 1972 in<br />
- semi-automatic derivative indexing,<br />
- machine-aided assignment indexing (including automatic assignment indexing techniques, which are based on previously<br />
created manual or semi-automatic indexing aids).<br />
- semi-automatic dictionary construction, since <strong>the</strong> indexing techniques <strong>of</strong>ten involve <strong>the</strong> setting up <strong>of</strong> <strong>the</strong>sauri.<br />
(Since dictionary construction is <strong>of</strong>ten closely linked to indexing and <strong>of</strong>ten employs similar methods, exact distinctions cannot<br />
always be made.)<br />
Evaluation, in <strong>the</strong> sense <strong>of</strong> measuring <strong>the</strong> retrieval efficiency <strong>of</strong> <strong>the</strong> different approaches described in <strong>the</strong> literature, is not<br />
involved here. The reason for this is that <strong>the</strong> authors usually content <strong>the</strong>mselves with general and <strong>of</strong>ten contradictory<br />
statements.<br />
For some computer-aided indexing techniques computer analysis <strong>of</strong> text is <strong>the</strong> fundamental step. [103(1969)]. For <strong>the</strong>se <strong>the</strong><br />
data is needed in machine-readable form:<br />
There are essentially two principal methods for obtaining a machine-readable text for computer indexing:<br />
- as a by-product <strong>of</strong> <strong>the</strong> printing process; and<br />
- through some kind <strong>of</strong> conversion procedure using keyboard devices to produce cards or tape, or using optical scanning<br />
devices. [9(1968)]<br />
The transfer <strong>of</strong> data in natural language into machine-readable form can be extremely expensive in relation to <strong>the</strong><br />
application for which it will be used. That is probably why automation in natural language processing is not as well developed as<br />
o<strong>the</strong>r fields <strong>of</strong> computer application. (Indeed, <strong>the</strong>re are only a few original approaches. Most applications are simple<br />
modifications <strong>of</strong> <strong>the</strong>se, i.e. <strong>the</strong>oretical advances have been minimal in <strong>the</strong> last decade).