Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato
Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato
Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
— alphabetically adjacent index terms and/or those semantically or hierarchically related to a given one. (The latter display<br />
requires structured <strong>the</strong>sauri.)<br />
The user<br />
<strong>the</strong>n proceeds to narrow <strong>the</strong> range <strong>of</strong> his search by imposing a set <strong>of</strong> constraints and observing, interactively, <strong>the</strong> effects <strong>of</strong> <strong>the</strong>se<br />
constraints. [27(1970)].<br />
He can proceed on different levels depending upon <strong>the</strong> data bases to which he has access. The evaluation <strong>of</strong> <strong>the</strong> query can be<br />
done with regard to its precision and ambiguity using <strong>the</strong> <strong>the</strong>saurus file and with regard to <strong>the</strong> user's real need using <strong>the</strong> search<br />
file in an interactive mode (feed back).<br />
It should be observed that <strong>the</strong> use <strong>of</strong> <strong>the</strong>se conversational systems is restricted to query formulation. <strong>Indexing</strong> in most <strong>of</strong> <strong>the</strong><br />
mentioned and similar systems is still done completely manually. But what is <strong>the</strong> difference between <strong>the</strong> query formulation and<br />
indexing <strong>of</strong> a document? The problems inherent in both are very similar.<br />
In fact, Herr [48, 49(1970)] observes that subject indexers <strong>of</strong>ten compare a new document with items under various subject<br />
headings to determine <strong>the</strong> most appropriate slot for <strong>the</strong> new document. Hence, indexing could be done faster and more<br />
consistently by using a conversational indexing system. Bennet [ 14( 1969)] goes still fur<strong>the</strong>r by requiring that:<br />
when adding a document to a collection an indexer should choose a representation which makes evident both <strong>the</strong> content <strong>of</strong> <strong>the</strong><br />
document and its relation to o<strong>the</strong>r documents already in <strong>the</strong> collection.<br />
This requirement is based on <strong>the</strong> observation that<br />
users, on <strong>the</strong> average, are dissatisfied if more than 50 documents are presented in response to a subject search. This might<br />
suggest that no individual content identifier should characterize more than 50 documents. The system can inform <strong>the</strong> indexer<br />
when he uses an identifier which is beyond this threshold, whereupon he can consider an alternative, more refined, subdivision.<br />
This re-assessment <strong>of</strong> content identification, occurring in a planned and continuous manner, could benefit both librarian and<br />
user.<br />
As ano<strong>the</strong>r approach, Markus [74(1962)] suggests that:<br />
each choice <strong>of</strong> an indexing term could place in front <strong>of</strong> <strong>the</strong> indexer a display <strong>of</strong> questions or possible additional indexing<br />
terms. These would be arranged to guide his thinking to <strong>the</strong> next logical choice <strong>of</strong> an indexing term,<br />
(see also [45(1973)].<br />
Fur<strong>the</strong>r advantages <strong>of</strong> access to <strong>the</strong> data bases during indexing seem to J. Herr, [48, 49(1970)] to be that<br />
decentralized indexers can communicate through <strong>the</strong>ir work and that new indexers can be trained with minimal contact with<br />
experienced indexers by attempting to duplicate indexing patterns in <strong>the</strong> system.<br />
Access to <strong>the</strong> data base during indexing also permits re-indexing which could be most desirable to improve discrimination<br />
between similar documents.<br />
The most considerable barrier inhibiting <strong>the</strong> use <strong>of</strong> on-line systems for indexing may still be its cost, although this has come<br />
down considerably during <strong>the</strong> last years. However, <strong>the</strong> most economic indexing process does not necessarily give <strong>the</strong> best results.<br />
Therefore quality considerations should be taken into account too. The best index will be <strong>the</strong> most economic one in <strong>the</strong><br />
long-term range, as <strong>the</strong> prospective user <strong>of</strong> information will have more sophisticated requirements. [ 109(1972)].<br />
2.2 Symbiotic <strong>Indexing</strong> Techniques<br />
Symbiotic or <strong>of</strong>f-line indexing means <strong>the</strong> integration <strong>of</strong> <strong>the</strong> computer into <strong>the</strong> indexing process without being permanently<br />
in contact with <strong>the</strong> indexer.<br />
The computer or indexer furnishes data which can be used for decision making at a later moment by <strong>the</strong> indexer or computer<br />
respectively. The process can alternate repeatedly. This kind <strong>of</strong> indexing is applied for economical reasons, preferably, in cases<br />
where large amounts <strong>of</strong> texts are to be elaborated, such as primary index- and dictionary construction <strong>of</strong> any kind.<br />
Most techniques can be defined as computer controlled, since <strong>the</strong> text is needed in machine readable form and since <strong>the</strong><br />
computer makes <strong>the</strong> choice <strong>of</strong> index terms. The final decision on this choice most <strong>of</strong>ten remains, however, a human prerogative<br />
but it is usually a binary decision whe<strong>the</strong>r to accept or reject an index term selected by <strong>the</strong> computer.<br />
Intellectually directed semi-automatic indexing techniques could be defined as those techniques which require a 'go-word'<br />
dictionary. In this special application <strong>the</strong> indexer has made his choice on <strong>the</strong> index terms a priori and <strong>the</strong> computer is used to<br />
find <strong>the</strong>ir occurrences in <strong>the</strong> text.<br />
The computer's reliability and speed as a searching, matching, comparing and arithmetic device can be exploited in two<br />
extremely useful ways. The computer can be used to edit <strong>the</strong> work <strong>of</strong> <strong>the</strong> indexer; it can also help to redesign an index so that it<br />
is sensitive, to and responds to, changes in <strong>the</strong> information content <strong>of</strong> a collection. [ 12( 1965)].<br />
The editing function <strong>of</strong> <strong>the</strong> computer is expressed as follows: Since <strong>the</strong> computer is to take over <strong>the</strong> role <strong>of</strong> <strong>the</strong> editor, <strong>the</strong><br />
indexer or author can now freely assign terms to a document and allow <strong>the</strong> computer to determine whe<strong>the</strong>r or not an assigned<br />
term is allowed by <strong>the</strong> index, whe<strong>the</strong>r or not <strong>the</strong> spelling <strong>of</strong> <strong>the</strong> term is acceptable, and whe<strong>the</strong>r <strong>the</strong> format <strong>of</strong> <strong>the</strong> term meets<br />
specifications. If desired, cross-references can also be added automatically. [78(1968), 77(1969)].<br />
The methods adopted for this task consist <strong>of</strong> simple dictionary comparisons. For error detection <strong>the</strong> terms not found in <strong>the</strong><br />
dictionary can be checked for simple errors such as a missing letter or <strong>the</strong> transposition <strong>of</strong> two adjacent letters. If <strong>the</strong> error<br />
cannot be automatically corrected <strong>the</strong> term is displayed in order to be rectified manually. [78(1968), 77(1969), 117(1970)].<br />
Redesigning an index with <strong>the</strong> aid <strong>of</strong> a computer capitalizes on <strong>the</strong> arithmetic features <strong>of</strong> <strong>the</strong> machine. Using <strong>the</strong>se, it is<br />
possible to keep a running tally on all <strong>the</strong> activities <strong>of</strong> <strong>the</strong> system, e.g., how <strong>of</strong>ten a term has been assigned to <strong>the</strong> documents <strong>of</strong><br />
<strong>the</strong> collection, how many questions have used a given term, and so on. When specified thresholds on such empirical data are<br />
reached, a computer can indicate that a revision <strong>of</strong> <strong>the</strong> index is necessary and can determine <strong>the</strong> documents that will be affected<br />
by <strong>the</strong> revision. For example, as a document collection grows, when a given index term is assigned to too large a proportion <strong>of</strong><br />
documents, that term loses power as a discriminator during search. This implies that <strong>the</strong> concept needs to be subdivided into<br />
more specific categories and that <strong>the</strong> original term should be used to designate a class. To control such circumstances one might<br />
specify, for example, that whenever a subject heading or an index term is assigned to one percent <strong>of</strong> <strong>the</strong> document collection,<br />
once <strong>the</strong> size <strong>of</strong> <strong>the</strong> collection has reached <strong>the</strong> range <strong>of</strong> 10,000 to 12,500 documents, that <strong>the</strong> computer program must provide a<br />
print-out <strong>of</strong> <strong>the</strong> subject heading toge<strong>the</strong>r with a list <strong>of</strong> <strong>the</strong> accession numbers <strong>of</strong> <strong>the</strong> documents to which that heading has been<br />
assigned. The use <strong>of</strong> a range ra<strong>the</strong>r than an absolute number would allow <strong>the</strong> system to continue effectively where <strong>the</strong><br />
documents being added — and <strong>the</strong>refore now subject to revision - had already been indexed under <strong>the</strong> old heading. It would<br />
fur<strong>the</strong>r accommodate <strong>the</strong> transition period, which always accompanies revision. [ 12(1965)].<br />
Symbiotic indexing as it is defined here, <strong>of</strong>ten also requires an intellectually performed editing function to prepare <strong>the</strong> input<br />
for (semi-)automatic processing (pre-editing), or to decide on <strong>the</strong> index terms chosen by <strong>the</strong> computer (post-editing). Text<br />
preparation may be at any level, for all kinds <strong>of</strong> indexes, (see also [85(1964)].<br />
1. Addition <strong>of</strong> special codes (escape sequences) for special signs, such as integral sign, or codes to represent uppercase, italics,<br />
boldface etc.<br />
2. Marking <strong>of</strong> document places which means assigning to each word and non-verbal text expression its place, such as title,<br />
abstract, summary, heading, maintext, footnote, etc.