23.03.2013 Views

Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato

Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato

Semi Automatic Indexing State of the Art - FTP Directory Listing - Nato

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

— alphabetically adjacent index terms and/or those semantically or hierarchically related to a given one. (The latter display<br />

requires structured <strong>the</strong>sauri.)<br />

The user<br />

<strong>the</strong>n proceeds to narrow <strong>the</strong> range <strong>of</strong> his search by imposing a set <strong>of</strong> constraints and observing, interactively, <strong>the</strong> effects <strong>of</strong> <strong>the</strong>se<br />

constraints. [27(1970)].<br />

He can proceed on different levels depending upon <strong>the</strong> data bases to which he has access. The evaluation <strong>of</strong> <strong>the</strong> query can be<br />

done with regard to its precision and ambiguity using <strong>the</strong> <strong>the</strong>saurus file and with regard to <strong>the</strong> user's real need using <strong>the</strong> search<br />

file in an interactive mode (feed back).<br />

It should be observed that <strong>the</strong> use <strong>of</strong> <strong>the</strong>se conversational systems is restricted to query formulation. <strong>Indexing</strong> in most <strong>of</strong> <strong>the</strong><br />

mentioned and similar systems is still done completely manually. But what is <strong>the</strong> difference between <strong>the</strong> query formulation and<br />

indexing <strong>of</strong> a document? The problems inherent in both are very similar.<br />

In fact, Herr [48, 49(1970)] observes that subject indexers <strong>of</strong>ten compare a new document with items under various subject<br />

headings to determine <strong>the</strong> most appropriate slot for <strong>the</strong> new document. Hence, indexing could be done faster and more<br />

consistently by using a conversational indexing system. Bennet [ 14( 1969)] goes still fur<strong>the</strong>r by requiring that:<br />

when adding a document to a collection an indexer should choose a representation which makes evident both <strong>the</strong> content <strong>of</strong> <strong>the</strong><br />

document and its relation to o<strong>the</strong>r documents already in <strong>the</strong> collection.<br />

This requirement is based on <strong>the</strong> observation that<br />

users, on <strong>the</strong> average, are dissatisfied if more than 50 documents are presented in response to a subject search. This might<br />

suggest that no individual content identifier should characterize more than 50 documents. The system can inform <strong>the</strong> indexer<br />

when he uses an identifier which is beyond this threshold, whereupon he can consider an alternative, more refined, subdivision.<br />

This re-assessment <strong>of</strong> content identification, occurring in a planned and continuous manner, could benefit both librarian and<br />

user.<br />

As ano<strong>the</strong>r approach, Markus [74(1962)] suggests that:<br />

each choice <strong>of</strong> an indexing term could place in front <strong>of</strong> <strong>the</strong> indexer a display <strong>of</strong> questions or possible additional indexing<br />

terms. These would be arranged to guide his thinking to <strong>the</strong> next logical choice <strong>of</strong> an indexing term,<br />

(see also [45(1973)].<br />

Fur<strong>the</strong>r advantages <strong>of</strong> access to <strong>the</strong> data bases during indexing seem to J. Herr, [48, 49(1970)] to be that<br />

decentralized indexers can communicate through <strong>the</strong>ir work and that new indexers can be trained with minimal contact with<br />

experienced indexers by attempting to duplicate indexing patterns in <strong>the</strong> system.<br />

Access to <strong>the</strong> data base during indexing also permits re-indexing which could be most desirable to improve discrimination<br />

between similar documents.<br />

The most considerable barrier inhibiting <strong>the</strong> use <strong>of</strong> on-line systems for indexing may still be its cost, although this has come<br />

down considerably during <strong>the</strong> last years. However, <strong>the</strong> most economic indexing process does not necessarily give <strong>the</strong> best results.<br />

Therefore quality considerations should be taken into account too. The best index will be <strong>the</strong> most economic one in <strong>the</strong><br />

long-term range, as <strong>the</strong> prospective user <strong>of</strong> information will have more sophisticated requirements. [ 109(1972)].<br />

2.2 Symbiotic <strong>Indexing</strong> Techniques<br />

Symbiotic or <strong>of</strong>f-line indexing means <strong>the</strong> integration <strong>of</strong> <strong>the</strong> computer into <strong>the</strong> indexing process without being permanently<br />

in contact with <strong>the</strong> indexer.<br />

The computer or indexer furnishes data which can be used for decision making at a later moment by <strong>the</strong> indexer or computer<br />

respectively. The process can alternate repeatedly. This kind <strong>of</strong> indexing is applied for economical reasons, preferably, in cases<br />

where large amounts <strong>of</strong> texts are to be elaborated, such as primary index- and dictionary construction <strong>of</strong> any kind.<br />

Most techniques can be defined as computer controlled, since <strong>the</strong> text is needed in machine readable form and since <strong>the</strong><br />

computer makes <strong>the</strong> choice <strong>of</strong> index terms. The final decision on this choice most <strong>of</strong>ten remains, however, a human prerogative<br />

but it is usually a binary decision whe<strong>the</strong>r to accept or reject an index term selected by <strong>the</strong> computer.<br />

Intellectually directed semi-automatic indexing techniques could be defined as those techniques which require a 'go-word'<br />

dictionary. In this special application <strong>the</strong> indexer has made his choice on <strong>the</strong> index terms a priori and <strong>the</strong> computer is used to<br />

find <strong>the</strong>ir occurrences in <strong>the</strong> text.<br />

The computer's reliability and speed as a searching, matching, comparing and arithmetic device can be exploited in two<br />

extremely useful ways. The computer can be used to edit <strong>the</strong> work <strong>of</strong> <strong>the</strong> indexer; it can also help to redesign an index so that it<br />

is sensitive, to and responds to, changes in <strong>the</strong> information content <strong>of</strong> a collection. [ 12( 1965)].<br />

The editing function <strong>of</strong> <strong>the</strong> computer is expressed as follows: Since <strong>the</strong> computer is to take over <strong>the</strong> role <strong>of</strong> <strong>the</strong> editor, <strong>the</strong><br />

indexer or author can now freely assign terms to a document and allow <strong>the</strong> computer to determine whe<strong>the</strong>r or not an assigned<br />

term is allowed by <strong>the</strong> index, whe<strong>the</strong>r or not <strong>the</strong> spelling <strong>of</strong> <strong>the</strong> term is acceptable, and whe<strong>the</strong>r <strong>the</strong> format <strong>of</strong> <strong>the</strong> term meets<br />

specifications. If desired, cross-references can also be added automatically. [78(1968), 77(1969)].<br />

The methods adopted for this task consist <strong>of</strong> simple dictionary comparisons. For error detection <strong>the</strong> terms not found in <strong>the</strong><br />

dictionary can be checked for simple errors such as a missing letter or <strong>the</strong> transposition <strong>of</strong> two adjacent letters. If <strong>the</strong> error<br />

cannot be automatically corrected <strong>the</strong> term is displayed in order to be rectified manually. [78(1968), 77(1969), 117(1970)].<br />

Redesigning an index with <strong>the</strong> aid <strong>of</strong> a computer capitalizes on <strong>the</strong> arithmetic features <strong>of</strong> <strong>the</strong> machine. Using <strong>the</strong>se, it is<br />

possible to keep a running tally on all <strong>the</strong> activities <strong>of</strong> <strong>the</strong> system, e.g., how <strong>of</strong>ten a term has been assigned to <strong>the</strong> documents <strong>of</strong><br />

<strong>the</strong> collection, how many questions have used a given term, and so on. When specified thresholds on such empirical data are<br />

reached, a computer can indicate that a revision <strong>of</strong> <strong>the</strong> index is necessary and can determine <strong>the</strong> documents that will be affected<br />

by <strong>the</strong> revision. For example, as a document collection grows, when a given index term is assigned to too large a proportion <strong>of</strong><br />

documents, that term loses power as a discriminator during search. This implies that <strong>the</strong> concept needs to be subdivided into<br />

more specific categories and that <strong>the</strong> original term should be used to designate a class. To control such circumstances one might<br />

specify, for example, that whenever a subject heading or an index term is assigned to one percent <strong>of</strong> <strong>the</strong> document collection,<br />

once <strong>the</strong> size <strong>of</strong> <strong>the</strong> collection has reached <strong>the</strong> range <strong>of</strong> 10,000 to 12,500 documents, that <strong>the</strong> computer program must provide a<br />

print-out <strong>of</strong> <strong>the</strong> subject heading toge<strong>the</strong>r with a list <strong>of</strong> <strong>the</strong> accession numbers <strong>of</strong> <strong>the</strong> documents to which that heading has been<br />

assigned. The use <strong>of</strong> a range ra<strong>the</strong>r than an absolute number would allow <strong>the</strong> system to continue effectively where <strong>the</strong><br />

documents being added — and <strong>the</strong>refore now subject to revision - had already been indexed under <strong>the</strong> old heading. It would<br />

fur<strong>the</strong>r accommodate <strong>the</strong> transition period, which always accompanies revision. [ 12(1965)].<br />

Symbiotic indexing as it is defined here, <strong>of</strong>ten also requires an intellectually performed editing function to prepare <strong>the</strong> input<br />

for (semi-)automatic processing (pre-editing), or to decide on <strong>the</strong> index terms chosen by <strong>the</strong> computer (post-editing). Text<br />

preparation may be at any level, for all kinds <strong>of</strong> indexes, (see also [85(1964)].<br />

1. Addition <strong>of</strong> special codes (escape sequences) for special signs, such as integral sign, or codes to represent uppercase, italics,<br />

boldface etc.<br />

2. Marking <strong>of</strong> document places which means assigning to each word and non-verbal text expression its place, such as title,<br />

abstract, summary, heading, maintext, footnote, etc.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!