RR_03_02

Indexing and Retrieval of Document Images 

Using Term Positions and Physical Structures 

Koichi Kise, Keinosuke Matsumoto 

Dept. of Computer and Systems Sciences, Osaka Prefecture University 

1-1 Gakuencho, Sakai, Osaka 599-853 1, Japan 

kise@cs.osakafu-u.ac.jp 

Abstract 

This paper presents some methods of indexing and retrieval 

of document images based on their physical (layout) 

structures and term positions in pages. Documents are divided 

into blocks that are physically de ned for the purpose 

of indexing with terms. The simple vector space model 

(VSM) and the latent semantic indexing (LSI) are employed 

as retrieval models. Experimental results on the retrieval of 

129 documents show that LSI with blocks consisting of overlapping 

pages outperforms an ordinary method of retrieval 

based on the VSM. 

1. Introduction 

Document image databases (OrBs) are the databases that 

provide ef cient storage of and access to document images 

[I]. As it becomes more popular to equip copiers and 

printers with devices for the storage of document images, 

OrBs gain importance in our society. 

An open issue of OrBs is how to achieve content-based 

retrieval based on queries given by users. In the case that 

queries are given as image features such as layout of documents 

[2], indexing by layout analysis should be applied. [f 

queries are given as keywords, it is required to apply OCR 

for indexing. [n addition to the issue of OCR errors [3], we 

have another issue of how to index document images based 

on recognized characters and words. 

A simple way is to employ the Bag of Words model, 

which is common in the eld of information retrieval. [n 

this model, documents are regarded as collections of words. 

[n the context of OrBs, therefore, the rest of information 

obtained through the process of OCR, e.g., positions of 

characters / words, and physical (layout) structures of documents, 

is discarded. 

This paper presents methods of util izing some of the discarded 

information in addition to words (terms) themselves 

19 

o··· DO 

document (D) page (P) 

II ··· D··· 

column (C) half of a column (H) 

Figure 1. Units of indexing. 

so as to improve the accuracy of retrieval. To be precise, 

documents are indexed based on the information of physical 

structures such as pages and columns, as well as positions 

of terms in pages. As an example of such methods, we 

have already proposed the method called density distributions 

of terms [4, 5], which is an application of passage 

retrieval in the [R eld to document images. The methods 

proposed in this paper can be viewed as simpli ed versions 

of this method for the reduction of computational costs. 

2. Indexing 

Our methods of indexing are de ned by units and blocks 

of indexing. As units of indexing, we consider page (P), 

column (C) and half of a column (H) in addition to document 

(D) as shown in Fig. l. 

Columns are obtained in a brute-force manner in order to 

avoid complicated problems in layout analysis: each page 

region is cut into two pieces at a xed position. As a result, 

regions laid out in a si ngle column format as well as 

some wider gures and tables are split into different pieces. 

Halves of columns are likewise de ned by splitting at the 

physical vertical center of columns.

Previous page

Next page

1

3

5

6

7

9

11

13

15

21

23

27

31

35

37

38

39

40

41

45

47

48

49

50

51

52

53

RR_03_02

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?