29.05.2013 Views

RR_03_02

RR_03_02

RR_03_02

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Indexing and Retrieval of Document Images<br />

Using Term Positions and Physical Structures<br />

Koichi Kise, Keinosuke Matsumoto<br />

Dept. of Computer and Systems Sciences, Osaka Prefecture University<br />

1-1 Gakuencho, Sakai, Osaka 599-853 1, Japan<br />

kise@cs.osakafu-u.ac.jp<br />

Abstract<br />

This paper presents some methods of indexing and retrieval<br />

of document images based on their physical (layout)<br />

structures and term positions in pages. Documents are divided<br />

into blocks that are physically de ned for the purpose<br />

of indexing with terms. The simple vector space model<br />

(VSM) and the latent semantic indexing (LSI) are employed<br />

as retrieval models. Experimental results on the retrieval of<br />

129 documents show that LSI with blocks consisting of overlapping<br />

pages outperforms an ordinary method of retrieval<br />

based on the VSM.<br />

1. Introduction<br />

Document image databases (OrBs) are the databases that<br />

provide ef cient storage of and access to document images<br />

[I]. As it becomes more popular to equip copiers and<br />

printers with devices for the storage of document images,<br />

OrBs gain importance in our society.<br />

An open issue of OrBs is how to achieve content-based<br />

retrieval based on queries given by users. In the case that<br />

queries are given as image features such as layout of documents<br />

[2], indexing by layout analysis should be applied. [f<br />

queries are given as keywords, it is required to apply OCR<br />

for indexing. [n addition to the issue of OCR errors [3], we<br />

have another issue of how to index document images based<br />

on recognized characters and words.<br />

A simple way is to employ the Bag of Words model,<br />

which is common in the eld of information retrieval. [n<br />

this model, documents are regarded as collections of words.<br />

[n the context of OrBs, therefore, the rest of information<br />

obtained through the process of OCR, e.g., positions of<br />

characters / words, and physical (layout) structures of documents,<br />

is discarded.<br />

This paper presents methods of util izing some of the discarded<br />

information in addition to words (terms) themselves<br />

19<br />

o··· DO<br />

document (D) page (P)<br />

II ··· D···<br />

column (C) half of a column (H)<br />

Figure 1. Units of indexing.<br />

so as to improve the accuracy of retrieval. To be precise,<br />

documents are indexed based on the information of physical<br />

structures such as pages and columns, as well as positions<br />

of terms in pages. As an example of such methods, we<br />

have already proposed the method called density distributions<br />

of terms [4, 5], which is an application of passage<br />

retrieval in the [R eld to document images. The methods<br />

proposed in this paper can be viewed as simpli ed versions<br />

of this method for the reduction of computational costs.<br />

2. Indexing<br />

Our methods of indexing are de ned by units and blocks<br />

of indexing. As units of indexing, we consider page (P),<br />

column (C) and half of a column (H) in addition to document<br />

(D) as shown in Fig. l.<br />

Columns are obtained in a brute-force manner in order to<br />

avoid complicated problems in layout analysis: each page<br />

region is cut into two pieces at a xed position. As a result,<br />

regions laid out in a si ngle column format as well as<br />

some wider gures and tables are split into different pieces.<br />

Halves of columns are likewise de ned by splitting at the<br />

physical vertical center of columns.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!