RR_03_02
RR_03_02
RR_03_02
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Indexing and Retrieval of Document Images<br />
Using Term Positions and Physical Structures<br />
Koichi Kise, Keinosuke Matsumoto<br />
Dept. of Computer and Systems Sciences, Osaka Prefecture University<br />
1-1 Gakuencho, Sakai, Osaka 599-853 1, Japan<br />
kise@cs.osakafu-u.ac.jp<br />
Abstract<br />
This paper presents some methods of indexing and retrieval<br />
of document images based on their physical (layout)<br />
structures and term positions in pages. Documents are divided<br />
into blocks that are physically de ned for the purpose<br />
of indexing with terms. The simple vector space model<br />
(VSM) and the latent semantic indexing (LSI) are employed<br />
as retrieval models. Experimental results on the retrieval of<br />
129 documents show that LSI with blocks consisting of overlapping<br />
pages outperforms an ordinary method of retrieval<br />
based on the VSM.<br />
1. Introduction<br />
Document image databases (OrBs) are the databases that<br />
provide ef cient storage of and access to document images<br />
[I]. As it becomes more popular to equip copiers and<br />
printers with devices for the storage of document images,<br />
OrBs gain importance in our society.<br />
An open issue of OrBs is how to achieve content-based<br />
retrieval based on queries given by users. In the case that<br />
queries are given as image features such as layout of documents<br />
[2], indexing by layout analysis should be applied. [f<br />
queries are given as keywords, it is required to apply OCR<br />
for indexing. [n addition to the issue of OCR errors [3], we<br />
have another issue of how to index document images based<br />
on recognized characters and words.<br />
A simple way is to employ the Bag of Words model,<br />
which is common in the eld of information retrieval. [n<br />
this model, documents are regarded as collections of words.<br />
[n the context of OrBs, therefore, the rest of information<br />
obtained through the process of OCR, e.g., positions of<br />
characters / words, and physical (layout) structures of documents,<br />
is discarded.<br />
This paper presents methods of util izing some of the discarded<br />
information in addition to words (terms) themselves<br />
19<br />
o··· DO<br />
document (D) page (P)<br />
II ··· D···<br />
column (C) half of a column (H)<br />
Figure 1. Units of indexing.<br />
so as to improve the accuracy of retrieval. To be precise,<br />
documents are indexed based on the information of physical<br />
structures such as pages and columns, as well as positions<br />
of terms in pages. As an example of such methods, we<br />
have already proposed the method called density distributions<br />
of terms [4, 5], which is an application of passage<br />
retrieval in the [R eld to document images. The methods<br />
proposed in this paper can be viewed as simpli ed versions<br />
of this method for the reduction of computational costs.<br />
2. Indexing<br />
Our methods of indexing are de ned by units and blocks<br />
of indexing. As units of indexing, we consider page (P),<br />
column (C) and half of a column (H) in addition to document<br />
(D) as shown in Fig. l.<br />
Columns are obtained in a brute-force manner in order to<br />
avoid complicated problems in layout analysis: each page<br />
region is cut into two pieces at a xed position. As a result,<br />
regions laid out in a si ngle column format as well as<br />
some wider gures and tables are split into different pieces.<br />
Halves of columns are likewise de ned by splitting at the<br />
physical vertical center of columns.