26.04.2013 Views

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

There is a high number of different words <strong>in</strong> the selected ground truth documents, and the<br />

literal transcription (word by word) is an expensive process, consequently, only a few words are<br />

labelled. The words selected are shown <strong>in</strong> figure 4. These words are selected because of their high<br />

frequency of apparition <strong>in</strong> the selected documents.The ground truth has all the labelled words (20<br />

classes). The subset of the ground truth is composed by the first 10 labelled words (10 classes).<br />

(a) Barna (b) de Barna (c) en Barna (d) de (e) pages (f) reberè (g) dia<br />

(h) dit (i) fill (j) filla (k) viuda (l) ab (m) habitant (n) donsella<br />

(o) dilluns (p) dimarts (q) dimecres (r) dijous (s) divendres (t) dissapte (u) viudo<br />

Figure 4: An example of each selected word<br />

For the experiments we have employed a cross-validation technique[11]. Suppose that we have<br />

a data set Z of size N x n, conta<strong>in</strong><strong>in</strong>g n-dimensional feature vectors describ<strong>in</strong>g N objects. We<br />

choose an <strong>in</strong>teger K (preferably a factor of N ) and randomly divide Z <strong>in</strong>to K subsets of size N/K.<br />

Then we use one subset to test the performance of D tra<strong>in</strong>ed on the union of the rema<strong>in</strong><strong>in</strong>g K -<br />

1 subsets. This procedure is repeated K times, choos<strong>in</strong>g a different part for test<strong>in</strong>g each time. To<br />

get the f<strong>in</strong>al result we average the K estimates. We have chosen K = 5 <strong>in</strong> our experiments.<br />

3. Related work<br />

<strong>Word</strong>-spott<strong>in</strong>g was orig<strong>in</strong>ally formulated to detect words <strong>in</strong> speech messages [10]. Later it was<br />

used <strong>in</strong> text documents [20] for match<strong>in</strong>g and <strong>in</strong>dex<strong>in</strong>g handwritten words of several documents.<br />

<strong>in</strong> this context, it was first proposed by Manmatha [13], and later, a number of different word<br />

match<strong>in</strong>g algorithms were <strong>in</strong>vestigated. This technique needs word segmentation, and many word<br />

segmentation approaches can be found <strong>in</strong> the literature. Relevant examples are: a scale-space word<br />

segmentation process was proposed <strong>in</strong> [14] and a neural network work segment<strong>in</strong>g algorithm is also<br />

presented <strong>in</strong> [26].<br />

Rath [21] proposed an automatic retrieval system for historical handwritten documents us<strong>in</strong>g<br />

relevance models. The method describes two statistical models for retrieval <strong>in</strong> large collections of<br />

handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a<br />

jo<strong>in</strong>t probability distribution between features computed from word images and their transcriptions.<br />

7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!