26.04.2013 Views

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6. Prelim<strong>in</strong>ary steps<br />

6.1 Pre-process<strong>in</strong>g<br />

Modell<strong>in</strong>g the human cognitive process to obta<strong>in</strong> a similar computational methodology for handwritten<br />

word segmentation is quite difficult due to the follow<strong>in</strong>g characteristics. The handwrit<strong>in</strong>g<br />

style is usually <strong>in</strong> cursive or discrete. In the case of discrete handwrit<strong>in</strong>g, characters are jo<strong>in</strong>ed to<br />

form words, but, unlike the mach<strong>in</strong>e pr<strong>in</strong>ted text, handwritten text is not uniformly spaced. The<br />

size of the characters along the words of the document is different (this is a scale problem). Ascenders<br />

and descenders are regularly connected and words present different orientations. Documents<br />

are often degraded due the age<strong>in</strong>g or other reasons. Another reason is the presence of show-through<br />

or bleed-through effects expla<strong>in</strong>ed above.<br />

Some of the ma<strong>in</strong> problems of our historical documents are that they have been written by<br />

several authors (every two years the writer changes), noisy (sta<strong>in</strong>s, shadows, bleed through, etc.),<br />

marg<strong>in</strong>s, etc.<br />

The documents to be used <strong>in</strong> our experiments present some of the above commented drawbacks,<br />

like ascenders and descenders connected, different sizes of character, etc. But a good characteristics<br />

of these documents is that they are well structured. As we have commented <strong>in</strong> section 2, each<br />

document has three parts, and the objective is to work with the marriage licenses.<br />

The steps of the pre-process<strong>in</strong>g are: b<strong>in</strong>arization of the documents, page segmentation, layout<br />

segmentation, segmentation of the l<strong>in</strong>es and, the last step, the word segmentation (Fig. 6). Let us<br />

<strong>in</strong> the follow<strong>in</strong>g subsections describe the details of these steps.<br />

Figure 6: Pre-process<strong>in</strong>g steps.<br />

11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!