26.04.2013 Views

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

(a) 1617: <strong>in</strong>dex of volume<br />

69<br />

(b) 1729: volume 127 (c) 1860: volume 200<br />

Figure 2: Llibre d’esposalles (Archive of Barcelona Cathedral, ACB)<br />

segmentation step has to be done. Other good characteristics present <strong>in</strong> the documents are that the<br />

text is hardly <strong>in</strong> cursive, the documents are very clean, the words are connected and grammatical<br />

structure is similar <strong>in</strong> all the registers: day of the wedd<strong>in</strong>g, name and job of the husband, parents’<br />

husband, name of the wife, parents’ wife and place where was the wedd<strong>in</strong>g.<br />

Figure 3: Structure of the documents: (a) husband’s surname, (b) wedd<strong>in</strong>g, (c) tax of the wedd<strong>in</strong>g<br />

Our work is part of a big project leaded by the Center for Demographic Studies (CED), Department<br />

of Geography, Universitat Autònoma de Barcelona (UAB). This project br<strong>in</strong>gs together<br />

researches of social sciences and computer sciences. From the perspective of scholars <strong>in</strong> social<br />

sciences, this collections is a rich source of <strong>in</strong>formation to construct genealogy of people over centuries.<br />

Thus, the first aim is to construct a database of marriages. It would be an artisanal and<br />

time consum<strong>in</strong>g task, so word spott<strong>in</strong>g techniques can help <strong>in</strong> pseudo-automatiz<strong>in</strong>g this process.<br />

Ground truth<br />

We have a ground truths composed by 500 documents of the volume 69 of the Cathedral of Barcelona<br />

volumes. We have also a second ground truth. It is a subset of the first ground truth. This one<br />

is composed by 30 documents, they are the same of the first ground truth. These documents have<br />

been labelled <strong>in</strong> a manual process. The labelled process has been done us<strong>in</strong>g an application program<br />

to label documents. This program allows us to select an area of the image and label it with a word.<br />

Each word (and the po<strong>in</strong>t marks of the area selected) is automatically saved <strong>in</strong> a XML file.<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!