26.04.2013 Views

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Handwritten Word Spotting in Old Manuscript Images using Shape ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

MASTER IN COMPUTER VISION AND ARTIFICIAL INTELLIGENCE<br />

REPORT OF THE RESEARCH PROJECT<br />

OPTION: COMPUTER VISION<br />

<strong>Handwritten</strong> <strong>Word</strong> <strong>Spott<strong>in</strong>g</strong> <strong>in</strong> <strong>Old</strong><br />

<strong>Manuscript</strong> <strong>Images</strong> us<strong>in</strong>g<br />

<strong>Shape</strong> Descriptors<br />

Author: David Fernández<br />

Date: 08/09/2010<br />

Advisors: Josep Lladós & Alicia Fornés


<strong>Handwritten</strong> <strong>Word</strong> <strong>Spott<strong>in</strong>g</strong> <strong>in</strong> <strong>Old</strong> <strong>Manuscript</strong> <strong>Images</strong> us<strong>in</strong>g<br />

<strong>Shape</strong> Descriptors<br />

David Fernández dfernandez@cvc.uab.es<br />

Computer Vision Center (CVC)<br />

Campus UAB - Edifici O<br />

08193 Bellaterra, Barcelona, Spa<strong>in</strong><br />

Supervisors: Josep Lladós & Alicia Fornés<br />

Abstract<br />

There are lots of historical handwritten documents with <strong>in</strong>formation that can be used for several<br />

studies and projects. The Document Image Analysis and Recognition community is <strong>in</strong>terested <strong>in</strong><br />

preserv<strong>in</strong>g these documents and extract<strong>in</strong>g all the valuable <strong>in</strong>formation from them. Handwrit<strong>in</strong>g<br />

word-spott<strong>in</strong>g is the pattern classification task which consists <strong>in</strong> detect<strong>in</strong>g words images document<br />

of handwrit<strong>in</strong>g. In this work, we have used query-by-example: we have matched an <strong>in</strong>put image<br />

with one or multiple query images to determ<strong>in</strong>e the distance that might <strong>in</strong>dicate a correspondence.<br />

We have developed two approaches. The first approach consists <strong>in</strong> a hierarchical process. It uses<br />

two different features organized <strong>in</strong> layers (basic features <strong>in</strong> the first layer and BSM features <strong>in</strong><br />

the second layer). The second approach employs characteristic Loci features. Marriage licenses<br />

of the Cathedral of Barcelona are used as benchmark<strong>in</strong>g database. We have search several words<br />

selected by their apparition <strong>in</strong> the documents. The results are evaluated us<strong>in</strong>g two different types<br />

of measures. The first one evaluates how good is clustered the observations <strong>in</strong> the learn<strong>in</strong>g process.<br />

Precision-recall curves are used to evaluate the retrieval step.<br />

Keywords: <strong>Word</strong>-<strong>Spott<strong>in</strong>g</strong>, BSM, Loci, k-means, Anisotropic Gaussian Filter,<br />

1. Introduction<br />

Context and motivation<br />

Despite the grow<strong>in</strong>g use of electronic documents <strong>in</strong> our daily life, the use of paper documents is<br />

still play<strong>in</strong>g an importance role. Current technologies allow us convenient and <strong>in</strong>expensive means<br />

to capture, store, compress and transfer digitized images of documents. Nevertheless, the process<br />

of (semi)automatic document process<strong>in</strong>g requires specialized technology to extract document contents.<br />

Information retrieval from Digital Libraries is primarily done us<strong>in</strong>g typed textual queries.<br />

Hence, document images are transcribed to ascii codes us<strong>in</strong>g Optical Character Recognition (OCR)<br />

systems. Query<strong>in</strong>g and <strong>in</strong>dex<strong>in</strong>g is performed by sequence comparison of ascii str<strong>in</strong>gs. This solution<br />

is constra<strong>in</strong>ed to mach<strong>in</strong>e pr<strong>in</strong>ted text, but documents conta<strong>in</strong> other forms of <strong>in</strong>formation such<br />

as handwritten text, symbols and graphical structures. One of the ma<strong>in</strong> purposes of the area of<br />

Document Image Analysis and Recognition (DIAR) is the extraction of <strong>in</strong>formation, either textual,<br />

pictorial or structural, from document images. The understand<strong>in</strong>g of such <strong>in</strong>formation represents<br />

1


a step forward towards shorten<strong>in</strong>g the semantic gap between recogniz<strong>in</strong>g <strong>in</strong>dividual visual objects,<br />

and understand<strong>in</strong>g the whole document content <strong>in</strong> a given context. It does not <strong>in</strong>volve the pure<br />

transcription of documents, but the retrieval and the l<strong>in</strong>kage of semantic knowledge from large<br />

collections of document images stored <strong>in</strong> digital repositories.<br />

There is an <strong>in</strong>creas<strong>in</strong>g <strong>in</strong>terest to digitally preserve and provide access to historical document<br />

collections <strong>in</strong> libraries, museums and archives. The conversion of historical document collections to<br />

digital archives is of prime importance to society both <strong>in</strong> terms of <strong>in</strong>formation accessibility, and longterm<br />

preservation. <strong>Handwritten</strong> documents are used to be found <strong>in</strong> historical archives. Examples<br />

are unique manuscripts written by well known scientists, artists or writers; letters, trade forms<br />

or adm<strong>in</strong>istrative documents kept by parish or municipalities that help to reconstruct historical<br />

sequences <strong>in</strong> a given place or time, etc. While mach<strong>in</strong>e pr<strong>in</strong>ted documents, under a m<strong>in</strong>imum of<br />

conditions, are easy to read by OCR systems, the recognition of handwrit<strong>in</strong>g is still a scientific<br />

challenge. The state of the art achieves only good performance <strong>in</strong> constra<strong>in</strong>ed doma<strong>in</strong>s or with<br />

small vocabularies.<br />

Mass digitization of historical documents is performed us<strong>in</strong>g specialized scanners. These scanners<br />

allow obta<strong>in</strong><strong>in</strong>g a good quality <strong>in</strong> the images, without physically damag<strong>in</strong>g the documents.<br />

After that, image process<strong>in</strong>g processes are usually done to enhance images and ease the visual <strong>in</strong>spection.<br />

The problem is the degradation of the documents caused by lifetime of use. Degradation<br />

can appear for several reasons: non stationary noise due to illum<strong>in</strong>ation changes, curvature of the<br />

document, <strong>in</strong>k and holes <strong>in</strong> the document, <strong>in</strong>k show through (is the appearance of the verso side<br />

text or graphics on the scanned image of the recto side), low contrast, warp<strong>in</strong>g effect, etc. Some<br />

centuries ago the <strong>in</strong>k used to write had some oxide particles, which contribute to degrade the paper<br />

of the document, and this causes that the words <strong>in</strong> the verso of the page can be seen <strong>in</strong> the analysed<br />

part. This effect is known as bleed through. Nowadays, some methods have been developed<br />

for improv<strong>in</strong>g the quality of the images [7; 27].<br />

There are lots of historical handwritten documents with <strong>in</strong>formation that can be used for several<br />

studies and projects. The Document Image Analysis and Recognition community is <strong>in</strong>terested <strong>in</strong><br />

preserv<strong>in</strong>g these documents and extract<strong>in</strong>g all the valuable <strong>in</strong>formation from them. There are<br />

two ways to extract the <strong>in</strong>formation: transcrib<strong>in</strong>g documents (word-to-word) and word-spott<strong>in</strong>g.<br />

<strong>Handwritten</strong> word-spott<strong>in</strong>g refers to the problem of detect<strong>in</strong>g specific keywords <strong>in</strong> handwritten<br />

document images. A model is provided as a query, and the goal is to retrieve all the occurrences <strong>in</strong><br />

a word image database (or regions of a document collection) that are close to the query <strong>in</strong> terms<br />

of a specific dissimilarity measure. But, one of the problems of these documents is the access to<br />

them. The majority of material is only physically accessible, and only a few of authorized people<br />

can access to them.<br />

Nowadays thousand of digitized documents are unutilised because they are not <strong>in</strong>dexed. There<br />

are some levels of <strong>in</strong>dexation <strong>in</strong> terms of meta-data, from the nam<strong>in</strong>g of the author and the brief<br />

history of the book to a full text transcription. Nevertheless, there is not a unique technique that<br />

allows us to <strong>in</strong>dex the document correctly. Dur<strong>in</strong>g the last decades these techniques have experienced<br />

great improvements and the error rates have dropped to a level that makes commercial<br />

applications feasible. Traditional optical character recognition (OCR) systems fail to process handwritten<br />

documents, and they are only suitable for modern pr<strong>in</strong>ted documents. However, the off-l<strong>in</strong>e<br />

handwritten text recognition systems, which take an image of a piece of handwrit<strong>in</strong>g as <strong>in</strong>put, are<br />

work<strong>in</strong>g properly <strong>in</strong> restricted vocabularies.<br />

2


Handwrit<strong>in</strong>g word-spott<strong>in</strong>g is the pattern classification task which consists <strong>in</strong> detect<strong>in</strong>g words<br />

<strong>in</strong> handwrit<strong>in</strong>g images document. In this dissertation, we are concerned on the detection of several<br />

words <strong>in</strong>to our documents.<br />

In documents where all pages are written by the same author (or few authors), the images of<br />

multiple <strong>in</strong>stances of the same word are likely to look similar. <strong>Word</strong>-spott<strong>in</strong>g [20] treats a collection<br />

of documents as a collection of words. Then, the first step consists <strong>in</strong> segment<strong>in</strong>g the document<br />

<strong>in</strong>to word images, and then, pair wise “distances” between word images are calculated , which are<br />

used to cluster all words with similar features. Ideally, each cluster conta<strong>in</strong>s all the samples of the<br />

same word.<br />

There are two types of word-spott<strong>in</strong>g approaches, depend<strong>in</strong>g on how the <strong>in</strong>put is specified:<br />

query-by-str<strong>in</strong>g and query-by-example. In query-by-str<strong>in</strong>g, character models have been tra<strong>in</strong>ed<br />

<strong>in</strong> advance and <strong>in</strong> time of execution the character models are comb<strong>in</strong>ed to form words and the<br />

probability of each word is evaluated; <strong>in</strong> query-by-example the <strong>in</strong>put is an image of the word to<br />

search, and the output is a set of the most representative images of the query word.<br />

Problem statement<br />

This work addresses the problem of handwritten word spott<strong>in</strong>g <strong>in</strong> historical manuscripts. While historical<br />

approaches are based on contextual methods like Hidden Markov Model (HMM) or Dynamic<br />

Time Warp<strong>in</strong>g (DTW), us<strong>in</strong>g the sequential <strong>in</strong>formation of graphemes <strong>in</strong> a word. We propose a<br />

holistic approach us<strong>in</strong>g shape match<strong>in</strong>g techniques. We propose two approaches. The first one uses<br />

a pixel-based descriptor tolerant to distortions. The second one is <strong>in</strong>spired <strong>in</strong> Loci characteristic<br />

and allows to aggregate pseudo-structural <strong>in</strong>formation <strong>in</strong> the descriptor. <strong>Handwritten</strong> collection of<br />

documents, that we will expla<strong>in</strong>ed with more details <strong>in</strong> follow<strong>in</strong>g sections, are used <strong>in</strong> this work.<br />

Objectives<br />

As started above, this work w<strong>in</strong>s to develop shape descriptors for handwritten word spott<strong>in</strong>g, <strong>in</strong><br />

particular the objectives are:<br />

• To <strong>in</strong>vestigate different shape descriptors that allow to describe handwritten words with<br />

<strong>in</strong>variance of variations <strong>in</strong> writer, acquisition conditions, etc. We aim to focus <strong>in</strong> pixel-based<br />

descriptors and structural ones.<br />

• Based on the above descriptors, def<strong>in</strong>e cluster<strong>in</strong>g criteria allow<strong>in</strong>g to build <strong>in</strong>dexation structures<br />

for word spott<strong>in</strong>g purposes.<br />

• Def<strong>in</strong>e an experimental framework. Construct a ground truth from a collection of a real<br />

application (Barcelona marriage records).<br />

Outl<strong>in</strong>e of the approach<br />

In our work we have used query-by-example. It consists <strong>in</strong> match<strong>in</strong>g an <strong>in</strong>put image with one or<br />

multiple query images to determ<strong>in</strong>e the distance that might <strong>in</strong>dicate a correspondence.<br />

A spott<strong>in</strong>g architecture consists of four tasks. First, a pre-process<strong>in</strong>g step is done. Second, a<br />

fast rejection with the words segmented is done. Third, a normalization step is done. And fourth,<br />

3


a classification of the tra<strong>in</strong><strong>in</strong>g set is done. The last step of our work is a retrieval step. Figure 1<br />

outl<strong>in</strong>es the architecture of our approach.<br />

The quality of old documents can be affected by degradations. We perform a pre-process<strong>in</strong>g<br />

step <strong>in</strong> order to obta<strong>in</strong> better results (Fig. 1). The first task consist of improv<strong>in</strong>g the quality of<br />

the document do<strong>in</strong>g a pre-process<strong>in</strong>g step. For this purpose we do a b<strong>in</strong>arization of the document.<br />

Then, we remove marg<strong>in</strong>s of the document that are likely to <strong>in</strong>terfere with subsequent operations.<br />

The page is then segmented <strong>in</strong> l<strong>in</strong>es us<strong>in</strong>g projection analysis techniques [18]. Once the l<strong>in</strong>es are<br />

segmented, word segmentation is done us<strong>in</strong>g similar technique. The projection function is smoothed<br />

with an Anisotropic Gaussian Filter [14].<br />

In our approach, for each considered word, we extract the bound<strong>in</strong>g box and do a fast rejection<br />

with the words that are very big or very small, with regard to the mean of all the words of the<br />

document. As well, the bound<strong>in</strong>g box that has few pixels of <strong>in</strong>formation is ruled out. It allows to<br />

drastically reduce the search space.<br />

The next step consists <strong>in</strong> word normalization. It is necessary to extract the word and discard the<br />

pixels that do not belong to the word. The normalization is done us<strong>in</strong>g the Anisotropic Gaussian<br />

Filter and the upper and bottom contour of the word.<br />

We have developed two approaches for the learn<strong>in</strong>g step. The first approach consists <strong>in</strong> a<br />

hierarchical process. It uses two different features organized <strong>in</strong> layers. In the first layer, we use<br />

basic features, like aspect-ratio, height and weight. In the second layer, we use the Blurred <strong>Shape</strong><br />

Model (BSM) features. In the first cluster the words are clustered <strong>in</strong> relation with the basic<br />

features of the words, and then, each cluster of the first layer is clustered with BSM features.<br />

Second approach employs characteristic Loci features.<br />

The rest of our dissertation is organized as follows. Section 3 discusses related work <strong>in</strong> this field.<br />

Section 2 shows the corpus of this work. Section 3 shows different method to evaluate a cluster<strong>in</strong>g<br />

process. From section 5 to 8 the different methods proposed <strong>in</strong> this work are expla<strong>in</strong>ed. Section 9<br />

shows the experimental results. And <strong>in</strong> the last sections we show the conclusion of this work and<br />

future work.<br />

2. The corpus of Barcelona marriage records. A social science perspective.<br />

Between 1451 and 1905 it was made a centralised register called Llibres d’Esposalles. It recorded<br />

all the marriages and the fees posed on them accord<strong>in</strong>g to their social class. It is conserved at the<br />

Archives of the Barcelona Cathedral and comprises 244 books with <strong>in</strong>formation on approximately<br />

550,000 marriages celebrated <strong>in</strong> over 250 parishes. Each book conta<strong>in</strong>s the marriages of two years,<br />

and each book was written by a different writer.<br />

All the books of the collection consist of by two parts. The first one is an <strong>in</strong>dex with all the<br />

husbands’ surname that appears <strong>in</strong> the volume, and the number of page where it appears (Fig. 2a).<br />

The <strong>in</strong>dexes of the books have the same structure: several columns, where each column is composed<br />

by a surname, several dots and the number of page where appears this surname. The second part<br />

is the marriage licences (Fig. 2b and 2c). This work has been developed us<strong>in</strong>g the second part of<br />

the document, the marriage licenses.<br />

Marriage licences have a structured layout(Fig. 3). The document is divided <strong>in</strong> three parts.<br />

In the left part we can f<strong>in</strong>d the husband’s surname. Each surname is next to the record of the<br />

4


Figure 1: General process<br />

wedd<strong>in</strong>g. In the right part we can f<strong>in</strong>d the tax of the wedd<strong>in</strong>g. The central part corresponds to<br />

the record. In general, a quite regular structure that can be represented by a syntactic model. In<br />

this work, the query words used for word spott<strong>in</strong>g are searched <strong>in</strong> the central part, so first a layout<br />

5


(a) 1617: <strong>in</strong>dex of volume<br />

69<br />

(b) 1729: volume 127 (c) 1860: volume 200<br />

Figure 2: Llibre d’esposalles (Archive of Barcelona Cathedral, ACB)<br />

segmentation step has to be done. Other good characteristics present <strong>in</strong> the documents are that the<br />

text is hardly <strong>in</strong> cursive, the documents are very clean, the words are connected and grammatical<br />

structure is similar <strong>in</strong> all the registers: day of the wedd<strong>in</strong>g, name and job of the husband, parents’<br />

husband, name of the wife, parents’ wife and place where was the wedd<strong>in</strong>g.<br />

Figure 3: Structure of the documents: (a) husband’s surname, (b) wedd<strong>in</strong>g, (c) tax of the wedd<strong>in</strong>g<br />

Our work is part of a big project leaded by the Center for Demographic Studies (CED), Department<br />

of Geography, Universitat Autònoma de Barcelona (UAB). This project br<strong>in</strong>gs together<br />

researches of social sciences and computer sciences. From the perspective of scholars <strong>in</strong> social<br />

sciences, this collections is a rich source of <strong>in</strong>formation to construct genealogy of people over centuries.<br />

Thus, the first aim is to construct a database of marriages. It would be an artisanal and<br />

time consum<strong>in</strong>g task, so word spott<strong>in</strong>g techniques can help <strong>in</strong> pseudo-automatiz<strong>in</strong>g this process.<br />

Ground truth<br />

We have a ground truths composed by 500 documents of the volume 69 of the Cathedral of Barcelona<br />

volumes. We have also a second ground truth. It is a subset of the first ground truth. This one<br />

is composed by 30 documents, they are the same of the first ground truth. These documents have<br />

been labelled <strong>in</strong> a manual process. The labelled process has been done us<strong>in</strong>g an application program<br />

to label documents. This program allows us to select an area of the image and label it with a word.<br />

Each word (and the po<strong>in</strong>t marks of the area selected) is automatically saved <strong>in</strong> a XML file.<br />

6


There is a high number of different words <strong>in</strong> the selected ground truth documents, and the<br />

literal transcription (word by word) is an expensive process, consequently, only a few words are<br />

labelled. The words selected are shown <strong>in</strong> figure 4. These words are selected because of their high<br />

frequency of apparition <strong>in</strong> the selected documents.The ground truth has all the labelled words (20<br />

classes). The subset of the ground truth is composed by the first 10 labelled words (10 classes).<br />

(a) Barna (b) de Barna (c) en Barna (d) de (e) pages (f) reberè (g) dia<br />

(h) dit (i) fill (j) filla (k) viuda (l) ab (m) habitant (n) donsella<br />

(o) dilluns (p) dimarts (q) dimecres (r) dijous (s) divendres (t) dissapte (u) viudo<br />

Figure 4: An example of each selected word<br />

For the experiments we have employed a cross-validation technique[11]. Suppose that we have<br />

a data set Z of size N x n, conta<strong>in</strong><strong>in</strong>g n-dimensional feature vectors describ<strong>in</strong>g N objects. We<br />

choose an <strong>in</strong>teger K (preferably a factor of N ) and randomly divide Z <strong>in</strong>to K subsets of size N/K.<br />

Then we use one subset to test the performance of D tra<strong>in</strong>ed on the union of the rema<strong>in</strong><strong>in</strong>g K -<br />

1 subsets. This procedure is repeated K times, choos<strong>in</strong>g a different part for test<strong>in</strong>g each time. To<br />

get the f<strong>in</strong>al result we average the K estimates. We have chosen K = 5 <strong>in</strong> our experiments.<br />

3. Related work<br />

<strong>Word</strong>-spott<strong>in</strong>g was orig<strong>in</strong>ally formulated to detect words <strong>in</strong> speech messages [10]. Later it was<br />

used <strong>in</strong> text documents [20] for match<strong>in</strong>g and <strong>in</strong>dex<strong>in</strong>g handwritten words of several documents.<br />

<strong>in</strong> this context, it was first proposed by Manmatha [13], and later, a number of different word<br />

match<strong>in</strong>g algorithms were <strong>in</strong>vestigated. This technique needs word segmentation, and many word<br />

segmentation approaches can be found <strong>in</strong> the literature. Relevant examples are: a scale-space word<br />

segmentation process was proposed <strong>in</strong> [14] and a neural network work segment<strong>in</strong>g algorithm is also<br />

presented <strong>in</strong> [26].<br />

Rath [21] proposed an automatic retrieval system for historical handwritten documents us<strong>in</strong>g<br />

relevance models. The method describes two statistical models for retrieval <strong>in</strong> large collections of<br />

handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a<br />

jo<strong>in</strong>t probability distribution between features computed from word images and their transcriptions.<br />

7


The models can then be used to retrieve unlabelled images of handwritten documents given a text<br />

query<br />

Handwrit<strong>in</strong>g recognition of large vocabularies <strong>in</strong> historical documents is still a very challeng<strong>in</strong>g<br />

task. Nagy <strong>in</strong> [15] discusses the papers published <strong>in</strong> PAMI on document analysis dur<strong>in</strong>g the last<br />

20 years.<br />

A word can be represented with different k<strong>in</strong>d of features. A feature is a measurement about<br />

the object to study, and allows to reduce all the characteristics of the image to a few that preserve<br />

the ma<strong>in</strong> <strong>in</strong>formation <strong>in</strong> a more manageable size . There are three types of features: quantitative<br />

(numeric) features, qualitative (symbolic) features and structured features. Quantitative features<br />

can be discrete values (e.g.- weight, the number of computers) or <strong>in</strong>terval values (e.g.- the duration<br />

of an event). Qualitative features can be nom<strong>in</strong>al or unordered (e.g.- colour) and ord<strong>in</strong>al (e.g.sound<br />

<strong>in</strong>tensity - “quiet” or “loud”). Structured features represent relational and/or hierarchical<br />

attributes among a set of primitive patterns (e.g.- a parent node can be a generalization of children<br />

labelled “cars”, “truck” and “motorbikes”) [28].<br />

There are different ways to match words, it depends on the k<strong>in</strong>d of features that are. For<br />

example, the words can be matched directly comput<strong>in</strong>g the distance such as XOR, Euclidean<br />

Distance Mapp<strong>in</strong>g (EDM), Sum of Square Differences (SSD), SLH, Hausdorff distance, etc. The<br />

problem of these methods is that they are very sensitive to spatial variation.<br />

One of the most widely used feature comparison algorithms <strong>in</strong> handwrit<strong>in</strong>g recognition is the<br />

Dynamic Time Warp<strong>in</strong>g (DTW) [19; 9]. DTW is an algorithm for measur<strong>in</strong>g similarity between<br />

two sequences which may vary <strong>in</strong> time or speed. It has been widely used <strong>in</strong> the speech process<strong>in</strong>g,<br />

bio-<strong>in</strong>formatics and also on the on-l<strong>in</strong>e handwrit<strong>in</strong>g communities to match 1-D signals. Even though<br />

the features of the image are <strong>in</strong> general <strong>in</strong> 2-dimensions, it is possible to recast them <strong>in</strong> 1-dimension,<br />

but it is possible to loose the association between columns features of images. DTW algorithm tries<br />

to m<strong>in</strong>imize the variations between the features vectors. In general, it is a method that allows a<br />

computer to f<strong>in</strong>d a optimal match between two given sequences.<br />

In holistic approaches the image word is not segmented <strong>in</strong>to smaller parts, but are considered<br />

as a whole shape [3]. Thus, the recognition uses to be performed by a shape match<strong>in</strong>g algorithm<br />

<strong>in</strong> terms of the features computed at some key po<strong>in</strong>ts of <strong>in</strong>terest. A comparative study between<br />

a number of po<strong>in</strong>ts of <strong>in</strong>terest detectors is presented <strong>in</strong> [25]. For example corner can be detected<br />

with the Harris detector [23], but a drawback of such detector is its sensitiveness to noise.<br />

Cohesive Elastic Match<strong>in</strong>g [12] is based on zon<strong>in</strong>g, and it is possible to apply <strong>in</strong> all the text<br />

image, it is not necessary to segment the words of the text. It is a good method to compare zones<br />

of <strong>in</strong>terest (ZOI). This algorithm is <strong>in</strong>dependent of the ZOI extraction method.<br />

Hidden Markov Models (HMM) are used sometimes <strong>in</strong> word-spott<strong>in</strong>g [1] to match words <strong>in</strong>to<br />

documents, but they are usually applied <strong>in</strong> documents with a reduced vocabulary and needs a<br />

considerable learn<strong>in</strong>g stage.<br />

4. Choos<strong>in</strong>g the number of clusters<br />

There are different methods <strong>in</strong> the literature to choose the number of clusters. They can be classified<br />

<strong>in</strong> two big groups depend<strong>in</strong>g on how it is chosen the number of clusters. The first one is a manual<br />

8


method. The number of clusters is chosen based on the experimentation. In this way it is the<br />

experience of the user that allows to choose the best number of clusters.<br />

The second group is an automatic, or pseudo-automatic, method to choose the best number of<br />

cluster. Algorithms of this group use an <strong>in</strong>dex, or several <strong>in</strong>dices, to obta<strong>in</strong> a measure that allows<br />

to choose the best number of clusters. There are several validity <strong>in</strong>dices. They can be classified <strong>in</strong><br />

two groups: external and <strong>in</strong>ternal validity <strong>in</strong>dices.<br />

External validity <strong>in</strong>dices are used when true class labels are known. Some examples of external<br />

validity <strong>in</strong>dices are: Rand <strong>in</strong>dex, measures the similarity between two data cluster<strong>in</strong>g; Adjusted<br />

Rand <strong>in</strong>dex is the corrected-for-chance version of the Rand <strong>in</strong>dex; Mirk<strong>in</strong> <strong>in</strong>dex considers only<br />

object pair <strong>in</strong> different clusters for both partitions and f<strong>in</strong>ds the dissimilarity.<br />

Internal validity <strong>in</strong>dices are used when true class labels are unknown. Some examples of <strong>in</strong>ternal<br />

validity <strong>in</strong>dices are: Silhouette <strong>in</strong>dex [24] computes the average distance of a po<strong>in</strong>t from the other<br />

po<strong>in</strong>ts of the cluster to which the po<strong>in</strong>t is assigned; Davies-Bould<strong>in</strong>g <strong>in</strong>dex [2] is a function of the<br />

ratio of the sum of with<strong>in</strong>-cluster scatter to between-cluster separation; Cal<strong>in</strong>ski-Harabasz <strong>in</strong>dex<br />

computes the sum of the squares of the distances between the clusters centroids and the mean of<br />

all genes <strong>in</strong> all classes.<br />

There are a big number of different validity <strong>in</strong>dices to choose the best number of clusters. Then,<br />

it is possible to choose only one of them to select the number of clusters, or several of them to do<br />

<strong>in</strong>tra-validation of the different <strong>in</strong>dices.<br />

5. <strong>Word</strong>-spott<strong>in</strong>g approach<br />

The objective of this work is word spott<strong>in</strong>g. Thus given a query word image, we <strong>in</strong>tend to locate<br />

<strong>in</strong>stances of the same word class <strong>in</strong>to the documents to be <strong>in</strong>dexed. <strong>Word</strong>-spott<strong>in</strong>g is used <strong>in</strong> many<br />

works to search words <strong>in</strong>to images. In this work <strong>in</strong>spired <strong>in</strong> some literature approaches, words are<br />

considered as shapes, and spott<strong>in</strong>g is achieved through shape dissimilarity functions.<br />

<strong>Word</strong> spott<strong>in</strong>g needs to def<strong>in</strong>e a descriptor, or several descriptors, that represents our observations<br />

and allows us to group, and organize the features of them. Once the observations are bunched<br />

and organized it is needed an <strong>in</strong>dexation structure. This structure organizes and groups all the<br />

observations of our experiment, and later, it is used to f<strong>in</strong>d the words that are similar to a given<br />

query <strong>in</strong>to the documents.<br />

A general spott<strong>in</strong>g architecture consists of two major modules, namely the learn<strong>in</strong>g stage and<br />

the retrieval one. Learn<strong>in</strong>g consists <strong>in</strong> cluster<strong>in</strong>g similar features <strong>in</strong> the search space (target images)<br />

to construct the <strong>in</strong>dexation structure. Retrieval consists <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g the best approximation of the<br />

observations of the classification set with observations of the tra<strong>in</strong><strong>in</strong>g set. In this work we propose<br />

two approaches (Fig. 5).<br />

The first approach is oriented to the pixel-based descriptors. It uses two different features<br />

as descriptors of the observations: basic features and BSM features. The <strong>in</strong>dexation structure<br />

is constructed us<strong>in</strong>g a hierarchical cluster. It consists <strong>in</strong> segment<strong>in</strong>g the words from the images<br />

and organiz<strong>in</strong>g them <strong>in</strong> several clusters, us<strong>in</strong>g two descriptors based <strong>in</strong> the distribution of the<br />

pixels of the image. In the first level of the hierarchical cluster structure, basic shape features are<br />

considered. Afterwards, the clusters are ref<strong>in</strong>ed <strong>in</strong> the second level us<strong>in</strong>g the Blurred <strong>Shape</strong> Model<br />

(BSM) features [5]. This organization of the search space allows, when a query word is searched,<br />

9


Figure 5: We present two approaches based <strong>in</strong> word spott<strong>in</strong>g. Both have the same firsts steps.<br />

first to quickly reject an important number of non similar words (first level) and do the <strong>in</strong>tensive<br />

search with more discrim<strong>in</strong>ant features (BSM) <strong>in</strong> the second level with a reduced number of target<br />

words.<br />

The second approach is oriented to pseudo-structural features. The descriptor used <strong>in</strong> this<br />

approach is characteristic Loci feature and the <strong>in</strong>dexation structure is constructed us<strong>in</strong>g a table,<br />

where each column is each observation of the documents, and the rows are the features of the<br />

words. Each word, or character, is composed by several features, and it is not significant where they<br />

appear <strong>in</strong>side the image. This approach uses features based <strong>in</strong> Loci Characteristics [3; 4; 8]. Given<br />

a word image, a feature vector based on Loci characteristics is computed at some characteristicpo<strong>in</strong>ts.<br />

Some approaches of the literature have used the background pixels of the image. Other<br />

approaches have used the foreground pixels, and even some approaches have used the contour or the<br />

skeleton of the images. Loci characteristics encode the frequency of <strong>in</strong>tersection counts for a given<br />

characteristic-po<strong>in</strong>t <strong>in</strong> different direction paths start<strong>in</strong>g from this po<strong>in</strong>t. Loci vectors extracted<br />

from the words of the image database are stored <strong>in</strong> a hash<strong>in</strong>g structure. Afterwards, the word<br />

spott<strong>in</strong>g is performed by a vot<strong>in</strong>g process after Loci vectors from the query word are <strong>in</strong>dexed <strong>in</strong><br />

the hash<strong>in</strong>g table.<br />

Let us describe the different steps of the two developed approaches. Both approaches have the<br />

same prelim<strong>in</strong>ary steps. They consist <strong>in</strong> a pre-process<strong>in</strong>g step, where the documents are segmented<br />

and extracted the words of them, <strong>in</strong> a fast rejection, where bad words are discarded, and noise<br />

removal, where the noise of the image is removed and the bound<strong>in</strong>g box is fixed to the contour<br />

of the image. These prelim<strong>in</strong>ary steps are expla<strong>in</strong>ed <strong>in</strong> the section 6. Section 7 expla<strong>in</strong>s the first<br />

approach developed and the section 8 the second one.<br />

10


6. Prelim<strong>in</strong>ary steps<br />

6.1 Pre-process<strong>in</strong>g<br />

Modell<strong>in</strong>g the human cognitive process to obta<strong>in</strong> a similar computational methodology for handwritten<br />

word segmentation is quite difficult due to the follow<strong>in</strong>g characteristics. The handwrit<strong>in</strong>g<br />

style is usually <strong>in</strong> cursive or discrete. In the case of discrete handwrit<strong>in</strong>g, characters are jo<strong>in</strong>ed to<br />

form words, but, unlike the mach<strong>in</strong>e pr<strong>in</strong>ted text, handwritten text is not uniformly spaced. The<br />

size of the characters along the words of the document is different (this is a scale problem). Ascenders<br />

and descenders are regularly connected and words present different orientations. Documents<br />

are often degraded due the age<strong>in</strong>g or other reasons. Another reason is the presence of show-through<br />

or bleed-through effects expla<strong>in</strong>ed above.<br />

Some of the ma<strong>in</strong> problems of our historical documents are that they have been written by<br />

several authors (every two years the writer changes), noisy (sta<strong>in</strong>s, shadows, bleed through, etc.),<br />

marg<strong>in</strong>s, etc.<br />

The documents to be used <strong>in</strong> our experiments present some of the above commented drawbacks,<br />

like ascenders and descenders connected, different sizes of character, etc. But a good characteristics<br />

of these documents is that they are well structured. As we have commented <strong>in</strong> section 2, each<br />

document has three parts, and the objective is to work with the marriage licenses.<br />

The steps of the pre-process<strong>in</strong>g are: b<strong>in</strong>arization of the documents, page segmentation, layout<br />

segmentation, segmentation of the l<strong>in</strong>es and, the last step, the word segmentation (Fig. 6). Let us<br />

<strong>in</strong> the follow<strong>in</strong>g subsections describe the details of these steps.<br />

Figure 6: Pre-process<strong>in</strong>g steps.<br />

11


6.1.1 B<strong>in</strong>arization<br />

The b<strong>in</strong>arization (Fig. 6(b)) of an image is the process that converts a digital image <strong>in</strong> an image<br />

<strong>in</strong> black and white, so it preserves the ma<strong>in</strong> properties of the image. In Document Image Analysis<br />

the objective is to classify each pixel as background or relevant <strong>in</strong>formation for us.<br />

The simplest way to b<strong>in</strong>arize an image is to choose a threshold value, and to classify all pixels<br />

with values above this threshold as white, and all other pixels as black (global image threshold).<br />

The problem then is how to select a threshold. In many cases, f<strong>in</strong>d<strong>in</strong>g a threshold compatible to<br />

the entire image is very difficult, and <strong>in</strong> many cases even impossible. Therefore, adaptive image<br />

b<strong>in</strong>arization is needed. In adaptive an optimal threshold is chosen for each image area (local image<br />

threshold).<br />

In our work we have applied two different methods of b<strong>in</strong>arization. Otsu method [17] is a global<br />

method that chooses the threshold to m<strong>in</strong>imize the <strong>in</strong>traclass variance of it values. It has the<br />

advantage of not requir<strong>in</strong>g the <strong>in</strong>put of parameters, but assumes that histograms are bimodal and<br />

illum<strong>in</strong>ation is uniform. Niblack’s algorithm [16] is a local threshold<strong>in</strong>g method. This algorithm<br />

calculates a threshold value for each pixel-based on the mean and standard deviation of all the<br />

pixels <strong>in</strong> a local neighborhood. The critical po<strong>in</strong>t of this algorithm is the size of the neighbourhood<br />

area. The ma<strong>in</strong> disadvantage of this approach is the computational time, strongly dependent of<br />

the size of the neighbourhood w<strong>in</strong>dow. The size should be small enough to preserve local details<br />

and large enough to suppress noise.<br />

The selected method <strong>in</strong> our work is the Otsu method because the documents of our corpus<br />

have good quality, and they present a uniform background. The Otsu method works better with<br />

documents with good quality (Fig. 7a). Niblack usually works better <strong>in</strong> historical documents, when<br />

they presents a high level of degradation (the document has shadows, bleed through, sta<strong>in</strong>s, etc.),<br />

but <strong>in</strong> such cases a perfect b<strong>in</strong>arization is difficult to be achieved, so of the algorithm can not avoid<br />

the presence of noise <strong>in</strong> the result<strong>in</strong>g image (Fig. 7b).<br />

(a) Otsu method (0.078 sec. of comput<strong>in</strong>g time) (b) Niblack method (204.099 sec. of comput<strong>in</strong>g<br />

time)<br />

Figure 7: Methods of b<strong>in</strong>arization applied to a piece of sheet of the marriage database.<br />

12


6.1.2 Page segmentation<br />

The handwritten manuscripts have been subjected to degradation dur<strong>in</strong>g all the time they have<br />

been used and stored, but also, the digitalization process adds degradations to the document, like<br />

the warp<strong>in</strong>g effect <strong>in</strong> marg<strong>in</strong>s. The purpose of this step is to remove some of these marg<strong>in</strong>s and<br />

l<strong>in</strong>es so that they will not <strong>in</strong>terfere with later stages (Fig. 6(c)).<br />

The method proposed <strong>in</strong> this work is based <strong>in</strong> the blob properties of the image. We know<br />

that the marg<strong>in</strong>s are located <strong>in</strong> the borders of the document, then, <strong>in</strong> these parts, we extract the<br />

properties of the blobs of the image, after the image has been b<strong>in</strong>arized. The biggest blobs are the<br />

marg<strong>in</strong>s, and then they are removed from the document.<br />

6.1.3 Layout segmentation<br />

The documents of our case study have a similar structure (it is expla<strong>in</strong>ed <strong>in</strong> section 2). The<br />

word spott<strong>in</strong>g of this work is centred <strong>in</strong> the central block of text of the document. Projection<br />

profile techniques [14] have been widely used <strong>in</strong> l<strong>in</strong>e and word segmentation for mach<strong>in</strong>e pr<strong>in</strong>ted<br />

documents. The idea is to obta<strong>in</strong> a 1D function of the pixel values by project<strong>in</strong>g the b<strong>in</strong>ary image<br />

onto horizontal axis. The dist<strong>in</strong>ct local peaks <strong>in</strong> the profile correspond to the white space between<br />

the columns and the dist<strong>in</strong>ct local m<strong>in</strong>ima correspond to the text.<br />

Before segment<strong>in</strong>g the l<strong>in</strong>es (Fig. 6(d)), it is necessary to extract the central block of text of<br />

each page of the documents. The aim is to delete the zones of the page that can <strong>in</strong>terfere <strong>in</strong> the l<strong>in</strong>e<br />

segmentation. A morphological dilation with a vertical structur<strong>in</strong>g element is applied to the <strong>in</strong>put<br />

document, and next it is smoothed with a Gaussian filter to discard false local m<strong>in</strong>ima and reduce<br />

sensitivity to noise. The local m<strong>in</strong>ima are obta<strong>in</strong>ed by sett<strong>in</strong>g the derivative of the projection profile<br />

to zero .<br />

6.1.4 L<strong>in</strong>e segmentation<br />

The documents used <strong>in</strong> this work conta<strong>in</strong> l<strong>in</strong>es which are approximately straight and close to<br />

horizontal. Projection profile techniques used before are also used <strong>in</strong> this step. In this case the<br />

projection is done <strong>in</strong> the vertical axis.<br />

L<strong>in</strong>es are segmented <strong>in</strong> the same way that the last step (Fig. 6(e)). The central block is dilated<br />

with a horizontal structur<strong>in</strong>g element and smoothed with a Gaussian filter. An horizontal projection<br />

is computed. Although we have applied the smooth<strong>in</strong>g function to discard false l<strong>in</strong>es, we check the<br />

size of l<strong>in</strong>es to discard the possible false ones. The rejection is done by look<strong>in</strong>g the height of the<br />

l<strong>in</strong>es. Small l<strong>in</strong>es are discarded.<br />

6.1.5 <strong>Word</strong> segmentation<br />

The segmented l<strong>in</strong>es obta<strong>in</strong>ed <strong>in</strong> the last process are exam<strong>in</strong>ed to extract the words of the document<br />

(Fig. 6(f)). A word image is composed of discrete characters, connected characters, or a comb<strong>in</strong>ation<br />

of both. The idea is to merge all these components <strong>in</strong> a s<strong>in</strong>gle entity which is a word. This may be<br />

achieved by form<strong>in</strong>g a blob-like representation of the image. A blob is considered as a connected<br />

region <strong>in</strong> the space. Our approach is based on the Laplacian of Gaussian (LOG) operator for<br />

creat<strong>in</strong>g a multi-scale representation for blob detection [14]. The idea is comb<strong>in</strong><strong>in</strong>g second order<br />

13


partial Gaussian derivatives along the two orientations at different scales, to merge the components<br />

of a word.<br />

An anisotropic Gaussian filter (Fig. 8) is def<strong>in</strong>ed as:<br />

1<br />

G(x, y; σx, σy) = e<br />

2πσxσy<br />

x2<br />

−(<br />

σ2 +<br />

x<br />

y2<br />

σ2 )<br />

y<br />

From the filter (1) the Laplacian of Gaussian operator is based on the addition of the second<br />

derivatives <strong>in</strong> x and y as follows:<br />

L(x, y; σx, σy) = Gxx(x, y; σx, σy) + Gyy(x, y; σx, σy) (2)<br />

A scale space representation of the l<strong>in</strong>e images is constructed by convolv<strong>in</strong>g the image with L<br />

from (2). Consider a two-dimensional image f(x,y); then, the correspond<strong>in</strong>g output image is<br />

I(x, y; σx, σy) = L(x, y; σx, σy) ∗ f(x, y) (3)<br />

As we can see <strong>in</strong> figure 8 the output is a grey-scale image, where the background has a middle<br />

grey-level and the words are lightly grey. It is very difficult to determ<strong>in</strong>e a threshold for select<strong>in</strong>g<br />

the pixels that corresponds to words. We have observed that most words have black contour. Our<br />

improvement allows, us<strong>in</strong>g this mask, to split each word <strong>in</strong> three areas: background, word and<br />

contours of the word. The mask converts the black th<strong>in</strong> contours <strong>in</strong> thick contours. The rest of the<br />

image is considered background. This ga<strong>in</strong>ed of the contour cause the jo<strong>in</strong><strong>in</strong>g of the letters that<br />

are together. This improvement allows to merge the characters of the word and is easier to split<br />

different words. The words, which are extracted from a scale space representation, are blob-like,<br />

but, to make sure that the blob merges all the parts of the words, we apply a clos<strong>in</strong>g operator to<br />

each word.<br />

6.2 Fast rejection<br />

The previous process produces one blob for each word <strong>in</strong> the document, but sometimes these<br />

components do not represent words, because they are sta<strong>in</strong>s, l<strong>in</strong>es or small parts of a word that has<br />

not been merged with the orig<strong>in</strong>al word. The selection of the suitable words are done <strong>in</strong> two steps.<br />

First, the blobs which are very small, regard<strong>in</strong>g to the height and the width of the segmented l<strong>in</strong>e,<br />

are rejected. For the rema<strong>in</strong><strong>in</strong>g blobs, we choose those blobs with more pixels than a threshold,<br />

experimentally set.<br />

6.3 Noise removal<br />

The images rema<strong>in</strong><strong>in</strong>g after the fast rejection step are subject to a normalization process to reduce<br />

their variability. Our proposal allows to clean the image and to fit the bound<strong>in</strong>g box to the word<br />

(Fig. 9).<br />

The first step consists <strong>in</strong> b<strong>in</strong>ariz<strong>in</strong>g the word image (Fig. 9b). Then, we apply the anisotropic<br />

Gaussian filter expla<strong>in</strong>ed before to merge the different parts of the same word (Fig. 9c). Once<br />

applied, the image is composed by several blobs, as we can see <strong>in</strong> figure 9d, then, the next step<br />

14<br />

(1)


Figure 8: Anisotropic Gaussian Filter<br />

is delet<strong>in</strong>g the blobs that do not belong to the word. The biggest blob is chosen and its contour<br />

computed (Fig. 9e). The contour is the frontier that separates the pixels of the word and the<br />

background. The last step consist <strong>in</strong> to project <strong>in</strong> vertical and <strong>in</strong> horizontal to fix the bound<strong>in</strong>g<br />

box.<br />

(a) Orig<strong>in</strong>al Image (b) B<strong>in</strong>arized image (c) Anisotropic Gaussian<br />

filter<br />

(d) Biggest blob (e) Blob contour (f) F<strong>in</strong>al image<br />

Figure 9: Normalization process<br />

15


7. Pixel-based descriptors organized <strong>in</strong> a hierarchical structure<br />

The first approach of this work is based <strong>in</strong> two pixel-based descriptors (basic features and BSM<br />

features), and they are organized us<strong>in</strong>g a hierarchical structure of clusters. The objective is to do<br />

several layers of clusters us<strong>in</strong>g diverse features. Top layer is composed by basic features. Down<br />

layer consists of pixel distribution based features, <strong>in</strong> particular BSM.<br />

This approach bunches the words <strong>in</strong>to clusters with similar features. When we downward of<br />

layer, we only use the observations of the cluster chosen to cluster the words with the new k<strong>in</strong>d<br />

of features (Fig. 10). We reduce <strong>in</strong> each layer the number of observations that we are comput<strong>in</strong>g<br />

features for the new layer, and the classification process is faster.<br />

7.1 Feature extraction<br />

In pattern recognition and <strong>in</strong> image process<strong>in</strong>g feature extraction is a special form of dimensionality<br />

reduction. The objective is to transform the <strong>in</strong>put data <strong>in</strong>to a reduced representation set of features<br />

(feature vector). The observations of the experiments have different ways to represent those us<strong>in</strong>g<br />

different features. The objective is to select the best features that describe the image.<br />

The marriage licences corpus of the cathedral of Barcelona is composed of 244 volumes, too<br />

much <strong>in</strong>formation to be <strong>in</strong>dexed directly. Consequently the computational time cost <strong>in</strong>crease with<br />

the number of documents of our corpus.<br />

Retrieval time can be reduced us<strong>in</strong>g a hierarchical <strong>in</strong>dexation structure. The features of our<br />

corpus are divided <strong>in</strong> several groups (clusters) <strong>in</strong> each layer, and then, each layer is divided <strong>in</strong> other<br />

groups us<strong>in</strong>g other features (Fig. 10).<br />

Figure 10: The structure <strong>in</strong> layers of the feature extraction.<br />

In this work two types of features have been used: Basic features and BSM features. The first<br />

layer uses basic features to do a rough separation of the word classes. In the second layer we have<br />

used features based <strong>in</strong> pixel distribution. Each cluster of the first layer is split us<strong>in</strong>g BSM features.<br />

16


Basic features<br />

Basic features are based <strong>in</strong> shape features of the images [28]. These features are extracted from the<br />

contour and the region of the shapes.<br />

For each normalized word, a sequence of basic feature vectors is obta<strong>in</strong>ed. The features used<br />

<strong>in</strong> this work are: height, width, aspect-ratio, centroid, filled-area, perimeter, eccentricity, Euler<br />

Number.<br />

The objective of this first layer is to separate all the words of our corpus <strong>in</strong> groups, with similar<br />

basic features.<br />

Blurred <strong>Shape</strong> Model (BSM) features<br />

The words are described by a probability density function of Blurred <strong>Shape</strong> Model (BSM) [5] that<br />

encodes the probability of pixel densities of image regions: The image is divided <strong>in</strong> a grid of n x n<br />

equal-sized subregions, and each b<strong>in</strong> receives votes from the shape po<strong>in</strong>ts <strong>in</strong> it and also from the<br />

shape po<strong>in</strong>ts <strong>in</strong> the neighbour<strong>in</strong>g b<strong>in</strong>s. Thus, each shape po<strong>in</strong>t contributes to a density measure of<br />

its b<strong>in</strong> and its neighbour<strong>in</strong>g ones. The output descriptor is a vector histogram where each position<br />

corresponds to the density <strong>in</strong> the context of the sub-region (Fig. 11).<br />

The objective of this second layer is to extract features based <strong>in</strong> pixel distributions. Once<br />

the words have been clustered accord<strong>in</strong>g the size, this layer matches the words with similar pixel<br />

distribution <strong>in</strong> different clusters.<br />

(a) Orig<strong>in</strong>al image (b) <strong>Shape</strong> pixel distances estimation<br />

respect to neighbour centroids<br />

7.2 Learn<strong>in</strong>g and retrieval<br />

Organiz<strong>in</strong>g features<br />

(c) 16 regions<br />

blurred shape<br />

Figure 11: Blurred <strong>Shape</strong> Model (BSM). Extracted from [6]<br />

The goal of a learn<strong>in</strong>g stage correspond to acquire new knowledge, behaviours, skills, values, preferences<br />

or understand<strong>in</strong>g, and may <strong>in</strong>volve synthesiz<strong>in</strong>g different types of <strong>in</strong>formation. The learn<strong>in</strong>g<br />

process of this work consists <strong>in</strong> extract<strong>in</strong>g features of the words, calculat<strong>in</strong>g the distance between<br />

them, and bunch<strong>in</strong>g them with respect the distance (called cluster<strong>in</strong>g process).<br />

The cluster<strong>in</strong>g processes have a drawback: to know what is the best number of clusters which<br />

the observations of our experiments are better bunched. This approach has a hierarchical structure<br />

of clusters. Each layer of the structure has to be cluster<strong>in</strong>g us<strong>in</strong>g one of the methods expla<strong>in</strong>ed <strong>in</strong><br />

the section 4. The first layer uses the direct method to choose the number of clusters. The second<br />

17


layer uses an automatic method. It uses the Davies-Bould<strong>in</strong>g <strong>in</strong>dex to choose the best number of<br />

clusters. Both layers uses k-means as cluster<strong>in</strong>g algorithm.<br />

The first layer uses direct method because we have obta<strong>in</strong>ed several hierarchical structures,<br />

more concretely we have compute from 3 to 30 clusters. Second layer uses an <strong>in</strong>dex to choose<br />

the best number of clusters, because the number of experiments is exponential as the number of<br />

clusters.<br />

Search<strong>in</strong>g words<br />

The process of search<strong>in</strong>g a word consists <strong>in</strong>, given a query word image, classify it regard<strong>in</strong>g the<br />

different clusters. Once the most similar cluster is selected, the method “spots” the word <strong>in</strong>stances<br />

<strong>in</strong> the document images. Classification is a procedure <strong>in</strong> which <strong>in</strong>dividual items are placed <strong>in</strong>to<br />

groups based on quantitative <strong>in</strong>formation on one or more characteristics (features) <strong>in</strong>herent <strong>in</strong> the<br />

items and based <strong>in</strong> tra<strong>in</strong><strong>in</strong>g set (learn<strong>in</strong>g process) of previously labelled items.<br />

To classify words we have used a k-NN approach. It is an algorithm that classifies each sample<br />

(feature extracted from the word) <strong>in</strong>to one of the groups, that <strong>in</strong> the lean<strong>in</strong>g process we have<br />

created, us<strong>in</strong>g the nearest neighbour method.<br />

Classification process is used <strong>in</strong> both levels of this work, and k-NN is used <strong>in</strong> both levels to<br />

classify our observations <strong>in</strong> the different clusters, previously established.<br />

8. Pseudo-Structural descriptor organized <strong>in</strong> a hash structure<br />

The second approach of this work is oriented to the features. It does not matter where appear <strong>in</strong> the<br />

image. This approach uses a table of <strong>in</strong>dexation to organize the observations of the experiments.<br />

We have developed a second approach which is characteristic-po<strong>in</strong>t centred, i.e. the <strong>in</strong>dexation<br />

terms are <strong>in</strong>dividual features, so words are detected based on a vot<strong>in</strong>g process. Feature vector vary<br />

depend<strong>in</strong>g of the position of the words.<br />

Features used <strong>in</strong> this approach are <strong>in</strong>variant under translation of the word. There is no need to<br />

center or left-justify all the observations of the same word to obta<strong>in</strong> good results.<br />

8.1 Feature extraction<br />

The characteristic Loci features were devised by Glucksman and applied to the classification of<br />

mixed-font alphabetic, as described <strong>in</strong> [8]. A characteristic Loci feature is composed by the number<br />

of the <strong>in</strong>tersections <strong>in</strong> the four directions (up, down, right and left). For each background pixel <strong>in</strong><br />

a b<strong>in</strong>ary image, and each direction, we count the number of <strong>in</strong>tersections (an <strong>in</strong>tersection means<br />

a black/white transition between two consecutive pixels). Then we obta<strong>in</strong> a number, that it is<br />

composed by the number of <strong>in</strong>tersections <strong>in</strong> the four directions (Fig. 12). The feature vector<br />

consists <strong>in</strong> the histogram of the <strong>in</strong>tersection counts.<br />

This work presents a new feature descriptor based <strong>in</strong> the characteristic Loci features. We have<br />

<strong>in</strong>troduced three variations of the basic descriptor:<br />

• We have added the two diagonal directions, as we can see <strong>in</strong> figure 12. This gives more<br />

<strong>in</strong>formation to the feature and more sturd<strong>in</strong>ess to the method.<br />

18


• The number of the <strong>in</strong>tersections is quantized.We have bounded the number of <strong>in</strong>tersections <strong>in</strong><br />

<strong>in</strong>tervals. Each direction has a different <strong>in</strong>terval. This bound<strong>in</strong>g do more robust the feature.<br />

• Two modes are implemented to compute the feature vector, namely background and foreground<br />

pixels.<br />

To obta<strong>in</strong> the number of <strong>in</strong>tersections for each direction a th<strong>in</strong>n<strong>in</strong>g operator is previously applied<br />

to the image. Th<strong>in</strong>n<strong>in</strong>g allows to get the skeleton of the image consist<strong>in</strong>g of l<strong>in</strong>es of width of 1<br />

pixel.<br />

Figure 12: Characteristic Loci feature of a s<strong>in</strong>gle po<strong>in</strong>t of the word page.<br />

The feature vector is computed by assign<strong>in</strong>g a number to each background (or foreground) pixel<br />

as show <strong>in</strong> Fig. 12. The features are computed accord<strong>in</strong>g to the number of <strong>in</strong>tersections with the<br />

the background pixels of the image <strong>in</strong> right, upward, left and downward directions. In previous<br />

works, the characteristic Loci method has been applied for digit and isolated letter recognition.<br />

In this work, to reduce the dimension of the feature space the maximum number of <strong>in</strong>tersection<br />

has been limited to 3 values (0, 1 and 2). Delimit<strong>in</strong>g the number of possible values we reduce the<br />

number of comb<strong>in</strong>ations. The length of the feature vector is proportional to the number of possible<br />

values. For example, with 3 possible values and 8 directions, we obta<strong>in</strong> 3 8 (6561) comb<strong>in</strong>ations;<br />

with 4 possible values we have 3 4 (65536). It <strong>in</strong>creases <strong>in</strong> exponential way and the computational<br />

cost (and time) <strong>in</strong>creases <strong>in</strong> the same way.<br />

Characteristic Loci feature was designed for digit and isolated letter recognition, and the number<br />

of <strong>in</strong>tersections was bounded. The orig<strong>in</strong>al approach uses the same <strong>in</strong>terval <strong>in</strong> all directions. In<br />

this work we have also bounded the number of <strong>in</strong>tersections. We have normalized the number of<br />

<strong>in</strong>tersections. For each direction we have def<strong>in</strong>ed a different <strong>in</strong>terval for each value. The horizontal<br />

direction has a bigger <strong>in</strong>terval than the vertical direction. In the orig<strong>in</strong>al approach the digits or<br />

characters have a similar height and width, but <strong>in</strong> our approach the width of the words is usually<br />

bigger than the height. Accord<strong>in</strong>g with the dimensions of the words the range of the <strong>in</strong>tervals are<br />

<strong>in</strong> harmonious. Diagonal directions are a comb<strong>in</strong>ation of the two other directions. Table 1 shows<br />

the <strong>in</strong>tervals for each direction.<br />

Accord<strong>in</strong>g to the above encod<strong>in</strong>g, for each background pixel, an eight digit number <strong>in</strong> base 3<br />

is obta<strong>in</strong>ed. For <strong>in</strong>stance, the locus number of po<strong>in</strong>t P <strong>in</strong> Fig. 12 is (22111122)3 = (6170)10. The<br />

locus numbers are between 0 and 6561 (= 3 8 ). This is done for all background pixels. In this case,<br />

19


Table 1: Intervals for each direction <strong>in</strong> characteristic Loci feature.<br />

Values<br />

direction 0 1 2<br />

Vertical {0} [1, 2] [3, +∞]<br />

Horizontal {0} [1, 4] [5, +∞]<br />

Diagonal {0} [1, 3] [4, +∞]<br />

the dimension of the feature space becomes 6561. Each element of this vector represents the total<br />

number of background pixels that have locus number correspond<strong>in</strong>g to that element.<br />

8.2 Learn<strong>in</strong>g and retrieval<br />

Organiz<strong>in</strong>g features<br />

The retrieval process of this approach consists <strong>in</strong> organiz<strong>in</strong>g the features <strong>in</strong> a look up table M<br />

(Fig. 13). Columns of M represent the words (w) of the documents that we are us<strong>in</strong>g for this<br />

experiment. Rows corresponds to all the possible comb<strong>in</strong>ations that can appear us<strong>in</strong>g characteristic<br />

Loci features (f ) . M(f, w) means that the feature f is presented <strong>in</strong> word w. For this work, we have<br />

8 directions and each one has three different values. So, we have 3 8 (= 6561) possible comb<strong>in</strong>ations.<br />

The feature vector is the histogram with all the possible comb<strong>in</strong>ations.<br />

Figure 13: Steps of the Pseudo-Structural descriptor organized <strong>in</strong> a hash structure process.<br />

20


Search<strong>in</strong>g words<br />

Classification process consists <strong>in</strong> search<strong>in</strong>g the best match<strong>in</strong>g of the query with all the words of M<br />

(Fig. 13). The query chosen is used to extract the vector of features. This vector is used to match<br />

the query with all the words of the ground truth. We have used the Euclidean distance to do the<br />

match<strong>in</strong>g. When we have all the distances, we select the words under a selected threshold.<br />

9. Experimental results<br />

In order to validate the proposed methodology, we describe our performance evaluation protocol <strong>in</strong><br />

terms of the data used, comparatives, metrics, and experiments.<br />

9.1 Data<br />

Our approach has been evaluated with a ground truth composed by 50 documents extracted from<br />

the volume 69. In these documents, all the <strong>in</strong>stances of 21 words are labelled. All the documents<br />

of the ground truth are written by the same author. The difficulties that we may face with these<br />

documents are: illum<strong>in</strong>ations changes, partial occlusions, warp<strong>in</strong>g effect <strong>in</strong> the document, <strong>in</strong>k bleed<br />

through, etc. Some samples of the documents are show <strong>in</strong> figure 14.<br />

Figure 14: Samples of the documents of the ground truth<br />

We have also used a subset of the ground truth. It consists of the first 20 documents of our<br />

ground truth, and it has the first 10 classes of the orig<strong>in</strong>al one. This ground truth has been used<br />

<strong>in</strong> some experiments <strong>in</strong> order to facilitate the data analysis. The analysis of the results could be<br />

easier by us<strong>in</strong>g a reduced ground truth. Some visual results can be easier understood by us<strong>in</strong>g less<br />

classes.<br />

9.2 Comparatives<br />

The experiments of this work are separated <strong>in</strong>to 3 groups: those that evaluate the segmentation<br />

process, the ones that evaluate the first approach and f<strong>in</strong>ally the experiments that evaluate the<br />

second approach.<br />

21


The segmentation process experiments evaluate the accuracy of the word segmentation. The<br />

segmented word and the labeled word are overlapped <strong>in</strong> order to check if they are the same word.<br />

Different thresholds of overlapp<strong>in</strong>g percentage are used to evaluate the accuracy of the segmentation<br />

process.<br />

The first approach has two types of experiments. The first one evaluates how the cluster<strong>in</strong>g<br />

process is done. The second one evaluates the accuracy of the retrieval process:<br />

• The experiment shows the relation between the basic features chosen by us<strong>in</strong>g 2D plots.<br />

• By us<strong>in</strong>g visual results, we observe the distribution of the observations of our ground truth<br />

<strong>in</strong> the clusters.<br />

• We evaluate the accuracy, the homogeneity and completeness of the cluster<strong>in</strong>g us<strong>in</strong>g Vmeasure<br />

(expla<strong>in</strong>ed <strong>in</strong> section 9.3).<br />

• The accuracy of the retrieval process is evaluated by means of a precision-recall curve.<br />

The second approach is evaluated by means of precision-recall curves:<br />

• Two experiments are used to assess the accuracy of this approach by us<strong>in</strong>g different characteristics<br />

pixels (background and foreground pixels).<br />

• Both characteristics po<strong>in</strong>ts are compared <strong>in</strong> order to evaluated.<br />

9.3 Metrics<br />

One drawback of cluster<strong>in</strong>g process is the proper selection of the number of clusters. Learn<strong>in</strong>g<br />

process consist <strong>in</strong> bunch<strong>in</strong>g the observations <strong>in</strong> different clusters. The ideal solution is achieved<br />

when all the <strong>in</strong>stances of the same word are <strong>in</strong> the same cluster, and each cluster has only <strong>in</strong>stances<br />

of only one word. The results of the retrieval process depend on the accuracy <strong>in</strong> the cluster<strong>in</strong>g process.<br />

The evaluation of the cluster<strong>in</strong>g process has been done us<strong>in</strong>g V-measure [22]. V-measure is an<br />

entropy-based measure which explicitly measures how successfully the criteria of homogeneity and<br />

completeness have been satisfied. V-measure is computed as the “mean” of dist<strong>in</strong>ct homogeneity<br />

and completeness scores, that is, V-measure can be weighted to favour the contributions of homogeneity<br />

or completeness. A cluster<strong>in</strong>g result satisfies homogeneity if each one of its clusters conta<strong>in</strong><br />

only data po<strong>in</strong>ts which are members of a s<strong>in</strong>gle class, and a cluster<strong>in</strong>g result satisfies completeness<br />

if all the data po<strong>in</strong>ts that are members of a given class are elements of the same cluster<br />

The retrieval process is evaluated us<strong>in</strong>g precision-recall curves:<br />

recall =<br />

precision =<br />

number of relevant items retrieved<br />

number of relevant items <strong>in</strong> collection<br />

number of relevant items retrieved<br />

total number of items retrieved<br />

22<br />

(4)<br />

(5)


9.4 Experiments<br />

We present the experiments done <strong>in</strong> this work. We show the results for the pre-process<strong>in</strong>g step and<br />

for the two approaches developed.<br />

Pre-process<strong>in</strong>g<br />

The pre-process<strong>in</strong>g step has the objective of segment<strong>in</strong>g the words on the documents. The performance<br />

of the next steps will depend on the results obta<strong>in</strong>ed from this stage. The segmentation<br />

process is evaluated <strong>in</strong> terms of words found with respect the ground truth.<br />

We have used both the complete and reduced ground truth <strong>in</strong> order to evaluate the segmentation<br />

process. After segment<strong>in</strong>g the words from the document, we have matched these words with the<br />

labelled words of the ground truth. Each segmented word is compared with the words of the<br />

ground truth observ<strong>in</strong>g the percentage of overlapp<strong>in</strong>g (Fig. 15). The words that have more of 40%<br />

of overlapp<strong>in</strong>g of their bound<strong>in</strong>g boxes are considered as the same word.<br />

Figure 15: Examples of a correct overlapp<strong>in</strong>g (left) and a <strong>in</strong>correct overlapp<strong>in</strong>g (right).<br />

In table 2 we observe the results of apply<strong>in</strong>g our method to the different ground truth by us<strong>in</strong>g<br />

different thresholds. In both, the same performance is obta<strong>in</strong>ed: with small overlapp<strong>in</strong>g threshold<br />

the percentage of words found is high, but when we <strong>in</strong>crease the threshold, the percentage decreases.<br />

We observe that, us<strong>in</strong>g the ground truth with 50 documents and 20 classes and 0.1 as threshold<br />

the accuracy is over 100%. A label of a word could conta<strong>in</strong> part of a next word, then, when we<br />

compare the two words segmented, both have the same label.<br />

We observe that the results stay stable until we reach a threshold value of 40%, then the<br />

accuracy decreases. In order to obta<strong>in</strong> good results and reduce the number of errors (expla<strong>in</strong>ed<br />

before), we have used 40% as threshold for our experiments.<br />

Pixel-based descriptors organized <strong>in</strong> a hierarchical structure<br />

The ma<strong>in</strong> problem with a Cluster algorithm is to choose the number of clusters that bunches the<br />

observations of the best form. We have done some experiments to obta<strong>in</strong> which is the best number<br />

of cluster.<br />

The first layer of our architecture is formed by clusters constructed <strong>in</strong> terms of basic features.<br />

The second layer uses BSM features. A key parameter <strong>in</strong> the BSM feature computation is the<br />

number of b<strong>in</strong>s to obta<strong>in</strong> the histogram-measure calculated us<strong>in</strong>g different number of beans. In the<br />

first experiment we evaluate the performance depend<strong>in</strong>g on the number of b<strong>in</strong>s. This performance<br />

is evaluated <strong>in</strong> terms of the V-measure.<br />

23


Table 2: Pre-process<strong>in</strong>g results. The ground truth is composed by 50 documents and 20 classes,<br />

it has 6718 words labelled. The subset of the ground truth is composed by 30 documents and 10<br />

classes, it has 3101 words labelled.<br />

Subset (3101 words) Ground truth (6718 words)<br />

Threshold f<strong>in</strong>ded accuracy f<strong>in</strong>ded accuraccy<br />

0.1 3129 100.90% 6712 99.91%<br />

0.2 3026 97.58% 6540 97.35%<br />

0.3 2987 96.32% 6447 95.97%<br />

0.4 2937 94.71% 6351 94.54%<br />

0.5 2806 90.49% 6058 90.18%<br />

0.6 2407 77.62% 5052 75.20%<br />

0.7 1495 44.57% 2994 44.57%<br />

0.8 466 15.03% 959 14.28%<br />

0.9 35 1.13% 85 1.27%<br />

1 0 0.00% 0 0.00%<br />

Figure 16 shows the V-measure with different number of b<strong>in</strong>s. We see that between 14 and 17<br />

b<strong>in</strong>s the function stops to <strong>in</strong>crease and it stabilizes. So, we have selected 17 b<strong>in</strong>s for all the rest of<br />

experiments of this work. Increas<strong>in</strong>g the number of b<strong>in</strong>s does not lead to better results.<br />

Figure 16: V-measure <strong>in</strong>creases as we <strong>in</strong>crease the number of b<strong>in</strong>s <strong>in</strong> BSM algorithm.<br />

The second experiment shows the relation between all the possible comb<strong>in</strong>ations of the selected<br />

features us<strong>in</strong>g 2D plots. We have chosen 7 different basic features: height, width, filled area,<br />

centroid, perimeter, eccentricity and Euler Number. The results are similar for all the comb<strong>in</strong>ations:<br />

the po<strong>in</strong>ts of the different observations are together, but <strong>in</strong> the plots we see that the classes are<br />

separated. The best result corresponds to the comb<strong>in</strong>ation of the features height and width (Fig. 17).<br />

We observe that all the observations are together, but they are separated <strong>in</strong> different zones.<br />

24


Figure 17: Distribution of the basic features width and height.<br />

The third experiment shows the distributions of the observations <strong>in</strong>to the clusters. We have<br />

used the 7 basic features <strong>in</strong> this experiment. We have clustered features from 3 to 30 clusters.<br />

Figure 18 represents how the observations of each class have been distributed <strong>in</strong> the 20 clusters<br />

us<strong>in</strong>g the 7 basic features. Each column represents a cluster and each row a class (words). The<br />

classes are sorted <strong>in</strong>to different clusters and because of that, the classification results are affected<br />

and can be confus<strong>in</strong>g. By reduc<strong>in</strong>g the number of clusters we observe that all the observations of<br />

each class are <strong>in</strong> the same cluster, but each cluster conta<strong>in</strong>s more than one class. Instead, <strong>in</strong>creas<strong>in</strong>g<br />

the number of clusters the dispersion of the observations is greater.<br />

The next experiment is similar to the last one, but <strong>in</strong> this case we have evaluated the BSM<br />

features. The second layer of our architecture is done us<strong>in</strong>g BSM features. We have done some<br />

experiments to evaluate how it works. Figure 19 shows how the observations of the classes are<br />

distributed along the clusters. We observe that all the observations of each cluster are more<br />

concentrated <strong>in</strong> one cluster than us<strong>in</strong>g basic features. Increas<strong>in</strong>g and reduc<strong>in</strong>g the number of<br />

clusters we have the same problems than us<strong>in</strong>g basic features.<br />

In the follow<strong>in</strong>g experiment we have evaluate the performance of the cluster<strong>in</strong>g process by us<strong>in</strong>g<br />

V-measure. In figure 20 we observe a comparative of the V-measure between different comb<strong>in</strong>ations<br />

of basic features and ground truth. We observe that us<strong>in</strong>g a smaller ground truth with fewer classes<br />

the results are worst. Us<strong>in</strong>g the same ground truth the results are similar between us<strong>in</strong>g only height<br />

and width and all the basic features. We conclude that as we use more observations and classes,<br />

the better the accuracy <strong>in</strong> the cluster, and us<strong>in</strong>g BSM features <strong>in</strong> the cluster<strong>in</strong>g we obta<strong>in</strong> better<br />

results than us<strong>in</strong>g basic features.<br />

25


Figure 18: Distribution of the observations <strong>in</strong> the clusters us<strong>in</strong>g basic features.<br />

The ideal solution <strong>in</strong> the cluster<strong>in</strong>g process is to obta<strong>in</strong> a 100% <strong>in</strong> completeness and homogeneity.<br />

In our case we have not obta<strong>in</strong> an ideal solution, then, we have to choose a measure which is a trade<br />

of between both measures. In figure 21 we observe two plots for each experiment, β = 0 means<br />

that the plot is measur<strong>in</strong>g homogeneity and β = 1 means that the plot is measur<strong>in</strong>g completeness.<br />

For each experiment we observe that with small number of clusters the homogeneity is small and<br />

the completeness is good. By <strong>in</strong>creas<strong>in</strong>g the number of clusters the homogeneity <strong>in</strong>creases and the<br />

completeness decreases. The best number of cluster for each experiment is when both plots cross.<br />

For example, the best number of clusters for the BSM features is 15.<br />

The experiment for the retrieval process evaluates its accuracy. We have done several experiments<br />

us<strong>in</strong>g different comb<strong>in</strong>ations of basic features, the subset of the ground truth and the BSM<br />

features (Fig. 22). We observe that the worst results are obta<strong>in</strong>ed when we use the ground truth<br />

with all the basic features. Us<strong>in</strong>g the BSM features we have obta<strong>in</strong>ed the best results, followed<br />

by the experiment us<strong>in</strong>g the basic features height and width. Us<strong>in</strong>g all the basic features we have<br />

obta<strong>in</strong>ed worst results.<br />

In the last experiment we evaluate the performance <strong>in</strong> terms of scalability (an <strong>in</strong>creas<strong>in</strong>g number<br />

of documents and classes) and the descriptor. We observe, us<strong>in</strong>g the same descriptor and different<br />

number of documents and classes, that the accuracy is better with less number of classes. We also<br />

observe that us<strong>in</strong>g the BSM descriptor, it is a better descriptor and more accurate. The performance<br />

improves, even us<strong>in</strong>g the bigger ground truth with respect the best result of the smaller ground<br />

truth.<br />

26


Figure 19: Distribution of the observations <strong>in</strong> the clusters us<strong>in</strong>g BSM features.<br />

In conclusion the performance is more sensitivity to the accuracy (descriptive power) of the<br />

descriptor. With the same descriptor the more is the number of classes, the higher is the confusion,<br />

so the performance decreases.<br />

Pseudo-Structural descriptor organized <strong>in</strong> a Hash Structure<br />

Our second approach is evaluated by us<strong>in</strong>g precision-recall curves. These experiments are done by<br />

tun<strong>in</strong>g two parameters: mask size and the threshold used to decide if an observation is member<br />

of a class, or not. To obta<strong>in</strong> Loci features we have used different masks to obta<strong>in</strong> the number of<br />

<strong>in</strong>tersections for each pixel <strong>in</strong> all directions (Fig. 23).<br />

There are two options <strong>in</strong> the feature extraction step. The first one is us<strong>in</strong>g the background<br />

pixels as reference to obta<strong>in</strong> the feature vector. The second experiment uses foreground pixels as<br />

reference.<br />

In figure 24b we observe the results us<strong>in</strong>g background pixels as reference, for different mask<br />

sizes and vary<strong>in</strong>g the threshold with the follow<strong>in</strong>g values: 25, 50, 100, 200, 300, 400, 500 and 600.<br />

We observe that when we <strong>in</strong>crease the size of the mask, the results are better. But, when the size<br />

of the mask is 80 pixels, or more, the results are very similar. We do not get more <strong>in</strong>formation<br />

because the mean of the height of the words is 80 pixels, then, <strong>in</strong>creas<strong>in</strong>g the size of the masks.<br />

The use of the foreground as reference pixels gives a similar performance (Fig. 24a). But if we<br />

compare the results us<strong>in</strong>g foreground and background as pixel reference (Fig. 25), we observe that<br />

27


Figure 20: Comparative us<strong>in</strong>g different features.<br />

us<strong>in</strong>g background pixels the results are better. Us<strong>in</strong>g background pixels as reference, the number<br />

of pixels that gives <strong>in</strong>formation to the feature vector is higher than us<strong>in</strong>g foreground pixels.<br />

9.5 Discussions<br />

We have used a ground truth <strong>in</strong> this work composed by 50 documents and 20 classes, and a subset<br />

of the first one composed by 30 documents and 10 classes. As it was expected, we have observed<br />

that, by us<strong>in</strong>g higher number of observations from each class, the results improve.<br />

We have used two different approaches <strong>in</strong> this work. In the hierarchical approach the first layer is<br />

created by us<strong>in</strong>g basic features. The features that obta<strong>in</strong>ed given better classification performance<br />

are width and height, and the optimal number of cluster is three: small words, medium words<br />

and big words. Us<strong>in</strong>g this simple cluster<strong>in</strong>g we separate the observations <strong>in</strong> three categories. The<br />

classification process only chooses the cluster tak<strong>in</strong>g <strong>in</strong>to account the size of the word. The second<br />

layer classes uses BSM features. The observations of each cluster of the first layer are bunched<br />

us<strong>in</strong>g BSM features. The results show that there is some confusion <strong>in</strong> some words, and a third<br />

layer, us<strong>in</strong>g other k<strong>in</strong>d of feature, helps to the classification to obta<strong>in</strong> better results.<br />

The second approach has obta<strong>in</strong>ed the best results. This process not depends of how good was<br />

the segmentation process, is more stable that the first one, and it is showed <strong>in</strong> the results showed<br />

<strong>in</strong> this work.<br />

28


Figure 21: Choos<strong>in</strong>g the best number of clusters. β = 0 means homogeneity. β = 1 means<br />

completeness.<br />

We have observed that the cluster<strong>in</strong>g algorithm used <strong>in</strong> the first approach does not perform<br />

well with the selected corpus. We have done some experiments with Self-Organiz<strong>in</strong>g Map 1 (SOM)<br />

as an <strong>in</strong>troductory work for future endeavours. SOM is a type of Artificial Neural Network that<br />

is tra<strong>in</strong>ed us<strong>in</strong>g unsupervised learn<strong>in</strong>g to produce a low-dimensional, discrete representation of the<br />

<strong>in</strong>put space of the tra<strong>in</strong><strong>in</strong>g examples, called a map.In the appendix A there are some figures with<br />

the results of this algorithm. Figure 26 shows a map of the observations of the tra<strong>in</strong><strong>in</strong>g set us<strong>in</strong>g<br />

BSM features. Each cell represents a different cluster, and each colour a different class. We observe<br />

that the observations of each class are bunched <strong>in</strong> close clusters. Figure 27 shows a similar to the<br />

previous one, but us<strong>in</strong>g characteristic Loci as features. We observe that <strong>in</strong> this case the observations<br />

are more concentrated <strong>in</strong> the same clusters.<br />

10. Conclusions<br />

<strong>Word</strong>-spott<strong>in</strong>g appears to be an attractive alternative to the seem<strong>in</strong>gly obvious recognize-thenretrieve<br />

approach to historical manuscript retrieval. With the capability of match<strong>in</strong>g word images<br />

<strong>in</strong> a quick and accurate way, partial transcriptions of a collection can be achieved with reasonable<br />

accuracy and scarce human <strong>in</strong>teraction and we obta<strong>in</strong> better results and by <strong>in</strong>creas<strong>in</strong>g the number of<br />

observations of the tra<strong>in</strong><strong>in</strong>g set. <strong>Word</strong>-spott<strong>in</strong>g has the capability to automatically identify <strong>in</strong>dex<strong>in</strong>g<br />

1. http://www.cis.hut.fi/somtoolbox/<br />

29


Figure 22: Classification process us<strong>in</strong>g different basic features and BSM features.<br />

(a) 10 pixels (b) 100 pixels<br />

Figure 23: Examples with different mask sizes.<br />

terms, mak<strong>in</strong>g it possible to use costly human labor more spar<strong>in</strong>gly than a full transcription would<br />

require.<br />

In this work we have <strong>in</strong>troduced two approaches to do word spott<strong>in</strong>g. Both approaches use<br />

predef<strong>in</strong>ed descriptors and the observations are stored <strong>in</strong> different structures. Our first approach<br />

is based <strong>in</strong> pixel-based descriptors organized <strong>in</strong> a hierarchical structure. Our second approach is<br />

based on a pseudo-structure descriptor organized <strong>in</strong> a hash structure.<br />

Experimental results of the first approach show that the selected basic features cannot discrim<strong>in</strong>ate<br />

enough the observations. The results are very confused but they can be used <strong>in</strong> order to do<br />

a fast rejection of, for example, small words, medium words and big words.<br />

30


(a) Foreground pixels as reference (b) Background pixels as reference<br />

Figure 24: Different mask sizes us<strong>in</strong>g different characteristics pixels as reference.<br />

Figure 25: Comparative results us<strong>in</strong>g background and foreground pixels as reference.<br />

In order to compare the descriptors used <strong>in</strong> the first approach (basic and BSM features), we<br />

conclude that BSM features work better than basic features us<strong>in</strong>g k-means as cluster<strong>in</strong>g algorithm.<br />

Concern<strong>in</strong>g the results of the second approach, we have obta<strong>in</strong>ed results us<strong>in</strong>g the background<br />

and the foreground as characteristic pixels. In both cases the performance results are similar. But if<br />

we compare them, we conclude that by us<strong>in</strong>g background pixels the results are better than when we<br />

31


use foreground pixels, because the number pixels are higher and the amount of stored <strong>in</strong>formation<br />

is bigger.<br />

By compar<strong>in</strong>g both approaches we can conclude that the second one, by us<strong>in</strong>g a pseudostructural<br />

descriptor organized <strong>in</strong> a hash structure leads to obta<strong>in</strong> better results. The reason<br />

beh<strong>in</strong>d that is <strong>in</strong> this case we use a descriptor whose performance does not depend on the localization<br />

of the characteristic pixels <strong>in</strong> the image, be<strong>in</strong>g more robust than the descriptors of the first<br />

approach.<br />

11. Future Work<br />

The results obta<strong>in</strong>ed <strong>in</strong> this work, although prelim<strong>in</strong>ary are encourag<strong>in</strong>g to cont<strong>in</strong>ue with a further<br />

research <strong>in</strong> different directions. Let us a sketch the major ones.<br />

The corpus of this work is composed only by documents of the volume 69 and the writer is the<br />

same <strong>in</strong> all the documents. The whole Barcelona Marriage Records collection is composed by 244<br />

books. One cont<strong>in</strong>uation path can be improv<strong>in</strong>g this work <strong>in</strong> order to manage a possible scalability<br />

problem, which presents some problems to solve, such as different handwrit<strong>in</strong>g patterns, different<br />

structure of the documents or variations <strong>in</strong> the structure even with<strong>in</strong> <strong>in</strong>stances of the same words.<br />

Our further research has to be oriented to big databases, like the collection of Barcelona Marriage<br />

records database, or other k<strong>in</strong>d of big collections of handwritten documents.<br />

As our research l<strong>in</strong>e is focused on search<strong>in</strong>g a more robust descriptor, i.e. structural shape<br />

descriptors, and on us<strong>in</strong>g other cluster<strong>in</strong>g algorithms <strong>in</strong> order to reduce the dimensionality of our<br />

dataset, with the f<strong>in</strong>al objective of improv<strong>in</strong>g our current results. As we can observe <strong>in</strong> section 9.5<br />

we have run several experiments us<strong>in</strong>g Self-Organiz<strong>in</strong>g Map with promis<strong>in</strong>g results.<br />

One f<strong>in</strong>al research l<strong>in</strong>e, perhaps more oriented to a concrete corpus, could be to use the structural<br />

<strong>in</strong>formation of the documents. The idea here is to use structural and <strong>in</strong>cremental learn<strong>in</strong>g methods<br />

to predict the words that may appear <strong>in</strong> conjunction with some others (generation of dictionaries).<br />

12. Acknowledgement<br />

First and foremost, I would like to thank my supervisors of this work, Josep LLadós for the valuable<br />

guidance and advice, and Alicia Fornés for her help and for the several times that she has read<br />

my report (thank you!). They help me to learn a lot of th<strong>in</strong>gs about computer vision, and more<br />

concretely, about document analysis. They have had enough patience with me to show me all I<br />

have needed for this work.<br />

An honourable mention goes to my family and friends for their understand<strong>in</strong>g and supports <strong>in</strong><br />

develop<strong>in</strong>g this project. I want to make a special mention to Eka<strong>in</strong>, Jon, Jorge, Lluis, Monica and<br />

Toni, the best master partners (and now my friends). They are always there when I am lost, or<br />

when I have needed somebody to talk.<br />

32


References<br />

[1] Ch. Choisy. Dynamic handwritten keyword spott<strong>in</strong>g based on the NSHP-HMM. N<strong>in</strong>th International<br />

Conference on Document Analysis and Recognition (ICDAR 2007), 1:242–246,<br />

September 2007.<br />

[2] DL Davies and DW Bould<strong>in</strong>. A cluster separation measure. IEEE Transactions on Pattern<br />

Analysis and Mach<strong>in</strong>e Intelligence, 1:224–227, 1979.<br />

[3] A. Ebrahimi and E. Kabir. A pictorial dictionary for pr<strong>in</strong>ted Farsi subwords. Pattern Recognition<br />

Letters, 29:656–663, April 2008.<br />

[4] R. Ebrahimpour, M. R. Moradian, A. Esmkhani, and M. Farzad. Recognition of Persian<br />

handwritten digits us<strong>in</strong>g Characterization Loci and Mixture of Experts. International Journal<br />

of Digital Content Technology and its Applications, 3:42–46, 2009.<br />

[5] S. Escalera, A. Fornés, O. Pujol, P. Radeva, G. Sánchez, and J. Lladós. Blurred <strong>Shape</strong><br />

Model for b<strong>in</strong>ary and grey-level symbol recognition. Pattern Recognition Letters, 30:1424–<br />

1433, November 2009.<br />

[6] A. Fornés, S. Escalera, J. Lladós, G. Sánchez, and O. Pujol. <strong>Handwritten</strong> symbol recognition<br />

by a boosted blurred shape model with error correction. Iberian Conference on Pattern<br />

Recognition and Image Analysis (IbPRIA), 1:13–21, 2007.<br />

[7] B. Gatos, I. Pratikakis, and S. Perantonis. Adaptive degraded document image b<strong>in</strong>arization.<br />

Pattern Recognition, 39:317–327, March 2006.<br />

[8] H.A. Glucksman. Classification of mixed-font alphabets by characteristic loci. Proc. IEEE<br />

Comput. Conf., pages 138–141, September 1967.<br />

[9] E. Keogh. Exact <strong>in</strong>dex<strong>in</strong>g of dynamic time warp<strong>in</strong>g. Knowledge and Information Systems,<br />

7:358–386, May 2005.<br />

[10] K.M. Knill and S.J. Young. Speaker dependent keyword spott<strong>in</strong>g for access<strong>in</strong>g stored speech.<br />

Eng<strong>in</strong>eer<strong>in</strong>g, 1994.<br />

[11] L.I. Kuncheva. Comb<strong>in</strong><strong>in</strong>g Pattern Classifiers: Methods and Algorithms. Wiley-Interscience,<br />

2004.<br />

[12] Y. Leydier, F. Lebourgeois, and H. Emptoz. Text search for medieval manuscript images.<br />

Pattern Recognition, 40:3552–3567, December 2007.<br />

[13] R. Manmatha, C. Han, E.M. Riseman, and W.B. Croft. Index<strong>in</strong>g handwrit<strong>in</strong>g us<strong>in</strong>g word<br />

match<strong>in</strong>g. Proceed<strong>in</strong>gs of the first ACM <strong>in</strong>ternational conference on Digital libraries, 1:159,<br />

1996.<br />

[14] R. Manmatha and J.L. Rothfeder. A scale space approach for automatically segment<strong>in</strong>g words<br />

from historical handwritten documents. IEEE Transactions on Pattern Analysis and Mach<strong>in</strong>e<br />

Intelligence, 27:1212–1225, 2005.<br />

33


[15] G. Nagy. Twenty years of document image analysis <strong>in</strong> PAMI. IEEE Transactions on Pattern<br />

Analysis and Mach<strong>in</strong>e Intelligence, 22:38–62, 2000.<br />

[16] W. Niblack. An <strong>in</strong>troduction to digital image process<strong>in</strong>g. Strandberg Publish<strong>in</strong>g Company,<br />

Birkeroed, Denmark, Denmark, 1985.<br />

[17] N. Otsu. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on<br />

Systems, Man, and Cybernetics, 9:62–66, 1979.<br />

[18] V. Papavassiliou, T. Stafylakis, V. Katsouros, and G. Carayannis. <strong>Handwritten</strong> document<br />

image segmentation <strong>in</strong>to text l<strong>in</strong>es and words. Pattern Recognition, 43:369–377, January 2010.<br />

[19] T.M. Rath and R. Manmatha. <strong>Word</strong> image match<strong>in</strong>g us<strong>in</strong>g dynamic time warp<strong>in</strong>g. IEEE Computer<br />

Society Conference on Computer Vision and Pattern Recognition (CVPR’03), 2:521–527,<br />

2003.<br />

[20] T.M. Rath and R. Manmatha. <strong>Word</strong> spott<strong>in</strong>g for historical documents. International Journal<br />

of Document Analysis and Recognition (IJDAR), 9:139–152, August 2006.<br />

[21] T.M. Rath, R. Manmatha, and V. Lavrenko. A search eng<strong>in</strong>e for historical manuscript images.<br />

Proceed<strong>in</strong>gs of the 27th annual <strong>in</strong>ternational conference on Research and development <strong>in</strong><br />

<strong>in</strong>formation retrieval - SIGIR ’04, 1:369, 2004.<br />

[22] A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster<br />

evaluation measure. Proceed<strong>in</strong>gs of the 2007 Jo<strong>in</strong>t Conference on Empirical Methods <strong>in</strong> Natural<br />

Language Process<strong>in</strong>g and Computational Natural Language Learn<strong>in</strong>g (EMNLP-CoNLL), 1:410–<br />

420, 2007.<br />

[23] J.L. Rothfeder, S. Feng, and T.M. Rath. Us<strong>in</strong>g corner feature correspondences to rank word<br />

images by similarity. Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03.<br />

Conference on, 3:30–35, 2003.<br />

[24] P Rousseeuw. Silhouettes: A graphical aid to the <strong>in</strong>terpretation and validation of cluster<br />

analysis. Journal of Computational and Applied Mathematics, 20:53–65, November 1987.<br />

[25] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of <strong>in</strong>terest po<strong>in</strong>t detectors. International<br />

Journal of computer vision, 37:151–172, 2000.<br />

[26] S. Srihari, H. Sr<strong>in</strong>ivasan, P. Babu, and Ch. Bhole. <strong>Handwritten</strong> arabic word spott<strong>in</strong>g us<strong>in</strong>g<br />

the cedarabic document analysis system. Proceed<strong>in</strong>gs 2005 Symposium on Document Image<br />

Understand<strong>in</strong>g Technology, 1:123, 2005.<br />

[27] C. Wolf. Document <strong>in</strong>k bleed-through removal with two hidden Markov random fields and<br />

a s<strong>in</strong>gle observation field. IEEE transactions on pattern analysis and mach<strong>in</strong>e Intelligence,<br />

32:431–447, March 2010.<br />

[28] D. Zhang and G. Lu. Review of shape representation and description techniques. Pattern<br />

Recognition, 37:1–19, January 2004.<br />

34


Appendix A. Self-Organiz<strong>in</strong>g Maps (SOM) results<br />

Figure 26: SOM us<strong>in</strong>g BSM features.<br />

35


Figure 27: SOM us<strong>in</strong>g characteristics Loci features.<br />

36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!