Handwritten Word Spotting in Old Manuscript Images using Shape ...

MASTER IN COMPUTER VISION AND ARTIFICIAL INTELLIGENCE 

REPORT OF THE RESEARCH PROJECT 

OPTION: COMPUTER VISION 

Handwritten Word Spotting in Old 

Manuscript Images using 

Shape Descriptors 

Author: David Fernández 

Date: 08/09/2010 

Advisors: Josep Lladós & Alicia Fornés

Handwritten Word Spotting in Old Manuscript Images using 

Shape Descriptors 

David Fernández dfernandez@cvc.uab.es 

Computer Vision Center (CVC) 

Campus UAB - Edifici O 

08193 Bellaterra, Barcelona, Spain 

Supervisors: Josep Lladós & Alicia Fornés 

Abstract 

There are lots of historical handwritten documents with information that can be used for several 

studies and projects. The Document Image Analysis and Recognition community is interested in 

preserving these documents and extracting all the valuable information from them. Handwriting 

word-spotting is the pattern classification task which consists in detecting words images document 

of handwriting. In this work, we have used query-by-example: we have matched an input image 

with one or multiple query images to determine the distance that might indicate a correspondence. 

We have developed two approaches. The first approach consists in a hierarchical process. It uses 

two different features organized in layers (basic features in the first layer and BSM features in 

the second layer). The second approach employs characteristic Loci features. Marriage licenses 

of the Cathedral of Barcelona are used as benchmarking database. We have search several words 

selected by their apparition in the documents. The results are evaluated using two different types 

of measures. The first one evaluates how good is clustered the observations in the learning process. 

Precision-recall curves are used to evaluate the retrieval step. 

Keywords: Word-Spotting, BSM, Loci, k-means, Anisotropic Gaussian Filter, 

1. Introduction 

Context and motivation 

Despite the growing use of electronic documents in our daily life, the use of paper documents is 

still playing an importance role. Current technologies allow us convenient and inexpensive means 

to capture, store, compress and transfer digitized images of documents. Nevertheless, the process 

of (semi)automatic document processing requires specialized technology to extract document contents. 

Information retrieval from Digital Libraries is primarily done using typed textual queries. 

Hence, document images are transcribed to ascii codes using Optical Character Recognition (OCR) 

systems. Querying and indexing is performed by sequence comparison of ascii strings. This solution 

is constrained to machine printed text, but documents contain other forms of information such 

as handwritten text, symbols and graphical structures. One of the main purposes of the area of 

Document Image Analysis and Recognition (DIAR) is the extraction of information, either textual, 

pictorial or structural, from document images. The understanding of such information represents 

1

a step forward towards shortening the semantic gap between recognizing individual visual objects, 

and understanding the whole document content in a given context. It does not involve the pure 

transcription of documents, but the retrieval and the linkage of semantic knowledge from large 

collections of document images stored in digital repositories. 

There is an increasing interest to digitally preserve and provide access to historical document 

collections in libraries, museums and archives. The conversion of historical document collections to 

digital archives is of prime importance to society both in terms of information accessibility, and longterm 

preservation. Handwritten documents are used to be found in historical archives. Examples 

are unique manuscripts written by well known scientists, artists or writers; letters, trade forms 

or administrative documents kept by parish or municipalities that help to reconstruct historical 

sequences in a given place or time, etc. While machine printed documents, under a minimum of 

conditions, are easy to read by OCR systems, the recognition of handwriting is still a scientific 

challenge. The state of the art achieves only good performance in constrained domains or with 

small vocabularies. 

Mass digitization of historical documents is performed using specialized scanners. These scanners 

allow obtaining a good quality in the images, without physically damaging the documents. 

After that, image processing processes are usually done to enhance images and ease the visual inspection. 

The problem is the degradation of the documents caused by lifetime of use. Degradation 

can appear for several reasons: non stationary noise due to illumination changes, curvature of the 

document, ink and holes in the document, ink show through (is the appearance of the verso side 

text or graphics on the scanned image of the recto side), low contrast, warping effect, etc. Some 

centuries ago the ink used to write had some oxide particles, which contribute to degrade the paper 

of the document, and this causes that the words in the verso of the page can be seen in the analysed 

part. This effect is known as bleed through. Nowadays, some methods have been developed 

for improving the quality of the images [7; 27]. 

There are lots of historical handwritten documents with information that can be used for several 

studies and projects. The Document Image Analysis and Recognition community is interested in 

preserving these documents and extracting all the valuable information from them. There are 

two ways to extract the information: transcribing documents (word-to-word) and word-spotting. 

Handwritten word-spotting refers to the problem of detecting specific keywords in handwritten 

document images. A model is provided as a query, and the goal is to retrieve all the occurrences in 

a word image database (or regions of a document collection) that are close to the query in terms 

of a specific dissimilarity measure. But, one of the problems of these documents is the access to 

them. The majority of material is only physically accessible, and only a few of authorized people 

can access to them. 

Nowadays thousand of digitized documents are unutilised because they are not indexed. There 

are some levels of indexation in terms of meta-data, from the naming of the author and the brief 

history of the book to a full text transcription. Nevertheless, there is not a unique technique that 

allows us to index the document correctly. During the last decades these techniques have experienced 

great improvements and the error rates have dropped to a level that makes commercial 

applications feasible. Traditional optical character recognition (OCR) systems fail to process handwritten 

documents, and they are only suitable for modern printed documents. However, the off-line 

handwritten text recognition systems, which take an image of a piece of handwriting as input, are 

working properly in restricted vocabularies. 

2

Handwriting word-spotting is the pattern classification task which consists in detecting words 

in handwriting images document. In this dissertation, we are concerned on the detection of several 

words into our documents. 

In documents where all pages are written by the same author (or few authors), the images of 

multiple instances of the same word are likely to look similar. Word-spotting [20] treats a collection 

of documents as a collection of words. Then, the first step consists in segmenting the document 

into word images, and then, pair wise “distances” between word images are calculated , which are 

used to cluster all words with similar features. Ideally, each cluster contains all the samples of the 

same word. 

There are two types of word-spotting approaches, depending on how the input is specified: 

query-by-string and query-by-example. In query-by-string, character models have been trained 

in advance and in time of execution the character models are combined to form words and the 

probability of each word is evaluated; in query-by-example the input is an image of the word to 

search, and the output is a set of the most representative images of the query word. 

Problem statement 

This work addresses the problem of handwritten word spotting in historical manuscripts. While historical 

approaches are based on contextual methods like Hidden Markov Model (HMM) or Dynamic 

Time Warping (DTW), using the sequential information of graphemes in a word. We propose a 

holistic approach using shape matching techniques. We propose two approaches. The first one uses 

a pixel-based descriptor tolerant to distortions. The second one is inspired in Loci characteristic 

and allows to aggregate pseudo-structural information in the descriptor. Handwritten collection of 

documents, that we will explained with more details in following sections, are used in this work. 

Objectives 

As started above, this work wins to develop shape descriptors for handwritten word spotting, in 

particular the objectives are: 

• To investigate different shape descriptors that allow to describe handwritten words with 

invariance of variations in writer, acquisition conditions, etc. We aim to focus in pixel-based 

descriptors and structural ones. 

• Based on the above descriptors, define clustering criteria allowing to build indexation structures 

for word spotting purposes. 

• Define an experimental framework. Construct a ground truth from a collection of a real 

application (Barcelona marriage records). 

Outline of the approach 

In our work we have used query-by-example. It consists in matching an input image with one or 

multiple query images to determine the distance that might indicate a correspondence. 

A spotting architecture consists of four tasks. First, a pre-processing step is done. Second, a 

fast rejection with the words segmented is done. Third, a normalization step is done. And fourth, 

3

a classification of the training set is done. The last step of our work is a retrieval step. Figure 1 

outlines the architecture of our approach. 

The quality of old documents can be affected by degradations. We perform a pre-processing 

step in order to obtain better results (Fig. 1). The first task consist of improving the quality of 

the document doing a pre-processing step. For this purpose we do a binarization of the document. 

Then, we remove margins of the document that are likely to interfere with subsequent operations. 

The page is then segmented in lines using projection analysis techniques [18]. Once the lines are 

segmented, word segmentation is done using similar technique. The projection function is smoothed 

with an Anisotropic Gaussian Filter [14]. 

In our approach, for each considered word, we extract the bounding box and do a fast rejection 

with the words that are very big or very small, with regard to the mean of all the words of the 

document. As well, the bounding box that has few pixels of information is ruled out. It allows to 

drastically reduce the search space. 

The next step consists in word normalization. It is necessary to extract the word and discard the 

pixels that do not belong to the word. The normalization is done using the Anisotropic Gaussian 

Filter and the upper and bottom contour of the word. 

We have developed two approaches for the learning step. The first approach consists in a 

hierarchical process. It uses two different features organized in layers. In the first layer, we use 

basic features, like aspect-ratio, height and weight. In the second layer, we use the Blurred Shape 

Model (BSM) features. In the first cluster the words are clustered in relation with the basic 

features of the words, and then, each cluster of the first layer is clustered with BSM features. 

Second approach employs characteristic Loci features. 

The rest of our dissertation is organized as follows. Section 3 discusses related work in this field. 

Section 2 shows the corpus of this work. Section 3 shows different method to evaluate a clustering 

process. From section 5 to 8 the different methods proposed in this work are explained. Section 9 

shows the experimental results. And in the last sections we show the conclusion of this work and 

future work. 

2. The corpus of Barcelona marriage records. A social science perspective. 

Between 1451 and 1905 it was made a centralised register called Llibres d’Esposalles. It recorded 

all the marriages and the fees posed on them according to their social class. It is conserved at the 

Archives of the Barcelona Cathedral and comprises 244 books with information on approximately 

550,000 marriages celebrated in over 250 parishes. Each book contains the marriages of two years, 

and each book was written by a different writer. 

All the books of the collection consist of by two parts. The first one is an index with all the 

husbands’ surname that appears in the volume, and the number of page where it appears (Fig. 2a). 

The indexes of the books have the same structure: several columns, where each column is composed 

by a surname, several dots and the number of page where appears this surname. The second part 

is the marriage licences (Fig. 2b and 2c). This work has been developed using the second part of 

the document, the marriage licenses. 

Marriage licences have a structured layout(Fig. 3). The document is divided in three parts. 

In the left part we can find the husband’s surname. Each surname is next to the record of the 

4

Figure 1: General process 

wedding. In the right part we can find the tax of the wedding. The central part corresponds to 

the record. In general, a quite regular structure that can be represented by a syntactic model. In 

this work, the query words used for word spotting are searched in the central part, so first a layout 

5

(a) 1617: index of volume 

69 

(b) 1729: volume 127 (c) 1860: volume 200 

Figure 2: Llibre d’esposalles (Archive of Barcelona Cathedral, ACB) 

segmentation step has to be done. Other good characteristics present in the documents are that the 

text is hardly in cursive, the documents are very clean, the words are connected and grammatical 

structure is similar in all the registers: day of the wedding, name and job of the husband, parents’ 

husband, name of the wife, parents’ wife and place where was the wedding. 

Figure 3: Structure of the documents: (a) husband’s surname, (b) wedding, (c) tax of the wedding 

Our work is part of a big project leaded by the Center for Demographic Studies (CED), Department 

of Geography, Universitat Autònoma de Barcelona (UAB). This project brings together 

researches of social sciences and computer sciences. From the perspective of scholars in social 

sciences, this collections is a rich source of information to construct genealogy of people over centuries. 

Thus, the first aim is to construct a database of marriages. It would be an artisanal and 

time consuming task, so word spotting techniques can help in pseudo-automatizing this process. 

Ground truth 

We have a ground truths composed by 500 documents of the volume 69 of the Cathedral of Barcelona 

volumes. We have also a second ground truth. It is a subset of the first ground truth. This one 

is composed by 30 documents, they are the same of the first ground truth. These documents have 

been labelled in a manual process. The labelled process has been done using an application program 

to label documents. This program allows us to select an area of the image and label it with a word. 

Each word (and the point marks of the area selected) is automatically saved in a XML file. 

6

There is a high number of different words in the selected ground truth documents, and the 

literal transcription (word by word) is an expensive process, consequently, only a few words are 

labelled. The words selected are shown in figure 4. These words are selected because of their high 

frequency of apparition in the selected documents.The ground truth has all the labelled words (20 

classes). The subset of the ground truth is composed by the first 10 labelled words (10 classes). 

(a) Barna (b) de Barna (c) en Barna (d) de (e) pages (f) reberè (g) dia 

(h) dit (i) fill (j) filla (k) viuda (l) ab (m) habitant (n) donsella 

(o) dilluns (p) dimarts (q) dimecres (r) dijous (s) divendres (t) dissapte (u) viudo 

Figure 4: An example of each selected word 

For the experiments we have employed a cross-validation technique[11]. Suppose that we have 

a data set Z of size N x n, containing n-dimensional feature vectors describing N objects. We 

choose an integer K (preferably a factor of N ) and randomly divide Z into K subsets of size N/K. 

Then we use one subset to test the performance of D trained on the union of the remaining K - 

1 subsets. This procedure is repeated K times, choosing a different part for testing each time. To 

get the final result we average the K estimates. We have chosen K = 5 in our experiments. 

3. Related work 

Word-spotting was originally formulated to detect words in speech messages [10]. Later it was 

used in text documents [20] for matching and indexing handwritten words of several documents. 

in this context, it was first proposed by Manmatha [13], and later, a number of different word 

matching algorithms were investigated. This technique needs word segmentation, and many word 

segmentation approaches can be found in the literature. Relevant examples are: a scale-space word 

segmentation process was proposed in [14] and a neural network work segmenting algorithm is also 

presented in [26]. 

Rath [21] proposed an automatic retrieval system for historical handwritten documents using 

relevance models. The method describes two statistical models for retrieval in large collections of 

handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a 

joint probability distribution between features computed from word images and their transcriptions. 

7

The models can then be used to retrieve unlabelled images of handwritten documents given a text 

query 

Handwriting recognition of large vocabularies in historical documents is still a very challenging 

task. Nagy in [15] discusses the papers published in PAMI on document analysis during the last 

20 years. 

A word can be represented with different kind of features. A feature is a measurement about 

the object to study, and allows to reduce all the characteristics of the image to a few that preserve 

the main information in a more manageable size . There are three types of features: quantitative 

(numeric) features, qualitative (symbolic) features and structured features. Quantitative features 

can be discrete values (e.g.- weight, the number of computers) or interval values (e.g.- the duration 

of an event). Qualitative features can be nominal or unordered (e.g.- colour) and ordinal (e.g.sound 

intensity - “quiet” or “loud”). Structured features represent relational and/or hierarchical 

attributes among a set of primitive patterns (e.g.- a parent node can be a generalization of children 

labelled “cars”, “truck” and “motorbikes”) [28]. 

There are different ways to match words, it depends on the kind of features that are. For 

example, the words can be matched directly computing the distance such as XOR, Euclidean 

Distance Mapping (EDM), Sum of Square Differences (SSD), SLH, Hausdorff distance, etc. The 

problem of these methods is that they are very sensitive to spatial variation. 

One of the most widely used feature comparison algorithms in handwriting recognition is the 

Dynamic Time Warping (DTW) [19; 9]. DTW is an algorithm for measuring similarity between 

two sequences which may vary in time or speed. It has been widely used in the speech processing, 

bio-informatics and also on the on-line handwriting communities to match 1-D signals. Even though 

the features of the image are in general in 2-dimensions, it is possible to recast them in 1-dimension, 

but it is possible to loose the association between columns features of images. DTW algorithm tries 

to minimize the variations between the features vectors. In general, it is a method that allows a 

computer to find a optimal match between two given sequences. 

In holistic approaches the image word is not segmented into smaller parts, but are considered 

as a whole shape [3]. Thus, the recognition uses to be performed by a shape matching algorithm 

in terms of the features computed at some key points of interest. A comparative study between 

a number of points of interest detectors is presented in [25]. For example corner can be detected 

with the Harris detector [23], but a drawback of such detector is its sensitiveness to noise. 

Cohesive Elastic Matching [12] is based on zoning, and it is possible to apply in all the text 

image, it is not necessary to segment the words of the text. It is a good method to compare zones 

of interest (ZOI). This algorithm is independent of the ZOI extraction method. 

Hidden Markov Models (HMM) are used sometimes in word-spotting [1] to match words into 

documents, but they are usually applied in documents with a reduced vocabulary and needs a 

considerable learning stage. 

4. Choosing the number of clusters 

There are different methods in the literature to choose the number of clusters. They can be classified 

in two big groups depending on how it is chosen the number of clusters. The first one is a manual 

8

method. The number of clusters is chosen based on the experimentation. In this way it is the 

experience of the user that allows to choose the best number of clusters. 

The second group is an automatic, or pseudo-automatic, method to choose the best number of 

cluster. Algorithms of this group use an index, or several indices, to obtain a measure that allows 

to choose the best number of clusters. There are several validity indices. They can be classified in 

two groups: external and internal validity indices. 

External validity indices are used when true class labels are known. Some examples of external 

validity indices are: Rand index, measures the similarity between two data clustering; Adjusted 

Rand index is the corrected-for-chance version of the Rand index; Mirkin index considers only 

object pair in different clusters for both partitions and finds the dissimilarity. 

Internal validity indices are used when true class labels are unknown. Some examples of internal 

validity indices are: Silhouette index [24] computes the average distance of a point from the other 

points of the cluster to which the point is assigned; Davies-Boulding index [2] is a function of the 

ratio of the sum of within-cluster scatter to between-cluster separation; Calinski-Harabasz index 

computes the sum of the squares of the distances between the clusters centroids and the mean of 

all genes in all classes. 

There are a big number of different validity indices to choose the best number of clusters. Then, 

it is possible to choose only one of them to select the number of clusters, or several of them to do 

intra-validation of the different indices. 

5. Word-spotting approach 

The objective of this work is word spotting. Thus given a query word image, we intend to locate 

instances of the same word class into the documents to be indexed. Word-spotting is used in many 

works to search words into images. In this work inspired in some literature approaches, words are 

considered as shapes, and spotting is achieved through shape dissimilarity functions. 

Word spotting needs to define a descriptor, or several descriptors, that represents our observations 

and allows us to group, and organize the features of them. Once the observations are bunched 

and organized it is needed an indexation structure. This structure organizes and groups all the 

observations of our experiment, and later, it is used to find the words that are similar to a given 

query into the documents. 

A general spotting architecture consists of two major modules, namely the learning stage and 

the retrieval one. Learning consists in clustering similar features in the search space (target images) 

to construct the indexation structure. Retrieval consists in finding the best approximation of the 

observations of the classification set with observations of the training set. In this work we propose 

two approaches (Fig. 5). 

The first approach is oriented to the pixel-based descriptors. It uses two different features 

as descriptors of the observations: basic features and BSM features. The indexation structure 

is constructed using a hierarchical cluster. It consists in segmenting the words from the images 

and organizing them in several clusters, using two descriptors based in the distribution of the 

pixels of the image. In the first level of the hierarchical cluster structure, basic shape features are 

considered. Afterwards, the clusters are refined in the second level using the Blurred Shape Model 

(BSM) features [5]. This organization of the search space allows, when a query word is searched, 

9

Figure 5: We present two approaches based in word spotting. Both have the same firsts steps. 

first to quickly reject an important number of non similar words (first level) and do the intensive 

search with more discriminant features (BSM) in the second level with a reduced number of target 

words. 

The second approach is oriented to pseudo-structural features. The descriptor used in this 

approach is characteristic Loci feature and the indexation structure is constructed using a table, 

where each column is each observation of the documents, and the rows are the features of the 

words. Each word, or character, is composed by several features, and it is not significant where they 

appear inside the image. This approach uses features based in Loci Characteristics [3; 4; 8]. Given 

a word image, a feature vector based on Loci characteristics is computed at some characteristicpoints. 

Some approaches of the literature have used the background pixels of the image. Other 

approaches have used the foreground pixels, and even some approaches have used the contour or the 

skeleton of the images. Loci characteristics encode the frequency of intersection counts for a given 

characteristic-point in different direction paths starting from this point. Loci vectors extracted 

from the words of the image database are stored in a hashing structure. Afterwards, the word 

spotting is performed by a voting process after Loci vectors from the query word are indexed in 

the hashing table. 

Let us describe the different steps of the two developed approaches. Both approaches have the 

same preliminary steps. They consist in a pre-processing step, where the documents are segmented 

and extracted the words of them, in a fast rejection, where bad words are discarded, and noise 

removal, where the noise of the image is removed and the bounding box is fixed to the contour 

of the image. These preliminary steps are explained in the section 6. Section 7 explains the first 

approach developed and the section 8 the second one. 

10

6. Preliminary steps 

6.1 Pre-processing 

Modelling the human cognitive process to obtain a similar computational methodology for handwritten 

word segmentation is quite difficult due to the following characteristics. The handwriting 

style is usually in cursive or discrete. In the case of discrete handwriting, characters are joined to 

form words, but, unlike the machine printed text, handwritten text is not uniformly spaced. The 

size of the characters along the words of the document is different (this is a scale problem). Ascenders 

and descenders are regularly connected and words present different orientations. Documents 

are often degraded due the ageing or other reasons. Another reason is the presence of show-through 

or bleed-through effects explained above. 

Some of the main problems of our historical documents are that they have been written by 

several authors (every two years the writer changes), noisy (stains, shadows, bleed through, etc.), 

margins, etc. 

The documents to be used in our experiments present some of the above commented drawbacks, 

like ascenders and descenders connected, different sizes of character, etc. But a good characteristics 

of these documents is that they are well structured. As we have commented in section 2, each 

document has three parts, and the objective is to work with the marriage licenses. 

The steps of the pre-processing are: binarization of the documents, page segmentation, layout 

segmentation, segmentation of the lines and, the last step, the word segmentation (Fig. 6). Let us 

in the following subsections describe the details of these steps. 

Figure 6: Pre-processing steps. 

11

6.1.1 Binarization 

The binarization (Fig. 6(b)) of an image is the process that converts a digital image in an image 

in black and white, so it preserves the main properties of the image. In Document Image Analysis 

the objective is to classify each pixel as background or relevant information for us. 

The simplest way to binarize an image is to choose a threshold value, and to classify all pixels 

with values above this threshold as white, and all other pixels as black (global image threshold). 

The problem then is how to select a threshold. In many cases, finding a threshold compatible to 

the entire image is very difficult, and in many cases even impossible. Therefore, adaptive image 

binarization is needed. In adaptive an optimal threshold is chosen for each image area (local image 

threshold). 

In our work we have applied two different methods of binarization. Otsu method [17] is a global 

method that chooses the threshold to minimize the intraclass variance of it values. It has the 

advantage of not requiring the input of parameters, but assumes that histograms are bimodal and 

illumination is uniform. Niblack’s algorithm [16] is a local thresholding method. This algorithm 

calculates a threshold value for each pixel-based on the mean and standard deviation of all the 

pixels in a local neighborhood. The critical point of this algorithm is the size of the neighbourhood 

area. The main disadvantage of this approach is the computational time, strongly dependent of 

the size of the neighbourhood window. The size should be small enough to preserve local details 

and large enough to suppress noise. 

The selected method in our work is the Otsu method because the documents of our corpus 

have good quality, and they present a uniform background. The Otsu method works better with 

documents with good quality (Fig. 7a). Niblack usually works better in historical documents, when 

they presents a high level of degradation (the document has shadows, bleed through, stains, etc.), 

but in such cases a perfect binarization is difficult to be achieved, so of the algorithm can not avoid 

the presence of noise in the resulting image (Fig. 7b). 

(a) Otsu method (0.078 sec. of computing time) (b) Niblack method (204.099 sec. of computing 

time) 

Figure 7: Methods of binarization applied to a piece of sheet of the marriage database. 

12

6.1.2 Page segmentation 

The handwritten manuscripts have been subjected to degradation during all the time they have 

been used and stored, but also, the digitalization process adds degradations to the document, like 

the warping effect in margins. The purpose of this step is to remove some of these margins and 

lines so that they will not interfere with later stages (Fig. 6(c)). 

The method proposed in this work is based in the blob properties of the image. We know 

that the margins are located in the borders of the document, then, in these parts, we extract the 

properties of the blobs of the image, after the image has been binarized. The biggest blobs are the 

margins, and then they are removed from the document. 

6.1.3 Layout segmentation 

The documents of our case study have a similar structure (it is explained in section 2). The 

word spotting of this work is centred in the central block of text of the document. Projection 

profile techniques [14] have been widely used in line and word segmentation for machine printed 

documents. The idea is to obtain a 1D function of the pixel values by projecting the binary image 

onto horizontal axis. The distinct local peaks in the profile correspond to the white space between 

the columns and the distinct local minima correspond to the text. 

Before segmenting the lines (Fig. 6(d)), it is necessary to extract the central block of text of 

each page of the documents. The aim is to delete the zones of the page that can interfere in the line 

segmentation. A morphological dilation with a vertical structuring element is applied to the input 

document, and next it is smoothed with a Gaussian filter to discard false local minima and reduce 

sensitivity to noise. The local minima are obtained by setting the derivative of the projection profile 

to zero . 

6.1.4 Line segmentation 

The documents used in this work contain lines which are approximately straight and close to 

horizontal. Projection profile techniques used before are also used in this step. In this case the 

projection is done in the vertical axis. 

Lines are segmented in the same way that the last step (Fig. 6(e)). The central block is dilated 

with a horizontal structuring element and smoothed with a Gaussian filter. An horizontal projection 

is computed. Although we have applied the smoothing function to discard false lines, we check the 

size of lines to discard the possible false ones. The rejection is done by looking the height of the 

lines. Small lines are discarded. 

6.1.5 Word segmentation 

The segmented lines obtained in the last process are examined to extract the words of the document 

(Fig. 6(f)). A word image is composed of discrete characters, connected characters, or a combination 

of both. The idea is to merge all these components in a single entity which is a word. This may be 

achieved by forming a blob-like representation of the image. A blob is considered as a connected 

region in the space. Our approach is based on the Laplacian of Gaussian (LOG) operator for 

creating a multi-scale representation for blob detection [14]. The idea is combining second order 

13

partial Gaussian derivatives along the two orientations at different scales, to merge the components 

of a word. 

An anisotropic Gaussian filter (Fig. 8) is defined as: 

1 

G(x, y; σx, σy) = e 

2πσxσy 

x2 

−( 

σ2 + 

x 

y2 

σ2 ) 

y 

From the filter (1) the Laplacian of Gaussian operator is based on the addition of the second 

derivatives in x and y as follows: 

L(x, y; σx, σy) = Gxx(x, y; σx, σy) + Gyy(x, y; σx, σy) (2) 

A scale space representation of the line images is constructed by convolving the image with L 

from (2). Consider a two-dimensional image f(x,y); then, the corresponding output image is 

I(x, y; σx, σy) = L(x, y; σx, σy) ∗ f(x, y) (3) 

As we can see in figure 8 the output is a grey-scale image, where the background has a middle 

grey-level and the words are lightly grey. It is very difficult to determine a threshold for selecting 

the pixels that corresponds to words. We have observed that most words have black contour. Our 

improvement allows, using this mask, to split each word in three areas: background, word and 

contours of the word. The mask converts the black thin contours in thick contours. The rest of the 

image is considered background. This gained of the contour cause the joining of the letters that 

are together. This improvement allows to merge the characters of the word and is easier to split 

different words. The words, which are extracted from a scale space representation, are blob-like, 

but, to make sure that the blob merges all the parts of the words, we apply a closing operator to 

each word. 

6.2 Fast rejection 

The previous process produces one blob for each word in the document, but sometimes these 

components do not represent words, because they are stains, lines or small parts of a word that has 

not been merged with the original word. The selection of the suitable words are done in two steps. 

First, the blobs which are very small, regarding to the height and the width of the segmented line, 

are rejected. For the remaining blobs, we choose those blobs with more pixels than a threshold, 

experimentally set. 

6.3 Noise removal 

The images remaining after the fast rejection step are subject to a normalization process to reduce 

their variability. Our proposal allows to clean the image and to fit the bounding box to the word 

(Fig. 9). 

The first step consists in binarizing the word image (Fig. 9b). Then, we apply the anisotropic 

Gaussian filter explained before to merge the different parts of the same word (Fig. 9c). Once 

applied, the image is composed by several blobs, as we can see in figure 9d, then, the next step 

14 

(1)

Figure 8: Anisotropic Gaussian Filter 

is deleting the blobs that do not belong to the word. The biggest blob is chosen and its contour 

computed (Fig. 9e). The contour is the frontier that separates the pixels of the word and the 

background. The last step consist in to project in vertical and in horizontal to fix the bounding 

box. 

(a) Original Image (b) Binarized image (c) Anisotropic Gaussian 

filter 

(d) Biggest blob (e) Blob contour (f) Final image 

Figure 9: Normalization process 

15

7. Pixel-based descriptors organized in a hierarchical structure 

The first approach of this work is based in two pixel-based descriptors (basic features and BSM 

features), and they are organized using a hierarchical structure of clusters. The objective is to do 

several layers of clusters using diverse features. Top layer is composed by basic features. Down 

layer consists of pixel distribution based features, in particular BSM. 

This approach bunches the words into clusters with similar features. When we downward of 

layer, we only use the observations of the cluster chosen to cluster the words with the new kind 

of features (Fig. 10). We reduce in each layer the number of observations that we are computing 

features for the new layer, and the classification process is faster. 

7.1 Feature extraction 

In pattern recognition and in image processing feature extraction is a special form of dimensionality 

reduction. The objective is to transform the input data into a reduced representation set of features 

(feature vector). The observations of the experiments have different ways to represent those using 

different features. The objective is to select the best features that describe the image. 

The marriage licences corpus of the cathedral of Barcelona is composed of 244 volumes, too 

much information to be indexed directly. Consequently the computational time cost increase with 

the number of documents of our corpus. 

Retrieval time can be reduced using a hierarchical indexation structure. The features of our 

corpus are divided in several groups (clusters) in each layer, and then, each layer is divided in other 

groups using other features (Fig. 10). 

Figure 10: The structure in layers of the feature extraction. 

In this work two types of features have been used: Basic features and BSM features. The first 

layer uses basic features to do a rough separation of the word classes. In the second layer we have 

used features based in pixel distribution. Each cluster of the first layer is split using BSM features. 

16

Basic features 

Basic features are based in shape features of the images [28]. These features are extracted from the 

contour and the region of the shapes. 

For each normalized word, a sequence of basic feature vectors is obtained. The features used 

in this work are: height, width, aspect-ratio, centroid, filled-area, perimeter, eccentricity, Euler 

Number. 

The objective of this first layer is to separate all the words of our corpus in groups, with similar 

basic features. 

Blurred Shape Model (BSM) features 

The words are described by a probability density function of Blurred Shape Model (BSM) [5] that 

encodes the probability of pixel densities of image regions: The image is divided in a grid of n x n 

equal-sized subregions, and each bin receives votes from the shape points in it and also from the 

shape points in the neighbouring bins. Thus, each shape point contributes to a density measure of 

its bin and its neighbouring ones. The output descriptor is a vector histogram where each position 

corresponds to the density in the context of the sub-region (Fig. 11). 

The objective of this second layer is to extract features based in pixel distributions. Once 

the words have been clustered according the size, this layer matches the words with similar pixel 

distribution in different clusters. 

(a) Original image (b) Shape pixel distances estimation 

respect to neighbour centroids 

7.2 Learning and retrieval 

Organizing features 

(c) 16 regions 

blurred shape 

Figure 11: Blurred Shape Model (BSM). Extracted from [6] 

The goal of a learning stage correspond to acquire new knowledge, behaviours, skills, values, preferences 

or understanding, and may involve synthesizing different types of information. The learning 

process of this work consists in extracting features of the words, calculating the distance between 

them, and bunching them with respect the distance (called clustering process). 

The clustering processes have a drawback: to know what is the best number of clusters which 

the observations of our experiments are better bunched. This approach has a hierarchical structure 

of clusters. Each layer of the structure has to be clustering using one of the methods explained in 

the section 4. The first layer uses the direct method to choose the number of clusters. The second 

17

layer uses an automatic method. It uses the Davies-Boulding index to choose the best number of 

clusters. Both layers uses k-means as clustering algorithm. 

The first layer uses direct method because we have obtained several hierarchical structures, 

more concretely we have compute from 3 to 30 clusters. Second layer uses an index to choose 

the best number of clusters, because the number of experiments is exponential as the number of 

clusters. 

Searching words 

The process of searching a word consists in, given a query word image, classify it regarding the 

different clusters. Once the most similar cluster is selected, the method “spots” the word instances 

in the document images. Classification is a procedure in which individual items are placed into 

groups based on quantitative information on one or more characteristics (features) inherent in the 

items and based in training set (learning process) of previously labelled items. 

To classify words we have used a k-NN approach. It is an algorithm that classifies each sample 

(feature extracted from the word) into one of the groups, that in the leaning process we have 

created, using the nearest neighbour method. 

Classification process is used in both levels of this work, and k-NN is used in both levels to 

classify our observations in the different clusters, previously established. 

8. Pseudo-Structural descriptor organized in a hash structure 

The second approach of this work is oriented to the features. It does not matter where appear in the 

image. This approach uses a table of indexation to organize the observations of the experiments. 

We have developed a second approach which is characteristic-point centred, i.e. the indexation 

terms are individual features, so words are detected based on a voting process. Feature vector vary 

depending of the position of the words. 

Features used in this approach are invariant under translation of the word. There is no need to 

center or left-justify all the observations of the same word to obtain good results. 

8.1 Feature extraction 

The characteristic Loci features were devised by Glucksman and applied to the classification of 

mixed-font alphabetic, as described in [8]. A characteristic Loci feature is composed by the number 

of the intersections in the four directions (up, down, right and left). For each background pixel in 

a binary image, and each direction, we count the number of intersections (an intersection means 

a black/white transition between two consecutive pixels). Then we obtain a number, that it is 

composed by the number of intersections in the four directions (Fig. 12). The feature vector 

consists in the histogram of the intersection counts. 

This work presents a new feature descriptor based in the characteristic Loci features. We have 

introduced three variations of the basic descriptor: 

• We have added the two diagonal directions, as we can see in figure 12. This gives more 

information to the feature and more sturdiness to the method. 

18

• The number of the intersections is quantized.We have bounded the number of intersections in 

intervals. Each direction has a different interval. This bounding do more robust the feature. 

• Two modes are implemented to compute the feature vector, namely background and foreground 

pixels. 

To obtain the number of intersections for each direction a thinning operator is previously applied 

to the image. Thinning allows to get the skeleton of the image consisting of lines of width of 1 

pixel. 

Figure 12: Characteristic Loci feature of a single point of the word page. 

The feature vector is computed by assigning a number to each background (or foreground) pixel 

as show in Fig. 12. The features are computed according to the number of intersections with the 

the background pixels of the image in right, upward, left and downward directions. In previous 

works, the characteristic Loci method has been applied for digit and isolated letter recognition. 

In this work, to reduce the dimension of the feature space the maximum number of intersection 

has been limited to 3 values (0, 1 and 2). Delimiting the number of possible values we reduce the 

number of combinations. The length of the feature vector is proportional to the number of possible 

values. For example, with 3 possible values and 8 directions, we obtain 3 8 (6561) combinations; 

with 4 possible values we have 3 4 (65536). It increases in exponential way and the computational 

cost (and time) increases in the same way. 

Characteristic Loci feature was designed for digit and isolated letter recognition, and the number 

of intersections was bounded. The original approach uses the same interval in all directions. In 

this work we have also bounded the number of intersections. We have normalized the number of 

intersections. For each direction we have defined a different interval for each value. The horizontal 

direction has a bigger interval than the vertical direction. In the original approach the digits or 

characters have a similar height and width, but in our approach the width of the words is usually 

bigger than the height. According with the dimensions of the words the range of the intervals are 

in harmonious. Diagonal directions are a combination of the two other directions. Table 1 shows 

the intervals for each direction. 

According to the above encoding, for each background pixel, an eight digit number in base 3 

is obtained. For instance, the locus number of point P in Fig. 12 is (22111122)3 = (6170)10. The 

locus numbers are between 0 and 6561 (= 3 8 ). This is done for all background pixels. In this case, 

19

Table 1: Intervals for each direction in characteristic Loci feature. 

Values 

direction 0 1 2 

Vertical {0} [1, 2] [3, +∞] 

Horizontal {0} [1, 4] [5, +∞] 

Diagonal {0} [1, 3] [4, +∞] 

the dimension of the feature space becomes 6561. Each element of this vector represents the total 

number of background pixels that have locus number corresponding to that element. 

8.2 Learning and retrieval 

Organizing features 

The retrieval process of this approach consists in organizing the features in a look up table M 

(Fig. 13). Columns of M represent the words (w) of the documents that we are using for this 

experiment. Rows corresponds to all the possible combinations that can appear using characteristic 

Loci features (f ) . M(f, w) means that the feature f is presented in word w. For this work, we have 

8 directions and each one has three different values. So, we have 3 8 (= 6561) possible combinations. 

The feature vector is the histogram with all the possible combinations. 

Figure 13: Steps of the Pseudo-Structural descriptor organized in a hash structure process. 

20

Searching words 

Classification process consists in searching the best matching of the query with all the words of M 

(Fig. 13). The query chosen is used to extract the vector of features. This vector is used to match 

the query with all the words of the ground truth. We have used the Euclidean distance to do the 

matching. When we have all the distances, we select the words under a selected threshold. 

9. Experimental results 

In order to validate the proposed methodology, we describe our performance evaluation protocol in 

terms of the data used, comparatives, metrics, and experiments. 

9.1 Data 

Our approach has been evaluated with a ground truth composed by 50 documents extracted from 

the volume 69. In these documents, all the instances of 21 words are labelled. All the documents 

of the ground truth are written by the same author. The difficulties that we may face with these 

documents are: illuminations changes, partial occlusions, warping effect in the document, ink bleed 

through, etc. Some samples of the documents are show in figure 14. 

Figure 14: Samples of the documents of the ground truth 

We have also used a subset of the ground truth. It consists of the first 20 documents of our 

ground truth, and it has the first 10 classes of the original one. This ground truth has been used 

in some experiments in order to facilitate the data analysis. The analysis of the results could be 

easier by using a reduced ground truth. Some visual results can be easier understood by using less 

classes. 

9.2 Comparatives 

The experiments of this work are separated into 3 groups: those that evaluate the segmentation 

process, the ones that evaluate the first approach and finally the experiments that evaluate the 

second approach. 

21

The segmentation process experiments evaluate the accuracy of the word segmentation. The 

segmented word and the labeled word are overlapped in order to check if they are the same word. 

Different thresholds of overlapping percentage are used to evaluate the accuracy of the segmentation 

process. 

The first approach has two types of experiments. The first one evaluates how the clustering 

process is done. The second one evaluates the accuracy of the retrieval process: 

• The experiment shows the relation between the basic features chosen by using 2D plots. 

• By using visual results, we observe the distribution of the observations of our ground truth 

in the clusters. 

• We evaluate the accuracy, the homogeneity and completeness of the clustering using Vmeasure 

(explained in section 9.3). 

• The accuracy of the retrieval process is evaluated by means of a precision-recall curve. 

The second approach is evaluated by means of precision-recall curves: 

• Two experiments are used to assess the accuracy of this approach by using different characteristics 

pixels (background and foreground pixels). 

• Both characteristics points are compared in order to evaluated. 

9.3 Metrics 

One drawback of clustering process is the proper selection of the number of clusters. Learning 

process consist in bunching the observations in different clusters. The ideal solution is achieved 

when all the instances of the same word are in the same cluster, and each cluster has only instances 

of only one word. The results of the retrieval process depend on the accuracy in the clustering process. 

The evaluation of the clustering process has been done using V-measure [22]. V-measure is an 

entropy-based measure which explicitly measures how successfully the criteria of homogeneity and 

completeness have been satisfied. V-measure is computed as the “mean” of distinct homogeneity 

and completeness scores, that is, V-measure can be weighted to favour the contributions of homogeneity 

or completeness. A clustering result satisfies homogeneity if each one of its clusters contain 

only data points which are members of a single class, and a clustering result satisfies completeness 

if all the data points that are members of a given class are elements of the same cluster 

The retrieval process is evaluated using precision-recall curves: 

recall = 

precision = 

number of relevant items retrieved 

number of relevant items in collection 

number of relevant items retrieved 

total number of items retrieved 

22 

(4) 

(5)

9.4 Experiments 

We present the experiments done in this work. We show the results for the pre-processing step and 

for the two approaches developed. 

Pre-processing 

The pre-processing step has the objective of segmenting the words on the documents. The performance 

of the next steps will depend on the results obtained from this stage. The segmentation 

process is evaluated in terms of words found with respect the ground truth. 

We have used both the complete and reduced ground truth in order to evaluate the segmentation 

process. After segmenting the words from the document, we have matched these words with the 

labelled words of the ground truth. Each segmented word is compared with the words of the 

ground truth observing the percentage of overlapping (Fig. 15). The words that have more of 40% 

of overlapping of their bounding boxes are considered as the same word. 

Figure 15: Examples of a correct overlapping (left) and a incorrect overlapping (right). 

In table 2 we observe the results of applying our method to the different ground truth by using 

different thresholds. In both, the same performance is obtained: with small overlapping threshold 

the percentage of words found is high, but when we increase the threshold, the percentage decreases. 

We observe that, using the ground truth with 50 documents and 20 classes and 0.1 as threshold 

the accuracy is over 100%. A label of a word could contain part of a next word, then, when we 

compare the two words segmented, both have the same label. 

We observe that the results stay stable until we reach a threshold value of 40%, then the 

accuracy decreases. In order to obtain good results and reduce the number of errors (explained 

before), we have used 40% as threshold for our experiments. 

Pixel-based descriptors organized in a hierarchical structure 

The main problem with a Cluster algorithm is to choose the number of clusters that bunches the 

observations of the best form. We have done some experiments to obtain which is the best number 

of cluster. 

The first layer of our architecture is formed by clusters constructed in terms of basic features. 

The second layer uses BSM features. A key parameter in the BSM feature computation is the 

number of bins to obtain the histogram-measure calculated using different number of beans. In the 

first experiment we evaluate the performance depending on the number of bins. This performance 

is evaluated in terms of the V-measure. 

23

Table 2: Pre-processing results. The ground truth is composed by 50 documents and 20 classes, 

it has 6718 words labelled. The subset of the ground truth is composed by 30 documents and 10 

classes, it has 3101 words labelled. 

Subset (3101 words) Ground truth (6718 words) 

Threshold finded accuracy finded accuraccy 

0.1 3129 100.90% 6712 99.91% 

0.2 3026 97.58% 6540 97.35% 

0.3 2987 96.32% 6447 95.97% 

0.4 2937 94.71% 6351 94.54% 

0.5 2806 90.49% 6058 90.18% 

0.6 2407 77.62% 5052 75.20% 

0.7 1495 44.57% 2994 44.57% 

0.8 466 15.03% 959 14.28% 

0.9 35 1.13% 85 1.27% 

1 0 0.00% 0 0.00% 

Figure 16 shows the V-measure with different number of bins. We see that between 14 and 17 

bins the function stops to increase and it stabilizes. So, we have selected 17 bins for all the rest of 

experiments of this work. Increasing the number of bins does not lead to better results. 

Figure 16: V-measure increases as we increase the number of bins in BSM algorithm. 

The second experiment shows the relation between all the possible combinations of the selected 

features using 2D plots. We have chosen 7 different basic features: height, width, filled area, 

centroid, perimeter, eccentricity and Euler Number. The results are similar for all the combinations: 

the points of the different observations are together, but in the plots we see that the classes are 

separated. The best result corresponds to the combination of the features height and width (Fig. 17). 

We observe that all the observations are together, but they are separated in different zones. 

24

Figure 17: Distribution of the basic features width and height. 

The third experiment shows the distributions of the observations into the clusters. We have 

used the 7 basic features in this experiment. We have clustered features from 3 to 30 clusters. 

Figure 18 represents how the observations of each class have been distributed in the 20 clusters 

using the 7 basic features. Each column represents a cluster and each row a class (words). The 

classes are sorted into different clusters and because of that, the classification results are affected 

and can be confusing. By reducing the number of clusters we observe that all the observations of 

each class are in the same cluster, but each cluster contains more than one class. Instead, increasing 

the number of clusters the dispersion of the observations is greater. 

The next experiment is similar to the last one, but in this case we have evaluated the BSM 

features. The second layer of our architecture is done using BSM features. We have done some 

experiments to evaluate how it works. Figure 19 shows how the observations of the classes are 

distributed along the clusters. We observe that all the observations of each cluster are more 

concentrated in one cluster than using basic features. Increasing and reducing the number of 

clusters we have the same problems than using basic features. 

In the following experiment we have evaluate the performance of the clustering process by using 

V-measure. In figure 20 we observe a comparative of the V-measure between different combinations 

of basic features and ground truth. We observe that using a smaller ground truth with fewer classes 

the results are worst. Using the same ground truth the results are similar between using only height 

and width and all the basic features. We conclude that as we use more observations and classes, 

the better the accuracy in the cluster, and using BSM features in the clustering we obtain better 

results than using basic features. 

25

Figure 18: Distribution of the observations in the clusters using basic features. 

The ideal solution in the clustering process is to obtain a 100% in completeness and homogeneity. 

In our case we have not obtain an ideal solution, then, we have to choose a measure which is a trade 

of between both measures. In figure 21 we observe two plots for each experiment, β = 0 means 

that the plot is measuring homogeneity and β = 1 means that the plot is measuring completeness. 

For each experiment we observe that with small number of clusters the homogeneity is small and 

the completeness is good. By increasing the number of clusters the homogeneity increases and the 

completeness decreases. The best number of cluster for each experiment is when both plots cross. 

For example, the best number of clusters for the BSM features is 15. 

The experiment for the retrieval process evaluates its accuracy. We have done several experiments 

using different combinations of basic features, the subset of the ground truth and the BSM 

features (Fig. 22). We observe that the worst results are obtained when we use the ground truth 

with all the basic features. Using the BSM features we have obtained the best results, followed 

by the experiment using the basic features height and width. Using all the basic features we have 

obtained worst results. 

In the last experiment we evaluate the performance in terms of scalability (an increasing number 

of documents and classes) and the descriptor. We observe, using the same descriptor and different 

number of documents and classes, that the accuracy is better with less number of classes. We also 

observe that using the BSM descriptor, it is a better descriptor and more accurate. The performance 

improves, even using the bigger ground truth with respect the best result of the smaller ground 

truth. 

26

Figure 19: Distribution of the observations in the clusters using BSM features. 

In conclusion the performance is more sensitivity to the accuracy (descriptive power) of the 

descriptor. With the same descriptor the more is the number of classes, the higher is the confusion, 

so the performance decreases. 

Pseudo-Structural descriptor organized in a Hash Structure 

Our second approach is evaluated by using precision-recall curves. These experiments are done by 

tuning two parameters: mask size and the threshold used to decide if an observation is member 

of a class, or not. To obtain Loci features we have used different masks to obtain the number of 

intersections for each pixel in all directions (Fig. 23). 

There are two options in the feature extraction step. The first one is using the background 

pixels as reference to obtain the feature vector. The second experiment uses foreground pixels as 

reference. 

In figure 24b we observe the results using background pixels as reference, for different mask 

sizes and varying the threshold with the following values: 25, 50, 100, 200, 300, 400, 500 and 600. 

We observe that when we increase the size of the mask, the results are better. But, when the size 

of the mask is 80 pixels, or more, the results are very similar. We do not get more information 

because the mean of the height of the words is 80 pixels, then, increasing the size of the masks. 

The use of the foreground as reference pixels gives a similar performance (Fig. 24a). But if we 

compare the results using foreground and background as pixel reference (Fig. 25), we observe that 

27

Figure 20: Comparative using different features. 

using background pixels the results are better. Using background pixels as reference, the number 

of pixels that gives information to the feature vector is higher than using foreground pixels. 

9.5 Discussions 

We have used a ground truth in this work composed by 50 documents and 20 classes, and a subset 

of the first one composed by 30 documents and 10 classes. As it was expected, we have observed 

that, by using higher number of observations from each class, the results improve. 

We have used two different approaches in this work. In the hierarchical approach the first layer is 

created by using basic features. The features that obtained given better classification performance 

are width and height, and the optimal number of cluster is three: small words, medium words 

and big words. Using this simple clustering we separate the observations in three categories. The 

classification process only chooses the cluster taking into account the size of the word. The second 

layer classes uses BSM features. The observations of each cluster of the first layer are bunched 

using BSM features. The results show that there is some confusion in some words, and a third 

layer, using other kind of feature, helps to the classification to obtain better results. 

The second approach has obtained the best results. This process not depends of how good was 

the segmentation process, is more stable that the first one, and it is showed in the results showed 

in this work. 

28

Figure 21: Choosing the best number of clusters. β = 0 means homogeneity. β = 1 means 

completeness. 

We have observed that the clustering algorithm used in the first approach does not perform 

well with the selected corpus. We have done some experiments with Self-Organizing Map 1 (SOM) 

as an introductory work for future endeavours. SOM is a type of Artificial Neural Network that 

is trained using unsupervised learning to produce a low-dimensional, discrete representation of the 

input space of the training examples, called a map.In the appendix A there are some figures with 

the results of this algorithm. Figure 26 shows a map of the observations of the training set using 

BSM features. Each cell represents a different cluster, and each colour a different class. We observe 

that the observations of each class are bunched in close clusters. Figure 27 shows a similar to the 

previous one, but using characteristic Loci as features. We observe that in this case the observations 

are more concentrated in the same clusters. 

10. Conclusions 

Word-spotting appears to be an attractive alternative to the seemingly obvious recognize-thenretrieve 

approach to historical manuscript retrieval. With the capability of matching word images 

in a quick and accurate way, partial transcriptions of a collection can be achieved with reasonable 

accuracy and scarce human interaction and we obtain better results and by increasing the number of 

observations of the training set. Word-spotting has the capability to automatically identify indexing 

1. http://www.cis.hut.fi/somtoolbox/ 

29

Figure 22: Classification process using different basic features and BSM features. 

(a) 10 pixels (b) 100 pixels 

Figure 23: Examples with different mask sizes. 

terms, making it possible to use costly human labor more sparingly than a full transcription would 

require. 

In this work we have introduced two approaches to do word spotting. Both approaches use 

predefined descriptors and the observations are stored in different structures. Our first approach 

is based in pixel-based descriptors organized in a hierarchical structure. Our second approach is 

based on a pseudo-structure descriptor organized in a hash structure. 

Experimental results of the first approach show that the selected basic features cannot discriminate 

enough the observations. The results are very confused but they can be used in order to do 

a fast rejection of, for example, small words, medium words and big words. 

30

(a) Foreground pixels as reference (b) Background pixels as reference 

Figure 24: Different mask sizes using different characteristics pixels as reference. 

Figure 25: Comparative results using background and foreground pixels as reference. 

In order to compare the descriptors used in the first approach (basic and BSM features), we 

conclude that BSM features work better than basic features using k-means as clustering algorithm. 

Concerning the results of the second approach, we have obtained results using the background 

and the foreground as characteristic pixels. In both cases the performance results are similar. But if 

we compare them, we conclude that by using background pixels the results are better than when we 

31

use foreground pixels, because the number pixels are higher and the amount of stored information 

is bigger. 

By comparing both approaches we can conclude that the second one, by using a pseudostructural 

descriptor organized in a hash structure leads to obtain better results. The reason 

behind that is in this case we use a descriptor whose performance does not depend on the localization 

of the characteristic pixels in the image, being more robust than the descriptors of the first 

approach. 

11. Future Work 

The results obtained in this work, although preliminary are encouraging to continue with a further 

research in different directions. Let us a sketch the major ones. 

The corpus of this work is composed only by documents of the volume 69 and the writer is the 

same in all the documents. The whole Barcelona Marriage Records collection is composed by 244 

books. One continuation path can be improving this work in order to manage a possible scalability 

problem, which presents some problems to solve, such as different handwriting patterns, different 

structure of the documents or variations in the structure even within instances of the same words. 

Our further research has to be oriented to big databases, like the collection of Barcelona Marriage 

records database, or other kind of big collections of handwritten documents. 

As our research line is focused on searching a more robust descriptor, i.e. structural shape 

descriptors, and on using other clustering algorithms in order to reduce the dimensionality of our 

dataset, with the final objective of improving our current results. As we can observe in section 9.5 

we have run several experiments using Self-Organizing Map with promising results. 

One final research line, perhaps more oriented to a concrete corpus, could be to use the structural 

information of the documents. The idea here is to use structural and incremental learning methods 

to predict the words that may appear in conjunction with some others (generation of dictionaries). 

12. Acknowledgement 

First and foremost, I would like to thank my supervisors of this work, Josep LLadós for the valuable 

guidance and advice, and Alicia Fornés for her help and for the several times that she has read 

my report (thank you!). They help me to learn a lot of things about computer vision, and more 

concretely, about document analysis. They have had enough patience with me to show me all I 

have needed for this work. 

An honourable mention goes to my family and friends for their understanding and supports in 

developing this project. I want to make a special mention to Ekain, Jon, Jorge, Lluis, Monica and 

Toni, the best master partners (and now my friends). They are always there when I am lost, or 

when I have needed somebody to talk. 

32

References 

[1] Ch. Choisy. Dynamic handwritten keyword spotting based on the NSHP-HMM. Ninth International 

Conference on Document Analysis and Recognition (ICDAR 2007), 1:242–246, 

September 2007. 

[2] DL Davies and DW Bouldin. A cluster separation measure. IEEE Transactions on Pattern 

Analysis and Machine Intelligence, 1:224–227, 1979. 

[3] A. Ebrahimi and E. Kabir. A pictorial dictionary for printed Farsi subwords. Pattern Recognition 

Letters, 29:656–663, April 2008. 

[4] R. Ebrahimpour, M. R. Moradian, A. Esmkhani, and M. Farzad. Recognition of Persian 

handwritten digits using Characterization Loci and Mixture of Experts. International Journal 

of Digital Content Technology and its Applications, 3:42–46, 2009. 

[5] S. Escalera, A. Fornés, O. Pujol, P. Radeva, G. Sánchez, and J. Lladós. Blurred Shape 

Model for binary and grey-level symbol recognition. Pattern Recognition Letters, 30:1424– 

1433, November 2009. 

[6] A. Fornés, S. Escalera, J. Lladós, G. Sánchez, and O. Pujol. Handwritten symbol recognition 

by a boosted blurred shape model with error correction. Iberian Conference on Pattern 

Recognition and Image Analysis (IbPRIA), 1:13–21, 2007. 

[7] B. Gatos, I. Pratikakis, and S. Perantonis. Adaptive degraded document image binarization. 

Pattern Recognition, 39:317–327, March 2006. 

[8] H.A. Glucksman. Classification of mixed-font alphabets by characteristic loci. Proc. IEEE 

Comput. Conf., pages 138–141, September 1967. 

[9] E. Keogh. Exact indexing of dynamic time warping. Knowledge and Information Systems, 

7:358–386, May 2005. 

[10] K.M. Knill and S.J. Young. Speaker dependent keyword spotting for accessing stored speech. 

Engineering, 1994. 

[11] L.I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, 

2004. 

[12] Y. Leydier, F. Lebourgeois, and H. Emptoz. Text search for medieval manuscript images. 

Pattern Recognition, 40:3552–3567, December 2007. 

[13] R. Manmatha, C. Han, E.M. Riseman, and W.B. Croft. Indexing handwriting using word 

matching. Proceedings of the first ACM international conference on Digital libraries, 1:159, 

1996. 

[14] R. Manmatha and J.L. Rothfeder. A scale space approach for automatically segmenting words 

from historical handwritten documents. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 27:1212–1225, 2005. 

33

[15] G. Nagy. Twenty years of document image analysis in PAMI. IEEE Transactions on Pattern 

Analysis and Machine Intelligence, 22:38–62, 2000. 

[16] W. Niblack. An introduction to digital image processing. Strandberg Publishing Company, 

Birkeroed, Denmark, Denmark, 1985. 

[17] N. Otsu. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on 

Systems, Man, and Cybernetics, 9:62–66, 1979. 

[18] V. Papavassiliou, T. Stafylakis, V. Katsouros, and G. Carayannis. Handwritten document 

image segmentation into text lines and words. Pattern Recognition, 43:369–377, January 2010. 

[19] T.M. Rath and R. Manmatha. Word image matching using dynamic time warping. IEEE Computer 

Society Conference on Computer Vision and Pattern Recognition (CVPR’03), 2:521–527, 

2003. 

[20] T.M. Rath and R. Manmatha. Word spotting for historical documents. International Journal 

of Document Analysis and Recognition (IJDAR), 9:139–152, August 2006. 

[21] T.M. Rath, R. Manmatha, and V. Lavrenko. A search engine for historical manuscript images. 

Proceedings of the 27th annual international conference on Research and development in 

information retrieval - SIGIR ’04, 1:369, 2004. 

[22] A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster 

evaluation measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural 

Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 1:410– 

420, 2007. 

[23] J.L. Rothfeder, S. Feng, and T.M. Rath. Using corner feature correspondences to rank word 

images by similarity. Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. 

Conference on, 3:30–35, 2003. 

[24] P Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster 

analysis. Journal of Computational and Applied Mathematics, 20:53–65, November 1987. 

[25] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. International 

Journal of computer vision, 37:151–172, 2000. 

[26] S. Srihari, H. Srinivasan, P. Babu, and Ch. Bhole. Handwritten arabic word spotting using 

the cedarabic document analysis system. Proceedings 2005 Symposium on Document Image 

Understanding Technology, 1:123, 2005. 

[27] C. Wolf. Document ink bleed-through removal with two hidden Markov random fields and 

a single observation field. IEEE transactions on pattern analysis and machine Intelligence, 

32:431–447, March 2010. 

[28] D. Zhang and G. Lu. Review of shape representation and description techniques. Pattern 

Recognition, 37:1–19, January 2004. 

34

Appendix A. Self-Organizing Maps (SOM) results 

Figure 26: SOM using BSM features. 

35

Figure 27: SOM using characteristics Loci features. 

36

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Create successful ePaper yourself

Delete template?

Save as template?