Handwritten Word Spotting in Old Manuscript Images using Shape ...

More documents

Recommendations

Info

a step forward towards shortening the semantic gap between recognizing individual visual objects, and understanding the whole document content in a given context. It does not involve the pure transcription of documents, but the retrieval and the linkage of semantic knowledge from large collections of document images stored in digital repositories. There is an increasing interest to digitally preserve and provide access to historical document collections in libraries, museums and archives. The conversion of historical document collections to digital archives is of prime importance to society both in terms of information accessibility, and longterm preservation. Handwritten documents are used to be found in historical archives. Examples are unique manuscripts written by well known scientists, artists or writers; letters, trade forms or administrative documents kept by parish or municipalities that help to reconstruct historical sequences in a given place or time, etc. While machine printed documents, under a minimum of conditions, are easy to read by OCR systems, the recognition of handwriting is still a scientific challenge. The state of the art achieves only good performance in constrained domains or with small vocabularies. Mass digitization of historical documents is performed using specialized scanners. These scanners allow obtaining a good quality in the images, without physically damaging the documents. After that, image processing processes are usually done to enhance images and ease the visual inspection. The problem is the degradation of the documents caused by lifetime of use. Degradation can appear for several reasons: non stationary noise due to illumination changes, curvature of the document, ink and holes in the document, ink show through (is the appearance of the verso side text or graphics on the scanned image of the recto side), low contrast, warping effect, etc. Some centuries ago the ink used to write had some oxide particles, which contribute to degrade the paper of the document, and this causes that the words in the verso of the page can be seen in the analysed part. This effect is known as bleed through. Nowadays, some methods have been developed for improving the quality of the images [7; 27]. There are lots of historical handwritten documents with information that can be used for several studies and projects. The Document Image Analysis and Recognition community is interested in preserving these documents and extracting all the valuable information from them. There are two ways to extract the information: transcribing documents (word-to-word) and word-spotting. Handwritten word-spotting refers to the problem of detecting specific keywords in handwritten document images. A model is provided as a query, and the goal is to retrieve all the occurrences in a word image database (or regions of a document collection) that are close to the query in terms of a specific dissimilarity measure. But, one of the problems of these documents is the access to them. The majority of material is only physically accessible, and only a few of authorized people can access to them. Nowadays thousand of digitized documents are unutilised because they are not indexed. There are some levels of indexation in terms of meta-data, from the naming of the author and the brief history of the book to a full text transcription. Nevertheless, there is not a unique technique that allows us to index the document correctly. During the last decades these techniques have experienced great improvements and the error rates have dropped to a level that makes commercial applications feasible. Traditional optical character recognition (OCR) systems fail to process handwritten documents, and they are only suitable for modern printed documents. However, the off-line handwritten text recognition systems, which take an image of a piece of handwriting as input, are working properly in restricted vocabularies. 2
Handwriting word-spotting is the pattern classification task which consists in detecting words in handwriting images document. In this dissertation, we are concerned on the detection of several words into our documents. In documents where all pages are written by the same author (or few authors), the images of multiple instances of the same word are likely to look similar. Word-spotting [20] treats a collection of documents as a collection of words. Then, the first step consists in segmenting the document into word images, and then, pair wise “distances” between word images are calculated , which are used to cluster all words with similar features. Ideally, each cluster contains all the samples of the same word. There are two types of word-spotting approaches, depending on how the input is specified: query-by-string and query-by-example. In query-by-string, character models have been trained in advance and in time of execution the character models are combined to form words and the probability of each word is evaluated; in query-by-example the input is an image of the word to search, and the output is a set of the most representative images of the query word. Problem statement This work addresses the problem of handwritten word spotting in historical manuscripts. While historical approaches are based on contextual methods like Hidden Markov Model (HMM) or Dynamic Time Warping (DTW), using the sequential information of graphemes in a word. We propose a holistic approach using shape matching techniques. We propose two approaches. The first one uses a pixel-based descriptor tolerant to distortions. The second one is inspired in Loci characteristic and allows to aggregate pseudo-structural information in the descriptor. Handwritten collection of documents, that we will explained with more details in following sections, are used in this work. Objectives As started above, this work wins to develop shape descriptors for handwritten word spotting, in particular the objectives are: • To investigate different shape descriptors that allow to describe handwritten words with invariance of variations in writer, acquisition conditions, etc. We aim to focus in pixel-based descriptors and structural ones. • Based on the above descriptors, define clustering criteria allowing to build indexation structures for word spotting purposes. • Define an experimental framework. Construct a ground truth from a collection of a real application (Barcelona marriage records). Outline of the approach In our work we have used query-by-example. It consists in matching an input image with one or multiple query images to determine the distance that might indicate a correspondence. A spotting architecture consists of four tasks. First, a pre-processing step is done. Second, a fast rejection with the words segmented is done. Third, a normalization step is done. And fourth, 3
Page 1: MASTER IN COMPUTER VISION AND ARTIF
Page 6 and 7: a classification of the training se
Page 8 and 9: (a) 1617: index of volume 69 (b) 17
Page 10 and 11: The models can then be used to retr
Page 12 and 13: Figure 5: We present two approaches
Page 14 and 15: 6.1.1 Binarization The binarization
Page 16 and 17: partial Gaussian derivatives along
Page 18 and 19: 7. Pixel-based descriptors organize
Page 20 and 21: layer uses an automatic method. It
Page 22 and 23: Table 1: Intervals for each directi
Page 24 and 25: The segmentation process experiment
Page 26 and 27: Table 2: Pre-processing results. Th
Page 28 and 29: Figure 18: Distribution of the obse
Page 30 and 31: Figure 20: Comparative using differ
Page 32 and 33: Figure 22: Classification process u
Page 34 and 35: use foreground pixels, because the
Page 36 and 37: [15] G. Nagy. Twenty years of docum
Page 38: Figure 27: SOM using characteristic

Handwritten Word Spotting in Old Manuscript Images using Shape ...

Create successful ePaper yourself

Delete template?

Save as template?