CS315A Final Project Proposal May 3, 2010 - Stanford PPL

CS315A Final Project Proposal 

May 3, 2010 

Group Members: 

Ajay Gupta (agupta74) 

Scott Green (sagreen1) 

Tushar Sawant (tsawant) 

Tahrina Rumu (trumu) 

Project Option Selection: 

Programming 

Description of Topic: 

Optical character recognition (OCR) is the mechanical or electronic translation of 

scanned images of handwritten, typewritten, or printed text, to machine encoded text. 

OCR has been in development for almost 80 years, the first patent for an OCR machine 

was filed by a German named Gustav Tauschek in 1929, and an American patent was 

filed subsequently 1935. OCR has many applications, including use in the postal serivce, 

language translation, digital libraries, and OCR is even in the hands of the general public, 

in the form of mobile applications. 

We are using an open source OCR software called Tesseract as a basis for parallelization. 

Development on Tesseract first started in 1985 by Hewlett Packard, further development 

was done by University of Northern Las Vegas, and the code was eventually released 

under the Apache 2.0 license as open source. Google has used Tesseract extensively in 

their google books project, which has attempted to digitize the worlds libraries. 

Statement of Why: 

For many years OCR has been considered a solved problem. Major work on Tesseract 

was completed around 1996 and small modifications were made over the next ten years. 

The emergence of commercially available multicore processors has opened up a new 

field within computer science, and it is worth reconsidering old problems in order to 

achieve speed gains. In the case of the Tesseract program we intend to parallelize the 

recognition process in order to quickly and accurately convert multiple page documents. 

Speeding the recognition process frees up computing resources, allowing more complex 

comparison algorithms to be implemented. Beyond paralellization on multicore 

processors, there is even more room for improvement by running OCR code on GPUs. 

Our assertion is that OCR is not a solved problem.

Description of Methods to Evaluate: 

Here is a brief description of the algorithm: 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

READING INPUT: 

The image is read and thresholded using adaptive threshold algorithm to 

create a binary image. 

******************Relevant Function in source code********************** 

// Copy the given image rectangle to Tesseract, with adaptive thresholding 

// if the image is not already binary (in baseapi.cpp file) 

void TessBaseAPI::CopyImageToTesseract(const unsigned char* imagedata, 

int bytes_per_pixel, 

int bytes_per_line, 

int left, int top, 

int width, int height) 

******************************************************************** 

OUTLINES/BLOBS: 

Connected-Component analysis is performed to extract outlines which are then nested 

together to form blobs. 

LINE FINDING/BASELINE FITTING: 

The line finding algorithm finds the lines and associates blobs to a unique textline by 

sorting and processing the blobs by x-coordinate and also keeping track of the slope (to 

account for any skew). The baseline fitting algorithm (least median of squares fit) then 

tries to fit a baseline to each of the lines. 

*******Relevant Function in source code(in baseapi.cpp file)*********** 

// Find lines from the image making the BLOCK_LIST 

//creates a full-page block and then runs connected component analysis and 

//text line creation 

void TessBaseAPI::FindLines(BLOCK_LIST* block_list) 

******************************************************************** 

WORD/CHARACTER SEGMENTATION: 

Each of the lines are then segmented into words/characters depending upon whether a 

given line is fixed pitch or not. For fixed pitch spacing, each of the words are chopped off 

into characters. However text lines with non-fixed (or proportional) pitch spacing, only 

the words are segmented out and the chopping of the words into characters is done later 

in the word recognition step.

WORD RECOGNITION: 

In this step, the words of non-fixed pitch spacing are segmented into characters by 

chopping joined characters and associating broken characters. 

STATIC CHARACTER CLASSIFICATION: 

In this step, feature extrcation is done for all the segemented characters and then 

classified using a static character classifier. 

********Relevant Functions in source code(in baseapi.cpp file)********* 

// Recognize the tesseract global image and return the result as Tesseract 

// internal structures. 

PAGE_RES* TessBaseAPI::Recognize(BLOCK_LIST* block_list, ETEXT_DESC* 

monitor) 

// Make a text string from the internal data structures. 

// The input page_res is deleted. 

char* TessBaseAPI::TesseractToText(PAGE_RES* page_res) 

******************************************************************** 

QUALITY: 

Quality of words and letters is checked 

WRITING OUTPUT: 

Words are written out to .txt file 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

Parallelization: We are planning to paralleize various functions at each step of the 

algorithm. Here are some of the functions that can be parallelized: 

1. Binarizing the image by applying the thresholds (calculated by adaptive threshold) 

in parallel using multiple cores 

2. Parallelize the connected-component labelling and line finding functions 

3. Parallelize the word/character segmentation and recognitions routines by processing 

lines on each of the cores in a parallel fashion using chunking or interleaved scheme 

4. Parallelize the feature extraction and classification of characters by dividing up the 

entire set of characters equally among all the available cores 

References: 

- Recognition of Handwritten Roman Script Using Tesseract OCR 

http://arxiv.org/ftp/arxiv/papers/1003/1003.5891.pdf 

- An Overview of the Tesseract OCR Engine 

http://tesseract-ocr.repairfaq.org/downloads/tesseract_overview.pdf 

- Optical Character Recognition Reference

http://www.nr.no/~eikvil/OCR.pdf 

- Character Recognition Under Severe Distortion 

http://www.computer.org/portal/web/csdl/doi/10.1109/ICDAR.2009.86 

-- SourceCode 

http://code.google.com/p/tesseract-ocr/ 

Resources Used / Required: 

- Multicore Personal Computers 

- Time on Cyclades and Niagara

CS315A Final Project Proposal May 3, 2010 - Stanford PPL

Create successful ePaper yourself

Delete template?

Save as template?