14.01.2015 Views

CS315A Final Project Proposal May 3, 2010 - Stanford PPL

CS315A Final Project Proposal May 3, 2010 - Stanford PPL

CS315A Final Project Proposal May 3, 2010 - Stanford PPL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>CS315A</strong> <strong>Final</strong> <strong>Project</strong> <strong>Proposal</strong><br />

<strong>May</strong> 3, <strong>2010</strong><br />

Group Members:<br />

Ajay Gupta (agupta74)<br />

Scott Green (sagreen1)<br />

Tushar Sawant (tsawant)<br />

Tahrina Rumu (trumu)<br />

<strong>Project</strong> Option Selection:<br />

Programming<br />

Description of Topic:<br />

Optical character recognition (OCR) is the mechanical or electronic translation of<br />

scanned images of handwritten, typewritten, or printed text, to machine encoded text.<br />

OCR has been in development for almost 80 years, the first patent for an OCR machine<br />

was filed by a German named Gustav Tauschek in 1929, and an American patent was<br />

filed subsequently 1935. OCR has many applications, including use in the postal serivce,<br />

language translation, digital libraries, and OCR is even in the hands of the general public,<br />

in the form of mobile applications.<br />

We are using an open source OCR software called Tesseract as a basis for parallelization.<br />

Development on Tesseract first started in 1985 by Hewlett Packard, further development<br />

was done by University of Northern Las Vegas, and the code was eventually released<br />

under the Apache 2.0 license as open source. Google has used Tesseract extensively in<br />

their google books project, which has attempted to digitize the worlds libraries.<br />

Statement of Why:<br />

For many years OCR has been considered a solved problem. Major work on Tesseract<br />

was completed around 1996 and small modifications were made over the next ten years.<br />

The emergence of commercially available multicore processors has opened up a new<br />

field within computer science, and it is worth reconsidering old problems in order to<br />

achieve speed gains. In the case of the Tesseract program we intend to parallelize the<br />

recognition process in order to quickly and accurately convert multiple page documents.<br />

Speeding the recognition process frees up computing resources, allowing more complex<br />

comparison algorithms to be implemented. Beyond paralellization on multicore<br />

processors, there is even more room for improvement by running OCR code on GPUs.<br />

Our assertion is that OCR is not a solved problem.


Description of Methods to Evaluate:<br />

Here is a brief description of the algorithm:<br />

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br />

READING INPUT:<br />

The image is read and thresholded using adaptive threshold algorithm to<br />

create a binary image.<br />

******************Relevant Function in source code**********************<br />

// Copy the given image rectangle to Tesseract, with adaptive thresholding<br />

// if the image is not already binary (in baseapi.cpp file)<br />

void TessBaseAPI::CopyImageToTesseract(const unsigned char* imagedata,<br />

int bytes_per_pixel,<br />

int bytes_per_line,<br />

int left, int top,<br />

int width, int height)<br />

********************************************************************<br />

OUTLINES/BLOBS:<br />

Connected-Component analysis is performed to extract outlines which are then nested<br />

together to form blobs.<br />

LINE FINDING/BASELINE FITTING:<br />

The line finding algorithm finds the lines and associates blobs to a unique textline by<br />

sorting and processing the blobs by x-coordinate and also keeping track of the slope (to<br />

account for any skew). The baseline fitting algorithm (least median of squares fit) then<br />

tries to fit a baseline to each of the lines.<br />

*******Relevant Function in source code(in baseapi.cpp file)***********<br />

// Find lines from the image making the BLOCK_LIST<br />

//creates a full-page block and then runs connected component analysis and<br />

//text line creation<br />

void TessBaseAPI::FindLines(BLOCK_LIST* block_list)<br />

********************************************************************<br />

WORD/CHARACTER SEGMENTATION:<br />

Each of the lines are then segmented into words/characters depending upon whether a<br />

given line is fixed pitch or not. For fixed pitch spacing, each of the words are chopped off<br />

into characters. However text lines with non-fixed (or proportional) pitch spacing, only<br />

the words are segmented out and the chopping of the words into characters is done later<br />

in the word recognition step.


WORD RECOGNITION:<br />

In this step, the words of non-fixed pitch spacing are segmented into characters by<br />

chopping joined characters and associating broken characters.<br />

STATIC CHARACTER CLASSIFICATION:<br />

In this step, feature extrcation is done for all the segemented characters and then<br />

classified using a static character classifier.<br />

********Relevant Functions in source code(in baseapi.cpp file)*********<br />

// Recognize the tesseract global image and return the result as Tesseract<br />

// internal structures.<br />

PAGE_RES* TessBaseAPI::Recognize(BLOCK_LIST* block_list, ETEXT_DESC*<br />

monitor)<br />

// Make a text string from the internal data structures.<br />

// The input page_res is deleted.<br />

char* TessBaseAPI::TesseractToText(PAGE_RES* page_res)<br />

********************************************************************<br />

QUALITY:<br />

Quality of words and letters is checked<br />

WRITING OUTPUT:<br />

Words are written out to .txt file<br />

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br />

Parallelization: We are planning to paralleize various functions at each step of the<br />

algorithm. Here are some of the functions that can be parallelized:<br />

1. Binarizing the image by applying the thresholds (calculated by adaptive threshold)<br />

in parallel using multiple cores<br />

2. Parallelize the connected-component labelling and line finding functions<br />

3. Parallelize the word/character segmentation and recognitions routines by processing<br />

lines on each of the cores in a parallel fashion using chunking or interleaved scheme<br />

4. Parallelize the feature extraction and classification of characters by dividing up the<br />

entire set of characters equally among all the available cores<br />

References:<br />

- Recognition of Handwritten Roman Script Using Tesseract OCR<br />

http://arxiv.org/ftp/arxiv/papers/1003/1003.5891.pdf<br />

- An Overview of the Tesseract OCR Engine<br />

http://tesseract-ocr.repairfaq.org/downloads/tesseract_overview.pdf<br />

- Optical Character Recognition Reference


http://www.nr.no/~eikvil/OCR.pdf<br />

- Character Recognition Under Severe Distortion<br />

http://www.computer.org/portal/web/csdl/doi/10.1109/ICDAR.2009.86<br />

-- SourceCode<br />

http://code.google.com/p/tesseract-ocr/<br />

Resources Used / Required:<br />

- Multicore Personal Computers<br />

- Time on Cyclades and Niagara

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!