CS315A Final Project Proposal May 3, 2010 - Stanford PPL
CS315A Final Project Proposal May 3, 2010 - Stanford PPL
CS315A Final Project Proposal May 3, 2010 - Stanford PPL
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>CS315A</strong> <strong>Final</strong> <strong>Project</strong> <strong>Proposal</strong><br />
<strong>May</strong> 3, <strong>2010</strong><br />
Group Members:<br />
Ajay Gupta (agupta74)<br />
Scott Green (sagreen1)<br />
Tushar Sawant (tsawant)<br />
Tahrina Rumu (trumu)<br />
<strong>Project</strong> Option Selection:<br />
Programming<br />
Description of Topic:<br />
Optical character recognition (OCR) is the mechanical or electronic translation of<br />
scanned images of handwritten, typewritten, or printed text, to machine encoded text.<br />
OCR has been in development for almost 80 years, the first patent for an OCR machine<br />
was filed by a German named Gustav Tauschek in 1929, and an American patent was<br />
filed subsequently 1935. OCR has many applications, including use in the postal serivce,<br />
language translation, digital libraries, and OCR is even in the hands of the general public,<br />
in the form of mobile applications.<br />
We are using an open source OCR software called Tesseract as a basis for parallelization.<br />
Development on Tesseract first started in 1985 by Hewlett Packard, further development<br />
was done by University of Northern Las Vegas, and the code was eventually released<br />
under the Apache 2.0 license as open source. Google has used Tesseract extensively in<br />
their google books project, which has attempted to digitize the worlds libraries.<br />
Statement of Why:<br />
For many years OCR has been considered a solved problem. Major work on Tesseract<br />
was completed around 1996 and small modifications were made over the next ten years.<br />
The emergence of commercially available multicore processors has opened up a new<br />
field within computer science, and it is worth reconsidering old problems in order to<br />
achieve speed gains. In the case of the Tesseract program we intend to parallelize the<br />
recognition process in order to quickly and accurately convert multiple page documents.<br />
Speeding the recognition process frees up computing resources, allowing more complex<br />
comparison algorithms to be implemented. Beyond paralellization on multicore<br />
processors, there is even more room for improvement by running OCR code on GPUs.<br />
Our assertion is that OCR is not a solved problem.
Description of Methods to Evaluate:<br />
Here is a brief description of the algorithm:<br />
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br />
READING INPUT:<br />
The image is read and thresholded using adaptive threshold algorithm to<br />
create a binary image.<br />
******************Relevant Function in source code**********************<br />
// Copy the given image rectangle to Tesseract, with adaptive thresholding<br />
// if the image is not already binary (in baseapi.cpp file)<br />
void TessBaseAPI::CopyImageToTesseract(const unsigned char* imagedata,<br />
int bytes_per_pixel,<br />
int bytes_per_line,<br />
int left, int top,<br />
int width, int height)<br />
********************************************************************<br />
OUTLINES/BLOBS:<br />
Connected-Component analysis is performed to extract outlines which are then nested<br />
together to form blobs.<br />
LINE FINDING/BASELINE FITTING:<br />
The line finding algorithm finds the lines and associates blobs to a unique textline by<br />
sorting and processing the blobs by x-coordinate and also keeping track of the slope (to<br />
account for any skew). The baseline fitting algorithm (least median of squares fit) then<br />
tries to fit a baseline to each of the lines.<br />
*******Relevant Function in source code(in baseapi.cpp file)***********<br />
// Find lines from the image making the BLOCK_LIST<br />
//creates a full-page block and then runs connected component analysis and<br />
//text line creation<br />
void TessBaseAPI::FindLines(BLOCK_LIST* block_list)<br />
********************************************************************<br />
WORD/CHARACTER SEGMENTATION:<br />
Each of the lines are then segmented into words/characters depending upon whether a<br />
given line is fixed pitch or not. For fixed pitch spacing, each of the words are chopped off<br />
into characters. However text lines with non-fixed (or proportional) pitch spacing, only<br />
the words are segmented out and the chopping of the words into characters is done later<br />
in the word recognition step.
WORD RECOGNITION:<br />
In this step, the words of non-fixed pitch spacing are segmented into characters by<br />
chopping joined characters and associating broken characters.<br />
STATIC CHARACTER CLASSIFICATION:<br />
In this step, feature extrcation is done for all the segemented characters and then<br />
classified using a static character classifier.<br />
********Relevant Functions in source code(in baseapi.cpp file)*********<br />
// Recognize the tesseract global image and return the result as Tesseract<br />
// internal structures.<br />
PAGE_RES* TessBaseAPI::Recognize(BLOCK_LIST* block_list, ETEXT_DESC*<br />
monitor)<br />
// Make a text string from the internal data structures.<br />
// The input page_res is deleted.<br />
char* TessBaseAPI::TesseractToText(PAGE_RES* page_res)<br />
********************************************************************<br />
QUALITY:<br />
Quality of words and letters is checked<br />
WRITING OUTPUT:<br />
Words are written out to .txt file<br />
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br />
Parallelization: We are planning to paralleize various functions at each step of the<br />
algorithm. Here are some of the functions that can be parallelized:<br />
1. Binarizing the image by applying the thresholds (calculated by adaptive threshold)<br />
in parallel using multiple cores<br />
2. Parallelize the connected-component labelling and line finding functions<br />
3. Parallelize the word/character segmentation and recognitions routines by processing<br />
lines on each of the cores in a parallel fashion using chunking or interleaved scheme<br />
4. Parallelize the feature extraction and classification of characters by dividing up the<br />
entire set of characters equally among all the available cores<br />
References:<br />
- Recognition of Handwritten Roman Script Using Tesseract OCR<br />
http://arxiv.org/ftp/arxiv/papers/1003/1003.5891.pdf<br />
- An Overview of the Tesseract OCR Engine<br />
http://tesseract-ocr.repairfaq.org/downloads/tesseract_overview.pdf<br />
- Optical Character Recognition Reference
http://www.nr.no/~eikvil/OCR.pdf<br />
- Character Recognition Under Severe Distortion<br />
http://www.computer.org/portal/web/csdl/doi/10.1109/ICDAR.2009.86<br />
-- SourceCode<br />
http://code.google.com/p/tesseract-ocr/<br />
Resources Used / Required:<br />
- Multicore Personal Computers<br />
- Time on Cyclades and Niagara