11.07.2015 Views

Character Extraction and Recognition from ... - IRNet Explore

Character Extraction and Recognition from ... - IRNet Explore

Character Extraction and Recognition from ... - IRNet Explore

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHARACTER EXTRACTION AND RECOGNITION FROMDOCUMENT IMAGES USING SEGMENTATION AND FEATUREEXTRACTIONREHNA. V. J 1 , R. NEHA 2 & SAMPADA. H. K 31 Department of E & C Engineering, HKBK College of Engineering, Bangalore, India2&3 Department of Telecommunication Engineering, Atria Institute of Technology, Bangalore, IndiaE-mail : rehna_vj@yahoo.co.in 1 , r.neha0401@gmail.com 2Abstract - The existing algorithms used for converting document images present are very expensive. In this paper, a costeffectivetool is introduced. Optical <strong>Character</strong> <strong>Recognition</strong> is the process of automatic conversion of machine-printed textinto computer processable codes. In our project we are emphasizing on extracting uppercase, lowercase letters <strong>and</strong> numerals<strong>from</strong> document images using segmentation <strong>and</strong> feature extraction in Matlab image processing tool box. The proposedmethod can extract characters <strong>from</strong> document image (which may be scanned or camera captured) of any font size, colour,space <strong>and</strong> can be rewritten in an editable window like Notepad, WordPad where the characters can even be edited; thus,improving accuracy <strong>and</strong> hence, saves time.Keywords- Document image analysis; optical character recognition; segmentation; classification.I. INTRODUCTIONTo replicate the human functions by machines,making the machine able to perform tasks likereading is an ancient dream. In the past few decades,however, in spite of this it is unclear whether thecomputer has decreased or increased the amount ofpaper. The objective of document image analysis is torecognize the text <strong>and</strong> graphic components in images<strong>and</strong> extract the intended information as human would.Some tasks are recognizing the text by OCR isdetermining the skew, columns, text lines <strong>and</strong> words.<strong>Character</strong> extraction is the extraction of character<strong>from</strong> document images <strong>and</strong> analysis of the same. It isone of the key tasks in document image analysis <strong>and</strong>subtask of page segmentation. It involvesSegmentation, Feature <strong>Extraction</strong> <strong>and</strong> Classification.<strong>Extraction</strong> of characters <strong>from</strong> documents in whichcharacter is embedded in complex coloured documentimage is a very challenging problem.<strong>Character</strong> extraction is done in order to speed upthe data entry. One of the main reasons for employingthis method is the reduction in the manual data entryerrors. The current automatic data retrieval systemprevailing in the market uses indirect method ofreading data <strong>from</strong> the document. For e.g. Watermarkbased system to read confidential data <strong>and</strong> bar codesystem to read the price of goods in the retailcommercial outlets. A system of direct reading of text<strong>from</strong> documents is introduced here which can replacethe existing technology.This paper aims at performing characterextraction, recognition <strong>and</strong> interpretation of character<strong>from</strong> document images. The result is then exported tospread sheets where it can even be edited.There are a lot of potential uses of characterextraction in Banking, Healthcare. <strong>Character</strong>extraction <strong>from</strong> images finds many usefulapplications in document processing, analysis oftechnical papers with tables, maps, charts, <strong>and</strong>electric circuits, identification of parts in industrialautomation, <strong>and</strong> content-based image/video retrieval<strong>from</strong> image/video databases, educational <strong>and</strong> trainingvideo <strong>and</strong> TV programs such as news contain mixedtext-picture-graphics regions. The other applicationsinclude document retrieval, camera based documentanalysis etc.Rest of the paper is organised as follows: sectionII deals with the basics of document image analysis,section III describes the implementation, section IVdeals with flow diagram followed by methodology insection V, section VI comprises of results obtained,future prospects are discussed in section VII <strong>and</strong>finally conclusion of the paper is presented in sectionVIII.II. DOCUMENT IMAGE ANALYSISDIA is the theory <strong>and</strong> practice of recovering thesymbol structures of digital images scanned <strong>from</strong>paper or produced by computer. Studying the content<strong>and</strong> structure of the documents, identifying <strong>and</strong>naming the components of some class of documents,specifying their interrelationships <strong>and</strong> naming theirproperties is known as document image analysis.Document analysis systems will become increasinglymore evident in the form of everyday documentsystems. Data capture of documents by opticalscanning or by digital video yields a file of pictureelements, or pixels that is the raw input to documentanalysis. The first step in document analysis is toperform processing on this image to prepare it forfurther analysis. Such processing includesthresholding to reduce a gray-scale or color image toa binary image, reduction of noise to reduceextraneous data, subsequent detection of pertinent<strong>IRNet</strong> Transactions on Electrical <strong>and</strong> Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012120


<strong>Character</strong> <strong>Extraction</strong> <strong>and</strong> <strong>Recognition</strong> <strong>from</strong> Document Images using Segmentation <strong>and</strong> Feature <strong>Extraction</strong>features <strong>and</strong> objects of interest. Document imageanalysis <strong>and</strong> recognition (DIAR) is a research fieldthat has its roots in the first Optical <strong>Character</strong><strong>Recognition</strong> (OCR) systems.The objective of document image analysis is torecognize the text <strong>and</strong> graphics components in imagesof documents, <strong>and</strong> to extract the intended informationas a human would. Two categories of documentimage analysis can be defined. Textual processingdeals with the text components of a document image.Some tasks here are determining the skew (any tilt atwhich the document may have been scanned into thecomputer), finding columns, paragraphs, text lines,<strong>and</strong> words, <strong>and</strong> finally recognizing the text (<strong>and</strong>possibly its attributes such as size, font etc.) byoptical character recognition (OCR). Graphicsprocessing deals with the non-textual line <strong>and</strong> symbolcomponents that make up line diagrams, delimitingstraight lines between text sections, company logosetc. Pictures are a third major component ofdocuments, but except for recognizing their locationon a page, further analysis of these is usually the taskof other image processing <strong>and</strong> machine visiontechniques.There are two main types of analysis that areapplied to text in documents. One is optical characterrecognition (OCR) to derive the meaning of thecharacters <strong>and</strong> words <strong>from</strong> their bit-mapped images,<strong>and</strong> the other is page-layout analysis to determine theformatting of the text, <strong>and</strong> <strong>from</strong> that to derivemeaning associated with the positional <strong>and</strong> functionalblocks (titles, subtitles, bodies of text, footnotes etc)in which the text is located. Depending on thearrangement of these text blocks, a page of text maybe a title page of a paper, a table of contents of ajournal, a business form, or the face of a mail piece.OCR <strong>and</strong> page layout analysis may be performedseparately, or the results <strong>from</strong> one analysis may beused to aid or correct the other.OCR deals with automatic conversion of scannedimages of machine printed or h<strong>and</strong>written documentsinto computer processable codes. The goal of Optical<strong>Character</strong> <strong>Recognition</strong> (OCR) is to classify opticalpatterns (often contained in a digital image)corresponding to alphanumeric or other characters.The process of OCR involves several steps namelysegmentation, feature extraction <strong>and</strong> classificationSegmentation is the process of partitioning a digitalimage into multiple segments. Segmentation isdivided into two levels. In the first level, text <strong>and</strong>graphics are separated <strong>and</strong> are sent for subsequentprocessing. In the second level; the rows, columns,paragraphs <strong>and</strong> words are recognized. Segmentationis of two types, namely implicit segmentation <strong>and</strong>explicit segmentation. In implicit segmentation wordsare recognized <strong>and</strong> in explicit segmentationindividual characters are recognized. Explicitsegmentation is employed.The attributes of the image are recovered infeature extraction stage. The number of lines, linespacing <strong>and</strong> number of crossings are noted here.Classification is the process of defining undefinedobjects. There are two approaches to Classificationnamely, Statistical Classification <strong>and</strong> StructuralClassification. In Statistical Classification approach,character image patterns are represented by points. InStructural Classification approach, strokes, crosspoints, end points, cross points are noted. StructuralClassification approach is adopted.III. IMPLEMENTATION<strong>Character</strong> extraction is mainly done to ease theprocess of machine reading. Detection of character<strong>from</strong> documents in which text is embedded incomplex coloured document images is a verychallenging task [16]. It involves segmentation,feature extraction, classification, recognition <strong>and</strong>interpretation of the character. Execution is dividedinto two main stages namely training <strong>and</strong> testing asshown in figure 1. Training data is the data which isused to create the database <strong>and</strong> test data is the inputimage under consideration. These images are made toundergo following steps:A. Pre-processing:A scanned document image is considered asinput <strong>and</strong> it is converted <strong>from</strong> RGB to binary in preprocessingstage, the process is known as binarization<strong>and</strong> background is removed. The image resulting<strong>from</strong> the scanning process may contain a certainamount of noise <strong>and</strong> hence, pre-processing isrequired.B. Feature extraction:The relevant information about the characters isextracted <strong>from</strong> the pre-processed image in featureextraction <strong>and</strong> is partitioned into multiple segments insegmentation for subsequent process. Errors in thesegmentation process may also result in confusionbetween text <strong>and</strong> graphics or between texts <strong>and</strong> noise.C. Classification:These features are then interpreted <strong>and</strong> classifiedaccordingly in classification. Structural classificationapproach is adopted here. Incorrect classification mayalso be due to poor design of the classifier.D. Database Creation:The images of uppercase letters A to Z <strong>and</strong>lowercase letters a to z, numerals 0 to 9 are recorded<strong>and</strong> they are called sample images. They are resizedto the dimensions of the extracted characters,recognised <strong>and</strong> assigned a label depending upon itsfeatures.Fig. 1 : Block diagram of Proposed method<strong>IRNet</strong> Transactions on Electrical <strong>and</strong> Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012121


<strong>Character</strong> <strong>Extraction</strong> <strong>and</strong> <strong>Recognition</strong> <strong>from</strong> Document Images using Segmentation <strong>and</strong> Feature <strong>Extraction</strong>The extracted features are then stored in databasefor further processing. In identification stage, featuresof the extracted characters are compared with those inthe database. Based on the matched features, thecharacters are recognised after which the character isinterpreted depending on the label assignedpreviously. After interpretation, the identified text isformatted <strong>and</strong> displayed in an editable window likenotepad.IV. FLOW DIAGRAMThe input document image is read <strong>and</strong> isconverted <strong>from</strong> RGB to binary in the pre-processingstage. The process is called as binarization <strong>and</strong> thenthe background is removed. The input image maycontain objects lesser than 15 pixels or irregular ordiscontinuous characters which are considered asnoise <strong>and</strong> are eliminated. The most important reasonto remove noise is the extraneous features otherwisemay cause subsequent errors during recognition.Then the relevant information about thecharacters is extracted <strong>from</strong> pre-processed image inFeature extraction <strong>and</strong> is partitioned into multiplesegments in Segmentation <strong>and</strong> is sent for ensuingprocessing. Two fundamental functions namely linecrop <strong>and</strong> letter crop are used for character extraction.These features are then interpreted <strong>and</strong> classifiedaccordingly. Then the extracted individual characteris resized to the dimensions of the sample image.Then they are compared with those in the database.Based on the matched features, the identifiedcharacters are formatted <strong>and</strong> displayed in an editablewindow.V. METHODOLOGYIn this paper the scanned/camera-captureddocument image is considered as the input. Thedocument image may consist of upper case, lowercase letters <strong>and</strong> numerals.First step is to read the input image which in thiscase is the scanned or camera captured. The scannedcopy is loaded into the system as the input. This is acolour image in RGB format. Then the image isconverted <strong>from</strong> RGB to binary in the pre-processingstage. The process is called as binarization <strong>and</strong> thenthe background is removed. Two fundamentalfunctions namely line crop <strong>and</strong> letter crop arefollowed by feature extraction <strong>and</strong> segmentationstage. For analyzing lines, to treat it as start of theline it checks for continuous array of white pixels <strong>and</strong>if there are dark pixels it will consider as end of line.This procedure is continued till the last line of theimage is displayed. In letter crop function, eachcharacter is cropped <strong>from</strong> the individual line, resizedto the required dimensions <strong>and</strong> is displayed. Thisprocess is continued till the last character of the firstline is displayed. This course of action is applied tothe remaining lines of the input image. After lettercrop, the extracted characters are compared withthose in the database, if they are found matching; thecharacter is interpreted, formatted <strong>and</strong> displayed. Ifnot matching, the maximum matched character isdisplayed.VI. RESULTS AND DISCUSSIONIn this paper the scanned/camera-captureddocument image is considered as the input. Thecolour image in RGB format is converted to binaryimage using binarization. Considering figure 3.a. <strong>and</strong>figure 4.a. as the input image, the binary images areshown in figure 3.b. <strong>and</strong> figure 4.b.A. First ResultFig. 2: Flow chart for <strong>Character</strong> <strong>Extraction</strong>Fig. 3.a : Input image<strong>IRNet</strong> Transactions on Electrical <strong>and</strong> Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012122


<strong>Character</strong> <strong>Extraction</strong> <strong>and</strong> <strong>Recognition</strong> <strong>from</strong> Document Images using Segmentation <strong>and</strong> Feature <strong>Extraction</strong>Fig. 3.b : Binary imageResult obtained:CRYPTOGRAPHY ANDNETORK SECURATYPWOPLE5 AND PRACT4CESFOURTH EDSTSONOutput:CRYPTOGRAPHY ANDNETWORK SECURITYPRINCIPLES AND PRACTICESFOURTH EDITIONB. Second ResultFig. 4.a : Input imageFig. 4.b : Binary imageResult obtained:AA B9 CC D0 FE FFGG HH 9P11 KK LL MM NN 00 PP 00 RPSS TT UU VV WW XX YV ZZT234567S90 6 C C C Y1Z34567890 12345678901Z3456789O 1334567890123456789O 123456789O1234BB789D Q234BB7B9OOutput:AA BB CC DD EE FF GG HH IIJJ KK LL MM NN 00 PP QQ RRSS TT UU VV WW XX YV ZZ1234567S90 $ ¢ € £ ¥1234567890 12345678901234567890 13345678901234567890 12345678901234567890 1234567890When the results of the two images arecompared, the second result is more accurate than thefirst result, as irregularities in misalignment due tomanual errors while scanning are reduced. The outputis displayed irrespective of colour, space, font size<strong>and</strong> intensity in an editable window like Notepad,WordPad where it can even be edited.VII. FUTURE WORKDirect reading of text <strong>from</strong> binary image orcolour image is a challenging task because of thecomplex background or degradations introducedduring the scanning of a paper document. In thispaper a simple solution to this problem based onoptical character recognition is presented. Thisalgorithm gives promising results that have beenobtained on a number of images in which almost allcharacters lines are retrieved. It also gives 90 percentaccuracy for all printed characters. In the future, theproposed work can be enhanced by increasing thedatabase to contain alphabets, numbers of any fontstyle which even includes different h<strong>and</strong>writtenstyles. However the one disadvantage of having bothnumbers <strong>and</strong> alphabets in the database is thepossibility of misinterpretation of digits havingsimilar features like „0‟ <strong>and</strong> „O‟, „1‟ <strong>and</strong> „l‟ <strong>and</strong> thelike.New methods for character recognition are stillexpected to appear, as the computer technologydevelops <strong>and</strong> decreasing computational restrictionsopen up for new approaches. There might for instancebe a potential in performing character recognitiondirectly on grey level images. However, the greatestpotential seems to lie within the exploitation ofexisting methods, by mixing methodologies <strong>and</strong>making more use of context. Generally there is apotential in using context to a larger extent than whatis done today. In addition, combinations of multipleindependent feature sets <strong>and</strong> classifiers, where theweakness of one method is compensated by the<strong>IRNet</strong> Transactions on Electrical <strong>and</strong> Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012123


<strong>Character</strong> <strong>Extraction</strong> <strong>and</strong> <strong>Recognition</strong> <strong>from</strong> Document Images using Segmentation <strong>and</strong> Feature <strong>Extraction</strong>strength of another, may improve the recognition ofindividual characters. The frontiers of research withincharacter recognition have now moved towards therecognition of cursive script that is h<strong>and</strong>writtenconnected or calligraphic characters. Promisingtechniques within this area, deal with the recognitionof entire words instead of individual characters.VIII. CONCLUSIONThe method adopted can read the text <strong>from</strong>scanned document images as well as camera capturedimages of all major image file formats such as jpg,bmp, gif, tif <strong>and</strong> png in any font size, s, intensity,space <strong>and</strong> color <strong>and</strong> display it in an editable windowlike Notepad, WordPad, MS Word with goodaccuracy.IX. ACKNOWLEDGMENTAuthors extend their thanks to our Management,Atria Institute of Technology, Bangalore, Dr. M.S.Shivakumar, Principal, Atria Institute of Technology,Bangalore <strong>and</strong> Prof. Vasanthi. S, Head of theDepartment of Telecommunication Engineering,Atria Institute of Technology, Bangalore for theirsupport <strong>and</strong> co-operation in carrying out the worksuccessfully in the institute.REFERENCES[1] Lawrence O‟Gorman & Rangachar Kasturi, “DocumentImage Analysis “, IEEE Computer Society ExecutiveBriefings.[2] Rafael C.gonzales & Richard.E.Woods,”Digital ImageProcessing”, Pearson education, 2001, 2nd edition.[3] Rafael C.gonzales & Richard.E.Woods,”Digital ImageProcessing using MATLAB ”, Pearson education, 2001,2nd edition.[4] Anil K.Jain, “Fundamentals of Digital Image Processing”,Pearson Education, 2001.‟[5] L. O‟Gorman, “Binarization <strong>and</strong> Multi-Thresholding ofDocument Images using Connectivity”, CVGIP: GraphicalModels <strong>and</strong> Image Processing, Vol. 56, 1994.[6] L. O'Gorman, “Image <strong>and</strong> Document Processingtechniques for the RightPages Electronic Library System”',Int. Conf. Pattern <strong>Recognition</strong> (ICPR), The Netherl<strong>and</strong>s,Sept. 1992.[7] A.K.Jain, Fundamentals of Digital Image Processing,Prentice Hall, New Jersey, 1989.[8] G. Nagy, S. Seth, “Hierarchical representation of opticallyscanned documents”, Proceedings of the 7th InternationalConference on Pattern <strong>Recognition</strong> (ICPR), Montreal,Canada, 1984.[9] G. Nagy, “Optical <strong>Character</strong> <strong>Recognition</strong> <strong>and</strong> DocumentImage Analysis”, Rensselaer Video, Clifton Park, NewYork, 1992.[10] MattiPietik¨ainen <strong>and</strong> Oleg Okun, “Text <strong>Extraction</strong> <strong>from</strong>Grey Scale Page Images By Simple Edge Detectors”[11] F.Wahl,K.Wong <strong>and</strong> R.Casey, ”block segmentation <strong>and</strong>text extraction in mixed text/Image documents,Computervision, Graphics <strong>and</strong> image processing,20:375-390,2001.[12] J.Ohya. A.Shio, <strong>and</strong> S.Akamatsu, “Recognizing charactersin scene images”, IEEE Trans. Pattern Anal. MachineIntell., vol. 16, pp. 215–220, Feb. 1999[13] T. K. Ho, J. J. Hull, S. N. Srihari, “A computational modelfor recognition of multifont word images”, MachineVision <strong>and</strong> Applications, Vol. 5, 1992.[14] H. M.Suen <strong>and</strong> J. F.Wang, “Text string extraction <strong>from</strong>images of color printed documents,” Proc. Inst. Elect. Eng.Vis., Image, Signal Process., vol. 143, no. 4, pp.210–216,2006.[15] L.Wang <strong>and</strong> T. Pavlidis, “Direct gray-scale extraction offeatures for character recognition,” IEEE Trans. PatternAnal. MachineIntell., vol. 15, pp. 1053–1067, Oct.2003.[16] JagathSamarab<strong>and</strong>u, Member, IEEE, <strong>and</strong> Xiaoqing Liu2007 “An Edge-based Text Region <strong>Extraction</strong> Algorithmfor Indoor Mobile Robot Navigation” ,InternationalJournal of Signal Processing 3(4 )2007.[17] Yassin M. Y. Hasan <strong>and</strong> Lina J. Karam 2000“Morphological Text <strong>Extraction</strong> <strong>from</strong> Images”, IEEETransaction on Image Processing vol. 9,no. 11,Nov. 2000[18] F.Wahl,K.Wong <strong>and</strong> R. Casey,”block segmentation <strong>and</strong>text extraction in mixed text/Image documents,Computervision, Graphics <strong>and</strong> image processing,20:375-390,2001[19] O. Hori, A. Okazaki, “High quality vectorization based ona generic object model”, in Structured Document ImageAnalysis, H.S. Baird, H. Bunke, K. Yamamoto (eds.),Springer-Verlag, 1992, pp. 325-339.[20] Williams.P. S. <strong>and</strong> Alder, M. D. 1996.Generic textureanalysis applied to newspaper segmentation. IEEEInternational Conference on Neural Networks, 3, 3-6:1664-1669.[21] Zhong, Yu.,Karu, K., <strong>and</strong> Jain, A.K.1995. Locating text incomplex color images. Proceedings of the ThirdInternational Conference on Document Analysis <strong>and</strong><strong>Recognition</strong>, 1, 14-16:146-149.<strong>IRNet</strong> Transactions on Electrical <strong>and</strong> Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012124

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!