Character Extraction and Recognition from ... - IRNet Explore

CHARACTER EXTRACTION AND RECOGNITION FROMDOCUMENT IMAGES USING SEGMENTATION AND FEATUREEXTRACTIONREHNA. V. J 1 , R. NEHA 2 & SAMPADA. H. K 31 Department of E & C Engineering, HKBK College of Engineering, Bangalore, India2&3 Department of Telecommunication Engineering, Atria Institute of Technology, Bangalore, IndiaE-mail : rehna_vj@yahoo.co.in 1 , r.neha0401@gmail.com 2Abstract - The existing algorithms used for converting document images present are very expensive. In this paper, a costeffectivetool is introduced. Optical Character Recognition is the process of automatic conversion of machine-printed textinto computer processable codes. In our project we are emphasizing on extracting uppercase, lowercase letters and numeralsfrom document images using segmentation and feature extraction in Matlab image processing tool box. The proposedmethod can extract characters from document image (which may be scanned or camera captured) of any font size, colour,space and can be rewritten in an editable window like Notepad, WordPad where the characters can even be edited; thus,improving accuracy and hence, saves time.Keywords- Document image analysis; optical character recognition; segmentation; classification.I. INTRODUCTIONTo replicate the human functions by machines,making the machine able to perform tasks likereading is an ancient dream. In the past few decades,however, in spite of this it is unclear whether thecomputer has decreased or increased the amount ofpaper. The objective of document image analysis is torecognize the text and graphic components in imagesand extract the intended information as human would.Some tasks are recognizing the text by OCR isdetermining the skew, columns, text lines and words.Character extraction is the extraction of characterfrom document images and analysis of the same. It isone of the key tasks in document image analysis andsubtask of page segmentation. It involvesSegmentation, Feature Extraction and Classification.Extraction of characters from documents in whichcharacter is embedded in complex coloured documentimage is a very challenging problem.Character extraction is done in order to speed upthe data entry. One of the main reasons for employingthis method is the reduction in the manual data entryerrors. The current automatic data retrieval systemprevailing in the market uses indirect method ofreading data from the document. For e.g. Watermarkbased system to read confidential data and bar codesystem to read the price of goods in the retailcommercial outlets. A system of direct reading of textfrom documents is introduced here which can replacethe existing technology.This paper aims at performing characterextraction, recognition and interpretation of characterfrom document images. The result is then exported tospread sheets where it can even be edited.There are a lot of potential uses of characterextraction in Banking, Healthcare. Characterextraction from images finds many usefulapplications in document processing, analysis oftechnical papers with tables, maps, charts, andelectric circuits, identification of parts in industrialautomation, and content-based image/video retrievalfrom image/video databases, educational and trainingvideo and TV programs such as news contain mixedtext-picture-graphics regions. The other applicationsinclude document retrieval, camera based documentanalysis etc.Rest of the paper is organised as follows: sectionII deals with the basics of document image analysis,section III describes the implementation, section IVdeals with flow diagram followed by methodology insection V, section VI comprises of results obtained,future prospects are discussed in section VII andfinally conclusion of the paper is presented in sectionVIII.II. DOCUMENT IMAGE ANALYSISDIA is the theory and practice of recovering thesymbol structures of digital images scanned frompaper or produced by computer. Studying the contentand structure of the documents, identifying andnaming the components of some class of documents,specifying their interrelationships and naming theirproperties is known as document image analysis.Document analysis systems will become increasinglymore evident in the form of everyday documentsystems. Data capture of documents by opticalscanning or by digital video yields a file of pictureelements, or pixels that is the raw input to documentanalysis. The first step in document analysis is toperform processing on this image to prepare it forfurther analysis. Such processing includesthresholding to reduce a gray-scale or color image toa binary image, reduction of noise to reduceextraneous data, subsequent detection of pertinentIRNet Transactions on Electrical and Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012120

Character Extraction and Recognition from Document Images using Segmentation and Feature Extractionfeatures and objects of interest. Document imageanalysis and recognition (DIAR) is a research fieldthat has its roots in the first Optical CharacterRecognition (OCR) systems.The objective of document image analysis is torecognize the text and graphics components in imagesof documents, and to extract the intended informationas a human would. Two categories of documentimage analysis can be defined. Textual processingdeals with the text components of a document image.Some tasks here are determining the skew (any tilt atwhich the document may have been scanned into thecomputer), finding columns, paragraphs, text lines,and words, and finally recognizing the text (andpossibly its attributes such as size, font etc.) byoptical character recognition (OCR). Graphicsprocessing deals with the non-textual line and symbolcomponents that make up line diagrams, delimitingstraight lines between text sections, company logosetc. Pictures are a third major component ofdocuments, but except for recognizing their locationon a page, further analysis of these is usually the taskof other image processing and machine visiontechniques.There are two main types of analysis that areapplied to text in documents. One is optical characterrecognition (OCR) to derive the meaning of thecharacters and words from their bit-mapped images,and the other is page-layout analysis to determine theformatting of the text, and from that to derivemeaning associated with the positional and functionalblocks (titles, subtitles, bodies of text, footnotes etc)in which the text is located. Depending on thearrangement of these text blocks, a page of text maybe a title page of a paper, a table of contents of ajournal, a business form, or the face of a mail piece.OCR and page layout analysis may be performedseparately, or the results from one analysis may beused to aid or correct the other.OCR deals with automatic conversion of scannedimages of machine printed or handwritten documentsinto computer processable codes. The goal of OpticalCharacter Recognition (OCR) is to classify opticalpatterns (often contained in a digital image)corresponding to alphanumeric or other characters.The process of OCR involves several steps namelysegmentation, feature extraction and classificationSegmentation is the process of partitioning a digitalimage into multiple segments. Segmentation isdivided into two levels. In the first level, text andgraphics are separated and are sent for subsequentprocessing. In the second level; the rows, columns,paragraphs and words are recognized. Segmentationis of two types, namely implicit segmentation andexplicit segmentation. In implicit segmentation wordsare recognized and in explicit segmentationindividual characters are recognized. Explicitsegmentation is employed.The attributes of the image are recovered infeature extraction stage. The number of lines, linespacing and number of crossings are noted here.Classification is the process of defining undefinedobjects. There are two approaches to Classificationnamely, Statistical Classification and StructuralClassification. In Statistical Classification approach,character image patterns are represented by points. InStructural Classification approach, strokes, crosspoints, end points, cross points are noted. StructuralClassification approach is adopted.III. IMPLEMENTATIONCharacter extraction is mainly done to ease theprocess of machine reading. Detection of characterfrom documents in which text is embedded incomplex coloured document images is a verychallenging task [16]. It involves segmentation,feature extraction, classification, recognition andinterpretation of the character. Execution is dividedinto two main stages namely training and testing asshown in figure 1. Training data is the data which isused to create the database and test data is the inputimage under consideration. These images are made toundergo following steps:A. Pre-processing:A scanned document image is considered asinput and it is converted from RGB to binary in preprocessingstage, the process is known as binarizationand background is removed. The image resultingfrom the scanning process may contain a certainamount of noise and hence, pre-processing isrequired.B. Feature extraction:The relevant information about the characters isextracted from the pre-processed image in featureextraction and is partitioned into multiple segments insegmentation for subsequent process. Errors in thesegmentation process may also result in confusionbetween text and graphics or between texts and noise.C. Classification:These features are then interpreted and classifiedaccordingly in classification. Structural classificationapproach is adopted here. Incorrect classification mayalso be due to poor design of the classifier.D. Database Creation:The images of uppercase letters A to Z andlowercase letters a to z, numerals 0 to 9 are recordedand they are called sample images. They are resizedto the dimensions of the extracted characters,recognised and assigned a label depending upon itsfeatures.Fig. 1 : Block diagram of Proposed methodIRNet Transactions on Electrical and Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012121

Character Extraction and Recognition from Document Images using Segmentation and Feature ExtractionThe extracted features are then stored in databasefor further processing. In identification stage, featuresof the extracted characters are compared with those inthe database. Based on the matched features, thecharacters are recognised after which the character isinterpreted depending on the label assignedpreviously. After interpretation, the identified text isformatted and displayed in an editable window likenotepad.IV. FLOW DIAGRAMThe input document image is read and isconverted from RGB to binary in the pre-processingstage. The process is called as binarization and thenthe background is removed. The input image maycontain objects lesser than 15 pixels or irregular ordiscontinuous characters which are considered asnoise and are eliminated. The most important reasonto remove noise is the extraneous features otherwisemay cause subsequent errors during recognition.Then the relevant information about thecharacters is extracted from pre-processed image inFeature extraction and is partitioned into multiplesegments in Segmentation and is sent for ensuingprocessing. Two fundamental functions namely linecrop and letter crop are used for character extraction.These features are then interpreted and classifiedaccordingly. Then the extracted individual characteris resized to the dimensions of the sample image.Then they are compared with those in the database.Based on the matched features, the identifiedcharacters are formatted and displayed in an editablewindow.V. METHODOLOGYIn this paper the scanned/camera-captureddocument image is considered as the input. Thedocument image may consist of upper case, lowercase letters and numerals.First step is to read the input image which in thiscase is the scanned or camera captured. The scannedcopy is loaded into the system as the input. This is acolour image in RGB format. Then the image isconverted from RGB to binary in the pre-processingstage. The process is called as binarization and thenthe background is removed. Two fundamentalfunctions namely line crop and letter crop arefollowed by feature extraction and segmentationstage. For analyzing lines, to treat it as start of theline it checks for continuous array of white pixels andif there are dark pixels it will consider as end of line.This procedure is continued till the last line of theimage is displayed. In letter crop function, eachcharacter is cropped from the individual line, resizedto the required dimensions and is displayed. Thisprocess is continued till the last character of the firstline is displayed. This course of action is applied tothe remaining lines of the input image. After lettercrop, the extracted characters are compared withthose in the database, if they are found matching; thecharacter is interpreted, formatted and displayed. Ifnot matching, the maximum matched character isdisplayed.VI. RESULTS AND DISCUSSIONIn this paper the scanned/camera-captureddocument image is considered as the input. Thecolour image in RGB format is converted to binaryimage using binarization. Considering figure 3.a. andfigure 4.a. as the input image, the binary images areshown in figure 3.b. and figure 4.b.A. First ResultFig. 2: Flow chart for Character ExtractionFig. 3.a : Input imageIRNet Transactions on Electrical and Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012122

Character Extraction and Recognition from Document Images using Segmentation and Feature ExtractionFig. 3.b : Binary imageResult obtained:CRYPTOGRAPHY ANDNETORK SECURATYPWOPLE5 AND PRACT4CESFOURTH EDSTSONOutput:CRYPTOGRAPHY ANDNETWORK SECURITYPRINCIPLES AND PRACTICESFOURTH EDITIONB. Second ResultFig. 4.a : Input imageFig. 4.b : Binary imageResult obtained:AA B9 CC D0 FE FFGG HH 9P11 KK LL MM NN 00 PP 00 RPSS TT UU VV WW XX YV ZZT234567S90 6 C C C Y1Z34567890 12345678901Z3456789O 1334567890123456789O 123456789O1234BB789D Q234BB7B9OOutput:AA BB CC DD EE FF GG HH IIJJ KK LL MM NN 00 PP QQ RRSS TT UU VV WW XX YV ZZ1234567S90 $ ¢ € £ ¥1234567890 12345678901234567890 13345678901234567890 12345678901234567890 1234567890When the results of the two images arecompared, the second result is more accurate than thefirst result, as irregularities in misalignment due tomanual errors while scanning are reduced. The outputis displayed irrespective of colour, space, font sizeand intensity in an editable window like Notepad,WordPad where it can even be edited.VII. FUTURE WORKDirect reading of text from binary image orcolour image is a challenging task because of thecomplex background or degradations introducedduring the scanning of a paper document. In thispaper a simple solution to this problem based onoptical character recognition is presented. Thisalgorithm gives promising results that have beenobtained on a number of images in which almost allcharacters lines are retrieved. It also gives 90 percentaccuracy for all printed characters. In the future, theproposed work can be enhanced by increasing thedatabase to contain alphabets, numbers of any fontstyle which even includes different handwrittenstyles. However the one disadvantage of having bothnumbers and alphabets in the database is thepossibility of misinterpretation of digits havingsimilar features like „0‟ and „O‟, „1‟ and „l‟ and thelike.New methods for character recognition are stillexpected to appear, as the computer technologydevelops and decreasing computational restrictionsopen up for new approaches. There might for instancebe a potential in performing character recognitiondirectly on grey level images. However, the greatestpotential seems to lie within the exploitation ofexisting methods, by mixing methodologies andmaking more use of context. Generally there is apotential in using context to a larger extent than whatis done today. In addition, combinations of multipleindependent feature sets and classifiers, where theweakness of one method is compensated by theIRNet Transactions on Electrical and Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012123

Character Extraction and Recognition from Document Images using Segmentation and Feature Extractionstrength of another, may improve the recognition ofindividual characters. The frontiers of research withincharacter recognition have now moved towards therecognition of cursive script that is handwrittenconnected or calligraphic characters. Promisingtechniques within this area, deal with the recognitionof entire words instead of individual characters.VIII. CONCLUSIONThe method adopted can read the text fromscanned document images as well as camera capturedimages of all major image file formats such as jpg,bmp, gif, tif and png in any font size, s, intensity,space and color and display it in an editable windowlike Notepad, WordPad, MS Word with goodaccuracy.IX. ACKNOWLEDGMENTAuthors extend their thanks to our Management,Atria Institute of Technology, Bangalore, Dr. M.S.Shivakumar, Principal, Atria Institute of Technology,Bangalore and Prof. Vasanthi. S, Head of theDepartment of Telecommunication Engineering,Atria Institute of Technology, Bangalore for theirsupport and co-operation in carrying out the worksuccessfully in the institute.REFERENCES[1] Lawrence O‟Gorman & Rangachar Kasturi, “DocumentImage Analysis “, IEEE Computer Society ExecutiveBriefings.[2] Rafael C.gonzales & Richard.E.Woods,”Digital ImageProcessing”, Pearson education, 2001, 2nd edition.[3] Rafael C.gonzales & Richard.E.Woods,”Digital ImageProcessing using MATLAB ”, Pearson education, 2001,2nd edition.[4] Anil K.Jain, “Fundamentals of Digital Image Processing”,Pearson Education, 2001.‟[5] L. O‟Gorman, “Binarization and Multi-Thresholding ofDocument Images using Connectivity”, CVGIP: GraphicalModels and Image Processing, Vol. 56, 1994.[6] L. O'Gorman, “Image and Document Processingtechniques for the RightPages Electronic Library System”',Int. Conf. Pattern Recognition (ICPR), The Netherlands,Sept. 1992.[7] A.K.Jain, Fundamentals of Digital Image Processing,Prentice Hall, New Jersey, 1989.[8] G. Nagy, S. Seth, “Hierarchical representation of opticallyscanned documents”, Proceedings of the 7th InternationalConference on Pattern Recognition (ICPR), Montreal,Canada, 1984.[9] G. Nagy, “Optical Character Recognition and DocumentImage Analysis”, Rensselaer Video, Clifton Park, NewYork, 1992.[10] MattiPietik¨ainen and Oleg Okun, “Text Extraction fromGrey Scale Page Images By Simple Edge Detectors”[11] F.Wahl,K.Wong and R.Casey, ”block segmentation andtext extraction in mixed text/Image documents,Computervision, Graphics and image processing,20:375-390,2001.[12] J.Ohya. A.Shio, and S.Akamatsu, “Recognizing charactersin scene images”, IEEE Trans. Pattern Anal. MachineIntell., vol. 16, pp. 215–220, Feb. 1999[13] T. K. Ho, J. J. Hull, S. N. Srihari, “A computational modelfor recognition of multifont word images”, MachineVision and Applications, Vol. 5, 1992.[14] H. M.Suen and J. F.Wang, “Text string extraction fromimages of color printed documents,” Proc. Inst. Elect. Eng.Vis., Image, Signal Process., vol. 143, no. 4, pp.210–216,2006.[15] L.Wang and T. Pavlidis, “Direct gray-scale extraction offeatures for character recognition,” IEEE Trans. PatternAnal. MachineIntell., vol. 15, pp. 1053–1067, Oct.2003.[16] JagathSamarabandu, Member, IEEE, and Xiaoqing Liu2007 “An Edge-based Text Region Extraction Algorithmfor Indoor Mobile Robot Navigation” ,InternationalJournal of Signal Processing 3(4 )2007.[17] Yassin M. Y. Hasan and Lina J. Karam 2000“Morphological Text Extraction from Images”, IEEETransaction on Image Processing vol. 9,no. 11,Nov. 2000[18] F.Wahl,K.Wong and R. Casey,”block segmentation andtext extraction in mixed text/Image documents,Computervision, Graphics and image processing,20:375-390,2001[19] O. Hori, A. Okazaki, “High quality vectorization based ona generic object model”, in Structured Document ImageAnalysis, H.S. Baird, H. Bunke, K. Yamamoto (eds.),Springer-Verlag, 1992, pp. 325-339.[20] Williams.P. S. and Alder, M. D. 1996.Generic textureanalysis applied to newspaper segmentation. IEEEInternational Conference on Neural Networks, 3, 3-6:1664-1669.[21] Zhong, Yu.,Karu, K., and Jain, A.K.1995. Locating text incomplex color images. Proceedings of the ThirdInternational Conference on Document Analysis andRecognition, 1, 14-16:146-149.IRNet Transactions on Electrical and Electronics Engineering (ITEEE) ISSN 2319 – 2577, Vol-1, Iss-2, 2012124

Character Extraction and Recognition from ... - IRNet Explore

Create successful ePaper yourself

Delete template?

Save as template?