Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Optical Character Recognition of AmharicDocumentsMillion Meshesha C. V. JawaharCenter for Visual Information TechnologyInternational Institute of Information TechnologyHyderabad - 500 032, Indiamillion@research.iiit.ac.in jawahar@iiit.ac.inAbstract— In Africa around 2,500 languagesare spoken. Some of these languages have theirown indigenous scripts. Accordingly, there is abulk of printed documents available in libraries,information centers, museums and offices.Digitization of these documents enables toharness already available informationtechnologies to local information needs anddevelopments. This paper presents an OpticalCharacter Recognition (OCR) system forconverting digitized documents in locallanguages. An extensive literature surveyreveals that this is the first attempt that reportsthe challenges towards the recognition ofindigenous African scripts and a possiblesolution for Amharic script. Research in therecognition of African indigenous scripts facesmajor challenges due to (i) the use of largenumber characters in the writing and (ii)existence of large set of visually similarcharacters. In this paper, we propose a novelfeature extraction scheme using principalcomponent and linear discriminant analysis,followed by a decision directed acyclic graphbased support vector machine classifier.Recognition results are presented on real-lifedegraded documents such as books, magazinesand newspapers to demonstrate theperformance of the recognizer.Index Terms— Optical Character Recognition;African Scripts; Feature Extraction; Classification;Amharic Documents.I. INTRODUCTIONNowadays, it is becoming increasinglyimportant to have information available indigital format for increased efficiency indata storage and retrieval, and opticalcharacter recognition (OCR) is being knownas one of valuable input devices in thisrespect [1], [2]. This is also supplementedby the emergency of large digital librariesand the advancement of informationtechnology.Digital libraries such as Universal DigitalLibrary, Digital Library of India, etc. areestablished for digitizing paper documentsand make them available to users via theInternet. The advancement of informationtechnology supports faster, more powerfulprocessors and high speed output devices(printers and others) to generate moreinformation at a faster rate. These days theavailability of relatively inexpensivedocument scanners and optical characterrecognition software has made OCR anattractively priced data entry methodology[1].Optical character recognition technologywas invented in the early 1800s, when itwas patented as reading aids for the blind.In 1870, C. R. Carey patented an imagetransmission system using photocells, andin 1890 P.G. Nipkow invented sequentialscanning OCR. However, the practical OCRtechnology used for reading characters wasintroduced in the early 1950s as areplacement for keypunching system. Ayear later, D.H. Shephard developed thefirst commercial OCR for typewritten data.The 1980’s saw the emergence of OCRsystems intended for use with personal

computers [3]. Nowadays, it is common tofind PC-based OCR systems that arecommercially available. However, most ofthese systems are developed to work withLatin-based scripts [4].Optical character recognition convertsscanned images of printed, typewritten orhandwritten documents into computerreadable format (such as ASCII, Unicode,etc.) so as to ease on-line data processing.The potential of OCR for data entryapplication is obvious: it offers a faster,more automated, and presumably lessexpensive alternative to the manual dataentry devices, thereby improving theaccuracy and speed in transcribing datainto the computer system. Consequently, itincreases efficiency and effectiveness (byreducing cost, time and labor) in informationstorage and retrieval.Major applications of OCR include: (i)Library and office automation, (ii) Form andbank check processing, (iii) Documentreader systems for the visually impaired, (iv)Postal automation, and (v) Database andcorpus development for language modeling,text-mining and information retrieval [2], [5].While the use and application of OCRsystems is well developed for mostlanguages in the world that use both Latinand non-Latin scripts [6], an extensiveliterature survey reveals that fewconference papers are available on theindigenous scripts of African languages.Lately some research reports have beenpublished on Amharic OCR. Amhariccharacter recognition is discussed in [7]with more emphasis to designing suitablefeature extraction scheme for scriptrepresentation. Recognition using directionfield tensor as a tool for Amharic charactersegmentation is also reported [8]. Workuand Fuchs [9] present handwritten Amharicbank check recognition. On the contrary,there are no similar works being found forother indigenous African scripts.Therefore, there is a need to exert mucheffort to come up with better and workableOCR technologies for African scripts inorder to satisfy the need for digitizedinformation processing in local languages.II. AMHARIC SCRIPTSIn Africa more than 2,500 languages,including regional dialects are spoken.Some are indigenous languages, whileothers are installed by conquerors of thepast. English, French, Portuguese, Spanishand Arabic are official languages of many ofthe African countries. As a result, mostAfrican languages with a writing system usea modification of the Latin and Arabicscripts. There are also many languageswith their own indigenous scripts andwriting systems. Some of these scriptsinclude Amharic script (Ethiopia), Vai script(West Africa), Hieroglyphic script (Egypt),Bassa script (Liberia), Mende script (SierraLeone), Nsibidi/Nsibiri script (Nigeria andCameroon) and Meroitic script (Sudan) [10].Amharic, which belongs to the Semiticlanguage, became a dominant language inEthiopia back in history. It is the official andworking language of Ethiopia and the mostcommonly learnt language next to Englishthroughout the country. Accordingly, thereis a bulk of information available in printedform that needs to be converted intoelectronic form for easy searching andretrieval as per users’ need. Suffice is tomention the huge amount of documentspiled high in information centers, libraries,museums and government and private

offices in the form of correspondence letters,magazines, newspapers, pamphlets, books,etc. Converting these documents intoelectronic format is a must in order to (i)preserve historical documents, (ii) savestorage space, and (iii) enhance retrieval ofrelevant information via the Internet. Thisenables to harness existing informationtechnologies to local information needs anddevelopments.Fig. 1: Amharic alphabets (FIDEL) with their seven ordersrow-wise. The 2nd column shows list of basic charactersand others are vowels each of which derived from thebasic symbol.Those African languages using amodified version of Latin and Arabic scriptscan easily be integrated to the existing Latinand Arabic OCR technologies. It is worth tomention here some of the related worksreported at home and in the Diaspora topreserve African languages digitally(predominantly in languages that use Latinscripts). Corpora projects in Swahili (EastAfrica), open source software for languageslike Zulu, Sepedi and Afrikaans (SouthAfrican), indigenous web browser inLuganda (Uganda), improvisation ofkeyboard characters for some Africanlanguages, the presence of Africanlanguages on the Internet, etc. [11].Therefore, we need to give moreemphasis to those indigenous Africanscripts. This is the motivation behind thepresent report. To the best of ourknowledge, this is the first work that reportsa robust Amharic character recognizer forconversion of printed document images ofvarying fonts, sizes, styles and quality.Amharic is written in the unique andancient Ethiopic script (inherited from Geez,a Semitic language), now effectively asyllabary requiring over 300 glyph shapesfor representation. As shown in Fig. 1,Amharic script has 33 core characters eachof which occurs in seven orders: one basicform and six non-basic forms consisting ofa consonant and following vowel. Vowelsare derived from basic characters withsome modification (like by attaching strokesat the middle or end of the base character,by elongating one of the leg of the basecharacter, etc.). Other symbols are alsoavailable that represent labialization,numerals, and punctuation marks. Thesebring the total number of characters in thescript to 310 as shown in Table 1.

D. Visual similarity of most characters inthe scriptThere are a number of very similarcharacters in Amharic script that are evendifficult for humans to identify them easily(examples are presented in Fig. 2). Robustdiscriminant features needs to be extractedfor classification of each of the characterinto their proper category or class.E. Language related issuesAfrican indigenous languages posemany additional challenges. Some of theseare: (i) lack of standard representation forthe fonts and encoding, (ii) lack of supportfrom operating systems, browsers andkeyboard, and (iii) lack of languageprocessing routines.These issues add to the complexity ofthe design and implementation of an opticalcharacter recognition system.SegmentationComponentsFeatureExtractionFeatureVectorsClassifierRefinesImagesModelPreprocessingRawImagesModelModelRecognizerConvertedTextDocumentImagesDocumentTextCorrectedTextPost-processingFig. 3: Overview of the Amharic OCR design.III. RECOGNITION OF AMHARIC SCRIPTWe built an Amharic OCR on top of anearlier work for Indian languages [12] andalso reported a preliminary result of ourwork in [13]. Fig. 3 shows the generalframework of Amharic OCR design.The system accepts as input scanneddocument images. The resolution, ornumber of dots per inch (dpi) produced bythe scanner determines how closely thereproduced image corresponds to theoriginal. The higher the resolution, the moreinformation is recorded and, therefore, thegreater the file size and vice versa. Theresolution most suitable for characterrecognition ranges from 200 dpi to 600 dpidepending on document quality. Actuallymost systems recommend to scandocuments at a resolution of 300 dpi foroptimum OCR accuracy. We also use thisresolution to digitize documents for theexperiment.Once images are captured, they arepreprocessed before individual components

different representations of the characters.Some of these features consider profiles,structural descriptors and transform domainrepresentations [2], [16]. Alternates onecould consider the entire image as thefeature. The former methods are highlylanguage specific and become verycomplex to represent all the characters inthe script. The later scheme providesexcellent results for printed characterrecognition.As a result, we extract features from theentire image by concatenating all the rowsto form a single contiguous vector. Thisfeature vector consists of zeros (0s) andones (1s) representing background andforeground pixels in the image, respectively.With such a representation, memory andcomputational requirements are veryintensive for languages like Amharic thathave large number of characters in thewriting.Therefore we need to transform thefeatures to obtain a lower dimensionalrepresentation. There are various methodsemployed in pattern recognition literaturesfor reducing the dimension of featurevectors [17]. We consider in the presentwork Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA).4.1 Principal Component AnalysisPrincipal component analysis (PCA) canhelp identify new features (in a lowerdimensional subspace) that are most usefulfor representation [17]. This should be donewithout losing valuable information.Principal components can give superiorperformance for font-independent OCRs,easy adaptation across languages, andscope for extension to handwrittendocuments.Consider thethi image samplerepresented as an M dimensional (column)vector xi, where M depends on the imagesize. From the large sets of trainingdatasets, x , , x Nwe compute thecovariance matrix as:1∑ =NN∑i 11 L T( xi− µ )( xi− µ )= (1)Then we need to identify minimal dimension,say K such that:K∑ − 1λ ∑i i/ Trace()thi≤ α (2)where λiis the largest eigenvalue of thecovariance matrix ∑ and α is a limitingvalue in percent.Eigenvectors corresponding to thelargest K eigenvalues are the direction ofthgreatest variance. The k eigenvector isthe direction of greatest variationstperpendicular to the first through ( k −1)eigenvectors. The eigenvectors arearranged as rows in matrix A and thistransformation matrix is used to computethe new feature vector by projecting asY = . With this we get the best oneiAx idimensional representation of thecomponent images with reduced featuresize.Principal component analysis (PCA)yields projection directions that maximizethe total scatter across all classes. Inchoosing the projection which maximizestotal scatter, PCA retains not only betweenclassscatter that is useful for classification,but also within-class scatter that isunwanted information for classificationpurposes. Much of the variation seenamong document images is due to printingvariations and degradations. If PCA isapplied on such images, the transformation

matrix will contain principal componentsthat retain these variations in the projectedfeature space. Consequently, the points inthe projected space will not be wellseparated and the classes may be smearedtogether.Thus, while the PCA projections areoptimal for representation in a lowdimensional space, they may not be optimalfrom a discrimination standpoint forclassification. Hence, in our work, wefurther apply linear discriminant analysis toextract useful features from the reduceddimensional space using PCA.4.2. Linear Discriminant AnalysisLinear discriminant analysis (LDA)selects features based entirely upon theirdiscriminatory potential. Let the projectionbe y = Dx , where x is the input image andD is the transformation matrix. Theobjective of LDA is to find a projection thatmaximizes the ratio of the between-classscatter and the within-class scatter [15].We apply the algorithm proposed byFoley and Sammon [18] for lineardiscriminant analysis. The algorithmextracts a set of optimal discriminantfeatures for a two-class problem which suitsthe Support Vector Machine (SVM)classifier we used for classification task.thConsider the i image samplerepresented as an M dimensional (column)vector xi, where M is the reduceddimension using PCA. From the given setsof training components, x , L 1, x Nwecompute with-in class scatter ( W ) matrixand between-class difference ( ∆ ) as:∆ = µ1− µ 2ththwhere is the j sample in i class.x ij(4)Sum of the with-in class scatter is alsodetermined by,A = cW1+ 1−c)( W2(5)where 0 ≤ c ≤1, and the scatter space using:1−11Sij= d−1d = α A ∆Sfor1= 1/ SdSniAn = 2n−1n11−1= α Ann−1dtot 2ω = ( d ∆)/ d Ad1 ⎡cnS= ⎢cn⎢⎣jK−1'{ ∆ − [ d Ld] S [ 1/ α 0L0]}tn−1n−11n+ S−1n−1n−1n n−1t− y Sn−1tnn−1y y S−1n−11−1− S ⎤n−1yn⎥1 ⎥⎦(6)Fig. 4: Algorithm for computing L best discriminantfeature vectors. The vertical and horizontal lines are drawn−1to partition the scatter matrix S n such that the resultobtained is a2× 2 matrix.The algorithm presented in Fig. 4 iscalled recursively for extracting an optimalset of discriminant vectors ( d n) thatcorresponds to the first L highestdiscriminant values such that( ω1 ≥ ω2≥ L ≥ ωL≥ 0 ) [18]. Here, K is thenumber of iterations for computingdiscriminant vectors as well as discriminantvalues, and, α is chosen such that:nWiNit( xi ij− µ )( xi ij− µi)1= ∑ =(3)d td n n=1(7)

ynandmanner.ycnnc n= L= sare determined in the following[ SinS( n−1)(n)nn− ytnS−1n−1yn](8)Since the principal purpose ofclassification is discrimination, thediscriminant vector subspace offersconsiderable potential as feature extractiontransformation. Hence, the use of lineardiscriminant features for each pair ofclasses in each component classification isan attractive solution to design betterclassifiers.However, the problem with thisapproach is that the classification becomesslower, because LDA extracts features foreach pair-wise class. This requires stackspace for storing N ( N −1)/ 2 transformationmatrices. For instance, for dataset with 200classes LDA generates around 20 thousandtransformation matrices for each pair and isvery expensive for large class problems.Hence, we propose a two-stage featureextraction scheme; PCA followed by LDAfor optimal discriminant feature extraction.We first reduce the feature dimension usingPCA and then run LDA on the reducedlower dimensional space to extract the mostdiscriminant feature vector. This reduces toa great extent the storage andcomputational complexity, while enhancingthe performance of SVM-based decisiondirected acyclic graph (DDAG) classifier.V. CLASSIFICATIONTraining and testing are the two basicphases of any pattern classification problem.During training phase, the classifier learnsthe association between samples and theirlabels from labeled samples. The testingphase involves analysis of errors in theclassification of unlabelled samples in orderto evaluate classifier’s performance. Ingeneral it is desirable to have a classifierwith minimal test error.We use Support Vector Machine (SVM)for classification task [19]. SVMs are pairwisediscriminating classifiers with theability to identify the decision boundary withmaximal margin. Maximal margin results inbetter generalization [20] which is a highlydesirable property for a classifier to performwell on a novel data set. Support vectormachines are less complex and performbetter (lower actual error) with limitedtraining data.Identification of the optimal hyper-planefor separation involves maximization of anappropriate objective function. The result ofthe training phase is the identification of aset of labeled support vectors and a setof coefficients αi.Support vectors are thesamples near the decision boundary. Theyhave class labels y iranging ±1. Thedecision is made from:l∑f ( x)= sgn( α y K(x , x))i=1iiix i(9)where K is the kernel function, which isdefined by:K( x,y)= φ(x)φ(y)(10)where φ : R d → H maps the data points inlower dimensions to a higher dimensionalspace H.Binary classifiers like SVM are basicallydesigned for two class classificationproblems. However, because of theexistence of a number of characters in anyscript, optical character recognition problemis inherently multi-class in nature. The fieldof binary classification is mature, and

provides a variety of approaches to solvethe problem of multi-class classification [21].Most of the existing multi-class algorithmsaddress the problem by first dividing it intosmaller sets of a pair of classes and thencombine the results of binary classifiersusing a suitable voting methods such asmajority or weighted majority approaches[21], [22].Fig. 5: A rooted binary decision directed acyclic graph (DDAG) multi-class classifier for eight-class classificationproblem.5.1 Multi-class SVMsMulti-class SVMs are usuallyimplemented as combinations of two-classsolution using majority voting methods. Ithas been shown that the integration ofpairwise classifiers using decision directedacyclic graph (DDAG) results in betterperformance as compared to other populartechniques such as decision tree, Bayesian,etc. [19].We construct a decision directed acyclicgraph (DDAG), where each node in thegraph corresponds to a two-class classifierfor a pair of classes. The multi-classclassifier built using decision directedacyclic graph (DDAG) for 8-classclassification problem is shown in Fig. 5. Itcan be observed that the number of binaryclassifiers built for a N class classificationproblem is N ( N −1) / 2.The input vector ispresented at the root node of the DDAGand moves through the DDAG until itreaches the leaf node where the output(class label) is obtained. At each node adecision is made concerning to which classthe input belongs.For example, at the root node of Fig. 5,a decision is made whether the inputbelongs to class 1 or class 8. If it doesnotbelong to class 1, it moves to the left child;otherwise, if it doesnot belong to class 8, itmoves to the right child. This process isrepeated at each of the remaining nodes tillit reaches the leaf node where the decisionis reached to classify the input as one of theeight classes.

VI. RESULTS AND DISCUSSIONSWe have a recognition system thatconverts a given document images intoequivalent textual format. The systemaccepts either already scanned documentimages or scan a given Amharic textdocument at a resolution of 300 dpi on aflat-bed scanner, HP7670 Scanjet scanner.The scanned document is binarized, noiseremoved, skew corrected and scaled beforeindividual components are extracted.Preprocessed pages are then segmentedinto character components for featureextraction and classification. FollowingAmharic letters shape formation; we canfirst decompose lines in a text page intowords and then words into appropriatecharacter components for recognition andthen recompose the recognizedcomponents to determine the order ofcharacters, words and lines in a text page.Once character components areidentified, optimal discriminant features (ina lower dimensionality space) are extractedfor classification. We use a two stagedimensionality reduction scheme based on99 percent principal component analysiswhich is followed by 15 percent lineardiscriminant analysis.The original dimension of feature vectorof each character image (afternormalization) is 400 ( 20× 20 ). Principalcomponent analysis reduces thedimensionality from 400 to 295 and lineardiscriminant analysis further to 50. We aredealing with such reduced optimaldiscriminant feature vectors. Both methodsperform well in feature dimensionalityreductions. The use of a two-stage featureextraction scheme further solves thesimilarity problem encountered during theapplication of PCA alone. This is becauseof the fact that the new scheme extractsoptimal features that discriminate betweena pair of characters employed forclassification using support vector machine.We conduct extensive experiments toevaluate the performance of the recognitionprocess on the various datasets of Amharicscripts. The experiments are organized in asystematic manner considering the varioussituations encountered in real-life printeddocuments. Our datasets are of two types:One set of datasets considers printingvariations (such as fonts, styles and sizes).The other is degraded documents such asnewspapers, magazines and books. Wereport the performance of the recognizer inall these datasets and the result obtained ispromising to extend it for other indigenousAfrican scripts.ResultPowerGeezFonts Point Size StyleVisualAgafari Alpas 10 12 14 16 Normal Bold ItalicGeezDatasets 7850 7850 7850 7850 7680 7680 7680 7680 7680 7680 7680Accuracy 99.08 96.24 95.53 95.16 98.64 99.08 98.06 98.21 99.08 98.21 89.67Table 2: Performance result of the Amharic OCR on pages that varies in fonts, sizes and styles.In the first experiment, we consider theprinting variation. We test on the mostpopular fonts such as PowerGeez,VisualGeez, Agafari and Alpas that areused for typing and printing purposes, fourpoint sizes (10, 12, 14 and 16) and fontstyles (such as normal, bold and italics).Performance results are shown in Table 2.

[6] Suen, C. Y., Mori, S., Kim, S. H. and Leung, C.H, “Analysis and recognition of Asian scripts -The state of the art”, in International Conferenceon Document Analysis and Recognition, 2005,pp. 866-878.[7] Cowell, J. and Hussain, F., “Amharic CharacterRecognition using a Fast Signature BasedAlgorithm,” in Proceedings of the SeventhInternational Conference on InformationVisualization, 2003, pp. 384-389.[8] Yaregal, A. and Bigun, J., “Ethiopic CharacterRecognition Using Direction Field Tensor,“ in18th International Conference on PatternRecognition (ICPR), 2006, pp. 284-287.[9] Worku, A., and Fuchs, S., “Handwritten Amharicbank check recognition using hidden markovrandom field”, in Document Image Analysis andRetrieval Workshop, 2003, p. 28.[10] Mafundikwa, S., “African alphabets,”www.ziva.org.zw, Link to a page, November2000 [Online]. Available:http://www.ziva.org.zw/afrikan.htm, [AccessedApril 23, 2007].[11] Osborn, D., "African languages and informationand communication technologies: literacy,access, and the future". in Proceedings of the35 th Annual Conference on African Linguistics,2006, pp. 86 - 93.[12] Jawahar, C. V., Pavan Kumar, M. N. S. S. K.and Ravi Kiran, S. S., “A bilingual OCR for Hindi-Telugu documents and its applications”, inInternational Conference on Document Analysisand Recognition (ICDAR), 2003, pp. 408-412.[13] Million Meshesha and Jawahar, C. V.,“Recognition of printed Amharic documents,” inProceedings of Eighth International Conferenceon Document Analysis and Recognition (ICDAR),2005, pp 784-788.[14] Casey, R. G. and Lecoline, E., “A survey ofmethods and strategies in charactersegmentation,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 18, no. 7,July 1996, pp. 690-706.[15] Trier, O., Jain, A. and Taxt, T., "Featureextraction methods for character recognition - Asurvey", Pattern Recognition, vol 29, no. 4, 1996,pp. 641-662.[16] Gonzalez, R. C. and Woods, R. E., Digital ImageProcessing, 2nd ed., India: Pearson Education,2002.[17] .Duda, R. O., Hart, P. E. and Stork, D. G.,Pattern Classification, New York: John Wiley &Sons, Inc., 2001.[18] Foley, D.H. and Sammon, J. W., “An optimal setof discriminant vectors”, IEEE Transactions onComputing, vol. 24, March 1975, pp. 271-278.[19] Platt, J. C., Cristianini, N. and Shawe-Taylor, J.,“Large margin DAGs for multi-classclassification”, in Advances in Neural InformationProcessing Systems 12, 2000, pp. 547-553.[20] Burges, C. JC., “A tutorial on support vectormachines for pattern recognition,” Data Miningand Knowledge Discovery, vol. 2, no. 2, June1998, pp. 121-167.[21] Allwein, E. L., Schapire, R. E. and Singer, Y.,“Reducing multiclass to binary: A unifyingapproach for margin classification,” InProceeding of 17th International Conf. onMachine Learning, 2000, pp. 9-16.[22] Dietterich, T. G., “Ensemble methods in machinelearning,” Lecture Notes in Computer Science,vol. 1857, 2000, pp. 1-15.Million Meshesha received his M. Sc. from AddisAbaba University (AAU), Ethiopia, in 2000. Since then,he is with the Faculty of Informatics at AAU. Presently,he is a PhD candidate at the International Institute ofInformation Technology (IIIT) Hyderabad, India. Hisresearch interests include Document Image Analysis,Artificial Intelligence and Information Retrieval.C. V. Jawahar received his PhD in 1998 from IITKharagpur, India. He worked with Center for ArtificialIntelligence and Robotics till Dec 2000 and since thenhe is with IIIT Hyderabad, India. He is presently anAssociate Professor. His areas of research areDocument Image Analysis, Content Based InformationRetrieval and Computer Vision.

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Create successful ePaper yourself

Delete template?

Save as template?