10.07.2015 Views

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

VI. RESULTS AND DISCUSSIONSWe have a recognition system thatconverts a given document images intoequivalent textual format. The systemaccepts either already scanned documentimages or scan a given <strong>Amharic</strong> textdocument at a resolution <strong>of</strong> 300 dpi on aflat-bed scanner, HP7670 Scanjet scanner.The scanned document is binarized, noiseremoved, skew corrected and scaled beforeindividual components are extracted.Preprocessed pages are then segmentedinto character components for featureextraction and classification. Following<strong>Amharic</strong> letters shape formation; we canfirst decompose lines in a text page intowords and then words into appropriatecharacter components for recognition andthen recompose the recognizedcomponents to determine the order <strong>of</strong>characters, words and lines in a text page.Once character components areidentified, optimal discriminant features (ina lower dimensionality space) are extractedfor classification. We use a two stagedimensionality reduction scheme based on99 percent principal component analysiswhich is followed by 15 percent lineardiscriminant analysis.The original dimension <strong>of</strong> feature vector<strong>of</strong> each character image (afternormalization) is 400 ( 20× 20 ). Principalcomponent analysis reduces thedimensionality from 400 to 295 and lineardiscriminant analysis further to 50. We aredealing with such reduced optimaldiscriminant feature vectors. Both methodsperform well in feature dimensionalityreductions. The use <strong>of</strong> a two-stage featureextraction scheme further solves thesimilarity problem encountered during theapplication <strong>of</strong> PCA alone. This is because<strong>of</strong> the fact that the new scheme extractsoptimal features that discriminate betweena pair <strong>of</strong> characters employed forclassification using support vector machine.We conduct extensive experiments toevaluate the performance <strong>of</strong> the recognitionprocess on the various datasets <strong>of</strong> <strong>Amharic</strong>scripts. The experiments are organized in asystematic manner considering the varioussituations encountered in real-life printeddocuments. Our datasets are <strong>of</strong> two types:One set <strong>of</strong> datasets considers printingvariations (such as fonts, styles and sizes).The other is degraded documents such asnewspapers, magazines and books. Wereport the performance <strong>of</strong> the recognizer inall these datasets and the result obtained ispromising to extend it for other indigenousAfrican scripts.ResultPowerGeezFonts Point Size StyleVisualAgafari Alpas 10 12 14 16 Normal Bold ItalicGeezDatasets 7850 7850 7850 7850 7680 7680 7680 7680 7680 7680 7680Accuracy 99.08 96.24 95.53 95.16 98.64 99.08 98.06 98.21 99.08 98.21 89.67Table 2: Performance result <strong>of</strong> the <strong>Amharic</strong> OCR on pages that varies in fonts, sizes and styles.In the first experiment, we consider theprinting variation. We test on the mostpopular fonts such as PowerGeez,VisualGeez, Agafari and Alpas that areused for typing and printing purposes, fourpoint sizes (10, 12, 14 and 16) and fontstyles (such as normal, bold and italics).Performance results are shown in Table 2.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!