10.07.2015 Views

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

different representations <strong>of</strong> the characters.Some <strong>of</strong> these features consider pr<strong>of</strong>iles,structural descriptors and transform domainrepresentations [2], [16]. Alternates onecould consider the entire image as thefeature. The former methods are highlylanguage specific and become verycomplex to represent all the characters inthe script. The later scheme providesexcellent results for printed characterrecognition.As a result, we extract features from theentire image by concatenating all the rowsto form a single contiguous vector. Thisfeature vector consists <strong>of</strong> zeros (0s) andones (1s) representing background andforeground pixels in the image, respectively.With such a representation, memory andcomputational requirements are veryintensive for languages like <strong>Amharic</strong> thathave large number <strong>of</strong> characters in thewriting.Therefore we need to transform thefeatures to obtain a lower dimensionalrepresentation. There are various methodsemployed in pattern recognition literaturesfor reducing the dimension <strong>of</strong> featurevectors [17]. We consider in the presentwork Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA).4.1 Principal Component AnalysisPrincipal component analysis (PCA) canhelp identify new features (in a lowerdimensional subspace) that are most usefulfor representation [17]. This should be donewithout losing valuable information.Principal components can give superiorperformance for font-independent OCRs,easy adaptation across languages, andscope for extension to handwrittendocuments.Consider thethi image samplerepresented as an M dimensional (column)vector xi, where M depends on the imagesize. From the large sets <strong>of</strong> trainingdatasets, x , , x Nwe compute thecovariance matrix as:1∑ =NN∑i 11 L T( xi− µ )( xi− µ )= (1)Then we need to identify minimal dimension,say K such that:K∑ − 1λ ∑i i/ Trace()thi≤ α (2)where λiis the largest eigenvalue <strong>of</strong> thecovariance matrix ∑ and α is a limitingvalue in percent.Eigenvectors corresponding to thelargest K eigenvalues are the direction <strong>of</strong>thgreatest variance. The k eigenvector isthe direction <strong>of</strong> greatest variationstperpendicular to the first through ( k −1)eigenvectors. The eigenvectors arearranged as rows in matrix A and thistransformation matrix is used to computethe new feature vector by projecting asY = . With this we get the best oneiAx idimensional representation <strong>of</strong> thecomponent images with reduced featuresize.Principal component analysis (PCA)yields projection directions that maximizethe total scatter across all classes. Inchoosing the projection which maximizestotal scatter, PCA retains not only betweenclassscatter that is useful for classification,but also within-class scatter that isunwanted information for classificationpurposes. Much <strong>of</strong> the variation seenamong document images is due to printingvariations and degradations. If PCA isapplied on such images, the transformation

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!