10.07.2015 Views

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

Optical Character Recognition of Amharic Documents - CVIT - IIIT ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>of</strong>fices in the form <strong>of</strong> correspondence letters,magazines, newspapers, pamphlets, books,etc. Converting these documents intoelectronic format is a must in order to (i)preserve historical documents, (ii) savestorage space, and (iii) enhance retrieval <strong>of</strong>relevant information via the Internet. Thisenables to harness existing informationtechnologies to local information needs anddevelopments.Fig. 1: <strong>Amharic</strong> alphabets (FIDEL) with their seven ordersrow-wise. The 2nd column shows list <strong>of</strong> basic charactersand others are vowels each <strong>of</strong> which derived from thebasic symbol.Those African languages using amodified version <strong>of</strong> Latin and Arabic scriptscan easily be integrated to the existing Latinand Arabic OCR technologies. It is worth tomention here some <strong>of</strong> the related worksreported at home and in the Diaspora topreserve African languages digitally(predominantly in languages that use Latinscripts). Corpora projects in Swahili (EastAfrica), open source s<strong>of</strong>tware for languageslike Zulu, Sepedi and Afrikaans (SouthAfrican), indigenous web browser inLuganda (Uganda), improvisation <strong>of</strong>keyboard characters for some Africanlanguages, the presence <strong>of</strong> Africanlanguages on the Internet, etc. [11].Therefore, we need to give moreemphasis to those indigenous Africanscripts. This is the motivation behind thepresent report. To the best <strong>of</strong> ourknowledge, this is the first work that reportsa robust <strong>Amharic</strong> character recognizer forconversion <strong>of</strong> printed document images <strong>of</strong>varying fonts, sizes, styles and quality.<strong>Amharic</strong> is written in the unique andancient Ethiopic script (inherited from Geez,a Semitic language), now effectively asyllabary requiring over 300 glyph shapesfor representation. As shown in Fig. 1,<strong>Amharic</strong> script has 33 core characters each<strong>of</strong> which occurs in seven orders: one basicform and six non-basic forms consisting <strong>of</strong>a consonant and following vowel. Vowelsare derived from basic characters withsome modification (like by attaching strokesat the middle or end <strong>of</strong> the base character,by elongating one <strong>of</strong> the leg <strong>of</strong> the basecharacter, etc.). Other symbols are alsoavailable that represent labialization,numerals, and punctuation marks. Thesebring the total number <strong>of</strong> characters in thescript to 310 as shown in Table 1.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!