computers [3]. Nowadays, it is common t<strong>of</strong>ind PC-based OCR systems that arecommercially available. However, most <strong>of</strong>these systems are developed to work withLatin-based scripts [4].<strong>Optical</strong> character recognition convertsscanned images <strong>of</strong> printed, typewritten orhandwritten documents into computerreadable format (such as ASCII, Unicode,etc.) so as to ease on-line data processing.The potential <strong>of</strong> OCR for data entryapplication is obvious: it <strong>of</strong>fers a faster,more automated, and presumably lessexpensive alternative to the manual dataentry devices, thereby improving theaccuracy and speed in transcribing datainto the computer system. Consequently, itincreases efficiency and effectiveness (byreducing cost, time and labor) in informationstorage and retrieval.Major applications <strong>of</strong> OCR include: (i)Library and <strong>of</strong>fice automation, (ii) Form andbank check processing, (iii) Documentreader systems for the visually impaired, (iv)Postal automation, and (v) Database andcorpus development for language modeling,text-mining and information retrieval [2], [5].While the use and application <strong>of</strong> OCRsystems is well developed for mostlanguages in the world that use both Latinand non-Latin scripts [6], an extensiveliterature survey reveals that fewconference papers are available on theindigenous scripts <strong>of</strong> African languages.Lately some research reports have beenpublished on <strong>Amharic</strong> OCR. <strong>Amharic</strong>character recognition is discussed in [7]with more emphasis to designing suitablefeature extraction scheme for scriptrepresentation. <strong>Recognition</strong> using directionfield tensor as a tool for <strong>Amharic</strong> charactersegmentation is also reported [8]. Workuand Fuchs [9] present handwritten <strong>Amharic</strong>bank check recognition. On the contrary,there are no similar works being found forother indigenous African scripts.Therefore, there is a need to exert mucheffort to come up with better and workableOCR technologies for African scripts inorder to satisfy the need for digitizedinformation processing in local languages.II. AMHARIC SCRIPTSIn Africa more than 2,500 languages,including regional dialects are spoken.Some are indigenous languages, whileothers are installed by conquerors <strong>of</strong> thepast. English, French, Portuguese, Spanishand Arabic are <strong>of</strong>ficial languages <strong>of</strong> many <strong>of</strong>the African countries. As a result, mostAfrican languages with a writing system usea modification <strong>of</strong> the Latin and Arabicscripts. There are also many languageswith their own indigenous scripts andwriting systems. Some <strong>of</strong> these scriptsinclude <strong>Amharic</strong> script (Ethiopia), Vai script(West Africa), Hieroglyphic script (Egypt),Bassa script (Liberia), Mende script (SierraLeone), Nsibidi/Nsibiri script (Nigeria andCameroon) and Meroitic script (Sudan) [10].<strong>Amharic</strong>, which belongs to the Semiticlanguage, became a dominant language inEthiopia back in history. It is the <strong>of</strong>ficial andworking language <strong>of</strong> Ethiopia and the mostcommonly learnt language next to Englishthroughout the country. Accordingly, thereis a bulk <strong>of</strong> information available in printedform that needs to be converted intoelectronic form for easy searching andretrieval as per users’ need. Suffice is tomention the huge amount <strong>of</strong> documentspiled high in information centers, libraries,museums and government and private
<strong>of</strong>fices in the form <strong>of</strong> correspondence letters,magazines, newspapers, pamphlets, books,etc. Converting these documents intoelectronic format is a must in order to (i)preserve historical documents, (ii) savestorage space, and (iii) enhance retrieval <strong>of</strong>relevant information via the Internet. Thisenables to harness existing informationtechnologies to local information needs anddevelopments.Fig. 1: <strong>Amharic</strong> alphabets (FIDEL) with their seven ordersrow-wise. The 2nd column shows list <strong>of</strong> basic charactersand others are vowels each <strong>of</strong> which derived from thebasic symbol.Those African languages using amodified version <strong>of</strong> Latin and Arabic scriptscan easily be integrated to the existing Latinand Arabic OCR technologies. It is worth tomention here some <strong>of</strong> the related worksreported at home and in the Diaspora topreserve African languages digitally(predominantly in languages that use Latinscripts). Corpora projects in Swahili (EastAfrica), open source s<strong>of</strong>tware for languageslike Zulu, Sepedi and Afrikaans (SouthAfrican), indigenous web browser inLuganda (Uganda), improvisation <strong>of</strong>keyboard characters for some Africanlanguages, the presence <strong>of</strong> Africanlanguages on the Internet, etc. [11].Therefore, we need to give moreemphasis to those indigenous Africanscripts. This is the motivation behind thepresent report. To the best <strong>of</strong> ourknowledge, this is the first work that reportsa robust <strong>Amharic</strong> character recognizer forconversion <strong>of</strong> printed document images <strong>of</strong>varying fonts, sizes, styles and quality.<strong>Amharic</strong> is written in the unique andancient Ethiopic script (inherited from Geez,a Semitic language), now effectively asyllabary requiring over 300 glyph shapesfor representation. As shown in Fig. 1,<strong>Amharic</strong> script has 33 core characters each<strong>of</strong> which occurs in seven orders: one basicform and six non-basic forms consisting <strong>of</strong>a consonant and following vowel. Vowelsare derived from basic characters withsome modification (like by attaching strokesat the middle or end <strong>of</strong> the base character,by elongating one <strong>of</strong> the leg <strong>of</strong> the basecharacter, etc.). Other symbols are alsoavailable that represent labialization,numerals, and punctuation marks. Thesebring the total number <strong>of</strong> characters in thescript to 310 as shown in Table 1.