28.06.2014 Views

Brugia Malayi - Clark Science Center - Smith College

Brugia Malayi - Clark Science Center - Smith College

Brugia Malayi - Clark Science Center - Smith College

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Handwriting Recognition in Classical Syriac<br />

Cordelia Nowak<br />

Much of human history and literature is preserved only in handwritten texts; these texts date back thousands of years and are<br />

written in languages which few or no modern people understand. Collecting knowledge in easily accessible databases which can<br />

be studied at researchers’ leisure is vastly important for academics in many different fields of the humanities. There is a wealth of<br />

information and literature in libraries written in Classical Syriac, an ancient dialect of Middle Arameic. Handwriting recognition<br />

is a difficult endeavor for a computer scientist: what a human mind can do in a second is enormously difficult for a computer.<br />

Programs built to recognize handwriting must be able to accommodate for human error and a thousand variations in size and style<br />

from writer to writer and even within the same text.<br />

Our goal in this project was to get the program to be able to recognize individual characters from a Syriac document. In order<br />

for the program to understand a handwritten word, it was necessary to use a classifier which would determine whether or not a<br />

letter is in a particular word. The classifier which we chose is a Support Vector Machine which, when given positive and negative<br />

examples of a certain letter, can be trained to identify whether or not there is a patch on the word which is a positive match for the<br />

letter in question. By going through the alphabet and performing the test to identify whether or not the letter is in the word, the<br />

program can eventually recreate the word and store it as a digital book.<br />

Our work utilized scans from a Classical Syriac book of Genesis. Initially, when provided with a correct transcript of a section<br />

of the book of Genesis containing roughly 1800 words, the program was able to superimpose and alter to fit a model (known as<br />

a PSM or Part Structured Model) onto the text. The PSM is a general model for what a letter should look like composed of nodes<br />

bound together by stretchable pointers so that the model can accurately fit over a letter. Once the program reached this level of<br />

functionality, it became necessary to train the program to work without a transcript provided. At this point we trained an SVM<br />

for every letter of the twenty-two in the Syriac alphabet. For each letter, we compiled a list of instances of that letter which were<br />

labeled as positive examples and a list of all the other possible letters which were labeled as negative examples. Once training was<br />

complete, the SVM was able to accurately predict what a letter was a large portion of the time, although some letters are still more<br />

reliably accurate than other. Most are well over 90% accurate.<br />

Although the project is by no means finished, we now have the correctly formatted data necessary to classify the characters<br />

and we have a Support Vector Machine that is quite capable of forming the backbone of the program. This is a big improvement<br />

where we had very little done with the text other than some manually completed transcripts and a file breaking the text into lines<br />

and words. With relatively little work, we can soon have a fully functioning program capable of reliably recognizing Syriac text.<br />

This program is well on its way towards being a serious asset for Syriac scholars. (Supported by the Schultz Foundation)<br />

Adviser: Nicolas Howe<br />

2012<br />

98

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!