Brugia Malayi - Clark Science Center - Smith College
Brugia Malayi - Clark Science Center - Smith College
Brugia Malayi - Clark Science Center - Smith College
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Handwriting Recognition in Classical Syriac<br />
Cordelia Nowak<br />
Much of human history and literature is preserved only in handwritten texts; these texts date back thousands of years and are<br />
written in languages which few or no modern people understand. Collecting knowledge in easily accessible databases which can<br />
be studied at researchers’ leisure is vastly important for academics in many different fields of the humanities. There is a wealth of<br />
information and literature in libraries written in Classical Syriac, an ancient dialect of Middle Arameic. Handwriting recognition<br />
is a difficult endeavor for a computer scientist: what a human mind can do in a second is enormously difficult for a computer.<br />
Programs built to recognize handwriting must be able to accommodate for human error and a thousand variations in size and style<br />
from writer to writer and even within the same text.<br />
Our goal in this project was to get the program to be able to recognize individual characters from a Syriac document. In order<br />
for the program to understand a handwritten word, it was necessary to use a classifier which would determine whether or not a<br />
letter is in a particular word. The classifier which we chose is a Support Vector Machine which, when given positive and negative<br />
examples of a certain letter, can be trained to identify whether or not there is a patch on the word which is a positive match for the<br />
letter in question. By going through the alphabet and performing the test to identify whether or not the letter is in the word, the<br />
program can eventually recreate the word and store it as a digital book.<br />
Our work utilized scans from a Classical Syriac book of Genesis. Initially, when provided with a correct transcript of a section<br />
of the book of Genesis containing roughly 1800 words, the program was able to superimpose and alter to fit a model (known as<br />
a PSM or Part Structured Model) onto the text. The PSM is a general model for what a letter should look like composed of nodes<br />
bound together by stretchable pointers so that the model can accurately fit over a letter. Once the program reached this level of<br />
functionality, it became necessary to train the program to work without a transcript provided. At this point we trained an SVM<br />
for every letter of the twenty-two in the Syriac alphabet. For each letter, we compiled a list of instances of that letter which were<br />
labeled as positive examples and a list of all the other possible letters which were labeled as negative examples. Once training was<br />
complete, the SVM was able to accurately predict what a letter was a large portion of the time, although some letters are still more<br />
reliably accurate than other. Most are well over 90% accurate.<br />
Although the project is by no means finished, we now have the correctly formatted data necessary to classify the characters<br />
and we have a Support Vector Machine that is quite capable of forming the backbone of the program. This is a big improvement<br />
where we had very little done with the text other than some manually completed transcripts and a file breaking the text into lines<br />
and words. With relatively little work, we can soon have a fully functioning program capable of reliably recognizing Syriac text.<br />
This program is well on its way towards being a serious asset for Syriac scholars. (Supported by the Schultz Foundation)<br />
Adviser: Nicolas Howe<br />
2012<br />
98