12.07.2015 Views

Automated Scoring of Handwritten Essays based on Latent - CEDAR

Automated Scoring of Handwritten Essays based on Latent - CEDAR

Automated Scoring of Handwritten Essays based on Latent - CEDAR

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

7which include the following steps.a. Removing punctuati<strong>on</strong>s and special charactersb. C<strong>on</strong>verting upper case to lower case for generalizati<strong>on</strong>c. Stop word removal - removing comm<strong>on</strong> words such as a and the which occur very <str<strong>on</strong>g>of</str<strong>on</strong>g>ten and arenot <str<strong>on</strong>g>of</str<strong>on</strong>g> significant importanced. Stemming - morphological variants <str<strong>on</strong>g>of</str<strong>on</strong>g> words have similar semantic interpretati<strong>on</strong>s and thereforea stemming algorithm is used to reduce the word to its stem or root form. The algorithm [11]] usesa technique called suffix stripping where an explicit suffix list is provided al<strong>on</strong>g with a c<strong>on</strong>diti<strong>on</strong><strong>on</strong> which the suffix should be removed or replaced to form the stem <str<strong>on</strong>g>of</str<strong>on</strong>g> the word, which would becomm<strong>on</strong> am<strong>on</strong>g all variati<strong>on</strong>s. For example the word reading after suffix stripping is reduced toread.Fig. 4. Answer Processor Architecture.


83.2 AESWhen a large number <str<strong>on</strong>g>of</str<strong>on</strong>g> student answers are available, which is the case in statewide assessments,a holistic approach will assist in providing computer automated grades close to those that wouldbe assigned by expert human graders. The <strong>Latent</strong> Semantic Analysis (LSA) approach can takeadvantage <str<strong>on</strong>g>of</str<strong>on</strong>g> the availability <str<strong>on</strong>g>of</str<strong>on</strong>g> a large training corpus. The training corpus c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> humanscoredanswer documents. The potentiality <str<strong>on</strong>g>of</str<strong>on</strong>g> LSA to imitate human judgment has been exploredand found to have a str<strong>on</strong>g correlati<strong>on</strong> [4].The underlying semantics <str<strong>on</strong>g>of</str<strong>on</strong>g> the training corpus are extracted using LSA and without the use<str<strong>on</strong>g>of</str<strong>on</strong>g> any other external knowledge. The method captures how the variati<strong>on</strong>s in term choices andvariati<strong>on</strong>s in answer document meanings are related. However, it does not take into c<strong>on</strong>siderati<strong>on</strong>the order <str<strong>on</strong>g>of</str<strong>on</strong>g> occurrence <str<strong>on</strong>g>of</str<strong>on</strong>g> words. This implies that even if two students have used different wordsto c<strong>on</strong>vey the same message, LSA can capture the co-relati<strong>on</strong> between the two documents. This isbecause LSA depicts the meaning <str<strong>on</strong>g>of</str<strong>on</strong>g> a word as an average <str<strong>on</strong>g>of</str<strong>on</strong>g> the c<strong>on</strong>notati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the documents inwhich it occurs. It can similarly judge the correctness <str<strong>on</strong>g>of</str<strong>on</strong>g> an answer document as an average <str<strong>on</strong>g>of</str<strong>on</strong>g> themeasure <str<strong>on</strong>g>of</str<strong>on</strong>g> correctness <str<strong>on</strong>g>of</str<strong>on</strong>g> all the words it c<strong>on</strong>tains.Mathematically this can be explained as the simultaneous representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> all the answer documentsin the training corpus as points in semantic space, with initial dimensi<strong>on</strong>ality <str<strong>on</strong>g>of</str<strong>on</strong>g> the order<str<strong>on</strong>g>of</str<strong>on</strong>g> the number <str<strong>on</strong>g>of</str<strong>on</strong>g> terms in the document. This dimensi<strong>on</strong>ality is reduced to an optimal value largeenough to represent the structure <str<strong>on</strong>g>of</str<strong>on</strong>g> the answer documents and small enough to facilitate eliminati<strong>on</strong><str<strong>on</strong>g>of</str<strong>on</strong>g> irrelevant representati<strong>on</strong>s. The answer document to be graded is also placed in the reduceddimensi<strong>on</strong>ality semantic space and the by and large term-<str<strong>on</strong>g>based</str<strong>on</strong>g> similarity between this documentand each <str<strong>on</strong>g>of</str<strong>on</strong>g> those in the training corpus can then be determined by measuring the cosine <str<strong>on</strong>g>of</str<strong>on</strong>g> theangle between the two documents at the origin.A good approximati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the computer score to a human score heavily depends <strong>on</strong> the optimalreduced dimensi<strong>on</strong>ality. This optimal dimensi<strong>on</strong> is related to the features that determine the termmeaning from which we can derive the hidden correlati<strong>on</strong>s between terms and answer documents.However a general method to determine this optimal dimensi<strong>on</strong> is still an open research problem.Currently a brute force approach is adopted. Reducing the dimensi<strong>on</strong>s is d<strong>on</strong>e by omitting inc<strong>on</strong>sequentialrelati<strong>on</strong>s and retaining <strong>on</strong>ly significant <strong>on</strong>es. A factor analysis method such as SingularValue Decompositi<strong>on</strong> (SVD) helps reduce the dimensi<strong>on</strong>ality to a desired approximati<strong>on</strong>.The first step in LSA is to c<strong>on</strong>struct a t x n term-by-document matrix M whose entries are frequencies.SVD or two-mode factor analysis decomposes this rectangular matrix into three matrices[1]. The SVD for a rectangular matrix M can be defined asM = TSD ′ , (1)where prime ( ′ ) indicates matrix transpositi<strong>on</strong>, M is the rectangular term by document matrix witht rows and n columns, T is the t x m matrix, which describes rows in the matrix M as left singularvectors <str<strong>on</strong>g>of</str<strong>on</strong>g> derived orthog<strong>on</strong>al factor values, D is the n x m matrix, which describes columns in thematrix M as right singular vectors <str<strong>on</strong>g>of</str<strong>on</strong>g> derived orthog<strong>on</strong>al factor values, S is the m x m diag<strong>on</strong>almatrix <str<strong>on</strong>g>of</str<strong>on</strong>g> singular values such that when T, S and D ′ are matrix multiplied Mis rec<strong>on</strong>structed,and m is the rank <str<strong>on</strong>g>of</str<strong>on</strong>g> M = min(t , n).To reduce the dimensi<strong>on</strong>ality to a value, say k, from the matrix S we have to delete m−krows and columns starting from those which c<strong>on</strong>tain the smallest singular value to form the matrixS 1 . The corresp<strong>on</strong>ding columns in T and rows in D ′ are also deleted to form matrices T 1 and D ′ 1respectively. The matrix M 1 is an approximati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> matrix M with reduced dimensi<strong>on</strong>s as follows


9M 1 = T 1 S 1 D ′ 1. (2)Standard algorithms are available to perform SVD. To illustrate, a document-term matrix c<strong>on</strong>structedfrom 31 essays from the American First Ladies example shown in Fig 1 and Fig 2 aregiven in Table 1. Since the corpus c<strong>on</strong>tains 31 documents with 154 unique words, M has dimensi<strong>on</strong>st = 154 and m = 31.Term/Doc D1 D2 D3 D4 D5 D6 D7 D8 . . . D31T1 0 0 0 0 0 0 0 0 . . . 0T2 2 1 2 1 3 1 2 1 . . . 2T3 0 1 0 1 3 1 2 1 . . . 1T4 0 1 0 0 0 1 0 0 . . . 0T5 0 1 0 0 0 0 1 0 . . . 0T6 0 1 1 0 0 1 2 0 . . . 0T7 0 0 0 0 1 0 0 0 . . . 0T8 0 0 0 0 1 0 1 0 . . . 0T9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .T154 0 0 0 0 0 0 0 0 . . . 0Table 1. An Example 154 x 31 Term by Document Matrix M, where M ij is the frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> the i th termin the j th answer document.Training Phase. The following steps are performed in the training phase:1) Answer documents are preprocessed and tokenized into a list <str<strong>on</strong>g>of</str<strong>on</strong>g> words or terms– using thedocument pre-processing steps described in secti<strong>on</strong> 3.1.2) An Answer Dicti<strong>on</strong>ary is created which assigns a unique file ID to all the answer documents inthe corpus.3) A Word Dicti<strong>on</strong>ary is created which assigns a unique word ID to all the words in the corpus.4) An index with the word ID and the number <str<strong>on</strong>g>of</str<strong>on</strong>g> times it occurs (word frequency) in each <str<strong>on</strong>g>of</str<strong>on</strong>g> the31 documents is created.5) A Term-by-Document Matrix, M is created from the index, where M ij is the frequency <str<strong>on</strong>g>of</str<strong>on</strong>g> theith term in the jth answer documentValidati<strong>on</strong> Phase. A set <str<strong>on</strong>g>of</str<strong>on</strong>g> human graded documents, known as the validati<strong>on</strong> set, are used todetermine the optimal value <str<strong>on</strong>g>of</str<strong>on</strong>g> k. Each <str<strong>on</strong>g>of</str<strong>on</strong>g> them are passed as query vectors and compared with thetraining corpus documents. The following steps are repeated for each document.1) A vector Q <str<strong>on</strong>g>of</str<strong>on</strong>g> term frequencies in the query document is created, similar to the way M wascreated2) Q is then added as the 0th column <str<strong>on</strong>g>of</str<strong>on</strong>g> the Matrix M to give a matrix M q3) SVD is performed <strong>on</strong> the matrix M q , to give the TSD ′ matrices4) Steps 5-10 are repeated for dimensi<strong>on</strong> values, k = 1 to min(t,m)5) Delete m−k rows and columns from the S matrix, starting from the smallest singular valueto form the matrix S 1 . The corresp<strong>on</strong>ding columns in T and rows in D ′ are also deleted to form


10matrices T 1 and D ′ 1 respectively6) C<strong>on</strong>struct the matrix M q1 by multiplying the matrices T 1 S 1 D ′ 17) The similarity between the query document x (the 0th column <str<strong>on</strong>g>of</str<strong>on</strong>g> the matrix M q1 ) and each <str<strong>on</strong>g>of</str<strong>on</strong>g> theother documents y in the training corpus (subsequent columns in the matrix M q1 ) are determinedby the cosine similarity measure defined asCosineSimilarity =∑ ni=1 x iy i√∑ ni=1 x ∑ ni i=1 y i(3)8) The training documents with the highest similarity score, when compared with the query answerdocuments are selected and the human scores associated with these documents are assigned to thedocuments in questi<strong>on</strong> respectively9) The mean difference between the LSA graded scores and that assigned to the query by a humangrader is calculated for each dimensi<strong>on</strong> over all the queries10) Return to step 411) The dimensi<strong>on</strong> with least mean difference is selected as the optimal dimensi<strong>on</strong> k which is usedin the testing phase.Testing Phase. The testing set c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> a set <str<strong>on</strong>g>of</str<strong>on</strong>g> scored essays not used in the training andvalidati<strong>on</strong> phases. The term-document matrix c<strong>on</strong>structed in the training phase and the value <str<strong>on</strong>g>of</str<strong>on</strong>g> kdetermined from the validati<strong>on</strong> phase are used to determine the scores <str<strong>on</strong>g>of</str<strong>on</strong>g> the test set.4 EXPERIMENTAL RESULTSThe corpus for experimentati<strong>on</strong> c<strong>on</strong>sisted <str<strong>on</strong>g>of</str<strong>on</strong>g> 71 handwritten answer essays. Of these essays 48 wereby students and 23 were by teachers. Each <str<strong>on</strong>g>of</str<strong>on</strong>g> the 71 answer essays were manually assigned a scoreby educati<strong>on</strong> researchers. The essays were divided into 47 training samples, 12 validati<strong>on</strong> samplesand 12 testing samples. The training set had a human-score distributi<strong>on</strong> <strong>on</strong> the seven-point scaleas follows: 1,8,9,10,2,9,8. Both the validati<strong>on</strong> and testing sets had human-score distributi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g>0,2,2,3,1,2,2.Two different sets <str<strong>on</strong>g>of</str<strong>on</strong>g> 71 transcribed essays were created, the first by manual transcripti<strong>on</strong> (MT)and the sec<strong>on</strong>d by the OHR system. The lexic<strong>on</strong> for the OHR system c<strong>on</strong>sisted <str<strong>on</strong>g>of</str<strong>on</strong>g> unique wordsfrom the passage to be read, which had a size <str<strong>on</strong>g>of</str<strong>on</strong>g> 274. Separate training and validati<strong>on</strong> phases werec<strong>on</strong>ducted for the MT and OHR essays. For the MT essays, the document-term matrix M hadt = 490 and m = 47 and the optimal value <str<strong>on</strong>g>of</str<strong>on</strong>g> k was determined to be 5. For the OHR essays, thecorresp<strong>on</strong>ding values were t = 154, m = 47 and k = 8. The smaller number <str<strong>on</strong>g>of</str<strong>on</strong>g> terms in the OHRcase is explained by the fact that several words were not recognized.Comparis<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> the human-assigned scores (the gold-standard) with (i) automatically assignedscores <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> MT is shown in Fig. 5(a) and (ii) automatically assigned scores <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> OHRis shown in 5(b). In both plots the human scores are shown as stars (*) and machine-assignedscores as open circles. Using MT the human-machine mean difference was 1.17 (Fig. 5 (a)). This isc<strong>on</strong>sistent with a <strong>on</strong>e-point difference between LSA and the gold-standard previously establishedin large scale testing. Using OHR the the human-machine difference was 1.75 (Fig. 5 (b)). Thus a0.58 difference is observed between MT and OHR using LSA scoring. Although the testing set issmall, these preliminary results dem<strong>on</strong>strate the potential <str<strong>on</strong>g>of</str<strong>on</strong>g> the method for holistic scoring androbustness with OHR errors.


11Fig. 5. Comparis<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> human scores, MT-LSA scores and OHR-LSA scores <strong>on</strong> 12 student resp<strong>on</strong>ses to theAmerican First Ladies questi<strong>on</strong>: (a) MT-LSA scores (open circles) are within 1.17 <str<strong>on</strong>g>of</str<strong>on</strong>g> human scores (stars),and (b) OHR-LSA scores (open circles) are within 1.75 <str<strong>on</strong>g>of</str<strong>on</strong>g> human scores (stars).5 SUMMARY AND DISCUSSIONAn approach to automatically evaluating handwritten essays in reading comprehensi<strong>on</strong> tests hasbeen described. The design is <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> optical handwriting recogniti<strong>on</strong> (OHR) and automatic essayscoring (AES). The lexic<strong>on</strong> for OHR is obtained from the passage to be read by the students.The AES method is <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> latent semantic analysis (LSA) which is a holistic method <str<strong>on</strong>g>of</str<strong>on</strong>g> scoringthat has shown much promise. Results <strong>on</strong> a small testing set show that with manually transcribed(MT) essays, LSA scoring has about a <strong>on</strong>e-point difference from human scoring which is c<strong>on</strong>sistentwith large scale testing. With the same test set, OHR-LSA scoring has a half-point difference fromMT-LSA scoring.The results point out that despite errors in word recogniti<strong>on</strong> the overall scoring performance isgood enough to have practical value. This points out that when the evaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> an OHR systemis <str<strong>on</strong>g>based</str<strong>on</strong>g> not so much <strong>on</strong> word recogniti<strong>on</strong> rates but in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> the overall applicati<strong>on</strong> in which it isused, the performance can be quite acceptable. The same phenomen<strong>on</strong> has been observed in otherOHR aplicati<strong>on</strong>s such as postal address reading where the goal is not so much as to read everyword correctly but achieve a correct sortati<strong>on</strong>.The LSA approach has some disadvantages for its use in a class-room setting since it is <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> aholistic approach. Like other IR approaches <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> a “bag <str<strong>on</strong>g>of</str<strong>on</strong>g> words” model, LSA ignores linguisticstructures. While the scoring is <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> c<strong>on</strong>tent, it is not <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> idea development, organizati<strong>on</strong>,cohesi<strong>on</strong>, style, grammar, or usage c<strong>on</strong>venti<strong>on</strong>s. Such an approach, known as analytic scoring, willneed to take into account linguistic structures. The result <str<strong>on</strong>g>of</str<strong>on</strong>g> analytic scoring will be more usefulto teachers and educati<strong>on</strong> researchers. Future work will involve improving the robustness <str<strong>on</strong>g>of</str<strong>on</strong>g> OHR,particularly in segmenting touching text lines, and the use <str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong> extracti<strong>on</strong> techniques fortighter integrati<strong>on</strong> between OHR and AES.


12References1. Baeza-Yates, R. and Ribeiro-Neto, B.: Modern informati<strong>on</strong> retrieval. New York: Addis<strong>on</strong>-Wesley (1999).2. Burstein, J.: The E-rater <str<strong>on</strong>g>Scoring</str<strong>on</strong>g> Engine: <str<strong>on</strong>g>Automated</str<strong>on</strong>g> essay scoring with natural language processing. In<str<strong>on</strong>g>Automated</str<strong>on</strong>g> Essay <str<strong>on</strong>g>Scoring</str<strong>on</strong>g> (2003).3. J. J. Hull: “Incorporati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a Markov model <str<strong>on</strong>g>of</str<strong>on</strong>g> syntax in a text recogniti<strong>on</strong> algorithm,” in Proceedings<str<strong>on</strong>g>of</str<strong>on</strong>g> the Symposium <strong>on</strong> Document Analysis and Informati<strong>on</strong> Retrieval, pp. 174-183 (1992).4. Landauer, T. and D. Laham and P. Foltz.: <str<strong>on</strong>g>Automated</str<strong>on</strong>g> scoring and annotati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> essays with the IntelligentEssay Assessor. In <str<strong>on</strong>g>Automated</str<strong>on</strong>g> Essay <str<strong>on</strong>g>Scoring</str<strong>on</strong>g> (2003).5. Landauer, T.K., P. W. Foltz and D. Laham: An introducti<strong>on</strong> to latent semantic analysis, DiscourseProcesses, 25, pp. 259-284.6. Larkey, L.S.: “Automatic essay grading using text categorizati<strong>on</strong> techniques, ” Proceedings ACM-SIGIRC<strong>on</strong>ference <strong>on</strong> Research and Development in Informati<strong>on</strong> Retrieval, Melbourne, Australia, pp. 90-95.7. Mahadevan, U., and Srihari, S.N.: “Parsing and recogniti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> city, state and ZIP Codes in handwrittenaddresses,” in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> Fifth Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Document Analysis and Recogniti<strong>on</strong>(ICDAR), Bangalore, India, pp. 325-328.(1999)8. Page, E. B.: “Computer grading <str<strong>on</strong>g>of</str<strong>on</strong>g> student prose using modern c<strong>on</strong>cepts and s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware,” Journal <str<strong>on</strong>g>of</str<strong>on</strong>g>Experimental Educati<strong>on</strong>, 62, pp. 127-142.9. Palmer, J., R. Williams and H. Dreher: “<str<strong>on</strong>g>Automated</str<strong>on</strong>g> essay grading system applied to a first year universitysubject - how can we do better?” Informing Science,pp. 1221-1229 June 2002.10. Plam<strong>on</strong>d<strong>on</strong>, R., and S. N. Srihari.: “On-line and <str<strong>on</strong>g>of</str<strong>on</strong>g>f-line handwriting recogniti<strong>on</strong>: A comprehensivesurvey,” IEEE Transacti<strong>on</strong>s <strong>on</strong> Pattern Analysis and Machine Intelligence, 22(1): 63-84, 2000.11. Porter, M.F.: “An Algorithm for Suffix Stripping” Program, 14(3),pp. 130-137, 1980.12. Srihari, R. K., S. Ng, C.M. Baltus and J. Kud: “Use <str<strong>on</strong>g>of</str<strong>on</strong>g> language models in <strong>on</strong>-line sentence/phraserecogniti<strong>on</strong>,” in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Internati<strong>on</strong>al Workshop <strong>on</strong> Fr<strong>on</strong>tiers in Handwriting Recogniti<strong>on</strong>,Buffalo, pp. 284-294, 1993.13. Srihari, S. N., and Kim, G.: “PENMAN: A system for reading unc<strong>on</strong>strained handwritten page images,”in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Symposium <strong>on</strong> Document Image Understanding Technology (SDIUT 97), Annapolis,MD, pp. 142-153, 1997.14. Srihari, S. N., B. Zhang, C. Tomai, S. Lee, Z. Shi and Y. C. Shin: “A system for handwriting matchingand recogniti<strong>on</strong>, ” in Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Symposium <strong>on</strong> Document Image Understanding Technology(SDIUT 03), Greenbelt, MD, pp. 67-75, 2003.15. Srihari, S.N., and E. J. Keubert: “Integrati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> handwritten address interpretati<strong>on</strong> technology into theUnited States Postal Service Remote Computer Reader System,” Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Fourth Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Document Analysis and Recogniti<strong>on</strong> (ICDAR 97), Ulm, Germany, pp. 892-896, 1997.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!