25.07.2013 Views

Visualizing Topic Flow in Students' Essays - Educational ...

Visualizing Topic Flow in Students' Essays - Educational ...

Visualizing Topic Flow in Students' Essays - Educational ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2-dimensional space that can be made <strong>in</strong>to a visual representation. In our topic flow visualizations, textual units (e.g.<br />

paragraphs) are represented <strong>in</strong> a two dimensional space, with their distances be<strong>in</strong>g proportional to the distance<br />

between topics (<strong>in</strong> the reduced dimensionality space). This visualization can provide the means to see how the parts<br />

of a document <strong>in</strong>terrelate and follow each other. Once the vector data is <strong>in</strong> lower dimensional representation and can<br />

be represented visually we need to understand how the visualization (produced us<strong>in</strong>g optimum dimensionality<br />

reduction techniques) will improve an aspect of a learn<strong>in</strong>g activity such as read<strong>in</strong>g and assessment. This is the second<br />

important question addressed <strong>in</strong> this paper<br />

The new techniques for document visualization and for quantify<strong>in</strong>g topic flow are described <strong>in</strong> the next section. First<br />

the mathematical framework to represent text segments as vectors (i.e. the vector space model), and then the<br />

comb<strong>in</strong>ation of the three dimensionality reduction techniques and Multidimensional Scal<strong>in</strong>g, used to br<strong>in</strong>g these<br />

representations to a 2-dimensional visualization. S<strong>in</strong>ce the algorithms need to be validated as produc<strong>in</strong>g data that<br />

represents the actual semantics of the documents we show how they can be used to quantify topic flow, <strong>in</strong> a<br />

collection of real documents assessed by experts. We then explore the value of the actual visual representations by<br />

measur<strong>in</strong>g the impact they have on tutors assess<strong>in</strong>g a collection of essays. This evaluation is made on a new corpora<br />

of student essays and assesses to what extent semantic features can be captured us<strong>in</strong>g these techniques.<br />

<strong>Visualiz<strong>in</strong>g</strong> <strong>Topic</strong> <strong>Flow</strong><br />

The Mathematical Framework<br />

Documents can be represented <strong>in</strong> high dimensional spaces (e.g. all its possible topics or terms) but their visualization<br />

can only be done <strong>in</strong> 2 or 3 dimensions. It is therefore essential to choose the dimensions that are mean<strong>in</strong>gful<br />

representations of quality features found <strong>in</strong> the text. Textual data m<strong>in</strong><strong>in</strong>g approaches have been applied to analyze the<br />

semantic structure of texts (Graesser, McNamara, Louwerse, & Cai, 2004) and to build automatic feedback tools<br />

(Villalon, Kearney, Calvo, & Reimann, 2008; Wiemer-Hast<strong>in</strong>gs & Graesser, 2000). Such approaches typically<br />

<strong>in</strong>volve perform<strong>in</strong>g computations on a text’s parts to uncover latent <strong>in</strong>formation that is not directly visible <strong>in</strong> the<br />

surface structure of text. These approaches have been applied to solve a number of problems <strong>in</strong> educational sett<strong>in</strong>gs,<br />

<strong>in</strong>clud<strong>in</strong>g automatic summarization, automatic assessment, automatic tutor<strong>in</strong>g and plagiarism detection (Dessus,<br />

2009). These applications often use measures of semantic similarity to compare text units. The most common<br />

approach <strong>in</strong>volves the creation of a term-by-document matrix; derived from frequency vectors of dist<strong>in</strong>ct terms <strong>in</strong><br />

each document to create a topic model, which describes the mixture of different topics <strong>in</strong> the document corpus.<br />

The most representative dimensionality reduction techniques (a.k.a. topic models) are Latent Semantic Analysis<br />

(LSA) (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990), Non-negative Matrix Factorization (NMF) (Lee<br />

& Seung, 1999), and Probabilistic LSA (PLSA) (Hofmann, 2001). These dimensionality reduction techniques<br />

produce topic models, where <strong>in</strong>dividual terms <strong>in</strong> the term-by-document matrix are weighted accord<strong>in</strong>g to their<br />

significance <strong>in</strong> the topic (McNamara, 2010). The aim of dimensionality reduction is to elim<strong>in</strong>ate unimportant details<br />

<strong>in</strong> the data and to allow the latent underly<strong>in</strong>g semantic structure of the topics to become more evident. These topic<br />

model<strong>in</strong>g methods do not take <strong>in</strong>to consideration many important components of a text, such as word order, syntax,<br />

morphology, or other features typically associated with text semantics, such as l<strong>in</strong>k<strong>in</strong>g words and anaphora.<br />

Nevertheless, the topic models built us<strong>in</strong>g these methods from many source documents have long been shown to be<br />

reliable <strong>in</strong> the doma<strong>in</strong> of <strong>in</strong>formation retrieval and highly similar to topic models extracted by humans (c.f.p.<br />

Landauer, McNamara, Dennis, & K<strong>in</strong>tsch, 2007). Similarly, automatic essay scor<strong>in</strong>g systems have also been shown<br />

to have agreement comparable to that of human assessors with the use of pre-scored essays (Wang & Brown, 2007)<br />

and course materials (Kakkonen & Sut<strong>in</strong>en, 2004) to build the mathematical models.<br />

While the above-mentioned topic model<strong>in</strong>g techniques have been successfully used on a large scale, they can be<br />

somewhat problematic <strong>in</strong> the case of provid<strong>in</strong>g feedback on a s<strong>in</strong>gle essay. The corpus used to create a topic model is<br />

essentially a set of basel<strong>in</strong>e texts, which are used to def<strong>in</strong>e the topic semantics aga<strong>in</strong>st which a document will be<br />

compared. Thus, the distances between the term vectors <strong>in</strong> a semantic space of a topic model is entirely dependent on<br />

the corpus upon which it is built. A document is compared to how it relates to the tendency words to co-occur <strong>in</strong> the<br />

source documents, which could bias its mean<strong>in</strong>g.<br />

5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!