09.04.2013 Views

Book of Abstracts - phase 14 - elektroninen.indd - Oulu

Book of Abstracts - phase 14 - elektroninen.indd - Oulu

Book of Abstracts - phase 14 - elektroninen.indd - Oulu

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Digital Humanities 2008<br />

_____________________________________________________________________________<br />

2,000 sentences), and thus to develop a model that can classify<br />

narration and description in “the wild”—a corpus <strong>of</strong> over 800<br />

American novels from 1789 to 1875. This paper will discuss<br />

how we have refi ned our research problem and developed<br />

our classifying model—trial-and-error processes both; our<br />

initial results in “the wild”; and fi nally how macro-analysis <strong>of</strong><br />

this kind leads to new problems for literary history.<br />

Existing scholarship suggests that “realist” description enters<br />

the American novel with the historical novel; thus our initial<br />

training set <strong>of</strong> samples was taken from 10 American historical<br />

novels from the 1820s, 1830s, and 1840s (by J.F. Cooper and<br />

his rivals). Participants in the Beyond Search workshop have<br />

tagged random selections from these 10 novels. The unit <strong>of</strong><br />

selection is, for convenience, the chapter. Selected chapters<br />

are broken (tokenized) into individual sentences and humantagged<br />

using a custom XML schema that allows for a “type”<br />

attribute for each sentence element. Possible values for the<br />

type attribute include “Description,” “Narration,” “Both,”<br />

“Speech,” and “Other.” Any disagreement about tagging the<br />

training set has been resolved via consensus. (Since the signals<br />

for description may change over time—indeed, no small<br />

problem for this study—we plan to add an additional training<br />

sample from later in the corpus.)<br />

Using a maximum-entropy classifi er we have begun to<br />

investigate the qualities <strong>of</strong> the evolving training set and to<br />

identify the stylistic “signals” that are unique to, or most<br />

prevalent in, narrative and descriptive sentences. In the case<br />

<strong>of</strong> description, for example, we fi nd a marked presence <strong>of</strong><br />

spatial prepositions, an above average percentage <strong>of</strong> nouns<br />

and adjectives, a relatively low percentage <strong>of</strong> fi rst and second<br />

person pronouns, above average sentence lengths, and a high<br />

percentage <strong>of</strong> diverse words (greater lexical richness). From<br />

this initial work it has become clear, however, that our fi nal<br />

model will need to include not simply word usage data, but also<br />

grammatical and lexical information, as well as contextualizing<br />

information (i.e., the kinds <strong>of</strong> sentence that precede and follow<br />

a given sentence, the sentence’s location in a paragraph). We<br />

are in the process <strong>of</strong> developing a model that makes use <strong>of</strong><br />

part <strong>of</strong> speech sequences and syntactic tree structures, as well<br />

as contextualizing information.<br />

After a suitable training set has been completed and an<br />

accurate classifying model has been constructed, our intention<br />

is to “auto-tag” the entire corpus at the level <strong>of</strong> sentence.<br />

Once the entire corpus has been tagged, a straightforward<br />

quantitative analysis <strong>of</strong> the relative frequency <strong>of</strong> sentence<br />

types within the corpus will follow. Here, the emphasis will be<br />

placed on a time-based evaluation <strong>of</strong> description as a feature <strong>of</strong><br />

19th-century American fi ction. But then, if a pattern emerges,<br />

we will have to explain it—and strictly quantitative analysis<br />

will need to be supplemented by qualitative analysis, as we<br />

interrogate not just what mode is prevalent when, but what<br />

the modes might mean at any given time and how the modes<br />

themselves undergo mutation.<br />

Abstract 3: “Tracking the ‘Voice <strong>of</strong><br />

Doxa’ in the Victorian Novel.”<br />

Sarah Allison<br />

The nineteenth-century British novel is known for its<br />

moralizing: is it possible to defi ne this “voice <strong>of</strong> doxa,” or<br />

conventional wisdom, in terms <strong>of</strong> computable, sentence-level<br />

stylistic features? A familiar version <strong>of</strong> the voice <strong>of</strong> doxa is<br />

the “interrupting narrator,” who addresses the reader in the<br />

second person in order to clarify the meaning <strong>of</strong> the story.<br />

This project seeks to go beyond simple narrative interruption<br />

to the explication <strong>of</strong> ethical signals emitted in the process <strong>of</strong><br />

characterization (how the portrayal <strong>of</strong> a character unfolds<br />

over the course <strong>of</strong> a novel). It also takes a more precise<br />

look at the shift in tense noted by narratologists from the<br />

past tense <strong>of</strong> the story to the present-tense <strong>of</strong> the discourse,<br />

in which meaning can be elaborated in terms <strong>of</strong> proverbs or<br />

truisms. (An example from Middlemarch: “[Fred] had gone<br />

to his father and told him one vexatious affair, and he had left<br />

another untold: in such cases the complete revelation always<br />

produces the impression <strong>of</strong> a previous duplicity” 23). This<br />

project is the fi rst attempt to generate a set <strong>of</strong> micro- stylistic<br />

features that indicate the presence <strong>of</strong> ethical judgment.<br />

Through an analysis <strong>of</strong> data derived through ad hoc harvesting<br />

<strong>of</strong> frequently occurring lexical and syntactic patterns (e.g. word<br />

frequencies, part <strong>of</strong> speech saturation, frequent grammatical<br />

patterns, etc) and data derived through the application <strong>of</strong><br />

supervised classifi cation algorithms this research attempts to<br />

determine a set <strong>of</strong> grammatical features that tend to cluster<br />

around these moments <strong>of</strong> direct narrative discourse.<br />

This research is an alternative application <strong>of</strong> the method<br />

developed in Joe Shapiro’s Beyond Search workshop project<br />

(see abstract above), which seeks to identify computable<br />

stylistic differences between narrative and descriptive prose in<br />

19th century American fi ction. In this work we seek to create<br />

another category within the “descriptive,” a subcategory that<br />

captures moments <strong>of</strong> explicitly moralized description: the<br />

Voice <strong>of</strong> Doxa. We identify the formal aspects <strong>of</strong> this authorial<br />

“voice” in order to “hunt” for similar moments, or occurrences,<br />

in a series <strong>of</strong> novels. The work begins with a limited search<br />

for patterns <strong>of</strong> characterization evident among characters in a<br />

single George Eliot novel; from this we develop a model, which<br />

we will then apply to the entire Eliot corpus. In the end, the<br />

target corpus is extended to include 250 19th century British<br />

novels wherein we roughly chart the evolutionary course <strong>of</strong><br />

the “ethical signal.”<br />

_____________________________________________________________________________<br />

15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!