Book of Abstracts - phase 14 - elektroninen.indd - Oulu
Book of Abstracts - phase 14 - elektroninen.indd - Oulu
Book of Abstracts - phase 14 - elektroninen.indd - Oulu
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Digital Humanities 2008<br />
_____________________________________________________________________________<br />
2,000 sentences), and thus to develop a model that can classify<br />
narration and description in “the wild”—a corpus <strong>of</strong> over 800<br />
American novels from 1789 to 1875. This paper will discuss<br />
how we have refi ned our research problem and developed<br />
our classifying model—trial-and-error processes both; our<br />
initial results in “the wild”; and fi nally how macro-analysis <strong>of</strong><br />
this kind leads to new problems for literary history.<br />
Existing scholarship suggests that “realist” description enters<br />
the American novel with the historical novel; thus our initial<br />
training set <strong>of</strong> samples was taken from 10 American historical<br />
novels from the 1820s, 1830s, and 1840s (by J.F. Cooper and<br />
his rivals). Participants in the Beyond Search workshop have<br />
tagged random selections from these 10 novels. The unit <strong>of</strong><br />
selection is, for convenience, the chapter. Selected chapters<br />
are broken (tokenized) into individual sentences and humantagged<br />
using a custom XML schema that allows for a “type”<br />
attribute for each sentence element. Possible values for the<br />
type attribute include “Description,” “Narration,” “Both,”<br />
“Speech,” and “Other.” Any disagreement about tagging the<br />
training set has been resolved via consensus. (Since the signals<br />
for description may change over time—indeed, no small<br />
problem for this study—we plan to add an additional training<br />
sample from later in the corpus.)<br />
Using a maximum-entropy classifi er we have begun to<br />
investigate the qualities <strong>of</strong> the evolving training set and to<br />
identify the stylistic “signals” that are unique to, or most<br />
prevalent in, narrative and descriptive sentences. In the case<br />
<strong>of</strong> description, for example, we fi nd a marked presence <strong>of</strong><br />
spatial prepositions, an above average percentage <strong>of</strong> nouns<br />
and adjectives, a relatively low percentage <strong>of</strong> fi rst and second<br />
person pronouns, above average sentence lengths, and a high<br />
percentage <strong>of</strong> diverse words (greater lexical richness). From<br />
this initial work it has become clear, however, that our fi nal<br />
model will need to include not simply word usage data, but also<br />
grammatical and lexical information, as well as contextualizing<br />
information (i.e., the kinds <strong>of</strong> sentence that precede and follow<br />
a given sentence, the sentence’s location in a paragraph). We<br />
are in the process <strong>of</strong> developing a model that makes use <strong>of</strong><br />
part <strong>of</strong> speech sequences and syntactic tree structures, as well<br />
as contextualizing information.<br />
After a suitable training set has been completed and an<br />
accurate classifying model has been constructed, our intention<br />
is to “auto-tag” the entire corpus at the level <strong>of</strong> sentence.<br />
Once the entire corpus has been tagged, a straightforward<br />
quantitative analysis <strong>of</strong> the relative frequency <strong>of</strong> sentence<br />
types within the corpus will follow. Here, the emphasis will be<br />
placed on a time-based evaluation <strong>of</strong> description as a feature <strong>of</strong><br />
19th-century American fi ction. But then, if a pattern emerges,<br />
we will have to explain it—and strictly quantitative analysis<br />
will need to be supplemented by qualitative analysis, as we<br />
interrogate not just what mode is prevalent when, but what<br />
the modes might mean at any given time and how the modes<br />
themselves undergo mutation.<br />
Abstract 3: “Tracking the ‘Voice <strong>of</strong><br />
Doxa’ in the Victorian Novel.”<br />
Sarah Allison<br />
The nineteenth-century British novel is known for its<br />
moralizing: is it possible to defi ne this “voice <strong>of</strong> doxa,” or<br />
conventional wisdom, in terms <strong>of</strong> computable, sentence-level<br />
stylistic features? A familiar version <strong>of</strong> the voice <strong>of</strong> doxa is<br />
the “interrupting narrator,” who addresses the reader in the<br />
second person in order to clarify the meaning <strong>of</strong> the story.<br />
This project seeks to go beyond simple narrative interruption<br />
to the explication <strong>of</strong> ethical signals emitted in the process <strong>of</strong><br />
characterization (how the portrayal <strong>of</strong> a character unfolds<br />
over the course <strong>of</strong> a novel). It also takes a more precise<br />
look at the shift in tense noted by narratologists from the<br />
past tense <strong>of</strong> the story to the present-tense <strong>of</strong> the discourse,<br />
in which meaning can be elaborated in terms <strong>of</strong> proverbs or<br />
truisms. (An example from Middlemarch: “[Fred] had gone<br />
to his father and told him one vexatious affair, and he had left<br />
another untold: in such cases the complete revelation always<br />
produces the impression <strong>of</strong> a previous duplicity” 23). This<br />
project is the fi rst attempt to generate a set <strong>of</strong> micro- stylistic<br />
features that indicate the presence <strong>of</strong> ethical judgment.<br />
Through an analysis <strong>of</strong> data derived through ad hoc harvesting<br />
<strong>of</strong> frequently occurring lexical and syntactic patterns (e.g. word<br />
frequencies, part <strong>of</strong> speech saturation, frequent grammatical<br />
patterns, etc) and data derived through the application <strong>of</strong><br />
supervised classifi cation algorithms this research attempts to<br />
determine a set <strong>of</strong> grammatical features that tend to cluster<br />
around these moments <strong>of</strong> direct narrative discourse.<br />
This research is an alternative application <strong>of</strong> the method<br />
developed in Joe Shapiro’s Beyond Search workshop project<br />
(see abstract above), which seeks to identify computable<br />
stylistic differences between narrative and descriptive prose in<br />
19th century American fi ction. In this work we seek to create<br />
another category within the “descriptive,” a subcategory that<br />
captures moments <strong>of</strong> explicitly moralized description: the<br />
Voice <strong>of</strong> Doxa. We identify the formal aspects <strong>of</strong> this authorial<br />
“voice” in order to “hunt” for similar moments, or occurrences,<br />
in a series <strong>of</strong> novels. The work begins with a limited search<br />
for patterns <strong>of</strong> characterization evident among characters in a<br />
single George Eliot novel; from this we develop a model, which<br />
we will then apply to the entire Eliot corpus. In the end, the<br />
target corpus is extended to include 250 19th century British<br />
novels wherein we roughly chart the evolutionary course <strong>of</strong><br />
the “ethical signal.”<br />
_____________________________________________________________________________<br />
15