Connectionist Modeling of Experience-based Effects in Sentence ...

Connectionist Modeling of Experience-based 

Effects in Sentence Comprehension 

Magisterarbeit im Fach Computerlinguistik 

am Institut für Linguistik der Universität Potsdam 

vorgelegt von 

Felix Engelmann 

Matrikelnummer: 716604 

Potsdam, Februar 2009 

1. Gutachter: Prof. Dr. Shravan Vasishth 

2. Gutachter: Dr. Heiner Drenhaus

Zusammenfassung 

Die vorliegende Arbeit widmet sich der Rolle individueller sprachlicher Erfahrung im 

Rahmen computationaler Modelle in der Psycholinguistik. Neuere Forschung deckt 

zunehmend sprachgebundene und sprechergebundene Verarbeitungsunterschiede auf, 

welche eine Herausforderung für sprachunabhängige Universalmodelle darstellen. Davon 

ausgehend, dass individuelle und sprachspezifische Fähigkeiten sich aus der sprachlichen 

Umgebung des Sprechers herleiten, versuchen erfahrungsbasierte Theorien die 

hochkomplexen Zusammenhänge zwischen korpusbasierten Regularitäten und Sprachfähigkeit 

zu erfassen. Während explizit konzeptionierte symbolbasierte Modelle nur stark 

vereinfachende Darstellungen liefern können, ermöglicht die Anwendung konnektionistischer 

Verfahren lernfähige Modelle, mit denen sich funktionale Beziehungen zwischen 

Erfahrung und sprachlichen Fähigkeiten herstellen lassen. 

In der vorliegenden Arbeit wird die Aussagekraft solcher konnektionistischer Modelle 

im Vergleich zu traditionellen Ansätzen anhand von zwei Beispielphänomenen untersucht. 

Zum einen wird die neuere Literatur zu Subjekt- und Objektrelativsätzen im 

chinesischen Mandarin erörtert. Diesbezügliche Lesestudien zeigen zwar sehr gemischte 

Resultate, legen aber die Annahme nahe, dass Objektrelativsätze im Mandarin einfacher 

verarbeitet werden als Subjektrelativsätze, was einen Gegensatz zur ansonsten 

sprachübergreifenden Subjektpräferenz darstellt. 

Als weiterer Untersuchungsgegenstand wird die Grammatikalitäts-Illusion in ungrammatischen 

Zentraleinbettungen herangezogen. Während im Englischen das Fehlen eines 

eingebetteten Verbes höhere Akzeptanz erzeugt als die grammatikalisch korrekte Version, 

entsteht diese Illusion bei Deutschen Lesern nicht. 

Aufbauend auf einem konnektionistischen Modell von MacDonald und Christiansen 

(2002), welches konsistente Vorhersagen für individuelle Unterschiede in der Verarbeitung 

englischer Relativsätze ermöglicht, werden neue Simulationen durchgeführt um 

beide Phänomene mit Modell-Vorhersagen zu vergleichen. Die Simulationsergebnisse 

sagen zum einen eine Objektpräferenz für Mandarin und zum anderen das Fehlen der 

Grammatikalitäts-Illusion im Deutschen vorher. Die empirische Konsistenz der Ergebnisse 

ist jedoch nur von oberflächlicher Natur und hält einer genaueren Analyse nicht 

stand. 

II

Acknowledgments 

I am grateful to Bei Wang, who was so kind to teach me on Chinese relative clause 

grammar for this work. For revision and helpful comments I owe special thanks to Pavel 

Logaçev and Titus von der Malsburg. And most of all, I want to thank my supervisor 

Shravan Vasishth for all his support and patience. 

III

Contents 

List of Figures VI 

List of Tables VII 

1 Preliminaries 1 

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 Relative Clauses and Complexity . . . . . . . . . . . . . . . . . . . . . . 3 

1.3 Psycholinguistic Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.3.1 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.3.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

1.3.3 Canonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 

1.3.4 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

2 Issues in Relative Clause Processing 18 

2.1 The Subject/Object Difference . . . . . . . . . . . . . . . . . . . . . . . . 18 

2.2 Chinese Relative Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

2.3 Predicting RC Extraction Preferences Cross-linguistically . . . . . . . . . 25 

2.3.1 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.3.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

2.3.3 Canonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

2.3.4 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

2.3.5 Other Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

2.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

2.4 The RC Extraction Preference in Mandarin . . . . . . . . . . . . . . . . 32 

2.5 Forgetting Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.5.1 The Grammaticality Illusion . . . . . . . . . . . . . . . . . . . . . 38 

2.5.2 Explaining the Forgetting Effect . . . . . . . . . . . . . . . . . . . 44 

3 Connectionist Modelling of Language Comprehension 47 

3.1 Structure and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.2 Recursion and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

3.3 A Model of RC Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

3.3.1 MacDonald and Christiansen (2002) . . . . . . . . . . . . . . . . . 51 

3.3.2 Critique and Relation to other Approaches . . . . . . . . . . . . . 54 

3.3.3 What is learned? . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

IV

Contents 

3.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

4 Two SRN Prediction Studies 62 

4.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

4.1.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 62 

4.1.2 Grammar and Corpora . . . . . . . . . . . . . . . . . . . . . . . . 63 

4.1.3 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . 64 

4.2 Replication of Previous Simulations . . . . . . . . . . . . . . . . . . . . . 65 

4.3 RC Extraction in Mandarin . . . . . . . . . . . . . . . . . . . . . . . . . 66 

4.3.1 Simulation 1: Regularity . . . . . . . . . . . . . . . . . . . . . . . 66 

4.3.2 Simulation 2: Frequency . . . . . . . . . . . . . . . . . . . . . . . 70 

4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

4.4 Forgetting Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

4.4.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

4.4.2 Simulation 3: English . . . . . . . . . . . . . . . . . . . . . . . . . 74 

4.4.3 Simulation 4: German . . . . . . . . . . . . . . . . . . . . . . . . 77 

4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

Bibliography VIII 

A Statistics XVII 

B Grammars XIX 

B.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIX 

B.2 German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XX 

B.3 Mandarin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXII 

V

List of Figures 

1.1 DLT memory cost for English RC. . . . . . . . . . . . . . . . . . . . . . . 8 

2.1 English RC reading times (King and Just, 1991) . . . . . . . . . . . . . . 19 

2.2 DLT memory cost for English RCs . . . . . . . . . . . . . . . . . . . . . 25 

2.3 DLT memory cost for Mandarin RCs . . . . . . . . . . . . . . . . . . . . 26 

2.4 CC-READER simulation on English RCs (Just and Carpenter, 1992) . . 28 

2.5 DLT memory cost for the three VPs in a doubly embedded ORC. . . . . 41 

2.6 Ungrammaticality in English (Vasishth et al., 2008) . . . . . . . . . . . . 43 

2.7 Ungrammaticality in German (Vasishth et al., 2008) . . . . . . . . . . . . 44 

3.1 Simple recurrent network (Elman, 1990) . . . . . . . . . . . . . . . . . . 48 

3.2 Ungrammaticality simulation (Christiansen and Chater, 1999) . . . . . . 51 

3.3 F requency × Regularity simulation (MacDonald and Christiansen, 2002) 53 

3.4 Wells et al. (2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

3.5 Wells et al. (2009) compared MacDonald and Christiansen (2002) . . . . 59 

4.1 Replication of MacDonald and Christiansen (2002) . . . . . . . . . . . . 66 

4.2 Replication of Konieczny and Ruh (2003) . . . . . . . . . . . . . . . . . . 67 

4.3 Output node activations on the relativizer in Mandarin . . . . . . . . . . 68 

4.4 Simulation 1: Mandarin ORC regularity. . . . . . . . . . . . . . . . . . . 69 

4.5 Simulation 2: Mandarin SRC Frequency . . . . . . . . . . . . . . . . . . 71 

4.6 Simulation 3a: Forgetting effect in English without commas . . . . . . . 75 

4.7 Simulation 3b: Forgetting effect in English with commas . . . . . . . . . 77 

4.8 Simulation 4a: Forgetting effect in German . . . . . . . . . . . . . . . . . 78 

4.9 Simulation 4b: Forgetting effect in German without commas . . . . . . . 79 

VI

List of Tables 

2.1 Languages with subject preference (Lin and Bever, 2006b) . . . . . . . . 20 

2.2 Mandarin corpus study (Kuo and Vasishth, 2007) . . . . . . . . . . . . . 24 

2.3 RC Extraction preference predictions . . . . . . . . . . . . . . . . . . . . 33 

2.4 Studies of Mandarin RC extraction . . . . . . . . . . . . . . . . . . . . . 38 

A.1 Statistics for simulation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . XVII 

A.2 Statistics for simulation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . XVII 

A.3 Statistics for simulation 3a . . . . . . . . . . . . . . . . . . . . . . . . . . XVII 

A.4 Statistics for simulation 3b . . . . . . . . . . . . . . . . . . . . . . . . . . XVIII 

A.5 Statistics for simulation 4a . . . . . . . . . . . . . . . . . . . . . . . . . . XVIII 

A.6 Statistics for simulation 4b . . . . . . . . . . . . . . . . . . . . . . . . . . XVIII 

B.1 Mandarin lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIII 

VII

Chapter 1 

Preliminaries 

1.1 Introduction 

Psycholinguistic models of human language processing in the tradition of competence 

theory (Chomsky, 1965) are anchored in generative grammar theories. So-called strong 

type-transparency approaches (Berwick and Weinberg, 1984) assume that parsing processes 

are directly driven by the underlying grammatical structure. There are currently 

a number of competing grammar-based approaches that base on different grammatical 

principles. Examples are categorial grammars, head-driven phrase structure grammar, 

minimalism, and optimality theory (for an overview of current symbolic processing 

models see Vasishth and Lewis, 2006a). However, empirical data from self-paced reading, 

eyetracking, or brain imaging studies, exhibit difficulty patterns that cannot be 

explained by a strong linking to competence. A precise modeling of human performance 

requires three aspects of cognition to be accounted for: a) biological constraints, b) 

gradedness, and c) experience. 

a) In contrast to the abstract logical nature of competence theories, human processing 

performance results from an interaction of linguistic and biological factors. Cognitive 

psychology is centered around the resource-bounded nature of human cognition. Specifically 

important for real-time cognitive tasks is the limitedness of short-term memory. 

Common properties of short-term or working memory are a limited capacity, decay over 

time, and memory interference. These insights from general cognitive psychology suggest 

that language processing performance is not only constrained by the principles of 

working memory, but also relies on processing strategies adapted to these constraints. 

The latter conclusion is addressed in a symbolically abstracted fashion by ambiguity 

resolution principles like minimal attachment and late closure (Frazier, 1979) or 

special low versus high preferences in NP attachment ambiguities (Frazier and Clifton, 

1996). A theory addressing processing difficulties caused by capacity and decay is the 

dependency locality theory (DLT, Gibson, 1998; 2000). 

b) Especially important for psycholinguistic models of working memory processes is to 

account for the “continuous, graded nature of human performance” (Vasishth and Lewis, 

2006a). Current comprehension models accounting for that aspect are implemented in 

activation-based architectures, for example CC-READER (Just and Carpenter, 1992) 

1

Chapter 1 Preliminaries 

and the ACT-R-based sentence processing model (Lewis and Vasishth, 2005). 

c) The third aspect characterizing human cognition, which is more and more acknowledged 

in psycholinguistics, is the influence of experience. Probabilistic models like 

Jurafsky (1996) use corpus-extracted likelihoods to construct probabilistic grammars, 

therewith capturing aspects of frequency and plausibility. The scope of these models, 

however, is mostly confined to the prediction of ambiguity resolution and acceptability 

ratings. A related theory predicting word-by-word difficulties is the expectation-based 

approach by Levy (2008). The tuning hypothesis by Mitchell et al. (1995) rather 

considers higher-level structural regularities as the main aspect of experience. Experience, 

however, is a vague term that relates to a complex interaction of frequencies of 

words and structures, plausibility, semantic context and structural regularities. Moreover, 

the connection between exposure and processing skill is mediated by a learning 

process, which is again constrained by biological factors. Thus the linking from corpus 

regularities and the like to observable effects is not trivial. 

A type of models that provide a promising approach to experience are connectionist 

network models (also called models of parallel distributed processing, PDP). In the 

past connectionist models have predominantly been used for word-level tasks like spoken 

and visual word recognition (McClelland and Elman, 1984; Seidenberg and McClelland, 

1989), lexical ambiguity resolution (Kawamoto, 1993), phoneme production (Dell et al., 

2002), and past tense acquisition (Rumelhart and McClelland, 1985). However, more and 

more connectionist approaches start to tackle the domain of comprehension (Christiansen 

and Chater, 1999; Rohde, 2002; Tabor et al., 1997). MacDonald and Christiansen (2002) 

(henceforth MC02) successfully used a simple recurrent network (SRN, Elman, 

1990) to implement their account of skill-through-experience, proposing an interaction 

of structural regularities with biological factors. Using an SRN they took advantage of 

special properties evolving from interactive activation-based parallel distributed connectionism. 

The inherent architectural properties of such models cause emergent behavior 

that can be described in terms of limited memory capacity, decay, interference, and continuously 

graded performance. In addition, their behavior and internal representations 

are entirely determined by learning. That means that connectionist networks are basically 

pure grammar-independent performance models (Christiansen and Chater, 1999; 

p. 3), that incorporate all three aspects of human language processing mentioned above. 

A special advantage of experience-based models is that they can account for individual 

performance differences and language-specific effects in a similar way. Individual differences 

are addressed in symbolic models like Just and Carpenter (1992) by differences in 

the capacity limit, but they do not provide a comprehensive explanation for the origin 

of these limits. Language-specific effects means different performance patterns on comparable 

structures in different languages. Evidence for that phenomenon are effects of 

antilocality in head-final languages (Konieczny, 2000) and the forgetting effect in 

complex center-embedding in English versus German (Vasishth et al., 2008). Languagespecific 

effects usually exceed the scope of models that do not incorporate experience 

factors. MC02’s SRN model has proven to make consistent predictions concerning global 

2

1.2 Relative Clauses and Complexity 

and individual differences in the comprehension of subject and object relative clauses. 

The aim of the work at hand is to assess this model’s predictions on two related phenomena 

that are most probably the result of language-specific experience: the subject/object 

difference in Chinese Mandarin and the forgetting effect in multiply embedded object 

relative clauses in English and German. 

S/O Difference in Mandarin Studies on Mandarin relative clauses are inconclusive 

regarding the preferred extraction. While all other languages investigated in that matter 

show a subject preference, Mandarin is a potential candidate for an exception from that 

global consistency. This is also what is claimed to be predicted by MC02’s account for 

structural regularity (Hsiao and Gibson, 2003; Kuo and Vasishth, 2007). 

Forgetting Effect The forgetting effect refers to a grammaticality illusion of ungrammatical 

center-embedding that is present in English but not in German. Grammatical 

differences between German and English here also suggest an explanation based on experience 

with structural regularities. 

In the remaining of this chapter the syntactic properties of relative clauses will briefly 

be introduced. Then four relevant explanatory aspects of psycholinguistic models, will 

be discussed: memory, expectation, canonicity, and experience. Chapter 2 will lay out 

the two issues of the subject/object difference in Mandarin and the forgetting effect in 

English and German, and discuss potential explanations. Chapter 3 will then explain the 

properties of simple recurrent networks and discuss MC02’s account in detail. Finally, in 

chapter 4 SRNs will be used to simulate the two issues addressed here and the resulting 

predictions will be discussed. 


The focus of the work at hand are NP-modifying restrictive relative clauses (RCs) like 

those shown in example (1). The embedded RC misses an NP, here represented by 

ei, which transformational syntax theories interpret as an unpronounced trace of an 

extraction movement (e.g. Chomsky, 1981). The trace appears either in subject or 

object position in the embedded clause and is co-indexed with the relative pronoun that, 

which binds it to the preceding head noun. The position of the trace depends on the 

extraction type of the RC. In a subject-extracted relative clause (subject relative 

clause, subject relative, or SRC) like in (1a) the embedded subject reporter is extracted 

as the subject of the main clause. In an object-extracted relative clause (object 

relative clause, object relative, or ORC) like (1b) the extracted element served as object 

of the embedded clause and subject of the matrix clause. Hence, in the ORC the noun 

reporter fulfills two roles. 

3


(1) a. The reporter thati ei attacked the senator admitted the error. (SRC) 

b. The reporter thati the senator attacked ei admitted the error. (ORC) 

The example shows subject-modifying relative clauses, i.e., they are attached to 

the subject-noun phrase of the matrix clause. RCs can just as well attach to the object. 

In the object-modifying case the modified noun fulfills tow roles in the SRC 

(object in the main clause and subject in the RC) and not in the ORC. In language 

comprehension theories the embedded extraction traces are called gaps that have to 

be filled to reconstruct the underlying argument structure. That involves identifying 

a filler (the head noun) and finding the appropriate gap. Theories like the Active 

Filler Strategy (Frazier and Clifton, 1989; Frazier and Flores d’Arcais, 1989) or 

the top-down gap-searching mechanism by Lin, Fong, and Bever (2005) deal with this 

problem in different ways. Theories based on memory assume that unbounded dependencies 

like non-integrated arguments or fillers have to be stored in linguistic working 

memory (WM) until the element is reached that is necessary to integrate the dependent 

into the sentential structure (King and Just, 1991; Just and Carpenter, 1992; 2002; Gibson, 

1998; Lewis et al., 2006; Lewis and Vasishth, 2005; Vasishth and Lewis, 2006b). In 

example (1) the filler noun reporter must be held in memory until the verb attacked is 

reached, which signals the gap. In (1b) the distance between the filler reporter and its 

gap is relatively big compared to (1a). The distances between dependent elements can 

be further increased by multiple RC embedding, which leads us to the third point of 

interest. 

The abovementioned properties make relative clauses especially interesting for psycholinguistic 

studies of memory processes and gap filling strategies. An additional point 

of interest is recursion, which will be discussed now. There are crucial differences in 

the complexity of embedded SRCs and ORCs. While object relatives in English can 

be recursively center-embedded in the main clause multiple embedding of subject relatives 

results in an iterative right-branching structure. In consequence ORC embedding 

causes longer distances between the dependents, illustrated in example (2a) and (2b) by 

co-indexation. 

(2) Doubly embedded RCs: 

a. The reporter that1 e1 attacked the senator that2 e2 recognized the officer 

admitted the error. (SRC) 

b. The reporter that1 the senator that2 the officer recognized e2 attacked e1 

admitted the error. (ORC) 

Center-embedded dependencies as in example (2) result from the so-called mirror 

recursion, which can be characterized by the following phrase structure rules: 

X → aXa; X → bXb; X → {} (1.1) 

The grammar in 1.1 generates infinite embedding in the form of abba. Allowing for 

infinite recursion introduces substantial formalistic complexities. Unlike right-branching 

4


as found in English SRCs center-embedding recursion exceeds finite-state expressibility. 

Mirror recursion is one of three basic recursion types defined by Chomsky (1957) as 

relevant for natural language. The second type is identity recursion, which produces 

cross-dependencies of the form abab as for example in Swiss-German or Dutch relatives 

clauses. Cross-dependencies are very rare but still problematic because they suggest that 

language is not context-free. A type even harder to find, if at all, is counting recursion. 

It is characterized in terms of an artificial language like a n b n , where the occurrences 

of b depend on the number of occurrences of a. It is debatable whether this type of 

recursion exists in natural language (cf. Christiansen and Chater, 1999) but this shall 

not be relevant in this thesis. 

Interestingly human language performance seems to be unable to handle recursion beyond 

double-embedding. Double center-embedding already poses severe comprehension 

difficulties and is often rated as ungrammatical (Blumenthal, 1966). In addition some 

studies of center-embedding show a grammaticality illusion of ungrammatical structures 

(Frazier, 1985; Gibson and Thomas, 1999; Vasishth et al., 2008; Christiansen and Chater, 

1999). In spite of the fact that cross-dependency is formally most complex it is centerembedding 

which is the hardest recursion type for the human comprehender (Bach et al., 

1986). The highly demanding dependencies involved in center-embedding and the potential 

ambiguities produced by multiple gap positions yield processing difficulties that 

have been extensively investigated in psycholinguistic studies. The human difficulties 

in processing recursive structures are also evident language-independently. For example 

center-embedding dependencies between digits and letters produce similar difficulties as 

in language processing (Larkin and Burns, 1977). Current symbolic psycholinguistic theories 

close the obvious gap between grammatical competence and empirical performance 

by the involvement of memory limitations, decay, and attention span or by explicitly 

defining a limit to the number of recursion levels. However, a question that arises is 

whether recursion should be assumed in the human processor at all. Since competence 

cannot be directly assessed, it can be empirically accessed only through the link to performance. 

But the link in turn is dependent on the underlying competence theory. This 

results in a non-falsifiability of an infinite competence, which Christiansen (1992) calls 

the Chomskian paradox. 

“In particular, I suggest that recursion is a conceptual artifact of the competence/performance 

distinction [...], instead of a necessary characteristic of 

the underlying computational mechanism.” (Christiansen, 1992; p. 1) 

As will be shown in chapter 3 connectionist models are performance models, that 

account for memory limitations, recursion limits, and the characteristics of different 

recursion types concerning human comprehension performance. The following section 

introduces four aspects of psycholinguistics relevant for this thesis. 

5

1.3 Psycholinguistic Aspects 


There are four major aspects considered in sentence processing theories which are of 

relevance for the work at hand: memory, expectation, canonicity, and experience. I will 

briefly present these and major related theories here. 

1.3.1 Memory 

Locality and Antilocality 

Complex sentences can contain increased distances between dependent constituents, 

which result in processing difficulties. For example Grodner and Gibson (2005) found 

in a self-paced reading experiment that an increased distance between a verb and its 

subject correlates with increased reading times at the main verb (example 3). 

(3) a. The nurse supervised the administrator while . . . 

b. The nurse from the clinic supervised the administrator while . . . 

c. The nurse who was from the clinic supervised the administrator while . . . 

The dependency relations in example (3) are illustrated by bold type face. In (3a) the 

dependent subject (the noun nurse) can be integrated with its head (the verb supervised) 

since there is no material intervening before the verb. In examples (3b) and (3c) the 

noun-verb distance is increasing because other dependents or heads intervene. This leads 

to increased processing time on the integration site, called a locality effect. 

Locality effects are most commonly attributed to the limited nature of linguistic working 

memory, constraining the process of integration. For binding two dependent elements 

in the sentence structure the dependent has to be held in memory until its head is 

reached. Independently motivated properties of working memory are capacity limits, 

decay, and interference. Capacity poses an upper limit to memory capacity usage and 

hence a limit to the number of elements held in memory. Furthermore, the representation 

of an item decays either as a function of time (Lewis and Vasishth, 2005) or of 

the complexity of intervening material (Gibson, 1998; Just and Carpenter, 1992), which 

makes it harder to retrieve properly from memory. Finally, memorized elements that are 

similar in certain features can possibly be confused. For example, two nouns that are 

similar in animacy cause similarity-based interference. The most documented 

effect attributed to similarity is retrieval interference (Gordon et al., 2001; 2002; 2004; 

2006; Lewis and Vasishth, 2005; Dyke and Lewis, 2003; Dyke and McElree, 2006). Under 

this interpretation the process of retrieving a dependent noun at the head region 

is impaired because access is mediated by so-called retrieval cues that correspond to 

features-value pairs. Now, when several nouns share similar features they cannot be 

distinguished by the retrieval cues. Encoding a noun that shares features with an already 

encoded noun can be subject to encoding interference (Gordon et al., 2004). And 

6


finally similarity-based interference is sometimes also said to affect processing between 

encoding and integration; this is called storage interference (Lewis et al., 2006). 

In contrast to locality effects in English opposite effects have been found in head-final 

languages. For example, Konieczny (2000), Vasishth and Lewis (2006b), and Gibson 

et al. (2005b) found so-called antilocality effects in German, Hindi, and Japanese, respectively. 

For example, Konieczny (2000) used German stimuli (4) that were comparable 

to those used by Grodner and Gibson (2005) in their English study. 

(4) a. Er hat den Abgeordneten begleitet, und . . . 

He has the delegate escorted, and . . . 

“He escorted the delegate, and . . . ” 

b. Er hat den Abgeordneten ans Rednerpult begleitet, und . . . 

He has the delegate to_the lectern escorted, and . . . 

“He escorted the delegate to the lectern, and . . . ” 

c. Er hat den Abgeordneten an das große Rednerpult begleitet, und . . . 

He has the delegate to the big lectern escorted, and . . . 

“He escorted the delegate to the large lectern, and . . . ” 

Reading times at the main verb begleitet showed fastest reading in (4c) and slowest in 

(4a). That is contradicting locality predictions. Since antilocality effects were first only 

discovered in head-final languages, it is commonly assumed that the language-specific 

word order regularities cause the divergent effects. Whereas theories based on integration 

cost (e.g. DLT: Gibson, 1998) cannot account for antilocality, expectation-based 

theories (Hale, 2001; Levy, 2008) predict antilocality effects to be caused by an increasing 

expectation of the verb to appear while more intervening material is read. Notably, 

expectation theories predict antilocality language-independently. Their predictions receive 

support by recent evidence (Jaeger, Fedorenko, Hofmeister, and Gibson, 2008) for 

the presence of antilocality even in non-head-final languages (for an overview of possible 

explanations concerning locality and antilocality see Vasishth, 2008). 

The Dependency Locality Theory 

Gibson (1998; 2000) formulated a theory of capacity and decay in working memory in 

a discrete symbolic fashion. The cost predictions were based on dependency-induced 

predictions of syntactic nodes and their distances, which is why the original theory was 

called the Syntactic Prediction Locality Theory (SPLT). The later revised version 

is called the Dependency Locality Theory, referred to as DLT. The DLT 

assigns a memory cost to each word in a sentence on the basis of two discrete functions: 

Integration Cost and Storage Cost. Integration Cost directly accounts for locality 

effects by a discrete distance measure. It predicts the amount of processing difficulty 

at the integration site and is defined by the number of discourse referents intervening 

between the dependent and its head. Valid discourse referents in Gibson’s sense are 

7


referrential constituents like nouns and main verbs as they refer to objects and events, 

respectively. Pronouns, however, do not induce memory cost because they are assumed 

to be immediately accessible. The assumption behind Integration Cost is that every 

stored item receives an activation which decays depending on the number of newly encoded 

discourse referents while it is maintained in memory. Integrating an element, i.e., 

relating it to its head, needs more processing effort when the element has less activation. 

Thus the integration cost is a function, monotonously increasing with the number of 

intervening discourse referents. The cost accounts only implicitly for decay over time 

since time is only represented discretely by successive discourse referents. The unit of 

Integration Cost is energy units (EUs). 

The memory capacity limit is accounted for by the second principle of DLT: Storage 

Cost. It rests on the assumption that the parser constantly predicts the most probable 

complete sentence structure given the previous material and keeps it in memory. 

Structural complexity is calculated by the number of syntactic heads contained. The 

more complex the predicted structure, the more syntactic heads it contains. Every predicted 

head uses up memory resources, so-called memory units (MUs). Memory load 

also affects processing, because storage and processing use the same resources (Just and 

Carpenter, 1992). Consequently, the more heads are predicted the higher the processing 

cost. The important difference between the two costs is the location of their effects. 

While Integration Cost accounts for processing differences only at the integration site, 

Storage Cost for a predicted structure affects processing of every following part of the 

sentence. Figure 1.1 shows the Integration Cost C(I) and the Storage Cost C(S) at 

each point in an English object relative clause. Seeing the sentence-initial determiner 

ORC The reporter whoi the senator attacked ei admitted the error 

C(I) 0 0 0 0 0 1+2 3 0 0+1 

C(S) 2 1 3 4 3 1 1 1 0 

Total 2 1 3 4 3 3 4 1 1 

Figure 1.1: DLT cost metrics for an English ORC according to Gibson (1998). 

induces the prediction of a main clause. Hence predictions for an NP and a main verb 

have to be stored. Note that DLT considers the prediction of the main verb as cost-free, 

but in literature, it is mostly assigned a cost. For simplicity, in this work Storage Cost 

will be consistently assumed for the main verb. Having completed the NP only the verb 

is predicted. At the relative pronoun who a Storage Cost of 3 is assigned because an 

embedded SRC is predicted, containing two heads: the embedded verb and a subject 

gap. Seeing another determiner changes the prediction into an ORC, which contains 

one more head, namely the embedded subject. On senator only the embedded verb, 

the object gap, and the main verb stay predicted. On the embedded verb attacked then 

two integrations take place. The subject integration of attacked costs 1 EU because the 

8


verb counts as a new discourse referent. Establishing the relation between the relative 

pronoun who and the empty element consumes 2 EU because two discourse referents 

(senator and attacked) have been processed meanwhile. The biggest cost is assigned to 

the main verb (admitted). Here the subject reporter is integrated after processing three 

new discourse referents (the embedded subject, the embedded verb, and the main verb). 

Additionally an NP head is predicted because the verb is transitive. The last integration 

takes place at the last word error. The integration of the determiner produces no 

cost, whereas building the structural relation with its head admitted consumes 1 EU. 

Altogether the total cost predicts the highest difficulty at the main verb, resulting from 

the long distance from its dependent arguments. Using the total cost as a reading time 

predictor the DLT perfectly predicts locality effects. 

CC-READER 

A computational model in its basic assumptions consistent with the DLT cost association 

is the Capacity Constrained READER or CC-READER (Just and Carpenter, 1992; 

2002), the successor of READER, a sentence reading model implemented in a framework 

called CAPS (Collaborative Activation-based Production System). CC-READER 

is an activation-based simulation of linguistic working memory processes with a limited 

capacity as explanatory factor for memory load effects and individual differences. The 

constituting mechanism of CAPS is activation propagation caused by symbolic production 

rules. Both productions and stored elements (words, structures, propositions, etc.) 

use the same resources. The availability of elements as well as the applicability of productions 

is dependent on their received activation exceeding a certain threshold. The 

condition for a production rule is met when the activation threshold of the respective 

source element is reached. An important architectural property of CC-READER is the 

usage of processing cyles. In each processing cycle all currently applicable production 

rules fire simultaneously, meaning they propagate weighted activation from the source to 

a “target element”. In this sense capacity is defined by the maximally available amount 

of activation that can be used by all productions and stored elements per processing 

cycle. In case of activation shortage thresholds can be reached by incremental production 

firing, resulting in more processing cycles. This is what happens when the total 

activation has to be “scaled back” because it exceeds the capacity limit. A back-scaling 

affects the activation of both productions and stored items: “Any shortfall of activation 

is assessed against both storage and processing, in proportion to the amount of 

activation they are currently consuming” (Just and Carpenter, 1992; p. 135). Reading 

word-by-word, the parsing process also depends on lexical access and constructs a 

representation of the sentence that includes thematic role information. Slow-downs in 

processing are represented by an increased number of processing cycles. The two concepts 

of activation propagation and processing cycles enable CC-READER to predict a) 

reading slow-downs in demanding sentence regions (due to storage of many elements or 

firing of many productions) and b) region-specific individual differences in reading; both 

9


empirically consistent, as comparisons with studies like King and Just (1991) show. All 

predicted processing difficulties are explained by demand of activation. This also covers 

individual differences by ascribing them to different limits of the total activation amount. 

The effect of decay is only indirectly accounted for, depending on the number of newly 

activated elements, which is very similar to the DLT Integration Cost. The capacity 

limit causes newly needed activation to be preferably drawn from older elements. This 

results in a continuously graded decay, which, however, is not temporally dependent but 

rather depends on storage/processing demands just like in the DLT. 

ACT-R Sentence Processing Model 

Another computationally implemented sentence processing model (Lewis and Vasishth, 

2005; Lewis et al., 2006; Vasishth and Lewis, 2006b) is built on the cognitive architecture 

ACT-R (Anderson and Lebiere, 1998; Anderson et al., 2004). Similar to the CAPS 

model ACT-R works on a sub-symbolic activation propagation basis. Rule application, 

however, happens on a more symbolic-fashioned condition-action relation. Processing 

difficulties are predominantly retrieval-based. Elements (memory chunks) like lexical 

entries involved in a production need to be retrieved from declarative memory. The 

success of the retrieval process depends on the chunk’s current activation level and its 

matching of the retrieval cues specified in the production condition. Retrieval cues 

are feature value pairs that increase the activation of chunks depending on the number of 

matched features (associative activation). The total activation of a memory chunk 

calculated from the activation level and cue-based activation determines the probability 

of being retrieved and its retrieval latency. The possibility of several chunks matching 

the retrieval cues partially enables the model to simulate associative retrieval 

interference. Retrieval interference causes distribution of associative activation onto 

several lexical entries, causing latencies and potentially the retrieval of the wrong chunk. 

How severely interference affects retrieval depends on the beforementioned activation 

level. Activation is a fluctuating value which is a function of usage and decay over time. 

Cue-based activation and retrieval of a particular element cause its reactivation that 

slows down the decay process. The parsing process of the ACT-R sentence processing 

model (Lewis and Vasishth, 2005) is a combination of a left corner incremental structure 

building mechanism and a top-down goal-guided syntactic expectation that specifies the 

phrasal category of the structure to be constructed. A very unconventional assumption 

of the model regarding the parsing process is that, in spite of an incremental parsing 

process, the memory representation does not contain serial order information that could 

guide retrieval and attachment preferences. Recency is only implicitly accounted for 

by the decay function that affects parsing decisions in addition to cue-matching. What 

differentiates this model from CC-READER and DLT is the account for interference 

effects and a temporal decay function. Furthermore, processing difficulty is not represented 

by processing cycles but directly by estimated processing time. Setting retrieval 

cues, structural attachment, and shifting attention to the next word have fixed time 

10


values that in combination with the activation- and interference-based retrieval latency 

constitute the predicted reading time for a word. Thus, while CC-READER and DLT 

make predictions based on resource management, the predictions by the ACT-R model 

are based on language-independent psychologically motivated latencies. The ACT-R 

sentence processing model has shown considerable consistency with empirical data regarding 

ambiguity resolution, reanalysis, center-embedding complexity, and extraction 

preferences in relative clauses (Lewis and Vasishth, 2005). Additionally, the cue-based 

reactivation mechanism accounts for antilocality effects in certain contexts. Additional 

material containing pronominals or other expressions referring to the previously mentioned 

dependent can reactivate the chunk representing the dependent and thus boost 

its activation, making it faster retrievable on the head. 

1.3.2 Expectation 

A very different approach from serial memory-based resource management theories is 

to ascribe reading time effects to the context-dependent plausibility of the evolving 

structure. For example, predictability of a certain word in a given context, empirically 

quantified by the Cloze completion task (Taylor, 1953), has a considerable effect on 

eyetracking and ERP measures in sentence processing (Ehrlich and Rayner, 1981). A 

theoretically related measure is surprisal (Hale, 2001). Surprisal as used in Levy (2008) 

is a probabilistic grammar-based approximation to the negative log-Cloze probability 

but in fact results in better data fit due to its logarithmic scaling of effects. Levy (2008) 

proposes a theory of a probabilistic ambiguity resolution by parallel plausibility ranking 

of possible structures. For a partial string all complete structures that include the input 

seen so far as a prefix are considered. At every word wi a probability distribution P i T 

is assigned over all possible continuations T , ranking most probable structures highest. 

Following Levy, the probability distribution can most straight-forwardly be based on 

a probabilistic grammar extracted from annotated corpora; but he does not make any 

obligatory commitments with respect to that because the source of the probability / 

plausibility distribution can theoretically be of semantical or phonological nature or the 

like. Since in incremental parsing the predictions change over time, the distribution P i T 

has to be updated with every new input word. A possible re-ranking of the preferences 

during the update process due to an unexpected word is regarded by Levy as a kind 

of reanalysis, inducing difficulty. In this sense the concept of difficulty prediction is 

equal to surprisal. However, Levy defines difficulty by the relative entropy between the 

two probability distributions before and after the update. Consequently, the more the 

re-ranked distribution differs from the original the higher is the processing cost. This is 

related to reanalysis as in Frazier et al. (1983) and other approaches, with the difference 

that the expectation-based theory by Levy (2008) is not serial but assumes a parallel 

maintenance of all possible (or at least the most probable) structures. 

In contrast to connectionist prediction models, in which the performance is dependent 

upon how well the network has extracted probabilistic constraints froms the input mate- 

11


rial, in Levy’s approach the structural probabilities of the grammar are perfectly known 

to the parser. Consequently, the parallel probabilistic resource allocation theory by Levy 

constitutes as sort of competence model with surprisal or relative entropy as a “bottleneck” 

to comprehension, thus yielding performance-related predictions. Predictions of 

frequency-based approaches like the tuning hypothesis (Mitchell, Cuetos, Corley, and 

Brysbaert, 1995) are quite similar to surprisal most of the time but differ fundamentally 

in head-final structures. Similarly DLT and surprisal make comparable predictions only 

in structures that are not head-final. In head-final constructions the preceding dependents 

provide statistical information about the nature of the head, thus narrowing the 

prediction. Following the theory, a better prediction (or lower surprisal) facilitates integration 

on the head. Thus an expectation-based theory predicts language-independent 

anti-locality effects in head-final structures. 

1.3.3 Canonicity 

In literature the term canonicity with respect to word order is often used as synonymous 

to regularity and structural frequency. Here these terms shall be distinguished in order 

to clearly formulate respective theories. 

A theory of canonicity has to answer two questions: 

1. What categorial domain is the focus of the canonicity? 

2. What makes specific structures canonical? 

The categorial focus of canonicity can be grammatical functions, thematic roles, letter 

sequences, prosody and the like. The specific structures counting as canonical in these 

domains can be chosen by structural regularity, complexity, or simply by convention. 

The most common canonicity account goes back to Greenberg (1963) and relates to 

the basic grammatical functions subject, object, and predicate and is justified by structural 

regularities. Greenberg classified languages in terms of their canonical word order. 

He and subsequent literature count English as a subject-verb-object (SVO) language because 

simple sentences and most subordinate constructions follow that order. Therewith 

it belongs to the second biggest class (41.79%) preceded by the SOV order attributed 

to 44.78% (according to Tomlin, 1986) of the languages of the world. However, the 

classification is not as clear for all languages. German is arguably an SOV language 

although the simplest sentence structure is built with an SVO order like English. For 

example, Erdmann (1990) concludes that German does not fulfill all requirements for 

an SOV language and should therefore be categorized as SVO. 

As mentioned above, structural regularity based on corpus occurences is not the only 

possibility to ground a canonicity account on. A generative grammar-based account 

that relates word order canonicity to language processing assumes the language-specific 

canonical structure as an internal representation underlying the surface structure (Lin 

et al., 2005). Thus, in order to comprehend a non-canonically structured sentence the 

12


parser has to re-transform it into the canonical order. This extra processing makes 

non-canonical structures harder to comprehend than sentences mirroring the underlying 

order. Supported by evidence from Ferreira (2003), this theory extends to thematic roles, 

which seem to be assigned by using heuristics based on a canonical argument structure. 

A fully heuristic language processing account has been proposed by Bever (1970). He 

defined several comprehension strategies that relied on superficial structural similarities. 

One of them (Strategy D) was related to thematic role assignment: 

Strategy D “Any Noun-Verb-Noun (NVN) sequence within a potential internal unit in 

the surface structure corresponds to ‘actor-action-object’.” (Bever, 1970) 

Bever’s strategies cover several categorial levels of structural regularities reaching from 

phonemes to complex phrases and clauses. These strategies define specific structures that 

seem to appear in a regular fashion at different levels as basic or canonical structures 

that are predicted or expected by default. Concerning processing difficulty, this account 

would predict harder processing for structures that do not fit into the strategy templates. 

Canonicity accounts are closely related to structural frequency. Although it is not obligatory, 

most accounts that use heuristics base their choice of canonical structures on the 

frequencies of those in language usage. The original language classification by Greenberg 

into SOV, SVO, etc., was clearly attributed to often occurring thematic orderings. 

On the other hand, a theory assuming base-generated orderings in deep structure of a 

generative grammar can, of course, argue completely independent of frequencies and, for 

instance, refer to universal grammar specifications. Of course, Bever’s strategic preferences, 

although apparently correlating with frequent structures, could also be claimed 

to stem from innate universal principles that help children learn a language. Detailed 

predictions of a word order canonicity account will be discussed in 2.3. 

1.3.4 Experience 

Experience-based theories assume that parsing strategies and processing efficiency are 

shaped by exposure to language. 

Structural Frequency 

Literature shows that corpus frequencies can be a good predictor for comprehension difficulty. 

A consequent assumption is that structures that are used more often should be 

easier to comprehend than structures that are rarely produced. This assumption proposes 

a parallelism between language production and comprehension. There are roughly 

two possibilities to explain that parallelism. One explanation states a casual relation 

between production and comprehension, meaning that exposure to special structures 

shapes the comprehension ability of those. The other explanation assumes that the underlying 

processes of production and comprehension are basically the same and hence 

13


are limited by the same constraints. An experience-based account clearly favors the 

former explanation, which, of course, does not exclude the second possibility but does 

not depend on it. 

The Structural Grain Size 

A serious problem for the formulation of symbolic theories of handling exposure-based 

parsing decisions is the question of grain size. As in canonicity accounts the question to 

answer is on which structural level information should be considered as affecting parsing 

decisions. A symbolic exposure-based account like the one by Mitchell, Cuetos, Corley, 

and Brysbaert (1995) tabulates the frequencies of specific structures. For each relevant 

structure there is a table listing its different interpretations (e.g. attachment site 

in complex noun phrases). When the parser processes an ambiguous construction the 

most frequent of the relevant recorded structures (frames or partial syntactic representations) 

is merged with the current sentence structure to yield a predicted disambiguated 

structure. 

“The success of this process depends upon establishing a useful link between 

aspects of the current material and corresponding features of the established 

records. This is essentially a category selection or pattern-matching 

problem.” (p. 470). 

The recorded structures can be specified in deep detail, say, on a lexical level, or rather 

abstracted, e.g., on the level of phrasal categories. Example (5) (Mitchell et al., 1995) 

contains a global ambiguity. In (5a) the RC who was outside the house can be attached 

to either the first noun wife or the second noun football star. The same is true for the 

PP outside the house in (5b). 

(5) a. Someone stabbed the wife of the football star who was outside the house. 

b. Someone stabbed the estranged wife of the movie star outside the house. 

In an exposure-based account the parser’s decision about noun one (high) or noun two 

(low) attachment depends on the corpus frequencies of both possibilities. These frequencies 

could be calculated on several possible structural levels. For example frequencies 

could be tabulated individually for each construction by tabulating attachment preferences 

for NP-PP-RC structures as well as for NP-PP-PP structures. Also, the preferences 

could be tabulated for both constructions pooled together by recording the occurrences of 

the more abstracted NP-PP-(modifying constituent) structure. The choice of grain size 

crucially affects the theory’s predictions. A too fine grained record level is in danger of 

missing out some affected constructions. A very abstract level on the other hand can lead 

to overgeneralization. Mitchel, Cuetos et al. categorize existing exposure-based models 

into a) fine-grained (Spivey-Knowlton, 1994), b) coarse-grained (Cuetos et al., 

1996), and c) mixed-grain models. Connectionist network models like MacDonald 

14


et al. (1994) and Juliano and Tanenhaus (1994) are counted into the third category. Also 

the account of Bever (1970) is a mixed-grain account. Mitchell et al. basically argue for 

a coarse grained approach. They present empirical evidence contra sub-classification of 

structures depending upon noun types or animacy or the like. For example in French 

the statistics of all NP-PP-RC structures make the correct noun phrase attachment 

preference predictions (high preference). Including statistical information about definiteness 

and other aspects of noun phrases leads to the wrong predictions. Obviously, 

fine-grained information has to be ignored in that case. In a sentence completion study 

Corley and Corley (1995) showed evidence that noun phrase attachment preferences in 

English do not rely on lexical data. They analyzed the by-subject variance of two studies 

involving the same structures but different lexical items. The lexical alternation did not 

affect the attachment preference (low). Interestingly, for noun phrase attachments with 

two potential attachment sites there is a high-attachment preference in most languages. 

Exceptions are English, German, Italian, and Swedish where low attachment is generally 

preferred. Mitchell et al. (1995) argue that such general preferences are only explainable 

in coarse-grained models. Cuetos et al. (1996) report that corpus frequencies predict 

the attachment preferences of two-NP-site ambiguities 1 for Spanish and English. For 

Spanish, which shows a high-attachment preference, Cuetos and colleagues found that 

60% of the NP-PP-RC constructions in the corpus had the RC attached to the NP. 

For English (low-attachment preference) they only found 38% high attachment in the 

corpus. Desmet and Gibson (2003) argue against the commitment of a model to one 

grain size level by showing that in some cases NP attachment preferences are affected 

on the lexical level. Desmet and Gibson studied human preferences in eyetracking and 

corpus frequencies of three-NP-site noun conjunction ambiguities (see example 6). They 

found that corpus frequencies support the empirical results, which show a preference of 

middle over high attachment. But replacing the noun inside the attached phrase with 

the pronoun one turned the preference into high > middle rather than middle > high 

(Gibson and Schütze, 1999). 

(6) A column about a soccer team from the suburbs and. . . 

a. an article/one about a baseball team from the city were published in the 

Sunday edition. (high) 

b. a baseball team/one from the city was published in the Sunday edition. 

(middle) 

A similar effect was gained for German two-NP-site ambiguities (Hemforth et al., 

2000): The ambiguity containing an anaphoric binding (e.g. a relative clause) produces 

a high attachment preference, while without the anaphoric binding it results in a low 

attachment preference. This leads Desmet and Gibson (2003) to conclude that in addition 

to structural information the occurrences of pronouns have to be tabulated as a 

1 Two-NP-site ambiguities refer to constructions where there are two preceding NPs to potentially 

attach to. 

15


predictor. This shows – as Mitchell et al. (1995) also admit – that the exposure-based 

approach has to find a balance between coarse and more fine grained measures and that 

different structures might require different grain sizes for the tabulation of frequencies. 

Structural Frequency in a Connectionist Network 

A type of models that specifically base their predictions on the records of structural 

frequencies are connectionist network models. Mitchell et al. (1995) say that “in a 

connectionist system the records would consist of a set of activation weights distributed 

throughout the network.” (p. 472). This is partly true. The network does not explicitly 

count frequencies, nor are frequencies stored anywhere in the network. It is rather that 

every exposure of the network to a specific structure immediately changes the weight 

distributions and thus the whole behavior of the network. So, one could say that the 

weight distributions contain implicit structural knowledge. This can be observed, for 

example, in the activations in hidden layers. Inputs of similar structures result in similar 

activation patterns in the hidden layer of simple recurrent networks (SRN, Elman, 1990). 

The comparison of these patterns reveals the structural generalization levels that drive 

the networks predictions. Different from symbolic exposure-based accounts there is no 

explicitly fixed structural grain size the network is sensitive to. There is of course a 

limit to the lowest grain size, which is defined by the encoding level of the input. If 

the input string is encoded on word level, the network has no information below that 

level to work with. The upper limit depends on the networks architecture and can be 

affected by the size of the hidden layer, the learning mechanism, and specifically for 

recurrent networks by their “memory span”. What levels the network actually choses 

is hard to say in advance. Learning is a walk through a state space trying to find the 

optimal solution to the desired input-output pairing. The choice of grain size is part 

of that optimization process and can change during the learning phase. A commitment 

to a specific grain size implicitly involves a commitment to the number of structures to 

distinguish. A fine-grained model will consequently have to keep apart lots of structural 

representations, while a very coarse-grained model has only few structures to deal with. 

This relation means that a network with very few hidden nodes where the information 

has to be passed through will only be able to do very high level generalizations. The 

final choice of grain size will ideally be the most useful structural level of the internal 

input representation to meet the output requirements given the networks architecture. 

In chapter 3 the properties of connectionist networks will be discussed in more detail. 

Frequency and Regularity 

Structural regularity is the occurrence of similarities between different structures on a 

certain grain level. For example, the English SRC is more regular than the ORC because 

on the level of functional categories (SVO) the SRC is similar to many other structures. 

In contrast, the corpus frequency of OSV as in the ORC is very low. In that sense, 

16


regularity is just nothing else than frequency. One can speak of a regularity effect when, 

for instance, structures or tokens that are not highly-frequent themselves receive a sort of 

“neighbor benefit” from frequent structures that are similar on a certain level. Benefit is 

meant in the context of a facilitating frequency effect. For example, in word-recognition 

regular words (with respect to orthography-pronounciation correspondence) are easier 

to recognize than exceptional words, although a regular word is not necessarily more 

frequent. It just shares sub-regulatities (i.e. similarities on a lower level) with others, 

which has a facilitating effect. Since irregular words do not have a “neighbor benefit”, this 

leads to the frequency × regularity interaction, which is implemented, for example, in 

Seidenberg and McClelland (1989). The interaction characterizes the fact that there is a 

recognition performance difference between high- and low-frequent irregular words, while 

this difference is not present for regular words. Seidenberg and McClelland’s model is a 

connectionist architecture, which predicts that interaction due to its learning mechanism. 

It is the same interaction that the SRN in MacDonald and Christiansen (2002) predicts 

for English subject and object relative clauses, regarding ORCs as irregular. 

17

Chapter 2 

Issues in Relative Clause Processing 

2.1 The Subject/Object Difference 

One of the issues that shall be addressed in this work is the processing difference between 

subject and object relatives. The subject/object difference is phenomenon excessively 

discussed in the literature. Studies in many languages show that subject relatives are 

easier to comprehend than object relatives (see table 2.1 for an overview). The studies 

are cross-linguistically consistent enough to speak of a universal subject preference. 

An exemplary study of English RCs that will be of further relevance for the work at 

hand is King and Just (1991). For that reason I will briefly describe their experiment: 

King and Just (1991) King and Just conducted a self-paced reading 1 study of English 

RCs like in example (1), repeated here as (7). Before the experiment the participants 

were grouped by their reading span value obtained by a reading span test 

(Daneman and Carpenter, 1980). The span value is assumed to be associated with the 

individual memory capacity. The test value was used to group participants into high-, 

mid-, and low-span readers. Reading time analysis yielded the following results: there 

was a) a global memory span effect showing increased reading times for participants 

with lower span value, b) the memory span effect was increased on the ORC, c) regions 

of greatest difficulty were the embedded verb (attacked) and the main verb (admitted), 

and d) the ORC was globally read slower than the SRC. The results showed that readers 

spent more time on the embedded and the main verb in object relative clauses compared 

to subject relative clauses. Additional comprehension questions yielded a significantly 

lower accuracy for low-spans compared to high-spans, showing that not only processing 

was slower but also comprehension was worse for participants with lower span value. 

Note that the high extraction type difference on the main verb may be a spillover effect. 

Grodner and Gibson (2005) carried out a study that used stimuli with intervening 

material between the verbs to prevent spillover. This study showed that there is indeed 

1 Self-paced reading (Just et al., 1982) is a method to record word-by-word reading time in online 

sentence comprehension. Participants read a sentences word-by-word pressing a button to make the 

next word appear. Only the current word is shown. The rest of the sentence may optionally be 

represented by masking characters. 

18

2.1 The Subject/Object Difference 

130 MARCEL ADAM JUST AND PATRICIA A. CARPENTER 

I 

Q 

O 

W 

S 

H 

O 25 

S 

H 

05 

900 

800 

700 

600 

500 

Mid 

High 

Subject Relative 

Low 

I I I 

[The] reporter senator admitted the 

that 

attacked 

the 

error, 

Object Relative 

J_ I I I 

[The] reporter attacked admitted the 

that error, 

the 

senator 

Figure 2. Reading time per word for successive areas of subject- and object-relative sentences, for high, 

medium (Mid), and low span subjects. (The differences among the span groups are larger for the more 

difficult object-relative construction, which is the more complex sentence. The differences are particularly 

large at the verbs, which are points of processing difficulty that are expected to stress working 

memory capacity. The reading times for parenthesized words are not included in the plotted points.) 

Figure 2.1: Reading times for English relative clauses by reading span value (low, 

middle and high) (King and Just, 1991). 

lated decline no in reading the ability time to imitate difference sentences on is largest the main in casesverbbiguities. 

causedA by comprehender extractionencountering type. Also, an ambiguity there was might 

in which an the extraction processing of the type main × syntactic verb (emb/matrix) constituent is interaction. select a single interpretation In particular, (Frazier, the1978; reading Just & times Carpenter, 

interrupted onby the embedded processing of a and long the embedded mainconstituent. verb yielded1987; theMarcus, pattern 1980), V or she or he might retain two alternative 

emb. < V main in the SRC and 

This type of construction requires that the initial portion of the interpretations until some later disambiguating information is 

main constituent the opposite be retained (V main in working < V emb) memory in the while ORC. the provided (Gorrell, 1987; Kurtzman, 1985). These two schemes 

embedded constituent is processed under the memory load, for dealing with syntactic ambiguity have been posed as oppos- 

and then the (7) stored English portion SRC must be and made ORC accessible (King again and Just, ing (and 1991): mutually exclusive) alternatives. However, in a series of 

when its final portion is being processed. In addition to this experiments, we found that both positions could be reconciled 

a. The reporter thati ei attacked the senator admitted the error. (SRC) 

age-related difference in imitation performance, Kemper by postulating individual differences in the degree to which 

found a corresponding b. age-related The reporter difference thati in spontaneous the senatormultiple attacked representations ei admitted are maintained the error. for a (ORC) syntactic ambigu- 

production (Kemper, 1988; Kynette & Kemper, 1986). 

ity (MacDonald, Just, & Carpenter, in press). 

Thus, the (8) decline German in language SRC performance and ORC in the (Konieczny elderly is and In the Ruh, model 2003): we advance, multiple representations are ini- 

focused on sentences a. whose Dersyntax 

Wärter, makes deri 

large demands ei denon 

Häftling tially constructed beleidigte, by all entdeckte comprehenders denon 

Tunnel. first encountering 

working memory. In general, the individual operations of lan- the syntactic ambiguity. Each of the multiple representations is 

guage processing show little The evidence guard, of decline whonom with age theacc when prisoner assumed insulted, to have an activation discovered level the proportional tunnel. to its fre- 

the total processing load ‘The is small. guard However, whoat insulted times of high the prisoner quency, its discovered syntactic complexity, the tunnel.’ and its (SRC) pragmatic plausibility. 

demand, the total performance does decline, indicating an age- The important new postulate of our theory is that the working 

related decrease in the b. overall Der Wärter, working memory deni capacity der for Häftling memory ei capacity beleidigte, of the comprehender entdeckte influences den Tunnel. the duration 

language. The guard, whoacc thenom prisoner (i.e, intervening insulted, text) over discovered 

which multiple the syntactic tunnel. representations 

can be maintained. A low span reader does not have suffi- 

‘The guard who the prisoner insulted 

Syntactic Ambiguity: Single Versus Multiple 

cient capacity discovered to maintain the the tunnel.’ two interpretations, (ORC) and soon 

abandons the less preferred interpretation, which results in a 

Representations 

Similar results concerning the subject/object single-interpretation differencescheme. were obtained In contrast, a inhigh German, span reader 

Another French, facet of language Hindi, Japanese, that could generate Korean demand and for others, 

will 

involving 

be able to maintain 

different 

two 

sorts 

interpretations 

of paradigms 

for some 

like 

period. 

additional resources is syntactic ambiguity, particularly in the The full set of results is too long to present here, because it 

absence of eye-tracking, a preceding context self-paced that selects reading, among the and possible brain imaging includes reading technics times (see and table comprehension 2.1 for references). rates on unam- 

interpretations. Explanations If comprehenders for the were processing to represent more differences than biguous between sentences the two and two types resolutions of RCs of ambiguous must ideally sentences 

one interpretation of an ambiguity during the portion of a sen- (MacDonakl, Just, & Carpenter, in press, for details). However, 

tence that is ambiguous, this would clearly demand additional we can present the critical data that support the central claim, 

capacity. However, the existing data and the corresponding the- which makes an unintuitive prediction. In the survey of capacories 

are in disagreement about the processing of syntactic amity effects presented above, a greater capacity produces better 

19

Chapter 2 Issues in Relative Clause Processing 

Language Task References 

Brazilian Portuguese RSVP Gouvea (2003) 

Dutch SPR Frazier (1987) 

SPR, eye-tracking Mak et al. (2002) 

English Lexical Decision Ford (1983) 

SPR King and Just (1991); Gibson 

et al. (2005a) 

ERP King and Kutas (1995) 

fMRI Caplan et al. (2002); Just 

et al. (1996) 

PET Stromswold et al. (1996) 

French phoneme-monitoring Frauenfelder and Segui 

(1980) 

click-monitoring Cohen and Mehler (1996) 

eye-tracking Holmes and O’Regan (1981) 

German SPR Schriefers et al. (1995) 

ERP Mecklinger et al. (1995) 

Hindi SPR Vasishth and Lewis (2006b) 

Japanese SPR Miyamoto and Nakamura 

(2003) 

Korean SPR Kwon et al. (2004) 

Table 2.1: A selection of papers reporting a subject preference (extended table originally 

from Lin and Bever, 2006b). 

cover the global preference in several languages. Hence, inherent differences between 

both constructions have to be found that could account for the diverse processing effects. 

The most reliable cross-linguistical difference between SRCs and ORCs is word order. 

For example there is a greater distance between the head noun and the gap in the ORC 

in English and German (cf. example 7 and 8). The involved dependencies are assumed 

to be particularly memory demanding and produce locality effects (e.g. Grodner and 

Gibson, 2005) by integration difficulty. Challenging for a cross-linguistic explanation is 

that in some languages, e.g. Korean, Japanese, and Chinese, RCs are prenominal, i.e, 

they precede the head noun. Others like Hindi use both possibilities Vasishth and Lewis 

(2006b). In most cases the position of the RC before or after the head noun does not 

seem to be an intruding factor as Korean and Japanese align with post-nominal languages 

showing a subject preference. Popular locality-independent word order explanations are 

canonicity and frequency: in most languages SRCs have a more canonical word order 

than ORCs and furthermore a higher corpus frequency. Apart from syntactic properties 

20

2.2 Chinese Relative Clauses 

also semantic information plays an important role in the subject/object difference. For 

example, experiments by Traxler et al. (2002) showed that animacy and verb-induced 

plausibility are crucial predictors for difficulty differences between both constructions. 

Although the global subject preference proves to be resistant, there is at least one 

exception reported so far: Chinese Mandarin, where RCs precede the head noun, just 

like in Japanese and Korean. Hsiao and Gibson (2003) found in an SPR experiment that 

in Mandarin subject relatives are in fact harder to comprehend than object relatives. 

Interestingly, henceforward the literature about Chinese relative clauses reports mixed 

results. While Lin and Garnsey (2007) and Qiao and Forster (2008) confirmed Hsiao and 

Gibson’s results, Kuo and Vasishth (2007) and Lin and Bever (2006b) found a subject 

preference. The apparently unsolved question about Chinese Mandarin might tip the 

scales in the search of a globally consistent theory of relative clause comprehension. 

Theories like Gibson’s (Gibson, 1998) Dependency Locality Theory, which favors an ORadvantage 

for Mandarin, or the Accessibility Hypothesis (Keenan and Comrie, 1977; Lin 

et al., 2005), which favors a global subject preference, might rise and fall as candidates 

for a theory globally consistent across languages. For other theories that are based on 

canonicity or word order frequency to make reasonable predictions further investigations 

of the Mandarin relative clause structure are necessary. 

The following section will discuss the structure of Mandarin RCs. Then relevant 

theories are assessed on their predictions concerning English and Mandarin. In the 

following, recent studies about the Chinese SRC/ORC difference will be discussed and 

their results will be compared to the predictions of the outlined theories. Finally I will 

turn to the second topic: language-specific forgetting effects in center-embedding. 


Relative clauses in Chinese Mandarin are head final, i.e., they precede the modified 

noun. The RC is attached to the noun with the intervening genitive marker de (gen), 

which here serves as a relativizer. 

(9) a. Mandarin SRC: 

[ei yaoqing fuhao dei] guanyuani xinhuaibugui. 

invite tycoon gen official have bad intentions 

V O S 

’The official who invited the tycoon has bad intentions.’ 

b. Mandarin ORC: 

[fuhao yaoqing ei dei] guanyuani xinhuaibugui. 

tycoon invite gen official have bad intentions 

S V O 

’The official who the tycoon invited has bad intentions.’ 

Subject extracted RCs (example 9a) start with the embedded verb, before which a 

subject gap is assumed that has to be filled with the head noun. The SRC’s surface 

21


structure is ‘V NO de NS’, where NO is the embedded object and NS the head noun 

serving as the RC subject. Object relatives (example 9b) start with the embedded 

subject and the object gap is assumed just before the relativizer. The general structure 

is NS V de NO, where the head noun (NO) serves as the RC object. 

The pre-nominal nature of Chinese RCs has three major structural consequences that 

distinguish these constructions from RCs in English and other languages and hence 

could lead to different theory predictions. The first difference is the position of the 

gap. In English the filler-gap distance is shorter in the subject relative while the headfinal 

nature of Chinese yields a shorter distance in object relatives. This and the fact 

that the gap precedes the filler should make a difference for memory-based accounts 

and gap-searching algorithms. Secondly, the head-final structure produces a temporal 

ambiguity, especially in the Chinese ORC. In English the start of a non-reduced RC is 

marked by a relative pronoun (e.g. that). In Chinese, because the relativizer follows 

the RC, the reader is not necessarily aware of the RC while reading it. Initially, the 

Chinese ORC has the form of a simple sentence. This should have consequences for 

parsing and prediction. Finally, the canonicity properties of object and subject RCs is 

swapped in Chinese. In contrast to English and other languages, where the SRC exhibits 

the canonical word order, it is the ORC in Chinese which resembles the SVO word 

order of simple sentences. Another consequence of the noun-preceding RC concerns the 

complexity of deeper embedding. Interestingly, in Chinese an SRC embedding produces 

the assumedly more complex center-embedding structure while ORC-embedding results 

in an iterative linear structure. 

(10) a. Doubly embedded SRC (Hsiao and Gibson, 2003): 

[ei yaoqing [ej gojie faguan dej ] fuhaoj dei] guanyuani 

gap invite gap conspire judge gen tycoon gen official . . . 

V1 V2 N1 de1 N2 de2 N3 

’The official who invited the tycoon who conspired with the judge. . . ’ 

b. Doubly embedded ORC (Hsiao and Gibson, 2003): 

[[fuhao yaoqing ei dei] faguani gojie ej dej ] guanyuanj 

tycoon invite gap gen judge conspire gap gen official . . . 

N1 V1 de1 N2 V2 de2 N3 

’The official who the judge who the tycoon invited conspired with. . . ’ 

As can be seen in the example (10), the doubly embedded SRC shows a recursive 

center-embedding dependency between the head noun and the related gap. In the doubly 

embedded ORC the dependency is linear. In the head-initial language English embedding 

results in the opposite complexity properties. 

Psychological Reality and Locality The semantic interpretation of the Mandarin 

RC structure is the same as in other languages. However, their dramatic syntactic 

difference from post-nominal RCs raises the question whether head-final constructions 

22


are syntactically comparable to head-initial RCs. A cross-linguistic processing theory 

for RCs should capture all kinds of RCs. This requires the captured structures to 

induce similar parsing processes. If the Chinese language did in fact not use comparable 

syntactic realizations, there would be no need for existing syntactic RC theories to fit the 

Chinese data. Indeed some researchers treat head-final RCs as adjective-like adjuncts not 

containing any gap (e.g. for Japanese: Matsumoto, 1997). This would make a filler-gap 

resolution process unnecessary in parsing Chinese RCs. However, Lin et al. (2005) (also 

reported in Lin, 2007) provided empirical evidence that Mandarin RC constructions are 

indeed gap-containing structures that are processed differently from adjunctive phrases. 

Their experiment contained Chinese possessor relative clauses (PRCs), which are similar 

to adjunctive phrases on surface structure. In contrast to RCs, PRCs do not contain an 

overt gap. In a canonical PRC like in example (11) the region before the relativizer has 

the structure ‘N1 V _ N2’ with the covert possessor gap being between the verb and 

the possessee. 

(11) huairen bangjia _ 

bad guys kidnap 

laopo 

wife 

de zongcai jueding baojing 

DE chairman decide call police 

‘The chairman whose wife some bad guys kidnapped decided to call the police.’ 

The construction can be slightly changed to alter the gap position. By using the 

marker ba, gap and object can appear pre-verbally. Inserting the passive marker bei 

allows the possessor gap to be in the sentence initial subject position. In a self-paced 

reading experiment Lin and colleagues controlled the material for three different positions 

of the potential gap and compared the reading times to adjunctive clauses. The results 

show that the reading speed on the head noun was dependent on the gap position only 

for the possessive RCs but not for the adjunctive clauses. The processing differences are 

interpreted in Lin (2007) as evidence for filler-gap dependencies in Chinese pre-nominal 

relative clauses. This makes them “psychologically real” (Lin, 2007; p. 9) and, hence, 

comparable to post-nominal RCs. Another crucial finding was that reading time on the 

head noun of the PRCs was fastest in the bei condition. Notably, this is the condition 

having the possessor gap in subject position, making the filler-gap distance longer than 

in the other two conditions. This clearly contradicts a locality account. 

Elided Subject or Gap Assumption For some of the examined theories it is important 

to know whether the reader is aware of the initial subject gap in the Chinese SRC 

with the form ‘gap V N1 de N2’. The knowledge of the gap can affect integration and 

memory processes as well as structural predictions and the gap-searching mechanism. 

For example, if the SRC was the only construction in Mandarin that starts with ‘V N’, 

the reader would know immediately at the first word that he is reading an SRC. Addressing 

that question, Kuo and Vasishth (2007) performed a corpus study on the Sinica 

Corpus 3.0 (5 million words), which is summarized in table 2.2. Kuo and Vasishth found 

639 SRC-like structures (V N1 de N2), of which only 19% (119) were in fact subject 

23


relatives. The majority of the structures were possessive modifiers with an inanimate 

head noun (see example 12). 

(12) a. tisheng qiye de jingzhengli 

increase company gen competitiveness 

‘To increase the companys competitiveness.’ 

b. guyong 

hire 

yuangong 

employee 

de chengben. . . 

gen cost 

‘The cost of hiring an employee. . . ’ 

SRC-like ORC-like 

V N1 de N2 Predicate N1 V de N2 Predicate 

N2 animate N2 inanimate N2 animate N2 inanimate 

N1 animate 13 51 3 42 

N1 inanimate 106 469 1 71 

Table 2.2: Table from Kuo and Vasishth (2007), summarizing their corpus study on 

RC-like structures. Bona fide RCs are indicated by numbers in bold. 

Considering these non-gapped structures and the existence of further non-gapped 

constructions in Mandarin that start with ‘V N’ the reader’s awareness of the gap is 

questionable. In Chinese in an appropriate context even mono-clausal structures with 

an elided subject (_ V N) are possible. However, since it is not clear how frequent these 

structures are, and since they mostly need special contexts, it is not clear how pro-drop 

mono-clauses affect the parser’s predictions. Because the question is not clearly solved 

Kuo and Vasishth consider both the Gap Assumption and the Elided Subject Assumption 

as concurrent theories. Under the Gap Assumption the reader knows immediately that 

he or she is reading a subject relative. Under the Elided Subject Assumption, which 

receives stronger support from the corpus study, the predictions are rather unclear but 

should not involve a gapped structure. 

Garden Path effects Lin (2007) reports evidence for garden path effects due to 

temporal ambiguities in both head-final object and subject RCs. Several reading time 

studies of Japanese and Chinese Mandarin show a facilitating effect on the relativizer 

and the head noun when the RC region is disambiguated earlier. The disambiguation 

was achieved by explicit marking, classifier mismatch (e.g. Hsu et al., 2006; Yoshida 

et al., 2004), RC-inducing contexts (e.g. Ishizuka et al., 2006), or explicit participant 

information (Lin and Bever, 2007). The facilitating effect suggests that without disambiguation 

a reanalysis happens at the region of the relativizer and the head noun because 

the parser expects a main clause. Seeing this effect not only in the ORC but also in the 

24

2.3 Predicting RC Extraction Preferences Cross-linguistically 

SRC provides further evidence for the elided subject assumption favored by Kuo and 

Vasishth (2007). 

How are the special characteristics of the Mandarin RC involved in processing differences 

between subject and object extraction? Do they lead to a subject preference 

prediction or do they account for a deviance from a cross-linguistically consistent theory? 

What further properties of Chinese Mandarin have to be taken into account to gain 

useful predictions? I will now briefly resume the theories concerning the subject/object 

difference that were presented in chapter 1 and make their predictions for Mandarin and 

English relative clauses explicit. Then, empirical studies concerning the Mandarin RCs 

will be examined and related to the theoretical predictions. 

2.3 Predicting RC Extraction Preferences 

Cross-linguistically 

2.3.1 Memory 

DLT 

In the SRC in examples (1a) and (8a) the distance between the filler (the head noun) 

and its gap is short. The ORC (examples 1b and 8b) on the other hand, contains a 

distant dependency. The cost metrics as illustrated for English in figure 2.2 show that 

the Dependency Locality Theory accounts for higher difficulties in the ORC. 

SRC The reporter whoi ei attacked the senator admitted the error 

C(I) 0 0 0 0+1 0 0+1 3 0 0+1 

C(S) 2 1 3 2 2 1 1 1 0 

Total 2 1 3 3 2 2 4 1 1 

ORC The reporter whoi the senator attacked ei admitted the error 

C(I) 0 0 0 0 0 1+2 3 0 0+1 

C(S) 2 1 3 4 3 1 1 1 0 

Total 2 1 3 4 3 3 4 1 1 

Figure 2.2: DLT cost metrics for English object and subject relative clauses 

The example shows that Integration Cost on the embedded verb attacked is higher 

for the ORC because two integrations take place at that position. First, attacked is 

integrated with its subject senator, consuming 1 EU. Establishing the relation between 

the relative pronoun who and the empty element consumes 2 EU because two discourse 

25


referents (senator and attacked) have been processed meanwhile. In the SRC, integrating 

the empty element is cost-free, while the subject attachment of attacked uses 1 EU 

because of attacked counting as one discourse referent. This results in a 3:1 advantage 

for the ORC. This is partly due to the fact that in the ORC all arguments are integrated 

at once. In the SRC the object integration of the RC happens later at senator, consuming 

1 EU. Still, the difference with respect to the whole RC region is 3:2 and increases when 

Storage Cost is counted in. Storage Cost differs from position four on, where the word 

the in the ORC predicts four heads, while attacked in the SRC predicts only two. The 

reporter who the predicts the upcoming of an embedded subject (since reporter is the 

object), a transitive verb (since the sentence has an object), an object gap, and the main 

verb. The reporter who attacked only predicts the main verb and a direct object for the 

embedded clause. Summing the cost metrics together we gain a total cost of 12 units for 

the ORC compared to 9 units for the SRC throughout the RC. Relating processing cost 

to processing time DLT clearly predicts a processing advantage for SRCs on the whole 

embedded clause. Consistent with the King and Just (1991) study and the correction 

by Grodner and Gibson (2005) the DLT predicts no clause type difference on the main 

verb. However, both studies show longest reading times at the embedded verb in the 

ORC, whereas the DLT predicts a higher difficulty at the main verb. 

All in all there a subject preference is predicted for English. Now, what has the DLT 

to say about Chinese? Figure 2.3 shows the costs for Mandarin subject and object 

relative clauses as assumed by Hsiao and Gibson (2003). 

SRC ei yaoqing fuhao dei guanyuan xinhuaibugui. 

gap invite tycoon gen official have bad intentions 

C(I) 0 1 1 3 1 

C(S) 3 2 2 1 0 

Total 3 3 3 4 1 

ORC fuhao yaoqing ei dei guanyuan xinhuaibugui. 

tycoon invite gap gen official have bad intentions 

C(I) 0 1 0 1 1 

C(S) 1 1 2 1 0 

Total 1 2 2 2 1 

Figure 2.3: DLT Integration and Storage Cost for Mandarin relative clauses. 

Integration Cost mainly predicts a difference on the head noun guanyuan “official”. 

The reason is the greater distance between the head-noun and the embedded gap position 

in the SRC. Also on the relativizer there is an ORC advantage because at this point 

in the SRC the referred subject is integrated with the embedded verb. Storage Cost 

predicts higher difficulty for the SRC from the first word on. Hsiao and Gibson assume 

26


that at that position the reader already knows that an RC is following because the 

sentential subject is missing. In other words, the gap is overtly recognized and affects 

the prediction. For the same reason the second word predicts more heads in the SRC 

than in the ORC, namely the missing subject and a main verb. In the ORC, on the 

other hand, only a direct object is predicted due to the temporal ambiguity of the initial 

RC resembling the beginning of a main clause. The resulting prediction of the DLT is 

a higher processing cost on the first two words of the SRC compared to the ORC. This 

means an object preference for Mandarin head-final RCs. However, the SRC Storage 

Cost predictions on the first two words are questionable. They rest on the assumption 

that the RC structure is the most probable one, given a missing subject. This would be 

the case under the Gap Assumption. However, assuming that DLT’s prediction choices 

correlate with corpus frequencies it is very unlikely that the parser should predict a 

relative clause before seeing the relativizer. Consequently, under the Elided Subject 

assumption the DLT would predict less heads on the first two words of the RC. Under 

this assumption the object preference should disappear in that region. 

Computational Models 

The CC-READER rests on similar assumptions as DLT, namely capacity limitation and 

integration-based decay. A simulation of King and Just’s study produced comparable 

results (see figure 2.4 from Just and Carpenter, 1992). However, the span × RC type 

interaction on the main verb and the last word in the RC region was not predicted 

(MacDonald and Christiansen, 2002). Specifically, there is a greater reading span effect 

in the ORC than in the SRC in the data, whereas the simulation results show no such 

difference. Additionally, as Lewis and Vasishth (2005) point out, CC-READER underestimates 

the difficulty on the ORC embedded verb compared to the main verb. All in all 

a subject preference is predicted for English. For Mandarin in accordance with the DLT 

an object preference can be expected. The same would apply for the ACT-R sentence 

processing model. This one makes similar predictions as CC-READER with respect to 

English RCs, albeit fitting the results slightly better. 

2.3.2 Expectation 

In expectation-based theories like Levy (2008) the subject/object difference is accounted 

for similarly to experience-based theories. Superficially, the highly frequent SRC structure 

receives a higher ranking in the probability distribution of continuations than the 

ORC. Consequently an expectation-based approach would predict a subject preference 

in all languages where the SRC corpus frequency exceeds the ORC frequency, as is the 

case in English, German, and Mandarin. However, similarly to experience theories, expectation 

is a framework without definite commitments. For gaining detailed predictions 

it is necessary to know the exact word-by-word likelihoods with respect to the grammar. 

Here again the Gap versus Elided subject assumptions play theoretically a role. How- 

27


140 MARCEL ADAM JUST AND PATRICIA A. CARPENTER 

fc 

o 

H 

03 


I I I I 

[The] reporter senator admitted the 

that error, 

attacked 

the 

SIMULATION 


a- B 

I I I 

[The] reporter attacked admitted the 

that error, 

the 

senator 

Figure 2.4: CC-READER simulation HUMAN results DATA on English subject and object extracted 

relative clauses (figure from Just and Carpenter, 1992; p. 140). 



900 

Lowa 

ever, in Mandarin both assumptions would most likely result in the same predictions. 

Assuming800 possible elided subjects in a main clause, both a main P'' clause and an embedded 

Low 

RC could be expected in either construction. Thus the mere frequency High of subject versus 

OH 700 

object Ed extractions could indeed be decisive in this case, predicting a subject preference. 

On the other 600 hand, an overt subject gap in the SRC as assumed in the gap assumption 

would olower 

the cost at the relativizer for the SRC. Due to the main clause ambiguity 

2 

in the R ORC 500 a costly Highupdate 

of the plausibility ranking would happen at the relativizer 

where, ORC becomes more 1 j_ likely i than I a main clause. This cost is lower I I I in the subject 

extraction[The] duereporter to thesenator higheradmitted rankingthe of an embedded [The] reporter SRC. attacked Consequently, admitted theunder 

the Gap 

Assumption W athat syntactic expectationerror, theory wouldthat also predict a subject error. preference. 

S 

attacked 

the 

2.3.3 Canonicity 

the 

senator 

Figure 9. The number of cycles expended on various parts of the subject-relative sentences (on the left) 

and object-relative sentences (on the right) when the simulation, CC READER, is operating with more or 

less working memory capacity. (The bottom graph presents the human data for comparison with the 

simulation.) 

Considering Greenberg’s classification as a basis, what would a canonical word order account 

predict for English and German relative clauses? English subject relative clauses 

exhibit the canonical SVO structure whereas object relatives use an OSV ordering. 

Therefore, a heuristic or base-generative canonicity theory would assign a higher pro- 

Pragmatic Influence on Syntactic Processing 

cessing cost to ORCs in English, which is consistent with empirical evidence. The same 

applies for German, agreeing on the widely accepted SOV classification. As illustrated in 

example (13) German SRCs have SOV ordering and would be preferred, whereas ORCs 

have an OSV ordering. If we considered an SVO basis, no clear predictions would be 

low span simulation, which has a smaller activation maximum. 

The words that follow the verbs evoke fewer productions, so 

even though the activation maximum applies during their firing, 

they complete their execution in a smaller number of cycles 

(compared with the verb processing), and the high-low difference 

becomes smaller. 

In summary possible. a simulation that varies the amount of activation 

available for simultaneously computing and mnintatMng 

information accounts for the reading time differences between 

high and low span subjects dealing with syntactic complexity 

provided by center-embedded clauses. 

The simulation demonstrates how the contribution of a pragmatic 

cue to syntactic aaatyas depends on an adequate supply 

of activation. Fust consider the simaJation of the high span 

subjects (in which the activation maximum is relatively high) in 

processing the sentences containing reduced relative clauses. 

The inanimacy information encoded with the subject noun 

"evidence") is still in an activated state whea the verb is 

being processed, and this information is used to select between 

the two interpretations of the verb (past tense vs. past partici- 

28


(13) German SRC and ORC (Konieczny and Ruh, 2003): 

a. Der Wärter, deri ei den Häftling beleidigte, entdeckte den Tunnel. (SRC) 

S 

O V 

b. Der Wärter, 

O 

deni der Häftling ei 

S 

beleidigte, 

V 

entdeckte den Tunnel. (ORC) 

Where would the difficulties show up in reading studies? This is not easy to answer 

for such a general theory. The location of the effects depends mainly on the underlying 

parsing theory and how it deals with unexpected structures. Generally, one can 

assume that slowdowns appear as soon as the reader realizes that he or she is reading 

a non-canonical structure. In Mandarin this would be the initial verb. Considering a 

transformational account with canonical base-generation on deep structure, some of the 

difficulties would most likely appear after having read the whole embedded structure, 

because then it has to be reordered and integrated. Like English, Chinese Mandarin 

is also claimed to be SVO (Hsiao and Gibson, 2003; Kuo and Vasishth, 2007). Since 

the ORCs resemble the canonical SVO order but SRCs have VOS, an object preference 

would be predicted for Mandarin. That means that a canonicity account based on 

the conventional classifications of SVO ordering would not speak for a cross-linguistic 

subject preference but rather confirm Hsiao and Gibson’s claim of Mandarin being an 

exception. As stated earlier, the location of effects is not clearly to determine. 

2.3.4 Experience 

Considering only RC type frequencies, there is a clear subject bias in most languages. 

But a mere comparison of SRC and ORC corpus frequencies is a rather abstract method 

and not psychologically motivated. A more comprehensive theory of experience would 

be driven by complex factors. Without a granularity commitment clear predictions are 

impossible. As discussed in 1.3, the implementation in a connectionist network could 

shed light on the complex structural relations. MacDonald and Christiansen (2002) 

used a simple recurrent prediction network to predict individual and global differences 

in English subject and object relative clauses. The study will be discussed in detail 

in chaper 3 and shall only be briefly mentioned here. The network was trained on 

a simplified grammar of English to make word-by-word continuation predictions. It 

performed better on SRCs. Furthermore the interactions found in the results are comparable 

to the reading span × RC type × region interactions in King and Just 

(1991) (see figure 3.3 in chapter 3 for details). MacDonald and Christiansen call this a 

frequency × regularity interaction, because the word-by-word predictions for the SRC 

benefited from its regularity, specifically its similarity in word order with main clauses. 

Thus, in this case the experience account can be seen as equivalent to the canonicity 

account, which makes the same predictions for the same reason. Here the connectionist 

implementation justifies the canonicity approach on the basis of experience. But this 

is no inherent connection. The experience account, specifically when implemented as a 

29


connectionist network, may well make divergent predictions with respect to canonicity 

assumptions. Regarding Mandarin RCs, Hsiao and Gibson (2003) are confident that 

regularity predicts an object preference. They also say that “[...] it remains an open 

question how to formalize this theory so that it makes more detailed predictions.” (p. 

14). In the following Hsiao and Gibson suggest to implement a theory of that kind in a 

connectionist system like MacDonald and Christiansen (2002) to decide that question. 

The discussion about SRC-like structures and elided subjects will be relevant for such 

a model. The modeling of Mandarin RC processing is focus of this thesis and will be 

addressed in chapter 4. 

2.3.5 Other Explanations 

Active Filler Strategy 

The Active Filler Strategy (Frazier and Flores d’Arcais, 1989; Frazier and Clifton, 

1989) accounts for difficulties and garden-pathing effects in the region between a filler 

and its gap. As soon as a filler is identified the parser starts an active search for the 

appropriate gap. Intervening potential gap positions produce resource-consuming ambiguities. 

In a relative clause, following the strategy, the parser would try to insert the 

head-noun as filler immediately after the relativizer that because this is the first possible 

gap position. That is a successful strategy in the SRC but results in the need for 

reanalysis in ORCs. Therefore, the Active Filler Strategy predicts a higher processing 

cost on noun and verb in ORCs due to reanalysis. 

It is not clear what the Active Filler Strategy would predict for head-final relative 

clauses. Since the filler succeeds the potential gap positions, the gap-searching has to 

happen in a post-processing stage or maybe by re-reading the embedded clause. Another 

question is then whether the search happens from the beginning of the RC or backwards, 

starting from the head noun. A reasonable assumption could be that the Active Filler 

Strategy would not apply at all when the whole phrase is already seen. Other structural 

strategies like the one proposed by Lin and Bever (2006b) probably fit better into a 

head-final situation. Then the question of subject/object preference is handed over to 

accessibility concerns. 

Accessibility 

A theory of accessibility comes in several versions. What is shared between them and 

important for predictions concerning the subject/object difference is that subjects are 

easier accessible than objects. For example, Keenan and Comrie (1977) introduce an 

accessibility-ordering for grammatical functions like the following: 

subject > direct object > indirect object > . . . 

30


This hierarchy is based on observed preferences for relativized NPs in a number of 

languages. The explanation is that subjects are more obligatory for predicates than 

objects are, and therefore, are more predictable. 

Lin, Fong, and Bever (2005) and Lin and Bever (2006b) suggest that the subject 

position is higher up in the syntactic tree structure than the object position. They propose 

an incremental minimalist parser (IMP), that performs a top down search through 

the tree from the filler to the gap. The search starts at the head noun and proceeds 

downward, looking for a c-commanded trace. Since subjects are higher in the tree, this 

mechanism makes sure that subject traces are always accessed first, irrespective of fillergap 

distances or specific word orders: “This top-down searching mechanism overrides 

the effect of NP recency (i.e. linear locality), and passive complexity (i.e. canonicity).” 

(Lin et al., 2005; p. 11). Since an Accessibility Theory is independent of word order and 

locality, it would predict cross-linguistic preference for subject extractions. 

The top-down gap-searching mechanism makes the same predictions for pre- and postnominal 

RCs. Consequently a subject preference is also predicted for Mandarin RCs. 

The facilitation effect in subject extraction would occur on the head noun where the 

gap-searching mechanism is initiated. Supporting evidence for this structural account 

provides the PRC experiment by Lin et al. (2005) reported above: the easiest condition 

was the one where the gap is in subject position. There was no significant difference on 

the head noun between the other two conditions. 

Perspective Shift 

A more pragmatic explanation provides the theory of perspective shift (MacWhinney and 

Pleh, 1988; MacWhinney, 1982; 1977). Somehow the comprehender preferably adopts 

the perspective of a sentential subject. Consequently, when the subject changes the 

comprehender has to shift his or her perspective. An object relative clause demands a 

perspective shift from the main clause subject to the RC’s subject and then back to the 

main clause subject after completion of the embedding-structure. In subject relatives 

the subject is the same for both clauses and hence no shifting is required. Perspective 

shifting demands processing resources. That makes ORCs costlier to process because 

here two shifts are necessary and none in SRCs. A slowdown in ORC reading is predicted 

on the embedded NP (first shift) and the main verb (second shift). 

The perspective shifting account predicts a subject-preference in most languages with 

post-nominal RCs. For pre-nominal RCs in a language like Mandarin the pattern would 

change. In Mandarin ORCs the subject of the RC is in initial position followed by a 

verb. Seeing the head noun, a main clause is predicted. Thus one shift is necessary to 

change perspective from the embedded to the head noun, being the subject now. As for 

Mandarin SRCs, the locus of perspective is dependent on the predicted structure when 

reading the ‘V N’ sequence. Predicting a gapped structure, i.e., being aware of the SRC, 

would cause an expectation of a subject head noun. This would not require any shifting. 

However, what happens when the gap is not recognized? Could perspective possibly 

31


center on the RC object which is the only NP available, wrongly interpreting it as an 

object? If so, reading an SRC would also require one shift, namely from the embedded 

NP to the sentential subject. This makes SRCs numerically as hard as ORCs. In addition 

to the perspective shift in the SRC a reanalysis would be expected. The answers to the 

questions concerning perspective shift in Chinese depends on the mechanism guiding the 

reader’s perspective when subject is ambiguous or absent. To sum up, the perspective 

shift account could probably account for a subject advantage in Mandarin, but this is 

not clear. If so, an effect is expected on the head noun. 

2.3.6 Summary 

Table 2.3 shows an overview of the theories addressed here and their predictions regarding 

English and Mandarin RCs. All mentioned theories agree on a subject preference 

for English. However, a heterogenous picture appears on the Mandarin side. There 

is a slight bias in the prediction pattern in favor of a subject preference. This would 

integrate nicely into an otherwise universal consistency. Accessibility, Expectation, perspective 

shift, and pure RC type frequency predict a clear subject preference, whereas 

canonicity, Integration Cost, and Storage Cost under the Gap Assumption predict an 

ORC advantage. Storage Cost under the Elided Subject Assumption would predict a 

subject advantage on the RC region and an object advantage on the head noun. The 

predictions of the Active Filler Strategy are unclear. As for experience, due to the granularity 

problem the predictions are not clear. A connectionist implementation, as follows 

in chapter 4, is believed to make more specific predictions. Anticipating the results, the 

simulations predicted a weak ORC preference, which appeared, however, only at the 

relativizer. Accounting for the corpus data by Kuo and Vasishth (2007) even caused 

a subject preference in the RC region. To find out about the just discussed theories’ 

compatibility with empirical data, the next section will report important studies on the 

subject/object difference in Chinese. 

2.4 The RC Extraction Preference in Mandarin 

Hsiao and Gibson, 2003 

The self-paced reading study by Hsiao and Gibson (2003) was the first to report results 

addressing the subject/object difference in Chinese. It had great impact on the discussion 

about the universality of the subject preference across-languages because Chinese 

was the first exception discovered. Hsiao and Gibson studied singly and doubly embedded 

Mandarin relative clauses like the ones in examples (2.2) and (10) in a self-paced 

reading task. For single-embedding they found an advantage for ORCs on the region 

before the relativizer (N1 V1 / V1 N1). For the double-embedded RCs the relevant 

regions were the 3rd and 4th word (de1 N2 / N1 de1 ) and the 5th and 6th word (V2 

de2 / N2 de2 ). On both regions an object advantage was measured. Both singly and 

32


Theory E-S C-S C-O 

√ √ 

Canonicity 

√ √ 

IC 

√ √ 

SC + Gap 

√ √ 

Accessibility 

√ √ 

Expectation 

√ √ 

Perspective 

√ √ 

RC Frequency 

√ 

Active Filler 

? ? 

SC + Elided Subj. √ ( √ ) 

√ √ √ 

Experience 

( ) ( ) 

Table 2.3: Extraction preference predictions for English and Mandarin RCs. E-S = 

English subject preference. C-S and C-O = Chinese subject or object preference, 

respectively. IC = Integration Cost, SC = Storage Cost. 

doubly embedded RCs show an object preference. According to Lin (2007) a garden 

path effect is expected in both RC types due to an initial misinterpretation as a main 

clause. A reanalysis should take place at the relativizer/head noun region, especially in 

the ORC, leading to higher reading times. Interestingly, Hsiao and Gibson’s data do not 

show such an effect. Nevertheless, Hsiao and Gibson argue for an initial misanalysis in 

the ORC but not in the SRC. Under these premises their result is consistent with the 

Storage Cost account when interpreted in terms of the Gap Assumption: In the SRC 

more heads are predicted, since an RC is expected, while in the ORC according to a 

main clause interpretation less heads are predicted. Hsiao and Gibson assume that no 

reanalysis is necessary because the already constructed main clause structure in the ORC 

does not have to be modified for attaching it as a relative clause. Concerning Integration 

Cost, the results do not support the theory. As there is no significant effect on the head 

noun, the results do not show the predicted Integration Cost due to a longer filler-gap 

distance in the SRC. A naive application of the canonicity account would definitely fit 

the data. The more canonical ORC structure, which resembles the main clause word 

order, seems indeed easier to process. However, as Hsiao and Gibson state themselves, 

it remains unclear what canonicity and structural frequency accounts predict without 

having reasonable evidence for what counts as canonical or what structural level feeds 

into frequency calculations. 

There are five major issues in Hsiao and Gibson’s study that can be criticized. 

a) Animacy Hsiao and Gibson only used animate NPs in their stimuli. Kuo and 

Vasishth’s corpus study (reported in 2.2) revealed that hardly any RC-like structures 

actually contain two animate NPs. In cases with an animate embedded noun the most 

33


frequent continuation contains an inanimate head noun. Only 2% of the SRC-like structures 

(V N1 de N2) were SRCs with two animate nouns. Of the found ORC-like structures 

(N1 V1 de N2) about 39% were ORCs. About 93% of these involved an inanimate 

head noun, leaving only 7% to the structures Hsiao and Gibson studied. This has two 

consequences. First, the stimuli used by Hsiao and Gibson (2003) are not natural relative 

clauses but rarely occurring constructions most readers never come across. This 

may cause a confound in the results. The second point is that the overall unexpected 

animacy of the head noun would induce a surprisal effect at this position which has not 

been found by Hsiao and Gibson. 

b) Clause Type vs. Embedding Confound in Double-embedding As mentioned 

in 1.2 and 2.2, double-embedding introduces special complexities which go beyond 

word order differences of single embedding. In particular Mandarin SRC recursion produces 

a center-embedding structure while object-extracted embedding results in serial 

dependencies. This makes multiply embedded ORCs easier than SRCs. Hence discovered 

effects in multiple embedding might be due to the difference in complexity which 

is not existent in single embedding.Lin and Bever (2006b) interpret this as a confound, 

questioning the contribution of double-embedding studies, whereas Kuo and Vasishth 

(2007) do not agree with that view. 

c) Gap assumption Hsiao and Gibson’s explanations for their results rest on the Gap 

Assumption. However, the evidence provided by the corpus study of Kuo and Vasishth 

makes the truth of that assumption appear unlikely. Relative clauses make only about 

20% of structural similar occurrences. As for corpus frequencies it seems that a gapped 

RC structure is the least probable to expect when reading either of the two RC types. 

d) Syntactically Ambiguous Verbs Lin and Bever (2006b) criticize Hsiao and Gibson’s 

study for using verbs that were unbalanced regarding their syntactic arguments. 

In addition to direct objects 7 of the verbs took sentential complements and 13 took 

verbal complements. 

e) Inconclusive RC region Lin and Bever (2006b) claim that the pre-relativizer 

region is inconclusive with respect to differences concerning relative clauses because the 

reader is not aware of the RC at that point, yet. Lin and Bever attribute the difficulty in 

SRCs to the missing subject in an allegedly regular sentence. However, it is not so clear 

whether it is right to call this a confound or rather a possible explanation for processing 

difficulties in the RC. Attributing specific effects to relative clause processing does not 

require the reader to know what structure he or she is actually reading. 

34


Lin and Bever (2006a) 

Addressing the ambiguous verbs confound in Hsiao and Gibson (2003), Lin and Bever 

(2006a) conducted a self-paced reading experiment with verbs that only took nominal 

objects. They controlled for RC type and also if the RC was modifying the subject 

or object of the matrix clause. In all conditions there was a subject preference on the 

relativizer and the head noun. No effect was observed on the RC region. The result 

contradicts the study of Hsiao and Gibson, as the overall preference is the opposite. 

However, to be exact, it is not necessarily contradiction, since the locations of the effects 

do not overlap in the two experiments. So, it may theoretically be possible that both 

results are consistent. 

Lin and Garnsey (2007) 

The reading time study by Lin and Garnsey (2007) provided evidence that ORCs are 

easier to comprehend. In addition they showed that animacy information is an important 

factor in the comprehension process and is used very early by the reader. Their 

stimuli were Mandarin RCs with another noun following the head noun. The head noun 

could optionally be omitted. When dropping the head noun, the second noun could 

ambiguously interpreted as the head noun. The confusability of the two nouns was 

controlled for by animacy. The plausibility of an animate RC head noun compared to 

an inamimate one was increased by semantic implications of the embedded verb. The 

results show that animacy information was used immediately to solve the ambiguity in 

both conditions (with and without head noun). Subject extraction was more difficult 

than object extraction in all conditions. The conditions with missing head nouns were 

most difficult. When the nouns were confusable, the differences between subject and object 

extraction were also found in regions after the head noun pointing to an interaction 

of RC type with similarity-baed interference (Gordon et al., 2006). 

Kuo and Vasishth (2007) 

Kuo and Vasishth (2007) conducted a self-paced reading experiment that used the singly 

embedded stimuli of Hsiao and Gibson (2003). Additionally, Kuo and Vasishth added 

two more conditions to further assess the validity of the Gap Assumption as assumed by 

Hsiao and Gibson (2003) and of the Storage Cost predictions. The two extra conditions 

are shown in example (2.4). In (14b) The ORC is fronted by the passivization marker 

bei, which is ungrammatical in front of a main clause out of context. Thus, inserting 

bei removed the main clause ambiguity in the ORC. On the other hand, inserting the 

demonstrative zheige pre-verbally in the SRC (example 14a) makes the subject gap 

obvious. The demonstrative in combination with the verb raises the expectation for a 

relative clause because a noun is needed to fill the gap. Hence the possibility of the 

structure continuing as a main clause with an elided subject is excluded. 

35


(14) a. Mandarin SRC with demonstrative: 

[Zheige [ei yaoqing fuhao dei] guanyuani] xinhuaibugui. 

This-CL invite tycoon gen official have bad intentions 

’The official who invited the tycoon has bad intentions.’ 

b. Passivized Mandarin ORC: 

[Bei fuhao yaoqing ei dei] guanyuani xinhuaibugui. 

bei tycoon invite gen official have bad intentions 

’The official who the tycoon invited has bad intentions.’ 

If the Gap Assumption is correct and the reader is aware of the gap in the SRC, no 

difference in reading time is predicted between the demonstrative condition and the one 

without zheige. As for the ORC, inserting the passivization marker should increase the 

difficulty in the RC region in case the Storage Cost predictions are correct. The reason 

is that more syntactic heads would be stored in memory when an RC is predicted and 

not a main clause as assumed by Hsiao and Gibson. In short, under Hsiao and Gibson’s 

assumptions the zheige condition should have no effect, whereas the bei condition would 

increase the difficulty on the ORC. The results showed an overall subject preference 

in the total reading times. The preference was mainly found on the relativizer and 

the head noun (de N2). No effect was found on the RC region before the relativizer. 

This is consistent with Lin and Bever (2006a). On the relativizer (and on the preceding 

region, but not significantly) the SRC condition without the determiner (SR-no-det) was 

easier than the condition with the determiner (SR-det). This is not consistent with the 

Gap Assumption but with the Elided Subject Assumption. The Storage Cost hypothesis 

under the Elided Subject Assumption would predict less syntactic heads in the SR-no-det 

condition, since not an RC but a main clause is expected. This confirms the prediction 

of the corpus frequency data. The initial misinterpretation in the SR-no-det condition 

receives support by another finding: On the regions following the head noun an increased 

difficulty was observed in the SR-no-det condition. This could be caused by a reanalysis 

due to the misinterpretation as a main clause. A difference between the OR-bei and 

OR-no-bei conditions was only observed on the relativizer. In particular, the no-bei 

condition was harder. This points to a reanalysis process due to initially misinterpreting 

the ORC as main clause. In spite of that, no effects pointing to a misanalysis were 

observed in the pre-relativizer region. 

To summarize, the ORC data of Kuo and Vasishth’s study do not support the Storage 

Cost Account, since being aware of an ORC following did not increase difficulty in 

the RC region. However, evidence for an initial misinterpretation was found in the 

reanalysis effect on the relativizer. SRC data supports a Storage Cost Account only 

under the Elided Subject Assumption, meaning that readers interpret SRCs initially as 

main clauses with an elided subject. Altogether a subject preference effect was shown 

which was located on the relativizer and head noun region like in Lin and Bever (2006a). 

36


Qiao and Forster (2008) 

The divergent results from the studies discussed so far are not necessarily contradictory 

because no exactly opposite effects were found in comparable regions. Rather some 

studies found effects in places were others did not. Throughout the studies, a subject 

preference was only found at the relativizer and head noun region while an object preference 

was mainly observed on the region before the relativizer in the embedded RC. Qiao 

and Forster (2008) claim both findings to be consistent. They argue that the readers 

in Hsiao and Gibson’s and Kuo and Vasishth’s experiments adopted different strategies 

that lead to the contradicting results. Following Qiao and Forster two strategies are 

possible in SPR experiments: a) a “wait-and-see” strategy where readers do not commit 

to a specific structure in the early sentence and b) a more careful processing of the RC. 

Effects that may be delayed in the first strategy would show up in the second. In particular, 

in the careful strategy the SRC structure should be recognized, causing higher 

difficulty. This would, on the other hand, decrease the difficulty at the relativizer and 

the head noun because no reanalysis takes place. Qiao and Forster used the Maze Task 2 

(Forster, Guerrera, and Elliot, 2008), which forced readers to adopt the more careful 

strategy. The results showed an object preference in the relative clause (as in Hsiao 

and Gibson, 2003) and a subject preference at the relativizer (consistent with Kuo and 

Vasishth, 2007). Overall a slight advantage for the ORC was found. The explanation 

offered by Qiao and Forster is that in in Kuo and Vasishth’s study readers adopted 

the wait-and-see strategy which avoided the difficulties in the SRC. This means that 

readers do not really predict a main clause in the online-reading of an SRC as stated 

by the Elided Subject Assumption; readers rather do not make any prediction at all. 

Considering that the subject preference on the relativizer was not significant by items, 

the results are consistent with Hsiao and Gibson’s data. It is, however, not clear why 

the participants in Hsiao and Gibson’s study adopted a different parsing strategy than 

the readers in Kuo and Vasishth’s study, when both used the same method. 

Summary 

Many issues of Hsiao and Gibson’s initial study were subject to critique. However, 

subsequent results were equally distributed between object and subject preference. See 

table 2.4 for a summary of the results gained for the RC extraction preference in Chinese 

Mandarin. The table also shows which of the discussed theories are consistent with the 

studies. Storage Cost under the Elided Subject Assumption (SC+ES) is not consistent 

with any study. Recall that the theory would predict a no effect in the RC region and an 

object advantage on the relativizer and head noun. This does not fit any of the empirical 

findings. The Active Filler Strategy is omitted in the table because it does not make 

2 In the Maze Task the reader has to choose between two words at each point in the sentence. Only 

one of the two words is a grammatical continuation. The reader is thus forced to predict a complete 

structure. 

37

clear predictions for head-final RCs. 


HG03 LB06 LG07 KV07 QF08 

Preference O S O S O (S) 

Region 

Canonicity 

IC 

SC+GA 

Access. 

Expectation 

Persp.+ES 

Frequency 

RC 

√ 

√ 

√ 

de N 

√ 

√ 

√ 

√ 

RC 

√ 

√ 

√ 

de N 

√ 

√ 

√ 

√ 

RC (de) 

√ 

√ 

√ 

√ 

SC+ES 

Experience ( √ ) ( √ ) ( √ ) 

Table 2.4: Studies of the RC extraction preference in Mandarin and their consistency 

with discussed theories. HG03 = Hsiao and Gibson (2003), LB06 = Lin and Bever 

(2006a), LG07 = Lin and Garnsey (2007), KV07 = Kuo and Vasishth (2007), and 

QF08 = Qiao and Forster (2008). IC = Integration Cost, SC = Storage Cost, GA = 

Gap Assumption, and ES = Elided Subject Assumption. 

I turn now to the second phenomenon to be addressed in this thesis: Effects of forgetting 

while the processing of complex nested structures. 

2.5 Forgetting Effects 

2.5.1 The Grammaticality Illusion 

Complex nested structures like center-embedding relative clauses are very difficult to 

process. Grammaticality rating studies show that these structures are often judged as 

ungrammatical. Memory-based theories (Gibson, 1998; 2000; Just and Carpenter, 1992; 

Lewis and Vasishth, 2005; Lewis et al., 2006) explain this by the excessive capacity load 

evoked by a number of unbounded dependencies that have to be held in memory. The 

DLT (Gibson, 2000) predicts parsing slow-downs due to storage of complex predictions 

and decay processes in distant dependencies. Capacity limitations are commonly seen 

as cross-linguistic constraints that underly all sorts of language processing. Hence the 

predictions of memory-based theories are language-independent. However, a study by 

Vasishth et al. (2008) casts doubt on that claim’s validity. Their experiment suggests 

that the robustness of memorized representations and related decay effects may well be 

dependent upon language-specific grammatical properties. The experiment concerned 

38


the so-called grammaticality illusion of ungrammatical center-embedding structures. 

Example (15) shows a sentence pair discussed in Frazier (1985). (15a) is a grammatical 

sentence containing a doubly embedded ORC. The center-embedding produces 

three consecutive verb phrases (VPs) completing the three clauses from the innermost 

to the outermost. In (15b) the second verb phrase was cleaning every week is dropped, 

what makes the sentence ungrammatical. I will call the condition in (15b) the drop-V2 

condition. 

(15) a. The apartment that the maid who the service had sent over was cleaning 

every week was well decorated. 

b. * The apartment that the maid who the service had sent over was well 

decorated. 3 

The surprising observation (attributed to Janet Fodor) was that the ungrammatical 

sentence (15a) to English readers does not only appear grammatical; but most readers 

judge it even better than the grammatical version of the sentence. The finding gained 

support by an acceptability rating study by Gibson and Thomas (1999). For their 

study Gibson and Thomas used stimuli that either contained all three or missed one 

of the three VPs. The results showed that missing out the second VP causes readers 

to rate the sentence as acceptable as the grammatical one containing all VPs. A further 

rating experiment conducted by Christiansen and MacDonald (1999) even showed 

a higher acceptability for the drop-V2 condition than for the grammatical condition. 

The qualitative difference to Gibson and Thomas’ study is explainable by the method 

Christiansen and MacDonald used. They carried out a so-called “stop-making-sense” 

task, which is self-paced word-by-word reading with periodic request for a grammaticality 

rating. The SPR task prevents the participants from re-reading the sentence. This 

kind of quasi-online measure may be the cause for a lower rating of the grammatical 

but complex center-embedding. As an explanation for the grammaticality illusion Gibson 

and Thomas (1999) propose that the high memory load causes the reader to forget 

the second NP (the maid) and with it the prediction of the second VP (was cleaning 

every week). Gibson and Thomas basically offer two hypotheses: a) the high memory 

cost pruning hypothesis and b) the recency/primacy account. The two approaches 

are restated by Vasishth et al. (2008) as the VP-forgetting Hypothesis and the 

NP-forgetting Hypothesis, respectively. 

a) The VP-forgetting Hypothesis The original High Memory Cost Pruning Hypothesis 

rests on the assumptions of SPLT (Gibson, 1998), the predecessor of DLT. The 

major proposition as stated by Gibson and Thomas (1999) is the following: 

(16) The high memory cost pruning hypothesis: 

At points of high memory complexity, forget the syntactic prediction(s) with the 

most memory load. 

3 By convention an asterisk (*) indicates ungrammaticality of a sentence. 

39


According to Gibson and Thomas, exceeding a theoretic memory capacity limit by excessive 

load causes a loss of costly predictions. A successful parse is possible as long as 

memory demands throughout the sentence stay within a certain capacity range. However, 

when high complexity causes the load to exceed the limit, a breakdown of the 

parser has to be prevented by pruning activation. In the sense of the discrete nature of 

SPLT this means that the prediction of certain syntactic categories have to be dropped. 

The pruning hypothesis assumes that the predictions to be forgotten are those causing 

the biggest part of SPLT memory cost at the current point in the sentence. In example 

(17) the point of highest memory cost is the deepest embedded subject the clinic (NP3). 

At this point two predictions are held in memory: VP2 predicted by NP2 and VP3 

predicted by NP3. Since VP2 is further up in the sentence and has to be held longer 

in memory than the successive VP3, it causes more memory cost. Consequently, the 

prediction of the second VP gets pruned and therewith forgotten. 

(17) a. [The patient]NP1 whoi [the nurse]NP2 whoj [the clinic]NP3 [had hired ej ]VP3 

[admitted ei]VP2 [met Jack]VP1 . 

b. * [The patient]NP1 whoi [the nurse]NP2 whoj [the clinic]NP3 [had hired ej ]VP3 

[met Jack]VP1 . 

Vasishth et al. (2008) restate the pruning hypothesis in terms of decay as defined in 

the DLT (Gibson, 2000) and refer to it as the VP-forgetting Hypothesis. Vasishth et al. 

calculate Integration and Storage Cost at the three VPs to determine the “point of greatest 

difficulty” in the sentence. The DLT cost predictions for example (17) are illustrated 

in figure 2.5. At the first VP (VP3) two integrations take place. The object the nurse 

with two intervening discourse referents (clinic and hired) and the subject the clinic with 

one intervening discourse referent (hired) are integrated. At this moment there are two 

active predictions held in memory: the predicate of the upper RC (admitted), caused 

by reading nurse, and the main verb. This makes a total cost of 4. At the second verb 

(admitted) the object the patient and the subject the nurse are integrated. The patient 

has a distance of four discourse referents (nurse, clinic, hired, and admitted) from the 

verb, the object nurse is separated by two, and just the matrix verb is predicted. This 

makes a total memory cost of 8 at the VP2 site. Finally, on the third VP, by integrating 

the patient and predicting a direct object, a cost of 6 is gained. Concluding from the 

calculations, VP2 has the highest memory cost and, hence, is forgotten. 

The difference between Vasishth et al.’s and Gibson and Thomas’ account is that 

the latter added Storage Cost on the noun and Integration Cost of the predicted verb, 

whereas Vasishth et al. just use the total cost on the verb. The predictions, however, are 

the same. Let me try to reformulate the decay approach more intuitively. The important 

measure of the decay approach is Integration Cost. By counting the number of intervening 

discourse referents it is a discrete indirect measure of time. Or, as Vasishth et al. put 

it: it is “a discretized abstraction over some activation decay function that determines 

the strength of a memorial representation.” Hence, decay could be described as a function 

of time and intervening memory load with the assumption that a high memory load 

40


The patient who the nurse who the clinic. . . 

had hired admitted met Jack. 

C(I) 2+1 4+3 5 

C(S) 2 1 1 

Total 4 8 6 

Figure 2.5: DLT memory cost for the three VPs in a doubly embedded ORC. 

increases the speed of decay. In our example, due to non-integrated discourse referents 

high memory load appears after the source of the VP2 prediction (nurse). That results 

in a steeper slope of the decay function, causing the representation of the VP2 prediction 

to fall below a certain threshold. The forgetting of the VP2 prediction would account 

for the good rating of the ungrammatical condition in the following way. The missing 

of the VP in the drop-V2 condition stays unnoticed and causes no surprise. In addition 

the distance between VP1 and its dependent is smaller, which facilitates retrieval. In 

the grammatical condition, on the other hand, the occurrence of the unpredicted VP2 

causes parsing failure. Comparing both conditions in a reading time study should show 

differences at the matrix verb and the following region. 

b) The NP-forgetting Hypothesis Gibson and Thomas (1999) and Vasishth et al. 

(2008) additionally mention a possible serial order effect on maintaining several NPs in 

memory. Evidence from cross-domain studies on human short-term memory (e.g. Henson, 

1998; Baddeley, 1997; Lewis, 1996) shows a recency/primary preference, making 

most recent and earliest items easier to recall. This suggests that the representational 

strength of memorized items exhibits a U-shaped pattern with respect to their serial 

order, making middle items harder to maintain than the rest. Assuming that a recency/primacy 

preference applies to the memorizing of noun phrases, this account leads to 

the NP-forgetting Hypothesis as follows. High memory load causes the lowest activated 

middle NP (NP2) to be forgotten. This results in retrieval failure at VP2 in the grammatical 

condition, whereas in the drop-V2 condition no retrieval can be triggered, since 

the respective verb is missing. A a result the the grammatical sentence is perceived more 

difficult. The effects of the NP-forgetting Hypothesis should also occur on the matrix 

verb and beyond. 

English 

Vasishth et al. (2008) assessed NP-forgetting and VP-forgetting measuring online reading 

time in SPR and eyetracking. Besides the grammaticality control NP-similarity contrasts 

like in example (18) were used to detect NP-forgetting effects. In (18a) all three NPs 

are highly confusable whereas in (18b) the second noun is inanimate, which reduces the 

similarity. Following Vasishth et al., high similarity (18a) predicts encoding interference 

41


at NP3 and storage and retrieval interference at VP3. When the NP2 is not forgotten, 

interference effects should also be seen at the rest of the sentence in the high-interference 

condition. However, assuming that the representation of the second NP has decayed 

latest at the middle verb, no further interference effect should occur. Consequently, 

the NP-forgetting Hypothesis predicts that differences between the high interference 

condition (18a) and the low interference condition (18b) should disappear after the first 

verb. 

(18) a. The carpenter who the craftsman that the peasant carried hurt supervised 

the apprentice. (high-interference) 

b. The carpenter who the pillar that the peasant carried hurt supervised the 

apprentice. (low-interference) 

The results of English SPR and eyetracking showed an existing but hardly significant 

support for the NP-forgetting Hypothesis. Although being non-significant, there was a 

clear numerical reading time effect of similarity-based interference, which disappeared at 

V2, pointing to a forgetting of NP2, which reduced the interference. The VP-forgetting 

Hypothesis was fully confirmed, as can be seen in figure 2.6. The drop-V2 condition was 

significantly faster at the main verb and the following region, suggesting a forgetting of 

the VP2 prediction and possibly additional difficulty at the main verb in the grammatical 

condition. In eyetracking there was also a surprising drop-V2 facilitation effect on the 

first verb (V3) not supported by the forgetting hypothesis. This, however, is explained 

by Vasishth and colleagues as an artifact of complexity-induced re-reading behavior in 

the grammatical condition. 

German 

Identical experiments as laid out above were also carried out by Vasishth et al. (2008) 

in German. An example stimuli pair for the grammaticality manipulation is shown in 

example (19). The resulting structure of German ORC double-embedding is identical to 

the one in English except for the commas. The comma issue will be addressed after the 

study results for German have been presented. The investigation of the NP-forgetting 

Hypothesis yielded analogical results to the English study. Surprisingly however, the 

VP-forgetting-Hypothesis was not confirmed. On the contrary it was the grammatical 

condition that showed faster reading time at V1 and post-V1 in both SPR and eyetracking. 

The eyetracking results for the German grammaticality manipulation are shown in 

figure 2.7. 

(19) a. Der Anwalt, den der Zeuge, den der Spion betrachtete, schnitt, überzeugte 

den Richter. (grammatical) 

b. Der Anwalt, den der Zeuge, den der Spion betrachtete, überzeugte den 

Richter. (drop-V2) 

42

2.5 Forgetting Effects SHORT-TERM FORGETTING 18 

Reading time [ms] 

400 600 800 1000 1200 1400 

Experiment 2 (English Eyetracking): grammaticality 

grammatical 

ungrammatical 

V3 V2 V1 Post!V1 

Region 

Figure Figure 6. 2.6: Mean Effect reading of the times grammaticality and 95% confidence manipulation intervals for inthe theverb English and post-verbal eyetracking regions study 

inby theVasishth English eyetracking et al. (2008) study (experiment 2). 2, The p. 18): figureMean shows reading the effecttimes of the grammaticality 

and 95% confi- 

manipulation. dence intervals for the verbs and post-verbal regions. 

addition, Germaninreaders the SPRdodata not seem we found to forget weak the evidence VP prediction. consistent with In fact thethey NP-forgetting seem to notice hy- 

the pothesis, ungrammaticality but in the eyetracking of the drop-V2 data wecondition, found stronger which evidence leads to forincreased NP forgetting. reading time. 

The surprising result points to the assumption that linguistic memory processes are not 

One surprising result in the eyetracking data was the shorter re-reading time at V3 

language-independent but rather affected by language-specific grammatical properties. 

in the ungrammatical condition. However, this effect can be explained. Re-reading time 

The head-finite nature of German (SOV) subordinate clauses causes verbs to appear 

is a function of revisits or regressions to regions that have already been viewed during the 

clause-finally more frequently than it is the case in English, an SVO language. An ob- 

first pass. Given that regressions are more frequent in complex sentences (where complexity 

jection 

is defined 

might 

as increased 

be, that commas 

ambiguity 

in 

(Clifton 

German 

et 

facilitate 

al., in press) 

the recognition 

or any other 

of 

kind 

a completed 

of integration 

clause. 

Adifficulty), double-embedding and given that involves the ungrammatical a comma aftersentences each embedded are predicted main to verb. be less Vasishth complex and 

colleagues overall, it addressed is not surprising this issue that byre-reading a fifth experiment time is shorter involving at V3 English in thesentences ungrammatical enriched 

with condition. commas. However, the comma inclusion did not show any effect. Nevertheless, as 

Vasishth et al. note, this result does not exclude the possibility of a comma-based facil- 

Another surprising result in the eyetracking data was the presence of the interference 

itation. An important fact is that German readers are trained on using commas, while 

effect in the Post-V1 region. If the second NP is forgotten by the time that V1 is processed, 

English readers are not, which suggests that commas were of no use for the English 

then it should not cause any interference in subsequent regions. There are two possible 

participants. 

interpretations of this pattern. One is that the second NP is not in fact forgotten, and 

the 

I will 

second 

now 

is 

investigate 

that the reappearance 

potential explanations 

of the interference 

for a language-specific 

effect is an artefact 

forgetting 

of processing 

effect. 

difficulty during earlier processing. We defer discussion of this question until findings in the 

remaining experiments are presented. 

We turn our attention next to the German experiments. 

43

Reading time [ms] 

400 600 800 1000 1200 


SHORT-TERM FORGETTING 24 

Experiment 4 (German Eyetracking): grammaticality 

grammatical 

ungrammatical 

V3 V2 V1 Post!V1 

Region 

Figure 10. 2.7: Mean Effect reading of the times grammaticality and 95% confidence manipulation intervals forinthe the critical German regions eyetracking in the German study 

eyetracking by Vasishth study et(experiment al. (2008) 4). (experiment The figure shows 4, p. the 24): effect Mean of grammaticality. reading times and 95% confidence 

intervals for the verbs and post-verbal regions. 

As in the German SPR study, the matrix verb V1 in the ungrammatical condition 

had longer reading time than V1 in the grammatical condition. The reading time at the 

2.5.2 word following Explaining V1 showed the no significant Forgetting difference. Effect This result is inconsistent with the VP- 

Capacity forgetting hypothesis and consistent with the hypothesis that the middle verb’s prediction is 

not forgotten. Not much evidence was found for the NP-forgetting evidence: no interference 

Just effectand wasCarpenter seen in the (1992) regions explicitly preceding mention V1, andthe atpossibility V1 the high-interference of forgetting certain condition predictions 

showed, surprisingly, in the CC-READER a lower reading model time as compared “Forgetting to by the displacement”. low-interference condition. The underlying 

mechanism Havingispresented equal to the pruning experimental hypothesis results, of weGibson turn toand the Thomas interpretation (1999), of the thusfind makingings. 

basically the same predictions as the DLT-based approach described above. As has 

just been laid out, the VP-forgetting Hypothesis is seemingly not cross-linguistically 

valid. Specifically, the hypothesis 

General 

has been 

Discussion 

confirmed for English but disconfirmed for 

German. We discuss An memory-based the NP- and explanation VP-forgettingcould hypotheses account separately, for the language-specific beginning with the difference 

former. inIn two the ways: Englisheither experiments by postulating 1 and 1a awe language-dependent found no evidence consistent capacity with limitthe or a 

language-dependent predictions of NP-forgetting, robustness whereas of VP in the predictions. English eyetracking However, study theories (experiment like DLT2) regard and 

memory the German processes SPR study as universally (experimentvalid. 3) we Thus, did findconsidering the predicted aneffects. applicability However, ofevidence the DLTbased 

inconsistent hypothesis with NP in both forgetting languages was also could found: onlyin mean the German SPRreaders study possess (experiment a higher 3) 

memory a shorter capacity reading time thanwas English found inreaders. the Post-V1 However, region there for theshould high-interference be evidence condition, for that 

capacity and in the difference eyetracking fromstudy reading (experiment span tasks 4) aand, shorter considering re-reading working time was memory found capacity in the 

as 

V1 

domain-unspecific, 

region for the high-interference 

there should 

condition 

also be 

(compared 

evidence in 

to 

non-linguistic 

the low-interference 

working 

condition). 

memory- 

Given these mixed results, it is impossible to make a strong case for NP-forgetting as the 

sole explanation for the forgetting effect, particularly since VP forgetting (discussed next) 

44


related tasks. Since there is no such evidence for a language specific memory span, a 

pure DLT-based hypothesis cannot account for the non-existence of the forgetting effect 

in German. There have to be additional factors that affect the robustness of the VP2 

prediction representation. 

The most promising explanation is that processing is affected by certain languagespecific 

grammatical properties. Vasishth et al. (2008) mention two possibilities how this 

could be come about. a) The robustness of the verb representation is directly specified by 

the same parameters that shape the grammar and hence the production-based corpus 

regularities. b) On the other hand, the more robust representation could be due to 

a more effective processing caused by reading skill, which is affected by the mentioned 

corpus regularities and not by the parameters directly. The first possibility is matched by 

an expectation-based account that directly depends on grammatical properties. Also a 

canonicity account would predict SOV structures to be easier in German than in English. 

The alternative of reading skill is accounted for by an experience-based approach. 

Expectation 

A door to language-specific effects could be antilocality. Antilocality has been observed 

predominantly in head-final languages like German (Konieczny, 2000) and Hindi (Vasishth 

and Lewis, 2006b). The seeming restrictedness of these effects to head-final languages 

has lead to the suggestion that the sentence-final verb in these languages is 

higher expected than in non-head-final languages. However, a recent study by Jaeger 

et al. (2008) shows antilocality effects in English, which is not head-final. The crosslinguistic 

explanation is that the expectation for a verb increases with more intervening 

material. The longer the distance between the dependent and the expected head the 

less likely it becomes that even more adjunctive material will intervene before the head. 

Additionally, in most cases the intervening material narrows the possible candidates for 

the head, which lowers surprisal even more. The fact that the associated speed-up at the 

verb shows slightly different patterns in English and German encourages an expectationbased 

account for a language-specific forgetting effect. For example it is imaginable that 

in head-final languages the prediction is more precise regarding the exact location of the 

verb, whereas in other languages head-finality is too rare to provide exact verb location 

statistics in that case. 

Experience 

The robustness of representations could be shaped by experience. An experience-based 

account assumes that the reader adapts processing strategies to often occurring structures. 

In result, German readers should be more skilled on head-final structures than 

English readers. An explanation based on coarse-grained corpus frequencies would be 

equivalent to a canonicity approach. German, being an SOV language exhibits more 

head-final structures than English, predicting easier processing. But earlier discussions 

45


have hopefully made clear that these theories rest on weakly justified fundamental commitments. 

The more comprehensive approach pursued here is experience with as least 

as possible commitments with respect to grain size and structural frames. In this sense 

a connectionist implementation in a model like MacDonald and Christiansen (2002) is 

promising to show a systematic difference in the performance on English and German 

double-center-embedding. In fact, Christiansen and Chater (1999) have shown that such 

a model, trained on center-embedding and right-branching structures, shows better prediction 

performance in ungrammatical 2VP embedding than in the grammatical 3VP 

embedding (see chapter 3 for details). Since right-branching and center-embedding reflect 

the dependency structure in English SRCs and ORCs, respectively, Christiansen 

and Chater’s simulation shows that connectionist experience models can in fact account 

for forgetting effects comparable to human data. But would this model show a different 

performance when trained on a German grammar? This depends on the involved word 

order regularities. In contrast to English German SRCs and ORCs both exhibit centerembedded 

dependencies and hence are verb-final. This is a considerable bias and should 

have an effect on the model’s performance. 

Anticipating the simulation results in chapter 4, the word order effects are present but 

weak. In contrast, the usage of commas seems to have a greater impact. 

46

Chapter 3 

Connectionist Modelling of Language 

Comprehension 

3.1 Structure and Learning 

Connectionist networks are prototypical exposure-based models. Or, more precisely, 

they are the implementation of non-committed exposure-based accounts. Non-committed 

should mean here accounts without any specific assumptions about structural levels or 

grain sizes, nor about the linking between corpus regularities and behavior. In literature 

there does not seem to be an agreement about what models to call connectionist. It 

must be mentioned that there are hybrid models that use parallel distributed activation 

spreading between symbolic entities on the one hand (e.g. Just and Carpenter, 1992; 

Lewis and Vasishth, 2005), and there are connectionist models that use hand-designed 

architectures and local representations on the other hand (e.g. Dell et al., 2002; McClelland 

and Elman, 1984; Rohde, 2002). I am concerned here only with “fully connectionist 

models” using fully distributed representations and no pre-designed internal structuring. 

The most important feature that distinguishes a connectionist network model of that 

kind from symbolic models is its architecturally constrained highly adaptive learning 

ability. Connectionist models are functional problem solving machines that, depending 

on the specific learning algorithm and certain architectural properties, are able to find 

the optimal solution to any task representable as input-output pairs. The design of 

symbolic models mostly involves many assumptions about the desired processes which 

are hard-coded into the system. For example, it has to be specified how to categorize 

and represent the input. A connectionist system on the other hand starts from zero 

without any presumptions. The structure of the internal input representation is shaped 

during the learning process depending on the task requirements. Obviously the information 

about the structure that linguists annotate to word strings of natural language 

is already there in the plain strings. Extracting the underlying structure of the input 

requires information about sequential and temporal relations between input chunks. For 

that reason time is an important component of cognitive tasks. In particular language 

transports highly-structured information while being entirely sequential. A memory of 

earlier input and the representation of temporal relations between input chunks pro- 

47

OUTPUT 

PUT PLAN 

Chapter 3 Connectionist Modelling of Language Comprehension 

vides much contextual information helping to interpret current input. The context of 

an utterance has a great influence on ambiguity resolution and predictions of content. 

There have been someThere accounts are many of providing ways in connectionist which this can networks be with temporal 

representation which were accomplished, of explicit and nature. a number This posed of interesting limits to the number and richness 

of representations. proposals Elman (1990) have appeared describesin athe simple literature way to (e.g. provide a connectionist 

network with memory, called Jordan, a simple 1986; recurrent Tank & network Hopfield, (SRN, 1987; figure 3.1). The hidden 

representations in the network Stornetta, are Hogg, copied & into Huberman, a so-called1987; context layer, which influences 

the hidden representations Watrous in the next & Shastri, step through 1987; weighted Waibel, activation feeding. This 

memory loop goes without Hanazawa, any explicit Hinton, representation Shikano, & of Lang, time1987; or relations between input 

chunks. It is the iterative Pineda, procedure 1988; Williams of copying & Zipser, and back-feeding 1988). One itself that produces 

temporal relations on anof implicit the most level. promising Becausewas every suggested copy of the by activation pattern has 

been influenced by all earlier Jordan copies, (1986). the Jordan contextual described memory a network reaches into the “past” in a 

continuously graded way (shown over several in Figure input1) steps. containing The information recurrent of earlier input representations 

is still in the connections representation which aswere a trace, used but to associate newer input a has more influential 

power. Elman (1990) writes: static pattern (a “Plan”) with a serially 

ordered output pattern (a sequence of 

“In this account, “Actions”). memory isThe neither recurrent passive connections nor a separate allowsubsystem. 

One 

cannot properly speak the network’s of a memory hidden forunits sequences; to see that its own memory is inextricably 

bound up withprevious the rest output, of the processing so that the mechanism.” subsequent 

behavior can be shaped by previous 

This very simple wayresponses. of memory These supply recurrent yields connections architecturally are determined plausible 

properties that can abstractly what give be described the network asmemory. storage limitations, memory span or decay 

of memorized representations over time. These are properties explicitly accounted for in 

symbolic models like ACT-R or CC-READER. 

rchitecture used by Jordan (1986). 

from output to state units are one-forxed 

weight of 1.0. Not all connections 

ach can be modified in 

ing way. Suppose a 

own in Figure 2) is 

at the input level by 

nits; call these Context 

units are also “hidden” 

se that they interact 

with other nodes 

he network, and not the 

ld. 

that there is a 

nput to be processed, 

clock which regulates 

of the input to the 

cessing would then 

e following sequence of 

time t, the input units 

first input in the sequence. Each input might be a single scalar value or a vector, 

n the nature of the problem. The context units are initially set to 0.5. 2 OUTPUT UNITS 

HIDDEN UNITS 

INPUT UNITS CONTEXT UNITS 

Figure 2. A simple recurrent network in which activations are 

Figure 3.1: copied Architecture from hidden of layer a simple to context recurrent layer network on a one-for-one (SRN, Elman, 1990). The 

solid linebasis, represents with fixed fixed weight one-to-one of 1.0. Dotted connections lines represent to thetrainable context layer. Dashed lines 

connections. 

represent trainable connections. 

Both the input 

ntext units activate the hidden units; and then the hidden units feed forward to 

48 

tion function used here bounds values between 0.0 and 1.0. 

Page 4

3.2 Recursion and Complexity 

The architecturally emergent nature of task solution is also often the subject of criticism 

concerning connectionist models. The argument goes that the explanatory value of 

connectionist models is low since there is no way to know what the network has really 

learned. Hence, these models do not provide any insight into parsing mechanisms or 

memory representations. Addressing that issue Elman (1990) proved that an SRN in a 

way learns the lexical classes as we use them. For that Elman trained the network on a 

simple word prediction task that involved simplified natural sentences as Noun-Verb or 

Noun-Verb-Noun sequences. The comparison of hidden unit activation vectors yielded 

a hierarchical clustering of words in natural classes like verbs and nouns which were 

sub-clustered by transitiveness or animateness, respectively. The categorial clustering 

of representations depends directly on the “behavioral” relations of the words. Words 

often occurring in similar environments result in similar representations in the hidden 

layer activation. When an input word is replaced after training by a new word, the 

network has never seen, it will internally represent that word in the same way as the 

replaced word since it behaves in the same way. An advantage of distributed representations 

with a continuous activation range is that they theoretically provide infinitive 

memory capacity. That, however, does not mean connectionist systems are able to store 

infinite equally treated representations like it would be the case in theoretical symbolic 

models with infinite capacity. Representations of higher grain size, i.e. involving 

a higher activation range, have more influential power with respect to the networks behavior 

than more fine-grained activation patterns do. So, what grades memory (and 

hence shapes memory capacity) is the relative importance of the stored representations. 

This relates to the type-token issue. As explained above, connectionist networks are able 

to discern types like lexical categories, but they do not ascribe any meaning to single 

tokens. However, that does not mean that connectionist networks do not functionally 

distinguish between different tokens. As Elman also showed, tokens of the same type are 

represented very similarly but with subtle differences. Notable, these differences do not 

only exist between tokens but also in tokens, discerning usages of that token in slightly 

differing contexts. In this way a token is a fuzzy category on a continuous scale. This 

makes philosophically sense since the boundary between type and token is often a matter 

of point of view, which is why also tokens can often be analyzed as having an internal 

structure. 

3.2 Recursion and Complexity 

Recursion is considered to be a language-independet structural phenomenon which causes 

processing difficulty. Besides the limit to recursion depth in the human language processor 

there are two rather surprising properties of human processing ability on recursive 

structures. First, in a comprehensibility rating study Bach et al. (1986) compared the 

difficulty of Dutch cross-dependency versus German center-embedding relative to rightbranching 

and found cross-dependency to be easier on deeper embedding than center- 

49


embedding. Second, a number of studies (Bach et al., 1986; Blaubergs and Braine, 

1974; Christiansen and MacDonald, 1999) showed increasing difficulty for iterated rightbranching. 

These performance patterns are not predicted by grammatical complexity. 

Cross-dependency is seen as especially complex because it is not representable by 

context-free phrase structure rules. Repeated right-branching, on the other hand, is not 

predicted to affect comprehension in most symbolic models. 

Christiansen and Chater (1999) showed in a series of simulations that simple recurrent 

network’s word-by-word prediction performance on recursive structures exhibits 

exactly the three properties just mentioned for the human comprehender. Chistiansen 

and Chater assessed the performance of SRNs on different types of grammatical complexity. 

In particular, they used the three types of recursion proposed by Chomsky 

(1957): counting recursion, mirror recursion (center-embedding) and identity recursion 

(cross-dependency). As a baseline they also included iterative right-branching structures. 

Christiansen and Chater constructed three artificial languages that each covered 

one of the three recursion types and right-branching. They were used to create training 

corpora containing half right-branching, half recursive structures. The SRNs trained 

on these corpora performed best on the counting recursion during testing, followed by 

cross-dependency and having most trouble with center-embedding. Similar to the human 

results the cross < center preference occurred only in embedding levels of two 

and upward. In single embedding the two recursion types were equally well handled, 

resembling the human results by Bach et al. Also consistent with human data the SRNs 

showed declining performance with deeper embedding on right-branching structures. 

The consistency of the SRNs predictions on recursive structures even included the 

phenomenon of the grammaticality illusion in double center-embedding. The SRN simulation 

was able to replicate the forgetting effect described in 2.5. After training on 

the center-embedding language the networks were tested on grammatical 3VP and ungrammatical 

2VP constructions. Mean error scores showed that the network favored the 

ungrammatical sequence of NNNVV over the grammatical NNNVVV sequence. Figure 

3.2 shows the SRN’s output node activations after seeing the second verb. The high activation 

of the end-of-sentence marker (EOS) clearly indicates that the network expects 

the sentence here to be complete. 

Interestingly, in all the simulations the number of hidden units did not affect the 

SRNs’ performance on recursion given a number higher than 15. Thus, in contrast to 

a criticism which is often brought forward the hidden layer size is not comparable to 

a certain capacity in symbolic models, which can theoretically be increased to capture 

recursion depth into infinity. But what is memory capacity in a connectionist network 

then? MacDonald and Christiansen (2002) say the following: 

“To the extent that it is useful to talk about working memory within these 

systems, it is the network itself; it is not some separate entity that can vary 

independently of the architecture and experience that governs the network’s 

processing efficiency.” (p. 38) 

50

3.3 A Model of RC Processing 

Mean Activation 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

Sing 

Nouns 

Indicates 2VP Preference 

Indicates 3VP Preference 

Erroneous Activation 

Plur 

Nouns 

Sing 

Verbs 

Plur 

Verbs 

Lexical Categories 

Figure 3.2: Ungrammaticality simulation by Christiansen and Chater (1999) (their 

figure 10). Mean output activation of lexical categories with error bars indicating 

standard error. 

In their view capacity is only a higher-level description of an SRN’s behavior. Changing 

aspects of the architecture or the training always affects both memory and processing. 

In contrast Just and Varma (2002) claim that noisy input to a network would be comparable 

to changing the capacity limit in symbolic models like CC-READER. They say 

that in the network the representational quality would be affected while the grammat- 

ical knowledge would stay constant. Whatever view may be correct, one can probably 

say that most effect on memory capacity has the mechanism of the temporal loop and 

the backpropagation procedure. Using different learning algorithms like for example 

backpropagation through time can increase the network’s memory span. 


3.3.1 MacDonald and Christiansen (2002) 

MacDonald and Christiansen (2002) (MC02) presented a connectionist model covering 

individual and global differences in relative clause comprehension in English. Mapping 

the word-by-word prediction performance of an Elman network on reading times they 

showed an impressively accurate fit of the results of King and Just (1991). Showing this, 

MC02 directly attacked the importance of a discrete memory component for the subject/object 

difference and individual differences in comprehension of relative clauses. Since 

MC02’s model serves as the basis for the simulations in chapter 4, I will here describe 

61 

51 

EOS


their experiment in detail. MC02 used a standard simple recurrent network (SRN) with 

a hidden and context layer of 60 units each. In- and output layers of 31 units each represented 

30 words plus an end-of-sentence (EOS) symbol. The corpora each consisted 

of 10,000 English sentences constructed randomly from a simple artificial probabilistic 

context-free grammar (PCFG). Subject- or object-modifying relative clauses were contained 

in 5% of the sentences. Half were subject extracted and half were object extracted 

RCs. The rest of each corpus consisted of simple mono-clausal sentences. Verbs differed 

by transitivity and shared a number agreement with their subject nouns. Each corpus 

consisted of about 55,000 words. The sentence length was 3 to 27 words with a mean of 

4.5. Notably relative clauses could be embedded recursively in each noun phrase. The 

RC attachment probability in the PCFG (0.05) limited the embedding depth. MC02 

trained 10 networks with randomly distributed initial weights 1 , each on a different corpus. 

The learning rate was set to 0.1. The training phase covered only three epochs, 

each consisting of one corpus length. The networks learned to predict the next word 

in a sentence without being provided with any probabilistic information. The output 

unit activations were calculated by a cross-entropy algorithm which ensured that all 

activation values summed to one. In that way the networks’ output was comparable to 

continuation likelihoods assigned to each possible word. After training the networks were 

assessed on 10 sentences of all three types (SRC, ORC, and simple clause), respectively. 

For interpreting the network output in terms of processing difficulty MC02 calculated 

the so-called grammatical prediction error 2 (GPE). The GPE value is a measure 

for the network’s difficulty in making the correct predictions on each word. The measure 

was then used to map the relative word-by-word differences between the conditions on 

reading times from the study by King and Just (1991). Besides RC type MC02 used 

training epochs as a second factor. The network performances after one, two, and three 

epochs of training were compared to low-, mid-, and high-span readers’ reading speed. 

The results of MC02’s network simulation are shown in figure 3.3. Pooled over all 

three epochs the results show a clear subject preference on the main verb (praised) and 

the preceding region (embedded object in the SRC and embedded verb in the ORC). 

Furthermore the ORC performance shows significant improvement on the embedded and 

main verb through the three epochs of training. Notably, the SRC data does not show 

such an improvement. Rather the performance was relatively good from the start with 

no change during training. This indicates a clause type × exposure interaction. The 

same interaction (in this case clause type × reading span) is seen in King and Just’s 

empirical data (figure 2.1). Notably, the simple SRN model seems to make better predictions 

than the CC-READER model by Just and Carpenter (1992) since CC-READER 

captures the span effect but not the interaction with clause type (see figure 2.4). Importantly, 

MC02 call the mentioned interaction a F requency × Regularity interaction. 

Specifically, the regular nature of English SRCs with respect to word order (SVO) serves 

1 Between -0.15 and 0.15. 

2 See chapter 4 for a detailed description 

52

42 


MACDONALD AND CHRISTIANSEN 

Figure 2. Network performance on sentence involving singly embedded subject- and object-relative clauses. 

Grammatical prediction error scores were averaged over 10 novel sentences of each kind and grouped into four 

regions to facilitate comparisons with the human data (see Just & Carpenter, 1992, Figure 9, p. 140). Error bars 

represent standard error. 

Figure 3.3: F requency × Regularity interaction of SRCs and ORCs. Performance of 

the model in MacDonald and Christiansen (2002). Error bars show standard error. 

sponds to that of high-span participants, the pattern of performance 

resembles that found in the human data discussed by Just and 

Carpenter, including at the main verb region where King and Just 

reported their three critical effects. 3 Carpenter’s and Waters and Caplan’s is that we have two claims 

about what capacity is and is not. First, capacity is not some 

as a sort of familiarity boost that makes primitive, themindependent easier toproperty handle. of networks As described or humansinin our 

section 3.1 the representational First, averagingsimilarity across sen- (and account with but it is instead the related strictly emergent prediction from behavior) 

other architectural 

tence type, ofthe words networks is bound with moreto training theirhad occurrence lower error rates behavior and experiential in context. factors. Since Second, thecapacity representation is not independent of of 

than less-trained networks at the critical main verb region, analo- knowledge, so that one cannot manipulate factors underlying the 

each input word also includes traces of each preceding word (i.e. the currently preceding 

gous to the main effect of reading span that King and Just (1991) capacity of a network (e.g., hidden unit layer size, activation 

reported. structure) Second, averaging similar across structures training epoch, result the SRNs in similar had function, internal weight representations decay, connectivitywhich pattern, again training) result without also 

higher error in similar rates withprediction object than subject behavior. relativesConsequently at the main affecting the relatively the knowledge quickly embedded gained in that“knowledge” network. These two 

verb, yielding a main effect of relative clause type. Third, there claims are not merely terminological changes but rather make our 

about simple sentences (95% of the corpus) also influences the “knowledge” about the 

was little effect of training for the “regular” subject relatives and account qualitatively different from the working memory accounts 

a large effect structurally of training for similar the “irregular” SRCs. object relatives at the advocated by Just and Carpenter and Waters and Caplan, for which 

main verb. This result captures the Span Clause Type interaction one or more capacities can vary independently of knowledge, 

discussed by Just and Carpenter and stands in contrast to their own experience, and other factors. In their view, capacity is something 

simulation results. Just and Carpenter’s CC-READER simulations, that enables a certain level of processing ability or skill, whereas 

which are shown in the top panels of their Figure 9, appear to have for us, capacity is a synonym for that skill. Just and Carpenter’s 

yielded main effects of span and sentence type but not the crucial and Waters and Caplan’s intermediate step is superfluous in our 

interaction in that the effect of span appears no larger on object account. 

than subject relatives at the main verb or at any other region. 

Our results highlight the potentially complex role of experience 

Multiple Versus Single Interpretations 

in individual differences. Both types of relative clauses were 

of Syntactic Ambiguities 

encountered equally frequently in the training corpora, but the 

superior performance on the subject relatives stems from the Just and Carpenter (1992, pp. 130–132; see also the Waters & 

networks’ abilities to generalize to rare structures as a function of Caplan, 1996, reply on pp. 765–766) presented evidence for a 

experience with similar, more common simple sentences. The working memory capacity based on MacDonald, Just, & Carpenextent 

and nature of this Frequency Regularity interaction 

changed as a function of the overall experience of the network in 

3 

that additional experience helped performance with object rela- The one region where the SRN does not correspond well to King and 

tives more than with subject relatives. 

Just’s human data is at the last word of the subordinate clause in the 

At this point it is important to consider an alternative view, that 

subject-relative sentences, where the SRN showed less processing difficulty 

than King and Just’s participants. This discrepancy may be due to 

our model is simply a connectionist implementation of a capacity- 

variations in the materials, particularly to the length of the subject-relative 

based account of individual differences. We are not claiming, of 

clause, which varied in King and Just’s items but was more uniform in our 

course, that there is no such thing as capacity; clearly any network materials. Gibson and Ko (1998) also used uniformly short subject relatives 

(and any human) can be described as having a particular capacity and found little processing difficulty at this position in self-paced reading 

to process information, and individual networks and people can studies, and our simulations correspond quite well to their reading data in 

vary in their capacities. What sets our account apart from Just and this and other regions. 

3 Theoretically one could invent a structure which is similar 

to SRCs but which the network has not been exposed to in training. The SRN would 

be able to handle it, nonetheless, benefiting from previous experience. The reason for 

the absence of the regularity benefit in ORC is its deviation from all other structures 

in the corpus. While in the SRC reporter, attacked, and senator appear in the regular 

NVN ordering, the ORC exhibits an NNV ordering. Specifically the number agreement 

serves as a structural hint in differentiating the word orders and to avoid that wrong 

regularities are learned. The network gains implicit information about subject-object 

relations from the input data through the structural fact that verbs share a number 

agreement with subjects but not with objects. 

The simulations show that a purely exposure-based model is able to predict complex 

interactions of complexity-related reading difficulties and individual differences on 

word level. The evidence lead MC02 to the formulation of their Skill-through- 

3I put knowledge in quotes here because what is meant in a connectionist system is not declarative or 

explicit knowledge but rather implicit behavioral knowledge provided by the connection weights. 

53


Experience Account (p. 44) that attacks the modular picture of knowledge and 

memory. The crucial claim of MC02 is that differences in performance result from processing 

skill as a function of experience and not a separable WM capacity. 

“In our view, neither knowledge nor capacity are primitives that can vary 

independently in theory or computational models; rather they emerge from 

the interaction of network architecture and experience.” (p. 37) 

The subsymbolic (and behavioristic) nature of connectionist networks make grammatical 

knowledge and processing indistinguishable. A change in parameters like weight 

vectors or hidden layer size is not attributable to one of the two components. Rather 

either affects the behavior of the whole network. 

3.3.2 Critique and Relation to other Approaches 

MC02 see their model as an opposition mainly to models like Just and Carpenter (1992) 

and Waters and Caplan (1996), which explicitly account for memory capacity limitations. 

MC02’s SRN simulations have important implications with respect to biological plausibility 

of processing models. they demonstrated that there is no need to assume separable 

working memory and knowledge modules in order to account for effects attributed to 

these. Rather experience shapes the whole system and capacity is a property emergent 

from the systems architecture. That emphasizes the role of symbolic models like Just 

and Carpenter (1992) as merely higher-level descriptions of underlying processes. There 

is, of course, nothing wrong with symbolic descriptions. What is in question, however, 

is the justification of explicit numerical limits on capacity. In Just and Carpenters account 

the capacity limit is defined as the maximal amount of activation attributed to 

productions (processing rules). The latter can be varied without touching the rest of the 

system. Arguing with MC02, however, this value is indeed emergent and inseparable 

from the entire system. As a consequence sentence comprehension and reading span 

measure the same thing, namely reading skill, which is the experience-shaped efficiency 

of linguistic processes. 

Non-Linguistic Working Memory 

Not convinced by this view Roberts and Gibson (2002) note that a pure skill-viaexperience 

account would not be able to explain these correlations of comprehension skill 

with non-linguistic working memory tasks that do not involve sentence reading. Roberts 

and Gibson provide respective empirical evidence for correlations of sentence memory 

with several memory load tasks that do not involve reading sentences. Addressing these 

correlations MC02 propose that reading skill is tied to phonological representations. 

These representations play the crucial role in all sorts of memory load tasks and account 

for individual differences. Regarding phonological representations MC02 formulate four 

important claims (p.45): 

54


(a) “Phonological and articulatory representations must be activated in order to utter 

the words for the load task”. 

(b) “Phonological activation is an important component of written and spoken sentence 

comprehension, particularly for certain difficult sentence structures”. 

(c) “The extend to which phonological representations are important during comprehension 

of difficult syntactic structures is likely to vary inversely with experience, 

such that phonological information is more crucial for less experienced comprehenders”. 

(d) “There appear to be notable individual differences in the ‘precision’ of phonological 

representations computed during language comprehension, and these differences 

are thought to owe both to reading experience and to biological factors.” 

As becomes clear MC02 do not completely deny an influence of biological factors on 

processing skill. These factors, however, concern the precision of representations, not 

capacity limitations, and those are subject to experience-caused variance. Moreover are 

individual differences assumed to be allocated primarily in the dependence on these 

phonological representations, meaning that highly-skilled readers exhibit a more efficient 

processing that does not rely so much on the phonological information. For example in 

extrinsic load tasks 4 both the stored items and sentence comprehension processes make 

use of shared phonological representations. Thus MC02 explain load effects by activation 

interference rather than activation limits. This is seen as naturally evolving from 

evidence that articulatory planning involves strict activation and inhibition of phonological 

units (Bock, 1987; Dell and O’Seaghdha, 1992). Thus during extrinsic load tasks 

activation and inhibition processes from both load and comprehension mechanism work 

on the same representations, interfering with each other. The more effective processing 

of highly experienced readers makes less use of the representation and, thus, reduces 

difficulties due to interference. The same processes also happen in the reading span task 

(which, by the way, is basically the same task as extrinsic load). The conclusion is that 

reading span is a function of experience and not of memory capacity. This account is 

also superior to Waters and Caplan (1996) because their theory assumes two separate 

working memories and, hence, does not predict an interaction of comprehension and 

extrinsic load. Furthermore also RC type differences are explained in the same way, 

namely that “object relatives, which are more challenging than subject relatives, are 

likely to rely more on phonological information than subject relatives” (p. 45). 

4 In these tasks participants are asked to memorize a set of words or digits and retain it while reading 

sentences. The extrinsic load influences the sentence comprehension performance in a certain way 

that correlates with the participants reading span value (Just and Carpenter, 1992; King and Just, 

1991). 

55

German Word Order 


Konieczny and Ruh (2003) ran simulations on German relative clauses using the model 

parameters of MC02. The results are inconsistent with the empirical subject preference. 

German ORCs clearly exhibit lower error rates on the embedded verb. In addition the 

results do not show a frequency × regularity interaction. This is not so surprising 

considering german word order properties. In English the regularity effect is attributed 

to the shared SVO ordering of main clauses and SRCs that separate both structures from 

the SOV ordered ORCs. On the other hand, in German while main clauses commonly 

exhibit an SVO ordering as in English the order in SRCs and ORCs is SOV and OSV 

respectively. In addition the free word order in German also allows an OVS structure 

in main clauses. This makes four different possible word orders in German that are not 

expected to give rise to a regularity preferring one of the two RC types. As pointed out 

in 1.3.3 a canonicity account based on thematic orderings cannot make clear predictions 

assuming SVO as the canonic ordering. On the other hand, an SOV-canonicity account 

would make the correct predictions. However, to derive structural frequency-based SOV 

regularity the structural scope of the model would have to be extended. Simple main 

clauses do not provide the desired regular structures as they do in English. I suspect 

that including sentential complements and other subordinate clauses exhibiting an SOV 

pattern would result in the desired frequency × regularity interaction with an advantage 

for SRCs. The exact reason for the actual advantage of ORCs over SRCs in the 

current model will be discusses at the end of the next section, which is concerned in 

more detail with the SRN’s structure-based learning in RC processing. 

3.3.3 What is learned? 

As laid out in the previous section the model by MacDonald and Christiansen (2002) 

was criticized in many issues. It is not completely evident how much of the networks 

prediction can be attributed to a frequency × regularity effect and what are merely 

artifacts. A mere correlation between network experience and human reading span is 

no sufficient evidence for an experience effect for human readers. Similarly, the effect of 

structural regularity differences between SRCs and ORCs does not necessarily cause the 

preference pattern for human readers. Also, assuming the conclusions drawn from the 

mentioned correlations are correct, the question remains of what exactly the learning 

effect in relative clauses is based upon. An exposure-based theory driven by structural 

frequency is in the need to say something about the specific structural cues essential for 

shaping the efficiency of RC processing. 

Assessing Experience 

Wells et al. (2009) designed an experiment that took the challenge of replicating the 

effects of experience with human readers for having a basis to assess the accuracy of 

56


MC02’s SRN predictions. For that reason two participant groups were formed: an RC 

experience group and a control experience group, both matched on reading span. Both 

groups received reading training over 3-4 weeks. The RC experience group received 

training mainly on SRC and ORC structures whereas the control experience group was 

exposed to other structures. A pre- and a post-test carried out in the SPR paradigm 

assessed both groups RC reading performance. The results show an overall improvement 

of RC reading. Most importantly, however, the data revealed an interaction of 

session × experience group × clause type × region as can be seen in figure 3.4. 

There was a reliable difference between SRC and ORC reading times observed in both 

groups in the pre-test; in particular the ORC condition was read slower. The subject/object 

difference, however, decreased significantly between pre- and post-test for the 

RC experience group whereas in the control group it stayed the same. Wells et al. attribute 

the global reading improvement to increased familiarity with the SPR task. The 

observed pattern in the experience group is similar, first, to the span × clause type 

interaction in the study by King and Just (1991) reported in 2.1, and second, it resembles 

the frequency × regularity interaction of the connectionist model in MC02 (see 

figure 3.5). Statistically the SRNs’ mean GPE scores predicted the within-sentence and 

experience-based variance in the human data extraordinarily well. There was a total fit 

of GPE and reading times of R 2 = .75. In a hierarchical regression Wells and colleagues 

also predicted the SRN simulation results using the human reading times. The RC experience 

group data accounted for 75% of overall variance in the SRNs’ GPE scores. 

Using the control group data as a predictor, which did not involve an experience factor 

but merely within-sentence variance, the regression accounted only for about 65% of the 

GPE variance. 

The impressive result of Wells et al.’s study delivers empirical support for the implications 

drawn by MC02 from the SRN simulations, namely that experience can account 

for individual differences in reading skill. Notably, the Wells et al. study proved a significant 

experience effect after minimal exposure amount (the training sets contained only 

160 sentences in total). Furthermore, lexical and structural short term priming effects 

were excluded by, first, a four weeks test distance, second, no lexical overlap between the 

tests and, third, the usage of RC constructions during training that were structurally 

different from testing items. Also task-related adaptation can be excluded since during 

the training phase a full-sentence display was used instead of word-by-word reading. 

Now, given the evidence that reading skill on certain sentence structures is affected 

by previous experience with these structures it remains the question of what has been 

learned; a question that evolves from the granularity problem of exposure-based theories 

described in section 1.3.4. Wells and colleagues identify verb transitivity as a crucial 

factor driving the learning process of the SRN. In ORCs on the one hand, embedded 

verbs are necessarily transitive because the head noun has to fill an object role. On 

the other hand, simple sentences and SRCs occur with transitive and intransitive verbs. 

That has the consequence that only for continuation predictions of ORCs the network has 

to learn to differentiate verb transitivity in the predictions whereas for the continuation 

57

Length-Adjusted Reading Time (ms) 


Figure 1: Self-Paced Reading Patterns at Pre- and Posttest 

175 

150 

125 

100 

75 

50 

25 

0 

-25 

-50 

-75 

1 2 3 4 

OR: (The) clerk that the typist trained told the truth 

SR: (The) clerk that trained the typist told the truth 

36 

Relative Clause Experience 

RC Experience Group (n=32) Control Experience Group (n=32) 

Pretest, Object Relatives 

Pretest, Subject Relatives 

Posttest, Object Relatives 

Posttest, Subject Relatives 

175 

150 

125 

100 

75 

50 

25 

0 

-25 

-50 

-75 

1 2 3 4 

(The) clerk that the typist trained told the truth 

(The) clerk that trained the typist told the truth 

Figure 3.4: Wells et al. (2009) reading times for pre- and post-test by group and RC 

type. 

of other structures transitive and intransitive verbs are of equal probability. Example 

(20) shows structural prefixes in main clauses (20a), SRCs (20b), and ORCs (20c) with 

possible transitivity properties of predicted verbs. 

(20) a. Simple: EOS the N . . . {trans/intrans} 

b. SRC: the N that . . . {trans/intrans} 

c. ORC: (the N) that the N . . . {trans} 

The experience effect of human readers, however, can be affected by a lot more structural 

cues in the input. A crucial factor also mentioned for Chinese RCs in section 2.4 

is animacy. In English like in Mandarin ORCs mostly contain inanimate head nouns 

whereas in SRCs head nouns are commonly animate. Since Wells et al. only used 

animate head nouns for both SRCs and ORCs the participants might have learned to 

handle the non-canonical animate-headed ORCs. Race and MacDonald (2003), Reali 

and Christiansen (2007a), and Reali and Christiansen (2007b) identify further probabilistic 

constraints that correlate with SRC/ORC corpus distributions. For example 

pronominal ORCs mostly contain personal pronouns whereas impersonal pronouns occur 

more frequently in SRCs. Furthermore there are differences in the NP-type of the 

embedded subjects that separate ORCs from other structures that exhibit an ‘NP that 

NPSubj VP’ sequence. The usage of a pronoun in the NPSubj position is highly correlated 

with ORCs. The Wells et al. study only used common nouns and all RCs were 

headed by the impersonal pronoun that, which both are potential properties subject to 

probabilistic learning since they are deviations from natural frequency patterns. 

58


Relative Clause Experience 

Figure 2. Comparison of the Human Self-Paced Reading Patterns at Pre- and Post-test to the GPE Patterns Obtained for the SRNs. 

Figure 3.5: Wells et al. (2009) reading times compared to modeling results of MacDonald 

and Christiansen (2002) 

The unnatural ORC patterns are probably the reason for an experience effect over 

such a short interval of only 4 weeks. 37 Additionally, unnaturalness might have inflated 

the clause type effect in reading times. I suspect that a study with completely natural 

sentences would show a smaller clause type difference and would need a strongly extended 

training period to show experience effects. However, this small-scale study clearly proves 

the effect of structural probabilistic constraints on reading skill. The comparability with 

the SRN simulations is given by the transitivity constraint in ORCs which is a probable 

factor both in the simulations and in the human study. 

A Detailed Prediction Analysis 

The assumption that the SRN mainly learns the verb transitivity in ORCs is, however, 

only speculation. Konieczny and Ruh (2003) carried out a more detailed analysis of 

the word-by-word predictions of MC02’s SRN. They found that the high GPE scores on 

the embedded verb in the ORC and the matrix verb in both RC types were caused by 

lexical misclassifications and local coherence. In particular, seeing the ORC sequence 

‘the N that the N’ the SRN predicts the end-of-sentence marker EOS in early training. 

In later epochs the activation on EOS is reduced and increased on the correct verbs. 

Notably, the predictions for incorrect verbs (in this case all intransitive verb plus those 

with non-matching number) do not change over training. This pattern is inconsistent 

with the transitivity hypothesis. The experience-based decrease of error on the ORC 

embedded verb is seemingly not caused by learning the distinction between transitive 

and intransitive verbs but rather by learning to separate the pronoun that from verbs. 

The other region showing effects of clause type and experience is the main verb. At 

59


this point the SRN highly predicts a determiner after seeing an ORC. This points to a 

locally consistent interpretation of the embedded ‘. . . the N Vtrans’ sequence as a main 

clause prefix continuing with an NP. This misinterpretation is reduced in later training 

epochs. An additional and very stable false prediction on the main verb is the EOS 

after embedded SRCs and ORCs. Concerning the SRC this is consistent with a locally 

coherent interpretation of the SRC sequence ‘. . . Vtrans the N’ as part of a main clause. 

In the ORC on the other hand, the EOS prediction after the ‘. . . the N Vtrans’ sequence 

is only locally consistent when interpreting the transitive verb as intransitive. This 

seems on first sight to be consistent with the assumption of Wells et al. (2009) that 

the SRN has to learn the trans/intrans difference. But surprisingly the wrong EOS 

prediction increases with further training, indicating that the network does not recognize 

the transitivity of the embedded verb. Summarizing the analysis of Konieczny and Ruh, 

what causes the effects on embedded and main verb is a) the interpretation of that as 

a verb, b) the prediction of the sentence to end after an embedded RC due to local 

coherence, and c) the failure to classify transitive and intransitive verbs. Konieczny and 

Ruh suggest to abandon verbs that can be both transitive and intransitive from the 

lexicon to separate the two classes more clearly. Furthermore the grammar should allow 

the use of pronominal NPs to move the classification of that nearer to nouns than verbs. 

Concerning the German RC simulations the explanation of the effects is quite simple. 

German SRCs and ORCs differ only in the serial order of the the relative pronoun and the 

determiner of the embedded NP. Consequently the SRC contains in this region a NOM- 

ACC sequence whereas the SRC contains an ACC-NOM sequence. The embedded verb 

always agrees with the nominative (der). This produces a locally consistent structure of 

‘detnom N V’ in the ORC but not in the SRC. Following Konieczny and Ruh this local 

consistency effect produces the correct predictions for the embedded verb in the ORC, 

which is the reason for the lower error. In the SRC the verb is bound to the relative 

pronoun, which shares the number with the matrix subject. The SRN’s verb predictions, 

however, seem to be more influenced by the number of the intervening object than by 

the distant dependency. 

3.3.4 Summary 

In using SRNs MacDonald and Christiansen (2002) take advantage of a simple mechanism 

that, however, without architectural predesign makes excellent predictions concerning 

the functional relation between exposure to certain structures and processing skill. 

The model’s behavior is interpretable in terms of memory and decay, but due to its 

temporal loop and learning mechanism it sensitive to context and experience. The King 

and Just data was well fitted, especially for individual differences. The model results 

in combination with the study by Wells et al. (2009) provide a comprehensive skillthrough-experience 

account that includes individual and language-specific differences. 

Konieczny and Ruh (2003) and others question the model’s validity, partly because a 

detailed analysis shows that learned constraints are of local nature and not comparable 

60


to human learning. However, Christiansen and Chater (1999) showed that an SRN is 

an outstanding predictor, especially for complex embedding issues. Chapter 4 will now 

show useful SRN predictions on two further topics. 

61

Chapter 4 

Two SRN Prediction Studies 

This chapter is concerned with the connectionist simulation of the subject/object difference 

in Mandarin and the forgetting effect in English and German. As has been found 

in the previous chapters, detailed predictions of an experience-based account are important 

both the subject/object difference in Chinese and the language-specific forgetting 

effect. For the forgetting effect a structural experience account seems to be a promising 

predictor capturing the divergence of the effects in different languages. Similarly, in 

the Mandarin extraction preference question structural regularities are considered as an 

important explanatory factor. As discussed in 1.3, the problem of theories involving 

structural experience or canonicity is the justification of certain granularity commitments. 

Chapter 3 emphasized that an implementation of the experience account in 

a connectionist network model deals with the granularity problem in a natural way, 

leaving it to the learning process to extract structural information on the granularity 

level that best serves the solution of the task. The network model by MacDonald and 

Christiansen (2002) (MC02) discussed in chapter 3 as being a prototypical structural 

experience implementation has proven to make empirically consistent predictions regarding 

the subject/object difference and individual differences. On that basis I rebuilt the 

model and used it to address the above mentioned issues that were waiting for a connectionist 

answer. In order to verify my implementation of the model, I present replications 

of the English and German simulations before reporting the new simulations. But first 

I will briefly introduce the structure of the network. 

4.1 The Model 

4.1.1 Network Architecture 

As in MC02 the connectionist architecture used was a simple recurrent network (SRN 

Elman, 1990) as described in chapter 3. All networks were built, trained, and tested in 

the Tlearn simulator (Elman, 1992) on a Windows platform. The SRN consisted of four 

layers. 

62

4.1 The Model 

The Input Layer In a localist input representation the number of input nodes depends 

on the number of words in the lexicon. Each input node represents one word. 

The representation of a word is encoded by activating the node representing the input 

word with 1 and all others 0. The MC02 replication model used 31 input nodes for 30 

words and the end-of-sentence marker (EOS). 

The Output Layer The output layer had the same amount of units as the input layer. 

Output units, however, could take activations on a continuous scale between zero and 

one. The output calculation used the cross-entropy method which guarantees that all 

output activations sum to one. This method makes it possible to map output activations 

of units directly onto continuation probabilities. 

The Hidden Layer The hidden layer holds the internal representations of the network. 

It consisted of 60 units which receive an all-to-all connection from the input units 

and connect in the same way to the output units. Depending on the weights of the 

incoming connections hidden units received activations between zero and one. Over all 

simulations the hidden layer size was held constant. As Christiansen and Chater (1999) 

demonstrated within layer sizes from about 15 units upward the number of hidden units 

does not significantly influence the performance of the network on recursive embedding. 

Hence, for the simulations presented here the size of the hidden and context layer stayed 

untouched. 

The Context Layer The context layer contained 60 units that received a one-toone 

connection from the hidden units. To obtain the copy mechanism the link weights 

connecting from the hidden layer were fixed to 1. In that way, in every time step the 

context units received an exact copy of the hidden layer’s activations. In an SRN the 

back-projection into the hidden layer happens in an all-to-all fashion, thus providing the 

next input step with indirect context from previous calculations. 

4.1.2 Grammar and Corpora 

Simple probabilistic context-free grammars (PCFGs) were used, covering simple sentences 

and subject- and object-extracted RCs in all three languages English, German 

and Chinese. For standard English and German training I used the grammars designed 

by Lars Konieczny (English) and Daniel Müller and Lars Konieczny (German). The 

relative clause distribution was adjusted across different experiments. For generating 

corpora and likelihood predictions the Simple Language Generator (SLG Rohde, 1999) 

was used. The three training grammars as represented in SLG can be found in the Appendix. 

Every training corpus consisted of 10,000 randomly generated sentences. Test 

corpora were generated for every condition consisting of 10 test sentences each. 

63

4.1.3 Training and Testing 

Chapter 4 Two SRN Prediction Studies 

Prior to training, all networks were initialized with random connection weights in the 

range of [-0.15, 0.15] and the hidden units received an initial bias activation of 0.5. Each 

training included 10 individually initialized networks that were trained on 10 different 

corpora, respectively. In doing so, statistical justification was achieved by simulating 

subjects of differing disposition exposed to non-identical material. The networks were 

trained for three epochs, where one epoch corresponded to a full run through a corpus. 

In the following, the training mechanism is briefly described. With every input word 

the weighted connections propagate activation and inhibition through the net, forming 

an activation pattern in the hidden layer which in turn is responsible for the output 

pattern. The error is then calculated with respect to activation of only the node corresponding 

to the subsequent word in the current sentence by value 1. Thus, the network 

is trained to deterministically predict the next word, which is, of course, impossible 

to achieve. Similar input will have different continuations but the teaching mechanism 

claims for every continuation to be the one and only for the current context. Consequently, 

after some examples the network will activate several words, in activation 

strength qualitatively corresponding to the “teacher’s” previous single lessons. In combination 

with the cross-entropy error calculation (output activations will sum to 1) the 

activation distribution over output nodes will be comparable to a probability distribution 

over words. Here the grammatical prediction error (GPE, Christiansen and Chater, 

1999) comes into play. The GPE algorithm is based on the numerical differences between 

the desired PCFG-corresponding probabilities and the actual output. The GPE value is 

a difficulty measure for every word in the sentence, which can be used as a reading time 

predictor. 

The Grammatical Prediction Error The GPE measure as described in Christiansen 

and Chater (1999) assigns an error score between 0 and 1 to every output 

activation pattern. With all correct units receiving the correct amount of activation 

and no incorrect units being activated the output would receive a GPE value of 0. If, 

on the other hand, no units are correctly activated, the GPE would be 1. The score 

is calculated by formula (4.1–4.5) with ui be the activation of unit i; G and U be the 

sets of grammatical and ungrammatical units, respectively; and ti be the desired target 

activation of unit i. The formula ensures that incorrectly or insufficiently activated 

output units get penalized by increasing score. Since all activations sum to 1, also 

over-activation is penalized because this too much activation is missing at another unit. 

Correctly activated units H (hits) and incorrectly activated units F (false alarms) sum 

together to the total activation. Additionally, M is the sum of all missing activation mi, 

which is the discrepancy of an under-predicted unit’s activation ui with respect to the 

target activation ti. The target activation t of a unit i is given by the likelihood of the 

respective word in the current context string calculated from the probabilistic grammar. 

The GPE is then defined by the proportion of correctly activated units (H) to the total 

64

4.2 Replication of Previous Simulations 

activation and a penalty for misses (4.5). 

Hits (correctly activated units): H = 

False Alarms (incorrectly activated units): F = 

mi = 

i∈Gui 

i∈U ui 

 

0 if ti − ui ≤ 0 

ti − ui otherwise 

Misses (units with underestimated activation): M = 

i∈G(H + F )mi 

H 

GP E = 1 − 

H + F + M 

(4.1) 

(4.2) 

(4.3) 

(4.4) 

(4.5) 

A Perl routine controlled training and testing of the ten networks and then calculated 

the region-specific GPEs. The correct functioning of the process will now be validated 

by the replication of two previous studies. 

4.2 Replication of English and German RC Processing 

I built the model with the parameters specified by MacDonald and Christiansen (2002) 

and tried to replicate their results (see figure 3.3 for MC02’s results). MC02 report an 

RC probability of 0.05. However, the replication fitted their data better when the RC 

probability was set to 0.1. Konieczny and Ruh (2003) also replicated MC02 with an RC 

probability of 0.1. Figure 4.1 shows the replication result. The pattern of MC02 was 

more exactly matched in epochs 3, 4, and 5 but the relevant interactions were also found 

in epochs 1, 2, and 3. Only the training effect on the main verb in the ORC was not 

very pronounced. The differences were, however, significant. 

I used the simplified German grammar from Konieczny and Ruh (2003) to replicate 

their results. Compared to the original study I gained lower error rates for the main verb 

in both conditions. Additionally the replication showed a significant experience effect in 

all regions of the SRC, which was not the case in the original. The pattern by region 

was successfully matched. 

I will not go into details regarding the two replication studies. They just build the 

basis for the following simulations, making sure that the model used here has similar 

properties as the models in MacDonald and Christiansen (2002) and Konieczny and Ruh 

(2003). 

65

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

English SRC 

rep. that attacked the senator praised the judge 

Region 

epoch 1 

epoch 2 

epoch 3 


GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

English ORC 

rep. that the sen. attacked praised the judge 

Region 

Figure 4.1: Replication of MacDonald and Christiansen (2002) 

4.3 RC Extraction in Mandarin 

4.3.1 Simulation 1: Regularity 

Model Parameters 

The first simulation should assess the degree of the regularity advantage the ORC receives 

due to its canonical word order. A regularity effect is assessable only when the frequencies 

of both RC types in the corpora are identical. Therefore the SRC and the ORC received 

the same probability in the generation grammar. Although the replications were done 

with an RC probability of 0.1, I used the original value of 0.05, reported in MacDonald 

and Christiansen (2002), for the Mandarin regularity simulation. Compared to English 

the Chinese grammar used here is very simple. Setting the RC probability too high 

could speed up in the learning process in a way that conceals training effects. 

The Grammar The grammar used to generate the corpora covered simple regular 

Mandarin SVO sentences as well as SR and OR clauses. Relative clause attachment 

could happen at every noun with a probability of 0.05. The embedding depth was 

theoretically unlimited but with the small attachment probability of 0.05 the longest 

sentence in the corpora had a length of 16 words. The 17-word lexicon consisted of 9 

plural and singular nouns, three transitive and four intransitive verbs, of which one (lijie 

“understand”) belongs to both categories, the relativizer de and the EOS. Note however, 

that there is no number agreement between nouns and verbs in Mandarin. The full 

lexicon is given in the Appendix. Note further that in normal Mandarin intransitive 

66 

epoch 1 

epoch 2 

epoch 3


GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

German SRC 

der den Passant trifft verspottet 

Region 

epoch 1 

epoch 2 

epoch 3 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

German ORC 

den der Passant trifft verspottet 

Region 

Figure 4.2: Replication of Konieczny and Ruh (2003) 

main clauses are closed by the marker le (see example 21). I, however, did not use it in 

the grammar to not blur the transitivity constraints since le could seem to the network 

as a noun. 

(21) faguan youyu le . 

Ten networks were trained on ten randomly generated corpora. Three test corpora 

were randomly generated containing simple transitive SVO sentences, SRCs, and ORCs, 

respectively. The respective RC test corpora only contained singly embedded RCs with 

transitive verbs. It was made sure that none of the test sentences appeared in the 

training sets. 

(22) Test set examples: 

a. gongji fayanren de lvshi biaoyang lvshimen . (SRC) 

b. fayanren lijie de lvshi biaoyang yinhangjia . (ORC) 

The SRN’s task was to predict the next word in a sentence. For example in an SRC, 

when the verb biaoyang “praise” was seen and now the noun lushi “lawyer” was activated 

the target activation for the relativizer de is 0.975. The activation pattern is shown in 

figure 4.3. Besides the relativizer the transitive verbs are expected to be activated. The 

reason is that there is a low but existent possibility of an object modifying ORC inside 

the SRC (see example 23). This is in fact the only possible continuation apart from de 

following a ‘V N’ sequence. Contrary to that in the ORC following the first two words 

‘N V’ all nodes apart from the EOS are activation targets because a lot of different 

continuations are possible. 

67 

epoch 1 

epoch 2 

epoch 3

(23) [V1 [N1 V2 de ORC ] N2 de SRC ] N3 


Rec-Act Target-Act mi Chin. Engl. 

0.688 0 0 (EOS) (EOS) 

0.003 0.00833333 0.00533333 (biaoyang) (praise) 

0.000 0 0 (dadianhua) (phone) 

0.713 0.975 0.262 (de) (gen) 

0.001 0 0 (faguan) (judge) 

0.001 0 0 (faguanmen) (judges) 

0.001 0 0 (fayanren) (reporter/reporters) 

0.003 0.00833333 0.00533333 (gongji) (attack) 

0.001 0 0 (guanyuan) (senator) 

0.001 0 0 (guanyuanmen) (senators) 

0.002 0.00833333 0.00633333 (lijie) (understand) 

0.001 0 0 (lushi) (lawyer) 

0.001 0 0 (lushimen) (lawyers) 

0.000 0 0 (sahuang) (lie) 

0.001 0 0 (yinhangjia) (banker) 

0.001 0 0 (yinhangjiamen) (bankers) 

0.000 0 0 (youyu) (hesitate) 

H: 0.721, F: 0.697, GPE: 0.602452988750017 

Figure 4.3: Output node activations on the relativizer in Mandarin after the input 

sequence ‘biaoyang lushi’. 

Predictions 

The aim of the first simulation was to verify the regularity (or canonicity) argument for 

Mandarin object relatives. As Hsiao and Gibson (2003) have stated, ORCs have a more 

canonical word order than SRCs and should therefore be easier. The expected outcome 

would be a reversed pattern to the English and German results. The performance on 

ORC should be relatively good and stable from the first training cycle on, whereas the 

SRC performance should be subject to improvement throughout training. 

Results 

Words 2 to 5 (N1/V1 de N2 V2 N3) were selected as regions of interest. Since the 

assessed performance is prediction, there is no point in testing the initial word in a 

sentence. The network’s prediction for the initial word will always be the same and the 

difference to the desired prediction will not tell us anything about structural matters. 

68


The GPE score measured on a certain word only tells us how the predictions based 

on previous words fit the probabilistic grammar. It does not include any effect of the 

current word itself. 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

Mandarin SRC 

N1 de N2 V2 N3 

Region 

epoch 1 

epoch 2 

epoch 3 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

Mandarin ORC 

V1 de N2 V2 N3 

Region 

Figure 4.4: Simulation 1: Mandarin ORC regularity. 

epoch 1 

epoch 2 

epoch 3 

See figure 4.4 for GPE scores of SRCs and ORCs by training epochs. For means 

and standard errors see table A.1 in the appendix. Collapsing over all regions and 

epochs there was a significant advantage for object relatives. The difference shrank with 

increasing epochs. For the ORC there was significant improvement on the main verb 

over the three epochs. The SRC improved on the main verb and the relativizer. On 

the first region (N1/V1) there was a marginal advantage for the ORC in the first epoch. 

The second region (de) showed a significant object advantage in all epochs. There was 

also an object advantage on region 4 (V2), which, however, disappeared after the second 

epoch due to SRC improvement. Region three and five did not show any effect. 

The results of experiment 1 showed the predicted frequency × regularity interaction. 

In contrast to the English results by MC02 the regularity effect is seen in object relatives 

in Mandarin. The effect, however, is is not located on the embedded RC but mainly on 

the relativizer. It seems as the predictions for position 4 (here the relativizer) are easier 

for the ORC because of the familiarity with the sequence ‘N V . . . ’ where the relativizer 

should have a quite low continuation probability due to the small RC frequency in the 

corpus. On the other hand, the SRC sequence ‘V N’ is very rarely occurring at the 

sentence beginning, making more training necessary to learn the correct predictions. 

Over training the network has to learn to assign a high activation to the relativizer after 

an ‘V N’ and to exclude almost all other words as a continuation. 

Experiment 1 superficially confirms the ORC regularity hypothesis. However, as the 

69


corpus study by Kuo and Vasishth (2007) revealed there are a lot more structures in 

the corpus that resemble the SRC typical ‘V N de N’ sequence. It has to be recognized 

that the granularity problem also accounts for connectionist networks, namely in the 

choice of input structures. Assume a structure be regular with respect to main clauses; 

but then, taking into account a great number of structures which fundamentally differ 

from the main clause, their word order has considerable influence on the respective 

regularity. German for example is considered to bear an SOV regularity although most 

main clauses are built with SVO. Experiment 2 will assess the qualitative impact of a 

displaced regularity in favor of the SRC structure. 

4.3.2 Simulation 2: Frequency 


Hsiao and Gibson (2003) performed a corpus study yielding only a small difference in 

the occurrences of subject vs. object relatives. Of all RCs they found about 57.5% were 

SRCs and 42.5% were ORCs. However, as reported in 2.2 there are more constructions 

with an RC-like pattern that are in fact not RCs. The purely syntactic account of structural 

frequency addressed here does not distinguish between homomorphic sequences 

yielding different interpretations. Consequently, the corpus frequencies of all RC-like 

structures will be considered in this experiment. The corpus study by Kuo and Vasishth 

(2007), found 639 occurrences SRC-like sequences like ‘V N1 de N2’ and only 117 of ‘N1 

V de N2’ (ORC). That makes a total of 756 RC-like structures, of which about 84.5% 

exhibit the SRC pattern. In addition the possibility of main clauses with elided subjects 

is considered as an influence on early parsing decisions (Kuo and Vasishth, 2007). This 

could further increase the familiarity with ‘V N . . . ’ sequences and hence yield a facilitation 

on the SRC. To account for a high number of SRC-like structures the probability 

of RC attachment in the grammar was set to 0.1 and the SRC probability was set to 

0.85. In that way the number of ORCs in the corpus was only slightly lowered whereas 

the number of SRCs increased by about 60%. Extra main clauses with missing subjects 

were not added to the grammar. No further changes were made. 

Predictions 

The implemented discrepancy between ORC and SRC frequency should in principle 

account for the structural properties in the corpus, decreasing the familiarity effect on 

ORCs. The SRC pattern should be easier predictable on the embedded noun and the 

relativizer. Thus, the object advantage seen in simulation 1 should decrease. The size 

of the effect is expected to be small because the distributional changes regarding RC 

probability are just a statistical approximation of given corpus data. 

70


Results 

Figure 4.5 shows the GPE scores by region and epoch for SRCs and ORCs. Table 

A.2 (in the appendix) shows means and standard error for the first two regions. The 

improvement on the main verb in the ORC over epochs was comparably low. There was 

no improvement on the SRC main verb. The training improvement on the relativizer 

in SRC happened predominantly between the first and the second epoch. An object 

advantage on the relativizer was only present in the first and slightly in the second epoch. 

The third epoch did not reveal an ORC advantage. In addition a subject preference was 

found on the pre-relativizer region in the second and third epoch (P < 0.001). 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

Mandarin SRC (85%) 

N1 de N2 V2 N3 

Region 

epoch 1 

epoch 2 

epoch 3 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

Mandarin ORC (25%) 

V1 de N2 V2 N3 

Region 

Figure 4.5: Simulation 2: Mandarin SRC Frequency 

epoch 1 

epoch 2 

epoch 3 

The heavily improved predictions for the relativizer in the SRC imply an increased 

familiarity effect due to the corpus containing more SRCs than before. The result of 

simulation 2 suggests that the regularity effect on object relatives is weak and can easily 

be affected by a slight distributional disproportion in favor of the SRC structure. 

4.3.3 Discussion 

Simulation 1 confirms the regularity advantage for Mandarin ORCs with respect to 

simple sentences with the greatest effect on the relativizer. However, the location of the 

effect is not consistent with human data. Recall that Hsiao and Gibson (2003) found an 

object preference on the pre-relativizer region whereas Kuo and Vasishth (2007) found a 

subject preference on the relativizer and the head noun. Besides the very small effect on 

N1/V1 in the simulation there is no region-specific consistency with empirical studies. 

71


The impact of structural regularity is rather disconfirmed by several studies finding a 

subject advantage on the relativizer. 

By changing the RC type proportions in favor of the SRC in simulation 2 the object 

advantage decreased dramatically. The RC region showed a subject advantage after two 

training epochs. Compared to human data this finding is also inconsistent. In empirical 

studies, on the RC region only an object preference was found (Hsiao and Gibson, 2003; 

Lin and Garnsey, 2007; Qiao and Forster, 2008). The assessment of frequency effects 

in simulation 2 is to be understood as a tentative approach to account for the complex 

interplay of statistical constraints that drive learning. Direct predictions for sentence 

processing patterns may, however, not be justified. An SRN-based regularity test like 

in simulation 1 is more or less straightforward as long as the structures in question are 

clearly defined. But the structural choice may not reflect the regularity relations that 

are really influencing skill in human readers. In order to gain more precise predictions, 

further corpus inspections are necessary. For example, the exact proportion of RClike 

structures or elided-subject clauses with respect to the whole corpus was neglected 

during the present study but could potentially have influenced the results. 

Note that the regularity pattern in Mandarin as revealed by the simulations is not 

easily comparable with the English simulation. In English there were effects of difficulty 

mainly occurring on the verbs. This is due to the number agreement between subject 

and predicate. Notably, between the verb and its direct object no agreement is necessary. 

This agreement pattern delivers as a side effect a sort of semantic information, 

comparable to thematic roles. Therewith, agreement gives rise to a simulation of integration 

difficulty effects, evolving from the need to relate verbs to their subject. I 

hypothesize that these “integration effects” are the main reason for the nice fit by region 

of human data. Mandarin, on the other hand, does not contain specific noun-verb 

dependencies. In a sense, the network just needs to count nouns and verbs instead of 

establishing pairwise relationships. Thus, the Mandarin-trained network is not required 

to deal with the concept of a sentential subject. Consequently, no “integration difficulty” 

comparable to English is expected. Of course, this is not comparable to human processing 

of Mandarin. Predicates and their arguments are indeed involved in dependencies 

like thematic roles and other semantic relationships. It is imaginable that the missing of 

specific noun-verb relationships is the reason for the absent pattern match between the 

Mandarin simulation and human data. An implementation of the missing dependencies 

similar to the simplified English grammar seems straightforward to test that hypothesis. 

The interpretability of the result of such a simulation would, however, be questionable. 

A possible interpretation of the all-over contradictory simulation results with respect 

to human data is that the effects observed here do in fact not reflect experience-relevant 

regularities. Assuming on the other hand, that the simulation results are interpretable as 

showing regularity properties that play a role in human sentence comprehension, there 

are two possible interpretations of the results: a) Assuming regularity plays a role in the 

extraction preference, the fact that the regularity effect on the ORC was very weak in 

the simulations could be one of the reasons for the inconclusive empirical results. b) On 

72


the other hand it is possible that regularity does not have a relevant impact on empirical 

studies of Mandarin extraction preferences and the explanation is left to other factors. 


4.4.1 The model 

As presented in chapter 3, the forgetting effect in center-embedded structures was addressed 

in a connectionist study by Christiansen and Chater (1999). They trained an 

SRN on right-branching and center-embedding structures and then assessed the output 

node activations after seeing the sequence NNNVV. The activations showed a clear 2VP 

preference consistent with empirical data from English speakers. The artificial language 

that covered center-embedding abba and right-branching aabb dependency patterns is 

perfectly comparable to the simple English grammar of object and subject relative 

clauses used by MacDonald and Christiansen (2002). Thus, it should be possible to 

replicate the effect with the SRNs trained on the English grammar for the replication 

in section 4.2. In German RCs, however, no real right-branching occurs, given the embedded 

RC is always attached to its head noun. Hence, in the German grammar used 

in section 4.2 both ORC and SRC exhibit a center-embedding abba pattern. This fact 

could result in the SRN exposed to a German grammar being more trained on verb-final 

center-embedding structures than the English counterpart resulting in different predictions 

for an NNNVV sequence. Supposing that the difference in SRC realization in 

the corpora approximately reflects an essential word order regularity difference between 

German and English, the SRN predictions will shed light on the part that experience 

plays in the explanation for the forgetting effect. 

I extended the study by Christiansen and Chater (1999) to gain GPE values for both 

conditions on all regions after the missing verb. In order to achieve that, it was necessary 

to have a grammar that simulates the forgetting effect, hence allows NNNVV sequences 

to be complete. Thus, in the probability table for the drop-V2 testing corpus the column 

referring to the position of V2 was deleted. In consequence the testing probabilities 

were adequate to a ‘N1 N2 N3 V3 V1’ grammar with the first verb (V3) being bound 

to N1 by number agreement and the second verb (V1) to N3. This is equivalent to 

forgetting the prediction induced by N2. The GPE for the ungrammatical conditions was 

calculated against these drop-V2 probabilities. So, if the network is making grammatical 

predictions, the error values for V1 and subsequent regions should be higher in the 

drop-V2 condition. On N1 the SRN would predict a verb in number agreement with 

N2. Then the network would predict another verb, but the test grammar predicts the 

determiner. After this point the network’s predictions should be completely confused 

because the just observed sequence is inconsistent with any structural generalizations 

developed during training. If the networks predictions are not too locally dependent, 

the predictions should be wrong for the last word (direct object of the main clause), too. 

73


However, assuming the forgetting hypothesis the GPE values would look differently. 

The forgetting hypothesis would mean for the SRN that it is unable to make correct 

predictions based on long distant dependencies but bases its predictions on rather locally 

consistent sequences. For example after seeing V3 the network only predicts one more 

verb because the observation of N1 is too weakly encoded in the hidden representations 

to influence the predictions. Consequently, on V1 the error for the drop-V2 condition 

should be lower because in the grammatical condition V1 is the third verb which is 

inconsistent with the SRN’s predictions. The 2VP preference should continue on the 

post-V1 regions because a locally coherent context with two verbs is easier to handle 

than a context of three verbs. 

Vasishth et al. (2008) mentioned the potential factor of comma insertions that could 

serve as structural cues alerting the reader of a missing verb. However, empirically it 

is hardly possible to separate the comma effect from word order effects. Vasishth and 

colleagues indeed tested English readers on comma-containing stimuli; but since English 

readers are not trained on commas used in this way, they are unable to draw as much 

information from comma positions as German readers do. In order to test whether the 

commas in fact influence structural predictions, the following study tested SRNs trained 

on German and English corpora both with and without commas. 

4.4.2 Simulation 3: English 


For the forgetting effect simulation of English without commas (simulation 3a) no new 

training was necessary. The SRNs trained on the English corpora were tested on the 

grammatical and the ungrammatical condition in their state after one, two, and three 

epochs. For simulation 3b the English grammars for the training and testing corpora 

were enriched with commas and the SRNs were trained and tested in the usual way. 

3a: English without commas 

According to the equivalence of Christiansen and Chater’s training language and the 

English training grammar used here, the effects should be similar. In particular, the 

GPE values for the V1 and post-V1 regions should receive lower GPE values in the 

drop-V2 condition. 

(24) Example test sentences: 

a. the judge that the reporters that the senators understand praise attacked 

the senators . (no-drop) 

b. the judges that the reporters that the lawyer praised attacked the senators . 

(drop-V2) 

74


Results for 3a 

Similar to the experiment in Vasishth et al. (2008) the assessed regions in the simulation 

were the three verbs V3, V2, V1 and the post-V1 region. The V2 region contains 

is no datapoint in the ungrammatical condition because the verb is dropped in the 

testing stimuli. Figure 4.6 shows GPE values for the SRNs trained and tested on the 

simplified English grammar without commas. The left panel shows both conditions after 

one training epoch. On the right results after three full corpus runs are shown. The 

ungrammatical condition is called drop-V2 and the grammatical is indicated by nodrop. 

The pattern was exactly as expected. The SRNs predicted a drop-V2 advantage 

on V1 and post-V1. No effect was predicted on V3 because no difference in stimuli and 

probability between the conditions is present at this point. 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

English without commas (epoch 2) 

V3 V2 V1 post-V1 

Region 

drop-V2 

no-drop 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

English without commas (epoch 3) 


Region 

drop-V2 

no-drop 

Figure 4.6: Simulation 3a: English doubly-embedded ORC. The graphic shows the 

GPE value on the three verbs and the subsequent region of the grammatical (nodrop) 

and ungrammatical (drop-V2) condition. The left panel shows GPE after two 

epochs of training, the right panel shows GPE after 3 epochs. 

3b: English with commas 

The commas serve as clause boundary markers. They appear in English SRCs subsequent 

to nouns only. In the ORC, on the other hand, commas appear after nouns in 

the beginning of the sentence and after the verbs in the end. In a double-embedded 

ORC there would be a comma after V3 and V2. Thus, the grammatical/ungrammatical 

sequence pair is not NNNVVV vs. NNNVV anymore but rather N,N,NV,V,V vs. 

N,N,NV,V. The comma is a category with only one token which attaches to nouns or 

75


verbs and is not involved in long-distant dependencies. Hence, the activation pattern 

representing it should not be too complex. In fact the learning of comma usage in ORCs 

can be scaled down to a counting recursion problem of the pattern aabb instead of abba. 

As discussed in chapter 3 counting recursion is the easiest of the three recursion types 

for both humans and connectionist networks (Christiansen and Chater, 1999). Thus, 

it is very likely that the inclusion of commas facilitates processing in the grammatical 

condition lowering the respective GPE values. 

(25) English with commas: 

a. SRC: S1 , V2 O2 , V3 O3 , V1 O1 

b. ORC: S1 , S2 , S3 V3 , V2 , V1 O1 


a. the banker , that the banker , that the senators phone , understands , attacks 

the reporters . (no-drop) 

b. the lawyer , that the senator , that the judges attack , praises the judge . 

(drop-V2) 

Results for 3b 

See figure 4.7 for the results of simulation 3b after one (left panel) and three epochs (right 

panel). Compared to simulation 3a there was a global improvement for both conditions. 

The most dramatic improvement happened on V3, which is predicted almost without 

errors after three epochs. Looking at the first epoch there was more improvement due 

to comma insertion on V1 for the grammatical condition. In result the V1 error was the 

same in both conditions. However, after subsequent training the no-drop condition did 

not change on V1 whereas the drop-V2 condition improved further resulting in a drop-V2 

preference on V1. The opposite happened on post-V1 where training had affected the 

no-drop condition more. Here training did not affect the ungrammatical condition at all. 

In summary, there was a comma insertion × condition × training interaction, resulting 

in a drop-V2 preference after completed training. The stable error on post-V1 in the 

drop-V2 condition can be interpreted as a floor effect. The prediction of the determiner 

and the noun is very good already with a GPE value around 0.1. It is very unlikely 

that the SRN learns the perfectly correct probabilities resulting in a GPE value of zero 

even after many epochs. Therefore, on the post-V1 region improvement by training 

is only possible for the slightly worse grammatical condition, which is why the two 

conditions settle on the same error value after three epochs. In conclusion, the insertion 

of commas definitely helps to make better predictions. However, training effects seem to 

be driven by rather local consistency, affecting the ungrammatical condition more than 

the grammatical. Thus, looking at V1 after three epochs the drop-V2 preference seems 

to be stable for English center-embedding. 

76


GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

English with commas (epoch 2) 


Region 

drop-V2 

no-drop 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

English with commas (epoch 3) 


Region 

drop-V2 

no-drop 

Figure 4.7: Simulation 3b: English doubly-embedded ORC with added commas. The 

graphic shows the GPE value on the three verbs and the subsequent region of the 

grammatical (no-drop) and ungrammatical (drop-V2) condition. The left panel shows 

GPE after two epochs of training, the right panel shows GPE after 3 epochs. 

4.4.3 Simulation 4: German 


Simulation 4a tested German center-embedding with commas. The already trained networks 

from section 4.2 were used. For simulation 4b training corpora created from a 

German grammar without commas were used. The testing corpora were built analogically 

to simulation 3. 

4a: German with Commas 

The German grammar exhibits a regularity for verb-finality in RCs. This is different 

from the English grammar and should enable the SRN to distinguish 2VP and 3VP 

embedding better than in English. As seen in the English simulation commas have 

a facilitating effect although, however, the drop-VP preference returned after further 

training. In German, commas could have an even greater facilitating effect. The reason 

for that is that the counting-recursion pattern aabb is not only applicable in the ORC 

as in English but also in the SRC because both are center-embedding in German. As 

example (27) illustrates both SRC and ORC contain the exact same pattern of nouns, 

verbs, and commas. In conclusion the SRN trained on the German corpus should be 

very skilled on center-embeding recursion and the comma counting-recursion and hence 

will have much lower error rates for the grammatical condition. 

77

(27) German with commas: 

a. SRC: S1 , O2 , O3 V3 , V2 , V1 O1 

b. ORC: S1 , S2 , S3 V3 , V2 , V1 O1 



a. der Polizist , den der Mensch , den der Polizist verspottet , ruft , verspottet 

den Jungen . (no-drop) 

b. der Polizist , den der Junge , den der Polizist verspottet , ruft den Menschen 

. (drop-V2) 

Results of 4a 

The results of simulation 4a (German with commas) are shown in figure 4.8. There 

was a dramatic improvement compared to English on V2 and V1. Interestingly, the 

comparison by conditions did not reveal any difference on the main verb. However, 

a slight grammaticality preference was found on the post-V1 region. This drop-V2 

preference was significant (α < 0.001). 

GPE 

0.0 0.2 0.4 0.6 0.8 

German with commas (epoch 2) 


Region 

drop-V2 

no-drop 

GPE 

0.0 0.2 0.4 0.6 0.8 

German with commas (epoch 3) 


Region 

drop-V2 

no-drop 

Figure 4.8: Simulation 4a: German doubly-embedded ORC. The graphic shows the 

GPE value on the three verbs and the subsequent region of the grammatical (nodrop) 

and ungrammatical (drop-V2) condition after two and three epochs of training. 

4b: German without commas 

Since simulation 3b provided evidence for a comma effect, the removal of commas should 

make the SRN’s predictions more error-prone. The verb-finality regularity in German, 

78


GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

German without commas (epoch 2) 


Region 

drop-V2 

no-drop 

GPE 

0.0 0.2 0.4 0.6 0.8 1.0 

German without commas (epoch 3) 


Region 

drop-V2 

no-drop 

Figure 4.9: Simulation 4b: German doubly-embedded ORC without commas. The 

graphic shows the GPE value on the three verbs and the subsequent region of the 

grammatical (nod-rop) and ungrammatical (drop-V2) condition after two and 3 epochs 

of training. 

however, should still cause better predictions for the grammatical condition in German 

than in English. Simulation 4b tested SRNs trained on a comma-free German grammar. 


a. der Polizist den der Mensch den die Passanten treffen ruft verspottet den 

Jungen . (no-drop) 

b. der Passant den der Junge den der Polizist ruft beschimpft die Passanten . 

(drop-V2) 

Results of 4b 

The GPE values of the simulation involving German without commas (figure 4.9) show 

a similar pattern as in English without commas. In the first epoch a drop-V2 preference 

was found in a small effect on V1 and a very pronounced effect on the region after. After 

completed training, V1 and post-V1 show a similar sized drop-V2 advantage. Looking 

at the data, the comma aspect seems to be the most relevant one. Surprisingly the 

regularity of verb-final structures does not seem to support correct predictions in German 

more than in English. Rather the more regular application of commas in German has 

an extensively facilitating effect on both conditions, slightly more on the grammatical. 

The impact of comma usage and a comparison of the results to human data will be the 

topic of the next section. 

79

4.4.4 Discussion 


The results of simulation 3a (English without commas) and 4a (German with commas) 

were consistent with empirical studies (Gibson and Thomas, 1999; Christiansen and 

MacDonald, 1999; Vasishth et al., 2008), suggesting a difference in forgetting behavior 

between German and English. A reliable grammaticality preference in German as 

observed in Vasishth et al. (2008) could, however, not be replicated. Simulation 3a is perfectly 

consistent with the simulation of the forgetting effect in Christiansen and Chater 

(1999) and the human data from Vasishth et al. The results prove that not only the 

limited grammar used in Christiansen and Chater (1999) predicts the forgetting effect 

but also the more complex grammar used here. The inherent architectural constraints 

of SRNs predict a forgetting effect in English doubly embedded ORCs. The insertion of 

commas in simulation 3b had an effect on the predictions, whereas in contrast, the study 

by Vasishth and colleagues showed no effect. The missing effect of commas in their study 

could be explained by the fact that English readers are not familiar with using commas 

in that way. On the other hand, the SRNs were trained on corpora containing commas 

and, thus, had learned how to use them as structural cues. The remaining drop-V2 

preference on V1 still shows a certain consistency between the simulation and empirical 

data. More importantly, the model makes different predictions in German. Simulation 

4a shows a similar performance on grammatical and ungrammatical sentences. In comparison, 

the Vasishth et al. study found in fact faster reading times in the grammatical 

condition for German readers. This is not predicted by the model, but it is the difference 

to English that is important here. Surprisingly, the SRNs trained on German without 

commas performed no better than the SRNs trained on English without commas. This 

yields a comma × language interaction. The greater effect of commas in German is 

explainable by the different comma patterns in both languages (see examples 25 and 

27). So, the different predictions regarding the forgetting effect seem to be caused only 

indirectly by word order regularities. In particular, the consistent center-embedding (or 

counting-recursion) of commas in German makes them a reliable predictor, whereas this 

is not the case in English. The word order itself, however, did not have the expected 

effect, as experiment 4b shows. Controlling for the comma effect the head-finiteness of 

SRCs in the simplified German grammar does not increase the performance on double 

center-embedding on V1 and post-V1. Maybe doubly-embedded RCs are too rare in 

the corpus to cause an effect. Another explanation could be the particularly increased 

complexity in the prediction of German embedded RCs. In a German RC the cues establishing 

the agreement of the embedded verb are very subtle compared to English. 

In English the word order of ‘who NP VP’ versus ‘who VP NP’ decides the agreement, 

whereas in German there are several possible pairings of der, den, and die, that determine 

the verb agreement. Thus, the verb agreement prediction requires a complex 

representation of previous input. Given the architectural limits of the network, highly 

complex representations are in a tradeoff with memory span. That means that distant 

dependencies and verb predictions are very hard to maintain. In other words the trace 

80

4.5 Conclusion 

of the VP-predicting NP in the representational cycle of the SRN decays faster. This in 

turn is compensated by increased training on center-embedding compared to the English 

simulation, resulting in comparable error values when trained without commas. This 

is, of course, an ad-hoc hypothesis and needs further investigation, which is beyond the 

scope of this thesis. 

4.5 Conclusion 

This thesis investigated the explanatory power of a certain implementation of the experience 

account. The well approved SRN modeling approach of MacDonald and Christiansen 

(2002) was adopted to test its predictions on two phenomena currently discussed 

in literature. The RC extraction type preference in Mandarin and the forgetting effect 

in complex center-embedding was discussed and then modeled. At first, the two problems 

were approached theoretically, reviewing results of empirical studies and discussing 

potential predictions of available theories. Concerning the Mandarin relative clauses, 

the studies showed exceptionally mixed results. However, an observed object advantage 

appeared always on the RC region, whereas a subject advantage was found only on the 

relativizer/head noun region. That fact and the experiment by Qiao and Forster (2008) 

are suggesting that Chinese Mandarin might have to be counted as an exception to a 

universal subject preference. On the other hand, the results for the forgetting effect were 

very clear and best explained by language-specific experience. 

In chapter 3 the simple recurrent network was introduced and its properties were 

discussed. An SRN is a very simple and domain-unspecific model, but it accounts for 

the three necessities introduced in the beginning of this thesis: a) biological factors 

(architectural limits), b) continuous gradedness in performance, and c) experience. 

In chapter 4 the experience theory predictions regarding the two sample problems were 

practically assessed. Just like in the discussion and critique of MC02, the simulation 

results presented here looked promising on first sight; but sub-experiments and detailed 

data analysis revealed considerable inconsistencies with respect to human data. The 

Mandarin RC simulation predicted an object preference, but the location of the effect 

was not consistent with human data. In addition, the second simulation demonstrated 

that the regularity effect was very weak. It becomes clear that the training material used 

must be carefully chosen in order to guarantee comparability with other simulations and 

empirical studies. The forgetting effect was predicted to be present in English but not in 

German, consistent with human data. However, further simulations revealed the comma 

insertion as the most important factor. 

Of course, it has to be clear that the simple network trained on a simple grammar 

would not learn the same constraints as humans do. These simulations are rather approximations 

pointing to a certain direction. A noticeable problem of the SRN predictions 

is their dependency on local coherence, which can also be described as a low memory 

span. This is, however, mainly dependent upon the specific properties of the learning 

81


mechanism and the context loop. As mentioned in the previous chapter, there are other 

learning mechanisms that can increase the span; although they might be cognitively very 

unmotivated. Interestingly, however, there is evidence that even human readers rely on 

local coherence in certain structures (Tabor et al., 2004). Another finding is that the 

simulations reported in Christiansen and Chater (1999) and also the comma issue in simulations 

3 and 4 presented here showed that the SRN handles counting-recursion better 

than other types. That may be the reason for the strong facilitating effect of comma 

insertion compared to head-finality. Addressing this, it shall be noted that Rodriguez 

(2001) claims that SRNs can in fact carry out explicit symbolic counting procedures. 

This work argued for a uniform account to individual and language-specific differences 

as well as language-independent processing skill. All three can in considerable parts be 

attributed to experience with the individual linguistic environment in interaction with 

architectural preconditions. It can be concluded that a lot of work is necessary before 

fine-grained experience-based predictions can be gained for the highly complex task of 

sentence comprehension. By all means, literature shows a promising trend towards PDP 

models of language comprehension, accompanied by the integration of corpus analyses 

and acquisition data. 

82

Bibliography 

J. R. Anderson and C. Lebiere. The Atomic Components of Thought. Lawrence Erlbaum 

Associates, 1998. 

J. R. Anderson, D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, and Y. Qin. An 

integrated theory of the mind. Psychological Review, New York, 111(4):1036–1060, 

2004. 

E. Bach, C. Brown, and W. Marslen-wilson. Crossed and nested dependencies in german 

and dutch: A psycholinguistic study. Language and Cognitive Processes, 1(4):249–262, 

1986. 

A. D. Baddeley. Human Memory: Theory and Practice. Psychology Press, 1997. 

R. C. Berwick and A. S. Weinberg. The grammatical basis of linguistic performance. 

MIT Press, 1984. 

T. G. Bever. The cognitive basis for linguistic structures. Cognition and the development 

of language, 279, 1970. 

M. S. Blaubergs and M. D. S. Braine. Short-term memory limitations on decoding 

self-embedded sentences. Journal of Experimental Psychology, 102(4):745–748, 1974. 

A. L. Blumenthal. Observations with self-embedded sentences. Psychonomic Science, 6 

(10):453–454, 1966. 

J. K. Bock. An effect of the accessibility of word forms on sentence structures. Journal 

of Memory and Language, 26(2):119–137, 1987. 

D. Caplan, S. Vijayan, G. Kuperberg, C. West, G. Waters, D. Greve, and A. M. 

Dale. Vascular responses to syntactic processing: Event-related fMRI study of relative 

clauses. Human Brain Mapping, 15(1):26–38, 2002. 

N. Chomsky. Lectures on government and binding the Pisa lectures Studies in generative 

grammar. Foris publications, 1981. 

N. Chomsky. Aspects of the Theory of Syntax. MIT Press, Cambridge, 1965. 

N. Chomsky. Syntactic Structures. Mouton, The Hague, 1957. 

VIII

Bibliography 

M. H. Christiansen. The (non) necessity of recursion in natural language. In Proceedings 

of the Fourteenth Annual Conference of the Cognitive Science Society. Lawrence 

Erlbaum Associates, 1992. 

M. H. Christiansen and N. Chater. Toward a connectionist model of recursion in human 

linguistic performance. Cognitive Science, 23(2):157–205, 1999. 

M. H. Christiansen and M. C. MacDonald. Processing of recursive sentence structure: 

Testing predictions from a connectionist model. Manuscript in preparation, 1999. 

L. Cohen and J. Mehler. Click monitoring revisited: An on-line study of sentence 

comprehension. Memory & cognition, 24(1):94–102, 1996. 

M. Corley and S. Corley. Cross-linguistic and intra-linguistic evidence for the use of 

statistics in human sentence processing. Unpublished manuscript, University of Exeter, 

1995. 

F. Cuetos, D. Mitchell, and M. Corley. Parsing in different languages. Language processing 

in Spanish, pages 145–187, 1996. 

M. Daneman and P. A. Carpenter. Individual differences in working memory and reading. 

Journal of Verbal Learning and Verbal Behavior, 19(4):450–66, 1980. 

G. S. Dell and P. G. O’Seaghdha. Stages of lexical access in language production. 

Cognition, 42(1-3):287–314, 1992. 

G. S. Dell, L. K. Burger, and W. R. Svec. Language production and serial order: A 

functional analysis and a model. Cognitive Modeling, 2002. 

T. Desmet and E. Gibson. Disambiguation preferences and corpus frequencies in noun 

phrase conjunction. Journal of Memory and Language, 49(3):353–374, 2003. 

J. A. V. Dyke and R. L. Lewis. Distinguishing effects of structure and decay on attachment 

and repair: A cue-based parsing account of recovery from misanalyzed ambiguities. 

Journal of Memory and Language, 49(3):285–316, 2003. 

J. A. V. Dyke and B. McElree. Retrieval interference in sentence comprehension. Journal 

of Memory and Language, 55(2):157–166, 2006. 

S. F. Ehrlich and K. Rayner. Contextual effects on word perception and eye movements 

during reading. Journal of Verbal Learning and Verbal Behavior, 20:641–655, 1981. 

J. L. Elman. Tlearn simulator. 

Software available at: http://crl.ucsd.edu/innate/tlearn.html, 1992. 

J. L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990. 

IX

Bibliography 

P. Erdmann. Ist das deutsche eine sov-sprache? Zeitschrift für deutsche Sprache und 

Literatur, 1990. 

F. Ferreira. The misinterpretation of noncanonical sentences. Cognitive Psychology, 47 

(2):164–203, 2003. 

M. Ford. A method for obtaining measures of local parsing complexity throughout 

sentences. Journal of Verbal Learning and Verbal Behavior, 22:203–218, 1983. 

K. I. Forster, C. Guerrera, and L. Elliot. The maze task: measuring forced incremental 

sentence processing time. Manuscript submitted for publication, 2008. 

U. Frauenfelder and J. Segui. Monitoring around the relative clause. Journal of Verbal 

Learning and Verbal Behavior, pages 19.328–37, 1980. 

L. Frazier. Syntactic processing: Evidence from dutch. Natural language and linguistic 

theory, 5(4):519–559, 1987. 

L. Frazier. On comprehending sentences: Synactic parsing strategies. PhD thesis, University 

of Connecticut, West Bend, IN., 1979. 

L. Frazier. Syntactic complexity. In D. R. Dowty, L. Kartunnen, and A. M. Zwicky, 

editors, Natural language parsing: Psychological, computational, and theoretical perspectives, 

pages 129–189. Cambridge University Press, 1985. 

L. Frazier and C. Clifton. Successive cyclicity in the grammar and the parser. Language 

and Cognitive Processes, 4:93–126, 1989. 

L. Frazier and C. Clifton. Construal. MIT Press, 1996. 

L. Frazier and G. B. Flores d’Arcais. Filler-driven parsing: a study of gap filling in 

dutch. Journal of Memory and Language, 28:331–344, 1989. 

L. Frazier, C. Clifton, and J. Randall. Filling gaps: Decision principles and structure in 

sentence comprehension. Cognition, 13(2):187–222, 1983. 

E. Gibson. The dependency locality theory: A distance-based theory of linguistic complexity. 

In Image, language, brain: Papers from the first mind articulation project 

symposium, pages 95–126, 2000. 

E. Gibson. Linguistic complexity: locality of syntactic dependencies. Cognition, 68(1): 

1–76, 1998. 

E. Gibson and C. T. Schütze. Disambiguation preferences in noun phrase conjunction 

do not mirror corpus frequency. Journal of Memory and Language, 40(2):263–279, 

1999. 

X

Bibliography 

E. Gibson and J. Thomas. Memory limitations and structural forgetting: The perception 

of complex ungrammatical sentences as grammatical. Language and Cognitive 

Processes, 14(3):225–248, 1999. 

E. Gibson, T. Desmet, D. Grodner, D. Watson, and K. Ko. Reading relative clauses in 

english. Cognitive Linguistics, 16(2):313–353, 2005a. 

E. Gibson, K. Nakatani, and E. Chen. Distinguishing theories of syntactic storage cost 

in sentence comprehension: Evidence from japanese. To appear, 2005b. 

P. C. Gordon, R. Hendrick, and M. Johnson. Memory interference during language 

processing. Journal of Experimenatl Psychology, Learning, Memory, and Cognition, 

27(6):1411–1423, 2001. 

P. C. Gordon, R. Hendrick, and W. H. Levine. Memory-Load interference in syntactic 

processing. Psychological Science, 13(5):425–430, 2002. 

P. C. Gordon, R. Hendrick, and M. Johnson. Effects of noun phrase type on sentence 

complexity. Journal of Memory and Language, 51(1):97–114, 2004. 

P. C. Gordon, R. Hendrick, M. Johnson, and Y. Lee. Similarity-Based interference 

during language comprehension: Evidence from eye tracking during reading. Journal 

of experimental psychology. Learning, memory, and cognition, 32(6):1304, 2006. 

A. Gouvea. Processing Syntactic Complexity: Cross-linguistic Differences and ERP 

Evidence. PhD thesis, University of Maryland, College Park, 2003. 

J. H. Greenberg. Some universals of grammar with particular reference to the order 

of meaningful elements. In J. H. Greenberg, editor, Universals of Language, pages 

73–113. MIT Press, London, 1963. 

D. Grodner and E. Gibson. Consequences of the serial nature of linguistic input for 

sentenial complexity. Cognitive Science, 29(2):261–290, 2005. 

J. Hale. A probabilistic earley parser as a psycholinguistic model. North American 

Chapter Of The Association For Computational Linguistics, pages 1–8, 2001. 

B. Hemforth, L. Konieczny, and C. Scheepers. Syntactic attachment and anaphor resolution: 

Two sides of relative clause attachment. Architectures and mechanisms for 

language processing, pages 259–281, 2000. 

R. N. A. Henson. Short-Term memory for serial order: The Start-End model. Cognitive 

Psychology, 36(2):73–137, 1998. 

V. M. Holmes and J. K. O’Regan. Eye fixation patterns during the reading of relative 

clause sentences. Journal of Verbal Learning and Verbal Behavior, 20(4):1, 1981. 

XI

Bibliography 

F. Hsiao and E. Gibson. Processing relative clauses in chinese. Cognition, 90(1):3–27, 

2003. 

C.-C. Hsu, F. Hurewitz, and C. Phillips. Contextual and syntactic cues for head-final 

relative clauses in chinese. In The 19th Annual CUNY Conference on Human Sentence 

Processing, NY, 2006. 

T. Ishizuka, K. Nakatani, and E. Gibson. Processing japanese relative clauses in context. 

In The 19th Annual CUNY Conference on Human Sentence Processing, CUNY 

Graduate Canter, NY, 2006. 

F. T. Jaeger, E. Fedorenko, P. Hofmeister, and E. Gibson. Expectation-based syntactic 

processing: Antilocality outside of head-final languages. CUNY Sentence Processing 

Conference, North Carolina, 2008. 

C. Juliano and M. K. Tanenhaus. A constraint-based lexicalist account of the subject/object 

attachment preference. Journal of Psycholinguistic Research, 23(6):459–471, 

1994. 

D. Jurafsky. A probabilistic model of lexical and syntactic access and disambiguation. 

Cognitive Science, 20(2):137–194, 1996. 

M. A. Just and P. A. Carpenter. A capacity theory of comprehension: individual differences 

in working memory. Psychological Review, 99(1):122–49, 1992. 

M. A. Just and P. A. Carpenter. A capacity theory of comprehension: Individual differences 

in working memory. Cognitive Modeling, pages 131–177, 2002. 

M. A. Just and S. Varma. A hybrid architecture for working memory: Reply to macdonald 

and christiansen (2002). Psychological Review, 109(1):55–65, 2002. 

M. A. Just, P. A. Carpenter, and J. D. Woolley. Paradigms and processes in reading 

comprehension. Journal of Experimental Psychology: General, 111(2):228–238, 1982. 

M. A. Just, P. A. Carpenter, T. A. Keller, W. F. Eddy, and K. R. Thulborn. Brain 

activation modulated by sentence comprehension. Science, 274(5284):114, 1996. 

A. Kawamoto. Nonlinear dynamics in the resolution of lexical ambiguity: a parallel 

distributed processing account. Journal of memory and language(Print), 32(4):474– 

516, 1993. 

E. L. Keenan and B. Comrie. Noun phrase accessibility and universal grammar. Linguistic 

Inquiry, pages 63–99, 1977. 

J. King and M. A. Just. Individual differences in syntactic processing: The role of 

working memory. Journal of Memory and Language, 30(5):580–602, 1991. 

XII

Bibliography 

J. W. King and M. Kutas. Who did what and when? using word-and Clause-Level ERPs 

to monitor working memory usage in reading. Journal of Cognitive Neuroscience, 7 

(3):376–395, 1995. 

L. Konieczny. Locality and parsing complexity. Journal of Psycholinguistic Research, 

29(6):627–645, 2000. 

L. Konieczny and N. Ruh. What’s in an error? a reply to macdonald and christiansen 

(2002). Manuscript submitted, University of Freiburg, 2003. 

K. Kuo and S. Vasishth. Processing chinese relative clauses: Evidence for the universal 

subject preference. Manuscript submitted, 2007. 

N. Kwon, M. Polinsky, and R. Kluender. Processing of relative clause sentences in 

korean. Poster presented at AMLaP Conference, 2004. 

W. Larkin and D. Burns. Sentence comprehension and memory for embedded structure. 

Memory and Cognition, 5(1):17–22, 1977. 

R. Levy. Expectation-based syntactic comprehension. Cognition, 106(3):1126–1177, 

2008. 

R. Lewis. A theory of grammatical but unacceptable embeddings. Journal of Psycholinguistic 

Research, 25:93–116, 1996. 

R. L. Lewis and S. Vasishth. An activation-based model of sentence processing as skilled 

memory retrieval. Cognitive Science, 29:1–45, May 2005. 

R. L. Lewis, S. Vasishth, and J. Van Dyke. Computational principles of working memory 

in sentence comprehension. Trends in Cognitive Sciences, 10(10):447–454, October 

2006. 

C. C. Lin. The psychological reality of head-final relative clauses. Paper presented at 

the International Workshop on Relative Clauses, Academia Sinica, Taipei, 2007. 

C. C. Lin and T. G. Bever. Chinese is no exception: Universal subject preference of 

relative clause processing. Paper presented at The 19th Annual CUNY Conference on 

Human Sentence Processing, CUNY Graduate Center, New York, NY, 2006a. 

C. C. Lin and T. G. Bever. Subject preference in the processing of relative clauses in 

chinese. In D. D. B. Donald Baumer, D. Montero, and M. Scanlon, editors, Proceedings 

of the 25th West Coast Conference on Formal Linguistics. Somerville, MA: Cascadilla 

Proceedings Project, pages 254–260, 2006b. 

C.-J. C. Lin and T. G. Bever. Processing head-final relative clauses without garden 

paths. Paper presented at the International Conference on Processing Head-Final 

Structures, Rochester Institute of Technology, Rochester, NY, September 21-22, 2007. 

XIII

Bibliography 

C. J. C. Lin, S. Fong, and T. G. Bever. Constructing filler-gap dependencies in chinese 

possessor relative clauses. In Proceedings of PACLIC, 2005. 

Y. Lin and S. Garnsey. Plausibility and the resolution of temporary ambiguity in relative 

clause comprehension in mandarin. In Proceedings of the 20th Annual CUNY 

Conference on Human Sentence Processing, 2007. 

M. C. MacDonald and M. H. Christiansen. Reassessing working memory: Comment on 

Just and Carpenter (1992) and Waters and Caplan (1996). Psychological Review, 109 

(1):35–54, 2002. 

M. C. MacDonald, N. J. Pearlmutter, and M. S. Seidenberg. Lexical nature of syntactic 

ambiguity resolution. Psychological Review, 101(4):676–703, October 1994. 

B. MacWhinney. Starting points. Language, 53(1):152–168, 1977. 

B. MacWhinney. Basic syntactic processes. Language development, 1:73–136, 1982. 

B. MacWhinney and C. Pleh. The processing of restrictive relative clauses in hungarian. 

Cognition, 29(2):95–141, 1988. 

W. M. Mak, W. Vonk, and H. Schriefers. The influence of animacy on relative clause 

processing. Journal of Memory and Language, 47(1):50–68, 2002. 

Y. Matsumoto. Noun-modifying Constructions in Japanese: A Frame-semantic Approach. 

John Benjamins, 1997. 

J. L. McClelland and J. L. Elman. The TRACE Model of Speech Perception. California 

University San Diego, La Jolla Center for Research in Language, 1984. 

A. Mecklinger, H. Schriefers, K. Steinhauer, and A. D. Friederici. Processing relative 

clauses varying on syntactic and semantic dimensions: an analysis with event-related 

potentials. Memory & Cognition, 23(4):477–94, 1995. 

D. C. Mitchell, F. Cuetos, M. M. B. Corley, and M. Brysbaert. Exposure-based models of 

human parsing: Evidence for the use of coarse-grained (nonlexical) statistical records. 

Journal of Psycholinguistic Research, 24(6):469–488, 1995. 

E. Miyamoto and M. Nakamura. Subject/object asymmetries in the processing of relative 

clauses in Japanese. In Proceedings of WCCFL, volume 22, pages 342–355, 2003. 

X. Qiao and K. I. Forster. Object relatives ARE easier than subject relatives in chinese. 

In Proceedings of AMLaP Conference, 2008. 

D. S. Race and M. C. MacDonald. The use of “that” in the production and comprehension 

of object relative clauses. In Proceedings of the 25th Annual Meeting of the Cognitive 

Science Society, pages 946–951, 2003. 

XIV

Bibliography 

F. Reali and M. H. Christiansen. Processing of relative clauses is made easier by frequency 

of occurrence. Journal of Memory and Language, 57(1):1–23, 2007a. 

F. Reali and M. H. Christiansen. Word chunk frequencies affect the processing of pronominal 

object-relative clauses. The Quarterly Journal of Experimental Psychology, 60(2): 

161–170, 2007b. 

R. Roberts and E. Gibson. Individual differences in sentence memory. Journal of Psycholinguistic 

Research, 31(6):573–598, Nov. 2002. 

P. Rodriguez. Simple Recurrent Networks Learn Context-Free and Context-Sensitive 

Languages by Counting, volume 13. MIT Press, 2001. 

D. L. T. Rohde. A Connectionist Model of Sentence Comprehension and Production. 

PhD thesis, Carnegie Mellon University, 2002. 

D. L. T. Rohde. The simple language generator: Encoding complex languages with 

simple grammars (Tech). Mellon University, Department of Computer Science, pages 

99–123, 1999. 

D. E. Rumelhart and J. L. McClelland. On Learning the Past Tenses of English Verbs. 

California University San Diego, La Jolla Center for Research in Language, 1985. 

H. Schriefers, A. D. Friederici, and K. Kuhn. The processing of locally ambiguous relative 

clauses in german. Journal of Memory and Language, 34(4):499–520, 1995. 

M. S. Seidenberg and J. L. McClelland. A distributed, developmental model of word 

recognition and naming. Psychological Review, 96(4):523–568, 1989. 

M. Spivey-Knowlton. Quantitative predictions from a constraint-based theory of syntactic 

ambiguity resolution. In Proceedings of the 1993 Connectionist Models Summer 

School, pages 130–137. Lawrence Erlbaum Associates, 1994. 

K. Stromswold, D. Caplan, N. Alpert, and S. Rauch. Localization of syntactic comprehension 

by positron emission tomography. Brain and Language, 52(3):452–473, 

1996. 

W. Tabor, C. Juliano, and M. K. Tanenhaus. Parsing in a dynamical system: An 

attractor-based account of the interaction of lexical and structural constraints in sentence 

processing. Language and Cognitive Processes, 12(2/3):211–271, 1997. 

W. Tabor, B. Galantucci, and D. Richardson. Effects of merely local syntactic coherence 

on sentence processing. Journal of Memory and Language, 50(4):355–370, May 2004. 

W. L. Taylor. Cloze procedure: A new tool for measuring readability. Journalism 

Quarterly, 30(4):415–433, 1953. 

XV

Bibliography 

R. S. Tomlin. Basic word order: Functional principles. Croom Helm, London, 1986. 

M. J. Traxler, R. K. Morris, and R. E. Seely. Processing subject and object relative 

clauses: Evidence from eye movements. Journal of Memory and Language, 47(1): 

69–90, July 2002. 

S. Vasishth. Integration and prediction in head-final structures. In Processing and 

Producing Head-Final Structure. 2008. 

S. Vasishth and R. L. Lewis. Human language processing: Symbolic models. In K. Brown, 

editor, Encyclopedia of Language and Linguistics, volume 5, pages 410–419. Elsevier, 

2006a. 

S. Vasishth and R. L. Lewis. Argument-head distance and processing complexity: Explaining 

both locality and antilocality effects. Language, 82(4):767–794, 2006b. 

S. Vasishth, K. Suckow, R. Lewis, and S. Kern. Short-term forgetting in sentence comprehension: 

Crosslinguistic evidence from head-final structures. Submitted to Language 

and Cognitive Processes, 2008. 

G. S. Waters and D. Caplan. The capacity theory of sentence comprehension: critique 

of just and carpenter (1992). Psychological Review, 103(4):761–72, 1996. 

J. B. Wells, M. H. Christiansen, D. S. Race, D. J. Acheson, and M. C. MacDonald. Experience 

and sentence processing: Statistical learning and relative clause comprehension. 

Cognitive Psychology, 58(2):250–271, 2009. 

M. Yoshida, S. Aoshima, and C. Phillips. Relative clause prediction in japanese. In 

Proceedings of the 17th Annual CUNY Conference on Human Sentence Processing, 

College Park, Maryland, 2004. 

XVI

Appendix A 

Statistics 

SRC ORC 

region mean se mean se 

N1/V1 0.1746577 0.0568403 0.2183628 0.2644270 

de 0.3303811 0.1897248 0.1207579 0.0726347 

N2 0.1001319 0.0391884 0.1064459 0.0342768 

Table A.1: Statistics for simulation 1 

SRC ORC 


N1/V1 0.1319658 0.06477893 0.2172967 0.2631532 

de 0.0870274 0.07931393 0.1096769 0.0772652 

N2 0.1001319 0.03918843 0.1064459 0.0342768 

Table A.2: Statistics for simulation 2 

drop-V2 no-drop 


V3 0.7976032 0.1091270 0.7977388 0.1090123 

V1 0.8639397 0.0581128 0.9801276 0.0140664 

post-V1 0.1610184 0.1042171 0.2658784 0.1700682 

Table A.3: Statistics for simulation 3a 

XVII



V3 0.1794197 0.0801674 0.1797425 0.08047316 

V1 0.6870279 0.0550648 0.7735128 0.05198529 

post-V1 0.1365552 0.1125236 0.1522183 0.0880624 

Table A.4: Statistics for simulation 3b 



V3 0.1376888 0.1044064 0.1375375 0.1043905 

V1 0.5554193 0.2136721 0.5564368 0.2350018 

post-V1 0.3160993 0.122175 0.2946364 0.1392607 

Table A.5: Statistics for simulation 4a 



V3 0.1462594 0.1230590 0.1465719 0.1232603 

V1 0.8691347 0.1514235 0.9761057 0.02860683 

post-V1 0.3004169 0.2266813 0.4357311 0.1730854 

Table A.6: Statistics for simulation 4b 

XVIII 

Appendix A Statistics

Appendix B 

Grammars 

B.1 English 

(written by Lars Konieczny, 2003) 

S : NP VP "." | 

{num1, NP N, VP Vi} | #number agreement in matrix clause 

{num2, NP N, VP Vt} ; 

NP : det N | det N Rel (0.05) | 

{num1, N, Rel SRC VP Vi} | #number agreement in Subject-RCs. 

{num2, N, Rel SRC VP Vt} ; 

Rel : SRC | ORC ; 

SRC : that VP ; 

ORC : that NP Vt | 

{num2, NP N, Vt} ; #number agreement in Object-RCs. 

VP : Vi | Vt NP ; 

N : Nsing | Nplur ; 

Vi : VIsing | VIplur ; 

Vt : VTsing | VTplur ; 

### LEXICON ### 

Nsing : lawyer | senator | reporter | banker | judge ; 

Nplur : lawyers | senators | reporters | bankers | judges ; 

VIsing : lies | lied | hesitates | hesitated | phones | 

XIX

Appendix B Grammars 

phoned | understands | understood ; 

VTsing : praises | praised | attacks | attacked | 

phones | phoned | understands | understood ; 

VIplur : lie | lied | hesitate | hesitated | phone | 

phoned | understand | understood ; 

VTplur : praise | praised | attack | attacked | 

phone | phoned | understand | understood ; 

det : the ; 

### RULES ### 

num1 { # for intransitive verbs 

Nsing : VIsing; 

Nplur : VIplur; 

} 

num2 { # for transitive verbs 

Nsing : VTsing; 

Nplur : VTplur; 

} 

B.2 German 

(written by Daniel Müller and Lars Konieczny 2004) 

S : NPnom VP "." | 

{numNoun, NPnom Nnom, VP V} | #num normal(N-V) 

{numNoun, NPnomx Nnom, V} ; #num topikalisiert 

NPnomx : DETnom Nnom RCx | 

{numDET, DETnom, Nnom}| #numerusabgleich DET-N 

{numREL, Nnom, RCx RCpure RCsub RELnom} | #num N-RelPron nom 

{numREL, Nnom, RCx RCpure RCobj RELakk} | #num N-RelPron askk 

{numNoun, Nnom, RCx RCpure RCsub V} ; #num N-V_eing. SRC 

NPnom : DETnom Nnom RC | 

{numDET, DETnom, Nnom}| #numerusabgleich DET-N 

{numREL, Nnom, RC RCpure RCsub RELnom} | #num N-RelPron nom 

XX

B.2 German 

{numREL, Nnom, RC RCpure RCobj RELakk} | #num N-RelPron askk 

{numNoun, Nnom, RC RCpure RCsub V} ; #num N-V_eing. SRC 

NPakkx : DETakk Nakk RCx | 

{numDET, DETakk, Nakk} | 

{numREL, Nakk, RCx RCpure RCsub RELnom} | 

{numREL, Nakk, RCx RCpure RCobj RELakk} | 

{numNoun, Nakk, RCx RCpure RCsub V} ; 

NPakk : DETakk Nakk RC | 

{numDET, DETakk, Nakk} | 

{numREL, Nakk, RC RCpure RCsub RELnom} | 

{numREL, Nakk, RC RCpure RCobj RELakk} | 

{numNoun, Nakk, RC RCpure RCsub V} ; 

VP : V NPakkx; 

RCx : "," RCpure | "" (0.9) ; 

RC : "," RCpure ","| "" (0.9) ; 

RCpure : RCsub | RCobj ; 

RCsub : RELnom NPakk V ; 

RCobj : RELakk NPnom V | 

{numNoun, NPnom Nnom, V} ; 

Nnom : Nnom_pl | Nnom_sing ; 

Nakk : Nakk_pl | Nakk_sing ; 

V : V_pl | V_sing; 

DETnom : DETnom_pl | DETnom_sing (0.7) ; 

DETakk : DETakk_pl | DETakk_sing (0.7) ; 

RELnom : RELnom_pl | RELnom_sing; 

RELakk : RELakk_pl | RELakk_sing; 

Nnom_pl : Jungen | Polizisten | Passanten | Menschen; 

Nakk_pl : Jungen | Polizisten | Passanten | Menschen; 

Nnom_sing : Junge | Polizist | Passant | Mensch; 

Nakk_sing : Jungen | Polizisten | Passanten | Menschen; 

XXI

V_pl : beschimpfen | treffen | rufen | verspotten; 

V_sing : beschimpft | trifft | ruft | verspottet; 

DETnom_pl : die; 

DETnom_sing : der; 

DETakk_pl : die; 

DETakk_sing : den; 

RELnom_pl : die; 

RELnom_sing :der; 

RELakk_pl : die; 

RELakk_sing : den; 

numNoun { 

Nnom_sing : V_sing; 

Nnom_pl : V_pl; 

Nakk_sing : V_sing; 

Nakk_pl : V_pl; 

} 

numDET { 

DETnom_pl : Nnom_pl; 

DETakk_pl : Nakk_pl; 

DETnom_sing : Nnom_sing; 

DETakk_sing : Nakk_sing; 

} 

numREL { 

Nnom_pl : RELnom_pl | RELakk_pl; 

Nakk_pl : RELnom_pl | RELakk_pl; 

Nnom_sing : RELnom_sing | RELakk_sing; 

Nakk_sing : RELnom_sing | RELakk_sing; 

} 

B.3 Mandarin 

S : NP VP END ; 

VP : Vt NP | Vi ; 

NP : N | Rel N (0.5) ; 

XXII 

Appendix B Grammars

B.3 Mandarin 

Rel : SRC (0.85) | ORC ; 

SRC : VP GEN ; 

ORC : NP Vt GEN ; 

N : Nsing | Nplur ; 

### LEXICON ### 

Nsing : lushi | guanyuan | fayanren | 

yinhangjia | faguan ; 

Nplur : lushimen | guanyuanmen | fayanren | 

yinhangjiamen | faguanmen ; 

Vi : sahuang | youyu | dadianhua | lijie ; 

Vt : biaoyang | gongji | lijie ; 

GEN : de ; 

END : "." ; 

biaoyang praise 

dadianhua phone 

de gen 

faguan judge 

faguanmen judges 

fayanren reporter/reporters 

gongji attack 

guanyuan senator 

guanyuanmen senators 

lijie understand 

lushi lawyer 

lushimen lawyers 

sahuang lie 

yinhangjia banker 

yinhangjiamen bankers 

youyu hesitate 

Table B.1: Mandarin lexicon 

XXIII

Erklärung der Urheberschaft 

Ich erkläre hiermit an Eides statt, dass ich die vorliegende Arbeit ohne Hilfe Dritter 

und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe; die aus 

fremden Quellen direkt oder indirekt übernommenen Gedanken sind als solche kenntlich 

gemacht. Die Arbeit wurde bisher in gleicher oder ähnlicher Form in keiner anderen 

Prüfungsbehörde vorgelegt und auch noch nicht veröffentlicht. 

Ort, Datum Unterschrift 

XXIV

Connectionist Modeling of Experience-based Effects in Sentence ...

Create successful ePaper yourself

Delete template?

Save as template?