Negative evidence and the raw frequency fallacy* - CiteSeerX

NOTE 

Negative evidence and the raw frequency fallacy* 

ANATOL STEFANOWITSCH 

Introduction 

There is little that is more completely accepted in the conventional wisdom 

of modern linguistics than the assumption that corpora do not 

contain negative evidence and that, therefore, intuition-based acceptability 

judgments are an indispensable part of linguistic methodology. 

This assumption goes back at least to Chomsky’s discussion of grammaticality 

in Syntactic Structures (Chomsky 1957: 15 ff.), whose claims 

can perhaps be excused on the basis that he was writing before the advent 

of modern corpus linguistics. More worrying, however, is that many 

modern corpus linguists still share this assumption. 

For example, in what is otherwise one of the most thorough and most 

thoughtful textbooks on corpus linguistics currently available, McEnery 

and Wilson (2001: 11) ask: 

Without recourse to introspective judgments, how can ungrammatical 

utterances be distinguished from ones that simply haven’t occurred 

yet? If our finite corpus does not contain the sentence: 

*He shines Tony books. 

how do we conclude that it is ungrammatical? 

And without any discussion of potential alternatives they promptly 

give the following answer (McEnery and Wilson 2001: 12): 

It is only by asking a native or expert speaker of a language for their 

opinion of the grammaticality of a sentence that we can hope to differentiate 

unseen but grammatical constructions from those which are 

simply grammatical but unseen. 

They conclude their discussion by stating that “we [corpus linguists] 

must not eschew introspection entirely. If we do, detecting ungrammatical 

structures and ambiguous structures becomes difficult and, indeed, 

may be impossible.” (McEnery and Wilson 2001: 12). 

Corpus Linguistics and Linguistic Theory 21 (2006), 6177 

DOI 10.1515/CLLT.2006.003 

1613-7027/06/00020061 

Walter de Gruyter

62 A. Stefanowitsch 

In this note, I would like to take issue with (a large part of) their 

argument. I will argue that the idea that corpora do not contain negative 

evidence is simply a special case of what I have termed the observedfrequency 

(or raw-frequency) fallacy, i. e., the belief that “[o]bserved frequencies 

of occurrence represent relevant facts for scientific analysis” 

(Stefanowitsch 2005: 296). When approached with the right methodological 

tools, corpora do provide negative evidence, i. e., evidence that 

allows us, in principle, to distinguish between constructions that did not 

occur but could have (these could be referred to as ‘accidentally absent’, 

and constructions that did not occur and could not have (these can be 

referred to as ‘significantly absent’ structures). Thus, while I do agree 

that linguists cannot (and should not) ‘eschew introspection entirely’, I 

will argue that they can (and largely should) eschew introspective judgments 

of acceptability. 

Collostructional analysis and the significance of absence 

In this section, I will address the general issue of how significant absences 

of a particular configuration of linguistic elements can be distinguished 

from accidental ones, using as an example the ‘ability’ or ‘inability’ of 

English verbs to occur with ditransitive complementation. The choice of 

this example is motivated primarily by practical considerations: as will 

presently become clear, the method I will use requires the researcher to 

extract exhaustively from a corpus all occurrences of the grammatical 

phenomenon in question. Ditransitive complementation happens to be 

one of the features that is relatively uncontroversially tagged in the 

largest grammatically annotated balanced corpus currently available, the 

British component of the International Corpus of English (ICE-GB, cf. 

Nelson et al. 2002). However, it is a welcome coincidence that this is 

precisely the complementation pattern that McEnery and Wilson chose 

to demonstrate the need for grammaticality judgments. 1 

The relevant method is one of several that Gries and I have developed 

in a series of publications specifically for the purpose of investigating 

the relationship between grammatical constructions and the words occurring 

in them, and that we refer to collectively as collostructional 

analysis (cf. e. g., Stefanowitsch and Gries 2003, 2005, to appear a; Gries 

and Stefanowitsch 2004a, b, to appear). 2 The most basic of these methods, 

simple collexeme analysis, allows the researcher to identify words 

that occur significantly more or less frequently than expected in a given 

slot of a construction. This is done on the basis of a standard 2-by-2 

contingency table containing four observed frequencies: (a) the frequency 

of a given word in a particular slot of a given construction, (b) 

the frequency of the same word in the corresponding slots of all other

Negative evidence and the raw frequency fallacy 63 

Table 1. 

Give with ditransitive complementation in the ICE-GB 

Ditransitive ÿDitransitive Total 

give 560 531 1,091 

(14.57) (1076.43) 

ÿgive 1,264 134,196 135,460 

(1,809.43) (133,650.57) 

Total 1,824 134,727 136,551 

constructions, (c) the frequency of all other words in the relevant slot of 

the construction under investigation, and (d) the frequency of all other 

words in the corresponding slot of all other constructions. From these 

frequencies, we can derive the expected frequency of occurrence of the 

word in the construction, which allows us to determine whether and in 

what direction the observed frequency deviates from the expected frequency 

and whether this deviation is statistically significant. As an example, 

consider Table 1, which shows the relevant contingency table for the 

verb give and the ditransitive complementation pattern in the British 

Component of the International Corpus of English (ICE-GB) (expected 

frequencies are shown in parentheses). 

As Table 1 shows, give occurs vastly more frequently than expected 

with ditransitive complementation; the Fisher-Yates exact test shows that 

this difference is highly significant (p < 4.94e324, the smallest number 

a typical current home-issue computer can handle). In collostructional 

analysis, we usually take the p-value directly as a measure of association 

strength (cf. Pedersen 1996 and Stefanowitsch and Gries 2003: 238 f. for 

justification). In other words, the extremely small p-value is taken to be 

an indication of an extremely strong association between give and the 

ditransitive complementation pattern. 

Repeating this procedure for all verbs occurring with ditransitive complementation 

in the ICE-GB allows us to rank all verbs first, by whether 

they occur more or less frequently than expected, and second, by association 

strength. Words that occur more frequently than expected are referred 

to as attracted collexemes (the strength of their positive association 

can be referred to as attraction strength), words that occur less frequently 

are referred to as repelled collexemes (with a corresponding repulsion 

strength). For example, all verbs occurring significantly more frequently 

than expected are shown in Table 2. The significance level of 0.05 

was corrected for multiple testing using a simple Bonferroni correction 

(Bonferroni 1936) whereby the significance level is divided by the 

number of tests. Since the ICE-GB contains 4,856 verb types, this gives 

us 0.05/4,856 1.03E05. 3


Table 2. 

Significantly attracted collexemes of the ditransitive in the ICE-GB 

Collexeme F(Corpus) F O (Ditr) F E (Ditr) FYE p-value 

give 1,091 560 14.57 0.00E000 

tell 792 493 10.58 0.00E000 

send 295 78 3.94 4.13E076 

ask 504 92 6.73 9.65E074 

show 628 84 8.39 5.15E056 

offer 196 54 2.62 3.73E054 

convince 32 23 0.43 1.70E036 

cost 65 23 0.87 9.04E027 

inform 55 20 0.73 9.57E024 

teach 92 23 1.23 7.94E023 

assure 19 13 0.25 1.04E020 

remind 41 16 0.55 7.25E020 

lend 31 12 0.41 3.48E015 

promise 43 12 0.57 3.26E013 

owe 25 9 0.33 2.24E011 

grant 26 9 0.35 3.38E011 

warn 38 10 0.51 5.94E011 

award 16 7 0.21 7.72E010 

persuade 33 8 0.44 1.03E008 

allow 326 20 4.35 2.59E008 

guarantee 27 7 0.36 5.27E008 

deny 51 8 0.68 3.82E007 

earn 56 8 0.75 8.03E007 

hand 16 5 0.21 1.63E006 

pay 395 18 5.28 8.66E006 

give back 4 3 0.05 9.42E006 

The list of verbs in this table could now serve as a basis for a variety 

of observations, for example about the meaning of the ditransitive complementation 

pattern. I will not pursue this issue here (but cf. Stefanowitsch 

and Gries 2003, Section 3.2.2). 4 Instead, let me point out two 

facts about the way that the label ‘ditransitive’ is applied in the ICE-GB. 

First, structures with nominal and with clausal direct objects are included 

under this label (i. e., uses like She told me that she wants to be 

free of lawyers and doctors [ICE GB s2a062 133] or I told him to drive 

the forklift truck [ICE GB s2a067 050] as well as the more obvious I’ve 

told you the truth [ICE GB w2 f.006 213]). Second, some verbs are 

tagged as ditransitive whose second object might be better analyzed as 

an oblique argument, e. g., cost, asinIt cost them three quid [ICE-GB 

s1a007 054]). 5 In other words, the label is applied rather generously. 

Next, consider Table 3, which shows the significantly repelled collexemes, 

sorted by repulsion strength (only the first two are significant at 

corrected levels).


Table 3. 

Significantly repelled collexemes of the ditransitive in the ICE-GB 


make 1,865 3 24.91 3.39E008 

do 2,937 12 39.23 2.56E007 

find 854 2 11.41 7.96E004 

call 616 1 8.23 2.32E003 

keep 374 1 5 3.95E002 

One could now ask why verbs that occur in a given construction might 

do so less frequently than expected. There are several reasons, some of 

them more interesting than others. First, verbs may appear on this list 

because they are incorrectly tagged (in this case, call, which is tagged as 

‘ditransitive’ in the utterance And the person who’s being called [ICE- 

GB s1a030 003]). Obviously, such incorrect tags are hard to eliminate 

completely once a corpus reaches a certain size. Second, some verbs 

appear on this list because their ditransitive uses are very restricted, in 

some cases to a single fixed expression (in this case, keep s. o. company). 

Finally, most verbs appear on this list because they occur very frequently 

with other complementation patterns (this is most obvious for the highfrequency 

verbs make and do, but it is also true of find). What one can 

take away from a discussion of such cases is, first, that fixed expressions 

must be taken into account in any linguistic analysis, and second, that 

complementation patterns exhibit a certain amount of productivity, occurring 

at least occasionally with verbs whose dominant patterns are 

others (both facts are unsurprising from the perspective of construction 

grammar, in which collostructional analysis first developed). 

However, the data in Table 3 do not speak directly to the issue of 

negative evidence yet: a further step is necessary. In our previous work, 

we have referred as ‘repelled’ only to those words which do occur in a 

given construction but do so less frequently than expected; however, as 

we noted in passing in our first paper (cf. Stefanowitsch and Gries 2003: 

238), it is possible and perhaps logical to include in this category 

words that would have been expected to occur in the construction based 

on their overall frequency in the corpus, but did not, in fact, occur in 

the construction at all. This is the step that finally takes us to the issue 

of negative evidence: The range of frequencies of occurrence that can be 

evaluated for statistical significance include the limiting case of zero; in 

other words, the non-occurrence of a particular configuration of linguistic 

categories (for example, of a particular verb in a particular construction) 

can be compared to its expected frequency of occurrence. This will


Table 4. 

a. say 

Three verbs that do not occur with ditransitive complementation in the ICE-GB 


b. explain 

say 0 3,333 3,333 

(44.52) 

ÿsay 1,824 131,394 133,218 

Total 1,824 134,727 136,551 


explain 0 172 172 

(2.30) 

ÿexplain 1,824 134,555 136,379 

c. whisper 

Total 1,824 134,727 136,551 


whisper 0 5 5 

(0.07) 

ÿwhisper 1,824 134,722 136,546 

Total 1,824 134,727 136,551 

allow us, in many cases, to determine whether an unseen construction is 

likely to be a possible construction of a language or not. 

Consider Table 4, which shows the contingency tables for three verbs 

that do not occur with ditransitive complementation in the ICE-GB, say, 

explain, and whisper. 

On a priori grounds, we might expect all three verbs to allow ditransitive 

complementation, since they are all reasonably close in meaning to 

one of the most strongly attracted collexemes of this pattern, tell (and 

other verbs of communication occurring among the significantly 

attracted collexemes; e. g. ask, inform, teach, assure). On the other hand, 

they are textbook cases in the linguistic literature of verbs not allowing 

ditransitive complementation (cf. e. g., Pinker 1989). 

Table 4a provides conclusive evidence that the linguistic literature is 

right in the case of say, whose repulsion strength meets the corrected 

level of significance (p 1.96E20; < 1.03E05). We can confidently


claim that the combination [say ditransitive] is significantly absent. In 

the case of explain, the repulsion strength does not meet even the uncorrected 

level (p 0.099; > 0.05), although it is not too far off. It is simply 

not frequent enough in the ICE-GB to let us determine whether its nonoccurrence 

is accidental or significant, although its marginal statistical 

significance may lead us to suspect the latter. No such suspicion would 

be warranted in the case of whisper, whose non-occurrence is well within 

the range of accidental variation (p 0.935; > 0.05). 

Before discussing these issues any further, let us take a look at the 

results we get when we apply simple collexeme analysis exploratively to 

all verbs that occur in the ICE-GB but not in the ditransitive. There are 

4,856 verb types in the ICE-GB (according to my definition, which lists 

phrasal verbs as separate types, see Footnote 4). Of these, 4,782 do not 

occur in the ditransitive. In turn, this non-occurrence is significant only 

for 53 verbs (of which only 11 meet the corrected level of significance). 

Table 5 shows the significantly repelled collexemes. 

Table 5. 

Significantly repelled collexemes of the ditransitive in the ICE-GB 


be 25,416 0 340.00 4.29E165 

be|have 6,261 0 83.63 3.66E038 

have 4,303 0 57.48 2.90E026 

think 3,335 0 44.55 1.90E020 

say 3,333 0 44.52 1.96E020 

know 2,120 0 28.32 3.32E013 

see 1,971 0 26.33 2.54E012 

go 1,900 0 25.38 6.69E012 

want 1,256 0 16.78 4.27E008 

use 1,222 0 16.32 6.77E008 

come 1,140 0 15.23 2.06E007 

look 1,099 0 14.68 3.59E007 

Significant at uncorrected significance levels: 

try 749 0 10.00 4.11E005 

mean 669 0 8.94 1.21E004 

work 646 0 8.63 1.65E004 

like 600 0 8.01 3.08E004 

feel 593 0 7.92 3.38E004 

become 577 0 7.71 4.20E004 

happen 523 0 6.99 8.70E004 

put 513 0 6.85 9.96E004 

talk 490 0 6.55 1.36E003 

hear 483 0 6.45 1.49E003 

need 420 0 5.61 3.49E003 

believe 397 0 5.30 4.76E003


Table 5. (continued) 


provide 380 0 5.08 5.99E003 

live 378 0 5.05 6.16E003 

remember 373 0 4.98 6.59E003 

produce 328 0 4.38 1.21E002 

speak 323 0 4.31 1.29E002 

hope 316 0 4.22 1.42E002 

run 309 0 4.13 1.56E002 

change 306 0 4.09 1.63E002 

meet 303 0 4.05 1.69E002 

help 301 0 4.02 1.74E002 

start 294 0 3.93 1.91E002 

move 291 0 3.89 1.99E002 

seem 285 0 3.81 2.16E002 

agree 279 0 3.73 2.34E002 

lead 271 0 3.62 2.60E002 

expect 265 0 3.54 2.82E002 

consider 264 0 3.53 2.86E002 

suggest 259 0 3.46 3.06E002 

describe 259 0 3.46 3.06E002 

decide 259 0 3.46 3.06E002 

understand 250 0 3.34 3.46E002 

hold 249 0 3.33 3.50E002 

require 244 0 3.26 3.75E002 

involve 242 0 3.23 3.85E002 

suppose 241 0 3.22 3.90E002 

include 236 0 3.15 4.17E002 

occur 233 0 3.11 4.35E002 

develop 233 0 3.11 4.35E002 

go on 231 0 3.09 4.46E002 

follow 227 0 3.03 4.71E002 

Two things about this table require discussion. First, it demonstrates 

that even a one-million-word corpus is too small to allow us to identify 

significant absences for more than a handful of cases (at least for a 

relatively rare pattern such as ditransitive complementation). I will discuss 

this problem in the remainder of this section and in the next section. 

Second, the results only tell us that a particular structure is significantly 

absent, they do not, as pointed out in the introduction, tell us why it is 

significantly absent. I will return to this problem in the final section. 

The problem of insufficient corpus size can ultimately only be solved 

by the creation of larger grammatically annotated corpora. However, in 

many individual cases it is possible to arrive at a fairly safe conclusion 

using currently available non-annotated corpora. Take the case of ex-


plain. In the 100-million-word British National Corpus (the largest balanced 

corpus of British English currently available), the verb explain 

occurs 18,334 times, but not once with ditransitive complementation. In 

Stefanowitsch and Gries (2003: 219), we estimated that the BNC contains 

10,206,300 complementation patterns overall. If we assume that 

the proportion of ditransitives in the BNC is the same as in the ICE- 

GB, then the BNC should contain 136,332 ditransitives. Given these 

figures, we can now calculate the expected frequency of occurrence of 

explain in the ditransitive: 245. The difference between this and the observed 

frequency of zero is highly significant (at uncorrected levels of 

significance; p 6.73E108; < 0.001). Thus, the combination [explain 

ditransitive] can be categorized as a significantly absent structure 

based on negative corpus evidence. This strategy works even with a lowfrequency 

verb like whisper, which occurs only 2,976 times in the BNC, 

but, again, does not occur in the ditransitive. Under the assumptions 

just outlined, the expected frequency of [whisper ditransitive] is 40. 

Again, the difference is highly significant (at uncorrected levels; 

p 4.139019E18; < 0.001). 6 Repeating individual tests on a larger 

corpus will, of course, not invariably lead to the conclusion that a given 

structure is significantly absent. In many cases, the frequency of a verb 

will remain too low to yield significant results. For example, the verb 

oxidise, like whisper, occurs five times in the ICE-GB, never in the ditransitive. 

In the BNC, it occurs 99 times, also never in the ditransitive. The 

expected frequency in this case is 1 (1.32, to be precise), and the difference 

between this and the observed frequency of zero is still far too 

small to reach statistical significance (p 0.26; >0.05). In other cases, 

extending a search to a larger corpus will fail to replicate the zero occurrence 

in the smaller corpus. For example, the BNC contains one clear 

example of donate with ditransitive complementation (1a), and a second 

potential one (1b): 

(1) a. Saudi king donates Laura transplant money. The king of Saudi 

Arabia has donated a hundred and fifty thousand pounds to Laura 

Davies ... (K1N) 

b. ... if the villagers hadn’t so kindly donated her furnishings, she’d 

probably still be existing in empty rooms ... (H95). 7 

Faced with such examples, there is no longer any reason to believe 

that [donate ditransitive] is a significantly absent structure. 

Thus, the methodological problem of insufficiently large corpora is 

not, in principle, an argument against replacing intuitive grammaticality 

judgments by negative corpus evidence (in practice it may be, a point 

which I will return to below). Instead, the preceding discussion shows


that negative corpus evidence can be adduced for a pet case of syntactic 

theorizing (note that among the significantly absent collexemes of the 

ditransitive complementation pattern in Table 5, there are a large 

number of famously non-ditransitive verbs, e. g., suggest, provide, say, 

describe, etc.). 

One may now argue that even if such negative corpus evidence can be 

obtained, it does not add any insights that could not also be arrived at 

by introspective acceptability judgments at best (e. g., for say, explain, 

whisper), it will confirm what we know from intuition anyway, in the 

worst case it will never yield enough data to decide the issue (e. g., oxidise) 

or contradict generally agreed-upon acceptability judgments (e. g., 

donate). There are two reasons why this argument is wrong: first, unlike 

acceptability judgments, negative corpus evidence meets the standards of 

scientific research. Second, it is only such corpus evidence that will allow 

us to make principled statements about what is and is not possible: if we 

hypothesize, based on acceptability judgments, that whisper ditransitive 

is impossible, a single counterexample can prove this wrong. While 

such counterexamples may not occur even in very large corpora, they 

are still easy to come by in the age of the Internet. A web search quickly 

turns up counterexamples produced by native speakers of (British) English, 

both with clausal and, more crucially, with nominal objects: 

(2) a. ... when I first beheld you the instinct of Nature whispered me 

that we were in some degree related ... (Jane Austen, Love and 

Friendship, Letter 11) 

b. She had not been allowed to ... to bury the two people she had 

loved most in the world ... to whisper them a last goodbye. (Meg 

Hutchinson, Peppercorn Woman) 

Of course, these examples can in turn be questioned; they may reflect 

older stages of the language (Jane Austen wrote Love and Friendship 

around 1790), they may reflect regional dialects (Meg Hutchinson lives 

in Staffordshire), etc. However, the fact remains that we have objective 

authentic examples pitched against subjective intuition. In contrast, the 

negative corpus evidence obtained from the BNC gives us an objective 

basis for arguing that, even though utterances like (2) can occur, they 

are very marginal. This allows us to uphold the useful generalization 

that communication verbs can be used with ditransitive syntax if they 

refer to the type of message communicated, but not if they refer to the 

manner in which it is communicated (Pinker 1989), even though we have 

to reformulate it as a strong statistical tendency instead of a categorical 

constraint. The same holds for donate. Even if we generously admit both 

(1a and 1b) as counterexamples, we can still point out that donate would


have been expected to occur 14 times (based on the assumptions above), 

and thus still constitutes a strongly repelled collexeme (p 0.0001; 


whether zero deviates significantly from this expected frequency (the sufficient 

condition for upholding the hypothesis). This information will 

likely be more difficult to obtain or estimate than information about 

complementation patterns, but to do so is by no means impossible. 

Any hypothesis about possible and impossible structures in language 

is ultimately a hypothesis about the incompatibility of two (or more) 

linguistic categories. As long as these categories can be operationalized 

in such a way that they can be exhaustively annotated (or identified 

spontaneously) in a corpus of naturally occurring language, and as long 

as the corpus is large enough, this corpus can provide both positive and 

negative evidence. The first condition should always be met: if a category 

cannot be operationalized for objective identification, it has no place in 

a linguistic theory. The second condition is not currently met. There are 

several syntactically annotated corpora (for example, the Penn Treebank, 

Sampson’s Suzanne and Christine corpora, and the ICE-GB used in this 

note), but they are either too small for many research questions, or their 

annotation scheme is too coarse or too unreliable, or both. However, 

this cannot seriously be used as a defense of the introspective method. 

Instead, it must be used as an argument for the funding and the human 

resources necessary for the construction of large grammatically annotated 

corpora. A discipline can only get so far by thought experiments (if 

that is what acceptability judgments are). It begins to make substantial 

headway only when it faces up to the problem of data scarcity and solves 

it. Astronomers have built radio telescopes, physicists have built particle 

colliders, and geneticists have sequenced the human genome; linguists 

should be able to construct large, balanced, syntactically annotated corpus 

of at least the world’s major languages. But even until this goal is 

reached or, more likely, in case it is never reached corpora can yield 

both positive and negative evidence for the construction of linguistic 

theories. 

Final remarks: the occurring and the non-occurring 

The main point of this note was to show that corpora contain negative 

evidence and that this negative corpus evidence can, and should, replace 

introspective acceptability judgments. It seems appropriate, however, to 

discuss the most important theoretical implications of such a step. 

First, from the perspective advocated here, the non-occurrence of a 

particular linguistic structure is merely the limiting case; it is not qualitatively 

different from very rare occurrences. This may seem to be a problem 

for an approach that argues for an absolute distinction between 

possible and impossible configurations of linguistic categories (for example, 

between grammatical and ungrammatical structures). This problem


may be more apparent than real, however. The continuum between significantly 

rare and significantly absent structures is not fundamentally 

different from the continuum between various degrees of unacceptability 

that is regularly found for acceptability ratings. In both cases, the data 

must be viewed in light of one’s theory of language in order to make 

sense of this continuum. Also, it may well be possible to identify a degree 

of improbability that is close enough to impossibility to be indistinguishable 

from it. 

Second, while the statistically significant absence (or rareness) of a 

particular configuration of grammatical categories can be taken as evidence 

that this configuration is impossible (i. e., very improbable), it does 

not, in itself, provide any clues as to why this should be the case. Again, 

the same is true of introspective judgments. Chomsky pointed this out 

early on: “The notion ‘acceptable’ is not to be confused with ‘grammatical’. 

Acceptability belongs to the study of performance whereas grammaticalness 

belongs to the study of competence” (Chomsky 1965: 11). A 

linguistic structure may give rise to introspective judgments of unacceptability 

for a number of reasons, of which ungrammaticality (or, more, 

generally, failure to conform to general linguistic rules) is just one. What 

that reason is must be determined independently of the acceptability 

judgment. The same is true of significantly absent (or rare) structures: 

determining significant absence/rareness is just the first step of a linguistic 

analysis. The second step is to determine the reasons for the significant 

absence/rareness. This step can be much closer to traditional linguistic 

argumentation. First, it may involve the search for authentic counterexamples 

(as in the case of whisper above) in order to test the extent of 

this absence. This may uncover variation in the data (panchronic, regional, 

social, etc.) or particular contexts in which seemingly impossible 

structures become possible. Second, it may involve constructing examples 

in order to determine whether the significant absence is semantically 

determined. If the constructed examples are not interpretable, the absence 

may simply be due to semantic incompatibility. For example, no 

interpretation can be assigned to He knew her the answer or She saw 

him the light. If the constructed examples are interpretable, their absence 

cannot be due to semantic incompatibility but may instead have purely 

formal reasons. For example, He said her the answer or She put him the 

book are straightforwardly interpretable (of course, there may be more 

fine-grained semantic restrictions as the huge literature on ditransitives 

shows). In other words, while I argue against the use of acceptability 

judgments as a linguistic method, I do not argue against the use of interpretation. 

There is good reason for this distinction, which I am not the 

first to point out: interpreting utterances is a natural human activity, 

judging their acceptability is not.


Third, while it is plausible to speak of different degrees of attraction 

or repulsion in the case of combinations that do occur, it is less clear 

whether it makes sense to speak of different degrees of absence, as the 

ranking of significantly absent collexemes in Table 5 suggests. Methodologically, 

this ranking merely reflects the certainty with which we can 

say that a structure is impossible. One may (but need not) argue, though, 

that this certainty reflects the certainty of a native speaker, in which case 

the ‘degrees of absence’ do become relevant to theoretical considerations. 

Whether the predictions of such a view are borne out by empirical data 

remains to be seen. 

More generally, it seems to me that accepting the methodology I have 

argued for here may lead to a slight but pervasive reorientation of linguistic 

theory. If we accept significant presence and significant absence 

(as well as significant frequency and rareness) as the primary facts that 

a linguistic theory must explain, then this theory will have to be broader 

than most current theories. Rather than focusing exclusively on grammaticality, 

such a theory would have to uncover the whole range of 

causes for the presence and absence of linguistic structures and investigate 

all of them with the same degree of rigor and explicitness. The aim 

of linguistic analysis would no longer be “to separate the grammatical 

sequences which are the sentences of [a language] L from the ungrammatical 

sequences which are not sentences of L and to study the structure 

of the grammatical sentences” (Chomsky 1957: 13). Instead, the 

aim would be to provide for individual languages and, ultimately, for 

language in general a comprehensive theory of the occurring and the 

non-occurring. 8 

Received January 2006 

Revisions received March 2006 

Final acceptance March 2006 

University of Bremen 

Notes 

* I would like to thank Stefan Gries, Arne Zeschel and the participants of the 7. 

Norddeutsches Linguistisches Kolloquium for their comments on the ideas presented 

in this paper. Any conceptual errors are mine alone. 

1. Actually, there are several potential reasons for the oddness of McEnery and Wilson’s 

example (for example, the use of the simple present and the potential violation 

of the selection restrictions of the verb shine by the direct object NP books). 

Their discussion suggests, however, that they are concerned with complementation. 

2. An overview over this method and its place in the corpus-based study of grammatical 

patterns will be provided in Stefanowitsch and Gries (to appear b); meanwhile, 

an introduction can be found on my website at . This website also provides a number of Perl scripts for doing col-


lostructional analysis (PerlClx 1.0); cf. also Gries’ R script CollAnalysis 3, available 

from his website at . Incidentally, both scripts can provide the corpus frequency 

that a word not occurring in a particular construction must have in order 

for its absence to be significant given the frequency of the construction and the 

size of the corpus. CollAnalysis provides this information as part of a collostructional 

analysis, PerlClx contains a script (zclx.pl) exclusively dedicated to this purpose. 

3. The Bonferroni correction is meant to place stricter requirements on statistical 

significance in situations where multiple tests are performed on the same data set: 

obviously, the more tests you perform, the more chances there are for a seemingly 

significant result to come about by accident. However, some have argued (for 

example, Pernege 1998) that this correction does more harm than good because 

it removes many results that are significant. I will not place too much emphasis 

here on this correction. In a sense, whether one has to apply it in the context of 

collostructional analysis or not depends on one’s view of what one is doing: is 

one testing individual word-construction pairs (in which case each test can stand 

on its own and one could be less concerned with correcting) or is one testing a 

construction and all words occurring in it (in which case one could be more concerned 

with correcting). 

4. These results differ from those that Gries and I have presented previously, mainly 

because we have focused exclusively on ditransitives with two nominal objects, 

whereas I have included here all uses tagged as ‘ditransitive’ in the ICE-GB, including 

those with a clausal direct object. Furthermore, here and throughout the 

following discussion I have used regular expressions to search the corpus files 

directly, rather than using ICECUP, the software tool that accompanies the ICE- 

GB. I have also discarded all verbs marked as ‘ignored’ in the corpus annotation 

and I have discarded all unclear words. I have manually lemmatized the verbs 

and standardized spelling variants. Finally, I treat phrasal verbs as lemmas in 

their own right (cf. give back in Table 2); they were identified by searching for 

verbs that were followed (and in some cases preceded) by a particle annotated 

as such. 

5. Note that cost does not behave like a typical transitive verb (whether in its monotransitive 

or its ditransitive use). For example, it cannot be passivized: *Three quid 

were cost (them), *They were cost three quid. Thus, the apparent direct object may 

be better analyzed as an oblique (for example, a subject complement or an adjunct, 

cf. e. g., Quirk et al. 1985, § 16.27). 

6. Since we are conducting individual tests here based on hypotheses about specific 

verbs, we could argue that the levels of significance do not have to be adjusted 

for multiple testing. However, since there are thousands of tests of the same kind 

(verb ditransitive) that we could have performed, it might be a good idea to 

correct for multiple testing anyway. According to Leech et al.’s (2001) frequency 

list, there are 38,019 verb types in the BNC; if anything, this is an overestimation 

(since inaccurately tokenized forms like ["see] [—see] [see?] etc. are all 

counted as their own lemmas), so if we correct on this basis, we are on the safe 

side. The corrected level of significance is 1.32E06; both explain and whisper 

clear this level by several orders of magnitude. 

7. Due to the ambiguity of the pronominal form her, it is not clear whether this 

example is monotransitive (They donated [NP her furnishings]) or ditransitive 

(They donated [NP her] [NP furnishings]). However, a web search turns up additional 

clear (if rare) examples of ditransitive uses of donate, for example, In May 

2004, Cycle Heaven, a local retailer, and City of York Council threw down the


gauntlet to local schools, saying ‘achieve your target increases in walking and cycling 

by summer 2005, and we will donate you a free, high quality children’s bike! (http:// 

www.york.gov.uk/cgi-bin/wn_document.pl?type 5927). 

8. I do not want to conclude this note without applying the by now familiar reasoning 

to McEnery and Wilson’s question how the ungrammaticality of *He shines 

Tony books could be determined without intuition judgments. Shine (in all its 

senses) occurs 2,258 times in the BNC. On the assumptions made above, the expected 

frequency of shine with ditransitive complementation would be 30; the 

observed frequency is zero. This difference is highly significant (p 6.47E14) 

even if we correct for multiple testing. Thus, without resorting to introspection, 

we have proved that [shine ditransitive] is significantly absent. Whether this is 

strictly due to ungrammaticality is doubtful: first, McEnery and Wilson’s sentence 

is interpretable (albeit weird); second, it is possible to find authentic ditransitive 

uses by certified native speakers of English for both senses of shine: (i) Shine me 

a light from your eyes dear (Christine McVie, Show me a Smile [performed by 

Fleetwood Mac]); (ii) He smiles telling him to shine him a metallic Purple armor 

(Jimi Hendrix, Bold as Love). Thus, we could hypothsize that ditransitive uses of 

shine are semantically so restricted that they occur only in very specific circumstances 

(e. g., the ‘light’ reading can only occur ditransitively when the direct object 

is light) or that they only occur in certain dialects (e. g., Lancashire [McVie] 

and Seattle [Hendrix]) or registers (e. g., rock lyrics). 

References 

Bonferroni Carlo E. 

1936 Teoria statistica delle classi e calcolo delle probabilitÃ . Pubblicazioni del 

R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8, 

362. 

Chomsky, Noam 

1957 Syntactic Structures. The Hague: Mouton. 

1965 Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. 

Gries, Stefan Th. and Anatol Stefanowitsch 

2004a Extending collostructional analysis: a corpus-based perspective on ‘alternations’. 

International Journal of Corpus Linguistics 9(1), 97129. 

2004b Co-varying collexemes in the into-causative. In: Achard, Michel and Suzanne 

Kemmer (eds.), Language, Culture, and Mind. Stanford: CSLI, 

225236. 

To appear Cluster analysis and the identification of collexeme classes. In John Newman 

and Sally Rice (eds.), Empirical and Experimental Methods in Cognitive/Functional 

Research. Stanford: CSLI. 

Leech, Geoffrey, Paul Rayson, and Andrew Wilson 

2001 Word Frequencies in Written and Spoken English: Based on the British 

National Corpus. London: Longman. 

McEnery, Tony, and Andrew Wilson 

2001 Corpus Linguistics. An Introduction. Second edition. Edinburgh: Edinburgh 

University Press. 

Nelson, Gerald, Sean Wallis and Bas Aarts (eds.) 

2002 Exploring Natural Language: Working with the British Component of the 

International Corpus of English. Amsterdam and Philadelphia: John Benjamins.


Pedersen, Ted 

1996 Fishing for exactness. Proceedings of the South Central SAS User’s Group 

Conference, Austin, TX, 188200. 

Pernege, Thomas V 

1998 What’s wrong with Bonferroni adjustments. British Medical Journal 316, 

12361238. 

Pinker, Steven 

1989 Learnability and Cognition. The Acquisition of Argument Structure. Cambridge, 

MA: MIT Press. 

Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik 

1985 A Comprehensive Grammar of the English Language. London: Longman. 

Stefanowitsch, Anatol 

2005 New York, Dayton (Ohio), and the Raw Frequency Fallacy. Corpus Linguistics 

and Linguistic Theory 1(2), 295301. 

Stefanowitsch, Anatol and Stefan Th. Gries 

2003 Collostructions: investigating the interaction of words and constructions. 

International Journal of Corpus Linguistics 8(2), 209243. 

2005 Covarying Collexemes. Corpus Linguistics and Linguistic Theory 1(1), 

143. 

To appear a Channel and constructional meaning: A collostructional case study. In: 

Kristiansen, Gitte and René Dirven (eds.), Cognitive Sociolinguistics: 

Language Variation, Cultural Models, Social Systems. Berlin and New 

York: Mouton de Gruyter. 

To appear b Corpora and Grammar. In: Anke Lüdeling, Meria Kytö, and Tony 

McEnery. Corpus Linguistics (Handbooks of Linguistics and Communication 

Science/HSK). Berlin: Mouton de Gruyter.

Negative evidence and the raw frequency fallacy* - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?