03.09.2013 Views

Lexical variation in aggregate perspective

Lexical variation in aggregate perspective

Lexical variation in aggregate perspective

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong><br />

Abstract: If one aims to study a pluricentric language with the goal of mak<strong>in</strong>g general<br />

assertions about l<strong>in</strong>guistic levels, an <strong>aggregate</strong> <strong>perspective</strong> <strong>in</strong> which many l<strong>in</strong>guistic<br />

items that represent the l<strong>in</strong>guistic level are considered is necessary. The current paper<br />

presents and compares two methodologies for aggregat<strong>in</strong>g lexical <strong>variation</strong> so that the<br />

similarity or dissimilarity between language varieties such as the centers of a pluricentric<br />

language can be quantitatively measured. The two methodologies differ with<br />

respect to the treatment of the semantic relation between words: whereas one method<br />

simply ignores the semantic relation between words, the other method <strong>in</strong>corporates<br />

the knowledge that some words are alternative means of nam<strong>in</strong>g a s<strong>in</strong>gle concept. The<br />

question of which method is most suitable for measur<strong>in</strong>g the similarity or dissimilarity<br />

between language varieties is raised and empirically tested <strong>in</strong> a corpus-based case<br />

study on the pluricentric language Dutch, as used <strong>in</strong> Belgium and the Netherlands. It<br />

will be shown that the method that <strong>in</strong>corporates semantic knowledge manages to go<br />

beyond possible conceptual <strong>variation</strong> between language varieties, clearly reveal<strong>in</strong>g<br />

an expected dist<strong>in</strong>ction between Dutch as used <strong>in</strong> Belgium and <strong>in</strong> the Netherlands. In<br />

contrast with this, the semantically non-<strong>in</strong>formed method is disturbed by conceptual<br />

<strong>variation</strong> and is not able to conv<strong>in</strong>c<strong>in</strong>gly show the dist<strong>in</strong>ction between Dutch as used<br />

<strong>in</strong> Belgium and <strong>in</strong> the Netherlands, although the set of l<strong>in</strong>guistic items clearly suggests<br />

that such a national pattern should emerge.<br />

Keywords. <strong>aggregate</strong> <strong>perspective</strong>, sociolectometry, lexical <strong>variation</strong>, Dutch<br />

1 Introduction<br />

The current paper shows how a sociolectometric approach is needed to disentangle the<br />

multidimensional structure of the varieties <strong>in</strong> a pluricentric language. There are different<br />

sociolectometric approaches, i.e. corpus-based methods, perception experiments,<br />

or attitude questionnaires; we will perform a corpus-based case study. Although the focus<br />

of a sociolectometric approach is on the varieties, the choice of the variables under<br />

analysis is crucial; we focus on lexical <strong>variation</strong>. Furthermore, <strong>in</strong> this paper we compare<br />

two quantitative corpus-based methods, which differ <strong>in</strong> their conceptual control<br />

of lexical variables: on the one hand, we take a method that ignores the conceptual<br />

relationship between the lexemes <strong>in</strong> the variable set. On the other hand, there is a<br />

method that <strong>in</strong>corporates knowledge about conceptual identity between lexemes. The<br />

importance and difficulties of conceptual control when study<strong>in</strong>g <strong>variation</strong> <strong>in</strong> the lexicon<br />

as a whole is shown by means of a case-study on the pluricentric language Dutch.<br />

The pluricentric character of Dutch is now widely accepted: Dutch is used both <strong>in</strong> Bel-


96 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

gium and <strong>in</strong> the Netherlands, but each nation has its own norm generat<strong>in</strong>g center (cf.<br />

Clyne 1992). This is different from the imposed situation <strong>in</strong> earlier years, especially<br />

the sixties, where Dutch <strong>in</strong> Belgium was supposed to be exogenically modeled on the<br />

norms of the Netherlands. Recently, by means of empirical work of e.g. Geeraerts et al.<br />

(1999) and experimental work of e.g. Impe et al. (2008), this historical view had to be<br />

adjusted to the current view, as described <strong>in</strong> Auer (2005).<br />

Rather than provid<strong>in</strong>g further empirical proof of the pluricentric character of<br />

the Dutch lexicon, the case-study aims to show the pert<strong>in</strong>ence of a sociolectometric<br />

methodology that can <strong>aggregate</strong> patterns of non-categorical lexical <strong>variation</strong> while <strong>in</strong>corporat<strong>in</strong>g<br />

an appropriate amount of conceptual control – <strong>in</strong> contrast to a methodology<br />

that discards any conceptual knowledge. As such, the study touches upon two<br />

general issues <strong>in</strong> the broader field of <strong>variation</strong>ist l<strong>in</strong>guistics: on the level of words, we<br />

look at the problematic status of lexical <strong>variation</strong> and the difficulty of del<strong>in</strong>eat<strong>in</strong>g word<br />

mean<strong>in</strong>g; on the level of structure, we run <strong>in</strong>to the methodological issue of aggregat<strong>in</strong>g<br />

the probabilistic <strong>variation</strong>al patterns of many words <strong>in</strong> order to reach a general view<br />

on the lexicon, rather than on <strong>in</strong>dividual words.<br />

Let us start, however, more generally with the status of <strong>variation</strong> <strong>in</strong> a l<strong>in</strong>guistic<br />

system. Attempts of <strong>in</strong>corporat<strong>in</strong>g <strong>variation</strong>al rules <strong>in</strong> the l<strong>in</strong>guistic system have been<br />

criticized (e.g. Bickerton 1971) on the argument that <strong>variation</strong> has no place <strong>in</strong> the search<br />

for an abstract and idealized l<strong>in</strong>guistic system of competence and langue. However, a<br />

paradigm-shift <strong>in</strong> l<strong>in</strong>guistics towards usage-based approaches turned the ubiquity of<br />

<strong>variation</strong> <strong>in</strong>to someth<strong>in</strong>g that should not be ignored. Nonetheless, even <strong>in</strong> usage-based<br />

Cognitive L<strong>in</strong>guistics, which studies parole by def<strong>in</strong>ition and can therefore hardly escape<br />

<strong>variation</strong>, there has been a tendency to overestimate the homogeneity of language<br />

communities and consequent non-variability. As of recently, Cognitive L<strong>in</strong>guistics has<br />

taken up the challenge of <strong>in</strong>corporat<strong>in</strong>g <strong>variation</strong>al dimensions <strong>in</strong> the study of l<strong>in</strong>guistic<br />

phenomena. Evidence for this are two collected volumes by Kristiansen and Dirven<br />

(2008) and Geeraerts et al. (2010) on Cognitive Sociol<strong>in</strong>guistics, which comb<strong>in</strong>e theoretical,<br />

methodological and empirical studies that <strong>in</strong>corporate cognitive, semantic and<br />

lectal dimensions <strong>in</strong> their l<strong>in</strong>guistic descriptions. Of course, one does not need to commit<br />

to a cognitive framework to comb<strong>in</strong>e language-<strong>in</strong>ternal variables and languageexternal<br />

variables, but Cognitive Sociol<strong>in</strong>guistics is currently at the cutt<strong>in</strong>g edge when<br />

it comes to multivariate analyses of l<strong>in</strong>guistic phenomena. The idea of Cognitive Sociol<strong>in</strong>guistics<br />

is best expla<strong>in</strong>ed by look<strong>in</strong>g at an exemplar case-study of Szmrecsanyi<br />

(2010). In that study, the English genitive alternation between an of -construction and<br />

an ’s-construction is approached <strong>in</strong> the well-known Cognitive L<strong>in</strong>guistic fashion, with<br />

semantic, pragmatic, psychol<strong>in</strong>guistic, structural and functional predictors. In addition<br />

to these typical Cognitive L<strong>in</strong>guistic predict<strong>in</strong>g factors, however, extra-l<strong>in</strong>guistic<br />

factors are <strong>in</strong>cluded as well: e.g. register (newspaper versus <strong>in</strong>formal), medium (spoken<br />

versus written) and geography (British versus American English). Based on many<br />

observations of genitive constructions <strong>in</strong> corpora that are representative of these lectal<br />

factors, it appears that “the magnitude of the effect that <strong>in</strong>dividual condition<strong>in</strong>g fac-


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 97<br />

tors [e.g. semantic and pragmatic factors] may have on genitive choice […] is demonstrably<br />

mediated by language-external [i.e. lectal] factors” (Szmrecsanyi 2010).<br />

The example given above – representative of a wide-spread trend <strong>in</strong> Cognitive<br />

L<strong>in</strong>guistics – studies a s<strong>in</strong>gle l<strong>in</strong>guistic phenomenon very closely. And although the<br />

ga<strong>in</strong>ed <strong>in</strong>sights of these s<strong>in</strong>gle-feature studies are at the very heart of the l<strong>in</strong>guistic<br />

enterprise, they hardly allow for extrapolations and abstractions about the l<strong>in</strong>guistic<br />

system <strong>in</strong> general: it is not because lectal factors have an important mediat<strong>in</strong>g <strong>in</strong>fluenceonthechoiceofaspecificgenitiveform(<strong>in</strong>English),thattheyhavethesameeffect<br />

on other l<strong>in</strong>guistic items (<strong>in</strong> other languages). In order to reach a more general level<br />

of that k<strong>in</strong>d, the behavior of many l<strong>in</strong>guistic variables needs to be <strong>aggregate</strong>d so that<br />

idiosyncratic differences are middled out, structures emerge and systematicity can be<br />

<strong>in</strong>duced. This <strong>aggregate</strong> <strong>perspective</strong> also appeals to the answer of Geeraerts (2010) on<br />

his question on the plausibility of a system when <strong>variation</strong> is rampant: f<strong>in</strong>d<strong>in</strong>g a l<strong>in</strong>guistic<br />

system is an empirical question that can be answered by look<strong>in</strong>g for statistically<br />

recurr<strong>in</strong>g structural patterns <strong>in</strong> <strong>variation</strong>al data. Or <strong>in</strong> other words, assum<strong>in</strong>g a system<br />

that is able to predict l<strong>in</strong>guistic choices, we should f<strong>in</strong>d a probabilistic model that fits<br />

observed <strong>variation</strong>.<br />

Return<strong>in</strong>g to the topic of the current paper (lexical <strong>variation</strong> <strong>in</strong> a pluricentric language),<br />

how can these theoretical <strong>in</strong>sights be applied? To answer this question, we<br />

will address lexical <strong>variation</strong> <strong>in</strong> Section 2 and aggregation <strong>in</strong> Section 3. In Section 4,<br />

we will perform a case-study on <strong>aggregate</strong>d lexical <strong>variation</strong> <strong>in</strong> the pluricentric language<br />

Dutch. F<strong>in</strong>ally, we br<strong>in</strong>g together the theoretical <strong>in</strong>sight and the results of the<br />

case-study <strong>in</strong> the conclusion of this paper.<br />

2 <strong>Lexical</strong> <strong>variation</strong><br />

Harder (2010: 270) claims that there are three stages <strong>in</strong> the com<strong>in</strong>g about of a sociodynamic<br />

<strong>perspective</strong> on l<strong>in</strong>guistic system. The first stage consists of mere fluctuations,<br />

comparable to the brabbl<strong>in</strong>g of a toddler. From these fluctuations a structure emerges<br />

consist<strong>in</strong>g of categories that conta<strong>in</strong> the fluctuation, but this structure is an <strong>in</strong>complete<br />

abstraction of the fluctuations. The abstraction goes only so far as the language<br />

user deems appropriate, c.q. until communication is enabled. This is the second stage<br />

of emerg<strong>in</strong>g structure. The third stage consists of the <strong>in</strong>itial stage fluctuations that<br />

turn <strong>in</strong>to systematic <strong>variation</strong> with<strong>in</strong> the emerged structural category. Although the<br />

three stages are presented by means of a developmental example (i.e. the brabbl<strong>in</strong>g<br />

todler), these stages might well have more general ontogenetic status that may expla<strong>in</strong><br />

language <strong>variation</strong> and change. Abandon<strong>in</strong>g the dynamic character of these three<br />

stages, and look<strong>in</strong>g at every stage <strong>in</strong>dependently, we could say that <strong>variation</strong>ist research<br />

zooms <strong>in</strong> on the third stage, assum<strong>in</strong>g the categories from the second stage. As<br />

an example, Harder gives the sem<strong>in</strong>al Labovian study on the structural stage two category<br />

“postvocalic -r”, with its category-bound stage three variants, which appeared


98 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

to be related to social classes <strong>in</strong> New York (Labov 1966). Scholars of the l<strong>in</strong>guistic system<br />

have traditionally removed stage three (<strong>variation</strong>, or rather variable usage) and<br />

focused on the abstract and idealized stage two structural categories. However, an<br />

adequate study of the l<strong>in</strong>guistic system must not ignore the stage three <strong>variation</strong>, as<br />

structure and <strong>variation</strong> cannot exist without each other. Structure without <strong>variation</strong><br />

is ridden of the l<strong>in</strong>guistic reality, and <strong>variation</strong> without structure is mere fluctuation,<br />

<strong>in</strong>capable of enabl<strong>in</strong>g communication.<br />

Although this idea of system is primarily geared towards l<strong>in</strong>guistic categories such<br />

as consonants or Germanic strong verbs, it can conveniently be “translated” towards<br />

the conceptual categories of the lexicon. There is, however, an important question related<br />

to the level of abstraction <strong>in</strong> stage two, when consider<strong>in</strong>g the lexicon. If on the<br />

onehandthecategoriesarechosentobeasnarrowasas<strong>in</strong>gleword(orsymbol),the<br />

<strong>variation</strong> with<strong>in</strong> these categories is semasiological <strong>variation</strong>. This means that one studies<br />

the different senses or aspects of mean<strong>in</strong>g of a s<strong>in</strong>gle word. If on the other hand<br />

the categories are chosen to be as broad as “concepts”, the <strong>variation</strong> <strong>in</strong> nam<strong>in</strong>g these<br />

categories (i.e. that different words may name the same concept) is onomasiological<br />

<strong>variation</strong>. This means that one studies the different ways of express<strong>in</strong>g (with words)<br />

the conceptual category. Obviously, this very old dist<strong>in</strong>ction between a semasiological<br />

or an onomasiological approach is related to the study of polysemy versus the study<br />

of synonymy.<br />

In this paper, we restrict ourselves to the onomasiological <strong>perspective</strong>, yet fully<br />

aware of the semasiological issues wait<strong>in</strong>g around the corner. We refer to Geeraerts<br />

(2009) for an overview of research on lexical <strong>variation</strong>, and zoom <strong>in</strong> here briefly on<br />

a dist<strong>in</strong>ction between Formal Onomasiological Variation (FOV) and Conceptual Onomasiological<br />

Variation (COV). A FOV approach resembles the sociol<strong>in</strong>guistic variable:<br />

FOV grasps a quality of a set of words that express the same concept, and just like <strong>in</strong> a<br />

sociol<strong>in</strong>guistic variable, each word <strong>in</strong> the set may have a specific socio-stylistic correlation.<br />

COV, on the other hand, l<strong>in</strong>ks up to the more subtle <strong>variation</strong> <strong>in</strong> concepts that<br />

are be<strong>in</strong>g used <strong>in</strong> language. Most obviously, at a very high level, and example could be<br />

that one can use specific words to talk about “beer” or about “semantics”. At a more<br />

f<strong>in</strong>e-gra<strong>in</strong>ed level, one could say that “fiddle” and “viol<strong>in</strong>” are an example of FOV, but<br />

because “fiddle” has a slightly more ord<strong>in</strong>ary tone to it than the more prestigious “viol<strong>in</strong>”,<br />

there is also COV between these words. In the case-study to this paper, we will<br />

show that this dist<strong>in</strong>ction between FOV <strong>in</strong> choos<strong>in</strong>g a word to express a concept versus<br />

COV when us<strong>in</strong>g words to talk <strong>in</strong> a certa<strong>in</strong> way crops up <strong>in</strong> a methodological difference<br />

between the two sociolectometric approaches that we compare.<br />

3 Aggregation<br />

As said above, aggregation of many variables is necessary when the goal is to describe<br />

general patterns <strong>in</strong> a system. In order to f<strong>in</strong>d underly<strong>in</strong>g dimensions of <strong>variation</strong> <strong>in</strong>


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 99<br />

a large set of (lexical) variables, the <strong>in</strong>dividual patterns of the variables thus need to<br />

be <strong>aggregate</strong>d. Aggregation of many features is already applied <strong>in</strong> e.g. dialectometry<br />

and text categorization. However, we f<strong>in</strong>d problems <strong>in</strong> both dialectometry and text<br />

categorization when it comes to deal<strong>in</strong>g with lexical <strong>variation</strong>.<br />

In dialectometry (Séguy 1971; Goebl 1975; Nerbonne and Kretzschmar 2003), lexical<br />

<strong>variation</strong> is almost always considered to be categorical per location (except e.g.<br />

Grieve et al. 2011): either a certa<strong>in</strong> location – or at best a s<strong>in</strong>gle <strong>in</strong>terviewee per location<br />

– is attributed the use of word a or the use of word b. This categorical approach is<br />

ma<strong>in</strong>ly due to the type of <strong>in</strong>put data, i.e. a lexical dialect atlas, used <strong>in</strong> most dialectometric<br />

studies. Dialect atlases have been pa<strong>in</strong>stak<strong>in</strong>gly constructed <strong>in</strong> earlier years by<br />

the efforts of dialectologists that visited pert<strong>in</strong>ent locations for their purposes and accumulated<br />

data through <strong>in</strong>terviews and questionnaires. Categorical word choices per<br />

location were a necessary (but currently not any longer acceptable) methodological decision.<br />

Because dialectometric methodology is tailored around the categorical dialect<br />

atlas <strong>in</strong>put format, their quantitative aggregation methods cannot straightforwardly<br />

be applied to corpus-driven <strong>in</strong>put, where lexical <strong>variation</strong> is a probabilistic matter.<br />

Unlike dialectometry, an aggregation method that <strong>in</strong>corporates both probabilistic<br />

word preferences <strong>in</strong> an onomasiological approach was <strong>in</strong>troduced <strong>in</strong> Geeraerts et al.<br />

(1999) and further formalized <strong>in</strong> Speelman et al. (2003). This so-called profile-based<br />

approach – where “profile” stands for the (relative frequencies of a) set of words <strong>in</strong><br />

a conceptual category – is formally <strong>in</strong>troduced below. The rationale of the method is<br />

just like most aggregation methods to measure the “distance” between pairs of subcorpora<br />

on the basis of their probabilistic overlap <strong>in</strong> onomasiological word preferences<br />

for express<strong>in</strong>g an underly<strong>in</strong>g conceptual category. A small distance between subcorpora<br />

implies a general agreement <strong>in</strong> word choice, whereas a large distance implies a<br />

general disagreement <strong>in</strong> word choice.<br />

Profile-based distances between subcorpora are calculated by means of the follow<strong>in</strong>g<br />

method. Given two subcorpora V1 and V2, a conceptual category L (e.g. SUB-<br />

TERRANEAN PUBLIC TRANSPORT)andx1 to xn the exhaustive list of variants (e.g. [subway,<br />

underground} as the profile, then we refer to the absolute frequency F of the usage of<br />

x1 for L <strong>in</strong> Vj with: 1<br />

FVj ,L (x1) (1)<br />

To make this methodological explanation more tangible, we provide a fictional example<br />

on the basis of the absolute frequencies for two concepts SUBTERRANEAN PUBLIC<br />

TRANSPORT and SMALL INSTRUMENT PLAYED WITH A BOW as used <strong>in</strong> American and British<br />

English, cf. Table 1.<br />

1 The follow<strong>in</strong>g <strong>in</strong>troduction to the City-Block distance method is based on Speelman et al. (2003:<br />

Section 2.2).


100 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Tab. 1: Fictional absolute frequencies for the variants of two concepts <strong>in</strong> two language varieties<br />

Concept Variant Am. Eng. Br. Eng.<br />

SUBTERRANEAN PUBLIC TRANSPORT<br />

SMALL INSTRUMENT PLAYED WITH A BOW<br />

subway 70 20<br />

underground 10 50<br />

viol<strong>in</strong> 50 30<br />

fiddle 40 35<br />

Subsequently, we <strong>in</strong>troduce the relative frequency R :<br />

RVj ,L (xi ) =<br />

FVj ,L (xi )<br />

n<br />

k =1 FVj ,L (xk )<br />

The absolute frequencies from Table 1 now become the relative frequencies <strong>in</strong> Table 2<br />

by means of apply<strong>in</strong>g Equation 2.<br />

Tab. 2: Fictional relative frequencies for the variants of two concepts <strong>in</strong> two language varieties,<br />

based on Table 1<br />

Concept Variant Am. Eng. Br. Eng.<br />

SUBTERRANEAN PUBLIC TRANSPORT<br />

SMALL INSTRUMENT PLAYED WITH A BOW<br />

subway 0,875 0,286<br />

underground 0,125 0,714<br />

viol<strong>in</strong> 0,556 0,462<br />

fiddle 0,444 0,538<br />

Now we can def<strong>in</strong>e the (City-Block) distance DCB between V1 and V2 on the basis of the<br />

profile for L as follows (the division by two is for normalization, mapp<strong>in</strong>g the results<br />

to the <strong>in</strong>terval [0,1]):<br />

DCB ,L (V1, V2) = 1<br />

2<br />

n<br />

i =1<br />

(2)<br />

|RVj ,L (xi ) − RVj ,L (xi )| (3)<br />

The City-Block distance is a straightforward descriptive dissimilarity measure that assumes<br />

the absolute frequencies <strong>in</strong> the sample-based profile to be large enough for the<br />

relative frequencies to be good estimates for the relative frequencies <strong>in</strong> the underly<strong>in</strong>g<br />

population-based profiles. If however the samples are rather small, the relative frequencies<br />

become unreliable, and a supplementary control is needed. For this we use<br />

a measure that takes as its basis the confidence of there be<strong>in</strong>g an actual difference between<br />

two profiles: the Fisher Exact test. This time, unlike with DCB , we look at the<br />

absolute frequencies <strong>in</strong> the profiles we compare. When we compare a profile <strong>in</strong> one<br />

subcorpus to the profile for the same concept <strong>in</strong> a second subcorpus, we use a Fisher<br />

Exact test to check the hypothesis that both samples are drawn from the same pop-


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 101<br />

ulation. We use the p-value from the Fisher Exact test as a filter for DCB .Wesetthe<br />

dissimilarity between subcorpora at zero if p > 0.05, and we use DCB if p < 0.05. 2<br />

If we now apply this step to the fictional data from Table 1 and 2, we must first<br />

calculate the Fisher Exact p value for every concept, verify<strong>in</strong>g that the absolute frequencies<br />

for American and British English are sampled from different populations. For<br />

SUBTERRANEAN PUBLIC TRANSPORT,thepvalueismuchsmallerthan0.05,sowecanac cept that British English is different from American English when it comes to this concept.<br />

Therefore, we calculate the City-Block distance by means of Equation 5 for SUB-<br />

TERRANEAN PUBLIC TRANSPORT. Fill<strong>in</strong>g <strong>in</strong> the equation, we get 0.5 × [(|0.875–0.286|) +<br />

(|0.125–0.714|)] = 0.589. For the concept of a SMALL INSTRUMENT PLAYED WITH A BOW we<br />

f<strong>in</strong>d a p value for the Fisher Exact test larger than 0.05, so we can say that British English<br />

is statistically speak<strong>in</strong>g not a different population than American English. Therefore,<br />

we can set the distance between these varieties for this concept at zero.<br />

To calculate the dissimilarity between subcorpora on the basis of many profiles,<br />

we just sum the dissimilarities for the <strong>in</strong>dividual profiles. In other words, given a set of<br />

profiles L1 to Lm , then the global dissimilarity D between two subcorpora V1 and VL2<br />

on the basis of L1 up to Lm can be calculated as:<br />

DCB (V1, V2) =<br />

m<br />

(L −i (V1, V2)W (Li )) (4)<br />

i =1<br />

The W <strong>in</strong> the formula is a weight<strong>in</strong>g factor. We use weights to ensure that concepts<br />

which have a relatively higher frequency (summed over the size of the two subcorpora<br />

that are be<strong>in</strong>g compared) 3 also have a greater impact on the distance measurement. In<br />

other words, <strong>in</strong> the case of a weighted calculation, concepts that are more common <strong>in</strong><br />

everyday life and language are treated as more important. Apply<strong>in</strong>g this to the fictional<br />

example from Table 1, we can calculate the W per concept by divid<strong>in</strong>g the sum of the<br />

absolute frequencies of all variants for one concept by the sum of simply all <strong>variation</strong>s.<br />

For SUBTERRANEAN PUBLIC TRANSPORT this equals to (70+10+20+50)/(70+10+20+50+<br />

50 + 40 + 30 + 35) = 0.492. There is no need to calculate the W for SMALL INSTRUMENT<br />

PLAYED WITH A BOW as its distance is already set to zero. Fill<strong>in</strong>g out equation 4, we f<strong>in</strong>d<br />

that the distance between British English and American English <strong>aggregate</strong>d over both<br />

concepts is (0.589 × 0.492) + 0 = 0.29.<br />

Now, we put text categorization <strong>in</strong> contrast with the profile-based approach, which<br />

<strong>in</strong>corporates probabilistic <strong>in</strong>formation of word choice. In text categorization, noncategorical<br />

(probabilistic) word choice is well accounted for (unlike dialectometric ap-<br />

2 If the frequency of the profile was lower than 30 <strong>in</strong> the two varieties that are be<strong>in</strong>g compared, that<br />

profile was excluded from the comparison.<br />

3 The size of the two subcorpora is not the actual amount of words <strong>in</strong> the two subcorpora, but the sum<br />

of all profiles <strong>in</strong> these two subcorpora with a frequency higher than 30.


102 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

proaches), but text categorization totally ignores the onomasiological <strong>perspective</strong> on<br />

lexical <strong>variation</strong>. This is primarily due to the fact that text categorization often zooms<br />

<strong>in</strong> on topical categorization, and the onomasiological approach to lexical <strong>variation</strong><br />

with<strong>in</strong> conceptual categories is exactly a way of downplay<strong>in</strong>g thematic bias <strong>in</strong> the <strong>variation</strong>al<br />

patterns (Speelman et al. 2003). However, other forms of text categorization,<br />

e.g. authorship attribution or l<strong>in</strong>guistic profil<strong>in</strong>g, quite the opposite of topic classification,<br />

also ignore onomasiological <strong>variation</strong> and use mere (relative) occurrence frequencies<br />

of the features <strong>in</strong> the aggregation step. This is problematic, especially given<br />

the recent trend <strong>in</strong> authorship attribution studies to use content words.<br />

Whereas the profile-based approach will be the quantitative method that <strong>in</strong>corporates<br />

conceptual control <strong>in</strong> our comparison of methods, we will use the textcategorization<br />

approach as the quantitative method that ignores conceptual similarity<br />

between the words <strong>in</strong> the variable set. Except for the used distance metric, the two approaches<br />

are identical. The underly<strong>in</strong>g metaphor of both the profile-based and categorization<br />

approach is spatial: subcorpora are represented as po<strong>in</strong>ts <strong>in</strong> an n-dimensional<br />

spacebymeansoftheoccurrencefrequenciesofn words. A made-up example <strong>in</strong> a<br />

two-dimensional space, i.e. with two words, conta<strong>in</strong><strong>in</strong>g two text types might make<br />

this rather abstract metaphor more clear. Given two subcorpora represent<strong>in</strong>g the text<br />

types “academic articles” and “computer mediated communication”, and given two<br />

words “hence” (a l<strong>in</strong>k<strong>in</strong>g word used <strong>in</strong> academic articles) and “LOL” (an abbreviation<br />

of “Laugh<strong>in</strong>g Out Loud”, commonly used <strong>in</strong> IRC), one might construct the “space” <strong>in</strong><br />

Figure 1. The position of the academic articles <strong>in</strong> the bottom right part is due to the high<br />

frequency of “hence” and the low frequency of “LOL” <strong>in</strong> these texts. The position of<br />

the computer-mediated communication <strong>in</strong> the top left part is due to the low frequency<br />

of “hence” and the high frequency of “LOL” <strong>in</strong> these texts. Obviously, these data are<br />

made up for the sake of the argument. Now, two l<strong>in</strong>es can be drawn through the orig<strong>in</strong>ofthespaceandthepositionofthetexttypes(onthebasisofthefrequenciesof<br />

the words that make up the dimensions), yield<strong>in</strong>g an angle, for which the cos<strong>in</strong>e can<br />

be calculated. A small angle implies high similarity between the text types, and will<br />

yield a high cos<strong>in</strong>e value; a large angle implies low similarity, and will yield a low cos<strong>in</strong>e<br />

value. More <strong>in</strong>formation on the cos<strong>in</strong>e metric can be found <strong>in</strong> Baeza-Yates and<br />

Ribeiro-Neto (1999: 27).<br />

Fig. 1: 2 Dimensional example of Vector Model


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 103<br />

Formally, given two subcorpora V1 and V2 <strong>in</strong> which the frequencies of a large number<br />

of words were counted and stored <strong>in</strong> the respective vectors x and y, wecalculate<br />

the distance between the subcorpora by means of Equation 5.<br />

4 Case study<br />

Dcos(V1, V2) = 1 − cos(x, y) = 1 −<br />

x · y<br />

|x||y| =<br />

n i =1 xi yi<br />

n i =1 x 2 n i i =1 y 2<br />

i<br />

The case study of this paper is an analysis of <strong>aggregate</strong>d lexical <strong>variation</strong> <strong>in</strong> the pluricentric<br />

language Dutch. It consists of a comparison between the state-of-the-art text<br />

categorization distance metric, which ignores conceptual control, and the profilebased<br />

distance metric, which <strong>in</strong>cludes conceptual control. In order to guarantee an<br />

objective comparison, we will apply both methods to the same dataset, which is tailored<br />

to conta<strong>in</strong> a specific constitution of <strong>variation</strong>al dimensions. The method that<br />

best approaches the expected structure will be considered superior. In what follows,<br />

we first <strong>in</strong>troduce the dataset by describ<strong>in</strong>g the set of lexical features and the corpus<br />

<strong>in</strong> which these features will be counted. Second, we apply the profile-based method to<br />

this dataset. Then, the state-of-the-art text categorization method is also applied to the<br />

dataset. F<strong>in</strong>ally, it will be concluded that the profile-based onomasiological approach<br />

grasps the a priori constitution of <strong>variation</strong>al dimensions much better than the text<br />

categorization method.<br />

The lexical <strong>in</strong>put features are derived from the “Referentiebestand Belgisch Nederlands”<br />

(Mart<strong>in</strong> 2005, Eng. Reference List of Belgian Dutch, abbreviation “RBBN”). This<br />

reference list conta<strong>in</strong>s words or expressions that exclusively appear <strong>in</strong> Belgian Dutch,<br />

and have no occurrences <strong>in</strong> The Netherlands, accord<strong>in</strong>g to dictionaries, corpora and<br />

<strong>in</strong>formants. The list conta<strong>in</strong>s about 4000 items, rang<strong>in</strong>g from colloquial items, over<br />

culturally l<strong>in</strong>ked (e.g. Belgian <strong>in</strong>stitutes) to register-specific and freely vary<strong>in</strong>g items.<br />

As an example, a small selection of items is listed <strong>in</strong> Table 3, but the whole list can<br />

be downloaded freely from the website of the “Instituut voor Nederlandse Lexicologie”.<br />

For each Belgian Dutch item, the list provides an alternative from general Dutch,<br />

or sometimes typically Netherlandic Dutch. From the 4000 items on the list, we only<br />

reta<strong>in</strong>ed 1455 items for which the Belgian Dutch item itself and its alternative consist<br />

of one s<strong>in</strong>gle word. If we restrict the RBBN list to these s<strong>in</strong>gle word items – and<br />

thus exclud<strong>in</strong>g multi-word-units and expressions –, these items can be counted accurately<br />

<strong>in</strong> an automatic way by merely keep<strong>in</strong>g track of the occurrence frequency<br />

of the words <strong>in</strong> the subcorpora. 4 Indeed, expressions and multi-word-units may be<br />

distributed over the sentence because of syntactic constructions, mak<strong>in</strong>g automatic<br />

4 We address the issue of possible polysemy issues and the need for word sense disambiguation when<br />

do<strong>in</strong>g automatic count<strong>in</strong>g <strong>in</strong> the conclusions.<br />

(5)


104 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Tab. 3: Selected examples from the RBBN<br />

Belgian Dutch General Dutch Translation of concept<br />

suikerboon doopsuiker candy to honor the birth of a baby<br />

appelsien s<strong>in</strong>aasappel orange (fruit)<br />

unaniem eenparig unanimous<br />

ambras ruzie a row<br />

confituur jam marmalade<br />

b<strong>in</strong>nenkoer b<strong>in</strong>nenplaats atrium<br />

count<strong>in</strong>g very hard. All (s<strong>in</strong>gle) words on the list were analyzed with the Alp<strong>in</strong>o parser,<br />

so that accurate count<strong>in</strong>gs on the lemmata could be performed, while controll<strong>in</strong>g for<br />

the part-of-speech. L<strong>in</strong>k<strong>in</strong>g back to the issue of conceptual categories <strong>in</strong> Section 2, we<br />

accept the conceptual categories of the makers of the RBBN <strong>in</strong> their equivalence judgement<br />

between the Belgian Dutch item and its alternative.<br />

Because we know that this list conta<strong>in</strong>s Belgian Dutch words and an alternative,<br />

we can predict that the ma<strong>in</strong> <strong>variation</strong> <strong>in</strong> the list will be due to a national pattern. Indeed,<br />

even the non-national <strong>variation</strong> which is present <strong>in</strong> the list (e.g. colloquialisms)<br />

is still embedded <strong>in</strong> the Belgian Dutch po<strong>in</strong>t-of-view of the RBBN. Or <strong>in</strong> other words,<br />

every variable <strong>in</strong> the variable set is at least nationally patterned. Therefore, we expect<br />

the results of our method to show a clear dist<strong>in</strong>ction between the two national varieties,<br />

and other <strong>variation</strong>al dimensions will only appear after that.<br />

In our corpus, we <strong>in</strong>corporate samples from the two national varieties of Dutch,<br />

taken from two registers (quality newspapers and Usenet), and from two topics (politics<br />

and economy). We collected a total of 6 million words, which were evenly split<br />

over the nations, registers and topics. The quality newspaper articles were sampled<br />

from two large newspaper corpora that are available for both Netherlandic and Belgian<br />

newspapers. From these two corpora, we selected four newspapers that are deemed<br />

to be quality newspapers: “De Standaard” and “De Morgen” for Belgium, and “Volkskrant”<br />

and “NRC” for The Netherlands. For most of the articles that appeared <strong>in</strong> the<br />

newspapers, there is access to the category <strong>in</strong> which it was published. This categorization<br />

was used to filter out the articles on the topics “politics” and “economy”.<br />

The Usenet posts were downloaded from a large Usenet archive, available onl<strong>in</strong>e<br />

at Google Groups and automatically stripped from meta-<strong>in</strong>formation (headers and<br />

html code) and reduplicated content (quotes from previous posts). Only posts from<br />

the groups “be.politics”, “be.f<strong>in</strong>ance”, “nl.politiek” and “nl.f<strong>in</strong>ancieel.*” were downloaded,<br />

where the country affiliation of the group was taken to be an <strong>in</strong>dication of the<br />

nationality of the author of the post, and where the topical restriction of the group <strong>in</strong>dicates<br />

the topic of the post. All texts were lemmatized and tagged with part-of-speech<br />

<strong>in</strong>formation by the Alp<strong>in</strong>o parser (Bouma et al. 2001).


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 105<br />

With these three dimensions (country, register, topic) and two levels for each dimension<br />

8 comb<strong>in</strong>ations are possible. These comb<strong>in</strong>ations, e.g. Belgian quality newspapers<br />

on economy (abbreviated as qnp.be.e), will be represented by the subcorpora,<br />

for which we will calculate the pair wise distances. However, to <strong>in</strong>crease the number<br />

of data po<strong>in</strong>ts and <strong>in</strong> order to verify the <strong>in</strong>ternal consistency of the subcorpora, we divided<br />

every subcorpus <strong>in</strong>to two equally sized groups (abbreviated as e.g. qnp.be.e.0<br />

and qnp.be.e.1). In total then, we counted the frequencies of the l<strong>in</strong>guistic characteristics<br />

which we <strong>in</strong>troduce above, <strong>in</strong> 16 subcorpora. A snippet of this <strong>in</strong>put data is presented<br />

<strong>in</strong> the appendix to this paper.<br />

Given the omnipresent country dimension <strong>in</strong> the <strong>in</strong>put features, the primary <strong>variation</strong>al<br />

dimension that could be expected to be revealed among the subcorpora is the<br />

Belgian Dutch versus Netherlandic Dutch dimension. Or <strong>in</strong> terms that relate to the<br />

distance measurement method: <strong>in</strong> a pair-wise comparison of subcorpora with a national<br />

difference, the distance will be bigger than a comparison of two subcorpora<br />

with the same national affiliation. Because the typical Belgian Dutch words are sometimes<br />

restricted to a specific register, e.g. colloquialisms, a register dist<strong>in</strong>ction should<br />

emerge, as well. And as words and their conceptual categories are <strong>in</strong>evitably sensitive<br />

to topic, we would expect the difference between political and economical subcorpora<br />

to emerge, too. However, the register and topic dimension should be secondary to the<br />

country dimension.<br />

4.1 Results of the profile-based method<br />

We first look <strong>in</strong>to the results of the profile-based approach, <strong>in</strong>troduced above. To the<br />

selected Belgian Dutch items on the RBBN list, we added the knowledge which alternatives<br />

are conceptually equivalent General Dutch words. In other words, we <strong>in</strong>troduce<br />

conceptually controlled profile <strong>in</strong>formation to the distance metric. A profile thus consists<br />

of a Belgian Dutch word from the RBBN list, together with its general Dutch alternative.<br />

Remember that the underly<strong>in</strong>g distance metric is basically a City-Block distance<br />

measure (see Formula 4). Now, we zoom <strong>in</strong> on the two- and three-dimensional visualizations<br />

of all the pair wise profile-based distances between the subcorpora, made<br />

by means of non-metric two-way one-mode Multidimensional Scal<strong>in</strong>g (Cox and Cox<br />

2001), as can be seen <strong>in</strong> Figure 2. 5<br />

5 The coord<strong>in</strong>ates of a Multidimensional Scal<strong>in</strong>g solution can be scaled freely, as long as the same<br />

scal<strong>in</strong>g is applied to all dimensions. Therefore, we discarded a scale on the axes, as these numbers<br />

would not be <strong>in</strong>sightful. However, we made sure that the x and y (and z for three-dimensional solutions)<br />

axes are always equal, so that the distances between the subcorpora on the different dimensions<br />

can be <strong>in</strong>terpreted.


106 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Fig. 2: L<strong>in</strong>guistic distance between subcorpora (profile-based, two-dimensional)<br />

Multidimensional Scal<strong>in</strong>g is a dimension reduction technique which is applied here<br />

to a matrix hold<strong>in</strong>g all the pair wise profile-based distances between the subcorpora.<br />

Because the result of a Multidimensional Scal<strong>in</strong>g analysis is a reduction of the orig<strong>in</strong>al<br />

<strong>in</strong>put, a certa<strong>in</strong> error is <strong>in</strong>troduced. The error-rate is grasped by a “stress” value,<br />

with 0% stress equal to no error at all. It is generally acceptable to present Multidimensional<br />

Scal<strong>in</strong>g solutions up to a stress level of 10–15%. Usually, Multidimensional<br />

Scal<strong>in</strong>g is used to return one-, two-, or three-dimensional reductions, so that visualization<br />

is possible. With every added dimension, the error-rate goes down, as the reduction<br />

becomes less severe. The fall of error-rate with added dimensions is grasped <strong>in</strong> a<br />

so-called screeplot. The screeplot <strong>in</strong> Figure 3 shows a stress difference of about 7% between<br />

a one-dimensional and a two-dimensional Multidimensional Scal<strong>in</strong>g solution.<br />

Therefore, we first <strong>in</strong>terpret the horizontal dimension (of an unrotated solution) as it<br />

represents the most important <strong>variation</strong> <strong>in</strong> Figure 2. In this case, the profile-based approach<br />

makes a dist<strong>in</strong>ction between Belgian subcorpora (black font) and Netherlandic<br />

subcorpora (grey font) on the first dimension. The grey zero-l<strong>in</strong>e divides the two countries<br />

perfectly. The vertical dimension makes a dist<strong>in</strong>ction between quality newspapers<br />

(normal font) and Usenet articles (bold font). Here aga<strong>in</strong>, the grey zero-l<strong>in</strong>e marks<br />

a perfect dist<strong>in</strong>ction between the two registers. Overall, there is a very clear group<strong>in</strong>g<br />

of the subcorpora, with only clear separation of the topics <strong>in</strong> the Belgian Usenet.<br />

The range of Belgian register <strong>variation</strong> is also somewhat larger than the Netherlandic<br />

range, but this has probably to do with the focus on Belgian Dutch <strong>variation</strong> <strong>in</strong> the<br />

<strong>in</strong>put features. Most importantly, however, the profile-based approach yields a visualization<br />

that complies with our expectations of f<strong>in</strong>d<strong>in</strong>g a national pattern first, followed<br />

by register <strong>variation</strong> on the second dimension.


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 107<br />

Fig. 3: Screeplot for non-metric Multidimensional Scal<strong>in</strong>g solution (profile-based)<br />

The screeplot suggest that a three-dimensional solution might even improve the quality<br />

of the visualization with another 5 or 6%. Therefore, we calculated a three dimensional<br />

solution, which is represented <strong>in</strong> Figure 4. 6 Instead of render<strong>in</strong>g a threedimensional<br />

plot, we drew the scatterplot of dimension 1 versus dimension 2, and the<br />

scatterplot of dimension 1 versus dimension 3. This shows us how, even <strong>in</strong> a threedimensional<br />

solution, dimension 1 still divides Belgian and Netherlandic subcorpora,<br />

Fig. 4: L<strong>in</strong>guistic distance between subcorpora (profile-based, three-dimensional)<br />

6 Note that a two-dimensional non-metric Multidimensional Scal<strong>in</strong>g solution is not a subset of a threedimensional<br />

non-metric Multidimensional Scal<strong>in</strong>g solution. Therefore, the first two dimensions of the<br />

three-dimensional solution of Figure 4 are not necessarily identical to the two dimensions of the twodimensional<br />

solution of Figure 2.


108 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

and that dimension 2 divides the quality newspaper articles from Usenet. However,<br />

this register division <strong>in</strong> the three-dimensional solution is not as neat as <strong>in</strong> the twodimensional<br />

solution, because one of the Netherlandic Usenet fragments crosses over<br />

<strong>in</strong>to the quadrant of the Netherlandic quality newspaper fragments. For dimension 3,<br />

we can see a split for the topics of the Belgian subcorpora, with on the top left of dimension<br />

3 subcorpora with an e for economy-related subcorpora, and politics fragments<br />

at the bottom. On the Netherlandic side, the register (dimension 2) and topic (dimension<br />

3) split is muddled. The register and topic divisions of the Belgian subcorpora,<br />

however, are perfect for respectively dimension 2 and dimension 3. The quality of the<br />

group<strong>in</strong>g on the Belgian side is obviously due to the <strong>in</strong>put variables which are specifically<br />

sensitive for Belgian Dutch <strong>variation</strong>. This <strong>in</strong>dicates that the choice for a Belgian<br />

Dutch term is not only nationally patterned, but also stylistically.<br />

4.2 Results of the categorization method<br />

Now, we present the method and the results of the state-of-the-art categorization approach,<br />

which uses the cos<strong>in</strong>e similarity metric, <strong>in</strong>stead of the adapted City-Block distance<br />

that is used <strong>in</strong> the profile-based approach.<br />

In the current case-study, we take the RBBN items (and the alternatives) as <strong>in</strong>dividual<br />

features and remove the knowledge of conceptual categorization. If we calculate<br />

the similarities (and consequent distances) with these <strong>in</strong>put features between the<br />

subcorpora <strong>in</strong> our dataset, and then produce the two-dimensional visualization with<br />

Multidimensional Scal<strong>in</strong>g, we get the plot <strong>in</strong> Figure 5. If we create a screeplot (Fig-<br />

Fig. 5: L<strong>in</strong>guistic distance between subcorpora (profile-based, three-dimensional)


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 109<br />

Fig. 6: L<strong>in</strong>guistic distance between subcorpora (cos<strong>in</strong>e, two-dimensional)<br />

ure 6) to show us how much stress difference there is between the first and the second<br />

dimension, we see that the second dimension reduces the stress of a one-dimensional<br />

solution with about 8%. Therefore, we will <strong>in</strong>terpret the two dimensions <strong>in</strong> their own<br />

respect, know<strong>in</strong>g however that the first dimension conta<strong>in</strong>s more outspoken distances<br />

than the second dimension.<br />

In Figure 6 we see on the horizontal axis (from left to right, dimension 1) a dist<strong>in</strong>ction<br />

between the Usenet articles (bold font) and the quality newspaper articles<br />

(regular font). The light grey vertical l<strong>in</strong>e <strong>in</strong>dicates the zero-l<strong>in</strong>e of the horizontal dimension.<br />

Normally, that l<strong>in</strong>e demarcates the boundary between two areas. Whereas<br />

we would expect the most important <strong>variation</strong> (thus, on the horizontal dimension) to<br />

be related to country, we encounter a dist<strong>in</strong>ction between registers. The vertical dimensions<br />

(from bottom to top) tends to divide Belgium (black font) from The Netherlands<br />

(grey font), but not very clearly. The (politics) Netherlandic Usenet articles s<strong>in</strong>k<br />

below the horizontal zero-l<strong>in</strong>e, and the (economy) Belgian Usenet articles rise above<br />

that l<strong>in</strong>e. Moreover, we notice that the topics are set apart <strong>in</strong> groups, as well, except for<br />

the quality newspapers from The Netherlands. All <strong>in</strong> all, the categorization approach<br />

yields somewhat unclear group<strong>in</strong>g of subcorpora and an unexpected promotion of register<br />

<strong>variation</strong> as the most important <strong>variation</strong> <strong>in</strong> the <strong>in</strong>put features.<br />

The screeplot shows that a three-dimensional solution would reduce the stress<br />

even more up to an almost optimal level. Therefore, we calculated a three-dimensional<br />

solution and represent the three dimensions <strong>in</strong> Figure 7. We apply the same idea as for<br />

the profile-based approach to plot dimension 1 and 2, and then dimension 1 and 3. Just<br />

like <strong>in</strong> the two-dimensional solution, we see that dimension 1 divides quality newspaper<br />

fragments from Usenet fragments, and that dimension 2 tends to divide the na-


110 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Fig. 7: Screeplot for non-metric Multidimensional Scal<strong>in</strong>g solution (cos<strong>in</strong>e)<br />

tional subcorpora. The three-dimensional solution does a slightly better job than the<br />

two-dimensional solution, because the nation division on dimension 2 is now almost<br />

correct. Dimension 3 divides largely the topics, with politics-related fragments at the<br />

top, and economy-related fragments at the bottom. This division is almost perfect, although<br />

the group<strong>in</strong>g of the subcorpora is not so neat. Overall, though, the categorization<br />

method yields messier output than the profile-based approach.<br />

5 Conclusion<br />

The two ma<strong>in</strong> theoretical questions of this paper have been (a) how important is the<br />

notion of a conceptual category <strong>in</strong> an <strong>aggregate</strong> study of <strong>variation</strong> <strong>in</strong> the lexicon and<br />

(b) what is the status of conceptual categories for lexical <strong>variation</strong>? Moreover, we have<br />

claimed that sociolectometric methodology, of which the current study is an example,<br />

is needed to study a pluricentric language. The l<strong>in</strong>k with pluricentric languages, c.q.<br />

Dutch, is also made <strong>in</strong> the case-study, which shows how conceptual categories and<br />

their consequent conceptual control are necessary to reveal the national dimension <strong>in</strong><br />

the lexicon. In other words, the national varieties of Dutch do not differ so much <strong>in</strong><br />

their use of words – both Belgium and the Netherlands use different words for different<br />

topics and registers –, but they do differ <strong>in</strong> their choice of words for express<strong>in</strong>g a<br />

conceptual category. This latter po<strong>in</strong>t is made clear <strong>in</strong> the case-study by means of the<br />

comparison between a profile-based onomasiological approach and a text categorization<br />

approach. The text categorization approach grasped the mere use of <strong>in</strong>dividual<br />

words and compared the use of words <strong>in</strong> two subcorpora by means of the cos<strong>in</strong>e similarity<br />

metric, which was not <strong>in</strong>formed about the conceptual similarity between words.<br />

Consequently, the text categorization showed that there was a pattern of register and<br />

topic <strong>in</strong> the <strong>in</strong>put features, stronger than the anticipated national pattern. The ono-


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 111<br />

masiological approach, on the contrary, revealed a strong national dimension <strong>in</strong> word<br />

choice for nam<strong>in</strong>g a conceptual category.<br />

Of course, <strong>in</strong> order to have an expected rank<strong>in</strong>g <strong>in</strong> the <strong>variation</strong>al dimensions,<br />

and <strong>in</strong> order to compare the outcome of the aggregation approaches, the dataset had<br />

to be manipulated so that a certa<strong>in</strong> pattern could conv<strong>in</strong>c<strong>in</strong>gly be assumed. With that<br />

goal <strong>in</strong> m<strong>in</strong>d, the variable set was taken from a reference list of Belgium Dutch, so that<br />

national <strong>variation</strong> is built <strong>in</strong>to the dataset. As such, the two aggregation approaches<br />

could be compared by assess<strong>in</strong>g how well they retrieve the national <strong>variation</strong>. It is important<br />

to understand, though, that an actual descriptive sociolectometric study can<br />

by no means rely on such a biased <strong>in</strong>put variable set. Therefore, the results of this paper<br />

can only be of methodological value. Given the a priori known pattern of national<br />

<strong>variation</strong> <strong>in</strong> the dataset used <strong>in</strong> the case-study, though, one might jump to the conclusion<br />

that an onomasiological approach is better suited for f<strong>in</strong>d<strong>in</strong>g <strong>variation</strong>al patterns<br />

<strong>in</strong> the lexicon, and the preferred method for any sociolectometric study. However, there<br />

are a number of problems with this conclusion.<br />

First of all, perhaps we are wrong <strong>in</strong> the assumption that national <strong>variation</strong> is the<br />

strongest dimension <strong>in</strong> the lexical variable set and the available subcorpora; it could<br />

be well possible that word use – as shown <strong>in</strong> the categorization approach – is actually<br />

more strongly <strong>in</strong>fluenced by a register or topic dimension, and that the onomasiological<br />

approach artificially weakens these dimensions. 7 In that case, we would have<br />

to tone down the conclusion, and say that an onomasiological approach with conceptual<br />

control is a methodological means of reveal<strong>in</strong>g and boost<strong>in</strong>g specific underly<strong>in</strong>g<br />

dimensions of <strong>variation</strong>. Moreover, we would like to po<strong>in</strong>t out that our corpus<br />

only sampled two topics and two registers, which is not enough to support strong generalizations.<br />

Further research is therefore needed with more topics and registers. All<br />

this, of course, does not weaken the strength of a profile-based approach, but it rather<br />

po<strong>in</strong>ts out the importance of know<strong>in</strong>g what is be<strong>in</strong>g measured. Our claim now is that<br />

the profile-based approach allows for much more control over what is measured than<br />

the text categorization method, and should therefore be preferred.<br />

Second, the onomasiological approach assumes a relation of identity of (conceptual)<br />

mean<strong>in</strong>g between the variants and this is theoretically problematic. Follow<strong>in</strong>g<br />

Edmonds and Hirst (2002), we agree that perfect synonymy – the highest possible level<br />

of detail <strong>in</strong> describ<strong>in</strong>g a conceptual category, and still f<strong>in</strong>d<strong>in</strong>g multiple words that fit<br />

the category – is extremely rare. By admitt<strong>in</strong>g this, our notion of semantics or word<br />

mean<strong>in</strong>g follows the Cognitive L<strong>in</strong>guistic view that encyclopedic knowledge is <strong>in</strong>dispensable.<br />

Translat<strong>in</strong>g the idea of Peter Harder that structural categories need not to be<br />

complete, and that the abstraction goes only as far as is functional for language users –<br />

here we l<strong>in</strong>k up to the prototype theory of word mean<strong>in</strong>g, cf. Rosch and Mervis (1975)–,<br />

7 Although the profile-based City-Block distance <strong>in</strong>corporates a W term that br<strong>in</strong>gs the frequency of<br />

the conceptual category <strong>in</strong>to play.


112 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

we can reach near-synonymy by slightly relax<strong>in</strong>g the level of detail of the conceptual<br />

category: not every language user has an identitical representation of a word <strong>in</strong> his<br />

head, but nonetheless two language users can communicate with that word. Idealized<br />

Cognitive Models (Lakoff 1987) or Frames (Fillmore 1994) are examples of describ<strong>in</strong>g<br />

mean<strong>in</strong>g, while balanc<strong>in</strong>g semasiological detail and operational functionality. In future<br />

research, we will operationalize the bottom-up creation of conceptual categories<br />

by apply<strong>in</strong>g Word Space Models (Turney and Pantel 2010).<br />

Third, an onomasiological approach requires prior semasiological analysis to exclude<br />

contextual nuances or polysemy. In the case-study of this paper, the lemmatized<br />

forms of the RBBN words were naively counted <strong>in</strong> the corpus, without further check<strong>in</strong>g<br />

the context of each occurrence. Closer <strong>in</strong>spection revealed that the RBBN list does not<br />

conta<strong>in</strong> many potential polysemous items, so that we can ignore the small error that<br />

must be present <strong>in</strong> the frequencies for the purposes of the current paper. However, as<br />

we want to perform the above analyses <strong>in</strong> future research with a naturalistic sample of<br />

lexical <strong>variation</strong>, <strong>in</strong>stead of an a priori list of national <strong>variation</strong>, a semasiological study<br />

for every occurrence needs to be done <strong>in</strong> order to establish the conceptual control. As<br />

this would be an unfeasible manual task when us<strong>in</strong>g a large amount of variables, we<br />

will rely further on the advances be<strong>in</strong>g made <strong>in</strong> the field of Word Space Models to automate<br />

this task.<br />

To conclude this paper, we try to answer our <strong>in</strong>itial questions. How important is<br />

the notion of a conceptual category <strong>in</strong> an <strong>aggregate</strong> study of the lexicon? The casestudy<br />

has shown that conceptual control is necessary to reveal <strong>variation</strong>al dimensions<br />

that are hidden <strong>in</strong> the overwhelm<strong>in</strong>g content (topic) function of words. Without conceptual<br />

control, the conclusion of the categorization approach would have been that<br />

different words are used to refer to different content, and that they may also signal<br />

register and perhaps national differences. This observation, albeit true and undeniable,<br />

is not the goal of an aggregation study: it is obvious that an aggregation of many<br />

words will be sensitive to content differences among subcorpora. Therefore, conceptual<br />

control, <strong>in</strong> the form of conceptual categories that group together similar words,<br />

is needed. And this br<strong>in</strong>gs us to the second question: what is the status of conceptual<br />

categories for lexical <strong>variation</strong>? Although practical as a methodological and heuristic<br />

device, the conceptual categories rema<strong>in</strong> somewhat artificial because of the flexibility<br />

<strong>in</strong> their def<strong>in</strong>ition. In the current case study, the makers of the RBBN clearly had referential<br />

equivalence <strong>in</strong> m<strong>in</strong>d for most categories. However, conceptual categories can<br />

be def<strong>in</strong>ed more strictly or less strictly at a whim of the researcher, because there is<br />

no consensus over the appropriate level of detail <strong>in</strong> the def<strong>in</strong>ition, especially s<strong>in</strong>ce the<br />

<strong>in</strong>corporation of encyclopedic knowledge <strong>in</strong> word-mean<strong>in</strong>g. The level of detail that is<br />

operational <strong>in</strong> the language community can only be retrieved by study<strong>in</strong>g the actual<br />

use of words.<br />

And then we are back at <strong>variation</strong>.


Appendix<br />

<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 113<br />

Tab. 4: Snippet of the <strong>in</strong>put data for both aggregation methods. Pairs of rows make up lexical<br />

variables.<br />

qnp.be.e.0<br />

qnp.be.e.1<br />

qnp.be.p.0<br />

qnp.be.p.1<br />

qnp.nl.e.0<br />

qnp.nl.e.1<br />

qnp.nl.p.0<br />

leefbaar 9 3 8 11 1 0 0 0 0 1 9 4 0 0 24 18<br />

levensvatbaar 2 4 2 0 2 1 3 2 0 0 1 1 0 0 4 4<br />

hangar 0 1 0 1 0 0 1 2 0 0 1 1 0 0 1 1<br />

loods 8 6 4 18 4 11 5 2 0 0 0 2 0 1 1 6<br />

schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85<br />

mooi 153 122 114 110 110 76 53 42 42 33 73 67 52 74 449 475<br />

dagorde 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

agenda 29 26 100 90 29 21 39 24 2 1 14 14 1 1 17 33<br />

knook 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0<br />

been 13 15 43 41 39 29 14 20 10 12 14 12 21 18 76 65<br />

zever 0 0 1 0 0 0 0 0 6 2 15 15 0 0 4 14<br />

onz<strong>in</strong> 7 1 23 30 8 5 5 3 5 10 44 61 26 43 451 485<br />

draad 4 6 14 10 6 13 2 3 1 2 31 32 9 10 90 87<br />

snoer 2 0 2 1 1 5 1 1 0 0 3 1 0 0 21 28<br />

weeral 0 0 2 0 0 0 0 0 9 3 9 9 0 1 4 1<br />

alweer 19 22 32 22 21 30 11 17 5 1 21 22 12 9 98 98<br />

fel 27 23 33 35 17 19 31 42 6 1 5 10 0 1 19 31<br />

erg 331 268 208 217 117 112 76 68 21 36 143 131 99 94 830 835<br />

strop 4 2 1 3 26 18 4 3 0 0 1 0 0 0 3 3<br />

strik 1 2 2 3 5 6 1 0 0 0 2 0 0 2 1 2<br />

verdiep 2 1 4 3 8 2 4 11 0 0 2 3 3 4 20 26<br />

verdiep<strong>in</strong>g 0 6 6 7 5 4 10 11 0 0 1 0 0 0 12 10<br />

stamp 6 2 9 5 5 1 0 2 1 0 5 5 0 0 11 10<br />

duw 27 16 42 34 20 25 13 16 1 1 13 8 0 5 27 28<br />

spaarzaam 0 1 0 1 2 2 1 2 0 0 0 0 0 0 1 0<br />

zu<strong>in</strong>ig 3 10 5 12 18 21 4 1 0 0 2 3 0 0 10 13<br />

hospitaal 0 4 4 3 0 0 0 0 0 0 1 1 0 0 0 2<br />

ziekenhuis 26 34 82 60 11 40 11 11 0 1 15 15 0 2 61 92<br />

micro 1 1 2 3 0 0 0 0 0 1 0 0 1 1 2 1<br />

microfoon 1 1 2 10 2 3 3 7 0 0 0 0 0 0 34 28<br />

buis 7 2 2 1 4 1 6 3 0 0 2 1 0 0 18 12<br />

onvoldoende 57 56 38 60 36 29 18 28 4 4 2 7 3 8 23 23<br />

toelage 3 2 3 2 2 5 0 1 0 0 5 0 0 0 1 1<br />

subsidie 33 41 13 15 35 22 29 49 1 0 14 15 2 4 122 137<br />

woonst 1 2 3 3 0 0 0 0 0 0 1 1 0 0 0 0<br />

won<strong>in</strong>g 47 60 45 54 47 70 2 21 17 15 8 9 23 17 54 91<br />

uitbater 13 11 3 8 1 1 2 4 0 0 3 2 0 0 6 4<br />

exploitant 2 2 2 2 15 13 3 5 0 0 0 0 0 0 1 1<br />

tussenkomst 19 8 17 13 3 3 0 1 1 2 0 1 2 2 0 6<br />

bijdrage 40 64 23 23 37 25 34 30 3 9 6 16 14 26 90 80<br />

tegenstrever 1 1 6 8 2 1 0 1 0 0 0 1 0 0 0 0<br />

tegenstander 24 19 70 77 16 17 38 32 0 0 18 16 5 5 63 64<br />

aanvang 5 5 3 3 7 8 2 2 0 0 1 3 1 2 3 4<br />

beg<strong>in</strong> 635 550 499 507 637 554 322 341 78 71 139 201 100 102 706 712<br />

qnp.nl.p.1<br />

use.be.e.0<br />

use.be.e.1<br />

use.be.p.0<br />

use.be.p.1<br />

use.nl.e.0<br />

use.nl.e.1<br />

use.nl.p.0<br />

use.nl.p.1


114 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

qnp.be.e.0<br />

qnp.be.e.1<br />

qnp.be.p.0<br />

qnp.be.p.1<br />

qnp.nl.e.0<br />

qnp.nl.e.1<br />

aanduid<strong>in</strong>g 7 3 6 4 1 1 1 0 1 1 2 5 1 1 5 4<br />

benoem<strong>in</strong>g 34 14 19 17 46 22 35 43 0 0 7 5 3 2 16 10<br />

tevergeefs 8 2 12 7 10 7 7 5 2 0 1 2 0 1 3 4<br />

vergeefs 2 0 0 2 3 7 4 14 0 0 0 4 0 0 0 4<br />

tewerkstell<strong>in</strong>g 8 7 4 16 0 0 0 0 0 0 4 0 0 0 0 0<br />

werkgelegenheid 79 80 17 24 25 16 7 5 0 0 4 6 7 5 13 27<br />

zetel 42 61 91 62 25 23 42 43 1 0 34 32 1 1 193 195<br />

fauteuil 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0<br />

verslaggever 11 10 29 43 3 1 8 5 0 0 0 0 0 0 21 28<br />

rapporteur 1 1 9 5 0 0 2 0 0 0 1 0 0 0 0 1<br />

verlieslatend 10 6 1 0 1 2 0 0 0 1 0 0 0 0 0 0<br />

verliesgevend 1 0 0 0 31 14 9 9 0 0 0 0 1 3 4 6<br />

vermits 4 5 1 4 0 0 0 0 19 12 16 20 0 0 1 2<br />

aangezien 95 81 32 43 24 28 2 3 33 25 45 36 33 26 161 148<br />

universitair 10 5 7 30 2 1 4 6 2 0 1 2 0 0 5 5<br />

academicus 6 1 13 9 2 0 1 2 0 0 1 1 0 0 4 6<br />

vaststell<strong>in</strong>g 30 27 42 44 4 3 1 4 0 0 5 10 2 1 6 6<br />

constater<strong>in</strong>g 1 0 0 1 15 6 0 4 0 0 1 0 1 2 11 12<br />

verhoog 184 178 25 38 107 112 36 34 8 11 12 12 23 22 39 41<br />

podium 1 1 20 25 3 2 4 7 0 0 4 1 0 0 7 5<br />

wedde 2 6 2 5 0 0 0 0 0 0 1 1 0 0 2 1<br />

salaris 13 13 1 0 96 83 25 26 0 0 3 0 6 4 49 44<br />

objectief 21 25 19 18 8 10 4 7 2 4 22 27 5 4 64 42<br />

doel 66 67 57 112 80 91 63 63 7 11 35 33 24 30 198 174<br />

nakend 9 15 12 10 1 1 0 1 0 1 3 1 1 1 0 0<br />

nabij 35 33 27 40 11 13 8 8 3 9 2 2 3 6 19 16<br />

nijverheid 18 14 1 0 0 0 0 0 0 0 0 1 0 0 0 0<br />

<strong>in</strong>dustrie 75 65 22 32 25 26 37 29 1 0 11 8 6 4 40 39<br />

<strong>in</strong>breuk 21 25 6 17 3 2 1 3 0 1 4 3 1 0 8 5<br />

overtred<strong>in</strong>g 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26<br />

job 141 140 59 78 2 0 0 1 4 6 21 16 0 2 4 9<br />

baan 133 122 31 39 150 117 111 78 4 5 11 13 9 6 139 117<br />

maximum 10 12 4 4 6 19 2 6 12 6 6 4 11 16 29 21<br />

maximaal 47 35 25 30 79 76 20 16 21 11 5 7 35 36 38 39<br />

m<strong>in</strong>imum 26 20 8 14 14 11 12 10 13 13 17 15 8 5 20 22<br />

m<strong>in</strong>imaal 28 19 15 25 73 59 19 28 6 3 2 5 37 28 62 46<br />

merkwaardig 19 14 30 37 7 15 4 4 1 0 2 0 0 0 48 28<br />

opmerkelijk 47 52 66 57 67 56 20 20 2 0 6 4 1 0 28 11<br />

effectief 36 34 35 36 45 59 11 20 8 8 24 15 13 12 51 57<br />

daadwerkelijk 19 16 21 13 59 54 24 21 1 1 4 1 11 9 49 55<br />

stock 12 12 2 3 6 0 0 1 45 40 0 0 34 25 0 1<br />

voorraad 65 40 13 3 27 25 4 9 4 0 0 1 19 25 7 18<br />

stilaan 48 49 57 53 1 2 0 0 2 3 6 6 3 0 1 2<br />

langzamerhand 2 4 1 3 30 27 3 13 0 0 0 3 0 0 29 32<br />

serieus 24 20 40 16 41 32 56 53 30 27 63 56 40 29 196 197<br />

ernstig 72 52 101 88 31 24 23 28 3 1 27 37 4 3 94 119<br />

politieker 0 0 0 0 0 0 0 0 0 1 18 14 0 0 13 8<br />

politicus 48 81 321 275 52 37 47 58 1 2 89 93 7 6 289 221<br />

gerechtshof 2 3 4 2 17 16 9 7 0 0 2 1 1 0 3 13<br />

qnp.nl.p.0<br />

qnp.nl.p.1<br />

use.be.e.0<br />

use.be.e.1<br />

use.be.p.0<br />

use.be.p.1<br />

use.nl.e.0<br />

use.nl.e.1<br />

use.nl.p.0<br />

use.nl.p.1


qnp.be.e.0<br />

qnp.be.e.1<br />

qnp.be.p.0<br />

qnp.be.p.1<br />

qnp.nl.e.0<br />

<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 115<br />

qnp.nl.e.1<br />

qnp.nl.p.0<br />

rechtbank 122 112 61 70 15 27 9 13 1 4 11 21 2 2 52 64<br />

prof 1 2 3 3 1 2 0 0 1 1 5 5 0 3 8 6<br />

professor 39 33 70 72 3 8 6 6 0 0 9 3 7 3 27 36<br />

fout 74 84 154 158 51 65 25 43 38 17 92 74 87 75 326 299<br />

overtred<strong>in</strong>g 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26<br />

publiciteit 9 6 5 6 16 18 9 11 0 0 4 5 2 1 17 14<br />

reclame 60 45 17 32 21 21 15 12 11 5 18 11 30 43 46 51<br />

proper 8 10 14 20 0 0 0 0 3 5 0 3 2 2 1 4<br />

schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85<br />

fier 1 4 15 13 1 4 0 1 3 0 5 6 0 0 1 1<br />

trots 15 19 25 25 22 32 11 16 2 0 9 9 2 3 69 63<br />

schepen 11 14 49 24 7 4 2 1 0 0 11 3 0 0 4 1<br />

wethouder 0 0 1 4 9 13 11 14 0 0 2 2 0 0 22 22<br />

schrijvelaar 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

Rekenhof 12 15 6 11 0 0 0 0 0 0 1 0 0 0 0 0<br />

Rekenkamer 6 7 10 3 17 33 4 65 0 0 0 0 0 0 0 1<br />

References<br />

Auer, Peter. 2005. Europe’s sociol<strong>in</strong>guistic unity, or: A typology of European dialect/standard<br />

constellations. In Nicole Delbecque, Johan van der Auwera & Dirk Geeraerts (eds.), Perspectives<br />

on <strong>variation</strong>, 7–42. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 1999. Modern <strong>in</strong>formation retrieval. New York:<br />

ACM Press & Addison-Wesley.<br />

Bickerton, Derek.1971. Inherent variability and variable rules. Foundations of Language and Cognitive<br />

Processes 7(4). 457–492.<br />

Bouma, Gerlof, Gertjan van Noord, and Rob Malouf. 2001. Alp<strong>in</strong>o: wide-coverage computational<br />

analysis of Dutch. In Walter Daelemans, K. Sima’an, J.Veenstra & J. Zavrel (eds.), Computational<br />

L<strong>in</strong>guistics <strong>in</strong> the Netherlands 2000, 45–59. Amsterdam: Rodopi.<br />

Clyne, Michael. 1992. Pluricentric languages: Differ<strong>in</strong>g norms <strong>in</strong> different nations. Berl<strong>in</strong>andNew<br />

York: Mouton de Gruyter.<br />

Cox, Trevor and Michael Cox. 2001. Multidimensional scal<strong>in</strong>g. London and New York: Chapman<br />

and Hall.<br />

Edmonds, Philip and Graeme Hirst. 2002. Near-synonymy and lexical choice. Computational L<strong>in</strong>guistics<br />

28(2). 105–144.<br />

Fillmore, Charles.1994. Start<strong>in</strong>g where dictionaries stop: the challenge of corpus lexicography.<br />

In Beryl T. Sue Atk<strong>in</strong>s & Antonio Zampolli (eds.), Computational approaches to the lexicon,<br />

349–393. Oxford: Oxford University Press.<br />

Geeraerts, Dirk. 2009. <strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> space. In Jürgen Erich Schmidt & Peter Auer (eds.),<br />

Language and space I: Theories and methods, 821–837. Berl<strong>in</strong> and New York: Mouton de<br />

Gruyter.<br />

Geeraerts, Dirk. 2010. Schmidt redux: How systematic is the l<strong>in</strong>guistic system if <strong>variation</strong> is rampant?<br />

In Kasper Boye & Elisabeth Engberg-Pedersen (eds.), Language usage and language<br />

structure, 237–262. Berl<strong>in</strong> & New York: Mouton de Gruyter.<br />

qnp.nl.p.1<br />

use.be.e.0<br />

use.be.e.1<br />

use.be.p.0<br />

use.be.p.1<br />

use.nl.e.0<br />

use.nl.e.1<br />

use.nl.p.0<br />

use.nl.p.1


116 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Geeraerts, Dirk, Stefan Grondelaers and Dirk Speelman. 1999. Convergentie en divergentie <strong>in</strong><br />

de Nederlandse woordenschat. Een onderzoek naar kled<strong>in</strong>g- en voetbaltermen. Amsterdam:<br />

Meertens Instituut.<br />

Geeraerts, Dirk, Gitte Kristiansen, and Yves Peirsman (eds.). 2010. Advances <strong>in</strong> Cognitive Sociol<strong>in</strong>guistics.<br />

Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Goebl, Hans. 1975. Dialektometrie. Grazer l<strong>in</strong>guistische Studien. 32–38.<br />

Grieve, Jack, Dirk Speelman, and Dirk Geeraerts. 2011. A statistical method for the identification<br />

and aggregation of regional l<strong>in</strong>guistic <strong>variation</strong>. Language Variation and Change 23. 193–<br />

221.<br />

Harder, Peter. 2010. Mean<strong>in</strong>g <strong>in</strong> m<strong>in</strong>d and society: A functional contribution to the social turn <strong>in</strong><br />

Cognitive L<strong>in</strong>guistics. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Impe, Leen, Dirk Geeraerts, and Dirk Speelman. 2008. Mutual <strong>in</strong>telligibility of standard and regional<br />

Dutch language varieties. International Journal of Humanities and Arts Comput<strong>in</strong>g 2.<br />

101–117.<br />

Kristiansen, Gitte and René Dirven (eds.). 2008. Cognitive Sociol<strong>in</strong>guistics: Language <strong>variation</strong>,<br />

cultural models, social systems. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Labov, William. 1966. The social stratification of English <strong>in</strong> New York City. Wash<strong>in</strong>gton, D.C.: Center<br />

for Applied L<strong>in</strong>guistics.<br />

Lakoff, George. 1987. Women, fire and dangerous th<strong>in</strong>gs: What categories reveal about the m<strong>in</strong>d.<br />

Chicago: University of Chicago Press.<br />

Mart<strong>in</strong>, Willy. 2005. Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-<br />

Nederlands (RBBN). Technical report. Amsterdam: Vrije Universiteit Amsterdam.<br />

Nerbonne, John and William Kretzschmar. 2003. Introduc<strong>in</strong>g computational techniques <strong>in</strong> Dialectometry.<br />

Computers and the Humanities 37. 245–255.<br />

Rosch, Eleanor and Carolyne Mervis. 1975. Family resemblances: Studies <strong>in</strong> the <strong>in</strong>ternal structure<br />

of categories. Cognitive Psychology 7(4). 573–605.<br />

Séguy, Jean. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de L<strong>in</strong>guistique<br />

Romane 35. 335–357.<br />

Speelman, Dirk, Stefan Grondelaers, and Dirk Geeraerts. 2003. Profile-based l<strong>in</strong>guistic uniformity<br />

as a generic method for compar<strong>in</strong>g language varieties. Computers and the Humanities 37.<br />

317–337.<br />

Szmrecsanyi, Benedikt. 2010. The English genitive alternation <strong>in</strong> a cognitive sociol<strong>in</strong>guistics <strong>perspective</strong>.<br />

In Dirk Geeraerts, Gitte Kristiansen & Yves Peirsman (eds.), Advances <strong>in</strong> Cognitive<br />

Sociol<strong>in</strong>guistics, 141–166. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Turney, Peter and Patrick Pantel. 2010. From frequency to mean<strong>in</strong>g: vector space models of semantics.<br />

Journal of Artificial Intelligence Research 37. 141–188.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!