03.09.2013 Views

Lexical variation in aggregate perspective

Lexical variation in aggregate perspective

Lexical variation in aggregate perspective

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong><br />

Abstract: If one aims to study a pluricentric language with the goal of mak<strong>in</strong>g general<br />

assertions about l<strong>in</strong>guistic levels, an <strong>aggregate</strong> <strong>perspective</strong> <strong>in</strong> which many l<strong>in</strong>guistic<br />

items that represent the l<strong>in</strong>guistic level are considered is necessary. The current paper<br />

presents and compares two methodologies for aggregat<strong>in</strong>g lexical <strong>variation</strong> so that the<br />

similarity or dissimilarity between language varieties such as the centers of a pluricentric<br />

language can be quantitatively measured. The two methodologies differ with<br />

respect to the treatment of the semantic relation between words: whereas one method<br />

simply ignores the semantic relation between words, the other method <strong>in</strong>corporates<br />

the knowledge that some words are alternative means of nam<strong>in</strong>g a s<strong>in</strong>gle concept. The<br />

question of which method is most suitable for measur<strong>in</strong>g the similarity or dissimilarity<br />

between language varieties is raised and empirically tested <strong>in</strong> a corpus-based case<br />

study on the pluricentric language Dutch, as used <strong>in</strong> Belgium and the Netherlands. It<br />

will be shown that the method that <strong>in</strong>corporates semantic knowledge manages to go<br />

beyond possible conceptual <strong>variation</strong> between language varieties, clearly reveal<strong>in</strong>g<br />

an expected dist<strong>in</strong>ction between Dutch as used <strong>in</strong> Belgium and <strong>in</strong> the Netherlands. In<br />

contrast with this, the semantically non-<strong>in</strong>formed method is disturbed by conceptual<br />

<strong>variation</strong> and is not able to conv<strong>in</strong>c<strong>in</strong>gly show the dist<strong>in</strong>ction between Dutch as used<br />

<strong>in</strong> Belgium and <strong>in</strong> the Netherlands, although the set of l<strong>in</strong>guistic items clearly suggests<br />

that such a national pattern should emerge.<br />

Keywords. <strong>aggregate</strong> <strong>perspective</strong>, sociolectometry, lexical <strong>variation</strong>, Dutch<br />

1 Introduction<br />

The current paper shows how a sociolectometric approach is needed to disentangle the<br />

multidimensional structure of the varieties <strong>in</strong> a pluricentric language. There are different<br />

sociolectometric approaches, i.e. corpus-based methods, perception experiments,<br />

or attitude questionnaires; we will perform a corpus-based case study. Although the focus<br />

of a sociolectometric approach is on the varieties, the choice of the variables under<br />

analysis is crucial; we focus on lexical <strong>variation</strong>. Furthermore, <strong>in</strong> this paper we compare<br />

two quantitative corpus-based methods, which differ <strong>in</strong> their conceptual control<br />

of lexical variables: on the one hand, we take a method that ignores the conceptual<br />

relationship between the lexemes <strong>in</strong> the variable set. On the other hand, there is a<br />

method that <strong>in</strong>corporates knowledge about conceptual identity between lexemes. The<br />

importance and difficulties of conceptual control when study<strong>in</strong>g <strong>variation</strong> <strong>in</strong> the lexicon<br />

as a whole is shown by means of a case-study on the pluricentric language Dutch.<br />

The pluricentric character of Dutch is now widely accepted: Dutch is used both <strong>in</strong> Bel-


96 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

gium and <strong>in</strong> the Netherlands, but each nation has its own norm generat<strong>in</strong>g center (cf.<br />

Clyne 1992). This is different from the imposed situation <strong>in</strong> earlier years, especially<br />

the sixties, where Dutch <strong>in</strong> Belgium was supposed to be exogenically modeled on the<br />

norms of the Netherlands. Recently, by means of empirical work of e.g. Geeraerts et al.<br />

(1999) and experimental work of e.g. Impe et al. (2008), this historical view had to be<br />

adjusted to the current view, as described <strong>in</strong> Auer (2005).<br />

Rather than provid<strong>in</strong>g further empirical proof of the pluricentric character of<br />

the Dutch lexicon, the case-study aims to show the pert<strong>in</strong>ence of a sociolectometric<br />

methodology that can <strong>aggregate</strong> patterns of non-categorical lexical <strong>variation</strong> while <strong>in</strong>corporat<strong>in</strong>g<br />

an appropriate amount of conceptual control – <strong>in</strong> contrast to a methodology<br />

that discards any conceptual knowledge. As such, the study touches upon two<br />

general issues <strong>in</strong> the broader field of <strong>variation</strong>ist l<strong>in</strong>guistics: on the level of words, we<br />

look at the problematic status of lexical <strong>variation</strong> and the difficulty of del<strong>in</strong>eat<strong>in</strong>g word<br />

mean<strong>in</strong>g; on the level of structure, we run <strong>in</strong>to the methodological issue of aggregat<strong>in</strong>g<br />

the probabilistic <strong>variation</strong>al patterns of many words <strong>in</strong> order to reach a general view<br />

on the lexicon, rather than on <strong>in</strong>dividual words.<br />

Let us start, however, more generally with the status of <strong>variation</strong> <strong>in</strong> a l<strong>in</strong>guistic<br />

system. Attempts of <strong>in</strong>corporat<strong>in</strong>g <strong>variation</strong>al rules <strong>in</strong> the l<strong>in</strong>guistic system have been<br />

criticized (e.g. Bickerton 1971) on the argument that <strong>variation</strong> has no place <strong>in</strong> the search<br />

for an abstract and idealized l<strong>in</strong>guistic system of competence and langue. However, a<br />

paradigm-shift <strong>in</strong> l<strong>in</strong>guistics towards usage-based approaches turned the ubiquity of<br />

<strong>variation</strong> <strong>in</strong>to someth<strong>in</strong>g that should not be ignored. Nonetheless, even <strong>in</strong> usage-based<br />

Cognitive L<strong>in</strong>guistics, which studies parole by def<strong>in</strong>ition and can therefore hardly escape<br />

<strong>variation</strong>, there has been a tendency to overestimate the homogeneity of language<br />

communities and consequent non-variability. As of recently, Cognitive L<strong>in</strong>guistics has<br />

taken up the challenge of <strong>in</strong>corporat<strong>in</strong>g <strong>variation</strong>al dimensions <strong>in</strong> the study of l<strong>in</strong>guistic<br />

phenomena. Evidence for this are two collected volumes by Kristiansen and Dirven<br />

(2008) and Geeraerts et al. (2010) on Cognitive Sociol<strong>in</strong>guistics, which comb<strong>in</strong>e theoretical,<br />

methodological and empirical studies that <strong>in</strong>corporate cognitive, semantic and<br />

lectal dimensions <strong>in</strong> their l<strong>in</strong>guistic descriptions. Of course, one does not need to commit<br />

to a cognitive framework to comb<strong>in</strong>e language-<strong>in</strong>ternal variables and languageexternal<br />

variables, but Cognitive Sociol<strong>in</strong>guistics is currently at the cutt<strong>in</strong>g edge when<br />

it comes to multivariate analyses of l<strong>in</strong>guistic phenomena. The idea of Cognitive Sociol<strong>in</strong>guistics<br />

is best expla<strong>in</strong>ed by look<strong>in</strong>g at an exemplar case-study of Szmrecsanyi<br />

(2010). In that study, the English genitive alternation between an of -construction and<br />

an ’s-construction is approached <strong>in</strong> the well-known Cognitive L<strong>in</strong>guistic fashion, with<br />

semantic, pragmatic, psychol<strong>in</strong>guistic, structural and functional predictors. In addition<br />

to these typical Cognitive L<strong>in</strong>guistic predict<strong>in</strong>g factors, however, extra-l<strong>in</strong>guistic<br />

factors are <strong>in</strong>cluded as well: e.g. register (newspaper versus <strong>in</strong>formal), medium (spoken<br />

versus written) and geography (British versus American English). Based on many<br />

observations of genitive constructions <strong>in</strong> corpora that are representative of these lectal<br />

factors, it appears that “the magnitude of the effect that <strong>in</strong>dividual condition<strong>in</strong>g fac-


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 97<br />

tors [e.g. semantic and pragmatic factors] may have on genitive choice […] is demonstrably<br />

mediated by language-external [i.e. lectal] factors” (Szmrecsanyi 2010).<br />

The example given above – representative of a wide-spread trend <strong>in</strong> Cognitive<br />

L<strong>in</strong>guistics – studies a s<strong>in</strong>gle l<strong>in</strong>guistic phenomenon very closely. And although the<br />

ga<strong>in</strong>ed <strong>in</strong>sights of these s<strong>in</strong>gle-feature studies are at the very heart of the l<strong>in</strong>guistic<br />

enterprise, they hardly allow for extrapolations and abstractions about the l<strong>in</strong>guistic<br />

system <strong>in</strong> general: it is not because lectal factors have an important mediat<strong>in</strong>g <strong>in</strong>fluenceonthechoiceofaspecificgenitiveform(<strong>in</strong>English),thattheyhavethesameeffect<br />

on other l<strong>in</strong>guistic items (<strong>in</strong> other languages). In order to reach a more general level<br />

of that k<strong>in</strong>d, the behavior of many l<strong>in</strong>guistic variables needs to be <strong>aggregate</strong>d so that<br />

idiosyncratic differences are middled out, structures emerge and systematicity can be<br />

<strong>in</strong>duced. This <strong>aggregate</strong> <strong>perspective</strong> also appeals to the answer of Geeraerts (2010) on<br />

his question on the plausibility of a system when <strong>variation</strong> is rampant: f<strong>in</strong>d<strong>in</strong>g a l<strong>in</strong>guistic<br />

system is an empirical question that can be answered by look<strong>in</strong>g for statistically<br />

recurr<strong>in</strong>g structural patterns <strong>in</strong> <strong>variation</strong>al data. Or <strong>in</strong> other words, assum<strong>in</strong>g a system<br />

that is able to predict l<strong>in</strong>guistic choices, we should f<strong>in</strong>d a probabilistic model that fits<br />

observed <strong>variation</strong>.<br />

Return<strong>in</strong>g to the topic of the current paper (lexical <strong>variation</strong> <strong>in</strong> a pluricentric language),<br />

how can these theoretical <strong>in</strong>sights be applied? To answer this question, we<br />

will address lexical <strong>variation</strong> <strong>in</strong> Section 2 and aggregation <strong>in</strong> Section 3. In Section 4,<br />

we will perform a case-study on <strong>aggregate</strong>d lexical <strong>variation</strong> <strong>in</strong> the pluricentric language<br />

Dutch. F<strong>in</strong>ally, we br<strong>in</strong>g together the theoretical <strong>in</strong>sight and the results of the<br />

case-study <strong>in</strong> the conclusion of this paper.<br />

2 <strong>Lexical</strong> <strong>variation</strong><br />

Harder (2010: 270) claims that there are three stages <strong>in</strong> the com<strong>in</strong>g about of a sociodynamic<br />

<strong>perspective</strong> on l<strong>in</strong>guistic system. The first stage consists of mere fluctuations,<br />

comparable to the brabbl<strong>in</strong>g of a toddler. From these fluctuations a structure emerges<br />

consist<strong>in</strong>g of categories that conta<strong>in</strong> the fluctuation, but this structure is an <strong>in</strong>complete<br />

abstraction of the fluctuations. The abstraction goes only so far as the language<br />

user deems appropriate, c.q. until communication is enabled. This is the second stage<br />

of emerg<strong>in</strong>g structure. The third stage consists of the <strong>in</strong>itial stage fluctuations that<br />

turn <strong>in</strong>to systematic <strong>variation</strong> with<strong>in</strong> the emerged structural category. Although the<br />

three stages are presented by means of a developmental example (i.e. the brabbl<strong>in</strong>g<br />

todler), these stages might well have more general ontogenetic status that may expla<strong>in</strong><br />

language <strong>variation</strong> and change. Abandon<strong>in</strong>g the dynamic character of these three<br />

stages, and look<strong>in</strong>g at every stage <strong>in</strong>dependently, we could say that <strong>variation</strong>ist research<br />

zooms <strong>in</strong> on the third stage, assum<strong>in</strong>g the categories from the second stage. As<br />

an example, Harder gives the sem<strong>in</strong>al Labovian study on the structural stage two category<br />

“postvocalic -r”, with its category-bound stage three variants, which appeared


98 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

to be related to social classes <strong>in</strong> New York (Labov 1966). Scholars of the l<strong>in</strong>guistic system<br />

have traditionally removed stage three (<strong>variation</strong>, or rather variable usage) and<br />

focused on the abstract and idealized stage two structural categories. However, an<br />

adequate study of the l<strong>in</strong>guistic system must not ignore the stage three <strong>variation</strong>, as<br />

structure and <strong>variation</strong> cannot exist without each other. Structure without <strong>variation</strong><br />

is ridden of the l<strong>in</strong>guistic reality, and <strong>variation</strong> without structure is mere fluctuation,<br />

<strong>in</strong>capable of enabl<strong>in</strong>g communication.<br />

Although this idea of system is primarily geared towards l<strong>in</strong>guistic categories such<br />

as consonants or Germanic strong verbs, it can conveniently be “translated” towards<br />

the conceptual categories of the lexicon. There is, however, an important question related<br />

to the level of abstraction <strong>in</strong> stage two, when consider<strong>in</strong>g the lexicon. If on the<br />

onehandthecategoriesarechosentobeasnarrowasas<strong>in</strong>gleword(orsymbol),the<br />

<strong>variation</strong> with<strong>in</strong> these categories is semasiological <strong>variation</strong>. This means that one studies<br />

the different senses or aspects of mean<strong>in</strong>g of a s<strong>in</strong>gle word. If on the other hand<br />

the categories are chosen to be as broad as “concepts”, the <strong>variation</strong> <strong>in</strong> nam<strong>in</strong>g these<br />

categories (i.e. that different words may name the same concept) is onomasiological<br />

<strong>variation</strong>. This means that one studies the different ways of express<strong>in</strong>g (with words)<br />

the conceptual category. Obviously, this very old dist<strong>in</strong>ction between a semasiological<br />

or an onomasiological approach is related to the study of polysemy versus the study<br />

of synonymy.<br />

In this paper, we restrict ourselves to the onomasiological <strong>perspective</strong>, yet fully<br />

aware of the semasiological issues wait<strong>in</strong>g around the corner. We refer to Geeraerts<br />

(2009) for an overview of research on lexical <strong>variation</strong>, and zoom <strong>in</strong> here briefly on<br />

a dist<strong>in</strong>ction between Formal Onomasiological Variation (FOV) and Conceptual Onomasiological<br />

Variation (COV). A FOV approach resembles the sociol<strong>in</strong>guistic variable:<br />

FOV grasps a quality of a set of words that express the same concept, and just like <strong>in</strong> a<br />

sociol<strong>in</strong>guistic variable, each word <strong>in</strong> the set may have a specific socio-stylistic correlation.<br />

COV, on the other hand, l<strong>in</strong>ks up to the more subtle <strong>variation</strong> <strong>in</strong> concepts that<br />

are be<strong>in</strong>g used <strong>in</strong> language. Most obviously, at a very high level, and example could be<br />

that one can use specific words to talk about “beer” or about “semantics”. At a more<br />

f<strong>in</strong>e-gra<strong>in</strong>ed level, one could say that “fiddle” and “viol<strong>in</strong>” are an example of FOV, but<br />

because “fiddle” has a slightly more ord<strong>in</strong>ary tone to it than the more prestigious “viol<strong>in</strong>”,<br />

there is also COV between these words. In the case-study to this paper, we will<br />

show that this dist<strong>in</strong>ction between FOV <strong>in</strong> choos<strong>in</strong>g a word to express a concept versus<br />

COV when us<strong>in</strong>g words to talk <strong>in</strong> a certa<strong>in</strong> way crops up <strong>in</strong> a methodological difference<br />

between the two sociolectometric approaches that we compare.<br />

3 Aggregation<br />

As said above, aggregation of many variables is necessary when the goal is to describe<br />

general patterns <strong>in</strong> a system. In order to f<strong>in</strong>d underly<strong>in</strong>g dimensions of <strong>variation</strong> <strong>in</strong>


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 99<br />

a large set of (lexical) variables, the <strong>in</strong>dividual patterns of the variables thus need to<br />

be <strong>aggregate</strong>d. Aggregation of many features is already applied <strong>in</strong> e.g. dialectometry<br />

and text categorization. However, we f<strong>in</strong>d problems <strong>in</strong> both dialectometry and text<br />

categorization when it comes to deal<strong>in</strong>g with lexical <strong>variation</strong>.<br />

In dialectometry (Séguy 1971; Goebl 1975; Nerbonne and Kretzschmar 2003), lexical<br />

<strong>variation</strong> is almost always considered to be categorical per location (except e.g.<br />

Grieve et al. 2011): either a certa<strong>in</strong> location – or at best a s<strong>in</strong>gle <strong>in</strong>terviewee per location<br />

– is attributed the use of word a or the use of word b. This categorical approach is<br />

ma<strong>in</strong>ly due to the type of <strong>in</strong>put data, i.e. a lexical dialect atlas, used <strong>in</strong> most dialectometric<br />

studies. Dialect atlases have been pa<strong>in</strong>stak<strong>in</strong>gly constructed <strong>in</strong> earlier years by<br />

the efforts of dialectologists that visited pert<strong>in</strong>ent locations for their purposes and accumulated<br />

data through <strong>in</strong>terviews and questionnaires. Categorical word choices per<br />

location were a necessary (but currently not any longer acceptable) methodological decision.<br />

Because dialectometric methodology is tailored around the categorical dialect<br />

atlas <strong>in</strong>put format, their quantitative aggregation methods cannot straightforwardly<br />

be applied to corpus-driven <strong>in</strong>put, where lexical <strong>variation</strong> is a probabilistic matter.<br />

Unlike dialectometry, an aggregation method that <strong>in</strong>corporates both probabilistic<br />

word preferences <strong>in</strong> an onomasiological approach was <strong>in</strong>troduced <strong>in</strong> Geeraerts et al.<br />

(1999) and further formalized <strong>in</strong> Speelman et al. (2003). This so-called profile-based<br />

approach – where “profile” stands for the (relative frequencies of a) set of words <strong>in</strong><br />

a conceptual category – is formally <strong>in</strong>troduced below. The rationale of the method is<br />

just like most aggregation methods to measure the “distance” between pairs of subcorpora<br />

on the basis of their probabilistic overlap <strong>in</strong> onomasiological word preferences<br />

for express<strong>in</strong>g an underly<strong>in</strong>g conceptual category. A small distance between subcorpora<br />

implies a general agreement <strong>in</strong> word choice, whereas a large distance implies a<br />

general disagreement <strong>in</strong> word choice.<br />

Profile-based distances between subcorpora are calculated by means of the follow<strong>in</strong>g<br />

method. Given two subcorpora V1 and V2, a conceptual category L (e.g. SUB-<br />

TERRANEAN PUBLIC TRANSPORT)andx1 to xn the exhaustive list of variants (e.g. [subway,<br />

underground} as the profile, then we refer to the absolute frequency F of the usage of<br />

x1 for L <strong>in</strong> Vj with: 1<br />

FVj ,L (x1) (1)<br />

To make this methodological explanation more tangible, we provide a fictional example<br />

on the basis of the absolute frequencies for two concepts SUBTERRANEAN PUBLIC<br />

TRANSPORT and SMALL INSTRUMENT PLAYED WITH A BOW as used <strong>in</strong> American and British<br />

English, cf. Table 1.<br />

1 The follow<strong>in</strong>g <strong>in</strong>troduction to the City-Block distance method is based on Speelman et al. (2003:<br />

Section 2.2).


100 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Tab. 1: Fictional absolute frequencies for the variants of two concepts <strong>in</strong> two language varieties<br />

Concept Variant Am. Eng. Br. Eng.<br />

SUBTERRANEAN PUBLIC TRANSPORT<br />

SMALL INSTRUMENT PLAYED WITH A BOW<br />

subway 70 20<br />

underground 10 50<br />

viol<strong>in</strong> 50 30<br />

fiddle 40 35<br />

Subsequently, we <strong>in</strong>troduce the relative frequency R :<br />

RVj ,L (xi ) =<br />

FVj ,L (xi )<br />

n<br />

k =1 FVj ,L (xk )<br />

The absolute frequencies from Table 1 now become the relative frequencies <strong>in</strong> Table 2<br />

by means of apply<strong>in</strong>g Equation 2.<br />

Tab. 2: Fictional relative frequencies for the variants of two concepts <strong>in</strong> two language varieties,<br />

based on Table 1<br />

Concept Variant Am. Eng. Br. Eng.<br />

SUBTERRANEAN PUBLIC TRANSPORT<br />

SMALL INSTRUMENT PLAYED WITH A BOW<br />

subway 0,875 0,286<br />

underground 0,125 0,714<br />

viol<strong>in</strong> 0,556 0,462<br />

fiddle 0,444 0,538<br />

Now we can def<strong>in</strong>e the (City-Block) distance DCB between V1 and V2 on the basis of the<br />

profile for L as follows (the division by two is for normalization, mapp<strong>in</strong>g the results<br />

to the <strong>in</strong>terval [0,1]):<br />

DCB ,L (V1, V2) = 1<br />

2<br />

n<br />

i =1<br />

(2)<br />

|RVj ,L (xi ) − RVj ,L (xi )| (3)<br />

The City-Block distance is a straightforward descriptive dissimilarity measure that assumes<br />

the absolute frequencies <strong>in</strong> the sample-based profile to be large enough for the<br />

relative frequencies to be good estimates for the relative frequencies <strong>in</strong> the underly<strong>in</strong>g<br />

population-based profiles. If however the samples are rather small, the relative frequencies<br />

become unreliable, and a supplementary control is needed. For this we use<br />

a measure that takes as its basis the confidence of there be<strong>in</strong>g an actual difference between<br />

two profiles: the Fisher Exact test. This time, unlike with DCB , we look at the<br />

absolute frequencies <strong>in</strong> the profiles we compare. When we compare a profile <strong>in</strong> one<br />

subcorpus to the profile for the same concept <strong>in</strong> a second subcorpus, we use a Fisher<br />

Exact test to check the hypothesis that both samples are drawn from the same pop-


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 101<br />

ulation. We use the p-value from the Fisher Exact test as a filter for DCB .Wesetthe<br />

dissimilarity between subcorpora at zero if p > 0.05, and we use DCB if p < 0.05. 2<br />

If we now apply this step to the fictional data from Table 1 and 2, we must first<br />

calculate the Fisher Exact p value for every concept, verify<strong>in</strong>g that the absolute frequencies<br />

for American and British English are sampled from different populations. For<br />

SUBTERRANEAN PUBLIC TRANSPORT,thepvalueismuchsmallerthan0.05,sowecanac cept that British English is different from American English when it comes to this concept.<br />

Therefore, we calculate the City-Block distance by means of Equation 5 for SUB-<br />

TERRANEAN PUBLIC TRANSPORT. Fill<strong>in</strong>g <strong>in</strong> the equation, we get 0.5 × [(|0.875–0.286|) +<br />

(|0.125–0.714|)] = 0.589. For the concept of a SMALL INSTRUMENT PLAYED WITH A BOW we<br />

f<strong>in</strong>d a p value for the Fisher Exact test larger than 0.05, so we can say that British English<br />

is statistically speak<strong>in</strong>g not a different population than American English. Therefore,<br />

we can set the distance between these varieties for this concept at zero.<br />

To calculate the dissimilarity between subcorpora on the basis of many profiles,<br />

we just sum the dissimilarities for the <strong>in</strong>dividual profiles. In other words, given a set of<br />

profiles L1 to Lm , then the global dissimilarity D between two subcorpora V1 and VL2<br />

on the basis of L1 up to Lm can be calculated as:<br />

DCB (V1, V2) =<br />

m<br />

(L −i (V1, V2)W (Li )) (4)<br />

i =1<br />

The W <strong>in</strong> the formula is a weight<strong>in</strong>g factor. We use weights to ensure that concepts<br />

which have a relatively higher frequency (summed over the size of the two subcorpora<br />

that are be<strong>in</strong>g compared) 3 also have a greater impact on the distance measurement. In<br />

other words, <strong>in</strong> the case of a weighted calculation, concepts that are more common <strong>in</strong><br />

everyday life and language are treated as more important. Apply<strong>in</strong>g this to the fictional<br />

example from Table 1, we can calculate the W per concept by divid<strong>in</strong>g the sum of the<br />

absolute frequencies of all variants for one concept by the sum of simply all <strong>variation</strong>s.<br />

For SUBTERRANEAN PUBLIC TRANSPORT this equals to (70+10+20+50)/(70+10+20+50+<br />

50 + 40 + 30 + 35) = 0.492. There is no need to calculate the W for SMALL INSTRUMENT<br />

PLAYED WITH A BOW as its distance is already set to zero. Fill<strong>in</strong>g out equation 4, we f<strong>in</strong>d<br />

that the distance between British English and American English <strong>aggregate</strong>d over both<br />

concepts is (0.589 × 0.492) + 0 = 0.29.<br />

Now, we put text categorization <strong>in</strong> contrast with the profile-based approach, which<br />

<strong>in</strong>corporates probabilistic <strong>in</strong>formation of word choice. In text categorization, noncategorical<br />

(probabilistic) word choice is well accounted for (unlike dialectometric ap-<br />

2 If the frequency of the profile was lower than 30 <strong>in</strong> the two varieties that are be<strong>in</strong>g compared, that<br />

profile was excluded from the comparison.<br />

3 The size of the two subcorpora is not the actual amount of words <strong>in</strong> the two subcorpora, but the sum<br />

of all profiles <strong>in</strong> these two subcorpora with a frequency higher than 30.


102 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

proaches), but text categorization totally ignores the onomasiological <strong>perspective</strong> on<br />

lexical <strong>variation</strong>. This is primarily due to the fact that text categorization often zooms<br />

<strong>in</strong> on topical categorization, and the onomasiological approach to lexical <strong>variation</strong><br />

with<strong>in</strong> conceptual categories is exactly a way of downplay<strong>in</strong>g thematic bias <strong>in</strong> the <strong>variation</strong>al<br />

patterns (Speelman et al. 2003). However, other forms of text categorization,<br />

e.g. authorship attribution or l<strong>in</strong>guistic profil<strong>in</strong>g, quite the opposite of topic classification,<br />

also ignore onomasiological <strong>variation</strong> and use mere (relative) occurrence frequencies<br />

of the features <strong>in</strong> the aggregation step. This is problematic, especially given<br />

the recent trend <strong>in</strong> authorship attribution studies to use content words.<br />

Whereas the profile-based approach will be the quantitative method that <strong>in</strong>corporates<br />

conceptual control <strong>in</strong> our comparison of methods, we will use the textcategorization<br />

approach as the quantitative method that ignores conceptual similarity<br />

between the words <strong>in</strong> the variable set. Except for the used distance metric, the two approaches<br />

are identical. The underly<strong>in</strong>g metaphor of both the profile-based and categorization<br />

approach is spatial: subcorpora are represented as po<strong>in</strong>ts <strong>in</strong> an n-dimensional<br />

spacebymeansoftheoccurrencefrequenciesofn words. A made-up example <strong>in</strong> a<br />

two-dimensional space, i.e. with two words, conta<strong>in</strong><strong>in</strong>g two text types might make<br />

this rather abstract metaphor more clear. Given two subcorpora represent<strong>in</strong>g the text<br />

types “academic articles” and “computer mediated communication”, and given two<br />

words “hence” (a l<strong>in</strong>k<strong>in</strong>g word used <strong>in</strong> academic articles) and “LOL” (an abbreviation<br />

of “Laugh<strong>in</strong>g Out Loud”, commonly used <strong>in</strong> IRC), one might construct the “space” <strong>in</strong><br />

Figure 1. The position of the academic articles <strong>in</strong> the bottom right part is due to the high<br />

frequency of “hence” and the low frequency of “LOL” <strong>in</strong> these texts. The position of<br />

the computer-mediated communication <strong>in</strong> the top left part is due to the low frequency<br />

of “hence” and the high frequency of “LOL” <strong>in</strong> these texts. Obviously, these data are<br />

made up for the sake of the argument. Now, two l<strong>in</strong>es can be drawn through the orig<strong>in</strong>ofthespaceandthepositionofthetexttypes(onthebasisofthefrequenciesof<br />

the words that make up the dimensions), yield<strong>in</strong>g an angle, for which the cos<strong>in</strong>e can<br />

be calculated. A small angle implies high similarity between the text types, and will<br />

yield a high cos<strong>in</strong>e value; a large angle implies low similarity, and will yield a low cos<strong>in</strong>e<br />

value. More <strong>in</strong>formation on the cos<strong>in</strong>e metric can be found <strong>in</strong> Baeza-Yates and<br />

Ribeiro-Neto (1999: 27).<br />

Fig. 1: 2 Dimensional example of Vector Model


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 103<br />

Formally, given two subcorpora V1 and V2 <strong>in</strong> which the frequencies of a large number<br />

of words were counted and stored <strong>in</strong> the respective vectors x and y, wecalculate<br />

the distance between the subcorpora by means of Equation 5.<br />

4 Case study<br />

Dcos(V1, V2) = 1 − cos(x, y) = 1 −<br />

x · y<br />

|x||y| =<br />

n i =1 xi yi<br />

n i =1 x 2 n i i =1 y 2<br />

i<br />

The case study of this paper is an analysis of <strong>aggregate</strong>d lexical <strong>variation</strong> <strong>in</strong> the pluricentric<br />

language Dutch. It consists of a comparison between the state-of-the-art text<br />

categorization distance metric, which ignores conceptual control, and the profilebased<br />

distance metric, which <strong>in</strong>cludes conceptual control. In order to guarantee an<br />

objective comparison, we will apply both methods to the same dataset, which is tailored<br />

to conta<strong>in</strong> a specific constitution of <strong>variation</strong>al dimensions. The method that<br />

best approaches the expected structure will be considered superior. In what follows,<br />

we first <strong>in</strong>troduce the dataset by describ<strong>in</strong>g the set of lexical features and the corpus<br />

<strong>in</strong> which these features will be counted. Second, we apply the profile-based method to<br />

this dataset. Then, the state-of-the-art text categorization method is also applied to the<br />

dataset. F<strong>in</strong>ally, it will be concluded that the profile-based onomasiological approach<br />

grasps the a priori constitution of <strong>variation</strong>al dimensions much better than the text<br />

categorization method.<br />

The lexical <strong>in</strong>put features are derived from the “Referentiebestand Belgisch Nederlands”<br />

(Mart<strong>in</strong> 2005, Eng. Reference List of Belgian Dutch, abbreviation “RBBN”). This<br />

reference list conta<strong>in</strong>s words or expressions that exclusively appear <strong>in</strong> Belgian Dutch,<br />

and have no occurrences <strong>in</strong> The Netherlands, accord<strong>in</strong>g to dictionaries, corpora and<br />

<strong>in</strong>formants. The list conta<strong>in</strong>s about 4000 items, rang<strong>in</strong>g from colloquial items, over<br />

culturally l<strong>in</strong>ked (e.g. Belgian <strong>in</strong>stitutes) to register-specific and freely vary<strong>in</strong>g items.<br />

As an example, a small selection of items is listed <strong>in</strong> Table 3, but the whole list can<br />

be downloaded freely from the website of the “Instituut voor Nederlandse Lexicologie”.<br />

For each Belgian Dutch item, the list provides an alternative from general Dutch,<br />

or sometimes typically Netherlandic Dutch. From the 4000 items on the list, we only<br />

reta<strong>in</strong>ed 1455 items for which the Belgian Dutch item itself and its alternative consist<br />

of one s<strong>in</strong>gle word. If we restrict the RBBN list to these s<strong>in</strong>gle word items – and<br />

thus exclud<strong>in</strong>g multi-word-units and expressions –, these items can be counted accurately<br />

<strong>in</strong> an automatic way by merely keep<strong>in</strong>g track of the occurrence frequency<br />

of the words <strong>in</strong> the subcorpora. 4 Indeed, expressions and multi-word-units may be<br />

distributed over the sentence because of syntactic constructions, mak<strong>in</strong>g automatic<br />

4 We address the issue of possible polysemy issues and the need for word sense disambiguation when<br />

do<strong>in</strong>g automatic count<strong>in</strong>g <strong>in</strong> the conclusions.<br />

(5)


104 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Tab. 3: Selected examples from the RBBN<br />

Belgian Dutch General Dutch Translation of concept<br />

suikerboon doopsuiker candy to honor the birth of a baby<br />

appelsien s<strong>in</strong>aasappel orange (fruit)<br />

unaniem eenparig unanimous<br />

ambras ruzie a row<br />

confituur jam marmalade<br />

b<strong>in</strong>nenkoer b<strong>in</strong>nenplaats atrium<br />

count<strong>in</strong>g very hard. All (s<strong>in</strong>gle) words on the list were analyzed with the Alp<strong>in</strong>o parser,<br />

so that accurate count<strong>in</strong>gs on the lemmata could be performed, while controll<strong>in</strong>g for<br />

the part-of-speech. L<strong>in</strong>k<strong>in</strong>g back to the issue of conceptual categories <strong>in</strong> Section 2, we<br />

accept the conceptual categories of the makers of the RBBN <strong>in</strong> their equivalence judgement<br />

between the Belgian Dutch item and its alternative.<br />

Because we know that this list conta<strong>in</strong>s Belgian Dutch words and an alternative,<br />

we can predict that the ma<strong>in</strong> <strong>variation</strong> <strong>in</strong> the list will be due to a national pattern. Indeed,<br />

even the non-national <strong>variation</strong> which is present <strong>in</strong> the list (e.g. colloquialisms)<br />

is still embedded <strong>in</strong> the Belgian Dutch po<strong>in</strong>t-of-view of the RBBN. Or <strong>in</strong> other words,<br />

every variable <strong>in</strong> the variable set is at least nationally patterned. Therefore, we expect<br />

the results of our method to show a clear dist<strong>in</strong>ction between the two national varieties,<br />

and other <strong>variation</strong>al dimensions will only appear after that.<br />

In our corpus, we <strong>in</strong>corporate samples from the two national varieties of Dutch,<br />

taken from two registers (quality newspapers and Usenet), and from two topics (politics<br />

and economy). We collected a total of 6 million words, which were evenly split<br />

over the nations, registers and topics. The quality newspaper articles were sampled<br />

from two large newspaper corpora that are available for both Netherlandic and Belgian<br />

newspapers. From these two corpora, we selected four newspapers that are deemed<br />

to be quality newspapers: “De Standaard” and “De Morgen” for Belgium, and “Volkskrant”<br />

and “NRC” for The Netherlands. For most of the articles that appeared <strong>in</strong> the<br />

newspapers, there is access to the category <strong>in</strong> which it was published. This categorization<br />

was used to filter out the articles on the topics “politics” and “economy”.<br />

The Usenet posts were downloaded from a large Usenet archive, available onl<strong>in</strong>e<br />

at Google Groups and automatically stripped from meta-<strong>in</strong>formation (headers and<br />

html code) and reduplicated content (quotes from previous posts). Only posts from<br />

the groups “be.politics”, “be.f<strong>in</strong>ance”, “nl.politiek” and “nl.f<strong>in</strong>ancieel.*” were downloaded,<br />

where the country affiliation of the group was taken to be an <strong>in</strong>dication of the<br />

nationality of the author of the post, and where the topical restriction of the group <strong>in</strong>dicates<br />

the topic of the post. All texts were lemmatized and tagged with part-of-speech<br />

<strong>in</strong>formation by the Alp<strong>in</strong>o parser (Bouma et al. 2001).


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 105<br />

With these three dimensions (country, register, topic) and two levels for each dimension<br />

8 comb<strong>in</strong>ations are possible. These comb<strong>in</strong>ations, e.g. Belgian quality newspapers<br />

on economy (abbreviated as qnp.be.e), will be represented by the subcorpora,<br />

for which we will calculate the pair wise distances. However, to <strong>in</strong>crease the number<br />

of data po<strong>in</strong>ts and <strong>in</strong> order to verify the <strong>in</strong>ternal consistency of the subcorpora, we divided<br />

every subcorpus <strong>in</strong>to two equally sized groups (abbreviated as e.g. qnp.be.e.0<br />

and qnp.be.e.1). In total then, we counted the frequencies of the l<strong>in</strong>guistic characteristics<br />

which we <strong>in</strong>troduce above, <strong>in</strong> 16 subcorpora. A snippet of this <strong>in</strong>put data is presented<br />

<strong>in</strong> the appendix to this paper.<br />

Given the omnipresent country dimension <strong>in</strong> the <strong>in</strong>put features, the primary <strong>variation</strong>al<br />

dimension that could be expected to be revealed among the subcorpora is the<br />

Belgian Dutch versus Netherlandic Dutch dimension. Or <strong>in</strong> terms that relate to the<br />

distance measurement method: <strong>in</strong> a pair-wise comparison of subcorpora with a national<br />

difference, the distance will be bigger than a comparison of two subcorpora<br />

with the same national affiliation. Because the typical Belgian Dutch words are sometimes<br />

restricted to a specific register, e.g. colloquialisms, a register dist<strong>in</strong>ction should<br />

emerge, as well. And as words and their conceptual categories are <strong>in</strong>evitably sensitive<br />

to topic, we would expect the difference between political and economical subcorpora<br />

to emerge, too. However, the register and topic dimension should be secondary to the<br />

country dimension.<br />

4.1 Results of the profile-based method<br />

We first look <strong>in</strong>to the results of the profile-based approach, <strong>in</strong>troduced above. To the<br />

selected Belgian Dutch items on the RBBN list, we added the knowledge which alternatives<br />

are conceptually equivalent General Dutch words. In other words, we <strong>in</strong>troduce<br />

conceptually controlled profile <strong>in</strong>formation to the distance metric. A profile thus consists<br />

of a Belgian Dutch word from the RBBN list, together with its general Dutch alternative.<br />

Remember that the underly<strong>in</strong>g distance metric is basically a City-Block distance<br />

measure (see Formula 4). Now, we zoom <strong>in</strong> on the two- and three-dimensional visualizations<br />

of all the pair wise profile-based distances between the subcorpora, made<br />

by means of non-metric two-way one-mode Multidimensional Scal<strong>in</strong>g (Cox and Cox<br />

2001), as can be seen <strong>in</strong> Figure 2. 5<br />

5 The coord<strong>in</strong>ates of a Multidimensional Scal<strong>in</strong>g solution can be scaled freely, as long as the same<br />

scal<strong>in</strong>g is applied to all dimensions. Therefore, we discarded a scale on the axes, as these numbers<br />

would not be <strong>in</strong>sightful. However, we made sure that the x and y (and z for three-dimensional solutions)<br />

axes are always equal, so that the distances between the subcorpora on the different dimensions<br />

can be <strong>in</strong>terpreted.


106 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Fig. 2: L<strong>in</strong>guistic distance between subcorpora (profile-based, two-dimensional)<br />

Multidimensional Scal<strong>in</strong>g is a dimension reduction technique which is applied here<br />

to a matrix hold<strong>in</strong>g all the pair wise profile-based distances between the subcorpora.<br />

Because the result of a Multidimensional Scal<strong>in</strong>g analysis is a reduction of the orig<strong>in</strong>al<br />

<strong>in</strong>put, a certa<strong>in</strong> error is <strong>in</strong>troduced. The error-rate is grasped by a “stress” value,<br />

with 0% stress equal to no error at all. It is generally acceptable to present Multidimensional<br />

Scal<strong>in</strong>g solutions up to a stress level of 10–15%. Usually, Multidimensional<br />

Scal<strong>in</strong>g is used to return one-, two-, or three-dimensional reductions, so that visualization<br />

is possible. With every added dimension, the error-rate goes down, as the reduction<br />

becomes less severe. The fall of error-rate with added dimensions is grasped <strong>in</strong> a<br />

so-called screeplot. The screeplot <strong>in</strong> Figure 3 shows a stress difference of about 7% between<br />

a one-dimensional and a two-dimensional Multidimensional Scal<strong>in</strong>g solution.<br />

Therefore, we first <strong>in</strong>terpret the horizontal dimension (of an unrotated solution) as it<br />

represents the most important <strong>variation</strong> <strong>in</strong> Figure 2. In this case, the profile-based approach<br />

makes a dist<strong>in</strong>ction between Belgian subcorpora (black font) and Netherlandic<br />

subcorpora (grey font) on the first dimension. The grey zero-l<strong>in</strong>e divides the two countries<br />

perfectly. The vertical dimension makes a dist<strong>in</strong>ction between quality newspapers<br />

(normal font) and Usenet articles (bold font). Here aga<strong>in</strong>, the grey zero-l<strong>in</strong>e marks<br />

a perfect dist<strong>in</strong>ction between the two registers. Overall, there is a very clear group<strong>in</strong>g<br />

of the subcorpora, with only clear separation of the topics <strong>in</strong> the Belgian Usenet.<br />

The range of Belgian register <strong>variation</strong> is also somewhat larger than the Netherlandic<br />

range, but this has probably to do with the focus on Belgian Dutch <strong>variation</strong> <strong>in</strong> the<br />

<strong>in</strong>put features. Most importantly, however, the profile-based approach yields a visualization<br />

that complies with our expectations of f<strong>in</strong>d<strong>in</strong>g a national pattern first, followed<br />

by register <strong>variation</strong> on the second dimension.


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 107<br />

Fig. 3: Screeplot for non-metric Multidimensional Scal<strong>in</strong>g solution (profile-based)<br />

The screeplot suggest that a three-dimensional solution might even improve the quality<br />

of the visualization with another 5 or 6%. Therefore, we calculated a three dimensional<br />

solution, which is represented <strong>in</strong> Figure 4. 6 Instead of render<strong>in</strong>g a threedimensional<br />

plot, we drew the scatterplot of dimension 1 versus dimension 2, and the<br />

scatterplot of dimension 1 versus dimension 3. This shows us how, even <strong>in</strong> a threedimensional<br />

solution, dimension 1 still divides Belgian and Netherlandic subcorpora,<br />

Fig. 4: L<strong>in</strong>guistic distance between subcorpora (profile-based, three-dimensional)<br />

6 Note that a two-dimensional non-metric Multidimensional Scal<strong>in</strong>g solution is not a subset of a threedimensional<br />

non-metric Multidimensional Scal<strong>in</strong>g solution. Therefore, the first two dimensions of the<br />

three-dimensional solution of Figure 4 are not necessarily identical to the two dimensions of the twodimensional<br />

solution of Figure 2.


108 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

and that dimension 2 divides the quality newspaper articles from Usenet. However,<br />

this register division <strong>in</strong> the three-dimensional solution is not as neat as <strong>in</strong> the twodimensional<br />

solution, because one of the Netherlandic Usenet fragments crosses over<br />

<strong>in</strong>to the quadrant of the Netherlandic quality newspaper fragments. For dimension 3,<br />

we can see a split for the topics of the Belgian subcorpora, with on the top left of dimension<br />

3 subcorpora with an e for economy-related subcorpora, and politics fragments<br />

at the bottom. On the Netherlandic side, the register (dimension 2) and topic (dimension<br />

3) split is muddled. The register and topic divisions of the Belgian subcorpora,<br />

however, are perfect for respectively dimension 2 and dimension 3. The quality of the<br />

group<strong>in</strong>g on the Belgian side is obviously due to the <strong>in</strong>put variables which are specifically<br />

sensitive for Belgian Dutch <strong>variation</strong>. This <strong>in</strong>dicates that the choice for a Belgian<br />

Dutch term is not only nationally patterned, but also stylistically.<br />

4.2 Results of the categorization method<br />

Now, we present the method and the results of the state-of-the-art categorization approach,<br />

which uses the cos<strong>in</strong>e similarity metric, <strong>in</strong>stead of the adapted City-Block distance<br />

that is used <strong>in</strong> the profile-based approach.<br />

In the current case-study, we take the RBBN items (and the alternatives) as <strong>in</strong>dividual<br />

features and remove the knowledge of conceptual categorization. If we calculate<br />

the similarities (and consequent distances) with these <strong>in</strong>put features between the<br />

subcorpora <strong>in</strong> our dataset, and then produce the two-dimensional visualization with<br />

Multidimensional Scal<strong>in</strong>g, we get the plot <strong>in</strong> Figure 5. If we create a screeplot (Fig-<br />

Fig. 5: L<strong>in</strong>guistic distance between subcorpora (profile-based, three-dimensional)


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 109<br />

Fig. 6: L<strong>in</strong>guistic distance between subcorpora (cos<strong>in</strong>e, two-dimensional)<br />

ure 6) to show us how much stress difference there is between the first and the second<br />

dimension, we see that the second dimension reduces the stress of a one-dimensional<br />

solution with about 8%. Therefore, we will <strong>in</strong>terpret the two dimensions <strong>in</strong> their own<br />

respect, know<strong>in</strong>g however that the first dimension conta<strong>in</strong>s more outspoken distances<br />

than the second dimension.<br />

In Figure 6 we see on the horizontal axis (from left to right, dimension 1) a dist<strong>in</strong>ction<br />

between the Usenet articles (bold font) and the quality newspaper articles<br />

(regular font). The light grey vertical l<strong>in</strong>e <strong>in</strong>dicates the zero-l<strong>in</strong>e of the horizontal dimension.<br />

Normally, that l<strong>in</strong>e demarcates the boundary between two areas. Whereas<br />

we would expect the most important <strong>variation</strong> (thus, on the horizontal dimension) to<br />

be related to country, we encounter a dist<strong>in</strong>ction between registers. The vertical dimensions<br />

(from bottom to top) tends to divide Belgium (black font) from The Netherlands<br />

(grey font), but not very clearly. The (politics) Netherlandic Usenet articles s<strong>in</strong>k<br />

below the horizontal zero-l<strong>in</strong>e, and the (economy) Belgian Usenet articles rise above<br />

that l<strong>in</strong>e. Moreover, we notice that the topics are set apart <strong>in</strong> groups, as well, except for<br />

the quality newspapers from The Netherlands. All <strong>in</strong> all, the categorization approach<br />

yields somewhat unclear group<strong>in</strong>g of subcorpora and an unexpected promotion of register<br />

<strong>variation</strong> as the most important <strong>variation</strong> <strong>in</strong> the <strong>in</strong>put features.<br />

The screeplot shows that a three-dimensional solution would reduce the stress<br />

even more up to an almost optimal level. Therefore, we calculated a three-dimensional<br />

solution and represent the three dimensions <strong>in</strong> Figure 7. We apply the same idea as for<br />

the profile-based approach to plot dimension 1 and 2, and then dimension 1 and 3. Just<br />

like <strong>in</strong> the two-dimensional solution, we see that dimension 1 divides quality newspaper<br />

fragments from Usenet fragments, and that dimension 2 tends to divide the na-


110 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Fig. 7: Screeplot for non-metric Multidimensional Scal<strong>in</strong>g solution (cos<strong>in</strong>e)<br />

tional subcorpora. The three-dimensional solution does a slightly better job than the<br />

two-dimensional solution, because the nation division on dimension 2 is now almost<br />

correct. Dimension 3 divides largely the topics, with politics-related fragments at the<br />

top, and economy-related fragments at the bottom. This division is almost perfect, although<br />

the group<strong>in</strong>g of the subcorpora is not so neat. Overall, though, the categorization<br />

method yields messier output than the profile-based approach.<br />

5 Conclusion<br />

The two ma<strong>in</strong> theoretical questions of this paper have been (a) how important is the<br />

notion of a conceptual category <strong>in</strong> an <strong>aggregate</strong> study of <strong>variation</strong> <strong>in</strong> the lexicon and<br />

(b) what is the status of conceptual categories for lexical <strong>variation</strong>? Moreover, we have<br />

claimed that sociolectometric methodology, of which the current study is an example,<br />

is needed to study a pluricentric language. The l<strong>in</strong>k with pluricentric languages, c.q.<br />

Dutch, is also made <strong>in</strong> the case-study, which shows how conceptual categories and<br />

their consequent conceptual control are necessary to reveal the national dimension <strong>in</strong><br />

the lexicon. In other words, the national varieties of Dutch do not differ so much <strong>in</strong><br />

their use of words – both Belgium and the Netherlands use different words for different<br />

topics and registers –, but they do differ <strong>in</strong> their choice of words for express<strong>in</strong>g a<br />

conceptual category. This latter po<strong>in</strong>t is made clear <strong>in</strong> the case-study by means of the<br />

comparison between a profile-based onomasiological approach and a text categorization<br />

approach. The text categorization approach grasped the mere use of <strong>in</strong>dividual<br />

words and compared the use of words <strong>in</strong> two subcorpora by means of the cos<strong>in</strong>e similarity<br />

metric, which was not <strong>in</strong>formed about the conceptual similarity between words.<br />

Consequently, the text categorization showed that there was a pattern of register and<br />

topic <strong>in</strong> the <strong>in</strong>put features, stronger than the anticipated national pattern. The ono-


<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 111<br />

masiological approach, on the contrary, revealed a strong national dimension <strong>in</strong> word<br />

choice for nam<strong>in</strong>g a conceptual category.<br />

Of course, <strong>in</strong> order to have an expected rank<strong>in</strong>g <strong>in</strong> the <strong>variation</strong>al dimensions,<br />

and <strong>in</strong> order to compare the outcome of the aggregation approaches, the dataset had<br />

to be manipulated so that a certa<strong>in</strong> pattern could conv<strong>in</strong>c<strong>in</strong>gly be assumed. With that<br />

goal <strong>in</strong> m<strong>in</strong>d, the variable set was taken from a reference list of Belgium Dutch, so that<br />

national <strong>variation</strong> is built <strong>in</strong>to the dataset. As such, the two aggregation approaches<br />

could be compared by assess<strong>in</strong>g how well they retrieve the national <strong>variation</strong>. It is important<br />

to understand, though, that an actual descriptive sociolectometric study can<br />

by no means rely on such a biased <strong>in</strong>put variable set. Therefore, the results of this paper<br />

can only be of methodological value. Given the a priori known pattern of national<br />

<strong>variation</strong> <strong>in</strong> the dataset used <strong>in</strong> the case-study, though, one might jump to the conclusion<br />

that an onomasiological approach is better suited for f<strong>in</strong>d<strong>in</strong>g <strong>variation</strong>al patterns<br />

<strong>in</strong> the lexicon, and the preferred method for any sociolectometric study. However, there<br />

are a number of problems with this conclusion.<br />

First of all, perhaps we are wrong <strong>in</strong> the assumption that national <strong>variation</strong> is the<br />

strongest dimension <strong>in</strong> the lexical variable set and the available subcorpora; it could<br />

be well possible that word use – as shown <strong>in</strong> the categorization approach – is actually<br />

more strongly <strong>in</strong>fluenced by a register or topic dimension, and that the onomasiological<br />

approach artificially weakens these dimensions. 7 In that case, we would have<br />

to tone down the conclusion, and say that an onomasiological approach with conceptual<br />

control is a methodological means of reveal<strong>in</strong>g and boost<strong>in</strong>g specific underly<strong>in</strong>g<br />

dimensions of <strong>variation</strong>. Moreover, we would like to po<strong>in</strong>t out that our corpus<br />

only sampled two topics and two registers, which is not enough to support strong generalizations.<br />

Further research is therefore needed with more topics and registers. All<br />

this, of course, does not weaken the strength of a profile-based approach, but it rather<br />

po<strong>in</strong>ts out the importance of know<strong>in</strong>g what is be<strong>in</strong>g measured. Our claim now is that<br />

the profile-based approach allows for much more control over what is measured than<br />

the text categorization method, and should therefore be preferred.<br />

Second, the onomasiological approach assumes a relation of identity of (conceptual)<br />

mean<strong>in</strong>g between the variants and this is theoretically problematic. Follow<strong>in</strong>g<br />

Edmonds and Hirst (2002), we agree that perfect synonymy – the highest possible level<br />

of detail <strong>in</strong> describ<strong>in</strong>g a conceptual category, and still f<strong>in</strong>d<strong>in</strong>g multiple words that fit<br />

the category – is extremely rare. By admitt<strong>in</strong>g this, our notion of semantics or word<br />

mean<strong>in</strong>g follows the Cognitive L<strong>in</strong>guistic view that encyclopedic knowledge is <strong>in</strong>dispensable.<br />

Translat<strong>in</strong>g the idea of Peter Harder that structural categories need not to be<br />

complete, and that the abstraction goes only as far as is functional for language users –<br />

here we l<strong>in</strong>k up to the prototype theory of word mean<strong>in</strong>g, cf. Rosch and Mervis (1975)–,<br />

7 Although the profile-based City-Block distance <strong>in</strong>corporates a W term that br<strong>in</strong>gs the frequency of<br />

the conceptual category <strong>in</strong>to play.


112 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

we can reach near-synonymy by slightly relax<strong>in</strong>g the level of detail of the conceptual<br />

category: not every language user has an identitical representation of a word <strong>in</strong> his<br />

head, but nonetheless two language users can communicate with that word. Idealized<br />

Cognitive Models (Lakoff 1987) or Frames (Fillmore 1994) are examples of describ<strong>in</strong>g<br />

mean<strong>in</strong>g, while balanc<strong>in</strong>g semasiological detail and operational functionality. In future<br />

research, we will operationalize the bottom-up creation of conceptual categories<br />

by apply<strong>in</strong>g Word Space Models (Turney and Pantel 2010).<br />

Third, an onomasiological approach requires prior semasiological analysis to exclude<br />

contextual nuances or polysemy. In the case-study of this paper, the lemmatized<br />

forms of the RBBN words were naively counted <strong>in</strong> the corpus, without further check<strong>in</strong>g<br />

the context of each occurrence. Closer <strong>in</strong>spection revealed that the RBBN list does not<br />

conta<strong>in</strong> many potential polysemous items, so that we can ignore the small error that<br />

must be present <strong>in</strong> the frequencies for the purposes of the current paper. However, as<br />

we want to perform the above analyses <strong>in</strong> future research with a naturalistic sample of<br />

lexical <strong>variation</strong>, <strong>in</strong>stead of an a priori list of national <strong>variation</strong>, a semasiological study<br />

for every occurrence needs to be done <strong>in</strong> order to establish the conceptual control. As<br />

this would be an unfeasible manual task when us<strong>in</strong>g a large amount of variables, we<br />

will rely further on the advances be<strong>in</strong>g made <strong>in</strong> the field of Word Space Models to automate<br />

this task.<br />

To conclude this paper, we try to answer our <strong>in</strong>itial questions. How important is<br />

the notion of a conceptual category <strong>in</strong> an <strong>aggregate</strong> study of the lexicon? The casestudy<br />

has shown that conceptual control is necessary to reveal <strong>variation</strong>al dimensions<br />

that are hidden <strong>in</strong> the overwhelm<strong>in</strong>g content (topic) function of words. Without conceptual<br />

control, the conclusion of the categorization approach would have been that<br />

different words are used to refer to different content, and that they may also signal<br />

register and perhaps national differences. This observation, albeit true and undeniable,<br />

is not the goal of an aggregation study: it is obvious that an aggregation of many<br />

words will be sensitive to content differences among subcorpora. Therefore, conceptual<br />

control, <strong>in</strong> the form of conceptual categories that group together similar words,<br />

is needed. And this br<strong>in</strong>gs us to the second question: what is the status of conceptual<br />

categories for lexical <strong>variation</strong>? Although practical as a methodological and heuristic<br />

device, the conceptual categories rema<strong>in</strong> somewhat artificial because of the flexibility<br />

<strong>in</strong> their def<strong>in</strong>ition. In the current case study, the makers of the RBBN clearly had referential<br />

equivalence <strong>in</strong> m<strong>in</strong>d for most categories. However, conceptual categories can<br />

be def<strong>in</strong>ed more strictly or less strictly at a whim of the researcher, because there is<br />

no consensus over the appropriate level of detail <strong>in</strong> the def<strong>in</strong>ition, especially s<strong>in</strong>ce the<br />

<strong>in</strong>corporation of encyclopedic knowledge <strong>in</strong> word-mean<strong>in</strong>g. The level of detail that is<br />

operational <strong>in</strong> the language community can only be retrieved by study<strong>in</strong>g the actual<br />

use of words.<br />

And then we are back at <strong>variation</strong>.


Appendix<br />

<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 113<br />

Tab. 4: Snippet of the <strong>in</strong>put data for both aggregation methods. Pairs of rows make up lexical<br />

variables.<br />

qnp.be.e.0<br />

qnp.be.e.1<br />

qnp.be.p.0<br />

qnp.be.p.1<br />

qnp.nl.e.0<br />

qnp.nl.e.1<br />

qnp.nl.p.0<br />

leefbaar 9 3 8 11 1 0 0 0 0 1 9 4 0 0 24 18<br />

levensvatbaar 2 4 2 0 2 1 3 2 0 0 1 1 0 0 4 4<br />

hangar 0 1 0 1 0 0 1 2 0 0 1 1 0 0 1 1<br />

loods 8 6 4 18 4 11 5 2 0 0 0 2 0 1 1 6<br />

schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85<br />

mooi 153 122 114 110 110 76 53 42 42 33 73 67 52 74 449 475<br />

dagorde 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

agenda 29 26 100 90 29 21 39 24 2 1 14 14 1 1 17 33<br />

knook 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0<br />

been 13 15 43 41 39 29 14 20 10 12 14 12 21 18 76 65<br />

zever 0 0 1 0 0 0 0 0 6 2 15 15 0 0 4 14<br />

onz<strong>in</strong> 7 1 23 30 8 5 5 3 5 10 44 61 26 43 451 485<br />

draad 4 6 14 10 6 13 2 3 1 2 31 32 9 10 90 87<br />

snoer 2 0 2 1 1 5 1 1 0 0 3 1 0 0 21 28<br />

weeral 0 0 2 0 0 0 0 0 9 3 9 9 0 1 4 1<br />

alweer 19 22 32 22 21 30 11 17 5 1 21 22 12 9 98 98<br />

fel 27 23 33 35 17 19 31 42 6 1 5 10 0 1 19 31<br />

erg 331 268 208 217 117 112 76 68 21 36 143 131 99 94 830 835<br />

strop 4 2 1 3 26 18 4 3 0 0 1 0 0 0 3 3<br />

strik 1 2 2 3 5 6 1 0 0 0 2 0 0 2 1 2<br />

verdiep 2 1 4 3 8 2 4 11 0 0 2 3 3 4 20 26<br />

verdiep<strong>in</strong>g 0 6 6 7 5 4 10 11 0 0 1 0 0 0 12 10<br />

stamp 6 2 9 5 5 1 0 2 1 0 5 5 0 0 11 10<br />

duw 27 16 42 34 20 25 13 16 1 1 13 8 0 5 27 28<br />

spaarzaam 0 1 0 1 2 2 1 2 0 0 0 0 0 0 1 0<br />

zu<strong>in</strong>ig 3 10 5 12 18 21 4 1 0 0 2 3 0 0 10 13<br />

hospitaal 0 4 4 3 0 0 0 0 0 0 1 1 0 0 0 2<br />

ziekenhuis 26 34 82 60 11 40 11 11 0 1 15 15 0 2 61 92<br />

micro 1 1 2 3 0 0 0 0 0 1 0 0 1 1 2 1<br />

microfoon 1 1 2 10 2 3 3 7 0 0 0 0 0 0 34 28<br />

buis 7 2 2 1 4 1 6 3 0 0 2 1 0 0 18 12<br />

onvoldoende 57 56 38 60 36 29 18 28 4 4 2 7 3 8 23 23<br />

toelage 3 2 3 2 2 5 0 1 0 0 5 0 0 0 1 1<br />

subsidie 33 41 13 15 35 22 29 49 1 0 14 15 2 4 122 137<br />

woonst 1 2 3 3 0 0 0 0 0 0 1 1 0 0 0 0<br />

won<strong>in</strong>g 47 60 45 54 47 70 2 21 17 15 8 9 23 17 54 91<br />

uitbater 13 11 3 8 1 1 2 4 0 0 3 2 0 0 6 4<br />

exploitant 2 2 2 2 15 13 3 5 0 0 0 0 0 0 1 1<br />

tussenkomst 19 8 17 13 3 3 0 1 1 2 0 1 2 2 0 6<br />

bijdrage 40 64 23 23 37 25 34 30 3 9 6 16 14 26 90 80<br />

tegenstrever 1 1 6 8 2 1 0 1 0 0 0 1 0 0 0 0<br />

tegenstander 24 19 70 77 16 17 38 32 0 0 18 16 5 5 63 64<br />

aanvang 5 5 3 3 7 8 2 2 0 0 1 3 1 2 3 4<br />

beg<strong>in</strong> 635 550 499 507 637 554 322 341 78 71 139 201 100 102 706 712<br />

qnp.nl.p.1<br />

use.be.e.0<br />

use.be.e.1<br />

use.be.p.0<br />

use.be.p.1<br />

use.nl.e.0<br />

use.nl.e.1<br />

use.nl.p.0<br />

use.nl.p.1


114 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

qnp.be.e.0<br />

qnp.be.e.1<br />

qnp.be.p.0<br />

qnp.be.p.1<br />

qnp.nl.e.0<br />

qnp.nl.e.1<br />

aanduid<strong>in</strong>g 7 3 6 4 1 1 1 0 1 1 2 5 1 1 5 4<br />

benoem<strong>in</strong>g 34 14 19 17 46 22 35 43 0 0 7 5 3 2 16 10<br />

tevergeefs 8 2 12 7 10 7 7 5 2 0 1 2 0 1 3 4<br />

vergeefs 2 0 0 2 3 7 4 14 0 0 0 4 0 0 0 4<br />

tewerkstell<strong>in</strong>g 8 7 4 16 0 0 0 0 0 0 4 0 0 0 0 0<br />

werkgelegenheid 79 80 17 24 25 16 7 5 0 0 4 6 7 5 13 27<br />

zetel 42 61 91 62 25 23 42 43 1 0 34 32 1 1 193 195<br />

fauteuil 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0<br />

verslaggever 11 10 29 43 3 1 8 5 0 0 0 0 0 0 21 28<br />

rapporteur 1 1 9 5 0 0 2 0 0 0 1 0 0 0 0 1<br />

verlieslatend 10 6 1 0 1 2 0 0 0 1 0 0 0 0 0 0<br />

verliesgevend 1 0 0 0 31 14 9 9 0 0 0 0 1 3 4 6<br />

vermits 4 5 1 4 0 0 0 0 19 12 16 20 0 0 1 2<br />

aangezien 95 81 32 43 24 28 2 3 33 25 45 36 33 26 161 148<br />

universitair 10 5 7 30 2 1 4 6 2 0 1 2 0 0 5 5<br />

academicus 6 1 13 9 2 0 1 2 0 0 1 1 0 0 4 6<br />

vaststell<strong>in</strong>g 30 27 42 44 4 3 1 4 0 0 5 10 2 1 6 6<br />

constater<strong>in</strong>g 1 0 0 1 15 6 0 4 0 0 1 0 1 2 11 12<br />

verhoog 184 178 25 38 107 112 36 34 8 11 12 12 23 22 39 41<br />

podium 1 1 20 25 3 2 4 7 0 0 4 1 0 0 7 5<br />

wedde 2 6 2 5 0 0 0 0 0 0 1 1 0 0 2 1<br />

salaris 13 13 1 0 96 83 25 26 0 0 3 0 6 4 49 44<br />

objectief 21 25 19 18 8 10 4 7 2 4 22 27 5 4 64 42<br />

doel 66 67 57 112 80 91 63 63 7 11 35 33 24 30 198 174<br />

nakend 9 15 12 10 1 1 0 1 0 1 3 1 1 1 0 0<br />

nabij 35 33 27 40 11 13 8 8 3 9 2 2 3 6 19 16<br />

nijverheid 18 14 1 0 0 0 0 0 0 0 0 1 0 0 0 0<br />

<strong>in</strong>dustrie 75 65 22 32 25 26 37 29 1 0 11 8 6 4 40 39<br />

<strong>in</strong>breuk 21 25 6 17 3 2 1 3 0 1 4 3 1 0 8 5<br />

overtred<strong>in</strong>g 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26<br />

job 141 140 59 78 2 0 0 1 4 6 21 16 0 2 4 9<br />

baan 133 122 31 39 150 117 111 78 4 5 11 13 9 6 139 117<br />

maximum 10 12 4 4 6 19 2 6 12 6 6 4 11 16 29 21<br />

maximaal 47 35 25 30 79 76 20 16 21 11 5 7 35 36 38 39<br />

m<strong>in</strong>imum 26 20 8 14 14 11 12 10 13 13 17 15 8 5 20 22<br />

m<strong>in</strong>imaal 28 19 15 25 73 59 19 28 6 3 2 5 37 28 62 46<br />

merkwaardig 19 14 30 37 7 15 4 4 1 0 2 0 0 0 48 28<br />

opmerkelijk 47 52 66 57 67 56 20 20 2 0 6 4 1 0 28 11<br />

effectief 36 34 35 36 45 59 11 20 8 8 24 15 13 12 51 57<br />

daadwerkelijk 19 16 21 13 59 54 24 21 1 1 4 1 11 9 49 55<br />

stock 12 12 2 3 6 0 0 1 45 40 0 0 34 25 0 1<br />

voorraad 65 40 13 3 27 25 4 9 4 0 0 1 19 25 7 18<br />

stilaan 48 49 57 53 1 2 0 0 2 3 6 6 3 0 1 2<br />

langzamerhand 2 4 1 3 30 27 3 13 0 0 0 3 0 0 29 32<br />

serieus 24 20 40 16 41 32 56 53 30 27 63 56 40 29 196 197<br />

ernstig 72 52 101 88 31 24 23 28 3 1 27 37 4 3 94 119<br />

politieker 0 0 0 0 0 0 0 0 0 1 18 14 0 0 13 8<br />

politicus 48 81 321 275 52 37 47 58 1 2 89 93 7 6 289 221<br />

gerechtshof 2 3 4 2 17 16 9 7 0 0 2 1 1 0 3 13<br />

qnp.nl.p.0<br />

qnp.nl.p.1<br />

use.be.e.0<br />

use.be.e.1<br />

use.be.p.0<br />

use.be.p.1<br />

use.nl.e.0<br />

use.nl.e.1<br />

use.nl.p.0<br />

use.nl.p.1


qnp.be.e.0<br />

qnp.be.e.1<br />

qnp.be.p.0<br />

qnp.be.p.1<br />

qnp.nl.e.0<br />

<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 115<br />

qnp.nl.e.1<br />

qnp.nl.p.0<br />

rechtbank 122 112 61 70 15 27 9 13 1 4 11 21 2 2 52 64<br />

prof 1 2 3 3 1 2 0 0 1 1 5 5 0 3 8 6<br />

professor 39 33 70 72 3 8 6 6 0 0 9 3 7 3 27 36<br />

fout 74 84 154 158 51 65 25 43 38 17 92 74 87 75 326 299<br />

overtred<strong>in</strong>g 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26<br />

publiciteit 9 6 5 6 16 18 9 11 0 0 4 5 2 1 17 14<br />

reclame 60 45 17 32 21 21 15 12 11 5 18 11 30 43 46 51<br />

proper 8 10 14 20 0 0 0 0 3 5 0 3 2 2 1 4<br />

schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85<br />

fier 1 4 15 13 1 4 0 1 3 0 5 6 0 0 1 1<br />

trots 15 19 25 25 22 32 11 16 2 0 9 9 2 3 69 63<br />

schepen 11 14 49 24 7 4 2 1 0 0 11 3 0 0 4 1<br />

wethouder 0 0 1 4 9 13 11 14 0 0 2 2 0 0 22 22<br />

schrijvelaar 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

Rekenhof 12 15 6 11 0 0 0 0 0 0 1 0 0 0 0 0<br />

Rekenkamer 6 7 10 3 17 33 4 65 0 0 0 0 0 0 0 1<br />

References<br />

Auer, Peter. 2005. Europe’s sociol<strong>in</strong>guistic unity, or: A typology of European dialect/standard<br />

constellations. In Nicole Delbecque, Johan van der Auwera & Dirk Geeraerts (eds.), Perspectives<br />

on <strong>variation</strong>, 7–42. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 1999. Modern <strong>in</strong>formation retrieval. New York:<br />

ACM Press & Addison-Wesley.<br />

Bickerton, Derek.1971. Inherent variability and variable rules. Foundations of Language and Cognitive<br />

Processes 7(4). 457–492.<br />

Bouma, Gerlof, Gertjan van Noord, and Rob Malouf. 2001. Alp<strong>in</strong>o: wide-coverage computational<br />

analysis of Dutch. In Walter Daelemans, K. Sima’an, J.Veenstra & J. Zavrel (eds.), Computational<br />

L<strong>in</strong>guistics <strong>in</strong> the Netherlands 2000, 45–59. Amsterdam: Rodopi.<br />

Clyne, Michael. 1992. Pluricentric languages: Differ<strong>in</strong>g norms <strong>in</strong> different nations. Berl<strong>in</strong>andNew<br />

York: Mouton de Gruyter.<br />

Cox, Trevor and Michael Cox. 2001. Multidimensional scal<strong>in</strong>g. London and New York: Chapman<br />

and Hall.<br />

Edmonds, Philip and Graeme Hirst. 2002. Near-synonymy and lexical choice. Computational L<strong>in</strong>guistics<br />

28(2). 105–144.<br />

Fillmore, Charles.1994. Start<strong>in</strong>g where dictionaries stop: the challenge of corpus lexicography.<br />

In Beryl T. Sue Atk<strong>in</strong>s & Antonio Zampolli (eds.), Computational approaches to the lexicon,<br />

349–393. Oxford: Oxford University Press.<br />

Geeraerts, Dirk. 2009. <strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> space. In Jürgen Erich Schmidt & Peter Auer (eds.),<br />

Language and space I: Theories and methods, 821–837. Berl<strong>in</strong> and New York: Mouton de<br />

Gruyter.<br />

Geeraerts, Dirk. 2010. Schmidt redux: How systematic is the l<strong>in</strong>guistic system if <strong>variation</strong> is rampant?<br />

In Kasper Boye & Elisabeth Engberg-Pedersen (eds.), Language usage and language<br />

structure, 237–262. Berl<strong>in</strong> & New York: Mouton de Gruyter.<br />

qnp.nl.p.1<br />

use.be.e.0<br />

use.be.e.1<br />

use.be.p.0<br />

use.be.p.1<br />

use.nl.e.0<br />

use.nl.e.1<br />

use.nl.p.0<br />

use.nl.p.1


116 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />

Geeraerts, Dirk, Stefan Grondelaers and Dirk Speelman. 1999. Convergentie en divergentie <strong>in</strong><br />

de Nederlandse woordenschat. Een onderzoek naar kled<strong>in</strong>g- en voetbaltermen. Amsterdam:<br />

Meertens Instituut.<br />

Geeraerts, Dirk, Gitte Kristiansen, and Yves Peirsman (eds.). 2010. Advances <strong>in</strong> Cognitive Sociol<strong>in</strong>guistics.<br />

Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Goebl, Hans. 1975. Dialektometrie. Grazer l<strong>in</strong>guistische Studien. 32–38.<br />

Grieve, Jack, Dirk Speelman, and Dirk Geeraerts. 2011. A statistical method for the identification<br />

and aggregation of regional l<strong>in</strong>guistic <strong>variation</strong>. Language Variation and Change 23. 193–<br />

221.<br />

Harder, Peter. 2010. Mean<strong>in</strong>g <strong>in</strong> m<strong>in</strong>d and society: A functional contribution to the social turn <strong>in</strong><br />

Cognitive L<strong>in</strong>guistics. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Impe, Leen, Dirk Geeraerts, and Dirk Speelman. 2008. Mutual <strong>in</strong>telligibility of standard and regional<br />

Dutch language varieties. International Journal of Humanities and Arts Comput<strong>in</strong>g 2.<br />

101–117.<br />

Kristiansen, Gitte and René Dirven (eds.). 2008. Cognitive Sociol<strong>in</strong>guistics: Language <strong>variation</strong>,<br />

cultural models, social systems. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Labov, William. 1966. The social stratification of English <strong>in</strong> New York City. Wash<strong>in</strong>gton, D.C.: Center<br />

for Applied L<strong>in</strong>guistics.<br />

Lakoff, George. 1987. Women, fire and dangerous th<strong>in</strong>gs: What categories reveal about the m<strong>in</strong>d.<br />

Chicago: University of Chicago Press.<br />

Mart<strong>in</strong>, Willy. 2005. Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-<br />

Nederlands (RBBN). Technical report. Amsterdam: Vrije Universiteit Amsterdam.<br />

Nerbonne, John and William Kretzschmar. 2003. Introduc<strong>in</strong>g computational techniques <strong>in</strong> Dialectometry.<br />

Computers and the Humanities 37. 245–255.<br />

Rosch, Eleanor and Carolyne Mervis. 1975. Family resemblances: Studies <strong>in</strong> the <strong>in</strong>ternal structure<br />

of categories. Cognitive Psychology 7(4). 573–605.<br />

Séguy, Jean. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de L<strong>in</strong>guistique<br />

Romane 35. 335–357.<br />

Speelman, Dirk, Stefan Grondelaers, and Dirk Geeraerts. 2003. Profile-based l<strong>in</strong>guistic uniformity<br />

as a generic method for compar<strong>in</strong>g language varieties. Computers and the Humanities 37.<br />

317–337.<br />

Szmrecsanyi, Benedikt. 2010. The English genitive alternation <strong>in</strong> a cognitive sociol<strong>in</strong>guistics <strong>perspective</strong>.<br />

In Dirk Geeraerts, Gitte Kristiansen & Yves Peirsman (eds.), Advances <strong>in</strong> Cognitive<br />

Sociol<strong>in</strong>guistics, 141–166. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />

Turney, Peter and Patrick Pantel. 2010. From frequency to mean<strong>in</strong>g: vector space models of semantics.<br />

Journal of Artificial Intelligence Research 37. 141–188.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!