Lexical variation in aggregate perspective

Tom Ruette, Dirk Speelman, and Dirk Geeraerts 

Lexical variation in aggregate perspective 

Abstract: If one aims to study a pluricentric language with the goal of making general 

assertions about linguistic levels, an aggregate perspective in which many linguistic 

items that represent the linguistic level are considered is necessary. The current paper 

presents and compares two methodologies for aggregating lexical variation so that the 

similarity or dissimilarity between language varieties such as the centers of a pluricentric 

language can be quantitatively measured. The two methodologies differ with 

respect to the treatment of the semantic relation between words: whereas one method 

simply ignores the semantic relation between words, the other method incorporates 

the knowledge that some words are alternative means of naming a single concept. The 

question of which method is most suitable for measuring the similarity or dissimilarity 

between language varieties is raised and empirically tested in a corpus-based case 

study on the pluricentric language Dutch, as used in Belgium and the Netherlands. It 

will be shown that the method that incorporates semantic knowledge manages to go 

beyond possible conceptual variation between language varieties, clearly revealing 

an expected distinction between Dutch as used in Belgium and in the Netherlands. In 

contrast with this, the semantically non-informed method is disturbed by conceptual 

variation and is not able to convincingly show the distinction between Dutch as used 

in Belgium and in the Netherlands, although the set of linguistic items clearly suggests 

that such a national pattern should emerge. 

Keywords. aggregate perspective, sociolectometry, lexical variation, Dutch 

1 Introduction 

The current paper shows how a sociolectometric approach is needed to disentangle the 

multidimensional structure of the varieties in a pluricentric language. There are different 

sociolectometric approaches, i.e. corpus-based methods, perception experiments, 

or attitude questionnaires; we will perform a corpus-based case study. Although the focus 

of a sociolectometric approach is on the varieties, the choice of the variables under 

analysis is crucial; we focus on lexical variation. Furthermore, in this paper we compare 

two quantitative corpus-based methods, which differ in their conceptual control 

of lexical variables: on the one hand, we take a method that ignores the conceptual 

relationship between the lexemes in the variable set. On the other hand, there is a 

method that incorporates knowledge about conceptual identity between lexemes. The 

importance and difficulties of conceptual control when studying variation in the lexicon 

as a whole is shown by means of a case-study on the pluricentric language Dutch. 

The pluricentric character of Dutch is now widely accepted: Dutch is used both in Bel-

96 Tom Ruette, Dirk Speelman, and Dirk Geeraerts 

gium and in the Netherlands, but each nation has its own norm generating center (cf. 

Clyne 1992). This is different from the imposed situation in earlier years, especially 

the sixties, where Dutch in Belgium was supposed to be exogenically modeled on the 

norms of the Netherlands. Recently, by means of empirical work of e.g. Geeraerts et al. 

(1999) and experimental work of e.g. Impe et al. (2008), this historical view had to be 

adjusted to the current view, as described in Auer (2005). 

Rather than providing further empirical proof of the pluricentric character of 

the Dutch lexicon, the case-study aims to show the pertinence of a sociolectometric 

methodology that can aggregate patterns of non-categorical lexical variation while incorporating 

an appropriate amount of conceptual control – in contrast to a methodology 

that discards any conceptual knowledge. As such, the study touches upon two 

general issues in the broader field of variationist linguistics: on the level of words, we 

look at the problematic status of lexical variation and the difficulty of delineating word 

meaning; on the level of structure, we run into the methodological issue of aggregating 

the probabilistic variational patterns of many words in order to reach a general view 

on the lexicon, rather than on individual words. 

Let us start, however, more generally with the status of variation in a linguistic 

system. Attempts of incorporating variational rules in the linguistic system have been 

criticized (e.g. Bickerton 1971) on the argument that variation has no place in the search 

for an abstract and idealized linguistic system of competence and langue. However, a 

paradigm-shift in linguistics towards usage-based approaches turned the ubiquity of 

variation into something that should not be ignored. Nonetheless, even in usage-based 

Cognitive Linguistics, which studies parole by definition and can therefore hardly escape 

variation, there has been a tendency to overestimate the homogeneity of language 

communities and consequent non-variability. As of recently, Cognitive Linguistics has 

taken up the challenge of incorporating variational dimensions in the study of linguistic 

phenomena. Evidence for this are two collected volumes by Kristiansen and Dirven 

(2008) and Geeraerts et al. (2010) on Cognitive Sociolinguistics, which combine theoretical, 

methodological and empirical studies that incorporate cognitive, semantic and 

lectal dimensions in their linguistic descriptions. Of course, one does not need to commit 

to a cognitive framework to combine language-internal variables and languageexternal 

variables, but Cognitive Sociolinguistics is currently at the cutting edge when 

it comes to multivariate analyses of linguistic phenomena. The idea of Cognitive Sociolinguistics 

is best explained by looking at an exemplar case-study of Szmrecsanyi 

(2010). In that study, the English genitive alternation between an of -construction and 

an ’s-construction is approached in the well-known Cognitive Linguistic fashion, with 

semantic, pragmatic, psycholinguistic, structural and functional predictors. In addition 

to these typical Cognitive Linguistic predicting factors, however, extra-linguistic 

factors are included as well: e.g. register (newspaper versus informal), medium (spoken 

versus written) and geography (British versus American English). Based on many 

observations of genitive constructions in corpora that are representative of these lectal 

factors, it appears that “the magnitude of the effect that individual conditioning fac-

Lexical variation in aggregate perspective 97 

tors [e.g. semantic and pragmatic factors] may have on genitive choice […] is demonstrably 

mediated by language-external [i.e. lectal] factors” (Szmrecsanyi 2010). 

The example given above – representative of a wide-spread trend in Cognitive 

Linguistics – studies a single linguistic phenomenon very closely. And although the 

gained insights of these single-feature studies are at the very heart of the linguistic 

enterprise, they hardly allow for extrapolations and abstractions about the linguistic 

system in general: it is not because lectal factors have an important mediating influenceonthechoiceofaspecificgenitiveform(inEnglish),thattheyhavethesameeffect 

on other linguistic items (in other languages). In order to reach a more general level 

of that kind, the behavior of many linguistic variables needs to be aggregated so that 

idiosyncratic differences are middled out, structures emerge and systematicity can be 

induced. This aggregate perspective also appeals to the answer of Geeraerts (2010) on 

his question on the plausibility of a system when variation is rampant: finding a linguistic 

system is an empirical question that can be answered by looking for statistically 

recurring structural patterns in variational data. Or in other words, assuming a system 

that is able to predict linguistic choices, we should find a probabilistic model that fits 

observed variation. 

Returning to the topic of the current paper (lexical variation in a pluricentric language), 

how can these theoretical insights be applied? To answer this question, we 

will address lexical variation in Section 2 and aggregation in Section 3. In Section 4, 

we will perform a case-study on aggregated lexical variation in the pluricentric language 

Dutch. Finally, we bring together the theoretical insight and the results of the 

case-study in the conclusion of this paper. 

2 Lexical variation 

Harder (2010: 270) claims that there are three stages in the coming about of a sociodynamic 

perspective on linguistic system. The first stage consists of mere fluctuations, 

comparable to the brabbling of a toddler. From these fluctuations a structure emerges 

consisting of categories that contain the fluctuation, but this structure is an incomplete 

abstraction of the fluctuations. The abstraction goes only so far as the language 

user deems appropriate, c.q. until communication is enabled. This is the second stage 

of emerging structure. The third stage consists of the initial stage fluctuations that 

turn into systematic variation within the emerged structural category. Although the 

three stages are presented by means of a developmental example (i.e. the brabbling 

todler), these stages might well have more general ontogenetic status that may explain 

language variation and change. Abandoning the dynamic character of these three 

stages, and looking at every stage independently, we could say that variationist research 

zooms in on the third stage, assuming the categories from the second stage. As 

an example, Harder gives the seminal Labovian study on the structural stage two category 

“postvocalic -r”, with its category-bound stage three variants, which appeared


to be related to social classes in New York (Labov 1966). Scholars of the linguistic system 

have traditionally removed stage three (variation, or rather variable usage) and 

focused on the abstract and idealized stage two structural categories. However, an 

adequate study of the linguistic system must not ignore the stage three variation, as 

structure and variation cannot exist without each other. Structure without variation 

is ridden of the linguistic reality, and variation without structure is mere fluctuation, 

incapable of enabling communication. 

Although this idea of system is primarily geared towards linguistic categories such 

as consonants or Germanic strong verbs, it can conveniently be “translated” towards 

the conceptual categories of the lexicon. There is, however, an important question related 

to the level of abstraction in stage two, when considering the lexicon. If on the 

onehandthecategoriesarechosentobeasnarrowasasingleword(orsymbol),the 

variation within these categories is semasiological variation. This means that one studies 

the different senses or aspects of meaning of a single word. If on the other hand 

the categories are chosen to be as broad as “concepts”, the variation in naming these 

categories (i.e. that different words may name the same concept) is onomasiological 

variation. This means that one studies the different ways of expressing (with words) 

the conceptual category. Obviously, this very old distinction between a semasiological 

or an onomasiological approach is related to the study of polysemy versus the study 

of synonymy. 

In this paper, we restrict ourselves to the onomasiological perspective, yet fully 

aware of the semasiological issues waiting around the corner. We refer to Geeraerts 

(2009) for an overview of research on lexical variation, and zoom in here briefly on 

a distinction between Formal Onomasiological Variation (FOV) and Conceptual Onomasiological 

Variation (COV). A FOV approach resembles the sociolinguistic variable: 

FOV grasps a quality of a set of words that express the same concept, and just like in a 

sociolinguistic variable, each word in the set may have a specific socio-stylistic correlation. 

COV, on the other hand, links up to the more subtle variation in concepts that 

are being used in language. Most obviously, at a very high level, and example could be 

that one can use specific words to talk about “beer” or about “semantics”. At a more 

fine-grained level, one could say that “fiddle” and “violin” are an example of FOV, but 

because “fiddle” has a slightly more ordinary tone to it than the more prestigious “violin”, 

there is also COV between these words. In the case-study to this paper, we will 

show that this distinction between FOV in choosing a word to express a concept versus 

COV when using words to talk in a certain way crops up in a methodological difference 

between the two sociolectometric approaches that we compare. 

3 Aggregation 

As said above, aggregation of many variables is necessary when the goal is to describe 

general patterns in a system. In order to find underlying dimensions of variation in


a large set of (lexical) variables, the individual patterns of the variables thus need to 

be aggregated. Aggregation of many features is already applied in e.g. dialectometry 

and text categorization. However, we find problems in both dialectometry and text 

categorization when it comes to dealing with lexical variation. 

In dialectometry (Séguy 1971; Goebl 1975; Nerbonne and Kretzschmar 2003), lexical 

variation is almost always considered to be categorical per location (except e.g. 

Grieve et al. 2011): either a certain location – or at best a single interviewee per location 

– is attributed the use of word a or the use of word b. This categorical approach is 

mainly due to the type of input data, i.e. a lexical dialect atlas, used in most dialectometric 

studies. Dialect atlases have been painstakingly constructed in earlier years by 

the efforts of dialectologists that visited pertinent locations for their purposes and accumulated 

data through interviews and questionnaires. Categorical word choices per 

location were a necessary (but currently not any longer acceptable) methodological decision. 

Because dialectometric methodology is tailored around the categorical dialect 

atlas input format, their quantitative aggregation methods cannot straightforwardly 

be applied to corpus-driven input, where lexical variation is a probabilistic matter. 

Unlike dialectometry, an aggregation method that incorporates both probabilistic 

word preferences in an onomasiological approach was introduced in Geeraerts et al. 

(1999) and further formalized in Speelman et al. (2003). This so-called profile-based 

approach – where “profile” stands for the (relative frequencies of a) set of words in 

a conceptual category – is formally introduced below. The rationale of the method is 

just like most aggregation methods to measure the “distance” between pairs of subcorpora 

on the basis of their probabilistic overlap in onomasiological word preferences 

for expressing an underlying conceptual category. A small distance between subcorpora 

implies a general agreement in word choice, whereas a large distance implies a 

general disagreement in word choice. 

Profile-based distances between subcorpora are calculated by means of the following 

method. Given two subcorpora V1 and V2, a conceptual category L (e.g. SUB- 

TERRANEAN PUBLIC TRANSPORT)andx1 to xn the exhaustive list of variants (e.g. [subway, 

underground} as the profile, then we refer to the absolute frequency F of the usage of 

x1 for L in Vj with: 1 

FVj ,L (x1) (1) 

To make this methodological explanation more tangible, we provide a fictional example 

on the basis of the absolute frequencies for two concepts SUBTERRANEAN PUBLIC 

TRANSPORT and SMALL INSTRUMENT PLAYED WITH A BOW as used in American and British 

English, cf. Table 1. 

1 The following introduction to the City-Block distance method is based on Speelman et al. (2003: 

Section 2.2).


Tab. 1: Fictional absolute frequencies for the variants of two concepts in two language varieties 

Concept Variant Am. Eng. Br. Eng. 

SUBTERRANEAN PUBLIC TRANSPORT 

SMALL INSTRUMENT PLAYED WITH A BOW 

subway 70 20 

underground 10 50 

violin 50 30 

fiddle 40 35 

Subsequently, we introduce the relative frequency R : 

RVj ,L (xi ) = 

FVj ,L (xi ) 

n 

k =1 FVj ,L (xk ) 

The absolute frequencies from Table 1 now become the relative frequencies in Table 2 

by means of applying Equation 2. 

Tab. 2: Fictional relative frequencies for the variants of two concepts in two language varieties, 

based on Table 1 

Concept Variant Am. Eng. Br. Eng. 

SUBTERRANEAN PUBLIC TRANSPORT 

SMALL INSTRUMENT PLAYED WITH A BOW 

subway 0,875 0,286 

underground 0,125 0,714 

violin 0,556 0,462 

fiddle 0,444 0,538 

Now we can define the (City-Block) distance DCB between V1 and V2 on the basis of the 

profile for L as follows (the division by two is for normalization, mapping the results 

to the interval [0,1]): 

DCB ,L (V1, V2) = 1 

2 

n 

i =1 

(2) 

|RVj ,L (xi ) − RVj ,L (xi )| (3) 

The City-Block distance is a straightforward descriptive dissimilarity measure that assumes 

the absolute frequencies in the sample-based profile to be large enough for the 

relative frequencies to be good estimates for the relative frequencies in the underlying 

population-based profiles. If however the samples are rather small, the relative frequencies 

become unreliable, and a supplementary control is needed. For this we use 

a measure that takes as its basis the confidence of there being an actual difference between 

two profiles: the Fisher Exact test. This time, unlike with DCB , we look at the 

absolute frequencies in the profiles we compare. When we compare a profile in one 

subcorpus to the profile for the same concept in a second subcorpus, we use a Fisher 

Exact test to check the hypothesis that both samples are drawn from the same pop-


ulation. We use the p-value from the Fisher Exact test as a filter for DCB .Wesetthe 

dissimilarity between subcorpora at zero if p > 0.05, and we use DCB if p < 0.05. 2 

If we now apply this step to the fictional data from Table 1 and 2, we must first 

calculate the Fisher Exact p value for every concept, verifying that the absolute frequencies 

for American and British English are sampled from different populations. For 

SUBTERRANEAN PUBLIC TRANSPORT,thepvalueismuchsmallerthan0.05,sowecanac cept that British English is different from American English when it comes to this concept. 

Therefore, we calculate the City-Block distance by means of Equation 5 for SUB- 

TERRANEAN PUBLIC TRANSPORT. Filling in the equation, we get 0.5 × [(|0.875–0.286|) + 

(|0.125–0.714|)] = 0.589. For the concept of a SMALL INSTRUMENT PLAYED WITH A BOW we 

find a p value for the Fisher Exact test larger than 0.05, so we can say that British English 

is statistically speaking not a different population than American English. Therefore, 

we can set the distance between these varieties for this concept at zero. 

To calculate the dissimilarity between subcorpora on the basis of many profiles, 

we just sum the dissimilarities for the individual profiles. In other words, given a set of 

profiles L1 to Lm , then the global dissimilarity D between two subcorpora V1 and VL2 

on the basis of L1 up to Lm can be calculated as: 

DCB (V1, V2) = 

m 

(L −i (V1, V2)W (Li )) (4) 

i =1 

The W in the formula is a weighting factor. We use weights to ensure that concepts 

which have a relatively higher frequency (summed over the size of the two subcorpora 

that are being compared) 3 also have a greater impact on the distance measurement. In 

other words, in the case of a weighted calculation, concepts that are more common in 

everyday life and language are treated as more important. Applying this to the fictional 

example from Table 1, we can calculate the W per concept by dividing the sum of the 

absolute frequencies of all variants for one concept by the sum of simply all variations. 

For SUBTERRANEAN PUBLIC TRANSPORT this equals to (70+10+20+50)/(70+10+20+50+ 

50 + 40 + 30 + 35) = 0.492. There is no need to calculate the W for SMALL INSTRUMENT 

PLAYED WITH A BOW as its distance is already set to zero. Filling out equation 4, we find 

that the distance between British English and American English aggregated over both 

concepts is (0.589 × 0.492) + 0 = 0.29. 

Now, we put text categorization in contrast with the profile-based approach, which 

incorporates probabilistic information of word choice. In text categorization, noncategorical 

(probabilistic) word choice is well accounted for (unlike dialectometric ap- 

2 If the frequency of the profile was lower than 30 in the two varieties that are being compared, that 

profile was excluded from the comparison. 

3 The size of the two subcorpora is not the actual amount of words in the two subcorpora, but the sum 

of all profiles in these two subcorpora with a frequency higher than 30.


proaches), but text categorization totally ignores the onomasiological perspective on 

lexical variation. This is primarily due to the fact that text categorization often zooms 

in on topical categorization, and the onomasiological approach to lexical variation 

within conceptual categories is exactly a way of downplaying thematic bias in the variational 

patterns (Speelman et al. 2003). However, other forms of text categorization, 

e.g. authorship attribution or linguistic profiling, quite the opposite of topic classification, 

also ignore onomasiological variation and use mere (relative) occurrence frequencies 

of the features in the aggregation step. This is problematic, especially given 

the recent trend in authorship attribution studies to use content words. 

Whereas the profile-based approach will be the quantitative method that incorporates 

conceptual control in our comparison of methods, we will use the textcategorization 

approach as the quantitative method that ignores conceptual similarity 

between the words in the variable set. Except for the used distance metric, the two approaches 

are identical. The underlying metaphor of both the profile-based and categorization 

approach is spatial: subcorpora are represented as points in an n-dimensional 

spacebymeansoftheoccurrencefrequenciesofn words. A made-up example in a 

two-dimensional space, i.e. with two words, containing two text types might make 

this rather abstract metaphor more clear. Given two subcorpora representing the text 

types “academic articles” and “computer mediated communication”, and given two 

words “hence” (a linking word used in academic articles) and “LOL” (an abbreviation 

of “Laughing Out Loud”, commonly used in IRC), one might construct the “space” in 

Figure 1. The position of the academic articles in the bottom right part is due to the high 

frequency of “hence” and the low frequency of “LOL” in these texts. The position of 

the computer-mediated communication in the top left part is due to the low frequency 

of “hence” and the high frequency of “LOL” in these texts. Obviously, these data are 

made up for the sake of the argument. Now, two lines can be drawn through the originofthespaceandthepositionofthetexttypes(onthebasisofthefrequenciesof 

the words that make up the dimensions), yielding an angle, for which the cosine can 

be calculated. A small angle implies high similarity between the text types, and will 

yield a high cosine value; a large angle implies low similarity, and will yield a low cosine 

value. More information on the cosine metric can be found in Baeza-Yates and 

Ribeiro-Neto (1999: 27). 

Fig. 1: 2 Dimensional example of Vector Model


Formally, given two subcorpora V1 and V2 in which the frequencies of a large number 

of words were counted and stored in the respective vectors x and y, wecalculate 

the distance between the subcorpora by means of Equation 5. 

4 Case study 

Dcos(V1, V2) = 1 − cos(x, y) = 1 − 

x · y 

|x||y| = 

n i =1 xi yi 

n i =1 x 2 n i i =1 y 2 

i 

The case study of this paper is an analysis of aggregated lexical variation in the pluricentric 

language Dutch. It consists of a comparison between the state-of-the-art text 

categorization distance metric, which ignores conceptual control, and the profilebased 

distance metric, which includes conceptual control. In order to guarantee an 

objective comparison, we will apply both methods to the same dataset, which is tailored 

to contain a specific constitution of variational dimensions. The method that 

best approaches the expected structure will be considered superior. In what follows, 

we first introduce the dataset by describing the set of lexical features and the corpus 

in which these features will be counted. Second, we apply the profile-based method to 

this dataset. Then, the state-of-the-art text categorization method is also applied to the 

dataset. Finally, it will be concluded that the profile-based onomasiological approach 

grasps the a priori constitution of variational dimensions much better than the text 

categorization method. 

The lexical input features are derived from the “Referentiebestand Belgisch Nederlands” 

(Martin 2005, Eng. Reference List of Belgian Dutch, abbreviation “RBBN”). This 

reference list contains words or expressions that exclusively appear in Belgian Dutch, 

and have no occurrences in The Netherlands, according to dictionaries, corpora and 

informants. The list contains about 4000 items, ranging from colloquial items, over 

culturally linked (e.g. Belgian institutes) to register-specific and freely varying items. 

As an example, a small selection of items is listed in Table 3, but the whole list can 

be downloaded freely from the website of the “Instituut voor Nederlandse Lexicologie”. 

For each Belgian Dutch item, the list provides an alternative from general Dutch, 

or sometimes typically Netherlandic Dutch. From the 4000 items on the list, we only 

retained 1455 items for which the Belgian Dutch item itself and its alternative consist 

of one single word. If we restrict the RBBN list to these single word items – and 

thus excluding multi-word-units and expressions –, these items can be counted accurately 

in an automatic way by merely keeping track of the occurrence frequency 

of the words in the subcorpora. 4 Indeed, expressions and multi-word-units may be 

distributed over the sentence because of syntactic constructions, making automatic 

4 We address the issue of possible polysemy issues and the need for word sense disambiguation when 

doing automatic counting in the conclusions. 

(5)


Tab. 3: Selected examples from the RBBN 

Belgian Dutch General Dutch Translation of concept 

suikerboon doopsuiker candy to honor the birth of a baby 

appelsien sinaasappel orange (fruit) 

unaniem eenparig unanimous 

ambras ruzie a row 

confituur jam marmalade 

binnenkoer binnenplaats atrium 

counting very hard. All (single) words on the list were analyzed with the Alpino parser, 

so that accurate countings on the lemmata could be performed, while controlling for 

the part-of-speech. Linking back to the issue of conceptual categories in Section 2, we 

accept the conceptual categories of the makers of the RBBN in their equivalence judgement 

between the Belgian Dutch item and its alternative. 

Because we know that this list contains Belgian Dutch words and an alternative, 

we can predict that the main variation in the list will be due to a national pattern. Indeed, 

even the non-national variation which is present in the list (e.g. colloquialisms) 

is still embedded in the Belgian Dutch point-of-view of the RBBN. Or in other words, 

every variable in the variable set is at least nationally patterned. Therefore, we expect 

the results of our method to show a clear distinction between the two national varieties, 

and other variational dimensions will only appear after that. 

In our corpus, we incorporate samples from the two national varieties of Dutch, 

taken from two registers (quality newspapers and Usenet), and from two topics (politics 

and economy). We collected a total of 6 million words, which were evenly split 

over the nations, registers and topics. The quality newspaper articles were sampled 

from two large newspaper corpora that are available for both Netherlandic and Belgian 

newspapers. From these two corpora, we selected four newspapers that are deemed 

to be quality newspapers: “De Standaard” and “De Morgen” for Belgium, and “Volkskrant” 

and “NRC” for The Netherlands. For most of the articles that appeared in the 

newspapers, there is access to the category in which it was published. This categorization 

was used to filter out the articles on the topics “politics” and “economy”. 

The Usenet posts were downloaded from a large Usenet archive, available online 

at Google Groups and automatically stripped from meta-information (headers and 

html code) and reduplicated content (quotes from previous posts). Only posts from 

the groups “be.politics”, “be.finance”, “nl.politiek” and “nl.financieel.*” were downloaded, 

where the country affiliation of the group was taken to be an indication of the 

nationality of the author of the post, and where the topical restriction of the group indicates 

the topic of the post. All texts were lemmatized and tagged with part-of-speech 

information by the Alpino parser (Bouma et al. 2001).


With these three dimensions (country, register, topic) and two levels for each dimension 

8 combinations are possible. These combinations, e.g. Belgian quality newspapers 

on economy (abbreviated as qnp.be.e), will be represented by the subcorpora, 

for which we will calculate the pair wise distances. However, to increase the number 

of data points and in order to verify the internal consistency of the subcorpora, we divided 

every subcorpus into two equally sized groups (abbreviated as e.g. qnp.be.e.0 

and qnp.be.e.1). In total then, we counted the frequencies of the linguistic characteristics 

which we introduce above, in 16 subcorpora. A snippet of this input data is presented 

in the appendix to this paper. 

Given the omnipresent country dimension in the input features, the primary variational 

dimension that could be expected to be revealed among the subcorpora is the 

Belgian Dutch versus Netherlandic Dutch dimension. Or in terms that relate to the 

distance measurement method: in a pair-wise comparison of subcorpora with a national 

difference, the distance will be bigger than a comparison of two subcorpora 

with the same national affiliation. Because the typical Belgian Dutch words are sometimes 

restricted to a specific register, e.g. colloquialisms, a register distinction should 

emerge, as well. And as words and their conceptual categories are inevitably sensitive 

to topic, we would expect the difference between political and economical subcorpora 

to emerge, too. However, the register and topic dimension should be secondary to the 

country dimension. 

4.1 Results of the profile-based method 

We first look into the results of the profile-based approach, introduced above. To the 

selected Belgian Dutch items on the RBBN list, we added the knowledge which alternatives 

are conceptually equivalent General Dutch words. In other words, we introduce 

conceptually controlled profile information to the distance metric. A profile thus consists 

of a Belgian Dutch word from the RBBN list, together with its general Dutch alternative. 

Remember that the underlying distance metric is basically a City-Block distance 

measure (see Formula 4). Now, we zoom in on the two- and three-dimensional visualizations 

of all the pair wise profile-based distances between the subcorpora, made 

by means of non-metric two-way one-mode Multidimensional Scaling (Cox and Cox 

2001), as can be seen in Figure 2. 5 

5 The coordinates of a Multidimensional Scaling solution can be scaled freely, as long as the same 

scaling is applied to all dimensions. Therefore, we discarded a scale on the axes, as these numbers 

would not be insightful. However, we made sure that the x and y (and z for three-dimensional solutions) 

axes are always equal, so that the distances between the subcorpora on the different dimensions 

can be interpreted.


Fig. 2: Linguistic distance between subcorpora (profile-based, two-dimensional) 

Multidimensional Scaling is a dimension reduction technique which is applied here 

to a matrix holding all the pair wise profile-based distances between the subcorpora. 

Because the result of a Multidimensional Scaling analysis is a reduction of the original 

input, a certain error is introduced. The error-rate is grasped by a “stress” value, 

with 0% stress equal to no error at all. It is generally acceptable to present Multidimensional 

Scaling solutions up to a stress level of 10–15%. Usually, Multidimensional 

Scaling is used to return one-, two-, or three-dimensional reductions, so that visualization 

is possible. With every added dimension, the error-rate goes down, as the reduction 

becomes less severe. The fall of error-rate with added dimensions is grasped in a 

so-called screeplot. The screeplot in Figure 3 shows a stress difference of about 7% between 

a one-dimensional and a two-dimensional Multidimensional Scaling solution. 

Therefore, we first interpret the horizontal dimension (of an unrotated solution) as it 

represents the most important variation in Figure 2. In this case, the profile-based approach 

makes a distinction between Belgian subcorpora (black font) and Netherlandic 

subcorpora (grey font) on the first dimension. The grey zero-line divides the two countries 

perfectly. The vertical dimension makes a distinction between quality newspapers 

(normal font) and Usenet articles (bold font). Here again, the grey zero-line marks 

a perfect distinction between the two registers. Overall, there is a very clear grouping 

of the subcorpora, with only clear separation of the topics in the Belgian Usenet. 

The range of Belgian register variation is also somewhat larger than the Netherlandic 

range, but this has probably to do with the focus on Belgian Dutch variation in the 

input features. Most importantly, however, the profile-based approach yields a visualization 

that complies with our expectations of finding a national pattern first, followed 

by register variation on the second dimension.


Fig. 3: Screeplot for non-metric Multidimensional Scaling solution (profile-based) 

The screeplot suggest that a three-dimensional solution might even improve the quality 

of the visualization with another 5 or 6%. Therefore, we calculated a three dimensional 

solution, which is represented in Figure 4. 6 Instead of rendering a threedimensional 

plot, we drew the scatterplot of dimension 1 versus dimension 2, and the 

scatterplot of dimension 1 versus dimension 3. This shows us how, even in a threedimensional 

solution, dimension 1 still divides Belgian and Netherlandic subcorpora, 

Fig. 4: Linguistic distance between subcorpora (profile-based, three-dimensional) 

6 Note that a two-dimensional non-metric Multidimensional Scaling solution is not a subset of a threedimensional 

non-metric Multidimensional Scaling solution. Therefore, the first two dimensions of the 

three-dimensional solution of Figure 4 are not necessarily identical to the two dimensions of the twodimensional 

solution of Figure 2.


and that dimension 2 divides the quality newspaper articles from Usenet. However, 

this register division in the three-dimensional solution is not as neat as in the twodimensional 

solution, because one of the Netherlandic Usenet fragments crosses over 

into the quadrant of the Netherlandic quality newspaper fragments. For dimension 3, 

we can see a split for the topics of the Belgian subcorpora, with on the top left of dimension 

3 subcorpora with an e for economy-related subcorpora, and politics fragments 

at the bottom. On the Netherlandic side, the register (dimension 2) and topic (dimension 

3) split is muddled. The register and topic divisions of the Belgian subcorpora, 

however, are perfect for respectively dimension 2 and dimension 3. The quality of the 

grouping on the Belgian side is obviously due to the input variables which are specifically 

sensitive for Belgian Dutch variation. This indicates that the choice for a Belgian 

Dutch term is not only nationally patterned, but also stylistically. 

4.2 Results of the categorization method 

Now, we present the method and the results of the state-of-the-art categorization approach, 

which uses the cosine similarity metric, instead of the adapted City-Block distance 

that is used in the profile-based approach. 

In the current case-study, we take the RBBN items (and the alternatives) as individual 

features and remove the knowledge of conceptual categorization. If we calculate 

the similarities (and consequent distances) with these input features between the 

subcorpora in our dataset, and then produce the two-dimensional visualization with 

Multidimensional Scaling, we get the plot in Figure 5. If we create a screeplot (Fig- 

Fig. 5: Linguistic distance between subcorpora (profile-based, three-dimensional)


Fig. 6: Linguistic distance between subcorpora (cosine, two-dimensional) 

ure 6) to show us how much stress difference there is between the first and the second 

dimension, we see that the second dimension reduces the stress of a one-dimensional 

solution with about 8%. Therefore, we will interpret the two dimensions in their own 

respect, knowing however that the first dimension contains more outspoken distances 

than the second dimension. 

In Figure 6 we see on the horizontal axis (from left to right, dimension 1) a distinction 

between the Usenet articles (bold font) and the quality newspaper articles 

(regular font). The light grey vertical line indicates the zero-line of the horizontal dimension. 

Normally, that line demarcates the boundary between two areas. Whereas 

we would expect the most important variation (thus, on the horizontal dimension) to 

be related to country, we encounter a distinction between registers. The vertical dimensions 

(from bottom to top) tends to divide Belgium (black font) from The Netherlands 

(grey font), but not very clearly. The (politics) Netherlandic Usenet articles sink 

below the horizontal zero-line, and the (economy) Belgian Usenet articles rise above 

that line. Moreover, we notice that the topics are set apart in groups, as well, except for 

the quality newspapers from The Netherlands. All in all, the categorization approach 

yields somewhat unclear grouping of subcorpora and an unexpected promotion of register 

variation as the most important variation in the input features. 

The screeplot shows that a three-dimensional solution would reduce the stress 

even more up to an almost optimal level. Therefore, we calculated a three-dimensional 

solution and represent the three dimensions in Figure 7. We apply the same idea as for 

the profile-based approach to plot dimension 1 and 2, and then dimension 1 and 3. Just 

like in the two-dimensional solution, we see that dimension 1 divides quality newspaper 

fragments from Usenet fragments, and that dimension 2 tends to divide the na-


Fig. 7: Screeplot for non-metric Multidimensional Scaling solution (cosine) 

tional subcorpora. The three-dimensional solution does a slightly better job than the 

two-dimensional solution, because the nation division on dimension 2 is now almost 

correct. Dimension 3 divides largely the topics, with politics-related fragments at the 

top, and economy-related fragments at the bottom. This division is almost perfect, although 

the grouping of the subcorpora is not so neat. Overall, though, the categorization 

method yields messier output than the profile-based approach. 

5 Conclusion 

The two main theoretical questions of this paper have been (a) how important is the 

notion of a conceptual category in an aggregate study of variation in the lexicon and 

(b) what is the status of conceptual categories for lexical variation? Moreover, we have 

claimed that sociolectometric methodology, of which the current study is an example, 

is needed to study a pluricentric language. The link with pluricentric languages, c.q. 

Dutch, is also made in the case-study, which shows how conceptual categories and 

their consequent conceptual control are necessary to reveal the national dimension in 

the lexicon. In other words, the national varieties of Dutch do not differ so much in 

their use of words – both Belgium and the Netherlands use different words for different 

topics and registers –, but they do differ in their choice of words for expressing a 

conceptual category. This latter point is made clear in the case-study by means of the 

comparison between a profile-based onomasiological approach and a text categorization 

approach. The text categorization approach grasped the mere use of individual 

words and compared the use of words in two subcorpora by means of the cosine similarity 

metric, which was not informed about the conceptual similarity between words. 

Consequently, the text categorization showed that there was a pattern of register and 

topic in the input features, stronger than the anticipated national pattern. The ono-


masiological approach, on the contrary, revealed a strong national dimension in word 

choice for naming a conceptual category. 

Of course, in order to have an expected ranking in the variational dimensions, 

and in order to compare the outcome of the aggregation approaches, the dataset had 

to be manipulated so that a certain pattern could convincingly be assumed. With that 

goal in mind, the variable set was taken from a reference list of Belgium Dutch, so that 

national variation is built into the dataset. As such, the two aggregation approaches 

could be compared by assessing how well they retrieve the national variation. It is important 

to understand, though, that an actual descriptive sociolectometric study can 

by no means rely on such a biased input variable set. Therefore, the results of this paper 

can only be of methodological value. Given the a priori known pattern of national 

variation in the dataset used in the case-study, though, one might jump to the conclusion 

that an onomasiological approach is better suited for finding variational patterns 

in the lexicon, and the preferred method for any sociolectometric study. However, there 

are a number of problems with this conclusion. 

First of all, perhaps we are wrong in the assumption that national variation is the 

strongest dimension in the lexical variable set and the available subcorpora; it could 

be well possible that word use – as shown in the categorization approach – is actually 

more strongly influenced by a register or topic dimension, and that the onomasiological 

approach artificially weakens these dimensions. 7 In that case, we would have 

to tone down the conclusion, and say that an onomasiological approach with conceptual 

control is a methodological means of revealing and boosting specific underlying 

dimensions of variation. Moreover, we would like to point out that our corpus 

only sampled two topics and two registers, which is not enough to support strong generalizations. 

Further research is therefore needed with more topics and registers. All 

this, of course, does not weaken the strength of a profile-based approach, but it rather 

points out the importance of knowing what is being measured. Our claim now is that 

the profile-based approach allows for much more control over what is measured than 

the text categorization method, and should therefore be preferred. 

Second, the onomasiological approach assumes a relation of identity of (conceptual) 

meaning between the variants and this is theoretically problematic. Following 

Edmonds and Hirst (2002), we agree that perfect synonymy – the highest possible level 

of detail in describing a conceptual category, and still finding multiple words that fit 

the category – is extremely rare. By admitting this, our notion of semantics or word 

meaning follows the Cognitive Linguistic view that encyclopedic knowledge is indispensable. 

Translating the idea of Peter Harder that structural categories need not to be 

complete, and that the abstraction goes only as far as is functional for language users – 

here we link up to the prototype theory of word meaning, cf. Rosch and Mervis (1975)–, 

7 Although the profile-based City-Block distance incorporates a W term that brings the frequency of 

the conceptual category into play.


we can reach near-synonymy by slightly relaxing the level of detail of the conceptual 

category: not every language user has an identitical representation of a word in his 

head, but nonetheless two language users can communicate with that word. Idealized 

Cognitive Models (Lakoff 1987) or Frames (Fillmore 1994) are examples of describing 

meaning, while balancing semasiological detail and operational functionality. In future 

research, we will operationalize the bottom-up creation of conceptual categories 

by applying Word Space Models (Turney and Pantel 2010). 

Third, an onomasiological approach requires prior semasiological analysis to exclude 

contextual nuances or polysemy. In the case-study of this paper, the lemmatized 

forms of the RBBN words were naively counted in the corpus, without further checking 

the context of each occurrence. Closer inspection revealed that the RBBN list does not 

contain many potential polysemous items, so that we can ignore the small error that 

must be present in the frequencies for the purposes of the current paper. However, as 

we want to perform the above analyses in future research with a naturalistic sample of 

lexical variation, instead of an a priori list of national variation, a semasiological study 

for every occurrence needs to be done in order to establish the conceptual control. As 

this would be an unfeasible manual task when using a large amount of variables, we 

will rely further on the advances being made in the field of Word Space Models to automate 

this task. 

To conclude this paper, we try to answer our initial questions. How important is 

the notion of a conceptual category in an aggregate study of the lexicon? The casestudy 

has shown that conceptual control is necessary to reveal variational dimensions 

that are hidden in the overwhelming content (topic) function of words. Without conceptual 

control, the conclusion of the categorization approach would have been that 

different words are used to refer to different content, and that they may also signal 

register and perhaps national differences. This observation, albeit true and undeniable, 

is not the goal of an aggregation study: it is obvious that an aggregation of many 

words will be sensitive to content differences among subcorpora. Therefore, conceptual 

control, in the form of conceptual categories that group together similar words, 

is needed. And this brings us to the second question: what is the status of conceptual 

categories for lexical variation? Although practical as a methodological and heuristic 

device, the conceptual categories remain somewhat artificial because of the flexibility 

in their definition. In the current case study, the makers of the RBBN clearly had referential 

equivalence in mind for most categories. However, conceptual categories can 

be defined more strictly or less strictly at a whim of the researcher, because there is 

no consensus over the appropriate level of detail in the definition, especially since the 

incorporation of encyclopedic knowledge in word-meaning. The level of detail that is 

operational in the language community can only be retrieved by studying the actual 

use of words. 

And then we are back at variation.

Appendix 


Tab. 4: Snippet of the input data for both aggregation methods. Pairs of rows make up lexical 

variables. 

qnp.be.e.0 

qnp.be.e.1 

qnp.be.p.0 

qnp.be.p.1 

qnp.nl.e.0 

qnp.nl.e.1 

qnp.nl.p.0 

leefbaar 9 3 8 11 1 0 0 0 0 1 9 4 0 0 24 18 

levensvatbaar 2 4 2 0 2 1 3 2 0 0 1 1 0 0 4 4 

hangar 0 1 0 1 0 0 1 2 0 0 1 1 0 0 1 1 

loods 8 6 4 18 4 11 5 2 0 0 0 2 0 1 1 6 

schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85 

mooi 153 122 114 110 110 76 53 42 42 33 73 67 52 74 449 475 

dagorde 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

agenda 29 26 100 90 29 21 39 24 2 1 14 14 1 1 17 33 

knook 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 

been 13 15 43 41 39 29 14 20 10 12 14 12 21 18 76 65 

zever 0 0 1 0 0 0 0 0 6 2 15 15 0 0 4 14 

onzin 7 1 23 30 8 5 5 3 5 10 44 61 26 43 451 485 

draad 4 6 14 10 6 13 2 3 1 2 31 32 9 10 90 87 

snoer 2 0 2 1 1 5 1 1 0 0 3 1 0 0 21 28 

weeral 0 0 2 0 0 0 0 0 9 3 9 9 0 1 4 1 

alweer 19 22 32 22 21 30 11 17 5 1 21 22 12 9 98 98 

fel 27 23 33 35 17 19 31 42 6 1 5 10 0 1 19 31 

erg 331 268 208 217 117 112 76 68 21 36 143 131 99 94 830 835 

strop 4 2 1 3 26 18 4 3 0 0 1 0 0 0 3 3 

strik 1 2 2 3 5 6 1 0 0 0 2 0 0 2 1 2 

verdiep 2 1 4 3 8 2 4 11 0 0 2 3 3 4 20 26 

verdieping 0 6 6 7 5 4 10 11 0 0 1 0 0 0 12 10 

stamp 6 2 9 5 5 1 0 2 1 0 5 5 0 0 11 10 

duw 27 16 42 34 20 25 13 16 1 1 13 8 0 5 27 28 

spaarzaam 0 1 0 1 2 2 1 2 0 0 0 0 0 0 1 0 

zuinig 3 10 5 12 18 21 4 1 0 0 2 3 0 0 10 13 

hospitaal 0 4 4 3 0 0 0 0 0 0 1 1 0 0 0 2 

ziekenhuis 26 34 82 60 11 40 11 11 0 1 15 15 0 2 61 92 

micro 1 1 2 3 0 0 0 0 0 1 0 0 1 1 2 1 

microfoon 1 1 2 10 2 3 3 7 0 0 0 0 0 0 34 28 

buis 7 2 2 1 4 1 6 3 0 0 2 1 0 0 18 12 

onvoldoende 57 56 38 60 36 29 18 28 4 4 2 7 3 8 23 23 

toelage 3 2 3 2 2 5 0 1 0 0 5 0 0 0 1 1 

subsidie 33 41 13 15 35 22 29 49 1 0 14 15 2 4 122 137 

woonst 1 2 3 3 0 0 0 0 0 0 1 1 0 0 0 0 

woning 47 60 45 54 47 70 2 21 17 15 8 9 23 17 54 91 

uitbater 13 11 3 8 1 1 2 4 0 0 3 2 0 0 6 4 

exploitant 2 2 2 2 15 13 3 5 0 0 0 0 0 0 1 1 

tussenkomst 19 8 17 13 3 3 0 1 1 2 0 1 2 2 0 6 

bijdrage 40 64 23 23 37 25 34 30 3 9 6 16 14 26 90 80 

tegenstrever 1 1 6 8 2 1 0 1 0 0 0 1 0 0 0 0 

tegenstander 24 19 70 77 16 17 38 32 0 0 18 16 5 5 63 64 

aanvang 5 5 3 3 7 8 2 2 0 0 1 3 1 2 3 4 

begin 635 550 499 507 637 554 322 341 78 71 139 201 100 102 706 712 

qnp.nl.p.1 

use.be.e.0 

use.be.e.1 

use.be.p.0 

use.be.p.1 

use.nl.e.0 

use.nl.e.1 

use.nl.p.0 

use.nl.p.1


qnp.be.e.0 

qnp.be.e.1 

qnp.be.p.0 

qnp.be.p.1 

qnp.nl.e.0 

qnp.nl.e.1 

aanduiding 7 3 6 4 1 1 1 0 1 1 2 5 1 1 5 4 

benoeming 34 14 19 17 46 22 35 43 0 0 7 5 3 2 16 10 

tevergeefs 8 2 12 7 10 7 7 5 2 0 1 2 0 1 3 4 

vergeefs 2 0 0 2 3 7 4 14 0 0 0 4 0 0 0 4 

tewerkstelling 8 7 4 16 0 0 0 0 0 0 4 0 0 0 0 0 

werkgelegenheid 79 80 17 24 25 16 7 5 0 0 4 6 7 5 13 27 

zetel 42 61 91 62 25 23 42 43 1 0 34 32 1 1 193 195 

fauteuil 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 

verslaggever 11 10 29 43 3 1 8 5 0 0 0 0 0 0 21 28 

rapporteur 1 1 9 5 0 0 2 0 0 0 1 0 0 0 0 1 

verlieslatend 10 6 1 0 1 2 0 0 0 1 0 0 0 0 0 0 

verliesgevend 1 0 0 0 31 14 9 9 0 0 0 0 1 3 4 6 

vermits 4 5 1 4 0 0 0 0 19 12 16 20 0 0 1 2 

aangezien 95 81 32 43 24 28 2 3 33 25 45 36 33 26 161 148 

universitair 10 5 7 30 2 1 4 6 2 0 1 2 0 0 5 5 

academicus 6 1 13 9 2 0 1 2 0 0 1 1 0 0 4 6 

vaststelling 30 27 42 44 4 3 1 4 0 0 5 10 2 1 6 6 

constatering 1 0 0 1 15 6 0 4 0 0 1 0 1 2 11 12 

verhoog 184 178 25 38 107 112 36 34 8 11 12 12 23 22 39 41 

podium 1 1 20 25 3 2 4 7 0 0 4 1 0 0 7 5 

wedde 2 6 2 5 0 0 0 0 0 0 1 1 0 0 2 1 

salaris 13 13 1 0 96 83 25 26 0 0 3 0 6 4 49 44 

objectief 21 25 19 18 8 10 4 7 2 4 22 27 5 4 64 42 

doel 66 67 57 112 80 91 63 63 7 11 35 33 24 30 198 174 

nakend 9 15 12 10 1 1 0 1 0 1 3 1 1 1 0 0 

nabij 35 33 27 40 11 13 8 8 3 9 2 2 3 6 19 16 

nijverheid 18 14 1 0 0 0 0 0 0 0 0 1 0 0 0 0 

industrie 75 65 22 32 25 26 37 29 1 0 11 8 6 4 40 39 

inbreuk 21 25 6 17 3 2 1 3 0 1 4 3 1 0 8 5 

overtreding 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26 

job 141 140 59 78 2 0 0 1 4 6 21 16 0 2 4 9 

baan 133 122 31 39 150 117 111 78 4 5 11 13 9 6 139 117 

maximum 10 12 4 4 6 19 2 6 12 6 6 4 11 16 29 21 

maximaal 47 35 25 30 79 76 20 16 21 11 5 7 35 36 38 39 

minimum 26 20 8 14 14 11 12 10 13 13 17 15 8 5 20 22 

minimaal 28 19 15 25 73 59 19 28 6 3 2 5 37 28 62 46 

merkwaardig 19 14 30 37 7 15 4 4 1 0 2 0 0 0 48 28 

opmerkelijk 47 52 66 57 67 56 20 20 2 0 6 4 1 0 28 11 

effectief 36 34 35 36 45 59 11 20 8 8 24 15 13 12 51 57 

daadwerkelijk 19 16 21 13 59 54 24 21 1 1 4 1 11 9 49 55 

stock 12 12 2 3 6 0 0 1 45 40 0 0 34 25 0 1 

voorraad 65 40 13 3 27 25 4 9 4 0 0 1 19 25 7 18 

stilaan 48 49 57 53 1 2 0 0 2 3 6 6 3 0 1 2 

langzamerhand 2 4 1 3 30 27 3 13 0 0 0 3 0 0 29 32 

serieus 24 20 40 16 41 32 56 53 30 27 63 56 40 29 196 197 

ernstig 72 52 101 88 31 24 23 28 3 1 27 37 4 3 94 119 

politieker 0 0 0 0 0 0 0 0 0 1 18 14 0 0 13 8 

politicus 48 81 321 275 52 37 47 58 1 2 89 93 7 6 289 221 

gerechtshof 2 3 4 2 17 16 9 7 0 0 2 1 1 0 3 13 

qnp.nl.p.0 

qnp.nl.p.1 

use.be.e.0 

use.be.e.1 

use.be.p.0 

use.be.p.1 

use.nl.e.0 

use.nl.e.1 

use.nl.p.0 

use.nl.p.1

qnp.be.e.0 

qnp.be.e.1 

qnp.be.p.0 

qnp.be.p.1 

qnp.nl.e.0 


qnp.nl.e.1 

qnp.nl.p.0 

rechtbank 122 112 61 70 15 27 9 13 1 4 11 21 2 2 52 64 

prof 1 2 3 3 1 2 0 0 1 1 5 5 0 3 8 6 

professor 39 33 70 72 3 8 6 6 0 0 9 3 7 3 27 36 

fout 74 84 154 158 51 65 25 43 38 17 92 74 87 75 326 299 

overtreding 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26 

publiciteit 9 6 5 6 16 18 9 11 0 0 4 5 2 1 17 14 

reclame 60 45 17 32 21 21 15 12 11 5 18 11 30 43 46 51 

proper 8 10 14 20 0 0 0 0 3 5 0 3 2 2 1 4 

schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85 

fier 1 4 15 13 1 4 0 1 3 0 5 6 0 0 1 1 

trots 15 19 25 25 22 32 11 16 2 0 9 9 2 3 69 63 

schepen 11 14 49 24 7 4 2 1 0 0 11 3 0 0 4 1 

wethouder 0 0 1 4 9 13 11 14 0 0 2 2 0 0 22 22 

schrijvelaar 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Rekenhof 12 15 6 11 0 0 0 0 0 0 1 0 0 0 0 0 

Rekenkamer 6 7 10 3 17 33 4 65 0 0 0 0 0 0 0 1 

References 

Auer, Peter. 2005. Europe’s sociolinguistic unity, or: A typology of European dialect/standard 

constellations. In Nicole Delbecque, Johan van der Auwera & Dirk Geeraerts (eds.), Perspectives 

on variation, 7–42. Berlin and New York: Mouton de Gruyter. 

Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 1999. Modern information retrieval. New York: 

ACM Press & Addison-Wesley. 

Bickerton, Derek.1971. Inherent variability and variable rules. Foundations of Language and Cognitive 

Processes 7(4). 457–492. 

Bouma, Gerlof, Gertjan van Noord, and Rob Malouf. 2001. Alpino: wide-coverage computational 

analysis of Dutch. In Walter Daelemans, K. Sima’an, J.Veenstra & J. Zavrel (eds.), Computational 

Linguistics in the Netherlands 2000, 45–59. Amsterdam: Rodopi. 

Clyne, Michael. 1992. Pluricentric languages: Differing norms in different nations. BerlinandNew 

York: Mouton de Gruyter. 

Cox, Trevor and Michael Cox. 2001. Multidimensional scaling. London and New York: Chapman 

and Hall. 

Edmonds, Philip and Graeme Hirst. 2002. Near-synonymy and lexical choice. Computational Linguistics 

28(2). 105–144. 

Fillmore, Charles.1994. Starting where dictionaries stop: the challenge of corpus lexicography. 

In Beryl T. Sue Atkins & Antonio Zampolli (eds.), Computational approaches to the lexicon, 

349–393. Oxford: Oxford University Press. 

Geeraerts, Dirk. 2009. Lexical variation in space. In Jürgen Erich Schmidt & Peter Auer (eds.), 

Language and space I: Theories and methods, 821–837. Berlin and New York: Mouton de 

Gruyter. 

Geeraerts, Dirk. 2010. Schmidt redux: How systematic is the linguistic system if variation is rampant? 

In Kasper Boye & Elisabeth Engberg-Pedersen (eds.), Language usage and language 

structure, 237–262. Berlin & New York: Mouton de Gruyter. 

qnp.nl.p.1 

use.be.e.0 

use.be.e.1 

use.be.p.0 

use.be.p.1 

use.nl.e.0 

use.nl.e.1 

use.nl.p.0 

use.nl.p.1


Geeraerts, Dirk, Stefan Grondelaers and Dirk Speelman. 1999. Convergentie en divergentie in 

de Nederlandse woordenschat. Een onderzoek naar kleding- en voetbaltermen. Amsterdam: 

Meertens Instituut. 

Geeraerts, Dirk, Gitte Kristiansen, and Yves Peirsman (eds.). 2010. Advances in Cognitive Sociolinguistics. 

Berlin and New York: Mouton de Gruyter. 

Goebl, Hans. 1975. Dialektometrie. Grazer linguistische Studien. 32–38. 

Grieve, Jack, Dirk Speelman, and Dirk Geeraerts. 2011. A statistical method for the identification 

and aggregation of regional linguistic variation. Language Variation and Change 23. 193– 

221. 

Harder, Peter. 2010. Meaning in mind and society: A functional contribution to the social turn in 

Cognitive Linguistics. Berlin and New York: Mouton de Gruyter. 

Impe, Leen, Dirk Geeraerts, and Dirk Speelman. 2008. Mutual intelligibility of standard and regional 

Dutch language varieties. International Journal of Humanities and Arts Computing 2. 

101–117. 

Kristiansen, Gitte and René Dirven (eds.). 2008. Cognitive Sociolinguistics: Language variation, 

cultural models, social systems. Berlin and New York: Mouton de Gruyter. 

Labov, William. 1966. The social stratification of English in New York City. Washington, D.C.: Center 

for Applied Linguistics. 

Lakoff, George. 1987. Women, fire and dangerous things: What categories reveal about the mind. 

Chicago: University of Chicago Press. 

Martin, Willy. 2005. Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch- 

Nederlands (RBBN). Technical report. Amsterdam: Vrije Universiteit Amsterdam. 

Nerbonne, John and William Kretzschmar. 2003. Introducing computational techniques in Dialectometry. 

Computers and the Humanities 37. 245–255. 

Rosch, Eleanor and Carolyne Mervis. 1975. Family resemblances: Studies in the internal structure 

of categories. Cognitive Psychology 7(4). 573–605. 

Séguy, Jean. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de Linguistique 

Romane 35. 335–357. 

Speelman, Dirk, Stefan Grondelaers, and Dirk Geeraerts. 2003. Profile-based linguistic uniformity 

as a generic method for comparing language varieties. Computers and the Humanities 37. 

317–337. 

Szmrecsanyi, Benedikt. 2010. The English genitive alternation in a cognitive sociolinguistics perspective. 

In Dirk Geeraerts, Gitte Kristiansen & Yves Peirsman (eds.), Advances in Cognitive 

Sociolinguistics, 141–166. Berlin and New York: Mouton de Gruyter. 

Turney, Peter and Patrick Pantel. 2010. From frequency to meaning: vector space models of semantics. 

Journal of Artificial Intelligence Research 37. 141–188.

Lexical variation in aggregate perspective

Create successful ePaper yourself

Delete template?

Save as template?