Lexical variation in aggregate perspective
Lexical variation in aggregate perspective
Lexical variation in aggregate perspective
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong><br />
Abstract: If one aims to study a pluricentric language with the goal of mak<strong>in</strong>g general<br />
assertions about l<strong>in</strong>guistic levels, an <strong>aggregate</strong> <strong>perspective</strong> <strong>in</strong> which many l<strong>in</strong>guistic<br />
items that represent the l<strong>in</strong>guistic level are considered is necessary. The current paper<br />
presents and compares two methodologies for aggregat<strong>in</strong>g lexical <strong>variation</strong> so that the<br />
similarity or dissimilarity between language varieties such as the centers of a pluricentric<br />
language can be quantitatively measured. The two methodologies differ with<br />
respect to the treatment of the semantic relation between words: whereas one method<br />
simply ignores the semantic relation between words, the other method <strong>in</strong>corporates<br />
the knowledge that some words are alternative means of nam<strong>in</strong>g a s<strong>in</strong>gle concept. The<br />
question of which method is most suitable for measur<strong>in</strong>g the similarity or dissimilarity<br />
between language varieties is raised and empirically tested <strong>in</strong> a corpus-based case<br />
study on the pluricentric language Dutch, as used <strong>in</strong> Belgium and the Netherlands. It<br />
will be shown that the method that <strong>in</strong>corporates semantic knowledge manages to go<br />
beyond possible conceptual <strong>variation</strong> between language varieties, clearly reveal<strong>in</strong>g<br />
an expected dist<strong>in</strong>ction between Dutch as used <strong>in</strong> Belgium and <strong>in</strong> the Netherlands. In<br />
contrast with this, the semantically non-<strong>in</strong>formed method is disturbed by conceptual<br />
<strong>variation</strong> and is not able to conv<strong>in</strong>c<strong>in</strong>gly show the dist<strong>in</strong>ction between Dutch as used<br />
<strong>in</strong> Belgium and <strong>in</strong> the Netherlands, although the set of l<strong>in</strong>guistic items clearly suggests<br />
that such a national pattern should emerge.<br />
Keywords. <strong>aggregate</strong> <strong>perspective</strong>, sociolectometry, lexical <strong>variation</strong>, Dutch<br />
1 Introduction<br />
The current paper shows how a sociolectometric approach is needed to disentangle the<br />
multidimensional structure of the varieties <strong>in</strong> a pluricentric language. There are different<br />
sociolectometric approaches, i.e. corpus-based methods, perception experiments,<br />
or attitude questionnaires; we will perform a corpus-based case study. Although the focus<br />
of a sociolectometric approach is on the varieties, the choice of the variables under<br />
analysis is crucial; we focus on lexical <strong>variation</strong>. Furthermore, <strong>in</strong> this paper we compare<br />
two quantitative corpus-based methods, which differ <strong>in</strong> their conceptual control<br />
of lexical variables: on the one hand, we take a method that ignores the conceptual<br />
relationship between the lexemes <strong>in</strong> the variable set. On the other hand, there is a<br />
method that <strong>in</strong>corporates knowledge about conceptual identity between lexemes. The<br />
importance and difficulties of conceptual control when study<strong>in</strong>g <strong>variation</strong> <strong>in</strong> the lexicon<br />
as a whole is shown by means of a case-study on the pluricentric language Dutch.<br />
The pluricentric character of Dutch is now widely accepted: Dutch is used both <strong>in</strong> Bel-
96 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
gium and <strong>in</strong> the Netherlands, but each nation has its own norm generat<strong>in</strong>g center (cf.<br />
Clyne 1992). This is different from the imposed situation <strong>in</strong> earlier years, especially<br />
the sixties, where Dutch <strong>in</strong> Belgium was supposed to be exogenically modeled on the<br />
norms of the Netherlands. Recently, by means of empirical work of e.g. Geeraerts et al.<br />
(1999) and experimental work of e.g. Impe et al. (2008), this historical view had to be<br />
adjusted to the current view, as described <strong>in</strong> Auer (2005).<br />
Rather than provid<strong>in</strong>g further empirical proof of the pluricentric character of<br />
the Dutch lexicon, the case-study aims to show the pert<strong>in</strong>ence of a sociolectometric<br />
methodology that can <strong>aggregate</strong> patterns of non-categorical lexical <strong>variation</strong> while <strong>in</strong>corporat<strong>in</strong>g<br />
an appropriate amount of conceptual control – <strong>in</strong> contrast to a methodology<br />
that discards any conceptual knowledge. As such, the study touches upon two<br />
general issues <strong>in</strong> the broader field of <strong>variation</strong>ist l<strong>in</strong>guistics: on the level of words, we<br />
look at the problematic status of lexical <strong>variation</strong> and the difficulty of del<strong>in</strong>eat<strong>in</strong>g word<br />
mean<strong>in</strong>g; on the level of structure, we run <strong>in</strong>to the methodological issue of aggregat<strong>in</strong>g<br />
the probabilistic <strong>variation</strong>al patterns of many words <strong>in</strong> order to reach a general view<br />
on the lexicon, rather than on <strong>in</strong>dividual words.<br />
Let us start, however, more generally with the status of <strong>variation</strong> <strong>in</strong> a l<strong>in</strong>guistic<br />
system. Attempts of <strong>in</strong>corporat<strong>in</strong>g <strong>variation</strong>al rules <strong>in</strong> the l<strong>in</strong>guistic system have been<br />
criticized (e.g. Bickerton 1971) on the argument that <strong>variation</strong> has no place <strong>in</strong> the search<br />
for an abstract and idealized l<strong>in</strong>guistic system of competence and langue. However, a<br />
paradigm-shift <strong>in</strong> l<strong>in</strong>guistics towards usage-based approaches turned the ubiquity of<br />
<strong>variation</strong> <strong>in</strong>to someth<strong>in</strong>g that should not be ignored. Nonetheless, even <strong>in</strong> usage-based<br />
Cognitive L<strong>in</strong>guistics, which studies parole by def<strong>in</strong>ition and can therefore hardly escape<br />
<strong>variation</strong>, there has been a tendency to overestimate the homogeneity of language<br />
communities and consequent non-variability. As of recently, Cognitive L<strong>in</strong>guistics has<br />
taken up the challenge of <strong>in</strong>corporat<strong>in</strong>g <strong>variation</strong>al dimensions <strong>in</strong> the study of l<strong>in</strong>guistic<br />
phenomena. Evidence for this are two collected volumes by Kristiansen and Dirven<br />
(2008) and Geeraerts et al. (2010) on Cognitive Sociol<strong>in</strong>guistics, which comb<strong>in</strong>e theoretical,<br />
methodological and empirical studies that <strong>in</strong>corporate cognitive, semantic and<br />
lectal dimensions <strong>in</strong> their l<strong>in</strong>guistic descriptions. Of course, one does not need to commit<br />
to a cognitive framework to comb<strong>in</strong>e language-<strong>in</strong>ternal variables and languageexternal<br />
variables, but Cognitive Sociol<strong>in</strong>guistics is currently at the cutt<strong>in</strong>g edge when<br />
it comes to multivariate analyses of l<strong>in</strong>guistic phenomena. The idea of Cognitive Sociol<strong>in</strong>guistics<br />
is best expla<strong>in</strong>ed by look<strong>in</strong>g at an exemplar case-study of Szmrecsanyi<br />
(2010). In that study, the English genitive alternation between an of -construction and<br />
an ’s-construction is approached <strong>in</strong> the well-known Cognitive L<strong>in</strong>guistic fashion, with<br />
semantic, pragmatic, psychol<strong>in</strong>guistic, structural and functional predictors. In addition<br />
to these typical Cognitive L<strong>in</strong>guistic predict<strong>in</strong>g factors, however, extra-l<strong>in</strong>guistic<br />
factors are <strong>in</strong>cluded as well: e.g. register (newspaper versus <strong>in</strong>formal), medium (spoken<br />
versus written) and geography (British versus American English). Based on many<br />
observations of genitive constructions <strong>in</strong> corpora that are representative of these lectal<br />
factors, it appears that “the magnitude of the effect that <strong>in</strong>dividual condition<strong>in</strong>g fac-
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 97<br />
tors [e.g. semantic and pragmatic factors] may have on genitive choice […] is demonstrably<br />
mediated by language-external [i.e. lectal] factors” (Szmrecsanyi 2010).<br />
The example given above – representative of a wide-spread trend <strong>in</strong> Cognitive<br />
L<strong>in</strong>guistics – studies a s<strong>in</strong>gle l<strong>in</strong>guistic phenomenon very closely. And although the<br />
ga<strong>in</strong>ed <strong>in</strong>sights of these s<strong>in</strong>gle-feature studies are at the very heart of the l<strong>in</strong>guistic<br />
enterprise, they hardly allow for extrapolations and abstractions about the l<strong>in</strong>guistic<br />
system <strong>in</strong> general: it is not because lectal factors have an important mediat<strong>in</strong>g <strong>in</strong>fluenceonthechoiceofaspecificgenitiveform(<strong>in</strong>English),thattheyhavethesameeffect<br />
on other l<strong>in</strong>guistic items (<strong>in</strong> other languages). In order to reach a more general level<br />
of that k<strong>in</strong>d, the behavior of many l<strong>in</strong>guistic variables needs to be <strong>aggregate</strong>d so that<br />
idiosyncratic differences are middled out, structures emerge and systematicity can be<br />
<strong>in</strong>duced. This <strong>aggregate</strong> <strong>perspective</strong> also appeals to the answer of Geeraerts (2010) on<br />
his question on the plausibility of a system when <strong>variation</strong> is rampant: f<strong>in</strong>d<strong>in</strong>g a l<strong>in</strong>guistic<br />
system is an empirical question that can be answered by look<strong>in</strong>g for statistically<br />
recurr<strong>in</strong>g structural patterns <strong>in</strong> <strong>variation</strong>al data. Or <strong>in</strong> other words, assum<strong>in</strong>g a system<br />
that is able to predict l<strong>in</strong>guistic choices, we should f<strong>in</strong>d a probabilistic model that fits<br />
observed <strong>variation</strong>.<br />
Return<strong>in</strong>g to the topic of the current paper (lexical <strong>variation</strong> <strong>in</strong> a pluricentric language),<br />
how can these theoretical <strong>in</strong>sights be applied? To answer this question, we<br />
will address lexical <strong>variation</strong> <strong>in</strong> Section 2 and aggregation <strong>in</strong> Section 3. In Section 4,<br />
we will perform a case-study on <strong>aggregate</strong>d lexical <strong>variation</strong> <strong>in</strong> the pluricentric language<br />
Dutch. F<strong>in</strong>ally, we br<strong>in</strong>g together the theoretical <strong>in</strong>sight and the results of the<br />
case-study <strong>in</strong> the conclusion of this paper.<br />
2 <strong>Lexical</strong> <strong>variation</strong><br />
Harder (2010: 270) claims that there are three stages <strong>in</strong> the com<strong>in</strong>g about of a sociodynamic<br />
<strong>perspective</strong> on l<strong>in</strong>guistic system. The first stage consists of mere fluctuations,<br />
comparable to the brabbl<strong>in</strong>g of a toddler. From these fluctuations a structure emerges<br />
consist<strong>in</strong>g of categories that conta<strong>in</strong> the fluctuation, but this structure is an <strong>in</strong>complete<br />
abstraction of the fluctuations. The abstraction goes only so far as the language<br />
user deems appropriate, c.q. until communication is enabled. This is the second stage<br />
of emerg<strong>in</strong>g structure. The third stage consists of the <strong>in</strong>itial stage fluctuations that<br />
turn <strong>in</strong>to systematic <strong>variation</strong> with<strong>in</strong> the emerged structural category. Although the<br />
three stages are presented by means of a developmental example (i.e. the brabbl<strong>in</strong>g<br />
todler), these stages might well have more general ontogenetic status that may expla<strong>in</strong><br />
language <strong>variation</strong> and change. Abandon<strong>in</strong>g the dynamic character of these three<br />
stages, and look<strong>in</strong>g at every stage <strong>in</strong>dependently, we could say that <strong>variation</strong>ist research<br />
zooms <strong>in</strong> on the third stage, assum<strong>in</strong>g the categories from the second stage. As<br />
an example, Harder gives the sem<strong>in</strong>al Labovian study on the structural stage two category<br />
“postvocalic -r”, with its category-bound stage three variants, which appeared
98 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
to be related to social classes <strong>in</strong> New York (Labov 1966). Scholars of the l<strong>in</strong>guistic system<br />
have traditionally removed stage three (<strong>variation</strong>, or rather variable usage) and<br />
focused on the abstract and idealized stage two structural categories. However, an<br />
adequate study of the l<strong>in</strong>guistic system must not ignore the stage three <strong>variation</strong>, as<br />
structure and <strong>variation</strong> cannot exist without each other. Structure without <strong>variation</strong><br />
is ridden of the l<strong>in</strong>guistic reality, and <strong>variation</strong> without structure is mere fluctuation,<br />
<strong>in</strong>capable of enabl<strong>in</strong>g communication.<br />
Although this idea of system is primarily geared towards l<strong>in</strong>guistic categories such<br />
as consonants or Germanic strong verbs, it can conveniently be “translated” towards<br />
the conceptual categories of the lexicon. There is, however, an important question related<br />
to the level of abstraction <strong>in</strong> stage two, when consider<strong>in</strong>g the lexicon. If on the<br />
onehandthecategoriesarechosentobeasnarrowasas<strong>in</strong>gleword(orsymbol),the<br />
<strong>variation</strong> with<strong>in</strong> these categories is semasiological <strong>variation</strong>. This means that one studies<br />
the different senses or aspects of mean<strong>in</strong>g of a s<strong>in</strong>gle word. If on the other hand<br />
the categories are chosen to be as broad as “concepts”, the <strong>variation</strong> <strong>in</strong> nam<strong>in</strong>g these<br />
categories (i.e. that different words may name the same concept) is onomasiological<br />
<strong>variation</strong>. This means that one studies the different ways of express<strong>in</strong>g (with words)<br />
the conceptual category. Obviously, this very old dist<strong>in</strong>ction between a semasiological<br />
or an onomasiological approach is related to the study of polysemy versus the study<br />
of synonymy.<br />
In this paper, we restrict ourselves to the onomasiological <strong>perspective</strong>, yet fully<br />
aware of the semasiological issues wait<strong>in</strong>g around the corner. We refer to Geeraerts<br />
(2009) for an overview of research on lexical <strong>variation</strong>, and zoom <strong>in</strong> here briefly on<br />
a dist<strong>in</strong>ction between Formal Onomasiological Variation (FOV) and Conceptual Onomasiological<br />
Variation (COV). A FOV approach resembles the sociol<strong>in</strong>guistic variable:<br />
FOV grasps a quality of a set of words that express the same concept, and just like <strong>in</strong> a<br />
sociol<strong>in</strong>guistic variable, each word <strong>in</strong> the set may have a specific socio-stylistic correlation.<br />
COV, on the other hand, l<strong>in</strong>ks up to the more subtle <strong>variation</strong> <strong>in</strong> concepts that<br />
are be<strong>in</strong>g used <strong>in</strong> language. Most obviously, at a very high level, and example could be<br />
that one can use specific words to talk about “beer” or about “semantics”. At a more<br />
f<strong>in</strong>e-gra<strong>in</strong>ed level, one could say that “fiddle” and “viol<strong>in</strong>” are an example of FOV, but<br />
because “fiddle” has a slightly more ord<strong>in</strong>ary tone to it than the more prestigious “viol<strong>in</strong>”,<br />
there is also COV between these words. In the case-study to this paper, we will<br />
show that this dist<strong>in</strong>ction between FOV <strong>in</strong> choos<strong>in</strong>g a word to express a concept versus<br />
COV when us<strong>in</strong>g words to talk <strong>in</strong> a certa<strong>in</strong> way crops up <strong>in</strong> a methodological difference<br />
between the two sociolectometric approaches that we compare.<br />
3 Aggregation<br />
As said above, aggregation of many variables is necessary when the goal is to describe<br />
general patterns <strong>in</strong> a system. In order to f<strong>in</strong>d underly<strong>in</strong>g dimensions of <strong>variation</strong> <strong>in</strong>
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 99<br />
a large set of (lexical) variables, the <strong>in</strong>dividual patterns of the variables thus need to<br />
be <strong>aggregate</strong>d. Aggregation of many features is already applied <strong>in</strong> e.g. dialectometry<br />
and text categorization. However, we f<strong>in</strong>d problems <strong>in</strong> both dialectometry and text<br />
categorization when it comes to deal<strong>in</strong>g with lexical <strong>variation</strong>.<br />
In dialectometry (Séguy 1971; Goebl 1975; Nerbonne and Kretzschmar 2003), lexical<br />
<strong>variation</strong> is almost always considered to be categorical per location (except e.g.<br />
Grieve et al. 2011): either a certa<strong>in</strong> location – or at best a s<strong>in</strong>gle <strong>in</strong>terviewee per location<br />
– is attributed the use of word a or the use of word b. This categorical approach is<br />
ma<strong>in</strong>ly due to the type of <strong>in</strong>put data, i.e. a lexical dialect atlas, used <strong>in</strong> most dialectometric<br />
studies. Dialect atlases have been pa<strong>in</strong>stak<strong>in</strong>gly constructed <strong>in</strong> earlier years by<br />
the efforts of dialectologists that visited pert<strong>in</strong>ent locations for their purposes and accumulated<br />
data through <strong>in</strong>terviews and questionnaires. Categorical word choices per<br />
location were a necessary (but currently not any longer acceptable) methodological decision.<br />
Because dialectometric methodology is tailored around the categorical dialect<br />
atlas <strong>in</strong>put format, their quantitative aggregation methods cannot straightforwardly<br />
be applied to corpus-driven <strong>in</strong>put, where lexical <strong>variation</strong> is a probabilistic matter.<br />
Unlike dialectometry, an aggregation method that <strong>in</strong>corporates both probabilistic<br />
word preferences <strong>in</strong> an onomasiological approach was <strong>in</strong>troduced <strong>in</strong> Geeraerts et al.<br />
(1999) and further formalized <strong>in</strong> Speelman et al. (2003). This so-called profile-based<br />
approach – where “profile” stands for the (relative frequencies of a) set of words <strong>in</strong><br />
a conceptual category – is formally <strong>in</strong>troduced below. The rationale of the method is<br />
just like most aggregation methods to measure the “distance” between pairs of subcorpora<br />
on the basis of their probabilistic overlap <strong>in</strong> onomasiological word preferences<br />
for express<strong>in</strong>g an underly<strong>in</strong>g conceptual category. A small distance between subcorpora<br />
implies a general agreement <strong>in</strong> word choice, whereas a large distance implies a<br />
general disagreement <strong>in</strong> word choice.<br />
Profile-based distances between subcorpora are calculated by means of the follow<strong>in</strong>g<br />
method. Given two subcorpora V1 and V2, a conceptual category L (e.g. SUB-<br />
TERRANEAN PUBLIC TRANSPORT)andx1 to xn the exhaustive list of variants (e.g. [subway,<br />
underground} as the profile, then we refer to the absolute frequency F of the usage of<br />
x1 for L <strong>in</strong> Vj with: 1<br />
FVj ,L (x1) (1)<br />
To make this methodological explanation more tangible, we provide a fictional example<br />
on the basis of the absolute frequencies for two concepts SUBTERRANEAN PUBLIC<br />
TRANSPORT and SMALL INSTRUMENT PLAYED WITH A BOW as used <strong>in</strong> American and British<br />
English, cf. Table 1.<br />
1 The follow<strong>in</strong>g <strong>in</strong>troduction to the City-Block distance method is based on Speelman et al. (2003:<br />
Section 2.2).
100 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
Tab. 1: Fictional absolute frequencies for the variants of two concepts <strong>in</strong> two language varieties<br />
Concept Variant Am. Eng. Br. Eng.<br />
SUBTERRANEAN PUBLIC TRANSPORT<br />
SMALL INSTRUMENT PLAYED WITH A BOW<br />
subway 70 20<br />
underground 10 50<br />
viol<strong>in</strong> 50 30<br />
fiddle 40 35<br />
Subsequently, we <strong>in</strong>troduce the relative frequency R :<br />
RVj ,L (xi ) =<br />
FVj ,L (xi )<br />
n<br />
k =1 FVj ,L (xk )<br />
The absolute frequencies from Table 1 now become the relative frequencies <strong>in</strong> Table 2<br />
by means of apply<strong>in</strong>g Equation 2.<br />
Tab. 2: Fictional relative frequencies for the variants of two concepts <strong>in</strong> two language varieties,<br />
based on Table 1<br />
Concept Variant Am. Eng. Br. Eng.<br />
SUBTERRANEAN PUBLIC TRANSPORT<br />
SMALL INSTRUMENT PLAYED WITH A BOW<br />
subway 0,875 0,286<br />
underground 0,125 0,714<br />
viol<strong>in</strong> 0,556 0,462<br />
fiddle 0,444 0,538<br />
Now we can def<strong>in</strong>e the (City-Block) distance DCB between V1 and V2 on the basis of the<br />
profile for L as follows (the division by two is for normalization, mapp<strong>in</strong>g the results<br />
to the <strong>in</strong>terval [0,1]):<br />
DCB ,L (V1, V2) = 1<br />
2<br />
n<br />
i =1<br />
(2)<br />
|RVj ,L (xi ) − RVj ,L (xi )| (3)<br />
The City-Block distance is a straightforward descriptive dissimilarity measure that assumes<br />
the absolute frequencies <strong>in</strong> the sample-based profile to be large enough for the<br />
relative frequencies to be good estimates for the relative frequencies <strong>in</strong> the underly<strong>in</strong>g<br />
population-based profiles. If however the samples are rather small, the relative frequencies<br />
become unreliable, and a supplementary control is needed. For this we use<br />
a measure that takes as its basis the confidence of there be<strong>in</strong>g an actual difference between<br />
two profiles: the Fisher Exact test. This time, unlike with DCB , we look at the<br />
absolute frequencies <strong>in</strong> the profiles we compare. When we compare a profile <strong>in</strong> one<br />
subcorpus to the profile for the same concept <strong>in</strong> a second subcorpus, we use a Fisher<br />
Exact test to check the hypothesis that both samples are drawn from the same pop-
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 101<br />
ulation. We use the p-value from the Fisher Exact test as a filter for DCB .Wesetthe<br />
dissimilarity between subcorpora at zero if p > 0.05, and we use DCB if p < 0.05. 2<br />
If we now apply this step to the fictional data from Table 1 and 2, we must first<br />
calculate the Fisher Exact p value for every concept, verify<strong>in</strong>g that the absolute frequencies<br />
for American and British English are sampled from different populations. For<br />
SUBTERRANEAN PUBLIC TRANSPORT,thepvalueismuchsmallerthan0.05,sowecanac cept that British English is different from American English when it comes to this concept.<br />
Therefore, we calculate the City-Block distance by means of Equation 5 for SUB-<br />
TERRANEAN PUBLIC TRANSPORT. Fill<strong>in</strong>g <strong>in</strong> the equation, we get 0.5 × [(|0.875–0.286|) +<br />
(|0.125–0.714|)] = 0.589. For the concept of a SMALL INSTRUMENT PLAYED WITH A BOW we<br />
f<strong>in</strong>d a p value for the Fisher Exact test larger than 0.05, so we can say that British English<br />
is statistically speak<strong>in</strong>g not a different population than American English. Therefore,<br />
we can set the distance between these varieties for this concept at zero.<br />
To calculate the dissimilarity between subcorpora on the basis of many profiles,<br />
we just sum the dissimilarities for the <strong>in</strong>dividual profiles. In other words, given a set of<br />
profiles L1 to Lm , then the global dissimilarity D between two subcorpora V1 and VL2<br />
on the basis of L1 up to Lm can be calculated as:<br />
DCB (V1, V2) =<br />
m<br />
(L −i (V1, V2)W (Li )) (4)<br />
i =1<br />
The W <strong>in</strong> the formula is a weight<strong>in</strong>g factor. We use weights to ensure that concepts<br />
which have a relatively higher frequency (summed over the size of the two subcorpora<br />
that are be<strong>in</strong>g compared) 3 also have a greater impact on the distance measurement. In<br />
other words, <strong>in</strong> the case of a weighted calculation, concepts that are more common <strong>in</strong><br />
everyday life and language are treated as more important. Apply<strong>in</strong>g this to the fictional<br />
example from Table 1, we can calculate the W per concept by divid<strong>in</strong>g the sum of the<br />
absolute frequencies of all variants for one concept by the sum of simply all <strong>variation</strong>s.<br />
For SUBTERRANEAN PUBLIC TRANSPORT this equals to (70+10+20+50)/(70+10+20+50+<br />
50 + 40 + 30 + 35) = 0.492. There is no need to calculate the W for SMALL INSTRUMENT<br />
PLAYED WITH A BOW as its distance is already set to zero. Fill<strong>in</strong>g out equation 4, we f<strong>in</strong>d<br />
that the distance between British English and American English <strong>aggregate</strong>d over both<br />
concepts is (0.589 × 0.492) + 0 = 0.29.<br />
Now, we put text categorization <strong>in</strong> contrast with the profile-based approach, which<br />
<strong>in</strong>corporates probabilistic <strong>in</strong>formation of word choice. In text categorization, noncategorical<br />
(probabilistic) word choice is well accounted for (unlike dialectometric ap-<br />
2 If the frequency of the profile was lower than 30 <strong>in</strong> the two varieties that are be<strong>in</strong>g compared, that<br />
profile was excluded from the comparison.<br />
3 The size of the two subcorpora is not the actual amount of words <strong>in</strong> the two subcorpora, but the sum<br />
of all profiles <strong>in</strong> these two subcorpora with a frequency higher than 30.
102 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
proaches), but text categorization totally ignores the onomasiological <strong>perspective</strong> on<br />
lexical <strong>variation</strong>. This is primarily due to the fact that text categorization often zooms<br />
<strong>in</strong> on topical categorization, and the onomasiological approach to lexical <strong>variation</strong><br />
with<strong>in</strong> conceptual categories is exactly a way of downplay<strong>in</strong>g thematic bias <strong>in</strong> the <strong>variation</strong>al<br />
patterns (Speelman et al. 2003). However, other forms of text categorization,<br />
e.g. authorship attribution or l<strong>in</strong>guistic profil<strong>in</strong>g, quite the opposite of topic classification,<br />
also ignore onomasiological <strong>variation</strong> and use mere (relative) occurrence frequencies<br />
of the features <strong>in</strong> the aggregation step. This is problematic, especially given<br />
the recent trend <strong>in</strong> authorship attribution studies to use content words.<br />
Whereas the profile-based approach will be the quantitative method that <strong>in</strong>corporates<br />
conceptual control <strong>in</strong> our comparison of methods, we will use the textcategorization<br />
approach as the quantitative method that ignores conceptual similarity<br />
between the words <strong>in</strong> the variable set. Except for the used distance metric, the two approaches<br />
are identical. The underly<strong>in</strong>g metaphor of both the profile-based and categorization<br />
approach is spatial: subcorpora are represented as po<strong>in</strong>ts <strong>in</strong> an n-dimensional<br />
spacebymeansoftheoccurrencefrequenciesofn words. A made-up example <strong>in</strong> a<br />
two-dimensional space, i.e. with two words, conta<strong>in</strong><strong>in</strong>g two text types might make<br />
this rather abstract metaphor more clear. Given two subcorpora represent<strong>in</strong>g the text<br />
types “academic articles” and “computer mediated communication”, and given two<br />
words “hence” (a l<strong>in</strong>k<strong>in</strong>g word used <strong>in</strong> academic articles) and “LOL” (an abbreviation<br />
of “Laugh<strong>in</strong>g Out Loud”, commonly used <strong>in</strong> IRC), one might construct the “space” <strong>in</strong><br />
Figure 1. The position of the academic articles <strong>in</strong> the bottom right part is due to the high<br />
frequency of “hence” and the low frequency of “LOL” <strong>in</strong> these texts. The position of<br />
the computer-mediated communication <strong>in</strong> the top left part is due to the low frequency<br />
of “hence” and the high frequency of “LOL” <strong>in</strong> these texts. Obviously, these data are<br />
made up for the sake of the argument. Now, two l<strong>in</strong>es can be drawn through the orig<strong>in</strong>ofthespaceandthepositionofthetexttypes(onthebasisofthefrequenciesof<br />
the words that make up the dimensions), yield<strong>in</strong>g an angle, for which the cos<strong>in</strong>e can<br />
be calculated. A small angle implies high similarity between the text types, and will<br />
yield a high cos<strong>in</strong>e value; a large angle implies low similarity, and will yield a low cos<strong>in</strong>e<br />
value. More <strong>in</strong>formation on the cos<strong>in</strong>e metric can be found <strong>in</strong> Baeza-Yates and<br />
Ribeiro-Neto (1999: 27).<br />
Fig. 1: 2 Dimensional example of Vector Model
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 103<br />
Formally, given two subcorpora V1 and V2 <strong>in</strong> which the frequencies of a large number<br />
of words were counted and stored <strong>in</strong> the respective vectors x and y, wecalculate<br />
the distance between the subcorpora by means of Equation 5.<br />
4 Case study<br />
Dcos(V1, V2) = 1 − cos(x, y) = 1 −<br />
x · y<br />
|x||y| =<br />
n i =1 xi yi<br />
n i =1 x 2 n i i =1 y 2<br />
i<br />
The case study of this paper is an analysis of <strong>aggregate</strong>d lexical <strong>variation</strong> <strong>in</strong> the pluricentric<br />
language Dutch. It consists of a comparison between the state-of-the-art text<br />
categorization distance metric, which ignores conceptual control, and the profilebased<br />
distance metric, which <strong>in</strong>cludes conceptual control. In order to guarantee an<br />
objective comparison, we will apply both methods to the same dataset, which is tailored<br />
to conta<strong>in</strong> a specific constitution of <strong>variation</strong>al dimensions. The method that<br />
best approaches the expected structure will be considered superior. In what follows,<br />
we first <strong>in</strong>troduce the dataset by describ<strong>in</strong>g the set of lexical features and the corpus<br />
<strong>in</strong> which these features will be counted. Second, we apply the profile-based method to<br />
this dataset. Then, the state-of-the-art text categorization method is also applied to the<br />
dataset. F<strong>in</strong>ally, it will be concluded that the profile-based onomasiological approach<br />
grasps the a priori constitution of <strong>variation</strong>al dimensions much better than the text<br />
categorization method.<br />
The lexical <strong>in</strong>put features are derived from the “Referentiebestand Belgisch Nederlands”<br />
(Mart<strong>in</strong> 2005, Eng. Reference List of Belgian Dutch, abbreviation “RBBN”). This<br />
reference list conta<strong>in</strong>s words or expressions that exclusively appear <strong>in</strong> Belgian Dutch,<br />
and have no occurrences <strong>in</strong> The Netherlands, accord<strong>in</strong>g to dictionaries, corpora and<br />
<strong>in</strong>formants. The list conta<strong>in</strong>s about 4000 items, rang<strong>in</strong>g from colloquial items, over<br />
culturally l<strong>in</strong>ked (e.g. Belgian <strong>in</strong>stitutes) to register-specific and freely vary<strong>in</strong>g items.<br />
As an example, a small selection of items is listed <strong>in</strong> Table 3, but the whole list can<br />
be downloaded freely from the website of the “Instituut voor Nederlandse Lexicologie”.<br />
For each Belgian Dutch item, the list provides an alternative from general Dutch,<br />
or sometimes typically Netherlandic Dutch. From the 4000 items on the list, we only<br />
reta<strong>in</strong>ed 1455 items for which the Belgian Dutch item itself and its alternative consist<br />
of one s<strong>in</strong>gle word. If we restrict the RBBN list to these s<strong>in</strong>gle word items – and<br />
thus exclud<strong>in</strong>g multi-word-units and expressions –, these items can be counted accurately<br />
<strong>in</strong> an automatic way by merely keep<strong>in</strong>g track of the occurrence frequency<br />
of the words <strong>in</strong> the subcorpora. 4 Indeed, expressions and multi-word-units may be<br />
distributed over the sentence because of syntactic constructions, mak<strong>in</strong>g automatic<br />
4 We address the issue of possible polysemy issues and the need for word sense disambiguation when<br />
do<strong>in</strong>g automatic count<strong>in</strong>g <strong>in</strong> the conclusions.<br />
(5)
104 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
Tab. 3: Selected examples from the RBBN<br />
Belgian Dutch General Dutch Translation of concept<br />
suikerboon doopsuiker candy to honor the birth of a baby<br />
appelsien s<strong>in</strong>aasappel orange (fruit)<br />
unaniem eenparig unanimous<br />
ambras ruzie a row<br />
confituur jam marmalade<br />
b<strong>in</strong>nenkoer b<strong>in</strong>nenplaats atrium<br />
count<strong>in</strong>g very hard. All (s<strong>in</strong>gle) words on the list were analyzed with the Alp<strong>in</strong>o parser,<br />
so that accurate count<strong>in</strong>gs on the lemmata could be performed, while controll<strong>in</strong>g for<br />
the part-of-speech. L<strong>in</strong>k<strong>in</strong>g back to the issue of conceptual categories <strong>in</strong> Section 2, we<br />
accept the conceptual categories of the makers of the RBBN <strong>in</strong> their equivalence judgement<br />
between the Belgian Dutch item and its alternative.<br />
Because we know that this list conta<strong>in</strong>s Belgian Dutch words and an alternative,<br />
we can predict that the ma<strong>in</strong> <strong>variation</strong> <strong>in</strong> the list will be due to a national pattern. Indeed,<br />
even the non-national <strong>variation</strong> which is present <strong>in</strong> the list (e.g. colloquialisms)<br />
is still embedded <strong>in</strong> the Belgian Dutch po<strong>in</strong>t-of-view of the RBBN. Or <strong>in</strong> other words,<br />
every variable <strong>in</strong> the variable set is at least nationally patterned. Therefore, we expect<br />
the results of our method to show a clear dist<strong>in</strong>ction between the two national varieties,<br />
and other <strong>variation</strong>al dimensions will only appear after that.<br />
In our corpus, we <strong>in</strong>corporate samples from the two national varieties of Dutch,<br />
taken from two registers (quality newspapers and Usenet), and from two topics (politics<br />
and economy). We collected a total of 6 million words, which were evenly split<br />
over the nations, registers and topics. The quality newspaper articles were sampled<br />
from two large newspaper corpora that are available for both Netherlandic and Belgian<br />
newspapers. From these two corpora, we selected four newspapers that are deemed<br />
to be quality newspapers: “De Standaard” and “De Morgen” for Belgium, and “Volkskrant”<br />
and “NRC” for The Netherlands. For most of the articles that appeared <strong>in</strong> the<br />
newspapers, there is access to the category <strong>in</strong> which it was published. This categorization<br />
was used to filter out the articles on the topics “politics” and “economy”.<br />
The Usenet posts were downloaded from a large Usenet archive, available onl<strong>in</strong>e<br />
at Google Groups and automatically stripped from meta-<strong>in</strong>formation (headers and<br />
html code) and reduplicated content (quotes from previous posts). Only posts from<br />
the groups “be.politics”, “be.f<strong>in</strong>ance”, “nl.politiek” and “nl.f<strong>in</strong>ancieel.*” were downloaded,<br />
where the country affiliation of the group was taken to be an <strong>in</strong>dication of the<br />
nationality of the author of the post, and where the topical restriction of the group <strong>in</strong>dicates<br />
the topic of the post. All texts were lemmatized and tagged with part-of-speech<br />
<strong>in</strong>formation by the Alp<strong>in</strong>o parser (Bouma et al. 2001).
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 105<br />
With these three dimensions (country, register, topic) and two levels for each dimension<br />
8 comb<strong>in</strong>ations are possible. These comb<strong>in</strong>ations, e.g. Belgian quality newspapers<br />
on economy (abbreviated as qnp.be.e), will be represented by the subcorpora,<br />
for which we will calculate the pair wise distances. However, to <strong>in</strong>crease the number<br />
of data po<strong>in</strong>ts and <strong>in</strong> order to verify the <strong>in</strong>ternal consistency of the subcorpora, we divided<br />
every subcorpus <strong>in</strong>to two equally sized groups (abbreviated as e.g. qnp.be.e.0<br />
and qnp.be.e.1). In total then, we counted the frequencies of the l<strong>in</strong>guistic characteristics<br />
which we <strong>in</strong>troduce above, <strong>in</strong> 16 subcorpora. A snippet of this <strong>in</strong>put data is presented<br />
<strong>in</strong> the appendix to this paper.<br />
Given the omnipresent country dimension <strong>in</strong> the <strong>in</strong>put features, the primary <strong>variation</strong>al<br />
dimension that could be expected to be revealed among the subcorpora is the<br />
Belgian Dutch versus Netherlandic Dutch dimension. Or <strong>in</strong> terms that relate to the<br />
distance measurement method: <strong>in</strong> a pair-wise comparison of subcorpora with a national<br />
difference, the distance will be bigger than a comparison of two subcorpora<br />
with the same national affiliation. Because the typical Belgian Dutch words are sometimes<br />
restricted to a specific register, e.g. colloquialisms, a register dist<strong>in</strong>ction should<br />
emerge, as well. And as words and their conceptual categories are <strong>in</strong>evitably sensitive<br />
to topic, we would expect the difference between political and economical subcorpora<br />
to emerge, too. However, the register and topic dimension should be secondary to the<br />
country dimension.<br />
4.1 Results of the profile-based method<br />
We first look <strong>in</strong>to the results of the profile-based approach, <strong>in</strong>troduced above. To the<br />
selected Belgian Dutch items on the RBBN list, we added the knowledge which alternatives<br />
are conceptually equivalent General Dutch words. In other words, we <strong>in</strong>troduce<br />
conceptually controlled profile <strong>in</strong>formation to the distance metric. A profile thus consists<br />
of a Belgian Dutch word from the RBBN list, together with its general Dutch alternative.<br />
Remember that the underly<strong>in</strong>g distance metric is basically a City-Block distance<br />
measure (see Formula 4). Now, we zoom <strong>in</strong> on the two- and three-dimensional visualizations<br />
of all the pair wise profile-based distances between the subcorpora, made<br />
by means of non-metric two-way one-mode Multidimensional Scal<strong>in</strong>g (Cox and Cox<br />
2001), as can be seen <strong>in</strong> Figure 2. 5<br />
5 The coord<strong>in</strong>ates of a Multidimensional Scal<strong>in</strong>g solution can be scaled freely, as long as the same<br />
scal<strong>in</strong>g is applied to all dimensions. Therefore, we discarded a scale on the axes, as these numbers<br />
would not be <strong>in</strong>sightful. However, we made sure that the x and y (and z for three-dimensional solutions)<br />
axes are always equal, so that the distances between the subcorpora on the different dimensions<br />
can be <strong>in</strong>terpreted.
106 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
Fig. 2: L<strong>in</strong>guistic distance between subcorpora (profile-based, two-dimensional)<br />
Multidimensional Scal<strong>in</strong>g is a dimension reduction technique which is applied here<br />
to a matrix hold<strong>in</strong>g all the pair wise profile-based distances between the subcorpora.<br />
Because the result of a Multidimensional Scal<strong>in</strong>g analysis is a reduction of the orig<strong>in</strong>al<br />
<strong>in</strong>put, a certa<strong>in</strong> error is <strong>in</strong>troduced. The error-rate is grasped by a “stress” value,<br />
with 0% stress equal to no error at all. It is generally acceptable to present Multidimensional<br />
Scal<strong>in</strong>g solutions up to a stress level of 10–15%. Usually, Multidimensional<br />
Scal<strong>in</strong>g is used to return one-, two-, or three-dimensional reductions, so that visualization<br />
is possible. With every added dimension, the error-rate goes down, as the reduction<br />
becomes less severe. The fall of error-rate with added dimensions is grasped <strong>in</strong> a<br />
so-called screeplot. The screeplot <strong>in</strong> Figure 3 shows a stress difference of about 7% between<br />
a one-dimensional and a two-dimensional Multidimensional Scal<strong>in</strong>g solution.<br />
Therefore, we first <strong>in</strong>terpret the horizontal dimension (of an unrotated solution) as it<br />
represents the most important <strong>variation</strong> <strong>in</strong> Figure 2. In this case, the profile-based approach<br />
makes a dist<strong>in</strong>ction between Belgian subcorpora (black font) and Netherlandic<br />
subcorpora (grey font) on the first dimension. The grey zero-l<strong>in</strong>e divides the two countries<br />
perfectly. The vertical dimension makes a dist<strong>in</strong>ction between quality newspapers<br />
(normal font) and Usenet articles (bold font). Here aga<strong>in</strong>, the grey zero-l<strong>in</strong>e marks<br />
a perfect dist<strong>in</strong>ction between the two registers. Overall, there is a very clear group<strong>in</strong>g<br />
of the subcorpora, with only clear separation of the topics <strong>in</strong> the Belgian Usenet.<br />
The range of Belgian register <strong>variation</strong> is also somewhat larger than the Netherlandic<br />
range, but this has probably to do with the focus on Belgian Dutch <strong>variation</strong> <strong>in</strong> the<br />
<strong>in</strong>put features. Most importantly, however, the profile-based approach yields a visualization<br />
that complies with our expectations of f<strong>in</strong>d<strong>in</strong>g a national pattern first, followed<br />
by register <strong>variation</strong> on the second dimension.
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 107<br />
Fig. 3: Screeplot for non-metric Multidimensional Scal<strong>in</strong>g solution (profile-based)<br />
The screeplot suggest that a three-dimensional solution might even improve the quality<br />
of the visualization with another 5 or 6%. Therefore, we calculated a three dimensional<br />
solution, which is represented <strong>in</strong> Figure 4. 6 Instead of render<strong>in</strong>g a threedimensional<br />
plot, we drew the scatterplot of dimension 1 versus dimension 2, and the<br />
scatterplot of dimension 1 versus dimension 3. This shows us how, even <strong>in</strong> a threedimensional<br />
solution, dimension 1 still divides Belgian and Netherlandic subcorpora,<br />
Fig. 4: L<strong>in</strong>guistic distance between subcorpora (profile-based, three-dimensional)<br />
6 Note that a two-dimensional non-metric Multidimensional Scal<strong>in</strong>g solution is not a subset of a threedimensional<br />
non-metric Multidimensional Scal<strong>in</strong>g solution. Therefore, the first two dimensions of the<br />
three-dimensional solution of Figure 4 are not necessarily identical to the two dimensions of the twodimensional<br />
solution of Figure 2.
108 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
and that dimension 2 divides the quality newspaper articles from Usenet. However,<br />
this register division <strong>in</strong> the three-dimensional solution is not as neat as <strong>in</strong> the twodimensional<br />
solution, because one of the Netherlandic Usenet fragments crosses over<br />
<strong>in</strong>to the quadrant of the Netherlandic quality newspaper fragments. For dimension 3,<br />
we can see a split for the topics of the Belgian subcorpora, with on the top left of dimension<br />
3 subcorpora with an e for economy-related subcorpora, and politics fragments<br />
at the bottom. On the Netherlandic side, the register (dimension 2) and topic (dimension<br />
3) split is muddled. The register and topic divisions of the Belgian subcorpora,<br />
however, are perfect for respectively dimension 2 and dimension 3. The quality of the<br />
group<strong>in</strong>g on the Belgian side is obviously due to the <strong>in</strong>put variables which are specifically<br />
sensitive for Belgian Dutch <strong>variation</strong>. This <strong>in</strong>dicates that the choice for a Belgian<br />
Dutch term is not only nationally patterned, but also stylistically.<br />
4.2 Results of the categorization method<br />
Now, we present the method and the results of the state-of-the-art categorization approach,<br />
which uses the cos<strong>in</strong>e similarity metric, <strong>in</strong>stead of the adapted City-Block distance<br />
that is used <strong>in</strong> the profile-based approach.<br />
In the current case-study, we take the RBBN items (and the alternatives) as <strong>in</strong>dividual<br />
features and remove the knowledge of conceptual categorization. If we calculate<br />
the similarities (and consequent distances) with these <strong>in</strong>put features between the<br />
subcorpora <strong>in</strong> our dataset, and then produce the two-dimensional visualization with<br />
Multidimensional Scal<strong>in</strong>g, we get the plot <strong>in</strong> Figure 5. If we create a screeplot (Fig-<br />
Fig. 5: L<strong>in</strong>guistic distance between subcorpora (profile-based, three-dimensional)
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 109<br />
Fig. 6: L<strong>in</strong>guistic distance between subcorpora (cos<strong>in</strong>e, two-dimensional)<br />
ure 6) to show us how much stress difference there is between the first and the second<br />
dimension, we see that the second dimension reduces the stress of a one-dimensional<br />
solution with about 8%. Therefore, we will <strong>in</strong>terpret the two dimensions <strong>in</strong> their own<br />
respect, know<strong>in</strong>g however that the first dimension conta<strong>in</strong>s more outspoken distances<br />
than the second dimension.<br />
In Figure 6 we see on the horizontal axis (from left to right, dimension 1) a dist<strong>in</strong>ction<br />
between the Usenet articles (bold font) and the quality newspaper articles<br />
(regular font). The light grey vertical l<strong>in</strong>e <strong>in</strong>dicates the zero-l<strong>in</strong>e of the horizontal dimension.<br />
Normally, that l<strong>in</strong>e demarcates the boundary between two areas. Whereas<br />
we would expect the most important <strong>variation</strong> (thus, on the horizontal dimension) to<br />
be related to country, we encounter a dist<strong>in</strong>ction between registers. The vertical dimensions<br />
(from bottom to top) tends to divide Belgium (black font) from The Netherlands<br />
(grey font), but not very clearly. The (politics) Netherlandic Usenet articles s<strong>in</strong>k<br />
below the horizontal zero-l<strong>in</strong>e, and the (economy) Belgian Usenet articles rise above<br />
that l<strong>in</strong>e. Moreover, we notice that the topics are set apart <strong>in</strong> groups, as well, except for<br />
the quality newspapers from The Netherlands. All <strong>in</strong> all, the categorization approach<br />
yields somewhat unclear group<strong>in</strong>g of subcorpora and an unexpected promotion of register<br />
<strong>variation</strong> as the most important <strong>variation</strong> <strong>in</strong> the <strong>in</strong>put features.<br />
The screeplot shows that a three-dimensional solution would reduce the stress<br />
even more up to an almost optimal level. Therefore, we calculated a three-dimensional<br />
solution and represent the three dimensions <strong>in</strong> Figure 7. We apply the same idea as for<br />
the profile-based approach to plot dimension 1 and 2, and then dimension 1 and 3. Just<br />
like <strong>in</strong> the two-dimensional solution, we see that dimension 1 divides quality newspaper<br />
fragments from Usenet fragments, and that dimension 2 tends to divide the na-
110 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
Fig. 7: Screeplot for non-metric Multidimensional Scal<strong>in</strong>g solution (cos<strong>in</strong>e)<br />
tional subcorpora. The three-dimensional solution does a slightly better job than the<br />
two-dimensional solution, because the nation division on dimension 2 is now almost<br />
correct. Dimension 3 divides largely the topics, with politics-related fragments at the<br />
top, and economy-related fragments at the bottom. This division is almost perfect, although<br />
the group<strong>in</strong>g of the subcorpora is not so neat. Overall, though, the categorization<br />
method yields messier output than the profile-based approach.<br />
5 Conclusion<br />
The two ma<strong>in</strong> theoretical questions of this paper have been (a) how important is the<br />
notion of a conceptual category <strong>in</strong> an <strong>aggregate</strong> study of <strong>variation</strong> <strong>in</strong> the lexicon and<br />
(b) what is the status of conceptual categories for lexical <strong>variation</strong>? Moreover, we have<br />
claimed that sociolectometric methodology, of which the current study is an example,<br />
is needed to study a pluricentric language. The l<strong>in</strong>k with pluricentric languages, c.q.<br />
Dutch, is also made <strong>in</strong> the case-study, which shows how conceptual categories and<br />
their consequent conceptual control are necessary to reveal the national dimension <strong>in</strong><br />
the lexicon. In other words, the national varieties of Dutch do not differ so much <strong>in</strong><br />
their use of words – both Belgium and the Netherlands use different words for different<br />
topics and registers –, but they do differ <strong>in</strong> their choice of words for express<strong>in</strong>g a<br />
conceptual category. This latter po<strong>in</strong>t is made clear <strong>in</strong> the case-study by means of the<br />
comparison between a profile-based onomasiological approach and a text categorization<br />
approach. The text categorization approach grasped the mere use of <strong>in</strong>dividual<br />
words and compared the use of words <strong>in</strong> two subcorpora by means of the cos<strong>in</strong>e similarity<br />
metric, which was not <strong>in</strong>formed about the conceptual similarity between words.<br />
Consequently, the text categorization showed that there was a pattern of register and<br />
topic <strong>in</strong> the <strong>in</strong>put features, stronger than the anticipated national pattern. The ono-
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 111<br />
masiological approach, on the contrary, revealed a strong national dimension <strong>in</strong> word<br />
choice for nam<strong>in</strong>g a conceptual category.<br />
Of course, <strong>in</strong> order to have an expected rank<strong>in</strong>g <strong>in</strong> the <strong>variation</strong>al dimensions,<br />
and <strong>in</strong> order to compare the outcome of the aggregation approaches, the dataset had<br />
to be manipulated so that a certa<strong>in</strong> pattern could conv<strong>in</strong>c<strong>in</strong>gly be assumed. With that<br />
goal <strong>in</strong> m<strong>in</strong>d, the variable set was taken from a reference list of Belgium Dutch, so that<br />
national <strong>variation</strong> is built <strong>in</strong>to the dataset. As such, the two aggregation approaches<br />
could be compared by assess<strong>in</strong>g how well they retrieve the national <strong>variation</strong>. It is important<br />
to understand, though, that an actual descriptive sociolectometric study can<br />
by no means rely on such a biased <strong>in</strong>put variable set. Therefore, the results of this paper<br />
can only be of methodological value. Given the a priori known pattern of national<br />
<strong>variation</strong> <strong>in</strong> the dataset used <strong>in</strong> the case-study, though, one might jump to the conclusion<br />
that an onomasiological approach is better suited for f<strong>in</strong>d<strong>in</strong>g <strong>variation</strong>al patterns<br />
<strong>in</strong> the lexicon, and the preferred method for any sociolectometric study. However, there<br />
are a number of problems with this conclusion.<br />
First of all, perhaps we are wrong <strong>in</strong> the assumption that national <strong>variation</strong> is the<br />
strongest dimension <strong>in</strong> the lexical variable set and the available subcorpora; it could<br />
be well possible that word use – as shown <strong>in</strong> the categorization approach – is actually<br />
more strongly <strong>in</strong>fluenced by a register or topic dimension, and that the onomasiological<br />
approach artificially weakens these dimensions. 7 In that case, we would have<br />
to tone down the conclusion, and say that an onomasiological approach with conceptual<br />
control is a methodological means of reveal<strong>in</strong>g and boost<strong>in</strong>g specific underly<strong>in</strong>g<br />
dimensions of <strong>variation</strong>. Moreover, we would like to po<strong>in</strong>t out that our corpus<br />
only sampled two topics and two registers, which is not enough to support strong generalizations.<br />
Further research is therefore needed with more topics and registers. All<br />
this, of course, does not weaken the strength of a profile-based approach, but it rather<br />
po<strong>in</strong>ts out the importance of know<strong>in</strong>g what is be<strong>in</strong>g measured. Our claim now is that<br />
the profile-based approach allows for much more control over what is measured than<br />
the text categorization method, and should therefore be preferred.<br />
Second, the onomasiological approach assumes a relation of identity of (conceptual)<br />
mean<strong>in</strong>g between the variants and this is theoretically problematic. Follow<strong>in</strong>g<br />
Edmonds and Hirst (2002), we agree that perfect synonymy – the highest possible level<br />
of detail <strong>in</strong> describ<strong>in</strong>g a conceptual category, and still f<strong>in</strong>d<strong>in</strong>g multiple words that fit<br />
the category – is extremely rare. By admitt<strong>in</strong>g this, our notion of semantics or word<br />
mean<strong>in</strong>g follows the Cognitive L<strong>in</strong>guistic view that encyclopedic knowledge is <strong>in</strong>dispensable.<br />
Translat<strong>in</strong>g the idea of Peter Harder that structural categories need not to be<br />
complete, and that the abstraction goes only as far as is functional for language users –<br />
here we l<strong>in</strong>k up to the prototype theory of word mean<strong>in</strong>g, cf. Rosch and Mervis (1975)–,<br />
7 Although the profile-based City-Block distance <strong>in</strong>corporates a W term that br<strong>in</strong>gs the frequency of<br />
the conceptual category <strong>in</strong>to play.
112 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
we can reach near-synonymy by slightly relax<strong>in</strong>g the level of detail of the conceptual<br />
category: not every language user has an identitical representation of a word <strong>in</strong> his<br />
head, but nonetheless two language users can communicate with that word. Idealized<br />
Cognitive Models (Lakoff 1987) or Frames (Fillmore 1994) are examples of describ<strong>in</strong>g<br />
mean<strong>in</strong>g, while balanc<strong>in</strong>g semasiological detail and operational functionality. In future<br />
research, we will operationalize the bottom-up creation of conceptual categories<br />
by apply<strong>in</strong>g Word Space Models (Turney and Pantel 2010).<br />
Third, an onomasiological approach requires prior semasiological analysis to exclude<br />
contextual nuances or polysemy. In the case-study of this paper, the lemmatized<br />
forms of the RBBN words were naively counted <strong>in</strong> the corpus, without further check<strong>in</strong>g<br />
the context of each occurrence. Closer <strong>in</strong>spection revealed that the RBBN list does not<br />
conta<strong>in</strong> many potential polysemous items, so that we can ignore the small error that<br />
must be present <strong>in</strong> the frequencies for the purposes of the current paper. However, as<br />
we want to perform the above analyses <strong>in</strong> future research with a naturalistic sample of<br />
lexical <strong>variation</strong>, <strong>in</strong>stead of an a priori list of national <strong>variation</strong>, a semasiological study<br />
for every occurrence needs to be done <strong>in</strong> order to establish the conceptual control. As<br />
this would be an unfeasible manual task when us<strong>in</strong>g a large amount of variables, we<br />
will rely further on the advances be<strong>in</strong>g made <strong>in</strong> the field of Word Space Models to automate<br />
this task.<br />
To conclude this paper, we try to answer our <strong>in</strong>itial questions. How important is<br />
the notion of a conceptual category <strong>in</strong> an <strong>aggregate</strong> study of the lexicon? The casestudy<br />
has shown that conceptual control is necessary to reveal <strong>variation</strong>al dimensions<br />
that are hidden <strong>in</strong> the overwhelm<strong>in</strong>g content (topic) function of words. Without conceptual<br />
control, the conclusion of the categorization approach would have been that<br />
different words are used to refer to different content, and that they may also signal<br />
register and perhaps national differences. This observation, albeit true and undeniable,<br />
is not the goal of an aggregation study: it is obvious that an aggregation of many<br />
words will be sensitive to content differences among subcorpora. Therefore, conceptual<br />
control, <strong>in</strong> the form of conceptual categories that group together similar words,<br />
is needed. And this br<strong>in</strong>gs us to the second question: what is the status of conceptual<br />
categories for lexical <strong>variation</strong>? Although practical as a methodological and heuristic<br />
device, the conceptual categories rema<strong>in</strong> somewhat artificial because of the flexibility<br />
<strong>in</strong> their def<strong>in</strong>ition. In the current case study, the makers of the RBBN clearly had referential<br />
equivalence <strong>in</strong> m<strong>in</strong>d for most categories. However, conceptual categories can<br />
be def<strong>in</strong>ed more strictly or less strictly at a whim of the researcher, because there is<br />
no consensus over the appropriate level of detail <strong>in</strong> the def<strong>in</strong>ition, especially s<strong>in</strong>ce the<br />
<strong>in</strong>corporation of encyclopedic knowledge <strong>in</strong> word-mean<strong>in</strong>g. The level of detail that is<br />
operational <strong>in</strong> the language community can only be retrieved by study<strong>in</strong>g the actual<br />
use of words.<br />
And then we are back at <strong>variation</strong>.
Appendix<br />
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 113<br />
Tab. 4: Snippet of the <strong>in</strong>put data for both aggregation methods. Pairs of rows make up lexical<br />
variables.<br />
qnp.be.e.0<br />
qnp.be.e.1<br />
qnp.be.p.0<br />
qnp.be.p.1<br />
qnp.nl.e.0<br />
qnp.nl.e.1<br />
qnp.nl.p.0<br />
leefbaar 9 3 8 11 1 0 0 0 0 1 9 4 0 0 24 18<br />
levensvatbaar 2 4 2 0 2 1 3 2 0 0 1 1 0 0 4 4<br />
hangar 0 1 0 1 0 0 1 2 0 0 1 1 0 0 1 1<br />
loods 8 6 4 18 4 11 5 2 0 0 0 2 0 1 1 6<br />
schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85<br />
mooi 153 122 114 110 110 76 53 42 42 33 73 67 52 74 449 475<br />
dagorde 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />
agenda 29 26 100 90 29 21 39 24 2 1 14 14 1 1 17 33<br />
knook 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0<br />
been 13 15 43 41 39 29 14 20 10 12 14 12 21 18 76 65<br />
zever 0 0 1 0 0 0 0 0 6 2 15 15 0 0 4 14<br />
onz<strong>in</strong> 7 1 23 30 8 5 5 3 5 10 44 61 26 43 451 485<br />
draad 4 6 14 10 6 13 2 3 1 2 31 32 9 10 90 87<br />
snoer 2 0 2 1 1 5 1 1 0 0 3 1 0 0 21 28<br />
weeral 0 0 2 0 0 0 0 0 9 3 9 9 0 1 4 1<br />
alweer 19 22 32 22 21 30 11 17 5 1 21 22 12 9 98 98<br />
fel 27 23 33 35 17 19 31 42 6 1 5 10 0 1 19 31<br />
erg 331 268 208 217 117 112 76 68 21 36 143 131 99 94 830 835<br />
strop 4 2 1 3 26 18 4 3 0 0 1 0 0 0 3 3<br />
strik 1 2 2 3 5 6 1 0 0 0 2 0 0 2 1 2<br />
verdiep 2 1 4 3 8 2 4 11 0 0 2 3 3 4 20 26<br />
verdiep<strong>in</strong>g 0 6 6 7 5 4 10 11 0 0 1 0 0 0 12 10<br />
stamp 6 2 9 5 5 1 0 2 1 0 5 5 0 0 11 10<br />
duw 27 16 42 34 20 25 13 16 1 1 13 8 0 5 27 28<br />
spaarzaam 0 1 0 1 2 2 1 2 0 0 0 0 0 0 1 0<br />
zu<strong>in</strong>ig 3 10 5 12 18 21 4 1 0 0 2 3 0 0 10 13<br />
hospitaal 0 4 4 3 0 0 0 0 0 0 1 1 0 0 0 2<br />
ziekenhuis 26 34 82 60 11 40 11 11 0 1 15 15 0 2 61 92<br />
micro 1 1 2 3 0 0 0 0 0 1 0 0 1 1 2 1<br />
microfoon 1 1 2 10 2 3 3 7 0 0 0 0 0 0 34 28<br />
buis 7 2 2 1 4 1 6 3 0 0 2 1 0 0 18 12<br />
onvoldoende 57 56 38 60 36 29 18 28 4 4 2 7 3 8 23 23<br />
toelage 3 2 3 2 2 5 0 1 0 0 5 0 0 0 1 1<br />
subsidie 33 41 13 15 35 22 29 49 1 0 14 15 2 4 122 137<br />
woonst 1 2 3 3 0 0 0 0 0 0 1 1 0 0 0 0<br />
won<strong>in</strong>g 47 60 45 54 47 70 2 21 17 15 8 9 23 17 54 91<br />
uitbater 13 11 3 8 1 1 2 4 0 0 3 2 0 0 6 4<br />
exploitant 2 2 2 2 15 13 3 5 0 0 0 0 0 0 1 1<br />
tussenkomst 19 8 17 13 3 3 0 1 1 2 0 1 2 2 0 6<br />
bijdrage 40 64 23 23 37 25 34 30 3 9 6 16 14 26 90 80<br />
tegenstrever 1 1 6 8 2 1 0 1 0 0 0 1 0 0 0 0<br />
tegenstander 24 19 70 77 16 17 38 32 0 0 18 16 5 5 63 64<br />
aanvang 5 5 3 3 7 8 2 2 0 0 1 3 1 2 3 4<br />
beg<strong>in</strong> 635 550 499 507 637 554 322 341 78 71 139 201 100 102 706 712<br />
qnp.nl.p.1<br />
use.be.e.0<br />
use.be.e.1<br />
use.be.p.0<br />
use.be.p.1<br />
use.nl.e.0<br />
use.nl.e.1<br />
use.nl.p.0<br />
use.nl.p.1
114 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
qnp.be.e.0<br />
qnp.be.e.1<br />
qnp.be.p.0<br />
qnp.be.p.1<br />
qnp.nl.e.0<br />
qnp.nl.e.1<br />
aanduid<strong>in</strong>g 7 3 6 4 1 1 1 0 1 1 2 5 1 1 5 4<br />
benoem<strong>in</strong>g 34 14 19 17 46 22 35 43 0 0 7 5 3 2 16 10<br />
tevergeefs 8 2 12 7 10 7 7 5 2 0 1 2 0 1 3 4<br />
vergeefs 2 0 0 2 3 7 4 14 0 0 0 4 0 0 0 4<br />
tewerkstell<strong>in</strong>g 8 7 4 16 0 0 0 0 0 0 4 0 0 0 0 0<br />
werkgelegenheid 79 80 17 24 25 16 7 5 0 0 4 6 7 5 13 27<br />
zetel 42 61 91 62 25 23 42 43 1 0 34 32 1 1 193 195<br />
fauteuil 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0<br />
verslaggever 11 10 29 43 3 1 8 5 0 0 0 0 0 0 21 28<br />
rapporteur 1 1 9 5 0 0 2 0 0 0 1 0 0 0 0 1<br />
verlieslatend 10 6 1 0 1 2 0 0 0 1 0 0 0 0 0 0<br />
verliesgevend 1 0 0 0 31 14 9 9 0 0 0 0 1 3 4 6<br />
vermits 4 5 1 4 0 0 0 0 19 12 16 20 0 0 1 2<br />
aangezien 95 81 32 43 24 28 2 3 33 25 45 36 33 26 161 148<br />
universitair 10 5 7 30 2 1 4 6 2 0 1 2 0 0 5 5<br />
academicus 6 1 13 9 2 0 1 2 0 0 1 1 0 0 4 6<br />
vaststell<strong>in</strong>g 30 27 42 44 4 3 1 4 0 0 5 10 2 1 6 6<br />
constater<strong>in</strong>g 1 0 0 1 15 6 0 4 0 0 1 0 1 2 11 12<br />
verhoog 184 178 25 38 107 112 36 34 8 11 12 12 23 22 39 41<br />
podium 1 1 20 25 3 2 4 7 0 0 4 1 0 0 7 5<br />
wedde 2 6 2 5 0 0 0 0 0 0 1 1 0 0 2 1<br />
salaris 13 13 1 0 96 83 25 26 0 0 3 0 6 4 49 44<br />
objectief 21 25 19 18 8 10 4 7 2 4 22 27 5 4 64 42<br />
doel 66 67 57 112 80 91 63 63 7 11 35 33 24 30 198 174<br />
nakend 9 15 12 10 1 1 0 1 0 1 3 1 1 1 0 0<br />
nabij 35 33 27 40 11 13 8 8 3 9 2 2 3 6 19 16<br />
nijverheid 18 14 1 0 0 0 0 0 0 0 0 1 0 0 0 0<br />
<strong>in</strong>dustrie 75 65 22 32 25 26 37 29 1 0 11 8 6 4 40 39<br />
<strong>in</strong>breuk 21 25 6 17 3 2 1 3 0 1 4 3 1 0 8 5<br />
overtred<strong>in</strong>g 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26<br />
job 141 140 59 78 2 0 0 1 4 6 21 16 0 2 4 9<br />
baan 133 122 31 39 150 117 111 78 4 5 11 13 9 6 139 117<br />
maximum 10 12 4 4 6 19 2 6 12 6 6 4 11 16 29 21<br />
maximaal 47 35 25 30 79 76 20 16 21 11 5 7 35 36 38 39<br />
m<strong>in</strong>imum 26 20 8 14 14 11 12 10 13 13 17 15 8 5 20 22<br />
m<strong>in</strong>imaal 28 19 15 25 73 59 19 28 6 3 2 5 37 28 62 46<br />
merkwaardig 19 14 30 37 7 15 4 4 1 0 2 0 0 0 48 28<br />
opmerkelijk 47 52 66 57 67 56 20 20 2 0 6 4 1 0 28 11<br />
effectief 36 34 35 36 45 59 11 20 8 8 24 15 13 12 51 57<br />
daadwerkelijk 19 16 21 13 59 54 24 21 1 1 4 1 11 9 49 55<br />
stock 12 12 2 3 6 0 0 1 45 40 0 0 34 25 0 1<br />
voorraad 65 40 13 3 27 25 4 9 4 0 0 1 19 25 7 18<br />
stilaan 48 49 57 53 1 2 0 0 2 3 6 6 3 0 1 2<br />
langzamerhand 2 4 1 3 30 27 3 13 0 0 0 3 0 0 29 32<br />
serieus 24 20 40 16 41 32 56 53 30 27 63 56 40 29 196 197<br />
ernstig 72 52 101 88 31 24 23 28 3 1 27 37 4 3 94 119<br />
politieker 0 0 0 0 0 0 0 0 0 1 18 14 0 0 13 8<br />
politicus 48 81 321 275 52 37 47 58 1 2 89 93 7 6 289 221<br />
gerechtshof 2 3 4 2 17 16 9 7 0 0 2 1 1 0 3 13<br />
qnp.nl.p.0<br />
qnp.nl.p.1<br />
use.be.e.0<br />
use.be.e.1<br />
use.be.p.0<br />
use.be.p.1<br />
use.nl.e.0<br />
use.nl.e.1<br />
use.nl.p.0<br />
use.nl.p.1
qnp.be.e.0<br />
qnp.be.e.1<br />
qnp.be.p.0<br />
qnp.be.p.1<br />
qnp.nl.e.0<br />
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 115<br />
qnp.nl.e.1<br />
qnp.nl.p.0<br />
rechtbank 122 112 61 70 15 27 9 13 1 4 11 21 2 2 52 64<br />
prof 1 2 3 3 1 2 0 0 1 1 5 5 0 3 8 6<br />
professor 39 33 70 72 3 8 6 6 0 0 9 3 7 3 27 36<br />
fout 74 84 154 158 51 65 25 43 38 17 92 74 87 75 326 299<br />
overtred<strong>in</strong>g 15 14 25 40 6 8 4 9 1 0 9 10 2 2 12 26<br />
publiciteit 9 6 5 6 16 18 9 11 0 0 4 5 2 1 17 14<br />
reclame 60 45 17 32 21 21 15 12 11 5 18 11 30 43 46 51<br />
proper 8 10 14 20 0 0 0 0 3 5 0 3 2 2 1 4<br />
schoon 7 10 10 12 29 21 13 11 0 2 7 3 2 4 66 85<br />
fier 1 4 15 13 1 4 0 1 3 0 5 6 0 0 1 1<br />
trots 15 19 25 25 22 32 11 16 2 0 9 9 2 3 69 63<br />
schepen 11 14 49 24 7 4 2 1 0 0 11 3 0 0 4 1<br />
wethouder 0 0 1 4 9 13 11 14 0 0 2 2 0 0 22 22<br />
schrijvelaar 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />
Rekenhof 12 15 6 11 0 0 0 0 0 0 1 0 0 0 0 0<br />
Rekenkamer 6 7 10 3 17 33 4 65 0 0 0 0 0 0 0 1<br />
References<br />
Auer, Peter. 2005. Europe’s sociol<strong>in</strong>guistic unity, or: A typology of European dialect/standard<br />
constellations. In Nicole Delbecque, Johan van der Auwera & Dirk Geeraerts (eds.), Perspectives<br />
on <strong>variation</strong>, 7–42. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />
Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 1999. Modern <strong>in</strong>formation retrieval. New York:<br />
ACM Press & Addison-Wesley.<br />
Bickerton, Derek.1971. Inherent variability and variable rules. Foundations of Language and Cognitive<br />
Processes 7(4). 457–492.<br />
Bouma, Gerlof, Gertjan van Noord, and Rob Malouf. 2001. Alp<strong>in</strong>o: wide-coverage computational<br />
analysis of Dutch. In Walter Daelemans, K. Sima’an, J.Veenstra & J. Zavrel (eds.), Computational<br />
L<strong>in</strong>guistics <strong>in</strong> the Netherlands 2000, 45–59. Amsterdam: Rodopi.<br />
Clyne, Michael. 1992. Pluricentric languages: Differ<strong>in</strong>g norms <strong>in</strong> different nations. Berl<strong>in</strong>andNew<br />
York: Mouton de Gruyter.<br />
Cox, Trevor and Michael Cox. 2001. Multidimensional scal<strong>in</strong>g. London and New York: Chapman<br />
and Hall.<br />
Edmonds, Philip and Graeme Hirst. 2002. Near-synonymy and lexical choice. Computational L<strong>in</strong>guistics<br />
28(2). 105–144.<br />
Fillmore, Charles.1994. Start<strong>in</strong>g where dictionaries stop: the challenge of corpus lexicography.<br />
In Beryl T. Sue Atk<strong>in</strong>s & Antonio Zampolli (eds.), Computational approaches to the lexicon,<br />
349–393. Oxford: Oxford University Press.<br />
Geeraerts, Dirk. 2009. <strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> space. In Jürgen Erich Schmidt & Peter Auer (eds.),<br />
Language and space I: Theories and methods, 821–837. Berl<strong>in</strong> and New York: Mouton de<br />
Gruyter.<br />
Geeraerts, Dirk. 2010. Schmidt redux: How systematic is the l<strong>in</strong>guistic system if <strong>variation</strong> is rampant?<br />
In Kasper Boye & Elisabeth Engberg-Pedersen (eds.), Language usage and language<br />
structure, 237–262. Berl<strong>in</strong> & New York: Mouton de Gruyter.<br />
qnp.nl.p.1<br />
use.be.e.0<br />
use.be.e.1<br />
use.be.p.0<br />
use.be.p.1<br />
use.nl.e.0<br />
use.nl.e.1<br />
use.nl.p.0<br />
use.nl.p.1
116 Tom Ruette, Dirk Speelman, and Dirk Geeraerts<br />
Geeraerts, Dirk, Stefan Grondelaers and Dirk Speelman. 1999. Convergentie en divergentie <strong>in</strong><br />
de Nederlandse woordenschat. Een onderzoek naar kled<strong>in</strong>g- en voetbaltermen. Amsterdam:<br />
Meertens Instituut.<br />
Geeraerts, Dirk, Gitte Kristiansen, and Yves Peirsman (eds.). 2010. Advances <strong>in</strong> Cognitive Sociol<strong>in</strong>guistics.<br />
Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />
Goebl, Hans. 1975. Dialektometrie. Grazer l<strong>in</strong>guistische Studien. 32–38.<br />
Grieve, Jack, Dirk Speelman, and Dirk Geeraerts. 2011. A statistical method for the identification<br />
and aggregation of regional l<strong>in</strong>guistic <strong>variation</strong>. Language Variation and Change 23. 193–<br />
221.<br />
Harder, Peter. 2010. Mean<strong>in</strong>g <strong>in</strong> m<strong>in</strong>d and society: A functional contribution to the social turn <strong>in</strong><br />
Cognitive L<strong>in</strong>guistics. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />
Impe, Leen, Dirk Geeraerts, and Dirk Speelman. 2008. Mutual <strong>in</strong>telligibility of standard and regional<br />
Dutch language varieties. International Journal of Humanities and Arts Comput<strong>in</strong>g 2.<br />
101–117.<br />
Kristiansen, Gitte and René Dirven (eds.). 2008. Cognitive Sociol<strong>in</strong>guistics: Language <strong>variation</strong>,<br />
cultural models, social systems. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />
Labov, William. 1966. The social stratification of English <strong>in</strong> New York City. Wash<strong>in</strong>gton, D.C.: Center<br />
for Applied L<strong>in</strong>guistics.<br />
Lakoff, George. 1987. Women, fire and dangerous th<strong>in</strong>gs: What categories reveal about the m<strong>in</strong>d.<br />
Chicago: University of Chicago Press.<br />
Mart<strong>in</strong>, Willy. 2005. Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-<br />
Nederlands (RBBN). Technical report. Amsterdam: Vrije Universiteit Amsterdam.<br />
Nerbonne, John and William Kretzschmar. 2003. Introduc<strong>in</strong>g computational techniques <strong>in</strong> Dialectometry.<br />
Computers and the Humanities 37. 245–255.<br />
Rosch, Eleanor and Carolyne Mervis. 1975. Family resemblances: Studies <strong>in</strong> the <strong>in</strong>ternal structure<br />
of categories. Cognitive Psychology 7(4). 573–605.<br />
Séguy, Jean. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de L<strong>in</strong>guistique<br />
Romane 35. 335–357.<br />
Speelman, Dirk, Stefan Grondelaers, and Dirk Geeraerts. 2003. Profile-based l<strong>in</strong>guistic uniformity<br />
as a generic method for compar<strong>in</strong>g language varieties. Computers and the Humanities 37.<br />
317–337.<br />
Szmrecsanyi, Benedikt. 2010. The English genitive alternation <strong>in</strong> a cognitive sociol<strong>in</strong>guistics <strong>perspective</strong>.<br />
In Dirk Geeraerts, Gitte Kristiansen & Yves Peirsman (eds.), Advances <strong>in</strong> Cognitive<br />
Sociol<strong>in</strong>guistics, 141–166. Berl<strong>in</strong> and New York: Mouton de Gruyter.<br />
Turney, Peter and Patrick Pantel. 2010. From frequency to mean<strong>in</strong>g: vector space models of semantics.<br />
Journal of Artificial Intelligence Research 37. 141–188.