Lexical variation in aggregate perspective
Lexical variation in aggregate perspective
Lexical variation in aggregate perspective
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 99<br />
a large set of (lexical) variables, the <strong>in</strong>dividual patterns of the variables thus need to<br />
be <strong>aggregate</strong>d. Aggregation of many features is already applied <strong>in</strong> e.g. dialectometry<br />
and text categorization. However, we f<strong>in</strong>d problems <strong>in</strong> both dialectometry and text<br />
categorization when it comes to deal<strong>in</strong>g with lexical <strong>variation</strong>.<br />
In dialectometry (Séguy 1971; Goebl 1975; Nerbonne and Kretzschmar 2003), lexical<br />
<strong>variation</strong> is almost always considered to be categorical per location (except e.g.<br />
Grieve et al. 2011): either a certa<strong>in</strong> location – or at best a s<strong>in</strong>gle <strong>in</strong>terviewee per location<br />
– is attributed the use of word a or the use of word b. This categorical approach is<br />
ma<strong>in</strong>ly due to the type of <strong>in</strong>put data, i.e. a lexical dialect atlas, used <strong>in</strong> most dialectometric<br />
studies. Dialect atlases have been pa<strong>in</strong>stak<strong>in</strong>gly constructed <strong>in</strong> earlier years by<br />
the efforts of dialectologists that visited pert<strong>in</strong>ent locations for their purposes and accumulated<br />
data through <strong>in</strong>terviews and questionnaires. Categorical word choices per<br />
location were a necessary (but currently not any longer acceptable) methodological decision.<br />
Because dialectometric methodology is tailored around the categorical dialect<br />
atlas <strong>in</strong>put format, their quantitative aggregation methods cannot straightforwardly<br />
be applied to corpus-driven <strong>in</strong>put, where lexical <strong>variation</strong> is a probabilistic matter.<br />
Unlike dialectometry, an aggregation method that <strong>in</strong>corporates both probabilistic<br />
word preferences <strong>in</strong> an onomasiological approach was <strong>in</strong>troduced <strong>in</strong> Geeraerts et al.<br />
(1999) and further formalized <strong>in</strong> Speelman et al. (2003). This so-called profile-based<br />
approach – where “profile” stands for the (relative frequencies of a) set of words <strong>in</strong><br />
a conceptual category – is formally <strong>in</strong>troduced below. The rationale of the method is<br />
just like most aggregation methods to measure the “distance” between pairs of subcorpora<br />
on the basis of their probabilistic overlap <strong>in</strong> onomasiological word preferences<br />
for express<strong>in</strong>g an underly<strong>in</strong>g conceptual category. A small distance between subcorpora<br />
implies a general agreement <strong>in</strong> word choice, whereas a large distance implies a<br />
general disagreement <strong>in</strong> word choice.<br />
Profile-based distances between subcorpora are calculated by means of the follow<strong>in</strong>g<br />
method. Given two subcorpora V1 and V2, a conceptual category L (e.g. SUB-<br />
TERRANEAN PUBLIC TRANSPORT)andx1 to xn the exhaustive list of variants (e.g. [subway,<br />
underground} as the profile, then we refer to the absolute frequency F of the usage of<br />
x1 for L <strong>in</strong> Vj with: 1<br />
FVj ,L (x1) (1)<br />
To make this methodological explanation more tangible, we provide a fictional example<br />
on the basis of the absolute frequencies for two concepts SUBTERRANEAN PUBLIC<br />
TRANSPORT and SMALL INSTRUMENT PLAYED WITH A BOW as used <strong>in</strong> American and British<br />
English, cf. Table 1.<br />
1 The follow<strong>in</strong>g <strong>in</strong>troduction to the City-Block distance method is based on Speelman et al. (2003:<br />
Section 2.2).