03.09.2013 Views

Lexical variation in aggregate perspective

Lexical variation in aggregate perspective

Lexical variation in aggregate perspective

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Lexical</strong> <strong>variation</strong> <strong>in</strong> <strong>aggregate</strong> <strong>perspective</strong> 101<br />

ulation. We use the p-value from the Fisher Exact test as a filter for DCB .Wesetthe<br />

dissimilarity between subcorpora at zero if p > 0.05, and we use DCB if p < 0.05. 2<br />

If we now apply this step to the fictional data from Table 1 and 2, we must first<br />

calculate the Fisher Exact p value for every concept, verify<strong>in</strong>g that the absolute frequencies<br />

for American and British English are sampled from different populations. For<br />

SUBTERRANEAN PUBLIC TRANSPORT,thepvalueismuchsmallerthan0.05,sowecanac cept that British English is different from American English when it comes to this concept.<br />

Therefore, we calculate the City-Block distance by means of Equation 5 for SUB-<br />

TERRANEAN PUBLIC TRANSPORT. Fill<strong>in</strong>g <strong>in</strong> the equation, we get 0.5 × [(|0.875–0.286|) +<br />

(|0.125–0.714|)] = 0.589. For the concept of a SMALL INSTRUMENT PLAYED WITH A BOW we<br />

f<strong>in</strong>d a p value for the Fisher Exact test larger than 0.05, so we can say that British English<br />

is statistically speak<strong>in</strong>g not a different population than American English. Therefore,<br />

we can set the distance between these varieties for this concept at zero.<br />

To calculate the dissimilarity between subcorpora on the basis of many profiles,<br />

we just sum the dissimilarities for the <strong>in</strong>dividual profiles. In other words, given a set of<br />

profiles L1 to Lm , then the global dissimilarity D between two subcorpora V1 and VL2<br />

on the basis of L1 up to Lm can be calculated as:<br />

DCB (V1, V2) =<br />

m<br />

(L −i (V1, V2)W (Li )) (4)<br />

i =1<br />

The W <strong>in</strong> the formula is a weight<strong>in</strong>g factor. We use weights to ensure that concepts<br />

which have a relatively higher frequency (summed over the size of the two subcorpora<br />

that are be<strong>in</strong>g compared) 3 also have a greater impact on the distance measurement. In<br />

other words, <strong>in</strong> the case of a weighted calculation, concepts that are more common <strong>in</strong><br />

everyday life and language are treated as more important. Apply<strong>in</strong>g this to the fictional<br />

example from Table 1, we can calculate the W per concept by divid<strong>in</strong>g the sum of the<br />

absolute frequencies of all variants for one concept by the sum of simply all <strong>variation</strong>s.<br />

For SUBTERRANEAN PUBLIC TRANSPORT this equals to (70+10+20+50)/(70+10+20+50+<br />

50 + 40 + 30 + 35) = 0.492. There is no need to calculate the W for SMALL INSTRUMENT<br />

PLAYED WITH A BOW as its distance is already set to zero. Fill<strong>in</strong>g out equation 4, we f<strong>in</strong>d<br />

that the distance between British English and American English <strong>aggregate</strong>d over both<br />

concepts is (0.589 × 0.492) + 0 = 0.29.<br />

Now, we put text categorization <strong>in</strong> contrast with the profile-based approach, which<br />

<strong>in</strong>corporates probabilistic <strong>in</strong>formation of word choice. In text categorization, noncategorical<br />

(probabilistic) word choice is well accounted for (unlike dialectometric ap-<br />

2 If the frequency of the profile was lower than 30 <strong>in</strong> the two varieties that are be<strong>in</strong>g compared, that<br />

profile was excluded from the comparison.<br />

3 The size of the two subcorpora is not the actual amount of words <strong>in</strong> the two subcorpora, but the sum<br />

of all profiles <strong>in</strong> these two subcorpora with a frequency higher than 30.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!