13.07.2015 Views

Collocation Collocations & colligations Defining ... - Indiana University

Collocation Collocations & colligations Defining ... - Indiana University

Collocation Collocations & colligations Defining ... - Indiana University

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Corpus LinguisticsApplication #2:<strong>Collocation</strong>sLexicographyCorpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Defining</strong> a collocation<strong>Defining</strong> a collocationKrishnamurthyKrishnamurthyCorpus Linguistics(L615)Application #2: <strong>Collocation</strong>sMarkus DickinsonCalculatingcollocationsPractical workCorpora for lexicography◮ Can extract authentic & typical examples, withfrequency information◮ With sociolinguistic meta-data, can get an accuratedescription of usage and, with monitor corpora, itschange over time◮ Can complement intuitions about meaningsCalculatingcollocationsPractical workDepartment of Linguistics, <strong>Indiana</strong> <strong>University</strong>Spring 2013The study of loanwords, for example, can be bolstered bycorpus studies1 / 282 / 28Lexical studies: <strong>Collocation</strong>Corpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s & <strong>colligations</strong>Corpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>s are characteristic co-occurrence patterns oftwo (or more) lexical items<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsA colligation is a slightly different concept:<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocations◮ Tend to occur with greater than random chance◮ The meaning tends to be more than the sum of its partsPractical work◮ collocation of a node word with a particular class ofwords (e.g., determiners)Practical workThese are extremely hard to define by intuition:Colligations often create “noise” in a list of collocations◮ Pro: Corpora have been able to reveal connectionspreviously unseen◮ e.g., this house because this is so common on its own,and determiners appear before nouns◮ Con: It may not be clear what the theoretical basis ofcollocations are◮ Thus, people sometimes use stop words to filter outnon-collocations◮ Pro & Con: how do they fit into grammar?3 / 284 / 28<strong>Defining</strong> a collocationCorpus LinguisticsApplication #2:<strong>Collocation</strong>sWhat a collocation isCorpus LinguisticsApplication #2:<strong>Collocation</strong>s“People disagree on collocations”◮ Intuition does not seem to be a completely reliable wayto figure out what a collocation is◮ Many collocations are overlooked: people noticeunusual words & structures, but not ordinary onesWhat your collocations are depends on exactly how youcalculate them◮ There is some notion that they are more than the sumof their parts<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work<strong>Collocation</strong>s are expressions of two or more words that arein some sense conventionalized as a group◮ strong tea (cf. ??powerful tea)◮ international best practice◮ kick the bucketImportance of the context: “You shall know a word by acompany it keeps” (Firth 1957)◮ There are lexical properties that more general syntacticproperties do not capture<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workSo, how do we practically define a collocation? . . .This slide and the next 3 adapted from Manning and Schütze (1999),Foundations of Statistical Natural Language Processing5 / 286 / 28


Prototypical collocationsCorpus LinguisticsApplication #2:<strong>Collocation</strong>sCompositionality testsCorpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>sPrototypically, collocations meet the following criteria:◮ Non-compositional: meaning of kick the bucket notcomposed of meaning of parts◮ Non-substitutable: orange hair just as accurate as redhair, but some don’t say it◮ Non-modifiable: often we cannot modify a collocation,even though we normally could modify one of thosewords: ??kick the red bucket<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workThe previous properties are good tests, but hard to verifywith corpus data(At least) two tests we can use with corpora:◮ Is the collocation translated word-by-word into anotherlanguage?◮ e.g., <strong>Collocation</strong> make a decision is not translatedliterally into French◮ Do the two words co-occur more frequently togetherthan we would otherwise expect?◮ e.g., of the is frequent, but both words are frequent, sowe might expect this<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work7 / 288 / 28Kinds of <strong>Collocation</strong>sCorpus LinguisticsApplication #2:<strong>Collocation</strong>sSemantic prosody & preferenceCorpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>s come in different guises:◮ Light verbs: verbs convey very little meaning but mustbe the right one:◮ make a decision vs. *take a decision, take a walk vs.*make a walk◮ Phrasal verbs: main verb and particle combination,often translated as a single word:◮ to tell off, to call up◮ Proper nouns: slightly different than others, but eachrefers to a single idea (e.g., Brooks Brothers)◮ Terminological expressions: technical terms that form aunit (e.g., hydraulic oil filter)<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workSemantic prosody = “a form of meaning which isestablished through the proximity of a consistent series ofcollocates” (Louw 2000)◮ Idea: you can tell the semantic prosody of a word by thetypes of words it frequently co-occurs with◮ These are typically negative: e.g., peddle, ripe for, getoneself verbed◮ This type of co-occurrence often leads to generalsemantic preferences◮ e.g., utterly, totally, etc. typically have a feature of‘absence or change of state’<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work9 / 2810 / 28<strong>Collocation</strong>: from silly ass to lexical setsKrishnamurthy 2000Corpus LinguisticsApplication #2:<strong>Collocation</strong>sNotes on a collocation’s definitionKrishnamurthy 2000Corpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Defining</strong> a collocation<strong>Defining</strong> a collocationFirth 1957: “You shall know a word by the company it keeps”◮ <strong>Collocation</strong>al meaning is a syntagmatic type ofmeaning, not a conceptual one◮ e.g., in this framework, one of the meanings of night isthe fact that it co-occurs with darkExample: ass is associated with a particular set of adjectives(think of goose if you prefer)◮ silly, obstinate, stupid, awful◮ We can see a lexical set associated with this wordLexical sets & collocations vary across genres, subcorpora,etc.KrishnamurthyCalculatingcollocationsPractical workWe often look for words which are adjacent to make up acollocation, but this is not always true◮ e.g., computers run, but these 2 words may only be inthe same proximity.We can also speak of upward/downward collocations:◮ downward: involves a more frequent node word A witha less frequent collocate B◮ upward: weaker relationship, tending to be more of agrammatical propertyKrishnamurthyCalculatingcollocationsPractical work11 / 2812 / 28


Corpus linguisticsKrishnamurthy 2000Corpus LinguisticsApplication #2:<strong>Collocation</strong>sCalculating collocationsCorpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsSimplest approach: use frequency counts◮ Two words appearing together a lot are a collocation<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsWhere collocations fit into corpus linguistics:1. Pattern recognition: recognize lexical and grammaticalunits2. Frequency list generation: rank words3. Concordancing: observe word behavior4. <strong>Collocation</strong>s: take concordancing a step further ...Practical workThe problem is that we get lots of uninteresting pairs offunction words (M&S 1999, table 5.1)C(w 1 , w 2 ) w 1 w 280871 of the58841 in the26430 to the21842 on thePractical work(Slides 14–30 are based on Manning & Schütze (M&S) 1999)13 / 2814 / 28POS filteringCorpus LinguisticsApplication #2:<strong>Collocation</strong>sPOS filtering (2)Corpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>sTo remove frequent pairings which are uninteresting, we canuse a POS filter (Justeson and Katz 1995)◮ only examine word sequences which fit a particularpart-of-speech pattern:A N, N N, A A N, A N N, N A N, N N N, N P NA N linear functionN A N mean squared errorN P N degrees of freedom◮ Crucially, all other sequences are removedP DMV Vof thehas been<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workSome results after tag filtering (M&S 1999, table 5.3)C(w 1 , w 2 ) w 1 w 2 Tag Pattern11487 New York A N7261 United States A N5412 Los Angeles N N3301 last year A N⇒ Fairly simple, but surprisingly effective◮ Needs to be refined to handle verb-particle collocations◮ Kind of inconvenient to write out patterns you want<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work15 / 2816 / 28Determining strength of collocationCorpus LinguisticsApplication #2:<strong>Collocation</strong>s(Pointwise) Mutual InformationCorpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Defining</strong> a collocation<strong>Defining</strong> a collocationWe want to compare the likelihood of 2 words next to otherbeing being a chance event vs. being a surprise◮ Do the two words appear next to each other more thanwe might expect, based on what we know about theirindividual frequencies?◮Is this an accidental pairing or not?◮ The more data we have, the more confident we will bein our assessment of a collocation or notWe’ll look at bigrams, but techniques work for words withinfive words of each other, translation pairs, phrases, etc.KrishnamurthyCalculatingcollocationsPractical workOne way to see if two words are strongly connected is tocompare◮ the probability of the two words appearing together ifthey are independent (p(w 1 )p(w 2 ))◮ the actual probability of the two words appearingtogether (p(w 1 w 2 ))The pointwise mutual information is a measure to do this:(1) I(w 1 , w 2 ) = log p(w1w2)p(w 1)p(w 2)KrishnamurthyCalculatingcollocationsPractical work17 / 2818 / 28


Pointwise Mutual Information EquationCorpus LinguisticsApplication #2:<strong>Collocation</strong>sMutual Information exampleCorpus LinguisticsApplication #2:<strong>Collocation</strong>sOur probabilities (p(w 1 w 2 ), p(w 1 ), p(w 2 )) are all basicallycalculated in the same way:(2) p(x) = C(x)N◮ N is the number of words in the corpus◮ The number of bigrams ≈ the number of unigrams<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workWe want to know if Ayatollah Ruhollah is a collocation in adata set we have:◮ C(Ayatollah) = 42◮ C(Ruhollah) = 20◮ C(AyatollahRuhollah) = 20◮ N = 14, 307, 668<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work(3) I(w 1 , w 2 ) = log p(w1w2)p(w 1)p(w 2)= logC(w 1 w 2 )NC(w 1 ) C(w 2 )N N= log[N C(w1w2)C(w 1)C(w 2) ] 19 / 28(4) I(Ayatollah, Ruhollah) = log 220N18.3842N × 20N= log 2 N 2042×20 ≈To see how good a collocation this is, we need to compare itto others20 / 28Problems for Mutual InformationCorpus LinguisticsApplication #2:<strong>Collocation</strong>sMotivating Contingency TablesCorpus LinguisticsApplication #2:<strong>Collocation</strong>sThe formula we have also has the following equivalencies:(5) I(w 1 , w 2 ) = log p(w1w2)P(w1|w2)p(w 1)p(w 2)= logP(w 1)= log P(w2|w1)P(w 2)<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsWhat we can instead get at is: which bigrams are likely, outof a range of possibilities?<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsMutual information tells us how much more information wehave for a word, knowing the other word◮ But a decrease in uncertainty isn’t quite right ...A few problems:◮ Sparse data: infrequent bigrams for infrequent wordsget high scores◮ Tends to measure independence (value of 0) betterthan dependence◮ Doesn’t account for how often the words do not appeartogether (M&S 1999, table 5.15)Practical workLooking at the Arthur Conan Doyle story A Case of Identity,we find the following possibilities for one particular bigram:◮ sherlock followed by holmes◮ sherlock followed by some word other than holmes◮ some word other than sherlock preceding holmes◮ two words: the first not being sherlock, the second notbeing holmesThese are all the relevant situations for examining thisbigramPractical work21 / 2822 / 28Contingency TablesCorpus LinguisticsApplication #2:<strong>Collocation</strong>sObserved bigram probabilitiesCorpus LinguisticsApplication #2:<strong>Collocation</strong>s<strong>Collocation</strong>s<strong>Collocation</strong>sWe can count up these different possibilities and put theminto a contingency table (or 2x2 table)B = holmes B holmes TotalA = sherlock 7 0 7A sherlock 39 7059 7098Total 46 7059 7105The Total row and Total column are the marginals◮ The values in this chart are the observed frequencies(f o )<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workBecause each cell indicates a bigram, divide each of thecells by the total number of bigrams (7105) to getprobabilities:holmes ¬ holmes Totalsherlock 0.00099 0.0 0.00099¬ sherlock 0.00549 0.99353 0.99901Total 0.00647 0.99353 1.0The marginal probabilities indicate the probabilities for agiven word, e.g., p(sherlock) = 0.00099 andp(holmes) = 0.00647<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work23 / 2824 / 28


Expected bigram probabilitiesCorpus LinguisticsApplication #2:<strong>Collocation</strong>sExpected bigram frequenciesCorpus LinguisticsApplication #2:<strong>Collocation</strong>sIf we assumed that sherlock and holmes areindependent—i.e., the probability of one is unaffected by theprobability of the other—we would get the following table:holmes ¬ holmes Totalsherlock 0.00647 x 0.00099 0.99353 x 0.00099 0.00099¬ sherlock 0.00647 x 0.99901 0.99353 x 0.99901 0.99901Total 0.00647 0.99353 1.0◮ This is simply p e (w 1 , w 2 ) = p(w 1 )p(w 2 )<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workMultiplying by 7105 (the total number of bigrams) gives usthe expected number of times we should see each bigram:holmes ¬ holmes Totalsherlock 0.05 6.95 7¬ sherlock 45.5 7052.05 7098Total 46 7059 7105◮ The values in this chart are the expected frequencies(f e )<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work25 / 2826 / 28Pearson’s chi-square testCorpus LinguisticsApplication #2:<strong>Collocation</strong>sWorking with collocationsCorpus LinguisticsApplication #2:<strong>Collocation</strong>sThe chi-square (χ 2 ) test measures how far the observedvalues are from the expected values:(6) χ 2 = ∑ (f o−f e) 2f e(7)χ 2= (7−0.05)20.05+ (0−6.95)26.95+ (39−45.5)245.5+ (7059−7052.05)27052.05<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical workThe question is:◮ What significant collocations are there that start with theword sweet?◮ Specifically, what nouns tend to co-occur after sweet?<strong>Collocation</strong>s<strong>Defining</strong> a collocationKrishnamurthyCalculatingcollocationsPractical work= 966.05 + 6.95 + 1.048 + 0.006What do your intuitions say?= 974.05Next time, we will work on how to calculate collocations ...If you look this up in a table, you’ll see that it’s unlikely to bechanceNB: The χ 2 test does not work well for rare events, i.e., f e < 627 / 2828 / 28

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!