Views
3 years ago

Corpus Linguistics - IU Computational Linguistics Program - Indiana ...

Corpus Linguistics - IU Computational Linguistics Program - Indiana ...

ApplicationsLinguistic

ApplicationsLinguistic searchingCorpus LinguisticsWhy CorpusLinguistics?ApplicationsLanguage variationCorpus LinguisticsWhy CorpusLinguistics?CorporaCorporaCorpus linguisticsCorpus linguisticsApplicationsApplicationsCorpora can investigate questions such as:◮ How does one order different types of adjectives inEnglish?◮ In what contexts are split constituents allowed inGerman?◮ With what frequency do parasitic gaps occur inacademic language?Computational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objectionsFor any of the questions mentioned above, we can comparefor different language groups◮ Do Indian speakers of English reduplicate words (morethan other groups)?◮ With what frequency do older speakers in the midwestuse cool?◮ vs. younger speakers◮ vs. in the south◮ vs. written languageComputational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objections7 / 278 / 27ApplicationsLexicographyCorpus LinguisticsWhy CorpusLinguistics?ApplicationsLanguage learningCorpus LinguisticsWhy CorpusLinguistics?How many senses does the word line have?14 (Webster’s New Encyclopedic Dictionary, 1994):CorporaCorpus linguisticsApplicationsComputational linguisticsCorporaCorpus linguisticsApplicationsComputational linguistics1. a comparatively strong slender cord2. a cord, wire, or tape used in measuring and leveling3. piping for conveying a fluid4. a row of words, letters, numbers or symbols that are written,printed, or displayed5. something that is distinct, elongated, and narrowApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objectionsWhat do you say in English: think about or think on?According to Google (1/3/13):290,000,000 hits for "think about" 8,050,000 hits forApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objections6. a state of agreement (bring ideas into line)7. a course of conduct, action, or thought (a political line)"think on"8. limit, restraint (overstep the line of good taste) . . .A corpus can provide examples & help re-define senses9 / 2710 / 27ApplicationsCorpora for computational linguisticsCorpus LinguisticsWhy CorpusLinguistics?Approaches to corporaCorpus-based & Corpus-drivenCorpus LinguisticsWhy CorpusLinguistics?CorporaCorporaCorpus linguisticsCorpus linguisticsCorpora are useful for linguistics research, but have alsorevolutionalized computational linguistics (CL)◮ With annotation, CL can train and evaluate newalgorithms◮ Technology has become more robust and more efficientsince the early 1990s◮ All sorts of new annotations (with practicalfocuses—e.g., biomedical annotation) have taken offWe will investigate computational applications at variouspoints this semesterApplicationsComputational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objections◮ Corpus-based research: corpora expound upontheories that were formulated before corpora◮ May have to do away with particular pieces of evidence◮ Corpus-driven research: strictly committed to corpusdataApplicationsComputational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objections11 / 2712 / 27

Corpus-based vs. Corpus-drivenDifferencesCorpus LinguisticsWhy CorpusLinguistics?A brief historyCorpus LinguisticsWhy CorpusLinguistics?1. Type of corpus data◮ representativeness: important for corpus-basedapproaches◮ corpus size: very large corpora (supposed to bebalanced) argued for in corpus-driven approaches◮ annotation: corpus-driven approaches want to bepre-theoretical (which annotation is not) and derivecategories completely from corpus2. Attitude towards existing theories & intuitions◮ Corpus-based approaches uses existing theory as astarting point3. Research focus:◮ Corpus-based: uses standard linguistic levels◮ Corpus-driven: holistic view, with a functional view ofmeaningCorporaCorpus linguisticsApplicationsComputational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objectionsThere is a long history of using empirical, observed data,even before the advent of computers◮ 1940s: structuralism, ‘shoebox corpora’◮ late 1950s, 1960s: generativism, almost no corpuslinguistics◮ Chomsky had several arguments against corpora (seenext slides), some geared towards shoebox corpora◮ notable exception: Brown corpus◮ 1980s and beyond: increased interest in corpuslinguistics◮ opened new areas of researchCorporaCorpus linguisticsApplicationsComputational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objections13 / 2714 / 27Bad start for corpus linguisticsCorpus LinguisticsWhy CorpusLinguistics?Bad start for corpus linguistics (2)Corpus LinguisticsWhy CorpusLinguistics?CorporaCorporaCorpus linguisticsCorpus linguisticsNoam Chomsky (1957) Syntactic Structures:ApplicationsComputational linguisticsApproachesNoam Chomsky (1957) Syntactic Structures:ApplicationsComputational linguisticsApproaches◮ p. 15: “. . . it is obvious that the set of grammatical sentencescannot be identified with any particular corpus of utterances. . .. . . a grammar mirrors the behavior of the speaker, who, onthe basis of a finite and accidental experience with language,can produce or understand an indefinite number of newsentences.”ObjectionsHistoryAdvantagesProbabilistic languageOther objections◮ p. 16/17: “. . . ones’s ability to produce and recognizegrammatical utterances is not based on notions of statisticalapproximations or the like.. . . If we rank the sequences of a given length in order ofstatistical approximation to English, we will find bothgrammatical and ungrammatical sequences scatteredthroughout the list; there appears to be no particular relationbetween the order of approximations and grammaticalness.”ObjectionsHistoryAdvantagesProbabilistic languageOther objections15 / 2716 / 27Chomsky recentlyCorpus LinguisticsWhy CorpusLinguistics?Chomsky recently (2)Corpus LinguisticsWhy CorpusLinguistics?Joszef Andor (2004) The master and his performance: Aninterview with Noam Chomsky. Intercultural Pragmatics 1:1.◮ ”Corpus linguistics doesn’t mean anything. It’s like sayingsuppose a physicist decides, suppose physics and chemistrydecide that instead of relying on experiments, what they’regoing to do is take videotapes of things happening in theworld and they’ll collect huge videotapes of everything that’shappening and from that maybe they’ll come up with somegeneralizations or insights.”CorporaCorpus linguisticsApplicationsComputational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objections◮ question: ”Think of the occurrence of ’Can you . . .’ or, ’Could you . ..’ rather than ’Are you able to . . .’ in polite requests in givencommunicative situations (a domain studied extensively by speechact theorists). Such chunks of linguistic expressions can be tracedby the researcher via the application of corpus linguistic methods. Itis from a corpus that one can identify their frequency and traceshifts in their meaning and use. Would you attribute significance tosuch data in your approach to linguistic analysis and description?”◮ answer: ”People who work seriously in this particular area do notrely on corpus linguistics. They may begin by looking at facts aboutfrequency and shifts in frequency and so on, but if they want tomove on to some understanding of what’s happening they will veryquickly, and in fact do, shift to the experimental framework. Whereyou design situations, you enquire into how people will act in thosesituations. You design them within a framework of theoretical inquirywhich has already suggested that these are likely to be importantquestions and I want the answers to them. But that’s not corpuslinguistics.”CorporaCorpus linguisticsApplicationsComputational linguisticsApproachesObjectionsHistoryAdvantagesProbabilistic languageOther objections17 / 2718 / 27

Corpus Linguistics - IU Computational Linguistics Program - Indiana ...
Corpus Linguistics - IU Computational Linguistics Program - Indiana ...
Dialogue Systems - IU Computational Linguistics Program - Indiana ...
Dialogue Systems - IU Computational Linguistics Program - Indiana ...
Searching - IU Computational Linguistics Program
Dialogue Systems - IU Computational Linguistics Program
Corpus Linguistics - Indiana University
Corpus Linguistics (L615) R for corpus ... - Indiana University
Corpus Linguistics (L615) - Corpus Building - Indiana University
Corpus Linguistics (L615) Motivating regular ... - Indiana University
Corpus Linguistics (L615) Multidimensional ... - Indiana University
Corpus Linguistics (L615) Translation ... - Indiana University
Corpus Linguistics (L615) Language learning ... - Indiana University
Corpus Linguistics (L615) - Multidimensional ... - Indiana University
Corpus Linguistics (L615) Big picture Annotating ... - Indiana University
IU VOLUNTARY BENEFITS PROGRAM - Indiana University
IU VOLUNTARY BENEFITS PROGRAM - Indiana University
Computational Linguistics and Mayan Languages - Indiana University
master of science in accounting program - AIM @ IU Home - Indiana ...
ALUMNI • MAGAZINE - AIM @ IU Home - Indiana University
E-Commerce @ Indiana University - Protect IU
VIVO@IU and the Data Culture of Indiana University
ALUMNI • MAGAZINE - AIM @ IU Home - Indiana University
A private, nonprofit organization, Indiana University ... - IU Health
IUS Strategic Plan - Indiana University Southeast
ALUMNI • MAGAZINE - AIM @ IU Home - Indiana University
ALUMNI · MAGAZINE - AIM @ IU Home - Indiana University
ALUMNI • MAGAZINE - AIM @ IU Home - Indiana University