24.12.2013 Views

Slides - University of Washington

Slides - University of Washington

Slides - University of Washington

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Ling472 / CSE 472<br />

Fundamentals <strong>of</strong> Computational Linguistics<br />

Lecture 1<br />

4/1/2013<br />

Monday, April 1, 2013<br />

1


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Calendar<br />

Class Meetings: Monday, Wednesday 1:30 –2:50 pm<br />

T.A. Section: Friday 1:30 – 2:20pm MGH 234<br />

(April 1 ‐ June 5, 2013)<br />

Midterm: May 1, 2013<br />

Monday, April 1, 2013<br />

2


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Course topics<br />

• Finite state morphology<br />

• Regular expressions<br />

• Formal grammars; Chomsky hierarchy; Context‐free<br />

grammars<br />

• Bayes' theorem<br />

• N‐grams and Language Modeling<br />

• Part‐<strong>of</strong>‐speech tagging<br />

• Semantic representations<br />

• Clustering and classifiers<br />

• Evaluation: Precision and Recall<br />

• Algorithms for corpus processing<br />

• Feature‐structures and unification‐based grammars<br />

Monday, April 1, 2013<br />

3


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Glenn Slayden<br />

Instructor<br />

B.S. Econ. <strong>University</strong> Pennsylvania<br />

B.Mus. Cornish College <strong>of</strong> the Arts<br />

M.Sci Computational Linguistics, <strong>University</strong> <strong>of</strong> <strong>Washington</strong> (2012)<br />

Experience:<br />

S<strong>of</strong>tware Design Engineer, Micros<strong>of</strong>t<br />

Micros<strong>of</strong>t Research – Machine Translation<br />

Research Interests:<br />

Thai language: precision grammar, lexicography, and online<br />

language learning; analytical MT: parsing, generation, and<br />

semantic transfer<br />

Monday, April 1, 2013<br />

4


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

T.A.<br />

• Sanghoun Song<br />

– sanghoun@uw.edu<br />

Monday, April 1, 2013<br />

5


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Textbook<br />

Daniel Jurafsky and James H. Martin. (2008) Speech<br />

and Language Processing (2nd edition). New<br />

Jersey: Prentice‐Hall.<br />

Monday, April 1, 2013<br />

6


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Grading<br />

• Midterm: 20%<br />

• Programming Projects: 65%<br />

• Writing Assignment: 10%<br />

• Class Participation: 5%<br />

• As you can see, the programming projects are<br />

the most important part.<br />

Monday, April 1, 2013<br />

7


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Programming Languages<br />

• You may use any <strong>of</strong> the following programming languages<br />

C<br />

C++<br />

C#<br />

Java<br />

Python<br />

others: see me first<br />

Assignment solutions will be provided in Python or C#.<br />

• This is not a basic programming class. We will not cover how<br />

to create, edit, compile, and run programs.<br />

Monday, April 1, 2013<br />

8


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• Code must run on Patas<br />

Programming Projects<br />

– or you will receive no credit. This is because some projects<br />

may reference licensed corpora which you may not copy<br />

– I won’t spend time figuring out why your code doesn’t run.<br />

– You can develop on a home machine, but make sure you<br />

test thoroughly on the cluster<br />

• Please follow instructions<br />

• Always include a text description (write‐up) <strong>of</strong> your<br />

work<br />

• Submit source code and write‐up in ZIP or TAR file<br />

Monday, April 1, 2013<br />

9


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Today’s lecture<br />

• A quick survey and history <strong>of</strong> the field <strong>of</strong><br />

computational linguistics<br />

• Why is language hard?<br />

• Finding patterns in text: Regular Expressions<br />

• Project 1: Eliza‐like<br />

Monday, April 1, 2013<br />

10


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

What is Computational Linguistics?<br />

Applying quantitative techniques to the analysis and<br />

processing <strong>of</strong> human (“natural”) languages.<br />

• Naturally, computers are well‐suited to this sort <strong>of</strong> Natural<br />

Language Processing (NLP)<br />

• Cross‐disciplinary<br />

– Linguistics<br />

– Computer science<br />

– Mathematics<br />

– Electrical Engineering<br />

• Computational linguists come from all <strong>of</strong> these<br />

backgrounds<br />

Monday, April 1, 2013<br />

11


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Linguistics<br />

• Nearly all research areas in linguistics can benefit<br />

from computational techniques:<br />

– Phonetics<br />

– Phonology<br />

– Morphology<br />

– Syntax<br />

– Semantics<br />

– Pragmatics<br />

– Discourse Analysis<br />

– Information Structure<br />

– Typology<br />

Monday, April 1, 2013<br />

12


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Computer Science<br />

• All fundamental CS data structures and<br />

principles apply: Landau (“Big‐O”) notation<br />

• Large scale text processing; RegEx; strings;<br />

character encoding; data conversion<br />

• Databases; high‐throughput computing<br />

• Specialized data structures and techniques<br />

– String hashes<br />

– Trie<br />

– Dynamic programming<br />

Monday, April 1, 2013<br />

13


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Mathematics<br />

• Probability<br />

– Modeling the occurrence(s) <strong>of</strong> events<br />

• Statistics<br />

– Application <strong>of</strong> probabilistic models<br />

• Set Theory<br />

– Intersection, Union, Exclusion<br />

• Logic<br />

– Boolean logic<br />

– First order logic (predicate calculus)<br />

– Markov (probabilistic) logic, entailment<br />

• Information Theory<br />

– Entropy<br />

– Shannon channels<br />

• Advanced Math<br />

– Kernel functions (for Support Vector Machines…)<br />

– Matrix decomposition (i.e. Singular Value Decomposition…)<br />

Monday, April 1, 2013<br />

14


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• Two ambitions are at different stages <strong>of</strong><br />

realization<br />

– Natural Language Processing (NLP)<br />

– Natural Language Understanding (NLU)<br />

Monday, April 1, 2013<br />

15


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

State <strong>of</strong> the Art<br />

ambition<br />

NLP<br />

NLU<br />

current<br />

realization<br />

Practical<br />

Applications<br />

Pure<br />

Linguistic<br />

Research<br />

Monday, April 1, 2013<br />

16


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Natural Language Understanding<br />

• Still largely in the research stage<br />

• Hybrid <strong>of</strong> linguistically motivated analytical (“rulebased”)<br />

systems with sensible application <strong>of</strong><br />

stochastic methods seems necessary<br />

• Deep processing applications continue to mature<br />

– ERG: English Resource Grammar (Flickinger 2002)<br />

• http://www.delph‐in.net/erg/<br />

– World knowledge, semantics, intricate grammatical<br />

knowledge<br />

Monday, April 1, 2013<br />

17


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

NLP / NLU Subfields<br />

• IE: Information Extraction<br />

• IR: Information Retrieval<br />

• MT: Machine Translation<br />

• Automatic Document Summarization<br />

• QA: Question Answering<br />

• ASR: Automatic Speech Recognition<br />

• NLG: Natural Language Generation<br />

• Speech Synthesis<br />

• CALL: Computer‐Assisted Language Learning<br />

• Alternative Input Methods<br />

Monday, April 1, 2013<br />

18


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

NLP / NLU Subtasks<br />

• Tokenization (word breaking)<br />

• Sentence Breaking<br />

• Morphological Analysis<br />

– POS tagging<br />

– Stemming<br />

• NER: Named Entity Recognition<br />

• WSD: Word Sense Disambiguation<br />

• Anaphora and reference resolution<br />

• Parsing and generation<br />

• Dialogue management and discourse analysis<br />

• Clustering<br />

• Classification<br />

• Treebanking and Corpora curatorship<br />

• …<br />

Monday, April 1, 2013<br />

19


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

NLP/NLU Approaches<br />

• Analytical “rule‐based”<br />

– Intuit a set <strong>of</strong> rules<br />

– Implement them<br />

– Evaluate<br />

• Statistical<br />

– Train a statistical model on a large set <strong>of</strong><br />

observations<br />

– Use the model to make predictions about unseen<br />

inputs<br />

Monday, April 1, 2013<br />

20


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Analytical (Rule‐Based) Techniques<br />

Intuition: In human brains, language operates by a set <strong>of</strong> rules.<br />

Let’s infer these and implement them in a computer.<br />

Q: Any problems with this?<br />

1. We don’t know how language operates in human brains<br />

2. How will we infer these rules? What form will they have?<br />

3. Will these putative rules be amenable to procedural<br />

implementation?<br />

4. How will we test and evaluate our progress?<br />

Monday, April 1, 2013<br />

21


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Analytical MT: a case study<br />

Machine Translation (MT) was the first goal: naïve ebullience!<br />

When I look at an article in Russian, I say: “This is really written in<br />

English, but it has been coded in some strange symbols.”<br />

Weaver 1947<br />

Monday, April 1, 2013<br />

22


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Analytical MT: a case study<br />

Another 19 Years Later: wet‐blanket pessimism<br />

There has been no machine translation <strong>of</strong> general<br />

scientific text, and none is in immediate prospect… After 8 years<br />

<strong>of</strong> work, the project] had to resort to post‐editing [which] took<br />

slightly longer to do and was more expensive than conventional<br />

human translation.”<br />

ALPAC Report 1966<br />

Recommendation <strong>of</strong> the ALPAC Report: suspend<br />

government funding for Machine Translation<br />

Monday, April 1, 2013<br />

23


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Analytical MT: a case study<br />

41 Years Later: realistic progress<br />

Progress on combining rule‐based and data‐driven approaches to<br />

MT will depend on a sustained stream <strong>of</strong> state‐<strong>of</strong>‐the‐art, MToriented<br />

linguistics research... Despite frequent cycles <strong>of</strong> overly<br />

high hopes and subsequent disillusionment, [MT] is the type <strong>of</strong><br />

application that may demand knowledge‐heavy, ‘deep’<br />

approaches to NLP for its ultimate, long‐term success.<br />

Oepen et al. 2007<br />

Monday, April 1, 2013<br />

24


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Why is language so hard?<br />

• As a simplification, let’s just consider text<br />

(speech processing has a separate set <strong>of</strong> problems)<br />

• Text representation <strong>of</strong> language<br />

I saw a man with a telescope.<br />

– This is a sequence <strong>of</strong> symbols<br />

– This is a “string” <strong>of</strong> characters or Unicode code points<br />

– This is an ordered collection <strong>of</strong> “words”<br />

– This is a “sentence”<br />

– This is a “surface” representation <strong>of</strong> a speech act<br />

– This is a surface representation <strong>of</strong> a semantic proposition<br />

Monday, April 1, 2013<br />

25


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Words<br />

Isa wa manwi thate lesco pe.<br />

I saw a man with a telescope.<br />

Observation: it appears that, in this language,<br />

one role <strong>of</strong> the space character is to partition<br />

symbols (letters) into units called “words.”<br />

Note: we always try to be aware <strong>of</strong>, and explicitly state, any<br />

assumptions that may not be cross‐linguistically valid<br />

Monday, April 1, 2013<br />

26


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Words<br />

It is not always the case that identifying “words” is straightforward:<br />

ผมเห็นผู ้ชายกับโทรทรรศน์<br />

ผม เห็น ผู ้ชาย กับ โทรทรรศน์<br />

pʰǒm hěn pʰûː tɕʰaːj kàp tʰoː rá tʰát<br />

1‐sg see man with telescope<br />

“I saw (a) man (who was) with a telescope.”<br />

The International Phonetic<br />

Alphabet (IPA), is the<br />

standard form <strong>of</strong> phonetic<br />

transcription.<br />

1 word or 2?<br />

This type <strong>of</strong> formatting for linguistic examples is called “Interlinear<br />

Glossed Text,” or IGT<br />

Monday, April 1, 2013<br />

27


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Word Classes<br />

• Can we identify a closed set <strong>of</strong> word classes to<br />

generalize about them?<br />

• These are called Parts <strong>of</strong> Speech.<br />

• In computational linguistics: POS tags<br />

• Closed and open classes<br />

Monday, April 1, 2013<br />

28


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• A closed class<br />

Closed Word Classes<br />

– Has a limited number <strong>of</strong> members<br />

– Is generally not open to new member production<br />

• Closed word classes<br />

– Conjunction<br />

– Determiner<br />

– Pronoun<br />

– Auxiliary verb<br />

– …and others<br />

Monday, April 1, 2013<br />

29


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• An open class:<br />

Open Word Classes<br />

– Has a large number <strong>of</strong> members<br />

– May accept new members over time<br />

– May allow new members to be heuristically generated<br />

• Open word classes: may allow compounding, inflecting, or<br />

productive morphology<br />

– Noun<br />

– Verb<br />

– Adjective<br />

– Adverb<br />

– …and others<br />

Monday, April 1, 2013<br />

30


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Sentences<br />

*the carnival. I saw a man with a<br />

Observation: it appears that, in this language, one role <strong>of</strong> the<br />

period is to partition words into sentences.<br />

Ok, let’s dismiss those issues too. We’ll assume we’re given the<br />

text <strong>of</strong> one sentence, unambiguously composed <strong>of</strong> words. Good<br />

enough?<br />

Monday, April 1, 2013<br />

31


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Convention for linguistic examples<br />

• Sentences marked ungrammatical are marked with an asterisk<br />

*Sentence is ungrammatical this.<br />

• Sentences judged marginal are marked with a question mark<br />

?Everyone doesn’t like their car.<br />

• Grammatical sentences can be semantic nonsense<br />

Colorless green ideas sleep furiously.<br />

• Grammatical sentences which don’t convey the intended<br />

meaning are marked with a hash<br />

#It’s raining outside but it’s not raining.<br />

Monday, April 1, 2013<br />

32


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Test the model<br />

*Man a with saw telescope I a.<br />

Observation: the ordering <strong>of</strong> words appears to<br />

be important in this language.<br />

English is generally Subject‐Verb‐Object (SVO), which is less common among<br />

world languages than SOV. In Russian, word order is more restricted in<br />

transitive clauses. Some languages, such as Datooga in Tanzania have free<br />

word order. Hixkaryana and Tamil are a few <strong>of</strong> examples <strong>of</strong> the rare OVS<br />

languages.<br />

Monday, April 1, 2013<br />

33


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

More data<br />

With a telescope I saw a man.<br />

#With a man I saw a telescope.<br />

#I saw a telescope with a man.<br />

*A man saw a telescope with I.<br />

I, with a telescope saw a man.<br />

Observation: some orderings have more felicity than others<br />

Can we find consistent generalizations to capture this? If<br />

so, what would we call such a set?<br />

Monday, April 1, 2013<br />

34


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Constituents<br />

• Can sentences be analyzed as containing sub‐units which<br />

consist <strong>of</strong> one or more words?<br />

This is a phrase structure tree.<br />

(Hausser 1998)<br />

Hypothesis: The class <strong>of</strong> grammatical constituents is closed.<br />

Monday, April 1, 2013<br />

35


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Constituent Types<br />

• Constituents are characterized by the part‐<strong>of</strong>‐speech <strong>of</strong> their<br />

main word, the “head.”<br />

• Thus, we notice that for many languages, a sentence<br />

comprises a:<br />

– subject (a Noun Phrase or NP)<br />

– predicate (a Verb Phrase or VP)<br />

• These may be composed <strong>of</strong> other constituents<br />

– Prepositional phrase (PP)<br />

– Determiner phrase (DP)<br />

Monday, April 1, 2013<br />

36


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• Noun phrases (NPs)<br />

Constituent Construction<br />

(DET NN)<br />

The ostrich<br />

(NNP)<br />

Kim<br />

(NN NN)<br />

container ship<br />

(DET JJ NN) A purple lawnmower<br />

(DET JJ NN) That darn cat<br />

• Verb phrases (VPs)<br />

(VB) tango<br />

(VBD NP NP) gave the dog a bone<br />

(VBD NP PP) gave a bone to the dog<br />

Monday, April 1, 2013<br />

37


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Syntax<br />

The set <strong>of</strong> rules governing permissible constructions in a language.<br />

• Syntax constrains the ways in which words may be<br />

combined to form constituents and sentences.<br />

• Syntax forms one part <strong>of</strong> the description, or<br />

grammar, <strong>of</strong> a language.<br />

Monday, April 1, 2013<br />

38


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Prescriptive v. descriptive grammar<br />

• Prescriptive<br />

– Rules against certain usages. Few if any rules for<br />

what is allowed.<br />

• Prepositions are not for ending sentences with.<br />

• Descriptive<br />

– Rules characterizing what people do say.<br />

– Goal is to characterize all and only what speakers<br />

find acceptable.<br />

– Based on the scientific method<br />

Slide: Emily Bender<br />

Monday, April 1, 2013<br />

39


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Artificiality <strong>of</strong> prescriptive rules<br />

• Fill in the blanks: he/his, they/their, or<br />

something else?<br />

Everyone insisted that __ record was unblemished.<br />

Everyone drives __ own car to work.<br />

Everyone was happy because __ passed the test.<br />

Everyone left the room, didn’t __ ?<br />

Everyone left early. __ seemed happy to get home.<br />

Slide: Bender, Sag, Wasow 2003<br />

Monday, April 1, 2013<br />

40


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Two kinds <strong>of</strong> ambiguity<br />

1. Lexical ambiguity<br />

The bank is crumbling.<br />

?<br />

Monday, April 1, 2013<br />

41


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Two kinds <strong>of</strong> ambiguity<br />

2. Structural ambiguity<br />

I saw a man with a telescope.<br />

? ?<br />

Monday, April 1, 2013<br />

42


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Is that all?<br />

I (<strong>of</strong>ten) saw a man with a telescope.<br />

Monday, April 1, 2013<br />

43


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Ambiguity<br />

Q: What kind <strong>of</strong> ambiguity does the following<br />

sentence illustrate?<br />

Have that report on my desk by Friday.<br />

A: Both structural and lexical ambiguity.<br />

In English, prosody in speech provides<br />

disambiguation by, for example, marking topic<br />

and focus.<br />

Monday, April 1, 2013<br />

44


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Constituents<br />

•Constituents help us understand and classify syntactic structure<br />

•Here, phrase structure trees show us how constituent structure<br />

helps us characterize ambiguity<br />

man<br />

man<br />

Trees: Dan Jinguji<br />

Monday, April 1, 2013<br />

45


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Garden Path Sentences<br />

These curiosities are another kind <strong>of</strong> ambiguity which are more relevant to<br />

psycholinguistics.<br />

The old man the boat.<br />

The generic NP “the old” and the verb “to man” are much more rarely used<br />

than the NP “the old man.”<br />

The horse raced past the barn fell.<br />

The relative clause “raced past the barn” is not introduced by “which.”<br />

Monday, April 1, 2013<br />

46


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Analytical NLP: Summary<br />

• Language has significant complexity<br />

• Ambiguity is an inherent feature <strong>of</strong> language<br />

• Inferring the syntactic rules <strong>of</strong> a language is difficult<br />

• It appears that syntactic rules are amenable to<br />

computational approximation, but there are a great<br />

number <strong>of</strong> them, and they are subtle<br />

• See next slide for the obligatory optimism…<br />

Monday, April 1, 2013<br />

47


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Latest Work: Hybrid analytical/statistical systems<br />

• Examples:<br />

– Stephan Oepen, Erik Velldal, Jan Tore Lønning, Paul Meurer, Victoria<br />

Rosén, and Dan Flickinger. 2007. “Towards hybrid quality‐oriented<br />

machine translation”<br />

http://share.emmtee.net/pub/bscw.cgi/d23044/tmi07.pdf<br />

– Parse ranking in the English Resource Grammar<br />

– Unsupervised learning <strong>of</strong> rules<br />

• Poon and Domingos 2009 “Unsupervised Semantic<br />

Parsing” (UW CSE)<br />

http://www.aclweb.org/anthology‐new/D/D09/D09‐1001.pdf<br />

Monday, April 1, 2013<br />

48


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Statistical Natural Language Processing<br />

• Large gains in practical application <strong>of</strong> stochastic methods in<br />

the past 15 years<br />

• Example: MT<br />

– Micros<strong>of</strong>t Bing Translator<br />

– Google translate<br />

– GIZA++/Moses SMT toolkit<br />

• Insight: clever math—with limited linguistic motivation—can<br />

work surprisingly well<br />

These improvements are not just due to Moore’s law<br />

Monday, April 1, 2013<br />

49


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

The mid‐1990s<br />

• Advances in computing power<br />

• Perceived lack <strong>of</strong> progress in analytical NLP<br />

• What if the “rules” <strong>of</strong> linguistics as we intuitively imagine<br />

them are too hard (or too incorrect, or too biased…) to<br />

capture?<br />

• Let’s forget any preconceived notion <strong>of</strong> what the rules should<br />

be and use math to characterize what the rules appear to be<br />

in practice.<br />

• German Verbmobil project had success with statistical<br />

machine translation (This large project also developed rulebased<br />

systems)<br />

Source: Koehn 2010 “Statistical Machine Translation”<br />

Monday, April 1, 2013<br />

50


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Corpus Linguistics<br />

The study <strong>of</strong> language as expressed in samples (corpora) or “real<br />

world” text.<br />

• Wikipedia<br />

This can be thought <strong>of</strong> as a variant <strong>of</strong> analytical NLP where<br />

the form, substance, and quantity <strong>of</strong> “rules” are generated<br />

automatically according to the maximization <strong>of</strong> an empirical<br />

objective function.<br />

Monday, April 1, 2013<br />

51


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Regular Expressions<br />

• Regular expressions: a syntax for matching<br />

patterns in text<br />


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Basic RegEx<br />

^ matches the start <strong>of</strong> a line<br />

$ matches the end <strong>of</strong> a line<br />

. matches any one character (except newline)<br />

[xyz] matches any one character from the set<br />

[^pdq] matches any one character not in the set<br />

| accepts either its left or its right side<br />

\ escape to specify special characters<br />

anything else: must match exactly<br />

Monday, April 1, 2013<br />

53


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

More RegEx<br />

* accepts zero or more <strong>of</strong> the preceding element<br />

this is the canonical ‘greedy’ operator<br />

? accepts zero or one <strong>of</strong> the preceding element(s)<br />

+ accepts one or more <strong>of</strong> the preceding element(s)<br />

{n} accepts n <strong>of</strong> the preceding element(s)<br />

{n,} accepts n or more <strong>of</strong> the preceding element(s)<br />

{n,m} accepts n to m <strong>of</strong> the preceding element(s)<br />

(pattern)<br />

defines a capture group which can be referred to<br />

later via \1<br />

Monday, April 1, 2013<br />

54


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

RegEx examples<br />

• abc<br />

• a|bc<br />

• (a|bb)c<br />

• a[bc]<br />

• a*b<br />

• a?b<br />

• [^a]*th[aeiou]+[a‐z]*<br />

• ^a<br />

• ^a.*z$<br />

• \bthe\b<br />

Monday, April 1, 2013<br />

55


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Capture groups<br />

• Use parentheses to ‘capture’ part <strong>of</strong> a match<br />

and copy it to the output<br />

• () creates a group you can reference<br />

• \n references the nth group<br />

• Example: to search for repeated words:<br />

([a‐z]+) \1 \1<br />

“I called kitty kitty kitty.”<br />

Monday, April 1, 2013<br />

56


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• Note: capture groups make “regular<br />

expressions” non‐regular in a formal sense.<br />

We will study this more in the next lecture.<br />

Monday, April 1, 2013<br />

57


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

RegEx Examples<br />

• Find the English stops followed by a liquid<br />

grep [PDKBDGpdkbdg][lr]<br />

• Find any two vowels together<br />

grep [aeiou][aeiou]<br />

• Find the same letter, repeated<br />

egrep '([a‐z])\1'<br />

• Lines where sentences end with ‘to’<br />

egrep '( |^)to.'<br />

Monday, April 1, 2013<br />

58


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Primitive tokenization<br />

$ cat moby_dick.html | # echo the text<br />

tr [:upper:] [:lower:] | # convert to lower case<br />

tr ' ' '\n' |<br />

# put each word on a line<br />

grep ‐v ^$ |<br />

# get rid <strong>of</strong> blank lines<br />

grep ‐v '


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• Search<br />

• Search and replace<br />

• In python:<br />

RegEx<br />

import re<br />

re.sub(pattern, replacement, string)<br />

Monday, April 1, 2013<br />

60


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

Python RegEx<br />

• input = ‘The Kiwis is a team.’<br />

• input = re.sub(‘is’, ‘are’, input)<br />

• print input<br />

‘The Kiwis are a team’<br />

Monday, April 1, 2013<br />

61


<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />

Ling472 Introduction to Computational Linguistics<br />

Lecture 1:<br />

Introduction<br />

• Project 1<br />

End <strong>of</strong> today’s lecture<br />

– Review J&M section 2.1.6<br />

Monday, April 1, 2013<br />

62

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!