Slides - University of Washington
Slides - University of Washington
Slides - University of Washington
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Ling472 / CSE 472<br />
Fundamentals <strong>of</strong> Computational Linguistics<br />
Lecture 1<br />
4/1/2013<br />
Monday, April 1, 2013<br />
1
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Calendar<br />
Class Meetings: Monday, Wednesday 1:30 –2:50 pm<br />
T.A. Section: Friday 1:30 – 2:20pm MGH 234<br />
(April 1 ‐ June 5, 2013)<br />
Midterm: May 1, 2013<br />
Monday, April 1, 2013<br />
2
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Course topics<br />
• Finite state morphology<br />
• Regular expressions<br />
• Formal grammars; Chomsky hierarchy; Context‐free<br />
grammars<br />
• Bayes' theorem<br />
• N‐grams and Language Modeling<br />
• Part‐<strong>of</strong>‐speech tagging<br />
• Semantic representations<br />
• Clustering and classifiers<br />
• Evaluation: Precision and Recall<br />
• Algorithms for corpus processing<br />
• Feature‐structures and unification‐based grammars<br />
Monday, April 1, 2013<br />
3
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Glenn Slayden<br />
Instructor<br />
B.S. Econ. <strong>University</strong> Pennsylvania<br />
B.Mus. Cornish College <strong>of</strong> the Arts<br />
M.Sci Computational Linguistics, <strong>University</strong> <strong>of</strong> <strong>Washington</strong> (2012)<br />
Experience:<br />
S<strong>of</strong>tware Design Engineer, Micros<strong>of</strong>t<br />
Micros<strong>of</strong>t Research – Machine Translation<br />
Research Interests:<br />
Thai language: precision grammar, lexicography, and online<br />
language learning; analytical MT: parsing, generation, and<br />
semantic transfer<br />
Monday, April 1, 2013<br />
4
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
T.A.<br />
• Sanghoun Song<br />
– sanghoun@uw.edu<br />
Monday, April 1, 2013<br />
5
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Textbook<br />
Daniel Jurafsky and James H. Martin. (2008) Speech<br />
and Language Processing (2nd edition). New<br />
Jersey: Prentice‐Hall.<br />
Monday, April 1, 2013<br />
6
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Grading<br />
• Midterm: 20%<br />
• Programming Projects: 65%<br />
• Writing Assignment: 10%<br />
• Class Participation: 5%<br />
• As you can see, the programming projects are<br />
the most important part.<br />
Monday, April 1, 2013<br />
7
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Programming Languages<br />
• You may use any <strong>of</strong> the following programming languages<br />
C<br />
C++<br />
C#<br />
Java<br />
Python<br />
others: see me first<br />
Assignment solutions will be provided in Python or C#.<br />
• This is not a basic programming class. We will not cover how<br />
to create, edit, compile, and run programs.<br />
Monday, April 1, 2013<br />
8
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• Code must run on Patas<br />
Programming Projects<br />
– or you will receive no credit. This is because some projects<br />
may reference licensed corpora which you may not copy<br />
– I won’t spend time figuring out why your code doesn’t run.<br />
– You can develop on a home machine, but make sure you<br />
test thoroughly on the cluster<br />
• Please follow instructions<br />
• Always include a text description (write‐up) <strong>of</strong> your<br />
work<br />
• Submit source code and write‐up in ZIP or TAR file<br />
Monday, April 1, 2013<br />
9
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Today’s lecture<br />
• A quick survey and history <strong>of</strong> the field <strong>of</strong><br />
computational linguistics<br />
• Why is language hard?<br />
• Finding patterns in text: Regular Expressions<br />
• Project 1: Eliza‐like<br />
Monday, April 1, 2013<br />
10
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
What is Computational Linguistics?<br />
Applying quantitative techniques to the analysis and<br />
processing <strong>of</strong> human (“natural”) languages.<br />
• Naturally, computers are well‐suited to this sort <strong>of</strong> Natural<br />
Language Processing (NLP)<br />
• Cross‐disciplinary<br />
– Linguistics<br />
– Computer science<br />
– Mathematics<br />
– Electrical Engineering<br />
• Computational linguists come from all <strong>of</strong> these<br />
backgrounds<br />
Monday, April 1, 2013<br />
11
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Linguistics<br />
• Nearly all research areas in linguistics can benefit<br />
from computational techniques:<br />
– Phonetics<br />
– Phonology<br />
– Morphology<br />
– Syntax<br />
– Semantics<br />
– Pragmatics<br />
– Discourse Analysis<br />
– Information Structure<br />
– Typology<br />
Monday, April 1, 2013<br />
12
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Computer Science<br />
• All fundamental CS data structures and<br />
principles apply: Landau (“Big‐O”) notation<br />
• Large scale text processing; RegEx; strings;<br />
character encoding; data conversion<br />
• Databases; high‐throughput computing<br />
• Specialized data structures and techniques<br />
– String hashes<br />
– Trie<br />
– Dynamic programming<br />
Monday, April 1, 2013<br />
13
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Mathematics<br />
• Probability<br />
– Modeling the occurrence(s) <strong>of</strong> events<br />
• Statistics<br />
– Application <strong>of</strong> probabilistic models<br />
• Set Theory<br />
– Intersection, Union, Exclusion<br />
• Logic<br />
– Boolean logic<br />
– First order logic (predicate calculus)<br />
– Markov (probabilistic) logic, entailment<br />
• Information Theory<br />
– Entropy<br />
– Shannon channels<br />
• Advanced Math<br />
– Kernel functions (for Support Vector Machines…)<br />
– Matrix decomposition (i.e. Singular Value Decomposition…)<br />
Monday, April 1, 2013<br />
14
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• Two ambitions are at different stages <strong>of</strong><br />
realization<br />
– Natural Language Processing (NLP)<br />
– Natural Language Understanding (NLU)<br />
Monday, April 1, 2013<br />
15
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
State <strong>of</strong> the Art<br />
ambition<br />
NLP<br />
NLU<br />
current<br />
realization<br />
Practical<br />
Applications<br />
Pure<br />
Linguistic<br />
Research<br />
Monday, April 1, 2013<br />
16
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Natural Language Understanding<br />
• Still largely in the research stage<br />
• Hybrid <strong>of</strong> linguistically motivated analytical (“rulebased”)<br />
systems with sensible application <strong>of</strong><br />
stochastic methods seems necessary<br />
• Deep processing applications continue to mature<br />
– ERG: English Resource Grammar (Flickinger 2002)<br />
• http://www.delph‐in.net/erg/<br />
– World knowledge, semantics, intricate grammatical<br />
knowledge<br />
Monday, April 1, 2013<br />
17
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
NLP / NLU Subfields<br />
• IE: Information Extraction<br />
• IR: Information Retrieval<br />
• MT: Machine Translation<br />
• Automatic Document Summarization<br />
• QA: Question Answering<br />
• ASR: Automatic Speech Recognition<br />
• NLG: Natural Language Generation<br />
• Speech Synthesis<br />
• CALL: Computer‐Assisted Language Learning<br />
• Alternative Input Methods<br />
Monday, April 1, 2013<br />
18
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
NLP / NLU Subtasks<br />
• Tokenization (word breaking)<br />
• Sentence Breaking<br />
• Morphological Analysis<br />
– POS tagging<br />
– Stemming<br />
• NER: Named Entity Recognition<br />
• WSD: Word Sense Disambiguation<br />
• Anaphora and reference resolution<br />
• Parsing and generation<br />
• Dialogue management and discourse analysis<br />
• Clustering<br />
• Classification<br />
• Treebanking and Corpora curatorship<br />
• …<br />
Monday, April 1, 2013<br />
19
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
NLP/NLU Approaches<br />
• Analytical “rule‐based”<br />
– Intuit a set <strong>of</strong> rules<br />
– Implement them<br />
– Evaluate<br />
• Statistical<br />
– Train a statistical model on a large set <strong>of</strong><br />
observations<br />
– Use the model to make predictions about unseen<br />
inputs<br />
Monday, April 1, 2013<br />
20
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Analytical (Rule‐Based) Techniques<br />
Intuition: In human brains, language operates by a set <strong>of</strong> rules.<br />
Let’s infer these and implement them in a computer.<br />
Q: Any problems with this?<br />
1. We don’t know how language operates in human brains<br />
2. How will we infer these rules? What form will they have?<br />
3. Will these putative rules be amenable to procedural<br />
implementation?<br />
4. How will we test and evaluate our progress?<br />
Monday, April 1, 2013<br />
21
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Analytical MT: a case study<br />
Machine Translation (MT) was the first goal: naïve ebullience!<br />
When I look at an article in Russian, I say: “This is really written in<br />
English, but it has been coded in some strange symbols.”<br />
Weaver 1947<br />
Monday, April 1, 2013<br />
22
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Analytical MT: a case study<br />
Another 19 Years Later: wet‐blanket pessimism<br />
There has been no machine translation <strong>of</strong> general<br />
scientific text, and none is in immediate prospect… After 8 years<br />
<strong>of</strong> work, the project] had to resort to post‐editing [which] took<br />
slightly longer to do and was more expensive than conventional<br />
human translation.”<br />
ALPAC Report 1966<br />
Recommendation <strong>of</strong> the ALPAC Report: suspend<br />
government funding for Machine Translation<br />
Monday, April 1, 2013<br />
23
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Analytical MT: a case study<br />
41 Years Later: realistic progress<br />
Progress on combining rule‐based and data‐driven approaches to<br />
MT will depend on a sustained stream <strong>of</strong> state‐<strong>of</strong>‐the‐art, MToriented<br />
linguistics research... Despite frequent cycles <strong>of</strong> overly<br />
high hopes and subsequent disillusionment, [MT] is the type <strong>of</strong><br />
application that may demand knowledge‐heavy, ‘deep’<br />
approaches to NLP for its ultimate, long‐term success.<br />
Oepen et al. 2007<br />
Monday, April 1, 2013<br />
24
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Why is language so hard?<br />
• As a simplification, let’s just consider text<br />
(speech processing has a separate set <strong>of</strong> problems)<br />
• Text representation <strong>of</strong> language<br />
I saw a man with a telescope.<br />
– This is a sequence <strong>of</strong> symbols<br />
– This is a “string” <strong>of</strong> characters or Unicode code points<br />
– This is an ordered collection <strong>of</strong> “words”<br />
– This is a “sentence”<br />
– This is a “surface” representation <strong>of</strong> a speech act<br />
– This is a surface representation <strong>of</strong> a semantic proposition<br />
Monday, April 1, 2013<br />
25
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Words<br />
Isa wa manwi thate lesco pe.<br />
I saw a man with a telescope.<br />
Observation: it appears that, in this language,<br />
one role <strong>of</strong> the space character is to partition<br />
symbols (letters) into units called “words.”<br />
Note: we always try to be aware <strong>of</strong>, and explicitly state, any<br />
assumptions that may not be cross‐linguistically valid<br />
Monday, April 1, 2013<br />
26
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Words<br />
It is not always the case that identifying “words” is straightforward:<br />
ผมเห็นผู ้ชายกับโทรทรรศน์<br />
ผม เห็น ผู ้ชาย กับ โทรทรรศน์<br />
pʰǒm hěn pʰûː tɕʰaːj kàp tʰoː rá tʰát<br />
1‐sg see man with telescope<br />
“I saw (a) man (who was) with a telescope.”<br />
The International Phonetic<br />
Alphabet (IPA), is the<br />
standard form <strong>of</strong> phonetic<br />
transcription.<br />
1 word or 2?<br />
This type <strong>of</strong> formatting for linguistic examples is called “Interlinear<br />
Glossed Text,” or IGT<br />
Monday, April 1, 2013<br />
27
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Word Classes<br />
• Can we identify a closed set <strong>of</strong> word classes to<br />
generalize about them?<br />
• These are called Parts <strong>of</strong> Speech.<br />
• In computational linguistics: POS tags<br />
• Closed and open classes<br />
Monday, April 1, 2013<br />
28
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• A closed class<br />
Closed Word Classes<br />
– Has a limited number <strong>of</strong> members<br />
– Is generally not open to new member production<br />
• Closed word classes<br />
– Conjunction<br />
– Determiner<br />
– Pronoun<br />
– Auxiliary verb<br />
– …and others<br />
Monday, April 1, 2013<br />
29
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• An open class:<br />
Open Word Classes<br />
– Has a large number <strong>of</strong> members<br />
– May accept new members over time<br />
– May allow new members to be heuristically generated<br />
• Open word classes: may allow compounding, inflecting, or<br />
productive morphology<br />
– Noun<br />
– Verb<br />
– Adjective<br />
– Adverb<br />
– …and others<br />
Monday, April 1, 2013<br />
30
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Sentences<br />
*the carnival. I saw a man with a<br />
Observation: it appears that, in this language, one role <strong>of</strong> the<br />
period is to partition words into sentences.<br />
Ok, let’s dismiss those issues too. We’ll assume we’re given the<br />
text <strong>of</strong> one sentence, unambiguously composed <strong>of</strong> words. Good<br />
enough?<br />
Monday, April 1, 2013<br />
31
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Convention for linguistic examples<br />
• Sentences marked ungrammatical are marked with an asterisk<br />
*Sentence is ungrammatical this.<br />
• Sentences judged marginal are marked with a question mark<br />
?Everyone doesn’t like their car.<br />
• Grammatical sentences can be semantic nonsense<br />
Colorless green ideas sleep furiously.<br />
• Grammatical sentences which don’t convey the intended<br />
meaning are marked with a hash<br />
#It’s raining outside but it’s not raining.<br />
Monday, April 1, 2013<br />
32
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Test the model<br />
*Man a with saw telescope I a.<br />
Observation: the ordering <strong>of</strong> words appears to<br />
be important in this language.<br />
English is generally Subject‐Verb‐Object (SVO), which is less common among<br />
world languages than SOV. In Russian, word order is more restricted in<br />
transitive clauses. Some languages, such as Datooga in Tanzania have free<br />
word order. Hixkaryana and Tamil are a few <strong>of</strong> examples <strong>of</strong> the rare OVS<br />
languages.<br />
Monday, April 1, 2013<br />
33
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
More data<br />
With a telescope I saw a man.<br />
#With a man I saw a telescope.<br />
#I saw a telescope with a man.<br />
*A man saw a telescope with I.<br />
I, with a telescope saw a man.<br />
Observation: some orderings have more felicity than others<br />
Can we find consistent generalizations to capture this? If<br />
so, what would we call such a set?<br />
Monday, April 1, 2013<br />
34
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Constituents<br />
• Can sentences be analyzed as containing sub‐units which<br />
consist <strong>of</strong> one or more words?<br />
This is a phrase structure tree.<br />
(Hausser 1998)<br />
Hypothesis: The class <strong>of</strong> grammatical constituents is closed.<br />
Monday, April 1, 2013<br />
35
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Constituent Types<br />
• Constituents are characterized by the part‐<strong>of</strong>‐speech <strong>of</strong> their<br />
main word, the “head.”<br />
• Thus, we notice that for many languages, a sentence<br />
comprises a:<br />
– subject (a Noun Phrase or NP)<br />
– predicate (a Verb Phrase or VP)<br />
• These may be composed <strong>of</strong> other constituents<br />
– Prepositional phrase (PP)<br />
– Determiner phrase (DP)<br />
Monday, April 1, 2013<br />
36
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• Noun phrases (NPs)<br />
Constituent Construction<br />
(DET NN)<br />
The ostrich<br />
(NNP)<br />
Kim<br />
(NN NN)<br />
container ship<br />
(DET JJ NN) A purple lawnmower<br />
(DET JJ NN) That darn cat<br />
• Verb phrases (VPs)<br />
(VB) tango<br />
(VBD NP NP) gave the dog a bone<br />
(VBD NP PP) gave a bone to the dog<br />
Monday, April 1, 2013<br />
37
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Syntax<br />
The set <strong>of</strong> rules governing permissible constructions in a language.<br />
• Syntax constrains the ways in which words may be<br />
combined to form constituents and sentences.<br />
• Syntax forms one part <strong>of</strong> the description, or<br />
grammar, <strong>of</strong> a language.<br />
Monday, April 1, 2013<br />
38
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Prescriptive v. descriptive grammar<br />
• Prescriptive<br />
– Rules against certain usages. Few if any rules for<br />
what is allowed.<br />
• Prepositions are not for ending sentences with.<br />
• Descriptive<br />
– Rules characterizing what people do say.<br />
– Goal is to characterize all and only what speakers<br />
find acceptable.<br />
– Based on the scientific method<br />
Slide: Emily Bender<br />
Monday, April 1, 2013<br />
39
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Artificiality <strong>of</strong> prescriptive rules<br />
• Fill in the blanks: he/his, they/their, or<br />
something else?<br />
Everyone insisted that __ record was unblemished.<br />
Everyone drives __ own car to work.<br />
Everyone was happy because __ passed the test.<br />
Everyone left the room, didn’t __ ?<br />
Everyone left early. __ seemed happy to get home.<br />
Slide: Bender, Sag, Wasow 2003<br />
Monday, April 1, 2013<br />
40
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Two kinds <strong>of</strong> ambiguity<br />
1. Lexical ambiguity<br />
The bank is crumbling.<br />
?<br />
Monday, April 1, 2013<br />
41
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Two kinds <strong>of</strong> ambiguity<br />
2. Structural ambiguity<br />
I saw a man with a telescope.<br />
? ?<br />
Monday, April 1, 2013<br />
42
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Is that all?<br />
I (<strong>of</strong>ten) saw a man with a telescope.<br />
Monday, April 1, 2013<br />
43
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Ambiguity<br />
Q: What kind <strong>of</strong> ambiguity does the following<br />
sentence illustrate?<br />
Have that report on my desk by Friday.<br />
A: Both structural and lexical ambiguity.<br />
In English, prosody in speech provides<br />
disambiguation by, for example, marking topic<br />
and focus.<br />
Monday, April 1, 2013<br />
44
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Constituents<br />
•Constituents help us understand and classify syntactic structure<br />
•Here, phrase structure trees show us how constituent structure<br />
helps us characterize ambiguity<br />
man<br />
man<br />
Trees: Dan Jinguji<br />
Monday, April 1, 2013<br />
45
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Garden Path Sentences<br />
These curiosities are another kind <strong>of</strong> ambiguity which are more relevant to<br />
psycholinguistics.<br />
The old man the boat.<br />
The generic NP “the old” and the verb “to man” are much more rarely used<br />
than the NP “the old man.”<br />
The horse raced past the barn fell.<br />
The relative clause “raced past the barn” is not introduced by “which.”<br />
Monday, April 1, 2013<br />
46
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Analytical NLP: Summary<br />
• Language has significant complexity<br />
• Ambiguity is an inherent feature <strong>of</strong> language<br />
• Inferring the syntactic rules <strong>of</strong> a language is difficult<br />
• It appears that syntactic rules are amenable to<br />
computational approximation, but there are a great<br />
number <strong>of</strong> them, and they are subtle<br />
• See next slide for the obligatory optimism…<br />
Monday, April 1, 2013<br />
47
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Latest Work: Hybrid analytical/statistical systems<br />
• Examples:<br />
– Stephan Oepen, Erik Velldal, Jan Tore Lønning, Paul Meurer, Victoria<br />
Rosén, and Dan Flickinger. 2007. “Towards hybrid quality‐oriented<br />
machine translation”<br />
http://share.emmtee.net/pub/bscw.cgi/d23044/tmi07.pdf<br />
– Parse ranking in the English Resource Grammar<br />
– Unsupervised learning <strong>of</strong> rules<br />
• Poon and Domingos 2009 “Unsupervised Semantic<br />
Parsing” (UW CSE)<br />
http://www.aclweb.org/anthology‐new/D/D09/D09‐1001.pdf<br />
Monday, April 1, 2013<br />
48
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Statistical Natural Language Processing<br />
• Large gains in practical application <strong>of</strong> stochastic methods in<br />
the past 15 years<br />
• Example: MT<br />
– Micros<strong>of</strong>t Bing Translator<br />
– Google translate<br />
– GIZA++/Moses SMT toolkit<br />
• Insight: clever math—with limited linguistic motivation—can<br />
work surprisingly well<br />
These improvements are not just due to Moore’s law<br />
Monday, April 1, 2013<br />
49
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
The mid‐1990s<br />
• Advances in computing power<br />
• Perceived lack <strong>of</strong> progress in analytical NLP<br />
• What if the “rules” <strong>of</strong> linguistics as we intuitively imagine<br />
them are too hard (or too incorrect, or too biased…) to<br />
capture?<br />
• Let’s forget any preconceived notion <strong>of</strong> what the rules should<br />
be and use math to characterize what the rules appear to be<br />
in practice.<br />
• German Verbmobil project had success with statistical<br />
machine translation (This large project also developed rulebased<br />
systems)<br />
Source: Koehn 2010 “Statistical Machine Translation”<br />
Monday, April 1, 2013<br />
50
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Corpus Linguistics<br />
The study <strong>of</strong> language as expressed in samples (corpora) or “real<br />
world” text.<br />
• Wikipedia<br />
This can be thought <strong>of</strong> as a variant <strong>of</strong> analytical NLP where<br />
the form, substance, and quantity <strong>of</strong> “rules” are generated<br />
automatically according to the maximization <strong>of</strong> an empirical<br />
objective function.<br />
Monday, April 1, 2013<br />
51
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Regular Expressions<br />
• Regular expressions: a syntax for matching<br />
patterns in text<br />
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Basic RegEx<br />
^ matches the start <strong>of</strong> a line<br />
$ matches the end <strong>of</strong> a line<br />
. matches any one character (except newline)<br />
[xyz] matches any one character from the set<br />
[^pdq] matches any one character not in the set<br />
| accepts either its left or its right side<br />
\ escape to specify special characters<br />
anything else: must match exactly<br />
Monday, April 1, 2013<br />
53
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
More RegEx<br />
* accepts zero or more <strong>of</strong> the preceding element<br />
this is the canonical ‘greedy’ operator<br />
? accepts zero or one <strong>of</strong> the preceding element(s)<br />
+ accepts one or more <strong>of</strong> the preceding element(s)<br />
{n} accepts n <strong>of</strong> the preceding element(s)<br />
{n,} accepts n or more <strong>of</strong> the preceding element(s)<br />
{n,m} accepts n to m <strong>of</strong> the preceding element(s)<br />
(pattern)<br />
defines a capture group which can be referred to<br />
later via \1<br />
Monday, April 1, 2013<br />
54
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
RegEx examples<br />
• abc<br />
• a|bc<br />
• (a|bb)c<br />
• a[bc]<br />
• a*b<br />
• a?b<br />
• [^a]*th[aeiou]+[a‐z]*<br />
• ^a<br />
• ^a.*z$<br />
• \bthe\b<br />
Monday, April 1, 2013<br />
55
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Capture groups<br />
• Use parentheses to ‘capture’ part <strong>of</strong> a match<br />
and copy it to the output<br />
• () creates a group you can reference<br />
• \n references the nth group<br />
• Example: to search for repeated words:<br />
([a‐z]+) \1 \1<br />
“I called kitty kitty kitty.”<br />
Monday, April 1, 2013<br />
56
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• Note: capture groups make “regular<br />
expressions” non‐regular in a formal sense.<br />
We will study this more in the next lecture.<br />
Monday, April 1, 2013<br />
57
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
RegEx Examples<br />
• Find the English stops followed by a liquid<br />
grep [PDKBDGpdkbdg][lr]<br />
• Find any two vowels together<br />
grep [aeiou][aeiou]<br />
• Find the same letter, repeated<br />
egrep '([a‐z])\1'<br />
• Lines where sentences end with ‘to’<br />
egrep '( |^)to.'<br />
Monday, April 1, 2013<br />
58
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Primitive tokenization<br />
$ cat moby_dick.html | # echo the text<br />
tr [:upper:] [:lower:] | # convert to lower case<br />
tr ' ' '\n' |<br />
# put each word on a line<br />
grep ‐v ^$ |<br />
# get rid <strong>of</strong> blank lines<br />
grep ‐v '
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• Search<br />
• Search and replace<br />
• In python:<br />
RegEx<br />
import re<br />
re.sub(pattern, replacement, string)<br />
Monday, April 1, 2013<br />
60
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
Python RegEx<br />
• input = ‘The Kiwis is a team.’<br />
• input = re.sub(‘is’, ‘are’, input)<br />
• print input<br />
‘The Kiwis are a team’<br />
Monday, April 1, 2013<br />
61
<strong>University</strong> <strong>of</strong> <strong>Washington</strong><br />
Ling472 Introduction to Computational Linguistics<br />
Lecture 1:<br />
Introduction<br />
• Project 1<br />
End <strong>of</strong> today’s lecture<br />
– Review J&M section 2.1.6<br />
Monday, April 1, 2013<br />
62