Resources required - Department of Linguistics and English Language
Resources required - Department of Linguistics and English Language
Resources required - Department of Linguistics and English Language
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
1<br />
Parsing Overview
2<br />
Parsing<br />
Assigning constituent structure to<br />
unstructured linguistic input<br />
Grammar engineering: field that deals with<br />
parser development, implementation<br />
Can be done top-down, bottom-up, or both<br />
Many techniques from CS, <strong>Linguistics</strong>
3<br />
Grammar specification formats<br />
Production rules<br />
if-then<br />
DCG’s<br />
CFG’s<br />
Transition networks<br />
Slot/filler specifications
4<br />
Parsers<br />
Grammar formalism<br />
Grammar<br />
Algorithm<br />
Interpreting parser: separate grammar/parser<br />
Procedural parser: integrated grammar/parser<br />
Compiled parser: declarative grammar is<br />
compiled into procedural form
5<br />
Parsers<br />
Usually tied to a particular linguistic theory<br />
(XTAG, LFG-WB, ALE)<br />
Advantage: principled coverage<br />
Disadvantage: inflexible<br />
Encoding <strong>of</strong> phrase-structure component is<br />
costly, complex<br />
Ungrammaticality is almost always avoided<br />
At best, explicitly code up expected<br />
ungrammaticalities as rules
6<br />
Components <strong>required</strong><br />
Grammar, rule base<br />
Data structures<br />
Strings, symbols<br />
Hash tables<br />
Trees<br />
On-line lexical information, dictionaries<br />
Corpora<br />
Tagged corpora<br />
Parsed corpora<br />
Corpus annotation
7<br />
Other considerations<br />
Features <strong>and</strong> unification<br />
Parsing direction: TD, BU, DF, BF<br />
Input processing: L-R, R-L, single-pass,<br />
multiple-pass<br />
Ambiguity processing: backtracking,<br />
parallelism, lookahead-based reduction
8<br />
Parsing efficiency aspects<br />
Deterministic vs. non-deterministic<br />
Avoiding dead-ends<br />
Backtracking style<br />
Overgeneration<br />
Undergeneration<br />
Quality <strong>of</strong> grammar<br />
(Re)usability <strong>of</strong> grammar
9<br />
Parsing strategies<br />
Top-down vs. bottom-up vs. mixed<br />
Search parameters<br />
Depth-first vs. breadth-first<br />
Single-path vs. multiple-path<br />
Backtracking?<br />
Granularity <strong>of</strong> word senses<br />
Left-to-right vs. bidirectional<br />
Contextual representation<br />
Interaction with other knowledge sources<br />
Learning vs. not
10<br />
Bottom-Up Parsing<br />
Start from data, incrementally building up<br />
constituents<br />
POS-tagged words<br />
Phrases<br />
Clauses<br />
Sentences<br />
Paragraphs, documents
11<br />
Top-Down Parsing<br />
Start with expectations, topmost category<br />
Split <strong>of</strong>f increasingly deeper subcategories<br />
Match input to these<br />
Done when all input accounted for
12<br />
How linguistic?<br />
Most traditional parsers are built on language<br />
theories<br />
Try to consider (at least some) human processing<br />
phenomena<br />
Traditional notions <strong>of</strong> constituency<br />
This isn’t the only way to do this!
13<br />
Dependency parsers<br />
Inter-word dependencies are the prinicpal<br />
features, not structure<br />
No large superstructure, attendant<br />
decisions<br />
All relations grounded directly in words<br />
Generally more robust<br />
Often used in IR, partial parsing, shallow<br />
parsing, Q/A
14<br />
The LG parser<br />
Freely available for research purposes<br />
Robust (e.g. information retrieval, MT)<br />
Calculates simple, explicit relations<br />
Fast<br />
Written in C<br />
More appropriate for the task than traditional<br />
phrase-structure grammars
15<br />
Parsing ungrammaticality<br />
An LFG mal-rule:<br />
S NP (agr ?a) VP (agr ~?a)<br />
Computational complexity increases as such<br />
rules are added<br />
Maintaining a knowledge base <strong>of</strong> such<br />
information is complicated, never-ending
16<br />
Exploring Link Grammar<br />
What is a link?<br />
Two parts, + <strong>and</strong> –<br />
Shows a relationship between pairs <strong>of</strong> words<br />
Subject + verb<br />
Verb + object<br />
Preposition + object<br />
Adjective + adverbial modifier<br />
Auxiliary + main verb<br />
Labels each relationship<br />
Potential links are specified by technical<br />
rules<br />
Possible to score linkages, penalize links
17<br />
Sample link parse<br />
Linkage 1, cost vector = (UNUSED=0 DIS=1 AND=0 LEN=20)<br />
+-------------------------------Xp------------------------------+<br />
+--------------Wd--------------+ |<br />
| +----------CO---------+ |<br />
| +--------Xc--------+ | |<br />
| +-----Jp----+ | | +------Op-----+ |<br />
| | +--Dmu-+ | +-Sp*i+--PPf-+ +--Dmc-+ |<br />
| | | | | | | | | | |<br />
LEFT-WALL during my schooling.n , I.p have.v taken.v many classes.n .
18<br />
Sample link parse<br />
He was killed by the Indians 15 March 1698.<br />
+-----------------Xc----------------+<br />
+------------MVp-----------+ |<br />
| +----Jp---+ | |<br />
+-Ss+---Pv--+-MVp-+ +--Dmc-+ +-TM+--TY-+ |<br />
| | | | | | | | | |<br />
he was.v killed.v by the Indians.n 15 March 1698 .
19<br />
LG parser’s robustness (1)<br />
Linkage 1, cost vector = (UNUSED=4 DIS=0 AND=0 LEN=11)<br />
+--------------------------------Xp-------------------------------+<br />
+-----Wd-----+ |<br />
| +-D*u-+------------Ss-----------+---Ost--+ |<br />
| | | | | |<br />
LEFT-WALL the class.n [most] [important] is.v Mathematical [for] [my] .
20<br />
LG parser’s robustness (2)<br />
Linkage 1, cost vector = (UNUSED=1 DIS=0 AND=0 LEN=17)<br />
+----------------------------------Xp----------------------------------+<br />
| +--------------MVp-------------+ |<br />
| +----I----+------MVp------+ +----Js----+ |<br />
+------Wi-----+-Ox-+ +---Op--+ +--Jp--+ | +--Ds-+ |<br />
| | | | | | | | | | |<br />
LEFT-WALL [it] help.v me make.v friends.n with people.p around the world.n .
21<br />
LG example parses<br />
Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=23)<br />
+-----------------------------------------Xp----------------------------------------+<br />
| +-----------------------MVp-----------------------+ |<br />
| +---------------MVp--------------+ | |<br />
| | +-------Jp-------+ +----Js---+ | |<br />
+--Wd--+Sp*+-PPf-+--Pg*b--+--MVp-+ +----AN----+ | +---D--+ +-Js+ |<br />
| | | | | | | | | | | | | |<br />
LEFT-WALL I.p 've been.v majoring.v in Material engineering.n at my University in Korea .<br />
Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=27)<br />
+----------------------------------------------Xp----------------------------------------------+<br />
| +-----------Wdc-----------+ +------------------Opt-----------------+ |<br />
| | +--------CO--------+ | +--------------AN-------------+ |<br />
| | | +-----D*u----+-------Ss------+ | +-------AN-------+ |<br />
+--Wc--+ | +--La-+ +--Mp--+--J-+ | | | +----AN---+ |<br />
| | | | | | | | | | | | | |<br />
LEFT-WALL but probably the best.a class.n for.p me was.v medicine.n <strong>and</strong> first.n aid.n principles.n .
22<br />
LG parser’s robustness (1)<br />
Thomas Smith, Haverhill, married at Andover 6 January 1659, Unice Singletary <strong>of</strong> Salisbury.<br />
+-------------------------------Xc------------------------------+<br />
+---------------------Osn--------------------+ |<br />
+----------Ss---------+---------------Xc--------------+ | |<br />
+----MX---+ +---------MVp---------+ | | |<br />
+--G--+ +--Xd-+--Xc-+ +--MVp-+-Js-+ +-TM-+--TY--+ | +----G---+--MG--+--JG-+ |<br />
| | | | | | | | | | | | | | | | |<br />
Thomas Smith , Haverhill , married.v at Andover 6 January 1659 , Unice Singletary <strong>of</strong> Salisbury .
23<br />
LG parser’s robustness (2)<br />
Mary married I think, 23 November 1661, Samuel Gay.<br />
No complete linkages found.<br />
+-------------------------Xc------------------------+<br />
+-----------------------Osn----------------------+ |<br />
+------------------Xc------------------+ | |<br />
+-------------MVp------------+ | | |<br />
+--Ss--+ +--TM-+--TY--+ | +--G-+ |<br />
| | | | | | | | |<br />
Mary married.v [I] [think] [,] 23 November 1661 , Samuel Gay .
24<br />
Sample LG rule entries<br />
words/words.y: % year numbers<br />
NN+ or NIa- or AN+ or MV- or ((Xd- & TY- & Xc+) or TY-)<br />
or ({EN- or NIc-} & (ND+ or OD- or ({{@L+} & DD-} &<br />
([[Dmcn+]] or (( or TA-) & (JT- or IN-<br />
or ))))));<br />
: ((K+ & {[[@MV+]]} & O*n+) or ({O+ or B-} & {K+}) or<br />
[[@MV+ & {Xc+} & O*n+]]) & {Xc+} & {@MV+};
25<br />
Syntax isn’t enough<br />
Linkage 1, cost vector = (UNUSED=0 DIS=1 AND=0 LEN=13)<br />
+--------------------------------Xp--------------------------------+<br />
+------Wd------+---------Ss---------+ +---Jp---+ |<br />
| +--D*u--+--Mp--+--Jp-+ +--Pg*b--+---MVp--+ +-D*u-+ |<br />
| | | | | | | | | | |<br />
LEFT-WALL the practice.n in <strong>English</strong>.n is.v progressing.v in the life.n .
26<br />
Evaluating parsing<br />
Tree accuracy, exact match (0 or 1)<br />
Partial credit<br />
PARSEVAL<br />
St<strong>and</strong>ard used to score parses<br />
Precision: how many brackets match std.<br />
Recall: how many brackets in std. are in parse<br />
Crossings: how many brackets cross std.
Treebanks, parsing, etc.
28<br />
WordNet subcat frames<br />
1 Something ----s<br />
2 Somebody ----s<br />
3 It is ----ing<br />
4 Something is ----ing PP<br />
5 Something ----s something Adjective/Noun<br />
6 Something ----s Adjective/Noun<br />
7 Somebody ----s Adjective<br />
8 Somebody ----s something<br />
9 Somebody ----s somebody<br />
10 Something ----s somebody<br />
11 Something ----s something<br />
12 Something ----s to somebody<br />
13 Somebody ----s on something<br />
14 Somebody ----s somebody something<br />
15 Somebody ----s something to somebody<br />
16 Somebody ----s something from somebody<br />
17 Somebody ----s somebody with something<br />
18 Somebody ----s somebody <strong>of</strong> something<br />
19 Somebody ----s something on somebody<br />
20 Somebody ----s somebody PP<br />
21 Somebody ----s something PP<br />
22 Somebody ----s PP<br />
23 Somebody's (body part) ----s<br />
24 Somebody ----s somebody to INFINITIVE<br />
25 Somebody ----s somebody INFINITIVE<br />
26 Somebody ----s that CLAUSE<br />
27 Somebody ----s to somebody<br />
28 Somebody ----s to INFINITIVE<br />
29 Somebody ----s whether INFINITIVE<br />
30 Somebody ----s somebody into V-ing something<br />
31 Somebody ----s something with something<br />
32 Somebody ----s INFINITIVE<br />
33 Somebody ----s VERB-ing<br />
34 It ----s that CLAUSE<br />
35 Something ----s INFINITIVE<br />
Soar 2003 Tutorial
<strong>English</strong> LCS lexicon<br />
Theta-grid information for verbs<br />
Derive ucat features<br />
used to build syntactic structure<br />
Co-referenced with WordNet2.0<br />
theta-grids are aligned with ucat features <strong>and</strong><br />
word sense information
<strong>English</strong> LCS lexicon data<br />
10.6.a#1#_ag_th,mod-poss(<strong>of</strong>)#exonerate#exonerate#exonerate#exonerate+ed#<br />
(2.0,00874318_exonerate%2:32:00::)<br />
"10.6.a" :NAME "Verbs <strong>of</strong> Possessional Deprivation: Cheat Verbs / -<strong>of</strong>“<br />
WORDS (absolve acquit balk bereave bilk bleed burgle cheat cleanse con<br />
cull cure defraud denude deplete depopulate deprive despoil<br />
disabuse disarm disencumber dispossess divest drain ease<br />
exonerate fleece free gull milk mulct pardon plunder purge purify ransack<br />
relieve render rid rifle rob sap strip swindle unburden void wean)<br />
THETA_ROLES ((1 "_ag_th,mod-poss()")<br />
(1 "_ag_th,mod-poss(from)")<br />
(1 "_ag_th,mod-poss(<strong>of</strong>)"))<br />
SENTENCES "He !!+ed the people (<strong>of</strong> their rights); He !!+ed him <strong>of</strong> his<br />
sins"
31<br />
Grammars<br />
Before you can parse you need a grammar.<br />
So where do grammars come from?<br />
Grammar Engineering<br />
Lovingly h<strong>and</strong>-crafted decades-long efforts by humans to<br />
write grammars (typically in some particular grammar<br />
formalism <strong>of</strong> interest to the linguists developing the<br />
grammar).<br />
TreeBanks<br />
Semi-automatically generated sets <strong>of</strong> parse trees for the<br />
sentences in some corpus. Typically in a generic lowest common<br />
denominator formalism (<strong>of</strong> no particular interest to any<br />
modern linguist).<br />
CSCI 5832 Spring 2006<br />
9/26/2011
32<br />
TreeBanks<br />
TreeBanks provide a grammar (<strong>of</strong> a sort).<br />
As we’ll see they also provide the training data for<br />
various ML approaches to parsing.<br />
But they can also provide useful data for more purely<br />
linguistic pursuits.<br />
You might have a theory about whether or not something can<br />
happen in particular language.<br />
Or a theory about the contexts in which something can<br />
happen.<br />
TreeBanks can give you the means to explore those theories.<br />
If you can formulate the questions in the right way <strong>and</strong> get<br />
the data you need.<br />
CSCI 5832 Spring 2006<br />
9/26/2011
The rise <strong>of</strong> annotated data:<br />
The Penn Treebank<br />
( (S<br />
(NP-SBJ (DT The) (NN move))<br />
(VP (VBD followed)<br />
(NP<br />
(NP (DT a) (NN round))<br />
(PP (IN <strong>of</strong>)<br />
(NP<br />
(NP (JJ similar) (NNS increases))<br />
(PP (IN by)<br />
(NP (JJ other) (NNS lenders)))<br />
(PP (IN against)<br />
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))<br />
(, ,)<br />
(S-ADV<br />
(NP-SBJ (-NONE- *))<br />
(VP (VBG reflecting)<br />
(NP<br />
(NP (DT a) (VBG continuing) (NN decline))<br />
(PP-LOC (IN in)<br />
(NP (DT that) (NN market)))))))<br />
(. .)))
A simple grammar<br />
S → NP VP 1.0<br />
VP → V NP 0.7<br />
VP → VP PP 0.3<br />
PP → P NP 1.0<br />
P → with 1.0<br />
V → saw 1.0<br />
NP → NP PP 0.4<br />
NP → astronomers 0.1<br />
NP → ears 0.18<br />
NP → saw 0.04<br />
NP → stars 0.18<br />
NP → telescope 0.1
Classical parsing<br />
Wrote symbolic grammar <strong>and</strong> lexicon<br />
S → NP VP NN → interest<br />
NP → (DT) NN NNS → rates<br />
NP → NN NNS NNS → raises<br />
NP → NNP VBP → interest<br />
VP → V NP VBZ → rates<br />
Simple 10 rule grammar: 592 parses<br />
Real-size broad-coverage grammar: millions <strong>of</strong><br />
parses
Two possible PP attachments
39<br />
Ambiguity<br />
CSCI 5832 Spring 2006<br />
9/26/2011
40<br />
Ambiguity<br />
Local ambiguity means that we have to deal<br />
with multiple plausible choices during the<br />
parsing process.<br />
Global ambiguity means that the grammar can’t<br />
tell us which <strong>of</strong> several (many?) possible parses<br />
is the correct one.<br />
CSCI 5832 Spring 2006<br />
9/26/2011
41<br />
TreeBanks<br />
CSCI 5832 Spring 2006<br />
9/26/2011
42<br />
TreeBanks<br />
CSCI 5832 Spring 2006<br />
9/26/2011
The bad effects <strong>of</strong> V/N<br />
ambiguities
44<br />
Sample Rules<br />
CSCI 5832 Spring 2006<br />
9/26/2011
How many rules?
Example<br />
9/26/2011 CSCI 5832 Spring 2006 46
A sample parsed sentence
52<br />
Background
TiGer Treebank<br />
crossing branches for<br />
discontinuous constituency types<br />
Im<br />
APPRART<br />
Dat<br />
in<br />
MO<br />
PP<br />
nächsten<br />
ADJA<br />
Sup.Dat.<br />
Sg.Neut<br />
nahe<br />
Jahr<br />
NN<br />
Dat.<br />
Pl.Neut<br />
Jahr<br />
HD SB OC<br />
AC NK NK NK NK NK NK<br />
will<br />
VMFIN<br />
3.Sg.<br />
Pres.Ind<br />
wollen<br />
die<br />
ART<br />
Nom.<br />
Sg.Fem<br />
die<br />
S<br />
NP<br />
edge labels:<br />
syntactic functions<br />
Regierung<br />
NN<br />
Nom.<br />
Sg.Fem<br />
Regierung<br />
VP<br />
ihre<br />
PPOSAT<br />
Acc.<br />
Pl.Masc<br />
ihr<br />
annotation on word level:<br />
part-<strong>of</strong>-speech,<br />
morphology, lemmata<br />
OA<br />
node labels:<br />
phrase categories<br />
NP<br />
Reformpläne<br />
NN<br />
Acc.<br />
Pl.Masc<br />
Plan<br />
HD<br />
umsetzen<br />
VVINF<br />
Inf<br />
umsetzen<br />
.<br />
$.
Why is NLU difficult? The hidden structure <strong>of</strong><br />
language is hugely ambiguous<br />
Tree for: Fed raises interest rates 0.5% in<br />
effort to control inflation (NYT headline 5/17/00)
Why parsing is difficult:<br />
Newspaper headlines<br />
Iraqi Head Seeks Arms<br />
Juvenile Court to Try Shooting Defendant<br />
Teacher Strikes Idle Kids<br />
Stolen Painting Found by Tree<br />
Local High School Dropouts Cut in Half<br />
Red Tape Holds Up New Bridges<br />
Clinton Wins on Budget, but More Lies Ahead<br />
Hospitals Are Sued by 7 Foot Doctors<br />
Kids Make Nutritious Snacks
Aligning parses
Sample trees
59<br />
Tgrep<br />
You might for example like to search through a<br />
file filled with trees.<br />
CSCI 5832 Spring 2006<br />
9/26/2011
Searching treebanks
Searching treebanks online<br />
The VISL website<br />
The NCLT website
Parallel treebanks<br />
Translation training <strong>and</strong> studies<br />
Machine translation (MT) research &<br />
development
Dependency structure<br />
Dependency structure shows which words depend on<br />
(modify or are arguments <strong>of</strong>) which other words.<br />
The boy put the tortoise on the rug<br />
The<br />
boy<br />
put<br />
the<br />
tortoise<br />
on<br />
the<br />
rug
Dependency Grammar/Parsing<br />
A sentence is parsed by relating each word to other words in the<br />
sentence which depend on it.<br />
The idea <strong>of</strong> dependency structure goes back a long way<br />
To Pāṇini’s grammar (c. 5th century BCE)<br />
Constituency is a new-fangled invention<br />
20th century invention<br />
Modern work <strong>of</strong>ten linked to work <strong>of</strong> L. Tesniere (1959)<br />
Dominant approach in “East” (Eastern bloc/East Asia)<br />
Among the earliest kinds <strong>of</strong> parsers in NLP, even in US:<br />
David Hays, one <strong>of</strong> the founders <strong>of</strong> computational linguistics, built early<br />
(first?) dependency parser (Hays 1962)
Prague Dependency Bank<br />
Kdo<br />
who<br />
Sb<br />
ACT.T<br />
ste<br />
hundred<br />
Obj<br />
RESTR.F<br />
korun<br />
crown<br />
s<br />
Atr<br />
PAT.F<br />
chce<br />
wants<br />
Sb<br />
investovat<br />
to-invest<br />
Obj<br />
ACT.VOL.T<br />
do<br />
to<br />
AuxP<br />
automobilu<br />
car<br />
Adv<br />
DIR.F<br />
annotation on word level:<br />
lemmata, morphology<br />
syntactic functions<br />
dependency structure<br />
semantic information<br />
on constituent roles,<br />
theme/rheme, etc.