20.07.2013 Views

Resources required - Department of Linguistics and English Language

Resources required - Department of Linguistics and English Language

Resources required - Department of Linguistics and English Language

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1<br />

Parsing Overview


2<br />

Parsing<br />

Assigning constituent structure to<br />

unstructured linguistic input<br />

Grammar engineering: field that deals with<br />

parser development, implementation<br />

Can be done top-down, bottom-up, or both<br />

Many techniques from CS, <strong>Linguistics</strong>


3<br />

Grammar specification formats<br />

Production rules<br />

if-then<br />

DCG’s<br />

CFG’s<br />

Transition networks<br />

Slot/filler specifications


4<br />

Parsers<br />

Grammar formalism<br />

Grammar<br />

Algorithm<br />

Interpreting parser: separate grammar/parser<br />

Procedural parser: integrated grammar/parser<br />

Compiled parser: declarative grammar is<br />

compiled into procedural form


5<br />

Parsers<br />

Usually tied to a particular linguistic theory<br />

(XTAG, LFG-WB, ALE)<br />

Advantage: principled coverage<br />

Disadvantage: inflexible<br />

Encoding <strong>of</strong> phrase-structure component is<br />

costly, complex<br />

Ungrammaticality is almost always avoided<br />

At best, explicitly code up expected<br />

ungrammaticalities as rules


6<br />

Components <strong>required</strong><br />

Grammar, rule base<br />

Data structures<br />

Strings, symbols<br />

Hash tables<br />

Trees<br />

On-line lexical information, dictionaries<br />

Corpora<br />

Tagged corpora<br />

Parsed corpora<br />

Corpus annotation


7<br />

Other considerations<br />

Features <strong>and</strong> unification<br />

Parsing direction: TD, BU, DF, BF<br />

Input processing: L-R, R-L, single-pass,<br />

multiple-pass<br />

Ambiguity processing: backtracking,<br />

parallelism, lookahead-based reduction


8<br />

Parsing efficiency aspects<br />

Deterministic vs. non-deterministic<br />

Avoiding dead-ends<br />

Backtracking style<br />

Overgeneration<br />

Undergeneration<br />

Quality <strong>of</strong> grammar<br />

(Re)usability <strong>of</strong> grammar


9<br />

Parsing strategies<br />

Top-down vs. bottom-up vs. mixed<br />

Search parameters<br />

Depth-first vs. breadth-first<br />

Single-path vs. multiple-path<br />

Backtracking?<br />

Granularity <strong>of</strong> word senses<br />

Left-to-right vs. bidirectional<br />

Contextual representation<br />

Interaction with other knowledge sources<br />

Learning vs. not


10<br />

Bottom-Up Parsing<br />

Start from data, incrementally building up<br />

constituents<br />

POS-tagged words<br />

Phrases<br />

Clauses<br />

Sentences<br />

Paragraphs, documents


11<br />

Top-Down Parsing<br />

Start with expectations, topmost category<br />

Split <strong>of</strong>f increasingly deeper subcategories<br />

Match input to these<br />

Done when all input accounted for


12<br />

How linguistic?<br />

Most traditional parsers are built on language<br />

theories<br />

Try to consider (at least some) human processing<br />

phenomena<br />

Traditional notions <strong>of</strong> constituency<br />

This isn’t the only way to do this!


13<br />

Dependency parsers<br />

Inter-word dependencies are the prinicpal<br />

features, not structure<br />

No large superstructure, attendant<br />

decisions<br />

All relations grounded directly in words<br />

Generally more robust<br />

Often used in IR, partial parsing, shallow<br />

parsing, Q/A


14<br />

The LG parser<br />

Freely available for research purposes<br />

Robust (e.g. information retrieval, MT)<br />

Calculates simple, explicit relations<br />

Fast<br />

Written in C<br />

More appropriate for the task than traditional<br />

phrase-structure grammars


15<br />

Parsing ungrammaticality<br />

An LFG mal-rule:<br />

S NP (agr ?a) VP (agr ~?a)<br />

Computational complexity increases as such<br />

rules are added<br />

Maintaining a knowledge base <strong>of</strong> such<br />

information is complicated, never-ending


16<br />

Exploring Link Grammar<br />

What is a link?<br />

Two parts, + <strong>and</strong> –<br />

Shows a relationship between pairs <strong>of</strong> words<br />

Subject + verb<br />

Verb + object<br />

Preposition + object<br />

Adjective + adverbial modifier<br />

Auxiliary + main verb<br />

Labels each relationship<br />

Potential links are specified by technical<br />

rules<br />

Possible to score linkages, penalize links


17<br />

Sample link parse<br />

Linkage 1, cost vector = (UNUSED=0 DIS=1 AND=0 LEN=20)<br />

+-------------------------------Xp------------------------------+<br />

+--------------Wd--------------+ |<br />

| +----------CO---------+ |<br />

| +--------Xc--------+ | |<br />

| +-----Jp----+ | | +------Op-----+ |<br />

| | +--Dmu-+ | +-Sp*i+--PPf-+ +--Dmc-+ |<br />

| | | | | | | | | | |<br />

LEFT-WALL during my schooling.n , I.p have.v taken.v many classes.n .


18<br />

Sample link parse<br />

He was killed by the Indians 15 March 1698.<br />

+-----------------Xc----------------+<br />

+------------MVp-----------+ |<br />

| +----Jp---+ | |<br />

+-Ss+---Pv--+-MVp-+ +--Dmc-+ +-TM+--TY-+ |<br />

| | | | | | | | | |<br />

he was.v killed.v by the Indians.n 15 March 1698 .


19<br />

LG parser’s robustness (1)<br />

Linkage 1, cost vector = (UNUSED=4 DIS=0 AND=0 LEN=11)<br />

+--------------------------------Xp-------------------------------+<br />

+-----Wd-----+ |<br />

| +-D*u-+------------Ss-----------+---Ost--+ |<br />

| | | | | |<br />

LEFT-WALL the class.n [most] [important] is.v Mathematical [for] [my] .


20<br />

LG parser’s robustness (2)<br />

Linkage 1, cost vector = (UNUSED=1 DIS=0 AND=0 LEN=17)<br />

+----------------------------------Xp----------------------------------+<br />

| +--------------MVp-------------+ |<br />

| +----I----+------MVp------+ +----Js----+ |<br />

+------Wi-----+-Ox-+ +---Op--+ +--Jp--+ | +--Ds-+ |<br />

| | | | | | | | | | |<br />

LEFT-WALL [it] help.v me make.v friends.n with people.p around the world.n .


21<br />

LG example parses<br />

Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=23)<br />

+-----------------------------------------Xp----------------------------------------+<br />

| +-----------------------MVp-----------------------+ |<br />

| +---------------MVp--------------+ | |<br />

| | +-------Jp-------+ +----Js---+ | |<br />

+--Wd--+Sp*+-PPf-+--Pg*b--+--MVp-+ +----AN----+ | +---D--+ +-Js+ |<br />

| | | | | | | | | | | | | |<br />

LEFT-WALL I.p 've been.v majoring.v in Material engineering.n at my University in Korea .<br />

Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=27)<br />

+----------------------------------------------Xp----------------------------------------------+<br />

| +-----------Wdc-----------+ +------------------Opt-----------------+ |<br />

| | +--------CO--------+ | +--------------AN-------------+ |<br />

| | | +-----D*u----+-------Ss------+ | +-------AN-------+ |<br />

+--Wc--+ | +--La-+ +--Mp--+--J-+ | | | +----AN---+ |<br />

| | | | | | | | | | | | | |<br />

LEFT-WALL but probably the best.a class.n for.p me was.v medicine.n <strong>and</strong> first.n aid.n principles.n .


22<br />

LG parser’s robustness (1)<br />

Thomas Smith, Haverhill, married at Andover 6 January 1659, Unice Singletary <strong>of</strong> Salisbury.<br />

+-------------------------------Xc------------------------------+<br />

+---------------------Osn--------------------+ |<br />

+----------Ss---------+---------------Xc--------------+ | |<br />

+----MX---+ +---------MVp---------+ | | |<br />

+--G--+ +--Xd-+--Xc-+ +--MVp-+-Js-+ +-TM-+--TY--+ | +----G---+--MG--+--JG-+ |<br />

| | | | | | | | | | | | | | | | |<br />

Thomas Smith , Haverhill , married.v at Andover 6 January 1659 , Unice Singletary <strong>of</strong> Salisbury .


23<br />

LG parser’s robustness (2)<br />

Mary married I think, 23 November 1661, Samuel Gay.<br />

No complete linkages found.<br />

+-------------------------Xc------------------------+<br />

+-----------------------Osn----------------------+ |<br />

+------------------Xc------------------+ | |<br />

+-------------MVp------------+ | | |<br />

+--Ss--+ +--TM-+--TY--+ | +--G-+ |<br />

| | | | | | | | |<br />

Mary married.v [I] [think] [,] 23 November 1661 , Samuel Gay .


24<br />

Sample LG rule entries<br />

words/words.y: % year numbers<br />

NN+ or NIa- or AN+ or MV- or ((Xd- & TY- & Xc+) or TY-)<br />

or ({EN- or NIc-} & (ND+ or OD- or ({{@L+} & DD-} &<br />

([[Dmcn+]] or (( or TA-) & (JT- or IN-<br />

or ))))));<br />

: ((K+ & {[[@MV+]]} & O*n+) or ({O+ or B-} & {K+}) or<br />

[[@MV+ & {Xc+} & O*n+]]) & {Xc+} & {@MV+};


25<br />

Syntax isn’t enough<br />

Linkage 1, cost vector = (UNUSED=0 DIS=1 AND=0 LEN=13)<br />

+--------------------------------Xp--------------------------------+<br />

+------Wd------+---------Ss---------+ +---Jp---+ |<br />

| +--D*u--+--Mp--+--Jp-+ +--Pg*b--+---MVp--+ +-D*u-+ |<br />

| | | | | | | | | | |<br />

LEFT-WALL the practice.n in <strong>English</strong>.n is.v progressing.v in the life.n .


26<br />

Evaluating parsing<br />

Tree accuracy, exact match (0 or 1)<br />

Partial credit<br />

PARSEVAL<br />

St<strong>and</strong>ard used to score parses<br />

Precision: how many brackets match std.<br />

Recall: how many brackets in std. are in parse<br />

Crossings: how many brackets cross std.


Treebanks, parsing, etc.


28<br />

WordNet subcat frames<br />

1 Something ----s<br />

2 Somebody ----s<br />

3 It is ----ing<br />

4 Something is ----ing PP<br />

5 Something ----s something Adjective/Noun<br />

6 Something ----s Adjective/Noun<br />

7 Somebody ----s Adjective<br />

8 Somebody ----s something<br />

9 Somebody ----s somebody<br />

10 Something ----s somebody<br />

11 Something ----s something<br />

12 Something ----s to somebody<br />

13 Somebody ----s on something<br />

14 Somebody ----s somebody something<br />

15 Somebody ----s something to somebody<br />

16 Somebody ----s something from somebody<br />

17 Somebody ----s somebody with something<br />

18 Somebody ----s somebody <strong>of</strong> something<br />

19 Somebody ----s something on somebody<br />

20 Somebody ----s somebody PP<br />

21 Somebody ----s something PP<br />

22 Somebody ----s PP<br />

23 Somebody's (body part) ----s<br />

24 Somebody ----s somebody to INFINITIVE<br />

25 Somebody ----s somebody INFINITIVE<br />

26 Somebody ----s that CLAUSE<br />

27 Somebody ----s to somebody<br />

28 Somebody ----s to INFINITIVE<br />

29 Somebody ----s whether INFINITIVE<br />

30 Somebody ----s somebody into V-ing something<br />

31 Somebody ----s something with something<br />

32 Somebody ----s INFINITIVE<br />

33 Somebody ----s VERB-ing<br />

34 It ----s that CLAUSE<br />

35 Something ----s INFINITIVE<br />

Soar 2003 Tutorial


<strong>English</strong> LCS lexicon<br />

Theta-grid information for verbs<br />

Derive ucat features<br />

used to build syntactic structure<br />

Co-referenced with WordNet2.0<br />

theta-grids are aligned with ucat features <strong>and</strong><br />

word sense information


<strong>English</strong> LCS lexicon data<br />

10.6.a#1#_ag_th,mod-poss(<strong>of</strong>)#exonerate#exonerate#exonerate#exonerate+ed#<br />

(2.0,00874318_exonerate%2:32:00::)<br />

"10.6.a" :NAME "Verbs <strong>of</strong> Possessional Deprivation: Cheat Verbs / -<strong>of</strong>“<br />

WORDS (absolve acquit balk bereave bilk bleed burgle cheat cleanse con<br />

cull cure defraud denude deplete depopulate deprive despoil<br />

disabuse disarm disencumber dispossess divest drain ease<br />

exonerate fleece free gull milk mulct pardon plunder purge purify ransack<br />

relieve render rid rifle rob sap strip swindle unburden void wean)<br />

THETA_ROLES ((1 "_ag_th,mod-poss()")<br />

(1 "_ag_th,mod-poss(from)")<br />

(1 "_ag_th,mod-poss(<strong>of</strong>)"))<br />

SENTENCES "He !!+ed the people (<strong>of</strong> their rights); He !!+ed him <strong>of</strong> his<br />

sins"


31<br />

Grammars<br />

Before you can parse you need a grammar.<br />

So where do grammars come from?<br />

Grammar Engineering<br />

Lovingly h<strong>and</strong>-crafted decades-long efforts by humans to<br />

write grammars (typically in some particular grammar<br />

formalism <strong>of</strong> interest to the linguists developing the<br />

grammar).<br />

TreeBanks<br />

Semi-automatically generated sets <strong>of</strong> parse trees for the<br />

sentences in some corpus. Typically in a generic lowest common<br />

denominator formalism (<strong>of</strong> no particular interest to any<br />

modern linguist).<br />

CSCI 5832 Spring 2006<br />

9/26/2011


32<br />

TreeBanks<br />

TreeBanks provide a grammar (<strong>of</strong> a sort).<br />

As we’ll see they also provide the training data for<br />

various ML approaches to parsing.<br />

But they can also provide useful data for more purely<br />

linguistic pursuits.<br />

You might have a theory about whether or not something can<br />

happen in particular language.<br />

Or a theory about the contexts in which something can<br />

happen.<br />

TreeBanks can give you the means to explore those theories.<br />

If you can formulate the questions in the right way <strong>and</strong> get<br />

the data you need.<br />

CSCI 5832 Spring 2006<br />

9/26/2011


The rise <strong>of</strong> annotated data:<br />

The Penn Treebank<br />

( (S<br />

(NP-SBJ (DT The) (NN move))<br />

(VP (VBD followed)<br />

(NP<br />

(NP (DT a) (NN round))<br />

(PP (IN <strong>of</strong>)<br />

(NP<br />

(NP (JJ similar) (NNS increases))<br />

(PP (IN by)<br />

(NP (JJ other) (NNS lenders)))<br />

(PP (IN against)<br />

(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))<br />

(, ,)<br />

(S-ADV<br />

(NP-SBJ (-NONE- *))<br />

(VP (VBG reflecting)<br />

(NP<br />

(NP (DT a) (VBG continuing) (NN decline))<br />

(PP-LOC (IN in)<br />

(NP (DT that) (NN market)))))))<br />

(. .)))


A simple grammar<br />

S → NP VP 1.0<br />

VP → V NP 0.7<br />

VP → VP PP 0.3<br />

PP → P NP 1.0<br />

P → with 1.0<br />

V → saw 1.0<br />

NP → NP PP 0.4<br />

NP → astronomers 0.1<br />

NP → ears 0.18<br />

NP → saw 0.04<br />

NP → stars 0.18<br />

NP → telescope 0.1


Classical parsing<br />

Wrote symbolic grammar <strong>and</strong> lexicon<br />

S → NP VP NN → interest<br />

NP → (DT) NN NNS → rates<br />

NP → NN NNS NNS → raises<br />

NP → NNP VBP → interest<br />

VP → V NP VBZ → rates<br />

Simple 10 rule grammar: 592 parses<br />

Real-size broad-coverage grammar: millions <strong>of</strong><br />

parses


Two possible PP attachments


39<br />

Ambiguity<br />

CSCI 5832 Spring 2006<br />

9/26/2011


40<br />

Ambiguity<br />

Local ambiguity means that we have to deal<br />

with multiple plausible choices during the<br />

parsing process.<br />

Global ambiguity means that the grammar can’t<br />

tell us which <strong>of</strong> several (many?) possible parses<br />

is the correct one.<br />

CSCI 5832 Spring 2006<br />

9/26/2011


41<br />

TreeBanks<br />

CSCI 5832 Spring 2006<br />

9/26/2011


42<br />

TreeBanks<br />

CSCI 5832 Spring 2006<br />

9/26/2011


The bad effects <strong>of</strong> V/N<br />

ambiguities


44<br />

Sample Rules<br />

CSCI 5832 Spring 2006<br />

9/26/2011


How many rules?


Example<br />

9/26/2011 CSCI 5832 Spring 2006 46


A sample parsed sentence


52<br />

Background


TiGer Treebank<br />

crossing branches for<br />

discontinuous constituency types<br />

Im<br />

APPRART<br />

Dat<br />

in<br />

MO<br />

PP<br />

nächsten<br />

ADJA<br />

Sup.Dat.<br />

Sg.Neut<br />

nahe<br />

Jahr<br />

NN<br />

Dat.<br />

Pl.Neut<br />

Jahr<br />

HD SB OC<br />

AC NK NK NK NK NK NK<br />

will<br />

VMFIN<br />

3.Sg.<br />

Pres.Ind<br />

wollen<br />

die<br />

ART<br />

Nom.<br />

Sg.Fem<br />

die<br />

S<br />

NP<br />

edge labels:<br />

syntactic functions<br />

Regierung<br />

NN<br />

Nom.<br />

Sg.Fem<br />

Regierung<br />

VP<br />

ihre<br />

PPOSAT<br />

Acc.<br />

Pl.Masc<br />

ihr<br />

annotation on word level:<br />

part-<strong>of</strong>-speech,<br />

morphology, lemmata<br />

OA<br />

node labels:<br />

phrase categories<br />

NP<br />

Reformpläne<br />

NN<br />

Acc.<br />

Pl.Masc<br />

Plan<br />

HD<br />

umsetzen<br />

VVINF<br />

Inf<br />

umsetzen<br />

.<br />

$.


Why is NLU difficult? The hidden structure <strong>of</strong><br />

language is hugely ambiguous<br />

Tree for: Fed raises interest rates 0.5% in<br />

effort to control inflation (NYT headline 5/17/00)


Why parsing is difficult:<br />

Newspaper headlines<br />

Iraqi Head Seeks Arms<br />

Juvenile Court to Try Shooting Defendant<br />

Teacher Strikes Idle Kids<br />

Stolen Painting Found by Tree<br />

Local High School Dropouts Cut in Half<br />

Red Tape Holds Up New Bridges<br />

Clinton Wins on Budget, but More Lies Ahead<br />

Hospitals Are Sued by 7 Foot Doctors<br />

Kids Make Nutritious Snacks


Aligning parses


Sample trees


59<br />

Tgrep<br />

You might for example like to search through a<br />

file filled with trees.<br />

CSCI 5832 Spring 2006<br />

9/26/2011


Searching treebanks


Searching treebanks online<br />

The VISL website<br />

The NCLT website


Parallel treebanks<br />

Translation training <strong>and</strong> studies<br />

Machine translation (MT) research &<br />

development


Dependency structure<br />

Dependency structure shows which words depend on<br />

(modify or are arguments <strong>of</strong>) which other words.<br />

The boy put the tortoise on the rug<br />

The<br />

boy<br />

put<br />

the<br />

tortoise<br />

on<br />

the<br />

rug


Dependency Grammar/Parsing<br />

A sentence is parsed by relating each word to other words in the<br />

sentence which depend on it.<br />

The idea <strong>of</strong> dependency structure goes back a long way<br />

To Pāṇini’s grammar (c. 5th century BCE)<br />

Constituency is a new-fangled invention<br />

20th century invention<br />

Modern work <strong>of</strong>ten linked to work <strong>of</strong> L. Tesniere (1959)<br />

Dominant approach in “East” (Eastern bloc/East Asia)<br />

Among the earliest kinds <strong>of</strong> parsers in NLP, even in US:<br />

David Hays, one <strong>of</strong> the founders <strong>of</strong> computational linguistics, built early<br />

(first?) dependency parser (Hays 1962)


Prague Dependency Bank<br />

Kdo<br />

who<br />

Sb<br />

ACT.T<br />

ste<br />

hundred<br />

Obj<br />

RESTR.F<br />

korun<br />

crown<br />

s<br />

Atr<br />

PAT.F<br />

chce<br />

wants<br />

Sb<br />

investovat<br />

to-invest<br />

Obj<br />

ACT.VOL.T<br />

do<br />

to<br />

AuxP<br />

automobilu<br />

car<br />

Adv<br />

DIR.F<br />

annotation on word level:<br />

lemmata, morphology<br />

syntactic functions<br />

dependency structure<br />

semantic information<br />

on constituent roles,<br />

theme/rheme, etc.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!