Projectivity◮ Most theoretical frameworks do not assume projectivity.◮ Non-projective structures are needed to account for◮ long-distance dependencies,◮ free word order.pcpvgsbjobjnmod nmod nmodIntroductionWhere we’re going◮ <strong>Dependency</strong> parsing:◮ Input: Sentence x = w1 , . . . w n◮ Output: <strong>Dependency</strong> graph G◮ Focus today:◮ Computational methods for dependency parsing◮ Resources for dependency parsing (parsers, treebanks)IntroductionWhat did economic news have little effect on ?<strong>Dependency</strong> <strong>Parsing</strong> 19(70)<strong>Dependency</strong> <strong>Parsing</strong> 20(70)<strong>Parsing</strong> Methods<strong>Parsing</strong> MethodsDeterministic <strong>Parsing</strong><strong>Parsing</strong> Methods◮ Three main traditions:◮ Deterministic parsing (specifically: Transition-based parsing)◮ Dynamic programming (specifically: Graph-based parsing)◮ Constraint satisfaction (not covered today)◮ Special issue:◮ Non-projective dependency parsing◮ Basic idea:◮ Derive a single syntactic representation (dependency graph)through a deterministic sequence of elementary parsing actions◮ Sometimes combined with backtracking or repair◮ Motivation:◮◮◮Psycholinguistic modelingEfficiencySimplicity<strong>Dependency</strong> <strong>Parsing</strong> 21(70)<strong>Dependency</strong> <strong>Parsing</strong> 22(70)Covington’s Incremental Algorithm<strong>Parsing</strong> MethodsShift-Reduce Type Algorithms<strong>Parsing</strong> Methods◮ Deterministic incremental parsing in O(n 2 ) time by trying tolink each new word to each preceding one [Covington(2001)]:PARSE(x = (w 1 , . . . , w n ))1 for i = 1 up to n2 for j = i − 1 down to 13 LINK(w i , w j )⎧⎨ E ← E ∪ (i, j)LINK(w i , w j ) = E ← E ∪ (j, i)⎩E ← Eif w j is a dependent of w iif w i is a dependent of w jotherwise◮ Different conditions, such as Single-Head and Projectivity, canbe incorporated into the LINK operation.Transition-based parsing◮ Data structures:◮ Stack [. . . , wi ] S of partially processed tokens◮ Queue [wj , . . .] Q of remaining input tokens◮ <strong>Parsing</strong> actions built from atomic actions:◮ Adding arcs (wi → w j , w i ← w j )◮ Stack and queue operations◮ Left-to-right parsing in O(n) time◮ Restricted to projective dependency graphs<strong>Dependency</strong> <strong>Parsing</strong> 23(70)<strong>Dependency</strong> <strong>Parsing</strong> 24(70)
<strong>Parsing</strong> Methods<strong>Parsing</strong> MethodsYamada’s AlgorithmNivre’s Algorithm◮ Three parsing actions:◮ Four parsing actions:Shift[. . .] S [w i , . . .] Q[. . . , w i ] S [. . .] QShift[. . .] S [w i , . . .] Q[. . . , w i ] S [. . .] QLeftRight[. . . , w i , w j ] S [. . .] Q[. . . , w i ] S [. . .] Q w i → w j[. . . , w i , w j ] S [. . .] Q[. . . , w j ] S [. . .] Q w i ← w j◮ Algorithm variants:◮ Originally developed for Japanese (strictly head-final) with onlythe Shift and Right actions [Kudo and Matsumoto(2002)]◮ Adapted for English (with mixed headedness) by adding theLeft action [Yamada and Matsumoto(2003)]◮ Multiple passes over the input give time complexity O(n 2 )Reduce[. . . , w i ] S [. . .] Q ∃w k : w k → w i[. . .] S [. . .] QLeft-Arc r[. . . , w i ] S [w j , . . .] Q ¬∃w k : w k → w i[. . .] S [w j , . . .] Q w ir← w jRight-Arc r[. . . , w i ] S [w j , . . .] Q ¬∃w k : w k → w j[. . . , w i , w j ] S [. . .] Q w ir→ w j◮ Characteristics:◮ Integrated labeled dependency parsing◮ Arc-eager processing of right-dependents◮ Single pass over the input gives time complexity O(n)<strong>Dependency</strong> <strong>Parsing</strong> 25(70)<strong>Dependency</strong> <strong>Parsing</strong> 26(70)<strong>Parsing</strong> Methods<strong>Parsing</strong> MethodsExampleClassifier-Based <strong>Parsing</strong>pred[root] S [Economic] S [news] S [had] S [little] S [effect] S [on] S [financial] Sobjsbjnmod nmod nmodShift Left-Arc nmod Shift Left-Arc sbj Right-Arc pred ShiftLeft-Arc nmod Right-Arc obj Right-Arc nmod Shift Left-Arc nmodRight-Arc pc Reduce Reduce Reduce Reduce Right-Arc pppcnmo◮ Data-driven deterministic parsing:◮ Deterministic parsing requires an oracle.◮ An oracle can be approximated by a classifier.◮ A classifier can be trained using treebank data.◮ Learning methods:◮ Support vector machines (SVM)[Kudo and Matsumoto(2002), Yamada and Matsumoto(2003),Isozaki et al.(2004)Isozaki, Kazawa and Hirao,Cheng et al.(2004)Cheng, Asahara and Matsumoto,Nivre et al.(2006)Nivre, Hall, Nilsson, Eryiğit and Marinov]◮ Memory-based learning (MBL)[Nivre et al.(2004)Nivre, Hall and Nilsson, Nivre and Scholz(2004)]◮ Maximum entropy modeling (MaxEnt)[Cheng et al.(2005)Cheng, Asahara and Matsumoto]<strong>Dependency</strong> <strong>Parsing</strong> 27(70)<strong>Dependency</strong> <strong>Parsing</strong> 28(70)<strong>Parsing</strong> Methods<strong>Parsing</strong> MethodsFeature ModelsComparing Algorithms◮ Learning problem:◮ Approximate a function from parser states, represented byfeature vectors to parser actions, given a training set of goldstandard derivations.◮ Typical features:◮ Tokens:◮ Target tokens◮ Linear context (neighbors in S and Q)◮ Structural context (parents, children, siblings in G)◮ Attributes:◮ Word form (and lemma)◮ Part-of-speech (and morpho-syntactic features)◮ <strong>Dependency</strong> type (if labeled)◮ Distance (between target tokens)◮ <strong>Parsing</strong> algorithm:◮ Nivre’s algorithm gives higher accuracy than Yamada’salgorithm for parsing the Chinese CKIP treebank[Cheng et al.(2004)Cheng, Asahara and Matsumoto].◮ Learning algorithm:◮ SVM gives higher accuracy than MaxEnt for parsing theChinese CKIP treebank[Cheng et al.(2004)Cheng, Asahara and Matsumoto].◮ SVM gives higher accuracy than MBL with lexicalized featuremodels for three languages[Hall et al.(2006)Hall, Nivre and Nilsson]:◮ Chinese (Penn)◮ English (Penn)◮ Swedish (Talbanken)<strong>Dependency</strong> <strong>Parsing</strong> 29(70)<strong>Dependency</strong> <strong>Parsing</strong> 30(70)