Vector Space Semantic Parsing: A Framework for Compositional ...

More documents

Recommendations

Info

NP NP ) , ADJP , NP ( NP - NP English divine James Hervey February 26 , 1714 December 25 , 1758 Figure 2: Labeling nodes to be followed to select “December 25, 1758.” NP 46 NP 40 ) , ADJP , 13 14 41 17 P ( NP 31 3 32 - 8 NP English divin 33 15 vey February 26 , 1714 December 25 , 2 4 5 6 7 9 10 11 1758 12 v33= x32 x(q) x40 x8 x(q) x40 x33 x(q) x40 0 x(q) x40 0 x(q) x40 Figure 3: Assembling features from the question, the parent, and siblings, to decide whether to follow node 33 the item to be classified (Waibel et al., 1989). Convolving over token sequences has achieved state-ofthe-art performance in part-of-speech tagging, named entity recognition, and chunking, and competitive performance in semantic role labeling and parsing, using one basic architecture (Collobert et al., 2011). Moreover, at classification time, the approach is 200 times faster at POS tagging than next-best systems such as (Shen et al., 2007) and 100 times faster at semantic role labeling (Koomen et al., 2005). Classifying tokens to answer questions involves not only information from nearby tokens, but long range syntactic dependencies. In most work utilizing parse trees as input, a systematic description of the whole parse tree has not been used. Some state-of-the-art semantic role labeling systems require multiple parse trees (alternative candidates for parsing the same sentence) as input, but they measure many ad-hoc features describing path lengths, head words of prepositional phrases, clause-based path features, etc., encoded in a sparse feature vector (Pradhan et al., 2005). By using feature representations from our RNN and performing convolutions across siblings inside the tree, instead of token sequences in the text, we can utilize the parse tree information in a more principled way. We start at the root of the parse tree and select branches to follow, working down. At each step, the entire question is visible, via the representation at its root, and we decide whether or not to follow each branch of the support sentence. Ideally, irrelevant information will be cut at the point where syntactic information indicates it is no longer needed. The point at which we reach a terminal node may be too late to cut out the corresponding word; the context that indicates it is the wrong answer may have been visible only at a higher level in the parse tree. The classifier must cut words out earlier, though we do not specify exactly where. Our classifier uses three pieces of information to decide whether to follow a node in the support sentence or not, given that its parent was followed: 1. The representation of the question at its root 2. The representation of the support sentence at the parent of the current node 3. The representations of the current node and a frame of k of its siblings on each side, in the order induced by the order of words in the sentence 114
Algorithm 2: Training the convolutional neural network for question answering Data: Ξ, a set of triples (Q, S, T ), with Q a parse tree of a question, S a parse tree of a support sentence, and T ⊂ W(S) a ground truth answer substring, and parse tree features ⃗x(p) attached by the recursive autoencoder for all p ∈ Q or p ∈ S Let n = dim ⃗x(p) Let h be the cross-entropy loss (equation (2)) Data: Φ : ( R 3n) 2k+1 → R 2 a convolutional neural network over frames of size 2k + 1, with parameters to be trained for question-answering Result: Parameters of Φ trained begin while stopping criterion not satisfied do Randomly choose (Q, S, T ) ∈ Ξ Let q, r = root(Q), root(S) Let X = {r} (the set of nodes to follow) Let A(T ) ⊂ S be the set of ancestor nodes of T in S while X ≠ ∅ do Pop an element p from X if p is not terminal then Let c 1 , . . . , c m be the children of p Let ⃗x j = ⃗x(c j ) for j ∈ {1, . . . , m} Let ⃗x j = ⃗0 for j /∈ {1, . . . , m} for i=1, . . . m do Let t = 1 if c i ∈ A(T ), or 0 otherwise Let v ci = ⊕ i+k j=i−k (⃗x j ⊕ ⃗x(p) ⊕ ⃗x(q)) Compute the cross-entropy loss h (Φ (v ci ) , t) if exp(−h (Φ (v ci ) , 1)) > 1 2 then Let X = X ∪ {c i } (the network predicts c i should be followed) Update parameters of Φ by backpropagation Each of these representations is n-dimensional. The convolutional neural network concatenates them together (denoted by ⊕) as a 3n-dimensional feature at each node position, and considers a frame enclosing k siblings on each side of the current node. The CNN consists of a convolutional layer mapping the 3n inputs to an r-dimensional space, a sigmoid function (such as tanh), a linear layer mapping the r-dimensional space to two outputs, and another sigmoid. We take k = 2 and r = 30 in the experiments. Application of the CNN begins with the children of the root, and proceeds in breadth first order through the children of the followed nodes. Sliding the CNN’s frame across siblings allows it to decide whether to follow adjacent siblings faster than a non-convolutional classifier, where the decisions would be computed without exploiting the overlapping features. A followed terminal node becomes part of the short answer of the system. The training of the question-answering convolutional neural network is detailed in Algorithm 2. Only visited nodes, as predicted by the classifier, are used for training. For ground truth, we say that a node should be followed if it is the ancestor of some token that is part of the desired answer. For example, to select the death date “December 25, 1758” from the support sentence (displayed on page one) about James Hervey, nodes of the tree would be attached ground truth values according to the coloring in Figure 2. At classification time, some unnecessary (negatively labeled) nodes may be followed without mistaking the final answer. For example, when deciding whether to follow the “NP” in node 33 on the third row of Figure 3, the classifier would see features of node 32 (NP) and node 8 (‘-’) on its left, its own features, and nothing on its right. Since there are no siblings to the right of node 33, zero vectors, used for padding, would be placed in the two empty slots. To each of these feature vectors, features from the parent and the question root would be concatenated. The combination of recursive autoencoders with convolutions inside the tree affords flexibility and generality. The ordering of children would be immeasurable by a classifier relying on path-based features alone. For instance, our classifier may consider a branch of a parse tree as in Figure 2, in which the birth date and death date have isomorphic connections to the rest of the parse tree. It can distinguish them by the ordering of nodes in a parenthetical expression (see examples in Table 2). 115
Page 1 and 2:
ACL 2013 51st Annual Meeting of the
Page 3:
Introduction In recent years, there
Page 7:
Table of Contents Vector Space Sema
Page 10 and 11:
Poster session (continued) 12:30 Lu
Page 12 and 13:
Input: Log. Form: “red ball”
Page 14 and 15:
the NP/N λx.x red N/N λx.A red x
Page 16 and 17:
dicted and target distributions, an
Page 18 and 19:
typed objects lie in the same vecto
Page 20 and 21:
Peter D. Turney. 2006. Similarity o
Page 22 and 23:
(1999), who suggested that the real
Page 24 and 25:
mation. In addition, using a new se
Page 26 and 27:
Figure 6: Positive rate, negative r
Page 28 and 29:
References Rens Bod, Remko Scha, an
Page 30 and 31:
A Structured Distributional Semanti
Page 32 and 33:
Figure 1: Sample sentences & triple
Page 34 and 35:
el as its three axes. We explore th
Page 36 and 37:
as well as our framing of word comp
Page 38 and 39:
References Marco Baroni and Roberto
Page 40 and 41:
Letter N-Gram-based Input Encoding
Page 42 and 43:
Hidden my house my house Visibl
Page 44 and 45:
… m my y h se us Hidden Visible m
Page 46 and 47:
line of the systems presented in Ta
Page 48 and 49:
References Yoshua Bengio, Réjean D
Page 50 and 51:
Transducing Sentences to Syntactic
Page 52 and 53:
t s “We booked the flight” →
Page 54 and 55:
D 3(s) = ❀ PRP + ❀ V + DT ❀ +
Page 56 and 57:
Output Model λ = 0 λ = 0.2 λ = 0
Page 58 and 59:
References James A. Anderson. 1973.
Page 60 and 61:
General estimation and evaluation o
Page 62 and 63:
Guevara (2010) and Zanzotto et al.
Page 64 and 65:
⃗p = A u ⃗v where A u is a matr
Page 66 and 67:
stituents. For this reason, we must
Page 68 and 69:
References Hervé Abdi and Lynne Wi
Page 70 and 71:
GOOD VERTICAL safe out raise level
Page 72 and 73:
HLBL HLBL SENNA 4lang original scal
Page 74 and 75: Determining Compositionality of Wor
Page 76 and 77: ability of a word in the investigat
Page 78 and 79: terminer) wheel” and “reinvent
Page 80 and 81: ing LSA are depicted. Discussion As
Page 82 and 83: WSM Measure wAvg(of ρ) ρAN-VO-SV
Page 84 and 85: “Not not bad” is not “bad”:
Page 86 and 87: activation functions in recursive n
Page 88 and 89: logic experiment in Socher et al. (
Page 90 and 91: tinction without the need to move t
Page 92 and 93: T. K. Landauer and S. T. Dumais. 19
Page 94 and 95: This differs from the classical Ran
Page 96 and 97: Hungarian Algorithm First cosine si
Page 98 and 99: Features: MSRpar MSRvid SMTeuroparl
Page 100 and 101: Joseph Reisinger and Raymond J. Moo
Page 102 and 103: phrase meaning in vector space. •
Page 104 and 105: By Bayes’ rule, p(x|a, n; Θ) ∝
Page 106 and 107: 4.2 Learning from distributional re
Page 108 and 109: W = (w 1 , w 2 , · · · , w n ) a
Page 110 and 111: Aggregating Continuous Word Embeddi
Page 112 and 113: trieval, Sivic and Zisserman propos
Page 114 and 115: Another advantage is that the propo
Page 116 and 117: e 50 100 200 300 400 500 CLEF 4.0 6
Page 118 and 119: Y. Bengio, J. Louradour, R. Collobe
Page 120 and 121: Answer Extraction by Recursive Pars
Page 122 and 123: Algorithm 1: Auto-encoders co-train
Page 126 and 127: System Short Answer Short Answer Sh
Page 128 and 129: References Y. Bengio and R. Ducharm
Page 130 and 131: problem of paragraphs, of tence, mo
Page 132 and 133: m1 k 4 only on the length of s and
Page 134 and 135: Dialogue Act Label Example Train (%
Page 136 and 137: [Calhoun et al.2010] Sasha Calhoun,
show all

Vector Space Semantic Parsing: A Framework for Compositional ...

Create successful ePaper yourself

Delete template?

Save as template?