Vector Space Semantic Parsing: A Framework for Compositional ...

More documents

Recommendations

Info

mation. In addition, using a new set of weight matrices for unary branchings makes our RNN more expressive without facing the problem of sparsity thanks to a large number of unary branchings in the treebank. We also encode syntactic tags by binary vectors but put an extra bit at the end of each vector to mark if the corresponding tag is extra or not (i.e., @P or P ). Figure 3: An RNN attached to the parse tree shown in the top-right of Figure 2. All unary branchings share a set of weight matrices, and all binary branchings share another set of weight matrices (see Figure 4). An RNN processes a tree structure by repeatedly applying itself at each internal node. Thus, walking bottom-up from the leaves of the tree to the root, we compute for every node a vector based on the vectors of its children. Because of this process, those vectors have to have the same dimension. It is worth noting that, because information at leaves, i.e. lexical semantics, is composed according to a given syntactic parse, what a vector at each internal node captures is some aspects of compositional semantics of the corresponding phrase. In the remainder of this subsection, we describe in more detail how to construct compositional vector-based semantics geared towards the parse-correctness classification task. Similar to Socher et al. (2010), and Collobert (2011), given a string of words (w 1 , ..., w l ), we first compute a string of vectors (x 1 , ..., x l ) representing those words by using a look-up table (i.e., word embeddings) L ∈ R n×|V | , where |V | is the size of the vocabulary and n is the dimensionality of the vectors. This look-up table L could be seen as a storage of lexical semantics where each column is a vector representation of a word. Hence, let b i be the binary representation of word w i (i.e., all of the entries of b i are zero except the one corresponding to the index of the word in the dictionary), then x i = Lb i ∈ R n (1) Figure 4: Details about our RNN for a unary branching (top) and a binary branching (bottom). The bias is not shown for the simplicity. Then, given a unary branching P [w h ] → C, we can compute the vector at the node P by (see Figure 4-top) p = f ( W u c + W h x h + W −1 x −1 + W +1 x +1 + W t t p + b u ) where c, x h are vectors representing the child C and the head word, x −1 , x +1 are the left and right neighbouring words of P , t p encodes the syntactic tag of P , W u , W h , W −1 , W +1 ∈ R n×n , W t ∈ R n×(|T |+1) , |T | is the size of the set of syntactic tags, b u ∈ R n , and f can be any activation function (tanh is used in this case). With a binary branching P [w h ] → C 1 C 2 , we simply change the way the children’s vectors added (see Figure 4-bottom) p = f ( W b1 c 1 + W b2 c 2 + W h x h + W −1 x −1 + W +1 x +1 + W t t p + b b ) Finally, we put a sigmoid neural unit on the top of each internal node (except pre-terminal nodes because we are not concerned with POStagging) to detect the correctness of the subparse tree rooted at that node y = sigmoid(W cat p + b cat ) (2) where W cat ∈ R 1×n , b cat ∈ R. 14
3.1 Learning The error on a parse tree is computed as the sum of classification errors of all subparses. Hence, the learning is to minimise the objective J(θ) = 1 N ∑ T ∑ (y(θ),t)∈T 1 2 (t − y(θ))2 + λ‖θ‖ 2 (3) where θ are the model parameters, N is the number of trees, λ is a regularisation hyperparameter, T is a parse tree, y(θ) is given by Equation 2, and t is the class of the corresponding subparse (t = 1 means correct). The gradient ∂J ∂θ is computed efficiently thanks to backpropagation through the structure (Goller and Kuchler, 1996). L-BFGS (Liu and Nocedal, 1989) is used to minimise the objective function. 3.2 Experiments We implemented our classifier in Torch7 3 (Collobert et al., 2011a), which is a powerful Matlablike environment for machine learning. In order to save time, we only trained the classifier on 10-best parses of WSJ02-21. The training phase took six days on a computer with 16 800MHz CPU-cores and 256GB RAM. The word embeddings given by Collobert et al. (2011b) 4 were used as L in Equation 1. Note that these embeddings, which are the result of training a language model neural network on the English Wikipedia and Reuters, have been shown to capture many interesting semantic similarities between words. We tested the classifier on the development set WSJ22, which contains 1, 700 sentences, and measured the performance in positive rate and negative rate pos-rate = neg-rate = #true pos #true pos + #false neg #true neg #true neg + #false pos The positive/negative rate tells us the rate at which positive/negative examples are correctly labelled positive/negative. In order to achieve high performance in the reranking task, the classifier must have a high positive rate as well as a high negative rate. In addition, percentage of positive examples is also interesting because it shows the unbalancedness of the data. Because the accuracy is not 3 http://www.torch.ch/ 4 http://ronan.collobert.com/senna/ a reliable measurement when the dataset is highly unbalanced, we do not show it here. Table 1, Figure 5, and Figure 6 show the classification results. pos-rate (%) neg-rate (%) %-Pos gold-std 75.31 - 1 1-best 90.58 64.05 71.61 10-best 93.68 71.24 61.32 50-best 95.00 73.76 56.43 Table 1: Classification results on the WSJ22 and the k-best lists. Figure 5: Positive rate, negative rate, and percentage of positive examples w.r.t. subtree depth. 3.3 Discussion Table 1 shows the classification results on the gold-standard, 1-best, 10-best, and 50-best lists. The positive rate on the gold-standard parses, 75.31%, gives us the upper bound of %-pos when this classifier is used to yield 1-best lists. On the 1- best data, the classifier missed less than one tenth positive subtrees and correctly found nearly two third of the negative ones. That is, our classifier might be useful for avoiding many of the mistakes made by the Berkeley parser, whilst not introducing too many new mistakes of its own. This fact gave us hope to improve parsing performance when using this classifier for reranking. Figure 5 shows positive rate, negative rate, and percentage of positive examples w.r.t. subtree depth on the 50-best data. We can see that the positive rate is inversely proportional to the subtree depth, unlike the negative rate. That is because the 15
Page 1 and 2: ACL 2013 51st Annual Meeting of the
Page 3: Introduction In recent years, there
Page 7: Table of Contents Vector Space Sema
Page 10 and 11: Poster session (continued) 12:30 Lu
Page 12 and 13: Input: Log. Form: “red ball”
Page 14 and 15: the NP/N λx.x red N/N λx.A red x
Page 16 and 17: dicted and target distributions, an
Page 18 and 19: typed objects lie in the same vecto
Page 20 and 21: Peter D. Turney. 2006. Similarity o
Page 22 and 23: (1999), who suggested that the real
Page 26 and 27: Figure 6: Positive rate, negative r
Page 28 and 29: References Rens Bod, Remko Scha, an
Page 30 and 31: A Structured Distributional Semanti
Page 32 and 33: Figure 1: Sample sentences & triple
Page 34 and 35: el as its three axes. We explore th
Page 36 and 37: as well as our framing of word comp
Page 38 and 39: References Marco Baroni and Roberto
Page 40 and 41: Letter N-Gram-based Input Encoding
Page 42 and 43: Hidden my house my house Visibl
Page 44 and 45: … m my y h se us Hidden Visible m
Page 46 and 47: line of the systems presented in Ta
Page 48 and 49: References Yoshua Bengio, Réjean D
Page 50 and 51: Transducing Sentences to Syntactic
Page 52 and 53: t s “We booked the flight” →
Page 54 and 55: D 3(s) = ❀ PRP + ❀ V + DT ❀ +
Page 56 and 57: Output Model λ = 0 λ = 0.2 λ = 0
Page 58 and 59: References James A. Anderson. 1973.
Page 60 and 61: General estimation and evaluation o
Page 62 and 63: Guevara (2010) and Zanzotto et al.
Page 64 and 65: ⃗p = A u ⃗v where A u is a matr
Page 66 and 67: stituents. For this reason, we must
Page 68 and 69: References Hervé Abdi and Lynne Wi
Page 70 and 71: GOOD VERTICAL safe out raise level
Page 72 and 73: HLBL HLBL SENNA 4lang original scal
Page 74 and 75:
Determining Compositionality of Wor
Page 76 and 77:
ability of a word in the investigat
Page 78 and 79:
terminer) wheel” and “reinvent
Page 80 and 81:
ing LSA are depicted. Discussion As
Page 82 and 83:
WSM Measure wAvg(of ρ) ρAN-VO-SV
Page 84 and 85:
“Not not bad” is not “bad”:
Page 86 and 87:
activation functions in recursive n
Page 88 and 89:
logic experiment in Socher et al. (
Page 90 and 91:
tinction without the need to move t
Page 92 and 93:
T. K. Landauer and S. T. Dumais. 19
Page 94 and 95:
This differs from the classical Ran
Page 96 and 97:
Hungarian Algorithm First cosine si
Page 98 and 99:
Features: MSRpar MSRvid SMTeuroparl
Page 100 and 101:
Joseph Reisinger and Raymond J. Moo
Page 102 and 103:
phrase meaning in vector space. •
Page 104 and 105:
By Bayes’ rule, p(x|a, n; Θ) ∝
Page 106 and 107:
4.2 Learning from distributional re
Page 108 and 109:
W = (w 1 , w 2 , · · · , w n ) a
Page 110 and 111:
Aggregating Continuous Word Embeddi
Page 112 and 113:
trieval, Sivic and Zisserman propos
Page 114 and 115:
Another advantage is that the propo
Page 116 and 117:
e 50 100 200 300 400 500 CLEF 4.0 6
Page 118 and 119:
Y. Bengio, J. Louradour, R. Collobe
Page 120 and 121:
Answer Extraction by Recursive Pars
Page 122 and 123:
Algorithm 1: Auto-encoders co-train
Page 124 and 125:
NP NP ) , ADJP , NP ( NP - NP Engli
Page 126 and 127:
System Short Answer Short Answer Sh
Page 128 and 129:
References Y. Bengio and R. Ducharm
Page 130 and 131:
problem of paragraphs, of tence, mo
Page 132 and 133:
m1 k 4 only on the length of s and
Page 134 and 135:
Dialogue Act Label Example Train (%
Page 136 and 137:
[Calhoun et al.2010] Sasha Calhoun,
show all

Vector Space Semantic Parsing: A Framework for Compositional ...

Create successful ePaper yourself

Delete template?

Save as template?