Vector Space Semantic Parsing: A Framework for Compositional ...

More documents

Recommendations

Info

Guevara (2010) and Zanzotto et al. (2010) explore a full form of the additive model (Fulladd), where the two vectors entering a composition process are pre-multiplied by weight matrices before being added, so that each output component is a weighted sum of all input components: ⃗p = W 1 ⃗u + W 2 ⃗v. Baroni and Zamparelli (2010) and Coecke et al. (2010), taking inspiration from formal semantics, characterize composition as function application. For example, Baroni and Zamparelli model adjective-noun phrases by treating the adjective as a function from nouns onto (modified) nouns. Given that linear functions can be expressed by matrices and their application by matrix-by-vector multiplication, a functor (such as the adjective) is represented by a matrix A u to be composed with the argument vector ⃗v (e.g., the noun) by multiplication, returning the lexical function (Lexfunc) representation of the phrase: ⃗p = A u ⃗v. The method proposed by Socher et al. (2012) (see Socher et al. (2011) for an earlier proposal from the same team) can be seen as a combination and non-linear extension of Fulladd and Lexfunc (that we thus call Fulllex) in which both phrase elements act as functors (matrices) and arguments (vectors). Given input terms u and v represented by (⃗u, A u ) and (⃗v, A v ), respectively, their composition vector is obtained by applying first a linear transformation and then the hyperbolic tangent function to the concatenation of the products A u ⃗v and A v ⃗u (see Table 1 for the equation). Socher and colleagues also present a way to construct matrix representations for specific phrases, needed to scale this composition method to larger constituents. We ignore it here since we focus on the two-word case. Estimating composition parameters If we have manually labeled example data for a target task, we can use supervised machine learning to optimize parameters. Mitchell and Lapata (2008; 2010), since their models have just a few parameters to optimize, use a direct grid search for the parameter setting that performs best on the training data. Socher et al. (2012) train their models using multinomial softmax classifiers. If our goal is to develop a cDSM optimized for a specific task, supervised methods are undoubtedly the most promising approach. However, every time we face a new task, parameters must be re-estimated from scratch, which goes against the idea of distributional semantics as a general similarity resource (Baroni and Lenci, 2010). Moreover, supervised methods are highly compositionmodel-dependent, and for models such as Fulladd and Lexfunc we are not aware of proposals about how to estimate them in a supervised manner. Socher et al. (2011) propose an autoencoding strategy. Given a decomposition function that reconstructs the constituent vectors from a phrase vector (e.g., it re-generates green and jacket vectors from the composed green jacket vector), the composition parameters minimize the distance between the original and reconstructed input vectors. This method does not require hand-labeled training data, but it is restricted to cDSMs for which an appropriate decomposition function can be defined, and even in this case the learning problem might lack a closed-form solution. Guevara (2010) and Baroni and Zamparelli (2010) optimize parameters using examples of how the output vectors should look like that are directly extracted from the corpus. To learn, say, a Lexfunc matrix representing the adjective green, we extract from the corpus example vectors of 〈N, green N〉 pairs that occur with sufficient frequency (〈car, green car〉, 〈jacket, green jacket〉, 〈politician, green politician〉, . . . ). We then use least-squares methods to find weights for the green matrix that minimize the distance between the green N vectors generated by the model given the input N and the corresponding corpus-observed phrase vectors. This is a very general approach, it does not require hand-labeled data, and it has the nice property that corpus-harvested phrase vectors provide direct evidence of the polysemous behaviour of functors (the green jacket vs. politician contexts, for example, will be very different). In the next section, we extend the corpus-extracted phrase approximation method to all cDSMs described above, with closed-form solutions for all but the Fulllex model, for which we propose a rapidly converging iterative estimation method. 3 Least-squares model estimation using corpus-extracted phrase vectors 3 Notation Given two matrices X, Y ∈ R m×n we ∑ denote their inner product by 〈X, Y 〉, (〈X, Y 〉 = m ∑ n i=1 j=1 x ijy ij ). Similarly we denote by 〈u, v〉 the dot product of two vectors u, v ∈ R m×1 and by ||u|| the Euclidean norm of a vector: 3 Proofs omitted due to space constraints. 52
||u|| = 〈u, u〉 1/2 . We use the following Frobenius norm notation: ||X|| F = 〈X, X〉 1/2 . Vectors are assumed to be column vectors and we use x i to stand for the i-th (m × 1)-dimensional column of matrix X. We use [X, Y ] ∈ R m×2n to denote the ] horizontal concatenation of two matrices while ∈ R 2m×n is their vertical concatenation. [ X Y General problem statement We assume vocabularies of constituents U, V and that of resulting phrases P. The training data consist of a set of tuples (u, v, p) where p stands for the phrase associated to the constituents u and v: Dil Given two vectors ⃗u and ⃗v, the dilation model computes the phrase vector ⃗p = ||⃗u|| 2 ⃗v + (λ − 1)〈⃗u, ⃗v〉⃗u where the parameter λ is a scalar. The problem becomes: λ ∗ = arg min ||P −V D ||ui || 2 −UD (λ−1)〈u i ,v i 〉|| F λ∈R where by D ||ui || 2 and D (λ−1)〈u i ,v i 〉 we denote diagonal matrices with diagonal elements (i, i) given by ||u i || 2 and (λ − 1)〈u i , v i 〉 respectively. The solution is: ∑ k λ ∗ i=1 = 1 − 〈u i, (||u i || 2 v i − p i )〉〈u i , v i 〉 ∑ k i=1 〈u i, v i 〉 2 ||u i || 2 T = {(u i , v i , p i )|(u i , v i , p i ) ∈ U×V×P, 1 ≤ i ≤ k} Mult Given two vectors ⃗u and ⃗v, the weighted We build the matrices U, V, P ∈ R m×k multiplicative model computes the phrase vector by concatenating the vectors associated to the training ⃗p = ⃗u w 1 ⊙ ⃗v w 2 where ⊙ stands for componentwise multiplication. We assume for this model that data elements as columns. 4 U, V, P ∈ R m×n ++ , i.e. that the entries are strictly Given the training data matrices, the general larger than 0: in practice we add a small smoothing constant to all elements to achieve this (Mult problem can be stated as: performs badly on negative entries, such as those θ ∗ = arg min ||P − fcomp θ (U, V )|| F produced by SVD). We use the w 1 and w 2 weights θ obtained when solving the much simpler related problem: 5 where fcomp θ is a composition function and θ stands for a list of parameters that this composition function is associated to. The composition functions are defined: fcomp θ : R m×1 × R m×1 → R m×1 and fcomp θ (U, V ) stands for their natural extension when applied on the individual columns of the U and V matrices. Add The weighted additive model returns the sum of the composing vectors which have been re-weighted by some scalars w 1 and w 2 : ⃗p = w 1 ⃗u + w 2 ⃗v. The problem becomes: w1, ∗ w2 ∗ = arg min ||P − w 1 U − w 2 V || F w 1 ,w 2 ∈R The optimal w 1 and w 2 are given by: w1 ∗ = ||V ||2 F 〈U, P 〉 − 〈U, V 〉〈V, P 〉 ||U|| 2 F ||V ||2 F − 〈U, V (1) 〉2 w1, ∗ w2 ∗ = arg min ||log(P )−log(U. ∧ w 1 ⊙V. ∧ w 2 )|| F w 1 ,w 2 ∈R where . ∧ stands for the component-wise power operation. The solution is the same as that for Add, given in equations (1) and (2), with U → log(U), V → log(V ) and P → log(P ). Fulladd The full additive model assumes the composition of two vectors to be ⃗p = W 1 ⃗u + W 2 ⃗v where W 1 , W 2 ∈ R m×m . The problem is: [ ] U [W 1 , W 2 ] ∗ = arg min ||P −[W 1 W 2 ] || [W 1 ,W 2 ]∈R m×2m V This is a multivariate linear regression problem (Hastie et al., 2009) for which the least squares estimate is given by: [W 1 , W 2 ] = ((X T X) −1 X T Y ) T where we use X = [U T , V T ] and Y = P T . w2 ∗ = ||U||2 F 〈V, P 〉 − 〈U, V 〉〈U, P 〉 ||U|| 2 F ||V ||2 F − 〈U, V (2) 〉2 4 In reality, not all composition models require u, v and p to have the same dimensionality. Lexfunc The lexical function composition method learns a matrix representation for each functor (given by U here) and defines composition as matrix-vector multiplication. More precisely: 5 In practice training Mult this way achieves similar or lower errors in comparison to Add. 53
Page 1 and 2:
ACL 2013 51st Annual Meeting of the
Page 3:
Introduction In recent years, there
Page 7:
Table of Contents Vector Space Sema
Page 10 and 11:
Poster session (continued) 12:30 Lu
Page 12 and 13: Input: Log. Form: “red ball”
Page 14 and 15: the NP/N λx.x red N/N λx.A red x
Page 16 and 17: dicted and target distributions, an
Page 18 and 19: typed objects lie in the same vecto
Page 20 and 21: Peter D. Turney. 2006. Similarity o
Page 22 and 23: (1999), who suggested that the real
Page 24 and 25: mation. In addition, using a new se
Page 26 and 27: Figure 6: Positive rate, negative r
Page 28 and 29: References Rens Bod, Remko Scha, an
Page 30 and 31: A Structured Distributional Semanti
Page 32 and 33: Figure 1: Sample sentences & triple
Page 34 and 35: el as its three axes. We explore th
Page 36 and 37: as well as our framing of word comp
Page 38 and 39: References Marco Baroni and Roberto
Page 40 and 41: Letter N-Gram-based Input Encoding
Page 42 and 43: Hidden my house my house Visibl
Page 44 and 45: … m my y h se us Hidden Visible m
Page 46 and 47: line of the systems presented in Ta
Page 48 and 49: References Yoshua Bengio, Réjean D
Page 50 and 51: Transducing Sentences to Syntactic
Page 52 and 53: t s “We booked the flight” →
Page 54 and 55: D 3(s) = ❀ PRP + ❀ V + DT ❀ +
Page 56 and 57: Output Model λ = 0 λ = 0.2 λ = 0
Page 58 and 59: References James A. Anderson. 1973.
Page 60 and 61: General estimation and evaluation o
Page 64 and 65: ⃗p = A u ⃗v where A u is a matr
Page 66 and 67: stituents. For this reason, we must
Page 68 and 69: References Hervé Abdi and Lynne Wi
Page 70 and 71: GOOD VERTICAL safe out raise level
Page 72 and 73: HLBL HLBL SENNA 4lang original scal
Page 74 and 75: Determining Compositionality of Wor
Page 76 and 77: ability of a word in the investigat
Page 78 and 79: terminer) wheel” and “reinvent
Page 80 and 81: ing LSA are depicted. Discussion As
Page 82 and 83: WSM Measure wAvg(of ρ) ρAN-VO-SV
Page 84 and 85: “Not not bad” is not “bad”:
Page 86 and 87: activation functions in recursive n
Page 88 and 89: logic experiment in Socher et al. (
Page 90 and 91: tinction without the need to move t
Page 92 and 93: T. K. Landauer and S. T. Dumais. 19
Page 94 and 95: This differs from the classical Ran
Page 96 and 97: Hungarian Algorithm First cosine si
Page 98 and 99: Features: MSRpar MSRvid SMTeuroparl
Page 100 and 101: Joseph Reisinger and Raymond J. Moo
Page 102 and 103: phrase meaning in vector space. •
Page 104 and 105: By Bayes’ rule, p(x|a, n; Θ) ∝
Page 106 and 107: 4.2 Learning from distributional re
Page 108 and 109: W = (w 1 , w 2 , · · · , w n ) a
Page 110 and 111: Aggregating Continuous Word Embeddi
Page 112 and 113:
trieval, Sivic and Zisserman propos
Page 114 and 115:
Another advantage is that the propo
Page 116 and 117:
e 50 100 200 300 400 500 CLEF 4.0 6
Page 118 and 119:
Y. Bengio, J. Louradour, R. Collobe
Page 120 and 121:
Answer Extraction by Recursive Pars
Page 122 and 123:
Algorithm 1: Auto-encoders co-train
Page 124 and 125:
NP NP ) , ADJP , NP ( NP - NP Engli
Page 126 and 127:
System Short Answer Short Answer Sh
Page 128 and 129:
References Y. Bengio and R. Ducharm
Page 130 and 131:
problem of paragraphs, of tence, mo
Page 132 and 133:
m1 k 4 only on the length of s and
Page 134 and 135:
Dialogue Act Label Example Train (%
Page 136 and 137:
[Calhoun et al.2010] Sasha Calhoun,
show all

Vector Space Semantic Parsing: A Framework for Compositional ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?