Vector Space Semantic Parsing: A Framework for Compositional ...

More documents

Recommendations

Info

m1 k 4 only on the length of s and itself has length t = ⌈ √ 2l⌉ − 1. It is given recursively by: m 2 m3 m 4 k3 k2 k 1 (k * m) 1 Figure 3: Convolution of a vector m with a kernel k of size 4. 2.2.1 Kernel and One-dimensional Convolution Given a sentence s and its paired matrix M s , let m be a feature that is a row in M s . Before defining kernels and the convolution operation, let us consider the underlying operation of local weighted addition. Let w 1 , ..., w k be a sequence of k weights; given the feature m, local weighted addition over the first k values of m gives: y = w 1 m 1 + ... + w k m k (1) Then, a kernel simply defines the value of k by specifying the sequence of weights w 1 , ..., w k and the one-dimensional convolution applies local weighted addition with the k weights to each subsequence of values of m. More precisely, let a one-dimensional kernel k be a vector of weights and assume |k| ≤ |m|, where | · | is the number of elements in a vector. Then we define the discrete, valid, onedimensional convolution (k ∗ m) of kernel k and feature m by: (k ∗ m) i := k∑ k j · m k+i−j (2) j=1 where k = |k| and |k ∗ m| = |m| − k + 1. Each value in k ∗ m is a sum of k values of m weighted by values in k (Fig. 3). To define the hierarchical architecture of the model, we need to define a sequence of kernel sizes and associated weights. To this we turn next. 2.2.2 Sequence of Kernel Sizes Let l be the number of words in the sentence s. The sequence of kernel sizes 〈k l i 〉 i≤t depends ∑t−1 k1 l = 2, ki+1 l = ki l + 1, kt l = l − (kj l − 1) j=1 (3) That is, kernel sizes increase by one until the resulting convolved vector is smaller or equal to the last kernel size; see for example the kernel sizes in Fig. 1. Note that, for a sentence of length l, the number of layers in the HCNN including the input layer will be t + 1 as convolution with the corresponding kernel is applied at every layer of the model. Let us now proceed to define the hierarchy of layers in the HCNN. 2.2.3 Composition Operation in a HCNN Given a sentence s, its length l and a sequence of kernel sizes 〈k l i 〉 i≤t, we may now give the recursive definition that yields the hierarchy of one-dimensional convolution operations applied to each feature f that is a row in M s . Specifically, for each feature f, let K f i be a sequence of t kernels, where the size of the kernel |K f i | = kl i . Then we have the hierarchy of matrices and corresponding features as follows: M 1 f,: = Ms f,: (4) M i+1 f,: = σ( K f i ∗ Mi f,: + bi f ) (5) for some non-linear sigmoid function σ and bias b i f , where i ranges over 1, ..., t. In sum, onedimensional convolution is applied feature-wise to each feature of a matrix at a certain layer, where the kernel weights depend both on the layer and the feature at hand (Fig. 1). A hierarchy of matrices is thus generated with the top matrix being a single vector for the sentence. 2.2.4 Multiple merged HCNNs Optionally one may consider multiple parallel HCNNs that are merged according to different strategies either at the top sentence vector layer or at intermediate layers. The weights in the word vectors may be tied across different HCNNs. Although potentially useful, multiple merged HC- NNs are not used in the experiment below. This concludes the description of the sentence model. Let us now proceed to the discourse model. 122
s Open the pod bay doors HAL i I'm afraid I can't do that Dave s i+1 S S I O i H i I O i+1 x i-1 P(x ) P(x ) i x i i+1 Figure 4: Unravelling of a RCNN discourse model to depth d = 2. The recurrent H i and output O i weights are conditioned on the respective agents a i . 3 Discourse Model The discourse model adapts a RNN architecture in order to capture central properties of discourse. We here first describe such properties and then define the model itself. 3.1 Discourse Compositionality The meaning of discourse - and of words and utterances within it - is often a result of a rich ensemble of context, of speakers’ intentions and actions and of other relevant surrounding circumstances (Korta and Perry, 2012; Potts, 2011). Far from capturing all aspects of discourse meaning, we aim at capturing in the model at least two of the most prominent ones: the sequentiality of the utterances and the interactions between the speakers. Concerning sequentiality, just the way the meaning of a sentence generally changes if words in it are permuted, so does the meaning of a paragraph or dialogue change if one permutes the sentences or utterances within. The change of meaning is more marked the larger the shift in the order of the sentences. Especially in tasks where one is concerned with a specific sentence within the context of the previous discourse, capturing the order of the sentences preceding the one at hand may be particularly crucial. Concerning the speakers’ interactions, the meaning of a speaker’s utterance within a discourse is differentially affected by the speaker’s previous utterances as opposed to other speakers’ previous utterances. Where applicable we aim at making the computed meaning vectors reflect the current speaker and the sequence of interactions with the previous speakers. With these two aims in mind, let us now proceed to define the model. 3.2 Recurrent Convolutional Neural Network The discourse model coupled to the sentence model is based on a RNN architecture with inputs from a HCNN and with the recurrent and output weights conditioned on the respective speakers. We take as given a sequence of sentences or utterances s 1 , ..., s T , each in turn being a sequence of words s i = y1 i ...yi l , a sequence of labels x 1 , ..., x T and a sequence of speakers or agents a 1 , ..., a T , in such way that the i-th utterance is performed by the i-th agent and has label x i . We denote by s i the sentence vector computed by way of the sentence model for the sentence s i . The RCNN computes probability distributions p i for the label at step i by iterating the following equations: h i = σ( Ix i−1 + H i−1 h i−1 + Ss i + b h ) (6) p i = softmax(O i h i + b o ) (7) where I, H i , O i are corresponding weight matrices for each agent a i and softmax(y) k = ey k ∑ j ey j returns a probability distribution. Thus p i is taken to model the following predictive distribution: p i = P (x i |x
Page 1 and 2:
ACL 2013 51st Annual Meeting of the
Page 3:
Introduction In recent years, there
Page 7:
Table of Contents Vector Space Sema
Page 10 and 11:
Poster session (continued) 12:30 Lu
Page 12 and 13:
Input: Log. Form: “red ball”
Page 14 and 15:
the NP/N λx.x red N/N λx.A red x
Page 16 and 17:
dicted and target distributions, an
Page 18 and 19:
typed objects lie in the same vecto
Page 20 and 21:
Peter D. Turney. 2006. Similarity o
Page 22 and 23:
(1999), who suggested that the real
Page 24 and 25:
mation. In addition, using a new se
Page 26 and 27:
Figure 6: Positive rate, negative r
Page 28 and 29:
References Rens Bod, Remko Scha, an
Page 30 and 31:
A Structured Distributional Semanti
Page 32 and 33:
Figure 1: Sample sentences & triple
Page 34 and 35:
el as its three axes. We explore th
Page 36 and 37:
as well as our framing of word comp
Page 38 and 39:
References Marco Baroni and Roberto
Page 40 and 41:
Letter N-Gram-based Input Encoding
Page 42 and 43:
Hidden my house my house Visibl
Page 44 and 45:
… m my y h se us Hidden Visible m
Page 46 and 47:
line of the systems presented in Ta
Page 48 and 49:
References Yoshua Bengio, Réjean D
Page 50 and 51:
Transducing Sentences to Syntactic
Page 52 and 53:
t s “We booked the flight” →
Page 54 and 55:
D 3(s) = ❀ PRP + ❀ V + DT ❀ +
Page 56 and 57:
Output Model λ = 0 λ = 0.2 λ = 0
Page 58 and 59:
References James A. Anderson. 1973.
Page 60 and 61:
General estimation and evaluation o
Page 62 and 63:
Guevara (2010) and Zanzotto et al.
Page 64 and 65:
⃗p = A u ⃗v where A u is a matr
Page 66 and 67:
stituents. For this reason, we must
Page 68 and 69:
References Hervé Abdi and Lynne Wi
Page 70 and 71:
GOOD VERTICAL safe out raise level
Page 72 and 73:
HLBL HLBL SENNA 4lang original scal
Page 74 and 75:
Determining Compositionality of Wor
Page 76 and 77:
ability of a word in the investigat
Page 78 and 79:
terminer) wheel” and “reinvent
Page 80 and 81:
ing LSA are depicted. Discussion As
Page 82 and 83: WSM Measure wAvg(of ρ) ρAN-VO-SV
Page 84 and 85: “Not not bad” is not “bad”:
Page 86 and 87: activation functions in recursive n
Page 88 and 89: logic experiment in Socher et al. (
Page 90 and 91: tinction without the need to move t
Page 92 and 93: T. K. Landauer and S. T. Dumais. 19
Page 94 and 95: This differs from the classical Ran
Page 96 and 97: Hungarian Algorithm First cosine si
Page 98 and 99: Features: MSRpar MSRvid SMTeuroparl
Page 100 and 101: Joseph Reisinger and Raymond J. Moo
Page 102 and 103: phrase meaning in vector space. •
Page 104 and 105: By Bayes’ rule, p(x|a, n; Θ) ∝
Page 106 and 107: 4.2 Learning from distributional re
Page 108 and 109: W = (w 1 , w 2 , · · · , w n ) a
Page 110 and 111: Aggregating Continuous Word Embeddi
Page 112 and 113: trieval, Sivic and Zisserman propos
Page 114 and 115: Another advantage is that the propo
Page 116 and 117: e 50 100 200 300 400 500 CLEF 4.0 6
Page 118 and 119: Y. Bengio, J. Louradour, R. Collobe
Page 120 and 121: Answer Extraction by Recursive Pars
Page 122 and 123: Algorithm 1: Auto-encoders co-train
Page 124 and 125: NP NP ) , ADJP , NP ( NP - NP Engli
Page 126 and 127: System Short Answer Short Answer Sh
Page 128 and 129: References Y. Bengio and R. Ducharm
Page 130 and 131: problem of paragraphs, of tence, mo
Page 134 and 135: Dialogue Act Label Example Train (%
Page 136 and 137: [Calhoun et al.2010] Sasha Calhoun,
show all

Vector Space Semantic Parsing: A Framework for Compositional ...

Create successful ePaper yourself

Delete template?

Save as template?