Vector Space Semantic Parsing: A Framework for Compositional ...

More documents

Recommendations

Info

trieval, Sivic and Zisserman proposed to use a tfidf weighting of the BoV vector and an inverted file for efficient matching (Sivic and Zisserman, 2003). As another example, pLSA, LDA and their many variations have been extensively applied to problems such as image classification (Quelhas et al., 2005) or object discovery (Russell et al., 2006). However, it has been noted that the quantization process mentioned above incurs a loss of information since a continuous descriptor is transformed into a discrete value (the index of the closest visual word). To overcome this limitation, several improvements have been proposed which depart from the pure discrete model. These improvements include the soft assignment of descriptors to visual words (Farquhar et al., 2005; Philbin et al., 2008; Gemert et al., 2008) or the use of more advanced coding techniques than vector quantization such as sparse coding (Yang et al., 2009) or locality-constrained linear coding (Wang et al., 2010). All the previous techniques can be understood (with some simplifications) as simple counting mechanisms (computation of 0-order statistics). It has been proposed to take into account higherorder statistics (first and second order for instance) which encode more descriptor-level information and therefore incur a lower loss of information. This includes the Fisher vector (Perronnin and Dance, 2007; Perronnin et al., 2010), which was directly inspired by the Fisher kernel of Jaakkola and Haussler (Jaakkola and Haussler, 1999). In a nutshell, the Fisher vector consists in modeling the distribution of patches in any image with a Gaussian mixture model (GMM) and then in describing an image by its deviation from this average probability distribution. In a recent evaluation (Chatfield et al., 2011), it has been shown experimentally that the Fisher vector was the state-of-the-art representation for image classification. However, in this work we question the treatment of words as discrete entities. Indeed, intuitvely some words are closer to each other from a semantic standpoint and words can be embedded in a continuous space as is done for instance in LSA. 3 The Bag-of-Embedded-Words (BoEW) In this work, we draw inspiration from the work in the computer vision community: we model the generation process of words with continuous mixture models and use the FK for aggregation. The proposed bag-of-embedded-words proceeds as follows: Learning phase. Given an unlabeled training set of documents: 1. Learn an embedding of words in a lowdimensional space, i.e. lower-dimensional than the VSM. After this operation, each word w is then represented by a vector of size e: w → E w = [E w,1 , . . . , E w,e ]. (1) 2. Fit a probabilistic model – e.g. a mixture model – on the continuous word embeddings. Document representation. Given a document whose BoW representation is {w 1 , . . . , w T }: 1. Transform the BoW representation into a BoEW: {w 1 , . . . , w T } → {E w1 , . . . , E wT } (2) 2. Aggregate the continuous word embeddings E wt using the FK framework. Since the proposed framework is independent of the particular embedding technique, we will first focus on the modeling of the generation process and on the FK-based aggregation. We will then compare the proposed continuous topic model to the traditional LSI, PLSA and LDA topic models. 3.1 Probabilistic modeling and FK aggregation We assume that the continuous word embeddings in a document have been generated by a “universal” (i.e. document-independent) probability density function (pdf). As is common practice for continuous features, we choose this pdf to be a Gaussian mixture model (GMM) since any continuous distribution can be approximated with arbitrary precision by a mixture of Gaussians. In what follows, the pdf is denoted u λ where λ = {θ i , µ i , Σ i , i = 1 . . . K} is the set of parameters of the GMM. θ i , µ i and Σ i denote respectively the mixture weight, mean vector and covariance matrix of Gaussian i. For computational reasons, we assume that the covariance matrices are diagonal and denote σi 2 the variance vector of Gaussian i, i.e. σi 2 = diag(Σ i ). In practice, the GMM is estimated offline with a set of continuous word embeddings extracted from a representative set of documents. The parameters λ are estimated through the optimization of a Maximum 102
Likelihood (ML) criterion using the Expectation- Maximization (EM) algorithm. Let us assume that a document contains T words and let us denote by X = {x 1 , . . . , x T } the set of continuous word embeddings extracted from the document. We wish to derive a fixed-length representation (i.e. a vector whose dimensionality is independent of T ) that characterizes X with respect to u λ . A natural framework to achieve this goal is the FK (Jaakkola and Haussler, 1999). In what follows, we use the notation of (Perronnin et al., 2010). Given u λ one can characterize the sample X using the score function: G X λ = ∇T λ log u λ(X). (3) This is a vector whose size depends only on the number of parameters in λ. Intuitively, it describes in which direction the parameters λ of the model should be modified so that the model u λ better fits the data. Assuming that the word embeddings x t are iid (a simplifying assumption), we get: G X λ T = ∑ ∇ λ log u λ (x t ). (4) t=1 Jaakkola and Haussler proposed to measure the similarity between two samples X and Y using the FK: K(X, Y ) = G X ′ λ F −1 λ GY λ (5) where F λ is the Fisher Information Matrix (FIM) of u λ : F λ = E x∼uλ [ ∇λ log u λ (x)∇ λ log u λ (x) ′] . (6) As F λ is symmetric and positive definite, it has a Cholesky decomposition F λ = L ′ λ L λ and K(X, Y ) can be rewritten as a dot-product between normalized vectors G λ with: G X λ = L λG X λ . (7) (Perronnin et al., 2010) refers to Gλ X as the Fisher Vector (FV) of X. Using a diagonal approximation of the FIM, we obtain the following formula for the gradient with respect to µ 1 i : G X i = 1 √ θi T ∑ t=1 ( ) xt − µ i γ t (i) . (8) σ i 1 we only consider the partial derivatives with respect to the mean vectors since the partial derivatives with respect to the mixture weights and variance parameters carry little additional information (we confirmed this fact in preliminary experiments). where the division by the vector σ i should be understood as a term-by-term operation and γ t (i) = p(i|x t , λ) is the soft assignment of x t to Gaussian i (i.e. the probability that x t was generated by Gaussian i) which can be computed using Bayes’ formula. The FV G X λ is the concatenation of the GX i , ∀i . Let e be the dimensionality of the continuous word descriptors and K be the number of Gaussians. The resulting vector is e × K dimensional. 3.2 Relationship with LSI, PLSA, LDA Relationship with LSI. Let n be the number of documents in the collection and t be the number of indexing terms. Let A be the t × n document matrix. In LSI (or NMF), A decomposes as: A ≈ UΣV ′ (9) where U ∈ R t×e , Σ ∈ R e×e is diagonal, V ∈ R n×e and e is the size of the embedding space. If we choose V Σ as the LSI document embedding matrix – which makes sense if we accept the dotproduct as a measure of similarity between documents since A ′ A ≈ (V Σ)(V Σ) ′ – then we have V Σ ≈ A ′ U. This means that the LSI embedding of a document is approximately the sum of the embedding of the words, weighted by the number of occurrences of each word. Similarly, from equations (4) and (7), it is clear that the FV Gλ X is a sum of non-linear mappings: x t → L λ ∇ λ log u λ (x t ) = [ γt (1) x √ t − µ 1 , . . . , γ ] t(K) x √ t − µ K ] θ1 σ 1 θK σ K (10) computed for each embedded-word x t . When the number of Gaussians K = 1, the mapping simplifies to a linear one: x t → x t − µ 1 σ 1 (11) and the FV is simply a whitened version of the sum of word-embeddings. Therefore, if we choose LSI to perform word-embeddings in our framework, the Fisher-based representation is similar to the LSI document embedding in the one Gaussian case. This does not come as a surprise in the case of LSI since Singular Value Decomposition (SVD) can be viewed as a the limite case of a probabilistic model with a Gaussian noise assumption (Salakhutdinov and Mnih, 2007). Hence, the proposed framework enables to model documents when the word embeddings are non-Gaussian. 103
Page 1 and 2:
ACL 2013 51st Annual Meeting of the
Page 3:
Introduction In recent years, there
Page 7:
Table of Contents Vector Space Sema
Page 10 and 11:
Poster session (continued) 12:30 Lu
Page 12 and 13:
Input: Log. Form: “red ball”
Page 14 and 15:
the NP/N λx.x red N/N λx.A red x
Page 16 and 17:
dicted and target distributions, an
Page 18 and 19:
typed objects lie in the same vecto
Page 20 and 21:
Peter D. Turney. 2006. Similarity o
Page 22 and 23:
(1999), who suggested that the real
Page 24 and 25:
mation. In addition, using a new se
Page 26 and 27:
Figure 6: Positive rate, negative r
Page 28 and 29:
References Rens Bod, Remko Scha, an
Page 30 and 31:
A Structured Distributional Semanti
Page 32 and 33:
Figure 1: Sample sentences & triple
Page 34 and 35:
el as its three axes. We explore th
Page 36 and 37:
as well as our framing of word comp
Page 38 and 39:
References Marco Baroni and Roberto
Page 40 and 41:
Letter N-Gram-based Input Encoding
Page 42 and 43:
Hidden my house my house Visibl
Page 44 and 45:
… m my y h se us Hidden Visible m
Page 46 and 47:
line of the systems presented in Ta
Page 48 and 49:
References Yoshua Bengio, Réjean D
Page 50 and 51:
Transducing Sentences to Syntactic
Page 52 and 53:
t s “We booked the flight” →
Page 54 and 55:
D 3(s) = ❀ PRP + ❀ V + DT ❀ +
Page 56 and 57:
Output Model λ = 0 λ = 0.2 λ = 0
Page 58 and 59:
References James A. Anderson. 1973.
Page 60 and 61:
General estimation and evaluation o
Page 62 and 63: Guevara (2010) and Zanzotto et al.
Page 64 and 65: ⃗p = A u ⃗v where A u is a matr
Page 66 and 67: stituents. For this reason, we must
Page 68 and 69: References Hervé Abdi and Lynne Wi
Page 70 and 71: GOOD VERTICAL safe out raise level
Page 72 and 73: HLBL HLBL SENNA 4lang original scal
Page 74 and 75: Determining Compositionality of Wor
Page 76 and 77: ability of a word in the investigat
Page 78 and 79: terminer) wheel” and “reinvent
Page 80 and 81: ing LSA are depicted. Discussion As
Page 82 and 83: WSM Measure wAvg(of ρ) ρAN-VO-SV
Page 84 and 85: “Not not bad” is not “bad”:
Page 86 and 87: activation functions in recursive n
Page 88 and 89: logic experiment in Socher et al. (
Page 90 and 91: tinction without the need to move t
Page 92 and 93: T. K. Landauer and S. T. Dumais. 19
Page 94 and 95: This differs from the classical Ran
Page 96 and 97: Hungarian Algorithm First cosine si
Page 98 and 99: Features: MSRpar MSRvid SMTeuroparl
Page 100 and 101: Joseph Reisinger and Raymond J. Moo
Page 102 and 103: phrase meaning in vector space. •
Page 104 and 105: By Bayes’ rule, p(x|a, n; Θ) ∝
Page 106 and 107: 4.2 Learning from distributional re
Page 108 and 109: W = (w 1 , w 2 , · · · , w n ) a
Page 110 and 111: Aggregating Continuous Word Embeddi
Page 114 and 115: Another advantage is that the propo
Page 116 and 117: e 50 100 200 300 400 500 CLEF 4.0 6
Page 118 and 119: Y. Bengio, J. Louradour, R. Collobe
Page 120 and 121: Answer Extraction by Recursive Pars
Page 122 and 123: Algorithm 1: Auto-encoders co-train
Page 124 and 125: NP NP ) , ADJP , NP ( NP - NP Engli
Page 126 and 127: System Short Answer Short Answer Sh
Page 128 and 129: References Y. Bengio and R. Ducharm
Page 130 and 131: problem of paragraphs, of tence, mo
Page 132 and 133: m1 k 4 only on the length of s and
Page 134 and 135: Dialogue Act Label Example Train (%
Page 136 and 137: [Calhoun et al.2010] Sasha Calhoun,
show all

Vector Space Semantic Parsing: A Framework for Compositional ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?