Vector Space Semantic Parsing: A Framework for Compositional ...

More documents

Recommendations

Info

Figure 1: Sample sentences & triples = verb.consumption, ?, dobj) i.e., “what is consumed”, might return expectations [pasta:1, spaghetti:1, mice:1 . . . ]. In our implementation, the relations and POS tags are obtained using the Fanseparser (Tratz and Hovy, 2011), supersense tags using sst-light (Ciaramita and Altun, 2006), and lemmas are obtained from Wordnet (Miller, 1995). 3.2 Building the Representation Next, we describe a method to represent lexical entries as structured distributional matrices using the PropStore. The canonical form of a concept C (word, phrase etc.) in the SDSM framework is a matrix M C , whose entry Mij C is a list of sentence identifiers obtained by querying the PropStore for contexts in which C appears in the syntactic neighborhood of the word j linked by the dependency relation i. As with other distributional models in the literature, the content of a cell is the frequency of co-occurrence of its concept and word under the given relational constraint. This canonical matrix form can be interpreted in several different ways. Each interpretation is based on a different normalization scheme. 1. Row Norm: Each row of the matrix is interpreted as a distribution over words that attach to the target concept with the given dependency relation. M ij Mij C = Σ j M ij 2. Full Norm: The entire matrix is interpreted as a distribution over the word-relation pairs which can attach to the target concept. M ij Mij C = Σ i,j M ij ∀i ∀i, j Figure 2: Mimicking composition of two words 3. Collapsed Vector Norm: The columns of the matrix are collapsed to form a standard normalized distributional vector trained on dependency relations rather than sliding windows. M C j = Σ iM ij Σ i,j M ij 3.3 Mimicking Compositionality For representing intermediate multi-word phrases, we extend the above word-relation matrix symbolism in a bottom-up fashion. The combination hinges on the intuition that when lexical units combine to form a larger syntactically connected phrase, the representation of the phrase is given by its own distributional neighborhood within the embedded parse tree. The distributional neighborhood of the net phrase can be computed using the PropStore given syntactic relations anchored on its parts. For the example in Figure 1, we can compose SST(w 1 ) = Noun.person and Lemma(W 1 ) = eat with relation ‘nsubj’ to obtain expectations around “people eat” yielding [pasta:1, spaghetti:1 . . . ] for the object relation ([dining room:2, restaurant:1 . . .] for the location relation, etc.) (See Figure 2). Larger phrasal queries can be built to answer questions like “What do people in China eat with?”, “What do cows do?”, etc. All of this helps ∀j 22
us to account for both relation r and knowledge K obtained from the PropStore within the compositional framework c = f(a, b, r, K). The general outline to obtain a composition of two words is given in Algorithm 1. Here, we first determine the sentence indices where the two words w 1 and w 2 occur with relation r. Then, we return the expectations around the two words within these sentences. Note that the entire algorithm can conveniently be written in the form of database queries to our PropStore. Algorithm 1 ComposePair(w 1 , r, w 2 ) M 1 ← queryMatrix(w 1 ) M 2 ← queryMatrix(w 2 ) SentIDs ← M 1 (r) ∩ M 2 (r) return ((M 1 ∩ SentIDs) ∪ (M 2 ∩ SentIDs)) Similar to the two-word composition process, given a parse subtree T of a phrase, we obtain its matrix representation of empirical counts over word-relation contexts. This procedure is described in Algorithm 2. Let the E = {e 1 . . . e n } be the set of edges in T , e i = (w i1 , r i , w i2 )∀i = 1 . . . n. Algorithm 2 ComposePhrase(T ) SentIDs ← All Sentences in corpus for i = 1 → n do M i1 ← queryMatrix(w i1 ) M i2 ← queryMatrix(w i2 ) SentIDs ← SentIDs ∩(M 1 (r i ) ∩ M 2 (r i )) end for return ((M 11 ∩ SentIDs) ∪ (M 12 ∩ SentIDs) · · · ∪ (M n1 ∩ SentIDs) ∪ (M n2 ∩ SentIDs)) 3.4 Tackling Sparsity The SDSM model reflects syntactic properties of language through preferential filler constraints. But by distributing counts over a set of relations the resultant SDSM representation is comparatively much sparser than the DSM representation for the same word. In this section we present some ways to address this problem. 3.4.1 Sparse Back-off The first technique to tackle sparsity is to back off to progressively more general levels of linguistic granularity when sparse matrix representations for words or compositional units are encountered or when the word or unit is not in the lexicon. For example, the composition “Balthazar eats” cannot be directly computed if the named entity “Balthazar” does not occur in the PropStore’s knowledge base. In this case, a query for a supersense substitute – “Noun.person eat” – can be issued instead. When supersenses themselves fail to provide numerically significant distributions for words or word combinations, a second back-off step involves querying for POS tags. With coarser levels of linguistic representation, the expressive power of the distributions becomes diluted. But this is often necessary to handle rare words. Note that this is an issue with DSMs too. 3.4.2 Densification In addition to the back-off method, we also propose a secondary method for “densifying” distributions. A concept’s distribution is modified by using words encountered in its syntactic neighborhood to infer counts for other semantically similar words. In other terms, given the matrix representation of a concept, densification seeks to populate its null columns (which each represent a worddimension in the structured distributional context) with values weighted by their scaled similarities to words (or effectively word-dimensions) that actually occur in the syntactic neighborhood. For example, suppose the word “play” had an “nsubj” preferential vector that contained the following counts: [cat:4 ; Jane:2]. One might then populate the column for “dog” in this vector with a count proportional to its similarity to the word cat (say 0.8), thus resulting in the vector [cat:4 ; Jane:2 ; dog:3.2]. These counts could just as well be probability values or PMI associations (suitably normalized). In this manner, the k most similar word-dimensions can be densified for each word that actually occurs in a syntactic context. As with sparse back-off, there is an inherent trade-off between the degree of densification k and the expressive power of the resulting representation. 3.4.3 Dimensionality Reduction The final method tackles the problem of sparsity by reducing the representation to a dense lowdimensional word embedding using singular value decomposition (SVD). In a typical term-document matrix, SVD finds a low-dimensional approximation of the original matrix where columns become latent concepts while similarity structure between rows are preserved. The PropStore, as described in Section 3.1, is an order-3 tensor with w 1 , w 2 and 23
Page 1 and 2: ACL 2013 51st Annual Meeting of the
Page 3: Introduction In recent years, there
Page 7: Table of Contents Vector Space Sema
Page 10 and 11: Poster session (continued) 12:30 Lu
Page 12 and 13: Input: Log. Form: “red ball”
Page 14 and 15: the NP/N λx.x red N/N λx.A red x
Page 16 and 17: dicted and target distributions, an
Page 18 and 19: typed objects lie in the same vecto
Page 20 and 21: Peter D. Turney. 2006. Similarity o
Page 22 and 23: (1999), who suggested that the real
Page 24 and 25: mation. In addition, using a new se
Page 26 and 27: Figure 6: Positive rate, negative r
Page 28 and 29: References Rens Bod, Remko Scha, an
Page 30 and 31: A Structured Distributional Semanti
Page 34 and 35: el as its three axes. We explore th
Page 36 and 37: as well as our framing of word comp
Page 38 and 39: References Marco Baroni and Roberto
Page 40 and 41: Letter N-Gram-based Input Encoding
Page 42 and 43: Hidden my house my house Visibl
Page 44 and 45: … m my y h se us Hidden Visible m
Page 46 and 47: line of the systems presented in Ta
Page 48 and 49: References Yoshua Bengio, Réjean D
Page 50 and 51: Transducing Sentences to Syntactic
Page 52 and 53: t s “We booked the flight” →
Page 54 and 55: D 3(s) = ❀ PRP + ❀ V + DT ❀ +
Page 56 and 57: Output Model λ = 0 λ = 0.2 λ = 0
Page 58 and 59: References James A. Anderson. 1973.
Page 60 and 61: General estimation and evaluation o
Page 62 and 63: Guevara (2010) and Zanzotto et al.
Page 64 and 65: ⃗p = A u ⃗v where A u is a matr
Page 66 and 67: stituents. For this reason, we must
Page 68 and 69: References Hervé Abdi and Lynne Wi
Page 70 and 71: GOOD VERTICAL safe out raise level
Page 72 and 73: HLBL HLBL SENNA 4lang original scal
Page 74 and 75: Determining Compositionality of Wor
Page 76 and 77: ability of a word in the investigat
Page 78 and 79: terminer) wheel” and “reinvent
Page 80 and 81: ing LSA are depicted. Discussion As
Page 82 and 83:
WSM Measure wAvg(of ρ) ρAN-VO-SV
Page 84 and 85:
“Not not bad” is not “bad”:
Page 86 and 87:
activation functions in recursive n
Page 88 and 89:
logic experiment in Socher et al. (
Page 90 and 91:
tinction without the need to move t
Page 92 and 93:
T. K. Landauer and S. T. Dumais. 19
Page 94 and 95:
This differs from the classical Ran
Page 96 and 97:
Hungarian Algorithm First cosine si
Page 98 and 99:
Features: MSRpar MSRvid SMTeuroparl
Page 100 and 101:
Joseph Reisinger and Raymond J. Moo
Page 102 and 103:
phrase meaning in vector space. •
Page 104 and 105:
By Bayes’ rule, p(x|a, n; Θ) ∝
Page 106 and 107:
4.2 Learning from distributional re
Page 108 and 109:
W = (w 1 , w 2 , · · · , w n ) a
Page 110 and 111:
Aggregating Continuous Word Embeddi
Page 112 and 113:
trieval, Sivic and Zisserman propos
Page 114 and 115:
Another advantage is that the propo
Page 116 and 117:
e 50 100 200 300 400 500 CLEF 4.0 6
Page 118 and 119:
Y. Bengio, J. Louradour, R. Collobe
Page 120 and 121:
Answer Extraction by Recursive Pars
Page 122 and 123:
Algorithm 1: Auto-encoders co-train
Page 124 and 125:
NP NP ) , ADJP , NP ( NP - NP Engli
Page 126 and 127:
System Short Answer Short Answer Sh
Page 128 and 129:
References Y. Bengio and R. Ducharm
Page 130 and 131:
problem of paragraphs, of tence, mo
Page 132 and 133:
m1 k 4 only on the length of s and
Page 134 and 135:
Dialogue Act Label Example Train (%
Page 136 and 137:
[Calhoun et al.2010] Sasha Calhoun,
show all

Vector Space Semantic Parsing: A Framework for Compositional ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?