02.05.2014 Views

Proceedings - Österreichische Gesellschaft für Artificial Intelligence

Proceedings - Österreichische Gesellschaft für Artificial Intelligence

Proceedings - Österreichische Gesellschaft für Artificial Intelligence

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

most commonly used (Turney and Pantel, 2010).<br />

The simple way of obtaining a raw word<br />

co-occurrence count representation for N target<br />

words is to consider C context words that occur<br />

inside a window of length l positioned around<br />

each occurrence of the target words. An accumulation<br />

of the co-occurrences creates a wordoccurrence<br />

matrix X C×N . Different context<br />

sizes yield representations with different information.<br />

Sahlgren (2006) notes that small contexts<br />

(of a few words around the target word),<br />

give rise to paradigmatic relationships between<br />

words, whereas longer contexts find words with<br />

syntagmatic relationship between them. For a<br />

review on the current state of the art for vector<br />

space models using word-document, wordcontext<br />

or pair-pattern matrices using singular<br />

value decomposition-based approaches in dimensionality<br />

reduction, see Turney and Pantel (2010).<br />

2.2 Word spaces with SVD, ICA and SENNA<br />

The standard co-occurrence vectors for words can<br />

be very high-dimensional even if the intrinsic dimensionality<br />

of word context information is actually<br />

low (Karlgren et al., 2008; Kivimäki et al.,<br />

2010), which calls for an informed way to reduce<br />

the data dimensionality, while retaining enough<br />

information. In our experiments, we apply two<br />

computational methods, singular value decomposition<br />

(SVD) and ICA, to reduce the dimensionality<br />

of the data vectors and to restructure the word<br />

space.<br />

Both the SVD and ICA methods extract components<br />

that are linear mixtures of the original<br />

dimensions. SVD is a general dimension reduction<br />

method, applied for example in latent semantic<br />

analysis (LSA) (Landauer and Dumais, 1997)<br />

in the linguistic domain. The LSA method represents<br />

word vectors in an orthogonal basis. ICA<br />

finds statistically independent components which<br />

is a stronger requirement and the emerging features<br />

are easier to interpret than the SVD features<br />

(Honkela et al., 2010).<br />

Truncated SVD approximates the matrix<br />

X C×N as a product UDV T in which D d×d is a<br />

diagonal matrix with square roots of the d largest<br />

eigenvalues of X T X (or XX T ), U C×d has the d<br />

corresponding eigenvectors of XX T , and V N×d<br />

has the d corresponding eigenvectors of X T X.<br />

The rows ofV N×d give ad-dimensional representation<br />

for the target words.<br />

ICA (Comon, 1994; Hyvärinen et al., 2001)<br />

represents the matrix X C×N as a product AS,<br />

where A C×d is a mixing matrix, and S d×N contains<br />

the independent components. The columns<br />

for the matrix S d×N give a d-dimensional representation<br />

for the target words. The FastICA<br />

algorithm for ICA estimates the model in two<br />

stages: 1) dimensionality reduction and whitening<br />

(decorrelation and variance normalization),<br />

and 2) rotation to maximize the statistical independence<br />

of the components (Hyvärinen and Oja,<br />

1997). The dimensionality reduction and decorrelation<br />

step can be computed, for instance, with<br />

principal component analysis or SVD.<br />

We compare the results obtained with dimension<br />

reduction to a set of 50 feature vectors from<br />

a system called SENNA (Collobert et al., 2011) 1 .<br />

SENNA is a labeling system suitable for several<br />

tagging tasks: part of speech tagging, named<br />

entity recognition, chunking and semantic role<br />

labeling. The feature vectors for a vocabulary<br />

of 130 000 words are obtained by using large<br />

amounts of unlabeled data from Wikipedia. In<br />

training, unlabeled data is used in a supervised<br />

setting. The system is presented a target word in<br />

its context of 5+5 (preceding+following) words<br />

with a ’correct’ class label. An ’incorrect’ class<br />

sample is constructed by substituting the target<br />

word with a random one and keeping the context<br />

otherwise intact. The results in the tagging<br />

tasks are at the level of the state of the art, which<br />

is why we want to compare these representations<br />

with the direct evaluation tests.<br />

2.3 Direct evaluation<br />

In addition to indirect evaluation of vector space<br />

models in applications, several tests for direct<br />

evaluation of word vector spaces have been proposed<br />

see e.g., Sahlgren (2006) and Bullinaria<br />

and Levy (2007). First we describe the semantic<br />

and syntactic category tests. Here, a category<br />

means a group of words with a given class label.<br />

The precisionP in the category task is calculated<br />

according to (Levy et al., 1998). A centroid for<br />

each category is calculated as an arithmetic mean<br />

of the word vectors belonging to that category.<br />

1 http://ronan.collobert.com/senna/<br />

99<br />

<strong>Proceedings</strong> of KONVENS 2012 (Main track: oral presentations), Vienna, September 20, 2012

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!