29.01.2014 Views

GWC 2008

GWC 2008

GWC 2008

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Verification of Valency Frame Structures by Means of<br />

Automatic Context Clustering in RussNet<br />

Irina V.Azarova 1 , Anna S. Marina 2 , and Anna A. Sinopalnikova 3<br />

1 Department of Applied Linguistics, St-Petersburg State University, Universitetskaya nab.<br />

11, 199034 St-Petersburg, Russia.<br />

2 Department of Lexicography, Institute of Linguistic Studies, Tuchkov pereulok 9, 199053<br />

Saint-Petersburg, Russia.<br />

3<br />

Brno University of Technology, Bozetechova 2, 61266 Brno, Czech<br />

ivazarova@gmail.com, a_s_marina@rambler.ru, sino@fit.vutbr.cz<br />

Abstract. The major point of the RussNet technique is a specification of<br />

valency frames for synsets. Parameters of valency frames are employed for<br />

word meaning and synsets differentiating in the procedure of thesaurus<br />

construction and automatic text analysis for word disambiguation. Valency<br />

description is calculated on the basis of statistically stable context features in<br />

the text corpus: morphologic, syntactic, and semantic. The automatic<br />

classification of verb contexts with unambiguous morphology annotation is<br />

discussed in the paper. The goal of this technique is differentiation of semantic<br />

types for verbs. The procedure is fulfilled with a help of morphology tag<br />

distributions in some context window for verbs from different semantic trees of<br />

RussNet. The optimal width of a distribution window, an appropriate tag set,<br />

and clustering results are discussed. This procedure may be helpful at various<br />

stages of analysis, especially for valency frame verification in some semantic<br />

tree.<br />

1 Introduction<br />

The computer thesaurus RussNet1 developed at the Department of Applied<br />

Mathematic Linguistics of Saint-Petersburg State University inherited the main<br />

principles of WordNet construction method [1]. RussNet is based on the corpus of<br />

modern texts (dated from 1985 up to nowadays) including 21 million of words, the<br />

major part of which (60%) are articles on various topics from newspapers and<br />

magazines, covering thematic diversity of the common Russian language [2].<br />

The RussNet was not translated from the WordNet prototype, its construction<br />

involves some additional components in its structure, that were oriented to its usage in<br />

automatic text analysis [3].<br />

The basic node in RussNet – the synset – may include several members (words or<br />

multiword expressions), which are ordered by their frequency of appearance in the<br />

corpus contexts in the particular sense described by the synset. This frequency is<br />

1<br />

http://www.phil.pu.ru/depts/12/RN

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!