� 42
Russian <strong>Speech</strong> <strong>Corpora</strong> Framework <strong>for</strong> L<strong>in</strong>guistic Purposes Pavel Skrel<strong>in</strong>, Daniil Kocharov Department of Phonetics, Sa<strong>in</strong>t-Petersburg State University, Universitetskaya emb., 11, 199034, Sa<strong>in</strong>t-Petersburg, Russia skrel<strong>in</strong>@phonetics.pu.ru, kocharov@phonetics.pu.ru Abstract The paper <strong>in</strong>troduces a comprehensive speech corpora framework <strong>for</strong> l<strong>in</strong>guistic purposes developed at the Department of Phonetics, Sa<strong>in</strong>t Petersburg State University. It was designed especially <strong>for</strong> phoneticians, provid<strong>in</strong>g them access to speech corpora and convenient tools <strong>for</strong> speech data selection and analysis. The framework consists of three major parts: speech data, l<strong>in</strong>guistic annotation and software tools <strong>for</strong> process<strong>in</strong>g and automatic annotation of speech data. The framework was designed <strong>for</strong> the Russian language. The paper presents the underly<strong>in</strong>g ideas of framework development and describes its architecture. Keywords: corpora annotation, transcription, Russian, corpora application 1. Introduction Most of the large speech corpora used <strong>in</strong> speech technology are <strong>in</strong>tended <strong>for</strong> automatic collection and process<strong>in</strong>g of statistical data and not <strong>for</strong> l<strong>in</strong>guistic analysis of the speech data. For these purposes, it is enough to have sound record<strong>in</strong>gs and their orthographic transcription or a simple phonetic transcription. A l<strong>in</strong>guist, however, often uses the corpus to test her/his hypothesis, to expla<strong>in</strong> errors of automatic speech process<strong>in</strong>g. The <strong>in</strong>terrelationship between different levels of the language system and the way it shows up through correspond<strong>in</strong>g sound patterns are of particular <strong>in</strong>terest. Obviously the corpus should conta<strong>in</strong> high quality annotated speech data that provides researchers with a wide range of l<strong>in</strong>guistic <strong>in</strong><strong>for</strong>mation. Good example of such a resource is the corpora developed <strong>for</strong> Dutch (Grønnum, 2009). Some data process<strong>in</strong>g results could be so essential or useful that it may turn out to be desirable to add them to the corpus as new annotation data <strong>for</strong> further use. Thus, the corpus annotation scheme should be scalable to enable add<strong>in</strong>g new annotations to the speech data. When study<strong>in</strong>g a specific l<strong>in</strong>guistic or speech phenomenon, a user would like to deal only with the parts of the corpus that have someth<strong>in</strong>g to do with the subject of her/his research and not the whole content. To make this possible, the speech corpus needs to be accompanied with software enabl<strong>in</strong>g customizable search and extraction the segments with specific annotation by the given criteria. The framework developed at the Department of Phonetics, Sa<strong>in</strong>t Petersburg State University consists of three major parts: speech data, l<strong>in</strong>guistic annotation and software tools <strong>for</strong> process<strong>in</strong>g and automatic annotation of speech data as well as <strong>for</strong> a complex search of relevant speech data us<strong>in</strong>g multiple search criteria. At present, two corpora of fully annotated Russian speech are used with<strong>in</strong> this framework. The first one was used a material <strong>for</strong> cross-l<strong>in</strong>guistic phonetic study of spontaneous speech (Bondarko, 2009). The second is used <strong>for</strong> a study of read-aloud speech (Skrel<strong>in</strong> et al., 2010). The paper presents the underly<strong>in</strong>g ideas of framework development and describes its architecture. 43 2. General Architecture The framework consists of three major modules. The first is speech data. The second and the most essential one is speech annotation: segmentation <strong>in</strong><strong>for</strong>mation and data label<strong>in</strong>g. The third one is a set of built-<strong>in</strong> tools <strong>for</strong> speech process<strong>in</strong>g, basic feature extraction, statistical process<strong>in</strong>g and extend<strong>in</strong>g corpus with l<strong>in</strong>guistic and automatically generated annotation, <strong>for</strong> search<strong>in</strong>g with<strong>in</strong> a corpus <strong>for</strong> specific data and extract<strong>in</strong>g slices of these data. 2.1. Annotation Scheme The annotation captures the maximum amount of phonetically and prosodically relevant <strong>in</strong><strong>for</strong>mation. Our primary objective to ensure that the annotation of the corpus covers a wide range of <strong>in</strong><strong>for</strong>mation that may be of <strong>in</strong>terest to those <strong>in</strong>volved <strong>in</strong> most areas of l<strong>in</strong>guistic research and phonetics <strong>in</strong> particular. For example, the l<strong>in</strong>guistic goal is to determ<strong>in</strong>e spectral characteristics of [u] pronounced by a female speaker with a high pitch on a prosodic rise consistent with question <strong>in</strong>tonation patterns. When select<strong>in</strong>g experimental data <strong>for</strong> this task, it is necessary to take <strong>in</strong>to account various levels of annotation:1) the canonical phonetic transcription, 2) the manual phonetic transcription, 3) the word level and the position of stressed syllable as vowel quality <strong>in</strong> Russian depends on its place relative to word stress and may be the reason beh<strong>in</strong>d the discrepancy between phonemic and phonetic transcription, 4) <strong>in</strong>tonation transcription level with the type of tone group and position of head, where the maximum rise may be expected, 5) fundamental frequency level that allows to generate a melodic curve and thus determ<strong>in</strong>e the parameters of the melodic rise. There are two k<strong>in</strong>ds of annotation. The first one is segmentation, i.e. <strong>in</strong><strong>for</strong>mation about boundaries between segmental units and their transcription labels. There are 8 ma<strong>in</strong> levels of segmentation, which are arranged <strong>in</strong> hierarchical order (see figure 1): 1. pitch marks; 2. manual phonetic events;