Best Practices for Speech Corpora in Linguistic Research Workshop ...

More documents

Recommendations

Info

� 42
Russian Speech Corpora Framework for Linguistic Purposes Pavel Skrelin, Daniil Kocharov Department of Phonetics, Saint-Petersburg State University, Universitetskaya emb., 11, 199034, Saint-Petersburg, Russia skrelin@phonetics.pu.ru, kocharov@phonetics.pu.ru Abstract The paper introduces a comprehensive speech corpora framework for linguistic purposes developed at the Department of Phonetics, Saint Petersburg State University. It was designed especially for phoneticians, providing them access to speech corpora and convenient tools for speech data selection and analysis. The framework consists of three major parts: speech data, linguistic annotation and software tools for processing and automatic annotation of speech data. The framework was designed for the Russian language. The paper presents the underlying ideas of framework development and describes its architecture. Keywords: corpora annotation, transcription, Russian, corpora application 1. Introduction Most of the large speech corpora used in speech technology are intended for automatic collection and processing of statistical data and not for linguistic analysis of the speech data. For these purposes, it is enough to have sound recordings and their orthographic transcription or a simple phonetic transcription. A linguist, however, often uses the corpus to test her/his hypothesis, to explain errors of automatic speech processing. The interrelationship between different levels of the language system and the way it shows up through corresponding sound patterns are of particular interest. Obviously the corpus should contain high quality annotated speech data that provides researchers with a wide range of linguistic information. Good example of such a resource is the corpora developed for Dutch (Grønnum, 2009). Some data processing results could be so essential or useful that it may turn out to be desirable to add them to the corpus as new annotation data for further use. Thus, the corpus annotation scheme should be scalable to enable adding new annotations to the speech data. When studying a specific linguistic or speech phenomenon, a user would like to deal only with the parts of the corpus that have something to do with the subject of her/his research and not the whole content. To make this possible, the speech corpus needs to be accompanied with software enabling customizable search and extraction the segments with specific annotation by the given criteria. The framework developed at the Department of Phonetics, Saint Petersburg State University consists of three major parts: speech data, linguistic annotation and software tools for processing and automatic annotation of speech data as well as for a complex search of relevant speech data using multiple search criteria. At present, two corpora of fully annotated Russian speech are used within this framework. The first one was used a material for cross-linguistic phonetic study of spontaneous speech (Bondarko, 2009). The second is used for a study of read-aloud speech (Skrelin et al., 2010). The paper presents the underlying ideas of framework development and describes its architecture. 43 2. General Architecture The framework consists of three major modules. The first is speech data. The second and the most essential one is speech annotation: segmentation information and data labeling. The third one is a set of built-in tools for speech processing, basic feature extraction, statistical processing and extending corpus with linguistic and automatically generated annotation, for searching within a corpus for specific data and extracting slices of these data. 2.1. Annotation Scheme The annotation captures the maximum amount of phonetically and prosodically relevant information. Our primary objective to ensure that the annotation of the corpus covers a wide range of information that may be of interest to those involved in most areas of linguistic research and phonetics in particular. For example, the linguistic goal is to determine spectral characteristics of [u] pronounced by a female speaker with a high pitch on a prosodic rise consistent with question intonation patterns. When selecting experimental data for this task, it is necessary to take into account various levels of annotation:1) the canonical phonetic transcription, 2) the manual phonetic transcription, 3) the word level and the position of stressed syllable as vowel quality in Russian depends on its place relative to word stress and may be the reason behind the discrepancy between phonemic and phonetic transcription, 4) intonation transcription level with the type of tone group and position of head, where the maximum rise may be expected, 5) fundamental frequency level that allows to generate a melodic curve and thus determine the parameters of the melodic rise. There are two kinds of annotation. The first one is segmentation, i.e. information about boundaries between segmental units and their transcription labels. There are 8 main levels of segmentation, which are arranged in hierarchical order (see figure 1): 1. pitch marks; 2. manual phonetic events;
Page 1 and 2: Best Practices for Speech Corpora i
Page 3 and 4: Editors Michael Haugh Griffith Univ
Page 5 and 6: Author Index Broeder, Daan ........
Page 7 and 8: A linguistics-based speech corpus J
Page 9 and 10: Figure 2: Grammatical tags are visi
Page 11: In Jokinen, Kristiina and Eckhard B
Page 14 and 15: 2.2 Parameters of the Corpus Design
Page 16 and 17: switching and code mixing, we have
Page 18 and 19: � 12
Page 20 and 21: French and Russian screen versions
Page 22 and 23: singularity, expressiveness, semant
Page 24 and 25: is often fluid in terms of communic
Page 26 and 27: annotation. Researchers have often
Page 28 and 29: � 22
Page 30 and 31: in various research and/or student
Page 32 and 33: As previously mentioned, the degree
Page 34 and 35: 9. References Corpus of Spoken Gree
Page 36 and 37: In addition to part-of-speech tags
Page 38 and 39: POS description example translitera
Page 40 and 41: Figure 4: Different syntactic analy
Page 42 and 43: Herbert H. Clark and Thomas Wasow.
Page 44 and 45: The ‘externality’ of DA arises
Page 46 and 47: 6. Data analysis Each turn is annot
Page 50 and 51: 3. manual phonetic transcription (t
Page 52 and 53: • comparative linguistic research
Page 54 and 55: ��
Page 56 and 57: ��
Page 58 and 59: The global corpus data model is a s
Page 60 and 61: a) b) c) Figure 6: Web experiment o
Page 62 and 63: � 56
Page 64 and 65: conversation across 26 languages; t
Page 66 and 67: codes marked through special Unicod
Page 68 and 69: demographic fields, as are designat
Page 70 and 71: methods for eliciting metadata. Che
Page 72 and 73: � 66
Page 74 and 75: difference between corpora that can
Page 76 and 77: database can be reconstructed at an

Best Practices for Speech Corpora in Linguistic Research Workshop ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?