Best Practices for Speech Corpora in Linguistic Research Workshop ...

More documents

Recommendations

Info

a) b) c) Figure 6: Web experiment on regional variation of speech sounds in German using the Percy software. a) registration web form, b) experiment item screen, c) final display of geographic distribution of recordings. 3. Data exchange with tools with different underlying data models may require manual adaptation. 4. It is debatable whether SQL is a suitable query language for phoneticians or linguists, and whether it is sufficiently powerful to express typical research questions. 5.1. Missing data Missing data can be controlled by enforcing a minimum standard set of metadata and log data to be collected during the speech database creation. However, this may not be feasible when the speech database comes from an external partner. It is thus necessary to manually check external databases before incorporating them into the WikiSpeech system, and to document the data manipulations applied. Database systems provide a default null value for unknown data, which at least ensures that queries can be run. If queries return unexpected results, then default null data maybe the reason for this. In a relational database the database schema can be inspected either via queries or via a graphical user interface, and thus it is in general quite straightforward to find out whether missing data is a possible reason for unexpected query results. 5.2. Incorporating new tools A database schema is intended to remain stable over time so that predefined views, proceduralized queries or scripts continue to work. However, new tools and services may use data items not present in the database. If the new data cannot be represented the given database schema, this schema has to be extended. In general, simply extending the data model, e.g. by adding new attributes in the relation tables, or even adding new tables, is not critical. Critical changes include changing the relationships between data items, or removing attributes or relational tables. Such changes however will be very rare because the data model has reached a very stable state by now. Any change to the data model can only be performed by the database administrator, and it may entail the modification of existing scripts and queries. 54 5.3. Manual adaption of data For tools that use a different underlying data model, any data that is exported to the tool and then reimported must be processed to minimize the loss of information. For example, in a data model that uses symbolic references between elements on different annotation tiers, all items must be given explicit time stamps and unique ids to be used in a purely time-based annotation tool such as e.g. Praat. Upon reimporting the data after processing in Praat, these timestamps have to be removed for non time-aligned annotations. Such modifications are in generally implemented using a parameterized import and export script. Several such scripts may be necessary for the different tools used in the workflow. 5.4. SQL as a query language SQL is the de facto standard query language for relational databases. However, it is quite verbose and lacks many of the operators needed in phonetic or linguistic research, namely sequence and dominance operators. Both sequence and dominance operators can be expressed by providing explicit position attributes or linking relations between data records, but they make queries even longer and more complex. The SQL view mechanism is a simple way of formulating simple queries. For example, in the web experiment, where experiment setup and input spread over 8 relational tables, a single view contains all data fields of the experiment and appears to the user as one single wide table, which can be directly imported into R or Excel. As an alternative to the direct use of SQL, high-level application domain specific query languages can be envisaged, which are then automatically translated to SQL for execution. This separation of high-level query language and the SQL query evaluation is desirable, because it opens the possibility to provide many different query language to the same underlying database. Such query languages can be text-based or graphical, very specific to a particular application domain or quite abstract. In fact, many graphical front-ends to database systems already allow form-like query languages or explorative interactive graphical query languages.
6. Conclusion The global corpus data model for speech databases is a pragmatic approach to supporting phonetic and linguistic research. It provides a means to exchange data with a multitude of speech processing tools, and allows proceduralizing often-needed research and development tasks. The two case studies show that quite different tasks can be performed on the same data base in parallel. The global corpus data model will slowly evolve, and it will be slightly different in different research and development labs, because of different requirements and tools used. However, simply the fact that there exists a global corpus data model with visible and formally specified data structures, e.g. in a relational schema description, will lead to a much higher degree of consistency and coverage in speech database creation. One major challenge is the development of a query language suitable for linguistic and phonetic researchers. This query language must be much closer to the application domain than SQL can ever be, and it must be sufficiently powerful to express the queries that researchers in phonetics or linguistics ask. A promising approach is to provide a query language such as XQuery or a graphical query interface to the database base, and to compile this query into SQL for efficient execution in the database. Acknowledgements The author thanks the Statistical Consulting Unit, Department of Statistics, Ludwig-Maximilians-Universitt, Munich for its support of the analysis of the German dialect perception experiment data. Thanks also go the the pupils and schools in Scotland who participated in the Scottish English VOYS recordings, Catherine Dicke and Felix Schaeffler for organizing the recordings in Scotland, and the students at LMU who transcribed and annotated these recordings. 7. References St. Bird and M. Liberman. 2001. A Formal Framework for Linguistic Annotation. Speech Communication, 33(1,2):23–60. P. Boersma. 2001. Praat, a system for doing phonetics by computer. Glot International, 5(9/10):341–345. C. Dickie, F. Schaeffler, Chr. Draxler, and K. Jänsch. 2009. Speech recordings via the internet: An overview of the VOYS project in Scotland. In Proc. Interspeech, pages 1807–1810, Brighton. Chr. Draxler and K. Jänsch. 2004. SpeechRecorder – a Universal Platform Independent Multi-Channel Audio Recording Software. In Proc. of LREC, pages 559–562, Lisbon. Chr. Draxler and K. Jänsch. 2008. Wikispeech – a content management system for speech databases. In Proc. of Interspeech, pages 1646–1649, Brisbane. Chr. Draxler. 2005. Webtranscribe – an extensible webbased speech annotation framework. In Proc. of TSD 2005, pages 61–68, Karlsbad, Czech Republic. Chr. Draxler. 2006. Exploring the Unknown – Collecting 1000 speakers over the Internet for the Ph@ttSessionz 55 Database of Adolescent Speakers. In Proc. of Interspeech, pages pp. 173–176, Pittsburgh, PA. Chr. Draxler. 2011. Percy – an HTML5 framework for media rich web experiments on mobile devices. In Proc. Interspeech, pages 3339–3340, Florence, Italy. F. Keller, G. Subahshini, N. Mayo, and M. Corley. 2009. Timing accuracy of web experiments: A case study using the webexp software package. Behavior Research Methods, Instruments and Computers, 41(1):1–12. U. Reips. 2002. Standards for internet-based experimenting. Experimental Psychology, 49(4):243–256. F. Schiel, Chr. Heinrich, S. Barfüßer, and Th. Gilg. 2010. Alcohol Language Corpus – a publicly available large corpus of alcoholized speech. In IAFPA Annual Conference, Trier, Germany. F. Schiel. 2004. MAUS goes iterative. In Proc. of LREC, pages 1015–1018, Lisbon, Portugal. Th. Schmidt and K. Wörner. 2005. Erstellen und Analysieren von Gesprächskorpora mit EXMARaLDA. Gesprächsforschung, Vol. 6:171–195. T. Schmidt, S. Duncan, O. Ehmer, J. Hoyt, M. Kipp, D. Loehr, M. Magnusson, T. Rose, and H. Sloetjes. 2009. An exchange format for multimodal annotations. In Multimodal Corpora, volume 5509 of Lecture Notes in Computer Science, pages 207–221. Springer Verlag. H. Sloetjes, A. Russel, and A. Klassmann. 2007. ELAN: a free and open-source multimedia annotation tool. In Proc. of Interspeech, pages 4015–4016, Antwerp.
Page 1 and 2:
Best Practices for Speech Corpora i
Page 3 and 4:
Editors Michael Haugh Griffith Univ
Page 5 and 6:
Author Index Broeder, Daan ........
Page 7 and 8:
A linguistics-based speech corpus J
Page 9 and 10: Figure 2: Grammatical tags are visi
Page 11: In Jokinen, Kristiina and Eckhard B
Page 14 and 15: 2.2 Parameters of the Corpus Design
Page 16 and 17: switching and code mixing, we have
Page 18 and 19: � 12
Page 20 and 21: French and Russian screen versions
Page 22 and 23: singularity, expressiveness, semant
Page 24 and 25: is often fluid in terms of communic
Page 26 and 27: annotation. Researchers have often
Page 28 and 29: � 22
Page 30 and 31: in various research and/or student
Page 32 and 33: As previously mentioned, the degree
Page 34 and 35: 9. References Corpus of Spoken Gree
Page 36 and 37: In addition to part-of-speech tags
Page 38 and 39: POS description example translitera
Page 40 and 41: Figure 4: Different syntactic analy
Page 42 and 43: Herbert H. Clark and Thomas Wasow.
Page 44 and 45: The ‘externality’ of DA arises
Page 46 and 47: 6. Data analysis Each turn is annot
Page 48 and 49: � 42
Page 50 and 51: 3. manual phonetic transcription (t
Page 52 and 53: • comparative linguistic research
Page 54 and 55: ��
Page 56 and 57: ��
Page 58 and 59: The global corpus data model is a s
Page 62 and 63: � 56
Page 64 and 65: conversation across 26 languages; t
Page 66 and 67: codes marked through special Unicod
Page 68 and 69: demographic fields, as are designat
Page 70 and 71: methods for eliciting metadata. Che
Page 72 and 73: � 66
Page 74 and 75: difference between corpora that can
Page 76 and 77: database can be reconstructed at an
show all

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Create successful ePaper yourself

Delete template?

Save as template?