19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Toward the Harmonization of Metadata Practice <strong>for</strong> Spoken Languages<br />

Resources<br />

Christopher Cieri ❄ , Malcah Yaeger-Dror ❄●<br />

❄ L<strong>in</strong>guistic Data Consortium, University of Pennsylvania, ● University of Arizona<br />

❄ 3600 Market Street, Suite 810, Philadelphia, PA 19104, USA<br />

E-mail: ccieri@ldc.upenn.edu, malcah@email.arizona.edu<br />

Abstract<br />

This paper addresses issues related to the elicitation and encod<strong>in</strong>g of demographic, situational and attitud<strong>in</strong>al metadata <strong>for</strong><br />

sociol<strong>in</strong>guistic research with an eye toward standardization to facilitate data shar<strong>in</strong>g. The discussion results from a series of<br />

workshops that have recently taken place at the NWAV and LSA conferences. These discussions have focused pr<strong>in</strong>cipally on the<br />

granularity of the metadata and the subset of categories that could be considered required <strong>for</strong> sociol<strong>in</strong>guistic fieldwork generally.<br />

Although a great deal of research on quantitative sociol<strong>in</strong>guists has taken place <strong>in</strong> the Unites Stated, the workshops participants<br />

actually represent research conducted <strong>in</strong> North and South America, Europe, Asian, the Middle East, Africa and Oceania. Although<br />

the paper does not attempt to consider the metadata necessary to characterize every possible speaker population, we present evidence<br />

that the methodological issues and f<strong>in</strong>d<strong>in</strong>gs apply generally to speech collections concerned with the demographics and attitudes or<br />

the speaker pools and the situations under which speech is elicited.<br />

Keywords: metadata, sociol<strong>in</strong>guistics, standards<br />

1. Introduction<br />

The brief history of build<strong>in</strong>g digital, shareable language<br />

resources (LRs) to support language related education<br />

research and technology development is marked by<br />

numerous attempts to create and en<strong>for</strong>ce standards. The<br />

motivations beh<strong>in</strong>d the standards are numerous. For<br />

example, standards offer the possibility of mak<strong>in</strong>g<br />

explicit the process by which LRs are created,<br />

establish<strong>in</strong>g m<strong>in</strong>imum quality levels and facilitat<strong>in</strong>g<br />

shar<strong>in</strong>g. Nevertheless, there have been <strong>in</strong>stances <strong>in</strong><br />

which the pre-mature or <strong>in</strong>appropriate promulgation or<br />

adoption of standards has lead to its own set of problems<br />

(Osborn 2010, p. 74ff, Mah, et. al. 1997) as researchers<br />

struggle to apply to their use cases standard that were not<br />

truly representative and perhaps not <strong>in</strong>tended to be. To<br />

reduce the potential ef<strong>for</strong>t expended <strong>in</strong> develop<strong>in</strong>g,<br />

promot<strong>in</strong>g and us<strong>in</strong>g proposed standards that may<br />

subsequently be found difficult to susta<strong>in</strong>, we propose<br />

that standardization is a late step <strong>in</strong> a multipart process<br />

that beg<strong>in</strong>s with understand<strong>in</strong>g, progresses to<br />

documentation that may itself encourage consistency <strong>in</strong><br />

practice with<strong>in</strong> small groups at which po<strong>in</strong>t the question<br />

of standardization beg<strong>in</strong>s to ripen.<br />

2. Background<br />

The present workshop seeks to survey current <strong>in</strong>itiatives<br />

<strong>in</strong> speech corpus creation with an eye toward<br />

standardization across sub-discipl<strong>in</strong>es. Such<br />

standardization could permit resource shar<strong>in</strong>g among<br />

researchers work<strong>in</strong>g <strong>in</strong> conversation and discourse<br />

analysis, sociol<strong>in</strong>guistics and dialectology among others<br />

and between those fields and others who depend upon<br />

similar k<strong>in</strong>ds of data <strong>in</strong>clud<strong>in</strong>g language eng<strong>in</strong>eers<br />

(Popescu-Belis, Zufferey 2007). Co<strong>in</strong>cidentally, the<br />

authors have been <strong>in</strong>volved <strong>in</strong> a number of workshops on<br />

related themes <strong>in</strong>clud<strong>in</strong>g a series tak<strong>in</strong>g place at the<br />

annual NWAV (New Ways of Analyz<strong>in</strong>g Variation)<br />

meet<strong>in</strong>gs on speech data collection, annotation and<br />

distribution <strong>in</strong>clud<strong>in</strong>g documentation and metadata<br />

61<br />

description. More recently they lead a workshop funded<br />

by the U.S. National Science Foundation at the 2012<br />

w<strong>in</strong>ter meet<strong>in</strong>g of the L<strong>in</strong>guistics Society of America 1 .<br />

The pr<strong>in</strong>cipal topics of the latter were metadata<br />

description and related legal issues <strong>in</strong> the creation of<br />

spoken language corpora <strong>for</strong> sociol<strong>in</strong>guistics. This paper<br />

constitutes a summary of ef<strong>for</strong>ts with<strong>in</strong> that community<br />

to beg<strong>in</strong> understand<strong>in</strong>g metadata encod<strong>in</strong>g practice as a<br />

first step toward consistency, shar<strong>in</strong>g and<br />

standardization.<br />

3. Towards Standardization<br />

Be<strong>for</strong>e metadata practice can be standardized, <strong>in</strong>dividual<br />

researchers must first understand their practices, the<br />

variations among them, the causes <strong>for</strong> variation, the<br />

tradeoffs of different approaches and their potential uses.<br />

In particular, researchers need to know if they can apply<br />

their metadata categories consistently, a question that is<br />

not frequently asked but must be if the goal is to adopt a<br />

standard that will be used by many <strong>in</strong>dependent groups<br />

with the <strong>in</strong>tent of shar<strong>in</strong>g corpora. Once the practice is<br />

understood it must be documented so that potential users<br />

can evaluate it and compet<strong>in</strong>g practices can be<br />

harmonized to permit appropriate comparisons. With<br />

adequate documentation <strong>in</strong>dependent researchers can<br />

decide if they want to adopt consistent practices.<br />

4. Metadata<br />

With<strong>in</strong> sociol<strong>in</strong>guistics, some researchers’ position is that<br />

each study requires its own set of demographics.<br />

However, the ultimate consensus at the workshops was<br />

that cross community comparative corpus-based studies<br />

are only possible if there is a shared set of specific<br />

cod<strong>in</strong>g choices. Some of the demographic <strong>in</strong><strong>for</strong>mation is<br />

generally accepted with<strong>in</strong> the larger sociol<strong>in</strong>guistic<br />

community: sex, birth year, years of education, and some<br />

designation of job description are fairly common<br />

1 http://projects.ldc.upenn.edu/NSF_Cod<strong>in</strong>g_<strong>Workshop</strong>_L<br />

SA/<strong>in</strong>dex.html

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!