19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

a) b) c)<br />

Figure 6: Web experiment on regional variation of speech sounds <strong>in</strong> German us<strong>in</strong>g the Percy software. a) registration web<br />

<strong>for</strong>m, b) experiment item screen, c) f<strong>in</strong>al display of geographic distribution of record<strong>in</strong>gs.<br />

3. Data exchange with tools with different underly<strong>in</strong>g<br />

data models may require manual adaptation.<br />

4. It is debatable whether SQL is a suitable query language<br />

<strong>for</strong> phoneticians or l<strong>in</strong>guists, and whether it is<br />

sufficiently powerful to express typical research questions.<br />

5.1. Miss<strong>in</strong>g data<br />

Miss<strong>in</strong>g data can be controlled by en<strong>for</strong>c<strong>in</strong>g a m<strong>in</strong>imum<br />

standard set of metadata and log data to be collected dur<strong>in</strong>g<br />

the speech database creation. However, this may not<br />

be feasible when the speech database comes from an external<br />

partner. It is thus necessary to manually check external<br />

databases be<strong>for</strong>e <strong>in</strong>corporat<strong>in</strong>g them <strong>in</strong>to the Wiki<strong>Speech</strong><br />

system, and to document the data manipulations applied.<br />

Database systems provide a default null value <strong>for</strong> unknown<br />

data, which at least ensures that queries can be run. If<br />

queries return unexpected results, then default null data<br />

maybe the reason <strong>for</strong> this. In a relational database the<br />

database schema can be <strong>in</strong>spected either via queries or via<br />

a graphical user <strong>in</strong>terface, and thus it is <strong>in</strong> general quite<br />

straight<strong>for</strong>ward to f<strong>in</strong>d out whether miss<strong>in</strong>g data is a possible<br />

reason <strong>for</strong> unexpected query results.<br />

5.2. Incorporat<strong>in</strong>g new tools<br />

A database schema is <strong>in</strong>tended to rema<strong>in</strong> stable over time<br />

so that predef<strong>in</strong>ed views, proceduralized queries or scripts<br />

cont<strong>in</strong>ue to work. However, new tools and services may use<br />

data items not present <strong>in</strong> the database. If the new data cannot<br />

be represented the given database schema, this schema<br />

has to be extended. In general, simply extend<strong>in</strong>g the data<br />

model, e.g. by add<strong>in</strong>g new attributes <strong>in</strong> the relation tables,<br />

or even add<strong>in</strong>g new tables, is not critical. Critical changes<br />

<strong>in</strong>clude chang<strong>in</strong>g the relationships between data items, or<br />

remov<strong>in</strong>g attributes or relational tables. Such changes however<br />

will be very rare because the data model has reached a<br />

very stable state by now.<br />

Any change to the data model can only be per<strong>for</strong>med by the<br />

database adm<strong>in</strong>istrator, and it may entail the modification<br />

of exist<strong>in</strong>g scripts and queries.<br />

54<br />

5.3. Manual adaption of data<br />

For tools that use a different underly<strong>in</strong>g data model, any<br />

data that is exported to the tool and then reimported must<br />

be processed to m<strong>in</strong>imize the loss of <strong>in</strong><strong>for</strong>mation. For example,<br />

<strong>in</strong> a data model that uses symbolic references between<br />

elements on different annotation tiers, all items must<br />

be given explicit time stamps and unique ids to be used <strong>in</strong> a<br />

purely time-based annotation tool such as e.g. Praat. Upon<br />

reimport<strong>in</strong>g the data after process<strong>in</strong>g <strong>in</strong> Praat, these timestamps<br />

have to be removed <strong>for</strong> non time-aligned annotations.<br />

Such modifications are <strong>in</strong> generally implemented us<strong>in</strong>g<br />

a parameterized import and export script. Several such<br />

scripts may be necessary <strong>for</strong> the different tools used <strong>in</strong> the<br />

workflow.<br />

5.4. SQL as a query language<br />

SQL is the de facto standard query language <strong>for</strong> relational<br />

databases. However, it is quite verbose and lacks many<br />

of the operators needed <strong>in</strong> phonetic or l<strong>in</strong>guistic research,<br />

namely sequence and dom<strong>in</strong>ance operators. Both sequence<br />

and dom<strong>in</strong>ance operators can be expressed by provid<strong>in</strong>g explicit<br />

position attributes or l<strong>in</strong>k<strong>in</strong>g relations between data<br />

records, but they make queries even longer and more complex.<br />

The SQL view mechanism is a simple way of <strong>for</strong>mulat<strong>in</strong>g<br />

simple queries. For example, <strong>in</strong> the web experiment, where<br />

experiment setup and <strong>in</strong>put spread over 8 relational tables,<br />

a s<strong>in</strong>gle view conta<strong>in</strong>s all data fields of the experiment and<br />

appears to the user as one s<strong>in</strong>gle wide table, which can be<br />

directly imported <strong>in</strong>to R or Excel.<br />

As an alternative to the direct use of SQL, high-level application<br />

doma<strong>in</strong> specific query languages can be envisaged,<br />

which are then automatically translated to SQL <strong>for</strong><br />

execution. This separation of high-level query language<br />

and the SQL query evaluation is desirable, because it opens<br />

the possibility to provide many different query language to<br />

the same underly<strong>in</strong>g database. Such query languages can<br />

be text-based or graphical, very specific to a particular application<br />

doma<strong>in</strong> or quite abstract. In fact, many graphical<br />

front-ends to database systems already allow <strong>for</strong>m-like<br />

query languages or explorative <strong>in</strong>teractive graphical query<br />

languages.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!