19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 2: Grammatical tags are visible <strong>for</strong> each<br />

word<br />

But given the written language bias it is fair to<br />

say that there is room <strong>for</strong> improvement with<br />

regard to the tagg<strong>in</strong>g. Transcriptions may also<br />

erroneous. Given these factors, the possibility to<br />

check the audio is crucial.<br />

5. L<strong>in</strong>ks between audio/video and<br />

transcriptions<br />

Even if a part of the corpus is phonetically<br />

transcribed, and all of it is orthographically<br />

transcribed, it is still important to have access to<br />

the audio (and <strong>in</strong> some cases, video). There are<br />

many features that are not marked <strong>in</strong> the<br />

transcriptions, such as <strong>in</strong>tonation and stress. We<br />

there<strong>for</strong>e have a clickable button next to each<br />

l<strong>in</strong>e <strong>in</strong> the search result. This gives the l<strong>in</strong>guist<br />

direct access to exactly that po<strong>in</strong>t <strong>in</strong> the sound<br />

file represented by the transcription. Figure 3<br />

shows the audio and video buttons.<br />

Figure 3: Transcription with audio/video buttons<br />

6. Search <strong>in</strong>terface<br />

The search <strong>in</strong>terface <strong>for</strong> the corpus is designed<br />

to be maximally simple <strong>for</strong> the l<strong>in</strong>guists who use<br />

it. It fits <strong>in</strong>to one screen, and is divided <strong>in</strong>to<br />

three parts, see Figure 4. We use the system<br />

Glossa (Johannessen et al. 2008) <strong>for</strong> the search<br />

facilities.<br />

3<br />

Figure 4: The search w<strong>in</strong>dow<br />

The top part is the l<strong>in</strong>guistic user-<strong>in</strong>terface,<br />

where a search can be done specify<strong>in</strong>g word,<br />

words, or parts of words, grammatical tagg<strong>in</strong>g,<br />

phonetic or orthographic search, exclusion of<br />

words etc. Any number of words <strong>in</strong> sequence<br />

can be chosen, and any part of a word. We use<br />

the Stuttgart CQP system <strong>for</strong> the text search<br />

(Christ 1994, Evert 2005).<br />

The middle part of the screen is used <strong>for</strong> the<br />

desired representation of search results, <strong>for</strong><br />

example number of results wanted, or which<br />

transcription to be displayed.<br />

The bottom part of the screen is used <strong>for</strong><br />

filter<strong>in</strong>g the search through metadata. Here the<br />

user can choose to specify country, area or place,<br />

<strong>in</strong><strong>for</strong>mant age or sex, record<strong>in</strong>gs year etc.<br />

The <strong>in</strong>terface is based completely on boxes<br />

and pull-down menus. It is, however, also<br />

possible to per<strong>for</strong>m searches us<strong>in</strong>g regular<br />

expressions, i.e., a <strong>for</strong>mal language used <strong>in</strong><br />

computer science, if necessary. We will<br />

illustrate this here. While the menu-based<br />

system allows the user to choose a part of word<br />

followed by a word, it does not allow a list of<br />

alternative characters. The system allows<br />

alternatives by add<strong>in</strong>g one set of search boxes<br />

<strong>for</strong> each, but this can be a very cumbersome<br />

solution if there are many alternatives. If a user<br />

wants to specify that a word should end <strong>in</strong> any<br />

vowel, she can embed all vowels <strong>in</strong> a s<strong>in</strong>gle<br />

search us<strong>in</strong>g square brackets <strong>in</strong> the follow<strong>in</strong>g<br />

way:<br />

(1)<br />

.*[aeiouyæøå]<br />

This regular expression will give as results<br />

anyth<strong>in</strong>g that ends <strong>in</strong> a (Norwegian) vowel.<br />

7. Presentation of search results<br />

We have seen part of the presentations <strong>in</strong><br />

Figures 1, 2 and 3. In Figure 5 we show more of<br />

a search results w<strong>in</strong>dow, with not just the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!