haga click aquí - Amprae
haga click aquí - Amprae
haga click aquí - Amprae
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
EasyAlign Spanish: an (semi-)automatic segmentation<br />
tool under Praat<br />
Goldman, Jean-Philippe; Schwab, Sandra<br />
Université de Genève<br />
The purpose of phonetic alignment is to determine the time position of phone<br />
boundaries in a speech corpus on the basis of the audio recording and its<br />
orthographic transcription. Aligned corpora are widely used in various speech<br />
applications like automatic speech recognition, speech synthesis, as well as<br />
prosodic and phonetic research. Although manual segmentation constitutes the<br />
more accurate method, it requires a large amount of time for the human labeller.<br />
Thus, various automatic methods are now used as they are not only much<br />
quicker, but their results are also reproducible and consistent throughout a large<br />
corpus.<br />
EasyAlign has been developed within Praat in order to provide an ergonomic<br />
automatic segmentation tool, easy to use for computer science non-specialists.<br />
It is freely distributed as a self-installable plug-in and it is available for French,<br />
English, Brazilian Portuguese and recently for Castilian Spanish. EasyAlign<br />
consists in a group of tools which successively perform three automatic steps<br />
with some minor manual verifications and adjustments to ensure better quality:<br />
utterance segmentation, grapheme-to-phoneme conversion and phonetic<br />
segmentation. From the orthographic transcription, the utterance segmentation<br />
process -which is language-independent- generates a TextGrid with a unique<br />
tier, in which each interval encloses one utterance. Then, the graphemephoneme<br />
conversion step creates a second tier with the phonetic transcription<br />
of the utterances. For this language-specific process, we used for Castilian<br />
Spanish the phonetizer SAGA (Moreno & Mariño 1998) which provides a<br />
detailed phonetic transcription in SAMPA. Finally, in the phonetic segmentation<br />
step, the Viterbi-based HVite tool (within HTK) is called to align each utterance<br />
to its phonetic sequence. For Castilian Spanish, the acoustic models were<br />
trained on the basis of about 360 minutes of unaligned multi-speaker speech for<br />
which a verified phonetic transcription was provided. The phonetic<br />
segmentation process simultaneously generates a phone tier and a word tier.<br />
Additionally, a syllable tier is created on the basis of sonority-based rules for<br />
syllable segmentation.<br />
EasyAlign performances have been evaluated on the basis of a corpus of 12<br />
minutes (one minute of 12 speakers: 6 "internal" speakers from the training<br />
corpus and 6 new "external" speakers), which was manually annotated by<br />
phonetic experts and compared to the automatic alignement. Evaluation was<br />
performed according to three approaches. Firstly, a boundary-based evaluation<br />
showed that 60% of the differences between automatic and manual boundaries<br />
lie within 10ms (and 86% within 20ms). Little difference was observed when<br />
comparing internal and external speakers, which shows a good generalization<br />
of the training. Secondly, a duration-based evaluation revealed that the<br />
difference of automatic/manual phone durations, which has a standard deviation<br />
of 20ms, is similar in internal and external speakers. Finally, a segment-based<br />
evaluation showed that the median of the so-called "overlapping-rate" -a<br />
speech-rate independent measure- reaches 0.74, with little difference between<br />
internal (0.75) and external (0.72) speakers.<br />
65