27.12.2012 Views

Oscillations, Waves, and Interactions - GWDG

Oscillations, Waves, and Interactions - GWDG

Oscillations, Waves, and Interactions - GWDG

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.4 Transfer to running speech<br />

Speech research 29<br />

The voice analysis was originally based on stationary vowels. In clinical diagnostics,<br />

however, a voice analysis from continuous speech is required in order to objectively<br />

assess the vocal disease under normal stress <strong>and</strong> to be able to treat it optimally.<br />

The stationary phonation corresponds rather to a singing voice, contrary to the<br />

more natural running speech. Thus for a comprehensive description of voice quality<br />

the analysis of running speech is an essential extension of the analysis of stationary<br />

phonation. The methods of vowel analysis should be partially transferable to voiced<br />

intervals in running speech. For this purpose a method was developed to recognize<br />

such intervals automatically. The main difficulty was that the linguistically voiced<br />

sounds are not necessarily realized as voiced for strong voice disorders.<br />

3.4.1 Determination of voiced <strong>and</strong> unvoiced intervals<br />

A voiced/unvoiced classification by (e. g.) zero-crossing <strong>and</strong> correlation techniques<br />

directly on the speech signal would, for strongly disturbed voices, recognize too few<br />

voiced intervals. Thus a consideration of the spectral envelope (formant structure) is<br />

preferable, which little depends on the actual glottal excitation. The method uses a<br />

3-layer perceptron with sigmoid activation function (values 0 to 1) as classifier. The<br />

template vectors for its input were formed as follows (numbers refer to Fig. 3):<br />

The speech signals, digitized with 48 kHz, were downsampled to 12 kHz <strong>and</strong> decomposed<br />

into overlapping Hann-windowed 40 ms intervals with 10 ms frame shift (3,4).<br />

Pauses are eliminated based on an empirical energy threshold. An LPC analysis of<br />

12 th order (autocorrelation method; preemphasis 0.9735) yields a model spectrum<br />

(5), which is converted to 19 critical b<strong>and</strong>s (Bark scale) by summation in overlapping<br />

trapezoidal windows (6). It is compressed with exponent 0.23 <strong>and</strong> normalized<br />

by its maximum over time <strong>and</strong> critical b<strong>and</strong>s (7,8). The LPC order <strong>and</strong> method were<br />

optimized to yield minimal misclassification.<br />

The optimal values of the perceptron parameters (number of hidden cells, learning<br />

rate, classification threshold, number of iterations, training material) were determined<br />

in extensive experiments (6750 different cases, about 12000 spectra). Twelve<br />

hidden cells worked best. As training method of the perceptron (9), an accelerated<br />

backpropagation [24] was employed with learning rate 0.01 <strong>and</strong> momentum term 0.8.<br />

The classification threshold at the output is 0.45, the desired net outputs for training<br />

are 0.1 (unvoiced) <strong>and</strong> 0.9 (voiced). The weights are initialized with r<strong>and</strong>om numbers<br />

in the range 0 to 1. Three perceptrons with different initial weights were used<br />

in parallel, averaging their recognition scores.<br />

Since our own speech data were unlabeled, they could not serve for training. Instead,<br />

the training set consisted first of 32 phonetically segmented texts (16 times<br />

“Nordwind und Sonne”, 16 times “Berlingeschichte”) from 16 different normal speakers<br />

in the German Phondat database, a total of 154550 labeled Bark spectra, excluding<br />

pauses. The training used up to 5000 iterations. For testing, different subsets of<br />

30 of the 32 texts were used in training <strong>and</strong> the remaining 2 in the test. The error<br />

score amounted to 4.8%. With the above threshold (0.45), only 25% of these are<br />

falsely classified as voiced; false unvoiced classification is less detrimental. As mis-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!