24.11.2014 Views

Elektronika 2009-11.pdf - Instytut Systemów Elektronicznych

Elektronika 2009-11.pdf - Instytut Systemów Elektronicznych

Elektronika 2009-11.pdf - Instytut Systemów Elektronicznych

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

definite temporary partition. Use of this two methods permits<br />

onto proper emission of signal audio, and correct realizing remaining<br />

stages of system. Suitable connection of both methods<br />

permits onto correct delimitation of beginning and end of<br />

signal. After delimitation of beginning and end of polish word, to<br />

further operation signal is design without superfluous silence.<br />

The non-stationary nature of the speech signal caused by<br />

dynamic proprieties of human speech result in dependence<br />

of the next stage on use of division of entrance signal onto<br />

stationary frame boxes [5]. Signal is stationary in short temporary<br />

partitions (10 ± 30 ms) [7]. Every such stationary frame<br />

box was replaced by symbol of observation in process of create<br />

of vectors of observation. In created system it was assumed<br />

that length of every frame box equals 30 ms, what at<br />

given sampling of signal (8 kHz) remove 240 samples. For<br />

speech recognition, in aim of keeping stationary of signal, it<br />

was assumed that every next frame box is sew on previous<br />

with delay. It was accepted, that 80 last samples of signal of<br />

previous frame box are simultaneously 80 samples of next<br />

frame box. For speaker verification, in aim to keep all the detail<br />

signal, all frame boxes do not overlap.<br />

The mechanism of cepstral speech analysis<br />

Speech processing applications require specific representations<br />

of speech information. A wide range of possibilities exists<br />

for parametrically representing the speech signal. Among<br />

these the most important parametric representation of speech<br />

is short time spectral envelope [5,7]. Linear Predictive Coding<br />

(LPC) and Mel Frequency Cepstral Coefficients (MFCC)<br />

spectral analysis models have been used widely for speech<br />

recognition applications. Usually together with MFCC coefficients,<br />

first and second order derivatives are also used to take<br />

into account the dynamic evolution of the speech signal, which<br />

carries relevant information for speech recognition.<br />

In the mel-cepstrum, the spectrum were first passed<br />

through mel-frequency-bandpass-filters before they were<br />

transformed to the frequency domain [8]:<br />

The characteristics of filters followed the characteristics old<br />

human auditory system [8]. The filters had triangular bandpass<br />

frequency responses. For speech recognition, it was<br />

used filters with width 300 mels, and transferred with a delay<br />

150 mels. For speaker verification, it was used filters with<br />

width 200 mels, and transferred with a delay 100 mels, for<br />

a more detailed analysis of low frequency. The bands of filters<br />

were spaced linearly for bandwidth below 1000 Hz and increased<br />

logarithmically after the 1000 Hz. In the mel-frequency<br />

scaling, all the filter bands had the same width, which<br />

were equal to the intended characteristic of the filters, when<br />

they were in normal frequency scaling. Spectrum of signal of<br />

every frame boxes obtained by Fast Fourier Transform<br />

(FFT,512) comes under process of filtration by bank of filters.<br />

The next step was to calculate the members of each filter by<br />

multiplying the filter’s amplitude with the average power spectrum<br />

of the corresponding frequency of the voice input. The<br />

summation of all members of a filters is:<br />

(1)<br />

(2)<br />

Finally, the mel-frequency cepstrum coefficients (MFCC) was<br />

derived by taking the log of the mel-power-spectrum coefficients<br />

S k then convert them back to time (frequency) domain<br />

using Discrete Cosine Transform (DCT). The number of melcoefficients<br />

K used, for speaker recognition purposes, was<br />

usually from 12 to 20 [8]:<br />

In practice, removing this one from the formula gave<br />

a better performance, both for speech recognition, and verification<br />

of user.<br />

In this work, for speech cooding, it was used twenty dimensional<br />

MFCC as the standard audio features. For speech<br />

recognition, aim of analysis of signal audio was coding of signal<br />

audio, and obtaining of entrance data in form of vectors of<br />

observation. Polish language contains 37 different phonems,<br />

therefore to coding of signal audio it was applied codebook<br />

including 37 code symbols. At frequency of sampling 8 kHz,<br />

instead of 8000 values, signal audio will be coded by about<br />

50 values. It was applied Lloyd algorithm to vector quantization.<br />

One from basic operation during vector quantization is<br />

delimitation of distance of next vector of observation from all<br />

center of gravity of codebook. To measurement of distance it<br />

was applied Euclidean measure. For speaker verification, obtained<br />

for all frame boxes cepstrum coefficients add upped<br />

properly. In this expedient all independent statement one<br />

coded by twenty cepstral coefficients.<br />

Vector quantization with use Lloyd<br />

algorithm<br />

In use of loss-free compression, generated data by source<br />

have to be represented by one from small number of code<br />

words. Number of possible different data is generally larger<br />

from number of code word, design to them of representing.<br />

Process of representing of large collection of value by collection<br />

considerably smaller is called quantization [9]. A vector<br />

quantizer Q of dimension M and size N is a mapping from<br />

a vector x in M-dimensional Euclidean space R M into a finite<br />

set Y containing NM-dimensional outputs or reproduction<br />

points, called code vectors or code words. Thus:<br />

where:<br />

Y is known as the codebook of the quantizer. The mapping<br />

action is written as:<br />

(3)<br />

(4)<br />

(5)<br />

(6)<br />

66 ELEKTRONIKA 11/<strong>2009</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!