05.06.2013 Views

A Study of the ITU-T G.729 Speech Coding Algorithm ...

A Study of the ITU-T G.729 Speech Coding Algorithm ...

A Study of the ITU-T G.729 Speech Coding Algorithm ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.2.2 <strong>Speech</strong> Sounds<br />

Open<br />

MASTER THESIS<br />

Figure 2: Human speech production.<br />

Datum - Date Rev Dokumentnr - Document no.<br />

04-09-28 PA1<br />

One way <strong>of</strong> analyzing speech is to use phonetics, which is <strong>the</strong> study <strong>of</strong> speech sounds and<br />

<strong>the</strong>ir production. Every language has limited linguistic units called phonemes. A phoneme<br />

is a sound and several consecutive phonemes create words. For example, <strong>the</strong> word dog<br />

consists <strong>of</strong> <strong>the</strong> three phonemes d/ao/g. Most languages have 20-50 phonemes, each<br />

phoneme representing a sound. The 42 phonemes in <strong>the</strong> English language are listed in<br />

Table 2 [15].<br />

A phone is an acoustic realization <strong>of</strong> a phoneme. If <strong>the</strong> realization <strong>of</strong> a phoneme is<br />

context dependent, it is called an Allophone. In most languages, <strong>the</strong> phonemes can be<br />

divided into two groups: vowels and consonants. The set <strong>of</strong> vowels and consonants is<br />

language specific. The hierarchy <strong>of</strong> <strong>the</strong> English phonemes is depicted in Figure 3 [25].<br />

2.2.3 <strong>Speech</strong>-Signal Waveform Characteristics<br />

The range <strong>of</strong> frequencies that humans are able to hear is called <strong>the</strong> audio spectrum. The<br />

bandwidth in <strong>the</strong> audio spectrum ranges hereby from 20 Hz to 20 kHz, although most<br />

humans have a much narrower bandwidth for hearing, especially with increasing age. Frequencies<br />

above <strong>the</strong> human hearing limit are called ultrasonic, while sounds below <strong>the</strong> limit<br />

are called infrasonic.<br />

<strong>Speech</strong>-signal waveform characteristics are constant for short time-periods. A speech<br />

signal is commonly assumed to be wide sense stationary and ergodic in <strong>the</strong> autocorrelation<br />

for segments <strong>of</strong> 10-30 ms <strong>of</strong> speech. In speech analysis, it is <strong>the</strong>refore common to extract<br />

short-time segments <strong>of</strong> speech, called frames to enable simple speech modeling. For<br />

smoo<strong>the</strong>r transaction between analysis frames, <strong>the</strong> latter are extracted with overlap, which<br />

16 (78)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!