Pietruch R., Grzanka A., Konopka W.: Vowels recognition

VOWELS RECOGNITION USING VIDEO AND AUDIO 

DATA WITH AN APPLICATION TO LARYNGECTOMEES’ 

VOICE ANALYSIS 

Rafal Pietruch 

Institute of Electronic Systems, Warsaw University of Technology, ul. Nowowiejska 15/19, 00- 

665 Warsaw, Poland, 

email: rpietruc@elka.pw.edu.pl 

Antoni Grzanka 

Institute of Electronic Systems, Warsaw University of Technology, ul. Nowowiejska 15/19, 00- 

665 Warsaw, Poland, 

email: antoni.grzanka@ise.pw.edu.pl 

Wieslaw Konopka 

Department of Audiology, Phoniatrics and Otoneurology, Medical University of Lodz, ul. Zeromskiego 

113, 90-549 Lodz, Poland, 

email: wieslaw.konopka@umed.lodz.pl 

In this paper we present the methods for speech analysis, joining audio and video data in application 

for laryngectomees’ voice recognition and quality evaluation. Facial expression measurements 

are applied to support analysis of pathological speech. It was demonstrated that 

visual parameters increase a recognition rate of Polish vowels. 

Keywords: laryngectomy, formants tracking, facial expression analysis 

1. Introduction 

Several difficulties with evaluation of the acoustical descriptors for laryngectomees’ speech 

were reported in earlier research. In pathological speech called pseudo-whisper, noises from tracheostoma 

play significant role in masking speech spectrum [6]. Many works concerning esophageal 

speech analyses showed differences in average formant frequencies between post laryngectomy and 

natural voice. The higher formant frequencies of vowels in esophageal speech were reported in literature 

[6, 1, 8]. Our research showed that the video data is a promising candidate for supporting 

laryngectomees’ speech analysis [4, 5]. Authors developed system to achieve parameters of laryngectomees’ 

speech and to evaluate the progress of patient’s rehabilitation process. 

2. Aims 

The aim was to compare recognition rate of vowels for subjects using natural voice with experimental 

groups of esophageal and pseudo-whisper speakers after total laryngectomy. The recognition 

ICSV16, 5–9 July 2009, Kraków, Poland 1

16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland 

was made using analysis of acoustical and facial expression parameters. 

3. Methods 

3.1 Computer system 

Presenting system’s platform is Windows XP. Program is based on DirectShow filters technology. 

It extracts speech parameters from digital movie files (mostly in MPEG format), visualizes them, 

and provides methods for archiving. We implemented automatic face detection algorithm for frontal 

view image sequence. The system automatically tracks elements of face in real time, extracts visual 

parameters including lips shape, compares them with acoustical descriptors and updates vocal tract 

model parameters. Neural network simulation and learning were made on Linux platform Ubuntu 

8.04 distribution. We used GNU Octave program version 3.0.0 with nnet-0.1.9 package. 

3.2 Acoustical model 

In this work we used the model of vocal tract divided into 10 sections of resonance cavities. 

We assumed that the palate is closed and the waves don’t pass through the nasal cavity. There is no 

energy loss and only reflection effect is taken into consideration. With these assumptions the vocal 

tract model is equivalent to lattice filter with PARCOR parameters equal to negative reflection coefficients 

[7]. Reversed lattice filter can be then transformed into transversal filter [3], which parameters 

can be estimated from speech signal using linear prediction method. Adaptive recursive least squares 

algorithm was used to estimate transversal filter LPC coefficients [3]. There is reversed transformation 

that from given LPC parameters derives related cross-sectional diameters of vocal tract sections. 

Transversal filter coefficients were used to track formant frequencies of six Polish vowels [6]. Two 

first formants F1 and F2 were chosen as the acoustical descriptors in audio recognition system. 

3.3 Formants tracking 

In the system with vowel recognition functionality the formant tracking algorithm was implemented. 

There was used Christensen algorithm for finding formants candidates from spectrum [2]. 

All minimums of spectrum second derivative are chosen. For 4kHz bandwidth that was assumed in 

our study we can find up to 5 candidates f K = {f K 1 , f K 2 , · · · , f K nK } and we need to assign 4 formants 

numbers F = {F 1, F 2, · · · , F nF }. There can be three situations: 

2 

1. nK = nF 

Then candidates are assigned respectively to formants numbers with a function fK→F according 

to equation 1. 

fK→F : f K j → F i ⇔ i = j (1) 

where : i, j ∈ 1, 2, · · · , nF 

2. nK < nF 

In this situation nF − nK formants will be rejected. Function 2 will specify which formants 

need to be unassigned. The function can be given using BF matrix where each row represents 

candidate and each column represents one of formant (equation 3). This transformation is found 

using best-fit, minimum cost algorithm 7. 

fK→F : f K j → F i (2) 

BF (j, i) = 0 ⇔ fK→F (j) = i (3)


3. nK > nF 

In this situation nK − nF candidates will stay unassigned. There can be one extra candidate 

in our situation. We are looking for the function 4 that will assign formant numbers to limited 

candidates according to matrix BT F , where columns represents each candidate and each row is 

related to formant number. Matrix elements are found according to equation 5. 

fF →K : F i → f K j 

B T F (i, j) = 0 ⇔ fF →K(i) = j (5) 

Best fit, minimal cost algorithm to find related transformation takes into account candidates 

frequencies f K and mean frequencies from previously assigned formants: ¯F = { ¯ F 1, ¯ F 2, · · · , ¯ F nF }, 

where for n-th sample: ¯ F i(n) = M m=1 F i(n − m). In our program we make an assumption: M = 9. 

In the algorithm the sum of costs of changing previous formant frequencies ( ¯ F i) into new ones (f K j ) 

is minimized according to equation 6. 

where CF (i, j) = | ¯ F i − f K j | 

max 

BF 

nF 

i=1 

(4) 

nK 

BF (i, j) · CF (i, j) (6) 

j=1 

∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fK→F (j1) = i1 ∧ fK→F (j2) = i2 : 

f K j1 < f K j1 ⇔ i1 < i2 (7) 

∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fF →K(i1) = j1 ∧ fF →K(i2) = j2 : 

i1 < i2 ⇔ f K j1 < f K j2 (8) 

There is natural assumption that formant candidates should be assigned to formants with respect 

to their order (equations 7, 8). Then we need to check all combinations min{nF , nK} - elements set 

of max{nF , nK} - elements set. 

3.4 Video methods 

Within our system novel computer vision techniques were used to automatically segment the 

eyes, mouth and nose regions [5]. Following parameters are tracked within the system: distance 

between the eyes L0, lips height LH and width LW , distance between line joining the eyes and the 

bottom of a jaw LJ. From lips shape the mouth opening area is estimated. Jaw angle and area of 

mouth opening were chosen as representatives of facial expression descriptors. 

3.5 Face elements tracking 

Facial elements tracking algorithms start from color space conversion, RGB to HSV. Then for 

each pixel (with respect to assumed margins) values of 5 features is counted. According to adaptive 

thresholds the binarization of 5 regions is made. The thresholds are updated according to region sizes. 

Related regions specifies red colored pixels (RED), dark pixels (DARK), moving objects (MOV), 

vertical differences (VDIFF) and horizontal differences (HDIFF). 

3


• Red color intensity IRED(x, y) for pixel coordinates (x, y), where x ∈ {1, · · · , W }, y ∈ 

{1, · · · , H}, (W - picture width, H - picture height), is counted from Hue component (H) 

according to equation 9. 

⎧ 

⎪⎨ GRED BRED − H(x, y) 

IRED(x, y) = 

⎪⎩ 

 

dla H(x, y) < BRED 

0 dla BRED ≤ H(x, y) ≤ HMAX − BRED 

 

GRED H(x, y) + BRED − HMAX dla H(x, y) > HMAX − BRED 

(9) 

where: 

HMAX = 2 8 , GRED = 2 4 , BRED = 15 

• Local, dynamic changes are counted for every pixel (x, y) for n-th sample according to equation 

10. 

IMOV (n) = λIMOV (n − 1) + (1 − λ) V (n) − V (n − 1) , (10) 

where λ = 0.875. 

• Horizontal differences for pixel at (x, y) are counted according to Value component (V) using 

equation 11. 

IHDIF F (x, y) = V (x, y) − V (x − 1, y) (11) 

• Vertical differences for pixel at (x, y) are counted according to V component using equation 12. 

IV DIF F (x, y) = 1 

• Darkness of pixel (x, y) is counted using equation 13. 

 

 

V (x − 1, y − 1) − V (x − 1, y) 

4 

+ 1 

 

V (x, y − 1) − V (x, y) 

2 

+ 1 

 

V (x + 1, y − 1) − V (x + 1, y) . (12) 

4 

IDARK(x, y) = 2 8 − V (x, y) − 1 (13) 

The examples of extracted regions are presented on figure 1. 

Face shape is extracted from HDIFF and RED regions. Face is the most red region in picture 

and from left and right sides it should be horizontal contrast. Face borders are found from assumptions 

14 and 15. 

For every y coordinate: 

4 

• the nearest point, beginning from left margin, that fulfils below assumption is chosen: 

x ∈ {SMAX, · · · , 1 

2 W } ∧ ZHDIF 

SMAX 

F (x, y) ∧ 

i=1 

ZRED(x + i, y) > SMIN 

• the nearest point, beginning from right margin, that fulfils below assumption is chosen: 

x ∈ { 1 

2 W, · · · , W − SMAX} 

SMAX 

∧ ZHDIF F (x, y) ∧ 

i=1 

ZRED(x − i, y) > SMIN 

(14) 

(15)


(a) Dark region. (b) Moving region. 

(c) Horizontal differences. (d) Vertical differences. 

(e) Red region. (f) Face shape 

(g) Face elements (h) Face borders, elements, symmetry 

lines, eyes points and mouth contour 

Figure 1. Results of video processing. 

5


where: SMAX = 32, SMIN = 16 

Example of evaluated face region is given on figure 1(f) indicated by black pixels. 

Face elements segmentation is done according to regions sets combination (equation 16). for 

every coordinate (x, y) . 

ZELEM = (ZRED ∧ ZV DIF F ) ∨ (ZDARK ∧ ZMOV ) (16) 

Example of evaluated face elements are shown on figure 1(g) indicated by black pixels. 

Middle points between left and right borders are taken as potential points of face symmetry. 

First the line that covers the most of middle points is evaluated using Hough transform. Points that 

lays in distance from line greater than defined are removed from set. Face symmetry line is estimated 

then using MLS (Minimum Least Squares) criteria from left points. 

From face symmetry line we create orthogonal lines between the borders. For every line the 

middle point of face elements region symmetry is found. The symmetry lines for facial elements 

is estimated with the same algorithm as facial symmetry from the set of facial elements symmetry 

points. The result of described methods is shown on figure 1(h). 

Symmetrical face elements can consists of regions that lays on symmetry line or two symmetrical 

regions on both sides of line. Symmetry points of the same parity of symmetrical regions are 

grouped using binary filtering: hB = [1 1 1 1 1]. The greatest sets with odd number of symmetric 

regions are taken as nose and mouth candidates, and sets with even numbers of symmetrical regions 

are potential eyes regions. Then assumptions of regions relative positions are taken into account, and 

symmetrical regions are assigned to mouth, eyes and nose elements. Lips contour is evaluated by 

inflation of symmetry middle point. 

3.6 Neural network 

We used neural network (NN) for both audio and visual recognition systems. Feed forward 

structure was used with two dimensional input vectors. The ’tansig’ transfer function was used in the 

first layer of NN and linear transfer function was chosen for output layer. We used mean of squared 

errors (MSE) performance function and back propagation NN training function. Network was trained 

according to gradient descent with momentum. 

Audio and video parameters were both limited to 2 dimensions. NN was trained to assign 

maximum value (1) for output related to spoken vowel and minimum value (-1) to other outputs. In 

NN simulation the maximum the output of maximum value was chosen and the vowel assigned to it 

was recognized. For visual data hidden layer size was 3 and output layer size was 4. Only 4 groups 

of vowels {’a’, ’e’}, {’i’, ’y’}, {’u’} and {’o’}were represented by n1:4 output neurons respectively. 

The input vectors [x1, x2] were formed by facial parameters converted by following method: x1 = 

4LHLW /L 2 0 −1, x2 = (LJ −LJm)/L0, where LJm is a jaw opening in neutral position. For acoustical 

parameters first layer had 5 neurons and output had 6 neurons. Every output n1:6 was related to each 

Polish vowel: ’a’, ’i’, ’e’, ’y’, ’o’, ’u’respectively. Input vectors [x1, x2] were normalized to achieve 

values from (-1; 2) in following way: x1 = F 2/1000 − 1.5, x2 = F 1/1000 − 0.7. 

3.7 Subjects 

Control group (C) of 10 laryngeal speakers formed the training set for neural networks. Experimental 

groups of 10 esophageal speakers (E) and 10 pseudo-whisper patients (P) were test sets. 

The system was evaluated with the experimental study of 34 subjects articulating 6 Polish isolated 

vowels: ’a’, ’i’, ’e’, ’y’, ’o’, ’u’(see Appendix A. for related IPA symbols). Control group (C) 

of 10 laryngeal speakers formed the training set in both cases: NN for visual descriptors (NN-v) and 

NN for acoustical parameters (NN-a). Experimental groups of 10 esophageal speakers (Ev) and 10 

6


Figure 2. IPA symbols of Polish vowels 

pseudo-whisper patients (Pv) formed test sets for visual analyses. Mean age of group Ev was 64 and 

was formed by men. 

For acoustical analyses experimental group consisted of 10 esophageal speakers (Ea). Mean 

age of group Ea was 64 but it differed from Ev group with 4 subjects including one woman. Pseudowhisper 

group was not taken into consideration as the extraction of formants in this group failed, as 

discussed in [6]. 

4. Results 

The neural network parameters were reported in [5]. From audio data we achieved a vowels 

recognition rate of 98 percent in C group and 66 percent in Ea group. The recognition rate for visual 

data was 94 percent in C group, 76 percent in Pv and 76 percent in Ev group. 

5. Discussion 

In present state the algorithm is sensitive to variable conditions of face (beard, glasses, and 

long hair). Video data measurements were supported by operator. No algorithm for extraction of 

jaw opening length has been implemented yet in our system. Every result of visual parameter was 

compared with subjective, manual measurement made on picture grabbed from related video frame. 

Our methodology demonstrated an improvement of vowels recognition especially for pseudo-whisper 

speakers as compared with formants analysis [6]. We expect that the further improvement of recognition 

will be achieved in our future work by evaluating and extracting hybrid audio-video parameters 

using presented system. 

6. Conclusion 

The proposed hybrid framework of joined audio and video feature extraction is to be achieving 

higher recognition accuracies and its integration would yield significant improvement of speech 

analysis. System is easy to extend. In future we are going to analyze real time images from video 

camera. 

7. Appendix 

7.1 IPA symbols for Polish vowels 

Polish vowels transcriptions are given in figure 2. 

7


8. Acknowledgement 

The work described in this paper is funded by Polish Ministry of Science and Higher Education 

under grant number N N518 0929 33. 

REFERENCES 

1 T. Cervera, J. L. Miralles, and J. González A. Acoustical analysis of spanish vowels produced 

by laryngectomized subjects. Journal of Speech, Language, and Hearing Research, 44:988–996, 

2001. 

2 J. M. Christensen and B. Weinberg. Vowel duration characteristics of esphageal speech. Journal of 

Speech and Hearing Research, 19:678–689, 1976. 

3 S. Haykin. Adaptive filter theory. Prentice Hall, Inc., Upper Saddle River, 1991. 

4 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. An analysis of face expression images for 

evaluation of laryngectomees voice quality. In Mirjana Sovilj and Dimitris Skanavis, editors, First 

European Congress on Prevention, Detection and Diagnostics of Verbal Communication Disorders, 

pages CD–ROM, Patra, Grecja, grudzień, 15-17 2006. 

5 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. Evaluation of laryngectomees’ voice 

quality usin correlations with facial expression. In Seiji Niimi, editor, Proceedings of the 5th International 

Conference on Voice Physiology and Biomechanics, pages 96–99, Tokio, Japonia, lipiec, 

12-14 2006. 

6 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. Methods for formant extraction in speech 

of patients after total laryngectomy. Biomedical Signal Processing and Control, 1/2:107–112, 2006. 

7 S. Saito. Speech Science and Technology. Ohmsha, Ltd., Tokyo, 1992. 

8 M. Sisty and B. Weinberg. Formant frequency characteristics of esophageal speech. Journal of 

Speech and Hearing Research, 15:439–448, 1972. 

8

Pietruch R., Grzanka A., Konopka W.: Vowels recognition

Create successful ePaper yourself

Delete template?

Save as template?