26.03.2013 Views

Pietruch R., Grzanka A., Konopka W.: Vowels recognition

Pietruch R., Grzanka A., Konopka W.: Vowels recognition

Pietruch R., Grzanka A., Konopka W.: Vowels recognition

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

VOWELS RECOGNITION USING VIDEO AND AUDIO

DATA WITH AN APPLICATION TO LARYNGECTOMEES’

VOICE ANALYSIS

Rafal Pietruch

Institute of Electronic Systems, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-

665 Warsaw, Poland,

email: rpietruc@elka.pw.edu.pl

Antoni Grzanka

Institute of Electronic Systems, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-

665 Warsaw, Poland,

email: antoni.grzanka@ise.pw.edu.pl

Wieslaw Konopka

Department of Audiology, Phoniatrics and Otoneurology, Medical University of Lodz, ul. Zeromskiego

113, 90-549 Lodz, Poland,

email: wieslaw.konopka@umed.lodz.pl

In this paper we present the methods for speech analysis, joining audio and video data in application

for laryngectomees’ voice recognition and quality evaluation. Facial expression measurements

are applied to support analysis of pathological speech. It was demonstrated that

visual parameters increase a recognition rate of Polish vowels.

Keywords: laryngectomy, formants tracking, facial expression analysis

1. Introduction

Several difficulties with evaluation of the acoustical descriptors for laryngectomees’ speech

were reported in earlier research. In pathological speech called pseudo-whisper, noises from tracheostoma

play significant role in masking speech spectrum [6]. Many works concerning esophageal

speech analyses showed differences in average formant frequencies between post laryngectomy and

natural voice. The higher formant frequencies of vowels in esophageal speech were reported in literature

[6, 1, 8]. Our research showed that the video data is a promising candidate for supporting

laryngectomees’ speech analysis [4, 5]. Authors developed system to achieve parameters of laryngectomees’

speech and to evaluate the progress of patient’s rehabilitation process.

2. Aims

The aim was to compare recognition rate of vowels for subjects using natural voice with experimental

groups of esophageal and pseudo-whisper speakers after total laryngectomy. The recognition

ICSV16, 5–9 July 2009, Kraków, Poland 1


16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland

was made using analysis of acoustical and facial expression parameters.

3. Methods

3.1 Computer system

Presenting system’s platform is Windows XP. Program is based on DirectShow filters technology.

It extracts speech parameters from digital movie files (mostly in MPEG format), visualizes them,

and provides methods for archiving. We implemented automatic face detection algorithm for frontal

view image sequence. The system automatically tracks elements of face in real time, extracts visual

parameters including lips shape, compares them with acoustical descriptors and updates vocal tract

model parameters. Neural network simulation and learning were made on Linux platform Ubuntu

8.04 distribution. We used GNU Octave program version 3.0.0 with nnet-0.1.9 package.

3.2 Acoustical model

In this work we used the model of vocal tract divided into 10 sections of resonance cavities.

We assumed that the palate is closed and the waves don’t pass through the nasal cavity. There is no

energy loss and only reflection effect is taken into consideration. With these assumptions the vocal

tract model is equivalent to lattice filter with PARCOR parameters equal to negative reflection coefficients

[7]. Reversed lattice filter can be then transformed into transversal filter [3], which parameters

can be estimated from speech signal using linear prediction method. Adaptive recursive least squares

algorithm was used to estimate transversal filter LPC coefficients [3]. There is reversed transformation

that from given LPC parameters derives related cross-sectional diameters of vocal tract sections.

Transversal filter coefficients were used to track formant frequencies of six Polish vowels [6]. Two

first formants F1 and F2 were chosen as the acoustical descriptors in audio recognition system.

3.3 Formants tracking

In the system with vowel recognition functionality the formant tracking algorithm was implemented.

There was used Christensen algorithm for finding formants candidates from spectrum [2].

All minimums of spectrum second derivative are chosen. For 4kHz bandwidth that was assumed in

our study we can find up to 5 candidates f K = {f K 1 , f K 2 , · · · , f K nK } and we need to assign 4 formants

numbers F = {F 1, F 2, · · · , F nF }. There can be three situations:

2

1. nK = nF

Then candidates are assigned respectively to formants numbers with a function fK→F according

to equation 1.

fK→F : f K j → F i ⇔ i = j (1)

where : i, j ∈ 1, 2, · · · , nF

2. nK < nF

In this situation nF − nK formants will be rejected. Function 2 will specify which formants

need to be unassigned. The function can be given using BF matrix where each row represents

candidate and each column represents one of formant (equation 3). This transformation is found

using best-fit, minimum cost algorithm 7.

fK→F : f K j → F i (2)

BF (j, i) = 0 ⇔ fK→F (j) = i (3)


16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland

3. nK > nF

In this situation nK − nF candidates will stay unassigned. There can be one extra candidate

in our situation. We are looking for the function 4 that will assign formant numbers to limited

candidates according to matrix BT F , where columns represents each candidate and each row is

related to formant number. Matrix elements are found according to equation 5.

fF →K : F i → f K j

B T F (i, j) = 0 ⇔ fF →K(i) = j (5)

Best fit, minimal cost algorithm to find related transformation takes into account candidates

frequencies f K and mean frequencies from previously assigned formants: ¯F = { ¯ F 1, ¯ F 2, · · · , ¯ F nF },

where for n-th sample: ¯ F i(n) = M m=1 F i(n − m). In our program we make an assumption: M = 9.

In the algorithm the sum of costs of changing previous formant frequencies ( ¯ F i) into new ones (f K j )

is minimized according to equation 6.

where CF (i, j) = | ¯ F i − f K j |

max

BF

nF

i=1

(4)

nK

BF (i, j) · CF (i, j) (6)

j=1

∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fK→F (j1) = i1 ∧ fK→F (j2) = i2 :

f K j1 < f K j1 ⇔ i1 < i2 (7)

∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fF →K(i1) = j1 ∧ fF →K(i2) = j2 :

i1 < i2 ⇔ f K j1 < f K j2 (8)

There is natural assumption that formant candidates should be assigned to formants with respect

to their order (equations 7, 8). Then we need to check all combinations min{nF , nK} - elements set

of max{nF , nK} - elements set.

3.4 Video methods

Within our system novel computer vision techniques were used to automatically segment the

eyes, mouth and nose regions [5]. Following parameters are tracked within the system: distance

between the eyes L0, lips height LH and width LW , distance between line joining the eyes and the

bottom of a jaw LJ. From lips shape the mouth opening area is estimated. Jaw angle and area of

mouth opening were chosen as representatives of facial expression descriptors.

3.5 Face elements tracking

Facial elements tracking algorithms start from color space conversion, RGB to HSV. Then for

each pixel (with respect to assumed margins) values of 5 features is counted. According to adaptive

thresholds the binarization of 5 regions is made. The thresholds are updated according to region sizes.

Related regions specifies red colored pixels (RED), dark pixels (DARK), moving objects (MOV),

vertical differences (VDIFF) and horizontal differences (HDIFF).

3


16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland

• Red color intensity IRED(x, y) for pixel coordinates (x, y), where x ∈ {1, · · · , W }, y ∈

{1, · · · , H}, (W - picture width, H - picture height), is counted from Hue component (H)

according to equation 9.


⎪⎨ GRED BRED − H(x, y)

IRED(x, y) =

⎪⎩


dla H(x, y) < BRED

0 dla BRED ≤ H(x, y) ≤ HMAX − BRED


GRED H(x, y) + BRED − HMAX dla H(x, y) > HMAX − BRED

(9)

where:

HMAX = 2 8 , GRED = 2 4 , BRED = 15

• Local, dynamic changes are counted for every pixel (x, y) for n-th sample according to equation

10.

IMOV (n) = λIMOV (n − 1) + (1 − λ) V (n) − V (n − 1) , (10)

where λ = 0.875.

• Horizontal differences for pixel at (x, y) are counted according to Value component (V) using

equation 11.

IHDIF F (x, y) = V (x, y) − V (x − 1, y) (11)

• Vertical differences for pixel at (x, y) are counted according to V component using equation 12.

IV DIF F (x, y) = 1

• Darkness of pixel (x, y) is counted using equation 13.



V (x − 1, y − 1) − V (x − 1, y)

4

+ 1


V (x, y − 1) − V (x, y)

2

+ 1


V (x + 1, y − 1) − V (x + 1, y) . (12)

4

IDARK(x, y) = 2 8 − V (x, y) − 1 (13)

The examples of extracted regions are presented on figure 1.

Face shape is extracted from HDIFF and RED regions. Face is the most red region in picture

and from left and right sides it should be horizontal contrast. Face borders are found from assumptions

14 and 15.

For every y coordinate:

4

• the nearest point, beginning from left margin, that fulfils below assumption is chosen:

x ∈ {SMAX, · · · , 1

2 W } ∧ ZHDIF

SMAX

F (x, y) ∧

i=1

ZRED(x + i, y) > SMIN

• the nearest point, beginning from right margin, that fulfils below assumption is chosen:

x ∈ { 1

2 W, · · · , W − SMAX}

SMAX

∧ ZHDIF F (x, y) ∧

i=1

ZRED(x − i, y) > SMIN

(14)

(15)


16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland

(a) Dark region. (b) Moving region.

(c) Horizontal differences. (d) Vertical differences.

(e) Red region. (f) Face shape

(g) Face elements (h) Face borders, elements, symmetry

lines, eyes points and mouth contour

Figure 1. Results of video processing.

5


16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland

where: SMAX = 32, SMIN = 16

Example of evaluated face region is given on figure 1(f) indicated by black pixels.

Face elements segmentation is done according to regions sets combination (equation 16). for

every coordinate (x, y) .

ZELEM = (ZRED ∧ ZV DIF F ) ∨ (ZDARK ∧ ZMOV ) (16)

Example of evaluated face elements are shown on figure 1(g) indicated by black pixels.

Middle points between left and right borders are taken as potential points of face symmetry.

First the line that covers the most of middle points is evaluated using Hough transform. Points that

lays in distance from line greater than defined are removed from set. Face symmetry line is estimated

then using MLS (Minimum Least Squares) criteria from left points.

From face symmetry line we create orthogonal lines between the borders. For every line the

middle point of face elements region symmetry is found. The symmetry lines for facial elements

is estimated with the same algorithm as facial symmetry from the set of facial elements symmetry

points. The result of described methods is shown on figure 1(h).

Symmetrical face elements can consists of regions that lays on symmetry line or two symmetrical

regions on both sides of line. Symmetry points of the same parity of symmetrical regions are

grouped using binary filtering: hB = [1 1 1 1 1]. The greatest sets with odd number of symmetric

regions are taken as nose and mouth candidates, and sets with even numbers of symmetrical regions

are potential eyes regions. Then assumptions of regions relative positions are taken into account, and

symmetrical regions are assigned to mouth, eyes and nose elements. Lips contour is evaluated by

inflation of symmetry middle point.

3.6 Neural network

We used neural network (NN) for both audio and visual recognition systems. Feed forward

structure was used with two dimensional input vectors. The ’tansig’ transfer function was used in the

first layer of NN and linear transfer function was chosen for output layer. We used mean of squared

errors (MSE) performance function and back propagation NN training function. Network was trained

according to gradient descent with momentum.

Audio and video parameters were both limited to 2 dimensions. NN was trained to assign

maximum value (1) for output related to spoken vowel and minimum value (-1) to other outputs. In

NN simulation the maximum the output of maximum value was chosen and the vowel assigned to it

was recognized. For visual data hidden layer size was 3 and output layer size was 4. Only 4 groups

of vowels {’a’, ’e’}, {’i’, ’y’}, {’u’} and {’o’}were represented by n1:4 output neurons respectively.

The input vectors [x1, x2] were formed by facial parameters converted by following method: x1 =

4LHLW /L 2 0 −1, x2 = (LJ −LJm)/L0, where LJm is a jaw opening in neutral position. For acoustical

parameters first layer had 5 neurons and output had 6 neurons. Every output n1:6 was related to each

Polish vowel: ’a’, ’i’, ’e’, ’y’, ’o’, ’u’respectively. Input vectors [x1, x2] were normalized to achieve

values from (-1; 2) in following way: x1 = F 2/1000 − 1.5, x2 = F 1/1000 − 0.7.

3.7 Subjects

Control group (C) of 10 laryngeal speakers formed the training set for neural networks. Experimental

groups of 10 esophageal speakers (E) and 10 pseudo-whisper patients (P) were test sets.

The system was evaluated with the experimental study of 34 subjects articulating 6 Polish isolated

vowels: ’a’, ’i’, ’e’, ’y’, ’o’, ’u’(see Appendix A. for related IPA symbols). Control group (C)

of 10 laryngeal speakers formed the training set in both cases: NN for visual descriptors (NN-v) and

NN for acoustical parameters (NN-a). Experimental groups of 10 esophageal speakers (Ev) and 10

6


16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland

Figure 2. IPA symbols of Polish vowels

pseudo-whisper patients (Pv) formed test sets for visual analyses. Mean age of group Ev was 64 and

was formed by men.

For acoustical analyses experimental group consisted of 10 esophageal speakers (Ea). Mean

age of group Ea was 64 but it differed from Ev group with 4 subjects including one woman. Pseudowhisper

group was not taken into consideration as the extraction of formants in this group failed, as

discussed in [6].

4. Results

The neural network parameters were reported in [5]. From audio data we achieved a vowels

recognition rate of 98 percent in C group and 66 percent in Ea group. The recognition rate for visual

data was 94 percent in C group, 76 percent in Pv and 76 percent in Ev group.

5. Discussion

In present state the algorithm is sensitive to variable conditions of face (beard, glasses, and

long hair). Video data measurements were supported by operator. No algorithm for extraction of

jaw opening length has been implemented yet in our system. Every result of visual parameter was

compared with subjective, manual measurement made on picture grabbed from related video frame.

Our methodology demonstrated an improvement of vowels recognition especially for pseudo-whisper

speakers as compared with formants analysis [6]. We expect that the further improvement of recognition

will be achieved in our future work by evaluating and extracting hybrid audio-video parameters

using presented system.

6. Conclusion

The proposed hybrid framework of joined audio and video feature extraction is to be achieving

higher recognition accuracies and its integration would yield significant improvement of speech

analysis. System is easy to extend. In future we are going to analyze real time images from video

camera.

7. Appendix

7.1 IPA symbols for Polish vowels

Polish vowels transcriptions are given in figure 2.

7


16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland

8. Acknowledgement

The work described in this paper is funded by Polish Ministry of Science and Higher Education

under grant number N N518 0929 33.

REFERENCES

1 T. Cervera, J. L. Miralles, and J. González A. Acoustical analysis of spanish vowels produced

by laryngectomized subjects. Journal of Speech, Language, and Hearing Research, 44:988–996,

2001.

2 J. M. Christensen and B. Weinberg. Vowel duration characteristics of esphageal speech. Journal of

Speech and Hearing Research, 19:678–689, 1976.

3 S. Haykin. Adaptive filter theory. Prentice Hall, Inc., Upper Saddle River, 1991.

4 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. An analysis of face expression images for

evaluation of laryngectomees voice quality. In Mirjana Sovilj and Dimitris Skanavis, editors, First

European Congress on Prevention, Detection and Diagnostics of Verbal Communication Disorders,

pages CD–ROM, Patra, Grecja, grudzień, 15-17 2006.

5 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. Evaluation of laryngectomees’ voice

quality usin correlations with facial expression. In Seiji Niimi, editor, Proceedings of the 5th International

Conference on Voice Physiology and Biomechanics, pages 96–99, Tokio, Japonia, lipiec,

12-14 2006.

6 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. Methods for formant extraction in speech

of patients after total laryngectomy. Biomedical Signal Processing and Control, 1/2:107–112, 2006.

7 S. Saito. Speech Science and Technology. Ohmsha, Ltd., Tokyo, 1992.

8 M. Sisty and B. Weinberg. Formant frequency characteristics of esophageal speech. Journal of

Speech and Hearing Research, 15:439–448, 1972.

8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!