Pietruch R., Grzanka A., Konopka W.: Vowels recognition
Pietruch R., Grzanka A., Konopka W.: Vowels recognition
Pietruch R., Grzanka A., Konopka W.: Vowels recognition
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
VOWELS RECOGNITION USING VIDEO AND AUDIO
DATA WITH AN APPLICATION TO LARYNGECTOMEES’
VOICE ANALYSIS
Rafal Pietruch
Institute of Electronic Systems, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-
665 Warsaw, Poland,
email: rpietruc@elka.pw.edu.pl
Antoni Grzanka
Institute of Electronic Systems, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-
665 Warsaw, Poland,
email: antoni.grzanka@ise.pw.edu.pl
Wieslaw Konopka
Department of Audiology, Phoniatrics and Otoneurology, Medical University of Lodz, ul. Zeromskiego
113, 90-549 Lodz, Poland,
email: wieslaw.konopka@umed.lodz.pl
In this paper we present the methods for speech analysis, joining audio and video data in application
for laryngectomees’ voice recognition and quality evaluation. Facial expression measurements
are applied to support analysis of pathological speech. It was demonstrated that
visual parameters increase a recognition rate of Polish vowels.
Keywords: laryngectomy, formants tracking, facial expression analysis
1. Introduction
Several difficulties with evaluation of the acoustical descriptors for laryngectomees’ speech
were reported in earlier research. In pathological speech called pseudo-whisper, noises from tracheostoma
play significant role in masking speech spectrum [6]. Many works concerning esophageal
speech analyses showed differences in average formant frequencies between post laryngectomy and
natural voice. The higher formant frequencies of vowels in esophageal speech were reported in literature
[6, 1, 8]. Our research showed that the video data is a promising candidate for supporting
laryngectomees’ speech analysis [4, 5]. Authors developed system to achieve parameters of laryngectomees’
speech and to evaluate the progress of patient’s rehabilitation process.
2. Aims
The aim was to compare recognition rate of vowels for subjects using natural voice with experimental
groups of esophageal and pseudo-whisper speakers after total laryngectomy. The recognition
ICSV16, 5–9 July 2009, Kraków, Poland 1
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland
was made using analysis of acoustical and facial expression parameters.
3. Methods
3.1 Computer system
Presenting system’s platform is Windows XP. Program is based on DirectShow filters technology.
It extracts speech parameters from digital movie files (mostly in MPEG format), visualizes them,
and provides methods for archiving. We implemented automatic face detection algorithm for frontal
view image sequence. The system automatically tracks elements of face in real time, extracts visual
parameters including lips shape, compares them with acoustical descriptors and updates vocal tract
model parameters. Neural network simulation and learning were made on Linux platform Ubuntu
8.04 distribution. We used GNU Octave program version 3.0.0 with nnet-0.1.9 package.
3.2 Acoustical model
In this work we used the model of vocal tract divided into 10 sections of resonance cavities.
We assumed that the palate is closed and the waves don’t pass through the nasal cavity. There is no
energy loss and only reflection effect is taken into consideration. With these assumptions the vocal
tract model is equivalent to lattice filter with PARCOR parameters equal to negative reflection coefficients
[7]. Reversed lattice filter can be then transformed into transversal filter [3], which parameters
can be estimated from speech signal using linear prediction method. Adaptive recursive least squares
algorithm was used to estimate transversal filter LPC coefficients [3]. There is reversed transformation
that from given LPC parameters derives related cross-sectional diameters of vocal tract sections.
Transversal filter coefficients were used to track formant frequencies of six Polish vowels [6]. Two
first formants F1 and F2 were chosen as the acoustical descriptors in audio recognition system.
3.3 Formants tracking
In the system with vowel recognition functionality the formant tracking algorithm was implemented.
There was used Christensen algorithm for finding formants candidates from spectrum [2].
All minimums of spectrum second derivative are chosen. For 4kHz bandwidth that was assumed in
our study we can find up to 5 candidates f K = {f K 1 , f K 2 , · · · , f K nK } and we need to assign 4 formants
numbers F = {F 1, F 2, · · · , F nF }. There can be three situations:
2
1. nK = nF
Then candidates are assigned respectively to formants numbers with a function fK→F according
to equation 1.
fK→F : f K j → F i ⇔ i = j (1)
where : i, j ∈ 1, 2, · · · , nF
2. nK < nF
In this situation nF − nK formants will be rejected. Function 2 will specify which formants
need to be unassigned. The function can be given using BF matrix where each row represents
candidate and each column represents one of formant (equation 3). This transformation is found
using best-fit, minimum cost algorithm 7.
fK→F : f K j → F i (2)
BF (j, i) = 0 ⇔ fK→F (j) = i (3)
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland
3. nK > nF
In this situation nK − nF candidates will stay unassigned. There can be one extra candidate
in our situation. We are looking for the function 4 that will assign formant numbers to limited
candidates according to matrix BT F , where columns represents each candidate and each row is
related to formant number. Matrix elements are found according to equation 5.
fF →K : F i → f K j
B T F (i, j) = 0 ⇔ fF →K(i) = j (5)
Best fit, minimal cost algorithm to find related transformation takes into account candidates
frequencies f K and mean frequencies from previously assigned formants: ¯F = { ¯ F 1, ¯ F 2, · · · , ¯ F nF },
where for n-th sample: ¯ F i(n) = M m=1 F i(n − m). In our program we make an assumption: M = 9.
In the algorithm the sum of costs of changing previous formant frequencies ( ¯ F i) into new ones (f K j )
is minimized according to equation 6.
where CF (i, j) = | ¯ F i − f K j |
max
BF
nF
i=1
(4)
nK
BF (i, j) · CF (i, j) (6)
j=1
∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fK→F (j1) = i1 ∧ fK→F (j2) = i2 :
f K j1 < f K j1 ⇔ i1 < i2 (7)
∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fF →K(i1) = j1 ∧ fF →K(i2) = j2 :
i1 < i2 ⇔ f K j1 < f K j2 (8)
There is natural assumption that formant candidates should be assigned to formants with respect
to their order (equations 7, 8). Then we need to check all combinations min{nF , nK} - elements set
of max{nF , nK} - elements set.
3.4 Video methods
Within our system novel computer vision techniques were used to automatically segment the
eyes, mouth and nose regions [5]. Following parameters are tracked within the system: distance
between the eyes L0, lips height LH and width LW , distance between line joining the eyes and the
bottom of a jaw LJ. From lips shape the mouth opening area is estimated. Jaw angle and area of
mouth opening were chosen as representatives of facial expression descriptors.
3.5 Face elements tracking
Facial elements tracking algorithms start from color space conversion, RGB to HSV. Then for
each pixel (with respect to assumed margins) values of 5 features is counted. According to adaptive
thresholds the binarization of 5 regions is made. The thresholds are updated according to region sizes.
Related regions specifies red colored pixels (RED), dark pixels (DARK), moving objects (MOV),
vertical differences (VDIFF) and horizontal differences (HDIFF).
3
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland
• Red color intensity IRED(x, y) for pixel coordinates (x, y), where x ∈ {1, · · · , W }, y ∈
{1, · · · , H}, (W - picture width, H - picture height), is counted from Hue component (H)
according to equation 9.
⎧
⎪⎨ GRED BRED − H(x, y)
IRED(x, y) =
⎪⎩
dla H(x, y) < BRED
0 dla BRED ≤ H(x, y) ≤ HMAX − BRED
GRED H(x, y) + BRED − HMAX dla H(x, y) > HMAX − BRED
(9)
where:
HMAX = 2 8 , GRED = 2 4 , BRED = 15
• Local, dynamic changes are counted for every pixel (x, y) for n-th sample according to equation
10.
IMOV (n) = λIMOV (n − 1) + (1 − λ) V (n) − V (n − 1) , (10)
where λ = 0.875.
• Horizontal differences for pixel at (x, y) are counted according to Value component (V) using
equation 11.
IHDIF F (x, y) = V (x, y) − V (x − 1, y) (11)
• Vertical differences for pixel at (x, y) are counted according to V component using equation 12.
IV DIF F (x, y) = 1
• Darkness of pixel (x, y) is counted using equation 13.
V (x − 1, y − 1) − V (x − 1, y)
4
+ 1
V (x, y − 1) − V (x, y)
2
+ 1
V (x + 1, y − 1) − V (x + 1, y) . (12)
4
IDARK(x, y) = 2 8 − V (x, y) − 1 (13)
The examples of extracted regions are presented on figure 1.
Face shape is extracted from HDIFF and RED regions. Face is the most red region in picture
and from left and right sides it should be horizontal contrast. Face borders are found from assumptions
14 and 15.
For every y coordinate:
4
• the nearest point, beginning from left margin, that fulfils below assumption is chosen:
x ∈ {SMAX, · · · , 1
2 W } ∧ ZHDIF
SMAX
F (x, y) ∧
i=1
ZRED(x + i, y) > SMIN
• the nearest point, beginning from right margin, that fulfils below assumption is chosen:
x ∈ { 1
2 W, · · · , W − SMAX}
SMAX
∧ ZHDIF F (x, y) ∧
i=1
ZRED(x − i, y) > SMIN
(14)
(15)
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland
(a) Dark region. (b) Moving region.
(c) Horizontal differences. (d) Vertical differences.
(e) Red region. (f) Face shape
(g) Face elements (h) Face borders, elements, symmetry
lines, eyes points and mouth contour
Figure 1. Results of video processing.
5
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland
where: SMAX = 32, SMIN = 16
Example of evaluated face region is given on figure 1(f) indicated by black pixels.
Face elements segmentation is done according to regions sets combination (equation 16). for
every coordinate (x, y) .
ZELEM = (ZRED ∧ ZV DIF F ) ∨ (ZDARK ∧ ZMOV ) (16)
Example of evaluated face elements are shown on figure 1(g) indicated by black pixels.
Middle points between left and right borders are taken as potential points of face symmetry.
First the line that covers the most of middle points is evaluated using Hough transform. Points that
lays in distance from line greater than defined are removed from set. Face symmetry line is estimated
then using MLS (Minimum Least Squares) criteria from left points.
From face symmetry line we create orthogonal lines between the borders. For every line the
middle point of face elements region symmetry is found. The symmetry lines for facial elements
is estimated with the same algorithm as facial symmetry from the set of facial elements symmetry
points. The result of described methods is shown on figure 1(h).
Symmetrical face elements can consists of regions that lays on symmetry line or two symmetrical
regions on both sides of line. Symmetry points of the same parity of symmetrical regions are
grouped using binary filtering: hB = [1 1 1 1 1]. The greatest sets with odd number of symmetric
regions are taken as nose and mouth candidates, and sets with even numbers of symmetrical regions
are potential eyes regions. Then assumptions of regions relative positions are taken into account, and
symmetrical regions are assigned to mouth, eyes and nose elements. Lips contour is evaluated by
inflation of symmetry middle point.
3.6 Neural network
We used neural network (NN) for both audio and visual recognition systems. Feed forward
structure was used with two dimensional input vectors. The ’tansig’ transfer function was used in the
first layer of NN and linear transfer function was chosen for output layer. We used mean of squared
errors (MSE) performance function and back propagation NN training function. Network was trained
according to gradient descent with momentum.
Audio and video parameters were both limited to 2 dimensions. NN was trained to assign
maximum value (1) for output related to spoken vowel and minimum value (-1) to other outputs. In
NN simulation the maximum the output of maximum value was chosen and the vowel assigned to it
was recognized. For visual data hidden layer size was 3 and output layer size was 4. Only 4 groups
of vowels {’a’, ’e’}, {’i’, ’y’}, {’u’} and {’o’}were represented by n1:4 output neurons respectively.
The input vectors [x1, x2] were formed by facial parameters converted by following method: x1 =
4LHLW /L 2 0 −1, x2 = (LJ −LJm)/L0, where LJm is a jaw opening in neutral position. For acoustical
parameters first layer had 5 neurons and output had 6 neurons. Every output n1:6 was related to each
Polish vowel: ’a’, ’i’, ’e’, ’y’, ’o’, ’u’respectively. Input vectors [x1, x2] were normalized to achieve
values from (-1; 2) in following way: x1 = F 2/1000 − 1.5, x2 = F 1/1000 − 0.7.
3.7 Subjects
Control group (C) of 10 laryngeal speakers formed the training set for neural networks. Experimental
groups of 10 esophageal speakers (E) and 10 pseudo-whisper patients (P) were test sets.
The system was evaluated with the experimental study of 34 subjects articulating 6 Polish isolated
vowels: ’a’, ’i’, ’e’, ’y’, ’o’, ’u’(see Appendix A. for related IPA symbols). Control group (C)
of 10 laryngeal speakers formed the training set in both cases: NN for visual descriptors (NN-v) and
NN for acoustical parameters (NN-a). Experimental groups of 10 esophageal speakers (Ev) and 10
6
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland
Figure 2. IPA symbols of Polish vowels
pseudo-whisper patients (Pv) formed test sets for visual analyses. Mean age of group Ev was 64 and
was formed by men.
For acoustical analyses experimental group consisted of 10 esophageal speakers (Ea). Mean
age of group Ea was 64 but it differed from Ev group with 4 subjects including one woman. Pseudowhisper
group was not taken into consideration as the extraction of formants in this group failed, as
discussed in [6].
4. Results
The neural network parameters were reported in [5]. From audio data we achieved a vowels
recognition rate of 98 percent in C group and 66 percent in Ea group. The recognition rate for visual
data was 94 percent in C group, 76 percent in Pv and 76 percent in Ev group.
5. Discussion
In present state the algorithm is sensitive to variable conditions of face (beard, glasses, and
long hair). Video data measurements were supported by operator. No algorithm for extraction of
jaw opening length has been implemented yet in our system. Every result of visual parameter was
compared with subjective, manual measurement made on picture grabbed from related video frame.
Our methodology demonstrated an improvement of vowels recognition especially for pseudo-whisper
speakers as compared with formants analysis [6]. We expect that the further improvement of recognition
will be achieved in our future work by evaluating and extracting hybrid audio-video parameters
using presented system.
6. Conclusion
The proposed hybrid framework of joined audio and video feature extraction is to be achieving
higher recognition accuracies and its integration would yield significant improvement of speech
analysis. System is easy to extend. In future we are going to analyze real time images from video
camera.
7. Appendix
7.1 IPA symbols for Polish vowels
Polish vowels transcriptions are given in figure 2.
7
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland
8. Acknowledgement
The work described in this paper is funded by Polish Ministry of Science and Higher Education
under grant number N N518 0929 33.
REFERENCES
1 T. Cervera, J. L. Miralles, and J. González A. Acoustical analysis of spanish vowels produced
by laryngectomized subjects. Journal of Speech, Language, and Hearing Research, 44:988–996,
2001.
2 J. M. Christensen and B. Weinberg. Vowel duration characteristics of esphageal speech. Journal of
Speech and Hearing Research, 19:678–689, 1976.
3 S. Haykin. Adaptive filter theory. Prentice Hall, Inc., Upper Saddle River, 1991.
4 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. An analysis of face expression images for
evaluation of laryngectomees voice quality. In Mirjana Sovilj and Dimitris Skanavis, editors, First
European Congress on Prevention, Detection and Diagnostics of Verbal Communication Disorders,
pages CD–ROM, Patra, Grecja, grudzień, 15-17 2006.
5 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. Evaluation of laryngectomees’ voice
quality usin correlations with facial expression. In Seiji Niimi, editor, Proceedings of the 5th International
Conference on Voice Physiology and Biomechanics, pages 96–99, Tokio, Japonia, lipiec,
12-14 2006.
6 R. Pietruch, M. Michalska, W. Konopka, and A. Grzanka. Methods for formant extraction in speech
of patients after total laryngectomy. Biomedical Signal Processing and Control, 1/2:107–112, 2006.
7 S. Saito. Speech Science and Technology. Ohmsha, Ltd., Tokyo, 1992.
8 M. Sisty and B. Weinberg. Formant frequency characteristics of esophageal speech. Journal of
Speech and Hearing Research, 15:439–448, 1972.
8