Pietruch R., Grzanka A., Konopka W.: Vowels recognition

More documents

Recommendations

Info

16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland was made using analysis of acoustical and facial expression parameters. 3. Methods 3.1 Computer system Presenting system’s platform is Windows XP. Program is based on DirectShow filters technology. It extracts speech parameters from digital movie files (mostly in MPEG format), visualizes them, and provides methods for archiving. We implemented automatic face detection algorithm for frontal view image sequence. The system automatically tracks elements of face in real time, extracts visual parameters including lips shape, compares them with acoustical descriptors and updates vocal tract model parameters. Neural network simulation and learning were made on Linux platform Ubuntu 8.04 distribution. We used GNU Octave program version 3.0.0 with nnet-0.1.9 package. 3.2 Acoustical model In this work we used the model of vocal tract divided into 10 sections of resonance cavities. We assumed that the palate is closed and the waves don’t pass through the nasal cavity. There is no energy loss and only reflection effect is taken into consideration. With these assumptions the vocal tract model is equivalent to lattice filter with PARCOR parameters equal to negative reflection coefficients [7]. Reversed lattice filter can be then transformed into transversal filter [3], which parameters can be estimated from speech signal using linear prediction method. Adaptive recursive least squares algorithm was used to estimate transversal filter LPC coefficients [3]. There is reversed transformation that from given LPC parameters derives related cross-sectional diameters of vocal tract sections. Transversal filter coefficients were used to track formant frequencies of six Polish vowels [6]. Two first formants F1 and F2 were chosen as the acoustical descriptors in audio <strong>recognition</strong> system. 3.3 Formants tracking In the system with vowel <strong>recognition</strong> functionality the formant tracking algorithm was implemented. There was used Christensen algorithm for finding formants candidates from spectrum [2]. All minimums of spectrum second derivative are chosen. For 4kHz bandwidth that was assumed in our study we can find up to 5 candidates f K = {f K 1 , f K 2 , · · · , f K nK } and we need to assign 4 formants numbers F = {F 1, F 2, · · · , F nF }. There can be three situations: 2 1. nK = nF Then candidates are assigned respectively to formants numbers with a function fK→F according to equation 1. fK→F : f K j → F i ⇔ i = j (1) where : i, j ∈ 1, 2, · · · , nF 2. nK < nF In this situation nF − nK formants will be rejected. Function 2 will specify which formants need to be unassigned. The function can be given using BF matrix where each row represents candidate and each column represents one of formant (equation 3). This transformation is found using best-fit, minimum cost algorithm 7. fK→F : f K j → F i (2) BF (j, i) = 0 ⇔ fK→F (j) = i (3)
16 th International Congress on Sound and Vibration, 5–9 July 2009, Kraków, Poland 3. nK > nF In this situation nK − nF candidates will stay unassigned. There can be one extra candidate in our situation. We are looking for the function 4 that will assign formant numbers to limited candidates according to matrix BT F , where columns represents each candidate and each row is related to formant number. Matrix elements are found according to equation 5. fF →K : F i → f K j B T F (i, j) = 0 ⇔ fF →K(i) = j (5) Best fit, minimal cost algorithm to find related transformation takes into account candidates frequencies f K and mean frequencies from previously assigned formants: ¯F = { ¯ F 1, ¯ F 2, · · · , ¯ F nF }, where for n-th sample: ¯ F i(n) = M m=1 F i(n − m). In our program we make an assumption: M = 9. In the algorithm the sum of costs of changing previous formant frequencies ( ¯ F i) into new ones (f K j ) is minimized according to equation 6. where CF (i, j) = | ¯ F i − f K j | max BF nF i=1 (4) nK BF (i, j) · CF (i, j) (6) j=1 ∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fK→F (j1) = i1 ∧ fK→F (j2) = i2 : f K j1 < f K j1 ⇔ i1 < i2 (7) ∀j1,j2∈{1,2,...,nK}∀i1,i2∈{1,2,...,nF } for which fF →K(i1) = j1 ∧ fF →K(i2) = j2 : i1 < i2 ⇔ f K j1 < f K j2 (8) There is natural assumption that formant candidates should be assigned to formants with respect to their order (equations 7, 8). Then we need to check all combinations min{nF , nK} - elements set of max{nF , nK} - elements set. 3.4 Video methods Within our system novel computer vision techniques were used to automatically segment the eyes, mouth and nose regions [5]. Following parameters are tracked within the system: distance between the eyes L0, lips height LH and width LW , distance between line joining the eyes and the bottom of a jaw LJ. From lips shape the mouth opening area is estimated. Jaw angle and area of mouth opening were chosen as representatives of facial expression descriptors. 3.5 Face elements tracking Facial elements tracking algorithms start from color space conversion, RGB to HSV. Then for each pixel (with respect to assumed margins) values of 5 features is counted. According to adaptive thresholds the binarization of 5 regions is made. The thresholds are updated according to region sizes. Related regions specifies red colored pixels (RED), dark pixels (DARK), moving objects (MOV), vertical differences (VDIFF) and horizontal differences (HDIFF). 3
Page 1: VOWELS RECOGNITION USING VIDEO AND
Page 5 and 6: 16 th International Congress on Sou
Page 7 and 8: 16 th International Congress on Sou

Pietruch R., Grzanka A., Konopka W.: Vowels recognition

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?