13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Proceedings</strong>, FOETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universityabout the difficulties in the mapping for differentphonemes.Clustering the acoustic and articulatoryspaces separately, as was done in previous studiesby Qin and Carreira-Perpinán (2007) as wellas Neiberg et al. (2008), leads to hard boundariesin the clusters. The cluster labels for theinstances near these boundaries may estimatedincorrectly, which may cause an over estimationof the uncertainty. This situation is explainedby Fig. 1 using synthetic data where wecan see both the distributions of the syntheticdata as well as the Maximum A-posterioriProbability (MAP) Estimates for the clusters.We can see that, because of the incorrect clustering,it seems as if data belonging to one clusterin mode A belongs to more than one clusterin mode B.In order to mitigate this problem, we havesuggested a method of cross-modal clusteringwhere both the available modalities are madeuse of by allowing soft boundaries for the clustersin each modality. Cross-modal clusteringhas been dealt with in detail under several contextsof combining multi-modal data. Coen(2005) proposed a self supervised methodwhere he used acoustic and visual features tolearn perceptual structures based on temporalcorrelations between the two modalities. Heused the concept of slices, which are topologicalmanifolds encoding dynamic states. Similarly,Belolli et al.(2007) proposed a clusteringalgorithm using Support Vector Machines(SVMs) for clustering inter-related text datasets.The method proposed in this paper does notmake use of correlations, but mainly uses coclusteringproperties between the two modalitiesin order to perform the cross-modal clustering.Thus, even non-linear dependencies (uncorrelated)may also be modeled using thissimple method.TheoryWe assume that the data is a Gaussian MixtureModel (GMM). The acoustic space Y = {y 1 ,y 2 …y } with ‘’ data points is modelled using‘I’ Gaussians, {λ 1 , λ 2 ,…λ I } and the articulatoryspace X = {x 1 , x 2 …x } is modelled using ‘K’Gaussians, {γ 1 ,γ 2 ,….γ K }. ‘I’ and ‘K’ areobtained by minimizing the BayesianInformation Criterion (BIC). If we know whicharticulatory Gaussian a particular data pointbelongs to, say, γ k The correct acousticGaussian ‘λ n ’ for the the ‘n th ’ data point havingacoustic features ‘y n ’ and articulatory features‘x n ’ is given by the maximum cross-modal a-posteriori probabilityλ = arg max P( λ | x , y , γ )n i n n k1≤i ≤I= arg max p( xn, yn| λi , γ k )*P( λi | γ k )*P( γ k )1≤i≤I(1)The knowledge about the articulatory clustercan then be used to improve the estimate of theFigure 1. The figures above show a synthesizedexample of data in two modalities. The figuresbelow show how MAP hard clustering may bringabout an effect of uncertainty in the correspondencebetween clusters in the two modalities.correct acoustic cluster and vice versa as shownbelowγ = arg max p( x , y | λ , γ )*P( γ | λ )*P( λ )n n n i k k i i1≤k ≤K(2)Where P(λ|γ) is the cross-modal prior and thep(x,y|λ,γ) is the joint cross-modal distribution.If the first estimates of the correct clusters areMAP, then the estimates of the correct clustersof the speech segments are improvedFigure 2. The figures shows an improved86420286420246 4 2 0 2 4 6 8864202Mode AMode A46 4 2 0 2 4 6 8Mode A46 4 2 0 2 4 6 824 2 0 2 4 6 8 10 12performance and soft boundaries for the syntheticdata using cross-modal clustring, here the effect ofuncertainty in correspondences is less.121086420121024 2 0 2 4 6 8 10 1286420121086420Mode BMode BMode B24 2 0 2 4 6 8 10 12203

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!