KCCA, Kernel K-Means - LASA

1MACHINE LEARNING – 2012MACHINE LEARNINGkernel CCA, kernel KmeansSpectral Clustering

2MACHINE LEARNING – 2012Change in timetable: We have practical session next week!

MACHINE LEARNING – 2012Canonical Correlation Analysis (CCA)xNy Pxx,y1 1,, wy2 2max T, Tcorr w x w ywxyxyVideo descriptionAudio descriptionDetermine features in two (or more) separate descriptions of the dataset thatbest explain each datapoint.Extract hidden structure that maximize correlation across two different projections.4

5MACHINE LEARNING – 2012Canonical Correlation Analysis (CCA)Pair of multidimensionalzero mean variablesWe have M instances of the pairs.Mi Ni q , X x Y y i1 i1x,y1 1MSearch two projections w and w :TTX ' w X and Y ' w Yxsolutions of:max max corr X',Y'w , wxyyxyx,, wy2 2max T, Tcorr w X w Ywxyxy

6MACHINE LEARNING – 2012Canonical Correlation Analysis (CCA)max max corr X',Y'w , wxymaxw , wxymaxw , wxywTxwwTxTxE XYXwTxTwCC wxyTywwYwyyC wTxx x y yy yCrosscovariance matrixC is N qxyMeasure crosscorrelation between X and Y.Covariance matricesTC =E XX : NNxxTC =E YY : qqyy

7MACHINE LEARNING – 2012Canonical Correlation Analysis (CCA)max max corr X',Y'w , wxmaxw , wxy maxw , wxyywTxwwTxTxE XYXwTxwCC wTxyTywwYwyyC wTxx x y yy yCorrelation not affected by rescaling the norm of the vectors, we can ask that w C w w C w 1TxTxx x y yy ymax max w C wTxw , wxyTxu. c. w C w w C w 1xyTxx x y yy yy

8MACHINE LEARNING – 2012Canonical Correlation Analysis (CCA)To determine the optimum (maximum) of , solve by Lagrange:TwxCwxy ymaxT Tw ,1 2x wywxCxxwxwyC yyw y CxyCyyCyxwx CxxwxTT u.c. wxCxxwx wyCyywy1 Generalized EigenvalueProblem; can be reducedto a classical eigenvalueproblem if Cxx is invertible

12MACHINE LEARNING – 2012Kernel CCAM , i Ni qX x Y y M Mi1 i1Project into a feature spaceMMM i i i if x and f y , with f x 0 and f y 0x y x yi1 i1i1 i1Construct associated kernel matrices: f K F F , K F F , columns of F , F are f x , yT T i ix x x y y y x y x yThe projection vectors can be expressed as a linear combinationin feature space:w F and w Fx x x y y y

13MACHINE LEARNING – 2012Kernel CCAKernel CCA can then become an optimization problem of themax max , , x y x y KKTx x y y1/2 1/2T 2 T 2 xKx x yKyy T 2 T 2u.c x Kxx yKyy1In Linear CCA, we were solving for:maxw , wxywTxTxwTxCC wxywwyC wTxx x y yy yu.c. w C w w C w 1Txx x y yy y

15MACHINE LEARNING – 2012Kernel CCAGeneralized eigenvalue problem:20 KKx y x Kx 0 x 2KKy x0 y 0 K y yAdd a regularization term to increase the rank of the matrix and make it invertibleK2x KxM+ I22Several methods have been proposed to choose carefully the regularizing term so as toget projections that are as close as possible to the “true” projections.

16MACHINE LEARNING – 2012Kernel CCA2MKx+ I 00 KK x yx 2x 2KKy x0 yM y 0 Ky+ IA 2 TSet: B C C and CBecomes a classical eigenvalue problem T 1 1BC AC

17MACHINE LEARNING – 2012Kernel CCAM , i Ni qX x Y y Two datasets case1Mi1 i1Can be extended to multiple datasets:L datasets: X ,...., X with M observations eachLDimensions N ,.... N :, i.e. X : N M1L i iApplying a non-linear transformation f to X ,... Xconstruct L Gram matrices: K ,....., K1L1L

18MACHINE LEARNING – 2012Kernel CCAM , i Ni qX x Y y Two datasets casei1 i1Was formulated as generalized eigenvalue problem:2MKx+ I 00 KK x yx 2x 2KKy x0 yM y 0 Ky+ IA 2 BCan be extended to multiple datasets:22 M K1+ I 00 K1K2 ....... K1KL 1 2 1K2K1 0 ........ K2KL 2 2 . . ................................................ . . . . KLK1 KLK2....... KLKL 2 L 2 M L 0 KL+ I 2 M

19MACHINE LEARNING – 2012Kernel CCAM. Kuss and T, Graepel ,The Geometry Of Kernel Canonical Correlation Analysis Tech. Report, Max Planck Institute, 2003

20MACHINE LEARNING – 2012Kernel CCAMatlab Example File: demo_CCA-KCCA.m

MACHINE LEARNING – 2012Applications of Kernel CCAGoal: To measure correlation between heterogeneous datasets and to extract sets ofgenes which share similarities with respect to multiple biological attributesKernel matrices K1, K2 and K3correspond to gene-gene similarities inpathways, genome position, andmicroarray expression data resp.Use RBF kernel with fixed kernel width.Correlation scores in MKCCA:pathway vs. genome vs. expression.Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data bygeneralized kernel canonical correlation analysis, Bioinformatics, 200321

MACHINE LEARNING – 2012Applications of Kernel CCAGoal: To measure correlation between heterogeneous datasets and to extract sets ofgenes which share similarities with respect to multiple biological attributesGives pairwise correlation between K1, K2Gives pairwise correlation between K1, K3Two clusters correspond to genesclose to each other with respect totheir positions in the pathways, in thegenome, and to their expressionA readout of the entries with equalprojection onto the first canonicalvectors give the genes whichbelong to each clusterCorrelation scores in MKCCA:pathway vs. genome vs. expression.Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data bygeneralized kernel canonical correlation analysis, Bioinformatics, 200322

MACHINE LEARNING – 2012Applications of Kernel CCAGoal: To construct appearance models for estimating an object’s pose from rawbrightness imagesX: Set of imagesY: pose parameters (pan and tiltangle of the object w.r.t. thecamera in degrees)Example of two image datapoints with different posesUse linear kernel on X and RBF kernel on Y and compare performance toapplying PCA on the (X, Y) dataset directlyT. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, PatternRecognition 36 (2003) p. 1961–197123

MACHINE LEARNING – 2012Applications of Kernel CCAGoal: To construct appearance models for estimating an object’s pose from rawbrightness imagesTesting/training ratiokernel-CCA performs better than PCA,especially for small k=testing/training ratio(i.e., for larger training sets).The kernel-CCA estimators tend toproduce less outliers, i.e., gross errors, andconsequently yield a smaller standarddeviation of the pose estimation error thantheir PCA-based counterparts.For very small training sets, the performance of both approaches becomes similarT. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, PatternRecognition 36 (2003) p. 1961–197124

26MACHINE LEARNING – 2012Kernel K-meansSpectral Clustering

27MACHINE LEARNING – 2012Structure Discovery: ClusteringGroups pair of points according to how similar these are.Density based clustering methods (Soft K-means, Kernel K-means,Gaussian Mixture Models) compare the relative distributions.

28MACHINE LEARNING – 2012K-meansK-means is a hard partitioning of the space through K clustersequidistant according to norm 2 measure.The distribution of data within each cluster is encapsulated in asphere.*m2*m1*m3

29MACHINE LEARNING – 2012K-means Algorithmkm , k 1...K,

30MACHINE LEARNING – 2012K-means AlgorithmIterative Method (variant on Expectation-Maximization)2. Calculate the distance from each data point to each centroid .3. Assignment Step: Assign the responsibility of each data point to its“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a datapoint, one assigns the data point to the smallest winning centroid).d x, mxmj k j kpj karg min d x , mk

31MACHINE LEARNING – 2012K-means AlgorithmIterative Method (variant on Expectation-Maximization)2. Calculate the distance from each data point to each centroid .3. Assignment Step: Assign the responsibility of each data point to its“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a datapoint, one assigns the data point to the smallest winning centroid).d x, mxmj k j kpj karg min d x , mk4. Update Step: Adjust the centroids to be the means of all data pointsassigned to them (M-step)5. Go back to step 2 and repeat the process until the clusters are stable.

32MACHINE LEARNING – 2012K-means Clustering: WeaknessesTwo hyperparameters: number of clusters K and power p of the metricVery sensitive to the choice of the number of clusters K and theinitialization.

MACHINE LEARNING – 2012K-means Clustering: HyperparametersTwo hyperparameters: number of clusters K and power p of the metricChoice of power determines the form of the decision boundariesP=1 P=2P=3 P=433

MACHINE LEARNING – 2012Kernel K-meansK-Means algorithm consists of minimization of:JmkK1 K j kpkjx Cm ,...., m x m with m k1x C k: number of datapoints in clusterjkCkkmxjMif x i 1Project into a feature spaceJK1m ,...., m K f xj f mkk1x CjkpxjCkfmkxjWe cannot observe the mean in feature space. Construct the mean in feature space using image of points in same cluster34

35MACHINE LEARNING – 2012Kernel K-meansJK1m ,...., m K f xj f mk=k1x CKjj x f x 22l jj l x f x f x f x i jj l x x k x , x x , x Ck1x C k kK= k x , xj kkkfj jl kx C2 k ,j kmi i x Cj l kx , x Ck1x C k kfmmj l k2m2

36MACHINE LEARNING – 2012Kernel K-meansKernel K-means algorithm is also an iterative procedure:1. Initialization: pick K clusters2. Assignment Step: Assign each data point to its “closest” centroid (E-step).If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to thesmallest winning centroid) by computing the distance in feature space. 2 k ,i k min , min k i i, j kx Cd x C x xkk mki j k j lx x x , x j l kx , x C3. Update Step: Update the list of points belonging to each centroid (M-step)4. Go back to step 2 and repeat the process until the clusters are stable.mk2

37MACHINE LEARNING – 2012Kernel K-meansWith a RBF kernelCst of value 1If xi is close to all pointsin cluster k, this is closeto 1.If the points arewell grouped incluster k, this sumis close to 1. 2 k ,i k min , min k i i, j kx Cd x C x xkk mki j k j lx x x , x j l kx , x Cmk2With homogeneous polynomial kernel?

38MACHINE LEARNING – 2012Kernel K-meansWith a polynomial kernelPositive valueSome of the termschange sign dependingon their position withrespect to the origin.If the points arealigned in thesame Quadran,the sum ismaximal 2 k ,i k min , min k i i, j kx Cd x C x xkk mki j k j lx x x , x j l kx , x Cmk2

39MACHINE LEARNING – 2012Kernel K-means: examplesRbf Kernel, 2 Clusters

40MACHINE LEARNING – 2012Kernel K-means: examplesRbf Kernel, 2 Clusters

41MACHINE LEARNING – 2012Kernel K-means: examplesRbf Kernel, 2 ClustersKernel width: 0.5Kernel width: 0.05

42MACHINE LEARNING – 2012Kernel K-means: examplesPolynomial Kernel, 2 Clusters

43MACHINE LEARNING – 2012Kernel K-means: examplesPolynomial Kernel (p=8), 2 Clusters

MACHINE LEARNING – 2012Kernel K-means: examplesPolynomial Kernel, 2 ClustersOrder 2 Order 4Order 644

45MACHINE LEARNING – 2012Kernel K-means: examplesPolynomial Kernel, 2 ClustersThe separating line will always be perpendicular to the line passing by the origin(which is located at the mean of the datapoints) and parrallel to the axis of theordinates (because of the change in sign of the cosine function in the inner product. No better than linear K-means!

46MACHINE LEARNING – 2012Kernel K-means: examplesPolynomial Kernel, 4 ClustersSolutions found with Kernel K-meansSolutions found with K-meansCan only group datapoints that do not overlap across quadrans with respect to theorigin (careful, data are centered!). No better than linear K-means! (except less sensitive to random initialization)

47MACHINE LEARNING – 2012Kernel K-means: LimitationsChoice of number of Clusters in Kernel K-means is important



50MACHINE LEARNING – 2012Limitations of kernel K-meansRaw Data

51MACHINE LEARNING – 2012Limitations of kernel K-meanskernel K-means with K=3, RBF kernel

52MACHINE LEARNING – 2012From Non-Linear ManifoldsLaplacian Eigenmaps, IsomapsTo Spectral Clustering

53MACHINE LEARNING – 2012Non-Linear ManifoldsPCA and Kernel PCA belong to a more general class of methods tocreate non-linear manifolds based on spectral decomposition.(Spectral decomposition of matrices is more frequently referred to as an eigenvaluedecomposition.)Depending on which matrix we decompose, we get a different set ofprojections.• PCA decomposes the covariance matrix of the dataset generaterotation and projection in the original space• kernel PCA decomposes the Gram matrix partition or regroup thedatapoints• The Laplacian matrix is a matrix representation of a graph. Its spectraldecomposition can be used for clustering.

54MACHINE LEARNING – 2012Embed Data in a GraphOriginal datasetGraph representation of the dataset• Build a similarity graph• Each vertex on the graph is a datapoint

55MACHINE LEARNING – 2012Measure Distances in GraphS0.9.....0.8. .. 0.2 ... 0.2 .....0.2.....0.2........0.7....0.6Construct the similarity matrix S to denote whether points are close or faraway to weight the edges of the graph:

56MACHINE LEARNING – 2012Disconnected GraphsDisconnected Graph:Two data-points are connected if:a) the similarity between them is higherthan a threshold.b) or if they are k-nearest neighbors(according to the similarity metric)S1.........1. .. .....0 .......0.....0.........0..........1.........1

57MACHINE LEARNING – 2012Graph Laplacian1.........1. .. .....0 .......0Given the similarity matrix S .....0.........0..........1.........1Construct the diagonal matrix D composed of the sum on each line of K: S1i................0 i0 S2........0 iD i,....0.................. S Mii and then, build the Laplacian matrix : L D SL is positive semi-definitespectral decomposition possible

58MACHINE LEARNING – 2012Graph LaplacianEigenvalue decomposition of the Laplacian matrix:L UUTAll eigenvalues of L are positive and the smallest eigenvalue of L is zero: If we order the eigenvalue by increasing order: 0 .... .1 2MIf the graph has k connected components, then theeigenvalue =0has multiplicity k.

59MACHINE LEARNING – 2012Spectral Clustering The multiplicity of the eigenvalue 0 determines the number of connectedcomponents in a graph. The associated eigenvectors identify these connectedcomponents.For an eigenvalue ii 0, the corresponding eigenvector e has the same valuefor all vertices in a component,and a different value for each one of their i 1components.Identifying the clusters is then trivial when the similarity matrix iscomposed of zeros and ones (as when using k-nearest neighbor).What happens when the similarity matrix is full?

60MACHINE LEARNING – 2012Spectral ClusteringN NSimilarity map S : S can either be binary (k-nearest neighboor)or continuous with Gaussia kernelx xi j22S x , x eijS0.9.....0.8. .. 0.2 ... 0.2 .....0.2.....0.2........0.7....0.6

61MACHINE LEARNING – 2012Spectral Clustering1) Build the Laplacian matrix : L D S2) Do eigenvalue decomposition of the Laplacian matrix:3) Order the eigenvalue by increasing order: 0 .... .1 2ML UUTThe first eigenvalue is still zero but with multiplicity 1 only (fully connected graph)!Idea: the smallest eigenvalues are close to zero and hence provide alsoinformation on the partioning of the graph (see exercise session)N NSimilarity map S : S can either be binary (k-nearest neighboor)or continuous with Gaussia kernelx xi j22S x , x eij

62MACHINE LEARNING – 2012Spectral ClusteringEigenvalue decomposition of the Laplacian matrix:L UUUTe e ........ e1e2. ..e e ........ e1 2 K1 1 11 2 KM 1 1yie. ..e1iKiConstruct an embedding of each of theiiM datapoints x through y .Reduce dimensionality by picking k

63MACHINE LEARNING – 2012Spectral ClusteringEigenvalue decomposition of the Laplacian matrix:L UUUTe e ........ e1e2. ..e e ........ e1 2 K1 1 11 2 KM 1 1yie. ..e1iKiConstruct an embedding of each of theiiM datapoints x through y .Reduce dimensionality by picking k

MACHINE LEARNING – 2012Spectral ClusteringExample: 3 datapoints in a graph composed of 2 partitions1 1 0The similarity matrix is S 1 1 0 0 0 1L has eigenvalue =0 with multiplicity two.3x1x2xOne solution for the two associated eigenvectors is:ee121 21 000 1Anothersolution for the two associated eigenvectors is:ee120.330.33 0.880.10.1 0.99Entries in theeigenvector for thetwo first datapoinsare equalyy1 12 2for the 1st set of eigenvectors1 0 y 0.33 -0.11 0 y 0.33 -0.1for the 2nd set of eigenvectors64

65MACHINE LEARNING – 2012Spectral ClusteringExample: 3 datapoints in a fully connected graph 1 0.9 0.02The similarity matrix is S 0.9 1 0.020.01 0.02 1 L has eigenvalue =0 with multiplicity 1. The secondeigenvalue is small 0.04, whereas the 3rd oneis large, 1.81.3with associated eigenvectors :1 0.411 0.81 2 3e 21, e 0.404, e 0.7 1 0.81 0.02Entries in the 2 ndeigenvector for thetwo first datapoinsare almost equal.1 0.41 , y 1 0.401 2The first two points have almost the samecoordinates on the y embedding.Reduce the dimensionality by considering the smallest eigenvaluey3x1x2x

66MACHINE LEARNING – 2012Spectral ClusteringExample: 3 datapoints in a fully connected graph 1 0.9 0.8The similarity matrix is S 0.9 1 0.70.8 0.7 1 L has eigenvalue =0 with multiplicity 1. The secondand third eigenvalues are both large 2.23, 2.57.with associated eigenvectors :1 0.21 0.781 2 3e 21, e 0.57 , e 0.57 1 0.79 0.212 3yEntries in the 2 nd eigenvectorfor the two first datapoins areno longer equal.1 -0.21 , y 1 -0.571 2The 3 rd point is nowcloser to the two otherpointsThe first two points have no longer the samecoordinates on the y embedding.3x1x2x

67MACHINE LEARNING – 2012Spectral Clustering1xw 122x1y2yw 213x3y6y6x4x4y5xStep 1:Embedding in y5yIdea: Points close to one another have almost the same coordinate onthe eigenvectors of L with small eigenvalues.Step1: Do an eigenvalue decomposition of the Lagrange matrix L andproject the datapoints onto the first K eigenvectors with smallest eigenvalue(hence reducing the dimensionality).

68MACHINE LEARNING – 2012Spectral Clustering1xw 122x1y2yw 213x3y6y6x4x4y5x5yStep 2:1 MPerform K-Means on the set of y ,... y vectorsCluster datapoints x according to their clustering in y.

70MACHINE LEARNING – 2012Equivalency to other non-linear EmbeddingsSpectral deomposition of the similarity matrix(which is already positive semi-definite)1 1e i .i y . . K KeiIn Isomap, the embedding isnormalized by the eigenvaluesand uses geodesic distance to buildthe similarity matrix,see supplementary material.

MACHINE LEARNING – 2012Laplacian EigenmapsSolve the generalized eigenvalue problem: LyTTSolution to optimization problem: min y Ly such that y Dy 1.Ensures minimal distorsion while preventing arbitrary scaling.y DySwissroll exampleProjections on each Laplacian eigenvectoriThe vectors y , i 1... M, form an embedding of the datapoints.Image courtesy from A. Singh71

72MACHINE LEARNING – 2012Equivalency to other non-linear Embeddingskernel PCA: Eigenvalue decomposition of the matrix of similarity SSUDUTThe choice of parameters in kernel K-Means can be initialized by doing areadout of the Gram matrix after kernel PCA.

73MACHINE LEARNING – 2012Kernel K-means and Kernel PCAThe optimization problem of kernel K-means is equivalent to:Tmax tr H KH ,H YDH12See paper by M. Welling,supplementary document on websiteTSince tr H KH i,i1 : M eigenvalues, resulting from the eigenvalue decompositioniof the Gram MatrixMLook at the eigenvalues to determineoptimal number of clusters.Y : M KD : K KEach entry of Y is 1 if the datapoint belongs to cluster k, otherwise zeroD is diagonal. Element on the diagonal is sum of the datapoint in cluster k

74MACHINE LEARNING – 2012Kernel PCAprojections canalso helpdetermine thekernel widthFrom top tobottomKernel width of0.8, 1.5, 2.5

75MACHINE LEARNING – 2012Kernel PCA projections can help determine the kernel widthThe sum of eigenvalue grows as we get a better clustering

76MACHINE LEARNING – 2012Quick Recap of Gaussian Mixture Model

77MACHINE LEARNING – 2012Clustering with Mixture of GaussiansAlternative to K-means; soft partitioning with elliptic clusters instead ofspheresClustering with Mixtures of Gaussians using spherical Gaussians (left) and nonspherical Gaussians (i.e. with full covariance matrix) (right).Notice how the clusters become elongated along the direction of the clusters (thegrey circles represent the first and second variances of the distributions).

78MACHINE LEARNING – 2012Gaussian Mixture Model (GMM)Using a set of M N-dimensional training datapointsiix1,...jThe pdf of X will be modeled through a mixture of K Gaussians:XMj1,...NK i i i i| , , with | , i i , i m m m p X p X p X Ni1i im , : mean and covariance matrix of Gaussian iMixing CoefficientsKi1 1Probability that the data was explained byGaussian i:i iM jp i | x p ij1

79MACHINE LEARNING – 2012Gaussian Mixture ModelingThe parameters of a GMM are the means, covariance matrices and priorpdf: m ,..... m , ,..... , ,..... 1 K 1 K 1 KEstimation of all the parameters can be done through Expectation-Maximization (E-M). E-M tries to find the optimum of the likelihood of themodel given the data, i.e.: max L | X max p X |See lecture notes for details

80MACHINE LEARNING – 2012Gaussian Mixture Model

81MACHINE LEARNING – 2012Gaussian Mixture ModelGMM using 4 Gaussianswith random initialization

82MACHINE LEARNING – 2012Gaussian Mixture ModelExpectation Maximization is very sensitive to initial conditions:GMM using 4 Gaussianswith new randominitialization

83MACHINE LEARNING – 2012Gaussian Mixture ModelVery sensitive to choice of number of Gaussians. Number of Gaussianscan be optimized iteratively using AIC or BIC, like for K-means:Here, GMM using 8 Gaussians

84MACHINE LEARNING – 2012Evaluation of Clustering Methods

85MACHINE LEARNING – 2012Evaluation of Clustering MethodsClustering methods rely on hyper parameters• Number of clusters• Kernel parameters Need to determine the goodness of these choicesClustering is unsupervised classification Do not know the real number of clusters and the data labels Difficult to evaluate these choice without ground truth

86MACHINE LEARNING – 2012Evaluation of Clustering MethodsTwo types of measures: Internal versus external measuresInternal measures rely on measure of similarity (e.g. intra-clusterdistance versus inter-cluster distances)E.g.: Residual Sum of Square is an internal measure (available inmldemos); Gives the squared distance of each vector from its centroidsummed over all vectors.RSS=Kk1xCkxk m2 Internal measures are problematic as the metric of similarity is oftenalready optimized by clustering algorithm

87MACHINE LEARNING – 2012Evaluation of Clustering MethodsK-Means, soft-K-Means and GMM have several hyperparameters:(Fixed number of clusters, beta, number of Gaussian functions) Measure to determine how well the choice of hyperparameters fit thedataset (maximum-likelihood measure)X : dataset; M : number of datapoints; B: number of free parameters- Aikaike Information Criterion: AIC= 2ln L2B- Bayesian Information Criterion: BIC 2ln L BlnML: maximum likelihood of the model given B parametersChoosing AIC versus BIC depends on the application:Is the purpose of the analysis to make predictions, or to decide which modelbest represents reality?AIC may have better predictive ability than BIC, but BIC finds a computationallymore efficient solution.Lower BIC implies either fewer explanatory variables, better fit, or both.As the number of datapoints (observations) increase, BIC assigns more weightsto simpler models than AIC.Penalty forincrease incomputationalcosts

88MACHINE LEARNING – 2012Evaluation of Clustering MethodsTwo types of measures: Internal versus external measuresExternal measures assume that a subset of datapoints have class labeland measures how well these datapoints are clustered. Needs to have an idea of the class and have labeled somedatapoints Interesting only in cases when labeling is highly time-consumingwhen the data is very large (e.g. in speech recognition)

MACHINE LEARNING – 2012Evaluation of Clustering MethodsRaw Data89

90MACHINE LEARNING – 2012Semi-Supervised LearningClustering F-Measure:(careful: similar but not the same F-measure as the F-measure we will see for classification!)Tradeoff between clustering correctly all datapoints of the same class in thesame cluster and making sure that each cluster contains points of only oneclass. M : nm of datapoints, C c : the set of classesKi , max F c , k : nm of clusters,n : nm of members of class c and of cluster kikF C KF c , kiR c , kiP c , kicC kiRcik P cik , P c , k 2 , ,ikiikicMR c kncnkiiiiPicks for each classthe cluster with themaximal number ofdatapointsRecall: proportion ofdatapoints correctlyclassified/clusterizedPrecision: proportion ofdatapoints of the sameclass in the cluster

91MACHINE LEARNING – 2012Evaluation of Clustering MethodsRSS with K-means can find the true optimal number of clusters but verysensitive to random initialization (left and right: two different runs).RSS finds an optimum forK=4 and K=5 for the rightrun

MACHINE LEARNING – 2012Evaluation of Clustering MethodsBIC (left) and AIC (right) perform very poorly here, splitting some clusters intotwo halves.92

MACHINE LEARNING – 2012Evaluation of Clustering MethodsBIC (left) and AIC (right) perform much better for picking the right number ofclusters in GMM.93

94MACHINE LEARNING – 2012Evaluation of Clustering MethodsRaw Data

95MACHINE LEARNING – 2012Evaluation of Clustering MethodsOptimization with BIC using K-means

MACHINE LEARNING – 2012Evaluation of Clustering MethodsOptimization with AIC using K-meansAIC tends to find more clusters96

97MACHINE LEARNING – 2012Evaluation of Clustering MethodsRaw Data

98MACHINE LEARNING – 2012Evaluation of Clustering MethodsOptimization with AIC using kernel K-means with RBF

99MACHINE LEARNING – 2012Evaluation of Clustering MethodsOptimization with BIC using kernel K-means with RBF

100MACHINE LEARNING – 2012Semi-Supervised LearningRaw Data: 3 classes

101MACHINE LEARNING – 2012Semi-Supervised LearningClustering with RBF kernel K-Means after optimization with BIC

102MACHINE LEARNING – 2012Semi-Supervised LearningAfter semi-supervised learning

103MACHINE LEARNING – 2012SummaryWe have seen several methods for extracting structure in data using thenotion of kernel and discussed their similarity, differences andcomplementarity:• Kernel CCA is a generalization of kernel PCA to determine partialgrouping in different dimensions.• Kernel PCA can be used too bootstrap choice of hyperparameters inkernel K-meansWe have compared the geometrical division of space yielded by Rbf andPolynomial kernels.We have seen that simpler techniques than kernel K-means such as K-means with norm-p and mixture of Gaussians can yield complex non-linearclustering.

104MACHINE LEARNING – 2012SummaryWhen to use what? E.g. k-means versus kernel K-means versus GMMWhen using any machine learning algorithm, you have to balance anumber of factors:• Computing time at training and at testing• Number of open parameters• Curse of dimensionality (order of growth with number of datapoints,dimension, etc)• Robustness to initial condition, optimality of solution

KCCA, Kernel K-Means - LASA

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?