12.07.2015 Views

SINCO - an Efficient Greedy Method for Learning Sparse INverse ...

SINCO - an Efficient Greedy Method for Learning Sparse INverse ...

SINCO - an Efficient Greedy Method for Learning Sparse INverse ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>SINCO</strong> - <strong>an</strong> <strong>Efficient</strong> <strong>Greedy</strong> <strong>Method</strong> <strong>for</strong> <strong>Learning</strong><strong>Sparse</strong> <strong>INverse</strong> COvari<strong>an</strong>ce MatrixKatya ScheinbergIEOR DepartmentColumbia UniversityNew York, NYIrina RishComputational Biology DepartmentIBM T.J. Watson Research CenterYorktown Heights, NY1 IntroductionIn m<strong>an</strong>y practical applications of statistical learning the objective is not simply to construct <strong>an</strong> accuratepredictive model but rather to discover me<strong>an</strong>ingful interactions among the variables. For example,in applications such as reverse-engineering of gene networks, discovery of functional brainconnectivity patterns from brain-imaging data, or <strong>an</strong>alysis of social interactions, the main focus ison reconstructing the network structure representing dependencies among multiple variables, suchas genes, brain areas, or individuals. Probabilistic graphical models, such as Markov networks (orMarkov R<strong>an</strong>dom Fields), provide a statistical tool <strong>for</strong> multivariate data <strong>an</strong>alysis that allows to capturevariable interactions explicitly, as conditional (in)dependence relationships. We focus on learningthe structure of Markov Network over Gaussi<strong>an</strong> r<strong>an</strong>dom variables, which is equivalent to learningzero-pattern of the inverse covari<strong>an</strong>ce matrix. To that end we consider the convex optimization<strong>for</strong>mulation based on l 1 regularization that is considered in [8, 13, 14, 10, 4].Herein, we propose a simple greedy algorithm (<strong>SINCO</strong>) <strong>for</strong> solving this optimization problem.<strong>SINCO</strong> solves the primal problem (unlike its predecessors such as COVSEL [10] <strong>an</strong>d glasso [4]),using coordinate ascent, in a greedy m<strong>an</strong>ner, thus naturally preserving the sparsity of the solution.As demonstrated by our empirical results, <strong>SINCO</strong> has better capability in reducing the false-positiveerror rate (while maintaining similar true positive rate when networks are sufficiently sparse) th<strong>an</strong>glasso [4], because of its greedy incremental nature.The structure reconstruction accuracy is known to be quite sensitive to the choice of regularizationparameter, which we denote as λ, <strong>an</strong>d the problem of selecting the “best” (i.e. giving the most accuratestructure) value of this parameter in practical settings remains open 1 . We investigate <strong>SINCO</strong>vs glasso behavior in on a r<strong>an</strong>ge of λs. What we observe is that <strong>SINCO</strong>’s greedy steps introduce“import<strong>an</strong>t” nonzero elements in the same m<strong>an</strong>ner as is achieved by reducing the value of λ. Hence,<strong>SINCO</strong> c<strong>an</strong> reproduce regularization path behavior without actually varying the value of the regularizationparameter, but rather following the “greedy solution path”, i.e. sequentially introducingnon-zero elements.Moreover, while glasso [4] is comparable to, or faster th<strong>an</strong> <strong>SINCO</strong> <strong>for</strong> relatively small number ofvariables p, <strong>SINCO</strong> has a better scaling on sparse or structured problems when p increases, <strong>an</strong>d c<strong>an</strong>outper<strong>for</strong>m glasso. Finally, experiments on real-life brain imaging (fMRI) data demonstrate that<strong>SINCO</strong> reconstructs Markov Networks that achieve same or better classification accuracy th<strong>an</strong> itscompetitors while using much smaller fraction of edges (non-zero entries of the inverse-covari<strong>an</strong>cematrix). Further adv<strong>an</strong>tages of our approach include simplicity, efficiency, a relatively straight<strong>for</strong>wardmassive parallelization.1 Most of the recent theoretical work in this area is focused on asymptotic consistency results [8, 14, 10,13, 11]; alternative approaches include Bayesi<strong>an</strong> treatment of the unknown λ parameter, but are only limited toempirical explorations [1].1


2 Problem FormulationLet us consider multivariate Gaussi<strong>an</strong> probability density function over a set of p r<strong>an</strong>dom variablesX = {X 1 , ..., X p } is p(x) = (2π) −p/2 det(Σ) − 1 2 e − 1 2 (x−µ)T Σ −1 (x−µ) , where µ is the me<strong>an</strong> <strong>an</strong>d Σis the covari<strong>an</strong>ce matrix of the distribution, respectively 2 . A Markov network (a Markov R<strong>an</strong>domField, or MRF) represents the conditional independence structure of P (X): <strong>an</strong> edge (i, j) exists if<strong>an</strong>d only if X i is conditionally dependent on X j given all remaining variables [6]. Missing edgescorrespond to zero entries in the inverse covari<strong>an</strong>ce matrix C = Σ −1 , <strong>an</strong>d vice versa [6], <strong>an</strong>dthus the problem of structure learning <strong>for</strong> the above probabilistic graphical model is equivalent tothe problem of learning the zero-pattern of the inverse-covari<strong>an</strong>ce matrix. Note that the inverse ofthe maximum-likelihood estimate of the covari<strong>an</strong>ce matrix Σ (i.e. the empirical covari<strong>an</strong>ce matrixA = 1 n∑ ni=1 xT i x i where x i is the i-th sample, i = 1, ..., n), even if it exists, does not typicallycontain <strong>an</strong>y elements that are exactly zero. There<strong>for</strong>e <strong>an</strong> explicit sparsity-en<strong>for</strong>cing constraint needsto be added to the estimation process.A common approach to en<strong>for</strong>cing sparsity of C is to include as penalty the (vector) l 1 -norm of C,which is equivalent to imposing a Laplace prior on C in maximum-likelihood framework [4, 10, 14](see [14] <strong>for</strong> the derivation details). Without loss of generality we make a slightly more generalassumption about p(C), allowing different elements of C to have different parameters λ ij as is donein [3]. Hence we consider the following <strong>for</strong>mulationnmaxC≻0 2 [ln det(C) − tr(AC)] − ‖C‖ S (1)Here by ‖C‖ S we denote the sum of absolute values of the elements of the matrix S · C, where ·denotes the element-wise product. For example, if S is a product of ρ = n 2λ <strong>an</strong>d the matrix of allones, then the problem reduces to the problem addressed in [4, 10, 14].The dual of this problem ismax {nW ≻0 2 ln det(W ) − np/2 : s.t. − S ≤ n (W − A) ≤ S}, (2)2where the inequalities involving matrices W , A <strong>an</strong>d S are element-wise. The optimality conditions<strong>for</strong> this pair of primal <strong>an</strong>d dual problems imply that W = C −1 <strong>an</strong>d that (n/2)W ij − A ij = S ij ifC ij > 0 <strong>an</strong>d (n/2)W ij − A ij = −S ij if C ij < 0.3 The <strong>SINCO</strong> methodProblem (1) is a special case of a semidefinite programming problem (SDP) [5], which c<strong>an</strong> besolved in polynomial time by interior point methods (IPM) 3 . However, each iteration requires O(p 6 )operations <strong>an</strong>d O(p 4 ) memory space, which is very costly. Another reason that using <strong>an</strong> IPM isundesirable <strong>for</strong> our problem is that the sparsity pattern is recovered in the limit, hence numericalinaccuracy c<strong>an</strong> interfere with the structure recovery.As <strong>an</strong> alternative to IPMs, more efficient approaches were developed <strong>for</strong> problem (1) in [10], [4], [3]<strong>an</strong>d [7]. The first two methods are block-coordinate ascent method applied to the dual <strong>for</strong>mulation<strong>an</strong>d the last two methods are first-order gradient ascent methods. The gradient based methods requireO(p 3 ) operation <strong>for</strong> the computation of the gradient at each iteration <strong>an</strong>d they recover the solutionsparsity only in the limit, since the gradient steps do not necessarily preserve the sparsity. Onthe other h<strong>an</strong>d they c<strong>an</strong> have global complexity bounds (see [7]) <strong>an</strong>d per<strong>for</strong>m as well on the denserproblems as they do on the sparse ones. The per-iteration complexity of the block-coordinate descentmethod in [4] depends on the sparsity of the solutions <strong>an</strong>d in general is not well established. Theoverall complexity is not known.Our method, which we refer to as “<strong>SINCO</strong>” <strong>for</strong> <strong>Sparse</strong> <strong>INverse</strong> COvari<strong>an</strong>ce problem, solves theprimal problem <strong>an</strong>d uses coordinate ascent, which naturally preserves the sparsity of the solution,by optimizing one diagonal or two (symmetric) off-diagonal entries of the matrix C at each step. The2 Without loss of generality we assume that the data is scaled so that µ = 0.3 In fact, problem (1) is somewhat simpler th<strong>an</strong> a st<strong>an</strong>dard SDP because the positive semidefiniteness constraintis en<strong>for</strong>ced by the log det term. From the IPM perspective the problem is similar, however2


iteration complexity is not known, just as in the case of [4]. The adv<strong>an</strong>tages of this approach is thatthe solution to each subproblem is available in closed <strong>for</strong>m as a root of a quadratic equation, whichc<strong>an</strong> be obtained in const<strong>an</strong>t number of arithmetic operation, independent of p. Hence, in O(p 2 )operations a potential step c<strong>an</strong> be computed <strong>for</strong> all pairs of symmetric elements (i.e., <strong>for</strong> all pairs(i, j)). Then the step which provides the best function value improvement c<strong>an</strong> be chosen, which isthe essence of the greedy nature of our approach. Once the step is taken, the update of the gradientin<strong>for</strong>mation requites O(p 2 ) operations. Hence, overall, each iteration takes O(p 2 ) operations. Notethat this procedure differers from selecting the step based on the largest element of the gradient (asis done in a traditional Gauss-Southwell method). The fact that each line search is very cheap allowsus to find the best step in the same number of operations that it takes us to update the gradient. Inthis property lies the greedy nature of our method <strong>an</strong>d its difference from all other block-coordinatedescent methods applied to this problem. In fact this also differs our algorithm from other greedyalgorithms like Matching Pursuit <strong>for</strong>, say, l 1 regularized regression problem. We benefit here fromthe combination of relatively expensive gradient computations <strong>an</strong>d relatively cheap line search steps.As we will show in our numerical experiments, <strong>SINCO</strong>, in a serial mode, is comparable to glasso,which is orders of magnitude faster th<strong>an</strong> COVSEL [4]. Also, as it is demonstrated by our computationalexperiments, <strong>SINCO</strong> leads to a lower false-positive error th<strong>an</strong> glasso since it introducesnonzero elements greedily. On the other h<strong>an</strong>d, it may suffer from introducing too few nonzeros <strong>an</strong>dthus missing some of the true positives, especially on denser networks. 4Perhaps the most interesting consequence of <strong>SINCO</strong>’s greedy nature is that it reproduces the regularizationpath behavior while using only one value of the regularization parameter λ. We willdiscuss this property further in Section 4.3.1 Algorithm descriptionWe now present the method. First, let us consider <strong>an</strong> equivalent re<strong>for</strong>mulation of the problem (1):nmaxC ′ ,C ′′ 2 [ln det(C′ −C ′′ )−tr(A(C ′ −C ′′ ))]−tr(S(C ′ +C ′′ )), s. t. C ′ ≥ 0, C ′′ ≥ 0, C ′ −C ′′ ≻ 0At each iteration, the matrix C ′ or the matrix C ′′ is updated by ch<strong>an</strong>ging one element on the diagonalor two symmetric off-diagonal elements. This implies the ch<strong>an</strong>ge in C that c<strong>an</strong> be written at C +θ(e i e T j + e je T i ), where i <strong>an</strong>d j are the indices corresponding to the elements that are being ch<strong>an</strong>ged.The key observation is that, given the matrix W = C −1 , the exact line search that optimizes theobjective function f(θ) of problem (1) along the direction e i e T j + e je T i reduces to a solution of aquadratic equation. Hence each such line search takes a const<strong>an</strong>t number of operations. Moreover,given the starting objective value, the new function value on each step c<strong>an</strong> be computed in a const<strong>an</strong>tnumber of steps. This me<strong>an</strong>s that we c<strong>an</strong> per<strong>for</strong>m such line search <strong>for</strong> all (i, j) pairs in O(p 2 ) time,which is linear in the number of unknown variables C ij . We then c<strong>an</strong> choose the step that gives thebest improvement in the value of the objective function. After the step is chosen, the dual matrixW = C −1 <strong>an</strong>d, hence, the objective function gradient, are updated in O(p 2 ) operations.The method we propose works as follows:Algorithm 10. Initialize C ′ = I, C ′′ = 0, W = I1. Form the gradient G ′ = n 2 (W − A) − S <strong>an</strong>d G′′ = −S − n (W + A)22. For each pair (i, j) such that(i) G ′ ij > 0, C ij ′′ = 0, compute the maximum off(θ) <strong>for</strong> θ > 0.(ii) G ′ ij < 0, C ij ′ > 0, compute the maximum off(θ) <strong>for</strong> θ < 0 subject to C ′ ≥ 0(iii) G ′′ij > 0, C ij ′ = 0, compute the maximum off(θ) <strong>for</strong> θ > 0.(iv) G ′′ij < 0, C ij ′′ > 0, compute the maximum of f(θ) <strong>for</strong> θ < 0 subject to C ′′ ≥ 0.4 In the limit <strong>SINCO</strong> produces the same solution as glasso. We exploit here <strong>SINCO</strong>’s ability to stop earlywith a good solution.(3)3


<strong>SINCO</strong> paths when varying toler<strong>an</strong>ce <strong>an</strong>d λ <strong>for</strong> SF network1110.50.90.450.80.80.80.4TP0.60.40.200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9FPtoler<strong>an</strong>ce pathregularization pathTP0.60.40.200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1FPtoler<strong>an</strong>ce pathregularization path0.70.60.50.40.30.20.100 5 10 15 20TP0.350.30.250.20.150.10.05reg. pathtol. path00 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45FPFigure 1: Scale-free networks <strong>an</strong>d Ecoli data: <strong>SINCO</strong> paths when varying toler<strong>an</strong>ce <strong>an</strong>d λ.path is presented by a line with “o”s. We also applied <strong>SINCO</strong> with fixed toler<strong>an</strong>ce of 10 −6 to ar<strong>an</strong>ge of λ values from 300 to 0.01. The corresponding ROC curves are denoted by lines with“x”s. We may wonder is <strong>SINCO</strong>’s solution path is actually the same as the regularization path. Itis not necessarily the case. The lower curve on the third plot on Figure 1, shows the percentage ofthe positives in solutions from <strong>SINCO</strong> path which are not present in the corresponding solution onthe regularization path. The higher curve represents the percentage of true positives. We observethat the <strong>SINCO</strong> <strong>an</strong>d the regularization paths largely coincide step by step until the TPs reach themaximum <strong>an</strong>d further nonzeros are in the FP category <strong>an</strong>d hence, in a way, are r<strong>an</strong>dom noise. Thefinal picture in Figure 1 is a comparison of the two paths <strong>for</strong> the E.coli data set (reduced to p = 133<strong>an</strong>d N = 300).Our observation imply that <strong>SINCO</strong> c<strong>an</strong> be used to greedily select the elements of graphical model untilthe desired trade-off between FPs <strong>an</strong>d TPs or the desired number of nonzero elements is achievedor the allocated CPU time is exhausted. In the limit <strong>SINCO</strong> solves the same problem as glasso <strong>an</strong>dhence the limit number of the true <strong>an</strong>d false positives is dictated by the choice of λ. But since thereal goal is to recover true nonzero structure of the covari<strong>an</strong>ce matrix, it is not necessary to solveproblem (1) accurately.5 Empirical complexityHere we will discuss the empirical dependence of the runtime of the <strong>SINCO</strong> algorithm on the choiceof stopping toler<strong>an</strong>ce <strong>an</strong>d the problem size p. We also investigate the effect increasing p has on theresults produced by <strong>SINCO</strong> <strong>an</strong>d glasso. Both methods were executed on Intel Core 2Duo T7700processor (2.40GHz).We consider the situation when p increases. If the number of nonzeros in the true inverse covari<strong>an</strong>ceincreases with p, then <strong>for</strong> proper comparison we need to increase n accordingly. This, in turn, affectsthe contribution of λ, since the problem scaling ch<strong>an</strong>ges. We consider two simple settings, wherewe c<strong>an</strong> account <strong>for</strong> these effects. In the first setting, we increase p while keeping the number of theoff-diagonal nonzero elements in the r<strong>an</strong>domly generated unstructured network const<strong>an</strong>t (around300). We do not, there<strong>for</strong>e, increase n or λ. The CPU time necessary to compute the entire path <strong>for</strong>λ r<strong>an</strong>ging from 300 to 0.01 is plotted <strong>for</strong> p values from 100 to 800 in the first plot of Figure 2.In the second case, we generated block-diagonal matrices of sizes p from 100 to 1100, with 100×100diagonal blocks, each of which equals the inverse covari<strong>an</strong>ce matrix of a 21%-dense structured(scale-free) network from the previous section. We increased n <strong>an</strong>d the appropriate r<strong>an</strong>ge of λlinearly with p as the number of nonzero elements. The CPU time <strong>for</strong> this case is shown in thesecond plot of Figure 2.The last two plots in Figure 2 explain why the CPU time (in seconds) <strong>for</strong> <strong>SINCO</strong> scales up slowerth<strong>an</strong> that of glasso. For similarly high true-positive rate, glasso’s final solution tends to have muchhigher false-positive rate th<strong>an</strong> <strong>SINCO</strong>’s, thus producing a less sparse solution, overall.Finally, we investigated the behavior of <strong>SINCO</strong> <strong>for</strong> a fixed value of p as n grows. We expect toobtain larger TP values <strong>an</strong>d smaller FP error with increasing n. We observed that <strong>SINCO</strong> achievesin the limit nearly 0% false-positive error <strong>an</strong>d nearly 100% true-positive rate, while glasso’s FP error5


100 200 300 400 500 600 700 800glasso vs <strong>SINCO</strong> on r<strong>an</strong>dom networks,p=800, n=5003.5 x 1043glasso<strong>SINCO</strong>40003500glasso vs <strong>SINCO</strong> on block−SF networks,p=1100, n=5500glasso<strong>SINCO</strong>10.9ROC: glasso vs <strong>SINCO</strong> on r<strong>an</strong>dom networks,p=100, n=500glasso<strong>SINCO</strong>10.90.8ROC: glasso vs <strong>SINCO</strong> on block−SF networks,p=300, n=1500glasso<strong>SINCO</strong>time (sec)2.521.51time (sec)30002500200015001000True Positives0.80.70.6True Positives0.70.60.50.40.30.55000.50.20.10p0100 200 300 400 500 600 700 800 900 1000 1100p0.40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False Positives00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False Positives(a) R<strong>an</strong>dom (b) Scale-free (c) R<strong>an</strong>dom (p = 100) (d) scale-free ROC (p = 300)Figure 2: CPU time: <strong>SINCO</strong> vs glasso on (a) r<strong>an</strong>dom networks (N = 500, fixed r<strong>an</strong>ge of λ) <strong>an</strong>d(b) scale-free networks (density 21%, N <strong>an</strong>d λ scaled by the same factor with p, N = 500 <strong>for</strong>p = 100). ROC curves: <strong>SINCO</strong> vs glasso on (c) r<strong>an</strong>dom networks (N = 500, fixed r<strong>an</strong>ge of λ)<strong>an</strong>d (d) scale-free networks (density 21%, N <strong>an</strong>d λ scaled by the same factor with p, N = 500 <strong>for</strong>p = 100).grows with increasing n. This result is, again, a consequence of the path generating greedy approachutilized by <strong>SINCO</strong> <strong>an</strong>d its ability (<strong>an</strong>d in some cases tendency) to stop the optimization early be<strong>for</strong>ethe FP rate increases.We also applied <strong>SINCO</strong> to fMRI data <strong>for</strong> mind-state prediction problem described in [9] 8 . The purposeof that experiment was to show that fewer nonzero elements selected by <strong>SINCO</strong> were sufficientto achieve the same prediction accuracy, as was obtained by the much more dense COVSEL solutionto the same problem, which suggests that COVSEL learns m<strong>an</strong>y links that are not essential <strong>for</strong>discriminative ability of the classifier.References[1] N. B<strong>an</strong>i Asadi, I. Rish, K. Scheinberg, D. K<strong>an</strong>evsky, <strong>an</strong>d B. Ramabhadr<strong>an</strong>. A MAP Approach to <strong>Learning</strong><strong>Sparse</strong> Gaussi<strong>an</strong> Markov Networks. In Proc. of ICASSP 2009. April 2009.[2] A.-L. Barabasi <strong>an</strong>d R. Albert. Emergence of scaling in r<strong>an</strong>dom networks. Science, 286:509–512, 1999.[3] J. Duchi, S. Gould, <strong>an</strong>d D. Koller. Projected subgradient methods <strong>for</strong> learning sparse gaussi<strong>an</strong>s. In Proc.of UAI-08, 2008.[4] J. Friedm<strong>an</strong>, T. Hastie, <strong>an</strong>d R. Tibshir<strong>an</strong>i. <strong>Sparse</strong> inverse covari<strong>an</strong>ce estimation with the graphical lasso.Biostatistics, 2007.[5] R. Saigal H. Wolkowicz <strong>an</strong>d eds. L. V<strong>an</strong>enberghe. H<strong>an</strong>dbook of Semidefinite Programming. KluwerAcademic Publoshers, 2000.[6] S. Lauritzen. Graphical Models. Ox<strong>for</strong>d University Press, 1996.[7] Z. Lu. Smooth optimization approach <strong>for</strong> sparse covari<strong>an</strong>ce selection. SIAM Journal on Optimization,(19(4)):1807–1827, 2009.[8] N. Meinshausen <strong>an</strong>d P. Buhlm<strong>an</strong>n. High dimensional graphs <strong>an</strong>d variable selection with the Lasso. Annalsof Statistics, 34(3):1436–1462, 2006.[9] T.M. Mitchell, R. Hutchinson, R.S. Niculescu, F. Pereira, X. W<strong>an</strong>g, M. Just, <strong>an</strong>d S. Newm<strong>an</strong>. <strong>Learning</strong> todecode cognitive states from brain images. Machine <strong>Learning</strong>, 57:145–175, 2004.[10] O.B<strong>an</strong>erjee, L. El Ghaoui, <strong>an</strong>d A. d’Aspremont. Model selection through sparse maximum likelihoodestimation <strong>for</strong> multivariate gaussi<strong>an</strong> or binary data. Journal of Machine <strong>Learning</strong> Research, 9:485–516,March 2008.[11] P. Ravikumar, M. J. Wainwright, G. Raskutti, <strong>an</strong>d B. Yu. Model selection in Gaussi<strong>an</strong> graphical models:High-dimensional consistency of l1-regularized MLE. In NIPS-08. 2008.[12] G. Stolovitzky, R.J. Prill, <strong>an</strong>d A. Calif<strong>an</strong>o. Lessons from the dream2 challenges. Annals of the New YorkAcademy of Sciences, (1158):159–95, 2009.[13] M. Wainwright, P. Ravikumar, <strong>an</strong>d J. Lafferty. High-Dimensional Graphical Model Selection Using l 1 -Regularized Logistic Regression. In NIPS 19, pages 1465–1472. 2007.[14] M. Yu<strong>an</strong> <strong>an</strong>d Y. Lin. Model Selection <strong>an</strong>d Estimation in the Gaussi<strong>an</strong> Graphical Model. Biometrika,94(1):19–35, 2007.8 For more details, see the StarPlus website http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/.6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!