SINCO - an Efficient Greedy Method for Learning Sparse INverse ...

SINCO - an Efficient Greedy Method for LearningSparse INverse COvariance MatrixKatya ScheinbergIEOR DepartmentColumbia UniversityNew York, NYIrina RishComputational Biology DepartmentIBM T.J. Watson Research CenterYorktown Heights, NY1 IntroductionIn many practical applications of statistical learning the objective is not simply to construct an accuratepredictive model but rather to discover meaningful interactions among the variables. For example,in applications such as reverse-engineering of gene networks, discovery of functional brainconnectivity patterns from brain-imaging data, or analysis of social interactions, the main focus ison reconstructing the network structure representing dependencies among multiple variables, suchas genes, brain areas, or individuals. Probabilistic graphical models, such as Markov networks (orMarkov Random Fields), provide a statistical tool for multivariate data analysis that allows to capturevariable interactions explicitly, as conditional (in)dependence relationships. We focus on learningthe structure of Markov Network over Gaussian random variables, which is equivalent to learningzero-pattern of the inverse covariance matrix. To that end we consider the convex optimizationformulation based on l 1 regularization that is considered in [8, 13, 14, 10, 4].Herein, we propose a simple greedy algorithm (SINCO) for solving this optimization problem.SINCO solves the primal problem (unlike its predecessors such as COVSEL [10] and glasso [4]),using coordinate ascent, in a greedy manner, thus naturally preserving the sparsity of the solution.As demonstrated by our empirical results, SINCO has better capability in reducing the false-positiveerror rate (while maintaining similar true positive rate when networks are sufficiently sparse) thanglasso [4], because of its greedy incremental nature.The structure reconstruction accuracy is known to be quite sensitive to the choice of regularizationparameter, which we denote as λ, and the problem of selecting the “best” (i.e. giving the most accuratestructure) value of this parameter in practical settings remains open 1 . We investigate SINCOvs glasso behavior in on a range of λs. What we observe is that SINCO’s greedy steps introduce“important” nonzero elements in the same manner as is achieved by reducing the value of λ. Hence,SINCO can reproduce regularization path behavior without actually varying the value of the regularizationparameter, but rather following the “greedy solution path”, i.e. sequentially introducingnon-zero elements.Moreover, while glasso [4] is comparable to, or faster than SINCO for relatively small number ofvariables p, SINCO has a better scaling on sparse or structured problems when p increases, and canoutperform glasso. Finally, experiments on real-life brain imaging (fMRI) data demonstrate thatSINCO reconstructs Markov Networks that achieve same or better classification accuracy than itscompetitors while using much smaller fraction of edges (non-zero entries of the inverse-covariancematrix). Further advantages of our approach include simplicity, efficiency, a relatively straightforwardmassive parallelization.1 Most of the recent theoretical work in this area is focused on asymptotic consistency results [8, 14, 10,13, 11]; alternative approaches include Bayesian treatment of the unknown λ parameter, but are only limited toempirical explorations [1].1

2 Problem FormulationLet us consider multivariate Gaussian probability density function over a set of p random variablesX = {X 1 , ..., X p } is p(x) = (2π) −p/2 det(Σ) − 1 2 e − 1 2 (x−µ)T Σ −1 (x−µ) , where µ is the mean and Σis the covariance matrix of the distribution, respectively 2 . A Markov network (a Markov RandomField, or MRF) represents the conditional independence structure of P (X): an edge (i, j) exists ifand only if X i is conditionally dependent on X j given all remaining variables [6]. Missing edgescorrespond to zero entries in the inverse covariance matrix C = Σ −1 , and vice versa [6], andthus the problem of structure learning for the above probabilistic graphical model is equivalent tothe problem of learning the zero-pattern of the inverse-covariance matrix. Note that the inverse ofthe maximum-likelihood estimate of the covariance matrix Σ (i.e. the empirical covariance matrixA = 1 n∑ ni=1 xT i x i where x i is the i-th sample, i = 1, ..., n), even if it exists, does not typicallycontain any elements that are exactly zero. Therefore an explicit sparsity-enforcing constraint needsto be added to the estimation process.A common approach to enforcing sparsity of C is to include as penalty the (vector) l 1 -norm of C,which is equivalent to imposing a Laplace prior on C in maximum-likelihood framework [4, 10, 14](see [14] for the derivation details). Without loss of generality we make a slightly more generalassumption about p(C), allowing different elements of C to have different parameters λ ij as is donein [3]. Hence we consider the following formulationnmaxC≻0 2 [ln det(C) − tr(AC)] − ‖C‖ S (1)Here by ‖C‖ S we denote the sum of absolute values of the elements of the matrix S · C, where ·denotes the element-wise product. For example, if S is a product of ρ = n 2λ and the matrix of allones, then the problem reduces to the problem addressed in [4, 10, 14].The dual of this problem ismax {nW ≻0 2 ln det(W ) − np/2 : s.t. − S ≤ n (W − A) ≤ S}, (2)2where the inequalities involving matrices W , A and S are element-wise. The optimality conditionsfor this pair of primal and dual problems imply that W = C −1 and that (n/2)W ij − A ij = S ij ifC ij > 0 and (n/2)W ij − A ij = −S ij if C ij < 0.3 The SINCO methodProblem (1) is a special case of a semidefinite programming problem (SDP) [5], which can besolved in polynomial time by interior point methods (IPM) 3 . However, each iteration requires O(p 6 )operations and O(p 4 ) memory space, which is very costly. Another reason that using an IPM isundesirable for our problem is that the sparsity pattern is recovered in the limit, hence numericalinaccuracy can interfere with the structure recovery.As an alternative to IPMs, more efficient approaches were developed for problem (1) in [10], [4], [3]and [7]. The first two methods are block-coordinate ascent method applied to the dual formulationand the last two methods are first-order gradient ascent methods. The gradient based methods requireO(p 3 ) operation for the computation of the gradient at each iteration and they recover the solutionsparsity only in the limit, since the gradient steps do not necessarily preserve the sparsity. Onthe other hand they can have global complexity bounds (see [7]) and perform as well on the denserproblems as they do on the sparse ones. The per-iteration complexity of the block-coordinate descentmethod in [4] depends on the sparsity of the solutions and in general is not well established. Theoverall complexity is not known.Our method, which we refer to as “SINCO” for Sparse INverse COvariance problem, solves theprimal problem and uses coordinate ascent, which naturally preserves the sparsity of the solution,by optimizing one diagonal or two (symmetric) off-diagonal entries of the matrix C at each step. The2 Without loss of generality we assume that the data is scaled so that µ = 0.3 In fact, problem (1) is somewhat simpler than a standard SDP because the positive semidefiniteness constraintis enforced by the log det term. From the IPM perspective the problem is similar, however2

iteration complexity is not known, just as in the case of [4]. The advantages of this approach is thatthe solution to each subproblem is available in closed form as a root of a quadratic equation, whichcan be obtained in constant number of arithmetic operation, independent of p. Hence, in O(p 2 )operations a potential step can be computed for all pairs of symmetric elements (i.e., for all pairs(i, j)). Then the step which provides the best function value improvement can be chosen, which isthe essence of the greedy nature of our approach. Once the step is taken, the update of the gradientinformation requites O(p 2 ) operations. Hence, overall, each iteration takes O(p 2 ) operations. Notethat this procedure differers from selecting the step based on the largest element of the gradient (asis done in a traditional Gauss-Southwell method). The fact that each line search is very cheap allowsus to find the best step in the same number of operations that it takes us to update the gradient. Inthis property lies the greedy nature of our method and its difference from all other block-coordinatedescent methods applied to this problem. In fact this also differs our algorithm from other greedyalgorithms like Matching Pursuit for, say, l 1 regularized regression problem. We benefit here fromthe combination of relatively expensive gradient computations and relatively cheap line search steps.As we will show in our numerical experiments, SINCO, in a serial mode, is comparable to glasso,which is orders of magnitude faster than COVSEL [4]. Also, as it is demonstrated by our computationalexperiments, SINCO leads to a lower false-positive error than glasso since it introducesnonzero elements greedily. On the other hand, it may suffer from introducing too few nonzeros andthus missing some of the true positives, especially on denser networks. 4Perhaps the most interesting consequence of SINCO’s greedy nature is that it reproduces the regularizationpath behavior while using only one value of the regularization parameter λ. We willdiscuss this property further in Section 4.3.1 Algorithm descriptionWe now present the method. First, let us consider an equivalent reformulation of the problem (1):nmaxC ′ ,C ′′ 2 [ln det(C′ −C ′′ )−tr(A(C ′ −C ′′ ))]−tr(S(C ′ +C ′′ )), s. t. C ′ ≥ 0, C ′′ ≥ 0, C ′ −C ′′ ≻ 0At each iteration, the matrix C ′ or the matrix C ′′ is updated by changing one element on the diagonalor two symmetric off-diagonal elements. This implies the change in C that can be written at C +θ(e i e T j + e je T i ), where i and j are the indices corresponding to the elements that are being changed.The key observation is that, given the matrix W = C −1 , the exact line search that optimizes theobjective function f(θ) of problem (1) along the direction e i e T j + e je T i reduces to a solution of aquadratic equation. Hence each such line search takes a constant number of operations. Moreover,given the starting objective value, the new function value on each step can be computed in a constantnumber of steps. This means that we can perform such line search for all (i, j) pairs in O(p 2 ) time,which is linear in the number of unknown variables C ij . We then can choose the step that gives thebest improvement in the value of the objective function. After the step is chosen, the dual matrixW = C −1 and, hence, the objective function gradient, are updated in O(p 2 ) operations.The method we propose works as follows:Algorithm 10. Initialize C ′ = I, C ′′ = 0, W = I1. Form the gradient G ′ = n 2 (W − A) − S and G′′ = −S − n (W + A)22. For each pair (i, j) such that(i) G ′ ij > 0, C ij ′′ = 0, compute the maximum off(θ) for θ > 0.(ii) G ′ ij < 0, C ij ′ > 0, compute the maximum off(θ) for θ < 0 subject to C ′ ≥ 0(iii) G ′′ij > 0, C ij ′ = 0, compute the maximum off(θ) for θ > 0.(iv) G ′′ij < 0, C ij ′′ > 0, compute the maximum of f(θ) for θ < 0 subject to C ′′ ≥ 0.4 In the limit SINCO produces the same solution as glasso. We exploit here SINCO’s ability to stop earlywith a good solution.(3)3

SINCO paths when varying tolerance and λ for SF network1110.50.90.450.80.80.80.4TP0.60.40.200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9FPtolerance pathregularization pathTP0.60.40.200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1FPtolerance pathregularization path0.70.60.50.40.30.20.100 5 10 15 20TP0.350.30.250.20.150.10.05reg. pathtol. path00 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45FPFigure 1: Scale-free networks and Ecoli data: SINCO paths when varying tolerance and λ.path is presented by a line with “o”s. We also applied SINCO with fixed tolerance of 10 −6 to arange of λ values from 300 to 0.01. The corresponding ROC curves are denoted by lines with“x”s. We may wonder is SINCO’s solution path is actually the same as the regularization path. Itis not necessarily the case. The lower curve on the third plot on Figure 1, shows the percentage ofthe positives in solutions from SINCO path which are not present in the corresponding solution onthe regularization path. The higher curve represents the percentage of true positives. We observethat the SINCO and the regularization paths largely coincide step by step until the TPs reach themaximum and further nonzeros are in the FP category and hence, in a way, are random noise. Thefinal picture in Figure 1 is a comparison of the two paths for the E.coli data set (reduced to p = 133and N = 300).Our observation imply that SINCO can be used to greedily select the elements of graphical model untilthe desired trade-off between FPs and TPs or the desired number of nonzero elements is achievedor the allocated CPU time is exhausted. In the limit SINCO solves the same problem as glasso andhence the limit number of the true and false positives is dictated by the choice of λ. But since thereal goal is to recover true nonzero structure of the covariance matrix, it is not necessary to solveproblem (1) accurately.5 Empirical complexityHere we will discuss the empirical dependence of the runtime of the SINCO algorithm on the choiceof stopping tolerance and the problem size p. We also investigate the effect increasing p has on theresults produced by SINCO and glasso. Both methods were executed on Intel Core 2Duo T7700processor (2.40GHz).We consider the situation when p increases. If the number of nonzeros in the true inverse covarianceincreases with p, then for proper comparison we need to increase n accordingly. This, in turn, affectsthe contribution of λ, since the problem scaling changes. We consider two simple settings, wherewe can account for these effects. In the first setting, we increase p while keeping the number of theoff-diagonal nonzero elements in the randomly generated unstructured network constant (around300). We do not, therefore, increase n or λ. The CPU time necessary to compute the entire path forλ ranging from 300 to 0.01 is plotted for p values from 100 to 800 in the first plot of Figure 2.In the second case, we generated block-diagonal matrices of sizes p from 100 to 1100, with 100×100diagonal blocks, each of which equals the inverse covariance matrix of a 21%-dense structured(scale-free) network from the previous section. We increased n and the appropriate range of λlinearly with p as the number of nonzero elements. The CPU time for this case is shown in thesecond plot of Figure 2.The last two plots in Figure 2 explain why the CPU time (in seconds) for SINCO scales up slowerthan that of glasso. For similarly high true-positive rate, glasso’s final solution tends to have muchhigher false-positive rate than SINCO’s, thus producing a less sparse solution, overall.Finally, we investigated the behavior of SINCO for a fixed value of p as n grows. We expect toobtain larger TP values and smaller FP error with increasing n. We observed that SINCO achievesin the limit nearly 0% false-positive error and nearly 100% true-positive rate, while glasso’s FP error5

100 200 300 400 500 600 700 800glasso vs SINCO on random networks,p=800, n=5003.5 x 1043glassoSINCO40003500glasso vs SINCO on block−SF networks,p=1100, n=5500glassoSINCO10.9ROC: glasso vs SINCO on random networks,p=100, n=500glassoSINCO10.90.8ROC: glasso vs SINCO on block−SF networks,p=300, n=1500glassoSINCOtime (sec)2.521.51time (sec)30002500200015001000True Positives0.80.70.6True Positives0.70.60.50.40.30.55000.50.20.10p0100 200 300 400 500 600 700 800 900 1000 1100p0.40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False Positives00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False Positives(a) Random (b) Scale-free (c) Random (p = 100) (d) scale-free ROC (p = 300)Figure 2: CPU time: SINCO vs glasso on (a) random networks (N = 500, fixed range of λ) and(b) scale-free networks (density 21%, N and λ scaled by the same factor with p, N = 500 forp = 100). ROC curves: SINCO vs glasso on (c) random networks (N = 500, fixed range of λ)and (d) scale-free networks (density 21%, N and λ scaled by the same factor with p, N = 500 forp = 100).grows with increasing n. This result is, again, a consequence of the path generating greedy approachutilized by SINCO and its ability (and in some cases tendency) to stop the optimization early beforethe FP rate increases.We also applied SINCO to fMRI data for mind-state prediction problem described in [9] 8 . The purposeof that experiment was to show that fewer nonzero elements selected by SINCO were sufficientto achieve the same prediction accuracy, as was obtained by the much more dense COVSEL solutionto the same problem, which suggests that COVSEL learns many links that are not essential fordiscriminative ability of the classifier.References[1] N. Bani Asadi, I. Rish, K. Scheinberg, D. Kanevsky, and B. Ramabhadran. A MAP Approach to LearningSparse Gaussian Markov Networks. In Proc. of ICASSP 2009. April 2009.[2] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999.[3] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse gaussians. In Proc.of UAI-08, 2008.[4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 2007.[5] R. Saigal H. Wolkowicz and eds. L. Vanenberghe. Handbook of Semidefinite Programming. KluwerAcademic Publoshers, 2000.[6] S. Lauritzen. Graphical Models. Oxford University Press, 1996.[7] Z. Lu. Smooth optimization approach for sparse covariance selection. SIAM Journal on Optimization,(19(4)):1807–1827, 2009.[8] N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the Lasso. Annalsof Statistics, 34(3):1436–1462, 2006.[9] T.M. Mitchell, R. Hutchinson, R.S. Niculescu, F. Pereira, X. Wang, M. Just, and S. Newman. Learning todecode cognitive states from brain images. Machine Learning, 57:145–175, 2004.[10] O.Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihoodestimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9:485–516,March 2008.[11] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. Model selection in Gaussian graphical models:High-dimensional consistency of l1-regularized MLE. In NIPS-08. 2008.[12] G. Stolovitzky, R.J. Prill, and A. Califano. Lessons from the dream2 challenges. Annals of the New YorkAcademy of Sciences, (1158):159–95, 2009.[13] M. Wainwright, P. Ravikumar, and J. Lafferty. High-Dimensional Graphical Model Selection Using l 1 -Regularized Logistic Regression. In NIPS 19, pages 1465–1472. 2007.[14] M. Yuan and Y. Lin. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika,94(1):19–35, 2007.8 For more details, see the StarPlus website http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/.6

SINCO - an Efficient Greedy Method for Learning Sparse INverse ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?