12.07.2015 Views

Convex Optimization Overview (cnt'd)

Convex Optimization Overview (cnt'd)

Convex Optimization Overview (cnt'd)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

her the fewest points. Here, (P) is called the primal optimization problem whereas (D) iscalled the dual optimization problem. Rows containing +∞ values correspond to strategieswhere the student has flagrantly violated the TAs demand that all problems be attempted;for reasons, which will become clear later, we refer to these rows as being primal-infeasible.In the example, the value of the dual problem is lower than that of the primal problem,i.e., d ∗ = 0 < 5 = p ∗ . This intuitively makes sense: the second player in this adversarialgame has the advantage of knowing his/her opponent’s strategy. This principle, however,holds more generally:Theorem 2.1 (Weak duality). For any matrix A = (a ij ) ∈ R m×n , it is always the case thatmaxjminia ij = d ∗ ≤ p ∗ = minimax a ij .jProof. Let (i d , j d ) be the row and column associated with d ∗ , and let (i p , j p ) be the row andcolumn associated with p ∗ . We have,d ∗ = a id j d≤ a ipj d≤ a ipj p= p ∗ .Here, the first inequality follows from the fact that a id j dis the smallest element in the j d thcolumn (i.e., i d was the strategy chosen by the student after the TAs chose problem j d , andhence, it must correspond to the fewest points lost in that column). Similarly, the secondinequality follow from the fact that a ipjpis the largest element in the i p th row (i.e., j p wasthe problem chosen by the TAs after the student picked strategy i p , so it must correspondto the most points lost in that row).2.2 Duality in optimizationThe task of constrained optimization, it turns out, relates closely with the adversarial gamedescribed in the previous section. To see the connection, first recall our original optimizationproblem,Define the generalized Lagrangian to beminimize f(x)xsubject to g i (x) ≤ 0, i = 1, . . . , m,h i (x) = 0, i = 1, . . . , p.L(x, λ, ν) := f(x) +mXi=1λ i g i (x) +pXi=1ν i h i (x).Here, the variables λ and ν are called the the dual variables (or Lagrange multipliers).Analogously, the variables x are known as the primal variables.The correspondence between primal/dual optimization and game playing can be picturedinformally using an infinite matrix whose rows are indexed by x ∈ R n and whose columns3


are indexed by (λ, ν) ∈ R m + × R p (i.e., λ i ≥ 0, for i = 1, . . . , m). In particular, we haveA =264... . ...· · · L(x, λ, ν) · · ·.......Here, the “student” manipulates the primal variables x in order to minimize the LagrangianL(x, λ, ν) while the “TAs” manipulate the dual variables (λ, ν) in order to maximize theLagrangian.To see the relationship between this game and the original optimization problem, weformulate the following primal problem:p ∗ = min xmaxλ,ν:λ i ≥0375L(x, λ, ν)= min xθ P (x) (P’)where θ P (x) := max λ,ν:λi ≥0 L(x, λ, ν). Computing p ∗ is equivalent to our original convexoptimization primal in the following sense: for any candidate solution x,• if g i (x) > 0 for some i ∈ {1, . . . , m}, then setting λ i = ∞ gives θ P (x) = ∞.• if h i (x) ≠ 0 for some i ∈ {1, . . . , m}, then setting λ i = ∞·Sign(h i (x)) gives θ P (x) = ∞.• if x is feasible (i.e., x obeys all the constraints of our original optimization problem),then θ P (x) = f(x), where the maximum is obtained, for example, by setting all of theλ i ’s and ν i ’s to zero.Intuitively then, θ P (x) behaves conceptually like an “unconstrained” version of the originalconstrained optimization problem in which the infeasible region of f is “carved away” byforcing θ P (x) = ∞ for any infeasible x; thus, only points in the feasible region are leftas candidate minimizers. This idea of using penalties to ensure that minimizers stay in thefeasible region will come up later when talk about barrier algorithms for convex optimization.By analogy to the CS 229 grading example, we can form the following dual problem:d ∗ =max minλ,ν:λ i ≥0 x= maxλ,ν:λ i ≥0θ D (λ, ν)L(x, λ, ν)(D’)where θ D (λ, ν) := min x L(x, λ, ν). Dual problems can often be easier to solve than theircorresponding primal problems. In the case of SVMs, for instance, SMO is a dual optimizationalgorithm which considers joint optimization of pairs of dual variables. Its simple formderives largely from the simplicity of the dual objective and the simplicity of the correspondingconstraints on the dual variables. Primal-based SVM solutions are indeed possible, butwhen the number of training examples is large and the kernel matrix K of inner productsK ij = K(x (i) , x (j) ) is large, dual-based optimization can be considerably more efficient.4


Using an argument essentially identical to that presented in Theorem (2.1), we can showthat in this setting, we again have d ∗ ≤ p ∗ . This is the property of weak duality forgeneral optimization problems. Weak duality can be particularly useful in the design ofoptimization algorithms. For example, suppose that during the course of an optimizationalgorithm we have a candidate primal solution x and dual-feasible vector (λ, ν) such thatθ P (x) − θ D (λ, ν) ≤ ɛ. From weak duality, we have thatθ D (λ, ν) ≤ d ∗ ≤ p ∗ ≤ θ P (x),implying that x and (λ, ν) must be ɛ-optimal (i.e., their objective functions differ by no morethan ɛ from the objective functions of the true optima x ∗ and (λ ∗ , ν ∗ ).In practice, the dual objective θ D (λ, ν) can often be found in closed form, thus allowingthe dual problem (D’) to depend only on the dual variables λ and ν. When the Lagrangian isdifferentiable with respect to x, then a closed-form for θ D (λ, ν) can often be found by settingthe gradient of the Lagrangian to zero, so as to ensure that the Lagrangian is minimizedwith respect to x. 2 An example derivation of the dual problem for the L 1 soft-margin SVMis shown in the Appendix.2.3 Strong dualityFor any primal/dual optimization problems, weak duality will always hold. In some cases,however, the inequality d ∗ ≤ p ∗ may be replaced with equality, i.e., d ∗ = p ∗ ; this lattercondition is known as strong duality. Strong duality does not hold in general. When itdoes however, the lower-bound property described in the previous section provide a usefultermination criterion for optimization algorithms. In particular, we can design algorithmswhich simultaneously optimize both the primal and dual problems. Once the candidatesolutions x of the primal problem and (λ, ν) of the dual problem obey θ P (x) − θ D (λ, ν) ≤ ɛ,then we know that both solutions are ɛ-accurate. This is guaranteed to happen providedour optimization algorithm works properly, since strong duality guarantees that the optimalprimal and dual values are equal.Conditions which guarantee strong duality for convex optimization problems are knownas constraint qualifications. The most commonly invoked constraint qualification, forexample, is Slater’s condition:Theorem 2.2. Consider a convex optimization problem of the form (1), whose correspondingprimal and dual problems are given by (P’) and (D’). If there exists a primal feasible x for2 Often, differentiating the Lagrangian with respect to x leads to the generation of additional requirementson dual variables that must hold at any fixed point of the Lagrangian with respect to x. When theseconstraints are not satisfied, one can show that the Lagrangian is unbounded below (i.e., θ D (λ, ν) = −∞).Since such points are clearly not optimal solutions for the dual problem, we can simply exclude them fromthe domain of the dual problem altogether by adding the derived P constraints to the existing constraints of themdual problem. An example of this is the derived constraint,i=1 α iy (i) = 0, in the SVM formulation. Thisprocedure of incorporating derived constraints into the dual problem is known as making dual constraintsexplicit (see [1], page 224).5


which each inequality constraint is strictly satisfied (i.e., g i (x) < 0), then d ∗ = p ∗ . 3The proof of this theorem is beyond the scope of this course. We will, however, point outits application to the soft-margin SVMs described in class. Recall that soft-margin SVMswere found by solvingmX1minimizew,b,ξ 2 ‖w‖2 + C ξ ii=1subject to y (i) (w T x (i) + b) ≥ 1 − ξ i , i = 1, . . . , m,ξ i ≥ 0, i = 1, . . . , m.Slater’s condition applies provided we can find at least one primal feasible setting of w, b,and ξ where all inequalities are strict. It is easy to verify that w = 0, b = 0, ξ = 2·1 satisfiesthese conditions (where 0 and 1 denote the vector of all 0’s and all 1’s, respectively), sincey (i) (w T x (i) + b) = y (i) (0 T x (i) + 0) = 0 > −1 = 1 − 2 = 1 − ξ i , i = 1, . . . , m,and the remaining m inequalities are trivially strictly satisfied. Hence, strong duality holds,so the optimal values of the primal and dual soft-margin SVM problems will be equal.2.4 The KKT conditionsIn the case of differentiable unconstrained convex optimization problems, setting the gradientto “zero” provides a simple means for identifying candidate local optima. For constrainedconvex programming, do similar criteria exist for characterizing the optima of primal/dualoptimization problems? The answer, it turns out, is provided by a set of requirements knownas the Karush-Kuhn-Tucker (KKT) necessary and sufficient conditions (see [1],pages 242-244).Suppose that the constraint functions g 1 , . . . , g m , h 1 , . . . , h p are not only convex (the h i ’smust be affine) but also differentiable.Theorem 2.3. If ˜x is primal feasible and (˜λ, ˜ν) are dual feasible, and if∇ x L(˜x, ˜λ, ˜ν) = 0,(KKT1)˜λ i g i (˜x) = 0, i = 1, . . . , m, (KKT2)then ˜x is primal optimal, (˜λ, ˜ν) are dual optimal, and strong duality holds.Theorem 2.4. If Slater’s condition holds, then conditions of Theorem 2.3 are necessary forany (x ∗ , λ ∗ , ν ∗ ) such that x ∗ is primal optimal and (λ ∗ , ν ∗ ) are dual feasible.3 One can actually show a more general version of Slater’s inequality, which requires only strict satisfactionof non-affine inequality constraints (but allowing affine inequalities to be satisfied with equality). See [1],page 226.6


3 Algorithms for convex optimizationThus far, we have talked about convex optimization problems and their properties. Buthow does one solve a convex optimization problem in practice? In this section, we describea generic strategy for solving convex optimization problems known as the interior-pointmethod. This method combines a safe-guarded variant of Newton’s algorithm with a “barrier”technique for enforcing inequality constraints.3.1 Unconstrained optimizationWe consider first the problem of unconstrained optimization, i.e.,minimizexf(x).In Newton’s algorithm for unconstrained optimization, we consider the Taylor approximation˜f of the function f, centered at the current iterate x t . Discarding terms of higherorder than two, we have˜f(x) = f(x t ) + ∇ x f(x t ) T (x − x t ) + 1 2 (x − x t)∇ 2 xf(x t )(x − x t ).To minimize ˜f(x), we can set its gradient to zero. In particular, if x nt denotes the minimumof ˜f(x), then∇ x f(x t ) + ∇ 2 xf(x t )(x nt − x t ) = 0∇ 2 xf(x t )(x nt − x t ) = −∇ x f(x t )x nt − x t = −∇ 2 xf(x t ) −1 ∇ x f(x t )x nt = x t − ∇ 2 xf(x t ) −1 ∇ x f(x t )assuming ∇ 2 xf(x t ) T is positive definite (and hence, full rank). This, of course, is the standardNewton algorithm for unconstrained minimization.While Newton’s method converges quickly if given an initial point near the minimum, forpoints far from the minimum, Newton’s method can sometimes diverge (as you may havediscovered in problem 1 of Problem Set #1 if you picked an unfortunate initial point!). Asimple fix for this behavior is to use a line-search procedure. Define the search direction dto be,d := ∇ 2 xf(x t ) −1 ∇ x f(x t ).A line-search procedure is an algorithm for finding an appropriate step size γ ≥ 0 such thatthe iterationx t+1 = x t − γ · dwill ensure that the function f decreases by a sufficient amount (relative to the size of thestep taken) during each iteration.8


One simple yet effective method for doing this is called a backtracking line search.In this method, one initially sets γ to 1 and then iteratively reduces γ by a multiplicativefactor β until f(x t + γ · d) is sufficiently smaller than f(x t ):Backtracking line-search• Choose α ∈ (0, 0.5), β ∈ (0, 1).• Set γ ← 1.• While f(x t + γ · d) > f(x t ) + γ · α∇ x f(x t ) T d,• Return γ.do γ ← βγ.Since the function f is known to decrease locally near x t in the direction of d, such astep will be found, provided γ is small enough. For more details, see [1], pages 464-466.In order to use Newton’s method, one must be able to compute and invert the Hessianmatrix ∇ 2 xf(x t ), or equivalently, compute the search direction d indirectly without formingthe Hessian. For some problems, the number of primal variables x is sufficiently largethat computing the Hessian can be very difficult. In many cases, this can be dealt withby clever use of linear algebra. In other cases, however, we can resort to other nonlinearminimization schemes, such as quasi-Newton methods, which initially behave like gradientdescent but gradually construct approximations of the inverse Hessian based on the gradientsobserved throughout the course of the optimization. 5 Alternatively, nonlinear conjugategradient schemes (which augment the standard conjugate gradient (CG) algorithm forsolving linear least squares systems with a line-search) provide another generic blackbox toolfor multivariable function minimization which is simple to implement, yet highly effective inpractice. 63.2 Inequality-constrained optimizationUsing our tools for unconstrained optimization described in the previous section, we nowtackle the (slightly) harder problem of constrained optimization. For simplicity, we considerconvex optimization problems without equality constraints 7 , i.e., problems of the form,minimize f(x)xsubject to g i (x) ≤ 0, i = 1, . . . , m.5 For more information on Quasi-Newton methods, the standard reference is Jorge Nocedal and StephenJ. Wright’s textbook, Numerical <strong>Optimization</strong>.6 For an excellent tutorial on the conjugate gradient method, see Jonathan Shewchuk’s tutorial, availableat: http://www.cs.cmu.edu/ ∼ quake-papers/painless-conjugate-gradient.pdf7 In practice, there are many of ways of dealing with equality constraints. Sometimes, we can eliminateequality constraints by either reparameterizing of the original primal problem, or converting to the dualproblem. A more general strategy is to rely on equality-constrained variants of Newton’s algorithms whichensure that the equality constraints are satisfied at every iteration of the optimization. For a more completetreatment of this topic, see [1], Chapter 10.9


Log-barrier optimization• Choose µ > 1, t > 0.• x ← x 0 .• Repeat until convergence:(a) Compute x ′ = min f(x) − 1 x t(b) t ← µ · t, x ← x ′ .mXi=1log(−g i (x)) using x as the initial point.One might expect that as t increases, the difficulty of solving each unconstrained minimizationproblem also increases due to numerical issues or ill-conditioning of the optimizationproblem. Surprisingly, Nesterov and Nemirovski showed in 1994 that this is not the casefor certain types of barrier functions, including the log-barrier; in particular, by using anappropriate barrier function, one obtains a general convex optimization algorithm whichtakes time polynomial in the dimensionality of the optimization variables and the desiredaccuracy!4 Directions for further explorationIn many real-world tasks, 90% of the challenge involves figuring out how to write an optimizationproblem in a convex form. Once the correct form has been found, a number ofpre-existing software packages for convex optimization have been well-tuned to handle differentspecific types of optimization problems. The following constitute a small sample ofthe available tools:• commerical packages: CPLEX, MOSEK• MATLAB-based: CVX, <strong>Optimization</strong> Toolbox (linprog, quadprog), SeDuMi• libraries: CVXOPT (Python), GLPK (C), COIN-OR (C)• SVMs: LIBSVM, SVM-light• machine learning: Weka (Java)In particular, we specifically point out CVX as an easy-to-use generic tool for solving convexoptimization problems easily using MATLAB, and CVXOPT as a powerful Python-basedlibrary which runs independently of MATLAB. 9 If you’re interested in looking at some of theother packages listed above, they are easy to find with a web search. In short, if you need aspecific convex optimization algorithm, pre-existing software packages provide a rapid wayto prototype your idea without having to deal with the numerical trickiness of implementingyour own complete convex optimization routines.9 CVX is available at http://www.stanford.edu/ ∼ boyd/cvx and CVXOPT is available at http://www.ee.ucla.edu/ ∼ vandenbe/cvxopt/.11


Also, if you find this material fascinating, make sure to check out Stephen Boyd’s class,EE364: <strong>Convex</strong> <strong>Optimization</strong> I, which will be offered during the Winter Quarter. Thetextbook for the class (listed as [1] in the References) has a wealth of information aboutconvex optimization and is available for browsing online.References[1] Stephen Boyd and Lieven Vandenberghe. <strong>Convex</strong> <strong>Optimization</strong>. Cambridge UP, 2004.Online: http://www.stanford.edu/ ∼ boyd/cvxbook/Appendix: The soft-margin SVMTo see the primal/dual action in practice, we derive the dual of the soft-margin SVM primalpresented in class, and corresponding KKT complementarity conditions. We have,mX1minimizew,b,ξ 2 ‖w‖2 + C ξ ii=1subject to y (i) (w T x (i) + b) ≥ 1 − ξ i , i = 1, . . . , m,ξ i ≥ 0, i = 1, . . . , m.First, we put this into our standard form, with “≤ 0” inequality constraints and no equalityconstraints. That is,mX1minimizew,b,ξ 2 ‖w‖2 + C ξ ii=1subject to 1 − ξ i − y (i) (w T x (i) + b) ≤ 0, i = 1, . . . , m,−ξ i ≤ 0, i = 1, . . . , m.Next, we form the generalized Lagrangian, 10L(w, b, ξ, α, β) = 1 2 ‖w‖2 + CmXi=1ξ i +which gives the primal and dual optimization problems:mXi=1α i (1 − ξ i − y (i) (w T x (i) + b)) −mXi=1β i ξ i ,maxα,β:α i ≥0,β i ≥0minw,b,ξθ D (α, β)where θ D (α, β) := minw,b,ξθ P (w, b, ξ) where θ P (w, b, ξ) := maxα,β:α i ≥0,β i ≥0L(w, b, ξ, α, β),L(w, b, ξ, α, β).(SVM-D)(SVM-P)To get the dual problem in the form shown in the lecture notes, however, we still have alittle more work to do. In particular,10 Here, it is important to note that (w, b, ξ) collectively play the role of the x primal variables. Similarly,(α, β) collectively play the role of the λ dual variables used for inequality constraints. There are no “ν” dualvariables here since there are no affine constraints in this problem.12


1. Eliminating the primal variables. To eliminate the primal variables from the dualproblem, we compute θ D (α, β) by noticing thatθ D (α, β) = min w,b,ξ L(w, b, ξ, α, β)is an unconstrained optimization problem, where the objective function L(w, b, ξ, α, β)is differentiable. Therefore, for any fixed (α, β), if (ŵ, ˆb, ˆξ) minimize the Lagrangian,it must be the case that∇ w L(ŵ, ˆb, ˆξ, α, β) = ŵ −mXmXi=1α i y (i) x (i) = 0 (5)∂∂b L(ŵ, ˆb, ˆξ, α, β) = − α i y (i) = 0i=1(6)∂L(ŵ,∂ξ ˆb, ˆξ, α, β) = C − α i − β i = 0.i(7)Adding (6) and (7) to the constraints of our dual optimizaton problem, we obtain,θ D (α, β) = L(ŵ, ˆb, ˆξ)= 1 2 ‖ŵ‖2 + C= 1 2 ‖ŵ‖2 + C= 1 2 ‖ŵ‖2 +mXi=1mXmXi=1i=1ˆξ i +ˆξ i +mXmXi=1i=1α i (1 − y (i) (ŵ T x (i) )).α i (1 − ˆξ i − y (i) (ŵ T x (i) + ˆb)) −α i (1 − ˆξ i − y (i) (ŵ T x (i) )) −where the first equality follows from the optimality of (ŵ, ˆb, ˆξ) for fixed (α, β), thesecond equality uses the definition of the generalized Lagrangian, and the third andfourth equalities follow from (6) and (7), respectively. Finally, to use (5), observe that12 ‖ŵ‖2 +mXi=1α i (1 − y (i) (ŵ T x (i) )) ===mXmXmXmXi=1i=1i=1α i + 1 2 ‖ŵ‖2 − ŵ Tm Xi=1α i + 1 2 ‖ŵ‖2 − ‖ŵ‖ 2α i − 1 2 ‖ŵ‖2= α i − 1i=12mXmXi=1 j=1mXi=1mXi=1β i ˆξiα i y (i) x (i)β i ˆξiα i α i y (i) y (j) 〈x (i) , x (j) 〉.13


Therefore, our dual problem (with no more primal variables) is simplymXmXmXmaximize α i − 1 α i α i y (i) y (j) 〈x (i) , x (j) 〉α,βi=12i=1 j=1subject to α i ≥ 0, i = 1, . . . , m,β i ≥ 0, i = 1, . . . , m,α i + β i = C, i = 1, . . . , m,mXi=1α i y (i) = 0.2. KKT complementary. KKT complementarity requires that for any primal optimal(w ∗ , b ∗ , ξ ∗ ) and dual optimal (α ∗ , β ∗ ),α ∗ i (1 − ξ ∗ i − y (i) (w ∗T x (i) + b ∗ )) = 0β ∗ i ξ ∗ i = 0for i = 1, . . . , m. From the first condition, we see that if α i > 0, then in order for theproduct to be zero, then 1 − ξ ∗ i − y (i) (w ∗T x (i) + b ∗ ) = 0. It follows thaty (i) (w ∗T x (i) + b ∗ ) ≤ 1since ξ ∗ ≥ 0 by primal feasibility. Similarly, if β ∗ i > 0, then ξ ∗ i = 0 to ensure complementarity.From the primal constraint, y (i) (w T x (i) + b) ≥ 1 − ξ i , it follows thaty (i) (w ∗T x (i) + b ∗ ) ≥ 1.Finally, since β ∗ i > 0 is equivalent to α ∗ i < C (since α ∗ + β ∗ i = C), we can summarizethe KKT conditions as follows:α ∗ i = 0 ⇒ y (i) (w ∗T x (i) + b ∗ ) ≥ 1,0 < α ∗ i < C ⇒ y (i) (w ∗T x (i) + b ∗ ) = 1,α ∗ i = C ⇒ y (i) (w ∗T x (i) + b ∗ ) ≤ 1.3. Simplification. We can tidy up our dual problem slightly by observing that each pairof constraints of the formβ i ≥ 0α i + β i = Cis equivalent to the single constraint, α i ≤ C; that is, if we solve the optimizationproblemmXmXmXmaximize α i − 1 α i α i y (i) y (j) 〈x (i) , x (j) 〉α,βi=12i=1 j=1subject to 0 ≤ α i ≤ C, i = 1, . . . , m,mXi=1α i y (i) = 0.14(8)


and subsequently set β i = C − α i , then it follows that (α, β) will be optimal for theprevious dual problem above. This last form, indeed, is the form of the soft-marginSVM dual given in the lecture notes.15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!