The Split Bregman Method for L1-Regularized Problems
The Split Bregman Method for L1-Regularized Problems
The Split Bregman Method for L1-Regularized Problems
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>The</strong> <strong>Split</strong> <strong>Bregman</strong> <strong>Method</strong> <strong>for</strong> <strong>L1</strong>-<strong>Regularized</strong><strong>Problems</strong>Tom GoldsteinMay 22, 2008
Some Common <strong>L1</strong> <strong>Regularized</strong> <strong>Problems</strong>TV Denoising: minu‖u‖ BV + µ 2 ‖u − f ‖2 2De-Blurring/Deconvolution: minu‖u‖ BV + µ 2 ‖Ku − f ‖2 2Basis Pursuit/Compressed Sensing MRI: minu‖u‖ BV + µ 2 ‖Fu−f ‖2 2
What Makes these <strong>Problems</strong> Hard??◮ Some “easy” problems...arg min ‖Au − f ‖ 2u2 (Differentiable)arg min |u| 1 + ‖u − f ‖ 2u2 (Solvable by shrinkage)◮ Some “hard” problemsarg minu|Φu| 1 + ‖u − f ‖ 2 2arg minu|u| 1 + ‖Au − f ‖ 2 2◮ What makes these problems hard is the “coupling” between<strong>L1</strong> and L2 terms
A Better Formulation◮ We want to solve the general <strong>L1</strong> regularization problem:arg minu|Φu| + ‖Ku − f ‖ 2◮ We need to “split” the <strong>L1</strong> and L2 components of this energy◮ Introduce a new variablelet d = Φu◮ We wish to solve the constrained problemarg minu,d ‖d‖ 1 + H(u) such that d = Φ(u)
Solving the Constrained Problemarg minu,x ‖d‖ 1 + H(u) such that d = Φ(u)◮ We add an L2 penalty term to get an unconstrained problemarg min ‖d‖ 1 + H(u) + λ ‖d − Φ(u)‖2u,x 2◮ This splitting was independently introduced by Wang and Dr.Yin Zhang (FTVd)◮ We need a way of modifying this problem to get exacten<strong>for</strong>cement of the constraint◮ <strong>The</strong> most obvious way is to use continuation: let λ n → ∞◮ Continuation makes the condition number bad
A Better Solution: Use <strong>Bregman</strong> Iteration◮ We group the first two energy terms together:◮ to get...arg min ‖d‖ 1 + H(u) + λ ‖d − Φ(u)‖2u,d } {{ } 2E(u,d)arg min E(u, d) + λ ‖d − Φ(u)‖2u,d 2◮ We now define the “<strong>Bregman</strong> Distance” of this convexfunctional asD p E (u, d, uk , d k ) = E(u, d) − 〈p k u, u − u k 〉 + 〈p k d , d − d k 〉
A Better Solution: Use <strong>Bregman</strong> Iteration◮ Rather than solve min E(u, d) + λ 2 ‖d − Φ(u)‖2 we recursivelysolve◮ or(u k+1 , d k+1 ) = arg minu,d Dp E (u, d, uk , d k ) + λ 2 ‖d − Φ(u)‖2 2arg minu,d E(u, d) − 〈pk u, u − u k 〉 + 〈p k d , d − d k 〉 + λ 2 ‖d − Φ(u)‖2 2◮ Where p u and p d are in the subgradient of E with respect tothe variables u and d
Why does this work?◮ Because of the convexity of the functionals we are using, itcan be shown that‖d − Φu‖ → 0 as k → ∞◮ Furthermore, is can be shown that the limiting values,u ∗ = lim k→∞ u k and d ∗ = lim k→∞ d k satisfy the originalconstrained optimization problemarg minu,d ‖d‖ 1 + H(u) such that d = Φ(u)◮ It there<strong>for</strong>e follows that u ∗ is a solution to the original <strong>L1</strong>constrained problemu ∗ = arg minu|Φu| + ‖Ku − f ‖ 2
Don’t Worry! This isn’t as complicated as it looks◮ As is done <strong>for</strong> <strong>Bregman</strong> iterative denoising, we can get explicit<strong>for</strong>mulas <strong>for</strong> p u and p d , and use them to simplify the iteration◮ This gives us the simplified iteration(u k+1 , d k+1 ) = arg minu,d ‖d‖ 1 + H(u) + λ 2 ‖d − Φ(u) − bk ‖ 2b k+1 = b k + (Φ(u) − d k )◮ This is the analog of “adding the noise back” when we use<strong>Bregman</strong> <strong>for</strong> denoising
Summary of what we have so far◮ We began with an <strong>L1</strong>-constrained problemu ∗ = arg min |Φu| + ‖Ku − f ‖ 2◮ We <strong>for</strong>m the “<strong>Split</strong> <strong>Bregman</strong>” <strong>for</strong>mulationmin ‖d‖ 1 + H(u) + λ ‖d − Φ(u) − b‖2u,d 2◮ For some optimal value b ∗ = b of the <strong>Bregman</strong> parameter,these two problems are equivalent◮ We solve the optimization problem by iterating(u k+1 , d k+1 ) = arg minu,d ‖d‖ 1 + H(u) + λ 2 ‖d − Φ(u) − bk ‖ 2b k+1 = b k + (Φ(u) − d k )
Why is this better?◮ We can break this algorithm down into three easy stepsStep 1 : u k+1 = arg minuH(u) + λ 2 ‖d − Φ(u) − bk ‖ 2 2Step 2 : d k+1 = arg mind|d| 1 + λ 2 ‖d − Φ(u) − bk ‖ 2 2Step 3 : b k+1 = b k + Φ(u k+1 ) − d k+1◮ Because of the decoupled <strong>for</strong>m, step 1 is now a differentiableoptimization problem - we can directly solve it with tools likeFourier Trans<strong>for</strong>m, Gauss-Seidel, CG, etc...◮ Step 2 can be solved efficiently using shrinkaged k+1 = shrink(Φ(u k+1 ) + b k , 1/λ)◮ Step 3 is explicit, and easy to evaluate
Example: Fast TV Denoising◮ We begin by considering the Anisotropic ROF denoisingproblemarg minu|∇ x u| + |∇ y u| + µ 2 ‖u − f ‖2 2◮ We then write down the <strong>Split</strong> <strong>Bregman</strong> <strong>for</strong>mulationarg minx,y,u|d x | + |d y | + µ 2 ‖u − f ‖2 2+ λ 2 ‖d x − ∇ x u − b x ‖ 2 2+ λ 2 ‖d y − ∇ y u − b y ‖ 2 2
Example: Fast TV Denoising◮ <strong>The</strong> TV algorithm then breaks down into these steps:Step 1 : u k+1 = G(u k )Step 2 : dxk+1 = shrink(∇ x u k+1 + bx k , 1/λ)Step 3 : dyk+1 = shrink(∇ y u k+1 + by k , 1/λ)Step 4 : bx k+1 = bx k + (∇ x u − x)Step 5 : by k+1 = by k + (∇ y u − y)where G(u k ) represents the results of one Gauss Seidel sweep<strong>for</strong> the corresponding L2 optimization problem.◮ This is very cheap – each step is only a few operations perpixel
Isotropic TV◮ This method can do isotropic TV using the followingdecoupled <strong>for</strong>mulation√arg min dx 2 + dy 2 + µ 2 ‖u − f ‖2 2+ λ 2 ‖d x − ∇ x u − b x ‖ 2 2 + λ 2 ‖d y − ∇ y u − b y ‖ 2 2◮ We now have to solve <strong>for</strong> (d x , d y ) using the generalizedshrinkage <strong>for</strong>mula (Yin et. al.)whered k+1x = max(s k − 1/λ, 0) ∇ xu k + b k xs kd k+1y = max(s k − 1/λ, 0) ∇ y u k + b k ys ks k =√(∇ x u k + b k x ) 2 + (∇ y u k + b k y ) 2
Time Trials◮ Time trials were done on a Intel Core 2 Due desktop (3 GHz)◮ Linux Plat<strong>for</strong>m, compiled with g++AnisotropicImage Time/cycle (sec) Time Total (sec)256 × 256 Blocks 0.0013 0.068512 × 512 Lena 0.0054 0.27IsotropicImage Time/cycle (sec) Time Total (sec)256 × 256 Blocks 0.0018 0.0876512 × 512 Lena 0.011 0.55
This can be made even faster...◮ Most of the denoising takes place in first 10 iterations
This can be made even faster...◮ Most of the denoising takes place in first 10 iterations◮ “Staircases” <strong>for</strong>m quickly, but then take some time to flattenout◮ If we are willing to accept a “visual” convergence criteria, wecan denoise in about 10 iterations (0.054 sec) <strong>for</strong> Lena, and20 iterations (0.024 sec) <strong>for</strong> the blocky image.
This can be made even faster...
Compressed Sensing <strong>for</strong> MRI◮ Many authors (Donoho, Yin, etc...) get superiorreconstruction using both TV and Besov regularizers◮ We wish to solvearg minu|∇u| + |Wu| + µ 2 ‖RFu − f k ‖ 2 2where R comprises a subset of rows of the identity, and W isan orthogonal wavelet trans<strong>for</strong>m (Haar).◮ Apply the “<strong>Split</strong> <strong>Bregman</strong>” method: Let w ← Wu,d x ← ∇ x u, and d y ← ∇ y uargminu,d x ,d y ,w√d 2 x + d 2 y + |w| + µ 2 ‖RFu − f ‖2 2+ λ 2 ‖d x − ∇ x u − b x ‖ 2 2 + λ 2 ‖d y − ∇ y u − b y ‖ 2 2+ γ 2 ‖w − Wu − b w ‖ 2 2
Compressed Sensing <strong>for</strong> MRI◮ <strong>The</strong> optimality condition <strong>for</strong> u is circulant:(µF T R T RF − λ∆ + γI )u k+1 = rhs k◮ <strong>The</strong> resulting algorithm isUnconstrained CS Optimization Algorithmu k+1 = F −1 K −1 Frhs k(dx k+1 , dyk+1 ) = shrink(∇ x u + b x , ∇ y u + b y , 1/λ)w k+1 = shrink(Wu + b w , 1/γ)bx k+1 = bx k + (∇ x u − d x )by k+1 = by k + (∇ y u − d y )bw k+1 = bw k + (Wu − w)
Compressed Sensing <strong>for</strong> MRI◮ To solve the constrained problemarg min |∇u| + |Wu| such that ‖RFu − f ‖ 2 < σuwe use “double <strong>Bregman</strong>”◮ First, solve the unconstrained problemarg minu|∇u| + |Wu| + µ 2 ‖RFu − f k ‖ 2 2by per<strong>for</strong>ming “inner” iterations◮ <strong>The</strong>n, updatethis is an “outer iteration”f k+1 = f k + f − RFu k+1
Compressed Sensing◮ 256 x 256 MRI of phantom, 30%
<strong>Bregman</strong> Iteration vs Continuation◮ As λ → ∞, the condition number of each sub-problem goesto ∞◮ This is okay if we have a direct solver <strong>for</strong> each sub-problem(such as FFT)◮ Drawback: Direct solvers are slower than iterative solvers, ormay not be available◮ With <strong>Bregman</strong> iteration, condition number stays constant -we can use efficient iterative solvers
Example: Direct Solvers May be Inefficient◮ TV-<strong>L1</strong>:◮ <strong>Split</strong>-<strong>Bregman</strong> <strong>for</strong>mulationarg minu|∇u| + µ|u − f |arg minu,d |d| + µ|v − f | + λ 2 ‖d − ∇u − b d‖ 2 2 + γ 2 ‖u − v − b v ‖ 2 2◮ We must solve the sub-problem(µI − λ∆)u = RHS◮ If λ ≈ µ, then this is strongly diagonally dominant: useGauss-Seidel (cheap)◮ If λ >> µ, then we must use a direct solver: 2 FFT’s periteration (expensive)
Example: Direct Solvers May Not Exist◮ Total-Variation based Inpainting:∫arg min |∇u| + µ (u − f )u∫Ω2Ω/Darg minu|∇u| + µ‖Ru − f ‖ 2where R consists of rows of the identity matrix.◮ <strong>The</strong> optimization sub-problem is(µR T R − λ∆)u = RHS◮ Not Circulant! - We have to use an iterative solver (e.g.Gauss-Seidel)
Generalizations◮ <strong>Bregman</strong> Iteration can be used to solve a wide range ofnon-<strong>L1</strong> problemsarg min J(u) such that A(u) = 0where J and ‖A(·)‖ 2 are convex.◮ We can use a <strong>Bregman</strong>-like penalty functionu k+1 = arg min J(u) + λ 2 ‖A(u) − bk ‖ 2b k+1 = b k − A(u)◮ <strong>The</strong>orem: Any fixed point of the above algorithm is solutionto the original constrained problem◮ Convergence can be proved <strong>for</strong> a broad class of problems:If J is strictly convex and twice differentiable, then ∃λ 0 > 0such that the algorithm converges <strong>for</strong> anyλ < λ 0
Conclusion◮ <strong>The</strong> <strong>Split</strong> <strong>Bregman</strong> <strong>for</strong>mulation is a fast tool that can solvealmost any <strong>L1</strong> regularization problem◮ Small memory footprint◮ This method is easily parallelized <strong>for</strong> large problems◮ Easy to code
ConclusionAcknowledgmentWe thank Jie Zheng <strong>for</strong> his helpful discussions regarding MR imageprocessing. This publication was made possible by the support ofthe National Science Foundation’s GRFP program, as well as ONRgrant N000140710810 and the Department of Defense.