Exploiting Redundancy for Aerial Image Fusion using Convex ...

Exploiting Redundancy for Aerial Image Fusion 

using Convex Optimization ⋆ 

Stefan Kluckner, Thomas Pock and Horst Bischof 

Institute for Computer Graphics and Vision 

Graz University of Technology, Austria 

{kluckner,pock,bischof}@icg.tugraz.at 

Abstract. Image fusion in high-resolution aerial imagery poses a challenging 

problem due to fine details and complex textures. In particular, 

color image fusion by using virtual orthographic cameras offers a common 

representation of overlapping yet perspective aerial images. This 

paper proposes a variational formulation for a tight integration of redundant 

image data showing urban environments. We introduce an efficient 

wavelet regularization which enables a natural-appearing recovery of fine 

details in the images by performing joint inpainting and denoising from 

a given set of input observations. Our framework is first evaluated on 

a setting with synthetic noise. Then, we apply our proposed approach 

to orthographic image generation in aerial imagery. In addition, we discuss 

an exemplar-based inpainting technique for an integrated removal 

of non-stationary objects like cars. 

1 Introduction 

In general, image fusion integrates information of multiple images, taken from 

the same scene, in order to obtain an improved result with respect to noise, outliers, 

illumination changes etc. Fusion from multiple observations is a hot topic 

in computer vision and photogrammetry since scene information can be taken 

from different view points without additional costs. In particular, modern aerial 

imaging technology provides multi-spectral images, which map every visible spot 

of urban environments from many overlapping camera viewpoints. Typically, a 

point on ground is at least visible in ten cameras. The provided, highly redundant 

data enables efficient techniques for height field generation [1], but also 

methods for resolution and quality enhancement [2–6]. On one hand, taking into 

account redundant observations of corresponding points in a common 3D world, 

the localization accuracy can be significantly improved using an integration of 

range data e.g. for 3D reconstruction [7]. On the other hand, accurate height 

fields can also be exploited to align data, such as the corresponding color information, 

within a common coordinate system. In our approach we exploit derived 

range data to compute geometric transformations between the original images 

⋆ This work was financed by the Austrian Research Promotion Agency within the 

projects vdQA (No. 816003) and APAFA (No. 813397).

2 Stefan Kluckner, Thomas Pock and Horst Bischof 

Fig. 1. An observed scene with overlapping camera positions. The scene is taken from 

different viewpoints. We exploit computed range images to transform the point cloud 

into a common orthographic aerial view. Our joint inpainting and denoising approach 

takes redundant observations as input data and generates an improved fused image. 

Note that some observations include many undefined areas (black pixels) caused by 

occlusions and non-stationary objects. 

and an orthographic view, which is related to novel view synthesis [2–4, 8]. Due 

to missing data in the individual height fields (e.g. caused by non-stationary 

objects or occlusions) the initial alignment causes undefined areas, artifacts or 

outliers in the novel view. Figure 1 depicts an urban scene taken from different 

camera positions and a set of redundant images, geometrically transformed to 

a common view. Some image tiles show large areas of missing information and 

erroneous pixel values. Our task can also be interpreted as an image fusion from 

multiple input observations of the same scene by joint inpainting and denoising. 

While inpainting fills undefined areas, the denoising removes strong outliers and 

noise by exploiting the high redundancy in the input data. 

This paper has several contributions: First, we present a novel variational 

framework for gray and color image fusion, which provides a smooth solution 

over the image domain by exploiting redundant input images. In order to compute 

natural appearing images we further introduce a wavelet transform [9], providing 

an improved texture prior for regularization, in our convex optimization 

framework (Section 3). In the experimental section we show that our framework 

can be successfully applied to image recovery and orthographic image generation 

in high-resolution aerial imagery (Section 4). In addition, we present results for 

exemplar-based inpainting, which enables an integrated removal of undesired, 

non-stationary objects like cars. Finally, Section 5 concludes our work and gives 

an outlook on future work. 

2 Related Work 

The challenging task of reconstructing an original image from given (noisy) observations 

is known to be ill-posed. Although fast mean or median computation

Exploiting Redundancy for Aerial Image Fusion using Convex Optimization 3 

over multiple pixel observations will suppress noisy or undefined areas, each 

pixel in the result is treated independently. A variety of proposed algorithms 

for a fusion of redundant information is based on image priors [2, 8], image 

transforms [10], Markov random field optimization procedures [3] and generative 

models [4]. Variational formulations are well-suited for finding smooth and 

consistent solutions of the inverse problem by exploiting different types of regularizations 

[7, 11–14]. The quadratic model [11] uses the L 2 norm for regularization, 

which causes smoothed edges. Introducing a total variation (TV) norm 

instead leads to the edge preserving denoising model proposed by Rudin, Osher 

and Fatemi (ROF) [12]. The authors in [13] proposed to also use a L 1 norm in 

the data term to estimate the deviation between sought solution and input observation. 

Thus the resulting TV-L 1 model is more effective in removing impulse 

noise containing strong outliers than the ROF model. Zach et al. [7] applied the 

TV-L 1 to robust range image integration from multiple views. Although TVbased 

methods are well suited for tasks like range data integration, in texture 

inpainting the regularization produces results that look unnatural near recovered 

edges (too much contrast). To overcome the problem of synthetic appearance, 

natural image priors based on multi-level transforms like wavelets [9, 15, 16] or 

curvelets [17] can be used within the inpainting and fusion model [14, 18]. These 

transforms provide a compact yet sparse image representation obtained with low 

computational costs. Similar to [14], we exploit a wavelet transform for natural 

regularization within our proposed variational fusion framework capable to 

handle multiple input observations. 

3 Convex Fusion Model 

In this section we describe our generic fusion model which takes into account 

multiple observations of the same scene. For clarity, we derive our model for grayvalued 

images, however the formulation can be easily extended to vector-valued 

data like color images. 

3.1 The Proposed Model 

We consider a discrete image domain Ω as a regular grid of size W × H pixels 

with Ω = {(i, j) : 1 ≤ i ≤ W, 1 ≤ j ≤ H}, where the tupel (i, j) denotes a pixel 

position in the domain Ω. 

Our fusion model, which takes into account multiple observations and a waveletbased 

regularization, can be seen as an extension of the TV-L 1 denoising model 

proposed by Nikolova [13]. In the discrete setting the minimization problem of 

the common TV-L 1 model for an image domain Ω is formulated as 

⎧ 

⎫ 

⎨ 

min 

u∈X ⎩ ‖∇u‖ 1 + λ ∑ 

⎬ 

|u i,j − f i,j | 

⎭ , (1) 

i,j∈Ω 

where X = R W H is a finite-dimensional vector space provided with a scalar 

product 〈u, v〉 X 

= ∑ i,j u i,jv i,j , u, v ∈ X. The first term denotes the TV of the


sought solution u and reflects the regularization in terms of a smooth solution. 

The second term accounts for the summed errors between u and the (noisy) input 

data f. The scalar λ controls the fidelity between data fitting and regularization. 

In following we derive our model for the task of image fusion from multiple 

observations. 

As a first modification of the TV-L 1 model defined in (1), we extend the convex 

minimization problem to handle a set of K scene observations (f 1 , . . . , f K ). 

Introducing multiple input images can be accomplished by summing the deviations 

between the sought solution u and available observations f k , k = 1 . . . K 

according to 

⎧ 

⎫ 

⎨ 

K∑ 

min 

u∈X ⎩ ‖∇u‖ ∑ 

⎬ 

1 + λ |u i,j − fi,j| 

k ⎭ . (2) 

k=1 i,j∈Ω 

Since orthographic image generation from gray or color information with 

sampling distances of approximately 10 cm requires an accurate recovery of fine 

details and complex textures, we replace the TV-based regularization with a 

dual-tree complex wavelet transform (DTCWT) [9, 16]. The DTCWT is nearly 

invariant to rotation, which is important for regularization, but also to translations 

and can be efficiently computed by using separable filter banks. The 

transform is based on analyzing the signal with two separate wavelet decompositions, 

where one provides the real-valued part and the other one yields the 

complex part. Due to the redundancy in the proposed decomposition, the directionality 

can be improved, compared to standard discrete wavelets [9]. In order 

to include the linear wavelet-based regularization into our generic formulation we 

replace the gradient operator ∇ by the linear transform Ψ : X → C. The space 

C ⊆ C D denotes the real- and complex-valued transform coefficients c ∈ C. The 

dimensionality of C D directly depends on parameters like the image dimensions, 

the number of levels and orientations. The adjoint operator of the transform Ψ, 

required for the signal reconstruction, is denoted as Ψ ∗ and is defined through 

the identity 〈Ψu, c〉 C 

= 〈u, Ψ ∗ c〉 X 

. 

As the L 1 norm in the data term is known to be sensitive to Gaussian noise 

(we expect a small amount), we use the robust Huber norm [19] to estimate 

the error between sought solution and observations instead. The Huber norm 

is quadratic for small values, which is appropriate for handling Gaussian noise, 

and linear for larger errors, which amounts to median like behavior. The Huber 

norm is defined as 

{ t 

2 

|t| ɛ = 2ɛ 

: 0 ≤ t ≤ ɛ 

t − ɛ 2 : ɛ < t . (3) 

Because of the height field driven alignment of the appearance information, 

undefined areas can be simply determined in advance for a geometrically transformed 

image f k . Therefore, we support our formulation with a spatially varying 

term w k i,j ∈ {0, 1}W H , which encodes the inpainting domain. The choice w k i,j = 0 

corresponds to pure inpainting at a pixel location (i, j). 

Considering the wavelet-based regularization, the encoded inpainting domain 

and the Huber norm, our extended energy minimization problem for redundant


observations can now be formulated for the image domain Ω as 

⎧ 

⎫ 

⎨ 

K∑ 

min 

u∈X ⎩ ‖Ψu‖ ∑ 

⎬ 

1 + λ wi,j|u k i,j − fi,j| k ɛ 

⎭ . (4) 

k=1 i,j∈Ω 

In the following we highlight an iterative strategy, based on an optimal first-order 

primal-dual algorithm, to minimize the non-smooth problem defined in (4). 

3.2 Primal-Dual Formulation 

Note that the minimization problem given in (4) poses a large-scale (the dimensionality 

directly depends on the number of image pixels e.g. for a small 

color image tile: 3 × 1600 2 pixels) and non-smooth optimization problem. Following 

recent trends in convex optimization [20, 21], we use an optimal first-order 

primal-dual scheme [22, 23] to minimize the energy. Thus we first need to convert 

the formulation defined in (4) into a classical convex-concave saddle-point 

problem. The general minimization problem is written as 

min max 〈Kx, y〉 + G(x) − F ∗ (y) , (5) 

x∈X y∈Y 

where K is a linear operator, G and F ∗ are convex functions and the term F ∗ 

denotes the convex conjugate of the function F . The finite-dimensional vector 

spaces X and Y provide a scalar product 〈·, ·〉 and a norm ‖ · ‖ = 〈·, ·〉 1 2 

. By applying 

the Legendre-Fenchel transform to (4), we obtain an energy minimization 

problem as follows 

{ 

min max 

u c,q 

〈Ψu, c〉 − δ C (c) + 

K∑ ( } 〈u 

− f k , q k〉 − δ Q k(q k ) − ɛ 2 ‖qk ‖ 2) . (6) 

k=1 

In our case, the convex sets Q and C are defined as follows 

Q k = { q k ∈ R W H : |q k i,j| ≤ λw k i,j, (i, j) ∈ Ω } , k = 1 . . . K, (7) 

C = { c ∈ C D : ‖c‖ ∞ ≤ 1 } , (8) 

where the norm of the coefficient vector space C is defined as 

√ 

‖c‖ ∞ = max |c i,j | , |c i,j | = (c 1 i,j 

i,j 

)2 + (c 2 i,j )2 . (9) 

Considering (6), we can first identify F ∗ = δ C (c) + ∑ K 

( 

k=1 δQ k(q k ) + ɛ 2 ‖qk ‖ 2) . 

The functions δ C and δ Q k are simple indicator functions of the convex sets and 

are given with 

δ C (c) = 

{ 0 if c ∈ C 

+∞ if c /∈ C 

δ Q k(q k ) = 

{ 0 if q k ∈ Q k 

+∞ if q k /∈ Q k . (10)


Since a closed form solution for the sum over multiple L 1 norms cannot be 

implemented efficiently, we additionally introduce a dualization of the data 

term with respect to G, which yields an extended linear term with 〈Kx, y〉 = 

〈Ψu, c〉 + ∑ K 

〈 

k=1 u − f k , q k〉 . According to [22, 23], the primal-dual algorithm 

can be summarized as follows: First, we set the primal and dual time steps with 

τ > 0, σ > 0. Additionally, we construct the required structures with u 0 ∈ R W H , 

ū 0 = u 0 , c 0 ∈ C and q0 k ∈ Q k . Based on the iterations proposed in [22], the iterative 

scheme is then given by 

⎧ 

c n+1 = proj C (c 

( n + σΨū n ) 

⎪⎨ qn+1 

k q 

k 

= proj Q k 

( 

u n+1 = u n − τ 

⎪⎩ 

ū n+1 = 2u n+1 − u n . 

n +σ(ūn−f k ) 

1+σɛ 

) 

, k = 1 . . . K 

) 

Ψ ∗ c n+1 + ∑ K 

k=1 qk n+1 

(11) 

In order to iteratively compute the solution of (6) by using the primal-dual 

scheme, point-wise Euclidean projections of the dual variables q and c onto the 

convex sets C and Q are required. The projection of the wavelet coefficients c is 

defined as 

˜c i,j 

proj C (˜c i,j ) = 

max(1, |˜c i,j |) . (12) 

The corresponding projections for the dual variables q k with k = 1 . . . K are 

given by 

proj Q k(˜q k i,j) = 

˜q k i,j 

min(+λwi,j k , max(−λwk i,j , (13) 

|˜qk i,j |)). 

Note that the iterative minimization scheme mainly consists of simple point-wise 

operations, therefore it can be considerably accelerated by exploiting multi-core 

systems such as graphics processing units. In the next section we use our model 

to perform image fusion of synthetic and real image data. 

4 Experimental Evaluation 

In this section we first demonstrate our convex fusion model on synthetic data, 

then we apply it to real world aerial images. 

4.1 Synthetic Experiments 

To show the performance with respect to recovered fine details, our first experiment 

investigates the fusion and inpainting capability of our proposed model 

using images with synthetically added noise. We therefore take the Barbara 

gray-valued image (512 × 512 pixels), which contains fine structures and highly 

textured areas. In order to imitate the expected noise model, we add a small 

amount of Gaussian noise (µ = 0, σ = 0.01) and replace a specified percentage


35 

Peak Signal−To−Noise Ratio Outliers=10% 

35 

Peak Signal−To−Noise Ratio Outliers=50% 

Peak Signal−To−Noise Ratio [dB] 

30 

25 

20 

15 

Mean 

10 

Median 

TV Regularization 

DTCWT Regularization 

5 

0 2 4 6 8 10 12 14 16 18 20 

Number of Input Observations 

Peak Signal−To−Noise Ratio [dB] 

30 

25 

20 

15 

Mean 

10 

Median 

TV Regularization 

DTCWT Regularization 

5 

0 2 4 6 8 10 12 14 16 18 20 

Number of Input Observations 

Fig. 2. Quantitative results for the Barbara image: PSNR depending on synthetically 

added noise (averaged noise values: 10%: 14.63 dB and 50%: 8.73 dB) and a varying 

number of input observations. Our proposed model using the wavelet-based regularization 

yields the best noise suppression. 

of pixels with undefined areas (we use 10% and 50%), which can be seen as simulation 

of occluded regions caused by perspective views. An evaluation in terms 

of peak signal-to-noise ratios (PSNR) for different amounts of undefined pixels 

and quantities of observations is shown in Figure 2. We compare our model to 

the TV-L 1 formulation, the mean and the median computation. For the TV-L 1 

and our model we present the plots computed for the optimal parameters determined 

by cross validation. One can see that our joint inpainting and denoising 

model, using the parameter setting τ = 0.05, σ = 1/8/τ, ɛ = 0.1, λ = 1.2 and 3 

levels of wavelet decomposition (we use 13,19-tap and Q-shift 14-tap filter kernels, 

respectively), performs best in both noise settings. Moreover, it is obvious 

that an increasing number of input observations improves the result significantly. 

Compared to the TV-L 1 model, the wavelet-based regularization improves the 

PSNR by an averaged value of 2 dB. 

4.2 Fusion of Real Images 

Our second experiment focuses on orthographic color image fusion in aerial imagery. 

The images are taken with the Microsoft UltraCam out of an aircraft in 

overlapping strips, where each image has a resolution of 11500×7500 pixels with 

a ground sampling distance of approximately 10 cm. Depending on the overlap 

in the imagery, the mapped area provides up to ten redundant observations for 

the same scene. To obtain the required range data for each input image we use a 

dense matching algorithm similar to the method proposed in [1]. By taking into 

account the ranges and available camera data, each pixel in the images can be 

transformed to common 3D world coordinates forming a large cloud of points, 

which are then defined by location and color information. Introducing virtual 

orthographic cameras, together with a defined pixel resolution (we use the same 

sampling distance provided by the original images), enables a projection of the 

point cloud of each scence observation to the ground plane (we simply set the


Fig. 3. Some fusion results. The first column shows results obtained with the TV-L 1 

model (three input observations). The second column depicts corresponding images 

computed with our fusion model using a DTCWT regularization, which yields images 

with an improved natural appearance (λ = 1.0). Larger fusion results are given in the 

third column, where we exploit a redundancy of ten input images. The color fusion 

(1600 × 1600 pixels) can be obtained within two minutes. Best viewed in color. 

height coordinate to a fixed value). Computed fusion results for different dimensions 

are shown in Figure 3. The obtained results show an improved natural 

appearance, resulting from the wavelet-based regularization in our fusion model. 

4.3 Removal of Non-Stationary Objects 

Non-stationary objects such as cars disturb in orthographic image generation. 

We therefore use our model to remove cars by simultaneous inpainting. Car 

detection masks can be efficiently obtained by an approach as described in [24]. 

In order to fill the detected car areas, our strategy is inspired by the work of 

Hays and Efros [25]. We perform scene completion with respect to the detection 

mask by using a pool of potential exemplars. To do so, we randomly collect 

image patches (the dimension is adapted for a common car length) and apply 

image transformations like rotation and translation in order to synthetically 

increase the pool. To find the best matching candidate for each detected car 

we compute a sum of weighted color distances between a masked detection and 

each exemplar. The weighting additionally prefers pixel locations near the mask 

boundary and is derived by using a distance transform. The detection mask with 

overlaid exemplars is then used as an additional input observation within the 

fusion model. Obtained removal results are shown in Figure 4.


5 Conclusion 

We have presented a novel variational method to fuse redundant gray and color 

images by using wavelet-based priors for regularization. To compute the solution 

of our large-scale optimization problem we exploit an optimal first-order primaldual 

algorithm, which can be accelerated using parallel computation techniques. 

We have shown that our fusion method is well suited for orthographic image generation 

in high-resolution aerial imagery, but also for an integrated exemplarbased 

fill-in to remove e.g. non-stationary objects like cars. Future work will 

concentrate on synthetic view generation in ground-level imagery, similar to the 

idea of [3], and on computing super-resolution from many redundant observations. 

References 

1. Hirschmüller, H.: Stereo vision in structured environments by consistent semiglobal 

matching. In: Proc. Conf. on Comp. Vision and Pattern Recognition. (2006) 

2. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-based rendering using imagebased 

priors. In: Proc. Int. Conf. on Comp. Vision. (2003) 

3. Agarwala, A., Agrawala, M., Cohen, M., Salesin, D., Szeliski, R.: Photographing 

long scenes with multi-viewpoint panoramas. ACM Trans. on Graphics (SIG- 

GRAPH) 25(3) (2006) 

4. Strecha, C., Gool, L.V., Fua, P.: A generative model for true orthorectification. 

Int. Archives of Photogrammetry and Remote Sensing 37 (2008) 303–308 

5. Goldluecke, B., Cremers, D.: A superresolution framework for high-accuracy multiview 

reconstruction. In: Proc. Pattern Recognition DAGM. (2009) 

6. Unger, M., Pock, T., Werlberger, M., Bischof, H.: A convex approach for variational 

super-resolution. In: Proc. Pattern Recognition DAGM. (2010) 

7. Zach, C., Pock, T., Bischof, H.: A globally optimal algorithm for robust TV-L 1 

range image integration. In: Proc. Int. Conf. on Comp. Vision. (2007) 

8. Woodford, O.J., Reid, I.D., Torr, P.H.S., Fitzgibbon, A.W.: On new view synthesis 

using multiview stereo. In: Proc. British Machine Vision Conf. (2007) 

9. Selesnick, I.W., Baraniuk, R.G., Kingsbury, N.G.: The dual-tree complex wavelet 

transform. Signal Processing Magazine 22(6) (2005) 123–151 

10. Pajares, G., de la Cruz, J.M.: A wavelet-based image fusion tutorial. Pattern 

Recognition 37(9) (2004) 1855 – 1872 

11. Tikhonov, A.N.: On the stability of inverse problems. Doklady Akademii Nauk 

SSSR 39 5 (1943) 195–198 

12. Rudin, L., Osher, S.J., Fatemi, E.: Nonlinear total variation based noise removal 

algorithms. Physica D. 60 (1992) 259–268 

13. Nikolova, M.: A variational approach to remove outliers and impulse noise. Journal 

of Mathematical Imaging and Vision 20(1-2) (2004) 99–120 

14. Carlavan, M., Weiss, P., Blanc-Fraud, L., Zerubia, J.: Complex wavelet regularization 

for solving inverse problems in remote sensing. In: Proc. Geoscience and 

Remote Sensing Society. (2009) 

15. Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics 

of complex wavelet coefficients. Int. Journal of Comp. Vision 40 (2000) 49–71 

16. Fadili, M., Starck, J.L., Murtagh, F.: Inpainting and zooming using sparse representations. 

Computer Journal 52(1) (2009)


Fig. 4. Inpainting results using a car detection mask. From left to right: The car 

detection mask, the fusion result computed without using a car detection mask, the 

result obtained by pure inpainting and the inpainting with supporting exemplars. The 

car areas are successfully removed in both cases, however the exemplar-based fill-in 

appears more naturally. Best viewed in color. 

17. E.Candés, Laurent, D., Donoho, D., Ying, L.: Fast discrete curvelet transforms. 

Multiscale Modeling and Simulation 5(3) (2006) 861–899 

18. l. Starck, J., Elad, M., Donoho, D.: Image decomposition via the combination of 

sparse representations and a variational approach. Trans. on Image Processing 14 

(2004) 1570–1582 

19. Huber, P.: Robust Statistics. Wiley, New York (1981) 

20. Nesterov, Y.: Smooth minimization of nonsmooth functions. Mathematical programming 

Series A 103 (2005) 127–152 

21. Nemirovski, A.: Prox-method with rate of convergence O(1/t) for variational 

inequalities with Lipschitz continuous monotone operators and smooth convexconcave 

saddle point problems. Journal on Optimization 15(1) (2004) 229–251 

22. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems 

with applications to imaging. Technical report, TU Graz (2010) 

23. Esser, E., Zhang, X., Chan, T.: A general framework for a class of first order 

primal-dual algorithms for tv minimization. Technical Report 67, UCLA (2009) 

24. Grabner, H., Nguyen, T.T., Grabner, B., Bischof, H.: On-line boosting-based car 

detection from aerial images. J. of Photogr. and R. Sensing 63(3) (2008) 382–396 

25. Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Trans. 

on Graphics (SIGGRAPH) 26(3) (2007)

Exploiting Redundancy for Aerial Image Fusion using Convex ...

Create successful ePaper yourself

Delete template?

Save as template?