An Integrated and Scalable Approach to Video Enhancement in ...

1 

An Integrated and Scalable Approach to Video 

Enhancement in Challenging Lighting Conditions 

Xuan Dong, Jiangtao (Gene) Wen, Weixin Li, Yi (Amy) Pang, Guan Wang, Yao Lu, Wei Meng 

Abstract—We describe a novel, integrated and scalable approach 

to video enhancement for applications to video acquired under 

a broad range of challenging lighting conditions. We show that 

by using the same core enhancement algorithm and the proper 

pre-processing module, video clips captured in low lighting, bad 

weather (e.g. hazy, rainy and snowy weathers), and high dynamic 

range situations can all benefit from the proposed system. We also 

propose to utilize temporal and spatial redundancies inherent in 

video signals to not only facilitate real-time processing but also 

improve the temporal and spatial consistencies of the output and 

overall visual quality. Various techniques to further improve the 

visual quality of the output are described, making the proposed 

approach a scalable system that can be deployed as an integrated 

module in either a video encoder or a video decoder, or as a 

combination of a codec module and a post-processing system for 

better visual quality. 

Index Terms—Video Enhancement, Computational Photography. 

I. INTRODUCTION 

Mobile cameras such as those embedded in smart phones are 

increasingly widely deployed, and are expected to acquire, 

record and sometimes compress and transmit video in all 

lighting and weather conditions. On the iPhone, softwares 

such as Skype and FaceTime support real time two-way 

video conferencing over 3G or WiFi networks using the video 

camera on the phone. The popular Flip camera can not only 

record video in HD resolution, but also upload the clips to 

video sharing or social network sites such as Youtube or 

Facebook. On Youtube, among the over 13 million hours of 

video that users uploaded in 2010, at least 3 of the the top 

10 most watched clips (No. 9 “Jimmy Surprises Bieber Fan”, 

No. 6, “Yosemite Bear Mountain Giant Double Rainbow”, and 

No. 3, “Greyson Chance singing Paparazzi”) were apparently 

shot with non-professional equipments [1]. 

The majority of portable cameras are not specifically designed 

to be all-purpose and weather-proof, rendering the video 

footage unusable under many circumstances. 

Image and video processing and enhancement including 

gamma correction, de-hazing, de-blurring are well-studied 

areas. Although many algorithms perform well for different 

specific lighting impairments, they often require tedious and 

Xuan Dong, Jiangtao (Gene) Wen, Yi (Amy) Pang and Wei Meng are 

with the Computer Science and Technology Department, Tsinghua University, 

Beijing, China, 100084. Weixin Li and Guan Wang are with the Computer 

Science and Technology Department, Beihang University, Beijing, China, 

100191. Yao Lu is with the Electronic Engineering Department, Tsinghua 

University, China, 100084. 

E-mail: jtwen@tsinghua.edu.cn 

sometimes manual input-dependent fine-tuning of algorithm 

parameters. In addition, different specific types of impairments 

often require different specific algorithms. Take low lighting 

video enhancement as an example. Although far and near 

infrared based systems ([2], [3], [4], [5]) are widely used, 

especially for “professional” video surveillance applications, 

they are usually more expensive, harder to maintain, and 

have a relatively shorter life-span than conventional systems. 

They also introduce extra, and often times considerable power 

consumption. In many consumer applications such as video 

capture and communications on smart phones, it is usually 

not feasible to deploy infrared systems due to such cost and 

power consumption issues. On the other hand, image and 

video processing based low lighting enhancement algorithms 

combining of noise reduction, contrast enhancement, tonemapping, 

histogram stretching, equalization, and gamma correction 

techniques have made tremendous progress over the 

years, algorithms such as [6] and [7] produced very good 

enhancement results. 

The algorithm in [6] utilized the temporal correlations of the 

color and lighting information of pixels, and used spatialtemporal 

smoothing to reduce the noise level of each frame followed 

by further improvement through tone mapping. As the 

approach was pixel-based, and made no distinction between 

foreground objects and background, the algorithm sometimes 

resulted in spatial inconsistencies, and/or under-enhancement 

in the foreground or over-enhancement of the background. 

The complexity of the overall algorithm was also fairly high, 

the processing speed reported was only 6 fps even with GPU 

acceleration. [7] took moving objects into consideration and 

used bilateral filtering to improve visual quality. However, the 

computational complexity was also very high, and enhancing 

each frame always needed more than ten seconds. 

Recently, we proposed a novel low complexity video enhancement 

algorithm [8]. The algorithm was based on the 

observation that “after inverting the input, pixels in the background 

regions of the inverted low-lighting video usually have 

high intensities in all color (RGB) channels while those of 

foreground regions usually have at least one color channel 

whose density is low” and “This is very similar to video 

captured in hazy weather conditions”. As a result, in [8], we 

proposed to apply image de-hazing algorithms to inverted lowlighting 

video for enhancement. 

On the other hand, even though many video editing softwares 

of different levels of sophistication are available, due to the 

time and expertise required, more and more users increasingly 

reply on web-based (“cloud-based”) video editing software to

2 

automate the process of editing and uploading video as much 

as possible. For video clips to be processed by popular webbased 

systems such as JayCut, they must first be compressed 

before they can be uploaded over the Internet, and often 

times, the compression is done with sub-optimal settings for 

the video encoder. Even for local editing with professional 

software by experts where compression and then uploading 

is not necessary, because video clips captured by portable 

cameras, such as mobile phones or the Flip camera are already 

compressed, usually by a low power video encoder introducing 

significant quality loss, it is required that video processing 

algorithms be able to handle video containing artifacts created 

by both capturing (e.g. low lighting, high dynamic range, and 

etc.) as well as compression. 

In this paper, we describe a novel integrated and scalable 

video enhancement approach applicable to a wide range of 

input impairments commonly encountered in mobile video 

applications. The core enhancement algorithm has much lower 

computational and memory complexities than other existing 

solutions of similar enhancement performance. In our system, 

a low complexity automatic module first determines the predominate 

source of impairment in the input video. The input 

is then pre-processed based on the particular source of impairment, 

followed by processing by the core enhancement 

module. Finally, post-processing is applied to produce the 

enhanced output. In addition, spatial and temporal correlations 

are utilized to improve the speed of the algorithm and the 

visual quality, enabling it to be embedded into video encoders 

or decoders to share temporal and spatial prediction modules 

in the video codec to further lower complexity. 

Although the system in the paper is described in detail in 

the context of using de-hazing as the core enhancement 

algorithm, it should be noted that the main contribution of 

the work is to establish the connections between the issues of 

enhancement of video captured with a wide range of lighting 

impairments. We show that it is possible to achieve reasonably 

good enhancement results in real time or close to real time, 

even with the limited resources available on a netbook or 

even a mobile phone. We also show that by introducing more 

optimization techniques, the processing quality can be further 

improved, thereby achieving a scalable architecture. Using the 

approach in this paper, one could focus on designing a suitable 

core enhancement algorithm (based on either de-hazing 

or low-lighting enhancement techniques), post-optimization 

algorithms, and/or temporal/spatial acceleration techniques. 

By integrating specific algorithms and techniques tailored for 

individual applications into the scalable system, optimized 

results could be achieved for generic applications as well as 

meeting highly specific requirements. 

A major advantage of the approach described in the paper is 

its flexibility. The flexibility is reflected in several key aspects. 

First of all, the system is of a low complexity that is possible 

to be embedded into a portable camera systems. It can also be 

incorporated into a post-processing software with various degrees 

of complexity for different quality-complexity tradeoffs 

targeting different applications; secondly, it can be adopted 

as a standalone module, or as an integrated part in a video 

encoder or decoder. By integrating the system into an encoder 

or a decoder, one can not only share the information between 

the codec and the enhancement systems, thereby lowering the 

combined complexity, but can also usually improve the quality 

after processing; finally, the multiple steps of the algorithm can 

be implemented as a complete system, or, the baseline features 

of the system can be implemented on a portable device for 

basic enhancement in real time applications, while the more 

sophisticated steps (e.g. further noise reduction, better tone 

mapping and etc. of the preliminary enhanced video) can be 

done on a cloud server. It is also conceivable that with the 

advance of computational photography and the development 

of good high dynamic range or low lighting capability image 

sensors, the core enhancement module could become the 

sensor itself, with only the pre-processing and post-processing 

modules required to handle the challenge in a large variety of 

applications. 

The paper is organized as the following. In Section II, we 

present the evidences and establish the connections between 

video de-hazing, low-lighting video and high dynamic range 

video enhancement. We show that for a wide range of applications, 

especially applications targeting low complexity and 

mobile platforms, satisfactory results can be achieved with a 

single core algorithm for a wide range of video enhancement 

problems. Then, using a low-complexity de-hazing algorithm 

explained in Section III as an example of a possible choice for 

the core algorithm, we explain various techniques for reducing 

the computational and memory complexities of the algorithm, 

and various techniques that can be used in conjunction of 

the core algorithm to further improve the visual quality of 

the core algorithm in Section IV. Given that in real-world 

applications, the video enhancement module could be deployed 

in multiple stages of the end to end system, e.g. before 

compression and transmission/storage, or after compression 

and transmission/storage but before decompression, or after 

decompression and before the video content displayed on the 

monitor, we examine the complexity and rate-distortion (RD) 

tradeoffs associated with applying the proposed algorithm in 

these different steps with experimental results in Sections V. 

Finally we conclude the paper with Section VI. 

II. AN INTEGRATED APPROACH TO VIDEO ENHANCEMENT 

The motivation for our algorithm is the observation made in 

[8] that if one performs a pixel-wise inversion of low lighting 

video, the results look quite similar to hazy video. Through 

experiments, we found that the same also hold true for a 

significant percentage of high dynamic range video. Here, the 

“inversion” operation is simply 

R c (x) = 255 − I c (x), (1) 

where R c (x) and I c (x) are intensities for the corresponding 

color (RGB) channel c for pixel x in the input and inverted 

frame respectively. 

To verify the claim, we randomly selected (by Google) and 

captured a total of 100 images and video clips each in

3 

be 14.07. The histogram of the minimum intensities of all color 

channels of all pixels for hazy videos, inverted low lighting 

and inverted high dynamic range videos were used in the tests, 

some examples are shown in Fig. 2. The results of the chisquare 

tests are given in Table I. As can be seen from the table, 

the chi-square values are far smaller than 14.07, demonstrating 

that our hypothesis of the similarities between haze videos and 

inverted low lighting videos, and between haze videos and high 

dynamic range videos is reasonable. 

Fig. 1: Examples of original (Top), inverted low lighting 

videos/images (Middle) and haze videos/images (Bottom). 

hazy, low lighting and high dynamic range conditions. Some 

examples are shown in Fig. 1. As can be clearly seen from Fig. 

1, visually, the videos in hazy weather are indeed similar to 

videos captured in low lighting and high dynamic range conditions 

after inversion. This can be understood using the widely 

used pixel degradation model for hazy images introduced by 

Koschmieder in 1924 [9], 

R(x) = J(x)t(x) + A(1 − t(x)), (2) 

where A is the global “airlight” (ambient light reflected into 

the line of sight by atmospheric particles), R(x) is the intensity 

of pixel x that the camera captures, J(x) is the original 

intensity of the pixel, and t(x) is the medium transmission 

function describing the percentage of the light emitted from the 

objects that reaches the camera. In this model, each degraded 

pixel is a mixture of the airlight and an unknown surface 

radiance, the intensities of both are influenced by the medium 

transmission, determined by the scene depth and the scattering 

coefficient of the atmosphere. 

For hazy, low lighting and high dynamic range videos, light 

captured by the camera is blended with the airlight . The main 

difference is the actual brightness of the airlight, brighter in 

the case of haze videos, darker in high dynamic range videos 

and black in the case of low lighting. 

We also performed the chi-square test to examine the statistical 

similarities between hazy videos and inverted low lighting and 

high dynamic range videos. The chi-square test is a standard 

statistical tool widely used to determine if the observed data 

are consistent with a hypothesis. As explained in [10], in chisquare 

tests, a p value is calculated, and usually, if p > 0.05, 

it is reasonable to assume that the deviation of the observed 

data from the expectation is due to chance alone. In our 

experiments, the expected distribution was calculated from 

hazy videos and the observed statistics from inverted low 

lighting and high dynamic range videos were tested. In the 

experiments, we divided the range [0, 255] of color channel 

intensities into eight equal intervals, corresponding to a degree 

of freedom of 7. According to the chi-square distribution 

table, if we adopt the common standard of p > 0.05, the 

corresponding upper threshold for the chi-square value should 

Finally, the observation was confirmed by various haze detection 

algorithms: we implemented haze detection using the 

HVS threshold range based method [11], the Dark Object 

Subtraction (DOS) approach [12], and the spatial frequency 

based technique [13], and found that hazy, inverted low 

lighting videos and inverted high dynamic range videos were 

all classified as hazy video clips, as whereas “normal” clips 

were not. 

In our experiments, we also tested with image and video clips 

captured in bad weather conditions such as rainy and snowy 

weathers. Some of the examples are given in later sections of 

the paper. 

Based on the visual observations and statistical tests, we 

believe that for the purpose of video enhancement, especially 

for applications on mobile devices, it is reasonable to categorize 

lighting impairments into two large classes, namely, low 

lighting video and hazy video. As the experiments and analysis 

also show similarities between inverted low lighting video and 

hazy video, it is conceivable that for applications such as in 

mobile systems, a system for video enhancement could employ 

the same core algorithm, integrated with an automatic classifier 

classifying if the input is low lighting or hazy video, followed 

by the inversion operation is necessary, then processing by 

the core algorithm. Although the rest of the paper uses a dehazing 

algorithm as the core enhancement algorithm, it is also 

possible for some systems to use a low-lighting enhancement 

module as the core processing module while performing the 

inversion operation on hazy, as opposed to low lighting inputs. 

III. BASELINE INTEGRATED VIDEO ENHANCEMENT 

SYSTEM FOR CHALLENGING LIGHTING CONDITIONS - AN 

EXAMPLE 

Given the connections between the enhancement problems for 

video captured in different challenging lighting conditions, 

our baseline experiment system consists of an automatic 

impairment detection module and a video-dehazing based core 

enhancement module. As already pointed out in the introduction, 

the particular implementation is simply one among many 

possible designs of the concept. It is intended to show that 

even a relatively simple design with off-the-shelf techniques 

can achieve reasonably good results for many applications and 

on many platforms.

4 

Fig. 2: The histogram of the minimum intensity of each pixel’s three color channels of haze videos (Left), low lighting videos 

(Middle) and high dynamic range videos (Right). 

TABLE I: Results of chi square tests 

Data of chi square test Degrees of Freedom Chi square values 

Haze videos and inverted low lighting videos 7 13.21 

Haze videos and inverted high dynamic range videos 7 11.53 

TABLE II: Parameter settings for the haze detection algorithm. 

Color attribute Threshold range 

S 0 ∼ 255 0 ∼ 130 

V 0 ∼ 255 90 ∼ 240 

A. Automatic Impairment Source Detection 

The function of the automatic impairment source detection 

module is to classify input video into normal video that does 

not need to be processed, low lighting video (also include 

high-dynamic range video) for which pixel-wise inversion is 

performed first, or hazy video (also include video captured 

in rainy and snowy weathers) which is processed by the core 

enhancement module directly. 

A flow diagram for this automatic detection system is shown 

in Fig. 3. Our detection algorithm is based on the technique 

introduced by R. Lim et al. [11]. To reduce complexity, we 

only perform the detection for the first frame in a Group of 

Pictures (GOP), coupled with scene changing detection. The 

corresponding algorithm parameters are given in Table II. The 

same test was conducted for each pixel in the frame. If the 

percentage of hazy pixels in a picture is higher than 60%, we 

designate the picture as a hazy picture. Similarly, if an image 

is determined to be a hazy picture after inversion, it is a low 

lighting image. 

B. Video De-Hazing Based Core Enhancement 

Similar to [8], in our experiments, we used a system in which 

the core enhancement algorithm is an improved video dehazing 

algorithm based on the image de-hazing algorithm of 

[14]. 

As is the case in many other advanced haze-removal algorithms 

such as [14], [15], [16], and [17], was also based on 

aforementioned Koschmieder model in (2). The critical part 

of all image de-hazing algorithms based on the Koschmieder 

Fig. 3: Flow diagram of the impairment source detection 

module. 

model is to estimate A and t(x) from the recorded image 

intensity I(x) so as to recover the J(x) from I(x). 

Following [14], we estimate the medium transmission and 

airlight using the Dark Channel method: 

{ 

R c } 

(y) 

t(x) = 1 − ω min min 

c∈{r,g,b} y∈Ω(x) A c , (3) 

where ω = 0.8 and Ω(x) is a local 3 × 3 block centered at x. 

In our experiments, the cpu-and-memory-costly soft matting 

method proposed in [14] was not implemented in the baseline 

system, but could be used as a post-processing step, e.g. if the 

output from the baseline system is subsequently uploaded to 

a high power server in the cloud. 

To estimate airlight, we first note that the schemes in existing 

image haze removal algorithms are usually not very robust. 

Even very small changes to the airlight value might lead 

to very large changes in the recovered images or video 

frames. As a result, calculating the airlight frame-by-frame

5 

Fig. 5: The comparison of original, haze removal, and optimized haze removal video clips. Top: input video sequences; Middle: 

outputs of image haze removal algorithm of ; Bottom: outputs of haze removal using our optimized algorithm in calculating 

airlight. 

value A, we refresh the value of airlight by 

A = A ∗ 0.4 + A t ∗ 0.6, (4) 

where A t is the airlight value calculated in GOP t, and A is 

the global airlight value. Examples of the recovered results 

are shown in Fig. 5. The frames in the bottom row change 

gradually using our algorithm, as opposed to the results for 

the same frames in the frame by frame approach in the middle 

row. 

Fig. 4: Examples of processing steps of low lighting enhancement 

algorithm (Left to Right, Up to Down): input image I, 

inverted input image R, haze removal result J of the image 

R, and output image. 

not only increases the overall complexity of the system, but 

also introduces visual inconsistency between frames, thereby 

creating annoying visual artifacts. Fig. 5 shows an example 

using the results of the algorithm in [14]. Notice the difference 

between the first and second frame in the middle row. 

Based on this observation, we would calculate the airlight 

value only for the first frame in a GOP. The same value is 

then used for all subsequent frames in the same GOP. In the 

implementation, we also incorporated a scene change detection 

module so as to detect sudden changes in airlight that are not 

aligned with GOP boundaries but merit recalculation. Among 

successive GOPs, to avoid severe changes of the global airlight 

Once A is found, from (2), 

J(x) = R(x) − A 

t(x) 

+ A. (5) 

Although (5) works reasonably well for haze removal, for lowlighting 

enhancements, we found that (5) might lead to underenhancement 

for low luminance areas and over-enhancement 

for high luminance areas. To solve this problem, we modified 

(5) to 

J(x) = R(x) − A + A, (6) 

P (x)t(x) 

where 

{ 

P (x) = 

Kt (x) 0 < t (x) ≤ 0.5, 

−Kt 2 (x) + M 0.5 < t (x) ≤ 1, 

In (7), K = 0.6 and M = 0.5, determined through experiments. 

The idea behind (6) is as the following. When t(x) is smaller 

than 0.5, which means that the corresponding pixel needs 

boosting, we assign P (x) a small value to make P (x)t(x) 

even smaller to increase the corresponding J(x), so as to 

increase the RGB intensities of this pixel. On the other hand, 

when t(x) is greater than 0.5, we refrain from overly boosting 

the corresponding pixel intensity. When t(x) is close to 1, 

(7)

6 

10 x 107 Difference of t(x) for constant pixels 

9 

8 

Fig. 6: Examples of optimizing low lighting and high dynamic 

range enhancement algorithm by introducing P (x): 

Input (Left), output of the enhancement algorithm without 

introducing P (x) (Middle), and output of the enhancement 

algorithm by introducing P (x) (Right). 

quantity 

7 

6 

5 

4 

3 

2 

P (x)t(x) leads to slight “dulling” of the pixel. This makes the 

overall visual quality more balanced and visually pleasant. 

For low lighting and high dynamic range videos, once J(x) 

is recovered, the inversion operation (1) is performed again to 

produce the enhanced videos of the original input. This process 

is illustrated in Fig. 4. The improvement after introducing 

P (x) can be seen in Fig. 6. 

1 

0 

0 0.2 0.4 0.6 0.8 1 

relative difference of t(x) 

Fig. 7: Differences of t(x) values between the predicted block’s 

pixels and its reference block’s pixels. 

IV. OPTIMIZATIONS OF THE BASELINE SYSTEM 

A. Algorithmic Optimizations 

1) Motion Estimation Based Acceleration and Quality Improvement: 

The algorithm described in Section III is a frame 

based approach, and the calculation of t(x) consumes about 

60% of the total computation time. For real-time and low 

complexity processing of video inputs, calculating t(x) frame 

by frame not only has high computational complexity, but also 

makes the output results much more sensitive to temporal and 

spatial noise, and destroys the temporal and spatial consistency 

of the processed outputs. 

To remedy these problems, we notice that the t(x) and other 

model parameters are correlated temporally and spatially. 

As a result, its calculation can be expedited using motion 

estimation/compensation (ME/MC) techniques. 

ME/MC is a key procedure in all state-of-the-art video 

compression algorithms. By matching blocks in subsequently 

encoded frames to find the “best” match of a current block and 

a block of the same size that has already been encoded and 

then decoded (the “reference”), video compression algorithms 

use the reference as a prediction of the current block and 

encodes only the difference (termed the “residual”) between 

the reference and the current block, thereby improving coding 

efficiency. The process of finding the best match between 

a current block and a block in a reference frame is called 

“motion estimation”, and the “best” match is usually determined 

by jointly considering the rate and distortion costs of 

the match. If a “best” match block is found, the current block 

will be encoded in the inter mode and only the residual will be 

encoded. Otherwise, the current block will be encoded in the 

intra mode. The most commonly used metric for distortion in 

motion estimation is the Sum of Absolute Differences (SAD). 

To verify the feasibility of using temporal block matching and 

ME to expedite t(x) calculation, we calculated the differences 

Fig. 8: Subsampling pattern of proposed fast SAD algorithm. 

of t(x) values for pixels in the predicted and reference blocks. 

The statistics in Fig. 7 shows that the differences are less 

than 10% in almost all cases and as a result, we could utilize 

ME/MC to by-pass the calculation of t(x) for the majority 

of the pixels/frames, and only calculate the t(x) for a small 

number of selective frames. For the the remainder of the 

frames, we used the corresponding t(x) values of the reference 

pixels. For motion estimation, we used mature fast motion 

estimation algorithms e.g. Enhanced Prediction Zonal Search 

(EPZS) [18]. When calculating the SAD, similar to [19] and 

[20], we only utilized a subset of the pixels in the current 

and reference blocks using the pattern shown in Fig. 8. With 

this pattern, our calculation “touched” a total of 60 pixels in 

a 16 × 16 block, or roughly 25%. These pixels were located 

on either the diagonals or the edges, resulting in about 75% 

reduction in SAD calculation when implemented in software 

on a general purpose processor. 

Specifically, when the proposed algorithm is deployed prior to 

video compression or after video decompression, we can first 

divide the input frames into GOPs. The GOPs could either 

contain a fix number of frames, or decided based on a max 

GOP size (in frames) and scene changing. Each GOP starts 

with an Intra coded frame (I frame), for which all t(x) values 

are calculated. ME is performed for the remaining frames (P

7 

are shown in Fig. 13 and Fig. 14. Some of the comparisons 

can be found in Section V. 

Fig. 9: Flow diagram of the core enhancement algorithm with 

ME acceleration. 

frames) of the GOP, similar to conventional video encoding. To 

this end, each P frame is divided into non-overlapping 16×16 

blocks, for which a motion search using the SAD is conducted. 

A threshold T is defined for the SAD of blocks: if the SAD 

is below the threshold which means a “best” match block is 

found, the calculation of t(x) for the entire MB is skipped. 

Otherwise, t(x) still needs to be calculated. In both cases, the 

values for the current frame are stored for reference by the 

next frame. The flow diagram is shown in Fig. 9. 

In addition to operating as a stand-along module with uncompressed 

pixel information as both the input and output, 

the ME accelerated enhancement algorithm could also be 

integrated into a video encoder or a video decoder. When the 

algorithm is integrated with a video encoder, the encoder and 

the enhancement can share the ME module. When integrated 

with the decoder, the system has the potential of using the 

motion information contained in the input video bitstream 

directly, and thereby by-passing the entire ME process. Such 

integration will usually lead to a Rate-Distortion (RD) loss. 

The reason for this loss is first and foremost that the ME 

module in the encoder with which the enhancement module 

is integrated or the encoder with which the bitstreams that a 

decoder with enhancement decodes may not be optimized for 

finding the best matches in t(x) values. For example, when the 

enhancement module is integrated with a decoder, it may have 

to decode an input bitstream encoded by a low complexity 

encoder using a really small ME range. The traditional SAD 

or SAD-plus-rate metrics for ME are also not optimal for t(x) 

match search. However, through extensive experiments with 

widely used encoders and decoders, we found that such quality 

loss were usually small, and well-justified by the savings in 

computational cost. The flow diagrams of integrating the ME 

acceleration enhancement algorithm into encoder and decoder 

2) Visual Quality Improvement with Motion Detection: As 

mentioned in previous sections, depending on the target application 

and the camera and processing platforms used, different 

systems could introduce different add-on modules on top of 

the base line system for further improvements in visual quality. 

In this section, we describe one, among many, such possible 

modules. The idea here is to focus the processing on the 

moving objects that are more likely to be in the Regions 

of Interests (ROIs), and/or more visible to the human visual 

system. In our experiments, we implemented algorithm in [21] 

for segmentation of moving objects and static background. 

Then depending on whether a pixel belongs to the background 

or a moving object, we modify the parameters K and M 

in the calculation of P (x) in (7) to K moving and M moving 

for moving objects, or K background and M background for the 

background respectively. In addition, to avoid abrupt changes 

of luminance around the edges of moving objects, we define a 

band of W trans -pixels wide around the moving objects as the 

transition areas. For the transition areas, P (x) is calculated 

using 

K trans = 

and 

M trans = 

d 

K moving + W trans − d 

K background , (8) 

W trans W trans 

d 

M moving + W trans − d 

M background , (9) 

W trans W trans 

where d is the distance between the pixel x and the edge of 

the moving object with which the transition area borders. In 

our experiments, the K foreground is set as 0.6, M foreground 

is set as 0.5, K background is set as 0.8, M background is set as 

1.2. 

B. Implementation Optimizations 

In addition to the algorithmic optimizations in the previous 

sections, the implementation of the core algorithm can also 

be further optimized by taking advantage of the redundancies 

inherent to the pixel wise calculations of t(x) and I c (x). 

First of all, we integrate the calculation of t(x) in (3) into the 

calculation of J(x) in (6), so that 

J(x) = 

I(x) − ωA min( min ( Ic (y) 

c y∈Ω(x) 

A 

)) c 

1 − ω min( min ( Ic (x) 

c y∈Ω(x) 

A 

)) c 

. (10) 

This allows for the enhancement of the input I(x) directly 

without calculating t(x). It should be noted that the aforementioned 

ME-based acceleration is still applicable to (10) 

after replacing caching t(x) values with caching Ic (y) 

A c . 

Although the algorithm in this paper, as were the de-haze algorithms 

in many of the papers in the reference, was described 

in the RGB space, our algorithm can be easily adapted to work

8 

… 

1 2 3 4 5 6 7 8 9 ... W-1 W 

Fig. 10: Fast calculation of 1-D local minimum value. 

in the YUV space to match the input format of most practical 

video applications: 

Y out (x) = 

U out (x) = 

V out (x) = 

Y in (x) − ωA min( min ( I(y) 

c y∈Ω(x) 

A )) 

1 − ω min( min ( I(y) 

c y∈Ω(x) 

A )) , (11) 

U in (x) − 128 

1 − ω min( min 

c 

( I(y) 

y∈Ω(x) 

A 

V in (x) − 128 

1 − ω min( min 

c 

( I(y) 

y∈Ω(x) 

A 

)) 

+ 128, (12) 

)) 

+ 128. (13) 

Finally, to further speed up the implementation, we exploited 

the inherent redundancies in the pixel-wise calculations of the 

minimization in equations (10) - (13), which corresponds to a 

complexity of k 2 × W × H comparisons for an input frame of 

resolution W ×H, and a search window (for the minimization) 

of size k × k pixels. To expedite the process, we first find 

and store the smaller of every two horizontally neighboring 

pixels in the frame using a sliding horizontal window of size 

2, requiring W × H comparisons. Then, by again using a 

horizontal sliding window of 2 over the values stored in the 

previous step, we can find the minimum of every 4 horizontally 

neighboring pixels in the original input frame. This process is 

repeated in both the horizontal and vertical directions, until we 

have found the minimum of all k × k neighborhoods of the 

input. It is easy to find that such a strategy has a complexity 

of roughly 2 log 2 k × W × H comparisons, as opposed to 

k 2 × W × H for the simplistic implementation. This process 

is illustrated for one row of W pixels in Fig. 10, where the 

red and black lines refer to the comparisons made, each with 

a sliding window of 2 values. 

Fig. 11: Example of low lighting video enhancement algorithm: 

Original input (Left), and the enhancement result 

(Right). 

Fig. 12: Example of high dynamic range video enhancement 

algorithm: Original input (Left), and the enhancement result 

(Right). 

Examples of the enhancement outputs for low lighting, high 

dynamic range and hazy videos are shown in Fig. 11, Fig. 12 

and Fig. 15 respectively. As we can see from these figures, 

the improvements in visibility are obvious. In Fig. 11, the 

yellow light from the windows and signs such as “Hobby 

Town” and other Chinese characters were recovered in correct 

color. In Fig. 12, the headlight of the car in the original input 

made letters on the license plate very difficult to read. After 

V. EXPERIMENTAL RESULTS 

To evaluate the proposed approach, a series of experiments 

were conducted with a Windows PC (Intel Core 2 Duo 

processor running at 2.0 GHz with 3G of RAM) and an iPhone 

4. On the iPhone, our software could process the images and 

videos from the camera directly or from the photo album. 

After the processing is complete, the output would be shown 

on the screen and saved in the photo album automatically. The 

resolution of test videos in our experiments was 640 × 480 on 

PC and 192 × 144 on iPhone 4. The enhancement effects and 

processing speed are listed below. Due to time constraints, 

we only implemented the frame-by-frame base line system on 

the iPhone and did not employ many further possible ways 

of optimization on the PC platform (e.g. by using assembly 

coding). 

Fig. 15: Example of haze removal algorithm: Original input 

(Left), and the enhancement result (Right). 

Fig. 16: Example of rainy video enhancement using haze 

removal algorithm: Original input (Left), and the enhancement 

result (Right).

9 

Fig. 13: Flow diagram of the integration of encoder and ME acceleration enhancement algorithm. 

Fig. 14: Flow diagram of the integration of decoder and ME acceleration enhancement algorithm. 

Fig. 17: Examples of snowy video enhancement using haze 

removal algorithm: Original input (Left), and the enhancement 

result (Right). 

Fig. 18: Examples of visual quality improvement with motion 

detection: Original input (Left), the enhancement result (Middle) 

and the improvement result with motion detection (Right). 

enhancement with our algorithm, the license plate became 

much more intelligible. The algorithm also worked well for 

video captured in hazy, rainy and snowy weathers as shown 

in Fig. 15, Fig. 16 and Fig. 17. An example of the visual 

quality improvement using motion detection is shown in Fig. 

18. 

As mentioned above, there are three possible ways of incorporating 

ME into the enhancement algorithm, i.e. through 

a separate ME module in the enhancement system, as well 

as utilizing the ME module and information available in 

a video encoder or decoder. Some example outputs of the 

frame-wise enhancement algorithm and these three ways of 

incorporating ME are shown in Fig. 22, with virtually no 

visual difference. We also calculated the average RD curves 

of ten randomly selected experimental videos using the three 

acceleration methods. The reference was enhancement using 

the proposed frame-wise enhancement algorithm in the YUV 

domain. The RD curves of performing the frame-wise enhancement 

algorithm before encoding or after decoding are 

shown in Fig. 19, while the results for acceleration using a 

separate ME module are given in Fig. 20, and integrating the 

ME acceleration into the codec are shown in Fig. 21. As the 

RD curves in our experiments reflect the aggregated outcome 

of both coding and enhancement, and because enhancement 

was not optimized for PSNR based distortion, the shape of our 

RD curve looks slightly different from RD curves for video 

compression systems. 

From the results, we found that in general, performing enhancement 

before encoding has better overall RD performance. 

Although enhancing after decoding means we can 

transmit un-enhanced video clips, which usually have lower 

contrast, less detail and are easier to compress, the reconstructed 

quality after decoder/enhancement is heavily affected 

by the loss of quality during the encoding, leading to an overall 

RD performance loss of 2 dB for the cases in the experiments. 

In addition, in Fig. 19, the RD loss of frame-wise enhancement 

was due to encoding and decoding. In Fig. 20, the RD loss

10 

resulted from ME acceleration and encoding/decoding. In Fig. 

21, the RD loss resulted from integration of ME acceleration 

algorithm into encoder and decoder. Overall however, the RD 

loss introduced by ME acceleration and integration was small 

in PSNR terms, and not visible subjectively. 

We also measured the computational complexity of frame-wise 

enhancement, acceleration with a separate ME module and 

integration into an encoder or a decoder. The computational 

cost was measured in terms of average time for enhancing each 

frame. For the cases when the enhancement was integrated into 

the codec, we did not count the actual encoding or decoding 

time, so as to measure only the enhancement itself. As shown 

in the Table III, using a separate ME module saved about 28% 

processing time on average compared with the frame-wise 

algorithm. On the other hand, integrating with the decoder 

saved about 40% processing time compared with the frame 

wise algorithm, while integrating with the encoder saved about 

77%. 

VI. CONCLUSIONS 

In this paper, we propose a novel integrated approach to 

enhancement of videos acquired under challenging lighting 

conditions including low lighting, bad weather (hazy, rainy, 

snowy) and high dynamic range conditions. We show that for 

many applications, it is usually acceptable to first classify the 

input video into “normal” video, low lighting video (including 

high dynamic range video) and hazy video (including video 

acquired in other bad weather conditions such as rainy and 

snowy weathers). Then, because visually and statistically, low 

lighting video and hazy video exhibit very similar characteristics 

after either one undergoes the pixel-wise inversion 

operation, a single enhancement module could be used for 

the processing of video under a broad range of bad lighting 

conditions. 

We also present as an example, a very simple video dehazing 

algorithm based baseline video enhancement system 

using “off-the-shelf” technologies. The resulted system can 

run in real time on a PC and achieved good speed and 

enhancement quality even on an iPhone. Results given in 

the paper demonstrate the usefulness and promise of the 

approach. Through experiments, we also examine the the 

tradeoffs associated with integrating the proposed system into 

different “links” of the video acquisition, coding, transmission 

and consumption chain. Potentially, the proposed approach 

could be integrated into the sensors, the codecs and video 

processing softwares and systems, or, its different processing 

steps could be distributed through these links to offer a tiered 

scheme for quality improvement. 

Areas of further improvements include better pre-processing 

filters targeting specific sources of impairments, especially 

high dynamic range inputs, further optimization using denoising, 

tone mapping and other techniques, improved core 

enhancement algorithms, and better acceleration techniques. 

Also of great importance is a system that can process inputs 

PSNR (db) 

39 

38 

37 

36 

35 

34 

33 

32 

31 

30 

frame−wise enhancement before encoding 

frame−wise enhancement after decoding 

29 

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

bitrate(kb/s) 

Fig. 19: RD performance of frame-wise enhancement in 

encoder and decoder. 

PSNR (db) 

38 

36 

34 

32 

30 

28 

stand−alone ME enhancement before encoding 

stand−alone ME enhancement after decoding 

26 

0 2000 4000 6000 8000 10000 

bitrate(kb/s) 

Fig. 20: RD performance of separate ME acceleration enhancement 

in encoder and decoder. 

with compounded impairments (e.g. video of foggy nights, 

with both haze and low lighting). 

REFERENCES 

[1] C. Smith. “The 10 Most Watched YouTube Videos of 2010,” at 

http://www.huffingtonpost.com. 

[2] M. Blanco, H. M. Jonathan, and T. A. Dingus. “Evaluating New Technologies 

to Enhance Night Vision by Looking at Detection and Recognition 

Distances of Non-Motorists and Objects,” in Proc. Human Factors and 

Ergonomics Society., Minneapolis, MN, Jan. 2001, vol. 5, pp. 1612-1616. 

PSNR (db) 

37 

36 

35 

34 

33 

32 

31 

30 

29 

28 

integration of ME enhancement before encoding 

integration of ME enhancement after decoding 

27 

0 2000 4000 6000 8000 10000 

bitrate(kb/s) 

Fig. 21: RD performance of integration of ME acceleration 

enhancement into encoder and decoder.

11 

TABLE III: Processing speeds of proposed algorithms over PC (640 × 480) and iPhone4 (192 × 144) 

PC/ms per frame iPhone4/ms per frame Time saved 

Frame-wise enhancement algorithm 27.1 66.3 N/A 

Separate ME acceleration enhancement algorithm 19.8 N/A 27.5% 

Integration of ME acceleration enhancement algorithm into encoder 6.2 N/A 77.3% 

Integration of ME acceleration enhancement algorithm into decoder 16.7 N/A 40.0% 

[15] R. Fattal. “Single Image Dehazing,” in ACM SIGGRAPH ’08, Los 

Angeles, CA, Aug. 2008, pp. 1-9. 

[16] R. Tan. “Visibility in Bad Weather from A Single Image,” in Proc. IEEE 

Conf. Computer Vision and Pattern Recognition., Anchorage, Alaska, Jun. 

2008, pp. 1-8. 

[17] S. G. Narasimhan, and S. K. Nayar. “Chromatic Framework for Vision 

in Bad Weather,” in Proc. IEEE Conf. Computer Vision and Pattern 

Recognition., Hilton Head, SC, Jun. 2000, vol. 1, pp. 1598-1605. 

[18] A. M. Tourapis. “Enhanced Predictive Zonal Search for Single and 

Multiple Frame Motion Estimation,” in Proc. Visual Communications and 

Image Processing., San Jose, CA, Jan. 2002, pp. 1069-1079. 

Fig. 22: Examples of comparisons among the frame-wise algorithm 

and the three proposed ME acceleration methods (Left to 

Right): Original input, output of frame-wise algorithm, output 

of separate ME acceleration algorithm, output of integration 

of ME acceleration algorithm into encoder and decoder. 

[3] O. Tsimhoni, J. Bärgman, T. Minoda, and M. J. Flannagan. “Pedestrian 

Detection with Near and Far Infrared Night Vision Enhancement,” Tech. 

rep., The University of Michigan, 2004. 

[4] L. Tao, H. Ngo, M. Zhang, A. Livingston, and V. Asari. “A Multi-sensor 

Image Fusion and Enhancement System for Assisting Drivers in Poor 

Lighting Conditions,” in Proc. IEEE Conf. Applied Imagery and Pattern 

Recognition Workshop., Washington, DC, Dec. 2005, pp. 106-113. 

[5] D. Koob, F. Bellotti, C. Bellotti and L. Andreone. “Enhanced Driver’s 

Perception in Poor Visibility.” Available at http://www.edel-eu.org/ 

[6] H. Malm, M. Oskarsson, E. Warrant, P. Clarberg, J. Hasselgren, and 

C. Lejdfors. “Adaptive Enhancement and Noise Reduction in Very Low 

Light-Level Video,” in Proc. IEEE Int. Conf. Computer Vision., Rio de 

Janeiro, Brazil, Oct. 2007, pp. 1-8. 

[7] E. P. Bennett, L. McMillan. “Video Enhancement Using Per-pixel Virtual 

Exposures,” in Proc. SIGGRAPH ’05., Los Angeles, CA, Jul. 2005, pp. 

845-852. 

[8] X. Dong, Y. Pang, and J. Wen. “Fast Efficient Algorithm for Enhancement 

of Low Lighting Video,” in Proc. SIGGRAPH ’10 Poster., Los Angeles, 

CA, Jul. 2010. 

[9] Koschmieder. “Theorie der horizontalen sichtweite,” in Beitr. Phys. Freien 

Atm., vol. 12, pp. 171-181, 1924. 

[10] R. Fisher, and F. Yates. “Statistical Tables for Biological, Agricultural 

and Medical Research,” 6 ed. Oliver and Boyd Ltd., Edinburgh, 1963. 

[11] R. Lim, T. Bretschneider. “Autonomous Monitoring of Fire-related Haze 

from Space,” in Conf. Imaging Science, Systems and Technology., Las 

Vegas, Nevada, Jun. 2004, pp. 101-105. 

[12] C. Song, C. E. Woodcock, K. C. Seto, M. P. Lenney, and S. A. 

Macomber. “Classification and Change Detection Using Landsat TM 

Data: When and How to Correct Atmospheric Effects?” in Int. Symposium 

Remote Sensing of Environment., vol. 75, no. 2, pp. 230-244, Feb. 2001. 

[13] Du Y., Guindong B., and Cihlar J.. “Haze Detection and Removal in 

High Resolution Satellite Image with Wavelet Analysis,” in IEEE Trans. 

Geoscience and Remote Sensing., vol. 40, no. 1, pp. 210-217, Jan. 2002. 

[14] K. He, J. Sun, and X. Tang. “Single Image Haze Removal Using 

Dark Channel Prior,” in Proc. IEEE Conf. Computer Vision and Pattern 

Recognition., Miami, FL, Jun. 2009, pp. 1956-1963. 

[19] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro.“Motion 

Compensated Interframe Coding for Video Conferencing,” in Proc. Nut. 

Telecommun. Conf., New Orleans, LA, Nov. 1981, pp. G5.3.1-G5.3.5. 

[20] B. Girod, and K. W. Stuhlmüller. “A Content-Dependent Fast DCT for 

Low Bit-Rate Video Coding,” in Proc. IEEE Int. Conf. Image Process., 

Chicago, Illinois, Oct. 1998, vol. 3, pp. 80-83. 

[21] P. Kaewtrakulpong, and R. Bowden. “An Improved Adaptive Bachground 

Mixture Model for Realtime Tracking with Shadow Detection,” 

in Proc. European Workshop on Advanced Video BAsed Surveillance 

Systems., London, UK, Sept., 2001, Vol.1, pp.149-158. 

[22] X. Dong, G. Wang, Y. Pang, W. Li. J. Wen, Y. Lu, and W. Meng. “Fast 

Efficient Algorithm for Enhancement of Low Lighting Video,” in IEEE 

Int. Conf. on Multimedia and Exop, Barcelona, Spain, Jul., 2011.

An Integrated and Scalable Approach to Video Enhancement in ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?