FRAGMENT-BASED REAL-TIME OBJECT TRACKING: A SPARSE ...

FRAGMENT-BASED REAL-TIME OBJECT TRACKING: 

A SPARSE REPRESENTATION APPROACH 

Naresh Kumar M. S. Priti Parate R. Venkatesh Babu 

Supercomputer Education and Research Centre 

Indian Institute of Science, Bangalore, India - 560012 

ABSTRACT 

Real-time object tracking is a critical task in many computer vision 

applications. Achieving rapid and robust tracking while handling 

changes in object pose and size, varying illumination and partial 

occlusion, is a challenging task given the limited amount of computational 

resources. In this paper we propose a real-time object 

tracker in l 1 framework addressing these issues. In the proposed approach, 

dictionaries containing templates of overlapping object fragments 

are created. The candidate fragments are sparsely represented 

in the dictionary fragment space by solving the l 1 regularized least 

squares problem. The non zero coefficients indicate the relative motion 

between the target and candidate fragments along with a fidelity 

measure. The final object motion is obtained by fusing the reliable 

motion information. The dictionary is updated based on the object 

likelihood map. The proposed tracking algorithm is tested on various 

challenging videos and found to outperform earlier approach. 

Index Terms— Object tracking, Fragment tracking, Motion estimation, 

l 1 minimization, Sparse representation 

1. INTRODUCTION 

Visual tracking is an important task in computer vision with a variety 

of applications such as surveillance, robotics, human computer 

interactions, medical imaging etc. One of the main challenges that 

limits the performance of the tracker is appearance change caused 

by pose variation, illumination or view point. Significant amount of 

work has been done to address these problems and develop a robust 

tracker. However robust object tracking still remains a big challenge 

in computer vision research. 

There have been many proposals towards building a robust 

tracker, a thorough survey can be found in [1]. In early works, 

minimizing SSD (sum of squared differences) between regions was 

a popular choice for the tracking problem [2] and a gradient descent 

algorithm was most commonly used to find the minimum SSD. Often 

in such methods, only a local minimum could be reached. Mean 

shift tracker [3] uses mean-shift iterations and a similarity measure 

based on Bhattacharyya coefficient between the target model 

and candidate regions to track the object. Incremental tracker [4] 

and Covariance tracker [5] are other examples of tracking methods 

which use appearance model to represent the target observations. 

One of the recently developed and popular trackers is the l 1 

tracker [6]. In this work, the authors have utilized the particle filter to 

select the candidate particles and then represent them sparsely in the 

space spanned by the object templates using l 1 minimization. This 

requires a large number of particles for reliable tracking. This results 

in a high computational cost and thus brings down the speed of the 

tracker. An attempt to speed up the tracking by reducing the number 

of particles only deteriorates the accuracy of the tracker. There 

have been attempts to improve the performance of [6]. In [7] the 

authors try to reduce the computation time by decomposing a single 

object template into the particle space. In [8] hash kernels are used 

to reduce the dimensionality of observation. 

In this paper, we propose a computationally efficient l 1 minimization 

based real-time and robust tracker. The tracker uses fragments 

of the object and the candidate to estimate the motion of the 

object. The number of candidate fragments required to track the object 

in this method is small, thus reducing the computational burden 

of the l 1 tracker. Further, the fragment based approach combined 

with the trivial templates make the tracker robust against partial occlusion. 

The results show that the proposed tracker gives more accurate 

tracking at much higher execution speeds in comparison to the 

earlier approach. 

The rest of the paper is organized as follows. Section 2 provides 

the overview of the proposed tracker. Section 3 describes the proposed 

approach in detail. Section 4 discusses the results and Section 

5 concludes the paper. 

2. OVERVIEW 

The proposed tracking algorithm is essentially a template tracker in 

l 1 framework. The object is partitioned into overlapping fragments 

that form the atoms of the dictionary. The candidate fragments are 

sparsely represented in the space spanned by the dictionary fragments 

by solving the l 1 minimization problem. The resulting sparse 

representation indicates the flow of fragments between consecutive 

frames. This flow information or the motion vectors are utilized 

for estimating the object motion between consecutive frames. The 

proposed algorithm uses only grey scale information for tracking. 

Similar to the mean-shift tracker [3], the proposed algorithm also assumes 

sufficient overlap between object and candidate regions such 

that there is at-least one fragment in the candidate area that corresponds 

to an object fragment. In this approach two dictionaries are 

used. One is kept static while the other is updated based on the 

tracking result and a confidence measure computed using histogram 

models. The dictionaries are initialized with the object selected in 

the first frame. The proposed algorithm is able to track objects with 

rapid changes in appearance, illumination and occlusions at realtime. 

Changes in size are also tracked up-to some extent. 

3. PROPOSED APPROACH 

3.1. Sparse representation and l 1 minimization 

The discriminative property of sparse representation is recently utilized 

for various computer vision applications such as tracking [6], 

detection [9], classification [10] etc. A candidate vector y can be 

sparsely represented in the space spanned by the vector elements of

the matrix (called the dictionary) D = [ d 1 , d 2 , ..., d n 

] 

∈ R l×n . 

Mathematically, 

y = Da (1) 

where a = [ a 1 , a 2 , ..., a n 

] T 

∈ R n is the coefficient vector of basis 

D. In application, the system represented by (1) could be under 

determined since l

One of the two motion vectors MV obj,1 and MV obj,2 are chosen 

based on a confidence measure computed using the histogram models 

of the object and background. The object histogram P obj with 20 

bins is constructed from the pixels occupying the 25% central area 

of the object. The background histogram P bg with 20 bins is constructed 

from the pixels occupying the area surrounding the object 

up-to 15 pixels. These histograms are normalized. Figure 2 shows 

the areas used to construct these histograms. The area between the 

innermost rectangle and the middle rectangle is not used as this region 

contains both object and background pixels which adds confusion 

to the models. The likelihood map is calculated using equation 

Background 

Object 

Not used 

Fig. 2. Pixels used to build the object and background histogram. 

(8) for the pixels occupying the central 25% area of the candidate 

area T . The confidence measure for each of the motion vector is 

taken as the sum of the corresponding likelihood values of the pixels 

using equation (9). 

L(x, y) = [P obj (b(T (x, y)))] / [max(P bg (b(T (x, y))), ɛ)] (8) 

L conf = ∑ ∑ 

L(x, y) (9) 

x y 

where function b maps the pixel at location (x, y) to its bin and ɛ 

is a small quantity to prevent division by zero. Out of MV obj,1 and 

MV obj,2 , the motion vector with a larger value of this confidence 

measure is chosen. Higher confidence measure implies that a larger 

number of pixels from that target area belong to the object than that 

pointed by the other motion vector. 

3.5. Dictionary update 

Fragments in the second dictionary are chosen for update after 

analysing how well each of the fragments were matched to the 

candidate fragments. This can be inferred from the target coefficient 

vectors A 2k . The maximum value along the rows (each 

row corresponds to a fragment in the dictionary) in the matrix 

A = [ ] 

A 21 , A 22 , ..., A 2p helps in sorting out fragments that 

matched very well, mildly and no match at all with candidate fragments. 

Since there are only p candidate fragments, a large portion 

of the dictionary fragments would not have matched at all, indicated 

by their zero coefficient values. A small number (depending on the 

update factor, which is expressed as the percentage of total number 

of elements in each dictionary) of such fragments are updated since 

there was no contribution from them in the current iteration. They 

are updated with the corresponding fragments of the tracking result 

after performing a check on the new fragment based on histogram 

models explained in Section 3.4. The likelihood map, inverse likelihood 

map and confidence measure of each new fragment F are 

computed 

L f (x, y) = [P obj (b(F (x, y)))] / [max(P bg (b(F (x, y))), ɛ)] 

(10) 

IL f (x, y) = [P bg (b(F (x, y)))] / [max(P obj (b(F (x, y))), ɛ)] 

[ ] [ ] (11) 

∑ ∑ 

∑ ∑ 

L conf,f = L f (x, y) / IL f (x, y) (12) 

x y 

x y 

The fragment is updated only if the confidence measure L conf,f > 

1 (indicates fragment has more pixels belonging to the object) to 

prevent erroneous updates of the dictionary fragments. 

Algorithm 1 Proposed Tracking 

1: Input: Initial position of the object in the first frame. 

2: Initialize: D 1 and D 2 with overlapping fragments of the object. 

3: repeat 

4: In next frame, select candidate from same location as the object 

in the previous frame and prepare set of p candidate fragments. 

5: Solve l 1 minimization problem using SPAMS [11] to sparsely 

reconstruct candidate fragments in the space spanned by dictionary 

fragments. 

6: Compute confidence measure C conf,k using equation (6). 

7: Choose top p ′ candidate fragments based on C conf,k and 

compute their motion vectors MV. 

8: Remove outliers in MV based on direction and magnitude to 

get s number of motion vectors MV ′ . 

9: Compute motion vector MV obj,1 as the median values of x 

and y components of MV ′ . 

10: Compute motion vector MV obj,2 using equation (7). 

11: Choose MV obj,1 or MV obj,2 as the motion vector for the object, 

whichever gives a higher confidence measure based on 

likelihood in (9). 

12: Update fragments of dictionary D 2 that did not match with 

any of the candidate fragments if L conf,f > 1. 

13: until End of video feed 

4. RESULTS AND DISCUSSION 

The proposed tracker is implemented in MATLAB and experimented 

on four different video sequences: pktest02 (450 frames), face (206 

frames), panda (451 frames) and trellis (226 frames). We use the 

software (SPAMS) provided by [11] to solve the l 1 minimization 

problem. For evaluating the performance of the proposed tracker, its 

results are compared with the l 1 tracker proposed by Mei et al. [6]. 

The l 1 tracker is configured 300 particles, 10 object templates of size 

12×15. The proposed tracker is configured for p = 25 candidate 

fragments of size 8×8; p ′ = 21, s = 5, and an update factor of 5%. 

Figure 3 shows the trajectory error (position error) plot with respect 

to ground truth for the four videos using the proposed method 

and l 1 tracker [6]. Table 1 summarizes the performance of the trackers 

under consideration. It can be seen that the proposed tracker 

achieves real time performance with better accuracy compared to the 

particle filter based l 1 tracker [6], while executed on a PC. The proposed 

tracker runs 60−70 times faster than [6]. Figures 4, 5, 6 and 7 

show the tracking results. The results of the proposed approach and 

l 1 tracker are shown by blue and yellow (dashed) windows respectively. 

In Figure 4, frame number 153 shows that l 1 tracker failed 

when the car was occluded by the tree and it continues to drift away. 

The proposed tracker survives the occlusion and gradual pose change 

as seen in frames 153, 156, 219 and 430. Figure 5 also shows that the 

proposed tracker is robust to changes in appearance and illumination 

at frames 69, 114 and 192. Figure 6 shows that the proposed tracker 

was able to track drastic changes in pose when the panda changes 

its direction of motion while the tracker in [6] fails at frames 94 and 

327. Figure 7 shows the ability of the proposed tracker to track object 

even under partial illumination changes because of the fragment 

based approach. In frame 71, it can be seen that the lower left re-

250 

Proposed 

Mei et al. 

50 

Proposed 

Mei et al. 

Absolute error 

200 

150 

100 

50 

0 

100 200 300 400 

Frame number 

(a) 


40 

30 

20 

10 

0 

50 100 150 200 

Frame number 

(b) 

Fig. 4. Result for pktest02 video at frames 5, 153, 156, 219 and 

430. [Color convention for all results: solid blue - proposed tracker; 

dashed yellow - l 1 tracker.] 


100 

80 

60 

40 

Proposed 

Mei et al. 


100 

50 

Proposed 

Mei et al. 

Fig. 5. Result for face video at frames 3, 10, 69, 114 and 192. 

20 

0 

100 200 300 400 

Frame number 

(c) 

0 

50 100 150 200 

Frame number 

(d) 

Fig. 6. Result for panda video at frames 4, 45, 94, 327 and 450. 

Fig. 3. Trajectory position error with respect to ground truth for: (a) 

pktest02 (b) face (c) panda and (d) trellis sequences. 

gion is illuminated more. Fragments on the lower left region would 

give low confidence measures and are discarded before computing 

the object displacement, whereas tracker in [6] uses the entire object 

to build the dictionary of templates and hence fails to track the 

object under such partial illumination changes. The videos corresponding 

to the results presented in Figs. 4 to 7 are available at 

http://www.serc.iisc.ernet.in/∼venky/tracking results/. 

Table 1. Execution time and trajectory error (RMSE) comparison of 

the proposed tracker and l 1 tracker [6] 

Video Execution time Trajectory Error 

(Number per frame (seconds) (RMSE) 

of frames) Proposed [6] Proposed [6] 

pktest02 (450) 0.0316 2.0770 2.9878 119.5893 

face (206) 0.0308 2.2194 7.0961 9.5666 

panda (451) 0.0303 2.2742 4.7350 25.5386 

trellis (226) 0.0301 2.1269 12.8113 42.3399 

5. CONCLUSION AND FUTURE WORK 

In this paper we have proposed a computationally efficient tracking 

algorithm which makes use of fragments of the object and candidate 

to track the object. The performance of the proposed tracker 

has been demonstrated by various complex video sequences and is 

shown to perform better than the earlier tracker in terms of both accuracy 

and speed. Future work includes improvement of the dictionary 

and its update mechanism to model the changes in pose, size 

and illumination of the object more precisely. 

6. REFERENCES 

[1] A. Yilmaz, O. Javed, and M. Shah., “Object tracking: A survey,” 

ACM Computing Surveys, vol. 38, no. 4, 2006. 

[2] G. D. Hager and P. N. Belhumeur, “Efficient region tracking 

with parametric models of geometry and illumination,” IEEE 

Fig. 7. Result for trellis video at frames 12, 24, 71, 141 and 226. 

Transactions on Pattern Analysis and Machine Intelligence, 

vol. 20, no. 10, pp. 1025–1039, 1998. 

[3] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking 

of non-rigid objects using mean shift,” in Proceedings of 

IEEE Conference on Computer Vision and Pattern Recognition, 

2000, vol. 2, pp. 142–149. 

[4] David A. Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming- 

Hsuan Yang, “Incremental learning for robust visual tracking,” 

International Journal of Computer Vision, vol. 77, no. 1-3, pp. 

125–141, 2008. 

[5] F. Porikli, O. Tuzel, and Peter Meer, “Covariance tracking 

using model update based on lie algebra,” in Proceedings of 

IEEE Conference on Computer Vision and Pattern Recognition, 

2006, vol. 1, pp. 728–735. 

[6] Xue Mei and Haibin Ling, “Robust visual tracking using l1 

minimization,” in Proceedings of IEEE International Conference 

on Computer Vision, 2009, pp. 1436–1443. 

[7] Huaping Liu and Fuchun Sun, “Visual tracking using sparsity 

induced similarity,” in Proceedings of IEEE International 

Conference on Pattern Recognition, 2010, pp. 1702–1705. 

[8] Hanxi Li and Chunhua Shen, “Robust real-time visual tracking 

with compressed sensing,” in Proceedings of IEEE International 

Conference on Image Processing, 2010. 

[9] Ran Xu, Baochang Zhang, Qixiang Ye, and Jianbin Jiao, “Human 

detection in images via l1-norm minimization learning,” 

in Proceedings of IEEE International Conference on Acoustics 

Speech and Signal Processing, 2010, pp. 3566–3569. 

[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma., 

“Robust face recognition via sparse representation,” In IEEE 

Transactions on Pattern Analysis and Machine Intelligence, 

vol. 31, no. 1, pp. 210–227, 2009. 

[11] SPAMS, “http://www.di.ens.fr/willow/spams/,” .

FRAGMENT-BASED REAL-TIME OBJECT TRACKING: A SPARSE ...

Create successful ePaper yourself

Delete template?

Save as template?