<strong>FRAGMENT</strong>-<strong>BASED</strong> <strong>REAL</strong>-<strong>TIME</strong> <strong>OBJECT</strong> <strong>TRACKING</strong>:<br />


Naresh Kumar M. S. Priti Parate R. Venkatesh Babu<br />

Supercomputer Education and Research Centre<br />

Indian Institute of Science, Bangalore, India - 560012<br />


Real-time object tracking is a critical task in many computer vision<br />

applications. Achieving rapid and robust tracking while handling<br />

changes in object pose and size, varying illumination and partial<br />

occlusion, is a challenging task given the limited amount of computational<br />

resources. In this paper we propose a real-time object<br />

tracker in l 1 framework addressing these issues. In the proposed approach,<br />

dictionaries containing templates of overlapping object fragments<br />

are created. The candidate fragments are sparsely represented<br />

in the dictionary fragment space by solving the l 1 regularized least<br />

squares problem. The non zero coefficients indicate the relative motion<br />

between the target and candidate fragments along with a fidelity<br />

measure. The final object motion is obtained by fusing the reliable<br />

motion information. The dictionary is updated based on the object<br />

likelihood map. The proposed tracking algorithm is tested on various<br />

challenging videos and found to outperform earlier approach.<br />

Index Terms— Object tracking, Fragment tracking, Motion estimation,<br />

l 1 minimization, Sparse representation<br />


Visual tracking is an important task in computer vision with a variety<br />

of applications such as surveillance, robotics, human computer<br />

interactions, medical imaging etc. One of the main challenges that<br />

limits the performance of the tracker is appearance change caused<br />

by pose variation, illumination or view point. Significant amount of<br />

work has been done to address these problems and develop a robust<br />

tracker. However robust object tracking still remains a big challenge<br />

in computer vision research.<br />

There have been many proposals towards building a robust<br />

tracker, a thorough survey can be found in [1]. In early works,<br />

minimizing SSD (sum of squared differences) between regions was<br />

a popular choice for the tracking problem [2] and a gradient descent<br />

algorithm was most commonly used to find the minimum SSD. Often<br />

in such methods, only a local minimum could be reached. Mean<br />

shift tracker [3] uses mean-shift iterations and a similarity measure<br />

based on Bhattacharyya coefficient between the target model<br />

and candidate regions to track the object. Incremental tracker [4]<br />

and Covariance tracker [5] are other examples of tracking methods<br />

which use appearance model to represent the target observations.<br />

One of the recently developed and popular trackers is the l 1<br />

tracker [6]. In this work, the authors have utilized the particle filter to<br />

select the candidate particles and then represent them sparsely in the<br />

space spanned by the object templates using l 1 minimization. This<br />

requires a large number of particles for reliable tracking. This results<br />

in a high computational cost and thus brings down the speed of the<br />

tracker. An attempt to speed up the tracking by reducing the number<br />

of particles only deteriorates the accuracy of the tracker. There<br />

have been attempts to improve the performance of [6]. In [7] the<br />

authors try to reduce the computation time by decomposing a single<br />

object template into the particle space. In [8] hash kernels are used<br />

to reduce the dimensionality of observation.<br />

In this paper, we propose a computationally efficient l 1 minimization<br />

based real-time and robust tracker. The tracker uses fragments<br />

of the object and the candidate to estimate the motion of the<br />

object. The number of candidate fragments required to track the object<br />

in this method is small, thus reducing the computational burden<br />

of the l 1 tracker. Further, the fragment based approach combined<br />

with the trivial templates make the tracker robust against partial occlusion.<br />

The results show that the proposed tracker gives more accurate<br />

tracking at much higher execution speeds in comparison to the<br />

earlier approach.<br />

The rest of the paper is organized as follows. Section 2 provides<br />

the overview of the proposed tracker. Section 3 describes the proposed<br />

approach in detail. Section 4 discusses the results and Section<br />

5 concludes the paper.<br />

2. OVERVIEW<br />

The proposed tracking algorithm is essentially a template tracker in<br />

l 1 framework. The object is partitioned into overlapping fragments<br />

that form the atoms of the dictionary. The candidate fragments are<br />

sparsely represented in the space spanned by the dictionary fragments<br />

by solving the l 1 minimization problem. The resulting sparse<br />

representation indicates the flow of fragments between consecutive<br />

frames. This flow information or the motion vectors are utilized<br />

for estimating the object motion between consecutive frames. The<br />

proposed algorithm uses only grey scale information for tracking.<br />

Similar to the mean-shift tracker [3], the proposed algorithm also assumes<br />

sufficient overlap between object and candidate regions such<br />

that there is at-least one fragment in the candidate area that corresponds<br />

to an object fragment. In this approach two dictionaries are<br />

used. One is kept static while the other is updated based on the<br />

tracking result and a confidence measure computed using histogram<br />

models. The dictionaries are initialized with the object selected in<br />

the first frame. The proposed algorithm is able to track objects with<br />

rapid changes in appearance, illumination and occlusions at realtime.<br />

Changes in size are also tracked up-to some extent.<br />


3.1. Sparse representation and l 1 minimization<br />

The discriminative property of sparse representation is recently utilized<br />

for various computer vision applications such as tracking [6],<br />

detection [9], classification [10] etc. A candidate vector y can be<br />

sparsely represented in the space spanned by the vector elements of

the matrix (called the dictionary) D = [ d 1 , d 2 , ..., d n<br />

]<br />

∈ R l×n .<br />

Mathematically,<br />

y = Da (1)<br />

where a = [ a 1 , a 2 , ..., a n<br />

] T<br />

∈ R n is the coefficient vector of basis<br />

D. In application, the system represented by (1) could be under<br />

determined since l

One of the two motion vectors MV obj,1 and MV obj,2 are chosen<br />

based on a confidence measure computed using the histogram models<br />

of the object and background. The object histogram P obj with 20<br />

bins is constructed from the pixels occupying the 25% central area<br />

of the object. The background histogram P bg with 20 bins is constructed<br />

from the pixels occupying the area surrounding the object<br />

up-to 15 pixels. These histograms are normalized. Figure 2 shows<br />

the areas used to construct these histograms. The area between the<br />

innermost rectangle and the middle rectangle is not used as this region<br />

contains both object and background pixels which adds confusion<br />

to the models. The likelihood map is calculated using equation<br />

Background<br />

Object<br />

Not used<br />

Fig. 2. Pixels used to build the object and background histogram.<br />

(8) for the pixels occupying the central 25% area of the candidate<br />

area T . The confidence measure for each of the motion vector is<br />

taken as the sum of the corresponding likelihood values of the pixels<br />

using equation (9).<br />

L(x, y) = [P obj (b(T (x, y)))] / [max(P bg (b(T (x, y))), ɛ)] (8)<br />

L conf = ∑ ∑<br />

L(x, y) (9)<br />

x y<br />

where function b maps the pixel at location (x, y) to its bin and ɛ<br />

is a small quantity to prevent division by zero. Out of MV obj,1 and<br />

MV obj,2 , the motion vector with a larger value of this confidence<br />

measure is chosen. Higher confidence measure implies that a larger<br />

number of pixels from that target area belong to the object than that<br />

pointed by the other motion vector.<br />

3.5. Dictionary update<br />

Fragments in the second dictionary are chosen for update after<br />

analysing how well each of the fragments were matched to the<br />

candidate fragments. This can be inferred from the target coefficient<br />

vectors A 2k . The maximum value along the rows (each<br />

row corresponds to a fragment in the dictionary) in the matrix<br />

A = [ ]<br />

A 21 , A 22 , ..., A 2p helps in sorting out fragments that<br />

matched very well, mildly and no match at all with candidate fragments.<br />

Since there are only p candidate fragments, a large portion<br />

of the dictionary fragments would not have matched at all, indicated<br />

by their zero coefficient values. A small number (depending on the<br />

update factor, which is expressed as the percentage of total number<br />

of elements in each dictionary) of such fragments are updated since<br />

there was no contribution from them in the current iteration. They<br />

are updated with the corresponding fragments of the tracking result<br />

after performing a check on the new fragment based on histogram<br />

models explained in Section 3.4. The likelihood map, inverse likelihood<br />

map and confidence measure of each new fragment F are<br />

computed<br />

L f (x, y) = [P obj (b(F (x, y)))] / [max(P bg (b(F (x, y))), ɛ)]<br />

(10)<br />

IL f (x, y) = [P bg (b(F (x, y)))] / [max(P obj (b(F (x, y))), ɛ)]<br />

[ ] [ ] (11)<br />

∑ ∑<br />

∑ ∑<br />

L conf,f = L f (x, y) / IL f (x, y) (12)<br />

x y<br />

x y<br />

The fragment is updated only if the confidence measure L conf,f ><br />

1 (indicates fragment has more pixels belonging to the object) to<br />

prevent erroneous updates of the dictionary fragments.<br />

Algorithm 1 Proposed Tracking<br />

1: Input: Initial position of the object in the first frame.<br />

2: Initialize: D 1 and D 2 with overlapping fragments of the object.<br />

3: repeat<br />

4: In next frame, select candidate from same location as the object<br />

in the previous frame and prepare set of p candidate fragments.<br />

5: Solve l 1 minimization problem using SPAMS [11] to sparsely<br />

reconstruct candidate fragments in the space spanned by dictionary<br />

fragments.<br />

6: Compute confidence measure C conf,k using equation (6).<br />

7: Choose top p ′ candidate fragments based on C conf,k and<br />

compute their motion vectors MV.<br />

8: Remove outliers in MV based on direction and magnitude to<br />

get s number of motion vectors MV ′ .<br />

9: Compute motion vector MV obj,1 as the median values of x<br />

and y components of MV ′ .<br />

10: Compute motion vector MV obj,2 using equation (7).<br />

11: Choose MV obj,1 or MV obj,2 as the motion vector for the object,<br />

whichever gives a higher confidence measure based on<br />

likelihood in (9).<br />

12: Update fragments of dictionary D 2 that did not match with<br />

any of the candidate fragments if L conf,f > 1.<br />

13: until End of video feed<br />


The proposed tracker is implemented in MATLAB and experimented<br />

on four different video sequences: pktest02 (450 frames), face (206<br />

frames), panda (451 frames) and trellis (226 frames). We use the<br />

software (SPAMS) provided by [11] to solve the l 1 minimization<br />

problem. For evaluating the performance of the proposed tracker, its<br />

results are compared with the l 1 tracker proposed by Mei et al. [6].<br />

The l 1 tracker is configured 300 particles, 10 object templates of size<br />

12×15. The proposed tracker is configured for p = 25 candidate<br />

fragments of size 8×8; p ′ = 21, s = 5, and an update factor of 5%.<br />

Figure 3 shows the trajectory error (position error) plot with respect<br />

to ground truth for the four videos using the proposed method<br />

and l 1 tracker [6]. Table 1 summarizes the performance of the trackers<br />

under consideration. It can be seen that the proposed tracker<br />

achieves real time performance with better accuracy compared to the<br />

particle filter based l 1 tracker [6], while executed on a PC. The proposed<br />

tracker runs 60−70 times faster than [6]. Figures 4, 5, 6 and 7<br />

show the tracking results. The results of the proposed approach and<br />

l 1 tracker are shown by blue and yellow (dashed) windows respectively.<br />

In Figure 4, frame number 153 shows that l 1 tracker failed<br />

when the car was occluded by the tree and it continues to drift away.<br />

The proposed tracker survives the occlusion and gradual pose change<br />

as seen in frames 153, 156, 219 and 430. Figure 5 also shows that the<br />

proposed tracker is robust to changes in appearance and illumination<br />

at frames 69, 114 and 192. Figure 6 shows that the proposed tracker<br />

was able to track drastic changes in pose when the panda changes<br />

its direction of motion while the tracker in [6] fails at frames 94 and<br />

327. Figure 7 shows the ability of the proposed tracker to track object<br />

even under partial illumination changes because of the fragment<br />

based approach. In frame 71, it can be seen that the lower left re-

250<br />

Proposed<br />

Mei et al.<br />

50<br />

Proposed<br />

Mei et al.<br />

Absolute error<br />

200<br />

150<br />

100<br />

50<br />

0<br />

100 200 300 400<br />

Frame number<br />

(a)<br />

Absolute error<br />

40<br />

30<br />

20<br />

10<br />

0<br />

50 100 150 200<br />

Frame number<br />

(b)<br />

Fig. 4. Result for pktest02 video at frames 5, 153, 156, 219 and<br />

430. [Color convention for all results: solid blue - proposed tracker;<br />

dashed yellow - l 1 tracker.]<br />

Absolute error<br />

100<br />

80<br />

60<br />

40<br />

Proposed<br />

Mei et al.<br />

Absolute error<br />

100<br />

50<br />

Proposed<br />

Mei et al.<br />

Fig. 5. Result for face video at frames 3, 10, 69, 114 and 192.<br />

20<br />

0<br />

100 200 300 400<br />

Frame number<br />

(c)<br />

0<br />

50 100 150 200<br />

Frame number<br />

(d)<br />

Fig. 6. Result for panda video at frames 4, 45, 94, 327 and 450.<br />

Fig. 3. Trajectory position error with respect to ground truth for: (a)<br />

pktest02 (b) face (c) panda and (d) trellis sequences.<br />

gion is illuminated more. Fragments on the lower left region would<br />

give low confidence measures and are discarded before computing<br />

the object displacement, whereas tracker in [6] uses the entire object<br />

to build the dictionary of templates and hence fails to track the<br />

object under such partial illumination changes. The videos corresponding<br />

to the results presented in Figs. 4 to 7 are available at<br />

http://www.serc.iisc.ernet.in/∼venky/tracking results/.<br />

Table 1. Execution time and trajectory error (RMSE) comparison of<br />

the proposed tracker and l 1 tracker [6]<br />

Video Execution time Trajectory Error<br />

(Number per frame (seconds) (RMSE)<br />

of frames) Proposed [6] Proposed [6]<br />

pktest02 (450) 0.0316 2.0770 2.9878 119.5893<br />

face (206) 0.0308 2.2194 7.0961 9.5666<br />

panda (451) 0.0303 2.2742 4.7350 25.5386<br />

trellis (226) 0.0301 2.1269 12.8113 42.3399<br />


In this paper we have proposed a computationally efficient tracking<br />

algorithm which makes use of fragments of the object and candidate<br />

to track the object. The performance of the proposed tracker<br />

has been demonstrated by various complex video sequences and is<br />

shown to perform better than the earlier tracker in terms of both accuracy<br />

and speed. Future work includes improvement of the dictionary<br />

and its update mechanism to model the changes in pose, size<br />

and illumination of the object more precisely.<br />


[1] A. Yilmaz, O. Javed, and M. Shah., “Object tracking: A survey,”<br />

ACM Computing Surveys, vol. 38, no. 4, 2006.<br />

[2] G. D. Hager and P. N. Belhumeur, “Efficient region tracking<br />

with parametric models of geometry and illumination,” IEEE<br />

Fig. 7. Result for trellis video at frames 12, 24, 71, 141 and 226.<br />

Transactions on Pattern Analysis and Machine Intelligence,<br />

vol. 20, no. 10, pp. 1025–1039, 1998.<br />

[3] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking<br />

of non-rigid objects using mean shift,” in Proceedings of<br />

IEEE Conference on Computer Vision and Pattern Recognition,<br />

2000, vol. 2, pp. 142–149.<br />

[4] David A. Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-<br />

Hsuan Yang, “Incremental learning for robust visual tracking,”<br />

International Journal of Computer Vision, vol. 77, no. 1-3, pp.<br />

125–141, 2008.<br />

[5] F. Porikli, O. Tuzel, and Peter Meer, “Covariance tracking<br />

using model update based on lie algebra,” in Proceedings of<br />

IEEE Conference on Computer Vision and Pattern Recognition,<br />

2006, vol. 1, pp. 728–735.<br />

[6] Xue Mei and Haibin Ling, “Robust visual tracking using l1<br />

minimization,” in Proceedings of IEEE International Conference<br />

on Computer Vision, 2009, pp. 1436–1443.<br />

[7] Huaping Liu and Fuchun Sun, “Visual tracking using sparsity<br />

induced similarity,” in Proceedings of IEEE International<br />

Conference on Pattern Recognition, 2010, pp. 1702–1705.<br />

[8] Hanxi Li and Chunhua Shen, “Robust real-time visual tracking<br />

with compressed sensing,” in Proceedings of IEEE International<br />

Conference on Image Processing, 2010.<br />

[9] Ran Xu, Baochang Zhang, Qixiang Ye, and Jianbin Jiao, “Human<br />

detection in images via l1-norm minimization learning,”<br />

in Proceedings of IEEE International Conference on Acoustics<br />

Speech and Signal Processing, 2010, pp. 3566–3569.<br />

[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.,<br />

“Robust face recognition via sparse representation,” In IEEE<br />

Transactions on Pattern Analysis and Machine Intelligence,<br />

vol. 31, no. 1, pp. 210–227, 2009.<br />

[11] SPAMS, “http://www.di.ens.fr/willow/spams/,” .

