FRAGMENT-BASED REAL-TIME OBJECT TRACKING: A SPARSE ...
FRAGMENT-BASED REAL-TIME OBJECT TRACKING: A SPARSE ...
FRAGMENT-BASED REAL-TIME OBJECT TRACKING: A SPARSE ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>FRAGMENT</strong>-<strong>BASED</strong> <strong>REAL</strong>-<strong>TIME</strong> <strong>OBJECT</strong> <strong>TRACKING</strong>:<br />
A <strong>SPARSE</strong> REPRESENTATION APPROACH<br />
Naresh Kumar M. S. Priti Parate R. Venkatesh Babu<br />
Supercomputer Education and Research Centre<br />
Indian Institute of Science, Bangalore, India - 560012<br />
ABSTRACT<br />
Real-time object tracking is a critical task in many computer vision<br />
applications. Achieving rapid and robust tracking while handling<br />
changes in object pose and size, varying illumination and partial<br />
occlusion, is a challenging task given the limited amount of computational<br />
resources. In this paper we propose a real-time object<br />
tracker in l 1 framework addressing these issues. In the proposed approach,<br />
dictionaries containing templates of overlapping object fragments<br />
are created. The candidate fragments are sparsely represented<br />
in the dictionary fragment space by solving the l 1 regularized least<br />
squares problem. The non zero coefficients indicate the relative motion<br />
between the target and candidate fragments along with a fidelity<br />
measure. The final object motion is obtained by fusing the reliable<br />
motion information. The dictionary is updated based on the object<br />
likelihood map. The proposed tracking algorithm is tested on various<br />
challenging videos and found to outperform earlier approach.<br />
Index Terms— Object tracking, Fragment tracking, Motion estimation,<br />
l 1 minimization, Sparse representation<br />
1. INTRODUCTION<br />
Visual tracking is an important task in computer vision with a variety<br />
of applications such as surveillance, robotics, human computer<br />
interactions, medical imaging etc. One of the main challenges that<br />
limits the performance of the tracker is appearance change caused<br />
by pose variation, illumination or view point. Significant amount of<br />
work has been done to address these problems and develop a robust<br />
tracker. However robust object tracking still remains a big challenge<br />
in computer vision research.<br />
There have been many proposals towards building a robust<br />
tracker, a thorough survey can be found in [1]. In early works,<br />
minimizing SSD (sum of squared differences) between regions was<br />
a popular choice for the tracking problem [2] and a gradient descent<br />
algorithm was most commonly used to find the minimum SSD. Often<br />
in such methods, only a local minimum could be reached. Mean<br />
shift tracker [3] uses mean-shift iterations and a similarity measure<br />
based on Bhattacharyya coefficient between the target model<br />
and candidate regions to track the object. Incremental tracker [4]<br />
and Covariance tracker [5] are other examples of tracking methods<br />
which use appearance model to represent the target observations.<br />
One of the recently developed and popular trackers is the l 1<br />
tracker [6]. In this work, the authors have utilized the particle filter to<br />
select the candidate particles and then represent them sparsely in the<br />
space spanned by the object templates using l 1 minimization. This<br />
requires a large number of particles for reliable tracking. This results<br />
in a high computational cost and thus brings down the speed of the<br />
tracker. An attempt to speed up the tracking by reducing the number<br />
of particles only deteriorates the accuracy of the tracker. There<br />
have been attempts to improve the performance of [6]. In [7] the<br />
authors try to reduce the computation time by decomposing a single<br />
object template into the particle space. In [8] hash kernels are used<br />
to reduce the dimensionality of observation.<br />
In this paper, we propose a computationally efficient l 1 minimization<br />
based real-time and robust tracker. The tracker uses fragments<br />
of the object and the candidate to estimate the motion of the<br />
object. The number of candidate fragments required to track the object<br />
in this method is small, thus reducing the computational burden<br />
of the l 1 tracker. Further, the fragment based approach combined<br />
with the trivial templates make the tracker robust against partial occlusion.<br />
The results show that the proposed tracker gives more accurate<br />
tracking at much higher execution speeds in comparison to the<br />
earlier approach.<br />
The rest of the paper is organized as follows. Section 2 provides<br />
the overview of the proposed tracker. Section 3 describes the proposed<br />
approach in detail. Section 4 discusses the results and Section<br />
5 concludes the paper.<br />
2. OVERVIEW<br />
The proposed tracking algorithm is essentially a template tracker in<br />
l 1 framework. The object is partitioned into overlapping fragments<br />
that form the atoms of the dictionary. The candidate fragments are<br />
sparsely represented in the space spanned by the dictionary fragments<br />
by solving the l 1 minimization problem. The resulting sparse<br />
representation indicates the flow of fragments between consecutive<br />
frames. This flow information or the motion vectors are utilized<br />
for estimating the object motion between consecutive frames. The<br />
proposed algorithm uses only grey scale information for tracking.<br />
Similar to the mean-shift tracker [3], the proposed algorithm also assumes<br />
sufficient overlap between object and candidate regions such<br />
that there is at-least one fragment in the candidate area that corresponds<br />
to an object fragment. In this approach two dictionaries are<br />
used. One is kept static while the other is updated based on the<br />
tracking result and a confidence measure computed using histogram<br />
models. The dictionaries are initialized with the object selected in<br />
the first frame. The proposed algorithm is able to track objects with<br />
rapid changes in appearance, illumination and occlusions at realtime.<br />
Changes in size are also tracked up-to some extent.<br />
3. PROPOSED APPROACH<br />
3.1. Sparse representation and l 1 minimization<br />
The discriminative property of sparse representation is recently utilized<br />
for various computer vision applications such as tracking [6],<br />
detection [9], classification [10] etc. A candidate vector y can be<br />
sparsely represented in the space spanned by the vector elements of
the matrix (called the dictionary) D = [ d 1 , d 2 , ..., d n<br />
]<br />
∈ R l×n .<br />
Mathematically,<br />
y = Da (1)<br />
where a = [ a 1 , a 2 , ..., a n<br />
] T<br />
∈ R n is the coefficient vector of basis<br />
D. In application, the system represented by (1) could be under<br />
determined since l
One of the two motion vectors MV obj,1 and MV obj,2 are chosen<br />
based on a confidence measure computed using the histogram models<br />
of the object and background. The object histogram P obj with 20<br />
bins is constructed from the pixels occupying the 25% central area<br />
of the object. The background histogram P bg with 20 bins is constructed<br />
from the pixels occupying the area surrounding the object<br />
up-to 15 pixels. These histograms are normalized. Figure 2 shows<br />
the areas used to construct these histograms. The area between the<br />
innermost rectangle and the middle rectangle is not used as this region<br />
contains both object and background pixels which adds confusion<br />
to the models. The likelihood map is calculated using equation<br />
Background<br />
Object<br />
Not used<br />
Fig. 2. Pixels used to build the object and background histogram.<br />
(8) for the pixels occupying the central 25% area of the candidate<br />
area T . The confidence measure for each of the motion vector is<br />
taken as the sum of the corresponding likelihood values of the pixels<br />
using equation (9).<br />
L(x, y) = [P obj (b(T (x, y)))] / [max(P bg (b(T (x, y))), ɛ)] (8)<br />
L conf = ∑ ∑<br />
L(x, y) (9)<br />
x y<br />
where function b maps the pixel at location (x, y) to its bin and ɛ<br />
is a small quantity to prevent division by zero. Out of MV obj,1 and<br />
MV obj,2 , the motion vector with a larger value of this confidence<br />
measure is chosen. Higher confidence measure implies that a larger<br />
number of pixels from that target area belong to the object than that<br />
pointed by the other motion vector.<br />
3.5. Dictionary update<br />
Fragments in the second dictionary are chosen for update after<br />
analysing how well each of the fragments were matched to the<br />
candidate fragments. This can be inferred from the target coefficient<br />
vectors A 2k . The maximum value along the rows (each<br />
row corresponds to a fragment in the dictionary) in the matrix<br />
A = [ ]<br />
A 21 , A 22 , ..., A 2p helps in sorting out fragments that<br />
matched very well, mildly and no match at all with candidate fragments.<br />
Since there are only p candidate fragments, a large portion<br />
of the dictionary fragments would not have matched at all, indicated<br />
by their zero coefficient values. A small number (depending on the<br />
update factor, which is expressed as the percentage of total number<br />
of elements in each dictionary) of such fragments are updated since<br />
there was no contribution from them in the current iteration. They<br />
are updated with the corresponding fragments of the tracking result<br />
after performing a check on the new fragment based on histogram<br />
models explained in Section 3.4. The likelihood map, inverse likelihood<br />
map and confidence measure of each new fragment F are<br />
computed<br />
L f (x, y) = [P obj (b(F (x, y)))] / [max(P bg (b(F (x, y))), ɛ)]<br />
(10)<br />
IL f (x, y) = [P bg (b(F (x, y)))] / [max(P obj (b(F (x, y))), ɛ)]<br />
[ ] [ ] (11)<br />
∑ ∑<br />
∑ ∑<br />
L conf,f = L f (x, y) / IL f (x, y) (12)<br />
x y<br />
x y<br />
The fragment is updated only if the confidence measure L conf,f ><br />
1 (indicates fragment has more pixels belonging to the object) to<br />
prevent erroneous updates of the dictionary fragments.<br />
Algorithm 1 Proposed Tracking<br />
1: Input: Initial position of the object in the first frame.<br />
2: Initialize: D 1 and D 2 with overlapping fragments of the object.<br />
3: repeat<br />
4: In next frame, select candidate from same location as the object<br />
in the previous frame and prepare set of p candidate fragments.<br />
5: Solve l 1 minimization problem using SPAMS [11] to sparsely<br />
reconstruct candidate fragments in the space spanned by dictionary<br />
fragments.<br />
6: Compute confidence measure C conf,k using equation (6).<br />
7: Choose top p ′ candidate fragments based on C conf,k and<br />
compute their motion vectors MV.<br />
8: Remove outliers in MV based on direction and magnitude to<br />
get s number of motion vectors MV ′ .<br />
9: Compute motion vector MV obj,1 as the median values of x<br />
and y components of MV ′ .<br />
10: Compute motion vector MV obj,2 using equation (7).<br />
11: Choose MV obj,1 or MV obj,2 as the motion vector for the object,<br />
whichever gives a higher confidence measure based on<br />
likelihood in (9).<br />
12: Update fragments of dictionary D 2 that did not match with<br />
any of the candidate fragments if L conf,f > 1.<br />
13: until End of video feed<br />
4. RESULTS AND DISCUSSION<br />
The proposed tracker is implemented in MATLAB and experimented<br />
on four different video sequences: pktest02 (450 frames), face (206<br />
frames), panda (451 frames) and trellis (226 frames). We use the<br />
software (SPAMS) provided by [11] to solve the l 1 minimization<br />
problem. For evaluating the performance of the proposed tracker, its<br />
results are compared with the l 1 tracker proposed by Mei et al. [6].<br />
The l 1 tracker is configured 300 particles, 10 object templates of size<br />
12×15. The proposed tracker is configured for p = 25 candidate<br />
fragments of size 8×8; p ′ = 21, s = 5, and an update factor of 5%.<br />
Figure 3 shows the trajectory error (position error) plot with respect<br />
to ground truth for the four videos using the proposed method<br />
and l 1 tracker [6]. Table 1 summarizes the performance of the trackers<br />
under consideration. It can be seen that the proposed tracker<br />
achieves real time performance with better accuracy compared to the<br />
particle filter based l 1 tracker [6], while executed on a PC. The proposed<br />
tracker runs 60−70 times faster than [6]. Figures 4, 5, 6 and 7<br />
show the tracking results. The results of the proposed approach and<br />
l 1 tracker are shown by blue and yellow (dashed) windows respectively.<br />
In Figure 4, frame number 153 shows that l 1 tracker failed<br />
when the car was occluded by the tree and it continues to drift away.<br />
The proposed tracker survives the occlusion and gradual pose change<br />
as seen in frames 153, 156, 219 and 430. Figure 5 also shows that the<br />
proposed tracker is robust to changes in appearance and illumination<br />
at frames 69, 114 and 192. Figure 6 shows that the proposed tracker<br />
was able to track drastic changes in pose when the panda changes<br />
its direction of motion while the tracker in [6] fails at frames 94 and<br />
327. Figure 7 shows the ability of the proposed tracker to track object<br />
even under partial illumination changes because of the fragment<br />
based approach. In frame 71, it can be seen that the lower left re-
250<br />
Proposed<br />
Mei et al.<br />
50<br />
Proposed<br />
Mei et al.<br />
Absolute error<br />
200<br />
150<br />
100<br />
50<br />
0<br />
100 200 300 400<br />
Frame number<br />
(a)<br />
Absolute error<br />
40<br />
30<br />
20<br />
10<br />
0<br />
50 100 150 200<br />
Frame number<br />
(b)<br />
Fig. 4. Result for pktest02 video at frames 5, 153, 156, 219 and<br />
430. [Color convention for all results: solid blue - proposed tracker;<br />
dashed yellow - l 1 tracker.]<br />
Absolute error<br />
100<br />
80<br />
60<br />
40<br />
Proposed<br />
Mei et al.<br />
Absolute error<br />
100<br />
50<br />
Proposed<br />
Mei et al.<br />
Fig. 5. Result for face video at frames 3, 10, 69, 114 and 192.<br />
20<br />
0<br />
100 200 300 400<br />
Frame number<br />
(c)<br />
0<br />
50 100 150 200<br />
Frame number<br />
(d)<br />
Fig. 6. Result for panda video at frames 4, 45, 94, 327 and 450.<br />
Fig. 3. Trajectory position error with respect to ground truth for: (a)<br />
pktest02 (b) face (c) panda and (d) trellis sequences.<br />
gion is illuminated more. Fragments on the lower left region would<br />
give low confidence measures and are discarded before computing<br />
the object displacement, whereas tracker in [6] uses the entire object<br />
to build the dictionary of templates and hence fails to track the<br />
object under such partial illumination changes. The videos corresponding<br />
to the results presented in Figs. 4 to 7 are available at<br />
http://www.serc.iisc.ernet.in/∼venky/tracking results/.<br />
Table 1. Execution time and trajectory error (RMSE) comparison of<br />
the proposed tracker and l 1 tracker [6]<br />
Video Execution time Trajectory Error<br />
(Number per frame (seconds) (RMSE)<br />
of frames) Proposed [6] Proposed [6]<br />
pktest02 (450) 0.0316 2.0770 2.9878 119.5893<br />
face (206) 0.0308 2.2194 7.0961 9.5666<br />
panda (451) 0.0303 2.2742 4.7350 25.5386<br />
trellis (226) 0.0301 2.1269 12.8113 42.3399<br />
5. CONCLUSION AND FUTURE WORK<br />
In this paper we have proposed a computationally efficient tracking<br />
algorithm which makes use of fragments of the object and candidate<br />
to track the object. The performance of the proposed tracker<br />
has been demonstrated by various complex video sequences and is<br />
shown to perform better than the earlier tracker in terms of both accuracy<br />
and speed. Future work includes improvement of the dictionary<br />
and its update mechanism to model the changes in pose, size<br />
and illumination of the object more precisely.<br />
6. REFERENCES<br />
[1] A. Yilmaz, O. Javed, and M. Shah., “Object tracking: A survey,”<br />
ACM Computing Surveys, vol. 38, no. 4, 2006.<br />
[2] G. D. Hager and P. N. Belhumeur, “Efficient region tracking<br />
with parametric models of geometry and illumination,” IEEE<br />
Fig. 7. Result for trellis video at frames 12, 24, 71, 141 and 226.<br />
Transactions on Pattern Analysis and Machine Intelligence,<br />
vol. 20, no. 10, pp. 1025–1039, 1998.<br />
[3] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking<br />
of non-rigid objects using mean shift,” in Proceedings of<br />
IEEE Conference on Computer Vision and Pattern Recognition,<br />
2000, vol. 2, pp. 142–149.<br />
[4] David A. Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-<br />
Hsuan Yang, “Incremental learning for robust visual tracking,”<br />
International Journal of Computer Vision, vol. 77, no. 1-3, pp.<br />
125–141, 2008.<br />
[5] F. Porikli, O. Tuzel, and Peter Meer, “Covariance tracking<br />
using model update based on lie algebra,” in Proceedings of<br />
IEEE Conference on Computer Vision and Pattern Recognition,<br />
2006, vol. 1, pp. 728–735.<br />
[6] Xue Mei and Haibin Ling, “Robust visual tracking using l1<br />
minimization,” in Proceedings of IEEE International Conference<br />
on Computer Vision, 2009, pp. 1436–1443.<br />
[7] Huaping Liu and Fuchun Sun, “Visual tracking using sparsity<br />
induced similarity,” in Proceedings of IEEE International<br />
Conference on Pattern Recognition, 2010, pp. 1702–1705.<br />
[8] Hanxi Li and Chunhua Shen, “Robust real-time visual tracking<br />
with compressed sensing,” in Proceedings of IEEE International<br />
Conference on Image Processing, 2010.<br />
[9] Ran Xu, Baochang Zhang, Qixiang Ye, and Jianbin Jiao, “Human<br />
detection in images via l1-norm minimization learning,”<br />
in Proceedings of IEEE International Conference on Acoustics<br />
Speech and Signal Processing, 2010, pp. 3566–3569.<br />
[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.,<br />
“Robust face recognition via sparse representation,” In IEEE<br />
Transactions on Pattern Analysis and Machine Intelligence,<br />
vol. 31, no. 1, pp. 210–227, 2009.<br />
[11] SPAMS, “http://www.di.ens.fr/willow/spams/,” .