12.07.2015 Views

Adaptive Background Learning for Vehicle Detection and Spatio ...

Adaptive Background Learning for Vehicle Detection and Spatio ...

Adaptive Background Learning for Vehicle Detection and Spatio ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Adaptive</strong> <strong>Background</strong> <strong>Learning</strong> <strong>for</strong> <strong>Vehicle</strong> <strong>Detection</strong> <strong>and</strong> <strong>Spatio</strong>-Temporal TrackingChengcui Zhang 1 , Shu-Ching Chen 1* , Mei-Ling Shyu 2 , Srinivas Peeta 31 Distributed Multimedia In<strong>for</strong>mation System LaboratorySchool of Computer Science, Florida International University, Miami, FL 33199, USA2 Department of Electrical <strong>and</strong> Computer Engineering, University of Miami,Coral Gables, FL 33124, USA3 School of Civil Engineering, Purdue University, West Lafayette, IN 47907, USAAbstractTraffic video analysis can provide a wide range of usefulin<strong>for</strong>mation such as vehicle identification, traffic flow, totraffic planners. In this paper, a framework is proposed toanalyze the traffic video sequence using unsupervisedvehicle detection <strong>and</strong> spatio-temporal tracking thatincludes an image/video segmentation method, abackground learning/subtraction method <strong>and</strong> an objecttracking algorithm. A real-life traffic video sequence froma road intersection is used in our study <strong>and</strong> theexperimental results show that our proposed unsupervisedframework is effective in vehicle tracking <strong>for</strong> complextraffic situations.1. IntroductionThe advent of Intelligent Transportation Systems (ITS)technologies has significantly enhanced the ability toprovide timely <strong>and</strong> relevant in<strong>for</strong>mation to road usersthrough advanced in<strong>for</strong>mation systems. One ITStechnology, Advanced Traffic Management Systems(ATMS) [8], aims at using advanced sensor systems <strong>for</strong> onlinesurveillance <strong>and</strong> detailed in<strong>for</strong>mation gathering ontraffic conditions. Techniques from computer vision, suchas 2-D image processing, can be applied to traffic videoanalysis to address queue detection, vehicle classification,<strong>and</strong> vehicle counting. In particular, vehicle classification<strong>and</strong> vehicle tracking have been extensively investigated [7,4].For traffic intersection monitoring, digital cameras are fixed<strong>and</strong> installed above the area of the intersection. A classictechnique to extract the moving objects (vehicles) isbackground subtraction. Various approaches to backgroundsubtraction <strong>and</strong> modeling techniques have been discussed inthe literature [5, 6]. <strong>Background</strong> subtraction is a techniqueto remove non-moving components from a video sequence,where a reference frame of the stationary components in theimage is created. Once created, the reference frame issubtracted from any subsequent images. The pixelsresulting from new (moving) objects will generate a nonzerodifference. The main assumption <strong>for</strong> its application isthat the camera remains stationary. However, in most cases,the non-semantic content (or background) in the images orvideo frames is very complex. There<strong>for</strong>e, an effective wayto obtain the background in<strong>for</strong>mation can enable bettersegmentation results.In the proposed framework, an unsupervised videosegmentation method called the Simultaneous Partition <strong>and</strong>Class Parameter Estimation (SPCPE) algorithm [1] isapplied to identify the vehicle objects in the videosequence. In addition, we propose a new method <strong>for</strong>background learning <strong>and</strong> subtraction to enhance the basicSPCPE algorithm in order to get more accuratesegmentation results, so that the more accurate spatiotemporalrelationships of objects can be obtained.Experiments are conducted using a real-life traffic videosequence from road intersections. The experimental resultsindicate that almost all moving vehicle objects can besuccessfully identified at a very early stage of theprocessing; thereby ensuring that accurate spatio-temporalin<strong>for</strong>mation of objects can be obtained through objecttracking.This article is organized as follows. The next sectionintroduces the proposed learning-based object trackingframework. The experiments, results, <strong>and</strong> the analysis ofthe proposed multimedia traffic video data indexingframework are discussed in Section 3. A real-life trafficvideo sequence is used <strong>for</strong> the experiments. Conclusions arepresented in Section 4.2. <strong>Learning</strong>-Based <strong>Vehicle</strong> ObjectSegmentation <strong>and</strong> Tracking <strong>for</strong> TrafficVideo SequencesThe proposed unsupervised spatio-temporal vehicletracking framework includes background learning <strong>and</strong>subtraction, vehicle object identification <strong>and</strong> tracking.In the proposed framework, an adaptive backgroundlearning method is proposed. Our method consists of thefollowing steps:1. Subtract the successive frames to get the motiondifference images.2. Apply segmentation on the difference images to get theestimation of <strong>for</strong>eground regions <strong>and</strong> backgroundregions.1


3. Generate the current background image based on thelearned in<strong>for</strong>mation seen so far.4. Do background subtraction <strong>and</strong> object segmentation <strong>for</strong>those frames that contribute to the generation of thecurrent background. Meanwhile, the extracted vehicleobjects are tracked from frame to frame. Upon finishingthe current processing, go back to Steps 1-3 to generatethe next background image until all the frames have beensegmented.The details are described in the following subsections. Theproposed segmentation method can identify vehicle objects,but does not differentiate between them (into cars, buses,etc.). There<strong>for</strong>e, a priori knowledge (size, length, etc.) ofdifferent vehicle classes should be provided to enable suchclassification. In addition, since the vehicle objects ofinterest are the moving ones, stopped vehicles will beconsidered as static objects <strong>and</strong> will not be identified asmobile objects until they start moving again. However, theobject tracking technique ensures that such vehicles areseamlessly tracked though they “disappear” <strong>for</strong> someduration due to the background subtraction. This aspect isespecially critical under congested or queued trafficconditions.2.1 <strong>Vehicle</strong> Object SegmentationThe SPCPE (Simultaneous Partition <strong>and</strong> Class ParameterEstimation) algorithm is an unsupervised imagesegmentation method to segment video frames [1]. A givenclass description determines a partition, <strong>and</strong> vice versa.Hence, the partition <strong>and</strong> the class parameters have to beestimated simultaneously. The method <strong>for</strong> partitioning avideo frame starts with an arbitrary partition <strong>and</strong> employsan iterative algorithm to estimate the partition <strong>and</strong> the classparameters jointly. Since the successive frames in a videodo not differ much, the partitions of the adjacent frames donot differ significantly. Each frame is partitioned by usingthe partition of the previous frame as an initial condition tospeed up the convergence rate of the algorithm. Also, weonly use two-class segmentation <strong>for</strong> this specificapplication domain, where class 1 corresponds to‘background’ <strong>and</strong> class 2 represents ‘<strong>for</strong>eground’.One difficulty with automatic segmentation is the lack ofthe intervention of the user: we do not know a priori whichpixels belong to which class. In [9], we further advance theSPCPE method by using the initial partition derived bywavelet analysis, which is proven to produce more reliablesegmentation results. More technical details can be found in[1, 9].2.2 <strong>Background</strong> <strong>Learning</strong> <strong>and</strong> ExtractionThe basic idea of our background learning method is togenerate the current background image based on thesegmentation results extracted from a set of frame-to-framedifference images. Many existing methods use binarythresholding on difference image as the segmentationmethod. In some work, a further step is taken to update thebackground image periodically based on the weighted sumof the current binary object mask <strong>and</strong> the previousbackground. In our method, instead of using binarythresholding, we use the SPCPE algorithm to segment adifference image into a two-class segmentation map whichcan serve as a <strong>for</strong>eground/background mask, where class 1includes the background points <strong>and</strong> class 2 records the<strong>for</strong>eground points. By just collecting a small portion of thecontinuous segmentation maps, we can reach a point whereevery pixel position has at least one segmentation map withits corresponding pixel value equal to 1 (background pixel).In other words, every background point within this timeinterval has appeared <strong>and</strong> been identified at least once in thesegmentation maps. Let S i ~ S j denote the segmentationmaps <strong>for</strong> difference images i to j; the correspondingbackground image can be generated if:∀ ( r ∈ row,c ∈ col)∃S[( S [ r,c]= 1) ∧ ( i ≤ k ≤ j)]kkwhere row is the number of rows <strong>and</strong> col is the number ofcolumns <strong>for</strong> a video frame.Figure 1 illustrates the process <strong>for</strong> background learningusing an image sequence from frame 1121 to frame 1228.As shown in Figure 1(b), the difference images arecomputed by subtracting successive frames <strong>and</strong> applyinglinear normalization. Then, these difference images aresegmented into 2-class segmentation maps (Figure 1(c))using advanced SPCPE, followed by a rectificationprocedure that can eliminate most of the noise <strong>and</strong> keep thebackground in<strong>for</strong>mation robustly. Figure 1(d) shows suchrectified segmentation maps <strong>for</strong> Figure 1(c). The rectifiedsegmentation maps are then used to generate a backgroundimage (as shown in Figure 1(e))if the condition specified inEquation (6) is satisfied. The step to extract the backgroundin<strong>for</strong>mation is done by taking the correspondingbackground pixels from individual frames within this timeinterval. Instead of simply averaging these backgroundpixel values, we further analyzed its histogram distribution<strong>and</strong> picked the values in dominant bin(s) as the trustedvalues. With this extra sophistication, the false positives inbackground image due to noise or miss-detected motionscan be reduced significantly.In a traffic video monitoring sequence, when a vehicleobject stops in the intersection area (including theapproaches to the intersection), our framework may deem itas part of the background in<strong>for</strong>mation. In this case, sincethe vehicle objects move into the intersection area be<strong>for</strong>estopping, they are identified as moving vehicles be<strong>for</strong>e theystop due to the characteristics of our framework. Hence,their centroids identified be<strong>for</strong>e they stop will be in theintersection area. For these vehicles, the tracking process isfrozen until they start moving again <strong>and</strong> they are identifiedas “waiting” rather than “disappearing” objects. That is, thetracking process will follow the same procedure as be<strong>for</strong>eunless one or more new objects abruptly appear in theintersection area. Then, the matching <strong>and</strong> tracking of the2


501001502005010015020050 100 150 200 250 30050 100 150 200 250 30050100150200501001502005010015020050 100 150 200 250 30050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300previous “waiting” objects will be triggered to continuetracking the trails of these vehicles.(a)(b)[3], which enables the proposed framework to provideuseful <strong>and</strong> accurate traffic in<strong>for</strong>mation <strong>for</strong> ATMS.Definitions: After video segmentation, the segments(objects) with their bounding boxes <strong>and</strong> centroids areextracted from each frame. Intuitively, two segments thatare spatially the closest in the adjacent frames areconnected. Euclidean distance is used to measure thedistance between their centroids.Definition 1: A bounding box B (of dimension 2) isdefined by the two endpoints S <strong>and</strong> T of its major diagonal[Gonzalez93]:B = (S, T), where S = (s 1 , s 2 ) <strong>and</strong> T = (t 1 , t 2 ) <strong>and</strong> s i ≤ t i <strong>for</strong> i= 1, 2. Due to this definition, the area of B: Area B = (t 1 - s 1 )× (t 2 - s 2 ).Definition 2: The centroid ctd O of a bounding box Bcorresponding to an object O is defined as follows:ctd O =[ctd O1 , ctd O2 ], where(c)(d)(e)Figure 1: Unsupervised background learning <strong>and</strong> subtraction in thetraffic video sequence. (a) Image sequence from frame 1121 toframe 1228; (b) Successive difference (normalized) images <strong>for</strong>frames in (a); (c) Segmentation maps <strong>for</strong> difference images in (b);(d) Rectified segmentation maps <strong>for</strong> difference images in (c); (e)The generated background <strong>for</strong> this sequence.The key point here is that it is not necessary to obtain aperfectly ‘clean’ background image at each time interval. Infact, including the static objects as part of the backgroundwill not affect the extraction of moving objects. Further,once the static objects begin to move, the underlyingbackground can be discovered automatically. Instead offinding one ‘general background image’, the proposedbackground learning method aims to provide periodicalbackground images based on motion difference <strong>and</strong> robustimage segmentation. In this manner, it is insensitive toillumination changes, <strong>and</strong> does not require any humanef<strong>for</strong>ts in the loop.2.3 <strong>Vehicle</strong> Object TrackingIn order to index the vehicle objects, the proposedframework must have the ability to track the movingvehicle objects (segments) within successive video framesNoctd O1 = ( Oi = 1xi ) /NoOi=1No ; ctd O2 = ( ) No ;where No is the number of pixels belonging to object Owithin bounding box B, O xi represents the x-coordinate ofthe ith pixel in object O, <strong>and</strong> O yi represents the y-coordinateof the ith pixel in object O.Let ctd M <strong>and</strong> ctd N , respectively, be the centroids of segmentsM <strong>and</strong> N that exist in consecutive frames, <strong>and</strong> δ be athreshold. The Euclidean distance between them should notexceed the threshold δ if M <strong>and</strong> N represent the same objectin consecutive frames:DISTyi /22( ctdM−ctdN) = ( ctdM1 −ctdN1)+ ( ctdM2 −ctdN2)≤δIn addition to the use of the Euclidean distance, some sizerestriction is applied to the process of object tracking. Iftwo segments in successive frames represent the sameobject, the difference between their sizes should not belarge. The details of object tracking can be found in [2].H<strong>and</strong>ling occlusion situations in object tracking:Unsupervised background learning <strong>and</strong> subtraction enablesfast <strong>and</strong> satisfactory segmentation results, greatly benefitingthe h<strong>and</strong>ling of object occlusion situations. A moresophisticated object tracking algorithm integrated in theproposed framework is given in [2], which can h<strong>and</strong>le thesituation of two objects overlapping under certainassumptions (e.g., the two overlapped objects should havesimilar sizes). In this case, if two overlapped objects withsimilar sizes have ever separated from each other in thevideo sequence, then they can be split <strong>and</strong> identified as twoobjects with their bounding boxes being fully recoveredusing the object tracking algorithm. The results aredemonstrated in Figure 2(a)-(c), where two vehicles havesome overlapping in frame 142, but are identified as twoseparate objects in frame 132. Figure 2(c) demonstrates thefinal results by applying the occlusion h<strong>and</strong>ling methodproposed in [2]. The segmentation results accurately3


501001502005010015020050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300identify all the vehicle objects’ bounding boxes <strong>and</strong>centroids.Frame132Frame142(a) (b) (c)Figure 2: H<strong>and</strong>ling two object occlusion in object tracking. (a)Video frames 132 <strong>and</strong> 142; (b) Segmentation maps <strong>for</strong> frames in(a) without occlusion h<strong>and</strong>ling; (c) The final detection results byapplying occlusion h<strong>and</strong>ling.However, there are cases where a large object overlaps witha small one. For example, a large bus merges with a smallcar to <strong>for</strong>m a new big segment (“overlapping” segment)though they are two separate segments. In this scenario, thecar object <strong>and</strong> the bus object that were separate in previousframes cannot find their corresponding segments in thecurrently processed frame by centroid-matching <strong>and</strong> sizerestriction. For these “overlapping” segments, we proposeda difference binary map reasoning method in [3] to identifywhich objects the “overlapping” segment may include. Theidea is to obtain the difference binary map by subtractingthe segmentation map of the previous frame from that of thecurrent frame <strong>and</strong> check the amount of difference betweenthe segmentation results of the consecutive frames. If theamount of difference between the segmentation results ofthe two consecutive frames is less than a threshold, the car<strong>and</strong> bus objects in the previous frame can be roughlymapped into the area of the “overlapping” segment in thecurrent frame. There<strong>for</strong>e, the corresponding links betweenthe “overlapping” segment <strong>and</strong> the car <strong>and</strong> bus objects inthe previous frame can be created, which means that therelative motion vectors of that big segment in the followingframes will be automatically appended to the trace tubes ofthe bus <strong>and</strong> car objects in the previous frame.3. Experimental AnalysisA real life traffic video sequence is used to analyze spatiotemporalvehicle tracking using the proposed learningbasedvehicle tracking framework. It is a grayscale videosequence that shows the traffic flows on a road intersections<strong>for</strong> some time duration. The background in<strong>for</strong>mation can bevery complex by having road pavement, trees, zebracrossing, pavement markings/signage, <strong>and</strong> ground. Theproposed new framework is fully unsupervised in that it canenable the automatic background learning process thatgreatly facilitates the unsupervised vehicle segmentationprocess without any human intervention. During thesegmentation, the first frame is partitioned with two classesusing r<strong>and</strong>om initial partitions. After obtaining the partitionof the first frame, the partitions of the subsequent framesare computed using the previous partitions as the initialpartitions since there is no significant difference betweenconsecutive frames. By doing so, the segmentation processwill converge fast, thereby providing support <strong>for</strong> real-timeprocessing. The effectiveness of the proposed backgroundlearning process ensures that a long run is not necessary tofully determine the accurate background in<strong>for</strong>mation. In ourexperiments, the current background in<strong>for</strong>mation can beusually obtained within 20~100 consecutive frames <strong>and</strong> isgood enough <strong>for</strong> the future segmentation process. In fact,by combining the background learning process with theunsupervised segmentation method, our framework canenable the adaptive learning of background in<strong>for</strong>mation.current background image #1 <strong>for</strong> frame 811frame 811 difference_image final result(a)current background image #2 <strong>for</strong> frame 863frame 863 difference_image final result(b)Figure 3: Segmentation results <strong>for</strong> video sequence. (a)Segmentation result <strong>for</strong> frame 811 using the background image#1; (b) Segmentation result <strong>for</strong> frame 863 using the backgroundimage #2.Figure 3 shows the segmentation results <strong>for</strong> a few frames(811 <strong>and</strong> 863) along with the original frames, backgroundimages, <strong>and</strong> the difference images. As shown in Figure3(a), the background image <strong>for</strong> frame 811 contains somestatic vehicle objects such as the gray car waiting in themiddle of the intersection, <strong>and</strong> those cars that stoppedbehind the zebra crossing. Since, in this time interval, theyare identified as part of the background as they lack motion,they will not be extracted by the segmentation process asshown in Figure 3(a). However, the gray car (marked bywhite arrow in original frames) that was previously waitingin the middle begins to move around frame 860, triggeringthe generation of a new current background, shown inFigure 3(b) labeled as #2. From Figure 3(b), it is obviousthat the gray car is fading, although not completely, fromthe background image. However, this fading is sufficient to4


esult in the identification of the gray car in frame 863, ascan be seen from the segmentation map <strong>and</strong> final result inFigure 3(b). Moreover, as shown in frame 863 in Figure3(b), our method can successfully illustrate the slow motioneffect; unlike many methods that have difficulties indealing with it since it can be easily confused with noiselevelin<strong>for</strong>mation.Based on our experimental experience, we have thefollowing observations: 1) the segmentation methodadopted is very robust <strong>and</strong> insensitive to illuminationchanges. Also, the underlying class model in the SPCPEmethod is very suitable <strong>for</strong> vehicle object modeling. Unlikesome other existing framework, in which one vehicle objecthas been segmented into several small regions <strong>and</strong> laterthese regions are grouped <strong>and</strong> linked with thecorresponding vehicle object, no extra ef<strong>for</strong>t is needed <strong>for</strong>such a merge in our method. 2) As described earlier, thelong-run look ahead is not necessary to generate abackground image in our framework. This implies that themoving objects can be extracted as soon as their motionsbegin to show up. Moreover, the background image can bequickly updated <strong>and</strong> adapted to new changes in theenvironment. 3) No manual initialization or priorknowledge of the background is needed.Further, since the position of the centroid of a vehicle isrecorded during the segmentation <strong>and</strong> tracking process, thisin<strong>for</strong>mation can be used in the future <strong>for</strong> indexing itsrelative spatial relations. The proposed framework has thepotential to address a large range of spatio-temporal relateddatabase queries <strong>for</strong> ITS. For example, it can be used toreconstruct accidents at intersections in an automatedmanner to identify causal factors to enhance safety.4. ConclusionsIn this paper, we present a framework <strong>for</strong> spatio-temporalvehicle tracking using unsupervised learning-basedsegmentation <strong>and</strong> object tracking. An adaptive backgroundlearning <strong>and</strong> subtraction method is proposed <strong>and</strong> applied toa real life traffic video sequence to obtain more accuratespatio-temporal in<strong>for</strong>mation of the vehicle objects. Theproposed background learning method paired with theimage segmentation is robust under many situations. Asdemonstrated in our experiments, almost all vehicle objectsare successfully identified through this framework. A keyadvantage of the proposed background learning algorithm isthat it is fully automatic <strong>and</strong> unsupervised, <strong>and</strong> per<strong>for</strong>ms thegeneration of background images using a self-triggeredmechanism. This is very useful in video sequences in whichit is difficult to acquire a clean image of the background.Hence, the proposed framework can deal with verycomplex situations vis-à-vis intersection monitoring.AcknowledgementFor Shu-Ching Chen, this research was support in part byNSF EIA-0220562 <strong>and</strong> NSF HRD-0317692. For Mei-LingShyu, this research was support in part by NSF ITR(Medium) IIS-0325260.References[1] S.-C. Chen, S. Sista, M.-L. Shyu, <strong>and</strong> R. L. Kashyap,“An Indexing <strong>and</strong> Searching Structure <strong>for</strong> MultimediaDatabase Systems,” IS&T/SPIE conference on Storage <strong>and</strong>Retrieval <strong>for</strong> Media Databases 2000, pp. 262-270, San Jose,CA, USA, January 23-28, 2000.[2] S.-C. Chen, M.-L. Shyu, C. Zhang, <strong>and</strong> R. L. Kashyap,“Object Tracking <strong>and</strong> Augmented Transition Network <strong>for</strong>Video Indexing <strong>and</strong> Modeling,” 12th IEEE InternationalConference on Tools with Artificial Intelligence (ICTAI2000), pp. 428-435, Vancouver, British Columbia, Canada,November 13-15, 2000.[3] S.-C. Chen, M.-L. Shyu, C. Zhang, <strong>and</strong> J. Strickrott,“Multimedia Data Mining Framework: Mining In<strong>for</strong>mationfrom Traffic Video Sequences,” Journal of IntelligentIn<strong>for</strong>mation System, Special Issue on Multimedia DataMining, vol. 19, no. 1, pp. 61-77, July 2002.[4] R. Cucchiara, M. Piccardi, <strong>and</strong> P. Mello, “ImageAnalysis <strong>and</strong> Rule-based Reasoning <strong>for</strong> a TrafficMonitoring System,” IEEE International conference onIntelligent Transportation Systems, vol. 1, no. 2, pp. 119-130, Tokyo, Japan, June 2000.[5] D. J. Dailey, F. Cathey, <strong>and</strong> S. Pumrin, “An Algorithmto Estimate Mean Traffic Speed Using UncalibratedCameras,” IEEE Transactions on IntelligentTransportations Systems, vol. 1, no. 2, pp. 98-107, June2000.[6] W. E. L. Grimson, C. Stauffer, R. Romano, <strong>and</strong> L. Lee,“Using <strong>Adaptive</strong> Tracking to Classify <strong>and</strong> MonitorActivities in a Site,” IEEE Computer Society Conferenceon Computer Vision <strong>and</strong> Pattern Recognition Proceeding,pp. 22-31, 1998.[7] S. Kamijo, Y. Matsushita, <strong>and</strong> K. Ikeuchi, “TrafficMonitoring <strong>and</strong> Accident <strong>Detection</strong> at Intersections,” IEEETrans. Intelligent Transportation Systems, vol. 1, no. 2, pp.108-118, 2000.[8] S. Peeta <strong>and</strong> H. S. Mahmassani, “System Optimal <strong>and</strong>User Equilibrium Time-dependent Traffic Assignment inCongested Networks,” Annals of Operations Research, pp.81-113, 1995.[9] X. Li, S.-C. Chen, M.-L. Shyu, <strong>and</strong> S.-T. Li, “A NovelHierarchical Approach to Image Retrieval Using Color <strong>and</strong>Spatial In<strong>for</strong>mation,” the Third IEEE Pacific-RimConference on Multimedia 2002 (PCM2002), LectureNotes in Computer Science: Springer-Verlag, pp. 175-182,Taiwan.5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!