Adaptive Background Learning for Vehicle Detection and Spatio ...

Adaptive Background Learning for Vehicle Detection and Spatio-Temporal TrackingChengcui Zhang 1 , Shu-Ching Chen 1* , Mei-Ling Shyu 2 , Srinivas Peeta 31 Distributed Multimedia Information System LaboratorySchool of Computer Science, Florida International University, Miami, FL 33199, USA2 Department of Electrical and Computer Engineering, University of Miami,Coral Gables, FL 33124, USA3 School of Civil Engineering, Purdue University, West Lafayette, IN 47907, USAAbstractTraffic video analysis can provide a wide range of usefulinformation such as vehicle identification, traffic flow, totraffic planners. In this paper, a framework is proposed toanalyze the traffic video sequence using unsupervisedvehicle detection and spatio-temporal tracking thatincludes an image/video segmentation method, abackground learning/subtraction method and an objecttracking algorithm. A real-life traffic video sequence froma road intersection is used in our study and theexperimental results show that our proposed unsupervisedframework is effective in vehicle tracking for complextraffic situations.1. IntroductionThe advent of Intelligent Transportation Systems (ITS)technologies has significantly enhanced the ability toprovide timely and relevant information to road usersthrough advanced information systems. One ITStechnology, Advanced Traffic Management Systems(ATMS) [8], aims at using advanced sensor systems for onlinesurveillance and detailed information gathering ontraffic conditions. Techniques from computer vision, suchas 2-D image processing, can be applied to traffic videoanalysis to address queue detection, vehicle classification,and vehicle counting. In particular, vehicle classificationand vehicle tracking have been extensively investigated [7,4].For traffic intersection monitoring, digital cameras are fixedand installed above the area of the intersection. A classictechnique to extract the moving objects (vehicles) isbackground subtraction. Various approaches to backgroundsubtraction and modeling techniques have been discussed inthe literature [5, 6]. Background subtraction is a techniqueto remove non-moving components from a video sequence,where a reference frame of the stationary components in theimage is created. Once created, the reference frame issubtracted from any subsequent images. The pixelsresulting from new (moving) objects will generate a nonzerodifference. The main assumption for its application isthat the camera remains stationary. However, in most cases,the non-semantic content (or background) in the images orvideo frames is very complex. Therefore, an effective wayto obtain the background information can enable bettersegmentation results.In the proposed framework, an unsupervised videosegmentation method called the Simultaneous Partition andClass Parameter Estimation (SPCPE) algorithm [1] isapplied to identify the vehicle objects in the videosequence. In addition, we propose a new method forbackground learning and subtraction to enhance the basicSPCPE algorithm in order to get more accuratesegmentation results, so that the more accurate spatiotemporalrelationships of objects can be obtained.Experiments are conducted using a real-life traffic videosequence from road intersections. The experimental resultsindicate that almost all moving vehicle objects can besuccessfully identified at a very early stage of theprocessing; thereby ensuring that accurate spatio-temporalinformation of objects can be obtained through objecttracking.This article is organized as follows. The next sectionintroduces the proposed learning-based object trackingframework. The experiments, results, and the analysis ofthe proposed multimedia traffic video data indexingframework are discussed in Section 3. A real-life trafficvideo sequence is used for the experiments. Conclusions arepresented in Section 4.2. Learning-Based Vehicle ObjectSegmentation and Tracking for TrafficVideo SequencesThe proposed unsupervised spatio-temporal vehicletracking framework includes background learning andsubtraction, vehicle object identification and tracking.In the proposed framework, an adaptive backgroundlearning method is proposed. Our method consists of thefollowing steps:1. Subtract the successive frames to get the motiondifference images.2. Apply segmentation on the difference images to get theestimation of foreground regions and backgroundregions.1

3. Generate the current background image based on thelearned information seen so far.4. Do background subtraction and object segmentation forthose frames that contribute to the generation of thecurrent background. Meanwhile, the extracted vehicleobjects are tracked from frame to frame. Upon finishingthe current processing, go back to Steps 1-3 to generatethe next background image until all the frames have beensegmented.The details are described in the following subsections. Theproposed segmentation method can identify vehicle objects,but does not differentiate between them (into cars, buses,etc.). Therefore, a priori knowledge (size, length, etc.) ofdifferent vehicle classes should be provided to enable suchclassification. In addition, since the vehicle objects ofinterest are the moving ones, stopped vehicles will beconsidered as static objects and will not be identified asmobile objects until they start moving again. However, theobject tracking technique ensures that such vehicles areseamlessly tracked though they “disappear” for someduration due to the background subtraction. This aspect isespecially critical under congested or queued trafficconditions.2.1 Vehicle Object SegmentationThe SPCPE (Simultaneous Partition and Class ParameterEstimation) algorithm is an unsupervised imagesegmentation method to segment video frames [1]. A givenclass description determines a partition, and vice versa.Hence, the partition and the class parameters have to beestimated simultaneously. The method for partitioning avideo frame starts with an arbitrary partition and employsan iterative algorithm to estimate the partition and the classparameters jointly. Since the successive frames in a videodo not differ much, the partitions of the adjacent frames donot differ significantly. Each frame is partitioned by usingthe partition of the previous frame as an initial condition tospeed up the convergence rate of the algorithm. Also, weonly use two-class segmentation for this specificapplication domain, where class 1 corresponds to‘background’ and class 2 represents ‘foreground’.One difficulty with automatic segmentation is the lack ofthe intervention of the user: we do not know a priori whichpixels belong to which class. In [9], we further advance theSPCPE method by using the initial partition derived bywavelet analysis, which is proven to produce more reliablesegmentation results. More technical details can be found in[1, 9].2.2 Background Learning and ExtractionThe basic idea of our background learning method is togenerate the current background image based on thesegmentation results extracted from a set of frame-to-framedifference images. Many existing methods use binarythresholding on difference image as the segmentationmethod. In some work, a further step is taken to update thebackground image periodically based on the weighted sumof the current binary object mask and the previousbackground. In our method, instead of using binarythresholding, we use the SPCPE algorithm to segment adifference image into a two-class segmentation map whichcan serve as a foreground/background mask, where class 1includes the background points and class 2 records theforeground points. By just collecting a small portion of thecontinuous segmentation maps, we can reach a point whereevery pixel position has at least one segmentation map withits corresponding pixel value equal to 1 (background pixel).In other words, every background point within this timeinterval has appeared and been identified at least once in thesegmentation maps. Let S i ~ S j denote the segmentationmaps for difference images i to j; the correspondingbackground image can be generated if:∀ ( r ∈ row,c ∈ col)∃S[( S [ r,c]= 1) ∧ ( i ≤ k ≤ j)]kkwhere row is the number of rows and col is the number ofcolumns for a video frame.Figure 1 illustrates the process for background learningusing an image sequence from frame 1121 to frame 1228.As shown in Figure 1(b), the difference images arecomputed by subtracting successive frames and applyinglinear normalization. Then, these difference images aresegmented into 2-class segmentation maps (Figure 1(c))using advanced SPCPE, followed by a rectificationprocedure that can eliminate most of the noise and keep thebackground information robustly. Figure 1(d) shows suchrectified segmentation maps for Figure 1(c). The rectifiedsegmentation maps are then used to generate a backgroundimage (as shown in Figure 1(e))if the condition specified inEquation (6) is satisfied. The step to extract the backgroundinformation is done by taking the correspondingbackground pixels from individual frames within this timeinterval. Instead of simply averaging these backgroundpixel values, we further analyzed its histogram distributionand picked the values in dominant bin(s) as the trustedvalues. With this extra sophistication, the false positives inbackground image due to noise or miss-detected motionscan be reduced significantly.In a traffic video monitoring sequence, when a vehicleobject stops in the intersection area (including theapproaches to the intersection), our framework may deem itas part of the background information. In this case, sincethe vehicle objects move into the intersection area beforestopping, they are identified as moving vehicles before theystop due to the characteristics of our framework. Hence,their centroids identified before they stop will be in theintersection area. For these vehicles, the tracking process isfrozen until they start moving again and they are identifiedas “waiting” rather than “disappearing” objects. That is, thetracking process will follow the same procedure as beforeunless one or more new objects abruptly appear in theintersection area. Then, the matching and tracking of the2

501001502005010015020050 100 150 200 250 30050 100 150 200 250 30050100150200501001502005010015020050 100 150 200 250 30050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300previous “waiting” objects will be triggered to continuetracking the trails of these vehicles.(a)(b)[3], which enables the proposed framework to provideuseful and accurate traffic information for ATMS.Definitions: After video segmentation, the segments(objects) with their bounding boxes and centroids areextracted from each frame. Intuitively, two segments thatare spatially the closest in the adjacent frames areconnected. Euclidean distance is used to measure thedistance between their centroids.Definition 1: A bounding box B (of dimension 2) isdefined by the two endpoints S and T of its major diagonal[Gonzalez93]:B = (S, T), where S = (s 1 , s 2 ) and T = (t 1 , t 2 ) and s i ≤ t i for i= 1, 2. Due to this definition, the area of B: Area B = (t 1 - s 1 )× (t 2 - s 2 ).Definition 2: The centroid ctd O of a bounding box Bcorresponding to an object O is defined as follows:ctd O =[ctd O1 , ctd O2 ], where(c)(d)(e)Figure 1: Unsupervised background learning and subtraction in thetraffic video sequence. (a) Image sequence from frame 1121 toframe 1228; (b) Successive difference (normalized) images forframes in (a); (c) Segmentation maps for difference images in (b);(d) Rectified segmentation maps for difference images in (c); (e)The generated background for this sequence.The key point here is that it is not necessary to obtain aperfectly ‘clean’ background image at each time interval. Infact, including the static objects as part of the backgroundwill not affect the extraction of moving objects. Further,once the static objects begin to move, the underlyingbackground can be discovered automatically. Instead offinding one ‘general background image’, the proposedbackground learning method aims to provide periodicalbackground images based on motion difference and robustimage segmentation. In this manner, it is insensitive toillumination changes, and does not require any humanefforts in the loop.2.3 Vehicle Object TrackingIn order to index the vehicle objects, the proposedframework must have the ability to track the movingvehicle objects (segments) within successive video framesNoctd O1 = ( Oi = 1xi ) /NoOi=1No ; ctd O2 = ( ) No ;where No is the number of pixels belonging to object Owithin bounding box B, O xi represents the x-coordinate ofthe ith pixel in object O, and O yi represents the y-coordinateof the ith pixel in object O.Let ctd M and ctd N , respectively, be the centroids of segmentsM and N that exist in consecutive frames, and δ be athreshold. The Euclidean distance between them should notexceed the threshold δ if M and N represent the same objectin consecutive frames:DISTyi /22( ctdM−ctdN) = ( ctdM1 −ctdN1)+ ( ctdM2 −ctdN2)≤δIn addition to the use of the Euclidean distance, some sizerestriction is applied to the process of object tracking. Iftwo segments in successive frames represent the sameobject, the difference between their sizes should not belarge. The details of object tracking can be found in [2].Handling occlusion situations in object tracking:Unsupervised background learning and subtraction enablesfast and satisfactory segmentation results, greatly benefitingthe handling of object occlusion situations. A moresophisticated object tracking algorithm integrated in theproposed framework is given in [2], which can handle thesituation of two objects overlapping under certainassumptions (e.g., the two overlapped objects should havesimilar sizes). In this case, if two overlapped objects withsimilar sizes have ever separated from each other in thevideo sequence, then they can be split and identified as twoobjects with their bounding boxes being fully recoveredusing the object tracking algorithm. The results aredemonstrated in Figure 2(a)-(c), where two vehicles havesome overlapping in frame 142, but are identified as twoseparate objects in frame 132. Figure 2(c) demonstrates thefinal results by applying the occlusion handling methodproposed in [2]. The segmentation results accurately3

501001502005010015020050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300501001502005010015020050 100 150 200 250 30050 100 150 200 250 300identify all the vehicle objects’ bounding boxes andcentroids.Frame132Frame142(a) (b) (c)Figure 2: Handling two object occlusion in object tracking. (a)Video frames 132 and 142; (b) Segmentation maps for frames in(a) without occlusion handling; (c) The final detection results byapplying occlusion handling.However, there are cases where a large object overlaps witha small one. For example, a large bus merges with a smallcar to form a new big segment (“overlapping” segment)though they are two separate segments. In this scenario, thecar object and the bus object that were separate in previousframes cannot find their corresponding segments in thecurrently processed frame by centroid-matching and sizerestriction. For these “overlapping” segments, we proposeda difference binary map reasoning method in [3] to identifywhich objects the “overlapping” segment may include. Theidea is to obtain the difference binary map by subtractingthe segmentation map of the previous frame from that of thecurrent frame and check the amount of difference betweenthe segmentation results of the consecutive frames. If theamount of difference between the segmentation results ofthe two consecutive frames is less than a threshold, the carand bus objects in the previous frame can be roughlymapped into the area of the “overlapping” segment in thecurrent frame. Therefore, the corresponding links betweenthe “overlapping” segment and the car and bus objects inthe previous frame can be created, which means that therelative motion vectors of that big segment in the followingframes will be automatically appended to the trace tubes ofthe bus and car objects in the previous frame.3. Experimental AnalysisA real life traffic video sequence is used to analyze spatiotemporalvehicle tracking using the proposed learningbasedvehicle tracking framework. It is a grayscale videosequence that shows the traffic flows on a road intersectionsfor some time duration. The background information can bevery complex by having road pavement, trees, zebracrossing, pavement markings/signage, and ground. Theproposed new framework is fully unsupervised in that it canenable the automatic background learning process thatgreatly facilitates the unsupervised vehicle segmentationprocess without any human intervention. During thesegmentation, the first frame is partitioned with two classesusing random initial partitions. After obtaining the partitionof the first frame, the partitions of the subsequent framesare computed using the previous partitions as the initialpartitions since there is no significant difference betweenconsecutive frames. By doing so, the segmentation processwill converge fast, thereby providing support for real-timeprocessing. The effectiveness of the proposed backgroundlearning process ensures that a long run is not necessary tofully determine the accurate background information. In ourexperiments, the current background information can beusually obtained within 20~100 consecutive frames and isgood enough for the future segmentation process. In fact,by combining the background learning process with theunsupervised segmentation method, our framework canenable the adaptive learning of background information.current background image #1 for frame 811frame 811 difference_image final result(a)current background image #2 for frame 863frame 863 difference_image final result(b)Figure 3: Segmentation results for video sequence. (a)Segmentation result for frame 811 using the background image#1; (b) Segmentation result for frame 863 using the backgroundimage #2.Figure 3 shows the segmentation results for a few frames(811 and 863) along with the original frames, backgroundimages, and the difference images. As shown in Figure3(a), the background image for frame 811 contains somestatic vehicle objects such as the gray car waiting in themiddle of the intersection, and those cars that stoppedbehind the zebra crossing. Since, in this time interval, theyare identified as part of the background as they lack motion,they will not be extracted by the segmentation process asshown in Figure 3(a). However, the gray car (marked bywhite arrow in original frames) that was previously waitingin the middle begins to move around frame 860, triggeringthe generation of a new current background, shown inFigure 3(b) labeled as #2. From Figure 3(b), it is obviousthat the gray car is fading, although not completely, fromthe background image. However, this fading is sufficient to4

esult in the identification of the gray car in frame 863, ascan be seen from the segmentation map and final result inFigure 3(b). Moreover, as shown in frame 863 in Figure3(b), our method can successfully illustrate the slow motioneffect; unlike many methods that have difficulties indealing with it since it can be easily confused with noiselevelinformation.Based on our experimental experience, we have thefollowing observations: 1) the segmentation methodadopted is very robust and insensitive to illuminationchanges. Also, the underlying class model in the SPCPEmethod is very suitable for vehicle object modeling. Unlikesome other existing framework, in which one vehicle objecthas been segmented into several small regions and laterthese regions are grouped and linked with thecorresponding vehicle object, no extra effort is needed forsuch a merge in our method. 2) As described earlier, thelong-run look ahead is not necessary to generate abackground image in our framework. This implies that themoving objects can be extracted as soon as their motionsbegin to show up. Moreover, the background image can bequickly updated and adapted to new changes in theenvironment. 3) No manual initialization or priorknowledge of the background is needed.Further, since the position of the centroid of a vehicle isrecorded during the segmentation and tracking process, thisinformation can be used in the future for indexing itsrelative spatial relations. The proposed framework has thepotential to address a large range of spatio-temporal relateddatabase queries for ITS. For example, it can be used toreconstruct accidents at intersections in an automatedmanner to identify causal factors to enhance safety.4. ConclusionsIn this paper, we present a framework for spatio-temporalvehicle tracking using unsupervised learning-basedsegmentation and object tracking. An adaptive backgroundlearning and subtraction method is proposed and applied toa real life traffic video sequence to obtain more accuratespatio-temporal information of the vehicle objects. Theproposed background learning method paired with theimage segmentation is robust under many situations. Asdemonstrated in our experiments, almost all vehicle objectsare successfully identified through this framework. A keyadvantage of the proposed background learning algorithm isthat it is fully automatic and unsupervised, and performs thegeneration of background images using a self-triggeredmechanism. This is very useful in video sequences in whichit is difficult to acquire a clean image of the background.Hence, the proposed framework can deal with verycomplex situations vis-à-vis intersection monitoring.AcknowledgementFor Shu-Ching Chen, this research was support in part byNSF EIA-0220562 and NSF HRD-0317692. For Mei-LingShyu, this research was support in part by NSF ITR(Medium) IIS-0325260.References[1] S.-C. Chen, S. Sista, M.-L. Shyu, and R. L. Kashyap,“An Indexing and Searching Structure for MultimediaDatabase Systems,” IS&T/SPIE conference on Storage andRetrieval for Media Databases 2000, pp. 262-270, San Jose,CA, USA, January 23-28, 2000.[2] S.-C. Chen, M.-L. Shyu, C. Zhang, and R. L. Kashyap,“Object Tracking and Augmented Transition Network forVideo Indexing and Modeling,” 12th IEEE InternationalConference on Tools with Artificial Intelligence (ICTAI2000), pp. 428-435, Vancouver, British Columbia, Canada,November 13-15, 2000.[3] S.-C. Chen, M.-L. Shyu, C. Zhang, and J. Strickrott,“Multimedia Data Mining Framework: Mining Informationfrom Traffic Video Sequences,” Journal of IntelligentInformation System, Special Issue on Multimedia DataMining, vol. 19, no. 1, pp. 61-77, July 2002.[4] R. Cucchiara, M. Piccardi, and P. Mello, “ImageAnalysis and Rule-based Reasoning for a TrafficMonitoring System,” IEEE International conference onIntelligent Transportation Systems, vol. 1, no. 2, pp. 119-130, Tokyo, Japan, June 2000.[5] D. J. Dailey, F. Cathey, and S. Pumrin, “An Algorithmto Estimate Mean Traffic Speed Using UncalibratedCameras,” IEEE Transactions on IntelligentTransportations Systems, vol. 1, no. 2, pp. 98-107, June2000.[6] W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee,“Using Adaptive Tracking to Classify and MonitorActivities in a Site,” IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition Proceeding,pp. 22-31, 1998.[7] S. Kamijo, Y. Matsushita, and K. Ikeuchi, “TrafficMonitoring and Accident Detection at Intersections,” IEEETrans. Intelligent Transportation Systems, vol. 1, no. 2, pp.108-118, 2000.[8] S. Peeta and H. S. Mahmassani, “System Optimal andUser Equilibrium Time-dependent Traffic Assignment inCongested Networks,” Annals of Operations Research, pp.81-113, 1995.[9] X. Li, S.-C. Chen, M.-L. Shyu, and S.-T. Li, “A NovelHierarchical Approach to Image Retrieval Using Color andSpatial Information,” the Third IEEE Pacific-RimConference on Multimedia 2002 (PCM2002), LectureNotes in Computer Science: Springer-Verlag, pp. 175-182,Taiwan.5

Adaptive Background Learning for Vehicle Detection and Spatio ...

Create successful ePaper yourself

Delete template?

Save as template?