12.07.2015 Views

Object Tracking and Face Recognition in Video Streams

Object Tracking and Face Recognition in Video Streams

Object Tracking and Face Recognition in Video Streams

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Object</strong> <strong>Track<strong>in</strong>g</strong> <strong>and</strong> <strong>Face</strong><strong>Recognition</strong> <strong>in</strong> <strong>Video</strong> <strong>Streams</strong>L<strong>in</strong>us NilssonJune 6, 2012Bachelor’s Thesis <strong>in</strong> Comput<strong>in</strong>g Science, 15 creditsSupervisor at CS-UmU: Niclas Börl<strong>in</strong>Supervisor at CodeMill: Mart<strong>in</strong> Wuotila IsakssonExam<strong>in</strong>er: Jonny PetterssonUmeå UniversityDepartment of Comput<strong>in</strong>g ScienceSE-901 87 UMEÅSWEDEN


AbstractThe goal with this project was to improve an exist<strong>in</strong>g face recognition system for videostreams by us<strong>in</strong>g adaptive object track<strong>in</strong>g to track faces between frames. The knowledge ofwhat faces occur <strong>and</strong> do not occur <strong>in</strong> subsequent frames was used to filter false faces <strong>and</strong> tobetter identify real ones.The recognition ability was tested by measur<strong>in</strong>g how many faces were found <strong>and</strong> howmany of them were correctly identified <strong>in</strong> two short video files. The tests also looked at thenumber of false face detections. The results were compared to a reference implementationthat did not use object track<strong>in</strong>g.Two identification modes were tested: the default <strong>and</strong> strict modes. In the default mode,whichever person is most similar to a given image patch is accepted as the answer. In strictmode, the similarity also has to be above a certa<strong>in</strong> threshold.The first video file had a fairly high image quality. It had only frontal faces, one at atime. The second video file had a slightly lower image quality. It had up to two faces at atime, <strong>in</strong> a larger variety of angles. The second video was therefore a more difficult case.The results show that the number of detected faces <strong>in</strong>creased by 6-21% <strong>in</strong> the two videofiles, for both identification modes, compared to the reference implementation.In the meantime, the number of false detections rema<strong>in</strong>ed low. In the first video file,there were fewer than 0.009 false detections per frame. In the second video file, there werefewer than 0.08 false detections per frame.The number of faces that were correctly identified <strong>in</strong>creased by 8-22% <strong>in</strong> the two videofiles <strong>in</strong> default mode. In the first video file, there was also a large improvement <strong>in</strong> strictmode, as it went from recognis<strong>in</strong>g 13% to 85% of all faces. In the second video file, however,neither implementation managed to identify anyone <strong>in</strong> strict mode.The conclusion is that object track<strong>in</strong>g is a good tool for improv<strong>in</strong>g the accuracy offace recognition <strong>in</strong> video streams. Anyone implement<strong>in</strong>g face recognition for video streamsshould consider us<strong>in</strong>g object track<strong>in</strong>g as a central component.


Contents1 Introduction 11.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Detailed goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 The TLD algorithm 52.1 <strong>Object</strong> model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 <strong>Track<strong>in</strong>g</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Learn<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Implementation 133.1 Term<strong>in</strong>ology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Wawo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 The reference implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Integrat<strong>in</strong>g <strong>Face</strong>clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Integrat<strong>in</strong>g object track<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Experiments 194.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Results 215.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.1 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.2 <strong>Object</strong> track<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.3 <strong>Face</strong>clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.4 <strong>Face</strong>clip with object track<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . 235.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.1 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23iii


ivCONTENTS5.2.2 <strong>Object</strong> track<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.3 <strong>Face</strong>clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2.4 <strong>Face</strong>clip with object track<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . 255.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Conclusions 276.1 Limitations <strong>and</strong> future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28References 31A Results 33


List of Figures2.1 Forward-backward error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Short-term track<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Detection fern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Detection forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Aff<strong>in</strong>e transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 N-P constra<strong>in</strong>ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Flow of TLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Frame process<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Trajectory amendment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Frames from the Simple video . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Frames from the Complex video . . . . . . . . . . . . . . . . . . . . . . . . . 205.1 Simple results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Processed Simple frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Complex results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4 Processed Complex frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24v


viLIST OF FIGURES


List of TablesA.1 Results for the Simple experiment, us<strong>in</strong>g the default Wawo mode. . . . . . . 34A.2 Results for the Simple experiment, us<strong>in</strong>g strict Wawo mode. . . . . . . . . . 35A.3 Results for the Complex experiment, us<strong>in</strong>g the default Wawo mode. . . . . . 36A.4 Results for the Complex experiment, us<strong>in</strong>g strict Wawo mode. . . . . . . . . 37vii


viiiLIST OF TABLES


Chapter 1IntroductionVidisp<strong>in</strong>e 1 is a software system for stor<strong>in</strong>g <strong>and</strong> process<strong>in</strong>g video content. One component<strong>in</strong> Vidisp<strong>in</strong>e is the transcoder, which consists of a set of subcomponents <strong>and</strong> plug<strong>in</strong>s thattogether can transform a video stream <strong>in</strong> various ways. It can for example convert a streambetween different formats, or analyse the stream to extract <strong>in</strong>formation. CodeMill hascreated a face recognition plug<strong>in</strong> that uses the library OpenCV 2 for detect<strong>in</strong>g faces <strong>in</strong> avideo stream <strong>and</strong> Wawo 3 for identify<strong>in</strong>g them. The purpose is to make it possible to extract<strong>and</strong> save <strong>in</strong>formation about which people occur <strong>in</strong> which frames so that the <strong>in</strong>formation canlater be looked up when needed. One usage scenario is with surveillance cameras, where itis often necessary to identify people.The problem is that the face recognition sometimes fails, either by not be<strong>in</strong>g able to makean identification, or by mak<strong>in</strong>g an <strong>in</strong>valid identification. This project is about improv<strong>in</strong>gthe success rate by tak<strong>in</strong>g more context <strong>in</strong>to account, which is done by track<strong>in</strong>g how thefaces move between frames.1.1 AimThe plug<strong>in</strong> currently used by Codemill does its job by look<strong>in</strong>g at one video frame at a time,try<strong>in</strong>g to detect <strong>and</strong> identify the faces that occur <strong>in</strong> that frame. The goal of this project isto <strong>in</strong>tegrate a library for track<strong>in</strong>g a mov<strong>in</strong>g object between frames, which can be used totrack a face <strong>and</strong> recognise when the face belongs to the same person, even if Wawo fails tosee that. If the object tracker sees that a particular face occurs <strong>in</strong> ten adjacent frames, butWawo only recognises the person’s face <strong>in</strong> eight of those frames, then chances are that it isstill the same person <strong>in</strong> the two frames where Wawo failed. This knowledge should be ableto be used to improve the results.1.2 Detailed goals<strong>Face</strong> recognition generally consists of two primary tasks: detect<strong>in</strong>g where there is a face <strong>in</strong>a frame or picture, <strong>and</strong> identify<strong>in</strong>g whose face it is (Torres, 2004). Either task can fail for anumber of reasons. For example, the system may only know how to recognise someone from1 http://www.vidisp<strong>in</strong>e.com/2 http://opencv.willowgarage.com/3 http://www.wawo.com/1


2 Chapter 1. Introductiona frontal picture, or it may only recognise a face <strong>in</strong> a particular “ideal”, or neutral, state.When the system fails, it can be by identify<strong>in</strong>g the face as belong<strong>in</strong>g to the wrong person,or it can be by fail<strong>in</strong>g to make an identification at all.This project aims to improve the ability to recognise a person <strong>in</strong> a non-ideal state orangle. By track<strong>in</strong>g how a face moves, the system has the potential to recognise that it isthe same face when it moves between an ideal <strong>and</strong> a non-ideal state or angle, even if Wawono longer recognises it. The system can then make a qualified guess that the identity is thesame <strong>in</strong> the two (or more) states. If Wawo makes contradictory identifications for a s<strong>in</strong>gleface <strong>in</strong> different frames, the identity that occurs most often can reasonably be assumed tobe the right one.The goal is to use a system called OpenTLD (Kalal, 2011b). It is an implementationof TLD, which is a set of algorithms that h<strong>and</strong>le object track<strong>in</strong>g <strong>in</strong> video streams, whichbasically means that the system tracks where objects move between frames. TLD has thebonus of be<strong>in</strong>g adaptive, mak<strong>in</strong>g it more flexible than the OpenCV face detector. Theorig<strong>in</strong>al OpenTLD is written <strong>in</strong> Matlab, but some C++ ports should be explored <strong>and</strong> used<strong>in</strong>stead if possible.The <strong>in</strong>tegration of OpenTLD with Vidisp<strong>in</strong>e will be general, allow<strong>in</strong>g it <strong>in</strong> theory totrack any object. This will be accomplished by writ<strong>in</strong>g a general plug<strong>in</strong> for object track<strong>in</strong>g.The general object-track<strong>in</strong>g plug<strong>in</strong> will then used by the face recognition plug<strong>in</strong> to trackfaces.An object can be tracked both forwards <strong>and</strong> backwards <strong>in</strong> a video stream. <strong>Track<strong>in</strong>g</strong> anobject forwards is obvious, <strong>and</strong> is the first th<strong>in</strong>g that should be implemented. If the faceis only found <strong>in</strong> the middle of a scene, it may also be worth it track<strong>in</strong>g the face backwardsto see see how long it has been there. Therefore, support for track<strong>in</strong>g backwards should beexplored.A library called <strong>Face</strong>clip (Rondahl, 2011) will be evaluated to see whether it is betterthan the current system at detect<strong>in</strong>g faces <strong>in</strong> a frame, <strong>and</strong> may take over that task. Likethe current code, <strong>Face</strong>clip is based on OpenCV. It has, however, been adjusted to do abetter job, ma<strong>in</strong>ly by perform<strong>in</strong>g a larger number of tests. Before mak<strong>in</strong>g the decision touse <strong>Face</strong>clip, it should be evaluated; both the results <strong>and</strong> the performance should be taken<strong>in</strong>to account to some degree.The current Wawo-based code will be used to do the actual face identification. A bonusfeature is to add support for detect<strong>in</strong>g <strong>and</strong> report<strong>in</strong>g the direction of a face <strong>in</strong> a given frame.The two transcoder plug<strong>in</strong>s: object track<strong>in</strong>g <strong>and</strong> face detection/identification, will becomb<strong>in</strong>ed <strong>in</strong> a higher level face-recognition transcoder plug<strong>in</strong>.The current plug<strong>in</strong> will be used as the reference system when evaluat<strong>in</strong>g the recognitionresults <strong>and</strong> performance of the project.In short, the goals for this project are:G1 Create a transcoder plug<strong>in</strong> for object track<strong>in</strong>g, us<strong>in</strong>g OpenTLD, <strong>and</strong> <strong>in</strong>tegrate it withthe face recognition plug<strong>in</strong>. The first goal is to track objects forward.G2 Support track<strong>in</strong>g objects backwardsG3 Implement <strong>and</strong> evaluate <strong>Face</strong>clip as an alternative to the current face detection code.G4 Support f<strong>in</strong>d<strong>in</strong>g the direction of detected faces.


1.3. Related work 31.3 Related workKalal et al. (2010a) had positive results when us<strong>in</strong>g a modified version of their TLD systemfor face track<strong>in</strong>g. In this case, they used an exist<strong>in</strong>g face detector. Instead of build<strong>in</strong>g anobject detector, they modified the learn<strong>in</strong>g process of TLD to build a face validator, whichvalidates whether an image patch, given by the tracker or the face detector, is the soughtface.Nielsen (2010) did research <strong>in</strong>to face recognition, us<strong>in</strong>g cont<strong>in</strong>uity filter<strong>in</strong>g to f<strong>in</strong>d falseface detections, <strong>and</strong> Active Appearance Models to identify faces. He found that the cont<strong>in</strong>uityfilter<strong>in</strong>g reduced the number of false faces to be identified.


4 Chapter 1. Introduction


Chapter 2The TLD algorithmTLD (<strong>Track<strong>in</strong>g</strong>, Learn<strong>in</strong>g, Detection) is a set of algorithms that together try to achievegood long-term object track<strong>in</strong>g <strong>in</strong> video streams (Kalal et al., 2011).TLD has three ma<strong>in</strong> processes that it uses to achieve its goal: short-term track<strong>in</strong>g fortrack<strong>in</strong>g small movements from one frame to the next, object detection for re-detect<strong>in</strong>g lostobjects, <strong>and</strong> tra<strong>in</strong><strong>in</strong>g to teach the object detector what the object looks like.The goal of the short-term tracker is to follow an object as it moves <strong>in</strong> a trajectory <strong>in</strong>the frame sequence. The tracker may lose track of the object if the object disappears fora few frames, becomes partly occluded, or moves a large distance between two successiveframes, such as when the scene changes.The object detector can re-detect the location of the object, so that track<strong>in</strong>g can cont<strong>in</strong>uefrom the new, detected location. By do<strong>in</strong>g this, an object can potentially be tracked throughany video sequence, regardless of the smoothness of its movements, <strong>and</strong> regardless of whetherthe object is occluded at any stage. This is called long-term track<strong>in</strong>g.TLD can track arbitrary objects. Given an image patch (a part of the image def<strong>in</strong>ed bya bound<strong>in</strong>g box) from the first frame, TLD learns the appearance of the object def<strong>in</strong>ed bythe image patch, <strong>and</strong> starts track<strong>in</strong>g it.The follow<strong>in</strong>g sections describe TLD <strong>in</strong> more detail, with regard to how it is implemented<strong>in</strong> OpenTLD. Parts of OpenTLD are described by Kalal et al. (2010b, 2011). Other partsare only described <strong>in</strong> the source code, which may be found at Kalal (2011a).2.1 <strong>Object</strong> modelA central component of the TLD algorithms is the object model; a model of what thetracked object looks like. The object model is def<strong>in</strong>ed by a current set of positive <strong>and</strong>negative examples. An image patch that is similar to the positive examples <strong>and</strong> dissimilarto the negative examples is considered to be similar to the model.Similarity is def<strong>in</strong>ed us<strong>in</strong>g normalised cross-correlation (NCC) (Gonzales <strong>and</strong> Woods,2008, Ch. 12.2). The NCC is calculated between the image patch <strong>and</strong> each positive <strong>and</strong>negative example, which gives a value between -1 <strong>and</strong> 1 for each example. An NCC of 1means that it is a perfect match, -1 means that it is an <strong>in</strong>verse match, <strong>and</strong> 0 means thatthere is no correlation found between the two images. To better match the usage scenario,the NCC values are moved <strong>in</strong>to the <strong>in</strong>terval [0, 1]: NCC ′ = (NCC + 1)/2. The highestNCC’ max P among the positive examples <strong>and</strong> the highest NCC’ max N among the negativeexamples are used to calculate the similarity value as <strong>in</strong> equation 2.1.5


6 Chapter 2. The TLD algorithmdistance P = 1 − max P ,distance N = 1 − max N , <strong>and</strong>similarity = distance N /(distance N + distance P ). (2.1)The equations say that the similarity value depends on the similarity distance to thepositive <strong>and</strong> the negative examples. If the distance to the most similar negative example islarger than the distance to the most similar positive example, the similarity value will behigh.2.2 <strong>Track<strong>in</strong>g</strong><strong>Track<strong>in</strong>g</strong> is implemented us<strong>in</strong>g a median-flow tracker based on the Lucas-Kanade method(LK) (Kalal et al., 2011; Lucas <strong>and</strong> Kanade, 1981). A set of feature po<strong>in</strong>ts p f,k are selectedfrom a rectangular grid <strong>in</strong> the currently tracked image patch <strong>in</strong> the current frame f. In thenext frame, the new position of each selected po<strong>in</strong>t is estimated us<strong>in</strong>g LK, which gives anew set of po<strong>in</strong>ts p f+1,k .To determ<strong>in</strong>e which of the po<strong>in</strong>ts are reliable, the NCC value is calculated from a 2x2pixel area around each pair of po<strong>in</strong>ts p f,k <strong>and</strong> p f+1,k . Furthermore, the forward-backwarderror (FBE) is measured (Kalal et al., 2010b). Each po<strong>in</strong>t p f+1,k is tracked <strong>in</strong> reverse fromframe f + 1 to frame f, us<strong>in</strong>g LK. This produces a po<strong>in</strong>t p ′ f,k , whose distance to p f,k iscomputed to get an error estimate; the further p ′ f,k is from p f,k, the larger the error. Ifthe median of the FBE for all po<strong>in</strong>ts is too large, the result from the tracker is considered<strong>in</strong>valid (see Figure 2.1). Otherwise, all po<strong>in</strong>ts with an NCC above the median NCC <strong>and</strong> anFBE below the median FBE are selected.The motion m of the object is calculated from the median of the estimated po<strong>in</strong>t motions(see Figure 2.2):m k = p f+1,k − p f,k , for all k, <strong>and</strong>m = median(m k ). (2.2)kThe new size of the object is calculated from a scale-change factor s:d i,j = distance(p f,i , p f,j ),d ′ i,j = distance(p f+1,i , p f+1,j ), <strong>and</strong>( d′ )i,js = median . (2.3)i≠j d i,jThe confidence of the tracker is calculated as the similarity between the object model <strong>and</strong>the image patch def<strong>in</strong>ed by the new bound<strong>in</strong>g box.


2.2. <strong>Track<strong>in</strong>g</strong> 7Figure 2.1: The images depict two subsequent frames, show<strong>in</strong>g two feature po<strong>in</strong>ts <strong>in</strong> anobject. The po<strong>in</strong>ts are tracked both forward <strong>and</strong> backward. One po<strong>in</strong>t is properly trackedback to its orig<strong>in</strong>, while the other ends up somewhere else.Figure 2.2: The images depict two subsequent frames, show<strong>in</strong>g three feature po<strong>in</strong>ts <strong>in</strong>an object. The tracker f<strong>in</strong>ds the three feature po<strong>in</strong>ts <strong>in</strong> both frames. The estimatedmovement of each one is <strong>in</strong>dicated by the arrows. The movement of the whole object isestimated as the median of all feature po<strong>in</strong>t movements. The scale-change is calculatedfrom the change <strong>in</strong> <strong>in</strong>ternal distance between po<strong>in</strong>ts <strong>in</strong> the two frames.


8 Chapter 2. The TLD algorithmFigure 2.3: Example of apply<strong>in</strong>g a fern with three object features (upper left rectangle)on an image patch of size 10x7 pixels (lower left rectangle). The image patch depicts agrey object on a white background. The bit value generated by each object feature is 1if the pixel marked “a” is brighter than the pixel marked “b”, otherwise 0. The fern bitsequencefor this image patch is 101. The correspond<strong>in</strong>g leaf node conta<strong>in</strong>s the estimatedprobability 0.92, which means that of all the tra<strong>in</strong><strong>in</strong>g data where this fern has generatedthe bit-sequence 101, 92% has been positive data.2.3 Detection<strong>Object</strong> (re-)detection <strong>in</strong> TLD is a three-stage process (Kalal et al., 2011). The first stageremoves image patches with too small a variance <strong>in</strong> pixel <strong>in</strong>tensity, based on the varianceof the <strong>in</strong>itial tracked patch. This is a relatively quick operation, <strong>and</strong> works as an efficient<strong>in</strong>itial filter. The second stage is based on a r<strong>and</strong>omised forest of ferns (Breiman, 2001) thatattempts to f<strong>in</strong>d image patches <strong>in</strong> the current frame that are similar to the tracked object.Like the first stage, this is a relatively quick operation. In the third stage, the patches thatpassed the previous stages are compared to the object model, which is a slower but moreaccurate operation. The first stage is based on well-known statistical variance, while thethird stage was described <strong>in</strong> Section 2.1. Therefore, this section focuses on describ<strong>in</strong>g thesecond stage, which is also the most <strong>in</strong>volved <strong>and</strong> novel operation of the three.A fern is a simple tree with only leaf nodes <strong>and</strong> a set of associated object features. Afern with n object features will have 2 n leaf nodes. An object feature is a boolean operationon the <strong>in</strong>tensity difference between two pixel positions with<strong>in</strong> an image patch. The pixelpair of each feature is <strong>in</strong>itialised at startup to r<strong>and</strong>om relative positions with<strong>in</strong> the imagepatch.Each leaf node conta<strong>in</strong>s an estimated probability that an image patch depicts the trackedobject. The node has the value p/(p + n), where p <strong>and</strong> n are the number of positive <strong>and</strong>negative image patches that have corresponded with that leaf node dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g (seeSection 2.4).Apply<strong>in</strong>g an object feature to an image patch produces one bit, either 1 or 0, depend<strong>in</strong>gon the outcome of the <strong>in</strong>tensity comparison. The output bits from all object features <strong>in</strong>the fern together make up a feature vector, which is used to uniquely select a leaf node (seeFigure 2.3).


2.4. Learn<strong>in</strong>g 9Figure 2.4: A forest has a number of <strong>in</strong>dependent trees. The image is compared with eachtree, which produces one probability measure per tree. If the mean of all probabilities isabove a certa<strong>in</strong> threshold, the forest accepts the image patch as the object.Detection is done us<strong>in</strong>g a slid<strong>in</strong>g w<strong>in</strong>dow of vary<strong>in</strong>g size, where each w<strong>in</strong>dows def<strong>in</strong>esan image patch. At each position, the detector asks each fern for the probability that thegiven image patch is a match. As illustrated by Figure 2.4, after all ferns are processed, theimage patch is accepted by the forest if the mean probability is high enough (≥ 50%).For each image patch that passes the first two stages, the similarity (Eq. 2.1) to theobject model is calculated. The patch with the highest similarity value is accepted as thedetection result.2.4 Learn<strong>in</strong>gThe short-term tracker calculates a position based primarily on geometric data, <strong>and</strong> willusually produce a position close to the previous position. In constrast, the detector calculatesprobabilities based on image data only, as determ<strong>in</strong>ed by the detection forest, for a numberof positions. TLD uses the two position sets to tra<strong>in</strong> the object detector <strong>and</strong> to f<strong>in</strong>d newexamples to add to the object model.The learn<strong>in</strong>g algorithm used <strong>in</strong> TLD is based on P-N Learn<strong>in</strong>g, which uses Positive <strong>and</strong>Negative constra<strong>in</strong>ts for f<strong>in</strong>d<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g data (Kalal et al., 2011). Based on the position setsreported, two constra<strong>in</strong>ts are applied on each frame: a P-constra<strong>in</strong>t that f<strong>in</strong>ds false negatives,<strong>and</strong> an N-constra<strong>in</strong>t that f<strong>in</strong>ds false positives. False negatives are used as positive tra<strong>in</strong><strong>in</strong>gdata, <strong>and</strong> false positives are used as negative tra<strong>in</strong><strong>in</strong>g data. Both constra<strong>in</strong>ts can makeerrors, but the idea is that their errors should cancel each other out to a sufficient degree,lead<strong>in</strong>g to positive learn<strong>in</strong>g.The detector reports probabilities for a number of possible object positions. The positionswith a probability above zero, but with too little spatial overlap with the positionreported by the tracker, are marked as false positives, <strong>and</strong> are used as negative tra<strong>in</strong><strong>in</strong>gdata.The image patch def<strong>in</strong>ed by the position reported by the short-term tracker is used togenerate positive tra<strong>in</strong><strong>in</strong>g data. The patch is exp<strong>and</strong>ed if necessary, so that it correspondsexactly to a position of the detector’s slid<strong>in</strong>g w<strong>in</strong>dow. The result<strong>in</strong>g patch is used as positivetra<strong>in</strong><strong>in</strong>g data, together with a number of aff<strong>in</strong>e transformations of the patch (Angel, 2009,Ch. 4.6); Figure 2.5 shows the transformations used.The generated tra<strong>in</strong><strong>in</strong>g examples are c<strong>and</strong>idates for be<strong>in</strong>g used <strong>in</strong> tra<strong>in</strong><strong>in</strong>g, but eachexample will only be used if the component to be tra<strong>in</strong>ed (object detector or object model)is <strong>in</strong>correct about the example. This is determ<strong>in</strong>ed by test<strong>in</strong>g each example aga<strong>in</strong>st thedetector forest <strong>and</strong> aga<strong>in</strong>st the object model, so that each component calculates its likelihoodor similarity value, as described <strong>in</strong> sections 2.3 <strong>and</strong> 2.1 respectively. For each component,positive examples are only used if the component does not th<strong>in</strong>k the patch depicts the object,


10 Chapter 2. The TLD algorithm(a) Image patch reportedby short-term tracker.(b) Exp<strong>and</strong>ed to grid po<strong>in</strong>ts.In this case, the object is centred<strong>in</strong> the exp<strong>and</strong>ed patch.(c) Translation (d) Scal<strong>in</strong>g (e) RotationFigure 2.5: The aff<strong>in</strong>e transformations applied on an image patch dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. Thepatch reported by the tracker is first exp<strong>and</strong>ed to match one of the detector’s slid<strong>in</strong>gw<strong>in</strong>dow positions. A number of positive examples are then generated from the patch,where each example r<strong>and</strong>omly comb<strong>in</strong>es the aff<strong>in</strong>e transformations.i.e. if the calculated value is below a certa<strong>in</strong> threshold. Similarly, negative examples are onlyused if the component th<strong>in</strong>ks the patch does depict the object, i.e. if the calculated value isabove a certa<strong>in</strong> threshold. In other words, the area reported by the tracker is assumed tobe the correct patch, so components that disagree try to learn from it. Figure 2.6 illustratesthe concept.S<strong>in</strong>ce the tra<strong>in</strong><strong>in</strong>g sets are based on the location reported by the tracker, the result of thetracker must be considered sufficiently good if the tra<strong>in</strong><strong>in</strong>g stage is to be applied: the trackerresult must be valid, <strong>and</strong> the tracker must be more confident than the detector. Additionally,if the tracker was not deemed good enough <strong>in</strong> the previous frame, the confidence of thetracker must be above a certa<strong>in</strong> threshold, i.e. the confidence of the tracker must be largeenough for tra<strong>in</strong><strong>in</strong>g to start up aga<strong>in</strong>.Initial tra<strong>in</strong><strong>in</strong>g examples are generated from the first frame, where the bound<strong>in</strong>g box ofthe object is known. As such, there is no need to take <strong>in</strong>to account the confidence or theforward-backward error. Positive examples are generated from the image patch def<strong>in</strong>ed bythe bound<strong>in</strong>g box, <strong>and</strong> negative examples are generated from other parts of the frame.2.5 SummaryThe overall flow of TLD is shown <strong>in</strong> Figure 2.7. The short-term tracker <strong>and</strong> the objectdetector are both run on the current frame. If neither component reports a valid result, theobject is considered lost until it is re-detected. Otherwise, the result of the more confidentcomponent is reported. Furthermore, if the tracker is more confident than the detector, thelearn<strong>in</strong>g stage is applied.


2.5. Summary 11Figure 2.6: Illustration of a one-dimensional trajectory over time. The blue l<strong>in</strong>e representsthe trajectory reported by the tracker. The red marks are detections from the detector.The green areas around each tracker position are considered positive, <strong>and</strong> may be usedas positive tra<strong>in</strong><strong>in</strong>g data. A detection outside of a green area may be used as negativetra<strong>in</strong><strong>in</strong>g data.Figure 2.7: The overall flow of TLD.


12 Chapter 2. The TLD algorithm


Chapter 3ImplementationCodemill has previously written a face recognition plug<strong>in</strong> for the transcoder <strong>in</strong> Vidisp<strong>in</strong>e.It uses OpenCV for face detection <strong>and</strong> Wawo for face identification. This plug<strong>in</strong> is used asa reference system with which to compare the results of this project.The ma<strong>in</strong> goal of this project was to <strong>in</strong>tegrate object track<strong>in</strong>g <strong>in</strong> the face recognitionplug<strong>in</strong>. The secondary goal was to make the plug<strong>in</strong> use <strong>Face</strong>clip for face detection. The<strong>in</strong>tegrations are described below.3.1 Term<strong>in</strong>ologyThe face recognition system has two primary subtasks. The first task is to detect the faces<strong>in</strong> each frame. A detected face is def<strong>in</strong>ed by its position, def<strong>in</strong>ed as a bound<strong>in</strong>g box. Thesecond task is to identify the faces, thus giv<strong>in</strong>g each one a probable person ID, or PID. If alikely PID cannot be decided, the face is given the unknown PID. A face state, or state forshort, is def<strong>in</strong>ed by a timestamp, a position, <strong>and</strong> a PID.A face trajectory, or trajectory for short, represents a face that is visible <strong>in</strong> a number offrames. A trajectory consists of states. If p is the PID that has occured most often amongthe states <strong>in</strong> a trajectory, the trajectory can be amended by chang<strong>in</strong>g all states to have pas PID.A state is reported when the system is ready to output the <strong>in</strong>formation of that state. Inthis case, report<strong>in</strong>g a state means to save the timestamp, position <strong>and</strong> PID to a file.3.2 WawoWawo is a central part of the implementation, as it h<strong>and</strong>les all face identifications. Givenan image patch for a detected face, Wawo outputs the most likely PID for that face.In order for Wawo to be able to identify faces, it must be tra<strong>in</strong>ed. Tra<strong>in</strong><strong>in</strong>g <strong>in</strong>volvesgiv<strong>in</strong>g Wawo one or more face images for each known person, which the face recognitionplug<strong>in</strong> does at startup. A set of images are loaded for each person, where each tra<strong>in</strong><strong>in</strong>g imageis assumed to conta<strong>in</strong> a s<strong>in</strong>gle face. For each image, the system detects the approximateposition of the face before giv<strong>in</strong>g the image to Wawo, <strong>in</strong> order to exclude any surround<strong>in</strong>gbackground.Wawo can operate <strong>in</strong> one of two modes. In the default mode, Wawo always outputs thePID of the person most similar to the given image patch, regardless of how similar it is. In13


14 Chapter 3. Implementationstrict mode, Wawo only outputs the PID if the similarity is above a given threshold; it willotherwise output unknown. The threshold is a float<strong>in</strong>g-po<strong>in</strong>t number from 0 to 1.3.3 The reference implementationThe reference system performs two tasks: face detection <strong>and</strong> face identification. For eachframe, a set of faces is detected us<strong>in</strong>g OpenCV. Each face is identified us<strong>in</strong>g Wawo. Perform<strong>in</strong>gthe two tasks produces a set of face states that are reported immediately.3.4 Integrat<strong>in</strong>g <strong>Face</strong>clipThe <strong>Face</strong>clip <strong>in</strong>tegration was performed by replac<strong>in</strong>g the relevant calls to OpenCV with thecorrespond<strong>in</strong>g calls to <strong>Face</strong>clip.Us<strong>in</strong>g <strong>Face</strong>clip turned out not to be feasible when tra<strong>in</strong><strong>in</strong>g Wawo, as it found some falsefaces <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g pictures. For this reason, the old detection code is still used for the<strong>in</strong>itial tra<strong>in</strong><strong>in</strong>g stage.3.5 Integrat<strong>in</strong>g object track<strong>in</strong>gIn this implementation, there are two components that may report face positions: the facedetector <strong>and</strong> the object tracker. The system also tries to identify faces detected by the facedetector, as <strong>in</strong> the reference implementation. This means that the output from the facedetector is both a position <strong>and</strong> a PID, while the output from the object tracker is just aposition.The implementation is based around face trajectories. The basic idea is that once theface detector has found a new face, the face detector <strong>and</strong> the object tracker together try tokeep track of that face <strong>in</strong> subsequent frames. Sav<strong>in</strong>g the face states to a s<strong>in</strong>gle trajectoryproduces a history of states for that face.When the face has been out of picture for some frames <strong>and</strong>/or at the end of the videostream, the trajectory will be amended <strong>and</strong> reported. States are dropped from the trajectoryonce they have been reported, <strong>and</strong> will therefore have no effect on future amendmentprocesses.Each trajectory has its own object tracker. The trajectory <strong>and</strong> the correspond<strong>in</strong>g trackerare used <strong>in</strong>terchangably <strong>in</strong> this report. When the detector f<strong>in</strong>ds a face that does not have atrajectory, both a trajectory <strong>and</strong> a tracker are created.A counter C pt counts how many times there has been a detection with PID p that hasoverlapped with the tracker t. This <strong>in</strong>cludes when the tracker is first created. Only def<strong>in</strong>edvalues of p are counted, i.e. not unknown.S<strong>in</strong>ce there are two components that may each f<strong>in</strong>d a set of face positions <strong>in</strong> a givenframe, the two sets must be comb<strong>in</strong>ed to give a mean<strong>in</strong>gful result. The procedure first f<strong>in</strong>dsthe set of overlapp<strong>in</strong>g positions, where a detection d overlaps with a tracker t. Each suchposition is added to the trajectory t, us<strong>in</strong>g the PID <strong>and</strong> exact coord<strong>in</strong>ates from d. Each t<strong>and</strong> d can occur <strong>in</strong> at most one such overlap. The tracker positions that did not overlapwith a detection are added to their respective trajectories, us<strong>in</strong>g the unknown PID.The detections that did not overlap with a tracker are saved, given certa<strong>in</strong> conditions.If the PID p is unknown, or if there is no C pt def<strong>in</strong>ed yet, a new trajectory is created, <strong>and</strong>


3.5. Integrat<strong>in</strong>g object track<strong>in</strong>g 15the position is added there. Otherwise, given the trajectory t with the highest C pt , thedetector’s position is added to t if <strong>and</strong> only if t has not yet been updated dur<strong>in</strong>g this frame.To prevent two faces be<strong>in</strong>g reported at the same position <strong>in</strong> a given frame, an <strong>in</strong>itialfilter is applied on the tracker positions before comb<strong>in</strong><strong>in</strong>g them with the detector positions.The trackers are compared pairwise. If they have overlapp<strong>in</strong>g positions, the less confidenttracker position is discarded. No filter<strong>in</strong>g is necessary for the face detector, s<strong>in</strong>ce it does notreport overlapp<strong>in</strong>g faces.Dur<strong>in</strong>g the amendment process, at least pid m<strong>in</strong> = 50 percent of the states <strong>in</strong> the trajectoryhave to have the majority PID. If less than pid m<strong>in</strong> percent has it, all states are set tounknown <strong>in</strong>stead of the majority PID. This process is meant to get rid of false identifications,us<strong>in</strong>g the fact that Wawo is unsure about the identity.In addition, to reduce the number of false face positions reported, a m<strong>in</strong>imum numberof det m<strong>in</strong> = 2 positions from the face detector must overlap with the trajectory for it to beconsidered valid. If a trajectory has too few detections when it is time to amend it, thetrajectory is <strong>in</strong>stead removed, together with its current states. Figure 3.1 shows how eachframe is processed. Figure 3.2 shows the amendment process.


16 Chapter 3. ImplementationFigure 3.1: How a frame is processed. The numbers by the arrows lead<strong>in</strong>g out fromthe F<strong>in</strong>d overlapp<strong>in</strong>g node represent the order <strong>in</strong> which they are executed; what isimportant is that the third step comes after the other two.


3.5. Integrat<strong>in</strong>g object track<strong>in</strong>g 17Figure 3.2: The amendment process.


18 Chapter 3. Implementation


Chapter 4ExperimentsThe implementation has been tested us<strong>in</strong>g two different video streams. The experiments aremeant to evaluate the improvement of us<strong>in</strong>g <strong>Face</strong>clip, <strong>and</strong> the improvement of us<strong>in</strong>g objecttrack<strong>in</strong>g. Each experiment <strong>in</strong>volves four ma<strong>in</strong> test configurations: one with the referencesystem, one with <strong>Face</strong>clip, one with object track<strong>in</strong>g, <strong>and</strong> one with both <strong>Face</strong>clip <strong>and</strong> objecttrack<strong>in</strong>g. Each test configuration has been run with Wawo set first to default <strong>and</strong> then strictmode. The threshold used <strong>in</strong> strict mode was 0.3.The video <strong>in</strong> each experiment has first been analysed manually, to f<strong>in</strong>d all face positions,<strong>and</strong> to mark the positions that are of known faces. The face positions <strong>in</strong>clude both frontalviews <strong>and</strong> side-views of faces. A known face is def<strong>in</strong>ed as one that has a tra<strong>in</strong><strong>in</strong>g picture givento the face identifier (Wawo) to learn from; only known faces can be correctly identified. Aface position of an unknown face is marked as unknown.Two th<strong>in</strong>gs are analysed for each test run: the reported face positions <strong>and</strong> the correspond<strong>in</strong>gPID values. Each face position <strong>and</strong> correspond<strong>in</strong>g PID are compared to the knowndata for that frame. A face position is correct if it overlaps with a known face position. Thenumbers of correct <strong>and</strong> <strong>in</strong>correct reports are counted.4.1 SimpleThe Simple experiment <strong>in</strong>volves a video of six people do<strong>in</strong>g presentations on a stage, oneperson at a time. All faces are known, all face views are frontal or close to frontal, <strong>and</strong> thereare no sudden movements. There are a total of 3768 face positions, which means that a faceis visible <strong>in</strong> all frames.The video has a run-length of 2 m<strong>in</strong>utes <strong>and</strong> 30 seconds, a resolution of 720x576, <strong>and</strong> aframe-rate of 25 frames per second. The source is a high-def<strong>in</strong>ition video file. Despite thedown-conversion from the source, the image is quite clear. Figure 4.1 shows a few exampleframes.This video is used partly because a potential client showed <strong>in</strong>terest <strong>in</strong> see<strong>in</strong>g the results,but also because the results should be relatively easy to analyze given its simplicity. Thelatter po<strong>in</strong>t makes it a good start<strong>in</strong>g po<strong>in</strong>t for figur<strong>in</strong>g out parts of the implementation thatcan be improved.19


20 Chapter 4. ExperimentsFigure 4.1: Some example frames from the Simple video.Figure 4.2: Some example frames from the Complex video.4.2 ComplexThe Complex experiment <strong>in</strong>volves a video conta<strong>in</strong><strong>in</strong>g six people, of which three are unknown,<strong>in</strong> a dynamic sett<strong>in</strong>g. People move <strong>in</strong> <strong>and</strong> out of the picture, <strong>and</strong> more than oneperson can often be seen <strong>in</strong> the same frame. There are a total of 353 face positions, of which205 are known faces. The video has 284 frames, of which 254 have at least one face visible.The video has a run-length of 30 seconds, a resolution of 1280x960, <strong>and</strong> a frame rate of7.5 frames per second. It is recorded at Codemill us<strong>in</strong>g a webcam. The image is slightlyblocky <strong>and</strong> not of very high quality. Figure 4.2 shows a few example frames.Some use cases for the system may <strong>in</strong>volve more than one person at a time, so it is goodto know how add<strong>in</strong>g object track<strong>in</strong>g affects the results <strong>in</strong> those cases. The results of thisvideo should give an <strong>in</strong>dication of that.


Chapter 5ResultsThis chapter summarises the results of the experiments. The details are listed <strong>in</strong> AppendixA.There are two numbers presented for each test run regard<strong>in</strong>g detected faces: the numberof correct detections <strong>and</strong> the number of false detections. There are also two numbers shownregard<strong>in</strong>g the identifications done: the number of faces correctly identified <strong>and</strong> the numberof <strong>in</strong>correct identifications. The <strong>in</strong>correct identifications <strong>in</strong>clude identifications done for falsedetections, unless they were identified as unknown.5.1 SimpleFigure 5.1 shows the results of the Simple experiment, <strong>and</strong> Figure 5.2 shows a few processedframes.5.1.1 ReferenceThe number of detected faces was 3541 <strong>in</strong> both default <strong>and</strong> strict mode. The numberof false detections was 6, or 0.00159 per frame, <strong>in</strong> both modes. There were 2525 correctidentifications <strong>and</strong> 1021 <strong>in</strong>correct identifications <strong>in</strong> default mode. In strict mode, there were493 correct <strong>and</strong> zero <strong>in</strong>correct identifications.5.1.2 <strong>Object</strong> track<strong>in</strong>gThe number of faces detected <strong>in</strong>creased by 6% <strong>in</strong> both Wawo modes compared to the referencesystem. The false detections disappeared <strong>in</strong> default mode, but <strong>in</strong>creased by about450% <strong>in</strong> strict mode, leav<strong>in</strong>g the latter at 0.0087 false detections per frame. The faces correctlyidentified <strong>in</strong>creased by 8% <strong>in</strong> default mode, <strong>and</strong> by 550% <strong>in</strong> strict mode. The falseidentifications decreased by 11% <strong>in</strong> default mode, <strong>and</strong> rema<strong>in</strong>ed at zero <strong>in</strong> strict mode.5.1.3 <strong>Face</strong>clipThe number of faces detected <strong>in</strong>creased by 3% <strong>in</strong> both Wawo modes compared to the referencesystem. The false detections <strong>in</strong>creased by a factor of almost 600 <strong>in</strong> both modes, leav<strong>in</strong>gthem at 0.91 false detections per frame. The faces correctly identified decreased by about21


22 Chapter 5. ResultsFigure 5.1: The results for the Simple experiment. The number of faces that can becorrectly detected <strong>and</strong> the number of faces that can be correctly identified are both 3768.The test comb<strong>in</strong><strong>in</strong>g <strong>Face</strong>clip <strong>and</strong> object track<strong>in</strong>g <strong>in</strong> strict mode did not run to completion.Figure 5.2: Some processed frames from the Simple video. Each detected face is markedby a bound<strong>in</strong>g box. The real names of the six people <strong>in</strong> this video are not know, so thenames “A” through “F” have been used <strong>in</strong>stead, assigned to the faces <strong>in</strong> the order theyappear <strong>in</strong> the video. In these particular frames, the first face is correctly identified as A,the second face is <strong>in</strong>correctly identified as F (correct would be B), <strong>and</strong> the third face iscorrectly identified as C.


5.2. Complex 2325% <strong>in</strong> both modes. The false identifications <strong>in</strong>creased by over 300% <strong>in</strong> default mode, <strong>and</strong>rema<strong>in</strong>ed at zero <strong>in</strong> strict mode.5.1.4 <strong>Face</strong>clip with object track<strong>in</strong>gWhen comb<strong>in</strong><strong>in</strong>g <strong>Face</strong>clip <strong>and</strong> object track<strong>in</strong>g <strong>in</strong> strict Wawo mode, the computer ran outof memory before the test could f<strong>in</strong>ish. The higher memory usage is a result of the largenumber of false detections, together with the fact that false detections are identified asunknown <strong>in</strong> strict mode. It means that new trackers have to be created more often, s<strong>in</strong>cethe counters C pt will be undef<strong>in</strong>ed <strong>in</strong> those cases (see Section 3.5). There are therefore noresult numbers for this configuration <strong>in</strong> strict mode.The number of faces detected <strong>in</strong>creased by 5% compared to the reference <strong>in</strong> the defaultWawo mode. The number of false detections is almost 2000 times higher than the referencenumbers, leav<strong>in</strong>g it at 2.7 false detections per frame. The faces correctly identified <strong>in</strong>creasedby 18%. The false identifications <strong>in</strong>creased by almost 900%.5.1.5 Summary<strong>Object</strong> track<strong>in</strong>g without <strong>Face</strong>clip is an improvement over the reference system, <strong>in</strong> bothdefault <strong>and</strong> strict Wawo mode. Us<strong>in</strong>g object track<strong>in</strong>g <strong>in</strong> strict mode gave a particularlylarge improvement.Us<strong>in</strong>g <strong>Face</strong>clip was not a clear improvement. <strong>Face</strong>clip found a larger number of facepositions than the orig<strong>in</strong>al detector, but it also found a very large number of false facepositions. Comb<strong>in</strong><strong>in</strong>g <strong>Face</strong>clip with object track<strong>in</strong>g gave a particularly bad result.5.2 ComplexFigure 5.3 shows the results of the Complex experiment, <strong>and</strong> Figure 5.4 shows a fewprocessed frames.5.2.1 ReferenceThe number of detected faces was 163 <strong>in</strong> both default <strong>and</strong> strict mode. The number of falsedetections was 9, or 0.0317 per frame, <strong>in</strong> both modes. There were 79 correct identifications<strong>and</strong> 87 <strong>in</strong>correct identifications <strong>in</strong> default mode. In strict mode, there were no identificationsat all.5.2.2 <strong>Object</strong> track<strong>in</strong>gThe number of faces detected <strong>in</strong>creased by 12% <strong>in</strong> default mode <strong>and</strong> by 21% <strong>in</strong> strict modecompared to the reference. The false detections <strong>in</strong>creased by 22% <strong>in</strong> default mode <strong>and</strong> by144% <strong>in</strong> strict mode, leav<strong>in</strong>g them at 0.039 <strong>and</strong> 0.077 false detections per frame respectively.The faces correctly identified <strong>in</strong>creased by 32% <strong>in</strong> default mode, <strong>and</strong> rema<strong>in</strong>ed at zero <strong>in</strong>strict mode. The false identifications <strong>in</strong>creased by 1% <strong>in</strong> default mode, <strong>and</strong> rema<strong>in</strong>ed atzero <strong>in</strong> strict mode.


24 Chapter 5. ResultsFigure 5.3: The results for the Complex experiment. The number of faces that can becorrectly detected is 352, while the number of faces that can be correctly identified is 205.Figure 5.4: Some processed frames from the Complex video. Each detected face ismarked by a bound<strong>in</strong>g box. The faces <strong>in</strong> the first frame are properly identified as Johan<strong>and</strong> S<strong>and</strong>ra, <strong>and</strong> the detected face <strong>in</strong> the second frame is correctly identified as Rickard.Three of the faces are not detected <strong>in</strong> these frames.


5.2. Complex 255.2.3 <strong>Face</strong>clipThe number of faces detected <strong>in</strong>creased by 21% compared to the reference <strong>in</strong> both Wawomodes. The false detections <strong>in</strong>creased by 122% <strong>in</strong> both modes, leav<strong>in</strong>g them at 0.039 falsedetections per frame. The faces correctly identified decreased by 8% <strong>in</strong> default mode, <strong>and</strong>rema<strong>in</strong>ed at zero <strong>in</strong> strict mode. The false identifications <strong>in</strong>creased by 51% <strong>in</strong> default mode,<strong>and</strong> rema<strong>in</strong>ed at zero <strong>in</strong> strict mode.5.2.4 <strong>Face</strong>clip with object track<strong>in</strong>gThe number of faces detected <strong>in</strong>creased by 37% <strong>in</strong> default mode <strong>and</strong> by 21% <strong>in</strong> strict modecompared to the reference. The false detections <strong>in</strong>creased by 278% <strong>in</strong> default mode <strong>and</strong> by122% <strong>in</strong> strict mode, leav<strong>in</strong>g them at 1.12 <strong>and</strong> 0.703 false detections per frame respectively.The faces correctly identified <strong>in</strong>creased by 14% <strong>in</strong> default mode, <strong>and</strong> rema<strong>in</strong>ed at zero <strong>in</strong>strict mode. The false identifications <strong>in</strong>creased by 65% <strong>in</strong> default mode, <strong>and</strong> rema<strong>in</strong>ed atzero <strong>in</strong> strict mode.5.2.5 SummaryUnlike <strong>in</strong> the Simple experiment, Wawo did not manage to identify anyone at all <strong>in</strong> theComplex video <strong>in</strong> strict mode. <strong>Object</strong> track<strong>in</strong>g <strong>in</strong>creased the number of faces detected <strong>in</strong>both Wawo modes, although it also <strong>in</strong>creased the number of false detections. In defaultmode, object track<strong>in</strong>g <strong>in</strong>creased the number of correct identifications, with only a slight<strong>in</strong>crease <strong>in</strong> the number of false identifications. The object tracker managed to track somefaces that were <strong>in</strong> profile, even when the face detector could not see them.<strong>Face</strong>clip managed to detect more faces than the reference implementation, but it alsoreported a larger number of <strong>in</strong>valid faces <strong>and</strong> identities. Comb<strong>in</strong><strong>in</strong>g <strong>Face</strong>clip <strong>and</strong> objecttrack<strong>in</strong>g had a slightly positive effect on the correct detections, but also affected the <strong>in</strong>correctresults <strong>in</strong> a negative way.


26 Chapter 5. Results


Chapter 6ConclusionsAll <strong>in</strong> all, us<strong>in</strong>g object track<strong>in</strong>g seems to improve the face recognition abilities of the system.More faces were found <strong>in</strong> the tests, <strong>and</strong> a larger percentage of the faces were correctlyidentified. Unless <strong>Face</strong>clip was used, the number of false detections did not <strong>in</strong>crease bymuch, <strong>and</strong> even went down <strong>in</strong> one test.The object tracker managed to track some faces that were <strong>in</strong> profile, even when the facedetector could not see them. It also failed to track some such faces though, which might beascribed to the low frame-rate; if a face moves too far between two successive frames, theshort-term tracker of TLD may fail to track it. If neither the face detector nor the TLDdetector can detect the face, it will be lost <strong>in</strong> that frame.Us<strong>in</strong>g <strong>Face</strong>clip yielded less positive results. While <strong>Face</strong>clip found more faces than theorig<strong>in</strong>al detector code, it also found a large number of false faces. Rondahl (2011) mentionsthat <strong>Face</strong>clip does worse with high-resolution images, so it may be that these videos fall<strong>in</strong> that category, <strong>in</strong> particular the Simple video, which has a lower resolution but higherimage quality than the Complex video.Comb<strong>in</strong><strong>in</strong>g <strong>Face</strong>clip with object track<strong>in</strong>g gave particularly bad results. Some of thenumbers were positively affected by the comb<strong>in</strong>ation, but the negative effects were greater.This suggests that the object tracker cont<strong>in</strong>ued track<strong>in</strong>g the false positions from <strong>Face</strong>clipfor an extended period of time, <strong>and</strong> therefore amplified the wrong results. <strong>Face</strong>clip may stillbe a good tool if the system is improved to better get rid of false detections while keep<strong>in</strong>gtrue detections.Us<strong>in</strong>g strict Wawo modes had vary<strong>in</strong>g results. In the Simple video, it lead to a greatimprovement when used together with object track<strong>in</strong>g. It did not give any identificationsat all <strong>in</strong> the Complex video though, which might be because of the lower image quality.This project had four goals:G1 Create a transcoder plug<strong>in</strong> for object track<strong>in</strong>g, us<strong>in</strong>g OpenTLD, <strong>and</strong> <strong>in</strong>tegrate it withthe face recognition plug<strong>in</strong>. The first goal is to track objects forward.G2 Support track<strong>in</strong>g objects backwardsG3 Implement <strong>and</strong> evaluate <strong>Face</strong>clip as an alternative to the current face detection code.G4 Support f<strong>in</strong>d<strong>in</strong>g the direction of detected faces.Goals G1 <strong>and</strong> G3 have been implemented, while G2 <strong>and</strong> G4 have not. Goal G1 is fulfilledby the <strong>in</strong>tended object track<strong>in</strong>g plug<strong>in</strong>, <strong>in</strong>tegrated with the face recognition code. Goal G3 is27


28 Chapter 6. Conclusionsfulfilled by hav<strong>in</strong>g <strong>in</strong>tegrated <strong>Face</strong>clip, <strong>and</strong> hav<strong>in</strong>g compared it to the orig<strong>in</strong>al code. Becauseof negative results, <strong>Face</strong>clip is not used <strong>in</strong> the current implementation. While it would havebeen nice to have goals G2 <strong>and</strong> G4 implemented too, the results of the current systemare enough to give an <strong>in</strong>dication of the effect of us<strong>in</strong>g object track<strong>in</strong>g <strong>in</strong> a face recognitionsystem.The conclusion drawn from this project is that object track<strong>in</strong>g is a good tool for improv<strong>in</strong>gthe accuracy of face recognition <strong>in</strong> video streams. Anyone implement<strong>in</strong>g face recognitionfor video streams should consider us<strong>in</strong>g object track<strong>in</strong>g as a central component.6.1 Limitations <strong>and</strong> future workTest<strong>in</strong>g was done by compar<strong>in</strong>g whether a reported face position overlaps with a knownface position. It would, for example, mark a position def<strong>in</strong><strong>in</strong>g a person’s nose as correct,s<strong>in</strong>ce it overlaps with the whole face. Inspect<strong>in</strong>g the results shows that there are few suchoccasions, but it still means that the result numbers are not fully accurate.The current object track<strong>in</strong>g implementation cannot h<strong>and</strong>le a large number of false detectionswith the unknown PID, as it uses up too much memory.The current way for filter<strong>in</strong>g false detections easily fails; a trajectory will be acceptedas long as the face detector reports a particular image patch twice with<strong>in</strong> a short timespan.This is particularly apparent when comb<strong>in</strong><strong>in</strong>g object track<strong>in</strong>g with <strong>Face</strong>clip, as <strong>Face</strong>clipoften reports the same <strong>in</strong>valid image patch twice or more.One type of false detection that occurs is when the system tracks a face properly for awhile, <strong>and</strong> then f<strong>in</strong>ds the face at the wrong position for a few frames, <strong>and</strong> then goes back totrack<strong>in</strong>g it at the correct position. A solution based on cont<strong>in</strong>uity filter<strong>in</strong>g (Nielsen, 2010)may be a way to get rid of the false <strong>in</strong>termediate positions.It may also be useful to try identify<strong>in</strong>g faces reported by the object tracker, <strong>in</strong>stead ofthe current way of only identify<strong>in</strong>g faces reported by the face detector. Do<strong>in</strong>g this would<strong>in</strong>crease the number of faces for potential identification.It would be useful to test different sett<strong>in</strong>gs, for example other values of pid m<strong>in</strong> <strong>and</strong> det m<strong>in</strong> .That may be enough to get rid of some of the false detections <strong>and</strong> identifications.The system currently only does forward track<strong>in</strong>g. As mentioned <strong>in</strong> Section ??, onepossible improvement would be to also implement backward track<strong>in</strong>g.The optional goal of detect<strong>in</strong>g the direction of each face has not been implemented. Afuture research idea is to look <strong>in</strong>to techniques for do<strong>in</strong>g that.In the Complex experiment, the object tracker failed to track some faces. As mentioned<strong>in</strong> Section 5.2, it may be due to the low frame rate. One way to test whether it is causedby the low frame rate may be to down-sample the frame rate of a video clip with a higherframe rate, <strong>and</strong> see how the object track<strong>in</strong>g is affected. If the low frame rate is the cause ofthe failure, gett<strong>in</strong>g good results when us<strong>in</strong>g the implementation may require a video sourcewith a high enough frame rate.In general, Wawo seems to be quite sensitive to what tra<strong>in</strong><strong>in</strong>g pictures are <strong>in</strong>cluded;they all have to be of similar size <strong>and</strong> quality. Because of the high sensitivity, us<strong>in</strong>g theimplementation <strong>in</strong> practice may require the user to put some effort <strong>in</strong>to produc<strong>in</strong>g tra<strong>in</strong><strong>in</strong>gpictures of high enough, <strong>and</strong> consistent, quality.


AcknowledgementsI would like to thank my supervisors Niclas Börl<strong>in</strong> at the Comput<strong>in</strong>g Science department atUmeå University, <strong>and</strong> Mart<strong>in</strong> Isaksson Wuotila at Codemill, for their assistance <strong>in</strong> produc<strong>in</strong>gthis thesis. I would also like to thank the people at Codemill who took part <strong>in</strong> creat<strong>in</strong>g someof my test data.29


30 Chapter 6. Conclusions


ReferencesAngel, E. (2009). Interactive Computer Graphics. Pearson Education, 5th edition.Breiman, L. (2001). R<strong>and</strong>om forests. Mach<strong>in</strong>e Learn<strong>in</strong>g, 45(1):5–32. DOI: 10.1023/A:1010933404324.Gonzales, R. C. <strong>and</strong> Woods, R. E. (2008). Digital image process<strong>in</strong>g. Pearson Prentice Hall,3rd edition.Kalal, Z. (2011a). OpenTLD Git repository. https://github.com/zk00006/OpenTLD/,commit 8a6934de6024d9297f6da61afb4fcee01e7282a2.Kalal, Z. (2011b). TLD web page. http://<strong>in</strong>fo.ee.surrey.ac.uk/Personal/Z.Kalal/tld.html, visited at 2011-11-29.Kalal, Z., Matas, J., <strong>and</strong> Mikolajczyk, K. (2010a). <strong>Face</strong>-TLD: <strong>Track<strong>in</strong>g</strong>-Learn<strong>in</strong>g-Detectionapplied to faces. In Proceed<strong>in</strong>gs of the 17th International Conference on Image Process<strong>in</strong>g,pages 3789–3792. DOI: 10.1109/ICIP.2010.5653525, http://www.ee.surrey.ac.uk/CVSSP/Publications/papers/Kalal-ICIP-2010.pdf.Kalal, Z., Matas, J., <strong>and</strong> Mikolajczyk, K. (2010b). Forward-backward error: Automaticdetection of track<strong>in</strong>g failures. In Proceed<strong>in</strong>gs of the 20th International Conference onPattern <strong>Recognition</strong>, pages 2756–2759. DOI: 10.1109/ICPR.2010.675, http://www.ee.surrey.ac.uk/CVSSP/Publications/papers/Kalal-ICPR-2010.pdf.Kalal, Z., Matas, J., <strong>and</strong> Mikolajczyk, K. (2011). <strong>Track<strong>in</strong>g</strong>-Learn<strong>in</strong>g-Detection.IEEE Transactions on Pattern Analysis <strong>and</strong> Mach<strong>in</strong>e Intelligence, 99(Prepr<strong>in</strong>t).DOI: 10.1109/TPAMI.2011.239, http://kahlan.eps.surrey.ac.uk/featurespace/tld/Publications/2011_tpami.pdf.Lucas, B. D. <strong>and</strong> Kanade, T. (1981). An iterative image registration technique with anapplication to stereo vision. In Proceed<strong>in</strong>gs of the International Jo<strong>in</strong>t Conference onArtificial Intelligence, volume 2, pages 674–679.Nielsen, J. B. (2010). <strong>Face</strong> detection <strong>and</strong> recognition <strong>in</strong> video-streams. Bachelor Thesis IMM-B.Sc.-2010-14, Department of Informatics <strong>and</strong> Mathematical Model<strong>in</strong>g, Image Analysis<strong>and</strong> Computer Graphics, Technical University of Denmark, Lyngby. http://orbit.dtu.dk/getResource?recordId=263847&objectId=1&versionId=1.Rondahl, T. (2011). <strong>Face</strong> detection <strong>in</strong> digital imagery us<strong>in</strong>g computer vision <strong>and</strong> imageprocess<strong>in</strong>g. Bachelor Thesis UMNAD-891, Department of Comput<strong>in</strong>g Science, UmeåUniversity, Sweden. URN:NBN: urn:nbn:se:umu:diva-51406.31


Appendix AResultsTable A.1 <strong>and</strong> A.2 show the result numbers for the Simple experiment <strong>in</strong> default mode<strong>and</strong> strict mode respectively. Table A.3 <strong>and</strong> A.4 show the result numbers for the Complexexperiment <strong>in</strong> default mode <strong>and</strong> strict mode respectively.A few numbers are calculated for each test run:– Total detections is the number of face positions that were reported.– Correct detections is the number of correct positions.– False detections is the number of <strong>in</strong>correct positions.– Missed faces is the number of known face positions that were not reported.– Total IDs is the total number of times an ID other than unknown was reported.– Correct IDs is the number of correct IDs.– Incorrect IDs is the number of <strong>in</strong>correct IDs. This number <strong>in</strong>cludes identificationsdone for false faces.The numbers that specify correctness also have one or two associated percentage numbers:– % of reported is the percentage of the total reported number.– % of known is the percentage of the total known number.Additionally, each number that is about correctness has, with<strong>in</strong> parantheses, the difference<strong>in</strong> percentage po<strong>in</strong>ts compared to the reference results. The default tests are comparedto the default References, while the strict tests are compared to the strict References.33


34 Chapter A. Resultsn % of reported % of knownTotal detections 3547 - -Correct detections 3541 99.83 93.98False detections 6 0.17 -Missed faces 227 - 6.02Total IDs 3546 - -Correct IDs 2525 71.21 67.01Incorrect IDs 1021 28.79 -(a) Referencen % of reported % of knownTotal detections 7068 - -Correct detections 3638 (+97) 51.47 (-48.36) 96.55 (+2.57)False detections 3430 (+3424) 48.53 (+48.46) -Missed faces 130 (-93) - 3.45 (-2.57)Total IDs 6399 - -Correct IDs 2168 (-357) 33.88 (-37.33) 57.54 (-9.47)Incorrect IDs 4231 (+3210) 66.12 (+37.33) -(b) <strong>Face</strong>clipn % of reported % of knownTotal detections 3756 - -Correct detections 3756 (+215) 100.00 (+0.17) 99.68 (+5.7)False detections 0 (-6) 0.00 (-0.17) -Missed faces 12 (-215) - 0.32 (-5.7)Total IDs 3624 - -Correct IDs 2720 (+195) 75.06 (+3.85) 72.19 (+5.18)Incorrect IDs 904 (-117) 24.94 (-3.85) -(c) <strong>Object</strong> track<strong>in</strong>gn % of reported % of knownTotal detections 15209 - -Correct detections 3729 (+188) 24.52 (-75.31) 98.96 (+4.98)False detections 11480 (+11474) 75.48 (+75.31) -Missed faces 39 (-188) - 1.04 (-4.98)Total IDs 12220 - -Correct IDs 2081 (-444) 17.03 (-54.18) 55.23 (-11.78)Incorrect IDs 10139 (+9118) 82.97 (+54.18) -(d) <strong>Object</strong> track<strong>in</strong>g + <strong>Face</strong>clipTable A.1: Results for the Simple experiment, us<strong>in</strong>g the default Wawo mode.


35n % of reported % of knownTotal detections 3547 - -Correct detections 3541 99.83 93.98False detections 6 0.17 -Missed faces 227 - 6.02Total IDs 493 - -Correct IDs 493 100.00 13.08Incorrect IDs 0 0.00 -(a) Reference (strict)n % of reported % of knownTotal detections 7068 - -Correct detections 3638 (+97) 51.47 (-48.36) 96.55 (+2.57)False detections 3430 (+3424) 48.53 (+48.36) -Missed faces 130 (-97) - 3.45 (-2.57)Total IDs 365 - -Correct IDs 365 (-128) 100.00 9.69 (-3.39)Incorrect IDs 0 0.00 -(b) <strong>Face</strong>clip (strict)n % of reported % of knownTotal detections 3794 - -Correct detections 3761 (+220) 99.13 (-0.7) 99.81 (+5.83)False detections 33 (+27) 0.87 (+0.7) -Missed faces 7 (-220) - 0.19 (-5.83)Total IDs 3204 - -Correct IDs 3204 (+2711) 100.00 85.03 (+71.95)Incorrect IDs 0 0.00 -(c) <strong>Object</strong> track<strong>in</strong>g (strict)Table A.2: Results for the Simple experiment, us<strong>in</strong>g strict Wawo mode.


36 Chapter A. Resultsn % of reported % of knownTotal detections 172 - -Correct detections 163 94.77 46.31False detections 9 5.23 -Missed faces 189 - 53.69Total IDs 166 - -Correct IDs 79 47.59 38.73Incorrect IDs 87 52.41 -(a) Referencen % of reported % of knownTotal detections 217 - -Correct detections 197 (+34) 90.78 (-3.99) 55.97 (+9.66)False detections 20 (+11) 9.22 (+3.99) -Missed faces 155 (-34) - 44.03 (-9.66)Total IDs 194 - -Correct IDs 73 (-6) 37.63 (-9.96) 35.78 (-2.95)Incorrect IDs 121 (+44) 62.37 (+9.96) -(b) <strong>Face</strong>clipn % of reported % of knownTotal detections 194 - -Correct detections 183 (+20) 94.33 (-0.44) 51.99 (+5.68)False detections 11 (+2) 5.67 (+0.44) -Missed faces 169 (-20) - 48.01 (-5.68)Total IDs 192 - -Correct IDs 104 (+25) 54.17 (+6.58) 50.98 (+12.25)Incorrect IDs 88 (+1) 45.83 (-6.58) -(c) <strong>Object</strong> track<strong>in</strong>gn % of reported % of knownTotal detections 256 - -Correct detections 222 (+59) 86.72 (-8.05) 63.07 (+16.76)False detections 34 (+25) 13.28 (+8.05) -Missed faces 130 (-59) - 36.93 (-16.76)Total IDs 256 - -Correct IDs 90 (+11) 35.16 (-12.34) 44.12 (+5.39)Incorrect IDs 166 (+79) 64.84 (+12-34) -(d) <strong>Object</strong> track<strong>in</strong>g + <strong>Face</strong>clipTable A.3: Results for the Complex experiment, us<strong>in</strong>g the default Wawo mode.


37n % of reported % of knownTotal detections 172 - -Correct detections 163 94.77 46.31False detections 9 5.23 -Missed faces 189 - 53.69Total IDs 0 - -Correct IDs 0 0 0Incorrect IDs 0 0 -(a) Reference (strict)n % of reported % of knownTotal detections 217 - -Correct detections 197 (+34) 90.78 (-3.99) 55.97 (+9.66)False detections 20 (+11) 9.22 (+3.99) -Missed faces 155 (-34) - 44.03 (-9.66)Total IDs 0 - -Correct IDs 0 0.00 0.00Incorrect IDs 0 0.00 -(b) <strong>Face</strong>clip (strict)n % of reported % of knownTotal detections 220 - -Correct detections 198 (+35) 90.00 (-4.77) 56.25 (+9.94)False detections 22 (+13) 10.00 (+4.77) -Missed faces 154 (-35) - 43.75 (-9.94)Total IDs 0 - -Correct IDs 0 0.00 0.00Incorrect IDs 0 0.00 -(c) <strong>Object</strong> track<strong>in</strong>g (strict)n % of reported % of knownTotal detections 217 - -Correct detections 197 (+34) 90.78 (-3.99) 55.97 (+9.66)False detections 20 (+11) 9.22 (+3.99) -Missed faces 155 (-34) - 44.03 (-9.66)Total IDs 0 - -Correct IDs 0 0.00 0.00Incorrect IDs 0 0.00 -(d) <strong>Object</strong> track<strong>in</strong>g + <strong>Face</strong>clip (strict)Table A.4: Results for the Complex experiment, us<strong>in</strong>g strict Wawo mode.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!