10.07.2015 Views

3.0 visual feature extraction and classification - DSpace@UM

3.0 visual feature extraction and classification - DSpace@UM

3.0 visual feature extraction and classification - DSpace@UM

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3The higher level <strong>feature</strong>s can be divided into two distinguished categories as syntactic<strong>feature</strong>s <strong>and</strong> semantic <strong>feature</strong>s (Konstantinou et al., 2009; Mittal, 2006). Syntactic<strong>feature</strong>s are derived from direct operations of pixel level data of an image or a video; foran example, object boundaries or histogram of a frame. On the other h<strong>and</strong>, semantic<strong>feature</strong>s are derived through a number of process operations <strong>and</strong> mappings; for example,listing of sea area, sky area <strong>and</strong> mountains. The semantic level of video sequencecontains levels of abstraction, which depends on sequence of frames, motions, objects,brightness, surrounding, <strong>and</strong> the relationships with other video attributes such as audio,prior scenes <strong>and</strong> video text. Using high level semantic <strong>feature</strong>s for video retrieval allowspeople to perform search at the semantic level, which is more intuitive. High levelsemantic <strong>feature</strong>s can be obtained by integrating additional knowledge of a specificdomain with low level <strong>feature</strong>s (Zheng et al., 2006).In recent past, researchers have focused on the use of internal <strong>feature</strong>s of images <strong>and</strong>videos computed in an automated or semi-automated way (Lew et al., 2006). Mostanalysis use statistics <strong>and</strong> probability approaches which can be approximately correlatedto the content <strong>feature</strong>s (Aghbari & Makinouchi, 2003; Datta et al., 2005; Fasel, 2006;Heller & Ghahramani, 2006). These approaches provide information without costlyhuman interaction. Researchers are commonly focusing on the syntactic <strong>feature</strong>s ratherthan the semantic <strong>feature</strong>s of <strong>visual</strong> contents (Israël, Broek, Putten, & Uyl, 2004). Somecontent-based retrieval systems that operate at semantic level would identify motion<strong>feature</strong>s as the key element besides other <strong>feature</strong>s such as colour, objects <strong>and</strong> texture,because motion (either camera motion or shot editing) adds additional meaning to thecontent.The performance of high level <strong>feature</strong> <strong>extraction</strong>s is currently not adequate <strong>and</strong> greatlyrestricts the search performance (Zheng et al., 2006). There is a need to further improve40


Chapter 3the performance of high level <strong>feature</strong> <strong>extraction</strong>. Meanwhile researches should considerthe reliabilities of semantic <strong>feature</strong> detectors more carefully as it is the key to the highlevel <strong>feature</strong> <strong>extraction</strong>. One heuristic in the selection process is to have applicationspecific<strong>feature</strong> sets, <strong>and</strong> this has given some positive results in the retrieval ofsemantic-sensitive data (Datta et al., 2005).3.3. Use of Colour Features in Content RetrievalColour is the most common syntactic <strong>feature</strong> for most researches in image <strong>and</strong> videocontent analysis. In use of colours, different colour representations have been used bydifferent experiments done in different scales. The popular colour modes are RGB (Red,Green, Blue), HSV/I (Hue, Saturation, Value/Intensity) <strong>and</strong> Gray levels. Colour <strong>feature</strong>spresent the amount <strong>and</strong> distribution of colors in an image. The rationale for these<strong>feature</strong>s is that images of similar objects or concepts usually have similar colours(Pakkanen, 2006). In querying a colour image by using an image retrieval system, ascan is operated for the similar colour distribution in the image database to match thequery. In the case of query by example, the algorithms’ task is to find the similar colourdistribution according to the example image.A pixel can be represented by different colour attributes that are described by differentcolour modes <strong>and</strong> stored in different data structures. All <strong>visual</strong> <strong>feature</strong>s are composed ofmany colour attributes. To describe a pixel, RGB, HSV, Gray <strong>and</strong> CIE-LUV (CIE Lab)colour models are common. Some image processing techniques give different results fordifferent models. The digital video pixels are products of a raster scanning or directresults from Charge Couple Device (CCD)/Exmore arrays in digital cameras (Y. Wang,Ostermann, & Zhang, 2002). According to the International Telecommunications Union- Radio Sector (ITU-R) BT. 601 st<strong>and</strong>ards, digital video has to be maintained with 4:3or 16:9 aspect ratio (Image Width: Image Height) (Y. Wang et al., 2002). Further,41


Chapter 3recommendation given for a digital colour coordinate is YCbCr (one luminance <strong>and</strong> twochrominance components). The Y, Cb, <strong>and</strong> Cr components are scaled <strong>and</strong> shiftedversions of the analog Y, U, <strong>and</strong> V components, whereby the scaling <strong>and</strong> shiftingoperations are introduced so that the resulting components take value in the range of(0,255). The choice of colour models is a matter of the conversion of a model to another<strong>and</strong> it is an easy process.3.4. Syntactic Visual FeaturesSyntactic <strong>visual</strong> <strong>feature</strong>s derived from colour data of video frames are the key to derivesemantic concepts in videos. There are different syntactic <strong>visual</strong> <strong>feature</strong>s commonlyused in many applications.3.4.1. Colour HistogramsColour histograms are often used for matching process (Lew, 2000). Although colourhistograms which are often used for matching processes are useful as they are relativelyinsensitive to position <strong>and</strong> orientation changes, they do not capture spatial relationshipof colour regions <strong>and</strong> thus they have limited discriminating power (Lavee, Khan, &Thuraisingham, 2007; Lew, 2000; Maron & Ratan, 1998; Nicu Sebe & Lew, 2001).More recently, researchers have introduced spatial histogram to integrate spatialdescription into histogram based techniques. MPEG-7 <strong>visual</strong> description guidelines alsomention about spatial histograms, but they do not describe micro level patterndistributions (MPEG-7, 2004).In the calculation for the most relevant match of the colour distribution, the probabilistic<strong>and</strong> statistical techniques are commonly used with techniques based on average colourvalues, st<strong>and</strong>ard deviations <strong>and</strong> 3D histogram comparisons which are computationallyexpensive. In some studies, 3D histograms are used in colour <strong>feature</strong> comparison with42


Chapter 3HSV components (Heller & Ghahramani, 2006). If the images originate from diversesources then parameters such as quality, resolution <strong>and</strong> colour depth would likelyvaried. This in turn causes variations in the extracted colour <strong>and</strong> texture <strong>feature</strong>s (Dattaet al., 2005). This is true for video frames as well.Commonly histograms lack in terms of describing micro level or texel level information<strong>and</strong> are also not accompanied with spatial information. Therefore they give pooraccuracy in concept detection. Most colour histograms are used with 125 bins (S. Gao,Zhu, & Sun, 2007). However this number of dimensions also negatively affects thecomputations that take place in the <strong>classification</strong> process. The HSV types of histogramsrequire colour conversions in addition to RGB histograms. The L*a*b histogramcontains 96 dimensions (Giang & Worring, 2008). Histograms are always likely to becombined with multiple <strong>feature</strong>s for accurate concept detection, such as texture <strong>feature</strong>s<strong>and</strong> shape <strong>feature</strong>s which cause increase in computational costs. Figure 3.1 representscolour histograms with different channels including RGB.Figure 3.1: Colour Histogram Representation of an Image3.4.2. Texture Features in Content RetrievalIn most applications, colour <strong>feature</strong> has been used to identify different object categories,but colour <strong>feature</strong> methods tend to produce false positives (Mäenpää, Viertola, &43


Chapter 3Pietikäinen, 2004). The <strong>visual</strong>s that contained similar colour composition can result infalse positives. Texture plays a very important role in <strong>visual</strong> perception <strong>and</strong> objectidentities (Ardizzone, Capra, & La Cascia, 2000; Mäenpää et al., 2004). Texturepatterns distinguish structural arrangement of surfaces from the surroundingenvironment; therefore texture is a natural property of all object surfaces. In usingtextures, both colour <strong>and</strong> geometrical distribution <strong>feature</strong>s give improvement in results(Heller & Ghahramani, 2006; Nicu Sebe & Lew, 2001). The analysis <strong>and</strong> categorizationof the textures are very crucial. Textures can be classified by using different propertiessuch as coarseness, dominant colour, roughness, colour gradients <strong>and</strong> periodicalpatterns.In video <strong>visual</strong>s, texture <strong>feature</strong>s available on objects can be used to classify or identifyobjects in scenes. It is difficult to identify <strong>and</strong> fully underst<strong>and</strong> how human visionsystem segment textures on objects. Some studies have used a mixture of texture<strong>feature</strong>s such as Tamura & Gabor texture <strong>feature</strong>s in <strong>visual</strong> content retrievals (Heller &Ghahramani, 2006). According to Tamura et al. (1978), texture <strong>feature</strong>s areapproximated in computational form into six basic <strong>feature</strong>s, namely, coarseness,contrast, directionality, line-likeness, regularity, <strong>and</strong> roughness with the help of humanlevel of perception. In addition, Simultaneous Auto-Regressive (SAR) orientation,Gabor <strong>and</strong> wavelet-transform are also used to describe texture (Ardizzone et al., 2000;Heller & Ghahramani, 2006).‘Textons’ which refer to a set of representative classes of image patches that suffice tocharacterize an image object or texture are typically extracted densely, representingsmall <strong>and</strong> relatively generic image micro-structures such as blobs, bars <strong>and</strong> corners(Jurie & Triggs, 2005; Leung & Malik, 2001). Texton representation can be consideredas a form of data compression, but it is not the only way for compressing data (Leung &44


Chapter 3Malik, 2001). Texture can be characterized by a spatial repetition of finite textureelements or 2D (two dimensional) patterns, called texels in an image area occupied by atexture (Todorovic & Ahuja, 2009). Texel which is also known as texture element cancontain several pixels in a 2D array. Normally texels in natural scenes are not identical,<strong>and</strong> their spatial repetition is not strictly periodic. Instead, texels are only statisticallysimilar to one another, <strong>and</strong> their placement along a surface is only statistically uniform.Texels are also not homogenous-intensity regions, but may contain hierarchicallyembedded structures. Therefore, the study by Todorovic <strong>and</strong> Ahuja (2009) havecharacterized an image texture by a probability density function governing the (natural)statistical variations of the following texel properties: (1) geometric <strong>and</strong> photometric –referred to as intrinsic properties (for example colour, area, shape); (2) structural; <strong>and</strong>(3) relative orientations <strong>and</strong> displacements of texels – referred to as placement.Wavelet textures, Gabor textures, <strong>and</strong> Wiccest textures are dominantly used in video<strong>visual</strong> description (Ardizzone et al., 2000; Giang & Worring, 2008; Heller &Ghahramani, 2006; Zheng et al., 2006). Gabor filter (or Gabor wavelet) is widelyadopted to extract texture <strong>feature</strong>s from images in image retrieval, <strong>and</strong> has been shownto be very efficient (Ardizzone et al., 2000; Giang & Worring, 2008; Heller &Ghahramani, 2006). Image retrieval using Gabor <strong>feature</strong>s outperforms the usage ofPyramid-structured wavelet transform (PWT) <strong>feature</strong>s, tree-structured wavelettransform (TWT) <strong>feature</strong>s <strong>and</strong> multi-resolution simultaneous autoregressive model(MR-SAR) <strong>feature</strong>s (Ro, Kim, Kang, Manjunath, & Kim, 2001).Natural sceneries can be effectively described using colour <strong>and</strong> spatial properties(Maron & Ratan, 1998). The QBIC <strong>and</strong> VIRAGE systems use colour, structuralproperties <strong>and</strong> combined with textures in their image query systems (IBM-Corporation,2003; Maron & Ratan, 1998; Virage, 2009). The texture can be static or dynamic.45


Chapter 3Ardizzone et al. (2000) used textures evolving over time, called temporal textures. Thesmoke flowing or wavy water of a river are good examples of temporal textures.Some studies have used the mixture of texture <strong>feature</strong>s such as Tamura & Gabor<strong>feature</strong>s <strong>and</strong> Wiccest <strong>feature</strong>s, which extend texture properties with colour invariance<strong>and</strong> natural image statistics (Heller & Ghahramani, 2006; Snoek et al., 2005). In manysituations, texture <strong>feature</strong>s accompanied with colour <strong>feature</strong>s enlarge the <strong>feature</strong> space<strong>and</strong> computational costs, for example the Wiccest <strong>feature</strong> used in Snoek et al. (2005)contains 108 dimensions (Giang & Worring, 2008; Snoek et al., 2005). Highdimensional texture <strong>feature</strong>s increase the computational cost of distance measures use in<strong>classification</strong>.3.4.3. Wavelets in Content DescriptionImages <strong>and</strong> video content are presented in multi resolution environment. In additiondifferent areas <strong>and</strong> objects in <strong>visual</strong>s are represented in different scales <strong>and</strong> resolutions.In this context, description of <strong>visual</strong> regions by using colour intensity requires multilevel resolution analysis. Consistent resolution within <strong>visual</strong> regions cannot be expected.Therefore, in processing <strong>visual</strong>s with multilevel resolutions, the wavelet transformationis widely used. The wavelet representation of images provides information about thevariations in different scales. According to Sebe & Lew (2001), the Haar waveletcomparison method is lagging behind as a salient point <strong>extraction</strong> due to its nonoverlappingwavelets at a given scale <strong>and</strong> increase of computations related to spatialsupport.3.4.4. Shape DescriptorsShape is also used as a key attribute of segmented image regions, <strong>and</strong> its efficient <strong>and</strong>robust representation plays an important role in some retrieval systems. Shape similarity46


Chapter 3matching is one of the methods that can be used to reduce irrelevant colour <strong>feature</strong>s.Datta et al. (2005) describe some performance problems of computation of shapesimilarity matching with geometrical invariance of transformation. In shape matching,dynamic geometrical transformation comparison is a required key <strong>feature</strong>. In clusteringof shapes in frames, different methods such as colour or spectral clustering, textureclustering <strong>and</strong> edge clustering are used. For characterizing shapes within images,reliable segmentation is critical, without which the shape estimations are largelymeaningless <strong>and</strong> the general problem of semantic <strong>visual</strong> concept segmentation in thecontext of human perception is far from being solved (Datta et al., 2005). Regionclustering is an important step for a successful <strong>visual</strong> underst<strong>and</strong>ing with shapesimilarity matching. Some issues plaguing current shape matching techniques areefficiency consideration, reliability of good segmentation, <strong>and</strong> a robust <strong>and</strong> acceptablebenchmark for assessment of the shape matching (Datta et al., 2005).Most <strong>visual</strong> <strong>feature</strong>s are task dependant, particularly shape <strong>feature</strong>s. Using shape<strong>feature</strong>s to detect circles with constraints in their spatial relations to find cars may givesatisfying results, but may completely fail when applied in searching for flowerbeds(Giang & Worring, 2008). Shape clustering that includes spectral, texture, <strong>and</strong> edgeclustering can increase computational cost <strong>and</strong> reduce speed (Datta et al., 2005).3.4.5. Salient <strong>and</strong> Local Features based MethodsIn video <strong>and</strong> image retrieval, the use of interest parts of images allows an image <strong>and</strong>video index to represent local properties. There are methods experimented on salientpoint <strong>extraction</strong> algorithms by extracting colour <strong>and</strong> texture information in the locationsgiven by these salient points. These methods have shown improved results in terms ofretrieval accuracy, computational complexity, <strong>and</strong> storage space of <strong>feature</strong> vectors ascompared to the global <strong>feature</strong> approaches (Mikolajczyk, Leibe, & Schiele, 2005; Nicu47


Chapter 3Sebe & Lew, 2001). These methods give an improvement over global <strong>feature</strong> basedretrieval by using <strong>feature</strong> vector based content retrieval which targets only the relevantpoints in the <strong>visual</strong> area. In video frame area, different parts are available with different<strong>feature</strong>s. When dealing with global <strong>feature</strong>s, video description is extracted by analyzingthe whole frame, <strong>and</strong> this leads to complication in distinguishing relevant informationfrom irrelevant ones.During early period of using salient <strong>feature</strong> method, it is applied with corner detectors toobtain the corner points of images (Bay, Ess, Tuytelaars, & Gool, 2008; Mikolajczyk etal., 2005; Nicu Sebe & Lew, 2001). However, this approach failed due to the inability toprovide powerful description of images based on corner points, which may not be theattractive part of an image. Only a small part of an image can be described by corners ascorners tend to be gathered at small area. The Harris corner detector is one of thepopular detectors used for identifying corners as it has strong invariance property inrotation, scale, illumination variation <strong>and</strong> image noise (Derpanis, 2004). The salientpoint may not always be the corners <strong>and</strong> it may occur within other parts of an object.Sebe <strong>and</strong> Lew (2001) show that this method gives higher performance to objects withconsistent shapes but not for irregular shapes because of corner points can describeshapes. In general, the local <strong>feature</strong> methods are efficient for object-based retrieval(Datta et al., 2005).However, in most cases, salient point alone cannot be used for content <strong>extraction</strong>. Sincethe salient point method is used with colour <strong>and</strong> textures, it increases the comparisonoverheads <strong>and</strong> calculations. Edges <strong>and</strong> boundary based descriptors increase theinformation density, <strong>and</strong> natural images bring about peak around zero due to lack ofedges <strong>and</strong> corner points (Snoek et al., 2005). Study in the area of Local Interest Points(LIP) was carried out to further explore the local <strong>feature</strong> based methods (Y.-G. Jiang,48


Chapter 3Zhao, & Ngo, 2006). The local <strong>feature</strong> based descriptors such as Scale InvarianceFeature Transform (SIFT) with 128 dimensions <strong>and</strong> Gradient Location OrientationHistogram (GLOH) consist of high dimensionality in <strong>feature</strong> space which causes majornegative impact in terms of computational costs (Bay et al., 2008; Y.-G. Jiang et al.,2006; Mikolajczyk et al., 2005). In considering only high accuracy at the expense ofother costs, local <strong>and</strong> regional <strong>feature</strong>s are used in combination with other <strong>feature</strong>s forbetter retrieval results in some studies (Snoek et al., 2005). This increases thecomparison overheads <strong>and</strong> calculations. Jiang et al. (2006) propose LIP-D (LIP-Distribution) to reduce the high dimensionality of SIFT using PCA-SIFT (PrincipalComponent Analysis-SIFT) with reduction to 36 dimensions. LIP-D also works withDifference of Gaussian (DoG) <strong>and</strong> other SIFT key point <strong>extraction</strong> steps which inheritthe <strong>feature</strong> <strong>extraction</strong> costs associated with the SIFT in addition to PCA baseddimensionality reduction cost (Ke & Sukthankar, 2004). The PCA-SIFT based LIP-Dmethod has advantages in terms of reduced <strong>feature</strong> space in <strong>feature</strong> <strong>classification</strong> butincreases the computational cost in <strong>feature</strong> <strong>extraction</strong> which is a major disadvantage.3.5. Issues in Video Colour InformationVideo <strong>visual</strong> signal can be represented in three separate channels which carry spatial,temporal <strong>and</strong> chromatic information. It is a complex task to analyse the behaviours ofall the three signals due to their native properties such as sampling rates, noise,continuity <strong>and</strong> quantized levels.Extraction of <strong>visual</strong> concepts within complex backgrounds is still a challenging problem(Lew et al., 2006). Poor efficiency resulted from the computation of high dimensionallow level <strong>feature</strong>s <strong>and</strong> dealing with higher colour depth <strong>and</strong> resolution needs to beresolved (Zheng et al., 2006). Hence, it has complicated the issue of extracting required<strong>feature</strong>s with dispersed non-required information within complex video <strong>visual</strong>s.49


Chapter 3In the selection of syntactic <strong>visual</strong> <strong>feature</strong>, it is important to consider the complexity,accuracy <strong>and</strong> overheads in describing those <strong>visual</strong>s (Hu, Dempere-Marco, & Davies,2008). Colour histogram which hides spatial information leads to false positives <strong>and</strong> itscomparison also utilizes higher dimensional data structures which result in highercomputational cost <strong>and</strong> complexity (Suhasini, Sriramakrishna, & Muralikrishna, 2009).Due to the wide range of colour depth <strong>and</strong> comparison of colour patches, the approachhas to deal with wide fuzzy ranges. However, multi level fuzzy selection criteria canimprove errors in semantic concept identification (He & Downton, 2004). If syntactic<strong>visual</strong> <strong>feature</strong>s are compact enough, concept identification can be performed fast (Zhenget al., 2006). Though the <strong>extraction</strong>s of high level semantic <strong>feature</strong>s are time consuming,it can be performed offline. The selection of appropriate syntactic <strong>feature</strong>s for contentbasedimage retrieval <strong>and</strong> annotation systems remains largely ad-hoc, apart from someexceptions <strong>and</strong> this also occurs for video frame underst<strong>and</strong>ing (Datta et al., 2005).Video signals come with large sequence of frames <strong>and</strong> each frame consists of a numberof pixels (resolution). Research studies commonly deal with RGB (Red, Green, Blue)colour scheme with 24 bits colour depth which can provide very high uncertainty aseach pixel can be in one of millions colours.It is important to examine the uncertainty made by huge colour depth with higherresolution. As defined in the information communication theory, Shannon entropy is ameasure of uncertainty associated with a r<strong>and</strong>om variable (Li & Drew, 2004). Itquantifies the information contained in a message, usually in bits or bits per symbol. IfX represents a discrete r<strong>and</strong>om variable such as colour component value of a pixelwhich can take on possible values in {x 1 ,……,x n }, then the information entropy can beexpressed as follows:n∑η = H ( X ) = − p ilog p(3.1)i = 1bi50


Chapter 3where p i is the probability that x i in X occurs <strong>and</strong> b is the base of logarithm incomputerized information. -log b p i indicates the amount of self-information contained inx i or the level of uncertainty. The entropy gives the measure of a system disorder,whereby a high value of η shows high disorder <strong>and</strong> uncertainty.Equation (3.1) can be applied to all three chromatic channels to compute uncertainty. Itis a complex process to classify the required <strong>visual</strong> concept or semantic <strong>feature</strong>s ofvideo frames as they have high level of uncertainty due to high level of colour depth <strong>and</strong>spatial resolution. Therefore it is important to bring the classifier process attributes <strong>and</strong>video <strong>visual</strong> syntactic <strong>feature</strong> attributes to a uniform level as it can reduce the fuzziness.Datta et al. (2005) have reported that compact colour representation is more efficientthan high dimensional histograms in <strong>visual</strong> search <strong>and</strong> retrieval <strong>and</strong> also suggest toreduce the problems associated with earlier methods of dimension reduction <strong>and</strong> colourmoment descriptors. Generally, a large amount of training samples are needed due tohigh dimensional <strong>feature</strong> space (Badjio & Poulet, 2005; Y. Gao & Fan, 2005; Mrowkaet al., 2004). Although effectiveness of <strong>visual</strong> data mining can be increased bycombining different descriptors, such a combination produces high dimensionality inthe resulting <strong>feature</strong> space which drastically reduces efficiency of storage <strong>and</strong> retrieval.Processing local <strong>and</strong> regional <strong>feature</strong>s with points <strong>and</strong> edges described in section 3.4.5will also suffer from high dimensionality <strong>and</strong> calculation effort is also increased.3.6. Dimension Reduction of Visual FeaturesContent-based concept detection applications use low-level <strong>feature</strong>s such as <strong>visual</strong><strong>feature</strong>s, text-based <strong>feature</strong>s, audio <strong>feature</strong>s, motion <strong>feature</strong>s <strong>and</strong> other metadata todetermine the semantic meaning of multimedia data (Lin, Ravitz, Shyu, & Chen, 2008).Development of content based multimedia retrieval applications must consider twoimportant facts which are effectiveness <strong>and</strong> efficiency (Lin et al., 2008; Mrowka et al.,51


Chapter 32004). In order to improve the efficiency, <strong>visual</strong>s are associated with descriptors(<strong>feature</strong> vectors) for representing their content. Feature vector representation facilitatesindexing, browsing, searching <strong>and</strong> retrieval of images from large-scale databases(Hopfgartner et al., 2009; Mrowka et al., 2004). Effectiveness is increased bycombining different descriptors for example colour <strong>and</strong> texture. However, such acombination (high dimensionality <strong>and</strong> multi <strong>feature</strong>s) reduces efficiency of storage <strong>and</strong>retrieval. This shortcoming is known as the curse of dimensionality (Badjio & Poulet,2005; Y. Gao & Fan, 2005; Hopfgartner et al., 2009; Mrowka et al., 2004; Pedrycz,2009).As shown in Figure 1.3, there are two main stages in video <strong>visual</strong> description, which are<strong>feature</strong> <strong>extraction</strong> <strong>and</strong> <strong>classification</strong>. The efficiency of <strong>feature</strong> <strong>extraction</strong> depends onalgorithms used for <strong>feature</strong> <strong>extraction</strong>. The efficiency of <strong>feature</strong> <strong>classification</strong> dependson both <strong>classification</strong> method (classifier) <strong>and</strong> the nature of the <strong>feature</strong> (<strong>feature</strong> vector).In fact, the performance of image classifiers largely depends on two inter-related issues;which are the quality of <strong>feature</strong>s that are used for image content representation <strong>and</strong> theeffectiveness of classifier training (Y. Gao & Fan, 2005). High dimensional <strong>feature</strong>slead to negative effects in the <strong>classification</strong> stage due to poor efficiency. Using severaldescriptors improves accuracy of the content representation but raises some challengessuch as non linear combination, expensive computation <strong>and</strong> the curse of dimensionality(Mrowka et al., 2004). Despite the different natures of the underlying image contentrepresentation framework, most existing semantic image <strong>classification</strong> techniques relyon high-dimensional <strong>visual</strong> <strong>feature</strong>s (Y. Gao & Fan, 2005). Accurate training of imageclassifier in high dimensional <strong>feature</strong> space suffers from the requirement of largenumbers of labelled <strong>visual</strong>s. Large training space also makes the <strong>classification</strong> processmore complex. All these issues complicate the <strong>classification</strong> process. By reducing the<strong>visual</strong> <strong>feature</strong> space, the efficiency, accuracy, <strong>and</strong> comprehensibility of the <strong>classification</strong>52


Chapter 3algorithm can be improved (Lin et al., 2008). Therefore devising low dimensional <strong>visual</strong><strong>feature</strong> space with high discrimination power, <strong>and</strong> training <strong>visual</strong> classifier in relativelylow dimensional <strong>feature</strong> subset provide a promising solution to address the problem ofthe curse of dimensionality (Y. Gao & Fan, 2005; Hopfgartner et al., 2009; Pedrycz,2009).Dimensionality reduction is targeted to reduce the number of attributes (<strong>feature</strong>s) of thedata which helps to transform from high dimensional <strong>feature</strong> space to a low dimensional<strong>feature</strong> space (Pedrycz, 2009). This transformation makes <strong>classification</strong> mechanism beeffective in terms of data reduction. Dimensional reduction in low level <strong>visual</strong> <strong>feature</strong>sis as follows according to (Mrowka et al., 2004).Let I M = {ω 1 , . . . , ω N} be a set of images <strong>and</strong> F N = {F D (1) , . . . ,F D (n) } be a set offunctions used to extract <strong>feature</strong> descriptors from the given images. The <strong>feature</strong> vectoror image descriptor( j)( j)ωi= FD( ωi)(3.2)is obtained by applying <strong>feature</strong> map F D (j) on image ω i . Then, a <strong>feature</strong> space F S is builtusingFN: I a F(3.3)MSwhere F S consists of all the <strong>feature</strong> vectors extracted from I M .Let I m = {ω 1 , . . ., ω K } be a sample of input images <strong>and</strong> T o = {t 1 , . . . , t K } be a sample ofoutput targets where I m ⊂ I M , T o ⊂ V W , <strong>and</strong> t k ∈ {0, 1}, k = 1, . . . , K. V W representsthe set of output targets.Given a functionf (ω) : I ma T o(3.4)53


Chapter 3( j)the problem is stated as, how to reduce the dimension of <strong>feature</strong> vectors, ωi∈ F S ,holding or improving the <strong>classification</strong> performance of f(ω).Methods based on Principal Component Analysis (PCA) <strong>and</strong> statistical moments areused to deal with high dimensional spaces (Abadpour & Kasaei, 2008; Y.-G. Jiang etal., 2006; Ke & Sukthankar, 2004; Leclercq, Khoudour, Macaire, Postaire, &Flancquart, 2006; Lenz, 2002; Menser & Muller, 1999; Tran & Lenz, 2001). Study byMenser et al. (2001) applied Independent Component Analysis (ICA) which uses higherorder statistics to reveal <strong>feature</strong>s (Menser, Lennon, Mercier, Mouchot, & Hubert-moy,2001). Ke <strong>and</strong> Sukthankar (2004) reduced the dimensionality of Scale InvarianceFeature Transform (SIFT) descriptor using PCA method. The PCA-SIFT was furtherenhanced to describe <strong>visual</strong>s with 36 dimensions with LIP-D (Local Interest PointDistribution) (Y.-G. Jiang et al., 2006). Some researches have proposed the use ofdimensionality reduction by selecting the most appropriate dimensions (Hopfgartner etal., 2009). The MPEG-7 has considered the <strong>feature</strong> space reduction for efficient video<strong>visual</strong> description (MPEG-7, 2004). The MPEG-7 Dominant Colour Descriptor (DCD)is one of the main descriptor which has low dimensionality <strong>and</strong> computational cost (J.Jiang, Weng, & Li, 2006; Manjunath, Ohm, Vasudevan, & Yamada, 2001; Manjunath,Salembier, & Sikora, 2002; Shao, Wu, Cui, & Zhang, 2008; S. Wang, Chia, & Deepu,2003). The DCD has been experimented by various <strong>visual</strong> depiction research projects.The study of Lin et al. (2008) gives a comparison of several <strong>feature</strong> space reductiontechniques including PCA, Information Gain (IG) <strong>and</strong> Correlation-based FeatureSelection (CFS).The study by Mrowka et al. (2004) suggests selecting suitable descriptors according tothe problematic <strong>visual</strong> domain for generating the <strong>feature</strong> space. The reduction of <strong>visual</strong>54


Chapter 3<strong>feature</strong> space without much negative effect on the retrieval accuracy is indeed achallenge.3.7. Low Dimensional Visual FeaturesThere are several low dimensional <strong>visual</strong> <strong>feature</strong> based approaches have been studied byresearches. Exploration of robust low dimensional <strong>visual</strong> <strong>feature</strong>s is getting attentiondue to negative effects of high dimensional <strong>visual</strong> <strong>feature</strong>s. A suite of activities whichlead to reduction of <strong>feature</strong> space called dimensionality reduction process (Pedrycz,2009).3.7.1. MPEG-7 Dominant Colour DescriptorThe MPEG-7 Dominant Colour Descriptor (DCD) is one of the main descriptors whichhas low dimensionality <strong>and</strong> computational cost (J. Jiang et al., 2006; Manjunath et al.,2001; MPEG-7, 2004; Shao et al., 2008). The DCD has been experimented by many<strong>visual</strong> depiction research projects (Mylonas, Spyrou, Avrithis, & Kollias, 2009; Spyrou,Tolias, Mylonas, & Avrithis, 2009).The key target of the DCD development is to achieve fast <strong>and</strong> efficient <strong>visual</strong> depiction<strong>and</strong> retrieval (J. Jiang et al., 2006). The DCD uses local spatial colour distribution atglobal level. The purpose of this descriptor is to provide effective, compact <strong>and</strong> intuitivedescription of the representative colours of an image or image region (J. Jiang et al.,2006; Manjunath et al., 2001). This descriptor uses the salient colour in <strong>visual</strong> regions inpixel domain using colour clustering process.Generally it uses CIE-LUV colour space in colour clustering. The DCD clusters a given<strong>visual</strong> region into a small number of representative colours. The <strong>feature</strong> descriptorconsists of the representative colours, their percentages in the region, spatial coherencyof the dominant colours, <strong>and</strong> colour variances for each dominant colour (J. Jiang et al.,55


Chapter 32006; Manjunath et al., 2001). Usually the number of representative colours is less thanor equal to 8. This size of <strong>feature</strong> vector has reduced the dimensionality of the <strong>feature</strong>space. The <strong>visual</strong> <strong>classification</strong> is similar to the quadratic colour histogram distancemeasure which is defined in this descriptor with avoidance of the high-dimensionalindexing problems associated with the traditional colour histogram (Manjunath et al.,2001). The <strong>feature</strong> vector consists of colour index (c i ), percentage (p i ), colour variance(v i ) <strong>and</strong> special coherency (s) where the last two parameters are optionals (S. Wang etal., 2003). It is defined byDCD = {(c i , p i , v i ), s} , i = 1, .. , N (3.5)where N is the number of colours <strong>and</strong> ∑ =Npi 1 i= 1The spatial coherency is a single number that represents the overall spatial homogeneityof the dominant colours.3.7.1.1. Dominant Colour ExtractionThe dominant colours <strong>extraction</strong> uses the generalised Lloyed algorithm which is popularas K-mean clustering (Manjunath et al., 2001; Manjunath et al., 2002). The algorithmperforms clustering by minimising the distortion D i in each cluster i. The steps of thealgorithm iterate until there is no movement found in points among clusters.∑D = v(n)x(n)− c x(n)∈ Cini2i(3.6)where c i is the centroid of cluster C i , x(n) is colour vector at a pixel, <strong>and</strong> v(n) isperceptual weight for pixel n. The perceptual weights are calculated from the local pixelstatistics to account for the fact that human vision perception is more sensitive tochanges in smooth regions than in textured regions. The effective use of perceptualweight was experimented in the study by (Gabbouj, Birinci, & Kiranyaz, 2009).56


Chapter 3There are some studies conducted with the following modification in the <strong>extraction</strong> ofdominant colours.ci=∑nv(n).x(n)∑nv(n)x(n)∈ Ci(3.7)A simple connected component analysis is performed to identify groups of pixels of thesame dominant colour that are spatially connected (Manjunath et al., 2001). Thenormalized average number of connecting pixels of the corresponding dominant colouris computed. A masking window of 3 x 3 is utilized to measure the spatial coherence ofa given dominant colour. The overall spatial variance is a linear combination of theindividual spatial variances with the corresponding p i percentages being the weights(Manjunath et al., 2001). The spatial variance is quantized to 5 bits, where 31 means thehighest confidence <strong>and</strong> 1 means no confidence. 0 is used for cases where it is notcomputed. The importance of spatial homogeneity in <strong>visual</strong> matching is emphasized in anumber of research studies (Morse, Thornton, Xia, & Uibel, 2007; Pass, Zabih, &Miller, 1996).The DCD based <strong>classification</strong> is described in next section. The <strong>classification</strong> includesoptional fields that also take different form of measures. The dimensionality of DCDwithout any optional fields can go up to 24. The inclusion of optional fields increasesthe dimensionality of DCD <strong>feature</strong> vector.3.7.1.2. Similarity Measure of MPEG-7 Dominant Colour DescriptorMPEG-7 has given directions to compute the similarity of two dominant colourdescriptors. There is one overall spatial coherency value for the given image region <strong>and</strong>several groups of (c i , p i , v i ) for the corresponding dominant colour. These parameterscan be used to compute the <strong>visual</strong> difference between images based on colour.57


Chapter 3Calculation is based on a quadratic type distance measure similar to histogram distance(Manjunath et al., 2001; Yang, Chang, Kuo, & Li, 2008).Consider two DCD <strong>feature</strong> vectors,F1 = {(c 1i , p 1i , v 1i ), s 1 }, ( i = 1,2 …N 1 ) <strong>and</strong> F2 = {(c 2j , p 2j , v 2j ), s 2 }, ( j = 1,2 …N 2 )Considering optional variance <strong>and</strong> spatial coherence, the dissimilarity of two <strong>feature</strong>vectors D(F1,F2) is described as,N1 N22∑ p1i+ ∑i= 1 j=122 j1 2∑∑2D ( F1,F2)=p − 2ap p (3.8)NNi= 1 j=11i,2j1i2 ja ,represents similarity coefficient between two colours of c k <strong>and</strong> c l .k lak , l⎧1− d= ⎨⎩0k , l/ dmaxddk , lk,l≤ Td> Td(3.9)where,d2 2k , lck− cl= (3.10)is the Euclidean distance between two colours <strong>and</strong> T d is the maximum value for twocolours to be considered as similar.The d max maintains the following equality to produce consistency within colourseparation of two dominant colours.dmax= αT d(3.11)A normal value specified for T d is between 10 – 20 in the CIE-LUV colour space <strong>and</strong> αis considered to be between 1.0 – 1.5.58


Chapter 3The above dissimilarity can be further modified with a linear combination of the spatialcoherency. Then the dissimilarity including spatial coherency D s is defined byD s = w 1 (s 1 -s 2 ) . D(F1,F2) + w 2 . D(F1,F2) (3.12)where, s 1 <strong>and</strong> s 2 are spatial coherency values for two descriptors. According to MPEG-7, w 1 <strong>and</strong> w 2 are fixed weights, with recommended settings to 0.3 <strong>and</strong> 0.7 respectively(Manjunath et al., 2002).In both the situations of equation (3.8) <strong>and</strong> (3.12), when the distance is smaller, <strong>visual</strong>sare considered to be more similar. However, this dissimilarity measure has to beperformed in each matching situation of a queried <strong>feature</strong> vector against trained <strong>feature</strong>vectors. Thus, in both situations of finding the similar <strong>and</strong> dissimilar <strong>visual</strong> regions thecalculation cost is consistent. As most of the technical specification is given for theCIE-LUV colour space, there is a need for colour space conversion at the beginning.This is another task included in dominant colour <strong>feature</strong> <strong>extraction</strong>. The optional spatialcoherency value indicates the level of homogeneity of colour distribution of allextracted dominant colour within a given region. However, it is not a value to determinemicro-level colour pattern arrangements or texture properties. In the case of usingcolour variance optional field, the dissimilarity measure utilises a mixture of Gaussi<strong>and</strong>istribution as a colour distribution <strong>and</strong> this makes dissimilarity calculation morecomplex with higher overhead (Manjunath et al., 2001; Yang et al., 2008).3.7.2. Colour Descriptor Based on Principal Component AnalysisPrincipal Component Analysis (PCA) is a st<strong>and</strong>ard technique for dimensionalityreduction (Ke & Sukthankar, 2004; L. I. Smith, 2002). This technique is an example fora linear <strong>feature</strong> transformation (Pedrycz, 2009). This technique has been applied to abroad class of computer vision problems, including <strong>feature</strong> selection, object recognition,watermarking, image compression, face recognition, <strong>and</strong> colour image retrieval59


Chapter 3(Abadpour & Kasaei, 2008; Ke & Sukthankar, 2004; Leclercq et al., 2006; Lenz, 2002;Menser & Muller, 1999; Tran & Lenz, 2001). PCA is able to linearly-project highdimensionalsamples onto a low-dimensional <strong>feature</strong> space. It has been proven that PCAleads to appropriate descriptors for natural colour images (Abadpour & Kasaei, 2008).These descriptors work in vector domains <strong>and</strong> take into consideration the statisticaldependence of the components of the colour images of pictures depicting nature whichconsist of irregular shapes based <strong>visual</strong>s.According to the studies in Leclercq et al. (2006) <strong>and</strong> Tran & Lenz (2001), the threedimensional histogram representation of video frame area can be transformed to lowdimensional vector space using PCA. The three dimensional histogram of selectedimage region produces N P x 3 matrix, where N P is the number of pixels in the selectedregion.With D as dimensions, the covariance matrix takes values in the form of,Cov (D w ,D v ) (3.13)where w, <strong>and</strong> v represent dimensional indexes.Final covariance matrix, ξ holds D x D elements which are in the form of covariancematrix as in equation (3.13) <strong>and</strong> ξ = ξ T . Karhunen-Loeve Transform (KLT) basis can becomputed by solving eigenvalue problem of,M D = P T ξ P (3.14)where P is the eigenvector of ξ <strong>and</strong> M D is the diagonal matrix of the eigenvalues. In fact,it turns out that the eigenvector with the highest eigenvalue is the principal componentof the dataset. Thus, the principal eigenvectors can be used to describe <strong>visual</strong> conceptsin reduced <strong>feature</strong> space. The highest eigenvalue can be obtained by examining thediagonal values of M D , <strong>and</strong> then corresponding eigenvectors can be found using P. The60


Chapter 3eigenvectors which are corresponds to the highest eigenvalues will be selected asprincipal components.When using PCA approach for three dimensional histogram descriptions, thedimensionality can be reduced down to three by considering only the main principalcomponent. Then similarity measure can be performed using a <strong>feature</strong> classifier. The<strong>feature</strong> space can be reduced down to very low dimensions using PCA. The maindisadvantage is the computational cost of calculating covariance matrices, eigenvalues<strong>and</strong> eigenvectors to obtain principal components.3.7.3. Colour Moment DescriptorColour Moment (CM) descriptor is capable of reducing the dimensionality of <strong>visual</strong><strong>feature</strong> space <strong>and</strong> is used for image <strong>and</strong> video retrieval with similarity measures.Generally this descriptor is defined with related mean, st<strong>and</strong>ard deviation <strong>and</strong> skewnessof each colour channel (Y.-G. Jiang et al., 2006; Maheshwary & Srivastava, 2009). InY.-G Jiang et al. (2006), colour moments are computed for each grid in Lab colourspace. CM Descriptor in Maheswary <strong>and</strong> Srivastava (2009) defines the <strong>feature</strong><strong>extraction</strong> as follows.Let p ij be the i th colour channel at the j th image pixel. The three colour moments for eachchannel can be defined as:Mean=E=N∑1ip ijj = 1 N(3.15)Mean E i is the average colour value of i th colour channel in the image.N 112St<strong>and</strong>ard Deviation = σi= ( ∑ ( pij− Ei) )(3.16)Nj=161


Chapter 31 ∑ (N3Skewness = si= 3 pij− Ei)(3.17)N j = 1Skewness is a measure of asymmetry in the distribution. The equations (3.15) to (3.17)create nine dimensional <strong>feature</strong> vectors for all the three colour channels. By includingspatial information with the nine parameters as in Y.-G Jiang et al. (2006), theinvariance properties can be added. Then the dimension goes up to 27 (9 x 3). Thesimilarity measure can be done using Euclidian distance as in section 3.8.1.3.7.4. Dimension Reduction Strategies of Local FeaturesLowe (2004) presented SIFT for extracting distinctive invariant <strong>feature</strong>s that can beinvariant to scale <strong>and</strong> rotation. This descriptor is widely used in image mosaic,recognition, <strong>and</strong> retrieval (Bay et al., 2008; Y.-G. Jiang et al., 2006; Ke & Sukthankar,2004; Mikolajczyk et al., 2005). However, the SIFT descriptor holds 128 dimensions<strong>and</strong> its variants including colours (for example C-SIFT, HSV-SIFT <strong>and</strong> RGB-SIFT) thatexp<strong>and</strong> the dimensionality to 384 dimensions.The cost of <strong>feature</strong> <strong>extraction</strong> with SIFT is minimized by taking a cascade filteringapproach, where expensive operations are applied only at locations that pass an initialtest of locating key points. The four major stages of computation to generate a set ofimage <strong>feature</strong>s are as follows (Lowe, 2004):Scale-space extrema detection: The first stage of computation searches over all scales<strong>and</strong> image locations. It is implemented efficiently by using a Difference-of-Gaussian(DoG) function to identify potential interest points that are invariant to scale <strong>and</strong>orientation. The scale space is defined as a function, L(x, y, σ), which is produced fromthe convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y),L(x, y, σ) = G(x, y, σ) * I(x, y) (3.18)62


Chapter 3where * is the convolution operation in x <strong>and</strong> y, <strong>and</strong>G2 2( x + y )1 −22σ( x,y,σ ) e2= (3.19)2πσThe DoG function convolved with the image, D(x, y, σ), is used to efficiently detectstable keypoint locations in scale space. It can be computed from the difference of twonearby scales separated by a constant multiplicative factor k:D( x,y,σ ) = ( G(x,y,kσ) − G(x,y,σ )) ∗ I(x,y)= L(x,y,kσ) − L(x,y,σ ) (3.20)Keypoint localization: At each c<strong>and</strong>idate location, a detailed model is fit to determinelocation <strong>and</strong> scale. Keypoints are selected based on measures of their stability. First <strong>and</strong>second derivatives of DoG are taken.Orientation assignment: One or more orientations are assigned to each keypointlocation based on local image gradient directions. All future operations are performedon image data that has been transformed relative to the assigned orientation, scale, <strong>and</strong>location for each <strong>feature</strong>, with invariance to these transformations.Keypoint descriptor: The local image gradients are measured at the selected scale in theregion around each keypoint. These are transformed into a representation that allows forsignificant levels of local shape distortion <strong>and</strong> change in illumination.The st<strong>and</strong>ard keypoint descriptor used by SIFT is created by sampling the magnitudes<strong>and</strong> orientations of the image gradient in the patch around the keypoint, <strong>and</strong> buildingsmoothened orientation histograms to capture the important aspects of the patch. Theassigned orientation(s), scale <strong>and</strong> location for each keypoint enables SIFT to construct acanonical view for the keypoint that is invariant to similarity transforms (Ke &Sukthankar, 2004). A 4 x 4 array of histograms, each with 8 orientation bins, capturesthe rough spatial structure of the patch. SIFT has major weaknesses in high63


Chapter 3dimensionality <strong>and</strong> computational expensive stages. Another drawback occurs whencolour information with SIFT is included <strong>and</strong> it further exp<strong>and</strong>s the dimensionality.Most colour sceneries with irregular shapes such as sky, water resources <strong>and</strong> fire give apeak around zero to DoG <strong>and</strong> keypoint detection is problematic because the distributionof edge responses for natural images always has a peak around zero, that is many pixelshave no edge responses (Snoek et al., 2005).3.7.5. PCA-SIFTPCA-SIFT representation is theoretically simpler, more compact, <strong>and</strong> faster than thest<strong>and</strong>ard SIFT descriptor in terms of <strong>visual</strong> description. PCA-SIFT technique wasproposed on local descriptor called normalized gradient patch. The experimental resultsin Ke <strong>and</strong> Sukthankar (2004) showed that PCA reduces dimensions, which decreasescomplexity <strong>and</strong> increases the matching accuracy (Watcharapinchai, Aramvith,Siddhichai, & Marukatat, 2008).PCA-SIFT uses the original SIFT source code <strong>and</strong> restrict changes to the fourth stage(Ke & Sukthankar, 2004). PCA-SIFT extracts a 41 x 41 pixel patch at the given scale,centered over the keypoint, <strong>and</strong> rotated to align its dominant orientation to a canonicaldirection, that is for keypoints with multiple dominant orientations, <strong>and</strong> they build arepresentation for each orientation, in the same manner as SIFT. PCA-SIFT can besummarized as follows:(1) Pre-compute an eigenspace to express the gradient images of local patches;(2) Given a patch, compute its local image gradient;(3) Project the gradient image vector using the eigenspace to derive a compact<strong>feature</strong> vector.64


Chapter 3This <strong>feature</strong> vector is significantly smaller than the st<strong>and</strong>ard SIFT <strong>feature</strong> vector, <strong>and</strong>can be used with the same matching algorithms. The Euclidean distance between two<strong>feature</strong> vectors is used to determine whether the two vectors correspond to the samekeypoint in different images. The study in Ke <strong>and</strong> Sukthankar (2004) reports that 20dimension of PCA-SIFT gives better performance. The st<strong>and</strong>ard SIFT representationemploys 128-element vectors, using PCA-SIFT results for significant space benefits.PCA-SIFT <strong>feature</strong> generation uses the first three stages of SIFT <strong>feature</strong> <strong>extraction</strong>,which inherits major computational cost in SIFT <strong>and</strong> PCA based projection. Thereforethe computational expenses are increased in PCA-SIFT <strong>feature</strong> <strong>extraction</strong> compared toSIFT.3.7.6. Local Interest Point Distribution (LIP-D)The enhancement to PCA-SIFT for better performance was done in the study in Y.-G.Jiang et al. (2006). The detector is invariant to scale <strong>and</strong> can tolerate certain amount ofaffine transformation. Each Local Interest Point (LIP) is characterized by a 36-dimensional PCA-SIFT <strong>feature</strong> descriptor, which has been demonstrated to bedistinctive <strong>and</strong> robust to colour, geometric <strong>and</strong> photometric changes, similar to PCA-SIFT recorded in Ke <strong>and</strong> Sukthankar (2004) quoted in Y.-G. Jiang et al. (2006).Generally the number of LIPs in a keyframe can range from several hundreds to fewthous<strong>and</strong>s <strong>and</strong> thus prohibit the efficient matching of LIPs with PCA-SIFT across largeamount of keyframes. Y.-G. Jiang et al. (2006) have generated a <strong>visual</strong> dictionary byoffline quantization of LIPs. Each keyframe is described as a vector of <strong>visual</strong> keywordsthat facilitates direct keyframe comparison without point-to-point LIP matching. Thelocal distribution of LIPs, on the other h<strong>and</strong>, is represented as shape-like <strong>feature</strong>s in themulti-resolution grids. The <strong>feature</strong>s are then embedded in a space where distance isevaluated with the embedded Earth Mover's Distance (e-EMD) measure.65


Chapter 3Figure 3.2: Detection <strong>and</strong> Locating LIPs in the study of Jiang et al. (2006). (Source:Y.-G. Jiang et al., 2006)Figure 3.2 illustrates keypoint detection <strong>and</strong> <strong>extraction</strong> stages as reported in the study ofY.-G. Jiang et al. (2006). Further <strong>visual</strong>isation of the above <strong>feature</strong> <strong>extraction</strong> stages canbe found in Figure 3.3.Figure 3.3: LIP distribution in a keyframe with different resolution in the study of Jianget al. (2006). (Source:Y.-G. Jiang et al., 2006)Figure 3.3 describes the distribution of LIPs with multi-resolution grid representation.The size of grids varies at different resolutions <strong>and</strong> thus the granularity of shapeinformation formed by LIP distribution changes according to the scale being considered.The first three colour moments of grids describe the shape-like information of LIPsacross resolutions. Each grid is physically viewed as a point characterized by moments<strong>and</strong> weighted according to its levels of resolution. With this representation, basically akeyframe is treated as a bag of grid points. The similarity between keyframes is based66


Chapter 3upon the matching of grid points within <strong>and</strong> across resolutions depending on their<strong>feature</strong> distance <strong>and</strong> transmitted weights that can be evaluated with Earth Mover'sDistance (EMD). However, EMD is expensive <strong>and</strong> it has an exponential worst case withthe number of points (Y.-G. Jiang et al., 2006). Further, LIP generation inherits most ofthe computational costs with PCA-SIFT calculation.LIP-D (LIP Distribution) with 36 dimensional <strong>feature</strong> vector generates bag of gridpoints <strong>and</strong> utilises Support Vector Machine (SVM) classifier for <strong>visual</strong> <strong>classification</strong>.3.8. Video Visual Classifiers <strong>and</strong> ClassificationChapter 1 emphasises the importance of <strong>visual</strong> classifier in video <strong>visual</strong> contentdescription. There are different classifiers utilised by researchers in <strong>visual</strong> <strong>classification</strong>based on the type of <strong>visual</strong> <strong>feature</strong>. The advantages of reduction of <strong>feature</strong> space to<strong>classification</strong> process are also discussed. A <strong>visual</strong> classifier performs mapping from a<strong>visual</strong> <strong>feature</strong> space to a discrete set of labels, that is mapping from low level syntactic<strong>visual</strong> <strong>feature</strong>s to semantic <strong>visual</strong> concepts. Classifiers may either be fixed or learningclassifiers. Learning classifiers may in turn be divided into supervised <strong>and</strong> unsupervisedlearning classifiers. The accuracy <strong>and</strong> effectiveness of <strong>visual</strong> descriptiondepends on both selected <strong>visual</strong> <strong>feature</strong> <strong>and</strong> <strong>classification</strong> method. Selection of asuitable <strong>classification</strong> method for a certain <strong>visual</strong> syntactic <strong>feature</strong> can be done by usinga performance evaluation as discussed in our publication of this research (Ranathunga etal., 2009). That may be effective to a certain domain of <strong>visual</strong> concepts based on type<strong>visual</strong>s, syntactic <strong>feature</strong> <strong>and</strong> classifier.Among video <strong>visual</strong> classifiers, Support Vector Machines (SVMs), Euclidean Distance<strong>and</strong> k-NN based <strong>classification</strong> techniques are the most dominant (S. Gao et al., 2007;Hsu, Chang, & Lin, 2003; Lew et al., 2006; Lin et al., 2008; Lu, Zhao, & Zhang, 2008;Mittal, 2006; Smeulders & Kielmann, 2006; Snoek et al., 2005). There are some67


Chapter 3descriptors accompanied with its own <strong>classification</strong> method such as MPEG-7 DCD(Manjunath et al., 2001; Manjunath et al., 2002). Supervised <strong>classification</strong> task usuallyinvolves training <strong>and</strong> testing data which consist of some data instances. Each instance inthe training set contains one target value (class labels) <strong>and</strong> several attributes (<strong>feature</strong>s)(Hsu et al., 2003). The goal of a classifier is to produce a model that predicts targetvalue of data instances in the testing set based on the given attributes. Differentclassifiers use different properties <strong>and</strong> computational techniques in the <strong>classification</strong>process. Efficiency <strong>and</strong> effectiveness of <strong>classification</strong> process depends on the<strong>classification</strong> technique <strong>and</strong> the nature of low level syntactic <strong>feature</strong>s produced by<strong>feature</strong> extractor.3.8.1. Euclidean Distance MeasureEuclidean distance is a popular <strong>and</strong> traditional similarity measure used for <strong>visual</strong><strong>classification</strong> (Datta et al., 2005; Lew et al., 2006; J. Liu et al., 2007; Lu et al., 2008;Mittal, 2006; Nguyen & Gillespie, 2003). It is used with a variety of low level <strong>visual</strong><strong>feature</strong>s such as histograms, Principal Components <strong>and</strong> Colour Moments. The basicrepresentation of this distance measure gives the variation between two <strong>feature</strong> vectors.Let X (X 1 , X 2 ,…X n ) <strong>and</strong> Y (Y 1 , Y 2 ,…Y n ) be the two <strong>feature</strong> vectors sharing same <strong>feature</strong>space with n dimensions. Then the Euclidean Distance (d xy ) is defined as,dxy=n∑( ( X − Y ) 2 )(3.21)i=1iiThe <strong>classification</strong> is done by identifying the minimum distance with or without athreshold value. The major disadvantage of this similarity measure is that, thecomputational cost is high with the pair wise calculation in equation (3.21).68


Chapter 33.8.2. K-Nearest Neighbour Classificationk-Nearest Neighbour (k-NN) uses simple majority of target label categories with greatersimilarity in <strong>visual</strong> <strong>classification</strong>. The application of k-NN is always accompanied withsimilarity measure technique (Dementhon & Doermann, 2006; Lu et al., 2008). Theidentification of nearest neighbour algorithm involves the following steps:a. Determine parameter k, the number of nearest neighbours to be consideredb. Calculate the similarity between the <strong>feature</strong> vector of query-instance <strong>and</strong> all thelabelled <strong>feature</strong> vectors in training samplesc. Sort the similarity <strong>and</strong> determine nearest neighbours based on the k th closestsimilarityd. Gather the category labels of the k nearest neighbourse. Use simple majority of the label categories in the k nearest neighbours as thetarget label value of the query-instanceIn this method, a queried <strong>visual</strong> is assigned with the most common category among its knearest neighbours. The performance of the k-NN algorithm is influenced by three mainfactors which are the similarity metric used to locate the nearest neighbours, for anexample Euclidean distance; the decision rule used to derive a <strong>classification</strong> from the k-NN (threshold), <strong>and</strong> the number of neighbours used to classify the new sample.3.8.3. Support Vector Machine (SVM)Significant improvement of usage shows the dominancy of Support Vector Machine(SVM) based <strong>classification</strong> techniques in video content analysis (S. Gao et al., 2007;Hsu et al., 2003; Lew et al., 2006; Lin et al., 2008; Mittal, 2006; Smeulders &Kielmann, 2006; Snoek et al., 2005). The goal of SVM is to produce a model that69


Chapter 3predicts target value of data instances in the testing set by obtaining only <strong>feature</strong>attributes. SVMs are feed-forward networks that can be used for pattern <strong>classification</strong><strong>and</strong> nonlinear regression (Hsu et al., 2003). The main idea behind SVM is to construct ahyperplane that acts as a decision space in such a way that the margin of separationbetween positive <strong>and</strong> negative examples is maximized. This hyperplane is notconstructed in the input space, where the problem may not be linearly solvable, but inthe <strong>feature</strong> space where the problem is driven. This is generally referred to as theOptimal Hyperplane. Which is the property utilised by support vector machine inimplementation of the method of structural risk minimization (Hsu et al., 2003).Combination of SVMs for <strong>visual</strong> <strong>classification</strong> <strong>and</strong> 3D histograms for <strong>visual</strong> descriptionhave been used for the detection of semantic concepts (S. Gao et al., 2007). SVM doesnot incorporate domain-specific knowledge <strong>and</strong> it provides good generalizedperformance, which is a unique property among various different types of neuralnetworks.According to the description in (Hsu et al., 2003), the SVM <strong>classification</strong> is defined asfollows.Given a training set of instance-label pairs (x i , y i ), i = 1,….,l where x i є R n <strong>and</strong> y i є {1, -1} l , SVM requires the solution of the following optimization problem:minW , b,ξsubjected to1W2y ( Wξ ≥ 0.iiTW + C∑ξi,Ti=1φ(x ) + b)≥ 1−ξ ,ili(3.22)Training vectors x i are mapped into a higher dimensional space by the function φ. ThenSVM finds a linear separating hyperplane with the maximal margin in this higherdimensional space. W is weight vector <strong>and</strong> b are the constraints. C > 0 is the penaltyparameter of the error term. Slack variables of ξiare for h<strong>and</strong>ling non-separable data.70


Chapter 3Furthermore, K(x i , x j )≡φ(x i ) T φ(x j ) is called the kernel function. The linear kernel isdefined as K(x i , x j )≡x iTx j .3.9. SummaryThis chapter covers discussion on popular <strong>visual</strong> <strong>feature</strong> <strong>extraction</strong> <strong>and</strong> representationapproaches with reference to high dimensional <strong>visual</strong> <strong>feature</strong> approaches, 3D colourhistogram <strong>feature</strong>s, texture <strong>feature</strong>s, shape <strong>feature</strong>s, <strong>and</strong> local <strong>and</strong> key point <strong>feature</strong>s.Their strengths <strong>and</strong> weaknesses are also discussed. The dimensional reduction of <strong>visual</strong><strong>feature</strong> space is emphasised with several methods including MPEG-7 DCD, PCAmethods, <strong>and</strong> Colour Moments based methods. Finally, <strong>visual</strong> <strong>classification</strong> methods<strong>and</strong> the relation between <strong>visual</strong> <strong>feature</strong>s <strong>and</strong> classifiers are described with relevant<strong>classification</strong> methods.71

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!