12.07.2015 Views

an innovative algorithm for key frame extraction in video ...

an innovative algorithm for key frame extraction in video ...

an innovative algorithm for key frame extraction in video ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

AN INNOVATIVE ALGORITHM FOR KEYFRAME EXTRACTION IN VIDEOSUMMARIZATIONG. Ciocca 1 , R. Schett<strong>in</strong>iDipartimento di In<strong>for</strong>matica Sistemistica e Comunicazione (DISCo)Università degli studi di Mil<strong>an</strong>o-BicoccaVia Bicocca degli Arcimboldi 8, 20126 Mil<strong>an</strong>o,ItalyAbstractVideo summarization, aimed at reduc<strong>in</strong>g the amount of data that must be exam<strong>in</strong>ed <strong>in</strong> order toretrieve the <strong>in</strong><strong>for</strong>mation desired from <strong>in</strong><strong>for</strong>mation <strong>in</strong> a <strong>video</strong>, is <strong>an</strong> essential task <strong>in</strong> <strong>video</strong> <strong>an</strong>alysis<strong>an</strong>d <strong>in</strong>dex<strong>in</strong>g applications. We propose <strong>an</strong> <strong><strong>in</strong>novative</strong> approach to the selection of representative(<strong>key</strong>) <strong>frame</strong>s of a <strong>video</strong> sequence <strong>for</strong> <strong>video</strong> summarization. By <strong>an</strong>alyz<strong>in</strong>g the differences betweentwo consecutive <strong>frame</strong>s of a <strong>video</strong> sequence, the <strong>algorithm</strong> determ<strong>in</strong>es the complexity of thesequence <strong>in</strong> terms of ch<strong>an</strong>ges <strong>in</strong> the visual content expressed by different <strong>frame</strong> descriptors. The<strong>algorithm</strong>, which escapes the complexity of exist<strong>in</strong>g methods based, <strong>for</strong> example, on cluster<strong>in</strong>g oroptimization strategies, dynamically <strong>an</strong>d rapidly selects a variable number of <strong>key</strong> <strong>frame</strong>s with<strong>in</strong>each sequence. The <strong>key</strong> <strong>frame</strong>s are extracted by detect<strong>in</strong>g curvature po<strong>in</strong>ts with<strong>in</strong> the curve of thecumulative <strong>frame</strong> differences. Another adv<strong>an</strong>tage is that it c<strong>an</strong> extract the <strong>key</strong> <strong>frame</strong>s on the fly:curvature po<strong>in</strong>ts c<strong>an</strong> be determ<strong>in</strong>ed while comput<strong>in</strong>g the <strong>frame</strong> differences <strong>an</strong>d the <strong>key</strong> <strong>frame</strong>s c<strong>an</strong>be extracted as soon as a second high curvature po<strong>in</strong>t has been detected. We compare theper<strong>for</strong>m<strong>an</strong>ce of this <strong>algorithm</strong> with that of other <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>s based on differentapproaches. The summaries obta<strong>in</strong>ed have been objectively evaluated by three quality measures:the Fidelity measure, the Shot Reconstruction Degree measure <strong>an</strong>d the Compression Ratiomeasure.Keywords. Video summarization, summary evaluation, dynamic <strong>key</strong> <strong>frame</strong>s<strong>extraction</strong>, <strong>frame</strong> contents description.1. IntroductionThe grow<strong>in</strong>g <strong>in</strong>terest of consumers <strong>in</strong> the acquisition of <strong>an</strong>d access to visual<strong>in</strong><strong>for</strong>mation has created a dem<strong>an</strong>d <strong>for</strong> new technologies to represent, model, <strong>in</strong>dex<strong>an</strong>d retrieve multimedia data. Very large databases of images <strong>an</strong>d <strong>video</strong>s requireefficient <strong>algorithm</strong>s that enable fast brows<strong>in</strong>g <strong>an</strong>d access to the <strong>in</strong><strong>for</strong>mationpursued [1]. In the case of <strong>video</strong>s, <strong>in</strong> particular, much of the visual data offered issimply redund<strong>an</strong>t, <strong>an</strong>d we must f<strong>in</strong>d a way to reta<strong>in</strong> only the <strong>in</strong><strong>for</strong>mation strictlyneeded <strong>for</strong> functional brows<strong>in</strong>g <strong>an</strong>d query<strong>in</strong>g.Video summarization, aimed at reduc<strong>in</strong>g the amount of data that must beexam<strong>in</strong>ed <strong>in</strong> order to retrieve a particular piece of <strong>in</strong><strong>for</strong>mation <strong>in</strong> a <strong>video</strong>, is,consequently, <strong>an</strong> essential task <strong>in</strong> applications of <strong>video</strong> <strong>an</strong>alysis <strong>an</strong>d <strong>in</strong>dex<strong>in</strong>g [2]as c<strong>an</strong> be seen <strong>in</strong> Fig. 1. Generally, a <strong>video</strong> summary is a sequence of still or1 Correspond<strong>in</strong>g AuthorE-mail addresses: ciocca@disco.unimib.it; schett<strong>in</strong>i@disco.unimib.itFax. +39 02 6448 78391


mov<strong>in</strong>g images, with or without audio. These images must preserve the overallcontents of the <strong>video</strong> with a m<strong>in</strong>imum of data. Still images chronologicallyarr<strong>an</strong>ged <strong>for</strong>m a pictorial summary that c<strong>an</strong> be assumed to be the equivalent of a<strong>video</strong> storyboard. Summarization utiliz<strong>in</strong>g mov<strong>in</strong>g images (<strong>an</strong>d at times acorrespond<strong>in</strong>g audio abstract), is called <strong>video</strong> skimm<strong>in</strong>g; the product is similar toa <strong>video</strong> trailer or clip. Both approaches must present a summary of the import<strong>an</strong>tevents recorded <strong>in</strong> the <strong>video</strong>. We focus our attention here on the creation of avisual summary us<strong>in</strong>g still images, called <strong>key</strong> <strong>frame</strong>s, extracted from the <strong>video</strong>.Although <strong>video</strong> skimm<strong>in</strong>g conveys pictorial, motion <strong>an</strong>d, where used, audio<strong>in</strong><strong>for</strong>mation, still images c<strong>an</strong> summarize the <strong>video</strong> content <strong>in</strong> more rapid <strong>an</strong>dcompact way: users c<strong>an</strong> grasp the overall content more quickly from <strong>key</strong> <strong>frame</strong>sth<strong>an</strong> by watch<strong>in</strong>g a set of <strong>video</strong> sequences (even when brief). Key-<strong>frame</strong> based<strong>video</strong> representation views <strong>video</strong> abstraction as a problem of mapp<strong>in</strong>g <strong>an</strong> entiresegment (both static <strong>an</strong>d mov<strong>in</strong>g contents) to some small number of representativeimages. The <strong>extraction</strong> of <strong>key</strong> <strong>frame</strong>s must be automatic <strong>an</strong>d content based so thatthey ma<strong>in</strong>ta<strong>in</strong> the salient content of the <strong>video</strong> while avoid<strong>in</strong>g all redund<strong>an</strong>cy. Intheory sem<strong>an</strong>tic primitives of the <strong>video</strong>, such as relev<strong>an</strong>t objects (people or objectidentities), actions <strong>an</strong>d events (a meet<strong>in</strong>g, a convention,…) should be used.However, s<strong>in</strong>ce this k<strong>in</strong>d of specific sem<strong>an</strong>tic <strong>an</strong>alysis is not currently feasible,<strong>key</strong> <strong>frame</strong> <strong>extraction</strong> based on low-level <strong>video</strong> features (ma<strong>in</strong>ly visual features) isused <strong>in</strong>stead [1]. Besides provid<strong>in</strong>g <strong>video</strong> brows<strong>in</strong>g capability <strong>an</strong>d contentdescription, <strong>key</strong> <strong>frame</strong>s act as <strong>video</strong> “bookmarks” that designate <strong>in</strong>terest<strong>in</strong>g eventscaptured supply<strong>in</strong>g direct access to <strong>video</strong> sub-sequences. Key <strong>frame</strong>s, whichvisually represent the <strong>video</strong> content, c<strong>an</strong> also be used <strong>in</strong> the <strong>in</strong>dex<strong>in</strong>g process,where we c<strong>an</strong> apply the same <strong>in</strong>dex<strong>in</strong>g <strong>an</strong>d retrieval strategies developed <strong>for</strong>image retrieval to retrieve <strong>video</strong> sequences. Low level visual features c<strong>an</strong> be used<strong>in</strong> <strong>in</strong>dex<strong>in</strong>g the <strong>key</strong> <strong>frame</strong>s <strong>an</strong>d thus the <strong>video</strong> sequences to which they belong.The use of low level features <strong>in</strong> the <strong>in</strong>dex<strong>in</strong>g process should not be considered asa less powerful strategy; high level features with different levels of sem<strong>an</strong>tic c<strong>an</strong>be derived from them. For example, high level <strong>in</strong><strong>for</strong>mation c<strong>an</strong> be derived fromlow level visual features by us<strong>in</strong>g knowledge based techniques <strong>in</strong>herited from thedoma<strong>in</strong> of artificial <strong>in</strong>telligence <strong>an</strong>d pattern recognition. Examples of this c<strong>an</strong> befound <strong>in</strong> Ant<strong>an</strong>i et al [3], where different pattern recognition strategies arediscussed, <strong>an</strong>d <strong>in</strong> Schett<strong>in</strong>i et al. [4] where classification strategies are used to<strong>an</strong>notate the global content of the images <strong>in</strong> terms of high level concepts (closeup,<strong>in</strong>door, outdoor). A similar strategy c<strong>an</strong> be adopted to automatically identifysem<strong>an</strong>tic regions with<strong>in</strong> the images themselves [5]. If the audio track is available,tr<strong>an</strong>scripts of the audio c<strong>an</strong> also be used [6].Video Raw DataFeatureExtractionFeaturesAbstractionVideo SummaryVideoEncod<strong>in</strong>gVideo StreamsStructureAnalysisVideo AnalysisStorageBrows<strong>in</strong>g &RetrievalMetadataIndex<strong>in</strong>gUserFig. 1. General application of the <strong>video</strong> <strong>an</strong>alysis <strong>an</strong>d <strong>in</strong>dex<strong>in</strong>g tasks.2


Section 2 of this paper presents several approaches to <strong>key</strong> <strong>frame</strong> <strong>extraction</strong>described <strong>in</strong> the literature. Section 3 <strong>in</strong>troduces the three quality measures usedhere to evaluate the summaries. Section 4 describes the <strong>key</strong> <strong>frame</strong> <strong>extraction</strong><strong>algorithm</strong> proposed. The results of the comparison of our approach with other <strong>key</strong><strong>frame</strong> selection <strong>algorithm</strong>s are shown <strong>in</strong> sections 5 <strong>an</strong>d 6. Section 7 brieflyillustrates <strong>an</strong> application of our <strong>algorithm</strong>.2. Related WorkDifferent methods c<strong>an</strong> be used to select <strong>key</strong> <strong>frame</strong>s. In general these methodsassume that the <strong>video</strong> has already been segmented <strong>in</strong>to shots (a cont<strong>in</strong>uoussequences of <strong>frame</strong>s taken over a short period of time) by a shot detection<strong>algorithm</strong>, <strong>an</strong>d extract the <strong>key</strong>-<strong>frame</strong>s from with<strong>in</strong> each shot. One of the possibleapproaches to <strong>key</strong> <strong>frame</strong> selection is to take the first <strong>frame</strong> <strong>in</strong> the shot as the <strong>key</strong><strong>frame</strong> [7]. Ueda et al [8] <strong>an</strong>d Rui et al. [9] use the first <strong>an</strong>d last <strong>frame</strong>s of eachshot. Other approaches time sample the shots at predef<strong>in</strong>ed <strong>in</strong>tervals, as <strong>in</strong>Pentl<strong>an</strong>d et. al [10] where the <strong>key</strong> <strong>frame</strong>s are taken from a set location with<strong>in</strong> theshot, or, <strong>in</strong> <strong>an</strong> alternative approach where the <strong>video</strong> is time sampled regardless ofshot boundaries [9]. These approaches do not consider the dynamics <strong>in</strong> the visualcontent of the shot but rely on the <strong>in</strong><strong>for</strong>mation regard<strong>in</strong>g the sequence’sboundaries. They often extract a fixed number of <strong>key</strong> <strong>frame</strong>s per shot. InZhonghua et al. [11] only one <strong>key</strong> <strong>frame</strong> is extracted from each shot: the <strong>frame</strong>sare segmented <strong>in</strong>to objects <strong>an</strong>d background, <strong>an</strong>d the <strong>frame</strong> with the maximumratio of objects to background is chosen as the <strong>key</strong> <strong>frame</strong> of the segment s<strong>in</strong>ce it isassume to convey the most <strong>in</strong><strong>for</strong>mation about the shot. Other approaches try togroup the <strong>key</strong> <strong>frame</strong>s (<strong>in</strong> each shot, or <strong>in</strong> the whole <strong>video</strong>) <strong>in</strong>to visually similarclusters. In Arma et al. [12] the shot is compacted <strong>in</strong>to a small number of <strong>frame</strong>s,group<strong>in</strong>g consecutive <strong>frame</strong>s together. Zhu<strong>an</strong>g et al. [13] group the <strong>frame</strong>s <strong>in</strong>clusters, <strong>an</strong>d the <strong>key</strong> <strong>frame</strong>s are selected from the largest clusters. In Girgensohnet al. [14] constra<strong>in</strong>ts on the position of the <strong>key</strong> <strong>frame</strong>s <strong>in</strong> time are also used <strong>in</strong> thecluster<strong>in</strong>g process; a hierarchical cluster<strong>in</strong>g reduction is per<strong>for</strong>med, obta<strong>in</strong><strong>in</strong>gsummaries at different levels of abstraction. In Gong et al. [15] the <strong>video</strong> aresummarized with a cluster<strong>in</strong>g <strong>algorithm</strong> based on S<strong>in</strong>gle Value Decomposition(SVD). The <strong>video</strong> <strong>frame</strong>s are time sampled <strong>an</strong>d visual features computed fromthem. The ref<strong>in</strong>ed feature space obta<strong>in</strong>ed by the SVD is clustered, <strong>an</strong>d a <strong>key</strong> <strong>frame</strong>is extracted from each cluster.In order to take <strong>in</strong>to account the visual dynamics of the <strong>frame</strong>s with<strong>in</strong> asequence, some approaches compute the differences between pairs of <strong>frame</strong>s (notnecessarily consecutive) <strong>in</strong> terms of color histograms, motion, or other visualdescriptions. The <strong>key</strong> <strong>frame</strong>s are selected by <strong>an</strong>alyz<strong>in</strong>g the values obta<strong>in</strong>ed. Zhaoet al. [16] have developed a simple method <strong>for</strong> <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> calledSimplified Breakpo<strong>in</strong>ts. A <strong>frame</strong> is selected as a <strong>key</strong> <strong>frame</strong> if its color histogramsdiffers from that of the previous <strong>frame</strong> by a given threshold. When the set ofselected <strong>frame</strong>s reaches the required number of <strong>key</strong> <strong>frame</strong>s <strong>for</strong> the shot, theprocess stops. In H<strong>an</strong>jalic et al. [17] <strong>frame</strong> differences are taken to build a“content development” curve that is approximated, us<strong>in</strong>g <strong>an</strong> error m<strong>in</strong>imization<strong>algorithm</strong>, by a curve composed of a predef<strong>in</strong>ed number of rect<strong>an</strong>gles. H<strong>an</strong> et al.[18] propose a very simple approach: the <strong>key</strong> <strong>frame</strong>s are selected by <strong>an</strong> adaptivetemporal sampl<strong>in</strong>g <strong>algorithm</strong> that uni<strong>for</strong>mly samples the y-axis of the curve ofcumulative <strong>frame</strong> differences. The result<strong>in</strong>g non-uni<strong>for</strong>m sampl<strong>in</strong>g on the curve’sx-axis represents the set of <strong>key</strong> <strong>frame</strong>s.3


The compressed doma<strong>in</strong> is often considered when develop<strong>in</strong>g <strong>key</strong> <strong>frame</strong><strong>extraction</strong> <strong>algorithm</strong>s s<strong>in</strong>ce it easily allows to express the dynamics of a <strong>video</strong>sequence through motion <strong>an</strong>alysis. Nara et al. [19] propose a neural networkapproach us<strong>in</strong>g motion <strong>in</strong>tensities computed from MPEG compressed <strong>video</strong>. Afuzzy system classifies the motion <strong>in</strong>tensities <strong>in</strong> five categories, <strong>an</strong>d those <strong>frame</strong>sthat exhibit high <strong>in</strong>tensities are chosen as <strong>key</strong> <strong>frame</strong>s. In Calic et al. [20] <strong>video</strong>features extracted from the statistics on the macro-blocks of a MPEG compressed<strong>video</strong> are used to compute <strong>frame</strong> differences. A discrete contour evolution<strong>algorithm</strong> is applied to extract <strong>key</strong> <strong>frame</strong>s from the curve of the <strong>frame</strong> differences.In Liu et al. [21] a perceived motion energy (PME) computed on the motionvectors is used to describe the <strong>video</strong> content. A tri<strong>an</strong>gle model is then employed tomodel motion patterns <strong>an</strong>d extract <strong>key</strong> <strong>frame</strong>s at the turn<strong>in</strong>g po<strong>in</strong>ts of acceleration<strong>an</strong>d deceleration.The drawback to most of these approaches is that the number of representative<strong>frame</strong>s must be set <strong>in</strong> some m<strong>an</strong>ner a priori depend<strong>in</strong>g on the length of the <strong>video</strong>shots <strong>for</strong> example. This c<strong>an</strong>not guar<strong>an</strong>tee that the <strong>frame</strong>s selected will not behighly correlated. It is also difficult to set a suitable <strong>in</strong>terval of time, or <strong>frame</strong>s:large <strong>in</strong>tervals me<strong>an</strong> a large number of <strong>frame</strong>s will be chosen, while small<strong>in</strong>tervals may not capture enough representative <strong>frame</strong>s, those chosen may not be<strong>in</strong> the right places to capture signific<strong>an</strong>t content. Still other approaches work onlyon compressed <strong>video</strong>, are threshold-dependent, or are computationally <strong>in</strong>tensive(e.g. [22] <strong>an</strong>d [21]).In this paper, we propose <strong>in</strong>stead <strong>an</strong> approach to the selection of <strong>key</strong> <strong>frame</strong>sthat determ<strong>in</strong>es the complexity of the sequence <strong>in</strong> terms of ch<strong>an</strong>ges <strong>in</strong> the pictorialcontent us<strong>in</strong>g three visual features: its color histogram, wavelet statistics, <strong>an</strong>d <strong>an</strong>edge direction histogram. Similarity measures are computed <strong>for</strong> each descriptor<strong>an</strong>d comb<strong>in</strong>ed to <strong>for</strong>m a <strong>frame</strong> difference measure. The <strong>frame</strong> differences are thenused to dynamically <strong>an</strong>d rapidly select a variable number of <strong>key</strong> <strong>frame</strong>s with<strong>in</strong>each shot. The method woks fast on all k<strong>in</strong>d of <strong>video</strong>s (compressed or not), <strong>an</strong>ddoes not exhibit the complexity of exist<strong>in</strong>g methods based, <strong>for</strong> example, oncluster<strong>in</strong>g strategies. It c<strong>an</strong> also extract <strong>key</strong> <strong>frame</strong>s on the fly, that is, it c<strong>an</strong> output<strong>key</strong> <strong>frame</strong>s while comput<strong>in</strong>g the <strong>frame</strong> differences without hav<strong>in</strong>g to process thewhole shot.3. Summary evaluationOne of the most challeng<strong>in</strong>g topics <strong>in</strong> the field of <strong>video</strong> <strong>an</strong>alysis <strong>an</strong>dsummarization is that of evaluat<strong>in</strong>g the summaries produced by the different <strong>key</strong><strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>s. In their work to design a <strong>frame</strong>work <strong>for</strong> <strong>video</strong>summarization, Fayzull<strong>in</strong> et al. [23] def<strong>in</strong>e three properties that must be taken <strong>in</strong>toaccount when creat<strong>in</strong>g a <strong>video</strong> summary: cont<strong>in</strong>uity, priority <strong>an</strong>d repetition.Cont<strong>in</strong>uity me<strong>an</strong>s that the summarized <strong>video</strong> must be as un<strong>in</strong>terrupted as possible.Priority me<strong>an</strong>s that, <strong>in</strong> a given application, certa<strong>in</strong> objects or events may be moreimport<strong>an</strong>t th<strong>an</strong> others, <strong>an</strong>d thus the summary must conta<strong>in</strong> high priority items.Repetition me<strong>an</strong>s that it is import<strong>an</strong>t to not represent the same events over <strong>an</strong>dover aga<strong>in</strong>. It is often very difficult to successfully <strong>in</strong>corporate these sem<strong>an</strong>ticproperties <strong>in</strong> a summarization <strong>algorithm</strong>. Priority, <strong>in</strong> particular, is a highly taskdependent property. It requires that <strong>video</strong> experts carefully def<strong>in</strong>e thesummarization rules most suitable <strong>for</strong> each genre of <strong>video</strong> sequence processed.The most common evaluation of a summary relies on the subjective op<strong>in</strong>ion ofa p<strong>an</strong>el of users. This shifts the problem of <strong>in</strong>corporat<strong>in</strong>g sem<strong>an</strong>tic <strong>in</strong><strong>for</strong>mation4


<strong>in</strong>to the summarization <strong>algorithm</strong> to the evaluation phase where users arerequested to compare the summary with the orig<strong>in</strong>al sequence. For example, <strong>in</strong>Narasimha et al. [19] <strong>an</strong>d <strong>in</strong> Lagendjik et al. [24] a global subjective evaluation isgiven <strong>for</strong> the goodness of the summary; <strong>in</strong> Liu et al. [21] users are asked to givescores based on their satisfaction mark<strong>in</strong>g the summaries as good, acceptable, orbad. Ngo et al. [25] apply the criteria of “<strong>in</strong><strong>for</strong>mativeness” <strong>an</strong>d “enjoyability”, <strong>for</strong>their evaluation of <strong>video</strong> highlights: “<strong>in</strong><strong>for</strong>mativeness” assesses the capability ofcover<strong>in</strong>g the content while avoid<strong>in</strong>g redund<strong>an</strong>cy; “enjoyability” assesses theper<strong>for</strong>m<strong>an</strong>ce of the <strong>algorithm</strong> <strong>in</strong> select<strong>in</strong>g perceptually agreeable <strong>video</strong> segments<strong>for</strong> summaries.The problem of these approaches is that their evaluation is highly subjective<strong>an</strong>d c<strong>an</strong>not be used to <strong>an</strong>alyze <strong>video</strong> sequences automatically.We have chosen,<strong>in</strong>stead, a more objective, general purpose to summary evaluation, one that doesnot take <strong>in</strong>to account the k<strong>in</strong>d of <strong>video</strong> be<strong>in</strong>g processed, <strong>an</strong>d c<strong>an</strong> be automaticallyapplied to all <strong>video</strong> sequences without requir<strong>in</strong>g the services of <strong>video</strong> experts. Asummary is considered good if the set of <strong>key</strong> <strong>frame</strong>s effectively represents thepictorial content of the <strong>video</strong> sequence. This objective evaluation is validregardless of genre, <strong>an</strong>d c<strong>an</strong> be per<strong>for</strong>med automatically if a suitable qualitymeasure is provided. From the very few works that have addressed the problem ofobjective evaluation of summaries, we have chosen two quality measures. Thefirst, well known <strong>in</strong> the literature, is the Fidelity measure, as proposed by Ch<strong>an</strong>get al. [26]; the second is the Shot Reconstruction Degree (SRD) recently proposedby Liu et al. [27]. These measures were chosen because they apply two differentapproaches: the Fidelity employs a global strategy, while the SRD uses a localevaluation of the <strong>key</strong> <strong>frame</strong>s. Along with the evaluation of the pictorial contentus<strong>in</strong>g these two measures we have also judged the compactness of the summaryon the basis of the Compression Ratio measure.3.1 FidelityThe Fidelity measure, which compares each <strong>key</strong> <strong>frame</strong> <strong>in</strong> the summary with theother <strong>frame</strong>s <strong>in</strong> the <strong>video</strong> sequence, <strong>an</strong>d is def<strong>in</strong>ed as a semi-Hausdorff dist<strong>an</strong>ce.We let a <strong>video</strong> sequence start<strong>in</strong>g at time t <strong>an</strong>d conta<strong>in</strong><strong>in</strong>g γNF<strong>frame</strong>s be{ F(t + n) n = 0,1, K , 1}S = γ (1)tNF −<strong>an</strong>d the set of γNKF<strong>key</strong> <strong>frame</strong>s extracted from the <strong>video</strong> sequence bet{ F (t + n ),F (t + n ), ,F (t + n ) 0 ≤ n < γ }KF = K(2)KF1KF2The dist<strong>an</strong>ce between the set of <strong>key</strong> <strong>frame</strong>s KF t <strong>an</strong>d a <strong>frame</strong> F belong<strong>in</strong>g to the<strong>video</strong> sequence S t c<strong>an</strong> be computed as:d( F t n),KF) = m<strong>in</strong> Diff ( F(t + n)+tjKFNKF{ ,FKF(t+ nj)} j = 1,2, K,γNKF( (3)where Diff( ) is a suitable <strong>frame</strong> difference measure. The dist<strong>an</strong>ce between the<strong>video</strong> sequence S t <strong>an</strong>d the set of <strong>key</strong> <strong>frame</strong>s KF t is f<strong>in</strong>ally def<strong>in</strong>ed as:{ d( F(t + n),KF ) n = 0,1, K,γ 1}d( S , KF ) max− (4)tt=tNFnWe c<strong>an</strong> then compute the Fidelity measure as:Fidelity( St tt t, KF ) = MaxDiff − d(S , KF ) (5)iNF5


where MaxDiff is the largest possible value that the Diff( ) <strong>frame</strong> differencedist<strong>an</strong>ce c<strong>an</strong> assume. High Fidelity values <strong>in</strong>dicate that the <strong>key</strong> <strong>frame</strong>s extractedfrom the <strong>video</strong> sequence provide a good global description of the visual content ofthe sequence. A visual representation of the Fidelity mesaure computation isshown <strong>in</strong> Fig. 2.... ... S tF(t) F(t+n) F(t+γ NF -1)F KF (t) F KF (t+2) Diff(F(t+γ NF -1),F KF (t+2))F(t+γ NF -1)KF tFig. 2. Computation of the Fidelity measure. Each <strong>frame</strong> <strong>in</strong> the <strong>video</strong> sequence is compared witheach <strong>key</strong> <strong>frame</strong> <strong>in</strong> the summary, <strong>an</strong>d the m<strong>in</strong>imal dist<strong>an</strong>ce of each is reta<strong>in</strong>ed. The greatest of thesedist<strong>an</strong>ces provides the Fidelity measure.3.2 Shot Reconstruction DegreeThe idea underly<strong>in</strong>g the Shot Reconstruction Degree (SRD) measure is that, us<strong>in</strong>ga suitable <strong>frame</strong> <strong>in</strong>terpolation <strong>algorithm</strong>, we should be able to reconstruct thewhole sequence, from the set of <strong>key</strong> <strong>frame</strong>s. The better the reconstructionapproximates the orig<strong>in</strong>al <strong>video</strong> sequence, the better the <strong>key</strong> <strong>frame</strong>s summarize itscontent. Let the <strong>video</strong> sequence St <strong>an</strong>d the set of <strong>key</strong> <strong>frame</strong>s KFt be def<strong>in</strong>ed as <strong>in</strong>the previous paragraph. Let FIA( ) be a <strong>frame</strong> <strong>in</strong>terpolation <strong>algorithm</strong> thatcomputes a <strong>frame</strong> from a pair of <strong>key</strong> <strong>frame</strong>s <strong>in</strong> KFt:~( t + n)= FIA F (t + n ),F (t + n ),n,n ,n n ≤ n n (6)(KF j KF j+ 1 j j+1)j


... ... S tF(t) F(t+1)F(t+n) F(t+γ NF -1)Diff(F(t+n),F KF (t+n))~ ~~ ~F(t) F(t+1)F(t+n) F(t+γ NF -1)... ...FIA(F KF (t+2),F KF (t+γ NF -1),n,2,γ NF -1)F KF (t) F KF (t+2) F(t+γ NF -1)KF tFig. 3. The computation of the SRD measure. The <strong>video</strong> sequence is reconstructed us<strong>in</strong>g the <strong>key</strong><strong>frame</strong>s extracted <strong>an</strong>d a <strong>frame</strong> <strong>in</strong>terpolation <strong>algorithm</strong> (FIA). The <strong>in</strong>terpolated <strong>frame</strong>s are comparedwith the correspond<strong>in</strong>g <strong>frame</strong>s <strong>in</strong> the orig<strong>in</strong>al sequence, <strong>an</strong>d the computed <strong>frame</strong> differences areused to compute the f<strong>in</strong>al SRD measure.3.3 Compression ratioA <strong>video</strong> summary should not conta<strong>in</strong> too m<strong>an</strong>y <strong>key</strong> <strong>frame</strong>s s<strong>in</strong>ce the aim of thesummarization process is to allow users to quickly grasp the content of a <strong>video</strong>sequence. For this reason, we have also evaluated the compactness of thesummary (compression ratio). The compression ratio is computed by divid<strong>in</strong>g thenumber of <strong>key</strong> <strong>frame</strong>s <strong>in</strong> the summary by the length of <strong>video</strong> sequence. For agiven <strong>video</strong> sequence S t , the compression rate is thus def<strong>in</strong>ed as:CRatioγ( S ) = 1 −NKFt(9)γNFwhere γNKFis the number of <strong>key</strong> <strong>frame</strong>s <strong>in</strong> the summary, <strong>an</strong>d γNFis the totalnumber of <strong>frame</strong>s <strong>in</strong> the <strong>video</strong> sequence. Ideally, a good summary produced by a<strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong> will present both high quality measure (<strong>in</strong> terms ofFidelity or SRD) <strong>an</strong>d a high compression ratio (i.e. small number of <strong>key</strong> <strong>frame</strong>s).4. New <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> approachThe proposed <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong> functions as shown <strong>in</strong> Fig. 4. The<strong>algorithm</strong> employs the <strong>in</strong><strong>for</strong>mation obta<strong>in</strong>ed by a <strong>video</strong> segmentation <strong>algorithm</strong> toprocess each shot. We have developed a prototypical segmentation <strong>algorithm</strong> that<strong>for</strong> the moment detects only abrupt ch<strong>an</strong>ges <strong>an</strong>d fades, s<strong>in</strong>ce these are the mostcommon edit<strong>in</strong>g effects. For abrupt ch<strong>an</strong>ges we have implemented a thresholdbased<strong>algorithm</strong> coupled with a <strong>frame</strong> difference measure computed fromhistograms <strong>an</strong>d texture descriptors. The same <strong>frame</strong> difference measure is used <strong>for</strong>the <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>. To detect fades we have implemented amodified version of the <strong>algorithm</strong> proposed by Fern<strong>an</strong>do et al. [28]. The resultsobta<strong>in</strong>ed by these <strong>algorithm</strong>s are submitted <strong>for</strong> evaluation to a decision modulewhich gives the f<strong>in</strong>al response. This allows us to cope with conflict<strong>in</strong>g results, orwith groups of <strong>frame</strong>s that are not me<strong>an</strong><strong>in</strong>gful, such as those between the end of afade out <strong>an</strong>d the start of a fade-<strong>in</strong>, <strong>in</strong>creas<strong>in</strong>g the robustness of the detection phase.A gradual tr<strong>an</strong>sition detection <strong>algorithm</strong> is currently be<strong>in</strong>g developed, <strong>an</strong>d will be<strong>in</strong>tegrated <strong>in</strong> a similar m<strong>an</strong>ner.7


We dist<strong>in</strong>guish two types of shots: “<strong>in</strong><strong>for</strong>mative” shots (type A) <strong>an</strong>d“un<strong>in</strong><strong>for</strong>mative” shots (type B). Type B shots are those limited by: a fade-outfollowed by a fade-<strong>in</strong> effect, a fade-out followed by a cut or a cut followed by afade-<strong>in</strong>. If we were to extract <strong>key</strong> <strong>frame</strong>s from these shots the result<strong>in</strong>g set of<strong>frame</strong>s would conta<strong>in</strong> uni<strong>for</strong>mly colored images that are me<strong>an</strong><strong>in</strong>gless <strong>in</strong> terms ofthe <strong>in</strong><strong>for</strong>mation supplied. Key <strong>frame</strong>s are extracted from type A shots of only.Video DataShot DetectionAlgorithmFeaturesExtractionType-AShotNDiscard Shot's<strong>frame</strong> differencesYFrame DifferenceMeasureKey FramesDetectorKey FramesVideoSummaryFig. 4. The <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>.4.1 Frame difference measureIn general, a s<strong>in</strong>gle visual descriptor c<strong>an</strong> not capture all the pictorial detailsneeded to estimate the ch<strong>an</strong>ges <strong>in</strong> the visual content of <strong>frame</strong>s, <strong>an</strong>d the visualcomplexity of a <strong>video</strong> shot. In def<strong>in</strong><strong>in</strong>g what a good pictorial representation of a<strong>frame</strong> is, we must take <strong>in</strong>to account both color properties <strong>an</strong>d structuralproperties, such as texture. Instead, as stated <strong>in</strong> Section 2, m<strong>an</strong>y exist<strong>in</strong>g<strong>algorithm</strong>s use only one feature. To overcome the <strong>frame</strong> representation problem,we compute three different descriptors: a color histogram, <strong>an</strong> edge directionhistogram, <strong>an</strong>d wavelet statistics. The features used have been selected <strong>for</strong> threebasic properties: perceptual similarity (the feature dist<strong>an</strong>ce between two images islarge only if the images are not "similar"), efficiency (the features c<strong>an</strong> be rapidlycomputed), <strong>an</strong>d economy (small dimensions that do not affect efficacy). The useof these assorted visual descriptors provides a more precise representation of the<strong>frame</strong> <strong>an</strong>d captures slight variations between the <strong>frame</strong>s <strong>in</strong> a shot.Color histograms are frequently used to compare images because they aresimple to compute, <strong>an</strong>d tend to be robust regard<strong>in</strong>g small ch<strong>an</strong>ges <strong>in</strong> cameraviewpo<strong>in</strong>t. Retrieval us<strong>in</strong>g color histograms to identify objects <strong>in</strong> image databaseshas been <strong>in</strong>vestigated <strong>in</strong> [29] <strong>an</strong>d [30]. An image histogram h( ) refers to theprobability mass function of image <strong>in</strong>tensities. Computationally, the colorhistogram is <strong>for</strong>med by count<strong>in</strong>g the number of pixels belong<strong>in</strong>g to each color.Usually a color qu<strong>an</strong>tization phase is per<strong>for</strong>med on the orig<strong>in</strong>al image <strong>in</strong> order toreduce the number of colors to consider <strong>in</strong> comput<strong>in</strong>g the histogram <strong>an</strong>d thus thesize of the histogram itself. The color histogram we use is composed of 64 b<strong>in</strong>sdeterm<strong>in</strong>ed by sampl<strong>in</strong>g groups of me<strong>an</strong><strong>in</strong>gful colors <strong>in</strong> the HSV color space [31].The use of the HSV color space allows us to carefully def<strong>in</strong>e groups of colors <strong>in</strong>terms of Hue, Saturation <strong>an</strong>d Lightness.The edge direction histogram is composed of 72 b<strong>in</strong>s correspond<strong>in</strong>g to<strong>in</strong>tervals of 2.5 degrees. Two Sobel filters are applied to obta<strong>in</strong> the gradient of thehorizontal <strong>an</strong>d the vertical edges of the lum<strong>in</strong><strong>an</strong>ce <strong>frame</strong> image [32]. These valuesare used to compute the gradient of each pixel; those pixels that exhibit a gradientabove a predef<strong>in</strong>ed threshold are taken to compute the gradient <strong>an</strong>gle <strong>an</strong>d then the8


histogram. The threshold has been heuristically set at the 4% of the gradientmaximum value <strong>in</strong> order to remove from the histogram computation edgesderived from background noise.Multiresolution wavelet <strong>an</strong>alysis provides representations of image data <strong>in</strong>which both spatial <strong>an</strong>d frequency <strong>in</strong><strong>for</strong>mation are present [33]. In multiresolutionwavelet <strong>an</strong>alysis we have four b<strong>an</strong>ds <strong>for</strong> each level of resolution result<strong>in</strong>g fromthe application of two filters, a low-pass filter (L) <strong>an</strong>d <strong>an</strong> high-pass filter (H). Thefilters are applied <strong>in</strong> pairs <strong>in</strong> the four comb<strong>in</strong>ations, LL, LH, HL <strong>an</strong>d HH, <strong>an</strong>dfollowed by a decimation phase that halves the result<strong>in</strong>g image size. The f<strong>in</strong>alimage, of the same size as the orig<strong>in</strong>al, conta<strong>in</strong>s a smoothed version of the orig<strong>in</strong>alimage (LL b<strong>an</strong>d) <strong>an</strong>d three b<strong>an</strong>ds of details (see Fig. 5a).F oLHF vF Horizontalo=filter<strong>in</strong>g <strong>an</strong>d decimationF Verticalv=filter<strong>in</strong>g <strong>an</strong>d decimationLLHLLHHHa) b)Fig. 5. The filter<strong>in</strong>g <strong>an</strong>d decimation of the image along the horizontal <strong>an</strong>d vertical directions. Fourb<strong>an</strong>ds are created each a quarter the size of the whole image. b) The tree-step application of themultiresolution wavelet. The wavelet filters are applied to the top left b<strong>an</strong>d conta<strong>in</strong><strong>in</strong>g the resizedimage.Each b<strong>an</strong>d corresponds to a coefficient matrix that c<strong>an</strong> be used to reconstructthe orig<strong>in</strong>al image. These b<strong>an</strong>ds conta<strong>in</strong> <strong>in</strong><strong>for</strong>mation about the content of theimage <strong>in</strong> terms of general image layout (the LL b<strong>an</strong>d) <strong>an</strong>d <strong>in</strong> terms of details(edges, textures, etc..). In our procedure the features are extracted from thelum<strong>in</strong><strong>an</strong>ce image us<strong>in</strong>g a three-step Daubechies multiresolution waveletdecomposition that uses 16 coefficients <strong>an</strong>d produc<strong>in</strong>g ten sub-b<strong>an</strong>ds [34] (Fig.5b). Two energy features, the me<strong>an</strong> <strong>an</strong>d st<strong>an</strong>dard deviation of the coefficients, arethen computed <strong>for</strong> each of the 10 sub-b<strong>an</strong>d obta<strong>in</strong>ed, result<strong>in</strong>g <strong>in</strong> a 20-valueddescriptor.To compare two <strong>frame</strong> descriptors, a difference measure is used to evaluatethe color histograms, wavelet statistics <strong>an</strong>d edge histograms. There are severaldist<strong>an</strong>ce <strong>for</strong>mulas <strong>for</strong> measur<strong>in</strong>g the similarity of color histograms. Techniques <strong>for</strong>compar<strong>in</strong>g probability distributions are not appropriate because it is visualperception that determ<strong>in</strong>es the similarity, rather th<strong>an</strong> the closeness of theprobability distributions. One of the most commonly used measures is thehistogram <strong>in</strong>tersection [29]. The dist<strong>an</strong>ce between two color histograms (dH)us<strong>in</strong>g the <strong>in</strong>tersection measure is given by:63t+ −∑j=0( H , H ) = 1 ( H (j),H (j))dH t 1m<strong>in</strong>t t+1(10)where H t <strong>an</strong>d H t+1 are the color histograms <strong>for</strong> <strong>frame</strong> F(t) <strong>an</strong>d <strong>frame</strong> F(t+1)respectively.9


The difference between two edge direction histograms (d D ) is computed us<strong>in</strong>g theEuclide<strong>an</strong> dist<strong>an</strong>ce as such <strong>in</strong> the case of two wavelet statistics (d W ):ddDW( D , D ) = ( D ( j)− D ( j))tt+1∑( W , W ) = ( W ( j)−W( j))tt+171j=019∑j=0ttt+1t+122(11)where D t <strong>an</strong>d D t+1 are the edge direction histograms <strong>an</strong>d W t <strong>an</strong>d W t+1 are thewavelets statistics <strong>for</strong> <strong>frame</strong> F(t) <strong>an</strong>d <strong>frame</strong> F(t+1).The three result<strong>in</strong>g values (to simplify the notation we have <strong>in</strong>dicated them asdH, dW, <strong>an</strong>d dD only) are mapped <strong>in</strong>to the r<strong>an</strong>ge [0,1] <strong>an</strong>d then comb<strong>in</strong>ed to <strong>for</strong>mthe f<strong>in</strong>al <strong>frame</strong> difference measure (dHWD) as follow:dHWD( d ⋅ d ) + ( d ⋅ d ) + ( d ⋅ d )= (12)HWWThe aim of the <strong>frame</strong> difference measure is to accentuate dissimilarities <strong>in</strong>order to detect ch<strong>an</strong>ges with<strong>in</strong> the <strong>frame</strong> sequence. At the same time it isimport<strong>an</strong>t that only when the <strong>frame</strong>s are very different, the measure should reporthigh difference values. As told be<strong>for</strong>e, the majority of the <strong>key</strong> <strong>frame</strong> selectionmethods exploit just one visual feature which is not sufficient to effectivelydescribe <strong>an</strong> image contents. If we were to use, <strong>for</strong> example, only the colorhistogram, a highly dynamic sequence (e.g. one conta<strong>in</strong><strong>in</strong>g fast mov<strong>in</strong>g orp<strong>an</strong>n<strong>in</strong>g effects) with <strong>frame</strong>s of the same color contents, would result <strong>in</strong> a series ofsimilar <strong>frame</strong> difference values <strong>an</strong>d the motion effects would be lost. Similarly,<strong>frame</strong>s with the same color content but different from the po<strong>in</strong>t of view of othervisual attributes are considered similar. The uses of multiple feature c<strong>an</strong> overcomethese issues but pose the problem of their comb<strong>in</strong>ation. In content-based retrievalsystems, the features are comb<strong>in</strong>ed by weigh<strong>in</strong>g them with suitable factors whichare usually task-dependent [31]. We choose <strong>in</strong>stead to use a different approach:the explicit selection of weight factors is removed by weigh<strong>in</strong>g each differenceaga<strong>in</strong>st the other. Moreover, this allow us to register signific<strong>an</strong>t differences <strong>in</strong> thedhwd values only if at least two of the s<strong>in</strong>gle differences exhibit high values (<strong>an</strong>dthus two of the visual attributes emphasize the <strong>frame</strong> dissimilarity).4.2 Key <strong>frame</strong> selectionThe <strong>key</strong> <strong>frame</strong> selection <strong>algorithm</strong> that we propose dynamically selects therepresentative <strong>frame</strong>s by <strong>an</strong>alyz<strong>in</strong>g the complexity of the events depicted <strong>in</strong> theshot <strong>in</strong> terms of pictorial ch<strong>an</strong>ges. The <strong>frame</strong> difference values <strong>in</strong>itially obta<strong>in</strong>edare used to construct a curve of the cumulative <strong>frame</strong> differences which describeshow the visual content of the <strong>frame</strong>s ch<strong>an</strong>ges over the entire shot, <strong>an</strong> <strong>in</strong>dication ofthe shot’s complexity: sharp slopes <strong>in</strong>dicate signific<strong>an</strong>t ch<strong>an</strong>ges <strong>in</strong> the visualcontent due to a mov<strong>in</strong>g object, camera motion, or the registration of a highlydynamic event. These cases must be taken <strong>in</strong>to account <strong>in</strong> select<strong>in</strong>g the <strong>key</strong> <strong>frame</strong>sto <strong>in</strong>clude <strong>in</strong> the shot summary. They are identified <strong>in</strong> the curve of the cumulative<strong>frame</strong> differences as those po<strong>in</strong>ts at the sharpest <strong>an</strong>gles of the curve (curvature orcorner po<strong>in</strong>ts). The <strong>key</strong> <strong>frame</strong>s are those correspond<strong>in</strong>g to the mid po<strong>in</strong>ts betweeneach pair of consecutive curvature po<strong>in</strong>ts. To detect the high curvature po<strong>in</strong>ts weuse the <strong>algorithm</strong> proposed by Chetverikov et al. [35]. The <strong>algorithm</strong> wasorig<strong>in</strong>ally developed <strong>for</strong> shape <strong>an</strong>alysis <strong>in</strong> order to identify salient po<strong>in</strong>ts <strong>in</strong> a 2Dshape outl<strong>in</strong>e. The high curvature po<strong>in</strong>ts are detected <strong>in</strong> a two-pass process<strong>in</strong>g. InDDH10


the first pass the <strong>algorithm</strong> detects c<strong>an</strong>didate curvature po<strong>in</strong>ts. The <strong>algorithm</strong>def<strong>in</strong>es as a “corner” a location where a tri<strong>an</strong>gle of specified size <strong>an</strong>d open<strong>in</strong>g<strong>an</strong>gle c<strong>an</strong> be <strong>in</strong>scribed <strong>in</strong> a curve. Us<strong>in</strong>g each curve po<strong>in</strong>t P as a fixed vertexpo<strong>in</strong>t, the <strong>algorithm</strong> tries to <strong>in</strong>scribe a tri<strong>an</strong>gle <strong>in</strong> the curve, <strong>an</strong>d then determ<strong>in</strong>esthe open<strong>in</strong>g <strong>an</strong>gle α(P) <strong>in</strong> correspondence of P. Different tri<strong>an</strong>gles are consideredus<strong>in</strong>g po<strong>in</strong>ts that fall with<strong>in</strong> a w<strong>in</strong>dow of a given size w centered <strong>in</strong> P; the sharpest<strong>an</strong>gle is reta<strong>in</strong>ed as a possible high curvature po<strong>in</strong>t. This procedure is illustrated <strong>in</strong>Fig. 6. Def<strong>in</strong><strong>in</strong>g the dist<strong>an</strong>ce between po<strong>in</strong>ts P <strong>an</strong>d O as d PO , the dist<strong>an</strong>ce betweenpo<strong>in</strong>ts P <strong>an</strong>d R as d PR , <strong>an</strong>d the dist<strong>an</strong>ce between po<strong>in</strong>ts O <strong>an</strong>d P as d OP , theopen<strong>in</strong>g <strong>an</strong>gle α correspond<strong>in</strong>g to the tri<strong>an</strong>gle OPR is computed as:2OP2PROPPR2ORd + d − dα = arccos(13)2 ⋅ d ⋅ dA tri<strong>an</strong>gle satisfy<strong>in</strong>g the constra<strong>in</strong>ts on the dist<strong>an</strong>ces between po<strong>in</strong>ts (weconsider only the x-coord<strong>in</strong>ates):ddm<strong>in</strong>m<strong>in</strong>≤ P − Oxxx≤ P − R<strong>an</strong>d the constra<strong>in</strong>t on the <strong>an</strong>gle valuesx≤ d≤ dmaxmaxα ≤ α max(15)is called <strong>an</strong> admissible tri<strong>an</strong>gle. The first two constra<strong>in</strong>ts represent the operat<strong>in</strong>gw<strong>in</strong>dow; the set of po<strong>in</strong>ts conta<strong>in</strong>ed <strong>in</strong> it are used to def<strong>in</strong>e the tri<strong>an</strong>gles. The thirdconstra<strong>in</strong>t is used to discard <strong>an</strong>gles that are too flat. The sharpest open<strong>in</strong>g <strong>an</strong>gle ofthe admissible tri<strong>an</strong>gles is then assigned to P:α ( P)= m<strong>in</strong> α =α{ OPR ˆ }If a po<strong>in</strong>t has no admissible tri<strong>an</strong>gles, the po<strong>in</strong>t is rejected assign<strong>in</strong>g it <strong>an</strong> <strong>an</strong>gledefault value of π. In the second pass, those po<strong>in</strong>ts <strong>in</strong> the set of the c<strong>an</strong>didate highcurvature po<strong>in</strong>ts that are sharper th<strong>an</strong> their neighbors (with<strong>in</strong> a certa<strong>in</strong> dist<strong>an</strong>ce)are classified as high curvature po<strong>in</strong>ts. A c<strong>an</strong>didate po<strong>in</strong>t P is discarded if it has asharper valid neighbor N, that is if:α ( P)> α ( N)(17)(14)(16)A po<strong>in</strong>t N is def<strong>in</strong>ed to be a neighbor of P if the follow<strong>in</strong>g constra<strong>in</strong>t is valid:P − N ≤ d (18)xxIn our implementation we have def<strong>in</strong>ed the m<strong>in</strong>imum po<strong>in</strong>ts dist<strong>an</strong>ce d m<strong>in</strong> asalways equal to 1; consequently the only two parameters that <strong>in</strong>fluence the resultsof the <strong>algorithm</strong> are d max <strong>an</strong>d α max . The most import<strong>an</strong>t parameter is α max whichcontrols the set of admissible <strong>an</strong>gles: a high value of α max will result <strong>in</strong> morepo<strong>in</strong>ts <strong>in</strong>cluded <strong>in</strong> the set of c<strong>an</strong>didate high curvature po<strong>in</strong>ts, while a lower value<strong>in</strong>dicates that only very sharp <strong>an</strong>gles must be considered. This is the same asconsider<strong>in</strong>g worthy of attention only slopes correspond<strong>in</strong>g to sharp ch<strong>an</strong>ges <strong>in</strong> thecurve of the cumulative d HWD <strong>frame</strong> differences.max11


d HWDd maxd maxd m<strong>in</strong>d m<strong>in</strong>...ROPα(P)t t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 t+10FrameFig. 6. Detection of the high curvature po<strong>in</strong>ts <strong>algorithm</strong>.Once the high curvature po<strong>in</strong>ts have been determ<strong>in</strong>ed, <strong>key</strong> <strong>frame</strong>s c<strong>an</strong> beextracted by tak<strong>in</strong>g the midpo<strong>in</strong>t between two consecutive high curvature po<strong>in</strong>ts.Fig. 7 is <strong>an</strong> example of how the <strong>algorithm</strong> works. The top image shows <strong>an</strong>example shot: the <strong>algorithm</strong> detects a high curvature po<strong>in</strong>t with<strong>in</strong> the curve ofcumulative <strong>frame</strong> differences. The first <strong>an</strong>d last <strong>frame</strong>s of the shot are implicitlyassumed to correspond to high curvature po<strong>in</strong>ts. The <strong>frame</strong>s correspond<strong>in</strong>g to themidpo<strong>in</strong>ts between each pair of consecutive high curvature po<strong>in</strong>ts are selected as<strong>key</strong> <strong>frame</strong>s (Fig. 7 center, tri<strong>an</strong>gles represent the high curvature po<strong>in</strong>ts detectedwhile the circles represent the <strong>key</strong> <strong>frame</strong>s selected). If a shot does not present adynamic behavior, i.e. the <strong>frame</strong>s with<strong>in</strong> the shot are highly correlated; the curvedoes not show evident curvature po<strong>in</strong>ts, signify<strong>in</strong>g that the shot c<strong>an</strong> besummarized by a s<strong>in</strong>gle representative <strong>frame</strong>. Fig. 7 bottom shows the <strong>key</strong> <strong>frame</strong>sextracted from the example shot. The summary conta<strong>in</strong>s the relev<strong>an</strong>t elements ofthe <strong>frame</strong> sequence <strong>in</strong> terms of low level features. Unlike some methods, such asthose that extract, <strong>for</strong> example, <strong>key</strong> <strong>frame</strong>s based on the length of the shots, our<strong>algorithm</strong> does not have to process the whole <strong>video</strong>. Another adv<strong>an</strong>tage is that itc<strong>an</strong> extract the <strong>key</strong> <strong>frame</strong>s on the fly: to detect a high curvature po<strong>in</strong>t we c<strong>an</strong> limitour <strong>an</strong>alysis to a fixed number of <strong>frame</strong> differences with<strong>in</strong> a predef<strong>in</strong>ed w<strong>in</strong>dow.Consequently the curvature po<strong>in</strong>ts c<strong>an</strong> be determ<strong>in</strong>ed while comput<strong>in</strong>g the <strong>frame</strong>differences, <strong>an</strong>d the <strong>key</strong> <strong>frame</strong>s extracted as soon as a second high curvature po<strong>in</strong>thas been detected.The high curvature po<strong>in</strong>ts <strong>an</strong>alysis <strong>for</strong> <strong>key</strong> <strong>frame</strong>s <strong>extraction</strong> is similar theapproach proposed <strong>in</strong> [36] <strong>an</strong>d also used <strong>in</strong> [27] where a polygonal curverepresent<strong>in</strong>g the <strong>frame</strong>s evolution is iteratively simplified by remov<strong>in</strong>g curvepo<strong>in</strong>ts. Unlike this approach, that requires that all the curve po<strong>in</strong>ts should beglobally <strong>an</strong>alyzed at each step <strong>in</strong> order to select the c<strong>an</strong>didate po<strong>in</strong>t to be removed,our detection of high curvature po<strong>in</strong>ts c<strong>an</strong> be made sequentially <strong>an</strong>d locally.12


Fig. 7. An example of <strong>key</strong> <strong>frame</strong> selection with the proposed method. Top: the shot to <strong>an</strong>alyze.Center: the corner po<strong>in</strong>ts detected (tri<strong>an</strong>gles) <strong>an</strong>d the <strong>key</strong> <strong>frame</strong>s selected (circles). Bottom: the two<strong>key</strong> <strong>frame</strong>s, number 71 <strong>an</strong>d 112, extracted from the example shot are shown at the bottom.As stated above, the corner po<strong>in</strong>t <strong>algorithm</strong> has two parameters: the size of thew<strong>in</strong>dow with<strong>in</strong> which the curvature <strong>an</strong>gles are computed (d max ), <strong>an</strong>d the maximumvalue of the <strong>an</strong>gle considered <strong>in</strong> determ<strong>in</strong><strong>in</strong>g the po<strong>in</strong>t’s curvature <strong>an</strong>gle (α max ).We have found experimentally that a size 3 w<strong>in</strong>dow <strong>an</strong>d <strong>an</strong>gles of less th<strong>an</strong> 140degrees provide a fair tradeoff between the number of computations required, <strong>an</strong>dthe number of <strong>key</strong> <strong>frame</strong>s extracted from each shot.It is <strong>in</strong>terest<strong>in</strong>g to note that, <strong>in</strong> theory, the shot segmentation phase is notstrictly required. Suppose that a segmentation <strong>algorithm</strong> is not available, <strong>an</strong>d thusthe <strong>video</strong> is a s<strong>in</strong>gle sequence of <strong>frame</strong>s. S<strong>in</strong>ce cuts are abrupt ch<strong>an</strong>ges <strong>in</strong> thevisual content of the <strong>video</strong> sequence, our <strong>key</strong> <strong>frame</strong> selection <strong>algorithm</strong> c<strong>an</strong> stilldetect them as corner po<strong>in</strong>ts. The <strong>key</strong> <strong>frame</strong>s extracted are the same as thoseextracted when the <strong>video</strong> is segmented. The <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong> doesnot detect fad<strong>in</strong>g or dissolv<strong>in</strong>g effects them <strong>an</strong>d, if their visual evolution does notpresent sharp ch<strong>an</strong>ges, they will not have corner po<strong>in</strong>ts assigned. However, theselection of <strong>key</strong> <strong>frame</strong>s with<strong>in</strong> these gradual tr<strong>an</strong>sition sequences c<strong>an</strong>not beentirely avoided. Take <strong>for</strong> example the case of the <strong>video</strong> that starts with a fade-<strong>in</strong>followed immediately by a cut: a <strong>key</strong> <strong>frame</strong> will be selected <strong>in</strong> the set of <strong>frame</strong>scorrespond<strong>in</strong>g to the fade sequence. Consequently the segmentation <strong>algorithm</strong>serves to remove these k<strong>in</strong>ds of shots, improv<strong>in</strong>g the summarization results byensur<strong>in</strong>g the selection of <strong>in</strong><strong>for</strong>mative <strong>frame</strong>s only.13


5. Experimental Setup5.1 Algorithms TestedWe have compared the results of our <strong>algorithm</strong> with those of five other <strong>key</strong> <strong>frame</strong><strong>extraction</strong> <strong>algorithm</strong>s. Together us<strong>in</strong>g our Curvature Po<strong>in</strong>ts (CP) <strong>algorithm</strong>, wetested the Adaptive Temporal Sampl<strong>in</strong>g (ATS) <strong>algorithm</strong> of Hoon et al. [18], the“Flexible Rect<strong>an</strong>gles” (FR) <strong>algorithm</strong> of H<strong>an</strong>jalic et al. [17], the ShotReconstruction Degree Interpolation (SRDI) <strong>algorithm</strong> of Tiey<strong>an</strong> Liu et al. [27],the Perceived Motion Energy (PME) <strong>algorithm</strong> of Ti<strong>an</strong>m<strong>in</strong>g Liu et al. [21], <strong>an</strong>d asimple Mid Po<strong>in</strong>t (MP) <strong>algorithm</strong>.Both the ATS <strong>an</strong>d FR <strong>algorithm</strong>s select the <strong>key</strong> <strong>frame</strong>s on the basis of thecumulative <strong>frame</strong> differences: it computes the color histogram differences on theRGB color space, <strong>an</strong>d plots them on a curve of cumulative differences. The <strong>key</strong><strong>frame</strong>s are selected by sampl<strong>in</strong>g the y-axis of the curve of the cumulativedifferences at const<strong>an</strong>t <strong>in</strong>tervals. The correspond<strong>in</strong>g values <strong>in</strong> the x-axis representthe <strong>key</strong> <strong>frame</strong>s. More <strong>key</strong> <strong>frame</strong>s are likely to be found <strong>in</strong> <strong>in</strong>tervals where the<strong>frame</strong> differences are pronounced th<strong>an</strong> <strong>in</strong> <strong>in</strong>tervals with lower values <strong>for</strong> <strong>frame</strong>differences. The FR <strong>algorithm</strong> also uses color histogram differences to build itscurve, but the histograms are computed on the YUV color space. The curve ofcumulative differences is approximated by a set of rect<strong>an</strong>gles, each of which usedto select a <strong>key</strong> <strong>frame</strong>. As the widths of the rect<strong>an</strong>gles calculated to m<strong>in</strong>imize theapproximation error, <strong>an</strong> optimization <strong>algorithm</strong> is required (<strong>an</strong> iterative search<strong>algorithm</strong> is used). The <strong>in</strong>put parameter of the <strong>algorithm</strong> is the number of <strong>key</strong><strong>frame</strong>s (<strong>an</strong>d thus of the approximation rect<strong>an</strong>gles) to be extracted. These authorsalso propose a strategy <strong>for</strong> decid<strong>in</strong>g how m<strong>an</strong>y <strong>key</strong> <strong>frame</strong>s to select from the shotsof the <strong>video</strong> sequence: it assigns the number of <strong>key</strong> <strong>frame</strong>s <strong>in</strong> proportion to thelength of the shot.Given a specified number of <strong>key</strong> <strong>frame</strong>s to be extracted, the SRDI <strong>algorithm</strong>uses the motion vectors to compute the <strong>frame</strong>’s motion energy. All the motionenergy values are then used to build a motion curve that is passed to a polygonsimplification <strong>algorithm</strong>. This <strong>algorithm</strong> reta<strong>in</strong>s only the most salient po<strong>in</strong>ts thatc<strong>an</strong> approximate the whole curve. The <strong>frame</strong>s correspond<strong>in</strong>g to these po<strong>in</strong>ts <strong>for</strong>mthe <strong>key</strong> <strong>frame</strong> set. If the number of <strong>frame</strong>s <strong>in</strong> the f<strong>in</strong>al set differs from the numberof <strong>key</strong> <strong>frame</strong>s requested, the set is reduced or <strong>in</strong>creased by <strong>in</strong>terpolat<strong>in</strong>g <strong>frame</strong>saccord<strong>in</strong>g to the Shot Reconstruction Degree criteria. When the number of <strong>frame</strong>sis lower th<strong>an</strong> the number desired, the shot is reconstructed by <strong>in</strong>terpolat<strong>in</strong>g the<strong>frame</strong>s <strong>in</strong> the <strong>frame</strong> set, <strong>an</strong>d the <strong>in</strong>terpolated <strong>frame</strong>s that have largestreconstruction errors are reta<strong>in</strong>ed up to the number of <strong>key</strong> <strong>frame</strong>s needed. Whenthe number of <strong>frame</strong>s <strong>in</strong> the <strong>frame</strong> set is greater th<strong>an</strong> the number of <strong>key</strong> <strong>frame</strong>sneeded, the <strong>frame</strong>s <strong>in</strong> the <strong>frame</strong> set are <strong>in</strong>terpolated, <strong>an</strong>d those with the m<strong>in</strong>imalreconstruction error are removed from the set. The result of the SRDI <strong>algorithm</strong>depends on both the polygon simplification <strong>algorithm</strong> <strong>an</strong>d the SRD criteria. Theonly parameter that must be set <strong>for</strong> the ATS, FR, <strong>an</strong>d SRDI <strong>algorithm</strong>s is thenumber of <strong>key</strong> <strong>frame</strong>s to be extracted.The PME <strong>algorithm</strong> works on compressed <strong>video</strong> us<strong>in</strong>g the motion vectors as<strong>in</strong>dicators of the visual content complexity of a shot. A tri<strong>an</strong>gle model based on aperceived motion energy feature is used to select <strong>key</strong> <strong>frame</strong>s. With this model, ashot is segmented <strong>in</strong>to sub-segments represent<strong>in</strong>g <strong>an</strong> acceleration <strong>an</strong>d decelerationmotion pattern (each modeled by a tri<strong>an</strong>gle). Key <strong>frame</strong>s are then extracted fromthe shots by tak<strong>in</strong>g the vertices of the tri<strong>an</strong>gles. To compute the perceived motion14


energy feature, the magnitudes of the motion vectors of the B-<strong>frame</strong>s are firstfiltered with two nonl<strong>in</strong>ear filters. For each motion vector <strong>in</strong> the <strong>frame</strong> feature, aspatial filter is applied with<strong>in</strong> a given spatial w<strong>in</strong>dow, <strong>an</strong>d a temporal filter isapplied on values belong<strong>in</strong>g to <strong>frame</strong>s with<strong>in</strong> a given temporal w<strong>in</strong>dow. For eachB <strong>frame</strong>, the PME is then computed on the magnitudes of the motion vectors <strong>an</strong>dthe dom<strong>in</strong><strong>an</strong>t motion direction. This preprocess<strong>in</strong>g requires the sett<strong>in</strong>g of severalparameters.A simple procedure then automatically computes the tri<strong>an</strong>gles on the PMEvalues <strong>an</strong>d the correspond<strong>in</strong>g <strong>key</strong> <strong>frame</strong>s. The <strong>algorithm</strong> requires the sett<strong>in</strong>g oftwo parameters the most import<strong>an</strong>t of which is the m<strong>in</strong>imum size of a tri<strong>an</strong>gles<strong>in</strong>ce it <strong>in</strong>fluences the length of the <strong>in</strong>terval between two consecutive <strong>key</strong> <strong>frame</strong>sThe MP <strong>algorithm</strong> was chosen because it c<strong>an</strong> represent the extreme case of our<strong>algorithm</strong>, when no evident high curvature po<strong>in</strong>ts c<strong>an</strong> be found <strong>in</strong> a shot <strong>an</strong>d thecenter <strong>frame</strong> of the sequence is chosen as the <strong>key</strong> <strong>frame</strong> <strong>in</strong>stead.Where available the parameters set <strong>for</strong> the <strong>algorithm</strong>s were always thosereported <strong>in</strong> the orig<strong>in</strong>al papers. The ATS, FR <strong>an</strong>d SRDI <strong>algorithm</strong>s require the<strong>in</strong>put parameter of the number of <strong>key</strong> <strong>frame</strong>s that must be provided. Def<strong>in</strong><strong>in</strong>g ageneral rule <strong>for</strong> sett<strong>in</strong>g this number is a crucial matter; mthe results may varywidely, depend<strong>in</strong>g on the rule selected. We have set the <strong>in</strong>put parameter <strong>for</strong> these<strong>algorithm</strong>s as the same number of <strong>key</strong> <strong>frame</strong>s found by our <strong>algorithm</strong>. We c<strong>an</strong>then compare the <strong>algorithm</strong>s regardless of the number of <strong>key</strong> <strong>frame</strong>s: <strong>an</strong>ydifference <strong>in</strong> results depend only on the selection strategy adopted. S<strong>in</strong>ce the PME<strong>algorithm</strong>, <strong>in</strong>stead, extracts the <strong>key</strong> <strong>frame</strong>s <strong>in</strong> a totally automatic way, as does our<strong>algorithm</strong>, the results depend on both the number of <strong>key</strong> <strong>frame</strong>s extracted <strong>an</strong>d theselection strategy applied.5.2 Video Data SetSix <strong>video</strong>s of various genre were used to test the per<strong>for</strong>m<strong>an</strong>ce of the <strong>key</strong> <strong>frame</strong><strong>extraction</strong> <strong>algorithm</strong>s. Table 1 summarizes the characteristics of the six <strong>video</strong> testsequences. The “eeopen” <strong>video</strong> is a MPEG1 <strong>in</strong>tro sequence of a TV series withshort shots <strong>an</strong>d several tr<strong>an</strong>sition effects. The “news” <strong>an</strong>d “nw<strong>an</strong>w1” are twoMPEG1 news sequences; the shots are moderately long, not too dynamic, <strong>an</strong>dmixed with commercial sequences of very fast-paced shots. The “nw<strong>an</strong>w1” <strong>video</strong>is similar to the “news” <strong>video</strong>, but has longer shots. The “football” <strong>an</strong>d“basketball2” <strong>video</strong> are two MPEG1 sport sequences: “football” exhibits ratherlong shots, while “basketball2” is a s<strong>in</strong>gle long shot, <strong>an</strong>d both have p<strong>an</strong>n<strong>in</strong>g <strong>an</strong>dcamera motion effects. F<strong>in</strong>ally, “bugsbunny” is a MPEG1 short cartoon sequencewith m<strong>an</strong>y shots <strong>an</strong>d a number of tr<strong>an</strong>sition effects.Table 1. The six <strong>video</strong>s used to test the <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>s. TNF denotes the totalnumber of <strong>frame</strong>s, <strong>an</strong>d NS, denotes the number of shots found; both refer to type A <strong>an</strong>d type Bshots.Video Name GenreLength Resolution(mm:ss) (W×H)TNF NSeeopen TV series <strong>in</strong>tro 00:42 352×240 1 289 24nw<strong>an</strong>w1 News with commercials 03:39 176×112 6 556 39news News with commercials 02:39 176×112 4 757 12football Sport 03:43 176×112 6 697 28bugsbunny Short cartoon 07:30 352×240 13 492 89basketball2 S<strong>in</strong>gle shot sport sequence 00:30 320×220 893 115


The <strong>video</strong> were chosen <strong>in</strong> order to evaluate the capability of the <strong>key</strong> <strong>frame</strong><strong>extraction</strong> <strong>algorithm</strong>s to cope with shots of different length, <strong>an</strong>d with the variouslevels of dynamic events captured. In particular the s<strong>in</strong>gle shot of the“basketball2” <strong>video</strong> was chosen to test the ability of the <strong>algorithm</strong>s to effectivelycapture the content of the long sequence <strong>in</strong> the presence of considerable cameramotion.Each <strong>video</strong> was processed by each of the six <strong>key</strong> <strong>frame</strong> <strong>extraction</strong><strong>algorithm</strong>s. The Fidelity <strong>an</strong>d the Shot Reconstruction Degree measure werecomputed on the result<strong>in</strong>g summaries. The parameter applied were not ch<strong>an</strong>gedwhen pass<strong>in</strong>g from one <strong>video</strong> to the other, except <strong>for</strong> the number of <strong>key</strong> <strong>frame</strong>sextracted <strong>for</strong> the ATS, FR <strong>an</strong>d SRDI <strong>algorithm</strong>s.5.3 Measures setupS<strong>in</strong>ce <strong>key</strong> <strong>frame</strong>s are extracted from each shot, we must take this <strong>in</strong>to accountwhen comput<strong>in</strong>g the quality measures def<strong>in</strong>ed <strong>in</strong> Section 3. For each shot of typeA the Fidelity <strong>an</strong>d the SRD measure are computed; if <strong>an</strong> <strong>algorithm</strong> does notextract <strong>key</strong> <strong>frame</strong>s from the shot, the correspond<strong>in</strong>g quality measures are set atequal to the worst case value (zero <strong>for</strong> both Fidelity <strong>an</strong>d SRD). The f<strong>in</strong>al qualitymeasure <strong>for</strong> the entire <strong>video</strong> is then computed as the average of all Fidelity <strong>an</strong>dSRD measures <strong>for</strong> the shots:Fidelity =SRD =1γ1γNS∑NS t=1NS∑NS t=1γγFidelity(SSRD(St, KF )tt, KF )where γNSis the number of shots. For both measures the higher the value thebetter the summary represents the <strong>video</strong> content. The compression ratio measurewas computed consider<strong>in</strong>g the whole <strong>video</strong> as follows:CRatioγ= 1 −TNKF(20)γTNFwhere γTNKFis the total number of <strong>key</strong> <strong>frame</strong>s extracted, <strong>an</strong>d γTNFis the totalnumber of <strong>frame</strong>s <strong>in</strong> shots of type A.Both the Fidelity <strong>an</strong>d the SRD measure require that a <strong>frame</strong> difference bedef<strong>in</strong>ed. In order to evaluate the <strong>in</strong>fluence of <strong>frame</strong> difference on the qualitymeasures, the Fidelity <strong>an</strong>d the SRD were computed us<strong>in</strong>g two different<strong>for</strong>mulations, the first is based on the previously def<strong>in</strong>ed color histogramdifferences computed with the histogram <strong>in</strong>tersection <strong>for</strong>mula, the second on theHWD measure <strong>in</strong>stead.S<strong>in</strong>ce the number of <strong>key</strong> <strong>frame</strong>s extracted from a shot is usually small (evenone only), to compute the SRD we opted <strong>for</strong> a simple l<strong>in</strong>ear <strong>in</strong>terpolation<strong>algorithm</strong> on the <strong>frame</strong>’s descriptive features. One reason <strong>for</strong> this choice is that amore complex <strong>in</strong>terpolation <strong>algorithm</strong> c<strong>an</strong>not be used effectively when the<strong>in</strong>terpolation po<strong>in</strong>ts at disposal are very few. Another reason is that pixel valuesdepend on the objects mov<strong>in</strong>g <strong>in</strong> the scene, <strong>an</strong>d <strong>in</strong>terpolat<strong>in</strong>g each pixel of the<strong>frame</strong> us<strong>in</strong>g only the color <strong>in</strong><strong>for</strong>mation may result <strong>in</strong> a very noisy image that doesnot reflect the real content of the orig<strong>in</strong>al <strong>frame</strong>s. Interpolat<strong>in</strong>g global features<strong>in</strong>stead c<strong>an</strong> capture the overall pictorial content of the <strong>frame</strong>s without hav<strong>in</strong>g totake motion <strong>in</strong>to account. To allow <strong>frame</strong> <strong>in</strong>terpolation when only one <strong>key</strong> <strong>frame</strong>t(19)16


was available we also selected the first <strong>an</strong>d last <strong>frame</strong> of the shot as <strong>in</strong>terpolationpo<strong>in</strong>ts.6. Experimental ResultsTable 2. Comparison of the six <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>s used <strong>in</strong> our experiment. For each<strong>algorithm</strong> some import<strong>an</strong>t characteristics are reported.ATS FR SRDI MP PME CPAutomatic <strong>key</strong> <strong>frame</strong>s selection N N N Y Y YVariable number of <strong>key</strong> <strong>frame</strong>s Y Y Y N Y YOn-the-fly process<strong>in</strong>g N N N Y N YReal time process<strong>in</strong>g Y Y N Y N YRequires motion vectors N N Y N Y NUses <strong>an</strong> optimization <strong>algorithm</strong> N Y N N N NShot length sensitive ? Y N N Y NReference [18] [17] [26] - [21]Table 2 summarizes some of the characteristic of the <strong>algorithm</strong>s tested. Thetop row <strong>in</strong> Table 2, regards the most import<strong>an</strong>t property of a <strong>key</strong> <strong>frame</strong> selection<strong>algorithm</strong>: it must extract <strong>key</strong> <strong>frame</strong>s <strong>in</strong> a totally automatic way without requir<strong>in</strong>gthat the user specify the number of <strong>key</strong> <strong>frame</strong>s to be extracted as a parameter.Otherwise the user should know the <strong>video</strong> content <strong>in</strong> adv<strong>an</strong>ce <strong>an</strong>d adjust theparameter accord<strong>in</strong>gly. The next row <strong>in</strong>dicates that the <strong>algorithm</strong> should beflexible enough to extract a variable number of <strong>key</strong> <strong>frame</strong>s accord<strong>in</strong>g to somecriteria (e.g. shot complexity, shot dynamics, or shot length). By “on-the-flyprocess<strong>in</strong>g” we me<strong>an</strong> the ability of the given <strong>algorithm</strong> to output <strong>key</strong> <strong>frame</strong>swithout hav<strong>in</strong>g to process all the <strong>frame</strong>s <strong>in</strong> the shot or, worse, all the <strong>frame</strong>s <strong>in</strong> the<strong>video</strong> sequence. By “real time process<strong>in</strong>g” we me<strong>an</strong> ma<strong>in</strong>ly that the time the<strong>algorithm</strong> takes to compute the <strong>key</strong> <strong>frame</strong>s (with all the eventual pre-process<strong>in</strong>gphases <strong>in</strong>cluded) must be less th<strong>an</strong> the time required to view the whole <strong>video</strong>sequence. Some <strong>algorithm</strong>s require that motion vectors <strong>video</strong> (usually fromMPEG compressed <strong>video</strong>) be provided. These <strong>algorithm</strong>s do not decode the<strong>frame</strong>s but their applicability is limited to a specific compression st<strong>an</strong>dard<strong>algorithm</strong>: <strong>video</strong>s that are not compressed would have to be compressed be<strong>for</strong>ebe<strong>in</strong>g processed. On the contrary, <strong>algorithm</strong>s that work on uncompressed data,when presented with a compressed <strong>video</strong>, need to decode the <strong>frame</strong>s. Due to theasymmetric nature of the compression <strong>algorithm</strong>s, the cod<strong>in</strong>g time is usually muchhigher th<strong>an</strong> the decod<strong>in</strong>g time. Algorithms that implement <strong>an</strong> optimization<strong>algorithm</strong> usually offer better results, but present the drawback of be<strong>in</strong>g timeconsum<strong>in</strong>g <strong>an</strong>d need<strong>in</strong>g that efficient implementation <strong>in</strong> order to be practical [24].“Shot length sensitive” refers to the fact that the <strong>key</strong> <strong>frame</strong>s extracted depend <strong>in</strong>some way on the length of a shot. This is, <strong>for</strong> example, the case of the FR<strong>algorithm</strong>, where the number of <strong>key</strong> <strong>frame</strong>s is proportional to the length of a givenshot, or that of the PME, where the <strong>algorithm</strong>’s parameter def<strong>in</strong>es the m<strong>in</strong>imumgap between two consecutive <strong>key</strong> <strong>frame</strong>s. In case of the ATS <strong>algorithm</strong>, noevaluation c<strong>an</strong> be made s<strong>in</strong>ce the orig<strong>in</strong>al publication gives no <strong>in</strong>dication of howthe <strong>key</strong> <strong>frame</strong>s are distributed among the shots.The motion vectors <strong>for</strong> the SRDI <strong>an</strong>d PME <strong>algorithm</strong>s have been directlyextracted from the compressed <strong>video</strong>. The ATS, FR, CP <strong>an</strong>d the <strong>in</strong>terpolation ofthe SRDI <strong>algorithm</strong> have been per<strong>for</strong>med on sub-sampled <strong>frame</strong>s. By experimentswe have observed that we c<strong>an</strong> sub-sample the <strong>frame</strong>s down to 48 pixels (withrespect to the larger dimension <strong>an</strong>d ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the aspect ratio) <strong>an</strong>d still17


obta<strong>in</strong><strong>in</strong>g similar results as the application of the <strong>algorithm</strong>s to the orig<strong>in</strong>al<strong>frame</strong>s. To limit the risk to lose <strong>in</strong><strong>for</strong>mation, especially <strong>for</strong> the edge histogram <strong>an</strong>dthe wavelet statistics, we have chosen to sub-sample the <strong>frame</strong>s to a dimension of64 pixels.6.1 Theoretical ComplexityTable 3 shows the theoretical complexity of the <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>s.We have used the same approach <strong>an</strong>d notation <strong>in</strong> Lefevre et. al [37]. Thecomplexity was computed consider<strong>in</strong>g mathematical, logical <strong>an</strong>d comparisonoperations (all supposed hav<strong>in</strong>g cost one). We have not taken <strong>in</strong>to accountmemory usage, allocation, de-allocation or the cost required to decode a <strong>frame</strong>. Incomput<strong>in</strong>g the complexity we have considered the cost required to compute thefeatures needed by the <strong>algorithm</strong> (such as color histograms, statistics …), the costrequired to compute the values to be <strong>an</strong>alyzed (such as the cumulative <strong>frame</strong>differences or the PME values), <strong>an</strong>d the cost required by the actual <strong>key</strong> <strong>frame</strong>detection <strong>algorithm</strong>. All the costs were computed consider<strong>in</strong>g the <strong>algorithm</strong>s’parameters. Follow<strong>in</strong>g the authors notation, N is the number of b<strong>in</strong>s <strong>in</strong> ahistogram (we have used the same symbol to <strong>in</strong>dicate the size of a feature vector);P is the number of pixels <strong>in</strong> a <strong>frame</strong>, <strong>an</strong>d B the number of blocks of pixels <strong>in</strong> a<strong>frame</strong> (e.g. the macro-blocks <strong>in</strong> a compressed <strong>video</strong>). F<strong>in</strong>ally, we <strong>in</strong>dicate with Kthe number of <strong>key</strong> <strong>frame</strong>s. The complexity is relative to the operations per <strong>frame</strong>required to process a shot.Table 3. Complexities of the <strong>key</strong> <strong>frame</strong> <strong>algorithm</strong>s tested.Algorithm ComplexityATS O(3P + 9N )FR O(18P + 9N + 2K )CP O(72P + 13N )SRDI O(53P ) # +PME O(958B ) +MPO(2)Although the <strong>key</strong> <strong>frame</strong> selection step of the PME <strong>algorithm</strong> is simple, thepre-process<strong>in</strong>g phase per<strong>for</strong>med with the two nonl<strong>in</strong>ear filters penalizes the<strong>algorithm</strong>: much data must be <strong>an</strong>alyzed, <strong>an</strong>d each step of the filter<strong>in</strong>g process alsorequires data reorder<strong>in</strong>g be<strong>for</strong>e the f<strong>in</strong>al result <strong>for</strong> that step c<strong>an</strong> be obta<strong>in</strong>ed. Moreth<strong>an</strong> half of the operations required <strong>for</strong> the CP <strong>algorithm</strong> are due to the waveletcomputation (we have used the st<strong>an</strong>dard wavelet decomposition). Us<strong>in</strong>g the lift<strong>in</strong>gscheme will reduce the complexity by approximately the half [38]. The #<strong>an</strong>notation <strong>for</strong> the SRDI <strong>algorithm</strong> <strong>in</strong>dicates that the one-time operationsper<strong>for</strong>med on the <strong>key</strong> <strong>frame</strong>s (divided <strong>for</strong> the number of <strong>frame</strong>s) <strong>for</strong> thebackground classification have not been added. The + <strong>an</strong>notation <strong>for</strong> the SRDI<strong>an</strong>d PME <strong>algorithm</strong>s <strong>in</strong>dicates <strong>in</strong>stead, that the complexity reported do not <strong>in</strong>cludethe motion vectors computation. The SRDI e PME uses motion vectors <strong>an</strong>d thusthey c<strong>an</strong> take adv<strong>an</strong>tage of compressed <strong>video</strong> us<strong>in</strong>g them. If the motion vectorsare not available <strong>an</strong> additional cost should be added <strong>for</strong> each <strong>frame</strong> requir<strong>in</strong>g<strong>for</strong>ward or backward motion vectors computation. For example, us<strong>in</strong>g the fast 2Dlog search <strong>algorithm</strong> which is sub-optimal, requires <strong>an</strong> additional O(75P )operations [37].18


6.2 Computational TimeTable 4 shows the computation time (<strong>in</strong> m<strong>in</strong>utes, seconds <strong>an</strong>d thous<strong>an</strong>d ofseconds) of the <strong>algorithm</strong>s tested <strong>in</strong>clud<strong>in</strong>g the <strong>frame</strong> decod<strong>in</strong>g time <strong>for</strong> the<strong>algorithm</strong>s that required it. We have implemented the <strong>key</strong> <strong>frame</strong>s <strong>extraction</strong><strong>algorithm</strong>s <strong>in</strong> C++ under the Borl<strong>an</strong>d C++ Builder 5.0 development environmentwith the default optimization (<strong>for</strong> faster execution) turned on. The computer used<strong>for</strong> the comparison was <strong>an</strong> AMD Athlon 1Ghz with 512MB of RAM, <strong>an</strong>d runn<strong>in</strong>ga W<strong>in</strong>dows 2000 Professional operat<strong>in</strong>g system. For image process<strong>in</strong>g <strong>algorithm</strong>s,separable filters were used whenever it was possible. Results that c<strong>an</strong> be used <strong>in</strong>subsequent operations were computed only once. A test session was per<strong>for</strong>med byprocess<strong>in</strong>g all the six <strong>video</strong> sequences with a <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>.Be<strong>for</strong>e the next test session the system was restarted. For each <strong>video</strong> thecomputational time reported refers to the average time of three test sessions. TheMP <strong>algorithm</strong> is not <strong>in</strong>cluded s<strong>in</strong>ce its computation is virtually <strong>in</strong>st<strong>an</strong>t<strong>an</strong>eous. The<strong>algorithm</strong>’s computational time are reported <strong>in</strong> relative <strong>for</strong>m with respect to theATS computational time.Table 4. Computational time of the five <strong>algorithm</strong>s on the six <strong>video</strong>s. The times reported <strong>for</strong> theATS <strong>algorithm</strong> are <strong>in</strong> m<strong>in</strong>utes, seconds <strong>an</strong>d thous<strong>an</strong>d of seconds. For the other <strong>algorithm</strong>s, thetimes are <strong>in</strong> relative <strong>for</strong>m with respect to the ATS.Video ATS FR CP SRDI PMEeeopen 00:10:596 1.01 2.86 3.97 9.03nw<strong>an</strong>w1 00:14:152 1.35 5.75 9.16 7.22news 00:10:425 1.34 5.74 9.09 6.66football 00:14:531 1.30 5.81 8.78 7.68bugsbunny 01:38:542 1.13 2.98 4.65 9.41basketball 00:06:570 1.08 3.08 4.55 6.97For the apparently discrep<strong>an</strong>cies <strong>for</strong> the computational time of the ATS <strong>an</strong>dFR <strong>algorithm</strong>s with respect to the CP <strong>algorithm</strong>, it should be noted that thedecod<strong>in</strong>g time take up a signific<strong>an</strong>tly part of the process<strong>in</strong>g time of the two<strong>algorithm</strong>s. For the “eeopen”, “bugsbunny”, <strong>an</strong>d “basketball” <strong>video</strong>s, the decod<strong>in</strong>gtime represents about the 90% (ATS) <strong>an</strong>d 82% (FR) of the total computationaltime. For the rema<strong>in</strong><strong>in</strong>g three <strong>video</strong>s, the decod<strong>in</strong>g time represents the 70%, <strong>an</strong>dthe 50% of the ATS <strong>an</strong>d FR total time respectively. The high computational timeof the SRDI <strong>an</strong>d PME <strong>algorithm</strong>s is ma<strong>in</strong>ly due to the heavy preprocess<strong>in</strong>grequired be<strong>for</strong>e the <strong>key</strong> <strong>frame</strong> c<strong>an</strong> be selected. The SRDI <strong>algorithm</strong> requires pixel<strong>in</strong>terpolation based on motion vectors <strong>an</strong>d pixel background classification.Although the <strong>key</strong> <strong>frame</strong> selection step of the PME <strong>algorithm</strong> is simple, the preprocess<strong>in</strong>gphase per<strong>for</strong>med with the two nonl<strong>in</strong>ear filters penalizes the <strong>algorithm</strong>:much data must be <strong>an</strong>alyzed, <strong>an</strong>d each step of the filter<strong>in</strong>g process also requiresdata reorder<strong>in</strong>g be<strong>for</strong>e the f<strong>in</strong>al result <strong>for</strong> that step c<strong>an</strong> be obta<strong>in</strong>ed.6.3 ResultsIn our prelim<strong>in</strong>ary experiments, the Fidelity <strong>an</strong>d SRD measures <strong>for</strong> the FR<strong>algorithm</strong> were more th<strong>an</strong> 20% lower th<strong>an</strong> those <strong>for</strong> the other <strong>algorithm</strong>s. Asstated be<strong>for</strong>e, <strong>in</strong> its orig<strong>in</strong>al <strong>for</strong>mulation, the FR <strong>algorithm</strong> assigns to each shot <strong>an</strong>umber of <strong>key</strong> <strong>frame</strong>s proportional to the length of the shot. When the totalnumber of <strong>key</strong> <strong>frame</strong>s to be extracted is comparable to the number of shots, thereis a high probability that no <strong>key</strong> <strong>frame</strong>s will be assigned to some of the shots.Thus <strong>for</strong> a fair comparison, <strong>in</strong> subsequent experiments we have modified the19


assignment criteria <strong>for</strong> the FR <strong>algorithm</strong>. In this modified <strong>for</strong>mulation, the FR<strong>algorithm</strong> extracts from each shot the same number of <strong>key</strong> <strong>frame</strong>s computed bythe CP <strong>algorithm</strong>. The same criterion was applied to the ATS <strong>algorithm</strong>.Table 5, Table 6, <strong>an</strong>d Table 7 show the results of the different <strong>algorithm</strong>s on thesix <strong>video</strong> sequences. Table 5 shows the Compression Ratio (CRatio) values <strong>an</strong>dsome statistics. KFPerShot <strong>in</strong>dicates the average number of <strong>key</strong> <strong>frame</strong>s <strong>for</strong> eachshot: nUsedShots <strong>an</strong>d FrPerShots <strong>in</strong>dicate the number of type A shots <strong>an</strong>d theaverage number of <strong>frame</strong>s per shots respectively. From Table 5 we c<strong>an</strong> see that <strong>for</strong>the “eeopen”,”nw<strong>an</strong>w1” <strong>an</strong>d “bugsbunny” <strong>video</strong>s some type B shots have beenremoved by our shot detection <strong>algorithm</strong>. Our <strong>key</strong> <strong>frame</strong> selection <strong>algorithm</strong> isable to extract more th<strong>an</strong> one <strong>frame</strong> per type A shot as is the PME <strong>algorithm</strong> (thatthe number of <strong>key</strong> <strong>frame</strong>s of the FR, ATS <strong>an</strong>d SRDI <strong>algorithm</strong>s depends by theCP <strong>algorithm</strong>). Videos with high dynamics, <strong>in</strong> fact, are assigned more <strong>key</strong> <strong>frame</strong>sper shot th<strong>an</strong> those with little motion. This c<strong>an</strong> be seen <strong>in</strong> case of the“basketball2”, “football” <strong>an</strong>d “news” <strong>video</strong>s. For the “bugsbunny” <strong>video</strong>, althoughvery few shots exhibit high dynamics, the PME <strong>algorithm</strong> extracts nearly doublethe number of <strong>key</strong> <strong>frame</strong>s th<strong>an</strong> the CP <strong>algorithm</strong>.Table 6 gives the Fidelity measure results with the m<strong>in</strong>imum <strong>an</strong>d st<strong>an</strong>darddeviation of the measure computed on both the histogram <strong>an</strong>d the HWDdescriptors. Table 7 presents the results of the Shot Reconstruction Degreemeasure. They show that the average Fidelity <strong>an</strong>d SRD measures computed on thehistogram (HST) are less variable th<strong>an</strong> those computed on the HWD. This seemsto <strong>in</strong>dicate that <strong>frame</strong> differences tend to be less dist<strong>in</strong>guishable when us<strong>in</strong>g thecolor histogram alone. It is also <strong>in</strong>terest<strong>in</strong>g to note that the summaries of the“basketball2” <strong>video</strong> are clearly unacceptable if we evaluate them with the Fidelitymeasure computed on the histogram, while they appear fairly acceptable ifevaluated with the HWD.To judge the per<strong>for</strong>m<strong>an</strong>ce of our <strong>algorithm</strong> with respect to the other five<strong>algorithm</strong>s, it is useful to express the results as a measure of relative improvement(∆Q) us<strong>in</strong>g the follow<strong>in</strong>g <strong>for</strong>mula:( Measure_Alg( CP) − Measure_Alg( X ))Measure_Alg( X )∆Q( X ) =(21)where Measure_Alg corresponds to the Fidelity <strong>an</strong>d the SRD measure, <strong>an</strong>d wesubstitute X with FR, ATS, SRDI, PME <strong>an</strong>d MP <strong>in</strong> turn.Table 8 details <strong>for</strong> each <strong>video</strong> the relative edge of the per<strong>for</strong>m<strong>an</strong>ce of our<strong>algorithm</strong> over that of other five <strong>algorithm</strong>s. It c<strong>an</strong> be seen that when the numberof <strong>key</strong> <strong>frame</strong>s per shot is small as <strong>in</strong> the case of the “eeopen” <strong>an</strong>d “bugsbunny”<strong>video</strong>s (1,136 <strong>an</strong>d 1,145 <strong>key</strong> <strong>frame</strong>s per shot respectively) the differences betweenthe CP, FR, ATS <strong>an</strong>d MP <strong>algorithm</strong>s are slight. Exceptions are the SRDI<strong>algorithm</strong>, about 5% worse th<strong>an</strong> the CP <strong>algorithm</strong>, <strong>an</strong>d the PME <strong>algorithm</strong> <strong>in</strong> the“bugsbunny” <strong>video</strong>, about 4% better th<strong>an</strong> the CP <strong>algorithm</strong> (note that the PME<strong>algorithm</strong> extracts 182 <strong>key</strong> <strong>frame</strong>s <strong>an</strong>d the CP <strong>algorithm</strong> 95). For the other <strong>video</strong>s,the PME <strong>algorithm</strong> shows, <strong>in</strong>stead, the worst results (exclud<strong>in</strong>g the MP<strong>algorithm</strong>). As the number of <strong>key</strong> <strong>frame</strong>s per shot <strong>in</strong>creases, the gap is moremarked. With the exception of the “basketball2” <strong>video</strong>, the per<strong>for</strong>m<strong>an</strong>ces of theFR <strong>algorithm</strong> <strong>an</strong>d the CP <strong>algorithm</strong> do not exhibit large differences; only <strong>in</strong> the“news” <strong>video</strong> does the FR <strong>algorithm</strong> outper<strong>for</strong>m (slightly) the CP <strong>algorithm</strong>. Themost <strong>in</strong>terest<strong>in</strong>g experiment is the one regard<strong>in</strong>g the “basketball2” <strong>video</strong>. The highdynamics <strong>an</strong>d the length of the shot, cause greatly diverg<strong>in</strong>g results. The20


per<strong>for</strong>m<strong>an</strong>ce of the SRDI <strong>algorithm</strong> exhibits a variable behavior depend<strong>in</strong>g on themeasure employed, while the MP <strong>algorithm</strong> shows the worse results.Table 9 lists the ∆Q measure of relative improvement computed as thepercentage average <strong>for</strong> all five <strong>video</strong>s. Overall our <strong>algorithm</strong> outper<strong>for</strong>ms theother five <strong>algorithm</strong>s. Its adv<strong>an</strong>tages over the FR <strong>algorithm</strong> is negligible, but wemust remember that the FR <strong>algorithm</strong> uses <strong>an</strong> optimization strategy to allocate the<strong>key</strong> <strong>frame</strong>s with<strong>in</strong> a shot, <strong>an</strong>d requires that the number of <strong>key</strong> <strong>frame</strong>s must begiven a priori. For the ATS <strong>an</strong>d SRDI <strong>algorithm</strong>s, the gap is more noticeable, <strong>an</strong>ds<strong>in</strong>ce the number of <strong>key</strong> <strong>frame</strong>s extracted are the same, the results depend only onthe selection strategies adopted by the <strong>algorithm</strong>s. The poor per<strong>for</strong>m<strong>an</strong>ce of theMP <strong>algorithm</strong> is, of course, to be expected s<strong>in</strong>ce it extracts only one <strong>key</strong> <strong>frame</strong>sfrom each shot.More <strong>in</strong>terest<strong>in</strong>g is the per<strong>for</strong>m<strong>an</strong>ce of the PME <strong>algorithm</strong>, which extracts <strong>key</strong><strong>frame</strong>s <strong>in</strong> a totally automatic way based on motion vectors. The results of the SRDmeasure show that the CP <strong>algorithm</strong> outper<strong>for</strong>ms the PME <strong>algorithm</strong> by about10%. The <strong>key</strong> <strong>frame</strong>s selected only by motion do not seem to be successful <strong>in</strong>visually represent<strong>in</strong>g the content of the <strong>video</strong>.21


Table 5. Compression ratios (CRatio) <strong>an</strong>d <strong>key</strong> <strong>frame</strong>s statistic results of the <strong>algorithm</strong>s on the six<strong>video</strong> sequences. KFPerShot <strong>in</strong>dicates the average number of <strong>key</strong> <strong>frame</strong>s <strong>for</strong> each shot, whilenUsedShots <strong>an</strong>d FrPerShots <strong>in</strong>dicate the number of type A shots <strong>an</strong>d the average number of <strong>frame</strong>sper shot respectively.eeopen CRatio KeyFrames KFPerShot nTotShots nUsedShots FrPerShotCP 0.980 25 1.136 24 22 53.500FR 0.980 25 1.136 24 22 53.500ATS 0.980 25 1.136 24 22 53.500SRDI 0.980 25 1.136 24 22 53.500PME 0.971 36 1.636 24 22 53.500MP 0.982 22 1.000 24 22 53.500nw<strong>an</strong>w1 CRatio KeyFrames KFPerShot nTotShots nUsedShots FrPerShotCP 0.991 61 1.743 39 35 168.256FR 0.991 61 1.743 39 35 168.256ATS 0.991 61 1.743 39 35 168.256SRDI 0.991 61 1.743 39 35 168.256PME 0.990 66 1.886 39 35 168.256MP 0.995 35 1.000 39 35 168.256news CRatio KeyFrames KFPerShot nTotShots nUsedShots FrPerShotCP 0.993 35 2.917 12 12 397.083FR 0.993 35 2.917 12 12 397.083ATS 0.993 35 2.917 12 12 397.083SRDI 0.993 35 2.917 12 12 397.083PME 0.994 28 2.333 12 12 397.083MP 0.997 12 1.000 12 12 397.083football CRatio KeyFrames KFPerShot nTotShots nUsedShots FrPerShotCP 0.988 82 2.929 28 28 239.286FR 0.988 82 2.929 28 28 239.286ATS 0.988 82 2.929 28 28 239.286SRDI 0.988 82 2.929 28 28 239.286PME 0.992 55 1.964 28 28 239.286MP 0.996 28 1.000 28 28 239.286bugsbunny CRatio KeyFrames KFPerShot nTotShots nUsedShots FrPerShotCP 0.993 95 1.145 89 83 151.551FR 0.993 95 1.145 89 83 151.551ATS 0.993 95 1.145 89 83 151.551SRDI 0.993 95 1.145 89 83 151.551PME 0.986 182 2.193 89 83 151.551MP 0.994 83 1.000 89 83 151.551basketball2 CRatio KeyFrames KFPerShot nTotShots nUsedShots FrPerShotCP 0.983 15 15.000 1 1 894.000FR 0.983 15 15.000 1 1 894.000ATS 0.983 15 15.000 1 1 894.000SRDI 0.983 15 15.000 1 1 894.000PME 0.993 6 6.000 1 1 894.000MP 0.999 1 1.000 1 1 894.00022


Table 6. Fidelity measure results computed us<strong>in</strong>g the histogram (HST) <strong>an</strong>d the HWD descriptors.The m<strong>in</strong>imum (mi) <strong>an</strong>d st<strong>an</strong>dard deviation (sd) of the Fidelity measures computed on each shot arealso reported.eeopen Fid. HST Fid. HST (mi) Fid. HST (sd) Fid. HWD Fid. HWD (mi) Fid. HWD (sd)CP 0.762 0.069 0.228 0.907 0.217 0.185FR 0.760 0.072 0.232 0.906 0.184 0.191ATS 0.756 0.016 0.236 0.885 0.123 0.235SRDI 0.761 0.077 0.236 0.899 0.298 0.184PME 0.763 0.072 0.228 0.892 0.133 0.213MP 0.770 0.069 0.207 0.901 0.184 0.203nw<strong>an</strong>w1 Fid. HST Fid. HST (mi) Fid. HST (sd) Fid. HWD Fid. HWD (mi) Fid. HWD (sd)CP 0.642 0.013 0.325 0.778 0.192 0.310FR 0.631 0.014 0.339 0.758 0.110 0.345ATS 0.645 0.014 0.323 0.783 0.113 0.308SRDI 0.619 0.046 0.343 0.762 0.035 0.313PME 0.620 0.017 0.334 0.745 0.199 0.352MP 0.613 0.007 0.354 0.733 0.015 0.371news Fid. HST Fid. HST (mi) Fid. HST (sd) Fid. HWD Fid. HWD (mi) Fid. HWD (sd)CP 0.607 0.166 0.258 0.860 0.662 0.125FR 0.614 0.167 0.250 0.870 0.656 0.117ATS 0.612 0.194 0.248 0.872 0.671 0.115SRDI 0.578 0.142 0.278 0.851 0.656 0.129PME 0.600 0.139 0.275 0.851 0.648 0.137MP 0.584 0.074 0.283 0.836 0.588 0.158football Fid. HST Fid. HST (mi) Fid. HST (sd) Fid. HWD Fid. HWD (mi) Fid. HWD (sd)CP 0.627 0.293 0.208 0.858 0.512 0.133FR 0.627 0.354 0.202 0.842 0.377 0.154ATS 0.642 0.331 0.196 0.858 0.537 0.133SRDI 0.611 0.268 0.214 0.840 0.447 0.147PME 0.612 0.254 0.213 0.825 0.431 0.167MP 0.601 0.243 0.233 0.834 0.183 0.185bugsbunny Fid. HST Fid. HST (mi) Fid. HST (sd) Fid. HWD Fid. HWD (mi) Fid. HWD (sd)CP 0.656 0.052 0.239 0.907 0.118 0.148FR 0.656 0.052 0.239 0.904 0.118 0.150ATS 0.665 0.065 0.226 0.906 0.141 0.141SRDI 0.647 0.024 0.245 0.904 0.283 0.134PME 0.678 0.004 0.232 0.910 0.351 0.126MP 0.650 0.052 0.247 0.901 0.118 0.152basketball2 Fid. HST Fid. HST (mi) Fid. HST (sd) Fid. HWD Fid. HWD (mi) Fid. HWD (sd)CP 0.093 0.093 0.000 0.556 0.556 0.000FR 0.094 0.094 0.000 0.528 0.528 0.000ATS 0.064 0.064 0.000 0.528 0.528 0.000SRDI 0.081 0.081 0.000 0.552 0.552 0.000PME 0.075 0.075 0.000 0.533 0.533 0.000MP 0.031 0.031 0.000 0.407 0.407 0.00023


Table 7. Shot Reconstruction Degree (SRD) measure results computed us<strong>in</strong>g the histogram (HST),<strong>an</strong>d the HWD descriptors. The m<strong>in</strong>imum (mi) <strong>an</strong>d st<strong>an</strong>dard deviation (sd) of the SRD measurescomputed on each shot are also reported.eeopen SRD HST SRD HST (mi) SRD HST (sd) SRD HWD SRD HWD (mi) SRD HWD (sd)CP 4.882 2.214 1.211 7.591 3.814 1.700FR 4.868 2.077 1.239 7.577 3.741 1.723ATS 4.660 2.072 1.246 7.259 3.729 1.757SRDI 4.693 1.984 1.189 7.314 3.416 1.684PME 4.949 1.752 1.501 7.550 3.308 2.065MP 4.846 1.842 1.283 7.563 3.682 1.749nw<strong>an</strong>w1 SRD HST SRD HST (mi) SRD HST (sd) SRD HWD SRD HWD (mi) SRD HWD (sd)CP 3.887 1.474 1.162 6.648 2.928 1.634FR 3.887 1.369 1.212 6.632 2.696 1.690ATS 3.849 1.527 1.199 6.588 2.984 1.664SRDI 3.587 0.636 1.295 6.104 1.402 1.959PME 3.581 1.280 1.255 6.067 2.668 1.839MP 3.642 0.897 1.306 6.311 1.858 1.919news SRD HST SRD HST (mi) SRD HST (sd) SRD HWD SRD HWD (mi) SRD HWD (sd)CP 3.520 2.167 1.372 6.186 4.355 2.064FR 3.534 2.122 1.374 6.238 4.296 2.048ATS 3.515 2.006 1.350 6.188 4.317 2.033SRDI 3.209 1.415 1.504 5.663 2.974 2.273PME 3.084 1.714 1.189 5.527 3.279 1.993MP 3.147 1.513 1.537 5.651 3.218 2.357football SRD HST SRD HST (mi) SRD HST (sd) SRD HWD SRD HWD (mi) SRD HWD (sd)CP 3.335 2.294 0.766 5.856 4.270 1.076FR 3.361 2.294 0.727 5.902 4.270 1.008ATS 3.276 1.847 0.844 5.684 3.522 1.196SRDI 3.067 1.817 0.834 5.327 3.319 1.230PME 2.846 1.634 0.816 5.049 3.026 1.248MP 2.814 1.467 0.802 5.121 2.809 1.334bugsbunny SRD HST SRD HST (mi) SRD HST (sd) SRD HWDSRD HWD (mi) SRD HWD (sd)CP 3.442 1.176 1.091 6.476 2.898 1.479FR 3.444 1.176 1.095 6.483 2.898 1.488ATS 3.339 1.201 1.070 6.275 2.769 1.464SRDI 3.268 0.865 1.137 6.134 1.243 1.656PME 3.597 1.170 1.151 6.521 2.324 1.586MP 3.388 1.176 1.128 6.410 2.898 1.533basketball2 SRD HST SRD HST (mi) SRD HST (sd) SRD HWDSRD HWD (mi) SRD HWD (sd)CP 2.269 2.269 0.000 4.137 4.137 0.000FR 2.079 2.079 0.000 4.044 4.044 0.000ATS 1.980 1.980 0.000 3.786 3.786 0.000SRDI 2.356 2.356 0.000 4.120 4.120 0.000PME 1.658 1.658 0.000 3.336 3.336 0.000MP 0.693 0.693 0.000 2.027 2.027 0.000Table 8. The relative improvement (∆Q) measured on the Fidelity <strong>an</strong>d SRD results of the proposed<strong>algorithm</strong> with respect to each of the other <strong>algorithm</strong>s tested. The results are <strong>in</strong> percentages.24


eeopenFidelityHSTFidelityHWDSRDHSTSRDHWDfootballFidelityHSTFidelityHWDSRDHSTSRDHWDCP 0.0 0.0 0.0 0.0 CP 0.0 0.0 0.0 0.0FR 0.3 0.1 0.3 0.2 FR 0.0 1.9 -0.8 -0.8ATS 0.8 2.5 4.8 4.6 ATS -2.3 0.0 1.8 3.0SRDI 0.1 0.9 4.0 3.8 SRDI 2.6 2.1 8.7 9.9PME -0.1 1.7 -1.4 0.5 PME 2.5 4.0 17.2 16.0MP -1.0 0.7 0.7 0.4 MP 4.3 2.9 18.5 14.4nw<strong>an</strong>w1FidelityHSTFidelityHWDSRDHSTSRDHWDbugsbunny FidelityHSTFidelityHWDSRDHSTCP 0.0 0.0 0.0 0.0 CP 0.0 0.0 0.0 0.0FR 1.7 2.6 0.0 0.2 FR 0.0 0.3 -0.1 -0.1ATS -0.5 -0.6 1.0 0.9 ATS -1.4 0.1 3.1 3.2SRDI 3.7 2.1 8.4 8.9 SRDI 1.4 0.3 5.3 5.6PME 3.5 4.4 8.5 9.6 PME -3.2 -0.3 -4.3 -0.7MP 4.7 6.1 6.7 5.3 MP 0.9 0.7 1.6 1.0SRDHWDnewsFidelityHSTFidelityHWDSRDHSTSRDHWDbasketball2 FidelityHSTFidelityHWDSRDHSTSRDHWDCP 0.0 0.0 0.0 0.0 CP 0.0 0.0 0.0 0.0FR -1.1 -1.1 -0.4 -0.8 FR -1.1 5.3 9.1 2.3ATS -0.8 -1.4 0.1 0.0 ATS 45.3 5.3 14.6 9.3SRDI 5.0 1.1 9.7 9.2 SRDI 14.8 0.7 -3.7 0.4PME 1.2 1.1 14.1 11.9 PME 24.0 4.3 36.9 24.0MP 3.9 2.9 11.9 9.5 MP 200.0 36.6 227.4 104.1Table 9. The average of the relative improvement (∆Q) measured on the Fidelity <strong>an</strong>d SRD resultsof the proposed <strong>algorithm</strong> with respect to each of the other <strong>algorithm</strong>s tested. The results are <strong>in</strong>percentages.Overall Fid. HST Fid. HWD SRD HST SRD HWDCP 0.0 0.0 0.0 0.0FR 0.0 1.5 1.4 0.2ATS 6.9 1.0 4.2 3.5SRDI 4.6 1.2 5.4 6.3PME 4.6 2.5 11.8 10.2MP 35.5 8.3 44.5 22.47. An application <strong>for</strong> <strong>key</strong> <strong>frame</strong>sOur <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong> is currently be<strong>in</strong>g applied <strong>in</strong> the "Archivio diEtnografia e Storia Sociale - AESS" (Archive of Social History <strong>an</strong>d Ethnography)[39]. The AESS archive has as its purpose the conservation, consultation, <strong>an</strong>demploy of documents <strong>an</strong>d images regard<strong>in</strong>g the life <strong>an</strong>d social tr<strong>an</strong>s<strong>for</strong>mation, theliterature, the oral history, <strong>an</strong>d the cultural artifacts of the Lombard region. TheAESS web site have been designed <strong>an</strong>d implemented at the CNR – ITC Unit ofMil<strong>an</strong> <strong>for</strong> the project promoted by the Lombard Region, (the Direzione generaleCulture, Identità e Autonomie della Lombardia), <strong>an</strong>d Interreg II Internum projectof the Europe<strong>an</strong> Union. The archive of the AESS web site stores <strong>in</strong><strong>for</strong>mationconcern<strong>in</strong>g the oral history of the Lombard region: it is ma<strong>in</strong>ly composed ofpopular songs <strong>an</strong>d other audio <strong>an</strong>d <strong>video</strong> records describ<strong>in</strong>g the popular traditionsh<strong>an</strong>ded down generation by generation, such as traditional fairs, <strong>an</strong>d customs. The25


images <strong>an</strong>d <strong>video</strong>s represent occasions on which the audio have been per<strong>for</strong>medas songs, recorded as <strong>in</strong>terviews, etc. L<strong>in</strong>ked with the audio <strong>an</strong>d image data arebooks, journal, discs, DVD, etc, on which this <strong>in</strong><strong>for</strong>mation is stored (pr<strong>in</strong>ted,recorded, etc.). Video <strong>in</strong><strong>for</strong>mation is h<strong>an</strong>dled with the collaboration of the DISCodepartment of the University of Mil<strong>an</strong>o-Bicocca. Fig. 8 shows a page of the AESSweb site where the <strong>video</strong> are stored, <strong>an</strong>d <strong>an</strong> example of a visual summaryassociated with a chosen <strong>video</strong>. With<strong>in</strong> the AESS web site, the <strong>video</strong> <strong>key</strong> <strong>frame</strong>sare used both as visual content abstracts <strong>for</strong> the users <strong>an</strong>d as entry po<strong>in</strong>ts to <strong>video</strong>subsequences.Fig. 8. The web site of the AESS archive where the CP <strong>key</strong> <strong>frame</strong> <strong>algorithm</strong> is currently employed.8. ConclusionIn this paper, we have presented <strong>an</strong> <strong><strong>in</strong>novative</strong> <strong>algorithm</strong> <strong>for</strong> <strong>key</strong> <strong>frame</strong> <strong>extraction</strong>.By <strong>an</strong>alyz<strong>in</strong>g the differences between pairs of consecutive <strong>frame</strong>s of a <strong>video</strong>sequence, the <strong>algorithm</strong> determ<strong>in</strong>es the complexity of the sequence <strong>in</strong> terms ofch<strong>an</strong>ges <strong>in</strong> visual content as expressed by different <strong>frame</strong> descriptors. Similaritymeasures are computed <strong>for</strong> each descriptor <strong>an</strong>d comb<strong>in</strong>ed to <strong>for</strong>m a <strong>frame</strong>difference measure. The <strong>algorithm</strong>, which does not exhibit the complexity ofexist<strong>in</strong>g methods based, <strong>for</strong> example, on cluster<strong>in</strong>g or optimization strategies, c<strong>an</strong>dynamically <strong>an</strong>d rapidly select a variable number of <strong>key</strong> <strong>frame</strong>s with<strong>in</strong> eachsequence. Another adv<strong>an</strong>tage is that it c<strong>an</strong> extracts the <strong>key</strong> <strong>frame</strong>s on the fly: <strong>key</strong><strong>frame</strong>s c<strong>an</strong> be determ<strong>in</strong>ed while comput<strong>in</strong>g the <strong>frame</strong> differences as soon as a twohigh curvature po<strong>in</strong>t has been detected. The per<strong>for</strong>m<strong>an</strong>ce of this <strong>algorithm</strong> hasbeen compared with that of other <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>s based ondifferent approaches. The summaries have been evaluated objectively with threequality measures: the Fidelity measure, the Shot Reconstruction Degree measure,<strong>an</strong>d the Compression Ratio measure. Experimental results show that the proposed<strong>algorithm</strong> outper<strong>for</strong>ms the other <strong>key</strong> <strong>frame</strong> <strong>extraction</strong> <strong>algorithm</strong>s <strong>in</strong>vestigated.26


AcknowledgementsThe <strong>video</strong> <strong>in</strong>dex<strong>in</strong>g <strong>an</strong>d <strong>an</strong>alysis presented here was supported by the Itali<strong>an</strong> MURST FIRBProject MAIS (Multi-ch<strong>an</strong>nel Adaptive In<strong>for</strong>mation Systems) [40], <strong>an</strong>d by the Regione Lombardia(Italy) with<strong>in</strong> the INTERNUM project aimed at facilitat<strong>in</strong>g access to the cultural <strong>video</strong>documentaries of the AESS (Archivio di Etnografia e Storia Sociale).References[1]. Agraim P., Zh<strong>an</strong>g H., <strong>an</strong>d Petkovic D. Content-Based Representation <strong>an</strong>d Retrieval of VisualMedia: A State of the Art review. Multimedia Tools <strong>an</strong>d Applications, 1996;3:179-202.[2]. Dimitrova N., Zh<strong>an</strong>g H., Shahraray B., Sez<strong>an</strong> M., Hu<strong>an</strong>g T. <strong>an</strong>d Zakhor A. Applications of<strong>video</strong>-content <strong>an</strong>alysis <strong>an</strong>d retrieval. IEEE MultiMedia, 2002;9(3): 44-55.[3]. Ant<strong>an</strong>i S., Kasturi R. <strong>an</strong>d Ja<strong>in</strong> R. A survey on the use of pattern recognition methods <strong>for</strong>abstraction, <strong>in</strong>dex<strong>in</strong>g <strong>an</strong>d retrieval of images <strong>an</strong>d <strong>video</strong>. Pattern Recognition, 2002;35:945–965.[4]. Schett<strong>in</strong>i R., Brambilla C., Cus<strong>an</strong>o, C., Ciocca G. Automatic classification of digitalphotographs based on decision <strong>for</strong>ests. Int. Journal of Pattern Recognition <strong>an</strong>d ArtificialIntelligence, 2004;18(5):819-846.[5]. Fredembach C., Schröder M., <strong>an</strong>d Süsstrunk S. Eigenregions <strong>for</strong> Image Classification. IEEETr<strong>an</strong>sactions on Pattern Analysis <strong>an</strong>d Mach<strong>in</strong>e Intelligence (PAMI), 2004;26(12):1645-1649.[6]. Alex<strong>an</strong>der G., Hauptm<strong>an</strong>n, Rong J<strong>in</strong>, <strong>an</strong>d Tobun D. N. Video Retrieval us<strong>in</strong>g Speech <strong>an</strong>dImage In<strong>for</strong>mation. Proc. Electronic Imag<strong>in</strong>g Conf. (EI’03), Storage Retrieval <strong>for</strong>Multimedia Databases, S<strong>an</strong>ta Clara, CA, 2003;5021:148-159.[7]. Tonomura Y., Akutsu A., Otsugi K., <strong>an</strong>d Sadakata T. VideoMAP <strong>an</strong>d VideoSpaceIcon: Tools<strong>for</strong> automatiz<strong>in</strong>g <strong>video</strong> content. Proc. ACM INTERCHI ’93 Conference, 1993:131-141.[8]. Ueda H., Miyatake T., <strong>an</strong>d Yoshizawa S. IMPACT: An <strong>in</strong>teractive natural-motion-picturededicated multimedia author<strong>in</strong>g system. Proc. ACM CHI ’91 Conference, 1991:343-350.[9]. Rui Y., Hu<strong>an</strong>g T. S. <strong>an</strong>d Mehrotra S. Explor<strong>in</strong>g Video Structure Beyond the Shots. Proc.IEEE Int. Conf. on Multimedia Comput<strong>in</strong>g <strong>an</strong>d Systems (ICMCS), Texas USA, 1998:237-240.[10]. Pentl<strong>an</strong>d A., Picard R., Davenport G. <strong>an</strong>d Haase K. Video <strong>an</strong>d Image Sem<strong>an</strong>tics: Adv<strong>an</strong>cedTools <strong>for</strong> Telecommunications. IEEE MultiMedia,1994;1(2):73-75.[11]. Zhonghua Sun, Fu P<strong>in</strong>g. Comb<strong>in</strong>ation of Color <strong>an</strong>d Object Outl<strong>in</strong>e Based Method <strong>in</strong> VideoSegmentation. Proc. SPIE Storage <strong>an</strong>d Retrieval Methods <strong>an</strong>d Applications <strong>for</strong> Multimedia,2004;5307:61-69.[12]. Arm<strong>an</strong> F., Hsu A. <strong>an</strong>d Chiu M.Y. Image process<strong>in</strong>g on Compressed Data <strong>for</strong> Large VideoDatabases. Proc. ACM Multimedia '93, Annaheim, CA,1993:267-272.[13]. Zhu<strong>an</strong>g Y., Rui Y., Hu<strong>an</strong>g T.S., Mehrotra S. Key Frame Extraction Us<strong>in</strong>g UnsupervisedCluster<strong>in</strong>g. Proc. of ICIP’98, Chicago, USA, 1998;1:866-870.[14]. Girgensohn A., Boreczky J. Time-Constra<strong>in</strong>ed Key<strong>frame</strong> Selection Technique. MultimediaTools <strong>an</strong>d Application, 2000;11:347-358.[15]. Gong Y. <strong>an</strong>d Liu X. Generat<strong>in</strong>g optimal <strong>video</strong> summaries. Proc. IEEE Int. Conference onMultimedia <strong>an</strong>d Expo, 2000;3:1559-1562.[16]. Li Zhao, Wei Qi, St<strong>an</strong> Z. Li, S.Q.Y<strong>an</strong>g, H.J. Zh<strong>an</strong>g. Key-<strong>frame</strong> Extraction <strong>an</strong>d Shot RetrievalUs<strong>in</strong>g Nearest Feature L<strong>in</strong>e (NFL). Proc. ACM Int. Workshops on Multimedia In<strong>for</strong>mationRetrieval, 2000;217-220.[17]. H<strong>an</strong>jalic A., Lagendijk R. L., Biemond J. A new Method <strong>for</strong> Key Frame Based VideoContent Representation. In: Image Databases <strong>an</strong>d Multimedia Search, World ScientificS<strong>in</strong>gapore, 1998.[18]. Hoon S. H., Yoon K., <strong>an</strong>d Kweon I. A new Technique <strong>for</strong> Shot Detection <strong>an</strong>d Key FramesSelection <strong>in</strong> Histogram Space. Proc. 12th Workshop on Image Process<strong>in</strong>g <strong>an</strong>d ImageUnderst<strong>an</strong>d<strong>in</strong>g, 2000;475-479.[19]. Narasimha R., Savakis A., Rao R. M. <strong>an</strong>d De Queiroz R. A Neaural Network Approach to<strong>key</strong> Frame <strong>extraction</strong>. Proc. of SPIE-IS&T Electronic Imag<strong>in</strong>g Storage <strong>an</strong>d retrieval Methods<strong>an</strong>d Applications <strong>for</strong> Multimedia, 2004;5307:439-447.[20]. Calic J. <strong>an</strong>d Izquierdo E. Efficient Key–Frame Extraction <strong>an</strong>d Video Analysis. Proc. IEEEITCC2002, Multimedia Web Retrieval Section, 2002;28-33.27


[21]. Liu Ti<strong>an</strong>m<strong>in</strong>g. M., Zh<strong>an</strong>g H.J., Qi F.H. A novel <strong>video</strong> <strong>key</strong>-<strong>frame</strong>-<strong>extraction</strong> <strong>algorithm</strong> basedon perceived motion energy model. IEEE Tr<strong>an</strong>s. Circuits <strong>an</strong>d Systems <strong>for</strong> Video Technology,2003;13(10):1006-1013.[22]. H. J. Zh<strong>an</strong>g, J. Wu, D. Zhong <strong>an</strong>d S. W. Smoliar, An <strong>in</strong>tegrated system <strong>for</strong> contentbased<strong>video</strong> retrieval <strong>an</strong>d brows<strong>in</strong>g, Pattern Recognition, 1997;30(4):643-658.[23]. Fayzull<strong>in</strong> M., Subrahm<strong>an</strong>i<strong>an</strong> V.S., Picarello A., Sap<strong>in</strong>o M.L., The CPR Model ForSummariz<strong>in</strong>g Video, Proc. of the 1st ACM Int. Workshop on Multimedia Databases, NewOrle<strong>an</strong>s, LA, USA, 2002;2-9.[24]. Lagendijk R.L., H<strong>an</strong>jalic A., <strong>an</strong>d Ceccarelli M.P., Soletic M., Persoon E.H. Visual Search <strong>in</strong>a SMASH System. Proc. of ICIP'96, 1995;671-674.[25]. Chong-Wah Ngo, Yu-Fei Ma, <strong>an</strong>d Hong-Ji<strong>an</strong>g Zh<strong>an</strong>g. Video Summarization <strong>an</strong>d SceneDetection by Graph Model<strong>in</strong>g. IEEE Tr<strong>an</strong>saction on Circuits <strong>an</strong>d System <strong>for</strong> VideoTechnology, 2005;15(2):196-305.[26]. Ch<strong>an</strong>g H. S., Sull S., S<strong>an</strong>g Uk Lee. Efficient Video Index<strong>in</strong>g Scheme <strong>for</strong> Content-BasedRetrieval. IEEE Tr<strong>an</strong>s. on Circuits <strong>an</strong>d Systems <strong>for</strong> Video Technology, 1999;9(8):1269-1279.[27]. Liu Tiey<strong>an</strong>., Zh<strong>an</strong>g X., Feng J., Lo K.T. Shot reconstruction degree: a novel criterion <strong>for</strong> <strong>key</strong><strong>frame</strong> selection. Pattern Recognition Letters, 2004;25:1451–1457.[28]. Fern<strong>an</strong>do A. C., C<strong>an</strong>aharajah C. N., Bull D. R. Fade-In <strong>an</strong>d Fade-Out Detection <strong>in</strong> VideoSequences Us<strong>in</strong>g Histograms. Proc. ISCAS 2000 – IEEE International Symposium onCircuits <strong>an</strong>d System, 2000;IV:709-712.[29]. Swa<strong>in</strong> M., Ballard D., Color Index<strong>in</strong>g, International Journal of Computer Vision,1991;7(1):11-32.[30]. Smith J. R., Ch<strong>an</strong>g S. F. Tools <strong>an</strong>d techniques <strong>for</strong> Color Image Retrieval. IST/SPIE Storage<strong>an</strong>d Retrieval <strong>for</strong> Image <strong>an</strong>d Video Databases IV, 1996;2670:426-437.[31]. Ciocca G., Gagliardi I., Schett<strong>in</strong>i R. Quicklook 2 : An Integrated Multimedia System.International Journal of Visual L<strong>an</strong>guages <strong>an</strong>d Comput<strong>in</strong>g, Special issue on Query<strong>in</strong>gMultiple Data Sources, 2001;12:81-103.[32]. R. Gonzalez <strong>an</strong>d R. Woods, Digital Image Process<strong>in</strong>g, Addison Wesley, 1992;414-428.[33]. Idris F., P<strong>an</strong>ch<strong>an</strong>ath<strong>an</strong> S. Storage <strong>an</strong>d retrieval of compressed images us<strong>in</strong>g wavelet vectorqu<strong>an</strong>tization. Journal of Visual L<strong>an</strong>guages <strong>an</strong>d Comput<strong>in</strong>g, 1997;8:289-301.[34]. Scheunders P., Livens S., V<strong>an</strong> de Wouwer G., Vautrot P., V<strong>an</strong> Dyck D. Wavelet-basedtexture <strong>an</strong>alysis. Int. Journal Computer Science <strong>an</strong>d In<strong>for</strong>mation m<strong>an</strong>agement, 1998;1(2):22-34.[35]. Chetverikov D. <strong>an</strong>d Szabo Zs. A Simple <strong>an</strong>d Efficient Algorithm <strong>for</strong> Detection of HighCurvature Po<strong>in</strong>ts <strong>in</strong> Pl<strong>an</strong>ar Curves, Proc. 23rd Workshop of the Austri<strong>an</strong> Pattern RecognitionGroup, 1999;175-184.[36]. Latecki L., DeMenthon D., Rosenfeld A., Extraction of <strong>key</strong> <strong>frame</strong>s from <strong>video</strong>s by polygonsimplification, Int. Symposium on Signal Process<strong>in</strong>g <strong>an</strong>d its Applications, 2001;643-646.[37]. Lefevre S., Holler J., V<strong>in</strong>cent N., A review of real time segmentation of uncompressed <strong>video</strong>sequences <strong>for</strong> content-based search <strong>an</strong>d retrieval, Real Time Imag<strong>in</strong>g, 2003; 9:73-98.[38]. Daubechies I., Sweldens W., Factor<strong>in</strong>g wavelet tr<strong>an</strong>s<strong>for</strong>ms <strong>in</strong>to lift<strong>in</strong>g steps, J. Fourier Anal.Appl., 1998;4(3):247-269.[39]. AESS Archive: http://aess.itc.cnr.it/<strong>in</strong>dex.htm.[40]. MAIS Consortium, Mais: Multich<strong>an</strong>nel Adaptive In<strong>for</strong>mation Systems.http://black.elet.polimi.it/mais/.28

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!