Real-time feature extraction from video stream data for stream ...

ai.cs.uni.dortmund.de

Real-time feature extraction from video stream data for stream ...

2.3. Segment Detection

this problem by introducing a semi-automatic approach that finds its own template for

anchorshots. Based on key frames for each shot, they compute the dissimilarity of each

shot with all other shots. Then they look for a shot, which best matches other shots

throughout the show and assume that shot to be a good template for an anchorshot.

The disadvantages of this approach are quiet obvious: Again this approach only holds a

single model for anchorshots and will not work well in settings where anchorshots vary

a lot. Especially when different anchorshots do not share the same background, which

might be due to different camera angles, their matching does not work well. Furthermore

the finding of the anchorshot template requires to compare each shot to all other shots

within the show. Hence, the approach can hardly be adapted to a real-time environment,

as each shot could only be compared to the shots before. Thus the approach would have

to be approximated to make it available in a stream environment.

Face detection

Since the lightning, studio design or dominant colors of anchorshots can vary between

different news shows or over time, Avrithis et al. [Avrithis et al., 2000] and others have

based their anchorshot detection algorithms on the only characteristic that can not

vary: the presence of an anchorperson. They identify anchorshots by recognizing the

anchorperson’s face by a face detection module. Their module recognizes faces by color

matching and shape recognition. In contrast to other approaches, they do not try to

identify people. They are only interested in the presence or absence of a person’s face

somewhere in the foreground. Because of this interviewers, reporters or politicians at

press conferences might incorrectly be labeled as anchorpersons and various news shots

will consequently be classified as anchorshots. Günsel’s approach [Gunsel et al., 1996]

overcomes these problems by also taking into account the position and size of the detected

faces.

Frame similarity

Similar to Hanjalic et al. [Hanjalic et al., 1999], other authors try to find anchorshots

by looking for shots throughout a video, which share a set of visual features. These

approaches use the characteristic that are ”extremely similar among themselves, and

also frequent compared to other speech/report shots.” [Ide et al., 1998] Therefore they

can be found by clustering all shots of a news show and assuming that the ”largest

and most dense cluster would be the anchorshots”. As this technique is unsupervised

and does not need any anchorshot model, it promises to perform better in unknown

environments, like on news shows we have never seen before. On the other hand it is

unlikely, that they outperform approaches that use more a priori assumptions or have

anchorshot models for the specific news show available.

One approach using cluster analysis was presented by Gao et al. [Gao and Tang, 2002]

in 2002. They apply a graph-theoretical cluster analysis (GTC Analysis) on key-frames

from news shows. By constructing a minimum spanning tree over vertices that represent

those key-frames and cutting edges that exceed a given threshold, clusters are gained. All

clusters with more than two nodes in them are then taken as potential anchorshots. In

23

More magazines by this user
Similar magazines