Real-time feature extraction from video stream data for stream ...

Real-time feature extraction from video stream data for stream ...

2. Video Segmentation and Tagging

Motion vectors

In order to distinguish between gradual transitions and special camera effects, Zhang

et al. [Zhang et al., 1993] proposed a method to infer the reason for high differences

between consecutive frames. Their approach bases on motion vectors. By tracking the

movement of single pixels between consecutive frames, we gain motion vectors. Tracking

pixels can be done by searching for pixels with the same color in two consecutive frames.

In case there is more than one pixel having the same color in the second image, the

spatially closest one is assumes to be the one we are looking for. For special effects like

pans, zoom-ins and zoom-outs these motion vectors generate typical fields, shown in

figure 2.8.

Figure 2.8.: Illustration of motion vectors due to camera panning, zoom-out and zoom-in.

Figure taken from [Zhang et al., 1993].

If all computed motion vectors point in the same direction, the difference of the images

most likely occurred due to a pan of the camera. In case all motion vectors are directed

from the image border towards a single center of focus, we detect a zoom-out. Analog,

if all motion vectors are directed from a single center of focus towards the image border,

we detect a zoom-in. Of course we should ignore small disparities that might occur due

to the movement of objects or noise. If the extracted motion vectors do not match any

of the three patterns, we assume that a gradual transition is the reason for the difference

of the frames.

Instead of computing the motion vector information on the decoded image stream, these

information can also be gained from MPEG encoded video data directly. ”However, the

block matching performed as part of MPEG encoding selects vectors based on compression

efficiency and thus often selects inappropriate vectors for image processing purposes.”

[Boreczky and Rowe, 1996]

2.3. Segment Detection

As one hour of video data mostly consists out of hundreds of shots, the shot level is surely

not the ideal level for indexing video data as indexes would still be way too large to give

a good overview of the present video content. Therefore ”it is necessary to find more

macroscopic time objects, for instance larger sequences constituting a narrative unit

or sharing the same setting.” [Aigrain et al., 1997]. To solve this learning task, almost

all approaches rely on the detection of shot boundaries. Hence the detection of shot


More magazines by this user
Similar magazines