Real-time feature extraction from video stream data for stream ...

2. Video Segmentation and Tagging

are approaches that are able to discover such segments, without having further domain

knowledge. Two of them are detailed in this section.

Rule-based segment detection

This approach goes back on Aigrain, Joly, and Longueville [Aigrain et al., 1997]. They

claim that microscopical changes in movies and shows are emphasized by using stylistic

devices. Examples for such stylistic devices are the use of gradual transition effects,

modification of the audio track, or variations in the editing rhythm. Hence we can take

the presence of these stylistic devices as clues, in order to solve the video segmentation

task. In the above mentioned paper the authors define a total of ten rules that enable

us to group shots together in order to get a segmentation of a video on the scene level.

I pick three types of rules to demonstrate, what they look like. For more details please

see the original work.

Transition effect rules Possible transitions in videos are cuts (C) and gradual transitions

(GT). Hence the transitions occurring in a video file can be seen as a word

over the alphabet {C; GT }. Transition effect rules base on the recognition of subwords,

which follow a certain pattern. For example if the sub-word C i GT j C k with

i, k > 2, j = 1 is found, we can assume that there is a segment boundary right

before the beginning of the gradual transition.

Shot repetition rules Especially within a shot reverse shot setting, the same shot is

repeated within a distance of just a couple of shots in between. The shot repetition

rule aims on detecting shots like that by comparing the first frame after a transition

with representative frames of the last three shots. Beside interviews or movies

using the shot reverse shot technique, this rule will turn out to be useful to handle

overexposed frames due to photographic flashlights. This will be further described

in chapter 7. On the other hand the application of the rule can result in the

false grouping of shots. For example anchorshots in news shows occur repeatedly.

Nevertheless they belong to different news stories and should hence not be grouped


Editing rhythm similarity rule Editing rhythm corresponds to the amount of transitions

occurring in a video. The more transitions we have, the shorter the duration

of each single shot is and the faster the movie appears to the viewer. Hence the

shot duration in action scenes will be rather short in comparison to dialogs. This

is utilized by the editing rhythm rule. As soon as a shot is three times longer or

four times than the average of the last ten shots, the shot is likely to be important

and can hence be seen as a segment boundary.

Audio-based segment detection

Segment boundaries are usually indicated by boundaries of the video- and the audiostream.

Hence taking the audio data into account can be very helpful for the detection of

segments. Commercial spots, for example, most often consist of many shots and transitions,

but the audio stream is probably uninterrupted. Furthermore constant background


