22.01.2015 Views

A Method for Detecting Subtitle Regions in Videos Using Video Text ...

A Method for Detecting Subtitle Regions in Videos Using Video Text ...

A Method for Detecting Subtitle Regions in Videos Using Video Text ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

International Journal of Advanced Intelligence<br />

Volume 2, Number 1, pp.37-55, July, 2010.<br />

c⃝ AIA International Advanced In<strong>for</strong>mation Institute<br />

A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong><br />

<strong><strong>Video</strong>s</strong> Us<strong>in</strong>g <strong>Video</strong> <strong>Text</strong> Candidate Images and<br />

Color Segmentation Images<br />

Yoshihide Matsumoto, Tadashi Uemiya, Masami Shishibori and Kenji Kita<br />

Faculty of Eng<strong>in</strong>eer<strong>in</strong>g, The University of Tokushima<br />

2-1 M<strong>in</strong>ami-josanjima, Tokushima 770-8506, Japan<br />

matsumoto@laboatec.com; uchikosi@helen.ocn.ne.jp;<br />

{bori;kita}@is.tokushima-u.ac.jp<br />

Received (January 2010)<br />

Revised (May 2010)<br />

In this paper, a method <strong>for</strong> detect<strong>in</strong>g text regions <strong>in</strong> digital videos with telop, such as<br />

drama, movie and news programm<strong>in</strong>g, is proposed. The typical characteristics of telop<br />

are that it does not move, and that its edges are strong. This method takes advantage of<br />

these characteristics to produce video text candidate images. Then, this method produces<br />

the video text region images from both the video text candidate images and the color<br />

segmentation images. The video text region images and the orig<strong>in</strong>al image are used to<br />

identify the color of the telop. F<strong>in</strong>ally, text regions are detected by <strong>in</strong>creas<strong>in</strong>g neighbor<strong>in</strong>g<br />

pixels of the identified color. The experiment results show that the precision of this method<br />

was 80.36% and the recall was 77.55%, whereas the precision of the traditional method was<br />

40.22% with the recall 75.48%. Higher accuracy was achieved by us<strong>in</strong>g this new method.<br />

Keywords: <strong>Video</strong> text candidate image; Color segmentation image; <strong>Video</strong> text region<br />

image; Multimedia <strong>in</strong><strong>for</strong>mation retrieval.<br />

1. Introduction<br />

In recent years, with the spread of the Internet, <strong>in</strong>creased hardware specifications,<br />

and the development of imag<strong>in</strong>g devices such as digital cameras and digital video<br />

cameras, there are more and more opportunities to accumulate large amounts of<br />

video content <strong>in</strong> personal computers. It is difficult to efficiently search the required<br />

image or scene with<strong>in</strong> these contents, and so the <strong>in</strong><strong>for</strong>mation is needed that clearly<br />

describes the content. The required <strong>in</strong><strong>for</strong>mation usually <strong>in</strong>cludes cut po<strong>in</strong>ts, camera<br />

work, sound, and subtitles. <strong>Subtitle</strong>s often describe the subject be<strong>in</strong>g photographed<br />

or the topic. <strong>Subtitle</strong>s also appear <strong>in</strong> sync with the video, mak<strong>in</strong>g them noteworthy<br />

as useful str<strong>in</strong>gs that reflect the semantic content.<br />

One of the well-known image-handl<strong>in</strong>g technologies focus<strong>in</strong>g on subtitles is the<br />

In<strong>for</strong>media project 1,2 , where large-size image data are processed us<strong>in</strong>g images from<br />

cut scenes, subtitle-recognition characters, and speech-recognition data. A method<br />

has been proposed <strong>for</strong> match<strong>in</strong>g cook<strong>in</strong>g <strong>in</strong>structions and cook<strong>in</strong>g images us<strong>in</strong>g subtitles<br />

and closed caption. 3 A method to <strong>in</strong>dex the semantic attributes correspond<strong>in</strong>g<br />

37


38 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

to scenes <strong>in</strong> news programs us<strong>in</strong>g the closed caption has been proposed. 4 A method<br />

has also been proposed <strong>for</strong> recogniz<strong>in</strong>g text resid<strong>in</strong>g with<strong>in</strong> a subtitle region. 5 To<br />

implement applied methods like this, it is first necessary to detect temporal and<br />

spatial ranges of the subtitles <strong>in</strong> the image. The establishment of a highly accurate<br />

method <strong>for</strong> detect<strong>in</strong>g subtitle regions is desired.<br />

Sato et al. 6 have proposed a traditional subtitle detection method, where macro<br />

block cod<strong>in</strong>g <strong>in</strong><strong>for</strong>mation is used to detect subtitle regions <strong>in</strong> images compressed<br />

as MPEGs. While this method allows <strong>for</strong> fast process<strong>in</strong>g, the accuracy has not yet<br />

reached a practical level. Arai et al. 7 have focused on a feature of subtitles, called<br />

edge pairs, to propose another method, where subtitle regions are detected from the<br />

spatial distribution and temporal cont<strong>in</strong>uity of edge pairs. Although the detection<br />

accuracy of this method has been developed to a practical level, the absence of a<br />

learn<strong>in</strong>g function may decrease the accuracy as the text fonts change. Hori et al. 8<br />

have proposed yet another method, where text candidate images are obta<strong>in</strong>ed from<br />

the logical products of low-dispersion images and immovable-edge images, followed<br />

by learn<strong>in</strong>g-based detection of subtitle regions. While this method leads to high<br />

recall, precision is low. Thus it tends to detect excessive regions as subtitles, result<strong>in</strong>g<br />

<strong>in</strong> subtitle text gett<strong>in</strong>g crushed. Additionally, there has been a proposal to<br />

<strong>in</strong>crease the detection accuracy of subtitle regions by first creat<strong>in</strong>g text candidate<br />

images, and then us<strong>in</strong>g a learn<strong>in</strong>g-based identification device called Support Vector<br />

Mach<strong>in</strong>e (SVM) 9 and a feature po<strong>in</strong>t extraction operator called Harris Interest<br />

Operator (Harris operator) 10,11 . Although this method 12 <strong>in</strong>creases precision, it has<br />

its own issues such as the fact that it needs data <strong>for</strong> learn<strong>in</strong>g, and that the recall<br />

decreases.<br />

This paper proposes a method <strong>for</strong> detect<strong>in</strong>g subtitle regions with high accuracy<br />

by first generat<strong>in</strong>g video text candidate images <strong>in</strong> the same way as <strong>in</strong> traditional<br />

methods 7,8 , followed by check<strong>in</strong>g color segmentation images aga<strong>in</strong>st the orig<strong>in</strong>al image.<br />

In this method, text candidate images are obta<strong>in</strong>ed first <strong>in</strong> the same way as <strong>in</strong><br />

the traditional method 8 , based on the regions where little brightness change occurs<br />

between cont<strong>in</strong>uous frame images, and on the regions with no changes <strong>in</strong> edges. The<br />

subtitles with<strong>in</strong> the text candidates images obta<strong>in</strong>ed this way are detected almost<br />

perfectly, but the background tends to be excessively detected at the same time.<br />

In other words, the recall is high while the precision is low. As a workaround, the<br />

text candidate images and the color segmentation images obta<strong>in</strong>ed this way are<br />

comb<strong>in</strong>ed, after which only the color segments that appear to be text are selected,<br />

thereby generat<strong>in</strong>g text region images with low background noise. The text region<br />

images thus obta<strong>in</strong>ed have few <strong>in</strong>stances of the background falsely detected as text.<br />

However, because subtitle regions are detected based on color segments, some characters<br />

<strong>in</strong> m<strong>in</strong>ute color segments of the subtitle text tend to escape detection. In<br />

other words, the precision is high while the recall decreases. In an ef<strong>for</strong>t to improve<br />

the recall, text color was used, assum<strong>in</strong>g that the color <strong>in</strong><strong>for</strong>mation of the subtitles<br />

does not change. Specifically, the color <strong>in</strong><strong>for</strong>mation of the subtitles is determ<strong>in</strong>ed us-


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 39<br />

<strong>in</strong>g multiple text region images generated with<strong>in</strong> cont<strong>in</strong>uous frames and the orig<strong>in</strong>al<br />

image. This is followed by the improvement of the recall by <strong>in</strong>creas<strong>in</strong>g neighbor<strong>in</strong>g<br />

pixels that have similar color <strong>in</strong><strong>for</strong>mation, thereby accurately detect<strong>in</strong>g subtitle regions.<br />

Chapter 2 <strong>in</strong>troduces a traditional method <strong>for</strong> detect<strong>in</strong>g subtitle regions us<strong>in</strong>g<br />

video text candidate images. Chapter 3 proposes a method <strong>for</strong> generat<strong>in</strong>g text region<br />

images us<strong>in</strong>g video text candidate images and color segmentation images, as<br />

well as a method <strong>for</strong> detect<strong>in</strong>g subtitles by automatically sett<strong>in</strong>g the color of the<br />

subtitle text us<strong>in</strong>g text region images and the orig<strong>in</strong>al image. Chapter 4 provides experiments<br />

<strong>for</strong> assess<strong>in</strong>g the validity of the proposed method, along with the results<br />

and discussions. F<strong>in</strong>ally, Chapter 5 presents the conclusion and describes future<br />

issues.<br />

2. Overview of Traditional <strong>Method</strong>s<br />

This chapter <strong>in</strong>troduces a traditional method <strong>for</strong> detect<strong>in</strong>g subtitle regions us<strong>in</strong>g<br />

text candidate images. <strong>Text</strong> candidate images are also used <strong>in</strong> our proposed method<br />

as subtitle region images <strong>in</strong> their first phase.<br />

2.1. A method <strong>for</strong> generat<strong>in</strong>g video text candidate images us<strong>in</strong>g<br />

low distributed images and immovable edge images<br />

Hori et al. 8 have proposed a method <strong>for</strong> generat<strong>in</strong>g video text candidate images from<br />

the logical products of low distributed images and immovable edge images. First, one<br />

low distributed image is created from cont<strong>in</strong>uous frame images based on an arbitrary<br />

number of brightness images. If the arbitrary number is an N, brightness frames<br />

<strong>for</strong> N frames are then used to obta<strong>in</strong> the distribution value of the brightness of<br />

each pixel. In this method, we chose brightness images <strong>for</strong> 4 frames. Pixels whose<br />

distribution values are lower than a specified threshold value are assigned a value<br />

of 1, with other pixels assigned 0 or 2, <strong>in</strong> order to obta<strong>in</strong> low distribute images. The<br />

threshold value is set us<strong>in</strong>g discrim<strong>in</strong>ate analysis. Static regions such as subtitles<br />

have little change <strong>in</strong> brightness, thus their distribution values are low. More dynamic<br />

regions have higher distribution values. There<strong>for</strong>e, the resultant low distributed<br />

images tend to have most of the subtitles <strong>in</strong>tact.<br />

Similarly, one immovable edge is created from cont<strong>in</strong>uous frame images based on<br />

an arbitrary number of brightness images. First, edge images with the value of 2 are<br />

obta<strong>in</strong>ed from brightness images. Wavelet conversion is used to detect edges. Then<br />

the logical product of the edge images <strong>for</strong> N frames is obta<strong>in</strong>ed. In this method, we<br />

chose brightness images <strong>for</strong> 4 frames. The images obta<strong>in</strong>ed by the logical product<br />

are called immovable edge images, which have sharp edges on the boundaries with<br />

the background. Static pixels are prone to rema<strong>in</strong> here, and so the subtitles tend<br />

to rema<strong>in</strong> <strong>in</strong> a similar way to low distributed images. Low distributed images and<br />

immovable edge images are obta<strong>in</strong>ed <strong>in</strong> the flow shown <strong>in</strong> Fig. 1. The logical product


40 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

Fig. 1. An illustration of mak<strong>in</strong>g a video text candidate image from each video frame.<br />

obta<strong>in</strong>ed from the low distributed images and immovable edge images, which <strong>in</strong> turn<br />

are obta<strong>in</strong>ed <strong>in</strong> the above manner, will generate video text candidate images. <br />

2.2. A method <strong>for</strong> detect<strong>in</strong>g subtitles us<strong>in</strong>g SVM and the Harris<br />

operator<br />

Hiramatsu et al. 12 have proposed a method which suppresses erroneous detection<br />

with the use of SVM and the Harris operator. In this method, video text candidate<br />

images are first generated exclud<strong>in</strong>g as much as possible the background parts<br />

except <strong>for</strong> the subtitles. Then the video text candidate images are divided <strong>in</strong>to<br />

blocks similar <strong>in</strong> size to the pre-determ<strong>in</strong>ed text, as shown <strong>in</strong> Fig. 2. Brightness<br />

histogram <strong>for</strong> each block is created from the white pixels rema<strong>in</strong><strong>in</strong>g <strong>in</strong> that block.<br />

Each block is assessed us<strong>in</strong>g SVM, label<strong>in</strong>g subtitle-bear<strong>in</strong>g blocks as positive, and<br />

those without subtitles as negative.<br />

The Harris operator, which is high <strong>in</strong> recall <strong>for</strong> image enlargement, is then applied<br />

to images determ<strong>in</strong>ed to be subtitle regions by SVM, to <strong>in</strong>crease the precision. The<br />

<strong>in</strong>terest po<strong>in</strong>ts detected by the Harris operator are seen abundant <strong>in</strong> parts with<br />

large color variation as well as <strong>in</strong> edges. S<strong>in</strong>ce <strong>in</strong> many cases subtitle regions are<br />

represented as supplementary colors <strong>for</strong> the images around them, it is expected that<br />

many <strong>in</strong>terest po<strong>in</strong>ts will be detected <strong>in</strong> the vic<strong>in</strong>ity of subtitle regions. There<strong>for</strong>e,<br />

blocks with positive identification by SVM are detected as subtitle regions if they<br />

have many <strong>in</strong>terest po<strong>in</strong>ts. <strong>Subtitle</strong> regions may not be recognized if edges of the<br />

text reside with<strong>in</strong> blocks. <strong>Subtitle</strong>s are long text str<strong>in</strong>gs aligned horizontally <strong>in</strong> a<br />

long str<strong>in</strong>g. There<strong>for</strong>e, to avoid this non-recognition issue, the number of <strong>in</strong>terest<br />

po<strong>in</strong>ts on the right and left side of the region <strong>in</strong> question is used to determ<strong>in</strong>e if<br />

that region is a subtitle region.


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 41<br />

Fig. 2. An example of histogram data generation.<br />

Fig. 3. An example of detect<strong>in</strong>g the <strong>in</strong>terest po<strong>in</strong>ts by the Harris operator.<br />

2.3. Issues on traditional methods<br />

Traditional methods elim<strong>in</strong>ates the background us<strong>in</strong>g the characteristics of subtitles<br />

found <strong>in</strong> images, and then recognizes subtitle regions us<strong>in</strong>g SVM with the<br />

manually-prepared positive and negative data, and the <strong>in</strong>terest po<strong>in</strong>ts. However,


42 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

such traditional methods have the follow<strong>in</strong>g issues:<br />

(i) Positive and negative data must be prepared manually so that SVM can learn<br />

them.<br />

(ii) <strong>Subtitle</strong> texts not resid<strong>in</strong>g with<strong>in</strong> divided blocks may make subsequent text recognition<br />

difficult.<br />

To solve these issues, we focused on techniques <strong>for</strong> divid<strong>in</strong>g image regions. In the<br />

subsequent chapters, we will discuss a subtitle-region detection method based on a<br />

technique <strong>for</strong> divid<strong>in</strong>g image regions.<br />

3. The Proposed <strong>Method</strong><br />

This paper proposes a method <strong>for</strong> detect<strong>in</strong>g subtitle regions based on images that<br />

have been processed with color segmentation and video text candidate images. We<br />

will call subtitle images generated us<strong>in</strong>g video text candidate images and color<br />

segmentation images “text region images.” We will first discuss a method <strong>for</strong> generat<strong>in</strong>g<br />

text region images us<strong>in</strong>g video text candidate images and color segmentation<br />

images. After that, we will use the text region images and the orig<strong>in</strong>al image to<br />

automatically set the text color, and discuss the process flow <strong>for</strong> detect<strong>in</strong>g f<strong>in</strong>al<br />

subtitle regions.<br />

3.1. Generat<strong>in</strong>g text region images us<strong>in</strong>g color segmentation<br />

images<br />

3.1.1. Introduction to the method<br />

The process flow <strong>for</strong> detect<strong>in</strong>g subtitles based on color a segmentation image is<br />

shown <strong>in</strong> Fig. 4. First, a video text candidate image is obta<strong>in</strong>ed <strong>in</strong> the same way<br />

as <strong>in</strong> the traditional method8. At the same time, an image processed with color<br />

segmentation (“a color segmentation image”) and color segmentation image data<br />

are obta<strong>in</strong>ed (Step 1 of Fig. 4). The color segmentation image data <strong>in</strong>clude the region<br />

numbers, the size of each region (the total number of pixels), the central coord<strong>in</strong>ate<br />

(x, y), the color <strong>in</strong><strong>for</strong>mation of the regions (luv), and the coord<strong>in</strong>ates that belong to<br />

the regions. A video text candidate image is created from four cont<strong>in</strong>uous frames,<br />

while a color segmentation image is created from the first frame that was used when<br />

creat<strong>in</strong>g the video text candidate image.<br />

Then, these two images are used to elim<strong>in</strong>ate noise (Step 2 of Fig. 4). We will process<br />

the elim<strong>in</strong>ation <strong>in</strong> two ways: (1) by horizontally scann<strong>in</strong>g the video text candidate<br />

image so that only the pixels with<strong>in</strong> the subtitles rema<strong>in</strong>, and (2) by check<strong>in</strong>g the<br />

video text candidate image aga<strong>in</strong>st the color segmentation image data <strong>in</strong> order to<br />

select only the color segments that appear to be subtitles. After this elim<strong>in</strong>ation<br />

process, we will elim<strong>in</strong>ate the edges of subtitles, because it is common <strong>for</strong> subtitles<br />

to have edges added on (Step 3 of Fig. 4). Specifically, we take advantage of the<br />

fact that the bodies and edges of subtitle characters use different colors. We will


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 43<br />

Fig. 4. Outl<strong>in</strong>e of detect<strong>in</strong>g text regions by us<strong>in</strong>g color segmentation images.<br />

use the k-means method to classify the colors of the regions that conta<strong>in</strong> the white<br />

pixels that are left after the noise elim<strong>in</strong>ation process. F<strong>in</strong>ally, we will supplement<br />

the text characters (Step 4 of Fig. 4) to improve recall. Specifically, we will search<br />

each segmentation region around pixels that rema<strong>in</strong> as part of the subtitles at the<br />

end of Step 3, and <strong>in</strong>crease the regions that resemble the subtitle region <strong>in</strong> size and<br />

typical color. Below are detailed discussions of each module.<br />

3.1.2. Generat<strong>in</strong>g color segmentation images<br />

In this process, the region <strong>in</strong>tegration method is used to generate color segmentation<br />

images. Region <strong>in</strong>tegration is a method <strong>for</strong> divid<strong>in</strong>g an image <strong>in</strong>to multiple sets<br />

(regions) of pixels that have similar amount of characteristic and are spatially close,<br />

based on such characteristics as the pixel values and the texture. The reason we<br />

chose this method is the characteristic of the subtitles. As discussed <strong>in</strong> the section<br />

on low distributed images, the brightness of subtitles vary little, and their color<br />

does not change much. In other words, all subtitles have more or less the same<br />

characteristics, which led us to speculate the color segmentation process might<br />

successfully extract subtitle regions. Below are the steps <strong>for</strong> <strong>in</strong>tegrat<strong>in</strong>g regions.<br />

Fig. 5 and 6 show examples of color segmentation images generated us<strong>in</strong>g the region<br />

<strong>in</strong>tegration method.<br />

Step 1 Search <strong>for</strong> each pixel by raster scann<strong>in</strong>g, flagg<strong>in</strong>g any unlabeled and unclassified<br />

pixels and label<strong>in</strong>g them.<br />

Step 2 Check eight (8) pixels neighbor<strong>in</strong>g the flagged pixels, and assign them the<br />

same label as that of the flagged pixels if the pixel value is the same.<br />

Step 3 Repeat Step 2 with the newly labeled pixels as the flagged pixels.


44 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

Step 4 If no pixels are labeled <strong>in</strong> Step 2, repeat Step 1.<br />

Step 5 The process is complete when all pixels have been labeled. Sets (regions) of<br />

neighbor<strong>in</strong>g pixels with the same pixel value are obta<strong>in</strong>ed at this po<strong>in</strong>t. Proceed<br />

to the next step us<strong>in</strong>g the labeled pixels.<br />

Step 6 Obta<strong>in</strong> the average pixel values among the pixels bear<strong>in</strong>g the same label.<br />

Step 7 Of the neighbor<strong>in</strong>g sets of pixels, <strong>in</strong>tegrate the two that have the smallest<br />

difference <strong>in</strong> the average pixel values obta<strong>in</strong>ed <strong>in</strong> Step 6.<br />

Step 8 Repeat Steps 6 and 7. To avoid the eventuality of only one exist<strong>in</strong>g set<br />

of pixels, the maximum average difference should be established <strong>for</strong> allow<strong>in</strong>g<br />

<strong>in</strong>tegration. [End of the steps of the region <strong>in</strong>tegration method]<br />

Fig. 5. Orig<strong>in</strong>al image.<br />

Fig. 6. Color segmentation image.<br />

Fig. 7. Noise elim<strong>in</strong>ation by scann<strong>in</strong>g of white pixels.<br />

3.1.3. Noise elim<strong>in</strong>ation<br />

The process of noise elim<strong>in</strong>ation is two-fold. The first phase starts with horizontal<br />

scann<strong>in</strong>g of a video text candidate image as shown <strong>in</strong> Fig. 7, creat<strong>in</strong>g a histogram<br />

with a tally of white pixels. The scann<strong>in</strong>g direction depends on the direction of<br />

the subtitles. Because white pixels are packed <strong>in</strong>to subtitles, the histogram shows<br />

locally high numbers where subtitles are found. Based on this observation, locations<br />

where the histogram numbers show sharp climbs and falls are identified, and only


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 45<br />

these locations are kept, thus narrow<strong>in</strong>g down subtitle-conta<strong>in</strong><strong>in</strong>g regions.<br />

In the second phase of noise elim<strong>in</strong>ation, we take advantage of the characteristics<br />

of subtitles, i.e., images processed with color segmentation based on color <strong>in</strong><strong>for</strong>mation<br />

are used. Because each character of the subtitle has the same color <strong>in</strong><strong>for</strong>mation,<br />

we can predict that the background and the subtitles reside <strong>in</strong> different regions of<br />

a color segmentation image. We can also predict that the subtitle regions are narrower<br />

than the background. The noise elim<strong>in</strong>ation process takes advantage of these<br />

characteristics. First, the video text candidate image and the color segmentation<br />

image data are checked aga<strong>in</strong>st each other after the noise elim<strong>in</strong>at<strong>in</strong>g process <strong>in</strong><br />

Phase 1, and the ratio of white pixels <strong>in</strong> each region is measured. Next, regions that<br />

have higher number of white pixels than the threshold value are made all white,<br />

and all other pixels are made black. S<strong>in</strong>ce subtitle regions are smaller than backgrounds,<br />

even a s<strong>in</strong>gle white pixel rema<strong>in</strong><strong>in</strong>g <strong>in</strong> a region might cont<strong>in</strong>ue to rema<strong>in</strong><br />

after this process largely depend<strong>in</strong>g on whether the region is a subtitle region or a<br />

background. Fig. 9 shows an image after noise elim<strong>in</strong>ation.<br />

Fig. 8. Noise elim<strong>in</strong>ation by ratio of white pixels.<br />

Fig. 9. Image with noise elim<strong>in</strong>ated.


46 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

3.1.4. Classification by k-means<br />

Generally, each character of subtitles consists of the edge part and the character<br />

itself, each <strong>in</strong> its own color. After noise elim<strong>in</strong>ation, an image may still have both of<br />

these parts left. If the edge part is still left, the entire character is crushed, mak<strong>in</strong>g<br />

it difficult to <strong>in</strong>dentify the character, especially if it is a complicated character such<br />

as kanji. The k-means method enables the classification of each pixel <strong>in</strong> the subtitle<br />

characters based on the color <strong>in</strong><strong>for</strong>mation, and it detects only the pixels that belong<br />

to the characters. In a video text candidate image with noise elim<strong>in</strong>ated, the colors<br />

of the regions to which the rema<strong>in</strong><strong>in</strong>g white pixels belong are classified us<strong>in</strong>g k-<br />

means, as shown <strong>in</strong> Fig. 10. After the classification, only the regions with colors<br />

that belong to the class with the most clusters are kept. Fig. 11 shows an example<br />

of an image after classification by k-means. <br />

3.1.5. Character complementation<br />

As can be seen <strong>in</strong> Fig. 11, images that have been classified by k-means tend to be<br />

high <strong>in</strong> precision and low <strong>in</strong> recall, lead<strong>in</strong>g to frequent non-detection. We will now<br />

focus on the characteristics of each region, and supplement the subtitle region. The<br />

part of the region that falls with<strong>in</strong> the 16 x 16 square pixels of the rema<strong>in</strong><strong>in</strong>g white<br />

pixels is searched as shown <strong>in</strong> Fig. 12. Then the Euclidean distance is calculated<br />

with the size of the region, the central coord<strong>in</strong>ate of the region, and the color of<br />

the region as the amounts of characteristic. If the resultant Euclidean distance is<br />

less than the threshold value, that region is added as a subtitle. Fig. 13 shows the<br />

image after character complementation, i.e., the video text region image after the<br />

application of the method based on color segmentation images.<br />

Fig. 10. Classification of each pixel by k-means.


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 47<br />

Fig. 11. Image after classification by k-means.<br />

Fig. 12. An illustration of complementation of the text characters.<br />

Fig. 13. An example of the video text region image after application of the proposed method.<br />

3.2. <strong>Method</strong> <strong>for</strong> detect<strong>in</strong>g subtitles by automatically sett<strong>in</strong>g the<br />

text color<br />

3.2.1. Overview of the method<br />

<strong>Text</strong> region images generated follow<strong>in</strong>g the method described <strong>in</strong> the preced<strong>in</strong>g section<br />

tends to escape detection <strong>in</strong> the m<strong>in</strong>ute segmentation regions resid<strong>in</strong>g with<strong>in</strong><br />

the subtitle section, lower<strong>in</strong>g the recall. We will now apply a recall-improv<strong>in</strong>g technique<br />

based on text color (Fig. 14). First, the color <strong>in</strong><strong>for</strong>mation of the subtitles is<br />

specified us<strong>in</strong>g the multiple text region images generated among cont<strong>in</strong>uous frames,<br />

and the orig<strong>in</strong>al picture image (Step 1 of Fig. 14). Then the text characters are<br />

supplemented (Step 2 of Fig. 14). The pixels rema<strong>in</strong><strong>in</strong>g after supplementation are<br />

labeled, and regions that are too large are removed (Step 3 of Fig. 14). We will<br />

discuss the details of each module.


48 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

Fig. 14. Outl<strong>in</strong>e of detect<strong>in</strong>g text regions by specify<strong>in</strong>g the text color.<br />

Fig. 15. An illustration of color histogram generation.<br />

3.2.2. Automatically sett<strong>in</strong>g the text color<br />

Multiple text region images generated from cont<strong>in</strong>uous images and the orig<strong>in</strong>al<br />

picture image of the top frame that was used to generate each text region image are<br />

used to specify the range of the subtitle text color. In this experiment, we focused<br />

on the pixels rema<strong>in</strong><strong>in</strong>g <strong>in</strong> thirty (30) text region images. These pixels are checked<br />

aga<strong>in</strong>st the orig<strong>in</strong>al picture image to extract the RGB value. Then the RGB 256<br />

gradation levels are compressed <strong>in</strong>to 16 levels to generate histogram (Fig. 15). The<br />

gradation level with the most pixels is determ<strong>in</strong>ed to be the range of the color of<br />

this text.


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 49<br />

3.2.3. Character supplementation and label<strong>in</strong>g<br />

Eight (8) square pixels around the white pixels rema<strong>in</strong><strong>in</strong>g after Step 7 of Fig. 14<br />

(pixels that have been detected as be<strong>in</strong>g with<strong>in</strong> the subtitle region) are searched.<br />

Pixels resid<strong>in</strong>g with<strong>in</strong> the range determ<strong>in</strong>ed <strong>in</strong> Step 7 are made white. Eight square<br />

pixels around these new white pixels are similarly searched until no more pixels<br />

are jo<strong>in</strong><strong>in</strong>g. After characters are supplemented, they are labeled. Labels that are<br />

connected above the threshold value are removed. Fig. 16 shows an example of the<br />

f<strong>in</strong>al result after automatically sett<strong>in</strong>g the text color.<br />

Fig. 16. An example of a video text region image generated by the proposed method.<br />

4. Assessment<br />

4.1. <strong>Method</strong> of experiment<br />

We conducted an experiment to confirm the validity of our proposed method. As<br />

the experiment data, we used drama image data that <strong>in</strong>cludes subtitles with full<br />

RGB colors, a resolution of 352 x 240, and a frame rate of 29.97 fps. As the correct<br />

data, we only used the images of the subtitles <strong>in</strong> the overlay region of this drama.<br />

The correct data, text candidate images, and the text region images were checked<br />

aga<strong>in</strong>st one another <strong>for</strong> each pixel to calculate the precision and recall. 30 images<br />

were selected at random from the drama <strong>for</strong> assessment. The criteria <strong>for</strong> assessment<br />

recall (r) and precision (p) can be represented <strong>in</strong> the follow<strong>in</strong>g <strong>for</strong>mulae (1),(2):<br />

precision : p =<br />

recall : r =<br />

N d : Number of correctly detected pixels<br />

N m : Number pixels that escaped detection<br />

N f : Number of falsely detected pixels<br />

N d<br />

N d + N f<br />

(1)<br />

N d<br />

N d + N m<br />

(2)<br />

The level of Precision and Recall <strong>in</strong> the method <strong>in</strong> the past is as shown <strong>in</strong> Table 1.<br />

The unit used <strong>in</strong> detection is the number of pixels. Detection is deemed correct<br />

if white pixels exist where the subtitles with<strong>in</strong> the frame are displayed. It is deemed<br />

escaped detection if white pixels do not exist. Detection is deemed false if white


50 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

Table 1. Evaluation of method <strong>in</strong> the past.<br />

Precision<br />

Recall<br />

40.22% 75.48%<br />

pixels exist <strong>in</strong> frames or locations where there are no subtitles. The experiment<br />

criteria <strong>for</strong> each method are listed below:<br />

• <strong>Method</strong> <strong>for</strong> detect<strong>in</strong>g subtitles us<strong>in</strong>g segmentation region images<br />

– The threshold value used <strong>for</strong> noise elim<strong>in</strong>ation <strong>in</strong> the second phase is 50<br />

– The number of classes <strong>for</strong> k-means is 2, 3, 4, and 5, variable.<br />

– The threshold values of the Euclidean distance when characters are supplemented<br />

are 10, 20, and 30, variable.<br />

• <strong>Method</strong> <strong>for</strong> detect<strong>in</strong>g subtitles by automatically sett<strong>in</strong>g the text color<br />

<strong>Regions</strong> are removed by label<strong>in</strong>g if they <strong>in</strong>clude 128 or more pixels that are<br />

connected.<br />

4.2. Experiment results<br />

Fig. 17 shows the shift <strong>in</strong> the detection accuracy when the number of classes used by<br />

k-means <strong>in</strong> the text detection method with region segmentation images changes, and<br />

when the threshold value of the Euclidean distance changes <strong>in</strong> character supplementation.<br />

In Fig. 17, K=N Precision represents the precision value when the number<br />

of classes under the k-means method is set to N (2, 3, 4, or 5), and K=N Recall<br />

represents the recall value when the number of classes is set to N. In addition, eucM<br />

represents the detection accuracy when the threshold <strong>for</strong> the Euclidean distance is<br />

set to M (10, 20, or 30) <strong>in</strong> character supplementation.<br />

The results shown <strong>in</strong> Fig. 17 <strong>in</strong>dicate that precision is higher than recall <strong>in</strong> each<br />

case as well as the recall rema<strong>in</strong>s steady. Changes <strong>in</strong> parameter values did not affect<br />

accuracy significantly when the threshold <strong>for</strong> the Euclidean distance changed. On<br />

the other hand, when the number of classes under k-means changed, both precision<br />

and recall were affected. When the number of classes was 3, the precision was<br />

highest, decrease <strong>in</strong> recall was at a m<strong>in</strong>imum, and the balance between these two<br />

elements was optimal, produc<strong>in</strong>g the best accuracy.<br />

Each character <strong>in</strong> subtitles generally consists of three parts: the background, the<br />

edges, and the character body. It is expected that sett<strong>in</strong>g the number of classes to<br />

three (3) enabled appropriate classification and detection of the character bodies.<br />

When the number of classes was set to two (2), the background mixed <strong>in</strong>to the<br />

selected class, result<strong>in</strong>g <strong>in</strong> lower precision. Higher numbers of classes such as 4 and<br />

5 resulted <strong>in</strong> big drops <strong>in</strong> recall, with <strong>in</strong>creased numbers of pixels that escaped<br />

detection. This is because subtitle characters do not consist of exactly the same<br />

color, but rather the color varies slightly from character to character. For example,<br />

subtitle characters that are seem<strong>in</strong>gly white were found to consist of four (4) smaller


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 51<br />

parts: mostly white, light gray, gray, and dark gray. Although the mostly white<br />

part has more pixels than the dark gray, larger numbers of classes ultimately lower<br />

the probability of the mostly white part be<strong>in</strong>g detected. We can reason that, as a<br />

result, there were more pixels escap<strong>in</strong>g detection and edges were falsely detected,<br />

significantly lower<strong>in</strong>g recall.<br />

Fig. 17. Experiment results.<br />

Fig. 18 shows the results of an experiment compar<strong>in</strong>g one of traditional methods<br />

(video text candidate images) and our proposed method (text region images and<br />

f<strong>in</strong>al result images), us<strong>in</strong>g the parameters with which the accuracy was best <strong>in</strong> the<br />

experiment shown <strong>in</strong> Fig. 17 (the number of classes = 3, and the threshold <strong>for</strong> the<br />

Euclidean distance = 30).


52 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

Fig. 18. Experiment Results.<br />

Fig. 19. An example of a video text candidate<br />

image.<br />

Fig. 20. An example of a video text region image.<br />

Fig. 21. An example of a video text region image<br />

that uses a color segmentation image.<br />

Fig. 22. An example of a video text region image<br />

with automatic sett<strong>in</strong>g of the telop color.<br />

Fig. 18 shows that our proposed method br<strong>in</strong>gs about better results <strong>in</strong> both precision<br />

and recall compared to traditional methods. Noise elim<strong>in</strong>ation was a factor<br />

<strong>in</strong> the improvement of precision. In rather static video scenes with such objects<br />

as a build<strong>in</strong>g, many non-subtitle pixels rema<strong>in</strong>ed <strong>in</strong> the video text candidate image,<br />

lower<strong>in</strong>g precision. It may be argued that our proposed method elim<strong>in</strong>ated the<br />

non-subtitle pixels, improv<strong>in</strong>g precision. Additionally, edges of video text candidate<br />

images had a strong tendency to rema<strong>in</strong> <strong>in</strong> subtitles as Fig. 19 <strong>in</strong>dicates, and<br />

<strong>in</strong> many cases only the edge of a character rema<strong>in</strong>ed. The fact that our proposed<br />

method enabled the supplementation of miss<strong>in</strong>g parts of a character as <strong>in</strong> Fig. 20<br />

may have contributed improved precision.<br />

Recall of the method based on region segmentation images did not significantly<br />

improve over traditional methods. To deal with this issue, color classification under


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 53<br />

k-means <strong>in</strong> our proposed method is simply based on the number of elements <strong>in</strong> each<br />

class. In cases where many edge pixels rema<strong>in</strong>ed, this may have disabled supplementation<br />

of subtitles almost completely as Fig. 21 shows, result<strong>in</strong>g <strong>in</strong> lower recall.<br />

In the future, the location of each element <strong>in</strong> a class should be taken <strong>in</strong>to consideration<br />

to develop an algorithm that enables more accurate selection of the class <strong>for</strong><br />

the character body. Furthermore, we found that accuracy was higher <strong>in</strong> a version<br />

of our method where region segmentation images are used and then the text color<br />

is automatically set, compared to a version that uses region segmentation images<br />

alone. Fig. 21 and 22 are of the same scene. We can see that Fig. 22 shows less<br />

miss<strong>in</strong>g pixels of characters and more correct pixels have been detected compared<br />

to Fig. 21. <strong>Subtitle</strong>s that do not benefit from a method us<strong>in</strong>g region segmentation<br />

images alone can be improved if text color is considered. Use of text color may be<br />

the factor <strong>in</strong> the improvement <strong>in</strong> recall.<br />

5. Conclusion<br />

This paper has proposed a method <strong>for</strong> detect<strong>in</strong>g subtitle regions us<strong>in</strong>g region segmentation<br />

images. Assessment experiments confirmed that our proposed method<br />

has high detection accuracy over a traditional method of us<strong>in</strong>g video text candidate<br />

images. It is thought that our proposed method has arrived at a practicable level<br />

because it can clearly detect the character from the text regions as shown <strong>in</strong> Fig.<br />

20 and 22. However, there is a problem with a lot of parameter and thresholds<br />

that should be set to make video text candidate images. Future issues <strong>in</strong>clude a<br />

review of the ways to automatically set parameters under our proposed method,<br />

and improvement of accuracy by elim<strong>in</strong>at<strong>in</strong>g m<strong>in</strong>ute non-subtitle regions.


54 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita<br />

Acknowledgments<br />

This study was f<strong>in</strong>anced partly by the Basic Scientific Research Grant (B)<br />

(17300036) and the Basic Scientific Research Grant (C) (17500644).<br />

References<br />

1. H.D. Wactler, A.G. Hauptmann, and M.J. Witbrock, In<strong>for</strong>media News-on-Demand: Us<strong>in</strong>g<br />

Speech Recognition to Create a Digital <strong>Video</strong> Library, CMU Tech. Rep. CMU-CS-98-109,<br />

Carnegie Mellon University, 1998.<br />

2. H.D. Wactler, M.G. Christel, Y. Gong, and A.G. Hauptmann, Lessons Learned from Build<strong>in</strong>g<br />

a Terabyte Digital <strong>Video</strong> Library, IEEE Comput., 32(2), pp. 66-73, 1999.<br />

3. H. Miura K. Takano S. Hamada I. Iide, O. Sakai and H. Tanaka, <strong>Video</strong> Analysis of the<br />

Structure of Food and Cook<strong>in</strong>g Steps with the Correspond<strong>in</strong>g, IEICE Journal, J86-D-II(11),<br />

pp. 1647-16562003.<br />

4. I. Iide, S. Hamada, S. Sakai and E. Tanaka, TV News <strong>Subtitle</strong>s <strong>for</strong> the Analysis of the Semantic<br />

Dictionary Attributes, IEICE Journal, J85-D-II(7), pp. 1201-12102002.<br />

5. S. Mori, M. Kurakake, T. Sugimura, T. Shio and A. Suzuki,The Shape of Characters and<br />

the Background Characteristics Dist<strong>in</strong>guish Correction Function by Us<strong>in</strong>g Dynamic Visual<br />

Character Recognition <strong>in</strong> the <strong>Subtitle</strong>s, IEICE Journal, J83-D-II(7), pp. 1658-1666, 2000.<br />

6. S. Sato, Y. Sh<strong>in</strong>kura, Y. Taniguchi, A. Akutsu, Y. Sotomura and H. Hamada,<strong>Subtitle</strong>s from<br />

the MPEG High-speed <strong>Video</strong> Cod<strong>in</strong>g Region of the Detection <strong>Method</strong>, IEICE Journal, J81-<br />

D-II(8), pp. 1847-18551998.<br />

7. K. Arai, H. Kuwano, M.Kurakage, T.Sugimura,The <strong>Video</strong> Frame <strong>Subtitle</strong> Display Detection<br />

<strong>Method</strong>, IEICE Journal, D-2, J83-D-2(6), pp. 1477-1486, 2000.<br />

8. O. Hori, U. Mita,<strong>Subtitle</strong>s <strong>for</strong> Recognition from the <strong>Video</strong> Division Robust Character Extraction<br />

<strong>Method</strong>, IEICE JournalD-2, J84-D-2(8), pp. 1800-1808, 2001.<br />

9. http://svmlight.joachims.org: “SVM-Light Support Vector Mach<strong>in</strong>e”<br />

10. C. Harris, and M. Stephens,A Comb<strong>in</strong>ed Corner and Edge Detector, Proceed<strong>in</strong>g of the 4th<br />

Alvey Vision Conference, pp. 147-151, 1988.<br />

11. V. Gouet and N. Boujemaa,Object-based Queries Us<strong>in</strong>g Color Po<strong>in</strong>ts of Interest, Proceed<strong>in</strong>gs<br />

of IEEE workshop CBAIVLICBPR, pp. 30-38, 2001.<br />

12. D. Hiramatsu, M. Shishibori and K. Kita,<strong>Subtitle</strong>d <strong>Subtitle</strong>s from the Area of <strong>Video</strong> Data<br />

Detection <strong>Method</strong>, IEICE Journal In<strong>for</strong>mation Systems and In<strong>for</strong>mation Industry Association<br />

and the Jo<strong>in</strong>t Research, IP-07-24 IIS-07-48, 2007.<br />

Yoshihide Matsumoto<br />

He graduated The In<strong>for</strong>mation System Technology<br />

& Eng<strong>in</strong>er<strong>in</strong>g cource student from Kochi University of<br />

Technology <strong>in</strong> Mar. 2002. He jo<strong>in</strong>ed Laboatec <strong>in</strong> Japan<br />

Co,LTD. <strong>in</strong> same year, he became the CTO position, Applied<br />

IT Lab. Master <strong>in</strong> 2008. He became a Graduate<br />

School of Doctor Program cource of Advanced Technology<br />

& Sience at the University of Tokushima <strong>in</strong> Oct.<br />

2006. His Study of Multi-Media IT System Publication<br />

was received a 2003 Japan IBM user symposium of Quasi-<br />

Selected Award.


A <strong>Method</strong> <strong>for</strong> <strong>Detect<strong>in</strong>g</strong> <strong>Subtitle</strong> <strong>Regions</strong> <strong>in</strong> <strong><strong>Video</strong>s</strong> 55<br />

Tadashi Uemiya<br />

He graduated from Waseda University <strong>in</strong> Mar.1968,<br />

jo<strong>in</strong>ed Kawasaki Heavy Industry Co. Ltd. <strong>in</strong> 1968,<br />

through 2000, and same year, transferred to IT Dep.<br />

of Benesse Co.Ltd to 2006 of Retired. He became a<br />

Graduate School of Doctor Program cource student of<br />

Advanced Technology & Sience at the University of<br />

Tokushima <strong>in</strong> Oct. 2006. His research <strong>in</strong>terests <strong>in</strong>clude IE<br />

& IT and Inovation IT Solution and <strong>in</strong><strong>for</strong>mation Technology.<br />

He has experienced <strong>for</strong> Aero Jet Eng<strong>in</strong>es Deveropment<br />

Project of <strong>in</strong>ternational 5 countries, development<br />

CAD/CAM/CAE/CG SYSTEM and Implementation,<br />

PICS, Web In<strong>for</strong>mation Infrastructure and was Senior<br />

member of IEEE etc. And First <strong>in</strong>stration IP public<br />

network implementation with MPLS Technology <strong>in</strong> Japan<br />

with Co-work Project of NTT and CISCO Japan. ; and<br />

many experience <strong>for</strong> Security system of ISMS, SRMS and<br />

The personal <strong>in</strong><strong>for</strong>mation Protection. He is Executive IT<br />

Consult<strong>in</strong>g Now.<br />

Masami Shishibori<br />

He graduated from the University of Tokushima <strong>in</strong><br />

1991, completed the doctoral program <strong>in</strong> 1995, and jo<strong>in</strong>ed<br />

the faculty as a research associate, becom<strong>in</strong>g a lecturer <strong>in</strong><br />

1997 and an associate professor <strong>in</strong> 2001. His research <strong>in</strong>terests<br />

are multimedia data search and natural language<br />

process<strong>in</strong>g. He is a coauthor of <strong>in</strong><strong>for</strong>mation Retrieval Algorithms<br />

(Kyoritsu Shuppan). He received the ISP 45th<br />

Natl. Con. Incentive Award. He holds a D.Eng. degree,<br />

and is a member of ICIER and NLP<br />

Kenji Kita<br />

He graduated from Waseda University <strong>in</strong> 1981, jo<strong>in</strong>ed<br />

Oki Electric Industry Co., Ltd. <strong>in</strong> 1983, and transferred<br />

to ART Interpret<strong>in</strong>g Telephony Research Laboratories<br />

<strong>in</strong> 1987. He became a lecturer at the University of<br />

Tokushima <strong>in</strong> 1992, an associate professor <strong>in</strong> 1993, and<br />

a professor <strong>in</strong> 2000. His research <strong>in</strong>terests <strong>in</strong>clude natural<br />

language process<strong>in</strong>g and <strong>in</strong><strong>for</strong>mation retrieval. He<br />

received a 1994 ASJ Technology Award. His publications<br />

<strong>in</strong>clude Probabilistic Language Models (Tokyo University<br />

Press) and In<strong>for</strong>mation Retrieval Algorithms (Kyoritsu<br />

Shuppan). He holds a D.Eng. degree.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!