A Method for Detecting Subtitle Regions in Videos Using Video Text ...

International Journal of Advanced Intelligence 

Volume 2, Number 1, pp.37-55, July, 2010. 

c⃝ AIA International Advanced Information Institute 

A Method for Detecting Subtitle Regions in 

Videos Using Video Text Candidate Images and 

Color Segmentation Images 

Yoshihide Matsumoto, Tadashi Uemiya, Masami Shishibori and Kenji Kita 

Faculty of Engineering, The University of Tokushima 

2-1 Minami-josanjima, Tokushima 770-8506, Japan 

matsumoto@laboatec.com; uchikosi@helen.ocn.ne.jp; 

{bori;kita}@is.tokushima-u.ac.jp 

Received (January 2010) 

Revised (May 2010) 

In this paper, a method for detecting text regions in digital videos with telop, such as 

drama, movie and news programming, is proposed. The typical characteristics of telop 

are that it does not move, and that its edges are strong. This method takes advantage of 

these characteristics to produce video text candidate images. Then, this method produces 

the video text region images from both the video text candidate images and the color 

segmentation images. The video text region images and the original image are used to 

identify the color of the telop. Finally, text regions are detected by increasing neighboring 

pixels of the identified color. The experiment results show that the precision of this method 

was 80.36% and the recall was 77.55%, whereas the precision of the traditional method was 

40.22% with the recall 75.48%. Higher accuracy was achieved by using this new method. 

Keywords: Video text candidate image; Color segmentation image; Video text region 

image; Multimedia information retrieval. 

1. Introduction 

In recent years, with the spread of the Internet, increased hardware specifications, 

and the development of imaging devices such as digital cameras and digital video 

cameras, there are more and more opportunities to accumulate large amounts of 

video content in personal computers. It is difficult to efficiently search the required 

image or scene within these contents, and so the information is needed that clearly 

describes the content. The required information usually includes cut points, camera 

work, sound, and subtitles. Subtitles often describe the subject being photographed 

or the topic. Subtitles also appear in sync with the video, making them noteworthy 

as useful strings that reflect the semantic content. 

One of the well-known image-handling technologies focusing on subtitles is the 

Informedia project 1,2 , where large-size image data are processed using images from 

cut scenes, subtitle-recognition characters, and speech-recognition data. A method 

has been proposed for matching cooking instructions and cooking images using subtitles 

and closed caption. 3 A method to index the semantic attributes corresponding 

37

38 Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita 

to scenes in news programs using the closed caption has been proposed. 4 A method 

has also been proposed for recognizing text residing within a subtitle region. 5 To 

implement applied methods like this, it is first necessary to detect temporal and 

spatial ranges of the subtitles in the image. The establishment of a highly accurate 

method for detecting subtitle regions is desired. 

Sato et al. 6 have proposed a traditional subtitle detection method, where macro 

block coding information is used to detect subtitle regions in images compressed 

as MPEGs. While this method allows for fast processing, the accuracy has not yet 

reached a practical level. Arai et al. 7 have focused on a feature of subtitles, called 

edge pairs, to propose another method, where subtitle regions are detected from the 

spatial distribution and temporal continuity of edge pairs. Although the detection 

accuracy of this method has been developed to a practical level, the absence of a 

learning function may decrease the accuracy as the text fonts change. Hori et al. 8 

have proposed yet another method, where text candidate images are obtained from 

the logical products of low-dispersion images and immovable-edge images, followed 

by learning-based detection of subtitle regions. While this method leads to high 

recall, precision is low. Thus it tends to detect excessive regions as subtitles, resulting 

in subtitle text getting crushed. Additionally, there has been a proposal to 

increase the detection accuracy of subtitle regions by first creating text candidate 

images, and then using a learning-based identification device called Support Vector 

Machine (SVM) 9 and a feature point extraction operator called Harris Interest 

Operator (Harris operator) 10,11 . Although this method 12 increases precision, it has 

its own issues such as the fact that it needs data for learning, and that the recall 

decreases. 

This paper proposes a method for detecting subtitle regions with high accuracy 

by first generating video text candidate images in the same way as in traditional 

methods 7,8 , followed by checking color segmentation images against the original image. 

In this method, text candidate images are obtained first in the same way as in 

the traditional method 8 , based on the regions where little brightness change occurs 

between continuous frame images, and on the regions with no changes in edges. The 

subtitles within the text candidates images obtained this way are detected almost 

perfectly, but the background tends to be excessively detected at the same time. 

In other words, the recall is high while the precision is low. As a workaround, the 

text candidate images and the color segmentation images obtained this way are 

combined, after which only the color segments that appear to be text are selected, 

thereby generating text region images with low background noise. The text region 

images thus obtained have few instances of the background falsely detected as text. 

However, because subtitle regions are detected based on color segments, some characters 

in minute color segments of the subtitle text tend to escape detection. In 

other words, the precision is high while the recall decreases. In an effort to improve 

the recall, text color was used, assuming that the color information of the subtitles 

does not change. Specifically, the color information of the subtitles is determined us-

A Method for Detecting Subtitle Regions in Videos 39 

ing multiple text region images generated within continuous frames and the original 

image. This is followed by the improvement of the recall by increasing neighboring 

pixels that have similar color information, thereby accurately detecting subtitle regions. 

Chapter 2 introduces a traditional method for detecting subtitle regions using 

video text candidate images. Chapter 3 proposes a method for generating text region 

images using video text candidate images and color segmentation images, as 

well as a method for detecting subtitles by automatically setting the color of the 

subtitle text using text region images and the original image. Chapter 4 provides experiments 

for assessing the validity of the proposed method, along with the results 

and discussions. Finally, Chapter 5 presents the conclusion and describes future 

issues. 

2. Overview of Traditional Methods 

This chapter introduces a traditional method for detecting subtitle regions using 

text candidate images. Text candidate images are also used in our proposed method 

as subtitle region images in their first phase. 

2.1. A method for generating video text candidate images using 

low distributed images and immovable edge images 

Hori et al. 8 have proposed a method for generating video text candidate images from 

the logical products of low distributed images and immovable edge images. First, one 

low distributed image is created from continuous frame images based on an arbitrary 

number of brightness images. If the arbitrary number is an N, brightness frames 

for N frames are then used to obtain the distribution value of the brightness of 

each pixel. In this method, we chose brightness images for 4 frames. Pixels whose 

distribution values are lower than a specified threshold value are assigned a value 

of 1, with other pixels assigned 0 or 2, in order to obtain low distribute images. The 

threshold value is set using discriminate analysis. Static regions such as subtitles 

have little change in brightness, thus their distribution values are low. More dynamic 

regions have higher distribution values. Therefore, the resultant low distributed 

images tend to have most of the subtitles intact. 

Similarly, one immovable edge is created from continuous frame images based on 

an arbitrary number of brightness images. First, edge images with the value of 2 are 

obtained from brightness images. Wavelet conversion is used to detect edges. Then 

the logical product of the edge images for N frames is obtained. In this method, we 

chose brightness images for 4 frames. The images obtained by the logical product 

are called immovable edge images, which have sharp edges on the boundaries with 

the background. Static pixels are prone to remain here, and so the subtitles tend 

to remain in a similar way to low distributed images. Low distributed images and 

immovable edge images are obtained in the flow shown in Fig. 1. The logical product


Fig. 1. An illustration of making a video text candidate image from each video frame. 

obtained from the low distributed images and immovable edge images, which in turn 

are obtained in the above manner, will generate video text candidate images. 

2.2. A method for detecting subtitles using SVM and the Harris 

operator 

Hiramatsu et al. 12 have proposed a method which suppresses erroneous detection 

with the use of SVM and the Harris operator. In this method, video text candidate 

images are first generated excluding as much as possible the background parts 

except for the subtitles. Then the video text candidate images are divided into 

blocks similar in size to the pre-determined text, as shown in Fig. 2. Brightness 

histogram for each block is created from the white pixels remaining in that block. 

Each block is assessed using SVM, labeling subtitle-bearing blocks as positive, and 

those without subtitles as negative. 

The Harris operator, which is high in recall for image enlargement, is then applied 

to images determined to be subtitle regions by SVM, to increase the precision. The 

interest points detected by the Harris operator are seen abundant in parts with 

large color variation as well as in edges. Since in many cases subtitle regions are 

represented as supplementary colors for the images around them, it is expected that 

many interest points will be detected in the vicinity of subtitle regions. Therefore, 

blocks with positive identification by SVM are detected as subtitle regions if they 

have many interest points. Subtitle regions may not be recognized if edges of the 

text reside within blocks. Subtitles are long text strings aligned horizontally in a 

long string. Therefore, to avoid this non-recognition issue, the number of interest 

points on the right and left side of the region in question is used to determine if 

that region is a subtitle region.


Fig. 2. An example of histogram data generation. 

Fig. 3. An example of detecting the interest points by the Harris operator. 

2.3. Issues on traditional methods 

Traditional methods eliminates the background using the characteristics of subtitles 

found in images, and then recognizes subtitle regions using SVM with the 

manually-prepared positive and negative data, and the interest points. However,


such traditional methods have the following issues: 

(i) Positive and negative data must be prepared manually so that SVM can learn 

them. 

(ii) Subtitle texts not residing within divided blocks may make subsequent text recognition 

difficult. 

To solve these issues, we focused on techniques for dividing image regions. In the 

subsequent chapters, we will discuss a subtitle-region detection method based on a 

technique for dividing image regions. 

3. The Proposed Method 

This paper proposes a method for detecting subtitle regions based on images that 

have been processed with color segmentation and video text candidate images. We 

will call subtitle images generated using video text candidate images and color 

segmentation images “text region images.” We will first discuss a method for generating 

text region images using video text candidate images and color segmentation 

images. After that, we will use the text region images and the original image to 

automatically set the text color, and discuss the process flow for detecting final 

subtitle regions. 

3.1. Generating text region images using color segmentation 

images 

3.1.1. Introduction to the method 

The process flow for detecting subtitles based on color a segmentation image is 

shown in Fig. 4. First, a video text candidate image is obtained in the same way 

as in the traditional method8. At the same time, an image processed with color 

segmentation (“a color segmentation image”) and color segmentation image data 

are obtained (Step 1 of Fig. 4). The color segmentation image data include the region 

numbers, the size of each region (the total number of pixels), the central coordinate 

(x, y), the color information of the regions (luv), and the coordinates that belong to 

the regions. A video text candidate image is created from four continuous frames, 

while a color segmentation image is created from the first frame that was used when 

creating the video text candidate image. 

Then, these two images are used to eliminate noise (Step 2 of Fig. 4). We will process 

the elimination in two ways: (1) by horizontally scanning the video text candidate 

image so that only the pixels within the subtitles remain, and (2) by checking the 

video text candidate image against the color segmentation image data in order to 

select only the color segments that appear to be subtitles. After this elimination 

process, we will eliminate the edges of subtitles, because it is common for subtitles 

to have edges added on (Step 3 of Fig. 4). Specifically, we take advantage of the 

fact that the bodies and edges of subtitle characters use different colors. We will


Fig. 4. Outline of detecting text regions by using color segmentation images. 

use the k-means method to classify the colors of the regions that contain the white 

pixels that are left after the noise elimination process. Finally, we will supplement 

the text characters (Step 4 of Fig. 4) to improve recall. Specifically, we will search 

each segmentation region around pixels that remain as part of the subtitles at the 

end of Step 3, and increase the regions that resemble the subtitle region in size and 

typical color. Below are detailed discussions of each module. 

3.1.2. Generating color segmentation images 

In this process, the region integration method is used to generate color segmentation 

images. Region integration is a method for dividing an image into multiple sets 

(regions) of pixels that have similar amount of characteristic and are spatially close, 

based on such characteristics as the pixel values and the texture. The reason we 

chose this method is the characteristic of the subtitles. As discussed in the section 

on low distributed images, the brightness of subtitles vary little, and their color 

does not change much. In other words, all subtitles have more or less the same 

characteristics, which led us to speculate the color segmentation process might 

successfully extract subtitle regions. Below are the steps for integrating regions. 

Fig. 5 and 6 show examples of color segmentation images generated using the region 

integration method. 

Step 1 Search for each pixel by raster scanning, flagging any unlabeled and unclassified 

pixels and labeling them. 

Step 2 Check eight (8) pixels neighboring the flagged pixels, and assign them the 

same label as that of the flagged pixels if the pixel value is the same. 

Step 3 Repeat Step 2 with the newly labeled pixels as the flagged pixels.


Step 4 If no pixels are labeled in Step 2, repeat Step 1. 

Step 5 The process is complete when all pixels have been labeled. Sets (regions) of 

neighboring pixels with the same pixel value are obtained at this point. Proceed 

to the next step using the labeled pixels. 

Step 6 Obtain the average pixel values among the pixels bearing the same label. 

Step 7 Of the neighboring sets of pixels, integrate the two that have the smallest 

difference in the average pixel values obtained in Step 6. 

Step 8 Repeat Steps 6 and 7. To avoid the eventuality of only one existing set 

of pixels, the maximum average difference should be established for allowing 

integration. [End of the steps of the region integration method] 

Fig. 5. Original image. 

Fig. 6. Color segmentation image. 

Fig. 7. Noise elimination by scanning of white pixels. 

3.1.3. Noise elimination 

The process of noise elimination is two-fold. The first phase starts with horizontal 

scanning of a video text candidate image as shown in Fig. 7, creating a histogram 

with a tally of white pixels. The scanning direction depends on the direction of 

the subtitles. Because white pixels are packed into subtitles, the histogram shows 

locally high numbers where subtitles are found. Based on this observation, locations 

where the histogram numbers show sharp climbs and falls are identified, and only


these locations are kept, thus narrowing down subtitle-containing regions. 

In the second phase of noise elimination, we take advantage of the characteristics 

of subtitles, i.e., images processed with color segmentation based on color information 

are used. Because each character of the subtitle has the same color information, 

we can predict that the background and the subtitles reside in different regions of 

a color segmentation image. We can also predict that the subtitle regions are narrower 

than the background. The noise elimination process takes advantage of these 

characteristics. First, the video text candidate image and the color segmentation 

image data are checked against each other after the noise eliminating process in 

Phase 1, and the ratio of white pixels in each region is measured. Next, regions that 

have higher number of white pixels than the threshold value are made all white, 

and all other pixels are made black. Since subtitle regions are smaller than backgrounds, 

even a single white pixel remaining in a region might continue to remain 

after this process largely depending on whether the region is a subtitle region or a 

background. Fig. 9 shows an image after noise elimination. 

Fig. 8. Noise elimination by ratio of white pixels. 

Fig. 9. Image with noise eliminated.


3.1.4. Classification by k-means 

Generally, each character of subtitles consists of the edge part and the character 

itself, each in its own color. After noise elimination, an image may still have both of 

these parts left. If the edge part is still left, the entire character is crushed, making 

it difficult to indentify the character, especially if it is a complicated character such 

as kanji. The k-means method enables the classification of each pixel in the subtitle 

characters based on the color information, and it detects only the pixels that belong 

to the characters. In a video text candidate image with noise eliminated, the colors 

of the regions to which the remaining white pixels belong are classified using k- 

means, as shown in Fig. 10. After the classification, only the regions with colors 

that belong to the class with the most clusters are kept. Fig. 11 shows an example 

of an image after classification by k-means. 

3.1.5. Character complementation 

As can be seen in Fig. 11, images that have been classified by k-means tend to be 

high in precision and low in recall, leading to frequent non-detection. We will now 

focus on the characteristics of each region, and supplement the subtitle region. The 

part of the region that falls within the 16 x 16 square pixels of the remaining white 

pixels is searched as shown in Fig. 12. Then the Euclidean distance is calculated 

with the size of the region, the central coordinate of the region, and the color of 

the region as the amounts of characteristic. If the resultant Euclidean distance is 

less than the threshold value, that region is added as a subtitle. Fig. 13 shows the 

image after character complementation, i.e., the video text region image after the 

application of the method based on color segmentation images. 

Fig. 10. Classification of each pixel by k-means.


Fig. 11. Image after classification by k-means. 

Fig. 12. An illustration of complementation of the text characters. 

Fig. 13. An example of the video text region image after application of the proposed method. 

3.2. Method for detecting subtitles by automatically setting the 

text color 

3.2.1. Overview of the method 

Text region images generated following the method described in the preceding section 

tends to escape detection in the minute segmentation regions residing within 

the subtitle section, lowering the recall. We will now apply a recall-improving technique 

based on text color (Fig. 14). First, the color information of the subtitles is 

specified using the multiple text region images generated among continuous frames, 

and the original picture image (Step 1 of Fig. 14). Then the text characters are 

supplemented (Step 2 of Fig. 14). The pixels remaining after supplementation are 

labeled, and regions that are too large are removed (Step 3 of Fig. 14). We will 

discuss the details of each module.


Fig. 14. Outline of detecting text regions by specifying the text color. 

Fig. 15. An illustration of color histogram generation. 

3.2.2. Automatically setting the text color 

Multiple text region images generated from continuous images and the original 

picture image of the top frame that was used to generate each text region image are 

used to specify the range of the subtitle text color. In this experiment, we focused 

on the pixels remaining in thirty (30) text region images. These pixels are checked 

against the original picture image to extract the RGB value. Then the RGB 256 

gradation levels are compressed into 16 levels to generate histogram (Fig. 15). The 

gradation level with the most pixels is determined to be the range of the color of 

this text.


3.2.3. Character supplementation and labeling 

Eight (8) square pixels around the white pixels remaining after Step 7 of Fig. 14 

(pixels that have been detected as being within the subtitle region) are searched. 

Pixels residing within the range determined in Step 7 are made white. Eight square 

pixels around these new white pixels are similarly searched until no more pixels 

are joining. After characters are supplemented, they are labeled. Labels that are 

connected above the threshold value are removed. Fig. 16 shows an example of the 

final result after automatically setting the text color. 

Fig. 16. An example of a video text region image generated by the proposed method. 

4. Assessment 

4.1. Method of experiment 

We conducted an experiment to confirm the validity of our proposed method. As 

the experiment data, we used drama image data that includes subtitles with full 

RGB colors, a resolution of 352 x 240, and a frame rate of 29.97 fps. As the correct 

data, we only used the images of the subtitles in the overlay region of this drama. 

The correct data, text candidate images, and the text region images were checked 

against one another for each pixel to calculate the precision and recall. 30 images 

were selected at random from the drama for assessment. The criteria for assessment 

recall (r) and precision (p) can be represented in the following formulae (1),(2): 

precision : p = 

recall : r = 

N d : Number of correctly detected pixels 

N m : Number pixels that escaped detection 

N f : Number of falsely detected pixels 

N d 

N d + N f 

(1) 

N d 

N d + N m 

(2) 

The level of Precision and Recall in the method in the past is as shown in Table 1. 

The unit used in detection is the number of pixels. Detection is deemed correct 

if white pixels exist where the subtitles within the frame are displayed. It is deemed 

escaped detection if white pixels do not exist. Detection is deemed false if white


Table 1. Evaluation of method in the past. 

Precision 

Recall 

40.22% 75.48% 

pixels exist in frames or locations where there are no subtitles. The experiment 

criteria for each method are listed below: 

• Method for detecting subtitles using segmentation region images 

– The threshold value used for noise elimination in the second phase is 50 

– The number of classes for k-means is 2, 3, 4, and 5, variable. 

– The threshold values of the Euclidean distance when characters are supplemented 

are 10, 20, and 30, variable. 

• Method for detecting subtitles by automatically setting the text color 

Regions are removed by labeling if they include 128 or more pixels that are 

connected. 

4.2. Experiment results 

Fig. 17 shows the shift in the detection accuracy when the number of classes used by 

k-means in the text detection method with region segmentation images changes, and 

when the threshold value of the Euclidean distance changes in character supplementation. 

In Fig. 17, K=N Precision represents the precision value when the number 

of classes under the k-means method is set to N (2, 3, 4, or 5), and K=N Recall 

represents the recall value when the number of classes is set to N. In addition, eucM 

represents the detection accuracy when the threshold for the Euclidean distance is 

set to M (10, 20, or 30) in character supplementation. 

The results shown in Fig. 17 indicate that precision is higher than recall in each 

case as well as the recall remains steady. Changes in parameter values did not affect 

accuracy significantly when the threshold for the Euclidean distance changed. On 

the other hand, when the number of classes under k-means changed, both precision 

and recall were affected. When the number of classes was 3, the precision was 

highest, decrease in recall was at a minimum, and the balance between these two 

elements was optimal, producing the best accuracy. 

Each character in subtitles generally consists of three parts: the background, the 

edges, and the character body. It is expected that setting the number of classes to 

three (3) enabled appropriate classification and detection of the character bodies. 

When the number of classes was set to two (2), the background mixed into the 

selected class, resulting in lower precision. Higher numbers of classes such as 4 and 

5 resulted in big drops in recall, with increased numbers of pixels that escaped 

detection. This is because subtitle characters do not consist of exactly the same 

color, but rather the color varies slightly from character to character. For example, 

subtitle characters that are seemingly white were found to consist of four (4) smaller


parts: mostly white, light gray, gray, and dark gray. Although the mostly white 

part has more pixels than the dark gray, larger numbers of classes ultimately lower 

the probability of the mostly white part being detected. We can reason that, as a 

result, there were more pixels escaping detection and edges were falsely detected, 

significantly lowering recall. 

Fig. 17. Experiment results. 

Fig. 18 shows the results of an experiment comparing one of traditional methods 

(video text candidate images) and our proposed method (text region images and 

final result images), using the parameters with which the accuracy was best in the 

experiment shown in Fig. 17 (the number of classes = 3, and the threshold for the 

Euclidean distance = 30).


Fig. 18. Experiment Results. 

Fig. 19. An example of a video text candidate 

image. 

Fig. 20. An example of a video text region image. 

Fig. 21. An example of a video text region image 

that uses a color segmentation image. 

Fig. 22. An example of a video text region image 

with automatic setting of the telop color. 

Fig. 18 shows that our proposed method brings about better results in both precision 

and recall compared to traditional methods. Noise elimination was a factor 

in the improvement of precision. In rather static video scenes with such objects 

as a building, many non-subtitle pixels remained in the video text candidate image, 

lowering precision. It may be argued that our proposed method eliminated the 

non-subtitle pixels, improving precision. Additionally, edges of video text candidate 

images had a strong tendency to remain in subtitles as Fig. 19 indicates, and 

in many cases only the edge of a character remained. The fact that our proposed 

method enabled the supplementation of missing parts of a character as in Fig. 20 

may have contributed improved precision. 

Recall of the method based on region segmentation images did not significantly 

improve over traditional methods. To deal with this issue, color classification under


k-means in our proposed method is simply based on the number of elements in each 

class. In cases where many edge pixels remained, this may have disabled supplementation 

of subtitles almost completely as Fig. 21 shows, resulting in lower recall. 

In the future, the location of each element in a class should be taken into consideration 

to develop an algorithm that enables more accurate selection of the class for 

the character body. Furthermore, we found that accuracy was higher in a version 

of our method where region segmentation images are used and then the text color 

is automatically set, compared to a version that uses region segmentation images 

alone. Fig. 21 and 22 are of the same scene. We can see that Fig. 22 shows less 

missing pixels of characters and more correct pixels have been detected compared 

to Fig. 21. Subtitles that do not benefit from a method using region segmentation 

images alone can be improved if text color is considered. Use of text color may be 

the factor in the improvement in recall. 

5. Conclusion 

This paper has proposed a method for detecting subtitle regions using region segmentation 

images. Assessment experiments confirmed that our proposed method 

has high detection accuracy over a traditional method of using video text candidate 

images. It is thought that our proposed method has arrived at a practicable level 

because it can clearly detect the character from the text regions as shown in Fig. 

20 and 22. However, there is a problem with a lot of parameter and thresholds 

that should be set to make video text candidate images. Future issues include a 

review of the ways to automatically set parameters under our proposed method, 

and improvement of accuracy by eliminating minute non-subtitle regions.


Acknowledgments 

This study was financed partly by the Basic Scientific Research Grant (B) 

(17300036) and the Basic Scientific Research Grant (C) (17500644). 

References 

1. H.D. Wactler, A.G. Hauptmann, and M.J. Witbrock, Informedia News-on-Demand: Using 

Speech Recognition to Create a Digital Video Library, CMU Tech. Rep. CMU-CS-98-109, 

Carnegie Mellon University, 1998. 

2. H.D. Wactler, M.G. Christel, Y. Gong, and A.G. Hauptmann, Lessons Learned from Building 

a Terabyte Digital Video Library, IEEE Comput., 32(2), pp. 66-73, 1999. 

3. H. Miura K. Takano S. Hamada I. Iide, O. Sakai and H. Tanaka, Video Analysis of the 

Structure of Food and Cooking Steps with the Corresponding, IEICE Journal, J86-D-II(11), 

pp. 1647-16562003. 

4. I. Iide, S. Hamada, S. Sakai and E. Tanaka, TV News Subtitles for the Analysis of the Semantic 

Dictionary Attributes, IEICE Journal, J85-D-II(7), pp. 1201-12102002. 

5. S. Mori, M. Kurakake, T. Sugimura, T. Shio and A. Suzuki,The Shape of Characters and 

the Background Characteristics Distinguish Correction Function by Using Dynamic Visual 

Character Recognition in the Subtitles, IEICE Journal, J83-D-II(7), pp. 1658-1666, 2000. 

6. S. Sato, Y. Shinkura, Y. Taniguchi, A. Akutsu, Y. Sotomura and H. Hamada,Subtitles from 

the MPEG High-speed Video Coding Region of the Detection Method, IEICE Journal, J81- 

D-II(8), pp. 1847-18551998. 

7. K. Arai, H. Kuwano, M.Kurakage, T.Sugimura,The Video Frame Subtitle Display Detection 

Method, IEICE Journal, D-2, J83-D-2(6), pp. 1477-1486, 2000. 

8. O. Hori, U. Mita,Subtitles for Recognition from the Video Division Robust Character Extraction 

Method, IEICE JournalD-2, J84-D-2(8), pp. 1800-1808, 2001. 

9. http://svmlight.joachims.org: “SVM-Light Support Vector Machine” 

10. C. Harris, and M. Stephens,A Combined Corner and Edge Detector, Proceeding of the 4th 

Alvey Vision Conference, pp. 147-151, 1988. 

11. V. Gouet and N. Boujemaa,Object-based Queries Using Color Points of Interest, Proceedings 

of IEEE workshop CBAIVLICBPR, pp. 30-38, 2001. 

12. D. Hiramatsu, M. Shishibori and K. Kita,Subtitled Subtitles from the Area of Video Data 

Detection Method, IEICE Journal Information Systems and Information Industry Association 

and the Joint Research, IP-07-24 IIS-07-48, 2007. 

Yoshihide Matsumoto 

He graduated The Information System Technology 

& Enginering cource student from Kochi University of 

Technology in Mar. 2002. He joined Laboatec in Japan 

Co,LTD. in same year, he became the CTO position, Applied 

IT Lab. Master in 2008. He became a Graduate 

School of Doctor Program cource of Advanced Technology 

& Sience at the University of Tokushima in Oct. 

2006. His Study of Multi-Media IT System Publication 

was received a 2003 Japan IBM user symposium of Quasi- 

Selected Award.


Tadashi Uemiya 

He graduated from Waseda University in Mar.1968, 

joined Kawasaki Heavy Industry Co. Ltd. in 1968, 

through 2000, and same year, transferred to IT Dep. 

of Benesse Co.Ltd to 2006 of Retired. He became a 

Graduate School of Doctor Program cource student of 

Advanced Technology & Sience at the University of 

Tokushima in Oct. 2006. His research interests include IE 

& IT and Inovation IT Solution and information Technology. 

He has experienced for Aero Jet Engines Deveropment 

Project of international 5 countries, development 

CAD/CAM/CAE/CG SYSTEM and Implementation, 

PICS, Web Information Infrastructure and was Senior 

member of IEEE etc. And First instration IP public 

network implementation with MPLS Technology in Japan 

with Co-work Project of NTT and CISCO Japan. ; and 

many experience for Security system of ISMS, SRMS and 

The personal information Protection. He is Executive IT 

Consulting Now. 

Masami Shishibori 

He graduated from the University of Tokushima in 

1991, completed the doctoral program in 1995, and joined 

the faculty as a research associate, becoming a lecturer in 

1997 and an associate professor in 2001. His research interests 

are multimedia data search and natural language 

processing. He is a coauthor of information Retrieval Algorithms 

(Kyoritsu Shuppan). He received the ISP 45th 

Natl. Con. Incentive Award. He holds a D.Eng. degree, 

and is a member of ICIER and NLP 

Kenji Kita 

He graduated from Waseda University in 1981, joined 

Oki Electric Industry Co., Ltd. in 1983, and transferred 

to ART Interpreting Telephony Research Laboratories 

in 1987. He became a lecturer at the University of 

Tokushima in 1992, an associate professor in 1993, and 

a professor in 2000. His research interests include natural 

language processing and information retrieval. He 

received a 1994 ASJ Technology Award. His publications 

include Probabilistic Language Models (Tokyo University 

Press) and Information Retrieval Algorithms (Kyoritsu 

Shuppan). He holds a D.Eng. degree.

A Method for Detecting Subtitle Regions in Videos Using Video Text ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?