01.02.2015 Views

Object Recognition with Multiple Feature Types - ResearchGate

Object Recognition with Multiple Feature Types - ResearchGate

Object Recognition with Multiple Feature Types - ResearchGate

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

in:<br />

Proceedings of the ICANN 1998, Skovde, Sweden<br />

Perspectives in Neural Computing<br />

L.Niklasson, M.Boden and T.Ziemke (eds.)<br />

Springer Verlag, Berlin, Heidelberg, New York, 1998.<br />

<strong>Object</strong> <strong>Recognition</strong> <strong>with</strong> <strong>Multiple</strong><br />

<strong>Feature</strong> <strong>Types</strong> <br />

Jochen Triesch and Christian Eckes<br />

Institut fur Neuroinformatik, Ruhr-Universitat Bochum<br />

D-44780 Bochum, Germany<br />

fJochen.Triesch,Christian.Eckesg@neuroinformatik.ruhr-uni-bochum.de<br />

Abstract<br />

One of the brain's recipes for robustly perceiving the world is to integrate<br />

multiple feature types such as shape, color, texture and motion. We have<br />

investigated how far also neural-network based object recognition can<br />

prot from the combination of several feature types. For this purpose we<br />

have extended Elastic Graph Matching such that several feature types<br />

may be combined in the object models. We applied the system in two<br />

dicult application domains, the interpretation of cluttered scenes and<br />

the recognition of hand postures against complex backgrounds. Our results<br />

demonstrate that the usage of additional feature types signicantly<br />

improves performance.<br />

1 Introduction<br />

Vision is a hard problem which our brains solve very well. The neurons in<br />

visual cortex extract dierent features of the input image. Some represent<br />

shape, others motion or color or combinations of these. These features have to<br />

be integrated to form object descriptions which can be stored and recognized.<br />

In computer vision, it has been realized that the integration of dierent<br />

feature types can be useful for object recognition tasks. Perhaps the most<br />

extreme example is Mel's SeeMore system [2]. There, recognition is based<br />

on 102 viewpoint invariant nonlinear lters, which code contour, texture, and<br />

color features.<br />

We were interested in the question in how far the use of multiple feature<br />

types could improve Elastic Graph Matching (EGM) a neurally inspired object<br />

recognition system [1]. Our particular interest was in whether appropriate color<br />

features would enhance the object recognition in dicult situations of cluttered<br />

scenes <strong>with</strong> many mutually occluding objects and complex backgrounds.<br />

Supported by a grant from the German Federal Ministry for Science and Technology<br />

(01 IN 504 E9).


Figure 1: Similarities between compound jets combining Gabor, color and colorGabor<br />

features: Top row: Far left: Source image <strong>with</strong> the circle indicating the position<br />

where the compound jet was extracted. Middle left: Skin color segmentation of the<br />

source image. Middle right: Target image. Far right: Skin color segmentation of the<br />

target image. Bottom row: Similarity landscapes obtained when comparing compound<br />

jets extracted at each position in the target image <strong>with</strong> the jet taken at the<br />

marked position in the source image using dierent weightings. Left to Right: only<br />

Gabor features used, only color features, only colorGabor features, a proper combination<br />

of all three. The circle in the target image (top row, middle right) corresponds<br />

to the position of the global maximum in the rightmost image in the bottom row.<br />

The corresponding point was found correctly despite the complex background.<br />

2 Graph Matching <strong>with</strong> Compound Jets<br />

In Elastic Graph Matching (EGM), objects are stored as graphs. The nodes<br />

of the graphs are labeled <strong>with</strong> a local image description in the form of a vector<br />

of responses of feature detectors. The edges are labeled <strong>with</strong> geometrical<br />

information, thus representing spatial relations between the features. During<br />

recognition a model graph of an object is matched onto the input image. During<br />

this process the graph's nodes try to nd matching image regions while at<br />

the same time attempting to keep their spatial relations intact [1].<br />

While earlier versions of EGM have only worked <strong>with</strong> a single shape or<br />

texture feature type extracted from a grey level image (e.g. Gabor or Mallat<br />

lters), we have extended it to allow for multiple feature types. While a vector<br />

of lter responses of the same feature type is traditionally called a jet, we call<br />

avector of lter responses stemming from dierent feature types a compound<br />

jet. Usually, a compound jet is the result of the concatenation of traditional<br />

jets. For example a compound jet J may be composed of a shape jet j s containing<br />

responses of shape feature detectors and a color jet j c containing color<br />

information.<br />

When similarities between compound jets are computed, rst the similarities<br />

of corresponding simple jets are computed. Their similarities are then<br />

added <strong>with</strong> certain normalized weighting factors, e.g.<br />

Sim(J;J 0 0<br />

)=w s Sim(j s ;j s )+w c Sim(j c ;j c<br />

0) ; w s + w c =1 (1)<br />

where w s and w c are the weights of the shape and the color features respectively.<br />

Figure 1 gives an impression of the advantages of combining multiple<br />

feature types in a compound jet in trying to nd point correspondences be-


Figure 2: Cluttered Scenes overview: An example of a complex scene is shown on the<br />

left, the result of the corresponding interpretation in the middle and the similarity<br />

landscape of the Coke Tin depending on the graph's position in the image is shown<br />

on the right (bright values correspond to high similarity).<br />

tween images. The feature types used are the ones we employed for the hand<br />

posture recognition task.<br />

3 Two Example Applications<br />

3.1 Interpretation of Cluttered Scenes<br />

Our rst application analyses cluttered scenes which are made up bytoy objects<br />

heavily occluding each other in front of a complex or white background. The<br />

recognition system is based on EGM <strong>with</strong> a competitive front-to-back algorithm<br />

to determine the mutual occlusion of the objects (for details see [5]). We are<br />

using a Gabor Wavelet lter bank <strong>with</strong> 3 levels and 8 orientations which is<br />

applied to the intensity part of the color image and the raw hue and saturation<br />

values averaged over the pixel neighborhood centered at the node's position<br />

as the basic features to generate compound jets. 70 scenes (50% <strong>with</strong> white<br />

background and 50% <strong>with</strong> colored complex background) each made up by up<br />

to 5 toy objects (out of 11 objects) have to be analyzed by the system. We<br />

work <strong>with</strong> 256 2 8-bit HSI-images, and the recognition parameters are set as<br />

follows: number of visible pixels must be 2500, threshold for acceptance is<br />

0:7 and algorithm II was used (see [5] for further details). The model graphs<br />

are generated manually by using a grid-topology <strong>with</strong> 7 pixel spacing between<br />

neighboring nodes { the corresponding training images are taken in front of<br />

white background. See gure 2 for an illustration of the system.<br />

In order to judge if and how the fusion of the dierent features types in<br />

the compound jet has inuence on the recognition performance we performed<br />

9 cross-runs <strong>with</strong> dierent relative weighting of color- and gabor features (see<br />

table 1 for the results). The outcome shows a performance peak near 50%<br />

weight between color- and gabor features and a fast decay as the color weight<br />

increases - this is due to the strong increase of false positive recognitions. It<br />

indicates that the pure color values alone do not support a robust recognition<br />

but they can increase the overall performance signicantly, as long as the gabor<br />

features still have major inuence.


weight color % 0.0 12.5 25.0 37.5 50.0 62.5 75.0 87.5 100<br />

correct scenes % 10.0 18.8 27.1 35.7 62.9 8.6 1.4 0.0 0.0<br />

false positives 0 0 1 2 3 104 163 170 201<br />

Table 1: Scene recognition results for the test sets <strong>with</strong> simple and complex background<br />

using dierent weights of the two feature types. Correct scene is the percentage<br />

of correct scene interpretations (position and mutual occlusion correct, no false<br />

positives) and false positive is the number of objects found in scenes in which they<br />

are not present. We have used 70 scenes <strong>with</strong> a total of 190 placed objects (average<br />

of 2:7 objects per image).<br />

Figure 3: The twelve postures used in the study. The model graphs are created from<br />

six example images (3 persons, light and dark background).<br />

3.2 Hand Posture <strong>Recognition</strong><br />

The second example is the recognition of handpostures (g. 3) against very<br />

complex backgrounds (g. 4) [3]. All graphs contain 15 nodes and 20 edges.<br />

The nodes were placed manually at anatomically signicant points on 6 training<br />

images for every posture. We extracted jets of three dierent feature types from<br />

every node of a posture's six models (compare g. 1):<br />

Gabor <strong>Feature</strong>s: Responses of complex Gabor lters <strong>with</strong> four orientations<br />

and 3 levels (half octave spacing) taken from the intensity distribution of the<br />

image.<br />

Color <strong>Feature</strong>s: an eight nearest neighborhood average of the color in HSI<br />

(hue, saturation, intensity) color space.<br />

ColorGabor <strong>Feature</strong>s: Responses of Complex Gabor lters <strong>with</strong> four orientations<br />

and two levels (half octave spacing) taken on an image expressing the<br />

similarity to skin color (compare g. 1). Skin color segmentation is based on a<br />

pixel's distance to a prototype in the HS plane.<br />

We created a graph for every feature type for every training image of every<br />

posture giving a total of 216 graphs. The six graphs of one posture containing<br />

features of the same type were fused into a bunchgraph, expressing the variability<br />

of the features among dierent models and variability of backgrounds [4, 3].<br />

The resulting three bunch graphs for each posture were then fused into a single<br />

compound graph for each posture. During a matching process we allowed for 15<br />

degree rotation of the model graph's node positions in the image plane as well<br />

as 20% rescaling. After that every node is allowed to move by one pixel in order


Figure 4: Examples of posture twelve performed by 10 out of 19 dierent subjects<br />

against dierent complex backgrounds. Of the 29 backgrounds used 5 show much<br />

skin color (compare g. 5), 11 show a medium amount of skin color, 8 show a little<br />

amount of skin color (compare target image in g. 1) and 5 show no skin color at all.<br />

weighting simple background complex background<br />

only Gabor 82.6% 70.4%<br />

only Color 39.7% 34.6%<br />

only ColorGabor 88.2% 76.3%<br />

best Mixture 92.9% 85.8%<br />

chance level 8.3% 8.3%<br />

Table 2: Results of the hand posture recognition: Correct recognition rate for test<br />

sets <strong>with</strong> simple and complex backgrounds for dierent weightings of feature types.<br />

to nd a better matching position. A result of the matching process is depicted<br />

in g. 5. The nodes usually nd their proper positions during the matching,<br />

even if the background is very complex or contains large regions of skin color.<br />

We performed crossruns on a test set of 604 images taken against uniform light<br />

or dark background and 338 images against complex backgrounds. The results<br />

are summarized in table 2. It turns out that a proper combination of the three<br />

feature types performs better than any of them alone. E.g., the error for the<br />

best weighting is less than half as big as the one for using Gabor features alone.<br />

One might fear that performance depends critically on the precise weighting<br />

between the features. However, this is not the case. The recognition rates vary<br />

very smoothly <strong>with</strong> the weighting. There is a large plateau of weightings yielding<br />

recognition rates higher than any of the feature types alone could account<br />

for.<br />

4 Discussion<br />

We have demonstrated that Elastic Graph Matching <strong>with</strong> compound jets is a<br />

very powerful architecture for view-based object recognition. The introduction<br />

of additional color features could signicantly enhance the performance of two<br />

example systems previously working only <strong>with</strong> Gabor jets.<br />

In case of the Cluttered Scenes <strong>Recognition</strong> system the performance was<br />

highly improved but one has to avoid the domination of the color cues because<br />

they are less robust and do generate many false positive recognitions for the<br />

complex background. For the hand posture recognition, error rates could be


Figure 5: Example of a modelgraph being matched onto an input image. Although<br />

the background has large regions of skin color the graph is positioned properly due<br />

to the inuence of the Gabor features.<br />

halved. Fortunately, nding suitable weightings for the dierent feature types<br />

turned out to be no problem, since performance hardly depends on the precise<br />

weighting as long as all feature types are used.<br />

Our architecture combines the strength of using multiple feature types, as<br />

does Mel's SeeMore system [2], while at the same time still being able to code<br />

the spatial arrangement of features as does conventional EGM. This allows it<br />

to analyze visual scenes which are cluttered or have a complex background to<br />

the object of interest | situations in which SeeMore must fail and conventional<br />

EGM performs signicantly worse. The use of compound jets may not only<br />

enhance object recognition but may also be applied in other domains. The<br />

ability to reliably estimate image point correspondences (g. 1) can be used for<br />

tracking or stereo algorithms.<br />

References<br />

[1] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R. P.<br />

Wurtz, and W. Konen. Distortion invariant object recognition in the dynamic<br />

link architecture. IEEE Trans. on Computers, 42:300{311, 1993.<br />

[2] Bartlett. W Mel. Seemore: Combining color, shape, and texture. Neural Computation,<br />

9(4):777{804, 1997.<br />

[3] J. Triesch and C. von der Malsburg. Robust classication of hand postures against<br />

complex backgrounds. In Proceedings of the Second International Conference on<br />

Automatic Face and Gesture <strong>Recognition</strong> 1996, Killington, Vermont, USA, October<br />

14-16, 1996.<br />

[4] L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition<br />

by elastic graph matching. IEEE Trans. PAMI, 19 7, 1997.<br />

[5] Laurenz Wiskott and Christoph von der Malsburg. A neural system for the recognition<br />

of partially occluded objects in cluttered scenes. Int. J. of Pattern <strong>Recognition</strong><br />

and Articial Intelligence, 7(4):935{948, 1993.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!