Object Recognition with Multiple Feature Types - ResearchGate

in: 

Proceedings of the ICANN 1998, Skovde, Sweden 

Perspectives in Neural Computing 

L.Niklasson, M.Boden and T.Ziemke (eds.) 

Springer Verlag, Berlin, Heidelberg, New York, 1998. 

Object Recognition with Multiple 

Feature Types 

Jochen Triesch and Christian Eckes 

Institut fur Neuroinformatik, Ruhr-Universitat Bochum 

D-44780 Bochum, Germany 

fJochen.Triesch,Christian.Eckesg@neuroinformatik.ruhr-uni-bochum.de 

Abstract 

One of the brain's recipes for robustly perceiving the world is to integrate 

multiple feature types such as shape, color, texture and motion. We have 

investigated how far also neural-network based object recognition can 

prot from the combination of several feature types. For this purpose we 

have extended Elastic Graph Matching such that several feature types 

may be combined in the object models. We applied the system in two 

dicult application domains, the interpretation of cluttered scenes and 

the recognition of hand postures against complex backgrounds. Our results 

demonstrate that the usage of additional feature types signicantly 

improves performance. 

1 Introduction 

Vision is a hard problem which our brains solve very well. The neurons in 

visual cortex extract dierent features of the input image. Some represent 

shape, others motion or color or combinations of these. These features have to 

be integrated to form object descriptions which can be stored and recognized. 

In computer vision, it has been realized that the integration of dierent 

feature types can be useful for object recognition tasks. Perhaps the most 

extreme example is Mel's SeeMore system [2]. There, recognition is based 

on 102 viewpoint invariant nonlinear lters, which code contour, texture, and 

color features. 

We were interested in the question in how far the use of multiple feature 

types could improve Elastic Graph Matching (EGM) a neurally inspired object 

recognition system [1]. Our particular interest was in whether appropriate color 

features would enhance the object recognition in dicult situations of cluttered 

scenes with many mutually occluding objects and complex backgrounds. 

Supported by a grant from the German Federal Ministry for Science and Technology 

(01 IN 504 E9).

Figure 1: Similarities between compound jets combining Gabor, color and colorGabor 

features: Top row: Far left: Source image with the circle indicating the position 

where the compound jet was extracted. Middle left: Skin color segmentation of the 

source image. Middle right: Target image. Far right: Skin color segmentation of the 

target image. Bottom row: Similarity landscapes obtained when comparing compound 

jets extracted at each position in the target image with the jet taken at the 

marked position in the source image using dierent weightings. Left to Right: only 

Gabor features used, only color features, only colorGabor features, a proper combination 

of all three. The circle in the target image (top row, middle right) corresponds 

to the position of the global maximum in the rightmost image in the bottom row. 

The corresponding point was found correctly despite the complex background. 

2 Graph Matching with Compound Jets 

In Elastic Graph Matching (EGM), objects are stored as graphs. The nodes 

of the graphs are labeled with a local image description in the form of a vector 

of responses of feature detectors. The edges are labeled with geometrical 

information, thus representing spatial relations between the features. During 

recognition a model graph of an object is matched onto the input image. During 

this process the graph's nodes try to nd matching image regions while at 

the same time attempting to keep their spatial relations intact [1]. 

While earlier versions of EGM have only worked with a single shape or 

texture feature type extracted from a grey level image (e.g. Gabor or Mallat 

lters), we have extended it to allow for multiple feature types. While a vector 

of lter responses of the same feature type is traditionally called a jet, we call 

avector of lter responses stemming from dierent feature types a compound 

jet. Usually, a compound jet is the result of the concatenation of traditional 

jets. For example a compound jet J may be composed of a shape jet j s containing 

responses of shape feature detectors and a color jet j c containing color 

information. 

When similarities between compound jets are computed, rst the similarities 

of corresponding simple jets are computed. Their similarities are then 

added with certain normalized weighting factors, e.g. 

Sim(J;J 0 0 

)=w s Sim(j s ;j s )+w c Sim(j c ;j c 

0) ; w s + w c =1 (1) 

where w s and w c are the weights of the shape and the color features respectively. 

Figure 1 gives an impression of the advantages of combining multiple 

feature types in a compound jet in trying to nd point correspondences be-

Figure 2: Cluttered Scenes overview: An example of a complex scene is shown on the 

left, the result of the corresponding interpretation in the middle and the similarity 

landscape of the Coke Tin depending on the graph's position in the image is shown 

on the right (bright values correspond to high similarity). 

tween images. The feature types used are the ones we employed for the hand 

posture recognition task. 

3 Two Example Applications 

3.1 Interpretation of Cluttered Scenes 

Our rst application analyses cluttered scenes which are made up bytoy objects 

heavily occluding each other in front of a complex or white background. The 

recognition system is based on EGM with a competitive front-to-back algorithm 

to determine the mutual occlusion of the objects (for details see [5]). We are 

using a Gabor Wavelet lter bank with 3 levels and 8 orientations which is 

applied to the intensity part of the color image and the raw hue and saturation 

values averaged over the pixel neighborhood centered at the node's position 

as the basic features to generate compound jets. 70 scenes (50% with white 

background and 50% with colored complex background) each made up by up 

to 5 toy objects (out of 11 objects) have to be analyzed by the system. We 

work with 256 2 8-bit HSI-images, and the recognition parameters are set as 

follows: number of visible pixels must be 2500, threshold for acceptance is 

0:7 and algorithm II was used (see [5] for further details). The model graphs 

are generated manually by using a grid-topology with 7 pixel spacing between 

neighboring nodes { the corresponding training images are taken in front of 

white background. See gure 2 for an illustration of the system. 

In order to judge if and how the fusion of the dierent features types in 

the compound jet has inuence on the recognition performance we performed 

9 cross-runs with dierent relative weighting of color- and gabor features (see 

table 1 for the results). The outcome shows a performance peak near 50% 

weight between color- and gabor features and a fast decay as the color weight 

increases - this is due to the strong increase of false positive recognitions. It 

indicates that the pure color values alone do not support a robust recognition 

but they can increase the overall performance signicantly, as long as the gabor 

features still have major inuence.

weight color % 0.0 12.5 25.0 37.5 50.0 62.5 75.0 87.5 100 

correct scenes % 10.0 18.8 27.1 35.7 62.9 8.6 1.4 0.0 0.0 

false positives 0 0 1 2 3 104 163 170 201 

Table 1: Scene recognition results for the test sets with simple and complex background 

using dierent weights of the two feature types. Correct scene is the percentage 

of correct scene interpretations (position and mutual occlusion correct, no false 

positives) and false positive is the number of objects found in scenes in which they 

are not present. We have used 70 scenes with a total of 190 placed objects (average 

of 2:7 objects per image). 

Figure 3: The twelve postures used in the study. The model graphs are created from 

six example images (3 persons, light and dark background). 

3.2 Hand Posture Recognition 

The second example is the recognition of handpostures (g. 3) against very 

complex backgrounds (g. 4) [3]. All graphs contain 15 nodes and 20 edges. 

The nodes were placed manually at anatomically signicant points on 6 training 

images for every posture. We extracted jets of three dierent feature types from 

every node of a posture's six models (compare g. 1): 

Gabor Features: Responses of complex Gabor lters with four orientations 

and 3 levels (half octave spacing) taken from the intensity distribution of the 

image. 

Color Features: an eight nearest neighborhood average of the color in HSI 

(hue, saturation, intensity) color space. 

ColorGabor Features: Responses of Complex Gabor lters with four orientations 

and two levels (half octave spacing) taken on an image expressing the 

similarity to skin color (compare g. 1). Skin color segmentation is based on a 

pixel's distance to a prototype in the HS plane. 

We created a graph for every feature type for every training image of every 

posture giving a total of 216 graphs. The six graphs of one posture containing 

features of the same type were fused into a bunchgraph, expressing the variability 

of the features among dierent models and variability of backgrounds [4, 3]. 

The resulting three bunch graphs for each posture were then fused into a single 

compound graph for each posture. During a matching process we allowed for 15 

degree rotation of the model graph's node positions in the image plane as well 

as 20% rescaling. After that every node is allowed to move by one pixel in order

Figure 4: Examples of posture twelve performed by 10 out of 19 dierent subjects 

against dierent complex backgrounds. Of the 29 backgrounds used 5 show much 

skin color (compare g. 5), 11 show a medium amount of skin color, 8 show a little 

amount of skin color (compare target image in g. 1) and 5 show no skin color at all. 

weighting simple background complex background 

only Gabor 82.6% 70.4% 

only Color 39.7% 34.6% 

only ColorGabor 88.2% 76.3% 

best Mixture 92.9% 85.8% 

chance level 8.3% 8.3% 

Table 2: Results of the hand posture recognition: Correct recognition rate for test 

sets with simple and complex backgrounds for dierent weightings of feature types. 

to nd a better matching position. A result of the matching process is depicted 

in g. 5. The nodes usually nd their proper positions during the matching, 

even if the background is very complex or contains large regions of skin color. 

We performed crossruns on a test set of 604 images taken against uniform light 

or dark background and 338 images against complex backgrounds. The results 

are summarized in table 2. It turns out that a proper combination of the three 

feature types performs better than any of them alone. E.g., the error for the 

best weighting is less than half as big as the one for using Gabor features alone. 

One might fear that performance depends critically on the precise weighting 

between the features. However, this is not the case. The recognition rates vary 

very smoothly with the weighting. There is a large plateau of weightings yielding 

recognition rates higher than any of the feature types alone could account 

for. 

4 Discussion 

We have demonstrated that Elastic Graph Matching with compound jets is a 

very powerful architecture for view-based object recognition. The introduction 

of additional color features could signicantly enhance the performance of two 

example systems previously working only with Gabor jets. 

In case of the Cluttered Scenes Recognition system the performance was 

highly improved but one has to avoid the domination of the color cues because 

they are less robust and do generate many false positive recognitions for the 

complex background. For the hand posture recognition, error rates could be

Figure 5: Example of a modelgraph being matched onto an input image. Although 

the background has large regions of skin color the graph is positioned properly due 

to the inuence of the Gabor features. 

halved. Fortunately, nding suitable weightings for the dierent feature types 

turned out to be no problem, since performance hardly depends on the precise 

weighting as long as all feature types are used. 

Our architecture combines the strength of using multiple feature types, as 

does Mel's SeeMore system [2], while at the same time still being able to code 

the spatial arrangement of features as does conventional EGM. This allows it 

to analyze visual scenes which are cluttered or have a complex background to 

the object of interest | situations in which SeeMore must fail and conventional 

EGM performs signicantly worse. The use of compound jets may not only 

enhance object recognition but may also be applied in other domains. The 

ability to reliably estimate image point correspondences (g. 1) can be used for 

tracking or stereo algorithms. 

References 

[1] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R. P. 

Wurtz, and W. Konen. Distortion invariant object recognition in the dynamic 

link architecture. IEEE Trans. on Computers, 42:300{311, 1993. 

[2] Bartlett. W Mel. Seemore: Combining color, shape, and texture. Neural Computation, 

9(4):777{804, 1997. 

[3] J. Triesch and C. von der Malsburg. Robust classication of hand postures against 

complex backgrounds. In Proceedings of the Second International Conference on 

Automatic Face and Gesture Recognition 1996, Killington, Vermont, USA, October 

14-16, 1996. 

[4] L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition 

by elastic graph matching. IEEE Trans. PAMI, 19 7, 1997. 

[5] Laurenz Wiskott and Christoph von der Malsburg. A neural system for the recognition 

of partially occluded objects in cluttered scenes. Int. J. of Pattern Recognition 

and Articial Intelligence, 7(4):935{948, 1993.

Object Recognition with Multiple Feature Types - ResearchGate

Create successful ePaper yourself

Delete template?

Save as template?