Object Recognition with Multiple Feature Types - ResearchGate
Object Recognition with Multiple Feature Types - ResearchGate
Object Recognition with Multiple Feature Types - ResearchGate
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
in:<br />
Proceedings of the ICANN 1998, Skovde, Sweden<br />
Perspectives in Neural Computing<br />
L.Niklasson, M.Boden and T.Ziemke (eds.)<br />
Springer Verlag, Berlin, Heidelberg, New York, 1998.<br />
<strong>Object</strong> <strong>Recognition</strong> <strong>with</strong> <strong>Multiple</strong><br />
<strong>Feature</strong> <strong>Types</strong> <br />
Jochen Triesch and Christian Eckes<br />
Institut fur Neuroinformatik, Ruhr-Universitat Bochum<br />
D-44780 Bochum, Germany<br />
fJochen.Triesch,Christian.Eckesg@neuroinformatik.ruhr-uni-bochum.de<br />
Abstract<br />
One of the brain's recipes for robustly perceiving the world is to integrate<br />
multiple feature types such as shape, color, texture and motion. We have<br />
investigated how far also neural-network based object recognition can<br />
prot from the combination of several feature types. For this purpose we<br />
have extended Elastic Graph Matching such that several feature types<br />
may be combined in the object models. We applied the system in two<br />
dicult application domains, the interpretation of cluttered scenes and<br />
the recognition of hand postures against complex backgrounds. Our results<br />
demonstrate that the usage of additional feature types signicantly<br />
improves performance.<br />
1 Introduction<br />
Vision is a hard problem which our brains solve very well. The neurons in<br />
visual cortex extract dierent features of the input image. Some represent<br />
shape, others motion or color or combinations of these. These features have to<br />
be integrated to form object descriptions which can be stored and recognized.<br />
In computer vision, it has been realized that the integration of dierent<br />
feature types can be useful for object recognition tasks. Perhaps the most<br />
extreme example is Mel's SeeMore system [2]. There, recognition is based<br />
on 102 viewpoint invariant nonlinear lters, which code contour, texture, and<br />
color features.<br />
We were interested in the question in how far the use of multiple feature<br />
types could improve Elastic Graph Matching (EGM) a neurally inspired object<br />
recognition system [1]. Our particular interest was in whether appropriate color<br />
features would enhance the object recognition in dicult situations of cluttered<br />
scenes <strong>with</strong> many mutually occluding objects and complex backgrounds.<br />
Supported by a grant from the German Federal Ministry for Science and Technology<br />
(01 IN 504 E9).
Figure 1: Similarities between compound jets combining Gabor, color and colorGabor<br />
features: Top row: Far left: Source image <strong>with</strong> the circle indicating the position<br />
where the compound jet was extracted. Middle left: Skin color segmentation of the<br />
source image. Middle right: Target image. Far right: Skin color segmentation of the<br />
target image. Bottom row: Similarity landscapes obtained when comparing compound<br />
jets extracted at each position in the target image <strong>with</strong> the jet taken at the<br />
marked position in the source image using dierent weightings. Left to Right: only<br />
Gabor features used, only color features, only colorGabor features, a proper combination<br />
of all three. The circle in the target image (top row, middle right) corresponds<br />
to the position of the global maximum in the rightmost image in the bottom row.<br />
The corresponding point was found correctly despite the complex background.<br />
2 Graph Matching <strong>with</strong> Compound Jets<br />
In Elastic Graph Matching (EGM), objects are stored as graphs. The nodes<br />
of the graphs are labeled <strong>with</strong> a local image description in the form of a vector<br />
of responses of feature detectors. The edges are labeled <strong>with</strong> geometrical<br />
information, thus representing spatial relations between the features. During<br />
recognition a model graph of an object is matched onto the input image. During<br />
this process the graph's nodes try to nd matching image regions while at<br />
the same time attempting to keep their spatial relations intact [1].<br />
While earlier versions of EGM have only worked <strong>with</strong> a single shape or<br />
texture feature type extracted from a grey level image (e.g. Gabor or Mallat<br />
lters), we have extended it to allow for multiple feature types. While a vector<br />
of lter responses of the same feature type is traditionally called a jet, we call<br />
avector of lter responses stemming from dierent feature types a compound<br />
jet. Usually, a compound jet is the result of the concatenation of traditional<br />
jets. For example a compound jet J may be composed of a shape jet j s containing<br />
responses of shape feature detectors and a color jet j c containing color<br />
information.<br />
When similarities between compound jets are computed, rst the similarities<br />
of corresponding simple jets are computed. Their similarities are then<br />
added <strong>with</strong> certain normalized weighting factors, e.g.<br />
Sim(J;J 0 0<br />
)=w s Sim(j s ;j s )+w c Sim(j c ;j c<br />
0) ; w s + w c =1 (1)<br />
where w s and w c are the weights of the shape and the color features respectively.<br />
Figure 1 gives an impression of the advantages of combining multiple<br />
feature types in a compound jet in trying to nd point correspondences be-
Figure 2: Cluttered Scenes overview: An example of a complex scene is shown on the<br />
left, the result of the corresponding interpretation in the middle and the similarity<br />
landscape of the Coke Tin depending on the graph's position in the image is shown<br />
on the right (bright values correspond to high similarity).<br />
tween images. The feature types used are the ones we employed for the hand<br />
posture recognition task.<br />
3 Two Example Applications<br />
3.1 Interpretation of Cluttered Scenes<br />
Our rst application analyses cluttered scenes which are made up bytoy objects<br />
heavily occluding each other in front of a complex or white background. The<br />
recognition system is based on EGM <strong>with</strong> a competitive front-to-back algorithm<br />
to determine the mutual occlusion of the objects (for details see [5]). We are<br />
using a Gabor Wavelet lter bank <strong>with</strong> 3 levels and 8 orientations which is<br />
applied to the intensity part of the color image and the raw hue and saturation<br />
values averaged over the pixel neighborhood centered at the node's position<br />
as the basic features to generate compound jets. 70 scenes (50% <strong>with</strong> white<br />
background and 50% <strong>with</strong> colored complex background) each made up by up<br />
to 5 toy objects (out of 11 objects) have to be analyzed by the system. We<br />
work <strong>with</strong> 256 2 8-bit HSI-images, and the recognition parameters are set as<br />
follows: number of visible pixels must be 2500, threshold for acceptance is<br />
0:7 and algorithm II was used (see [5] for further details). The model graphs<br />
are generated manually by using a grid-topology <strong>with</strong> 7 pixel spacing between<br />
neighboring nodes { the corresponding training images are taken in front of<br />
white background. See gure 2 for an illustration of the system.<br />
In order to judge if and how the fusion of the dierent features types in<br />
the compound jet has inuence on the recognition performance we performed<br />
9 cross-runs <strong>with</strong> dierent relative weighting of color- and gabor features (see<br />
table 1 for the results). The outcome shows a performance peak near 50%<br />
weight between color- and gabor features and a fast decay as the color weight<br />
increases - this is due to the strong increase of false positive recognitions. It<br />
indicates that the pure color values alone do not support a robust recognition<br />
but they can increase the overall performance signicantly, as long as the gabor<br />
features still have major inuence.
weight color % 0.0 12.5 25.0 37.5 50.0 62.5 75.0 87.5 100<br />
correct scenes % 10.0 18.8 27.1 35.7 62.9 8.6 1.4 0.0 0.0<br />
false positives 0 0 1 2 3 104 163 170 201<br />
Table 1: Scene recognition results for the test sets <strong>with</strong> simple and complex background<br />
using dierent weights of the two feature types. Correct scene is the percentage<br />
of correct scene interpretations (position and mutual occlusion correct, no false<br />
positives) and false positive is the number of objects found in scenes in which they<br />
are not present. We have used 70 scenes <strong>with</strong> a total of 190 placed objects (average<br />
of 2:7 objects per image).<br />
Figure 3: The twelve postures used in the study. The model graphs are created from<br />
six example images (3 persons, light and dark background).<br />
3.2 Hand Posture <strong>Recognition</strong><br />
The second example is the recognition of handpostures (g. 3) against very<br />
complex backgrounds (g. 4) [3]. All graphs contain 15 nodes and 20 edges.<br />
The nodes were placed manually at anatomically signicant points on 6 training<br />
images for every posture. We extracted jets of three dierent feature types from<br />
every node of a posture's six models (compare g. 1):<br />
Gabor <strong>Feature</strong>s: Responses of complex Gabor lters <strong>with</strong> four orientations<br />
and 3 levels (half octave spacing) taken from the intensity distribution of the<br />
image.<br />
Color <strong>Feature</strong>s: an eight nearest neighborhood average of the color in HSI<br />
(hue, saturation, intensity) color space.<br />
ColorGabor <strong>Feature</strong>s: Responses of Complex Gabor lters <strong>with</strong> four orientations<br />
and two levels (half octave spacing) taken on an image expressing the<br />
similarity to skin color (compare g. 1). Skin color segmentation is based on a<br />
pixel's distance to a prototype in the HS plane.<br />
We created a graph for every feature type for every training image of every<br />
posture giving a total of 216 graphs. The six graphs of one posture containing<br />
features of the same type were fused into a bunchgraph, expressing the variability<br />
of the features among dierent models and variability of backgrounds [4, 3].<br />
The resulting three bunch graphs for each posture were then fused into a single<br />
compound graph for each posture. During a matching process we allowed for 15<br />
degree rotation of the model graph's node positions in the image plane as well<br />
as 20% rescaling. After that every node is allowed to move by one pixel in order
Figure 4: Examples of posture twelve performed by 10 out of 19 dierent subjects<br />
against dierent complex backgrounds. Of the 29 backgrounds used 5 show much<br />
skin color (compare g. 5), 11 show a medium amount of skin color, 8 show a little<br />
amount of skin color (compare target image in g. 1) and 5 show no skin color at all.<br />
weighting simple background complex background<br />
only Gabor 82.6% 70.4%<br />
only Color 39.7% 34.6%<br />
only ColorGabor 88.2% 76.3%<br />
best Mixture 92.9% 85.8%<br />
chance level 8.3% 8.3%<br />
Table 2: Results of the hand posture recognition: Correct recognition rate for test<br />
sets <strong>with</strong> simple and complex backgrounds for dierent weightings of feature types.<br />
to nd a better matching position. A result of the matching process is depicted<br />
in g. 5. The nodes usually nd their proper positions during the matching,<br />
even if the background is very complex or contains large regions of skin color.<br />
We performed crossruns on a test set of 604 images taken against uniform light<br />
or dark background and 338 images against complex backgrounds. The results<br />
are summarized in table 2. It turns out that a proper combination of the three<br />
feature types performs better than any of them alone. E.g., the error for the<br />
best weighting is less than half as big as the one for using Gabor features alone.<br />
One might fear that performance depends critically on the precise weighting<br />
between the features. However, this is not the case. The recognition rates vary<br />
very smoothly <strong>with</strong> the weighting. There is a large plateau of weightings yielding<br />
recognition rates higher than any of the feature types alone could account<br />
for.<br />
4 Discussion<br />
We have demonstrated that Elastic Graph Matching <strong>with</strong> compound jets is a<br />
very powerful architecture for view-based object recognition. The introduction<br />
of additional color features could signicantly enhance the performance of two<br />
example systems previously working only <strong>with</strong> Gabor jets.<br />
In case of the Cluttered Scenes <strong>Recognition</strong> system the performance was<br />
highly improved but one has to avoid the domination of the color cues because<br />
they are less robust and do generate many false positive recognitions for the<br />
complex background. For the hand posture recognition, error rates could be
Figure 5: Example of a modelgraph being matched onto an input image. Although<br />
the background has large regions of skin color the graph is positioned properly due<br />
to the inuence of the Gabor features.<br />
halved. Fortunately, nding suitable weightings for the dierent feature types<br />
turned out to be no problem, since performance hardly depends on the precise<br />
weighting as long as all feature types are used.<br />
Our architecture combines the strength of using multiple feature types, as<br />
does Mel's SeeMore system [2], while at the same time still being able to code<br />
the spatial arrangement of features as does conventional EGM. This allows it<br />
to analyze visual scenes which are cluttered or have a complex background to<br />
the object of interest | situations in which SeeMore must fail and conventional<br />
EGM performs signicantly worse. The use of compound jets may not only<br />
enhance object recognition but may also be applied in other domains. The<br />
ability to reliably estimate image point correspondences (g. 1) can be used for<br />
tracking or stereo algorithms.<br />
References<br />
[1] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R. P.<br />
Wurtz, and W. Konen. Distortion invariant object recognition in the dynamic<br />
link architecture. IEEE Trans. on Computers, 42:300{311, 1993.<br />
[2] Bartlett. W Mel. Seemore: Combining color, shape, and texture. Neural Computation,<br />
9(4):777{804, 1997.<br />
[3] J. Triesch and C. von der Malsburg. Robust classication of hand postures against<br />
complex backgrounds. In Proceedings of the Second International Conference on<br />
Automatic Face and Gesture <strong>Recognition</strong> 1996, Killington, Vermont, USA, October<br />
14-16, 1996.<br />
[4] L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition<br />
by elastic graph matching. IEEE Trans. PAMI, 19 7, 1997.<br />
[5] Laurenz Wiskott and Christoph von der Malsburg. A neural system for the recognition<br />
of partially occluded objects in cluttered scenes. Int. J. of Pattern <strong>Recognition</strong><br />
and Articial Intelligence, 7(4):935{948, 1993.