Maximally Stable Corner Clusters - Graz University of Technology

Maximally Stable Corner Clusters - Graz University of Technology

Maximally Stable Corner Clusters: A novel

distinguished region detector and descriptor 1)

M. Winter, H. Bischof and F. Fraundorfer

Graz University of Technology


We propose a novel distinguished region detector called Maximally Stable Corner Cluster

detector (MSCC). It is complementary to existing approaches like Harris-corner detectors,

Difference of Gaussian detectors (DoG) or Maximally Stable Extremal Regions (MSER). The

basic idea is to find distinguished regions by looking at clusters of interest points and using the

concept of maximal stableness across scale. Additionally, we propose a novel descriptor ideally

suited for regions detected by MSCC. It is based on the 2D joint occurrence histograms of

corner orientations. We demonstrate its performance and compare it against other competitive

detectors and descriptors recently evaluated by Mikolajczyk and Schmidt [11].

1 Introduction

Recently, there has been a considerable interest in using local image region detectors and

descriptors for wide base-line stereo [8, 13], video retrieval and indexing [12, 9], object recognition

[7] and categorization tasks [2, 5]. There exist two main categories of distinguished

region detectors: Corner based detectors like Harris [4], Harris-Laplace [9], Harris-Affine [10]

etc. and region based detectors such as Maximally Stable Regions (MSER) [8], Difference

of Gaussian points(DoG) [7] or scale space blobs [6]. Corner based detectors locate points of

interests at regions which contain a lot of image structure, but they fail at uniform regions and

regions with smooth transitions. Nevertheless they have problems in highly textured regions

where lots of corners on similar structures are detected, which in turn lead to mismatches.

Region based detectors deliver blob like structures of uniform regions but highly structured

regions are left out. As the two categories act quite complementary it is no surprise that people

have started to use detector combinations (e.g. Video Google [12]). The main advantage

of detector combinations is to use the complementarity of the approaches to be independent

of the scene content.

1) This work is funded by the Austrian Joint Research Project Cognitive Vision under sub-projects S9103-

N04 and S9104-N04 and by a grant from Federal Ministry for Education, Science and Culture of Austria under

the CONEX program.

Although corner and region based detectors cover already a broad variety of different scenes,

there is a broad class of images not ideally covered by either class of detectors. Fig. 1 shows an

image from a database we are using for studying visual robot localization tasks. The image

shows a door with an attached poster. The poster contains a lot of text. Fig. 1(a) shows

as an example detected Harris-corners. The text in the image results in a high number of

Harris-corners with a very similar appearance. Fig. 1(b) shows the results of an Hessian-

Affine detector. The detector selects strong corners and constructs local affine frames around

the corners on multiple scales. This leads to unfortunate detections as one can see in the

lower right part of the image. The MSER detector (see Fig. 1(c)) perfectly detects the homogeneous

regions but ignores the parts of the image containing the text. This simple example

demonstrates that neither of the available detectors delivers satisfactory results.

(a) (b) (c) (d)

Figure 1: (a) Harris-corner detector. (b) Hessian-Affine detector. (c) MSER detector.

(d) MSCC detector (one scale only).

However, if we take a closer look at the results of a simple Harris-corner detector (Fig. 1(a))

we see that we can find characteristic dense patterns of detected corners in image parts which

contain some texture. This suggests to consider characteristic groups of individual detections

instead of looking at a single local property of the image.

Motivated by this we suggest a novel distinguished region detector which is based on clustering

interest points. Once we have detected a distinguished region we can use any of the

available region descriptors for matching (e.g. SIFT-keys [7]). However, we can use the special

properties of our MSCC detector to develop a novel descriptor ideally suited for the detected

regions. We can show, that this combination outperforms other approaches on the mentioned

special class of images.

The remainder of the paper is organized as follows: In Section 2 we outline the algorithm for

the MSCC detector. In Section 3 we present the details of the novel descriptor. Finally the

detector and descriptor performance is evaluated in Section 4.

2 The MSCC detector

The MSCC detector incorporates two main novel ideas: The first one is to use point constellations

instead of single interest points. Point constellations should be more robust against

viewpoint change than single interest points because a few missing single points will not affect

the detection of the constellation itself. The second idea is that only constellations will be

selected which are stable across several ”scales”. Note, that the extent of a MSCC (= the

MSCC ”scale”) is defined by the distribution of the points contributing to the constellation.

The MSCC algorithm proceeds along the following four steps:

1. Detect simple, single interest points all over the image, e.g. Harris-corners.

2. Cluster the interest points by graph-based point clustering using a minimal spanning

tree (MST).

3. Select clusters which stay stable over a certain number of scales.

It should be noted, that the steps 2 and 3 of the algorithm can be implemented very efficiently

as it is possible to do the multi scale clustering as well as the selection of stable clusters already

during the MST construction.

2.1 Interest point detection and MST calculation

To detect the interest points acting as cluster primitives we make use of the structure tensor.

For every image pixel we evaluate the structure tensor and calculate a Harris-corner strength

measure (cornerness) as given by Harris and Stephens [4]. We select a large number of corners

(all local maxima above the noise level) as our corner primitives. This ensures that we are

not dependent on a cornerness threshold. The interest points itself represent the nodes of an

undirected weighted graph in 2D. The weight for the edge between two graph nodes is their

geometric distance to which we will also refer to as edge length. The MST computation creates

edges between nearby nodes and is performed by applying Kruskal’s [1] iterative growing

algorithm on the Delaunay triangulation of the nodes.

2.2 Multi scale clustering

Since we do not know the number of clusters we have to use a non-parametric clustering

method. The method we use is inspired by the MSER detector. Given a threshold T on the

edge length we can get a subdivision of the MST into subtrees by removing all edges with

an edge length higher than this threshold. Every subtree corresponds to a cluster and gives

rise to an image region. Different values for T produce different subdivisions of the MST i.e.

different point clusters. To create a multi scale clustering we compute subdivisions of the

MST for p logarithmically spaced thresholds T 1 ...T p between the minimal and maximal edge

length occurring in the MST.

2.3 Selection of stable clusters

The previous step produced p different cluster sets. We are now interested in clusters which

do not change their shape over several scales - namely those that are stable. As a stability

criterion for a cluster we compare the set of their points. Clusters consisting of the same

set of points across r different scales are defined as stable and constitute the output of the

MSCC detector. A similar stability criterion is used by Matas et al. in the MSER detector

for grayvalues. Fig. 2(a) illustrates the method on a synthetic test image. The image shows

4 differently sized squares. The Harris-corner detection step produces several responses on

the corners of the squares. Connecting the single points with the MST reveals a structure

where one can easily see that clustering can be done by removing the larger edges. Clusters

of interest points are indicated by ellipses around them. The test image shows the capability

of detecting stable clusters at multiple scales, starting from very small clusters at the corners

of the squares itself up to the cluster containing all detected interest points.

The clusters may show arbitrary shapes, an approximative delineation may be obtain by

convex hull construction or fitting ellipses on clustered points. In the next section we will

propose a descriptor without a need for region delineation and uses the clustered points




Figure 2: (a) Example of the MSCC detector on a synthetic test image - clustered interest

points are indicated by ellipses around them and (b) 2D histogram of relative interest point


3 The joint orientation histogram descriptor

Orientation information has already been successfully used by Lowe in his SIFT-descriptor [7]

and inspired by this, we propose a joint orientation information descriptor placing emphasis

on the benefits of the MSCC detector.

The basic calculation steps for the descriptor are the following:

1. Estimation of orientation information (gradients) of corners inside the MSCC region.

2. Normalization of this orientation information with respect to the mean orientation of

the region gradients.

3. Computation of a two dimensional histogram of joint gradient occurrences.

As an estimate of the local gradient orientation at every corner we use Gaussian derivatives.

The mean of the corner orientation within a region is computed, assigning one main orientation

to the whole region. All orientations are normalized with respect to this mean orientation in

order to obtain rotational invariance.

The joint occurrences are computed only in a limited local neighborhood. For each corner

point within a region the joint occurrence of orientation to its n local neighbors is entered

in a 2D histogram. The histograms are smoothed using a Gaussian kernel in order to avoid

boundary effects. Best results are obtained for a typical histogram size of 16 × 16 and n = 12

nearest neighbors resulting in a 256-dimensional descriptor.

The histogram takes into account the local structure within the region (e.g. do we have many

similar orientations or are most orientations uniform?). Figure 2(b) shows an example of such

a two dimensional descriptor. In contrast to most other descriptors we are not dependent on

the definition of a center point or a region (ellipse) where the descriptor is calculated from.

Therefore, even if the detected clusters differ in shape, this does not affect the descriptor as

long as the same primary, highly repetitive points are found again.

4 Evaluation and Experiments

To compare the performance of our approach to others we use the public available evaluation

framework of Mikolajczyk [11] and an adapted framework, as we found that using the

Bhattacharyya distance for histogram matching provides better results. Nevertheless, as the

use of a homography restricts the evaluation method to planar scenes only, we evaluated our

approach additionally using a ground truth method based on the properties of the trifocal

geometry [3] for the ”group” scene evaluation.

The evaluation framework of Mikolajczyk gives two performance measures - a repeatability

score and a matching score. The repeatability score is the relative number of repetitive

detected interest regions. The matching score is the relative number of correctly matched

regions compared to the number of detected regions. The correctness of the matches is verified

automatically using a ground truth transformation. The same performance measures were

used for the ”group” scene evaluation.

Evaluation has been done on two different datasets named ”doors” and ”group” (see Fig. 3).

The ”group” dataset (b) consists of 19 images and shows two boxes on a turntable. The

viewpoint varies from 0 ◦ to 90 ◦ . The ”doors” dataset (a) consists of 10 images from a robot

localization experiment with estimated viewpoint changes changes up to 130 ◦ .



Figure 3: Example images from dataset ”doors” (a) and dataset ”group” (b).

4.1 Detector evaluation

We evaluate 11 different detectors on increasing viewpoint angle. For the MSER detector [8]

we use an implementation from Matas et al. and for the other detectors we use the publicly

available implementations from Mikolajczyk.

Fig. 4(a) shows the repeatability score for the ”group” dataset. The best performance is

obtained by the MSER detector. The proposed MSCC detector comes second and shows a

repeatability score higher than the other detectors. Fig. 4(b) shows the repeatability score

for the ”doors” dataset for which our proposed MSCC detector shows comparable results

to others. The evaluations demonstrate that our MSCC detector is competitive to other

established detectors.

4.2 Descriptor Evaluation

To compare our descriptor against others we used Harris Affine, Hessian-Affine, MSERs and intensity

based regions (IBR) [13] and calculated SIFT-descriptors for the identified regions using

Mikolajczyk’s evaluation framework. For the MSCC detector we calculated SIFT-descriptors


repeatability [%]





















0 10 20 30 40 50 60 70 80 90 100

viewpoint angle [°]



Figure 4: Repeatability scores for images from ”group” dataset (a) and ”doors” dataset (b) for

different viewpoints.

and our proposed joint orientation histogram descriptor (JOH). Fig. 5(a) shows the matching

score for the ”group” scene. The performance is promising and is decreasing with larger

viewpoint changes as expected. Fig. 5(b) depicts the results for the ”doors” scene. Although

the MSCC detector with SIFT-keys shows performance worse than others, the combination

of MSCC regions with our joint orientation histogram descriptor outperforms the others.

The experiment reveals a competitive performance of our novel detector when compared to

other approaches. It is important to note that the regions detected by our approach are consistently

different from those of other detectors therefore it can be considered complementary.



Figure 5: Matching scores for images from ”group” dataset(a) and ”doors” dataset(b) for

different viewpoints.

5 Summary and Conclusions

We have presented a novel method for the detection of distinguished regions, the so called

Maximally Stable Corner Clusters (MSCC). The new local detector finds distinguished regions

by clustering feature primitives. We have also developed a novel local descriptor based on 2D

histograms ideally suited for the properties of the detector. We evaluated the repeatability

of the MSCC detector and the matching score of the local descriptor under changing viewing

points and compared the performance to other established detectors. The results show a

competitive performance of the MSCC detector and gives a convincing argument that the new

detector-descriptor combination will successfully enrich the current variety of local detectors

and descriptors. One should note that the proposed framework is quite general and can be

used to develop other distinguished region detectors.


[1] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cambridge MA, 1990.

[2] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.

In Proceedings 9th International Conference on Computer Vision, June 2003.

[3] F. Fraundorfer and H. Bischof. Evaluation of local detectors on non-planar scenes. In Proceedings 28th

Workshop of the Austrian Association for Pattern Recognition, 2004.

[4] C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings 4th Alvey Vision

Conference, pages 147–151, 1988.

[5] T. Kadir, A. Zisserman, and M. Brady. An affine invariant salient region detector. In Proceedings 8th

European Conference on Computer Vision, 2004.

[6] T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision,

30(2):77–116, 1998. Technical report ISRN KTH NA/P–96/18–SE.

[7] D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer

Vision, 2004.

[8] J. Matas, O. Chum, U. Martin, and T. Pajdla. Robust wide baseline stereo from maximally stable

extremal regions. In Proceedings 13th British Machine Vision Conference, volume 1, pages 384–393,


[9] K. Mikolajczk and C. Schmid. Indexing based on scale invariant interest points. In Proceedings 8th

International Conference on Computer Vision, pages 525–531, July 2001.

[10] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proceedings 7th European

Conference on Computer Vision, volume 1, pages 128–142, 2002.

[11] K. Mikolajczyk and C. Schmid. Comparison of affine-invariant local detectors and descriptors. In Proceedings

of the 12th European Signal Processing Conference, 2004.

[12] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In

Proceedings 9th International Conference on Computer Vision, October 2003.

[13] T. Tuytelaars and L. Van Gool. Wide baseline stereo matching based on local, affinely invariant regions.

In Proceedings 11th British Machine Vision Conference, pages 412–422, 2000.

More magazines by this user
Similar magazines