Maximally Stable Corner Clusters: A novel
distinguished region detector and descriptor 1)
M. Winter, H. Bischof and F. Fraundorfer
Graz University of Technology
We propose a novel distinguished region detector called Maximally Stable Corner Cluster
detector (MSCC). It is complementary to existing approaches like Harris-corner detectors,
Difference of Gaussian detectors (DoG) or Maximally Stable Extremal Regions (MSER). The
basic idea is to find distinguished regions by looking at clusters of interest points and using the
concept of maximal stableness across scale. Additionally, we propose a novel descriptor ideally
suited for regions detected by MSCC. It is based on the 2D joint occurrence histograms of
corner orientations. We demonstrate its performance and compare it against other competitive
detectors and descriptors recently evaluated by Mikolajczyk and Schmidt .
Recently, there has been a considerable interest in using local image region detectors and
descriptors for wide base-line stereo [8, 13], video retrieval and indexing [12, 9], object recognition
 and categorization tasks [2, 5]. There exist two main categories of distinguished
region detectors: Corner based detectors like Harris , Harris-Laplace , Harris-Affine 
etc. and region based detectors such as Maximally Stable Regions (MSER) , Difference
of Gaussian points(DoG)  or scale space blobs . Corner based detectors locate points of
interests at regions which contain a lot of image structure, but they fail at uniform regions and
regions with smooth transitions. Nevertheless they have problems in highly textured regions
where lots of corners on similar structures are detected, which in turn lead to mismatches.
Region based detectors deliver blob like structures of uniform regions but highly structured
regions are left out. As the two categories act quite complementary it is no surprise that people
have started to use detector combinations (e.g. Video Google ). The main advantage
of detector combinations is to use the complementarity of the approaches to be independent
of the scene content.
1) This work is funded by the Austrian Joint Research Project Cognitive Vision under sub-projects S9103-
N04 and S9104-N04 and by a grant from Federal Ministry for Education, Science and Culture of Austria under
the CONEX program.
Although corner and region based detectors cover already a broad variety of different scenes,
there is a broad class of images not ideally covered by either class of detectors. Fig. 1 shows an
image from a database we are using for studying visual robot localization tasks. The image
shows a door with an attached poster. The poster contains a lot of text. Fig. 1(a) shows
as an example detected Harris-corners. The text in the image results in a high number of
Harris-corners with a very similar appearance. Fig. 1(b) shows the results of an Hessian-
Affine detector. The detector selects strong corners and constructs local affine frames around
the corners on multiple scales. This leads to unfortunate detections as one can see in the
lower right part of the image. The MSER detector (see Fig. 1(c)) perfectly detects the homogeneous
regions but ignores the parts of the image containing the text. This simple example
demonstrates that neither of the available detectors delivers satisfactory results.
(a) (b) (c) (d)
Figure 1: (a) Harris-corner detector. (b) Hessian-Affine detector. (c) MSER detector.
(d) MSCC detector (one scale only).
However, if we take a closer look at the results of a simple Harris-corner detector (Fig. 1(a))
we see that we can find characteristic dense patterns of detected corners in image parts which
contain some texture. This suggests to consider characteristic groups of individual detections
instead of looking at a single local property of the image.
Motivated by this we suggest a novel distinguished region detector which is based on clustering
interest points. Once we have detected a distinguished region we can use any of the
available region descriptors for matching (e.g. SIFT-keys ). However, we can use the special
properties of our MSCC detector to develop a novel descriptor ideally suited for the detected
regions. We can show, that this combination outperforms other approaches on the mentioned
special class of images.
The remainder of the paper is organized as follows: In Section 2 we outline the algorithm for
the MSCC detector. In Section 3 we present the details of the novel descriptor. Finally the
detector and descriptor performance is evaluated in Section 4.
2 The MSCC detector
The MSCC detector incorporates two main novel ideas: The first one is to use point constellations
instead of single interest points. Point constellations should be more robust against
viewpoint change than single interest points because a few missing single points will not affect
the detection of the constellation itself. The second idea is that only constellations will be
selected which are stable across several ”scales”. Note, that the extent of a MSCC (= the
MSCC ”scale”) is defined by the distribution of the points contributing to the constellation.
The MSCC algorithm proceeds along the following four steps:
1. Detect simple, single interest points all over the image, e.g. Harris-corners.
2. Cluster the interest points by graph-based point clustering using a minimal spanning
3. Select clusters which stay stable over a certain number of scales.
It should be noted, that the steps 2 and 3 of the algorithm can be implemented very efficiently
as it is possible to do the multi scale clustering as well as the selection of stable clusters already
during the MST construction.
2.1 Interest point detection and MST calculation
To detect the interest points acting as cluster primitives we make use of the structure tensor.
For every image pixel we evaluate the structure tensor and calculate a Harris-corner strength
measure (cornerness) as given by Harris and Stephens . We select a large number of corners
(all local maxima above the noise level) as our corner primitives. This ensures that we are
not dependent on a cornerness threshold. The interest points itself represent the nodes of an
undirected weighted graph in 2D. The weight for the edge between two graph nodes is their
geometric distance to which we will also refer to as edge length. The MST computation creates
edges between nearby nodes and is performed by applying Kruskal’s  iterative growing
algorithm on the Delaunay triangulation of the nodes.
2.2 Multi scale clustering
Since we do not know the number of clusters we have to use a non-parametric clustering
method. The method we use is inspired by the MSER detector. Given a threshold T on the
edge length we can get a subdivision of the MST into subtrees by removing all edges with
an edge length higher than this threshold. Every subtree corresponds to a cluster and gives
rise to an image region. Different values for T produce different subdivisions of the MST i.e.
different point clusters. To create a multi scale clustering we compute subdivisions of the
MST for p logarithmically spaced thresholds T 1 ...T p between the minimal and maximal edge
length occurring in the MST.
2.3 Selection of stable clusters
The previous step produced p different cluster sets. We are now interested in clusters which
do not change their shape over several scales - namely those that are stable. As a stability
criterion for a cluster we compare the set of their points. Clusters consisting of the same
set of points across r different scales are defined as stable and constitute the output of the
MSCC detector. A similar stability criterion is used by Matas et al. in the MSER detector
for grayvalues. Fig. 2(a) illustrates the method on a synthetic test image. The image shows
4 differently sized squares. The Harris-corner detection step produces several responses on
the corners of the squares. Connecting the single points with the MST reveals a structure
where one can easily see that clustering can be done by removing the larger edges. Clusters
of interest points are indicated by ellipses around them. The test image shows the capability
of detecting stable clusters at multiple scales, starting from very small clusters at the corners
of the squares itself up to the cluster containing all detected interest points.
The clusters may show arbitrary shapes, an approximative delineation may be obtain by
convex hull construction or fitting ellipses on clustered points. In the next section we will
propose a descriptor without a need for region delineation and uses the clustered points
Figure 2: (a) Example of the MSCC detector on a synthetic test image - clustered interest
points are indicated by ellipses around them and (b) 2D histogram of relative interest point
3 The joint orientation histogram descriptor
Orientation information has already been successfully used by Lowe in his SIFT-descriptor 
and inspired by this, we propose a joint orientation information descriptor placing emphasis
on the benefits of the MSCC detector.
The basic calculation steps for the descriptor are the following:
1. Estimation of orientation information (gradients) of corners inside the MSCC region.
2. Normalization of this orientation information with respect to the mean orientation of
the region gradients.
3. Computation of a two dimensional histogram of joint gradient occurrences.
As an estimate of the local gradient orientation at every corner we use Gaussian derivatives.
The mean of the corner orientation within a region is computed, assigning one main orientation
to the whole region. All orientations are normalized with respect to this mean orientation in
order to obtain rotational invariance.
The joint occurrences are computed only in a limited local neighborhood. For each corner
point within a region the joint occurrence of orientation to its n local neighbors is entered
in a 2D histogram. The histograms are smoothed using a Gaussian kernel in order to avoid
boundary effects. Best results are obtained for a typical histogram size of 16 × 16 and n = 12
nearest neighbors resulting in a 256-dimensional descriptor.
The histogram takes into account the local structure within the region (e.g. do we have many
similar orientations or are most orientations uniform?). Figure 2(b) shows an example of such
a two dimensional descriptor. In contrast to most other descriptors we are not dependent on
the definition of a center point or a region (ellipse) where the descriptor is calculated from.
Therefore, even if the detected clusters differ in shape, this does not affect the descriptor as
long as the same primary, highly repetitive points are found again.
4 Evaluation and Experiments
To compare the performance of our approach to others we use the public available evaluation
framework of Mikolajczyk  and an adapted framework, as we found that using the
Bhattacharyya distance for histogram matching provides better results. Nevertheless, as the
use of a homography restricts the evaluation method to planar scenes only, we evaluated our
approach additionally using a ground truth method based on the properties of the trifocal
geometry  for the ”group” scene evaluation.
The evaluation framework of Mikolajczyk gives two performance measures - a repeatability
score and a matching score. The repeatability score is the relative number of repetitive
detected interest regions. The matching score is the relative number of correctly matched
regions compared to the number of detected regions. The correctness of the matches is verified
automatically using a ground truth transformation. The same performance measures were
used for the ”group” scene evaluation.
Evaluation has been done on two different datasets named ”doors” and ”group” (see Fig. 3).
The ”group” dataset (b) consists of 19 images and shows two boxes on a turntable. The
viewpoint varies from 0 ◦ to 90 ◦ . The ”doors” dataset (a) consists of 10 images from a robot
localization experiment with estimated viewpoint changes changes up to 130 ◦ .
Figure 3: Example images from dataset ”doors” (a) and dataset ”group” (b).
4.1 Detector evaluation
We evaluate 11 different detectors on increasing viewpoint angle. For the MSER detector 
we use an implementation from Matas et al. and for the other detectors we use the publicly
available implementations from Mikolajczyk.
Fig. 4(a) shows the repeatability score for the ”group” dataset. The best performance is
obtained by the MSER detector. The proposed MSCC detector comes second and shows a
repeatability score higher than the other detectors. Fig. 4(b) shows the repeatability score
for the ”doors” dataset for which our proposed MSCC detector shows comparable results
to others. The evaluations demonstrate that our MSCC detector is competitive to other
4.2 Descriptor Evaluation
To compare our descriptor against others we used Harris Affine, Hessian-Affine, MSERs and intensity
based regions (IBR)  and calculated SIFT-descriptors for the identified regions using
Mikolajczyk’s evaluation framework. For the MSCC detector we calculated SIFT-descriptors
0 10 20 30 40 50 60 70 80 90 100
viewpoint angle [°]
Figure 4: Repeatability scores for images from ”group” dataset (a) and ”doors” dataset (b) for
and our proposed joint orientation histogram descriptor (JOH). Fig. 5(a) shows the matching
score for the ”group” scene. The performance is promising and is decreasing with larger
viewpoint changes as expected. Fig. 5(b) depicts the results for the ”doors” scene. Although
the MSCC detector with SIFT-keys shows performance worse than others, the combination
of MSCC regions with our joint orientation histogram descriptor outperforms the others.
The experiment reveals a competitive performance of our novel detector when compared to
other approaches. It is important to note that the regions detected by our approach are consistently
different from those of other detectors therefore it can be considered complementary.
Figure 5: Matching scores for images from ”group” dataset(a) and ”doors” dataset(b) for
5 Summary and Conclusions
We have presented a novel method for the detection of distinguished regions, the so called
Maximally Stable Corner Clusters (MSCC). The new local detector finds distinguished regions
by clustering feature primitives. We have also developed a novel local descriptor based on 2D
histograms ideally suited for the properties of the detector. We evaluated the repeatability
of the MSCC detector and the matching score of the local descriptor under changing viewing
points and compared the performance to other established detectors. The results show a
competitive performance of the MSCC detector and gives a convincing argument that the new
detector-descriptor combination will successfully enrich the current variety of local detectors
and descriptors. One should note that the proposed framework is quite general and can be
used to develop other distinguished region detectors.
 T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cambridge MA, 1990.
 R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.
In Proceedings 9th International Conference on Computer Vision, June 2003.
 F. Fraundorfer and H. Bischof. Evaluation of local detectors on non-planar scenes. In Proceedings 28th
Workshop of the Austrian Association for Pattern Recognition, 2004.
 C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings 4th Alvey Vision
Conference, pages 147–151, 1988.
 T. Kadir, A. Zisserman, and M. Brady. An affine invariant salient region detector. In Proceedings 8th
European Conference on Computer Vision, 2004.
 T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision,
30(2):77–116, 1998. Technical report ISRN KTH NA/P–96/18–SE.
 D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer
 J. Matas, O. Chum, U. Martin, and T. Pajdla. Robust wide baseline stereo from maximally stable
extremal regions. In Proceedings 13th British Machine Vision Conference, volume 1, pages 384–393,
 K. Mikolajczk and C. Schmid. Indexing based on scale invariant interest points. In Proceedings 8th
International Conference on Computer Vision, pages 525–531, July 2001.
 K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proceedings 7th European
Conference on Computer Vision, volume 1, pages 128–142, 2002.
 K. Mikolajczyk and C. Schmid. Comparison of affine-invariant local detectors and descriptors. In Proceedings
of the 12th European Signal Processing Conference, 2004.
 J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In
Proceedings 9th International Conference on Computer Vision, October 2003.
 T. Tuytelaars and L. Van Gool. Wide baseline stereo matching based on local, affinely invariant regions.
In Proceedings 11th British Machine Vision Conference, pages 412–422, 2000.