Hierarchical Multi-Affine (HMA) algorithm for fast and accurate ...

Hierarchical Multi-Affine (HMA) algorithm for fast and accurate 

feature matching in minimally-invasive surgical images 

Gustavo A. Puerto-Souza, and Gian Luca Mariottini 

Abstract— The ability to find similar features between two 

distinct views of the same scene (feature matching) is essential 

in robotics and computer vision. We are here interested in 

the robotic-assisted minimally-invasive surgical scenario, for 

which feature matching can be used to recover tracked features 

after prolonged occlusions, strong illumination changes, image 

clutter, or fast camera motion. In this paper we introduce the 

Hierarchical Multi-Affine (HMA) feature-matching algorithm, 

which improves over the existing methods by recovering a 

larger number of image correspondences, at an increased 

speed and with a higher accuracy and robustness. Extensive 

experimental results are presented that compare HMA against 

existing methods, over a large surgical-image dataset and over 

several types of detected features. 

(a) Training - Before Occlusion 

Tracked features 

(c) Query - After Occlusion 

Recovered features 

(b) Occlusion 

(d) Augmented Reality 

I. INTRODUCTION 

Feature matching consists of finding image similarities 

between two distinct views of the same scene. 

Differently from feature tracking [1]–[3], feature matching 

does not make any assumption about the sequential nature 

of the two images (e.g., consecutive frames of a video), and 

can then be used to recover corresponding features after 

a prolonged occlusion, a fast camera motion, or a strong 

illumination change. 

Due to their generality, feature-matching algorithms are 

essential in many robotics and vision applications [4]–[9]. In 

particular, feature matching is important in robotic-assisted 

minimally-invasive surgical (MIS) scenarios, e.g., for tissueshape 

recovery [10]–[12], camera tracking [13], [14], or for 

recovering lost features used as anchors for augmented reality 

[15], [16]. Figure 1(a)-(c) shows a MIS image sequence 

in which feature matching is used to recover the position of 

some lost tracked features that were lost after a prolonged 

occlusion and organ movement. These points are then used 

to re-initialize an augmented-reality display of a kidney from 

CT data (Fig. 1(d)). 

Feature-matching algorithms have been proposed in the 

recent years [17]–[19] that make limited (or no) assumptions 

about the two given images, or about the observed scene. 

In this paper, we present a novel feature-matching algorithm 

that improves over the existing methods, by finding 

a larger number of image correspondences at an increased 

speed, and with a higher accuracy, and robustness to image 

clutter. Our method, called Hierarchical Multi-Affine (HMA), 

robustly matches clusters of image features on the observed 

objects’ surface. HMA adopts hierarchical k-means [20] to 

G.A. Puerto-Souza and G.L. Mariottini are with the Department of 

Computer Science and Engineering, University of Texas at Arlington, 

416 Yates Street, 76019 Texas, USA. 

Email:gustavo.puerto@mavs.uta.edu 

gianluca@uta.edu 

Fig. 1. Robotic-surgery application: HMA is used to recover tracked 

features lost due to a prolonged occlusion and organ/camera 

motion (a): Features (dots) tracked during surgery; (b): Occlusion 

of ultrasound probe; (c): HMA can recover the features in (c) from 

those in image (a); (d): Recovered features are used to initialize an 

augmented-reality display. 

iteratively cluster a set of initial matching features according 

to their similarity-space representation. For each of these 

spatially-distributed clusters, an affine transformation is also 

estimated, to further prune the outliers and to accurately 

capture matches along the entire object surface. 

Experimental results are presented over a large and varied 

MIS image dataset (extracted from six 3-hours-long surgical 

interventions) and in the case of several detected features 

(SIFT, ASIFT, and SURF). We show that HMA outperforms 

the existing state-of-the-art methods in terms of speed, inlierdetection 

rate, and accuracy. 

A. Related work 

The first step in feature matching consists of detecting 

salient features from the two images of the same scene. 

Several algorithms have been proposed to extract features 

(SIFT [17], SURF [21], ASIFT [22], etc.) along with their 

local descriptors, each one with different invariant properties. 

After this feature-extraction part, a feature-matching phase 

follows, which in general consists of two phases; first, an 

appearance-based matching, in which the local appearance 

of each feature is used to determine a set of candidate 

(or initial) matches. The work in [23] compares potential 

matches based only on their local appearance. Second, a 

geometrical-validation step is used in [17]–[19] to prune 

from ambiguities the candidate matches by leveraging the 

spatial arrangement of the initial matches.

The algorithm in [17] initially matches first SIFT features 

according to the Euclidean distance between descriptors. 

Then, a geometric phase is enforced to detect a predominant 

set of matches (inliers) that satisfy a unique affinetransformation 

constraint [24]. While this method can reliably 

discard many wrong matches, however only a limited 

number of those belonging to a predominant planar surface 

will be detected. 

The above issue was solved in [18], where multiple localaffine 

mappings are used to detect a larger number of 

matches over the entire non-planar object surface. However, 

due to its high computational time, this method can only be 

used for off-line data processing. 

The work in [19] uses a dissimilarity measure that measures 

the matches similarity based on both a geometric 

and an appearance constraint. Finally, an agglomerative step 

generates clusters of matches by iteratively merging similar 

ones (according to the dissimilarity measure). A drawback 

of this algorithm is due to its high number of parameters. 

Moreover, tuning these parameters is difficult because of 

their unclear physical interpretation. 

The HMA algorithm presented in this paper improves over 

the existing algorithm as described in what follows: 

• HMA is fast because of the adoption of hierarchical k- 

means in the clustering phase. In particular, when a large 

number of features are detected (e.g., when using HD 

images or extracting ASIFT features), HMA is almost 

two-orders-of-magnitude faster than [19]. 

• HMA is accurate: a large percentage of wrong matches 

is automatically removed. Furthermore, the cluster of 

affine transformations guarantees accurate feature recovery 

over the entire object’s surface. 

• HMA is easy to use: only a few thresholds need to be 

specified by the user, which have an intuitive physical 

interpretation. 

The paper is organized as follows: Sect. II describes the 

feature-matching problem and introduces the basic notation. 

Sect. III describes the HMA algorithm, while Sect. IV reports 

the results of our extensive experimental evaluation and the 

comparison of HMA with the state-of-the-art feature matching 

methods. Finally, in Sect. V we draw the conclusions 

and illustrate possible future extensions. 

II. OVERVIEW OF THE FEATURE MATCHING PROBLEM 

A. Keypoints and Descriptors: Notation 

Consider a pair of images, I t (training) and I q (query), 

and their two corresponding sets of detected image features 

F t and F q , respectively. For a generic image I, an image 

feature (e.g., SIFT) is defined as, f (k,d) ∈ F, consisting 

of a keypoint,k, and a local descriptor,d. Each keypointk 

[x,σ,θ] T contains the feature pixel-position, x = [x,y] T , 

scale, σ, and orientation, θ. The descriptor d is a vector 

containing information about the local appearance around 

the feature position. 

Based on the above notation, the feature-matching problem 

can be defined as retrieving the set of similar feature pairs 

(correspondences) from all the features extracted in both 

images. Such a (generic) matchmis defined asm (f t ,f q ), 

between features inF t andF q . In addition, we are interested 

to simultaneously estimate a mapping function, A i , that 

maps any feature from the training to the query images. 

B. Initial Appearance-based Feature Matching 

In feature matching, a set M of candidate matches is 

initially determined by means of appeareance-based criteria 

(e.g., the Nearest-Neighbor distance Ratio (NNR)). 

In NNR, the best training feature candidate to match the 

i-th f q i = (kq i ,dq i ), is the feature ft j(i) = (kt j(i) ,dt j(i) ) 

with both the closest descriptor-distance, i.e., 

j(i) = argmin j(i) ||d q i −dt j(i) || 2, and the smallest ratio 

between the closest and the second-closest descriptor 

distances, defined as, 

( ) 

NNR(i)‖d q i −dt j(i) ‖ 2/ min 

j∈F t −{j(i)} ‖dq i −dt j ‖ 2 . 

This ratio has to be less than some threshold τ NNR which 

allows to discard many wrong matches that could arise, e.g., 

due to ambiguity with the background. Tipically, a value for 

the ratio-threshold is τ NNR ≥ 0.8 [17]. 

However, due to image similarities, local appearance 

can generate a high percentage of outliers (i.e., wrong 

matches). As a consequence, the matches obtained from 

this appearance-based phase are unreliable and might not be 

sufficient in several applications, such as tracking recovery 

(see [25] for a performance comparison of appearance-based 

criteria). 

C. Similarity Transformation from Keypoints 

Corresponding features in both images can be related by a 

rigid image motion, which consists of a rotation, translation 

and scale. In particular, the two matching keypoints (k t i ,kq i ) 

of match m i can be used to compute such a (similarity) 

transformation, which models the rigid mapping for a pair of 

image points [u t ,v t ] T ∈ I t and [u q ,v q ] T ∈ I q , as follows, 

[ u 

q 

v q ] 

[ ] cos(δθi ) −sin(δθ 

= δσ i ) 

i 

sin(δθ i ) cos(δθ i ) 

} {{ } 

R(δθ i ) 

[ u 

t 

u t ] 

+ 

[ ] δxi 

, (1) 

δy i 

where the scale ratio δσ i σ q i /σt i , the 2-D rotation angle 

δθ i θ q i −θt i , and the translation [δx i,δy i ] T [x q i ,yq i ]T − 

δσ i R(δθ i )[x t i ,yt i ]T . 

The similarity parameters s i = [δx i ,δy i ,δσ i ,δθ i ] T are 

fast to compute, and only require the matched-keypoint 

parameters. 

In general, a match m i is considered to agree with 

a generic transformation 1 , A, if the symmetric pixel reprojection 

error of mapping its keypoints is below some 

threshold τ, i.e, 

e A (˜x t i,˜x q i ) = 0.5‖˜xq i −A˜xt i‖+0.5‖˜x t i−A −1˜x q i ‖ < τ (2)

A. Overview 

III. THE HMA ALGORITHM 

Given a set of (appearance-based) candidate matches, M, 

(see Sect. II-B), the goal of HMA is to compute (i) a set 

of outlier-free matches, M ∗ , and (ii) a set of spatiallydistributed 

image mappings, A ∗ = {A 1 ,A 2 ,...,A n }. Each 

of these transformations has the capability to map features 

lying on different portions of the training image, to corresponding 

features in the query image. As illustrated in 

Fig. 1, these mappings can then be used to recover in I q , 

the position any image feature (not necessarily SIFT but also 

corners) that was lost due to occlusions or rapid camera 

motion. 

HMA achieves these tasks by clustering the initial matches 

according to a spatial and an affine constraints. We adopt 

a hierarchical k-means strategy [20] to iteratively cluster 

all the matching features, as illustrated in Fig. 2. The 

hierarchical structure of our algorithm, and its corresponding 

tree structure, present several advantages: 

• Speed: our divide-and-conquer expansion of the tree can 

be stopped until a desired pixel accuracy (e.g., in terms 

of the reprojection error e A ). 

• Robustness: the quantization of the k-means clustering 

will iteratively adjust at each level of the tree, by 

reassigning matches to more appropriate clusters. 

• Accuracy: The multiple Affine Transformations (AT) 

locally adapt more precisely to the non-planar regions of 

the observed scene than a single global-affine mapping. 

Fig. 3 shows all the phases of the HMA algorithm: 

the clustering phase partitions each node’s matches into k 

clusters, C 1 ,C 2 ,...,C k . The affine estimation phase, for each 

cluster simultaneously, estimates an affine transformation, 

1 E.g., similarity, affine, homography transformations, etc. [24] 

Node 1 

I t 

I t I q I t I q 

. . . 

A 1 

Node 1(1) 

Node 1(k) 

I t I q I t I q 

A 1(1) 

. . . 

Root node 

I q 

Initial matches 

A 1(k) 

Node k 

Fig. 2. HMA hierarchically quantizes matches according to their spatial 

position on the object’s surface. 

Node 1 

Affine Est. 

Phase 

Further 

Expansion 

Clustering 

Phase 

Leaf 

node 

A 1 , M 1 ,O 1 

A ′ 1 , M′ 1 ,O′ 1 

M ′ 1 ∪ O′ 1 

Root node 

Initial Matches 

Clustering 

Phase 

Correction 

Phase 

Leaf 

node 

Leaf 

node 

Affine Est. 

Phase 

Clustering 

Phase 

C 1 

C1 C k 

C 1 C k 

Node 1(1) Node1(k) Node Node k(1) . . . k(k) 

M ′ 1(1) ∪ O′ i(1) 

Fig. 3. 

yes 

. . . 

no 

. . . 

M ′ 1(k) ∪ O′ i(k) 

C k 

M ′ k(1) ∪ O′ j(1) 

Leaf 

node 

Node k 

A k ,M k ,O k 

A ′ k ,M′ k ,O′ k 

no 

Further 

Leaf 

Expansion 

node 

M ′ k ∪ O′ k 

M ′ k(k) ∪ O′ j(k) 

. 

. 

Block diagram of the HMA feature-matching algorithm. 

yes 

Leaf 

node 

A i , the sets of inliers, M i , and the outliers O i . A correction 

phase is used to reassign matches to more adequate clusters 

and, if necessary, to obtain an updated affine mapping, A ′ i 

(and new sets M ′ i and O′ i ). Each cluster defines a new child 

node as N j(i) , where j(·) is the function that determines the 

index of the cluster i in the tree structure. Finally, a criteria 

based on the matches’ reprojection error, e A , is employed to 

determine if these new nodes require further clustering. 

These three phases are repeated until each branch reaches a 

leaf node. Then, the set of outlier-free correspondences,M ∗ , 

and the set of transformations, A ∗ , are extracted from the 

leaf nodes that satisfy a threshold over the minimal number 

of inliers, i.e., |M i | ≥ τ c . Once a leaf is reached, a final 

validation step is used to recover those outliers that might 

be potential inliers for other nodes. 

B. Clustering phase 

In this phase (see Fig. 3) the input matches are partitioned 

into k subsets by using k-means over an extended 

feature space, consisting of both the matches’ similaritytransformation 

parameters, as well as their (keypoint position 

in the query image). The resulting clustersC 1 ,C 2 ,...,C k , are 

illustrated in the example of Fig. 4. Note that each cluster 

represents a new branch of the tree at the current level. 

C. Similarity-space clustering 

For all the matches in each node, the four similarity transformation 

parameters [δx,δy,δσ,δθ] T are first computed 

(c.f., Sect. II-C). Clustering in this similarity space is the key, 

since it will result in clusters with close similarity parameters 

(representing indeed an initial approximation for the AT).

Training 

Query 

Fig. 4. Clustering phase: Example of the partition of matches. Note each 

symbol (and color) represents a different cluster (better seen in color). 

From our experiments, we noticed that this step does not 

always ensure that the clusters are disjoint, which would 

be ideal in order to define a unique affine transformation 

features for that portion of the scene. For this reason, we 

apply k-means to an extended (six-dimensional) similarity 

space 2 given by [x q ,y q ,δx,δy,δσ,δθ] T . 

We also noticed that, in earlier levels of the tree, the translation 

parameters [δx,δy] are more informative in isolating 

outliers. On the other hand, in latter levels of the tree, the 

position parameters [x q ,y q ] become more important 3 , and 

[δx,δy] tends to be less informative. Finally, we noticed that 

δσ and δθ are not very informative and can be very sensitive 

to changes of view, in particular when the observed object 

is non-planar. 

Due to the above empirical observations, a scaling vector 

W was chosen to adaptively weight the relative importance 

of those similarity parameters before applying k-means. 

Our strategy varies W depending on the values of the AT 

residuals. As from what discussed above, W will put more 

importance on the translation (than on the other parameters) 

when the cluster’s AT is not adequate to model the matches 

(e.g., due to the large number of outliers such as in the initial 

levels of the tree). When the number of outliers is reduced, 

W will reduce its importance on the translation parameters, 

and increase its weight values on the weight for the position 

(thus enforcing spatial contiguity of the clusters). 

The weights W are computed by interpolating between 

two given weigh values W i (initial), and W f (final), as 

follows: W = W i + (W f − W i )α(¯r), where ¯r represent 

the average of all the symmetric reprojection errors, and α 

is a decreasing function with image in the interval [0,1], 

defined as, 

⎧ 

⎨ ( 1 ) if ¯r ≤ τ R 

α(¯r) = ¯r−τ 

⎩ 1−erf √ r 

, 

otherwise 

2σ 2 w 

where τ R is the mapping maximum-error threshold, and σ w 

is a parameter which smooths the shape of α(¯r). 

Finally, the dataset is scaled according to these weights, 

i.e., Q [x q ,y q ,δx,δy,δσ,δθ]W, which is then clustered 

by k-means. 

D. Affine-estimation phase 

The keypoints (k t ,k q ) contained in each cluster C i are 

used in the affine-transformation estimation phase to com- 

2 This six-dimensional dataset is normalized (each column vector will have 

norm one) and scaled (with respect to each column). 

3 In fact, spatial contiguity of the clusters has higher priority. 

pute A i . RANSAC is used, for a given pixel error threshold 

τ R , to provide a robust estimate of A i , as well as to obtain 

the inliers M i . The affine model A i maps a generic image 

point [u t ,v t ] T ∈ I t (in the corresponding cluster C i ) to 

[û q ,ˆv q ] T ∈ I q , according to: 

[ûq 

ˆv q ] 

= 

[ 

m1 m 2 

m 3 

m 4 

][ u 

t 

v t ] 

+ 

[ ] 

m5 

. (3) 

m 6 

The parameters in A i represent the translation, rotation, 

and shear associated with the AT. The vector m 

[m 1 ,m 2 ,...,m 6 ] can be estimated at each RANSAC iteration 

by randomly picking triplets of matches and by using 

them to solve for m in a least-squares sense from (3). Fig. 5 

shows an example of the AT and the set of inliers. The 

colored boxes indicate how each region is mapped between 

images. As observed, these regions nicely capture keypoints 

according to the slope of their local surface. 

Fig. 5. 

Training 

Query 

Example of AT (the planar regions) and inliers for each cluster. 

In order to eliminate the isolated matches with large 

residuals 4 , we adopt a non-parametric outlier-detection technique 

[26]. In particular, a threshold τ o is first computed 

from the statistics of the symmetric reprojection errors, and 

used to discard matches with an error larger than τ o . This 

threshold is computed as τ o q 3 +γ(q 3 −q 1 ), where q i is 

the i-th quartile of the error over all the matches, and γ is a 

factor 5 . 

E. Correction phase 

In the correction phase the outliers O i are reassigned to 

other, more appropriate) clusters and, if necessary, the ATs 

are recomputed with the new set of features. In particular, 

Linear Discriminant Analysis (LDA) [27] is used to separate 

features belonging to all the known classes (clusters) in the 

query image, thus guaranteeing non overlap among the final 

clusters. 

F. Stop Criteria 

In order to stop the expansion of a i-th node, and 

to deem it as a leaf node, HMA examines the number 

of inliers (consensus), |M i |, and the ratio of inliers, 

ρ i |M i |/(|M i |+|O i |). In particular, a node is deemed as 

a leaf when |M i | < τ c or ρ i > τ ρ . Note that these thresholds 

have indeed a very intuitive interpretation and were fixed for 

our experiments (c.f., Sect. IV). 

4 These matches could negatively affect subsequent clustering phases. 

5 γ has been fixed to 3/2, as in [26].

Method Avg. Error Time % Inl. Avg. Error Time % Inl. Avg. Error Time % Inl. 

(SIFT)(pix) (SIFT)(s) (SIFT) (SURF)(pix) (SURF)(s) (SURF) (ASIFT)(pix) (ASIFT)(s) (ASIFT) 

Lowe 22.90±24.67 0.38 27.7% 20.77±22.15 0.40 34.6% 19.68±20.38 4.54 34.4% 

AMA 4.37±6.18 7.98 58.5% 4.45±7.38 13.00 69.4% 2.48±2.76 133.47 76.2% 

Cho 6.33±8.89 0.58 59.9% 6.92±6.83 0.56 66.6% 9.37±23.29 595.18 40.6% 

HMA 4.30±4.75 0.73 59.3% 4.44±6.66 0.67 66.7% 3.00±2.84 5.99 73.0% 

IV. EXPERIMENTAL RESULTS AND DISCUSSION 

We validated the performance of the HMA algorithm in 

two scenarios: (i) a fully controlled in-lab experiment, and 

(ii) a real MIS scenario, composed of almost 100 images 

extracted from 6 surgical interventions. For each image, 

we measured the inlier ratio 6 that is descriptive of the 

detection power of our algorithm, the mapping accuracy over 

a set of manually-labelled ground-truth correspondences (per 

image), and the computational time. The computational time 

is measured in seconds of timer CPU 7 required by each run 

of each algorithm. 

In addition, we compared HMA against other recent 

algorithms: Lowe’s [17], AMA [18], and Cho’s [19], when 

using either SIFT [17], SURF [21], or ASIFT [22] features. 

Our implementation of HMA considered the following 

fixed parameters: k = 4, τ R = 5 [pixels], 

W i = [0.2,0.2,0.25,0.25,0.05,0.05] T , W f = 

[0.5,0.5,0,0,0,0] T , which were chosen after an extensive 

analysis. 

In order to provide a fair comparison between the aforementioned 

algorithms, we used the same sets of initial 

matches 8 . We consider a threshold of 5 pixels for AMA and 

Lowe, while a cutoff value of 25 in the linkage function was 

chosen for Cho. 

A. In-lab experiment 

In this experiment a pair of images with resolution 640× 

480 were used. The initial matches are 354, 330 and 4903 

for SIFT, SURF, and ASIFT, respectively. The ground-truth 

corresponding corners are 41 manually-selected matches. 

Table I provides detailed statistical results of the algorithms’ 

performance for all the features. The second, fifth 

and eighth columns show the mean and standard deviation 

of the symmetric reprojection error, the third, sixth and ninth 

indicate the computational time (in seconds), and the forth, 

seventh and tenth contain the percentage of inliers. 

Discussion: From the results in Table I, we observe that 

HMA achieves better results than AMA in terms of speed 

(approximately 11, 19, and 22 times faster for SIFT, SURF, 

and ASIFT, respectively). While HMA requires slightly 

higher computational time when compared to Cho (in SIFT 

and SURF), however HMA obtains a reduced reprojection 

error. Furthermore, when using ASIFT features (in general 

more than 4000 are detected per image), HMA requires 

significant less computational time when compared to Cho. 

6 Inliers ratio = no. of inliers/no. of initial matches. 

7 i7, 2.20GHz. 

8 The initial matches are computed using NN distance ratio with thresholds 

0.9 (SIFT), 0.8 (SURF), and 0.8 (ASIFT). 

TABLE I 

In-lab experiment: Algorithms’ performance 

B. Surgical-images dataset 

Each laparoscopic image has a resolution of 704 × 480 

pixels, and the set of ground-truth correspondences contains 

an average of 20 points. Table II summarizes the results 

obtained by the four feature-matching algorithms. 

Discussion: HMA achieves comparable accuracy than 

AMA, but with a significantly lower computational effort. 

In fact, HMA is approximately 21 times faster than AMA 

for both SIFT and SURF, and 66 times for ASIFT. Furthermore, 

HMA demonstrates a smaller reprojection error 

than Cho’s, with a comparable time (note Cho’s larger 

standard deviation) for SIFT and SURF. As from the previous 

results, HMA shows a dramatic lower computation time 

(and lower accuracy cost) with a high number of ASIFT 

matches. Note that HMA has an inlier ratio comparable to 

other methods. Finally, HMA achieves high stability over 

all the experiment’s image pairs. These results suggest that 

HMA formulation achieves a good balance between speed 

and accuracy. 

Figs. 6 and show qualitative examples of the matching 

performance of the four algorithms for SIFT. Note that Lowe 

is not able to fully adapt to the non-planar surface in both 

cases, while AMA, Cho, and HMA obtain similar results 

(covering most of the organ’s surface). 

V. CONCLUSIONS 

In this work we presented HMA, a novel feature-matching 

algorithm, and we applied to robotic-assisted minimallyinvasive 

surgery (MIS). The MIS scenario is challenging 

because laparoscopic images have a lot of clutter, non-planar 

structures, changes in viewpoint, and organ deformations. 

HMA improves over existing methods by achieving a lower 

computational time, and a higher accuracy and robustness to 

identify many matches. This comparison was made over a 

very large and carefully annotated MIS-image dataset, and 

over three most-popular feature descriptors (SIFT, SURF, 

and ASIFT). As future work we plan to include fusion of 

descriptors and to further improve HMA computational time. 

REFERENCES 

[1] Bruce D. Lucas and Takeo Kanade. An iterative image registration 

technique with an application to stereo vision (ijcai). In Proc. of the 

7th Int. Joint Conference on Artificial Intelligence, pages 674–679, 

April 1981. 

[2] D. Stoyanov, G. Mylonas, F. Deligianni, A. Darzi, and G. Yang. Softtissue 

motion tracking and structure estimation for robotic assisted mis 

procedures. MICCAI, pages 139–146, 2005. 

[3] R. Richa, A.P. L Bo, and P. Poignet. Towards robust 3d visual tracking 

for motion compensation in beating heart surgery. Medical Image 

Analysis, 15(3):302–1315, 2010. 

[4] D. Scaramuzza and F. Fraundorfer. Visual odometry : Tutorial. IEEE 

Robotics Automation Magazine, 18(4):80–92, Dec. 2011.

Method Avg. Error Avg. Time % Inl. Avg. Error Avg. Time % Inl. Avg. Error Avg. Time % Inl. 

(SIFT) (SIFT) (SIFT) (SURF) (SURF) (SURF) (ASIFT) (ASIFT) (ASIFT) 

Lowe 4.66±4.24 0.37±0.15 34% 4.06±3.64 0.36±0.06 28% 4.77±4.25 1.83±1.57 35% 

AMA 2.49±2.42 5.74±4.45 40% 3.11±3.57 11.16±3.55 34% 2.57±2.85 162.43±136.88 46% 

Cho 3.56±3.35 0.31±0.12 39% 4.44±5.06 0.41±0.17 39% 7.56±11.16 116.10±162.39 31% 

HMA 2.84±2.64 0.27±0.1 39% 3.70±3.45 0.51±0.13 32% 3.62±3.55 2.44±1.11 50% 

I t 

(a) Lowe 

TABLE II 

MIS dataset: Algorithms’ performance 

I q 

I t 

(b) AMA 

I q 

I t 

(c) Cho 

I q 

I t 

(d) HMA 

I q 

Fig. 6. Qualitative example of the matching performance for SIFT features. 

I t 

I q 

I t 

(a) Lowe 

(b) AMA 

I q 

I t 

(c) Cho 

I q 

I t 

(d) HMA 

I q 

Fig. 7. 

[5] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3d object 

modeling and recognition using local affine-invariant image descriptors 

and multi-view spatial constraints. Proc. IEEE/RSJ Int. Conf. Intel. 

Robots Syst, 66(3):231–259, 2011. 

[6] M. Brown and D.G. Lowe. Automatic panoramic image stitching using 

invariant features. Int. J. Comp. Vis., 74(1):59–73, 2007. 

[7] Hailin Jin, P. Favaro, and S. Soatto. Real-time feature tracking and 

outlier rejection with changes in illumination. In Proc. IEEE Intern. 

Conf. on Computer Vision, volume 1, pages 684–689, 2001. 

[8] P. Mountney, B. Lo, S. Thiemjarus, D. Stoyanov, and G. Zhong- 

Yang. A probabilistic framework for tracking deformable soft tissue 

in minimally invasive surgery. In MICCAI, pages 34–41. Springer, 

2007. 

[9] P. Mountney and G.Z. Yang. Motion compensated slam for image 

guided surgery. MICCAI, pages 496–504, 2010. 

[10] Benny P. L. Lo, Marco Visentini-Scarzanella, Danail Stoyanov, and 

Guang-Zhong Yang. Belief propagation for depth cue fusion in 

minimally invasive surgery. In MICCAI, pages 104–112, 2008. 

[11] Marco Visentini-Scarzanella, George P. Mylonas, Danail Stoyanov, 

and Guang-Zhong Yang. i-brush: A gaze-contingent virtual paintbrush 

for dense 3d reconstruction in robotic assisted surgery. In Med. Im. 

Comp. and Comp. Ass. Int., pages 353–360, 2009. 

[12] J. Totz, P. Mountney, D. Stoyanov, and G.-Z. Yang. Dense surface 

reconstruction for enhanced navigation in MIS. In MICCAI, pages 

89–96, 2011. 

[13] M. Hu, G.P. Penney, D. Rueckert, P.J. Edwards, F. Bello, R. Casula, 

M. Figl, and D.J. Hawkes. Non-rigid reconstruction of the beating 

heart surface for minimally invasive cardiac surgery. In MICCAI, 2009. 

[14] P. Mountney, D. Stoyanov, and Guang-Zhong Yang;. Threedimensional 

tissue deformation recovery and tracking. IEEE Signal 

Processing Magazine, 27:14–24, 2010. 

[15] L.M. Su, B.P. Vagvolgyi, R. Agarwal, C.E. Reiley, R.H. Taylor, and 

G.D. Hager. Augmented reality during robot-assisted laparoscopic 

Qualitative example of the matching performance for ASIFT features. 

partial nephrectomy: Toward real-time 3d-ct to stereoscopic video 

registration. Urology, 73(4):896–900, 2009. 

[16] O. Ukimura and I.S. Gill. Image-fusion, augmented reality, and 

predictive surgical navigation. Urologic Clinics of North America, 

36(2):115 – 123, 2009. 

[17] D.G. Lowe. Distinctive image features from scale-invariant keypoints. 

Int. J. Comp. Vis., 60(2):91–110, 2004. 

[18] G. A. Puerto-Souza, M. Adibi, J. A. Cadeddu, and G.L. Mariottini. 

Adaptive multi-affine (AMA) feature-matching algorithm and 

its application to minimally-invasive surgery images. In IEEE/RSJ 

International Conference on Intelligent Robots and Systems, pages 

2371 –2376, sept. 2011. 

[19] M. Cho, J. Lee, and K.M. Lee. Feature correspondence and deformable 

object matching via agglomerative correspondence clustering. In Proc. 

9th Int. Conf. Comp. Vis., pages 1280–1287, 2009. 

[20] H. Trevor, T. Robert, and F. Jerome. The elements of statistical 

learning: data mining, inference and prediction. New York: Springer- 

Verlag, 1(8):371–406, 2001. 

[21] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust 

features. In Proc. European Comp Vis. Conf, pages 404–417. Springer, 

2006. 

[22] J.M. Morel and G. Yu. ASIFT: A new framework for fully affine 

invariant image comparison. Journal on Imaging Sciences, 2(2):438– 

469, 2009. 

[23] K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point 

detectors. Int. J. Comp. Vis., 60(1):63–86, 2004. 

[24] R. Hartley and A. Zisserman. Multiple view geometry in computer 

vision. Cambridge Univ Press, 2000. 

[25] K. Mikolajczyk and C. Schmid. A performance evaluation of local 

descriptors. IEEE Trans. Pattern Anal., 60:1615–1630, 2005. 

[26] J.W. Tukey. Exploratory data analysis. Reading, MA, 1977. 

[27] Christopher M. Bishop. Pattern Recognition and Machine Learning. 

Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

Hierarchical Multi-Affine (HMA) algorithm for fast and accurate ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?