Fast nearest neighbor search in pseudosemimetric spaces

Fast nearest neighbor search in pseudosemimetric spacesMarkus Lessmann and Rolf P. WürtzInstitut für Neuroinformatik, Ruhr-University Bochum, Germanymarkus.lessmann@ini.rub.de, rolf.wuertz@ini.rub.deKeywords:locality-sensitive hashing, differing vector spacesAbstract:Nearest neighbor search in metric spaces is an important task in pattern recognition becauseit allows a query pattern to be associated with a known pattern from a learned dataset. Inlow-dimensional spaces a lot of good solutions exist that minimize the number of comparisonsbetween patterns by partitioning the search space using tree structures. In high-dimensionalspaces tree methods become useless because they fail to prevent scanning almost the completedataset. Locality sensitive hashing methods solve the task approximately by grouping patternsthat are nearby in search space into buckets. Therefore an appropriate hash function has to beknown that is highly likely to assign a query pattern to the same bucket as its nearest neighbor.This works fine as long as all the patterns are of the same dimensionality and exist in the samevector space with a complete metric. Here, we propose a locality-sensitive hashing-scheme thatis able to process patterns which are built up of several possibly missing subpatterns causing thepatterns to be in vector spaces of different dimensionality. These patterns can only be comparedusing a pseudosemimetric.1 INTRODUCTIONNearest neighbor search (NN search) in highdimensionalspaces is a common problem in patternrecognition and computer vision. Examplesof possible applications include content-based imageretrieval (CBIR) (Giacinto, 2007), which describestechniques to find the image most similarto a query image in a database, as well as opticalcharacter recognition (OCR)(Sankar K. et al.,2010), where letters and words in an image of atext are recognized automatically to digitize it.Another field in which the problem arises is objectrecognition by feature extraction and matchingas in the approach of (Westphal and Würtz,2009). This system extracts features, which arecalled parquet graphs, from images during learningand stores them together with additional informationabout object identity and category ina codebook. During recall features of the samekind are drawn from a test image and for eachof them the nearest neighbor in the codebook isfound. The additional information is then usedfor voting about possible object identity or category.All these examples spend a lot of their totalcomputational resources on nearest neighborsearch. To make them able to handle largedatabases (also called codebooks from now on)they need methods for fast NN search. In thecase of spatially extended features used togetherwith segmentation masks, like in (Würtz, 1997)for face recognition and (Westphal and Würtz,2009) for general object recognition, a furtherproblem arises.A multidimensional template or a parquetgraph describes a local image patch using Gabordescriptors called jets at a small range of positionsand can incorporate segmentation informationby deactivating jets at positions outside theobject. Thus, the parquet graph is enhanced witha segmentation mask with the values active orinactive. The comparison function between twoparquet graphs uses only the positions that areactive in both graphs. By means of different allocationpatterns of active and inactive positionsparquet graphs divide into several groups of vectorsof different dimensions. This renders knowntechniques for nearest neighbor search useless becausethey rely on the triangle inequality which is

only valid if all vectors belong to the same metricspace. We will present a search technique that isable to handle this problem.The outline of this article is as follows: in section2 we will introduce nearest neighbor searchtechniques for low-dimensional metric spaces.Then we explain the problem of high-dimensionalnearest neighbor search (curse of dimensionality)and the locality-sensitive-hashing scheme (LSH)as an approximate solution. In section 3 parquetgraphs are presented and it is made clear whythe original LSH scheme is not suited for them.Section 4 presents our LSH scheme fitted to thespecial demands of parquet graphs. Section 5shows result of experiments that were conductedusing our technique, original LSH and exhaustivesearch for comparison. The conclusion is given insection 6.2 OVERVIEW NEARESTNEIGHBOR SEARCHOf course, NN search can always be performedusing exhaustive search. That means the queryvector is compared to all vectors in the databaseand minimal distance as well as index of the nearestneighbor are updated after each comparison.For big codebooks this methods becomes prohibitivebecause of computational cost.In low-dimensional spaces several efficientmethods for NN search exist. The most prominentone is the kd-tree (Bentley, 1975). It builds abinary tree structure using the given codebook ofvectors and thereby bisects the search space iterativelyin each dimension. At first, all vectors aresorted according to the value of their first component.Then the vector whose value is the medianis assigned to the root node. This means the rootstores the median value and a pointer to the completevector. All other vectors are now split intotwo groups depending on whether their first componentvalue was higher or lower than the median.Now both groups are sorted according totheir second components and again medians aredetermined. Their vectors become the root nodesof two new subtrees, which become left and rightchildren of the first root node. This process is repeateduntil all vectors are assigned to a node. Ifthe depth of the tree gets higher than the numberof dimensions the first dimension is taken againfor determining median values.NN search now works in the following way: thequery vector is compared to the vector of the rootnode and the complete and single dimension distanceare computed. Since the complete squaredEuclidean distance is the sum of the squared Euclideandistances of all single dimensions the completedistance between two vectors can never besmaller than their distance in only one dimension.Therefore, an estimate of the complete distance tothe nearest neighbor is updated while traversingthe tree. If the distance to a vector in one dimensionis already bigger than the current estimatethis vector cannot be the nearest neighbor andneither can all its (left or right) children, whichare even more distant in the actual dimension.This means that only one of the two subtrees hasto be scanned further and thus the number of necessarycomplete distance calculations and also thesearch time is O(log(n)) in the optimal case withn being the number of vectors in the codebook.This method does not work in high-dimensionalspaces because of the curse of dimensionality. Itmeans that distances between arbitrary points inmetric spaces become more and more equal withincreasing number of dimensions. This leads tothe fact that fewer and fewer subtrees can be discardedin the kd-tree search and almost the completedatabase needs to be scanned. The searchtimethus grows from O(log(n)) to O(n). Thefirst method to provide at least an approximatesolution to NN search in high-dimensional spacesis called locality-sensitive-hashing (LSH). In itsversion for metric spaces (Datar et al., 2004) ituses a special hash function that assigns vectorsto buckets while preserving their neighborhoodrelations. The hash function is based on a (2-stable) Gaussian distribution G. 2-stable meansfor the dot product of vector ⃗a, whose componentsare drawn from the Gaussian distributionG and vector ⃗v:⃗a · ⃗v ∝ ‖⃗v‖ 2 · A , (1)where A is also a Gaussian distributed randomnumber. This dot product projects each vectoronto the real line. For two different vectors ⃗uand ⃗v their distance on the real line is distributedaccording to:⃗u · ⃗a − ⃗v · ⃗a ∝ ‖⃗u − ⃗v‖ 2 · A . (2)This means that the variance of the distributionincreases proportionally to the distance of thevectors in their vector space. Therefore, vectorswith a smaller distance in the metric space have ahigher probability of being close to each other onthe real line. Chopping the real line into bins ofequal length leads to buckets containing adjacent

vectors. The complete hash function is⃗a · ⃗v + bH(⃗v) = ⌊W ⌋ , (3)W being the bin width and b a random numberdrawn from an equal distribution in the range 0 toW . A query vector is then mapped to its accordingbucket and compared with all vectors withinit. Since only a fraction of all codebook vectorsis contained in this bucket search is sped up. Itis always possible that a vector is not projectedinto the bucket of its nearest neighbor but to anadjacent one. Then not the real nearest neighboris found but only the l nearest neighbor (wherel is not known). Thus it is not an exact but anapproximate method. To increase the probabilityof finding the true nearest neighbors (also calledhit rate within this paper) the vectors from thedataset can be mapped to several buckets usingseveral different hash functions. Scanning multiplecompilations of adjoining vectors increasesthe probability to find the true nearest neighbor.3 PARQUET GRAPHSParquet Graphs (Westphal and Würtz, 2009)are features that describe an image patch of acertain size by combining vectors of Gabor descriptorsor jets (Lades et al., 1993) in a graphstructure. The jets are extracted from a 3 ×3 rectangular grid. Jets contain absolute valuesof Gabor wavelet responses in 8 directions and 5scales, which means that they are 40 dimensionalvectors. Thus a full parquet graph consists of 9jets and can be seen as an 360 dimensional vectorin a metric space. Two jets can be compared bycalculating their normalized dot product resultingin a similarity measure between 0 and 1. Thejet is stored in normalized form throughout theprocess, so a single dot product is enough for calculatingsimilarities. Two parquet graphs can becompared by comparing their jets on correspondingpositions and then averaging the similarityvalues. Segmentation information of a trainingimage can be incorporated into the parquet graphby labeling grid positions inactive that have notbeen placed on the object. The respective jetthen does not contain information about the objectand is not used for comparison. For matchingof two parquet graphs only grid positions labeledactive in both parquet graphs are used for calculatingthe average similarity. For each parquetgraph at least the center position is active by definition,therefore all parquet graphs can be comparedto each other.For NN search in parquet graphs with inactivepositions none of the existing methods canbe used. Because of the different dimensionalityof the features the triangle inequality is invalid,which would be needed to draw conclusions aboutdistances of vectors that have not been compareddirectly.4 PROPOSED SYSTEM4.1 General application flowAs LSH is the only applicable technique in highdimensionalspaces we base our method on it.Since the hash function is the essential elementthat captures neighborhood relations of vectorswe have to replace it with a new function that issuited for our requirements. This new functionhas to somehow capture spatial relationships betweenparquet graphs. The only way to do this isto use the parquet graphs themselves. This worksas follows: a small subset K of K parquet graphs,which we will call hash vectors, is selected fromthe codebook. Now each vector in the codebookis compared with them, yielding K different similarityvalues between 0 and 1. These are nowtransformed into binary values using the Heavisidefunction with a threshold t i :b i = H(s i − t i ) , (4)yielding a bit string of length K. This string,interpreted as a positive integer, is the index ofthe bucket in which the vector is stored. Storingmeans that either the bucket contains a list ofindices of vectors that belong to it or has a localdata array which holds the corresponding vectors.Since we are using specialized libraries for matrixmultiplication to compute dot products of queryand codebook vectors, local storage is the fastermethod and used in our approach. By storing thebuckets in an array of length 2 K each bucket canbe accessed directly by its hash value.The assumption behind this definition is thatparquet graphs behaving similarly in relation toa set of other parquet graphs, are more similarthan those that do not. Therefore they are assignedto the same bucket. In a query a hashvalue is computed for the query vector in thesame way. Then the according bucket has to bescanned by exhaustive search. It is possible thatthis query bucket is empty. Then the non-empty

uckets with the most similar bit strings have tobe determined. This means a NN search in Hammingspace has to be performed by probing the bitstrings with Hamming distance of 1, 2, 3 and soforth one after another until an occupied bucketis found for a given distance. If there are severalfilled buckets in the same Hamming distancethey are all scanned. Bit strings with Hammingdistance D are easy to compute by applying abitwise XOR on the query bit string and a stringof the same length that contains 1 at D positionsand 0 at the remaining ones. These perturbationvectors are precomputed before the search starts.The maximal possible Hamming distance D maxof a query can also be computed in advance bycomparing all bit strings found in the codebookwith all other possible bit strings of length K.The perturbation vectors are then precalculatedup to the distance D max .4.2 Selection of hash vectors andthresholdsThe crucial point in this method is the selectionof the K parquet graphs used for deriving thehash values and their corresponding thresholdst i . If vectors and respective thresholds are chosenpoorly, they assign the same hash value to eachvector of the database, and the resulting searchbecomes exhaustive. The ideal case is when thevectors are distributed equally across the possible2 K bit strings. Then each bucket contains n2vectors and only a small subset of the completeKdatabase has to be scanned. Because of the differingmetric spaces in which parquet graphs can becompared it is hardly imaginable how such vectorscould be calculated from scratch. Instead wewill select suitable vectors first and then chosean threshold for each of them one after anothersuch that codebook vectors are distributed mostequally onto the buckets.Hash vectors need to be as dissimilar as possible,such that comparison with them yields amaximum of differing information. Suppose thatall hash vectors are chosen. Now all codebookvectors are compared with the first one and athreshold is chosen that gives an optimal distribution(for one half of the codebook vectors b 1is 0 and for the other 1). Comparison with thesecond hash vector should now be able to furthersubdivide these 2 groups into 4 different groupsof almost the same size. This will only be possibleif the second hash vector is very dissimilarto the first, otherwise it would assign the samebit value to all vectors which are already in thesame group, giving 2 buckets of the same sizeand 2 empty ones. Therefore hash vectors arechosen first in an evolutionary optimization step.In a fixed number of tries a codebook vector isselected randomly. Then it is compared to allother vectors and the one with the lowest similarityis used as second hash vector. Now bothvectors are compared with the rest and the codebookvector whose maximum similarity to bothof them is minimal is chosen. This is repeateduntil an expected maximal number of hash vectorsthat will be needed is reached. The maximalsimilarity between any two hash vectors is usedas a measure of the quality of the selected vectors.After, e.g., 1000 trials the codebook vectorwith the lowest quality measure is chosen (the remaininghash vectors are exactly determined bythat choice). The thresholds are then fixed sequentially:t i is increased in steps of 0.01 from0.0 to 1.0 and vectors are divided into groups accordingto it. To check the created distributionthe following fitness function is computed. Letp i = n[i]n , (5)with n[i] being the number of vectors in bucket i.This is the percentage of codebook vectors containedin that bucket. The optimal percentage ofeach bucket isp = n2 K . (6)Absolute differences of these values are added foreach bucket to give the quality measuref(K) =2 K ∑i=1|p i − p| . (7)The threshold yielding a minimum f(K) is selected.Another possible measure we tested is theentropy of the distribution, which becomes maximalfor an uniform distribution. The problem isthat the entropy also becomes high if one bucketcontains the majority of vectors and the remainingones are distributed equally on the remainingbuckets, which is a very unfavorable situation.4.3 Scanning additional bucketsAs already mentioned the method will not givethe real nearest neighbor in some cases. To beuseful we expect it to be correct in ca. 90% ofall cases. In the original LSH scheme the hit ratewas increased by adding further partitionings of

Method Time(sec) av.Time(sec) % of vectors Hit rateLinear search 4798.98 0.0222 100.0 100.0Fast search, online parameter tuning 773.97 0.0036 14.13 90.61Fast search, manual parameter tuning 723.99 0.0034 13.42 90.69Table 1: 216054 NN searches in a codebook containing 45935 parquet graphs with active and inactive jets.Method Time(sec) av.Time(sec) % of vectors Hit rateLinear search 6144.88 0.0284 100.0 100.0Fast search, online parameter tuning 716.88 0.0033 10.22 90.34Fast search, manual parameter tuning 654.93 0.0030 9.44 90.66LSH, online parameter tuning 1321.91 0.0061 7.52 97.25LSH, offline parameter tuning 997.06 0.0046 13.24 98.60LSH, manual parameter tuning 579.36 0.0027 5.47 90.03Table 2: 216054 NN searches in a codebook containing 45779 parquet graphs with only active jets.the codebook and thus checking several bucketsthat were constructed using different hash functions.To reduce memory demand(Lv et al., 2007)have devised a way of scanning additional bucketsof the same partitioning in search of the truenearest neighbor. This method is called multiprobeLSH. The idea behind this approach is, thatbuckets should be probed first if their hash valuewould have resulted from smaller displacementsof the query vector than the hash values of otherswould have. If, e.g., in the original LSH we useonly one hash function then each buckets correspondsto one of the bins, into which the real lineis fragmented. If the current bin does not containthe correct nearest neighbor most likely one of theadjacent bins would. Further, if the query vectoris projected on a value closer to the left neighborbin it is more probable that this bin contains thecorrect nearest neighbor. Therefore, it should bescanned before the right neighbor bin. If severalhash functions are used the number of neighboringbins increases exponentially. In (Lv et al.,2007) Lv et al. present a scheme that assignseach possible bucket (relative to the current) ascore that describes its probability to contain thecorrect nearest neighbor and creates the correspondingchanges in the query hash value in theorder of decreasing probability (see reference fordetails). Although we do not have bins of thereal line in our approach we can use this schemeas well. In our case everything depends on thesimilarities to the hash vectors. For an examplewith K = 3, assume the following similarities fora query: 0.49,0.505,0.78 and that all the thresholdsare set to 0.5. The first and second valueare pretty close to the threshold that decides iftheir respective bit is set to 1 or not. Althoughthe current query bit string is 011, a string of111 (for the true nearest neighbor) is also possible.A bit string 001 is even more likely because0.505 is closer to the threshold than 0.49. In thiscase the proposed scheme decides which bit stringshould be checked first. The number of additionalscanned buckets will be determined during searchby devising the percentage P of codebook vectorsthat will be scanned. The higher this percentagethe more additional buckets will be scanned. Thepercentage will depend on the desired hit rate.It is also possible to use several partitioningsgiven by different sets K of hash vectors. If onlyindices of codebook vectors are stored in the bucketsthis can help to further reduce search timebecause in a well chosen additional partitioningthe first few scanned buckets have higher probabilitiesto contain the true nearest neighbors thanthe last buckets that have to be checked in a singlepartitioning. Since different partitionings containthe same vectors it can and will happen that acurrently probed bucket of the second partitioningcontains vectors which were already found inthe first partitioning. If only indices of vectorsare stored each index can be checked to preventmultiple comparisons of a codebook vector withthe query vector. This is not possible for localstorage of vectors. In this case unnecessary redundantcalculations arise, which keep the searchfrom benefiting from the usage of additional partitionings.4.4 Parameter determinationTo use the proposed scheme two parameters haveto be determined: K and the percentage of codebookvectors P that needs to be scanned. A lotof effort was put on the task of creating test vectorsthat can be used for determination of these

Method Time(sec) av.Time(sec) % of vectors Hit rateParallel linear search 2870.75 0.0133 100.0 100.0Fast search, online parameter tuning 1103.38 0.0051 8.24 90.50LSH, offline parameter tuning 1453.45 0.0067 11.82 98.79Table 3: 216054 NN searches in a codebook containing 88951 parquet graphs with only active jets.two parameters. In the end none of the devisedschemes was satisfying and they were forsaken.The percentage P depends too strongly on the actualquery vectors rather than the codebook itself.Methods like the original LSH in the LSHKITimplementationassume a certain (gamma-) distributionof the codebook vectors and the queryvectors and use this assumption for parameter determination.Since parquet graphs are comparedin a lot of possibly differing vector spaces we cannotmake the same presumption. Additionally, wefound during testing of our method in a differentsetting with clearly not gamma-like distributeddata that it is better to exclude any assumptionsabout the data distribution. Therefore, we do thefollowing: K is determined for each codebook inadvance, while the percentage of scanned codebookvectors is adjusted online during search.The parameter K is set according to the followingconsideration. In the best case each bucketcontains n[i] = n vectors. For scanning a single2bucket one will have K then to compute the productof the query vector with K hash vectors andn[i] vectors in the bucket. If K is bigger than n[i]more time is spent for computing the hash valuethan for probing the bucket. In that case it isbetter to decrease K such that the hash value iscomputed faster and instead more vectors in thecodebook are scanned, which increases the probabilityto find the true nearest neighbor there. Toguarantee that less time is spent on hash valuecomputation than on scanning of buckets we imposethe conditionn[i] = n2 K > 2 · K (8)to get an upper bound for K. On the other hand,K should not be too small, otherwise the codebookis not split up enough and a lot of possiblegain in speed search is wasted. Hence we chosejust the maximum value for K that fulfills ourcondition. This scheme provides us with goodvalues for K.The percentage P of scanned codebook vectorsis adjusted during search. To do this in thefirst 10 searches all buckets are scanned and thepercentage of probed vectors after which the nearestneighbor was found is stored in a histogram.After these linear searches the percentage neededto get the desired hit rate can be read from thehistogram.It may seem more natural to determine insteadthe number T of visited buckets. If thecodebook was distributed uniformly on the bucketsthis would give exact the same results. But inpractice buckets will contain different numbers ofvectors and a fixed number of buckets causes themethod to scan fewer vectors in some searchesthan in others, which decreases the probabilityof finding the true nearest neighbor. Therefore,it turned out that fixing the amount of scannedvectors (via the percentage) works better thanfixing T . The percentage P is further changedduring search because the statistics of the querydata may change. To do this every 1000 searches10 complete scans are done for determining P .4.5 Extending the scheme tok-nearest neighbor searchThe scheme can also be used for kNN search. Theonly change that has to be done is that not onlythe index of the current nearest neighbor estimateis stored but also the indices of the k vectors thathave been closest to the query vector.4.6 Extending the scheme toEuclidean pseudosemimetricInstead of the normalized scalar product thescheme also works with (averaged) squared Euclideandistances. These can also be computedusing the dot product because of the relation(⃗x − ⃗y) 2 = ‖⃗x‖ 2 − 2⃗x · ⃗y + ‖⃗y‖ 2 . (9)The squared norms of the codebook vectors andhash vectors are computed once before the searchand stored in a container. During search thesquared norm of the query vector and its dotproducts with hash and codebook vectors arecomputed and the Euclidean distance (averagedover subspaces) is computed. The rest of thescheme stays the same. The hash vectors nowhave to be as distant as possible (which of courseresembles the dissimilarity condition) and the

possible distances between a query vector anda hash vector are not restricted to [0, 1] but to[0, ∞]. Therefore, the minimal and maximal distancesof any codebook vector to any hash vectorare determined. The thresholds are then checkedin steps of δ = maxDis−minDis1000.5 EXPERIMENTSAs our scheme is the only one besides exhaustivesearch that can handle parquet graphs withinactive jets we compare it with linear search asreference. For complete activated parquet graphswe take the LSH implementation of the LSHKITC++ library by Wei Dong (Dong, 2011) as anadditional comparison. This library implementsthe multi-probe locality sensitive hashing schemefrom (Lv et al., 2007). It uses the squared Euclideandistance for comparison of vectors, whichgives the same result as normalized dot product,as can be seen from equation (9) with ‖⃗x‖ 2 =‖⃗y‖ 2 = 1. The most similar vector (⃗x · ⃗y = 1) isalso the closest in Euclidean space.Linear search was done by storing the completedatabase in a single matrix and then computingthe matrix-matrix-product of query vectorand codebook matrix. All components belongingto inactive jets are set to 0. Normalization bythe number of coincidentally active jets is doneby storing the allocation pattern as an unsignedint between 0 and 511. Each bit indicates the status(active or inactive) for a jet. By bitwise ANDof two such patterns and subsequent counting thenumber of bits set to 1 the appropriate normalizationfactor is found and the dot product is dividedby it.NN search was tested within the frame of ansimple object recognition system. This systemlearns a codebook of parquet graphs from a setof training images by vector quantization. Eachextracted parquet graph is added to the codebookif its highest similarity to any codebook parquetgraph is less than an threshold set to 0.92in these tests. Together with these features informationabout the current object category isstored. During recognition parquet graphs are extractedfrom test images and their nearest neighborin the codebook is found. Its correspondingcategory information is then used in a votingscheme to determine the category of the currentobject. Time for NN search was measured separatelyduring recognition. Two different codebookswere used: the first contains inactive jetsand can therefore only be handled by linear searchand our method, the second comprises only parquetgraphs of which every jet is active. The lattercodebook is used to compare our method withoriginal LSH.For being able to evaluate the potential ofboth systems independent of parameter determinationwe run the tests for them once with handtunedparameters and once with parameters detectedby the methods. The LSHKIT also offersthe possibility of online adaptation of the numberof scanned buckets. This was tested, too. Alltests were run 5 times and the average search timewas recorded. The biggest standard deviationswere 14.5 sec for linear search and 14.34 sec forLSHKIT search with online parameter determinationon the second codebook. All others werebelow 7.0 sec and several also below 3.0 sec. Thefollowing tables summarize all test results. Thefirst column gives search time for all searches, thesecond average time per search, the third the percentageof scanned codebook vectors and the lastthe hit rate.Table 1 clearly shows the superiority of ourmethod compared to linear search. Our nearestneighbor search is considerably faster than exhaustivesearch and needs only 16.13% respectively15.09% of linear search time, which fitsrelatively well with the according percentages ofscanned codebook vectors.For the second codebook (table 2) the percentagesof search time are (in the order of the table)11.67%, 10.66%, 21.51%, 16.23% and 9.43%.This is also in relative good accordance to theamount of scanned vectors for our method, forLSHKIT the difference is bigger. It seems thaton that codebook LSHKIT spends relatively moretime on hash value computation than our approach.The second codebook shows also that ourmethod is potentially not as good as the originalLSH in the LSHKIT-implementation, which onlyneeds to look at 5.47% of all vectors for manualtuned parameters. For automatic parameter detectionLSHKIT is slower than our method butgives a higher hit rate (at the cost of scanningmore vectors than necessary for desired hit rate).In case of online determination the scan amountis lowered to 7.52%, but search time is worsethan for offline determination with only slightlysmaller recognition rate. Thus online determinationseems to scan the codebook more efficientlybut does not lead to better approximation to thedesired recognition rate and needs additional timefor parameter adaptation. This may be due to

non-gamma-distributed data in our tests. Thesuperiority of LSHKIT can highly likely be ascribedto a better partitioning of the search spacethan in our method. In our search scheme thisstrongly depends on the selected hash vectors.By testing more than 1000 vectors as first hashvector (maybe all codebook vectors) this can beimproved, but time for parameter determinationincreases considerably.Additionally, we tried to increase search speedby using Intel’s MKL (INTEL, 2011) (version10.2.6.038) for speeding up matrix multiplication.This library is optimized for Intel CPUs todo algebraic calculations as fast as possible. Toour astonishment it turned out that using the libraryactually diminished the performance of oursystem. Even when using the special functionsfor matrix-vector and vector-vector products thecomputation took more time on our test systemthan when using the boost ublas product functions(BOOST, 2011). MKL needs certain sizesof both matrices for being able to accelerate multiplication.If this condition is fulfilled it makes asubstantial difference. We run an additional testwith linear search on all parquet graphs of eachsingle image in parallel on the second codebook.This gave a search time of 1009.02 sec (standarddeviation 2.97 sec) for parallel linear search(4.7msec per search). This result is clearly slowerthan our method when aiming at a hit rate of 90%. But if one compares it to LSHKIT with offlineparameter determination and a hit rate close to100 % it is only slightly slower while being completelyexact. Is it therefore recommendable todo parallel linear search if very high hit rates areneeded? To test this we created a third codebookof 88951 parquet graph jets, almost twiceas big as the others. For this we got the followingresults (table reftab:tab3) Parallel linear search(standard deviation 2.11 sec) was clearly slowerthan our method (standard deviation 5.49 sec)and LSHKIT (standard deviation 2.51 sec). Thisgives a hint, that even with optimized matrixmatrixmultiplication it makes sense to use specialsearch methods. It is possible that parallellinear search would have been faster when searchwould have been done on more search vectors atonce, but in most applications one cannot expectto have all search vectors available from the beginning.A real time object recognition systemcannot know the search vectors of future images,and the matrix size up to which parallel matrixmatrix-multiplicationgives a speedup will highlylikely depend on the cache size of the CPU, too.6 CONCLUSIONWe have presented the first search scheme thatis able to do fast nearest neighbor search on a setof vectors with different dimensionality that arecomparable with each other but do not share acommon vector space. The method is (at least ina serial search scheme) considerable faster thanexhaustive search and applicable as long as asimilarity function can be derived or in (spacesconsisting of) Euclidean (sub)spaces. The automaticparameter determination (for hash vectors)takes some time but has to be run only once fora database.REFERENCESBentley, J. L. (1975). Multidimensional binary searchtrees used for associative searching. Commun.ACM, 18:509–517.BOOST (2011). UBLAS basic linear algebra library.Datar, M., Immorlica, N., Indyk, P., and Mirrokni,V. S. (2004). Locality-sensitive hashing schemebased on p-stable distributions. In Proc. SCG’04, pages 253–262, ACM.Dong, W. (2011). LSHKIT: A C++locality sensitive hashing library.http://lshkit.sourceforge.net/index.html.Giacinto, G. (2007). A nearest-neighbor approachto relevance feedback in content based imageretrieval. In Proc. CIVR ’07, pages 456–463,ACM.INTEL (2011). Intel math kernel library.http://software.intel.com/en-us/intel-mkl/.Lades, M., Vorbrüggen, J. C., Buhmann, J., Lange,J., von der Malsburg, C., Würtz, R. P., andKonen, W. (1993). Distortion invariant objectrecognition in the dynamic link architecture.IEEE Trans. Comp., 42(3):300–311.Lv, Q., Josephson, W., Wang, Z., Charikar, M., andLi, K. (2007). Multi-probe lsh: efficient indexingfor high-dimensional similarity search. In Proc.VLDB ’07, pages 950–961. VLDB Endowment.Sankar K., P., Jawahar, C. V., and Manmatha, R.(2010). Nearest neighbor based collection OCR.In Proc. DAS ’10, pages 207–214, New York,NY, USA. ACM.Westphal, G. and Würtz, R. P. (2009). Combiningfeature- and correspondence-based methods forvisual object recognition. Neural Computation,21(7):1952–1989.Würtz, R. P. (1997). Object recognition robust undertranslations, deformations and changes in background.IEEE Trans. PAMI, 19(7):769–775.

Fast nearest neighbor search in pseudosemimetric spaces

Create successful ePaper yourself

Delete template?

Save as template?