10.07.2015 Views

Fast nearest neighbor search in pseudosemimetric spaces

Fast nearest neighbor search in pseudosemimetric spaces

Fast nearest neighbor search in pseudosemimetric spaces

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Fast</strong> <strong>nearest</strong> <strong>neighbor</strong> <strong>search</strong> <strong>in</strong> <strong>pseudosemimetric</strong> <strong>spaces</strong>Markus Lessmann and Rolf P. WürtzInstitut für Neuro<strong>in</strong>formatik, Ruhr-University Bochum, Germanymarkus.lessmann@<strong>in</strong>i.rub.de, rolf.wuertz@<strong>in</strong>i.rub.deKeywords:locality-sensitive hash<strong>in</strong>g, differ<strong>in</strong>g vector <strong>spaces</strong>Abstract:Nearest <strong>neighbor</strong> <strong>search</strong> <strong>in</strong> metric <strong>spaces</strong> is an important task <strong>in</strong> pattern recognition becauseit allows a query pattern to be associated with a known pattern from a learned dataset. Inlow-dimensional <strong>spaces</strong> a lot of good solutions exist that m<strong>in</strong>imize the number of comparisonsbetween patterns by partition<strong>in</strong>g the <strong>search</strong> space us<strong>in</strong>g tree structures. In high-dimensional<strong>spaces</strong> tree methods become useless because they fail to prevent scann<strong>in</strong>g almost the completedataset. Locality sensitive hash<strong>in</strong>g methods solve the task approximately by group<strong>in</strong>g patternsthat are nearby <strong>in</strong> <strong>search</strong> space <strong>in</strong>to buckets. Therefore an appropriate hash function has to beknown that is highly likely to assign a query pattern to the same bucket as its <strong>nearest</strong> <strong>neighbor</strong>.This works f<strong>in</strong>e as long as all the patterns are of the same dimensionality and exist <strong>in</strong> the samevector space with a complete metric. Here, we propose a locality-sensitive hash<strong>in</strong>g-scheme thatis able to process patterns which are built up of several possibly miss<strong>in</strong>g subpatterns caus<strong>in</strong>g thepatterns to be <strong>in</strong> vector <strong>spaces</strong> of different dimensionality. These patterns can only be comparedus<strong>in</strong>g a <strong>pseudosemimetric</strong>.1 INTRODUCTIONNearest <strong>neighbor</strong> <strong>search</strong> (NN <strong>search</strong>) <strong>in</strong> highdimensional<strong>spaces</strong> is a common problem <strong>in</strong> patternrecognition and computer vision. Examplesof possible applications <strong>in</strong>clude content-based imageretrieval (CBIR) (Giac<strong>in</strong>to, 2007), which describestechniques to f<strong>in</strong>d the image most similarto a query image <strong>in</strong> a database, as well as opticalcharacter recognition (OCR)(Sankar K. et al.,2010), where letters and words <strong>in</strong> an image of atext are recognized automatically to digitize it.Another field <strong>in</strong> which the problem arises is objectrecognition by feature extraction and match<strong>in</strong>gas <strong>in</strong> the approach of (Westphal and Würtz,2009). This system extracts features, which arecalled parquet graphs, from images dur<strong>in</strong>g learn<strong>in</strong>gand stores them together with additional <strong>in</strong>formationabout object identity and category <strong>in</strong>a codebook. Dur<strong>in</strong>g recall features of the samek<strong>in</strong>d are drawn from a test image and for eachof them the <strong>nearest</strong> <strong>neighbor</strong> <strong>in</strong> the codebook isfound. The additional <strong>in</strong>formation is then usedfor vot<strong>in</strong>g about possible object identity or category.All these examples spend a lot of their totalcomputational resources on <strong>nearest</strong> <strong>neighbor</strong><strong>search</strong>. To make them able to handle largedatabases (also called codebooks from now on)they need methods for fast NN <strong>search</strong>. In thecase of spatially extended features used togetherwith segmentation masks, like <strong>in</strong> (Würtz, 1997)for face recognition and (Westphal and Würtz,2009) for general object recognition, a furtherproblem arises.A multidimensional template or a parquetgraph describes a local image patch us<strong>in</strong>g Gabordescriptors called jets at a small range of positionsand can <strong>in</strong>corporate segmentation <strong>in</strong>formationby deactivat<strong>in</strong>g jets at positions outside theobject. Thus, the parquet graph is enhanced witha segmentation mask with the values active or<strong>in</strong>active. The comparison function between twoparquet graphs uses only the positions that areactive <strong>in</strong> both graphs. By means of different allocationpatterns of active and <strong>in</strong>active positionsparquet graphs divide <strong>in</strong>to several groups of vectorsof different dimensions. This renders knowntechniques for <strong>nearest</strong> <strong>neighbor</strong> <strong>search</strong> useless becausethey rely on the triangle <strong>in</strong>equality which is


only valid if all vectors belong to the same metricspace. We will present a <strong>search</strong> technique that isable to handle this problem.The outl<strong>in</strong>e of this article is as follows: <strong>in</strong> section2 we will <strong>in</strong>troduce <strong>nearest</strong> <strong>neighbor</strong> <strong>search</strong>techniques for low-dimensional metric <strong>spaces</strong>.Then we expla<strong>in</strong> the problem of high-dimensional<strong>nearest</strong> <strong>neighbor</strong> <strong>search</strong> (curse of dimensionality)and the locality-sensitive-hash<strong>in</strong>g scheme (LSH)as an approximate solution. In section 3 parquetgraphs are presented and it is made clear whythe orig<strong>in</strong>al LSH scheme is not suited for them.Section 4 presents our LSH scheme fitted to thespecial demands of parquet graphs. Section 5shows result of experiments that were conductedus<strong>in</strong>g our technique, orig<strong>in</strong>al LSH and exhaustive<strong>search</strong> for comparison. The conclusion is given <strong>in</strong>section 6.2 OVERVIEW NEARESTNEIGHBOR SEARCHOf course, NN <strong>search</strong> can always be performedus<strong>in</strong>g exhaustive <strong>search</strong>. That means the queryvector is compared to all vectors <strong>in</strong> the databaseand m<strong>in</strong>imal distance as well as <strong>in</strong>dex of the <strong>nearest</strong><strong>neighbor</strong> are updated after each comparison.For big codebooks this methods becomes prohibitivebecause of computational cost.In low-dimensional <strong>spaces</strong> several efficientmethods for NN <strong>search</strong> exist. The most prom<strong>in</strong>entone is the kd-tree (Bentley, 1975). It builds ab<strong>in</strong>ary tree structure us<strong>in</strong>g the given codebook ofvectors and thereby bisects the <strong>search</strong> space iteratively<strong>in</strong> each dimension. At first, all vectors aresorted accord<strong>in</strong>g to the value of their first component.Then the vector whose value is the medianis assigned to the root node. This means the rootstores the median value and a po<strong>in</strong>ter to the completevector. All other vectors are now split <strong>in</strong>totwo groups depend<strong>in</strong>g on whether their first componentvalue was higher or lower than the median.Now both groups are sorted accord<strong>in</strong>g totheir second components and aga<strong>in</strong> medians aredeterm<strong>in</strong>ed. Their vectors become the root nodesof two new subtrees, which become left and rightchildren of the first root node. This process is repeateduntil all vectors are assigned to a node. Ifthe depth of the tree gets higher than the numberof dimensions the first dimension is taken aga<strong>in</strong>for determ<strong>in</strong><strong>in</strong>g median values.NN <strong>search</strong> now works <strong>in</strong> the follow<strong>in</strong>g way: thequery vector is compared to the vector of the rootnode and the complete and s<strong>in</strong>gle dimension distanceare computed. S<strong>in</strong>ce the complete squaredEuclidean distance is the sum of the squared Euclideandistances of all s<strong>in</strong>gle dimensions the completedistance between two vectors can never besmaller than their distance <strong>in</strong> only one dimension.Therefore, an estimate of the complete distance tothe <strong>nearest</strong> <strong>neighbor</strong> is updated while travers<strong>in</strong>gthe tree. If the distance to a vector <strong>in</strong> one dimensionis already bigger than the current estimatethis vector cannot be the <strong>nearest</strong> <strong>neighbor</strong> andneither can all its (left or right) children, whichare even more distant <strong>in</strong> the actual dimension.This means that only one of the two subtrees hasto be scanned further and thus the number of necessarycomplete distance calculations and also the<strong>search</strong> time is O(log(n)) <strong>in</strong> the optimal case withn be<strong>in</strong>g the number of vectors <strong>in</strong> the codebook.This method does not work <strong>in</strong> high-dimensional<strong>spaces</strong> because of the curse of dimensionality. Itmeans that distances between arbitrary po<strong>in</strong>ts <strong>in</strong>metric <strong>spaces</strong> become more and more equal with<strong>in</strong>creas<strong>in</strong>g number of dimensions. This leads tothe fact that fewer and fewer subtrees can be discarded<strong>in</strong> the kd-tree <strong>search</strong> and almost the completedatabase needs to be scanned. The <strong>search</strong>timethus grows from O(log(n)) to O(n). Thefirst method to provide at least an approximatesolution to NN <strong>search</strong> <strong>in</strong> high-dimensional <strong>spaces</strong>is called locality-sensitive-hash<strong>in</strong>g (LSH). In itsversion for metric <strong>spaces</strong> (Datar et al., 2004) ituses a special hash function that assigns vectorsto buckets while preserv<strong>in</strong>g their <strong>neighbor</strong>hoodrelations. The hash function is based on a (2-stable) Gaussian distribution G. 2-stable meansfor the dot product of vector ⃗a, whose componentsare drawn from the Gaussian distributionG and vector ⃗v:⃗a · ⃗v ∝ ‖⃗v‖ 2 · A , (1)where A is also a Gaussian distributed randomnumber. This dot product projects each vectoronto the real l<strong>in</strong>e. For two different vectors ⃗uand ⃗v their distance on the real l<strong>in</strong>e is distributedaccord<strong>in</strong>g to:⃗u · ⃗a − ⃗v · ⃗a ∝ ‖⃗u − ⃗v‖ 2 · A . (2)This means that the variance of the distribution<strong>in</strong>creases proportionally to the distance of thevectors <strong>in</strong> their vector space. Therefore, vectorswith a smaller distance <strong>in</strong> the metric space have ahigher probability of be<strong>in</strong>g close to each other onthe real l<strong>in</strong>e. Chopp<strong>in</strong>g the real l<strong>in</strong>e <strong>in</strong>to b<strong>in</strong>s ofequal length leads to buckets conta<strong>in</strong><strong>in</strong>g adjacent


vectors. The complete hash function is⃗a · ⃗v + bH(⃗v) = ⌊W ⌋ , (3)W be<strong>in</strong>g the b<strong>in</strong> width and b a random numberdrawn from an equal distribution <strong>in</strong> the range 0 toW . A query vector is then mapped to its accord<strong>in</strong>gbucket and compared with all vectors with<strong>in</strong>it. S<strong>in</strong>ce only a fraction of all codebook vectorsis conta<strong>in</strong>ed <strong>in</strong> this bucket <strong>search</strong> is sped up. Itis always possible that a vector is not projected<strong>in</strong>to the bucket of its <strong>nearest</strong> <strong>neighbor</strong> but to anadjacent one. Then not the real <strong>nearest</strong> <strong>neighbor</strong>is found but only the l <strong>nearest</strong> <strong>neighbor</strong> (wherel is not known). Thus it is not an exact but anapproximate method. To <strong>in</strong>crease the probabilityof f<strong>in</strong>d<strong>in</strong>g the true <strong>nearest</strong> <strong>neighbor</strong>s (also calledhit rate with<strong>in</strong> this paper) the vectors from thedataset can be mapped to several buckets us<strong>in</strong>gseveral different hash functions. Scann<strong>in</strong>g multiplecompilations of adjo<strong>in</strong><strong>in</strong>g vectors <strong>in</strong>creasesthe probability to f<strong>in</strong>d the true <strong>nearest</strong> <strong>neighbor</strong>.3 PARQUET GRAPHSParquet Graphs (Westphal and Würtz, 2009)are features that describe an image patch of acerta<strong>in</strong> size by comb<strong>in</strong><strong>in</strong>g vectors of Gabor descriptorsor jets (Lades et al., 1993) <strong>in</strong> a graphstructure. The jets are extracted from a 3 ×3 rectangular grid. Jets conta<strong>in</strong> absolute valuesof Gabor wavelet responses <strong>in</strong> 8 directions and 5scales, which means that they are 40 dimensionalvectors. Thus a full parquet graph consists of 9jets and can be seen as an 360 dimensional vector<strong>in</strong> a metric space. Two jets can be compared bycalculat<strong>in</strong>g their normalized dot product result<strong>in</strong>g<strong>in</strong> a similarity measure between 0 and 1. Thejet is stored <strong>in</strong> normalized form throughout theprocess, so a s<strong>in</strong>gle dot product is enough for calculat<strong>in</strong>gsimilarities. Two parquet graphs can becompared by compar<strong>in</strong>g their jets on correspond<strong>in</strong>gpositions and then averag<strong>in</strong>g the similarityvalues. Segmentation <strong>in</strong>formation of a tra<strong>in</strong><strong>in</strong>gimage can be <strong>in</strong>corporated <strong>in</strong>to the parquet graphby label<strong>in</strong>g grid positions <strong>in</strong>active that have notbeen placed on the object. The respective jetthen does not conta<strong>in</strong> <strong>in</strong>formation about the objectand is not used for comparison. For match<strong>in</strong>gof two parquet graphs only grid positions labeledactive <strong>in</strong> both parquet graphs are used for calculat<strong>in</strong>gthe average similarity. For each parquetgraph at least the center position is active by def<strong>in</strong>ition,therefore all parquet graphs can be comparedto each other.For NN <strong>search</strong> <strong>in</strong> parquet graphs with <strong>in</strong>activepositions none of the exist<strong>in</strong>g methods canbe used. Because of the different dimensionalityof the features the triangle <strong>in</strong>equality is <strong>in</strong>valid,which would be needed to draw conclusions aboutdistances of vectors that have not been compareddirectly.4 PROPOSED SYSTEM4.1 General application flowAs LSH is the only applicable technique <strong>in</strong> highdimensional<strong>spaces</strong> we base our method on it.S<strong>in</strong>ce the hash function is the essential elementthat captures <strong>neighbor</strong>hood relations of vectorswe have to replace it with a new function that issuited for our requirements. This new functionhas to somehow capture spatial relationships betweenparquet graphs. The only way to do this isto use the parquet graphs themselves. This worksas follows: a small subset K of K parquet graphs,which we will call hash vectors, is selected fromthe codebook. Now each vector <strong>in</strong> the codebookis compared with them, yield<strong>in</strong>g K different similarityvalues between 0 and 1. These are nowtransformed <strong>in</strong>to b<strong>in</strong>ary values us<strong>in</strong>g the Heavisidefunction with a threshold t i :b i = H(s i − t i ) , (4)yield<strong>in</strong>g a bit str<strong>in</strong>g of length K. This str<strong>in</strong>g,<strong>in</strong>terpreted as a positive <strong>in</strong>teger, is the <strong>in</strong>dex ofthe bucket <strong>in</strong> which the vector is stored. Stor<strong>in</strong>gmeans that either the bucket conta<strong>in</strong>s a list of<strong>in</strong>dices of vectors that belong to it or has a localdata array which holds the correspond<strong>in</strong>g vectors.S<strong>in</strong>ce we are us<strong>in</strong>g specialized libraries for matrixmultiplication to compute dot products of queryand codebook vectors, local storage is the fastermethod and used <strong>in</strong> our approach. By stor<strong>in</strong>g thebuckets <strong>in</strong> an array of length 2 K each bucket canbe accessed directly by its hash value.The assumption beh<strong>in</strong>d this def<strong>in</strong>ition is thatparquet graphs behav<strong>in</strong>g similarly <strong>in</strong> relation toa set of other parquet graphs, are more similarthan those that do not. Therefore they are assignedto the same bucket. In a query a hashvalue is computed for the query vector <strong>in</strong> thesame way. Then the accord<strong>in</strong>g bucket has to bescanned by exhaustive <strong>search</strong>. It is possible thatthis query bucket is empty. Then the non-empty


uckets with the most similar bit str<strong>in</strong>gs have tobe determ<strong>in</strong>ed. This means a NN <strong>search</strong> <strong>in</strong> Hamm<strong>in</strong>gspace has to be performed by prob<strong>in</strong>g the bitstr<strong>in</strong>gs with Hamm<strong>in</strong>g distance of 1, 2, 3 and soforth one after another until an occupied bucketis found for a given distance. If there are severalfilled buckets <strong>in</strong> the same Hamm<strong>in</strong>g distancethey are all scanned. Bit str<strong>in</strong>gs with Hamm<strong>in</strong>gdistance D are easy to compute by apply<strong>in</strong>g abitwise XOR on the query bit str<strong>in</strong>g and a str<strong>in</strong>gof the same length that conta<strong>in</strong>s 1 at D positionsand 0 at the rema<strong>in</strong><strong>in</strong>g ones. These perturbationvectors are precomputed before the <strong>search</strong> starts.The maximal possible Hamm<strong>in</strong>g distance D maxof a query can also be computed <strong>in</strong> advance bycompar<strong>in</strong>g all bit str<strong>in</strong>gs found <strong>in</strong> the codebookwith all other possible bit str<strong>in</strong>gs of length K.The perturbation vectors are then precalculatedup to the distance D max .4.2 Selection of hash vectors andthresholdsThe crucial po<strong>in</strong>t <strong>in</strong> this method is the selectionof the K parquet graphs used for deriv<strong>in</strong>g thehash values and their correspond<strong>in</strong>g thresholdst i . If vectors and respective thresholds are chosenpoorly, they assign the same hash value to eachvector of the database, and the result<strong>in</strong>g <strong>search</strong>becomes exhaustive. The ideal case is when thevectors are distributed equally across the possible2 K bit str<strong>in</strong>gs. Then each bucket conta<strong>in</strong>s n2vectors and only a small subset of the completeKdatabase has to be scanned. Because of the differ<strong>in</strong>gmetric <strong>spaces</strong> <strong>in</strong> which parquet graphs can becompared it is hardly imag<strong>in</strong>able how such vectorscould be calculated from scratch. Instead wewill select suitable vectors first and then chosean threshold for each of them one after anothersuch that codebook vectors are distributed mostequally onto the buckets.Hash vectors need to be as dissimilar as possible,such that comparison with them yields amaximum of differ<strong>in</strong>g <strong>in</strong>formation. Suppose thatall hash vectors are chosen. Now all codebookvectors are compared with the first one and athreshold is chosen that gives an optimal distribution(for one half of the codebook vectors b 1is 0 and for the other 1). Comparison with thesecond hash vector should now be able to furthersubdivide these 2 groups <strong>in</strong>to 4 different groupsof almost the same size. This will only be possibleif the second hash vector is very dissimilarto the first, otherwise it would assign the samebit value to all vectors which are already <strong>in</strong> thesame group, giv<strong>in</strong>g 2 buckets of the same sizeand 2 empty ones. Therefore hash vectors arechosen first <strong>in</strong> an evolutionary optimization step.In a fixed number of tries a codebook vector isselected randomly. Then it is compared to allother vectors and the one with the lowest similarityis used as second hash vector. Now bothvectors are compared with the rest and the codebookvector whose maximum similarity to bothof them is m<strong>in</strong>imal is chosen. This is repeateduntil an expected maximal number of hash vectorsthat will be needed is reached. The maximalsimilarity between any two hash vectors is usedas a measure of the quality of the selected vectors.After, e.g., 1000 trials the codebook vectorwith the lowest quality measure is chosen (the rema<strong>in</strong><strong>in</strong>ghash vectors are exactly determ<strong>in</strong>ed bythat choice). The thresholds are then fixed sequentially:t i is <strong>in</strong>creased <strong>in</strong> steps of 0.01 from0.0 to 1.0 and vectors are divided <strong>in</strong>to groups accord<strong>in</strong>gto it. To check the created distributionthe follow<strong>in</strong>g fitness function is computed. Letp i = n[i]n , (5)with n[i] be<strong>in</strong>g the number of vectors <strong>in</strong> bucket i.This is the percentage of codebook vectors conta<strong>in</strong>ed<strong>in</strong> that bucket. The optimal percentage ofeach bucket isp = n2 K . (6)Absolute differences of these values are added foreach bucket to give the quality measuref(K) =2 K ∑i=1|p i − p| . (7)The threshold yield<strong>in</strong>g a m<strong>in</strong>imum f(K) is selected.Another possible measure we tested is theentropy of the distribution, which becomes maximalfor an uniform distribution. The problem isthat the entropy also becomes high if one bucketconta<strong>in</strong>s the majority of vectors and the rema<strong>in</strong><strong>in</strong>gones are distributed equally on the rema<strong>in</strong><strong>in</strong>gbuckets, which is a very unfavorable situation.4.3 Scann<strong>in</strong>g additional bucketsAs already mentioned the method will not givethe real <strong>nearest</strong> <strong>neighbor</strong> <strong>in</strong> some cases. To beuseful we expect it to be correct <strong>in</strong> ca. 90% ofall cases. In the orig<strong>in</strong>al LSH scheme the hit ratewas <strong>in</strong>creased by add<strong>in</strong>g further partition<strong>in</strong>gs of


Method Time(sec) av.Time(sec) % of vectors Hit rateL<strong>in</strong>ear <strong>search</strong> 4798.98 0.0222 100.0 100.0<strong>Fast</strong> <strong>search</strong>, onl<strong>in</strong>e parameter tun<strong>in</strong>g 773.97 0.0036 14.13 90.61<strong>Fast</strong> <strong>search</strong>, manual parameter tun<strong>in</strong>g 723.99 0.0034 13.42 90.69Table 1: 216054 NN <strong>search</strong>es <strong>in</strong> a codebook conta<strong>in</strong><strong>in</strong>g 45935 parquet graphs with active and <strong>in</strong>active jets.Method Time(sec) av.Time(sec) % of vectors Hit rateL<strong>in</strong>ear <strong>search</strong> 6144.88 0.0284 100.0 100.0<strong>Fast</strong> <strong>search</strong>, onl<strong>in</strong>e parameter tun<strong>in</strong>g 716.88 0.0033 10.22 90.34<strong>Fast</strong> <strong>search</strong>, manual parameter tun<strong>in</strong>g 654.93 0.0030 9.44 90.66LSH, onl<strong>in</strong>e parameter tun<strong>in</strong>g 1321.91 0.0061 7.52 97.25LSH, offl<strong>in</strong>e parameter tun<strong>in</strong>g 997.06 0.0046 13.24 98.60LSH, manual parameter tun<strong>in</strong>g 579.36 0.0027 5.47 90.03Table 2: 216054 NN <strong>search</strong>es <strong>in</strong> a codebook conta<strong>in</strong><strong>in</strong>g 45779 parquet graphs with only active jets.the codebook and thus check<strong>in</strong>g several bucketsthat were constructed us<strong>in</strong>g different hash functions.To reduce memory demand(Lv et al., 2007)have devised a way of scann<strong>in</strong>g additional bucketsof the same partition<strong>in</strong>g <strong>in</strong> <strong>search</strong> of the true<strong>nearest</strong> <strong>neighbor</strong>. This method is called multiprobeLSH. The idea beh<strong>in</strong>d this approach is, thatbuckets should be probed first if their hash valuewould have resulted from smaller displacementsof the query vector than the hash values of otherswould have. If, e.g., <strong>in</strong> the orig<strong>in</strong>al LSH we useonly one hash function then each buckets correspondsto one of the b<strong>in</strong>s, <strong>in</strong>to which the real l<strong>in</strong>eis fragmented. If the current b<strong>in</strong> does not conta<strong>in</strong>the correct <strong>nearest</strong> <strong>neighbor</strong> most likely one of theadjacent b<strong>in</strong>s would. Further, if the query vectoris projected on a value closer to the left <strong>neighbor</strong>b<strong>in</strong> it is more probable that this b<strong>in</strong> conta<strong>in</strong>s thecorrect <strong>nearest</strong> <strong>neighbor</strong>. Therefore, it should bescanned before the right <strong>neighbor</strong> b<strong>in</strong>. If severalhash functions are used the number of <strong>neighbor</strong><strong>in</strong>gb<strong>in</strong>s <strong>in</strong>creases exponentially. In (Lv et al.,2007) Lv et al. present a scheme that assignseach possible bucket (relative to the current) ascore that describes its probability to conta<strong>in</strong> thecorrect <strong>nearest</strong> <strong>neighbor</strong> and creates the correspond<strong>in</strong>gchanges <strong>in</strong> the query hash value <strong>in</strong> theorder of decreas<strong>in</strong>g probability (see reference fordetails). Although we do not have b<strong>in</strong>s of thereal l<strong>in</strong>e <strong>in</strong> our approach we can use this schemeas well. In our case everyth<strong>in</strong>g depends on thesimilarities to the hash vectors. For an examplewith K = 3, assume the follow<strong>in</strong>g similarities fora query: 0.49,0.505,0.78 and that all the thresholdsare set to 0.5. The first and second valueare pretty close to the threshold that decides iftheir respective bit is set to 1 or not. Althoughthe current query bit str<strong>in</strong>g is 011, a str<strong>in</strong>g of111 (for the true <strong>nearest</strong> <strong>neighbor</strong>) is also possible.A bit str<strong>in</strong>g 001 is even more likely because0.505 is closer to the threshold than 0.49. In thiscase the proposed scheme decides which bit str<strong>in</strong>gshould be checked first. The number of additionalscanned buckets will be determ<strong>in</strong>ed dur<strong>in</strong>g <strong>search</strong>by devis<strong>in</strong>g the percentage P of codebook vectorsthat will be scanned. The higher this percentagethe more additional buckets will be scanned. Thepercentage will depend on the desired hit rate.It is also possible to use several partition<strong>in</strong>gsgiven by different sets K of hash vectors. If only<strong>in</strong>dices of codebook vectors are stored <strong>in</strong> the bucketsthis can help to further reduce <strong>search</strong> timebecause <strong>in</strong> a well chosen additional partition<strong>in</strong>gthe first few scanned buckets have higher probabilitiesto conta<strong>in</strong> the true <strong>nearest</strong> <strong>neighbor</strong>s thanthe last buckets that have to be checked <strong>in</strong> a s<strong>in</strong>glepartition<strong>in</strong>g. S<strong>in</strong>ce different partition<strong>in</strong>gs conta<strong>in</strong>the same vectors it can and will happen that acurrently probed bucket of the second partition<strong>in</strong>gconta<strong>in</strong>s vectors which were already found <strong>in</strong>the first partition<strong>in</strong>g. If only <strong>in</strong>dices of vectorsare stored each <strong>in</strong>dex can be checked to preventmultiple comparisons of a codebook vector withthe query vector. This is not possible for localstorage of vectors. In this case unnecessary redundantcalculations arise, which keep the <strong>search</strong>from benefit<strong>in</strong>g from the usage of additional partition<strong>in</strong>gs.4.4 Parameter determ<strong>in</strong>ationTo use the proposed scheme two parameters haveto be determ<strong>in</strong>ed: K and the percentage of codebookvectors P that needs to be scanned. A lotof effort was put on the task of creat<strong>in</strong>g test vectorsthat can be used for determ<strong>in</strong>ation of these


Method Time(sec) av.Time(sec) % of vectors Hit rateParallel l<strong>in</strong>ear <strong>search</strong> 2870.75 0.0133 100.0 100.0<strong>Fast</strong> <strong>search</strong>, onl<strong>in</strong>e parameter tun<strong>in</strong>g 1103.38 0.0051 8.24 90.50LSH, offl<strong>in</strong>e parameter tun<strong>in</strong>g 1453.45 0.0067 11.82 98.79Table 3: 216054 NN <strong>search</strong>es <strong>in</strong> a codebook conta<strong>in</strong><strong>in</strong>g 88951 parquet graphs with only active jets.two parameters. In the end none of the devisedschemes was satisfy<strong>in</strong>g and they were forsaken.The percentage P depends too strongly on the actualquery vectors rather than the codebook itself.Methods like the orig<strong>in</strong>al LSH <strong>in</strong> the LSHKITimplementationassume a certa<strong>in</strong> (gamma-) distributionof the codebook vectors and the queryvectors and use this assumption for parameter determ<strong>in</strong>ation.S<strong>in</strong>ce parquet graphs are compared<strong>in</strong> a lot of possibly differ<strong>in</strong>g vector <strong>spaces</strong> we cannotmake the same presumption. Additionally, wefound dur<strong>in</strong>g test<strong>in</strong>g of our method <strong>in</strong> a differentsett<strong>in</strong>g with clearly not gamma-like distributeddata that it is better to exclude any assumptionsabout the data distribution. Therefore, we do thefollow<strong>in</strong>g: K is determ<strong>in</strong>ed for each codebook <strong>in</strong>advance, while the percentage of scanned codebookvectors is adjusted onl<strong>in</strong>e dur<strong>in</strong>g <strong>search</strong>.The parameter K is set accord<strong>in</strong>g to the follow<strong>in</strong>gconsideration. In the best case each bucketconta<strong>in</strong>s n[i] = n vectors. For scann<strong>in</strong>g a s<strong>in</strong>gle2bucket one will have K then to compute the productof the query vector with K hash vectors andn[i] vectors <strong>in</strong> the bucket. If K is bigger than n[i]more time is spent for comput<strong>in</strong>g the hash valuethan for prob<strong>in</strong>g the bucket. In that case it isbetter to decrease K such that the hash value iscomputed faster and <strong>in</strong>stead more vectors <strong>in</strong> thecodebook are scanned, which <strong>in</strong>creases the probabilityto f<strong>in</strong>d the true <strong>nearest</strong> <strong>neighbor</strong> there. Toguarantee that less time is spent on hash valuecomputation than on scann<strong>in</strong>g of buckets we imposethe conditionn[i] = n2 K > 2 · K (8)to get an upper bound for K. On the other hand,K should not be too small, otherwise the codebookis not split up enough and a lot of possiblega<strong>in</strong> <strong>in</strong> speed <strong>search</strong> is wasted. Hence we chosejust the maximum value for K that fulfills ourcondition. This scheme provides us with goodvalues for K.The percentage P of scanned codebook vectorsis adjusted dur<strong>in</strong>g <strong>search</strong>. To do this <strong>in</strong> thefirst 10 <strong>search</strong>es all buckets are scanned and thepercentage of probed vectors after which the <strong>nearest</strong><strong>neighbor</strong> was found is stored <strong>in</strong> a histogram.After these l<strong>in</strong>ear <strong>search</strong>es the percentage neededto get the desired hit rate can be read from thehistogram.It may seem more natural to determ<strong>in</strong>e <strong>in</strong>steadthe number T of visited buckets. If thecodebook was distributed uniformly on the bucketsthis would give exact the same results. But <strong>in</strong>practice buckets will conta<strong>in</strong> different numbers ofvectors and a fixed number of buckets causes themethod to scan fewer vectors <strong>in</strong> some <strong>search</strong>esthan <strong>in</strong> others, which decreases the probabilityof f<strong>in</strong>d<strong>in</strong>g the true <strong>nearest</strong> <strong>neighbor</strong>. Therefore,it turned out that fix<strong>in</strong>g the amount of scannedvectors (via the percentage) works better thanfix<strong>in</strong>g T . The percentage P is further changeddur<strong>in</strong>g <strong>search</strong> because the statistics of the querydata may change. To do this every 1000 <strong>search</strong>es10 complete scans are done for determ<strong>in</strong><strong>in</strong>g P .4.5 Extend<strong>in</strong>g the scheme tok-<strong>nearest</strong> <strong>neighbor</strong> <strong>search</strong>The scheme can also be used for kNN <strong>search</strong>. Theonly change that has to be done is that not onlythe <strong>in</strong>dex of the current <strong>nearest</strong> <strong>neighbor</strong> estimateis stored but also the <strong>in</strong>dices of the k vectors thathave been closest to the query vector.4.6 Extend<strong>in</strong>g the scheme toEuclidean <strong>pseudosemimetric</strong>Instead of the normalized scalar product thescheme also works with (averaged) squared Euclideandistances. These can also be computedus<strong>in</strong>g the dot product because of the relation(⃗x − ⃗y) 2 = ‖⃗x‖ 2 − 2⃗x · ⃗y + ‖⃗y‖ 2 . (9)The squared norms of the codebook vectors andhash vectors are computed once before the <strong>search</strong>and stored <strong>in</strong> a conta<strong>in</strong>er. Dur<strong>in</strong>g <strong>search</strong> thesquared norm of the query vector and its dotproducts with hash and codebook vectors arecomputed and the Euclidean distance (averagedover sub<strong>spaces</strong>) is computed. The rest of thescheme stays the same. The hash vectors nowhave to be as distant as possible (which of courseresembles the dissimilarity condition) and the


possible distances between a query vector anda hash vector are not restricted to [0, 1] but to[0, ∞]. Therefore, the m<strong>in</strong>imal and maximal distancesof any codebook vector to any hash vectorare determ<strong>in</strong>ed. The thresholds are then checked<strong>in</strong> steps of δ = maxDis−m<strong>in</strong>Dis1000.5 EXPERIMENTSAs our scheme is the only one besides exhaustive<strong>search</strong> that can handle parquet graphs with<strong>in</strong>active jets we compare it with l<strong>in</strong>ear <strong>search</strong> asreference. For complete activated parquet graphswe take the LSH implementation of the LSHKITC++ library by Wei Dong (Dong, 2011) as anadditional comparison. This library implementsthe multi-probe locality sensitive hash<strong>in</strong>g schemefrom (Lv et al., 2007). It uses the squared Euclideandistance for comparison of vectors, whichgives the same result as normalized dot product,as can be seen from equation (9) with ‖⃗x‖ 2 =‖⃗y‖ 2 = 1. The most similar vector (⃗x · ⃗y = 1) isalso the closest <strong>in</strong> Euclidean space.L<strong>in</strong>ear <strong>search</strong> was done by stor<strong>in</strong>g the completedatabase <strong>in</strong> a s<strong>in</strong>gle matrix and then comput<strong>in</strong>gthe matrix-matrix-product of query vectorand codebook matrix. All components belong<strong>in</strong>gto <strong>in</strong>active jets are set to 0. Normalization bythe number of co<strong>in</strong>cidentally active jets is doneby stor<strong>in</strong>g the allocation pattern as an unsigned<strong>in</strong>t between 0 and 511. Each bit <strong>in</strong>dicates the status(active or <strong>in</strong>active) for a jet. By bitwise ANDof two such patterns and subsequent count<strong>in</strong>g thenumber of bits set to 1 the appropriate normalizationfactor is found and the dot product is dividedby it.NN <strong>search</strong> was tested with<strong>in</strong> the frame of ansimple object recognition system. This systemlearns a codebook of parquet graphs from a setof tra<strong>in</strong><strong>in</strong>g images by vector quantization. Eachextracted parquet graph is added to the codebookif its highest similarity to any codebook parquetgraph is less than an threshold set to 0.92<strong>in</strong> these tests. Together with these features <strong>in</strong>formationabout the current object category isstored. Dur<strong>in</strong>g recognition parquet graphs are extractedfrom test images and their <strong>nearest</strong> <strong>neighbor</strong><strong>in</strong> the codebook is found. Its correspond<strong>in</strong>gcategory <strong>in</strong>formation is then used <strong>in</strong> a vot<strong>in</strong>gscheme to determ<strong>in</strong>e the category of the currentobject. Time for NN <strong>search</strong> was measured separatelydur<strong>in</strong>g recognition. Two different codebookswere used: the first conta<strong>in</strong>s <strong>in</strong>active jetsand can therefore only be handled by l<strong>in</strong>ear <strong>search</strong>and our method, the second comprises only parquetgraphs of which every jet is active. The lattercodebook is used to compare our method withorig<strong>in</strong>al LSH.For be<strong>in</strong>g able to evaluate the potential ofboth systems <strong>in</strong>dependent of parameter determ<strong>in</strong>ationwe run the tests for them once with handtunedparameters and once with parameters detectedby the methods. The LSHKIT also offersthe possibility of onl<strong>in</strong>e adaptation of the numberof scanned buckets. This was tested, too. Alltests were run 5 times and the average <strong>search</strong> timewas recorded. The biggest standard deviationswere 14.5 sec for l<strong>in</strong>ear <strong>search</strong> and 14.34 sec forLSHKIT <strong>search</strong> with onl<strong>in</strong>e parameter determ<strong>in</strong>ationon the second codebook. All others werebelow 7.0 sec and several also below 3.0 sec. Thefollow<strong>in</strong>g tables summarize all test results. Thefirst column gives <strong>search</strong> time for all <strong>search</strong>es, thesecond average time per <strong>search</strong>, the third the percentageof scanned codebook vectors and the lastthe hit rate.Table 1 clearly shows the superiority of ourmethod compared to l<strong>in</strong>ear <strong>search</strong>. Our <strong>nearest</strong><strong>neighbor</strong> <strong>search</strong> is considerably faster than exhaustive<strong>search</strong> and needs only 16.13% respectively15.09% of l<strong>in</strong>ear <strong>search</strong> time, which fitsrelatively well with the accord<strong>in</strong>g percentages ofscanned codebook vectors.For the second codebook (table 2) the percentagesof <strong>search</strong> time are (<strong>in</strong> the order of the table)11.67%, 10.66%, 21.51%, 16.23% and 9.43%.This is also <strong>in</strong> relative good accordance to theamount of scanned vectors for our method, forLSHKIT the difference is bigger. It seems thaton that codebook LSHKIT spends relatively moretime on hash value computation than our approach.The second codebook shows also that ourmethod is potentially not as good as the orig<strong>in</strong>alLSH <strong>in</strong> the LSHKIT-implementation, which onlyneeds to look at 5.47% of all vectors for manualtuned parameters. For automatic parameter detectionLSHKIT is slower than our method butgives a higher hit rate (at the cost of scann<strong>in</strong>gmore vectors than necessary for desired hit rate).In case of onl<strong>in</strong>e determ<strong>in</strong>ation the scan amountis lowered to 7.52%, but <strong>search</strong> time is worsethan for offl<strong>in</strong>e determ<strong>in</strong>ation with only slightlysmaller recognition rate. Thus onl<strong>in</strong>e determ<strong>in</strong>ationseems to scan the codebook more efficientlybut does not lead to better approximation to thedesired recognition rate and needs additional timefor parameter adaptation. This may be due to


non-gamma-distributed data <strong>in</strong> our tests. Thesuperiority of LSHKIT can highly likely be ascribedto a better partition<strong>in</strong>g of the <strong>search</strong> spacethan <strong>in</strong> our method. In our <strong>search</strong> scheme thisstrongly depends on the selected hash vectors.By test<strong>in</strong>g more than 1000 vectors as first hashvector (maybe all codebook vectors) this can beimproved, but time for parameter determ<strong>in</strong>ation<strong>in</strong>creases considerably.Additionally, we tried to <strong>in</strong>crease <strong>search</strong> speedby us<strong>in</strong>g Intel’s MKL (INTEL, 2011) (version10.2.6.038) for speed<strong>in</strong>g up matrix multiplication.This library is optimized for Intel CPUs todo algebraic calculations as fast as possible. Toour astonishment it turned out that us<strong>in</strong>g the libraryactually dim<strong>in</strong>ished the performance of oursystem. Even when us<strong>in</strong>g the special functionsfor matrix-vector and vector-vector products thecomputation took more time on our test systemthan when us<strong>in</strong>g the boost ublas product functions(BOOST, 2011). MKL needs certa<strong>in</strong> sizesof both matrices for be<strong>in</strong>g able to accelerate multiplication.If this condition is fulfilled it makes asubstantial difference. We run an additional testwith l<strong>in</strong>ear <strong>search</strong> on all parquet graphs of eachs<strong>in</strong>gle image <strong>in</strong> parallel on the second codebook.This gave a <strong>search</strong> time of 1009.02 sec (standarddeviation 2.97 sec) for parallel l<strong>in</strong>ear <strong>search</strong>(4.7msec per <strong>search</strong>). This result is clearly slowerthan our method when aim<strong>in</strong>g at a hit rate of 90%. But if one compares it to LSHKIT with offl<strong>in</strong>eparameter determ<strong>in</strong>ation and a hit rate close to100 % it is only slightly slower while be<strong>in</strong>g completelyexact. Is it therefore recommendable todo parallel l<strong>in</strong>ear <strong>search</strong> if very high hit rates areneeded? To test this we created a third codebookof 88951 parquet graph jets, almost twiceas big as the others. For this we got the follow<strong>in</strong>gresults (table reftab:tab3) Parallel l<strong>in</strong>ear <strong>search</strong>(standard deviation 2.11 sec) was clearly slowerthan our method (standard deviation 5.49 sec)and LSHKIT (standard deviation 2.51 sec). Thisgives a h<strong>in</strong>t, that even with optimized matrixmatrixmultiplication it makes sense to use special<strong>search</strong> methods. It is possible that parallell<strong>in</strong>ear <strong>search</strong> would have been faster when <strong>search</strong>would have been done on more <strong>search</strong> vectors atonce, but <strong>in</strong> most applications one cannot expectto have all <strong>search</strong> vectors available from the beg<strong>in</strong>n<strong>in</strong>g.A real time object recognition systemcannot know the <strong>search</strong> vectors of future images,and the matrix size up to which parallel matrixmatrix-multiplicationgives a speedup will highlylikely depend on the cache size of the CPU, too.6 CONCLUSIONWe have presented the first <strong>search</strong> scheme thatis able to do fast <strong>nearest</strong> <strong>neighbor</strong> <strong>search</strong> on a setof vectors with different dimensionality that arecomparable with each other but do not share acommon vector space. The method is (at least <strong>in</strong>a serial <strong>search</strong> scheme) considerable faster thanexhaustive <strong>search</strong> and applicable as long as asimilarity function can be derived or <strong>in</strong> (<strong>spaces</strong>consist<strong>in</strong>g of) Euclidean (sub)<strong>spaces</strong>. The automaticparameter determ<strong>in</strong>ation (for hash vectors)takes some time but has to be run only once fora database.REFERENCESBentley, J. L. (1975). Multidimensional b<strong>in</strong>ary <strong>search</strong>trees used for associative <strong>search</strong><strong>in</strong>g. Commun.ACM, 18:509–517.BOOST (2011). UBLAS basic l<strong>in</strong>ear algebra library.Datar, M., Immorlica, N., Indyk, P., and Mirrokni,V. S. (2004). Locality-sensitive hash<strong>in</strong>g schemebased on p-stable distributions. In Proc. SCG’04, pages 253–262, ACM.Dong, W. (2011). LSHKIT: A C++locality sensitive hash<strong>in</strong>g library.http://lshkit.sourceforge.net/<strong>in</strong>dex.html.Giac<strong>in</strong>to, G. (2007). A <strong>nearest</strong>-<strong>neighbor</strong> approachto relevance feedback <strong>in</strong> content based imageretrieval. In Proc. CIVR ’07, pages 456–463,ACM.INTEL (2011). Intel math kernel library.http://software.<strong>in</strong>tel.com/en-us/<strong>in</strong>tel-mkl/.Lades, M., Vorbrüggen, J. C., Buhmann, J., Lange,J., von der Malsburg, C., Würtz, R. P., andKonen, W. (1993). Distortion <strong>in</strong>variant objectrecognition <strong>in</strong> the dynamic l<strong>in</strong>k architecture.IEEE Trans. Comp., 42(3):300–311.Lv, Q., Josephson, W., Wang, Z., Charikar, M., andLi, K. (2007). Multi-probe lsh: efficient <strong>in</strong>dex<strong>in</strong>gfor high-dimensional similarity <strong>search</strong>. In Proc.VLDB ’07, pages 950–961. VLDB Endowment.Sankar K., P., Jawahar, C. V., and Manmatha, R.(2010). Nearest <strong>neighbor</strong> based collection OCR.In Proc. DAS ’10, pages 207–214, New York,NY, USA. ACM.Westphal, G. and Würtz, R. P. (2009). Comb<strong>in</strong><strong>in</strong>gfeature- and correspondence-based methods forvisual object recognition. Neural Computation,21(7):1952–1989.Würtz, R. P. (1997). Object recognition robust undertranslations, deformations and changes <strong>in</strong> background.IEEE Trans. PAMI, 19(7):769–775.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!