13.07.2015 Views

Top-K Correlation Sub-graph Search in Graph Databases

Top-K Correlation Sub-graph Search in Graph Databases

Top-K Correlation Sub-graph Search in Graph Databases

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong><strong>Databases</strong>Lei Zou 1 , Lei Chen 2 , and Yansheng Lu 11 Huazhong University of Science and Technology, Wuhan, Ch<strong>in</strong>a,{zoulei,lys}@mail.hust.edu.cn2 Hong Kong of Science and Technology, Hong Kong, Ch<strong>in</strong>a,leichen@cse.ust.hkAbstract. Recently, due to its wide applications, (similar) sub<strong>graph</strong> search hasattracted a lot of attentions from database and data m<strong>in</strong><strong>in</strong>g community, such as[13, 18, 19, 5]. In [8], Ke et al. first proposed correlation sub-<strong>graph</strong> search problem(CG<strong>Search</strong> for short) to capture the underly<strong>in</strong>g dependency between sub<strong>graph</strong>s<strong>in</strong> a <strong>graph</strong> database, that is CGS algorithm. However, CGS algorithm requiresthe specification of a m<strong>in</strong>imum correlation threshold θ to perform computation.In practice, it may not be trivial for users to provide an appropriatethreshold θ, s<strong>in</strong>ce different <strong>graph</strong> databases typically have different characteristics.Therefore, we propose an alternative m<strong>in</strong><strong>in</strong>g task: top-K correlation sub<strong>graph</strong>search(TOP-CGSearh for short). The new problem itself does not requiresett<strong>in</strong>g a correlation threshold, which leads the previous proposed CGS algorithm<strong>in</strong>efficient if we apply it directly to TOP-CG<strong>Search</strong> problem. To conduct TOP-CG<strong>Search</strong> efficiently, we develop a pattern-growth algorithm (that is PG-searchalgorithm) and utilize <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g methods to speed up the m<strong>in</strong><strong>in</strong>g task. Extensiveexperiment results evaluate the efficiency of our methods.1 IntroductionAs a popular data structure, <strong>graph</strong>s have been used to model many complex data objectsand their relationships, such as, chemical compounds [14], entities <strong>in</strong> images [12] andsocial networks [3]. Recently, the problems related to <strong>graph</strong> database have attractedmuch attentions from database and data m<strong>in</strong><strong>in</strong>g community, such as frequent sub-<strong>graph</strong>m<strong>in</strong><strong>in</strong>g [7, 17], (similar) sub-<strong>graph</strong> search [13, 18, 19, 5]. On the other hand, correlationm<strong>in</strong><strong>in</strong>g [15] is always used to discovery the underly<strong>in</strong>g dependency between objects,such as market-basket databases [2, 10], multimedia databases [11] and so on. Ke et al.first propose correlation sub-<strong>graph</strong> search <strong>in</strong> <strong>graph</strong> databases [8]. Formally, correlationsub-<strong>graph</strong> search (CG<strong>Search</strong> for short) is def<strong>in</strong>ed as follows [8]:Given a query <strong>graph</strong> Q and a m<strong>in</strong>imum correlation threshold θ, we need to report allsub-<strong>graph</strong>s S i <strong>in</strong> <strong>graph</strong> database, where the Pearson’s correlation coefficient betweenQ and S i (that is φ(Q, S i ), see Def<strong>in</strong>ition 4) is no less than θ.Actually, the larger φ(Q, S i ) is, the more often sub-<strong>graph</strong>s Q and S i appear together<strong>in</strong> the same data <strong>graph</strong>s. Take molecule databases for example. If two sub-structures S 1and S 2 often appear together <strong>in</strong> the same molecules, S 1 and S 2 may share some similarchemical properties. In Fig. 1, we have 5 data <strong>graph</strong>s <strong>in</strong> database D and a query sub<strong>graph</strong>Q and the threshold θ = 1.0. In Fig. 1, for presentation convenience, we assumethat all edges have the same label ‘x’. The number beside the vertex is vertex ID. Theletter <strong>in</strong>side the vertex is vertex label. S<strong>in</strong>ce only φ(Q, S 2 ) = 1.0 ≥ θ, we report S 2 <strong>in</strong>Fig. 1. φ(Q, S i ) = 1.0 means S i (or Q) always appears <strong>in</strong> the same data <strong>graph</strong> g, if Q


2 Lei Zou et al.(or S i ) appears <strong>in</strong> data <strong>graph</strong> g. In CG<strong>Search</strong> problem, each sub-<strong>graph</strong> S i of data <strong>graph</strong>s<strong>in</strong> database D is a candidate, which is quite different from (similar) sub-<strong>graph</strong> search <strong>in</strong>previous work. In (similar) sub-<strong>graph</strong> search problem [13, 18], it reports all data <strong>graph</strong>sthat (closely) conta<strong>in</strong> query <strong>graph</strong> Q. Therefore, only data <strong>graph</strong>s are candidates. Thus,the search space of CG<strong>Search</strong> problem is much larger than that of (similar) sub-<strong>graph</strong>search, which leads CG<strong>Search</strong> to be a more challeng<strong>in</strong>g task.1ax 2 1b ax 2 1b ax 2 1cx 2 1a bx 2 xa c a bx x4x x x34 xxa34 x3c a b d3x xb a3caG 1 G 2 G 3 G 4 G 5Query QaaxbS 1xaS 2∅ ( QS , ) = 0.611∅ ( QS , ) = 1.02Fig. 1. A <strong>Graph</strong> DatabaseFurthermore, CG<strong>Search</strong> is also different from frequent sub-<strong>graph</strong> m<strong>in</strong><strong>in</strong>g. As weall know, all frequent sub-<strong>graph</strong> m<strong>in</strong><strong>in</strong>g algorithms are based on “down-closure property”(thatis Apriori property)[1] no matter they adopt bread-first search [9] or depthfirstsearch [17]. However, Apriori property does not hold for φ(Q, S i ). It means thatwe cannot guarantee that φ(Q, S i ) ≥ φ(Q, S j ), if S i is a sub-<strong>graph</strong> of S j . Therefore,for CG<strong>Search</strong> problem, we cannot employ the same prun<strong>in</strong>g strategies used <strong>in</strong> frequentsub-<strong>graph</strong> m<strong>in</strong><strong>in</strong>g algorithms.In order to address CG<strong>Search</strong>, Ke et al. propose CGS Algorithm [8], which adopts“filter-and-verification” framework to perform the search. First, accord<strong>in</strong>g to a givenm<strong>in</strong>imum correlation coefficient threshold θ and a query <strong>graph</strong> Q, they derive them<strong>in</strong>imum support lowerbound(sup(g)). Any sub-<strong>graph</strong> g i whose support is less thanlowerbound(sup(g)) cannot be a correlation sub-<strong>graph</strong> w.r.t 3 query Q. Then, <strong>in</strong> theprojected database (all data <strong>graph</strong>s conta<strong>in</strong><strong>in</strong>g query <strong>graph</strong> Q are collected to form theprojected database, denoted as D q ), frequent sub-<strong>graph</strong> m<strong>in</strong><strong>in</strong>g algorithms are used tof<strong>in</strong>d candidates g i by sett<strong>in</strong>g m<strong>in</strong>imum support lowerbound(sup(g))sup(Q)(that is filter process).After that, for each candidate g i , the correlation coefficient is computed to f<strong>in</strong>d all trueanswers (that is verification process).However, CG<strong>Search</strong> <strong>in</strong> [8] requires users or applications to specify a m<strong>in</strong>imum correlationthreshold θ. In practice, it may not be trivial for users to provide an appropriatethreshold θ, s<strong>in</strong>ce different <strong>graph</strong> databases typically have different characteristics.Therefore, <strong>in</strong> this paper, we propose an alternative m<strong>in</strong><strong>in</strong>g task:top-K correlation sub<strong>graph</strong>search, TOP-CG<strong>Search</strong>, which is def<strong>in</strong>ed:Given a query <strong>graph</strong> Q, we report top-K sub-<strong>graph</strong>s S i accord<strong>in</strong>g to the Pearson’scorrelation coefficient between Q and S i (that is φ(Q, S i )).Obviously, the above methods <strong>in</strong> CGSearh [8] cannot be directly applied to theTOP-CG<strong>Search</strong> problem, s<strong>in</strong>ce we do not have the threshold θ. Simply sett<strong>in</strong>g a largethreshold and then decreas<strong>in</strong>g it step by step based on the number of answers obta<strong>in</strong>edfrom the previous step is out of practice, s<strong>in</strong>ce we need to perform frequent sub-<strong>graph</strong>m<strong>in</strong><strong>in</strong>g algorithms to obta<strong>in</strong> candidates <strong>in</strong> each iterative step. Furthermore, we also needto perform verification of CG<strong>Search</strong> [8] <strong>in</strong> each iterative step.3 with regard to


<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 3Can we derive top-K answers directly without perform<strong>in</strong>g frequent sub-<strong>graph</strong> m<strong>in</strong><strong>in</strong>galgorithms for candidates and expensive verification several times? Our answer<strong>in</strong> this paper is YES. For TOP-CG<strong>Search</strong> problem, we first derive a upper bound ofφ(Q, S i ) (that is upperbound(φ(Q, S i ))). S<strong>in</strong>ce upperbound(φ(Q, S i )) is a monotone<strong>in</strong>creas<strong>in</strong>g function w.r.t.sup(S i ), we develop a novel pattern-growth algorithm, calledPG-search algorithm to conduct effective filter<strong>in</strong>g through the upper bound. Dur<strong>in</strong>gm<strong>in</strong><strong>in</strong>g process, we always ma<strong>in</strong>ta<strong>in</strong> β to be the K-largest φ(Q, S i ) by now. The algorithmreports top-K answers until all un-checked sub-<strong>graph</strong>s cannot be <strong>in</strong> top-K answers.Furthermore, <strong>in</strong> pattern-growth process, we only grow a pattern P if and only ifwe have checked all its parents P i (see Lemma 2). Therefore, we significantly reducethe search space (the number of sub-<strong>graph</strong>s that need to be checked <strong>in</strong> PG-search algorithm). F<strong>in</strong>ally, we utilize exist<strong>in</strong>g <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g techniques [13, 18] to speed up them<strong>in</strong><strong>in</strong>g task. In summary, we made the follow<strong>in</strong>g contributions:1. We propose a new m<strong>in</strong><strong>in</strong>g task TOP-CG<strong>Search</strong> problem. For TOP-CG<strong>Search</strong>, wedevelop an efficient “pattern-growth” algorithm, called PG-search algorithm. InPG-search, we <strong>in</strong>troduce two novel concepts, Growth Element(GE for short) andCorrect Growth to determ<strong>in</strong>e the correct growth and avoid duplicate generation.2. To speed up the m<strong>in</strong><strong>in</strong>g task, we utilize <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g techniques to improve theperformance <strong>in</strong> PG-search algorithm.3. We conduct extensive experiments to verify the efficiency of the proposed methods.The rest of the paper is organized as follows: Section 2 discuss prelim<strong>in</strong>ary background.PG-search algorithm is proposed <strong>in</strong> Section 3. We evaluate our methods <strong>in</strong> experiments<strong>in</strong> Section 4. Section 5 discusses related work. Section 6 concludes this work.2 Prelim<strong>in</strong>ary2.1 Problem Def<strong>in</strong>itionDef<strong>in</strong>ition 1. <strong>Graph</strong> Isomorphism. Assume that we have two <strong>graph</strong>s G 1 < V 1 , E 1 ,L 1v , L 1e , F 1v , F 1e > and G 2 < V 2 , E 2 , L 2v , L 2e , F 2v , F 2e >. G 1 is <strong>graph</strong> isomorphismto G 2 , if and only if there exists at least one bijective function f : V 1 → V 2 such that:1) for any edge uv ∈ E 1 , there is an edge f(u)f(v) ∈ E 2 ; 2) F 1v (u)= F 2v (f(u)) andF 1v (v)= F 2v (f(v)); 3) F 1e (uv)= F 2e (f(u)f(v)).Def<strong>in</strong>ition 2. <strong>Sub</strong>-<strong>graph</strong> Isomorphism. Assume that we have two <strong>graph</strong>s G ′ and G, ifG ′ is <strong>graph</strong> isomorphism to at least one sub-<strong>graph</strong> of G under bijective function f, G ′is sub-<strong>graph</strong> isomorphism to G under <strong>in</strong>jective function f.If a <strong>graph</strong> S is sub-<strong>graph</strong> isomorphism to another <strong>graph</strong> G, we always say that <strong>graph</strong> Gconta<strong>in</strong>s S.Def<strong>in</strong>ition 3. <strong>Graph</strong> Support. Given a <strong>graph</strong> database D with N data <strong>graph</strong>s, and asub-<strong>graph</strong> S, there are M data <strong>graph</strong>s conta<strong>in</strong><strong>in</strong>g S. The support of S is denoted assup(S) = M N. Obviously, 0 ≤ sup(S) ≤ 1.Def<strong>in</strong>ition 4. Pearsons <strong>Correlation</strong> Coefficient.Given two sub-<strong>graph</strong>s S 1 and S 2 , Pearson’s <strong>Correlation</strong> Coefficient of S 1 and S 2 , denotedas φ(S 1 , S 2 ), is def<strong>in</strong>ed as:φ(S 1, S 2) =Here, −1 ≤ φ(S 1 , S 2 ) ≤ 1sup(S 1, S 2) − sup(S 1)sup(S 2)√sup(S1)sup(S 2)(1 − sup(S 1))(1 − sup(S 2))(1)


4 Lei Zou et al.Given two <strong>graph</strong>s S 1 and S 2 , if φ(S 1 , S 2 ) > 0, then S 1 and S 2 are positively correlated;otherwise, they are negatively correlated. In this paper, we focus on positively correlatedsub-<strong>graph</strong> search.Def<strong>in</strong>ition 5. (Problem Def<strong>in</strong>ition) <strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> (TOP-CG<strong>Search</strong>for short) Given a query <strong>graph</strong> Q and a <strong>graph</strong> database D, accord<strong>in</strong>g to Def<strong>in</strong>ition 4,we need to f<strong>in</strong>d K sub-<strong>graph</strong>s S i (i=1...K) <strong>in</strong> the <strong>graph</strong> database D, where φ(Q, S i )are the first K largest.Note that, <strong>in</strong> this work, we only focus on positively correlated sub-<strong>graph</strong>s, if thereexist less than K positively correlated sub-<strong>graph</strong>s, we report all positively correlatedsub-<strong>graph</strong>s. Now we demonstrate top-K correlation sub-<strong>graph</strong> search with a runn<strong>in</strong>gexample.EXAMPLE 1(Runn<strong>in</strong>g Example). Given the same <strong>graph</strong> database D with 5 data<strong>graph</strong>s and a query <strong>graph</strong> Q <strong>in</strong> Fig. 1, we want to report top-2 correlated sub-<strong>graph</strong>s<strong>in</strong> D accord<strong>in</strong>g to Def<strong>in</strong>ition 5. In this example, the top-2 answers are S 1 and S 2 ,where φ(Q, S 1 ) = 1.0 and φ(Q, S 2 ) = 0.61 respectively. For the other sub-<strong>graph</strong>sS i , φ(Q, S i ) < 0.61.2.2 Canonical label<strong>in</strong>gIn order to determ<strong>in</strong>e whether two <strong>graph</strong>s are isomorphism to each other, we can assigneach <strong>graph</strong> a unique code. Such a code is referred to as canonical label of a <strong>graph</strong>[4]. Furthermore, benefit<strong>in</strong>g from canonical labels, we can def<strong>in</strong>e a total order for aset of <strong>graph</strong>s <strong>in</strong> a unique and determ<strong>in</strong>istic way, regardless of the orig<strong>in</strong>al vertex andedge order<strong>in</strong>g. The property will be used <strong>in</strong> Def<strong>in</strong>ition 8 and 12. It is known that thehardness of determ<strong>in</strong><strong>in</strong>g the canonical label of a <strong>graph</strong> is equivalent to determ<strong>in</strong><strong>in</strong>gisomorphism between <strong>graph</strong>s. So far, these two problems are not known to be either <strong>in</strong>P or <strong>in</strong> NP-complete [4]. There are many canonical label<strong>in</strong>g proposals <strong>in</strong> the literature[17]. In fact, any canonical label<strong>in</strong>g method can be used <strong>in</strong> our algorithm, which isorthogonal to our method. A simple way of def<strong>in</strong><strong>in</strong>g the canonical label of a <strong>graph</strong> is toconcatenate the upper-triangular entries of <strong>graph</strong>’s adjacency matrix when this matrixhas been symmetrically permuted so that the obta<strong>in</strong>ed str<strong>in</strong>g is the lexico<strong>graph</strong>icallysmallest over the str<strong>in</strong>gs that can be obta<strong>in</strong>ed from all such permutations.Fig. 2(a) shows an example about obta<strong>in</strong><strong>in</strong>g the canonical label of a <strong>graph</strong>. In Fig.2(a), we permutate all vertex order<strong>in</strong>gs. For each vertex order<strong>in</strong>g, we can obta<strong>in</strong> a codeby concatenat<strong>in</strong>g the upper-triangular entries of the <strong>graph</strong>s adjacency matrix. Amongall codes, the code <strong>in</strong> Fig. 2(a)a is the lexico<strong>graph</strong>ically smallest (‘0’ is smaller thanall letters). Thus, <strong>in</strong> Fig. 2(a), “aab x0x” is the canonical code of the <strong>graph</strong>. Fig. 2(a)ashows the canonical form of the <strong>graph</strong>.Def<strong>in</strong>ition 6. Canonical Vertex Order<strong>in</strong>g and ID. Given a <strong>graph</strong> S, some vertex order<strong>in</strong>gis called Canonical Vertex Order<strong>in</strong>g, if and only if this order<strong>in</strong>g leads to thecanonical code of <strong>graph</strong> S. Accord<strong>in</strong>g to canonical vertex order<strong>in</strong>g, we def<strong>in</strong>e CanonicalVertex ID for each vertex <strong>in</strong> S.The vertex order<strong>in</strong>g <strong>in</strong> Fig. 2(a)a leads to canonical code, thus, the vertex order<strong>in</strong>g <strong>in</strong>Fig. 2(a)a is called canonical vertex order<strong>in</strong>g. Vertex ID <strong>in</strong> Fig. 2(a)a is called canonicalvertex ID.


<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 51 x 2a bx0a2 x 1a bx0a0 2a x b1xa2 x 0a bx1a1ax 0bx2a0a x1bx2aaaaa 0a 1 b 2x 0xa 0a 0 b 1 a 2a 0 0 xa 2 b 1 x a 1b 2a 0 a 1 b 2x x0b 0 a 1 a 2 b 0 a 1 a 2b 0 0 x b 0 x 0a 2 a 2a 1 x a 1 xa 0b 1a 2a 0 b 1 a 2x x0axbbxcaab x0x aba 0xx aab xx0 baa 0xx baa x0x aba xx0Canonical Label(a) (b) (c) (d) (e) (f)axbC1xaP1axbxcC2P22.3 Pattern Growth(a) Canonical Label<strong>in</strong>g(b) Parents and ChildrenFig. 2. Canonical Labels and Pattern GrowthDef<strong>in</strong>ition 7. Parent and Child. Given two connected <strong>graph</strong>es P and C, P and C haven and n + 1 edges respectively. If P is a sub-<strong>graph</strong> of C, P is called a parent of C, andC is called a child of P .For a <strong>graph</strong>, it can have more than one child or parent. For example, <strong>in</strong> Fig. 2(b), C 2has two parents that are P 1 and P 2 . P 1 have two children that are C 1 and C 2 .Dur<strong>in</strong>g m<strong>in</strong><strong>in</strong>g process, we always “grow” a sub-<strong>graph</strong> from its parent, which iscalled pattern growth. Dur<strong>in</strong>g pattern growth, we <strong>in</strong>troduce an extra edge <strong>in</strong>to a parentto obta<strong>in</strong> a child. The <strong>in</strong>troduced extra edge is called Growth Element (GE for short).The formal def<strong>in</strong>ition about GE is given <strong>in</strong> Section 3.1.3 PG-search AlgorithmAs mentioned <strong>in</strong> the Section 1, we can transfer a TOP-CG<strong>Search</strong> problem <strong>in</strong>to a thresholdbasedCG<strong>Search</strong>. First, we set threshold θ to be a large value. Accord<strong>in</strong>g to CGS algorithm[8], if we can f<strong>in</strong>d the correct top-K correlation sub-<strong>graph</strong>s, then we report themand term<strong>in</strong>ate the algorithm. Otherwise, we decrease θ and repeat the above process untilwe obta<strong>in</strong> the top-K answers. We refer this method as “Decreas<strong>in</strong>g Threshold-BasedCGS” <strong>in</strong> the rest of this paper. Obviously, when K is large, we have to set θ to be asmall value to f<strong>in</strong>d the correct top-K answers, which is quite <strong>in</strong>efficient s<strong>in</strong>ce we needto perform CGS algorithm [8] several times and there are more candidates needed to be<strong>in</strong>vestigated. In order to avoid this shortcom<strong>in</strong>g, we propose a pattern-growth searchalgorithm (PG-search for short) <strong>in</strong> this section.3.1 <strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> QueryThe search space (the number of candidates) of TOP-CG<strong>Search</strong> problem is large, s<strong>in</strong>ceeach sub-<strong>graph</strong> S i <strong>in</strong> <strong>graph</strong> database is a candidate. Obviously, it is impossible to enumerateall sub-<strong>graph</strong>s <strong>in</strong> the database, s<strong>in</strong>ce it will result <strong>in</strong> a comb<strong>in</strong>ational explosionproblem. Therefore, an effective prun<strong>in</strong>g strategy is needed to reduce the search space.In our algorithm, we conduct effective filter<strong>in</strong>g through the upper bound. Specifically,if the largest upper bound for all un-checked sub-<strong>graph</strong>s cannot get <strong>in</strong> the top-K answers,the algorithm can stop. Generally speak<strong>in</strong>g, our algorithms work as follows:We first generate all size-1 sub-<strong>graph</strong> (the size of a <strong>graph</strong> is def<strong>in</strong>ed <strong>in</strong> terms of numberof edges, i.e. |E|) S i , and <strong>in</strong>sert them <strong>in</strong>to heap H <strong>in</strong> non-<strong>in</strong>creas<strong>in</strong>g order ofupperbound(φ(Q, S i )), which is the upper bound of φ(Q, S i ). The upper bound isa monotone <strong>in</strong>creas<strong>in</strong>g function w.r.t sup(S i ). The head of the heap H is h. We setα = upperbound(φ(Q, h)). We compute φ(Q, h) and <strong>in</strong>sert h <strong>in</strong>to result set RS. β is


6 Lei Zou et al.set to be the Kth largest value <strong>in</strong> RS by now. Then, we delete the heap head h, andf<strong>in</strong>d all children C i (see Def<strong>in</strong>ition 7) of h through pattern growth. We <strong>in</strong>sert these C i<strong>in</strong>to heap H. For the heap head h, we set α = upperbound(φ(Q, h)). We also computeφ(Q, h), and then <strong>in</strong>sert h <strong>in</strong>to result set RS. The above processes are iterated untilβ ≥ α. At last, we report top-K answers <strong>in</strong> RS.In order to implement the above algorithm, we have to solve the follow<strong>in</strong>g technicalchallenges: 1)how to def<strong>in</strong>e upperbound(φ(Q, S i )); 2)how to f<strong>in</strong>d all h’s children;3)S<strong>in</strong>ce a child may be generated from different parents (see Fig. 2(b)), how to avoidgenerat<strong>in</strong>g duplicate patterns; and 4)how to reduce the search space as much as possible.Lemma 1. Given a query <strong>graph</strong> Q and a sub-<strong>graph</strong> S i <strong>in</strong> the <strong>graph</strong> database, the upperbound of φ(Q, S i ) is denoted as:√√1 − sup(Q) sup(S i)φ(Q, S i) ≤ upperbound(φ(Q, S i)) =∗(2)sup(Q) 1 − sup(S i)Proof. Obviously, sup(Q, S i) ≤ sup(S i), thus:φ(Q, S i) =≤==sup(Q,S i )−sup(Q)sup(S i )√sup(Q)sup(Si )(1−sup(Q))(1−sup(S i ))sup(S i )−sup(Q)sup(S i )√sup(Q)sup(Si )(1−sup(Q))(1−sup(S i ))sup(S i )(1−sup(Q))√sup(Q)sup(Si )(1−sup(Q))(1−sup(S i ))√ √1−sup(Q)∗ sup(Si )sup(Q) 1−sup(S i )Theorem 1. Given a query <strong>graph</strong> Q and a sub-<strong>graph</strong> S i <strong>in</strong> a <strong>graph</strong> database D, the upper boundof φ(Q, S i) (that is upperbound(φ(Q, S i))) is a monotone <strong>in</strong>creas<strong>in</strong>g function w.r.t sup(S i).Proof. Given two sub-<strong>graph</strong>s S 1 and S 2, where sup(S 1) >sup(S 2), we can know that the follow<strong>in</strong>gformulation holds.√ √upperbound(φ(Q, S 1))upperbound(φ(Q, S = sup(S 1) 1 − sup(S 2)2)) sup(S 2) 1 − supp(S > 1 1)Therefore, Theorem 1 holds.Property 1. (Apriori Property) Given two sub-<strong>graph</strong> S 1 and S 2, if S 1 is a parent of S 2 (seeDef<strong>in</strong>ition 7), then Sup(S 2) < Sup(S 1).Lemma 2. Assume that β ∗ is the K-largest φ(Q, S i) <strong>in</strong> the f<strong>in</strong>al results. Given a sub-<strong>graph</strong> Swith |S| > 1, the necessary condition for upperbound(φ(Q, S)) > β ∗ is that, for all of itsparents P i, upperbound(φ(Q, P i)) > β ∗ .Proof. Accord<strong>in</strong>g to Property 1, we know that Sup(S) ≤ Sup(P i). S<strong>in</strong>ce upperbound(φ(Q, S))is a monotone <strong>in</strong>creas<strong>in</strong>g function w.r.t sup(S), it is clear that upperbound(φ(Q, P i)) ≥ upperbound(φ(Q, S)) > β ∗ . Therefore, Lemma 2 holds.Accord<strong>in</strong>g to Lemma 2, we only need to check a sub-<strong>graph</strong>, if and only if we have checked allits parents P i. It means that we can adopt “pattern-growth” framework <strong>in</strong> the m<strong>in</strong><strong>in</strong>g process, thatis to generate and check a size-(n+1) sub-<strong>graph</strong> S from its parent P i after we have verified all S’sparents. Furthermore, accord<strong>in</strong>g to monotone <strong>in</strong>creas<strong>in</strong>g property of upperbound(φ(Q, S)), wealways “first” grow a sub-<strong>graph</strong> P with the highest support. We refer the method as “best-first”. Inorder to comb<strong>in</strong>e “pattern-growth” and “best-first” <strong>in</strong> the m<strong>in</strong><strong>in</strong>g process, we def<strong>in</strong>e Total Order<strong>in</strong> the follow<strong>in</strong>g def<strong>in</strong>ition.


<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 7Def<strong>in</strong>ition 8. (Total Order) Given two sub-<strong>graph</strong>s S 1 and S 2 <strong>in</strong> <strong>graph</strong> database D, where S 1 isnot isomorphism to S 2, we can say that S 1 < S 2 if either of the follow<strong>in</strong>g conditions holds:1) sup(S 1) > sup(S 2);2) sup(S 1) == sup(S 2) AND size(S 1) < size(S 2), where size(S 1) denotes the number ofedges <strong>in</strong> S 1.3) sup(S 1) == sup(S 2) AND size(S 1) == size(S 2) AND label(S 1) < label(S 2), wherelabel(S 1) and label(S 2) are the canonical label<strong>in</strong>g of sub-<strong>graph</strong>s S 1 and S 2 respectively.S1S2 S3 S4 S5aaQuery Qa a(a) Projected Database D Qxx x x xMax Heap: H a ba c b b b cSupp(Si) 0.8 0.6 0.6 0.2 0.2Occurrences G1 G1 G1 G2 G1 G2 G1 G3 G2 bG2 G2 1G5 2 1 2 1 2a b a bG3 G2 a b3aG4 G4 43 4 3G4 c a b aG1 G2 G4<strong>Sub</strong>-<strong>graph</strong>s Si Φ( QS , i )Result SetProjected Database DQ(RS) a b 0.61xS1(b) The First Step <strong>in</strong> M<strong>in</strong><strong>in</strong>g ProcessFig. 3. Projected Database and The First StepFurthermore, as mentioned <strong>in</strong> Section 2, we only focus on search<strong>in</strong>g positively correlated<strong>graph</strong>s <strong>in</strong> this paper. Thus, accord<strong>in</strong>g to Def<strong>in</strong>ition 4, if φ(Q, S i) > 0, then sup(Q, S i) > 0,which <strong>in</strong>dicates that Q and S i must appear together <strong>in</strong> at least one <strong>graph</strong> <strong>in</strong> D. Therefore, we onlyneed to conduct the correlation search over all data <strong>graph</strong>s that conta<strong>in</strong> query <strong>graph</strong> Q, which iscalled projected <strong>graph</strong> database and denoted as D Q. A projected database of the runn<strong>in</strong>g exampleis given <strong>in</strong> Fig. 3(a). Similar with [8], we focus the m<strong>in</strong><strong>in</strong>g task on the projected database D Q. Itmeans that we always generate candidate sub-<strong>graph</strong>s from the projected <strong>graph</strong> database throughpattern growth. S<strong>in</strong>ce |D Q| < |D|, it is efficient to perform m<strong>in</strong><strong>in</strong>g task on D Q.At the beg<strong>in</strong>n<strong>in</strong>g of PG-<strong>Search</strong> algorithm, we enumerate all size-1 sub-<strong>graph</strong> S i <strong>in</strong> the projected<strong>graph</strong> database and <strong>in</strong>sert them <strong>in</strong>to the heap H <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order of total order. Wealso record their occurrences <strong>in</strong> each data <strong>graph</strong>. Fig. 3(b) shows all size-1 sub-<strong>graph</strong> S i <strong>in</strong> theprojected <strong>graph</strong> database <strong>in</strong> the runn<strong>in</strong>g example. “G 1 < 1, 2 >” <strong>in</strong> Fig. 3(b) shows that there isone occurrence of S 1 <strong>in</strong> data <strong>graph</strong> G 1 and the correspond<strong>in</strong>g vertex IDs are 1 and 2 <strong>in</strong> <strong>graph</strong> G 1(see Fig. 1). In Fig. 3(b), the head of the heap H is sub-<strong>graph</strong> S 1. Therefore, we first grow S 1,s<strong>in</strong>ce it has the highest support. For each occurrence of S 1 <strong>in</strong> the projected database D Q, we f<strong>in</strong>dall Growth Elements (GE for short) around the occurrence.Def<strong>in</strong>ition 9. Growth Element. Given a connected <strong>graph</strong> C hav<strong>in</strong>g n edges and another connected<strong>graph</strong> P hav<strong>in</strong>g n − 1 edges, P is a sub-<strong>graph</strong> of C (namely, C is a child of P ). For someoccurrence O i of P <strong>in</strong> C, we use (C \ O i) to represent the edge <strong>in</strong> C that is not <strong>in</strong> O i. (C \ O i)is called Growth Element (GE for short) w.r.t O i. Usually, GE is represented as a five-tuple(< (vid1, vlable1), (elabel), (vid2, vlabel2) >), where vid1 and vid2 are canonical vertexIDs (def<strong>in</strong>ed <strong>in</strong> Def<strong>in</strong>ition 6) of P (vid1 < vid2). vlable1 and vlable2 are correspond<strong>in</strong>g vertexlabels, and elabel is the edge label of GE. Notice that vid2 can also be “*”, and “*” means theadditional vertex for P .To facilitate understand<strong>in</strong>g GE, we illustrate it with an example. Given a sub-<strong>graph</strong> S 1 <strong>in</strong>Fig. 4(a), there is one occurrence of S 1 <strong>in</strong> data <strong>graph</strong> G 1, which is denoted as the shaded area<strong>in</strong> Fig. 4(a). For the occurrence, there are three GEs that are also shown <strong>in</strong> Fig. 4(a). Grow<strong>in</strong>gS 1 (parent) through each GE, we will obta<strong>in</strong> another size-2 sub-<strong>graph</strong> (child) <strong>in</strong> Fig. 4(b). The


8 Lei Zou et al.GE3 1 2axb1 x 2a bx x3 4a cS1G1GE2 GE1 1 2xa bS11 x 2a bS11 2xa bS1+++GE1 GE2 GE3 1 x 3a bx2cS61x 2a b3xcS72 x 3a b1xaS8(a) How to F<strong>in</strong>d GE(b) Grow <strong>Sub</strong>-<strong>graph</strong> by GEFig. 4. F<strong>in</strong>d<strong>in</strong>g GE and Grow <strong>Sub</strong>-<strong>graph</strong> by GEnumbers beside vertexes <strong>in</strong> each sub-<strong>graph</strong> S i are canonical vertex IDs <strong>in</strong> S i. The numbers besidevertexes <strong>in</strong> data <strong>graph</strong> G are just vertex IDs, which are assigned arbitrarily. Notice that, canonicalvertex IDs <strong>in</strong> a parent may not be preserved <strong>in</strong> its child. For example, <strong>in</strong> Fig. 4(b), the canonicalvertex IDs <strong>in</strong> S 1 are different from that <strong>in</strong> S 6. Given a size-(n+1) sub-<strong>graph</strong> S i (child), it can beobta<strong>in</strong>ed by grow<strong>in</strong>g from different size-n sub-<strong>graph</strong>s (parents). For example, <strong>in</strong> Fig. 5(a)a, wecan obta<strong>in</strong> sub-<strong>graph</strong> S 6 by grow<strong>in</strong>g S 1 or S 5. Obviously, duplicate patterns will decrease theperformance of m<strong>in</strong><strong>in</strong>g algorithm. Therefore, we propose the follow<strong>in</strong>g def<strong>in</strong>ition about correctgrowth to avoid the possible duplicate generation.Def<strong>in</strong>ition 10. Correct Growth. Given a size-(n+1) sub-<strong>graph</strong> C(child), it can be obta<strong>in</strong>ed bygrow<strong>in</strong>g some size-n sub-<strong>graph</strong>s P i (parent), i=1...m. Among all growthes, the growth from P jto C is called correct growth, where P j is the largest one among all parents accord<strong>in</strong>g to TotalOrder <strong>in</strong> Def<strong>in</strong>ition 8. Otherwise, the growth is <strong>in</strong>correct one.For example, <strong>in</strong> Fig. 5(a), S 9 has two parents, which are S 1 and S 5. S<strong>in</strong>ce S 5 is larger than S 1(see Def<strong>in</strong>ition 8), the growth from S 5 to S 6 is a correct growth, and the growth from S 1 to S 6 isan <strong>in</strong>correct growth. Obviously, accord<strong>in</strong>g to Def<strong>in</strong>ition 8, we can determ<strong>in</strong>e whether the growthfrom P to C is correct or <strong>in</strong>correct. We will propose another efficient method to determ<strong>in</strong>e thecorrect growth <strong>in</strong> Section 3.2.Now, we illustrate the m<strong>in</strong><strong>in</strong>g process with the runn<strong>in</strong>g example. In the runn<strong>in</strong>g example,we always ma<strong>in</strong>ta<strong>in</strong> β to be the K largest φ(Q, S i) by now (K = 2 <strong>in</strong> the runn<strong>in</strong>g example).Initially, we set β = −∞. First, we conduct sub-<strong>graph</strong> query to obta<strong>in</strong> projected <strong>graph</strong> databaseD Q, which is shown <strong>in</strong> Fig. 3(a). Then, we f<strong>in</strong>d all size-1 sub-<strong>graph</strong>s (hav<strong>in</strong>g one edge) <strong>in</strong> theprojected database D Q. In Fig. 3(b), we <strong>in</strong>sert them <strong>in</strong>to heap H <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order accord<strong>in</strong>g toTotal Order <strong>in</strong> Def<strong>in</strong>ition 8. We also record all their occurrences <strong>in</strong> the <strong>graph</strong> database. S<strong>in</strong>ce S 1is the head of heap H, we compute α = upperbound(φ(Q, S 1)) = 1.63 and φ(Q, S 1) = 0.61.We <strong>in</strong>set S 1 <strong>in</strong>to result set RS. Now, β = −∞ < α, the algorithm cont<strong>in</strong>ues.We f<strong>in</strong>d all growth elements (GEs for short) around S 1 <strong>in</strong> D Q, which are shown <strong>in</strong> Fig. 5(b).S<strong>in</strong>ce G 3 is not <strong>in</strong> the projected database, we do not consider the occurrences <strong>in</strong> G 3 when wef<strong>in</strong>d GEs. For each GE, we check whether it leads to a correct growth. In Fig. 5(b), only the lastone is a correct growth. We <strong>in</strong>sert the last one (that is S 10) <strong>in</strong>to the max-heap H, as shown <strong>in</strong> Fig.6. Now, the sub-<strong>graph</strong> S 2 is the heap head. We re-compute α = upperbound(φ(Q, S 2)) = 1.0,φ(Q, S 2) = 1.0, and <strong>in</strong>sert S 2 <strong>in</strong>to RS. We update β to be <strong>Top</strong>-2 answer, that is β = 0.61. S<strong>in</strong>ceβ < α, the algorithm still cont<strong>in</strong>ues. The above processes are iterated until β ≥ α. At last, wereport top-2 answers <strong>in</strong> RS.3.2 PG-search AlgorithmAs discussed <strong>in</strong> Section 3.1, the heap H is ranked accord<strong>in</strong>g to Total Order <strong>in</strong> Def<strong>in</strong>ition 8 (seeFig. 3(b) and 6). In Def<strong>in</strong>ition 8, we need sup(S i). Therefore, we ma<strong>in</strong>ta<strong>in</strong> all occurrences of S i


<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 9S11 2xa bOccurrences GEG1 1 2axb1 2a x bCandidateGEPatterns+ axbxc+ x xc a bS6S7axbbxcS1S5Correct GrowthaxbxcaS1xbCorrect GrowthbxaxbG2 G2 G3 1 2a x b1 2xa b1 2xa bx x+ a a bx x+ a b bx x+ b a bS8S9S10G4 S6S11(a)(b)(A)(B)(a) Correct Growth(b) GEs for S 1Fig. 5. Correct Growth and S 1’s Growth ElementsS2 S3 S4 S5 S10x x x x x xa a a c b b b cMax Heap: H b a bSupp(Si) 0.6 0.6 0.2 0.2S2Occurrences G1 G1 G2 G1 G1 G2 G3 G5 G2 G2 G4 G4 0.2G2 G2 Result Set(RS)aa<strong>Sub</strong>-<strong>graph</strong>s Si Φ( QS , i )xS2xS1ab1.000.61Fig. 6. The Second Step <strong>in</strong> M<strong>in</strong><strong>in</strong>g Process<strong>in</strong> heap H. Accord<strong>in</strong>g to the occurrence list, it is easy to obta<strong>in</strong> sup(S i). However, it is expensiveto ma<strong>in</strong>ta<strong>in</strong> all occurrences <strong>in</strong> the whole database D, especially when |D| is large. On the otherhand, we need to f<strong>in</strong>d GE around each occurrence <strong>in</strong> the projected database D Q (see Fig. 5(b)).Therefore, we need to ma<strong>in</strong>ta<strong>in</strong> occurrences <strong>in</strong> the projected database. Furthermore, the projecteddatabase D Q is always smaller than the whole database D. In order to improve the performance,<strong>in</strong> our PG-search algorithm, we only ma<strong>in</strong>ta<strong>in</strong> the occurrence list <strong>in</strong> the projected database <strong>in</strong>steadof the whole database.However, Def<strong>in</strong>ition 8 needs sup(S i). Therefore, we propose to use <strong>in</strong>dexsup(S i) to def<strong>in</strong>eTotal Order <strong>in</strong>stead of sup(S i), where <strong>in</strong>dexsup(S i) is derived from sub-<strong>graph</strong> query technique.Recently, a lot of <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g structures have been proposed <strong>in</strong> database literature [13,18] for sub-<strong>graph</strong> query. <strong>Sub</strong>-<strong>graph</strong> query is def<strong>in</strong>ed as: given a sub-<strong>graph</strong> Q, we report all data<strong>graph</strong>s conta<strong>in</strong><strong>in</strong>g Q as a sub-<strong>graph</strong> from the <strong>graph</strong> database. Because sub-<strong>graph</strong> isomorphismis a NP-complete problem [4], we always employ a filter-and-verification framework to speedup the search process. In filter<strong>in</strong>g process, we identify the candidate answers by <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>gstructures. Then, <strong>in</strong> verification process, we check each candidate by sub-<strong>graph</strong> isomorphism.Generally speak<strong>in</strong>g, the filter<strong>in</strong>g process is much faster than verification process.Def<strong>in</strong>ition 11. Index Support. For a sub-<strong>graph</strong> S i, <strong>in</strong>dex support is size of candidates <strong>in</strong> filter<strong>in</strong>gprocess of sub-<strong>graph</strong> S i query, which is denoted as <strong>in</strong>dexsup(S i).Actually, sup(S i) is the size of answers for sub-<strong>graph</strong> S i query, which is obta<strong>in</strong>ed <strong>in</strong> verificationprocess. Accord<strong>in</strong>g to <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g techniques [13, 18], we can identify the candidates withoutfalse negatives. Therefore, Lemma 3 holds.Lemma 3. For any sub-<strong>graph</strong> S i <strong>in</strong> the <strong>graph</strong> database, sup(S i) ≤ <strong>in</strong>dexsup(S i), where<strong>in</strong>dexsup(S i) is the size of candidates for sub-<strong>graph</strong> S i query.Furthermore, <strong>in</strong>dexsup(S i) also satisfy Apriori Property. For example, <strong>in</strong> gIndex [18], if Q 1is a sub-<strong>graph</strong> of Q 2, the candidates for Q 2 query is a subset of candidates for Q 1. It means thefollow<strong>in</strong>g property holds.


10 Lei Zou et al.Property 2. (Apriori Property) Given two sub-<strong>graph</strong> S 1 and S 2, if S 1 is a parent of S 2 (seeDef<strong>in</strong>ition 7), then <strong>in</strong>dexsup(S 2) < <strong>in</strong>dexsup(S 1).Lemma 4. Given a query <strong>graph</strong> Q and a sub-<strong>graph</strong> S i <strong>in</strong> the <strong>graph</strong> database, the upper boundof φ(Q, S i) is denoted as:φ(Q, S i) ≤ upperbound2(φ(Q, S i))=√1−sup(Q)sup(Q)∗√<strong>in</strong>dexsup(Si )1−<strong>in</strong>dexsup(S i )(3)Proof. Accord<strong>in</strong>g to Lemma 1 and 3:φ(Q, S i) ≤≤√1−sup p(Q)sup(Q)√1−sup(Q)sup(Q)∗√∗ sup p(Si )√<strong>in</strong>dex sup p(Si )1−<strong>in</strong>dex sup p(S i )1−sup p(S i )Theorem 2. Given a query <strong>graph</strong> Q and a sub-<strong>graph</strong> S i <strong>in</strong> a <strong>graph</strong> database D, the upperbound of φ(Q, S i) (that is upperbound2(φ(Q, S i))) is a monotone <strong>in</strong>creas<strong>in</strong>g function w.r.t<strong>in</strong>dexsup(S i).Proof. It is omitted due to space limited.In order to use <strong>graph</strong> <strong>in</strong>dex <strong>in</strong> the PG-<strong>Search</strong> algorithm, we need to re-def<strong>in</strong>e Total Order andCorrect Growth accord<strong>in</strong>g to Index Support as follows.Def<strong>in</strong>ition 12. (Total Order) Given two sub-<strong>graph</strong>s S 1 and S 2 <strong>in</strong> <strong>graph</strong> database D, where S 1is not isomorphism to S 2, we can say that S 1 < S 2 if either of the follow<strong>in</strong>g conditions holds:1) <strong>in</strong>dexsup(S 1) > <strong>in</strong>dexsup(S 2);2) <strong>in</strong>dexsup(S 1) == <strong>in</strong>dexsup(S 2) AND size(S 1) < size(S 2), where size(S 1) denotes thenumber of edges <strong>in</strong> S 1.3) <strong>in</strong>dexsup(S 1) == <strong>in</strong>dexsup(S 2) AND size(S 1) == size(S 2) AND label(s 1) < label(s 2),where label(S 1) and label(S 2) are the canonical label<strong>in</strong>g of sub-<strong>graph</strong>s S 1 and S 2 respectively.Lemma 5. Accord<strong>in</strong>g to Def<strong>in</strong>ition 12, if sub-<strong>graph</strong>s S 1 < S 2, then upperbound2(φ(Q, S 1)) ≥upperbound2(φ(Q, S 2)).Def<strong>in</strong>ition 13. Correct Growth. Given a size-(n+1) sub-<strong>graph</strong> C(child), it can be obta<strong>in</strong>ed bygrow<strong>in</strong>g some size-n sub-<strong>graph</strong>s P i (parent), i=1...m. Among all growthes, the growth from P jto C is called the correct growth, where P j is the largest parent among all parents accord<strong>in</strong>g toTotal Order <strong>in</strong> Def<strong>in</strong>ition 12. Otherwise, the growth is <strong>in</strong>correct one.We show the pseudo codes of PG-search algorithm <strong>in</strong> Algorithm 1. In PG-search algorithm,first, given a query Q, we obta<strong>in</strong> projected <strong>graph</strong> database D Q by perform<strong>in</strong>g sub-<strong>graph</strong> Q query(L<strong>in</strong>e 1). Then, we f<strong>in</strong>d all size-1 sub-<strong>graph</strong>s S i(hav<strong>in</strong>g one edge) <strong>in</strong> D Q, and <strong>in</strong>sert them <strong>in</strong>tomax-heap H <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order accord<strong>in</strong>g to Def<strong>in</strong>ition 12. We also need to ma<strong>in</strong>ta<strong>in</strong> the occurrencesof sub-<strong>graph</strong>s S i <strong>in</strong> the projected database D Q(L<strong>in</strong>e 2). The threshold β is recordedas the top-K φ(Q, S i) <strong>in</strong> result set RS by now. Initially, we set β = −∞ (L<strong>in</strong>e 3). We setα = upperbound2(Q, h), where h is the heap head <strong>in</strong> H (L<strong>in</strong>e 4). For heap head h <strong>in</strong> max-heapH, we perform sub-<strong>graph</strong> search to obta<strong>in</strong> sup(h), and compute φ(h, Q). We also <strong>in</strong>sert h <strong>in</strong>toresult set RS (L<strong>in</strong>e 5). Then, we update β to be the <strong>Top</strong>-K answer <strong>in</strong> RS (L<strong>in</strong>e 6). If β ≥ α, thealgorithm reports top-K results (L<strong>in</strong>e 7). Otherwise, we pop the heap head h (L<strong>in</strong>e 8). We f<strong>in</strong>d allGEs with regard with h <strong>in</strong> the projected <strong>graph</strong> database D Q (L<strong>in</strong>e 9). For each GE g, if grow<strong>in</strong>gh through g is not a correct growth, it is ignored (L<strong>in</strong>e 15). If the growth is correct, we obta<strong>in</strong> anew sub-<strong>graph</strong> C. We <strong>in</strong>sert it <strong>in</strong>to heap H accord<strong>in</strong>g to Total Order <strong>in</strong> Def<strong>in</strong>ition 12 (L<strong>in</strong>e 13).Then, we set α = upperbound2(φ(Q, h)), where h is the new head <strong>in</strong> H (L<strong>in</strong>e 16). For the newheap head h, we perform sub-<strong>graph</strong> search to obta<strong>in</strong> sup(h) and compute φ(h, Q). We <strong>in</strong>sert h


<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 11Algorithm 1 PG-search algorithmRequire: Input: a <strong>graph</strong> database D and a query <strong>graph</strong> Q and KOutput: the top-K correlated sub-<strong>graph</strong>.1: We conduct sub-<strong>graph</strong> Q search to obta<strong>in</strong> projected <strong>graph</strong> database D Q.2: F<strong>in</strong>d all size-1 sub-<strong>graph</strong> S i (hav<strong>in</strong>g one edge), and <strong>in</strong>sert them <strong>in</strong> max-heap H <strong>in</strong> order ofTotal Order <strong>in</strong> Def<strong>in</strong>ition 8. We also ma<strong>in</strong> their occurrences <strong>in</strong> projected <strong>graph</strong> database D Q.3: Set threshold β = −∞.4: For heap head h <strong>in</strong> max-heap H, we perform sub-<strong>graph</strong> query to obta<strong>in</strong> sup(h). Then, weset α = upperbound2(φ(h, Q)) where upperbound2(φ(h, Q)) is computed accord<strong>in</strong>g toEquation 3.5: For heap head h <strong>in</strong> max-heap H, we compute φ(h, Q) and <strong>in</strong>sert it <strong>in</strong>to answer set RS <strong>in</strong>non-<strong>in</strong>creas<strong>in</strong>g order of φ(h, Q).6: Update β to be the <strong>Top</strong>-K answer <strong>in</strong> RS.7: while (β < α) do8: Pop the current head h <strong>in</strong>to the table T .9: F<strong>in</strong>d all GEs around the occurrences of h <strong>in</strong> the projected <strong>graph</strong> D Q.10: for each GE g do11: Assume that we can get a sub-<strong>graph</strong> C through grow<strong>in</strong>g h by growth element g.12: if The growth from h to C is correct growth then13: Insert P <strong>in</strong>to max-heap H accord<strong>in</strong>g to Total Order <strong>in</strong> Def<strong>in</strong>ition 12. We also ma<strong>in</strong>ta<strong>in</strong>the occurrence list for C <strong>in</strong> the projected <strong>graph</strong> database D Q.14: else15: cont<strong>in</strong>ue16: Set α = upperbound2(φ(Q, h)), where h is the new head <strong>in</strong> H.17: For the new heap head h <strong>in</strong> max-heap H, we perform sub-<strong>graph</strong> query to obta<strong>in</strong> sup(h),and compute φ(h, Q). Then <strong>in</strong>sert it <strong>in</strong>to answer set RS <strong>in</strong> non-<strong>in</strong>creas<strong>in</strong>g order ofφ(h, Q).18: Update β to be the <strong>Top</strong>-K answer <strong>in</strong> RS.19: Report top-K results <strong>in</strong> RS<strong>in</strong>to result set RS (L<strong>in</strong>e 17). Update β to be the <strong>Top</strong>-K answer <strong>in</strong> RS (L<strong>in</strong>e 18). We iterate L<strong>in</strong>es7-21 until β ≥ α (L<strong>in</strong>e 7). As last, we report top-K answers (L<strong>in</strong>e 19).Notice that, <strong>in</strong> L<strong>in</strong>e 4 and 16 of PG-search algorithm, we utilize <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g techniqueto support sub-<strong>graph</strong> query to obta<strong>in</strong> sup(h). In L<strong>in</strong>e 13, we also utilize <strong>in</strong>dexes to obta<strong>in</strong><strong>in</strong>dexsup(h), which is obta<strong>in</strong>ed <strong>in</strong> filter<strong>in</strong>g process.In order to determ<strong>in</strong>e the growth from h to C is correct or not (L<strong>in</strong>e 12), we need to generateall parents P i of h, and then perform sub-<strong>graph</strong> P i query (only filter<strong>in</strong>g process) to obta<strong>in</strong><strong>in</strong>dexsup(P i). At last, we determ<strong>in</strong>e whether h is the largest one among all P i accord<strong>in</strong>g toDef<strong>in</strong>ition 12. Obviously, the above method leads to numerous sub-<strong>graph</strong> query operations. Wepropose another efficient method for the determ<strong>in</strong>ation as follows:As we know, at each m<strong>in</strong><strong>in</strong>g step, we always pop the heap head h i from the heap H (L<strong>in</strong>e 8).We can ma<strong>in</strong>ta<strong>in</strong> all h i <strong>in</strong>to a table T , which is ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> memory. For a growth from size-nsub-<strong>graph</strong> P to size-(n+1) sub-<strong>graph</strong>s C, we generate all parents P i of C. If all parents P i exceptfor P have existed <strong>in</strong> the table T , the parent P must be the largest one among all parents P i (seeLemma 8 and proof). Therefore, the growth from P to C is correct one. Otherwise, the growth is<strong>in</strong>correct. In the method, we only perform <strong>graph</strong> isomorphism test <strong>in</strong> table T . Furthermore, <strong>graph</strong>canonical label<strong>in</strong>g will facilitate the <strong>graph</strong> isomorphism. For example, we can record canonical


12 Lei Zou et al.labels of all sub-<strong>graph</strong>s <strong>in</strong> table T . The hash table built on these canonical labels will speed upthe determ<strong>in</strong>ation.We first discuss Lemma 6 and Lemma 7, which are used to prove the correctness of Lemma8 and Theorem 3.Lemma 6. Assume that h 1 and h 2 are heads of heap H (L<strong>in</strong>e 8) <strong>in</strong> PG-<strong>Search</strong> algorithm <strong>in</strong> i−thand (i + 1)−th iteration step (L<strong>in</strong>es 8-18). Accord<strong>in</strong>g to Total Order <strong>in</strong> Def<strong>in</strong>ition 12, we knowthat h 1 < h 2.Proof. 1)If h 2 exists <strong>in</strong> heap H <strong>in</strong> i−th iteration step: S<strong>in</strong>ce the heap H is <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g orderaccord<strong>in</strong>g to Def<strong>in</strong>ition 12 and h 1 is the heap head <strong>in</strong> i−th iteration step, it is straightforward toknow h 1 < h 2.2)If h 2 does not exist <strong>in</strong> heap H <strong>in</strong> i−th iteration step: Accord<strong>in</strong>g to PG-search algorithm, h 2must be generated by grow<strong>in</strong>g the heap head h 1 <strong>in</strong> i−step. It means that h 1 is a parent of h 2.Accord<strong>in</strong>g to Property 2, <strong>in</strong>dexsup(h 1) ≥ <strong>in</strong>dexsup(h 2). Furthermore, size(h 1) < size(h 2).Therefore, accord<strong>in</strong>g to Def<strong>in</strong>ition 12, we know that h 1 < h 2.Total Order <strong>in</strong> Def<strong>in</strong>ition 12 is “transitive ”, thus, Lemma 7 holds.Lemma 7. Assume that h 1 and h 2 are heads of heap H <strong>in</strong> PG-<strong>Search</strong> Algorithm <strong>in</strong> i−th andj−th iteration step (L<strong>in</strong>es 8-18), where i < j. Accord<strong>in</strong>g to Def<strong>in</strong>ition 12, we know that h 1 < h 2.As discussed before, <strong>in</strong> PG-search algorithm, we ma<strong>in</strong>ta<strong>in</strong> the heap head h of each step <strong>in</strong>tothe table T . The follow<strong>in</strong>g lemma holds.Lemma 8. For a growth from size-n sub-<strong>graph</strong> P to size-(n + 1) sub-<strong>graph</strong>s C, we generate allparents P i of C. If all parents P i except for P have existed <strong>in</strong> the table T , the parent P must bethe largest one among all parents P i.Proof. (Sketch) S<strong>in</strong>ce all parents P i except for P have existed <strong>in</strong> the table T , it means all P i existbefore P <strong>in</strong> heap H. Accord<strong>in</strong>g to Lemma 7, we know P i < P . Therefore, P is the largest oneamong all parents P i.Theorem 3. Algorithm Correctness. PG-search algorithm can f<strong>in</strong>d the correct top-K correlatedsub-<strong>graph</strong>s.Proof. (Sketch) Assume that β ≥ α at nth iteration step of PG-search algorithm, where α =upperbound2(φ(Q, h)) and h is the heap head at nth step. Accord<strong>in</strong>g to Lemma 7, h ′ < h,where h ′ is the heap head at jth step, where j > n. Accord<strong>in</strong>g to Lemma 5, we know thatupperbound2(φ(Q, h ′ )) < upperbound2(φ(Q, h)) = α. S<strong>in</strong>ce β ≥ α, α > upperbound2(φ(Q, h ′ )). It means that, <strong>in</strong> latter iteration steps, we cannot f<strong>in</strong>d top-K answers. Therefore, PGsearchalgorithm can f<strong>in</strong>d all correct top-K answers, if the algorithm term<strong>in</strong>ates when β ≥ α(L<strong>in</strong>e 7). The correctness of PG-search algorithm is proved.4 ExperimentsIn this section, we evaluate our methods <strong>in</strong> both real dataset and synthetic dataset. As far as weknow, there is no exist<strong>in</strong>g work on top-K correlation sub-<strong>graph</strong> search problem. In [8], Ke etal. proposed threshold-based correlation sub-<strong>graph</strong> search. As discussed <strong>in</strong> Section 1, we cantransfer top-K correlation sub-<strong>graph</strong> search <strong>in</strong>to threshold-based one with decreas<strong>in</strong>g threshold.In the follow<strong>in</strong>g section, we refer the method as “Decreas<strong>in</strong>g Threshold-Based CGS”(DT-CGSfor short). We implement “DT-CGS” by ourselves and optimize it accord<strong>in</strong>g to [8]. All experimentscoded by ourself are implemented by standard C++ and conducted on a P4 1.7G mach<strong>in</strong>eof 1024M RAM runn<strong>in</strong>g W<strong>in</strong>dows XP.1) Datasets. We use both real dataset (i.e. AIDS dataset) and synthetic daset for performanceevaluation.AIDS Dataset. This dataset is available publicly on the website of the Developmental TherapeuticsProgram. We generate 10,000 connected and labeled <strong>graph</strong>s from the molecule structures andomit Hydrogen atoms. The <strong>graph</strong>s have an average number of 24.80 vertices and 26.80 edges,


<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 13and a maximum number of 214 vertices and 217 edges. A major portion of the vertices are C,O and N. The total number of dist<strong>in</strong>ct vertex labels is 62, and the total number of dist<strong>in</strong>ct edgelabels is 3. We refer to this dataset as AIDS dataset. We randomly generate four query sets, F 1,F 2, F 3 and F 4, each of which conta<strong>in</strong>s 100 queries. The support ranges for the queries <strong>in</strong> F 1 toF 4 are [0.02, 0.05], (0.05, 0.07], (0.07, 0.1] and (0.1, 1) respectively.Synthetic Dataset. The synthetic dataset is generated by a synthetic <strong>graph</strong> generator provided byauthors of [9]. The synthetic <strong>graph</strong> dataset is generated as follows: First, a set of S seed fragmentsare generated randomly, whose size is determ<strong>in</strong>ed by a Poisson distribution with mean I. Thesize of each <strong>graph</strong> is a Poisson random variable with mean T. Seed fragments are then randomlyselected and <strong>in</strong>serted <strong>in</strong>to a <strong>graph</strong> one by one until the <strong>graph</strong> reaches its size. Parameter V andE denote the number of dist<strong>in</strong>ct vertex labels and edge labels respectively. The card<strong>in</strong>ality of the<strong>graph</strong> dataset is denoted by D. We generate the <strong>graph</strong> database us<strong>in</strong>g the follow<strong>in</strong>g parameterswith gIndex <strong>in</strong> [18]: D=10,000, S=200, I=10, T=50, V=4, E=1.2) Experiment 1. Benefit<strong>in</strong>g from <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g techniques, we can only ma<strong>in</strong>ta<strong>in</strong> the occurrences<strong>in</strong> the projected database, as discussed <strong>in</strong> Section 3.2. In the experiment, we evaluate the<strong>in</strong>dex<strong>in</strong>g technique <strong>in</strong> PG-search. We use GCod<strong>in</strong>g <strong>in</strong>dex<strong>in</strong>g technique [20] <strong>in</strong> the experiment.First, <strong>in</strong> Fig. 7(a) and 7(c), we fix the query sets F 1 and vary K from 10 to 100. Fig. 7(a) andFig. 7(c) show the runn<strong>in</strong>g time and memory consumption respectively. Second, <strong>in</strong> Fig. 7(b) and7(d), we fix K to be 50 and use different query sets F 1 to F 4. In “PG-search without Index”,we need to ma<strong>in</strong>ta<strong>in</strong> the occurrence list <strong>in</strong> the whole database. Therefore, it needs more memoryconsumption, which is shown <strong>in</strong> Fig. 7(c) and 7(d). Observed from Fig. 7(a) and 7(b), it is clearthat “PG-search” is faster than “PG-search without Index”.Run<strong>in</strong>g Time (seconds)605040302010PG−search Without IndexPG−search010 20 30 40 50 60 70 80 90 100<strong>Top</strong>−K(a) Runn<strong>in</strong>g Time <strong>in</strong> F 1 Query(b) Runn<strong>in</strong>g Time of <strong>Top</strong>-50(c) Memory Consumption <strong>in</strong> F 1 (d) Memory Consumption of <strong>Top</strong>-SetsRun<strong>in</strong>g Time (seconds)25020015010050PG−search Without IndexPG−search0F1 F2 F3 F4Query SetsQuery <strong>in</strong> different Query SetsMemory (<strong>in</strong> MB)600500400300200100PG−search Without IndexPG−search010 20 30 40 50 60 70 80 90 100<strong>Top</strong>−KQuery SetsMemory (<strong>in</strong> MB)500400300200100PG−search Without IndexPG−search0F1 F2 F3 F4Query Sets50 Query <strong>in</strong> different Query SetsFig. 7. Evaluat<strong>in</strong>g Index<strong>in</strong>g Technique <strong>in</strong> PG-search Algorithm <strong>in</strong> AIDS datasetsRun<strong>in</strong>g Time (seconds)8070605040302010PG−searchDT−CGSF<strong>in</strong>al Threshold β10.90.80.70.6F<strong>in</strong>al Threshold βCandidate Size600500400300200100PG−searchDG−CGSAnswer Set02 5 10 20 30 40 50 60 70 80 90 100<strong>Top</strong>−K(a) Runn<strong>in</strong>g Time <strong>in</strong> F 1 QuerySets0.52 5 10 20 30 40 50 60 70 80 90 100<strong>Top</strong>−K(b) F<strong>in</strong>al Threshold β02 5 10 20 30 40 50 60 70 80 90 100<strong>Top</strong>−K(c) Candidate SizeFig. 8. PG-search VS. Decreas<strong>in</strong>g Threshold-based CGS on AIDS Dataset


14 Lei Zou et al.Run<strong>in</strong>g Time (seconds)15010050PG−searchDT−CGSF<strong>in</strong>al Threshold β10.90.80.70.6F<strong>in</strong>al Threshold βCandidate Size700600500400300200100PG−searchDT−CGSAnswer Set02 5 10 20 30 40 50 60 70 80 90 100<strong>Top</strong>−K0.52 5 10 20 30 40 50 60 70 80 90 100<strong>Top</strong>−K02 5 10 20 30 40 50 60 70 80 90 100<strong>Top</strong>−K(a) Runn<strong>in</strong>g Time <strong>in</strong> F 1SetsQuery(b) F<strong>in</strong>al Threshold β(c) Candidate SizeFig. 9. PG-search VS. Decreas<strong>in</strong>g Threshold-based CGS on Synthetic Dataset3) Experiment 2. In the experiment, we compare PG-search with decreas<strong>in</strong>g threshold-basedCGS (DT-CGS for short). We fix query sets F 1 and vary K from 2 to 100. Fig. 9(a) reportsrunn<strong>in</strong>g time. We also report the f<strong>in</strong>al threshold β for top-K query <strong>in</strong> Fig. 9(b). Observed fromFig. 9(a), PG-search is faster than DT-CGS <strong>in</strong> the most cases. In DT-CGS, it is difficult to setan appropriate threshold θ. Therefore, <strong>in</strong> the experiments, we always first set θ = 0.9, and thendecrease it by “0.02” <strong>in</strong> each follow<strong>in</strong>g step based on the number of answers obta<strong>in</strong>ed from theprevious step. For example, <strong>in</strong> Fig. 9(a), when K = 2, 5, the runn<strong>in</strong>g time <strong>in</strong> DT-CGS keeps thesame, because the f<strong>in</strong>al threshold are β = 1.0 and β = 0.95 respectively (see Fig. 9(b)). It meansthat we can f<strong>in</strong>d the top-2 and top-5 answers, if we set θ = 0.9 <strong>in</strong> the first step <strong>in</strong> DT-CGS. Asthe <strong>in</strong>creas<strong>in</strong>g of K, the performance of DT-CGS decreases greatly. Fig. 9(b) expla<strong>in</strong>s the cause.When K is large, the f<strong>in</strong>al threshold β is small. Thus, we have to perform CGS algorithm [8]many times accord<strong>in</strong>g to decreas<strong>in</strong>g threshold θ. Thus, it leads to more candidates <strong>in</strong> DT-CGS.Furthermore, we observed that the <strong>in</strong>creas<strong>in</strong>g trend of runn<strong>in</strong>g time <strong>in</strong> PG-search is muchslower than that <strong>in</strong> DT-CGS algorithm. In PG-search algorithm, we need to compute φ(Q, h) <strong>in</strong>each iteration step and h is the head of heap H. Therefore, we can regard h as a candidate answer.Fig. 9(c) reports candidate size <strong>in</strong> PG-search and DT-CGS algorithm. Observed from Fig. 9(c),we also f<strong>in</strong>d the <strong>in</strong>creas<strong>in</strong>g trend of candidate size <strong>in</strong> PG-<strong>Search</strong> is much slower than that <strong>in</strong> DT-CGS. We have the similar results on other query sets <strong>in</strong> the experiments. Due to space limit, wedo not report them here. We also compare PG-search with DT-CGS on synthetic datasets <strong>in</strong> Fig.9. The experiment results also confirm the superiority of our method.5 Related WorkGiven a query <strong>graph</strong> Q, retriev<strong>in</strong>g related <strong>graph</strong>s from a large <strong>graph</strong> database is a key problem<strong>in</strong> many <strong>graph</strong>-based applications. Most exist<strong>in</strong>g work is about sub-<strong>graph</strong> search [13, 18]. Due toNP-complete hardness of sub-<strong>graph</strong> isomorphism, we have to employ filter<strong>in</strong>g-and-verificationframework. First, we use some prun<strong>in</strong>g strategies to filter out false positives as many as possible,second, we perform sub-<strong>graph</strong> isomorphism for each candidate to fix the correct answers.Furthermore, similar sub-<strong>graph</strong> search is also important, which is def<strong>in</strong>ed as: f<strong>in</strong>d all <strong>graph</strong>s that“closely conta<strong>in</strong>” the query <strong>graph</strong> Q.However, the above similar sub<strong>graph</strong> search is structure-based. Ke et al. first propose correlationm<strong>in</strong><strong>in</strong>g task <strong>in</strong> <strong>graph</strong> database, that is correlation <strong>graph</strong> search (CG<strong>Search</strong> for short)[8].CG<strong>Search</strong> can be regarded as statistical similarity search <strong>in</strong> <strong>graph</strong> database, which is differentfrom structural similarity search. Therefore, CG<strong>Search</strong> provides an orthogonal search problemof structural similarity search <strong>in</strong> <strong>graph</strong> database. CG<strong>Search</strong> <strong>in</strong> [8] requires the specification of am<strong>in</strong>imum correlation threshold θ to perform the computation. In practice, it may be not trivial forusers to provide an appropriate threshold θ, s<strong>in</strong>ce different <strong>graph</strong> databases typically have differentcharacteristics. Therefore, we propose an alternative m<strong>in</strong><strong>in</strong>g task: top-K correlation sub-<strong>graph</strong>search, TOP-CGS.


<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 15<strong>Correlation</strong> m<strong>in</strong><strong>in</strong>g is always used to discovery the underly<strong>in</strong>g dependency between objects.The correlation m<strong>in</strong><strong>in</strong>g task f<strong>in</strong>ds many applications, such as market-basket databases [2, 10],multimedia databases [11]. In [16], Xiong et al. propose an all-strong-pairs correlation query<strong>in</strong> a market-basket database. In [6], Ilyas et al. use correlation discovery to f<strong>in</strong>d soft functionaldependencies between columns <strong>in</strong> relational database.6 ConclusionsIn this paper, we propose a novel <strong>graph</strong> m<strong>in</strong><strong>in</strong>g task, called as TOP-CGS. To address the problem,we propose a pattern-growth algorithm, called PG-search algorithm. In PG-search, we firstgrow the sub-<strong>graph</strong> S i with the highest support and ma<strong>in</strong>ta<strong>in</strong> the threshold β to be the K-largestcorrelation value φ(Q, S i) by now. We <strong>in</strong>crease β <strong>in</strong> need until we f<strong>in</strong>d the correct top-K answers.Furthermore, <strong>in</strong> order to improve the performance, we utilize <strong>graph</strong> <strong>in</strong>dex<strong>in</strong>g technique to speedup the m<strong>in</strong><strong>in</strong>g task. Extensive experiments on real and synthetic datasets confirm the efficiencyof our methods.References1. R. Agrawal and R. Srikant. Fast algorithms for m<strong>in</strong><strong>in</strong>g association rules <strong>in</strong> large databases.In VLDB, 1994.2. S. Br<strong>in</strong>, R. Motwani, and C. Silverste<strong>in</strong>. Beyond market baskets: Generaliz<strong>in</strong>g associationrules to correlations. In SIGMOD, 1997.3. D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Community m<strong>in</strong><strong>in</strong>g from multi-relational networks.In PKDD, 2005.4. S. Fort<strong>in</strong>. The <strong>graph</strong> isomorphism problem. Department of Comput<strong>in</strong>g Science, Universityof Alberta, 1996.5. H. He and A. K. S<strong>in</strong>gh. Closure-tree: An <strong>in</strong>dex structure for <strong>graph</strong> queries. In ICDE, 2006.6. I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery ofcorrelations and soft functional dependencies. In SIGMOD, 2004.7. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for m<strong>in</strong><strong>in</strong>g frequentsubstructures from <strong>graph</strong> data. In PKDD, 2000.8. Y. Ke, J. Cheng, and W. Ng. <strong>Correlation</strong> search <strong>in</strong> <strong>graph</strong> databases. In SIGKDD, 2007.9. M. Kuramochi and G. Karypis. Frequent sub<strong>graph</strong> discovery. In ICDM, 2001.10. E. Omiec<strong>in</strong>ski. Alternative <strong>in</strong>terest measures for m<strong>in</strong><strong>in</strong>g associations <strong>in</strong> databases. IEEETKDE, 15(1), 2003.11. J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modalcorrelation discovery. In KDD, 2004.12. E. G. M. Petrakis and C. Faloutsos. Similarity search<strong>in</strong>g <strong>in</strong> medical image databases. IEEETransactions on Knowledge and Data Eng<strong>in</strong>ner<strong>in</strong>g, 9(3), 1997.13. D. Shasha, J. T.-L. Wang, and R. Giugno. Algorithmics and applications of tree and <strong>graph</strong>search<strong>in</strong>g. In PODS, 2002.14. P. Willett. Chemical similarity search<strong>in</strong>g. J. Chem. Inf. Comput. Sci, 38(6), 1998.15. H. Xiong, M. Brodie, and S. Ma. <strong>Top</strong>-cop: M<strong>in</strong><strong>in</strong>g top-k strongly correlated pairs <strong>in</strong> largedatabases. In ICDM, 2006.16. H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Exploit<strong>in</strong>g a support-based upper bound ofpearson’s correlation coefficient for efficiently identify<strong>in</strong>g strongly correlated pairs. In KDD,2004.17. X. Yan and J. Han. gspan: <strong>Graph</strong>-based substructure pattern m<strong>in</strong><strong>in</strong>g. In ICDM, 2002.18. X. Yan, P. S. Yu, and J. Han. <strong>Graph</strong> <strong>in</strong>dex<strong>in</strong>g: A frequent structure-based approach. InSIGMOD, 2004.19. X. Yan, P. S. Yu, and J. Han. <strong>Sub</strong>structure similarity search <strong>in</strong> <strong>graph</strong> databases. pages 766–777, 2005.20. L. Zou, L. Chen, J. X. Yu, and Y. Lu. A novel spectral cod<strong>in</strong>g <strong>in</strong> a large <strong>graph</strong> database. InEDBT, 2008.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!