Top-K Correlation Sub-graph Search in Graph Databases
Top-K Correlation Sub-graph Search in Graph Databases
Top-K Correlation Sub-graph Search in Graph Databases
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
8 Lei Zou et al.GE3 1 2axb1 x 2a bx x3 4a cS1G1GE2 GE1 1 2xa bS11 x 2a bS11 2xa bS1+++GE1 GE2 GE3 1 x 3a bx2cS61x 2a b3xcS72 x 3a b1xaS8(a) How to F<strong>in</strong>d GE(b) Grow <strong>Sub</strong>-<strong>graph</strong> by GEFig. 4. F<strong>in</strong>d<strong>in</strong>g GE and Grow <strong>Sub</strong>-<strong>graph</strong> by GEnumbers beside vertexes <strong>in</strong> each sub-<strong>graph</strong> S i are canonical vertex IDs <strong>in</strong> S i. The numbers besidevertexes <strong>in</strong> data <strong>graph</strong> G are just vertex IDs, which are assigned arbitrarily. Notice that, canonicalvertex IDs <strong>in</strong> a parent may not be preserved <strong>in</strong> its child. For example, <strong>in</strong> Fig. 4(b), the canonicalvertex IDs <strong>in</strong> S 1 are different from that <strong>in</strong> S 6. Given a size-(n+1) sub-<strong>graph</strong> S i (child), it can beobta<strong>in</strong>ed by grow<strong>in</strong>g from different size-n sub-<strong>graph</strong>s (parents). For example, <strong>in</strong> Fig. 5(a)a, wecan obta<strong>in</strong> sub-<strong>graph</strong> S 6 by grow<strong>in</strong>g S 1 or S 5. Obviously, duplicate patterns will decrease theperformance of m<strong>in</strong><strong>in</strong>g algorithm. Therefore, we propose the follow<strong>in</strong>g def<strong>in</strong>ition about correctgrowth to avoid the possible duplicate generation.Def<strong>in</strong>ition 10. Correct Growth. Given a size-(n+1) sub-<strong>graph</strong> C(child), it can be obta<strong>in</strong>ed bygrow<strong>in</strong>g some size-n sub-<strong>graph</strong>s P i (parent), i=1...m. Among all growthes, the growth from P jto C is called correct growth, where P j is the largest one among all parents accord<strong>in</strong>g to TotalOrder <strong>in</strong> Def<strong>in</strong>ition 8. Otherwise, the growth is <strong>in</strong>correct one.For example, <strong>in</strong> Fig. 5(a), S 9 has two parents, which are S 1 and S 5. S<strong>in</strong>ce S 5 is larger than S 1(see Def<strong>in</strong>ition 8), the growth from S 5 to S 6 is a correct growth, and the growth from S 1 to S 6 isan <strong>in</strong>correct growth. Obviously, accord<strong>in</strong>g to Def<strong>in</strong>ition 8, we can determ<strong>in</strong>e whether the growthfrom P to C is correct or <strong>in</strong>correct. We will propose another efficient method to determ<strong>in</strong>e thecorrect growth <strong>in</strong> Section 3.2.Now, we illustrate the m<strong>in</strong><strong>in</strong>g process with the runn<strong>in</strong>g example. In the runn<strong>in</strong>g example,we always ma<strong>in</strong>ta<strong>in</strong> β to be the K largest φ(Q, S i) by now (K = 2 <strong>in</strong> the runn<strong>in</strong>g example).Initially, we set β = −∞. First, we conduct sub-<strong>graph</strong> query to obta<strong>in</strong> projected <strong>graph</strong> databaseD Q, which is shown <strong>in</strong> Fig. 3(a). Then, we f<strong>in</strong>d all size-1 sub-<strong>graph</strong>s (hav<strong>in</strong>g one edge) <strong>in</strong> theprojected database D Q. In Fig. 3(b), we <strong>in</strong>sert them <strong>in</strong>to heap H <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order accord<strong>in</strong>g toTotal Order <strong>in</strong> Def<strong>in</strong>ition 8. We also record all their occurrences <strong>in</strong> the <strong>graph</strong> database. S<strong>in</strong>ce S 1is the head of heap H, we compute α = upperbound(φ(Q, S 1)) = 1.63 and φ(Q, S 1) = 0.61.We <strong>in</strong>set S 1 <strong>in</strong>to result set RS. Now, β = −∞ < α, the algorithm cont<strong>in</strong>ues.We f<strong>in</strong>d all growth elements (GEs for short) around S 1 <strong>in</strong> D Q, which are shown <strong>in</strong> Fig. 5(b).S<strong>in</strong>ce G 3 is not <strong>in</strong> the projected database, we do not consider the occurrences <strong>in</strong> G 3 when wef<strong>in</strong>d GEs. For each GE, we check whether it leads to a correct growth. In Fig. 5(b), only the lastone is a correct growth. We <strong>in</strong>sert the last one (that is S 10) <strong>in</strong>to the max-heap H, as shown <strong>in</strong> Fig.6. Now, the sub-<strong>graph</strong> S 2 is the heap head. We re-compute α = upperbound(φ(Q, S 2)) = 1.0,φ(Q, S 2) = 1.0, and <strong>in</strong>sert S 2 <strong>in</strong>to RS. We update β to be <strong>Top</strong>-2 answer, that is β = 0.61. S<strong>in</strong>ceβ < α, the algorithm still cont<strong>in</strong>ues. The above processes are iterated until β ≥ α. At last, wereport top-2 answers <strong>in</strong> RS.3.2 PG-search AlgorithmAs discussed <strong>in</strong> Section 3.1, the heap H is ranked accord<strong>in</strong>g to Total Order <strong>in</strong> Def<strong>in</strong>ition 8 (seeFig. 3(b) and 6). In Def<strong>in</strong>ition 8, we need sup(S i). Therefore, we ma<strong>in</strong>ta<strong>in</strong> all occurrences of S i