Top-K Correlation Sub-graph Search in Graph Databases

More documents

Recommendations

Info

8 Lei Zou et al.GE3 1 2axb1 x 2a bx x3 4a cS1G1GE2 GE1 1 2xa bS11 x 2a bS11 2xa bS1+++GE1 GE2 GE3 1 x 3a bx2cS61x 2a b3xcS72 x 3a b1xaS8(a) How to Find GE(b) Grow Sub-graph by GEFig. 4. Finding GE and Grow Sub-graph by GEnumbers beside vertexes in each sub-graph S i are canonical vertex IDs in S i. The numbers besidevertexes in data graph G are just vertex IDs, which are assigned arbitrarily. Notice that, canonicalvertex IDs in a parent may not be preserved in its child. For example, in Fig. 4(b), the canonicalvertex IDs in S 1 are different from that in S 6. Given a size-(n+1) sub-graph S i (child), it can beobtained by growing from different size-n sub-graphs (parents). For example, in Fig. 5(a)a, wecan obtain sub-graph S 6 by growing S 1 or S 5. Obviously, duplicate patterns will decrease theperformance of mining algorithm. Therefore, we propose the following definition about correctgrowth to avoid the possible duplicate generation.Definition 10. Correct Growth. Given a size-(n+1) sub-graph C(child), it can be obtained bygrowing some size-n sub-graphs P i (parent), i=1...m. Among all growthes, the growth from P jto C is called correct growth, where P j is the largest one among all parents according to TotalOrder in Definition 8. Otherwise, the growth is incorrect one.For example, in Fig. 5(a), S 9 has two parents, which are S 1 and S 5. Since S 5 is larger than S 1(see Definition 8), the growth from S 5 to S 6 is a correct growth, and the growth from S 1 to S 6 isan incorrect growth. Obviously, according to Definition 8, we can determine whether the growthfrom P to C is correct or incorrect. We will propose another efficient method to determine thecorrect growth in Section 3.2.Now, we illustrate the mining process with the running example. In the running example,we always maintain β to be the K largest φ(Q, S i) by now (K = 2 in the running example).Initially, we set β = −∞. First, we conduct sub-graph query to obtain projected graph databaseD Q, which is shown in Fig. 3(a). Then, we find all size-1 sub-graphs (having one edge) in theprojected database D Q. In Fig. 3(b), we insert them into heap H in increasing order according toTotal Order in Definition 8. We also record all their occurrences in the graph database. Since S 1is the head of heap H, we compute α = upperbound(φ(Q, S 1)) = 1.63 and φ(Q, S 1) = 0.61.We inset S 1 into result set RS. Now, β = −∞ < α, the algorithm continues.We find all growth elements (GEs for short) around S 1 in D Q, which are shown in Fig. 5(b).Since G 3 is not in the projected database, we do not consider the occurrences in G 3 when wefind GEs. For each GE, we check whether it leads to a correct growth. In Fig. 5(b), only the lastone is a correct growth. We insert the last one (that is S 10) into the max-heap H, as shown in Fig.6. Now, the sub-graph S 2 is the heap head. We re-compute α = upperbound(φ(Q, S 2)) = 1.0,φ(Q, S 2) = 1.0, and insert S 2 into RS. We update β to be Top-2 answer, that is β = 0.61. Sinceβ < α, the algorithm still continues. The above processes are iterated until β ≥ α. At last, wereport top-2 answers in RS.3.2 PG-search AlgorithmAs discussed in Section 3.1, the heap H is ranked according to Total Order in Definition 8 (seeFig. 3(b) and 6). In Definition 8, we need sup(S i). Therefore, we maintain all occurrences of S i
Top-K Correlation Sub-graph Search in Graph Databases 9S11 2xa bOccurrences GEG1 1 2axb1 2a x bCandidateGEPatterns+ axbxc+ x xc a bS6S7axbbxcS1S5Correct GrowthaxbxcaS1xbCorrect GrowthbxaxbG2 G2 G3 1 2a x b1 2xa b1 2xa bx x+ a a bx x+ a b bx x+ b a bS8S9S10G4 S6S11(a)(b)(A)(B)(a) Correct Growth(b) GEs for S 1Fig. 5. Correct Growth and S 1’s Growth ElementsS2 S3 S4 S5 S10x x x x x xa a a c b b b cMax Heap: H b a bSupp(Si) 0.6 0.6 0.2 0.2S2Occurrences G1 G1 G2 G1 G1 G2 G3 G5 G2 G2 G4 G4 0.2G2 G2 Result Set(RS)aaSub-graphs Si Φ( QS , i )xS2xS1ab1.000.61Fig. 6. The Second Step in Mining Processin heap H. According to the occurrence list, it is easy to obtain sup(S i). However, it is expensiveto maintain all occurrences in the whole database D, especially when |D| is large. On the otherhand, we need to find GE around each occurrence in the projected database D Q (see Fig. 5(b)).Therefore, we need to maintain occurrences in the projected database. Furthermore, the projecteddatabase D Q is always smaller than the whole database D. In order to improve the performance,in our PG-search algorithm, we only maintain the occurrence list in the projected database insteadof the whole database.However, Definition 8 needs sup(S i). Therefore, we propose to use indexsup(S i) to defineTotal Order instead of sup(S i), where indexsup(S i) is derived from sub-graph query technique.Recently, a lot of graph indexing structures have been proposed in database literature [13,18] for sub-graph query. Sub-graph query is defined as: given a sub-graph Q, we report all datagraphs containing Q as a sub-graph from the graph database. Because sub-graph isomorphismis a NP-complete problem [4], we always employ a filter-and-verification framework to speedup the search process. In filtering process, we identify the candidate answers by graph indexingstructures. Then, in verification process, we check each candidate by sub-graph isomorphism.Generally speaking, the filtering process is much faster than verification process.Definition 11. Index Support. For a sub-graph S i, index support is size of candidates in filteringprocess of sub-graph S i query, which is denoted as indexsup(S i).Actually, sup(S i) is the size of answers for sub-graph S i query, which is obtained in verificationprocess. According to graph indexing techniques [13, 18], we can identify the candidates withoutfalse negatives. Therefore, Lemma 3 holds.Lemma 3. For any sub-graph S i in the graph database, sup(S i) ≤ indexsup(S i), whereindexsup(S i) is the size of candidates for sub-graph S i query.Furthermore, indexsup(S i) also satisfy Apriori Property. For example, in gIndex [18], if Q 1is a sub-graph of Q 2, the candidates for Q 2 query is a subset of candidates for Q 1. It means thefollowing property holds.
Page 1 and 2: Top-K Correlation Sub-graph Search
Page 7: Top-K Correlation Sub-graph Search
Page 15: Top-K Correlation Sub-graph Search

Top-K Correlation Sub-graph Search in Graph Databases

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?