Top-K Correlation Sub-graph Search in Graph Databases
Top-K Correlation Sub-graph Search in Graph Databases
Top-K Correlation Sub-graph Search in Graph Databases
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 51 x 2a bx0a2 x 1a bx0a0 2a x b1xa2 x 0a bx1a1ax 0bx2a0a x1bx2aaaaa 0a 1 b 2x 0xa 0a 0 b 1 a 2a 0 0 xa 2 b 1 x a 1b 2a 0 a 1 b 2x x0b 0 a 1 a 2 b 0 a 1 a 2b 0 0 x b 0 x 0a 2 a 2a 1 x a 1 xa 0b 1a 2a 0 b 1 a 2x x0axbbxcaab x0x aba 0xx aab xx0 baa 0xx baa x0x aba xx0Canonical Label(a) (b) (c) (d) (e) (f)axbC1xaP1axbxcC2P22.3 Pattern Growth(a) Canonical Label<strong>in</strong>g(b) Parents and ChildrenFig. 2. Canonical Labels and Pattern GrowthDef<strong>in</strong>ition 7. Parent and Child. Given two connected <strong>graph</strong>es P and C, P and C haven and n + 1 edges respectively. If P is a sub-<strong>graph</strong> of C, P is called a parent of C, andC is called a child of P .For a <strong>graph</strong>, it can have more than one child or parent. For example, <strong>in</strong> Fig. 2(b), C 2has two parents that are P 1 and P 2 . P 1 have two children that are C 1 and C 2 .Dur<strong>in</strong>g m<strong>in</strong><strong>in</strong>g process, we always “grow” a sub-<strong>graph</strong> from its parent, which iscalled pattern growth. Dur<strong>in</strong>g pattern growth, we <strong>in</strong>troduce an extra edge <strong>in</strong>to a parentto obta<strong>in</strong> a child. The <strong>in</strong>troduced extra edge is called Growth Element (GE for short).The formal def<strong>in</strong>ition about GE is given <strong>in</strong> Section 3.1.3 PG-search AlgorithmAs mentioned <strong>in</strong> the Section 1, we can transfer a TOP-CG<strong>Search</strong> problem <strong>in</strong>to a thresholdbasedCG<strong>Search</strong>. First, we set threshold θ to be a large value. Accord<strong>in</strong>g to CGS algorithm[8], if we can f<strong>in</strong>d the correct top-K correlation sub-<strong>graph</strong>s, then we report themand term<strong>in</strong>ate the algorithm. Otherwise, we decrease θ and repeat the above process untilwe obta<strong>in</strong> the top-K answers. We refer this method as “Decreas<strong>in</strong>g Threshold-BasedCGS” <strong>in</strong> the rest of this paper. Obviously, when K is large, we have to set θ to be asmall value to f<strong>in</strong>d the correct top-K answers, which is quite <strong>in</strong>efficient s<strong>in</strong>ce we needto perform CGS algorithm [8] several times and there are more candidates needed to be<strong>in</strong>vestigated. In order to avoid this shortcom<strong>in</strong>g, we propose a pattern-growth searchalgorithm (PG-search for short) <strong>in</strong> this section.3.1 <strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> QueryThe search space (the number of candidates) of TOP-CG<strong>Search</strong> problem is large, s<strong>in</strong>ceeach sub-<strong>graph</strong> S i <strong>in</strong> <strong>graph</strong> database is a candidate. Obviously, it is impossible to enumerateall sub-<strong>graph</strong>s <strong>in</strong> the database, s<strong>in</strong>ce it will result <strong>in</strong> a comb<strong>in</strong>ational explosionproblem. Therefore, an effective prun<strong>in</strong>g strategy is needed to reduce the search space.In our algorithm, we conduct effective filter<strong>in</strong>g through the upper bound. Specifically,if the largest upper bound for all un-checked sub-<strong>graph</strong>s cannot get <strong>in</strong> the top-K answers,the algorithm can stop. Generally speak<strong>in</strong>g, our algorithms work as follows:We first generate all size-1 sub-<strong>graph</strong> (the size of a <strong>graph</strong> is def<strong>in</strong>ed <strong>in</strong> terms of numberof edges, i.e. |E|) S i , and <strong>in</strong>sert them <strong>in</strong>to heap H <strong>in</strong> non-<strong>in</strong>creas<strong>in</strong>g order ofupperbound(φ(Q, S i )), which is the upper bound of φ(Q, S i ). The upper bound isa monotone <strong>in</strong>creas<strong>in</strong>g function w.r.t sup(S i ). The head of the heap H is h. We setα = upperbound(φ(Q, h)). We compute φ(Q, h) and <strong>in</strong>sert h <strong>in</strong>to result set RS. β is