13.07.2015 Views

Top-K Correlation Sub-graph Search in Graph Databases

Top-K Correlation Sub-graph Search in Graph Databases

Top-K Correlation Sub-graph Search in Graph Databases

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> <strong>Search</strong> <strong>in</strong> <strong>Graph</strong> <strong>Databases</strong> 51 x 2a bx0a2 x 1a bx0a0 2a x b1xa2 x 0a bx1a1ax 0bx2a0a x1bx2aaaaa 0a 1 b 2x 0xa 0a 0 b 1 a 2a 0 0 xa 2 b 1 x a 1b 2a 0 a 1 b 2x x0b 0 a 1 a 2 b 0 a 1 a 2b 0 0 x b 0 x 0a 2 a 2a 1 x a 1 xa 0b 1a 2a 0 b 1 a 2x x0axbbxcaab x0x aba 0xx aab xx0 baa 0xx baa x0x aba xx0Canonical Label(a) (b) (c) (d) (e) (f)axbC1xaP1axbxcC2P22.3 Pattern Growth(a) Canonical Label<strong>in</strong>g(b) Parents and ChildrenFig. 2. Canonical Labels and Pattern GrowthDef<strong>in</strong>ition 7. Parent and Child. Given two connected <strong>graph</strong>es P and C, P and C haven and n + 1 edges respectively. If P is a sub-<strong>graph</strong> of C, P is called a parent of C, andC is called a child of P .For a <strong>graph</strong>, it can have more than one child or parent. For example, <strong>in</strong> Fig. 2(b), C 2has two parents that are P 1 and P 2 . P 1 have two children that are C 1 and C 2 .Dur<strong>in</strong>g m<strong>in</strong><strong>in</strong>g process, we always “grow” a sub-<strong>graph</strong> from its parent, which iscalled pattern growth. Dur<strong>in</strong>g pattern growth, we <strong>in</strong>troduce an extra edge <strong>in</strong>to a parentto obta<strong>in</strong> a child. The <strong>in</strong>troduced extra edge is called Growth Element (GE for short).The formal def<strong>in</strong>ition about GE is given <strong>in</strong> Section 3.1.3 PG-search AlgorithmAs mentioned <strong>in</strong> the Section 1, we can transfer a TOP-CG<strong>Search</strong> problem <strong>in</strong>to a thresholdbasedCG<strong>Search</strong>. First, we set threshold θ to be a large value. Accord<strong>in</strong>g to CGS algorithm[8], if we can f<strong>in</strong>d the correct top-K correlation sub-<strong>graph</strong>s, then we report themand term<strong>in</strong>ate the algorithm. Otherwise, we decrease θ and repeat the above process untilwe obta<strong>in</strong> the top-K answers. We refer this method as “Decreas<strong>in</strong>g Threshold-BasedCGS” <strong>in</strong> the rest of this paper. Obviously, when K is large, we have to set θ to be asmall value to f<strong>in</strong>d the correct top-K answers, which is quite <strong>in</strong>efficient s<strong>in</strong>ce we needto perform CGS algorithm [8] several times and there are more candidates needed to be<strong>in</strong>vestigated. In order to avoid this shortcom<strong>in</strong>g, we propose a pattern-growth searchalgorithm (PG-search for short) <strong>in</strong> this section.3.1 <strong>Top</strong>-K <strong>Correlation</strong> <strong>Sub</strong>-<strong>graph</strong> QueryThe search space (the number of candidates) of TOP-CG<strong>Search</strong> problem is large, s<strong>in</strong>ceeach sub-<strong>graph</strong> S i <strong>in</strong> <strong>graph</strong> database is a candidate. Obviously, it is impossible to enumerateall sub-<strong>graph</strong>s <strong>in</strong> the database, s<strong>in</strong>ce it will result <strong>in</strong> a comb<strong>in</strong>ational explosionproblem. Therefore, an effective prun<strong>in</strong>g strategy is needed to reduce the search space.In our algorithm, we conduct effective filter<strong>in</strong>g through the upper bound. Specifically,if the largest upper bound for all un-checked sub-<strong>graph</strong>s cannot get <strong>in</strong> the top-K answers,the algorithm can stop. Generally speak<strong>in</strong>g, our algorithms work as follows:We first generate all size-1 sub-<strong>graph</strong> (the size of a <strong>graph</strong> is def<strong>in</strong>ed <strong>in</strong> terms of numberof edges, i.e. |E|) S i , and <strong>in</strong>sert them <strong>in</strong>to heap H <strong>in</strong> non-<strong>in</strong>creas<strong>in</strong>g order ofupperbound(φ(Q, S i )), which is the upper bound of φ(Q, S i ). The upper bound isa monotone <strong>in</strong>creas<strong>in</strong>g function w.r.t sup(S i ). The head of the heap H is h. We setα = upperbound(φ(Q, h)). We compute φ(Q, h) and <strong>in</strong>sert h <strong>in</strong>to result set RS. β is

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!