Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
intersection steps<br />
20<br />
15<br />
10<br />
5<br />
0<br />
average steps until ∆R = 0<br />
possible steps<br />
necessary steps<br />
0 10 20 30<br />
# trigrams<br />
40 50 60<br />
3.8 The trigram index<br />
Figure 3.9: It takes only a fraction (≈ 1)<br />
of intersection steps to get to the final result R. 1500<br />
3<br />
random function and variable names have been tested.<br />
Instead, DCS can only use some kind of heuristic to decide when decoding can stop because<br />
the amount of false positives will not be reduced significantly by reading more posting lists.<br />
Figure 3.9 confirms the hunch: It is (on average) not necessary to perform all intersections<br />
to get to the final result R, or very close.<br />
A heuristic which yields a low number of false positives but still saves a considerable<br />
number of steps (and thus time) is:<br />
Stop processing if ∆Pi−1 < 10 and i > 0.70×n (70 % of the posting lists have been decoded).<br />
As figure 3.10 shows, the amount of false positives does not exceed one or two files on average,<br />
while the total speed-up for executing the AND query is ≈ 2×.<br />
# of files<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
false positives by skipping decoding<br />
false positives<br />
200 400 600 800 1000 1200<br />
query #<br />
total AND-query time<br />
10 ms<br />
8 ms<br />
6 ms<br />
4 ms<br />
2 ms<br />
0 ms<br />
saved time<br />
saved time<br />
total time<br />
200 400 600 800 1000 1200<br />
query #<br />
Figure 3.10: With the heuristic explained above, the amount of false positives does<br />
not exceed two files on average; the total speed-up is ≈ 2×.<br />
23