url - Universität zu Lübeck
url - Universität zu Lübeck
url - Universität zu Lübeck
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
5.4. QUERY PROCESSING 93<br />
particular there are only 41 different values for the year element of the books.<br />
The selectivity of 1.0 for the /dblp/article/year path expression indicates that all<br />
values are the same. The selectivity for the other elements that have values in<br />
this DBLP fragment can be seen in the figure.<br />
The values r and sl are used when estimating the number of elements that correspond<br />
to a path expression to be evaluated. This number is the result of a<br />
multiplication of all r values on the path to the selected element. If the path<br />
expression contains a key comparison the value for its selectivity is multiplied<br />
additionally.<br />
Example 14 The path expression /dblp/article/author will lead to 1 · 535 · 1.69 = 904<br />
selected elements.<br />
The same path expressions with a value comparison (/dblp/article/author[. = ′ x ′ ])<br />
leads to 904·0.621 = 561 selected elements. This number is relatively high because<br />
the 535 articles in the selected DBLP fragment are written by only 342 different authors.<br />
The similar path expression /dblp/article/title[. = ′ x ′ ] leads to only 1 · 535 · 1 ·<br />
0.002 = 1.07 hits. Therefore, querying an article by its title is more than 300<br />
times faster than querying it by the authors.<br />
If the query contains a wildcard (*) or the descendant axis (//) multiple extents<br />
must be regarded. The values for r can be summarized in order to get the final<br />
result. With a key in the path expression the r values of the different affected<br />
extents have to be weighted by the selectivity sl before summarizing them.<br />
Example 15 The path expression /dblp/ ∗ /<strong>url</strong> affects three <strong>url</strong> extents of books,<br />
articles and inproceedings. We calculate their numbers independently and summarize<br />
them afterwards. Therefore, the result of this path expression has the<br />
expected cardinality 1 · 540 · 0.18 + 1 · 535 · 0.04 + 1 · 26565 · 1.0 = 26684.<br />
The statistic DataGuide is a relatively simple but in most cases sufficient and<br />
efficient approach to estimate the cardinality of selected elements of a path expression.<br />
The particular value can be used in cost models for indexes and conventional<br />
XPath evaluation.<br />
<br />
The approach assumes statistical independence between elements in the XML<br />
data. If we have elements that are statistically dependent, for instance they are<br />
mutually exclusive the statistic DataGuide will lead to reduced precision: For<br />
instance, an element X has two children a and b. Half of the X elements have<br />
exactly one a child and the other half has exactly one b child. Therefore, no X element<br />
has both an a and b child. The statistic DataGuide would assign r = 0.5 for<br />
the a and b extent. The path expression //X[a and b] would lead to an estimated<br />
cardinality of |X| · 0.5 · 0.5 = |x|<br />
4<br />
indicating that a quarter of the X elements have<br />
both an a and b value.