30.08.2014 Views

url - Universität zu Lübeck

url - Universität zu Lübeck

url - Universität zu Lübeck

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.4. QUERY PROCESSING 93<br />

particular there are only 41 different values for the year element of the books.<br />

The selectivity of 1.0 for the /dblp/article/year path expression indicates that all<br />

values are the same. The selectivity for the other elements that have values in<br />

this DBLP fragment can be seen in the figure.<br />

The values r and sl are used when estimating the number of elements that correspond<br />

to a path expression to be evaluated. This number is the result of a<br />

multiplication of all r values on the path to the selected element. If the path<br />

expression contains a key comparison the value for its selectivity is multiplied<br />

additionally.<br />

Example 14 The path expression /dblp/article/author will lead to 1 · 535 · 1.69 = 904<br />

selected elements.<br />

The same path expressions with a value comparison (/dblp/article/author[. = ′ x ′ ])<br />

leads to 904·0.621 = 561 selected elements. This number is relatively high because<br />

the 535 articles in the selected DBLP fragment are written by only 342 different authors.<br />

The similar path expression /dblp/article/title[. = ′ x ′ ] leads to only 1 · 535 · 1 ·<br />

0.002 = 1.07 hits. Therefore, querying an article by its title is more than 300<br />

times faster than querying it by the authors.<br />

If the query contains a wildcard (*) or the descendant axis (//) multiple extents<br />

must be regarded. The values for r can be summarized in order to get the final<br />

result. With a key in the path expression the r values of the different affected<br />

extents have to be weighted by the selectivity sl before summarizing them.<br />

Example 15 The path expression /dblp/ ∗ /<strong>url</strong> affects three <strong>url</strong> extents of books,<br />

articles and inproceedings. We calculate their numbers independently and summarize<br />

them afterwards. Therefore, the result of this path expression has the<br />

expected cardinality 1 · 540 · 0.18 + 1 · 535 · 0.04 + 1 · 26565 · 1.0 = 26684.<br />

The statistic DataGuide is a relatively simple but in most cases sufficient and<br />

efficient approach to estimate the cardinality of selected elements of a path expression.<br />

The particular value can be used in cost models for indexes and conventional<br />

XPath evaluation.<br />

<br />

The approach assumes statistical independence between elements in the XML<br />

data. If we have elements that are statistically dependent, for instance they are<br />

mutually exclusive the statistic DataGuide will lead to reduced precision: For<br />

instance, an element X has two children a and b. Half of the X elements have<br />

exactly one a child and the other half has exactly one b child. Therefore, no X element<br />

has both an a and b child. The statistic DataGuide would assign r = 0.5 for<br />

the a and b extent. The path expression //X[a and b] would lead to an estimated<br />

cardinality of |X| · 0.5 · 0.5 = |x|<br />

4<br />

indicating that a quarter of the X elements have<br />

both an a and b value.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!