25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />

and σ B and the following gathered statistics:<br />

σ A : |ds in | = 100, |ds out | = 40 σ B : |ds in | = 40, |ds out | = 8<br />

⇒P (A) = 0.4<br />

P (B) = P (B|A) = 0.2<br />

P (A ∧ B) = 0.08.<br />

//monitored<br />

//computed assuming independence<br />

//monitored<br />

Assuming statistical independence, we set P (B) = P (B|A) and based on the comparison<br />

<strong>of</strong> P (A) > P (B), we reorder the sequence (σ A , σ B ) to (σ B , σ A ). Using the new plan, we<br />

gather the following statistics:<br />

σ B : |ds in | = 100, |ds out | = 68 σ A : |ds in | = 68, |ds out | = 8<br />

⇒P (B) = 0.68<br />

P (A) = P (A|B) ≈ 0.12<br />

P (A ∧ B) = 0.08.<br />

//monitored<br />

//computed assuming independence<br />

//monitored<br />

Clearly, P (A) and P (B) are strongly correlated (P (¬A∧¬B) = 0). Due to the simplifying<br />

assumption <strong>of</strong> independence, we would assume that P (B) > P (A). Hence, we would<br />

reorder the operators back to the initial plan. As a result, even in the presence <strong>of</strong> constant<br />

statistics, we would reorder the plan back and forth and thus, produce inefficient plans.<br />

Our approach <strong>of</strong> conditional selectivities explicitly takes those conditional probabilities<br />

into account in order to overcome that problem when reordering selective dataflow-oriented<br />

operators (e.g., Selection, Projection with duplicate elimination, Join,<br />

Groupby, Setoperation) or the paths <strong>of</strong> a Switch operator. Essentially, we maintain<br />

selectivity statistics over multiple versions <strong>of</strong> a plan, independently <strong>of</strong> the sliding time<br />

window statistics. Therefore, for each pair <strong>of</strong> data-flow-oriented operators with direct<br />

data dependency within the current plan, we maintain a row <strong>of</strong> selectivities:<br />

(o 1 , o 2 , P (o 1 ), P (o 2 ), P (o 1 ∧ o 2 )), (3.14)<br />

where both operators o 1 and o 2 are identified, and we store the selectivity as well as the<br />

conjunct selectivity <strong>of</strong> both operators. The approach works similar for binary operators<br />

and reordered Switch paths. Due to the binary comparison approach <strong>of</strong> only two operators<br />

at-a-time, the overhead is fairly low. In the worst case, there are m 2 selectivity tuples,<br />

where m denotes the number <strong>of</strong> operators. We revisit the example in order to explain that<br />

concept in detail.<br />

Example 3.8 (Monitored Conditional Selectivities). Assume the same setting as in Example<br />

3.7. When reordering σ A and σ B at timestamp T 1 , we create a new statistic tuple<br />

as shown in Table 3.5.<br />

Table 3.5: Conditional Selectivity Table<br />

o 1 o 2 P (o 1 ) P (o 2 ) P (o 1 ∧ o 2 )<br />

T 1 σ A σ B 0.4 0.08<br />

T 2 σ A σ B 0.4 0.68 0.08<br />

We only include probabilities that are known not to be conditional (the first operator and<br />

the combination <strong>of</strong> both operators). If we evaluate the temporal order <strong>of</strong> those operators<br />

58

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!