Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />
and σ B and the following gathered statistics:<br />
σ A : |ds in | = 100, |ds out | = 40 σ B : |ds in | = 40, |ds out | = 8<br />
⇒P (A) = 0.4<br />
P (B) = P (B|A) = 0.2<br />
P (A ∧ B) = 0.08.<br />
//monitored<br />
//computed assuming independence<br />
//monitored<br />
Assuming statistical independence, we set P (B) = P (B|A) and based on the comparison<br />
<strong>of</strong> P (A) > P (B), we reorder the sequence (σ A , σ B ) to (σ B , σ A ). Using the new plan, we<br />
gather the following statistics:<br />
σ B : |ds in | = 100, |ds out | = 68 σ A : |ds in | = 68, |ds out | = 8<br />
⇒P (B) = 0.68<br />
P (A) = P (A|B) ≈ 0.12<br />
P (A ∧ B) = 0.08.<br />
//monitored<br />
//computed assuming independence<br />
//monitored<br />
Clearly, P (A) and P (B) are strongly correlated (P (¬A∧¬B) = 0). Due to the simplifying<br />
assumption <strong>of</strong> independence, we would assume that P (B) > P (A). Hence, we would<br />
reorder the operators back to the initial plan. As a result, even in the presence <strong>of</strong> constant<br />
statistics, we would reorder the plan back and forth and thus, produce inefficient plans.<br />
Our approach <strong>of</strong> conditional selectivities explicitly takes those conditional probabilities<br />
into account in order to overcome that problem when reordering selective dataflow-oriented<br />
operators (e.g., Selection, Projection with duplicate elimination, Join,<br />
Groupby, Setoperation) or the paths <strong>of</strong> a Switch operator. Essentially, we maintain<br />
selectivity statistics over multiple versions <strong>of</strong> a plan, independently <strong>of</strong> the sliding time<br />
window statistics. Therefore, for each pair <strong>of</strong> data-flow-oriented operators with direct<br />
data dependency within the current plan, we maintain a row <strong>of</strong> selectivities:<br />
(o 1 , o 2 , P (o 1 ), P (o 2 ), P (o 1 ∧ o 2 )), (3.14)<br />
where both operators o 1 and o 2 are identified, and we store the selectivity as well as the<br />
conjunct selectivity <strong>of</strong> both operators. The approach works similar for binary operators<br />
and reordered Switch paths. Due to the binary comparison approach <strong>of</strong> only two operators<br />
at-a-time, the overhead is fairly low. In the worst case, there are m 2 selectivity tuples,<br />
where m denotes the number <strong>of</strong> operators. We revisit the example in order to explain that<br />
concept in detail.<br />
Example 3.8 (Monitored Conditional Selectivities). Assume the same setting as in Example<br />
3.7. When reordering σ A and σ B at timestamp T 1 , we create a new statistic tuple<br />
as shown in Table 3.5.<br />
Table 3.5: Conditional Selectivity Table<br />
o 1 o 2 P (o 1 ) P (o 2 ) P (o 1 ∧ o 2 )<br />
T 1 σ A σ B 0.4 0.08<br />
T 2 σ A σ B 0.4 0.68 0.08<br />
We only include probabilities that are known not to be conditional (the first operator and<br />
the combination <strong>of</strong> both operators). If we evaluate the temporal order <strong>of</strong> those operators<br />
58