Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
3.3 Periodical Re-<strong>Optimization</strong><br />
at T 2 , we enrich the selectivity tuple with the missing information (due to reordering, we<br />
now have independent statistics for both operators). Over time, we use the exponential<br />
moving average (EMA) to adapt those statistics independently <strong>of</strong> the sliding time window.<br />
Finally, we see that the sequence <strong>of</strong> (σ A , σ B ) is the best choice because P (A) < P (B).<br />
We might make wrong decisions at the beginning but the probability estimates converge to<br />
the real statistics over time. Hence, we do not make wrong decisions twice as long as the<br />
probability comparison and the data dependency between both operators do not change.<br />
As long as a selectivity tuple contains a missing value for an operator o 2 , we make the<br />
assumption <strong>of</strong> statistical independence and hence compute P (o 2 ) = P (o 1 ∧ o 2 )/P (o 1 ).<br />
Furthermore, we compute the conditional probabilities for a given operator—based on the<br />
described selectivity tuple—as follows:<br />
P (o 2 |o 1 ) = P (o 1 ∧ o 2 )<br />
P (o 1 )<br />
⎧<br />
P (o 1 ∧ o 2 )<br />
⎪⎨<br />
= P (o 1 ) P (o 2 ) = NULL<br />
P (o 1 |o 2 ) =<br />
P (o 2 |o 1 )<br />
P (o 1 ∧ o 2 )<br />
⎪⎩<br />
otherwise.<br />
P (o 2 )<br />
(3.15)<br />
Thus, we explicitly use the monitored conditional probabilities if available. It is important<br />
to note that we can only maintain the probability <strong>of</strong> the first operator as well as the joint<br />
probability <strong>of</strong> both operators. Hence, the additional problem <strong>of</strong> starvation in reordering<br />
decisions might occur if the real probability <strong>of</strong> the second operator decreases because<br />
we cannot monitor this effect. In order to tackle this problem <strong>of</strong> starvation in certain<br />
rewriting decisions, we use an aging strategy, where the probability <strong>of</strong> the second operator<br />
is slowly decreased over time. This prevents starvation because over time we reorder both<br />
operators due to the decreased probability and can monitor the actual probability <strong>of</strong> this<br />
second operator. Although this can cause suboptimal plans in case <strong>of</strong> no workload changes,<br />
it prevents starvation and converges to the real selectivities and probabilities.<br />
For arbitrary chains <strong>of</strong> operators, the correlation table is used recursively along operators<br />
that are directly connected by data dependencies such that an operator might be included<br />
in multiple entries <strong>of</strong> the correlation table. For such chains, we allow only reordering<br />
directly connected operators. After a reordering, all entries <strong>of</strong> the correlation table that<br />
refer to a removed data dependency are removed as well and new entries for new data<br />
dependencies are created.<br />
In conclusion, the adaptive behavior <strong>of</strong> our condition selectivity approach ensures more<br />
robust and stable optimizer decisions for conditional probabilities and correlated data,<br />
even in the context <strong>of</strong> changing workload characteristics. Note that there are more sophisticated,<br />
heavyweight approaches (e.g., by using a maximum entropy approach [MHK + 07,<br />
MMK + 05] or by using a measure <strong>of</strong> clusteredness [HNM03]) for correlation-aware selectivity<br />
estimation in the context <strong>of</strong> DBMS. We could also maintain adaptable multidimensional<br />
histograms [BCG01, AC99, LWV03, MVW00, KW99]. However, in contrast<br />
to DBMS, where data is static and statistics are required for arbitrary predicates, we optimize<br />
the average plan execution time with known predicates but dynamic data. Hence,<br />
the introduced lightweight correlation table can be used instead <strong>of</strong> heavyweight multidimensional<br />
histograms, where we would need to maintain exact or approximate frequencies<br />
with regard to arbitrary conjunct predicates [Pol05, TDJ10]. However, other<br />
approaches can be integrated as well. Due to the problem <strong>of</strong> missing statistics about<br />
59