Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.3 Periodical Re-<strong>Optimization</strong><br />
The MA causes the slowest adaptation because <strong>of</strong> the simple average, where all items<br />
are equally weighted, while WMA (linear weights) and EMA (exponential weights) support a<br />
faster adaptation due to the highest influence <strong>of</strong> the latest items. However, LR achieves the<br />
fastest adaptation because the estimate is extrapolated from the last items. Unfortunately,<br />
LR tends to over- and underestimate on abrupt changes. Further, full re-computation with<br />
all methods can be realized with linear complexity <strong>of</strong> O(n). A detailed explanation <strong>of</strong><br />
the different adaptation parameters with regard to monitored execution statistics is given<br />
within the experimental evaluation. We also experimented with Polynomial Regression<br />
(PR) with up to a degree <strong>of</strong> four. Due to high over- and underestimation on abrupt changes,<br />
we do not use this aggregation method. Apart from these aggregation methods, one can<br />
use forecast models types [DB07, BJR94] in order to detect reoccurring patterns and thus,<br />
increase the accuracy <strong>of</strong> estimation. However, doing so is a trade-<strong>of</strong>f between estimation<br />
overhead and benefit achieved by more accurate estimation. Experiments have shown that<br />
the simple exponential moving average (EMA) robustly achieves the highest accuracy with<br />
low estimation overhead such that we use this method as our default workload aggregation<br />
strategy. There, we do not use any automatic parameter estimation. However, the automatic<br />
evaluation and adaptation <strong>of</strong> this parameter could be seamlessly integrated into the<br />
periodical re-optimization framework.<br />
Recall the parameters optimization interval ∆t and sliding time window size ∆w. If<br />
∆t ≥ ∆w, statistic tuples are only used at one specific optimization timestamp. In this<br />
case, we simply compute those statistics from scratch using the workload aggregation methods.<br />
However, if ∆t < ∆w, this is an inefficient approach because we aggregate portions<br />
<strong>of</strong> statistic tuples multiple times. In such a case, incremental maintenance <strong>of</strong> workload<br />
statistics—in the sense <strong>of</strong> updating the aggregate with new statistics—is required. In general,<br />
incremental statistics maintenance is possible for all <strong>of</strong> these aggregation methods.<br />
However, note that MA and LR require negative (implicitly removed tuples based on time)<br />
and positive maintenance (new tuples) according to the sliding time window size ∆w (and<br />
thus, atomic statistics must be stored), while EMA and WMA does not require negative maintenance<br />
due to the increasing weights, where the influence <strong>of</strong> older tuples can be neglected.<br />
In conclusion, we use the EMA as default workload aggregation method.<br />
Finally, the workload aggregation can be optimized. Similar to the basic counting<br />
algorithm [DGIM02], approximate incremental statistics maintenance over the sliding<br />
window—with summarizing data structures—is possible. Further, according to workload<br />
shift detection approaches [HR07, HR08], it might be possible to minimize the optimization<br />
costs by deferring the cost re-estimation until predicted workload shifts (anticipatory<br />
re-optimization).<br />
3.3.4 Handling Correlation and Conditional Probabilities<br />
In the context <strong>of</strong> missing knowledge about data properties <strong>of</strong> external systems, the main<br />
problem <strong>of</strong> cost estimation is correlation and conditional probabilities (more precisely,<br />
relative frequencies). Assume two successive Selection operators σ A and σ B . In fact, we<br />
are only able to monitor P (A) and P (B|A). A naïve approach would be to assume that<br />
all predicates and the resulting selectivities and cardinalities are independent. However,<br />
this can lead to wrong estimates [BMM + 04] that would result in non-optimal plans or<br />
changing a plan back and forth.<br />
Example 3.7 (Monitored Selectivities). Assume two successive Selection operators σ A<br />
57