25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.3 Periodical Re-<strong>Optimization</strong><br />

The MA causes the slowest adaptation because <strong>of</strong> the simple average, where all items<br />

are equally weighted, while WMA (linear weights) and EMA (exponential weights) support a<br />

faster adaptation due to the highest influence <strong>of</strong> the latest items. However, LR achieves the<br />

fastest adaptation because the estimate is extrapolated from the last items. Unfortunately,<br />

LR tends to over- and underestimate on abrupt changes. Further, full re-computation with<br />

all methods can be realized with linear complexity <strong>of</strong> O(n). A detailed explanation <strong>of</strong><br />

the different adaptation parameters with regard to monitored execution statistics is given<br />

within the experimental evaluation. We also experimented with Polynomial Regression<br />

(PR) with up to a degree <strong>of</strong> four. Due to high over- and underestimation on abrupt changes,<br />

we do not use this aggregation method. Apart from these aggregation methods, one can<br />

use forecast models types [DB07, BJR94] in order to detect reoccurring patterns and thus,<br />

increase the accuracy <strong>of</strong> estimation. However, doing so is a trade-<strong>of</strong>f between estimation<br />

overhead and benefit achieved by more accurate estimation. Experiments have shown that<br />

the simple exponential moving average (EMA) robustly achieves the highest accuracy with<br />

low estimation overhead such that we use this method as our default workload aggregation<br />

strategy. There, we do not use any automatic parameter estimation. However, the automatic<br />

evaluation and adaptation <strong>of</strong> this parameter could be seamlessly integrated into the<br />

periodical re-optimization framework.<br />

Recall the parameters optimization interval ∆t and sliding time window size ∆w. If<br />

∆t ≥ ∆w, statistic tuples are only used at one specific optimization timestamp. In this<br />

case, we simply compute those statistics from scratch using the workload aggregation methods.<br />

However, if ∆t < ∆w, this is an inefficient approach because we aggregate portions<br />

<strong>of</strong> statistic tuples multiple times. In such a case, incremental maintenance <strong>of</strong> workload<br />

statistics—in the sense <strong>of</strong> updating the aggregate with new statistics—is required. In general,<br />

incremental statistics maintenance is possible for all <strong>of</strong> these aggregation methods.<br />

However, note that MA and LR require negative (implicitly removed tuples based on time)<br />

and positive maintenance (new tuples) according to the sliding time window size ∆w (and<br />

thus, atomic statistics must be stored), while EMA and WMA does not require negative maintenance<br />

due to the increasing weights, where the influence <strong>of</strong> older tuples can be neglected.<br />

In conclusion, we use the EMA as default workload aggregation method.<br />

Finally, the workload aggregation can be optimized. Similar to the basic counting<br />

algorithm [DGIM02], approximate incremental statistics maintenance over the sliding<br />

window—with summarizing data structures—is possible. Further, according to workload<br />

shift detection approaches [HR07, HR08], it might be possible to minimize the optimization<br />

costs by deferring the cost re-estimation until predicted workload shifts (anticipatory<br />

re-optimization).<br />

3.3.4 Handling Correlation and Conditional Probabilities<br />

In the context <strong>of</strong> missing knowledge about data properties <strong>of</strong> external systems, the main<br />

problem <strong>of</strong> cost estimation is correlation and conditional probabilities (more precisely,<br />

relative frequencies). Assume two successive Selection operators σ A and σ B . In fact, we<br />

are only able to monitor P (A) and P (B|A). A naïve approach would be to assume that<br />

all predicates and the resulting selectivities and cardinalities are independent. However,<br />

this can lead to wrong estimates [BMM + 04] that would result in non-optimal plans or<br />

changing a plan back and forth.<br />

Example 3.7 (Monitored Selectivities). Assume two successive Selection operators σ A<br />

57

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!