Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />
Problem 3.2 (Changing Workload Characteristics). In the context <strong>of</strong> integration flows<br />
that integrate distributed systems and applications, the workload characteristics—in the<br />
form <strong>of</strong> statistics such as selectivities, cardinalities, and execution times—can change significantly<br />
over time [IHW04, NRB09, DIR07, CC08, LSM + 07, BMM + 04, MSHR02]. This<br />
can be caused by unpredictable workloads <strong>of</strong> the external systems (e.g., number <strong>of</strong> update<br />
transactions, amount <strong>of</strong> data to be integrated, waiting times for external systems) or temporal<br />
variations <strong>of</strong> infrastructural properties (e.g., network traffic or bandwidth). These<br />
influences lead to changing workload characteristics <strong>of</strong> the integration platform in the sense<br />
<strong>of</strong> different numbers <strong>of</strong> plan instances, different cardinalities and selectivities as well as<br />
different execution times when accessing external systems.<br />
As a result <strong>of</strong> Problem 3.2, rule-based optimized plans, where no execution statistics<br />
are used for optimization, fall short and may perform inefficiently over time. For the<br />
same reason, also manually optimized plans, where a fixed (hand-crafted) plan is specified<br />
by an administrator, cannot be applied [Win03]. <strong>Cost</strong>-based optimization, using data<br />
properties and execution statistics, has the potential to overcome this deficit because<br />
optimal execution plans are generated with regard to the current statistics. By keeping<br />
these statistics up-to-date, adaptation to changing workload characteristics is possible.<br />
However, for integration flows, there is the additional problem <strong>of</strong> missing statistics:<br />
Problem 3.3 (Missing Knowledge about Statistics <strong>of</strong> External Systems). One <strong>of</strong> the<br />
main problems <strong>of</strong> integration flow optimization is the lack <strong>of</strong> knowledge about data characteristics<br />
(e.g., cardinalities, selectivities, ordering) and execution statistics (execution<br />
times, bandwidth) <strong>of</strong> the different data sources [IHW04]. Due to the integration <strong>of</strong> autonomous<br />
(loosely-coupled) source systems, the integration platform usually has no access<br />
to statistical information <strong>of</strong> the external systems—if they exist at all.<br />
However, execution statistics are required for cost-based optimization. Due to the missing<br />
statistics, the central integration platform has to incrementally maintain the workload<br />
characteristics and execution statistics by itself. The combination <strong>of</strong> Problem 3.2 and<br />
Problem 3.3 leads to the need for a cost-based optimization approach that incrementally<br />
maintains execution statistics and re-optimizes given plans if necessary. Unfortunately, existing<br />
cost-based approaches [SMWM06, SVS05] follow an optimize-always model, where<br />
optimization is synchronously trigged for each plan instance before it is executed. This<br />
optimize-always model falls short in the presence <strong>of</strong> many short-running plan instances,<br />
where the optimization time might be even higher than the execution time <strong>of</strong> an instance.<br />
In addition, these existing approaches do not consider an overall cost-based optimization<br />
framework but only investigate selected cost-based optimization techniques in isolation.<br />
In consequence, we introduce the general concept <strong>of</strong> cost-based optimization <strong>of</strong> imperative<br />
integration flows. As a starting point, we follow the optimization objective <strong>of</strong><br />
minimizing the average execution time <strong>of</strong> a plan, which implicitly increases the message<br />
throughput as well. The core concept is (1) to incrementally monitor workload characteristics<br />
and execution statistics, and (2) to periodically re-optimize given plans using a set <strong>of</strong><br />
cost-based optimization techniques. As a result, this approach enables cost-based rewriting<br />
decisions and it achieves a suitable adaptation to changing workload characteristics,<br />
while it requires less optimization overhead than the optimize-always model.<br />
The cost-based optimization <strong>of</strong> integration flows differs from cost-based optimization for<br />
(1) programming languages or (2) data management systems for several reasons. First,<br />
although optimizers <strong>of</strong> programming language compilers optimize imperative programs,<br />
34