25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />

Problem 3.2 (Changing Workload Characteristics). In the context <strong>of</strong> integration flows<br />

that integrate distributed systems and applications, the workload characteristics—in the<br />

form <strong>of</strong> statistics such as selectivities, cardinalities, and execution times—can change significantly<br />

over time [IHW04, NRB09, DIR07, CC08, LSM + 07, BMM + 04, MSHR02]. This<br />

can be caused by unpredictable workloads <strong>of</strong> the external systems (e.g., number <strong>of</strong> update<br />

transactions, amount <strong>of</strong> data to be integrated, waiting times for external systems) or temporal<br />

variations <strong>of</strong> infrastructural properties (e.g., network traffic or bandwidth). These<br />

influences lead to changing workload characteristics <strong>of</strong> the integration platform in the sense<br />

<strong>of</strong> different numbers <strong>of</strong> plan instances, different cardinalities and selectivities as well as<br />

different execution times when accessing external systems.<br />

As a result <strong>of</strong> Problem 3.2, rule-based optimized plans, where no execution statistics<br />

are used for optimization, fall short and may perform inefficiently over time. For the<br />

same reason, also manually optimized plans, where a fixed (hand-crafted) plan is specified<br />

by an administrator, cannot be applied [Win03]. <strong>Cost</strong>-based optimization, using data<br />

properties and execution statistics, has the potential to overcome this deficit because<br />

optimal execution plans are generated with regard to the current statistics. By keeping<br />

these statistics up-to-date, adaptation to changing workload characteristics is possible.<br />

However, for integration flows, there is the additional problem <strong>of</strong> missing statistics:<br />

Problem 3.3 (Missing Knowledge about Statistics <strong>of</strong> External Systems). One <strong>of</strong> the<br />

main problems <strong>of</strong> integration flow optimization is the lack <strong>of</strong> knowledge about data characteristics<br />

(e.g., cardinalities, selectivities, ordering) and execution statistics (execution<br />

times, bandwidth) <strong>of</strong> the different data sources [IHW04]. Due to the integration <strong>of</strong> autonomous<br />

(loosely-coupled) source systems, the integration platform usually has no access<br />

to statistical information <strong>of</strong> the external systems—if they exist at all.<br />

However, execution statistics are required for cost-based optimization. Due to the missing<br />

statistics, the central integration platform has to incrementally maintain the workload<br />

characteristics and execution statistics by itself. The combination <strong>of</strong> Problem 3.2 and<br />

Problem 3.3 leads to the need for a cost-based optimization approach that incrementally<br />

maintains execution statistics and re-optimizes given plans if necessary. Unfortunately, existing<br />

cost-based approaches [SMWM06, SVS05] follow an optimize-always model, where<br />

optimization is synchronously trigged for each plan instance before it is executed. This<br />

optimize-always model falls short in the presence <strong>of</strong> many short-running plan instances,<br />

where the optimization time might be even higher than the execution time <strong>of</strong> an instance.<br />

In addition, these existing approaches do not consider an overall cost-based optimization<br />

framework but only investigate selected cost-based optimization techniques in isolation.<br />

In consequence, we introduce the general concept <strong>of</strong> cost-based optimization <strong>of</strong> imperative<br />

integration flows. As a starting point, we follow the optimization objective <strong>of</strong><br />

minimizing the average execution time <strong>of</strong> a plan, which implicitly increases the message<br />

throughput as well. The core concept is (1) to incrementally monitor workload characteristics<br />

and execution statistics, and (2) to periodically re-optimize given plans using a set <strong>of</strong><br />

cost-based optimization techniques. As a result, this approach enables cost-based rewriting<br />

decisions and it achieves a suitable adaptation to changing workload characteristics,<br />

while it requires less optimization overhead than the optimize-always model.<br />

The cost-based optimization <strong>of</strong> integration flows differs from cost-based optimization for<br />

(1) programming languages or (2) data management systems for several reasons. First,<br />

although optimizers <strong>of</strong> programming language compilers optimize imperative programs,<br />

34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!