25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Abstract<br />

<strong>Integration</strong> flows are increasingly used to specify and execute data-intensive integration<br />

tasks between several heterogeneous systems and applications. There are many different<br />

application areas such as (near) real-time ETL (Extraction Transformation Loading) and<br />

data synchronization between operational systems. For the reasons <strong>of</strong> (1) an increasing<br />

amount <strong>of</strong> data, (2) typically highly distributed IT infrastructures, and (3) high requirements<br />

for data consistency and up-to-dateness, many instances <strong>of</strong> integration flows—with<br />

rather small amounts <strong>of</strong> data per instance—are executed over time by the central integration<br />

platform. Due to this high load as well as blocking synchronous source systems or<br />

client applications, the performance <strong>of</strong> the central integration platform is crucial for an<br />

IT infrastructure. As a result, there is a need for optimizing integration flows. Existing<br />

approaches for the optimization <strong>of</strong> integration flows tackle this problem with rule-based<br />

optimization in the form <strong>of</strong> algebraic simplifications or static rewriting decisions during<br />

deployment. Unfortunately, rule-based optimization exhibits two major drawbacks. First,<br />

we cannot exploit the full optimization potential because the decision on rewriting alternatives<br />

<strong>of</strong>ten depends on dynamically changing costs with regard to execution statistics<br />

such as cardinalities, selectivities and execution times. Second, there is no re-optimization<br />

over time and hence, the adaptation to changing workload characteristics is impossible.<br />

In conclusion, there is a need for adaptive cost-based optimization <strong>of</strong> integration flows.<br />

This problem <strong>of</strong> cost-based optimization <strong>of</strong> integration flows is not as straight-forward as<br />

it may appear at a first glance. The differences to optimization in traditional data management<br />

systems are manifold. First, integration flows are reactive in the sense that they process<br />

remote, partially non-accessible data that is received in the form <strong>of</strong> message streams.<br />

Thus, proactive optimization such as dedicated physical design is impossible. Second,<br />

there is also the problem <strong>of</strong> missing knowledge about data properties <strong>of</strong> external systems<br />

because, in the context <strong>of</strong> loosely coupled applications, statistics are non-accessible or do<br />

not exist at all. Third, in contrast to traditional declarative queries, integration flows are<br />

described as imperative flow specifications including both data-flow-oriented and controlflow-oriented<br />

operators. This requires awareness with regard to semantic correctness when<br />

rewriting such flows. Additionally, further integration-flow-specific transactional properties<br />

such as the serial order <strong>of</strong> messages, the cache coherency problem when interacting<br />

with external systems, and the compensation-based rollback must be taken into account<br />

when optimizing such integration flows. In conclusion, the cost-based optimization <strong>of</strong><br />

integration flows is a hard but highly relevant problem in today’s IT infrastructures.<br />

In this thesis, we introduce the concept <strong>of</strong> cost-based optimization <strong>of</strong> integration flows<br />

that relies on incremental statistics maintenance and inter-instance plan re-optimization.<br />

As a foundation, we propose the concept <strong>of</strong> periodical re-optimization and present how<br />

to integrate such a cost-based optimizer into the system architecture <strong>of</strong> an integration<br />

platform. This includes integration-flow-specific (1) prerequisites such as the dependency<br />

analysis and a cost model for interaction-, control-flow- and data-flow-oriented operators<br />

as well as (2) specific statistic maintenance strategies, optimization algorithms and<br />

optimization techniques. While this architecture was inspired by cost-based optimizers<br />

iii

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!