25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1 Introduction<br />

for integration flows exist [Sch01]. First, the horizontal integration describes the integration<br />

<strong>of</strong> systems within one level. A typical example is the integration <strong>of</strong> operational systems<br />

by EAI servers (adapter-based integration <strong>of</strong> arbitrary systems and applications) or MOM<br />

systems (efficient message transport via messaging standards), where every update within<br />

an operational system can initiate data synchronization with other operational systems<br />

and hence, data is exchanged by transferring many small messages [HW04]. Second, the<br />

vertical integration describes the integration across the levels <strong>of</strong> the information pyramid.<br />

The most typical example is the integration <strong>of</strong> data from the operational systems into<br />

the dispositive and strategical systems (data warehouses) by ETL tools. In this context,<br />

there is a trend towards operational BI (Business Intelligence) that requires immediate<br />

synchronization between the operational source systems and the data warehouse in order<br />

to achieve high up-to-dateness for analytical query results [DCSW09, O’C08, WK10].<br />

This requirement is typically addressed with a near-real-time approach [DCSW09], where<br />

the frequency <strong>of</strong> periodical delta load is simply increased, or with data-driven ETL flows,<br />

where data changes <strong>of</strong> the operational source systems are directly propagated to the data<br />

warehouse infrastructure as so-called trickle-feeds [DCSW09, SWCD09].<br />

As a result <strong>of</strong> both horizontal and vertical integration, many independent instances <strong>of</strong><br />

integration flows are executed over time. In addition to this high load <strong>of</strong> flow instances,<br />

the performance <strong>of</strong> source systems depends on the execution time <strong>of</strong> synchronous datadriven<br />

integration flows, where the source systems are blocked during execution. For<br />

these reasons, there are high performance demands on integration platforms in terms <strong>of</strong><br />

minimizing execution and latency times. Furthermore, from an availability perspective in<br />

terms <strong>of</strong> the average response times, the performance <strong>of</strong> synchronous integration flows has<br />

also direct monetary influences. For example, Amazon states that just 0.1s increase in<br />

average response times will cost them 1% in sales [Bro09]. Similarly, Google recognized<br />

that just 0.5s increase in latency time caused the traffic to drop by a fifth [Lin06]. In<br />

consequence, optimization approaches are required.<br />

Existing optimization approaches <strong>of</strong> integration flows are mainly rule-based in the sense<br />

that a flow is only optimized once during the initial deployment. Thus, only static rewriting<br />

decisions can be made. Further, the optimization <strong>of</strong> integration flows is a hard problem<br />

with regard to the characteristics <strong>of</strong> imperative flow specifications with interaction-,<br />

control-flow- and data-flow-oriented operators as well as specific transactional properties<br />

such the need for preserving the serial order <strong>of</strong> incoming messages. The advantage is low<br />

optimization overhead because optimization is only executed once. However, rule-based<br />

optimization has two major drawbacks. First, many optimization opportunities cannot<br />

be exploited because rewriting decisions can <strong>of</strong>ten only be made dynamically based on<br />

costs with regard to execution statistics such as operator execution times, selectivities<br />

and cardinalities. Second, it is impossible to adapt to changing workload characteristics,<br />

which commonly vary significantly over time [IHW04, NRB09, DIR07, CC08, LSM + 07,<br />

BMM + 04, MSHR02]. This would require rewriting an integration flow in a cost-based<br />

manner according to the load <strong>of</strong> flow instances and specific execution statistics.<br />

Contributions<br />

In order to address the high performance demands on integration platforms and to overcome<br />

the drawbacks <strong>of</strong> rule-based optimization, we introduce the concept <strong>of</strong> cost-based<br />

optimization <strong>of</strong> integration flows with a primary focus on typically used imperative in-<br />

2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!