Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
1 Introduction<br />
for integration flows exist [Sch01]. First, the horizontal integration describes the integration<br />
<strong>of</strong> systems within one level. A typical example is the integration <strong>of</strong> operational systems<br />
by EAI servers (adapter-based integration <strong>of</strong> arbitrary systems and applications) or MOM<br />
systems (efficient message transport via messaging standards), where every update within<br />
an operational system can initiate data synchronization with other operational systems<br />
and hence, data is exchanged by transferring many small messages [HW04]. Second, the<br />
vertical integration describes the integration across the levels <strong>of</strong> the information pyramid.<br />
The most typical example is the integration <strong>of</strong> data from the operational systems into<br />
the dispositive and strategical systems (data warehouses) by ETL tools. In this context,<br />
there is a trend towards operational BI (Business Intelligence) that requires immediate<br />
synchronization between the operational source systems and the data warehouse in order<br />
to achieve high up-to-dateness for analytical query results [DCSW09, O’C08, WK10].<br />
This requirement is typically addressed with a near-real-time approach [DCSW09], where<br />
the frequency <strong>of</strong> periodical delta load is simply increased, or with data-driven ETL flows,<br />
where data changes <strong>of</strong> the operational source systems are directly propagated to the data<br />
warehouse infrastructure as so-called trickle-feeds [DCSW09, SWCD09].<br />
As a result <strong>of</strong> both horizontal and vertical integration, many independent instances <strong>of</strong><br />
integration flows are executed over time. In addition to this high load <strong>of</strong> flow instances,<br />
the performance <strong>of</strong> source systems depends on the execution time <strong>of</strong> synchronous datadriven<br />
integration flows, where the source systems are blocked during execution. For<br />
these reasons, there are high performance demands on integration platforms in terms <strong>of</strong><br />
minimizing execution and latency times. Furthermore, from an availability perspective in<br />
terms <strong>of</strong> the average response times, the performance <strong>of</strong> synchronous integration flows has<br />
also direct monetary influences. For example, Amazon states that just 0.1s increase in<br />
average response times will cost them 1% in sales [Bro09]. Similarly, Google recognized<br />
that just 0.5s increase in latency time caused the traffic to drop by a fifth [Lin06]. In<br />
consequence, optimization approaches are required.<br />
Existing optimization approaches <strong>of</strong> integration flows are mainly rule-based in the sense<br />
that a flow is only optimized once during the initial deployment. Thus, only static rewriting<br />
decisions can be made. Further, the optimization <strong>of</strong> integration flows is a hard problem<br />
with regard to the characteristics <strong>of</strong> imperative flow specifications with interaction-,<br />
control-flow- and data-flow-oriented operators as well as specific transactional properties<br />
such the need for preserving the serial order <strong>of</strong> incoming messages. The advantage is low<br />
optimization overhead because optimization is only executed once. However, rule-based<br />
optimization has two major drawbacks. First, many optimization opportunities cannot<br />
be exploited because rewriting decisions can <strong>of</strong>ten only be made dynamically based on<br />
costs with regard to execution statistics such as operator execution times, selectivities<br />
and cardinalities. Second, it is impossible to adapt to changing workload characteristics,<br />
which commonly vary significantly over time [IHW04, NRB09, DIR07, CC08, LSM + 07,<br />
BMM + 04, MSHR02]. This would require rewriting an integration flow in a cost-based<br />
manner according to the load <strong>of</strong> flow instances and specific execution statistics.<br />
Contributions<br />
In order to address the high performance demands on integration platforms and to overcome<br />
the drawbacks <strong>of</strong> rule-based optimization, we introduce the concept <strong>of</strong> cost-based<br />
optimization <strong>of</strong> integration flows with a primary focus on typically used imperative in-<br />
2