25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5 Multi-Flow <strong>Optimization</strong><br />

(2) plan partitioning in the sense <strong>of</strong> compiling different plans for different partitioning attribute<br />

values 17 , and (3) MFO for multiple plans by an extended waiting time computation<br />

for scheduling overlapping plan executions with regard to the hardware environment.<br />

The main differences <strong>of</strong> the MFO approach to prior work are tw<strong>of</strong>old: First, the concept<br />

<strong>of</strong> horizontal message queue partitioning simplifies the naïve (time-based) approach<br />

because it can be applied also for local operators and no rewriting <strong>of</strong> queries and local<br />

post-processing <strong>of</strong> query results is required. In addition, horizontal partitioned execution<br />

leads to predictable and higher throughput. Second, the approach <strong>of</strong> computing the optimal<br />

waiting time ensures the adaptation to current workload characteristics by adjusting<br />

the waiting time according to throughput and latency time. In consequence, the MFO<br />

approach can be applied in other domains as well. For example, it might be used for (1)<br />

scan sharing, where queries are indexed according to predicate values [UGA + 09], or (2)<br />

for transient views over equivalent query predicates [ZLFL07].<br />

Beside the achievable throughput improvements, MFO has also some limitations that<br />

one should be aware <strong>of</strong>. First, while caching might lead to using outdated data, the<br />

execution <strong>of</strong> message partitions might cause us to use data that is more current than it<br />

was when the message arrived. Despite our guarantee <strong>of</strong> ensuring eventual consistency<br />

as sketched in Subsection 5.1, both caching and MFO, cannot ensure monotonic reads<br />

over multiple data objects, which might be a problem if there are hidden dependencies<br />

between data objects within the external system. Second, if the number <strong>of</strong> distinct values<br />

is too high, we will not benefit from partitioning due to the additional runtime overhead<br />

(partitioned enqueue <strong>of</strong> messages, serialization at the outbound side) and a fairly low<br />

maximum waiting time due to the worst-case latency time consideration according to the<br />

number <strong>of</strong> partitions. However, with regard to the experimental evaluation, there are three<br />

facts why we typically benefit from MFO. First, even for one-message partitions, there is<br />

only a moderate runtime overhead. Second, throughput optimization is required if and<br />

only if high message load (peaks) exists. In such cases, it is very likely that messages<br />

with equal attribute values are in the queue. Third, only a small number <strong>of</strong> messages is<br />

required within one partition to yield a significant speedup for different types <strong>of</strong> operators.<br />

Finally, we consolidate the results from the Chapters 3-5. The general cost-based optimization<br />

framework for integration flows, defined in Chapter 3, minimizes the average plan<br />

execution time by employing control-flow- and data-flow-oriented optimization techniques<br />

but it neglected the alternative optimization objective <strong>of</strong> throughput improvement. This<br />

drawback has been addressed with the integration-flow-specific optimization techniques<br />

vectorization (Chapter 4) and multi-flow optimization (Chapter 5). However, the periodical<br />

re-optimization algorithm has still several drawbacks. Most importantly, there are the<br />

problems <strong>of</strong> (1) many unnecessary re-optimization steps, where we do not find a new plan<br />

if workload characteristics have not changed, and (2) adaptation delays after a workload<br />

change, where we use a suboptimal plan until re-optimization and miss optimization opportunities.<br />

To tackle these additional problems, in the following Chapter 6, we introduce<br />

the concept <strong>of</strong> on-demand re-optimization.<br />

17 For example, correlated data inherently leads to data partitions, where each partition has specific statistical<br />

characteristics and thus a different optimal plan [Pol05, BBDW05, TDJ10]. MFO in combination<br />

with different plans for different data can address this within our cost-based optimization framework.<br />

In addition, we can apply plan simplifications (e.g., remove Switch operators).<br />

166

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!