Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
5 Multi-Flow <strong>Optimization</strong><br />
(2) plan partitioning in the sense <strong>of</strong> compiling different plans for different partitioning attribute<br />
values 17 , and (3) MFO for multiple plans by an extended waiting time computation<br />
for scheduling overlapping plan executions with regard to the hardware environment.<br />
The main differences <strong>of</strong> the MFO approach to prior work are tw<strong>of</strong>old: First, the concept<br />
<strong>of</strong> horizontal message queue partitioning simplifies the naïve (time-based) approach<br />
because it can be applied also for local operators and no rewriting <strong>of</strong> queries and local<br />
post-processing <strong>of</strong> query results is required. In addition, horizontal partitioned execution<br />
leads to predictable and higher throughput. Second, the approach <strong>of</strong> computing the optimal<br />
waiting time ensures the adaptation to current workload characteristics by adjusting<br />
the waiting time according to throughput and latency time. In consequence, the MFO<br />
approach can be applied in other domains as well. For example, it might be used for (1)<br />
scan sharing, where queries are indexed according to predicate values [UGA + 09], or (2)<br />
for transient views over equivalent query predicates [ZLFL07].<br />
Beside the achievable throughput improvements, MFO has also some limitations that<br />
one should be aware <strong>of</strong>. First, while caching might lead to using outdated data, the<br />
execution <strong>of</strong> message partitions might cause us to use data that is more current than it<br />
was when the message arrived. Despite our guarantee <strong>of</strong> ensuring eventual consistency<br />
as sketched in Subsection 5.1, both caching and MFO, cannot ensure monotonic reads<br />
over multiple data objects, which might be a problem if there are hidden dependencies<br />
between data objects within the external system. Second, if the number <strong>of</strong> distinct values<br />
is too high, we will not benefit from partitioning due to the additional runtime overhead<br />
(partitioned enqueue <strong>of</strong> messages, serialization at the outbound side) and a fairly low<br />
maximum waiting time due to the worst-case latency time consideration according to the<br />
number <strong>of</strong> partitions. However, with regard to the experimental evaluation, there are three<br />
facts why we typically benefit from MFO. First, even for one-message partitions, there is<br />
only a moderate runtime overhead. Second, throughput optimization is required if and<br />
only if high message load (peaks) exists. In such cases, it is very likely that messages<br />
with equal attribute values are in the queue. Third, only a small number <strong>of</strong> messages is<br />
required within one partition to yield a significant speedup for different types <strong>of</strong> operators.<br />
Finally, we consolidate the results from the Chapters 3-5. The general cost-based optimization<br />
framework for integration flows, defined in Chapter 3, minimizes the average plan<br />
execution time by employing control-flow- and data-flow-oriented optimization techniques<br />
but it neglected the alternative optimization objective <strong>of</strong> throughput improvement. This<br />
drawback has been addressed with the integration-flow-specific optimization techniques<br />
vectorization (Chapter 4) and multi-flow optimization (Chapter 5). However, the periodical<br />
re-optimization algorithm has still several drawbacks. Most importantly, there are the<br />
problems <strong>of</strong> (1) many unnecessary re-optimization steps, where we do not find a new plan<br />
if workload characteristics have not changed, and (2) adaptation delays after a workload<br />
change, where we use a suboptimal plan until re-optimization and miss optimization opportunities.<br />
To tackle these additional problems, in the following Chapter 6, we introduce<br />
the concept <strong>of</strong> on-demand re-optimization.<br />
17 For example, correlated data inherently leads to data partitions, where each partition has specific statistical<br />
characteristics and thus a different optimal plan [Pol05, BBDW05, TDJ10]. MFO in combination<br />
with different plans for different data can address this within our cost-based optimization framework.<br />
In addition, we can apply plan simplifications (e.g., remove Switch operators).<br />
166