Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
4 Vectorizing <strong>Integration</strong> <strong>Flows</strong><br />
<strong>Based</strong> on the general cost-based optimization framework, in this chapter, we present<br />
the vectorization <strong>of</strong> integration flows [BHP + 09a, BHP + 09b, BHP + 11] as a control-floworiented<br />
optimization technique that is tailor-made for integration flows. This technique<br />
tackles the problem <strong>of</strong> low CPU utilization imposed by the instance-based plan execution<br />
<strong>of</strong> integration flows. The core idea is to transparently rewrite instance-based plans into<br />
vectorized plans with pipelined execution characteristics in order to exploit pipeline parallelism<br />
over multiple plan instances. Thus, this concept increases the message throughput,<br />
while it still ensures the required transactional properties. We call this concept vectorization<br />
because a vector <strong>of</strong> messages is processed at-a-time.<br />
In order to enable vectorization, we first describe necessary flow meta model extensions<br />
as well as the rule-based plan vectorization that ensures semantic correctness; i.e., the<br />
rewriting algorithm preserves the serialized external behavior. Furthermore, we present<br />
the cost-based vectorization that computes the optimal grouping <strong>of</strong> operators to multithreaded<br />
execution buckets in order to achieve the optimal degree <strong>of</strong> pipeline parallelism<br />
and hence, maximize message throughput. We present exhaustive, heuristic, and constrained<br />
computation approaches. In addition, we also discuss the cost-based vectorization<br />
for multiple deployed plans and we sketch how this rather complex optimization technique<br />
is embedded within our periodical re-optimization framework. Finally, the experimental<br />
evaluation shows that significant throughput improvements are achieved by vectorization,<br />
with a moderate increase <strong>of</strong> latency time for individual messages. The cost-based vectorization<br />
further increases this improvement and ensures robustness <strong>of</strong> vectorization.<br />
4.1 Motivation and Problem Description<br />
In scenarios with high load <strong>of</strong> plan instances, the major optimization objective is <strong>of</strong>ten<br />
throughput maximization, where moderate latency times are acceptable [UGA + 09]. Unfortunately,<br />
despite the optimization techniques on parallelizing subflows, instance-based<br />
plans <strong>of</strong> integration flows, typically, do not achieve a high CPU utilization.<br />
Problem 4.1 (Low CPU Utilization). The low CPU utilization is mainly caused by (1)<br />
significant waiting times for external systems (for example, the plan instance is blocked,<br />
while executing external queries), (2) the trend towards multi- and many-core architectures,<br />
which stands in contrast to the single-threaded execution <strong>of</strong> instance-based integration<br />
flows, and (3) the IO bottleneck due to the need for message persistence to enable<br />
recoverability <strong>of</strong> plan instances.<br />
In conclusion <strong>of</strong> Problem 4.1 in combination with the existence <strong>of</strong> many independent plan<br />
instances, there are optimization opportunities with regard to the message throughput,<br />
which we could exploit by increasing the degree <strong>of</strong> parallelism. Essentially, we could<br />
leverage four different types <strong>of</strong> parallelism to overcome that problem, where we additionally<br />
use the classification [Gra90] <strong>of</strong> horizontal (parallel processing <strong>of</strong> data partitions) and<br />
vertical parallelism (pipelining):<br />
87