25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4 Vectorizing <strong>Integration</strong> <strong>Flows</strong><br />

<strong>Based</strong> on the general cost-based optimization framework, in this chapter, we present<br />

the vectorization <strong>of</strong> integration flows [BHP + 09a, BHP + 09b, BHP + 11] as a control-floworiented<br />

optimization technique that is tailor-made for integration flows. This technique<br />

tackles the problem <strong>of</strong> low CPU utilization imposed by the instance-based plan execution<br />

<strong>of</strong> integration flows. The core idea is to transparently rewrite instance-based plans into<br />

vectorized plans with pipelined execution characteristics in order to exploit pipeline parallelism<br />

over multiple plan instances. Thus, this concept increases the message throughput,<br />

while it still ensures the required transactional properties. We call this concept vectorization<br />

because a vector <strong>of</strong> messages is processed at-a-time.<br />

In order to enable vectorization, we first describe necessary flow meta model extensions<br />

as well as the rule-based plan vectorization that ensures semantic correctness; i.e., the<br />

rewriting algorithm preserves the serialized external behavior. Furthermore, we present<br />

the cost-based vectorization that computes the optimal grouping <strong>of</strong> operators to multithreaded<br />

execution buckets in order to achieve the optimal degree <strong>of</strong> pipeline parallelism<br />

and hence, maximize message throughput. We present exhaustive, heuristic, and constrained<br />

computation approaches. In addition, we also discuss the cost-based vectorization<br />

for multiple deployed plans and we sketch how this rather complex optimization technique<br />

is embedded within our periodical re-optimization framework. Finally, the experimental<br />

evaluation shows that significant throughput improvements are achieved by vectorization,<br />

with a moderate increase <strong>of</strong> latency time for individual messages. The cost-based vectorization<br />

further increases this improvement and ensures robustness <strong>of</strong> vectorization.<br />

4.1 Motivation and Problem Description<br />

In scenarios with high load <strong>of</strong> plan instances, the major optimization objective is <strong>of</strong>ten<br />

throughput maximization, where moderate latency times are acceptable [UGA + 09]. Unfortunately,<br />

despite the optimization techniques on parallelizing subflows, instance-based<br />

plans <strong>of</strong> integration flows, typically, do not achieve a high CPU utilization.<br />

Problem 4.1 (Low CPU Utilization). The low CPU utilization is mainly caused by (1)<br />

significant waiting times for external systems (for example, the plan instance is blocked,<br />

while executing external queries), (2) the trend towards multi- and many-core architectures,<br />

which stands in contrast to the single-threaded execution <strong>of</strong> instance-based integration<br />

flows, and (3) the IO bottleneck due to the need for message persistence to enable<br />

recoverability <strong>of</strong> plan instances.<br />

In conclusion <strong>of</strong> Problem 4.1 in combination with the existence <strong>of</strong> many independent plan<br />

instances, there are optimization opportunities with regard to the message throughput,<br />

which we could exploit by increasing the degree <strong>of</strong> parallelism. Essentially, we could<br />

leverage four different types <strong>of</strong> parallelism to overcome that problem, where we additionally<br />

use the classification [Gra90] <strong>of</strong> horizontal (parallel processing <strong>of</strong> data partitions) and<br />

vertical parallelism (pipelining):<br />

87

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!