25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4 Vectorizing <strong>Integration</strong> <strong>Flows</strong><br />

4.7 Summary and Discussion<br />

In this chapter, we introduced the control-flow-oriented optimization technique <strong>of</strong> plan<br />

vectorization with the aim <strong>of</strong> throughput optimization for integration flows. We use the<br />

term vectorization as an analogy to executing a vector <strong>of</strong> messages at a time by a standing<br />

plan. Due to the dependency on the dynamic workload characteristics, we introduced the<br />

cost-based plan vectorization as a generalization, where the costs <strong>of</strong> single operators are<br />

taken into account and operators are merged to execution buckets. In detail, we presented<br />

exhaustive and heuristic algorithms for computing the cost-optimal plan. Furthermore,<br />

we showed how to use those algorithms in the presence <strong>of</strong> multiple deployed plans and<br />

how this concept is embedded into our general cost-based optimization framework.<br />

<strong>Based</strong> on our evaluation, we can state that significant throughput improvements are<br />

possible. In comparison to full vectorization, the cost-based vectorization achieves even<br />

better performance, reduces the latency <strong>of</strong> single messages, and ensures robustness in the<br />

sense <strong>of</strong> minimizing the number <strong>of</strong> required threads. In conclusion, the concept <strong>of</strong> plan<br />

vectorization is applicable in many different application areas. It is important to note<br />

that the benefit <strong>of</strong> vectorization and hence, also cost-based vectorization, will increase<br />

with the ongoing development <strong>of</strong> modern many-core processors because the gap between<br />

CPU performance and main memory, IO, and network speed is increasing (Problem 4.1).<br />

The main differences <strong>of</strong> our approach to prior work are (1) that we vectorize procedural<br />

integration flows (imperative flow specifications) and that we (2) dynamically compute<br />

the cost-optimal vectorized plan within our periodical re-optimization framework. This<br />

enables the dynamic adaptation to changing workload characteristics in terms <strong>of</strong> the operator<br />

execution times. Despite the focus on procedural plans, the cost-based vectorization<br />

approach, in general, can also be applied in the context <strong>of</strong> DSMS and ETL tools.<br />

However, the vectorization approach has also some limitations that must be taken into<br />

account when applying this optimization technique. First, vectorization is a trade-<strong>of</strong>f<br />

between throughput improvement and additional latency time. Thus, it should only be<br />

applied if the optimization objective is throughput improvement or minimizing the latency<br />

in the presence <strong>of</strong> high message rates rather than minimizing the latency time <strong>of</strong> single<br />

messages. Second, for plans with complex procedural aspects, vectorization requires additional<br />

operators for synchronization and handling <strong>of</strong> the explicit data flow. This aspect is<br />

explicitly taken into account by cost-based vectorization but might reduce the achievable<br />

throughput improvement. Third, low cost plans with many less time-consuming operators<br />

might also not benefit from vectorization due to the higher relative overhead <strong>of</strong> queue<br />

management as well as thread synchronization and monitoring. Despite these general<br />

limitations <strong>of</strong> vectorization, the cost-based vectorization can be applied by default due<br />

to its hybrid model characteristics (full spectrum between instance-based and vectorized<br />

execution) that takes the execution statistics into account. As already mentioned, the<br />

concept <strong>of</strong> cost-based vectorization can also be extended to a distributed setting, where<br />

operators are executed by different server nodes rather than only by different threads.<br />

While the vectorization <strong>of</strong> integration flows is a control-flow-oriented optimization technique,<br />

the next chapter will address a data-flow-oriented optimization technique. However,<br />

both techniques reduce the execution time <strong>of</strong> message sequences and thus, increase the<br />

message throughput, where the benefit <strong>of</strong> both techniques can be combined.<br />

128

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!