Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
4.2 Plan Vectorization<br />
thread monitoring and synchronization efforts as well as increased cache displacement.<br />
This problem is strengthened in the presence <strong>of</strong> multiple deployed plans because there, the<br />
total number <strong>of</strong> operators and thus, the number <strong>of</strong> threads is even higher. In addition, the<br />
higher the number <strong>of</strong> threads, the higher the latency time <strong>of</strong> single messages.<br />
To tackle this problem <strong>of</strong> a possibly high number <strong>of</strong> required threads, in Section 4.3, we<br />
introduce the cost-based vectorization <strong>of</strong> integration flows that assigns groups <strong>of</strong> operators<br />
to execution buckets and thus, to threads. This reduces the number <strong>of</strong> required threads<br />
and achieves higher throughput. In addition, Section 4.4 discusses how to compute the<br />
optimal assignment <strong>of</strong> threads to multiple deployed plans. Finally, Section 4.5 illustrates<br />
how this cost-based vectorization <strong>of</strong> integration flows, as an optimization technique for<br />
throughput maximization, is embedded into our overall optimization framework that was<br />
described in Chapter 3.<br />
In contrast to related work on throughput optimization by parallelization <strong>of</strong> tasks<br />
in DBMS [HA03, HSA05, GHP + 06], DSMS [SBL04, BMK08, AAB + 05] or ETL tools<br />
[BABO + 09, SWCD09], our approach allows the rewriting <strong>of</strong> procedural integration flows to<br />
pipelined (vectorized) execution plans. Further, existing approaches [CHK + 07, CcR + 03,<br />
BBDM03, JC04] that also distribute operators across a number <strong>of</strong> threads or server nodes,<br />
compute this distribution in a static manner during query deploy time. In contrast, we<br />
compute the cost-optimal distribution during periodical re-optimization in order to achieve<br />
the highest throughput and to allow the adaptation to changing workload characteristics.<br />
4.2 Plan Vectorization<br />
As a prerequisite, we give an overview <strong>of</strong> the core concepts <strong>of</strong> our plan vectorization<br />
approach [BHP + 09b]. We define the vectorization problem, explain the required modifications<br />
<strong>of</strong> the integration flow meta model, sketch the basic rewriting algorithm, describe<br />
context-specific rewriting rules and analyze the costs <strong>of</strong> vectorized plans.<br />
4.2.1 Overview and Meta Model Extension<br />
The general idea <strong>of</strong> plan vectorization is to transparently rewrite the instance-based plan—<br />
where each instance is executed as a thread—into a vectorized plan, where each operator<br />
is executed as a single execution bucket and hence, as a single thread. Thus, we model a<br />
standing plan <strong>of</strong> an integration flow. Due to different execution times <strong>of</strong> the single operators,<br />
transient inter-bucket message queues (with constraints 8 on the maximum number<br />
<strong>of</strong> messages) are required for each data flow edge. With regard to our classification <strong>of</strong><br />
execution approaches, we change the execution model from control-flow semantics with<br />
instance-local, materialized intermediates to data-flow semantics with a hybrid instanceglobal<br />
data granularity (pipelining <strong>of</strong> materialized intermediates from multiple instances),<br />
while still enabling complex procedural modeling. We illustrate this idea <strong>of</strong> plan vectorization<br />
with an example.<br />
Example 4.1 (Full Plan Vectorization). Assume the instance-based example plan P 2 as<br />
shown in Figure 4.1(a). Further, Figure 4.1(b) illustrates the typical instance-based plan<br />
8 Queues in front <strong>of</strong> cost-intensive operators include larger numbers <strong>of</strong> messages. In order to overcome the<br />
high memory requirements, typically, the (1) maximal number <strong>of</strong> messages or (2) the maximal total<br />
size <strong>of</strong> messages per queue is constrained. It is important to note that finally, this constraint leads to<br />
a work-cycle <strong>of</strong> the whole pipeline that is dominated by the most time-consuming operator.<br />
89