25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.2 Plan Vectorization<br />

thread monitoring and synchronization efforts as well as increased cache displacement.<br />

This problem is strengthened in the presence <strong>of</strong> multiple deployed plans because there, the<br />

total number <strong>of</strong> operators and thus, the number <strong>of</strong> threads is even higher. In addition, the<br />

higher the number <strong>of</strong> threads, the higher the latency time <strong>of</strong> single messages.<br />

To tackle this problem <strong>of</strong> a possibly high number <strong>of</strong> required threads, in Section 4.3, we<br />

introduce the cost-based vectorization <strong>of</strong> integration flows that assigns groups <strong>of</strong> operators<br />

to execution buckets and thus, to threads. This reduces the number <strong>of</strong> required threads<br />

and achieves higher throughput. In addition, Section 4.4 discusses how to compute the<br />

optimal assignment <strong>of</strong> threads to multiple deployed plans. Finally, Section 4.5 illustrates<br />

how this cost-based vectorization <strong>of</strong> integration flows, as an optimization technique for<br />

throughput maximization, is embedded into our overall optimization framework that was<br />

described in Chapter 3.<br />

In contrast to related work on throughput optimization by parallelization <strong>of</strong> tasks<br />

in DBMS [HA03, HSA05, GHP + 06], DSMS [SBL04, BMK08, AAB + 05] or ETL tools<br />

[BABO + 09, SWCD09], our approach allows the rewriting <strong>of</strong> procedural integration flows to<br />

pipelined (vectorized) execution plans. Further, existing approaches [CHK + 07, CcR + 03,<br />

BBDM03, JC04] that also distribute operators across a number <strong>of</strong> threads or server nodes,<br />

compute this distribution in a static manner during query deploy time. In contrast, we<br />

compute the cost-optimal distribution during periodical re-optimization in order to achieve<br />

the highest throughput and to allow the adaptation to changing workload characteristics.<br />

4.2 Plan Vectorization<br />

As a prerequisite, we give an overview <strong>of</strong> the core concepts <strong>of</strong> our plan vectorization<br />

approach [BHP + 09b]. We define the vectorization problem, explain the required modifications<br />

<strong>of</strong> the integration flow meta model, sketch the basic rewriting algorithm, describe<br />

context-specific rewriting rules and analyze the costs <strong>of</strong> vectorized plans.<br />

4.2.1 Overview and Meta Model Extension<br />

The general idea <strong>of</strong> plan vectorization is to transparently rewrite the instance-based plan—<br />

where each instance is executed as a thread—into a vectorized plan, where each operator<br />

is executed as a single execution bucket and hence, as a single thread. Thus, we model a<br />

standing plan <strong>of</strong> an integration flow. Due to different execution times <strong>of</strong> the single operators,<br />

transient inter-bucket message queues (with constraints 8 on the maximum number<br />

<strong>of</strong> messages) are required for each data flow edge. With regard to our classification <strong>of</strong><br />

execution approaches, we change the execution model from control-flow semantics with<br />

instance-local, materialized intermediates to data-flow semantics with a hybrid instanceglobal<br />

data granularity (pipelining <strong>of</strong> materialized intermediates from multiple instances),<br />

while still enabling complex procedural modeling. We illustrate this idea <strong>of</strong> plan vectorization<br />

with an example.<br />

Example 4.1 (Full Plan Vectorization). Assume the instance-based example plan P 2 as<br />

shown in Figure 4.1(a). Further, Figure 4.1(b) illustrates the typical instance-based plan<br />

8 Queues in front <strong>of</strong> cost-intensive operators include larger numbers <strong>of</strong> messages. In order to overcome the<br />

high memory requirements, typically, the (1) maximal number <strong>of</strong> messages or (2) the maximal total<br />

size <strong>of</strong> messages per queue is constrained. It is important to note that finally, this constraint leads to<br />

a work-cycle <strong>of</strong> the whole pipeline that is dominated by the most time-consuming operator.<br />

89

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!