25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.3 <strong>Cost</strong>-<strong>Based</strong> Vectorization<br />

ALL). In fact, in the instance-based case, the costs <strong>of</strong> processing n messages are determined<br />

by W (P ) = n · r · m, where r denotes the number <strong>of</strong> iteration loops for each message and<br />

m denotes the number <strong>of</strong> operators in the iteration body. When rewriting the Iteration<br />

operator to parallel flows, in the best case, the costs are reduced to W (P ) = n · m because<br />

all iteration loops are executed in parallel. In contrast, due to the sub pipelining <strong>of</strong> a<br />

vectorized plan, we can reduce the costs to W (P ′ ) = n · r + m − 1 + 2, where r denotes<br />

the number <strong>of</strong> sub-messages. We see that costs <strong>of</strong> 2 must be added to represent the costs<br />

for message splitting (Split) and merging (Setoperation). Furthermore, the optimality<br />

<strong>of</strong> vectorized execution is given if r ≤ m.<br />

Finally, note that this is a best-case consideration using an idealized static cost model<br />

supposed for illustration purposes. This does not take into account a changed number <strong>of</strong><br />

operators during vectorization or additional costs for synchronization. However, in the<br />

following sections, we will revisit these issues.<br />

In conclusion, plan vectorization strongly increases the degree <strong>of</strong> parallelism and thus,<br />

may lead to a higher CPU utilization. In this section, we introduced the basic vectorization<br />

approach and the required meta model extensions. In addition, we described the core<br />

rewriting algorithm as well as the specific rewriting rules that are necessary in order to<br />

guarantee semantic correctness.<br />

4.3 <strong>Cost</strong>-<strong>Based</strong> Vectorization<br />

Plan vectorization rewrites an instance-based plan (one execution bucket per plan) into a<br />

fully vectorized plan (one execution bucket per operator), which solves the P-PV. However,<br />

the approach <strong>of</strong> full vectorization has two major drawbacks. First, the theoretical<br />

performance and latency <strong>of</strong> a vectorized plan mainly depends on the performance <strong>of</strong> the<br />

most time-consuming operator. The reason is that the work cycle <strong>of</strong> a whole data-flow<br />

graph is given by the longest running operator because all queues after this operator are<br />

empty, while queues in front <strong>of</strong> it reach their maximum constraint. Similar theoretical<br />

observations have also been made for task scheduling in parallel computing environments<br />

[Gra69], where Graham described bounds on the overall time influence <strong>of</strong> task timing<br />

anomalies, which quantify the optimization potential vectorized plans still exhibit. Second,<br />

the practical performance also strongly depends on the number <strong>of</strong> operators because<br />

each operator requires a single thread. Depending on the concrete workload (runtime <strong>of</strong><br />

operators), a number <strong>of</strong> threads that is too high can also hurt performance due to (1) additional<br />

thread monitoring and synchronization efforts as well as (2) cache displacement<br />

because the different threads work on different intermediate result messages <strong>of</strong> a plan.<br />

Figure 4.8 shows the results <strong>of</strong> a speedup experiment from Chapter 3, which was reexecuted<br />

for two plans with m = 100 and m = 200 Delay operators, respectively. Then,<br />

we varied the number <strong>of</strong> threads (k ∈ [1, m]) as well as the delay time in order to simulate<br />

the waiting time <strong>of</strong> Invoke operators for external systems. Furthermore, we computed<br />

the speedup by S p = W (P, 1)/W (P ′ , k). The theoretical maximum speedup is m/ ⌈m/k⌉.<br />

As a result, we see that the empirical speedup increases, but decreases after a certain<br />

maximum. There, the maximum speedup and the number <strong>of</strong> threads, where this maximum<br />

speedup occurs depends on the waiting time, i.e., the higher the waiting time, the higher<br />

the reachable speedup and the higher the number <strong>of</strong> threads, where this maximum occurs.<br />

In conclusion, an enhanced vectorization approach is required that takes into account<br />

the execution statistics <strong>of</strong> single operators. In this section, we introduce a generalization<br />

99

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!