25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />

number <strong>of</strong> parallel subflows. The rewriting <strong>of</strong> iterations to parallel flows is beneficial if<br />

⎛<br />

⎞<br />

|r| ∑m i<br />

∑m i<br />

max ⎝ Ŵ (o i,j ) + i · W (Start Thread) ⎠ < r · W (o i,j )<br />

i=1<br />

j=1<br />

j=1<br />

(3.18)<br />

with Ŵ (o |r|<br />

i,j) =<br />

min(|r|, k) · (Ŵ (o i,j) − wait(o i,j )) + wait(o i,j ).<br />

Due to foreach semantics, r must be estimated in terms <strong>of</strong> the frequency <strong>of</strong> iterations.<br />

Rewriting is realized by the following algorithm. We check if there are no dependencies<br />

between iterations and if we are allowed to change the temporal order. In case <strong>of</strong> independence,<br />

we compile all operators <strong>of</strong> the iteration body to r subflows plus one additional<br />

subflow. Each subflow references a specific data partition <strong>of</strong> the inbound data set, while<br />

the last subflow is used for all partitions that exceed the estimated number <strong>of</strong> iterations.<br />

In conclusion, this optimization technique <strong>of</strong>fers—similar to the rewriting <strong>of</strong> sequences—<br />

high optimization opportunities. In contrast to sequences, the rewriting relies heavily on<br />

the estimated number <strong>of</strong> iterations. The possibility <strong>of</strong> arbitrary input data sets might<br />

result in unused subflows (overestimation) or in executing multiple partitions with the<br />

last subflow (underestimation). One can further enhance this by using dynamic runtime<br />

scheduling <strong>of</strong> parallel subflows (e.g., guided self-scheduling [PK87], or factoring [HSF91]).<br />

In addition, this technique can be combined with rewriting sequences (the operators <strong>of</strong><br />

one iteration) to parallel flows and this technique should be applied before the techniques<br />

WC1 (rescheduling parallel flows) and WC4 (merging parallel flows).<br />

Merging Parallel <strong>Flows</strong><br />

Recall the costs <strong>of</strong> a Fork operator that are determined by the most time-consuming<br />

subflow. The idea <strong>of</strong> the technique WC4: Merging Parallel <strong>Flows</strong> is that if the costs<br />

<strong>of</strong> the subflow with maximum costs subsume the costs <strong>of</strong> two or more other subflows,<br />

the subsumed subflows can be rewritten to one subflow in order to reduce the costs by<br />

W (T hread) that denotes the costs for thread creation, starting, and monitoring (in contrast<br />

to the previously used W (Start T hread) that denotes the time required for thread<br />

creation only). Therefore, all Fork operators with more than two subflows have to be<br />

considered. We achieve an execution time reduction because less threads are required,<br />

while the most time-consuming subflow is unchanged.<br />

In general, this problem <strong>of</strong> a given maximum constraint per partition and the optimization<br />

objective <strong>of</strong> minimizing the number <strong>of</strong> partitions, is reducible to the <strong>of</strong>fline bin<br />

packing problem that is known to be an NP-hard problem. Therefore, we use an extension<br />

<strong>of</strong> the heuristic first fit algorithm [Joh74] that works as follows. First, we determine the<br />

subflow with maximum costs. Second, for each old subflow, we check if it can be merged<br />

with an existing new subflow. If this is possible, we merge the subflow with the first fit<br />

subflow by temporally concatenating the sequences <strong>of</strong> operators; otherwise, we create a<br />

new subflow. In the worst case, for each old subflow, we check each new subflow. Thus,<br />

the time complexity is given by O(m 2 ). In the following, we use an example to illustrate<br />

this heuristic algorithm.<br />

Example 3.11 (Merging Parallel <strong>Flows</strong>). Recall our example plan P 7 from Example 3.9.<br />

Further, assume changed execution times as shown in Figure 3.14(a). First, we determine<br />

the fourth subflow (349 ms) as upper bound for our extended first fit algorithm. Second,<br />

64

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!