Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />
number <strong>of</strong> parallel subflows. The rewriting <strong>of</strong> iterations to parallel flows is beneficial if<br />
⎛<br />
⎞<br />
|r| ∑m i<br />
∑m i<br />
max ⎝ Ŵ (o i,j ) + i · W (Start Thread) ⎠ < r · W (o i,j )<br />
i=1<br />
j=1<br />
j=1<br />
(3.18)<br />
with Ŵ (o |r|<br />
i,j) =<br />
min(|r|, k) · (Ŵ (o i,j) − wait(o i,j )) + wait(o i,j ).<br />
Due to foreach semantics, r must be estimated in terms <strong>of</strong> the frequency <strong>of</strong> iterations.<br />
Rewriting is realized by the following algorithm. We check if there are no dependencies<br />
between iterations and if we are allowed to change the temporal order. In case <strong>of</strong> independence,<br />
we compile all operators <strong>of</strong> the iteration body to r subflows plus one additional<br />
subflow. Each subflow references a specific data partition <strong>of</strong> the inbound data set, while<br />
the last subflow is used for all partitions that exceed the estimated number <strong>of</strong> iterations.<br />
In conclusion, this optimization technique <strong>of</strong>fers—similar to the rewriting <strong>of</strong> sequences—<br />
high optimization opportunities. In contrast to sequences, the rewriting relies heavily on<br />
the estimated number <strong>of</strong> iterations. The possibility <strong>of</strong> arbitrary input data sets might<br />
result in unused subflows (overestimation) or in executing multiple partitions with the<br />
last subflow (underestimation). One can further enhance this by using dynamic runtime<br />
scheduling <strong>of</strong> parallel subflows (e.g., guided self-scheduling [PK87], or factoring [HSF91]).<br />
In addition, this technique can be combined with rewriting sequences (the operators <strong>of</strong><br />
one iteration) to parallel flows and this technique should be applied before the techniques<br />
WC1 (rescheduling parallel flows) and WC4 (merging parallel flows).<br />
Merging Parallel <strong>Flows</strong><br />
Recall the costs <strong>of</strong> a Fork operator that are determined by the most time-consuming<br />
subflow. The idea <strong>of</strong> the technique WC4: Merging Parallel <strong>Flows</strong> is that if the costs<br />
<strong>of</strong> the subflow with maximum costs subsume the costs <strong>of</strong> two or more other subflows,<br />
the subsumed subflows can be rewritten to one subflow in order to reduce the costs by<br />
W (T hread) that denotes the costs for thread creation, starting, and monitoring (in contrast<br />
to the previously used W (Start T hread) that denotes the time required for thread<br />
creation only). Therefore, all Fork operators with more than two subflows have to be<br />
considered. We achieve an execution time reduction because less threads are required,<br />
while the most time-consuming subflow is unchanged.<br />
In general, this problem <strong>of</strong> a given maximum constraint per partition and the optimization<br />
objective <strong>of</strong> minimizing the number <strong>of</strong> partitions, is reducible to the <strong>of</strong>fline bin<br />
packing problem that is known to be an NP-hard problem. Therefore, we use an extension<br />
<strong>of</strong> the heuristic first fit algorithm [Joh74] that works as follows. First, we determine the<br />
subflow with maximum costs. Second, for each old subflow, we check if it can be merged<br />
with an existing new subflow. If this is possible, we merge the subflow with the first fit<br />
subflow by temporally concatenating the sequences <strong>of</strong> operators; otherwise, we create a<br />
new subflow. In the worst case, for each old subflow, we check each new subflow. Thus,<br />
the time complexity is given by O(m 2 ). In the following, we use an example to illustrate<br />
this heuristic algorithm.<br />
Example 3.11 (Merging Parallel <strong>Flows</strong>). Recall our example plan P 7 from Example 3.9.<br />
Further, assume changed execution times as shown in Figure 3.14(a). First, we determine<br />
the fourth subflow (349 ms) as upper bound for our extended first fit algorithm. Second,<br />
64