Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
4 Vectorizing <strong>Integration</strong> <strong>Flows</strong><br />
We ran our experiments using the same platform as described in Section 3.6. Further,<br />
we executed all experiments on synthetically generated XML data (using our DIPBench<br />
toolsuite [BHLW08c]) due to only minor influence <strong>of</strong> the data distribution <strong>of</strong> real data sets<br />
on the benefit achieved by vectorization because it is a control-flow-oriented optimization<br />
technique. However, there are several aspects with influences on vectorization. In general,<br />
we used five scale factors for all three execution approaches: the data size d <strong>of</strong> input<br />
messages, the number <strong>of</strong> operators m, the time interval t between two arriving messages,<br />
the number <strong>of</strong> plan instances n, and the maximum constraint <strong>of</strong> messages in a queue q.<br />
End-to-End Comparison and Scalability<br />
Similar to the general comparison experiment <strong>of</strong> optimized and unoptimized plan execution,<br />
which results are shown in Figure 3.22, we first evaluated the impact <strong>of</strong> vectorization<br />
and cost-based vectorization compared to the unoptimized execution for our example use<br />
case plans. In detail, we executed 20,000 plan instances for all asynchronous, data-driven<br />
example plans (P 1 , P 2 , P 5 , and P 7 ) and for each execution model. We fixed the cardinality<br />
<strong>of</strong> input data sets to d = 1 (100 kB messages) and used the same workload configuration<br />
(without workload changes and without correlations) as in the mentioned experiment <strong>of</strong><br />
Chapter 3. Note that the normal cost-based plan rewriting is orthogonal to vectorization,<br />
where vectorization achieves additional improvements except for the effects <strong>of</strong> rewriting<br />
patterns to parallel flows. In order to be focused on vectorization, we disable all other<br />
optimization techniques. Furthermore, we fixed an optimization interval <strong>of</strong> ∆t = 5 min,<br />
a sliding window size <strong>of</strong> ∆w = 5 min and EMA as the workload aggregation method. To<br />
summarize, we consistently observe significant total execution time reductions (see Figure<br />
4.19(a)) <strong>of</strong> 71% (P 1 ), 72% (P 2 ), 69% (P 5 ), and 55% (P 7 ). In contrast to Chapter 3, we<br />
measured the scenario elapsed time (the latency time <strong>of</strong> the message sequence) because<br />
for vectorized execution, the execution times <strong>of</strong> single plan instances cannot be aggregated<br />
due to overlapping message execution (pipeline semantics).<br />
(a) Scenario Elapsed Time<br />
(b) First <strong>Optimization</strong> Time<br />
Figure 4.19: Use Case Comparison <strong>of</strong> Vectorization<br />
First, the full vectorization approach leads to a significant reduction <strong>of</strong> the total elapsed<br />
time for execution <strong>of</strong> the sequence <strong>of</strong> 20,000 plan instances. We achieved a speedup <strong>of</strong> factor<br />
three for the plans P 1 , P 2 , and P 5 , while for the plan P 7 we achieved a speedup <strong>of</strong> factor<br />
two. Furthermore, the cost-based vectorization further improved the full vectorization by<br />
about 10%. However, there are cases, where the cost-based vectorization caused only a<br />
minor improvement because plans such as P 1 are too restrictive with regard to merging<br />
execution buckets (e.g., the combination <strong>of</strong> a Switch operator with specific paths is not<br />
120