Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />
Experimental Setting<br />
The experimental setup comprises an IBM blade with two processors (each a Dual Core<br />
AMD Opteron Processor 270 at 2 GHz) and 9 GB RAM, where we used Linux openSUSE<br />
9.1 (32 bit) as the operating system. Our WFPE (workflow process engine) realizes the<br />
extended reference system architecture (described in this chapter) including our optimization<br />
component. The WFPE is implemented using Java 1.6 as the programming language<br />
and includes approximately 36,000 lines <strong>of</strong> code. It currently includes several adapters (inbound<br />
and outbound) for the interaction with files, databases, and Web services. However,<br />
the external systems, used by our example integration flows, have been simulated with file<br />
adapters in order to minimize the influence <strong>of</strong> used systems on the measured experimental<br />
results (reproducibility). In order to use arbitrary workload scenarios with different cardinalities<br />
and selectivities (workability), we executed all experiments on synthetic XML<br />
data generated using our DIPBench toolsuite [BHLW08c].<br />
There are several parameters that influence plan execution. Essentially, we analyzed<br />
two groups <strong>of</strong> parameters. First, there is the group <strong>of</strong> optimization parameters, where we<br />
used different sliding window sizes ∆w (default: 5 min), different optimization intervals ∆t<br />
(default: 5 min), and different workload aggregation methods Agg (default: EMA). Second,<br />
there is the group <strong>of</strong> workload characteristics. Here, we used different numbers <strong>of</strong> plan<br />
instances n (executed instances), different plans with certain numbers <strong>of</strong> operators m, and<br />
different input data sizes d (default: d = 1, which stands for 100 kB input messages) and<br />
different selectivities. With regard to applied optimization techniques, we used all costbased<br />
optimization techniques, except message indexing and heterogeneous load balancing<br />
(both not discussed in this thesis) as well as vectorization (see Chapter 4) and multiflow<br />
optimization (see Chapter 5). Furthermore, we disabled all rule-based optimization<br />
techniques in order to focus on the benefit achieved by cost-based optimization because<br />
these techniques either did not apply (e.g., algebraic simplifications) for the used plans or<br />
they achieved a constant absolute improvement (e.g., static node compilation) for both<br />
the unoptimized and the cost-based optimized execution.<br />
End-to-End Comparison and <strong>Optimization</strong> Benefits<br />
In a first series <strong>of</strong> experiments, we compared the end-to-end performance <strong>of</strong> no-optimization<br />
versus the periodical re-optimization. These experiments already include all optimization<br />
overheads (such as statistics maintenance and periodical re-optimization). As a<br />
result, these experiments show the overall benefit achieved by periodical re-optimization.<br />
First, we compared the periodical re-optimization with no-optimization. The periodical<br />
re-optimization was realized as asynchronous inter-instance optimization approach. We<br />
executed 100,000 instances <strong>of</strong> our example plan P 5 for the non-optimized plan as well as<br />
for the optimized plan and measured re-optimization and plan execution time. For periodical<br />
re-optimization, the plan execution time already includes the synchronous statistic<br />
maintenance. During execution, we varied the input cardinality (see Figure 3.20(b))<br />
and selectivities <strong>of</strong> the three selection operators (see Figure 3.20(a)). Here, the input<br />
data was generated without correlations between different attributes. With regard to reoptimization,<br />
there are four points (∗1, ∗2, ∗3, and ∗4) where a workload change (shown<br />
as intersection points between selectivities) reasons the change <strong>of</strong> the optimal plan.<br />
For periodical re-optimization, we used an optimization interval <strong>of</strong> ∆t = 5 min, a sliding<br />
window size <strong>of</strong> ∆w = 5 min and EMA as the workload aggregation method. Figure 3.20(c)<br />
74