25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3 Fundamentals <strong>of</strong> Optimizing <strong>Integration</strong> <strong>Flows</strong><br />

Experimental Setting<br />

The experimental setup comprises an IBM blade with two processors (each a Dual Core<br />

AMD Opteron Processor 270 at 2 GHz) and 9 GB RAM, where we used Linux openSUSE<br />

9.1 (32 bit) as the operating system. Our WFPE (workflow process engine) realizes the<br />

extended reference system architecture (described in this chapter) including our optimization<br />

component. The WFPE is implemented using Java 1.6 as the programming language<br />

and includes approximately 36,000 lines <strong>of</strong> code. It currently includes several adapters (inbound<br />

and outbound) for the interaction with files, databases, and Web services. However,<br />

the external systems, used by our example integration flows, have been simulated with file<br />

adapters in order to minimize the influence <strong>of</strong> used systems on the measured experimental<br />

results (reproducibility). In order to use arbitrary workload scenarios with different cardinalities<br />

and selectivities (workability), we executed all experiments on synthetic XML<br />

data generated using our DIPBench toolsuite [BHLW08c].<br />

There are several parameters that influence plan execution. Essentially, we analyzed<br />

two groups <strong>of</strong> parameters. First, there is the group <strong>of</strong> optimization parameters, where we<br />

used different sliding window sizes ∆w (default: 5 min), different optimization intervals ∆t<br />

(default: 5 min), and different workload aggregation methods Agg (default: EMA). Second,<br />

there is the group <strong>of</strong> workload characteristics. Here, we used different numbers <strong>of</strong> plan<br />

instances n (executed instances), different plans with certain numbers <strong>of</strong> operators m, and<br />

different input data sizes d (default: d = 1, which stands for 100 kB input messages) and<br />

different selectivities. With regard to applied optimization techniques, we used all costbased<br />

optimization techniques, except message indexing and heterogeneous load balancing<br />

(both not discussed in this thesis) as well as vectorization (see Chapter 4) and multiflow<br />

optimization (see Chapter 5). Furthermore, we disabled all rule-based optimization<br />

techniques in order to focus on the benefit achieved by cost-based optimization because<br />

these techniques either did not apply (e.g., algebraic simplifications) for the used plans or<br />

they achieved a constant absolute improvement (e.g., static node compilation) for both<br />

the unoptimized and the cost-based optimized execution.<br />

End-to-End Comparison and <strong>Optimization</strong> Benefits<br />

In a first series <strong>of</strong> experiments, we compared the end-to-end performance <strong>of</strong> no-optimization<br />

versus the periodical re-optimization. These experiments already include all optimization<br />

overheads (such as statistics maintenance and periodical re-optimization). As a<br />

result, these experiments show the overall benefit achieved by periodical re-optimization.<br />

First, we compared the periodical re-optimization with no-optimization. The periodical<br />

re-optimization was realized as asynchronous inter-instance optimization approach. We<br />

executed 100,000 instances <strong>of</strong> our example plan P 5 for the non-optimized plan as well as<br />

for the optimized plan and measured re-optimization and plan execution time. For periodical<br />

re-optimization, the plan execution time already includes the synchronous statistic<br />

maintenance. During execution, we varied the input cardinality (see Figure 3.20(b))<br />

and selectivities <strong>of</strong> the three selection operators (see Figure 3.20(a)). Here, the input<br />

data was generated without correlations between different attributes. With regard to reoptimization,<br />

there are four points (∗1, ∗2, ∗3, and ∗4) where a workload change (shown<br />

as intersection points between selectivities) reasons the change <strong>of</strong> the optimal plan.<br />

For periodical re-optimization, we used an optimization interval <strong>of</strong> ∆t = 5 min, a sliding<br />

window size <strong>of</strong> ∆w = 5 min and EMA as the workload aggregation method. Figure 3.20(c)<br />

74

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!