25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5 Multi-Flow <strong>Optimization</strong><br />

mention its complexity. The A-MPR exhibits—similar to the plan vectorization algorithm<br />

(A-PV, Subsection 4.2.2)—a cubic worst-case time complexity <strong>of</strong> O(m 3 ) according to the<br />

number <strong>of</strong> operators m. The rationale for this is the already analyzed dependency checking.<br />

Note that the additional inner loop over following operators do not change this asymptotic<br />

behavior because each operator is assigned only once to an inserted Iteration operator.<br />

The split and merge approach realizes the transparent plan rewriting and thus enables<br />

the execution <strong>of</strong> message partitions even in the case <strong>of</strong> multiple partitioning attributes.<br />

The rewritten plan mainly depends on the cost-based derived partitioning scheme, which<br />

neglects any additional costs <strong>of</strong> PSlit and PMerge operators. The reason for this optimization<br />

objective is that the ordering <strong>of</strong> partitioning attributes has a higher influence<br />

on the overall performance (partitioned queue maintenance and benefit by partitioning)<br />

than the additional operators, because PSplit and PMerge are low-cost operators with<br />

linear scalability with regard to the number <strong>of</strong> messages due to the efficient hash partition<br />

tree data structure. In addition to the throughput improvement achieved by executing<br />

operations on partitions <strong>of</strong> messages, the inserted Iteration operators <strong>of</strong>fer further optimization<br />

potential. In detail, the technique WC3: Rewriting Iterations to Parallel <strong>Flows</strong>,<br />

described in Subsection 3.4.1, can be applied after the A-MPR in order to additionally<br />

achieve a higher degree <strong>of</strong> parallelism that further increases the throughput.<br />

To summarize, we discussed the necessary preconditions in order to enable the horizontal<br />

message queue partitioning and the execution <strong>of</strong> operations on these message partitions. In<br />

detail, we introduced the (hash) partition tree as a message queue data structure that allows<br />

the hierarchical partitioning <strong>of</strong> messages according to certain partitioning attributes.<br />

Furthermore, we introduced basic algorithms (1) for deriving candidate partitioning attributes<br />

from a plan, (2) for deriving the optimal partitioning scheme <strong>of</strong> attributes, and (3)<br />

for rewriting the plan according to this scheme. Only minor changes <strong>of</strong> operators and the<br />

execution environment are necessary, while all other aspects are issues <strong>of</strong> logical optimization<br />

and therefore, fit seamlessly into our cost-based optimization framework. Multi-flow<br />

optimization now reduces to the challenge <strong>of</strong> computing the optimal waiting time.<br />

5.3 Periodical Re-<strong>Optimization</strong><br />

The cost-based decision <strong>of</strong> the multi-flow optimization technique is to compute the optimal<br />

waiting time ∆tw in order to adjust the trade-<strong>of</strong>f between message throughput and latency<br />

times <strong>of</strong> single messages according to the current workload characteristics. In this section,<br />

we define the formal optimization objective, we explain the extended cost model and cost<br />

estimation, we discuss the waiting time computation and finally, show how to integrate<br />

this optimization technique into our cost-based optimization framework.<br />

5.3.1 Formal Problem Definition<br />

As described in Section 2.3.1, we assume a message sequence M = {m 1 , m 2 , . . . , m n } <strong>of</strong><br />

incoming messages, where each message m i is modeled as a (t i , d i , a i )-tuple, where t i ∈ Z +<br />

denotes the incoming timestamp <strong>of</strong> the message, d i denotes a semi-structured tree <strong>of</strong> namevalue<br />

data elements, and a i denotes a list <strong>of</strong> additional atomic name-value attributes. Each<br />

message m i is processed by an instance p i <strong>of</strong> a plan P , and t out (m i ) ∈ Z + denotes the<br />

timestamp when the message has been successfully executed. Here, the latency <strong>of</strong> a single<br />

message T L (m i ) is given by T L (m i ) = t out (m i ) − t i (m i ).<br />

142

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!