25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4 Vectorizing <strong>Integration</strong> <strong>Flows</strong><br />

2. Schema Evaluation: Evaluate all distribution schemes according to the optimization<br />

objective φ or φ c .<br />

3. Plan Rewriting: Rewrite the plan according to the distribution scheme.<br />

In the following, we briefly describe each <strong>of</strong> those steps with more technical depth.<br />

1: Scheme Enumeration: In order to enumerate all possible distribution schemes<br />

<strong>of</strong> a plan P with m operators, we recursively use Algorithm 4.2. As a first step, we<br />

create a MEMO table with m columns. In a second step, for each k ∈ [1, m], we create a<br />

record <strong>of</strong> length k and invoke the recursive A-EDS. Conceptually, this algorithm varies the<br />

number <strong>of</strong> operators <strong>of</strong> bucket 1 (line 5) and recursively invokes itself in order to distribute<br />

the remaining operators across buckets 2 to k. It then varies the number <strong>of</strong> operators <strong>of</strong><br />

bucket 2 and so on. Finally, if the remaining operators should be distributed across the last<br />

bucket, we insert the tuple into the MEMO structure but we could also directly evaluate the<br />

enumerated scheme. As a result, the MEMO structure holds all 2 m−1 candidate distribution<br />

schemes. Note that this approach is used recursively for complex operators and it contains<br />

different loop conditions for the case <strong>of</strong> sets <strong>of</strong> operators.<br />

Algorithm 4.2 Enumerate Distribution Schemes (A-EDS)<br />

Require: number <strong>of</strong> operators m, number <strong>of</strong> buckets k, record r, position pos<br />

1: if k = 1 then<br />

2: r.pos[1] ← m<br />

3: insert r into MEMO<br />

4: else<br />

5: for i ← 1 to m − k + 1 do // for each operator o i<br />

6: r.pos[pos] ← i<br />

7: A-EDS(m − i, k − 1, r, pos + 1) // recursively enumerate distribution schemes<br />

2: Scheme Evaluation: Having enumerated all candidates, we can now iterate over<br />

the MEMO structure and evaluate those schemes in order to determine the optimal scheme<br />

according to the optimization objectives φ or φ c . Recall the problem definition <strong>of</strong> costbased<br />

vectorization, i.e., the overall performance <strong>of</strong> vectorized plans depends on the most<br />

time-consuming operator. Here, the costs <strong>of</strong> a bucket are defined as the sum <strong>of</strong> all operators<br />

in that bucket. We then determine the bucket with maximum costs. The overall<br />

optimization objective φ is to minimize the number <strong>of</strong> buckets under the condition <strong>of</strong><br />

lowest possible maximum bucket costs. In general, all 2 m−1 candidate schemes need to<br />

be evaluated. However, we could prune schemes, where (1) we already determined that<br />

a bucket exceeds the maximum execution time and (2) the number <strong>of</strong> buckets exceeds<br />

the minimum number <strong>of</strong> buckets seen so far. These pruning techniques can be realized<br />

on-the-fly during scheme enumeration or with skip-list structures as known from other<br />

research areas such as join enumeration [HKL + 08] or time series analysis [GZ08].<br />

3: Plan Rewriting: Finally, we use the optimal scheme in order to rewrite the given<br />

plan P . For that, the A-PV can be reused with minor changes. Here, we do not create an<br />

execution bucket for each operator but we consider the computed k. All operators <strong>of</strong> one<br />

bucket can be copied as a subplan, while data dependencies across execution buckets are<br />

replaced by queues. The general model <strong>of</strong> execution buckets (see Subsection 4.2) is reused<br />

as it is with the difference <strong>of</strong> arbitrary subplans instead <strong>of</strong> single operators.<br />

Similar to the analysis <strong>of</strong> Section 4.2, the rewriting algorithm still has a worst-case complexity<br />

<strong>of</strong> O(m 3 ) for the case <strong>of</strong> k = m. Furthermore, the evaluation <strong>of</strong> a single distribution<br />

106

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!