Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
4 Vectorizing <strong>Integration</strong> <strong>Flows</strong><br />
2. Schema Evaluation: Evaluate all distribution schemes according to the optimization<br />
objective φ or φ c .<br />
3. Plan Rewriting: Rewrite the plan according to the distribution scheme.<br />
In the following, we briefly describe each <strong>of</strong> those steps with more technical depth.<br />
1: Scheme Enumeration: In order to enumerate all possible distribution schemes<br />
<strong>of</strong> a plan P with m operators, we recursively use Algorithm 4.2. As a first step, we<br />
create a MEMO table with m columns. In a second step, for each k ∈ [1, m], we create a<br />
record <strong>of</strong> length k and invoke the recursive A-EDS. Conceptually, this algorithm varies the<br />
number <strong>of</strong> operators <strong>of</strong> bucket 1 (line 5) and recursively invokes itself in order to distribute<br />
the remaining operators across buckets 2 to k. It then varies the number <strong>of</strong> operators <strong>of</strong><br />
bucket 2 and so on. Finally, if the remaining operators should be distributed across the last<br />
bucket, we insert the tuple into the MEMO structure but we could also directly evaluate the<br />
enumerated scheme. As a result, the MEMO structure holds all 2 m−1 candidate distribution<br />
schemes. Note that this approach is used recursively for complex operators and it contains<br />
different loop conditions for the case <strong>of</strong> sets <strong>of</strong> operators.<br />
Algorithm 4.2 Enumerate Distribution Schemes (A-EDS)<br />
Require: number <strong>of</strong> operators m, number <strong>of</strong> buckets k, record r, position pos<br />
1: if k = 1 then<br />
2: r.pos[1] ← m<br />
3: insert r into MEMO<br />
4: else<br />
5: for i ← 1 to m − k + 1 do // for each operator o i<br />
6: r.pos[pos] ← i<br />
7: A-EDS(m − i, k − 1, r, pos + 1) // recursively enumerate distribution schemes<br />
2: Scheme Evaluation: Having enumerated all candidates, we can now iterate over<br />
the MEMO structure and evaluate those schemes in order to determine the optimal scheme<br />
according to the optimization objectives φ or φ c . Recall the problem definition <strong>of</strong> costbased<br />
vectorization, i.e., the overall performance <strong>of</strong> vectorized plans depends on the most<br />
time-consuming operator. Here, the costs <strong>of</strong> a bucket are defined as the sum <strong>of</strong> all operators<br />
in that bucket. We then determine the bucket with maximum costs. The overall<br />
optimization objective φ is to minimize the number <strong>of</strong> buckets under the condition <strong>of</strong><br />
lowest possible maximum bucket costs. In general, all 2 m−1 candidate schemes need to<br />
be evaluated. However, we could prune schemes, where (1) we already determined that<br />
a bucket exceeds the maximum execution time and (2) the number <strong>of</strong> buckets exceeds<br />
the minimum number <strong>of</strong> buckets seen so far. These pruning techniques can be realized<br />
on-the-fly during scheme enumeration or with skip-list structures as known from other<br />
research areas such as join enumeration [HKL + 08] or time series analysis [GZ08].<br />
3: Plan Rewriting: Finally, we use the optimal scheme in order to rewrite the given<br />
plan P . For that, the A-PV can be reused with minor changes. Here, we do not create an<br />
execution bucket for each operator but we consider the computed k. All operators <strong>of</strong> one<br />
bucket can be copied as a subplan, while data dependencies across execution buckets are<br />
replaced by queues. The general model <strong>of</strong> execution buckets (see Subsection 4.2) is reused<br />
as it is with the difference <strong>of</strong> arbitrary subplans instead <strong>of</strong> single operators.<br />
Similar to the analysis <strong>of</strong> Section 4.2, the rewriting algorithm still has a worst-case complexity<br />
<strong>of</strong> O(m 3 ) for the case <strong>of</strong> k = m. Furthermore, the evaluation <strong>of</strong> a single distribution<br />
106