25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.3 Periodical Re-<strong>Optimization</strong><br />

at T 2 , we enrich the selectivity tuple with the missing information (due to reordering, we<br />

now have independent statistics for both operators). Over time, we use the exponential<br />

moving average (EMA) to adapt those statistics independently <strong>of</strong> the sliding time window.<br />

Finally, we see that the sequence <strong>of</strong> (σ A , σ B ) is the best choice because P (A) < P (B).<br />

We might make wrong decisions at the beginning but the probability estimates converge to<br />

the real statistics over time. Hence, we do not make wrong decisions twice as long as the<br />

probability comparison and the data dependency between both operators do not change.<br />

As long as a selectivity tuple contains a missing value for an operator o 2 , we make the<br />

assumption <strong>of</strong> statistical independence and hence compute P (o 2 ) = P (o 1 ∧ o 2 )/P (o 1 ).<br />

Furthermore, we compute the conditional probabilities for a given operator—based on the<br />

described selectivity tuple—as follows:<br />

P (o 2 |o 1 ) = P (o 1 ∧ o 2 )<br />

P (o 1 )<br />

⎧<br />

P (o 1 ∧ o 2 )<br />

⎪⎨<br />

= P (o 1 ) P (o 2 ) = NULL<br />

P (o 1 |o 2 ) =<br />

P (o 2 |o 1 )<br />

P (o 1 ∧ o 2 )<br />

⎪⎩<br />

otherwise.<br />

P (o 2 )<br />

(3.15)<br />

Thus, we explicitly use the monitored conditional probabilities if available. It is important<br />

to note that we can only maintain the probability <strong>of</strong> the first operator as well as the joint<br />

probability <strong>of</strong> both operators. Hence, the additional problem <strong>of</strong> starvation in reordering<br />

decisions might occur if the real probability <strong>of</strong> the second operator decreases because<br />

we cannot monitor this effect. In order to tackle this problem <strong>of</strong> starvation in certain<br />

rewriting decisions, we use an aging strategy, where the probability <strong>of</strong> the second operator<br />

is slowly decreased over time. This prevents starvation because over time we reorder both<br />

operators due to the decreased probability and can monitor the actual probability <strong>of</strong> this<br />

second operator. Although this can cause suboptimal plans in case <strong>of</strong> no workload changes,<br />

it prevents starvation and converges to the real selectivities and probabilities.<br />

For arbitrary chains <strong>of</strong> operators, the correlation table is used recursively along operators<br />

that are directly connected by data dependencies such that an operator might be included<br />

in multiple entries <strong>of</strong> the correlation table. For such chains, we allow only reordering<br />

directly connected operators. After a reordering, all entries <strong>of</strong> the correlation table that<br />

refer to a removed data dependency are removed as well and new entries for new data<br />

dependencies are created.<br />

In conclusion, the adaptive behavior <strong>of</strong> our condition selectivity approach ensures more<br />

robust and stable optimizer decisions for conditional probabilities and correlated data,<br />

even in the context <strong>of</strong> changing workload characteristics. Note that there are more sophisticated,<br />

heavyweight approaches (e.g., by using a maximum entropy approach [MHK + 07,<br />

MMK + 05] or by using a measure <strong>of</strong> clusteredness [HNM03]) for correlation-aware selectivity<br />

estimation in the context <strong>of</strong> DBMS. We could also maintain adaptable multidimensional<br />

histograms [BCG01, AC99, LWV03, MVW00, KW99]. However, in contrast<br />

to DBMS, where data is static and statistics are required for arbitrary predicates, we optimize<br />

the average plan execution time with known predicates but dynamic data. Hence,<br />

the introduced lightweight correlation table can be used instead <strong>of</strong> heavyweight multidimensional<br />

histograms, where we would need to maintain exact or approximate frequencies<br />

with regard to arbitrary conjunct predicates [Pol05, TDJ10]. However, other<br />

approaches can be integrated as well. Due to the problem <strong>of</strong> missing statistics about<br />

59

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!