Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2 Preliminaries and Existing Techniques<br />
2.1.4 Executing <strong>Integration</strong> <strong>Flows</strong><br />
When deploying an integration flow, we transform the logical flow into an executable plan.<br />
Therefore, we distinguish two major plan representations. First, there are interpreted<br />
plans, where we use an object graph <strong>of</strong> operators and interpret this object graph during<br />
execution <strong>of</strong> a plan instance. Second, there are compiled plans, where code templates are<br />
used in order to generate and compile physical executable plans. As a first side-effect<br />
from the modeling perspective, (1) directed graphs are typically interpreted, while (2)<br />
hierarchies <strong>of</strong> sequences, source code and fixed flows are commonly executed as compiled<br />
plans.<br />
Moreover, as a second side effect, flows with data-flow modeling semantics are typically<br />
also executed with data-flow semantics. The same is true for control-flow semantics.<br />
Hence, we use the flow semantics, as our major classification criterion <strong>of</strong> execution approaches.<br />
With regard to the data granularity as the second classification criterion <strong>of</strong> plan<br />
execution, we traditionally distinguish between two fundamental execution models:<br />
• Iterator Model: The Volcano iterator model [Gra90, Gra94] is the typical execution<br />
model <strong>of</strong> traditional DBMS (row stores). Each operator implements an interface<br />
with the operations open(), next() and close(). The operators <strong>of</strong> a plan call their<br />
predecessors, i.e., the top operator determines the execution <strong>of</strong> the whole plan (pull<br />
principle). In addition, each operator can be executed by an individual thread (and<br />
thus, adheres to the pipes-and-filter execution model), where each operator exhibits<br />
a so-called iterator state (tuple buffer) [Gra90]. The advantages <strong>of</strong> this model are<br />
extensibility with additional operators as well as the exploitation <strong>of</strong> vertical parallelism<br />
(pipeline parallelism or data parallelism) and horizontal parallelism (parallel<br />
pipelines). The disadvantages are the high communication overhead between operators<br />
and the predominant applicability for row-based (tuple-based) execution.<br />
• Materialized Intermediates: The concept <strong>of</strong> materialized intermediates is the typical<br />
execution model <strong>of</strong> column stores [KM05, MBK00]. Operators <strong>of</strong> a plan are executed<br />
in sequence (one operator at a time), where the result <strong>of</strong> one operator is completely<br />
materialized (as variable) and then used as input <strong>of</strong> the next operator (push principle).<br />
This reduces the overhead <strong>of</strong> operator communication and is particularly<br />
advantageous for column stores, where operators work on (compressed) columns in<br />
the form <strong>of</strong> continuous memory (arrays). This concept <strong>of</strong>fers additional optimization<br />
opportunities such as vectorized execution within a single operator, or the recycling<br />
<strong>of</strong> intermediate results [IKNG09] across multiple plans.<br />
<strong>Integration</strong> flows are typically executed as independent plan instances. Here, we distinguish<br />
between data-driven integration flows, where incoming data conceptually initiates a<br />
new plan instance, and scheduled integration flows, where such an instance is initiated by<br />
a time-based scheduler. If strong consistency is required, data-driven integration flows are<br />
executed synchronously, which means that the client systems are blocked during execution.<br />
In contrast, if only weak (eventual) consistency is required, data-driven integration<br />
flows can also be executed asynchronously using inbound queues. Note that scheduled<br />
integration flows are per se asynchronous and thus, ensure only weak consistency. We use<br />
this integration-flow-specific characteristic <strong>of</strong> independent instances to refine the classification<br />
criterion <strong>of</strong> data granularity. Therefore, we introduce the notion <strong>of</strong> instance-local<br />
(data/messages <strong>of</strong> one flow instance) and instance-global (data/messages <strong>of</strong> multiple flow<br />
instances) data granularity.<br />
14