27.09.2014 Views

MiniTasking: Improving Cache Performance for Multiple ... - CiteSeerX

MiniTasking: Improving Cache Performance for Multiple ... - CiteSeerX

MiniTasking: Improving Cache Performance for Multiple ... - CiteSeerX

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>MiniTasking</strong>: <strong>Improving</strong> <strong>Cache</strong> <strong>Per<strong>for</strong>mance</strong> <strong>for</strong><br />

<strong>Multiple</strong> Query Workloads<br />

Yan Zhang 1 , Zhifeng Chen 2 , and Yuanyuan Zhou 3<br />

1 Center <strong>for</strong> In<strong>for</strong>mation Science, Peking Univ., Beijing, 100871, China<br />

zhy@cis.pku.edu.cn,<br />

2 Google, USA<br />

zhifeng.chen@gmail.com,<br />

3 Department of Computer Science, University of Illinois at Urbana-Champaign, USA<br />

yyzhou@cs.uiuc.edu<br />

Abstract. This paper proposes a novel idea, called <strong>MiniTasking</strong> to reduce the<br />

number of cache misses by improving the data temporal locality <strong>for</strong> multiple<br />

concurrent queries. Our idea is based on the observation that, in many workloads<br />

such as decision support systems (DSS), there is usually significant amount of<br />

data sharing among different concurrent queries. <strong>MiniTasking</strong> exploits such data<br />

sharing characteristics to improve data temporal locality by scheduling query execution<br />

at three levels: (1) It batches queries based on their data sharing characteristics<br />

and the cache configuration. (2) It groups operators that share certain<br />

data. (3) It schedules mini-tasks which are small pieces of computation in operator<br />

groups according to their data locality without violating their execution<br />

dependencies.<br />

Our experimental results show that, <strong>MiniTasking</strong> can significantly reduce the execution<br />

time up to 12% <strong>for</strong> joins. For the TPC-H throughput test workload, Mini-<br />

Tasking improves the end per<strong>for</strong>mance up to 20%. Even with the Partition Attributes<br />

Across (PAX) layout, <strong>MiniTasking</strong> further reduces the cache misses by<br />

65% and the execution time by 9%.<br />

1 Introduction<br />

1.1 Motivation<br />

With the increasing size of main memory, most of query processing working set<br />

can fit into main memory <strong>for</strong> many database workloads. As a result, the main memory<br />

latency is becoming a major per<strong>for</strong>mance bottleneck <strong>for</strong> many database applications,<br />

such as DSS (Decision Support System) applications [2, 20, 31]. This problem will get<br />

worse as the processor-memory speed gap increases. Previous work demonstrates that<br />

the L2 data stall time is one of the most significant components of the query execution<br />

time [2]. We conducted similar measurements using IBM DB2 with DSS workloads.<br />

Our results demonstrate that on Pentium 4, the L2 cache misses contribute 18%-56%<br />

of CPIs (cycle per instructions) <strong>for</strong> most TPC-H queries. There<strong>for</strong>e, improving the L2<br />

cache hit ratio is critical to reduce the number of expensive memory accesses and improve<br />

the end per<strong>for</strong>mance <strong>for</strong> database applications.


Fig. 1. CPI breakdown of some TPC-H queries on Shore using PAX.<br />

An effective method <strong>for</strong> improving the L2 data cache hit ratio is to increase data<br />

locality, which includes spatial locality and temporal locality. Many previous studies<br />

have proposed clever ideas to improve the data spatial locality of a single query by<br />

using cache-conscious data layout. Examples include PAX (Partition Attributes Across)<br />

by Ailamaki et al. [1], data morphing by Hankins and Patel [14] and wider B + -tree<br />

nodes by Chen et al. [8]. These layout schemes place data that are likely to be accessed<br />

together consecutively so that servicing one cache miss can “prefetch” other data into<br />

the cache to avoid subsequent cache misses.<br />

While the above techniques are very effective in reducing the number of cache<br />

misses, the memory latency still remains significant contributor <strong>for</strong> the query execution<br />

time even though the amount of contribution is not as high as be<strong>for</strong>e. For example,<br />

as shown in Figure 1, with the PAX layout, the L1 and L2 cache misses still contribute<br />

around 20% of CPIs <strong>for</strong> TPC-H queries. There<strong>for</strong>e, it is still necessary to seek other<br />

complementary techniques to further reduce the number of cache misses.<br />

<strong>Improving</strong> temporal locality is a potential complementary technique to reduce cache<br />

miss ratio by improving data temporal reuse. This approach has been widely studied <strong>for</strong><br />

scientific applications. Most previous work in this category maximizes data temporal<br />

locality by reordering computation, e.g., compiler-directed tiling or loop trans<strong>for</strong>mations<br />

[32, 18, 11, 3], fine-grained thread scheduling [23, 34]. While these techniques are<br />

very useful <strong>for</strong> regular, array-based applications, it is difficult to apply them to database<br />

applications that usually have complex pointer-based data structures, and whose structure<br />

in<strong>for</strong>mation is known only at run-time after the database schema is loaded into the<br />

main memory. So far few studies have been conducted to improve the temporal cache<br />

reuse <strong>for</strong> database applications.<br />

1.2 Our Contributions<br />

In this paper, we propose a technique called <strong>MiniTasking</strong> to improve data temporal<br />

locality <strong>for</strong> concurrent query execution. Our idea is based on the observation that, in a<br />

large scale decision support system, it is very common <strong>for</strong> multiple users with complex<br />

queries to hit the same data set concurrently [16], even though these queries may not be<br />

identical. <strong>MiniTasking</strong> exploits such data sharing characteristics to improve temporal<br />

locality by scheduling query execution at three levels:(1) It batches queries based on<br />

their data sharing characteristics and the cache configuration. (2) It groups operators<br />

that share certain data. (3) It schedules mini-tasks which are small fractions of operator<br />

groups according to their data locality without violating their execution dependencies.<br />

2


<strong>MiniTasking</strong> is complementary to previously proposed solutions such as PAX [1]<br />

and data morphing [14], because <strong>MiniTasking</strong> improves temporal locality while cache<br />

conscious layouts improve spatial locality. <strong>MiniTasking</strong> is also complementary to multiple<br />

query optimization (MQO) techniques that produce a global query plan <strong>for</strong> them [13,<br />

28, 27].<br />

We implemented <strong>MiniTasking</strong> in the Shore storage manager [6]. Our experimental<br />

results with various DSS workloads using the TPC-H benchmark suite show that,<br />

<strong>MiniTasking</strong> improves the end per<strong>for</strong>mance up to 20% on a real compound workload<br />

running TPC-H throughput testing streams. Even with the Partition Attributes Across<br />

(PAX) layout, <strong>MiniTasking</strong> reduces the L2 cache misses by 65% and the execution time<br />

of concurrent queries by 9%.<br />

The remainder of this paper is organized as follows. Section 2 presents the related<br />

work. Section 3 introduces data temporal locality. Section 4 describes <strong>MiniTasking</strong> in<br />

detail. Section 5 demonstrates the experimental evaluation. Finally, we show our conclusions<br />

in Section 6.<br />

2 Related Work<br />

<strong>Multiple</strong> Query Optimization endeavors to reduce the execution time of multiple queries<br />

by reducing duplicated computation and reusing the computation results. Previous work<br />

proposes to extract common sub-expressions from plans of multiple queries and reuse<br />

their intermediate results in all queries [10, 13, 27, 28]. Early work shows that the multiple<br />

query optimization is an NP-hard problem and proposes heuristics <strong>for</strong> query ordering<br />

and common sub-expressions detection and selection [13, 27]. Roy et al. propose<br />

to materialize certain common sub-expressions into transient tables so that later<br />

queries can reuse the results [26]. Instead of materializing the results of common subexpressions,<br />

Davli et al. focus on pipelining the intermediate tuples simultaneously to<br />

several queries so as to avoid the prohibitive cost of materializing and reading the intermediate<br />

results [10]. Harizopoulos et al. propose a operator-centric engine Qpipe to<br />

support on-demand simultaneous pipelining [15]. O’Gorman et al. propose to reduce<br />

disk I/O by scheduling queries with the same table scans at the same time and there<strong>for</strong>e<br />

achieve significant speedups [22]. However, reusing intermediate results requires<br />

exactly same common sub-expressions. For example, a little change in the selection<br />

predicate of one query will render previous results not usable.<br />

<strong>Improving</strong> Data Locality is another important technique to improve per<strong>for</strong>mance of<br />

multiple queries, especially when the memory latency becomes a new bottleneck <strong>for</strong><br />

DSS workload on modern processors. Ailamaki et al. show that the primary memoryrelated<br />

bottleneck is mainly contributed by L1 instruction and L2 data cache misses [2].<br />

Many recent studies have focused on improving data spatial locality to reduce cache<br />

misses in database systems [1, 9, 19, 33, 25]. <strong>Cache</strong>-conscious algorithms change data<br />

access pattern of table scan [4] and index scan [33] so that consecutive data accesses<br />

will hit in the same cache lines. Shatdal et al. demonstrate that several basic database<br />

operator algorithms can be redesigned to make better use of the cache [29]. <strong>Cache</strong>conscious<br />

index structures pack more keys in one cache lines to reduce cache misses<br />

3


during lookup in an index tree [9, 19, 25]. <strong>Cache</strong>-conscious data storage models partition<br />

tables vertically so that one cache line can store the same fields from several<br />

records [1, 24]. Although these techniques effectively reduce cache misses within a single<br />

query, data fetched into processor caches are not reused across multiple queries.<br />

Much previous work studies improving data temporal locality <strong>for</strong> general programs<br />

[7, 5, 12]. For example, based on the temporal relationship graph between objects generated<br />

via profiling, Calder et al. present a compiler directed approach <strong>for</strong> cache-conscious<br />

data placement [5]. Carr and Tseng propose a model that computes temporal reuse of<br />

cache lines to find desirable loop organizations <strong>for</strong> better data locality [7, 21]. Although<br />

these methods are effective in increasing cache reuse, it is difficult to apply them directly<br />

to DSS workload because it is hard to profile ad hoc DSS queries.<br />

3 Feasibility Analysis: <strong>Improving</strong> Temporal Locality<br />

Processor caches are used in modern architectures to reduce the average latency of<br />

memory accesses. Every memory load or store instruction is first checked inside the<br />

processor cache (L1 and L2). If the data is in the cache, a.k.a. a cache hit, the access<br />

is satisfied by the cache directly. Otherwise, it is a cache miss. Upon a cache miss,<br />

the accessed data is fetched into the cache from the main memory. Because accessing<br />

the main memory is 10–30 times slower than accessing the processor cache, it is<br />

per<strong>for</strong>mance critical to have high cache hit ratios to avoid paying the large penalty of<br />

accessing main memory.<br />

There are two kinds of locality: spatial locality and temporal locality. Our work<br />

focuses on improving temporal locality via locality-based scheduling. Temporal locality<br />

is the tendency that individual locations, once referenced, are likely to be referenced<br />

again in the near future. Good temporal locality allows data in processor caches to be<br />

reused (called as temporal reuse) multiple times be<strong>for</strong>e being replaced and thereby<br />

improving the cache effectiveness.<br />

In most real world workloads, database servers usually serve multiple concurrent<br />

queries simultaneously. Usually, there is significant amount of data sharing among<br />

many of such concurrent queries. For example, Query 1 (Q1) and Query 6 (Q6) from the<br />

TPC-H benchmark [30] share the same table Lineitem, the largest one in the TPC-H<br />

database.<br />

However, due to the locality-oblivious multi-query scheduling that is commonly<br />

used in modern database servers, such significant data sharing is not fully exploited in<br />

databases to improve the level of temporal reuse in processor caches and reduce the<br />

number of processor cache misses. As a result, be<strong>for</strong>e a piece of data can be reused by<br />

another query, it has already been replaced and needs to be fetched again from main<br />

memory when it is needed by another query.<br />

Let us looking at an example using Q1 and Q6 from the TPC-H benchmark. Suppose<br />

Lineitem has 1M tuples, with each tuple occupying one cache line of 64 bytes (<strong>for</strong> the<br />

simplicity of description), and the L2 cache holds only 64K cache lines (total of 4<br />

MBytes). Suppose that the scheduler decides to execute Q1 first in concurrent to some<br />

other queries that do not share any data with Q1 and Q6. After Q1 accesses the 128K-th<br />

tuple, Q6 is scheduled to start from the 1st tuple. Since the L2 cache can only hold 64K<br />

4


Fig. 2. Comparison between locality-oblivious and locality-aware multi-query scheduling.<br />

tuples, the first tuple of Lineitem is already evicted from L2. There<strong>for</strong>e, the database<br />

needs to fetch this tuple again from main memory to execute Q6.<br />

In contrast, if we use a locality-aware multi-query scheduling and execution, we<br />

can schedule Q1 and Q6 together in an interleaved fashion so that, after a query fetches<br />

a tuple from main memory into L2, this tuple can be accessed <strong>for</strong> both queries be<strong>for</strong>e<br />

being replaced from L2.<br />

Figure 2 shows that, <strong>for</strong> multiple queries of different types (Q1+Q6), the localityaware<br />

scheduling is able to reduce the number of cache misses by 41.7% and result<br />

in 9.7% reduction in execution time. For multiple queries of the same type but with<br />

different arguments (Q6+Q6’), the locality-aware scheduling reduces the number of<br />

cache misses by 42.4% and the execution time by 9.9%. These results indicate that<br />

locality-awareness in multi-query scheduling is very helpful to reduce the number of<br />

cache misses and improve database per<strong>for</strong>mance, which is the major focus of our work.<br />

4 <strong>MiniTasking</strong><br />

4.1 Overview<br />

To exploit data sharing among concurrent queries <strong>for</strong> improving temporal locality,<br />

<strong>MiniTasking</strong> schedules and executes concurrent queries based on data sharing characteristics<br />

at three levels: query level batching, operator level grouping and mini-task<br />

level scheduling. While each level is different, all levels share the same goal: improving<br />

temporal data locality. There<strong>for</strong>e, at each level, all decisions are made based on data<br />

sharing characteristics with consideration of other factors that are specific to each level.<br />

At the query level, due to the processor cache capacity limit, it is not beneficial to<br />

execute together all concurrent queries (queries that have already arrived at the database<br />

management server and are waiting to be processed). There<strong>for</strong>e, <strong>MiniTasking</strong> carefully<br />

selects a batch of queries based on their data sharing characteristics and the processor<br />

cache configuration to maximize the level of temporal locality in the processor cache.<br />

Queries in the same batch are then processed together in the next two levels.<br />

At the second level, <strong>MiniTasking</strong> produces a locality-aware query plan tree <strong>for</strong> each<br />

batch of queries. <strong>MiniTasking</strong> does this by starting from the query plan tree produced<br />

by the optimizer and group together those operators that share significant amount of<br />

data. Operators that do not share data with others remain untouched.<br />

At the third level, <strong>MiniTasking</strong> further breaks each operator into mini-tasks, with<br />

each mini-task operating on a fine-grained data block. Then all mini-tasks from the<br />

5


Algorithm Greedy-Selecting:<br />

;; Given n queries Q 1, ..., Q n, return a batch of<br />

;; queries that will be processed as a whole.<br />

S={Q a, Q b |max i,jAmountDataSharing(Q i, Q j)}<br />

while |S| < MaxBatchSize<br />

do<br />

Find Q /∈ S s.t. ∃Q ′ ∈ S<br />

AmountDataSharing(Q,Q ′ ) is maximized<br />

if AmountDataSharing(Q,Q ′ ) ≠ 0<br />

S=S ∪ {Q}<br />

else<br />

exit the loop ;; No more queries sharing with S<br />

return S<br />

Fig. 3. Greedy batch selecting algorithm.<br />

same of query plan tree are executed one after another following an order to maximize<br />

temporal data reuse in the processor cache.<br />

4.2 Query Level Batching<br />

Obviously, the first criteria <strong>for</strong> query batching should be data sharing. If two queries<br />

access totally different data, there is no chance of reusing each other’s data from the<br />

processor cache. Such case can happen even when two queries access the same table<br />

but access different fields that do not share the same cache line. In this case, we call that<br />

these two queries do not have overlapping working sets, which is defined as the set of<br />

data (cache lines) accessed by a query.<br />

There<strong>for</strong>e, to batch queries based on data sharing characteristics, <strong>MiniTasking</strong> needs<br />

to estimate the amount of sharing between any two concurrent queries. A metric, called<br />

as AmountDataSharing is introduced to measure the estimated amount of data sharing,<br />

i.e. the amount of overlapping in working set, between two given concurrent queries.<br />

Since only coarse-grain data access characteristics are known at the query level, we estimate<br />

a query’s working set based on the tables and the fields accessed by this query.<br />

<strong>MiniTasking</strong> schedules queries in batches and processes these batches one by one.<br />

Given a large number of concurrent queries that share data with each other, intuitively,<br />

it sounds beneficial to execute concurrently as many queries as possible so that the<br />

amount of data reuse can be maximized.<br />

However, in reality, due to the limited L2 cache capacity, scheduling too many concurrent<br />

queries can result in even poor temporal locality because data from different<br />

queries can replace each other in the cache be<strong>for</strong>e being reused. There<strong>for</strong>e, we should<br />

carefully decide how many and which concurrent queries should be batched together.<br />

To address this problem, we use a threshold parameter, MaxBatchSize, to limit the<br />

number of concurrent queries in a batch.<br />

Based on the above analysis, we use a heuristic greedy algorithm to select batches<br />

of queries, as shown in Figure 3. It works similar to a clustering algorithm: divide all<br />

concurrent queries into clusters smaller than MaxBatchSize to maximize the total<br />

amount of data sharing.<br />

6


Original Plans:<br />

Op6<br />

Op’4<br />

Op5<br />

Op’3<br />

Op4<br />

Op3<br />

Op’1<br />

Op’2<br />

Op1<br />

Op2<br />

Table Scan T1 Table Scan T2<br />

Enhanced Plans:<br />

Table Scan T1 Table Scan T2<br />

Op6<br />

Table Scan T3<br />

<strong>MiniTasking</strong><br />

Op’4<br />

Op5<br />

Op’3<br />

Op4<br />

Op3<br />

Op’1 Op1<br />

Op’2 Op2<br />

Table Scan T1<br />

Table Scan T2<br />

Table Scan T3<br />

Fig. 4. An example of the operator grouping process. The output of Op ′ 3 to Op ′ 4 and the output<br />

of Op 4 to Op 5 are materialized.<br />

4.3 Operator Level Grouping<br />

Since queries consist of operators, <strong>MiniTasking</strong> goes one step further to group together<br />

operators from the same batch of queries according to their data sharing characteristics.<br />

<strong>MiniTasking</strong> scans every physical operator tree produced by the query optimizer<br />

<strong>for</strong> each query in a batch and groups operators that share some certain data.<br />

The evaluation process is similar to the one used at the query level. If the results of an<br />

operator is pipelined to other operators, <strong>MiniTasking</strong> also puts these related operators<br />

into the same group. Each group of operators is then passed to the mini-task level. Operators<br />

that do not share data with others are all put into the last group and is executed<br />

last using the original scheduling algorithm.<br />

<strong>MiniTasking</strong> supports operator dependency by maintaining a pool of ready operators.<br />

An operator is ready and joins the ready pool when it does not dependent on other<br />

unexecuted operators. <strong>MiniTasking</strong> selects a group of operators from the ready pool<br />

using a similar algorithm to the one used in query batching described in Figure 3. After<br />

this group of operators finishes execution via mini-tasking (described in the next subsection),<br />

some operators that depend on the ones just executed will be “released” and<br />

join the ready pool if they do not have other dependencies. <strong>MiniTasking</strong> will select the<br />

next group of operators and so on so <strong>for</strong>th until all operators are executed.<br />

Figure 4 uses an example to demonstrate how <strong>MiniTasking</strong> works at the operator<br />

level. Suppose there are two queries, namely Q and Q’. Op 1 to Op 5 are operators of<br />

query Q, and Op ′ 1 to Op ′ 4 are operators of query Q’. As both Op 1 and Op ′ 1 access table<br />

T 1 , they are grouped together. Suppose Op ′ 3 and Op 4 are implemented using pipelining.<br />

<strong>MiniTasking</strong> also puts them into the same group as Op 1 , Op ′ 1 , Op 2 and Op ′ 2 . This group<br />

does not contain Op ′ 4 or Op 5 , because the results of Op ′ 3 and Op 4 are materialized.<br />

4.4 Mini-task Level Scheduling<br />

At the mini-task level, the challenge is how <strong>MiniTasking</strong> breaks various query operators<br />

into mini-tasks and achieves benefit from rescheduling them. We show our method<br />

7


Fig. 5. <strong>MiniTasking</strong> breaks the operators into<br />

mini-tasks and schedules them.<br />

Fig. 6. The layouts <strong>for</strong> the two join relations<br />

and the join query.<br />

by illustrating a data-centric method applied to a table scan. The idea can be extended<br />

to handle other query operators.<br />

The goal of <strong>MiniTasking</strong> is to make the data loaded into the cache reused by queries<br />

as much as possible be<strong>for</strong>e it is evicted from the cache. There<strong>for</strong>e, <strong>MiniTasking</strong> carefully<br />

chooses an appropriate value <strong>for</strong> the whole working set size, which means the<br />

total size <strong>for</strong> all the data blocks that can reside in the cache. It has a big impact on the<br />

query per<strong>for</strong>mance. If it is too large, some data may be evicted from the cache be<strong>for</strong>e<br />

being reused. However, decreasing it will result in more mini-tasks and thereby heavier<br />

switching overhead.<br />

Generally, this parameter is related to the target architecture, the data layouts and<br />

the queries, especially the L2 cache size, the L2 cache line size and the associativity.<br />

According to our experiments, it is not very sensitive to the type of queries. Once the<br />

target architecture and the data layouts are specified, it is feasible to run some calibration<br />

experiments in advance to determine the best value <strong>for</strong> this parameter.<br />

There<strong>for</strong>e, <strong>for</strong> a table scan, <strong>MiniTasking</strong> divides the table into n fine-grained data<br />

blocks, with each block suitable <strong>for</strong> the working set. Correspondingly, <strong>MiniTasking</strong><br />

breaks the execution of each scan operator into n mini-tasks, according to the data<br />

blocks they use. Thereafter, when a data block is loaded by the first mini-task, Mini-<br />

Tasking schedules other mini-tasks that share this data block to execute one by one.<br />

When no mini-tasks use this data block, it will be replaced by the next data block. Thus<br />

the data resided in the cache can be maximally reused be<strong>for</strong>e being evicted.<br />

The following example illustrates this data-centric scheduling method. Suppose<br />

there are three table scan operators Op 1 , Op 2 , Op 3 and they share the table T , as<br />

shown in Figure 5. Table T is divided into three data blocks. According to the data<br />

blocks they access, the three operators are broken into (Op 1,1 , Op 1,2 , Op 1,3 ), (Op 2,1 ,<br />

Op 2,2 , Op 2,3 ), and (Op 3,1 , Op 3,2 , Op 3,3 ), respectively. <strong>MiniTasking</strong> schedules them in<br />

such an order: Op 1,1 , Op 2,1 , Op 3,1 , Op 1,2 , Op 2,2 , Op 3,2 , Op 1,3 , Op 2,3 , and Op 3,3 . In<br />

this way, the data block (DT j ) loaded into the cache by Op 1,j (j=1, 2, 3) can be reused<br />

by the subsequent mini-tasks Op 2,j and Op 3,j .<br />

8


Hash Joins Index Joins<br />

Tuples Hash-1 Hash-2 Index-1 Index-2<br />

Outer 10 6 10 6 10 6 5,000<br />

Inner 5,000 100 500,000 10 6<br />

Table 1. The sizes of the outer and inner relations<br />

used by Micro-join.<br />

Parameters L1 D cache L2 cache<br />

Size 8KB 512KB<br />

Associativity 4-way 8-way<br />

<strong>Cache</strong> line 64B 64B<br />

<strong>Cache</strong> miss latency 7 cycles 350 cycles<br />

Table 2. Processor cache parameters of the<br />

evaluation plat<strong>for</strong>m .<br />

5 Experimental Evaluation<br />

5.1 Evaluation Methodology<br />

We implement <strong>MiniTasking</strong> in the Shore database storage manager [6], which provides<br />

most of the popular storage features used in a modern commercial DBMS. Previous<br />

work show that Shore exhibits memory access behaviors similar to several commercial<br />

DBMSes [1]. Since Shore’s original query scheduler is fairly serialized (executing<br />

one query after another), we have extended Shore to use a slightly more sophisticated<br />

scheduler which switches from one query to another after a certain time quantum or<br />

when this query yields voluntarily due to other reasons (e.g. I/Os). This scheduler emulates<br />

what would really happen with a multi-threaded or multi-processed commercial<br />

database server. Our results also show that this scheduler per<strong>for</strong>ms slightly better than<br />

the original scheduler in Shore. There<strong>for</strong>e, we use this time quantum-based scheduler<br />

as our baseline to compare with <strong>MiniTasking</strong>.<br />

Experimental Workloads For DSS workloads, we use a TPC-H-like benchmark, which<br />

represents the activities of a complex business that manages, sells and distributes a large<br />

number of products [30]. The following are the table sizes in our TPC-H-like database:<br />

600572 tuples in Lineitem, 150000 tuples in Orders, and 20000 tuples in Part.<br />

Experimental Plat<strong>for</strong>m Our evaluation is conducted on a machine with a 2.4GHz Intel<br />

Pentium 4 processor and 2.5GB of main memory. The processor includes two levels of<br />

caches: L1 and L2, whose characteristics are shown on Table 2. The operating system<br />

is Linux kernel 2.4.20. For measurements, we use a commercial tool, the Intel VTune<br />

per<strong>for</strong>mance tool [17], which collect per<strong>for</strong>mance statistics with negligible overhead.<br />

5.2 Results For Micro-Join<br />

We use a two-relation join query to examine <strong>MiniTasking</strong>, as shown in Figure 6.<br />

We vary the number of tuples in the two relations and examine four representative<br />

combinations <strong>for</strong> them, as shown in Table 1.<br />

Our experiments show that <strong>MiniTasking</strong> improves the per<strong>for</strong>mance of join operations<br />

by 4%–12%. When a hash join is used, if the hash table on the inner relation is<br />

small enough to be put into the cache, <strong>MiniTasking</strong> can be effectively applied to the<br />

outer relation. For example, when two instances of the join query are running, Mini-<br />

Tasking improves the query per<strong>for</strong>mance by 9% in the case of Hash-1 and 12% in the<br />

case of Hash-2, as shown in Figure 7. <strong>MiniTasking</strong> has similar speedup <strong>for</strong> the indexbased<br />

join since it can break the index probing into mini-tasks. As a result, <strong>MiniTasking</strong><br />

reduces the query execution time by up to 8.2% <strong>for</strong> two concurrently running instances<br />

of the join query.<br />

9


Fig. 7. <strong>Per<strong>for</strong>mance</strong> of join operations. <strong>MiniTasking</strong> reduces execution time up to 12.1% <strong>for</strong> hash<br />

joins and 8.2% <strong>for</strong> index nested-loops joins. MT stands <strong>for</strong> <strong>MiniTasking</strong>.<br />

(a) Normalized execution time<br />

(b) CPI breakdown<br />

Fig. 8. <strong>Per<strong>for</strong>mance</strong> of throughput-real tests. Each test runs several concurrent streams. The execution<br />

time of each test is normalized to the baseline without <strong>MiniTasking</strong>.<br />

5.3 Results For Throughput-Real<br />

We validate our <strong>MiniTasking</strong> strategy using a real workload, modeling after the<br />

throughput test of TPC-H benchmark. The standard TPC-H throughput test is composed<br />

of multiple concurrent streams. Each stream contains a sequence of TPC-H queries in<br />

an order which TPC-H benchmark specifies. Accordingly, our experiment follows these<br />

sequences and let each stream execute the six TPC-H queries we implemented.<br />

Our experimental results show that <strong>MiniTasking</strong> is very effective <strong>for</strong> this workload.<br />

Figure 8(a) shows that the execution time of each test is reduced by 11%-20% <strong>for</strong> various<br />

number of concurrent query streams. As shown on Figure 8(b), the per<strong>for</strong>mance<br />

gain comes from the reduction in L2 cache misses: <strong>MiniTasking</strong> significantly reduces<br />

the number of L2 cache misses by 41%-79%. This is all because <strong>MiniTasking</strong>’s localityaware<br />

query scheduling and execution effectively improves the access temporal locality.<br />

Meanwhile, <strong>MiniTasking</strong> do not affect other processor events very much since it adds<br />

little overhead. There<strong>for</strong>e, the improved L2 cache hit ratios is proportionally reflected<br />

into the end per<strong>for</strong>mance.<br />

5.4 Improvement Upon PAX Layout<br />

Figure 9 shows the effects of <strong>MiniTasking</strong> on cache-conscious data layout such<br />

as PAX [1]. <strong>MiniTasking</strong> can still effectively reduce the number of L2 cache misses<br />

by 65% and the execution time by 9%. The per<strong>for</strong>mance speedup is less pronounced<br />

with PAX than with the default NSF layout because, with PAX that has significantly<br />

improved spatial locality in accesses, the L2 cache miss time contributes less to the<br />

execution time than with NSM.<br />

10


(a) Normalized execution time<br />

(b) CPI breakdown<br />

Fig. 9. The execution time and the CPI breakdown of four concurrent queries (TPCH-Q1). The<br />

PAX data layout is used in Shore and MT.<br />

6 Conclusion<br />

In this paper, we propose a technique called <strong>MiniTasking</strong> to improve database per<strong>for</strong>mance<br />

<strong>for</strong> concurrent query execution by reducing the number of processor cache<br />

misses via three levels of locality-based scheduling. Through query level batching, operator<br />

level grouping and mini-task level scheduling, <strong>MiniTasking</strong> can significantly reduce<br />

L2 cache misses and execution time. Our experimental results show that, Mini-<br />

Tasking can significantly reduce the execution time up to 12% <strong>for</strong> joins. For the TPC-<br />

H throughput test workload, <strong>MiniTasking</strong> reduces the number of L2 cache misses up<br />

to 79% and improves the end per<strong>for</strong>mance up to 20%. With the Partition Attributes<br />

Across (PAX) layout, <strong>MiniTasking</strong> further reduces the cache misses by 65% and the<br />

execution time by 9%, which indicates that our technique well compliments previous<br />

cache-conscious layouts.<br />

References<br />

1. A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page layouts <strong>for</strong> relational databases on<br />

deep memory hierarchies. The VLDB Journal, 11(3):198–215, 2002.<br />

2. A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor:<br />

Where does time go? In VLDB ’99, pages 266–277, 1999.<br />

3. A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng. Evaluating the impact of memory<br />

system per<strong>for</strong>mance on software prefetching and locality optimizations. In International<br />

Conference on Supercomputing, pages 486–500, 2001.<br />

4. P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized <strong>for</strong> the new<br />

bottleneck: Memory access. In VLDB ’99, pages 54–65, 1999.<br />

5. B. Calder, C. Krintz, S. John, and T. Austin. <strong>Cache</strong>-conscious data placement. In ASPLOS<br />

’98, pages 139–149, 1998.<br />

6. M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall, M. L. McAuliffe, J. F. Naughton, D. T.<br />

Schuh, M. H. Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and M. J. Zwilling. Shoring<br />

up persistent applications. In SIGMOD ’94, pages 383–394, 1994.<br />

7. S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations <strong>for</strong> improving data locality.<br />

In ASPLOS ’94, pages 252–262, 1994.<br />

8. S. Chen, P. B. Gibbons, and T. C. Mowry. <strong>Improving</strong> index per<strong>for</strong>mance through prefetching.<br />

In SIGMOD ’01, pages 235–246, 2001.<br />

9. S. Chen, P. B. Gibbons, T. C. Mowry, and G. Valentin. Fractal prefetching b+-trees: optimizing<br />

both cache and disk per<strong>for</strong>mance. In SIGMOD ’02, pages 157–168, 2002.<br />

11


10. N. N. Dalvi, S. K. Sanghai, P. Roy, and S. Sudarshan. Pipelining in multi-query optimization.<br />

In PODS ’01, pages 59–70, 2001.<br />

11. C. Ding and K. Kennedy. Inter-array data regrouping. In Languages and Compilers <strong>for</strong><br />

Parallel Computing, pages 149–163, 1999.<br />

12. C. Ding and M. Orlovich. The potential of computation regrouping <strong>for</strong> improving locality.<br />

In ACM/IEEE SC2004, Nov. 6-12, 2004.<br />

13. S. Finkelstein. Common expression analysis in database applications. In SIGMOD ’82,<br />

pages 235–245, 1982.<br />

14. R. A. Hankins and J. M. Patel. Data morphing: An adaptive,cache-conscious storage technique.<br />

In VLDB ’03. Morgan Kaufmann, 2003.<br />

15. S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: A simultaneously pipelined relational<br />

query engine. In SIGMOD ’05, pages 383–394, 2005.<br />

16. IBM. Personal communication with IBM, Jan. 2005.<br />

17. Intel Corporation. Intel vtune per<strong>for</strong>mance analyzer.<br />

http://www.intel.com/software/products/vtune/, 2004.<br />

18. K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality<br />

via loop fusion and distribution. In Proceedings of the 6th International Workshop on<br />

Languages and Compilers <strong>for</strong> Parallel Computing, pages 301–320. Springer-Verlag, 1994.<br />

19. K. Kim, S. K. Cha, and K. Kwon. Optimizing multidimensional index trees <strong>for</strong> main memory<br />

access. In SIGMOD ’01, pages 139–150. ACM Press, 2001.<br />

20. J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, and S. S. Parekh. An<br />

analysis of database workload per<strong>for</strong>mance on simultaneous multithreaded processors. In<br />

ISCA ’98, pages 39–50. IEEE Computer Society, 1998.<br />

21. K. S. McKinley, S. Carr, and C.-W. Tseng. <strong>Improving</strong> data locality with loop trans<strong>for</strong>mations.<br />

ACM Transactions on Programming Languages and Systems, 18(4):424–453, July 1996.<br />

22. K. O’Gorman, D. Agrawal, and A. E. Abbadi. <strong>Multiple</strong> query optimization by cache-aware<br />

middleware using query teamwork. In ICDE ’02, page 274. IEEE Computer Society, 2002.<br />

23. J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling <strong>for</strong> cache<br />

locality. In ASPLOS ’96, pages 60–71. ACM Press, 1996.<br />

24. R. Ramamurthy, D. J. DeWitt, and Q. Su. A case <strong>for</strong> fractured mirrors. In VLDB ’02, pages<br />

430–441, 2002.<br />

25. J. Rao and K. A. Ross. Making b+- trees cache conscious in main memory. In SIGMOD ’00,<br />

pages 475–486, New York, NY, USA, 2000. ACM Press.<br />

26. P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and extensible algorithms <strong>for</strong><br />

multi query optimization. In SIGMOD ’00, pages 249–260. ACM Press, 2000.<br />

27. T. Sellis and S. Ghosh. On the multiple-query optimization problem. IEEE Transactions on<br />

Knowledge and Data Engineering, 2(2):262–266, 1990.<br />

28. T. K. Sellis. <strong>Multiple</strong>-query optimization. ACM Trans. Database Syst., 13(1):23–52, 1988.<br />

29. A. Shatdal, C. Kant, and J. F. Naughton. <strong>Cache</strong> conscious algorithms <strong>for</strong> relational query<br />

processing. In VLDB ’94, pages 510–521, 1994.<br />

30. Transaction processing per<strong>for</strong>mance council. http://www.tpc.org.<br />

31. P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torrellas. The memory per<strong>for</strong>mance of DSS<br />

commercial workloads in shared-memory multiprocessors. In HPCA ’97, 1997.<br />

32. M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI ’91, 1991.<br />

33. J. Zhou and K. A. Ross. Buffering accesses to memory-resident index structures. In VLDB<br />

’03, pages 405–416, 2003.<br />

34. Y. Zhou, L. Wang, D. W. Clark, and K. Li. Thread scheduling <strong>for</strong> out-of-core applications<br />

with memory server on multicomputers. In IOPADS ’99, pages 57–67. ACM Press, 1999.<br />

12

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!