MiniTasking: Improving Cache Performance for Multiple ... - CiteSeerX

MiniTasking: Improving Cache Performance for 

Multiple Query Workloads 

Yan Zhang 1 , Zhifeng Chen 2 , and Yuanyuan Zhou 3 

1 Center for Information Science, Peking Univ., Beijing, 100871, China 

zhy@cis.pku.edu.cn, 

2 Google, USA 

zhifeng.chen@gmail.com, 

3 Department of Computer Science, University of Illinois at Urbana-Champaign, USA 

yyzhou@cs.uiuc.edu 

Abstract. This paper proposes a novel idea, called MiniTasking to reduce the 

number of cache misses by improving the data temporal locality for multiple 

concurrent queries. Our idea is based on the observation that, in many workloads 

such as decision support systems (DSS), there is usually significant amount of 

data sharing among different concurrent queries. MiniTasking exploits such data 

sharing characteristics to improve data temporal locality by scheduling query execution 

at three levels: (1) It batches queries based on their data sharing characteristics 

and the cache configuration. (2) It groups operators that share certain 

data. (3) It schedules mini-tasks which are small pieces of computation in operator 

groups according to their data locality without violating their execution 

dependencies. 

Our experimental results show that, MiniTasking can significantly reduce the execution 

time up to 12% for joins. For the TPC-H throughput test workload, Mini- 

Tasking improves the end performance up to 20%. Even with the Partition Attributes 

Across (PAX) layout, MiniTasking further reduces the cache misses by 

65% and the execution time by 9%. 

1 Introduction 

1.1 Motivation 

With the increasing size of main memory, most of query processing working set 

can fit into main memory for many database workloads. As a result, the main memory 

latency is becoming a major performance bottleneck for many database applications, 

such as DSS (Decision Support System) applications [2, 20, 31]. This problem will get 

worse as the processor-memory speed gap increases. Previous work demonstrates that 

the L2 data stall time is one of the most significant components of the query execution 

time [2]. We conducted similar measurements using IBM DB2 with DSS workloads. 

Our results demonstrate that on Pentium 4, the L2 cache misses contribute 18%-56% 

of CPIs (cycle per instructions) for most TPC-H queries. Therefore, improving the L2 

cache hit ratio is critical to reduce the number of expensive memory accesses and improve 

the end performance for database applications.

Fig. 1. CPI breakdown of some TPC-H queries on Shore using PAX. 

An effective method for improving the L2 data cache hit ratio is to increase data 

locality, which includes spatial locality and temporal locality. Many previous studies 

have proposed clever ideas to improve the data spatial locality of a single query by 

using cache-conscious data layout. Examples include PAX (Partition Attributes Across) 

by Ailamaki et al. [1], data morphing by Hankins and Patel [14] and wider B + -tree 

nodes by Chen et al. [8]. These layout schemes place data that are likely to be accessed 

together consecutively so that servicing one cache miss can “prefetch” other data into 

the cache to avoid subsequent cache misses. 

While the above techniques are very effective in reducing the number of cache 

misses, the memory latency still remains significant contributor for the query execution 

time even though the amount of contribution is not as high as before. For example, 

as shown in Figure 1, with the PAX layout, the L1 and L2 cache misses still contribute 

around 20% of CPIs for TPC-H queries. Therefore, it is still necessary to seek other 

complementary techniques to further reduce the number of cache misses. 

Improving temporal locality is a potential complementary technique to reduce cache 

miss ratio by improving data temporal reuse. This approach has been widely studied for 

scientific applications. Most previous work in this category maximizes data temporal 

locality by reordering computation, e.g., compiler-directed tiling or loop transformations 

[32, 18, 11, 3], fine-grained thread scheduling [23, 34]. While these techniques are 

very useful for regular, array-based applications, it is difficult to apply them to database 

applications that usually have complex pointer-based data structures, and whose structure 

information is known only at run-time after the database schema is loaded into the 

main memory. So far few studies have been conducted to improve the temporal cache 

reuse for database applications. 

1.2 Our Contributions 

In this paper, we propose a technique called MiniTasking to improve data temporal 

locality for concurrent query execution. Our idea is based on the observation that, in a 

large scale decision support system, it is very common for multiple users with complex 

queries to hit the same data set concurrently [16], even though these queries may not be 

identical. MiniTasking exploits such data sharing characteristics to improve temporal 

locality by scheduling query execution at three levels:(1) It batches queries based on 

their data sharing characteristics and the cache configuration. (2) It groups operators 

that share certain data. (3) It schedules mini-tasks which are small fractions of operator 

groups according to their data locality without violating their execution dependencies. 

2

MiniTasking is complementary to previously proposed solutions such as PAX [1] 

and data morphing [14], because MiniTasking improves temporal locality while cache 

conscious layouts improve spatial locality. MiniTasking is also complementary to multiple 

query optimization (MQO) techniques that produce a global query plan for them [13, 

28, 27]. 

We implemented MiniTasking in the Shore storage manager [6]. Our experimental 

results with various DSS workloads using the TPC-H benchmark suite show that, 

MiniTasking improves the end performance up to 20% on a real compound workload 

running TPC-H throughput testing streams. Even with the Partition Attributes Across 

(PAX) layout, MiniTasking reduces the L2 cache misses by 65% and the execution time 

of concurrent queries by 9%. 

The remainder of this paper is organized as follows. Section 2 presents the related 

work. Section 3 introduces data temporal locality. Section 4 describes MiniTasking in 

detail. Section 5 demonstrates the experimental evaluation. Finally, we show our conclusions 

in Section 6. 

2 Related Work 

Multiple Query Optimization endeavors to reduce the execution time of multiple queries 

by reducing duplicated computation and reusing the computation results. Previous work 

proposes to extract common sub-expressions from plans of multiple queries and reuse 

their intermediate results in all queries [10, 13, 27, 28]. Early work shows that the multiple 

query optimization is an NP-hard problem and proposes heuristics for query ordering 

and common sub-expressions detection and selection [13, 27]. Roy et al. propose 

to materialize certain common sub-expressions into transient tables so that later 

queries can reuse the results [26]. Instead of materializing the results of common subexpressions, 

Davli et al. focus on pipelining the intermediate tuples simultaneously to 

several queries so as to avoid the prohibitive cost of materializing and reading the intermediate 

results [10]. Harizopoulos et al. propose a operator-centric engine Qpipe to 

support on-demand simultaneous pipelining [15]. O’Gorman et al. propose to reduce 

disk I/O by scheduling queries with the same table scans at the same time and therefore 

achieve significant speedups [22]. However, reusing intermediate results requires 

exactly same common sub-expressions. For example, a little change in the selection 

predicate of one query will render previous results not usable. 

Improving Data Locality is another important technique to improve performance of 

multiple queries, especially when the memory latency becomes a new bottleneck for 

DSS workload on modern processors. Ailamaki et al. show that the primary memoryrelated 

bottleneck is mainly contributed by L1 instruction and L2 data cache misses [2]. 

Many recent studies have focused on improving data spatial locality to reduce cache 

misses in database systems [1, 9, 19, 33, 25]. Cache-conscious algorithms change data 

access pattern of table scan [4] and index scan [33] so that consecutive data accesses 

will hit in the same cache lines. Shatdal et al. demonstrate that several basic database 

operator algorithms can be redesigned to make better use of the cache [29]. Cacheconscious 

index structures pack more keys in one cache lines to reduce cache misses 

3

during lookup in an index tree [9, 19, 25]. Cache-conscious data storage models partition 

tables vertically so that one cache line can store the same fields from several 

records [1, 24]. Although these techniques effectively reduce cache misses within a single 

query, data fetched into processor caches are not reused across multiple queries. 

Much previous work studies improving data temporal locality for general programs 

[7, 5, 12]. For example, based on the temporal relationship graph between objects generated 

via profiling, Calder et al. present a compiler directed approach for cache-conscious 

data placement [5]. Carr and Tseng propose a model that computes temporal reuse of 

cache lines to find desirable loop organizations for better data locality [7, 21]. Although 

these methods are effective in increasing cache reuse, it is difficult to apply them directly 

to DSS workload because it is hard to profile ad hoc DSS queries. 

3 Feasibility Analysis: Improving Temporal Locality 

Processor caches are used in modern architectures to reduce the average latency of 

memory accesses. Every memory load or store instruction is first checked inside the 

processor cache (L1 and L2). If the data is in the cache, a.k.a. a cache hit, the access 

is satisfied by the cache directly. Otherwise, it is a cache miss. Upon a cache miss, 

the accessed data is fetched into the cache from the main memory. Because accessing 

the main memory is 10–30 times slower than accessing the processor cache, it is 

performance critical to have high cache hit ratios to avoid paying the large penalty of 

accessing main memory. 

There are two kinds of locality: spatial locality and temporal locality. Our work 

focuses on improving temporal locality via locality-based scheduling. Temporal locality 

is the tendency that individual locations, once referenced, are likely to be referenced 

again in the near future. Good temporal locality allows data in processor caches to be 

reused (called as temporal reuse) multiple times before being replaced and thereby 

improving the cache effectiveness. 

In most real world workloads, database servers usually serve multiple concurrent 

queries simultaneously. Usually, there is significant amount of data sharing among 

many of such concurrent queries. For example, Query 1 (Q1) and Query 6 (Q6) from the 

TPC-H benchmark [30] share the same table Lineitem, the largest one in the TPC-H 

database. 

However, due to the locality-oblivious multi-query scheduling that is commonly 

used in modern database servers, such significant data sharing is not fully exploited in 

databases to improve the level of temporal reuse in processor caches and reduce the 

number of processor cache misses. As a result, before a piece of data can be reused by 

another query, it has already been replaced and needs to be fetched again from main 

memory when it is needed by another query. 

Let us looking at an example using Q1 and Q6 from the TPC-H benchmark. Suppose 

Lineitem has 1M tuples, with each tuple occupying one cache line of 64 bytes (for the 

simplicity of description), and the L2 cache holds only 64K cache lines (total of 4 

MBytes). Suppose that the scheduler decides to execute Q1 first in concurrent to some 

other queries that do not share any data with Q1 and Q6. After Q1 accesses the 128K-th 

tuple, Q6 is scheduled to start from the 1st tuple. Since the L2 cache can only hold 64K 

4

Fig. 2. Comparison between locality-oblivious and locality-aware multi-query scheduling. 

tuples, the first tuple of Lineitem is already evicted from L2. Therefore, the database 

needs to fetch this tuple again from main memory to execute Q6. 

In contrast, if we use a locality-aware multi-query scheduling and execution, we 

can schedule Q1 and Q6 together in an interleaved fashion so that, after a query fetches 

a tuple from main memory into L2, this tuple can be accessed for both queries before 

being replaced from L2. 

Figure 2 shows that, for multiple queries of different types (Q1+Q6), the localityaware 

scheduling is able to reduce the number of cache misses by 41.7% and result 

in 9.7% reduction in execution time. For multiple queries of the same type but with 

different arguments (Q6+Q6’), the locality-aware scheduling reduces the number of 

cache misses by 42.4% and the execution time by 9.9%. These results indicate that 

locality-awareness in multi-query scheduling is very helpful to reduce the number of 

cache misses and improve database performance, which is the major focus of our work. 

4 MiniTasking 

4.1 Overview 

To exploit data sharing among concurrent queries for improving temporal locality, 

MiniTasking schedules and executes concurrent queries based on data sharing characteristics 

at three levels: query level batching, operator level grouping and mini-task 

level scheduling. While each level is different, all levels share the same goal: improving 

temporal data locality. Therefore, at each level, all decisions are made based on data 

sharing characteristics with consideration of other factors that are specific to each level. 

At the query level, due to the processor cache capacity limit, it is not beneficial to 

execute together all concurrent queries (queries that have already arrived at the database 

management server and are waiting to be processed). Therefore, MiniTasking carefully 

selects a batch of queries based on their data sharing characteristics and the processor 

cache configuration to maximize the level of temporal locality in the processor cache. 

Queries in the same batch are then processed together in the next two levels. 

At the second level, MiniTasking produces a locality-aware query plan tree for each 

batch of queries. MiniTasking does this by starting from the query plan tree produced 

by the optimizer and group together those operators that share significant amount of 

data. Operators that do not share data with others remain untouched. 

At the third level, MiniTasking further breaks each operator into mini-tasks, with 

each mini-task operating on a fine-grained data block. Then all mini-tasks from the 

5

Algorithm Greedy-Selecting: 

;; Given n queries Q 1, ..., Q n, return a batch of 

;; queries that will be processed as a whole. 

S={Q a, Q b |max i,jAmountDataSharing(Q i, Q j)} 

while |S| < MaxBatchSize 

do 

Find Q /∈ S s.t. ∃Q ′ ∈ S 

AmountDataSharing(Q,Q ′ ) is maximized 

if AmountDataSharing(Q,Q ′ ) ≠ 0 

S=S ∪ {Q} 

else 

exit the loop ;; No more queries sharing with S 

return S 

Fig. 3. Greedy batch selecting algorithm. 

same of query plan tree are executed one after another following an order to maximize 

temporal data reuse in the processor cache. 

4.2 Query Level Batching 

Obviously, the first criteria for query batching should be data sharing. If two queries 

access totally different data, there is no chance of reusing each other’s data from the 

processor cache. Such case can happen even when two queries access the same table 

but access different fields that do not share the same cache line. In this case, we call that 

these two queries do not have overlapping working sets, which is defined as the set of 

data (cache lines) accessed by a query. 

Therefore, to batch queries based on data sharing characteristics, MiniTasking needs 

to estimate the amount of sharing between any two concurrent queries. A metric, called 

as AmountDataSharing is introduced to measure the estimated amount of data sharing, 

i.e. the amount of overlapping in working set, between two given concurrent queries. 

Since only coarse-grain data access characteristics are known at the query level, we estimate 

a query’s working set based on the tables and the fields accessed by this query. 

MiniTasking schedules queries in batches and processes these batches one by one. 

Given a large number of concurrent queries that share data with each other, intuitively, 

it sounds beneficial to execute concurrently as many queries as possible so that the 

amount of data reuse can be maximized. 

However, in reality, due to the limited L2 cache capacity, scheduling too many concurrent 

queries can result in even poor temporal locality because data from different 

queries can replace each other in the cache before being reused. Therefore, we should 

carefully decide how many and which concurrent queries should be batched together. 

To address this problem, we use a threshold parameter, MaxBatchSize, to limit the 

number of concurrent queries in a batch. 

Based on the above analysis, we use a heuristic greedy algorithm to select batches 

of queries, as shown in Figure 3. It works similar to a clustering algorithm: divide all 

concurrent queries into clusters smaller than MaxBatchSize to maximize the total 

amount of data sharing. 

6

Original Plans: 

Op6 

Op’4 

Op5 

Op’3 

Op4 

Op3 

Op’1 

Op’2 

Op1 

Op2 

Table Scan T1 Table Scan T2 

Enhanced Plans: 

Table Scan T1 Table Scan T2 

Op6 

Table Scan T3 

MiniTasking 

Op’4 

Op5 

Op’3 

Op4 

Op3 

Op’1 Op1 

Op’2 Op2 

Table Scan T1 

Table Scan T2 

Table Scan T3 

Fig. 4. An example of the operator grouping process. The output of Op ′ 3 to Op ′ 4 and the output 

of Op 4 to Op 5 are materialized. 

4.3 Operator Level Grouping 

Since queries consist of operators, MiniTasking goes one step further to group together 

operators from the same batch of queries according to their data sharing characteristics. 

MiniTasking scans every physical operator tree produced by the query optimizer 

for each query in a batch and groups operators that share some certain data. 

The evaluation process is similar to the one used at the query level. If the results of an 

operator is pipelined to other operators, MiniTasking also puts these related operators 

into the same group. Each group of operators is then passed to the mini-task level. Operators 

that do not share data with others are all put into the last group and is executed 

last using the original scheduling algorithm. 

MiniTasking supports operator dependency by maintaining a pool of ready operators. 

An operator is ready and joins the ready pool when it does not dependent on other 

unexecuted operators. MiniTasking selects a group of operators from the ready pool 

using a similar algorithm to the one used in query batching described in Figure 3. After 

this group of operators finishes execution via mini-tasking (described in the next subsection), 

some operators that depend on the ones just executed will be “released” and 

join the ready pool if they do not have other dependencies. MiniTasking will select the 

next group of operators and so on so forth until all operators are executed. 

Figure 4 uses an example to demonstrate how MiniTasking works at the operator 

level. Suppose there are two queries, namely Q and Q’. Op 1 to Op 5 are operators of 

query Q, and Op ′ 1 to Op ′ 4 are operators of query Q’. As both Op 1 and Op ′ 1 access table 

T 1 , they are grouped together. Suppose Op ′ 3 and Op 4 are implemented using pipelining. 

MiniTasking also puts them into the same group as Op 1 , Op ′ 1 , Op 2 and Op ′ 2 . This group 

does not contain Op ′ 4 or Op 5 , because the results of Op ′ 3 and Op 4 are materialized. 

4.4 Mini-task Level Scheduling 

At the mini-task level, the challenge is how MiniTasking breaks various query operators 

into mini-tasks and achieves benefit from rescheduling them. We show our method 

7

Fig. 5. MiniTasking breaks the operators into 

mini-tasks and schedules them. 

Fig. 6. The layouts for the two join relations 

and the join query. 

by illustrating a data-centric method applied to a table scan. The idea can be extended 

to handle other query operators. 

The goal of MiniTasking is to make the data loaded into the cache reused by queries 

as much as possible before it is evicted from the cache. Therefore, MiniTasking carefully 

chooses an appropriate value for the whole working set size, which means the 

total size for all the data blocks that can reside in the cache. It has a big impact on the 

query performance. If it is too large, some data may be evicted from the cache before 

being reused. However, decreasing it will result in more mini-tasks and thereby heavier 

switching overhead. 

Generally, this parameter is related to the target architecture, the data layouts and 

the queries, especially the L2 cache size, the L2 cache line size and the associativity. 

According to our experiments, it is not very sensitive to the type of queries. Once the 

target architecture and the data layouts are specified, it is feasible to run some calibration 

experiments in advance to determine the best value for this parameter. 

Therefore, for a table scan, MiniTasking divides the table into n fine-grained data 

blocks, with each block suitable for the working set. Correspondingly, MiniTasking 

breaks the execution of each scan operator into n mini-tasks, according to the data 

blocks they use. Thereafter, when a data block is loaded by the first mini-task, Mini- 

Tasking schedules other mini-tasks that share this data block to execute one by one. 

When no mini-tasks use this data block, it will be replaced by the next data block. Thus 

the data resided in the cache can be maximally reused before being evicted. 

The following example illustrates this data-centric scheduling method. Suppose 

there are three table scan operators Op 1 , Op 2 , Op 3 and they share the table T , as 

shown in Figure 5. Table T is divided into three data blocks. According to the data 

blocks they access, the three operators are broken into (Op 1,1 , Op 1,2 , Op 1,3 ), (Op 2,1 , 

Op 2,2 , Op 2,3 ), and (Op 3,1 , Op 3,2 , Op 3,3 ), respectively. MiniTasking schedules them in 

such an order: Op 1,1 , Op 2,1 , Op 3,1 , Op 1,2 , Op 2,2 , Op 3,2 , Op 1,3 , Op 2,3 , and Op 3,3 . In 

this way, the data block (DT j ) loaded into the cache by Op 1,j (j=1, 2, 3) can be reused 

by the subsequent mini-tasks Op 2,j and Op 3,j . 

8

Hash Joins Index Joins 

Tuples Hash-1 Hash-2 Index-1 Index-2 

Outer 10 6 10 6 10 6 5,000 

Inner 5,000 100 500,000 10 6 

Table 1. The sizes of the outer and inner relations 

used by Micro-join. 

Parameters L1 D cache L2 cache 

Size 8KB 512KB 

Associativity 4-way 8-way 

Cache line 64B 64B 

Cache miss latency 7 cycles 350 cycles 

Table 2. Processor cache parameters of the 

evaluation platform . 

5 Experimental Evaluation 

5.1 Evaluation Methodology 

We implement MiniTasking in the Shore database storage manager [6], which provides 

most of the popular storage features used in a modern commercial DBMS. Previous 

work show that Shore exhibits memory access behaviors similar to several commercial 

DBMSes [1]. Since Shore’s original query scheduler is fairly serialized (executing 

one query after another), we have extended Shore to use a slightly more sophisticated 

scheduler which switches from one query to another after a certain time quantum or 

when this query yields voluntarily due to other reasons (e.g. I/Os). This scheduler emulates 

what would really happen with a multi-threaded or multi-processed commercial 

database server. Our results also show that this scheduler performs slightly better than 

the original scheduler in Shore. Therefore, we use this time quantum-based scheduler 

as our baseline to compare with MiniTasking. 

Experimental Workloads For DSS workloads, we use a TPC-H-like benchmark, which 

represents the activities of a complex business that manages, sells and distributes a large 

number of products [30]. The following are the table sizes in our TPC-H-like database: 

600572 tuples in Lineitem, 150000 tuples in Orders, and 20000 tuples in Part. 

Experimental Platform Our evaluation is conducted on a machine with a 2.4GHz Intel 

Pentium 4 processor and 2.5GB of main memory. The processor includes two levels of 

caches: L1 and L2, whose characteristics are shown on Table 2. The operating system 

is Linux kernel 2.4.20. For measurements, we use a commercial tool, the Intel VTune 

performance tool [17], which collect performance statistics with negligible overhead. 

5.2 Results For Micro-Join 

We use a two-relation join query to examine MiniTasking, as shown in Figure 6. 

We vary the number of tuples in the two relations and examine four representative 

combinations for them, as shown in Table 1. 

Our experiments show that MiniTasking improves the performance of join operations 

by 4%–12%. When a hash join is used, if the hash table on the inner relation is 

small enough to be put into the cache, MiniTasking can be effectively applied to the 

outer relation. For example, when two instances of the join query are running, Mini- 

Tasking improves the query performance by 9% in the case of Hash-1 and 12% in the 

case of Hash-2, as shown in Figure 7. MiniTasking has similar speedup for the indexbased 

join since it can break the index probing into mini-tasks. As a result, MiniTasking 

reduces the query execution time by up to 8.2% for two concurrently running instances 

of the join query. 

9

Fig. 7. Performance of join operations. MiniTasking reduces execution time up to 12.1% for hash 

joins and 8.2% for index nested-loops joins. MT stands for MiniTasking. 

(a) Normalized execution time 

(b) CPI breakdown 

Fig. 8. Performance of throughput-real tests. Each test runs several concurrent streams. The execution 

time of each test is normalized to the baseline without MiniTasking. 

5.3 Results For Throughput-Real 

We validate our MiniTasking strategy using a real workload, modeling after the 

throughput test of TPC-H benchmark. The standard TPC-H throughput test is composed 

of multiple concurrent streams. Each stream contains a sequence of TPC-H queries in 

an order which TPC-H benchmark specifies. Accordingly, our experiment follows these 

sequences and let each stream execute the six TPC-H queries we implemented. 

Our experimental results show that MiniTasking is very effective for this workload. 

Figure 8(a) shows that the execution time of each test is reduced by 11%-20% for various 

number of concurrent query streams. As shown on Figure 8(b), the performance 

gain comes from the reduction in L2 cache misses: MiniTasking significantly reduces 

the number of L2 cache misses by 41%-79%. This is all because MiniTasking’s localityaware 

query scheduling and execution effectively improves the access temporal locality. 

Meanwhile, MiniTasking do not affect other processor events very much since it adds 

little overhead. Therefore, the improved L2 cache hit ratios is proportionally reflected 

into the end performance. 

5.4 Improvement Upon PAX Layout 

Figure 9 shows the effects of MiniTasking on cache-conscious data layout such 

as PAX [1]. MiniTasking can still effectively reduce the number of L2 cache misses 

by 65% and the execution time by 9%. The performance speedup is less pronounced 

with PAX than with the default NSF layout because, with PAX that has significantly 

improved spatial locality in accesses, the L2 cache miss time contributes less to the 

execution time than with NSM. 

10

(a) Normalized execution time 

(b) CPI breakdown 

Fig. 9. The execution time and the CPI breakdown of four concurrent queries (TPCH-Q1). The 

PAX data layout is used in Shore and MT. 

6 Conclusion 

In this paper, we propose a technique called MiniTasking to improve database performance 

for concurrent query execution by reducing the number of processor cache 

misses via three levels of locality-based scheduling. Through query level batching, operator 

level grouping and mini-task level scheduling, MiniTasking can significantly reduce 

L2 cache misses and execution time. Our experimental results show that, Mini- 

Tasking can significantly reduce the execution time up to 12% for joins. For the TPC- 

H throughput test workload, MiniTasking reduces the number of L2 cache misses up 

to 79% and improves the end performance up to 20%. With the Partition Attributes 

Across (PAX) layout, MiniTasking further reduces the cache misses by 65% and the 

execution time by 9%, which indicates that our technique well compliments previous 

cache-conscious layouts. 

References 

1. A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page layouts for relational databases on 

deep memory hierarchies. The VLDB Journal, 11(3):198–215, 2002. 

2. A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: 

Where does time go? In VLDB ’99, pages 266–277, 1999. 

3. A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng. Evaluating the impact of memory 

system performance on software prefetching and locality optimizations. In International 

Conference on Supercomputing, pages 486–500, 2001. 

4. P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized for the new 

bottleneck: Memory access. In VLDB ’99, pages 54–65, 1999. 

5. B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. In ASPLOS 

’98, pages 139–149, 1998. 

6. M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall, M. L. McAuliffe, J. F. Naughton, D. T. 

Schuh, M. H. Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and M. J. Zwilling. Shoring 

up persistent applications. In SIGMOD ’94, pages 383–394, 1994. 

7. S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. 

In ASPLOS ’94, pages 252–262, 1994. 

8. S. Chen, P. B. Gibbons, and T. C. Mowry. Improving index performance through prefetching. 

In SIGMOD ’01, pages 235–246, 2001. 

9. S. Chen, P. B. Gibbons, T. C. Mowry, and G. Valentin. Fractal prefetching b+-trees: optimizing 

both cache and disk performance. In SIGMOD ’02, pages 157–168, 2002. 

11

10. N. N. Dalvi, S. K. Sanghai, P. Roy, and S. Sudarshan. Pipelining in multi-query optimization. 

In PODS ’01, pages 59–70, 2001. 

11. C. Ding and K. Kennedy. Inter-array data regrouping. In Languages and Compilers for 

Parallel Computing, pages 149–163, 1999. 

12. C. Ding and M. Orlovich. The potential of computation regrouping for improving locality. 

In ACM/IEEE SC2004, Nov. 6-12, 2004. 

13. S. Finkelstein. Common expression analysis in database applications. In SIGMOD ’82, 

pages 235–245, 1982. 

14. R. A. Hankins and J. M. Patel. Data morphing: An adaptive,cache-conscious storage technique. 

In VLDB ’03. Morgan Kaufmann, 2003. 

15. S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: A simultaneously pipelined relational 

query engine. In SIGMOD ’05, pages 383–394, 2005. 

16. IBM. Personal communication with IBM, Jan. 2005. 

17. Intel Corporation. Intel vtune performance analyzer. 

http://www.intel.com/software/products/vtune/, 2004. 

18. K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality 

via loop fusion and distribution. In Proceedings of the 6th International Workshop on 

Languages and Compilers for Parallel Computing, pages 301–320. Springer-Verlag, 1994. 

19. K. Kim, S. K. Cha, and K. Kwon. Optimizing multidimensional index trees for main memory 

access. In SIGMOD ’01, pages 139–150. ACM Press, 2001. 

20. J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, and S. S. Parekh. An 

analysis of database workload performance on simultaneous multithreaded processors. In 

ISCA ’98, pages 39–50. IEEE Computer Society, 1998. 

21. K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. 

ACM Transactions on Programming Languages and Systems, 18(4):424–453, July 1996. 

22. K. O’Gorman, D. Agrawal, and A. E. Abbadi. Multiple query optimization by cache-aware 

middleware using query teamwork. In ICDE ’02, page 274. IEEE Computer Society, 2002. 

23. J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache 

locality. In ASPLOS ’96, pages 60–71. ACM Press, 1996. 

24. R. Ramamurthy, D. J. DeWitt, and Q. Su. A case for fractured mirrors. In VLDB ’02, pages 

430–441, 2002. 

25. J. Rao and K. A. Ross. Making b+- trees cache conscious in main memory. In SIGMOD ’00, 

pages 475–486, New York, NY, USA, 2000. ACM Press. 

26. P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and extensible algorithms for 

multi query optimization. In SIGMOD ’00, pages 249–260. ACM Press, 2000. 

27. T. Sellis and S. Ghosh. On the multiple-query optimization problem. IEEE Transactions on 

Knowledge and Data Engineering, 2(2):262–266, 1990. 

28. T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23–52, 1988. 

29. A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms for relational query 

processing. In VLDB ’94, pages 510–521, 1994. 

30. Transaction processing performance council. http://www.tpc.org. 

31. P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torrellas. The memory performance of DSS 

commercial workloads in shared-memory multiprocessors. In HPCA ’97, 1997. 

32. M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI ’91, 1991. 

33. J. Zhou and K. A. Ross. Buffering accesses to memory-resident index structures. In VLDB 

’03, pages 405–416, 2003. 

34. Y. Zhou, L. Wang, D. W. Clark, and K. Li. Thread scheduling for out-of-core applications 

with memory server on multicomputers. In IOPADS ’99, pages 57–67. ACM Press, 1999. 

12

MiniTasking: Improving Cache Performance for Multiple ... - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?