11.03.2014 Views

Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University

Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University

Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Applying</strong> <strong>OLAP</strong> <strong>Pre</strong>-<strong>Aggregation</strong> <strong>Techniques</strong> <strong>to</strong> Speed Up<br />

Aggregate Query Processing in Array Databases<br />

by<br />

Angélica García Gutiérrez<br />

A thesis submitted in partial fulfillment<br />

of the requirements for the degree of<br />

Doc<strong>to</strong>r of Philosophy<br />

in Computer Science<br />

Approved, Thesis Committee:<br />

Prof. Dr. Peter Baumann<br />

Prof. Dr. Vikram Unnithan<br />

Prof. Dr. Inés Fernando Vega López<br />

Date of Defense: November 12, 2010<br />

School of Engineering and Science


In memory of my grandmother, Naty.


Acknowledgments<br />

I would like <strong>to</strong> express my sincere gratitude <strong>to</strong> my thesis advisor, Prof. Dr. Peter<br />

Baumann for his excellent guidance throughout the course of this dissertation. With<br />

his tremendous passion for science and his great efforts <strong>to</strong> explain things clearly and<br />

simply, he made this research <strong>to</strong> be one of the richest experiences of my life. He<br />

always suggested new ideas, and guided my research through many pitfalls. Furthermore,<br />

I learned from him <strong>to</strong> be kind and cooperative. Thank you, for every<br />

single meeting, for every single discussion that you always managed <strong>to</strong> be thoughtprovoking,<br />

for your continue encouragement, for believing in that I could bring this<br />

project <strong>to</strong> success.<br />

I am also grateful <strong>to</strong> Prof. Dr. Inés Fernando Vega López for his valuable suggestions.<br />

He not only provided me with technical advice but also gave me some important<br />

hints on scientific writing that I applied on this dissertation. My sincere gratitude also<br />

<strong>to</strong> Prof. Dr. Vikram Unnithan. Despite being one of <strong>Jacobs</strong> <strong>University</strong>’s most popular<br />

and busiest professors due <strong>to</strong> his genuine engagement with student life beyond<br />

academics, Prof. Unnithan <strong>to</strong>ok interest in this work and provided me unconditional<br />

support.<br />

I would like <strong>to</strong> thank two promising graduate students, Irina Calciu and Eugen<br />

Sorbalo for their outstanding contributions with some of the experiments presented in<br />

Chapter 5 of this thesis.<br />

I am especially grateful <strong>to</strong> my colleagues Michael Owonibi, Salah Al Jubeh, and<br />

Yu Jinsongdi for their many valuable discussions, and for providing a stimulating and<br />

fun environment in which <strong>to</strong> learn and grow.<br />

I am grateful <strong>to</strong> the team assistants at School of Engineering and Science, for helping<br />

the School <strong>to</strong> run smoothly and for assisting me in many different ways. Sigrid<br />

Manss deserves special mention. Thank you for all your kindness, and caring.<br />

Also, I would like <strong>to</strong> thank Connie Garcia, Jim Toersten, Greg White, Irina Prjadeha,<br />

and all of my friends that helped me <strong>to</strong> proofread this thesis. Vic<strong>to</strong>ria Inness-<br />

Brown deserves special mention for applying her expertise as an edi<strong>to</strong>r on reviewing<br />

each chapter of this thesis.<br />

Thank you <strong>to</strong> all my great friends who provided support and encouragement in so<br />

many ways, for helping me <strong>to</strong> see the bright side of my problems in difficult times, for


all the emotional support, comraderie, entertainment, and caring provided. Specially,<br />

<strong>to</strong> Salah Al Jubeh, Asma Alazeib, Talina Eslava, Rainer Gruenheid, Yu Jinsongdi,<br />

Maria Joy, Ghada Kadamany, Ingrid Lara, Blessing Musunda, Michael Owonibi, Jessica<br />

Price, Irina Prjadeha, Joerg Reinekirchen, Yannic Ramaye, Mila Tarabashkina,<br />

Ruiju Tong, Derya Toykan, Iyad Tumar, Vanya Uzunova, Tanja Vaitulevich, and Jus<strong>to</strong><br />

Vargas. You all have a place in my heart. Also, <strong>to</strong> my friend Samantha Hoo<strong>to</strong>n, whom<br />

I learned <strong>to</strong> love as a sister shortly after meeting her. Her authenticity, self-confidence,<br />

and drive <strong>to</strong> success are a real inspiration. Thank you for your caring, for sharing your<br />

wisdom, for taking me <strong>to</strong> the hospital when I was in pain, and for being there anytime<br />

I needed a friend.<br />

My warmest thanks <strong>to</strong> Father Matthew I. Nwoko for his spiritual guidance, his<br />

caring, his advices, and overall, for his unconditional love.<br />

Thank you <strong>to</strong> my parents, my brother and sisters, who have always been very supportive<br />

of my aspirations. Their support has been instrumental in getting me on the<br />

path that brought me <strong>to</strong> this project. Especialmente, Gracias a ti mamá, por ser mi<br />

ejemplo de tenacidad y compromiso. A ti también te dedico esta tesis.<br />

To DAAD and CONACYT, the financial support and trust is gratefully acknowledged.<br />

To everybody that has been a part of my life, thank you very much.<br />

Lastly, I thank the Lord God Almighty for giving me health, ideas and wisdom <strong>to</strong><br />

enable me complete this research project successfully.


Abstract<br />

Large multidimensional arrays of data are common in a variety of scientific applications.<br />

In the past, arrays have typically been s<strong>to</strong>red in files, and then manipulated<br />

by cus<strong>to</strong>mized programs operating on those files. Nowadays, with science moving<br />

<strong>to</strong>ward computational databases, the trend is <strong>to</strong>ward a new class of database, the array<br />

database. In the broadest sense, the array database supports various types of multidimensional<br />

array data, including remote-sensor data, satellite imagery, and data<br />

resulting from scientific simulations.<br />

As with traditional databases for business applications, analytics in array databases<br />

often involves the extraction of general characteristics from large reposi<strong>to</strong>ries. This requires<br />

efficient methods for computing queries that involve data summarization, such<br />

as aggregate queries. A typical solution is <strong>to</strong> pre-compute the whole or parts of each<br />

query, and then save the results of those queries that are frequently submitted against<br />

the database and those that can be used <strong>to</strong> compute the results of similar future queries.<br />

This process is known as pre-aggregation. Unfortunately, pre-aggregation support for<br />

array databases is currently limited <strong>to</strong> one specific operation, scaling (zooming), and<br />

<strong>to</strong> two-dimensional datasets (images).<br />

In this aspect, database technology for business applications is much more mature.<br />

Technologies such as On-Line Analytical Processing (<strong>OLAP</strong>) provide the means <strong>to</strong><br />

analyze business data from one or multiple sources, and thus facilitate the decision<br />

making process. In <strong>OLAP</strong>, the information is viewed as data cubes. These cubes<br />

are typically s<strong>to</strong>red in relational tables, or in multidimensional arrays, or in a hybrid<br />

model. In order <strong>to</strong> enable fast interactive multidimensional data analysis, database<br />

systems frequently pre-compute and s<strong>to</strong>re the results of aggregate queries. While there<br />

are some valuable research results in the realm of <strong>OLAP</strong> pre-aggregation techniques<br />

with varying degrees of power and refinement, not enough work has been done and<br />

reported for array databases.<br />

The purpose of this thesis is <strong>to</strong> investigate the application of <strong>OLAP</strong> pre-aggregation<br />

techniques with the objective of speeding up aggregate operations in array databases.<br />

In particular, we consider enhancing aggregate computation in Geographic Information<br />

Systems (GIS) and remote-sensing imaging applications. To this end, we describe<br />

a set of fundamental operations in GIS based on a sound algebraic framework.


This allows us <strong>to</strong> identify those operations that require data summarization and that<br />

therefore may benefit from pre-aggregation. We introduce a conceptual framework<br />

and cost model for rewriting basic aggregate queries in terms of pre-aggregated data,<br />

and conduct experiments <strong>to</strong> assess the performance of our algorithms. Results show<br />

that query response times can be substantially reduced by strategically selecting the<br />

pre-aggregate with the least cost in terms of execution time. We also investigate the<br />

problem of selecting a set of queries for pre-aggregation, but failed <strong>to</strong> find an analytical<br />

solution for all possible types of aggregate queries. Nevertheless, we present a<br />

framework and algorithms for the selection of scaling operations for pre-aggregation<br />

considering 2D, 3D, and 4D datasets. The results of our experiments with 2D datasets<br />

outperform the results of image pyramids, the current technique used <strong>to</strong> speed up scaling<br />

operations on 2D datasets. Furthermore, our experiments on 3D and 4D datasets<br />

show that query response types can also be substantially reduced by intelligently selecting<br />

a set of scaling operations for pre-aggregation.<br />

The work presented in this thesis is the first of its kind for array databases in scientific<br />

applications.


Contents<br />

1 Introduction and Problem Statement 9<br />

1.1 Overview of Thesis and Contributions . . . . . . . . . . . . . . . . . 12<br />

1.2 Publications Related <strong>to</strong> this Thesis . . . . . . . . . . . . . . . . . . . 12<br />

2 Background and Related Work 15<br />

2.1 Array Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.1.1 Basic Notion of Arrays . . . . . . . . . . . . . . . . . . . . . 15<br />

2.1.2 2D Data Models . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.1.3 Multidimensional Data Models . . . . . . . . . . . . . . . . 17<br />

2.1.4 S<strong>to</strong>rage Management . . . . . . . . . . . . . . . . . . . . . . 18<br />

2.1.5 2D <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.1.6 <strong>Pre</strong>-<strong>Aggregation</strong> Beyond 2D . . . . . . . . . . . . . . . . . . 23<br />

2.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.2 On-Line Analytical Processing (<strong>OLAP</strong>) . . . . . . . . . . . . . . . . 25<br />

2.2.1 <strong>OLAP</strong> Data model . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.2.2 <strong>OLAP</strong> Operations . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

2.2.3 <strong>OLAP</strong> Architectures . . . . . . . . . . . . . . . . . . . . . . 26<br />

2.2.4 <strong>OLAP</strong> <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . 30<br />

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3 Fundamental Geo-Raster Operations 37<br />

3.1 Array Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

3.1.1 Construc<strong>to</strong>r . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

3.1.2 Condenser . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

3.1.3 Sorter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

3.2 Geo-Raster Operations . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

3.2.1 Mathematical Operations . . . . . . . . . . . . . . . . . . . . 39<br />

3.2.2 <strong>Aggregation</strong> Operations . . . . . . . . . . . . . . . . . . . . 45<br />

3.2.3 Statistical Aggregate Operations . . . . . . . . . . . . . . . . 51<br />

3.2.4 Affine Transformations . . . . . . . . . . . . . . . . . . . . . 55<br />

3.2.5 Terrain Analysis . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

3.2.6 Other Operations . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

3


4 Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data 63<br />

4.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

4.1.1 <strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4.1.2 <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4.1.3 Aggregate Query and <strong>Pre</strong>-Aggregate Equivalence . . . . . . . 64<br />

4.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

4.2.1 Computing Queries from Raw Data . . . . . . . . . . . . . . 68<br />

4.2.2 Computing Queries from Independent and Overlapped <strong>Pre</strong>-<br />

Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

4.2.3 Computing Queries from Dominant <strong>Pre</strong>-Aggregates . . . . . 69<br />

4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

5 <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations 77<br />

5.1 Non-Standard Aggregate Operations . . . . . . . . . . . . . . . . . . 77<br />

5.2 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

5.2.1 Lattice Representation . . . . . . . . . . . . . . . . . . . . . 79<br />

5.2.2 <strong>Pre</strong>-<strong>Aggregation</strong> Selection Problem . . . . . . . . . . . . . . 80<br />

5.3 <strong>Pre</strong>-Aggregates Selection . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

5.3.1 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 83<br />

5.4 Answering Scaling Operations Using <strong>Pre</strong>-Aggregated Data . . . . . . 83<br />

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

5.5.1 2D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

5.5.2 3D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

5.5.3 4D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

6 Conclusion 103<br />

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104


List of Figures<br />

2.1 3D Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.2 Map Algebra Functions . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

2.3 Image Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.4 Image Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.5 Nearest Neighbor, Bilinear and Cubic Interpolation Methods . . . . . 22<br />

2.6 3D Scaling Operations on Time-Series Imagery Datasets . . . . . . . 24<br />

2.7 <strong>OLAP</strong> Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

2.8 Typical <strong>OLAP</strong> Cube Operations . . . . . . . . . . . . . . . . . . . . 27<br />

2.9 <strong>OLAP</strong> Approaches: M<strong>OLAP</strong>, R<strong>OLAP</strong>, and H<strong>OLAP</strong> . . . . . . . . . . 27<br />

2.10 M<strong>OLAP</strong> S<strong>to</strong>rage Scheme . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

2.11 R<strong>OLAP</strong> S<strong>to</strong>rage Scheme . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.12 Typical Query as Expressed in R<strong>OLAP</strong> and M<strong>OLAP</strong> Systems . . . . 29<br />

2.13 Star Model of a Spatial Warehouse . . . . . . . . . . . . . . . . . . . 32<br />

2.14 Comparison of Roll-Up and Scaling Operations . . . . . . . . . . . . 34<br />

3.1 Reduction of Contrast in the Green Channel of an RGB Image . . . . 40<br />

3.2 Highlighted Infrared Areas of an NRG Image . . . . . . . . . . . . . 41<br />

3.3 Cells of Rasters A and B with Equal Values . . . . . . . . . . . . . . 42<br />

3.4 Re-Classification of the Cell Values of a Raster Image . . . . . . . . . 43<br />

3.5 Computation of a Proximity Operation . . . . . . . . . . . . . . . . . 44<br />

3.6 Computation of an Overlay Operation . . . . . . . . . . . . . . . . . 45<br />

3.7 Computation of an Overlay Operation Considering Values Greater<br />

than Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.8 Calculation of the Total Sum of Cell Values in a Raster . . . . . . . . 47<br />

3.9 Result of an Average Aggregate Operation . . . . . . . . . . . . . . . 48<br />

3.10 Result of a Maximum Aggregate Operation . . . . . . . . . . . . . . 48<br />

3.11 Result of a Minimum Aggregate Operation . . . . . . . . . . . . . . 49<br />

3.12 Computation of the His<strong>to</strong>gram for a Raster Image . . . . . . . . . . . 50<br />

3.13 Computation of the Diversity for a Raster Image . . . . . . . . . . . . 50<br />

3.14 Computation of a Majority Operation for a Raster Image . . . . . . . 51<br />

3.15 Computation of the Variance for a Raster Image . . . . . . . . . . . . 52<br />

3.16 Computation of the Standard Deviation for a Raster Image . . . . . . 52<br />

3.17 Computation of Median for a Raster Image . . . . . . . . . . . . . . 54<br />

3.18 Computation of a Top-k Operation for a Raster Image . . . . . . . . . 54<br />

3.19 Computation of a Translation Operation for a Raster Image . . . . . . 56<br />

5


3.20 Computation of a Scaling Operation for a Raster Image . . . . . . . . 57<br />

3.21 Slopes Along the X and Y Directions . . . . . . . . . . . . . . . . . . 58<br />

3.22 Flow Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

3.23 Sobel Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

3.24 Computation of an Edge-Detection for a Raster Image . . . . . . . . . 60<br />

4.1 Types of <strong>Pre</strong>-Aggregates . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

4.2 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> (left) and Decomposed Queries<br />

(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

5.1 Sample Lattice Diagram for a Workload with Five Scaling Operations 79<br />

5.2 Query Workload with Uniform Distribution . . . . . . . . . . . . . . 87<br />

5.3 Query Workload with Poisson Distribution . . . . . . . . . . . . . . . 88<br />

5.4 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . 89<br />

5.5 Query Workload with Peak Distribution . . . . . . . . . . . . . . . . 90<br />

5.6 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . 90<br />

5.7 Query Workload with Step Distribution . . . . . . . . . . . . . . . . 91<br />

5.8 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . 92<br />

5.9 Workload with Uniform Distribution along x, y, and t . . . . . . . . . 93<br />

5.10 Average Query Cost over S<strong>to</strong>rage Space . . . . . . . . . . . . . . . . 93<br />

5.11 Selected <strong>Pre</strong>-Aggregates, c = 36% . . . . . . . . . . . . . . . . . . . 94<br />

5.12 Workload with Uniform Distribution Along x, y, and Poisson distribution<br />

in t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

5.13 Average Query Cost as Space is Varied . . . . . . . . . . . . . . . . . 95<br />

5.14 Selected <strong>Pre</strong>-Aggregates, c = 26% . . . . . . . . . . . . . . . . . . . 96<br />

5.15 Workload with Poisson distribution Along x, y, and t . . . . . . . . . 96<br />

5.16 Average Query Cost as Space is Varied . . . . . . . . . . . . . . . . . 97<br />

5.17 Selected <strong>Pre</strong>-Aggregates, c = 30% . . . . . . . . . . . . . . . . . . . 97<br />

5.18 Workload with Poisson Distribution Along x, y, and Uniform Distribution<br />

in t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

5.19 Average Query Cost as Space is Varied . . . . . . . . . . . . . . . . . 99<br />

5.20 Selected <strong>Pre</strong>-Aggregates, c = 21% . . . . . . . . . . . . . . . . . . . 99


List of Tables<br />

3.1 UNO and FAO Suitability Classifications . . . . . . . . . . . . . . . 43<br />

3.2 Capability Indexes for Different Capability Classes . . . . . . . . . . 43<br />

3.3 Array Algebra Classification of Geo-Raster Operations. . . . . . . . . 62<br />

4.1 Cost Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

4.2 Database and Queries of the Experiment. . . . . . . . . . . . . . . . . 74<br />

4.3 Comparison of Query Evaluation Costs Using <strong>Pre</strong>-Aggregated Data<br />

and Original Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

5.1 Sample <strong>Pre</strong>-Aggregates. . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

5.2 ECHAM T-42 Climate Simulation Dimensions . . . . . . . . . . . . 100<br />

5.3 4D Scaling: Scale Vec<strong>to</strong>r Distribution . . . . . . . . . . . . . . . . . 100<br />

5.4 4D Scaling: Selected <strong>Pre</strong>-Aggregates . . . . . . . . . . . . . . . . . . 100<br />

7


This page was left blank intentionally.


Chapter 1<br />

Introduction and Problem Statement<br />

Scientific computing platforms and infrastructures are making new kinds of experiments<br />

possible, resulting in the generation of vast volumes of arrays of data. This<br />

is happening in many specialized application areas such as meteorology, oceanography,<br />

hydrology, astronomy, medical imaging, and exploration systems for oil, natural<br />

gas, coal, and diamonds. These datasets range from uniformly spaced points<br />

(cells) along a single dimension <strong>to</strong> multidimensional arrays containing several different<br />

types of data. For example, astronomy and earth sciences operate on two- or<br />

three-dimensional spatial grids, often using a plethora of spherical coordinate systems.<br />

Furthermore, nearly all sciences must deal with data series over time. It is frequently<br />

necessary <strong>to</strong> understand relationships between consecutive elements in time,<br />

or <strong>to</strong> analyze entire sequences of observations, and such datasets may represent spatial,<br />

temporal, or spatio-temporal information. For example, if ocean measurements<br />

such as temperature, salinity, and oxygen are recorded every hour at spacings of every<br />

one meter in depth, and every ten meters in two horizontal dimensions, the result is<br />

a four-dimensional array with three spatial dimensions and one temporal dimension,<br />

and three values attached <strong>to</strong> each cell of the array.<br />

In the past, arrays were typically s<strong>to</strong>red in files and then manipulated by programs<br />

that operated on these files. Nowadays, with science moving <strong>to</strong>ward being computational<br />

and data based, the trend is <strong>to</strong>ward a new class of database system which provides<br />

support for not only traditional, or coded, data types such as text, integers, etc.,<br />

but also richer data types like multidimensional arrays. This new trend of databases is<br />

referred <strong>to</strong> as Array Databases.<br />

Implementing an efficient array database management system (DBMS) can be very<br />

challenging. Typically, there are two approaches that can be taken <strong>to</strong> s<strong>to</strong>re array<br />

datasets in a DBMS. In the first, the values of each cell are s<strong>to</strong>red in a separate row,<br />

along with fields describing the position of the cell in the array. The most obvious<br />

drawback of this approach is the need for a large multidimensional index <strong>to</strong> efficiently<br />

find rows in the table. Moreover, the space taken by a multidimensional index is larger<br />

than the size of the table itself if all dimensions forming an array are used as the key.<br />

In the second approach, a multidimensional array is written <strong>to</strong> a Binary Large Object<br />

(BLOB), which is s<strong>to</strong>red in a field of a table in the database. Applications then fetch<br />

9


10 1. Introduction and Problem Statement<br />

the contents of the BLOB when they wish <strong>to</strong> operate on the data. The main drawback<br />

<strong>to</strong> this approach is that it either requires the entire array <strong>to</strong> be passed <strong>to</strong> the client, or it<br />

requires that the client perform a large number of BLOB input/output (I/O) operations<br />

<strong>to</strong> read only the required portions of the array. With databases growing beyond a few<br />

tens of terabytes, the analysis of large volumes of array datasets is severely limited<br />

by the relatively low I/O performance of most of <strong>to</strong>days computing platforms. Highperformance<br />

numerical simulations are also increasingly feeling the I/O bottleneck.<br />

To improve data management and analytics on large reposi<strong>to</strong>ries of data, aggregation<br />

has been put forward as a key process when describing high-level data. An<br />

example of data aggregation is the computation and s<strong>to</strong>rage of statistical parameters,<br />

such as count, average, median, and standard deviation. Aggregate computation has<br />

been studied in a variety of settings [4, 21, 66]. In particular, On-Line Analytical Processing<br />

(<strong>OLAP</strong>) technology has emerged <strong>to</strong> address the problem of efficiently computing<br />

complex multidimensional aggregate queries on large data warehouses. Most<br />

<strong>OLAP</strong> systems rely on the process of selecting aggregate combinations, and then precomputing<br />

and s<strong>to</strong>ring their results so the database system can make use of them in<br />

subsequent requests. Such a process is known as pre-aggregation, which has proved <strong>to</strong><br />

speed up aggregate queries by several orders of magnitude in business and statistical<br />

applications [31, 41].<br />

While considerable work has been done on the problem of efficiently computing<br />

aggregate queries in <strong>OLAP</strong>-based applications, such computations continue <strong>to</strong> be a<br />

data management challenge in scientific applications. A relevant example in which the<br />

use of advanced data management and efficient query processing are highly desirable<br />

is hyper-spectral remote-sensing imaging, in which an image spectrometer collects<br />

hundreds or even thousands of measurements for the same area of the surface of the<br />

Earth. The scenes provided by such sensors are often called data cubes <strong>to</strong> denote<br />

the dimensionality of the data. Notably, efficient query processing and data mining<br />

techniques facilitate exploration of spatio-temporal data patterns, both interactively as<br />

well as in batch on archived data.<br />

A significant fraction of scientific data is image-based and can be naturally represented<br />

in multidimensional arrays. These datasets fit poorly in<strong>to</strong> relational databases,<br />

which lack efficient support for the concepts of physical proximity and order. They<br />

are typically s<strong>to</strong>red in array-friendly formats such as HDF5, netCDF, or FITS. The<br />

extremely high computational requirements introduced by image-based scientific applications<br />

make them an excellent case study for our research.<br />

Since array databases and <strong>OLAP</strong>/data warehousing both deal with large multidimensional<br />

datasets and aggregate queries, adapting <strong>OLAP</strong> pre-aggregation techniques<br />

<strong>to</strong> the management and computation of aggregate queries in array databases may provide<br />

a strong potential benefit. This thesis investigates the application of <strong>OLAP</strong> preaggregation<br />

techniques in speeding up query processing in array databases. In particular,<br />

we focus on enhancing aggregate computation in GIS and remote-sensing imaging<br />

applications. However, the results can be generalized <strong>to</strong> other domains as well.


Relevant and complementary questions <strong>to</strong> this thesis are:<br />

1. What fac<strong>to</strong>rs influence the decision of selecting an aggregate query for preaggregation?<br />

2. What formalisms are necessary <strong>to</strong> establish an efficient and scalable pre-aggregation<br />

framework for array databases?<br />

3. What type of constraints are typically considered by existing <strong>OLAP</strong> pre-aggregation<br />

algorithms, and how do they effect performance?<br />

The thesis objectives are outlined as follows:<br />

1. To illustrate the necessity for improving aggregate computation in array databases<br />

for GIS and remote-sensing imaging applications.<br />

2. To achieve a solid understanding of <strong>OLAP</strong> pre-aggregation algorithms and architectural<br />

issues when manipulating large amounts of data.<br />

3. To formally describe fundamental operations in GIS and remote-sensing imaging<br />

applications and identify those that involve data summarization.<br />

4. To design a theoretical pre-aggregation framework for array databases supporting<br />

GIS and remote-sensing imaging applications.<br />

5. To design query selection and query rewriting algorithms using existing <strong>OLAP</strong>/data<br />

warehousing pre-aggregation techniques.<br />

6. To implement algorithms in an array database management system.<br />

7. To conduct a performance study of the developed algorithms.<br />

The methodological approach employed in this thesis is centered on a three-stage<br />

design methodology:<br />

• Identification of fundamental operations in GIS and remote-sensing imaging<br />

applications.<br />

A literature review helped us identify fundamental operations in GIS that require<br />

data summarization. The literature included different classification schemes,<br />

international standards and best practices.<br />

• Design and implementation<br />

Existing <strong>OLAP</strong> pre-aggregation techniques are used as a basis for the construction<br />

of a pre-aggregation framework for array databases. S<strong>to</strong>rage space constraints<br />

are considered while designing query selection algorithms. The algorithms<br />

were developed using the C++ programming language and tested in the<br />

RasDaMan multidimensional array database management system.<br />

• Evaluation<br />

Performance of the developed algorithms is measured on 2D, 3D, and 4D datasets.<br />

For scaling operations on 2D datasets we compare our results against those of<br />

the traditional image pyramids approach.<br />

11


12 1. Introduction and Problem Statement<br />

1.1 Overview of Thesis and Contributions<br />

This section provides an overview of the following chapters.<br />

Chapter 2 presents a comparative study between array databases and <strong>OLAP</strong>, and<br />

devotes special attention <strong>to</strong> data structures and operations. It starts with a discussion<br />

of existing approaches for data modeling, s<strong>to</strong>rage management and query processing<br />

in both array databases and the data warehousing/<strong>OLAP</strong> environment. Existing<br />

pre-aggregation and related techniques are also discussed in both application domains.<br />

From this study, one can observe similarities with regards <strong>to</strong> data structures and operations<br />

between both application domains. This suggests that array databases can benefit<br />

from pre-aggregation schemes <strong>to</strong> accelerate the computation of aggregate queries.<br />

Chapter 3 describes fundamental operations in GIS and remote-sensing imaging<br />

applications. The selection of operations is based on a thorough review of existing<br />

surveys regarding GIS operations, international standards, and on feedback from GIS<br />

practitioners. To better understand the structural characteristics of common queries in<br />

array databases, such operations evolved using a proven array model. This allowed<br />

us <strong>to</strong> identify a set of operations requiring data summarization (aggregation) and the<br />

candidate operations <strong>to</strong> be supported by pre-aggregation techniques.<br />

Chapter 4 deals with the computation of aggregate queries in array databases using<br />

pre-aggregated data. The proposed pre-aggregation framework distinguishes different<br />

types of pre-aggregates and shows that such a distinction is useful in finding an optimal<br />

solution that reduces the cost of the CPU required for the computation of aggregate<br />

queries. A cost-model is used <strong>to</strong> assess the benefit of using pre-aggregated data<br />

for computing aggregate queries. The measurements on real-life raster image datasets<br />

show that the computation of aggregate queries is always faster with our algorithms<br />

in comparison <strong>to</strong> traditional methods.<br />

Chapter 5 considers the problem of offering pre-aggregation support <strong>to</strong> non-standard<br />

aggregate operations in GIS and remote-sensing imaging applications. A discussion<br />

is presented on the issues found while attempting <strong>to</strong> provide pre-aggregation support<br />

for all non-standard aggregate operations as well as the motivation for focusing on<br />

scaling operations. The framework and cost model presented in Chapter 4 are adapted<br />

<strong>to</strong> support scaling operations. Experiments covering 2D, 3D, and 4D show how our<br />

pre-aggregation approach not only generalizes the most common approach for 2D, but<br />

it also helps reduce computational times for 2D, 3D, and 4D datasets.<br />

Chapter 6 presents a summary of our findings and outlines future lines of research.<br />

1.2 Publications Related <strong>to</strong> this Thesis<br />

A number of papers have been published that relate <strong>to</strong> the work described in this<br />

thesis. Doc<strong>to</strong>ral workshops provided a platform <strong>to</strong> discuss the feasibility of the proposed<br />

research and an opportunity <strong>to</strong> receive feedback from experts in computer science<br />

[6] and the GIS scientific community [5]. Participation in those workshops led<br />

<strong>to</strong> a refinement of the research objectives outlined in Chapter 1. The study and algebraic<br />

modeling of geo-raster operations reported in Chapter 3 are presented in [7, 8].


1.2 Publications Related <strong>to</strong> this Thesis 13<br />

The pre-aggregation framework described in Chapter 4 is presented in [9]. Finally,<br />

findings about the query selection problem addressed in Chapter 5 have been accepted<br />

for publication in [10].


This page was left blank intentionally.


Chapter 2<br />

Background and Related Work<br />

This chapter describes existing database technology for two environments: GIS/remotesensing<br />

imaging and data warehousing/<strong>OLAP</strong>. Our investigation shows that conceptual<br />

data models and operations are similar in both application domains. This suggests<br />

that array database technology can be substantially enhanced by adopting a preaggregation<br />

scheme using a basis of existing <strong>OLAP</strong> technology.<br />

2.1 Array Databases<br />

Multidimensional data analysis has recently taken the spotlight in the context of<br />

scientific applications. A fundamental demand from science users is extremely fast<br />

response times for multidimensional queries. While most scientific users can use relational<br />

tables and have been forced <strong>to</strong> do so by many commercial DBMS systems, only<br />

a few users find tables <strong>to</strong> be a natural data model that closely matches their data. Furthermore,<br />

few users are satisfied with SQL as the interface language [30]. In contrast,<br />

it appears that arrays are a natural data model for a significant subset of science users,<br />

specifically in astronomy, oceanography, and remote-sensing applications. Moreover,<br />

a table with a primary key is merely a 1D array. Hence, an array data model can<br />

subsume the needs of users who are satisfied with tables.<br />

Next we review the existing database technology supporting multidimensional arrays<br />

in scientific applications: 1D sensor time-series, 2D satellite imagery, 3D image<br />

time-series, and 4D atmospheric data.<br />

2.1.1 Basic Notion of Arrays<br />

Several approaches have been proposed <strong>to</strong>wards the formalization of arrays and<br />

array query languages. The underlying methods of formalization differ, and it is still<br />

an open discussion. However, the following notion of arrays is quite common [79]:<br />

An array is a set of cells of a fixed data type T , with a fixed cell size. Each<br />

cell corresponds <strong>to</strong> one element in the multidimensional domain of the array. The<br />

domain D of an array is a d-dimensional subinterval of a discrete coordinate set S =<br />

S 1 × ... × S d , where each S i , i = 1, ..., d is a finite <strong>to</strong>tally ordered discrete set and d is<br />

the dimensionality of the array.<br />

15


16 2. Background and Related Work<br />

The definition domain of an array is expressed as a multidimensional interval by its<br />

lower and upper bounds, l i and u i respectively, along each direction l i of the domain,<br />

denoted as D = [l 1 : u 1 ; ...; l d : u d ], where l i < u i , i = 1, ..., d, and l i , u i ∈ S i .<br />

Figure 2.1(a) shows the constituents of a sample 3D array.<br />

Figure 2.1. 3D Array<br />

The following subsections provide a brief summary of the main contributions of<br />

data modeling and query languages that support array data in GIS and remote-sensing<br />

imaging applications.<br />

2.1.2 2D Data Models<br />

A uniform representation and algebraic notation for manipulating image-based data<br />

structures known as map algebra was first advanced by Tomlin and Berry [56]. While<br />

not the first ones <strong>to</strong> describe this type of spatial data processing, Tomlin and Berry put<br />

forward the methodological basis for the organization of this form of geographical<br />

data analysis. Map algebra represents a method of treating individual rasters or array<br />

layers as members of algebraic equations. Map algebra functions are grouped in<strong>to</strong> the<br />

following categories:<br />

• Local functions create outputs in which output cell values are determined on a<br />

cell-by-cell basis without regard for the value of neighboring cells.<br />

• Focal functions create outputs in which the value of the output grid is affected by<br />

the value of neighboring cells. Low-pass filters are commonly used <strong>to</strong> smooth<br />

out data.<br />

• Zonal functions create outputs in which the values of output cells are determined<br />

in part by the spatial association between cells in the input grids.<br />

• Global functions compute an output raster where the value for each output cell<br />

is potentially a function of all of the input cell values.


2.1 Array Databases 17<br />

Figure 2.2 shows a graphical classification of grid functions according <strong>to</strong> map algebra.<br />

Figure 2.2. Map Algebra Functions<br />

Map algebra is primarily oriented <strong>to</strong>ward 2D static data. Each layer is associated<br />

with a particular moment or period of time, and analytical operations are intended <strong>to</strong><br />

deal with spatial relationships. In its original form, map algebra was never intended<br />

<strong>to</strong> handle spatial data with a temporal component.<br />

2.1.3 Multidimensional Data Models<br />

AQL<br />

Libkin et al. [63] presented an array data model called AQL that embeds array support<br />

in<strong>to</strong> specific nested relational calculus and treats arrays as functions rather than<br />

collection types. The AQL data model combines complex objects such as sets, bags,<br />

and lists with multidimensional arrays. To express complex object values, the core<br />

calculus on which AQL is based has been extended with concepts such as comprehensions,<br />

pattern matching, and block structures that strengthen the expressive power of<br />

the language. Still, AQL does not provide a declarative mechanism <strong>to</strong> define the order<br />

in which queries manipulate data.<br />

Array Manipulation Language (AML)<br />

AML is a query language for multidimensional array data [80]. The model is aimed<br />

<strong>to</strong>wards applications in image databases, particularly for remote sensing, but it is cus<strong>to</strong>mizable<br />

<strong>to</strong> support a wide variety of application domains. An interesting characteristic<br />

of this language is the use of bit patterns, an array indexing mechanism that<br />

allows for a more powerful access structure <strong>to</strong> arrays. AML’s algebra consists of three<br />

opera<strong>to</strong>rs that enable the manipulation of arrays: subsample, merge, and apply. Each<br />

opera<strong>to</strong>r takes one or more arrays as arguments, and produces an array as result. Subsample<br />

is a unary opera<strong>to</strong>r that eliminates cells from an array by cutting out slices.<br />

Merge is a binary opera<strong>to</strong>r that combines two arrays defined over the same domain.<br />

The Apply opera<strong>to</strong>r applies a user-defined function <strong>to</strong> an array, thereby producing a<br />

new array. All AML opera<strong>to</strong>rs take bit patterns as parameters.


18 2. Background and Related Work<br />

Data and Query Model for Stream Geo-Raster Imagery<br />

Gertz et al. [67] introduced a data and query model for managing and querying<br />

streams of remote-sensing imagery. The data model considers the spatio-temporal<br />

and geo-referenced nature of satellite imagery. Three classes of opera<strong>to</strong>rs allow the<br />

formulation of queries. A stream restriction opera<strong>to</strong>r acts as a filter that selects points<br />

from a stream that satisfy a given condition of the spatial, temporal, or spatio-temporal<br />

component of the image. The stream transform opera<strong>to</strong>r maps the point or value<br />

associated with a stream <strong>to</strong> a new point or value set. This class of opera<strong>to</strong>rs is useful<br />

for processing on a point-by-point basis. The third class of opera<strong>to</strong>rs is called stream<br />

compositions, which allows the combination of image data from different spectral<br />

bands. To this end, each stream is considered <strong>to</strong> represent a single spectral band.<br />

However, since the primary objective of the authors was <strong>to</strong> stream geo-raster image<br />

data, they put less emphasis on post-processing satellite images. Core operations<br />

such as Fourier transforms and edge detection are therefore not supported by their<br />

framework.<br />

Array Algebra<br />

Baumann [75] introduced a formal array model called Array Algebra that supports the<br />

description and manipulation of multidimensional array data types [76]. The simple<br />

algebra consists of three core opera<strong>to</strong>rs: an array construc<strong>to</strong>r, a general condenser for<br />

computing aggregations, and an index sorter. The expressive power of Array Algebra<br />

through these opera<strong>to</strong>rs enables a wide range of signal processing, imaging, and statistical<br />

operations. Moreover, the termination of any well-formed query is guaranteed<br />

by limiting the expressiveness power <strong>to</strong> non-recursive operations. Array Algebra is<br />

described in more detail in Chapter 3.<br />

To date, Array Algebra is the most comprehensive and complete approach supporting<br />

a variety of applications including sensor, image and statistical data. Recently, a<br />

Geo-raster service standard based on Array Algebra concepts has been issued by the<br />

Open GeoSpatial Consortium (OGC) [78]. A commercial and open-source implementation<br />

of Array Algebra is currently available for the scientific community.<br />

2.1.4 S<strong>to</strong>rage Management<br />

At present, handling large image data s<strong>to</strong>red in a database is usually carried out by<br />

adopting a tiling strategy [23]. An image is split in<strong>to</strong> sub-images (tiles), as shown in<br />

Fig. 2.3. When a region of interest is requested in a given query operation, only the<br />

relevant tiles are accessed. This strategy results in significant I/O bandwidth savings.<br />

Tiles form the basic processing units for indexing and compression. Spatial indexing<br />

allows for the quick retrieval of the identifier and location of a required tile, while<br />

compression improves disk I/O bandwidth efficiency. The choice of tile size is crucial<br />

for efficiency. While large tiles return much redundant data in response <strong>to</strong> a range<br />

query, small tiles result in a bad compression ratio where tile size varies from 8 KB<br />

(very small) <strong>to</strong> 512 KB (very large) of data [23, 96]. A comprehensive approach


2.1 Array Databases 19<br />

<strong>to</strong>ward the s<strong>to</strong>rage of large amounts of data on tertiary s<strong>to</strong>rage media considering<br />

tiling techniques in multidimensional database management systems is presented in<br />

[23, 24, 25].<br />

Figure 2.3. Image Tiling<br />

A key fac<strong>to</strong>r influencing the effectiveness of a tiling scheme is compression. Raster<br />

data compression algorithms are the same as algorithms for compression of other image<br />

data. However, remote-sensing images are usually of much higher resolution,<br />

are multi-spectral and have significant larger volumes than natural images. To effectively<br />

compress raster data in GIS environments, emphasis must be placed on the<br />

management of schemas <strong>to</strong> deal with large volumes of remote-sensing imagery, and<br />

on the integration of various types of datasets such as vec<strong>to</strong>r and multidimensional<br />

datasets [3, 87].<br />

Dehmel [3] proposed a comprehensive framework for the compression of multidimensional<br />

arrays based on different model layers, including various kinds of predic<strong>to</strong>rs<br />

and a generic wavelet engine for lossy compression with arbitrary quality levels.<br />

In particular, the author introduces concepts such as channel separation <strong>to</strong> compress<br />

values for each channel separately, and predic<strong>to</strong>rs that calculate approximate values<br />

for some cells and express those cell values relative <strong>to</strong> the approximate values. Further,<br />

the proposed method applies wavelets <strong>to</strong> transform the channels individually in<strong>to</strong><br />

multi-resolution representations with coarse approximations and various levels of detail<br />

information. This led <strong>to</strong> a wavelet engine architecture consisting of three major<br />

components: transformation, quantization and compression that helps improve compression<br />

rates considerably in array databases.<br />

2.1.5 2D <strong>Pre</strong>-<strong>Aggregation</strong><br />

Aggregate operations on GIS and remote-sensing applications have been shown <strong>to</strong><br />

be computationally expensive due <strong>to</strong> the size and complexity of the operations [8].<br />

One such operation is zooming (scaling), which is carried out by interpolating the values<br />

of the original dataset <strong>to</strong> downsample it <strong>to</strong> a lower resolution. This is particularly<br />

necessary in web-based raster applications, where limitations such as bandwidth and<br />

other resources prevent the efficient processing of the original raster datasets. For<br />

smooth interactive panning, browsers load the image in tiles and quantities larger than<br />

actually displayed. Zooming far out results in large scale fac<strong>to</strong>rs, meaning that large<br />

amounts of data must be moved <strong>to</strong> deliver minimal results.


20 2. Background and Related Work<br />

Current database technology for GIS and remote-sensing imaging applications employ<br />

multi-scale image pyramids <strong>to</strong> improve performance of scaling operations on 2D<br />

raster images [51, 70, 82]. Image pyramids is a technique which consists of resampling<br />

the original dataset and creating a number of copies from it, where each copy is<br />

resampled at a coarser resolution (Fig. 2.4). The pyramid consists of a finite number<br />

of levels that differ in scale by a fixed step fac<strong>to</strong>r, and are much smaller in size than the<br />

original dataset but adequate for visualization at a lower scale (zoom ratio). Common<br />

practice is <strong>to</strong> construct pyramids in scale levels of a power of 2, yielding scale fac<strong>to</strong>rs<br />

2, 4, 6, 8, 16, 32, 64, 128, 256, and 512. When more detailed data are needed, or<br />

when it becomes necessary <strong>to</strong> access the original image, a better access speed can be<br />

achieved by accessing the smaller piece of the original data, if the original data are cut<br />

in<strong>to</strong> smaller pieces. A restricted area of the image, instead of the entire image, is then<br />

accessed.<br />

Figure 2.4. Image Pyramids<br />

Pyramid Construction<br />

The construction of pyramid layers requires resampling of original image cell values.<br />

Resampling interpolates cell values or otherwise assigns values <strong>to</strong> cells of a new<br />

raster object. It results in a raster with larger or smaller cells and different dimensions.<br />

Resampling changes the scale of an input raster, and is used in conjunction<br />

with geometric transformation models that change the internal geometry of a raster.<br />

The following are the most popular interpolation methods [34]:<br />

• Nearest neighbor is the resampling technique of choice for discrete (categorical)<br />

data since it does not alter the value of the input cells [64]. After the cell’s center<br />

on the output raster dataset is located on the input raster, the nearest neighbor<br />

assignment determines the location of the closest cell center on the input raster<br />

and assigns the value of that cell <strong>to</strong> the cell on the output raster.<br />

• Linear interpolation is used <strong>to</strong> interpolate along value curves. It assumes that<br />

cell values vary in proportion <strong>to</strong> distance along a value segment: v = a + bx.<br />

Linear interpolation may be used <strong>to</strong> interpolate feature attribute values along a<br />

line segment connecting any two point value pairs.


2.1 Array Databases 21<br />

• Bilinear interpolation is used <strong>to</strong> interpolate cell values at direct positions within<br />

a quadrilateral grid. It assumes that feature attribute values vary as a bilinear<br />

function of position within the grid cell: v = a + bx + cy + dxy. Given a<br />

direct position, p, in a grid cell whose vertices are V, V + V 1 , V + V 2 , and<br />

V + V 1 + V 2 , where V 1 and V 2 are offset vec<strong>to</strong>rs of the grid, and with cell values<br />

at vertices v 1 , v 2 , v 3 , and v 4 , respectively, there are unique numbers i and j, with<br />

0 ≤ i ≤ 1, and 0 ≤ j ≤ 1 such that p = V + iV 1 + jV 2 . The cell value at p is:<br />

v = (1 − i)(1 − j)v 1 + i(1 − j)v 2 + j(1 − i)v 3 + ijv 4 .<br />

Since the values for output cells are calculated according <strong>to</strong> the relative positions<br />

and values of input cells, bilinear interpolation is preferred for data where the<br />

location from a known point or phenomenon determines the value assigned <strong>to</strong><br />

a cell (that is, continuous surfaces). Elevation, slope, intensity of noise from an<br />

airport, and salinity of groundwater near an estuary are phenomena represented<br />

as continuous surfaces and are most appropriately resampled using bilinear interpolation.<br />

• Quadratic interpolation is used <strong>to</strong> interpolate cell values along curves. It assumes<br />

that cell values vary as a quadratic function of distance along a value<br />

segment: v = a + bx + cx 2 , where a is the value of a cell at the start of a value<br />

segment and v is the value of a cell at distance x along the curve from the start.<br />

Three point value pairs are needed <strong>to</strong> provide control values for calculating the<br />

coefficients of the function.<br />

• Cubic interpolation is used <strong>to</strong> interpolate cell values along curves. It assumes<br />

that cell values vary as a cubic function of distance along a value segment:<br />

v = a + bx + cx 2 + dx 3 where a is the value of a cell at the start of a value<br />

segment and v is the value of a cell at distance x along the curve from the start.<br />

Four point value pairs are needed <strong>to</strong> provide control values for calculating the<br />

coefficients of the function.<br />

Cubic convolution has a tendency <strong>to</strong> sharpen the edges of the data more than bilinear<br />

interpolation since more cells are involved in the calculation of the output<br />

values.<br />

Pyramid Evaluation<br />

During the evaluation of a scaling operation with a target scale fac<strong>to</strong>r s, the pyramid<br />

level with the largest scale fac<strong>to</strong>r s ′ with s ′ < s is determined. This level is loaded and<br />

then an adjustment is made by scaling the resulting image by a fac<strong>to</strong>r of s/s ′ . If, for<br />

example, scaling by s = 11 is required, then pyramid level 3 with scale fac<strong>to</strong>r s ′ = 8<br />

is chosen, requiring a rest scaling of 11/8 = 1.375, thereby <strong>to</strong>uching only 1/64 of<br />

what is read without a pyramid.<br />

The computation complexity of a scaling operation depends on the chosen resampling<br />

method. For example, nearest neighbor resampling considers the closest cell<br />

center of the input raster and assigns the value of that cell <strong>to</strong> the corresponding cell


22 2. Background and Related Work<br />

on the output raster. Other resampling methods such as bilinear and cubic interpolation<br />

consider a subset of cells <strong>to</strong> calculate each of the cell values in the output rasters.<br />

Fig. 2.5 shows three common options for interpolating output cell values. Note that<br />

the bold outline (center image) indicates the current target cell for which a value is<br />

being interpolated.<br />

(a) Portion of<br />

original raster<br />

(b) Portion of<br />

output raster<br />

(c) Input cells used by common<br />

resampling methods<br />

Figure 2.5. Nearest Neighbor, Bilinear and Cubic Interpolation Methods<br />

A characteristic of the pyramid approach is that it increases the size of a raster<br />

dataset by approximately 33 percent. This is because the additionally reduced resolution<br />

representations are s<strong>to</strong>red in the system <strong>to</strong>gether with the original dataset. This is<br />

offset, however, by the increasing response time obtained in return. The choice of resampling<br />

method for constructing the pyramid is influenced by the data characteristics<br />

and type of analysis performed on the data. For example, visual appearance of remote<br />

sensing imagery is best using nearest-neighbor resampling, whereas scientific interpretation<br />

may require cubic interpolation. Rasters representing categorical data e.g.,<br />

land use data, do not allow interpolation since it is important that original data values<br />

remain unchanged; hence only nearest-neighbor resampling can be applied [64].<br />

The reason why categorical data should not be interpolated is because intermediate<br />

terms cannot be derived with meaningful results. For example, soil type data cannot<br />

be interpolated since a soil type 14 and a soil type 15 cannot sensibly be averaged<br />

<strong>to</strong> derive a soil type 14.5. Creating pyramids for different resampling methods is not<br />

efficient due <strong>to</strong> the additional resources required for s<strong>to</strong>rage and maintenance. Thus,<br />

the hard-wired resampling approach possess significant flexibility limitations <strong>to</strong> users<br />

when analytic objectives diverge.<br />

Fast retrieval of raster image datasets has also been investigated in distributed<br />

database systems. Kitamo<strong>to</strong> [14] proposed a caching mechanism that allows twodimensional<br />

satellite imagery <strong>to</strong> be cached with minimum resolution <strong>to</strong> provide a<br />

coarse view of the images in distributed satellite image databases. The cache management<br />

problem is treated as the knapsack problem [14], where the relevance and size<br />

of the data is considered <strong>to</strong> determine if the data will be cached or not. Additionally,<br />

access patterns influence the relevance of the data. The frequency of requests for a


2.1 Array Databases 23<br />

given image and its resulting popularity rank are included in the strategy for caching<br />

selection. <strong>Pre</strong>diction of user access patterns is not considered, however.<br />

More recently, methods exploiting the capabilities of modern graphics hardware<br />

have been applied <strong>to</strong> the organization and processing of large amounts of satellite<br />

imagery. For example, Boettger et al. presented a method based on the concepts of<br />

perspective and complex logarithm [90] for visualization and navigation of satellite<br />

and aerial imagery [50]. Datasets are decomposed in<strong>to</strong> tiles of different sizes and<br />

levels of resolution according <strong>to</strong> a pre-defined area of interest. The tiles closer <strong>to</strong> the<br />

center of interest have higher resolution, whereas low-resolution tiles are created for<br />

parts further away. The resulting tiles are indexed and cached in<strong>to</strong> the memory of the<br />

graphics hardware, enabling quick access <strong>to</strong> the area of interest with the best available<br />

resolution. When the center of interest is changed, tiles not yet available in graphics<br />

memory are loaded. Based on the assumption that the graphics memory offers more<br />

space than needed, the cache contains not only the tiles that conform <strong>to</strong> the area of<br />

interest, but those that presumably will be needed in the future.<br />

2.1.6 <strong>Pre</strong>-<strong>Aggregation</strong> Beyond 2D<br />

Geographic phenomena can be examined at different granularities. This includes<br />

different spatial perspectives and temporal views. Earth remote sensing imagery can<br />

be treated as time-series data <strong>to</strong> study/track changes over time. For example, a user<br />

looking at changes in vegetation patterns over a certain region during the past 10 years<br />

can see their effect on the regional maps over that time period. Fig. 2.6 shows various<br />

instances of scaling operations on 3D image time-series. Figure 2.6(a) shows the<br />

original dataset, which consists of two spatial dimensions (dim 1, dim 2), and one<br />

temporal dimension (dim 3). Figure 2.6(b) shows the original dataset scaled down<br />

along the two spatial dimensions. Figure 2.6(c) shows a scaling operation along the<br />

time dimension of the original dataset. Figure 2.6(d) shows the original dataset scaled<br />

down in the spatial and temporal dimensions.<br />

Shifts in temporal detail have been studied in various application domains [18, 22,<br />

43]. At the time of this writing, there is little support for zooming with respect <strong>to</strong> time<br />

in GIS technology: the focus has been set on studying such alterations with respect <strong>to</strong><br />

the geometric (vec<strong>to</strong>r) properties of objects [54, 58, 59].<br />

Datasets in environmental observation and climate modeling are often defined over<br />

4-D spatio-temporal space of the form (x,y,z,t), possibly extended with <strong>to</strong>pology relationships.<br />

Scaling operations are also critical for these kinds of applications due <strong>to</strong> the<br />

size and dimensionality of the data. Extremely large volumes of data are generated<br />

during climate simulations. While only one part might be needed for a specific data<br />

analysis, huge data volumes are moved. This is particularly true for time-series data<br />

analysis. At the time of this writing, however, 4D scaling operations are not supported<br />

for GIS and remote-sensing imaging applications.


24 2. Background and Related Work<br />

(a) 3D dataset<br />

(b) 3D dataset (scaled-down along<br />

dim1 and dim2 by a fac<strong>to</strong>r of 2)<br />

(c) 3D dataset (scaled-down along<br />

dim3 by a fac<strong>to</strong>r of 4)<br />

(d) 3D dataset (scaled-down along<br />

all dimensions by a fac<strong>to</strong>r of 2)<br />

Figure 2.6. 3D Scaling Operations on Time-Series Imagery Datasets


2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 25<br />

2.1.7 Summary<br />

Array database theory is gradually entering its consolidation phase. The notion<br />

of arrays as functions mapping points of some hypercube-shaped domain <strong>to</strong> values<br />

of some range set is commonly accepted. Two main modeling paradigms are used:<br />

calculus and algebra. Multidimensional data models embed arrays in<strong>to</strong> the relational<br />

world, either by providing conceptual stubs like Array Algebra, or by adding relational<br />

capabilities explicitly such as AQL and RAM. Notably, aggregate query processing<br />

plays a critical role given the large volumes of the arrays. Our study shows<br />

that pre-aggregation techniques focus only on 2D datasets, and that support is limited<br />

<strong>to</strong> one particular operation: scaling. We distinguish the pyramid approach as the<br />

most popular method for speeding up scaling operations on 2D datasets; despite its<br />

known limitations such as hard-wired interpolation and lack of support for datasets of<br />

higher dimensions. Advances on hardware graphics are enabling quicker and more<br />

accurate visualization and navigation capabilities for raster imagery. However, little<br />

work has been reported on how array database technology is progressively exploiting<br />

these hardware advances. A critical gap with respect <strong>to</strong> pre-aggregation is the lack of<br />

support for aggregate operations other than 2D scaling.<br />

2.2 On-Line Analytical Processing (<strong>OLAP</strong>)<br />

Data warehousing/<strong>OLAP</strong> is an application domain where complex multidimensional<br />

aggregates on large databases have been studied intensively. Typically, a data<br />

warehouse collects business data from one or multiple sources so that the desired financial,<br />

marketing, and business analyses can be performed. These kinds of analyses<br />

can detect trends and anomalies, make projections, and make business decisions<br />

[41]. When such analysis predominantly involves aggregate queries, it is called<br />

on-line analytical processing, or <strong>OLAP</strong> [38, 39]. To understand the mechanism of<br />

pre-computation, the following subsections review different approaches <strong>to</strong> structuring<br />

multidimensional data, s<strong>to</strong>rage mechanisms and operations in <strong>OLAP</strong>.<br />

2.2.1 <strong>OLAP</strong> Data model<br />

The multidimensional <strong>OLAP</strong> model begins with the observation that the fac<strong>to</strong>rs<br />

that influence decision-making processes are related <strong>to</strong> enterprise-specific facts, such<br />

as sales, shipments, hospital admissions, surgeries, and so on. [68]. Instances of a<br />

fact subsequently correspond <strong>to</strong> events that occur. For example, every sale or shipment<br />

carried out is an event. Each fact is described by the values of a set of relevant<br />

measures providing quantitative descriptions of events, e.g., sales receipts, amounts<br />

shipped, hospital admission costs, and surgery times are all measures.<br />

In <strong>OLAP</strong>, information is viewed conceptually as cubes that consist of descriptive<br />

categories (dimensions) and quantitative values (measures) [26, 81, 69, 83]. In the scientific<br />

literature, measures are at times called variables, metrics, properties, attributes,<br />

or indica<strong>to</strong>rs. Figure 2.7 illustrates a 3D <strong>OLAP</strong> data cube where business events


26 2. Background and Related Work<br />

(facts) are mapped at the intersection of a specific combination of dimensions.<br />

Different attributes along each dimension are often organized in hierarchical structures<br />

that determine the different levels in which data can be further analyzed [26].<br />

For example, within the time dimension, one may have levels composed of years,<br />

months, and days. Similarly, within the geography dimension, one may have levels<br />

such as country, region, state/province, or city. Hierarchical structures are used <strong>to</strong> infer<br />

summarization (aggregation), that is, whether an aggregate view (query) defined<br />

for some category can be correctly derived from a set of precomputed views defined<br />

for other categories.<br />

Figure 2.7. <strong>OLAP</strong> Data Cube<br />

2.2.2 <strong>OLAP</strong> Operations<br />

<strong>OLAP</strong> includes a set of operations for manipulation of dimensional data organized<br />

in multiple levels of abstraction. Basic <strong>OLAP</strong> operations are roll-up, drill-down, slice,<br />

dice and pivot [44]. A roll-up (aggregation) operation computes higher aggregations<br />

from lower aggregations or base facts according <strong>to</strong> their hierarchies, whereas drilldown<br />

(disaggregation) is an analytic technique whereby the user navigates among<br />

levels of data ranging from most summarized/aggregated, <strong>to</strong> most detailed. Typical<br />

<strong>OLAP</strong> aggregate functions include average, maximum, minimum, count, and sum.<br />

Drilling paths may be defined by the hierarchies within dimensions or other relationships<br />

dynamic within or between dimensions. A slice consists of the selection of a<br />

smaller data cube or even the reduction of a multidimensional data cube <strong>to</strong> fewer dimensions<br />

by a point restriction in some dimension. The dice operation works similarly<br />

<strong>to</strong> the slice except that it performs a selection on two or more dimensions. Figure 2.8<br />

provides a graphical description of these operations.<br />

2.2.3 <strong>OLAP</strong> Architectures<br />

Figure 2.9 shows different approaches for the implementation of <strong>OLAP</strong> functionalities:<br />

Multidimensional <strong>OLAP</strong> (M<strong>OLAP</strong>), Relational <strong>OLAP</strong> (R<strong>OLAP</strong>), Hybrid <strong>OLAP</strong>


2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 27<br />

Figure 2.8. Typical <strong>OLAP</strong> Cube Operations<br />

(H<strong>OLAP</strong>). These approaches offer a common view in the form of data cubes, which<br />

are independent of how the data is s<strong>to</strong>red.<br />

Figure 2.9. <strong>OLAP</strong> Approaches: M<strong>OLAP</strong>, R<strong>OLAP</strong>, and H<strong>OLAP</strong><br />

M<strong>OLAP</strong><br />

M<strong>OLAP</strong> maintains data in a multi-dimensional matrix based on a non-relational specialized<br />

s<strong>to</strong>rage structure [37], see Fig. 2.10(a). While building the s<strong>to</strong>rage structure,<br />

selected aggregations associated with all possible roll-ups are precomputed and s<strong>to</strong>red<br />

[92]. Thus, roll-up and drill-down operations are executed in interactive time. Products<br />

such as Oracle Essbase, IBM Cognos Powerplay, and open-source Palo have<br />

adopted this approach.<br />

A M<strong>OLAP</strong> system is based on an ad-hoc logical model that directly represents<br />

multidimensional data and its applicable operations. The underlying multidimensional<br />

database physically s<strong>to</strong>res data as arrays and access <strong>to</strong> it is positional [68]. Grid-files<br />

[53, 55], R*-trees [71] and UB-trees [84] are among the techniques used for that<br />

purpose.<br />

The main advantage of this approach is that it contains the pre-computed aggregate<br />

values that offer a very compact and efficient way <strong>to</strong> retrieve answers for specific


28 2. Background and Related Work<br />

aggregate queries [68]. One difficulty that M<strong>OLAP</strong> poses, however, pertains <strong>to</strong> the<br />

sparseness of the data. Sparseness means that many events did not take place and<br />

valuable processing time is taken by adding up zeros [91]. For example, a company<br />

may not sell every item every day in every s<strong>to</strong>re, so no values appear at the intersection<br />

where products are not sold in a particular region at a particular time. On the other<br />

hand, M<strong>OLAP</strong> can be much faster for applications where subsets of the data cube<br />

are dense [100]. Another limitation of this approach is that the computation of a<br />

cube requires a complex aggregate query across all data in a warehouse. Though<br />

it is possible <strong>to</strong> incrementally update cubes as new data arrives, it is impractical <strong>to</strong><br />

dynamically create new cubes <strong>to</strong> answer ad-hoc queries [68].<br />

Figure 2.10. M<strong>OLAP</strong> S<strong>to</strong>rage Scheme<br />

R<strong>OLAP</strong><br />

In R<strong>OLAP</strong>, underlying data is s<strong>to</strong>red in a relational database, see Fig. 2.11(a). The<br />

relational model, however, does not include concepts of dimension, measure, and hierarchy.<br />

Thus specific types of schemata must be created so the multidimensional<br />

model can be represented in terms of basic relational elements such as attributes, relations,<br />

and integrity constraints [68]. Such representations are done using a star schema<br />

data model, although the snowflake schema is also often adopted.<br />

R<strong>OLAP</strong> implementations can handle large amounts of data and leverage all functionalities<br />

of the relational database [72]. Disadvantages are that overall performance<br />

is slow and each R<strong>OLAP</strong> report represents an SQL query with the limitations of the<br />

genre. R<strong>OLAP</strong> vendors tried <strong>to</strong> mitigate this problem by including out-of-the-box<br />

complex functions in their product offering and providing users the capability of defining<br />

their own functions. Another problem with R<strong>OLAP</strong> implementations results from<br />

the performance hit caused by costly join operations between large tables [68]. To<br />

overcome this issue, fact tables in data-warehouses are usually de-normalized. Sub-


2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 29<br />

stantial performance gains can be achieved through the materialization of derived tables<br />

(views) that s<strong>to</strong>re aggregate data used for typical <strong>OLAP</strong> queries.<br />

Figure 2.11. R<strong>OLAP</strong> S<strong>to</strong>rage Scheme<br />

Figure 2.12 shows the formulation of a typical query in both R<strong>OLAP</strong> and M<strong>OLAP</strong>.<br />

The query yields sales information for a specific product sold in a particular city by a<br />

given vendor. The formulation of the queries is done according <strong>to</strong> the syntax of Oracle<br />

10g. Note the lengthy difference between the two query formulations.<br />

(a) Sample R<strong>OLAP</strong> query<br />

(b) Sample M<strong>OLAP</strong> query<br />

Figure 2.12. Typical Query as Expressed in R<strong>OLAP</strong> and M<strong>OLAP</strong> Systems


30 2. Background and Related Work<br />

H<strong>OLAP</strong><br />

The intermediate architecture type, H<strong>OLAP</strong>, mixes the advantages offered by R<strong>OLAP</strong><br />

and M<strong>OLAP</strong>. It takes advantage of the standardization level and the ability <strong>to</strong> manage<br />

large amounts of data from R<strong>OLAP</strong> implementations, and the query speed typical of<br />

M<strong>OLAP</strong> systems. For summary type information, H<strong>OLAP</strong> leverages cube technology<br />

and for drilling down in<strong>to</strong> details it uses the R<strong>OLAP</strong> model. In H<strong>OLAP</strong> architecture,<br />

the largest amount of data should be s<strong>to</strong>red in an RDBMS <strong>to</strong> avoid the problems<br />

caused by sparsity, and a multidimensional system should s<strong>to</strong>re only the information<br />

users most frequently need <strong>to</strong> access [68]. If that information is not enough <strong>to</strong> solve<br />

queries, then the system accesses the data managed by the relational system in a more<br />

transparent manner.<br />

2.2.4 <strong>OLAP</strong> <strong>Pre</strong>-<strong>Aggregation</strong><br />

<strong>OLAP</strong> systems require fast interactive multidimensional data analysis of aggregates.<br />

To fulfill this requirement, database systems frequently pre-compute aggregate<br />

views on some subset of dimensions and their corresponding hierarchies. Virtually<br />

all <strong>OLAP</strong> products resort <strong>to</strong> some degree of pre-computation of these aggregates,<br />

a process known as pre-aggregation. <strong>OLAP</strong> pre-aggregation techniques have<br />

proved <strong>to</strong> speed up aggregate queries by several orders of magnitude in business applications<br />

[31, 41]. A full pre-aggregation of all possible combinations of aggregate<br />

queries, however, is not considered feasible because it often exceeds the available s<strong>to</strong>rage<br />

limit and incurs a high maintenance cost. Therefore, modern <strong>OLAP</strong> systems adopt<br />

a partial pre-aggregation approach where only a set of aggregates are materialized so<br />

it can be re-used for efficiently computing other aggregates.<br />

<strong>Pre</strong>-aggregation techniques consist of three inter-related processes: view selection,<br />

query rewriting, and view maintenance. A view is a derived relation defined in terms<br />

of base relations. Views can be materialized by s<strong>to</strong>ring the tuples of a view in a<br />

database, as was first investigated in the 1980s [36]. Like a cache, a materialized<br />

view provides fast access <strong>to</strong> its data. However, a cache may get dirty whenever its<br />

underlying base relations are updated. The process of updating a materialized view in<br />

response <strong>to</strong> changes <strong>to</strong> its base data is called view maintenance [12].<br />

View Selection<br />

Gupta et al. [13] proposed a framework that shows how <strong>to</strong> use materialized views <strong>to</strong><br />

help answer aggregate queries. The framework provides a set of query rewriting rules<br />

<strong>to</strong> determine what materialized aggregate views can be employed <strong>to</strong> answer aggregate<br />

queries. An algorithm uses these rules <strong>to</strong> transform a query tree in<strong>to</strong> an equivalent<br />

tree with some or all base relations replaced by materialized views. Thus, a query<br />

optimizer can choose the most efficient tree and provide the best query response time.<br />

Harinarayan et al. [92] investigated the issue of how <strong>to</strong> select views for materialization<br />

under s<strong>to</strong>rage space constraints so the average query cost is minimal.<br />

To meet changing user needs several dynamic pre-aggregation approaches have


2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 31<br />

been proposed. In principle, views may be either selected on demand or pre-selected<br />

using some prediction strategy. For applications where s<strong>to</strong>rage space is a constraint,<br />

replacement algorithms identify those views that can be replaced with new selections<br />

[60]. Kotidis et al. [97] introduced a dynamic view selection approach called Multidimensional<br />

Range Queries (MRQ), known as slice queries in <strong>OLAP</strong>, which use an<br />

on-demand fetching strategy. Within this approach, the level of detail or granularity is<br />

a compromise between the materialization of many small, highly specific queries, and<br />

the materialization of a few large queries followed by answering incoming queries at<br />

each stage, using the materialized queries. This approach, however, does not take in<strong>to</strong><br />

account user access patterns before making selections.<br />

The first work <strong>to</strong> consider user access information <strong>to</strong> evaluate potential queries<br />

<strong>to</strong> be materialized is presented in [26], where the author introduced PROMISE, an<br />

approach that predicts the structure and value of the next query based on the current<br />

query. Yao et al. [99] proposed a different approach for the materialization of dynamic<br />

views. A set of batch queries were rewritten using certain canonical queries so the<br />

<strong>to</strong>tal cost of execution could be reduced using the intermediate results for answering<br />

queries appearing later in the batch. This approach requires all queries <strong>to</strong> be precisely<br />

known before hand, and though the approach might work well in a particular database<br />

scenario, it might not be useful in dynamic <strong>OLAP</strong>, where it is extremely difficult <strong>to</strong><br />

accurately predict the exact nature of future queries.<br />

View Maintenance<br />

In most cases it is wasteful <strong>to</strong> maintain a view by recomputing it from scratch. Materialized<br />

views are therefore maintained using an incremental approach [11]. Only the<br />

changes <strong>to</strong> be propagated <strong>to</strong> the materialized view are computed using the changes of<br />

the source relations [1, 33, 89]. At present, view maintenance has been investigated<br />

from these four dimensions [11]:<br />

• Information Dimension: Focuses on accessing the information required for view<br />

maintenance, such as base relations and the materialized view.<br />

• Modification Dimension: Focuses on the kinds of modifications e.g., insertions<br />

and deletions, that a view maintenance algorithm can handle.<br />

• Language Dimension: Addresses the problems related <strong>to</strong> the language of the<br />

views supported by the view maintenance algorithm. That is, what is the language<br />

of the views that can be maintained by the view maintenance algorithm?<br />

How are views expressed? Does the algorithm allow duplicates?<br />

• Instance Dimension: Considers the applicability of the algorithm <strong>to</strong> all or a<br />

specific set of instances of the database.<br />

View maintenance cost is the sum of the cost of propagating each base relation<br />

change <strong>to</strong> the affected materialized views. The sum can be weighted, where each<br />

weight indicates the frequency of propagations of the changes of the associated source


32 2. Background and Related Work<br />

relation. When the base relation affects more than one materialized view, multiple<br />

maintenance expressions must be evaluated. Multi-query optimization techniques<br />

can be used <strong>to</strong> detect common sub-expressions between the maintenance expressions<br />

so that an efficient global evaluation plan for the maintenance expressions can be<br />

achieved [61, 62].<br />

Numerous methods have been developed for materialized view maintenance in conventional<br />

database systems. Zhuge et al. [101] introduced the Eager Compensating<br />

Algorithm (ECA) based on previous incremental view maintenance algorithms and<br />

compensating queries used <strong>to</strong> eliminate anomalies. In [102], authors define multiple<br />

views consistent with each other as the multiple view consistency problem. Further<br />

research from the same authors [102, 103] considers data warehouse views defined<br />

on base tables located in different data sources, i.e., if a view involves n base tables,<br />

then n data sources are also involved.<br />

A common characteristic of the early approaches <strong>to</strong> view maintenance is the considerable<br />

need for accessing base relations, which in most cases results in performance<br />

degradation. The improvement of the efficiency of view maintenance techniques has<br />

been a <strong>to</strong>pic of active research in the database research community [15, 65, 85, 98].<br />

Spatial <strong>OLAP</strong> (S<strong>OLAP</strong>)<br />

The multidimensional approach used by data warehouses and <strong>OLAP</strong> does not support<br />

array data types or spatial data types such as point, lines, or polygons. Following<br />

the development trends of data warehouse and data mining techniques, Stefanovic et<br />

al. [52] proposed the construction of a spatial data warehouse <strong>to</strong> enable on-line data<br />

analysis in spatial-information reposi<strong>to</strong>ries. The authors used a star/snowflake model<br />

<strong>to</strong> build a spatial data cube consisting of both spatial and non-spatial dimensions and<br />

measures: the data cube shown in Fig. 2.13 consists of one spatial dimension (region)<br />

and three non-spatial dimensions (precipitation, temperature, and time).<br />

Figure 2.13. Star Model of a Spatial Warehouse<br />

Current research in spatial data management focuses on querying spatial data,<br />

particularly regarding the improvement of aggregate query performance [57] for


2.3 Discussion 33<br />

spatial-vec<strong>to</strong>r data structures. Alas, little attention has been given <strong>to</strong> spatial-raster<br />

data [42, 73, 86]. Support for spatial-raster data typically consists of creating a<br />

spatial-raster cube from information in the metadata file (such as size, level, width,<br />

height, date of creation, format, and location) [28, 94].<br />

Vega et al. [40] presented a model <strong>to</strong> analyze and compare existing techniques<br />

for the evaluation of aggregate queries on spatial, temporal, and spatio-temporal data.<br />

The study shows that existing aggregate computation techniques rely on some form<br />

of pre-aggregation and support is restricted <strong>to</strong> distributive aggregate functions such<br />

as COUNT , SUM, and MAX. Additionally, the authors identify several important<br />

needs concerning aggregate computation. First, they discuss the need <strong>to</strong> develop<br />

further and more substantial techniques <strong>to</strong> support holistic aggregate functions e.g.,<br />

MEDIAN, RANK, and <strong>to</strong> better support selective predicates. The second observation<br />

pertains <strong>to</strong> the lack of support for queries needing <strong>to</strong> be efficiently evaluated<br />

at every granule in time. Existing aggregate computation techniques focus only on<br />

spatial objects such as lines, points, and polygons but do not consider aggregate computation<br />

on data grids (array) structures.<br />

2.3 Discussion<br />

Query performance is a major concern underlying the design of databases in both<br />

business and remote-sensing imaging applications. While there are some valuable<br />

research results in the realm of pre-aggregation techniques <strong>to</strong> support query processing<br />

in business and statistical applications, little has been done in the field of array<br />

databases.<br />

The question therefore arises, what distinguishes array data from traditional data<br />

types that it cannot be fully supported by relational databases and thus take advantage<br />

of advance technologies such as <strong>OLAP</strong>? <strong>OLAP</strong> from its very conception was designed<br />

<strong>to</strong> assist in the decision-making process of business applications, where business perspectives<br />

such as products and/or s<strong>to</strong>res, represented the dimensions of the data cube.<br />

And while the different columns in a data cube are usually called dimensions, they<br />

generally cannot be considered as a special extent of the entities modeled by the<br />

database. Instead, they are regarded as explicit attributes that characterize a particular<br />

entity. Some dimensions in a data cube (e.g., Cus<strong>to</strong>merId) are defined over discrete<br />

domains which do not have a natural ordering among their values (cus<strong>to</strong>mer 1000 cannot<br />

be considered close <strong>to</strong> cus<strong>to</strong>mer 1001). In such cases, any ordering defined for the<br />

values in one of these columns is arbitrary [40]. For this reason, existing <strong>OLAP</strong> solutions<br />

and related pre-aggregation techniques cannot be applied <strong>to</strong> multidimensional<br />

arrays, at least not in a straight-forward manner.<br />

Recently, however, a new trend in <strong>OLAP</strong> gained considerable popularity due <strong>to</strong> its<br />

capabilities <strong>to</strong> support Geo-spatial data. Spatial <strong>OLAP</strong> considers the case in which a<br />

data-cube may have both spatial and non-spatial dimensions. However, spatial <strong>OLAP</strong><br />

focuses mainly on spatial-vec<strong>to</strong>r data and so far little support has been provided for<br />

spatial-raster data in terms of selective materialization for the optimization of aggregates.<br />

Support is limited only <strong>to</strong> those operations that can be constructed from


34 2. Background and Related Work<br />

metadata available for the raster, but not <strong>to</strong> the improvement of the computation of<br />

aggregate operations over the values of raster datasets.<br />

At present, pre-aggregation support in array databases is limited. Only one comparatively<br />

simple pre-aggregation technique has been used, namely image pyramids. The<br />

limitation of this technique <strong>to</strong> two-dimensional datasets and hard-wired interpolation<br />

calls for the development of more flexible and efficient techniques.<br />

From our study of data modeling, s<strong>to</strong>rage techniques, operations in <strong>OLAP</strong> and<br />

remote sensing imaging applications, we have observed the following similarities:<br />

• Array databases and <strong>OLAP</strong> systems typically employ multidimensional data<br />

models <strong>to</strong> organize their data.<br />

• Both application domains handle large volumes of multidimensional data.<br />

• Operations convey a high degree of similarity, for instance, a roll-up (aggregate)<br />

operation in <strong>OLAP</strong> such as computing the weekly sales per product is very<br />

similar <strong>to</strong> scaling a satellite image by a fac<strong>to</strong>r of seven along the X axis. Figure<br />

2.14 illustrates this similarity.<br />

(a) Scaling operation<br />

(b) Roll-Up operation<br />

Figure 2.14. Comparison of Roll-Up and Scaling Operations


2.3 Discussion 35<br />

• Both application domains use pre-aggregation approaches <strong>to</strong> speed up query<br />

processing: <strong>OLAP</strong> pre-aggregation techniques support a wide range of aggregate<br />

operations and speed up query processing by several orders of magnitude<br />

(last benchmark reported fac<strong>to</strong>rs up <strong>to</strong> 100 times [29, 88]). Scaling of 2D<br />

datasets always uses the same scale fac<strong>to</strong>r on each dimension <strong>to</strong> maintain a<br />

coherent view, whereas for datasets of higher dimensionality, the scale fac<strong>to</strong>r is<br />

independent. Scaling resembles a primitive form of pre-aggregation in comparison<br />

<strong>to</strong> existing <strong>OLAP</strong> pre-aggregation techniques.<br />

• While data in <strong>OLAP</strong> applications are sparsely populated, remote sensing imagery<br />

usually are densely populated (100%). There are no guidelines explaining<br />

when an <strong>OLAP</strong> data cube is considered sparse or dense. However, when a data<br />

cube contains 30 percent empty cells it is usually treated with sparsity-handling<br />

techniques in most <strong>OLAP</strong> systems.<br />

Furthermore, when compared <strong>to</strong> well-known <strong>OLAP</strong> pre-aggregation techniques,<br />

GIS image pyramids are different in several respects:<br />

• Image pyramids are constrained <strong>to</strong> 2D imagery. To the best of our knowledge<br />

there is no generalization of pyramids <strong>to</strong> n-D.<br />

• The x and y axes are always zoomed by the same scalar fac<strong>to</strong>r s in the 2D zoom<br />

vec<strong>to</strong>r (s, s). This is exploited by image pyramids in that they only offer preaggregates<br />

along a scalar range. In this respect, image pyramids actually are 1D<br />

pre-aggregates.<br />

• Several interpolation methods are used for resampling during scaling. Some<br />

techniques are standardized [48], they include nearest-neighbor, bi-linear, biquadratic,<br />

bi-cubic, and barycentric. The two scaling steps incurred for image<br />

pyramids (construction of the pyramid level and rest scaling) must be done using<br />

the same interpolation technique <strong>to</strong> achieve valid results. In <strong>OLAP</strong>, summation<br />

during roll-up corresponds <strong>to</strong> linear interpolation in imaging.<br />

• Scale fac<strong>to</strong>rs are continuous, as opposed <strong>to</strong> the discrete hierarchy levels in<br />

<strong>OLAP</strong>. It is, therefore, impossible <strong>to</strong> materialize all possible pre-aggregates.<br />

Based on these observations, this thesis aims <strong>to</strong> systematically carry over results<br />

from <strong>OLAP</strong> <strong>to</strong> array databases and provide pre-aggregation support not only for queries<br />

using basic aggregate functions, but <strong>to</strong> more complex operations such as scaling. As<br />

a preliminary and fundamental step, it is necessary <strong>to</strong> have a clear understanding of<br />

the various operations performed on remote sensing imagery and <strong>to</strong> identify those that<br />

involve aggregation computation. Next chapter addresses this issue in more detail.


This page was left blank intentionally.


Chapter 3<br />

Fundamental Geo-Raster Operations<br />

in GIS and Remote-sensing<br />

Applications<br />

This chapter describes a set of fundamental operations in GIS and remote-sensing<br />

imaging applications. For rigid comparison and classification, these operations are<br />

discussed by means of a sound mathematical framework. The aim is <strong>to</strong> identify those<br />

operations requiring data summarization that may benefit from a pre-aggregation approach.<br />

To that end, we use Array Algebra as our modeling framework.<br />

3.1 Array Algebra<br />

The rationale behind the selection of Array Algebra as the modeling framework is<br />

grounded in the following observations:<br />

• It is oriented <strong>to</strong>wards multidimensional data in a variety of applications including<br />

imaging.<br />

• It provides the means <strong>to</strong> formulate a wide variety of operations on multidimensional<br />

arrays.<br />

• There are commercial and open-source implementations of Array Algebra that<br />

show the soundness and maturity of the framework.<br />

The expressive power of Array Algebra, the simplicity of its opera<strong>to</strong>rs, and its successful<br />

implementation in both commercial and scientific applications make it suitable<br />

for our investigation.<br />

Essentially, the algebra consists of three opera<strong>to</strong>rs: an array construc<strong>to</strong>r, a generalized<br />

aggregation, and a multi-dimensional sorter [75, 76]. Array algebra is minimal<br />

in the sense that no subset of its operations exhibits the same expressive power. It is<br />

safe in evaluation: every formula can be evaluated in a finite number of steps. It is<br />

closed in its application: any resulting expression is either a scalar or an array.<br />

37


38 3. Fundamental Geo-Raster Operations<br />

Arrays are represented as functions mapping n-dimensional points from discrete<br />

Euclidean space <strong>to</strong> values. The spatial domain of an array is defined as a finite set of<br />

n-dimensional points in Euclidean space forming a hypercube with boundaries parallel<br />

<strong>to</strong> the coordinate system axes.<br />

Let X ⊆ Z d be a spatial domain and F a value set i.e., a homogeneous algebra.<br />

Then, an F-valued d-dimensional array over the spatial domain X(multi-dimensional<br />

array) is defined as:<br />

a : X → F (i.e., a ∈ F X ),<br />

a = {(x, a(x)) : x ∈ X, a(x) ∈ F }<br />

The array elements a(x) are referred <strong>to</strong> as cells. Auxiliary function sdom(a) denotes<br />

the spatial domain of some array a.<br />

3.1.1 Construc<strong>to</strong>r<br />

The MARRAY array construc<strong>to</strong>r allows arrays <strong>to</strong> be defined by indicating a spatial<br />

domain and an expression evaluated for each cell position of the array. An iteration<br />

variable bound <strong>to</strong> a spatial domain is available in the cell expression so that the cell<br />

value depends on its position. Let X be a spatial domain, F a value set, and v a free<br />

identifier. Let e v be an expression with result type F containing zero or more free occurrences<br />

of v as placeholder(s) for an expression with result type X. Then, an array<br />

over spatial domain X with base type F is constructed through:<br />

MARRAY X,v (e v ) = {(x, a(x)) : a(x) = e x , x ∈ X}<br />

A straightforward application of MARRAY is spatio-temporal sub-setting by simply<br />

changing its domain.<br />

Example: For some 2-D grey-scale image a, its cu<strong>to</strong>ut <strong>to</strong> domain [x0:x1,y0:y1] (assumed<br />

<strong>to</strong> lie inside the array) is given by:<br />

MARRAY [x0:x1,y0:y1],p (a[p])<br />

Similarly, trimming produces a cu<strong>to</strong>ut of an array of lower volume, but unchanged<br />

dimensionality, and section cuts out a hyperplane with reduced dimensionality.<br />

We can also change an array’s values by changing the e v expression. In the simplest<br />

case this expression takes the cell value and modifies it. The following expression adds<br />

the values in the cells of two raster images, regardless of their extent and dimension:<br />

a + b = MARRAY X,p (a[p] + b[p])<br />

If we allow the use of all operations known on the base algebra, i.e., on the pixel<br />

type, we immediately obtain a cohort of the following useful operations.


3.2 Geo-Raster Operations 39<br />

3.1.2 Condenser<br />

The COND array condenser (aggrega<strong>to</strong>r) takes the values of an array’s cells and<br />

combines them through some commutative and associative operation, thereby obtaining<br />

a scalar value. For some v free identifier, spatial domain X = x 1 , ..., x n , x i ∈ Z d<br />

consisting of n points, and e a,v an expression of result type F containing occurrences<br />

of an array a and identifier v, the condense of a by o is defined as:<br />

COND o,X,v (e a,v ) := O<br />

x∈X<br />

e a,x = e a,x1 o...oe a,xn<br />

Example: Let a be the image as defined in above. The average over all pixel<br />

intensities in a is then given by:<br />

∑<br />

COND +,sdom(a),p (a) = a[x]/(m ∗ n)<br />

3.1.3 Sorter<br />

x∈[1:m,1:n]<br />

The SORT array sorter proceeds along a selected dimension <strong>to</strong> reorder the corresponding<br />

hyperslices. Functional sort s rearranges a given array along a specified<br />

dimension s without changing its value set or spatial domain. To that end, an<br />

order-generating function is provided that associates a sequence position <strong>to</strong> each (d-<br />

1)-dimensional hyperslice. Note that function f s,a has all degrees of freedom <strong>to</strong> assess<br />

any of a’s cell values for determining the measure value of a hyperslice on hand - it<br />

can be a particular cell value in the current hyperslice, the average of all hyperslice<br />

values, or the value of one or more neighboring slices. Note that the sort opera<strong>to</strong>r<br />

includes the relational group by.<br />

The language is recursive in the array expression e v and hence allows arbitrary<br />

nesting of expressions. In the sequel we use the abbreviations introduced above for<br />

nested expressions.<br />

3.2 Geo-Raster Operations<br />

This section presents a set of fundamental operations for Geo-raster data. These<br />

operations have been selected based on an exhaustive literature review of classification<br />

schemes, international standards, and best practices [2, 19, 27, 32, 35, 45, 46, 47, 49].<br />

By examining the Array Algebra opera<strong>to</strong>rs involved in the computation of the operations,<br />

we identify those that require data summarization (aggregation) and therefore<br />

may benefit from pre-aggregation.<br />

Queries were executed in a raster database management system (RasDaMan), and<br />

formulated according <strong>to</strong> the syntax of an SQL-based query language for multidimensional<br />

raster databases based on Array Algebra, namely, rasql.<br />

3.2.1 Mathematical Operations<br />

The following groups of mathematical opera<strong>to</strong>rs are distinguished: arithmetic,<br />

trigonometric, boolean and relational. They operate at cell level and can be applied


40 3. Fundamental Geo-Raster Operations<br />

in a single or multiple rasters of numerical type and identical spatial domain. The<br />

basic arithmetic opera<strong>to</strong>rs include addition (+), subtraction (-), multiplication (*), and<br />

division (/). Trigonometric functions perform trigonometric calculations on the values<br />

of an input raster: sine (sin), cosine (cos), tangent (tan) or their inverse (arcsin, arccos,<br />

arctan). Consider, for example, the following query:<br />

Query 3.2.1. Consider a RGB (red, green, blue) raster image A. Extract the green<br />

component from the image, and reduce the contrast by a fac<strong>to</strong>r of 2.<br />

With Array Algebra, the query can be computed as follows:<br />

Results are shown in Fig. 3.1.<br />

MARRAY sdom(A),i (A.green[i]/2)<br />

(a) Original RGB image (b) Green component (c) Output raster<br />

Figure 3.1. Reduction of Contrast in the Green Channel of an RGB Image<br />

All or part of a raster image can be manipulated using the rules of Boolean algebra<br />

integrated in<strong>to</strong> database query languages such as SQL [2]. Boolean algebra uses<br />

logical opera<strong>to</strong>rs such as and, or, not, and xor <strong>to</strong> determine if a particular condition is<br />

true or false. These opera<strong>to</strong>rs are often combined with relational opera<strong>to</strong>rs: equal (=),<br />

not equal (≠), less than (), and greater<br />

than or equal <strong>to</strong> (≥). Consider, for example, the following queries:<br />

Query 3.2.2. Given a near-infrared green (NRG) raster image A, highlight the cells<br />

with sufficient near-infrared values.<br />

This query can be answered by imposing a lower bound on the infrared intensity,<br />

and upper bounds on the green and blue intensities. The resulting boolean array is


3.2 Geo-Raster Operations 41<br />

multiplied by the original image A <strong>to</strong> show the original cell where an infrared value<br />

prevails and black otherwise.<br />

MARRAY sdom(A),i (A[i] ∗ ((A[i].nir ≥ 130) and<br />

Results are shown in Fig. 3.2.<br />

(A[i].green ≤ 110) and (A[i].blue ≤ 140)))<br />

(a) Original NRG raster<br />

(b) Output raster<br />

Figure 3.2. Highlighted Infrared Areas of an NRG Image<br />

Query 3.2.3. Compare the cell values of two 8-bit gray raster images A and B. Create<br />

a new raster where each cell value takes the value of 255 (white pixel) when the cell<br />

values of A and B are identical.<br />

The algebraic formulation is as follows:<br />

Results are shown in Fig. 3.3.<br />

MARRAY sdom(A),i ((A[i] = B[i]) ∗ 255)


42 3. Fundamental Geo-Raster Operations<br />

(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster image<br />

Figure 3.3. Cells of Rasters A and B with Equal Values<br />

Reclassification<br />

Reclassification is a generalization technique used <strong>to</strong> re-assign cell values in classified<br />

rasters. For example consider the query below where reclassification is based on a land<br />

suitability study.<br />

Query 3.2.4. Given an 8-bit gray image A, map each cell value <strong>to</strong> its corresponding<br />

suitability class shown in Table 3.2 1 , and decrease the contrast of the image according<br />

<strong>to</strong> the decreasing fac<strong>to</strong>r.<br />

The query can be answered as follows:<br />

MARRAY sdom(A),g (((A[g] > 180) ∗ A[g]/2) +<br />

Results are shown in Fig. 3.4.<br />

(((A[g] ≥ 130)and(A[g] < 180)) ∗ A[g]/3) +<br />

(((A[g] ≥ 80)and(A[g] < 130)) ∗ A[g]/4) +<br />

((A[g] < 80) ∗ A[g]/5))<br />

1 Classification taken from http://www.fao.org/docrep/X5310E/X5310E00.htm


3.2 Geo-Raster Operations 43<br />

Table 3.1. UNO and FAO Suitability Classifications<br />

Classification Description<br />

S1<br />

Highly suitable<br />

S2<br />

Moderately suitable<br />

S3<br />

Marginally suitable<br />

NS<br />

Not suitable<br />

Table 3.2. Capability Indexes for Different Capability Classes<br />

Capability index Class Suitability class Decrease fac<strong>to</strong>r<br />

>180 I S1 2<br />

130-180 II S2 3<br />

80-130 III S3 4<br />

< 80 IV NS 5<br />

(a) Original raster<br />

(b) Output raster<br />

Figure 3.4. Re-Classification of the Cell Values of a Raster Image


44 3. Fundamental Geo-Raster Operations<br />

Proximity<br />

The proximity operation creates a new raster where each cell value contains the distance<br />

<strong>to</strong> a specified reference point. As an example consider the following query:<br />

Query 3.2.5. Estimate the proximity of each cell of the raster image shown in Fig. 3.4(a)<br />

<strong>to</strong> the reference cell located in [30,5].<br />

The computation of this query can be formulated as:<br />

Results are shown in Fig. 3.5.<br />

MARRAY sdom(A),(g,h) (|g − 30| + |h − 5|)<br />

Figure 3.5. Computation of a Proximity Operation<br />

Overlay<br />

The overlay operation refers <strong>to</strong> the process of stacking two or more identical georeferenced<br />

rasters on <strong>to</strong>p of each other so that each position in the covered area can be<br />

analyzed in terms of these data. The overlay operation can be solved using arithmetic<br />

and relational opera<strong>to</strong>rs. For example, consider the following query:<br />

Query 3.2.6. Given two 8-bit gray raster images A and B with identical spatial domain,<br />

perform an overlay operation. That is, make a cell-wise comparison between<br />

the two rasters. Each cell value of the new array must take the maximum cell value<br />

between A and B.<br />

The computation of this query can be formulated as:<br />

MARRAY sdom(A),g (((A[g] > B[g]) ∗ A[g]) + ((A[g] ≤ B[g]) ∗ B[g]))<br />

The above formulation works as follows. The left part of the arithmetic expression +<br />

tests for the cell value of array A <strong>to</strong> be greater than the cell value of B. The result of<br />

this operation is either 0 (condition not satisfied) or 1 (condition satisfied), which in


3.2 Geo-Raster Operations 45<br />

turn is multiplied by the cell value of array A. Thus, the left part of the expression is<br />

either 0 or the cell value of array A. Similarly, the right-hand side of the arithmetic<br />

addition expression verifies if the cell value of array A is less than or equal <strong>to</strong> the cell<br />

value of B. The result is either 0 or 1 depending on whether or not the condition is<br />

satisfied. This value is multiplied for the cell value of array B. Note that only one of<br />

the parts of the addition expression will be greater than zero, and that this value corresponds<br />

<strong>to</strong> the highest value between arrays A and B. Results are shown in Fig. 3.6.<br />

(a) 8-bit gray raster A (b) 8-bit gray raster B (c) Output raster<br />

Figure 3.6. Computation of an Overlay Operation<br />

An overlay operation can also be done considering a different condition <strong>to</strong> be tested<br />

while determining the cell values of the output array. For example:<br />

Query 3.2.7. Compute an overlay operation between rasters A and B. That is, compare<br />

cell-wise the two rasters: if the cell value of B is non-zero, then set this value<br />

as the cell value of the corresponding cell in array A. Otherwise, the cell value of A<br />

remains unchanged.<br />

The query can be answered as follows:<br />

MARRAY sdom(A),g (((B[g] > 0) ∗ B[g]) + ((B[g] ≤ 0) ∗ A[g]))<br />

Results are shown in Fig. 3.7.<br />

3.2.2 <strong>Aggregation</strong> Operations<br />

We now present the modeling of operations consisting of one or more aggregate<br />

functions. An aggregate function takes a collection of cells and returns a single value<br />

that summarizes the information contained in the set of cells. The SQL standard provides<br />

a variety of aggregate functions. SQL-92 includes count, sum, average, min,


46 3. Fundamental Geo-Raster Operations<br />

(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster<br />

Figure 3.7. Computation of an Overlay Operation Considering Values Greater than<br />

Zero<br />

and max. SQL:1999 adds every, some and any. <strong>OLAP</strong> functions were first published<br />

as an addendum <strong>to</strong> the ISO SQL:1999 standard. They have since been completely<br />

incorporated in<strong>to</strong> both SQL:2003 and recently published SQL:2008 ISO SQL Standards.<br />

<strong>OLAP</strong> functions include rank, ntile, cume dist, percent rank, row number,<br />

percentile cont, and percentile disc.<br />

Add<br />

The add operation sums up the content of the cells and returns the <strong>to</strong>tal as a scalar<br />

value. It can be applied in two or more rasters with an identical spatial domain, returning<br />

a new raster with the same spatial domain. In this case, the cells of the new<br />

raster contain the sum of the inputs computed on a cell-by-cell basis. As an example<br />

of the add operation in a single raster consider the following query:<br />

Query 3.2.8. Return the sum of all cell values of the raster shown in Fig. 3.8(a).<br />

Results are shown in Fig. 3.8.<br />

add cells(A) = COND +,sdom(A),i (A[i])


3.2 Geo-Raster Operations 47<br />

(a) Original NRG raster<br />

(b) Output result<br />

Figure 3.8. Calculation of the Total Sum of Cell Values in a Raster<br />

Count<br />

The count operation returns the number of cells that fulfill a boolean condition applied<br />

<strong>to</strong> a raster. For example, consider the following query:<br />

Query 3.2.9. Return the number of cells of raster A of boolean type, containing true<br />

value in the green channel.<br />

Average<br />

count cells(A) = COND +,sdom(A),i (A[i].green = 1)<br />

The average operation returns a scalar value representing the mean of all values contained<br />

in a raster. As an example consider the following query:<br />

Query 3.2.10. Return the average of the cell values in each channel of the NRG image<br />

shown in Fig. 3.9(a).<br />

Let sum cells(A) be a function calculated as shown in Section 3.2.2, and card(sdom(A))<br />

a function returning the cardinality of A. Then, the average of A is calculated as follows:<br />

sum cells(A)<br />

avg cells(A) =<br />

card(sdom(A))<br />

Results are shown in Fig. 3.9.<br />

Maximum<br />

A maximum operation returns the largest cell value contained in a raster of numerical<br />

type. As an example, consider the following query:


48 3. Fundamental Geo-Raster Operations<br />

(a) Original NRG raster<br />

(b) Output result<br />

Figure 3.9. Result of an Average Aggregate Operation<br />

Query 3.2.11. Return the maximum cell value of all cells contained in the NRG raster<br />

image shown in Fig. 3.10(a).<br />

Results are shown in Fig. 3.10.<br />

max cells(A) = COND max,sdom(A),i (A[i])<br />

(a) Original NRG raster<br />

(b) Output result<br />

Figure 3.10. Result of a Maximum Aggregate Operation<br />

Minimum<br />

A minimum operation returns the smallest cell value contained in a raster of numerical<br />

type. As an example, consider the following query:


3.2 Geo-Raster Operations 49<br />

Query 3.2.12. Return the smallest element of all cell values in the NRG raster image<br />

shown in Fig. 3.11(a).<br />

Results are shown in Fig. 3.11.<br />

min cells(A) = COND min,sdom(A),i (A[i])<br />

(a) Original NRG raster<br />

(b) Output result<br />

Figure 3.11. Result of a Minimum Aggregate Operation<br />

His<strong>to</strong>gram<br />

A his<strong>to</strong>gram provides information about the number of times a value occurs across a<br />

range of possible values. For an 8-bit raster up <strong>to</strong> 256 different values are possible.<br />

As an example consider the following query:<br />

Query 3.2.13. Calculate the his<strong>to</strong>gram for a 2D raster A with 8-bit integer pixel<br />

resolution.<br />

The query can be computed as follows:<br />

Results are shown in Fig. 3.12.<br />

Diversity<br />

MARRAY sdom(A),g (count cells(A = g[0])) (3.1)<br />

The diversity operation returns the different classifications in a raster. For example,<br />

consider the following query:<br />

Query 3.2.14. Given the classifications in an 8-bit gray raster image, return true (1)<br />

for those classes whose <strong>to</strong>tal number of cells are greater than 0.


50 3. Fundamental Geo-Raster Operations<br />

Figure 3.12. Computation of the His<strong>to</strong>gram for a Raster Image<br />

For the computation of this operation we make use of the his<strong>to</strong>gram calculated in<br />

Query. 3.2.2. Let B be a 1-D array containing the his<strong>to</strong>gram values:<br />

B = MARRAY sdom(A),g (COND +,sdom(A),i (A[i] = g))<br />

then, C is the array containing true values for the elements of the his<strong>to</strong>gram that are<br />

greater than 0:<br />

C = MARRAY sdom(B),i (B[i] > 0)<br />

Results are shown in Fig. 3.13.<br />

Figure 3.13. Computation of the Diversity for a Raster Image<br />

Majority/Minority<br />

In a classified raster, the majority operation finds the class value with the largest number<br />

of elements in the raster. Similarly, the minority operation finds the cell value with<br />

fewest number of elements. As an example, consider the following query:<br />

Query 3.2.15. Return the cell representing the majority of all cell values contained in<br />

2D 8-bit gray raster image A shown in Fig. 3.14(a).<br />

To solve this query we use the his<strong>to</strong>gram computed in Query. 3.2.2, and then select<br />

the cell value representing the majority of the different classes. Let h be a 1-D array


3.2 Geo-Raster Operations 51<br />

containing the his<strong>to</strong>gram values, h1 a 1-D array of spatial domain[0:255] containing<br />

a list of values from 0 <strong>to</strong> 255. Let h2 be an array containing the sum of h and h1:<br />

h2 = MARRAY [0:255],g (h + h1)<br />

then, majority can be computed as follows:<br />

COND +,sdom(A),i ((max cells(h) = (h2[i] − h1[i])) ∗ h1[i])<br />

Results are shown in Fig. 3.14.<br />

(a) Classified raster<br />

(b) Majority class<br />

Figure 3.14. Computation of a Majority Operation for a Raster Image<br />

3.2.3 Statistical Aggregate Operations<br />

We now consider operations that consist or include one or more statistical aggregate<br />

functions. The basic statistical aggregate functions include standard deviation, root<br />

square, power, mode, median, variance, and <strong>to</strong>p-k. These functions can be applied<br />

<strong>to</strong> a raster, or a set of rasters retrieved by a logical search. Consider the following<br />

examples:<br />

Variance<br />

Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg a<br />

variable containing the average of all cell values of A, avg=avg cells(A); then the<br />

variance v of A can be solved as follows:<br />

v(A) = 1 n ∗ COND +,sdom(A),i((A[i] − avg) ∗ (A[i] − avg))<br />

Results are shown in Fig. 3.15.


52 3. Fundamental Geo-Raster Operations<br />

Figure 3.15. Computation of the Variance for a Raster Image<br />

Standard Deviation<br />

Query 3.2.16. Estimate the standard deviation of the cell values of the NRG raster<br />

image shown in Fig. 3.8(a).<br />

Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg the<br />

average of the cell values of A, avg=avg cells(A); then the standard deviation s of A<br />

can be solved as follows:<br />

s(A) =<br />

√<br />

1<br />

n ∗ COND +,sdom(A),i((A[i] − avg) ∗ (A[i] − avg))<br />

Results are shown in Fig. 3.16.<br />

Figure 3.16. Computation of the Standard Deviation for a Raster Image<br />

Median<br />

The median can be calculated by sorting the cell values of raster A in ascending order<br />

and choosing the middle value. In case the number of cells is even, the median


3.2 Geo-Raster Operations 53<br />

is the average of the two middle values. In solving this operation, we use the sort<br />

opera<strong>to</strong>r <strong>to</strong> perform the ascending sorting of array A. However, for an array of dimensionality<br />

higher than 1 it is necessary <strong>to</strong> flatten the array in<strong>to</strong> a one-dimensional<br />

array. For example, the conversion from a two-dimensional raster A[0:m,0:n] in<strong>to</strong> a<br />

one-dimensional raster B[0:m*n] can be calculated as follows:<br />

Let d be the cardinality of A, d=card(sdom(A); let r be the number of rows; and let<br />

c be the number of columns. Then, the flattening of A can be calculated as:<br />

B =MARRAY [0:255],g (<br />

COND +,[0:m,0:n],i (<br />

((g > (m ∗ (i − 1))) and (g ≤ i)) ∗ A[1 : (g − (m ∗ (i − 1))), 1 : i]))<br />

Let S be the raster containing the sorted values of B (the flattening of A), S = SORT 0 , asc<br />

f<br />

(B), and let n be the cardinality of S, n = card(sdom(S)). Assuming an integer division<br />

and an array indexing starting at zero, the median of array A can be solved as follows:<br />

if n is odd then the median is equal <strong>to</strong> S[ n 2<br />

the following query:<br />

n−1<br />

S[ 2<br />

]; else median =<br />

]+S[<br />

n+1<br />

2 ]<br />

2<br />

. Consider<br />

Query 3.2.17. Obtain the median of the 1-D array A whose cell values are shown in<br />

Fig. 3.17(a).<br />

Since the array has an odd number of elements the computation of the query is as<br />

follows:<br />

A[card(A)/2]<br />

Results are shown in Fig. 3.17(b).<br />

Top-k<br />

The Top-k function returns the k cells with the highest values within a raster. For<br />

example, consider the following query:<br />

Query 3.2.18. Find the five highest values contained in raster A.<br />

To solve this query we first sort A in ascending order and then select the <strong>to</strong>p five<br />

values. Let d=0 indicate a sorting in the 0 dimension, and let f be sorting function<br />

f d,A (p)=A[P]. Then S is a sorted array of raster A (see Fig. 3.18):<br />

S = SORT 0 , asc<br />

f (A)<br />

thus, the <strong>to</strong>p five cell values are obtained by:<br />

S[0 : 4]


54 3. Fundamental Geo-Raster Operations<br />

(a) 1-D array<br />

(b) Median<br />

Figure 3.17. Computation of Median for a Raster Image<br />

(a) Top five values<br />

Figure 3.18. Computation of a Top-k Operation for a Raster Image


3.2 Geo-Raster Operations 55<br />

3.2.4 Affine Transformations<br />

Geometric transformations permit the elimination of geometric dis<strong>to</strong>rtions that occur<br />

when images are captured. An example is the attempt <strong>to</strong> match remotely sensed<br />

images of the same area taken after one year, when the more recent image was probably<br />

not taken from precisely the same position. Another example is the Landsat<br />

Level 1B data that are already transformed on<strong>to</strong> a plane, but that may not be rectified<br />

<strong>to</strong> the user’s desired map projection [46]. <strong>Applying</strong> an affine transformation <strong>to</strong><br />

a uniformly dis<strong>to</strong>rted raster image can correct for a range of perspective dis<strong>to</strong>rtions<br />

by transforming the measurements from the ideal coordinates <strong>to</strong> those actually used.<br />

An affine transformation is an important class of linear 2-D geometric transformations<br />

that maps variables, e.g. cell intensity values located at position (x 1 , y 1 ), in an<br />

input raster image in<strong>to</strong> new variables (x 2 , y 2 ) in an output raster image by applying<br />

a linear combination of translation, rotation, scaling and shearing operations. The<br />

computation of these operations often requires interpolation techniques.<br />

In the remainder of this section we discuss special cases of affine transformations.<br />

Translation<br />

Translation performs a geometric transformation that maps the position of each cell in<br />

an input raster image in<strong>to</strong> a new position in an output raster image. Under translation,<br />

a cell located at (x 1 , y 1 ) in the original is shifted <strong>to</strong> a new position (x 2 , y 2 ) in the<br />

corresponding output raster image by displacing it through a user-specified translation<br />

vec<strong>to</strong>r (h, k). The cell values remain unchanged and the spatial domain of the output<br />

raster image is the same as that of the original input raster. Consider for example, the<br />

following query:<br />

Query 3.2.19. Shift the spatial domain of a raster defined as A[x 1 : x 2 , y 1 : y 2 ] by the<br />

point [h:k].<br />

The query can be solved by invoking the shift function of Array Algebra:<br />

Results are shown in Fig. 3.19.<br />

shift(A[x 1 : x 2 , y 1 : y 2 ], [h : k]])<br />

Rotation<br />

Rotation performs a geometric transformation that maps position (x 1 , y 1 ) of a cell in<br />

an input raster image on<strong>to</strong> a position (x 2 , y 2 ) in an output raster image by rotating it<br />

clockwise or counterclockwise, through a user-specified angle (θ) about origin O. The<br />

rotation operation performs a transformation of the form:<br />

x 2 = cos(θ) ∗ (x 1 − x 0 ) − sin(θ) ∗ (y 1 − y 0 ) + x 0<br />

y 2 = sin(θ) ∗ (x 1 − x 0 ) + cos(θ) ∗ (y 1 − y 0 ) + y 0


56 3. Fundamental Geo-Raster Operations<br />

(a) Original domain<br />

(b) Translated domain<br />

Figure 3.19. Computation of a Translation Operation for a Raster Image<br />

where (x 0 , y 0 ) are the coordinates of the center of rotation in the input raster image,<br />

and θ is the angle of rotation. Existing algorithms for the computation of rotation,<br />

unlike those employed by translation, can produce coordinates (x 2 , y 2 ) that are not<br />

integers. A common solution <strong>to</strong> this problem is the application of interpolation techniques<br />

like nearest neighbor, bilinear, or cubic interpolation. For large raster datasets<br />

this is an intensive computing problem because every output cell must be computed<br />

separately using data from its neighbors. Consequently, the rotation operation is not<br />

yet properly supported by Array Algebra.<br />

Scaling<br />

Scaling stretches or compresses the coordinates of a raster (or part of it) according<br />

<strong>to</strong> a scaling fac<strong>to</strong>r. This operation can be used <strong>to</strong> change the visual appearance of an<br />

image, <strong>to</strong> alter the quantity of information s<strong>to</strong>red in a scene representation, or as a lowlevel<br />

preprocessor in a multi-stage image processing chain that operates on features of<br />

a particular scale. For the estimation of the cell values in a scaled output raster image,<br />

two common approaches exist:<br />

• one pixel value within a local neighborhood is chosen (perhaps randomly) <strong>to</strong><br />

be representative of its surroundings. This method is computationally simple<br />

but may lead <strong>to</strong> poor results when the sampling neighborhood is <strong>to</strong>o large and<br />

diverse.<br />

• the second method interpolates cell values within a neighborhood by taking the<br />

average of the local intensity values.


3.2 Geo-Raster Operations 57<br />

As in the rotation operation, the application of scaling using interpolation techniques<br />

in large raster datasets is an intensive computing problem because every output cell<br />

must be computed separately using data from its neighbors. Consider the following<br />

query performing a scaling operation using bilinear interpolation. That is, the cell<br />

value for (x0,y0) in the output raster is calculated by averaging the values of its nearest<br />

cells: two in the horizontal plane (x0,x1) and two in the vertical plane (y0,y1). Note<br />

that the query is applied in a raster of spatial domain [0:255, 0:255] but as earlier<br />

mentioned, raster datasets tend <strong>to</strong> be extremely large (TB, PB).<br />

Query 3.2.20. Scale the 2D raster shown in Fig. 3.20(a), along the x and y dimensions<br />

by a fac<strong>to</strong>r of 2.<br />

The query can be solved as follows:<br />

B = MARRAY [0:<br />

m<br />

2 ,0: n 2 ],(x,y) (COND +,[0:1,0:1],(i,j) (A[i + x ∗ 2, j + y ∗ 2]/4))<br />

Results are shown in Fig. 3.20.<br />

(a) Original raster<br />

(b) Scaled raster<br />

Figure 3.20. Computation of a Scaling Operation for a Raster Image<br />

3.2.5 Terrain Analysis<br />

Raster image data is particularly useful for tasks related <strong>to</strong> terrain analysis. Some<br />

of the most popular operations include slope/aspect, drainage networks, and catchments<br />

(or watersheds). The processing of these operations may involve interpolation


58 3. Fundamental Geo-Raster Operations<br />

techniques that lead <strong>to</strong> expensive computational costs. For simplicity, we model these<br />

operations with approaches not using interpolation methods.<br />

Slope/Aspect<br />

Slope is defined by a plane tangent <strong>to</strong> a <strong>to</strong>pographic surface, as modeled by the Digital<br />

Elevation Model (DEM) at a point [2]. Slope is classified as a vec<strong>to</strong>r, thus having two<br />

components: a quantity (gradient) and a direction (aspect). The slope (gradient) is<br />

defined as the maximum rate of change in altitude, and aspect as the compass direction<br />

of the maximum rate of change. Several approaches exist for the computation of<br />

slope/aspect, and we follow the method proposed by [32]:<br />

• Slope in the X direction (difference in height values on either side of P) is given<br />

by:<br />

z(r, c + 1) − z(r, c − 1)<br />

T anΘ x =<br />

2g<br />

• slope in the Y direction<br />

• gradient at P<br />

T anΘ y =<br />

• direction or aspect of the gradient<br />

Results are shown in Fig. 3.21.<br />

z(r + 1, c) − z(r − 1, c)<br />

2g<br />

√<br />

(tan 2 Θ x + tan 2 Θ y )<br />

tanα = tanΘ x<br />

tanΘ y<br />

Figure 3.21. Slopes Along the X and Y Directions<br />

Note that after the calculation of the slopes for each cell in a raster image, the<br />

results may need <strong>to</strong> be classified <strong>to</strong> display them clearly on a map [2].<br />

Query 3.2.21. Calculate the slope along the X direction of an 8-bit grey raster A:<br />

MARRAY sdom(A),(r,c)<br />

(arctan(A(r, c + 1) − A(r, c − 1)))<br />

2g


3.2 Geo-Raster Operations 59<br />

Local Drain Directions (ldd)<br />

The ldd network is useful for computing several properties of a DEM because it explicitly<br />

contains information about the connectivity of different cells. Two steps are<br />

required <strong>to</strong> derive a drainage network: the estimation of flow of material over the<br />

surface and the removal of pits. For instance (see Fig. 3.22), cell A1 has three neighboring<br />

cells (A2, B1 and B2) and the lowest of them is B1, thus the flow direction is<br />

south (downward). For cell C3, the lowest of its eight neighboring cells is D2, so the<br />

flow direction is southwest (<strong>to</strong> the lower left). This method is one of the most popular<br />

algorithms <strong>to</strong> estimate flow directions and it is commonly known as D8 algorithm [2].<br />

Figure 3.22. Flow Directions<br />

Query 3.2.22. Estimate the flow of material over raster A where each cell contains<br />

the slope along the X direction.<br />

Let A be a raster with the slopes along the X direction of A. The ldd is then calculated<br />

as:<br />

MARRAY sdom(A),(i,j) (COND min,[−1:1,−1:1],(v,w) (A[i + v, j + w]))<br />

Irrespective of the algorithm used <strong>to</strong> compute flow directions, the resulting ldd network<br />

is extremely useful for computing other properties of a DEM such as stream<br />

channels, ridges, and catchments.<br />

3.2.6 Other Operations<br />

Edge Detection<br />

Edge detection produces a new raster containing only the boundary cells of a given<br />

raster. The detection of intensity discontinuities in a raster is very useful, e.g. the<br />

boundary representation is easy <strong>to</strong> integrate in<strong>to</strong> a large variety of detection algorithms.<br />

The following parameterized function can be used <strong>to</strong> express filtering operations<br />

in Array Algebra:<br />

f(A, M) = MARRAY sdom(A),x (COND +,sdom(M),i (A[x + i] ∗ M(y)))<br />

where sdom(M) is the size of the corresponding filter window, e.g., 3x3. As an example<br />

consider the following query:


60 3. Fundamental Geo-Raster Operations<br />

(a) M1<br />

(b) M2<br />

Figure 3.23. Sobel Masks<br />

Query 3.2.23. Apply edge detection <strong>to</strong> raster A shown in Fig. 3.24(a) using a 3x3<br />

Sobel filter.<br />

To compute this query, a Sobel filter and its inverse are applied <strong>to</strong> the original raster<br />

A (see Fig. 3.23):<br />

|f(A, M1)| + |f(A, M2)|<br />

9<br />

which in Array Algebra can be computed as follows:<br />

MARRAY sdom(A),x (COND +,sdom(M1),i (<br />

Results are shown in Fig. 3.24.<br />

(abs(A[x + i] ∗ M1(i))) + (abs((A[x + i] ∗ M2(i))))/9))<br />

(a) Original raster image<br />

(b) Output raster image<br />

Figure 3.24. Computation of an Edge-Detection for a Raster Image


3.3 Summary 61<br />

Slicing<br />

The slicing operation extracts lower-dimensional sections from a raster. Array Algebra<br />

accomplishes the slicing operation by indicating the slicing position in the desired<br />

dimension. Thus, the operation reduces the dimensionality of the raster by one. For<br />

example, consider the following query:<br />

Query 3.2.24. Slice raster A along the second dimension at position 50.<br />

The query is solved by specifying the slicing position as follows:<br />

3.3 Summary<br />

MARRAY sdom(A),(x,y,z) (A[x, 50, z])<br />

By examining the fundamental structure of Geo-raster operations and breaking<br />

down their computational steps in<strong>to</strong> a few basic Array Algebra opera<strong>to</strong>rs, we determine<br />

that Geo-raster operations can be broken down in<strong>to</strong> the following classes:<br />

• COND and MARRAY combined operations. Operations whose computation<br />

requires both MARRAY and COND opera<strong>to</strong>rs:<br />

add, count, average, maximum, minimum, majority, minority, his<strong>to</strong>gram, diversity,<br />

variance, standard deviation, scaling, edge detection, and local drain<br />

directions.<br />

• MARRAY exclusive operations. Operations whose computation requires only<br />

the MARRAY opera<strong>to</strong>r:<br />

arithmetic, trigonometric, boolean, logical, overlay, reclassification, proximity,<br />

translation, slicing, and slope/aspect.<br />

• SORT operations. Operations whose computation requires the SORT opera<strong>to</strong>r:<br />

<strong>to</strong>p-k, median.<br />

• AFFINE transformations. Special cases of affine transformations partially or<br />

not yet supported by Array Algebra: rotation and scaling.<br />

This classification allows us <strong>to</strong> identify a set of operations that require data summarization<br />

and thus are potential candidates <strong>to</strong> be treated with pre-aggregation techniques:<br />

add, count, average, maximum, minimum, majority, minority, his<strong>to</strong>gram, diversity,<br />

variance, standard deviation, scaling, edge detection, and local drain directions.<br />

Table 3.3 summarizes the usage of Array Algebra opera<strong>to</strong>rs for each operation<br />

discussed in Section 3.2.


62 3. Fundamental Geo-Raster Operations<br />

Table 3.3. Array Algebra Classification of Geo-Raster Operations.<br />

Operation MARRAY COND SORT AFFINE<br />

1. Count x<br />

2. Add x<br />

3. Average x<br />

4. Maximum x<br />

5. Minimum x<br />

6. Majority x x<br />

7. Minority x x<br />

8. Std. Deviation x<br />

9. Median x x<br />

10. Variance x<br />

11. Top-k x<br />

12. His<strong>to</strong>gram x x<br />

13. Diversity x x<br />

14. Proximity x<br />

15. Arithmetic x<br />

16. Trigonometric x<br />

17. Boolean x<br />

18. Logical x<br />

19. Overlay x<br />

20. Re-classification x<br />

21. Translation x<br />

22. Rotation x<br />

23. Scaling x x x<br />

24. Slicing x<br />

25. Edge Detection x x<br />

26. Slope/Aspect x<br />

27. Local drain directions (ldd) x x


Chapter 4<br />

Answering Basic Aggregate Queries<br />

Using <strong>Pre</strong>-Aggregated Data<br />

As discussed in previous chapters, aggregation is an important mechanism that allows<br />

users <strong>to</strong> extract general characterizations from very large reposi<strong>to</strong>ries of data. In this<br />

chapter, we study the effect of selecting a set of aggregate queries, compute their<br />

results and use them for subsequent query requests. In particular, we study the effect<br />

of pre-aggregation in computing aggregate queries in the field of GIS and remotesensing<br />

imaging applications.<br />

We introduce a pre-aggregation framework that distinguishes among different types<br />

of pre-aggregates for computing a query. We show that in most cases, several preaggregates<br />

may qualify for answering an aggregate query and address the problem of<br />

selecting the best pre-aggregate in terms of execution time. To this end, we introduce<br />

a model that measures the cost of using qualified pre-aggregates for the computation<br />

of a query. We then present an algorithm that selects the best pre-aggregate for computing<br />

a query. We measure the performance of our algorithms in an array database<br />

management system (RasDaMan), and show that our algorithms give much better performance<br />

over straightforward methods.<br />

4.1 Framework<br />

Most major database management systems allow the user <strong>to</strong> s<strong>to</strong>re query results<br />

through a process known as view materialization. The query optimizer may then au<strong>to</strong>matically<br />

use the materialized data <strong>to</strong> speed up the evaluation of a new query. Queries<br />

that benefit from using materialized data are those that involve the summarization of<br />

large amounts of data. They are known as aggregate queries because their query statements<br />

include one or more aggregate functions. The ANSI SQL:2008 standard defines<br />

a wide variety of aggregate functions including: COUNT, SUM, AVG, MAX, MIN,<br />

EVERY, ANY, SOME, VAR POP, VAR SAMP, STDDEV POP, STDDEV SAMP, AR-<br />

RAY AGG, REGR COUNT, COVAR POP, COVAR SAMP, CORR, REGR R2, REGR SLOPE,<br />

and REGR INTER-CEPT [20].<br />

63


64 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />

4.1.1 <strong>Aggregation</strong><br />

An aggregate operation contains one or more aggregate functions that map a multiset<br />

of cell values in a dataset <strong>to</strong> a single scalar value. In our framework, queries<br />

may contain an arbitrary number of aggregate functions, e.g., COUNT, SUM, AVG,<br />

MAX, MIN, and a spatial domain. We formulate our queries using rasql 1 , the declarative<br />

interface <strong>to</strong> the RasDaMan server. We use the Array Algebra notation for spatial<br />

domains:<br />

sdom = [l 1 : h 1 , . . . , l d : h d ] (4.1)<br />

where the vec<strong>to</strong>r variables l (low) and h (high) deliver lower and upper bound vec<strong>to</strong>rs<br />

respectively.<br />

4.1.2 <strong>Pre</strong>-<strong>Aggregation</strong><br />

The term pre-aggregation refers <strong>to</strong> the process of pre-computing and s<strong>to</strong>ring the<br />

results of aggregate queries for subsequent use in the same or similar query requests.<br />

The decision <strong>to</strong> use pre-aggregated data during the computation of an aggregate query<br />

is influenced by the structural characteristics of the query and the pre-aggregate.<br />

By comparing the data structures between the two, one can determine if the preaggregated<br />

result contributes fully or partially <strong>to</strong> the final answer of the query, and<br />

if it is worth using pre-aggregated data.<br />

4.1.3 Aggregate Query and <strong>Pre</strong>-Aggregate Equivalence<br />

An aggregate query Q and a pre-aggregate p i are equivalent if and only if all the<br />

following conditions are met:<br />

1. The aggregate operation of the query Q is the same as the aggregate operation<br />

defined for the pre-aggregate p i .<br />

2. The aggregate operation of the query Q and the pre-aggregate p i must be applied<br />

over the same objects.<br />

3. The same logical and boolean conditions, if any, apply <strong>to</strong> both the query Q and<br />

the pre-aggreate p i .<br />

4. For aggregate operations <strong>to</strong> be applied over a specific spatial domain, the extent<br />

of the spatial domain in query Q must be the same as the one in pre-aggregate<br />

p i .<br />

When all of the above conditions are satisfied, we say there is a full-matching<br />

between the query and pre-aggregate. In this case, the time it takes <strong>to</strong> retrieve the<br />

1 rasql is a SQL-based query language for multidimensional raster databases based on Array Algebra.


4.1 Framework 65<br />

pre-aggregated result will be much faster than the time required <strong>to</strong> compute the query<br />

from raw (original) data. Moreover, the s<strong>to</strong>rage overhead required <strong>to</strong> save the preaggregated<br />

result is compensated by the faster computation of the query obtained in<br />

return. However, cases do occur when only conditions 1, 2, 3 are satisfied. We refer <strong>to</strong><br />

this case as a partial-matching between the query and pre-aggregate. We can use the<br />

partial results provided by these pre-aggregates and thus speed up the computation of<br />

the query. However, further analysis must be carried out <strong>to</strong> find those pre-aggregates<br />

that provide the maximum speed for computing a query. To that end, we define the<br />

following types of pre-aggregates: independent, overlapped, and dominant.<br />

Independent <strong>Pre</strong>-Aggregates<br />

Definition 4.1 (Independent <strong>Pre</strong>-Aggregates) – A set of pre-aggregates is called<br />

Independent <strong>Pre</strong>-Aggregates (IPAS) with respect <strong>to</strong> Q, if the spatial domain of each<br />

pre-aggregate is contained within the spatial domain of query Q and there is no intersection<br />

among the spatial domains of the pre-aggregates. Fig. 4.1(a) shows an example<br />

of an independent pre-aggregate.<br />

IPAS := {p 1 , p 2 , . . . , p n | p i.sdom ⊆ Q .sdom , p i.sdom ∩ p j.sdom = ∅} , (4.2)<br />

✷<br />

Overlapped <strong>Pre</strong>-Aggregates<br />

Definition 4.2 (Overlapped <strong>Pre</strong>-Aggregates) – A set of pre-aggregates is called<br />

Overlapped <strong>Pre</strong>-Aggregates (OPAS) if the spatial domain of each pre-aggregate intersects<br />

with the spatial domain of the query Q. Fig. 4.1(b) shows an example of an<br />

overlapped pre-aggregate.<br />

OPAS := {p 1 , p 2 , . . . , p n | p i.sdom ∩ Q .sdom ≠ ∅} (4.3)<br />

✷<br />

Dominant <strong>Pre</strong>-Aggregates<br />

Definition 4.3 (Dominant <strong>Pre</strong>-Aggregates) – A set of pre-aggregates is called Dominant<br />

<strong>Pre</strong>-Aggregates (DPAS) if the spatial domain of the query Q is contained within<br />

the spatial domain of each pre-aggregate. Fig. 4.1(c) shows an example of a dominant<br />

pre-aggregate. Note that dominant pre-aggregates can only be used <strong>to</strong> answer the<br />

following types of aggregate queries: ADD, COUNT, and AVG.


66 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />

✷<br />

DPAS := {p 1 , p 2 , . . . , p n | Q .sdom ⊆ p i.sdom } . (4.4)<br />

Moreover, given an ordered DPAS<br />

DPAS = {p 1 , p 2 , . . . , p n | Q .sdom ⊆ p 1.sdom ⊆ . . . ⊆ p n.sdom } , (4.5)<br />

the closest dominant pre-aggregate (p cd ) <strong>to</strong> Q is given by p 1 , i.e., p cd = p 1 .<br />

(a) Independent preaggregate<br />

(b) Overlapped preaggregate<br />

(c) Dominant preaggregate<br />

Figure 4.1. Types of <strong>Pre</strong>-Aggregates<br />

Cases may occur where a pre-aggregate intersects with one or more pre-aggregates<br />

of the same or different type. Intersections are problematic because the greater the<br />

number of intersections, the greater the number of cells that may need <strong>to</strong> be computed<br />

from raw data <strong>to</strong> determine the real contribution <strong>to</strong>wards the result of the query by<br />

a given pre-aggregate. The computation process involves several intermediary operations<br />

such as decomposing the pre-aggregate in<strong>to</strong> sub-partitions that in turn must<br />

be aggregated. Moreover, the same procedure must be performed on the other intersected<br />

pre-aggregates should we want <strong>to</strong> use their results. For example, assume that<br />

pre-aggregates p 1 , p 2 and p 3 can be used <strong>to</strong> answer query Q, and that they all intersect<br />

with each other. Since the result of each pre-aggregate includes a partial result of the<br />

other two pre-aggregates, we must use raw data <strong>to</strong> compute the intersected area and<br />

adjust the result of the pre-aggregate according <strong>to</strong> the aggregate function specified in<br />

the query predicate.<br />

To overcome this problem, a query selected for pre-aggregation for which other<br />

pre-aggregates exist with different spatial domains but identical structural properties<br />

can be decomposed in<strong>to</strong> a set of sub-partitions prior <strong>to</strong> the pre-aggregation process.


4.2 Cost Model 67<br />

By partitioning the query <strong>to</strong> be pre-aggregated we can avoid intersection among preaggregates,<br />

see example shown in Fig. 4.2.<br />

Figure 4.2. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> (left) and Decomposed Queries<br />

(right)<br />

4.2 Cost Model<br />

This section introduces a cost model that allows us <strong>to</strong> estimate the cost (in terms of<br />

execution time) of computing a query using pre-aggregates compared <strong>to</strong> raw data. In<br />

our model, the access cost is driven by the number of required disk I/Os and memory<br />

accesses. These parameters are influenced by the number of tiles needed <strong>to</strong> answer<br />

a given query and the number and size of the cells in the datasets. The following<br />

assumptions underlie our estimates.<br />

1. We assume that the tiles needed <strong>to</strong> answer a given query are s<strong>to</strong>red using implicit<br />

s<strong>to</strong>rage of coordinates, which is the prevalent s<strong>to</strong>rage format for raster image<br />

data [79]. Implicit s<strong>to</strong>rage of coordinate values is a s<strong>to</strong>rage technique that<br />

leads <strong>to</strong> a higher degree of clustering of cell values that are close in data space,<br />

that is, it preserves spatial proximity of cell values. Given that state-of-the-art<br />

disk drives improve access <strong>to</strong> multidimensional datasets by allowing the spatial<br />

locality of the data <strong>to</strong> be preserved in the disk itself [93], we assume that it takes<br />

the same time <strong>to</strong> retrieve a tile from disk as <strong>to</strong> retrieve any other tile needed <strong>to</strong><br />

answer a given query. Clearly, there are other fac<strong>to</strong>rs, not considered here, that<br />

influence access cost. Among them are the cost for s<strong>to</strong>ring intermediate results,<br />

and the communication cost for sending the results from the client <strong>to</strong> the server.<br />

More complicated cost models are certainly possible, but we believe the cost


68 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />

model we pick, being both simple and realistic, enable us <strong>to</strong> design and analyze<br />

powerful algorithms.<br />

2. We consider the time taken <strong>to</strong> access a given cell (pixel) on main memory <strong>to</strong> be<br />

the same as that required <strong>to</strong> access any other cell. That is, we assume that a tile<br />

sits in main memory and is not swapped out.<br />

3. We ignore the time it takes <strong>to</strong> combine partial aggregate results. Investigations<br />

have shown this time <strong>to</strong> be negligible compared <strong>to</strong> tile iteration [74].<br />

Table 4.1 lists the parameters involved in the different cost functions presented in<br />

the remainder of this section.<br />

Table 4.1. Cost Parameters<br />

Parameter Description<br />

Ntiles Number of tiles<br />

Ncells Number of cells<br />

sdom Spatial domain<br />

IPAS Independent pre-aggregates set<br />

OPAS Overlapped pre-aggregates set<br />

DPAS Dominant pre-aggregated set<br />

p cd Closest dominant pre-aggregate<br />

SP Sub-partitions<br />

4.2.1 Computing Queries from Raw Data<br />

The cost of computing an aggregate query Q (or sub-partitions of pre-aggregates)<br />

from raw data (C r ), is given by<br />

C r (Q) = C acc (Ntiles(Q)) + C agg (Ncells(Q)) (4.6)<br />

where C acc is the cost of retrieving the tiles required <strong>to</strong> answer Q, and C agg is the<br />

time taken <strong>to</strong> access and aggregate the <strong>to</strong>tal cells given by the spatial component of<br />

the query.<br />

4.2.2 Computing Queries from Independent and Overlapped <strong>Pre</strong>-Aggregates<br />

The cost of answering an aggregate query using independent and overlapped preaggregates<br />

is given by:<br />

C IOP AS (Q) = C IP AS (Q) + C OP AS (Q) + C SP (Q), (4.7)<br />

where C IP AS and C OP AS are the costs of using the results of independent and overlapped<br />

pre-aggregates, respectively, and C SP is the cost of decomposing the query Q<br />

in<strong>to</strong> a set of sub-partitions and aggregating each from raw data.


4.2 Cost Model 69<br />

Cost of independent pre-aggregates<br />

The cost of retrieving the results of independent pre-aggregates (C IP AS ) is given by:<br />

C IP AS (Q, T ) = C fin (Q, T ) +<br />

∑<br />

|IP AS|<br />

i=0<br />

C acc (p i ) (4.8)<br />

where C fin is the cost of finding the pre-aggregates ∈ IP AS in the pre-aggregated<br />

pool T , and C acc is the accumulated cost of retrieving the results of the pre-aggregates.<br />

Cost of overlapped pre-aggregates<br />

The cost of retrieving the results of overlapped pre-aggregates (C OP AS ) is given by:<br />

C OP AS (Q) = C fin (Q, T ) +<br />

∑<br />

|OP AS|<br />

i=0<br />

|S|<br />

∑<br />

C dec (p i ) + C r (s i ) (4.9)<br />

where C fin is the cost of finding the pre-aggregates ∈ OP AS in the pre-aggregated<br />

pool T , C dec is the cost of decomposing the spatial domain of each pre-aggregate in<strong>to</strong><br />

a set of sub-partitions S such that the spatial domain of the partitioned pre-aggregate<br />

corresponds <strong>to</strong> p i.sdom − (p i.sdom ∩ Q), and C r is the cost of aggregating each resulting<br />

sub-partition s i ∈ S from raw data.<br />

Cost of aggregating sub-partitions of a query<br />

The cost of aggregating all sub-partitions forming a query is given by:<br />

|SP |<br />

∑<br />

C SP (Q) = C dec (Q) + C r (s i ), (4.10)<br />

where C dec is the cost of decomposing Q in<strong>to</strong> a set SP of sub-partitions, and C r is<br />

the cost of aggregating each resulting sub-partition s ∈ SP from raw data. Note that<br />

C dec is influenced by the costs of accessing the tiles required <strong>to</strong> aggregate each subpartition,<br />

and the cost of accessing the spatial properties of the pre-aggregates in IPAS<br />

and OPAS.<br />

4.2.3 Computing Queries from Dominant <strong>Pre</strong>-Aggregates<br />

The cost of computing an aggregate query Q using a dominant pre-aggregate is<br />

given by:<br />

C DP AS (Q) = C DP (Q, T ) + C agg (p cd ), (4.11)<br />

where C DP is the sum of the cost of finding the pre-aggregates ∈ DPAS in the preaggregated<br />

pool T and the cost of finding the closest dominant pre-aggregate p cd ,<br />

and C agg is the cost of computing the aggregate difference of p cd corresponding <strong>to</strong><br />

p cd.sdom − Q .sdom .<br />

i=0<br />

i=0


70 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />

Cost of aggregating sub-partitions of the closest dominant pre-aggregate<br />

The cost C agg can be calculated as follows:<br />

|SP |<br />

∑<br />

C agg (p cd ) = C dec (p cd ) + C r (s i ), (4.12)<br />

where C dec is the cost of decomposing p cd in<strong>to</strong> a set SP of sub-partitions, and C r is the<br />

cost of aggregating each resulting sub-partition s ∈ SP from raw data.<br />

4.3 Implementation<br />

This section describes the application of a query optimization technique that transforms<br />

an input query written in terms of arrays so that it can be executed faster using<br />

pre-aggregated data. The query processing module of an array database management<br />

system (RasDaMan) has been extended with our pre-aggregation framework for query<br />

rewriting, and has been implemented as part of the optimization and evaluation phases.<br />

As discussed earlier in this chapter, there are two problems related <strong>to</strong> the computation<br />

of an aggregate query using pre-aggregated data. First, we must find all pre-aggregates<br />

that can be used <strong>to</strong> compute an aggregate query, including those that provide partial<br />

answers. Next, from all candidate pre-aggregates, we must find the one that minimizes<br />

the execution time (or cost) for computing the query. Our solution is based on an existing<br />

approach for answering queries using views in <strong>OLAP</strong> applications. Halevy et<br />

al. [95] showed that all possible rewritings of a query can be obtained by considering<br />

containment mappings from the bodies of the views <strong>to</strong> the body of the query. They<br />

also showed that such characterization is a NP-complete problem.<br />

The QUERYCOMPUTATION procedure returns the result of a query or an execution<br />

plan for a given query Q. An execution plan is an indica<strong>to</strong>r of the kind of data that<br />

must be used <strong>to</strong> compute the query. It returns a raw indica<strong>to</strong>r if the query must be<br />

computed from the original data. Other valid indica<strong>to</strong>rs include IP AS, OP AS, and<br />

DP AS, which indicate that the query will be answered using one or more partial<br />

pre-aggregates.<br />

The input of the algorithm is a query tree Q t of an aggregate query. The algorithm<br />

first verifies if the conditions for a PERFECT-MATCHING between the query and the<br />

pre-aggregated queries are satisfied. If a perfect-matching is found, it returns the result<br />

of the pre-aggregated query. Otherwise, the algorithm verifies if the conditions for a<br />

PARTIALMATCHING between the query and set of pre-aggregate queries are satisfied.<br />

Then, the algorithm makes use of our cost model <strong>to</strong> determine the cost of using preaggregates<br />

that satisfy partial-matching conditions for the computation of the query,<br />

and the cost of computing the query using the original data. Finally, the algorithm<br />

picks the plan with least cost in terms of execution time. The algorithm makes use of<br />

the following auxiliary procedures:<br />

• DECOMPOSEQUERY(Q t ) examines the nodes of the query tree Q t and generates<br />

a standardized representation S qt that can be manipulated via SQL statements.<br />

i=0


4.3 Implementation 71<br />

Algorithm 1 QUERYCOMPUTATION<br />

Require: A query tree Q t , a set of k number of pre-aggregate queries P<br />

1: initialize R = 0, key = false<br />

2: S qt = decomposeQuery(Q t )<br />

3: key = perfectMatching(S qt , P )<br />

4: if key then<br />

5: R = fetchResult(key)<br />

6: return R;<br />

7: end if<br />

8: if !key then<br />

9: plan = partialMatching(S qt , P )<br />

10: return plan;<br />

11: end if<br />

• PERFECTMATCHING(S qt ) compares a standardized representation of the query<br />

tree S qt against existing k number of pre-aggregates. The output is the corresponding<br />

key of the matched pre-aggregated query. A null value is returned if<br />

no perfect matching is found.<br />

• FETCHRESULT(key) retrieves the result R of the pre-aggregated query identified<br />

by key.<br />

The algorithm PARTIALMATCHING identifies an aggregate sub-expression in a<br />

query tree Q t , and finds pre-aggregated queries satisfying conditions 1, 2 and 3, but<br />

not condition 4 as defined in section 4.1.2. It considers the use of pre-aggregates<br />

that partially contribute <strong>to</strong> the answer of a query sub-expression that are either independent,<br />

overlapped, or dominant. The algorithm calculates the cost of using each<br />

pre-aggregate for computing the query, and returns an indica<strong>to</strong>r of the type of query<br />

providing the least cost.<br />

The aggregateOp() procedure compares a node n of a given query tree Q t against<br />

a list of pre-defined aggregate operations, e.g, add cells, count cells, avg cells,<br />

max cells, and min cells. If the node matches any such operation, it returns a true<br />

value.<br />

The getSubtree() procedure receives as parameter a query tree Q t and a pointer <strong>to</strong><br />

an aggregate node. If the aggregate node has children, it creates a subtree Q ′ where<br />

the root node corresponds <strong>to</strong> the aggregate node.<br />

The findP reaggregate() procedure receives as parameters an aggregate operation<br />

op, an object identifier ro, and a spatial domain sd. It then determines if the values of<br />

these parameters match those of any existing pre-aggregate. If a match is found, the<br />

result of the matched pre-aggregate is returned.<br />

The findIpasP reaggregates() procedure receives as a parameter a subtree Q ′<br />

and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section<br />

4.1.2 for equivalence between a query and a pre-aggregate. For those pre-aggregates


72 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />

Algorithm 2 PARTIALMATCHING<br />

Require: A standardized query tree Q t with m number of nodes.<br />

1: initialize IP AS, OP AS, DP AS = {}<br />

2: initialize plan = ”raw”, key = false<br />

3: for each node n of Q t do<br />

4: if aggregateOp(node[n]) then<br />

5: Q ′ = getSubtree(Q t , node[n])<br />

6: op = getOperation(Q ′ )<br />

7: ro = getRasterObject(Q ′ )<br />

8: sd = getSpatialDomain(Q ′ )<br />

9: key = findP reaggregate(op, ro, sd)<br />

10: if key then<br />

11: R = fetchResult(key)<br />

12: return R;<br />

13: end if<br />

14: if !key then<br />

15: IP AS = findIpasP reaggregates(op, ro, sd)<br />

16: OP AS = findOpasP reaggregates(op, ro, sd)<br />

17: DP AS = findDpasP reaggregates(op, ro, sd)<br />

18: end if<br />

19: plan = selectP lan(Q ′ , IP AS, OP AS, DP AS)<br />

20: end if<br />

21: end for<br />

22: return plan;<br />

that qualify, it identifies those whose spatial domains are contained in the spatial domain<br />

of the query. The output is a set of independent pre-aggregates.<br />

The findOpasP reaggregates() procedure receives as a parameter a subtree Q ′<br />

and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section<br />

4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial domains<br />

intersect with the spatial domain of the query. The output is a set of overlapped preaggregates.<br />

The findDpasP reaggregates() procedure receives as a parameter a subtree Q ′<br />

and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section<br />

4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial<br />

domains dominate the spatial domain of the query. The output is a set of dominant<br />

pre-aggregates.<br />

The selectP lan() procedure receives as parameters a sub-query tree Q ′ , a set of<br />

independent pre-aggregates IP AS, a set of overlapped pre-aggregates OP AS and<br />

a set of dominant pre-aggregates DP AS. It then calculates the cost of answering<br />

the query using different types of pre-aggregates and raw data. The output of this<br />

procedure is an indica<strong>to</strong>r of the best plan for executing the query.


4.4 Experimental Results 73<br />

Query Evaluation<br />

The query optimizer module provides an optimized query tree along with the plan<br />

suggested for the computation of the query <strong>to</strong> the final phase, evaluation. Typically,<br />

the evaluation phase identifies the tiles affected by an aggregate query and executes<br />

the aggregate operation on each tile. Finally it combines the results <strong>to</strong> generate the<br />

answer <strong>to</strong> the query. With the extension of pre-aggregation in the optimizer, the traditional<br />

process differs such that the selected plan is considered before proceeding<br />

<strong>to</strong> execution. If the plan corresponds <strong>to</strong> raw, then the computation of the query is<br />

entirely done from raw data. Otherwise, it executes the aggregate operation only on<br />

those sub-expressions for which there are not pre-aggregated results.<br />

4.4 Experimental Results<br />

This section presents the performance results of our algorithms on real-life raster<br />

image datasets. We ran our experiments on a Intel Pentium 4 -CPU 3.00 GHz PC<br />

running SuSe Linux 9.1. The workstation had a <strong>to</strong>tal physical memory of 512 MB.<br />

The datasets were s<strong>to</strong>red in RasDaMan, an array database management system (our<br />

research vehicle).<br />

Table 4.2 lists the test queries used in our experiments. We ran each query 200<br />

times against the database <strong>to</strong> obtain average query response times. The queries are<br />

formulated using rasql syntax, the declarative query interface <strong>to</strong> the RasDaMan server.<br />

We performed a cold test where the queries were run sequentially; the cache buffer<br />

was cleaned after the completion of each query. The dataset consists of a collection of<br />

2D raster images, each associated with an object identifier (oid). Each image shows<br />

a portion of the Black Sea, is 260 Mb in size, and consists of 100 indexed tiles. We<br />

artificially created a set of pre-aggregates for the experiment. They are s<strong>to</strong>red in a preaggregation<br />

pool containing a <strong>to</strong>tal of 5000 pre-aggregates requiring a <strong>to</strong>tal s<strong>to</strong>rage<br />

space of 50 Mb.<br />

Computing the test queries involves the execution of two fundamental operations<br />

in GIS and remote-sensing imaging: sub-setting and aggregation. The values of the<br />

spatial domain of the queries were chosen such that we could measure the impact of<br />

using pre-aggregation for the following cases:<br />

• The computation of queries Q1, Q2 and Q3 can be done by combining the<br />

results of partial pre-aggregates and the remaining parts from original data.<br />

• The computation of queries Q4, Q5 and Q6 can be done by using the results<br />

of full pre-aggregates. That is, the full answer <strong>to</strong> these queries has been precomputed<br />

and s<strong>to</strong>red in the database.<br />

• The computation of queries Q7, Q8 and Q9 can be done by combining the<br />

results of two or more pre-aggregates. There is no need <strong>to</strong> use original data <strong>to</strong><br />

compute these queries.


74 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />

Table 4.2. Database and Queries of the Experiment.<br />

Qid Description<br />

Q1 select add cells(y[6000:10000, 29000:32000])<br />

from blacksea as y were oid(y) = 49153<br />

Q2 select add cells(y[7000:10000, 29000:31000])<br />

from blacksea as y where oid(y) = 49154<br />

Q3 select add cells(y[6700:10000, 28000:30000])<br />

from blacksea as y where oid(y) = 49155<br />

Q4 select add cells(y[7680:8191, 29000:31000])<br />

from blacksea as y where oid(y) = 49153<br />

Q5 select add cells(y[8704:9215, 29000:31000])<br />

from blacksea as y where oid(y) = 49154<br />

Q6 select add cells(y[9728:10000, 29000:31000])<br />

from blacksea as y where oid(y) = 49155<br />

Q7 select add cells(y[7680:8191, 29696:30207])<br />

from blacksea as y where oid(y) = 49153<br />

Q8 select add cells(y[8704:9215, 30720:31000])<br />

from blacksea as y where oid(y) = 49154<br />

Q9 select add cells(y[9216:9727, 30208:30719])<br />

from blacksea as y where oid(y) = 49155<br />

Table 4.3 compares the CPU cost required for the computation of the queries<br />

using pre-aggregated data and raw data. The CPU cost was obtained by using the time<br />

library of C++. The column #aff. tiles shows the number of tiles that need <strong>to</strong> be read<br />

for computing the given query. Column # preagg. tiles represents the number of preaggregates<br />

that can be used <strong>to</strong> compute the query. Column t pre shows the <strong>to</strong>tal CPU<br />

cost of computing the query considering pre-aggregated data. Column t ex shows the<br />

time taken <strong>to</strong> execute the query entirely from raw data. Column ratio shows that CPU<br />

time is always better when the computations consider pre-aggregated data.<br />

Table 4.3. Comparison of Query Evaluation Costs Using <strong>Pre</strong>-Aggregated Data and<br />

Original Data.<br />

Q id #aff. tiles #preagg. tiles t pre t ex ratio<br />

Q1 63 24 15.6 17.8 87%<br />

Q2 35 24 6.9 9.3 74%<br />

Q3 35 8 9.4 10 94%<br />

Q4 5 5 1.02 1.55 65%<br />

Q5 5 5 1.1 1.63 67%<br />

Q6 5 5 0.74 1.01 73%<br />

Q7 2 1 0.04 0.41 9%<br />

Q8 2 1 0.04 0.45 8%<br />

Q9 2 1 0.04 0.41 9%<br />

4.5 Summary<br />

In this chapter we presented a framework for computing aggregate queries in array<br />

databases using pre-aggregated data. We distinguished among different types of


4.5 Summary 75<br />

pre-aggregates: independent, overlapped, and dominant. We showed that such a distinction<br />

is useful <strong>to</strong> find a set of pre-aggregated queries that can reduce CPU cost for<br />

query computation. We proposed a cost-model <strong>to</strong> calculate the cost of using different<br />

pre-aggregates and select the best option for evaluating a query using pre-aggregated<br />

data. The measurements on real-life raster images showed that the computation of<br />

the queries is always faster with our algorithms compared <strong>to</strong> straightforward methods.<br />

We focused on queries using basic aggregate functions covering a large number of<br />

operations in GIS and remote-sensing imaging applications. The challenge remains,<br />

however, in supporting more complex aggregate operations, e.g., scaling, which is<br />

discussed in the following chapter.


This page was left blank intentionally.


Chapter 5<br />

<strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond<br />

Basic Aggregate Operations<br />

In this chapter we investigate the problem of offering pre-aggregation support <strong>to</strong> nonstandard<br />

aggregate operations such as scaling and edge detection. We discuss issues<br />

found while attempting <strong>to</strong> provide a pre-aggregation framework for all non-standard<br />

aggregate operations. We then justify our reasons for focusing on scaling operations.<br />

We adapt the framework and cost model presented in Chapter 4 <strong>to</strong> support scaling operations.<br />

Finally, we discuss the efficiency of our algorithms based on a performance<br />

analysis covering 2D, 3D and 4D datasets. We indicate how our approach generalizes<br />

and outperforms well-known 2D image pyramids widely used in Web mapping.<br />

5.1 Non-Standard Aggregate Operations<br />

As shown in Chapter 2, aggregate operations are not limited <strong>to</strong> queries using basic<br />

aggregate functions. In the GIS domain, operations such as scaling, edge detection,<br />

and those related <strong>to</strong> terrain analysis also require data summarization and may therefore<br />

benefit from pre-aggregation. See Table 3.3 for a complete list of operations requiring<br />

summarization. Finding a general pre-aggregation approach for computing those<br />

kinds of operations, however, it introduces additional complications when compared<br />

<strong>to</strong> finding pre-aggregates using basic aggregate functions.<br />

Basic aggregate functions each consolidate the values of a group of cells and return<br />

a scalar value. The value may represent the <strong>to</strong>tal sum, the number of cells, the maximum<br />

or minimum cell value, or the average value of the affected cells. Affected cells<br />

are determined by the spatial domain defined in the predicate of the query. In contrast,<br />

the computation of a scaling operation may require consolidating the cell values of a<br />

group of cells <strong>to</strong> calculate each cell value in the output raster. The affected cells are<br />

determined by both the resampling method and scale vec<strong>to</strong>r as described in Chapter 3.<br />

A similar situation occurs with edge detection. The affected cells are determined by<br />

the size and values of the applied Sobel filter. For simplicity, we refer <strong>to</strong> those kinds<br />

of operations as non-standard aggregate operations.<br />

There is an important concern that must now be taken in<strong>to</strong> account. From Chap-<br />

77


78 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

ter 3, we see that the result returned by a group of affected cells for a given nonstandard<br />

aggregate operation such as scaling is not likely <strong>to</strong> be useful in computing<br />

another non-standard aggregate operation such as edge detection. This is because<br />

non-standard operations differ significantly with respect <strong>to</strong> the way their affected cells<br />

are determined. Nevertheless, this result may be useful in computing the same type of<br />

non-standard operation under certain conditions. For example, the result of scaling by<br />

a fac<strong>to</strong>r of 8 could be used <strong>to</strong> compute scaling by a fac<strong>to</strong>r of 10 (assuming that both<br />

operations utilize the same resampling method). This result, however, is not likely <strong>to</strong><br />

be useful in edge detection for the same object.<br />

We therefore simplify the problem of offering pre-aggregation support <strong>to</strong> nonstandard<br />

aggregations by treating each type of non-standard operation separately. This<br />

simplification is similar <strong>to</strong> those found in data warehousing techniques where preaggregation<br />

algorithms cover a specific type of queries. For instance, pre-aggregation<br />

algorithms exist for queries that include a group-by clause in their predicates, while<br />

other algorithms are used for queries without join conditions.<br />

We now focus on pre-aggregation support for one non-standard aggregate operation,<br />

scaling, for the following reasons:<br />

• One of the most frequent operations in GIS and remote-sensing imaging applications<br />

is downscaling of some dataset or part thereof, such as obtaining a 1 GB<br />

overview of a 10 TB dataset.<br />

• Scaling is a very expensive operation as it normally requires a full scan of the<br />

dataset, plus costly main memory operations. Therefore, query optimization is<br />

critical <strong>to</strong> this class of retrieval operations.<br />

• Scaling is the only operation that has already been supported by pre-aggregation,<br />

at least for 2D datasets. This provides a point of reference <strong>to</strong> compare the effectiveness<br />

of our algorithms against existing techniques.<br />

Although the framework discussed in the following sections is centered around<br />

scaling operations, it can be adapted <strong>to</strong> support other non-standard aggregate operations<br />

by modifying the matching conditions as discussed later in this chapter.<br />

5.2 Conceptual Framework<br />

A common optimization technique that speeds up scaling operations is <strong>to</strong> materialize<br />

selected downscaled versions of an object, e.g., using image pyramids. When<br />

evaluating a scaling operation with target scale fac<strong>to</strong>r s, the pyramid level with the<br />

largest scale fac<strong>to</strong>r s ′ is determined, where s ′ < s. This relationship between scaling<br />

operations places them within a lattice framework similar <strong>to</strong> that used for data cubes<br />

in data warehouse/<strong>OLAP</strong> applications [92]. Our conceptual framework and greedy<br />

algorithm for the selection of pre-aggregates is based on the work of Harinarayan et<br />

al. presented in [92]. The use of this approach was motivated by the similarities<br />

between our datasets (multidimensional arrays) and <strong>OLAP</strong> data cubes. Furthermore,


5.2 Conceptual Framework 79<br />

Figure 5.1. Sample Lattice Diagram for a Workload with Five Scaling Operations<br />

the lattice framework and the greedy algorithm have proven successful in a variety of<br />

business applications.<br />

5.2.1 Lattice Representation<br />

A scaling lattice consists of a set of queries L and dependence relations ≼ denoted<br />

by 〈L, ≼〉. The ≼ opera<strong>to</strong>r imposes a partial ordering on the queries of the lattice.<br />

Consider two queries q 1 and q 2 . We say q 1 ≼ q 2 if q 1 can be answered using only the<br />

results of q 2 . The base node of the lattice is the scaling operation with the smallest<br />

scale vec<strong>to</strong>r upon which every query is dependent. Lattices are commonly represented<br />

in a diagram in which the elements are nodes, and there is a path downward from q 1<br />

<strong>to</strong> q 2 if and only if q 1 ≼ q 2 . The selection of pre-aggregates, that is, queries for<br />

materialization, is equivalent <strong>to</strong> selecting vertices from the underlying nodes of the<br />

lattice. Fig. 5.1 shows a lattice diagram for a workload containing five queries. Each<br />

node has an associated label that represents a scaling operation for a given dataset,<br />

scale-vec<strong>to</strong>r and resampling method.<br />

In our framework, we use the following function <strong>to</strong> define scaling operations:<br />

where<br />

scale(objName[lo 1 : hi 1 , ..., lo n : hi n ], ⃗s, resMeth) (5.1)<br />

• objName[lo 1 : hi 1 , ..., lo n : hi n ]: is the name of the multidimensional raster<br />

image <strong>to</strong> be scaled. The operation can be restricted <strong>to</strong> a specific area of the<br />

raster object. In that case, the area is specified by defining lower (lo n ) and<br />

upper (hi n ) bounds for each dimension. If the spatial domain is omitted, the<br />

operation is performed on the full spatial extent defining the raster image.<br />

• ⃗s: is a vec<strong>to</strong>r where each element is a numeric value that represents the scale<br />

fac<strong>to</strong>r used in a specific dimension of the raster image.<br />

• resMeth: specifies the resampling method <strong>to</strong> be applied <strong>to</strong> the original raster<br />

object.


80 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

For example, scale(CalF ires, [2, 2, 2], nn) defines a scaling operation by a fac<strong>to</strong>r<br />

of two on each dimension, using nearest neighbor as resampling method on a 3D<br />

dataset identified as CalF ires.<br />

5.2.2 <strong>Pre</strong>-<strong>Aggregation</strong> Selection Problem<br />

Definition 5.4 (<strong>Pre</strong>-Aggregates Selection Problem) – Given a query workload Q<br />

and a s<strong>to</strong>rage space constraint C, the pre-aggregates selection problem is <strong>to</strong> select a<br />

set P ⊆ Q of queries such that P minimizes the overall costs of computing Q while<br />

the s<strong>to</strong>rage space required by P does not exceed the limit given by C.<br />

✷<br />

Considering existing view selection strategies in data warehousing/<strong>OLAP</strong>, the following<br />

selection criteria are suggested for pre-aggregates:<br />

• Frequency. <strong>Pre</strong>-aggregates yield particularly significant increases in processing<br />

speed when scaling operations are executed with high frequency within a<br />

workload.<br />

• S<strong>to</strong>rage space. The s<strong>to</strong>rage space constraint of a candidate scaling operation<br />

must be at least the size of the s<strong>to</strong>rage required by the query in the workload with<br />

the smallest scale vec<strong>to</strong>r. This guarantees that for any query in the workload at<br />

least one pre-aggregate can be used for its computation.<br />

• Benefit. A scaling operation may be used <strong>to</strong> compute the same and other dependent<br />

queries in the workload. A metric is therefore used <strong>to</strong> calculate the<br />

cost savings gained by using a candidate scaling operation. To evaluate the<br />

cost, we use the model presented in Section 4.2. We call this the benefit of<br />

a pre-aggregate set and normalize the benefit against the base object’s s<strong>to</strong>rage<br />

volume.<br />

Frequency<br />

The frequency of query q, denoted by F (q), is the number of occurrences of a given<br />

query in a workload:<br />

F (q) = N(q)/ |Q| (5.2)<br />

where N(q) is a function that returns the number of occurrences of a given query in<br />

workload Q.<br />

S<strong>to</strong>rage Space<br />

The s<strong>to</strong>rage space of a given query denoted by S(q), represents the s<strong>to</strong>rage space<br />

required <strong>to</strong> save the result of query q and it is determined by the number of cells<br />

composing the output object defined in query q.


5.2 Conceptual Framework 81<br />

Benefit<br />

The benefit of a candidate scale operation for pre-aggregation q, is computed by<br />

adding the savings in query cost for each scaling operation in the workload dependent<br />

on q, including all queries identical <strong>to</strong> q. That is, query q may contribute <strong>to</strong><br />

saving processing costs for the same or similar queries in the workload. In both cases,<br />

specific matching conditions must be satisfied.<br />

Full-Match Conditions. Let q be a candidate query for pre-aggregation and p a<br />

query in workload Q. Let p and q both be scaling operations as defined in Eq. 5.1.<br />

There is a full-match between q and p if and only if:<br />

• the value of parameter objName[] in the scale function defined for q is the same<br />

as in p<br />

• the value of parameter ⃗s in the scale function defined for q is the same as in p<br />

• the value of parameter resMeth in the scale function defined for q is the same<br />

as in p<br />

Partial-Match Conditions. Let q be a candidate query for pre-aggregation and p<br />

be a query in the workload Q. There is a partial-match between p and q if and only if:<br />

• the value of parameter objName[] in the scale function defined for q is the same<br />

as in p<br />

• the value of parameter resMeth in the scale function defined for q is the same<br />

as in p<br />

• the parameter ⃗s for both q and p is of the same dimensionality<br />

• vec<strong>to</strong>r values defined in ⃗s for q are higher than those defined in p<br />

Definition 5.5 (Benefit) – Let T ∈ Q be a subset of scaling operations that can<br />

be fully or partially computed using query q. The benefit of query q per unit space,<br />

denoted by B(q), is the sum of the computational cost savings gained by selecting<br />

query q for pre-aggregation.<br />

✷<br />

B(q) = ((F (q) ∗ C(q)) + ∑ t∈T<br />

(F (t) ∗ C r (t, q)))/size(q) (5.3)<br />

where F (q) represents the frequency of query q in the workload, C ( q) is the cost of<br />

computing query q on the original dataset, C r (t, q) is the relative cost of computing<br />

query t from q, and size(q) is a function that returns the number of cells composing<br />

the spatial domain component of a query q.


82 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

5.3 <strong>Pre</strong>-Aggregates Selection<br />

<strong>Pre</strong>-aggregating all distinct scaling operations in the workload is not always possible<br />

because of space limitations. This is similar <strong>to</strong> the problem of selecting views<br />

for materialization in <strong>OLAP</strong>. One approach for finding the optimal set of scaling operations<br />

<strong>to</strong> pre-compute consists of enumerating all possible combinations and finding<br />

the one that yields the minimum average query cost, or the maximum benefit. Finding<br />

the optimal set of pre-aggreates in this way has a complexity of O(2 n ) where n is the<br />

number of queries in the workload. If the number of scaling operations on a given<br />

raster object is 50, there are 2 50 possible pre-aggregates for that object. Therefore,<br />

computing the optimal set of aggregates exhaustively is not feasible. In fact, it is an<br />

NP-hard problem [92, 17].<br />

We therefore consider the selection of pre-aggregates as an optimization problem<br />

where the input includes multidimensional datasets, a query workload, and an upper<br />

bound on available disk space. The output is a set of queries that minimizes the <strong>to</strong>tal<br />

cost of evaluating the query workload depending on the s<strong>to</strong>rage limit. We present an<br />

algorithm that uses the benefit per unit space of a scaling operation. We model the<br />

expected queries by a query workload, which is a set of scaling operations:<br />

Q = {q i |0 < i ≤ n} (5.4)<br />

where each q i has an associated non-negative frequency, f i . We normalize frequencies<br />

so that they sum up <strong>to</strong> 1:<br />

(<br />

n∑<br />

q i ) (5.5)<br />

i=1<br />

Based on this setup we study different workload patterns.<br />

The PRE-AGGREGATESSELECTION procedure returns a set P = {p i |0 < i ≤ n} of<br />

queries <strong>to</strong> be pre-aggregated. Input is a workload Q and a s<strong>to</strong>rage space constraint S.<br />

The workload contains a number of queries, each corresponding <strong>to</strong> a scaling operation<br />

as defined in Eq. 5.1.<br />

Frequency, s<strong>to</strong>rage space, and benefit per unit space are calculated for each distinct<br />

query in the workload. When calculating the benefit, we assume that each query is<br />

evaluated using the root (<strong>to</strong>p) node, which is the first selected pre-aggregate, p 1 . The<br />

second chosen pre-aggregate p 2 is the one with highest benefit per unit space.<br />

The algorithm recalculates the benefit of each scaling operation given that they are<br />

computed either from the root, if the scaling operation is above p 1 , or from p 2 otherwise.<br />

Subsequent selections are performed in a similar manner. The benefit is recalculated<br />

each time a scaling operation is selected for pre-aggregation. The algorithm<br />

s<strong>to</strong>ps selecting pre-aggregates when the s<strong>to</strong>rage space constraint is reached, or when<br />

there are no more queries in the workload <strong>to</strong> be considered for pre-aggregation, i.e.,<br />

all scaling operations in the workload have already been selected for pre-aggregation.<br />

The function highestBenefit(Q) returns the scaling operation with highest benefit<br />

per unit space in Q. Complexity of the algorithm is O(k · n 2 ) (k is the number


5.4 Answering Scaling Operations Using <strong>Pre</strong>-Aggregated Data 83<br />

Algorithm 3 PRE-AGGREGATESSELECTION<br />

Require: A workload Q, and a s<strong>to</strong>rage space constraint c<br />

1: P = {<strong>to</strong>p scaling operation}<br />

2: while (c > 0 and |P | != |Q| ) do<br />

3: p = highestBenefit(Q, P )<br />

4: if (c - |p| > 0) then<br />

5: c = c - |p|<br />

6: P = P ∪ p<br />

7: end if<br />

8: else c = 0<br />

9: return P<br />

of selected pre-aggregates and n is the number of vertices in the lattice), which arises<br />

from the cost of sorting the pre-aggregates by benefit per unit size.<br />

5.3.1 Complexity Analysis<br />

Let m be the number of queries in the lattice. Suppose we have no queries selected<br />

except for the <strong>to</strong>p query, which is manda<strong>to</strong>ry. The time <strong>to</strong> answer a given query in the<br />

workload is the time taken <strong>to</strong> compute the query using the <strong>to</strong>p query and calculating<br />

it according <strong>to</strong> our cost model. We denote this time by T o . Suppose that in addition<br />

<strong>to</strong> the <strong>to</strong>p query, we choose a set of queries P . Denote the average time <strong>to</strong> answer a<br />

query by T p . The benefit of the set of queries P is the reduction in average time <strong>to</strong><br />

answer a query, that is, T o − T p . Thus, minimizing the average time <strong>to</strong> answer a query<br />

is equivalent <strong>to</strong> maximizing the benefit of a set of queries.<br />

Let p 1 , p 2 , ..., p k be the k queries selected by the PRE-AGGREGATESSELECTION<br />

algorithm. Let b i be the benefit achieved by the selection of p i , for i = 1, 2, ..., k.<br />

That is, b i is the benefit of p i , with respect <strong>to</strong> the set consisting of the <strong>to</strong>p query and<br />

p 1 , p 2 , ..., p i−1 . Let P = p 1 , p 2 , ..., p k .<br />

Let O = o 1 , o 2 , ..., o k be an optimal set of k queries, i.e., those queries giving<br />

the maximum benefit. Let m i be the benefit achieved by the selection of o i , for i =<br />

1, 2, ..., k. That is, m i is the benefit of o i , with respect <strong>to</strong> the set consisting of the <strong>to</strong>p<br />

query and o 1 , o 2 , ..., o i−1 .<br />

Harinarayan et al [92] proved that the benefit of the greedy algorithm can never<br />

be less than (e-1)/e = 0.63 times the benefit of the optimum choice of pre-aggregated<br />

queries.<br />

5.4 Answering Scaling Operations Using <strong>Pre</strong>-Aggregated Data<br />

We say that a pre-aggregate p answers query q if there exists some other query q ′<br />

which when executed on the result of p, provides the result of q. The result can be<br />

either exact with respect <strong>to</strong> q (q ′ ◦ p ≡ q), or only an approximation (q ′ ◦ p ≈ q).<br />

In practice, the result is often an approximation because of the effect of resampling<br />

the original dataset. The same effect is observed in the traditional image pyramids


84 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

approach, but it is considered negligible since approximations are good enough for<br />

many applications. In our approach, when two or more pre-aggregates qualify for<br />

computing a given scaling operation, we pick the pre-aggregate with the closest scale<br />

vec<strong>to</strong>r value <strong>to</strong> the one defined in the scaling operation.<br />

Example 5.1 – Assume the queries listed in Table 5.1 have been pre-aggregated, and<br />

suppose we want <strong>to</strong> compute the following query: q = scale(ras01, (4.0, 4.0, ⃗ 4.0), bi).<br />

From the list of available pre-aggregates, the query can be answered either by using<br />

p2 or p3. From these two pre-aggregates, p3 has the closest scale vec<strong>to</strong>r <strong>to</strong> q. Thus,<br />

q ′ = scale(p3, (0.87, 0.87, ⃗ 0.87), bi). Note that q ′ represents a rewritten scaling operation<br />

in terms of the pre-aggregate.<br />

✷<br />

Table 5.1. Sample <strong>Pre</strong>-Aggregates.<br />

Raster Object ID Raster Name Scale Vec<strong>to</strong>r Resampling Method<br />

p1 ras01 (2.0, 2.0, ⃗ 2.0) nn<br />

p2 ras01 (3.0, 3.0, ⃗ 3.0) bi<br />

p3 ras01 (3.5, 3.5, ⃗ 3.5) bi<br />

p4 ras01 (6.0, 6.0, ⃗ 6.0) bi<br />

The REWRITEOPERATION procedure returns for query q a query q ′ that has been<br />

rewritten in terms of a pre-aggregate identified with p id . The input of the algorithm<br />

is the scaling operation q and a set of pre-aggregates P . The algorithm looks for a<br />

PERFECT-MATCH between q and one of the elements in P . To this end, the algorithm<br />

verifies that the matching conditions listed in Section 5.2.2 are all satisfied. If<br />

a perfect match is found, it returns the identifier of the matched pre-aggregate. Otherwise,<br />

the algorithm verifies PARTIAL-MATCH conditions for all pre-aggregates in<br />

P . All qualified pre-aggregates are added <strong>to</strong> set S. In case of a partial matching,<br />

the algorithm finds the pre-aggregate with the scale vec<strong>to</strong>r closest <strong>to</strong> the one defined<br />

in Q. REWRITEQUERY rewrites the original query as a function of the selected preaggregate,<br />

and adjusts the values of the scale vec<strong>to</strong>r <strong>to</strong> perform the complementary<br />

scaling operation. The algorithm makes use of the following auxiliary functions.<br />

• FULLMATCH(q, P ). Verifies that all full-match conditions are satisfied. If<br />

no matching is found, it returns 0, else it returns the id of the matching preaggregate.<br />

• PARTIALMATCH(q, P ). Verifies that all partial-match conditions are satisfied.<br />

Each qualified pre-aggregate of P is added <strong>to</strong> set S.<br />

• CLOSESTSCALEVECTOR(q, S). Compares the scale vec<strong>to</strong>rs between q and the<br />

elements of S, and returns the identifier (p id ) of the pre-aggregate whose scale<br />

vec<strong>to</strong>r is the closest <strong>to</strong> that defined for q.<br />

• REWRITEQUERY(Q, p id ). Rewrites query q in terms of the selected pre-aggregate<br />

and adjusts the scale vec<strong>to</strong>r values accordingly.


5.5 Experimental Results 85<br />

Algorithm 4 REWRITEOPERATION<br />

Require: A query q, and a set of pre-aggregates P<br />

1: initialize S = {} , p id = 0<br />

2: p id = fullMatch(q, P )<br />

3: if (p id == 0) then<br />

4: S = partialMatch(q, P )<br />

5: p id = closestScaleV ec<strong>to</strong>r(q, S)<br />

6: end if<br />

7: q ′ = rewriteQuery(q, p id )<br />

8: return q ′<br />

5.5 Experimental Results<br />

Experiments were conducted <strong>to</strong> evaluate the effectiveness of the pre-aggregation<br />

selection and rewriting algorithms in supporting scaling operations. They were run on<br />

a machine with a 3.00 GHz Intel Pentium 4 processor, running SuSe Linux 9.1. The<br />

workstation had a <strong>to</strong>tal physical memory of 512 MB.<br />

The query workload consisted of scaling operations with different scaling vec<strong>to</strong>rs.<br />

Different data distributions of the query workload were also considered. Despite the<br />

growing popularity of Web mapping services for GIS raster information processing,<br />

very few studies have been undertaken that report on user behaviors while using those<br />

services. One of the primary reasons for lack of research in this area may be the<br />

limited availability of the datasets outside of specialized research groups. Moreover,<br />

while query patterns related <strong>to</strong> scaling operations on 2D datasets are difficult <strong>to</strong> find,<br />

no empirical workload distributions were found for datasets of higher dimensionalities.<br />

We therefore resorted <strong>to</strong> using a set of artificial distributions that cover many<br />

practical situations in GIS and remote-sensing imaging.<br />

Most pre-aggregation algorithms in <strong>OLAP</strong> and image pyramids assume a uniform<br />

distribution of the values given for the scale vec<strong>to</strong>r in the query workload, so we<br />

considered the same type of distribution for our experiments. Furthermore, we also<br />

considered a Poisson distribution of the scale vec<strong>to</strong>r values. The rationale is that<br />

such a distribution covers situations where the dataset is scaled down by fac<strong>to</strong>rs that<br />

typically fall within a narrow range of scale vec<strong>to</strong>rs. For example, very large objects<br />

may need <strong>to</strong> be scaled down by large scale vec<strong>to</strong>rs so they can be efficiently transferred<br />

back and forth via Web services [77]. We also considered applications where the<br />

dataset is scaled down by the same scale vec<strong>to</strong>r, we refer <strong>to</strong> such access patter as a<br />

peak distribution. Finally, we investigated a step distribution that satisfies cases where<br />

scaling operations can be grouped within specific ranges of scale vec<strong>to</strong>rs.<br />

Our experiments were performed on datasets generated from three real-life rasterobjects:<br />

• Dataset R1. Consists of a 2D raster object with spatial domain [0 : 15359, 0 :<br />

10239]. The dataset contains 600 tiles, each with a spatial domain of [0 : 512, 0 :<br />

512]. The <strong>to</strong>tal number of cells composing the raster object is 157 millions.


86 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

• Dataset R2. Consists of a 3D raster object with spatial domain [0 : 11299, 0 :<br />

10459, 0 : 3650]. The dataset contains 3214 tiles, each with a spatial domain of<br />

[0 : 512, 0 : 512, 0 : 512]. The <strong>to</strong>tal number of cells composing the raster object<br />

is 43 trillions.<br />

• Dataset R3. Consists of a 4D raster object with spatial domain [0 : 10150, 0 :<br />

7259, 0 : 2430, 0 : 75640]. The dataset contains 197,070 tiles, each with a<br />

spatial domain of [0 : 512, 0 : 512, 0 : 512, 0 : 512]. The <strong>to</strong>tal number of cells<br />

composing the raster object is 1.35e+16.<br />

In the rest of this section, we present the results of our experiments according <strong>to</strong><br />

the dimensionality of the data.<br />

5.5.1 2D Datasets<br />

In this experiment the workload consisted of 12800 scaling operations defined for<br />

dataset R1.<br />

Uniform Distribution<br />

The scaling vec<strong>to</strong>rs of the queries in the workload were uniformly distributed. Scale<br />

vec<strong>to</strong>rs were integers ranging from 2 <strong>to</strong> 256. Per observations in practice, we assumed<br />

that both dimensions were coupled. We considered a s<strong>to</strong>rage space constraint of 35%,<br />

which is slightly higher than the additional s<strong>to</strong>rage space taken by image pyramids.<br />

The PRE-AGGREGATESSELECTION algorithm yields 12 pre-aggregates for this test<br />

where we executed scaling operations with scale vec<strong>to</strong>rs 2, 4, 6, 11, 15, 22, 32, 46, 67, 95,<br />

137 and 182. The cost of computing the workload using these pre-aggregates is<br />

18, 565. In contrast, image pyramids selects scaling operations with scale vec<strong>to</strong>rs:<br />

2, 4, 8, 16, 32, 64, 128, and 256, and requires 33% additional s<strong>to</strong>rage space. Image<br />

pyramids computes the workload at a cost of 29, 166. The results of this experiment<br />

show that the pre-aggregates selected by our algorithm provide an improved performance<br />

for scaling operations over image pyramids. The cost of computing the workload<br />

using our algorithm is 36% less than that incurred by image pyramids, at a price<br />

of 2% additional s<strong>to</strong>rage space.<br />

Fig. 5.2(a) shows the distribution of the scale vec<strong>to</strong>rs of all queries in the workload.<br />

The pre-aggregates selected by image pyramids and our pre-aggregation selection algorithm<br />

are shown in Fig. 5.2(b) and 5.2(c), respectively.<br />

Poisson Distribution<br />

The workload for this experiment consisted of scaling operations where the scale vec<strong>to</strong>rs<br />

had a Poisson distribution, and the mean value of the scale vec<strong>to</strong>r equaled 50. The<br />

PRE-AGGREGATES-SELECTION algorithm yields 33 pre-aggregates for this test that<br />

executed scaling operations using scale vec<strong>to</strong>rs from 34 <strong>to</strong> 66. The cost of computing<br />

the workload using these pre-aggregates is 42, 455. In contrast, image pyramids


5.5 Experimental Results 87<br />

(a) Query workload (Uniform distribution)<br />

(b) Selected queries for materialization by image pyramids<br />

(c) Selected queries for materialization by our pre-aggregation selection algorithm<br />

Figure 5.2. Query Workload with Uniform Distribution<br />

selects scaling operations with scale vec<strong>to</strong>rs: 2, 4, 8, 16, 32, 64, 128, and the cost of<br />

computing the workload is 95, 468. Thus, the cost of computing the workload using


88 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

pre-aggregates selected by our algorithm is 55% less than that incurred using image<br />

pyramids. There is also a major difference with respect <strong>to</strong> the additional s<strong>to</strong>rage<br />

space required by both approaches: image pyramids requires 33% additional s<strong>to</strong>rage<br />

space, while our algorithm requires only 5% additional space <strong>to</strong> s<strong>to</strong>re the selected<br />

pre-aggregates.<br />

(a) Query workload (Poisson distribution)<br />

Figure 5.3. Query Workload with Poisson Distribution<br />

Fig. 5.3(a) shows the distribution of the scale vec<strong>to</strong>rs of all queries in the workload.<br />

The pre-aggregates selected by image pyramids are shown in Fig. 5.4(a). Even when<br />

there are no queries in the workload with scale fac<strong>to</strong>rs smaller than 33, image pyramids<br />

still allocates space for pre-aggregates 2, 4, 8, 16, 32 which are the ones that account<br />

for much of the overall space requirement (33%). In contrast, our algorithm uses<br />

the query frequencies in the workload <strong>to</strong> select the queries for pre-aggregation. See<br />

Fig. 5.4(b). For this workload configuration, it is possible <strong>to</strong> pre-aggregate all distinct<br />

queries, and provide much faster query response times than image pyramids. This<br />

shows the benefit of considering query frequencies in the workload. If we pick a<br />

mean higher than 50, the additional s<strong>to</strong>rage space needed by the pre-aggregates is<br />

minimal. Conversely, if the mean is shifted <strong>to</strong> a lower scale vec<strong>to</strong>r value, e.g. 16, the<br />

s<strong>to</strong>rage space needed by our pre-aggregation algorithm can increase up <strong>to</strong> 35%.<br />

Peak Distribution<br />

In this experiment, the query workload consisted of scaling operations with a scale<br />

vec<strong>to</strong>r having a value of 100 in each dimension. The PRE-AGGREGATESSELECTION<br />

algorithm yields 1 pre-aggregate for this test that executes a scaling operation with<br />

scale vec<strong>to</strong>r: 100, 100. The cost of computing the workload using this pre-aggregate<br />

is 1.27E + 08. In contrast, image pyramids selects scaling operations with scale<br />

fac<strong>to</strong>r values in each dimension: 2, 4, 8, 16, 32, 64, 128,, and the cost of computing<br />

the workload is 3.01E + 08. Thus, the cost of computing the workload using the<br />

pre-aggregates selected by our algorithm is 58% less than the cost incurred by image<br />

pyramids. Furthermore, there is major difference with respect <strong>to</strong> the s<strong>to</strong>rage space


5.5 Experimental Results 89<br />

(a) Selected queries for pre-aggregation by image pyramids<br />

(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm<br />

Figure 5.4. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong><br />

required by both approaches: image pyramids requires 33% additional s<strong>to</strong>rage space,<br />

while our algorithm only requires 5% additional space.<br />

Fig. 5.5(a) shows the distribution of the scale vec<strong>to</strong>rs for all queries in the workload.<br />

The pre-aggregates selected by image pyramids are shown in Fig. 5.6(a). Image<br />

pyramids allocates space for pre-aggregates with scale fac<strong>to</strong>rs 2, 4, 8, 16, 32, 128, and<br />

256 in each dimension. In contrast, our pre-aggregation selection algorithm selected<br />

one query, shown in Fig. 5.6(b). Although our algorithm makes more efficient use<br />

of s<strong>to</strong>rage space and computes the workload faster than image pyramids, this kind of<br />

scenario is not likely <strong>to</strong> occur in practice. The s<strong>to</strong>rage overhead is simply not justified.<br />

However, users may benefit from having a system that au<strong>to</strong>matically pre-aggregates<br />

such operations with minimum overhead, a capability that can be provided by using<br />

our algorithm.<br />

Step Distribution<br />

We now consider a scenario where scale vec<strong>to</strong>rs are distributed in various ranges of<br />

frequencies, i.e. in a step distribution. The PRE-AGGREGATESSELECTION algorithm<br />

yields 6 pre-aggregates for this test, where scaling operations are executed with scale


90 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

(a) Query workload (Peak distribution)<br />

Figure 5.5. Query Workload with Peak Distribution<br />

(a) Selected queries for pre-aggregation by image pyramids<br />

(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm<br />

Figure 5.6. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong>


5.5 Experimental Results 91<br />

vec<strong>to</strong>rs 6, 8, 13, 19, 75, and 200. The cost of computing the workload using these preaggregates<br />

is 1.5e + 09. In contrast, image pyramids selects scaling operations with<br />

scale vec<strong>to</strong>rs: 2, 4, 8, 16, 32, 64, and 128, and the cost of computing the workload is<br />

2.21e + 09. The cost of computing the workload using the pre-aggregates selected by<br />

our algorithm is therefore 32% less than that incurred by image pyramids. Moreover,<br />

there is a major difference with respect <strong>to</strong> the additional s<strong>to</strong>rage space required by<br />

both approaches. Image pyramids requires 33% additional s<strong>to</strong>rage space, while our<br />

algorithm only requires 15% additional space.<br />

(a) Query workload (Step distribution)<br />

Figure 5.7. Query Workload with Step Distribution<br />

Fig. 5.7(a) shows the distribution of the scale vec<strong>to</strong>rs for all queries in the workload.<br />

The pre-aggregates selected by image pyramids are shown in Fig. 5.8(a).<br />

5.5.2 3D Datasets<br />

To test our pre-aggregation algorithms on 3D time-series datasets, we picked four<br />

data distribution patterns of scaling vec<strong>to</strong>rs. For simplicity, we have labeled each<br />

dimension x, y, and t respectively. The following assumption (taken from observation<br />

in practice) is common for each data distribution type: the scale vec<strong>to</strong>r along the first<br />

two dimensions is the same, i.e. x = y. The aim of this test is <strong>to</strong> measure average<br />

query costs while varying available s<strong>to</strong>rage space for pre-aggregation.<br />

Uniform distribution in x, y, t<br />

In this experiment, the workload consisted of 10, 000 scaling operations referring <strong>to</strong><br />

the 3D dataset R2 described at the beginning of this Section. Scale vec<strong>to</strong>rs were uniformly<br />

distributed along the x, y, and t dimensions. Values of scale vec<strong>to</strong>rs ranged<br />

from 2 <strong>to</strong> 256. Fig. 5.9 shows the distribution of the scaling vec<strong>to</strong>rs in the workload.<br />

We executed the PRE-AGGREGATESSELECTION algorithm for different values<br />

of s<strong>to</strong>rage space constraint (c). The minimum s<strong>to</strong>rage space required <strong>to</strong> support the


92 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

(a) Selected queries for pre-aggregation by image pyramids<br />

(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm<br />

Figure 5.8. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong>


5.5 Experimental Results 93<br />

root node of the lattice was 12.5% of the size of the original dataset. Fig. 5.10 shows<br />

the average query cost as s<strong>to</strong>rage space is increased. A small amount of s<strong>to</strong>rage space<br />

dramatically reduces the average query cost. The improvement in average query cost<br />

decreases, however, as allocated space goes beyond 36%. Fig. 5.11 shows the scaling<br />

operations selected for pre-aggregation when c = 36%. For this instance of the<br />

s<strong>to</strong>rage space constraint, the algorithm selected 49 pre-aggreagates. The <strong>to</strong>tal cost of<br />

computing that workload is 6.44e+05. In contrast, computing the workload using the<br />

original dataset incurs a cost of 1.28e + 12.<br />

Figure 5.9. Workload with Uniform Distribution along x, y, and t<br />

Figure 5.10. Average Query Cost over S<strong>to</strong>rage Space


94 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

Figure 5.11. Selected <strong>Pre</strong>-Aggregates, c = 36%<br />

Uniform distribution in x, y and Poisson distribution in t<br />

In this experiment, the workload consisted of 23, 460 scaling operations referring <strong>to</strong><br />

3D dataset R2. The scale vec<strong>to</strong>rs were uniformly distributed along x and y, with a<br />

Poisson distribution along t. Values of scale vec<strong>to</strong>rs ranged from 2 <strong>to</strong> 256 in the x<br />

and y dimensions, whereas in t they ranged from 8 <strong>to</strong> 16, with a mean value of 12.<br />

Fig. 5.12 shows the distribution of scaling vec<strong>to</strong>rs in the workload. Note that the<br />

scale vec<strong>to</strong>r values in the dimensions x and y are coupled. The frequency of the various<br />

scale fac<strong>to</strong>r values is denoted by f. We ran the PRE-AGGREGATESSELECTION<br />

algorithm for different values of the s<strong>to</strong>rage space constraint. The minimum s<strong>to</strong>rage<br />

space required <strong>to</strong> support the root node of the lattice was 3.13% of the size of the<br />

original dataset. Fig. 5.13 shows the average query cost as s<strong>to</strong>rage space increases.<br />

A small amount of s<strong>to</strong>rage space dramatically improves the average query cost. However,<br />

we can also observe that the improvement in average query cost decreases as<br />

allocated space goes beyond 26%. Fig. 5.14 shows the scaling operations selected for<br />

pre-aggregation when c = 26%. For this instance of the s<strong>to</strong>rage space constraint, the<br />

algorithm selected 67 pre-aggreagates. The <strong>to</strong>tal cost for computing the workload is<br />

1.21e + 07. In contrast, computing the workload using the original dataset incurs a<br />

cost of 2.31e + 11.<br />

Poisson distribution in x, y, t<br />

In this experiment, the workload consisted of 600 scaling operations referring <strong>to</strong> 3D<br />

dataset R2. The scale vec<strong>to</strong>rs followed a Poisson distribution along the three dimensions<br />

x, y, and t. Values of scale vec<strong>to</strong>rs ranged from 2 <strong>to</strong> 10 in the x and y<br />

dimensions, whereas in t they ranged between 8 and 16, with a mean value of 12.<br />

Fig. 5.15 shows the distribution of the scaling vec<strong>to</strong>rs in the workload. We ran<br />

the PRE-AGGREGATESSELECTION algorithm for different values of the s<strong>to</strong>rage space


5.5 Experimental Results 95<br />

Figure 5.12. Workload with Uniform Distribution Along x, y, and Poisson distribution<br />

in t<br />

Figure 5.13. Average Query Cost as Space is Varied<br />

constraint. The minimum s<strong>to</strong>rage space required <strong>to</strong> support the root node of the lattice<br />

was 4.18% of the size of the original dataset. Fig. 5.16 shows the average query cost<br />

as s<strong>to</strong>rage space is increased. A small amount of s<strong>to</strong>rage space dramatically improves<br />

the average query cost. However, the improvement in average query cost decreases as<br />

allocated space goes beyond 26%. Fig. 5.17 shows the scaling operations selected for<br />

pre-aggregation when c = 30%. For this instance of the s<strong>to</strong>rage space constraint, the<br />

algorithm selected 23 pre-aggreagates. The <strong>to</strong>tal cost of computing the workload is<br />

1680. In contrast, computing the workload using the original dataset incurs a cost of<br />

1.34e + 12.


96 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

Figure 5.14. Selected <strong>Pre</strong>-Aggregates, c = 26%<br />

Figure 5.15. Workload with Poisson distribution Along x, y, and t<br />

Poisson distribution in x, y, and Uniform distribution along t<br />

In this experiment, the workload consisted of 924 scaling operations referring <strong>to</strong> 3D<br />

dataset R2. The scale vec<strong>to</strong>rs followed a Poisson distribution along the dimensions x<br />

and y, and a uniform distribution along dimension t. Values of scale vec<strong>to</strong>rs ranged<br />

from 2 <strong>to</strong> 10 in the x and y dimensions, and were uniformly distributed along t. Fig.<br />

5.18 shows the distribution of the scaling vec<strong>to</strong>rs in the workload. We ran the PRE-<br />

AGGREGATESSELECTION algorithm for different values of the s<strong>to</strong>rage space constraint.<br />

The minimum s<strong>to</strong>rage space required <strong>to</strong> support the root node of the lattice<br />

was 4% of the size of the original dataset. Fig. 5.19 shows the average query cost<br />

as s<strong>to</strong>rage space is increased. A small amount of s<strong>to</strong>rage space dramatically improves<br />

the average query cost. However, the improvement in average query cost decreases as<br />

allocated space goes beyond 21%. Fig. 5.20 shows the scaling operations selected for


5.5 Experimental Results 97<br />

Figure 5.16. Average Query Cost as Space is Varied<br />

Figure 5.17. Selected <strong>Pre</strong>-Aggregates, c = 30%


98 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

Figure 5.18. Workload with Poisson Distribution Along x, y, and Uniform Distribution<br />

in t<br />

pre-aggregation when c = 21%. For this instance of the s<strong>to</strong>rage space constraint, the<br />

algorithm selected 17 pre-aggreagates. The <strong>to</strong>tal cost of computing the workload is<br />

1472. In contrast, computing the workload using the original dataset incurs a cost of<br />

1.63e + 12.<br />

5.5.3 4D Datasets<br />

For 4D datasets, we considered ECHAMT−42 as a typical use case found in<br />

climate modeling. ECHAMT−42 is an energy and mass budget model developed<br />

by the Max-Planck-Institute for Meteorology [16]. We assumed that dimensions x<br />

and y are scaled down by the same scale value. However, the scale values along z and<br />

t may vary according <strong>to</strong> specific analysis requirements for a given application. If we<br />

look at the sample dimensions of ECHAMT − 42 model shown in Table 5.2, it is<br />

clear that the dimension values along the first three dimensions are much smaller than<br />

those of the fourth dimension (time).<br />

In this experiment, the workload consisted of 1, 137 scaling operations referring <strong>to</strong><br />

4D dataset R3. We assumed that the scale vec<strong>to</strong>rs followed a Poisson distribution in<br />

each of the four dimensions. The rationale behind this assumption is that scientists<br />

are often interested in a highly selective data set and Poisson distribution fits nicely<br />

for this data access pattern. Values of scale vec<strong>to</strong>rs ranged from 2 <strong>to</strong> 11 in the x, y<br />

dimensions with a mean of 6; from 10 <strong>to</strong> 19 along the z dimension with a mean of 14,<br />

and from 230 <strong>to</strong> 239 along t with the mean of 234. Table 5.3 shows the distribution<br />

of the scale fac<strong>to</strong>rs of all scaling operations in the workload.<br />

We ran the PRE-AGGREGATESSELECTION algorithm for different values of the<br />

s<strong>to</strong>rage space constraint. The minimum s<strong>to</strong>rage space required <strong>to</strong> support the root<br />

node of the lattice was 1.25% of the size of the original dataset. Table 5.4 shows the


5.5 Experimental Results 99<br />

Figure 5.19. Average Query Cost as Space is Varied<br />

Figure 5.20. Selected <strong>Pre</strong>-Aggregates, c = 21%


100 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />

scaling operations selected for pre-aggregation when c = 1.3%. For this instance of<br />

the s<strong>to</strong>rage space constraint, the algorithm selected 4 pre-aggregates shown in Table<br />

5.4. The <strong>to</strong>tal cost of computing the workload is 3361. In contrast, computing the<br />

workload using the original dataset incurs a cost of 1.35e + 16.<br />

Table 5.2. ECHAM T-42 Climate Simulation Dimensions<br />

Dimension<br />

Extent<br />

Longitude 128<br />

Latitude 64<br />

Elevation 17<br />

Time (24 min/slice) 200 years (2,190,000)<br />

Table 5.3. 4D Scaling: Scale Vec<strong>to</strong>r Distribution<br />

Scale Vec<strong>to</strong>r count<br />

2,2,10,230 200<br />

3,3,11,231 300<br />

4,4,12,232 500<br />

5,5,13,233 800<br />

6,6,14,234 1000<br />

7,7,15,235 1000<br />

8,8,16,236 800<br />

9,9,17,237 500<br />

10,10,18,238 300<br />

11,11,19,239 200<br />

Table 5.4. 4D Scaling: Selected <strong>Pre</strong>-Aggregates<br />

Scale Vec<strong>to</strong>r count<br />

2,2,10,230 200<br />

4,4,12,232 500<br />

6,6,14,234 1000<br />

8,8,16,236 800<br />

5.6 Summary<br />

This chapter describes our investigations on the problem of intelligently picking<br />

a subset of scaling operations given a s<strong>to</strong>rage space constraint. There is a tradeoff<br />

between the amount of space allocated for pre-aggregation, and the average query<br />

cost of scaling operations. We introduced a pre-aggregation selection algorithm based<br />

on a given query workload that determines a set of pre-aggregates in face of s<strong>to</strong>rage<br />

space constraints.<br />

We performed experiments on 2D, 3D, and 4D datasets using different data distribution<br />

patterns for the scale vec<strong>to</strong>rs. We relied on artificial data distributions since no<br />

empirical distributions were found. In addition <strong>to</strong> uniformly distributed scale vec<strong>to</strong>rs,


5.6 Summary 101<br />

we considered non-uniform distributions including Poisson, peak, and step. For 2D<br />

datasets, we showed that our algorithm performs better than that of image pyramids.<br />

In particular, for non-uniform data distributions, our pre-aggregation selection algorithm<br />

not only provides a lower average query cost, but makes a much more efficient<br />

use of s<strong>to</strong>rage space. This is because our algorithm considers the frequency of the<br />

query, and the cost savings (benefit) this provides for computing the workload. Nevertheless,<br />

the major advantage of our algorithm over that of image pyramids is not the<br />

improved average query cost, but the reduced amount of s<strong>to</strong>rage space required for<br />

the pre-aggregates, especially for non-uniform distributions.<br />

In our experiments with 3D and 4D datasets, we showed the effect of the available<br />

s<strong>to</strong>rage space for pre-aggregation on average query costs. We observed that a small<br />

amount of s<strong>to</strong>rage overhead is sufficient <strong>to</strong> dramatically reduce average query costs.<br />

Since there are no similar techniques against which we can compare our results, we<br />

compared our results against the average query costs obtained by using the original<br />

data.


This page was left blank intentionally.


Chapter 6<br />

Conclusion<br />

One of the biggest challenges of database technology is <strong>to</strong> effectively and efficiently<br />

provide solutions for extremely large volumes of multidimensional array data archiving<br />

and management. This thesis focuses on investigating the problem of applying<br />

<strong>OLAP</strong> pre-aggregation technology <strong>to</strong> speed up aggregate query processing in array<br />

databases for GIS and remote-sensing imaging applications.<br />

We presented a study of fundamental imaging operations in GIS. By using a formal<br />

algebraic framework, Array Algebra, we were able <strong>to</strong> classify GIS operations<br />

according <strong>to</strong> three basic algebraic opera<strong>to</strong>rs, and thus identify a set of operations that<br />

can benefit from pre-aggregation techniques. We argued that <strong>OLAP</strong> pre-aggregation<br />

techniques cannot be applied in a straight-forward manner <strong>to</strong> array databases for our<br />

target applications. The reason is that although similar, data structures in both application<br />

domains differ in fundamental aspects. In <strong>OLAP</strong>, multidimensional data spaces<br />

are spanned by axes where cell values sit on the grid at intersection points. This<br />

is paralleled by raster image data that are discretized during acquisition. Thus, the<br />

structure of an <strong>OLAP</strong> data cube is rather similar <strong>to</strong> a raster array. Dimension hierarchies<br />

in <strong>OLAP</strong> serve <strong>to</strong> group value ranges along an axis. Querying data by referring<br />

<strong>to</strong> coordinates on the measure axes yields ground data, whereas queries using axes<br />

higher up in a dimension hierarchy will return aggregated values. A main differentiating<br />

criterion between data in <strong>OLAP</strong> and raster image data is density: <strong>OLAP</strong> data<br />

are sparse, typically 5%, whereas, raster image datasets are 100% dense. Note also<br />

that dimensions in <strong>OLAP</strong> are treated as business perspectives such as products and/or<br />

s<strong>to</strong>res, and these are non-spatial dimensions, which contrast with the spatial nature of<br />

raster image datasets. There are, however, core similarities that motivated us <strong>to</strong> further<br />

research <strong>OLAP</strong> pre-aggregation techniques. For example, we observed that array<br />

databases and <strong>OLAP</strong> systems both employ multidimensional data models <strong>to</strong> organize<br />

their data. Also, the operations convey a high degree of similarity: a roll-up (aggregate)<br />

operation in <strong>OLAP</strong> is very similar <strong>to</strong> a scaling operation in a raster domain.<br />

Moreover, both application domains make use of pre-aggregation approaches <strong>to</strong> speed<br />

up query processing, however, each has different levels of maturity and scalability.<br />

We presented a framework that focuses on computing basic aggregate operations<br />

using pre-aggregated data. We argued that the decision of computing an aggregate<br />

103


104 6. Conclusion<br />

query using pre-aggregated data is influenced by the structural characteristics of the<br />

query and the pre-aggregate. Thus, by comparing query tree structures between the<br />

two, one can determine if the pre-aggregated result contributes fully or partially <strong>to</strong><br />

the final answer of the query. The best case occurs when there is full-matching between<br />

the query and the pre-aggregate, since the time taken <strong>to</strong> compute the query is<br />

reduced <strong>to</strong> the time it takes <strong>to</strong> retrieve the result. However, in the case of partialmatching,<br />

several pre-aggregates can be considered for computing the answer of a<br />

query. The decision has <strong>to</strong> be made, therefore, as <strong>to</strong> which pre-aggregates provide the<br />

best performance in terms of execution time. To this end, we distinguished between<br />

different pre-aggregates and presented a cost-model <strong>to</strong> calculate the cost of using each<br />

qualifying pre-aggregate. Then we presented an algorithm that selects the best execution<br />

plan for evaluating a query considering pre-aggregated data. Tests performed on<br />

real-life raster image datasets showed that our distinction between different types of<br />

pre-aggregates is useful <strong>to</strong> determine the pre-aggregate providing the highest benefit<br />

(in terms of execution time) for computing a given query.<br />

We then described the issues of attempting <strong>to</strong> generalize our pre-aggregation framework<br />

<strong>to</strong> support more complex aggregate operations, and justified our decision <strong>to</strong> focus<br />

on one particular operation: scaling. Traditionally, 2D scaling operations have<br />

been performed using image pyramids. Practice shows that pyramids are typically<br />

constructed in scale levels of powers of 2, thus yielding scale vec<strong>to</strong>rs 2, 4, 6, 8, 16, 32, 64,<br />

128, 256, and 512. The materialization of the pyramid requires an estimated 33% additional<br />

s<strong>to</strong>rage space. Our pre-aggregation selection algorithm is similar <strong>to</strong> the pyramid<br />

approach in that it selects a set of queries for materialization, where each level corresponds<br />

<strong>to</strong> a scaling operation with a defined scale fac<strong>to</strong>r. However, the selection of<br />

such queries is not restricted <strong>to</strong> a fixed number of levels interleveled by a power of two.<br />

Instead, our selection algorithm considers the frequency of each query in the workload,<br />

and how the results of each individual query can help <strong>to</strong> reduce the overall cost<br />

of computing the workload. We compared the performance of our pre-aggregation algorithm<br />

against that of image pyramids: results showed that for workloads with scale<br />

vec<strong>to</strong>rs uniformly distributed our algorithm computes the workload 36% cheaper than<br />

image pyramids, and requires 7% additional space than image pyramids. For scale<br />

vec<strong>to</strong>rs following a Poisson distribution, our algorithm computes the workload at a<br />

cost 55% cheaper than when using the pyramids approach. Further, our algorithm<br />

can be applied <strong>to</strong> datasets of higher dimensions, a feature not supported by traditional<br />

image pyramids.<br />

6.1 Future Work<br />

There are natural extensions <strong>to</strong> this work that would help expand and strengthen the<br />

results. One area of further work is in adding self-management capabilities so that the<br />

DBMS maintains statistics about each scaling operation appearing within the incoming<br />

queries and, at some suitable time, adjust the pre-aggregate set accordingly. <strong>OLAP</strong><br />

dynamic pre-aggregation addresses a similar problem. Another area is in applying the<br />

results studied here <strong>to</strong> the many real-world situations where data cubes contain one or


6.1 Future Work 105<br />

more non-spatio-temporal dimensions, such as pressure, which is common in meteorological<br />

and oceanographic data sets.<br />

Workload distribution deserves further investigation. While the distributions chosen<br />

are practical and relevant, there might be further situations worth considering.<br />

Gaining empirical figures from user-exposed services like EarthLook 1 can be useful<br />

<strong>to</strong> tune our pre-aggregation selection algorithms. Further investigation is also necessary<br />

in the realm of rewriting scaling operations. In <strong>OLAP</strong> applications, there is a<br />

trade-off between speed and accuracy. But accuracy may be critical for certain Georaster<br />

applications, so solutions <strong>to</strong> the query rewriting problem must weight these two<br />

aspects according <strong>to</strong> user data analysis requirements. Moreover, it must consider the<br />

fact that the same dataset may be accessed by various users with <strong>to</strong>tally different analysis<br />

needs.<br />

1 www.earthlook.org


This page was left blank intentionally.


Bibliography<br />

[1] Blakeley J. A., Larson P-K., and Tompa F. Efficiently updating materialized<br />

views. In SIGMOD Rec., volume 15, pages 61–71, New York, NY, USA, 1986.<br />

ACM.<br />

[2] Burrough P. A. and McDonell R. A. Principles of Geographical Information<br />

Systems. Oxford, 2004.<br />

[3] Dehmel A. A Compression Engine for Multidimensional Array Database Systems.<br />

PhD thesis, Technical <strong>University</strong> Munich, Germany, 2002.<br />

[4] Dobra A., Garofalakis M., Gehrke J., and Ras<strong>to</strong>gi R. Processing complex aggregate<br />

queries over data streams. In Proceedings of the 2002 ACM SIGMOD<br />

international conference on Management of data, pages 61–72, New York, NY,<br />

USA, 2002. ACM.<br />

[5] Garcia-Gutierrez A. <strong>Applying</strong> <strong>OLAP</strong> pre-aggregation techniques <strong>to</strong> speed up<br />

query processing in raster-image databases. In GI-Days 2007 - Young Researchers<br />

Forum, pages 189–191, Muenster, Germany, 2007. IfGIprints 30.<br />

[6] Garcia-Gutierrez A. <strong>Applying</strong> <strong>OLAP</strong> pre-aggregation techniques <strong>to</strong> speed up<br />

query response times in raster image databases. In ICSOFT (ISDM/EHST/DC),<br />

pages 259–266, 2007.<br />

[7] Garcia-Gutierrez A. Modeling geo-raster operations with array algebra. In<br />

Technical Report (7), 2007.<br />

[8] Garcia-Gutierrez A. and Baumann P. Modeling fundamental geo-raster operations<br />

with array algebra. In ICDM Workshops, pages 607–612, 2007.<br />

[9] Garcia-Gutierrez A. and Baumann P. Computing aggregate queries in raster image<br />

databases using pre-aggregated data. In Proceedings of the International<br />

Conference on Computer Science and Applications, pages 84–89, San Francisco,<br />

CA, USA, 2008.<br />

[10] Garcia-Gutierrez A. and Baumann P. Using pre-aggregation <strong>to</strong> speed up scaling<br />

operations on massive spatio-temporal data. In 29th International Conference<br />

on Conceptual Modeling, November 2010.<br />

107


108 Bibliography<br />

[11] Gupta A. and Mumick I. S. Maintenance of materialized views: Problems,<br />

techniques, and applications. In IEEE Data Engineering Bulletin, volume 18,<br />

pages 3–18, 1995.<br />

[12] Gupta A. and Mumick I. S. Materialized Views. The MIT <strong>Pre</strong>ss, 2007.<br />

[13] Gupta A., Harinarayan V., and Quass D. Aggregate-query processing in data<br />

warehousing environments. In Proceedings of the 21th International Conference<br />

on Very Large Data Bases, pages 358–369, San Francisco, CA, USA,<br />

1995. Morgan Kaufmann Publishers Inc.<br />

[14] Kitamo<strong>to</strong> A. Multiresolution cache management for distributed satellite image<br />

database using nacsis-thai international link. In Proceedings of the 6th International<br />

Workshop on Academic Information Networks and Systems (WAINS),<br />

pages 243–250, 2000.<br />

[15] Koeller A. and Rundensteiner E. A. Incremental maintenance of schemarestructuring<br />

views in schemasql. In IEEE Transactions on Knowledge and<br />

Data Engineering, volume 16, pages 1096–1111, Piscataway, NJ, USA, 2004.<br />

IEEE Educational Activities Department.<br />

[16] Lauer A., J. Hendricks, I. Ackermann, B. Schell, H. Hass, and S. Metzger.<br />

Simulating aerosol microphysics with the echam/made gcm; part i: Model description<br />

and comparison with observations. In Atmospheric Chemistry and<br />

Physics, volume 5, pages 3251–3276, 2005.<br />

[17] Shukla A., Deshpande P., and Naugh<strong>to</strong>n J. F. Materialized view selection for<br />

multidimensional datasets. In Proceedings of the 24th International Conference<br />

on Very Large Data Bases, pages 488–499, San Francisco, CA, USA, 1998.<br />

Morgan Kaufmann Publishers Inc.<br />

[18] Spokoiny A. and Shahar Y. An active database architecture for knowledgebased<br />

incremental abstraction of complex concepts from continuously arriving<br />

time-oriented raw data. In Journal on Intelligent Information Systems, volume<br />

28, pages 199–231, Hingham, MA, USA, 2007. Kluwer Academic Publishers.<br />

[19] Stan A. Geographic information systems: A management perspective. In WDL<br />

Publications, 1991.<br />

[20] American National Standards Institute Inc. (ANSI). ANSI/ISO/IEC<br />

9075-2:2008, International Organization for Standardization (ISO), Information<br />

Technology –Database Languages – SQL–Part 2: Foundation<br />

(SQL/Foundation). Technical report, American National Standards Institute,<br />

2008.<br />

[21] Barbará B. and Imielinski T. Sleepers and workaholics: Caching strategies in<br />

mobile environments. In SIGMOD Conference, pages 1–12, 1994.


BIBLIOGRAPHY 109<br />

[22] Moon B., Vega-Lopez I. F., and Vijaykumar I. Scalable algorithms for large<br />

temporal aggregation. In Proceedings of the 16th International Conference on<br />

Data Engineering, page 145, Washing<strong>to</strong>n, DC, USA, 2000. IEEE Computer<br />

Society.<br />

[23] Reiner B. HEAVEN A Hierarchical S<strong>to</strong>rage and Archive Environment for Multidimensional<br />

Array Database Management Systems. PhD thesis, Technical<br />

<strong>University</strong> Munich, Germany, 2004.<br />

[24] Reiner B. and Hahn K. Tertiary s<strong>to</strong>rage support for large-scale multidimensional<br />

array database management systems, 2002.<br />

[25] Reiner B., Hahn K., Hoefling G., and Baumann P. Hierarchical s<strong>to</strong>rage support<br />

and management for large-scale multidimensional array database management<br />

systems. In Proceedings of the 3rd International Conference on Database and<br />

Expert Systems Applications (DEXA), Aix en Provence, 2002.<br />

[26] Sapia C. Promise: <strong>Pre</strong>dicting query behavior <strong>to</strong> enable predictive caching<br />

strategies for <strong>OLAP</strong> systems. In Proceedings of the 2nd International Conference<br />

on Data Warehousing and Knowledge Discovery, pages 224–233, London,<br />

UK, 2000. Springer-Verlag.<br />

[27] Open GIS Consortium. Web Coverage Processing Service (WCPS). In best<br />

practices document No. 06-035r1, pages 21–47, 2006.<br />

[28] The <strong>OLAP</strong> Council. Efficient s<strong>to</strong>rage and management of environmental information.<br />

www.olapreport.com, Accessed July 11 2002.<br />

[29] The <strong>OLAP</strong> Council. Apb-1 olap benchmark release ii. http://<br />

www.olapcouncil.org/research/resrchly.htm, Accessed July<br />

11 2010.<br />

[30] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush,<br />

P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath,<br />

D. Maier, S. Madden, J. Patel, M. S<strong>to</strong>nebraker, and S. Zdonik. A demonstration<br />

of scidb: a science-oriented dbms. In Proceedings of the Very Large<br />

Data Bases Conference Endowment, volume 2, pages 1534–1537. VLDB Endowment,<br />

2009.<br />

[31] Chatzian<strong>to</strong>niou D. Ad hoc <strong>OLAP</strong>: Expression and evaluation. In Proceedings of<br />

the 15th International Conference on Data Engineering, page 250, Washing<strong>to</strong>n,<br />

DC, USA, 1999. IEEE Computer Society.<br />

[32] O’Sullivan D. and Unwin D. Geographic Information Analysis. John Wiley,<br />

2003.<br />

[33] Quass D. Maintenance expressions for views with aggregation. In VIEWS,<br />

pages 110–118, 1996.


110 Bibliography<br />

[34] Tvei<strong>to</strong> I. D., Dobesch H., Grueter E., Perdigao A., Tvei<strong>to</strong> O.E., Thornes J. E.,<br />

Van derWel F., and Bottai L. The use of geographic information systems in<br />

clima<strong>to</strong>logy and meteorology. In Final Report COST Action, page 719, 2006.<br />

[35] Nguyen D.H. Using javascript for some interactive operations in virtual geographic<br />

model with geovrml. In Proceedings of the International Symposium<br />

on Geoinformatics for Spatial Infrastructure Development in Earth and Allied<br />

Sciences, 2006.<br />

[36] Adiba M. E. and Lindsay B. G. Database snapshots. In Proceedings of the Sixth<br />

International Conference on Very Large Data Bases, Oc<strong>to</strong>ber 1-3, 1980, Montreal,<br />

Quebec, Canada, Proceedings, pages 86–91. IEEE Computer Society,<br />

1980.<br />

[37] Thomsen E. Olap Solutions : Building Multidimensional Information Systems.<br />

John Wiley and Sons, 1997.<br />

[38] Codd E. F., Codd S. B., and Salley C.T. Beyond decision support. In Computer<br />

World, volume 27, 1993.<br />

[39] Codd E. F., Codd S. B., and Salley C. T. Providing <strong>OLAP</strong> (on-line analytical<br />

processing) <strong>to</strong> user-analysts: An it mandate. In Technical Report, 1993.<br />

[40] Vega-Lopez I. F., Snodgrass R. T., and Moon B. Spatiotemporal aggregate computation:<br />

A survey. In IEEE Transactions on Knowledge and Data Engineering,<br />

volume 17, pages 271–286, Piscataway, NJ, USA, 2005. IEEE Educational<br />

Activities Department.<br />

[41] Colliat G. <strong>OLAP</strong>, relational, and multidimensional database systems. In SIG-<br />

MOD Rec., volume 25, pages 64–69, New York, NY, USA, 1996. ACM.<br />

[42] Pestana G., da Silva M. M., and Bedard Y. Spatial <strong>OLAP</strong> modeling: An<br />

overview base on spatial objects changing over time. In IEEE 3rd International<br />

Conference on Computational Cybernetics, pages 149–154, April 2005.<br />

[43] Wiederhold G., Jajodia S., and Litwin W. Dealing with granularity of time<br />

in temporal databases. In Proceedings of the 3rd international conference on<br />

Advanced information systems engineering, pages 124–140, New York, NY,<br />

USA, 1991. Springer-Verlag New York, Inc.<br />

[44] García-Molina H., Ullman J. D., and Widom J. Database Systems: The Complete<br />

Book. Williams, 2002.<br />

[45] Samet H. Foundations of Multidimensional and Metric Data Structures. Morgan<br />

Kaufmann Publishers, 2006.<br />

[46] ERDAS IMAGINE. ERDAS Field Guide. 1997.


BIBLIOGRAPHY 111<br />

[47] ESRI Inc. ArcGIS 9 Geo Processing Commands, quick reference guide. ArcGIS,<br />

2004.<br />

[48] ISO. 19123:2005 geographic information - coverage geometry and functions,<br />

2005.<br />

[49] Albrecht J. Universal analytical gis operations - a task-oriented systematization<br />

of data structure-independent gis functionality. In Geographic Information<br />

Research- transatlantic perspectives, pages 577–591, 1998.<br />

[50] Boettger J., <strong>Pre</strong>iser M., Balzer M., and Deussen O. Detail-in-context visualization<br />

for satellite imagery. volume 27, pages 587–596, 2008.<br />

[51] Burt P. J. and Adelson E. H. The laplacian pyramid as a compact code. In IEEE<br />

Transactions on Communications, number 31, pages 532–540, 1983.<br />

[52] Han J., Stefanovic N., and Koperski K. Selective materialization: An efficient<br />

method for spatial data cube construction. In Proceedings of the Second Pacific-<br />

Asia Conference on Research and Development in Knowledge Discovery and<br />

Data Mining, pages 144–158, London, UK, 1998. Springer-Verlag.<br />

[53] Nievergelt J., Hinterberger H., and Sevcik K. C. The grid file: An adaptable,<br />

symmetric multikey file structure. In ACM Transactions on Database Systems,<br />

volume 9, pages 38–71, 1984.<br />

[54] Peuquet D. J. Making space for time: Issues in space-time data representation.<br />

In Geoinformatica, volume 5, pages 11–32, Hingham, MA, USA, 2001.<br />

Kluwer Academic Publishers.<br />

[55] Whang K. J. and Krishnamurthy R. The multilevel grid file - a dynamic hierarchical<br />

multidimensional file structure. In DASFAA, pages 449–459, 1991.<br />

[56] Berry J. K. and Tomlin C. D. A Mathematical Structure for Car<strong>to</strong>graphic Modeling<br />

in Environmental Analysis. In Proceedings of the American Congress on<br />

Surveying and Mapping, pages 269–283, 1979.<br />

[57] Choi K. and Luk W. Processing aggregate queries on spatial <strong>OLAP</strong> data. In Proceedings<br />

of the 10th international conference on Data Warehousing and Knowledge<br />

Discovery, pages 125–134, Berlin, Heidelberg, 2008. Springer-Verlag.<br />

[58] Hornsby K. and Egenhofer M. J. Shifts in detail through temporal zooming.<br />

In International Workshop on Database and Expert Systems Applications, volume<br />

0, page 487, Los Alami<strong>to</strong>s, CA, USA, 1999. IEEE Computer Society.<br />

[59] Hornsby K. and Egenhofer M. J. Identity-based change: A foundation for<br />

spatio-temporal knowledge representation. In International Journal of Geographical<br />

Information Science, volume 14, pages 207–224, 2000.


112 Bibliography<br />

[60] Ramachandran K., Shah B., and Raghavan V. V. Dynamic pre-fetching of views<br />

based on user-access patterns in an <strong>OLAP</strong> system. In ICEIS (1), pages 60–67,<br />

2005.<br />

[61] Sellis T. K. Multiple-query optimization. In ACM Trans. Database Syst., volume<br />

13, pages 23–52, New York, NY, USA, 1988. ACM.<br />

[62] Shim K., Sellis T., and Nau D. Improvements on a heuristic algorithm for<br />

multiple-query optimization. In Data and Knowledge Engineering, volume 12,<br />

pages 197–222, 1994.<br />

[63] Libkin L., Machlin R., and Wong L. A query language for multidimensional<br />

arrays: Design, implementation, and optimization techniques. In SIGMOD<br />

Rec., volume 25, pages 228–239, New York, NY, USA, 1996. ACM.<br />

[64] Usery E. L., Finn M. P., Scheidt D. J., Ruhl S., Beard T., and Bearden M.<br />

Geospatial data resampling and resolution effects on watershed modeling: A<br />

case study using the agricultural non-point source pollution model. In Journal<br />

of Geographical Systems, volume 6, pages 289–306, 2004.<br />

[65] Yong K. L. and Kim M. H. Optimizing the incremental maintenance of multiple<br />

join views. In Proceedings of the 8th ACM International Workshop on Data<br />

Warehousing and <strong>OLAP</strong>, pages 107–113, New York, NY, USA, 2005. ACM.<br />

[66] Benedikt M. and Libkin L. Exact and approximate aggregation in constraint<br />

query languages. In Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART<br />

Symposium on Principles of Database Systems, pages 102–113, New York, NY,<br />

USA, 1999. ACM.<br />

[67] Gertz M., Hart Q., Rueda C., Singhal S., and Zhang J. A data and query model<br />

for streaming geospatial image data. In EDBT Workshops, pages 687–699,<br />

2006.<br />

[68] Golfarelli M. and Rizzi S. Data Warehouse Design: Modern Principles and<br />

Methodologies. McGraw Hill, 2009.<br />

[69] Gyssens M. and Lakshmanan L. V. A foundation for multi-dimensional<br />

databases. pages 106–115, 1997.<br />

[70] Ogden J. M., Adelson E. H., Bergen J. R., and Burt P. J. Pyramid methods in<br />

computer graphics, 1985.<br />

[71] Beckmann N., Kriegel H. P., Schneider R., and Seeger B. The r*-tree: an<br />

efficient and robust access method for points and rectangles. In SIGMOD Rec.,<br />

volume 19, pages 322–331, New York, NY, USA, 1990. ACM.<br />

[72] Roussopoulos N. Materialized views and data warehouses. In SIGMOD<br />

Record, volume 27, pages 21–26, 1997.


BIBLIOGRAPHY 113<br />

[73] Stefanovic N., Han J., and Koperski K. Object-based selective materialization<br />

for efficient implementation of spatial data cubes. In IEEE Transactions on<br />

Knowledge and Data Engineering, volume 12, pages 938–958, Piscataway, NJ,<br />

USA, 2000. IEEE Educational Activities Department.<br />

[74] Widmann N. and Baumann P. Performance evaluation of multidimensional<br />

array s<strong>to</strong>rage techniques in databases. In Proceedings of the IDEAS Conference,<br />

1999.<br />

[75] Baumann P. Management of multidimensional discrete data. In The VLDB<br />

Journal, volume 3, pages 401–444, Secaucus, NJ, USA, 1994. Springer-Verlag<br />

New York, Inc.<br />

[76] Baumann P. A database array algebra for spatio-temporal data and beyond. In<br />

Next Generation Information Technologies and Systems, pages 76–93, 1999.<br />

[77] Baumann P. Web-enabled raster gis services for large image and map databases.<br />

In Proceedings of the 12th International Workshop on Database and Expert<br />

Systems Applications, page 870, Washing<strong>to</strong>n, DC, USA, 2001. IEEE Computer<br />

Society.<br />

[78] Baumann P. Web coverage processing service (wcps) implementation specification.<br />

number 08-068. ogc, 1.0.0 edition. 2008.<br />

[79] Furtado P. and Baumann P. S<strong>to</strong>rage of multidimensional arrays based on arbitrary<br />

tiling. In Proceedings of the 15th International Conference on Data<br />

Engineering, page 480, Washing<strong>to</strong>n, DC, USA, 1999. IEEE Computer Society.<br />

[80] Marathe A. P. and Salem K. A language for manipulating arrays. In Proceedings<br />

of the 23rd International Conference on Very Large Data Bases VLDB<br />

’97, pages 46–55, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers<br />

Inc.<br />

[81] Vassiliadis P. Modeling multidimensional databases, cubes and cube operations.<br />

In Proceedings of the 10th International Conference on Scientific and<br />

Statistical Database Management, pages 53–62, Washing<strong>to</strong>n, DC, USA, 1998.<br />

IEEE Computer Society.<br />

[82] Burt P.J. Fast filter transforms for image processing. In Computer Graphics<br />

and Image Processing, number 16, pages 16–51, 1981.<br />

[83] Agrawal R., Gupta A., and Sarawagi S. Modeling multidimensional databases.<br />

In Proceedings of the 13th International Conference on Data Engineering,<br />

pages 232–243, Washing<strong>to</strong>n, DC, USA, 1997. IEEE Computer Society.<br />

[84] Pieringer R., Markl V., Ramsak F., and Bayer R. Hinta: A linearization algorithm<br />

for physical clustering of complex <strong>OLAP</strong> hierarchies. In DMDW, page 11,<br />

2001.


114 Bibliography<br />

[85] Chen S., Liu B., and Rundensteiner E. A. Multiversion-based view maintenance<br />

over distributed data sources. In ACM Transaction Database Systems,<br />

volume 29, pages 675–709, New York, NY, USA, 2004. ACM.<br />

[86] Prasher S. and Zhou X. Multiresolution amalgamation: Dynamic spatial data<br />

cube generation. In Proceedings of the 15th Australasian database conference,<br />

pages 103–111, Darlinghurst, Australia, Australia, 2004. Australian Computer<br />

Society, Inc.<br />

[87] Shekhar S. and Xiong H. Encyclopedia of GIS. Springer, 2008.<br />

[88] SYBASE. Sybase solutions guide. http://www.sybase.cz/uploads/<br />

CEEMEA_SybaseIQ_FINAL.pdf, Accessed July 11, 2010.<br />

[89] Griffin T. and Libkin L. Incremental maintenance of views with duplicates. In<br />

Proceedings of the SIGMOD Rec., volume 24, pages 328–339, New York, NY,<br />

USA, 1995. ACM.<br />

[90] Needham T. Visual Complex Analysis. Oxford <strong>University</strong> <strong>Pre</strong>ss, 1998.<br />

[91] Niemi T., Nummenmaa J., and Thanisch P. Normalizing <strong>OLAP</strong> cubes for controlling<br />

sparsity. In Data Knowledge Engineering, volume 46, pages 317–343,<br />

Amsterdam, The Netherlands, The Netherlands, 2003. Elsevier Science Publishers<br />

B. V.<br />

[92] Harinarayan V., Rajaraman A., and Ullman J. D. Implementing data cubes<br />

efficiently. In SIGMOD Rec., volume 25, pages 205–216, New York, NY, USA,<br />

1996. ACM.<br />

[93] Schlosser S. W., Schindler J., Papadomanolakis S., Shao M., Ailamaki A.,<br />

Faloutsos C., and Ganger G. R. On multidimensional data and modern disks. In<br />

In Proceedings of the 4th USENIX Conference on File and S<strong>to</strong>rage Technologies.<br />

USENIX Association, pages 225–238, 2005.<br />

[94] Mingjie X. Experiments on remote sensing image cube and its <strong>OLAP</strong>. In<br />

Proceedings of the IEEE International Geoscience and Remote Sensing Symposium,<br />

volume 7, pages 4398–4401 vol.7, September 2004.<br />

[95] Halevy A. Y. Answering queries using views: A survey. In The VLDB Journal,<br />

volume 10, pages 270–294, Secaucus, NJ, USA, December 2001. Springer-<br />

Verlag New York, Inc.<br />

[96] Jiebing Y. and Dewitt D. J. Processing satellite images on tertiary s<strong>to</strong>rage:<br />

A study of the impact of tile size on performance. In Proceedings of the 5th<br />

NASA Goddard Conference on Mass S<strong>to</strong>rage Systems and Technologies, pages<br />

460–476, 1996.


BIBLIOGRAPHY 115<br />

[97] Kotidis Y. and Roussopoulos N. A case for dynamic view management. In ACM<br />

Transactions on Database Systems, volume 26, pages 388–423, New York, NY,<br />

USA, 2001. ACM.<br />

[98] Lee K. Y., Son J. H., and Kim M. H. Efficient incremental view maintenance in<br />

data warehouses. In Proceedings of the 10th International Conference on Information<br />

and Knowledge Management, pages 349–356, New York, NY, USA,<br />

2001. ACM.<br />

[99] Qingsong Y. and Aijun A. Using user access patterns for semantic query<br />

caching. In DEXA, pages 737–746, 2003.<br />

[100] Zhao Y., Deshpande P. M., and Naugh<strong>to</strong>n J. F. An array-based algorithm for simultaneous<br />

multidimensional aggregates. In SIGMOD Rec., volume 26, pages<br />

159–170, New York, NY, USA, 1997. ACM.<br />

[101] Zhuge Y., García-Molina H., Hammer J., and Widom J. View maintenance<br />

in a warehousing environment. In Proceedings of the 1995 ACM SIGMOD<br />

International Conference on Management of Data, pages 316–327, New York,<br />

NY, USA, 1995. ACM.<br />

[102] Zhuge Y., García-Molina H., and Wiener J. L. Multiple view consistency for<br />

data warehousing. In Proceedings of the 13th International Conference on Data<br />

Engineering, pages 289–300, Washing<strong>to</strong>n, DC, USA, 1997. IEEE Computer<br />

Society.<br />

[103] Zhuge Y., García-Molina H., and Wiener J. L. Consistency algorithms for<br />

multi-source warehouse view maintenance. In Distributed Parallel Databases,<br />

volume 6, pages 7–40, Hingham, MA, USA, 1998. Kluwer Academic Publishers.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!