Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University
Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University
Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10 1. Introduction and Problem Statement<br />
the contents of the BLOB when they wish <strong>to</strong> operate on the data. The main drawback<br />
<strong>to</strong> this approach is that it either requires the entire array <strong>to</strong> be passed <strong>to</strong> the client, or it<br />
requires that the client perform a large number of BLOB input/output (I/O) operations<br />
<strong>to</strong> read only the required portions of the array. With databases growing beyond a few<br />
tens of terabytes, the analysis of large volumes of array datasets is severely limited<br />
by the relatively low I/O performance of most of <strong>to</strong>days computing platforms. Highperformance<br />
numerical simulations are also increasingly feeling the I/O bottleneck.<br />
To improve data management and analytics on large reposi<strong>to</strong>ries of data, aggregation<br />
has been put forward as a key process when describing high-level data. An<br />
example of data aggregation is the computation and s<strong>to</strong>rage of statistical parameters,<br />
such as count, average, median, and standard deviation. Aggregate computation has<br />
been studied in a variety of settings [4, 21, 66]. In particular, On-Line Analytical Processing<br />
(<strong>OLAP</strong>) technology has emerged <strong>to</strong> address the problem of efficiently computing<br />
complex multidimensional aggregate queries on large data warehouses. Most<br />
<strong>OLAP</strong> systems rely on the process of selecting aggregate combinations, and then precomputing<br />
and s<strong>to</strong>ring their results so the database system can make use of them in<br />
subsequent requests. Such a process is known as pre-aggregation, which has proved <strong>to</strong><br />
speed up aggregate queries by several orders of magnitude in business and statistical<br />
applications [31, 41].<br />
While considerable work has been done on the problem of efficiently computing<br />
aggregate queries in <strong>OLAP</strong>-based applications, such computations continue <strong>to</strong> be a<br />
data management challenge in scientific applications. A relevant example in which the<br />
use of advanced data management and efficient query processing are highly desirable<br />
is hyper-spectral remote-sensing imaging, in which an image spectrometer collects<br />
hundreds or even thousands of measurements for the same area of the surface of the<br />
Earth. The scenes provided by such sensors are often called data cubes <strong>to</strong> denote<br />
the dimensionality of the data. Notably, efficient query processing and data mining<br />
techniques facilitate exploration of spatio-temporal data patterns, both interactively as<br />
well as in batch on archived data.<br />
A significant fraction of scientific data is image-based and can be naturally represented<br />
in multidimensional arrays. These datasets fit poorly in<strong>to</strong> relational databases,<br />
which lack efficient support for the concepts of physical proximity and order. They<br />
are typically s<strong>to</strong>red in array-friendly formats such as HDF5, netCDF, or FITS. The<br />
extremely high computational requirements introduced by image-based scientific applications<br />
make them an excellent case study for our research.<br />
Since array databases and <strong>OLAP</strong>/data warehousing both deal with large multidimensional<br />
datasets and aggregate queries, adapting <strong>OLAP</strong> pre-aggregation techniques<br />
<strong>to</strong> the management and computation of aggregate queries in array databases may provide<br />
a strong potential benefit. This thesis investigates the application of <strong>OLAP</strong> preaggregation<br />
techniques in speeding up query processing in array databases. In particular,<br />
we focus on enhancing aggregate computation in GIS and remote-sensing imaging<br />
applications. However, the results can be generalized <strong>to</strong> other domains as well.