SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...
SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...
SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4 Int'l Conf. Foundations of Computer Science | FCS'11 |<br />
need for out-of-core processing for analysis over a longer<br />
period of time. The proposed algorithms, the Periodic Partial<br />
Result Merging and K-way Merge based Technique are<br />
scalable, out-of-core methods, capable of process datasets<br />
several times bigger than the main system memory.<br />
The organization of this paper is as follows: Section 2<br />
presents the properties of methods dealing with large datasets<br />
and the state-of-the-art scientific approaches in this field.<br />
Section 3 introduces two novel out-of-core approaches, their<br />
execution time analysis compared to execution complexities of<br />
other approaches and the execution time determining factors of<br />
algorithms. Section 4 shows the results of algorithms measured<br />
on real datasets, while the last section summarizes our work<br />
and presents possible further work.<br />
2 Related work<br />
Although the amount of available memory has been<br />
significantly increasing during the past decades, handling<br />
large datasets is a still challenging problem in environments<br />
limited by. In this paper a dataset is regarded large, if it does<br />
not fit in the main memory at the same time. During the<br />
evolution of computers this size is continuously varying. In<br />
computer-related literature several approaches were published<br />
how to handle such vast datasets.<br />
Creating a representative subset of the original dataset<br />
based on statistical sampling is a frequently applied method to<br />
solve the problem of limited memory. With simple random<br />
sampling or a more sophisticated sampling technique (e.g.<br />
stratified sampling) a dataset can be generated which has very<br />
similar statistical properties to the original dataset. Thus,<br />
according to our assumption the processed datasets will share<br />
similar statistical properties as the result of the original<br />
dataset. But from a sample-based dataset only a partial result<br />
can be generated; this fact makes this method inappropriate for<br />
problems requiring results based on the whole dataset (e.g.<br />
aggregation).<br />
Another approach to overcome the limited memory issue is<br />
the compression of the dataset. This technique is based on the<br />
idea, that the redundancy of the dataset makes it<br />
unmanageable large. According to information theory, there is<br />
a lower bound of compression, thus in general there is no<br />
guarantee that the compressed dataset will in fact fit in the<br />
main memory. The compression of a dataset can be done using<br />
specific data structures which can have other favorable<br />
properties as well, like in [4][5][6]. Another issue related to<br />
the compression is whether such external data structures can<br />
be designed that have a similar I/O performance as the<br />
uncompressed structures [7]. A compressed, external data<br />
structure, with competitive I/O performance could further<br />
accelerate the data preparation step.<br />
If we have to process large datasets by orders of magnitude<br />
larger than the size of main memory out-of-core methods can<br />
be a well-scalable solution. Out-of-core methods make the<br />
processing even in a memory limited environment possible by<br />
the usage of secondary storages (e.g. hard disks). These<br />
methods follow a partitioning principle, which means that the<br />
original dataset is processed in smaller blocks, resulting partial<br />
processed sets, from which the global result can be obtained.<br />
Due to the high cardinality of dataset the principle is<br />
applicable so that the partial results are stored on a secondary<br />
storage, freeing the main memory for processing another<br />
block.<br />
In out-of-core literature we can find several techniques for<br />
generation of global result from the partial results: the global<br />
result of some problems can be generated as the union of the<br />
partial result sets, as presented in [8][9][10]. For other<br />
problems a merging can be the applicable technique to<br />
determine the global dataset from partial results [11][12]. In<br />
other cases a more complex algorithm has to be performed to<br />
derive the global result [13].<br />
We have approaches which solve the memory limited<br />
issues using secondary storages. In this paper we discuss the<br />
performance analysis of these methods and an essential factor<br />
of the out-of-core methods can be observed even in conceptual<br />
phase. If we take a look at the up-to-date computer<br />
architecture a crucial runtime determining factor can be<br />
pointed out: accessing a secondary storage lasts more than<br />
accessing the same data being actually held in main memory.<br />
This factor influences the efficiency of processing, resulting<br />
that in an out-of-core method the number of I/O instructions<br />
should be kept at a minimum level.<br />
This requirement of minimal I/O instructions is essential<br />
from another point of view as well: the raw datasets are<br />
generated in an automated way, continuously, thus it is needed<br />
to avoid the extrusion of the raw data. This could be done by<br />
assuring that the procedural steps have linear time complexity,<br />
which in general cannot be satisfied. But if we keep the I/O<br />
instructions at the lowest level the performance of the<br />
processing will be still efficient to avoid data extrusion.<br />
Based on the two previous points the core-efficiency of the<br />
out-of-core methods depends on whether they read only<br />
constant times the input dataset or not.<br />
Before applying an out-of-ore method the block size has to<br />
be chosen: the successively, equal-sized partitioning is a<br />
trivial, but working method [11][12][14]. But according to [9]<br />
a sophisticated partitioning approach can have performance<br />
increasing effect. The carefully chosen size is an important<br />
performance determining factor, because with it the consumed<br />
memory can be controlled. An eager, in-memory algorithm<br />
will be presented in this paper to demonstrate the undesirable<br />
behavior of the processing when the main memory reaches its<br />
physical bounds.<br />
3 Out-of-core processing methods<br />
In this paragraph five different processing approaches are<br />
presented, together with their runtime complexity analysis and<br />
the resulting factors. First we discuss an in-memory algorithm,<br />
which have a major shortcoming assuming that processed data<br />
will fit in the main memory at the same time. Two different<br />
extension ways will be presented to overcome the problem of<br />
limited memory: dataset modification (sampling, compression)