09.01.2013 Views

SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...

SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...

SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4 Int'l Conf. Foundations of Computer Science | FCS'11 |<br />

need for out-of-core processing for analysis over a longer<br />

period of time. The proposed algorithms, the Periodic Partial<br />

Result Merging and K-way Merge based Technique are<br />

scalable, out-of-core methods, capable of process datasets<br />

several times bigger than the main system memory.<br />

The organization of this paper is as follows: Section 2<br />

presents the properties of methods dealing with large datasets<br />

and the state-of-the-art scientific approaches in this field.<br />

Section 3 introduces two novel out-of-core approaches, their<br />

execution time analysis compared to execution complexities of<br />

other approaches and the execution time determining factors of<br />

algorithms. Section 4 shows the results of algorithms measured<br />

on real datasets, while the last section summarizes our work<br />

and presents possible further work.<br />

2 Related work<br />

Although the amount of available memory has been<br />

significantly increasing during the past decades, handling<br />

large datasets is a still challenging problem in environments<br />

limited by. In this paper a dataset is regarded large, if it does<br />

not fit in the main memory at the same time. During the<br />

evolution of computers this size is continuously varying. In<br />

computer-related literature several approaches were published<br />

how to handle such vast datasets.<br />

Creating a representative subset of the original dataset<br />

based on statistical sampling is a frequently applied method to<br />

solve the problem of limited memory. With simple random<br />

sampling or a more sophisticated sampling technique (e.g.<br />

stratified sampling) a dataset can be generated which has very<br />

similar statistical properties to the original dataset. Thus,<br />

according to our assumption the processed datasets will share<br />

similar statistical properties as the result of the original<br />

dataset. But from a sample-based dataset only a partial result<br />

can be generated; this fact makes this method inappropriate for<br />

problems requiring results based on the whole dataset (e.g.<br />

aggregation).<br />

Another approach to overcome the limited memory issue is<br />

the compression of the dataset. This technique is based on the<br />

idea, that the redundancy of the dataset makes it<br />

unmanageable large. According to information theory, there is<br />

a lower bound of compression, thus in general there is no<br />

guarantee that the compressed dataset will in fact fit in the<br />

main memory. The compression of a dataset can be done using<br />

specific data structures which can have other favorable<br />

properties as well, like in [4][5][6]. Another issue related to<br />

the compression is whether such external data structures can<br />

be designed that have a similar I/O performance as the<br />

uncompressed structures [7]. A compressed, external data<br />

structure, with competitive I/O performance could further<br />

accelerate the data preparation step.<br />

If we have to process large datasets by orders of magnitude<br />

larger than the size of main memory out-of-core methods can<br />

be a well-scalable solution. Out-of-core methods make the<br />

processing even in a memory limited environment possible by<br />

the usage of secondary storages (e.g. hard disks). These<br />

methods follow a partitioning principle, which means that the<br />

original dataset is processed in smaller blocks, resulting partial<br />

processed sets, from which the global result can be obtained.<br />

Due to the high cardinality of dataset the principle is<br />

applicable so that the partial results are stored on a secondary<br />

storage, freeing the main memory for processing another<br />

block.<br />

In out-of-core literature we can find several techniques for<br />

generation of global result from the partial results: the global<br />

result of some problems can be generated as the union of the<br />

partial result sets, as presented in [8][9][10]. For other<br />

problems a merging can be the applicable technique to<br />

determine the global dataset from partial results [11][12]. In<br />

other cases a more complex algorithm has to be performed to<br />

derive the global result [13].<br />

We have approaches which solve the memory limited<br />

issues using secondary storages. In this paper we discuss the<br />

performance analysis of these methods and an essential factor<br />

of the out-of-core methods can be observed even in conceptual<br />

phase. If we take a look at the up-to-date computer<br />

architecture a crucial runtime determining factor can be<br />

pointed out: accessing a secondary storage lasts more than<br />

accessing the same data being actually held in main memory.<br />

This factor influences the efficiency of processing, resulting<br />

that in an out-of-core method the number of I/O instructions<br />

should be kept at a minimum level.<br />

This requirement of minimal I/O instructions is essential<br />

from another point of view as well: the raw datasets are<br />

generated in an automated way, continuously, thus it is needed<br />

to avoid the extrusion of the raw data. This could be done by<br />

assuring that the procedural steps have linear time complexity,<br />

which in general cannot be satisfied. But if we keep the I/O<br />

instructions at the lowest level the performance of the<br />

processing will be still efficient to avoid data extrusion.<br />

Based on the two previous points the core-efficiency of the<br />

out-of-core methods depends on whether they read only<br />

constant times the input dataset or not.<br />

Before applying an out-of-ore method the block size has to<br />

be chosen: the successively, equal-sized partitioning is a<br />

trivial, but working method [11][12][14]. But according to [9]<br />

a sophisticated partitioning approach can have performance<br />

increasing effect. The carefully chosen size is an important<br />

performance determining factor, because with it the consumed<br />

memory can be controlled. An eager, in-memory algorithm<br />

will be presented in this paper to demonstrate the undesirable<br />

behavior of the processing when the main memory reaches its<br />

physical bounds.<br />

3 Out-of-core processing methods<br />

In this paragraph five different processing approaches are<br />

presented, together with their runtime complexity analysis and<br />

the resulting factors. First we discuss an in-memory algorithm,<br />

which have a major shortcoming assuming that processed data<br />

will fit in the main memory at the same time. Two different<br />

extension ways will be presented to overcome the problem of<br />

limited memory: dataset modification (sampling, compression)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!