SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...

More documents

Recommendations

Info

4 Int'l Conf. Foundations of Computer Science | FCS'11 | need for out-of-core processing for analysis over a longer period of time. The proposed algorithms, the Periodic Partial Result Merging and K-way Merge based Technique are scalable, out-of-core methods, capable of process datasets several times bigger than the main system memory. The organization of this paper is as follows: Section 2 presents the properties of methods dealing with large datasets and the state-of-the-art scientific approaches in this field. Section 3 introduces two novel out-of-core approaches, their execution time analysis compared to execution complexities of other approaches and the execution time determining factors of algorithms. Section 4 shows the results of algorithms measured on real datasets, while the last section summarizes our work and presents possible further work. 2 Related work Although the amount of available memory has been significantly increasing during the past decades, handling large datasets is a still challenging problem in environments limited by. In this paper a dataset is regarded large, if it does not fit in the main memory at the same time. During the evolution of computers this size is continuously varying. In computer-related literature several approaches were published how to handle such vast datasets. Creating a representative subset of the original dataset based on statistical sampling is a frequently applied method to solve the problem of limited memory. With simple random sampling or a more sophisticated sampling technique (e.g. stratified sampling) a dataset can be generated which has very similar statistical properties to the original dataset. Thus, according to our assumption the processed datasets will share similar statistical properties as the result of the original dataset. But from a sample-based dataset only a partial result can be generated; this fact makes this method inappropriate for problems requiring results based on the whole dataset (e.g. aggregation). Another approach to overcome the limited memory issue is the compression of the dataset. This technique is based on the idea, that the redundancy of the dataset makes it unmanageable large. According to information theory, there is a lower bound of compression, thus in general there is no guarantee that the compressed dataset will in fact fit in the main memory. The compression of a dataset can be done using specific data structures which can have other favorable properties as well, like in [4][5][6]. Another issue related to the compression is whether such external data structures can be designed that have a similar I/O performance as the uncompressed structures [7]. A compressed, external data structure, with competitive I/O performance could further accelerate the data preparation step. If we have to process large datasets by orders of magnitude larger than the size of main memory out-of-core methods can be a well-scalable solution. Out-of-core methods make the processing even in a memory limited environment possible by the usage of secondary storages (e.g. hard disks). These methods follow a partitioning principle, which means that the original dataset is processed in smaller blocks, resulting partial processed sets, from which the global result can be obtained. Due to the high cardinality of dataset the principle is applicable so that the partial results are stored on a secondary storage, freeing the main memory for processing another block. In out-of-core literature we can find several techniques for generation of global result from the partial results: the global result of some problems can be generated as the union of the partial result sets, as presented in [8][9][10]. For other problems a merging can be the applicable technique to determine the global dataset from partial results [11][12]. In other cases a more complex algorithm has to be performed to derive the global result [13]. We have approaches which solve the memory limited issues using secondary storages. In this paper we discuss the performance analysis of these methods and an essential factor of the out-of-core methods can be observed even in conceptual phase. If we take a look at the up-to-date computer architecture a crucial runtime determining factor can be pointed out: accessing a secondary storage lasts more than accessing the same data being actually held in main memory. This factor influences the efficiency of processing, resulting that in an out-of-core method the number of I/O instructions should be kept at a minimum level. This requirement of minimal I/O instructions is essential from another point of view as well: the raw datasets are generated in an automated way, continuously, thus it is needed to avoid the extrusion of the raw data. This could be done by assuring that the procedural steps have linear time complexity, which in general cannot be satisfied. But if we keep the I/O instructions at the lowest level the performance of the processing will be still efficient to avoid data extrusion. Based on the two previous points the core-efficiency of the out-of-core methods depends on whether they read only constant times the input dataset or not. Before applying an out-of-ore method the block size has to be chosen: the successively, equal-sized partitioning is a trivial, but working method [11][12][14]. But according to [9] a sophisticated partitioning approach can have performance increasing effect. The carefully chosen size is an important performance determining factor, because with it the consumed memory can be controlled. An eager, in-memory algorithm will be presented in this paper to demonstrate the undesirable behavior of the processing when the main memory reaches its physical bounds. 3 Out-of-core processing methods In this paragraph five different processing approaches are presented, together with their runtime complexity analysis and the resulting factors. First we discuss an in-memory algorithm, which have a major shortcoming assuming that processed data will fit in the main memory at the same time. Two different extension ways will be presented to overcome the problem of limited memory: dataset modification (sampling, compression)
Page 1 and 2: Int'l Conf. Foundations of Computer
Page 3: Int'l Conf. Foundations of Computer
Page 10 and 11: 10 Int'l Conf. Foundations of Compu
Page 54 and 55:
54 Int'l Conf. Foundations of Compu
Page 56 and 57:
Page 58 and 59:
Page 60 and 61:
Page 62 and 63:
Page 64 and 65:
Page 66 and 67:
Page 68 and 69:
Page 70 and 71:
Page 72 and 73:
Page 74 and 75:
Page 76 and 77:
Page 78 and 79:
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
Page 100 and 101:
100 Int'l Conf. Foundations of Comp
Page 102 and 103:
Page 104 and 105:
Page 106 and 107:
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Page 154 and 155:
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Page 162 and 163:
Page 164 and 165:
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
Page 184 and 185:
Page 186 and 187:
Page 188 and 189:
Page 190 and 191:
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
Page 210 and 211:
Page 212 and 213:
Page 214 and 215:
Page 216 and 217:
Page 218 and 219:
Page 220 and 221:
Page 222 and 223:
Page 224 and 225:
Page 226:
show all

SESSION NOVEL ALGORIHMS AND APPLICATIONS + ...

Create successful ePaper yourself

Delete template?

Save as template?