28.01.2015 Views

Optimizing Sorting with Genetic Algorithms - Polaris

Optimizing Sorting with Genetic Algorithms - Polaris

Optimizing Sorting with Genetic Algorithms - Polaris

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

AMD Sun SGI Intel Intel<br />

CPU Athlon MP UltraSparcIIIi R12000 Itanium 2 Pentium 4<br />

Frequency 1.2GHz 1GHz 300MHz 1.5GHz 2GHz<br />

L1d/L1i Cache 128KB 64KB/32KB 32KB/32KB 16KB/16KB 8KB/12KB<br />

L2 Cache 256KB 1MB 4MB 256KB 512KB<br />

Memory 1GB 4GB 1GB 8GB 512KB<br />

OS RedHat9 SunOS5.8 IRIX64 v6.5 RedHat7.2 RedHat7.2<br />

Compiler gcc3.2.2 Workshop cc 5.0 MIPSPro cc 7.3.0 gcc3.3.2 gcc3.3.1<br />

Options -O3 -native -xO5 -O3 -TARG: -O3 -O3<br />

platform=IP30<br />

Table 2: Test Platforms. Intel Pentium 4 has a 8KB trace cache instead of a L1 instruction cache. Intel Itanium2 has a L3 cache of<br />

6MB.<br />

Execution Time (Cycles per key)<br />

Execution Time (Cycles per key)<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

Intel Pentium 4<br />

100<br />

100 1000 10000 100000 1e+06 1e+07<br />

1000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

Quicksort<br />

CC-radix<br />

Standard Deviation<br />

Multi-way Merge<br />

Gene Sort<br />

Intel MKL<br />

C++ STL<br />

100<br />

100 1000 10000 100000 1e+06 1e+07<br />

Quicksort<br />

CC-radix<br />

Intel Itanium 2<br />

Standard Deviation<br />

Multi-way Merge<br />

Gene Sort<br />

Intel MKL<br />

C++ STL<br />

Figure 9: Performance comparison <strong>with</strong> Intel Math Kernel Library<br />

and C++ STL<br />

used routines and and that the improvement on their performance<br />

obtained by our genetic algorithm search is meaningful.<br />

We now proceed to compare the sort algorithm generated by our<br />

strategy <strong>with</strong> the three baseline algorithms. All the experiments<br />

presented below sort records <strong>with</strong> two fields, a 32 bit integer key<br />

and a 32 bit pointer. The reason for this structure is that, for the<br />

long records typical of database, sorting is usually performed on an<br />

array of tuples each containing a key and a pointer to the original<br />

record [11]. We assume that this array has been created before our<br />

library routines are called.<br />

Next we show the performance of sorting algorithms that have<br />

been using exclusively inputs of 14M records and the performance<br />

of a second set of sorting algorithms that were tuned using input<br />

data set sized ranging from 8M all the way to 16M records.<br />

Figure 10 shows the performance of four different sorting algorithms:<br />

Quicksort, CC-radix, Multiway Merge, and the sorting<br />

algorithms generated using the genetic algorithm, Gene Sort, when<br />

sorting input sets sized 14M. The Figure plots the execution time in<br />

Cycles per key as the standard deviation changes from 512 to 8M.<br />

The test inputs used to collect the data in Figure 10 were different<br />

from the ones we used during the generation of the algorithm.<br />

The use of genetic algorithms to optimize sorting results in a<br />

significant improvement in performance. Our Gene sort (either the<br />

specialized or the general form) usually performs better than the<br />

best of the other three sorting algorithms. On Itanium2, which is the<br />

platform <strong>with</strong> the minimum improvement, we obtain a 12% average<br />

improvement. On the other platforms, the average improvement<br />

over the best of the three algorithms is up to 42%.<br />

There is not much difference between the performance of the<br />

specialized and the general gene sort. The reason is that a sorting<br />

algorithm can be good for a wide range of inputs. The specialized<br />

genome intends to over-fit for the training data. It doesn’t necessarily<br />

perform better than the general genome on the independent test<br />

inputs.<br />

4.3 Convergence Experiments<br />

Figure 11 shows the speedup improvement for each generation.<br />

The results are normalized <strong>with</strong> respect to the best sorting algorithm<br />

in the initial population.<br />

There are small up-and-downs in the speedup curves. That’s because<br />

the test input data is different from the input data used for<br />

training. The genetic algorithm always tunes for the data used during<br />

training. Therefore, the best genome for generation n is not<br />

necessarily better than the genome of generation n − 1 when sorting<br />

the test data. It takes several generations and needs more random<br />

trains to penalize the genomes selected because of the specific<br />

sequence of values in the training set.<br />

As the plot shows, for most platforms speedup remains relatively<br />

stable for several generations and then it improves. The reason is<br />

that it takes a while to “discover” a good genome. In that case,<br />

the performance remains constant. Once, a new good individual<br />

appears, it will reproduce fast, and, as a result, the performance of<br />

the following generations will improve.<br />

The Figure also shows that the speedup of Sun UltraSparcIIIi,<br />

Athlon MP , Pentium 4 and SGI R12000 show no sign of convergence<br />

after 16 generations. This leads us to believe that a higher<br />

performance could be achieved by running the experiment for more<br />

generations or <strong>with</strong> a larger population.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!