ECE1724 Project Final Report: HashLife on GPU

<strong>ECE1724</strong> <strong>Project</strong> <strong>Final</strong> <strong>Report</strong>:<strong>HashLife</strong> on GPUCharles Eric LaForestApril 14, 2010Contents1 Introduction 22 Related Work 23 Life Algorithm 23.1 <strong>HashLife</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.1.1 Quad-Tree Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.1.2 Recursive Memoized Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Implementation 44.1 Basic Life on CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.2 ’<strong>HashLife</strong>’ on the CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.3 ’<strong>HashLife</strong>’ on the GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Methodology 46 Evaluation 57 Conclusions 6List of Figures1 Quad-tree representation of cell space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Exploded view of 9-tree of overlapping nodes used for computation of the parent node (largest squarein centre) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 R-Pentomino (black cells are Live) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5List of Tables1 Milliseconds per iteration of Life and <strong>HashLife</strong> implementations. . . . . . . . . . . . . . . . . . . . . 52 Speedup comparisons between all implementations of Life and <strong>HashLife</strong>. The speedups are for therows, relative to the columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 EvaluationExperimental Setup All implementations of Life and <strong>HashLife</strong> were tested on a 2.83GHz Core 2 quad-core CPUwith 4GB of RAM (unknown type and speed), although there is only one thread of execution. The GPU used is anNVIDIA GeForce GTX 280 with 1GB of RAM. All the grids were square with 22,000 cells on a side. Given one byteper cell, this translates to about 484MB, enabling two such grids to fit in the video card memory at one time, allowingalternating updates.All cells are initially set to ’dead’ in both grids, and a small test pattern is then inserted at the centre of the first grid.This pattern is the R-Pentomino, which is a simple pattern of 6 cells which is known to evolve for 1103 iterations, atwhich point it settles into a number of stable, cyclical patterns.Figure 3: R-Pentomino (black cells are Live)The grids are then copied to the GPU (when used), and no further data transfers occur during measurement. Onlythe kernel is restarted at each iteration. The number of generations is set to 100 for CPU implementations and 4000for the GPU, to achieve similar benchmark times of 300 to 500 seconds. Table 1 shows the number of millisecondsfor each iteration of each algorithm, and the resulting speedups are worked out in Table 2.Speedup and Memory Bandwidth The raw speedup granted by moving Life to the GPU is 38.4x. Relative to thatLife baseline, the speedup granted by the partial implementation of <strong>HashLife</strong> is 1.53x on the CPU, and only 1.18x onthe GPU. The GPU implementation was already memory bound but benefited nonetheless from caching of the texturememory, since the total number of cells accessed remains otherwise the same. The benefit is relatively small as thetexture caches do not reduce latency, but merely free up some global memory bandwidth. In contrast, the CPU cachesdo reduce latency and thus achieve a greater speedup.Thread Dispatching It was expected that reducing the number of threads by 75% (prior to using texture memory)would have a positive performance impact since otherwise most of the threads would do nothing at all, reducing theutilization of the processors. However, there was no significant difference in performance, further supporting the ideathat access to global memory is the limiting factor. This lack of change when going from 484 million threads to only121 million also suggests that the dispatching of threads is extremely efficient.Milliseconds per IterationCPU GPULife 4612 120<strong>HashLife</strong> 3006 102Table 1: Milliseconds per iteration of Life and <strong>HashLife</strong> implementations.SpeedupsLife (CPU) Life (GPU) <strong>HashLife</strong> (CPU) <strong>HashLife</strong> (GPU)Life (CPU) 1 0.0260 0.652 0.0221Life (GPU) 38.4 1 25.1 0.850<strong>HashLife</strong> (CPU) 1.53 0.0399 1 0.0339<strong>HashLife</strong> (GPU) 45.2 1.18 29.5 1Table 2: Speedup comparisons between all implementations of Life and <strong>HashLife</strong>. The speedups are for the rows,relative to the columns.5

7 ConclusionsMy inability to really implement <strong>HashLife</strong> on the GPU is disheartening, and the resulting speedup not much betterthan a direct translation of Life to the GPU. It was however a salvageable educational experience as I learnt a littlemore about the CUDA memory hierarchy and corrected my assumption that dispatching threads costs time as it doeson operating systems running on CPUs.In retrospect, I made one major mistake: I tackled both the CPU and GPU implementations of <strong>HashLife</strong> at thesame time. I would have been better off to implement a plain <strong>HashLife</strong> on the CPU with recursion and dynamicmemory, and then find out how to transform it to operate without them.In fact, I can now see that the solution is to collapse the quad-tree spatial representation into the temporary 9-treeused for computation. It may appear inefficient, but the overhead of five extra pointers per node would have beenrapidly absorbed as the grid size increased, and the static allocation would have benefited performance also. Anyremaining memory on the GPU would then be used for a large static hash table which would use linear probing toresolve collisions (e.g.: finds the next empty spot), and eject old entries in a Least-Recently-Used manner.Also, the recursion could have been converted into iteration with some help from the host CPU:1. Run the kernel on one thread at the root of the tree to check if it needs updating or if the next state is in the hashtable, return the status back to the CPU.2. Based on that status, run the kernel on up-to four threads on the next level down of the quad-tree ’subset’ andcheck if updating is needed.3. If part of the tree needs updating, run the kernel on nine threads on the ’temporary’ overlapping 9-tree childnodes used for computation. Return their status to the host CPU.4. Repeat until all ’quad-tree’ nodes are visited, and all 9-tree nodes are updated, then start over at the root node.Despite the overhead of repeatedly launching kernels and copying data back to the CPU, the algorithm would stillbehave as it should and try to reduce the computations to hash table lookups.6

References[1] FINCH, T. LIAR: Life in a register. http://fanf.livejournal.com/81169.html, January 2008.Accessed March 11th 2010.[2] GARDNER, M. Mathematical Games. Scientific American 223 (1970), 120–123.[3] GOSPER, R. W. Exploiting regularities in large cellular spaces. Physica D: Nonlinear Phenomena 10 (January1984), 75–80. Online at http://dx.doi.org/10.1016/0167-2789(84)90251-3.[4] PERUMALLA, K. S., AND AABY, B. G. Data parallel execution challenges and runtime performance of agentsimulations on GPUs. In SpringSim ’08: Proceedings of the 2008 Spring simulation multiconference (San Diego,CA, USA, 2008), Society for Computer Simulation International, pp. 116–123.[5] PETRICEK, T. Accelerator and F# (II.): The Game of Life on GPU. http://tomasp.net/blog/accelerator-life-game.aspx, Dec 2009. Accessed March 15th 2010.[6] RACARR. GPU Life. http://blogs.gnome.org/racarr/2009/01/26/gpu-life/, January 2009.Author pseudonym only, Accessed March 15th 2010.[7] ROKICKI, T. G. An Algorithm for Compressing Space and Time. Dr. Dobb’s (April 2006). http://www.drdobbs.com/java/184406478, Accessed March 15th 2010.[8] SILVER, S. A., AND MARTIN, E. Life Lexicon. http://www.bitstorm.org/gameoflife/lexicon/, December 2006. Accessed March 15th 2010.[9] TARDITI, D., PURI, S., AND OGLESBY, J. Accelerator: using data parallelism to program GPUs for generalpurposeuses. In ASPLOS-XII: Proceedings of the 12th international conference on Architectural support forprogramming languages and operating systems (New York, NY, USA, 2006), ACM, pp. 325–335.[10] TREVORROW, A., AND ROKICKI, T. Golly. http://golly.sourceforge.net/. An open source, crossplatformapplication for exploring Conway’s Game of Life and other cellular automata. Accessed March 15th2010.[11] WOLFRAM, S. A new kind of science. Wolfram Media Inc., Champaign, Ilinois, US, United States, 2002.[12] ZUSE, K. Calculating Space. MIT (Proj. MAC), Cambridge, Mass., 1970. MIT Technical Translation AZT-70-164-GEMIT.[13] ZVOLD. Conway’s "Life" and "Brian’s Brain" cellular automata using GPU. http://zvold.blogspot.com/2010/01/conways-life-and-brians-brain-cellular.html, January 2010. AccessedMarch 15th 2010, Author’s name is unknown, pseudonym given.7

ECE1724 Project Final Report: HashLife on GPU

Create successful ePaper yourself

Delete template?

Save as template?