13.07.2015 Views

ECE1724 Project Final Report: HashLife on GPU

ECE1724 Project Final Report: HashLife on GPU

ECE1724 Project Final Report: HashLife on GPU

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>ECE1724</str<strong>on</strong>g> <str<strong>on</strong>g>Project</str<strong>on</strong>g> <str<strong>on</strong>g>Final</str<strong>on</strong>g> <str<strong>on</strong>g>Report</str<strong>on</strong>g>:<str<strong>on</strong>g>HashLife</str<strong>on</strong>g> <strong>on</strong> <strong>GPU</strong>Charles Eric LaForestApril 14, 2010C<strong>on</strong>tents1 Introducti<strong>on</strong> 22 Related Work 23 Life Algorithm 23.1 <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.1.1 Quad-Tree Representati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.1.2 Recursive Memoized Evaluati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Implementati<strong>on</strong> 44.1 Basic Life <strong>on</strong> CPU and <strong>GPU</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.2 ’<str<strong>on</strong>g>HashLife</str<strong>on</strong>g>’ <strong>on</strong> the CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.3 ’<str<strong>on</strong>g>HashLife</str<strong>on</strong>g>’ <strong>on</strong> the <strong>GPU</strong>. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Methodology 46 Evaluati<strong>on</strong> 57 C<strong>on</strong>clusi<strong>on</strong>s 6List of Figures1 Quad-tree representati<strong>on</strong> of cell space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Exploded view of 9-tree of overlapping nodes used for computati<strong>on</strong> of the parent node (largest squarein centre) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 R-Pentomino (black cells are Live) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5List of Tables1 Millisec<strong>on</strong>ds per iterati<strong>on</strong> of Life and <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> implementati<strong>on</strong>s. . . . . . . . . . . . . . . . . . . . . 52 Speedup comparis<strong>on</strong>s between all implementati<strong>on</strong>s of Life and <str<strong>on</strong>g>HashLife</str<strong>on</strong>g>. The speedups are for therows, relative to the columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


6 Evaluati<strong>on</strong>Experimental Setup All implementati<strong>on</strong>s of Life and <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> were tested <strong>on</strong> a 2.83GHz Core 2 quad-core CPUwith 4GB of RAM (unknown type and speed), although there is <strong>on</strong>ly <strong>on</strong>e thread of executi<strong>on</strong>. The <strong>GPU</strong> used is anNVIDIA GeForce GTX 280 with 1GB of RAM. All the grids were square with 22,000 cells <strong>on</strong> a side. Given <strong>on</strong>e byteper cell, this translates to about 484MB, enabling two such grids to fit in the video card memory at <strong>on</strong>e time, allowingalternating updates.All cells are initially set to ’dead’ in both grids, and a small test pattern is then inserted at the centre of the first grid.This pattern is the R-Pentomino, which is a simple pattern of 6 cells which is known to evolve for 1103 iterati<strong>on</strong>s, atwhich point it settles into a number of stable, cyclical patterns.Figure 3: R-Pentomino (black cells are Live)The grids are then copied to the <strong>GPU</strong> (when used), and no further data transfers occur during measurement. Onlythe kernel is restarted at each iterati<strong>on</strong>. The number of generati<strong>on</strong>s is set to 100 for CPU implementati<strong>on</strong>s and 4000for the <strong>GPU</strong>, to achieve similar benchmark times of 300 to 500 sec<strong>on</strong>ds. Table 1 shows the number of millisec<strong>on</strong>dsfor each iterati<strong>on</strong> of each algorithm, and the resulting speedups are worked out in Table 2.Speedup and Memory Bandwidth The raw speedup granted by moving Life to the <strong>GPU</strong> is 38.4x. Relative to thatLife baseline, the speedup granted by the partial implementati<strong>on</strong> of <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> is 1.53x <strong>on</strong> the CPU, and <strong>on</strong>ly 1.18x <strong>on</strong>the <strong>GPU</strong>. The <strong>GPU</strong> implementati<strong>on</strong> was already memory bound but benefited n<strong>on</strong>etheless from caching of the texturememory, since the total number of cells accessed remains otherwise the same. The benefit is relatively small as thetexture caches do not reduce latency, but merely free up some global memory bandwidth. In c<strong>on</strong>trast, the CPU cachesdo reduce latency and thus achieve a greater speedup.Thread Dispatching It was expected that reducing the number of threads by 75% (prior to using texture memory)would have a positive performance impact since otherwise most of the threads would do nothing at all, reducing theutilizati<strong>on</strong> of the processors. However, there was no significant difference in performance, further supporting the ideathat access to global memory is the limiting factor. This lack of change when going from 484 milli<strong>on</strong> threads to <strong>on</strong>ly121 milli<strong>on</strong> also suggests that the dispatching of threads is extremely efficient.Millisec<strong>on</strong>ds per Iterati<strong>on</strong>CPU <strong>GPU</strong>Life 4612 120<str<strong>on</strong>g>HashLife</str<strong>on</strong>g> 3006 102Table 1: Millisec<strong>on</strong>ds per iterati<strong>on</strong> of Life and <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> implementati<strong>on</strong>s.SpeedupsLife (CPU) Life (<strong>GPU</strong>) <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> (CPU) <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> (<strong>GPU</strong>)Life (CPU) 1 0.0260 0.652 0.0221Life (<strong>GPU</strong>) 38.4 1 25.1 0.850<str<strong>on</strong>g>HashLife</str<strong>on</strong>g> (CPU) 1.53 0.0399 1 0.0339<str<strong>on</strong>g>HashLife</str<strong>on</strong>g> (<strong>GPU</strong>) 45.2 1.18 29.5 1Table 2: Speedup comparis<strong>on</strong>s between all implementati<strong>on</strong>s of Life and <str<strong>on</strong>g>HashLife</str<strong>on</strong>g>. The speedups are for the rows,relative to the columns.5


7 C<strong>on</strong>clusi<strong>on</strong>sMy inability to really implement <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> <strong>on</strong> the <strong>GPU</strong> is disheartening, and the resulting speedup not much betterthan a direct translati<strong>on</strong> of Life to the <strong>GPU</strong>. It was however a salvageable educati<strong>on</strong>al experience as I learnt a littlemore about the CUDA memory hierarchy and corrected my assumpti<strong>on</strong> that dispatching threads costs time as it does<strong>on</strong> operating systems running <strong>on</strong> CPUs.In retrospect, I made <strong>on</strong>e major mistake: I tackled both the CPU and <strong>GPU</strong> implementati<strong>on</strong>s of <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> at thesame time. I would have been better off to implement a plain <str<strong>on</strong>g>HashLife</str<strong>on</strong>g> <strong>on</strong> the CPU with recursi<strong>on</strong> and dynamicmemory, and then find out how to transform it to operate without them.In fact, I can now see that the soluti<strong>on</strong> is to collapse the quad-tree spatial representati<strong>on</strong> into the temporary 9-treeused for computati<strong>on</strong>. It may appear inefficient, but the overhead of five extra pointers per node would have beenrapidly absorbed as the grid size increased, and the static allocati<strong>on</strong> would have benefited performance also. Anyremaining memory <strong>on</strong> the <strong>GPU</strong> would then be used for a large static hash table which would use linear probing toresolve collisi<strong>on</strong>s (e.g.: finds the next empty spot), and eject old entries in a Least-Recently-Used manner.Also, the recursi<strong>on</strong> could have been c<strong>on</strong>verted into iterati<strong>on</strong> with some help from the host CPU:1. Run the kernel <strong>on</strong> <strong>on</strong>e thread at the root of the tree to check if it needs updating or if the next state is in the hashtable, return the status back to the CPU.2. Based <strong>on</strong> that status, run the kernel <strong>on</strong> up-to four threads <strong>on</strong> the next level down of the quad-tree ’subset’ andcheck if updating is needed.3. If part of the tree needs updating, run the kernel <strong>on</strong> nine threads <strong>on</strong> the ’temporary’ overlapping 9-tree childnodes used for computati<strong>on</strong>. Return their status to the host CPU.4. Repeat until all ’quad-tree’ nodes are visited, and all 9-tree nodes are updated, then start over at the root node.Despite the overhead of repeatedly launching kernels and copying data back to the CPU, the algorithm would stillbehave as it should and try to reduce the computati<strong>on</strong>s to hash table lookups.6


References[1] FINCH, T. LIAR: Life in a register. http://fanf.livejournal.com/81169.html, January 2008.Accessed March 11th 2010.[2] GARDNER, M. Mathematical Games. Scientific American 223 (1970), 120–123.[3] GOSPER, R. W. Exploiting regularities in large cellular spaces. Physica D: N<strong>on</strong>linear Phenomena 10 (January1984), 75–80. Online at http://dx.doi.org/10.1016/0167-2789(84)90251-3.[4] PERUMALLA, K. S., AND AABY, B. G. Data parallel executi<strong>on</strong> challenges and runtime performance of agentsimulati<strong>on</strong>s <strong>on</strong> <strong>GPU</strong>s. In SpringSim ’08: Proceedings of the 2008 Spring simulati<strong>on</strong> multic<strong>on</strong>ference (San Diego,CA, USA, 2008), Society for Computer Simulati<strong>on</strong> Internati<strong>on</strong>al, pp. 116–123.[5] PETRICEK, T. Accelerator and F# (II.): The Game of Life <strong>on</strong> <strong>GPU</strong>. http://tomasp.net/blog/accelerator-life-game.aspx, Dec 2009. Accessed March 15th 2010.[6] RACARR. <strong>GPU</strong> Life. http://blogs.gnome.org/racarr/2009/01/26/gpu-life/, January 2009.Author pseud<strong>on</strong>ym <strong>on</strong>ly, Accessed March 15th 2010.[7] ROKICKI, T. G. An Algorithm for Compressing Space and Time. Dr. Dobb’s (April 2006). http://www.drdobbs.com/java/184406478, Accessed March 15th 2010.[8] SILVER, S. A., AND MARTIN, E. Life Lexic<strong>on</strong>. http://www.bitstorm.org/gameoflife/lexic<strong>on</strong>/, December 2006. Accessed March 15th 2010.[9] TARDITI, D., PURI, S., AND OGLESBY, J. Accelerator: using data parallelism to program <strong>GPU</strong>s for generalpurposeuses. In ASPLOS-XII: Proceedings of the 12th internati<strong>on</strong>al c<strong>on</strong>ference <strong>on</strong> Architectural support forprogramming languages and operating systems (New York, NY, USA, 2006), ACM, pp. 325–335.[10] TREVORROW, A., AND ROKICKI, T. Golly. http://golly.sourceforge.net/. An open source, crossplatformapplicati<strong>on</strong> for exploring C<strong>on</strong>way’s Game of Life and other cellular automata. Accessed March 15th2010.[11] WOLFRAM, S. A new kind of science. Wolfram Media Inc., Champaign, Ilinois, US, United States, 2002.[12] ZUSE, K. Calculating Space. MIT (Proj. MAC), Cambridge, Mass., 1970. MIT Technical Translati<strong>on</strong> AZT-70-164-GEMIT.[13] ZVOLD. C<strong>on</strong>way’s "Life" and "Brian’s Brain" cellular automata using <strong>GPU</strong>. http://zvold.blogspot.com/2010/01/c<strong>on</strong>ways-life-and-brians-brain-cellular.html, January 2010. AccessedMarch 15th 2010, Author’s name is unknown, pseud<strong>on</strong>ym given.7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!