10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

F.6. CHAPTER 6: LOCKING 293in performance-critical code. In particular, existenceguarantees require that the transactioncover the full path from a global reference tothe data elements being updated.7. Use RCU, which can be thought of as an extremelylightweight approximation to a garbagecollector. Updaters are not permitted to freeRCU-protected data structures that RCU readersmight still be referencing. RCU is mostheavily used for read-mostly data structures,and is discussed at length in Chapter 8.For more on providing existence guarantees, seeChapters 6 and 8.Quick Quiz 5.12:How can a single-threaded 64-by-64 matrix multiplepossibly have an efficiency of less than 1.0?Shouldn’t all of the traces in Figure 5.22 haveefficiency of exactly 1.0 when running on only onethread?Answer:The matmul.c program creates the specified numberof worker threads, so even the single-worker-threadcase incurs thread-creation overhead. Making thechanges required to optimize away thread-creationoverhead in the single-worker-thread case is left asan exercise to the reader.Quick Quiz 5.13:How are data-parallel techniques going to help withmatrix multiply? <strong>It</strong> is already data parallel!!!Answer:I am glad that you are paying attention! This exampleserves to show that although data parallelismcan be a very good thing, it is not some magic wandthat automatically wards off any and all sources ofinefficiency. Linear scaling at full performance, evento “only” 64 threads, requires care at all phases ofdesign and implementation.In particular, you need to pay careful attention tothe size of the partitions. For example, if you splita 64-by-64 matrix multiply across 64 threads, eachthread gets only 64 floating-point multiplies. Thecost of a floating-point multiply is miniscule comparedto the overhead of thread creation.Moral: <strong>If</strong> you have a parallel program with variableinput, always include a check for the input sizebeing too small to be worth parallelizing. <strong>And</strong> whenit is not helpful to parallelize, it is not helpful tospawn a single thread, now is it?Quick Quiz 5.14:In what situation would hierarchical locking workwell?Answer:<strong>If</strong> the comparison on line 31 of Figure 5.26 werereplaced by a much heavier-weight operation,then releasing bp->bucket lock might reduce lockcontention enough to outweigh the overhead of theextra acquisition and release of cur->node lock.Quick Quiz 5.15:In Figure 5.32, there is a pattern of performancerising with increasing run length in groups of threesamples, for example, for run lengths 10, 11, and12. Why?Answer:This is due to the per-CPU target value being three.A run length of 12 must acquire the global-poollock twice, while a run length of 13 must acquirethe global-pool lock three times.Quick Quiz 5.16:Allocation failures were observed in the two-threadtests at run lengths of 19 and greater. Given theglobal-pool size of 40 and the per-CPU target poolsize of three, what is the smallest allocation runlength at which failures can occur?Answer:The exact solution to this problem is left as anexercise to the reader. The first solution receivedwill be credited to its submitter. As a rough ruleof thumb, the global pool size should be at leastm + 2sn, where “m” is the maximum numberof elements allocated at a given time, “s” is theper-CPU pool size, and “n” is the number of CPUs.F.6 Chapter 6: LockingQuick Quiz 6.1:<strong>What</strong> if the element we need to delete is not thefirst element of the list on line 8 of Figure 6.1?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!