10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

280 APPENDIX F. ANSWERS TO QUICK QUIZZES<strong>So</strong>me other ways of handling very small criticalsections are described in Section 8.3.Quick Quiz 3.17:In Figure 3.10, all of the traces other than the 100Mtrace deviate gently from the ideal line. In contrast,the 100M trace breaks sharply from the ideal lineat 64 CPUs. In addition, the spacing between the100M trace and the 10M trace is much smaller thanthat between the 10M trace and the 1M trace. Whydoes the 100M trace behave so much differentlythan the other traces?Answer:<strong>You</strong>r first clue is that 64 CPUs is exactly half ofthe 128 CPUs on the machine. The difference isan artifact of hardware threading. This system has64 cores with two hardware threads per core. Aslong as fewer than 64 threads are running, each canrun in its own core. But as soon as there are morethan 64 threads, some of the threads must sharecores. Because the pair of threads in any given coreshare some hardware resources, the throughput oftwo threads sharing a core is not quite as high asthat of two threads each in their own core. <strong>So</strong> theperformance of the 100M trace is limited not bythe reader-writer lock, but rather by the sharing ofhardware resources between hardware threads in asingle core.Thiscanalsobeseeninthe10Mtrace, whichdeviatesgently from the ideal line up to 64 threads, thenbreaks sharply down, parallel to the 100M trace. Upto 64 threads, the 10M trace is limited primarily byreader-writer lock scalability, and beyond that, alsoby sharing of hardware resources between hardwarethreads in a single core.Quick Quiz 3.18:Power 5 is several years old, and new hardwareshould be faster. <strong>So</strong> why should anyone worryabout reader-writer locks being slow?Answer:In general, newer hardware is improving. However,it will need to improve more than two orders ofmagnitude to permit reader-writer lock to achieveidea performance on 128 CPUs. Worse yet, thegreater the number of CPUs, the larger the requiredperformance improvement. The performanceproblems of reader-writer locking are therefore verylikely to be with us for quite some time to come.Quick Quiz 3.19:<strong>Is</strong> it really necessary to have both sets of primitives?Answer:Strictly speaking, no. One could implement anymember of the second set using the correspondingmember of the first set. For example, one couldimplement __sync_nand_and_fetch() in terms of__sync_fetch_and_nand() as follows:tmp = v;ret = __sync_fetch_and_nand(p, tmp);ret = ~ret & tmp;<strong>It</strong> is similarly possible to implement __sync_fetch_and_add(), __sync_fetch_and_sub(), and__sync_fetch_and_xor() in terms of their postvaluecounterparts.However, the alternative forms can be quite convenient,both for the programmer and for the compiler/libraryimplementor.Quick Quiz 3.20:Given that these atomic operations will often beable to generate single atomic instructions that aredirectly supported by the underlying instructionset, shouldn’t they be the fastest possible way toget things done?Answer:Unfortunately, no. See Chapter 4 for some starkcounterexamples.Quick Quiz 3.21:<strong>What</strong> happened to the Linux-kernel equivalents tofork() and join()?Answer:They don’t really exist. All tasks executing withinthe Linux kernel share memory, at least unless youwant to do a huge amount of memory-mappingwork by hand.F.4 Chapter 4: CountingQuick Quiz 4.1:Why on earth should efficient and scalable countingbe hard??? After all, computers have special

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!