10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

F.4. CHAPTER 4: COUNTING 283CPU 0 CPU 1Cache CacheInterconnectCPU 2 CPU 3Cache CacheInterconnectAnswer:No, because modulo addition is still commutativeand associative. At least as long as you use unsignedinteger. Recall that in the C standard, overflow ofsigned integers results in undefined behavior (nevermind the fact that machines that do anything otherthan wrap on overflow are quite rare these days.That said, one potential source of additional complexityarises when attempting to gather (say) a 64-bit sum from 32-bit per-thread counters. For themoment, dealing with this added complexity is leftas an exercise for the reader.MemoryCacheCPU 4InterconnectSystem InterconnectCacheCPU 5CacheCPU 6InterconnectMemoryCacheCPU 7Figure F.1: Data Flow For Global Combining-TreeAtomic IncrementBut what if neither of the first two conditions holds?Then you should think carefully about the algorithmsdiscussed in Section 4.2, which achieve nearidealperformance on commodity hardware.<strong>If</strong> either or both of the first two conditions hold,there is some hope for improvement. One couldimagine the hardware implementing a combiningtree, so that the increment requests from multipleCPUsarecombinedbythehardwareintoasingleadditionwhen the combined request reaches the hardware.The hardware could also apply an order tothe requests, thus returning to each CPU the returnvalue corresponding to its particular atomic increment.This results in instruction latency that variesas O(logN), where N is the number of CPUs, asshown in Figure F.1.ThisisagreatimprovementovertheO(N)performanceof current hardware shown in Figure 4.4, andit is possible that hardware latencies might decreasesomewhat if innovations such as three-D fabricationprove practical. Nevertheless, we will see that insome important special cases, software can do muchbetter.Quick Quiz 4.11:But doesn’t the fact that C’s “integers” are limitedin size complicate things?Quick Quiz 4.12:An array??? But doesn’t that limit the number ofthreads???Answer:<strong>It</strong> can, and in this toy implementation, it does. Butit is not that hard to come up with an alternativeimplementation that permits an arbitrary numberof threads. However, this is left as an exercise forthe reader.Quick Quiz 4.13:<strong>What</strong> other choice does gcc have, anyway???Answer:According to the C standard, the effects of fetchinga variable that might be concurrently modifiedby some other thread are undefined. <strong>It</strong> turns outthat the C standard really has no other choice,given that C must support (for example) eight-bitarchitectures which are incapable of atomicallyloading a long. An upcoming version of the Cstandard aims to fill this gap, but until then, wedepend on the kindness of the gcc developers.Quick Quiz 4.14:How does the per-thread counter variable inFigure 4.5 get initialized?Answer:The C standard specifies that the initial value ofglobal variables is zero, unless they are explicitlyinitialized. <strong>So</strong> the initial value of all the instancesof counter will be zero.That said, one often takes differences of consecutivereads from statistical counters, in which casethe initial value is irrelevant.Quick Quiz 4.15:How is the code in Figure 4.5 supposed to permitmore than one counter???Answer:Indeed, this toy example does not support morethan one counter. Modifying it so that it can

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!