10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

276 APPENDIX F. ANSWERS TO QUICK QUIZZESQuick Quiz 2.3:This is a simplified sequence of events? How couldit possibly be any more complex???Answer:This sequence ignored a number of possiblecomplications, including:1. Other CPUs might be concurrently attemptingto perform CAS operations involving this samecacheline.2. The cacheline might have been replicated readonlyin several CPUs’ caches, in which case, itwould need to be flushed from their caches.3. CPU 7 might have been operating on the cacheline when the request for it arrived, in whichcase CPU 7 would need to hold of the requestuntil its own operation completed.4. CPU 7 might have ejected the cacheline fromits cache (for example, in order to make roomfor other data), so that by the time that therequest arrived, the cacheline was on its way tomemory.5. A correctable error might have occurred in thecacheline, which would then need to be correctedat some point before the data was used.Production-quality cache-coherence mechanismsare extremely complicated due to these sorts of considerations.Quick Quiz 2.4:Why is it necessary to flush the cacheline fromCPU 7’s cache?Answer:<strong>If</strong> the cacheline was not flushed from CPU 7’s cache,then CPUs 0 and 7 might have different values forthe same set of variables in the cacheline. This sortof incoherence would greatly complicate parallelsoftware, and so hardware architects have beenconvinced to avoid it.Quick Quiz 2.5:Surely the hardware designers could be persuadedto improve this situation! Why have they beencontent with such abysmal performance for thesesingle-instruction operations?Operation Cost (ns) RatioClock period 0.4 1.0“Best-case” CAS 12.2 33.8Best-case lock 25.6 71.2Single cache miss 12.9 35.8CAS cache miss 7.0 19.4Off-CoreSingle cache miss 31.2 86.6CAS cache miss 31.2 86.5Off-<strong>So</strong>cketSingle cache miss 92.4 256.7CAS cache miss 95.9 266.4Comms Fabric 4,500 7,500Global Comms 195,000,000 324,000,000Table F.1: Performance of Synchronization Mechanismson 16-CPU 2.8GHz Intel X5550 (Nehalem)SystemAnswer:The hardware designers have been working on thisproblem, and have consulted with no less a luminarythan the physicist Stephen Hawking. Hawking’sobservation was that the hardware designers havetwo basic problems [Gar07]:1. the finite speed of light, and2. the atomic nature of matter.The first problem limits raw speed, and the secondlimits miniaturization, which in turn limitsfrequency. <strong>And</strong> even this sidesteps the powerconsumptionissue that is currently holding productionfrequencies to well below 10 GHz.Nevertheless, someprogressisbeingmade, asmaybe seen by comparing Table F.1 with Table 2.1 onpage 2.1. Integration of hardware threads in a singlecoreandmultiplecoresonadiehaveimprovedlatenciesgreatly, at least within the confines of a singlecore or single die. There has been some improvementin overall system latency, but only by abouta factor of two. Unfortunately, neither the speed oflight nor the atomic nature of matter has changedmuch in the past few years.Section 2.3 looks at what else hardware designersmight be able to do to ease the plight of parallelprogrammers.Quick Quiz 2.6:These numbers are insanely large! How can Ipossibly get my head around them?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!