10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

60 CHAPTER 5. PARTITIONING AND SYNCHRONIZATION DESIGNSynchronization Efficiency10.90.80.70.60.50.40.30.20.110255010075102030405060708090100Matrix Multiply Efficiency10.90.80.70.60.50.40.30.20.101286425610245121 10 100Number of CPUs/ThreadsNumber of CPUs/ThreadsFigure 5.21: Synchronization EfficiencyFigure 5.22: Matrix Multiply Efficiencyan atomic increment every 250 nanoseconds, and thef = 100 line corresponds to each CPU attemptingan atomic increment every 2.5 microseconds, whichinturncorrespondstoseveralthousandinstructions.Given that each trace drops off sharply with increasingnumbers of CPUs or threads, we can concludethat synchronization mechanisms based on atomicmanipulation of a single global shared variable willnot scale well if used heavily on current commodityhardware. This is a mathematical depiction oftheforcesleadingtotheparallelcountingalgorithmsthat were discussed in Chapter 4.The concept of efficiency is useful even in caseshaving little or no formal synchronization. Considerfor example a matrix multiply, in which the columnsof one matrix are multiplied (via “dot product”)by the rows of another, resulting in an entry in athird matrix. Because none of these operations conflict,it is possible to partition the columns of thefirst matrix among a group of threads, with eachthread computing the corresponding columns of theresult matrix. The threads can therefore operate entirelyindependently, with no synchronization overheadwhatsoever,asisdoneinmatmul.c. Onemighttherefore expect a parallel matrix multiply to havea perfect efficiency of 1.0.However, Figure 5.22 tells a different story, especiallyfor a 64-by-64 matrix multiply, which nevergets above an efficiency of about 0.7, even when runningsingle-threaded. The 512-by-512 matrix multiply’sefficiency is measurably less than 1.0 on asfew as 10 threads, and even the 1024-by-1024 matrixmultiply deviates noticeably from perfection ata few tens of threads.Quick Quiz 5.12: How can a single-threaded 64-by-64 matrix multiple possibly have an efficiency ofless than 1.0? Shouldn’t all of the traces in Figure5.22 have efficiency of exactly 1.0 when runningon only one thread?Given these inefficiencies, it is worthwhile to lookinto more-scalable approaches such as the data lockingdescribedinSection5.3.3ortheparallel-fastpathapproach discussed in the next section.Quick Quiz 5.13: How are data-parallel techniquesgoing to help with matrix multiply? <strong>It</strong> isalready data parallel!!!5.4 <strong>Parallel</strong> FastpathFine-grained (and therefore usually higherperformance)designs are typically more complexthan are coarser-grained designs. In many cases,most of the overhead is incurred by a small fractionof the code [Knu73]. <strong>So</strong> why not focus effort onthat small fraction?Thisistheideabehindtheparallel-fastpathdesignpattern, to aggressively parallelize the common-casecode path without incurring the complexity thatwould be required to aggressively parallelize the entirealgorithm. <strong>You</strong> must understand not only thespecific algorithm you wish to parallelize, but also

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!