10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

30 CHAPTER 4. COUNTING1 atomic_t counter = ATOMIC_INIT(0);23 void inc_count(void)4 {5 atomic_inc(&counter);6 }78 long read_count(void)9 {10 return atomic_read(&counter);11 }CPU 0CacheCPU 1CacheCPU 2CacheCPU 3CacheInterconnectInterconnectMemory System Interconnect MemoryFigure 4.2: Just Count Atomically!InterconnectCache CacheCPU 4 CPU 5InterconnectCache CacheCPU 6 CPU 7Time Per Increment (nanoseconds)90080070060050040030020010001 2 3 4 5 6 7 8Number of CPUs/ThreadsFigure 4.3: Atomic Increment Scalability on NehalemThis approach has the additional advantage of beingblazingly fast if you are doing lots of reading andalmost no incrementing, and on small systems, theperformance is excellent.There is just one large fly in the ointment: thisapproach can lose counts. On my dual-core laptop,a short run invoked inc_count() 100,014,000times, but the final value of the counter was only52,909,118. Although it is true that approximatevalues have their place in computing, it is almostalways necessary to do better than this.Quick Quiz 4.6: But doesn’t the ++ operatorproduce an x86 add-to-memory instruction? <strong>And</strong>won’t the CPU cache cause this to be atomic?Quick Quiz 4.7: The 8-figure accuracy on thenumber of failures indicates that you really did testthis. Whywould it benecessarytotest suchatrivialprogram, especially when the bug is easily seen byinspection?Figure 4.4: Data Flow For Global Atomic IncrementThe straightforward way to count accurately isto use atomic operations, as shown in Figure 4.2(count_atomic.c). Line 1 defines an atomic variable,line 5 atomically increments it, and line 10reads it out. Because this is atomic, it keeps perfectcount. However, it is slower: on a Intel Core Duolaptop, it is about six times slower than non-atomicincrement when a single thread is incrementing, andmore than ten times slower if two threads are incrementing.This poor performance should not be a surprise,given the discussion in Chapter 2, nor should it bea surprise that the performance of atomic incrementgets slower as the number of CPUs and threads increase,as shown in Figure 4.3. In this figure, thehorizontal dashed line resting on the x axis is theideal performance that would be achieved by a perfectlyscalable algorithm: with such an algorithm, agiven increment would incur the same overhead thatitwouldinasingle-threadedprogram. Atomicincrementof a single global variable is clearly decidedlynon-ideal, and gets worse as you add CPUs.Quick Quiz 4.8: Why doesn’t the dashed lineon the x axis meet the diagonal line at y = 1?Quick Quiz 4.9: But atomic increment is stillpretty fast. <strong>And</strong> incrementing a single variable in atight loop sounds pretty unrealistic to me, after all,most of the program’s execution should be devotedto actually doing work, not accounting for the workit has done! Why should I care about making thisgo faster?For another perspective on global atomic increment,consider Figure 4.4. In order for each CPUto get a chance to increment a given global variable,the cache line containing that variable mustcirculate among all the CPUs, as shown by the redarrows. Such circulation will take significant time,resulting in the poor performance seen in Figure 4.3.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!