10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

98 CHAPTER 8. DEFERRED PROCESSING1 void synchronize_rcu(void)2 {3 int i;45 smp_mb();6 spin_lock(&rcu_gp_lock);7 i = atomic_read(&rcu_idx);8 atomic_set(&rcu_idx, !i);9 smp_mb();10 while (atomic_read(&rcu_refcnt[i]) != 0) {11 poll(NULL, 0, 10);12 }13 smp_mb();14 atomic_set(&rcu_idx, i);15 smp_mb();16 while (atomic_read(&rcu_refcnt[!i]) != 0) {17 poll(NULL, 0, 10);18 }19 spin_unlock(&rcu_gp_lock);20 smp_mb();21 }Figure 8.32: RCU Update Using Global Reference-Count PairRCU read-side critical section does not bleed out afterthe rcu_read_unlock() code. Line 22 picks upthis thread’s instance of rcu_nesting, and if line 23findsthatthisistheoutermostrcu_read_unlock(),then lines 24 and 25 pick up this thread’s instance ofrcu_read_idx (saved by the outermost rcu_read_lock()) and atomically decrements the selected elementof rcu_refcnt. Regardless of the nestinglevel, line 27 decrements this thread’s instance ofrcu_nesting.Figure 8.32 (rcu_rcpg.c) shows the correspondingsynchronize_rcu() implementation. Lines 6and 19 acquire and release rcu_gp_lock in orderto prevent more than one concurrent instance ofsynchronize_rcu(). Lines 7-8 pick up the valueof rcu_idx and complement it, respectively, so thatsubsequent instances of rcu_read_lock() will use adifferent element of rcu_idx that did preceding instances.Lines 10-12 then wait for the prior elementof rcu_idx to reach zero, with the memory barrieron line 9 ensuring that the check of rcu_idx is notreordered to precede the complementing ofrcu_idx.Lines 13-18 repeat this process, and line 20 ensuresthat any subsequent reclamation operations are notreordered to precede the checking of rcu_refcnt.Quick Quiz 8.42: Why the memory barrier online 5 of synchronize_rcu() in Figure 8.32 giventhat there is a spin-lock acquisition immediately after?Quick Quiz 8.43: Why is the counter flippedtwice in Figure 8.32? Shouldn’t a single flip-andwaitcycle be sufficient?This implementation avoids the update-starvationissues that could occur in the single-counter implementationshown in Figure 8.29.There are still some serious shortcomings. First,the atomic operations in rcu_read_lock() andrcu_read_unlock() are still quite heavyweight. Infact, they are more complex than those of the singlecountervariant shown in Figure 8.29, with theread-side primitives consuming about 150 nanosecondson a single Power5 CPU and almost 40 microsecondson a 64-CPU system. The updates-sidesynchronize_rcu()primitiveismorecostlyaswell,ranging from about 200 nanoseconds on a singlePower5 CPU to more than 40 microseconds on a 64-CPU system. This means that the RCU read-sidecritical sections have to be extremely long in orderto get any real read-side parallelism.Second, if there are many concurrent rcu_read_lock() and rcu_read_unlock() operations,there will be extreme memory contention on thercu_refcnt elements, resulting in expensive cachemisses. This further extends the RCU read-sidecritical-section duration required to provide parallelread-side access. These first two shortcomingsdefeat the purpose of RCU in most situations.Third, theneedtofliprcu_idxtwiceimposessubstantialoverhead on updates, especially if there arelarge numbers of threads.Finally, despite the fact that concurrent RCU updatescould in principle be satisfied by a commongrace period, this implementation serializes graceperiods, preventing grace-period sharing.Quick Quiz 8.44: Given that atomic incrementand decrement are so expensive, why not just usenon-atomic increment on line 10 and a non-atomicdecrement on line 25 of Figure 8.31?Despite these shortcomings, one could imaginethis variant of RCU being used on small tightlycoupled multiprocessors, perhaps as a memoryconservingimplementation that maintains APIcompatibility with more complex implementations.However, it would not not likely scale well beyond afew CPUs.The next section describes yet another variationon the reference-counting scheme that providesgreatly improved read-side performance and scalability.8.3.4.5 Scalable Counter-Based RCUFigure 8.34 (rcu_rcpl.h) shows the read-side primitivesof an RCU implementation that uses perthreadpairs of reference counters. This implementationis quite similar to that shown in Figure 8.31,the only difference being that rcu_refcnt is now aper-thread variable (as shown in Figure 8.33), so thercu_read_lock() and rcu_read_unlock() primi-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!