10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

104 CHAPTER 8. DEFERRED PROCESSINGscheme in Section 8.3.4.7, which can be thought ofas having a single low-order bit reserved for countingnesting depth. Two C-preprocessor macros are usedto arrange this, RCU_GP_CTR_NEST_MASK and RCU_GP_CTR_BOTTOM_BIT. These are related: RCU_GP_CTR_NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1. TheRCU_GP_CTR_BOTTOM_BITmacrocontainsasinglebitthat is positioned just above the bits reserved forcounting nesting, and the RCU_GP_CTR_NEST_MASKhas all one bits covering the region of rcu_gp_ctrused to count nesting. Obviously, these two C-preprocessor macros must reserve enough of thelow-order bits of the counter to permit the maximumrequired nesting of RCU read-side critical sections,and this implementation reserves seven bits,for a maximum RCU read-side critical-section nestingdepth of 127, which should be well in excess ofthat needed by most applications.The resulting rcu_read_lock() implementationis still reasonably straightforward. Line 6 placesa pointer to this thread’s instance of rcu_reader_gp into the local variable rrgp, minimizing thenumber of expensive calls to the pthreads threadlocal-stateAPI. Line 7 records the current valueof rcu_reader_gp into another local variable tmp,and line 8 checks to see if the low-order bits arezero, which would indicate that this is the outermostrcu_read_lock(). <strong>If</strong> so, line 9 places theglobal rcu_gp_ctr into tmp because the currentvalue previously fetched by line 7 is likely to be obsolete.In either case, line 10 increments the nestingdepth, which you will recall is stored in theseven low-order bits of the counter. Line 11 storesthe updated counter back into this thread’s instanceof rcu_reader_gp, and, finally, line 12 executes amemory barrier to prevent the RCU read-side criticalsection from bleeding out into the code precedingthe call to rcu_read_lock().In other words, this implemntation of rcu_read_lock() picks up a copy of the global rcu_gp_ctrunless the current invocation of rcu_read_lock()is nested within an RCU read-side critical section,in which case it instead fetches the contents of thecurrent thread’s instance of rcu_reader_gp. Eitherway, it increments whatever value it fetchedin order to record an additional nesting level, andstores the result in the current thread’s instance ofrcu_reader_gp.Interestingly enough, the implementation of rcu_read_unlock() is identical to that shown in Section8.3.4.7. Line 19 executes a memory barrierin order to prevent the RCU read-side critical sectionfrom bleeding out into code following the callto rcu_read_unlock(), and line 20 decrements this1 DEFINE_SPINLOCK(rcu_gp_lock);2 long rcu_gp_ctr = 0;3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp);Figure 8.43: Data for Quiescent-State-Based RCUthread’s instance of rcu_reader_gp, which has theeffect of decrementing the nesting count containedin rcu_reader_gp’s low-order bits. Debugging versionsof this primitive would check (before decrementing!)that these low-order bits were non-zero.The implementation of synchronize_rcu() isquite similar to that shown in Section 8.3.4.7. Thereare two differences. The first is that line 29 addsRCU_GP_CTR_BOTTOM_BIT to the global rcu_gp_ctrinstead of adding the constant “2”, and the second isthat the comparison on line 32 has been abstractedout to a separate function, where it checks the bitindicatedbyRCU_GP_CTR_BOTTOM_BITinsteadofunconditionallychecking the low-order bit.This approach achieves read-side performance almostequal to that shown in Section 8.3.4.7, incurringroughly 65 nanoseconds of overhead regardlessofthenumberofPower5CPUs. Updatesagainincurmore overhead, ranging from about 600 nanosecondson a single Power5 CPU to more than 100 microsecondson 64 such CPUs.Quick Quiz 8.52: Why not simply maintaina separate per-thread nesting-level variable, as wasdone in previous section, rather than having all thiscomplicated bit manipulation?This implementation suffers from the same shortcomingsas does that of Section 8.3.4.7, except thatnesting of RCU read-side critical sections is nowpermitted. In addition, on 32-bit systems, this approachshortens the time required to overflow theglobal rcu_gp_ctr variable. The following sectionshows one way to greatly increase the time requiredfor overflow to occur, while greatly reducing readsideoverhead.Quick Quiz 8.53: Given the algorithm shown inFigure 8.42, how could you double the time requiredto overflow the global rcu_gp_ctr?Quick Quiz 8.54: Again, given the algorithmshowninFigure8.42, iscounteroverflowfatal? Whyor why not? <strong>If</strong> it is fatal, what can be done to fixit?8.3.4.9 RCU Based on Quiescent StatesFigure 8.44 (rcu_qs.h) shows the read-side primitivesused to construct a user-level implementationof RCU based on quiescent states, with thedata shown in Figure 8.43. As can be seen from

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!