10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

100 CHAPTER 8. DEFERRED PROCESSING1 DEFINE_SPINLOCK(rcu_gp_lock);2 DEFINE_PER_THREAD(int [2], rcu_refcnt);3 long rcu_idx;4 DEFINE_PER_THREAD(int, rcu_nesting);5 DEFINE_PER_THREAD(int, rcu_read_idx);Figure 8.36: RCU Read-Side Using Per-ThreadReference-Count Pair and Shared Update Datatheir first call to rcu_read_lock().Quick Quiz 8.46: Great, if we have N threads,we can have 2N ten-millisecond waits (one set perflip_counter_and_wait() invocation, and eventhat assumes that we wait only once for each thread.<strong>Do</strong>n’t we need the grace period to complete muchmore quickly?This implementation still has several shortcomings.First, the need to flip rcu_idx twice imposessubstantial overhead on updates, especially if thereare large numbers of threads.Second, synchronize_rcu() must now examine anumber of variables that increases linearly with thenumber of threads, imposing substantial overheadon applications with large numbers of threads.Third, as before, although concurrent RCU updatescould in principle be satisfied by a commongrace period, this implementation serializes graceperiods, preventing grace-period sharing.Finally, as noted in the text, the need for perthreadvariables and for enumerating threads maybe problematic in some software environments.That said, the read-side primitives scale verynicely, requiring about 115 nanoseconds regardlessof whether running on a single-CPU or a 64-CPUPower5 system. As noted above, the synchronize_rcu() primitive does not scale, ranging in overheadfrom almost a microsecond on a single Power5 CPUup to almost 200 microseconds on a 64-CPU system.this implementation could conceivably form the basisfor a production-quality user-level RCU implementation.The next section describes an algorithm permittingmore efficient concurrent RCU updates.8.3.4.6 Scalable Counter-Based RCU WithShared Grace PeriodsFigure 8.37 (rcu_rcpls.h) shows the read-sideprimitives for an RCU implementation using perthreadreference count pairs, as before, but permittingupdates to share grace periods. The main differencefrom the earlier implementation shown inFigure 8.34 is that rcu_idx is now a long thatcounts freely, so that line 8 of Figure 8.37 mustmask off the low-order bit. We also switched from1 static void rcu_read_lock(void)2 {3 int i;4 int n;56 n = __get_thread_var(rcu_nesting);7 if (n == 0) {8 i = ACCESS_ONCE(rcu_idx) & 0x1;9 __get_thread_var(rcu_read_idx) = i;10 __get_thread_var(rcu_refcnt)[i]++;11 }12 __get_thread_var(rcu_nesting) = n + 1;13 smp_mb();14 }1516 static void rcu_read_unlock(void)17 {18 int i;19 int n;2021 smp_mb();22 n = __get_thread_var(rcu_nesting);23 if (n == 1) {24 i = __get_thread_var(rcu_read_idx);25 __get_thread_var(rcu_refcnt)[i]--;26 }27 __get_thread_var(rcu_nesting) = n - 1;28 }Figure 8.37: RCU Read-Side Using Per-ThreadReference-Count Pair and Shared Updateusing atomic_read() and atomic_set() to usingACCESS_ONCE(). The data is also quite similar, asshown in Figure 8.36, with rcu_idx now being alock instead of an atomic_t.Figure 8.38 (rcu_rcpls.c) shows the implementationof synchronize_rcu() and its helper functionflip_counter_and_wait(). These are similarto those in Figure 8.35. The differences inflip_counter_and_wait() include:1. Line 6 uses ACCESS_ONCE() instead of atomic_set(), and increments rather than complementing.2. A new line 7 masks the counter down to its bottombit.Thechangestosynchronize_rcu()aremorepervasive:1. There is a new oldctr local variable that capturesthe pre-lock-acquisition value of rcu_idxon line 23.2. Line26usesACCESS_ONCE()insteadofatomic_read().3. Lines 27-30 check to see if at least three counterflips were performed by other threads while thelock was being acquired, and, if so, releases thelock, does a memory barrier, and returns. In

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!