10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

8.3. READ-COPY UPDATE (RCU) 101this case, there were two full waits for the counterstogotozero,sothoseotherthreadsalreadydid all the required work.4. At lines 33-34, flip_counter_and_wait() isonly invoked a second time if there were fewerthan two counter flips while the lock was beingacquired. On the other hand, if there weretwo counter flips, some other thread did one fullwait for all the counters to go to zero, so onlyone more is required.1 static void flip_counter_and_wait(int ctr)2 {3 int i;4 int t;56 ACCESS_ONCE(rcu_idx) = ctr + 1;7 i = ctr & 0x1;8 smp_mb();9 for_each_thread(t) {10 while (per_thread(rcu_refcnt, t)[i] != 0) {11 poll(NULL, 0, 10);12 }13 }14 smp_mb();15 }1617 void synchronize_rcu(void)18 {19 int ctr;20 int oldctr;2122 smp_mb();23 oldctr = ACCESS_ONCE(rcu_idx);24 smp_mb();25 spin_lock(&rcu_gp_lock);26 ctr = ACCESS_ONCE(rcu_idx);27 if (ctr - oldctr >= 3) {28 spin_unlock(&rcu_gp_lock);29 smp_mb();30 return;31 }32 flip_counter_and_wait(ctr);33 if (ctr - oldctr < 2)34 flip_counter_and_wait(ctr + 1);35 spin_unlock(&rcu_gp_lock);36 smp_mb();37 }Figure 8.38: RCU Shared Update Using Per-ThreadReference-Count PairWith this approach, if an arbitrarily large numberof threads invoke synchronize_rcu() concurrently,with one CPU for each thread, there will be a totalof only three waits for counters to go to zero.Despite the improvements, this implementation ofRCU still has a few shortcomings. First, as before,the need to flip rcu_idx twice imposes substantialoverhead on updates, especially if there are largenumbers of threads.Second, each updater still acquires rcu_gp_lock,even if there is no work to be done. This can resultin a severe scalability limitation if there are largenumbers of concurrent updates. Section D.4 showsone way to avoid this in a production-quality realtimeimplementation of RCU for the Linux kernel.Third, this implementation requires per-threadvariables and the ability to enumerate threads,which again can be problematic in some softwareenvironments.Finally, on 32-bit machines, a given update threadmight be preempted long enough for the rcu_idxcounter to overflow. This could cause such a threadto force an unnecessary pair of counter flips. However,even if each grace period took only one microsecond,the offending thread would need to bepreempted for more than an hour, in which case anextra pair of counter flips is likely the least of yourworries.As with the implementation described in Section8.3.4.3, the read-side primitives scale extremelywell, incurring roughly 115 nanoseconds of overheadregardless of the number of CPUs. Thesynchronize_rcu() primitives is still expensive,ranging from about one microsecond up to about16 microseconds. This is nevertheless much cheaperthan the roughly 200 microseconds incurred by theimplementation in Section 8.3.4.5. <strong>So</strong>, despite itsshortcomings, one could imagine this RCU implementationbeing used in production in real-life applications.Quick Quiz 8.47: All of these toy RCUimplementations have either atomic operationsin rcu_read_lock() and rcu_read_unlock(), or

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!