Is Parallel Programming Hard, And, If So, What Can You Do About It?

More documents

Recommendations

Info

$TeX op Mac OS X, met teTeX en TeXShop - Nluug$

104 CHAPTER 8. DEFERRED PROCESSINGscheme in Section 8.3.4.7, which can be thought ofas having a single low-order bit reserved for countingnesting depth. Two C-preprocessor macros are usedto arrange this, RCU_GP_CTR_NEST_MASK and RCU_GP_CTR_BOTTOM_BIT. These are related: RCU_GP_CTR_NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1. TheRCU_GP_CTR_BOTTOM_BITmacrocontainsasinglebitthat is positioned just above the bits reserved forcounting nesting, and the RCU_GP_CTR_NEST_MASKhas all one bits covering the region of rcu_gp_ctrused to count nesting. Obviously, these two C-preprocessor macros must reserve enough of thelow-order bits of the counter to permit the maximumrequired nesting of RCU read-side critical sections,and this implementation reserves seven bits,for a maximum RCU read-side critical-section nestingdepth of 127, which should be well in excess ofthat needed by most applications.The resulting rcu_read_lock() implementationis still reasonably straightforward. Line 6 placesa pointer to this thread’s instance of rcu_reader_gp into the local variable rrgp, minimizing thenumber of expensive calls to the pthreads threadlocal-stateAPI. Line 7 records the current valueof rcu_reader_gp into another local variable tmp,and line 8 checks to see if the low-order bits arezero, which would indicate that this is the outermostrcu_read_lock(). If so, line 9 places theglobal rcu_gp_ctr into tmp because the currentvalue previously fetched by line 7 is likely to be obsolete.In either case, line 10 increments the nestingdepth, which you will recall is stored in theseven low-order bits of the counter. Line 11 storesthe updated counter back into this thread’s instanceof rcu_reader_gp, and, finally, line 12 executes amemory barrier to prevent the RCU read-side criticalsection from bleeding out into the code precedingthe call to rcu_read_lock().In other words, this implemntation of rcu_read_lock() picks up a copy of the global rcu_gp_ctrunless the current invocation of rcu_read_lock()is nested within an RCU read-side critical section,in which case it instead fetches the contents of thecurrent thread’s instance of rcu_reader_gp. Eitherway, it increments whatever value it fetchedin order to record an additional nesting level, andstores the result in the current thread’s instance ofrcu_reader_gp.Interestingly enough, the implementation of rcu_read_unlock() is identical to that shown in Section8.3.4.7. Line 19 executes a memory barrierin order to prevent the RCU read-side critical sectionfrom bleeding out into code following the callto rcu_read_unlock(), and line 20 decrements this1 DEFINE_SPINLOCK(rcu_gp_lock);2 long rcu_gp_ctr = 0;3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp);Figure 8.43: Data for Quiescent-State-Based RCUthread’s instance of rcu_reader_gp, which has theeffect of decrementing the nesting count containedin rcu_reader_gp’s low-order bits. Debugging versionsof this primitive would check (before decrementing!)that these low-order bits were non-zero.The implementation of synchronize_rcu() isquite similar to that shown in Section 8.3.4.7. Thereare two differences. The first is that line 29 addsRCU_GP_CTR_BOTTOM_BIT to the global rcu_gp_ctrinstead of adding the constant “2”, and the second isthat the comparison on line 32 has been abstractedout to a separate function, where it checks the bitindicatedbyRCU_GP_CTR_BOTTOM_BITinsteadofunconditionallychecking the low-order bit.This approach achieves read-side performance almostequal to that shown in Section 8.3.4.7, incurringroughly 65 nanoseconds of overhead regardlessofthenumberofPower5CPUs. Updatesagainincurmore overhead, ranging from about 600 nanosecondson a single Power5 CPU to more than 100 microsecondson 64 such CPUs.Quick Quiz 8.52: Why not simply maintaina separate per-thread nesting-level variable, as wasdone in previous section, rather than having all thiscomplicated bit manipulation?This implementation suffers from the same shortcomingsas does that of Section 8.3.4.7, except thatnesting of RCU read-side critical sections is nowpermitted. In addition, on 32-bit systems, this approachshortens the time required to overflow theglobal rcu_gp_ctr variable. The following sectionshows one way to greatly increase the time requiredfor overflow to occur, while greatly reducing readsideoverhead.Quick Quiz 8.53: Given the algorithm shown inFigure 8.42, how could you double the time requiredto overflow the global rcu_gp_ctr?Quick Quiz 8.54: Again, given the algorithmshowninFigure8.42, iscounteroverflowfatal? Whyor why not? If it is fatal, what can be done to fixit?8.3.4.9 RCU Based on Quiescent StatesFigure 8.44 (rcu_qs.h) shows the read-side primitivesused to construct a user-level implementationof RCU based on quiescent states, with thedata shown in Figure 8.43. As can be seen from
8.3. READ-COPY UPDATE (RCU) 1051 static void rcu_read_lock(void)2 {3 }45 static void rcu_read_unlock(void)6 {7 }89 rcu_quiescent_state(void)10 {11 smp_mb();12 __get_thread_var(rcu_reader_qs_gp) =13 ACCESS_ONCE(rcu_gp_ctr) + 1;14 smp_mb();15 }1617 static void rcu_thread_offline(void)18 {19 smp_mb();20 __get_thread_var(rcu_reader_qs_gp) =21 ACCESS_ONCE(rcu_gp_ctr);22 smp_mb();23 }2425 static void rcu_thread_online(void)26 {27 rcu_quiescent_state();28 }Figure 8.44: Quiescent-State-Based RCU Read Sidelines 1-7 in the figure, the rcu_read_lock() andrcu_read_unlock() primitives do nothing, and caninfactbeexpectedtobeinlinedandoptimizedaway,as they are in server builds of the Linux kernel.This is due to the fact that quiescent-state-basedRCU implementations approximate the extents ofRCU read-side critical sections using the aforementionedquiescent states, which contains calls torcu_quiescent_state(), shown from lines 9-15 inthe figure. Threads entering extended quiescentstates (for example, when blocking) may insteaduse the thread_offline() and thread_online()APIs to mark the beginning and the end, respectively,of such an extended quiescent state. Assuch, thread_online() is analogous to rcu_read_lock()andthread_offline()isanalogous torcu_read_unlock(). These two functions are shown onlines 17-28 in the figure. In either case, it is illegalfor a quiescent state to appear within an RCUread-side critical section.In rcu_quiescent_state(), line 11 executes amemory barrier to prevent any code prior to thequiescent state from being reordered into the quiescentstate. Lines 12-13 pick up a copy of theglobal rcu_gp_ctr, using ACCESS_ONCE() to ensurethat the compiler does not employ any optimizationsthat would result in rcu_gp_ctr beingfetched more than once, and then adds one tothe value fetched and stores it into the per-threadrcu_reader_qs_gp variable, so that any concurrentinstance of synchronize_rcu() will see an odd-1 void synchronize_rcu(void)2 {3 int t;45 smp_mb();6 spin_lock(&rcu_gp_lock);7 rcu_gp_ctr += 2;8 smp_mb();9 for_each_thread(t) {10 while (rcu_gp_ongoing(t) &&11 ((per_thread(rcu_reader_qs_gp, t) -12 rcu_gp_ctr) < 0)) {13 poll(NULL, 0, 10);14 }15 }16 spin_unlock(&rcu_gp_lock);17 smp_mb();18 }Figure 8.45: RCU Update Side Using QuiescentStatesnumbered value, thus becoming aware that a newRCU read-side critical section has started. Instancesof synchronize_rcu() that are waiting on olderRCU read-side critical sections will thus know to ignorethis new one. Finally, line 14 executes a memorybarrier.Quick Quiz 8.55: Doesn’t the additional memorybarrier shown on line 14 of Figure 8.44, greatlyincrease the overhead of rcu_quiescent_state?Some applications might use RCU only occasionally,but use it very heavily when they do useit. Such applications might choose to use rcu_thread_online() when starting to use RCU andrcu_thread_offline() when no longer using RCU.The time between a call to rcu_thread_offline()and a subsequent call to rcu_thread_online() isan extended quiescent state, so that RCU will notexpect explicit quiescent states to be registered duringthis time.The rcu_thread_offline() function simply setsthe per-thread rcu_reader_qs_gp variable to thecurrent value of rcu_gp_ctr, which has an evennumberedvalue. Any concurrent instances ofsynchronize_rcu() will thus know to ignore thisthread.Quick Quiz 8.56: Why are the two memory barrierson lines 19 and 22 of Figure 8.44 needed?The rcu_thread_online() function simply invokesrcu_quiescent_state(), thus marking theend of the extended quiescent state.Figure 8.45 (rcu_qs.c) shows the implementationof synchronize_rcu(), which is quite similarto that of the preceding sections.This implementation has blazingly fast read-sideprimitives, with an rcu_read_lock()-rcu_read_unlock() round trip incurring an overhead ofroughly 50 picoseconds. The synchronize_rcu()
Page 1 and 2:
Is Parallel Programming Hard, And,
Page 3 and 4:
Contents1 Introduction 11.1 Histori
Page 5 and 6:
CONTENTSv6 Locking 676.1 Staying Al
Page 7 and 8:
CONTENTSviiB Synchronization Primit
Page 9 and 10:
CONTENTSixE.7.1 Introduction to Pre
Page 11 and 12:
PrefaceThe purpose of this book is
Page 13 and 14:
Chapter 1IntroductionParallel progr
Page 15 and 16:
1.2. PARALLEL PROGRAMMING GOALS 3CP
Page 17 and 18:
1.3. ALTERNATIVES TO PARALLEL PROGR
Page 19 and 20:
1.4. WHAT MAKES PARALLEL PROGRAMMIN
Page 21 and 22:
1.5. GUIDE TO THIS BOOK 9other hand
Page 23 and 24:
Chapter 2Hardware and its HabitsMos
Page 25:
2.1. OVERVIEW 13Therefore, as shown
Page 28 and 29:
16 CHAPTER 2. HARDWARE AND ITS HABI
Page 30 and 31:
18 CHAPTER 2. HARDWARE AND ITS HABI
Page 32 and 33:
20 CHAPTER 3. TOOLS OF THE TRADE1 p
Page 34 and 35:
22 CHAPTER 3. TOOLS OF THE TRADE1 p
Page 36 and 37:
24 CHAPTER 3. TOOLS OF THE TRADE1.1
Page 38 and 39:
26 CHAPTER 3. TOOLS OF THE TRADEQui
Page 40 and 41:
28 CHAPTER 3. TOOLS OF THE TRADE
Page 42 and 43:
30 CHAPTER 4. COUNTING1 atomic_t co
Page 44 and 45:
32 CHAPTER 4. COUNTING4.2.3 Eventua
Page 46 and 47:
34 CHAPTER 4. COUNTINGvanish when t
Page 48 and 49:
36 CHAPTER 4. COUNTINGper-thread va
Page 50 and 51:
38 CHAPTER 4. COUNTING1 unsigned lo
Page 52 and 53:
Page 54 and 55:
42 CHAPTER 4. COUNTING1 #define THE
Page 56 and 57:
Page 58 and 59:
46 CHAPTER 4. COUNTINGReadsAlgorith
Page 60 and 61:
48 CHAPTER 5. PARTITIONING AND SYNC
Page 62 and 63:
Page 64 and 65:
Page 66 and 67: 54 CHAPTER 5. PARTITIONING AND SYNC
Page 80 and 81: 68 CHAPTER 6. LOCKING1 int delete(i
Page 82 and 83: 70 CHAPTER 7. DATA OWNERSHIP
Page 84 and 85: 72 CHAPTER 8. DEFERRED PROCESSINGfo
Page 86 and 87: 74 CHAPTER 8. DEFERRED PROCESSINGth
Page 88 and 89: 76 CHAPTER 8. DEFERRED PROCESSING
Page 90 and 91: 78 CHAPTER 8. DEFERRED PROCESSINGfi
Page 92 and 93: 80 CHAPTER 8. DEFERRED PROCESSINGti
Page 94 and 95: 82 CHAPTER 8. DEFERRED PROCESSINGNo
Page 96 and 97: 84 CHAPTER 8. DEFERRED PROCESSING12
Page 100 and 101: 88 CHAPTER 8. DEFERRED PROCESSINGvo
Page 102 and 103: 90 CHAPTER 8. DEFERRED PROCESSINGLi
Page 104 and 105: 92 CHAPTER 8. DEFERRED PROCESSINGpe
Page 106 and 107: 94 CHAPTER 8. DEFERRED PROCESSINGCa
Page 108 and 109: 96 CHAPTER 8. DEFERRED PROCESSINGTh
Page 118 and 119: 106 CHAPTER 8. DEFERRED PROCESSINGo
Page 120 and 121: 108 CHAPTER 8. DEFERRED PROCESSING
Page 122 and 123: 110 CHAPTER 9. APPLYING RCU1 struct
Page 124 and 125: 112 CHAPTER 9. APPLYING RCU
Page 126 and 127: 114 CHAPTER 10. VALIDATION: DEBUGGI
Page 128 and 129: 116 CHAPTER 11. DATA STRUCTURES
Page 130 and 131: 118 CHAPTER 12. ADVANCED SYNCHRONIZ
Page 150 and 151: 138 CHAPTER 13. EASE OF USEFigure 1
Page 152 and 153: 140 CHAPTER 13. EASE OF USE
Page 154 and 155: 142 CHAPTER 14. TIME MANAGEMENT
Page 156 and 157: 144 CHAPTER 15. CONFLICTING VISIONS
Page 166 and 167:
154 APPENDIX A. IMPORTANT QUESTIONS
Page 168 and 169:
156 APPENDIX A. IMPORTANT QUESTIONS
Page 170 and 171:
158 APPENDIX B. SYNCHRONIZATION PRI
Page 172 and 173:
160 APPENDIX B. SYNCHRONIZATION PRI
Page 174 and 175:
162 APPENDIX C. WHY MEMORY BARRIERS
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
Page 184 and 185:
Page 186 and 187:
Page 188 and 189:
Page 190 and 191:
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
184 APPENDIX D. READ-COPY UPDATE IM
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
Page 210 and 211:
Page 212 and 213:
Page 214 and 215:
Page 216 and 217:
Page 218 and 219:
Page 220 and 221:
Page 222 and 223:
Page 224 and 225:
Page 226 and 227:
Page 228 and 229:
Page 230 and 231:
Page 232 and 233:
Page 234 and 235:
Page 236 and 237:
Page 238 and 239:
Page 240 and 241:
Page 242 and 243:
Page 244 and 245:
Page 246 and 247:
Page 248 and 249:
Page 250 and 251:
Page 252 and 253:
Page 254 and 255:
Page 256 and 257:
244 APPENDIX E. FORMAL VERIFICATION
Page 258 and 259:
Page 260 and 261:
Page 262 and 263:
Page 264 and 265:
Page 266 and 267:
Page 268 and 269:
Page 270 and 271:
Page 272 and 273:
Page 274 and 275:
Page 276 and 277:
Page 278 and 279:
Page 280 and 281:
Page 282 and 283:
Page 284 and 285:
272 APPENDIX F. ANSWERS TO QUICK QU
Page 286 and 287:
Page 288 and 289:
Page 290 and 291:
Page 292 and 293:
Page 294 and 295:
Page 296 and 297:
Page 298 and 299:
Page 300 and 301:
Page 302 and 303:
Page 304 and 305:
Page 306 and 307:
Page 308 and 309:
Page 310 and 311:
Page 312 and 313:
Page 314 and 315:
Page 316 and 317:
Page 318 and 319:
Page 320 and 321:
Page 322 and 323:
Page 324 and 325:
Page 326 and 327:
Page 328 and 329:
Page 330 and 331:
Page 332 and 333:
Page 334 and 335:
Page 336 and 337:
Page 338 and 339:
Page 340 and 341:
Page 342 and 343:
330 APPENDIX G. GLOSSARY(2) A physi
Page 344 and 345:
332 APPENDIX G. GLOSSARYnear by. Th
Page 346 and 347:
334 APPENDIX G. GLOSSARY
Page 348 and 349:
336 BIBLIOGRAPHY[But97]USA, March 2
Page 350 and 351:
338 BIBLIOGRAPHY[HMB06][Hol03][HP95
Page 352 and 353:
340 BIBLIOGRAPHY[McK06] Paul E. McK
Page 354 and 355:
342 BIBLIOGRAPHYtor. Software - Pra
Page 356 and 357:
344 BIBLIOGRAPHY[UoC08][VGS08]Berke
Page 358:
346 APPENDIX H. CREDITSH.4 Original
show all

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Create successful ePaper yourself

Delete template?

Save as template?