10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

178 APPENDIX C. WHY MEMORY BARRIERS?4. isync forces all preceding instructions to appearto have completed before any subsequentinstructions start execution. This means thattheprecedinginstructionsmusthaveprogressedfar enough that any traps they might generatehave either happened or are guaranteed not tohappen, and that any side-effects of these instructions(forexample,page-tablechanges)areseen by the subsequent instructions.Unfortunately, none of these instructions line upexactly with Linux’swmb() primitive, which requiresall stores to be ordered, but does not require theother high-overhead actions of the sync instruction.But there is no choice: ppc64 versions of wmb() andmb() are defined to be the heavyweight sync instruction.However, Linux’s smp wmb() instructionisneverusedforMMIO(sinceadrivermustcarefullyorder MMIOs in UP as well as SMP kernels, afterall), so it is defined to be the lighter weight eieioinstruction. This instruction may well be unique inhaving a five-vowel mneumonic. The smp mb() instructionis also defined to be the sync instruction,but both smp rmb() and rmb() are defined to be thelighter-weight lwsync instruction.Power features “cumulativity”, which can be usedtoobtain transitivity. Whenusedproperly, any codeseeingtheresultsofanearliercodefragmentwillalsosee the accesses that this earlier code fragment itselfsaw. Much more detail is available from McKenneyand Silvera [MS09].Power respects control dependencies in much thesame way that ARM does, with the exception thatthe Power isync instruction is substituted for theARM ISB instruction.Many members of the POWER architecture haveincoherent instruction caches, so that a store tomemory will not necessarily be reflected in the instructioncache. Thankfully, few people write selfmodifyingcode these days, but JITs and compilersdo it all the time. Furthermore, recompiling arecently run program looks just like self-modifyingcode from the CPU’s viewpoint. The icbi instruction(instruction cache block invalidate) invalidatesa specified cache line from the instruction cache, andmay be used in these situations.C.7.7 SPARC RMO, PSO, and TSO<strong>So</strong>laris on SPARC uses TSO (total-store order),as does Linux when built for the “sparc” 32-bit architecture. However, a 64-bit Linux kernel(the “sparc64” architecture) runs SPARC in RMO(relaxed-memoryorder)mode[SPA94]. TheSPARCarchitecture also offers an intermediate PSO (partialstore order). Any program that runs in RMO willalso run in either PSO or TSO, and similarly, a programthat runs in PSO will also run in TSO. Movinga shared-memory parallel program in the other directionmay require careful insertion of memory barriers,although, asnotedearlier, programsthatmakestandard use of synchronization primitives need notworry about memory barriers.SPARC has a very flexible memory-barrier instruction[SPA94] that permits fine-grained controlof ordering:StoreStore: order preceding stores before subsequentstores. (This option is used by the Linuxsmp wmb() primitive.)LoadStore: order preceding loads before subsequentstores.StoreLoad: order preceding stores before subsequentloads.LoadLoad: order preceding loads before subsequentloads. (This option is used by the Linuxsmp rmb() primitive.)Sync: fully complete all preceding operations beforestarting any subsequent operations.Mem<strong>Is</strong>sue: complete preceding memory operationsbefore subsequent memory operations, importantforsomeinstancesofmemory-mappedI/O.Lookaside: same as Mem<strong>Is</strong>sue, but only appliesto preceding stores and subsequent loads, andeven then only for stores and loads that accessthe same memory location.The Linux smp mb() primitive uses the firstfour options together, as in membar #LoadLoad |#LoadStore | #StoreStore | #StoreLoad, thusfully ordering memory operations.<strong>So</strong>, why is membar #Mem<strong>Is</strong>sue needed? Because amembar #StoreLoad could permit a subsequent loadto get its value from a write buffer, which would bedisastrous if the write was to an MMIO register thatinduced side effects on the value to be read. In contrast,membar #Mem<strong>Is</strong>sue would wait until the writebuffers were flushed before permitting the loads toexecute, thereby ensuring that the load actually getsits value from the MMIO register. Drivers couldinstead use membar #Sync, but the lighter-weightmembar #Mem<strong>Is</strong>sue is preferred in cases where theadditional function of the more-expensive membar#Sync are not required.The membar #Lookaside is a lighter-weight versionof membar #Mem<strong>Is</strong>sue, which is useful when

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!