10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

C.7. MEMORY-BARRIER INSTRUCTIONS FOR SPECIFIC CPUS 179writing to a given MMIO register affects the valuethat will next be read from that register. However,the heavier-weight membar #Mem<strong>Is</strong>sue must be usedwhen a write to a given MMIO register affects thevalue that will next be read from some other MMIOregister.<strong>It</strong> is not clear why SPARC does not definewmb() to be membar #Mem<strong>Is</strong>sue and smb wmb() tobe membar #StoreStore, as the current definitionsseem vulnerable to bugs in some drivers. <strong>It</strong> is quitepossible that all the SPARC CPUs that Linux runson implement a more conservative memory-orderingmodel than the architecture would permit.SPARC requires a flush instruction be used betweenthe time that an instruction is stored and executed[SPA94]. This is needed to flush any priorvalueforthatlocationfromtheSPARC’sinstructioncache. Note that flush takes an address, and willflush only that address from the instruction cache.On SMP systems, all CPUs’ caches are flushed, butthere is no convenient way to determine when theoff-CPU flushes complete, though there is a referenceto an implementation note.store. <strong>So</strong>ftware may use atomic operations to overridethese hardware optimizations, which is one reasonthat atomic operations tend to be more expensivethan their non-atomic counterparts. This totalstore order is not guaranteed on older processors.However, note that some SSE instructions areweaklyordered(clflushandnon-temporalmoveinstructions[Int04a]). CPUs that have SSE can usemfence for smp mb(), lfence for smp rmb(), andsfence for smp wmb().A few versions of the x86 CPU have a mode bitthatenablesout-of-orderstores, andfortheseCPUs,smp wmb() must also be defined to be lock;addl.Althoughmanyolderx86implementationsaccommodatedself-modifying code without the need forany special instructions, newer revisions of the x86architecturenolongerrequiresx86CPUstobesoaccommodating.Interestingly enough, this relaxationcomes just in time to inconvenience JIT implementors.C.7.8 x86Since the x86 CPUs provide “process ordering” sothat all CPUs agree on the order of a given CPU’swritestomemory,thesmp wmb()primitiveisano-opfor the CPU [Int04b]. However, a compiler directiveis required to prevent the compiler from performingoptimizations that would result in reordering acrossthe smp wmb() primitive.On the other hand, x86 CPUs have traditionallygiven no ordering guarantees for loads, sothe smp mb() and smp rmb() primitives expand tolock;addl. This atomic instruction acts as a barrierto both loads and stores.More recently, Intel has published a memorymodel for x86 [Int07]. <strong>It</strong> turns out that Intel’s actualCPUs enforced tighter ordering than was claimed inthe previous specifications, so this model is in effectsimply mandating the earlier de-facto behavior.Even more recently, Intel published an updatedmemory model for x86 [Int09, Section 8.2], whichmandates a total global order for stores, althoughindividual CPUs are still permitted to see their ownstores as having happened earlier than this totalglobal order would indicate. This exception to thetotal ordering is needed to allow important hardwareoptimizations involving store buffers. In addition,memory ordering obeys causality, so that ifCPU 0 sees a store by CPU 1, then CPU 0 is guaranteedto see all stores that CPU 1 saw prior to itsC.7.9 zSeriesThe zSeries machines make up the IBM TM mainframefamily, previously known as the 360, 370, and390 [Int04c]. <strong>Parallel</strong>ism came late to zSeries, butgiven that these mainframes first shipped in the mid1960s, this is not saying much. The bcr 15,0 instructionis used for the Linuxsmp mb(), smp rmb(),and smp wmb() primitives. <strong>It</strong> also has comparativelystrong memory-ordering semantics, as shown in TableC.5, which should allow the smp wmb() primitiveto be a nop (and by the time you read this, thischange may well have happened). The table actuallyunderstatesthesituation,asthezSeriesmemorymodel is otherwise sequentially consistent, meaningthat all CPUs will agree on the order of unrelatedstores from different CPUs.As with most CPUs, the zSeries architecture doesnot guarantee a cache-coherent instruction stream,hence, self-modifying code must execute a serializinginstruction between updating the instructions andexecuting them. That said, many actual zSeries machinesdo in fact accommodate self-modifying codewithout serializing instructions. The zSeries instructionset provides a large set of serializing instructions,including compare-and-swap, some typesof branches (for example, the aforementioned bcr15,0 instruction), and test-and-set, among others.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!