10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

172 APPENDIX C. WHY MEMORY BARRIERS?vendor, and as a presumably attractive alternativeto dragging the reader through detailed technicalspecifications, let us instead design a mythical butmaximally memory-ordering-hostile computer architecture.4This hardware must obey the following orderingconstraints [McK05a, McK05b]:1. Each CPU will always perceive its own memoryaccesses as occuring in program order.Node 0CPU 0 CPU 1CacheCPU 0 CPU 1Message MessageQueue QueueInterconnectNode 1CPU 2 CPU 3CacheCPU 2 CPU 3Message MessageQueue Queue2. CPUswillreorderagivenoperationwithastoreonly if the two operations are referencing differentlocations.Memory3. All of a given CPU’s loads preceding a readmemory barrier (smp_rmb()) will be perceivedby all CPUs to precede any loads following thatread memory barrier.4. All of a given CPU’s stores preceding a writememory barrier (smp_wmb()) will be perceivedbyallCPUstoprecedeanystoresfollowingthatwrite memory barrier.5. All of a given CPU’s accesses (loads and stores)precedingafullmemorybarrier(smp_mb())willbe perceived by all CPUs to precede any accessesfollowing that memory barrier.Quick Quiz C.9: <strong>Do</strong>es the guarantee that eachCPU sees its own memory accesses in order alsoguarantee that each user-level thread will see its ownmemory accesses in order? Why or why not?Imagine a large non-uniform cache architecture(NUCA) system that, in order to provide fair allocationof interconnect bandwidth to CPUs in a givennode, providedper-CPUqueuesineachnode’sinterconnectinterface, as shown in Figure C.8. Althougha given CPU’s accesses are ordered as specified bymemory barriers executed by that CPU, however,the relative order of a given pair of CPUs’ accessescould be severely reordered, as we will see. 54 Readers preferring a detailed look at real hardware architecturesare encouraged to consult CPU vendors’ manuals[SW95, Adv02, Int02b, IBM94, LSH02, SPA94, Int04b,Int04a, Int04c], Gharachorloo’s dissertation [Gha95], or PeterSewell’s work [Sew].5 Any real hardware architect or designer will no doubt beloudly calling for Ralph on the porcelain intercom, as theyjust might be just a bit upset about the prospect of workingout which queue should handle a message involving a cacheline that both CPUs accessed, to say nothing of the manyraces that this example poses. All I can say is “Give me abetter example”.Figure C.8: Example Ordering-Hostile ArchitectureCPU 0 CPU 1 CPU 2a=1;smp_wmb(); while(b==0);b=1; c=1; z=c;smp_rmb();x=a;assert(z==0||x==1);Table C.2: Memory Barrier Example 1C.6.2 Example 1TableC.2showsthreecodefragments, executedconcurrentlybyCPUs0,1, and2. Eachof“a”, “b”, and“c” are initially zero.Suppose CPU 0 recently experienced many cachemisses, so that its message queue is full, but thatCPU 1 has been running exclusively within thecache, so that its message queue is empty. ThenCPU 0’s assignment to “a” and “b” will appearin Node 0’s cache immediately (and thus be visibleto CPU 1), but will be blocked behind CPU 0’sprior traffic. In contrast, CPU 1’s assignment to “c”will sail through CPU 1’s previously empty queue.Therefore, CPU 2 might well see CPU 1’s assignmentto “c” before it sees CPU 0’s assignment to“a”, causing the assertion to fire, despite the memorybarriers.In theory, portable code cannot rely on this examplecode sequence, however, in practice it actuallydoes work on all mainstream computer systems.Quick Quiz C.10: Could this code be fixed byinsertingamemorybarrierbetweenCPU1’s“while”and assignment to “c”? Why or why not?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!