10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

162 APPENDIX C. WHY MEMORY BARRIERS?0x00x10x20x30x40x50x60x70x80x90xA0xB0xC0xD0xE0xFWay 00x123450000x123451000x123452000x123453000x123454000x123455000x123456000x123457000x123458000x123459000x12345A000x12345B000x12345C000x12345D000x12345E00Way 10x43210E00Figure C.2: CPU Cache Structurefetched item. Such a cache miss is termed a “capacitymiss”, because it is caused by the cache’s limitedcapacity. However, most caches can be forcedto eject an old item to make room for a new itemeven when they are not yet full. This is due to thefact that large caches are implemented as hardwarehash tables with fixed-size hash buckets (or “sets”,as CPU designers call them) and no chaining, asshown in Figure C.2.This cache has sixteen “sets” and two “ways” fora total of 32 “lines”, each entry containing a single256-byte “cache line”, which is a 256-byte-alignedblock of memory. This cache line size is a littleon the large size, but makes the hexadecimal arithmeticmuch simpler. In hardware parlance, this isa two-way set-associative cache, and is analogous toa software hash table with sixteen buckets, whereeach bucket’s hash chain is limited to at most twoelements. The size (32 cache lines in this case) andthe associativity (two in this case) are collectivelycalled the cache’s “geometry”. Since this cache isimplemented in hardware, the hash function is extremelysimple: extract four bits from the memoryaddress.In Figure C.2, each box corresponds to a cache entry,which can contain a 256-byte cache line. However,a cache entry can be empty, as indicated bythe empty boxes in the figure. The rest of the boxesare flagged with the memory address of the cacheline that they contain. Since the cache lines must be256-byte aligned, the low eight bits of each addressare zero, and the choice of hardware hash functionmeans that the next-higher four bits match the hashline number.The situation depicted in the figure might ariseif the program’s code were located at address0x43210E00 through0x43210EFF, andthisprogramaccessed data sequentially from 0x12345000 through0x12345EFF.Supposethattheprogramwerenowtoaccess location 0x12345F00. This location hashes toline 0xF, and both ways of this line are empty, so thecorresponding 256-byte line can be accommodated.<strong>If</strong> the program were to access location 0x1233000,whichhashestoline0x0, thecorresponding256-bytecache line can be accommodated in way 1. However,if the program were to access location 0x1233E00,which hashes to line 0xE, one of the existing linesmust be ejected from the cache to make room forthe new cache line. <strong>If</strong> this ejected line were accessedlater, a cache miss would result. Such a cache missis termed an “associativity miss”.Thus far, we have been considering only caseswhere a CPU reads a data item. <strong>What</strong> happenswhen it does a write? Because it is important thatall CPUs agree on the value of a given data item, beforea given CPU writes to that data item, it mustfirst cause it to be removed, or “invalidated”, fromother CPUs’ caches. Once this invalidation has completed,the CPU may safely modify the data item.<strong>If</strong> the data item was present in this CPU’s cache,but was read-only, this process is termed a “writemiss”. Once a given CPU has completed invalidatinga given data item from other CPUs’ caches,thatCPUmayrepeatedlywrite(andread)thatdataitem.Later, if one of the other CPUs attempts to accessthe data item, it will incur a cache miss, this timebecause the first CPU invalidated the item in orderto write to it. This type of cache miss is termeda “communication miss”, since it is usually due toseveral CPUs using the data items to communicate(for example, a lock is a data item that is used tocommunicateamongCPUsusingamutual-exclusionalgorithm).Clearly, much care must be taken to ensure thatallCPUsmaintainacoherentviewofthedata. Withall this fetching, invalidating, and writing, it is easyto imagine data being lost or (perhaps worse) differentCPUs having conflicting values for the samedata item in their respective caches. These problemsare prevented by “cache-coherency protocols”,described in the next section.C.2 Cache-Coherence ProtocolsCache-coherency protocols manage cache-line statesso as to prevent inconsistent or lost data. These

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!