10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Appendix CWhy Memory Barriers?<strong>So</strong>whatpossessedCPUdesignerstocausethemtoinflict memory barriers on poor unsuspecting SMPsoftware designers?Inshort, becausereorderingmemoryreferencesallowsmuch better performance, and so memory barriersare needed to force ordering in things like synchronizationprimitives whose correct operation dependson ordered memory references.Getting a more detailed answer to this questionrequires a good understanding of how CPU cacheswork, and especially what is required to make cachesreally work well. The following sections:1. present the structure of a cache,2. describe how cache-coherency protocols ensurethat CPUs agree on the value of each locationin memory, and, finally,3. outline how store buffers and invalidate queueshelp caches and cache-coherency protocolsachieve high performance.We will see that memory barriers are a necessaryevil that is required to enable good performanceand scalability, an evil that stems from the fact thatCPUs are orders of magnitude faster than are boththe interconnects between them and the memorythey are attempting to access.C.1 Cache StructureModern CPUs are much faster than are modernmemory systems. A 2006 CPU might be capable ofexecuting ten instructions per nanosecond, but willrequire many tens of nanoseconds to fetch a dataitem from main memory. This disparity in speed —morethantwoordersofmagnitude—hasresultedinthe multi-megabyte caches found on modern CPUs.These caches are associated with the CPUs as shownin Figure C.1, and can typically be accessed in a fewcycles. 1 CPU 0 CPU 1CacheInterconnectMemoryCacheFigure C.1: Modern Computer System Cache StructureData flows among the CPUs’ caches and memoryin fixed-length blocks called “cache lines”, which arenormally a power of two in size, ranging from 16to 256 bytes. When a given data item is first accessedby a given CPU, it will be absent from thatCPU’s cache, meaning that a “cache miss” (or, morespecifically, a “startup” or “warmup” cache miss)has occurred. The cache miss means that the CPUwill have to wait (or be “stalled”) for hundreds ofcycles while the item is fetched from memory. However,the item will be loaded into that CPU’s cache,so that subsequent accesses will find it in the cacheand therefore run at full speed.After some time, the CPU’s cache will fill, andsubsequent misses will likely need to eject an itemfrom the cache in order to make room for the newly1 <strong>It</strong> is standard practice to use multiple levels of cache, witha small level-one cache close to the CPU with single-cycleaccess time, and a larger level-two cache with a longer accesstime, perhaps roughly ten clock cycles. Higher-performanceCPUs often have three or even four levels of cache.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!