10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

180 APPENDIX C. WHY MEMORY BARRIERS?C.8 Are Memory Barriers Forever?Therehavebeenanumberofrecentsystemsthataresignificantly less aggressive about out-of-order executionin general and re-ordering memory referencesin particular. Will this trend continue to the pointwhere memory barriers are a thing of the past?The argument in favor would cite proposed massivelymulti-threaded hardware architectures, sothateachthreadwouldwaituntilmemorywasready,with tens, hundreds, or even thousands of otherthreads making progress in the meantime. In suchan architecture, there would be no need for memorybarriers, because a given thread would simply waitfor all outstanding operations to complete beforeproceeding to the next instruction. Because therewould be potentially thousands of other threads, theCPU would be completely utilized, so no CPU timewould be wasted.The argument against would cite the extremelylimited number of applications capable of scaling upto a thousand threads, as well as increasingly severerealtime requirements, which are in the tens ofmicroseconds for some applications. The realtimeresponserequirements are difficult enough to meetas is, and would be even more difficult to meet giventhe extremely low single-threaded throughput impliedby the massive multi-threaded scenarios.Another argument in favor would cite increasinglysophisticated latency-hiding hardware implementationtechniques that might well allow the CPUto provide the illusion of fully sequentially consistentexecution while still providing almost all of theperformance advantages of out-of-order execution.A counter-argument would cite the increasingly severepower-efficiency requirements presented bothby battery-operated devices and by environmentalresponsibility.Who is right? We have no clue, so are preparingto live with either scenario.C.9 Advice to <strong>Hard</strong>ware DesignersThere are any number of things that hardware designerscan do to make the lives of software peopledifficult. Here is a list of a few such things that wehave encountered in the past, presented here in thehope that it might help prevent future such problems:1. I/O devices that ignore cache coherence.This charming misfeature can result in DMAsfrommemorymissingrecentchangestotheoutputbuffer, or, just as bad, cause input bufferstobeoverwrittenbythecontentsofCPUcachesjust after the DMA completes. To make yoursystem work in face of such misbehavior, youmust carefully flush the CPU caches of any locationinanyDMAbufferbeforepresentingthatbuffer to the I/O device. <strong>And</strong> even then, youneed to be very careful to avoid pointer bugs,as even a misplaced read to an input buffer canresult in corrupting the data input!2. External busses that fail to transmit cachecoherencedata.This is an even more painful variant of theabove problem, but causes groups of devices—and even memory itself—to fail to respect cachecoherence. <strong>It</strong> is my painful duty to inform youthatasembeddedsystemsmovetomulticorearchitectures,we will no doubt see a fair numberof such problems arise. Hopefully these problemswill clear up by the year 2015.3. Device interrupts that ignore cache coherence.This might sound innocent enough — after all,interrupts aren’t memory references, are they?But imagine a CPU with a split cache, one bankof which is extremely busy, therefore holdingontothelast cacheline oftheinput buffer. <strong>If</strong>thecorresponding I/O-complete interrupt reachesthis CPU, then that CPU’s memory referenceto the last cache line of the buffer could returnold data, again resulting in data corruption, butin a form that will be invisible in a later crashdump. By the time the system gets around todumping the offending input buffer, the DMAwill most likely have completed.4. Inter-processor interrupts (IP<strong>Is</strong>) that ignorecache coherence.This can be problematic if the IPI reaches itsdestination before all of the cache lines in thecorresponding message buffer have been committedto memory.5. Context switches that get ahead of cache coherence.<strong>If</strong> memory accesses can complete too wildly outoforder, thencontextswitchescanbequiteharrowing.<strong>If</strong> the task flits from one CPU to anotherbefore all the memory accesses visible tothesourceCPUmakeittothedestinationCPU,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!