10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

16 CHAPTER 2. HARDWARE AND ITS HABITSis more expensive than CAS because it requires twoatomic operations on the lock data structure.An operation that misses the cache consumes almostone hundred and forty nanoseconds, or morethan two hundred clock cycles. A CAS operation,which must look at the old value of the variable aswell as store a new value, consumes over three hundrednanoseconds, or more than five hundred clockcycles. Think about this a bit. In the time requiredto do one CAS operation, the CPU couldhave executed more than five hundred normal instructions.This should demonstrate the limitationsof fine-grained locking.Quick Quiz 2.5: Surely the hardware designerscould be persuaded to improve this situation! Whyhave they been content with such abysmal performancefor these single-instruction operations?I/O operations are even more expensive. A highperformance (and expensive!) communications fabric,such as InfiniBand or any number of proprietaryinterconnects, has a latency of roughly three microseconds,during which time five thousand instructionsmight have been executed. Standards-basedcommunications networks often require some sort ofprotocol processing, which further increases the latency.Of course, geographic distance also increaseslatency, with the theoretical speed-of-light latencyaround the world coming to roughly 130 milliseconds,or more than 200 million clock cycles.Quick Quiz 2.6: These numbers are insanelylarge! HowcanIpossiblygetmyheadaroundthem?2.3 <strong>Hard</strong>ware Free Lunch?The major reason that concurrency has been receivingso much focus over the past few years is theend of Moore’s-Law induced single-threaded performanceincreases (or “free lunch” [Sut08]), as shownin Figure 1.1 on page 3. This section briefly surveysa few ways that hardware designers might be able tobring back some form of the “free lunch”.However, the preceding section presented somesubstantial hardware obstacles to exploiting concurrency.One severe physical limitation that hardwaredesigners face is the finite speed of light. As notedin Figure 2.9 on page 15, light can travel only aboutan 8-centimeters round trip in a vacuum during theduration of a 1.8 GHz clock period. This distancedropstoabout3centimetersfora5GHzclock. Bothof these distances are relatively small compared tothe size of a modern computer system.To make matters even worse, electrons in silicon3 cm70 um1.5 cmFigure 2.10: Latency Benefit of 3D Integrationmove from three to thirty times more slowly thandoes light in a vacuum, and common clocked logicconstructsrunstillmoreslowly, forexample, amemoryreference may need to wait for a local cachelookuptocompletebeforetherequestmaybepassedon to the rest of the system. Furthermore, relativelylow speed and high power drivers are required tomove electrical signals from one silicon die to another,for example, to communicate between a CPUand main memory.There are nevertheless some technologies (bothhardware and software) that might help improvematters:1. 3D integration,2. Novel materials and processes,3. Substituting light for electrons,4. Special-purpose accelerators, and5. Existing parallel software.Each of these is described in one of the followingsections.2.3.1 3D Integration3D integration is the practice of bonding very thinsilicon dies to each other in a vertical stack. Thispractice provides potential benefits, but also posessignificant fabrication challenges [Kni08].Perhaps the most important benefit of 3DI is decreasedpath length through the system, as shownin Figure 2.10. A 3-centimeter silicon die is replacedwith a stack of four 1.5-centimeter dies, in theorydecreasing the maximum path through the systemby a factor of two, keeping in mind that each layeris quite thin. In addition, given proper attention todesign and placement, long horizontal electrical connections(which are both slow and power hungry)can be replaced by short vertical electrical connections,whicharebothfasterandmorepowerefficient.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!