Chuck Moore, AMD - Semiconductor Research Corporation
Chuck Moore, AMD - Semiconductor Research Corporation
Chuck Moore, AMD - Semiconductor Research Corporation
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
May 17, 2006<br />
The Role of Accelerated Computing<br />
in the Multi-Core Era<br />
<strong>Chuck</strong> <strong>Moore</strong><br />
Senior Fellow<br />
Advanced Micro Devices
Outline<br />
• <strong>Moore</strong>’s Law and Customer Value<br />
• A Few High-level Trends<br />
• Some Thoughts on SMP and Multi-core Computing<br />
• Throughput Computing<br />
• Relationship to Accelerated Computing<br />
• Accelerated Computing Challenges<br />
• Summary<br />
2 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
<strong>Moore</strong>’s Law (Gordon <strong>Moore</strong>, Electronics Magazine – April 1965)<br />
What he observed:<br />
Double transistor integration every 18 months<br />
Another trend often confused with <strong>Moore</strong>’s Law:<br />
Double performance every 18 months<br />
This is actually an example of ongoing customer value<br />
The semiconductor industry is dependent upon ongoing<br />
customer value:<br />
build<br />
A virtuous cycle:<br />
customer<br />
value<br />
invest<br />
$$$<br />
The Question we should all be asking ourselves:<br />
How do we continue increasing customer value<br />
3 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
A Few High-level Trends<br />
<strong>Moore</strong>’s Law ☺<br />
The Power Wall <br />
The Frequency Wall <br />
Integration (log scale)<br />
o<br />
we are<br />
here<br />
!<br />
-DFM<br />
- Variability<br />
- Reliability<br />
- Wire delay<br />
Power Budget (TDP)<br />
Server: power=$$<br />
DT: eliminate fans<br />
Mobile: battery<br />
o<br />
we are<br />
here<br />
Frequency<br />
o<br />
we are<br />
here<br />
Time<br />
Time<br />
Time<br />
The Complexity Wall <br />
Locality <br />
Single thread Perf (!)<br />
IPC<br />
o<br />
we are<br />
here<br />
Performance<br />
o<br />
we are<br />
here<br />
Single-thread Perf<br />
o<br />
we are<br />
here<br />
<br />
Issue Width<br />
Cache Size<br />
Time<br />
So, how can we add customer value<br />
4 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Customer Value beyond just Performance<br />
Seamless<br />
upgradeability<br />
+<br />
Quad-core technology<br />
+<br />
Virtualization<br />
+<br />
Performance-per<br />
per-watt<br />
innovation<br />
<strong>AMD</strong> Native Dual<br />
Core Opteron<br />
Pervasive<br />
64-bit capability<br />
+<br />
Dual-core technology<br />
+<br />
Systems architecture<br />
innovation<br />
<strong>AMD</strong> Native Quad<br />
Core Core Opteron<br />
SMP and Multi-Core to the long term rescue<br />
5 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Optimized CMP Platforms for Performance<br />
• In the near-term, there is definite potential here<br />
• Commodity multi-core processors break the “chicken & egg” barrier<br />
• Impressive amount of interesting research firing up:<br />
– TM, coherency filters, hierarchical scheduling, MREs, VMs, etc<br />
• Lots of good activity on the Tools front More to come<br />
• Some workloads will do well with this, but many will not:<br />
• Amdahl’s Law seriously inhibits unstructured parallelism ...<br />
Hold on a moment …<br />
You do remember the implications of<br />
Amdahl’s Law, don’t you<br />
6 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Amdahl’s Law<br />
Speed-up =<br />
1<br />
S W + (1 – S W ) / N<br />
S W : % Serial Work<br />
N: Number of processors<br />
7 7/17/2008<br />
The Role of Accelerated Computing ** in <strong>AMD</strong> the Confidential Multi-core Era **
Amdahl’s Law – Zoom out a bit …<br />
8 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Optimized CMP Platforms for Performance<br />
• In the near-term, there is definite potential here<br />
• Commodity multi-core processors break the “chicken & egg” barrier<br />
• Impressive amount of interesting research firing up:<br />
– TM, coherency filters, hierarchical scheduling, MREs, VMs, etc<br />
• Lots of good activity on the Tools front More to come<br />
• Some workloads will do well with this, but many will not:<br />
• Amdahl’s Law seriously inhibits unstructured parallelism …<br />
• As it turns out, existing software isn’t really that soft<br />
– Transitioning the concurrency model is a very big deal<br />
• Many key emerging applications will work on “media rich” content that<br />
does not map onto traditional processors in a power-efficient manner<br />
• In reality, SMP/Multi-core challenges are just an early indicator<br />
of the shifts yet to come<br />
• Power constraints will force these to be performance heterogeneous<br />
• Advances in synchronization and non-uniform resource management<br />
9 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Outline<br />
• <strong>Moore</strong>’s Law and Customer Value<br />
• A Few High-level Trends<br />
• Some Thoughts on SMP and Multi-core Computing<br />
• Throughput Computing<br />
• Relationship to Accelerated Computing<br />
• Accelerated Computing Challenges<br />
• Memory Bandwidth and Data Movement<br />
• The Software Stack and a Framework for Innovation<br />
• Summary<br />
10 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
What about Throughput Computing<br />
• Measure Performance as throughput (vs. turn-around-time)<br />
• How long does it takes to run N “independent” tasks<br />
• Multiple cores/threads should be faster than just one<br />
• For basic multi-program throughput, the OS plays a key role<br />
is time-slicing application use on one or more cores<br />
• In some sense, the OS “offloads” work onto available cores<br />
• As some point, OS scalability becomes the bottleneck<br />
• More advanced applications take on that role themselves with<br />
“gang scheduling”<br />
• Modern databases<br />
• Some HPC apps turn data parallelism into task-level throughput<br />
• Some types of web servers<br />
• Can we exploit large scale throughput computing<br />
11 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Large Scale Throughput Systems<br />
• In these, there are far more threads than HW thread-slots<br />
• A Centralized Controller helps dispatch/juggle among threads<br />
Two-Tiered Web Server<br />
Massive amounts<br />
of Internet Clicks<br />
SMP Server:<br />
“Sprayer”<br />
Vertices<br />
Pixels<br />
Modern GPU<br />
“Sequencer”<br />
Blade Servers<br />
Cooperative Heterogeneous Computing!<br />
12 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Characteristics of Throughput Computing<br />
• The real goal is to saturate the pipeline to memory<br />
• Or any other very long latency pipeline, for that matter …<br />
• Hardware Implications<br />
• Support of large number of active and standby threads<br />
– under-threaded designs won’t saturate the pipeline to memory<br />
• Very robust memory controller - lots of outstanding references<br />
• Measure effectiveness by how well memory BW is used<br />
• Software / Workload Implications<br />
• Separate natural parallelism into large # of independent threads<br />
• What happens if the “Central Controller” gets backed up<br />
• It is effectively the serial component in these topologies<br />
• A macroscopic version of Amdahl’s Law kicks in<br />
13 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
What does a General Purpose Throughput<br />
Computing solution look like<br />
• Start with a Powerful, yet efficient, Uniprocessor<br />
• Perhaps a next generation Opteron would be a good choice ☺<br />
• Deploy one or more of these for excellent general purpose legacy<br />
support, and as for the Centralized Controller for throughput<br />
• Add a larger number of small, power-efficient, domain<br />
optimized compute offload engines<br />
• Build an optimized memory system<br />
• Shared among all the processors<br />
• Optimized support for communication, synchronization & data transfers<br />
• Embrace a programming model abstraction that favors<br />
simplicity and programmer productivity<br />
Accelerated Computing with Heterogeneous<br />
Processors!<br />
14 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Accelerated Computing: Motivations<br />
New and emerging applications are defining and operating on<br />
larger-scale and more abstract data types<br />
Targeted special purpose HW solutions can offer substantially<br />
better power efficiency and throughput than traditional<br />
microprocessors when operating on some of these new<br />
data types.<br />
Traditional “host” offload to dense compute accelerator<br />
• Use APIs to enable this without heroic programming efforts<br />
• Use Concurrent Runtime to ease scheduling & resource management<br />
• ISA compatibility yields to API and Platform Compatibility<br />
Accelerated Computing is a Platform for<br />
Heterogeneous Computing<br />
15 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Lots of Challenges …<br />
• Memory BW and Data Movement<br />
• Keeping up with the computation rates will require increasingly<br />
capable memory systems<br />
• New and appropriate Software Stack and APIs<br />
• A programming model that builds on emerging multi-core models<br />
• Use abstraction to trade some performance for programmer<br />
productivity<br />
• Live within bounds set by ecosystem OS and Concurrent Runtimes<br />
• Managing context state and exceptions<br />
• This includes the program-visible state in the compute offload engine!<br />
• Virtualizing the context state<br />
• Communications/Messaging<br />
• Simplified & fabric independent producer-consumer model<br />
• Optimized communications is a key enabler<br />
• It’s the synchronization, stupid!<br />
16 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
The Memory Wall – getting thicker<br />
There has always been a Critical Balance between<br />
Data Availability and Processing<br />
Situation When Implication Industry Solution<br />
DRAM vs CPU Cycle Times<br />
Widening GAP<br />
Early<br />
1990s<br />
Memory wait time dominates<br />
computing<br />
Non-blocking<br />
caches<br />
O-o-O Machines<br />
☺<br />
SW Productivity Crisis: Round 1<br />
Object oriented languages;<br />
Managed runtime environments<br />
Early<br />
1990s<br />
Larger working sets<br />
More diverse data types<br />
Larger Caches<br />
Cache Hierarchies<br />
Elaborate prefetch<br />
<br />
Frequency and IPC Wall<br />
CMP and Multi-Threading<br />
2005 and<br />
beyond<br />
Multiple working sets! VMs!<br />
More memory transactions<br />
Huge Caches<br />
Elaborate MemCtlrs !<br />
SW Productivity Crisis: Round 2<br />
Increased abstraction layers<br />
Image/Video as basic data types<br />
2009 and<br />
beyond<br />
Even larger working sets<br />
Larger data types<br />
Accelerated<br />
Computing<br />
Throughput<br />
Computing<br />
Chip Stacking<br />
TBD<br />
17 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
The Emerging Layers of Computation<br />
Start with an Analogy to the Communications Industry<br />
Computing Model<br />
Application<br />
Operating<br />
System<br />
Fault tolerant<br />
“meta apps”<br />
Browsers<br />
Web services<br />
JVM, CLR<br />
LLVM<br />
Virtualization<br />
Intelligent I/O<br />
Advanced APIs<br />
Dynamic binary<br />
translation<br />
Reconfiguration<br />
Hardware<br />
Indicators<br />
of a bigger<br />
picture<br />
18 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
The Emerging Layers of Computation<br />
Network Runtime Layer<br />
Google<br />
apps<br />
SETI<br />
@Home<br />
Data Center Applications<br />
Data Center Runtime Environment<br />
Network Layer<br />
Networked Platform<br />
Native Runtime<br />
Layer<br />
Applications<br />
API’s, Libs<br />
Runtime<br />
Traditional OS<br />
Hypervisor (virtual platform)<br />
Compatible Hardware Platform<br />
Network-aware Applications (web services)<br />
Network Services<br />
VMware<br />
Arch<br />
extensions<br />
GPUs<br />
DirectX<br />
Concurrent<br />
Runtime<br />
AJAX<br />
Platform Layer<br />
x86 Compatible Hardware<br />
RAW Hardware<br />
Devices<br />
Redundant<br />
hardware<br />
Microcode<br />
Error<br />
recovery<br />
Dynamic<br />
translation<br />
Physical Layer<br />
19 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
The Emerging Layers of Computation<br />
Network Runtime Layer<br />
Data Center Applications<br />
Ease of Programming<br />
Network Layer<br />
Native Runtime<br />
Layer<br />
Compatible Hardware Platform<br />
x86 Compatible Hardware<br />
RAW Hardware<br />
Productivity<br />
Applications<br />
API’s, Libs<br />
MRE’s<br />
Traditional OS<br />
Hypervisor (virtual platform)<br />
Devices<br />
Data Center Runtime Environment<br />
The Challenge:<br />
Networked Platform<br />
Network-aware Applications<br />
Shift this from a<br />
Network Services<br />
retrospective<br />
observation into a<br />
more formal and<br />
deliberate framework<br />
Platform Layer<br />
Physical Layer<br />
20 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Lots of Interesting Implications<br />
The Data Center<br />
Compute<br />
Platform<br />
Data Center Applications<br />
Data Center Runtime Environment<br />
Networked Platform<br />
Parallel<br />
Applications<br />
using<br />
CMP/SMP<br />
Applications<br />
API’s, Libs MRE’s<br />
Traditional OS<br />
Network-aware Applications (web services)<br />
Network Services<br />
Hypervisor (virtual platform)<br />
Compatible Hardware Platform<br />
x86 Compatible Hardware<br />
RAW Hardware<br />
Devices<br />
Special<br />
purpose<br />
HW<br />
New types<br />
of<br />
Programable<br />
HW<br />
Accelerated<br />
Computing<br />
21 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Summary<br />
Uniprocessor performance scaling is nearing it’s limits<br />
<strong>Moore</strong>’s Law is great from an integration standpoint, but…<br />
• Power scalability isn’t tracking due to approaching V min<br />
limits<br />
• This fact, combined with the challenges of Amdahl’s Law will<br />
ultimately place practical limits on the #cores in CMPs<br />
The demand for increasing performance won’t go away, but<br />
we need to find new ways of easily offering that performance<br />
• Programmer productivity is a first order consideration<br />
• Deployment of significantly more power effective HW structures<br />
• Lots of interesting system architecture innovations are possible<br />
The time for Accelerated Computing with<br />
Heterogeneous HW is here!<br />
22 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era
Thank You !<br />
Questions<br />
©2008. Advanced Micro Devices, Inc. All rights reserved. <strong>AMD</strong>, the<br />
<strong>AMD</strong> Arrow logo, <strong>AMD</strong> Opteron, and combinations thereof, are<br />
trademarks of Advanced Micro Devices, Inc.<br />
Other names are for informational purposes only and may be<br />
trademarks of their respective owners.<br />
23 7/17/2008<br />
The Role of Accelerated Computing in the Multi-core Era