01.01.2015 Views

Chuck Moore, AMD - Semiconductor Research Corporation

Chuck Moore, AMD - Semiconductor Research Corporation

Chuck Moore, AMD - Semiconductor Research Corporation

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

May 17, 2006<br />

The Role of Accelerated Computing<br />

in the Multi-Core Era<br />

<strong>Chuck</strong> <strong>Moore</strong><br />

Senior Fellow<br />

Advanced Micro Devices


Outline<br />

• <strong>Moore</strong>’s Law and Customer Value<br />

• A Few High-level Trends<br />

• Some Thoughts on SMP and Multi-core Computing<br />

• Throughput Computing<br />

• Relationship to Accelerated Computing<br />

• Accelerated Computing Challenges<br />

• Summary<br />

2 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


<strong>Moore</strong>’s Law (Gordon <strong>Moore</strong>, Electronics Magazine – April 1965)<br />

What he observed:<br />

Double transistor integration every 18 months<br />

Another trend often confused with <strong>Moore</strong>’s Law:<br />

Double performance every 18 months<br />

This is actually an example of ongoing customer value<br />

The semiconductor industry is dependent upon ongoing<br />

customer value:<br />

build<br />

A virtuous cycle:<br />

customer<br />

value<br />

invest<br />

$$$<br />

The Question we should all be asking ourselves:<br />

How do we continue increasing customer value<br />

3 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


A Few High-level Trends<br />

<strong>Moore</strong>’s Law ☺<br />

The Power Wall <br />

The Frequency Wall <br />

Integration (log scale)<br />

o<br />

we are<br />

here<br />

!<br />

-DFM<br />

- Variability<br />

- Reliability<br />

- Wire delay<br />

Power Budget (TDP)<br />

Server: power=$$<br />

DT: eliminate fans<br />

Mobile: battery<br />

o<br />

we are<br />

here<br />

Frequency<br />

o<br />

we are<br />

here<br />

Time<br />

Time<br />

Time<br />

The Complexity Wall <br />

Locality <br />

Single thread Perf (!)<br />

IPC<br />

o<br />

we are<br />

here<br />

Performance<br />

o<br />

we are<br />

here<br />

Single-thread Perf<br />

o<br />

we are<br />

here<br />

<br />

Issue Width<br />

Cache Size<br />

Time<br />

So, how can we add customer value<br />

4 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Customer Value beyond just Performance<br />

Seamless<br />

upgradeability<br />

+<br />

Quad-core technology<br />

+<br />

Virtualization<br />

+<br />

Performance-per<br />

per-watt<br />

innovation<br />

<strong>AMD</strong> Native Dual<br />

Core Opteron<br />

Pervasive<br />

64-bit capability<br />

+<br />

Dual-core technology<br />

+<br />

Systems architecture<br />

innovation<br />

<strong>AMD</strong> Native Quad<br />

Core Core Opteron<br />

SMP and Multi-Core to the long term rescue<br />

5 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Optimized CMP Platforms for Performance<br />

• In the near-term, there is definite potential here<br />

• Commodity multi-core processors break the “chicken & egg” barrier<br />

• Impressive amount of interesting research firing up:<br />

– TM, coherency filters, hierarchical scheduling, MREs, VMs, etc<br />

• Lots of good activity on the Tools front More to come<br />

• Some workloads will do well with this, but many will not:<br />

• Amdahl’s Law seriously inhibits unstructured parallelism ...<br />

Hold on a moment …<br />

You do remember the implications of<br />

Amdahl’s Law, don’t you<br />

6 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Amdahl’s Law<br />

Speed-up =<br />

1<br />

S W + (1 – S W ) / N<br />

S W : % Serial Work<br />

N: Number of processors<br />

7 7/17/2008<br />

The Role of Accelerated Computing ** in <strong>AMD</strong> the Confidential Multi-core Era **


Amdahl’s Law – Zoom out a bit …<br />

8 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Optimized CMP Platforms for Performance<br />

• In the near-term, there is definite potential here<br />

• Commodity multi-core processors break the “chicken & egg” barrier<br />

• Impressive amount of interesting research firing up:<br />

– TM, coherency filters, hierarchical scheduling, MREs, VMs, etc<br />

• Lots of good activity on the Tools front More to come<br />

• Some workloads will do well with this, but many will not:<br />

• Amdahl’s Law seriously inhibits unstructured parallelism …<br />

• As it turns out, existing software isn’t really that soft<br />

– Transitioning the concurrency model is a very big deal<br />

• Many key emerging applications will work on “media rich” content that<br />

does not map onto traditional processors in a power-efficient manner<br />

• In reality, SMP/Multi-core challenges are just an early indicator<br />

of the shifts yet to come<br />

• Power constraints will force these to be performance heterogeneous<br />

• Advances in synchronization and non-uniform resource management<br />

9 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Outline<br />

• <strong>Moore</strong>’s Law and Customer Value<br />

• A Few High-level Trends<br />

• Some Thoughts on SMP and Multi-core Computing<br />

• Throughput Computing<br />

• Relationship to Accelerated Computing<br />

• Accelerated Computing Challenges<br />

• Memory Bandwidth and Data Movement<br />

• The Software Stack and a Framework for Innovation<br />

• Summary<br />

10 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


What about Throughput Computing<br />

• Measure Performance as throughput (vs. turn-around-time)<br />

• How long does it takes to run N “independent” tasks<br />

• Multiple cores/threads should be faster than just one<br />

• For basic multi-program throughput, the OS plays a key role<br />

is time-slicing application use on one or more cores<br />

• In some sense, the OS “offloads” work onto available cores<br />

• As some point, OS scalability becomes the bottleneck<br />

• More advanced applications take on that role themselves with<br />

“gang scheduling”<br />

• Modern databases<br />

• Some HPC apps turn data parallelism into task-level throughput<br />

• Some types of web servers<br />

• Can we exploit large scale throughput computing<br />

11 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Large Scale Throughput Systems<br />

• In these, there are far more threads than HW thread-slots<br />

• A Centralized Controller helps dispatch/juggle among threads<br />

Two-Tiered Web Server<br />

Massive amounts<br />

of Internet Clicks<br />

SMP Server:<br />

“Sprayer”<br />

Vertices<br />

Pixels<br />

Modern GPU<br />

“Sequencer”<br />

Blade Servers<br />

Cooperative Heterogeneous Computing!<br />

12 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Characteristics of Throughput Computing<br />

• The real goal is to saturate the pipeline to memory<br />

• Or any other very long latency pipeline, for that matter …<br />

• Hardware Implications<br />

• Support of large number of active and standby threads<br />

– under-threaded designs won’t saturate the pipeline to memory<br />

• Very robust memory controller - lots of outstanding references<br />

• Measure effectiveness by how well memory BW is used<br />

• Software / Workload Implications<br />

• Separate natural parallelism into large # of independent threads<br />

• What happens if the “Central Controller” gets backed up<br />

• It is effectively the serial component in these topologies<br />

• A macroscopic version of Amdahl’s Law kicks in<br />

13 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


What does a General Purpose Throughput<br />

Computing solution look like<br />

• Start with a Powerful, yet efficient, Uniprocessor<br />

• Perhaps a next generation Opteron would be a good choice ☺<br />

• Deploy one or more of these for excellent general purpose legacy<br />

support, and as for the Centralized Controller for throughput<br />

• Add a larger number of small, power-efficient, domain<br />

optimized compute offload engines<br />

• Build an optimized memory system<br />

• Shared among all the processors<br />

• Optimized support for communication, synchronization & data transfers<br />

• Embrace a programming model abstraction that favors<br />

simplicity and programmer productivity<br />

Accelerated Computing with Heterogeneous<br />

Processors!<br />

14 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Accelerated Computing: Motivations<br />

New and emerging applications are defining and operating on<br />

larger-scale and more abstract data types<br />

Targeted special purpose HW solutions can offer substantially<br />

better power efficiency and throughput than traditional<br />

microprocessors when operating on some of these new<br />

data types.<br />

Traditional “host” offload to dense compute accelerator<br />

• Use APIs to enable this without heroic programming efforts<br />

• Use Concurrent Runtime to ease scheduling & resource management<br />

• ISA compatibility yields to API and Platform Compatibility<br />

Accelerated Computing is a Platform for<br />

Heterogeneous Computing<br />

15 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Lots of Challenges …<br />

• Memory BW and Data Movement<br />

• Keeping up with the computation rates will require increasingly<br />

capable memory systems<br />

• New and appropriate Software Stack and APIs<br />

• A programming model that builds on emerging multi-core models<br />

• Use abstraction to trade some performance for programmer<br />

productivity<br />

• Live within bounds set by ecosystem OS and Concurrent Runtimes<br />

• Managing context state and exceptions<br />

• This includes the program-visible state in the compute offload engine!<br />

• Virtualizing the context state<br />

• Communications/Messaging<br />

• Simplified & fabric independent producer-consumer model<br />

• Optimized communications is a key enabler<br />

• It’s the synchronization, stupid!<br />

16 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


The Memory Wall – getting thicker<br />

There has always been a Critical Balance between<br />

Data Availability and Processing<br />

Situation When Implication Industry Solution<br />

DRAM vs CPU Cycle Times<br />

Widening GAP<br />

Early<br />

1990s<br />

Memory wait time dominates<br />

computing<br />

Non-blocking<br />

caches<br />

O-o-O Machines<br />

☺<br />

SW Productivity Crisis: Round 1<br />

Object oriented languages;<br />

Managed runtime environments<br />

Early<br />

1990s<br />

Larger working sets<br />

More diverse data types<br />

Larger Caches<br />

Cache Hierarchies<br />

Elaborate prefetch<br />

<br />

Frequency and IPC Wall<br />

CMP and Multi-Threading<br />

2005 and<br />

beyond<br />

Multiple working sets! VMs!<br />

More memory transactions<br />

Huge Caches<br />

Elaborate MemCtlrs !<br />

SW Productivity Crisis: Round 2<br />

Increased abstraction layers<br />

Image/Video as basic data types<br />

2009 and<br />

beyond<br />

Even larger working sets<br />

Larger data types<br />

Accelerated<br />

Computing<br />

Throughput<br />

Computing<br />

Chip Stacking<br />

TBD<br />

17 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


The Emerging Layers of Computation<br />

Start with an Analogy to the Communications Industry<br />

Computing Model<br />

Application<br />

Operating<br />

System<br />

Fault tolerant<br />

“meta apps”<br />

Browsers<br />

Web services<br />

JVM, CLR<br />

LLVM<br />

Virtualization<br />

Intelligent I/O<br />

Advanced APIs<br />

Dynamic binary<br />

translation<br />

Reconfiguration<br />

Hardware<br />

Indicators<br />

of a bigger<br />

picture<br />

18 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


The Emerging Layers of Computation<br />

Network Runtime Layer<br />

Google<br />

apps<br />

SETI<br />

@Home<br />

Data Center Applications<br />

Data Center Runtime Environment<br />

Network Layer<br />

Networked Platform<br />

Native Runtime<br />

Layer<br />

Applications<br />

API’s, Libs<br />

Runtime<br />

Traditional OS<br />

Hypervisor (virtual platform)<br />

Compatible Hardware Platform<br />

Network-aware Applications (web services)<br />

Network Services<br />

VMware<br />

Arch<br />

extensions<br />

GPUs<br />

DirectX<br />

Concurrent<br />

Runtime<br />

AJAX<br />

Platform Layer<br />

x86 Compatible Hardware<br />

RAW Hardware<br />

Devices<br />

Redundant<br />

hardware<br />

Microcode<br />

Error<br />

recovery<br />

Dynamic<br />

translation<br />

Physical Layer<br />

19 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


The Emerging Layers of Computation<br />

Network Runtime Layer<br />

Data Center Applications<br />

Ease of Programming<br />

Network Layer<br />

Native Runtime<br />

Layer<br />

Compatible Hardware Platform<br />

x86 Compatible Hardware<br />

RAW Hardware<br />

Productivity<br />

Applications<br />

API’s, Libs<br />

MRE’s<br />

Traditional OS<br />

Hypervisor (virtual platform)<br />

Devices<br />

Data Center Runtime Environment<br />

The Challenge:<br />

Networked Platform<br />

Network-aware Applications<br />

Shift this from a<br />

Network Services<br />

retrospective<br />

observation into a<br />

more formal and<br />

deliberate framework<br />

Platform Layer<br />

Physical Layer<br />

20 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Lots of Interesting Implications<br />

The Data Center<br />

Compute<br />

Platform<br />

Data Center Applications<br />

Data Center Runtime Environment<br />

Networked Platform<br />

Parallel<br />

Applications<br />

using<br />

CMP/SMP<br />

Applications<br />

API’s, Libs MRE’s<br />

Traditional OS<br />

Network-aware Applications (web services)<br />

Network Services<br />

Hypervisor (virtual platform)<br />

Compatible Hardware Platform<br />

x86 Compatible Hardware<br />

RAW Hardware<br />

Devices<br />

Special<br />

purpose<br />

HW<br />

New types<br />

of<br />

Programable<br />

HW<br />

Accelerated<br />

Computing<br />

21 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Summary<br />

Uniprocessor performance scaling is nearing it’s limits<br />

<strong>Moore</strong>’s Law is great from an integration standpoint, but…<br />

• Power scalability isn’t tracking due to approaching V min<br />

limits<br />

• This fact, combined with the challenges of Amdahl’s Law will<br />

ultimately place practical limits on the #cores in CMPs<br />

The demand for increasing performance won’t go away, but<br />

we need to find new ways of easily offering that performance<br />

• Programmer productivity is a first order consideration<br />

• Deployment of significantly more power effective HW structures<br />

• Lots of interesting system architecture innovations are possible<br />

The time for Accelerated Computing with<br />

Heterogeneous HW is here!<br />

22 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era


Thank You !<br />

Questions<br />

©2008. Advanced Micro Devices, Inc. All rights reserved. <strong>AMD</strong>, the<br />

<strong>AMD</strong> Arrow logo, <strong>AMD</strong> Opteron, and combinations thereof, are<br />

trademarks of Advanced Micro Devices, Inc.<br />

Other names are for informational purposes only and may be<br />

trademarks of their respective owners.<br />

23 7/17/2008<br />

The Role of Accelerated Computing in the Multi-core Era

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!