Chuck Moore, AMD - Semiconductor Research Corporation

May 17, 2006 

The Role of Accelerated Computing 

in the Multi-Core Era 

Chuck Moore 

Senior Fellow 

Advanced Micro Devices

Outline 

• Moore’s Law and Customer Value 

• A Few High-level Trends 

• Some Thoughts on SMP and Multi-core Computing 

• Throughput Computing 

• Relationship to Accelerated Computing 

• Accelerated Computing Challenges 

• Summary 

2 7/17/2008 

The Role of Accelerated Computing in the Multi-core Era

Moore’s Law (Gordon Moore, Electronics Magazine – April 1965) 

What he observed: 

Double transistor integration every 18 months 

Another trend often confused with Moore’s Law: 

Double performance every 18 months 

This is actually an example of ongoing customer value 

The semiconductor industry is dependent upon ongoing 

customer value: 

build 

A virtuous cycle: 

customer 

value 

invest 

$$$ 

The Question we should all be asking ourselves: 

How do we continue increasing customer value 

3 7/17/2008 


A Few High-level Trends 

Moore’s Law ☺ 

The Power Wall 

The Frequency Wall 

Integration (log scale) 

o 

we are 

here 

! 

-DFM 

- Variability 

- Reliability 

- Wire delay 

Power Budget (TDP) 

Server: power=$$ 

DT: eliminate fans 

Mobile: battery 

o 

we are 

here 

Frequency 

o 

we are 

here 

Time 

Time 

Time 

The Complexity Wall 

Locality 

Single thread Perf (!) 

IPC 

o 

we are 

here 

Performance 

o 

we are 

here 

Single-thread Perf 

o 

we are 

here 

 

Issue Width 

Cache Size 

Time 

So, how can we add customer value 

4 7/17/2008 


Customer Value beyond just Performance 

Seamless 

upgradeability 

+ 

Quad-core technology 

+ 

Virtualization 

+ 

Performance-per 

per-watt 

innovation 

AMD Native Dual 

Core Opteron 

Pervasive 

64-bit capability 

+ 

Dual-core technology 

+ 

Systems architecture 

innovation 

AMD Native Quad 

Core Core Opteron 

SMP and Multi-Core to the long term rescue 

5 7/17/2008 


Optimized CMP Platforms for Performance 

• In the near-term, there is definite potential here 

• Commodity multi-core processors break the “chicken & egg” barrier 

• Impressive amount of interesting research firing up: 

– TM, coherency filters, hierarchical scheduling, MREs, VMs, etc 

• Lots of good activity on the Tools front More to come 

• Some workloads will do well with this, but many will not: 

• Amdahl’s Law seriously inhibits unstructured parallelism ... 

Hold on a moment … 

You do remember the implications of 

Amdahl’s Law, don’t you 

6 7/17/2008 


Amdahl’s Law 

Speed-up = 

1 

S W + (1 – S W ) / N 

S W : % Serial Work 

N: Number of processors 

7 7/17/2008 

The Role of Accelerated Computing ** in AMD the Confidential Multi-core Era **

Amdahl’s Law – Zoom out a bit … 

8 7/17/2008 


Optimized CMP Platforms for Performance 

• In the near-term, there is definite potential here 

• Commodity multi-core processors break the “chicken & egg” barrier 

• Impressive amount of interesting research firing up: 

– TM, coherency filters, hierarchical scheduling, MREs, VMs, etc 

• Lots of good activity on the Tools front More to come 

• Some workloads will do well with this, but many will not: 

• Amdahl’s Law seriously inhibits unstructured parallelism … 

• As it turns out, existing software isn’t really that soft 

– Transitioning the concurrency model is a very big deal 

• Many key emerging applications will work on “media rich” content that 

does not map onto traditional processors in a power-efficient manner 

• In reality, SMP/Multi-core challenges are just an early indicator 

of the shifts yet to come 

• Power constraints will force these to be performance heterogeneous 

• Advances in synchronization and non-uniform resource management 

9 7/17/2008 


Outline 

• Moore’s Law and Customer Value 

• A Few High-level Trends 

• Some Thoughts on SMP and Multi-core Computing 

• Throughput Computing 

• Relationship to Accelerated Computing 

• Accelerated Computing Challenges 

• Memory Bandwidth and Data Movement 

• The Software Stack and a Framework for Innovation 

• Summary 

10 7/17/2008 


What about Throughput Computing 

• Measure Performance as throughput (vs. turn-around-time) 

• How long does it takes to run N “independent” tasks 

• Multiple cores/threads should be faster than just one 

• For basic multi-program throughput, the OS plays a key role 

is time-slicing application use on one or more cores 

• In some sense, the OS “offloads” work onto available cores 

• As some point, OS scalability becomes the bottleneck 

• More advanced applications take on that role themselves with 

“gang scheduling” 

• Modern databases 

• Some HPC apps turn data parallelism into task-level throughput 

• Some types of web servers 

• Can we exploit large scale throughput computing 

11 7/17/2008 


Large Scale Throughput Systems 

• In these, there are far more threads than HW thread-slots 

• A Centralized Controller helps dispatch/juggle among threads 

Two-Tiered Web Server 

Massive amounts 

of Internet Clicks 

SMP Server: 

“Sprayer” 

Vertices 

Pixels 

Modern GPU 

“Sequencer” 

Blade Servers 

Cooperative Heterogeneous Computing! 

12 7/17/2008 


Characteristics of Throughput Computing 

• The real goal is to saturate the pipeline to memory 

• Or any other very long latency pipeline, for that matter … 

• Hardware Implications 

• Support of large number of active and standby threads 

– under-threaded designs won’t saturate the pipeline to memory 

• Very robust memory controller - lots of outstanding references 

• Measure effectiveness by how well memory BW is used 

• Software / Workload Implications 

• Separate natural parallelism into large # of independent threads 

• What happens if the “Central Controller” gets backed up 

• It is effectively the serial component in these topologies 

• A macroscopic version of Amdahl’s Law kicks in 

13 7/17/2008 


What does a General Purpose Throughput 

Computing solution look like 

• Start with a Powerful, yet efficient, Uniprocessor 

• Perhaps a next generation Opteron would be a good choice ☺ 

• Deploy one or more of these for excellent general purpose legacy 

support, and as for the Centralized Controller for throughput 

• Add a larger number of small, power-efficient, domain 

optimized compute offload engines 

• Build an optimized memory system 

• Shared among all the processors 

• Optimized support for communication, synchronization & data transfers 

• Embrace a programming model abstraction that favors 

simplicity and programmer productivity 

Accelerated Computing with Heterogeneous 

Processors! 

14 7/17/2008 


Accelerated Computing: Motivations 

New and emerging applications are defining and operating on 

larger-scale and more abstract data types 

Targeted special purpose HW solutions can offer substantially 

better power efficiency and throughput than traditional 

microprocessors when operating on some of these new 

data types. 

Traditional “host” offload to dense compute accelerator 

• Use APIs to enable this without heroic programming efforts 

• Use Concurrent Runtime to ease scheduling & resource management 

• ISA compatibility yields to API and Platform Compatibility 

Accelerated Computing is a Platform for 

Heterogeneous Computing 

15 7/17/2008 


Lots of Challenges … 

• Memory BW and Data Movement 

• Keeping up with the computation rates will require increasingly 

capable memory systems 

• New and appropriate Software Stack and APIs 

• A programming model that builds on emerging multi-core models 

• Use abstraction to trade some performance for programmer 

productivity 

• Live within bounds set by ecosystem OS and Concurrent Runtimes 

• Managing context state and exceptions 

• This includes the program-visible state in the compute offload engine! 

• Virtualizing the context state 

• Communications/Messaging 

• Simplified & fabric independent producer-consumer model 

• Optimized communications is a key enabler 

• It’s the synchronization, stupid! 

16 7/17/2008 


The Memory Wall – getting thicker 

There has always been a Critical Balance between 

Data Availability and Processing 

Situation When Implication Industry Solution 

DRAM vs CPU Cycle Times 

Widening GAP 

Early 

1990s 

Memory wait time dominates 

computing 

Non-blocking 

caches 

O-o-O Machines 

☺ 

SW Productivity Crisis: Round 1 

Object oriented languages; 

Managed runtime environments 

Early 

1990s 

Larger working sets 

More diverse data types 

Larger Caches 

Cache Hierarchies 

Elaborate prefetch 

 

Frequency and IPC Wall 

CMP and Multi-Threading 

2005 and 

beyond 

Multiple working sets! VMs! 

More memory transactions 

Huge Caches 

Elaborate MemCtlrs ! 

SW Productivity Crisis: Round 2 

Increased abstraction layers 

Image/Video as basic data types 

2009 and 

beyond 

Even larger working sets 

Larger data types 

Accelerated 

Computing 

Throughput 

Computing 

Chip Stacking 

TBD 

17 7/17/2008 


The Emerging Layers of Computation 

Start with an Analogy to the Communications Industry 

Computing Model 

Application 

Operating 

System 

Fault tolerant 

“meta apps” 

Browsers 

Web services 

JVM, CLR 

LLVM 

Virtualization 

Intelligent I/O 

Advanced APIs 

Dynamic binary 

translation 

Reconfiguration 

Hardware 

Indicators 

of a bigger 

picture 

18 7/17/2008 



Network Runtime Layer 

Google 

apps 

SETI 

@Home 

Data Center Applications 

Data Center Runtime Environment 

Network Layer 

Networked Platform 

Native Runtime 

Layer 

Applications 

API’s, Libs 

Runtime 

Traditional OS 

Hypervisor (virtual platform) 

Compatible Hardware Platform 

Network-aware Applications (web services) 

Network Services 

VMware 

Arch 

extensions 

GPUs 

DirectX 

Concurrent 

Runtime 

AJAX 

Platform Layer 

x86 Compatible Hardware 

RAW Hardware 

Devices 

Redundant 

hardware 

Microcode 

Error 

recovery 

Dynamic 

translation 

Physical Layer 

19 7/17/2008 



Network Runtime Layer 


Ease of Programming 

Network Layer 

Native Runtime 

Layer 



RAW Hardware 

Productivity 

Applications 

API’s, Libs 

MRE’s 



Devices 


The Challenge: 


Network-aware Applications 

Shift this from a 


retrospective 

observation into a 

more formal and 

deliberate framework 

Platform Layer 

Physical Layer 

20 7/17/2008 


Lots of Interesting Implications 

The Data Center 

Compute 

Platform 




Parallel 

Applications 

using 

CMP/SMP 

Applications 

API’s, Libs MRE’s 


Network-aware Applications (web services) 





RAW Hardware 

Devices 

Special 

purpose 

HW 

New types 

of 

Programable 

HW 

Accelerated 

Computing 

21 7/17/2008 


Summary 

Uniprocessor performance scaling is nearing it’s limits 

Moore’s Law is great from an integration standpoint, but… 

• Power scalability isn’t tracking due to approaching V min 

limits 

• This fact, combined with the challenges of Amdahl’s Law will 

ultimately place practical limits on the #cores in CMPs 

The demand for increasing performance won’t go away, but 

we need to find new ways of easily offering that performance 

• Programmer productivity is a first order consideration 

• Deployment of significantly more power effective HW structures 

• Lots of interesting system architecture innovations are possible 

The time for Accelerated Computing with 

Heterogeneous HW is here! 

22 7/17/2008 


Thank You ! 

Questions 

©2008. Advanced Micro Devices, Inc. All rights reserved. AMD, the 

AMD Arrow logo, AMD Opteron, and combinations thereof, are 

trademarks of Advanced Micro Devices, Inc. 

Other names are for informational purposes only and may be 

trademarks of their respective owners. 

23 7/17/2008

Chuck Moore, AMD - Semiconductor Research Corporation

Create successful ePaper yourself

Delete template?

Save as template?