“Single-chip Cloud Computer”, an IA Tera-scale Research Processor

par.univie.ac.at

“Single-chip Cloud Computer”, an IA Tera-scale Research Processor

“Single-chip Cloud Computer”

IA Tera-scale Research Processor

Jim Held

Intel Fellow & Director

Tera-scale Computing Research

Intel Labs

August 31, 2010

www.intel.com/info/scc


Agenda

Tera-scale Research

• SCC Architecture

• Software environment

• Co-travelers Program

• Summary

2


Performance Scaling Challenges

Energy

Efficiency

Design

Complexity

Programming

Models

Emerging

Applications

3


Tera-scale Research

Applications – Identify, characterize & optimize

Programming – Empower the mainstream

System Software – Scalable services

Memory Hierarchy – Feed the compute engine

Interconnects – High bandwidth, low latency

Cores – power efficient general & special function

www.intel.com/go/terascale

4


21.72mm

Teraflops Research Processor

Technology

Transistors

Die Area

PLL

I/O Area

12.64mm

I/O Area

single tile

1.5mm

C4 bumps #

2.0mm

65nm, 1 poly, 8 metal (Cu)

100 Million (full-chip)

1.2 Million (tile)

275mm 2 (full-chip)

3mm 2 (tile)

8390

TAP

Goals:

• Deliver Tera-scale performance

– Single precision TFLOP at desktop power

– Frequency target 5GHz

– Bi-section B/W order of Terabits/s

– Link bandwidth in hundreds of GB/s

• Prototype two key technologies

– On-die interconnect fabric

– 3D stacked memory

• Develop a scalable design

methodology

– Tiled design approach

– Mesochronous clocking

– Power-aware capability

5


Within-Die Variation-Aware

DVFS and scheduling

• Max Frequency variation per core 28% at 1.2V 62% at 0.8V

• No correlation die to die – individual characterization

required

• Improved performance or energy efficiency with:

– Multiple frequency islands

– Dynamic scheduling of processing to core

6

Dighe, S, et al., “Within-Die Variation-Aware Dynamic Voltage-Frequency Scaling, Core

Mapping and Thread Hopping for an 80-Core Processor”, in Proceedings of ISSCC 2010 (IEEE

International Solid-State Circuits Conference), Feb. 2010


Cloud datacenters:

Cloud Computing Today

–1000s of networked computers

–Millions of threads & petabytes of data

Opportunity:

–Lower power, higher density via

integration

–Greater efficiency and better

programmability

45 Mb/s T3

to Internet

1 Gb/s

(x4)

1 Gb/s

(x4)

1 Gb/s

(x4)

1 Gb/s

(x8)

1 Gb/s

(x4)

1 Gb/s

(x4)

1 Gb/s

(x4)

Example: Intel’s Open Cirrus testbed

Intel Labs Pittsburgh

1 Gb/s (x2x5 p2p)

1 Gb/s

(x8 p2p)

1 Gb/s

(x4)

1 Gb/s

(x4x4 p2p)

1 Gb/s

(x4x4 p2p)

1 Gb/s

(x15 p2p)

1 Gb/s

(x15 p2p)

1 Gb/s

(x15 p2p)

7


Motivations for SCC

• Many-core processor research

–High-performance power-efficient fabric

–Fine-grain power management

–Message-based programming support

• Parallel Programming research

–Better support for scale-out server model

– Operating system, communication architecture

–Scale-out programming model for client

– Programming languages, runtimes

An experimental processor, not a product!

8


5.2mm

21.4mm

Single-chip Cloud Computer

Experimental Processor

3.6mm

26.5mm

L2$1 Core1

Router MPB

L2$0 Core0

DDR3 MC DDR3 MC

PLL

TILE

TILE

DDR3 MC DDR3 MC

JTAG

Technology

Interconnect

45nm Hi-K CMOS

9 Metal (Cu)

VRC

System Interface + I/O

Transistors

Die: 1.3B, Tile: 48M

Tile Area 18.7mm 2

Die Area 567.1mm 2

9


Memory Controller

Memory Controller

Memory Controller

Memory Controller

Architectural Overview

• 2 nd Generation Intel Labs experimental processor

IA-based software research vehicle

• “Cluster-on-die” architecture

– 48 Pentium Processor cores (P54C - x87FP only)

Tile

Tile

Tile

Tile

Tile

Tile

R

R

R

R

R

R

Tile

Tile

Tile

Tile

Tile

Tile

R

R

R

R

R

R

Tile

L2$1

Core 1

Tile

Tile

Tile

Tile

Tile

Tile

R

R

R

R

R

R

Router

MPB

R

Tile

R

Tile

R

Tile

R

Tile

R

Tile

R

Tile

L2$0

Core 0

System I/F

10

Howard, J, et al., “A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS”, in

Proceedings of ISSCC 2010 (IEEE International Solid-State Circuits Conference), Feb. 2010


Freq (GHz)

On-die Interconnect

• Architecture

–6x4 2D Mesh NOC

–16B wide data links + 2B sideband

–8 Virtual Channels in 2 classes

–Fixed (X-Y) routing

• Performance

–Target freq: 2GHz @ 1.1V

–Link Bandwidth 64GB/s

–4 cycle latency

• Power Management

–Independent Frequency & Voltage control

–Sleep mode, clock gating, low power RF

10

1

0.1

0.01

0.55V

60MHz

50°C 0.94V

1.4GHz

0.73V

300MHz

0.94V

0.9GHz

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

Supply (V)

1.34V

2.6GHz

1.32V

1.3GHz

Router

Core

11


• Memory

Memory Architecture

– Up to 64GB DDR3 via 4 memory controllers @ 21.3GB/s

– 16KB SRAM in each tile as Message Passing Buffer (MPB)

• Caching

– 32KB L1 per core (16KB I,D), 12MB L2 cache (256KB/core)

– No HW cache-coherent shared memory

• Addressing

– Core physical to system physical addresses in 16MB sections

– Memory mapped configuration & control registers

Core Physical

Address Space

Core Physical

Address Space

Physical-Physical Mapping

Physical-Physical Mapping

12

System Physical Address Space


Memory Controller

Memory Controller

Memory Controller

Memory Controller

Power Management

• Configurable MC, Mesh, SIF Voltage & Frequency

• Software-controlled DVFS* of cores

– Fine-grain voltage control at 4 tile cluster level (6.25mV)

– Frequency control at tile level (16bit divider)

– Closed loop - thermal sensors per tile, current through BMC

Tile Tile Tile Tile

R V

F 0

R R n V

F

R n

1

Tile

R

Tile

R

Tile

R

Tile

R

Tile

R

Tile

R

Tile F n

R

Tile

R

Tile F n

R

Tile

R

Tile

R

Tile

R

Tile

R

Tile

R

DVFS gives wide operating range:

125W @ 1.14V 1GHz

25W @ 0.7v 125MHz

Tile

R

Tile

R

Tile

R

Tile

R

Tile

R

Tile

R

System I/F

13

*Dynamic voltage and frequency scaling


Measured full chip power

14

14


Power breakdown

MC &

DDR3-

800

19%

Full Power Breakdown

Total -125.3W

Cores

69%

MC &

DDR3-

800

69%

Low Power Breakdown

Total - 24.7W

Routers

& 2Dmesh

10%

Global

Clocking

2%

Routers

& 2Dmesh

5%

Global

Clocking

5%

Cores

21%

Clocking: 1.9W Routers: 12.1W

Cores: 87.7W MCs: 23.6W

Cores-1GHz, Mesh-2GHz, 1.14V, 50°C

Clocking: 1.2W Routers: 1.2W

Cores: 5.1W MCs: 17.2W

Cores-125MHz, Mesh-250MHz, 0.7V, 50°C

15

15


Rocky Lake – SCC platform

• Replacement for evaluation board

– 100 boards with more I/O, more robust, less expensive

– BIOS/Firmware in definition

16


SCC “Chipset”

• System Interface FPGA

–Connects to SCC Mesh interconnect

–IO capabilities like PCIe, Ethernet & SATA

–Bitstream is part of sccKit distribution

• Board Management Controller (BMC)

–JTAG interface for Clocking, Power etc.

–USB Stick with FPGA bitstream

–Network interface for User interaction via Telnet

–Status monitoring

–Firmware is part of sccKit distribution

17


• SCC Software

Software Environment

– Bare Metal

– Customized Linux

– RCCE communication & power management API

– Tools

– Selected Intel tools (e.g., icc, ifort, ...)

– Microsoft research release of SCC extensions to Visual Studio

• Management Console PC Software

– PCIe driver with integrated TCP/IP driver

– Programming API for communication with SCC platform

– GUI for interaction with SCC platform

– Command line tools for interaction with SCC platform

18


RCCE Communication API

• A compact, lightweight communication

environment.

– SCC and RCCE were designed together side by side:

– … a true HW/SW co-design project.

• A research vehicle to understand how message

passing APIs map onto many core chips.

• For experienced parallel programmers willing to

work close to the hardware.

• Static SPMD Execution Model:

– identical UEs created together when a program starts (this is a

standard approach familiar to message passing programmers)

UE: Unit of Execution … a software entity that advances a

program counter (e.g. process of thread).

19


SCC Disclosure Demos

Financial Analytics

w/ shared virtual memory

Microsoft Visual Studio

Advanced Power Management

JavaScript Physics Modeling HPC Parallel Workloads Hadoop Web Search

20


SCC Co-Travelers Program

• Currently building SCC software research community

– 100 systems total, with 40 in Oregon Datacenter

Research partners for 2010 have been selected

• SCC community website available today

– Communities.intel.com/community/marc

– To share ideas, HowTo’s, code, tools

21


Summary

• SCC provides a unique experimental

platform for many-core research

–Better support for “Cloud” data center servers

–Scale-out programming model for client

• We are sharing SCC with selected

researchers in academia and industry

–Documentation and presentations

http://www.intel.com/info/scc

http://communities.intel.com/community/marc

22


SCC Team

Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal,

David Finan, Gregory Ruhl, David Jenkins, Howard Wilson,

Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain,

Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam,

Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido

Droege, Joerg Lindemann, Matthias Gries, Thomas Apel,

Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar

Borkar, Vivek De, Rob Van Der Wijngaart, Timothy Mattson

23


24

Questions?

More magazines by this user
Similar magazines