Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial

Using Intel’s Single-Chip Cloud Computer 

(SCC) 

Tim Mattson 

Microprocessor and Programming Lab 

Intel Corp. 

1 

With content gratefully “borrowed” from: Rob van der Wijngaart, Ted Kubaska, Michael Riepen, Ernie Chan

Disclosure 

• The views expressed in this talk are those of the 

speaker and not his employer. 

• I am in a research group and know almost nothing 

about Intel products, hence anything I say about 

them is highly suspect. 

• This was a team effort, but if I say anything really 

stupid, it’s all my fault … don’t blame my 

collaborators. 

2

The many core design challenge 

• Scalable architecture: 

– How should we connect the cores so we can scale as far as we 

need (O(100’s to 1000) should be enough)? 

• Software: 

– Can “general purpose programmers” write software that takes 

advantage of the cores? 

– Will ISV’s actually write scalable software? 

• Manufacturability: 

– Validation costs grow steeply as the number of transistors grows. 

Can we use tiled architectures to address this problem? 

– For an N transistor budget … Validate a tile (M transistors/tile) and the 

connections between tiles. Drops validation costs from KO(N) to 

K’O(M) (warning, K, K’ can be very large). 

80 core Research 

processor 

Intel’s “TeraScale” processor 

research program is addressing these 

questions with a series of Test chips 

… two so far. 

48 core SCC 

processor 

3

Agenda 

• The SCC Processor 

• Using the SCC system 

• SCC Address spaces 

4

Hardware view of SCC 

•48 P54C cores, 6x4 mesh, 2 cores per tile 

•45 nm, 1.3 B transistors, 25 to 125 W 

•16 to 64 GB DRAM using 4 DDR3 MC 

MC 

Tile 

R 

Tile 

R 

Tile 

R 

Tile 

R 

R 

Tile 

Tile 

Tile 

R 

Tile 

Tile Tile Tile Tile 

R R R R 


R R R R 


R R R R 


MC 

P54C 

16KB L1-D$ 

16KB L1-I$ 

Message 

Passing 

Buffer 

16 KB 

P54C 

16KB L1-D$ 

16KB L1-I$ 

256KB 

unified 

L2$ 

Mesh 

I/F 

256KB 

unified 

L2$ 

To 

Router 

MC 

R 

R 

R 

R 

R 

R 

MC 

to PCI 

R = router, MC = Memory Controller, 

5 

P54C = second generation Pentium® core, CC = cache cntrl.

On-Die Network 

•2D 6 by 4 mesh of tiles 

• Fixed X-Y Routing 

•Carries all traffic 

(memory, I/O, message 

passing). 

Bisection bandwidth 2 TB/s at 2GHz and 1.1V 

Latency 

4 cycles 

Link Width 

16 Bytes 

Bandwidth 

64GB/s per link 

Architecture 

8 VCs over 2 Msg. Class. 

Power Consumption 500mW @ 50°C 

Routing 

Pre-computed, X-Y 

•Router 

•Low Power 5 

port cross bar 

•2 Message 

classes, 8 virtual 

channels 

6 

6

Power and memory domains 

Memory* 

Voltage 

Question 

SCC package 

MC 

MC 

R 

R 

R 

R 

Tile 

Tile 

Tile 

R 

R 

Tile Tile 

R 

R 

Tile 

Tile 

Tile 

R 

R 

R 

R 

Tile Tile 

Tile 

Tile 

Tile 

R 

Tile 

R 

Tile 

R 

R 

Bus to 

PCI 

Tile 

R 

R 

R 

R 

Tile Tile 

R 

Tile Tile 

Tile 

Tile 

R 

R 

R 

Tile 

Tile 

Frequency 

MC 

MC 

Power ~ F V 2 

Power Control domains 

• 7 Voltage domains (6 

4-tile blocks plus one 

for the on-die 

network) 

• 24 tile clock frequency 

dividers 

• One voltage control 

register … so only one 

voltage change in 

flight at a time. 

7 

*How cores map to MC is under programmer control …. i.e. the memory-controller domain is configurable.

SCC system overview 

Questions? 

DIMM 

DIMM 

PLL 

MC 

MC 

R 

R 

R 

R 

tile tile 

R 

tile tile 

R 

tile tile 

R 

tile tile 

R 

R 

R 

R 

R 

tile tile 

R 

tile tile 

R 

tile tile 

R 

tile tile 

R 

R 

R 

R 

R 

tile tile 

R 

tile tile 

R 

tile tile 

R 

tile tile 

R 

MC 

JTAG 

I/O 

MC 

DIMM DIMM 

JTAG interface 

for Clocking, 

Power etc. 

JTAG 

BUS 

RPC 

System Interface 

SCC die 

System Interface FPGA 

– Connects to SCC 

Mesh interconnect 

– IO capabilities like 

PCIe, Ethernet & 

SATA 

8 

System 

FPGA 

PCIe 

Management 

Console PC 

Third party names are the property of their owners. 

Board Management 

Controller (BMC) 

– Network interface for 

User interaction via 

Telnet 

– Status monitoring 

8

The latest FPGA Features (release 1.4) 

• 1-4 x 1Gb Ethernet ports 

• Interface from BMC to FPGA 

• Global interrupt controller 

• Additional System registers 

• Global timestamp counter 

• Linux private memory size 

• SCC main clock frequency 

• Network configuration (IP addresses, Gateway, MAC address) 

• Linux frame buffer resolution and color depth for virtual display 

• Atomic Counters 

9

SCC Platforms 

• Three Intel provided platforms for SCC and RCCE* 

– Functional emulator (on top of OpenMP) 

– SCC board with two “OS Flavors” … Linux or Baremetal (i.e. 

no OS) 

Apps 

Apps 

Apps 

RCCE 

RCCE_EMU 

Driver 

icc 

ifort 

MKL 

RCCE 

RCCE 

icc fort MKL icc fort MKL 

OpenMP 

PC or server with 

Windows or Linux 

Baremetal C 

SCC 

Linux 

SCC 

Functional emulator, 

based on OpenMP. 

SCC board – NO OpenMP 

10 

*RCCE: Native Message passing library for SCC 

Third party names are the property of their owners.

Networking: TCP/IP 

• We support the standard “Internet protocol stack” 

Application 

Transport 

Network 

Link 

Physical 

network applications (ssh, http, MPI, NFS …) 

host-host data transfer (TCP, UDP) 

routing of datagrams from source to destination (IP) 

data transfer between peers (rckmb) 

move packets over physical network (on-die mesh) 

• Rckmb … our implementation of the Link layer 

– Local write/remote read data transfer 

– Static mapping IP address SCC core number 

– Uses IP packet to determine destination 

• Seamless integration with Linux networking utilities 

• Rckmb is based on NAPI 

– Interrupt shared between peer nodes 

– Provides for operation in polling mode 

Source: Light-weight Communications on Intel's Single-Chip-Cloud Computer Processor, Rob F. van der Wijngaart, Timothy 

G. Mattson, Werner Haas, Operating Systems Review, ACM, vol 45, number 1, pp. 73-83, January 2011. 


11

Agenda 




12

sccKit: Core software for SCC 

• SCC lacks a PC BIOS ROM or the typical bootstrap process 

– we use bootable image preloaded into memory by sccKit. 

• sccKit includes: 

– Platform support 

– System Interface FPGA bitstream 

– Board management controller (BMC) firmware 

– SCC System Software 

– Customized Linux to run on each core 

– Kernel 2.6.16 with Busybox 1.15.1 and on-die TCP/IP drivers 

– bareMetalC to run on SCC without an OS 

– Management Console PC Software 

– PCIe driver with integrated TCP/IP driver 

– Programming API for communication with SCC platform 

– GUI for interaction with SCC platform 

– Command line tools for interaction with SCC platform 

Questions? 


13

Creating Management Console PC Apps 

• Written in C++ making use of 

Nokia Qt cross-platform 

application and UI framework. 

• Low level API (sccApi) with access 

to SCC and board management 

controller via PCIe. 

• Code of sccGui as well as 

command line tools is available as 

code example. These tools use 

and extend the low level API. 

14

sccGui 

• Read and write 

system memory and 

registers. 

• Boot OS or other 

workloads 

(e.g. bareMetalC). 

• Open SSH connections 

to booted Linux cores 

• Performance meter 

• Initialize Platform via Board Management Controller. 

• Open a console connection to nodes 

• Debug support... Query memory addresses, read/write registers. 

15

Command Line Interface to sccKit 

sccBoot: 

– sccBoot pulls reset, loads the bootable Linux images into memory, sets 

up configuration registers and releases reset - the cores execute the 

boot sequence (see section 16 of the Penitum P45c manual). 

– Example: to boot linux on all cores 

> sccBoot -l 

– Also used to “boot” generic workloads (e.g. bareMetalC applications) 

sccReset: 

– Reset selected SCC cores. Example: to put all cores into reset 

sccBmc: 

> sccReset -g 

– Board Management controller: Example: to prepare SCC in a known 

state selected from a list of options: 

> sccBmc -i 

Use any of these commands with –h to see usage notes. 

16

Setting up your account for SCC 

• Add these lines to your .bashrc file 

#setup for sccKit and enable SCC cross compilers 

export PATH=.:/opt/sccKit/current/bin:$PATH 

export LD_LIBRARY_PATH=.:/opt/sccKit/lib:$LD_LIBRARY_PATH 

source /opt/compilerSetupFiles/crosscompile.sh 

# setup for RCCE 

export MANPATH="/home/tmattson/rcce/man:${MANPATH}" 

export PATH=.:/home/tmattson/rcce:$PATH 

the directory where 

you installed RCCE 

Note: while any Linux shell works, we test all our scripts 

with bash … so its simplest if you use bash. If your 

configures systems with something else (such as csh), I 

manually switch to bash as soon as I log in

Setting up your account for SCC 

• If you have successfully setup the compilers 

and sccKit, you should see … 

>% which icc 

/opt/icc-8.1.038/bin/icc 

>% which gcc 

/opt/i386-unknown-linux-gnu/bin/gcc 

>% which sccBmc 

/opt/sccKit/current/bin/sccBmc

Check out latest RCCE from SVN 

• You want to check out the rcce release but in a “read 

only” user mode. That way you don’t get all the tags 

and settings for checking things into RCCE 

>% svn export http://marcbug.scc-dc.com/svn/repository/trunk/rcce/ 

• While you’re at it, you might as well grab MPI as well. 

>% svn export http://marcbug.scc-dc.com/svn/repository/trunk/rckmpi/

Build RCCE 

• Go to the root of the RCCE directory tree and 

setup the build environment 

>% ./configure SCC_LINUX 

• Then build RCCE by typing from the RCCE 

directory tree root 

>% ./makeall 

• To build with other options (such as shared 

memory) edit the file common/symbols.in

Test by building an app (pingpong) 

• Go to rcce_root/apps/PINGPONG and type 

>% make pingpong 

• Copy executable to /shared 

>% cp pingpong /shared/tmattson 

Note: /shared is 

NFS mounted and 

visible to all cores 

• Create or copy an rc.hosts file … one entry per 

line with numbers (00 to 47) 

• Run the job on the SCC 

>% rccerun -nue 2 -f rc.hosts pingpong

A closer look at rccerun 

• rccerun is analogous to the mpirun … It sets up 

the system and submits the rcce executable. 

tmattson@boathouse:/shared/tmattson$ rccerun -nue 2 -f rc.hosts pingpong 

pssh -h PSSH_HOST_FILE.29325 -t -1 -p 2 /shared/tmattson/mpb.29325 < /dev/null 

[1] 17:56:39 [SUCCESS] rck00 

[2] 17:56:39 [SUCCESS] rck01 

pssh -h PSSH_HOST_FILE.29325 -t -1 -P -p 2 /shared/tmattson/pingpong 2 0.533 00 01 < /dev/null 

2965792 0.173102739 

3526944 0.205475685 

[1] 17:57:17 [SUCCESS] rck00 

[2] 17:57:17 [SUCCESS] rck01 

Standard out 

Time and exit status 

for each core 

Run a utility to 

put MPB in a 

known state. 

Run the executable (pingpong) for 

default clock speed (533 Mhz)

Managing the SCC system 

• To put the system into a known state, we train the chip. 

• To train scc (setup powers, memory, etc) use the command: 

sccBmc –i 

• It will give you a collection of options to choose from: 

Please select from the following possibilities: 

INFO: (0) Tile533_Mesh800_DDR800 





INFO: (others) Abort! 

Make your selection: 

• Note: some people reset first with a sccReset –g but that 

shouldn’t be necessary.

… so to run benchmarks … 

• I reset, trained, rebooted and then used SCC. I 

followed this procedure to assure that the 

system is in a known state. 

213> sccReset -g 

214> sccBmc -i 

215> sccBoot -l 

216> rccerun -nue 2 -f rc.hosts pingpong 

271> rccerun –nue 48 –f rc.hosts stencil 

272> rccerun –nue 48 –f rc.hosts cshift 8

Agenda 




25

SCC Address spaces 

• 48 x86 cores which use the x86 memory model for Private DRAM 

Private 

DRAM 

On-chip 

Off-chip 

L2$ 

L1$ 

Memory 

Where is the physical Memory? 

Shared off-chip DRAM (variable size) 

CPU_0 

Private 

DRAM 

L1$, MPBT 

Reg file, no$ 

L2$, non-coherent 

Shared mem. Cache utilization 

To better understand how the Memory 

works on SCC, we will take a closer look 

at how the SCC native message passing 

environment t&s (RCCE) is implemented. 

… 

L2$ 

L1$ 

t&s 

CPU_47 

Shared on-chip Message Passing Buffer (8KB/core) 

26 

t&s Shared test and set register

How does RCCE work? Part 1 

• Treat Msg Pass Buf (MPB) as 48 smaller buffers … one per core. 

• Symmetric name space … Allocate memory as a collective op. 

Each core gets a variable with the given name at a fixed offset 

from the beginning of a core’s MPB. 

… 

0 1 2 3 

47 

27 

Flags allocated 

and used to 

coordinate 

memory ops 

2 

A = (double *) RCCE_malloc(size) 

Called on all cores so any core can 

put/get(A at Core_ID) without errorprone 

explicit offsets


• The foundation of RCCE is a one-sided put/get interface. 

• Symmetric name space … Allocate memory as a collective and 

put a variable with a given name into each core’s MPB. 

t&s 

t&s 

Private 

DRAM 

L2$ 

L1$ 

CPU_0 

Private 

DRAM 

L2$ 

L1$ 

CPU_47 

Put(A,0) 

Get(A, 0) 

0 

… 

47 

28 

… and use flags to make the put’s and get’s “safe”


Single cycle op to invalidate all 

MPBT in L1 … Note this is not a 

Consequences of MPBT properties for RCCE: flush 

• If data changed by another core and image still in L1, read returns stale data. 

• Solution: Invalidate before read. 

• L1 has write-combining buffer; write incomplete line? expect trouble! 

• Solution: don’t. Always push whole cache lines 

• If image of line to be written already in L1, write will not go to memory. 

• Solution: invalidate before write. 

Message passing buffer 

memory is special … of type 

MPBT 

Cached in L1, L2 bypassed. 

Not coherent between cores 

Cache (line) allocated on read, 

not on write. 

29 

Discourage user operations on data in MPB. Use only as a data 

movement area managed by RCCE … Invalidate early, invalidate often

Shared Memory (DRAM) on SCC today 

• By default, each core sees 64MB of shared memory. 

– 4 16MB chunks, each on a separate MC... 

• The available shared memory area in DRAM is fragmented. 

– On-Die & Host ethernet drivers 

– Memory mapped IO consoles 

– Other system software uses ... 

• 30 No allocation scheme... Different applications may collide! 

30

31 

SCC and the shared DRAM: three options 

• Non-cachable: 

– Loads and stores ignore the L2 cache … go directly to DRAM 

– Pro: Easy for the programmer … no software managed cache coherence. 

– Con: Performance is poor 

• Cachable: 

– Loads and stores interact with Cache “normally” …. 4-way set associative 

with a pseudo-LRU replacement policy. Write-back only … write-allocate 

is not available. 

• MPBT: 

– Pro: better performance. 

– Con: Programmer manages cache coherence … requires a cache flush routine. 

– Loads and stores bypass L2 but go to L1. Works exactly the same as the 

MPBT data for the on-chip message passing buffer. 

– Pro: No need for complicated (and expensive) cache flushes 

– Con: Miss locality benefits of the large L2 cache 

In each case, however, only 64 Mbytes of fragmented 

memory is available. That is not enough!!!

Core Memory Management: A more detailed look 

• Each core has an address Look 

Up Table (LUT) 

o Provides address translation 

and routing information 

o Organized into 16 MB segments 

• Shared DRAM among all cores … 

through each memory controller 

(MC0 to MC3) … default size: 

o 4*16 = 64 MB 

• LUT boundaries are dynamically 

programmed 

CORE0 LUT Example 

FPGA registers 

APIC/boot 

PCI hierarchy 

Shared 

1 GB 

Private 

256MB 

Maps to LUT 

Maps to VRCs 

Maps to MC3 

Maps to MC1 

Maps to MC2 

Maps to MPBs 

Maps to MC0 

• Core cache coherency is restricted to private memory space 

• Maintaining cache coherency for shared memory space is under 

software control 

MC# = one of the 4 memory controllers, MPB = message passing buffer, VRC’s = Voltage Regulator control 

32

Moving beyond the 64 MB limit 

0x00 

Default configuration 

(by sccMerge) assigns 

these slots to the OS 

Default Linux image has 

only 64 MB of shared 

DRAM in slots 0x83 to 

0x84 … parts of which 

are used by the system. 

SCC Linux 

0x13 

0x14 

0x1a 

0x28 

0x80 

0x81 

0x82 

0x83 

0x84 

0xbf 

0xc0 

MPB 

33


SCC Linux needs 320 MB. 

At 16MB per LUT slot, this 

is 20 slots. 

0x00 0x13 


0x00 

0x13 

0x14 

0x1a 

needed 

not needed 

We can Hijack the memory 

pointed to by slots 0x28 through 

0x1a (15 slots) … and use them 

as additional shared memory in 

our applications. 

0x28 

0x80 

0x81 

0x82 

0x83 

0x84 

0xbf 

0xc0 

MPB 

34


SCC Linux needs 320 MB. 

At 16MB per LUT slot, this 

is 20 slots. 

0x00 0x13 


0x00 

0x13 

0x14 

0x1a 

needed 

not needed 

Hijack the memory pointed to 

by slots 0x28 through 0x1a (15 

slots) … have slots 0x84 through 

0xbf point to that memory. 

When we add slots to shared 

memory, we add them in 

groups of 4. With these 60 slots 

we can take 15 addresses from 

SCC Linux for each memory 

controller. 

0x28 

0x80 

0x81 

0x82 

0x83 

0x84 

0xbf 

0xc0 

60 slots 

MPB 

Source: Ted Kubaska 

35

36 

How to use the shared DRAM 

• Two devices for shared memory DRAM 

– /dev/rckncm – Linux device exposing “non-cacheable memory”. 

– /dev/rckdcm – Linux device exposing “definitely cacheable memory”. 

• Access to shared memory through RCCE 

– The appropriate device is opened inside RCCE_init() 

– SHMalloc() in SCC_API.c does the actual mmap() routine to get the 

file descriptor of the opened device. 

– Set all this up by building RCCE with SHMADD_CACHEABLE 

• If you work with Cacheable shared memory, you need to 

manage cache coherence explicitly. 

– In other words … you need to understand when the cache needs to 

be flushed and explicitly insert flushes as needed.

37 

Cacheable shared memory and Flush 

• The actual flush routine is part of the /dev/rckdcm driver in the file 

– rckmem.c in linuxkernel/linux-2.6.16-mcemu/drivers/char. 

• RCCE uses this to flush the entire L2 cache for a core: 

– RCCE_DCMflush() 

• For example, invoke it as 

– if(iam==receiver) RCCE_DCMflush(); 

• Inside RCCE_DCMflush() the driver routine is in DCMflush() is called : 

– write(DCMDeviceFD,0,65536); 

– where DCMDeviceFD is the /dev/rckdcm file descriptor and 65536 (64K) the size of an 

L2 way. 

• If is possible to flush only a portion of the L2 cache, but this hasn’t been 

implemented yet in RCCE. For example, if you want to flush the values 

in a structure called XY, you would specify the write() as 

– write(DCMDeviceFD,&XY,sizeof(XY));

38 

Shared memory API: Example 

#include "RCCE.h“ 

#define BSIZ 1024*64 

… 

volatile int *buffer; 

RCCE_init(&argc, &argv); 

iam 

size 

= RCCE_ue(); 

= bufsize*sizeof(int); 

buffer = (int *) RCCE_shmalloc(size); 

RCCE_barrier(&RCCE_COMM_WORLD); 

if(iam==sender) { 

} 

fill_buffer(buffer); 

RCCE_DCMflush(); 


if(iam==receiver) { 

RCCE_DCMflush(); 

use_buffer(buffer); 

} 


RCCE_shfree((t_vcharp)buffer); 

RCCE_finalize();

Off-Chip Shared-Memory in RCCE 

• SuperMatrix: 

– Map dense matrix computation to a directed acyclic graph 

• Store DAG and matrix in off-chip shared DRAM 

Cholesky Factorization on 

a single node so we can 

isolate cost of cache 

coherence. 

Source: Ernie Chan, UT Austin, MARC symposium, Nov. 2010 

39

Shared memory ... What we want 

• Shared DRAM memory works and with hijacking we get the 

memory footpring large enough to be interesting, 

• But its not safe (system and apps can collide). And its still 

of lmited size (~960 Mbytes). 

• We want something better ... Shared memory with: 

– No reserved areas ... Everyone schedule memory through a single 

resource. 

– No fragmentation 

– Much larger sizes for the shared memory 

– A proper allocation scheme 

– Coexists to legacy SHM section 

– Convenient usage model (no manual LUT changes etc.) 

40 

Sounds better? Okay, but how to achieve it?

The answer... 

Each core has several 100MB of private 

memory available (~640 on a latest systems) 

Why not allocate some private memory (using 

regular Linux mechanisms) and make it public 

for other cores? 

Hide the implementation details in a library to 

make usage convenient... Available for Linux as 

well as bareMetalC. 

=> Privately owned public shared memory 

41

POPSHM: Flexible Shared Memory 

• Linux Kernel patch 

– Allocates contiguous 16MB chunks 

– From private memory area 

– 16MB aligned (LUT entry) 

– Pins them 

– Publishes it to be used collectively 

– Total allocated 

– First LUT Entry 

• User space library 

– Aggregates available 

memory, up to 

48 * (5 * 16 MB) = 3.75 GB 

– POPSHM address space 

– Provides memcpy and 

put/get 

– Low level interface for 

minimal overhead 

Linux 

Private Memory 

Linux 


Linux 


POPSHM 

Address Space 

Core 0 Core 1 Core 2 

All Cores 

POPSHM is new … still 

42 

under early evaluation.

Conclusion 

• SCC is alive and well: 

– Message passing and shared memory APIs are available. 

– A vigorous community (MARC) using SCC for significant 

parallel computing research. 

– Usable through a GUI or a command line interface 

• The future …. Shared memory on SCC 

– We understand how to use cacheable vs. non-cacheable shared 

memory … we even have a working L2 flush! 

– Full characterization of POPSHM is the next step. 

43

Backup 

• Performance numbers 

• A survey of SCC research 

• Additional Details 

44

Round-trip Latencies, 32 byte message 

• Ping-pong between a pair of cores … one core fixed at a corner, the second 

core varied from “same tile” to opposite corner. 

5.60E-06 

Roundtrip Latency (secs) 

5.55E-06 

5.50E-06 

5.45E-06 

5.40E-06 

5.35E-06 

5.30E-06 

5.25E-06 

5.20E-06 

5.15E-06 

Data fit to a straight line: 

T = 528 nanoosecs + 30 nanosecs * hops 

5.10E-06 

0 1 2 3 4 5 6 7 8 

Network hops 

RCCE_Send/Recv … 3 messages per transit, 6 for roundtrip 

4 cycles per hop, 800 MHz router … or 2*3*4/0.8 GHz = 30 nanosecs.

46 

Power breakdown 

MC & 

DDR3- 

800 

19% 

Full Power Breakdown 

Total -125.3W 

Cores 

69% 

MC & 

DDR3- 

800 

69% 

Low Power Breakdown 

Total - 24.7W 

Routers 

& 2Dmesh 

10% 

Global 

Clocking 

2% 

Routers 

& 2Dmesh 

5% 

Global 

Clocking 

5% 

Cores 

21% 

Clocking: 1.9W Routers: 12.1W 

Cores: 87.7W MCs: 23.6W 

Cores-1GHz, Mesh-2GHz, 1.14V, 50°C 

Clocking: 1.2W Routers: 1.2W 

Cores: 5.1W MCs: 17.2W 

Cores-125MHz, Mesh-250MHz, 0.7V, 50°C

Impact of Core Position on Memory Performance 

• Stream benchmark mapped to one core and MC channel 

– Position of core is varied 

– Report relative reduction in usable memory bandwidth 

• Tile 533 MHz, Router 800 MHz, Memory 800 MHz 

– Up to 13% variation within quadrant of Memory controller (iMC) 

-13.0% -15.8% -19.8% -23.0% -25.5% -28.7% 

iMC -8.3% -13.0% -15.8% -19.8% -23.0% -25.5% iMC 

-5.3% -8.3% -13.0% -15.8% -19.8% -23.0% 

iMC 0.0% -5.3% -8.4% -13.0% -15.8% -19.8% iMC 

• Tile 800 MHz, Router 1600 MHz, Memory 800 MHz 

– Up to 8% variation within quadrant of Memory controller 

-8.0% -8.9% -11.6% -14.9% -15.6% -18.0% 

iMC -4.2% -8.0% -8.9% -11.6% -14.9% -15.6% iMC 

-1.0% -4.2% -8.0% -8.8% -11.6% -14.9% 

iMC 0.0% -1.0% -4.2% -8.0% -8.8% -11.6% iMC 

Source: Intel, SCC workshop, Germany March 16 2010

Linpack and NAS Parallel benchmarks 

1. Linpack (HPL): solve dense system of linear equations 

– Synchronous comm. with “MPI wrappers” to simplify porting 

x-sweep 

z-sweep 

2. BT: Multipartition decomposition 

– Each core owns multiple blocks (3 in this case) 

– update all blocks in plane of 3x3 blocks 

– send data to neighbor blocks in next plane 

– update next plane of 3x3 blocks 

3. LU: Pencil decomposition 

4 

– Define 2D-pipeline process. 

– await data (bottom+left) 

4 

3 

4 

– compute new tile 

– send data (top+right) 

4 

3 

2 

3 

4 

48 


4 

3 

2 

1 

2 

3 

4

LU/BT NAS Parallel Benchmarks, SCC 

Problem size: Class A, 64 x 64 x 64 grid* 

2000 

1600 

MFlops 

1200 

800 

LU 

BT 

50 

400 

0 

• Using latency 

optimized, 

whole cache 

line flags 

0 10 20 30 40 

# cores * These are not official NAS Parallel benchmark results. 

SCC processor 500MHz core, 1GHz routers, 25MHz system interface, and DDR3 memory at 800 MHz. 


Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or 

software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information 

on performance tests and on the performance of Intel products, reference or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

Backup 




51

Many-core Application 

Research Community 

•123 Contracts signed worldwide 

•77 Unique Institutions 

•39 Research Partners in EU 

•28 Research Partners in USA 

•10 Research Partners in other 

Countries including China, 

India, South Korea, Brazil, 

Canada 

•233 MARC website Participants

SCC Bare Metal 

Development Framework 

Bandwidth Studies 

Source: SCC MARC symposium, Santa Clara, CA (March 2011) 

• ET International has investigated performance with lock free message queues. 

• Queue head/tail pointers reside in MPBs, and the message buffers reside in DRAM. 

• Message buffers in DRAM are mapped twice into a given core’s virtual memory: Once for 

read using the MPBT bit, and once for write with all caching disabled. 

• MPBT cache control bit allow both the MPBs and the DRAM read mapping to be flushed 

via the SCC’s CL1FLUSH instruction. 

Many-Core Applications Research Community– http://communities.intel.com/community/marc

X10 on the SCC 

Keith Chapman, Ahmed Hussein, Antony Hosking 

Purdue University 


Porting and Preliminary Performance 

•RCCE-X10: An extension to RCCE tailored for X10 runtime 

•Performance for X10 benchmarks … RCCE-X10 performs better than MPI RCCE 

• HF (Hartree-Fock Quantum chemistry) and BC (Betweenness Centrality). 

BC: Speedup (normalized to RCCE-X10) 

BC: Speedup for varying workloads 

HF: Dynamic load balancing 

(normalized to RCCE-X10) 

HF: Static load balancing 

(normalized to RCCE-X10)

Black Cloud OS 

Microsoft Research 


What will an operating system look like when the chip becomes a 

cloud? 

•Is breaking cache coherency a way to scale beyond a dozen of cores? Does message 

passing benefit from network-on-chip design? Black Cloud operating system tries to 

answer these questions by making the message passing features of the SCC first class 

citizens. 

•Black Cloud OS: 

•Based on Singularity/Helios operating systems. 

•Written in managed code. 

•Runs a single operating system instance across all 48 cores. 

•Each core runs a separate kernel. 

•Processes can be hosted by any of the kernels. 

•All inter-process communication is done via messages. 

•Local and remote channels are available via the same APIs. 

•Inspired by “The Invincible” by Stanislaw Lem. 


RCKMPI 

MPI designed for SCC 

This proof of concept software stack shows that it‘s possible to use out-of-the-box 

programming models on SCC, while modifying the physical layer to use the low-latency 

hardware acceleration features of SCC. Thus, it can serve as a valuable example for the 

development of new programming models. As the physical layer has access to both message 

passing buffer and/or shared memory, it also enables research on the “right mix” of MPB vs. 

SHM based communication. Standard MPI debug tools (like iTac) can be used to analyze the 

impact of modifications… 

INSERT PICTURE HERE 


MESSY & Faroe 

Hrishikesh Amur, Alexander Merritt, Sudarsun Kannan, Priyanka Tembey 

Vishal Gupta, Min Lee, Ada Gavrilovska, Karsten Schwan 

CERCS, Georgia Institute of Technology 

MESSY – Library for Software Coherence 

•Aims to re-evaluate conventional wisdom regarding effects of memory consistency 

models on application performance using the fast on-chip interconnect 

•Implements a key-value store that allows multiple consistency models to be 

applied on different data. 

Faroe – Memory Balancing for Clusters on Chip 

•Aims to evaluate more scalable memory borrow/release protocols 

•The on-chip interconnect facilitates faster communication between cores, 

hence enabling newer coordination mechanisms at the OS layer. 


A Software SVM 

for the SCC Architecture 

Junghyun Kim, Sangmin Seo, Jun Lee and Jaejin Lee 

Center for Manycore Programming, Seoul National University, Seoul 151-744, Korea 

http://aces.snu.ac.kr 

To provide an illusion of coherent shared memory to the programmer 

• With comparable performance to and better scalability than 

the ccNUMA architecture 

• Using the CRF memory model 


Performance Modeling of SCC’s Onchip 

Interconnect 

Research Overview 

Our findings: 

• We characterize the SCC on-chip interconnection network with micro-benchmarks 

• Observed point-to-point latency, bandwidth 

• Designed a performance model from observations 

• We present new collective communication algorithms for the SCC by efficiently 

utilizing the message-passing buffer (MPB) 

• Broadcast 22x faster than RCCE 

• Reduce 6.4x faster than RCCE 

Current and Future work: 

• Quantifying the price of cache coherence 

• Study other collective patterns 

• Project real application scalability on the SCC 

• Power studies 

Performance Model 

Remote MPB read/write vs 

Hop count 

Write 

Read 

Local MPB read/write 

performance 

MPB write 

MPB read 

L1-cache read 

Aparna Chandramowlishwaran, Richard Vuduc 

Georgia Institute of Technology 

Collective Communication 

• Performance model guides the design of optimal collective 

communication algorithms 

• Two case studies 

• Reduce 

• Broadcast 

Reduce parallel scaling 

• Naïve: get-based approach, 

Naive 

sequential 

• Eg: Core 0 reads from 

remote MPB’s and performs a 

local reduce 

• Tree-based reduce: scales 

as log p 

Tree-based 

• Naïve: put-based approach, 

sequential 

• Eg: Core 0 writes to remote 

MPB’s of other cores 

• Parallel broadcast 

algorithm: All cores fetch data 

from core 0’s MPB 

Broadcast parallel scaling 

Naive 

Parallel 

• 64 byte message. Slope gives a router 

latency (~ 5ns) 

• Write to local MPB is ~1.4x slower than 

read 

Configuration: Cores at 533 MHz, Router at 800 MHz, DDR3 800 memory 

Acknowledgements 

We would like to thank Intel for access to the SCC processor hosted at the 

manycore research community (MARC) data center. 



C++ Front-end for Distributed Transactional Memory 

Sam Vafaee, Natalie Enright Jerger 

Department of Electrical & Computer Engineering, University of Toronto, Canada 

SCC-TM is a Compiler-Agnostic Framework for Rapid Development of Distributed Apps 

• SCC-TM makes SCC more accessible by providing an easier alternative to the 

message passing paradigm 

• Builds on SCC’s on-chip inter-core communication 

• Writing distributed apps is as easy as atomic { … } 

• First compiler-agnostic TM library without annotations 

• Status: Initial version with some centralized operations 

FRONTEND 

Smart Pointers 

Architecture 

Memory Protection 

On Read On Write TX Begin TX Commit 

… 

SPMD 

PROGRAM 

Interface 

TMMain(int argc, char **argv) 

{ 

// Launched on desired # of cores 

// Node Id: TMNodeId(), # of nodes: TMNodes() 

} 

TM PROTOCOL 

BACKEND 

Version 

Management 

Message Passing Buffers 

Contention 

Management 

Memory/Cache Management 

Conflict Detection 

RCCE Send RCCE Recv RCCE Bcast RCCE Malloc … 

Globally Mapped Shared 

Memory 

COLLECTIVE 

DECLARATION 

ATOMIC 

REGION 

Given: 

struct Account 

{ 

}; 

int withdrawlLimit; 

int balance; 

Account() : withdrawlLimit(100), balance(1000) {}; 

Declaration: 

tm_shared account = new(TMGlobal) Account(); 

atomic 

{ 

if (100 < account->withdrawlLimit 

&& account->balance >= 100) 

{ 

account->balance -= 100; 

} 

} 


Dense Matrix Computations on SCC: A Programmability Study 

We study programmability with RCCE by porting to SCC Elemental, a well-layered, 

high-performance library for distributed-memory computers. 

Enabling Features for Port: 

•Used low-latency, low-overhead RCCE for synchronous collective communication to replace MPI calls 

•C++ Elemental code, which is object-oriented and modular, made required changes very limited and easy to 

implement 

•Implemented a few well-defined MPI collectives with RCCE to complete port 

•Algorithm variants enabled incremental porting and validation 

•Algorithm variants also allowed testing for best algorithms 

PartitionDownDiagonal 

( A, ATL, ATR, 

ABL, ABR, 0 ); 

while( ABR.Height() > 0 ) 

{ 

RepartitionDownDiagonal 

( ATL, /**/ ATR, A00, /**/ A01, A02, 

/*************/ /******************/ 

/**/ A10, /**/ A11, A12, 

ABL, /**/ ABR, A20, /**/ A21, A22 ); 

A12_Star_MC.AlignWith( A22 ); 

A12_Star_MR.AlignWith( A22 ); 

A12_Star_VR.AlignWith( A22 ); 

//----------------------------------// 

A11_Star_Star = A11; 

advanced::internal::LocalChol 

( Upper, A11_Star_Star ); 

A11 = A11_Star_Star; 

A12_Star_VR = A12; 

basic::internal::LocalTrsm 

( Left, Upper, 

ConjugateTranspose, NonUnit, 

(F)1, A11_Star_Star, A12_Star_VR ); 

A12_Star_MC = A12_Star_VR; 

A12_Star_MR = A12_Star_VR; 

basic::internal::LocalTriangularRankK 

( Upper, ConjugateTranspose, 

(F)-1, A12_Star_MC, A12_Star_MR, 

(F)1, A22 ); 

A12 = A12_Star_MR; 

//----------------------------------// 

A12_Star_MC.FreeAlignments(); 

A12_Star_MR.FreeAlignments(); 

A12_Star_VR.FreeAlignments(); 

} 

SlidePartitionDownDiagonal 

( ATL, /**/ ATR, A00, A01, /**/ A02, 

/**/ A10, A11, /**/ A12, 

/*************/ /******************/ 

ABL, /**/ ABR, A20, A21, /**/ A22 ); 

Many-Core Applications Research Community– http://communities.intel.com/community/marc 

Bryan Marker, Ernie Chan, Jack Poulson, Robert van de Geijn, Rob Van der Wijngaart, Timothy Mattson, and Theodore Kubaska

SCC + QED for 

Effective Post-Si Validation 

Quick Error Detection (QED) is a post-Si validation technique which reduces error detection latency and improves coverage of existing 

validation tests. Error detection latency is the time elapsed between the occurrence of an error and its manifestation. Long error 

detection latencies extend the post-Si validation process, delaying product shipments. By incorporating QED with the SCC, the QED 

technique could be exercised by dedicated cores, greatly reducing the amount of software support and further reducing error detection 

latency. 

QED technique 

• Effective for single-cores [Hong ITC 2010] 

• 4x coverage improvement 

• 10 6 decrease in error detection latency 

• Current research efforts 

• Prove efficacy of QED on multi-core errors 

• Prove applicability of QED on uncore errors 

Intel SCC 

• Core clusters configurable for specific operating points 

• Allows creation of unreliable system cores within normal operating cores 

• Can see effects of single core unreliability on a multi-core system 

• Can create dedicated cores for performing QED 

• Can test uncore errors using the extensive on-chip fabric 


Distributed Power/Thermal Management for Single Chip Cloud Computer 

1 st Step - Thermal Modeling and Characterization 

•Motivation: Mapping and scheduling of tasks affects the peak temperature and thermal gradients 

•Goal: allocate workload evenly across space and time with the awareness of heat dissipation and lateral heat transfer 

•Approach: proactive distributed task migration 

•SCC: a unique many-core system with flexible DVFS capability and easy-to-access temperature sensors. It allows us to 

•Understand the thermal influence and workload dependency among on-chip processors, and 

•Evaluate the impact of those dependencies on the efficiency of power/thermal management 

•Key metric 

•Step 1: thermal modeling of the SCC 

•Step 2: Framework of distributed thermal management on SCC 

•Step 3: Performance evaluation and sensitivity analysis 



Pregel on SCC 

18 

PageRank 

16 

14 

12 

Graph with 6144 nodes, 

10 edges/node, 96 

partitions, 100 steps 

time 

10 

8 

comp 

comm 

6 

4 

2 

0 

1 2 4 8 16 32 48 

Pregel: a system for large-scale Cores graph processing (SIGMOD’ 10) 

Many-Core Applications Research Community– http://communities.intel.com/community/marc 

Source: Chuntao HONG, Tsinghua University, Jan 2011 

66

… and beyond 

• Supermatrix—E. Chan et al, UT Austin; in progress 

• Software Managed Coherence (SMC)— 

XiaochengZhou et al, Intel, China SCC symposium, 

Jan 2011 

• An OpenCLFramework for Homogeneous 

Manycoreswith no Hardware Cache Coherence 

Support—JaeJin Lee et. al., Seoul National 

University, submitted to PLDI 11

Backup 




68

SCC Platform Board Overview 

69

How to use the atomic counters 

• Two sets of 48 atomic counters (32 bit) 

• Reachable with LUT* entry 0xf9 

• Each set starts at 4k page boundaries 

- 0xf900E000 -> Block0 

- 0xf900F000 -> Block 1 

• Each counter is represented by two registers: 

– Atomic increment register 

– Read: Returns couter value and decrements it. 

– Write: Increments counter value. 

– Initialization register allows preloading or 

reading counter value. 

Atomic Increment 

0xE000 Counter #00 

0xE004 

Initialization 

Counter #00 



0xE00C 


Counter #01 



0xE17C 


Counter #47 

*LUT: Memory look up table … address translation unit (described later) 

70

References 

• A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS, J. 

Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. 

Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, 

V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. 

Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De1, R. Van Der Wijngaart, T. 

Mattson, Proceedings of the International Solid-State Circuits Conference, Feb 2010 

• The 48-core SCC processor: the programmer’s view ,T. G. Mattson, R. F. Van der 

Wijngaart, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. 

Vangal, N. Borkar, G. Ruhl, S. Dighe, Proceedings SC10, New Orleans, November 

2010 

• Light-weight Communications on Intel's Single-Chip-Cloud Computer 

Processor, Rob F. van der Wijngaart, Timothy G. Mattson, Werner Haas, Operating 

Systems Review, ACM, vol 45, number 1, pp. 73-83, January 2011. 

• Programming many-core Architectures – a case study: dense Matrix 

computations on the Intel SCC Processor, B. Marker, E. Chan, J. Poulson, R. van 

de Geijn, R. van der Wijngaart, T. Mattson, T. Kubaska, submitted to Concurrency and 

Computation: practice and experience, 2011. 

• Many Core Applications Research Community, An online community of users of 

the SCC processor. http://communities.intel.com/community/marc 

71

Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial

Create successful ePaper yourself

Delete template?

Save as template?