25.06.2014 Views

Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial

Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial

Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Using</strong> <strong><strong>Intel</strong>’s</strong> <strong>Single</strong>-<strong>Chip</strong> <strong>Cloud</strong> <strong>Computer</strong><br />

(<strong>SCC</strong>)<br />

Tim <strong>Mattson</strong><br />

Microprocessor and Programming Lab<br />

<strong>Intel</strong> Corp.<br />

1<br />

With content gratefully “borrowed” from: Rob van der Wijngaart, Ted Kubaska, Michael Riepen, Ernie Chan


Disclosure<br />

• The views expressed in this talk are those of the<br />

speaker and not his employer.<br />

• I am in a research group and know almost nothing<br />

about <strong>Intel</strong> products, hence anything I say about<br />

them is highly suspect.<br />

• This was a team effort, but if I say anything really<br />

stupid, it’s all my fault … don’t blame my<br />

collaborators.<br />

2


The many core design challenge<br />

• Scalable architecture:<br />

– How should we connect the cores so we can scale as far as we<br />

need (O(100’s to 1000) should be enough)?<br />

• Software:<br />

– Can “general purpose programmers” write software that takes<br />

advantage of the cores?<br />

– Will ISV’s actually write scalable software?<br />

• Manufacturability:<br />

– Validation costs grow steeply as the number of transistors grows.<br />

Can we use tiled architectures to address this problem?<br />

– For an N transistor budget … Validate a tile (M transistors/tile) and the<br />

connections between tiles. Drops validation costs from KO(N) to<br />

K’O(M) (warning, K, K’ can be very large).<br />

80 core Research<br />

processor<br />

<strong><strong>Intel</strong>’s</strong> “TeraScale” processor<br />

research program is addressing these<br />

questions with a series of Test chips<br />

… two so far.<br />

48 core <strong>SCC</strong><br />

processor<br />

3


Agenda<br />

• The <strong>SCC</strong> Processor<br />

• <strong>Using</strong> the <strong>SCC</strong> system<br />

• <strong>SCC</strong> Address spaces<br />

4


Hardware view of <strong>SCC</strong><br />

•48 P54C cores, 6x4 mesh, 2 cores per tile<br />

•45 nm, 1.3 B transistors, 25 to 125 W<br />

•16 to 64 GB DRAM using 4 DDR3 MC<br />

MC<br />

Tile<br />

R<br />

Tile<br />

R<br />

Tile<br />

R<br />

Tile<br />

R<br />

R<br />

Tile<br />

Tile<br />

Tile<br />

R<br />

Tile<br />

Tile Tile Tile Tile<br />

R R R R<br />

Tile Tile Tile Tile<br />

R R R R<br />

Tile Tile Tile Tile<br />

R R R R<br />

Tile Tile Tile Tile<br />

MC<br />

P54C<br />

16KB L1-D$<br />

16KB L1-I$<br />

Message<br />

Passing<br />

Buffer<br />

16 KB<br />

P54C<br />

16KB L1-D$<br />

16KB L1-I$<br />

256KB<br />

unified<br />

L2$<br />

Mesh<br />

I/F<br />

256KB<br />

unified<br />

L2$<br />

To<br />

Router<br />

MC<br />

R<br />

R<br />

R<br />

R<br />

R<br />

R<br />

MC<br />

to PCI<br />

R = router, MC = Memory Controller,<br />

5<br />

P54C = second generation Pentium® core, CC = cache cntrl.


On-Die Network<br />

•2D 6 by 4 mesh of tiles<br />

• Fixed X-Y Routing<br />

•Carries all traffic<br />

(memory, I/O, message<br />

passing).<br />

Bisection bandwidth 2 TB/s at 2GHz and 1.1V<br />

Latency<br />

4 cycles<br />

Link Width<br />

16 Bytes<br />

Bandwidth<br />

64GB/s per link<br />

Architecture<br />

8 VCs over 2 Msg. Class.<br />

Power Consumption 500mW @ 50°C<br />

Routing<br />

Pre-computed, X-Y<br />

•Router<br />

•Low Power 5<br />

port cross bar<br />

•2 Message<br />

classes, 8 virtual<br />

channels<br />

6<br />

6


Power and memory domains<br />

Memory*<br />

Voltage<br />

Question<br />

<strong>SCC</strong> package<br />

MC<br />

MC<br />

R<br />

R<br />

R<br />

R<br />

Tile<br />

Tile<br />

Tile<br />

R<br />

R<br />

Tile Tile<br />

R<br />

R<br />

Tile<br />

Tile<br />

Tile<br />

R<br />

R<br />

R<br />

R<br />

Tile Tile<br />

Tile<br />

Tile<br />

Tile<br />

R<br />

Tile<br />

R<br />

Tile<br />

R<br />

R<br />

Bus to<br />

PCI<br />

Tile<br />

R<br />

R<br />

R<br />

R<br />

Tile Tile<br />

R<br />

Tile Tile<br />

Tile<br />

Tile<br />

R<br />

R<br />

R<br />

Tile<br />

Tile<br />

Frequency<br />

MC<br />

MC<br />

Power ~ F V 2<br />

Power Control domains<br />

• 7 Voltage domains (6<br />

4-tile blocks plus one<br />

for the on-die<br />

network)<br />

• 24 tile clock frequency<br />

dividers<br />

• One voltage control<br />

register … so only one<br />

voltage change in<br />

flight at a time.<br />

7<br />

*How cores map to MC is under programmer control …. i.e. the memory-controller domain is configurable.


<strong>SCC</strong> system overview<br />

Questions?<br />

DIMM<br />

DIMM<br />

PLL<br />

MC<br />

MC<br />

R<br />

R<br />

R<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

R<br />

R<br />

R<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

R<br />

R<br />

R<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

tile tile<br />

R<br />

MC<br />

JTAG<br />

I/O<br />

MC<br />

DIMM DIMM<br />

JTAG interface<br />

for Clocking,<br />

Power etc.<br />

JTAG<br />

BUS<br />

RPC<br />

System Interface<br />

<strong>SCC</strong> die<br />

System Interface FPGA<br />

– Connects to <strong>SCC</strong><br />

Mesh interconnect<br />

– IO capabilities like<br />

PCIe, Ethernet &<br />

SATA<br />

8<br />

System<br />

FPGA<br />

PCIe<br />

Management<br />

Console PC<br />

Third party names are the property of their owners.<br />

Board Management<br />

Controller (BMC)<br />

– Network interface for<br />

User interaction via<br />

Telnet<br />

– Status monitoring<br />

8


The latest FPGA Features (release 1.4)<br />

• 1-4 x 1Gb Ethernet ports<br />

• Interface from BMC to FPGA<br />

• Global interrupt controller<br />

• Additional System registers<br />

• Global timestamp counter<br />

• Linux private memory size<br />

• <strong>SCC</strong> main clock frequency<br />

• Network configuration (IP addresses, Gateway, MAC address)<br />

• Linux frame buffer resolution and color depth for virtual display<br />

• Atomic Counters<br />

9


<strong>SCC</strong> Platforms<br />

• Three <strong>Intel</strong> provided platforms for <strong>SCC</strong> and RCCE*<br />

– Functional emulator (on top of OpenMP)<br />

– <strong>SCC</strong> board with two “OS Flavors” … Linux or Baremetal (i.e.<br />

no OS)<br />

Apps<br />

Apps<br />

Apps<br />

RCCE<br />

RCCE_EMU<br />

Driver<br />

icc<br />

ifort<br />

MKL<br />

RCCE<br />

RCCE<br />

icc fort MKL icc fort MKL<br />

OpenMP<br />

PC or server with<br />

Windows or Linux<br />

Baremetal C<br />

<strong>SCC</strong><br />

Linux<br />

<strong>SCC</strong><br />

Functional emulator,<br />

based on OpenMP.<br />

<strong>SCC</strong> board – NO OpenMP<br />

10<br />

*RCCE: Native Message passing library for <strong>SCC</strong><br />

Third party names are the property of their owners.


Networking: TCP/IP<br />

• We support the standard “Internet protocol stack”<br />

Application<br />

Transport<br />

Network<br />

Link<br />

Physical<br />

network applications (ssh, http, MPI, NFS …)<br />

host-host data transfer (TCP, UDP)<br />

routing of datagrams from source to destination (IP)<br />

data transfer between peers (rckmb)<br />

move packets over physical network (on-die mesh)<br />

• Rckmb … our implementation of the Link layer<br />

– Local write/remote read data transfer<br />

– Static mapping IP address <strong>SCC</strong> core number<br />

– Uses IP packet to determine destination<br />

• Seamless integration with Linux networking utilities<br />

• Rckmb is based on NAPI<br />

– Interrupt shared between peer nodes<br />

– Provides for operation in polling mode<br />

Source: Light-weight Communications on <strong>Intel</strong>'s <strong>Single</strong>-<strong>Chip</strong>-<strong>Cloud</strong> <strong>Computer</strong> Processor, Rob F. van der Wijngaart, Timothy<br />

G. <strong>Mattson</strong>, Werner Haas, Operating Systems Review, ACM, vol 45, number 1, pp. 73-83, January 2011.<br />

Third party names are the property of their owners.<br />

11


Agenda<br />

• The <strong>SCC</strong> Processor<br />

• <strong>Using</strong> the <strong>SCC</strong> system<br />

• <strong>SCC</strong> Address spaces<br />

12


sccKit: Core software for <strong>SCC</strong><br />

• <strong>SCC</strong> lacks a PC BIOS ROM or the typical bootstrap process<br />

– we use bootable image preloaded into memory by sccKit.<br />

• sccKit includes:<br />

– Platform support<br />

– System Interface FPGA bitstream<br />

– Board management controller (BMC) firmware<br />

– <strong>SCC</strong> System Software<br />

– Customized Linux to run on each core<br />

– Kernel 2.6.16 with Busybox 1.15.1 and on-die TCP/IP drivers<br />

– bareMetalC to run on <strong>SCC</strong> without an OS<br />

– Management Console PC Software<br />

– PCIe driver with integrated TCP/IP driver<br />

– Programming API for communication with <strong>SCC</strong> platform<br />

– GUI for interaction with <strong>SCC</strong> platform<br />

– Command line tools for interaction with <strong>SCC</strong> platform<br />

Questions?<br />

Third party names are the property of their owners.<br />

13


Creating Management Console PC Apps<br />

• Written in C++ making use of<br />

Nokia Qt cross-platform<br />

application and UI framework.<br />

• Low level API (sccApi) with access<br />

to <strong>SCC</strong> and board management<br />

controller via PCIe.<br />

• Code of sccGui as well as<br />

command line tools is available as<br />

code example. These tools use<br />

and extend the low level API.<br />

14


sccGui<br />

• Read and write<br />

system memory and<br />

registers.<br />

• Boot OS or other<br />

workloads<br />

(e.g. bareMetalC).<br />

• Open SSH connections<br />

to booted Linux cores<br />

• Performance meter<br />

• Initialize Platform via Board Management Controller.<br />

• Open a console connection to nodes<br />

• Debug support... Query memory addresses, read/write registers.<br />

15


Command Line Interface to sccKit<br />

sccBoot:<br />

– sccBoot pulls reset, loads the bootable Linux images into memory, sets<br />

up configuration registers and releases reset - the cores execute the<br />

boot sequence (see section 16 of the Penitum P45c manual).<br />

– Example: to boot linux on all cores<br />

> sccBoot -l<br />

– Also used to “boot” generic workloads (e.g. bareMetalC applications)<br />

sccReset:<br />

– Reset selected <strong>SCC</strong> cores. Example: to put all cores into reset<br />

sccBmc:<br />

> sccReset -g<br />

– Board Management controller: Example: to prepare <strong>SCC</strong> in a known<br />

state selected from a list of options:<br />

> sccBmc -i<br />

Use any of these commands with –h to see usage notes.<br />

16


Setting up your account for <strong>SCC</strong><br />

• Add these lines to your .bashrc file<br />

#setup for sccKit and enable <strong>SCC</strong> cross compilers<br />

export PATH=.:/opt/sccKit/current/bin:$PATH<br />

export LD_LIBRARY_PATH=.:/opt/sccKit/lib:$LD_LIBRARY_PATH<br />

source /opt/compilerSetupFiles/crosscompile.sh<br />

# setup for RCCE<br />

export MANPATH="/home/tmattson/rcce/man:${MANPATH}"<br />

export PATH=.:/home/tmattson/rcce:$PATH<br />

the directory where<br />

you installed RCCE<br />

Note: while any Linux shell works, we test all our scripts<br />

with bash … so its simplest if you use bash. If your<br />

configures systems with something else (such as csh), I<br />

manually switch to bash as soon as I log in


Setting up your account for <strong>SCC</strong><br />

• If you have successfully setup the compilers<br />

and sccKit, you should see …<br />

>% which icc<br />

/opt/icc-8.1.038/bin/icc<br />

>% which gcc<br />

/opt/i386-unknown-linux-gnu/bin/gcc<br />

>% which sccBmc<br />

/opt/sccKit/current/bin/sccBmc


Check out latest RCCE from SVN<br />

• You want to check out the rcce release but in a “read<br />

only” user mode. That way you don’t get all the tags<br />

and settings for checking things into RCCE<br />

>% svn export http://marcbug.scc-dc.com/svn/repository/trunk/rcce/<br />

• While you’re at it, you might as well grab MPI as well.<br />

>% svn export http://marcbug.scc-dc.com/svn/repository/trunk/rckmpi/


Build RCCE<br />

• Go to the root of the RCCE directory tree and<br />

setup the build environment<br />

>% ./configure <strong>SCC</strong>_LINUX<br />

• Then build RCCE by typing from the RCCE<br />

directory tree root<br />

>% ./makeall<br />

• To build with other options (such as shared<br />

memory) edit the file common/symbols.in


Test by building an app (pingpong)<br />

• Go to rcce_root/apps/PINGPONG and type<br />

>% make pingpong<br />

• Copy executable to /shared<br />

>% cp pingpong /shared/tmattson<br />

Note: /shared is<br />

NFS mounted and<br />

visible to all cores<br />

• Create or copy an rc.hosts file … one entry per<br />

line with numbers (00 to 47)<br />

• Run the job on the <strong>SCC</strong><br />

>% rccerun -nue 2 -f rc.hosts pingpong


A closer look at rccerun<br />

• rccerun is analogous to the mpirun … It sets up<br />

the system and submits the rcce executable.<br />

tmattson@boathouse:/shared/tmattson$ rccerun -nue 2 -f rc.hosts pingpong<br />

pssh -h PSSH_HOST_FILE.29325 -t -1 -p 2 /shared/tmattson/mpb.29325 < /dev/null<br />

[1] 17:56:39 [SUCCESS] rck00<br />

[2] 17:56:39 [SUCCESS] rck01<br />

pssh -h PSSH_HOST_FILE.29325 -t -1 -P -p 2 /shared/tmattson/pingpong 2 0.533 00 01 < /dev/null<br />

2965792 0.173102739<br />

3526944 0.205475685<br />

[1] 17:57:17 [SUCCESS] rck00<br />

[2] 17:57:17 [SUCCESS] rck01<br />

Standard out<br />

Time and exit status<br />

for each core<br />

Run a utility to<br />

put MPB in a<br />

known state.<br />

Run the executable (pingpong) for<br />

default clock speed (533 Mhz)


Managing the <strong>SCC</strong> system<br />

• To put the system into a known state, we train the chip.<br />

• To train scc (setup powers, memory, etc) use the command:<br />

sccBmc –i<br />

• It will give you a collection of options to choose from:<br />

Please select from the following possibilities:<br />

INFO: (0) Tile533_Mesh800_DDR800<br />

INFO: (1) Tile800_Mesh1600_DDR1066<br />

INFO: (2) Tile800_Mesh1600_DDR800<br />

INFO: (3) Tile800_Mesh800_DDR1066<br />

INFO: (4) Tile800_Mesh800_DDR800<br />

INFO: (others) Abort!<br />

Make your selection:<br />

• Note: some people reset first with a sccReset –g but that<br />

shouldn’t be necessary.


… so to run benchmarks …<br />

• I reset, trained, rebooted and then used <strong>SCC</strong>. I<br />

followed this procedure to assure that the<br />

system is in a known state.<br />

213> sccReset -g<br />

214> sccBmc -i<br />

215> sccBoot -l<br />

216> rccerun -nue 2 -f rc.hosts pingpong<br />

271> rccerun –nue 48 –f rc.hosts stencil<br />

272> rccerun –nue 48 –f rc.hosts cshift 8


Agenda<br />

• The <strong>SCC</strong> Processor<br />

• <strong>Using</strong> the <strong>SCC</strong> system<br />

• <strong>SCC</strong> Address spaces<br />

25


<strong>SCC</strong> Address spaces<br />

• 48 x86 cores which use the x86 memory model for Private DRAM<br />

Private<br />

DRAM<br />

On-chip<br />

Off-chip<br />

L2$<br />

L1$<br />

Memory<br />

Where is the physical Memory?<br />

Shared off-chip DRAM (variable size)<br />

CPU_0<br />

Private<br />

DRAM<br />

L1$, MPBT<br />

Reg file, no$<br />

L2$, non-coherent<br />

Shared mem. Cache utilization<br />

To better understand how the Memory<br />

works on <strong>SCC</strong>, we will take a closer look<br />

at how the <strong>SCC</strong> native message passing<br />

environment t&s (RCCE) is implemented.<br />

…<br />

L2$<br />

L1$<br />

t&s<br />

CPU_47<br />

Shared on-chip Message Passing Buffer (8KB/core)<br />

26<br />

t&s Shared test and set register


How does RCCE work? Part 1<br />

• Treat Msg Pass Buf (MPB) as 48 smaller buffers … one per core.<br />

• Symmetric name space … Allocate memory as a collective op.<br />

Each core gets a variable with the given name at a fixed offset<br />

from the beginning of a core’s MPB.<br />

…<br />

0 1 2 3<br />

47<br />

27<br />

Flags allocated<br />

and used to<br />

coordinate<br />

memory ops<br />

2<br />

A = (double *) RCCE_malloc(size)<br />

Called on all cores so any core can<br />

put/get(A at Core_ID) without errorprone<br />

explicit offsets


How does RCCE work? Part 2<br />

• The foundation of RCCE is a one-sided put/get interface.<br />

• Symmetric name space … Allocate memory as a collective and<br />

put a variable with a given name into each core’s MPB.<br />

t&s<br />

t&s<br />

Private<br />

DRAM<br />

L2$<br />

L1$<br />

CPU_0<br />

Private<br />

DRAM<br />

L2$<br />

L1$<br />

CPU_47<br />

Put(A,0)<br />

Get(A, 0)<br />

0<br />

…<br />

47<br />

28<br />

… and use flags to make the put’s and get’s “safe”


How does RCCE work? Part 3<br />

<strong>Single</strong> cycle op to invalidate all<br />

MPBT in L1 … Note this is not a<br />

Consequences of MPBT properties for RCCE: flush<br />

• If data changed by another core and image still in L1, read returns stale data.<br />

• Solution: Invalidate before read.<br />

• L1 has write-combining buffer; write incomplete line? expect trouble!<br />

• Solution: don’t. Always push whole cache lines<br />

• If image of line to be written already in L1, write will not go to memory.<br />

• Solution: invalidate before write.<br />

Message passing buffer<br />

memory is special … of type<br />

MPBT<br />

Cached in L1, L2 bypassed.<br />

Not coherent between cores<br />

Cache (line) allocated on read,<br />

not on write.<br />

29<br />

Discourage user operations on data in MPB. Use only as a data<br />

movement area managed by RCCE … Invalidate early, invalidate often


Shared Memory (DRAM) on <strong>SCC</strong> today<br />

• By default, each core sees 64MB of shared memory.<br />

– 4 16MB chunks, each on a separate MC...<br />

• The available shared memory area in DRAM is fragmented.<br />

– On-Die & Host ethernet drivers<br />

– Memory mapped IO consoles<br />

– Other system software uses ...<br />

• 30 No allocation scheme... Different applications may collide! <br />

30


31<br />

<strong>SCC</strong> and the shared DRAM: three options<br />

• Non-cachable:<br />

– Loads and stores ignore the L2 cache … go directly to DRAM<br />

– Pro: Easy for the programmer … no software managed cache coherence.<br />

– Con: Performance is poor<br />

• Cachable:<br />

– Loads and stores interact with Cache “normally” …. 4-way set associative<br />

with a pseudo-LRU replacement policy. Write-back only … write-allocate<br />

is not available.<br />

• MPBT:<br />

– Pro: better performance.<br />

– Con: Programmer manages cache coherence … requires a cache flush routine.<br />

– Loads and stores bypass L2 but go to L1. Works exactly the same as the<br />

MPBT data for the on-chip message passing buffer.<br />

– Pro: No need for complicated (and expensive) cache flushes<br />

– Con: Miss locality benefits of the large L2 cache<br />

In each case, however, only 64 Mbytes of fragmented<br />

memory is available. That is not enough!!!


Core Memory Management: A more detailed look<br />

• Each core has an address Look<br />

Up Table (LUT)<br />

o Provides address translation<br />

and routing information<br />

o Organized into 16 MB segments<br />

• Shared DRAM among all cores …<br />

through each memory controller<br />

(MC0 to MC3) … default size:<br />

o 4*16 = 64 MB<br />

• LUT boundaries are dynamically<br />

programmed<br />

CORE0 LUT Example<br />

FPGA registers<br />

APIC/boot<br />

PCI hierarchy<br />

Shared<br />

1 GB<br />

Private<br />

256MB<br />

Maps to LUT<br />

Maps to VRCs<br />

Maps to MC3<br />

Maps to MC1<br />

Maps to MC2<br />

Maps to MPBs<br />

Maps to MC0<br />

• Core cache coherency is restricted to private memory space<br />

• Maintaining cache coherency for shared memory space is under<br />

software control<br />

MC# = one of the 4 memory controllers, MPB = message passing buffer, VRC’s = Voltage Regulator control<br />

32


Moving beyond the 64 MB limit<br />

0x00<br />

Default configuration<br />

(by sccMerge) assigns<br />

these slots to the OS<br />

Default Linux image has<br />

only 64 MB of shared<br />

DRAM in slots 0x83 to<br />

0x84 … parts of which<br />

are used by the system.<br />

<strong>SCC</strong> Linux<br />

0x13<br />

0x14<br />

0x1a<br />

0x28<br />

0x80<br />

0x81<br />

0x82<br />

0x83<br />

0x84<br />

0xbf<br />

0xc0<br />

MPB<br />

33


Moving beyond the 64 MB limit<br />

<strong>SCC</strong> Linux needs 320 MB.<br />

At 16MB per LUT slot, this<br />

is 20 slots.<br />

0x00 0x13<br />

<strong>SCC</strong> Linux<br />

0x00<br />

0x13<br />

0x14<br />

0x1a<br />

needed<br />

not needed<br />

We can Hijack the memory<br />

pointed to by slots 0x28 through<br />

0x1a (15 slots) … and use them<br />

as additional shared memory in<br />

our applications.<br />

0x28<br />

0x80<br />

0x81<br />

0x82<br />

0x83<br />

0x84<br />

0xbf<br />

0xc0<br />

MPB<br />

34


Moving beyond the 64 MB limit<br />

<strong>SCC</strong> Linux needs 320 MB.<br />

At 16MB per LUT slot, this<br />

is 20 slots.<br />

0x00 0x13<br />

<strong>SCC</strong> Linux<br />

0x00<br />

0x13<br />

0x14<br />

0x1a<br />

needed<br />

not needed<br />

Hijack the memory pointed to<br />

by slots 0x28 through 0x1a (15<br />

slots) … have slots 0x84 through<br />

0xbf point to that memory.<br />

When we add slots to shared<br />

memory, we add them in<br />

groups of 4. With these 60 slots<br />

we can take 15 addresses from<br />

<strong>SCC</strong> Linux for each memory<br />

controller.<br />

0x28<br />

0x80<br />

0x81<br />

0x82<br />

0x83<br />

0x84<br />

0xbf<br />

0xc0<br />

60 slots<br />

MPB<br />

Source: Ted Kubaska<br />

35


36<br />

How to use the shared DRAM<br />

• Two devices for shared memory DRAM<br />

– /dev/rckncm – Linux device exposing “non-cacheable memory”.<br />

– /dev/rckdcm – Linux device exposing “definitely cacheable memory”.<br />

• Access to shared memory through RCCE<br />

– The appropriate device is opened inside RCCE_init()<br />

– SHMalloc() in <strong>SCC</strong>_API.c does the actual mmap() routine to get the<br />

file descriptor of the opened device.<br />

– Set all this up by building RCCE with SHMADD_CACHEABLE<br />

• If you work with Cacheable shared memory, you need to<br />

manage cache coherence explicitly.<br />

– In other words … you need to understand when the cache needs to<br />

be flushed and explicitly insert flushes as needed.


37<br />

Cacheable shared memory and Flush<br />

• The actual flush routine is part of the /dev/rckdcm driver in the file<br />

– rckmem.c in linuxkernel/linux-2.6.16-mcemu/drivers/char.<br />

• RCCE uses this to flush the entire L2 cache for a core:<br />

– RCCE_DCMflush()<br />

• For example, invoke it as<br />

– if(iam==receiver) RCCE_DCMflush();<br />

• Inside RCCE_DCMflush() the driver routine is in DCMflush() is called :<br />

– write(DCMDeviceFD,0,65536);<br />

– where DCMDeviceFD is the /dev/rckdcm file descriptor and 65536 (64K) the size of an<br />

L2 way.<br />

• If is possible to flush only a portion of the L2 cache, but this hasn’t been<br />

implemented yet in RCCE. For example, if you want to flush the values<br />

in a structure called XY, you would specify the write() as<br />

– write(DCMDeviceFD,&XY,sizeof(XY));


38<br />

Shared memory API: Example<br />

#include "RCCE.h“<br />

#define BSIZ 1024*64<br />

…<br />

volatile int *buffer;<br />

RCCE_init(&argc, &argv);<br />

iam<br />

size<br />

= RCCE_ue();<br />

= bufsize*sizeof(int);<br />

buffer = (int *) RCCE_shmalloc(size);<br />

RCCE_barrier(&RCCE_COMM_WORLD);<br />

if(iam==sender) {<br />

}<br />

fill_buffer(buffer);<br />

RCCE_DCMflush();<br />

RCCE_barrier(&RCCE_COMM_WORLD);<br />

if(iam==receiver) {<br />

RCCE_DCMflush();<br />

use_buffer(buffer);<br />

}<br />

RCCE_barrier(&RCCE_COMM_WORLD);<br />

RCCE_shfree((t_vcharp)buffer);<br />

RCCE_finalize();


Off-<strong>Chip</strong> Shared-Memory in RCCE<br />

• SuperMatrix:<br />

– Map dense matrix computation to a directed acyclic graph<br />

• Store DAG and matrix in off-chip shared DRAM<br />

Cholesky Factorization on<br />

a single node so we can<br />

isolate cost of cache<br />

coherence.<br />

Source: Ernie Chan, UT Austin, MARC symposium, Nov. 2010<br />

39


Shared memory ... What we want<br />

• Shared DRAM memory works and with hijacking we get the<br />

memory footpring large enough to be interesting,<br />

• But its not safe (system and apps can collide). And its still<br />

of lmited size (~960 Mbytes).<br />

• We want something better ... Shared memory with:<br />

– No reserved areas ... Everyone schedule memory through a single<br />

resource.<br />

– No fragmentation<br />

– Much larger sizes for the shared memory<br />

– A proper allocation scheme<br />

– Coexists to legacy SHM section<br />

– Convenient usage model (no manual LUT changes etc.)<br />

40<br />

Sounds better? Okay, but how to achieve it?


The answer...<br />

Each core has several 100MB of private<br />

memory available (~640 on a latest systems)<br />

Why not allocate some private memory (using<br />

regular Linux mechanisms) and make it public<br />

for other cores?<br />

Hide the implementation details in a library to<br />

make usage convenient... Available for Linux as<br />

well as bareMetalC.<br />

=> Privately owned public shared memory<br />

41


POPSHM: Flexible Shared Memory<br />

• Linux Kernel patch<br />

– Allocates contiguous 16MB chunks<br />

– From private memory area<br />

– 16MB aligned (LUT entry)<br />

– Pins them<br />

– Publishes it to be used collectively<br />

– Total allocated<br />

– First LUT Entry<br />

• User space library<br />

– Aggregates available<br />

memory, up to<br />

48 * (5 * 16 MB) = 3.75 GB<br />

– POPSHM address space<br />

– Provides memcpy and<br />

put/get<br />

– Low level interface for<br />

minimal overhead<br />

Linux<br />

Private Memory<br />

Linux<br />

Private Memory<br />

Linux<br />

Private Memory<br />

POPSHM<br />

Address Space<br />

Core 0 Core 1 Core 2<br />

All Cores<br />

POPSHM is new … still<br />

42<br />

under early evaluation.


Conclusion<br />

• <strong>SCC</strong> is alive and well:<br />

– Message passing and shared memory APIs are available.<br />

– A vigorous community (MARC) using <strong>SCC</strong> for significant<br />

parallel computing research.<br />

– Usable through a GUI or a command line interface<br />

• The future …. Shared memory on <strong>SCC</strong><br />

– We understand how to use cacheable vs. non-cacheable shared<br />

memory … we even have a working L2 flush!<br />

– Full characterization of POPSHM is the next step.<br />

43


Backup<br />

• Performance numbers<br />

• A survey of <strong>SCC</strong> research<br />

• Additional Details<br />

44


Round-trip Latencies, 32 byte message<br />

• Ping-pong between a pair of cores … one core fixed at a corner, the second<br />

core varied from “same tile” to opposite corner.<br />

5.60E-06<br />

Roundtrip Latency (secs)<br />

5.55E-06<br />

5.50E-06<br />

5.45E-06<br />

5.40E-06<br />

5.35E-06<br />

5.30E-06<br />

5.25E-06<br />

5.20E-06<br />

5.15E-06<br />

Data fit to a straight line:<br />

T = 528 nanoosecs + 30 nanosecs * hops<br />

5.10E-06<br />

0 1 2 3 4 5 6 7 8<br />

Network hops<br />

RCCE_Send/Recv … 3 messages per transit, 6 for roundtrip<br />

4 cycles per hop, 800 MHz router … or 2*3*4/0.8 GHz = 30 nanosecs.


46<br />

Power breakdown<br />

MC &<br />

DDR3-<br />

800<br />

19%<br />

Full Power Breakdown<br />

Total -125.3W<br />

Cores<br />

69%<br />

MC &<br />

DDR3-<br />

800<br />

69%<br />

Low Power Breakdown<br />

Total - 24.7W<br />

Routers<br />

& 2Dmesh<br />

10%<br />

Global<br />

Clocking<br />

2%<br />

Routers<br />

& 2Dmesh<br />

5%<br />

Global<br />

Clocking<br />

5%<br />

Cores<br />

21%<br />

Clocking: 1.9W Routers: 12.1W<br />

Cores: 87.7W MCs: 23.6W<br />

Cores-1GHz, Mesh-2GHz, 1.14V, 50°C<br />

Clocking: 1.2W Routers: 1.2W<br />

Cores: 5.1W MCs: 17.2W<br />

Cores-125MHz, Mesh-250MHz, 0.7V, 50°C


Impact of Core Position on Memory Performance<br />

• Stream benchmark mapped to one core and MC channel<br />

– Position of core is varied<br />

– Report relative reduction in usable memory bandwidth<br />

• Tile 533 MHz, Router 800 MHz, Memory 800 MHz<br />

– Up to 13% variation within quadrant of Memory controller (iMC)<br />

-13.0% -15.8% -19.8% -23.0% -25.5% -28.7%<br />

iMC -8.3% -13.0% -15.8% -19.8% -23.0% -25.5% iMC<br />

-5.3% -8.3% -13.0% -15.8% -19.8% -23.0%<br />

iMC 0.0% -5.3% -8.4% -13.0% -15.8% -19.8% iMC<br />

• Tile 800 MHz, Router 1600 MHz, Memory 800 MHz<br />

– Up to 8% variation within quadrant of Memory controller<br />

-8.0% -8.9% -11.6% -14.9% -15.6% -18.0%<br />

iMC -4.2% -8.0% -8.9% -11.6% -14.9% -15.6% iMC<br />

-1.0% -4.2% -8.0% -8.8% -11.6% -14.9%<br />

iMC 0.0% -1.0% -4.2% -8.0% -8.8% -11.6% iMC<br />

Source: <strong>Intel</strong>, <strong>SCC</strong> workshop, Germany March 16 2010


Linpack and NAS Parallel benchmarks<br />

1. Linpack (HPL): solve dense system of linear equations<br />

– Synchronous comm. with “MPI wrappers” to simplify porting<br />

x-sweep<br />

z-sweep<br />

2. BT: Multipartition decomposition<br />

– Each core owns multiple blocks (3 in this case)<br />

– update all blocks in plane of 3x3 blocks<br />

– send data to neighbor blocks in next plane<br />

– update next plane of 3x3 blocks<br />

3. LU: Pencil decomposition<br />

4<br />

– Define 2D-pipeline process.<br />

– await data (bottom+left)<br />

4<br />

3<br />

4<br />

– compute new tile<br />

– send data (top+right)<br />

4<br />

3<br />

2<br />

3<br />

4<br />

48<br />

Third party names are the property of their owners.<br />

4<br />

3<br />

2<br />

1<br />

2<br />

3<br />

4


LU/BT NAS Parallel Benchmarks, <strong>SCC</strong><br />

Problem size: Class A, 64 x 64 x 64 grid*<br />

2000<br />

1600<br />

MFlops<br />

1200<br />

800<br />

LU<br />

BT<br />

50<br />

400<br />

0<br />

• <strong>Using</strong> latency<br />

optimized,<br />

whole cache<br />

line flags<br />

0 10 20 30 40<br />

# cores * These are not official NAS Parallel benchmark results.<br />

<strong>SCC</strong> processor 500MHz core, 1GHz routers, 25MHz system interface, and DDR3 memory at 800 MHz.<br />

Third party names are the property of their owners.<br />

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of <strong>Intel</strong> products as measured by those tests. Any difference in system hardware or<br />

software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information<br />

on performance tests and on the performance of <strong>Intel</strong> products, reference or call (U.S.) 1-800-628-8686 or 1-916-356-3104.


Backup<br />

• Performance numbers<br />

• A survey of <strong>SCC</strong> research<br />

• Additional Details<br />

51


Many-core Application<br />

Research Community<br />

•123 Contracts signed worldwide<br />

•77 Unique Institutions<br />

•39 Research Partners in EU<br />

•28 Research Partners in USA<br />

•10 Research Partners in other<br />

Countries including China,<br />

India, South Korea, Brazil,<br />

Canada<br />

•233 MARC website Participants


<strong>SCC</strong> Bare Metal<br />

Development Framework<br />

Bandwidth Studies<br />

Source: <strong>SCC</strong> MARC symposium, Santa Clara, CA (March 2011)<br />

• ET International has investigated performance with lock free message queues.<br />

• Queue head/tail pointers reside in MPBs, and the message buffers reside in DRAM.<br />

• Message buffers in DRAM are mapped twice into a given core’s virtual memory: Once for<br />

read using the MPBT bit, and once for write with all caching disabled.<br />

• MPBT cache control bit allow both the MPBs and the DRAM read mapping to be flushed<br />

via the <strong>SCC</strong>’s CL1FLUSH instruction.<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


X10 on the <strong>SCC</strong><br />

Keith Chapman, Ahmed Hussein, Antony Hosking<br />

Purdue University<br />

Source: <strong>SCC</strong> MARC symposium, Santa Clara, CA (March 2011)<br />

Porting and Preliminary Performance<br />

•RCCE-X10: An extension to RCCE tailored for X10 runtime<br />

•Performance for X10 benchmarks … RCCE-X10 performs better than MPI RCCE<br />

• HF (Hartree-Fock Quantum chemistry) and BC (Betweenness Centrality).<br />

BC: Speedup (normalized to RCCE-X10)<br />

BC: Speedup for varying workloads<br />

HF: Dynamic load balancing<br />

(normalized to RCCE-X10)<br />

HF: Static load balancing<br />

(normalized to RCCE-X10)


Black <strong>Cloud</strong> OS<br />

Microsoft Research<br />

Source: <strong>SCC</strong> MARC symposium, Santa Clara, CA (March 2011)<br />

What will an operating system look like when the chip becomes a<br />

cloud?<br />

•Is breaking cache coherency a way to scale beyond a dozen of cores? Does message<br />

passing benefit from network-on-chip design? Black <strong>Cloud</strong> operating system tries to<br />

answer these questions by making the message passing features of the <strong>SCC</strong> first class<br />

citizens.<br />

•Black <strong>Cloud</strong> OS:<br />

•Based on Singularity/Helios operating systems.<br />

•Written in managed code.<br />

•Runs a single operating system instance across all 48 cores.<br />

•Each core runs a separate kernel.<br />

•Processes can be hosted by any of the kernels.<br />

•All inter-process communication is done via messages.<br />

•Local and remote channels are available via the same APIs.<br />

•Inspired by “The Invincible” by Stanislaw Lem.<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


RCKMPI<br />

MPI designed for <strong>SCC</strong><br />

This proof of concept software stack shows that it‘s possible to use out-of-the-box<br />

programming models on <strong>SCC</strong>, while modifying the physical layer to use the low-latency<br />

hardware acceleration features of <strong>SCC</strong>. Thus, it can serve as a valuable example for the<br />

development of new programming models. As the physical layer has access to both message<br />

passing buffer and/or shared memory, it also enables research on the “right mix” of MPB vs.<br />

SHM based communication. Standard MPI debug tools (like iTac) can be used to analyze the<br />

impact of modifications…<br />

INSERT PICTURE HERE<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


MESSY & Faroe<br />

Hrishikesh Amur, Alexander Merritt, Sudarsun Kannan, Priyanka Tembey<br />

Vishal Gupta, Min Lee, Ada Gavrilovska, Karsten Schwan<br />

CERCS, Georgia Institute of Technology<br />

MESSY – Library for Software Coherence<br />

•Aims to re-evaluate conventional wisdom regarding effects of memory consistency<br />

models on application performance using the fast on-chip interconnect<br />

•Implements a key-value store that allows multiple consistency models to be<br />

applied on different data.<br />

Faroe – Memory Balancing for Clusters on <strong>Chip</strong><br />

•Aims to evaluate more scalable memory borrow/release protocols<br />

•The on-chip interconnect facilitates faster communication between cores,<br />

hence enabling newer coordination mechanisms at the OS layer.<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


A Software SVM<br />

for the <strong>SCC</strong> Architecture<br />

Junghyun Kim, Sangmin Seo, Jun Lee and Jaejin Lee<br />

Center for Manycore Programming, Seoul National University, Seoul 151-744, Korea<br />

http://aces.snu.ac.kr<br />

To provide an illusion of coherent shared memory to the programmer<br />

• With comparable performance to and better scalability than<br />

the ccNUMA architecture<br />

• <strong>Using</strong> the CRF memory model<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


Performance Modeling of <strong>SCC</strong>’s Onchip<br />

Interconnect<br />

Research Overview<br />

Our findings:<br />

• We characterize the <strong>SCC</strong> on-chip interconnection network with micro-benchmarks<br />

• Observed point-to-point latency, bandwidth<br />

• Designed a performance model from observations<br />

• We present new collective communication algorithms for the <strong>SCC</strong> by efficiently<br />

utilizing the message-passing buffer (MPB)<br />

• Broadcast 22x faster than RCCE<br />

• Reduce 6.4x faster than RCCE<br />

Current and Future work:<br />

• Quantifying the price of cache coherence<br />

• Study other collective patterns<br />

• Project real application scalability on the <strong>SCC</strong><br />

• Power studies<br />

Performance Model<br />

Remote MPB read/write vs<br />

Hop count<br />

Write<br />

Read<br />

Local MPB read/write<br />

performance<br />

MPB write<br />

MPB read<br />

L1-cache read<br />

Aparna Chandramowlishwaran, Richard Vuduc<br />

Georgia Institute of Technology<br />

Collective Communication<br />

• Performance model guides the design of optimal collective<br />

communication algorithms<br />

• Two case studies<br />

• Reduce<br />

• Broadcast<br />

Reduce parallel scaling<br />

• Naïve: get-based approach,<br />

Naive<br />

sequential<br />

• Eg: Core 0 reads from<br />

remote MPB’s and performs a<br />

local reduce<br />

• Tree-based reduce: scales<br />

as log p<br />

Tree-based<br />

• Naïve: put-based approach,<br />

sequential<br />

• Eg: Core 0 writes to remote<br />

MPB’s of other cores<br />

• Parallel broadcast<br />

algorithm: All cores fetch data<br />

from core 0’s MPB<br />

Broadcast parallel scaling<br />

Naive<br />

Parallel<br />

• 64 byte message. Slope gives a router<br />

latency (~ 5ns)<br />

• Write to local MPB is ~1.4x slower than<br />

read<br />

Configuration: Cores at 533 MHz, Router at 800 MHz, DDR3 800 memory<br />

Acknowledgements<br />

We would like to thank <strong>Intel</strong> for access to the <strong>SCC</strong> processor hosted at the<br />

manycore research community (MARC) data center.<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


Many-Core Applications Research Community– http://communities.intel.com/community/marc


C++ Front-end for Distributed Transactional Memory<br />

Sam Vafaee, Natalie Enright Jerger<br />

Department of Electrical & <strong>Computer</strong> Engineering, University of Toronto, Canada<br />

<strong>SCC</strong>-TM is a Compiler-Agnostic Framework for Rapid Development of Distributed Apps<br />

• <strong>SCC</strong>-TM makes <strong>SCC</strong> more accessible by providing an easier alternative to the<br />

message passing paradigm<br />

• Builds on <strong>SCC</strong>’s on-chip inter-core communication<br />

• Writing distributed apps is as easy as atomic { … }<br />

• First compiler-agnostic TM library without annotations<br />

• Status: Initial version with some centralized operations<br />

FRONTEND<br />

Smart Pointers<br />

Architecture<br />

Memory Protection<br />

On Read On Write TX Begin TX Commit<br />

…<br />

SPMD<br />

PROGRAM<br />

Interface<br />

TMMain(int argc, char **argv)<br />

{<br />

// Launched on desired # of cores<br />

// Node Id: TMNodeId(), # of nodes: TMNodes()<br />

}<br />

TM PROTOCOL<br />

BACKEND<br />

Version<br />

Management<br />

Message Passing Buffers<br />

Contention<br />

Management<br />

Memory/Cache Management<br />

Conflict Detection<br />

RCCE Send RCCE Recv RCCE Bcast RCCE Malloc …<br />

Globally Mapped Shared<br />

Memory<br />

COLLECTIVE<br />

DECLARATION<br />

ATOMIC<br />

REGION<br />

Given:<br />

struct Account<br />

{<br />

};<br />

int withdrawlLimit;<br />

int balance;<br />

Account() : withdrawlLimit(100), balance(1000) {};<br />

Declaration:<br />

tm_shared account = new(TMGlobal) Account();<br />

atomic<br />

{<br />

if (100 < account->withdrawlLimit<br />

&& account->balance >= 100)<br />

{<br />

account->balance -= 100;<br />

}<br />

}<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


Dense Matrix Computations on <strong>SCC</strong>: A Programmability Study<br />

We study programmability with RCCE by porting to <strong>SCC</strong> Elemental, a well-layered,<br />

high-performance library for distributed-memory computers.<br />

Enabling Features for Port:<br />

•Used low-latency, low-overhead RCCE for synchronous collective communication to replace MPI calls<br />

•C++ Elemental code, which is object-oriented and modular, made required changes very limited and easy to<br />

implement<br />

•Implemented a few well-defined MPI collectives with RCCE to complete port<br />

•Algorithm variants enabled incremental porting and validation<br />

•Algorithm variants also allowed testing for best algorithms<br />

PartitionDownDiagonal<br />

( A, ATL, ATR,<br />

ABL, ABR, 0 );<br />

while( ABR.Height() > 0 )<br />

{<br />

RepartitionDownDiagonal<br />

( ATL, /**/ ATR, A00, /**/ A01, A02,<br />

/*************/ /******************/<br />

/**/ A10, /**/ A11, A12,<br />

ABL, /**/ ABR, A20, /**/ A21, A22 );<br />

A12_Star_MC.AlignWith( A22 );<br />

A12_Star_MR.AlignWith( A22 );<br />

A12_Star_VR.AlignWith( A22 );<br />

//----------------------------------//<br />

A11_Star_Star = A11;<br />

advanced::internal::LocalChol<br />

( Upper, A11_Star_Star );<br />

A11 = A11_Star_Star;<br />

A12_Star_VR = A12;<br />

basic::internal::LocalTrsm<br />

( Left, Upper,<br />

ConjugateTranspose, NonUnit,<br />

(F)1, A11_Star_Star, A12_Star_VR );<br />

A12_Star_MC = A12_Star_VR;<br />

A12_Star_MR = A12_Star_VR;<br />

basic::internal::LocalTriangularRankK<br />

( Upper, ConjugateTranspose,<br />

(F)-1, A12_Star_MC, A12_Star_MR,<br />

(F)1, A22 );<br />

A12 = A12_Star_MR;<br />

//----------------------------------//<br />

A12_Star_MC.FreeAlignments();<br />

A12_Star_MR.FreeAlignments();<br />

A12_Star_VR.FreeAlignments();<br />

}<br />

SlidePartitionDownDiagonal<br />

( ATL, /**/ ATR, A00, A01, /**/ A02,<br />

/**/ A10, A11, /**/ A12,<br />

/*************/ /******************/<br />

ABL, /**/ ABR, A20, A21, /**/ A22 );<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc<br />

Bryan Marker, Ernie Chan, Jack Poulson, Robert van de Geijn, Rob Van der Wijngaart, Timothy <strong>Mattson</strong>, and Theodore Kubaska


<strong>SCC</strong> + QED for<br />

Effective Post-Si Validation<br />

Quick Error Detection (QED) is a post-Si validation technique which reduces error detection latency and improves coverage of existing<br />

validation tests. Error detection latency is the time elapsed between the occurrence of an error and its manifestation. Long error<br />

detection latencies extend the post-Si validation process, delaying product shipments. By incorporating QED with the <strong>SCC</strong>, the QED<br />

technique could be exercised by dedicated cores, greatly reducing the amount of software support and further reducing error detection<br />

latency.<br />

QED technique<br />

• Effective for single-cores [Hong ITC 2010]<br />

• 4x coverage improvement<br />

• 10 6 decrease in error detection latency<br />

• Current research efforts<br />

• Prove efficacy of QED on multi-core errors<br />

• Prove applicability of QED on uncore errors<br />

<strong>Intel</strong> <strong>SCC</strong><br />

• Core clusters configurable for specific operating points<br />

• Allows creation of unreliable system cores within normal operating cores<br />

• Can see effects of single core unreliability on a multi-core system<br />

• Can create dedicated cores for performing QED<br />

• Can test uncore errors using the extensive on-chip fabric<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


Distributed Power/Thermal Management for <strong>Single</strong> <strong>Chip</strong> <strong>Cloud</strong> <strong>Computer</strong><br />

1 st Step - Thermal Modeling and Characterization<br />

•Motivation: Mapping and scheduling of tasks affects the peak temperature and thermal gradients<br />

•Goal: allocate workload evenly across space and time with the awareness of heat dissipation and lateral heat transfer<br />

•Approach: proactive distributed task migration<br />

•<strong>SCC</strong>: a unique many-core system with flexible DVFS capability and easy-to-access temperature sensors. It allows us to<br />

•Understand the thermal influence and workload dependency among on-chip processors, and<br />

•Evaluate the impact of those dependencies on the efficiency of power/thermal management<br />

•Key metric<br />

•Step 1: thermal modeling of the <strong>SCC</strong><br />

•Step 2: Framework of distributed thermal management on <strong>SCC</strong><br />

•Step 3: Performance evaluation and sensitivity analysis<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc


Many-Core Applications Research Community– http://communities.intel.com/community/marc


Pregel on <strong>SCC</strong><br />

18<br />

PageRank<br />

16<br />

14<br />

12<br />

Graph with 6144 nodes,<br />

10 edges/node, 96<br />

partitions, 100 steps<br />

time<br />

10<br />

8<br />

comp<br />

comm<br />

6<br />

4<br />

2<br />

0<br />

1 2 4 8 16 32 48<br />

Pregel: a system for large-scale Cores graph processing (SIGMOD’ 10)<br />

Many-Core Applications Research Community– http://communities.intel.com/community/marc<br />

Source: Chuntao HONG, Tsinghua University, Jan 2011<br />

66


… and beyond<br />

• Supermatrix—E. Chan et al, UT Austin; in progress<br />

• Software Managed Coherence (SMC)—<br />

XiaochengZhou et al, <strong>Intel</strong>, China <strong>SCC</strong> symposium,<br />

Jan 2011<br />

• An OpenCLFramework for Homogeneous<br />

Manycoreswith no Hardware Cache Coherence<br />

Support—JaeJin Lee et. al., Seoul National<br />

University, submitted to PLDI 11


Backup<br />

• Performance numbers<br />

• A survey of <strong>SCC</strong> research<br />

• Additional Details<br />

68


<strong>SCC</strong> Platform Board Overview<br />

69


How to use the atomic counters<br />

• Two sets of 48 atomic counters (32 bit)<br />

• Reachable with LUT* entry 0xf9<br />

• Each set starts at 4k page boundaries<br />

- 0xf900E000 -> Block0<br />

- 0xf900F000 -> Block 1<br />

• Each counter is represented by two registers:<br />

– Atomic increment register<br />

– Read: Returns couter value and decrements it.<br />

– Write: Increments counter value.<br />

– Initialization register allows preloading or<br />

reading counter value.<br />

Atomic Increment<br />

0xE000 Counter #00<br />

0xE004<br />

Initialization<br />

Counter #00<br />

Atomic Increment<br />

0xE008 Counter #01<br />

0xE00C<br />

Initialization<br />

Counter #01<br />

Atomic Increment<br />

0xE178 Counter #47<br />

0xE17C<br />

Initialization<br />

Counter #47<br />

*LUT: Memory look up table … address translation unit (described later)<br />

70


References<br />

• A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS, J.<br />

Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N.<br />

Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam,<br />

V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K.<br />

Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De1, R. Van Der Wijngaart, T.<br />

<strong>Mattson</strong>, Proceedings of the International Solid-State Circuits Conference, Feb 2010<br />

• The 48-core <strong>SCC</strong> processor: the programmer’s view ,T. G. <strong>Mattson</strong>, R. F. Van der<br />

Wijngaart, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S.<br />

Vangal, N. Borkar, G. Ruhl, S. Dighe, Proceedings SC10, New Orleans, November<br />

2010<br />

• Light-weight Communications on <strong>Intel</strong>'s <strong>Single</strong>-<strong>Chip</strong>-<strong>Cloud</strong> <strong>Computer</strong><br />

Processor, Rob F. van der Wijngaart, Timothy G. <strong>Mattson</strong>, Werner Haas, Operating<br />

Systems Review, ACM, vol 45, number 1, pp. 73-83, January 2011.<br />

• Programming many-core Architectures – a case study: dense Matrix<br />

computations on the <strong>Intel</strong> <strong>SCC</strong> Processor, B. Marker, E. Chan, J. Poulson, R. van<br />

de Geijn, R. van der Wijngaart, T. <strong>Mattson</strong>, T. Kubaska, submitted to Concurrency and<br />

Computation: practice and experience, 2011.<br />

• Many Core Applications Research Community, An online community of users of<br />

the <strong>SCC</strong> processor. http://communities.intel.com/community/marc<br />

71

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!