Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial
Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial
Using Intel’s Single-Chip Cloud Computer (SCC) - Intel, Mattson, Tutorial
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Using</strong> <strong><strong>Intel</strong>’s</strong> <strong>Single</strong>-<strong>Chip</strong> <strong>Cloud</strong> <strong>Computer</strong><br />
(<strong>SCC</strong>)<br />
Tim <strong>Mattson</strong><br />
Microprocessor and Programming Lab<br />
<strong>Intel</strong> Corp.<br />
1<br />
With content gratefully “borrowed” from: Rob van der Wijngaart, Ted Kubaska, Michael Riepen, Ernie Chan
Disclosure<br />
• The views expressed in this talk are those of the<br />
speaker and not his employer.<br />
• I am in a research group and know almost nothing<br />
about <strong>Intel</strong> products, hence anything I say about<br />
them is highly suspect.<br />
• This was a team effort, but if I say anything really<br />
stupid, it’s all my fault … don’t blame my<br />
collaborators.<br />
2
The many core design challenge<br />
• Scalable architecture:<br />
– How should we connect the cores so we can scale as far as we<br />
need (O(100’s to 1000) should be enough)?<br />
• Software:<br />
– Can “general purpose programmers” write software that takes<br />
advantage of the cores?<br />
– Will ISV’s actually write scalable software?<br />
• Manufacturability:<br />
– Validation costs grow steeply as the number of transistors grows.<br />
Can we use tiled architectures to address this problem?<br />
– For an N transistor budget … Validate a tile (M transistors/tile) and the<br />
connections between tiles. Drops validation costs from KO(N) to<br />
K’O(M) (warning, K, K’ can be very large).<br />
80 core Research<br />
processor<br />
<strong><strong>Intel</strong>’s</strong> “TeraScale” processor<br />
research program is addressing these<br />
questions with a series of Test chips<br />
… two so far.<br />
48 core <strong>SCC</strong><br />
processor<br />
3
Agenda<br />
• The <strong>SCC</strong> Processor<br />
• <strong>Using</strong> the <strong>SCC</strong> system<br />
• <strong>SCC</strong> Address spaces<br />
4
Hardware view of <strong>SCC</strong><br />
•48 P54C cores, 6x4 mesh, 2 cores per tile<br />
•45 nm, 1.3 B transistors, 25 to 125 W<br />
•16 to 64 GB DRAM using 4 DDR3 MC<br />
MC<br />
Tile<br />
R<br />
Tile<br />
R<br />
Tile<br />
R<br />
Tile<br />
R<br />
R<br />
Tile<br />
Tile<br />
Tile<br />
R<br />
Tile<br />
Tile Tile Tile Tile<br />
R R R R<br />
Tile Tile Tile Tile<br />
R R R R<br />
Tile Tile Tile Tile<br />
R R R R<br />
Tile Tile Tile Tile<br />
MC<br />
P54C<br />
16KB L1-D$<br />
16KB L1-I$<br />
Message<br />
Passing<br />
Buffer<br />
16 KB<br />
P54C<br />
16KB L1-D$<br />
16KB L1-I$<br />
256KB<br />
unified<br />
L2$<br />
Mesh<br />
I/F<br />
256KB<br />
unified<br />
L2$<br />
To<br />
Router<br />
MC<br />
R<br />
R<br />
R<br />
R<br />
R<br />
R<br />
MC<br />
to PCI<br />
R = router, MC = Memory Controller,<br />
5<br />
P54C = second generation Pentium® core, CC = cache cntrl.
On-Die Network<br />
•2D 6 by 4 mesh of tiles<br />
• Fixed X-Y Routing<br />
•Carries all traffic<br />
(memory, I/O, message<br />
passing).<br />
Bisection bandwidth 2 TB/s at 2GHz and 1.1V<br />
Latency<br />
4 cycles<br />
Link Width<br />
16 Bytes<br />
Bandwidth<br />
64GB/s per link<br />
Architecture<br />
8 VCs over 2 Msg. Class.<br />
Power Consumption 500mW @ 50°C<br />
Routing<br />
Pre-computed, X-Y<br />
•Router<br />
•Low Power 5<br />
port cross bar<br />
•2 Message<br />
classes, 8 virtual<br />
channels<br />
6<br />
6
Power and memory domains<br />
Memory*<br />
Voltage<br />
Question<br />
<strong>SCC</strong> package<br />
MC<br />
MC<br />
R<br />
R<br />
R<br />
R<br />
Tile<br />
Tile<br />
Tile<br />
R<br />
R<br />
Tile Tile<br />
R<br />
R<br />
Tile<br />
Tile<br />
Tile<br />
R<br />
R<br />
R<br />
R<br />
Tile Tile<br />
Tile<br />
Tile<br />
Tile<br />
R<br />
Tile<br />
R<br />
Tile<br />
R<br />
R<br />
Bus to<br />
PCI<br />
Tile<br />
R<br />
R<br />
R<br />
R<br />
Tile Tile<br />
R<br />
Tile Tile<br />
Tile<br />
Tile<br />
R<br />
R<br />
R<br />
Tile<br />
Tile<br />
Frequency<br />
MC<br />
MC<br />
Power ~ F V 2<br />
Power Control domains<br />
• 7 Voltage domains (6<br />
4-tile blocks plus one<br />
for the on-die<br />
network)<br />
• 24 tile clock frequency<br />
dividers<br />
• One voltage control<br />
register … so only one<br />
voltage change in<br />
flight at a time.<br />
7<br />
*How cores map to MC is under programmer control …. i.e. the memory-controller domain is configurable.
<strong>SCC</strong> system overview<br />
Questions?<br />
DIMM<br />
DIMM<br />
PLL<br />
MC<br />
MC<br />
R<br />
R<br />
R<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
R<br />
R<br />
R<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
R<br />
R<br />
R<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
tile tile<br />
R<br />
MC<br />
JTAG<br />
I/O<br />
MC<br />
DIMM DIMM<br />
JTAG interface<br />
for Clocking,<br />
Power etc.<br />
JTAG<br />
BUS<br />
RPC<br />
System Interface<br />
<strong>SCC</strong> die<br />
System Interface FPGA<br />
– Connects to <strong>SCC</strong><br />
Mesh interconnect<br />
– IO capabilities like<br />
PCIe, Ethernet &<br />
SATA<br />
8<br />
System<br />
FPGA<br />
PCIe<br />
Management<br />
Console PC<br />
Third party names are the property of their owners.<br />
Board Management<br />
Controller (BMC)<br />
– Network interface for<br />
User interaction via<br />
Telnet<br />
– Status monitoring<br />
8
The latest FPGA Features (release 1.4)<br />
• 1-4 x 1Gb Ethernet ports<br />
• Interface from BMC to FPGA<br />
• Global interrupt controller<br />
• Additional System registers<br />
• Global timestamp counter<br />
• Linux private memory size<br />
• <strong>SCC</strong> main clock frequency<br />
• Network configuration (IP addresses, Gateway, MAC address)<br />
• Linux frame buffer resolution and color depth for virtual display<br />
• Atomic Counters<br />
9
<strong>SCC</strong> Platforms<br />
• Three <strong>Intel</strong> provided platforms for <strong>SCC</strong> and RCCE*<br />
– Functional emulator (on top of OpenMP)<br />
– <strong>SCC</strong> board with two “OS Flavors” … Linux or Baremetal (i.e.<br />
no OS)<br />
Apps<br />
Apps<br />
Apps<br />
RCCE<br />
RCCE_EMU<br />
Driver<br />
icc<br />
ifort<br />
MKL<br />
RCCE<br />
RCCE<br />
icc fort MKL icc fort MKL<br />
OpenMP<br />
PC or server with<br />
Windows or Linux<br />
Baremetal C<br />
<strong>SCC</strong><br />
Linux<br />
<strong>SCC</strong><br />
Functional emulator,<br />
based on OpenMP.<br />
<strong>SCC</strong> board – NO OpenMP<br />
10<br />
*RCCE: Native Message passing library for <strong>SCC</strong><br />
Third party names are the property of their owners.
Networking: TCP/IP<br />
• We support the standard “Internet protocol stack”<br />
Application<br />
Transport<br />
Network<br />
Link<br />
Physical<br />
network applications (ssh, http, MPI, NFS …)<br />
host-host data transfer (TCP, UDP)<br />
routing of datagrams from source to destination (IP)<br />
data transfer between peers (rckmb)<br />
move packets over physical network (on-die mesh)<br />
• Rckmb … our implementation of the Link layer<br />
– Local write/remote read data transfer<br />
– Static mapping IP address <strong>SCC</strong> core number<br />
– Uses IP packet to determine destination<br />
• Seamless integration with Linux networking utilities<br />
• Rckmb is based on NAPI<br />
– Interrupt shared between peer nodes<br />
– Provides for operation in polling mode<br />
Source: Light-weight Communications on <strong>Intel</strong>'s <strong>Single</strong>-<strong>Chip</strong>-<strong>Cloud</strong> <strong>Computer</strong> Processor, Rob F. van der Wijngaart, Timothy<br />
G. <strong>Mattson</strong>, Werner Haas, Operating Systems Review, ACM, vol 45, number 1, pp. 73-83, January 2011.<br />
Third party names are the property of their owners.<br />
11
Agenda<br />
• The <strong>SCC</strong> Processor<br />
• <strong>Using</strong> the <strong>SCC</strong> system<br />
• <strong>SCC</strong> Address spaces<br />
12
sccKit: Core software for <strong>SCC</strong><br />
• <strong>SCC</strong> lacks a PC BIOS ROM or the typical bootstrap process<br />
– we use bootable image preloaded into memory by sccKit.<br />
• sccKit includes:<br />
– Platform support<br />
– System Interface FPGA bitstream<br />
– Board management controller (BMC) firmware<br />
– <strong>SCC</strong> System Software<br />
– Customized Linux to run on each core<br />
– Kernel 2.6.16 with Busybox 1.15.1 and on-die TCP/IP drivers<br />
– bareMetalC to run on <strong>SCC</strong> without an OS<br />
– Management Console PC Software<br />
– PCIe driver with integrated TCP/IP driver<br />
– Programming API for communication with <strong>SCC</strong> platform<br />
– GUI for interaction with <strong>SCC</strong> platform<br />
– Command line tools for interaction with <strong>SCC</strong> platform<br />
Questions?<br />
Third party names are the property of their owners.<br />
13
Creating Management Console PC Apps<br />
• Written in C++ making use of<br />
Nokia Qt cross-platform<br />
application and UI framework.<br />
• Low level API (sccApi) with access<br />
to <strong>SCC</strong> and board management<br />
controller via PCIe.<br />
• Code of sccGui as well as<br />
command line tools is available as<br />
code example. These tools use<br />
and extend the low level API.<br />
14
sccGui<br />
• Read and write<br />
system memory and<br />
registers.<br />
• Boot OS or other<br />
workloads<br />
(e.g. bareMetalC).<br />
• Open SSH connections<br />
to booted Linux cores<br />
• Performance meter<br />
• Initialize Platform via Board Management Controller.<br />
• Open a console connection to nodes<br />
• Debug support... Query memory addresses, read/write registers.<br />
15
Command Line Interface to sccKit<br />
sccBoot:<br />
– sccBoot pulls reset, loads the bootable Linux images into memory, sets<br />
up configuration registers and releases reset - the cores execute the<br />
boot sequence (see section 16 of the Penitum P45c manual).<br />
– Example: to boot linux on all cores<br />
> sccBoot -l<br />
– Also used to “boot” generic workloads (e.g. bareMetalC applications)<br />
sccReset:<br />
– Reset selected <strong>SCC</strong> cores. Example: to put all cores into reset<br />
sccBmc:<br />
> sccReset -g<br />
– Board Management controller: Example: to prepare <strong>SCC</strong> in a known<br />
state selected from a list of options:<br />
> sccBmc -i<br />
Use any of these commands with –h to see usage notes.<br />
16
Setting up your account for <strong>SCC</strong><br />
• Add these lines to your .bashrc file<br />
#setup for sccKit and enable <strong>SCC</strong> cross compilers<br />
export PATH=.:/opt/sccKit/current/bin:$PATH<br />
export LD_LIBRARY_PATH=.:/opt/sccKit/lib:$LD_LIBRARY_PATH<br />
source /opt/compilerSetupFiles/crosscompile.sh<br />
# setup for RCCE<br />
export MANPATH="/home/tmattson/rcce/man:${MANPATH}"<br />
export PATH=.:/home/tmattson/rcce:$PATH<br />
the directory where<br />
you installed RCCE<br />
Note: while any Linux shell works, we test all our scripts<br />
with bash … so its simplest if you use bash. If your<br />
configures systems with something else (such as csh), I<br />
manually switch to bash as soon as I log in
Setting up your account for <strong>SCC</strong><br />
• If you have successfully setup the compilers<br />
and sccKit, you should see …<br />
>% which icc<br />
/opt/icc-8.1.038/bin/icc<br />
>% which gcc<br />
/opt/i386-unknown-linux-gnu/bin/gcc<br />
>% which sccBmc<br />
/opt/sccKit/current/bin/sccBmc
Check out latest RCCE from SVN<br />
• You want to check out the rcce release but in a “read<br />
only” user mode. That way you don’t get all the tags<br />
and settings for checking things into RCCE<br />
>% svn export http://marcbug.scc-dc.com/svn/repository/trunk/rcce/<br />
• While you’re at it, you might as well grab MPI as well.<br />
>% svn export http://marcbug.scc-dc.com/svn/repository/trunk/rckmpi/
Build RCCE<br />
• Go to the root of the RCCE directory tree and<br />
setup the build environment<br />
>% ./configure <strong>SCC</strong>_LINUX<br />
• Then build RCCE by typing from the RCCE<br />
directory tree root<br />
>% ./makeall<br />
• To build with other options (such as shared<br />
memory) edit the file common/symbols.in
Test by building an app (pingpong)<br />
• Go to rcce_root/apps/PINGPONG and type<br />
>% make pingpong<br />
• Copy executable to /shared<br />
>% cp pingpong /shared/tmattson<br />
Note: /shared is<br />
NFS mounted and<br />
visible to all cores<br />
• Create or copy an rc.hosts file … one entry per<br />
line with numbers (00 to 47)<br />
• Run the job on the <strong>SCC</strong><br />
>% rccerun -nue 2 -f rc.hosts pingpong
A closer look at rccerun<br />
• rccerun is analogous to the mpirun … It sets up<br />
the system and submits the rcce executable.<br />
tmattson@boathouse:/shared/tmattson$ rccerun -nue 2 -f rc.hosts pingpong<br />
pssh -h PSSH_HOST_FILE.29325 -t -1 -p 2 /shared/tmattson/mpb.29325 < /dev/null<br />
[1] 17:56:39 [SUCCESS] rck00<br />
[2] 17:56:39 [SUCCESS] rck01<br />
pssh -h PSSH_HOST_FILE.29325 -t -1 -P -p 2 /shared/tmattson/pingpong 2 0.533 00 01 < /dev/null<br />
2965792 0.173102739<br />
3526944 0.205475685<br />
[1] 17:57:17 [SUCCESS] rck00<br />
[2] 17:57:17 [SUCCESS] rck01<br />
Standard out<br />
Time and exit status<br />
for each core<br />
Run a utility to<br />
put MPB in a<br />
known state.<br />
Run the executable (pingpong) for<br />
default clock speed (533 Mhz)
Managing the <strong>SCC</strong> system<br />
• To put the system into a known state, we train the chip.<br />
• To train scc (setup powers, memory, etc) use the command:<br />
sccBmc –i<br />
• It will give you a collection of options to choose from:<br />
Please select from the following possibilities:<br />
INFO: (0) Tile533_Mesh800_DDR800<br />
INFO: (1) Tile800_Mesh1600_DDR1066<br />
INFO: (2) Tile800_Mesh1600_DDR800<br />
INFO: (3) Tile800_Mesh800_DDR1066<br />
INFO: (4) Tile800_Mesh800_DDR800<br />
INFO: (others) Abort!<br />
Make your selection:<br />
• Note: some people reset first with a sccReset –g but that<br />
shouldn’t be necessary.
… so to run benchmarks …<br />
• I reset, trained, rebooted and then used <strong>SCC</strong>. I<br />
followed this procedure to assure that the<br />
system is in a known state.<br />
213> sccReset -g<br />
214> sccBmc -i<br />
215> sccBoot -l<br />
216> rccerun -nue 2 -f rc.hosts pingpong<br />
271> rccerun –nue 48 –f rc.hosts stencil<br />
272> rccerun –nue 48 –f rc.hosts cshift 8
Agenda<br />
• The <strong>SCC</strong> Processor<br />
• <strong>Using</strong> the <strong>SCC</strong> system<br />
• <strong>SCC</strong> Address spaces<br />
25
<strong>SCC</strong> Address spaces<br />
• 48 x86 cores which use the x86 memory model for Private DRAM<br />
Private<br />
DRAM<br />
On-chip<br />
Off-chip<br />
L2$<br />
L1$<br />
Memory<br />
Where is the physical Memory?<br />
Shared off-chip DRAM (variable size)<br />
CPU_0<br />
Private<br />
DRAM<br />
L1$, MPBT<br />
Reg file, no$<br />
L2$, non-coherent<br />
Shared mem. Cache utilization<br />
To better understand how the Memory<br />
works on <strong>SCC</strong>, we will take a closer look<br />
at how the <strong>SCC</strong> native message passing<br />
environment t&s (RCCE) is implemented.<br />
…<br />
L2$<br />
L1$<br />
t&s<br />
CPU_47<br />
Shared on-chip Message Passing Buffer (8KB/core)<br />
26<br />
t&s Shared test and set register
How does RCCE work? Part 1<br />
• Treat Msg Pass Buf (MPB) as 48 smaller buffers … one per core.<br />
• Symmetric name space … Allocate memory as a collective op.<br />
Each core gets a variable with the given name at a fixed offset<br />
from the beginning of a core’s MPB.<br />
…<br />
0 1 2 3<br />
47<br />
27<br />
Flags allocated<br />
and used to<br />
coordinate<br />
memory ops<br />
2<br />
A = (double *) RCCE_malloc(size)<br />
Called on all cores so any core can<br />
put/get(A at Core_ID) without errorprone<br />
explicit offsets
How does RCCE work? Part 2<br />
• The foundation of RCCE is a one-sided put/get interface.<br />
• Symmetric name space … Allocate memory as a collective and<br />
put a variable with a given name into each core’s MPB.<br />
t&s<br />
t&s<br />
Private<br />
DRAM<br />
L2$<br />
L1$<br />
CPU_0<br />
Private<br />
DRAM<br />
L2$<br />
L1$<br />
CPU_47<br />
Put(A,0)<br />
Get(A, 0)<br />
0<br />
…<br />
47<br />
28<br />
… and use flags to make the put’s and get’s “safe”
How does RCCE work? Part 3<br />
<strong>Single</strong> cycle op to invalidate all<br />
MPBT in L1 … Note this is not a<br />
Consequences of MPBT properties for RCCE: flush<br />
• If data changed by another core and image still in L1, read returns stale data.<br />
• Solution: Invalidate before read.<br />
• L1 has write-combining buffer; write incomplete line? expect trouble!<br />
• Solution: don’t. Always push whole cache lines<br />
• If image of line to be written already in L1, write will not go to memory.<br />
• Solution: invalidate before write.<br />
Message passing buffer<br />
memory is special … of type<br />
MPBT<br />
Cached in L1, L2 bypassed.<br />
Not coherent between cores<br />
Cache (line) allocated on read,<br />
not on write.<br />
29<br />
Discourage user operations on data in MPB. Use only as a data<br />
movement area managed by RCCE … Invalidate early, invalidate often
Shared Memory (DRAM) on <strong>SCC</strong> today<br />
• By default, each core sees 64MB of shared memory.<br />
– 4 16MB chunks, each on a separate MC...<br />
• The available shared memory area in DRAM is fragmented.<br />
– On-Die & Host ethernet drivers<br />
– Memory mapped IO consoles<br />
– Other system software uses ...<br />
• 30 No allocation scheme... Different applications may collide! <br />
30
31<br />
<strong>SCC</strong> and the shared DRAM: three options<br />
• Non-cachable:<br />
– Loads and stores ignore the L2 cache … go directly to DRAM<br />
– Pro: Easy for the programmer … no software managed cache coherence.<br />
– Con: Performance is poor<br />
• Cachable:<br />
– Loads and stores interact with Cache “normally” …. 4-way set associative<br />
with a pseudo-LRU replacement policy. Write-back only … write-allocate<br />
is not available.<br />
• MPBT:<br />
– Pro: better performance.<br />
– Con: Programmer manages cache coherence … requires a cache flush routine.<br />
– Loads and stores bypass L2 but go to L1. Works exactly the same as the<br />
MPBT data for the on-chip message passing buffer.<br />
– Pro: No need for complicated (and expensive) cache flushes<br />
– Con: Miss locality benefits of the large L2 cache<br />
In each case, however, only 64 Mbytes of fragmented<br />
memory is available. That is not enough!!!
Core Memory Management: A more detailed look<br />
• Each core has an address Look<br />
Up Table (LUT)<br />
o Provides address translation<br />
and routing information<br />
o Organized into 16 MB segments<br />
• Shared DRAM among all cores …<br />
through each memory controller<br />
(MC0 to MC3) … default size:<br />
o 4*16 = 64 MB<br />
• LUT boundaries are dynamically<br />
programmed<br />
CORE0 LUT Example<br />
FPGA registers<br />
APIC/boot<br />
PCI hierarchy<br />
Shared<br />
1 GB<br />
Private<br />
256MB<br />
Maps to LUT<br />
Maps to VRCs<br />
Maps to MC3<br />
Maps to MC1<br />
Maps to MC2<br />
Maps to MPBs<br />
Maps to MC0<br />
• Core cache coherency is restricted to private memory space<br />
• Maintaining cache coherency for shared memory space is under<br />
software control<br />
MC# = one of the 4 memory controllers, MPB = message passing buffer, VRC’s = Voltage Regulator control<br />
32
Moving beyond the 64 MB limit<br />
0x00<br />
Default configuration<br />
(by sccMerge) assigns<br />
these slots to the OS<br />
Default Linux image has<br />
only 64 MB of shared<br />
DRAM in slots 0x83 to<br />
0x84 … parts of which<br />
are used by the system.<br />
<strong>SCC</strong> Linux<br />
0x13<br />
0x14<br />
0x1a<br />
0x28<br />
0x80<br />
0x81<br />
0x82<br />
0x83<br />
0x84<br />
0xbf<br />
0xc0<br />
MPB<br />
33
Moving beyond the 64 MB limit<br />
<strong>SCC</strong> Linux needs 320 MB.<br />
At 16MB per LUT slot, this<br />
is 20 slots.<br />
0x00 0x13<br />
<strong>SCC</strong> Linux<br />
0x00<br />
0x13<br />
0x14<br />
0x1a<br />
needed<br />
not needed<br />
We can Hijack the memory<br />
pointed to by slots 0x28 through<br />
0x1a (15 slots) … and use them<br />
as additional shared memory in<br />
our applications.<br />
0x28<br />
0x80<br />
0x81<br />
0x82<br />
0x83<br />
0x84<br />
0xbf<br />
0xc0<br />
MPB<br />
34
Moving beyond the 64 MB limit<br />
<strong>SCC</strong> Linux needs 320 MB.<br />
At 16MB per LUT slot, this<br />
is 20 slots.<br />
0x00 0x13<br />
<strong>SCC</strong> Linux<br />
0x00<br />
0x13<br />
0x14<br />
0x1a<br />
needed<br />
not needed<br />
Hijack the memory pointed to<br />
by slots 0x28 through 0x1a (15<br />
slots) … have slots 0x84 through<br />
0xbf point to that memory.<br />
When we add slots to shared<br />
memory, we add them in<br />
groups of 4. With these 60 slots<br />
we can take 15 addresses from<br />
<strong>SCC</strong> Linux for each memory<br />
controller.<br />
0x28<br />
0x80<br />
0x81<br />
0x82<br />
0x83<br />
0x84<br />
0xbf<br />
0xc0<br />
60 slots<br />
MPB<br />
Source: Ted Kubaska<br />
35
36<br />
How to use the shared DRAM<br />
• Two devices for shared memory DRAM<br />
– /dev/rckncm – Linux device exposing “non-cacheable memory”.<br />
– /dev/rckdcm – Linux device exposing “definitely cacheable memory”.<br />
• Access to shared memory through RCCE<br />
– The appropriate device is opened inside RCCE_init()<br />
– SHMalloc() in <strong>SCC</strong>_API.c does the actual mmap() routine to get the<br />
file descriptor of the opened device.<br />
– Set all this up by building RCCE with SHMADD_CACHEABLE<br />
• If you work with Cacheable shared memory, you need to<br />
manage cache coherence explicitly.<br />
– In other words … you need to understand when the cache needs to<br />
be flushed and explicitly insert flushes as needed.
37<br />
Cacheable shared memory and Flush<br />
• The actual flush routine is part of the /dev/rckdcm driver in the file<br />
– rckmem.c in linuxkernel/linux-2.6.16-mcemu/drivers/char.<br />
• RCCE uses this to flush the entire L2 cache for a core:<br />
– RCCE_DCMflush()<br />
• For example, invoke it as<br />
– if(iam==receiver) RCCE_DCMflush();<br />
• Inside RCCE_DCMflush() the driver routine is in DCMflush() is called :<br />
– write(DCMDeviceFD,0,65536);<br />
– where DCMDeviceFD is the /dev/rckdcm file descriptor and 65536 (64K) the size of an<br />
L2 way.<br />
• If is possible to flush only a portion of the L2 cache, but this hasn’t been<br />
implemented yet in RCCE. For example, if you want to flush the values<br />
in a structure called XY, you would specify the write() as<br />
– write(DCMDeviceFD,&XY,sizeof(XY));
38<br />
Shared memory API: Example<br />
#include "RCCE.h“<br />
#define BSIZ 1024*64<br />
…<br />
volatile int *buffer;<br />
RCCE_init(&argc, &argv);<br />
iam<br />
size<br />
= RCCE_ue();<br />
= bufsize*sizeof(int);<br />
buffer = (int *) RCCE_shmalloc(size);<br />
RCCE_barrier(&RCCE_COMM_WORLD);<br />
if(iam==sender) {<br />
}<br />
fill_buffer(buffer);<br />
RCCE_DCMflush();<br />
RCCE_barrier(&RCCE_COMM_WORLD);<br />
if(iam==receiver) {<br />
RCCE_DCMflush();<br />
use_buffer(buffer);<br />
}<br />
RCCE_barrier(&RCCE_COMM_WORLD);<br />
RCCE_shfree((t_vcharp)buffer);<br />
RCCE_finalize();
Off-<strong>Chip</strong> Shared-Memory in RCCE<br />
• SuperMatrix:<br />
– Map dense matrix computation to a directed acyclic graph<br />
• Store DAG and matrix in off-chip shared DRAM<br />
Cholesky Factorization on<br />
a single node so we can<br />
isolate cost of cache<br />
coherence.<br />
Source: Ernie Chan, UT Austin, MARC symposium, Nov. 2010<br />
39
Shared memory ... What we want<br />
• Shared DRAM memory works and with hijacking we get the<br />
memory footpring large enough to be interesting,<br />
• But its not safe (system and apps can collide). And its still<br />
of lmited size (~960 Mbytes).<br />
• We want something better ... Shared memory with:<br />
– No reserved areas ... Everyone schedule memory through a single<br />
resource.<br />
– No fragmentation<br />
– Much larger sizes for the shared memory<br />
– A proper allocation scheme<br />
– Coexists to legacy SHM section<br />
– Convenient usage model (no manual LUT changes etc.)<br />
40<br />
Sounds better? Okay, but how to achieve it?
The answer...<br />
Each core has several 100MB of private<br />
memory available (~640 on a latest systems)<br />
Why not allocate some private memory (using<br />
regular Linux mechanisms) and make it public<br />
for other cores?<br />
Hide the implementation details in a library to<br />
make usage convenient... Available for Linux as<br />
well as bareMetalC.<br />
=> Privately owned public shared memory<br />
41
POPSHM: Flexible Shared Memory<br />
• Linux Kernel patch<br />
– Allocates contiguous 16MB chunks<br />
– From private memory area<br />
– 16MB aligned (LUT entry)<br />
– Pins them<br />
– Publishes it to be used collectively<br />
– Total allocated<br />
– First LUT Entry<br />
• User space library<br />
– Aggregates available<br />
memory, up to<br />
48 * (5 * 16 MB) = 3.75 GB<br />
– POPSHM address space<br />
– Provides memcpy and<br />
put/get<br />
– Low level interface for<br />
minimal overhead<br />
Linux<br />
Private Memory<br />
Linux<br />
Private Memory<br />
Linux<br />
Private Memory<br />
POPSHM<br />
Address Space<br />
Core 0 Core 1 Core 2<br />
All Cores<br />
POPSHM is new … still<br />
42<br />
under early evaluation.
Conclusion<br />
• <strong>SCC</strong> is alive and well:<br />
– Message passing and shared memory APIs are available.<br />
– A vigorous community (MARC) using <strong>SCC</strong> for significant<br />
parallel computing research.<br />
– Usable through a GUI or a command line interface<br />
• The future …. Shared memory on <strong>SCC</strong><br />
– We understand how to use cacheable vs. non-cacheable shared<br />
memory … we even have a working L2 flush!<br />
– Full characterization of POPSHM is the next step.<br />
43
Backup<br />
• Performance numbers<br />
• A survey of <strong>SCC</strong> research<br />
• Additional Details<br />
44
Round-trip Latencies, 32 byte message<br />
• Ping-pong between a pair of cores … one core fixed at a corner, the second<br />
core varied from “same tile” to opposite corner.<br />
5.60E-06<br />
Roundtrip Latency (secs)<br />
5.55E-06<br />
5.50E-06<br />
5.45E-06<br />
5.40E-06<br />
5.35E-06<br />
5.30E-06<br />
5.25E-06<br />
5.20E-06<br />
5.15E-06<br />
Data fit to a straight line:<br />
T = 528 nanoosecs + 30 nanosecs * hops<br />
5.10E-06<br />
0 1 2 3 4 5 6 7 8<br />
Network hops<br />
RCCE_Send/Recv … 3 messages per transit, 6 for roundtrip<br />
4 cycles per hop, 800 MHz router … or 2*3*4/0.8 GHz = 30 nanosecs.
46<br />
Power breakdown<br />
MC &<br />
DDR3-<br />
800<br />
19%<br />
Full Power Breakdown<br />
Total -125.3W<br />
Cores<br />
69%<br />
MC &<br />
DDR3-<br />
800<br />
69%<br />
Low Power Breakdown<br />
Total - 24.7W<br />
Routers<br />
& 2Dmesh<br />
10%<br />
Global<br />
Clocking<br />
2%<br />
Routers<br />
& 2Dmesh<br />
5%<br />
Global<br />
Clocking<br />
5%<br />
Cores<br />
21%<br />
Clocking: 1.9W Routers: 12.1W<br />
Cores: 87.7W MCs: 23.6W<br />
Cores-1GHz, Mesh-2GHz, 1.14V, 50°C<br />
Clocking: 1.2W Routers: 1.2W<br />
Cores: 5.1W MCs: 17.2W<br />
Cores-125MHz, Mesh-250MHz, 0.7V, 50°C
Impact of Core Position on Memory Performance<br />
• Stream benchmark mapped to one core and MC channel<br />
– Position of core is varied<br />
– Report relative reduction in usable memory bandwidth<br />
• Tile 533 MHz, Router 800 MHz, Memory 800 MHz<br />
– Up to 13% variation within quadrant of Memory controller (iMC)<br />
-13.0% -15.8% -19.8% -23.0% -25.5% -28.7%<br />
iMC -8.3% -13.0% -15.8% -19.8% -23.0% -25.5% iMC<br />
-5.3% -8.3% -13.0% -15.8% -19.8% -23.0%<br />
iMC 0.0% -5.3% -8.4% -13.0% -15.8% -19.8% iMC<br />
• Tile 800 MHz, Router 1600 MHz, Memory 800 MHz<br />
– Up to 8% variation within quadrant of Memory controller<br />
-8.0% -8.9% -11.6% -14.9% -15.6% -18.0%<br />
iMC -4.2% -8.0% -8.9% -11.6% -14.9% -15.6% iMC<br />
-1.0% -4.2% -8.0% -8.8% -11.6% -14.9%<br />
iMC 0.0% -1.0% -4.2% -8.0% -8.8% -11.6% iMC<br />
Source: <strong>Intel</strong>, <strong>SCC</strong> workshop, Germany March 16 2010
Linpack and NAS Parallel benchmarks<br />
1. Linpack (HPL): solve dense system of linear equations<br />
– Synchronous comm. with “MPI wrappers” to simplify porting<br />
x-sweep<br />
z-sweep<br />
2. BT: Multipartition decomposition<br />
– Each core owns multiple blocks (3 in this case)<br />
– update all blocks in plane of 3x3 blocks<br />
– send data to neighbor blocks in next plane<br />
– update next plane of 3x3 blocks<br />
3. LU: Pencil decomposition<br />
4<br />
– Define 2D-pipeline process.<br />
– await data (bottom+left)<br />
4<br />
3<br />
4<br />
– compute new tile<br />
– send data (top+right)<br />
4<br />
3<br />
2<br />
3<br />
4<br />
48<br />
Third party names are the property of their owners.<br />
4<br />
3<br />
2<br />
1<br />
2<br />
3<br />
4
LU/BT NAS Parallel Benchmarks, <strong>SCC</strong><br />
Problem size: Class A, 64 x 64 x 64 grid*<br />
2000<br />
1600<br />
MFlops<br />
1200<br />
800<br />
LU<br />
BT<br />
50<br />
400<br />
0<br />
• <strong>Using</strong> latency<br />
optimized,<br />
whole cache<br />
line flags<br />
0 10 20 30 40<br />
# cores * These are not official NAS Parallel benchmark results.<br />
<strong>SCC</strong> processor 500MHz core, 1GHz routers, 25MHz system interface, and DDR3 memory at 800 MHz.<br />
Third party names are the property of their owners.<br />
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of <strong>Intel</strong> products as measured by those tests. Any difference in system hardware or<br />
software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information<br />
on performance tests and on the performance of <strong>Intel</strong> products, reference or call (U.S.) 1-800-628-8686 or 1-916-356-3104.
Backup<br />
• Performance numbers<br />
• A survey of <strong>SCC</strong> research<br />
• Additional Details<br />
51
Many-core Application<br />
Research Community<br />
•123 Contracts signed worldwide<br />
•77 Unique Institutions<br />
•39 Research Partners in EU<br />
•28 Research Partners in USA<br />
•10 Research Partners in other<br />
Countries including China,<br />
India, South Korea, Brazil,<br />
Canada<br />
•233 MARC website Participants
<strong>SCC</strong> Bare Metal<br />
Development Framework<br />
Bandwidth Studies<br />
Source: <strong>SCC</strong> MARC symposium, Santa Clara, CA (March 2011)<br />
• ET International has investigated performance with lock free message queues.<br />
• Queue head/tail pointers reside in MPBs, and the message buffers reside in DRAM.<br />
• Message buffers in DRAM are mapped twice into a given core’s virtual memory: Once for<br />
read using the MPBT bit, and once for write with all caching disabled.<br />
• MPBT cache control bit allow both the MPBs and the DRAM read mapping to be flushed<br />
via the <strong>SCC</strong>’s CL1FLUSH instruction.<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
X10 on the <strong>SCC</strong><br />
Keith Chapman, Ahmed Hussein, Antony Hosking<br />
Purdue University<br />
Source: <strong>SCC</strong> MARC symposium, Santa Clara, CA (March 2011)<br />
Porting and Preliminary Performance<br />
•RCCE-X10: An extension to RCCE tailored for X10 runtime<br />
•Performance for X10 benchmarks … RCCE-X10 performs better than MPI RCCE<br />
• HF (Hartree-Fock Quantum chemistry) and BC (Betweenness Centrality).<br />
BC: Speedup (normalized to RCCE-X10)<br />
BC: Speedup for varying workloads<br />
HF: Dynamic load balancing<br />
(normalized to RCCE-X10)<br />
HF: Static load balancing<br />
(normalized to RCCE-X10)
Black <strong>Cloud</strong> OS<br />
Microsoft Research<br />
Source: <strong>SCC</strong> MARC symposium, Santa Clara, CA (March 2011)<br />
What will an operating system look like when the chip becomes a<br />
cloud?<br />
•Is breaking cache coherency a way to scale beyond a dozen of cores? Does message<br />
passing benefit from network-on-chip design? Black <strong>Cloud</strong> operating system tries to<br />
answer these questions by making the message passing features of the <strong>SCC</strong> first class<br />
citizens.<br />
•Black <strong>Cloud</strong> OS:<br />
•Based on Singularity/Helios operating systems.<br />
•Written in managed code.<br />
•Runs a single operating system instance across all 48 cores.<br />
•Each core runs a separate kernel.<br />
•Processes can be hosted by any of the kernels.<br />
•All inter-process communication is done via messages.<br />
•Local and remote channels are available via the same APIs.<br />
•Inspired by “The Invincible” by Stanislaw Lem.<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
RCKMPI<br />
MPI designed for <strong>SCC</strong><br />
This proof of concept software stack shows that it‘s possible to use out-of-the-box<br />
programming models on <strong>SCC</strong>, while modifying the physical layer to use the low-latency<br />
hardware acceleration features of <strong>SCC</strong>. Thus, it can serve as a valuable example for the<br />
development of new programming models. As the physical layer has access to both message<br />
passing buffer and/or shared memory, it also enables research on the “right mix” of MPB vs.<br />
SHM based communication. Standard MPI debug tools (like iTac) can be used to analyze the<br />
impact of modifications…<br />
INSERT PICTURE HERE<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
MESSY & Faroe<br />
Hrishikesh Amur, Alexander Merritt, Sudarsun Kannan, Priyanka Tembey<br />
Vishal Gupta, Min Lee, Ada Gavrilovska, Karsten Schwan<br />
CERCS, Georgia Institute of Technology<br />
MESSY – Library for Software Coherence<br />
•Aims to re-evaluate conventional wisdom regarding effects of memory consistency<br />
models on application performance using the fast on-chip interconnect<br />
•Implements a key-value store that allows multiple consistency models to be<br />
applied on different data.<br />
Faroe – Memory Balancing for Clusters on <strong>Chip</strong><br />
•Aims to evaluate more scalable memory borrow/release protocols<br />
•The on-chip interconnect facilitates faster communication between cores,<br />
hence enabling newer coordination mechanisms at the OS layer.<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
A Software SVM<br />
for the <strong>SCC</strong> Architecture<br />
Junghyun Kim, Sangmin Seo, Jun Lee and Jaejin Lee<br />
Center for Manycore Programming, Seoul National University, Seoul 151-744, Korea<br />
http://aces.snu.ac.kr<br />
To provide an illusion of coherent shared memory to the programmer<br />
• With comparable performance to and better scalability than<br />
the ccNUMA architecture<br />
• <strong>Using</strong> the CRF memory model<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
Performance Modeling of <strong>SCC</strong>’s Onchip<br />
Interconnect<br />
Research Overview<br />
Our findings:<br />
• We characterize the <strong>SCC</strong> on-chip interconnection network with micro-benchmarks<br />
• Observed point-to-point latency, bandwidth<br />
• Designed a performance model from observations<br />
• We present new collective communication algorithms for the <strong>SCC</strong> by efficiently<br />
utilizing the message-passing buffer (MPB)<br />
• Broadcast 22x faster than RCCE<br />
• Reduce 6.4x faster than RCCE<br />
Current and Future work:<br />
• Quantifying the price of cache coherence<br />
• Study other collective patterns<br />
• Project real application scalability on the <strong>SCC</strong><br />
• Power studies<br />
Performance Model<br />
Remote MPB read/write vs<br />
Hop count<br />
Write<br />
Read<br />
Local MPB read/write<br />
performance<br />
MPB write<br />
MPB read<br />
L1-cache read<br />
Aparna Chandramowlishwaran, Richard Vuduc<br />
Georgia Institute of Technology<br />
Collective Communication<br />
• Performance model guides the design of optimal collective<br />
communication algorithms<br />
• Two case studies<br />
• Reduce<br />
• Broadcast<br />
Reduce parallel scaling<br />
• Naïve: get-based approach,<br />
Naive<br />
sequential<br />
• Eg: Core 0 reads from<br />
remote MPB’s and performs a<br />
local reduce<br />
• Tree-based reduce: scales<br />
as log p<br />
Tree-based<br />
• Naïve: put-based approach,<br />
sequential<br />
• Eg: Core 0 writes to remote<br />
MPB’s of other cores<br />
• Parallel broadcast<br />
algorithm: All cores fetch data<br />
from core 0’s MPB<br />
Broadcast parallel scaling<br />
Naive<br />
Parallel<br />
• 64 byte message. Slope gives a router<br />
latency (~ 5ns)<br />
• Write to local MPB is ~1.4x slower than<br />
read<br />
Configuration: Cores at 533 MHz, Router at 800 MHz, DDR3 800 memory<br />
Acknowledgements<br />
We would like to thank <strong>Intel</strong> for access to the <strong>SCC</strong> processor hosted at the<br />
manycore research community (MARC) data center.<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
Many-Core Applications Research Community– http://communities.intel.com/community/marc
C++ Front-end for Distributed Transactional Memory<br />
Sam Vafaee, Natalie Enright Jerger<br />
Department of Electrical & <strong>Computer</strong> Engineering, University of Toronto, Canada<br />
<strong>SCC</strong>-TM is a Compiler-Agnostic Framework for Rapid Development of Distributed Apps<br />
• <strong>SCC</strong>-TM makes <strong>SCC</strong> more accessible by providing an easier alternative to the<br />
message passing paradigm<br />
• Builds on <strong>SCC</strong>’s on-chip inter-core communication<br />
• Writing distributed apps is as easy as atomic { … }<br />
• First compiler-agnostic TM library without annotations<br />
• Status: Initial version with some centralized operations<br />
FRONTEND<br />
Smart Pointers<br />
Architecture<br />
Memory Protection<br />
On Read On Write TX Begin TX Commit<br />
…<br />
SPMD<br />
PROGRAM<br />
Interface<br />
TMMain(int argc, char **argv)<br />
{<br />
// Launched on desired # of cores<br />
// Node Id: TMNodeId(), # of nodes: TMNodes()<br />
}<br />
TM PROTOCOL<br />
BACKEND<br />
Version<br />
Management<br />
Message Passing Buffers<br />
Contention<br />
Management<br />
Memory/Cache Management<br />
Conflict Detection<br />
RCCE Send RCCE Recv RCCE Bcast RCCE Malloc …<br />
Globally Mapped Shared<br />
Memory<br />
COLLECTIVE<br />
DECLARATION<br />
ATOMIC<br />
REGION<br />
Given:<br />
struct Account<br />
{<br />
};<br />
int withdrawlLimit;<br />
int balance;<br />
Account() : withdrawlLimit(100), balance(1000) {};<br />
Declaration:<br />
tm_shared account = new(TMGlobal) Account();<br />
atomic<br />
{<br />
if (100 < account->withdrawlLimit<br />
&& account->balance >= 100)<br />
{<br />
account->balance -= 100;<br />
}<br />
}<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
Dense Matrix Computations on <strong>SCC</strong>: A Programmability Study<br />
We study programmability with RCCE by porting to <strong>SCC</strong> Elemental, a well-layered,<br />
high-performance library for distributed-memory computers.<br />
Enabling Features for Port:<br />
•Used low-latency, low-overhead RCCE for synchronous collective communication to replace MPI calls<br />
•C++ Elemental code, which is object-oriented and modular, made required changes very limited and easy to<br />
implement<br />
•Implemented a few well-defined MPI collectives with RCCE to complete port<br />
•Algorithm variants enabled incremental porting and validation<br />
•Algorithm variants also allowed testing for best algorithms<br />
PartitionDownDiagonal<br />
( A, ATL, ATR,<br />
ABL, ABR, 0 );<br />
while( ABR.Height() > 0 )<br />
{<br />
RepartitionDownDiagonal<br />
( ATL, /**/ ATR, A00, /**/ A01, A02,<br />
/*************/ /******************/<br />
/**/ A10, /**/ A11, A12,<br />
ABL, /**/ ABR, A20, /**/ A21, A22 );<br />
A12_Star_MC.AlignWith( A22 );<br />
A12_Star_MR.AlignWith( A22 );<br />
A12_Star_VR.AlignWith( A22 );<br />
//----------------------------------//<br />
A11_Star_Star = A11;<br />
advanced::internal::LocalChol<br />
( Upper, A11_Star_Star );<br />
A11 = A11_Star_Star;<br />
A12_Star_VR = A12;<br />
basic::internal::LocalTrsm<br />
( Left, Upper,<br />
ConjugateTranspose, NonUnit,<br />
(F)1, A11_Star_Star, A12_Star_VR );<br />
A12_Star_MC = A12_Star_VR;<br />
A12_Star_MR = A12_Star_VR;<br />
basic::internal::LocalTriangularRankK<br />
( Upper, ConjugateTranspose,<br />
(F)-1, A12_Star_MC, A12_Star_MR,<br />
(F)1, A22 );<br />
A12 = A12_Star_MR;<br />
//----------------------------------//<br />
A12_Star_MC.FreeAlignments();<br />
A12_Star_MR.FreeAlignments();<br />
A12_Star_VR.FreeAlignments();<br />
}<br />
SlidePartitionDownDiagonal<br />
( ATL, /**/ ATR, A00, A01, /**/ A02,<br />
/**/ A10, A11, /**/ A12,<br />
/*************/ /******************/<br />
ABL, /**/ ABR, A20, A21, /**/ A22 );<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc<br />
Bryan Marker, Ernie Chan, Jack Poulson, Robert van de Geijn, Rob Van der Wijngaart, Timothy <strong>Mattson</strong>, and Theodore Kubaska
<strong>SCC</strong> + QED for<br />
Effective Post-Si Validation<br />
Quick Error Detection (QED) is a post-Si validation technique which reduces error detection latency and improves coverage of existing<br />
validation tests. Error detection latency is the time elapsed between the occurrence of an error and its manifestation. Long error<br />
detection latencies extend the post-Si validation process, delaying product shipments. By incorporating QED with the <strong>SCC</strong>, the QED<br />
technique could be exercised by dedicated cores, greatly reducing the amount of software support and further reducing error detection<br />
latency.<br />
QED technique<br />
• Effective for single-cores [Hong ITC 2010]<br />
• 4x coverage improvement<br />
• 10 6 decrease in error detection latency<br />
• Current research efforts<br />
• Prove efficacy of QED on multi-core errors<br />
• Prove applicability of QED on uncore errors<br />
<strong>Intel</strong> <strong>SCC</strong><br />
• Core clusters configurable for specific operating points<br />
• Allows creation of unreliable system cores within normal operating cores<br />
• Can see effects of single core unreliability on a multi-core system<br />
• Can create dedicated cores for performing QED<br />
• Can test uncore errors using the extensive on-chip fabric<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
Distributed Power/Thermal Management for <strong>Single</strong> <strong>Chip</strong> <strong>Cloud</strong> <strong>Computer</strong><br />
1 st Step - Thermal Modeling and Characterization<br />
•Motivation: Mapping and scheduling of tasks affects the peak temperature and thermal gradients<br />
•Goal: allocate workload evenly across space and time with the awareness of heat dissipation and lateral heat transfer<br />
•Approach: proactive distributed task migration<br />
•<strong>SCC</strong>: a unique many-core system with flexible DVFS capability and easy-to-access temperature sensors. It allows us to<br />
•Understand the thermal influence and workload dependency among on-chip processors, and<br />
•Evaluate the impact of those dependencies on the efficiency of power/thermal management<br />
•Key metric<br />
•Step 1: thermal modeling of the <strong>SCC</strong><br />
•Step 2: Framework of distributed thermal management on <strong>SCC</strong><br />
•Step 3: Performance evaluation and sensitivity analysis<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc
Many-Core Applications Research Community– http://communities.intel.com/community/marc
Pregel on <strong>SCC</strong><br />
18<br />
PageRank<br />
16<br />
14<br />
12<br />
Graph with 6144 nodes,<br />
10 edges/node, 96<br />
partitions, 100 steps<br />
time<br />
10<br />
8<br />
comp<br />
comm<br />
6<br />
4<br />
2<br />
0<br />
1 2 4 8 16 32 48<br />
Pregel: a system for large-scale Cores graph processing (SIGMOD’ 10)<br />
Many-Core Applications Research Community– http://communities.intel.com/community/marc<br />
Source: Chuntao HONG, Tsinghua University, Jan 2011<br />
66
… and beyond<br />
• Supermatrix—E. Chan et al, UT Austin; in progress<br />
• Software Managed Coherence (SMC)—<br />
XiaochengZhou et al, <strong>Intel</strong>, China <strong>SCC</strong> symposium,<br />
Jan 2011<br />
• An OpenCLFramework for Homogeneous<br />
Manycoreswith no Hardware Cache Coherence<br />
Support—JaeJin Lee et. al., Seoul National<br />
University, submitted to PLDI 11
Backup<br />
• Performance numbers<br />
• A survey of <strong>SCC</strong> research<br />
• Additional Details<br />
68
<strong>SCC</strong> Platform Board Overview<br />
69
How to use the atomic counters<br />
• Two sets of 48 atomic counters (32 bit)<br />
• Reachable with LUT* entry 0xf9<br />
• Each set starts at 4k page boundaries<br />
- 0xf900E000 -> Block0<br />
- 0xf900F000 -> Block 1<br />
• Each counter is represented by two registers:<br />
– Atomic increment register<br />
– Read: Returns couter value and decrements it.<br />
– Write: Increments counter value.<br />
– Initialization register allows preloading or<br />
reading counter value.<br />
Atomic Increment<br />
0xE000 Counter #00<br />
0xE004<br />
Initialization<br />
Counter #00<br />
Atomic Increment<br />
0xE008 Counter #01<br />
0xE00C<br />
Initialization<br />
Counter #01<br />
Atomic Increment<br />
0xE178 Counter #47<br />
0xE17C<br />
Initialization<br />
Counter #47<br />
*LUT: Memory look up table … address translation unit (described later)<br />
70
References<br />
• A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS, J.<br />
Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N.<br />
Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam,<br />
V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K.<br />
Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De1, R. Van Der Wijngaart, T.<br />
<strong>Mattson</strong>, Proceedings of the International Solid-State Circuits Conference, Feb 2010<br />
• The 48-core <strong>SCC</strong> processor: the programmer’s view ,T. G. <strong>Mattson</strong>, R. F. Van der<br />
Wijngaart, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S.<br />
Vangal, N. Borkar, G. Ruhl, S. Dighe, Proceedings SC10, New Orleans, November<br />
2010<br />
• Light-weight Communications on <strong>Intel</strong>'s <strong>Single</strong>-<strong>Chip</strong>-<strong>Cloud</strong> <strong>Computer</strong><br />
Processor, Rob F. van der Wijngaart, Timothy G. <strong>Mattson</strong>, Werner Haas, Operating<br />
Systems Review, ACM, vol 45, number 1, pp. 73-83, January 2011.<br />
• Programming many-core Architectures – a case study: dense Matrix<br />
computations on the <strong>Intel</strong> <strong>SCC</strong> Processor, B. Marker, E. Chan, J. Poulson, R. van<br />
de Geijn, R. van der Wijngaart, T. <strong>Mattson</strong>, T. Kubaska, submitted to Concurrency and<br />
Computation: practice and experience, 2011.<br />
• Many Core Applications Research Community, An online community of users of<br />
the <strong>SCC</strong> processor. http://communities.intel.com/community/marc<br />
71