Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Babak Falsafi<br />
Parallel Systems Architecture Lab<br />
<strong>EPFL</strong><br />
parsa.epfl.ch<br />
ecocloud.ch<br />
© 2010 Babak Falsafi
Energy: Shaping IT’s Future<br />
• 40 years of energy scalability<br />
Doubling transistors every two years<br />
Quadratic reduction in energy from voltages<br />
• But, while Moore’s law continues<br />
Voltages have started to level off<br />
ITRS projections in 2000 for voltage levels in<br />
2009 are way off!<br />
© 2010 Babak Falsafi<br />
An exponential increase in energy usage<br />
every generation?<br />
2
Household IT Energy Usage<br />
(from Sun)<br />
Cell phones<br />
© 2010 Babak Falsafi 3
Shift towards Cloud Computing Helps<br />
Data Centers<br />
• Ubiquitous connectivity & access to data<br />
• Consolidate servers Amortize energy costs<br />
© 2010 Babak Falsafi
But, the Cloud has hit a wall!<br />
Trends:<br />
• Moore’s law continues<br />
• Server density is increasing<br />
• But, voltage scaling has slowed<br />
➔ It’s too expensive to buy/cool servers<br />
A 1,000m 2 datacenter is 1.5MW!<br />
(carbon footprint of airlines in 2012)<br />
© 2010 Babak Falsafi<br />
5
Enterprise IT Energy Usage<br />
Kenneth Brill (Uptime Institute)<br />
• “Economic Meltdown of Moore’s Law”<br />
• In 2012: Energy/server lifetime 50% more<br />
than price/server<br />
And 2% of all Carbon footprint in the US<br />
Energy Star report to Congress:<br />
• Datacenter energy 2x from 2000 to 2006<br />
• Roughly 2% of all electricity & growing<br />
© 2010 Babak Falsafi<br />
6
Example Projections for Datacenters<br />
Billion Kilowatt hour/year<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
Energy Star's<br />
projections<br />
Projections<br />
without voltage<br />
reduction<br />
Energy Star<br />
projections up<br />
to 2011<br />
2001 2003 2005 2007 2009 2011 2013 2015 2017<br />
• Projections for 2011 are already off<br />
• Exponential increase in usage<br />
© 2010 Babak Falsafi
What lies ahead?<br />
For the next ten years:<br />
• CMOS is still the cheapest technology<br />
But,<br />
• need ~100x reduction in energy just to keep<br />
up with Moore’s Law<br />
Chip design recommendations:<br />
Short-term, lean chips (squeeze all fat)<br />
Cores, caches, NoC<br />
Long-term, cannot power up all of chip<br />
© 2010 Babak Falsafi<br />
Live with “<strong>dark</strong> <strong>silicon</strong>”, specialize<br />
8
ecocloud.ch<br />
10 faculty, CSEM & industrial affiliates<br />
HP, Intel, IBM, Microsoft, …<br />
Research:<br />
• Energy-minimal cloud computing<br />
• Elastic data bricks and storage<br />
• Scalable cloud applications & services<br />
© 2010 Babak Falsafi<br />
Making tomorrow’s clouds green &<br />
sustainable 9
Outline<br />
• Where are we?<br />
➔ Energy scalability for servers<br />
• Where do we go from here?<br />
• Future on-chip caches<br />
• Future NoC’s<br />
• Summary<br />
© 2010 Babak Falsafi<br />
10
Where does server energy go?<br />
Many sources of power consumption:<br />
• Server only [Fan, ISCA’07]<br />
Processors chips (37%)<br />
Memory (17%)<br />
Peripherals (29%)<br />
…<br />
• Infrastructure (another 50%)<br />
Cooling<br />
Power distribution<br />
© 2010 Babak Falsafi<br />
11
How did we get here?<br />
Leakage killed the supply voltage<br />
Historically,<br />
Power ∞ V 2 f<br />
Voltage<br />
Frequency<br />
Four decades of reducing V to keep up<br />
But, can no longer reduce V due to leakage!!<br />
Exponential in area<br />
Exponential in temperature<br />
© 2010 Babak Falsafi<br />
12
Voltages have already leveled off<br />
1.4<br />
Power Supply V dd<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
> 2x error<br />
Today<br />
2008<br />
2007<br />
2004<br />
2002<br />
0<br />
2002 2006 2010 2014 2018 2022<br />
© 2010 Babak Falsafi<br />
ITRS estimates for today were off by > 2x<br />
13
A Study of Server Chip Scalability<br />
Actual server workloads today<br />
Easily parallelizable (performance-scalable)<br />
Actual physical char. of processors/memory<br />
ITRS projections for technology nodes<br />
Modeled power/performance across nodes<br />
For server chips<br />
Bandwidth is near-term limiter<br />
→ Energy is the ultimate limiter<br />
© 2010 Babak Falsafi<br />
14
A few words about our model<br />
Physical char. modeled after Niagara<br />
Area: cores/caches (72% die)<br />
<br />
Power:<br />
<br />
<br />
© 2010 Babak Falsafi<br />
scaled across tech. nodes<br />
Active: projected V dd /ITRS<br />
Core=scaled, cache=f(miss), crossbar=f(hops)<br />
Leakage: projected V th /ITRS, f(area), 62C<br />
Performance:<br />
<br />
<br />
<br />
Parameters from real server workloads<br />
(DB2, Oracle, Apache, Zeus)<br />
Cache miss rate model (validated)<br />
CPI model based on miss rate<br />
15
Caveat: Simple<br />
Parallelizable Workloads<br />
Workloads are assumed parallel<br />
• Scaling server workloads is reasonable<br />
CPI model:<br />
• Works well for workloads with low MLP<br />
• OLTP, Web & DSS are mostly memorylatency<br />
dependent<br />
Future servers will run a mix of workloads<br />
© 2010 Babak Falsafi<br />
16
Area vs. Power Envelope (22nm)<br />
Good news: can fit hundreds of cores<br />
× Can not use them all at highest speed<br />
© 2010 Babak Falsafi
Of course one could pack more<br />
slower cores, cheaper cache<br />
• Result: a performance/power trade-off<br />
• Assuming bandwidth is unlimited<br />
© 2010 Babak Falsafi
But, limited pin b/w favors<br />
fewer cores + more cache<br />
• For clarity, only showing two bandwidth lines<br />
• Where would the best performance be?<br />
© 2010 Babak Falsafi
Peak Performing with<br />
Conventional Memory<br />
• B/W constrained, then power constrained<br />
• Fewer slower cores, lots of cache<br />
© 2010 Babak Falsafi
[Loh, ISCA’08]<br />
Mitigating B/W Limitations:<br />
3D-stacked Memory<br />
• Delivers TB/sec of bandwidth<br />
© 2010 Babak Falsafi
Peak Performing w/<br />
3D-stacked Memory<br />
• Only power-constrained<br />
• Virtually eliminates on-chip cache<br />
© 2010 Babak Falsafi
Core Scaling across Technologies<br />
Peak Freq.<br />
Peak Perf.<br />
• Assumes a 130-Watt chip envelope<br />
• Pin b/w keeps Niagara from scaling<br />
© 2010 Babak Falsafi
Niagara + 3D-stacked Memory<br />
1024<br />
Number of Cores<br />
512<br />
256<br />
128<br />
64<br />
32<br />
16<br />
8<br />
4<br />
2<br />
1<br />
2004 2007 2010 2013 2016 2019<br />
Year of Technology Introduction<br />
Area<br />
DB2 TPC-C w/ 3D Mem<br />
DB2 TPC-H w/ 3D Mem<br />
Apache w/ 3D Mem<br />
DB2 TPC-C Power<br />
DB2 TPC-H Power<br />
Apache Power<br />
Peak Freq.<br />
Peak Perf.<br />
• Power limits Niagara to 75% area!<br />
© 2010 Babak Falsafi
But, even Niagara is an overkill!<br />
Servers mostly access memory<br />
Benefit little from core complexity<br />
Niagara cores are too big!<br />
E.g., Kgil et al., ASPLOS06:<br />
• Servers on embedded cores + 3D<br />
Can we run servers with embedded cores?<br />
© 2010 Babak Falsafi<br />
25
ARM9 + 3D-stacked Memory<br />
Number of Cores<br />
1024<br />
512<br />
256<br />
128<br />
64<br />
32<br />
16<br />
8<br />
4<br />
2<br />
1<br />
2004 2007 2010 2013 2016 2019<br />
Year of Technology Introduction<br />
Area<br />
DB2 TPC-C w/ 3D Mem<br />
DB2 TPC-H w/ 3D Mem<br />
Apache w/ 3D Mem<br />
DB2 TPC-C Power<br />
DB2 TPC-H Power<br />
Apache Power<br />
Peak Freq.<br />
Peak Perf.<br />
• Can not scale with a 130-Watt envelope!!!<br />
• On-chip hierarchy + interconnect not scalable<br />
© 2010 Babak Falsafi
Long-term:<br />
Where to go from here?<br />
1. Redo SW stack<br />
<br />
<br />
Minimize joules/work (algo. down to HW)<br />
Program for locality + heterogeneity<br />
2. Pray for technology<br />
<br />
<br />
Energy-scalable <strong>silicon</strong> devices<br />
Emerging nanoscale technologies?<br />
3. Infrastructure technology<br />
<br />
<br />
Renewable/carbon-neutral energy<br />
Scalable cooling + power delivery<br />
© 2010 Babak Falsafi<br />
27
Short-term Scaling Implications<br />
• Caches are getting huge<br />
Need cache architectures to deal with >> MB<br />
E.g., Reactive NUCA [ISCA’09]<br />
• Interconnect + cache hierarchy power<br />
Need lean on-chip communication/storage<br />
Eurocloud chip: ARM+3D [ACLD’10]<br />
• Dark Silicon<br />
Specialized processors<br />
Use only parts of the chip at a time<br />
© 2010 Babak Falsafi<br />
28
Outline<br />
• Where are we?<br />
• Energy scalability for servers<br />
• Where do we go from here<br />
➔ Future on-chip caches<br />
• Future NoC’s<br />
• Summary<br />
© 2010 Babak Falsafi<br />
29
Optimal Data Placement in Large<br />
On-chip Caches<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
cache<br />
slice<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
" Data placement determines performance<br />
" Goal: place data on chip close to where they are used<br />
© 2010 Babak Falsafi<br />
30
Prior Work<br />
• Several proposals for CMP cache management<br />
ASR, cooperative caching, victim replication,<br />
CMP-NuRapid, D-NUCA<br />
• ...but suffer from shortcomings<br />
complex, high-latency lookup/coherence<br />
don’t scale<br />
lower effective cache capacity<br />
optimize only for subset of accesses<br />
We need:<br />
" Simple, scalable mechanism for fast access to all data<br />
© 2010 Babak Falsafi 31
Our Proposal: Reactive NUCA<br />
[ISCA’09, IEEE Micro Top Picks ‘10]<br />
• Cache accesses can be classified at run-time<br />
Each class amenable to different placement<br />
• Per-class block placement<br />
Simple, scalable, transparent<br />
No need for HW coherence mechanisms at LLC<br />
• Speedup<br />
Up to 32% speedup<br />
-5% on avg. from ideal cache organization<br />
© 2010 Babak Falsafi 32
Terminology: Data Types<br />
core<br />
Read<br />
or<br />
Write<br />
core<br />
Read<br />
core<br />
Read<br />
core<br />
Write<br />
core<br />
Read<br />
L2<br />
L2<br />
L2<br />
Private<br />
Shared<br />
Read-Only<br />
Shared<br />
Read-Write<br />
© 2010 Babak Falsafi 33
Conventional Multicore Caches<br />
Shared<br />
Private<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
dir L2 L2 L2<br />
Addr-interleave blocks<br />
+ High effective capacity<br />
− Slow access<br />
Each block cached locally<br />
+ Fast access (local)<br />
− Low capacity (replicas)<br />
− Coherence: via indirection<br />
(distributed directory)<br />
" We want: high capacity (shared) + fast access (priv.)<br />
© 2010 Babak Falsafi<br />
34
Where to Place the Data?<br />
• Close to where they are used!<br />
• Accessed by single core: migrate locally<br />
• Accessed by many cores: replicate (?)<br />
If read-only, replication is OK<br />
If read-write, coherence a problem<br />
Low reuse: evenly distribute across sharers<br />
read-write<br />
read-only<br />
migrate<br />
replicate<br />
share<br />
sharers#<br />
© 2010 Babak Falsafi 35
Methodology<br />
Flexus: Full-system cycle-accurate timing simulation<br />
Workloads<br />
• OLTP: TPC-C 3.0 100 WH<br />
IBM DB2 v8<br />
Oracle 10g<br />
• DSS: TPC-H Qry 6, 8, 13<br />
IBM DB2 v8<br />
• SPECweb99 on Apache 2.0<br />
• Multiprogammed: SPEC2K<br />
Model Parameters<br />
• Tiled, LLC = L2<br />
• 16-cores, 1MB/core<br />
• OoO, 2GHz, 96-entry ROB<br />
• Folded 2D-torus<br />
2-cycle router<br />
1-cycle link<br />
• 45ns memory<br />
• Scientific: em3d<br />
© 2010 Babak Falsafi 36
Cache Access Classification<br />
• Each bubble: cache blocks shared by x cores<br />
• Size of bubble proportional to % L2 accesses<br />
• y axis: % blocks in bubble that are read-write<br />
% RW Blocks in Bubble<br />
© 2010 Babak Falsafi<br />
37
Cache Access Clustering<br />
share (addr-interleave)<br />
% RW Blocks in Bubble<br />
% RW Blocks in Bubble<br />
R/W<br />
R/O<br />
migrate<br />
share<br />
replicate<br />
sharers#<br />
migrate<br />
locally<br />
Server Apps<br />
replicate<br />
Scientific/MP Apps<br />
" Accesses naturally form 3 clusters<br />
© 2010 Babak Falsafi<br />
38
Instruction Replication<br />
• Instruction working set too large for one<br />
cache slice<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
core core core core<br />
L2<br />
L2<br />
L2<br />
L2<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
" Distribute in cluster of neighbors, replicate across<br />
© 2010 Babak Falsafi<br />
39
Coherence:<br />
No Need for HW Mechanisms at LLC<br />
• Reactive NUCA placement guarantee<br />
Each R/W datum in unique & known location<br />
Shared data: addr-interleave<br />
Private data: local slice<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
core core core core<br />
L2 L2 L2 L2<br />
" Fast access, eliminates HW overhead<br />
© 2010 Babak Falsafi<br />
40
Evaluation<br />
<br />
ASR (A)<br />
Shared (S)<br />
R-NUCA (R)<br />
Ideal (I)<br />
" Delivers robust performance across workloads<br />
" Shared: same for Web, DSS; 17% for OLTP, MIX<br />
" Private: 17% for OLTP, Web, DSS; same for MIX<br />
© 2010 Babak Falsafi 41
R-NUCA Conclusions<br />
Near-optimal block placement<br />
and replication in distributed caches<br />
• Cache accesses can be classified at run-time<br />
Each class amenable to different placement<br />
• Reactive NUCA: placement of each class<br />
Simple, scalable, low-overhead, transparent<br />
Obviates HW coherence mechanisms for LLC<br />
• Robust performance across server workloads<br />
Near-optimal placement (-5% avg. from ideal)<br />
© 2010 Babak Falsafi 42
Outline<br />
• Overview<br />
• Where are we?<br />
• Energy scalability for servers<br />
• Where do we from here?<br />
• Future on-chip caches<br />
➔ Future NoC’s<br />
• Summary<br />
© 2010 Babak Falsafi<br />
43
Optimal Interconnect<br />
On-chip interconnect is an energy hog!<br />
• Over 30% of chip energy (e.g., SCC, RAW)<br />
Modern NoC’s:<br />
• Optimize for worst-case traffic<br />
• Virtualize req/rep to avoid protocol deadlock<br />
But,<br />
• Cache-coherent chips have bimodal traffic<br />
Requests are control, replies are blocks<br />
Typically just load cache blocks<br />
© 2010 Babak Falsafi<br />
44
CCNoC:<br />
Optimal for Bimodal Traffic<br />
• Narrow request plane<br />
• Full-width response<br />
plane<br />
• No need for virtual<br />
channels<br />
➝ 30%-40% power<br />
improvement<br />
© 2010 Babak Falsafi
Bringing it all together:<br />
The EuroCloud Chip<br />
(www.eurocloudserver.com)<br />
Datacenters with mobile<br />
processors<br />
• ARM cores<br />
Will likely have to be<br />
multithreaded!<br />
• 3D-stacked memory<br />
• Nokia’s Ovi Cloud<br />
applications<br />
Your 1-Watt Future<br />
Datacenter Chip<br />
[ACLD’10]<br />
© 2010 Babak Falsafi<br />
UCyprus
Design for Dark Silicon<br />
Long-term: Vertically Integrate<br />
Can not power up entire chip?<br />
➠ Specialize!<br />
Vertically-integrated server architecture (VISA)<br />
<br />
<br />
<br />
<br />
Identify services which are energy hogs<br />
Integrate SW/HW to minimize energy/service<br />
Provide service API not ISA<br />
E.g., Intel’s TCP/IP processor @ 1W<br />
Good places to start:<br />
OS, DBMS, machine learning<br />
© 2010 Babak Falsafi<br />
47
Summary<br />
• Moore’s law continues (for another decade)<br />
• CMOS is still cheap<br />
• But, energy scaling has slowed down<br />
Recommendation: Energy-Centric Computing<br />
• Can’t get there with parallelism alone<br />
• Holistic approach to energy<br />
Time to put the “embedded”<br />
into all of computing!<br />
© 2010 Babak Falsafi<br />
48
For more information<br />
please visit us online at parsa.epfl.ch, ecocloud.ch<br />
© 2010 Babak Falsafi 49