04.05.2015 Views

dark silicon.pptx - PARSA - EPFL

dark silicon.pptx - PARSA - EPFL

dark silicon.pptx - PARSA - EPFL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Babak Falsafi<br />

Parallel Systems Architecture Lab<br />

<strong>EPFL</strong><br />

parsa.epfl.ch<br />

ecocloud.ch<br />

© 2010 Babak Falsafi


Energy: Shaping IT’s Future<br />

• 40 years of energy scalability<br />

Doubling transistors every two years<br />

Quadratic reduction in energy from voltages<br />

• But, while Moore’s law continues<br />

Voltages have started to level off<br />

ITRS projections in 2000 for voltage levels in<br />

2009 are way off!<br />

© 2010 Babak Falsafi<br />

An exponential increase in energy usage<br />

every generation?<br />

2


Household IT Energy Usage<br />

(from Sun)<br />

Cell phones<br />

© 2010 Babak Falsafi 3


Shift towards Cloud Computing Helps<br />

Data Centers<br />

• Ubiquitous connectivity & access to data<br />

• Consolidate servers Amortize energy costs<br />

© 2010 Babak Falsafi


But, the Cloud has hit a wall!<br />

Trends:<br />

• Moore’s law continues<br />

• Server density is increasing<br />

• But, voltage scaling has slowed<br />

➔ It’s too expensive to buy/cool servers<br />

A 1,000m 2 datacenter is 1.5MW!<br />

(carbon footprint of airlines in 2012)<br />

© 2010 Babak Falsafi<br />

5


Enterprise IT Energy Usage<br />

Kenneth Brill (Uptime Institute)<br />

• “Economic Meltdown of Moore’s Law”<br />

• In 2012: Energy/server lifetime 50% more<br />

than price/server<br />

And 2% of all Carbon footprint in the US<br />

Energy Star report to Congress:<br />

• Datacenter energy 2x from 2000 to 2006<br />

• Roughly 2% of all electricity & growing<br />

© 2010 Babak Falsafi<br />

6


Example Projections for Datacenters<br />

Billion Kilowatt hour/year<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

Energy Star's<br />

projections<br />

Projections<br />

without voltage<br />

reduction<br />

Energy Star<br />

projections up<br />

to 2011<br />

2001 2003 2005 2007 2009 2011 2013 2015 2017<br />

• Projections for 2011 are already off<br />

• Exponential increase in usage<br />

© 2010 Babak Falsafi


What lies ahead?<br />

For the next ten years:<br />

• CMOS is still the cheapest technology<br />

But,<br />

• need ~100x reduction in energy just to keep<br />

up with Moore’s Law<br />

Chip design recommendations:<br />

Short-term, lean chips (squeeze all fat)<br />

Cores, caches, NoC<br />

Long-term, cannot power up all of chip<br />

© 2010 Babak Falsafi<br />

Live with “<strong>dark</strong> <strong>silicon</strong>”, specialize<br />

8


ecocloud.ch<br />

10 faculty, CSEM & industrial affiliates<br />

HP, Intel, IBM, Microsoft, …<br />

Research:<br />

• Energy-minimal cloud computing<br />

• Elastic data bricks and storage<br />

• Scalable cloud applications & services<br />

© 2010 Babak Falsafi<br />

Making tomorrow’s clouds green &<br />

sustainable 9


Outline<br />

• Where are we?<br />

➔ Energy scalability for servers<br />

• Where do we go from here?<br />

• Future on-chip caches<br />

• Future NoC’s<br />

• Summary<br />

© 2010 Babak Falsafi<br />

10


Where does server energy go?<br />

Many sources of power consumption:<br />

• Server only [Fan, ISCA’07]<br />

Processors chips (37%)<br />

Memory (17%)<br />

Peripherals (29%)<br />

…<br />

• Infrastructure (another 50%)<br />

Cooling<br />

Power distribution<br />

© 2010 Babak Falsafi<br />

11


How did we get here?<br />

Leakage killed the supply voltage<br />

Historically,<br />

Power ∞ V 2 f<br />

Voltage<br />

Frequency<br />

Four decades of reducing V to keep up<br />

But, can no longer reduce V due to leakage!!<br />

Exponential in area<br />

Exponential in temperature<br />

© 2010 Babak Falsafi<br />

12


Voltages have already leveled off<br />

1.4<br />

Power Supply V dd<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

> 2x error<br />

Today<br />

2008<br />

2007<br />

2004<br />

2002<br />

0<br />

2002 2006 2010 2014 2018 2022<br />

© 2010 Babak Falsafi<br />

ITRS estimates for today were off by > 2x<br />

13


A Study of Server Chip Scalability<br />

Actual server workloads today<br />

Easily parallelizable (performance-scalable)<br />

Actual physical char. of processors/memory<br />

ITRS projections for technology nodes<br />

Modeled power/performance across nodes<br />

For server chips<br />

Bandwidth is near-term limiter<br />

→ Energy is the ultimate limiter<br />

© 2010 Babak Falsafi<br />

14


A few words about our model<br />

Physical char. modeled after Niagara<br />

Area: cores/caches (72% die)<br />

<br />

Power:<br />

<br />

<br />

© 2010 Babak Falsafi<br />

scaled across tech. nodes<br />

Active: projected V dd /ITRS<br />

Core=scaled, cache=f(miss), crossbar=f(hops)<br />

Leakage: projected V th /ITRS, f(area), 62C<br />

Performance:<br />

<br />

<br />

<br />

Parameters from real server workloads<br />

(DB2, Oracle, Apache, Zeus)<br />

Cache miss rate model (validated)<br />

CPI model based on miss rate<br />

15


Caveat: Simple<br />

Parallelizable Workloads<br />

Workloads are assumed parallel<br />

• Scaling server workloads is reasonable<br />

CPI model:<br />

• Works well for workloads with low MLP<br />

• OLTP, Web & DSS are mostly memorylatency<br />

dependent<br />

Future servers will run a mix of workloads<br />

© 2010 Babak Falsafi<br />

16


Area vs. Power Envelope (22nm)<br />

Good news: can fit hundreds of cores<br />

× Can not use them all at highest speed<br />

© 2010 Babak Falsafi


Of course one could pack more<br />

slower cores, cheaper cache<br />

• Result: a performance/power trade-off<br />

• Assuming bandwidth is unlimited<br />

© 2010 Babak Falsafi


But, limited pin b/w favors<br />

fewer cores + more cache<br />

• For clarity, only showing two bandwidth lines<br />

• Where would the best performance be?<br />

© 2010 Babak Falsafi


Peak Performing with<br />

Conventional Memory<br />

• B/W constrained, then power constrained<br />

• Fewer slower cores, lots of cache<br />

© 2010 Babak Falsafi


[Loh, ISCA’08]<br />

Mitigating B/W Limitations:<br />

3D-stacked Memory<br />

• Delivers TB/sec of bandwidth<br />

© 2010 Babak Falsafi


Peak Performing w/<br />

3D-stacked Memory<br />

• Only power-constrained<br />

• Virtually eliminates on-chip cache<br />

© 2010 Babak Falsafi


Core Scaling across Technologies<br />

Peak Freq.<br />

Peak Perf.<br />

• Assumes a 130-Watt chip envelope<br />

• Pin b/w keeps Niagara from scaling<br />

© 2010 Babak Falsafi


Niagara + 3D-stacked Memory<br />

1024<br />

Number of Cores<br />

512<br />

256<br />

128<br />

64<br />

32<br />

16<br />

8<br />

4<br />

2<br />

1<br />

2004 2007 2010 2013 2016 2019<br />

Year of Technology Introduction<br />

Area<br />

DB2 TPC-C w/ 3D Mem<br />

DB2 TPC-H w/ 3D Mem<br />

Apache w/ 3D Mem<br />

DB2 TPC-C Power<br />

DB2 TPC-H Power<br />

Apache Power<br />

Peak Freq.<br />

Peak Perf.<br />

• Power limits Niagara to 75% area!<br />

© 2010 Babak Falsafi


But, even Niagara is an overkill!<br />

Servers mostly access memory<br />

Benefit little from core complexity<br />

Niagara cores are too big!<br />

E.g., Kgil et al., ASPLOS06:<br />

• Servers on embedded cores + 3D<br />

Can we run servers with embedded cores?<br />

© 2010 Babak Falsafi<br />

25


ARM9 + 3D-stacked Memory<br />

Number of Cores<br />

1024<br />

512<br />

256<br />

128<br />

64<br />

32<br />

16<br />

8<br />

4<br />

2<br />

1<br />

2004 2007 2010 2013 2016 2019<br />

Year of Technology Introduction<br />

Area<br />

DB2 TPC-C w/ 3D Mem<br />

DB2 TPC-H w/ 3D Mem<br />

Apache w/ 3D Mem<br />

DB2 TPC-C Power<br />

DB2 TPC-H Power<br />

Apache Power<br />

Peak Freq.<br />

Peak Perf.<br />

• Can not scale with a 130-Watt envelope!!!<br />

• On-chip hierarchy + interconnect not scalable<br />

© 2010 Babak Falsafi


Long-term:<br />

Where to go from here?<br />

1. Redo SW stack<br />

<br />

<br />

Minimize joules/work (algo. down to HW)<br />

Program for locality + heterogeneity<br />

2. Pray for technology<br />

<br />

<br />

Energy-scalable <strong>silicon</strong> devices<br />

Emerging nanoscale technologies?<br />

3. Infrastructure technology<br />

<br />

<br />

Renewable/carbon-neutral energy<br />

Scalable cooling + power delivery<br />

© 2010 Babak Falsafi<br />

27


Short-term Scaling Implications<br />

• Caches are getting huge<br />

Need cache architectures to deal with >> MB<br />

E.g., Reactive NUCA [ISCA’09]<br />

• Interconnect + cache hierarchy power<br />

Need lean on-chip communication/storage<br />

Eurocloud chip: ARM+3D [ACLD’10]<br />

• Dark Silicon<br />

Specialized processors<br />

Use only parts of the chip at a time<br />

© 2010 Babak Falsafi<br />

28


Outline<br />

• Where are we?<br />

• Energy scalability for servers<br />

• Where do we go from here<br />

➔ Future on-chip caches<br />

• Future NoC’s<br />

• Summary<br />

© 2010 Babak Falsafi<br />

29


Optimal Data Placement in Large<br />

On-chip Caches<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

cache<br />

slice<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

" Data placement determines performance<br />

" Goal: place data on chip close to where they are used<br />

© 2010 Babak Falsafi<br />

30


Prior Work<br />

• Several proposals for CMP cache management<br />

ASR, cooperative caching, victim replication,<br />

CMP-NuRapid, D-NUCA<br />

• ...but suffer from shortcomings<br />

complex, high-latency lookup/coherence<br />

don’t scale<br />

lower effective cache capacity<br />

optimize only for subset of accesses<br />

We need:<br />

" Simple, scalable mechanism for fast access to all data<br />

© 2010 Babak Falsafi 31


Our Proposal: Reactive NUCA<br />

[ISCA’09, IEEE Micro Top Picks ‘10]<br />

• Cache accesses can be classified at run-time<br />

Each class amenable to different placement<br />

• Per-class block placement<br />

Simple, scalable, transparent<br />

No need for HW coherence mechanisms at LLC<br />

• Speedup<br />

Up to 32% speedup<br />

-5% on avg. from ideal cache organization<br />

© 2010 Babak Falsafi 32


Terminology: Data Types<br />

core<br />

Read<br />

or<br />

Write<br />

core<br />

Read<br />

core<br />

Read<br />

core<br />

Write<br />

core<br />

Read<br />

L2<br />

L2<br />

L2<br />

Private<br />

Shared<br />

Read-Only<br />

Shared<br />

Read-Write<br />

© 2010 Babak Falsafi 33


Conventional Multicore Caches<br />

Shared<br />

Private<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

dir L2 L2 L2<br />

Addr-interleave blocks<br />

+ High effective capacity<br />

− Slow access<br />

Each block cached locally<br />

+ Fast access (local)<br />

− Low capacity (replicas)<br />

− Coherence: via indirection<br />

(distributed directory)<br />

" We want: high capacity (shared) + fast access (priv.)<br />

© 2010 Babak Falsafi<br />

34


Where to Place the Data?<br />

• Close to where they are used!<br />

• Accessed by single core: migrate locally<br />

• Accessed by many cores: replicate (?)<br />

If read-only, replication is OK<br />

If read-write, coherence a problem<br />

Low reuse: evenly distribute across sharers<br />

read-write<br />

read-only<br />

migrate<br />

replicate<br />

share<br />

sharers#<br />

© 2010 Babak Falsafi 35


Methodology<br />

Flexus: Full-system cycle-accurate timing simulation<br />

Workloads<br />

• OLTP: TPC-C 3.0 100 WH<br />

IBM DB2 v8<br />

Oracle 10g<br />

• DSS: TPC-H Qry 6, 8, 13<br />

IBM DB2 v8<br />

• SPECweb99 on Apache 2.0<br />

• Multiprogammed: SPEC2K<br />

Model Parameters<br />

• Tiled, LLC = L2<br />

• 16-cores, 1MB/core<br />

• OoO, 2GHz, 96-entry ROB<br />

• Folded 2D-torus<br />

2-cycle router<br />

1-cycle link<br />

• 45ns memory<br />

• Scientific: em3d<br />

© 2010 Babak Falsafi 36


Cache Access Classification<br />

• Each bubble: cache blocks shared by x cores<br />

• Size of bubble proportional to % L2 accesses<br />

• y axis: % blocks in bubble that are read-write<br />

% RW Blocks in Bubble<br />

© 2010 Babak Falsafi<br />

37


Cache Access Clustering<br />

share (addr-interleave)<br />

% RW Blocks in Bubble<br />

% RW Blocks in Bubble<br />

R/W<br />

R/O<br />

migrate<br />

share<br />

replicate<br />

sharers#<br />

migrate<br />

locally<br />

Server Apps<br />

replicate<br />

Scientific/MP Apps<br />

" Accesses naturally form 3 clusters<br />

© 2010 Babak Falsafi<br />

38


Instruction Replication<br />

• Instruction working set too large for one<br />

cache slice<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

core core core core<br />

L2<br />

L2<br />

L2<br />

L2<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

" Distribute in cluster of neighbors, replicate across<br />

© 2010 Babak Falsafi<br />

39


Coherence:<br />

No Need for HW Mechanisms at LLC<br />

• Reactive NUCA placement guarantee<br />

Each R/W datum in unique & known location<br />

Shared data: addr-interleave<br />

Private data: local slice<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

core core core core<br />

L2 L2 L2 L2<br />

" Fast access, eliminates HW overhead<br />

© 2010 Babak Falsafi<br />

40


Evaluation<br />

<br />

ASR (A)<br />

Shared (S)<br />

R-NUCA (R)<br />

Ideal (I)<br />

" Delivers robust performance across workloads<br />

" Shared: same for Web, DSS; 17% for OLTP, MIX<br />

" Private: 17% for OLTP, Web, DSS; same for MIX<br />

© 2010 Babak Falsafi 41


R-NUCA Conclusions<br />

Near-optimal block placement<br />

and replication in distributed caches<br />

• Cache accesses can be classified at run-time<br />

Each class amenable to different placement<br />

• Reactive NUCA: placement of each class<br />

Simple, scalable, low-overhead, transparent<br />

Obviates HW coherence mechanisms for LLC<br />

• Robust performance across server workloads<br />

Near-optimal placement (-5% avg. from ideal)<br />

© 2010 Babak Falsafi 42


Outline<br />

• Overview<br />

• Where are we?<br />

• Energy scalability for servers<br />

• Where do we from here?<br />

• Future on-chip caches<br />

➔ Future NoC’s<br />

• Summary<br />

© 2010 Babak Falsafi<br />

43


Optimal Interconnect<br />

On-chip interconnect is an energy hog!<br />

• Over 30% of chip energy (e.g., SCC, RAW)<br />

Modern NoC’s:<br />

• Optimize for worst-case traffic<br />

• Virtualize req/rep to avoid protocol deadlock<br />

But,<br />

• Cache-coherent chips have bimodal traffic<br />

Requests are control, replies are blocks<br />

Typically just load cache blocks<br />

© 2010 Babak Falsafi<br />

44


CCNoC:<br />

Optimal for Bimodal Traffic<br />

• Narrow request plane<br />

• Full-width response<br />

plane<br />

• No need for virtual<br />

channels<br />

➝ 30%-40% power<br />

improvement<br />

© 2010 Babak Falsafi


Bringing it all together:<br />

The EuroCloud Chip<br />

(www.eurocloudserver.com)<br />

Datacenters with mobile<br />

processors<br />

• ARM cores<br />

Will likely have to be<br />

multithreaded!<br />

• 3D-stacked memory<br />

• Nokia’s Ovi Cloud<br />

applications<br />

Your 1-Watt Future<br />

Datacenter Chip<br />

[ACLD’10]<br />

© 2010 Babak Falsafi<br />

UCyprus


Design for Dark Silicon<br />

Long-term: Vertically Integrate<br />

Can not power up entire chip?<br />

➠ Specialize!<br />

Vertically-integrated server architecture (VISA)<br />

<br />

<br />

<br />

<br />

Identify services which are energy hogs<br />

Integrate SW/HW to minimize energy/service<br />

Provide service API not ISA<br />

E.g., Intel’s TCP/IP processor @ 1W<br />

Good places to start:<br />

OS, DBMS, machine learning<br />

© 2010 Babak Falsafi<br />

47


Summary<br />

• Moore’s law continues (for another decade)<br />

• CMOS is still cheap<br />

• But, energy scaling has slowed down<br />

Recommendation: Energy-Centric Computing<br />

• Can’t get there with parallelism alone<br />

• Holistic approach to energy<br />

Time to put the “embedded”<br />

into all of computing!<br />

© 2010 Babak Falsafi<br />

48


For more information<br />

please visit us online at parsa.epfl.ch, ecocloud.ch<br />

© 2010 Babak Falsafi 49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!