Slides - Stanford PPL - Stanford University

ppl.stanford.edu

Slides - Stanford PPL - Stanford University

THE ZCACHE: DECOUPLING

WAYS AND ASSOCIATIVITY

Daniel Sanchez and Christos Kozyrakis

Stanford University

MICRO-43, December 6 th 2010


Executive Summary

2


Mitigating the memory wall requires large, highly associative caches

Last-level caches take ~50% chip area, have 24-32 ways in latest CMPs

More ways large energy, latency and area overheads


ZCache: A highly associative cache with a low number of ways

Improves associativity by increasing number of replacement candidates

Retains low energy/hit, latency and area of caches with few ways

Based on skew-associative caches and cuckoo hashing


Analytical framework explains why zcache works

Associativity depends on number of replacement candidates, not ways or

locations a block can be in


Outline

3

Introduction

ZCache

Analytical Framework

Evaluation


Introduction

4

Uses of high associativity:

Improve performance by reducing conflict misses

Partitioning, pinning, storing speculative data ( e.g. TM, TLS)

Increasing number of ways affects area, delay, energy

Increase over 4-way (%)

50

40

30

20

10

0

16-way 32-way

101%

Area Hit Latency Hit Energy

IPC improvement

vs 4-way (%)

10

8

6

4

2

0

-2

-4

16-way

rand0

32-way

ammp_m


Techniques for high associativity (1/2):

Increase number of locations

5

Allow multiple locations per way

Column-associative caches [Agarwal93], set-balancing

cache [Rolan09], …

Hit latency ↑, hit energy ↑

Use a victim cache

VC [Jouppi90], Scavenger [Basu07], …

Area ↑, hit latency ↑, hit energy ↑

Use indirection in the tag array

IIC [Hallnor00], V-Way cache [Qureshi05]

Area ↑, hit latency ↑, hit energy ↑


Techniques for high associativity (2/2):

Better hashing

Use a hash function to index the cache

Simple hashing significantly reduces conflicts [Karbutli04]

Skew-associative caches [Seznec93]

Index each way using a different hash function

A line conflicts with a different set of lines on each way, reducing

conflict misses

No sets, cannot use replacement policy that relies on set ordering

6

Indexes

Line

address

H

Hash

function

Set

index

Way0 Way1 Way2

Line

address

H0

H1

H2

Way0 Way1 Way2


Outline

7

Introduction

ZCache

Analytical Framework

Evaluation


The ZCache Design

8

Lookups and hits happen as in

a skew-associative cache

Line

address

H0

H1

Indexes

H2

Way0 Way1 Way2

Misses exploit the multiple hash functions to obtain an

arbitrarily large number of replacement candidates

Phase 1: Walk the tag array, get best candidate

Phase 2: Move a few lines to fit everything

This happens infrequently (on misses) and off the critical path

Draws on prior research in cuckoo hashing


ZCache Replacement

9

Way 0 Way 1 Way 2

Y

H0

H1

H2

5

4

0

U

V

M

F

C

X

P

K

H

B

E

R

N

D

J

A

Z

Q

G

T

I

L O S

0

1

2

3

4

5

6

7

Start replacement process

while fetching Y

MISS


ZCache Replacement

10

Way 0 Way 1 Way 2

Y

H0

H1

H2

5

4

0

U

F

P

B

V

C

K

E

M

X

H

R

N

D

J

A

Z

Q

G

T

I

L O S

0

1

2

3

4

5

6

7


ZCache Replacement

11

Way 0 Way 1 Way 2

A

H0

H1

H2

5

2

1

U

F

P

B

V

C

K

E

M

X

H

R

N

D

J

A

Z

Q

G

T

I

L O S

0

1

2

3

4

5

6

7

Instead of evicting A, can move it and evict K or X


ZCache Replacement

12

H0

H1

H2

1 st -level

candidates

Addr Y A D M

H0 5 5 3 2

H1 4 2 4 5

H2 0 1 7 0

Way 0 Way 1 Way 2

U

F

P

B

V

C

K

E

M

X

H

R

N

D

J

A

Z

Q

G

T

I

L O S

0

1

2

3

4

5

6

7


ZCache Replacement

13

H0

H1

H2

1 st -level

candidates

Addr Y A D M

H0 5 5 3 2

H1 4 2 4 5

H2 0 1 7 0

Way 0 Way 1 Way 2

U

F

P

B

V

C

K

E

M

X

H

R

N

D

J

A

Z

Q

G

T

I

L O S

2 nd -level

candidates

B K X P Z S

0

1

2

3

4

5

6

7


ZCache Replacement

14

H0

H1

H2

Way 0 Way 1 Way 2

U

F

P

B

V

C

K

E

M

X

H

R

N

D

J

A

Z

Q

G

T

I

L O S

0

1

2

3

4

5

6

7

Addr Y A D M B K X P Z S

H0 5 5 3 2 3 7 4 2 6 1

H1 4 2 4 5 6 2 3 3 5 2

H2 0 1 7 0 1 0 1 5 3 7


ZCache Replacement

15

Y

A

D

M

K

X

B

S

P

Z

L M N E T X F K E Q G R

Chosen by replacement policy (e.g. LRU block)

Addr Y A D M B K X P Z S

H0 5 5 3 2 3 7 4 2 6 1

H1 4 2 4 5 6 2 3 3 5 2

H2 0 1 7 0 1 0 1 5 3 7


ZCache Replacement

16

Y

A

D

M

K

X

B

S

P

Z

L M N E

T X F K E Q G R

Y

U

F

V

C

M

X

1 P

K

H

2 3

B

E

R

N

D

J

4

A

Z

Q

G

T

I

L O S


ZCache Replacement

17

Y

A

D

M

K

X

B

S

P

Z

L M N E

T X F K E Q G R

U

V M

F

C

A

P

B

K

E

H

R

X

D

J

Y

G

Z

T

Q

I

L O S


ZCache Replacement

18

Hits always take a single lookup

H0 5

Y H1 4

H2

0

Way 0 Way 1 Way 2

U

F

P

V

C

K

M

A

H

B

E

R

X

D

J

Y

Z

Q

G

T

I

L O S

M 0

1

2

3

4

5

6

7

HIT


ZCache Implementation Overview

Replacements take place:

Off the critical path

Concurrently with other operations

Walk accesses are pipelined

Do not saturate tag bandwidth

in practice

No effect

on hit latency

Energy per miss mostly determined by walk

Similar to set-associative cache of same associativity

Cheap to implement

SRAM with 10s of bits to track candidates

Leverages existing MSHRs

See paper for more details

19


Number of Candidates

An L-level walk on a W-way zcache gets R candidates:

20

R

= W


L


n=

0

( W

−1)

n

L

W

2 3 4 8

0 2 3 4 8

1 4 9 16 64

2 6 21 52 456

Few ways (W=4) give many candidates with shallow walks

Ratio of tag bandwidth vs bandwidth of next level limits

number of candidates


Outline

21

Introduction

ZCache

Analytical Framework

Evaluation


An Analytical Associativity Framework

22

Comparing associativity across cache designs is hard

Ways do not mean much

Conflict misses are workload and architecture-specific

Goals

Find a general way to characterize associativity

Analyze what determines the performance of a zcache


General Cache Model

23

Cache array:

Holds tags and data

Implements associative lookup by address

On a replacement, gives list of replacement candidates

Model assumes nothing about array organization

Replacement policy: Maintains a global rank of which

cache blocks to replace

All policies conceptually do (LRU, LFU, OPT, …)

Implementation does not need to


Associativity Distribution

Eviction priority: Rank of a block given by the replacement

function, normalized to [0,1]

Higher is better to evict

24

Associativity distribution: Probability distribution of the

eviction priorities of evicted blocks

Higher associativity distribution more skewed towards 1.0

Measures how well the array does, not the replacement policy

For good performance, replacement policy also needs to do a good job!


Uniformity Assumption

If the cache array gives R replacement candidates with

uniformly distributed priorities,

25

F

A

E

1

,..., E

R

~ i.

i.

d.

U[0,1]

A

=

max{ E

1

,...,

E

( x)

= P(

A ≤ x)

= x

R

R

}

, x ∈[0,1]


Associativity Distributions in Practice

26

Set-associative caches do

significantly worse than UA

Hashing (H 3 ) improves

associativity, but still

sensibly worse than UA


Associativity Distributions for ZCaches

27

Skew-associative caches

(1-level zcaches) are very

close to UA

Increasing candidates but

not ways still yields distrib

very close to UA


Analytical Framework: Conclusions

In caches with good hashing, the number of replacement

candidates R determines associativity

28

ZCaches provide large number of candidates with few

ways Decouple ways and associativity


Outline

29

Introduction

ZCache

Analytical Framework

Evaluation


Methodology

Infrastructure:

CACTI-based models for cache cost estimates

McPAT for full-CMP area, power estimations

Microarchitectural simulation with Pin-based simulator

Target system:

32 in-order x86-64 cores (single-issue, 2GHz, 32KB I/D L1s)

Fully shared L2, 8MB, 8 1MB banks (set-assoc/zcache)

All L2 caches use hashing (H 3 )

72 workloads:

Multithreaded: PARSEC, SPECOMP

Multiprogrammed: SPECCPU2006

See paper for more details

30


Cache Costs

60

50

40

30

20

10

0

Area (mm2)

SA 4-way SA 16-way SA 32-way Z 4/16 Z 4/52

6

5

4

3

2

1

0

Hit Latency (ns)

1.4

1.2

1

0.8

0.6

0.4

0.2

0

Hit Energy (nJ)

4

3

2

1

0

Miss Energy (nJ)

31

Each design is optimized for area*latency*energy

ZCaches:

Retain hit area, hit latency, hit energy of a 4-way SA cache

Energy per miss comparable to similarly-associative SA cache


Performance and Energy-Efficiency

32

IPC improvement

vs 4-way (%)

BIPS/W improv

vs 4-way (%)

20

15

10

5

0

-5

15

10

5

0

-5

SetAssoc 32-way Z 4-way/52-rc

ammp_m rand0 cactusADM gmean (72) gmean (10)

ammp_m rand0 cactusADM gmean (72) gmean (10)


Conclusions

ZCaches enable efficient highly-associative caches

Low number of ways

Associativity gained by increasing replacement candidates

Costs of high associativity (energy, tag bandwidth) paid only

on misses

Analytical framework shows that replacement

candidates determine associativity

33


THANK YOU FOR

YOUR ATTENTION

QUESTIONS?


Backup: Replacement Timeline

35

Address for

read/write

Tag port

out/in

Data port

out/in

Time

Way0

Way1

Way2

Way0

Way1

Way2

Way0

Way1

Way2

Miss Walk Relocations

0 5 10 15 20 105

5 3 2 7 4 6 1 4 5 4 5

4 2 5 6 3 3 2


0 1 7 1 0 5 3 1 1

A B P L N G F N A X Y

D K Z T E E K


M X S X M Q R X A

A N A X Y

D


M X A

Memory bus

Y N Y

Fetch on miss Writeback (if needed) Miss response


Backup: LRU with coarse-grain timestamps

36

8-bit timestamp per tag

Tag each block with a global timestamp counter

Increment timestamp every k=5% accesses

Wraparounds are rare

Current TS

Timestamp distrib

0

Timestamp 255

More magazines by this user
Similar magazines