12.07.2015 Views

slides - Computing Systems Laboratory

slides - Computing Systems Laboratory

slides - Computing Systems Laboratory

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Introduction• Cores in CMPs typically share some level of the memoryhierarchy• Applications compete for the limited shared space• Need for efficient use of the shared cache– Requests to off-chip memory are expensive (latency and power)


Introduction• LRU (or approximations) is typically employed• Partitions the cache implicitly on a demand basis– Application with highest demand gets majority of cacheresources– Could be suboptimal (eg. streaming applications)• Thread-blind policy– Cannot detect and deal with inter-thread interference


Motivation• Applications can beclassified into 3 differentcategories [Qureshi and Patt(MICRO ’06)]• High Utility• Applications that continue tobenefit significantly as thecache space is increased


Motivation• Low Utility• Applications that do notbenefit significantly as thecache space is graduallyincreased


Motivation• Saturating Utility• Applications that initiallybenefit as the cache space isincreasedTarget : Exploit the differences in the cache utility ofconcurrently executed applications


Static Cache Partitioning


Static Cache Partitioning• Two major drawbacks• The system must be aware of each application’s profile• Partitions remain the same throughout the execution– Programs are known to have distinct phases of behaviour• Need for a scheme that can partition the cachedynamically– Acquire the applications’ profile at run-time– Repartition when the phase of an application changes


Dynamic Cache Partitioning• LRU's “stack property” [Mattson etal. 1970]“An access that hits in a N-wayassociative cache using the LRUreplacement policy is guaranteed to hitalso if the cache had more than N ways,provided that the number of sets remainsthe same.”


ABFCP : Overview• Adaptive Bloom Filter Cache Partitioning (ABFCP)• Partitioning Module– Track misses and hits– Partitioning Algorithm– Replacement support toenforce partitionsCORE nCORE 0I−CacheD−Cache.I−CachePartitioningModuleSharedL2 CacheD−CacheDRAM


ABFCP : Tracking system• Far Misses– Misses that would have been hits had the application beenallowed to use more cache ways– Tracked by Bloom filters


ABFCP : Partitioning Algorithm• 2 counters per core per cache set– CLRU– CFarMiss• Each core’s allocation can be changed by ± 1 way• Estimate performance loss/gain– -1 way : Hits in the LRU position will become missesperf. loss → CLRU– +1 way : A portion of the far misses will become hitsperf. gain → a * CFarMiss , a = (1 - ways/assoc)


ABFCP : Partitioning Algorithm• Select the best partition that maximises performance (hits)• Complexity– cores = 2 → possible partitions = 3– cores = 4 → possible partitions = 19– cores = 8 → possible partitions = 1107– cores = 16 → possible partitions = 5196627• Linear algorithm that selects the best partition or a goodapproximation thereof.– N/2 comparisons (worst case) → O(N)


ABFCP : Way Partitioning• Way Partitioning support [Suh et al. HPCA ’02, Qureshi andPatt MICRO ’06]• Each line has a core-id field• On a miss the ways occupied by the miss-causingapplication are counted– ways_occupied < partition_limit → victim is the LRU line of anotherapplication– Otherwise the victim is the LRU line of the miss-causing application


Evaluation• Configuration– 2,4,8 single-issue, in-order cores– Private L1 I and D caches (32KB, 4-wayassociative, 32B line size, 1 cycleaccess latency)– Unified shared on-chip L2 cache (4MB,32-way associative, 32B line size, 16cycle access latency)– Main memory (32 outstanding requests,100 cycle access latency)• Benchmarks– 9 apps from JavaGrande + NAS– One application per processor– Simulation stops when one of thebenchmarks finishes


Results (Dual core system)


Results (Dual core system)


Results (Quad core system)


Results (Eight core system)


Evaluation• Increasing promise as number of cores increase• Hardware Cost per core– BF arrays (4096 sets * 32b) → 16KB– Counters (4096 sets * 2 counters * 8b) → 8KB– L2 Cache (240KB tags + 4MB data) → 4336KB– 0.55% increase in area• 8-core system– 48KB for the core-ids per cache set– Total overhead 240KB → 5.5% increase over L2


Evaluation


Related Work• Cache Partitioning Aware Replacement Policy [Dybdhal et al.HPC ’06]– Cannot deal with applications with non-convex miss rate curves• Utility-Based cache partitioning [Qureshi and Patt MICRO ’06]– Smaller overhead– Enforces the same partition over all the cache sets


Conclusions• It is important to share the cache efficiently in CMPs• LRU does not achieve optimal sharing of the cache• Cache partitioning can alleviate its consequences• ABFCP– shows increasing promise as the number of cores increase– provides better performance than LRU at a reasonable cost(5.5% increase for an 8-core system achieves similar results tousing LRU with a 50% bigger L2 cache)


Any Questions?Thank you!


Utility-Based Cache Partitioning


Utility-Based Cache Partitioning• High hardware overhead(‏lines • Dynamic Set Sampling (monitor only 32– Smaller UMONs• Enforce the same partition for the whole cache– Less counters


Utility-Based Cache Partitioning


ABFCP Comparison with UCP• UCP has a lowerstorage overhead(‏‎8-core (70KB for an• If it attempted topartition on a linebasis, it would require11MB per processor• ABFCP is morerobust• ABFCP performsbetter as the numberof cores increases


ABFCP Comparison with UCP


CPARP


Conclusions


Evaluation• UCP acquires a moreaccurate profile thanCPARP• Example– curr_hits = 135– if app2 gets 6 ways then(‏UCP‏)‏ hits = 145– CPARP does not modify thepartition

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!