21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

338 M. Moreto et al.<br />

As a consequence, some threads can suffer a severe degradation <strong>in</strong> performance.<br />

Previous work has tried to solve this problem by us<strong>in</strong>g static and dynamic partition<strong>in</strong>g<br />

algorithms that monitor the L2 cache accesses and decide a partition for<br />

a fixed amount of cycles <strong>in</strong> order to maximize throughput [5,6,7] or fairness [8].<br />

Basically, these proposals predict the number of misses per application for each<br />

possible cache partition. Then, they use the cache partition that leads to the<br />

m<strong>in</strong>imum number of misses for the next <strong>in</strong>terval.<br />

A common characteristic of these proposals is that they treat all L2 misses<br />

equally. However, it has been shown that L2 misses affect performance differently<br />

depend<strong>in</strong>g on how clustered they are. An isolated L2 miss has approximately<br />

the same miss penalty than a cluster of L2 misses, as they can be served <strong>in</strong><br />

parallel if they all fit <strong>in</strong> the reorder buffer (ROB) [9]. In Figure 1 we can see this<br />

behavior. We have represented an ideal IPC curve that is constant until an L2<br />

miss occurs. After some cycles, commit stops. When the cache l<strong>in</strong>e comes from<br />

ma<strong>in</strong> memory, commit ramps up to its steady state value. As a consequence, an<br />

isolated L2 miss has a higher impact on performance than a miss <strong>in</strong> a burst of<br />

misses as the memory latency is shared by all clustered misses.<br />

(a) Isolated L2 miss (b) Clustered L2 misses<br />

Fig. 1. Isolated and clustered L2 misses<br />

Based on this fact, we propose a new DCP algorithm that gives a cost to each<br />

L2 access accord<strong>in</strong>g to its impact <strong>in</strong> f<strong>in</strong>al performance. We detect isolated and<br />

clustered misses and assign a higher cost to isolated misses. Then, our algorithm<br />

determ<strong>in</strong>es the partition that m<strong>in</strong>imizes the total cost for all threads, which is<br />

used <strong>in</strong> the next <strong>in</strong>terval. Our results show that differentiat<strong>in</strong>g between clustered<br />

and isolated L2 misses leads to cache partitions with higher performance than<br />

previous proposals. The ma<strong>in</strong> contributions of this work are the follow<strong>in</strong>g.<br />

1) A runtime mechanism to dynamically partition shared L2 caches <strong>in</strong> a CMP<br />

scenario that takes <strong>in</strong>to account the MLP of each L2 access. We obta<strong>in</strong> improvements<br />

over LRU up to 63.9% (10.6% on average) and over previous proposals<br />

up to 15.4% (4.1% on average) <strong>in</strong> a four-core architecture.<br />

2) We extend previous workloads classifications for CMP architectures with<br />

more than two cores. Results can be better analyzed <strong>in</strong> every workload group.<br />

3) We give a sampl<strong>in</strong>g technique that reduces the hardware cost <strong>in</strong> terms<br />

of storage under 1% of the total L2 cache size with an average throughput<br />

degradation of 0.76% (compared to the throughput obta<strong>in</strong>ed without sampl<strong>in</strong>g).<br />

The rest of this paper is structured as follows. In Section 2 we <strong>in</strong>troduce the<br />

methods that have been previously proposed to decide L2 cache partitions and

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!