12.07.2015 Views

Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Namemix1mix2mix3mix4mix5mix6mix7mix8mix9mix10CacheSensitivityBenchmarksCurve43.5mcf hmmer libquantumomnetpp 25.5gobmk soplex libquantum lbm22.012.614.6zeusmp leslie3d libquantumxalancbmk 9.419.0gamess cactusADM soplexlibquantum 9.5bzip2 gamess mcf sphinx336.416.69.7gcc calculix libquantumsphinx3 6.6perlbench milc hmmer lbmbzip2 gcc gobmk lbmgamess mcf tonto xalancbmkmilc namd sphinx3 xalancbmk10.49.66.75.437.016.612.4TABLE IV: Multi-core workload mixes with cache sensitivitycurves giving LLC misses per 1000 instructions (MPKI)on the y-axis <strong>for</strong> last-level cache sizes 128KB through32MB on the x-axis.For the multi-core workloads, we report the weighted speedupnormalized to LRU. That is, <strong>for</strong> each thread i sharing the 8MBcache, we compute IPC i . Then we find SingleIPC i as the IPCof the same program running in isolation with an 8MB cachewith LRU replacement. Then we compute the weighted IPCas ∑ IPC i /SingleIPC i . We then normalize this weighted IPCwith the weighted IPC using the LRU replacement policy.B. Optimal Replacement and Bypass PolicyFor simulating misses, we also compare with an optimalblock replacement and bypass policy. That is, we enhanceBelady’s MIN replacement policy [3] with a bypass policythat refuses to place a block in a set when that block’s nextaccess will not occur until after the next accesses to all otherblocks in the set. We use trace-based simulation to determinethe optimal number of misses using the same sequence ofmemory accesses made by the out-of-order simulator. The outof-ordersimulator does not include the optimal replacementand bypass policy so we report optimal numbers only <strong>for</strong> cachemiss reduction and not <strong>for</strong> speedup.4.5VII. EXPERIMENTAL RESULTSIn this section we discuss results of our experiments. Inthe graphs that follow, several techniques are referred to withabbreviated names. Table V gives a legend <strong>for</strong> these names.For TDBP and CDBP, we simulate a dead block bypass andreplacement policy just as described previously, dropping inthe reftrace and counting predictors, respectively, in place ofour sampling predictor.A. <strong>Dead</strong> <strong>Block</strong> Replacement with LRU BaselineWe explore the use of sampling prediction to drive replacementand bypass in a default LRU replaced cache comparedwith several other techniques <strong>for</strong> the single-thread benchmarks.1) LLC Misses: Figure 4 shows LLC cache misses normalizedto a 2MB LRU cache <strong>for</strong> each benchmark. On average,dynamic insertion (DIP) reduces cache misses to 93.9% ofthe baseline LRU, a reduction of by 6.1%. RRIP reducesmisses by 8.1%. The reftrace-predictor-driven policy (TDBP)increases average misses on average by 8.0% (mostly dueto 473.astar), decreasing misses on only 11 of the 19benchmarks. CDBP reduces average misses by 4.6%. Thesampling predictor reduces average misses by 11.7%. Theoptimal policy reduces misses by 18.6% over LRU; thus, thesampling predictor achieves 63% of the improvement of theoptimal policy.2) Speedup: Reducing cache misses translates into improvedper<strong>for</strong>mance. Figure 5 shows the speedup (i.e. newIPC divided by old IPC) over LRU <strong>for</strong> the predictor-drivenpolicies with a default LRU cache.DIP improves per<strong>for</strong>mance by a geometric mean of 3.1%.TDBP provides a speedup on some benchmarks and a slowdownon others, resulting in a geometric mean speedup ofapproximately 0%. The counting predictor delivers a geometricmean speedup of 2.3%, and does not significantly slowdown any benchmarks. RRIP yields an average speedup of4.1%. The sampling predictor gives a geometric mean speedupof 5.9%. It improves per<strong>for</strong>mance by at least 4% <strong>for</strong> eight ofthe benchmarks, as opposed to only five benchmarks <strong>for</strong> RRIPand CDBP and two <strong>for</strong> TDBP. The sampling predictor deliversper<strong>for</strong>mance superior to each of the other techniques tested.Speedup and cache misses are particularly poor <strong>for</strong>473.astar. As we will see in Section VII-C, dead blockprediction accuracy is bad <strong>for</strong> this benchmark. However, thesampling predictor minimizes the damage by making fewerpredictions than the other predictors.3) Poor Per<strong>for</strong>mance <strong>for</strong> Trace-Based Predictor: Note thatthe reftrace predictor per<strong>for</strong>ms quite poorly compared withits observed behavior in previous work [15]. In that work,reftrace was used <strong>for</strong> L1 or L2 caches with significant temporallocality in streams of reference reaching the predictor. Reftracelearns from these streams of temporal locality. In this work, thepredictor optimizes the LLC in which most temporal localityhas been filtered by the 256KB middle-level cache. In thissituation, it is easier <strong>for</strong> the predictor to try to simply learnthe last PC to reference a block rather than a sparse reference8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!