Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

More documents

Recommendations

Info

Namemix1mix2mix3mix4mix5mix6mix7mix8mix9mix10CacheSensitivityBenchmarksCurve43.5mcf hmmer libquantumomnetpp 25.5gobmk soplex libquantum lbm22.012.614.6zeusmp leslie3d libquantumxalancbmk 9.419.0gamess cactusADM soplexlibquantum 9.5bzip2 gamess mcf sphinx336.416.69.7gcc calculix libquantumsphinx3 6.6perlbench milc hmmer lbmbzip2 gcc gobmk lbmgamess mcf tonto xalancbmkmilc namd sphinx3 xalancbmk10.49.66.75.437.016.612.4TABLE IV: Multi-core workload mixes with cache sensitivitycurves giving LLC misses per 1000 instructions (MPKI)on the y-axis for last-level cache sizes 128KB through32MB on the x-axis.For the multi-core workloads, we report the weighted speedupnormalized to LRU. That is, for each thread i sharing the 8MBcache, we compute IPC i . Then we find SingleIPC i as the IPCof the same program running in isolation with an 8MB cachewith LRU replacement. Then we compute the weighted IPCas ∑ IPC i /SingleIPC i . We then normalize this weighted IPCwith the weighted IPC using the LRU replacement policy.B. Optimal Replacement and Bypass PolicyFor simulating misses, we also compare with an optimalblock replacement and bypass policy. That is, we enhanceBelady’s MIN replacement policy [3] with a bypass policythat refuses to place a block in a set when that block’s nextaccess will not occur until after the next accesses to all otherblocks in the set. We use trace-based simulation to determinethe optimal number of misses using the same sequence ofmemory accesses made by the out-of-order simulator. The outof-ordersimulator does not include the optimal replacementand bypass policy so we report optimal numbers only for cachemiss reduction and not for speedup.4.5VII. EXPERIMENTAL RESULTSIn this section we discuss results of our experiments. Inthe graphs that follow, several techniques are referred to withabbreviated names. Table V gives a legend for these names.For TDBP and CDBP, we simulate a dead block bypass andreplacement policy just as described previously, dropping inthe reftrace and counting predictors, respectively, in place ofour sampling predictor.A. Dead Block Replacement with LRU BaselineWe explore the use of sampling prediction to drive replacementand bypass in a default LRU replaced cache comparedwith several other techniques for the single-thread benchmarks.1) LLC Misses: Figure 4 shows LLC cache misses normalizedto a 2MB LRU cache for each benchmark. On average,dynamic insertion (DIP) reduces cache misses to 93.9% ofthe baseline LRU, a reduction of by 6.1%. RRIP reducesmisses by 8.1%. The reftrace-predictor-driven policy (TDBP)increases average misses on average by 8.0% (mostly dueto 473.astar), decreasing misses on only 11 of the 19benchmarks. CDBP reduces average misses by 4.6%. Thesampling predictor reduces average misses by 11.7%. Theoptimal policy reduces misses by 18.6% over LRU; thus, thesampling predictor achieves 63% of the improvement of theoptimal policy.2) Speedup: Reducing cache misses translates into improvedperformance. Figure 5 shows the speedup (i.e. newIPC divided by old IPC) over LRU for the predictor-drivenpolicies with a default LRU cache.DIP improves performance by a geometric mean of 3.1%.TDBP provides a speedup on some benchmarks and a slowdownon others, resulting in a geometric mean speedup ofapproximately 0%. The counting predictor delivers a geometricmean speedup of 2.3%, and does not significantly slowdown any benchmarks. RRIP yields an average speedup of4.1%. The sampling predictor gives a geometric mean speedupof 5.9%. It improves performance by at least 4% for eight ofthe benchmarks, as opposed to only five benchmarks for RRIPand CDBP and two for TDBP. The sampling predictor deliversperformance superior to each of the other techniques tested.Speedup and cache misses are particularly poor for473.astar. As we will see in Section VII-C, dead blockprediction accuracy is bad for this benchmark. However, thesampling predictor minimizes the damage by making fewerpredictions than the other predictors.3) Poor Performance for Trace-Based Predictor: Note thatthe reftrace predictor performs quite poorly compared withits observed behavior in previous work [15]. In that work,reftrace was used for L1 or L2 caches with significant temporallocality in streams of reference reaching the predictor. Reftracelearns from these streams of temporal locality. In this work, thepredictor optimizes the LLC in which most temporal localityhas been filtered by the 256KB middle-level cache. In thissituation, it is easier for the predictor to try to simply learnthe last PC to reference a block rather than a sparse reference8
NameTechniqueSamplerDead block bypass and replacement with sampling predictor, default LRU policyTDBPDead block bypass and replacement with reftrace, default LRU policyCDBPDead block bypass and replacement with counting predictor, default LRU policyDIPDynamic Insertion Policy, default LRU policy.RRIPRe-reference interval predictionTADIPThread-aware DIP, default LRU policyRandom Sampler Dead block bypass and replacement with sampling predictor, default random policyRandom CDBP Dead block bypass and replacement with counting predictor, default random policy.OptimalOptimal replacement and bypass policy as described in Section VI-B.TABLE V: Legend for various cache optimization techniques.Normalized MPKI1.21.00.80.62.5TDBPCDBPDIPRRIPSamplerOptimal450.soplexamean433.milc403.gcc481.wrf400.perlbench436.cactusADM435.gromacs462.libquantum459.GemsFDTD471.omnetpp483.xalancbmk434.zeusmp437.leslie3d482.sphinx3401.bzip2473.astar429.mcf470.lbm456.hmmerFig. 4: Reduction in LLC misses for various policies.Speedup1.201.151.101.051.000.950.901.27 1.3 1.4 1.4 1.30.86 0.70TDBPCDBPDIPRRIPSampler450.soplexgmean433.milc403.gcc481.wrf400.perlbench436.cactusADM435.gromacs462.libquantum459.GemsFDTD471.omnetpp483.xalancbmk434.zeusmp437.leslie3d482.sphinx3401.bzip2473.astar429.mcf470.lbm456.hmmerFig. 5: Speedup for various policiestrace that might not be repeated often enough to learn from.Note that we have simulated reftrace correctly with access tothe original source code used for the cache bursts paper. Inour simulations and in previous work, reftrace works quitewell when there is no middle-level cache to filter the temporallocality between the small L1 and large LLC. However, manyreal systems have middle-level caches.4) Contribution of Components: Aside from using only thelast PC, there are three other components that contribute toour predictor’s performance with dead block replacement andbypass (DBRB): 1) using a sampler, 2) using reduced associativityin the sampler, and 3) using a skewed predictor. Figure 6shows the speedup achieved on the single-thread benchmarksfor every feasible combination of presence or absence of thesecomponents. We find that these three components interactsynergistically to improve performance.The PC-only predictor (“DBRB alone”) without any of theother enhancements achieves a speedup of 3.4% over the LRUbaseline. This predictor is equivalent to the reftrace predictorusing the last PC instead of the trace signature. Adding askewed predictor with three tables (“DBRB+3 tables”), eachone-fourth the size of the single-table predictor, results in areduced speedup of 2.3%. The advantage of a skewed predictoris its ability to improve accuracy in the presence of a moderateamount of conflict. However, with no sampler to filter theonslaught of a large working set of PCs, the skewed predictorexperiences significant conflict with a commensurate reductionin coverage and accuracy.The sampler with no other enhancements(“DBRB+sampler”) yields a speedup of 3.8%. Theimprovement over DBRB-only is due to the filteringeffect on the predictor: learning from far fewer examples issufficient to learn the general behavior of the program, butresults in much less conflict in the prediction table. Addingthe skewed predictor to this scenarion (“DBRB+sampler+3tables”) slightly improves speedup to 4.0%, addressing the9
Page 5 and 6: L2 CacheData accessL2 CachePredicto
Page 7: PredictionExtraPredictor Structures
Page 11 and 12: Normalized MPKI1.21.00.80.6RandomRa

Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?