12.07.2015 Views

Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

Sampling Dead Block Prediction for Last-Level Caches - TAMU ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

NameTechniqueSampler<strong>Dead</strong> block bypass and replacement with sampling predictor, default LRU policyTDBP<strong>Dead</strong> block bypass and replacement with reftrace, default LRU policyCDBP<strong>Dead</strong> block bypass and replacement with counting predictor, default LRU policyDIPDynamic Insertion Policy, default LRU policy.RRIPRe-reference interval predictionTADIPThread-aware DIP, default LRU policyRandom Sampler <strong>Dead</strong> block bypass and replacement with sampling predictor, default random policyRandom CDBP <strong>Dead</strong> block bypass and replacement with counting predictor, default random policy.OptimalOptimal replacement and bypass policy as described in Section VI-B.TABLE V: Legend <strong>for</strong> various cache optimization techniques.Normalized MPKI1.21.00.80.62.5TDBPCDBPDIPRRIPSamplerOptimal450.soplexamean433.milc403.gcc481.wrf400.perlbench436.cactusADM435.gromacs462.libquantum459.GemsFDTD471.omnetpp483.xalancbmk434.zeusmp437.leslie3d482.sphinx3401.bzip2473.astar429.mcf470.lbm456.hmmerFig. 4: Reduction in LLC misses <strong>for</strong> various policies.Speedup1.201.151.101.051.000.950.901.27 1.3 1.4 1.4 1.30.86 0.70TDBPCDBPDIPRRIPSampler450.soplexgmean433.milc403.gcc481.wrf400.perlbench436.cactusADM435.gromacs462.libquantum459.GemsFDTD471.omnetpp483.xalancbmk434.zeusmp437.leslie3d482.sphinx3401.bzip2473.astar429.mcf470.lbm456.hmmerFig. 5: Speedup <strong>for</strong> various policiestrace that might not be repeated often enough to learn from.Note that we have simulated reftrace correctly with access tothe original source code used <strong>for</strong> the cache bursts paper. Inour simulations and in previous work, reftrace works quitewell when there is no middle-level cache to filter the temporallocality between the small L1 and large LLC. However, manyreal systems have middle-level caches.4) Contribution of Components: Aside from using only thelast PC, there are three other components that contribute toour predictor’s per<strong>for</strong>mance with dead block replacement andbypass (DBRB): 1) using a sampler, 2) using reduced associativityin the sampler, and 3) using a skewed predictor. Figure 6shows the speedup achieved on the single-thread benchmarks<strong>for</strong> every feasible combination of presence or absence of thesecomponents. We find that these three components interactsynergistically to improve per<strong>for</strong>mance.The PC-only predictor (“DBRB alone”) without any of theother enhancements achieves a speedup of 3.4% over the LRUbaseline. This predictor is equivalent to the reftrace predictorusing the last PC instead of the trace signature. Adding askewed predictor with three tables (“DBRB+3 tables”), eachone-fourth the size of the single-table predictor, results in areduced speedup of 2.3%. The advantage of a skewed predictoris its ability to improve accuracy in the presence of a moderateamount of conflict. However, with no sampler to filter theonslaught of a large working set of PCs, the skewed predictorexperiences significant conflict with a commensurate reductionin coverage and accuracy.The sampler with no other enhancements(“DBRB+sampler”) yields a speedup of 3.8%. Theimprovement over DBRB-only is due to the filteringeffect on the predictor: learning from far fewer examples issufficient to learn the general behavior of the program, butresults in much less conflict in the prediction table. Addingthe skewed predictor to this scenarion (“DBRB+sampler+3tables”) slightly improves speedup to 4.0%, addressing the9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!