13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEIf an application uses a large data set that can be reused across multiple passes of aloop, it will benefit from strip mining. Data sets larger than the cache will beprocessed in groups small enough to fit into cache. This allows temporal data toreside in the cache longer, reducing bus traffic.Data set size <strong>and</strong> temporal locality (data characteristics) fundamentally affect howPREFETCH instructions are applied to strip-mined code. Figure 9-7 shows two simplifiedscenarios for temporally-adjacent data <strong>and</strong> temporally-non-adjacent data.Dataset ADataset APass 1Dataset ADataset BPass 2Dataset BDataset APass 3Dataset BDataset BPass 4Temporallyadjacent passesTemporallynon-adjacentpassesFigure 9-7. Cache Blocking – Temporally Adjacent <strong>and</strong> Non-adjacent PassesIn the temporally-adjacent scenario, subsequent passes use the same data <strong>and</strong> findit already in second-level cache. Prefetch issues aside, this is the preferred situation.In the temporally non-adjacent scenario, data used in pass m is displaced by pass(m+1), requiring data re-fetch into the first level cache <strong>and</strong> perhaps the second levelcache if a later pass reuses the data. If both data sets fit into the second-level cache,load operations in passes 3 <strong>and</strong> 4 become less expensive.Figure 9-8 shows how prefetch instructions <strong>and</strong> strip-mining can be applied toincrease performance in both of these scenarios.9-23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!