13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEeach iteration. As a result, when iteration n, vertex V n , is being processed; therequested data is already brought into cache. In the meantime, the front-side bus istransferring the data needed for iteration n+1, vertex V n+1 . Because there is nodependence between V n+1 data <strong>and</strong> the execution of V n , the latency for data accessof V n+1 can be entirely hidden behind the execution of V n . Under such circumstances,no “bubbles” are present in the pipelines <strong>and</strong> thus the best possible performance canbe achieved.Prefetching is useful for inner loops that have heavy computations, or are close to theboundary between being compute-bound <strong>and</strong> memory-b<strong>and</strong>width-bound. It is probablynot very useful for loops which are predominately memory b<strong>and</strong>width-bound.When data is already located in the first level cache, prefetching can be useless <strong>and</strong>could even slow down the performance because the extra µops either back upwaiting for outst<strong>and</strong>ing memory accesses or may be dropped altogether. Thisbehavior is platform-specific <strong>and</strong> may change in the future.9.6.5 Software Prefetching Usage ChecklistThe following checklist covers issues that need to be addressed <strong>and</strong>/or resolved touse the software PREFETCH instruction properly:• Determine software prefetch scheduling distance.• Use software prefetch concatenation.• Minimize the number of software prefetches.• Mix software prefetch with computation instructions.• Use cache blocking techniques (for example, strip mining).• Balance single-pass versus multi-pass execution.• Resolve memory bank conflict issues.• Resolve cache management issues.Subsequent sections discuss the above items.9.6.6 Software Prefetch Scheduling DistanceDetermining the ideal prefetch placement in the code depends on many architecturalparameters, including: the amount of memory to be prefetched, cache lookuplatency, system memory latency, <strong>and</strong> estimate of computation cycle. The idealdistance for prefetching data is processor- <strong>and</strong> platform-dependent. If the distance istoo short, the prefetch will not hide the latency of the fetch behind computation. Ifthe prefetch is too far ahead, prefetched data may be flushed out of the cache by thetime it is required.Since prefetch distance is not a well-defined metric, for this discussion, we define anew term, prefetch scheduling distance (PSD), which is represented by the numberof iterations. For large loops, prefetch scheduling distance can be set to 1 (that is,9-17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!