13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENERAL OPTIMIZATION GUIDELINESboundaries. Frequently, moving at doubleword granularity performs better withaddresses that are 8-byte aligned.• REP string move vs. SIMD move — Implementing general-purpose memoryfunctions using SIMD extensions usually requires adding some prolog code toensure the availability of SIMD instructions, preamble code to facilitate aligneddata movement requirements at runtime. Throughput comparison must also takeinto consideration the overhead of the prolog when considering a REP stringimplementation versus a SIMD approach.• Cache eviction — If the amount of data to be processed by a memory routineapproaches half the size of the last level on-die cache, temporal locality of thecache may suffer. Using streaming store instructions (for example: MOVNTQ,MOVNTDQ) can minimize the effect of flushing the cache. The threshold to startusing a streaming store depends on the size of the last level cache. Determinethe size using the deterministic cache parameter leaf of CPUID.Techniques for using streaming stores for implementing a MEMSET()-typelibrary must also consider that the application can benefit from this techniqueonly if it has no immediate need to reference the target addresses. Thisassumption is easily upheld when testing a streaming-store implementation ona micro-benchmark configuration, but violated in a full-scale applicationsituation.When applying general heuristics to the design of general-purpose, high-performancelibrary routines, the following guidelines can are useful when optimizing anarbitrary counter value N <strong>and</strong> address alignment. Different techniques may be necessaryfor optimal performance, depending on the magnitude of N:• When N is less than some small count (where the small count threshold will varybetween microarchitectures -- empirically, 8 may be a good value whenoptimizing for Intel NetBurst microarchitecture), each case can be coded directlywithout the overhead of a looping structure. For example, 11 bytes can beprocessed using two MOVSD instructions explicitly <strong>and</strong> a MOVSB with REPcounter equaling 3.• When N is not small but still less than some threshold value (which may vary fordifferent micro-architectures, but can be determined empirically), an SIMDimplementation using run-time CPUID <strong>and</strong> alignment prolog will likely deliverless throughput due to the overhead of the prolog. A REP string implementationshould favor using a REP string of doublewords. To improve address alignment, asmall piece of prolog code using MOVSB/STOSB with a count less than 4 can beused to peel off the non-aligned data moves before starting to useMOVSD/STOSD.• When N is less than half the size of last level cache, throughput considerationmay favor either:— An approach using a REP string with the largest data granularity because aREP string has little overhead for loop iteration, <strong>and</strong> the branch mispredictionoverhead in the prolog/epilogue code to h<strong>and</strong>le address alignment isamortized over many iterations.3-75

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!