13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESIn some situations, the byte count of the data is known by the context (as opposedto being known by a parameter passed from a call), <strong>and</strong> one can take a simplerapproach than those required for a general-purpose library routine. For example, ifthe byte count is also small, using REP MOVSB/STOSB with a count less than four canensure good address alignment <strong>and</strong> loop-unrolling to finish the remaining data; usingMOVSD/STOSD can reduce the overhead associated with iteration.Using a REP prefix with string move instructions can provide high performance in thesituations described above. However, using a REP prefix with string scan instructions(SCASB, SCASW, SCASD, SCASQ) or compare instructions (CMPSB, CMPSW,SMPSD, SMPSQ) is not recommended for high performance. Consider using SIMDinstructions instead.3.8 FLOATING-POINT CONSIDERATIONSWhen programming floating-point applications, it is best to start with a high-levelprogramming language such as C, C++, or Fortran. Many compilers perform floatingpointscheduling <strong>and</strong> optimization when it is possible. However in order to produceoptimal code, the compiler may need some assistance.3.8.1 Guidelines for Optimizing Floating-point CodeUser/Source Coding Rule 13. (M impact, M generality) Enable the compiler’suse of SSE, SSE2 or SSE3 instructions with appropriate switches.Follow this procedure to investigate the performance of your floating-point application:• Underst<strong>and</strong> how the compiler h<strong>and</strong>les floating-point code.• Look at the assembly dump <strong>and</strong> see what transforms are already performed onthe program.• Study the loop nests in the application that dominate the execution time.• Determine why the compiler is not creating the fastest code.• See if there is a dependence that can be resolved.• Determine the problem area: bus b<strong>and</strong>width, cache locality, trace cacheb<strong>and</strong>width, or instruction latency. Focus on optimizing the problem area. Forexample, adding PREFETCH instructions will not help if the bus is alreadysaturated. If trace cache b<strong>and</strong>width is the problem, added prefetch µops maydegrade performance.Also, in general, follow the general coding recommendations discussed in thischapter, including:• Blocking the cache• Using prefetch3-77

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!