13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESprefetched by IP-based prefetches, <strong>and</strong> rows can be prefetched by DPL <strong>and</strong> the L2streamer.3.7.5 Cacheability InstructionsSSE2 provides additional cacheability instructions that extend those provided in SSE.The new cacheability instructions include:• new streaming store instructions• new cache line flush instruction• new memory fencing instructionsFor more information, see Chapter 9, “Optimizing Cache Usage.”3.7.6 REP Prefix <strong>and</strong> Data MovementThe REP prefix is commonly used with string move instructions for memory relatedlibrary functions such as MEMCPY (using REP MOVSD) or MEMSET (using REP STOS).These STRING/MOV instructions with the REP prefixes are implemented in MS-ROM<strong>and</strong> have several implementation variants with different performance levels.The specific variant of the implementation is chosen at execution time based on datalayout, alignment <strong>and</strong> the counter (ECX) value. For example, MOVSB/STOSB with theREP prefix should be used with counter value less than or equal to three for bestperformance.String MOVE/STORE instructions have multiple data granularities. For efficient datamovement, larger data granularities are preferable. This means better efficiency canbe achieved by decomposing an arbitrary counter value into a number of doublewordsplus single byte moves with a count value less than or equal to 3.Because software can use SIMD data movement instructions to move 16 bytes at atime, the following paragraphs discuss general guidelines for designing <strong>and</strong> implementinghigh-performance library functions such as MEMCPY(), MEMSET(), <strong>and</strong>MEMMOVE(). Four factors are to be considered:• Throughput per iteration — If two pieces of code have approximately identicalpath lengths, efficiency favors choosing the instruction that moves larger piecesof data per iteration. Also, smaller code size per iteration will in general reduceoverhead <strong>and</strong> improve throughput. Sometimes, this may involve a comparison ofthe relative overhead of an iterative loop structure versus using REP prefix foriteration.• Address alignment — Data movement instructions with highest throughputusually have alignment restrictions, or they operate more efficiently if thedestination address is aligned to its natural data size. Specifically, 16-byte movesneed to ensure the destination address is aligned to 16-byte boundaries, <strong>and</strong>8-bytes moves perform better if the destination address is aligned to 8-byte3-74

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!