13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING CACHE USAGENOTEFailure to map the region as WC may allow the line to be speculativelyread into the processor caches (via the wrong path of a mispredictedbranch).In case the region is not mapped as WC, the streaming might update in-place in thecache <strong>and</strong> a subsequent SFENCE would not result in the data being written to systemmemory. Explicitly mapping the region as WC in this case ensures that any data readfrom this region will not be placed in the processor’s caches. A read of this memorylocation by a non-coherent I/O device would return incorrect/out-of-date results.For a processor which solely implements Case 2 (Section 9.5.1.3), a streaming storecan be used in this non-coherent domain without requiring the memory region to alsobe mapped as WB, since any cached data will be flushed to memory by the streamingstore.9.5.3 Streaming Store Instruction DescriptionsMOVNTQ/MOVNTDQ (non-temporal store of packed integer in an MMX technology orStreaming SIMD Extensions register) store data from a register to memory. They areimplicitly weakly-ordered, do no write-allocate, <strong>and</strong> so minimize cache pollution.MOVNTPS (non-temporal store of packed single precision floating point) is similar toMOVNTQ. It stores data from a Streaming SIMD Extensions register to memory in 16-byte granularity. Unlike MOVNTQ, the memory address must be aligned to a 16-byteboundary or a general protection exception will occur. The instruction is implicitlyweakly-ordered, does not write-allocate, <strong>and</strong> thus minimizes cache pollution.MASKMOVQ/MASKMOVDQU (non-temporal byte mask store of packed integer in anMMX technology or Streaming SIMD Extensions register) store data from a registerto the location specified by the EDI register. The most significant bit in each byte ofthe second mask register is used to selectively write the data of the first register ona per-byte basis. The instructions are implicitly weakly-ordered (that is, successivestores may not write memory in original program-order), do not write-allocate, <strong>and</strong>thus minimize cache pollution.9.5.4 FENCE InstructionsThe following fence instructions are available: SFENCE, lFENCE, <strong>and</strong> MFENCE.9.5.4.1 SFENCE InstructionThe SFENCE (STORE FENCE) instruction makes it possible for every STORE instructionthat precedes an SFENCE in program order to be globally visible before anySTORE that follows the SFENCE. SFENCE provides an efficient way of ensuringordering between routines that produce weakly-ordered results.9-10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!