13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESAssembly/Compiler Coding Rule 57. (H impact, L generality) Always putcode <strong>and</strong> data on separate pages. Avoid self-modifying code wherever possible. Ifcode is to be modified, try to do it all at once <strong>and</strong> make sure the code that performsthe modifications <strong>and</strong> the code being modified are on separate 4-KByte pages or onseparate aligned 1-KByte subpages.3.6.8.1 Self-modifying CodeSelf-modifying code (SMC) that ran correctly on Pentium III processors <strong>and</strong> priorimplementations will run correctly on subsequent implementations. SMC <strong>and</strong> crossmodifyingcode (when multiple processors in a multiprocessor system are writing toa code page) should be avoided when high performance is desired.Software should avoid writing to a code page in the same 1-KByte subpage that isbeing executed or fetching code in the same 2-KByte subpage of that is beingwritten. In addition, sharing a page containing directly or speculatively executedcode with another processor as a data page can trigger an SMC condition that causesthe entire pipeline of the machine <strong>and</strong> the trace cache to be cleared. This is due to theself-modifying code condition.Dynamic code need not cause the SMC condition if the code written fills up a datapage before that page is accessed as code. Dynamically-modified code (for example,from target fix-ups) is likely to suffer from the SMC condition <strong>and</strong> should be avoidedwhere possible. Avoid the condition by introducing indirect branches <strong>and</strong> using datatables on data pages (not code pages) using register-indirect calls.3.6.9 Write CombiningWrite combining (WC) improves performance in two ways:• On a write miss to the first-level cache, it allows multiple stores to the samecache line to occur before that cache line is read for ownership (RFO) from furtherout in the cache/memory hierarchy. Then the rest of line is read, <strong>and</strong> the bytesthat have not been written are combined with the unmodified bytes in thereturned line.• Write combining allows multiple writes to be assembled <strong>and</strong> written further out inthe cache hierarchy as a unit. This saves port <strong>and</strong> bus traffic. Saving traffic isparticularly important for avoiding partial writes to uncached memory.There are six write-combining buffers (on Pentium 4 <strong>and</strong> Intel Xeon processors witha CPUID signature of family encoding 15, model encoding 3; there are 8 writecombiningbuffers). Two of these buffers may be written out to higher cache levels<strong>and</strong> freed up for use on other write misses. Only four write-combining buffers areguaranteed to be available for simultaneous use. Write combining applies to memorytype WC; it does not apply to memory type UC.3-<strong>64</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!