13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESWith profile-guided optimization, the compiler can lay out basic blocks to eliminatebranches for the most frequently executed paths of a function or at least improvetheir predictability. Branch prediction need not be a concern at the source level. Formore information, see Intel C++ Compiler documentation.3.4.2 Fetch <strong>and</strong> Decode <strong>Optimization</strong>Intel Core microarchitecture provides several mechanisms to increase front endthroughput. Techniques to take advantage of some of these features are discussedbelow.3.4.2.1 Optimizing for Micro-fusionAn Instruction that operates on a register <strong>and</strong> a memory oper<strong>and</strong> decodes into moreμops than its corresponding register-register version. Replacing the equivalent workof the former instruction using the register-register version usually require asequence of two instructions. The latter sequence is likely to result in reduced fetchb<strong>and</strong>width.Assembly/Compiler Coding Rule 18. (ML impact, M generality) For improvingfetch/decode throughput, Give preference to memory flavor of an instruction overthe register-only flavor of the same instruction, if such instruction can benefit frommicro-fusion.The following examples are some of the types of micro-fusions that can be h<strong>and</strong>ledby all decoders:• All stores to memory, including store immediate. Stores execute internally as twoseparate μops: store-address <strong>and</strong> store-data.• All “read-modify” (load+op) instructions between register <strong>and</strong> memory, forexample:ADDPS XMM9, OWORD PTR [RSP+40]FADD DOUBLE PTR [RDI+RSI*8]XOR RAX, QWORD PTR [RBP+<strong>32</strong>]• All instructions of the form “load <strong>and</strong> jump,” for example:JMP [RDI+200]RET• CMP <strong>and</strong> TEST with immediate oper<strong>and</strong> <strong>and</strong> memoryAn Intel <strong>64</strong> instruction with RIP relative addressing is not micro-fused in the followingcases:• When an additional immediate is needed, for example:CMP [RIP+400], 27MOV [RIP+3000], 142• When an RIP is needed for control flow purposes, for example:JMP [RIP+5000000]3-17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!