13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESFalse LCP stalls occur when (a) instructions with LCP that are encoded using the F7opcodes, <strong>and</strong> (b) are located at offset 14 of a fetch line. These instructions are not,neg, div, idiv, mul, <strong>and</strong> imul. False LCP experiences delay because the instructionlength decoder can not determine the length of the instruction before the next fetchline, which holds the exact opcode of the instruction in its MODR/M byte.The following techniques can help avoid false LCP stalls:• Upcast all short operations from the F7 group of instructions to long, using thefull <strong>32</strong> bit version.• Ensure that the F7 opcode never starts at offset 14 of a fetch line.Assembly/Compiler Coding Rule 22. (M impact, ML generality) Ensureinstructions using 0xF7 opcode byte does not start at offset 14 of a fetch line; <strong>and</strong>avoid using these instruction to operate on 16-bit data, upcast short data to <strong>32</strong> bits.Example 3-15. Avoiding False LCP Delays with 0xF7 Group InstructionsA Sequence Causing Delay in the Decoder Alternate Sequence to Avoid Delayneg word ptr amovsx eax, word ptr aneg eaxmov word ptr a, AX3.4.2.4 Optimizing the Loop Stream Detector (LSD)Loops that fit the following criteria are detected by the LSD <strong>and</strong> replayed from theinstruction queue:• Must be less than or equal to four 16-byte fetches.• Must be less than or equal to 18 instructions.• Can contain no more than four taken branches <strong>and</strong> none of them can be a RET.• Should usually have more than <strong>64</strong> iterations.Many calculation-intensive loops, searches <strong>and</strong> software string moves match thesecharacteristics. These loops exceed the BPU prediction capacity <strong>and</strong> always terminatein a branch misprediction.Assembly/Compiler Coding Rule 23. (MH impact, MH generality) Break up aloop long sequence of instructions into loops of shorter instruction blocks of nomore than 18 instructions.Assembly/Compiler Coding Rule 24. (MH impact, M generality) Avoidunrolling loops containing LCP stalls, if the unrolled block exceeds 18 instructions.3.4.2.5 Scheduling Rules for the Pentium 4 Processor DecoderProcessors based on Intel NetBurst microarchitecture have a single decoder that c<strong>and</strong>ecode instructions at the maximum rate of one instruction per clock. Complexinstructions must enlist the help of the microcode ROM.3-23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!