13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

INTEL® <strong>64</strong> AND <strong>IA</strong>-<strong>32</strong> PROCESSOR ARCHITECTURESDue to stalls in the rest of the machine, front end starvation does not usually causeperformance degradation. For extremely fast code with larger instructions (such asSSE2 integer media kernels), it may be beneficial to use targeted alignment toprevent instruction starvation.Instruction PreDecodeThe predecode unit accepts the sixteen bytes from the instruction cache or prefetchbuffers <strong>and</strong> carries out the following tasks:• Determine the length of the instructions.• Decode all prefixes associated with instructions.• Mark various properties of instructions for the decoders (for example, “isbranch.”).The predecode unit can write up to six instructions per cycle into the instructionqueue. If a fetch contains more than six instructions, the predecoder continues todecode up to six instructions per cycle until all instructions in the fetch are written tothe instruction queue. Subsequent fetches can only enter predecoding after thecurrent fetch completes.For a fetch of seven instructions, the predecoder decodes the first six in one cycle,<strong>and</strong> then only one in the next cycle. This process would support decoding 3.5 instructionsper cycle. Even if the instruction per cycle (IPC) rate is not fully optimized, it ishigher than the performance seen in most applications. In general, software usuallydoes not have to take any extra measures to prevent instruction starvation.The following instruction prefixes cause problems during length decoding. Theseprefixes can dynamically change the length of instructions <strong>and</strong> are known as lengthchanging prefixes (LCPs):• Oper<strong>and</strong> Size Override (66H) preceding an instruction with a word immediatedata• Address Size Override (67H) preceding an instruction with a mod R/M in real,16-bit protected or <strong>32</strong>-bit protected modesWhen the predecoder encounters an LCP in the fetch line, it must use a slower lengthdecoding algorithm. With the slower length decoding algorithm, the predecoderdecodes the fetch in 6 cycles, instead of the usual 1 cycle.Normal queuing within the processor pipeline usually cannot hide LCP penalties.The REX prefix (4xh) in the Intel <strong>64</strong> architecture instruction set can change the sizeof two classes of instruction: MOV offset <strong>and</strong> MOV immediate. Nevertheless, it doesnot cause an LCP penalty <strong>and</strong> hence is not considered an LCP.2.1.2.3 Instruction Queue (IQ)The instruction queue is 18 instructions deep. It sits between the instruction predecodeunit <strong>and</strong> the instruction decoders. It sends up to five instructions per cycle, <strong>and</strong>2-7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!