13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESBecause μops are delivered from the trace cache in the common cases, decodingrules <strong>and</strong> code alignment are not required.3.4.2.6 Scheduling Rules for the Pentium M Processor DecoderThe Pentium M processor has three decoders, but the decoding rules to supply μopsat high b<strong>and</strong>width are less stringent than those of the Pentium III processor. Thisprovides an opportunity to build a front-end tracker in the compiler <strong>and</strong> try toschedule instructions correctly. The decoder limitations are:• The first decoder is capable of decoding one macroinstruction made up of four orfewer μops in each clock cycle. It can h<strong>and</strong>le any number of bytes up to themaximum of 15. Multiple prefix instructions require additional cycles.• The two additional decoders can each decode one macroinstruction per clockcycle (assuming the instruction is one μop up to seven bytes in length).• Instructions composed of more than four μops take multiple cycles to decode.Assembly/Compiler Coding Rule 25. (M impact, M generality) Avoid puttingexplicit references to ESP in a sequence of stack operations (POP, PUSH, CALL,RET).3.4.2.7 Other Decoding GuidelinesAssembly/Compiler Coding Rule 26. (ML impact, L generality) Use simpleinstructions that are less than eight bytes in length.Assembly/Compiler Coding Rule 27. (M impact, MH generality) Avoid usingprefixes to change the size of immediate <strong>and</strong> displacement.Long instructions (more than seven bytes) limit the number of decoded instructionsper cycle on the Pentium M processor. Each prefix adds one byte to the length ofinstruction, possibly limiting the decoder’s throughput. In addition, multiple prefixescan only be decoded by the first decoder. These prefixes also incur a delay whendecoded. If multiple prefixes or a prefix that changes the size of an immediate ordisplacement cannot be avoided, schedule them behind instructions that stall thepipe for some other reason.3.5 OPTIMIZING THE EXECUTION COREThe superscalar, out-of-order execution core(s) in recent generations of microarchitecturescontain multiple execution hardware resources that can execute multipleμops in parallel. These resources generally ensure that μops execute efficiently <strong>and</strong>3-24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!