11.07.2015 Views

Optimizing subroutines in assembly language

Optimizing subroutines in assembly language

Optimizing subroutines in assembly language

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.<strong>Optimiz<strong>in</strong>g</strong> <strong>subrout<strong>in</strong>es</strong> <strong>in</strong> <strong>assembly</strong><strong>language</strong>An optimization guide for x86 platformsBy Agner Fog. Copenhagen University College of Eng<strong>in</strong>eer<strong>in</strong>g.Copyright © 1996 - 2008. Last updated 2008-10-31.Contents1 Introduction ....................................................................................................................... 41.1 Reasons for us<strong>in</strong>g <strong>assembly</strong> code .............................................................................. 51.2 Reasons for not us<strong>in</strong>g <strong>assembly</strong> code ........................................................................ 51.3 Microprocessors covered by this manual .................................................................... 61.4 Operat<strong>in</strong>g systems covered by this manual................................................................. 72 Before you start................................................................................................................. 72.1 Th<strong>in</strong>gs to decide before you start programm<strong>in</strong>g .......................................................... 72.2 Make a test strategy.................................................................................................... 82.3 Common cod<strong>in</strong>g pitfalls............................................................................................... 93 The basics of <strong>assembly</strong> cod<strong>in</strong>g........................................................................................ 113.1 Assemblers available ................................................................................................ 113.2 Register set and basic <strong>in</strong>structions............................................................................ 143.3 Address<strong>in</strong>g modes .................................................................................................... 183.4 Instruction code format ............................................................................................. 243.5 Instruction prefixes.................................................................................................... 254 ABI standards.................................................................................................................. 274.1 Register usage.......................................................................................................... 274.2 Data storage ............................................................................................................. 274.3 Function call<strong>in</strong>g conventions ..................................................................................... 284.4 Name mangl<strong>in</strong>g and name decoration ...................................................................... 304.5 Function examples.................................................................................................... 305 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions <strong>in</strong> C++ ....................................................................................... 335.1 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions for system code .................................................................. 345.2 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions for <strong>in</strong>structions not available <strong>in</strong> standard C++ ..................... 355.3 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions for vector operations ........................................................... 355.4 Availability of <strong>in</strong>tr<strong>in</strong>sic functions................................................................................. 356 Us<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> <strong>in</strong> C++ .......................................................................................... 356.1 MASM style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> ..................................................................................... 366.2 Gnu style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> ......................................................................................... 417 Us<strong>in</strong>g an assembler......................................................................................................... 447.1 Static l<strong>in</strong>k libraries..................................................................................................... 457.2 Dynamic l<strong>in</strong>k libraries................................................................................................ 467.3 Libraries <strong>in</strong> source code form.................................................................................... 477.4 Mak<strong>in</strong>g classes <strong>in</strong> <strong>assembly</strong>...................................................................................... 477.5 Thread-safe functions ............................................................................................... 497.6 Makefiles .................................................................................................................. 498 Mak<strong>in</strong>g function libraries compatible with multiple compilers and platforms..................... 518.1 Support<strong>in</strong>g multiple name mangl<strong>in</strong>g schemes........................................................... 518.2 Support<strong>in</strong>g multiple call<strong>in</strong>g conventions <strong>in</strong> 32 bit mode ............................................. 528.3 Support<strong>in</strong>g multiple call<strong>in</strong>g conventions <strong>in</strong> 64 bit mode ............................................. 558.4 Support<strong>in</strong>g different object file formats...................................................................... 568.5 Support<strong>in</strong>g other high level <strong>language</strong>s ...................................................................... 589 <strong>Optimiz<strong>in</strong>g</strong> for speed ....................................................................................................... 589.1 Identify the most critical parts of your code ............................................................... 589.2 Out of order execution .............................................................................................. 59


9.3 Instruction fetch, decod<strong>in</strong>g and retirement ................................................................ 619.4 Instruction latency and throughput ............................................................................ 629.5 Break dependency cha<strong>in</strong>s......................................................................................... 639.6 Jumps and calls ........................................................................................................ 6410 <strong>Optimiz<strong>in</strong>g</strong> for size......................................................................................................... 7110.1 Choos<strong>in</strong>g shorter <strong>in</strong>structions.................................................................................. 7110.2 Us<strong>in</strong>g shorter constants and addresses .................................................................. 7310.3 Reus<strong>in</strong>g constants .................................................................................................. 7410.4 Constants <strong>in</strong> 64-bit mode ........................................................................................ 7410.5 Addresses and po<strong>in</strong>ters <strong>in</strong> 64-bit mode................................................................... 7510.6 Mak<strong>in</strong>g <strong>in</strong>structions longer for the sake of alignment............................................... 7611 <strong>Optimiz<strong>in</strong>g</strong> memory access............................................................................................ 7911.1 How cach<strong>in</strong>g works................................................................................................. 7911.2 Trace cache............................................................................................................ 8111.3 Alignment of data.................................................................................................... 8111.4 Alignment of code ................................................................................................... 8311.5 Organiz<strong>in</strong>g data for improved cach<strong>in</strong>g..................................................................... 8511.6 Organiz<strong>in</strong>g code for improved cach<strong>in</strong>g.................................................................... 8611.7 Cache control <strong>in</strong>structions....................................................................................... 8612 Loops ............................................................................................................................ 8612.1 M<strong>in</strong>imize loop overhead .......................................................................................... 8612.2 Induction variables.................................................................................................. 8912.3 Move loop-<strong>in</strong>variant code........................................................................................ 9012.4 F<strong>in</strong>d the bottlenecks................................................................................................ 9012.5 Instruction fetch, decod<strong>in</strong>g and retirement <strong>in</strong> a loop ................................................ 9112.6 Distribute µops evenly between execution units...................................................... 9112.7 An example of analysis for bottlenecks on PM........................................................ 9212.8 Same example on Core2 ........................................................................................ 9612.9 Loop unroll<strong>in</strong>g ......................................................................................................... 9712.10 Optimize cach<strong>in</strong>g .................................................................................................. 9912.11 Parallelization ..................................................................................................... 10012.12 Analyz<strong>in</strong>g dependences...................................................................................... 10112.13 Loops on processors without out-of-order execution ........................................... 10412.14 Macro loops ........................................................................................................ 10613 Vector programm<strong>in</strong>g.................................................................................................... 10813.1 Conditional moves <strong>in</strong> SIMD registers .................................................................... 10913.2 Us<strong>in</strong>g vector <strong>in</strong>structions with other types of data than they are <strong>in</strong>tended for ........ 11213.3 Shuffl<strong>in</strong>g data........................................................................................................ 11413.4 Generat<strong>in</strong>g constants............................................................................................ 11713.5 Access<strong>in</strong>g unaligned data ..................................................................................... 11913.6 Us<strong>in</strong>g AVX <strong>in</strong>struction set and YMM registers ....................................................... 12313.7 Vector operations <strong>in</strong> general purpose registers ..................................................... 12814 Multithread<strong>in</strong>g.............................................................................................................. 13015 CPU dispatch<strong>in</strong>g.......................................................................................................... 13015.1 Check<strong>in</strong>g for operat<strong>in</strong>g system support for XMM and YMM registers .................... 13216 Problematic Instructions .............................................................................................. 13316.1 LEA <strong>in</strong>struction (all processors)............................................................................. 13316.2 INC and DEC (all Intel processors) ....................................................................... 13416.3 XCHG (all processors) .......................................................................................... 13416.4 Shifts and rotates (P4) .......................................................................................... 13516.5 Rotates through carry (all processors) .................................................................. 13516.6 Bit test (all processors) ......................................................................................... 13516.7 LAHF and SAHF (all processors).......................................................................... 13516.8 Integer multiplication (all processors).................................................................... 13516.9 Division (all processors)........................................................................................ 13516.10 Str<strong>in</strong>g <strong>in</strong>structions (all processors) ...................................................................... 14016.11 WAIT <strong>in</strong>struction (all processors) ........................................................................ 14116.12 FCOM + FSTSW AX (all processors).................................................................. 1422


16.13 FPREM (all processors)...................................................................................... 14316.14 FRNDINT (all processors)................................................................................... 14316.15 FSCALE and exponential function (all processors) ............................................. 14416.16 FPTAN (all processors)....................................................................................... 14516.17 FSQRT (SSE processors)................................................................................... 14516.18 FLDCW (Most Intel processors).......................................................................... 14517 Special topics .............................................................................................................. 14617.1 XMM versus float<strong>in</strong>g po<strong>in</strong>t registers (Processors with SSE) .................................. 14617.2 MMX versus XMM registers (Processors with SSE2)............................................ 14717.3 XMM versus YMM registers (Processors with AVX).............................................. 14717.4 Free<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t registers (all processors)..................................................... 14817.5 Transitions between float<strong>in</strong>g po<strong>in</strong>t and MMX <strong>in</strong>structions (Processors with MMX). 14817.6 Convert<strong>in</strong>g from float<strong>in</strong>g po<strong>in</strong>t to <strong>in</strong>teger (All processors) ...................................... 14817.7 Us<strong>in</strong>g <strong>in</strong>teger <strong>in</strong>structions for float<strong>in</strong>g po<strong>in</strong>t operations .......................................... 15017.8 Us<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions for <strong>in</strong>teger operations .......................................... 15317.9 Mov<strong>in</strong>g blocks of data (All processors).................................................................. 15317.10 Self-modify<strong>in</strong>g code (All processors)................................................................... 15418 Measur<strong>in</strong>g performance............................................................................................... 15418.1 Test<strong>in</strong>g speed ....................................................................................................... 15418.2 The pitfalls of unit-test<strong>in</strong>g ...................................................................................... 15619 Literature..................................................................................................................... 1563


1 IntroductionThis is the second <strong>in</strong> a series of five manuals:1. <strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++: An optimization guide for W<strong>in</strong>dows, L<strong>in</strong>ux and Macplatforms.2. <strong>Optimiz<strong>in</strong>g</strong> <strong>subrout<strong>in</strong>es</strong> <strong>in</strong> <strong>assembly</strong> <strong>language</strong>: An optimization guide for x86platforms.3. The microarchitecture of Intel and AMD CPU's: An optimization guide for <strong>assembly</strong>programmers and compiler makers.4. Instruction tables: Lists of <strong>in</strong>struction latencies, throughputs and micro-operationbreakdowns for Intel and AMD CPU's.5. Call<strong>in</strong>g conventions for different C++ compilers and operat<strong>in</strong>g systems.The latest versions of these manuals are always available from www.agner.org/optimize.Copyright conditions are listed <strong>in</strong> manual 5.The present manual expla<strong>in</strong>s how to comb<strong>in</strong>e <strong>assembly</strong> code with a high level programm<strong>in</strong>g<strong>language</strong> and how to optimize CPU-<strong>in</strong>tensive code for speed by us<strong>in</strong>g <strong>assembly</strong> code.This manual is <strong>in</strong>tended for advanced <strong>assembly</strong> programmers and compiler makers. It isassumed that the reader has a good understand<strong>in</strong>g of <strong>assembly</strong> <strong>language</strong> and someexperience with <strong>assembly</strong> cod<strong>in</strong>g. Beg<strong>in</strong>ners are advised to seek <strong>in</strong>formation elsewhere andget some programm<strong>in</strong>g experience before try<strong>in</strong>g the optimization techniques describedhere. I can recommend the various <strong>in</strong>troductions, tutorials, discussion forums andnewsgroups on the Internet (see l<strong>in</strong>ks from www.agner.org/optimize) and the book"Introduction to 80x86 Assembly Language and Computer Architecture" by R. C. Detmer, 2.ed. 2006.The present manual covers all platforms that use the x86 <strong>in</strong>struction set. This <strong>in</strong>struction setis used by most microprocessors from Intel and AMD. Operat<strong>in</strong>g systems that can use this<strong>in</strong>struction set <strong>in</strong>clude DOS (16 and 32 bits), W<strong>in</strong>dows (16, 32 and 64 bits), L<strong>in</strong>ux (32 and64 bits), FreeBSD/Open BSD (32 and 64 bits), and Intel-based Mac OS (32 bits).Optimization techniques that are not specific to <strong>assembly</strong> <strong>language</strong> are discussed <strong>in</strong> manual1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++". Details that are specific to a particular microprocessor arecovered by manual 3: "The microarchitecture of Intel and AMD CPU's". Tables of <strong>in</strong>structiontim<strong>in</strong>gs etc. are provided <strong>in</strong> manual 4: "Instruction tables: Lists of <strong>in</strong>struction latencies,throughputs and micro-operation breakdowns for Intel and AMD CPU's". Details aboutcall<strong>in</strong>g conventions for different operat<strong>in</strong>g systems and compilers are covered <strong>in</strong> manual 5:"Call<strong>in</strong>g conventions for different C++ compilers and operat<strong>in</strong>g systems".Programm<strong>in</strong>g <strong>in</strong> <strong>assembly</strong> <strong>language</strong> is much more difficult than high-level <strong>language</strong>. Mak<strong>in</strong>gbugs is very easy, and f<strong>in</strong>d<strong>in</strong>g them is very difficult. Now you have been warned! Pleasedon't send your programm<strong>in</strong>g questions to me. Such mails will not be answered. There arevarious discussion forums on the Internet where you can get answers to your programm<strong>in</strong>gquestions if you cannot f<strong>in</strong>d the answers <strong>in</strong> the relevant books and manuals.Good luck with your hunt for nanoseconds!4


1.1 Reasons for us<strong>in</strong>g <strong>assembly</strong> codeAssembly cod<strong>in</strong>g is not used as much today as previously. However, there are still reasonsfor learn<strong>in</strong>g and us<strong>in</strong>g <strong>assembly</strong> code. The ma<strong>in</strong> reasons are:1. Educational reasons. It is important to know how microprocessors and compilerswork at the <strong>in</strong>struction level <strong>in</strong> order to be able to predict which cod<strong>in</strong>g techniquesare most efficient, to understand how various constructs <strong>in</strong> high level <strong>language</strong>swork, and to track hard-to-f<strong>in</strong>d errors.2. Debugg<strong>in</strong>g and verify<strong>in</strong>g. Look<strong>in</strong>g at compiler-generated <strong>assembly</strong> code or thedis<strong>assembly</strong> w<strong>in</strong>dow <strong>in</strong> a debugger is useful for f<strong>in</strong>d<strong>in</strong>g errors and for check<strong>in</strong>g howwell a compiler optimizes a particular piece of code.3. Mak<strong>in</strong>g compilers. Understand<strong>in</strong>g <strong>assembly</strong> cod<strong>in</strong>g techniques is necessary formak<strong>in</strong>g compilers, debuggers and other development tools.4. Embedded systems. Small embedded systems have fewer resources than PC's andma<strong>in</strong>frames. Assembly programm<strong>in</strong>g can be necessary for optimiz<strong>in</strong>g code for speedor size <strong>in</strong> small embedded systems.5. Hardware drivers and system code. Access<strong>in</strong>g hardware, system control registersetc. may sometimes be difficult or impossible with high level code.6. Access<strong>in</strong>g <strong>in</strong>structions that are not accessible from high level <strong>language</strong>. Certa<strong>in</strong><strong>assembly</strong> <strong>in</strong>structions have no high-level <strong>language</strong> equivalent.7. Self-modify<strong>in</strong>g code. Self-modify<strong>in</strong>g code is generally not profitable because it<strong>in</strong>terferes with efficient code cach<strong>in</strong>g. It may, however, be advantageous for exampleto <strong>in</strong>clude a small compiler <strong>in</strong> math programs where a user-def<strong>in</strong>ed function has tobe calculated many times.8. <strong>Optimiz<strong>in</strong>g</strong> code for size. Storage space and memory is so cheap nowadays that it isnot worth the effort to use <strong>assembly</strong> <strong>language</strong> for reduc<strong>in</strong>g code size. However,cache size is still such a critical resource that it may be useful <strong>in</strong> some cases tooptimize a critical piece of code for size <strong>in</strong> order to make it fit <strong>in</strong>to the code cache.9. <strong>Optimiz<strong>in</strong>g</strong> code for speed. Modern C++ compilers generally optimize code quite well<strong>in</strong> most cases. But there are still many cases where compilers perform poorly andwhere dramatic <strong>in</strong>creases <strong>in</strong> speed can be achieved by careful <strong>assembly</strong> programm<strong>in</strong>g.10. Function libraries. The total benefit of optimiz<strong>in</strong>g code is higher <strong>in</strong> function librariesthat are used by many programmers.11. Mak<strong>in</strong>g function libraries compatible with multiple compilers and operat<strong>in</strong>g systems.It is possible to make library functions with multiple entries that are compatible withdifferent compilers and different operat<strong>in</strong>g systems. This requires <strong>assembly</strong>programm<strong>in</strong>g.The ma<strong>in</strong> focus <strong>in</strong> this manual is on optimiz<strong>in</strong>g code for speed, though some of the othertopics are also discussed.1.2 Reasons for not us<strong>in</strong>g <strong>assembly</strong> codeThere are so many disadvantages and problems <strong>in</strong>volved <strong>in</strong> <strong>assembly</strong> programm<strong>in</strong>g that itis advisable to consider the alternatives before decid<strong>in</strong>g to use <strong>assembly</strong> code for aparticular task. The most important reasons for not us<strong>in</strong>g <strong>assembly</strong> programm<strong>in</strong>g are:5


1. Development time. Writ<strong>in</strong>g code <strong>in</strong> <strong>assembly</strong> <strong>language</strong> takes much longer time than<strong>in</strong> a high level <strong>language</strong>.2. Reliability and security. It is easy to make errors <strong>in</strong> <strong>assembly</strong> code. The assembler isnot check<strong>in</strong>g if the call<strong>in</strong>g conventions and register save conventions are obeyed.Nobody is check<strong>in</strong>g for you if the number of PUSH and POP <strong>in</strong>structions is the same <strong>in</strong>all possible branches and paths. There are so many possibilities for hidden errors <strong>in</strong><strong>assembly</strong> code that it affects the reliability and security of the project unless youhave a very systematic approach to test<strong>in</strong>g and verify<strong>in</strong>g.3. Debugg<strong>in</strong>g and verify<strong>in</strong>g. Assembly code is more difficult to debug and verifybecause there are more possibilities for errors than <strong>in</strong> high level code.4. Ma<strong>in</strong>ta<strong>in</strong>ability. Assembly code is more difficult to modify and ma<strong>in</strong>ta<strong>in</strong> because the<strong>language</strong> allows unstructured spaghetti code and all k<strong>in</strong>ds of dirty tricks that aredifficult for others to understand. Thorough documentation and a consistentprogramm<strong>in</strong>g style is needed.5. Portability. Assembly code is very platform-specific. Port<strong>in</strong>g to a different platform isdifficult.6. System code can use <strong>in</strong>tr<strong>in</strong>sic functions <strong>in</strong>stead of <strong>assembly</strong>. The best modern C++compilers have <strong>in</strong>tr<strong>in</strong>sic functions for access<strong>in</strong>g system control registers and othersystem <strong>in</strong>structions. Assembly code is no longer needed for device drivers and othersystem code when <strong>in</strong>tr<strong>in</strong>sic functions are available.7. Application code can use <strong>in</strong>tr<strong>in</strong>sic functions or vector classes <strong>in</strong>stead of <strong>assembly</strong>.The best modern C++ compilers have <strong>in</strong>tr<strong>in</strong>sic functions for vector operations andother special <strong>in</strong>structions that previously required <strong>assembly</strong> programm<strong>in</strong>g. It is nolonger necessary to use old fashioned <strong>assembly</strong> code to take advantage of the XMMvector <strong>in</strong>structions. See page 33.8. Compilers have been improved a lot <strong>in</strong> recent years. The best compilers are nowquite good. It takes a lot of expertise and experience to optimize better than the bestC++ compiler.1.3 Microprocessors covered by this manualThe follow<strong>in</strong>g families of x86-family microprocessors are discussed <strong>in</strong> this manual:Microprocessor nameIntel Pentium (without name suffix)Intel Pentium MMXIntel Pentium ProIntel Pentium IIIntel Pentium IIIIntel Pentium 4 (NetBurst)Intel Pentium 4 with EM64T, Pentium D, etc.Intel Pentium M, Core Solo, Core DuoIntel Core 2AMD AthlonAMD Athlon 64, Opteron, etc., 64-bitAMD Family 10h, Phenom, third generation OpteronTable 1.1. Microprocessor familiesAbbreviationP1PMMXPProP2P3P4P4EPMCore2AMD K7AMD K8AMD K106


See manual 3: "The microarchitecture of Intel and AMD CPU's" for details.1.4 Operat<strong>in</strong>g systems covered by this manualThe follow<strong>in</strong>g operat<strong>in</strong>g systems can use x86 family microprocessors:16 bit: DOS, W<strong>in</strong>dows 3.x.32 bit: W<strong>in</strong>dows, L<strong>in</strong>ux, FreeBSD, OpenBSD, NetBSD, Intel-based Mac OS X.64 bit: W<strong>in</strong>dows, L<strong>in</strong>ux, FreeBSD, OpenBSD, NetBSD, Intel-based Mac OS X.All the UNIX-like operat<strong>in</strong>g systems (L<strong>in</strong>ux, BSD, Mac OS) use the same call<strong>in</strong>gconventions, with very few exceptions. Everyth<strong>in</strong>g that is said <strong>in</strong> this manual about L<strong>in</strong>uxalso applies to other UNIX-like systems, possibly <strong>in</strong>clud<strong>in</strong>g systems not mentioned here.2 Before you start2.1 Th<strong>in</strong>gs to decide before you start programm<strong>in</strong>gBefore you start to program <strong>in</strong> <strong>assembly</strong>, you have to th<strong>in</strong>k about why you want to use<strong>assembly</strong> <strong>language</strong>, which part of your program you need to make <strong>in</strong> <strong>assembly</strong>, and whatprogramm<strong>in</strong>g method to use. If you haven't made your development strategy clear, then youwill soon f<strong>in</strong>d yourself wast<strong>in</strong>g time optimiz<strong>in</strong>g the wrong parts of the program, do<strong>in</strong>g th<strong>in</strong>gs<strong>in</strong> <strong>assembly</strong> that could have been done <strong>in</strong> C++, attempt<strong>in</strong>g to optimize th<strong>in</strong>gs that cannot beoptimized further, mak<strong>in</strong>g spaghetti code that is difficult to ma<strong>in</strong>ta<strong>in</strong>, and mak<strong>in</strong>g code that isfull or errors and difficult to debug.Here is a checklist of th<strong>in</strong>gs to consider before you start programm<strong>in</strong>g:• Never make the whole program <strong>in</strong> <strong>assembly</strong>. That is a waste of time. Assembly codeshould be used only where speed is critical and where a significant improvement <strong>in</strong>speed can be obta<strong>in</strong>ed. Most of the program should be made <strong>in</strong> C or C++. These arethe programm<strong>in</strong>g <strong>language</strong>s that are most easily comb<strong>in</strong>ed with <strong>assembly</strong> code.• If the purpose of us<strong>in</strong>g <strong>assembly</strong> is to make system code or use special <strong>in</strong>structionsthat are not available <strong>in</strong> standard C++ then you should isolate the part of theprogram that needs these <strong>in</strong>structions <strong>in</strong> a separate function or class with a welldef<strong>in</strong>ed functionality. Use <strong>in</strong>tr<strong>in</strong>sic functions (see p. 33) if possible.• If the purpose of us<strong>in</strong>g <strong>assembly</strong> is to optimize for speed then you have to identifythe part of the program that consumes the most CPU time, possibly with the use of aprofiler. Check if the bottleneck is file access, memory access, CPU <strong>in</strong>structions, orsometh<strong>in</strong>g else, as described <strong>in</strong> manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++". Isolate thecritical part of the program <strong>in</strong>to a function or class with a well-def<strong>in</strong>ed functionality.• If the purpose of us<strong>in</strong>g <strong>assembly</strong> is to make a function library then you should clearlydef<strong>in</strong>e the functionality of the library. Decide whether to make a function library or aclass library. Decide whether to use static l<strong>in</strong>k<strong>in</strong>g (.lib <strong>in</strong> W<strong>in</strong>dows, .a <strong>in</strong> L<strong>in</strong>ux) ordynamic l<strong>in</strong>k<strong>in</strong>g (.dll <strong>in</strong> W<strong>in</strong>dows, .so <strong>in</strong> L<strong>in</strong>ux). Static l<strong>in</strong>k<strong>in</strong>g is usually moreefficient, but dynamic l<strong>in</strong>k<strong>in</strong>g may be necessary if the library is called from <strong>language</strong>ssuch as C# and Visual Basic. You may possibly make both a static and a dynamicl<strong>in</strong>k version of the library.7


• If the purpose of us<strong>in</strong>g <strong>assembly</strong> is to optimize an embedded application for size orspeed then f<strong>in</strong>d a development tool that supports both C/C++ and <strong>assembly</strong> andmake as much as possible <strong>in</strong> C or C++.• Decide if the code is reusable or application-specific. Spend<strong>in</strong>g time on carefuloptimization is more justified if the code is reusable. A reusable code is mostappropriately implemented as a function library or class library.• Decide if the code should support multithread<strong>in</strong>g. A multithread<strong>in</strong>g application cantake advantage of microprocessors with multiple cores. Any data that must bepreserved from one function call to the next on a per-thread basis should be stored<strong>in</strong> a C++ class or a per-thread buffer supplied by the call<strong>in</strong>g program.• Decide if portability is important for your application. Should the application work <strong>in</strong>both W<strong>in</strong>dows, L<strong>in</strong>ux and Intel-based Mac OS? Should it work <strong>in</strong> both 32 bit and 64bit mode? Should it work on non-x86 platforms? This is important for the choice ofcompiler, assembler and programm<strong>in</strong>g method.• Decide if your application should work on old microprocessors. If so, then you maymake one version for microprocessors with, for example, the SSE2 <strong>in</strong>struction set,and another version which is compatible with old microprocessors. You may evenmake several versions, each optimized for a particular CPU. It is recommended tomake automatic CPU dispatch<strong>in</strong>g (see page 130).• There are three <strong>assembly</strong> programm<strong>in</strong>g methods to choose between: (1) Use<strong>in</strong>tr<strong>in</strong>sic functions and vector classes <strong>in</strong> a C++ compiler. (2) Use <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> <strong>in</strong> aC++ compiler. (3) Use an assembler. These three methods and their relativeadvantages and disadvantages are described <strong>in</strong> chapter 5, 6 and 7 respectively(page 33, 35 and 44 respectively).• If you are us<strong>in</strong>g an assembler then you have to choose between different syntaxdialects. It may be preferred to use an assembler that is compatible with the<strong>assembly</strong> code that your C++ compiler can generate.• Make your code <strong>in</strong> C++ first and optimize it as much as you can, us<strong>in</strong>g the methodsdescribed <strong>in</strong> manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++". Make the compiler translatethe code to <strong>assembly</strong>. Look at the compiler-generated code and see if there are anypossibilities for improvement <strong>in</strong> the code.• Highly optimized code tends to be very difficult to read and understand for othersand even for yourself when you get back to it after some time. In order to make itpossible to ma<strong>in</strong>ta<strong>in</strong> the code, it is important that you organize it <strong>in</strong>to small logicalunits (procedures or macros) with a well-def<strong>in</strong>ed <strong>in</strong>terface and call<strong>in</strong>g convention andappropriate comments. Decide on a consistent strategy for code comments anddocumentation.• Save the compiler, assembler and all other development tools together with thesource code and project files for later ma<strong>in</strong>tenance. Compatible tools may not beavailable <strong>in</strong> a few years when updates and modifications <strong>in</strong> the code are needed.2.2 Make a test strategyAssembly code is error prone, difficult to debug, difficult to make <strong>in</strong> a clearly structured way,difficult to read, and difficult to ma<strong>in</strong>ta<strong>in</strong>, as I have already mentioned. A consistent teststrategy can ameliorate some of these problems and save you a lot of time.8


My recommendation is to make the <strong>assembly</strong> code as an isolated module, function, class orlibrary with a well-def<strong>in</strong>ed <strong>in</strong>terface to the call<strong>in</strong>g program. Make it all <strong>in</strong> C++ first. Thenmake a test program which can test all aspects of the code you want to optimize. It is easierand safer to use a test program than to test the module <strong>in</strong> the f<strong>in</strong>al application.The test program has two purposes. The first purpose is to verify that the <strong>assembly</strong> codeworks correctly <strong>in</strong> all situations. And the second purpose is to test the speed of the<strong>assembly</strong> code without <strong>in</strong>vok<strong>in</strong>g the user <strong>in</strong>terface, file access and other parts of the f<strong>in</strong>alapplication program that may make the speed measurements less accurate and lessreproducible.You should use the test program repeatedly after each step <strong>in</strong> the development process andafter each modification of the code.Make sure the test program works correctly. It is quite common to spend a lot of timelook<strong>in</strong>g for an error <strong>in</strong> the code under test when <strong>in</strong> fact the error is <strong>in</strong> the test program.There are different test methods that can be used for verify<strong>in</strong>g that the code works correctly.A white box test supplies a carefully chosen series of different sets of <strong>in</strong>put data to makesure that all branches, paths and special cases <strong>in</strong> the code are tested. A black box testsupplies a series of random <strong>in</strong>put data and verifies that the output is correct. A very longseries of random data from a good random number generator can sometimes f<strong>in</strong>d rarelyoccurr<strong>in</strong>g errors that the white box test hasn't found.The test program may compare the output of the <strong>assembly</strong> code with the output of a C++implementation to verify that it is correct. The test should cover all boundary cases andpreferably also illegal <strong>in</strong>put data to see if the code generates the correct error responses.The speed test should supply a realistic set of <strong>in</strong>put data. A significant part of the CPU timemay be spent on branch mispredictions <strong>in</strong> code that conta<strong>in</strong>s a lot of branches. The amountof branch mispredictions depends on the degree of randomness <strong>in</strong> the <strong>in</strong>put data. You mayexperiment with the degree of randomness <strong>in</strong> the <strong>in</strong>put data to see how much it <strong>in</strong>fluencesthe computation time, and then decide on a realistic degree of randomness that matches atypical real application.An automatic test program that supplies a long stream of test data will typically f<strong>in</strong>d moreerrors and f<strong>in</strong>d them much faster than test<strong>in</strong>g the code <strong>in</strong> the f<strong>in</strong>al application. A good testprogram will f<strong>in</strong>d most errors, but you cannot be sure that it f<strong>in</strong>ds all errors. It is possible thatsome errors show up only <strong>in</strong> comb<strong>in</strong>ation with the f<strong>in</strong>al application.2.3 Common cod<strong>in</strong>g pitfallsThe follow<strong>in</strong>g list po<strong>in</strong>ts out some of the most common programm<strong>in</strong>g errors <strong>in</strong> <strong>assembly</strong>code.1. Forgett<strong>in</strong>g to save registers. Some registers have callee-save status, for exampleEBX. These registers must be saved <strong>in</strong> the prolog of a function and restored <strong>in</strong> theepilog if they are modified <strong>in</strong>side the function. Remember that the order of POP<strong>in</strong>structions must be the opposite of the order of PUSH <strong>in</strong>structions. See page 27 for alist of callee-save registers.2. Unmatched PUSH and POP <strong>in</strong>structions. The number of PUSH and POP <strong>in</strong>structionsmust be equal for all possible paths through a function. Example:Example 2.1. Unmatched push/poppush ebxtest ecx, ecxjz F<strong>in</strong>ished9


...pop ebxF<strong>in</strong>ished:ret; Wrong! Label should be before pop ebxHere, the value of EBX that is pushed is not popped aga<strong>in</strong> if ECX is zero. The result isthat the RET <strong>in</strong>struction will pop the former value of EBX and jump to a wrongaddress.3. Us<strong>in</strong>g a register that is reserved for another purpose. Some compilers reserve theuse of EBP or EBX for a frame po<strong>in</strong>ter or other purpose. Us<strong>in</strong>g these registers for adifferent purpose <strong>in</strong> <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> can cause errors.4. Stack-relative address<strong>in</strong>g after push. When address<strong>in</strong>g a variable relative to thestack po<strong>in</strong>ter, you must take <strong>in</strong>to account all preced<strong>in</strong>g <strong>in</strong>structions that modify thestack po<strong>in</strong>ter. Example:Example 2.2. Stack-relative address<strong>in</strong>gmov [esp+4], edipush ebppush ebxcmp esi, [esp+4]; Probably wrong!Here, the programmer probably <strong>in</strong>tended to compare ESI with EDI, but the value ofESP that is used for address<strong>in</strong>g has been changed by the two PUSH <strong>in</strong>structions, sothat ESI is <strong>in</strong> fact compared with EBP <strong>in</strong>stead.5. Confus<strong>in</strong>g value and address of a variable. Example:Example 2.3. Value versus address (MASM syntax).dataMyVariable DD 0; Def<strong>in</strong>e variable.codemov eax, MyVariable ; Gets value of MyVariablemov eax, offset MyVariable; Gets address of MyVariablelea eax, MyVariable ; Gets address of MyVariablemov ebx, [eax]; Gets value of MyVariable through po<strong>in</strong>termov ebx, [100]; Gets the constant 100 despite bracketsmov ebx, ds:[100] ; Gets value from address 1006. Ignor<strong>in</strong>g call<strong>in</strong>g conventions. It is important to observe the call<strong>in</strong>g conventions forfunctions, such as the order of parameters, whether parameters are transferred onthe stack or <strong>in</strong> registers, and whether the stack is cleaned up by the caller or thecalled function. See page 27.7. Function name mangl<strong>in</strong>g. A C++ code that calls an <strong>assembly</strong> function should useextern "C" to avoid name mangl<strong>in</strong>g. Some systems require that an underscore (_)is put <strong>in</strong> front of the name <strong>in</strong> the <strong>assembly</strong> code. See page 30.8. Forgett<strong>in</strong>g return. A function declaration must end with both RET and ENDP. Us<strong>in</strong>gone of these is not enough. The execution will cont<strong>in</strong>ue <strong>in</strong> the code after theprocedure if there is no RET.9. Forgett<strong>in</strong>g stack alignment. The stack po<strong>in</strong>ter must po<strong>in</strong>t to an address divisible by16 before any call statement, except <strong>in</strong> 16-bit systems and 32-bit W<strong>in</strong>dows. Seepage 27.10


10. Forgett<strong>in</strong>g shadow space <strong>in</strong> 64-bit W<strong>in</strong>dows. It is required to reserve 32 bytes ofempty stack space before any function call <strong>in</strong> 64-bit W<strong>in</strong>dows. See page 29.11. Mix<strong>in</strong>g call<strong>in</strong>g conventions. The call<strong>in</strong>g conventions <strong>in</strong> 64-bit W<strong>in</strong>dows and 64-bitL<strong>in</strong>ux are different. See page 27.12. Forgett<strong>in</strong>g to clean up float<strong>in</strong>g po<strong>in</strong>t stack. All float<strong>in</strong>g po<strong>in</strong>t stack registers that areused by a function must be cleared, typically with FSTP ST(0), before the functionreturns, except for ST(0) if it is used for return value. It is necessary to keep track ofexactly how many float<strong>in</strong>g po<strong>in</strong>t registers are <strong>in</strong> use. If a functions pushes morevalues on the float<strong>in</strong>g po<strong>in</strong>t register stack than it pops, then the register stack willgrow each time the function is called. An exception is generated when the stack isfull. This exception may occur somewhere else <strong>in</strong> the program.13. Forgett<strong>in</strong>g to clear MMX state. A function that uses MMX registers must clear thesewith the EMMS <strong>in</strong>struction before any call or return.14. Forgett<strong>in</strong>g to clear YMM state. A function that uses YMM registers must clear thesewith the VZEROUPPER or VZEROALL <strong>in</strong>struction before any call or return.15. Forgett<strong>in</strong>g to clear direction flag. Any function that sets the direction flag with STDmust clear it with CLD before any call or return.16. Mix<strong>in</strong>g signed and unsigned <strong>in</strong>tegers. Unsigned <strong>in</strong>tegers are compared us<strong>in</strong>g the JBand JA <strong>in</strong>structions. Signed <strong>in</strong>tegers are compared us<strong>in</strong>g the JL and JG <strong>in</strong>structions.Mix<strong>in</strong>g signed and unsigned <strong>in</strong>tegers can have un<strong>in</strong>tended consequences.17. Forgett<strong>in</strong>g to scale array <strong>in</strong>dex. An array <strong>in</strong>dex must be multiplied by the size of onearray element. For example mov eax, MyIntegerArray[ebx*4].18. Exceed<strong>in</strong>g array bounds. An array with n elements is <strong>in</strong>dexed from 0 to n - 1, notfrom 1 to n. A defective loop writ<strong>in</strong>g outside the bounds of an array can cause errorselsewhere <strong>in</strong> the program that are hard to f<strong>in</strong>d.19. Loop with ECX = 0. A loop that ends with the LOOP <strong>in</strong>struction will repeat 2 32 times ifECX is zero. Be sure to check if ECX is zero before the loop.20. Read<strong>in</strong>g carry flag after INC or DEC. The INC and DEC <strong>in</strong>structions do not change thecarry flag. Do not use <strong>in</strong>structions that read the carry flag, such as ADC, SBB, JC, JBE,SETA, etc. after INC or DEC. Use ADD and SUB <strong>in</strong>stead of INC and DEC to avoid thisproblem.3 The basics of <strong>assembly</strong> cod<strong>in</strong>g3.1 Assemblers availableThere are several assemblers available for the x86 <strong>in</strong>struction set, but currently none ofthem is good enough for universal recommendation. Assembly programmers are <strong>in</strong> theunfortunate situation that there is no universally agreed syntax for x86 <strong>assembly</strong>. Differentassemblers use different syntax variants. The most common assemblers are listed below.MASMThe Microsoft assembler is <strong>in</strong>cluded with Microsoft C++ compilers. Free versions cansometimes be obta<strong>in</strong>ed by download<strong>in</strong>g the Microsoft W<strong>in</strong>dows driver kit (WDK) or theplatforms software development kit (SDK) or as an add-on to the free Visual C++ Express11


Edition. MASM has been a de-facto standard <strong>in</strong> the W<strong>in</strong>dows world for many years, and the<strong>assembly</strong> output of most W<strong>in</strong>dows compilers uses MASM syntax. MASM has manyadvanced <strong>language</strong> features. The syntax is somewhat messy and <strong>in</strong>consistent due to aheritage that dates back to the very first assemblers for the 8086 processor. Microsoft is stillma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g MASM <strong>in</strong> order to provide a complete set of development tools for W<strong>in</strong>dows, butit is obviously not profitable and the ma<strong>in</strong>tenance of MASM is apparently kept at a m<strong>in</strong>imum.New <strong>in</strong>struction sets are still added regularly, but the 64-bit version has several deficiencies.The newest version can run only when the compiler is <strong>in</strong>stalled and only <strong>in</strong> W<strong>in</strong>dows XP orlater. Version 6 and earlier can run <strong>in</strong> any system, <strong>in</strong>clud<strong>in</strong>g L<strong>in</strong>ux with a W<strong>in</strong>dows emulator.Such versions are circulat<strong>in</strong>g on the web.GASThe Gnu assembler is part of the Gnu B<strong>in</strong>utils package that is <strong>in</strong>cluded with mostdistributions of L<strong>in</strong>ux, BSD and Mac OS X. The Gnu compilers produce <strong>assembly</strong> outputthat goes through the Gnu assembler before it is l<strong>in</strong>ked. The Gnu assembler traditionallyuses the AT&T syntax that works well for mach<strong>in</strong>e-generated code, but it is very<strong>in</strong>convenient for human-generated <strong>assembly</strong> code. The AT&T syntax has the operands <strong>in</strong>an order that differs from all other x86 assemblers and from the <strong>in</strong>struction documentationpublished by Intel and AMD. It uses various prefixes like % and $ for specify<strong>in</strong>g operandtypes.Fortunately, newer versions of the Gnu assembler has an option for us<strong>in</strong>g Intel syntax<strong>in</strong>stead. The Gnu-Intel syntax is almost identical to MASM syntax. The Gnu-Intel syntaxdef<strong>in</strong>es only the syntax for <strong>in</strong>struction codes, not for directives, functions, macros, etc. Thedirectives still use the old Gnu-AT&T syntax. Specify .<strong>in</strong>tel_syntax noprefix to use theIntel syntax. Specify .att_syntax prefix to return to the AT&T syntax before leav<strong>in</strong>g<strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> <strong>in</strong> C or C++ code.The Gnu assembler is available for all platforms, but the W<strong>in</strong>dows version is currently notfully up to date.NASMNASM is a free open source assembler with support for several platforms and object fileformats. The syntax is more clear and consistent than MASM syntax. NASM has fewer highlevelfeatures than MASM, but it is sufficient for most purposes.YASMYASM is very similar to NASM and uses exactly the same syntax. YASM has turned out tobe more versatile, stable and reliable than NASM <strong>in</strong> my tests and to produce better code.Does not support OMF format. YASM has often been the first assembler to support new<strong>in</strong>struction sets. YASM would be my best recommendation for a free multi-platformassembler if you don't need MASM syntax.FASMThe Flat assembler is another open source assembler for multiple platforms. The syntax isnot compatible with other assemblers. FASM is itself written <strong>in</strong> <strong>assembly</strong> <strong>language</strong> - anentic<strong>in</strong>g idea, but unfortunately this makes the development and ma<strong>in</strong>tenance of it lessefficient. FASM is currently not fully up to date with the latest <strong>in</strong>struction sets.WASMThe WASM assembler is <strong>in</strong>cluded with the Open Watcom C++ compiler. The syntaxresembles MASM but is somewhat different. Not fully up to date.12


JWASMJWASM is a further development of WASM. It is fully compatible with MASM syntax,<strong>in</strong>clud<strong>in</strong>g advanced macro and high level directives. Does not currently support 64-bit mode.JWASM is a good choice if MASM syntax is desired.TASMBorland Turbo Assembler is <strong>in</strong>cluded with CodeGear C++ Builder. It is compatible withMASM syntax except for some newer syntax additions. TASM is no longer ma<strong>in</strong>ta<strong>in</strong>ed but isstill available. It is obsolete and does not support current <strong>in</strong>struction sets.HLAHigh Level Assembler is actually a high level <strong>language</strong> compiler that allows <strong>assembly</strong>-likestatements and produces <strong>assembly</strong> output. This was probably a good idea at the time it was<strong>in</strong>vented, but today where the best C++ compilers support <strong>in</strong>tr<strong>in</strong>sic functions, I believe thatHLA is no longer needed.Inl<strong>in</strong>e <strong>assembly</strong>Microsoft and Intel C++ compilers support <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> us<strong>in</strong>g a subset of the MASMsyntax. It is possible to access C++ variables, functions and labels simply by <strong>in</strong>sert<strong>in</strong>g theirnames <strong>in</strong> the <strong>assembly</strong> code. This is easy, but does not support C++ register variables. Seepage 35.The Gnu assembler supports <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> with access to the full range of <strong>in</strong>structionsand directives of the Gnu assembler <strong>in</strong> both Intel and AT&T syntax. The access to C++variables from <strong>assembly</strong> uses a quite complicated method.The Intel compilers for L<strong>in</strong>ux and Mac systems support both the Microsoft style and the Gnustyle of <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.Intr<strong>in</strong>sic functions <strong>in</strong> C++This is the newest and most convenient way of comb<strong>in</strong><strong>in</strong>g low-level and high-level code.Intr<strong>in</strong>sic functions are high-level <strong>language</strong> representatives of mach<strong>in</strong>e <strong>in</strong>structions. Forexample, you can do a vector addition <strong>in</strong> C++ by call<strong>in</strong>g the <strong>in</strong>tr<strong>in</strong>sic function that isequivalent to an <strong>assembly</strong> <strong>in</strong>struction for vector addition. Furthermore, it is possible todef<strong>in</strong>e a vector class with an overloaded + operator so that a vector addition is obta<strong>in</strong>edsimply by writ<strong>in</strong>g +. Intr<strong>in</strong>sic functions are supported by Microsoft, Intel and Gnu compilers.See page 33 and manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++".Which assembler to choose?In most cases, the easiest solution is to use <strong>in</strong>tr<strong>in</strong>sic functions <strong>in</strong> C++ code. The compilercan take care of most of the optimization so that the programmer can concentrate onchoos<strong>in</strong>g the best algorithm and organiz<strong>in</strong>g the data <strong>in</strong>to vectors. System programmers canaccess system <strong>in</strong>structions by us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions without hav<strong>in</strong>g to use <strong>assembly</strong><strong>language</strong>.Where real low-level programm<strong>in</strong>g is needed, such as <strong>in</strong> highly optimized function libraries,you may use an assembler.It may be preferred to use an assembler that is compatible with the C++ compiler you areus<strong>in</strong>g. This allows you to use the compiler for translat<strong>in</strong>g C++ to <strong>assembly</strong>, optimize the<strong>assembly</strong> code further, and then assemble it. If the assembler is not compatible with thesyntax generated by the compiler then you may generate an object file with the compilerand disassemble the object file to the <strong>assembly</strong> syntax you need. The objconv disassemblersupports several different syntax dialects.13


The examples <strong>in</strong> this manual use MASM syntax, unless otherwise noted. The MASM syntaxis described <strong>in</strong> Microsoft Macro Assembler Reference at msdn.microsoft.com.See www.agner.org/optimize for l<strong>in</strong>ks to various syntax manuals, cod<strong>in</strong>g manuals anddiscussion forums.3.2 Register set and basic <strong>in</strong>structionsRegisters <strong>in</strong> 16 bit modeGeneral purpose and <strong>in</strong>teger registersFull registerbit 0 - 15Partial registerbit 8 - 15Partial registerbit 0 - 7AX AH ALBX BH BLCX CH CLDX DH DLSIDIBPSPFlagsIPTable 3.1. General purpose registers <strong>in</strong> 16 bit mode.The 32-bit registers are also available <strong>in</strong> 16-bit mode if supported by the microprocessorand operat<strong>in</strong>g system. The high word of ESP should not be used because it is not saveddur<strong>in</strong>g <strong>in</strong>terrupts.Float<strong>in</strong>g po<strong>in</strong>t registersFull registerbit 0 - 79ST(0)ST(1)ST(2)ST(3)ST(4)ST(5)ST(6)ST(7)Table 3.2. Float<strong>in</strong>g po<strong>in</strong>t stack registersMMX registers may be available if supported by the microprocessor. XMM registers may beavailable if supported by microprocessor and operat<strong>in</strong>g system.Segment registersFull registerbit 0 - 15CSDSESSSTable 3.3. Segment registers <strong>in</strong> 16 bit modeRegister FS and GS may be available.14


Registers <strong>in</strong> 32 bit modeGeneral purpose and <strong>in</strong>teger registersFull registerbit 0 - 31Partial registerbit 0 - 15Partial registerbit 8 - 15Partial registerbit 0 - 7EAX AX AH ALEBX BX BH BLECX CX CH CLEDX DX DH DLESISIEDIDIEBPBPESPSPEFlagsFlagsEIPIPTable 3.4. General purpose registers <strong>in</strong> 32 bit modeFloat<strong>in</strong>g po<strong>in</strong>t and 64-bit vector registersFull registerbit 0 - 79ST(0)ST(1)ST(2)ST(3)ST(4)ST(5)ST(6)ST(7)Table 3.5. Float<strong>in</strong>g po<strong>in</strong>t and MMX registersPartial registerbit 0 - 63MM0MM1MM2MM3MM4MM5MM6MM7The MMX registers are only available if supported by the microprocessor. The ST and MMXregisters cannot be used <strong>in</strong> the same part of the code. A section of code us<strong>in</strong>g MMXregisters must be separated from any subsequent section us<strong>in</strong>g ST registers by execut<strong>in</strong>gan EMMS <strong>in</strong>struction.128- and 256-bit <strong>in</strong>teger and float<strong>in</strong>g po<strong>in</strong>t vector registersFull registerbit 0 - 255Full or partial registerbit 0 - 127YMM0XMM0YMM1XMM1YMM2XMM2YMM3XMM3YMM4XMM4YMM5XMM5YMM6XMM6YMM7XMM7Table 3.6. XMM and YMM registers <strong>in</strong> 32 bit modeThe XMM registers are only available if supported both by the microprocessor and theoperat<strong>in</strong>g system. Scalar float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions use only 32 or 64 bits of the XMMregisters for s<strong>in</strong>gle or double precision, respectively. The YMM registers are expected to beavailable <strong>in</strong> 2010.Segment registers15


Full registerbit 0 - 15CSDSESFSGSSSTable 3.7. Segment registers <strong>in</strong> 32 bit modeRegisters <strong>in</strong> 64 bit modeGeneral purpose and <strong>in</strong>teger registersFull registerbit 0 - 63PartialregisterPartialregisterPartialregisterbit 8 - 15Partialregisterbit 0 - 7bit 0 - 31 bit 0 - 15RAX EAX AX AH ALRBX EBX BX BH BLRCX ECX CX CH CLRDX EDX DX DH DLRSI ESI SI SILRDI EDI DI DILRBP EBP BP BPLRSP ESP SP SPLR8 R8D R8W R8BR9 R9D R9W R9BR10 R10D R10W R10BR11 R11D R11W R11BR12 R12D R12W R12BR13 R13D R13W R13BR14 R14D R14W R14BR15 R15D R15W R15BRFlagsRIPTable 3.8. Registers <strong>in</strong> 64 bit modeFlagsThe high 8-bit registers AH, BH, CH, DH can only be used <strong>in</strong> <strong>in</strong>structions that have no REXprefix.Note that modify<strong>in</strong>g a 32-bit partial register will set the rest of the register (bit 32-63) to zero,but modify<strong>in</strong>g an 8-bit or 16-bit partial register does not affect the rest of the register. Thiscan be illustrated by the follow<strong>in</strong>g sequence:; Example 3.1. 8, 16, 32 and 64 bit registersmov rax, 1111111111111111H ; rax = 1111111111111111Hmov eax, 22222222H; rax = 0000000022222222Hmov ax, 3333H ; rax = 0000000022223333Hmov al, 44H ; rax = 0000000022223344HThere is a good reason for this <strong>in</strong>consistency. Sett<strong>in</strong>g the unused part of a register to zero ismore efficient than leav<strong>in</strong>g it unchanged because this removes a false dependence onprevious values. But the pr<strong>in</strong>ciple of resett<strong>in</strong>g the unused part of a register cannot beextended to 16 bit and 8 bit partial registers because this would break the backwardscompatibility with 32-bit and 16-bit modes.The only <strong>in</strong>struction that can have a 64-bit immediate data operand is MOV. Other <strong>in</strong>teger<strong>in</strong>structions can only have a 32-bit sign extended operand. Examples:16


; Example 3.2. Immediate operands, full and sign extendedmov rax, 1111111111111111H ; Full 64 bit immediate operandmov rax, -1; 32 bit sign-extended operandmov eax, 0ffffffffH; 32 bit zero-extended operandadd rax, 1; 8 bit sign-extended operandadd rax, 100H; 32 bit sign-extended operandadd eax, 100H; 32 bit operand. result is zero-extendedmov rbx, 100000000H; 64 bit immediate operandadd rax, rbx; Use an extra register if big operandIt is not possible to use a 16-bit sign-extended operand. If you need to add an immediatevalue to a 64 bit register then it is necessary to first move the value <strong>in</strong>to another register ifthe value is too big for fitt<strong>in</strong>g <strong>in</strong>to a 32 bit sign-extended operand.Float<strong>in</strong>g po<strong>in</strong>t and 64-bit vector registersFull registerbit 0 - 79ST(0)ST(1)ST(2)ST(3)ST(4)ST(5)ST(6)ST(7)Table 3.9. Float<strong>in</strong>g po<strong>in</strong>t and MMX registers17Partial registerbit 0 - 63MM0MM1MM2MM3MM4MM5MM6MM7The ST and MMX registers cannot be used <strong>in</strong> the same part of the code. A section of codeus<strong>in</strong>g MMX registers must be separated from any subsequent section us<strong>in</strong>g ST registers byexecut<strong>in</strong>g an EMMS <strong>in</strong>struction. The ST and MMX registers cannot be used <strong>in</strong> device driversfor 64-bit W<strong>in</strong>dows.128- and 256-bit <strong>in</strong>teger and float<strong>in</strong>g po<strong>in</strong>t vector registersFull registerbit 0 - 255Full or partial registerbit 0 - 127YMM0XMM0YMM1XMM1YMM2XMM2YMM3XMM3YMM4XMM4YMM5XMM5YMM6XMM6YMM7XMM7YMM8XMM8YMM9XMM9YMM10XMM10YMM11XMM11YMM12XMM12YMM13XMM13YMM14XMM14YMM15XMM15Table 3.10. XMM and YMM registers <strong>in</strong> 64 bit modeScalar float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions use only 32 or 64 bits of the XMM registers for s<strong>in</strong>gle ordouble precision, respectively. The YMM registers are expected to be available <strong>in</strong> 2010.Segment registersFull register


it 0 - 15CSFSGSTable 3.11. Segment registers <strong>in</strong> 64 bit modeSegment registers are only used for special purposes.3.3 Address<strong>in</strong>g modesAddress<strong>in</strong>g <strong>in</strong> 16-bit mode16-bit code uses a segmented memory model. A memory operand can have any of thesecomponents:• A segment specification. This can be any segment register or a segment or groupname associated with a segment register. (The default segment is DS, except if BP isused as base register). The segment can be implied from a label def<strong>in</strong>ed <strong>in</strong>side asegment.• A label def<strong>in</strong><strong>in</strong>g a relocatable offset. The offset relative to the start of the segment iscalculated by the l<strong>in</strong>ker.• An immediate offset. This is a constant. If there is also a relocatable offset then thevalues are added.• A base register. This can only be BX or BP.• An <strong>in</strong>dex register. This can only be SI or DI. There can be no scale factor.A memory operand can have all of these components. An operand conta<strong>in</strong><strong>in</strong>g only animmediate offset is not <strong>in</strong>terpreted as a memory operand by the MASM assembler, even if ithas a []. Examples:; Example 3.3. Memory operands <strong>in</strong> 16-bit modeMOV AX, DS:[100H] ; Address has segment and immediate offsetADD AX, MEM[SI]+4 ; Has relocatable offset and <strong>in</strong>dex and immediateData structures bigger than 64 kb are handled <strong>in</strong> the follow<strong>in</strong>g ways. In real mode andvirtual mode (DOS): Add<strong>in</strong>g 1 to the segment register corresponds to add<strong>in</strong>g 10H to theoffset. In protected mode (W<strong>in</strong>dows 3.x): Add<strong>in</strong>g 8 to the segment register corresponds toadd<strong>in</strong>g 10000H to the offset. The value added to the segment must be a multiple of 8.Address<strong>in</strong>g <strong>in</strong> 32-bit mode32-bit code uses a flat memory model <strong>in</strong> most cases. Segmentation is possible but onlyused for special purposes (e.g. thread environment block <strong>in</strong> FS).A memory operand can have any of these components:• A segment specification. Not used <strong>in</strong> flat mode.• A label def<strong>in</strong><strong>in</strong>g a relocatable offset. The offset relative to the FLAT segment group iscalculated by the l<strong>in</strong>ker.• An immediate offset. This is a constant. If there is also a relocatable offset then thevalues are added.18


• A base register. This can be any 32 bit register.• An <strong>in</strong>dex register. This can be any 32 bit register except ESP.• A scale factor applied to the <strong>in</strong>dex register. Allowed values are 1, 2, 4, 8.A memory operand can have all of these components. Examples:; Example 3.4. Memory operands <strong>in</strong> 32-bit modemov eax, fs:[10H] ; Address has segment and immediate offsetadd eax, mem[esi] ; Has relocatable offset and <strong>in</strong>dexadd eax, [esp+ecx*4+8] ; Base, <strong>in</strong>dex, scale and immediate offsetPosition-<strong>in</strong>dependent code <strong>in</strong> 32-bit modePosition-<strong>in</strong>dependent code is required for mak<strong>in</strong>g shared objects (*.so) <strong>in</strong> 32-bit Unix-likesystems. The most common method for mak<strong>in</strong>g position-<strong>in</strong>dependent code <strong>in</strong> 32-bit L<strong>in</strong>uxand BSD is to use a global offset table (GOT) conta<strong>in</strong><strong>in</strong>g the addresses of all static objects.The GOT method is quite <strong>in</strong>efficient because the code has to fetch an address from theGOT every time it reads or writes data <strong>in</strong> the data segment. A faster method is to use anarbitrary reference po<strong>in</strong>t, as shown <strong>in</strong> the follow<strong>in</strong>g example:; Example 3.5. Position-<strong>in</strong>dependent code, 32 bit, YASM syntaxSECTION .dataalpha: dd 1beta: dd 2SECTION .textfunca: ; This function returns alpha + betacall get_thunk_ecx ; get ecx = eiprefpo<strong>in</strong>t:; ecx po<strong>in</strong>ts heremov eax, [ecx+alpha-refpo<strong>in</strong>t] ; relative addressadd eax, [ecx+beta -refpo<strong>in</strong>t] ; relative addressretget_thunk_ecx: ; Function for read<strong>in</strong>g <strong>in</strong>struction po<strong>in</strong>termov ecx, [esp]retThe only <strong>in</strong>struction that can read the <strong>in</strong>struction po<strong>in</strong>ter <strong>in</strong> 32-bit mode is the call<strong>in</strong>struction. In example 3.5 we are us<strong>in</strong>g call get_thunk_ecx for read<strong>in</strong>g the <strong>in</strong>structionpo<strong>in</strong>ter (eip) <strong>in</strong>to ecx. ecx will then po<strong>in</strong>t to the first <strong>in</strong>struction after the call. This is ourreference po<strong>in</strong>t, named refpo<strong>in</strong>t. (get_thunk_ecx must be a separate function with itsown return because a call without a return would cause mispredictions of subsequentreturns). All objects <strong>in</strong> the data segment can now be addressed relative to refpo<strong>in</strong>t withecx as po<strong>in</strong>ter.This method is commonly used <strong>in</strong> Mac systems, where the mach-o file format supportsreferences relative to an arbitrary po<strong>in</strong>t. Other file formats don't support this k<strong>in</strong>d ofreference, but it is possible to use a self-relative reference with an offset. The YASM andGnu assemblers will do this automatically, while most other assemblers are unable tohandle this situation. It is therefore necessary to use a YASM or Gnu assembler if you wantto generate position-<strong>in</strong>dependent code <strong>in</strong> 32-bit mode with this method. The code may lookstrange <strong>in</strong> a debugger or disassembler, but it executes without any problems <strong>in</strong> all 32-bitx86 operat<strong>in</strong>g systems.The GOT method would use the same reference po<strong>in</strong>t method as <strong>in</strong> example 3.5 foraddress<strong>in</strong>g the GOT, and then use the GOT to read the addresses of alpha and beta <strong>in</strong>toother po<strong>in</strong>ter registers. This is an unnecessary waste of time and registers because if we19


can access the GOT relative to the reference po<strong>in</strong>t, then we can just as well access alphaand beta relative to the reference po<strong>in</strong>t.The po<strong>in</strong>ter tables used <strong>in</strong> switch/case statements can use the same reference po<strong>in</strong>t forthe table and for the po<strong>in</strong>ters <strong>in</strong> the table:; Example 3.6. Position-<strong>in</strong>dependent switch, 32 bit, YASM syntaxSECTION .datajumptable: dd case1-refpo<strong>in</strong>t, case2-refpo<strong>in</strong>t, case3-refpo<strong>in</strong>tSECTION .textfuncb: ; This function implements a switch statementmov eax, [esp+4] ; function parametercall get_thunk_ecx ; get ecx = eiprefpo<strong>in</strong>t:; ecx po<strong>in</strong>ts herecmp eax, 3jnb case_default ; <strong>in</strong>dex out of rangemov eax, [ecx+eax*4+jumptable-refpo<strong>in</strong>t] ; read table entry; The jump addresses are relative to refpo<strong>in</strong>t, get absolute address:add eax, ecxjmp eax ; jump to desired casecase1: ...retcase2: ...retcase3: ...retcase_default:...retget_thunk_ecx: ; Function for read<strong>in</strong>g <strong>in</strong>struction po<strong>in</strong>termov ecx, [esp]retAddress<strong>in</strong>g <strong>in</strong> 64-bit mode64-bit code always uses a flat memory model. Segmentation is impossible except for FS andGS which are used for special purposes only (thread environment block, etc.).There are several different address<strong>in</strong>g modes <strong>in</strong> 64-bit mode: RIP-relative, 32-bit absolute,64-bit absolute, and relative to a base register.RIP-relative address<strong>in</strong>gThis is the preferred address<strong>in</strong>g mode for static data. The address conta<strong>in</strong>s a 32-bit signextendedoffset relative to the <strong>in</strong>struction po<strong>in</strong>ter. The address cannot conta<strong>in</strong> any segmentregister or <strong>in</strong>dex register, and no base register other than RIP, which is implicit. Example:; Example 3.7a. RIP-relative memory operand, MASM syntaxmov eax, [mem]; Example 3.7b. RIP-relative memory operand, NASM/YASM syntaxdefault relmov eax, [mem]; Example 3.7c. RIP-relative memory operand, Gas/Intel syntaxmov eax, [mem+rip]20


The MASM assembler always generates RIP-relative addresses for static data when noexplicit base or <strong>in</strong>dex register is specified. On other assemblers you must remember tospecify relative address<strong>in</strong>g.32-bit absolute address<strong>in</strong>g <strong>in</strong> 64 bit modeA 32-bit constant address is sign-extended to 64 bits. This address<strong>in</strong>g mode works only if itis certa<strong>in</strong> that all addresses are below 2 31 (or above -2 31 for system code).It is safe to use 32-bit absolute address<strong>in</strong>g <strong>in</strong> L<strong>in</strong>ux and BSD, where all addresses arebelow 2 31 by default. It is unsafe <strong>in</strong> W<strong>in</strong>dows DLL's because they might be relocated to anyaddress. It will not work <strong>in</strong> Mac systems, where addresses are above 2 32 by default.Note that NASM, YASM and Gas assemblers can make 32-bit absolute addresses whenyou do not explicitly specify rip-relative addresses. You have to specify default rel <strong>in</strong>NASM/YASM or [mem+rip] <strong>in</strong> Gas to avoid 32-bit absolute addresses.There is absolutely no reason to use absolute addresses for simple memory operands. Riprelativeaddresses make <strong>in</strong>structions shorter, they elim<strong>in</strong>ate the need for relocation at loadtime, and they are safe to use <strong>in</strong> all systems.Absolute addresses are needed only for access<strong>in</strong>g arrays where there is an <strong>in</strong>dex register,e.g.; Example 3.8. 32 bit absolute addresses <strong>in</strong> 64-bit modemov al, [chararray + rsi]mov ebx, [<strong>in</strong>tarray + rsi*4]This method can be used only if the address is guaranteed to be < 2 31 , as expla<strong>in</strong>ed above.See below for alternative methods of address<strong>in</strong>g static arrays.The MASM assembler generates absolute addresses only when a base or <strong>in</strong>dex register isspecified together with a memory label as <strong>in</strong> example 3.8 above.The <strong>in</strong>dex register should preferably be a 64-bit register, not a 32-bit register. Segmentationis possible only with FS or GS.64-bit absolute address<strong>in</strong>gThis uses a 64-bit absolute virtual address. The address cannot conta<strong>in</strong> any segmentregister, base or <strong>in</strong>dex register. 64-bit absolute addresses can only be used with the MOV<strong>in</strong>struction, and only with AL, AX, EAX or RAX as source or dest<strong>in</strong>ation.; Example 3.9. 64 bit absolute address, YASM/NASM syntaxmov eax, dword [qword a]This address<strong>in</strong>g mode is not supported by the MASM assembler, but it is supported byother assemblers.Address<strong>in</strong>g relative to 64-bit base registerA memory operand <strong>in</strong> this mode can have any of these components:• A base register. This can be any 64 bit <strong>in</strong>teger register.• An <strong>in</strong>dex register. This can be any 64 bit <strong>in</strong>teger register except RSP.• A scale factor applied to the <strong>in</strong>dex register. The only possible values are 1, 2, 4, 8.• An immediate offset. This is a constant offset relative to the base register.A base register is always needed for this address<strong>in</strong>g mode. The other components areoptional. Examples:21


; Example 3.10. Base register address<strong>in</strong>g <strong>in</strong> 64 bit modemov eax, [rsi]add eax, [rsp + 4*rcx + 8]This address<strong>in</strong>g mode is used for data on the stack, for structure and class members andfor arrays.Address<strong>in</strong>g static arrays <strong>in</strong> 64 bit modeIt is not possible to access static arrays with RIP-relative address<strong>in</strong>g and an <strong>in</strong>dex register.There are several possible alternatives.The follow<strong>in</strong>g examples address static arrays. The C++ code for this example is:// Example 3.11a. Static arrays <strong>in</strong> 64 bit mode// C++ code:static <strong>in</strong>t a[100], b[100];for (<strong>in</strong>t i = 0; i < 100; i++) {b[i] = -a[i];}The simplest solution is to use 32-bit absolute addresses. This is possible as long as theaddresses are below 2 31 .; Example 3.11b. Use 32-bit absolute addresses; (64 bit L<strong>in</strong>ux); Assumes that image base < 80000000H.dataA DD 100 dup (?) ; Def<strong>in</strong>e static array AB DD 100 dup (?) ; Def<strong>in</strong>e static array B.codexor ecx, ecx ; i = 0TOPOFLOOP:; Top of loopmov eax, [A+rcx*4] ; 32-bit address + scaled <strong>in</strong>dexneg eaxmov [B+rcx*4], eax ; 32-bit address + scaled <strong>in</strong>dexadd ecx, 1cmp ecx, 100 ; i < 100jb TOPOFLOOP ; LoopThe assembler will generate a 32-bit relocatable address for A and B <strong>in</strong> example 3.11bbecause it cannot comb<strong>in</strong>e a RIP-relative address with an <strong>in</strong>dex register.This method is used by Gnu and Intel compilers <strong>in</strong> 64-bit L<strong>in</strong>ux to access static arrays. It isnot used by any compiler for 64-bit W<strong>in</strong>dows, I have seen, but it works <strong>in</strong> W<strong>in</strong>dows as well ifthe address is less than 2 31 . The image base is typically 2 22 for application programs andbetween 2 28 and 2 29 for DLL's, so this method will work <strong>in</strong> most cases, but not all. Thismethod cannot normally be used <strong>in</strong> 64-bit Mac systems because all addresses are above2 32 by default.The second method is to use image-relative address<strong>in</strong>g. The follow<strong>in</strong>g solution loads theimage base <strong>in</strong>to register RBX by us<strong>in</strong>g a LEA <strong>in</strong>struction with a RIP-relative address:; Example 3.11c. Address relative to image base; 64 bit, W<strong>in</strong>dows only, MASM assembler.dataA DD 100 dup (?)B DD 100 dup (?)extern __ImageBase:byte22


.codelea rbx, __ImageBase ; Use RIP-relative address of image basexor ecx, ecx ; i = 0TOPOFLOOP:; Top of loop; imagerel(A) = address of A relative to image base:mov eax, [(imagerel A) + rbx + rcx*4]neg eaxmov [(imagerel B) + rbx + rcx*4], eaxadd ecx, 1cmp ecx, 100jb TOPOFLOOPThis method is used <strong>in</strong> 64 bit W<strong>in</strong>dows only. In L<strong>in</strong>ux, the image base is available as__executable_start, but image-relative addresses are not supported <strong>in</strong> the ELF fileformat. The Mach-O format allows addresses relative to an arbitrary reference po<strong>in</strong>t,<strong>in</strong>clud<strong>in</strong>g the image base, which is available as __mh_execute_header.The third solution loads the address of array A <strong>in</strong>to register RBX by us<strong>in</strong>g a LEA <strong>in</strong>structionwith a RIP-relative address. The address of B is calculated relative to A.; Example 3.11d.; Load address of array <strong>in</strong>to base register; (All 64-bit systems).dataA DD 100 dup (?)B DD 100 dup (?).codelea rbx, [A]; Load RIP-relative address of Axor ecx, ecx ; i = 0TOPOFLOOP:; Top of loopmov eax, [rbx + 4*rcx] ; A[i]neg eaxmov [(B-A) + rbx + 4*rcx], eax ; Use offset of B relative to Aadd ecx, 1cmp ecx, 100jb TOPOFLOOPNote that we can use a 32-bit <strong>in</strong>struction for <strong>in</strong>crement<strong>in</strong>g the <strong>in</strong>dex (ADD ECX,1), eventhough we are us<strong>in</strong>g the 64-bit register for <strong>in</strong>dex (RCX). This works because we are sure thatthe <strong>in</strong>dex is non-negative and less than 2 32 . This method can use any address <strong>in</strong> the datasegment as a reference po<strong>in</strong>t and calculate other addresses relative to this reference po<strong>in</strong>t.If an array is more than 2 31 bytes away from the <strong>in</strong>struction po<strong>in</strong>ter then we have to load thefull 64 bit address <strong>in</strong>to a base register. For example, we can replace LEA RBX,[A] withMOV RBX,OFFSET A <strong>in</strong> example 3.11d.Position-<strong>in</strong>dependent code <strong>in</strong> 64-bit modePosition-<strong>in</strong>dependent code is rarely required <strong>in</strong> 64-bit systems, but it is easy to make. Staticdata can be accessed with rip-relative address<strong>in</strong>g. Static arrays can be accessed as <strong>in</strong>example 3.11d.The po<strong>in</strong>ter tables of switch statements can be made relative to an arbitrary reference po<strong>in</strong>t.It may be convenient to use the table itself as the reference po<strong>in</strong>t:; Example 3.12. switch with relative po<strong>in</strong>ters, 64 bit, YASM syntaxSECTION .datajumptable: dd case1-jumptable, case2-jumptable, case3-jumptable23


SECTION .textdefault rel; use relative addressesfuncb: ; This function implements a switch statementmov eax, [rsp+8] ; function parametercmp eax, 3jnb case_default ; <strong>in</strong>dex out of rangelea rdx, [jumptable] ; address of tablemovsxd rax, dword [rdx+rax*4] ; read table entry; The jump addresses are relative to jumptable, get absolute address:add rax, rdxjmp rax ; jump to desired casecase1: ...retcase2: ...retcase3: ...retcase_default:...retThis method can be useful for reduc<strong>in</strong>g the size of long po<strong>in</strong>ter tables because it uses 32-bitrelative po<strong>in</strong>ters rather than 64-bit absolute po<strong>in</strong>ters.The MASM assembler cannot generate the relative tables <strong>in</strong> example 3.12 unless the jumptable is placed <strong>in</strong> the code segment. It is preferred to place the jump table <strong>in</strong> the datasegment for optimal cach<strong>in</strong>g and code prefetch<strong>in</strong>g, and this can be done with the YASM orGnu assembler.3.4 Instruction code formatThe format for <strong>in</strong>struction codes is described <strong>in</strong> detail <strong>in</strong> manuals from Intel and AMD. Thebasic pr<strong>in</strong>ciples of <strong>in</strong>struction encod<strong>in</strong>g are expla<strong>in</strong>ed here because of its relevance tomicroprocessor performance. In general, you can rely on the assembler for generat<strong>in</strong>g thesmallest possible encod<strong>in</strong>g of an <strong>in</strong>struction.Each <strong>in</strong>struction can consist of the follow<strong>in</strong>g elements, <strong>in</strong> the order mentioned:1. Prefixes (0-5 bytes)These are prefixes that modify the mean<strong>in</strong>g of the opcode that follows. There areseveral different k<strong>in</strong>ds of prefixes as described <strong>in</strong> table 3.12 below.2. Opcode (1-3 bytes)This is the <strong>in</strong>struction code. The first byte is 0Fh if there are two or three bytes. Thesecond byte is 38h - 3Ah if there are three bytes.3. mod-reg-r/m byte (0-1 byte)This byte specifies the operands. It consists of three fields. The mod field is two bitsspecify<strong>in</strong>g the address<strong>in</strong>g mode, the reg field is three bits specify<strong>in</strong>g a register for thefirst operand (most often the dest<strong>in</strong>ation operand), the r/m field is three bitsspecify<strong>in</strong>g the second operand (most often the source operand), which can be aregister or a memory operand. The reg field can be part of the opcode if there is onlyone operand.4. SIB byte (0-1 byte)This byte is used for memory operands with complex address<strong>in</strong>g modes, and only ifthere is a mod-reg-r/m byte. It has two bits for a scale factor, three bits specify<strong>in</strong>g a24


scaled <strong>in</strong>dex register, and three bits specify<strong>in</strong>g a base po<strong>in</strong>ter register. A SIB byte isneeded <strong>in</strong> the follow<strong>in</strong>g cases:a. If a memory operand has two po<strong>in</strong>ter or <strong>in</strong>dex registers,b. If a memory operand has a scaled <strong>in</strong>dex register,c. If a memory operand has the stack po<strong>in</strong>ter (ESP or RSP) as base po<strong>in</strong>ter,d. If a memory operand <strong>in</strong> 64-bit mode uses a 32-bit sign-extended direct memoryaddress rather than a RIP-relative address.A SIB byte cannot be used <strong>in</strong> 16-bit address<strong>in</strong>g mode.5. Displacement (0, 1, 2, 4 or 8 bytes)This is part of the address of a memory operand. It is added to the value of thepo<strong>in</strong>ter registers (base or <strong>in</strong>dex or both), if any.A 1-byte sign-extended displacement is possible <strong>in</strong> all address<strong>in</strong>g modes if a po<strong>in</strong>terregister is specified.A 2-byte displacement is possible only <strong>in</strong> 16-bit address<strong>in</strong>g mode.A 4-byte displacement is possible <strong>in</strong> 32-bit address<strong>in</strong>g mode.A 4-byte sign-extended displacement is possible <strong>in</strong> 64-bit address<strong>in</strong>g mode. If thereare any po<strong>in</strong>ter registers specified then the displacement is added to these. If thereis no po<strong>in</strong>ter register specified and no SIB byte then the displacement is added toRIP. If there is a SIB byte and no po<strong>in</strong>ter register then the sign-extended value is anabsolute direct address.An 8-byte absolute direct address is possible <strong>in</strong> 64-bit address<strong>in</strong>g mode for a fewMOV <strong>in</strong>structions that have no mod-reg-r/m byte.6. Immediate operand (0, 1, 2, 4 or 8 bytes)This is a data constant which <strong>in</strong> most cases is a source operand for the operation.A 1-byte sign-extended immediate operand is possible <strong>in</strong> all modes for all<strong>in</strong>structions that can have immediate operands, except MOV, CALL and RET.A 2-byte immediate operand is possible for <strong>in</strong>structions with 16-bit operand size.A 4-byte immediate operand is possible for <strong>in</strong>structions with 32-bit operand size.A 4-byte sign-extended immediate operand is possible for <strong>in</strong>structions with 64-bitoperand size.An 8-byte immediate operand is possible only for moves <strong>in</strong>to a 64-bit register.3.5 Instruction prefixesThe follow<strong>in</strong>g table summarizes the use of <strong>in</strong>struction prefixes.prefix for: 16 bit mode 32 bit mode 64 bit mode8 bit operand size none none none16 bit operand size none 66h 66h32 bit operand size 66h none none64 bit operand size n.a. n.a. REX.W (48h)packed <strong>in</strong>tegers <strong>in</strong> mmx register none none nonepacked <strong>in</strong>tegers <strong>in</strong> xmm register 66h 66h 66hpacked s<strong>in</strong>gle-precision floats <strong>in</strong> xmm register none none nonepacked double-precision floats <strong>in</strong> xmm register 66h 66h 66hscalar s<strong>in</strong>gle-precision floats <strong>in</strong> xmm register F3h F3h F3hscalar double-precision floats <strong>in</strong> xmm register F2h F2h F2h16 bit address size none 67h n.a.32 bit address size 67h none 67h64 bit address size n.a. n.a. noneCS segment 2Eh 2Eh n.a.DS segment 3Eh 3Eh n.a.ES segment 26h 26h n.a.SS segment 36h 36h n.a.25


FS segment 64h 64h 64hGS segment 65h 65h 65hREP or REPE str<strong>in</strong>g operation F3h F3h F3hREPNE str<strong>in</strong>g operation F2h F2h F2hLocked memory operand F0h F0h F0hRegister R8 - R15, XMM8 - XMM15 <strong>in</strong> reg field n.a. n.a. REX.R (44h)Register R8 - R15, XMM8 - XMM15 <strong>in</strong> r/m field n.a. n.a. REX.B (41h)Register R8 - R15 <strong>in</strong> SIB.base field n.a. n.a. REX.B (41h)Register R8 - R15 <strong>in</strong> SIB.<strong>in</strong>dex field n.a. n.a. REX.X (42h)Register SIL, DIL, BPL, SPL n.a. n.a. REX (40h)Predict branch taken first time 3Eh 3Eh 3EhPredict branch not taken first time 2Eh 2Eh 2EhVEX prefix, 2 bytes C5h C5h C5hVEX prefix, 3 bytes C4h C4h C4hTable 3.12. Instruction prefixesSegment prefixes are rarely needed <strong>in</strong> a flat memory model. The DS segment prefix is onlyneeded if a memory operand has base register BP, EBP or ESP and the DS segment isdesired rather than SS.The lock prefix is only allowed on certa<strong>in</strong> <strong>in</strong>structions that read, modify and write a memoryoperand.The branch prediction prefixes work only on Pentium 4 and are rarely needed.There can be no more than one REX prefix. If more than one REX prefix is needed then thevalues are OR'ed <strong>in</strong>to a s<strong>in</strong>gle byte with a value <strong>in</strong> the range 40h to 4Fh. These prefixes areavailable only <strong>in</strong> 64-bit mode. The bytes 40h to 4Fh are <strong>in</strong>struction opcodes <strong>in</strong> 16-bit and32-bit mode. These <strong>in</strong>structions (INC r and DEC r) are coded differently <strong>in</strong> 64-bit mode.The prefixes can be <strong>in</strong>serted <strong>in</strong> any order, except for the REX and VEX prefixes which mustcome after any other prefixes.The future AVX <strong>in</strong>struction set will use 2- and 3-byte prefixes called VEX prefixes. The VEXprefixes <strong>in</strong>clude bits to replace all 66h, F2h, F3h and REX prefixes as well as the first byteof two-byte opcodes and the first two bytes of three-byte opcodes. VEX prefixes also<strong>in</strong>clude bits for specify<strong>in</strong>g YMM registers, an extra register operand, and bits for futureextensions. The only prefixes allowed before a VEX prefix are segment prefixes andaddress size prefixes. No additional prefixes are allowed after a VEX prefix.Mean<strong>in</strong>gless, redundant or misplaced prefixes are ignored, except for the LOCK and VEXprefixes. But prefixes that have no effect <strong>in</strong> a particular context may have an effect <strong>in</strong> futureprocessors.Unnecessary prefixes may be used <strong>in</strong>stead of NOP's for align<strong>in</strong>g code, but an excessivenumber of prefixes can slow down <strong>in</strong>struction decod<strong>in</strong>g.There can be any number of prefixes as long as the total <strong>in</strong>struction length does not exceed15 bytes. For example, a MOV EAX,EBX with ten ES segment prefixes will still work, but itmay take a long time to decode.26


4 ABI standardsABI stands for Application B<strong>in</strong>ary Interface. An ABI is a standard for how functions arecalled, how parameters and return values are transferred, and which registers a function isallowed to change. It is important to obey the appropriate ABI standard when comb<strong>in</strong><strong>in</strong>g<strong>assembly</strong> with high level <strong>language</strong>. The details of call<strong>in</strong>g conventions etc. are covered <strong>in</strong>manual 5: "Call<strong>in</strong>g conventions for different C++ compilers and operat<strong>in</strong>g systems". Themost important rules are summarized here for your convenience.4.1 Register usage16 bitDOS,W<strong>in</strong>dowsRegisters that AX, BX, CX, DX,can be used ES,freelyST(0)-ST(7)Registers thatmust besaved andrestoredRegisters thatcannot bechangedRegistersused for parametertransferRegistersused forreturn valuesTable 4.1. Register usageSI, DI, BP, DS32 bitW<strong>in</strong>dows,UnixEAX, ECX, EDX,ST(0)-ST(7),XMM0-XMM7,YMM0-YMM7EBX, ESI, EDI,EBPDS, ES, FS, GS,SS(ECX)64 bitW<strong>in</strong>dowsRAX, RCX, RDX,R8-R11,ST(0)-ST(7),XMM0-XMM5,YMM0-YMM5,YMM6H-YMM15HRBX, RSI, RDI,RBP, R12-R15,XMM6-XMM15RCX, RDX, R8,R9,XMM0-XMM3,YMM0-YMM3AX, DX, ST(0) EAX, EDX, ST(0) RAX, XMM0,YMM064 bitUnixRAX, RCX, RDX,RSI, RDI,R8-R11,ST(0)-ST(7),XMM0-XMM15,YMM0-YMM15RBX, RBP,R12-R15RDI, RSI, RDX,RCX, R8, R9,XMM0-XMM7,YMM0-YMM7RAX, RDX, XMM0,XMM1, YMM0ST(0), ST(1)The float<strong>in</strong>g po<strong>in</strong>t registers ST(0) - ST(7) must be empty before any call or return, exceptwhen used for function return value. The MMX registers must be cleared by EMMS beforeany call or return.The arithmetic flags can be changed freely. The direction flag may be set temporarily, butmust be cleared before any call or return <strong>in</strong> 32-bit and 64-bit systems. The <strong>in</strong>terrupt flagcannot be cleared <strong>in</strong> protected operat<strong>in</strong>g systems. The float<strong>in</strong>g po<strong>in</strong>t control word and bit 6-15 of the MXCSR register must be saved and restored <strong>in</strong> functions that modify them.Register FS and GS are used for thread <strong>in</strong>formation blocks etc. and should not be changed.Other segment registers should not be changed, except <strong>in</strong> segmented 16-bit models.4.2 Data storageVariables and objects that are declared <strong>in</strong>side a function <strong>in</strong> C or C++ are stored on the stackand addressed relative to the stack po<strong>in</strong>ter or a stack frame. This is the most efficient way ofstor<strong>in</strong>g data, for two reasons. Firstly, the stack space used for local storage is releasedwhen the function returns and may be reused by the next function that is called. Us<strong>in</strong>g thesame memory area repeatedly improves data cach<strong>in</strong>g. The second reason is that datastored on the stack can often be addressed with an 8-bit offset relative to a po<strong>in</strong>ter rather27


than the 32 bits required for address<strong>in</strong>g data <strong>in</strong> the data segment. This makes the codemore compact so that it takes less space <strong>in</strong> the code cache or trace cache.Global and static data <strong>in</strong> C++ are stored <strong>in</strong> the data segment and addressed with 32-bitabsolute addresses <strong>in</strong> 32-bit systems and with 32-bit RIP-relative addresses <strong>in</strong> 64-bitsystems. A third way of stor<strong>in</strong>g data <strong>in</strong> C++ is to allocate space with new or malloc. Thismethod should be avoided if speed is critical.4.3 Function call<strong>in</strong>g conventionsCall<strong>in</strong>g convention <strong>in</strong> 16 bit mode DOS and W<strong>in</strong>dows 3.xFunction parameters are passed on the stack with the first parameter at the lowest address.This corresponds to push<strong>in</strong>g the last parameter first. The stack is cleaned up by the caller.Parameters of 8 or 16 bits size use one word of stack space. Parameters bigger than 16 bitsare stored <strong>in</strong> little-endian form, i.e. with the least significant word at the lowest address. Allstack parameters are aligned by 2.Function return values are passed <strong>in</strong> registers <strong>in</strong> most cases. 8-bit <strong>in</strong>tegers are returned <strong>in</strong>AL, 16-bit <strong>in</strong>tegers and near po<strong>in</strong>ters <strong>in</strong> AX, 32-bit <strong>in</strong>tegers and far po<strong>in</strong>ters <strong>in</strong> DX:AX,Booleans <strong>in</strong> AX, and float<strong>in</strong>g po<strong>in</strong>t values <strong>in</strong> ST(0).Call<strong>in</strong>g convention <strong>in</strong> 32 bit W<strong>in</strong>dows, L<strong>in</strong>ux, BSD, Mac OS XFunction parameters are passed on the stack accord<strong>in</strong>g to the follow<strong>in</strong>g call<strong>in</strong>g conventions:Call<strong>in</strong>g convention Parameter order on stack Parameters removed by__cdecl First par. at low address Caller__stdcall First par. at low address Subrout<strong>in</strong>e__fastcall Microsoft First 2 parameters <strong>in</strong> ecx, edx. Subrout<strong>in</strong>eand GnuRest as __stdcall__fastcall Borland First 3 parameters <strong>in</strong> eax, edx, Subrout<strong>in</strong>eecx. Rest as __stdcall_pascal First par. at high address Subrout<strong>in</strong>e__thiscall Microsoft this <strong>in</strong> ecx. Rest as __stdcall Subrout<strong>in</strong>eTable 4.2. Call<strong>in</strong>g conventions <strong>in</strong> 32 bit modeThe __cdecl call<strong>in</strong>g convention is the default <strong>in</strong> L<strong>in</strong>ux. In W<strong>in</strong>dows, the __cdecl conventionis also the default except for member functions, system functions and DLL's. Staticallyl<strong>in</strong>ked modules <strong>in</strong> .obj and .lib files should preferably use __cdecl, while dynamic l<strong>in</strong>klibraries <strong>in</strong> .dll files should use __stdcall. The Microsoft, Intel, Digital Mars andCodeplay compilers use __thiscall by default for member functions under W<strong>in</strong>dows, theBorland compiler uses __cdecl with 'this' as the first parameter.The fastest call<strong>in</strong>g convention for functions with <strong>in</strong>teger parameters is __fastcall, but thiscall<strong>in</strong>g convention is not standardized.Remember that the stack po<strong>in</strong>ter is decreased when a value is pushed on the stack. Thismeans that the parameter pushed first will be at the highest address, <strong>in</strong> accordance with the_pascal convention. You must push parameters <strong>in</strong> reverse order to satisfy the __cdecl and__stdcall conventions.Parameters of 32 bits size or less use 4 bytes of stack space. Parameters bigger than 32bits are stored <strong>in</strong> little-endian form, i.e. with the least significant DWORD at the lowestaddress, and aligned by 4.28


Mac OS X and the Gnu compiler version 3 and later align the stack by 16 before every call<strong>in</strong>struction, though this behavior is not consistent. Sometimes the stack is aligned by 4. Thisdiscrepancy is an unresolved issue at the time of writ<strong>in</strong>g. See manual 5: "Call<strong>in</strong>gconventions for different C++ compilers and operat<strong>in</strong>g systems" for details.Function return values are passed <strong>in</strong> registers <strong>in</strong> most cases. 8-bit <strong>in</strong>tegers are returned <strong>in</strong>AL, 16-bit <strong>in</strong>tegers <strong>in</strong> AX, 32-bit <strong>in</strong>tegers, po<strong>in</strong>ters, references and Booleans <strong>in</strong> EAX, 64-bit<strong>in</strong>tegers <strong>in</strong> EDX:EAX, and float<strong>in</strong>g po<strong>in</strong>t values <strong>in</strong> ST(0).See manual 5: "Call<strong>in</strong>g conventions for different C++ compilers and operat<strong>in</strong>g systems" fordetails about parameters of composite types (struct, class, union) and vector types(__m64, __m128).Call<strong>in</strong>g conventions <strong>in</strong> 64 bit W<strong>in</strong>dowsThe first parameter is transferred <strong>in</strong> RCX if it is an <strong>in</strong>teger or <strong>in</strong> XMM0 if it is a float ordouble. The second parameter is transferred <strong>in</strong> RDX or XMM1. The third parameter is transferred<strong>in</strong> R8 or XMM2. The fourth parameter is transferred <strong>in</strong> R9 or XMM3. Note that RCX is notused for parameter transfer if XMM0 is used, and vice versa. No more than four parameterscan be transferred <strong>in</strong> registers, regardless of type. Any further parameters are transferredon the stack with the first parameter at the lowest address and aligned by 8. Memberfunctions have 'this' as the first parameter.The caller must allocate 32 bytes of free space on the stack <strong>in</strong> addition to any parameterstransferred on the stack. This is a shadow space where the called function can save the fourparameter registers if it needs to. The shadow space is the place where the first fourparameters would have been stored if they were transferred on the stack accord<strong>in</strong>g to the__cdecl rule. The shadow space belongs to the called function which is allowed to store theparameters (or anyth<strong>in</strong>g else) <strong>in</strong> the shadow space. The caller must reserve the 32 bytes ofshadow space even for functions that have no parameters. The caller must clean up thestack, <strong>in</strong>clud<strong>in</strong>g the shadow space. Return values are <strong>in</strong> RAX or XMM0.The stack po<strong>in</strong>ter must be aligned by 16 before any CALL <strong>in</strong>struction, so that the value ofRSP is 8 modulo 16 at the entry of a function. The function can rely on this alignment whenstor<strong>in</strong>g XMM registers to the stack.See manual 5: "Call<strong>in</strong>g conventions for different C++ compilers and operat<strong>in</strong>g systems" fordetails about parameters of composite types (struct, class, union) and vector types(__m64, __m128).Call<strong>in</strong>g conventions <strong>in</strong> 64 bit L<strong>in</strong>ux, BSD and Mac OS XThe first six <strong>in</strong>teger parameters are transferred <strong>in</strong> RDI, RSI, RDX, RCX, R8, R9, respectively.The first eight float<strong>in</strong>g po<strong>in</strong>t parameters are transferred <strong>in</strong> XMM0 - XMM7. All these registerscan be used, so that a maximum of fourteen parameters can be transferred <strong>in</strong> registers. Anyfurther parameters are transferred on the stack with the first parameters at the lowestaddress and aligned by 8. The stack is cleaned up by the caller if there are any parameterson the stack. There is no shadow space. Member functions have 'this' as the firstparameter. Return values are <strong>in</strong> RAX or XMM0.The stack po<strong>in</strong>ter must be aligned by 16 before any CALL <strong>in</strong>struction, so that the value ofRSP is 8 modulo 16 at the entry of a function. The function can rely on this alignment whenstor<strong>in</strong>g XMM registers to the stack.The address range from [RSP-1] to [RSP-128] is called the red zone. A function can safelystore data above the stack <strong>in</strong> the red zone as long as this is not overwritten by any PUSH orCALL <strong>in</strong>structions.29


See manual 5: "Call<strong>in</strong>g conventions for different C++ compilers and operat<strong>in</strong>g systems" fordetails about parameters of composite types (struct, class, union) and vector types(__m64, __m128, __m256).4.4 Name mangl<strong>in</strong>g and name decorationThe support for function overload<strong>in</strong>g <strong>in</strong> C++ makes it necessary to supply <strong>in</strong>formation aboutthe parameters of a function to the l<strong>in</strong>ker. This is done by append<strong>in</strong>g codes for theparameter types to the function name. This is called name mangl<strong>in</strong>g. The name mangl<strong>in</strong>gcodes have traditionally been compiler specific. Fortunately, there is a grow<strong>in</strong>g tendencytowards standardization <strong>in</strong> this area <strong>in</strong> order to improve compatibility between differentcompilers. The name mangl<strong>in</strong>g codes for different compilers are described <strong>in</strong> detail <strong>in</strong>manual 5: "Call<strong>in</strong>g conventions for different C++ compilers and operat<strong>in</strong>g systems".The problem of <strong>in</strong>compatible name mangl<strong>in</strong>g codes is most easily solved by us<strong>in</strong>gextern "C" declarations. Functions with extern "C" declaration have no namemangl<strong>in</strong>g. The only decoration is an underscore prefix <strong>in</strong> 16- and 32-bit W<strong>in</strong>dows and 32-and 64-bit Mac OS. There is some additional decoration of the name for functions with__stdcall and __fastcall declarations.The extern "C" declaration cannot be used for member functions, overloaded functions,operators, and other constructs that are not supported <strong>in</strong> the C <strong>language</strong>. You can avoidname mangl<strong>in</strong>g <strong>in</strong> these cases by def<strong>in</strong><strong>in</strong>g a mangled function that calls an unmangledfunction. If the mangled function is def<strong>in</strong>ed as <strong>in</strong>l<strong>in</strong>e then the compiler will simply replace thecall to the mangled function by the call to the unmangled function. For example, to def<strong>in</strong>e anoverloaded C++ operator <strong>in</strong> <strong>assembly</strong> without name mangl<strong>in</strong>g:class C1;// unmangled <strong>assembly</strong> function;extern "C" C1 cplus (C1 const & a, C1 const & b);// mangled C++ operator<strong>in</strong>l<strong>in</strong>e C1 operator + (C1 const & a, C1 const & b) {// operator + replaced <strong>in</strong>l<strong>in</strong>e by function cplusreturn cplus(a, b);}Overloaded functions can be <strong>in</strong>l<strong>in</strong>ed <strong>in</strong> the same way. Class member functions can betranslated to friend functions as illustrated <strong>in</strong> example 7.1b page 48.4.5 Function examplesThe follow<strong>in</strong>g examples show how to code a function <strong>in</strong> <strong>assembly</strong> that obeys the call<strong>in</strong>gconventions. First the code <strong>in</strong> C++:// Example 4.1aextern "C" double s<strong>in</strong>xpnx (double x, <strong>in</strong>t n) {return s<strong>in</strong>(x) + n * x;}The same function can be coded <strong>in</strong> <strong>assembly</strong>. The follow<strong>in</strong>g examples show the samefunction coded for different platforms.; Example 4.1b. 16-bit DOS and W<strong>in</strong>dows 3.xALIGN 4_s<strong>in</strong>xpnx PROC NEAR; parameter x = [SP+2]; parameter n = [SP+10]; return value = ST(0)30


push bp; bp must be savedmov bp, sp ; stack framefild word ptr [bp+12] ; nfld qword ptr [bp+4] ; xfmul st(1), st(0) ; n*xfs<strong>in</strong>; s<strong>in</strong>(x)fadd; s<strong>in</strong>(x) + n*xpop bp ; restore bpret; return value is <strong>in</strong> st(0)_s<strong>in</strong>xpnx ENDPIn 16-bit mode we need BP as a stack frame because SP cannot be used as base po<strong>in</strong>ter.The <strong>in</strong>teger n is only 16 bits. I have used the hardware <strong>in</strong>struction FSIN for the s<strong>in</strong> function.; Example 4.1c. 32-bit W<strong>in</strong>dowsEXTRN _s<strong>in</strong>:nearALIGN 4_s<strong>in</strong>xpnx PROC near; parameter x = [ESP+4]; parameter n = [ESP+12]; return value = ST(0)fld qword ptr [esp+4] ; xsub esp, 8 ; make space for parameter xfstp qword ptr [esp] ; store parameter for s<strong>in</strong>; clear st(0)call _s<strong>in</strong>; library function for s<strong>in</strong>()add esp, 8 ; clean up stack after callfild dword ptr [esp+12] ; nfmul qword ptr [esp+4] ; n*xfadd; s<strong>in</strong>(x) + n*xret; return value is <strong>in</strong> st(0)_s<strong>in</strong>xpnx ENDPHere, I have chosen to use the library function _s<strong>in</strong> <strong>in</strong>stead of FSIN. This may be faster <strong>in</strong>some cases because FSIN gives higher precision than needed. The parameter for _s<strong>in</strong> istransferred as 8 bytes on the stack.; Example 4.1d. 32-bit L<strong>in</strong>uxEXTRN s<strong>in</strong>:nearALIGN 4s<strong>in</strong>xpnx PROC near; parameter x = [ESP+4]; parameter n = [ESP+12]; return value = ST(0)fld qword ptr [esp+4] ; xsub esp, 12 ; Keep stack aligned by 16 before callfstp qword ptr [esp] ; Store parameter for s<strong>in</strong>; clear st(0)call s<strong>in</strong>; Library proc. may be faster than fs<strong>in</strong>add esp, 12 ; Clean up stack after callfild dword ptr [esp+12] ; nfmul qword ptr [esp+4] ; n*xfadd; s<strong>in</strong>(x) + n*xret; Return value is <strong>in</strong> st(0)s<strong>in</strong>xpnx ENDPIn 32-bit L<strong>in</strong>ux there is no underscore on function names. The stack must be kept aligned by16 <strong>in</strong> L<strong>in</strong>ux (GCC version 3 or later). The call to s<strong>in</strong>xpnx subtracts 4 from ESP. We aresubtract<strong>in</strong>g 12 more from ESP so that the total amount subtracted is 16. We may subtractmore, as long as the total amount is a multiple of 16. In example 4.1c we subtracted only 8from ESP because the stack is only aligned by 4 <strong>in</strong> 32-bit W<strong>in</strong>dows.31


; Example 4.1e. 64-bit W<strong>in</strong>dowsEXTRN s<strong>in</strong>:nearALIGN 4s<strong>in</strong>xpnx PROC; parameter x = xmm0; parameter n = edx; return value = xmm0push rbx ; rbx must be savedmovaps [rsp+16],xmm6 ; save xmm6 <strong>in</strong> my shadow spacesub rsp, 32 ; shadow space for call to s<strong>in</strong>mov ebx, edx ; save nmovsd xmm6, xmm0 ; save xcall s<strong>in</strong> ; xmm0 = s<strong>in</strong>(xmm0)cvtsi2sd xmm1, ebx ; convert n to doublemulsd xmm1, xmm6 ; n * xaddsd xmm0, xmm1 ; s<strong>in</strong>(x) + n * xadd rsp, 32 ; restore stack po<strong>in</strong>termovaps xmm6, [rsp+16] ; restore xmm6pop rbx ; restore rbxret; return value is <strong>in</strong> xmm0s<strong>in</strong>xpnx ENDPFunction parameters are transferred <strong>in</strong> registers <strong>in</strong> 64-bit W<strong>in</strong>dows. ECX is not used forparameter transfer because the first parameter is not an <strong>in</strong>teger. We are us<strong>in</strong>g RBX andXMM6 for stor<strong>in</strong>g n and x across the call to s<strong>in</strong>. We have to use registers with callee-savestatus for this, and we have to save these registers on the stack before us<strong>in</strong>g them. Thestack must be aligned by 16 before the call to s<strong>in</strong>. The call to s<strong>in</strong>xpnx subtracts 8 fromRSP; the PUSH RBX <strong>in</strong>struction subtracts 8; and the SUB <strong>in</strong>struction subtracts 32. The totalamount subtracted is 8+8+32 = 48, which is a multiple of 16. The extra 32 bytes on thestack is shadow space for the call to s<strong>in</strong>. Note that example 4.1e does not <strong>in</strong>clude supportfor exception handl<strong>in</strong>g. It is necessary to add tables with stack unw<strong>in</strong>d <strong>in</strong>formation if theprogram relies on catch<strong>in</strong>g exceptions generated <strong>in</strong> the function s<strong>in</strong>.; Example 4.1f. 64-bit L<strong>in</strong>uxEXTRN s<strong>in</strong>:nearALIGN 4s<strong>in</strong>xpnx PROCPUBLIC s<strong>in</strong>xpnx; parameter x = xmm0; parameter n = edi; return value = xmm0push rbx ; rbx must be savedsub rsp, 16 ; make local space and align stack by 16movaps [rsp], xmm0 ; save xmov ebx, edi ; save ncall s<strong>in</strong> ; xmm0 = s<strong>in</strong>(xmm0)cvtsi2sd xmm1, ebx ; convert n to doublemulsd xmm1, [rsp] ; n * xaddsd xmm0, xmm1 ; s<strong>in</strong>(x) + n * xadd rsp, 16 ; restore stack po<strong>in</strong>terpop rbx ; restore rbxret; return value is <strong>in</strong> xmm0s<strong>in</strong>xpnx ENDP64-bit L<strong>in</strong>ux does not use the same registers for parameter transfer as 64-bit W<strong>in</strong>dowsdoes. There are no XMM registers with callee-save status, so we have to save x on thestack across the call to s<strong>in</strong>, even though sav<strong>in</strong>g it <strong>in</strong> a register might be faster (Sav<strong>in</strong>g x <strong>in</strong>a 64-bit <strong>in</strong>teger register is possible, but slow). n can still be saved <strong>in</strong> a general purposeregister with callee-save status. The stack is aligned by 16. There is no need for shadow32


space on the stack. The red zone cannot be used because it would be overwritten by thecall to s<strong>in</strong>. Note that example 4.1f does not <strong>in</strong>clude support for exception handl<strong>in</strong>g. It isnecessary to add tables with stack unw<strong>in</strong>d <strong>in</strong>formation if the program relies on catch<strong>in</strong>gexceptions generated <strong>in</strong> the function s<strong>in</strong>.5 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions <strong>in</strong> C++As already mentioned, there are three different ways of mak<strong>in</strong>g <strong>assembly</strong> code: us<strong>in</strong>g<strong>in</strong>tr<strong>in</strong>sic functions and vector classes <strong>in</strong> C++, us<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> <strong>in</strong> C++, and mak<strong>in</strong>gseparate <strong>assembly</strong> modules. Intr<strong>in</strong>sic functions are described <strong>in</strong> this chapter. The other twomethods are described <strong>in</strong> the follow<strong>in</strong>g chapters.Intr<strong>in</strong>sic functions and vector classes are highly recommended because they are mucheasier and safer to use than <strong>assembly</strong> <strong>language</strong> syntax.The Microsoft, Intel and Gnu C++ compilers have support for <strong>in</strong>tr<strong>in</strong>sic functions. Most of the<strong>in</strong>tr<strong>in</strong>sic functions generate one mach<strong>in</strong>e <strong>in</strong>struction each. An <strong>in</strong>tr<strong>in</strong>sic function is thereforeequivalent to an <strong>assembly</strong> <strong>in</strong>struction.Cod<strong>in</strong>g with <strong>in</strong>tr<strong>in</strong>sic functions is a k<strong>in</strong>d of high-level <strong>assembly</strong>. It can easily be comb<strong>in</strong>edwith C++ <strong>language</strong> constructs such as if-statements, loops, functions, classes andoperator overload<strong>in</strong>g. Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions is an easier way of do<strong>in</strong>g high level <strong>assembly</strong>cod<strong>in</strong>g than us<strong>in</strong>g .if constructs etc. <strong>in</strong> an assembler or us<strong>in</strong>g the so-called high levelassembler (HLA).The <strong>in</strong>vention of <strong>in</strong>tr<strong>in</strong>sic functions has made it much easier to do programm<strong>in</strong>g tasks thatpreviously required cod<strong>in</strong>g with <strong>assembly</strong> syntax. The advantages of us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sicfunctions are:• No need to learn <strong>assembly</strong> <strong>language</strong> syntax.• Seamless <strong>in</strong>tegration <strong>in</strong>to C++ code.• Branches, loops, functions, classes, etc. are easily made with C++ syntax.• The compiler takes care of call<strong>in</strong>g conventions, register usage conventions, etc.• The code is portable to almost all x86 platforms: 32-bit and 64-bit W<strong>in</strong>dows, L<strong>in</strong>ux,Mac OS, etc. Some <strong>in</strong>tr<strong>in</strong>sic functions can even be used on Itanium and other nonx86platforms.• The code is compatible with Microsoft, Gnu and Intel compilers.• The compiler takes care of register variables, register allocation and register spill<strong>in</strong>g.The programmer doesn't have to care about which register is used for whichvariable.• Different <strong>in</strong>stances of the same <strong>in</strong>l<strong>in</strong>ed function or operator can use differentregisters for its parameters. This elim<strong>in</strong>ates the need for register-to-register moves.The same function coded with <strong>assembly</strong> syntax would typically use a specificregister for each parameter; and a move <strong>in</strong>struction would be required if the valuehappens to be <strong>in</strong> a different register.• It is possible to def<strong>in</strong>e overloaded operators for the <strong>in</strong>tr<strong>in</strong>sic functions. For example,the <strong>in</strong>struction that adds two 4-element vectors of floats is coded as ADDPS <strong>in</strong>33


<strong>assembly</strong> <strong>language</strong>, and as _mm_add_ps when <strong>in</strong>tr<strong>in</strong>sic functions are used. But anoverloaded operator can be def<strong>in</strong>ed for the latter so that it is simply coded as a +when us<strong>in</strong>g the so-called vector classes. This makes the code look like pla<strong>in</strong> oldC++.• The compiler can optimize the code further, for example by common subexpressionelim<strong>in</strong>ation, loop-<strong>in</strong>variant code motion, schedul<strong>in</strong>g and reorder<strong>in</strong>g, etc. This wouldhave to be done manually if <strong>assembly</strong> syntax was used. The Gnu and Intel compilersprovide the best optimization.The disadvantages of us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions are:• Not all <strong>assembly</strong> <strong>in</strong>structions have <strong>in</strong>tr<strong>in</strong>sic function equivalents.• The function names are sometimes long and difficult to remember.• An expression with many <strong>in</strong>tr<strong>in</strong>sic functions looks kludgy and is difficult to read.• Requires a good understand<strong>in</strong>g of the underly<strong>in</strong>g mechanisms.• The compilers may not be able to optimize code conta<strong>in</strong><strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions asmuch as it optimizes other code, especially when constant propagation is needed.• Unskilled use of <strong>in</strong>tr<strong>in</strong>sic functions can make the code less efficient than simple C++code.• The compiler can modify the code or implement it <strong>in</strong> a less efficient way than theprogrammer <strong>in</strong>tended. It may be necessary to look at the code generated by thecompiler to see if it is optimized <strong>in</strong> the way the programmer <strong>in</strong>tended.• Mixture of __m128 and __m256 types can cause severe delays if the programmerdoesn't follow certa<strong>in</strong> rules (see Intel AVX manuals).5.1 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions for system codeIntr<strong>in</strong>sic functions are useful for mak<strong>in</strong>g system code and access system registers that arenot accessible with standard C++. Some of these functions are listed below.Functions for access<strong>in</strong>g system registers:__rdtsc, __readpmc, __readmsr, __readcr0, __readcr2, __readcr3, __readcr4,__readcr8, __writecr0, __writecr3, __writecr4, __writecr8, __writemsr,_mm_getcsr, _mm_setcsr, __getcallerseflags.Functions for <strong>in</strong>put and output:__<strong>in</strong>byte, __<strong>in</strong>word, __<strong>in</strong>dword, __outbyte, __outword, __outdword.Functions for atomic memory read/write operations:_InterlockedExchange, etc.Functions for access<strong>in</strong>g FS and GS segments:__readfsbyte, __writefsbyte, etc.Cache control <strong>in</strong>structions (Require SSE or SSE2 <strong>in</strong>struction set):_mm_prefetch, _mm_stream_si32, _mm_stream_pi, _mm_stream_si128, _ReadBarrier,_WriteBarrier, _ReadWriteBarrier, _mm_sfence.34


Other system functions:__cpuid, __debugbreak, _disable, _enable.5.2 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions for <strong>in</strong>structions not available <strong>in</strong> standard C++Some simple <strong>in</strong>structions that are not available <strong>in</strong> standard C++ can be coded with <strong>in</strong>tr<strong>in</strong>sicfunctions, for example functions for bit-rotate, bit-scan, etc.:_rotl8, _rotr8, _rotl16, _rotr16, _rotl, _rotr, _rotl64, _rotr64, _BitScanForward,_BitScanReverse.5.3 Us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic functions for vector operationsVector <strong>in</strong>structions are very useful for improv<strong>in</strong>g the speed of code with <strong>in</strong>herent parallelism.There are <strong>in</strong>tr<strong>in</strong>sic functions for almost all vector operations us<strong>in</strong>g the MMX and XMMregisters.The use of these <strong>in</strong>tr<strong>in</strong>sic functions for vector operations is thoroughly described <strong>in</strong> manual1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++".5.4 Availability of <strong>in</strong>tr<strong>in</strong>sic functionsThe <strong>in</strong>tr<strong>in</strong>sic functions are available on newer versions of Microsoft, Gnu and Intelcompilers. Most <strong>in</strong>tr<strong>in</strong>sic functions have the same names <strong>in</strong> all three compilers. You have to<strong>in</strong>clude a header file named <strong>in</strong>tr<strong>in</strong>.h or emm<strong>in</strong>tr<strong>in</strong>.h to get access to the <strong>in</strong>tr<strong>in</strong>sicfunctions. The Codeplay compiler has limited support for <strong>in</strong>tr<strong>in</strong>sic vector functions, but thefunction names are not compatible with the other compilers.The <strong>in</strong>tr<strong>in</strong>sic functions are listed <strong>in</strong> the help documentation for each compiler, <strong>in</strong> theappropriate header files, <strong>in</strong> msdn.microsoft.com, <strong>in</strong> "Intel 64 and IA-32 ArchitecturesSoftware Developer’s Manual", volume 2A and 2B (developer.<strong>in</strong>tel.com) and <strong>in</strong> "IntelIntr<strong>in</strong>sic Guide" (softwareprojects.<strong>in</strong>tel.com/avx/).6 Us<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> <strong>in</strong> C++Inl<strong>in</strong>e <strong>assembly</strong> is another way of putt<strong>in</strong>g <strong>assembly</strong> code <strong>in</strong>to a C++ file. The keyword asm or_asm or __asm or __asm__ tells the compiler that the code is <strong>assembly</strong>. Different compilershave different syntaxes for <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. The different syntaxes are expla<strong>in</strong>ed below.The advantages of us<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> are:• It is easy to comb<strong>in</strong>e with C++.• Variables and other symbols def<strong>in</strong>ed <strong>in</strong> C++ code can be accessed from the<strong>assembly</strong> code.• Only the part of the code that cannot be coded <strong>in</strong> C++ is coded <strong>in</strong> <strong>assembly</strong>.• All <strong>assembly</strong> <strong>in</strong>structions are available.• The code generated is exactly what you write.• It is possible to optimize <strong>in</strong> details.• The compiler takes care of call<strong>in</strong>g conventions, name mangl<strong>in</strong>g and sav<strong>in</strong>g registers.35


• The compiler can <strong>in</strong>l<strong>in</strong>e a function conta<strong>in</strong><strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.• Portable to different x86 platforms when us<strong>in</strong>g the Intel compiler.The disadvantages of us<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> are:• Different compilers use different syntax.• Requires knowledge of <strong>assembly</strong> <strong>language</strong>.• Requires a good understand<strong>in</strong>g of how the compiler works. It is easy to make errors.• The allocation of registers is mostly done manually. The compiler may allocatedifferent registers for the same variables.• The compiler cannot optimize well across the <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> code.• It may be difficult to control function prolog and epilog code.• It may not be possible to def<strong>in</strong>e data.• It may not be possible to use macros and other directives.• It may not be possible to make functions with multiple entries.• The Microsoft compiler does not support <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> on 64-bit platforms.The follow<strong>in</strong>g sections illustrate how to make <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> with different compilers.6.1 MASM style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>The most common syntax for <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> is a MASM-style syntax. This is the easiestway of mak<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> and it is supported by most compilers, but not the Gnucompiler. Unfortunately, the syntax for <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> is poorly documented or notdocumented at all <strong>in</strong> the compiler manuals. I will therefore briefly describe the syntax here.The follow<strong>in</strong>g examples show a function that raises a float<strong>in</strong>g po<strong>in</strong>t number x to an <strong>in</strong>tegerpower n. The algorithm is to multiply x 1 , x 2 , x 4 , x 8 , etc. accord<strong>in</strong>g to each bit <strong>in</strong> the b<strong>in</strong>aryrepresentation of n. Actually, it is not necessary to code this <strong>in</strong> <strong>assembly</strong> because a goodcompiler will optimize it almost as much when you just write pow(x,n). My purpose here isjust to illustrate the syntax of <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.First the code <strong>in</strong> C++ to illustrate the algorithm:// Example 6.1a. Raise double x to the power of <strong>in</strong>t n.double ipow (double x, <strong>in</strong>t n) {unsigned <strong>in</strong>t nn = abs(n); // absolute value of ndouble y = 1.0;// used for multiplicationwhile (nn != 0) {// loop for each bit <strong>in</strong> nnif (nn & 1) y *= x; // multiply if bit = 1x *= x;// square xnn >>= 1;// get next bit of nn}if (n < 0) y = 1.0 / y; // reciprocal if n is negativereturn y;// return y = pow(x,n)}36


And then the optimized code us<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> with MASM style syntax:// Example 6.1b. MASM style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>, 32 bit modedouble ipow (double x, <strong>in</strong>t n) {__asm {mov eax, n// Move n to eax// abs(n) is calculated by <strong>in</strong>vert<strong>in</strong>g all bits and add<strong>in</strong>g 1 if n < 0:cdq// Get sign bit <strong>in</strong>to all bits of edxxor eax, edx // Invert bits if negativesub eax, edxfld1 // st(0) = 1.0jz L9 // End if n = 0// Add 1 if negative. Now eax = abs(n)fld qword ptr x // st(0) = x, st(1) = 1.0jmp L2// Jump <strong>in</strong>to loopL1: // Top of loopfmul st(0), st(0) // Square xL2: // Loop entered hereshr eax, 1// Get each bit of n <strong>in</strong>to carry flagjnc L1// No carry. Skip multiplication, goto nextfmul st(1), st(0) // Multiply by x squared i times for bit # ijnz L1 // End of loop. Stop when nn = 0fstp st(0)// Discard st(0)test edx, edx // Test if n was negativejns L9// F<strong>in</strong>ish if n was not negativefld1// st(0) = 1.0, st(1) = x^abs(n)fdivr// ReciprocalL9: // F<strong>in</strong>ish} // Result is <strong>in</strong> st(0)#pragma warn<strong>in</strong>g(disable:1011) // Don't warn for miss<strong>in</strong>g return value}Note that the function entry and parameters are declared with C++ syntax. The functionbody, or part of it, can then be coded with <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. The parameters x and n, whichare declared with C++ syntax, can be accessed directly <strong>in</strong> the <strong>assembly</strong> code us<strong>in</strong>g thesame names. The compiler simply replaces x and n <strong>in</strong> the <strong>assembly</strong> code with theappropriate memory operands, probably [esp+4] and [esp+12]. If the <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>code needs to access a variable that happens to be <strong>in</strong> a register, then the compiler will storeit to a memory variable on the stack and then <strong>in</strong>sert the address of this memory variable <strong>in</strong>the <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> code.The result is returned <strong>in</strong> st(0) accord<strong>in</strong>g to the 32-bit call<strong>in</strong>g convention. The compiler willnormally issue a warn<strong>in</strong>g because there is no return y; statement <strong>in</strong> the end of thefunction. This statement is not needed if you know which register to return the value <strong>in</strong>. The#pragma warn<strong>in</strong>g(disable:1011) removes the warn<strong>in</strong>g. If you want the code to work withdifferent call<strong>in</strong>g conventions (e.g. 64-bit systems) then it is necessary to store the result <strong>in</strong> atemporary variable <strong>in</strong>side the <strong>assembly</strong> block:// Example 6.1c. MASM style, <strong>in</strong>dependent of call<strong>in</strong>g conventiondouble ipow (double x, <strong>in</strong>t n) {double result;// Def<strong>in</strong>e temporary variable for result__asm {mov eax, ncdqxor eax, edxsub eax, edxfld1jz L9fld qword ptr xjmp L2L1:fmul st(0), st(0)37


}L2:shr eax, 1jnc L1fmul st(1), st(0)jnz L1fstp st(0)test edx, edxjns L9fld1fdivrL9:fstp qword ptr result // store result to temporary variable}return result;Now the compiler takes care of all aspects of the call<strong>in</strong>g convention and the code works onall x86 platforms.The compiler <strong>in</strong>spects the <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> code to see which registers are modified. Thecompiler will automatically save and restore these registers if required by the register usageconvention. In some compilers it is not allowed to modify register ebp or ebx <strong>in</strong> the <strong>in</strong>l<strong>in</strong>e<strong>assembly</strong> code because these registers are needed for a stack frame. The compiler willgenerally issue a warn<strong>in</strong>g <strong>in</strong> this case.It is possible to remove the automatically generated prolog and epilog code by add<strong>in</strong>g__declspec(naked) to the function declaration. In this case it is the responsibility of theprogrammer to add any necessary prolog and epilog code and to save any modifiedregisters if necessary. The only th<strong>in</strong>g the compiler takes care of <strong>in</strong> a naked function is namemangl<strong>in</strong>g. Automatic variable name substitution may not work with naked functions becauseit depends on how the function prolog is made. A naked function cannot be <strong>in</strong>l<strong>in</strong>ed.Access<strong>in</strong>g register variablesRegister variables cannot be accessed directly by their symbolic names <strong>in</strong> MASM-style<strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. Access<strong>in</strong>g a variable by name <strong>in</strong> an <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> code will force thecompiler to store the variable to a temporary memory location.If you know which register a variable is <strong>in</strong> then you can simply write the name of theregister. This makes the code more efficient but less portable.For example, if the code <strong>in</strong> the above example is used <strong>in</strong> 64-bit W<strong>in</strong>dows, then x will be <strong>in</strong>register XMM0 and n will be <strong>in</strong> register EDX. Tak<strong>in</strong>g advantage of this knowledge, we canimprove the code:// Example 6.1d. MASM style, 64-bit W<strong>in</strong>dowsdouble ipow (double x, <strong>in</strong>t n) {const double one = 1.0; // def<strong>in</strong>e constant 1.0__asm {// x is <strong>in</strong> xmm0mov eax, edx// get n <strong>in</strong>to eaxcdqxor eax, edxsub eax, edxmovsd xmm1, one // load 1.0jz L9jmp L2L1:mulsd xmm0, xmm0 // square xL2:shr eax, 1jnc L1mulsd xmm1, xmm0 // Multiply by x squared i timesjnz L1movsd xmm0, xmm1 // Put result <strong>in</strong> xmm0test edx, edxjns L938


movsd xmm0, onedivsd xmm0, xmm1 // ReciprocalL9: }#pragma warn<strong>in</strong>g(disable:1011) // Don't warn for miss<strong>in</strong>g return value}In 64-bit L<strong>in</strong>ux we will have n <strong>in</strong> register EDI so the l<strong>in</strong>e mov eax,edx should be changedto mov eax,edi.Access<strong>in</strong>g class members and structure membersLet's take as an example a C++ class conta<strong>in</strong><strong>in</strong>g a list of <strong>in</strong>tegers:// Example 6.2a. Access<strong>in</strong>g class data members// def<strong>in</strong>e C++ classclass MyList {protected:<strong>in</strong>t length;// Number of items <strong>in</strong> list<strong>in</strong>t buffer[100];// Store itemspublic:MyList();// Constructorvoid AttItem(<strong>in</strong>t item); // Add item to list<strong>in</strong>t Sum();// Compute sum of items};MyList::MyList() {length = 0;}// Constructorvoid MyList::AttItem(<strong>in</strong>t item) { // Add item to listif (length < 100) {buffer[length++] = item;}}<strong>in</strong>t MyList::Sum() {// Member function Sum<strong>in</strong>t i, sum = 0;for (i = 0; i < length; i++) sum += buffer[i];return sum;}Below, I will show how to code the member function MyList::Sum <strong>in</strong> <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. Ihave not tried to optimize the code, my purpose here is simply to show the syntax.Class members are accessed by load<strong>in</strong>g 'this' <strong>in</strong>to a po<strong>in</strong>ter register and address<strong>in</strong>g classdata members relative to the po<strong>in</strong>ter with the dot operator (.).// Example 6.2b. Access<strong>in</strong>g class members (32-bit)<strong>in</strong>t MyList::Sum() {__asm {mov ecx, this// 'this' po<strong>in</strong>terxor eax, eax // sum = 0xor edx, edx // loop <strong>in</strong>dex, i = 0cmp [ecx].length, 0 // if (this->length != 0)je L9L1: add eax, [ecx].buffer[edx*4] // sum += buffer[i]add edx, 1// i++cmp edx, [ecx].length // while (i < length)jb L1L9:} // Return value is <strong>in</strong> eax#pragma warn<strong>in</strong>g(disable:1011)}39


Here the 'this' po<strong>in</strong>ter is accessed by its name 'this', and all class data members areaddressed relative to 'this'. The offset of the class member relative to 'this' is obta<strong>in</strong>ed bywrit<strong>in</strong>g the member name preceded by the dot operator. The <strong>in</strong>dex <strong>in</strong>to the array namedbuffer must be multiplied by the size of each element <strong>in</strong> buffer [edx*4].Some 32-bit compilers for W<strong>in</strong>dows put 'this' <strong>in</strong> ecx, so the <strong>in</strong>struction mov ecx,this canbe omitted. 64-bit systems require 64-bit po<strong>in</strong>ters, so ecx should be replaced by rcx andedx by rdx. 64-bit W<strong>in</strong>dows has 'this' <strong>in</strong> rcx, while 64-bit L<strong>in</strong>ux has 'this' <strong>in</strong> rdi.Structure members are accessed <strong>in</strong> the same way by load<strong>in</strong>g a po<strong>in</strong>ter to the structure <strong>in</strong>toa register and us<strong>in</strong>g the dot operator. There is no syntax check aga<strong>in</strong>st access<strong>in</strong>g privateand protected members. There is no way to resolve the ambiguity if more than onestructure or class has a member with the same name. The MASM assembler can resolvesuch ambiguities by us<strong>in</strong>g the assume directive or by putt<strong>in</strong>g the name of the structurebefore the dot, but this is not possible with <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.Call<strong>in</strong>g functionsFunctions are called by their name <strong>in</strong> <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. Member functions can only be calledfrom other member functions of the same class. Overloaded functions cannot be calledbecause there is no way to resolve the ambiguity. It is not possible to use mangled functionnames. It is the responsibility of the programmer to put any function parameters on thestack or <strong>in</strong> the right registers before call<strong>in</strong>g a function and to clean up the stack after thecall. It is also the programmer's responsibility to save any registers you want to preserveacross the function call, unless these registers have callee-save status.Because of these complications, I will recommend that you go out of the <strong>assembly</strong> blockand use C++ syntax when mak<strong>in</strong>g function calls.Syntax overviewThe syntax for MASM-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> is not well described <strong>in</strong> any compiler manual Ihave seen. I will therefore summarize the most important rules here.In most cases, the MASM-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> is <strong>in</strong>terpreted by the compiler without<strong>in</strong>vok<strong>in</strong>g any assembler. You can therefore not assume that the compiler will acceptanyth<strong>in</strong>g that the assembler understands.The <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> code is marked by the keyword __asm. Some compilers allow the alias_asm. The <strong>assembly</strong> code must be enclosed <strong>in</strong> curly brackets {} unless there is only onel<strong>in</strong>e. The <strong>assembly</strong> <strong>in</strong>structions are separated by newl<strong>in</strong>es. Alternatively, you may separatethe <strong>assembly</strong> <strong>in</strong>structions by the __asm keyword without any semicolons.Instructions and labels are coded <strong>in</strong> the same way as <strong>in</strong> MASM. The size of memoryoperands can be specified with the PTR operator, for example INC DWORD PTR [ESI]. Thenames of <strong>in</strong>structions and registers are not case sensitive.Variables, functions, and goto labels declared <strong>in</strong> C++ can be accessed by the same names<strong>in</strong> the <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> code. These names are case sensitive.Data members of structures, classes and unions are accessed relative to a po<strong>in</strong>ter registerus<strong>in</strong>g the dot operator.Comments are <strong>in</strong>itiated with a semicolon (;) or a double slash (//).40


Hard-coded opcodes are made with _emit followed by a byte constant, where you woulduse DB <strong>in</strong> MASM. For example _emit 0x90 is equivalent to NOP.Directives and macros are not allowed.The compiler takes care of call<strong>in</strong>g conventions and register sav<strong>in</strong>g for the function thatconta<strong>in</strong>s <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>, but not for any function calls from the <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.Compilers us<strong>in</strong>g MASM style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>MASM-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> is supported by 16-bit and 32-bit Microsoft C++ compilers, butthe 64-bit Microsoft compiler has no <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.The Intel C++ compiler supports MASM-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> <strong>in</strong> both 32-bit and 64-bitW<strong>in</strong>dows as well as 32-bit and 64-bit L<strong>in</strong>ux (and Mac OS ?). This is the preferred compilerfor mak<strong>in</strong>g portable code with <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. The Intel compiler under L<strong>in</strong>ux requires thecommand l<strong>in</strong>e option -use-msasm to recognize this syntax for <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. Only thekeyword __asm works <strong>in</strong> this case, not the synonyms asm or __asm__. The Intel compilerconverts the MASM syntax to AT&T syntax before send<strong>in</strong>g it to the assembler. The Intelcompiler can therefore be useful as a syntax converter.The MASM-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> is also supported by Borland, Digital Mars, Watcom andCodeplay compilers. The Borland assembler is not up to date and the Watcom compiler hassome limitations on <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.6.2 Gnu style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>Inl<strong>in</strong>e <strong>assembly</strong> works quite differently on the Gnu compiler because the Gnu compilercompiles via <strong>assembly</strong> rather than produc<strong>in</strong>g object code directly, as most other compilersdo. The <strong>assembly</strong> code is entered as a str<strong>in</strong>g constant which is passed through to theassembler with very little change. The default syntax is the AT&T syntax that the Gnuassembler uses.The Gnu-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> has the advantage that you can pass on any <strong>in</strong>struction ordirective to the assembler. Everyth<strong>in</strong>g is possible. The disadvantage is that the syntax isvery elaborate and difficult to learn, as the examples below show.The Gnu-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> is supported by the Gnu compiler and the Intel compiler forL<strong>in</strong>ux <strong>in</strong> both 32-bit and 64-bit mode.AT&T syntaxLet's return to example 6.1b and translate it to Gnu style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> with AT&T syntax:// Example 6.1e. Gnu-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>, AT&T syntaxdouble ipow (double x, <strong>in</strong>t n) {double y;__asm__ ("cltd\n" // cdq"xorl %%edx, %%eax \n""subl %%edx, %%eax \n""fld1\n""jz 9f \n" // Forward jump to nearest 9:"fldl %[xx]\n" // Substituted with x"jmp 2f \n" // Forward jump to nearest 2:"1: \n" // Local label 1:"fmul %%st(0), %%st(0) \n""2: \n" // Local label 2:41


"shrl $1, %%eax \n""jnc 1b \n" // Backward jump to nearest 1:"fmul %%st(0), %%st(1) \n""jnz 1b \n" // Backward jump to nearest 1:"fstp %%st(0)\n""testl %%edx, %%edx \n""jns 9f \n" // Forward jump to nearest 9:"fld1\n""fdivp %%st(0), %%st(1)\n""9: \n" // Assembly str<strong>in</strong>g ends here// Use extended <strong>assembly</strong> syntax to specify operands:// Output operands:: "=t" (y) // Output top of stack to y// Input operands:: [xx] "m" (x), "a" (n) // Input operand %[xx] = x, eax = n// Clobbered registers:: "%edx", "%st(1)" // Clobber edx and st(1)); // __asm__ statement ends here}return y;We immediately notice that the syntax is very different from the previous examples. Many ofthe <strong>in</strong>struction codes have suffixes for specify<strong>in</strong>g the operand size. Integer <strong>in</strong>structions useb for BYTE, w for WORD, l for DWORD, q for QWORD. Float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions use s for DWORD(float), l for QWORD (double), t for TBYTE (long double). Some <strong>in</strong>struction codes arecompletely different, for example CDQ is changed to CLTD. The order of the operands is theopposite of MASM syntax. The source operand comes before the dest<strong>in</strong>ation operand.Register names have %% prefix, which is changed to % before the str<strong>in</strong>g is passed on to theassembler. Constants have $ prefix. Memory operands are also different. For example,[ebx+ecx*4+20h] is changed to 0x20(%%ebx,%%ecx,4).Jump labels can be coded <strong>in</strong> the same way as <strong>in</strong> MASM, e.g. L1:, L2:, but I have chosen touse the syntax for local labels, which is a decimal number followed by a colon. The jump<strong>in</strong>structions can then use jmp 1b for jump<strong>in</strong>g backwards to the nearest preced<strong>in</strong>g 1:label, and jmp 1f for jump<strong>in</strong>g forwards to the nearest follow<strong>in</strong>g 1: label. The reason why Ihave used this k<strong>in</strong>d of labels is that the compiler will produce multiple <strong>in</strong>stances of the <strong>in</strong>l<strong>in</strong>e<strong>assembly</strong> code if the function is <strong>in</strong>l<strong>in</strong>ed, which is quite likely to happen. If we use normallabels like L1: then there will be more than one label with the same name <strong>in</strong> the f<strong>in</strong>al<strong>assembly</strong> file, which of course will produce an error. If you want to use normal labels thenadd __attribute__((no<strong>in</strong>l<strong>in</strong>e)) to the function declaration to prevent <strong>in</strong>l<strong>in</strong><strong>in</strong>g of thefunction.The Gnu style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> does not allow you to put the names of local C++ variablesdirectly <strong>in</strong>to the <strong>assembly</strong> code. Only global variable names will work because they have thesame names <strong>in</strong> the <strong>assembly</strong> code. Instead you can use the so-called extended syntax asillustrated above. The extended syntax looks like this:__asm__ ("<strong>assembly</strong> code str<strong>in</strong>g" : [output list] : [<strong>in</strong>put list] :[clobbered registers list] );The <strong>assembly</strong> code str<strong>in</strong>g is a str<strong>in</strong>g constant conta<strong>in</strong><strong>in</strong>g <strong>assembly</strong> <strong>in</strong>structions separated bynewl<strong>in</strong>e characters (\n).In the above example, the output list is "=t" (y). t means the top-of-stack float<strong>in</strong>g po<strong>in</strong>tregister st(0), and y means that this should be stored <strong>in</strong> the C++ variable named y afterthe <strong>assembly</strong> code str<strong>in</strong>g.42


There are two <strong>in</strong>put operands <strong>in</strong> the <strong>in</strong>put list. [xx] "m" (x) means replace %[xx] <strong>in</strong> the<strong>assembly</strong> str<strong>in</strong>g with a memory operand for the C++ variable x. "a" (n) means load theC++ variable n <strong>in</strong>to register eax before the <strong>assembly</strong> str<strong>in</strong>g. There are many differentconstra<strong>in</strong>ts symbols for specify<strong>in</strong>g different k<strong>in</strong>ds of operands and registers for <strong>in</strong>put andoutput. See the GCC manual for details.The clobbered registers list "%edx", "%st(1)" tells that registers edx and st(1) aremodified by the <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> code. The compiler would falsely assume that theseregisters were unchanged if they didn't occur <strong>in</strong> the clobber list.Intel syntaxThe above example will be a little easier to code if we use Intel syntax for the <strong>assembly</strong>str<strong>in</strong>g. The Gnu assembler now accepts Intel syntax with the directive .<strong>in</strong>tel_syntaxnoprefix. The noprefix means that registers don't need a %-sign as prefix.// Example 6.1f. Gnu-style <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>, Intel syntaxdouble ipow (double x, <strong>in</strong>t n) {double y;__asm__ (".<strong>in</strong>tel_syntax noprefix \n" // use Intel syntax for convenience"cdq\n""xor eax, edx\n""sub eax, edx\n""fld1\n""jz 9f\n"".att_syntax prefix \n" // AT&T syntax needed for %[xx]"fldl %[xx]\n" // memory operand substituted with x".<strong>in</strong>tel_syntax noprefix \n" // switch to Intel syntax aga<strong>in</strong>"jmp 2f\n""1: \n""fmul st(0), st(0) \n""2: \n""shr eax, 1\n""jnc 1b\n""fmul st(1), st(0) \n""jnz 1b\n""fstp st(0)\n""test edx, edx\n""jns 9f\n""fld1\n""fdivrp\n""9: \n"".att_syntax prefix \n" // switch back to AT&T syntax// output operands:: "=t" (y) // output <strong>in</strong> top-of-stack goes to y// <strong>in</strong>put operands:: [xx] "m" (x), "a" (n) // <strong>in</strong>put memory %[x] for x, eax for n// clobbered registers:: "%edx", "%st(1)" ); // edx and st(1) are modified}return y;Here, I have <strong>in</strong>serted .<strong>in</strong>tel_syntax noprefix <strong>in</strong> the start of the <strong>assembly</strong> str<strong>in</strong>g whichallows me to use Intel syntax for the <strong>in</strong>structions. The str<strong>in</strong>g must end with .att_syntaxprefix to return to the default AT&T syntax, because this syntax is used <strong>in</strong> the subsequentcode generated by the compiler. The <strong>in</strong>struction that loads the memory operand x must useAT&T syntax because the operand substitution mechanism uses AT&T syntax for the43


operand substituted for %[xx]. The <strong>in</strong>struction fldl %[xx] must therefore be written <strong>in</strong>AT&T syntax. We can still use AT&T-style local labels. The lists of output operands, <strong>in</strong>putoperands and clobbered registers are the same as <strong>in</strong> example 6.1e.7 Us<strong>in</strong>g an assemblerThere are certa<strong>in</strong> limitations on what you can do with <strong>in</strong>tr<strong>in</strong>sic functions and <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>.These limitations can be overcome by us<strong>in</strong>g an assembler. The pr<strong>in</strong>ciple is to write one ormore <strong>assembly</strong> files conta<strong>in</strong><strong>in</strong>g the most critical functions of a program and writ<strong>in</strong>g the lesscritical parts <strong>in</strong> C++. The different modules are then l<strong>in</strong>ked together <strong>in</strong>to an executable file.The advantages of us<strong>in</strong>g an assembler are:• There are almost no limitations on what you can do.• You have complete control of all details of the f<strong>in</strong>al executable code.• All aspects of the code can be optimized, <strong>in</strong>clud<strong>in</strong>g function prolog and epilog,parameter transfer methods, register usage, data alignment, etc.• It is possible to make functions with multiple entries.• It is possible to make code that is compatible with multiple compilers and multipleoperat<strong>in</strong>g systems (see page 51).• MASM and some other assemblers have a powerful macro <strong>language</strong> which opensup possibilities that are absent <strong>in</strong> most compiled high-level <strong>language</strong>s (see page106).The disadvantages of us<strong>in</strong>g an assembler are:• Assembly <strong>language</strong> is difficult to learn. There are many <strong>in</strong>structions to remember.• Cod<strong>in</strong>g <strong>in</strong> <strong>assembly</strong> takes more time than cod<strong>in</strong>g <strong>in</strong> a high level <strong>language</strong>.• The <strong>assembly</strong> <strong>language</strong> syntax is not fully standardized.• Assembly code tends to become poorly structured and spaghetti-like. It takes a lot ofdiscipl<strong>in</strong>e to make <strong>assembly</strong> code well structured and readable for others.• Assembly code is not portable between different microprocessor architectures.• The programmer must know all details of the call<strong>in</strong>g conventions and obey theseconventions <strong>in</strong> the code.• The assembler provides very little syntax check<strong>in</strong>g. Many programm<strong>in</strong>g errors arenot detected by the assembler.• There are many th<strong>in</strong>gs that you can do wrong <strong>in</strong> <strong>assembly</strong> code and the errors canhave serious consequences.• Errors <strong>in</strong> <strong>assembly</strong> code can be difficult to trace. For example, the error of not sav<strong>in</strong>ga register can cause a completely different part of the program to malfunction.44


• Assembly <strong>language</strong> is not suitable for mak<strong>in</strong>g a complete program. Most of theprogram has to be made <strong>in</strong> a different programm<strong>in</strong>g <strong>language</strong>.The best way to start if you want to make <strong>assembly</strong> code is to first make the entire program<strong>in</strong> C or C++. Optimize the program with the use of the methods described <strong>in</strong> manual 1:"<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++". If any part of the program needs further optimization thenisolate this part <strong>in</strong> a separate module. Then translate the critical module from C++ to<strong>assembly</strong>. There is no need to do this translation manually. Most C++ compilers canproduce <strong>assembly</strong> code. Turn on all relevant optimization options <strong>in</strong> the compiler whentranslat<strong>in</strong>g the C++ code to <strong>assembly</strong>. The <strong>assembly</strong> code thus produced by the compiler isa good start<strong>in</strong>g po<strong>in</strong>t for further optimization. The compiler-generated <strong>assembly</strong> code is sureto have the call<strong>in</strong>g conventions right. (The output produced by 64-bit compilers for W<strong>in</strong>dowsis not yet fully compatible with any assembler).Inspect the <strong>assembly</strong> code produced by the compiler to see if there are any possibilities forfurther optimization. Sometimes compilers are very smart and produce code that is betteroptimized than what an average <strong>assembly</strong> programmer can do. In other cases, compilersare <strong>in</strong>credibly stupid and do th<strong>in</strong>gs <strong>in</strong> very awkward and <strong>in</strong>efficient ways. It is <strong>in</strong> the lattercase that it is justified to spend time on <strong>assembly</strong> cod<strong>in</strong>g.Most IDE's (Integrated Development Environments) provide a way of <strong>in</strong>clud<strong>in</strong>g <strong>assembly</strong>modules <strong>in</strong> a C++ project. For example <strong>in</strong> Microsoft Visual Studio, you can def<strong>in</strong>e a "custombuild step" for an <strong>assembly</strong> source file. The specification for the custom build step may, forexample, look like this. Command l<strong>in</strong>e: ml /c /Cx /Zi /coff $(InputName).asm.Outputs: $(InputName).obj. Alternatively, you may use a makefile (see page 49) or abatch file.The C++ files that call the functions <strong>in</strong> the <strong>assembly</strong> module should <strong>in</strong>clude a header file(*.h) conta<strong>in</strong><strong>in</strong>g function prototypes for the <strong>assembly</strong> functions. It is recommended to addextern "C" to the function prototypes to remove the compiler-specific name mangl<strong>in</strong>gcodes from the function names.Examples of <strong>assembly</strong> functions for different platforms are provided <strong>in</strong> paragraph 4.5, page30ff.7.1 Static l<strong>in</strong>k librariesIt is convenient to collect assembled code from multiple <strong>assembly</strong> files <strong>in</strong>to a functionlibrary. The advantages of organiz<strong>in</strong>g <strong>assembly</strong> modules <strong>in</strong>to function libraries are:• The library can conta<strong>in</strong> many functions and modules. The l<strong>in</strong>ker will automaticallypick the modules that are needed <strong>in</strong> a particular project and leave out the rest sothat no superfluous code is added to the project.• A function library is easy and convenient to <strong>in</strong>clude <strong>in</strong> a C++ project. All C++compilers and IDE's support function libraries.• A function library is reusable. The extra time spent on cod<strong>in</strong>g and test<strong>in</strong>g a function<strong>in</strong> <strong>assembly</strong> <strong>language</strong> is better justified if the code can be reused <strong>in</strong> differentprojects.• Mak<strong>in</strong>g as a reusable function library forces you to make well tested and welldocumented code with a well def<strong>in</strong>ed functionality and a well def<strong>in</strong>ed <strong>in</strong>terface to thecall<strong>in</strong>g program.45


• A reusable function library with a general functionality is easier to ma<strong>in</strong>ta<strong>in</strong> and verifythan an application-specific <strong>assembly</strong> code with a less well-def<strong>in</strong>ed responsibility.• A function library can be used by other programmers who have no knowledge of<strong>assembly</strong> <strong>language</strong>.A static l<strong>in</strong>k function library for W<strong>in</strong>dows is built by us<strong>in</strong>g the library manager (e.g. lib.exe)to comb<strong>in</strong>e one or more *.obj files <strong>in</strong>to a *.lib file.A static l<strong>in</strong>k function library for L<strong>in</strong>ux is built by us<strong>in</strong>g the archive manager (ar) to comb<strong>in</strong>eone or more *.o files <strong>in</strong>to an *.a file.A function library must be supplemented by a header file (*.h) conta<strong>in</strong><strong>in</strong>g functionprototypes for the functions <strong>in</strong> the library. This header file is <strong>in</strong>cluded <strong>in</strong> the C++ files thatcall the library functions (e.g. #<strong>in</strong>clude "mylibrary.h").It is convenient to use a makefile (see page 49) for manag<strong>in</strong>g the commands necessary forbuild<strong>in</strong>g and updat<strong>in</strong>g a function library.7.2 Dynamic l<strong>in</strong>k librariesThe difference between static l<strong>in</strong>k<strong>in</strong>g and dynamic l<strong>in</strong>k<strong>in</strong>g is that the static l<strong>in</strong>k library isl<strong>in</strong>ked <strong>in</strong>to the executable program file so that the executable file conta<strong>in</strong>s a copy of thenecessary parts of the library. A dynamic l<strong>in</strong>k library (*.dll <strong>in</strong> W<strong>in</strong>dows, *.so <strong>in</strong> L<strong>in</strong>ux) isdistributed as a separate file which is loaded at runtime by the executable file.The advantages of dynamic l<strong>in</strong>k libraries are:• Only one <strong>in</strong>stance of the dynamic l<strong>in</strong>k library is loaded <strong>in</strong>to memory when multipleprograms runn<strong>in</strong>g simultaneously use the same library.• The dynamic l<strong>in</strong>k library can be updated without modify<strong>in</strong>g the executable file.• A dynamic l<strong>in</strong>k library can be called from most programm<strong>in</strong>g <strong>language</strong>s, such asPascal, C#, Visual Basic (Call<strong>in</strong>g from Java is possible but difficult).The disadvantages of dynamic l<strong>in</strong>k libraries are:• The whole library is loaded <strong>in</strong>to memory even when only a small part of it is needed.• Load<strong>in</strong>g a dynamic l<strong>in</strong>k library takes extra time when the executable program file isloaded.• Call<strong>in</strong>g a function <strong>in</strong> a dynamic l<strong>in</strong>k library is less efficient than a static librarybecause of extra call overhead and because of less efficient code cache use.• The dynamic l<strong>in</strong>k library must be distributed together with the executable file.• Multiple programs <strong>in</strong>stalled on the same computer must use the same version of adynamic l<strong>in</strong>k library. This can cause many compatibility problems.A DLL for W<strong>in</strong>dows is made with the Microsoft l<strong>in</strong>ker (l<strong>in</strong>k.exe). The l<strong>in</strong>ker must besupplied one or more .obj or .lib files conta<strong>in</strong><strong>in</strong>g the necessary library functions and aDllEntry function, which just returns 1. A module def<strong>in</strong>ition file (*.def) is also needed.Note that DLL functions <strong>in</strong> 32-bit W<strong>in</strong>dows use the __stdcall call<strong>in</strong>g convention, while46


static l<strong>in</strong>k library functions use the __cdecl call<strong>in</strong>g convention by default. An examplesource code can be found <strong>in</strong> www.agner.org/random/randoma.zip. Further <strong>in</strong>structions canbe found <strong>in</strong> the Microsoft compiler documentation and <strong>in</strong> Iczelion's tutorials atw<strong>in</strong>32asm.cjb.net.I have no experience <strong>in</strong> mak<strong>in</strong>g dynamic l<strong>in</strong>k libraries (shared objects) for L<strong>in</strong>ux.7.3 Libraries <strong>in</strong> source code formA problem with subrout<strong>in</strong>e libraries <strong>in</strong> b<strong>in</strong>ary form is that the compiler cannot optimize thefunction call. This problem can be solved by supply<strong>in</strong>g the library functions as C++ sourcecode.If the library functions are supplied as C++ source code then the compiler can optimizeaway the function call<strong>in</strong>g overhead by <strong>in</strong>l<strong>in</strong><strong>in</strong>g the function. It can optimize register allocationacross the function. It can do constant propagation. It can move <strong>in</strong>variant code when thefunction is called <strong>in</strong>side a loop, etc.The compiler can only do these optimizations with C++ source code, not with <strong>assembly</strong>code. The code may conta<strong>in</strong> <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> or <strong>in</strong>tr<strong>in</strong>sic function calls. The compiler can dofurther optimizations if the code uses <strong>in</strong>tr<strong>in</strong>sic function calls, but not if it uses <strong>in</strong>l<strong>in</strong>e<strong>assembly</strong>. Note that different compilers will not optimize the code equally well.If the compiler uses whole program optimization then the library functions can simply besupplied as a C++ source file. If not, then the library code must be <strong>in</strong>cluded with #<strong>in</strong>cludestatements <strong>in</strong> order to enable optimization across the function calls. A function def<strong>in</strong>ed <strong>in</strong> an<strong>in</strong>cluded file should be declared static and/or <strong>in</strong>l<strong>in</strong>e <strong>in</strong> order to avoid clashes betweenmultiple <strong>in</strong>stances of the function.Some compilers with whole program optimization features can produce half-compiled objectfiles that allow further optimization at the l<strong>in</strong>k stage. Unfortunately, the format of such files isnot standardized - not even between different versions of the same compiler. It is possiblethat future compiler technology will allow a standardized format for half-compiled code. Thisformat should, as a m<strong>in</strong>imum, specify which registers are used for parameter transfer andwhich registers are modified by each function. It should preferably also allow registerallocation at l<strong>in</strong>k time, constant propagation, common subexpression elim<strong>in</strong>ation acrossfunctions, and <strong>in</strong>variant code motion.As long as such facilities are not available, we may consider us<strong>in</strong>g the alternative strategy ofputt<strong>in</strong>g the entire <strong>in</strong>nermost loop <strong>in</strong>to an optimized library function rather than call<strong>in</strong>g thelibrary function from <strong>in</strong>side a C++ loop. This solution is used <strong>in</strong> Intel's Math Kernel Library(www.<strong>in</strong>tel.com). If, for example, you need to calculate a thousand logarithms then you cansupply an array of thousand arguments to a vector logarithm function <strong>in</strong> the library andreceive an array of thousand results back from the library. This has the disadvantage that<strong>in</strong>termediate results have to be stored <strong>in</strong> arrays rather than transferred <strong>in</strong> registers.7.4 Mak<strong>in</strong>g classes <strong>in</strong> <strong>assembly</strong>Classes are coded as structures <strong>in</strong> <strong>assembly</strong> and member functions are coded as functionsthat receive a po<strong>in</strong>ter to the class/structure as a parameter.It is not possible to apply the extern "C" declaration to a member function <strong>in</strong> C++because extern "C" refers to the call<strong>in</strong>g conventions of the C <strong>language</strong> which doesn'thave classes and member functions. The most logical solution is to use the mangledfunction name. Return<strong>in</strong>g to example 6.2a and b page 39, we can write the member function<strong>in</strong>t MyList::Sum() with a mangled name as follows:47


; Example 7.1a (Example 6.2b translated to stand alone <strong>assembly</strong>); Member function, 32-bit W<strong>in</strong>dows, Microsoft compiler; Def<strong>in</strong>e structure correspond<strong>in</strong>g to class MyList:MyList STRUClength_ DD ?; length is a reserved name. Use length_buffer DD 100 DUP (?) ; <strong>in</strong>t buffer[100];MyList ENDS; <strong>in</strong>t MyList::Sum(); Mangled function name compatible with Microsoft compiler (32 bit):?Sum@MyList@@QAEHXZ PROC near; Microsoft compiler puts 'this' <strong>in</strong> ECXassume ecx: ptr MyList; ecx po<strong>in</strong>ts to structure MyListxor eax, eax ; sum = 0xor edx, edx ; Loop <strong>in</strong>dex i = 0cmp [ecx].length_, 0; this->lengthje L9 ; Skip if length = 0L1: add eax, [ecx].buffer[edx*4] ; sum += buffer[i]add edx, 1; i++cmp edx, [ecx].length_ ; while (i < length)jb L1; LoopL9: ret ; Return value is <strong>in</strong> eax?Sum@MyList@@QAEHXZ ENDP; End of <strong>in</strong>t MyList::Sum()assume ecx: noth<strong>in</strong>g; ecx no longer po<strong>in</strong>ts to anyth<strong>in</strong>gThe mangled function name ?Sum@MyList@@QAEHXZ is compiler specific. Other compilersmay have other name-mangl<strong>in</strong>g codes. Furthermore, other compilers may put 'this' on thestack rather than <strong>in</strong> a register. These <strong>in</strong>compatibilities can be solved by us<strong>in</strong>g a friendfunction rather than a member function. This solves the problem that a member functioncannot be declared extern "C". The declaration <strong>in</strong> the C++ header file must then bechanged to the follow<strong>in</strong>g:// Example 7.1b. Member function changed to friend function:// An <strong>in</strong>complete class declaration is needed here:class MyList;// Function prototype for friend function with 'this' as parameter:extern "C" <strong>in</strong>t MyList_Sum(MyList * ThisP);// Class declaration:class MyList {protected:<strong>in</strong>t length;<strong>in</strong>t buffer[100];public:MyList();void AttItem(<strong>in</strong>t item);// Data members:// Constructor// Add item to list// Make MyList_Sum a friend:friend <strong>in</strong>t MyList_Sum(MyList * ThisP);};// Translate Sum to MyList_Sum by <strong>in</strong>l<strong>in</strong>e call:<strong>in</strong>t Sum() {return MyList_Sum(this);}The prototype for the friend function must come before the class declaration because somecompilers do not allow extern "C" <strong>in</strong>side the class declaration. An <strong>in</strong>complete classdeclaration is needed because the friend function needs a po<strong>in</strong>ter to the class.48


The above declarations will make the compiler replace any call to MyList::Sum by a call toMyList_Sum because the latter function is <strong>in</strong>l<strong>in</strong>ed <strong>in</strong>to the former. The <strong>assembly</strong>implementation of MyList_Sum does not need a mangled name:; Example 7.1c. Friend function, 32-bit mode; Def<strong>in</strong>e structure correspond<strong>in</strong>g to class MyList:MyList STRUClength_ DD ?; length is a reserved name. Use length_buffer DD 100 DUP (?) ; <strong>in</strong>t buffer[100];MyList ENDS; extern "C" friend <strong>in</strong>t MyList_Sum()_MyList_Sum PROC near; Parameter ThisP is on stackmov ecx, [esp+4]; ThisPassume ecx: ptr MyList; ecx po<strong>in</strong>ts to structure MyListxor eax, eax ; sum = 0xor edx, edx ; Loop <strong>in</strong>dex i = 0cmp [ecx].length_, 0; this->lengthje L9 ; Skip if length = 0L1: add eax, [ecx].buffer[edx*4] ; sum += buffer[i]add edx, 1; i++cmp edx, [ecx].length_ ; while (i < length)jb L1; LoopL9: ret ; Return value is <strong>in</strong> eax_MyList_Sum ENDP ; End of <strong>in</strong>t MyList_Sum()assume ecx: noth<strong>in</strong>g; ecx no longer po<strong>in</strong>ts to anyth<strong>in</strong>g7.5 Thread-safe functionsA thread-safe or reentrant function is a function that works correctly when it is calledsimultaneously from more than one thread. Multithread<strong>in</strong>g is used for tak<strong>in</strong>g advantage ofcomputers with multiple CPU cores. It is therefore reasonable to require that a functionlibrary <strong>in</strong>tended for speed-critical applications should be thread-safe.Functions are thread-safe when no variables are shared between threads, except for<strong>in</strong>tended communication between the threads. Constant data can be shared betweenthreads without problems. Variables that are stored on the stack are thread-safe becauseeach thread has its own stack. The problem arises only with static variables stored <strong>in</strong> thedata segment. Static variables are used when data have to be saved from one function callto the next. It is possible to make thread-local static variables, but this is <strong>in</strong>efficient andsystem-specific.The best way to store data from one function call to the next <strong>in</strong> a thread-safe way is to letthe call<strong>in</strong>g function allocate storage space for these data. The most elegant way to do this isto encapsulate the data and the functions that need them <strong>in</strong> a class. Each thread mustcreate an object of the class and call the member functions on this object. The previousparagraph shows how to make member functions <strong>in</strong> <strong>assembly</strong>.If the thread-safe <strong>assembly</strong> function has to be called from C or another <strong>language</strong> that doesnot support classes, or does so <strong>in</strong> an <strong>in</strong>compatible way, then the solution is to allocate astorage buffer <strong>in</strong> each thread and supply a po<strong>in</strong>ter to this buffer to the function.7.6 MakefilesA make utility is a universal tool to manage software projects. It keeps track of all the sourcefiles, object files, library files, executable files, etc. <strong>in</strong> a software project. It does so by meansof a general set of rules based on the date/time stamps of all the files. If a source file is49


newer than the correspond<strong>in</strong>g object file then the object file has to be re-made. If the objectfile is newer than the executable file then the executable file has to be re-made.Any IDE (Integrated Development Environment) conta<strong>in</strong>s a make utility which is activatedfrom a graphical user <strong>in</strong>terface, but <strong>in</strong> most cases it is also possible to use a command-l<strong>in</strong>eversion of the make utility. The command l<strong>in</strong>e make utility (called make or nmake) is basedon a set of rules that you can def<strong>in</strong>e <strong>in</strong> a so-called makefile. The advantage of us<strong>in</strong>g amakefile is that it is possible to def<strong>in</strong>e rules for any type of files, such as source files <strong>in</strong> anyprogramm<strong>in</strong>g <strong>language</strong>, object files, library files, module def<strong>in</strong>ition files, resource files,executable files, zip files, etc. The only requirement is that a tool exists for convert<strong>in</strong>g onetype of file to another and that this tool can be called from a command l<strong>in</strong>e with the filenames as parameters.The syntax for def<strong>in</strong><strong>in</strong>g rules <strong>in</strong> a makefile is almost the same for all the different makeutilities that come with different compilers for W<strong>in</strong>dows and L<strong>in</strong>ux.Many IDE's also provide features for user-def<strong>in</strong>ed make rules for file types not known to theIDE, but these utilities are often less general and flexible than a stand-alone make utility.The follow<strong>in</strong>g is an example of a makefile for mak<strong>in</strong>g a function library mylibrary.libfrom three <strong>assembly</strong> source files func1.asm, func2.asm, func3.asm and pack<strong>in</strong>g ittogether with the correspond<strong>in</strong>g header file mylibrary.h <strong>in</strong>to a zip file mylibrary.zip.# Example 7.2. makefile for mylibrarymylibrary.zip: mylibrary.lib mylibrary.hwzzip $@ $?mylibrary.lib: func1.obj func2.obj func3.objlib /out:$@ $**.asm.objml /c /Cx /coff /Fo$@ $*.asmThe l<strong>in</strong>e mylibrary.zip: mylibrary.lib mylibrary.h tells that the filemylibrary.zip is built from mylibrary.lib and mylibrary.h, and that it must be rebuiltif any of these has been modified later than the zip file. The next l<strong>in</strong>e, which must be<strong>in</strong>dented, specifies the command needed for build<strong>in</strong>g the target file mylibrary.zip fromits dependents mylibrary.lib and mylibrary.h. (wzzip is a command l<strong>in</strong>e version ofW<strong>in</strong>Zip. Any other command l<strong>in</strong>e zip utility may be used). The next two l<strong>in</strong>es tell how to buildthe library file mylibrary.lib from the three object files. The l<strong>in</strong>e .asm.obj is a genericrule. It tells that any file with extension .obj can be built from a file with the same name andextension .asm by us<strong>in</strong>g the rule <strong>in</strong> the follow<strong>in</strong>g <strong>in</strong>dented l<strong>in</strong>e.The build rules can use the follow<strong>in</strong>g macros for specify<strong>in</strong>g file names:$@ = Current target's full name$< = Full name of dependent file$* = Current target's base name without extension$** = All dependents of the current target, separated by spaces (nmake)$+ = All dependents of the current target, separated by spaces (Gnu make)$? = All dependents with a later timestamp than the current targetThe make utility is activated with the commandnmake /Fmakefile ormake -f makefile.50


See the manual for the particular make utility for details.8 Mak<strong>in</strong>g function libraries compatible with multiplecompilers and platformsThere are a number of compatibility problems to take care of if you want to make a functionlibrary that is compatible with multiple compilers, multiple programm<strong>in</strong>g <strong>language</strong>s, andmultiple operat<strong>in</strong>g systems. The most important compatibility problems have to do with:1. Name mangl<strong>in</strong>g2. Call<strong>in</strong>g conventions3. Object file formatsThe easiest solution to these portability problems is to make the code <strong>in</strong> a high level<strong>language</strong> such as C++ and make any necessary low-level constructs with the use of<strong>in</strong>tr<strong>in</strong>sic functions or <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong>. The code can then be compiled with differentcompilers for the different platforms. Note that not all C++ compilers support <strong>in</strong>tr<strong>in</strong>sicfunctions or <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> and that the syntax may differ.If <strong>assembly</strong> <strong>language</strong> programm<strong>in</strong>g is necessary or desired then there are various methodsfor overcom<strong>in</strong>g the compatibility problems between different x86 platforms. These methodsare discussed <strong>in</strong> the follow<strong>in</strong>g paragraphs.8.1 Support<strong>in</strong>g multiple name mangl<strong>in</strong>g schemesThe easiest way to deal with the problems of compiler-specific name mangl<strong>in</strong>g schemes isto turn off name mangl<strong>in</strong>g with the extern "C" directive, as expla<strong>in</strong>ed on page 30.The extern "C" directive cannot be used for class member functions, overloadedfunctions and operators. This problem can be used by mak<strong>in</strong>g an <strong>in</strong>l<strong>in</strong>e function with amangled name to call an <strong>assembly</strong> function with an unmangled name:// Example 8.1. Avoid name mangl<strong>in</strong>g of overloaded functions <strong>in</strong> C++// Prototypes for unmangled <strong>assembly</strong> functions:extern "C" double power_d (double x, double n);extern "C" double power_i (double x, <strong>in</strong>t n);// Wrap these <strong>in</strong>to overloaded functions:<strong>in</strong>l<strong>in</strong>e double power (double x, double n) {return power_d(x, n);<strong>in</strong>l<strong>in</strong>e double power (double x, <strong>in</strong>t n) {return power_i(x, n);The compiler will simply replace a call to the mangled function with a call to the appropriateunmangled <strong>assembly</strong> function without any extra code. The same method can be used forclass member functions, as expla<strong>in</strong>ed on page 48.However, <strong>in</strong> some cases it is desired to preserve the name mangl<strong>in</strong>g. Either because itmakes the C++ code simpler, or because the mangled names conta<strong>in</strong> <strong>in</strong>formation aboutcall<strong>in</strong>g conventions and other compatibility issues.An <strong>assembly</strong> function can be made compatible with multiple name mangl<strong>in</strong>g schemessimply by giv<strong>in</strong>g it multiple public names. Return<strong>in</strong>g to example 4.1c page 31, we can addmangled names for multiple compilers <strong>in</strong> the follow<strong>in</strong>g way:; Example 8.2. (Example 4.1c rewritten)51


; Function with multiple mangled names (32-bit mode); double s<strong>in</strong>xpnx (double x, <strong>in</strong>t n) {return s<strong>in</strong>(x) + n*x;}ALIGN 4_s<strong>in</strong>xpnx PROC NEAR; extern "C" name; Make public names for each name mangl<strong>in</strong>g scheme:?s<strong>in</strong>xpnx@@YANNH@Z LABEL NEAR ; Microsoft compiler@s<strong>in</strong>xpnx$qdi LABEL NEAR ; Borland compiler_Z7s<strong>in</strong>xpnxdi LABEL NEAR ; Gnu compiler for L<strong>in</strong>ux__Z7s<strong>in</strong>xpnxdi LABEL NEAR ; Gnu compiler for W<strong>in</strong>dows and Mac OSPUBLIC ?s<strong>in</strong>xpnx@@YANNH@Z, @s<strong>in</strong>xpnx$qdi, _Z7s<strong>in</strong>xpnxdi, __Z7s<strong>in</strong>xpnxdi; parameter x = [ESP+4]; parameter n = [ESP+12]; return value = ST(0)fild dword ptr [esp+12] ; nfld qword ptr [esp+4] ; xfmul st(1), st(0) ; n*xfs<strong>in</strong>; s<strong>in</strong>(x)fadd; s<strong>in</strong>(x) + n*xret; return value is <strong>in</strong> st(0)_s<strong>in</strong>xpnx ENDPExample 8.2 works with most compilers <strong>in</strong> both 32-bit W<strong>in</strong>dows and 32-bit L<strong>in</strong>ux becausethe call<strong>in</strong>g conventions are the same. A function can have multiple public names and thel<strong>in</strong>ker will simply search for a name that matches the call from the C++ file. But a functioncall cannot have more than one external name.The syntax for name mangl<strong>in</strong>g for different compilers is described <strong>in</strong> manual 5: "Call<strong>in</strong>gconventions for different C++ compilers and operat<strong>in</strong>g systems". Apply<strong>in</strong>g this syntaxmanually is a difficult job. It is much easier and safer to generate each mangled name bycompil<strong>in</strong>g the function <strong>in</strong> C++ with the appropriate compiler. Command l<strong>in</strong>e versions of mostcompilers are available for free or as trial versions.The Intel, Digital Mars and Codeplay compilers for W<strong>in</strong>dows are compatible with theMicrosoft name mangl<strong>in</strong>g scheme. The Intel compiler for L<strong>in</strong>ux is compatible with the Gnuname mangl<strong>in</strong>g scheme. Gnu compilers version 2.x and earlier have a different namemangl<strong>in</strong>g scheme which I have not <strong>in</strong>cluded <strong>in</strong> example 8.2. Mangled names for theWatcom compiler conta<strong>in</strong> special characters which are only allowed by the Watcomassembler.8.2 Support<strong>in</strong>g multiple call<strong>in</strong>g conventions <strong>in</strong> 32 bit modeMember functions <strong>in</strong> 32-bit W<strong>in</strong>dows do not always have the same call<strong>in</strong>g convention. TheMicrosoft-compatible compilers use the __thiscall convention with 'this' <strong>in</strong> register ecx,while Borland and Gnu compilers use the __cdecl convention with 'this' on the stack. Onesolution is to use friend functions as expla<strong>in</strong>ed on page 48. Another possibility is to make afunction with multiple entries. The follow<strong>in</strong>g example is a rewrite of example 7.1a page 48with two entries for the two different call<strong>in</strong>g conventions:; Example 8.3a (Example 7.1a with two entries); Member function, 32-bit mode; <strong>in</strong>t MyList::Sum(); Def<strong>in</strong>e structure correspond<strong>in</strong>g to class MyList:MyList STRUClength_ DD ?buffer DD 100 DUP (?)52


MyListENDS_MyList_Sum PROC NEAR ; for extern "C" friend function; Make mangled names for compilers with __cdecl convention:@MyList@Sum$qv LABEL NEAR ; Borland compiler_ZN6MyList3SumEv LABEL NEAR ; Gnu comp. for L<strong>in</strong>ux__ZN6MyList3SumEv LABEL NEAR ; Gnu comp. for W<strong>in</strong>dows and Mac OSPUBLIC @MyList@Sum$qv, _ZN6MyList3SumEv, __ZN6MyList3SumEv; Move 'this' from the stack to register ecx:mov ecx, [esp+4]; Make mangled names for compilers with __thiscall convention:?Sum@MyList@@QAEHXZ LABEL NEAR ; Microsoft compilerPUBLIC ?Sum@MyList@@QAEHXZassume ecx: ptr MyList; ecx po<strong>in</strong>ts to structure MyListxor eax, eax ; sum = 0xor edx, edx ; Loop <strong>in</strong>dex i = 0cmp [ecx].length_, 0; this->lengthje L9 ; Skip if length = 0L1: add eax, [ecx].buffer[edx*4] ; sum += buffer[i]add edx, 1; i++cmp edx, [ecx].length_ ; while (i < length)jb L1; LoopL9: ret ; Return value is <strong>in</strong> eax_MyList_Sum ENDP; End of <strong>in</strong>t MyList::Sum()assume ecx: noth<strong>in</strong>g; ecx no longer po<strong>in</strong>ts to anyth<strong>in</strong>gThe difference <strong>in</strong> name mangl<strong>in</strong>g schemes is actually an advantage here because it enablesthe l<strong>in</strong>ker to lead the call to the entry that corresponds to the right call<strong>in</strong>g convention.The method becomes more complicated if the member function has more parameters.Consider the function void MyList::AttItem(<strong>in</strong>t item) on page 39. The __thiscallconvention has the parameter 'this' <strong>in</strong> ecx and the parameter item on the stack at[esp+4] and requires that the stack is cleaned up by the function. The __cdecl conventionhas both parameters on the stack with 'this' at [esp+4] and item at [esp+8] and the stackcleaned up by the caller. A solution with two function entries requires a jump:; Example 8.3b; void MyList::AttItem(<strong>in</strong>t item);_MyList_AttItem PROC NEAR ; for extern "C" friend function; Make mangled names for compilers with __cdecl convention:@MyList@AttItem$qi LABEL NEAR ; Borland compiler_ZN6MyList7AttItemEi LABEL NEAR ; Gnu comp. for L<strong>in</strong>ux__ZN6MyList7AttItemEi LABEL NEAR ; Gnu comp. for W<strong>in</strong>dows and Mac OSPUBLIC @MyList@AttItem$qi, _ZN6MyList7AttItemEi, __ZN6MyList7AttItemEi; Move parameters <strong>in</strong>to registers:mov ecx, [esp+4]; ecx = thismov edx, [esp+8]; edx = itemjmp L0; jump <strong>in</strong>to common section; Make mangled names for compilers with __thiscall convention:?AttItem@MyList@@QAEXH@Z LABEL NEAR; Microsoft compilerPUBLIC ?AttItem@MyList@@QAEXH@Zpop eax; Remove return address from stackpop edx; Get parameter 'item' from stackpush eax; Put return address back on stackL0: ; common section where parameters are <strong>in</strong> registers53


; ecx = this, edx = itemassume ecx: ptr MyList ; ecx po<strong>in</strong>ts to structure MyListmov eax, [ecx].length_ ; eax = this->lengthcmp eax, 100; Check if too highjnb L9; List is full. Exitmov [ecx].buffer[eax*4],edx ; buffer[length] = itemadd eax, 1; length++mov [ecx].length_, eaxL9: ret_MyList_AttItem ENDP; End of MyList::AttItemassume ecx: noth<strong>in</strong>g; ecx no longer po<strong>in</strong>ts to anyth<strong>in</strong>gIn example 8.3b, the two function entries each load all parameters <strong>in</strong>to registers and thenjumps to a common section that doesn't need to read parameters from the stack. The__thiscall entry must remove parameters from the stack before the common section.Another compatibility problem occurs when we want to have a static and a dynamic l<strong>in</strong>kversion of the same function library <strong>in</strong> 32-bit W<strong>in</strong>dows. The static l<strong>in</strong>k library uses the__cdecl convention by default, while the dynamic l<strong>in</strong>k library uses the __stdcallconvention by default. The static l<strong>in</strong>k library is the most efficient solution for C++ programs,but the dynamic l<strong>in</strong>k library is needed for several other programm<strong>in</strong>g <strong>language</strong>s.One solution to this problem is to specify the __cdecl or the __stdcall convention for bothlibraries. Another solution is to make functions with two entries.The follow<strong>in</strong>g example shows the function from example 8.2 with two entries for the__cdecl and __stdcall call<strong>in</strong>g conventions. Both conventions have the parameters on thestack. The difference is that the stack is cleaned up by the caller <strong>in</strong> the __cdecl conventionand by the called function <strong>in</strong> the __stdcall convention.; Example 8.4a (Example 8.2 with __stdcall and __cdecl entries); Function with entries for __stdcall and __cdecl (32-bit W<strong>in</strong>dows):ALIGN 4; __stdcall entry:; extern "C" double __stdcall s<strong>in</strong>xpnx (double x, <strong>in</strong>t n);_s<strong>in</strong>xpnx@12 PROC NEAR; Get all parameters <strong>in</strong>to registersfild dword ptr [esp+12] ; nfld qword ptr [esp+4] ; x; Remove parameters from stack:pop eax; Pop return addressadd esp, 12; remove 12 bytes of parameterspush eax; Put return address back on stackjmp L0; __cdecl entry:; extern "C" double __cdecl s<strong>in</strong>xpnx (double x, <strong>in</strong>t n);_s<strong>in</strong>xpnxLABEL NEARPUBLIC _s<strong>in</strong>xpnx; Get all parameters <strong>in</strong>to registersfild dword ptr [esp+12] ; nfld qword ptr [esp+4] ; x; Don't remove parameters from the stack. This is done by callerL0: ; Common entry with parameters all <strong>in</strong> registers; parameter x = st(0); parameter n = st(1)fmul st(1), st(0) ; n*xfs<strong>in</strong>; s<strong>in</strong>(x)54


faddret_s<strong>in</strong>xpnx@12 ENDP; s<strong>in</strong>(x) + n*x; return value is <strong>in</strong> st(0)The method of remov<strong>in</strong>g parameters from the stack <strong>in</strong> the function prolog rather than <strong>in</strong> theepilog is admittedly rather kludgy. A more efficient solution is to use conditional <strong>assembly</strong>:; Example 8.4b; Function with versions for __stdcall and __cdecl (32-bit W<strong>in</strong>dows); Choose function prolog accord<strong>in</strong>g to call<strong>in</strong>g convention:IFDEF STDCALL_; If STDCALL_ is def<strong>in</strong>ed_s<strong>in</strong>xpnx@12 PROC NEAR ; extern "C" __stdcall function nameELSE_s<strong>in</strong>xpnx PROC NEAR ; extern "C" __cdecl function nameENDIF; Function body common to both call<strong>in</strong>g conventions:fild dword ptr [esp+12] ; nfld qword ptr [esp+4] ; xfmul st(1), st(0) ; n*xfs<strong>in</strong>; s<strong>in</strong>(x)fadd; s<strong>in</strong>(x) + n*x; Choose function epilog accord<strong>in</strong>g to call<strong>in</strong>g convention:IFDEF STDCALL_; If STDCALL_ is def<strong>in</strong>edret 12; Clean up stack if __stdcall_s<strong>in</strong>xpnx@12 ENDP ; End of functionELSEret; Don't clean up stack if __cdecl_s<strong>in</strong>xpnx ENDP; End of functionENDIFThis solution requires that you make two versions of the object file, one with __cdecl call<strong>in</strong>gconvention for the static l<strong>in</strong>k library and one with __stdcall call<strong>in</strong>g convention for thedynamic l<strong>in</strong>k library. The dist<strong>in</strong>ction is made on the command l<strong>in</strong>e for the assembler. The__stdcall version is assembled with /DSTDCALL_ on the command l<strong>in</strong>e to def<strong>in</strong>e themacro STDCALL_, which is detected by the IFDEF conditional.8.3 Support<strong>in</strong>g multiple call<strong>in</strong>g conventions <strong>in</strong> 64 bit modeCall<strong>in</strong>g conventions are better standardized <strong>in</strong> 64-bit systems than <strong>in</strong> 32-bit systems. Thereis only one call<strong>in</strong>g convention for 64-bit W<strong>in</strong>dows and one call<strong>in</strong>g convention for 64-bit L<strong>in</strong>uxand other Unix-like systems. Unfortunately, the two sets of call<strong>in</strong>g conventions are quitedifferent. The most important differences are:• Function parameters are transferred <strong>in</strong> different registers <strong>in</strong> the two systems.• Registers RSI, RDI, and XMM6 - XMM15 have callee-save status <strong>in</strong> 64-bit W<strong>in</strong>dows butnot <strong>in</strong> 64-bit L<strong>in</strong>ux.• The caller must reserve a "shadow space" of 32 bytes on the stack for the calledfunction <strong>in</strong> 64-bit W<strong>in</strong>dows but not <strong>in</strong> L<strong>in</strong>ux.• A "red zone" of 128 bytes below the stack po<strong>in</strong>ter is available for storage <strong>in</strong> 64-bitL<strong>in</strong>ux but not <strong>in</strong> W<strong>in</strong>dows.• The Microsoft name mangl<strong>in</strong>g scheme is used <strong>in</strong> 64-bit W<strong>in</strong>dows, the Gnu namemangl<strong>in</strong>g scheme is used <strong>in</strong> 64-bit L<strong>in</strong>ux.55


Both systems have the stack aligned by 16 before any call, and both systems have thestack cleaned up by the caller.It is possible to make functions that can be used <strong>in</strong> both systems when the differencesbetween the two systems are taken <strong>in</strong>to account. The function should save the registers thathave callee-save status <strong>in</strong> W<strong>in</strong>dows or leave them untouched. The function should not usethe shadow space or the red zone. The function should reserve a shadow space for anyfunction it calls. The function needs two entries <strong>in</strong> order to resolve the differences <strong>in</strong>registers used for parameter transfer if it has any <strong>in</strong>teger parameters.Let's use example 4.1 page 30 once more and make an implementation that works <strong>in</strong> both64-bit W<strong>in</strong>dows and 64-bit L<strong>in</strong>ux.; Example 8.5a (Example 4.1e/f comb<strong>in</strong>ed).; Support for both 64-bit W<strong>in</strong>dows and 64-bit L<strong>in</strong>ux; double s<strong>in</strong>xpnx (double x, <strong>in</strong>t n) {return s<strong>in</strong>(x) + n * x;}EXTRN s<strong>in</strong>:nearALIGN 8; 64-bit L<strong>in</strong>ux entry:_Z7s<strong>in</strong>xpnxdi PROC NEAR; Gnu name mangl<strong>in</strong>g; L<strong>in</strong>ux has n <strong>in</strong> edi, W<strong>in</strong>dows has n <strong>in</strong> edx. Move it:mov edx, edi; 64-bit W<strong>in</strong>dows entry:?s<strong>in</strong>xpnx@@YANNH@Z LABEL NEAR ; Microsoft name mangl<strong>in</strong>gPUBLIC ?s<strong>in</strong>xpnx@@YANNH@Z; parameter x = xmm0; parameter n = edx; return value = xmm0push rbx ; rbx must be savedsub rsp, 48 ; space for x, shadow space f. s<strong>in</strong>, alignmovapd [rsp+32], xmm0 ; save x across call to s<strong>in</strong>mov ebx, edx ; save n across call to s<strong>in</strong>call s<strong>in</strong> ; xmm0 = s<strong>in</strong>(xmm0)cvtsi2sd xmm1, ebx ; convert n to doublemulsd xmm1, [rsp+32] ; n * xaddsd xmm0, xmm1 ; s<strong>in</strong>(x) + n * xadd rsp, 48 ; restore stack po<strong>in</strong>terpop rbx ; restore rbxret; return value is <strong>in</strong> xmm0_Z7s<strong>in</strong>xpnxdi ENDP; End of functionWe are not us<strong>in</strong>g extern "C" declaration here because we are rely<strong>in</strong>g on the differentname mangl<strong>in</strong>g schemes for dist<strong>in</strong>guish<strong>in</strong>g between W<strong>in</strong>dows and L<strong>in</strong>ux. The two entriesare used for resolv<strong>in</strong>g the differences <strong>in</strong> parameter transfer. If the function declaration had nbefore x, i.e. double s<strong>in</strong>xpnx (<strong>in</strong>t n, double x);, then the W<strong>in</strong>dows version wouldhave x <strong>in</strong> XMM1 and n <strong>in</strong> ecx, while the L<strong>in</strong>ux version would still have x <strong>in</strong> XMM0 and n <strong>in</strong> EDI.The function is stor<strong>in</strong>g x on the stack across the call to s<strong>in</strong> because there are no XMMregisters with callee-save status <strong>in</strong> 64-bit L<strong>in</strong>ux. The function reserves 32 bytes of shadowspace for the call to s<strong>in</strong> even though this is not needed <strong>in</strong> L<strong>in</strong>ux.8.4 Support<strong>in</strong>g different object file formatsAnother compatibility problem stems from differences <strong>in</strong> the formats of object files.56


Borland, Digital Mars and 16-bit Microsoft compilers use the OMF format for object files.Microsoft, Intel and Gnu compilers for 32-bit W<strong>in</strong>dows use the COFF format, also calledPE32. Gnu and Intel compilers under 32-bit L<strong>in</strong>ux prefer the ELF32 format. Gnu and Intelcompilers for Mac OS X use the 32- and 64-bit Mach-O format. The 32-bit Codeplaycompiler supports both the OMF, PE32 and ELF32 formats. All compilers for 64-bitW<strong>in</strong>dows use the COFF/PE32+ format, while compilers for 64-bit L<strong>in</strong>ux use the ELF64format.The MASM assembler can produce both OMF, COFF/PE32 and COFF/PE32+ format objectfiles, but not ELF format. The NASM assembler can produce OMF, COFF/PE32 and ELF32formats. The YASM assembler can produce OMF, COFF/PE32, ELF32/64, COFF/PE32+and MachO32/64 formats. The Gnu assembler (Gas) can produce ELF32/64 andMachO32/64 formats.It is possible to do cross-platform development if you have an assembler that supports allthe object file formats you need or a suitable object file conversion utility. This is useful formak<strong>in</strong>g function libraries that work on multiple platforms. An object file converter and crossplatformlibrary manager named objconv is available from www.agner.org/optimize.The objconv utility can change function names <strong>in</strong> the object file as well as convert<strong>in</strong>g to adifferent object file format. This elim<strong>in</strong>ates the need for name mangl<strong>in</strong>g. Repeat<strong>in</strong>g example8.5 without name mangl<strong>in</strong>g:; Example 8.5b.; Support for both 64-bit W<strong>in</strong>dows and 64-bit Unix systems.; double s<strong>in</strong>xpnx (double x, <strong>in</strong>t n) {return s<strong>in</strong>(x) + n * x;}EXTRN s<strong>in</strong>:nearALIGN 8; 64-bit L<strong>in</strong>ux entry:Unix_s<strong>in</strong>xpnx PROC NEAR; L<strong>in</strong>ux, BSD, Mac entry; Unix has n <strong>in</strong> edi, W<strong>in</strong>dows has n <strong>in</strong> edx. Move it:mov edx, edi; 64-bit W<strong>in</strong>dows entry:W<strong>in</strong>_s<strong>in</strong>xpnx LABEL NEAR ; Microsoft entryPUBLIC W<strong>in</strong>_s<strong>in</strong>xpnx; parameter x = xmm0; parameter n = edx; return value = xmm0push rbx ; rbx must be savedsub rsp, 48 ; space for x, shadow space f. s<strong>in</strong>, alignmovapd [rsp+32], xmm0 ; save x across call to s<strong>in</strong>mov ebx, edx ; save n across call to s<strong>in</strong>call s<strong>in</strong> ; xmm0 = s<strong>in</strong>(xmm0)cvtsi2sd xmm1, ebx ; convert n to doublemulsd xmm1, [rsp+32] ; n * xaddsd xmm0, xmm1 ; s<strong>in</strong>(x) + n * xadd rsp, 48 ; restore stack po<strong>in</strong>terpop rbx ; restore rbxret; return value is <strong>in</strong> xmm0Unix_s<strong>in</strong>xpnx ENDP; End of functionThis function can now be assembled and converted to multiple file formats with the follow<strong>in</strong>gcommands:ml64 /c s<strong>in</strong>xpnx.asmobjconv -cof64 -np:W<strong>in</strong>_: s<strong>in</strong>xpnx.obj s<strong>in</strong>xpnx_w<strong>in</strong>.objobjconv -elf64 -np:Unix_: s<strong>in</strong>xpnx.obj s<strong>in</strong>xpnx_l<strong>in</strong>ux.o57


objconv -mac64 -np:Unix_:_ s<strong>in</strong>xpnx.obj s<strong>in</strong>xpnx_mac.oThe first l<strong>in</strong>e assembles the code us<strong>in</strong>g the Microsoft 64-bit assembler ml64 and producesa COFF object file.The second l<strong>in</strong>e replaces "W<strong>in</strong>_" with noth<strong>in</strong>g <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g of function names <strong>in</strong> theobject file. The result is a COFF object file for 64-bit W<strong>in</strong>dows where the W<strong>in</strong>dows entry forour function is available as extern "C" double s<strong>in</strong>xpnx(double x, <strong>in</strong>t n). The nameUnix_s<strong>in</strong>xpnx for the Unix entry is still unchanged <strong>in</strong> the object file, but is not used.The third l<strong>in</strong>e converts the file to ELF format for 64-bit L<strong>in</strong>ux and BSD, and replaces "Unix_"with noth<strong>in</strong>g <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g of function names <strong>in</strong> the object file. This makes the Unix entryfor the function available as s<strong>in</strong>xpnx, while the unused W<strong>in</strong>dows entry is W<strong>in</strong>_s<strong>in</strong>xpnx.The fourth l<strong>in</strong>e does the same for the MachO file format, and puts an underscore prefix onthe function name, as required by Mac compilers.Objconv can also build and convert static library files (*.lib, *.a). This makes it possibleto build a multi-platform function library on a s<strong>in</strong>gle source platform.An example of us<strong>in</strong>g this method is the multi-platform function library asmlib.zip availablefrom www.agner.org/optimize/. asmlib.zip <strong>in</strong>cludes a makefile (see page 49) that makesmultiple versions of the same library by us<strong>in</strong>g the object file converter objconv.More details about object file formats can be found <strong>in</strong> the book "L<strong>in</strong>kers and Loaders" by J.R. Lev<strong>in</strong>e (Morgan Kaufmann Publ. 2000).8.5 Support<strong>in</strong>g other high level <strong>language</strong>sIf you are us<strong>in</strong>g other high-level <strong>language</strong>s than C++, and the compiler manual has no<strong>in</strong>formation on how to l<strong>in</strong>k with <strong>assembly</strong>, then see if the manual has any <strong>in</strong>formation onhow to l<strong>in</strong>k with C or C++ modules. You can probably f<strong>in</strong>d out how to l<strong>in</strong>k with <strong>assembly</strong>from this <strong>in</strong>formation.In general, it is preferred to use simple functions without name mangl<strong>in</strong>g, compatible withthe extern "C" and __cdecl or __stdcall conventions <strong>in</strong> C++. This will work with mostcompiled <strong>language</strong>s. Arrays and str<strong>in</strong>gs are usually implemented differently <strong>in</strong> different<strong>language</strong>s.Many modern programm<strong>in</strong>g <strong>language</strong>s such as C# and Visual Basic.NET cannot l<strong>in</strong>k tostatic l<strong>in</strong>k libraries. You have to make a dynamic l<strong>in</strong>k library <strong>in</strong>stead. Delphi Pascal mayhave problems l<strong>in</strong>k<strong>in</strong>g to object files - it is easier to use a DLL.Call<strong>in</strong>g <strong>assembly</strong> code from Java is quite complicated. You have to compile the code to aDLL and use the Java Native Interface (JNI).9 <strong>Optimiz<strong>in</strong>g</strong> for speed9.1 Identify the most critical parts of your code<strong>Optimiz<strong>in</strong>g</strong> software is not just a question of fiddl<strong>in</strong>g with the right <strong>assembly</strong> <strong>in</strong>structions.Many modern applications use much more time on load<strong>in</strong>g modules, resource files,databases, <strong>in</strong>terface frameworks, etc. than on actually do<strong>in</strong>g the calculations the program ismade for. <strong>Optimiz<strong>in</strong>g</strong> the calculation time does not help when the program spends 99.9% ofits time on someth<strong>in</strong>g other than calculation. It is important to f<strong>in</strong>d out where the biggesttime consumers are before you start to optimize anyth<strong>in</strong>g. Sometimes the solution can be tochange from C# to C++, to use a different user <strong>in</strong>terface framework, to organize file <strong>in</strong>putand output differently, to cache network data, to avoid dynamic memory allocation, etc.,58


ather than us<strong>in</strong>g <strong>assembly</strong> <strong>language</strong>. See manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++" forfurther discussion.The use of <strong>assembly</strong> code for optimiz<strong>in</strong>g software is relevant only for highly CPU-<strong>in</strong>tensiveprograms such as sound and image process<strong>in</strong>g, encryption, sort<strong>in</strong>g, data compression andcomplicated mathematical calculations.In CPU-<strong>in</strong>tensive software programs, you will often f<strong>in</strong>d that more than 99% of the CPU timeis used <strong>in</strong> the <strong>in</strong>nermost loop. Identify<strong>in</strong>g the most critical part of the software is thereforenecessary if you want to improve the speed of computation. <strong>Optimiz<strong>in</strong>g</strong> less critical parts ofthe code will not only be a waste of time, it also makes the code less clear, and less easy todebug and ma<strong>in</strong>ta<strong>in</strong>. Most compiler packages <strong>in</strong>clude a profiler that can tell you which partof the code is most critical. If you don't have a profiler and if it is not obvious which part ofthe code is most critical, then set up a number of counter variables that are <strong>in</strong>cremented atdifferent places <strong>in</strong> the code to see which part is executed most times. Use the methodsdescribed on page 154 for measur<strong>in</strong>g how long time each part of the code takes.It is important to study the algorithm used <strong>in</strong> the critical part of the code to see if it can beimproved. Often you can ga<strong>in</strong> more speed simply by choos<strong>in</strong>g the optimal algorithm than byany other optimization method.9.2 Out of order executionAll modern x86 processors can execute <strong>in</strong>structions out of order. Consider this example:; Example 9.1a, Out-of-order executionmov eax, [mem1]imul eax, 6mov [mem2], eaxmov ebx, [mem3]add ebx, 2mov [mem4], ebxThis piece of code is do<strong>in</strong>g two th<strong>in</strong>gs that have noth<strong>in</strong>g to do with each other: multiply<strong>in</strong>g[mem1] by 6 and add<strong>in</strong>g 2 to [mem3]. If it happens that [mem1] is not <strong>in</strong> the cache then theCPU has to wait many clock cycles while this operand is be<strong>in</strong>g fetched from ma<strong>in</strong> memory.The CPU will look for someth<strong>in</strong>g else to do <strong>in</strong> the meantime. It cannot do the second<strong>in</strong>struction imul eax,6 because it depends on the output of the first <strong>in</strong>struction. But thethird <strong>in</strong>struction mov ebx,[mem3] is <strong>in</strong>dependent of the preced<strong>in</strong>g <strong>in</strong>structions so it ispossible to execute mov ebx,[mem3] and add ebx,2 while it is wait<strong>in</strong>g for [mem1]. TheCPU's have many features to support efficient out-of-order execution. Most important is, ofcourse, the ability to detect whether an <strong>in</strong>struction depends on the output of a previous<strong>in</strong>struction. Another important feature is register renam<strong>in</strong>g. Assume that the we are us<strong>in</strong>gthe same register for multiply<strong>in</strong>g and add<strong>in</strong>g <strong>in</strong> example 9.1a because there are no morespare registers:; Example 9.1b, Out-of-order execution with register renam<strong>in</strong>gmov eax, [mem1]imul eax, 6mov [mem2], eaxmov eax, [mem3]add eax, 2mov [mem4], eaxExample 9.1b will work exactly as fast as example 9.1a because the CPU is able to usedifferent physical registers for the same logical register eax. This works <strong>in</strong> a very elegantway. The CPU assigns a new physical register to hold the value of eax every time eax iswritten to. This means that the above code is changed <strong>in</strong>side the CPU to a code that usesfour different physical registers for eax. The first register is used for the value loaded from59


[mem1]. The second register is used for the output of the imul <strong>in</strong>struction. The third registeris used for the value loaded from [mem3]. And the fourth register is used for the output ofthe add <strong>in</strong>struction. The use of different physical registers for the same logical registerenables the CPU to make the last three <strong>in</strong>structions <strong>in</strong> example 9.1b <strong>in</strong>dependent of the firstthree <strong>in</strong>structions. The CPU must have a lot of physical registers for this mechanism to workefficiently. The number of physical registers is different for different microprocessors, butyou can generally assume that the number is sufficient for quite a lot of <strong>in</strong>structionreorder<strong>in</strong>g.Some CPU's can keep different parts of a register separate, while other CPU's always treata register as a whole. If we change example 9.1b so that the second part uses 16-bitregisters then we have the problem of a false dependence:; Example 9.1c, False dependence of partial registermov eax, [mem1] ; 32 bit memory operandimul eax, 6mov [mem2], eaxmov ax, [mem3] ; 16 bit memory operandadd ax, 2mov [mem4], axHere the <strong>in</strong>struction mov ax,[mem3] changes only the lower 16 bits of register eax, whilethe upper 16 bits reta<strong>in</strong> the value they got from the imul <strong>in</strong>struction. Many CPU's from bothIntel and AMD are unable to rename a partial register. The consequence is that the movax,[mem3] <strong>in</strong>struction has to wait for the imul <strong>in</strong>struction to f<strong>in</strong>ish because it needs tocomb<strong>in</strong>e the 16 lower bits from [mem3] with the 16 upper bits from the imul <strong>in</strong>struction.Other CPU's are able to resolve the situation and avoid the false dependence <strong>in</strong> variousmore or less efficient ways. The problem is avoided by replac<strong>in</strong>g mov ax,[mem3] withmovzx eax,[mem3]. This resets the high bits of eax and breaks the dependence on anyprevious value of eax. In 64-bit mode, it is sufficient to write to the 32-bit register becausethis always resets the upper part of a 64-bit register. Thus, movzx eax,[mem3] and movzxrax,[mem3] are do<strong>in</strong>g exactly the same th<strong>in</strong>g. The 32-bit version of the <strong>in</strong>struction is onebyte shorter than the 64-bit version. Any use of the high 8-bit registers AH, BH, CH, DH shouldbe avoided because it can cause false dependences and less efficient code.Another important feature is the splitt<strong>in</strong>g of <strong>in</strong>structions <strong>in</strong>to micro-operations (abbreviatedµops or uops). The follow<strong>in</strong>g example shows the advantage of this:; Example 9.2, Splitt<strong>in</strong>g <strong>in</strong>structions <strong>in</strong>to uopspush eaxcall SomeFunctionThe push eax <strong>in</strong>struction does two th<strong>in</strong>gs. It subtracts 4 from the stack po<strong>in</strong>ter and storeseax to the address po<strong>in</strong>ted to by the stack po<strong>in</strong>ter. Assume now that eax is the result of along and time-consum<strong>in</strong>g calculation. This delays the push <strong>in</strong>struction. The call <strong>in</strong>structiondepends on the value of the stack po<strong>in</strong>ter which is modified by the push <strong>in</strong>struction. If<strong>in</strong>structions were not split <strong>in</strong>to µops then the call <strong>in</strong>struction would have to wait until thepush <strong>in</strong>struction was f<strong>in</strong>ished. But the CPU splits the push eax <strong>in</strong>struction <strong>in</strong>to sub esp,4followed by mov [esp],eax. The sub esp,4 micro-operation can be executed before eaxis ready, so the call <strong>in</strong>struction will wait only for sub esp,4 but not for mov [esp],eax.Out-of-order execution becomes even more efficient when the CPU can do more than oneth<strong>in</strong>g at the same time. Many CPU's can do two, three or four th<strong>in</strong>gs at the same time if theth<strong>in</strong>gs to do are <strong>in</strong>dependent of each other and do not use the same execution units <strong>in</strong> theCPU. Most CPU's have at least two <strong>in</strong>teger ALU's (Arithmetic Logic Units) so that they cando two or more <strong>in</strong>teger additions per clock cycle. There is usually one float<strong>in</strong>g po<strong>in</strong>t add unitand one float<strong>in</strong>g po<strong>in</strong>t multiplication unit so that it is possible to do a float<strong>in</strong>g po<strong>in</strong>t additionand a multiplication at the same time. There is at least one memory read unit and one60


memory write unit so that it is possible to read and write to memory at the same time. Themaximum average number of µops per clock cycle is three or four so that it is possible, forexample, to do an <strong>in</strong>teger operation, a float<strong>in</strong>g po<strong>in</strong>t operation, and a memory operation <strong>in</strong>the same clock cycle. The maximum number of arithmetic operations (i.e. anyth<strong>in</strong>g elsethan memory read or write) is limited to two or three µops per clock cycle, depend<strong>in</strong>g on theCPU.Float<strong>in</strong>g po<strong>in</strong>t operations are usually pipel<strong>in</strong>ed so that a new float<strong>in</strong>g po<strong>in</strong>t addition can startbefore the previous addition is f<strong>in</strong>ished. MMX and XMM <strong>in</strong>structions use the float<strong>in</strong>g po<strong>in</strong>texecution units even for <strong>in</strong>teger <strong>in</strong>structions on many CPU's. The details about which<strong>in</strong>structions can be executed simultaneously or pipel<strong>in</strong>ed and how many clock cycles each<strong>in</strong>struction takes are CPU specific. The details for each type of CPU are expla<strong>in</strong>ed manual3: "The microarchitecture of Intel and AMD CPU's" and manual 4: "Instruction tables".The most important th<strong>in</strong>gs you have to be aware of <strong>in</strong> order to take maximum advantage ofout-or-order execution are:• At least the follow<strong>in</strong>g registers can be renamed: all <strong>in</strong>teger registers, the stackpo<strong>in</strong>ter, the flags register, float<strong>in</strong>g po<strong>in</strong>t registers, MMX registers, and XMMregisters. Some CPU's can also rename segment registers and the float<strong>in</strong>g po<strong>in</strong>tcontrol word.• Prevent false dependences by writ<strong>in</strong>g to a full register rather than a partial register.• The INC and DEC <strong>in</strong>structions should be avoided because they write to only part ofthe flags register (exclud<strong>in</strong>g the carry flag). Use ADD or SUB <strong>in</strong>stead to avoid falsedependences or <strong>in</strong>efficient splitt<strong>in</strong>g of the flags register.• A cha<strong>in</strong> of <strong>in</strong>structions where each <strong>in</strong>struction depends on the previous one cannotexecute out of order. Avoid long dependency cha<strong>in</strong>s. (See page 63).• Memory operands cannot be renamed.• A memory read can execute before a preced<strong>in</strong>g memory write to a different address.Any po<strong>in</strong>ter or <strong>in</strong>dex registers should be calculated as early as possible so that theCPU can verify that the addresses of memory operands are different.• A memory write cannot execute before a preced<strong>in</strong>g write, but the write buffers canhold a number of pend<strong>in</strong>g writes, typically four or six.• A memory read can execute before another preced<strong>in</strong>g read on Intel processors, butnot on AMD processors.• The CPU can do more th<strong>in</strong>gs simultaneously if the code conta<strong>in</strong>s a good mixture of<strong>in</strong>structions from different categories, such as: simple <strong>in</strong>teger <strong>in</strong>structions, float<strong>in</strong>gpo<strong>in</strong>t addition, multiplication, memory read, memory write.9.3 Instruction fetch, decod<strong>in</strong>g and retirementInstruction fetch<strong>in</strong>g can be a bottleneck. Many processors cannot fetch more than 16 bytesof <strong>in</strong>struction code per clock cycle. It may be necessary to make <strong>in</strong>structions as short aspossible if this limit turns out to be critical. One way of mak<strong>in</strong>g <strong>in</strong>structions shorter is toreplace memory operands by po<strong>in</strong>ters (see chapter 10 page 71). The address of memoryoperands can possibly be loaded <strong>in</strong>to po<strong>in</strong>ter registers outside of a loop if <strong>in</strong>structionfetch<strong>in</strong>g is a bottleneck. Large constants can likewise be loaded <strong>in</strong>to registers.61


Instruction fetch<strong>in</strong>g is delayed by jumps on most processors. It is important to m<strong>in</strong>imize thenumber of jumps <strong>in</strong> critical code. Branches that are not taken and correctly predicted do notdelay <strong>in</strong>struction fetch<strong>in</strong>g. It is therefore advantageous to organize if-else branches so thatthe branch that is followed most commonly is the one where the conditional jump is nottaken.Most processors fetch <strong>in</strong>structions <strong>in</strong> aligned 16-byte or 32-byte blocks. It can beadvantageous to align critical loop entries and subrout<strong>in</strong>e entries by 16 <strong>in</strong> order to m<strong>in</strong>imizethe number of 16-byte boundaries <strong>in</strong> the code. Alternatively, make sure that there is no 16-byte boundary <strong>in</strong> the first few <strong>in</strong>structions after a critical loop entry or subrout<strong>in</strong>e entry.Instruction decod<strong>in</strong>g is often a bottleneck. The organization of <strong>in</strong>structions that gives theoptimal decod<strong>in</strong>g is processor-specific. Intel PM processors require a 4-1-1 decode pattern.This means that <strong>in</strong>structions which generate 2, 3 or 4 µops should be <strong>in</strong>terspersed by twos<strong>in</strong>gle-µop <strong>in</strong>structions. On Core2 processors the optimal decode pattern is 4-1-1-1. OnAMD processors it is preferred to avoid <strong>in</strong>structions that generate more than 2 µops.Instructions with multiple prefixes can slow down decod<strong>in</strong>g. The maximum number ofprefixes that an <strong>in</strong>struction can have without slow<strong>in</strong>g down decod<strong>in</strong>g is 1 on 32-bit Intelprocessors, 2 on Intel P4E processors, 3 on AMD processors, and unlimited on Core2.Avoid address size prefixes. Avoid operand size prefixes on <strong>in</strong>structions with an immediateoperand. For example, it is preferred to replace MOV AX,2 by MOV EAX,2.Decod<strong>in</strong>g is rarely a bottleneck on processors with a trace cache, but there are specificrequirements for optimal use of the trace cache.Most Intel processors have a problem called register read stalls. This occurs if the code hasseveral registers which are often read from but seldom written to.Instruction retirement can be a bottleneck on most processors. AMD processors and IntelPM and P4 processors can retire no more than 3 µops per clock cycle. Core2 processorscan retire 4 µops per clock cycle. No more than one taken jump can retire per clock cycle.All these details are processor-specific. See manual 3: "The microarchitecture of Intel andAMD CPU's" for details.9.4 Instruction latency and throughputThe latency of an <strong>in</strong>struction is the number of clock cycles it takes from the time the<strong>in</strong>struction starts to execute till the result is ready. The time it takes to execute adependency cha<strong>in</strong> is the sum of the latencies of all <strong>in</strong>structions <strong>in</strong> the cha<strong>in</strong>.The throughput of an <strong>in</strong>struction is the maximum number of <strong>in</strong>structions of the same k<strong>in</strong>dthat can be executed per clock cycle if the <strong>in</strong>structions are <strong>in</strong>dependent. I prefer to list thereciprocal throughputs because this makes it easier to compare latency and throughput.The reciprocal throughput is the average time it takes from the time an <strong>in</strong>struction starts toexecute till another <strong>in</strong>dependent <strong>in</strong>struction of the same type can start to execute, or thenumber of clock cycles per <strong>in</strong>struction <strong>in</strong> a series of <strong>in</strong>dependent <strong>in</strong>structions of the samek<strong>in</strong>d. For example, float<strong>in</strong>g po<strong>in</strong>t addition on a Core 2 processor has a latency of 3 clockcycles and a reciprocal throughput of 1 clock per <strong>in</strong>struction. This means that the processoruses 3 clock cycles per addition if each addition depends on the result of the preced<strong>in</strong>gaddition, but only 1 clock cycle per addition if the additions are <strong>in</strong>dependent.Manual 4: "Instruction tables" conta<strong>in</strong>s detailed lists of latencies and throughputs for almostall <strong>in</strong>structions on many different microprocessors from Intel and AMD.The follow<strong>in</strong>g list shows some typical values.62


Instruction Typical latency Typical reciprocalthroughputInteger move 1 0.33-0.5Integer addition 1 0.33-0.5Integer Boolean 1 0.33-1Integer shift 1 0.33-1Integer multiplication 3-10 1-2Integer division 20-80 20-40Float<strong>in</strong>g po<strong>in</strong>t addition 3-6 1Float<strong>in</strong>g po<strong>in</strong>t multiplication 4-8 1-2Float<strong>in</strong>g po<strong>in</strong>t division 20-45 20-45Integer vector addition (XMM) 1-2 0.5-2Integer vector multiplication (XMM) 3-7 1-2Float<strong>in</strong>g po<strong>in</strong>t vector addition (XMM) 3-5 1-2Float<strong>in</strong>g po<strong>in</strong>t vector multiplication (XMM) 4-7 1-4Float<strong>in</strong>g po<strong>in</strong>t vector division (XMM) 20-60 20-60Memory read (cached) 3-4 0.5-1Memory write (cached) 3-4 1Jump or call 0 1-2Table 9.1. Typical <strong>in</strong>struction latencies and throughputs9.5 Break dependency cha<strong>in</strong>sIn order to take advantage of out-of-order execution, you have to avoid long dependencycha<strong>in</strong>s. Consider the follow<strong>in</strong>g C++ example, which calculates the sum of 100 numbers:// Example 9.3a, Loop-carried dependency cha<strong>in</strong>double list[100], sum = 0.;for (<strong>in</strong>t i = 0; i < 100; i++) sum += list[i];This code is do<strong>in</strong>g a hundred additions, and each addition depends on the result of thepreced<strong>in</strong>g one. This is a loop-carried dependency cha<strong>in</strong>. A loop-carried dependency cha<strong>in</strong>can be very long and completely prevent out-of-order execution for a long time. Only thecalculation of i can be done <strong>in</strong> parallel with the float<strong>in</strong>g po<strong>in</strong>t addition.Assum<strong>in</strong>g that float<strong>in</strong>g po<strong>in</strong>t addition has a latency of 4 and a reciprocal throughput of 1, theoptimal implementation will have four accumulators so that we always have four additions <strong>in</strong>the pipel<strong>in</strong>e of the float<strong>in</strong>g po<strong>in</strong>t adder. In C++ this will look like:// Example 9.3b, Multiple accumulatorsdouble list[100], sum1 = 0., sum2 = 0., sum3 = 0., sum4 = 0.;for (<strong>in</strong>t i = 0; i < 100; i += 4) {sum1 += list[i];sum2 += list[i+1];sum3 += list[i+2];sum4 += list[i+3];}sum1 = (sum1 + sum2) + (sum3 + sum4);Here we have four dependency cha<strong>in</strong>s runn<strong>in</strong>g <strong>in</strong> parallel and each dependency cha<strong>in</strong> isone fourths as long as the orig<strong>in</strong>al one. See page 63 for examples of <strong>assembly</strong> code forloops with multiple accumulators.It may not be possible to obta<strong>in</strong> the theoretical maximum throughput. The more paralleldependency cha<strong>in</strong>s there are, the more difficult is it for the CPU to schedule and reorder theµops optimally. It is particularly difficult if the dependency cha<strong>in</strong>s are branched or entangled.63


Dependency cha<strong>in</strong>s occur not only <strong>in</strong> loops but also <strong>in</strong> l<strong>in</strong>ear code. Such dependency cha<strong>in</strong>scan also be broken up. For example, y = a + b + c + d can be changed toy = (a + b) + (c + d) so that the two parentheses can be calculated <strong>in</strong> parallel.Sometimes there are different possible ways of implement<strong>in</strong>g the same calculation withdifferent latencies. For example, you may have the choice between a branch and aconditional move. The branch has the shortest latency, but the conditional move avoids therisk of branch misprediction (see page 65). Which implementation is optimal depends onhow predictable the branch is and how long the dependency cha<strong>in</strong> is.A common way of sett<strong>in</strong>g a register to zero is XOR EAX,EAX or SUB EAX,EAX. Someprocessors recognize that these <strong>in</strong>structions are <strong>in</strong>dependent of the prior value of theregister. So any <strong>in</strong>struction that uses the new value of the register will not have to wait forthe value prior to the XOR or SUB <strong>in</strong>struction to be ready. These <strong>in</strong>structions are useful forbreak<strong>in</strong>g an unnecessary dependence. The follow<strong>in</strong>g list summarizes which <strong>in</strong>structions arerecognized as break<strong>in</strong>g dependence when source and dest<strong>in</strong>ation are the same, ondifferent processors:Instruction P3 and P4 PM Core2 AMDearlierXOR - x - x xSUB - x - x xSBB - - - - xCMP - - - - -PXOR - x - x xXORPS, XORPD - - - x xPANDN - - - - -PSUBxx - - - x -PCMPxx - - - x -Table 9.2. Instructions that break dependence when source and dest<strong>in</strong>ation are the sameYou should not break a dependence by an 8-bit or 16-bit part of a register. For exampleXOR AX,AX breaks a dependence on some processors, but not all. But XOR EAX,EAX issufficient for break<strong>in</strong>g the dependence on RAX <strong>in</strong> 64-bit mode.The SBB EAX,EAX is of course dependent on the carry flag, even when it does not dependon EAX.You may also use these <strong>in</strong>structions for break<strong>in</strong>g dependences on the flags. For example,rotate <strong>in</strong>structions have a false dependence on the flags <strong>in</strong> Intel processors. This can beremoved <strong>in</strong> the follow<strong>in</strong>g way:; Example 9.4, Break dependence on flagsror eax, 1sub edx, edx ; Remove false dependence on the flagsror ebx, 1You cannot use CLC for break<strong>in</strong>g dependences on the carry flag.9.6 Jumps and callsJumps, branches, calls and returns do not necessarily add to the execution time of a codebecause they will typically be executed <strong>in</strong> parallel with someth<strong>in</strong>g else. The number ofjumps etc. should nevertheless be kept at a m<strong>in</strong>imum <strong>in</strong> critical code for the follow<strong>in</strong>greasons:64


• Instruction prefetch<strong>in</strong>g is less efficient after a jump, especially if the target is near theend of a 16-byte block.• The code cache becomes fragmented and less efficient when jump<strong>in</strong>g aroundbetween noncontiguous <strong>subrout<strong>in</strong>es</strong>.• Microprocessors with a trace cache are likely to store multiple <strong>in</strong>stances of the samecode <strong>in</strong> the trace cache when the code conta<strong>in</strong>s many jumps.• The branch target buffer (BTB) can store only a limited number of jump targetaddresses. A BTB miss costs many clock cycles.• Conditional jumps are predicted accord<strong>in</strong>g to advanced branch predictionmechanisms. Mispredictions are expensive, as expla<strong>in</strong>ed below.• On most processors, branches can <strong>in</strong>terfere with each other <strong>in</strong> the global branchpattern history table and the branch history register. One branch may thereforereduce the prediction rate of other branches.• Returns are predicted by the use of a return stack buffer, which can only hold alimited number of return addresses, typically 8 or 16.• Indirect jumps and <strong>in</strong>direct calls are poorly predicted on older processors.All modern CPU's have an execution pipel<strong>in</strong>e that conta<strong>in</strong>s stages for <strong>in</strong>struction prefetch<strong>in</strong>g,decod<strong>in</strong>g, register renam<strong>in</strong>g, µop reorder<strong>in</strong>g and schedul<strong>in</strong>g, execution, retirement, etc.The number of stages <strong>in</strong> the pipel<strong>in</strong>e range from 12 to 22, depend<strong>in</strong>g on the specific microarchitecture.When a branch <strong>in</strong>struction is fed <strong>in</strong>to the pipel<strong>in</strong>e then the CPU doesn't knowfor sure which <strong>in</strong>struction is the next one to fetch <strong>in</strong>to the pipel<strong>in</strong>e. It takes at least 12 moreclock cycles before the branch <strong>in</strong>struction is executed so that it is known with certa<strong>in</strong>tywhich way the branch goes. This uncerta<strong>in</strong>ty is likely to break the flow through the pipel<strong>in</strong>e.Rather than wait<strong>in</strong>g 12 or more clock cycles for an answer, the CPU attempts to guesswhich way the branch will go. The guess is based on the previous behavior of the branch. Ifthe branch has gone the same way the last several times then it is predicted that it will gothe same way this time. If the branch has alternated regularly between the two ways then itis predicted that it will cont<strong>in</strong>ue to alternate.If the prediction is right then the CPU has saved a lot of time by load<strong>in</strong>g the right branch <strong>in</strong>tothe pipel<strong>in</strong>e and started to decode and speculatively execute the <strong>in</strong>structions <strong>in</strong> the branch.If the prediction was wrong then the mistake is discovered after several clock cycles and themistake has to be fixed by flush<strong>in</strong>g the pipel<strong>in</strong>e and discard<strong>in</strong>g the results of the speculativeexecutions. The cost of a branch misprediction ranges from 12 to more than 50 clockcycles, depend<strong>in</strong>g on the length of the pipel<strong>in</strong>e and other details of the microarchitecture.This cost is so high that very advanced algorithms have been implemented <strong>in</strong> order to ref<strong>in</strong>ethe branch prediction. These algorithms are expla<strong>in</strong>ed <strong>in</strong> detail <strong>in</strong> manual 3: "Themicroarchitecture of Intel and AMD CPU's".In general, you can assume that branches are predicted correctly most of the time <strong>in</strong> thesecases:• If the branch always goes the same way.• If the branch follows a simple repetitive pattern and is <strong>in</strong>side a loop with few or noother branches.65


• If the branch is correlated with a preced<strong>in</strong>g branch.• If the branch is a loop with a constant, small repeat count and few or no conditionaljumps <strong>in</strong>side the loop.The worst case is a branch that goes either way approximately 50% of the time, does notfollow any regular pattern, and is not correlated with any preced<strong>in</strong>g branch. Such a branchwill be mispredicted 50% of the time. This is so costly that the branch should be replaced byconditional moves or a table lookup if possible.In general, you should try to keep the number of poorly predicted branches at a m<strong>in</strong>imumand keep the number of branches <strong>in</strong>side a loop at a m<strong>in</strong>imum. It may be useful to split up orunroll a loop if this can reduce the number of branches <strong>in</strong>side the loop.Indirect jumps and <strong>in</strong>direct calls are often poorly predicted. Most processors will simplypredict an <strong>in</strong>direct jump or call to go the same way as it did last time. The PM and Core2processors are able to recognize simple repetitive patterns for <strong>in</strong>direct jumps.Returns are predicted by means of a so-called return stack buffer which is a first-<strong>in</strong>-last-outbuffer that mirrors the return addresses pushed on the stack. A return stack buffer with 16entries can correctly predict all returns for <strong>subrout<strong>in</strong>es</strong> at a nest<strong>in</strong>g level up to 16. If thesubrout<strong>in</strong>e nest<strong>in</strong>g level is deeper than the size of the return stack buffer then the failure willbe seen at the outer nest<strong>in</strong>g levels, not the presumably more critical <strong>in</strong>ner nest<strong>in</strong>g levels. Areturn stack buffer size of 8 or 16 is therefore sufficient <strong>in</strong> most cases, except for deeplynested recursive functions.The return stack buffer will fail if there is a call without a match<strong>in</strong>g return or a return withouta preced<strong>in</strong>g call. It is therefore important to always match calls and returns. Do not jump outof a subrout<strong>in</strong>e by any other means than by a RET <strong>in</strong>struction. And do not use the RET<strong>in</strong>struction as an <strong>in</strong>direct jump. Far calls should be matched with far returns.Elim<strong>in</strong>at<strong>in</strong>g callsIt is possible to replace a call followed by a return by a jump:; Example 9.5a, call/ret sequence (32-bit W<strong>in</strong>dows)Func1 PROC NEAR...call Func2retFunc1 ENDPThis can be changed to:; Example 9.5b, call+ret replaced by jmpFunc1 PROC NEAR...jmp Func2Func1 ENDPThis modification does not conflict with the return stack buffer mechanism because the callto Func1 is matched with the return from Func2. In systems with stack alignment, it isnecessary to restore the stack po<strong>in</strong>ter before the jump:; Example 9.6a, call/ret sequence (64-bit W<strong>in</strong>dows or L<strong>in</strong>ux)Func1 PROC NEARsub rsp, 8 ; Align stack by 16...call Func2; This call can be elim<strong>in</strong>atedadd rsp, 866


etFunc1 ENDPThis can be changed to:; Example 9.6b, call+ret replaced by jmp with stack alignedFunc1 PROC NEARsub rsp, 8...add rsp, 8 ; Restore stack po<strong>in</strong>ter before jumpjmp Func2Func1 ENDPElim<strong>in</strong>at<strong>in</strong>g unconditional jumpsIt is often possible to elim<strong>in</strong>ate a jump by copy<strong>in</strong>g the code that it jumps to. The code that iscopied can typically be a loop epilog or function epilog. The follow<strong>in</strong>g example is a functionwith an if-else branch <strong>in</strong>side a loop:; Example 9.7a, Function with jump that can be elim<strong>in</strong>atedFuncA PROC NEARpush ebpmov ebp, espsub esp, StackSpaceNeededlea edx, EndOfSomeArrayxor eax, eaxLoop1:; Loop starts herecmp [edx+eax*4], eax ; if-elseje ElseBranch... ; First branchjmp End_IfElseBranch:... ; Second branchEnd_If:add eax, 1 ; Loop epilogjnz Loop1FuncAmov esp, ebp ; Function epilogpop ebpretENDPThe jump to End_If may be elim<strong>in</strong>ated by duplicat<strong>in</strong>g the loop epilog:; Example 9.7b, Loop epilog copied to elim<strong>in</strong>ate jumpFuncA PROC NEARpush ebpmov ebp, espsub esp, StackSpaceNeededlea edx, EndOfSomeArrayxor eax, eaxLoop1:; Loop starts herecmp [edx+eax*4], eax ; if-elseje ElseBranch... ; First branchadd eax, 1 ; Loop epilog for first branchjnz Loop1jmp AfterLoopElseBranch:... ; Second branchadd eax, 1 ; Loop epilog for second branchjnz Loop167


AfterLoop:mov esp, ebp ; Function epilogpop ebpretFuncA ENDPIn example 9.7b, the unconditional jump <strong>in</strong>side the loop has been elim<strong>in</strong>ated by mak<strong>in</strong>g twocopies of the loop epilog. The branch that is executed most often should come first becausethe first branch is fastest. The unconditional jump to AfterLoop can also be elim<strong>in</strong>ated. Thisis done by copy<strong>in</strong>g the function epilog:; Example 9.7b, Function epilog copied to elim<strong>in</strong>ate jumpFuncA PROC NEARpush ebpmov ebp, espsub esp, StackSpaceNeededlea edx, EndOfSomeArrayxor eax, eaxLoop1:; Loop starts herecmp [edx+eax*4], eax ; if-elseje ElseBranch... ; First branchadd eax, 1 ; Loop epilog for first branchjnz Loop1mov esp, ebp ; Function epilog 1pop ebpretElseBranch:... ; Second branchadd eax, 1 ; Loop epilog for second branchjnz Loop1FuncAmov esp, ebp ; Function epilog 2pop ebpretENDPThe ga<strong>in</strong> that is obta<strong>in</strong>ed by elim<strong>in</strong>at<strong>in</strong>g the jump to AfterLoop is less than the ga<strong>in</strong>obta<strong>in</strong>ed by elim<strong>in</strong>at<strong>in</strong>g the jump to End_If because it is outside the loop. But I have shownit here to illustrate the general method of duplicat<strong>in</strong>g a function epilog.Replac<strong>in</strong>g conditional jumps with conditional movesThe most important jumps to elim<strong>in</strong>ate are conditional jumps, especially if they are poorlypredicted. Example:// Example 9.8a. C++ branch to optimizea = b > c ? d : e;This can be implemented with either a conditional jump or a conditional move:; Example 9.8b. Branch implemented with conditional jumpmov eax, [b]cmp eax, [c]jng L1mov eax, [d]jmp L2L1: mov eax, [e]L2: mov [a], eax68


; Example 9.8c. Branch implemented with conditional movemov eax, [b]cmp eax, [c]mov eax, [d]cmovng eax, [e]mov [a], eaxConditional move <strong>in</strong>structions may not be available on old processors from the 1990'es.The advantage of a conditional move is that it avoids branch mispredictions. But it has thedisadvantage that it <strong>in</strong>creases the length of a dependency cha<strong>in</strong>, while a predicted branchbreaks the dependency cha<strong>in</strong>. If the code <strong>in</strong> example 9.8c is part of a dependency cha<strong>in</strong>then the cmov <strong>in</strong>struction adds to the length of the cha<strong>in</strong>. The latency of cmov is two clockcycles on Intel processors, and one clock cycle on AMD processors. If the same code isimplemented with a conditional jump as <strong>in</strong> example 9.8b and if the branch is predictedcorrectly, then the result doesn't have to wait for b and c to be ready. It only has to wait foreither d or e, whichever is chosen. This means that the dependency cha<strong>in</strong> is broken by thepredicted branch, while the implementation with a conditional move has to wait for both b, c,d and e to be available. If d and e are complicated expressions, then both have to becalculated when the conditional move is used, while only one of them has to be calculated ifa conditional jump is used.As a rule of thumb, we can say that a conditional jump is faster than a conditional move ifthe code is part of a dependency cha<strong>in</strong> and the prediction rate is better than 75%. Aconditional jump is also preferred if we can avoid a lengthy calculation of d or e when theother operand is chosen.Loop-carried dependency cha<strong>in</strong>s are particularly sensitive to the disadvantages ofconditional moves. For example, the code <strong>in</strong> example 12.13a on page 107 works moreefficiently with a branch <strong>in</strong>side the loop than with a conditional move, even if the branch ispoorly predicted. This is because the float<strong>in</strong>g po<strong>in</strong>t conditional move adds to the loopcarrieddependency cha<strong>in</strong> and because the implementation with a conditional move has tocalculate all the power*xp values, even when they are not used.Another example of a loop-carried dependency cha<strong>in</strong> is a b<strong>in</strong>ary search <strong>in</strong> a sorted list. Ifthe items to search for are randomly distributed over the entire list then the branchprediction rate will be close to 50% and it will be faster to use conditional moves. But if theitems are often close to each other so that the prediction rate will be better, then it is moreefficient to use conditional jumps than conditional moves because the dependency cha<strong>in</strong> isbroken every time a correct branch prediction is made.It is also possible to do conditional moves <strong>in</strong> vector registers on an element-by-elementbasis. See page 109ff for details. There are special vector <strong>in</strong>structions for gett<strong>in</strong>g them<strong>in</strong>imum or maximum of two numbers. It may be faster to use vector registers than <strong>in</strong>tegeror float<strong>in</strong>g po<strong>in</strong>t registers for f<strong>in</strong>d<strong>in</strong>g m<strong>in</strong>imums or maximums.Replac<strong>in</strong>g conditional jumps with conditional set <strong>in</strong>structionsIf a conditional jump is used for sett<strong>in</strong>g a Boolean variable to 0 or 1 then it is often moreefficient to use the conditional set <strong>in</strong>struction. Example:// Example 9.9a. Set a bool variable on some condition<strong>in</strong>t b, c;bool a = b > c;; Example 9.9b. Implementation with conditional setmov eax, [b]cmp eax, [c]setg almov [a], al69


The conditional set <strong>in</strong>struction writes only to 8-bit registers. If a 32-bit result is needed thenset the rest of the register to zero before the compare:; Example 9.9c. Implementation with conditional set, 32 bitsmov eax, [b]xor ebx, ebx ; zero register before cmp to avoid chang<strong>in</strong>g flagscmp eax, [c]setg blmov [a], ebxIf there is no vacant register then use movzx:; Example 9.9d. Implementation with conditional set, 32 bitsmov eax, [b]cmp eax, [c]setg almovzx eax, almov [a], eaxIf a value of all ones is needed for true then use neg eax.An implementation with conditional jumps may be faster than conditional set if the predictionrate is good and the code is part of a long dependency cha<strong>in</strong>, as expla<strong>in</strong>ed <strong>in</strong> the previoussection (page 68).Replac<strong>in</strong>g conditional jumps with bit-manipulation <strong>in</strong>structionsIt is sometimes possible to obta<strong>in</strong> the same effect as a branch by <strong>in</strong>genious manipulation ofbits and flags. The carry flag is particularly useful for bit manipulation tricks:; Example 9.10, Set carry flag if eax is zero:cmp eax, 1; Example 9.11, Set carry flag if eax is not zero:neg eax; Example 9.12, Increment eax if carry flag is set:adc eax, 0; Example 9.13, Copy carry flag to all bits of eax:sbb eax, eax; Example 9.14, Copy bits one by one from carry <strong>in</strong>to a bit vector:rcl eax, 1It is possible to calculate the absolute value of a signed <strong>in</strong>teger without branch<strong>in</strong>g:; Example 9.15, Calculate absolute value of eaxcdq; Copy sign bit of eax to all bits of edxxor eax, edx ; Invert all bits if negativesub eax, edx ; Add 1 if negativeThe follow<strong>in</strong>g example f<strong>in</strong>ds the m<strong>in</strong>imum of two unsigned numbers: if (b > a) b = a;; Example 9.16a, F<strong>in</strong>d m<strong>in</strong>imum of eax and ebx (unsigned):sub eax, ebx ; = a-bsbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0and edx, eax ; = (b > a) ? a-b : 0add ebx, edx ; Result is <strong>in</strong> ebxOr, for signed numbers, ignor<strong>in</strong>g overflow:70


; Example 9.16b, F<strong>in</strong>d m<strong>in</strong>imum of eax and ebx (signed):sub eax, ebx ; Will not work if overflow herecdq ; = (b > a) ? 0xFFFFFFFF : 0and edx, eax ; = (b > a) ? a-b : 0add ebx, edx ; Result is <strong>in</strong> ebxThe next example chooses between two numbers: if (a < 0) d = b; else d = c;; Example 9.17a, Choose between two numberstest eax, eaxmov edx, ecxcmovs edx, ebx ; = (a < 0) ? b : cConditional moves are not very efficient on Intel processors and not available on oldprocessors. Alternative implementations may be faster <strong>in</strong> some cases. The follow<strong>in</strong>gexample gives the same result as example 9.17a.; Example 9.17b, Choose between two numbers without conditional move:cdq ; = (a < 0) ? 0xFFFFFFFF : 0xor ebx, ecx ; b ^ c = bits that differ between b and cand edx, ebx ; = (a < 0) ? (b ^ c) : 0xor edx, ecx ; = (a < 0) ? b : cExample 9.17b may be faster than 9.17a on processors where conditional moves are<strong>in</strong>efficient. Example 9.17b destroys the value of ebx.Whether these tricks are faster than a conditional jump depends on the prediction rate, asexpla<strong>in</strong>ed above.10 <strong>Optimiz<strong>in</strong>g</strong> for sizeThe code cache can hold from 8 to 32 kb of code, as expla<strong>in</strong>ed <strong>in</strong> chapter 11 page 79. Ifthere are problems keep<strong>in</strong>g the critical parts of the code with<strong>in</strong> the code cache, then youmay consider reduc<strong>in</strong>g the size of the code. Reduc<strong>in</strong>g the code size can also improve thedecod<strong>in</strong>g of <strong>in</strong>structions. Loops that have no more than 64 bytes of code performparticularly fast on the Core2 processor.You may even want to reduce the size of the code at the cost of reduced speed if speed isnot important.32-bit code is usually bigger than 16-bit code because addresses and data constants take 4bytes <strong>in</strong> 32-bit code and only 2 bytes <strong>in</strong> 16-bit code. However, 16-bit code has otherpenalties, especially because of segment prefixes. 64-bit code does not need more bytesfor addresses than 32-bit code because it can use 32-bit RIP-relative addresses. 64-bitcode may be slightly bigger than 32-bit code because of REX prefixes and other m<strong>in</strong>ordifferences, but it may as well be smaller than 32-bit code because the <strong>in</strong>creased number ofregisters reduces the need for memory variables.10.1 Choos<strong>in</strong>g shorter <strong>in</strong>structionsCerta<strong>in</strong> <strong>in</strong>structions have short forms. PUSH and POP <strong>in</strong>structions with an <strong>in</strong>teger register takeonly one byte. XCHG EAX,reg32 is also a s<strong>in</strong>gle-byte <strong>in</strong>struction and thus takes less spacethan a MOV <strong>in</strong>struction, but XCHG is slower than MOV. INC and DEC with a 32-bit register <strong>in</strong> 32-bit mode, or a 16-bit register <strong>in</strong> 16-bit mode take only one byte. The short form of INC andDEC is not available <strong>in</strong> 64-bit mode.71


The follow<strong>in</strong>g <strong>in</strong>structions take one byte less when they use the accumulator than when theyuse any other register: ADD, ADC, SUB, SBB, AND, OR, XOR, CMP, TEST with an immediateoperand without sign extension. This also applies to the MOV <strong>in</strong>struction with a memoryoperand and no po<strong>in</strong>ter register <strong>in</strong> 16 and 32 bit mode, but not <strong>in</strong> 64 bit mode. Examples:; Example 10.1. Instruction sizesadd eax,1000 is smaller than add ebx,1000mov eax,[mem] is smaller than mov ebx,[mem], except <strong>in</strong> 64 bit mode.Instructions with po<strong>in</strong>ters take one byte less when they have only a base po<strong>in</strong>ter (exceptESP, RSP or R12) and a displacement than when they have a scaled <strong>in</strong>dex register, or bothbase po<strong>in</strong>ter and <strong>in</strong>dex register, or ESP, RSP or R12 as base po<strong>in</strong>ter. Examples:; Example 10.2. Instruction sizesmov eax,array[ebx] is smaller than mov eax,array[ebx*4]mov eax,[ebp+12] is smaller than mov eax,[esp+12]Instructions with BP, EBP, RBP or R13 as base po<strong>in</strong>ter and no displacement and no <strong>in</strong>dextake one byte more than with other registers:; Example 10.3. Instruction sizesmov eax,[ebx] is smaller than mov eax,[ebp], butmov eax,[ebx+4] is same size as mov eax,[ebp+4].Instructions with a scaled <strong>in</strong>dex po<strong>in</strong>ter and no base po<strong>in</strong>ter must have a four bytesdisplacement, even when it is 0:; Example 10.4. Instruction sizeslea eax,[ebx+ebx] is shorter than lea eax,[ebx*2].Instructions <strong>in</strong> 64-bit mode need a REX prefix if at least one of the registers R8 - R15 or XMM8- XMM15 are used. Instructions that use these registers are therefore one byte longer than<strong>in</strong>structions that use other registers, unless a REX prefix is needed anyway for otherreasons:; Example 10.5a. Instruction sizes (64 bit mode)mov eax,[rbx] is smaller than mov eax,[r8].; Example 10.5b. Instruction sizes (64 bit mode)mov rax,[rbx] is same size as mov rax,[r8].In example 10.5a, we can avoid a REX prefix by us<strong>in</strong>g register RBX <strong>in</strong>stead of R8 as po<strong>in</strong>ter.But <strong>in</strong> example 10.5b, we need a REX prefix anyway for the 64-bit operand size, and the<strong>in</strong>struction cannot have more than one REX prefix.Float<strong>in</strong>g po<strong>in</strong>t calculations can be done either with the old float<strong>in</strong>g po<strong>in</strong>t stack registersST(0)-ST(7) or the new XMM registers on microprocessors with the SSE2 or later<strong>in</strong>struction set. The former <strong>in</strong>structions are more compact than the latter, for example:; Example 10.6. Float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>struction sizesfadd st(0), st(1) ; 2 bytesaddsd xmm0, xmm1 ; 4 bytesThe use of ST(0)-ST(7) may be advantageous even if it requires extra FXCH <strong>in</strong>structions.There is no big difference <strong>in</strong> execution speed between the two types of float<strong>in</strong>g po<strong>in</strong>t<strong>in</strong>structions on current processors.72


Processors support<strong>in</strong>g the AVX <strong>in</strong>struction set (expected available <strong>in</strong> 2010) can code XMM<strong>in</strong>structions <strong>in</strong> two different ways, with a VEX prefix or with the old prefixes. Sometimes theVEX version is shorter and sometimes the old version is shorter. However, there is a severeperformance penalty to mix<strong>in</strong>g XMM <strong>in</strong>structions without VEX prefix with <strong>in</strong>structions us<strong>in</strong>gYMM registers.10.2 Us<strong>in</strong>g shorter constants and addressesMany jump addresses, data addresses, and data constants can be expressed as signextended8-bit constants. This saves a lot of space. A sign-extended byte can only be usedif the value is with<strong>in</strong> the <strong>in</strong>terval from -128 to +127.For jump addresses, this means that short jumps take two bytes of code, whereas jumpsbeyond 127 bytes take 5 bytes if unconditional and 6 bytes if conditional.Likewise, data addresses take less space if they can be expressed as a po<strong>in</strong>ter and adisplacement between -128 and +127. The follow<strong>in</strong>g example assumes that [mem1] and[mem2] are static memory addresses <strong>in</strong> the data segment and that the distance betweenthem is less than 128 bytes:; Example 10.7a, Static memory operandsmov ebx, [mem1]; 6 bytesadd ebx, [mem2]; 6 bytesReduce to:; Example 10.7b, Replace addresses by po<strong>in</strong>termov eax, offset mem1; 5 bytesmov ebx, [eax]; 2 bytesadd ebx, [eax] + (mem2 - mem1) ; 3 bytesIn 64-bit mode you need to replace mov eax,offset mem1 with lea rax,[mem1], whichis one byte longer. The advantage of us<strong>in</strong>g a po<strong>in</strong>ter obviously <strong>in</strong>creases if you can use thesame po<strong>in</strong>ter many times. Stor<strong>in</strong>g data on the stack and us<strong>in</strong>g EBP or ESP as po<strong>in</strong>ter willthus make the code smaller than if you use static memory locations and absoluteaddresses, provided of course that the data are with<strong>in</strong> +/-127 bytes of the po<strong>in</strong>ter. Us<strong>in</strong>gPUSH and POP to write and read temporary <strong>in</strong>teger data is even shorter.Data constants may also take less space if they are between -128 and +127. Most<strong>in</strong>structions with immediate operands have a short form where the operand is a signextendeds<strong>in</strong>gle byte. Examples:; Example 10.8, Sign-extended operandspush 200; 5 bytespush 100; 2 bytes, sign extendedadd ebx, 128sub ebx, -128; 6 bytes; 3 bytes, sign extendedThe only <strong>in</strong>structions with an immediate operand that do not have a short form with a signextended8-bit constant are MOV, TEST, CALL and RET. You may save space, for example byreplac<strong>in</strong>g TEST with AND:; Example 10.9, Sign-extended operandstest ebx, 8; 6 bytestest eax, 8; 5 bytesand ebx, 8; 3 bytesbt ebx, 3 ; 4 bytes (uses carry flag)cmp ebx, 8; 3 bytes73


Shorter alternatives for MOV register, constant are often useful. Examples:; Example 10.10, Load<strong>in</strong>g constants <strong>in</strong>to 32-bit registersmov eax, 0; 5 bytessub eax, eax; 2 bytesmov eax, 1sub eax, eax / <strong>in</strong>c eaxpush 1 / pop eaxmov eax, -1or eax, -1; 5 bytes; 3 bytes; 3 bytes; 5 bytes; 3 bytesYou may also consider reduc<strong>in</strong>g the size of static data. Obviously, an array can be madesmaller by us<strong>in</strong>g a smaller data size for the elements. For example 16-bit <strong>in</strong>tegers <strong>in</strong>stead of32-bit <strong>in</strong>tegers if the data are sure to fit <strong>in</strong>to the smaller data size. The code for access<strong>in</strong>g16-bit <strong>in</strong>tegers is slightly bigger than for access<strong>in</strong>g 32-bit <strong>in</strong>tegers, but the <strong>in</strong>crease <strong>in</strong> codesize is small compared to the decrease <strong>in</strong> data size for a large array. Instructions with 16-bitimmediate data operands should be avoided <strong>in</strong> 32-bit and 64-bit mode because of <strong>in</strong>efficientdecod<strong>in</strong>g.10.3 Reus<strong>in</strong>g constantsIf the same address or constant is used more than once then you may load it <strong>in</strong>to a register.A MOV with a 4-byte immediate operand may sometimes be replaced by an arithmetic<strong>in</strong>struction if the value of the register before the MOV is known. Example:Replace with:; Example 10.11a, Load<strong>in</strong>g 32-bit constantsmov [mem1], 200; 10 bytesmov [mem2], 201; 10 bytesmov eax, 100; 5 bytesmov ebx, 150; 5 bytes; Example 10.11b, Reuse constantsmov eax, 200; 5 bytesmov [mem1], eax; 5 bytes<strong>in</strong>c eax; 1 bytemov [mem2], eax; 5 bytessub eax, 101; 3 byteslea ebx, [eax+50]; 3 bytes10.4 Constants <strong>in</strong> 64-bit modeIn 64-bit mode, there are three ways to move a constant <strong>in</strong>to a 64-bit register: with a 64-bitconstant, with a 32-bit sign-extended constant, and with a 32-bit zero-extended constant:; Example 10.12, Load<strong>in</strong>g constants <strong>in</strong>to 64-bit registersmov rax, 123456789abcdef0h ; 10 bytes (64-bit constant)mov rax, -100; 7 bytes (32-bit sign-extended)mov eax, 100; 5 bytes (32-bit zero-extended)Some assemblers use the sign-extended version rather than the shorter zero-extendedversion, even when the constant is with<strong>in</strong> the range that fits <strong>in</strong>to a zero-extended constant.You can force the assembler to use the zero-extended version by specify<strong>in</strong>g a 32-bitdest<strong>in</strong>ation register. Writes to a 32-bit register are always zero-extended <strong>in</strong>to the 64-bitregister.74


10.5 Addresses and po<strong>in</strong>ters <strong>in</strong> 64-bit mode64-bit code should preferably use 64-bit register size for base and <strong>in</strong>dex <strong>in</strong> addresses, and32-bit register size for everyth<strong>in</strong>g else. Example:; Example 10.13, 64-bit versus 32-bit registersmov eax, [rbx + 4*rcx]<strong>in</strong>c rcxHere, you can save one byte by chang<strong>in</strong>g <strong>in</strong>c rcx to <strong>in</strong>c ecx. This will work because thevalue of the <strong>in</strong>dex register is certa<strong>in</strong> to be less than 2 32 . The base po<strong>in</strong>ter however, may bebigger than 2 32 <strong>in</strong> some systems so you can't replace add rbx,4 by add ebx,4. Neveruse 32-bit registers as base or <strong>in</strong>dex <strong>in</strong>side the square brackets <strong>in</strong> 64-bit mode.The rule of us<strong>in</strong>g 64-bit registers <strong>in</strong>side the square brackets of an <strong>in</strong>direct address and 32-bit registers everywhere else also applies to the LEA <strong>in</strong>struction. Examples:; Example 10.14. LEA <strong>in</strong> 64-bit modelea eax, [ebx + ecx] ; 4 bytes (needs address size prefix)lea eax, [rbx + rcx] ; 3 bytes (no prefix)lea rax, [ebx + ecx] ; 5 bytes (address size and REX prefix)lea rax, [rbx + rcx] ; 4 bytes (needs REX prefix)The form with 32-bit dest<strong>in</strong>ation and 64-bit address is preferred unless a 64-bit result isneeded. This version takes no more time to execute than the version with 64-bit dest<strong>in</strong>ation.The forms with address size prefix should never be used.An array of 64-bit po<strong>in</strong>ters <strong>in</strong> a 64-bit program can be made smaller by us<strong>in</strong>g 32-bit po<strong>in</strong>tersrelative to the image base or to some reference po<strong>in</strong>t. This makes the array of po<strong>in</strong>terssmaller at the cost of mak<strong>in</strong>g the code that uses the po<strong>in</strong>ters bigger s<strong>in</strong>ce it needs to addthe image base. Whether this gives a net advantage depends on the size of the array.Example:; Example 10.15a. Jump-table <strong>in</strong> 64-bit mode.dataJumpTable DQ Label1, Label2, Label3, ...codemov eax, [n]lea rdx, JumpTablejmp qword ptr [rdx+rax*8]; Index; Address of jump table; Jump to JumpTable[n]Implementation with image-relative po<strong>in</strong>ters:; Example 10.15b. Image-relative jump-table <strong>in</strong> 64-bit W<strong>in</strong>dows.dataJumpTable DD imagerel(Label1),imagerel(Label2),imagerel(Label3),..extrn __ImageBase:byte.codemov eax, [n]; Indexlea rdx, __ImageBase ; Image basemov eax, [rdx+rax*4+imagerel(JumpTable)] ; Load image rel. addressadd rax, rdx; Add image base to addressjmp rax; Jump to computed addressThe Gnu compiler for Mac OS X uses the jump table itself as a reference po<strong>in</strong>t:; Example 10.15c. Self-relative jump-table <strong>in</strong> 64-bit Mac.dataJumpTable DD Label1-JumpTable, Label2-JumpTable, Label3-JumpTable75


.codemov eax, [n]lea rdx, JumpTablemovsxd rax, [rdx + rax*4]add rax, rdxjmp rax; Index; Table and reference po<strong>in</strong>t; Load address relative to table; Add table base to address; Jump to computed addressThe assembler may not be able to make self-relative references form the data segment tothe code segment.The shortest alternative is to use 32-bit absolute po<strong>in</strong>ters. This method can be used only ifthere is certa<strong>in</strong>ty that all addresses are less than 2 31 :; Example 10.15d. 32-bit absolute jump table <strong>in</strong> 64-bit L<strong>in</strong>ux; Requires that addresses < 2^31.dataJumpTable DD Label1, Label2, Label3, .. ; 32-bit addresses.codemov eax, [n]mov eax, JumpTable[rax*4]jmp rax; Index; Load 32-bit address; Jump to zero-extended addressIn example 10.15d, the address of JumpTable is a 32-bit relocatable address which is zeroextendedor sign-extended to 64 bits. This works if the address is less than 2 31 . Theaddresses of Label1, etc., are zero-extended, so this will work if the addresses are lessthan 2 32 . The method of example 10.15d can be used if there is certa<strong>in</strong>ty that the imagebase plus the program size is less than 2 31 , which will be true <strong>in</strong> most cases for applicationprograms <strong>in</strong> W<strong>in</strong>dows, L<strong>in</strong>ux and BSD, but not <strong>in</strong> Mac OS X (see page 22).It is even possible to replace the 64-bit or 32-bit po<strong>in</strong>ters with 16-bit offsets relative to asuitable reference po<strong>in</strong>t:; Example 10.15d. 16-bit offsets to a reference po<strong>in</strong>t.dataJumpTable DW 0, Label2-Label1, Label3-Label1, ...codemov eax, [n]; Indexlea rdx, JumpTable; Address of table (RIP-relative)movsx rax, word ptr [rdx+rax*2] ; Sign-extend 16-bit offsetlea rdx, Label1; Use Label1 as reference po<strong>in</strong>tadd rax, rdx; Add offset to reference po<strong>in</strong>tjmp rax; Jump to computed addressExample 10.15d uses Label1 as a reference po<strong>in</strong>t. It works only if all labels are with<strong>in</strong> the<strong>in</strong>terval Label1 ± 2 15 . The table conta<strong>in</strong>s the 16-bit offsets which are sign-extended andadded to the reference po<strong>in</strong>t.The examples above show different methods for stor<strong>in</strong>g code po<strong>in</strong>ters. The same methodscan be used for data po<strong>in</strong>ters. A po<strong>in</strong>ter can be stored as a 64-bit absolute address, a 32-bitrelative address, a 32-bit absolute address, or a 16-bit offset relative to a suitable referencepo<strong>in</strong>t. The methods that use po<strong>in</strong>ters relative to the image base or a reference po<strong>in</strong>t are onlyworth the extra code if there are many po<strong>in</strong>ters. This is typically the case <strong>in</strong> large switchstatements and <strong>in</strong> l<strong>in</strong>ked lists.10.6 Mak<strong>in</strong>g <strong>in</strong>structions longer for the sake of alignmentThere are situations where it can be advantageous to reverse the advice of the previousparagraphs <strong>in</strong> order to make <strong>in</strong>structions longer. Most important is the case where a loop76


entry needs to be aligned (see p. 83). Rather than <strong>in</strong>sert<strong>in</strong>g NOP's to align the loop entrylabel you may make the preced<strong>in</strong>g <strong>in</strong>structions longer than their m<strong>in</strong>imum lengths <strong>in</strong> such away that the loop entry becomes properly aligned. The longer versions of the <strong>in</strong>structions donot take longer time to execute, so we can save the time it takes to execute the NOP's.The assembler will normally choose the shortest possible form of an <strong>in</strong>struction. It is oftenpossible to choose a longer form of the same or an equivalent <strong>in</strong>struction. This can be done<strong>in</strong> several ways.Use general form <strong>in</strong>stead of short form of an <strong>in</strong>structionThe short forms of INC, DEC, PUSH, POP, XCHG, ADD, MOV do not have a mod-reg-r/m byte (seep. 24). The same <strong>in</strong>structions can be coded <strong>in</strong> the general form with a mod-reg-r/m byte.Examples:; Example 10.16. Mak<strong>in</strong>g <strong>in</strong>structions longer<strong>in</strong>c eax; short form. 1 byte (<strong>in</strong> 32-bit mode only)DB 0FFH, 0C0H ; long form of INC EAX, 2 bytespush ebx; short form. 1 byteDB 0FFH, 0F3H ; long form of PUSH EBX, 2 bytesUse an equivalent <strong>in</strong>struction that is longerExamples:; Example 10.17. Mak<strong>in</strong>g <strong>in</strong>structions longer<strong>in</strong>c eax; 1 byte (<strong>in</strong> 32-bit mode only)add eax, 1 ; 3 bytes replacement for INC EAXmov eax, ebxlea eax, [ebx]; 2 bytes; can be any length from 2 to 8 bytes, see belowUse 4-bytes immediate operandInstructions with a sign-extended 8-bit immediate operand can be replaced by the versionwith a 32-bit immediate operand:; Example 10.18. Mak<strong>in</strong>g <strong>in</strong>structions longeradd ebx,1; 3 bytes. Uses sign-extended 8-bit operandadd ebx,9999 ; Use dummy constant too big for 8-bit operandORG $ - 4; Go back 4 bytes..DD 1 ; and overwrite the 9999 operand with a 1.The above will encode ADD EBX,1 us<strong>in</strong>g 6 bytes of code.Add zero displacement to po<strong>in</strong>terAn <strong>in</strong>struction with a po<strong>in</strong>ter can have a displacement of 1 or 4 bytes <strong>in</strong> 32-bit or 64-bitmode (1 or 2 bytes <strong>in</strong> 16-bit mode). A dummy displacement of zero can be used for mak<strong>in</strong>gthe <strong>in</strong>struction longer:; Example 10.19. Mak<strong>in</strong>g <strong>in</strong>structions longermov eax,[ebx] ; 2 bytesmov eax,[ebx+1] ; Add 1-byte displacement. Total length = 3ORG $ - 1; Go back one byte..DB 0 ; and overwrite displacement with 0mov eax,[ebx+9999]; Add 4-byte displacement. Total length = 6ORG $ - 4; Go back 4 bytes..DD 0 ; and overwrite displacement with 077


The same can be done with LEA EAX,[EBX+0] as a replacement for MOV EAX,EBX.Use SIB byteAn <strong>in</strong>struction with a memory operand can have a SIB byte (see p. 24). A SIB byte can beadded to an <strong>in</strong>struction that doesn't already have one to make the <strong>in</strong>struction one bytelonger. A SIB byte cannot be used <strong>in</strong> 16-bit mode or <strong>in</strong> 64-bit mode with a RIP-relativeaddress. Example:; Example 10.20. Mak<strong>in</strong>g <strong>in</strong>structions longermov eax, [ebx]; Length = 2 bytesDB 8BH, 04H, 23H ; Same with SIB byte. Length = 3 bytesDB 8BH, 44H, 23H, 00H ; With SIB byte and displacement. 4 bytesUse prefixesAn easy way to make an <strong>in</strong>struction longer is to add unnecessary prefixes. All <strong>in</strong>structionswith a memory operand can have a segment prefix. The DS segment prefix is rarelyneeded, but it can be added without chang<strong>in</strong>g the mean<strong>in</strong>g of the <strong>in</strong>struction:; Example 10.21. Mak<strong>in</strong>g <strong>in</strong>structions longerDB 3EH ; DS segment prefixmov eax,[ebx] ; prefix + <strong>in</strong>struction = 3 bytesAll <strong>in</strong>structions with a memory operand can have a segment prefix, <strong>in</strong>clud<strong>in</strong>g LEA. It isactually possible to add a segment prefix even to <strong>in</strong>structions without a memory operand.Such mean<strong>in</strong>gless prefixes are simply ignored. But there is no absolute guarantee that themean<strong>in</strong>gless prefix will not have some mean<strong>in</strong>g on future processors. For example, the P4uses segment prefixes on branch <strong>in</strong>structions as branch prediction h<strong>in</strong>ts. The probability isvery low, I would say, that segment prefixes will have any adverse effect on futureprocessors for <strong>in</strong>structions that could have a memory operand, i.e. <strong>in</strong>structions with a modreg-r/mbyte.CS, DS, ES and SS segment prefixes have no effect <strong>in</strong> 64-bit mode, but they are stillallowed, accord<strong>in</strong>g to AMD64 Architecture Programmer’s Manual, Volume 3: General-Purpose and System Instructions, 2003.In 64-bit mode, you can also use an empty REX prefix to make <strong>in</strong>structions longer:; Example 10.22. Mak<strong>in</strong>g <strong>in</strong>structions longerDB 40H ; empty REX prefixmov eax,[rbx] ; prefix + <strong>in</strong>struction = 3 bytesEmpty REX prefixes can safely be applied to almost all <strong>in</strong>structions <strong>in</strong> 64-bit mode that donot already have a REX prefix, except <strong>in</strong>structions that use AH, BH, CH or DH. REX prefixescannot be used <strong>in</strong> 32-bit or 16-bit mode. A REX prefix must come after any other prefixes,and no <strong>in</strong>struction can have more than one REX prefix.AMD's optimization manual recommends the use of up to three operand size prefixes (66H)as fillers. But this prefix can only be used on <strong>in</strong>structions that are not affected by this prefix,i.e. NOP and float<strong>in</strong>g po<strong>in</strong>t ST() <strong>in</strong>structions. Segment prefixes are more widely applicableand have the same effect - or rather lack of effect.It is possible to add multiple identical prefixes to any <strong>in</strong>struction as long as the total<strong>in</strong>struction length does not exceed 15. For example, you can have an <strong>in</strong>struction with two orthree DS segment prefixes. But <strong>in</strong>structions with multiple prefixes take extra time to decode,especially on older processors. Do not use more than three prefixes on AMD processors,<strong>in</strong>clud<strong>in</strong>g any necessary prefixes. There is no limit to the number of prefixes that can bedecoded efficiently on Core2 processors, as long as the total <strong>in</strong>struction length is no more78


than 15 bytes, but earlier Intel processors can handle no more than one or two prefixeswithout penalty.It is not a good idea to use address size prefixes as fillers because this may slow down<strong>in</strong>struction decod<strong>in</strong>g.Do not place dummy prefixes immediately before a jump label to align it:; Example 10.23. Wrong way of mak<strong>in</strong>g <strong>in</strong>structions longerL1: mov ecx,1000DB 3EH ; DS segment prefix. Wrong!L2: mov eax,[esi] ; Executed both with and without prefixIn this example, the MOV EAX,[ESI] <strong>in</strong>struction will be decoded with a DS segment prefixwhen we come from L1, but without the prefix when we come from L2. This works <strong>in</strong>pr<strong>in</strong>ciple, but some microprocessors remember where the <strong>in</strong>struction boundaries are, andsuch processors will be confused when the same <strong>in</strong>struction beg<strong>in</strong>s at two differentlocations. There may be a performance penalty for this.Processors support<strong>in</strong>g the AVX <strong>in</strong>struction set use VEX prefixes, which are 2 or 3 byteslong. A 2-bytes VEX prefix can always be replaced by a 3-bytes VEX prefix. A VEX prefixcan be preceded by segment prefixes but not by any other prefixes. No other prefix isallowed after the VEX prefix. Only <strong>in</strong>structions us<strong>in</strong>g XMM or YMM registers can have aVEX prefix. The VEX prefix is optional only for <strong>in</strong>structions that read XMM registers but donot write to XMM registers. It is possible that VEX prefixes will be allowed on other<strong>in</strong>structions <strong>in</strong> future <strong>in</strong>struction sets.It is recommended to check hand-coded <strong>in</strong>structions with a debugger or disassembler tomake sure they are correct.11 <strong>Optimiz<strong>in</strong>g</strong> memory accessRead<strong>in</strong>g from the level-1 cache takes approximately 3 clock cycles. Read<strong>in</strong>g from the level-2 cache takes <strong>in</strong> the order of magnitude of 10 clock cycles. Read<strong>in</strong>g from ma<strong>in</strong> memorytakes <strong>in</strong> the order of magnitude of 100 clock cycles. The access time is even longer if aDRAM page boundary is crossed, and extremely long if the memory area has beenswapped to disk. I cannot give exact access times here because it depends on thehardware configuration and the figures keep chang<strong>in</strong>g thanks to the fast technologicaldevelopment.However, it is obvious from these numbers that cach<strong>in</strong>g of code and data is extremelyimportant for performance. If the code has many cache misses, and each cache miss costsmore than a hundred clock cycles, then this can be a very serious bottleneck for theperformance.More advice on how to organize data for optimal cach<strong>in</strong>g are given <strong>in</strong> manual 1: "<strong>Optimiz<strong>in</strong>g</strong>software <strong>in</strong> C++". Processor-specific details are given <strong>in</strong> manual 3: "The microarchitecture ofIntel and AMD CPU's" and <strong>in</strong> Intel's and AMD's software optimization manuals.11.1 How cach<strong>in</strong>g worksA cache is a means of temporary storage that is closer to the microprocessor than the ma<strong>in</strong>memory. Data and code that is used often, or that is expected to be used soon, is stored <strong>in</strong>a cache so that it is accessed faster. Different microprocessors have one, two or threelevels of cache. The level-1 cache is close to the microprocessor kernel and is accessed <strong>in</strong>79


just a few clock cycles. A bigger level-2 cache is placed on the same chip or at least <strong>in</strong> thesame hous<strong>in</strong>g.The level-1 data cache <strong>in</strong> the P4 processor, for example, can conta<strong>in</strong> 8 kb of data. It isorganized as 128 l<strong>in</strong>es of 64 bytes each. The cache is 4-way set-associative. This meansthat the data from a particular memory address cannot be assigned to an arbitrary cachel<strong>in</strong>e, but only to one of four possible l<strong>in</strong>es. The l<strong>in</strong>e length <strong>in</strong> this example is 2 6 = 64. So eachl<strong>in</strong>e must be aligned to an address divisible by 64. The least significant 6 bits, i.e. bit 0 - 5, ofthe memory address are used for address<strong>in</strong>g a byte with<strong>in</strong> the 64 bytes of the cache l<strong>in</strong>e. Aseach set comprises 4 l<strong>in</strong>es, there will be 128 / 4 = 32 = 2 5 different sets. The next five bits,i.e. bits 6 - 10, of a memory address will therefore select between these 32 sets. Therema<strong>in</strong><strong>in</strong>g bits can have any value. The conclusion of this mathematical exercise is that ifbits 6 - 10 of two memory addresses are equal, then they will be cached <strong>in</strong> the same set ofcache l<strong>in</strong>es. The 64-byte memory blocks that contend for the same set of cache l<strong>in</strong>es arespaced 2 11 = 2048 bytes apart. No more than 4 such addresses can be cached at the sametime.Let me illustrate this by the follow<strong>in</strong>g piece of code, where EDI holds an address divisible by64:; Example 11.1. Level-1 cache contentionaga<strong>in</strong>: mov eax, [edi]mov ebx, [edi + 0804h]mov ecx, [edi + 1000h]mov edx, [edi + 5008h]mov esi, [edi + 583ch]sub ebp, 1jnz aga<strong>in</strong>The five addresses used here all have the same set-value because the differences betweenthe addresses with the lower 6 bits truncated are multiples of 2048 = 800H. This loop willperform poorly because at the time we read ESI, there is no free cache l<strong>in</strong>e with the properset-value, so the processor takes the least recently used of the four possible cache l<strong>in</strong>es -that is the one which was used for EAX - and fills it with the data from [EDI+5800H] to[EDI+583FH] and reads ESI. Next, when read<strong>in</strong>g EAX, we f<strong>in</strong>d that the cache l<strong>in</strong>e that heldthe value for EAX has now been discarded, so the processor takes the least recently usedl<strong>in</strong>e, which is the one hold<strong>in</strong>g the EBX value, and so on. We have noth<strong>in</strong>g but cache misses,but if the 5'th l<strong>in</strong>e is changed to MOV ESI,[EDI + 5840H] then we have crossed a 64 byteboundary, so that we do not have the same set-value as <strong>in</strong> the first four l<strong>in</strong>es, and there willbe no problem assign<strong>in</strong>g a cache l<strong>in</strong>e to each of the five addresses.The cache sizes, cache l<strong>in</strong>e sizes, and set associativity on different microprocessors arelisted <strong>in</strong> manual 4: "Instruction tables". The performance penalty for level-1 cache l<strong>in</strong>econtention can be quite considerable on older microprocessors, but on newer processorssuch as the P4 we loose only a few clock cycles because the data are likely to beprefetched from the level-2 cache, which is accessed quite fast through a full-speed 256 bitdata bus. The improved efficiency of the level-2 cache <strong>in</strong> the P4 compensates for thesmaller level-1 data cache.The cache l<strong>in</strong>es are always aligned to physical addresses divisible by the cache l<strong>in</strong>e size (<strong>in</strong>the above example 64). When we have read a byte at an address divisible by 64, then thenext 63 bytes will be cached as well, and can be read or written to at almost no extra cost.We can take advantage of this by arrang<strong>in</strong>g data items that are used near each othertogether <strong>in</strong>to aligned blocks of 64 bytes of memory.The level-1 code cache works <strong>in</strong> the same way as the data cache, except on processorswith a trace cache (see below). The level-2 cache is usually shared between code and data.80


11.2 Trace cacheThe P4 and P4E processors have a trace cache <strong>in</strong>stead of a code cache. The trace cachestores the code after it has been translated to micro-operations (µops) while a normal codecache stores the raw code without translation. The trace cache makes the design moreRISC-like and removes the bottleneck of <strong>in</strong>struction decod<strong>in</strong>g. Another difference between acode cache and a trace cache is that the trace cache attempts to store the code <strong>in</strong> the order<strong>in</strong> which it is executed rather than the order <strong>in</strong> which it occurs <strong>in</strong> memory. This reduces thenumber of jumps <strong>in</strong> the trace cache.The ma<strong>in</strong> disadvantage of a trace cache is that the code takes more space <strong>in</strong> a trace cachethan <strong>in</strong> a code cache, because µops take more space than CISC code and because thesame code may occur <strong>in</strong> several traces. The trace cache is therefore most advantageouswhen the critical hot spot of the program is a small <strong>in</strong>nermost loop that fits <strong>in</strong>to the tracecache. The trace cache is not an advantage when the critical part of the code is distributedover so many functions and loops that it can fill up the trace cache.The newest processors do not have a trace cache.11.3 Alignment of dataAll data <strong>in</strong> RAM should be aligned at addresses divisible by a power of 2 accord<strong>in</strong>g to thisscheme:Operand sizeAlignment1 (byte) 12 (word) 24 (dword) 46 (fword) 88 (qword) 810 (tbyte) 1616 (oword, xmmword) 1632 (ymmword) 32Table 11.1. Preferred data alignmentThe follow<strong>in</strong>g example illustrates alignment of static data.; Example 11.2, alignment of static data.dataA DQ ?, ? ; A is aligned by 16B DB 32 DUP (?)C DD ?D DW ?ALIGN 16 ; E must be aligned by 16E DQ ?, ?.codemovdqa xmm0, [A]movdqa [E], xmm0In the above example, A, B and C all start at addresses divisible by 16. D starts at an addressdivisible by 4, which is more than sufficient because it only needs to be aligned by 2. Analignment directive must be <strong>in</strong>serted before E because the address after D is not divisible by16 as required by the MOVDQA <strong>in</strong>struction. Alternatively, E could be placed after A or B tomake it aligned.All microprocessors have a penalty of several clock cycles when access<strong>in</strong>g misaligned datathat cross a cache l<strong>in</strong>e boundary. AMD K8 and earlier processors also have a penalty when81


misaligned data cross an 8-byte boundary, and some early Intel processors (P1, PMMX)have a penalty for misaligned data cross<strong>in</strong>g a 4-byte boundary. Most processors have apenalty when read<strong>in</strong>g a misaligned operand shortly after writ<strong>in</strong>g to the same operand.Most XMM <strong>in</strong>structions that read or write 16-byte memory operands require that theoperand is aligned by 16. Instructions that accept unaligned 16-byte operands can be quite<strong>in</strong>efficient.Align<strong>in</strong>g data stored on the stack can be done by round<strong>in</strong>g down the value of the stackpo<strong>in</strong>ter. The old value of the stack po<strong>in</strong>ter must of course be restored before return<strong>in</strong>g. Afunction with aligned local data may look like this:; Example 11.3a, Explicit alignment of stack (32-bit W<strong>in</strong>dows)_FuncWithAlign PROC NEARpush ebp ; Prolog codemov ebp, esp ; Save value of stack po<strong>in</strong>tersub esp, LocalSpace ; Allocate space for local dataand esp, 0FFFFFFF0H ; (= -16) Align ESP by 16mov eax, [ebp+8] ; Function parameter = arraymovdqu xmm0,[eax]; Load from unaligned arraymovdqa [esp],xmm0; Store <strong>in</strong> aligned spacecall SomeOtherFunction ; Call some other function...mov esp, ebp ; Epilog code. Restore esppop ebp ; Restore ebpret_FuncWithAlign ENDPThis function uses EBP to address function parameters, and ESP to address aligned localdata. ESP is rounded down to the nearest value divisible by 16 simply by AND'<strong>in</strong>g it with -16.You can align the stack by any power of 2 by AND'<strong>in</strong>g the stack po<strong>in</strong>ter with the negativevalue.All 64-bit operat<strong>in</strong>g systems, and some 32-bit operat<strong>in</strong>g systems (Mac OS, optional <strong>in</strong> L<strong>in</strong>ux)keep the stack aligned by 16 at all CALL <strong>in</strong>structions. This elim<strong>in</strong>ates the need for the AND<strong>in</strong>struction and the frame po<strong>in</strong>ter. It is necessary to propagate this alignment from one CALL<strong>in</strong>struction to the next by proper adjustment of the stack po<strong>in</strong>ter <strong>in</strong> each function:; Example 11.3b, Propagate stack alignment (32-bit L<strong>in</strong>ux)FuncWithAlign PROC NEARsub esp, 28 ; Allocate space for local datamov eax, [esp+32] ; Function parameter = arraymovdqu xmm0,[eax]; Load from unaligned arraymovdqa [esp],xmm0; Store <strong>in</strong> aligned spacecall SomeOtherFunction ; This call must be aligned...retFuncWithAlign ENDPIn example 11.3b we are rely<strong>in</strong>g on the fact that the stack po<strong>in</strong>ter is aligned by 16 before thecall to FuncWithAlign. The CALL FuncWithAlign <strong>in</strong>struction (not shown here) has pushedthe return address on the stack, whereby 4 is subtracted from the stack po<strong>in</strong>ter. We have tosubtract another 12 from the stack po<strong>in</strong>ter before it is aligned by 16 aga<strong>in</strong>. The 12 is notenough for the local variable that needs 16 bytes so we have to subtract 28 to keep thestack po<strong>in</strong>ter aligned by 16. 4 for the return address + 28 = 32, which is divisible by 16.Remember to <strong>in</strong>clude any PUSH <strong>in</strong>structions <strong>in</strong> the calculation. If, for example, there hadbeen one PUSH <strong>in</strong>struction <strong>in</strong> the function prolog then we would subtract 24 from ESP to keepit aligned by 16. Example 11.3b needs to align the stack for two reasons. The MOVDQA<strong>in</strong>struction needs an aligned operand, and the CALL SomeOtherFunction needs to bealigned <strong>in</strong> order to propagate the correct stack alignment to SomeOtherFunction.82


The pr<strong>in</strong>ciple is the same <strong>in</strong> 64-bit mode:; Example 11.3c, Propagate stack alignment (64-bit L<strong>in</strong>ux)FuncWithAlign PROCsub rsp, 24 ; Allocate space for local datamov rax, rdi ; Function parameter rdi = arraymovdqu xmm0,[rax]; Load from unaligned arraymovdqa [rsp],xmm0; Store <strong>in</strong> aligned spacecall SomeOtherFunction ; This call must be aligned...retFuncWithAlign ENDPHere, the return address takes 8 bytes and we subtract 24 from RSP, so that the totalamount subtracted is 8 + 24 = 32, which is divisible by 16. Every PUSH <strong>in</strong>struction subtracts8 from RSP <strong>in</strong> 64-bit mode.Alignment issues are also important when mix<strong>in</strong>g C++ and <strong>assembly</strong> <strong>language</strong>. Considerthis C++ structure:// Example 11.4a, C++ structurestruct abcd {unsigned char a; // takes 1 byte storage<strong>in</strong>t b;// 4 bytes storageshort <strong>in</strong>t c; // 2 bytes storagedouble d;// 8 bytes storage} x;Most compilers (but not all) will <strong>in</strong>sert three empty bytes between a and b, and six emptybytes between c and d <strong>in</strong> order to give each element its natural alignment. You may changethe structure def<strong>in</strong>ition to:// Example 11.4b, C++ structurestruct abcd {double d;// 8 bytes storage<strong>in</strong>t b;// 4 bytes storageshort <strong>in</strong>t c; // 2 bytes storageunsigned char a; // 1 byte storagechar unused[1]; // fill up to 16 bytes} x;This has several advantages: The implementation is identical on compilers with and withoutautomatic alignment, the structure is easily translated to <strong>assembly</strong>, all members are properlyaligned, and there are fewer unused bytes. The extra unused character <strong>in</strong> the end makessure that all elements <strong>in</strong> an array of structures are properly aligned.See page 153 for how to move unaligned blocks of data efficiently.11.4 Alignment of codeMost microprocessors fetch code <strong>in</strong> aligned 16-byte or 32-byte blocks. If an importantsubrout<strong>in</strong>e entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetch<strong>in</strong>g that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first <strong>in</strong>structions after thelabel. This can be avoided by align<strong>in</strong>g important subrout<strong>in</strong>e entries and loop entries by 16.Align<strong>in</strong>g by 8 will assure that at least 8 bytes of code can be loaded with the first <strong>in</strong>structionfetch, which may be sufficient if the <strong>in</strong>structions are small. We may align subrout<strong>in</strong>e entriesby the cache l<strong>in</strong>e size (typically 64 bytes) if the subrout<strong>in</strong>e is part of a critical hot spot andthe preced<strong>in</strong>g code is unlikely to be executed <strong>in</strong> the same context.83


A disadvantage of code alignment is that some cache space is lost to empty spaces beforethe aligned code entries.In most cases, the effect of code alignment is m<strong>in</strong>imal. So my recommendation is to aligncode only <strong>in</strong> the most critical cases like critical <strong>subrout<strong>in</strong>es</strong> and critical <strong>in</strong>nermost loops.Align<strong>in</strong>g a subrout<strong>in</strong>e entry is as simple as putt<strong>in</strong>g as many NOP's as needed before thesubrout<strong>in</strong>e entry to make the address divisible by 8, 16, 32 or 64, as desired. The assemblerdoes this with the ALIGN directive. The NOP's that are <strong>in</strong>serted will not slow down theperformance because they are never executed.It is more problematic to align a loop entry because the preced<strong>in</strong>g code is also executed. Itmay require up to 15 NOP's to align a loop entry by 16. These NOP's will be executed beforethe loop is entered and this will cost processor time. It is more efficient to use longer<strong>in</strong>structions that do noth<strong>in</strong>g than to use a lot of s<strong>in</strong>gle-byte NOP's. The best modernassemblers will do just that and use <strong>in</strong>structions like MOV EAX,EAX andLEA EBX,[EBX+00000000H] to fill the space before an ALIGN nn statement. The LEA<strong>in</strong>struction is particularly flexible. It is possible to give an <strong>in</strong>struction like LEA EBX,[EBX]any length from 2 to 8 by variously add<strong>in</strong>g a SIB byte, a segment prefix and an offset of oneor four bytes of zero. Don't use a two-byte offset <strong>in</strong> 32-bit mode as this will slow downdecod<strong>in</strong>g. And don't use more than one prefix because this will slow down decod<strong>in</strong>g onolder Intel processors.Us<strong>in</strong>g an <strong>in</strong>struction like LEA EBX,[EBX+0] as a NOP has the disadvantage that it has afalse dependence on EBX. If this causes unwanted dependences then you may alternativelyuse the multi-byte NOP <strong>in</strong>struction which has the code 0FH,1FH,mod-000-rm. This code canhave any comb<strong>in</strong>ation of offset, SIB byte, and prefix without do<strong>in</strong>g anyth<strong>in</strong>g. This multi-byteNOP <strong>in</strong>struction is available <strong>in</strong> Intel processors with conditional move or SSE and <strong>in</strong> AMDprocessors K7 and later.Another useful filler is FXCH ST(0) because this <strong>in</strong>struction is resolved by registerrenam<strong>in</strong>g without go<strong>in</strong>g to any execution port on most modern Intel processors. Rememberthat FXCH cannot be used <strong>in</strong> code that uses MMX registers.A more efficient way of align<strong>in</strong>g a loop entry is to code the preced<strong>in</strong>g <strong>in</strong>structions <strong>in</strong> waysthat are longer than necessary. In most cases, this will not add to the execution time, butpossibly to the <strong>in</strong>struction fetch time. See page 76 for details on how to code <strong>in</strong>structions <strong>in</strong>longer versions.The most efficient way to align an <strong>in</strong>nermost loop is to move the preced<strong>in</strong>g subrout<strong>in</strong>e entry.The follow<strong>in</strong>g example shows how to do this:; Example 11.5, Align<strong>in</strong>g loop entryALIGN 16X1 = 9; Replace value with whatever X2 is.DB (-X1 AND 0FH) DUP (90H) ; Insert calculated number of NOP's.INNERFUNCTION PROC NEAR; This address will be adjustedmov eax,[esp+4]mov ecx,10000INNERLOOP: ; Loop entry will be aligned by 16X2 = INNERLOOP - INNERFUNCTION ; This value is needed above.ERRNZ X1 NE X2; Make error message if X1 != X2; ...sub ecx, 1jnz INNERLOOPretINNERFUNCTION ENDP84


This code looks awkward because most assemblers cannot resolve the forward reference toa difference between two labels <strong>in</strong> this case. X2 is the distance from INNERFUNCTION toINNERLOOP. X1 must be manually adjusted to the same value as X2 by look<strong>in</strong>g at the<strong>assembly</strong> list<strong>in</strong>g. The .ERRNZ l<strong>in</strong>e will generate an error message from the assembler if X1and X2 are different. The number of bytes to <strong>in</strong>sert before INNERFUNCTION <strong>in</strong> order to alignINNERLOOP by 16 is ((-X1) modulo 16). The modulo 16 is calculated here by AND'<strong>in</strong>g with15. The DB l<strong>in</strong>e calculates this value and <strong>in</strong>serts the appropriate number of NOP's withopcode 90H.INNERLOOP is aligned here by misalign<strong>in</strong>g INNERFUNCTION. The cost of misalign<strong>in</strong>gINNERFUNCTION is negligible compared to the ga<strong>in</strong> by align<strong>in</strong>g INNERLOOP because the latterlabel is jumped to 10000 times as often.11.5 Organiz<strong>in</strong>g data for improved cach<strong>in</strong>gThe cach<strong>in</strong>g of data works best if critical data are conta<strong>in</strong>ed <strong>in</strong> a small contiguous area ofmemory. The best place to store critical data is on the stack. The stack space that isallocated by a subrout<strong>in</strong>e is released when the subrout<strong>in</strong>e returns. The same stack space isthen reused by the next subrout<strong>in</strong>e that is called. Reus<strong>in</strong>g the same memory area gives theoptimal cach<strong>in</strong>g. Variables should therefore be stored on the stack rather than <strong>in</strong> the datasegment when possible.Float<strong>in</strong>g po<strong>in</strong>t constants are typically stored <strong>in</strong> the data segment. This is a problem becauseit is difficult to keep the constants used by different <strong>subrout<strong>in</strong>es</strong> contiguous. An alternative isto store the constants <strong>in</strong> the code. In 64-bit mode it is possible to load a double precisionconstant via an <strong>in</strong>teger register to avoid us<strong>in</strong>g the data segment. Example:; Example 11.6a. Load<strong>in</strong>g double constant from data segment.dataC1 DQ SomeConstant.codemovsd xmm0, C1This can be changed to:; Example 11.6b. Load<strong>in</strong>g double constant from register (64-bit mode).codemov rax, SomeConstantmovq xmm0, rax ; Some assemblers use 'movd' for this <strong>in</strong>structionSee page 117 for various methods of generat<strong>in</strong>g constants without load<strong>in</strong>g data frommemory. This is advantageous if data cache misses are expected, but not if data cach<strong>in</strong>g isefficient.Constant tables are typically stored <strong>in</strong> the data segment. It may be advantageous to copysuch a table from the data segment to the stack outside the <strong>in</strong>nermost loop if this canimprove cach<strong>in</strong>g <strong>in</strong>side the loop.Static variables are variables that are preserved from one function call to the next. Suchvariables are typically stored <strong>in</strong> the data segment. It may be a better alternative toencapsulate the function together with its data <strong>in</strong> a class. The class may be declared <strong>in</strong> theC++ part of the code even when the member function is coded <strong>in</strong> <strong>assembly</strong>.Data structures that are too large for the data cache should preferably be accessed <strong>in</strong> al<strong>in</strong>ear, forward way for optimal prefetch<strong>in</strong>g and cach<strong>in</strong>g. Non-sequential access can causecache l<strong>in</strong>e contentions if the stride is a high power of 2. Manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong>C++" conta<strong>in</strong>s examples of how to avoid access strides that are high powers of 2.85


11.6 Organiz<strong>in</strong>g code for improved cach<strong>in</strong>gThe cach<strong>in</strong>g of code works best if the critical part of the code is conta<strong>in</strong>ed with<strong>in</strong> acontiguous area of memory no bigger than the code cache. Avoid scatter<strong>in</strong>g critical<strong>subrout<strong>in</strong>es</strong> around at random memory addresses. Rarely accessed code such as errorhandl<strong>in</strong>g rout<strong>in</strong>es should be kept separate from the critical hot spot code.It may be useful to split the code segment <strong>in</strong>to different segments for different parts of thecode. For example, you may make a hot code segment for the code that is executed mostoften and a cold code segment for code that is not speed-critical.Alternatively, you may control the order <strong>in</strong> which modules are liked, so that modules that areused <strong>in</strong> the same part of the program are l<strong>in</strong>ked at addresses near each other.Dynamic l<strong>in</strong>k<strong>in</strong>g of function libraries (DLL's or shared objects) makes code cach<strong>in</strong>g lessefficient. Dynamic l<strong>in</strong>k libraries are typically loaded at round memory addresses. This cancause cache contentions if the distances between multiple DLL's are divisible by highpowers of 2.11.7 Cache control <strong>in</strong>structionsMemory writes are more expensive than reads when cache misses occur <strong>in</strong> a write-backcache. A whole cache l<strong>in</strong>e has to be read from memory, modified, and written back <strong>in</strong> caseof a cache miss. This can be avoided by us<strong>in</strong>g the non-temporal write <strong>in</strong>structions MOVNTI,MOVNTQ, MOVNTDQ, MOVNTPD, MOVNTPS. These <strong>in</strong>structions should be used when writ<strong>in</strong>g to amemory location that is unlikely to be cached and unlikely to be read from aga<strong>in</strong> before thewould-be cache l<strong>in</strong>e is evicted. As a rule of thumb, it can be recommended to use nontemporalwrites only when writ<strong>in</strong>g a memory block that is bigger than half the size of thelargest-level cache.Explicit data prefetch<strong>in</strong>g with the PREFETCH <strong>in</strong>structions can sometimes improve cacheperformance, but <strong>in</strong> most cases the automatic prefetch<strong>in</strong>g is sufficient.12 LoopsThe critical hot spot of a CPU-<strong>in</strong>tensive program is almost always a loop. The clockfrequency of modern computers is so high that even the most time-consum<strong>in</strong>g <strong>in</strong>structions,cache misses and <strong>in</strong>efficient exceptions are f<strong>in</strong>ished <strong>in</strong> a fraction of a microsecond. Thedelay caused by <strong>in</strong>efficient code is only noticeable when repeated millions of times. Suchhigh repeat counts are likely to be seen only <strong>in</strong> the <strong>in</strong>nermost level of a series of nestedloops. The th<strong>in</strong>gs that can be done to improve the performance of loops is discussed <strong>in</strong> thischapter.12.1 M<strong>in</strong>imize loop overheadThe loop overhead is the <strong>in</strong>structions needed for jump<strong>in</strong>g back to the beg<strong>in</strong>n<strong>in</strong>g of the loopand to determ<strong>in</strong>e when to exit the loop. <strong>Optimiz<strong>in</strong>g</strong> these <strong>in</strong>structions is a fairly generaltechnique that can be applied <strong>in</strong> many situations. <strong>Optimiz<strong>in</strong>g</strong> the loop overhead is notneeded, however, if some other bottleneck is limit<strong>in</strong>g the speed. See page 90ff for adescription of possible bottlenecks <strong>in</strong> a loop.A typical loop <strong>in</strong> C++ may look like this:// Example 12.1a. Typical for-loop <strong>in</strong> C++86


for (<strong>in</strong>t i = 0; i < n; i++) {// (loop body)}Without optimization, the <strong>assembly</strong> implementation will look like this:; Example 12.1b. For-loop, not optimizedmov ecx, n ; Load nxor eax, eax ; i = 0LoopTop:cmp eax, ecx ; i < njge LoopEnd ; Exit when i >= n; (loop body) ; Loop body goes hereadd eax, 1 ; i++jmp LoopTop ; Jump backLoopEnd:It may be unwise to use the <strong>in</strong>c <strong>in</strong>struction for add<strong>in</strong>g 1 to the loop counter. The <strong>in</strong>c<strong>in</strong>struction has a problem with writ<strong>in</strong>g to only part of the flags register, which makes it lessefficient than the add <strong>in</strong>struction on P4 processors and may cause false dependences onother processors.The most important problem with the loop <strong>in</strong> example 12.1b is that there are two jump<strong>in</strong>structions. We can elim<strong>in</strong>ate one jump from the loop by putt<strong>in</strong>g the branch <strong>in</strong>struction <strong>in</strong>the end:; Example 12.1c. For-loop with branch <strong>in</strong> the endmov ecx, n ; Load ntest ecx, ecx ; Test njng LoopEnd ; Skip if n = n) break; // Exit condition hereFuncB();// Lower loop body}This can be implemented <strong>in</strong> <strong>assembly</strong> by reorganiz<strong>in</strong>g the loop so that the exit comes <strong>in</strong> theend and the entry comes <strong>in</strong> the middle:; Example 12.2b. Assembly loop with entry <strong>in</strong> the middlexor eax, eax ; i = 0jmp LoopEntry; Jump <strong>in</strong>to middle of loopLoopTop:87


call FuncBLoopEntry:call FuncAadd eax, 1cmp eax, njge LoopTop; Lower loop body comes first; Upper loop body comes last; Exit condition <strong>in</strong> the endThe cmp <strong>in</strong>struction <strong>in</strong> example 12.1c and 12.2b can be elim<strong>in</strong>ated if the counter ends atzero because we can rely on the add <strong>in</strong>struction for sett<strong>in</strong>g the zero flag. This can be doneby count<strong>in</strong>g down from n to zero rather count<strong>in</strong>g up from zero to n:; Example 12.3. Loop with count<strong>in</strong>g downmov ecx, n ; Load ntest ecx, ecx ; Test njng LoopEnd ; Skip if n


A slightly different solution is to multiply n by 4 and count from -4*n to zero:; Example 12.4c. For-loop with neg. <strong>in</strong>dex multiplied by element sizemov ecx, n; Load nshl ecx, 2 ; n * 4jng LoopEnd ; Skip if (4*n)


xor eax, eax ; i = 0fld M0_01 ; Load constant 0.01fldz ; x = 0.LoopTop:fld st(0); Copy xfs<strong>in</strong>; s<strong>in</strong>(x)fstp _Table[eax*8] ; Table[i] = s<strong>in</strong>(x)fadd st(0), st(1) ; x += 0.01add eax, 1; i++cmp eax, 100; i < njb LoopTop ; Loopfcompp; Discard st(0) and st(1)There is no need to optimize the loop overhead <strong>in</strong> this case because the speed is limited bythe float<strong>in</strong>g po<strong>in</strong>t calculations. Another possible optimization is to use a library function thatcalculates two or four s<strong>in</strong>e values at a time <strong>in</strong> an XMM register. Such functions can be found<strong>in</strong> libraries from Intel and AMD, see manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++".The method of calculat<strong>in</strong>g x <strong>in</strong> example 12.5c by add<strong>in</strong>g 0.01 to the previous value ratherthan multiply<strong>in</strong>g i by 0.01 is commonly known as us<strong>in</strong>g an <strong>in</strong>duction variable. Inductionvariables are useful whenever it is easier to calculate some value <strong>in</strong> a loop from theprevious value than to calculate it from the loop counter. An <strong>in</strong>duction variable can be<strong>in</strong>teger or float<strong>in</strong>g po<strong>in</strong>t. The most common use of <strong>in</strong>duction variables is for calculat<strong>in</strong>g arrayaddresses, as <strong>in</strong> example 12.4c, but <strong>in</strong>duction variables can also be used for more complexexpressions. Any function that is an n'th degree polynomial of the loop counter can becalculated with just n additions and no multiplications by the use of n <strong>in</strong>duction variables.See manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++" for an example.The calculation of a function of the loop counter by the <strong>in</strong>duction variable method makes aloop-carried dependency cha<strong>in</strong>. If this cha<strong>in</strong> is too long then it may be advantageous tocalculate each value from a value that is two or more iterations back, as <strong>in</strong> the Taylorexpansion example on page 104.12.3 Move loop-<strong>in</strong>variant codeThe calculation of any expression that doesn't change <strong>in</strong>side the loop should be moved outof the loop.The same applies to if-else branches with a condition that doesn't change <strong>in</strong>side the loop.Such a branch can be avoided by mak<strong>in</strong>g two loops, one for each branch, and mak<strong>in</strong>g abranch that chooses between the two loops.12.4 F<strong>in</strong>d the bottlenecksThere are a number of possible bottlenecks that can limit the performance of a loop. Themost likely bottlenecks are:• Cache misses and cache contentions• Loop-carried dependency cha<strong>in</strong>s• Instruction fetch<strong>in</strong>g• Instruction decod<strong>in</strong>g• Instruction retirement90


• Register read stalls• Execution port throughput• Execution unit throughput• Suboptimal reorder<strong>in</strong>g and schedul<strong>in</strong>g of µops• Branch mispredictions• Float<strong>in</strong>g po<strong>in</strong>t exceptions and denormal operandsIf one particular bottleneck is limit<strong>in</strong>g the performance then it doesn't help to optimizeanyth<strong>in</strong>g else. It is therefore very important to analyze the loop carefully <strong>in</strong> order to identifywhich bottleneck is the limit<strong>in</strong>g factor. Only when the narrowest bottleneck has successfullybeen removed does it make sense to look at the next bottleneck. The various bottlenecksare discussed <strong>in</strong> the follow<strong>in</strong>g sections. All these details are processor-specific. See manual3: "The microarchitecture of Intel and AMD CPU's" for explanation of the processor-specificdetails mentioned below.Sometimes a lot of experimentation is needed <strong>in</strong> order to f<strong>in</strong>d and fix the limit<strong>in</strong>g bottleneck.It is important to remember that a solution found by experimentation is CPU-specific andunlikely to be optimal on CPU's with a different microarchitecture.12.5 Instruction fetch, decod<strong>in</strong>g and retirement <strong>in</strong> a loopThe details about how to optimize <strong>in</strong>struction fetch<strong>in</strong>g, decod<strong>in</strong>g, retirement, etc. isprocessor-specific, as mentioned on page 61.If code fetch<strong>in</strong>g is a bottleneck then it is necessary to align the loop entry by 16 and reduce<strong>in</strong>struction sizes <strong>in</strong> order to m<strong>in</strong>imize the number of 16-byte boundaries <strong>in</strong> the loop.If <strong>in</strong>struction decod<strong>in</strong>g is a bottleneck then it is necessary to observe the CPU-specific rulesabout decod<strong>in</strong>g patterns. Avoid complex <strong>in</strong>structions such as LOOP, JECXZ, LODS, STOS, etc.Register read stalls can occur <strong>in</strong> most Intel processors. If register read stalls are likely tooccur then it can be necessary to reorder <strong>in</strong>structions or to refresh registers that are readmultiple times but not written to <strong>in</strong>side the loop.Jumps and calls <strong>in</strong>side the loop should be avoided because it delays code fetch<strong>in</strong>g.Subrout<strong>in</strong>es that are called <strong>in</strong>side the loop should be <strong>in</strong>l<strong>in</strong>ed if possible.Branches <strong>in</strong>side the loop should be avoided if possible because they <strong>in</strong>terfere with theprediction of the loop exit branch. However, branches should not be replaced by conditionalmoves if this <strong>in</strong>creases the length of a loop-carried dependency cha<strong>in</strong>.If <strong>in</strong>struction retirement is a bottleneck then it may be preferred to make the total number ofµops <strong>in</strong>side the loop a multiple of the retirement rate (4 for Core2, 3 for all otherprocessors). Some experimentation is needed <strong>in</strong> this case.12.6 Distribute µops evenly between execution unitsManual 4: "Instruction tables" conta<strong>in</strong>s tables of how many µops each <strong>in</strong>struction generatesand which execution port each µop goes to. This <strong>in</strong>formation is CPU-specific, of course. It isnecessary to calculate how many µops the loop generates <strong>in</strong> total and how many of theseµops go to each execution port and each execution unit.91


The time it takes to retire all <strong>in</strong>structions <strong>in</strong> the loop is the total number of µops divided bythe retirement rate. The retirement rate is 4 µops per clock cycle for Core2 processors and 3for all other processors. The calculated retirement time is the m<strong>in</strong>imum execution time forthe loop. This value is useful as a norm which other potential bottlenecks can be comparedaga<strong>in</strong>st.The throughput for an execution port is 1 µop per clock cycle on most Intel processors. Theload on a particular execution port is calculated as the number of µops that goes to this portdivided by the throughput of the port. If this value exceeds the retirement time as calculatedabove, then this particular execution port is likely to be a bottleneck. AMD processors do nothave execution ports, but they have three pipel<strong>in</strong>es with similar throughputs.There may be more than one execution unit on each execution port on Intel processors.Most execution units have the same throughput as the execution port. If this is the casethen the execution unit cannot be a narrower bottleneck than the execution port. But anexecution unit can be a bottleneck <strong>in</strong> the follow<strong>in</strong>g situations: (1) if the throughput of theexecution unit is lower than the throughput of the execution port, e.g. for multiplication anddivision; (2) if the execution unit is accessible though more than one execution port, e.g.float<strong>in</strong>g po<strong>in</strong>t addition on PM; and (3) on AMD processors that have no execution ports.The load on a particular execution unit is calculated as the total number of µops go<strong>in</strong>g tothat execution unit multiplied by the reciprocal throughput for that unit. If this value exceedsthe retirement time as calculated above, then this particular execution unit is likely to be abottleneck.12.7 An example of analysis for bottlenecks on PMThe way to do these calculations is illustrated <strong>in</strong> the follow<strong>in</strong>g example, which is the socalledDAXPY algorithm used <strong>in</strong> l<strong>in</strong>ear algebra:// Example 12.6a. C++ code for DAXPY algorithm<strong>in</strong>t i; const <strong>in</strong>t n = 100;double X[n]; double Y[n]; double DA;for (i = 0; i < n; i++) Y[i] = Y[i] - DA * X[i];The follow<strong>in</strong>g implementation is for a processor with the SSE2 <strong>in</strong>struction set <strong>in</strong> 32-bitmode, assum<strong>in</strong>g that X and Y are aligned by 16:; Example 12.6b. DAXPY algorithm, 32-bit moden = 100 ; Def<strong>in</strong>e constant n (even and positive)mov ecx, n * 8; Load n * sizeof(double)xor eax, eax ; i = 0lea esi, X ; X must be aligned by 16lea edi, Y ; Y must be aligned by 16movsd xmm2, DA; Load DAshufpd xmm2, xmm2, 0 ; Get DA <strong>in</strong>to both qwords of xmm2; This loop does 2 DAXPY calculations per iteration, us<strong>in</strong>g vectors:L1: movapd xmm1, [esi+eax] ; X[i], X[i+1]mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DAmovapd xmm0, [edi+eax] ; Y[i], Y[i+1]subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DAmovapd [edi+eax], xmm0 ; Store resultadd eax, 16; Add size of two elements to <strong>in</strong>dexcmp eax, ecx ; Compare with n*8jl L1; Loop backNow let's analyze this code for bottlenecks on a Pentium M processor, assum<strong>in</strong>g that thereare no cache misses. The CPU-specific details that I am referr<strong>in</strong>g to are expla<strong>in</strong>ed <strong>in</strong>manual 3: "The microarchitecture of Intel and AMD CPU's".92


We are only <strong>in</strong>terested <strong>in</strong> the loop, i.e. the code after L1. We need to list the µop breakdownfor all <strong>in</strong>structions <strong>in</strong> the loop, us<strong>in</strong>g the table <strong>in</strong> manual 4: "Instruction tables". The list looksas follows:Instructionµopsfusedµops for eachexecution portexecutionunitsport port port port port port FADD FMUL0 1 0 or 1 2 3 4movapd xmm1,[esi+eax] 2 2mulpd xmm1, xmm2 2 2 2movapd xmm0,[edi+eax] 2 2subpd xmm0, xmm1 2 2 2movapd [edi+eax],xmm0 2 2 2add eax, 16 1 1cmp eax, ecx 1 1jl L1 1 1Total 13 2 1 4 4 2 2 2 2Table 12.1. Total µops for DAXPY loop on Pentium MThe total number of µops go<strong>in</strong>g to all the ports is 15. The 4 µops from movapd[edi+eax],xmm0 are fused <strong>in</strong>to 2 µops so the total number of µops <strong>in</strong> the fused doma<strong>in</strong> is13. The total retirement time is 13 µops / 3 µops per clock cycle = 4.33 clock cycles.Now, we will calculate the number of µops go<strong>in</strong>g to each port. There are 2 µops for port 0, 1for port 1, and 4 which can go to either port 0 or port 1. This totals 7 µops for port 0 and port1 comb<strong>in</strong>ed so that each of these ports will get 3.5 µops on average per iteration. Each porthas a throughput of 1 µop per clock cycle so the total load corresponds to 3.5 clock cycles.This is less than the 4.33 clock cycles for retirement, so port 0 and 1 will not be bottlenecks.Port 2 receives 4 µops per clock cycle so it is close to be<strong>in</strong>g saturated. Port 3 and 4 receiveonly 2 µops each.The next analysis concerns the execution units. The FADD unit has a throughput of 1 µopper clock and it can receive µops from both port 0 and 1. The total load is 2 µops / 1 µop perclock = 2 clocks. Thus the FADD unit is far from saturated. The FMUL unit has a throughputof 0.5 µop per clock and it can receive µops only from port 0. The total load is 2 µops / 0.5µop per clock = 4 clocks. Thus the FMUL unit is close to be<strong>in</strong>g saturated.There is a dependency cha<strong>in</strong> <strong>in</strong> the loop. The latencies are: 2 for memory read, 5 formultiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This isthree times as much as the retirement time but it is not a loop-carried dependence becausethe results from each iteration are saved to memory and not reused <strong>in</strong> the next iteration.The out-of-order execution mechanism and pipel<strong>in</strong><strong>in</strong>g makes it possible that eachcalculation can start before the preced<strong>in</strong>g calculation is f<strong>in</strong>ished. The only loop-carrieddependency cha<strong>in</strong> is add eax,16 which has a latency of only 1.The time needed for <strong>in</strong>struction fetch<strong>in</strong>g can be calculated from the lengths of the<strong>in</strong>structions. The total length of the eight <strong>in</strong>structions <strong>in</strong> the loop is 30 bytes. The processorneeds to fetch two 16-byte blocks if the loop entry is aligned by 16, or at most three 16-byteblocks <strong>in</strong> the worst case. Instruction fetch<strong>in</strong>g is therefore not a bottleneck.Instruction decod<strong>in</strong>g <strong>in</strong> the PM processor follows the 4-1-1 pattern. The pattern of (fused)µops for each <strong>in</strong>struction <strong>in</strong> the loop <strong>in</strong> example 12.6b is 2-2-2-2-2-1-1-1. This is not optimal,and it will take 6 clock cycles to decode. This is more than the retirement time, so we canconclude that <strong>in</strong>struction decod<strong>in</strong>g is the bottleneck <strong>in</strong> example 12.6b. The total executiontime is 6 clock cycles per iteration or 3 clock cycles per calculated Y[i] value. The decod<strong>in</strong>gcan be improved by mov<strong>in</strong>g one of the 1-µop <strong>in</strong>structions and by chang<strong>in</strong>g the sign of xmm293


so that movapd xmm0,[edi+eax] and subpd xmm0,xmm1 can be comb<strong>in</strong>ed <strong>in</strong>to one<strong>in</strong>struction addpd xmm1,[edi+eax]. We will therefore change the code of the loop asfollows:; Example 12.6c. Loop of DAXPY algorithm with improved decod<strong>in</strong>g.dataalign 16SignBit DD 0, 80000000H ; qword with sign bit setn = 100 ; Def<strong>in</strong>e constant n (even and positive).codemov ecx, n * 8; Load n * sizeof(double)xor eax, eax ; i = 0lea esi, X ; X must be aligned by 16lea edi, Y ; Y must be aligned by 16movsd xmm2, DA; Load DAxorpd xmm2, SignBit ; Change signshufpd xmm2, xmm2, 0 ; Get -DA <strong>in</strong>to both qwords of xmm2L1: movapd xmm1, [esi+eax] ; X[i], X[i+1]mulpd xmm1, xmm2 ; X[i] * (-DA), X[i+1] * (-DA)addpd xmm1, [edi+eax] ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DAadd eax, 16; Add size of two elements to <strong>in</strong>dexmovapd [edi+eax-16],xmm1 ; Address corrected for changed eaxcmp eax, ecx ; Compare with n*8jl L1; Loop backThe number of µops <strong>in</strong> the loop is the same as before, but the decode pattern is now 2-2-4-1-2-1-1 and the decode time is reduced from 6 to 4 clock cycles so that decod<strong>in</strong>g is nolonger a bottleneck.An experimental test shows that the expected improvement is not obta<strong>in</strong>ed. The executiontime is only reduced from 6.0 to 5.6 clock cycles per iteration. There are three reasons whythe execution time is higher than the expected 4.3 clocks per iteration.The first reason is register read stalls. The reorder buffer can handle no more than threereads per clock cycle from registers that have not been modified recently. Register esi, edi,ecx and xmm2 are not modified <strong>in</strong>side the loop. xmm2 counts as two because it isimplemented as two 64-bit registers. eax also contributes to the register read stallsbecause its value has enough time to retire before it is used aga<strong>in</strong>. This causes a registerread stall <strong>in</strong> two out of three iterations. The execution time can be reduced to 5.4 by mov<strong>in</strong>gthe add eax,16 <strong>in</strong>struction up before the mulpd so that the reads of register esi, xmm2 andedi are separated further apart and therefore prevented from go<strong>in</strong>g <strong>in</strong>to the reorder buffer <strong>in</strong>the same clock cycle.The second reason is retirement. The number of fused µops <strong>in</strong> the loop is not divisible by 3.Therefore, the taken jump to L1 will not always go <strong>in</strong>to the first slot of the retirement stationwhich it has to. This can sometimes cost a clock cycle.The third reason is that the two µops for addpd may be issued <strong>in</strong> the same clock cyclethrough port 0 and port 1, respectively, though there is only one execution unit for float<strong>in</strong>gpo<strong>in</strong>t addition. This is the consequence of a bad design, as expla<strong>in</strong>ed <strong>in</strong> manual 3: "Themicroarchitecture of Intel and AMD CPU's".A solution which happens to work better is to get rid of the cmp <strong>in</strong>struction by us<strong>in</strong>g anegative <strong>in</strong>dex from the end of the arrays:; Example 12.6d. Loop of DAXPY algorithm with negative <strong>in</strong>dexes.dataalign 1694


SignBit DD 0, 80000000H ; qword with sign bit setn = 100 ; Def<strong>in</strong>e constant n (even and positive).codemov eax, -n * 8lea esi, X + 8 * nlea edi, Y + 8 * nmovsd xmm2, DAxorpd xmm2, SignBitshufpd xmm2, xmm2, 0; Index = -n * sizeof(double); Po<strong>in</strong>t to end of array X (aligned); Po<strong>in</strong>t to end of array Y (aligned); Load DA; Change sign; Get -DA <strong>in</strong>to both qwords of xmm2L1: movapd xmm1, [esi+eax] ; X[i], X[i+1]mulpd xmm1, xmm2 ; X[i] * (-DA), X[i+1] * (-DA)addpd xmm1, [edi+eax] ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DAmovapd [edi+eax],xmm1 ; Store resultadd eax, 16; Add size of two elements to <strong>in</strong>dexjs L1; Loop backThis removes one µop from the loop. My measurements show an execution time forexample 12.6d of 5.0 clock cycles per iteration on a PM processor. The theoretical m<strong>in</strong>imumis 4. The register read stalls have disappeared because eax now has less time to retirebefore it is used aga<strong>in</strong>. The retirement is also improved because the number of fused µops<strong>in</strong> the loop is now 12, which is divisible by the retirement rate of 3. The problem with thefloat<strong>in</strong>g po<strong>in</strong>t addition µops clash<strong>in</strong>g rema<strong>in</strong>s and this is responsible for the extra clockcycle. This problem can only be targeted by experimentation. I found that the optimal orderof the <strong>in</strong>structions has the add <strong>in</strong>struction immediately after the mulpd:; Example 12.6e. Loop of DAXPY. Optimal solution for PM.dataalign 16SignBit DD 0, 80000000H ; qword with sign bit setn = 100 ; Def<strong>in</strong>e constant n (even and positive).codemov eax, -n * 8lea esi, X + 8 * nlea edi, Y + 8 * nmovsd xmm2, DAxorpd xmm2, SignBitshufpd xmm2, xmm2, 0; Index = -n * sizeof(double); Po<strong>in</strong>t to end of array X (aligned); Po<strong>in</strong>t to end of array Y (aligned); Load DA; Change sign; Get -DA <strong>in</strong>to both qwords of xmm2L1: movapd xmm1, [esi+eax] ;mulpd xmm1, xmm2 ;add eax, 16; Optimal position of add <strong>in</strong>structionaddpd xmm1, [edi+eax-16]; Address corrected for changed eaxmovapd [edi+eax-16],xmm1 ; Address corrected for changed eaxjs L1 ;The execution time is now reduced to 4.0 clock cycles per iteration, which is the theoreticalm<strong>in</strong>imum. An analysis of the bottlenecks <strong>in</strong> the loop of example 12.6e gives the follow<strong>in</strong>gresults: The decod<strong>in</strong>g time is 4 clock cycles. The retirement time is 12 / 3 = 4 clock cycles.Port 0 and 1 are used at 75% or their capacity. Port 2 is used 100%. Port 3 and 4 are used50%. The FMUL execution unit is used at 100% of its maximum throughput. The FADD unitis used 50%. The conclusion is that the speed is limited by four equally narrow bottlenecksand that no further improvement is possible.The fact that we found an <strong>in</strong>struction order that removes the float<strong>in</strong>g po<strong>in</strong>t µop clash<strong>in</strong>gproblem is sheer luck. In more complicated cases there may not exist a solution thatelim<strong>in</strong>ates this problem.95


12.8 Same example on Core2The same loop will run more efficiently on a Core2 thanks to better µop fusion and morepowerful execution units. Example 12.6c and 12.6d are good candidates for runn<strong>in</strong>g on aCore2. The jl L1 <strong>in</strong>struction <strong>in</strong> example 12.6c should be changed to jb L1 <strong>in</strong> order toenable macro-op fusion <strong>in</strong> 32-bit mode. This change is possible because the loop countercannot be negative <strong>in</strong> example 12.6c.We will now analyze the resource use of the loop <strong>in</strong> example 12.6d for a Core2 processor.The resource use of each <strong>in</strong>struction <strong>in</strong> the loop is listed <strong>in</strong> table 12.2 below.InstructionInstructionlengthµopsfused doma<strong>in</strong>µops for eachexecutionportport0port1port2port3port4port5movapd xmm1, [esi+eax] 5 1 1mulpd xmm1, xmm2 4 1 1addpd xmm1, [edi+eax] 5 1 1 1movapd [edi+eax], xmm1 5 1 1 1add eax, 16 3 1 x x xjs L1 2 1 1Total 24 6 1.33 1.33 1.33 2 1 1Table 12.2. Total µops for DAXPY loop on Core2 (example 12.6d).Predecod<strong>in</strong>g is not a problem here because the total size of the loop is less than 64 bytes.There is no need to align the loop because the size is less than 50 bytes. If the loop hadbeen bigger than 64 bytes then we would notice that the predecoder can handle only thefirst three <strong>in</strong>structions <strong>in</strong> one clock cycle because it cannot handle more than 16 bytes.There are no restrictions on the decoders because all the <strong>in</strong>structions generate only a s<strong>in</strong>gleµop each. The decoders can handle four <strong>in</strong>structions per clock cycle, so the decod<strong>in</strong>g timewill be 6/4 = 1.5 clock cycles per iteration.Port 0, 1 and 2 each have one µop that can go nowhere else. In addition, the ADD EAX,16<strong>in</strong>struction can go to any one of these three ports. If these <strong>in</strong>structions are distributed evenlybetween port 0, 1 and 2, then these ports will be busy for 1.33 clock cycle per iteration onaverage.Port 3 receives two µops per iteration for read<strong>in</strong>g X[i] and Y[i]. We can thereforeconclude that memory reads on port 3 is the bottleneck for the DAXPY loop, and that theexecution time is 2 clock cycles per iteration.Example 12.6c has one <strong>in</strong>struction more than example 12.6d, which we have just analyzed.The extra <strong>in</strong>struction is CMP EAX,ECX, which can go to any of the ports 0, 1 and 2. This will<strong>in</strong>crease the pressure on these ports to 1.67, which is still less than the 2 µops on port 3.The extra µop can be elim<strong>in</strong>ated by macro-op fusion <strong>in</strong> 32-bit mode, but not <strong>in</strong> 64-bit mode.The float<strong>in</strong>g po<strong>in</strong>t addition and multiplication units all have a throughput of 1. These unitsare therefore not bottlenecks <strong>in</strong> the DAXPY loop.We need to consider the possibility of register read stalls <strong>in</strong> the loop. Example 12.6c hasfour registers that are read, but not modified, <strong>in</strong>side the loop. This is register ESI, EDI, XMM296


and ECX. This could cause a register read stall if all of these registers were read with<strong>in</strong> fourconsecutive µops. But the µops that read these registers are spaced so much apart that nomore than three of these can possibly be read <strong>in</strong> four consecutive µops. Example 12.6d hasone register less to read because ECX is not used. Register read stalls are therefore unlikelyto cause troubles <strong>in</strong> these loops.The conclusion is that the Core2 processor runs the DAXPY loop <strong>in</strong> half as many clockcycles as the PM. The problem with µop clash<strong>in</strong>g that made the optimization particularlydifficult on the PM has been elim<strong>in</strong>ated <strong>in</strong> the Core2 design.12.9 Loop unroll<strong>in</strong>gA loop that does n repetitions can be replaced by a loop that repeats n / r times and does rcalculations for each repetition, where r is the unroll factor. n should preferably be divisibleby r.Loop unroll<strong>in</strong>g can be used for the follow<strong>in</strong>g purposes:• Reduc<strong>in</strong>g loop overhead. The loop overhead per calculation is divided by the loopunroll factor r. This is only useful if the loop overhead contributes significantly to thecalculation time. There is no reason to unroll a loop if some other bottleneck limitsthe execution speed. For example, the loop <strong>in</strong> example 12.6e above cannot benefitfrom further unroll<strong>in</strong>g.• Vectorization. A loop must be rolled out by r or a multiple of r <strong>in</strong> order to use vectorregisters with r elements. The loop <strong>in</strong> example 12.6e is rolled out by 2 <strong>in</strong> order to usevectors of two double-precision numbers. If we had used s<strong>in</strong>gle-precision numbersthen we would have rolled out the loop by 4 and used vectors of 4 elements.• Improve branch prediction. The prediction of the loop exit branch can be improvedby unroll<strong>in</strong>g the loop so much that the repeat count n / r does not exceed themaximum repeat count that can be predicted on a specific CPU.• Improve cach<strong>in</strong>g. If the loop suffers from many data cache misses or cachecontentions then it may be advantageous to schedule memory reads and writes <strong>in</strong>the way that is optimal for a specific processor. See the optimization manual fromthe microprocessor vendor for details.• Elim<strong>in</strong>ate <strong>in</strong>teger divisions. If the loop conta<strong>in</strong>s an expression where the loop counteri is divided by an <strong>in</strong>teger r or the modulo of i by r is calculated, then the <strong>in</strong>tegerdivision can be avoided by unroll<strong>in</strong>g the loop by r.• Elim<strong>in</strong>ate branch <strong>in</strong>side loop. If there is a branch or a switch statement <strong>in</strong>side theloop with a repetitive pattern of period r then this can be elim<strong>in</strong>ated by unroll<strong>in</strong>g theloop by r. For example, if an if-else branch goes either way every second time thenthis branch can be elim<strong>in</strong>ated by roll<strong>in</strong>g out by 2.• Break loop-carried dependency cha<strong>in</strong>. A loop-carried dependency cha<strong>in</strong> can <strong>in</strong> somecases be broken up by us<strong>in</strong>g multiple accumulators. The unroll factor r is equal tothe number of accumulators. See example 9.3b page 63.• Reduce dependence of <strong>in</strong>duction variable. If the latency of calculat<strong>in</strong>g an <strong>in</strong>ductionvariable from the value <strong>in</strong> the previous iteration is so long that it becomes abottleneck then it may be possible to solve this problem by unroll<strong>in</strong>g by r andcalculate each value of the <strong>in</strong>duction variable from the value that is r places beh<strong>in</strong>d<strong>in</strong> the sequence.97


• Improv<strong>in</strong>g µop retirement. If the number of µops <strong>in</strong> the loop is not divisible by theretirement rate and retirement is <strong>in</strong>efficient for this reason then this problem may <strong>in</strong>some cases be solved by loop unroll<strong>in</strong>g.• Parallel operations on microprocessors without out-of-order capabilities. The old P1and PMMX processors cannot do calculations out of order, but they can execute two<strong>in</strong>structions <strong>in</strong> parallel. Roll<strong>in</strong>g out by 2 can facilitate this.• Complete unroll<strong>in</strong>g. A loop is completely unrolled when r = n. This elim<strong>in</strong>ates theloop overhead completely. Every expression that is a function of the loop countercan be replaced by constants. Every branch that depends only on the loop countercan be elim<strong>in</strong>ated. See page 106 for examples.There is a problem with loop unroll<strong>in</strong>g when the repeat count n is not divisible by the unrollfactor r. There will be a rema<strong>in</strong>der of n modulo r extra calculations that are not done <strong>in</strong>sidethe loop. These extra calculations have to be done either before or after the ma<strong>in</strong> loop.Gett<strong>in</strong>g the extra calculations right can be somewhat tricky if we are us<strong>in</strong>g a negative <strong>in</strong>dexas <strong>in</strong> example 12.6d and e. The follow<strong>in</strong>g example shows the DAXPY algorithm aga<strong>in</strong>, thistime with s<strong>in</strong>gle precision and unrolled by 4. In this example n is a variable which may ormay not be divisible by 4. The arrays X and Y must be aligned by 16. (The optimization thatwas specific to the PM processor has been omitted for the sake of clarity).; Example 12.7. Unrolled Loop of DAXPY, s<strong>in</strong>gle precision..dataalign 16SignBitS DD 80000000H ; dword with sign bit set.codemov eax, n ; Number of calculations, nsub eax, 4 ; n - 4lea esi, [X + eax*4] ; Po<strong>in</strong>t to X[n-4]lea edi, [Y + eax*4] ; Po<strong>in</strong>t to Y[n-4]movss xmm2, DA; Load DAxorps xmm2, SignBitS ; Change signshufps xmm2, xmm2, 0 ; Get -DA <strong>in</strong>to all four dwords of xmm2neg eax ; -(n-4)jg L2 ; Skip ma<strong>in</strong> loop if n < 4L1: ; Ma<strong>in</strong> loop rolled out by 4movaps xmm1, [esi+eax*4] ; Load 4 values from Xmulps xmm1, xmm2; Multiply with -DAaddps xmm1, [edi+eax*4] ; Add 4 values from Ymovaps [edi+eax*4],xmm1 ; Store 4 results <strong>in</strong> Yadd eax, 4 ; i += 4jle L1 ; Loop as long as


An alternative solution for an unrolled loop that does calculations on arrays is to extend thearrays with up to r-1 unused spaces and round<strong>in</strong>g up the repeat count n to the nearestmultiple of the unroll factor r. This elim<strong>in</strong>ates the need for calculat<strong>in</strong>g the rema<strong>in</strong>der (n modr) and for the extra loop for the rema<strong>in</strong><strong>in</strong>g calculations. The unused array elements must be<strong>in</strong>itialized to zero or some other valid float<strong>in</strong>g po<strong>in</strong>t value <strong>in</strong> order to avoid denormalnumbers, NAN, overflow, underflow, or any other condition that can slow down the float<strong>in</strong>gpo<strong>in</strong>t calculations. If the arrays are of <strong>in</strong>teger type then the only condition you have to avoidis division by zero.Loop unroll<strong>in</strong>g should only be used when there is a reason to do so and a significant ga<strong>in</strong> <strong>in</strong>speed can be obta<strong>in</strong>ed. Excessive loop unroll<strong>in</strong>g should be avoided. The disadvantages ofloop unroll<strong>in</strong>g are:• The code becomes bigger and takes more space <strong>in</strong> the code cache. This can causecode cache misses that cost more than what is ga<strong>in</strong>ed by the unroll<strong>in</strong>g. Note that thecode cache misses are not detected when the loop is tested <strong>in</strong> isolation.• The Core2 processor performs much better on loops that are no bigger than 64bytes of code. Loop unroll<strong>in</strong>g will decrease performance on the Core2 if it makes theloop bigger than 64 bytes. (See manual 3: "The microarchitecture of Intel and AMDCPU's".)• The need to do extra calculations outside the unrolled loop <strong>in</strong> case n is not divisibleby r makes the code more complicated and clumsy and <strong>in</strong>creases the number ofbranches.• The unrolled loop may need more registers, e.g. for multiple accumulators.12.10 Optimize cach<strong>in</strong>gMemory access is likely to take more time than anyth<strong>in</strong>g else <strong>in</strong> a loop that accessesuncached memory. Data should be held contiguous if possible and accessed sequentially,as expla<strong>in</strong>ed <strong>in</strong> chapter 11 page 79.The number of arrays accessed <strong>in</strong> a loop should not exceed the number of read/writebuffers <strong>in</strong> the microprocessor. One way of reduc<strong>in</strong>g the number of data streams is tocomb<strong>in</strong>e multiple arrays <strong>in</strong>to an array of structures so that the multiple data streams are<strong>in</strong>terleaved <strong>in</strong>to a s<strong>in</strong>gle stream.Some microprocessors have advanced data prefetch<strong>in</strong>g mechanisms. These mechanismscan detect regularities <strong>in</strong> the data access pattern such as access<strong>in</strong>g data with a particularstride. It is recommended to take advantage of such prefetch<strong>in</strong>g mechanisms by keep<strong>in</strong>gthe number of different data streams at a m<strong>in</strong>imum and keep<strong>in</strong>g the access stride constant ifpossible. Automatic data prefetch<strong>in</strong>g often works better than explicit data prefetch<strong>in</strong>g whenthe data access pattern is sufficiently regular.Explicit prefetch<strong>in</strong>g of data with the prefetch <strong>in</strong>structions may be necessary <strong>in</strong> cases wherethe data access pattern is too irregular to be predicted by the automatic prefetchmechanisms. A good deal of experimentation is often needed to f<strong>in</strong>d the optimal prefetch<strong>in</strong>gstrategy for a program that accesses data <strong>in</strong> an irregular manner.It is possible to put the data prefetch<strong>in</strong>g <strong>in</strong>to a separate thread if the system has multipleCPU cores. The Intel C++ compiler has a feature for do<strong>in</strong>g this.Data access with a stride that is a high power of 2 is likely to cause cache l<strong>in</strong>e contentions.This can be avoided by chang<strong>in</strong>g the stride or by loop block<strong>in</strong>g. See the chapter onoptimiz<strong>in</strong>g memory access <strong>in</strong> manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++" for details.99


The non-temporal write <strong>in</strong>structions are useful for writ<strong>in</strong>g to uncached memory that isunlikely to be accessed aga<strong>in</strong> soon. You may use vector <strong>in</strong>structions <strong>in</strong> order to m<strong>in</strong>imizethe number of non-temporal write <strong>in</strong>structions.12.11 ParallelizationThe most important way of improv<strong>in</strong>g the performance of CPU-<strong>in</strong>tensive code is to do th<strong>in</strong>gs<strong>in</strong> parallel. The ma<strong>in</strong> methods of do<strong>in</strong>g th<strong>in</strong>gs <strong>in</strong> parallel are:• Improve the possibilities of the CPU to do out-of-order execution. This is done bybreak<strong>in</strong>g long dependency cha<strong>in</strong>s (see page 63) and distribut<strong>in</strong>g µops evenlybetween the different execution units or execution ports (see page 91).• Use vector <strong>in</strong>structions. See chapter 13 page 108.• Use multiple threads. See chapter 14 page 130.Loop-carried dependency cha<strong>in</strong>s can be broken by us<strong>in</strong>g multiple accumulators, asexpla<strong>in</strong>ed on page 63. The optimal number of accumulators if the CPU has noth<strong>in</strong>g else todo is the latency of the most critical <strong>in</strong>struction <strong>in</strong> the dependency cha<strong>in</strong> divided by thereciprocal throughput for that <strong>in</strong>struction. For example, the latency of float<strong>in</strong>g po<strong>in</strong>t additionon an AMD processor is 4 clock cycles and the reciprocal throughput is 1. This means thatthe optimal number of accumulators is 4. Example 12.8b below shows a loop that addsnumbers with four float<strong>in</strong>g po<strong>in</strong>t registers as accumulators.// Example 12.8a, Loop-carried dependency cha<strong>in</strong>// (Same as example 9.3a page 63)double list[100], sum = 0.;for (<strong>in</strong>t i = 0; i < 100; i++) sum += list[i];An implementation with 4 float<strong>in</strong>g po<strong>in</strong>t registers as accumulators looks like this:; Example 12.8b, Four float<strong>in</strong>g po<strong>in</strong>t accumulatorslea esi, list ; Po<strong>in</strong>ter to listfld qword ptr [esi] ; accum1 = list[0]fld qword ptr [esi+8] ; accum2 = list[1]fld qword ptr [esi+16] ; accum3 = list[2]fld qword ptr [esi+24] ; accum4 = list[3]fxch st(3); Get accum1 to topadd esi, 800 ; Po<strong>in</strong>t to end of listmov eax, 32-800 ; Index to list[4] from end of listL1:fadd qword ptr [esi+eax] ; Add list[i]fxch st(1); Swap accumulatorsfadd qword ptr [esi+eax+8] ; Add list[i+1]fxch st(2); Swap accumulatorsfadd qword ptr [esi+eax+16] ; Add list[i+2]fxch st(3); Swap accumulatorsadd eax, 24 ; i += 3js L1 ; Loopfaddp st(1), st(0)fxch st(1)faddp st(2), st(0)faddp st(1), st(0)fstp qword ptr [sum]; Add two accumulators together; Swap accumulators; Add the two other accumulators; Add these sums; Store the resultIn example 12.8b, I have loaded the four accumulators with the first four values from list.100


Then the number of additions to do <strong>in</strong> the loop happens to be divisible by the rollout factor,which is 3. The funny th<strong>in</strong>g about us<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t registers as accumulators is that thenumber of accumulators is equal to the rollout factor plus one. This is a consequence of theway the fxch <strong>in</strong>structions are used for swapp<strong>in</strong>g the accumulators. You have to playcomputer and follow the position of each accumulator on the float<strong>in</strong>g po<strong>in</strong>t register stack toverify that the four accumulators are actually rotated one place after each iteration of theloop so that each accumulator is used for every fourth addition despite the fact that the loopis only rolled out by three.The loop <strong>in</strong> example 12.8b takes 1 clock cycle per addition, which is the maximumthroughput of the float<strong>in</strong>g po<strong>in</strong>t adder. The latency of 4 clock cycles for float<strong>in</strong>g po<strong>in</strong>taddition is taken care of by us<strong>in</strong>g four accumulators. The fxch <strong>in</strong>structions have zerolatency because they are translated to register renam<strong>in</strong>g on both Intel and AMD processors.The fxch <strong>in</strong>structions can be avoided on processors with the SSE2 <strong>in</strong>struction set by us<strong>in</strong>gXMM registers <strong>in</strong>stead of float<strong>in</strong>g po<strong>in</strong>t stack registers as shown <strong>in</strong> example 12.8c. Thelatency of float<strong>in</strong>g po<strong>in</strong>t vector addition is 4 on an AMD and the reciprocal throughput is 2 sothe optimal number of accumulators is 2 vector registers.; Example 12.8c, Two XMM vector accumulatorslea esi, list ; list must be aligned by 16movapd xmm0, [esi]; list[0], list[1]movapd xmm1, [esi+16] ; list[2], list[3]add esi, 800 ; Po<strong>in</strong>t to end of listmov eax, 32-800 ; Index to list[4] from end of listL1:addpd xmm0, [esi+eax] ; Add list[i], list[i+1]addpd xmm1, [esi+eax+16] ; Add list[i+2], list[i+3]add eax, 32 ; i += 4js L1 ; Loopaddpd xmm0, xmm1 ; Add the two accumulators togethermovhlps xmm1, xmm0; There is no movhlpd <strong>in</strong>structionaddsd xmm0, xmm1 ; Add the two vector elementsmovsd [sum], xmm0 ; Store the resultExample 12.8b and 12.8c are exactly equally fast on an AMD processor because both arelimited by the throughput of the float<strong>in</strong>g po<strong>in</strong>t adder.Example 12.8c is faster than 12.8b on an Intel Core2 processor because this processor hasa 128 bits wide float<strong>in</strong>g po<strong>in</strong>t adder that can handle a whole vector <strong>in</strong> one operation.12.12 Analyz<strong>in</strong>g dependencesA loop may have several <strong>in</strong>terlocked dependency cha<strong>in</strong>s. Such complex cases require acareful analysis.The next example is a Taylor expansion. As you probably know, many functions can beapproximated by a Taylor polynomial of the formf ( x)≈n∑i=0c ixiEach power x i is conveniently calculated by multiply<strong>in</strong>g the preced<strong>in</strong>g power x i-1 with x. Thecoefficients c i are stored <strong>in</strong> a table.; Example 12.9a. Taylor expansion.dataalign 16one dq 1.0 ; 1.0x dq ? ; x101


coeff dq c0, c1, c2, ... ; Taylor coefficientscoeff_end label qword ; end of coeff. list.codemovsd xmm2, [x]; xmm2 = xmovsd xmm1, [one]; xmm1 = x^ipxor xmm0, xmm0 ; xmm0 = sum. <strong>in</strong>it. to 0mov eax, offset coeff ; po<strong>in</strong>t to c[i]L1: movsd xmm3, [eax] ; c[i]mulsd xmm3, xmm1; c[i] * x^imulsd xmm1, xmm2; x^(i+1)addsd xmm0, xmm3; sum += c[i] * x^iadd eax, 8 ; po<strong>in</strong>t to c[i+1]cmp eax, offset coeff_end ; stop at end of listjb L1 ; loop(If your assembler confuses the movsd <strong>in</strong>struction with the str<strong>in</strong>g <strong>in</strong>struction of the samename, then code it as DB 0F2H / movups).And now to the analysis. The list of coefficients is so short that we can expect it to staycached. Trace cache and retirement are obviously not limit<strong>in</strong>g factors <strong>in</strong> this example.In order to check whether latencies are important, we have to look at the dependences <strong>in</strong>this code. The dependences are shown <strong>in</strong> figure 12.1.Figure 12.1: Dependences <strong>in</strong> example 12.9.There are two cont<strong>in</strong>ued dependency cha<strong>in</strong>s, one for calculat<strong>in</strong>g x i and one for calculat<strong>in</strong>gthe sum. The latency of the mulsd <strong>in</strong>struction is longer than the addsd on Intel processors.102


The vertical multiplication cha<strong>in</strong> is therefore more critical than the addition cha<strong>in</strong>. Theadditions have to wait for c i x i , which come one multiplication latency after x i , and later thanthe preced<strong>in</strong>g additions. If noth<strong>in</strong>g else limits the performance, then we can expect this codeto take one multiplication latency per iteration, which is 4 - 7 clock cycles, depend<strong>in</strong>g on theprocessor.Throughput is not a limit<strong>in</strong>g factor because the throughput of the multiplication unit is onemultiplication per clock cycle on processors with a 128 bit execution unit.Measur<strong>in</strong>g the time of example 12.9a, we f<strong>in</strong>d that the loop takes at least one clock cyclemore than the multiplication latency. The explanation is as follows: Both multiplications haveto wait for the value of x i-1 <strong>in</strong> xmm1 from the preced<strong>in</strong>g iteration. Thus, both multiplications areready to start at the same time. We would like the vertical multiplication <strong>in</strong> the ladder offigure 12.1 to start first, because it is part of the most critical dependency cha<strong>in</strong>. But themicroprocessor sees no reason to swap the order of the two multiplications, so thehorizontal multiplication on figure 12.1 starts first. The vertical multiplication is delayed forone clock cycles, which is the reciprocal throughput of the float<strong>in</strong>g po<strong>in</strong>t multiplication unit.The problem can be solved by putt<strong>in</strong>g the horizontal multiplication <strong>in</strong> figure 12.1 after thevertical multiplication:; Example 12.9b. Taylor expansionmovsd xmm2, [x]; xmm2 = xmovsd xmm1, [one]; xmm1 = x^ipxor xmm0, xmm0 ; xmm0 = sum. <strong>in</strong>itialize to 0mov eax, offset coeff ; po<strong>in</strong>t to c[i]L1: movapd xmm3, xmm1 ; copy x^imulsd xmm1, xmm2 ; x^(i+1) (vertical multipl.)mulsd xmm3, [eax]; c[i]*x^i (horizontal multipl.)add eax, 8 ; po<strong>in</strong>t to c[i+1]cmp eax, offset coeff_end ; stop at end of listaddsd xmm0, xmm3; sum += c[i] * x^ijb L1 ; loopHere we need an extra <strong>in</strong>struction for copy<strong>in</strong>g x i (movapd xmm3,xmm1) so that we can makethe vertical multiplication before the horizontal multiplication. This makes the execution oneclock cycle faster per iteration of the loop.We are still us<strong>in</strong>g only the lower part of the xmm registers. There is more to ga<strong>in</strong> by do<strong>in</strong>g twomultiplications simultaneously and tak<strong>in</strong>g two steps <strong>in</strong> the Taylor expansion at a time. Thetrick is to calculate each value of x i from x i-2 by multiply<strong>in</strong>g with x 2 :; Example 12.9c. Taylor expansion, double stepsmovsd xmm4, [x] ; xmulsd xmm4, xmm4 ; x^2movlhps xmm4, xmm4; x^2, x^2movapd xmm1, xmmword ptr [one] ; xmm1(H)=x, xmm1(L)=1.0pxor xmm0, xmm0 ; xmm0 = sum. <strong>in</strong>it. to 0mov eax, offset coeff ; po<strong>in</strong>t to c[i]L1: movapd xmm3, xmm1 ; copy x^(i+1), x^imulpd xmm1, xmm4 ; x^(i+3), x^(i+2)mulpd xmm3, [eax] ; c[i+1]*x^(i+1),c[i]*x^iaddpd xmm0, xmm3 ; sum += c[i] * x^iadd eax, 16 ; po<strong>in</strong>t to c[i+2]cmp eax, offset coeff_end ; stop at end of listjb L1 ; loophaddpd xmm0, xmm0; jo<strong>in</strong> the two parts of sumThe speed is now doubled. The loop still takes the multiplication latency per iteration, butthe number of iterations is halved. The multiplication latency is 5 clock cycles for an IntelCore 2 and 4 for an AMD K10. Do<strong>in</strong>g two vector multiplies per iteration, we are us<strong>in</strong>g only103


40 or 50% of the maximum multiplier throughput. This means that there is more to ga<strong>in</strong> byfurther parallelization. We can do this by tak<strong>in</strong>g four Taylor steps at a time and calculat<strong>in</strong>geach value of x i from x i-4 by multiply<strong>in</strong>g with x 4 :; Example 12.9d. Taylor expansion, quadruple stepsmovsd xmm4, [x] ; xmulsd xmm4, xmm4 ; x^2movlhps xmm4, xmm4; x^2, x^2movapd xmm2, xmmword ptr [one] ; xmm2(H)=x, xmm2(L)=1.0movapd xmm1, xmm2 ; xmm1 = x, 1mulpd xmm2, xmm4 ; xmm2 = x^3, x^2mulpd xmm4, xmm4 ; xmm4 = x^4, x^4pxor xmm5, xmm5 ; xmm5 = sum. <strong>in</strong>it. to 0pxor xmm6, xmm6 ; xmm6 = sum. <strong>in</strong>it. to 0mov eax, offset coeff ; po<strong>in</strong>t to c[i]L1: movapd xmm3, xmm1 ; copy x^(i+1), x^imovapd xmm0, xmm2; copy x^(i+3), x^(i+2)mulpd xmm1, xmm4 ; x^(i+5), x^(i+4)mulpd xmm2, xmm4 ; x^(i+7), x^(i+6)mulpd xmm3, xmmword ptr [eax] ; term(i+1), term(i)mulpd xmm0, xmmword ptr [eax+16] ; term(i+3), term(i+2)addpd xmm5, xmm3 ; add to sumaddpd xmm6, xmm0 ; add to sumadd eax, 32 ; po<strong>in</strong>t to c[i+2]cmp eax, offset coeff_end ; stop at end of listjb L1 ; loopaddpd xmm5, xmm6 ; jo<strong>in</strong> two accumulatorshaddpd xmm5, xmm5; f<strong>in</strong>al sumWe are us<strong>in</strong>g two registers, xmm1 and xmm2, for keep<strong>in</strong>g four consecutive powers of x <strong>in</strong>example 12.9d <strong>in</strong> order to split the vertical dependency cha<strong>in</strong> <strong>in</strong> figure 12.1 <strong>in</strong>to four parallelcha<strong>in</strong>s. Each power x i is calculated from x i-4 . Similarly, we are us<strong>in</strong>g two accumulators forthe sum, xmm5 and xmm6 <strong>in</strong> order to split the addition cha<strong>in</strong> <strong>in</strong>to four parallel cha<strong>in</strong>s. The foursums are added together after the loop. A measurement on a Core 2 shows 5.04 clockcycles per iteration or 1.26 clock cycle per Taylor term for example 12.9d. This is close tothe expectation of 5 clock cycles per iteration as determ<strong>in</strong>ed by the multiplication latency.This solution uses 80% of the maximum throughput of the multiplier which is a verysatisfactory result. There is hardly anyth<strong>in</strong>g to ga<strong>in</strong> by unroll<strong>in</strong>g further. The vector additionand other <strong>in</strong>structions do not add to the execution time because they do not need to use thesame execution units as the multiplications. Instruction fetch and decod<strong>in</strong>g is not abottleneck <strong>in</strong> this case.The time spent outside the loop is of course higher <strong>in</strong> example 12.9d than <strong>in</strong> the previousexamples, but the performance ga<strong>in</strong> is still very significant even for a small number ofiterations. A little f<strong>in</strong>al twist is to put the first coefficient <strong>in</strong>to the accumulator before the loopand start with x 1 rather than x 0 . A better precision is obta<strong>in</strong>ed by add<strong>in</strong>g the first term last,but at the cost of an extra addition latency.It is common to stop a Taylor expansion when the terms become negligible. However, itmay be wise to always <strong>in</strong>clude the worst case maximum number of terms <strong>in</strong> order to avoidthe float<strong>in</strong>g po<strong>in</strong>t comparisons needed for check<strong>in</strong>g the stop condition. A constant repetitioncount will also prevent misprediction of the loop exit. Set the mxcsr register to "Flush tozero" mode <strong>in</strong> order to avoid the possible penalty of underflows when do<strong>in</strong>g more iterationsthan necessary.12.13 Loops on processors without out-of-order executionThe P1 and PMMX processors have no capabilities for out-of-order execution. Theseprocessors are of course obsolete today, but the pr<strong>in</strong>ciples described here may be appliedto smaller embedded processors.104


I have chosen the simple example of a procedure that reads <strong>in</strong>tegers from an array,changes the sign of each <strong>in</strong>teger, and stores the results <strong>in</strong> another array. A C++ <strong>language</strong>code for this procedure would be:// Example 12.10a. Loop to change signvoid ChangeSign (<strong>in</strong>t * A, <strong>in</strong>t * B, <strong>in</strong>t N) {for (<strong>in</strong>t i = 0; i < N; i++) B[i] = -A[i];}An <strong>assembly</strong> implementation optimized for P1 may look like this:; Example 12.10bmov esi, [A]mov eax, [N]mov edi, [B]xor ecx, ecxlea esi, [esi+4*eax] ; po<strong>in</strong>t to end of array asub ecx, eax ; -nlea edi, [edi+4*eax] ; po<strong>in</strong>t to end of array bjz short L3xor ebx, ebx ; start first calculationmov eax, [esi+4*ecx]<strong>in</strong>c ecxjz short L2L1: sub ebx, eax ; umov eax, [esi+4*ecx] ; v (pairs)mov [edi+4*ecx-4], ebx ; u<strong>in</strong>c ecx ; v (pairs)mov ebx, 0 ; ujnz L1 ; v (pairs)L2: sub ebx, eax ; end last calculationmov [edi+4*ecx-4], ebxL3:Here the iterations are overlapped <strong>in</strong> order to improve pair<strong>in</strong>g opportunities. We beg<strong>in</strong>read<strong>in</strong>g the second value before we have stored the first one. The mov ebx,0 <strong>in</strong>structionhas been put <strong>in</strong> between <strong>in</strong>c ecx and jnz L1, not to improve pair<strong>in</strong>g, but to avoid theAGI stall that would result from us<strong>in</strong>g ecx as address <strong>in</strong>dex <strong>in</strong> the first <strong>in</strong>struction pair after ithas been <strong>in</strong>cremented.Loops with float<strong>in</strong>g po<strong>in</strong>t operations are somewhat different because the float<strong>in</strong>g po<strong>in</strong>t<strong>in</strong>structions are pipel<strong>in</strong>ed and overlapp<strong>in</strong>g rather than pair<strong>in</strong>g. Consider the DAXPY loop ofexample 12.6 page 92. An optimal solution for P1 is as follows:; Example 12.11. DAXPY optimized for P1 and PMMXmov eax, [n] ; number of elementsmov esi, [X] ; po<strong>in</strong>ter to Xmov edi, [Y] ; po<strong>in</strong>ter to Yxor ecx, ecxlea esi, [esi+8*eax] ; po<strong>in</strong>t to end of Xsub ecx, eax ; -nlea edi, [edi+8*eax] ; po<strong>in</strong>t to end of Yjz short L3 ; test for n = 0fld qword ptr [DA] ; start first calc.fmul qword ptr [esi+8*ecx] ; DA * X[0]jmp short L2 ; jump <strong>in</strong>to loopL1: fld qword ptr [DA]fmul qword ptr [esi+8*ecx] ; DA * X[i]fxch; get old resultfstp qword ptr [edi+8*ecx-8] ; store Y[i]L2: fsubr qword ptr [edi+8*ecx] ; subtract from Y[i]105


L3:<strong>in</strong>c ecx ; <strong>in</strong>crement <strong>in</strong>dexjnz L1 ; loopfstp qword ptr [edi+8*ecx-8] ; store last resultHere we are us<strong>in</strong>g the loop counter as array <strong>in</strong>dex and count<strong>in</strong>g through negative values upto zero.Each operation beg<strong>in</strong>s before the previous one is f<strong>in</strong>ished, <strong>in</strong> order to improve calculationoverlaps. Processors with out-of-order execution will do this automatically, but forprocessors with no out-of-order capabilities we have to do this overlapp<strong>in</strong>g explicitly.The <strong>in</strong>terleav<strong>in</strong>g of float<strong>in</strong>g po<strong>in</strong>t operations works perfectly here: The 2 clock stall betweenFMUL and FSUBR is filled with the FSTP of the previous result. The 3 clock stall betweenFSUBR and FSTP is filled with the loop overhead and the first two <strong>in</strong>structions of the nextoperation. An address generation <strong>in</strong>terlock (AGI) stall has been avoided by read<strong>in</strong>g the onlyparameter that doesn't depend on the <strong>in</strong>dex counter <strong>in</strong> the first clock cycle after the <strong>in</strong>dexhas been <strong>in</strong>cremented.This solution takes 6 clock cycles per iteration, which is better than unrolled solutionspublished elsewhere.12.14 Macro loopsIf the repetition count for a loop is small and constant, then it is possible to unroll the loopcompletely. The advantage of this is that calculations that depend only on the loop countercan be done at <strong>assembly</strong> time rather than at execution time. The disadvantage is, of course,that it takes up more space <strong>in</strong> the code cache if the repeat count is high.The MASM syntax <strong>in</strong>cludes a powerful macro <strong>language</strong> with full support of metaprogramm<strong>in</strong>g,which can be quite useful. Other assemblers have similar capabilities, but thesyntax is different.If, for example, we need a list of square numbers, then the C++ code may look like this:// Example 12.12a. Loop to make list of squares<strong>in</strong>t squares[10];for (<strong>in</strong>t i = 0; i < 10; i++) squares[i] = i*i;The same list can be generated by a macro loop <strong>in</strong> MASM <strong>language</strong>:; Example 12.12b. Macro loop to produce data.DATAsquares LABEL DWORD ; label at start of arrayI = 0; temporary counterREPT 10; repeat 10 timesDD I * I; def<strong>in</strong>e one array elementI = I + 1; <strong>in</strong>crement counterENDM; end of REPT loopHere, I is a preprocess<strong>in</strong>g variable. The I loop is run at <strong>assembly</strong> time, not at executiontime. The variable I and the statement I = I + 1 never make it <strong>in</strong>to the f<strong>in</strong>al code, andhence take no time to execute. In fact, example 12.12b generates no executable code, onlydata. The macro preprocessor will translate the above code to:; Example 12.12c. Resuls of macro loop expansionsquares LABEL DWORD ; label at start of arrayDD 0DD 1106


DD 4DD 9DD 16DD 25DD 36DD 49DD 64DD 81Macro loops are also useful for generat<strong>in</strong>g code. The next example calculates x n , where x isa float<strong>in</strong>g po<strong>in</strong>t number and n is a positive <strong>in</strong>teger. This is done most efficiently byrepeatedly squar<strong>in</strong>g x and multiply<strong>in</strong>g together the factors that correspond to the b<strong>in</strong>arydigits <strong>in</strong> n. The algorithm can be expressed by the C++ code:// Example 12.13a. Calculate pow(x,n) where n is a positive <strong>in</strong>tegerdouble x, xp, power;unsigned <strong>in</strong>t n, i;xp = x; power = 1.0;for (i = n; i != 0; i >>= 1) {if (i & 1) power *= xp;xp *= xp;}If n is known at <strong>assembly</strong> time, then the power function can be implemented us<strong>in</strong>g thefollow<strong>in</strong>g macro loop:; Example 12.13b.; This macro will raise two packed double-precision floats <strong>in</strong> X; to the power of N, where N is a positive <strong>in</strong>teger constant.; The result is returned <strong>in</strong> Y. X and Y must be two different; XMM registers. X is not preserved.; (Only for processors with SSE2)INTPOWER MACRO X, Y, NLOCAL I, YUSED; def<strong>in</strong>e local identifiersI = N; I used for shift<strong>in</strong>g NYUSED = 0; remember if Y conta<strong>in</strong>s valid dataREPT 32 ; maximum repeat count is 32IF I AND 1 ; test bit 0IF YUSED; If Y already conta<strong>in</strong>s datamulsd Y, X ; multiply Y with a power of XELSE; If this is first time Y is used:movsd Y, X ; copy data to YYUSED = 1 ; remember that Y now conta<strong>in</strong>s dataENDIF; end of IF YUSEDENDIF ; end of IF I AND 1I = I SHR 1; shift right I one placeIF I EQ 0 ; stop when I = 0EXITM; exit REPT 32 loop prematurelyENDIF ; end of IF I EQ 0mulsd X, X; square XENDM; end of REPT 32 loopENDM; end of INTPOWER macro def<strong>in</strong>itionThis macro generates the m<strong>in</strong>imum number of <strong>in</strong>structions needed to do the job. There is noloop overhead, prolog or epilog <strong>in</strong> the f<strong>in</strong>al code. And, most importantly, no branches. Allbranches have been resolved by the macro preprocessor. To calculate xmm0 to the power of12, you write:; Example 12.13c. Macro <strong>in</strong>vocationINTPOWER xmm0, xmm1, 12This will be expanded to:107


; Example 12.13d. Result of macro expansionmulsd xmm0, xmm0 ; x^2mulsd xmm0, xmm0 ; x^4movsd xmm1, xmm0 ; save x^4mulsd xmm0, xmm0 ; x^8mulsd xmm1, xmm0 ; x^4 * x^8 = x^12This even has fewer <strong>in</strong>structions than an optimized <strong>assembly</strong> loop without unroll<strong>in</strong>g. Themacro can also work on vectors when mulsd is replaced by mulpd and movsd is replaced bymovapd.13 Vector programm<strong>in</strong>gS<strong>in</strong>ce there are technological limits to the maximum clock frequency of microprocessors, thetrend goes towards <strong>in</strong>creas<strong>in</strong>g processor throughput by handl<strong>in</strong>g multiple data <strong>in</strong> parallel.When optimiz<strong>in</strong>g code, it is important to consider if there are data that can be handled <strong>in</strong>parallel. The pr<strong>in</strong>ciple of S<strong>in</strong>gle-Instruction-Multiple-Data (SIMD) programm<strong>in</strong>g is that avector or set of data are packed together <strong>in</strong> one large register and handled together <strong>in</strong> oneoperation. There are many hundred different SIMD <strong>in</strong>structions available. These <strong>in</strong>structionsare listed <strong>in</strong> "IA-32 Intel Architecture Software Developer’s Manual" vol. 2A and 2B, and <strong>in</strong>"AMD64 Architecture Programmer’s Manual", vol. 4.Multiple data can be packed <strong>in</strong>to 64-bit MMX registers, 128-bit XMM registers or 256-bitYMM registers <strong>in</strong> the follow<strong>in</strong>g ways:data type data per pack register size <strong>in</strong>struction set microprocessor8-bit <strong>in</strong>teger 8 64 bit (MMX) MMX PMMX and later16-bit <strong>in</strong>teger 4 64 bit (MMX) MMX PMMX and later32-bit <strong>in</strong>teger 2 64 bit (MMX) MMX PMMX and later64-bit <strong>in</strong>teger 1 64 bit (MMX) SSE2 P4 and later32-bit float 2 64 bit (MMX) 3DNow AMD only8-bit <strong>in</strong>teger 16 128 bit (XMM) SSE2 P4 and later16-bit <strong>in</strong>teger 8 128 bit (XMM) SSE2 P4 and later32-bit <strong>in</strong>teger 4 128 bit (XMM) SSE2 P4 and later64-bit <strong>in</strong>teger 2 128 bit (XMM) SSE2 P4 and later32-bit float 4 128 bit (XMM) SSE P3 and later64-bit float 2 128 bit (XMM) SSE2 P4 and later32-bit float 8 256 bit (YMM) AVX future (2010)64-bit float 4 256 bit (YMM) AVX future (2010)Table 13.1. Vector typesThe 64- and 128-bit pack<strong>in</strong>g modes are available on the latest microprocessors from Inteland AMD, except for the 3DNow mode, which is available only on AMD processors.Whether the different <strong>in</strong>struction sets are supported on a particular microprocessor can bedeterm<strong>in</strong>ed with the CPUID <strong>in</strong>struction, as expla<strong>in</strong>ed on page 130. The 64-bit MMX registerscannot be used together with the float<strong>in</strong>g po<strong>in</strong>t registers. The XMM and YMM registers canonly be used if supported by the operat<strong>in</strong>g system. See page 132 for how to check if the useof XMM and YMM registers is enabled by the operat<strong>in</strong>g system.It is advantageous to choose the smallest data size that fits the purpose <strong>in</strong> order to pack asmany data as possible <strong>in</strong>to one vector register. Mathematical computations may requiredouble precision (64-bit) floats <strong>in</strong> order to avoid loss of precision <strong>in</strong> the <strong>in</strong>termediatecalculations, even if s<strong>in</strong>gle precision is sufficient for the f<strong>in</strong>al result.108


Before you choose to use vector <strong>in</strong>structions, you have to consider whether the result<strong>in</strong>gcode will be faster than the simple <strong>in</strong>structions without vectors. With vector code, you mayspend more <strong>in</strong>structions on trivial th<strong>in</strong>gs such as mov<strong>in</strong>g data <strong>in</strong>to the right positions <strong>in</strong> theregisters and emulat<strong>in</strong>g conditional moves, than on the actual calculations. Example 13.5below is an example of this. Vector <strong>in</strong>structions are relatively slow on older processors, butthe newest processors can do a vector calculation just as fast as a scalar (s<strong>in</strong>gle)calculation.For float<strong>in</strong>g po<strong>in</strong>t calculations, it is often advantageous to use XMM registers, even if thereare no opportunities for handl<strong>in</strong>g data <strong>in</strong> parallel. The registers are handled <strong>in</strong> a morestraightforward way than the old x87 register stack.Memory operands for XMM <strong>in</strong>structions have to be aligned by 16. Memory operands forYMM <strong>in</strong>structions have to be aligned by 32 <strong>in</strong> some cases. See page 83 for how to aligndata <strong>in</strong> memory.; Example 13.1a. Add<strong>in</strong>g two arrays us<strong>in</strong>g vectors; float a[100], b[100], c[100];; for (<strong>in</strong>t i = 0; i < 100; i++) a[i] = b[i] + c[i];; Assume that a, b and c are aligned by 16xor ecx, ecx ; Loop counter iL: movaps xmm0, b[ecx] ; Load 4 elements from baddps xmm0, c[ecx] ; Add 4 elements from cmovaps a[ecx], xmm0 ; Store result <strong>in</strong> aadd ecx, 16 ; 4 elements * 4 bytes = 16cmp ecx, 400 ; 100 elements * 4 bytes = 400jb L ; Loop13.1 Conditional moves <strong>in</strong> SIMD registersConsider this C++ code which f<strong>in</strong>ds the biggest values <strong>in</strong> four pairs of values:// Example 13.2a. Loop to f<strong>in</strong>d maximumsfloat a[4], b[4], c[4];for (<strong>in</strong>t i = 0; i < 4; i++) {c[i] = a[i] > b[i] ? a[i] : b[i];}If we want to implement this code with XMM registers then we cannot use a conditionaljump for the branch <strong>in</strong>side the loop because the branch condition is not the same for all fourelements. Fortunately, there is a maximum <strong>in</strong>struction that does the same:; Example 13.2b. Maximum <strong>in</strong> XMMmovaps xmm0, a ; Load a vectormaxps xmm0, b ; max(a,b)movaps c, xmm0 ; c = a > b ? a : bM<strong>in</strong>imum and maximum vector <strong>in</strong>structions exist for s<strong>in</strong>gle and double precision floats andfor 8-bit and 16-bit <strong>in</strong>tegers. The absolute value of float<strong>in</strong>g po<strong>in</strong>t vector elements iscalculated by AND'<strong>in</strong>g out the sign bit, as shown <strong>in</strong> example 13.8 page 119. Instructions forthe absolute value of <strong>in</strong>teger vector elements exist <strong>in</strong> the "Supplementary SSE3" <strong>in</strong>structionset. The <strong>in</strong>teger saturated addition vector <strong>in</strong>structions (e.g. PADDSW) can also be used forf<strong>in</strong>d<strong>in</strong>g maximum or m<strong>in</strong>imum or for limit<strong>in</strong>g values to a specific range.These methods are not very general, however. The most general way of do<strong>in</strong>g conditionalmoves <strong>in</strong> vector registers is to use Boolean vector <strong>in</strong>structions. The follow<strong>in</strong>g example is amodification of the above example where we cannot use the MAXPS <strong>in</strong>struction:// Example 13.3a. Branch <strong>in</strong> loop109


float a[4], b[4], c[4], x[4], y[4];for (<strong>in</strong>t i = 0; i < 4; i++) {c[i] = x[i] > y[i] ? a[i] : b[i];}The necessary conditional move is done by mak<strong>in</strong>g a mask that consists of all 1's when thecondition is true and all 0's when the condition is false. a[i] is AND'ed with this mask andb[i] is AND'ed with the <strong>in</strong>verted mask:; Example 13.3b. Conditional move <strong>in</strong> XMM registersmovaps xmm1, y ; Load y vectorcmpltps xmm1, x ; Compare with x. xmm1 = mask for y < xmovaps xmm0, a ; Load a vectorandps xmm0, xmm1 ; a AND maskandnps xmm1, b ; b AND NOT maskorps xmm0, xmm1 ; (a AND mask) OR (b AND NOT mask)movaps c, xmm0 ; c = x > y ? a : bThe vectors that make the condition (x and y <strong>in</strong> example 13.3b) and the vectors that areselected (a and b <strong>in</strong> example 13.3b) need not be the same type. For example, x and y couldbe <strong>in</strong>tegers. But they should have the same number of elements per vector. If a and b aredouble's with two elements per vector, and x and y are 32-bit <strong>in</strong>tegers with four elementsper vector, then we have to duplicate each element <strong>in</strong> x and y <strong>in</strong> order to get the right size ofthe mask (See example 13.5b below).Note that the AND-NOT <strong>in</strong>struction (andnps, andnpd, pandn) <strong>in</strong>verts the dest<strong>in</strong>ationoperand, not the source operand. This means that it destroys the mask. Therefore we musthave andps before andnps <strong>in</strong> example 13.3b. If the mask is needed more than once then itmay be more efficient to AND the mask with an XOR comb<strong>in</strong>ation of a and b. This isillustrated <strong>in</strong> the next example which makes a conditional swapp<strong>in</strong>g of a and b:// Example 13.4a. Conditional swapp<strong>in</strong>g <strong>in</strong> loopfloat a[4], b[4], x[4], y[4], temp;for (<strong>in</strong>t i = 0; i < 4; i++) {if (x[i] > y[i]) {temp = a[i]; // Swap a[i] and b[i] if x[i] > y[i]a[i] = b[i];b[i] = temp;}}And now the <strong>assembly</strong> code us<strong>in</strong>g XMM vectors:; Example 13.4b. Conditional swapp<strong>in</strong>g <strong>in</strong> XMM registersmovaps xmm2, y ; Load y vectorcmpltps xmm2, x ; Compare with x. xmm2 = mask for y < xmovaps xmm0, a ; Load a vectormovaps xmm1, b ; Load b vectorxorps xmm0, xmm1 ; a XOR bandps xmm2, xmm0 ; (a XOR b) AND maskxorps xmm1, xmm2 ; b XOR ((a XOR b) AND mask)xorps xmm2, a ; a XOR ((a XOR b) AND mask)movaps b, xmm1 ; (x[i] > y[i]) ? a[i] : b[i]movaps a, xmm2 ; (x[i] > y[i]) ? b[i] : a[i]The xorps xmm0,xmm1 <strong>in</strong>struction generates a pattern of the bits that differ between a andb. This bit pattern is AND'ed with the mask so that xmm2 conta<strong>in</strong>s the bits that need to bechanged if a and b should be swapped, and zeroes if they should not be swapped. The lasttwo xorps <strong>in</strong>structions flip the bits that have to be changed if a and b should be swappedand leave the values unchanged if not.110


The mask used for conditional moves can also be generated by copy<strong>in</strong>g the sign bit <strong>in</strong>to allbit positions us<strong>in</strong>g the arithmetic shift right <strong>in</strong>struction psrad. This is illustrated <strong>in</strong> the nextexample where vector elements are raised to different <strong>in</strong>teger powers. We are us<strong>in</strong>g themethod <strong>in</strong> example 12.13a page 107 for calculat<strong>in</strong>g powers.// Example 13.5a. Raise vector elements to different <strong>in</strong>teger powersdouble x[2], y[2]; unsigned <strong>in</strong>t n[2];for (<strong>in</strong>t i = 0; i < 2; i++) {y[i] = pow(x[i],n[i]);}If the elements of n are equal then the simplest solution is to use a branch. But if the powersare different then we have to use conditional moves:; Example 13.5b. Raise vector to power, us<strong>in</strong>g <strong>in</strong>teger mask.data; Data segmentalign 16; Must be alignedONE DQ 1.0, 1.0 ; Make constant 1.0X DQ ?, ? ; x[0], x[1]Y DQ ?, ? ; y[0], y[1]N DD ?, ? ; n[0], n[1].code; register use:; xmm0 = xp; xmm1 = power; xmm2 = i (i0 and i1 each stored twice as DWORD <strong>in</strong>tegers); xmm3 = 1.0 if not(i & 1); xmm4 = xp if (i & 1)movq xmm2, [N] ; Load n0, n1punpckldq xmm2, xmm2 ; Copy to get n0, n0, n1, n1movapd xmm0, [X] ; Load x0, x1movapd xmm1, [one] ; power <strong>in</strong>itialized to 1.0mov eax, [N] ; n0or eax, [N+4] ; n0 OR n1 to get highest significant bitxor ecx, ecx ; 0 if n0 and n1 are both zerobsr ecx, eax ; Compute repeat count for max(n0,n1)L1: movdqa xmm3, xmm2 ; Copy ipslld xmm3, 31 ; Get least significant bit of ipsrad xmm3, 31 ; Copy to all bit positions to make maskpsrld xmm2, 1 ; i >>= 1movapd xmm4, xmm0 ; Copy of xpandpd xmm4, xmm3 ; xp if bit = 1andnpd xmm3, [one] ; 1.0 if bit = 0orpd xmm3, xmm4 ; (i & 1) ? xp : 1.0mulpd xmm1, xmm3 ; power *= (i & 1) ? xp : 1.0mulpd xmm0, xmm0 ; xp *= xpsub ecx, 1 ; Loop counterjns L1 ; Repeat ecx+1 timesmovapd [Y], xmm1 ; Store resultThe repeat count of the loop is calculated separately outside the loop <strong>in</strong> order to reduce thenumber of <strong>in</strong>structions <strong>in</strong>side the loop.Tim<strong>in</strong>g analysis for example 13.5b <strong>in</strong> P4E: There are four cont<strong>in</strong>ued dependency cha<strong>in</strong>s:xmm0: 7 clocks, xmm1: 7 clocks, xmm2: 4 clocks, ecx: 1 clock. Throughput for the differentexecution units: MMX-SHIFT: 3 µops, 6 clocks. MMX-ALU: 3 µops, 6 clocks. FP-MUL: 2µops, 4 clocks. Throughput for port 1: 8 µops, 8 clocks. Thus, the loop appears to be limitedby port 1 throughput. The best tim<strong>in</strong>g we can hope for is 8 clocks per iteration which is the111


number of µops that must go to port 1. However, three of the cont<strong>in</strong>ued dependency cha<strong>in</strong>sare <strong>in</strong>terconnected by two broken, but quite long, dependency cha<strong>in</strong>s <strong>in</strong>volv<strong>in</strong>g xmm3 andxmm4, which take 23 and 19 clocks, respectively. This tends to h<strong>in</strong>der the optimal reorder<strong>in</strong>gof µops. The measured time is approximately 10 µops per iteration. This tim<strong>in</strong>g actuallyrequires a quite impressive reorder<strong>in</strong>g capability, consider<strong>in</strong>g that several iterations must beoverlapped and several dependency cha<strong>in</strong>s <strong>in</strong>terwoven <strong>in</strong> order to satisfy the restrictions onall ports and execution units.Conditional moves <strong>in</strong> general purpose registers us<strong>in</strong>g CMOVcc and float<strong>in</strong>g po<strong>in</strong>t registersus<strong>in</strong>g FCMOVcc are no faster than <strong>in</strong> XMM registers.A faster implementation of conditional moves <strong>in</strong> XMM registers has been implemented <strong>in</strong>newer <strong>in</strong>structions sets which, unfortunately, are different for Intel and AMD processors. OnIntel processors with the SSE4.1 <strong>in</strong>struction set we can use the PBLENDVB, BLENDVPS orBLENDVPD <strong>in</strong>structions <strong>in</strong>stead of the AND/OR operations:; Example 13.3c. Conditional move <strong>in</strong> XMM registers, Intel SSE4.1 onlymovaps xmm0, y ; Load y vectorcmpltps xmm0, x ; Compare with x. xmm1 = mask for y < xmovaps xmm1, b ; Load b vectorblendvps xmm1, a, xmm0 ; Conditionally replace b[i] by a[i]movaps c, xmm1 ; c = x > y ? a : bOn future AMD processors with the SSE5 <strong>in</strong>struction set we can use the PCMOV <strong>in</strong>struction <strong>in</strong>the same way:; Example 13.3d. Conditional move <strong>in</strong> XMM registers, AMD SSE5 onlymovaps xmm0, y ; Load y vectorcmpltps xmm0, x ; Compare with x. xmm1 = mask for y < xmovaps xmm1, b ; Load b vectorpcmov xmm1, xmm1, a, xmm0 ; Conditionally replace b[i] by a[i]movaps c, xmm1 ; c = x > y ? a : b13.2 Us<strong>in</strong>g vector <strong>in</strong>structions with other types of data than they are <strong>in</strong>tendedforMost XMM and YMM <strong>in</strong>structions are 'typed' <strong>in</strong> the sense that they are <strong>in</strong>tended for aparticular type of data. For example, it doesn't make sense to use an <strong>in</strong>struction for add<strong>in</strong>g<strong>in</strong>tegers on float<strong>in</strong>g po<strong>in</strong>t data. But <strong>in</strong>structions that only move data around will work withany type of data even though they are <strong>in</strong>tended for one particular type of data. This can bequite useful if an equivalent <strong>in</strong>struction doesn't exist for the type of data you have or if an<strong>in</strong>struction for another type of data is more efficient.All XMM and YMM <strong>in</strong>structions that move, shuffle, blend or shift data as well as the Boolean<strong>in</strong>structions can be used for other types of data than they are <strong>in</strong>tended for. But <strong>in</strong>structionsthat do any k<strong>in</strong>d of arithmetic operation, type conversion or precision conversion can only beused for the type of data it is <strong>in</strong>tended for. For example, the FLD <strong>in</strong>struction does more thanmove float<strong>in</strong>g po<strong>in</strong>t data, it also converts to a different precision. If you try to use FLD andFSTP for mov<strong>in</strong>g <strong>in</strong>teger data then you may get exceptions for denormal operands <strong>in</strong> casethe <strong>in</strong>teger data do not happen to represent a normal float<strong>in</strong>g po<strong>in</strong>t number. The <strong>in</strong>structionmay even change the value of the data <strong>in</strong> some cases. But the <strong>in</strong>struction MOVAPS, which isalso <strong>in</strong>tended for mov<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t data, does not convert precision or anyth<strong>in</strong>g else. Itjust moves the data. Therefore, it is OK to use MOVAPS for mov<strong>in</strong>g <strong>in</strong>teger data.If you are <strong>in</strong> doubt whether a particular <strong>in</strong>struction will work with any type of data then checkthe software manual from Intel or AMD. If the <strong>in</strong>struction can generate any k<strong>in</strong>d of "float<strong>in</strong>gpo<strong>in</strong>t exception" then it should not be used for any other k<strong>in</strong>d of data than it is <strong>in</strong>tended for.112


There is a penalty for us<strong>in</strong>g the wrong type of <strong>in</strong>structions <strong>in</strong> some cases. This is becausethe processor may have different data buses or different execution units for <strong>in</strong>teger andfloat<strong>in</strong>g po<strong>in</strong>t data. Mov<strong>in</strong>g data between the <strong>in</strong>teger and float<strong>in</strong>g po<strong>in</strong>t units takes an extraclock cycle on Intel processors and one or two clock cycles on AMD processors. Seemanual 3 "The microarchitecture of Intel and AMD CPU's" and manual 4: "Instruction tables"for details on specific processors. In some cases there can be another reason for the delay.Some AMD processors store extra <strong>in</strong>formation about float<strong>in</strong>g po<strong>in</strong>t numbers <strong>in</strong> order toremember if the number is zero, normal or denormal. This <strong>in</strong>formation is lost if an <strong>in</strong>struction<strong>in</strong>tended for <strong>in</strong>teger vectors is used for mov<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t data. The processor needs oneor two clock cycles extra for re-generat<strong>in</strong>g the lost <strong>in</strong>formation. This is called a reformatt<strong>in</strong>gdelay. The reformatt<strong>in</strong>g delay occurs whenever the output of an <strong>in</strong>teger XMM <strong>in</strong>struction isused as <strong>in</strong>put for a float<strong>in</strong>g po<strong>in</strong>t XMM <strong>in</strong>struction, except when the float<strong>in</strong>g po<strong>in</strong>t XMM<strong>in</strong>struction does noth<strong>in</strong>g else than writ<strong>in</strong>g the value to memory. Interest<strong>in</strong>gly, there is noreformatt<strong>in</strong>g delay when us<strong>in</strong>g s<strong>in</strong>gle-precision XMM <strong>in</strong>structions for double-precision dataor vice versa.Us<strong>in</strong>g an <strong>in</strong>struction of a wrong type can be advantageous <strong>in</strong> cases where there is no extradelay and <strong>in</strong> cases where the ga<strong>in</strong> by us<strong>in</strong>g a particular <strong>in</strong>struction is more than the delay.Some cases are described below.Us<strong>in</strong>g the shortest <strong>in</strong>structionThe <strong>in</strong>structions for packed s<strong>in</strong>gle precision float<strong>in</strong>g po<strong>in</strong>t numbers, with names end<strong>in</strong>g <strong>in</strong>PS, are one byte shorter than equivalent <strong>in</strong>structions for double precision or <strong>in</strong>tegers. Forexample, you may use MOVAPS <strong>in</strong>stead of MOVAPD or MOVDQA for mov<strong>in</strong>g data to or frommemory or between registers. A reformatt<strong>in</strong>g delay occurs <strong>in</strong> AMD processors when us<strong>in</strong>gMOVAPS for mov<strong>in</strong>g the result of an <strong>in</strong>teger <strong>in</strong>struction to another register, but not whenmov<strong>in</strong>g data to or from memory.Us<strong>in</strong>g the most efficient <strong>in</strong>structionThere are several different ways of read<strong>in</strong>g an XMM register from unaligned memory. Thetyped <strong>in</strong>structions are MOVDQU, MOVUPD, and MOVUPS. These are all quite <strong>in</strong>efficient on olderprocessors. LDDQU is faster, but requires the SSE3 <strong>in</strong>struction set. On some processors, themost efficient way of read<strong>in</strong>g an XMM register from unaligned memory is to read 64 bits at atime us<strong>in</strong>g MOVQ and MOVHPS. Likewise, the fastest way of writ<strong>in</strong>g to unaligned memory maybe to use MOVLPS and MOVHPS. See page 119 for more discussion of unaligned access.An efficient way of sett<strong>in</strong>g a vector register to zero is PXOR XMM0,XMM0. Many processorsrecognize this <strong>in</strong>struction as be<strong>in</strong>g <strong>in</strong>dependent of the previous value of XMM0, while not allprocessors recognize the same for XORPS and XORPD. The PXOR <strong>in</strong>struction is thereforepreferred for sett<strong>in</strong>g a register to zero.The <strong>in</strong>teger versions of the Boolean vector <strong>in</strong>structions (PAND, PANDN, POR, PXOR) can usethe FADD or FMUL unit <strong>in</strong> an AMD K8 processor, while the float<strong>in</strong>g po<strong>in</strong>t versions can useonly the FMUL unit.Us<strong>in</strong>g an <strong>in</strong>struction that is not available for other types of dataThere are many situations where it is advantageous to use an <strong>in</strong>struction <strong>in</strong>tended for adifferent type of data simply because an equivalent <strong>in</strong>struction doesn't exist for the type ofdata you have.The <strong>in</strong>structions for s<strong>in</strong>gle precision float vectors are available <strong>in</strong> the SSE <strong>in</strong>struction set,while the equivalent <strong>in</strong>structions for double precision and <strong>in</strong>tegers require the SSE2<strong>in</strong>struction set. Us<strong>in</strong>g MOVAPS <strong>in</strong>stead of MOVAPD or MOVDQA for mov<strong>in</strong>g data makes the codecompatible with processors that have SSE but not SSE2.113


There are many useful <strong>in</strong>structions for data shuffl<strong>in</strong>g that are available for only one type ofdata. These <strong>in</strong>structions can easily be used for other types of data than they are <strong>in</strong>tendedfor. The transition delay, if any, is often less than the cost of alternative solutions. The datashuffl<strong>in</strong>g <strong>in</strong>structions are listed <strong>in</strong> the next paragraph.13.3 Shuffl<strong>in</strong>g dataVectorized code sometimes needs a lot of <strong>in</strong>structions for swapp<strong>in</strong>g and copy<strong>in</strong>g vectorelements and putt<strong>in</strong>g data <strong>in</strong>to the right positions <strong>in</strong> the vectors. The need for these extra<strong>in</strong>structions reduces the advantage of us<strong>in</strong>g vector operations. It can often be an advantageto use a shuffl<strong>in</strong>g <strong>in</strong>struction that is <strong>in</strong>tended for a different type of data than you have, asexpla<strong>in</strong>ed <strong>in</strong> the previous paragraph. Some <strong>in</strong>structions that are useful for data shuffl<strong>in</strong>g arelisted below.Mov<strong>in</strong>g data between different elements of a register(Use same register for source and dest<strong>in</strong>ation)Instruction Block size,bitsDescriptionInstructionsetPSHUFD 32 Universal shuffle SSE2PSHUFLW 16 Shuffles low half of register only SSE2PSHUFHW 16 Shuffles high half of register only SSE2SHUFPS 32 Shuffle SSESHUFPD 64 Shuffle SSE2PSLLDQ 8 Shifts to a different position and sets the orig<strong>in</strong>al SSE2position to zeroPSHUFB 8 Universal shuffle Suppl. SSE3PALIGNR 8 Rotate vector of 8 or 16 bytes Suppl. SSE3Table 13.2. Shuffle <strong>in</strong>structionsMov<strong>in</strong>g data from one register to different elements of another registerInstruction Block size,bitsDescriptionInstructionsetPSHUFD 32 Universal shuffle SSE2PSHUFLW 16 Shuffles low half of register only SSE2PSHUFHW 16 Shuffles high half of register only SSE2PINSRB 8 Insert byte <strong>in</strong>to vector SSE4.1PINSRW 16 Insert word <strong>in</strong>to vector SSEPINSRD 32 Insert dword <strong>in</strong>to vector SSE4.1PINSRQ 64 Insert qword <strong>in</strong>to vector SSE4.1INSERTPS 32 Insert dword <strong>in</strong>to vector SSE4.1VPERMILPS 32 Shuffle with variable selector AVXVPERMIL2PS 32 Shuffle with variable selector AVXVPERMILPD 64 Shuffle with variable selector AVXVPERMIL2PD 64 Shuffle with variable selector AVXPPERM 8 Shuffle with variable selector AMD SSE5PERMPS 32 Shuffle with variable selector AMD SSE5PERMPD 64 Shuffle with variable selector AMD SSE5Table 13.3. Move-and-shuffle <strong>in</strong>structionsComb<strong>in</strong><strong>in</strong>g data from two different sourcesInstruction Block size, Descriptionbits114Instructionset


SHUFPS 32 Lower 2 dwords from any position of source SSEhigher 2 dwords from any position of dest<strong>in</strong>ationSHUFPD 64 Low qword from any position of sourceSSE2high qword from any position of dest<strong>in</strong>ationMOVLPS/D 64 Low qword from memory,SSE/SSE2high qword unchangedMOVHPS/D 64 High qword from memory,SSE/SSE2low qword unchangedMOVLHPS 64 Low qword unchanged,SSEhigh qword from low of sourceMOVHLPS 64 Low qword from high of source,SSEhigh qword unchangedMOVSS 32 Lowest dword from source (register only), SSEbits 32-127 unchangedMOVSD 64 Low qword from source (register only),SSE2high qword unchangedPUNPCKLBW 8 Low 8 bytes from source and dest<strong>in</strong>ation SSE2<strong>in</strong>terleavedPUNPCKLWD 16 Low 4 words from source and dest<strong>in</strong>ation SSE2<strong>in</strong>terleavedPUNPCKLDQ 32 Low 2 dwords from source and dest<strong>in</strong>ation SSE2<strong>in</strong>terleavedPUNPCKLQDQ 64 Low qword unchanged,SSE2high qword from low of sourcePUNPCKHBW 8 High 8 bytes from source and dest<strong>in</strong>ation SSE2<strong>in</strong>terleavedPUNPCKHWD 16 High 4 words from source and dest<strong>in</strong>ation SSE2<strong>in</strong>terleavedPUNPCKHDQ 32 High 2 dwords from source and dest<strong>in</strong>ation SSE2<strong>in</strong>terleavedPUNPCKHQDQ 64 Low qword from high of dest<strong>in</strong>ation,SSE2high qword from high of sourcePACKUSWB 8 Low 8 bytes from 8 words of dest<strong>in</strong>ation, high 8 SSE2bytes from 8 words of source. Converted withunsigned saturation.PACKSSWB 8 Low 8 bytes from 8 words of dest<strong>in</strong>ation, high 8 SSE2bytes from 8 words of source. Converted withsigned saturation.PACKSSDW 16 Low 4 words from 4 dwords of dest<strong>in</strong>ation, high SSE24 words from 4 dwords of source. Convertedwith signed saturation.MOVQ 64 Low qword from source, high qword set to zero SSE2PINSRB 8 Insert byte <strong>in</strong>to vector SSE4.1PINSRW 16 Insert word <strong>in</strong>to vector SSEPINSRD 32 Insert dword <strong>in</strong>to vector SSE4.1INSERTPS 32 Insert dword <strong>in</strong>to vector SSE4.1PINSRQ 64 Insert qword <strong>in</strong>to vector SSE4.1PBLENDW 16 Blend from two different sources SSE4.1BLENDPS 32 Blend from two different sources SSE4.1BLENDPD 64 Blend from two different sources SSE4.1PBLENDVB 8 Multiplexer SSE4.1BLENDVPS 32 Multiplexer SSE4.1BLENDVPD 64 Multiplexer SSE4.1PCMOV 1 Multiplexer AMD SSE5PALIGNR 8 Double shift (analogous to SHRD) Suppl. SSE3115


Table 13.4. Comb<strong>in</strong>e dataCopy<strong>in</strong>g data to multiple elements of a register (broadcast)(Use same register for source and dest<strong>in</strong>ation)Instruction Block size,bitsDescriptionInstructionsetPSHUFD 32 Broadcast any dword SSE2SHUFPS 32 Broadcast dword SSESHUFPD 64 Broadcast qword SSE2MOVLHPS 64 Broadcast qword SSE2MOVHLPS 64 Broadcast high qword SSE2MOVDDUP 64 Broadcast qword SSE3MOVSLDUP 32 Copy dword 0 to 1, copy dword 2 to 3 SSE3MOVSHDUP 32 Copy dword 1 to 0, copy dword 3 to 2 SSE3PUNPCKLBW 8 Duplicate each of the lower 8 bytes SSE2PUNPCKLWD 16 Duplicate each of the lower 4 words SSE2PUNPCKLDQ 32 Duplicate each of the lower 2 dwords SSE2PUNPCKLQDQ 64 Broadcast qword SSE2PUNPCKHBW 8 Duplicate each of the higher 8 bytes SSE2PUNPCKHWD 16 Duplicate each of the higher 4 words SSE2PUNPCKHDQ 32 Duplicate each of the higher 2 dwords SSE2PUNPCKHQDQ 64 Broadcast high qword SSE2PSHUFB 8 Broadcast any byte or larger element Suppl. SSE3Table 13.5. Broadcast dataCopy data from one register to all elements of another register (broadcast)InstructionBlocksize, bitsDescriptionInstructionsetPSHUFD xmm2,xmm1,0 32 Broadcast dword SSE2PSHUFD xmm2,xmm1,0EEH 64 Broadcast qword SSE2MOVDDUP 64 Broadcast qword SSE3MOVSLDUP 32 2 copies of each of dword 0 and 2 SSE3MOVSHDUP 32 2 copies of each of dword 1 and 3 SSE3VBROADCASTSS 32 Broadcast dword from memory AVXVBROADCASTSD 64 Broadcast qword from memory AVXVBROADCASTF128 128 Broadcast 16 bytes from memory AVXTable 13.6. Move and broadcast dataExample: Horizontal additionThe follow<strong>in</strong>g examples show how to add all elements of a vector.; Example 13.6a. Add 16 elements <strong>in</strong> vector of 8-bit unsigned <strong>in</strong>tegersmovaps xmm1, source ; Source vector, 16 8-bit unsigned <strong>in</strong>tegerspxor xmm0, xmm0 ; 0psadbw xmm1, xmm0 ; Sum of 8 differencespshufd xmm0, xmm1, 0EH ; Get bit 64-127 from xmm1 (or use movhlps)paddd xmm0, xmm1 ; Summovd eax, xmm0 ; Sum is <strong>in</strong> eaxmov [sum], ax ; Store sum(If suppl. SSE3 <strong>in</strong>struction set is available, use PHADDD after PSADBW.116


; Example 13.6b. Add eight elements <strong>in</strong> vector of 16-bit <strong>in</strong>tegersmovaps xmm0, source ; Source vector, 8 16-bit <strong>in</strong>tegerspshufd xmm1, xmm0, 0EH ; Get bit 64-127 from xmm1 (or use movhlps)paddw xmm0, xmm1 ; Sums are <strong>in</strong> 4 wordspshufd xmm1, xmm0, 01H ; Get bit 32-63 from xmm0paddw xmm0, xmm1 ; Sums are <strong>in</strong> 2 wordspshuflw xmm1, xmm0, 01H ; Get bit 16-31 from xmm0paddw xmm0, xmm1 ; Sum is <strong>in</strong> one wordmovd eax, xmm0 ; Sum is <strong>in</strong> low word of eaxmov [sum], ax ; Store sum(If suppl. SSE3 <strong>in</strong>struction set is available, use PHADDW three times).; Example 13.6c. Add four elements <strong>in</strong> vector of 32-bit <strong>in</strong>tegersmovaps xmm0, source ; Source vector, 4 32-bit <strong>in</strong>tegerspshufd xmm1, xmm0, 0EH ; Get bit 64-127 from xmm1 (or use movhlps)paddd xmm0, xmm1 ; Sums are <strong>in</strong> 2 dwordspshufd xmm1, xmm0, 01H ; Get bit 32-63 from xmm0paddd xmm0, xmm1 ; Sum is <strong>in</strong> one dwordmovd [sum],xmm0 ; Store sum(If suppl. SSE3 <strong>in</strong>struction set is available, use PHADDD twice.; Example 13.6d. Add two elements <strong>in</strong> vector of 64-bit <strong>in</strong>tegersmovaps xmm0, source ; Source vector, 4 32-bit <strong>in</strong>tegerspshufd xmm1, xmm0, 0EH ; Get bit 64-127 from xmm1 (or use movhlps)paddq xmm0, xmm1 ; Sum is <strong>in</strong> one qwordmovlps [sum],xmm0 ; Store sum; Example 13.6e. Add four elements <strong>in</strong> vector of 32-bit floatsmovaps xmm0, source ; Source vector, 4 32-bit floatsmovhlps xmm1, xmm0 ; Get bit 64-127 from xmm1addps xmm0, xmm1 ; Sums are <strong>in</strong> 2 dwordspshufd xmm1, xmm0, 01H ; Get bit 32-63 from xmm0addss xmm0, xmm1 ; Sum is <strong>in</strong> one dwordmovss [sum],xmm0 ; Store sum(If SSE3 <strong>in</strong>struction set is available, use HADDPS twice).; Example 13.6f. Add two elements <strong>in</strong> vector of 64-bit doublesmovaps xmm0, source ; Source vector, 2 64-bit doublesmovhlps xmm1, xmm0 ; Get bit 64-127 from xmm1addsd xmm0, xmm1 ; Sum is <strong>in</strong> one qwordmovsd [sum],xmm0 ; Store sum(If SSE3 <strong>in</strong>struction set is available, use HADDPD).13.4 Generat<strong>in</strong>g constantsThere is no <strong>in</strong>struction for mov<strong>in</strong>g a constant <strong>in</strong>to an XMM register. The default way ofputt<strong>in</strong>g a constant <strong>in</strong>to an XMM register is to load it from a memory constant. This is also themost efficient way if cache misses are rare. But if cache misses are frequent then we maylook for alternatives.One alternative is to copy the constant from a static memory location to the stack outsidethe <strong>in</strong>nermost loop and use the stack copy <strong>in</strong>side the <strong>in</strong>nermost loop. A memory location onthe stack is less likely to cause cache misses than a memory location <strong>in</strong> a constant datasegment. However, this option may not be possible <strong>in</strong> library functions.117


A second alternative is to store the constant to stack memory us<strong>in</strong>g <strong>in</strong>teger <strong>in</strong>structions andthen load the value from the stack memory to the XMM register.A third alternative is to generate the constant by clever use of various <strong>in</strong>structions. Thisdoes not use the data cache but takes more space <strong>in</strong> the code cache. The code cache isless likely to cause cache misses because the code is contiguous.The constants may be reused many times as long as the register is not needed forsometh<strong>in</strong>g else.The table 13.7 below shows how to make various <strong>in</strong>teger constants <strong>in</strong> XMM registers. Thesame value is generated <strong>in</strong> all cells <strong>in</strong> the vector:Mak<strong>in</strong>g constants for <strong>in</strong>teger vectors <strong>in</strong> XMM registersValue 8 bit 16 bit 32 bit 64 bit0 pxor xmm0,xmm0 pxor xmm0,xmm0 pxor xmm0,xmm0 pxor xmm0,xmm01 pcmpeqw xmm0,xmm0 pcmpeqw xmm0,xmm0 pcmpeqd xmm0,xmm0 pcmpeqw xmm0,xmm0psrlw xmm0,15 psrlw xmm0,15 psrld xmm0,31 psrlq xmm0,63packuswb xmm0,xmm02 pcmpeqw xmm0,xmm0psrlw xmm0,15psllw xmm0,1packuswb xmm0,xmm03 pcmpeqw xmm0,xmm0psrlw xmm0,14packuswb xmm0,xmm0pcmpeqw xmm0,xmm0psrlw xmm0,15psllw xmm0,1pcmpeqw xmm0,xmm0psrlw xmm0,14pcmpeqd xmm0,xmm0psrld xmm0,31pslld xmm0,1pcmpeqd xmm0,xmm0psrld xmm0,30pcmpeqw xmm0,xmm0psrlq xmm0,63psllq xmm0,1pcmpeqw xmm0,xmm0psrlq xmm0,624 pcmpeqw xmm0,xmm0 pcmpeqw xmm0,xmm0 pcmpeqd xmm0,xmm0 pcmpeqw xmm0,xmm0psrlw xmm0,15 psrlw xmm0,15 psrld xmm0,31 psrlq xmm0,63psllw xmm0,2psllw xmm0,2pslld xmm0,2psllq xmm0,2packuswb xmm0,xmm0-1 pcmpeqw xmm0,xmm0 pcmpeqw xmm0,xmm0 pcmpeqd xmm0,xmm0 pcmpeqw xmm0,xmm0-2 pcmpeqw xmm0,xmm0psllw xmm0,1packsswb xmm0,xmm0Othervaluemov eax,value*01010101Hmovd xmm0,eaxpshufd xmm0,xmm0,0pcmpeqw xmm0,xmm0psllw xmm0,1mov eax,value*10001Hmovd xmm0,eaxpshufd xmm0,xmm0,0Table 13.7. Generate <strong>in</strong>teger vector constantspcmpeqd xmm0,xmm0pslld xmm0,1mov eax,valuemovd xmm0,eaxpshufd xmm0,xmm0,0pcmpeqw xmm0,xmm0psllq xmm0,1mov rax,valuemovq xmm0,raxpunpcklqdq xmm0,xmm0(64 bit mode only)Table 13.8 below shows how to make various float<strong>in</strong>g po<strong>in</strong>t constants <strong>in</strong> XMM registers. Thesame value is generated <strong>in</strong> one or all cells <strong>in</strong> the vector:Mak<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t constants <strong>in</strong> XMM registersValue scalar s<strong>in</strong>gle scalar double vector s<strong>in</strong>gle vector double0.0 pxor xmm0,xmm0 pxor xmm0,xmm0 pxor xmm0,xmm0 pxor xmm0,xmm00.5 pcmpeqw xmm0,xmm0 pcmpeqw xmm0,xmm0 pcmpeqw xmm0,xmm0 pcmpeqw xmm0,xmm0pslld xmm0,26 psllq xmm0,55 pslld xmm0,26 psllq xmm0,55psrld xmm0,2psrlq xmm0,2psrld xmm0,2psrlq xmm0,21.0 pcmpeqw xmm0,xmm0pslld xmm0,25psrld xmm0,21.5 pcmpeqw xmm0,xmm0pslld xmm0,24psrld xmm0,22.0 pcmpeqw xmm0,xmm0pslld xmm0,31psrld xmm0,1-2.0 pcmpeqw xmm0,xmm0pslld xmm0,30sign bit pcmpeqw xmm0,xmm0pslld xmm0,31pcmpeqw xmm0,xmm0psllq xmm0,54psrlq xmm0,2pcmpeqw xmm0,xmm0psllq xmm0,53psrlq xmm0,2pcmpeqw xmm0,xmm0psllq xmm0,63psrlq xmm0,1pcmpeqw xmm0,xmm0psllq xmm0,62pcmpeqw xmm0,xmm0psllq xmm0,63118pcmpeqw xmm0,xmm0pslld xmm0,25psrld xmm0,2pcmpeqw xmm0,xmm0pslld xmm0,24psrld xmm0,2pcmpeqw xmm0,xmm0pslld xmm0,31psrld xmm0,1pcmpeqw xmm0,xmm0pslld xmm0,30pcmpeqw xmm0,xmm0pslld xmm0,31pcmpeqw xmm0,xmm0psllq xmm0,54psrlq xmm0,2pcmpeqw xmm0,xmm0psllq xmm0,53psrlq xmm0,2pcmpeqw xmm0,xmm0psllq xmm0,63psrlq xmm0,1pcmpeqw xmm0,xmm0psllq xmm0,62pcmpeqw xmm0,xmm0psllq xmm0,63


notsign bitOthervalue(32 bitmode)Othervalue(64 bitmode)pcmpeqw xmm0,xmm0psrld xmm0,1mov eax, valuemovd xmm0,eaxmov eax, valuemovd xmm0,eaxpcmpeqw xmm0,xmm0psrlq xmm0,1mov eax, value>>32movd xmm0,eaxpsllq xmm0,32mov rax, valuemovq xmm0,raxTable 13.8. Generate float<strong>in</strong>g po<strong>in</strong>t vector constantspcmpeqw xmm0,xmm0psrld xmm0,1mov eax, valuemovd xmm0,eaxshufps xmm0,xmm0,0mov eax, valuemovd xmm0,eaxshufps xmm0,xmm0,0pcmpeqw xmm0,xmm0psrlq xmm0,1mov eax, value>>32movd xmm0,eaxpshufd xmm0,xmm0,22Hmov rax, valuemovq xmm0,raxshufpd xmm0,xmm0,0The "sign bit" is a value with the sign bit set and all other bits = 0. This is used for chang<strong>in</strong>gor sett<strong>in</strong>g the sign of a variable. For example to change the sign of a 2*double vector <strong>in</strong>xmm0:; Example 13.7. Change sign of 2*double vectorpcmpeqw xmm7, xmm7 ; All 1'spsllq xmm7, 63 ; Shift out the lower 63 1'sxorpd xmm0, xmm7 ; Flip sign bit of xmm0The "not sign bit" is the <strong>in</strong>verted value of "sign bit". It has the sign bit = 0 and all other bits =1. This is used for gett<strong>in</strong>g the absolute value of a variable. For example to get the absolutevalue of a 2*double vector <strong>in</strong> xmm0:; Example 13.8. Absolute value of 2*double vectorpcmpeqw xmm6, xmm6 ; All 1'spsrlq xmm6, 1 ; Shift out the highest bitandpd xmm0, xmm6 ; Set sign bit to 0Generat<strong>in</strong>g an arbitrary double precision value <strong>in</strong> 32-bit mode is more complicated. Themethod <strong>in</strong> table 13.8 uses only the upper 32 bits of the 64-bit representation of the number,assum<strong>in</strong>g that the lower b<strong>in</strong>ary decimals of the number are zero or that an approximation isacceptable. For example, to generate the double value 9.25, we first use a compiler orassembler to f<strong>in</strong>d that the hexadecimal representation of 9.25 is 4022800000000000H. Thelower 32 bits can be ignored, so we can do as follows:; Example 13.9a. Set 2*double vector to arbitrary value (32 bit mode)mov eax, 40228000H ; High 32 bits of 9.25movd xmm0, eax ; Move to xmm0pshufd xmm0, xmm0, 22H ; Get value <strong>in</strong>to dword 1 and 3In 64-bit mode, we can use 64-bit <strong>in</strong>teger registers:; Example 13.9b. Set 2*double vector to arbitrary value (64 bit mode)mov rax, 4022800000000000H ; Full representation of 9.25movq xmm0, rax ; Move to xmm0shufpd xmm0, xmm0, 0 ; BroadcastNote that some assemblers use the very mislead<strong>in</strong>g name movd <strong>in</strong>stead of movq for the<strong>in</strong>struction that moves 64 bits between a general purpose register and an XMM register.13.5 Access<strong>in</strong>g unaligned dataAll data that are read or written with vector registers should be aligned by the vector size ifpossible. See page 81 for how to align data.119


However, there are situations where alignment is not possible, for example if a libraryfunction receives a po<strong>in</strong>ter to an array and it is unknown whether the array is aligned or not.The follow<strong>in</strong>g methods can be used for read<strong>in</strong>g unaligned vectors:Us<strong>in</strong>g unaligned read <strong>in</strong>structionsThe <strong>in</strong>structions movdqu, movups, movupd and lddqu are all able to read unaligned vectors.lddqu is faster than the alternatives on some processors, but requires the SSE3 <strong>in</strong>structionset. The unaligned read <strong>in</strong>structions are relatively slow on current processors (2008), but itis expected that these <strong>in</strong>structions will be faster on future AMD and Intel processors.; Example 13.10. Unaligned vector read; esi conta<strong>in</strong>s po<strong>in</strong>ter to unaligned arraymovdqu xmm0, [esi]; Read vector unalignedRead<strong>in</strong>g 8 bytes at a timeInstructions that read 8 bytes or less have no alignment requirements and are often quitefast unless the read crosses a cache boundary. Example:; Example 13.11. Unaligned vector read split <strong>in</strong> two; esi conta<strong>in</strong>s po<strong>in</strong>ter to unaligned arraymovq xmm0, qword ptr [esi] ; Read lower half of vectormovhps xmm0, qword ptr [esi+8] ; Read upper half of vectorThis method is faster than us<strong>in</strong>g unaligned read <strong>in</strong>structions on current processors (2008),but probably not on future processors.Partially overlapp<strong>in</strong>g readsMake the first read from the unaligned address and the next read from the nearest follow<strong>in</strong>g16-bytes boundary. The first two reads will therefore possibly overlap:; Example 13.12. First unaligned read overlaps next aligned read; esi conta<strong>in</strong>s po<strong>in</strong>ter to unaligned arraymovdqu xmm1, [esi]; Read vector unalignedadd esi, 10Hand esi, 0FH ; = nearest follow<strong>in</strong>g 16B boundarymovdqa xmm2, [esi]; Read next vector alignedHere, the data <strong>in</strong> xmm1 and xmm2 will be partially overlapp<strong>in</strong>g if esi is not divisible by 16.This, of course, only works if the follow<strong>in</strong>g algorithm allows the redundant data. The lastvector read can also overlap with the last-but-one if the end of the array is not aligned.Read<strong>in</strong>g from the nearest preced<strong>in</strong>g 16-bytes boundaryIt is possible to start read<strong>in</strong>g from the nearest preced<strong>in</strong>g 16-bytes boundary of an unalignedarray. This will put irrelevant data <strong>in</strong>to part of the vector register, and these data must beignored. Likewise, the last read may go past the end of the array until the next 16-bytesboundary:; Example 13.13. Read<strong>in</strong>g from nearest preced<strong>in</strong>g 16-bytes boundary; esi conta<strong>in</strong>s po<strong>in</strong>ter to unaligned arraymov eax, esi ; Copy po<strong>in</strong>terand esi, -10H ; Round down to value divisible by 16and eax, 0FH ; Array is misaligned by this valuemovdqa xmm1, [esi]; Read from preced<strong>in</strong>g 16B boundarymovdqa xmm2, [esi+10H]; Read next block120


In the above example, xmm1 conta<strong>in</strong>s eax bytes of junk followed by (16-eax) useful bytes.xmm2 conta<strong>in</strong>s the rema<strong>in</strong><strong>in</strong>g eax bytes of the vector followed by (16-eax) bytes which areeither junk or belong<strong>in</strong>g to the next vector. The value <strong>in</strong> eax should then be used formask<strong>in</strong>g out or ignor<strong>in</strong>g the part of the register that conta<strong>in</strong>s junk data.Here we are tak<strong>in</strong>g advantage of the fact that vector sizes, cache l<strong>in</strong>e sizes and memorypage sizes are always powers of 2. The cache l<strong>in</strong>e boundaries and memory pageboundaries will therefore co<strong>in</strong>cide with vector boundaries. While we are read<strong>in</strong>g someirrelevant data with this method, we will never load any unnecessary cache l<strong>in</strong>e, becausecache l<strong>in</strong>es are always aligned by some multiple of 16. There is therefore no cost to read<strong>in</strong>gthe irrelevant data. And, more importantly, we will never get a read fault for read<strong>in</strong>g fromnon-exist<strong>in</strong>g memory addresses, because data memory is always allocated <strong>in</strong> pages of4096 (= 2 12 ) bytes or more so the aligned vector read can never cross a memory pageboundary.An example us<strong>in</strong>g this method is shown <strong>in</strong> the strlenSSE2.asm example <strong>in</strong> the appendixwww.agner.org/optimize/asmexamples.zip.Comb<strong>in</strong>e two unaligned vector parts <strong>in</strong>to one aligned vectorThe above method can be extended by comb<strong>in</strong><strong>in</strong>g the valid parts of two registers <strong>in</strong>to onefull register. If xmm1 <strong>in</strong> example 13.13 is shifted right by the unalignment value (eax) andxmm2 is shifted left by (16-eax) then the two registers can be comb<strong>in</strong>ed <strong>in</strong>to one vectorconta<strong>in</strong><strong>in</strong>g only valid data.Unfortunately, there is no <strong>in</strong>struction to shift a whole vector register right or left by a variablecount (analogous to shr eax,cl) so we have to use a shuffle <strong>in</strong>struction. The nextexample uses a byte shuffle <strong>in</strong>struction, PSHUFB, with appropriate mask values for shift<strong>in</strong>g avector register right or left:; Example 13.14. Comb<strong>in</strong><strong>in</strong>g two unaligned parts <strong>in</strong>to one vector.; (Supplementary-SSE3 <strong>in</strong>struction set required); This example takes the squareroot of n floats <strong>in</strong> an unaligned; array src and stores the result <strong>in</strong> an aligned array dest.; C++ code:; const <strong>in</strong>t n = 100;; float * src;; float dest[n];; for (<strong>in</strong>t i=0; i


L: ; Loopmovdqa xmm1, [esi+ecx] ; Read from preced<strong>in</strong>g boundarymovdqa xmm2, [esi+ecx+10H] ; Read next blockpshufb xmm1, xmm4; shift right by apshufb xmm2, xmm5 ; shift left by 16-apor xmm1, xmm2 ; comb<strong>in</strong>e blockssqrtps xmm1, xmm1; compute four squarerootsmovaps [edi+ecx], xmm1 ; Save result alignedadd ecx, 10H ; Loop to next four valuescmp ecx, 400 ; 4*njb L ; LoopThis method gets easier if the value of the misalignment (eax) is a known constant. We canuse PSRLDQ and PSLLDQ <strong>in</strong>stead of PSHUFB for shift<strong>in</strong>g the registers right and left. PSRLDQand PSLLDQ belong to the SSE2 <strong>in</strong>struction set, while PSHUFB requires supplementary-SSE3. The number of <strong>in</strong>structions can be reduced by us<strong>in</strong>g the PALIGNR <strong>in</strong>struction(Supplementary-SSE3) when the shift count is constant. Certa<strong>in</strong> tricks with MOVSS or MOVSDare possible when the misalignment is 4, 8 or 12, as shown <strong>in</strong> example 13.15b below.The fastest implementation has sixteen branches of code for each of the possible alignmentvalues <strong>in</strong> eax. See the examples for memcpy and memset <strong>in</strong> the appendixwww.agner.org/optimize/asmexamples.zip.Comb<strong>in</strong><strong>in</strong>g unaligned vector parts from same arrayMisalignment is <strong>in</strong>evitable when an operation is performed between two elements <strong>in</strong> thesame array and the distance between the elements is not a multiple of the vector size. Inthe follow<strong>in</strong>g C++ example, each array element is added to the preced<strong>in</strong>g element:// Example 13.15a. Add each array element to the preced<strong>in</strong>g elementfloat x[100];for (<strong>in</strong>t i = 0; i < 99; i++) x[i] += x[i+1];Here x[i+1] is <strong>in</strong>evitably misaligned relative to x[i]. We can still use the method ofcomb<strong>in</strong><strong>in</strong>g parts of two aligned vectors, and take advantage of the fact that themisalignment is a known constant, 4.; Example 13.15b. Add each array element to the preced<strong>in</strong>g elementxor ecx, ecx ; Loop counter i = 0lea esi, x ; Po<strong>in</strong>ter to aligned arraymovaps xmm0, [esi]; x3,x2,x1,x0L: ; Loopmovaps xmm2, xmm0; x3,x2,x1,x0movaps xmm1, [esi+ecx+10H] ; x7,x6,x5,x4; Use trick with movss to comb<strong>in</strong>e parts of two vectorsmovss xmm2, xmm1 ; x3,x2,x1,x4; Rotate right by 4 bytesshufps xmm2, xmm2, 00111001B ; x4,x3,x2,x1addps xmm0, xmm2 ; x3+x4,x2+x3,x1+x2,x0+x1movaps [esi+ecx], xmm0 ; save alignedmovaps xmm0, xmm1; Save for next iterationadd ecx, 16 ; i += 4cmp ecx, 400-16 ; Repeat for 96 valuesjb L; Do the last odd one:L: movaps xmm2, xmm0 ; x99,x98,x97,x96xorps xmm1, xmm1 ; 0,0,0,0movss xmm2, xmm1 ; x99,x98,x97,0; Rotate right by 4 bytesshufps xmm2, xmm2, 00111001B ; 0,x99,x98,x97addps xmm0, xmm2 ; x99+0,x98+x99,x97+x98,x96+x97122


movaps [esi+ecx], xmm0; save alignedNow we have discussed several methods for read<strong>in</strong>g unaligned vectors. Of course we alsohave to discuss how to write vectors to unaligned addresses. Some of the methods arevirtually the same.Us<strong>in</strong>g unaligned write <strong>in</strong>structionsThe <strong>in</strong>structions movdqu, movups, and movupd are all able to write unaligned vectors. Theunaligned write <strong>in</strong>structions are relatively slow on current processors (2008), but it isexpected that these <strong>in</strong>structions will be faster on future AMD and Intel processors.; Example 13.16. Unaligned vector write; edi conta<strong>in</strong>s po<strong>in</strong>ter to unaligned arraymovdqu [edi], xmm0; Write vector unalignedWrit<strong>in</strong>g 8 bytes at a timeInstructions that write 8 bytes or less have no alignment requirements and are often quitefast unless the write crosses a cache boundary. Example:; Example 13.17. Unaligned vector read split <strong>in</strong> two; edi conta<strong>in</strong>s po<strong>in</strong>ter to unaligned arraymovq qword ptr [edi], xmm0 ; Write lower half of vectormovhps qword ptr [edi+8], xmm0 ; Write upper half of vectorThis method is faster than us<strong>in</strong>g unaligned write <strong>in</strong>structions on current processors (2008),but probably not on future processors.Partially overlapp<strong>in</strong>g writesMake the first write to the unaligned address and the next write to the nearest follow<strong>in</strong>g 16-bytes boundary. The first two writes will therefore possibly overlap:; Example 13.18. First unaligned write overlaps next aligned write; edi conta<strong>in</strong>s po<strong>in</strong>ter to unaligned arraymovdqu [edi], xmm1; Write vector unalignedadd edi, 10Hand edi, 0FH ; = nearest follow<strong>in</strong>g 16B boundarymovdqa [edi], xmm2; Write next vector alignedHere, the data from xmm1 and xmm2 will be partially overlapp<strong>in</strong>g if edi is not divisible by 16.This, of course, only works if the algorithm can generate the overlapp<strong>in</strong>g data. The lastvector write can also overlap with the last-but-one if the end of the array is not aligned. Seethe memset example <strong>in</strong> the appendix www.agner.org/optimize/asmexamples.zip.Writ<strong>in</strong>g the beg<strong>in</strong>n<strong>in</strong>g and the end separatelyUse non-vector <strong>in</strong>structions for writ<strong>in</strong>g from the beg<strong>in</strong>n<strong>in</strong>g of an unaligned array until the first16-bytes boundary, and aga<strong>in</strong> from the last 16-bytes boundary to the end of the array. Seethe memcpy example <strong>in</strong> the appendix www.agner.org/optimize/asmexamples.zip.Us<strong>in</strong>g masked writeThe <strong>in</strong>struction MASKMOVDQU can be used for writ<strong>in</strong>g to the first part of an unaligned array upto the first 16-bytes boundary as well as the last part after the last 16-bytes boundary. Thismethod is quite slow because it doesn't use the cache.13.6 Us<strong>in</strong>g AVX <strong>in</strong>struction set and YMM registersA planned extension of the 128-bit XMM registers to 256-bit YMM registers is expected tobe available <strong>in</strong> 2010. Further extensions to 512 or 1024 bits are likely <strong>in</strong> the future. Each123


XMM register forms the lower half of a YMM register. A YMM register can hold a vector of 8s<strong>in</strong>gle precision or 4 double precision float<strong>in</strong>g po<strong>in</strong>t numbers. Integer vector <strong>in</strong>structions maybe available <strong>in</strong> later versions. Most XMM and YMM <strong>in</strong>structions will allow three operands:one dest<strong>in</strong>ation and two source operands.Almost all the exist<strong>in</strong>g XMM <strong>in</strong>structions will have two versions <strong>in</strong> the AVX <strong>in</strong>struction set. Alegacy version which leaves the upper half (bit 128-255) of the target YMM registerunchanged, and a VEX-prefix version which sets the upper half to zero <strong>in</strong> order to removefalse dependences. If a code sequence conta<strong>in</strong>s both 128-bit and 256-bit <strong>in</strong>structions then itis important to use the VEX-prefix version of the 128-bit <strong>in</strong>structions. Mix<strong>in</strong>g 256 bit<strong>in</strong>structions with legacy 128-bit <strong>in</strong>structions without VEX prefix will cause the processor toswitch the register file between different states which may cost many clock cycles.For the sake of compatibility with exist<strong>in</strong>g code, the register set will have three differentstates:A. The upper half of all YMM registers is unused and known to be zero.B. The upper half of at least one YMM register is used and conta<strong>in</strong>s data.C. All YMM registers are split <strong>in</strong> two. The lower half is used by legacy XMM <strong>in</strong>structionswhich leave the upper part unchanged. All the upper-part halves are stored <strong>in</strong> ascratchpad. The two parts of each register will be jo<strong>in</strong>ed together aga<strong>in</strong> if needed bya transition to state B.Two <strong>in</strong>structions are available for the sake of fast transition between the states. VZEROALLwhich sets all YMM registers to zero, and VZEROUPPER which sets the upper half of all theregisters to zero. Both <strong>in</strong>structions give state A. The state transitions can be illustrated bythe follow<strong>in</strong>g state transition table:Current state →A B CInstruction ↓VZEROALL / UPPER A A AXMM A C CVEX XMM A B BYMM B B BTable 13.9. YMM state transitionsState A is the neutral <strong>in</strong>itial state. State B is needed when the full YMM registers are used.State C is needed for mak<strong>in</strong>g legacy XMM code fast and free of false dependences whencalled from state B. A transition from state B to C is costly because all YMM registers mustbe split <strong>in</strong> two halves which are stored separately. A transition from state C to B is equallycostly because all the registers have to be merged together aga<strong>in</strong>. A tradition from state Cto A is also costly because the scratchpad conta<strong>in</strong><strong>in</strong>g the upper part of the registers doesnot support out-of-order access and must be protected from speculative execution <strong>in</strong>possibly mispredicted branches. The transitions B → C, C → B and C → A can take up to50 clock cycles or more because they have to wait for all registers to retire.Transitions A → B and B → A are fast, probably tak<strong>in</strong>g only one clock cycle. C should beregarded as an undesired state, and the transition A → C is not possible.The follow<strong>in</strong>g examples illustrate the state transitions:; Example 13.19a. Transition between YMM statesvaddps ymm0, ymm1, ymm2 ; State Baddss xmm3, xmm4 ; State Cvmulps ymm0, ymm0, ymm5 ; State B124


Example 13.19a has two expensive state transitions, from B to C, and back to state B. Thiscan be avoided by replac<strong>in</strong>g the legacy ADDSS <strong>in</strong>struction by the VEX-coded versionVADDSS:; Example 13.19b. Transition between YMM states avoidedvaddps ymm0, ymm1, ymm2 ; State Bvaddss xmm3, xmm3, xmm4 ; State Bvmulps ymm0, ymm0, ymm5 ; State BThis method cannot be used when call<strong>in</strong>g a library function that uses XMM <strong>in</strong>structions. Thesolution here is to save any used YMM registers and go to state A:; Example 13.19c. Transition to state Avaddps ymm0, ymm1, ymm2 ; State Bvmovaps [mem], ymm0 ; Save ymm0vzeroupper; State Acall XMM_Function ; Legacy functionvmovaps ymm0, [mem] ; Restore ymm0vmulps ymm0, ymm0, ymm5 ; State B...XMM_Function proc nearaddss xmm3, xmm4 ; State AretAny program that mixes XMM and YMM code should follow certa<strong>in</strong> guidel<strong>in</strong>es <strong>in</strong> order toavoid the costly transitions to and from state C:• A function that uses YMM <strong>in</strong>structions should issue a VZEROALL or VZEROUPPERbefore return<strong>in</strong>g if there is a possibility that it might be called from XMM code.• A function that uses YMM <strong>in</strong>structions should save any used YMM register and issuea VZEROALL or VZEROUPPER before call<strong>in</strong>g any function that might conta<strong>in</strong> XMM code.• A function that has a CPU dispatcher for choos<strong>in</strong>g YMM code if available and XMMcode otherwise, should issue a VZEROALL or VZEROUPPER before leav<strong>in</strong>g the YMMpart.In other words: The ABI allows any function to use XMM or YMM registers, but a functionthat uses YMM registers must leave the registers <strong>in</strong> state A before call<strong>in</strong>g any ABI compliantfunction and before return<strong>in</strong>g to any ABI compliant function.Obviously, this does not apply to functions that use full YMM registers for parameter transferor return. A function that uses YMM registers for parameter transfer allows state B on entry.A function that uses a YMM register (YMM0) for the return value can be <strong>in</strong> state B on return.State C should always be avoided.It is more efficient to use VZEROALL than VZEROUPPER to end a section of YMM codebecause VZEROALL tells the CPU that all YMM registers are now unused. This freesresources <strong>in</strong> the out-of-order core and it also makes state transitions faster because theYMM registers do not have to be saved.VZEROALL cannot be used <strong>in</strong> 64-bit W<strong>in</strong>dows because the ABI specifies that registers XMM6 -XMM15 have callee-save status. In other words, the call<strong>in</strong>g function can assume that registerXMM6 - XMM15, but not the upper halves of the YMM registers, are unchanged after return.No XMM or YMM registers have callee-save status <strong>in</strong> 32-bit W<strong>in</strong>dows or <strong>in</strong> any Unix system(L<strong>in</strong>ux, BSD, Mac). Therefore it is OK to use VZEROALL <strong>in</strong> e.g. 64-bit L<strong>in</strong>ux. Obviously,125


VZEROALL cannot be used if any XMM register conta<strong>in</strong>s a function parameter or returnvalue. VZEROUPPER must be used <strong>in</strong> these cases.YMM and system codeA situation where transitions between state B and C must take place is when YMM code is<strong>in</strong>terrupted and the <strong>in</strong>terrupt handler conta<strong>in</strong>s legacy XMM code that saves the XMMregisters but not the full YMM registers. In fact, state C was <strong>in</strong>vented exactly for the sake ofpreserv<strong>in</strong>g the upper part of the YMM registers <strong>in</strong> this situation.It is very important to follow certa<strong>in</strong> rules when writ<strong>in</strong>g device drivers and other system codethat might be called from <strong>in</strong>terrupt handlers. If any YMM register is modified by a YMM<strong>in</strong>struction or a VEX-XMM <strong>in</strong>struction then it is necessary to save the entire register statewith XSAVE first and restore it with XRESTOR before return<strong>in</strong>g. It is not sufficient to save the<strong>in</strong>dividual YMM registers because future processors, which may extend the YMM registersfurther to 512-bit ZMM registers (or whatever they will be called), will zero-extend the YMMregisters to ZMM when execut<strong>in</strong>g YMM <strong>in</strong>structions and thereby destroy the highest part ofthe ZMM registers. XSAVE / XRESTOR is the only way of sav<strong>in</strong>g these registers that iscompatible with future extensions beyond 256 bits. Future extensions will not use thecomplicated method of hav<strong>in</strong>g two versions of every YMM <strong>in</strong>struction.If a device driver does not use XSAVE / XRESTOR then there is a risk that it might<strong>in</strong>advertently use YMM registers even if the programmer did not <strong>in</strong>tend this. The compilermay <strong>in</strong>sert implicit calls to library functions such as memset and memcpy. These functionstypically have their own CPU dispatcher which selects the largest register size available.Lett<strong>in</strong>g the operat<strong>in</strong>g system save everyth<strong>in</strong>g with XSAVE before call<strong>in</strong>g a device driver froman <strong>in</strong>terrupt is the safest solution, but not the most efficient solution. See manual 5: "Call<strong>in</strong>gconventions" for further discussion.Us<strong>in</strong>g non-destructive three-operand <strong>in</strong>structionsThe AVX <strong>in</strong>struction set which def<strong>in</strong>es the YMM <strong>in</strong>structions also def<strong>in</strong>es an alternativeencod<strong>in</strong>g of all exist<strong>in</strong>g XMM <strong>in</strong>structions by replac<strong>in</strong>g exist<strong>in</strong>g prefixes and escape codeswith the new VEX prefix. The VEX prefix has the further advantage that it def<strong>in</strong>es an extraregister operand. Almost all XMM <strong>in</strong>structions that previously had two operands now havethree operands when the VEX prefix is used. This <strong>in</strong>cludes the <strong>in</strong>teger vector XMM<strong>in</strong>structions that do not yet support the full YMM registers.The two-operand version of an <strong>in</strong>struction typically uses the same register for thedest<strong>in</strong>ation and for one of the source operands:; Example 13.20a. Two-operand <strong>in</strong>structionaddss xmm1, xmm2; xmm1 = xmm1 + xmm2This has the disadvantage that the result overwrites the value of the source operand <strong>in</strong>xmm1. The three-operand version allows both source operands to rema<strong>in</strong> unchanged:; Example 13.20b. Three-operand <strong>in</strong>structionvaddss xmm0, xmm1, xmm2 ; xmm0 = xmm1 + xmm2Here the source operand <strong>in</strong> xmm1 is not destroyed because the result can be stored <strong>in</strong> adifferent dest<strong>in</strong>ation register. This can be useful for avoid<strong>in</strong>g register-to-register moves. The<strong>in</strong>structions <strong>in</strong> example 13.20a and 13.20b have exactly the same length. Therefore there isno penalty for us<strong>in</strong>g the three-operand version. The <strong>in</strong>structions with names end<strong>in</strong>g <strong>in</strong> ps(packed s<strong>in</strong>gle precision) are one byte shorter <strong>in</strong> the two-operand version than the threeoperandversion if the source register is not xmm8 - xmm15. The three-operand version isshorter than the two-operand version <strong>in</strong> a few cases. In most cases the two- and threeoperandversions have the same length.126


It is possible to mix two-operand and three-operand <strong>in</strong>structions <strong>in</strong> the same code as longas the register set is <strong>in</strong> state A. But if the register set happens to be <strong>in</strong> state C for whateverreason then the mix<strong>in</strong>g of XMM <strong>in</strong>structions with and without VEX will cause a costly statechange every time the <strong>in</strong>struction type changes. It is therefore better to use VEX versionsonly or non-VEX only. If YMM registers are used (state B) then you should use only theVEX-prefix version for all XMM <strong>in</strong>structions until the VEX-section of code is ended with aVZEROALL or VZEROUPPER.The 64-bit MMX <strong>in</strong>structions and general purpose register <strong>in</strong>structions do not have threeoperand versions. There is no penalty for mix<strong>in</strong>g MMX and VEX <strong>in</strong>structions. It is possiblethat some general purpose register <strong>in</strong>structions will have three-operand versions <strong>in</strong> a future<strong>in</strong>struction set.Compilers should use the VEX prefix version for all XMM <strong>in</strong>structions, <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sicfunctions, if compil<strong>in</strong>g for the AVX <strong>in</strong>struction set. The Intel compiler will have switches and#pragmas for this as well as for mak<strong>in</strong>g <strong>in</strong>l<strong>in</strong>e <strong>assembly</strong> <strong>in</strong>structions with or without VEXprefix. The compiler us<strong>in</strong>g AVX should also issue a VZEROALL or VZEROUPPER before call<strong>in</strong>gany library function unless there is a separate VEX version of the function.Taylor series exampleThe follow<strong>in</strong>g example illustrates the use of the AVX <strong>in</strong>struction set by us<strong>in</strong>g three-operand<strong>in</strong>structions and a vector size of 256 bits. The example computes the same Taylor series as<strong>in</strong> example 12.9 page 101.; Example 13.21a. Taylor expansion us<strong>in</strong>g AVX (64-bit mode); (This example has not been tested because the relevant; processor is not available yet).dataone dq 1.0 ; 1.0align 32coeff dq c0, c1, c2, ... ; Taylor coefficientscoeff_end label qword ; end of coeff. list.code; double Taylor (double x) {; double sum = 0;; for (<strong>in</strong>t i = 0; i < NumCoeff; i++) sum += coeff[i]*pow(x,i);; return sum;; }Taylor proc near; xmm0 = xvshufpd xmm2, xmm0, xmm0, 0 ; xmm2 = x, xvmovsd xmm1, [one] ; xmm1 = 1vshufpd xmm1, xmm1, xmm0, 0 ; xmm1 = x, 1vmulpd xmm2, xmm2, xmm2 ; xmm2 = x^2, x^2vmulpd xmm3, xmm1, xmm2 ; xmm3 = x^3, x^2vmulpd xmm2, xmm2, xmm2 ; xmm2 = x^4, x^4<strong>in</strong>sertf128 ymm2, ymm2, ymm2, 1 ; ymm2 = x^4, x^4, x^4, x^4<strong>in</strong>sertf128 ymm1, ymm1, ymm3, 1 ; ymm1 = x^3, x^2, x^1, 1vxorpd xmm0, xmm0, xmm0 ; ymm0 = 0lea rax, coeff ; po<strong>in</strong>t to coeff[i]lea rdx, coeff_end ; po<strong>in</strong>t to end of coeffL1: ; Loop for all coefficientsvmovapd ymm3, ymm1 ; save ymm1 before multiply<strong>in</strong>gvmulpd ymm1, ymm1, ymm2 ; ymm1 = x^7, x^6, x^5, x^4vmulpd ymm3, ymm3, coeff[rax] ; ymm3 = c3*x^3,..,c0*1vaddpd ymm0, ymm0, ymm3 ; ymm0 = sum3,sum2,sum1,sum0add rax, 32 ; next four termscmp rax, rdx ; stop at end of coeff listjb L1 ; loop127


; Add all elements <strong>in</strong> ymm0 to get resultvextractf128 xmm3, ymm0, 1; xmm3 = sum3,sum2vaddpd xmm0, xmm0, xmm3 ; xmm0 = sum1+sum3,sum0+sum2vhaddpd xmm0, xmm0, xmm0 ; xmm0 = return value; clear upper registers for compatibility with non-AVX code:zeroupperretTaylor endpWe have used 128-bit <strong>in</strong>structions rather than 256-bit <strong>in</strong>structions where possible becausethis may be faster on processors with 128-bit execution units. For example, the vxorpd<strong>in</strong>struction sets the entire ymm0 register to zero, even though it uses only the 128-bit xmm0 asoperand. The v <strong>in</strong> front of the <strong>in</strong>struction names <strong>in</strong>dicate that the VEX version is used <strong>in</strong>order to avoid mix<strong>in</strong>g VEX and non-VEX <strong>in</strong>structions.The execution time for the L1 loop is determ<strong>in</strong>ed by the critical dependency cha<strong>in</strong> ofmultiply<strong>in</strong>g ymm1 with ymm2. The loop <strong>in</strong> example 13.21a is limited by the latency of thefloat<strong>in</strong>g po<strong>in</strong>t multiplier, just as <strong>in</strong> example 12.9d page 104. If the latency is the same thenthere is noth<strong>in</strong>g ga<strong>in</strong>ed by us<strong>in</strong>g ymm registers. The first processors with VEX <strong>in</strong>structionswill probably have 128-bit execution units. Later versions may have 256-bit execution units.This will double the throughput of the multiplier so that we can double the speed by unroll<strong>in</strong>gthe loop <strong>in</strong> example 13.21a once more and multiply<strong>in</strong>g by x 8 each time.We may use fused multiply-and-add <strong>in</strong>structions if the FMA <strong>in</strong>struction set is available:; Example 13.21b. Taylor expansion us<strong>in</strong>g AVX and FMAL1: ; Loop for all coefficientsvmovapd ymm3, ymm1 ; save ymm1 before multiply<strong>in</strong>gvmulpd ymm1, ymm1, ymm2 ; ymm1 = x^7, x^6, x^5, x^4vfmaddpd ymm0, ymm3, coeff[rax], ymm0 ; ymm0 += c3*x^3,..,c0*1add rax, 32 ; next four termscmp rax, rdx ; stop at end of coeff listjbL1However, this is probably no advantage <strong>in</strong> this case because the bottleneck is the latency ofvmulpd.13.7 Vector operations <strong>in</strong> general purpose registersSometimes it is possible to handle packed data <strong>in</strong> 32-bit or 64-bit general purpose registers.You may use this method on processors where <strong>in</strong>teger operations are faster than vectoroperations or where vector operations are not available.A 64-bit register can hold two 32-bit <strong>in</strong>tegers, four 16-bit <strong>in</strong>tegers, eight 8-bit <strong>in</strong>tegers, or 64Booleans. When do<strong>in</strong>g calculations on packed <strong>in</strong>tegers <strong>in</strong> 32-bit or 64-bit registers, youhave to take special care to avoid carries from one operand go<strong>in</strong>g <strong>in</strong>to the next operand ifoverflow is possible. Carry does not occur if all operands are positive and so small thatoverflow cannot occur. For example, you can pack four positive 16-bit <strong>in</strong>tegers <strong>in</strong>to RAX anduse ADD RAX,RBX <strong>in</strong>stead of PADDW MM0,MM1 if you are sure that overflow will not occur. Ifcarry cannot be ruled out then you have to mask out the highest bit, as <strong>in</strong> the follow<strong>in</strong>gexample, which adds 2 to all four bytes <strong>in</strong> EAX:; Example 13.22. Byte vector addition <strong>in</strong> 32-bit registermov eax, [esi] ; read 4-bytes operandmov ebx, eax ; copy <strong>in</strong>to ebxand eax, 7f7f7f7fh ; get lower 7 bits of each byte <strong>in</strong> eaxxor ebx, eax ; get the highest bit of each byteadd eax, 02020202h ; add desired value to all four bytesxor eax, ebx ; comb<strong>in</strong>e bits aga<strong>in</strong>128


mov [edi],eax ; store resultHere the highest bit of each byte is masked out to avoid a possible carry from each byte <strong>in</strong>tothe next one when add<strong>in</strong>g. The code is us<strong>in</strong>g XOR rather than ADD to put back the high bitaga<strong>in</strong>, <strong>in</strong> order to avoid carry. If the second addend may have the high bit set as well, itmust be masked too. No mask<strong>in</strong>g is needed if none of the two addends have the high bitset.It is also possible to search for a specific byte. This C code illustrates the pr<strong>in</strong>ciple bycheck<strong>in</strong>g if a 32-bit <strong>in</strong>teger conta<strong>in</strong>s at least one byte of zero:// Example 13.23. Return nonzero if dword conta<strong>in</strong>s null byte<strong>in</strong>l<strong>in</strong>e <strong>in</strong>t dword_has_nullbyte(<strong>in</strong>t w) {return ((w - 0x01010101) & ~w & 0x80808080);}The output is zero if all four bytes of w are nonzero. The output will have 0x80 <strong>in</strong> the positionof the first byte of zero. If there are more than one bytes of zero then the subsequent bytesare not necessarily <strong>in</strong>dicated. (This method was <strong>in</strong>vented by Alan Mycroft and published <strong>in</strong>1987. I published the same method <strong>in</strong> 1996 <strong>in</strong> the first version of this manual unaware thatMycroft had made the same <strong>in</strong>vention before me).This pr<strong>in</strong>ciple can be used for f<strong>in</strong>d<strong>in</strong>g the length of a zero-term<strong>in</strong>ated str<strong>in</strong>g by search<strong>in</strong>g forthe first byte of zero. It is faster than us<strong>in</strong>g REPNE SCASB:; Example 13.24, optimized strlen procedure (32-bit):_strlen PROC NEAR; extern "C" <strong>in</strong>t strlen (const char * s);push ebx ; ebx must be savedmov ecx, [esp+8] ; get po<strong>in</strong>ter to str<strong>in</strong>gmov eax, ecx ; copy po<strong>in</strong>terand ecx, 3 ; lower 2 bits, check alignmentjz L2 ; s is aligned by 4. Go to loopand eax, -4 ; align po<strong>in</strong>ter by 4mov ebx, [eax] ; read from preced<strong>in</strong>g boundaryshl ecx, 3 ; *8 = displacement <strong>in</strong> bitsmov edx, -1shl edx, cl ; make byte masknot edx ; mask = 0FFH for false bytesor ebx, edx ; mask out false bytes; check first four bytes for zerolea ecx, [ebx-01010101H] ; subtract 1 from each bytenot ebx ; <strong>in</strong>vert all bytesand ecx, ebx ; and these twoand ecx, 80808080H ; test all sign bitsjnz L3 ; zero-byte found; Ma<strong>in</strong> loop, read 4 bytes alignedL1: add eax, 4 ; <strong>in</strong>crement po<strong>in</strong>terL2: mov ebx, [eax] ; read 4 bytes of str<strong>in</strong>glea ecx, [ebx-01010101H] ; subtract 1 from each bytenot ebx ; <strong>in</strong>vert all bytesand ecx, ebx ; and these twoand ecx, 80808080H ; test all sign bitsjz L1 ; no zero bytes, cont<strong>in</strong>ue loopL3: bsf ecx, ecx ; f<strong>in</strong>d right-most 1-bitshr ecx, 3 ; divide by 8 = byte <strong>in</strong>dexsub eax, [esp+8] ; subtract start addressadd eax, ecx ; add <strong>in</strong>dex to bytepop ebx ; restore ebx129


et_strlen ENDP; return value <strong>in</strong> eaxThe alignment check makes sure that we are only read<strong>in</strong>g from addresses aligned by 4. Thefunction may read both before the beg<strong>in</strong>n<strong>in</strong>g of the str<strong>in</strong>g and after the end, but s<strong>in</strong>ce allreads are aligned, we will not cross any cache l<strong>in</strong>e boundary or page boundaryunnecessarily. Most importantly, we will not get any page fault for read<strong>in</strong>g beyond allocatedmemory because page boundaries are always aligned by 2 12 or more.If the SSE2 <strong>in</strong>struction set is available then it may be faster to use XMM <strong>in</strong>structions with thesame alignment trick. Example 13.24 as well as a similar function us<strong>in</strong>g XMM registers areprovided <strong>in</strong> the appendix at www.agner.org/optimize/asmexamples.zip.Other common functions that search for a specific byte, such as strcpy, strchr, memchrcan use the same trick. To search for a byte different from zero, just XOR with the desiredbyte as illustrated <strong>in</strong> this C code:// Example 13.25. Return nonzero if byte b conta<strong>in</strong>ed <strong>in</strong> dword w<strong>in</strong>l<strong>in</strong>e <strong>in</strong>t dword_has_byte(<strong>in</strong>t w, unsigned char b) {w ^= b * 0x01010101;return ((w - 0x01010101) & ~w & 0x80808080);}Note if search<strong>in</strong>g backwards (e.g. strrchr) that the above method will <strong>in</strong>dicate the positionof the first occurrence of b <strong>in</strong> w <strong>in</strong> case there is more than one occurrence.14 Multithread<strong>in</strong>gThere is a limit to how much process<strong>in</strong>g power you can get out of a s<strong>in</strong>gle CPU. Therefore,many modern computer systems have multiple CPU cores. The way to make use of multipleCPU cores is to divide the comput<strong>in</strong>g job between multiple threads. The optimal number ofthreads is equal to the number of CPU cores. The workload should ideally be divided evenlybetween the threads.Multithread<strong>in</strong>g is useful where the code has an <strong>in</strong>herent parallelism that is coarse-gra<strong>in</strong>ed.Multithread<strong>in</strong>g cannot be used for f<strong>in</strong>e-gra<strong>in</strong>ed parallelism because there is a considerableoverhead cost of start<strong>in</strong>g and stopp<strong>in</strong>g threads and synchroniz<strong>in</strong>g the threads. Communicationbetween threads can be quite costly. Therefore, the comput<strong>in</strong>g job should preferablybe divided <strong>in</strong>to threads at the highest possible level. If the outermost loop can beparallelized, then it should be divided <strong>in</strong>to one loop for each thread, each do<strong>in</strong>g its share ofthe whole job.Thread-local storage should preferably use the stack, because access<strong>in</strong>g thread-local staticmemory is <strong>in</strong>efficient.See manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++" for more details on multithread<strong>in</strong>g.15 CPU dispatch<strong>in</strong>gWhat is optimal for one microprocessor may not be optimal for another. Therefore, you maymake the most critical part of your program <strong>in</strong> different versions, each optimized for aspecific microprocessor, and select the desired version at runtime after detect<strong>in</strong>g whichmicroprocessor the program is runn<strong>in</strong>g on. The CPUID <strong>in</strong>struction tells which <strong>in</strong>structions themicroprocessor supports. If you are us<strong>in</strong>g <strong>in</strong>structions that are not supported by allmicroprocessors, then you must first check if the program is runn<strong>in</strong>g on a microprocessorthat supports these <strong>in</strong>structions. If your program can benefit significantly from us<strong>in</strong>g S<strong>in</strong>gle-Instruction-Multiple-Data (SIMD) <strong>in</strong>structions, then you may make one version of a critical130


part of the program that uses these <strong>in</strong>structions, and another version which does not andwhich is compatible with old microprocessors.If you are <strong>in</strong> doubt how many different versions of the code to make then you shouldconsider the expected lifetime of your software application and what k<strong>in</strong>d of hardware thetypical user can be expected to have. Most software applications have a lifetime of at leastseveral years. The newest microprocessors that are available today will probably be themost common microprocessors <strong>in</strong> a few years time. Therefore, it makes sense to optimizefor the newest microarchitecture. Few users will run a speed-critical application on an oldmicroprocessor. In most cases it will be sufficient to make one code version which isoptimized for the newest processors and one version that is compatible with olderprocessors. There is no reason to require the latest <strong>in</strong>struction set if you cannot takeadvantage of the <strong>in</strong>structions it conta<strong>in</strong>s. The SSE2 or SSE3 <strong>in</strong>struction set will satisfy mostneeds.CPU dispatch<strong>in</strong>g can be implemented with branches or with a code po<strong>in</strong>ter, as shown <strong>in</strong> thefollow<strong>in</strong>g example.Example 15.1. Function with CPU dispatch<strong>in</strong>gMyFunction proc near; Jump through po<strong>in</strong>ter. The code po<strong>in</strong>ter <strong>in</strong>itially po<strong>in</strong>ts to; MyFunctionDispatch. MyFunctionDispatch changes the po<strong>in</strong>ter; so that it po<strong>in</strong>ts to the appropriate version of MyFunction.; The next time MyFunction is called, it jumps directly to; the right version of the functionjmp [MyFunctionPo<strong>in</strong>t]; Code for each version. Put the most probable version first:MyFunctionSSE3:; SSE3 version of MyFunctionretMyFunctionSSE2:; SSE2 version of MyFunctionretMyFunction386:; Generic/80386 version of MyFunctionretMyFunctionDispatch:; Detect which <strong>in</strong>struction set is supported.; Function InstructionSet is <strong>in</strong> asmlibcall InstructionSet ; eax <strong>in</strong>dicates <strong>in</strong>struction setmov edx, offset MyFunction386cmp eax, 4 ; eax >= 4 if SSE2jb DispEndmov edx, offset MyFunctionSSE2cmp eax, 5 ; eax >= 5 if SSE3jb DispEndmov edx, offset MyFunctionSSE3DispEnd:; Save po<strong>in</strong>ter to appropriate version of MyFunctionmov [MyFunctionPo<strong>in</strong>t], edxjmp edx ; Jump to this version.dataMyFunctionPo<strong>in</strong>t DD MyFunctionDispatch ; Code po<strong>in</strong>ter.codeMyFunction endp131


The function InstructionSet, which detects which <strong>in</strong>struction set is supported, is provided<strong>in</strong> the library that can be downloaded from www.agner.org/optimize/asmlib.zip. Mostoperat<strong>in</strong>g systems also have functions for this purpose. Obviously, it is recommended tostore the output from InstructionSet rather than call<strong>in</strong>g it aga<strong>in</strong> each time the <strong>in</strong>formationis needed. See also www.agner.org/optimize/asmexamples.zip for detailed examples offunctions with CPU dispatch<strong>in</strong>g.For assemblers that don't support the newest <strong>in</strong>struction sets, you may use the macros atwww.agner.org/optimize/macros.zip for cod<strong>in</strong>g the new <strong>in</strong>structions.15.1 Check<strong>in</strong>g for operat<strong>in</strong>g system support for XMM and YMM registersUnfortunately, the <strong>in</strong>formation that can be obta<strong>in</strong>ed from the CPUID <strong>in</strong>struction is notsufficient for determ<strong>in</strong><strong>in</strong>g whether it is possible to use XMM registers. The operat<strong>in</strong>g systemhas to save these registers dur<strong>in</strong>g a task switch and restore them when the task is resumed.The microprocessor can disable the use of the XMM registers <strong>in</strong> order to prevent their useunder old operat<strong>in</strong>g systems that do not save these registers. Operat<strong>in</strong>g systems thatsupport the use of XMM registers must set bit 9 of the control register CR4 to enable the useof XMM registers and <strong>in</strong>dicate its ability to save and restore these registers dur<strong>in</strong>g taskswitches. (Sav<strong>in</strong>g and restor<strong>in</strong>g registers is actually faster when XMM registers areenabled).Unfortunately, the CR4 register can only be read <strong>in</strong> privileged mode. Application programstherefore have a problem determ<strong>in</strong><strong>in</strong>g whether they are allowed to use the XMM registers ornot. Accord<strong>in</strong>g to official Intel documents, the only way for an application program todeterm<strong>in</strong>e whether the operat<strong>in</strong>g system supports the use of XMM registers is to try toexecute an XMM <strong>in</strong>struction and see if you get an <strong>in</strong>valid opcode exception. This isridiculous, because not all operat<strong>in</strong>g systems, compilers and programm<strong>in</strong>g <strong>language</strong>sprovide facilities for application programs to catch <strong>in</strong>valid opcode exceptions. Theadvantage of us<strong>in</strong>g XMM registers evaporates completely if you have no way of know<strong>in</strong>gwhether you can use these registers without crash<strong>in</strong>g your software.These serious problems led me to search for an alternative way of check<strong>in</strong>g if the operat<strong>in</strong>gsystem supports the use of XMM registers, and fortunately I have found a way that worksreliably. If XMM registers are enabled, then the FXSAVE and FXRSTOR <strong>in</strong>structions can readand modify the XMM registers. If XMM registers are disabled, then FXSAVE and FXRSTORcannot access these registers. It is therefore possible to check if XMM registers areenabled, by try<strong>in</strong>g to read and write these registers with FXSAVE and FXRSTOR. The<strong>subrout<strong>in</strong>es</strong> <strong>in</strong> www.agner.org/optimize/asmlib.zip use this method. These <strong>subrout<strong>in</strong>es</strong> canbe called from <strong>assembly</strong> as well as from high-level <strong>language</strong>s, and provide an easy way ofdetect<strong>in</strong>g whether XMM registers can be used.In order to verify that this detection method works correctly with all microprocessors, I firstchecked various manuals. The 1999 version of Intel's software developer's manual saysabout the FXRSTOR <strong>in</strong>struction: "The Stream<strong>in</strong>g SIMD Extension fields <strong>in</strong> the save image(XMM0-XMM7 and MXCSR) will not be loaded <strong>in</strong>to the processor if the CR4.OSFXSR bit isnot set." AMD's Programmer’s Manual says effectively the same. However, the 2003version of Intel's manual says that this behavior is implementation dependent. In order toclarify this, I contacted Intel Technical Support and got the reply, "If the OSFXSR bit <strong>in</strong> CR4<strong>in</strong> not set, then XMMx registers are not restored when FXRSTOR is executed". They furtherconfirmed that this is true for all versions of Intel microprocessors and all microcodeupdates. I regard this as a guarantee from Intel that my detection method will work on allIntel microprocessors. We can rely on the method work<strong>in</strong>g correctly on AMD processors aswell s<strong>in</strong>ce the AMD manual is unambiguous on this question. It appears to be safe to rely onthis method work<strong>in</strong>g correctly on future microprocessors as well, because any microprocessorthat deviates from the above specification would <strong>in</strong>troduce a security problem as well132


as fail<strong>in</strong>g to run exist<strong>in</strong>g programs. Compatibility with exist<strong>in</strong>g programs is of great concernto microprocessor producers.The <strong>subrout<strong>in</strong>es</strong> <strong>in</strong> www.agner.org/optimize/asmlib.zip are constructed so that the detectionwill give a correct answer unless FXSAVE and FXRSTOR are both buggy. My detection methodhas been further verified by test<strong>in</strong>g on many different versions of Intel and AMD processorsand different operat<strong>in</strong>g systems.The detection method recommended <strong>in</strong> Intel manuals has the drawback that it relies on theability of the compiler and the operat<strong>in</strong>g system to catch <strong>in</strong>valid opcode exceptions. AW<strong>in</strong>dows application, for example, us<strong>in</strong>g Intel's detection method would therefore have to betested <strong>in</strong> all compatible operat<strong>in</strong>g systems, <strong>in</strong>clud<strong>in</strong>g various W<strong>in</strong>dows emulators runn<strong>in</strong>gunder a number of other operat<strong>in</strong>g systems. My detection method does not have thisproblem because it is <strong>in</strong>dependent of compiler and operat<strong>in</strong>g system. My method has thefurther advantage that it makes modular programm<strong>in</strong>g easier, because a module,subrout<strong>in</strong>e library, or DLL us<strong>in</strong>g XMM <strong>in</strong>structions can <strong>in</strong>clude the detection procedure sothat the problem of XMM support is of no concern to the call<strong>in</strong>g program, which may evenbe written <strong>in</strong> a different programm<strong>in</strong>g <strong>language</strong>. Some operat<strong>in</strong>g systems provide systemfunctions that tell which <strong>in</strong>struction set is supported, but the method mentioned above is<strong>in</strong>dependent of the operat<strong>in</strong>g system.See "Intel Advanced Vector Extensions Programm<strong>in</strong>g Reference" for details on how tocheck both the CPU and the operat<strong>in</strong>g system for support of YMM registers.The above discussion has relied on the follow<strong>in</strong>g documents:Intel application note AP-900: "Identify<strong>in</strong>g support for Stream<strong>in</strong>g SIMD Extensions <strong>in</strong> theProcessor and Operat<strong>in</strong>g System". 1999.Intel application note AP-485: "Intel Processor Identification and the CPUID Instruction".2002."Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference",1999."IA-32 Intel Architecture Software Developer's Manual, Volume 2: Instruction SetReference", 2003."AMD64 Architecture Programmer’s Manual, Volume 4: 128-Bit Media Instructions", 2003."Intel Advanced Vector Extensions Programm<strong>in</strong>g Reference", 2008.16 Problematic Instructions16.1 LEA <strong>in</strong>struction (all processors)The LEA <strong>in</strong>struction is useful for many purposes because it can do a shift operation, twoadditions, and a move <strong>in</strong> just one <strong>in</strong>struction. Example:; Example 16.1a, LEA <strong>in</strong>structionlea eax, [ebx+8*ecx-1000]is much faster than; Example 16.1bmov eax, ecx133


shl eax, 3add eax, ebxsub eax, 1000A typical use of LEA is as a three-register addition: lea eax,[ebx+ecx]. The LEA<strong>in</strong>struction can also be used for do<strong>in</strong>g an addition or shift without chang<strong>in</strong>g the flags.The processors have no documented address<strong>in</strong>g mode with a scaled <strong>in</strong>dex register andnoth<strong>in</strong>g else. Therefore, an <strong>in</strong>struction like lea eax,[ebx*2] is actually coded as leaeax,[ebx*2+00000000H] with an immediate displacement of 4 bytes. The size of this<strong>in</strong>struction can be reduced by writ<strong>in</strong>g lea eax,[ebx+ebx]. If you happen to have a registerthat is zero (like a loop counter after a loop) then you may use it as a base register toreduce the code size:; Example 16.2, LEA <strong>in</strong>struction without base po<strong>in</strong>terlea eax, [ebx*4] ; 7 byteslea eax, [ecx+ebx*4] ; 3 bytesLEA with a scale factor is slow on the P4, and may be replaced by additions. This appliesonly to the LEA <strong>in</strong>struction, not to <strong>in</strong>structions access<strong>in</strong>g memory with a scaled <strong>in</strong>dexregister.The size of the base and <strong>in</strong>dex registers can be changed with an address size prefix. Thesize of the dest<strong>in</strong>ation register can be changed with an operand size prefix (See prefixes,page 25). If the operand size is less than the address size then the result is truncated. If theoperand size is more than the address size then the result is zero-extended.The shortest version of LEA <strong>in</strong> 64-bit mode has 32-bit operand size and 64-bit address size,e.g. LEA EAX,[RBX+RCX], see page 75. Use this version when the result is sure to be lessthan 2 32 . Use the version with a 64-bit dest<strong>in</strong>ation register for address calculation <strong>in</strong> 64-bitmode when the address may be bigger than 2 32 .The preferred version <strong>in</strong> 32-bit mode has 32-bit operand size and 32-bit address size. LEAwith a 16-bit operand size is slow on AMD processors. LEA with a 16-bit address size <strong>in</strong> 32-bit mode should be avoided because the decod<strong>in</strong>g of this <strong>in</strong>struction is slow on manyprocessors.LEA can also be used <strong>in</strong> 64-bit mode for load<strong>in</strong>g a RIP-relative address. A RIP-relativeaddress cannot be comb<strong>in</strong>ed with base or <strong>in</strong>dex registers.16.2 INC and DEC (all Intel processors)The INC and DEC <strong>in</strong>structions do not modify the carry flag but they do modify the otherarithmetic flags. Writ<strong>in</strong>g to only part of the flags register costs an extra µop on P4 and P4E.It can cause a partial flags stalls on other Intel processors if a subsequent <strong>in</strong>struction readsthe carry flag or all the flag bits. On all processors, it can cause a false dependence on thecarry flag from a previous <strong>in</strong>struction.Use ADD and SUB when optimiz<strong>in</strong>g for speed. Use INC and DEC when optimiz<strong>in</strong>g for size orwhen no penalty is expected.16.3 XCHG (all processors)The XCHG register,[memory] <strong>in</strong>struction is dangerous. This <strong>in</strong>struction always has animplicit LOCK prefix which prevents it from us<strong>in</strong>g the cache. This <strong>in</strong>struction is therefore verytime consum<strong>in</strong>g, and should always be avoided.134


The XCHG <strong>in</strong>struction with register operands may be useful when optimiz<strong>in</strong>g for size asexpla<strong>in</strong>ed on page 71.16.4 Shifts and rotates (P4)Shifts and rotates on general purpose registers are slow on the P4. You may consider us<strong>in</strong>gMMX or XMM registers <strong>in</strong>stead or replac<strong>in</strong>g left shifts by additions.16.5 Rotates through carry (all processors)RCR and RCL with CL or with a count different from one are slow on all processors andshould be avoided.16.6 Bit test (all processors)BT, BTC, BTR, and BTS <strong>in</strong>structions should preferably be replaced by <strong>in</strong>structions like TEST,AND, OR, XOR, or shifts on older processors. Bit tests with a memory operand should beavoided on Intel processors. BTC, BTR, and BTS use 2 µops on AMD processors. Bit test<strong>in</strong>structions are useful when optimiz<strong>in</strong>g for size.16.7 LAHF and SAHF (all processors)LAHF is slow on P4 and P4E. Use SETcc <strong>in</strong>stead for stor<strong>in</strong>g the value of a flag.SAHF is slow on P4E and AMD processors. Use TEST <strong>in</strong>stead for test<strong>in</strong>g a bit <strong>in</strong> AH. UseFCOMI if available as a replacement for the sequence FCOM / FNSTSW AX / SAHF.LAHF and SAHF are not available <strong>in</strong> 64 bit mode on some early 64-bit Intel processors.16.8 Integer multiplication (all processors)An <strong>in</strong>teger multiplication takes from 3 to 14 clock cycles, depend<strong>in</strong>g on the processor. It istherefore often advantageous to replace a multiplication by a constant with a comb<strong>in</strong>ation ofother <strong>in</strong>structions such as SHL, ADD, SUB, and LEA. For example IMUL EAX,5 can bereplaced by LEA EAX,[EAX+4*EAX].16.9 Division (all processors)Both <strong>in</strong>teger division and float<strong>in</strong>g po<strong>in</strong>t division are quite time consum<strong>in</strong>g on all processors.Various methods for reduc<strong>in</strong>g the number of divisions are expla<strong>in</strong>ed <strong>in</strong> manual 1:"<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++". Several methods to improve code that conta<strong>in</strong>s division arediscussed below.Integer division by a power of 2 (all processors)Integer division by a power of two can be done by shift<strong>in</strong>g right. Divid<strong>in</strong>g an unsigned<strong>in</strong>teger by 2 N :; Example 16.3. Divide unsigned <strong>in</strong>teger by 2^Nshr eax, NDivid<strong>in</strong>g a signed <strong>in</strong>teger by 2 N :; Example 16.4. Divide signed <strong>in</strong>teger by 2^Ncdqand edx, (1 shl n) - 1 ; (Or: shr edx,32-n)add eax, edx135


sar eax, nObviously, you should prefer the unsigned version if the dividend is certa<strong>in</strong> to be nonnegative.Integer division by a constant (all processors)Divid<strong>in</strong>g by a constant can be done by multiply<strong>in</strong>g with the reciprocal. A useful algorithm for<strong>in</strong>teger division by a constant is as follows:To calculate the unsigned <strong>in</strong>teger division q = x / d <strong>in</strong> <strong>in</strong>tegers with a word size of w bits, youfirst calculate the reciprocal of the divisor, f = 2 r / d, where r def<strong>in</strong>es the position of the b<strong>in</strong>arydecimal po<strong>in</strong>t (radix po<strong>in</strong>t). Then multiply x with f and shift right r positions. The maximumvalue of r is w+b, where b is the number of b<strong>in</strong>ary digits <strong>in</strong> d m<strong>in</strong>us 1. (b is the highest<strong>in</strong>teger for which 2 b ≤ d). Use r = w+b to cover the maximum range for the value of thedividend x.This method needs some ref<strong>in</strong>ement <strong>in</strong> order to compensate for round<strong>in</strong>g errors. Thefollow<strong>in</strong>g algorithm will give you the correct result for unsigned <strong>in</strong>teger division withtruncation, i.e. the same result as the DIV <strong>in</strong>struction gives (Thanks to Terje Mathisen who(re-)<strong>in</strong>vented this method):b = (the number of significant bits <strong>in</strong> d) - 1r = w + bf = 2 r / dIf f is an <strong>in</strong>teger then d is a power of 2: go to case A.If f is not an <strong>in</strong>teger, then check if the fractional part of f is < 0.5If the fractional part of f < 0.5: go to case B.If the fractional part of f > 0.5: go to case C.case A (d = 2b):result = x SHR bcase B (fractional part of f < 0.5):round f down to nearest <strong>in</strong>tegerresult = ((x+1) * f) SHR rcase C (fractional part of f > 0.5):round f up to nearest <strong>in</strong>tegerresult = (x * f) SHR rExample:Assume that you want to divide by 5.5 = 101B.w = 32.b = (number of significant b<strong>in</strong>ary digits) - 1 = 2r = 32+2 = 34f = 2 34 / 5 = 3435973836.8 = 0CCCCCCCC.CCC...(hexadecimal)The fractional part is greater than a half: use case C.Round f up to 0CCCCCCCDH.The follow<strong>in</strong>g code divides EAX by 5 and returns the result <strong>in</strong> EDX:; Example 16.5a. Divide unsigned <strong>in</strong>teger by 5mov edx, 0CCCCCCCDHmul edxshr edx, 2136


After the multiplication, EDX conta<strong>in</strong>s the product shifted right 32 places. S<strong>in</strong>ce r = 34 youhave to shift 2 more places to get the result. To divide by 10, just change the last l<strong>in</strong>e to SHREDX,3.In case B we would have:; Example 16.5b. Divide unsigned <strong>in</strong>teger, case Badd eax, 1mov edx, fmul edxshr edx, bThis code works for all values of x except 0FFFFFFFFH which gives zero because ofoverflow <strong>in</strong> the ADD EAX,1 <strong>in</strong>struction. If x = 0FFFFFFFFH is possible, then change the codeto:; Example 16.5c. Divide unsigned <strong>in</strong>teger, case B, check for overflowmov edx, fadd eax, 1jc DOVERFLmul edxDOVERFL: shr edx, bIf the value of x is limited, then you may use a lower value of r, i.e. fewer digits. There canbe several reasons for us<strong>in</strong>g a lower value of r:• You may set r = w = 32 to avoid the SHR EDX,b <strong>in</strong> the end.• You may set r = 16+b and use a multiplication <strong>in</strong>struction that gives a 32-bit resultrather than 64 bits. This will free the EDX register:; Example 16.5d. Divide unsigned <strong>in</strong>teger by 5, limited rangeimul eax,0CCCDHshr eax,18• You may choose a value of r that gives case C rather than case B <strong>in</strong> order to avoidthe ADD EAX,1 <strong>in</strong>structionThe maximum value for x <strong>in</strong> these cases is at least 2 r -b, sometimes higher. You have to do asystematic test if you want to know the exact maximum value of x for which the code workscorrectly.You may want to replace the slow multiplication <strong>in</strong>struction with faster <strong>in</strong>structions asexpla<strong>in</strong>ed on page 135.The follow<strong>in</strong>g example divides EAX by 10 and returns the result <strong>in</strong> EAX. I have chosen r=17rather than 19 because it happens to give code that is easier to optimize, and covers thesame range for x. f = 2 17 / 10 = 3333H, case B: q = (x+1)*3333H:; Example 16.5e. Divide unsigned <strong>in</strong>teger by 10, limited rangelea ebx, [eax+2*eax+3]lea ecx, [eax+2*eax+3]shl ebx, 4mov eax, ecxshl ecx, 8add eax, ebxshl ebx, 8add eax, ecxadd eax, ebxshr eax, 17137


A systematic test shows that this code works correctly for all x < 10004H.The division method can also be used for vector operands. Example 16.5f divides eightunsigned 16-bit <strong>in</strong>tegers by 10:; Example 16.5f. Divide vector of unsigned <strong>in</strong>tegers by 10.dataalign 16RECIPDIV DW 8 dup (0CCCDH) ; Vector of reciprocal divisor.codepmulhuw xmm0, RECIPDIVpsrlw xmm0, 3Repeated <strong>in</strong>teger division by the same value (all processors)If the divisor is not known at <strong>assembly</strong> time, but you are divid<strong>in</strong>g repeatedly with the samedivisor, then you may use the same method as above. The code has to dist<strong>in</strong>guish betweencase A, B and C and calculate f before do<strong>in</strong>g the divisions.The code that follows shows how to do multiple divisions with the same divisor (unsigneddivision with truncation). First call SET_DIVISOR to specify the divisor and calculate thereciprocal, then call DIVIDE_FIXED for each value to divide by the same divisor.Example 16.6, repeated <strong>in</strong>teger division with same divisor.DATARECIPROCAL_DIVISOR DD ?; rounded reciprocal divisorCORRECTION DD ? ; case A: -1, case B: 1, case C: 0BSHIFT DD ? ; number of bits <strong>in</strong> divisor - 1.CODESET_DIVISOR PROC NEAR; divisor <strong>in</strong> EAXpush ebxmov ebx, eaxbsr ecx, eax ; b = number of bits <strong>in</strong> divisor - 1mov edx, 1jz ERROR ; error: divisor is zeroshl edx, cl ; 2^bmov [BSHIFT], ecx ; save bcmp eax, edxmov eax, 0je short CASE_A ; divisor is a power of 2div ebx ; 2^(32+b) / dshr ebx, 1 ; divisor / 2xor ecx, ecxcmp edx, ebx ; compare rema<strong>in</strong>der with divisor/2setbe cl; 1 if case bmov [CORRECTION], ecx ; correction for round<strong>in</strong>g errorsxor ecx, 1add eax, ecx ; add 1 if case cmov [RECIPROCAL_DIVISOR],eax ; rounded reciprocal divisorpop ebxretCASE_A: mov [CORRECTION], -1 ; remember that we have case apop ebxretSET_DIVISOR ENDPDIVIDE_FIXED PROC NEARmov edx, [CORRECTION]mov ecx, [BSHIFT]; dividend <strong>in</strong> EAX, result <strong>in</strong> EAX138


test edx, edxjs DSHIFT ; divisor is power of 2add eax, edx ; correct for round<strong>in</strong>g errorjc DOVERFL ; correct for overflowmul [RECIPROCAL_DIVISOR] ; multiply w. reciprocal divisormov eax, edxDSHIFT: shr eax, cl ; adjust for number of bitsretDOVERFL:mov eax, [RECIPROCAL_DIVISOR] ; dividend = 0ffffffffhshr eax, cl ; do division by shift<strong>in</strong>gretDIVIDE_FIXED ENDPThis code gives the same result as the DIV <strong>in</strong>struction for 0 ≤ x < 2 32 , 0 < d < 2 32 .The l<strong>in</strong>e jc DOVERFL and its target are not needed if you are certa<strong>in</strong> that x < 0FFFFFFFFH.If powers of 2 occur so seldom that it is not worth optimiz<strong>in</strong>g for them, then you may leaveout the jump to DSHIFT and <strong>in</strong>stead do a multiplication with CORRECTION = 0 for case A.Float<strong>in</strong>g po<strong>in</strong>t division (all processors)Two or more float<strong>in</strong>g po<strong>in</strong>t divisions can be comb<strong>in</strong>ed <strong>in</strong>to one, us<strong>in</strong>g the method described<strong>in</strong> manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++".The time it takes to make a float<strong>in</strong>g po<strong>in</strong>t division depends on the precision. When float<strong>in</strong>gpo<strong>in</strong>t registers are used, you can make division faster by specify<strong>in</strong>g a lower precision <strong>in</strong> thefloat<strong>in</strong>g po<strong>in</strong>t control word. This also speeds up the FSQRT <strong>in</strong>struction, but not any other<strong>in</strong>structions. When XMM registers are used, you don't have to change any control word.Just use s<strong>in</strong>gle-precision <strong>in</strong>structions if your application allows this.It is not possible to do a float<strong>in</strong>g po<strong>in</strong>t division and an <strong>in</strong>teger division at the same timebecause they are us<strong>in</strong>g the same execution unit on most processors.Us<strong>in</strong>g reciprocal <strong>in</strong>struction for fast division (processors with SSE)On processors with the SSE <strong>in</strong>struction set, you can use the fast reciprocal <strong>in</strong>structionRCPSS or RCPPS on the divisor and then multiply with the dividend. However, the precision isonly 12 bits. You can <strong>in</strong>crease the precision to 23 bits by us<strong>in</strong>g the Newton-Raphsonmethod described <strong>in</strong> Intel's application note AP-803:x0 = rcpss(d);x1 = x0 * (2 - d * x0) = 2 * x0 - d * x0 * x0;where x0 is the first approximation to the reciprocal of the divisor d, and x1 is a betterapproximation. You must use this formula before multiply<strong>in</strong>g with the dividend.; Example 16.7, fast division, s<strong>in</strong>gle precision (SSE)movaps xmm1, [divisors]; load divisorsrcpps xmm0, xmm1 ; approximate reciprocalmulps xmm1, xmm0 ; newton-raphson formulamulps xmm1, xmm0addps xmm0, xmm0subps xmm0, xmm1mulps xmm0, [dividends] ; results <strong>in</strong> xmm0This makes four divisions <strong>in</strong> approximately 23 clock cycles on a PM with a precision of 23bits. Increas<strong>in</strong>g the precision further by repeat<strong>in</strong>g the Newton-Raphson formula with doubleprecision is possible, but not very advantageous.139


If you want to use this method for <strong>in</strong>teger divisions then you have to check for round<strong>in</strong>gerrors. The follow<strong>in</strong>g code makes four <strong>in</strong>teger divisions with truncation on packed word size<strong>in</strong>tegers <strong>in</strong> approximately 39 clock cycles on the PM. It gives exactly the same results as theDIV <strong>in</strong>struction for 0 ≤ dividend ≤ 7FFFFH and 0 < divisor ≤ 7FFFFH:; Example 16.8, <strong>in</strong>teger division with packed 16-bit words (SSE2):; compute QUOTIENTS = DIVIDENDS / DIVISORSmovq xmm1, [DIVISORS] ; load four divisorsmovq xmm2, [DIVIDENDS] ; load four dividendspxor xmm0, xmm0 ; temporary 0punpcklwd xmm1, xmm0; convert divisors to dwordspunpcklwd xmm2, xmm0; convert dividends to dwordscvtdq2ps xmm1, xmm1; convert divisors to floatscvtdq2ps xmm2, xmm2; convert dividends to floatsrcpps xmm0, xmm1 ; approximate reciprocal of divisorsmulps xmm1, xmm0 ; improve precision with..mulps xmm1, xmm0 ; newton-raphson methodaddps xmm0, xmm0subps xmm0, xmm1 ; reciprocal divisors (23 bit precision)mulps xmm0, xmm2 ; multiply with dividendscvttps2dq xmm0, xmm0; truncate result of divisionpackssdw xmm0, xmm0; convert quotients to word sizemovq xmm1, [DIVISORS] ; load divisors aga<strong>in</strong>movq xmm2, [DIVIDENDS] ; load dividends aga<strong>in</strong>psubw xmm2, xmm1 ; dividends - divisorspmullw xmm1, xmm0 ; divisors * quotientspcmpgtw xmm1, xmm2 ; -1 if quotient not too smallpcmpeqw xmm2, xmm2 ; make <strong>in</strong>teger -1'spxor xmm1, xmm2 ; -1 if quotient too smallpsubw xmm0, xmm1 ; correct quotientmovq [QUOTIENTS], xmm0 ; save the four corrected quotientsThis code checks if the result is too small and makes the appropriate correction. It is notnecessary to check if the result is too big.16.10 Str<strong>in</strong>g <strong>in</strong>structions (all processors)Str<strong>in</strong>g <strong>in</strong>structions without a repeat prefix are too slow and should be replaced by simpler<strong>in</strong>structions. The same applies to LOOP on all processors and to JECXZ on some processors.REP MOVSD and REP STOSD are quite fast if the repeat count is not too small. Always usethe largest word size possible (DWORD <strong>in</strong> 32-bit mode, QWORD <strong>in</strong> 64-bit mode), and make surethat both source and dest<strong>in</strong>ation are aligned by the word size. In most cases, however, it isfaster to use XMM registers. Mov<strong>in</strong>g data <strong>in</strong> XMM registers is faster than REP MOVSD andREP STOSD <strong>in</strong> most cases. See page 153 for details.Note that while the REP MOVS <strong>in</strong>struction writes a word to the dest<strong>in</strong>ation, it reads the nextword from the source <strong>in</strong> the same clock cycle. You can have a cache bank conflict if bit 2-4are the same <strong>in</strong> these two addresses on P2 and P3. In other words, you will get a penalty ofone clock extra per iteration if ESI+WORDSIZE-EDI is divisible by 32. The easiest way toavoid cache bank conflicts is to align both source and dest<strong>in</strong>ation by 8. Never use MOVSB orMOVSW <strong>in</strong> optimized code, not even <strong>in</strong> 16-bit mode.On PM, REP MOVS and REP STOS can perform fast by mov<strong>in</strong>g an entire cache l<strong>in</strong>e at a time.This happens only when the follow<strong>in</strong>g conditions are met:• Both source and dest<strong>in</strong>ation must be aligned by 8• Direction must be forward (direction flag cleared)140


• The difference between EDI and ESI must be numerically greater than or equal to 64• The memory type for both source and dest<strong>in</strong>ation must be either write-back or writecomb<strong>in</strong><strong>in</strong>g(you can normally assume this).Under these conditions, the speed is approximately 5-6 bytes per clock cycle for both<strong>in</strong>structions, which is more than 4 times as fast as when the above conditions are not met.The speed is higher for high values of ECX.On P4, the number of clock cycles for REP MOVSD is difficult to predict, but it is always fasterto use MOVAPS for mov<strong>in</strong>g data, except possibly for small repeat counts if a loop would suffera branch misprediction penalty.While the str<strong>in</strong>g <strong>in</strong>structions can be quite convenient, it must be emphasized that othersolutions are faster <strong>in</strong> almost all cases. If the above conditions for fast move are not metthen there is a lot to ga<strong>in</strong> by us<strong>in</strong>g other methods. See page 153 for alternatives to REPMOVS.REP LOADS, REP SCAS, and REP CMPS take more time per iteration than simple loops. Seepage 129 for alternatives to REPNE SCASB.16.11 WAIT <strong>in</strong>struction (all processors)You can often <strong>in</strong>crease speed by omitt<strong>in</strong>g the WAIT <strong>in</strong>struction (also known as FWAIT). TheWAIT <strong>in</strong>struction has three functions:A. The old 8087 processor requires a WAIT before every float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>struction to makesure the coprocessor is ready to receive it.B. WAIT is used for coord<strong>in</strong>at<strong>in</strong>g memory access between the float<strong>in</strong>g po<strong>in</strong>t unit and the<strong>in</strong>teger unit. Examples:; Example 16.9. Uses of WAIT:B1: fistp [mem32]wait; wait for FPU to write before..mov eax,[mem32] ; read<strong>in</strong>g the result with the <strong>in</strong>teger unitB2: fild [mem32]wait; wait for FPU to read value..mov [mem32],eax ; before overwrit<strong>in</strong>g it with <strong>in</strong>teger unitB3: fld qword ptr [ESP]wait; prevent an accidental <strong>in</strong>terrupt from..add esp,8 ; overwrit<strong>in</strong>g value on stackC. WAIT is sometimes used to check for exceptions. It will generate an <strong>in</strong>terrupt if anunmasked exception bit <strong>in</strong> the float<strong>in</strong>g po<strong>in</strong>t status word has been set by a preced<strong>in</strong>gfloat<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>struction.Regard<strong>in</strong>g A:The functionality <strong>in</strong> po<strong>in</strong>t A is never needed on any other processors than the old 8087.Unless you want your 16-bit code to be compatible with the 8087, you should tell yourassembler not to put <strong>in</strong> these WAIT's by specify<strong>in</strong>g a higher processor. An 8087 float<strong>in</strong>gpo<strong>in</strong>t emulator also <strong>in</strong>serts WAIT <strong>in</strong>structions. You should therefore tell your assembler not togenerate emulation code unless you need it.Regard<strong>in</strong>g B:141


WAIT <strong>in</strong>structions to coord<strong>in</strong>ate memory access are def<strong>in</strong>itely needed on the 8087 and80287 but not on the Pentiums. It is not quite clear whether it is needed on the 80387 and80486. I have made several tests on these Intel processors and not been able to provokeany error by omitt<strong>in</strong>g the WAIT on any 32-bit Intel processor, although Intel manuals say thatthe WAIT is needed for this purpose except after FNSTSW and FNSTCW. Omitt<strong>in</strong>g WAIT<strong>in</strong>structions for coord<strong>in</strong>at<strong>in</strong>g memory access is not 100 % safe, even when writ<strong>in</strong>g 32-bitcode, because the code may be able to run on the very rare comb<strong>in</strong>ation of a 80386 ma<strong>in</strong>processor with a 287 coprocessor, which requires the WAIT. Also, I have no <strong>in</strong>formation onnon-Intel processors, and I have not tested all possible hardware and softwarecomb<strong>in</strong>ations, so there may be other situations where the WAIT is needed.If you want to be certa<strong>in</strong> that your code will work on even the oldest 32-bit processors then Iwould recommend that you <strong>in</strong>clude the WAIT here <strong>in</strong> order to be safe. If rare and obsoletehardware platforms such as the comb<strong>in</strong>ation of 80386 and 80287 can be ruled out, then youmay omit the WAIT.Regard<strong>in</strong>g C:The assembler automatically <strong>in</strong>serts a WAIT for this purpose before the follow<strong>in</strong>g<strong>in</strong>structions: FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW. You can omit the WAIT by writ<strong>in</strong>gFNCLEX, etc. My tests show that the WAIT is unnecessary <strong>in</strong> most cases because these<strong>in</strong>structions without WAIT will still generate an <strong>in</strong>terrupt on exceptions except for FNCLEX andFNINIT on the 80387. (There is some <strong>in</strong>consistency about whether the IRET from the<strong>in</strong>terrupt po<strong>in</strong>ts to the FN.. <strong>in</strong>struction or to the next <strong>in</strong>struction).Almost all other float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions will also generate an <strong>in</strong>terrupt if a previous float<strong>in</strong>gpo<strong>in</strong>t <strong>in</strong>struction has set an unmasked exception bit, so the exception is likely to be detectedsooner or later anyway. You may <strong>in</strong>sert a WAIT after the last float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>struction <strong>in</strong> yourprogram to be sure to catch all exceptions.You may still need the WAIT if you want to know exactly where an exception occurred <strong>in</strong>order to be able to recover from the situation. Consider, for example, the code under B3above: If you want to be able to recover from an exception generated by the FLD here, thenyou need the WAIT because an <strong>in</strong>terrupt after ADD ESP,8 might overwrite the value to load.FNOP may be faster than WAIT on some processors and serve the same purpose.16.12 FCOM + FSTSW AX (all processors)The FNSTSW <strong>in</strong>struction is very slow on all processors. Most processors have FCOMI<strong>in</strong>structions to avoid the slow FNSTSW. Us<strong>in</strong>g FCOMI <strong>in</strong>stead of the common sequence FCOM /FNSTSW AX / SAHF will save 4 - 8 clock cycles. You should therefore use FCOMI to avoidFNSTSW wherever possible, even <strong>in</strong> cases where it costs some extra code.On P1 and PMMX processors, which don't have FCOMI <strong>in</strong>structions, the usual way of do<strong>in</strong>gfloat<strong>in</strong>g po<strong>in</strong>t comparisons is:; Example 16.10a.fld [a]fcomp [b]fstsw axsahfjb ASmallerThanBYou may improve this code by us<strong>in</strong>g FNSTSW AX rather than FSTSW AX and test AH directlyrather than us<strong>in</strong>g the non-pairable SAHF:; Example 16.10b.fld [a]142


fcomp [b]fnstsw axshr ah,1jc ASmallerThanBTest<strong>in</strong>g for zero or equality:; Example 16.10c.ftstfnstsw axand ah, 40H ; Don't use TEST <strong>in</strong>struction, it's not pairablejnz IsZero ; (the zero flag is <strong>in</strong>verted!)Test if greater:; Example 16.10d.fld [a]fcomp [b]fnstsw axand ah, 41Hjz AGreaterThanBOn the P1 and PMMX, the FNSTSW <strong>in</strong>struction takes 2 clocks, but it is delayed for anadditional 4 clocks after any float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>struction because it is wait<strong>in</strong>g for the statusword to retire from the pipel<strong>in</strong>e. You may fill this gap with <strong>in</strong>teger <strong>in</strong>structions.It is sometimes faster to use <strong>in</strong>teger <strong>in</strong>structions for compar<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t values, asdescribed on page 150 and 152.16.13 FPREM (all processors)The FPREM and FPREM1 <strong>in</strong>structions are slow on all processors. You may replace it by thefollow<strong>in</strong>g algorithm: Multiply by the reciprocal divisor, get the fractional part by subtract<strong>in</strong>gthe truncated value, and then multiply by the divisor. (See page 149 on how to truncate onprocessors that don't have truncate <strong>in</strong>structions).Some documents say that these <strong>in</strong>structions may give <strong>in</strong>complete reductions and that it istherefore necessary to repeat the FPREM or FPREM1 <strong>in</strong>struction until the reduction iscomplete. I have tested this on several processors beg<strong>in</strong>n<strong>in</strong>g with the old 8087 and I havefound no situation where a repetition of the FPREM or FPREM1 was needed.16.14 FRNDINT (all processors)This <strong>in</strong>struction is slow on all processors. Replace it by:; Example 16.11.fistp qword ptr [TEMP]fild qword ptr [TEMP]This code is faster despite a possible penalty for attempt<strong>in</strong>g to read from [TEMP] before thewrite is f<strong>in</strong>ished. It is recommended to put other <strong>in</strong>structions <strong>in</strong> between <strong>in</strong> order to avoid thispenalty. See page 149 on how to truncate on processors that don't have truncate<strong>in</strong>structions. On processors with SSE <strong>in</strong>structions, use the conversion <strong>in</strong>structions such asCVTSS2SI and CVTTSS2SI.143


16.15 FSCALE and exponential function (all processors)FSCALE is slow on all processors. Comput<strong>in</strong>g <strong>in</strong>teger powers of 2 can be done much fasterby <strong>in</strong>sert<strong>in</strong>g the desired power <strong>in</strong> the exponent field of the float<strong>in</strong>g po<strong>in</strong>t number. Tocalculate 2 N , where N is a signed <strong>in</strong>teger, select from the examples below the one that fitsyour range of N:For |N| < 2 7 -1 you can use s<strong>in</strong>gle precision:; Example 16.12a.mov eax, [N]shl eax, 23add eax, 3f800000hmov dword ptr [TEMP], eaxfld dword ptr [TEMP]For |N| < 2 10 -1 you can use double precision:; Example 16.12b.mov eax, [N]shl eax, 20add eax, 3ff00000hmov dword ptr [TEMP], 0mov dword ptr [TEMP+4], eaxfld qword ptr [TEMP]For |N| < 2 14 -1 use long double precision:; Example 16.12c.mov eax, [N]add eax, 00003fffhmov dword ptr [TEMP], 0mov dword ptr [TEMP+4], 80000000hmov dword ptr [TEMP+8], eaxfld tbyte ptr [TEMP]On processors with SSE2 <strong>in</strong>structions, you can make these operations <strong>in</strong> XMM registerswithout the need for a memory <strong>in</strong>termediate (see page 152).FSCALE is often used <strong>in</strong> the calculation of exponential functions. The follow<strong>in</strong>g code showsan exponential function without the slow FRNDINT and FSCALE <strong>in</strong>structions:; Example 16.13. Exponential function; extern "C" long double exp (double x);_exp PROC NEARPUBLIC _expfldl2efld qword ptr [esp+4] ; xfmul; z = x*log2(e)fist dword ptr [esp+4] ; round(z)sub esp, 12mov dword ptr [esp], 0mov dword ptr [esp+4], 80000000hfisub dword ptr [esp+16] ; z - round(z)mov eax, [esp+16]add eax,3fffhmov [esp+8],eaxjle short UNDERFLOWcmp eax,8000hjge short OVERFLOWf2xm1fld1fadd; 2^(z-round(z))144


fld tbyte ptr [esp] ; 2^(round(z))add esp,12fmul; 2^z = e^xretUNDERFLOW:fstp stfldz ; return 0add esp,12retOVERFLOW:push 07f800000h ; +<strong>in</strong>f<strong>in</strong>ityfstp stfld dword ptr [esp] ; return <strong>in</strong>f<strong>in</strong>ityadd esp,16ret_exp ENDP16.16 FPTAN (all processors)Accord<strong>in</strong>g to the manuals, FPTAN returns two values, X and Y, and leaves it to theprogrammer to divide Y with X to get the result; but <strong>in</strong> fact it always returns 1 <strong>in</strong> X so you cansave the division. My tests show that on all 32-bit Intel processors with float<strong>in</strong>g po<strong>in</strong>t unit orcoprocessor, FPTAN always returns 1 <strong>in</strong> X regardless of the argument. If you want to beabsolutely sure that your code will run correctly on all processors, then you may test if X is1, which is faster than divid<strong>in</strong>g with X. The Y value may be very high, but never <strong>in</strong>f<strong>in</strong>ity, soyou don't have to test if Y conta<strong>in</strong>s a valid number if you know that the argument is valid.16.17 FSQRT (SSE processors)A fast way of calculat<strong>in</strong>g an approximate square root on processors with SSE is to multiplythe reciprocal square root of x by x:sqrt(x) = x * rsqrt(x)The <strong>in</strong>struction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12bits. You can improve the precision to 23 bits by us<strong>in</strong>g the Newton-Raphson formuladescribed <strong>in</strong> Intel's application note AP-803:x0 = rsqrtss(a)x1 = 0.5 * x0 * (3 - (a * x0) * x0)where x0 is the first approximation to the reciprocal square root of a, and x1 is a betterapproximation. The order of evaluation is important. You must use this formula beforemultiply<strong>in</strong>g with a to get the square root.16.18 FLDCW (Most Intel processors)Many processors have a serious stall after the FLDCW <strong>in</strong>struction if followed by any float<strong>in</strong>gpo<strong>in</strong>t <strong>in</strong>struction which reads the control word (which almost all float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structionsdo).When C or C++ code is compiled, it often generates a lot of FLDCW <strong>in</strong>structions becauseconversion of float<strong>in</strong>g po<strong>in</strong>t numbers to <strong>in</strong>tegers is done with truncation while other float<strong>in</strong>gpo<strong>in</strong>t <strong>in</strong>structions use round<strong>in</strong>g. After translation to <strong>assembly</strong>, you can improve this code byus<strong>in</strong>g round<strong>in</strong>g <strong>in</strong>stead of truncation where possible, or by mov<strong>in</strong>g the FLDCW out of a loopwhere truncation is needed <strong>in</strong>side the loop.145


On the P4, this stall is even longer, approximately 143 clocks. But the P4 has made aspecial case out of the situation where the control word is alternat<strong>in</strong>g between two differentvalues. This is the typical case <strong>in</strong> C++ programs where the control word is changed tospecify truncation when a float<strong>in</strong>g po<strong>in</strong>t number is converted to <strong>in</strong>teger, and changed backto round<strong>in</strong>g after this conversion. The latency for FLDCW is 3 when the new value loaded isthe same as the value of the control word before the preced<strong>in</strong>g FLDCW. The latency is still143, however, when load<strong>in</strong>g the same value <strong>in</strong>to the control word as it already has, if this isnot the same as the value it had one time earlier.See page 149 on how to convert float<strong>in</strong>g po<strong>in</strong>t numbers to <strong>in</strong>tegers without chang<strong>in</strong>g thecontrol word. On processors with SSE, use truncation <strong>in</strong>structions such as CVTTSS2SI<strong>in</strong>stead.17 Special topics17.1 XMM versus float<strong>in</strong>g po<strong>in</strong>t registers (Processors with SSE)Processors with the SSE <strong>in</strong>struction set can do s<strong>in</strong>gle precision float<strong>in</strong>g po<strong>in</strong>t calculations <strong>in</strong>XMM registers. Processors with the SSE2 <strong>in</strong>struction set can also do double precisioncalculations <strong>in</strong> XMM registers. Float<strong>in</strong>g po<strong>in</strong>t calculations are approximately equally fast <strong>in</strong>XMM registers and ST() registers. The decision of whether to use the float<strong>in</strong>g po<strong>in</strong>t stackregisters ST(0) - ST(7) or the new XMM registers depends on the follow<strong>in</strong>g factors.Advantages of us<strong>in</strong>g ST() registers:• Compatible with old processors without SSE or SSE2.• Compatible with old operat<strong>in</strong>g systems without XMM support.• Supports long double precision.• Intermediate results are always calculated with long double precision.• Precision conversions are free <strong>in</strong> the sense that they require no extra <strong>in</strong>structionsand take no extra time. Use ST() registers for expressions where operands havemixed precision.• Mathematical functions such as logarithms and trigonometric functions aresupported by hardware <strong>in</strong>structions. These functions are useful when optimiz<strong>in</strong>g forsize, but not necessarily faster than library functions us<strong>in</strong>g XMM registers.• Conversions to and from decimal numbers can use the FBLD and FBSTP <strong>in</strong>structionswhen optimiz<strong>in</strong>g for size.• Float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions us<strong>in</strong>g ST() registers are smaller than the correspond<strong>in</strong>g<strong>in</strong>structions us<strong>in</strong>g XMM registers. For example, FADD ST(0),ST(1) is 2 bytes, whileADDSD XMM0,XMM1 is 4 bytes.Advantages of us<strong>in</strong>g XMM or YMM registers:• Can do multiple operations with a s<strong>in</strong>gle vector <strong>in</strong>struction.• Avoids the need to use FXCH for gett<strong>in</strong>g the desired register to the top of the stack.146


• No need to clean up the register stack after use.• Can be used together with MMX <strong>in</strong>structions.• No need for memory <strong>in</strong>termediates when convert<strong>in</strong>g between <strong>in</strong>tegers and float<strong>in</strong>gpo<strong>in</strong>t numbers.• 64-bit systems have 16 XMM/YMM registers, but only 8 ST() registers.• ST() registers cannot be used <strong>in</strong> device drivers <strong>in</strong> 64-bit W<strong>in</strong>dows.• ST() registers may be implemented less efficiently <strong>in</strong> future processors when mostcode uses vector registers.17.2 MMX versus XMM registers (Processors with SSE2)Integer vector <strong>in</strong>structions can use either the 64-bit MMX registers or the 128-bit XMMregisters.Advantages of us<strong>in</strong>g MMX registers:• Compatible with older microprocessors s<strong>in</strong>ce the PMMX.• Compatible with old operat<strong>in</strong>g systems without XMM support.• No need for data alignment.Advantages of us<strong>in</strong>g XMM registers:• The number of elements per vector <strong>in</strong> XMM registers is double the number <strong>in</strong> MMXregisters.• MMX registers cannot be used together with ST() registers.• A series of MMX <strong>in</strong>structions must end with EMMS.• 64-bit systems have 16 XMM registers, but only 8 MMX registers.• MMX registers cannot be used <strong>in</strong> device drivers <strong>in</strong> 64-bit W<strong>in</strong>dows.• XMM <strong>in</strong>structions have higher throughput than MMX <strong>in</strong>structions.• MMX registers are go<strong>in</strong>g out of use and may be implemented less efficiently <strong>in</strong> futureprocessors.17.3 XMM versus YMM registers (Processors with AVX)Vector <strong>in</strong>structions can use the 128-bit XMM registers or their 256-bit extension namedYMM registers (expected available 2010).Advantages of us<strong>in</strong>g YMM registers:• Double vector sizeAdvantages of us<strong>in</strong>g XMM registers:147


• Compatible with exist<strong>in</strong>g processors• There is a penalty for switch<strong>in</strong>g between YMM <strong>in</strong>structions and XMM <strong>in</strong>structionswithout VEX prefix, see page 124.• YMM registers cannot be used <strong>in</strong> device drivers without sav<strong>in</strong>g everyth<strong>in</strong>g withXSAVE / XRESTOR., see page 126.17.4 Free<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t registers (all processors)You have to free all used float<strong>in</strong>g po<strong>in</strong>t ST() registers before exit<strong>in</strong>g a subrout<strong>in</strong>e, except forany register used for the result.The fastest way of free<strong>in</strong>g one register is FSTP ST. To free two registers you may use eitherFCOMPP or twice FSTP ST, whichever fits best <strong>in</strong>to the decod<strong>in</strong>g sequence or port load.It is not recommended to use FFREE.17.5 Transitions between float<strong>in</strong>g po<strong>in</strong>t and MMX <strong>in</strong>structions (Processorswith MMX)It is not possible to use 64-bit MMX registers and 80-bit float<strong>in</strong>g po<strong>in</strong>t ST() registers <strong>in</strong> thesame part of the code. You must issue an EMMS <strong>in</strong>struction after the last <strong>in</strong>struction that usesMMX registers if there is a possibility that later code uses float<strong>in</strong>g po<strong>in</strong>t registers. You mayavoid this problem by us<strong>in</strong>g 128-bit XMM registers <strong>in</strong>stead.On PMMX there is a high penalty for switch<strong>in</strong>g between float<strong>in</strong>g po<strong>in</strong>t and MMX <strong>in</strong>structions.The first float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>struction after an EMMS takes approximately 58 clocks extra, and thefirst MMX <strong>in</strong>struction after a float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>struction takes approximately 38 clocks extra.On processors with out-of-order execution there is no such penalty.17.6 Convert<strong>in</strong>g from float<strong>in</strong>g po<strong>in</strong>t to <strong>in</strong>teger (All processors)All conversions between float<strong>in</strong>g po<strong>in</strong>t registers and <strong>in</strong>teger registers must go via a memorylocation:; Example 17.1.fistp dword ptr [TEMP]mov eax, [TEMP]On many processors, and especially the P4, this code is likely to have a penalty forattempt<strong>in</strong>g to read from [TEMP] before the write to [TEMP] is f<strong>in</strong>ished. It doesn't help to put<strong>in</strong> a WAIT. It is recommended that you put <strong>in</strong> other <strong>in</strong>structions between the write to [TEMP]and the read from [TEMP] if possible <strong>in</strong> order to avoid this penalty. This applies to all theexamples that follow.The specifications for the C and C++ <strong>language</strong> requires that conversion from float<strong>in</strong>g po<strong>in</strong>tnumbers to <strong>in</strong>tegers use truncation rather than round<strong>in</strong>g. The method used by most Clibraries is to change the float<strong>in</strong>g po<strong>in</strong>t control word to <strong>in</strong>dicate truncation before us<strong>in</strong>g anFISTP <strong>in</strong>struction, and chang<strong>in</strong>g it back aga<strong>in</strong> afterwards. This method is very slow on allprocessors. On many processors, the float<strong>in</strong>g po<strong>in</strong>t control word cannot be renamed, so allsubsequent float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions must wait for the FLDCW <strong>in</strong>struction to retire. See page145.148


On processors with SSE or SSE2 <strong>in</strong>structions you can avoid all these problems by us<strong>in</strong>gXMM registers <strong>in</strong>stead of float<strong>in</strong>g po<strong>in</strong>t registers and use the CVT.. <strong>in</strong>structions to avoid thememory <strong>in</strong>termediate.Whenever you have a conversion from a float<strong>in</strong>g po<strong>in</strong>t register to an <strong>in</strong>teger register, youshould th<strong>in</strong>k of whether you can use round<strong>in</strong>g to nearest <strong>in</strong>teger <strong>in</strong>stead of truncation.If you need truncation <strong>in</strong>side a loop then you should change the control word only outsidethe loop if the rest of the float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions <strong>in</strong> the loop can work correctly <strong>in</strong>truncation mode.You may use various tricks for truncat<strong>in</strong>g without chang<strong>in</strong>g the control word, as illustrated <strong>in</strong>the examples below. These examples presume that the control word is set to default, i.e.round<strong>in</strong>g to nearest or even.; Example 17.2a. Round<strong>in</strong>g to nearest or even:; extern "C" <strong>in</strong>t round (double x);_round PROC NEAR ; (32 bit mode)PUBLIC _roundfld qword ptr [esp+4]fistp dword ptr [esp+4]mov eax, dword ptr [esp+4]ret_round ENDP; Example 17.2b. Truncation towards zero:; extern "C" <strong>in</strong>t truncate (double x);_truncate PROC NEAR ; (32 bit mode)PUBLIC _truncatefld qword ptr [esp+4] ; xsub esp, 12 ; space for local variablesfist dword ptr [esp] ; rounded valuefst dword ptr [esp+4] ; float valuefisub dword ptr [esp] ; subtract rounded valuefstp dword ptr [esp+8] ; differencepop eax ; rounded valuepop ecx ; float valuepop edx ; difference (float)test ecx, ecx ; test sign of xjs short NEGATIVEadd edx, 7FFFFFFFH ; produce carry if difference < -0sbb eax, 0 ; subtract 1 if x-round(x) < -0retNEGATIVE:xor ecx, ecxtest edx, edxsetg cl ; 1 if difference > 0add eax, ecx ; add 1 if x-round(x) > 0ret_truncate ENDP; Example 17.2c. Truncation towards m<strong>in</strong>us <strong>in</strong>f<strong>in</strong>ity:; extern "C" <strong>in</strong>t ifloor (double x);_ifloor PROC NEAR ; (32 bit mode)PUBLIC _ifloorfld qword ptr [esp+4] ; xsub esp, 8 ; space for local variablesfist dword ptr [esp] ; rounded valuefisub dword ptr [esp] ; subtract rounded valuefstp dword ptr [esp+4] ; differencepop eax ; rounded valuepop edx ; difference (float)add edx, 7FFFFFFFH ; produce carry if difference < -0149


sbb eax, 0 ; subtract 1 if x-round(x) < -0ret_ifloor ENDPThese procedures work for -2 31 < x < 2 31 -1. They do not check for overflow or NAN's.17.7 Us<strong>in</strong>g <strong>in</strong>teger <strong>in</strong>structions for float<strong>in</strong>g po<strong>in</strong>t operationsInteger <strong>in</strong>structions are generally faster than float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions, so it is oftenadvantageous to use <strong>in</strong>teger <strong>in</strong>structions for do<strong>in</strong>g simple float<strong>in</strong>g po<strong>in</strong>t operations. Themost obvious example is mov<strong>in</strong>g data. For example; Example 17.3a. Mov<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t datafld qword ptr [esi]fstp qword ptr [edi]can be replaced by:or:; Example 17.3bmov eax,[esi]mov ebx,[esi+4]mov [edi],eaxmov [edi+4],ebx; Example 17.3cmovq mm0,[esi]movq [edi],mm0In 64-bit mode, use:; Example 17.3dmov rax,[rsi]mov [rdi],raxMany other manipulations are possible if you know how float<strong>in</strong>g po<strong>in</strong>t numbers arerepresented <strong>in</strong> b<strong>in</strong>ary format. See the chapter "Us<strong>in</strong>g <strong>in</strong>teger operations for manipulat<strong>in</strong>gfloat<strong>in</strong>g po<strong>in</strong>t variables" <strong>in</strong> manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++".The bit positions are shown <strong>in</strong> this table:precision mantissa always 1 exponent signs<strong>in</strong>gle (32 bits) bit 0 - 22 bit 23 - 30 bit 31double (64 bits) bit 0 - 51 bit 52 - 62 bit 63long double (80 bits) bit 0 - 62 bit 63 bit 64 - 78 bit 79Table 17.1. Float<strong>in</strong>g po<strong>in</strong>t formatsFrom this table we can f<strong>in</strong>d that the value 1.0 is represented as 3F80,0000H <strong>in</strong> s<strong>in</strong>gleprecision format, 3FF0,0000,0000,0000H <strong>in</strong> double precision, and3FFF,8000,0000,0000,0000H <strong>in</strong> long double precision.It is possible to generate simple float<strong>in</strong>g po<strong>in</strong>t constants without us<strong>in</strong>g data <strong>in</strong> memory asexpla<strong>in</strong>ed on page 117.Test<strong>in</strong>g if a float<strong>in</strong>g po<strong>in</strong>t value is zeroTo test if a float<strong>in</strong>g po<strong>in</strong>t number is zero, we have to test all bits except the sign bit, whichmay be either 0 or 1. For example:150


; Example 17.4a. Test<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t value for zerofld dword ptr [ebx]ftstfnstsw axand ah, 40hjnz IsZerocan be replaced by; Example 17.4b. Test<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t value for zeromov eax, [ebx]add eax, eaxjz IsZerowhere the ADD EAX,EAX shifts out the sign bit. Double precision floats have 63 bits to test,but if denormal numbers can be ruled out, then you can be certa<strong>in</strong> that the value is zero ifthe exponent bits are all zero. Example:; Example 17.4c. Test<strong>in</strong>g double value for zerofld qword ptr [ebx]ftstfnstsw axand ah, 40hjnz IsZerocan be replaced by; Example 17.4d. Test<strong>in</strong>g double value for zeromov eax, [ebx+4]add eax,eaxjz IsZeroManipulat<strong>in</strong>g the sign bitA float<strong>in</strong>g po<strong>in</strong>t number is negative if the sign bit is set and at least one other bit is set.Example (s<strong>in</strong>gle precision):; Example 17.5. Test<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t value for negativemov eax, [NumberToTest]cmp eax, 80000000Hja IsNegativeYou can change the sign of a float<strong>in</strong>g po<strong>in</strong>t number simply by flipp<strong>in</strong>g the sign bit. This isuseful when XMM registers are used, because there is no XMM change sign <strong>in</strong>struction.Example:; Example 17.6. Change sign of four s<strong>in</strong>gle-precision floats <strong>in</strong> xmm0cmpeqd xmm1, xmm1 ; generate all 1'spslld xmm1, 31 ; 1 <strong>in</strong> the leftmost bit of each dword onlyxorps xmm0, xmm1 ; change sign of xmm0You can get the absolute value of a float<strong>in</strong>g po<strong>in</strong>t number by AND'<strong>in</strong>g out the sign bit:; Example 17.7. Absolute value of four s<strong>in</strong>gle-precision floats <strong>in</strong> xmm0cmpeqd xmm1, xmm1 ; generate all 1'spsrld xmm1,1 ; 1 <strong>in</strong> all but the leftmost bit of each dwordandps xmm0 ,xmm1 ; set sign bits to 0You can extract the sign bit of a float<strong>in</strong>g po<strong>in</strong>t number:; Example 17.8. Generate a bit-mask if s<strong>in</strong>gle-precision floats <strong>in</strong>151


; xmm0 are negative or -0.0psrad xmm0,31 ; copy sign bit <strong>in</strong>to all bit positionsManipulat<strong>in</strong>g the exponentYou can multiply a non-zero number by a power of 2 by simply add<strong>in</strong>g to the exponent:; Example 17.9. Multiply vector by power of 2movaps xmm0, [x] ; four s<strong>in</strong>gle-precision floatsmovdqa xmm1, [n] ; four 32-bit positive <strong>in</strong>tegerspslld xmm1, 23 ; shift <strong>in</strong>tegers <strong>in</strong>to exponent fieldpaddd xmm0, xmm1 ; x * 2^nLikewise, you can divide by a power of 2 by subtract<strong>in</strong>g from the exponent. Note that thiscode does not work if X is zero or if overflow or underflow is possible.Manipulat<strong>in</strong>g the mantissaYou can convert an <strong>in</strong>teger to a float<strong>in</strong>g po<strong>in</strong>t number <strong>in</strong> an <strong>in</strong>terval of length 1.0 by putt<strong>in</strong>gbits <strong>in</strong>to the mantissa field. The follow<strong>in</strong>g code computes x = n / 2 32 , where n <strong>in</strong> an unsigned<strong>in</strong>teger <strong>in</strong> the <strong>in</strong>terval 0 ≤ n < 2 32 , and the result<strong>in</strong>g x is <strong>in</strong> the <strong>in</strong>terval 0 ≤ x < 1.0.; Example 17.10. Convert bits to value between 0 and 1.dataone dq 1.0x dq ?n dd ?.codemovsd xmm0, [one] ; 1.0, double precisionmovd xmm1, [n] ; n, 32-bit unsigned <strong>in</strong>tegerpsllq xmm1, 20 ; align n left <strong>in</strong> mantissa fieldorpd xmm1, xmm0 ; comb<strong>in</strong>e mantissa and exponentsubsd xmm1, xmm0 ; subtract 1.0movsd [x], xmm1 ; store resultIn the above code, the exponent from 1.0 is comb<strong>in</strong>ed with a mantissa conta<strong>in</strong><strong>in</strong>g the bits ofn. This gives a double-precision value <strong>in</strong> the <strong>in</strong>terval 1.0 ≤ x < 2.0. The SUBSD <strong>in</strong>structionsubtracts 1.0 to get x <strong>in</strong>to the desired <strong>in</strong>terval. This is useful for random number generators.Compar<strong>in</strong>g numbersThanks to the fact that the exponent is stored <strong>in</strong> the biased format and to the left of themantissa, it is possible to use <strong>in</strong>teger <strong>in</strong>structions for compar<strong>in</strong>g positive float<strong>in</strong>g po<strong>in</strong>tnumbers. Example (s<strong>in</strong>gle precision):; Example 17.11a. Compare s<strong>in</strong>gle precision float numbersfld [a]fcomp [b]fnstsw axand ah, 1jnz ASmallerThanBcan be replaced by:; Example 17.11b. Compare s<strong>in</strong>gle precision float numbersmov eax, [a]mov ebx, [b]cmp eax, ebxjb ASmallerThanBThis method works only if you are certa<strong>in</strong> that none of the numbers have the sign bit set.You may compare absolute values by shift<strong>in</strong>g out the sign bit of both numbers. For double-152


precision numbers, you can make an approximate comparison by compar<strong>in</strong>g the upper 32bits us<strong>in</strong>g <strong>in</strong>teger <strong>in</strong>structions.17.8 Us<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions for <strong>in</strong>teger operationsWhile there are no problems us<strong>in</strong>g <strong>in</strong>teger <strong>in</strong>structions for mov<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>t data, it is notalways safe to use float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions for mov<strong>in</strong>g <strong>in</strong>teger data. For example, you maybe tempted to use FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI] to move 8 bytes at atime. However, this method may fail if the data do not represent valid float<strong>in</strong>g po<strong>in</strong>tnumbers. The FLD <strong>in</strong>struction may generate an exception and it may even change thevalue of the data. If you want your code to be compatible with processors that don't haveMMX and XMM registers then you can only use the slower FILD and FISTP for mov<strong>in</strong>g 8bytes at a time.However, some float<strong>in</strong>g po<strong>in</strong>t <strong>in</strong>structions can handle <strong>in</strong>teger data without generat<strong>in</strong>gexceptions or modify<strong>in</strong>g data. See page 112 for details.Convert<strong>in</strong>g b<strong>in</strong>ary to decimal numbersThe FBLD and FBSTP <strong>in</strong>structions provide a simple and convenient way of convert<strong>in</strong>gnumbers between b<strong>in</strong>ary and decimal, but this is not the fastest method.17.9 Mov<strong>in</strong>g blocks of data (All processors)There are several ways of mov<strong>in</strong>g large blocks of data. The most common method is REPMOVS, but this is rarely the fastest method. On processors with SSE or later <strong>in</strong>struction sets,the fastest way of mov<strong>in</strong>g data is to use vector registers if the conditions for fast REP MOVSdescribed on page 140 are not met or if the dest<strong>in</strong>ation is <strong>in</strong> the level 1 or level 2 cache. IfSSE is available then use XMM registers. If AVX is available then use YMM registers.It is preferred to use the largest available register size (i.e. XMM or YMM) to move the largestpossible number of bytes per operation. Make sure that both source and dest<strong>in</strong>ation arealigned by the register size. If the size of the block you want to move is not a multiple of theregister size and you are us<strong>in</strong>g a loop for mov<strong>in</strong>g data, then it is better to pad the bufferswith extra space <strong>in</strong> the end and move a little more data than needed than to move the extradata outside the loop. If the amount of data is constant and so small that you want to unrollthe loop completely then there is no problem <strong>in</strong> us<strong>in</strong>g smaller registers for mov<strong>in</strong>g the lastdata.If the source and dest<strong>in</strong>ation are not aligned properly accord<strong>in</strong>g to the size of the registersused then the optimal way of mov<strong>in</strong>g data is to align all the reads at the proper addressboundaries (16-bytes boundaries for XMM registers) and shift and comb<strong>in</strong>e consecutivereads <strong>in</strong>to registers used for writ<strong>in</strong>g so that the writes to the dest<strong>in</strong>ation can also be aligned.The speed can typically be <strong>in</strong>creased by a factor 4 or more by this method if source anddest<strong>in</strong>ation are both <strong>in</strong> the level-1 cache and unaligned. See the memcpySSE2.asmexample <strong>in</strong> the appendix www.agner.org/optimize/asmexamples.zip. This code is socomplicated that it should be placed <strong>in</strong> a function library. The library atwww.agner.org/optimize/asmlib.zip conta<strong>in</strong>s an implementation of the memcpy function us<strong>in</strong>gthis method. See manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong> C++" for a comparison of functionlibraries.On processors with SSE you also have the option of writ<strong>in</strong>g directly to RAM memory without<strong>in</strong>volv<strong>in</strong>g the cache by us<strong>in</strong>g the MOVNTQ, MOVNTPS and similar <strong>in</strong>structions. This can beuseful if you don't want the dest<strong>in</strong>ation to go <strong>in</strong>to a cache. As a rule of thumb, it can berecommended to use these non-temporal writes when the amount of data written is morethan half the size of the largest-level cache.153


More advices on improv<strong>in</strong>g memory access can be found <strong>in</strong> Intel's "IA-32 Intel ArchitectureOptimization Reference Manual" and AMD's "Software Optimization Guide for AMD64Processors".17.10 Self-modify<strong>in</strong>g code (All processors)The penalty for execut<strong>in</strong>g a piece of code immediately after modify<strong>in</strong>g it is approximately 19clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM. The P4 will purge the entiretrace cache after self-modify<strong>in</strong>g code. The 80486 and earlier processors require a jumpbetween the modify<strong>in</strong>g and the modified code <strong>in</strong> order to flush the code cache.To get permission to modify code <strong>in</strong> a protected operat<strong>in</strong>g system you need to call specialsystem functions: In 16-bit W<strong>in</strong>dows call ChangeSelector, <strong>in</strong> 32-bit and 64-bit W<strong>in</strong>dows callVirtualProtect and FlushInstructionCache. The trick of putt<strong>in</strong>g the code <strong>in</strong> a datasegment doesn't work <strong>in</strong> newer systems.Self-modify<strong>in</strong>g code is not considered good programm<strong>in</strong>g practice. It should be used only ifthe ga<strong>in</strong> <strong>in</strong> speed is substantial and the modified code is executed so many times that theadvantage outweighs the penalties for us<strong>in</strong>g self-modify<strong>in</strong>g code.Self-modify<strong>in</strong>g code can be useful for example <strong>in</strong> a math program where a user-def<strong>in</strong>edfunction has to be evaluated many times. The program may conta<strong>in</strong> a small compiler thatconverts the function to b<strong>in</strong>ary code.18 Measur<strong>in</strong>g performance18.1 Test<strong>in</strong>g speedMany compilers have a profiler which makes it possible to measure how many times eachfunction <strong>in</strong> a program is called and how long time it takes. This is very useful for f<strong>in</strong>d<strong>in</strong>g anyhot spot <strong>in</strong> the program. If a particular hot spot is tak<strong>in</strong>g a high proportion of the totalexecution time then this hot spot should be the target for your optimization efforts.Many profilers are not very accurate, and certa<strong>in</strong>ly not accurate enough for f<strong>in</strong>e-tun<strong>in</strong>g asmall part of the code. The most accurate way of test<strong>in</strong>g the speed of a piece of code is touse the so-called time stamp counter. This is an <strong>in</strong>ternal 64-bit clock counter which can beread <strong>in</strong>to EDX:EAX us<strong>in</strong>g the <strong>in</strong>struction RDTSC (read time stamp counter). The time stampcounter counts at the CPU clock frequency so that one count equals one clock cycle, whichis the smallest relevant time unit. Some overclocked processors count at a slightly differentfrequency.The time stamp counter is very useful because it can measure how many clock cycles apiece of code takes. Some processors are able to change the clock frequency <strong>in</strong> responseto chang<strong>in</strong>g workloads. However, when search<strong>in</strong>g for bottlenecks <strong>in</strong> small pieces of code, itis more <strong>in</strong>formative to measure clock cycles than to measure microseconds.The resolution of the time measurement is equal to the value of the clock multiplier (e.g. 11)on Intel processors with SpeedStep technology. The resolution is one clock cycle on otherprocessors. The processors with SpeedStep technology have a performance counter forunhalted core cycles which gives more accurate measurements than the time stampcounter. Privileged access is necessary to enable this counter.On all processors with out-of-order execution, you have to <strong>in</strong>sert XOR EAX,EAX / CPUIDbefore and after each read of the counter <strong>in</strong> order to prevent it from execut<strong>in</strong>g <strong>in</strong> parallelwith anyth<strong>in</strong>g else. CPUID is a serializ<strong>in</strong>g <strong>in</strong>struction, which means that it flushes the154


pipel<strong>in</strong>e and waits for all pend<strong>in</strong>g operations to f<strong>in</strong>ish before proceed<strong>in</strong>g. This is very usefulfor test<strong>in</strong>g purposes.The biggest problem when count<strong>in</strong>g clock ticks is to avoid <strong>in</strong>terrupts. Protected operat<strong>in</strong>gsystems do not allow you to clear the <strong>in</strong>terrupt flag, so you cannot avoid <strong>in</strong>terrupts and taskswitches dur<strong>in</strong>g the test. This makes test results <strong>in</strong>accurate and irreproducible. There areseveral alternative ways to overcome this problem:1. Run the test code with a high priority to m<strong>in</strong>imize the risk of <strong>in</strong>terrupts and taskswitches.2. If the piece of code you are test<strong>in</strong>g is not too long then you may repeat the testseveral times and assume that the lowest of the clock counts measured represents asituation where no <strong>in</strong>terrupt has occurred.3. If the piece of code you are test<strong>in</strong>g takes so long time that <strong>in</strong>terrupts are unavoidablethen you may repeat the test many times and take the average of the clock countmeasurements.4. Make a virtual device driver to clear the <strong>in</strong>terrupt flag.5. Use an operat<strong>in</strong>g system that allows clear<strong>in</strong>g the <strong>in</strong>terrupt flag (e.g. W<strong>in</strong>dows 98without network, <strong>in</strong> console mode).6. Start the test program <strong>in</strong> real mode us<strong>in</strong>g the old DOS operat<strong>in</strong>g system.I have made a series of test programs that use method 1, 2 and possibly 6. These programsare available at www.agner.org/optimize/testp.zip.A further complication occurs on processors with multiple cores if a thread can jump fromone core to another. The time stamp counters on different cores are not necessarilysynchronized. This is not a problem when test<strong>in</strong>g small pieces of code if the aboveprecautions are taken to m<strong>in</strong>imize <strong>in</strong>terrupts. But it can be a problem when measur<strong>in</strong>glonger time <strong>in</strong>tervals. You may need to lock the process to a s<strong>in</strong>gle CPU core, for examplewith the function SetProcessAff<strong>in</strong>ityMask <strong>in</strong> W<strong>in</strong>dows. This is discussed <strong>in</strong> the document"Game Tim<strong>in</strong>g and Multicore Processors", Microsoft 2005. msdn2.microsoft.com/enus/library/bb173458.aspx.You will soon observe when measur<strong>in</strong>g clock cycles that a piece of code always takeslonger time the first time it is executed where it is not <strong>in</strong> the cache. Furthermore, it may taketwo or three iterations before the branch predictor has adapted to the code. The firstmeasurement gives the execution time when code and data are not <strong>in</strong> the cache. Thesubsequent measurements give the execution time with the best possible cach<strong>in</strong>g.The alignment effects on the PPro, P2 and P3 processors make time measurements verydifficult on these processors. Assume that you have a piece code and you want to make achange which you expect to make the code a few clocks faster. The modified code does nothave exactly the same size as the orig<strong>in</strong>al. This means that the code below the modificationwill be aligned differently and the <strong>in</strong>struction fetch blocks will be different. If <strong>in</strong>struction fetchand decod<strong>in</strong>g is a bottleneck, which is often the case on these processors, then the change<strong>in</strong> the alignment may make the code several clock cycles faster or slower. The change <strong>in</strong>the alignment may actually have a larger effect on the clock count than the modification youhave made. So you may be unable to verify whether the modification <strong>in</strong> itself makes thecode faster or slower. It can be quite difficult to predict where each <strong>in</strong>struction fetch blockbeg<strong>in</strong>s, as expla<strong>in</strong>ed <strong>in</strong> manual 3: "The microarchitecture of Intel and AMD CPUs".Other processors do not have serious alignment problems. The P4 does, however, have asomewhat similar, though less severe, effect. This effect is caused by changes <strong>in</strong> the155


alignment of µops <strong>in</strong> the trace cache. The time it takes to jump to the least common (butpredicted) branch after a conditional jump <strong>in</strong>struction may differ by up to two clock cycles ondifferent alignments if trace cache delivery is the bottleneck. The alignment of µops <strong>in</strong> thetrace cache l<strong>in</strong>es is difficult to predict.Most x86 processors also have a set of so-called performance monitor counters which cancount events such as cache misses, misalignments, branch mispredictions, etc. These arevery useful for diagnos<strong>in</strong>g performance problems. The performance monitor counters areprocessor-specific. You need a different test setup for each type of CPU.Details about the performance monitor counters can be found <strong>in</strong> Intel's "IA-32 IntelArchitecture Software Developer’s Manual", vol. 3 and <strong>in</strong> AMD's "BIOS and KernelDeveloper's Guide".You need privileged access to set up the performance monitor counters. This is done mostconveniently with a device driver. The test programs at www.agner.org/optimize/testp.zipgive access to the performance monitor counters under 32-bit and 64-bit W<strong>in</strong>dows and 16-bit real mode DOS. These test program support the different k<strong>in</strong>ds of performance monitorcounters <strong>in</strong> most Intel and AMD processors.Intel and AMD are provid<strong>in</strong>g profilers that use the performance monitor counters of theirrespective processors. Intel's profiler is called Vtune and AMD's profiler is calledCodeAnalyst.18.2 The pitfalls of unit-test<strong>in</strong>gIf you want to f<strong>in</strong>d out which version of a function performs best then it is not sufficient tomeasure clock cycles <strong>in</strong> a small test program that calls the function many times. Such a testis unlikely to give a realistic measure of cache misses because the test program may useless memory than the cache size. For example, an isolated test may show that it isadvantageous to roll out a loop while a test where the function is <strong>in</strong>serted <strong>in</strong> the f<strong>in</strong>alprogram shows a large amount of cache misses when the loop is rolled out.Therefore, it is important not only to count clock cycles when evaluat<strong>in</strong>g the performance ofa function, but also to consider how much space it uses <strong>in</strong> the code cache, data cache andbranch target buffer.See the section named "The pitfalls of unit-test<strong>in</strong>g" <strong>in</strong> manual 1: "<strong>Optimiz<strong>in</strong>g</strong> software <strong>in</strong>C++" for further discussion of this problem.19 LiteratureThe present manual is part of a series available from www.agner.org/optimize as mentioned<strong>in</strong> the <strong>in</strong>troduction on page 4. See manual 3: "The microarchitecture of Intel and AMDCPU's" of a list of relevant literature on specific processors.A lot of other sources also have useful <strong>in</strong>formation. These sources are listed <strong>in</strong> the FAQ forthe newsgroup comp.lang.asm.x86. For other <strong>in</strong>ternet resources follow the l<strong>in</strong>ks fromwww.agner.org/optimize.Some useful textbooks:• R. C. Detmer: Introduction to 80x86 Assembly Language and ComputerArchitecture, 2'nd ed. Jones & Bartlett, 2006. Essentials of 80x86 AssemblyLanguage. Jones & Bartlett, 2006.156


• J. L. Hennessy and D. A. Patterson: Computer Architecture: A QuantitativeApproach, 3'rd ed. 2002.• S. Goedecker and A. Hoisie: Performance Optimization of Numerically IntensiveCodes. SIAM, 2001.• John R. Lev<strong>in</strong>e: L<strong>in</strong>kers and Loaders. Morgan Kaufmann, 2000.157

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!