10.07.2015 Views

An ARM Backend for PyPyls Tracing JIT - STUPS Group

An ARM Backend for PyPyls Tracing JIT - STUPS Group

An ARM Backend for PyPyls Tracing JIT - STUPS Group

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

INSTITUT FÜR INFORMATIKSoftwaretechnik undProgrammiersprachenUniversitätsstr. 1D–40225 Düsseldorf<strong>An</strong> <strong>ARM</strong> <strong>Backend</strong> <strong>for</strong> PyPy’s <strong>Tracing</strong> <strong>JIT</strong>David SchneiderMasterarbeitBeginn der Arbeit: 24. November 2010Abgabe der Arbeit: 23. März 2011Gutachter: Prof. Dr. Michael LeuschelProf. Dr. Michael Schöttner


ErklärungHiermit versichere ich, dass ich diese Masterarbeit selbstständig verfasst habe. Ich habedazu keine anderen als die angegebenen Quellen und Hilfsmittel verwendet.Düsseldorf, den 23. März 2011David Schneider


AbstractA large part of the computing done today is per<strong>for</strong>med on mobile devices. Phones,tablets, netbooks are changing the way we see and use computers. Devices built <strong>for</strong>mobile usage are constrained in different ways. Battery lifetime, speed, power consumption,storage and memory resources are issues compared to todays desktop computers.A very large number of the mobile devices being in use or created nowadays use <strong>ARM</strong>based CPUs. <strong>ARM</strong>, a RISC architecture, is specialized in the development of CPUs <strong>for</strong>mobile and embedded systems, producing reasonably fast and energy efficient CPU designs.Due to the wide usage of <strong>ARM</strong> CPUs on mobile devices it is becoming an increasinglyimportant plat<strong>for</strong>m <strong>for</strong> software developers. The frameworks <strong>for</strong> creating software<strong>for</strong> mobile phones are mostly based on statically typed languages such as C#, Java orObjective-C.Although dynamic languages are seeing a rise in interest and usage over the last years,many of the virtual machines <strong>for</strong> these languages are rather straight<strong>for</strong>ward and do notoffer per<strong>for</strong>mance comparable to that of statically typed languages. Making them less apt<strong>for</strong> developing software <strong>for</strong> plat<strong>for</strong>ms with restricted resources. <strong>JIT</strong> in time compilationhas shown to be a viable way to improve the per<strong>for</strong>mance of virtual machines <strong>for</strong> dynamiclanguages. Just-in-time compilers per<strong>for</strong>m profiling at runtime, while interpretinga given program and generate specialized code <strong>for</strong> frequently used code paths. There aredifferent approaches on how to per<strong>for</strong>m profiling, either based on methods or on loops.The PyPy project, a toolset <strong>for</strong> writing dynamic languages, and an implementation of thePython language which are both written in Python provides a way to write interpreters<strong>for</strong> dynamic languages at a high-level of abstraction by leaving out plat<strong>for</strong>m specificdetails, that are added to the interpreter when it is translated. The translation toolchainallows to target different architectures such as x86, x86_64, the JVM or .NET CLR virtualmachines and also <strong>ARM</strong> while only having to maintain one language implementation.PyPy also provides a generator <strong>for</strong> tracing <strong>JIT</strong>s that can be added to a language implementation,by providing a few hints in the language specification. The tracing <strong>JIT</strong> isalready giving good results on x86, speeding up the execution of Python programs byabout 4x compared to CPython. In this work we want to present the backend developed<strong>for</strong> PyPy’s <strong>JIT</strong> that targets the <strong>ARM</strong> architecture, presenting the specifics of how dynamiccompilation can be implemented on <strong>ARM</strong> in the scope of PyPy.The <strong>ARM</strong> backend shows promising results in terms of speed, the current results show asimilar relative per<strong>for</strong>mance when comparing it to CPython as it is seen on x86. Furtherspeed-ups are to be expected when integrating PyPy’s own GCs.


2 CONTENTS6.4.3 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4.4 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4.5 Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.4.6 Forcing the Frame Object . . . . . . . . . . . . . . . . . . . . . . . . 336.5 Guards and Leaving the Loop . . . . . . . . . . . . . . . . . . . . . . . . . . 346.6 Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Cross Translation 368 Evaluation 388.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2.1 Python Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 398.2.2 Prolog Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Related Work 4910 Conclusion and Future Work 5011 <strong>An</strong>nex 51References 53List of Figures 57List of Tables 58


31 Introduction<strong>ARM</strong> cores are present in most of the devices we use day to day, be it a phone, a tabletor a music player. Additionally <strong>ARM</strong> based CPUs are beginning to be used in netbooksand energy efficient servers. The large scale deployment of <strong>ARM</strong> cores and the varietyof devices they power makes <strong>ARM</strong> an increasingly interesting plat<strong>for</strong>m to be targeted bydevelopers.At the same time there is a growing interest in dynamic programming languages and anincrease in their usage due to improvements in the implementations of the languages andthe increased computational power of computers. These developments make it more andmore feasible to develop software from small scripts to large scale server systems usingdynamic languages.<strong>ARM</strong> based devices are usually small, energy efficient and portable devices where executionspeed and power consumption are critical concerns when creating software.To bridge the gap between the high level expressiveness and ease of development of dynamiclanguages with the restricted resources found on small systems we want to createa backend <strong>for</strong> PyPy’s tracing <strong>JIT</strong> [PyP11b]. Based on the powerful toolchain provided byPyPy, we are able provide a way to create interpreters <strong>for</strong> different dynamic languages.These interpreters are able to, besides the plat<strong>for</strong>ms already targeted by PyPy, also run on<strong>ARM</strong> based systems. At the same time these interpreters may benefit from the optimizationsprovided by just-in-time compilation as done by PyPy. This will make it possibleto write more and faster software <strong>for</strong> mobile devices using high-level languages such asPython.PyPy is the tool of choice to bring dynamic languages and their dynamic compilationto <strong>ARM</strong>, because it is a general toolchain <strong>for</strong> writing dynamic languages that alreadyhas a very compliant Python 2.7 implementation as well as a Prolog implementation andseveral experimental language implementations. Due to its modular architecture PyPyallows all language implementations to potentially benefit from and use the new featureof an <strong>ARM</strong> backend <strong>for</strong> the <strong>JIT</strong>. PyPy has a powerful <strong>JIT</strong> compiler generator providingvery good results on x86. We hope to be able to bring these good results to <strong>ARM</strong>.In this thesis we present an <strong>ARM</strong> backend <strong>for</strong> PyPy’s tracing <strong>JIT</strong> compiler. After givingan introduction into the <strong>ARM</strong> architecture and into just-int-time compilation, we willdescribe how this works on PyPy. Based on this foundation we will describe how wetranslate the operations PyPy’s <strong>JIT</strong> to machine code in a way that it can be efficientlyexecuted on <strong>ARM</strong> based devices. Additionally we present an new translation plat<strong>for</strong>m<strong>for</strong> PyPy to cross-translate programs written in RPython to target <strong>ARM</strong>.The rest of this thesis is structured as follows: in Section 2 we present the <strong>ARM</strong> architectureand the company behind it, describing the different features that <strong>ARM</strong> provides,focussing especially on those that are central to our goal of creating a backend <strong>for</strong> PyPy’s<strong>JIT</strong>. In Section 3 we present a general introduction to just-in-time compilation and differenttechniques which have been developed to improve the execution speed of high levellanguages. In Section 4 we are going to focus on one particular approach in developingdynamic languages presenting the PyPy project and some of its core ideas. In Section 5we will describe PyPy’s approach to just-in-time compilation. Based on these concepts


4 1 INTRODUCTIONin Section 6 we elaborate on how a backend <strong>for</strong> PyPy’s tracing <strong>JIT</strong> is built and what theparticularities of creating such a backend <strong>for</strong> <strong>ARM</strong> are. In Section 7 we describe howwe cross translate interpreters using PyPy’s toolchain to run on resource constrained devices.In Section 8 the results we have gathered from benchmarks evaluating our <strong>ARM</strong>backend are compared to another backend <strong>for</strong> PyPy’s <strong>JIT</strong>. Finally in Sections 9 and 10 ourwork in related to previous research and development and we outline the topics of futurework related to the subject of this thesis.


52 <strong>ARM</strong><strong>ARM</strong> is at the same time the name of the company as well as the name of their mainproduct. The CPUs designed by <strong>ARM</strong> and based on the <strong>ARM</strong> architecture are nowadayswidely deployed, being built into most of current mobile devices such as phones andtablets. At the same time <strong>ARM</strong> CPUs are entering the computer and server markets.The CPUs designed by <strong>ARM</strong> are based on a RISC architecture and have several uniquefeatures. In this section we are going to give a short overview about the company behindthese CPUs and to introduce the <strong>ARM</strong> architecture itself focusing on aspects relevant <strong>for</strong>PyPy’s <strong>JIT</strong> backend presented in this thesis.2.1 Company and History<strong>ARM</strong> Holdings, the company that develops the <strong>ARM</strong> CPUs, was founded in 1990 underthe name of Advanced RISC Machines as a spin-off of Acorn and Apple Computer.<strong>ARM</strong> itself does not sell chips, instead their business model is based on designing 32-bitRISC CPUs and GPUs as well as tools and software targeting their designs. The intellectualproperty, in <strong>for</strong>m of chip and architecture designs is licensed to partners. Partnercompanies who licence the designs, can potentially adapt them to their needs and actuallybuild the chips paying <strong>ARM</strong> a license fee per chip built. <strong>An</strong> example <strong>for</strong> such chipsare Apple’s A4 and A5 processors used in the iPad and iPhone. These processors aresystem-on-a-chip designs based on an <strong>ARM</strong> CPU and combined with additional componentson the same chip such as a GPU.The main markets that <strong>ARM</strong> targets with its technologies are those of mobile computing,home entertainment systems and embedded computing in areas such as householdappliances and car controls.Currently <strong>ARM</strong> is the largest vendor of intellectual property <strong>for</strong> mobile computing, beingpresent on about 95% of all mobile handsets, according to company numbers [<strong>ARM</strong>11].With the growth of the mobile computing market, mainly due to smartphones andtablets, also the importance of <strong>ARM</strong> as a plat<strong>for</strong>m has increased. This has made <strong>ARM</strong>an interesting target plat<strong>for</strong>m. Large vendors of mobile devices and software, such asApple and Google, are already building on <strong>ARM</strong> as a plat<strong>for</strong>m and lately Microsoft announcedthat their next operating system will also run on <strong>ARM</strong> 1 . NVIDIA announced tobuild CPUs based on <strong>ARM</strong> designs 2 .2.2 The <strong>ARM</strong> ArchitectureThe architecture used in the chips developed by <strong>ARM</strong>, which is called the <strong>ARM</strong> architecture,is based on a RISC-Model with a special focus on small code-size, energy efficiencyand per<strong>for</strong>mance, which all are relevant features <strong>for</strong> embedded and mobile computing.1 http://www.microsoft.com/presspass/press/2011/jan11/01-05socsupport.mspx2 http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09&version=live&releasejsp=release_157&xhtml=true&prid=705184


6 2 <strong>ARM</strong>As <strong>ARM</strong> states [<strong>ARM</strong>10] the architecture spots a set of typical RISC characteristics, witha few additional properties only provided by <strong>ARM</strong> processors. The <strong>ARM</strong> architectureprovides:• a large set of multi-purpose registers• data processing operations that work on register contents and not directly on datastored in memory• a simple and direct addressing mode based only on register contents and constantoffsetsAmong the special features only found on <strong>ARM</strong> are:• the combination of shift operations with logical and arithmetic operations, meaningthat it is possible to apply a shift to arguments of operations be<strong>for</strong>e they areevaluated• auto-decrement and increment addressing modes• instructions to load and store multiple register values at once, to improve datathroughput• conditional execution of most of the instructions, to avoid jump operations andimprove per<strong>for</strong>mance and cache behaviour2.2.1 <strong>ARM</strong>/THUMBThe latest version of the <strong>ARM</strong> architecture is version 7 which is usually referred to as<strong>ARM</strong>v7. This architecture defines two main instruction sets and corresponding executionmodes.Chronologically the first was the <strong>ARM</strong> instruction set, introduced in 1983 by AcornsComputers. The <strong>ARM</strong> instruction set is a fixed width 32-bit instructions set, providingthe best per<strong>for</strong>mance and the biggest flexibility in the interaction with the processor.The second was the Thumb instruction set, first introduced with some cores of the <strong>ARM</strong>v4processor family around 1993, it was a fixed width 16 bit instruction set designed to improvethe code density at the cost of execution speed. The speed penalty is present becauseThumb instructions are translated at runtime to the corresponding <strong>ARM</strong> instructions.[<strong>ARM</strong>95]The Thumb instruction set evolved into Thumb-2 which is a variable width instructionset trying to combine the code density of Thumb with per<strong>for</strong>mance comparable to the<strong>ARM</strong> instruction set.There are also additional instruction sets and execution modes targeting virtual machinessuch as Jazelle which specifically targets Java bytecode. Jazelle has been deprecated infavor of ThumbEE which extends the Thumb instruction set with null pointer checks andsome other additional features.


2.2 The <strong>ARM</strong> Architecture 7The current implementation of the <strong>JIT</strong> backend <strong>for</strong> <strong>ARM</strong> targets the <strong>ARM</strong> instruction set,but it should not be hard to adapt the backend to generate Thumb-2 or ThumbEE code inorder to evaluate the memory usage compared to the speed trade-offs.Unless stated explicitly we refer to the <strong>ARM</strong> instruction set and mode in version 7 <strong>for</strong> therest of this thesis.2.2.2 ProfilesWith the specification of the <strong>ARM</strong>v7 architecture, <strong>ARM</strong> introduced the concept of profiles<strong>for</strong> processors, defining a profile <strong>for</strong> applications, embedded plat<strong>for</strong>ms and <strong>for</strong> real-timeprocessors. The requirements <strong>for</strong> the different profiles contain a set of necessary andoptional features <strong>for</strong> the processors to provide. Each processor implementing a profileprovides a defined set of core functionality.The profiles are:• <strong>ARM</strong>v7-A Application profile provides different execution modes and instructionsets such as <strong>ARM</strong> and Thumb and it supports virtual memory.• <strong>ARM</strong>v7-R Real-time profile is similar to the <strong>ARM</strong>v7-A profile but provides a simplerand more direct memory access model.• <strong>ARM</strong>v7-M Microcontroller profile defines a profile <strong>for</strong> low level programming, alsowith a simple memory model and only supports the Thumb instruction set.The <strong>ARM</strong> backend <strong>for</strong> PyPy’s <strong>JIT</strong> backend targets the <strong>ARM</strong>v7-A processor profile butshould also be compatible with the <strong>ARM</strong>v7-R profile.2.2.3 Calling ConventionThe <strong>ARM</strong> architecture procedure calling standard (AAPCS) [Ear09] describes the callingconvention in <strong>for</strong>m of a specification with rules, the calling and called side of a subroutinecall have to follow. The AAPCS also defines the different data types available inprocedure calls and how they are passed as arguments and return values. By followingthis convention independently defined, compiled and assembled libraries can work witheach other.<strong>ARM</strong> processors have a register set of 16 core registers accessible in <strong>ARM</strong> and Thumbmodes with slightly different rules. In <strong>ARM</strong> mode all of these registers can be read andwritten including the one used as program counter.The registers r12 to r15 have a special role in the AAPCS being used to hold the currentexecution state.The names and roles of the registers are defined as follows:• Registers r0 to r11 are freely usable general purpose registers used <strong>for</strong> storingvalues during computation.


8 2 <strong>ARM</strong>• Register r11 was used as a frame pointer (FP), but this has been deprecated in theAAPCS and can now be used as a general purpose register.• Register r12 is also called the intra-procedure-call scratch register, it is used to storetemporary values during the execution of a subroutine. The value of this register isnot guaranteed to be preserved through subroutine calls.• Register r13 is used as a stack pointer (SP) pointing to the top of the currently usedstack (the lowest address).• Register r14 also called the link register (LR) is used to store the return address ofthe calling subroutine <strong>for</strong> procedure calls.• Register r15 is the program counter (PC), containing the address of the currentinstruction. Writing to this register results in an unconditional branch.According to the AAPCS arguments to procedures are passed in registers r0 to r3 andon the stack. By passing the first arguments in registers, calls with up to four integerarguments can profit from avoiding the overhead of passing the arguments on the stack.The return values of sub-procedure calls are returned in the same registers used to passarguments. For this reasons the caller has to take care of preserving values stored in theseregisters. On the other hand the callee has to take care to preserve the values stored inthe registers r4 to r10 which are to be used <strong>for</strong> local variables.Figure 1 shows an example of a procedure written in <strong>ARM</strong> assembler that follows thecalling convention defined in the AAPCS, Figure 2 shows the equivalent algorithm writtenin Python. The procedure in this example calculates the Fibonacci number of an inputpassed in register r0 using a recursive algorithm. When entering the procedure the callersaved registers, that are going to be used, and the link register that contains the returnaddress are saved on the stack. Next, by checking if the input passed in register r0 isless than two we can check <strong>for</strong> the base cases (r0=0 or r0=1). In case we hit such a basecase, using a conditionally executed POP instruction (see Section 2.2.5 <strong>for</strong> details), we restorethe caller saved registers and load the return address into the program counter thusper<strong>for</strong>ming a jump to it. the recursive calculation of the two previous Fibonacci numbersby calling the same procedure using the branch and link instruction (BL) and passing theargument in register r0. The BL instruction stores the return address in the link registerand per<strong>for</strong>ms a jump to the address or label passed. Finally we calculate the sum of bothnumbers and store it in r0 as the return value. To return, using a POP instruction, werestore the caller saved registers and put the saved value of the link register into programcounter per<strong>for</strong>ming an unconditional branch to the return address.The AAPCS defines, based on a 32 bit word size, as core data types signed and unsignedbytes, halfwords, words and double words as well as half, single and double precisionfloating point numbers and finally vector data types. According to the size of the datatypes the values are stored in one, two or four consecutive registers.Section 5.5 of the AAPCS describes in detail the algorithm used to pass arguments toprocedures, describing the encoding of the arguments in registers, the stack and the coprocessorregisters taking into account the different sizes of data types described above.Larger data structures or data structures with variable size are passed and returned to


2.2 The <strong>ARM</strong> Architecture 9and from subroutines by reference, storing the data in memory and passing its addressaccording to the AAPCS as the actual parameter <strong>for</strong> a procedure call. Smaller values suchas numbers are passed by value.fib:push {r5, r6, lr} @ save the link register and@ required registerscmp r0, #2 @ check <strong>for</strong> base casespoplt {r5, r6, pc} @ return if n < 2mov r5, r0 @ store input in r5sub r0, r5, #1 @ put n - 1 in r0bl fib @ call fib(n-1)mov r6, r0 @ store result in r6sub r0, r5, #2 @ put n - 2 in r0bl fib@ call fib(n-2)add r0, r0, r6 @ add both resultspop {r5, r6, pc} @ restore registers and returnFigure 1: Calculation of Fibonacci numbers written in <strong>ARM</strong>-assemblerdef fib(n):if n < 2:return nn1 = fib(n-1)n2 = fib(n-2)return n1 + n2Figure 2: Calculation of Fibonacci numbers written in Python2.2.4 Barrel ShifterThe arithmetic logic units built into <strong>ARM</strong> CPUs provide a 32-bit barrel shifter, which canper<strong>for</strong>m efficient shift operations. <strong>ARM</strong> supports five types of shift operations which areavailable as standalone operations as well as applicable to the immediate parameter ofthe operations that support them [<strong>ARM</strong>10].The types of shift operations are:• Logical shift left, shifts a value left by a given number of bits.• Logical shift right, shifts a value right by a given number of bits.• Arithmetic shift right, shifts a value right by a given number of bits, the highest bitis copied and shifted into the highest bits preserving the sign of the value.• Rotate right, rotates a value right by a given number of bits, the shifted out low bitsare shifted in at the highest bits.• Rotate right extended, rotates the value right by one bit and the carry bit is shiftedin on the highest bit of the value


10 2 <strong>ARM</strong>Given that the <strong>ARM</strong> instruction set is a fixed width instruction set, the immediate valuesthat can be passed to operations are encoded in one byte of the instruction. Thismeans that more than one byte can not be encoded as an immediate value. One usage ofthe barrel shifter in combination with immediate values is to load larger constants intoa register, <strong>for</strong> an application of this see Section 6.1.1. As an example <strong>for</strong> using shiftedimmediate values see Figure 3 where we move the value of register r1 shifted by one tothe left into register r0. If the value of register r1 is 0xF0, then the resulting value storedin register r0 after this operation is 0x1E0, which corresponds to a multiplication by two.MOV r0, r1, LSL #1 % r0 = r1 * 2Figure 3: Use of <strong>ARM</strong>’s barrel shifter <strong>for</strong> multiplication by two2.2.5 Conditional ExecutionA feature supported by almost all instructions in the <strong>ARM</strong> instruction set is the conditionalexecution. Conditional execution allows to specify a condition under which instructionsshould be executed depending on the processor flags. Encoded in the fourhighest bits of each instruction there are 16 different conditions under which a given instructioncan be executed. The execution is based on the state of the processor flags andthe default is to always execute the instruction. Conditional execution can be used <strong>for</strong>branched execution while avoiding jumps, possibly improving cache behaviour. Availablecondition codes include:• EQ, NE <strong>for</strong> equality checks• VS, VC <strong>for</strong> overflow checks• GE, LT, LE, GT <strong>for</strong> less and greater than checks• etc.Figure 4 shows how the operation r1 = (r6 >= r7) could be encoded using conditionalexecution. The CMP operation sets the condition flags according to the values ofthe operands, the next two operations move the result, either 0 or 1, to the register r1 byexecuting one of the MOV operations and ignoring the other depending on the flags set bythe CMP operation.CMP r6, r7MOVGE r1, 1MOVLT r1, 0Figure 4: Use of conditional execution to assign a value to a register


113 Just-in-Time CompilationJust-in-time compilation is a technique used <strong>for</strong> high-level languages to improve thespeed of program execution by analyzing the program at runtime and generating optimizedcode <strong>for</strong> frequently used code paths. The tools that per<strong>for</strong>m this kind of optimizationare called just in time compilers (<strong>JIT</strong>s).In the last years <strong>JIT</strong> technology, especially that of tracing <strong>JIT</strong>s, has been investigated byimplementations of dynamic languages seeking to improve the per<strong>for</strong>mance of programexecution on their virtual machines. There are several projects integrating different kindsof <strong>JIT</strong>s into their virtual machines such as V8, Mozilla’s Trace- and JägerMonkey or Spur<strong>for</strong> Javascript [GFE + 09, BBF + 10], PyPy <strong>for</strong> Python [BCFR09], Lua<strong>JIT</strong> <strong>for</strong> Lua[Pal11] andRubinius in the case of Ruby [Pho10]. Other plat<strong>for</strong>ms not primarily targeting dynamiclanguages that also use dynamic compilation to improve the execution speed are the JavaVirtual Machine [PVC01] and the .NET CLR [MSD05].The execution and translation of programs is usually divided into two groups.• Static compilation, which translates a given program to another representation, inmost cases at a lower level. The target representation is then suited <strong>for</strong> executionon a chosen plat<strong>for</strong>m which might be either a virtual machine or natively on aprocessor.• Interpretation, where the compilation and execution phases are combined into oneexecution context. The interpreter analyzes a program, or parts of it, trans<strong>for</strong>mingit into a more suitable <strong>for</strong>mat <strong>for</strong> evaluation and executes the result immediately,possibly interleaving further analysis and execution steps.A <strong>JIT</strong> compiler tries to bridge the advantages of static compilation and of interpretationtaking advantage from the additional in<strong>for</strong>mation available at runtime. Following Aycock’shistorical overview over just-in-time compilation [Ayc03] the advantages of thisapproach are:• the fast execution speed of compiled programs• smaller program size because of the higher level of abstraction usually found ininterpreted languages• portability of interpreted programs, given that the executing VM is available ondifferent systems• the availability of runtime in<strong>for</strong>mation such as control and data flowDynamic compilation is not a novel optimization method Aycock dates the first <strong>JIT</strong> relatedpublication to a paper about LISP by McCarthy published in 1960 which mentionsthe dynamic compilation of functions into machine language at runtime [McC60]. Afurther step was taken by analyzing which parts of a program should be optimized.This was achieved by searching <strong>for</strong> frequently executed paths, detected using executioncounters, described in Hansen’s work on Fortran in 1974 [HS74]. In Hansen’s approach,


12 3 JUST-IN-TIME COMPILATIONwhen the execution count <strong>for</strong> a certain block of code passed a defined threshold thisblock would be further optimized and trans<strong>for</strong>med to machine code. <strong>An</strong>other projectthat innovated in the area of dynamic compilation is the SELF 3 project. SELF is a dynamicobject oriented language, inspired in many ways – such as message passing – bySmalltalk, it was built on a prototype based object model [SU94, UCCH91]. Because ofthe dynamic nature of the language it is difficult to optimize. SELF developed differentapproaches, finally creating a compiler that, based on runtime profiling, generated<strong>for</strong> one SELF method different optimized code paths depending on the types of the inputarguments [CUL89]. For a in depth historic overview of dynamic compilation see[Ayc03], which also contains a more detailed overview as well as different categorizationapproaches <strong>for</strong> <strong>JIT</strong> compilers.There are two main groups in which <strong>JIT</strong> compilers can be classified depending on howthey per<strong>for</strong>m profiling and optimize the instructions be<strong>for</strong>e generating low level code.3.1 Method Based Just-In-Time CompilationMethod based <strong>JIT</strong>s, as the name suggests, per<strong>for</strong>m profiling and compilation on a permethod basis. There are different approaches in this area, some systems compile methodsto lower level code when a method is first executed, such as some Smalltalk systems[DS84]. Other systems per<strong>for</strong>m profiling by counting the executions of methods to selectcandidate methods <strong>for</strong> compilation such as the Rubinius Ruby VM [Pho10]. <strong>An</strong>d finallythere are approaches such as the Java HotSpot server compiler and Google’s V8 Javascriptengine which do not contain an interpreter but per<strong>for</strong>m two step compilation by firstcompiling a version of the code with a fast compiler that does not per<strong>for</strong>m all possibleoptimizations, then in a second step the generated code is profiled to select candidatesthat are compiled per<strong>for</strong>ming more aggressive optimizations. [PVC01, V8 11]3.2 Trace Based Just-In-Time Compilation<strong>An</strong> alternative approach to profile and optimize programs at runtime is taken by tracing<strong>JIT</strong>s. <strong>Tracing</strong> <strong>JIT</strong>s operate take loops as the basic unit <strong>for</strong> profiling and optimization. Theexact definition of a loop can vary depending on the VM, but <strong>for</strong> bytecode based VMsis usually is defined as a stream of bytecodes ending with a backwards jump, the jumptarget is then the beginning of the loop. The basic assumptions of tracing <strong>JIT</strong>s are that:• A program spends most of its time in loops.• Iterations of a loop usually take similar paths through the code.A tracing <strong>JIT</strong> runs a program by interpreting it, first converting it to bytecode and recordingprofiling in<strong>for</strong>mation on the interpreted code. Once a certain threshold is passed thetracing <strong>JIT</strong> records the operations per<strong>for</strong>med during one execution of the loop. This listof operations that corresponds to one possible path through the loop is called a trace.3 http://labs.oracle.com/self/


3.2 Trace Based Just-In-Time Compilation 13Per<strong>for</strong>ming optimizations by recording the program flow was initially explored withinthe Dynamo Project [BDB00] which is a tracing optimizer <strong>for</strong> machine code. Dynamoacts as an interpreter <strong>for</strong> machine code, profiling and optimizing the machine code duringexecution. Optimized pieces of assembler are then executed directly on the processor.Based on the project DynamoRIO [BEES04], the successor of Dynamo, a tracing <strong>JIT</strong> <strong>for</strong>bytecode based dynamic languages has been created. This tracing <strong>JIT</strong> uses partial evaluationtechniques to optimize the traced operations [SBB + 03]. There have also been experimentalimplementations of the JVM using trace optimization techniques such as [GPF06].The application of tracing <strong>JIT</strong>s to dynamic languages has lately been investigated by severalprojects, such as Tracemonkey [GFE + 09] and Spur [BBF + 10] <strong>for</strong> Javascript, LuaJit<strong>for</strong> Lua [Pal11] and Adobes Tamarin Action Script VM used in the Flash Player [Moz11]among others, showing that dynamic trace based compilation can help improve the executionspeed of dynamic languages.In a loop, defined as above, a backward jump is an operation that makes the VM branchto a previous location in the bytecode stream, i.e. by decrementing the program counter.Profiling is per<strong>for</strong>med on a per loop basis at the backward jump instructions, countingthe number of iterations <strong>for</strong> a loop. Once a certain threshold in the number of iterations isreached the loop is considered hot. Once a loop is considered hot the interpreter switchesto tracing mode <strong>for</strong> the next execution of the loop. In tracing mode the interpreter recordsall operations per<strong>for</strong>med during the execution of the loop, possibly inlining method callsper<strong>for</strong>med within the loop. The resulting trace of operations is the basis <strong>for</strong> all furtheroptimization and finally <strong>for</strong> the translation to machine code.Because the recorded trace corresponds only to one of the possible paths through thetraced loop, special care needs to be taken regarding conditional paths such as if/elsebranches. At every branching point a special instruction is inserted into the trace thatchecks that the condition valid at that branching point during tracing is still valid whenthe trace is executed. These special instructions are called guards [GPF06].A guard condition fails during the execution of a loop when a condition at a branchingpoint that was true during tracing is now false. The failure of a guard leads to the stoppingthe execution of the optimized version of the loop and returning to executing theinterpreted version of it.Once a loop has been compiled the <strong>JIT</strong> also per<strong>for</strong>ms profiling on the guards insertedinto the trace, counting how often a trace is left through a specific failing guard. Once thefailure count <strong>for</strong> a guard reaches a certain threshold the execution is traced again startingfrom the failing guard until the entry point to a loop is reached, this trace is also compiledand executed the next time the guard fails, instead of leaving the optimized code. Thiskind of trace is called a bridge.<strong>Tracing</strong> can end <strong>for</strong> different reasons, either because the tracing interpreter jumps to aprevious position in the bytecode and thus closing the loop or by tracing abortion due todifferent conditions such as a too long trace. If the tracing finished without aborting thetraced list of instructions is passed to the trace optimizer. Depending on the implementationdifferent optimizations are per<strong>for</strong>med on the trace such as constant folding, loopinvariant code motion, allocation removal, etc. The optimized trace is then passed to thebackend to be trans<strong>for</strong>med to machine code <strong>for</strong> the corresponding plat<strong>for</strong>m.


14 3 JUST-IN-TIME COMPILATIONOnce compiled the optimized version of the loop becomes available to the interpreter.When the interpreter is going to execute bytecode corresponding to the loop it can insteadpass control to the compiled version of the loop. Once the optimized loop finishesthe interpreter updates is state based on the results of the optimized loop and resumesinterpretation.As an example <strong>for</strong> a trace we will look at the RPython functions defined in Figure 5, thedetails of PyPy’s <strong>JIT</strong> will be introduced in Section 5. This example is taken from [BCFR09].The function strange_sum executes a loop that calls the function f which does somecalculations on its inputs. When the loop is considered hot, after executing it a certainnumber of times, the execution is traced. The tracing records all operations executed inone iteration of the loop, the recorded trace is also shown in Figure 5. The tracing onlyrecorded the hot else path, inserting a guard to check the condition. The tracing alsoinlined the call to f. The operations recorded correspond to the internal representationof the functions. The trace ends with a jump to its beginning and is only left when one ofthe guards generated <strong>for</strong> the conditional paths fails.def strange_sum(n):result = 0while n >= 0:result = f(result, n)n -= 1return resultdef f(a, b):if b % 46 == 41:return a - belse:return a + b# loop_start(result0, n0)# i0 = int_mod(n0, Const(46))# i1 = int_eq(i0, Const(41))# guard_false(i1)# result1 = int_add(result0, n0)# n1 = int_sub(n0, Const(1))# i2 = int_ge(n1, Const(0))# guard_true(i2)# jump(result1, n1)Figure 5: Two simple RPython functions, and a trace recorded when executing the functionstrange_sum


154 PyPyThe PyPy project was started in 2003 with the goal to create a high level implementationof Python which should be easily extensible and portable while not sacrificing per<strong>for</strong>mance[Bol10]. The language chosen to implement this is Python. To achieve the goalsof flexibility and extendability PyPy is written in a very modular way, which also ledto it becoming a general tool <strong>for</strong> writing interpreters <strong>for</strong> other dynamic languages. Besidesthe Python implementation there are interpreters <strong>for</strong> Smalltalk [BKL + 08], Prolog[BLS10], Javascript, etc. When creating an interpreter using PyPy, the language specificationis written in a restricted subset of Python called RPython [AACM07]. This allowsthe interpreter author to concentrate on the language semantics and not mix these withthe low level details of a target plat<strong>for</strong>m. This subset of the language is chosen in a way,that it is possible to per<strong>for</strong>m abstract interpretation on it to do data flow analysis andtype inference. Based on the inferred types the RPython code can be trans<strong>for</strong>med to alower level representation such as C or JVM/.NET-bytecode and then compiled to anexecutable <strong>for</strong> the corresponding plat<strong>for</strong>m.RPython is a subset of Python, which was chosen in a way that it does not have all thedynamic features of Python itself and thus allows to be statically analysed. RPython isa statically typed, object oriented and garbage collected language which supports singleinheritance. Being a subset of Python, programs written in RPython can be run and testedusing the standard Python interpreter providing fast feedback and quick prototyping andtesting tools. Some of the restrictions imposed on RPython programs compared to Pythonare:• RPython programs are not allowed to modify the layout of instances at runtime• it is now possible to create classes at runtime• the reflection features are restricted• the types of objects used in the same places, i.e. lists, must have a common basetypeAmong the advantages of using RPython as the language to encode the low level detailsof language specifications is fact that it is a subset of Python, so programs written inRPython can be run, debugged and tested on top of CPython emulating the low leveldetails.RPython can not be discussed separated from the translation framework, RPython aslanguage evolved as a part of PyPy’s translation toolchain. The translation toolchaintranslates RPython programs to a lower level representation and generates an executableprogram [HRP05]. The translation process in PyPy does not work on source level butper<strong>for</strong>ms program analysis on live objects loaded into memory. To translate a program itis loaded into a Python interpreter and then following steps are applied [PyP11c]:• A control flow graph is generate <strong>for</strong> each function.• The graphs are annotated, through whole program type inference, with in<strong>for</strong>mationabout the type each variable will have at runtime.


16 4 PYPY• The graphs are trans<strong>for</strong>med into graphs using operations closer to those of the targetplat<strong>for</strong>m.• Several trans<strong>for</strong>mations can be per<strong>for</strong>med on the graphs such as optimizations orthe insertion of a garbage collector.• Finally the graphs are converted to source code <strong>for</strong> the selected plat<strong>for</strong>m and compiled.Being a subset of Python, RPython shares the characteristic that it is a garbage collectedlanguage. This means that programs are free of low level details about memory management.For this reason RPython programs are not bound to a certain memory managementmodel. During the translation process different implementations of the memory managementand garbage collection can be added to the generated program, without affectingthe semantics of the program itself. PyPy supports several own garbage collector implementationsas well as the Boehm-Demers-Weiser conservative garbage collector [BW88].Other aspects of RPython and of the programs implemented using it can be chosen attranslation time, such as the use of tagged pointers or boxes to represent integers or howclasses should be represented in memory. These details are added to the program duringthe translation process.Because RPython is translated from live objects loaded within the Python interpreter, thefull power of the Python language can be used as a preprocessor <strong>for</strong> RPython allowingto dynamically generate RPython code at load time which can then be analyzed, trans<strong>for</strong>medand translated to a lower level target [RP06].


175 PyPy’s Approach to <strong>Tracing</strong> <strong>JIT</strong>sOne of the features provided by PyPy is that it includes a <strong>JIT</strong> compiler generator. Thisgenerator can be applied to language implementations generating a <strong>JIT</strong> compiler <strong>for</strong> thelanguage at translation time. Although the procedure is similar to adding other low leveldetails such as garbage collection, the <strong>JIT</strong> generator requires a few hints which need tobe provided by the language implementor. In this section we are going to describe thedetails of PyPy’s <strong>JIT</strong>. This section as well as the examples used here are based on thepaper "<strong>Tracing</strong> the meta-level: PyPy’s tracing <strong>JIT</strong> compiler" [BCFR09].PyPy’s tracing <strong>JIT</strong> is not specific to the Python interpreter, instead PyPy provides a tracing<strong>JIT</strong> generator which can be applied to any language implementation itself which is basedon PyPy. Because there are two interpreters involved in the discussion of PyPy’s <strong>JIT</strong> itis helpful to distinguish some terminologies. Following the naming scheme proposed in[BCFR09] first there is the language interpreter which runs a given program. There is alsothe interpreter used to per<strong>for</strong>m tracing, which runs the language interpreter. We will callthis the tracing interpreter. Programs running on the language interpreter are called userprograms and loops within them are named user loops while interpreter loops refer to toloops within the language interpreter.As a running example <strong>for</strong> a language interpreter we are going to use the simple bytecodeinterpreter shown in Figure 6 from [BCFR09]. It is a bytecode interpreter with 256registers managed in a list. The inputs the interpreter takes are the bytecode as a list ofcharacters and an integer as an accumulator. A user program <strong>for</strong> our bytecode interpreteris shown in Figure 7, it calculates the square of the accumulator.A traditional tracing <strong>JIT</strong>, which is applied to a user program, would detects hot loops init. In difference to this approach PyPy’s tracing <strong>JIT</strong> is applied to the language interpreterand not the user program. The main hot loop of a language interpreter is its bytecodedispatch loop. One iteration of the dispatch loop corresponds to the execution of onebytecode instruction. It is rather unlikely that this loop will take similar code paths insubsequent iterations rendering one of the core assumptions of tracing <strong>JIT</strong>s invalid.To solve this problems PyPy traces several iterations of the bytecode dispatch loop. Theloop is unrolled so far that it corresponds to a user loop. As mentioned be<strong>for</strong>e a loop in auser program occurs when the program counter has repeatedly the same value which canonly happen through backward jumps. Usually the program counter consists of the currentlist of bytecodes and the pointer or index to the currently executed bytecode. Sincethe <strong>JIT</strong> is not aware of which variables contain the program counter it needs to get thisin<strong>for</strong>mation via hints placed by the implementor of the language interpreter. The tracing<strong>JIT</strong> can detect a user loop by checking the values that make up the program counter, aloop is found if these variables have a value they have had at previous point of the execution.The program counter can only have a previous value again if it is explicitly set toan earlier value. This can only happen at a backward jump in the language interpreter.Because the tracing <strong>JIT</strong> can also not know which parts of the language interpreter correspondto backward jumps it needs an additional hint to mark these places. The examplebytecode interpreter with the hints applied is shown in Figure 8. In the example, thevariables passed in the greens list to JitDriver class are those that compose the programcounter and those variables passed in the red are the ones that are not used <strong>for</strong> the


18 5 PYPY’S APPROACH TO TRACING <strong>JIT</strong>Sdef interpret(bytecode, a):regs = [0] * 256pc = 0while True:opcode = ord(bytecode[pc])pc += 1if opcode == JUMP_IF_A:target = ord(bytecode[pc])pc += 1if a:pc = targetelif opcode == MOV_A_R:n = ord(bytecode[pc])pc += 1regs[n] = aelif opcode == MOV_R_A:n = ord(bytecode[pc])pc += 1a = regs[n]elif opcode == ADD_R_TO_A:n = ord(bytecode[pc])pc += 1a += regs[n]elif opcode == DECR_A:a -= 1elif opcode == RETURN_A:return aFigure 6: A very simple bytecode interpreter with registers and an accumulator.MOV_A_R 0 # i = aMOV_A_R 1 # copy of ’a’# 4:MOV_R_A 0 # i--DECR_AMOV_A_R 0MOV_R_A 2 # res += aADD_R_TO_A 1MOV_A_R 2MOV_R_A 0 # if i!=0: goto 4JUMP_IF_A 4MOV_R_A 2 # return resRETURN_AFigure 7: Example bytecode: Compute the square of the accumulator


5.1 The Shape of a Trace 19program counter. The call to can_enter_jit on the instance of the JitDriver is usedto mark a backward jump in the language interpreter and thus a possible entry point <strong>for</strong>the <strong>JIT</strong>. Finally the call to jit_merge_point tells the <strong>JIT</strong> where it can resume normalinterpretation in case it had to leave the optimized code.tlrjitdriver = JitDriver(greens = [’pc’, ’bytecode’],reds = [’a’, ’regs’])def interpret(bytecode, a):regs = [0] * 256pc = 0while True:tlrjitdriver.jit_merge_point()opcode = ord(bytecode[pc])pc += 1if opcode == JUMP_IF_A:target = ord(bytecode[pc])pc += 1if a:if target < pc:tlrjitdriver.can_enter_jit()pc = targetelif opcode == MOV_A_R:... # rest unmodifiedFigure 8: Simple bytecode interpreter with hints appliedWith these hints applied the tracing <strong>JIT</strong> can per<strong>for</strong>m profiling based on loops found inthe user program while tracing the operations the tracing interpreter per<strong>for</strong>ms when executingthe user loop. The downside of this is that the tracing interpreter records alsoinstructions that are related to manipulating the internal state and data structures of thelanguage interpreter. How this problem is handled is explained in Section 5.2. The tracerecorded by the tracing interpreter <strong>for</strong> a user loop is passed to the optimizer be<strong>for</strong>e compilingthe trace.5.1 The Shape of a TraceA trace in PyPy’s <strong>JIT</strong> is composed of operations and input arguments. The operations area list of instructions provided in SSA <strong>for</strong>m [CFR + 91]. The input arguments are providedas list of variables which are used within the loop. Figure 10 shows the trace <strong>for</strong> a loopwithin the user program shown in Figure 7. Because the trace corresponds to the executionof a loop, the instructions in the trace build an infinite loop that can only be leftthrough a failing guard.Input Arguments and Types The argument types used in the PyPy <strong>JIT</strong> are types eitherrepresenting variables or constants. Arguments referring to variables are called boxes andhave the in<strong>for</strong>mation about the type of the value they box. Constants are also representedby containers, these containers carry the type of the value and the value of the constant.


20 5 PYPY’S APPROACH TO TRACING <strong>JIT</strong>S5.2 OptimizationsIn this section we are going to describe some of the optimizations per<strong>for</strong>med by PyPy’stracing <strong>JIT</strong> on the recorded traces be<strong>for</strong>e they are trans<strong>for</strong>med into machine code. Optimizationsapplicable to traces are well known optimizations used in compilers, but havethe advantage that the trace they operate on is a linear structure and do not have to takebranching and conditional execution into account.Some optimizations are described below, other optimizations are:• reducing memory accesses by removing read and write operations and reusing previousresults• reusing results of pure operations• removing redundant guardsGreen Folding Because the tracing <strong>JIT</strong> is applied to the language interpreter many ofthe operations in the trace deal with manipulating the data structures of the languageinterpreter and not with the calculation per<strong>for</strong>med by the user program itself. Specificallymost of the operations work on the bytecode and program counter. The values of thesedata structures are known when entering the loop, since they are part of the hints <strong>for</strong>the <strong>JIT</strong>. Operations on these variables can be constant folded, because the operationsinvolved are side effect free and the string holding the bytecode is immutable.Figure 9 shows the trace be<strong>for</strong>e removing the operations on the interpreter data structuresby constant folding and Figure 10 shows the resulting trace after removing theseoperations.Allocation Removal This optimization, described in [BCF + 11], tries to avoid object allocationsby virtualizing objects allocated within the trace. If the object does not escapethe trace, i.e. it is only used within the trace the allocation can be removed completelyand the object is replaced by its fields which are treated as local variables.Loop Invariant Code Motion Is a technique to optimize loops by moving loop invariantexpressions [ALSU06] outside of the loop body, i.e. values that do not change overiterations and are not dependent of variables modified in the loop. The core idea abouthow this optimization is applied to traces is described in [Ard11].


5.2 Optimizations 21loop_start(a0, regs0, bytecode0, pc0)# MOV_R_A 0opcode0 = strgetitem(bytecode0, pc0)pc1 = int_add(pc0, Const(1))guard_value(opcode0, Const(2))n1 = strgetitem(bytecode0, pc1)pc2 = int_add(pc1, Const(1))a1 = list_getitem(regs0, n1)# DECR_Aopcode1 = strgetitem(bytecode0, pc2)pc3 = int_add(pc2, Const(1))guard_value(opcode1, Const(7))a2 = int_sub(a1, Const(1))# MOV_A_R 0opcode1 = strgetitem(bytecode0, pc3)pc4 = int_add(pc3, Const(1))guard_value(opcode1, Const(1))n2 = strgetitem(bytecode0, pc4)pc5 = int_add(pc4, Const(1))list_setitem(regs0, n2, a2)# MOV_R_A 2...# ADD_R_TO_A 1opcode3 = strgetitem(bytecode0, pc7)pc8 = int_add(pc7, Const(1))guard_value(opcode3, Const(5))n4 = strgetitem(bytecode0, pc8)pc9 = int_add(pc8, Const(1))i0 = list_getitem(regs0, n4)a4 = int_add(a3, i0)# MOV_A_R 2...# MOV_R_A 0...# JUMP_IF_A 4opcode6 = strgetitem(bytecode0, pc13)pc14 = int_add(pc13, Const(1))guard_value(opcode6, Const(3))target0 = strgetitem(bytecode0, pc14)pc15 = int_add(pc14, Const(1))i1 = int_is_true(a5)guard_true(i1)jump(a5, regs0, bytecode0, target0)Figure 9: Trace when executing the Square function of Figure 7, with the correspondingbytecodes as comments.


22 5 PYPY’S APPROACH TO TRACING <strong>JIT</strong>Sloop_start(a0, regs0)# MOV_R_A 0a1 = list_getitem(regs0, Const(0))# DECR_Aa2 = int_sub(a1, Const(1))# MOV_A_R 0list_setitem(regs0, Const(0), a2)# MOV_R_A 2a3 = list_getitem(regs0, Const(2))# ADD_R_TO_A 1i0 = list_getitem(regs0, Const(1))a4 = int_add(a3, i0)# MOV_A_R 2list_setitem(regs0, Const(2), a4)# MOV_R_A 0a5 = list_getitem(regs0, Const(0))# JUMP_IF_A 4i1 = int_is_true(a5)guard_true(i1)jump(a5, regs0)Figure 10: Trace when executing the Square function of Figure 7, with the correspondingopcodes as comments. The constant-folding of operations on green variables is enabled.


236 A <strong>Backend</strong> <strong>for</strong> PyPy’s <strong>JIT</strong>In this section we are going to explain how a backend <strong>for</strong> PyPy’s tracing <strong>JIT</strong> compiler isstructured, especially taking a look at how a backend is build to generate code suited <strong>for</strong><strong>ARM</strong> processors. Many of the ideas presented here were also used be<strong>for</strong>e in the x86 andx86_64 backends of PyPy’s <strong>JIT</strong>.The tasks necessary to compile and execute such a loop are manifold. There are several,rather independent components that need to in place be<strong>for</strong>e we can generate and executecode <strong>for</strong> a loop.First we need a way to generate executable machine code. We want to have a well structuredway to interface with it to avoid duplicating the logic of encoding the instructions,which is described in Subsection 6.1. Additionally we need a way to manage the registers.This tool should take care of associating registers to variables, keeping track of thebindings and taking care of freeing the registers. This interface is described in Subsection6.2. With these tools in place we can start creating the execution context <strong>for</strong> the loop.We want to create a procedure that acts as a container <strong>for</strong> the compiled loop that alsoprovides a context to hold the state while execution the loop, how the frame layout usedin the backend to hold the state is organized is described in Subsection 6.3.1 and howthe procedure that is used to encapsulate the trace works is described in Subsection 6.3.2.With all these components in place we can proceed to compile the operations containedin the trace, which is described in Subsection 6.4. How the loop is left and how we passin<strong>for</strong>mation to the frontend so it can restore the execution state be<strong>for</strong>e resuming interpretationis described in Subsection 6.5. Finally how bridges are handled is described inSubsection 6.6.6.1 Low level Code Generation InterfaceBe<strong>for</strong>e we can generate code from the traces passed from the frontend into the plat<strong>for</strong>mspecific backends we need a way to encode the machine instructions. In classical compilersthis is done by the assembler which provides a higher level interface to the instructionssupported by the processor, trans<strong>for</strong>ming them to their binary representation whichis then executable by the processor.Because the <strong>JIT</strong> compiler operates at runtime we need to provide an assembler that isavailable at runtime. To achieve this and to avoid having to encode repeatedly all instructionsas numbers by hand as well as to have a clean interface to the machine language wegenerate an assembler interface which provides functions named after the assembler instructions.Such a function encodes the arguments passed to it according to the plat<strong>for</strong>mdocumentation an stores the corresponding instruction to memory. In the <strong>ARM</strong> backendthis interface is provided by a class named CodeBuilder, so we are going to further referto this interface <strong>for</strong> simplicity as codebuilder.The assembler functions are exposed in the codebuilder using PyPy’s metaprogrammingfeatures by generating a function <strong>for</strong> each required instruction from anencoded representation of the specification at load time.As mentioned earlier the current <strong>ARM</strong> backend targets the <strong>ARM</strong> state/instruction set


24 6 A BACKEND FOR PYPY’S <strong>JIT</strong>which is a fixed-width 32-bit instruction set.The <strong>ARM</strong> Architecture Reference Manual [<strong>ARM</strong>10] defines different instruction groupsand <strong>for</strong> each of these instruction groups it describes how the arguments to the instructionare encoded in the bits that actually build the instruction. In this section we are going touse the group of Data-processing (immediate) instructions as an example. These instructionstake one register and one immediate value as their arguments. Instructions suchas AND, CMP are defined in this group. The bit encoding <strong>for</strong> instructions defined there isshown in Figure 11. The empty part in the lowest bits of the encoding is defined on aper instruction basis. As explained in Section 2.2.5 almost every instruction can take aconditional execution flag, this flag is stored in the upper 4 bits of the instruction.Figure 11: Bit encoding <strong>for</strong> the data processing instructions with immediate values[<strong>ARM</strong>10]To encode the instructions <strong>for</strong> the backend all instructions <strong>for</strong> one group are stored in aPython hashmap. This grouping is possible, because most of the instructions in a grouphave a common interface. For each instruction the hashmap contains the name of theoperation, its arguments and a description of the variable parts of the encoding. Figure12 shows how some of the instruction of our example group are stored. For exampleAND_ri instruction has the operation number 0 and it has a result and a register argument.This instruction per<strong>for</strong>ms a bitwise AND between the register and the immediatevalue arguments storing the result in a result register.data_proc_imm = {’AND_ri’: {’op’: 0, ’result’:True, ’base’:True},’EOR_ri’: {’op’: 0x2, ’result’:True, ’base’:True},’SUB_ri’: {’op’: 0x4, ’result’:True, ’base’:True},’RSB_ri’: {’op’: 0x6, ’result’:True, ’base’:True},’CMN_ri’: {’op’: 0x17, ’result’:False, ’base’:True},’ADD_ri’: {’op’: 0x8, ’result’:True, ’base’:True},’ORR_ri’: {’op’: 0x18, ’result’:True, ’base’:True},’MVN_ri’: {’op’: 0x1E, ’result’:True, ’base’:False},#snip}Figure 12: Encoding of the group of data processing instructions with immediate valuesas arguments as a Python hashmapAt load time, when the modules defining the backend are loaded into memory theseinstruction encodings are trans<strong>for</strong>med into RPython functions and attached to theCodeBuilder class. When the codebuilder is translated all required functions arealready in place and translated with the class, thus available at runtime to provide a highlevel interface to the machine instructions. In Figure 13 the function used to generate thefunctions <strong>for</strong> data processing instructions with immediate values is shown. It checks if


6.1 Low level Code Generation Interface 25the instruction has a return value and if it has a base register and returns a function withthe corresponding signature. When the returned function is called at runtime it writes tomemory the binary representation of the instruction it implements.def define_data_proc_imm_func(name, table):n = (0x1


26 6 A BACKEND FOR PYPY’S <strong>JIT</strong>address it by its offset to the program counter. The first solution has the advantage thatit works with conditional execution. Figure 16 shows the function used to load constantsof up to 32 bits into a register, passing in the variable r that marks the target register,where the value will be stored after executing the generated instructions. If the valuefits in 2 bytes or if the instructions are to be executed conditionally we load the constantbyte by byte into the target register (see the function _load_by_shifting in Figure 16).Initially the lowest byte is moved into the target register, all further bytes are written tothe register by an ORR operation which applies a logical OR to the argument register andthe shifted 8 bit immediate value, Figure 14 shows how the constant 0xBDBC would beloaded into register r4 using this approach. For the second approach to loading a constant,used when the constant requires three or four bytes and we do not need conditionalexecution, we encode the constant in the program stream, read it using an address basedon an offset to the value of the PC and per<strong>for</strong>m a small unconditional jump to continueexecution after the location of the value. With this method loading a 32 bit constantstakes at most 3 words (2 instructions plus 1 data word) see Figure 15 <strong>for</strong> an example ofthe code generated to load the constant 0x6A5EF3C5 into register r4....MOV r4, #179ORR r4, r4, #189 LSL #8...Figure 14: Loading a constant by shifting the bytes into the target register...LDR R4, [PC + #8]ADD PC, PC, #40x6A5EF3C5...Figure 15: Loading a constant encoded in the program streamWe also experimented with using exclusively one or the other approach but this mixedapproach gives the best results in terms of speed, although the differences are small.A further method worth investigating, which we have not tried so far, is using a constantpool stored in memory in a location that can be accessed by offset addressing relativeto the PC where all the used constants are stored. Constants can then be loaded intoa register by being referenced by their corresponding offset to the PC. This techniquewould save the jump and work with conditional execution.6.2 Register Allocation<strong>An</strong> important aspect <strong>for</strong> the results produced by a compiler, which is also valid <strong>for</strong> justin-timecompilers, is register allocation. Register allocation is the process of assigningCPU registers to the variables used in the intermediate representation of a program. Thisprocess should be fast, especially <strong>for</strong> <strong>JIT</strong>s, and at the same time produce good results to


6.2 Register Allocation 27def gen_load_int(self, r, value, cond=cond.AL):from pypy.jit.backend.arm.conditions import ALif cond != AL or 0 offset) & 0xFFif b == 0:continuet = b | (shift


28 6 A BACKEND FOR PYPY’S <strong>JIT</strong>to be spilled is based on the previously calculated longevity of the variables, spilling thevariable that survives <strong>for</strong> the longest time. This selection scheme should avoid blockingregisters by linking them to very long lived variables. A possible drawback is althoughif the variable is long-lived and often used, then it could occur that it needs to be readfrequently from memory if spilling occurs between usages. Once a variable is selected <strong>for</strong>spilling we generate an instruction to move this variable to the spilling area on the stackand store the in<strong>for</strong>mation about where on the stack the variable is stored in the registerallocator. After moving the variable away we can free the previously bound register andassociate it with a new variable.6.3 Setup to Execute a Compiled LoopApart from compiling a trace, we need a way to actually execute the compiled instructions.This means that we need a mechanism to call into the compiled code once the frontendtries to execute the same loop again. Similar to the approach described in [GFE + 09]we generate code that follows the plat<strong>for</strong>m calling conventions (see Section 2.2.3 <strong>for</strong> howto create a procedure interface [Ear09]) so it can be casted to a function pointer at runtimeand called as a normal function/procedure.The compilation process <strong>for</strong> a loop is started when the frontend passes an optimized traceto the backend. Be<strong>for</strong>e the instructions contained in the trace can be compiled the backendneeds to take some preparing steps to provide an interface to the compiled trace thatcan be executed. This consist of creating a callable interface and generating the instructionsto set up the frame and load the arguments <strong>for</strong> the loop.6.3.1 Frame LayoutBe<strong>for</strong>e we can execute instruction we need to setup the frame of the procedure. For thiswe generate instructions that create a frame as described below.The frame layout is composed of four parts:• callee saved registers, according to the calling convention• a slot to store the <strong>for</strong>ce index, a value used to check if the interpreter level framewas <strong>for</strong>ced (see Section 6.4.6)• an area to store spilled variables• and the stack where push and pop instructions operateThe frame begins with the registers the function has to save according to the calling conventiondescribed in Section 2.2.3, which are on <strong>ARM</strong> registers r4 up to r11 or FP. Afterthese registers one word on the stack is left to store the <strong>for</strong>ce index, see Section 6.4.6.After the location <strong>for</strong> the <strong>for</strong>ce index the frame contains space <strong>for</strong> spilled register values.The address of the beginning of this area is stored in the FP register and spilled valuesare addressed by their offset from the FP. After this area the classic stack area begins


6.3 Setup to Execute a Compiled Loop 29where intermediate results are stored and registers are pushed and popped around subprocedurecalls. The stack pointer (SP) always points to the last value pushed on thestack, so at the beginning of the execution it points to the end of the spilling area and ismodified automatically every time a value is pushed on the stack or popped from it. Deflectingfrom the calling convention we use the frame pointer to mark the location werethe spilled registers are stored in the frame. The exact size of the spilling area is determinedby the number of registers the register allocator needs to spill <strong>for</strong> a particular loopand is only known after compiling the instructions <strong>for</strong> the loop, so the exact position ofthe SP is patched after compiling the instructions of the loop, once the size of the spillingarea is known. Figure 17 shows the layout of the frame at the beginning of the executionof a compiled loop.Previous Framer4r5r6r7r8r9r10FPForce IndexFramePointerSpillingStackPointerFigure 17: Frame layout used by the <strong>ARM</strong> backend6.3.2 Function InterfaceThe interface mentioned above to call the compiled code is generated by creating an interfacethat follows the rules imposed by the AAPCS and then per<strong>for</strong>ms a set of operationsthat setup of the frame and state to execute the loop.The function interface generated <strong>for</strong> <strong>ARM</strong> starts with pushing the callee saved registerson the stack. As a next step we generate instructions to move the stack pointer by oneword to leave area <strong>for</strong> the <strong>for</strong>ce index. Then we generate four no-ops where we laterare going to patch the instructions to setup the size of the spilling area. With these in-


30 6 A BACKEND FOR PYPY’S <strong>JIT</strong>structions the entry point is set up and we can proceed to load the input arguments andcompile the operations in the trace.After generating the function interface we generate instructions to load the input argumentsof the loop to locations controlled by the register allocator to setup the state <strong>for</strong>the loop. Input values <strong>for</strong> a loop are passed to the backend through pre-allocated arrays.There is one such array <strong>for</strong> each different argument type. These types can be floats, integers,long longs and pointers, although at the current time the backend only supportsintegers and pointers. At runtime we allocate a register <strong>for</strong> each box passed to the loopusing these lists of input arguments. For each box we then generate the instructions toload the value stored in the corresponding list in memory to an allocated register.6.4 Compiling Trace OperationsOnce the frame is set up and the instructions to load the input arguments into registershave been generated, the next step is to generate instructions <strong>for</strong> each of the operationsin the trace. The code generation <strong>for</strong> operations is rather straight <strong>for</strong>ward and is dividedinto two steps. For each operation the first step is allocating the registers <strong>for</strong> theoperation’s arguments and, if present, <strong>for</strong> the result. The second step is the instruction selectionand generation step. This step actually emits the instructions that implement theoperations using the allocated registers. A goal we try to achieve during the instructionselection is to emit the operations that provide the best execution speed in relation to thekind of arguments <strong>for</strong> an operation, i.e selection the correct operation to add a registerand small constant.We are going to look at how these operations are implemented by taking a look at thedifferent groups of trace operations that are handled by the backend. The operations canbe categorized as follows:• arithmetic operations• comparison operations• memory allocation• memory access• calls• <strong>for</strong>cing of frames• guards• jumps6.4.1 Arithmetic OperationsThe <strong>JIT</strong> implements operations <strong>for</strong> unary and binary arithmetic operations <strong>for</strong> signed andunsigned integers and long longs as well as <strong>for</strong> floating point numbers.


6.4 Compiling Trace Operations 31According to the scheme described above, <strong>for</strong> arithmetic operations we first allocate registers<strong>for</strong> the operands and <strong>for</strong> the result. How this is done <strong>for</strong> the int_sub operation isshown in the function prepare_op_int_sub in Figure 18. This function allocates theregisters <strong>for</strong> variables involved in the operation and makes sure the corresponding valuesare stored in the registers. Once we have allocated registers <strong>for</strong> the current operationwe emit the actual instructions, done by the function emit_op_int_sub also shown inFigure 18, taking care to select the most suitable instructions to implement the operationdepending on the input variables. For this operation we check if the second operandfits in an immediate value and emit an operation which takes one register and one immediateoperand. If the first argument is an immediate value we make use of <strong>ARM</strong>’sreverse subtract operation which subtracts the second operand from the value stored inthe register passed as first operand. Finally if both values are stored in registers we emitan instruction which takes both arguments in registers. This procedure is similar <strong>for</strong> allarithmetic operations, except <strong>for</strong> those not provided by the underlying plat<strong>for</strong>m.def prepare_op_int_sub(self, op, fcond):a0, a1 = op.getarglist()l0 = self.make_sure_var_in_reg(a0)l1 = self.make_sure_var_in_reg(a1)res = self.<strong>for</strong>ce_allocate_reg(op.result)return [l0, l1, res]def emit_op_int_sub(self, op, arglocs, regalloc, fcond):l0, l1, res = arglocsif l0.is_imm():self.mc.RSB_ri(res.value, l1.value, l0.value)elif l1.is_imm():self.mc.SUB_ri(res.value, l0.value, l1.value)else:self.mc.SUB_rr(res.value, l0.value, l1.value)Figure 18: Implementation of the int_sub operationFor operations such as division and modulo which are not supported by the <strong>ARM</strong>v7-Aprofile of the <strong>ARM</strong> instruction set we provide pre-compiled wrapper functions that implementthe behaviour expected by the <strong>JIT</strong> operations. The <strong>ARM</strong>v7-R profile supportsthese operations and would not need these helper fuctions. The functions are compiledwhen the backend is translated and rely on the implementations of these operations providedby the compiler vendors, which usually comply to <strong>ARM</strong>’s EABI. The EABI [Smi09]provides a binary interface which, among other things, defines interfaces <strong>for</strong> arithmeticfunctions to be followed by compiler vendors. As defined in the EABI the compiler vendorscan reuse the implementation of functions defined in the EABI from other librariesat link time or they can generate code <strong>for</strong> the used functions from the EABI. When generatingEABI functions the compiler can emit code that uses the features provided bythe selected version of the EABI and the hardware plat<strong>for</strong>m, exploiting the presence ofvector and/or floating point units or falling back to software based implementations, tomake the best use of the plat<strong>for</strong>m while still adhering to a standardized interface whichis reusable across toolchains.


32 6 A BACKEND FOR PYPY’S <strong>JIT</strong>6.4.2 Comparison OperationsAmong the operations a trace can contain, there are several comparison operations thatcan per<strong>for</strong>m boolean, equality, greater or less than checks to compare two variables. Weimplemented these operations using the standard <strong>ARM</strong> CMP operation that comparestwo registers or a register and a constant and sets the condition flags of the processoraccording to the result. Using the conditional execution feature we can then assign theresult to a register according to the state of the condition flags. The same procedure isused <strong>for</strong> boolean checks where we compare a variable against a constant considering allnon-zero values as true.As an example, the i2 = int_lt(i0, i1) operation checks if i0 is less than i1 andassigns the result of the comparison to the variable i2. Assuming that i0 is stored inregister r9, the variable i1 is stored in register r4 and the result should be stored inregister r0. The generated code <strong>for</strong> the operation would look like the one shown inFigure 19.CMP r9, r4MOVLT r0, #1MOVGE r0, #0Figure 19: Instructions generated <strong>for</strong> the int_lt(i0, i1) operation6.4.3 AllocationIn order to provide a way to be able to improve the per<strong>for</strong>mance of operations that per<strong>for</strong>mmemory allocation, especially in conjunction with PyPy’s own garbage collectors,the <strong>JIT</strong> provides operations to allocate new objects, strings and arrays. These objects areallocated by the way as done in the interpreter, so they can be passed to the frontendwhen loop is left. Currently the <strong>ARM</strong> backend <strong>for</strong> PyPy’s <strong>JIT</strong> only supports the Boehm-Demers-Weiser conservative garbage collector [BW88], which is library that replaces thestandard malloc implementation. In future we plan to support PyPy’s own garbage collectors.With the supported garbage collector the allocation procedure is similar <strong>for</strong> allthree types of allocations. We calculate how much memory is required based on the numberand the size of fields or elements the object has and the size of the meta data storedin the object. Once the size is known we generate a call to the malloc implementationprovided by the garbage collector which returns the address to the allocated memory. Inthe freshly allocated object we store the metadata <strong>for</strong> it, i.e. the length of a list or string,finally we associate the object with the corresponding variable.6.4.4 Memory AccessHaving optimized operations <strong>for</strong> memory allocation, we also want to be able to optimizethe operations that interact with object that are stored in memory, reading and writingfields of an object or reading and writing items of strings and arrays. For these operationsfirst we generate instructions that calculate the address of the field based on the address


6.4 Compiling Trace Operations 33of the object and the offset within it. Then, depending on the size of field we generate aninstruction that reads or writes a byte, a half-word or a word and stores the result in thecorresponding allocated register.6.4.5 CallsThere are two cases where we might want to per<strong>for</strong>m external calls from our traces. Thefirst case is to call an externally defined function, such as malloc <strong>for</strong> allocation. Thesecond kind of situation is when we have a trace that contains calls an already compiledtrace.The first kind of call is handled by the call operation used to call any function thatfollows the AAPCS (see Section 2.2.3). When compiling this instruction we move thearguments to the registers r0 to r3 and to the stack, according to the AAPCS and updatethe stack pointer. We write the return address to the link register and per<strong>for</strong>m the jumpto the procedure. When we return from the call we associate the return value storedin register r0 with the corresponding variable, so it is further managed by the registerallocator.The other kind of call operation supported by the <strong>JIT</strong> is called call_assembler. Thisoperation is used to make a call from one compiled loop to another. This has the advantagethat we do not need to compile all the instructions in the called loop again as itwould be the case when inlining the operations. The loop is invoked as any other procedure,passing the arguments to the loop following the AAPCS. We make a call fromone compiled loop to another by this way to avoid the overhead of passing all argumentsthrough the preallocated lists as done when calling the loop from the <strong>JIT</strong> frontend. Tobe able to call the compiled loop by this way we generate a second entry point to thefunction that wraps the compiled loop. The additional entry point is generated after theactual instructions in the loop have been compiled, because then we know where all thevariables should be stored when we enter the loop body. When we call the loop usingthe second entry point we generate instructions that first setup the frame and then readthe values passed in registers and on the stack and put them in the locations the loopoperations expect them. Once the arguments are in place we jump directly to the loopbody, skipping the original setup instructions.6.4.6 Forcing the Frame ObjectOne of the optimizations per<strong>for</strong>med by the <strong>JIT</strong> is to remove object allocations. In languagessuch as Python the frame of an interpreter level function is accessible from theinterpreter and is a heap allocated object that contains references to the local variables.Among the optimizations per<strong>for</strong>med by PyPy’s <strong>JIT</strong> is avoiding the allocation of the objectswithin the frame by virtualizing them. In the case some function called from theoptimized loop accesses the frame we need to <strong>for</strong>cibly allocate the objects in the frames<strong>for</strong> the scope of the optimized loop, this operation is called <strong>for</strong>cing.When the frame needs to be <strong>for</strong>ced the backend stores the variables used in the loop toa location where the frontend can read. The frontend then allocate the objects referenced


34 6 A BACKEND FOR PYPY’S <strong>JIT</strong>in frames. When <strong>for</strong>cing happens we can not continue with the execution of the optimizedloop and have to fall back to interpretation. Because of this, around calls that cantrigger such a frame <strong>for</strong>cing we need to check if <strong>for</strong>cing happened, in order to leave theoptimized loop in that case.To per<strong>for</strong>m the steps described above we make use of the <strong>for</strong>ce_index, mentionedbe<strong>for</strong>e when describing the frame layout. We use the slot of the <strong>for</strong>ce_index on thestack to store a value that is changed when we <strong>for</strong>ce the frame. Using this value wecan check if the frame was <strong>for</strong>ced and exit the loop if necessary. Additionally we makeuse of the address of the <strong>for</strong>ce_index, that is stored in the FP register, to restore thevariables used in the loop so the frontend can create the frame objects. We can restorethe variables using the address of the <strong>for</strong>ce_index, because the variables are stored ina known offset from this address.6.5 Guards and Leaving the LoopAs described in Section 3.2 traces correspond to only one possible path through the tracedloop. Operations called guards are inserted in the trace at all points where the executioncould take a different path than the one recorded during tracing. These operations checkthat the conditions valid during tracing are still valid during execution. Examples <strong>for</strong>such guards are guard_value, guard_class, guard_true, etc. which check the conditionindicated by their name to be true.Because a trace contains a large amount of guards and guards fail rarely we want toimplement guards in a way that the case where the guard does not fail is efficient. Theefficiency of the failure case is not as important.Guards are implemented by generating a comparison instruction on the input argumentof the guard and the provided condition. If the condition passes we jump over the blockof instructions generated <strong>for</strong> the guard and continue with the execution of the loop.In the case the check fails we enter the guard block. Within the guard block we first savethe locations of all variables, i.e. the registers they are associated with, and jump to aprocedure that takes care of all the steps needed to exit the loop. Each guard is associatedwith a list of variables that should be passed to the frontend to restore the state of theexecution in case the guard fails. When we generate the instructions <strong>for</strong> a guard, weadditionally allocate a small piece of memory which is associated with the guard. In thispiece of memory we store in an encoded way the in<strong>for</strong>mation about where each of thearguments associated with the guard is currently stored. Because of the large numberof guards generated we want to keep this encoding as compact as possible to reduce thememory footprint of the compiled loops. For each variable the encoding consists of upto 6 bytes: one byte <strong>for</strong> the type (int, pointer or float), one <strong>for</strong> the location (stack, registeror immediate) and four <strong>for</strong> the value itself, in the case of register locations we only storethe type and the register number, which on <strong>ARM</strong> fits into one byte.After a guard failure we jump to a pre-generated procedure that serves as a common exitpath <strong>for</strong> all compiled loops. This procedure takes care of saving all values that should bepassed to the frontend and then finally leaves the compiled loop by returning from thefunction that executes the trace. The saving of the values is done by reading the encoding


6.5 Guards and Leaving the Loop 35of the variable locations generated <strong>for</strong> the guard. With that in<strong>for</strong>mation we can load thecorresponding values and save them to preallocated arrays where the frontend can readthem. The caller or the frontend can then read all values from the corresponding lists andrestore the state be<strong>for</strong>e resuming interpretation. Once all values are saved we generateinstructions to return from the function we generated <strong>for</strong> the trace.To illustrate this we will describe the procedure <strong>for</strong> the guard_true operation that producesthe code shown in Figure 20. The guard in this example has one input argument(i0) and two variables (p0 and i0) that should be passed to the frontend in case thisguard fails.......CMP r1, 0ADDNE PC, 20LDR lr, pc + 8LDR pc, pc + 8ADDRESS OF ENCODED LOCATIONSADDRESS OF EXIT PROCEDURE......Figure 20: Generated machine code guard_true(i0, [p0, i0])Let’s assume the variable i0 is associated to register r1 and p0 is currently stored on thestack with one word offset from the FP. The instructions generated <strong>for</strong> the guard can beseen in Figure 20. First we generate a CMP instruction to compare the value of i0 with 0.In PyPy compare with 0 because all non-zero values are true. If the value in register r1 iszero we ignore the next instruction, load the address of the memory location containingthe encoded locations of the variables into the LR register and finally we load the addressof the exit procedure into the register PC thus per<strong>for</strong>ming an unconditional jump to it.The encoding of the variable locations <strong>for</strong> our guard is shown in Figure 21. We firstencode the location of the variable p0. The in<strong>for</strong>mation that p0 is is a pointer is encodedas \xEE. Then we encode the in<strong>for</strong>mation that it is stored on the stack, represented as\xFC and finally we store the offset to FP as a sequence of four bytes. After p0 weencode the variable i0 first encoding the in<strong>for</strong>mation that it is an integer (represented as\xEF) and then the number of the register it is stored in. In the case of values stored inregisters we omit the in<strong>for</strong>mation about the location in the encoding.\xEE \xFC 0 0 0 1\xEF 1Figure 21: Encoding of variable locations <strong>for</strong> guard_true(i0, [p0, i0]). Variablep0 is spilled on the stack and variable i0 is stored in register r1


36 7 CROSS TRANSLATION6.6 BridgesWhen a specific guard fails often enough that its failure count passes a defined thresholdthe <strong>JIT</strong> frontend begins to trace the execution from the failing guard. The trace is recordeduntil the entry point of another loop is found. This kind of trace is called a bridge. Once abridge has been traced it is passed to the backend to be compiled. The compilation of theinstructions contained in a bridge is the same as <strong>for</strong> a loop. The difference is in the setupof the bridge and how it is entered. A bridge is entered from a failing guard. To makethe entering of the loop as efficient as possible we patch the previously generated code<strong>for</strong> the guard in the loop with a jump to the compiled bridge. The patched guard fromour example above would then look like shown in Figure 22. Now in case the guard failsinstead of the jump to the exit procedure it now jumps directly to the code of the bridge.This means that when we enter a bridge we are in the same frame as the loop, thus wehave the same state, i.e. allocated registers, spilled values etc. When we generate a bridgewe need to setup the state to be the same as it was in the loop at the time the guard failed,<strong>for</strong> this we use the same encoded in<strong>for</strong>mation used when leaving the loop, decoding itand restoring the register and stack bindings with the variables.......CMP r1, 0ADDNE PC, 20LDR pc, pc + 4ADDRESS OF THE BRIDGEADDRESS OF ENCODED LOCATIONSADDRESS OF EXIT PROCEDURE......Figure 22: Patched machine code <strong>for</strong> guard_true(i0, [p0, i0])7 Cross TranslationTranslating PyPy is a memory and computation intensive task. To translate the fullPython interpreter at least 2GB of memory are required. Most <strong>ARM</strong> devices target eithermobile or embedded plat<strong>for</strong>ms, these devices usually do not have the power orresources to translate larger programs such as PyPy’s Python interpreter on the device itself.There<strong>for</strong>e, based on the work done <strong>for</strong> PyPy on Maemo [PyP11a], we implemented across translation target to translate PyPy on a more powerful host machine and generatecode <strong>for</strong> <strong>ARM</strong>-processors.PyPy has a mechanism to select different plat<strong>for</strong>ms and compilers to translate the generatedC code. This mechanism also provides an interface to gather in<strong>for</strong>mation aboutthe plat<strong>for</strong>m, libraries and to compile and execute programs on the plat<strong>for</strong>m. For thisthesis we implemented an additional plat<strong>for</strong>m <strong>for</strong> PyPy which can generate a binary <strong>for</strong>


37<strong>ARM</strong> using a cross compiler, such as the GCC <strong>ARM</strong> cross compiler toolchain 4 . Alsousing Scratchbox2 5 to provide a chrooted environment with the libraries of the targetsystem and the possibility to run small programs in the emulated environment based onQemu 6 . During the translation process PyPy needs to compile and execute several smallprograms to gather in<strong>for</strong>mation about the target plat<strong>for</strong>m it will generate code <strong>for</strong>. Withthe cross-translation approach the Python interpreter that per<strong>for</strong>ms the translation runson the host system and all calls to the compiler and to execute the compiled programs areredirected through the Scratchbox, working as if they were run on an <strong>ARM</strong> based system.This approach has some drawbacks, but gives a translation speed, that is comparable tothe speed when targeting the host environment. The biggest drawback is that the Pythoninterpreter running the translation process needs to be compiled <strong>for</strong> the same word sizeas the target system provides. So it is impossible to cross-translate from a 64-bit Pythonto <strong>ARM</strong> which is a 32-bit plat<strong>for</strong>m.4 htt://gcc.gnu.org5 http://freedesktop.org/wiki/Software/sbox26 http://www.qemu.org


38 8 EVALUATION8 EvaluationFigure 23: BeagleBoard-xM development boardIn order to evaluate the per<strong>for</strong>mance of the <strong>ARM</strong> backend and to analyze how it behavescompared to PyPy on x86 in this section we want to present the results of different benchmarks.We gathered results from benchmarks executed on PyPy’s Python interpreter andon Pyrolog [BLS10] a Prolog interpreter written in RPython.8.1 HardwareAll tests and benchmarks on <strong>ARM</strong> were run on an otherwise idle BeagleBoard-xM 7 runningUbuntu 10.10 <strong>for</strong> <strong>ARM</strong> 8 .The BeagleBoard-xM is an <strong>ARM</strong> based development board with an <strong>ARM</strong> Cortex A8 CPUat 1 GHz and with 512MB of memory with Linux 2.6.35. See Figure 23 <strong>for</strong> a picture of theboard.The benchmarks on x86 were per<strong>for</strong>med on an otherwise idle Intel Core2 Duo P8400processor with 2.26 GHz and 3072 KB of cache on a machine with 3GB RAM runningLinux 2.6.35.8.2 BenchmarksTo measure the per<strong>for</strong>mance of the <strong>ARM</strong> backend we per<strong>for</strong>med two different sets ofbenchmarks, one running the PyPy benchmarks and one running the Prolog benchmarksdescribed in [BLS10]. All interpreters were translated using a fixed revision(370c23f085d7) of the <strong>ARM</strong> branch of PyPy’s mercurial repository 9 .7 http://beagleboard.org/hardware-xM8 https://wiki.ubuntu.com/<strong>ARM</strong>9 https://bitbucket.org/pypy/pypy/src/arm-backend-2


8.2 Benchmarks 398.2.1 Python BenchmarksTo run the PyPy Python benchmarks we created a set of different versions of PyPy’sPython interpreter. We created a version of PyPy with and without the <strong>JIT</strong> <strong>for</strong> <strong>ARM</strong>both using the Boehm-Demers-Weiser garbage collector [BW88] (wich we will refer to asBoehm GC) to compare them against CPython on <strong>ARM</strong>. Additionally we created fourversions of PyPy’s Python interpreter <strong>for</strong> x86. We created interpreter versions with allcombinations of <strong>JIT</strong>, no <strong>JIT</strong>, Boehm GC and PyPy’s own GC. The CPython version on<strong>ARM</strong> and x86 is version 2.6.6 revision r266:84292. The benchmarks are a mix of smalland medium sized python programs composed of synthetic benchmarks and real applications.We had to remove the spambayes, spitfire, pyflate-fast and rietvield benchmarksfrom the full set, because they were hitting an issue in the <strong>ARM</strong> backend that is stillopen. The full set of benchmarks as well as the current results on x86 can be seen onthe PyPy Speed Center website 10 , the benchmark’s sourcecode can be found at the PyPybenchmarks repository 11 . All benchmarks were run 50 times in the same instance of theVM, taking the average of all runs so that they contain interpretation, warm-up, codegenerationand execution of the generated code. The errors were calculated using a confidenceinterval with a 95% confidence level [GBE07].The results <strong>for</strong> the benchmarks run on the BeagleBoard are shown below. In Table 1 wepresent the absolute runtimes <strong>for</strong> the different benchmarks. Table 2 shows the relativeruntimes which as well as the graphed data in Figure 24 are normalized to CPython.To put the results gathered on <strong>ARM</strong> into context we also per<strong>for</strong>med the same benchmarkson x86 with all four PyPy versions mentioned above. The graph of the relative speedcompared to CPython on x86 can be seen in Figure 25. The tables with the absolute (seeTable 6) and the relative results (see Table 7) can be found in Section 11. The benchmarksshow that running PyPy with the Boehm GC is always, with one exception, slower thanthe corresponding version of PyPy with its own GC. Also on x86 the versions using theBoehm GC and the <strong>JIT</strong> are almost always faster than the version with the Boehm GC andno <strong>JIT</strong>.Looking at the graph in Figure 24 it can be seen that on <strong>ARM</strong> the <strong>JIT</strong> almost alwaysimproves over the non-jitted version of PyPy. At the same time the results comparedto CPython are not that clear. While PyPy without a <strong>JIT</strong> is always slower that CPythonthe jitted version only outper<strong>for</strong>ms CPython in about half of the benchmarks. In somebenchmarks the jitted version is significantly slower than CPython, even being slowerthan PyPy without a <strong>JIT</strong> on the twisted_pb benchmark. The mixed results on <strong>ARM</strong> andthe results on x86, especially those <strong>for</strong> the versions of PyPy using the Boehm GC, showthe importance of a fast GC that is well integrated with the <strong>JIT</strong> to get good results.<strong>An</strong> important result can be seen when comparing the relative results calculated <strong>for</strong> bothplat<strong>for</strong>ms. When looking at the relative results comparing the PyPy versions to CPythonon both plat<strong>for</strong>ms, as shown in Table 3, we see that on <strong>ARM</strong> we get relative results thatare comparable to those achieved on x86 <strong>for</strong> the interpreters using the Boehm GC withand without the <strong>JIT</strong>. The cases where the relative results differ might be due to severalmicro-optimizations that have not yet been done in the <strong>ARM</strong> backend as well as some10 http://speed.pypy.org11 http://codespeak.net/svn/pypy/benchmarks/


40 8 EVALUATIONBenchmark cpython [ms] nojit, boehm [ms] jit, boehm [ms]ai 5931.82 ± 7.15 18630.77 ± 58.60 13316.69 ± 492.45bm_mako 1976.47 ± 12.54 6045.01 ± 274.69 3080.70 ± 234.98chaos 6548.26 ± 19.39 16049.27 ± 67.81 4100.33 ± 851.92crypto_pyaes 26746.72 ± 9.91 127704.81 ± 53.35 12099.35 ± 1280.47django 13111.72 ± 264.58 27510.31 ± 605.54 6240.66 ± 222.82fannkuch 17592.60 ± 65.52 61427.77 ± 1056.54 18752.91 ± 240.71float 8642.43 ± 129.71 16318.45 ± 335.48 4527.81 ± 293.27go 11457.86 ± 38.30 67172.65 ± 60.21 11438.12 ± 801.13html5lib 177656.38 ± 785.98 660478.63 ± 58354.54 339913.73 ± 15962.25meteor-contest 3758.29 ± 4.33 10932.99 ± 70.55 8207.28 ± 222.83nbody_modified 7533.80 ± 52.81 11473.04 ± 61.70 7840.77 ± 109.42raytrace-simple 33938.57 ± 528.15 77114.42 ± 960.34 17280.65 ± 414.07richards 3712.72 ± 14.59 10840.95 ± 93.97 1187.04 ± 118.55slowspitfire 6378.05 ± 1.84 25646.74 ± 1234.26 21371.59 ± 8929.30spectral-norm 6990.59 ± 30.40 28566.51 ± 25.40 4492.29 ± 171.88spitfire_cstringio 153292.00 ± 125.83 257690.40 ± 636.23 189844.20 ± 979.49telco 15434.80 ± 9.53 61941.56 ± 49.16 10938.12 ± 219.95twisted_iteration 2055.69 ± 4.71 7319.68 ± 86.70 1305.72 ± 149.00twisted_names 139.98 ± 0.98 362.43 ± 6.55 385.62 ± 138.98twisted_pb 1076.19 ± 66.42 3400.00 ± 471.40 4066.67 ± 1394.84twisted_tcp 13386.24 ± 81.21 43825.39 ± 3205.68 33173.15 ± 4922.12waf 77516.19 ± 3944.68 80218.64 ± 8958.83 77198.68 ± 6029.46Table 1: Absolute runtimes <strong>for</strong> standard PyPy benchmarks run on <strong>ARM</strong>architectural differences. For some benchmarks, such as the float benchmark, the <strong>ARM</strong>backend is currently slower than the x86 backend because on <strong>ARM</strong> the <strong>JIT</strong> does not yetsupport optimized floating point operations. On x86 the <strong>JIT</strong> in combination the PyPy’sown GCs per<strong>for</strong>m significantly better than using Boehm. Given the similar results onboth plat<strong>for</strong>ms using the Boehm GC, it is to be expected to see speedups comparable tothose on x86 once PyPy’s GCs are integrated on <strong>ARM</strong>.


8.2 Benchmarks 41Benchmark cpython nojit, boehm jit, boehmai 1.00 3.14 2.24bm_mako 1.00 3.06 1.56chaos 1.00 2.45 0.63crypto_pyaes 1.00 4.77 0.45django 1.00 2.10 0.48fannkuch 1.00 3.49 1.07float 1.00 1.89 0.52go 1.00 5.86 1.00html5lib 1.00 3.72 1.91meteor-contest 1.00 2.91 2.18nbody_modified 1.00 1.52 1.04raytrace-simple 1.00 2.27 0.51richards 1.00 2.92 0.32slowspitfire 1.00 4.02 3.35spectral-norm 1.00 4.09 0.64spitfire_cstringio 1.00 1.68 1.24telco 1.00 4.01 0.71twisted_iteration 1.00 3.56 0.64twisted_names 1.00 2.59 2.75twisted_pb 1.00 3.16 3.78twisted_tcp 1.00 3.27 2.48waf 1.00 1.03 1.00Table 2: Relative runtimes <strong>for</strong> standard PyPy benchmarks run on <strong>ARM</strong> normalized toCPython


42 8 EVALUATIONBenchmark <strong>ARM</strong>, nojit <strong>ARM</strong>, jit x86, nojit x86, jitai 3.14 2.24 2.82 1.69bm_mako 3.06 1.56 3.38 1.89chaos 2.45 0.63 2.13 0.32crypto_pyaes 4.77 0.45 3.14 0.34django 2.10 0.48 1.76 0.40fannkuch 3.49 1.07 2.07 0.81float 1.89 0.52 1.62 0.22go 5.86 1.00 4.74 0.74html5lib 3.72 1.91 3.53 1.78meteor-contest 2.91 2.18 2.28 1.63nbody_modified 1.52 1.04 1.53 0.73raytrace-simple 2.27 0.51 2.06 0.20richards 2.92 0.32 1.97 0.23slowspitfire 4.02 3.35 2.91 2.97spectral-norm 4.09 0.64 4.43 0.34spitfire_cstringio 1.68 1.24 2.09 1.76telco 4.01 0.71 3.63 0.60twisted_iteration 3.56 0.64 3.54 0.65twisted_names 2.59 2.75 2.43 1.10twisted_pb 3.16 3.78 3.91 2.32twisted_tcp 3.27 2.48 3.94 3.09waf 1.03 1.00 1.00 1.01Table 3: Relative results of the PyPy benchmarks using the Boehm GC on <strong>ARM</strong> and x86,normalized to CPython <strong>for</strong> the corresponding plat<strong>for</strong>m


8.2 Benchmarks 43Ratio6x5x4x3x2x1x0xaibm_makochaoscrypto_pyaesdjangofannkuchfloatgohtml5libmeteor−contestcpythonnojit, boehmjit, boehmnbody_modifiedraytrace−simplerichardsslowspitfirespectral−normspitfire_cstringiotelcotwisted_iterationtwisted_namestwisted_pbtwisted_tcpwafBenchmarkFigure 24: PyPy benchmarks run on <strong>ARM</strong>, normalized to CPython


44 8 EVALUATIONRatio6x5x4x3x2x1x0xaibm_makochaoscrypto_pyaesdjangofannkuchfloatgohtml5libmeteor−contestcpythonnojit, boehmjit, boehmnojit, gcjit, gcnbody_modifiedraytrace−simplerichardsslowspitfirespectral−normspitfire_cstringiotelcotwisted_iterationtwisted_namestwisted_pbtwisted_tcpwafBenchmarkFigure 25: PyPy benchmarks run on x86, normalized to CPython


8.2 Benchmarks 458.2.2 Prolog BenchmarksTo see how the <strong>ARM</strong> backend per<strong>for</strong>ms when the <strong>JIT</strong> is applied to an interpreter that isnot based on a bytecode dispatch loop we used the Pyrolog interpreter and the benchmarksdescribed in [BLS10]. Pyrolog is a Prolog interpreter written in RPython that canbe translated using the PyPy toolchain. On <strong>ARM</strong> we created two versions and comparedthem, as <strong>for</strong> Python, to the same interpreter translated in different versions <strong>for</strong> x86. For<strong>ARM</strong> we created one version with and one version without the <strong>JIT</strong>, both using the BoehmGC. On x86 we created the same set of versions as we did <strong>for</strong> Python. On both plat<strong>for</strong>mswe compared the generated versions of Pyrolog to SWI-Prolog 12 which is available <strong>for</strong>Ubuntu 10.10 on <strong>ARM</strong> and x86 in version 5.8.2. For comparability the benchmarks wererun the same way as described in [BLS10], running each benchmark twice in the sameprocess to give the <strong>JIT</strong> time to warm up and generate code and measuring the secondrun.Benchmark no jit, boehm jit, boehm SWIiterate_assert 1270 ms 10 ms 110 msiterate_call 3060 ms 10 ms 3190 msiterate_cut 2690 ms 360 ms 70 msiterate_exception 3140 ms 350 ms 4990 msiterate_failure 2780 ms 130 ms 430 msiterate_findall 4190 ms – 740 msiterate_if 7410 ms 10 ms 140 msiterate 1190 ms 10 ms 50 msTable 4: Absolute runtimes <strong>for</strong> Pyrolog iterate benchmarks run on <strong>ARM</strong>From the set of benchmarks we had to remove the arithmetic and reducer benchmarks,because they did not terminate in a reasonable amount of time on both plat<strong>for</strong>ms. Thisas well as the slower results compared to [BLS10] might be due to recent changes inthe optimizer of PyPy’s <strong>JIT</strong> that affect the Pyrolog interpreter negatively. Also we hadto adapt some of the benchmarks so they could run on <strong>ARM</strong>, reducing the number ofiterations they per<strong>for</strong>med. The adapted benchmarks can be found in the arm branchof the pyrolog-benchmark repository 13 . All graphs of the results of running the Prologbenchmarks are presented in a logarithmic scale.Additionally on <strong>ARM</strong> we per<strong>for</strong>med the iterate benchmarks which are rather atypicalProlog programs but show how a loop in <strong>for</strong>m of recursive calls is optimized by the <strong>JIT</strong>,compared to SWI-Prolog. From this set of benchmarks the iterate_findall benchmarkcrashed when executed with the <strong>JIT</strong>, thus producing no result. Also worth noting,the bad per<strong>for</strong>mance of the iterate_cut which is due to the bad handling of the cutoperation by <strong>JIT</strong>. <strong>An</strong> in depth evaluation of these benchmarks has been done in [BLS10].The results <strong>for</strong> the iterate set of benchmarks, confirm the findings of [BLS10] that the<strong>JIT</strong> optimizes recursive calls and can optimize dynamic invocation particularly well alsogiving a speedup on <strong>ARM</strong> <strong>for</strong> the cases it does on x86. The results are shown in Table 412 http://www.swi-prolog.org/13 https://bitbucket.org/cfbolz/pyrolog-benchmark/changeset/arm


46 8 EVALUATIONand the graph of the relative runtimes normalized to SWI-Prolog is shown in Figure 27.For the standard Prolog benchmarks we get the same kind of results as <strong>for</strong> Python. Thebenchmarks of the relative speed compared to SWI shown in Figure 28 were run on x86using different GCs and show that also <strong>for</strong> this kind of interpreter the choice of the GChas a significant impact on the speed of the resulting interpreter. See Figure 26 <strong>for</strong> a graphof the relative runtimes on <strong>ARM</strong> normalized to SWI-Prolog and Table 5 <strong>for</strong> the absoluteruntimes. The graph of the relative runtimes on x86 can be found in Figure 28. In thisset of benchmarks we can also see that the relative results <strong>for</strong> Pyrolog on <strong>ARM</strong> and x86compared to the corresponding versions of SWI-Prolog are similar, confirming that the<strong>ARM</strong> backend produces code with a similar relative per<strong>for</strong>mance compared to the x86backend.SWInojit, boehmjit, boehm100x10xRatio1x0.1x0.01x0.001xboyerchat_parsercryptderivmeta_nrevmunrevpolyprimesqsortqueenstakzebraBenchmarkFigure 26: Pyrolog benchmarks run on <strong>ARM</strong>, normalized to SWI-Prolog


8.2 Benchmarks 47SWInojit, boehmjit, boehm100x10xRatio1x0.1x0.01x0.001xiterateiterate_assertiterate_calliterate_cutiterate_exceptioniterate_failureiterate_findalliterate_ifBenchmarkFigure 27: Pyrolog iterate benchmarks run on <strong>ARM</strong>, normalized to SWI-PrologBenchmark own-interp own-<strong>JIT</strong> warm SWIboyer 45220 ms 125840 ms 890 mschat_parser 761760 ms 2140 ms 62790 mscrypt 15970 ms 44860 ms 4520 msderiv 48860 ms 60 ms 3780 msmeta_nrev 44630 ms 11160 ms 7590 msmu 32190 ms 1960 ms 4360 msnrev 16180 ms 2250 ms 1830 mspoly 7980 ms 10960 ms 80 msprimes 46390 ms 14420 ms 4320 msqsort 41080 ms 740 ms 4000 msqueens 95930 ms 38020 ms 16860 mstak 18930 ms 8960 ms 630 mszebra 97320 ms 230 ms 17330 msTable 5: Absolute runtimes <strong>for</strong> Pyrolog benchmarks run on <strong>ARM</strong>


48 8 EVALUATIONSWInojit, boehmjit, boehmnojit, gcjit, gc100x10xRatio1x0.1x0.01x0.001xboyerchat_parsercryptderivmeta_nrevmunrevpolyprimesqsortqueenstakzebraBenchmarkFigure 28: Pyrolog benchmarks run on x86, normalized to SWI-Prolog


499 Related WorkThere are many virtual machines <strong>for</strong> dynamic language that aim to improve their per<strong>for</strong>manceby including just-in-time compilers into their implementations, <strong>for</strong> instance, asdescribed earlier, SELF [CUL89] contains a <strong>JIT</strong> that generates specialized code <strong>for</strong> differentcode paths depending on the input arguments to the optimized functions. Also theHotSpot JVM [PVC01] contains a <strong>JIT</strong> that directly generates machine code <strong>for</strong> functionsand then profiles the compiled functions to find candidates <strong>for</strong> optimized recompiling.Other examples are some Smalltalk systems [DS84] that compile methods on their firstexecution.Lately there have been several dynamic language implementations using tracing <strong>JIT</strong> technologyto improve their the per<strong>for</strong>mance of their virtual machines such as Spur [BBF + 10]or TraceMonkey [GFE + 09]. Some of these virtual machines especially those developedwith mobile plat<strong>for</strong>ms in mind also target the <strong>ARM</strong> architecture. The Dalvik VM, part ofGoogle’s <strong>An</strong>droid mobile operating system, contains since version 2.2 a tracing <strong>JIT</strong> thattargets <strong>ARM</strong> 14 . The Tracemonkey Javascript engine used in the Firefox browser has beenported to <strong>ARM</strong> and tested in the mobile version of Firefox running on <strong>An</strong>droid 15 . TheLua<strong>JIT</strong> project, that adds a trace based <strong>JIT</strong> compiler to Lua, is working on adding support<strong>for</strong> <strong>ARM</strong> to their <strong>JIT</strong> 16 .There also are ef<strong>for</strong>ts that take a more general approach, as done by the Libjit 17 whichis part of the DotGNU project 18 . Libjit aims to provide tools <strong>for</strong> developers to add a <strong>JIT</strong>to their language implementations, providing a library that encapsulates the low leveldetails of a <strong>JIT</strong> and provides backends <strong>for</strong> different plat<strong>for</strong>ms, also supporting dynamiccode generation on <strong>ARM</strong> [TCAR09].For Python there have been previous attempts to bring dynamic compilation to the language,most notably Psyco [Rig04], a specializing <strong>JIT</strong> compiler <strong>for</strong> Python. Psyco is ahandwritten <strong>JIT</strong> <strong>for</strong> Python that can be used with CPython as an extension module. Thismakes it hard to maintain and dependant on changes done to CPython. As an extensionmodule, written in C, Psyco can only be used <strong>for</strong> CPython and only works on x86 32-bitprocessors. In difference to PyPy that can used to implement different languages andis constructed in a way that it can target different architectures allowing to generate a<strong>JIT</strong> compiler at translation time <strong>for</strong> the target architecture. Additionally because PyPy iswritten in Python itself it provides a higher level of abstraction and an better maintainability.14 http://developer.android.com/sdk/android-2.2-highlights.html15 http://blog.cdleary.com/2011/02/picky-monkeys-pic-arm/16 http://lua-users.org/lists/lua-l/2011-01/msg01238.html17 http://www.gnu.org/software/dotgnu/libjit-doc/libjit.html18 http://www.gnu.org/software/dotgnu/


50 10 CONCLUSION AND FUTURE WORK10 Conclusion and Future WorkIn this thesis, after presenting the technical and theoretical background of <strong>ARM</strong>, PyPyand just-in-time compilation, we have described how a <strong>JIT</strong> backend <strong>for</strong> PyPy’s tracing<strong>JIT</strong> is built, with a special focus on targeting the <strong>ARM</strong> architecture. We showed howthe backend translates the instructions recorded by PyPy’s tracing <strong>JIT</strong> during the executionand how these are prepared and executed on an <strong>ARM</strong> based plat<strong>for</strong>m. We presentedhow it is possible to translate an interpreter written in RPython using PyPy, sothat it includes the presented backend, while being translated on a different plat<strong>for</strong>m,by creating a cross-translation target <strong>for</strong> <strong>ARM</strong>. With a cross-translation target it is possibleto translate on systems that provide enough resources to translate PyPy and stilltarget <strong>ARM</strong> devices. Based on these results we per<strong>for</strong>med several benchmarks on PyPy’sPython implementation and Pyrolog, a Prolog implementation written in RPython. Thebenchmarks showed that our results are comparable to those achieved on x86 in terms ofrelative speed compared to reference implementations. Based on the results of the benchmarkswe can expect to see further per<strong>for</strong>mance improvements on <strong>ARM</strong> once PyPy’s owngarbage collectors are integrated with PyPy on <strong>ARM</strong>.There are still several open or unfinished tasks in relation to the <strong>ARM</strong> backend on PyPy’s<strong>JIT</strong> compiler. As mentioned be<strong>for</strong>e the current implementation of the backend only supportsthe Boehm-Demers-Weiser garbage collector. One of the next steps is to integratethe <strong>ARM</strong> backend with PyPy’s own exact garbage collectors, which also should increasethe speed of the code generated by the backend. On the other hand the support <strong>for</strong> floatingpoint numbers should be finished and merged soon. Additionally the <strong>JIT</strong> gained,about two months ago, support <strong>for</strong> long long operations providing specialized code<strong>for</strong> 64 bit integer operations. We plan to implement these operations in the <strong>ARM</strong> backendtoo. <strong>An</strong>other open topic not directly related to the <strong>ARM</strong> <strong>JIT</strong> backend is to bring PyPyto <strong>An</strong>droid phones, which run a special version of Linux on <strong>ARM</strong> processors. To do thiswe need to adapt the translation toolchain to target <strong>An</strong>droids NDK and C-library andto investigate viable ways of integrating PyPy with the host environment. This wouldallow applications <strong>for</strong> mobile phones to be developed in Python, or other languages implementedusing PyPy and benefit from the speed gains by using a <strong>JIT</strong> compiler. Wewould also like to explore how PyPy’s <strong>JIT</strong> per<strong>for</strong>ms on <strong>ARM</strong> compared with <strong>An</strong>droid’sDalvik VM.AcknowledgmentI would like to thank in alphabetical order Carl Friedrich Bolz, <strong>An</strong>tonio Cuni and ArminRigo <strong>for</strong> their help during the development of the <strong>ARM</strong> backend as well as <strong>for</strong> the helpfulcomments when writing this thesis.


5111 <strong>An</strong>nexBenchmark cpython [ms] nojit, boehm [ms] jit, boehm [ms] nojit, gc [ms] jit, gc [ms]ai 515.25 ± 1.65 1455.26 ± 5.21 869.16 ± 11.40 1045.98 ± 20.31 381.40 ± 23.65bm_mako 128.29 ± 0.63 434.11 ± 30.57 241.92 ± 28.00 378.61 ± 1.10 138.65 ± 12.26chaos 501.00 ± 4.65 1065.10 ± 7.04 159.68 ± 70.19 888.07 ± 2.01 42.24 ± 61.28crypto_pyaes 2828.41 ± 22.29 8875.09 ± 6.10 958.48 ± 82.07 5063.13 ± 8.38 177.35 ± 71.35django 990.67 ± 1.29 1744.97 ± 22.64 394.94 ± 21.74 1635.51 ± 2.07 155.88 ± 7.22fannkuch 1958.91 ± 7.24 4050.90 ± 62.17 1584.03 ± 19.76 2667.34 ± 12.80 402.22 ± 11.86float 569.37 ± 11.11 922.31 ± 27.85 127.43 ± 37.63 949.14 ± 20.16 106.10 ± 11.32go 932.47 ± 2.92 4416.65 ± 12.89 687.29 ± 34.19 2966.42 ± 5.25 203.37 ± 45.28html5lib 14451.03 ± 72.15 50954.23 ± 6874.04 25751.72 ± 3445.27 28898.94 ± 65.72 7008.61 ± 1505.37meteor-contest 351.97 ± 4.96 802.46 ± 5.54 574.82 ± 16.10 640.06 ± 2.24 282.25 ± 11.35nbody_modified 618.89 ± 4.65 943.92 ± 5.41 452.87 ± 10.95 660.92 ± 12.68 99.62 ± 4.31raytrace-simple 2655.83 ± 6.56 5475.93 ± 14.32 529.33 ± 15.65 3652.94 ± 33.11 130.93 ± 13.46richards 353.98 ± 3.62 697.24 ± 5.75 81.67 ± 10.90 552.73 ± 1.64 17.42 ± 3.27slowspitfire 630.94 ± 2.65 1837.00 ± 90.08 1872.80 ± 203.10 1828.18 ± 18.68 1135.90 ± 11.02spectral-norm 472.93 ± 1.17 2095.51 ± 6.29 158.45 ± 14.06 1069.36 ± 1.82 36.32 ± 8.60spitfire_cstringio 9231.00 ± 91.39 19257.20 ± 71.14 16227.60 ± 229.90 11522.40 ± 22.00 5056.60 ± 67.18telco 1202.20 ± 7.64 4360.59 ± 14.72 716.04 ± 15.16 2900.18 ± 4.92 188.49 ± 24.36twisted_iteration 154.48 ± 0.74 546.64 ± 1.85 100.91 ± 0.02 422.62 ± 1.24 37.75 ± 0.16twisted_names 9.63 ± 0.02 23.41 ± 0.13 10.63 ± 0.13 18.84 ± 0.11 5.36 ± 0.12twisted_pb 69.96 ± 1.24 273.84 ± 6.78 162.35 ± 13.11 145.03 ± 2.43 32.06 ± 1.44twisted_tcp 887.33 ± 3.12 3495.82 ± 4.88 2745.48 ± 26.98 1810.98 ± 10.89 931.62 ± 18.28waf 5415.81 ± 36.76 5405.10 ± 20.46 5493.30 ± 228.29 5413.62 ± 50.76 5373.34 ± 69.82Table 6: Absolute runtimes <strong>for</strong> standard PyPy benchmarks run on x86


52 11 ANNEXBenchmark cpython nojit, boehm jit, boehm nojit, gc jit, gcai 1.00 2.82 1.69 2.03 0.74bm_mako 1.00 3.38 1.89 2.95 1.08chaos 1.00 2.13 0.32 1.77 0.08crypto_pyaes 1.00 3.14 0.34 1.79 0.06django 1.00 1.76 0.40 1.65 0.16fannkuch 1.00 2.07 0.81 1.36 0.21float 1.00 1.62 0.22 1.67 0.19go 1.00 4.74 0.74 3.18 0.22html5lib 1.00 3.53 1.78 2.00 0.48meteor-contest 1.00 2.28 1.63 1.82 0.80nbody_modified 1.00 1.53 0.73 1.07 0.16raytrace-simple 1.00 2.06 0.20 1.38 0.05richards 1.00 1.97 0.23 1.56 0.05slowspitfire 1.00 2.91 2.97 2.90 1.80spectral-norm 1.00 4.43 0.34 2.26 0.08spitfire_cstringio 1.00 2.09 1.76 1.25 0.55telco 1.00 3.63 0.60 2.41 0.16twisted_iteration 1.00 3.54 0.65 2.74 0.24twisted_names 1.00 2.43 1.10 1.96 0.56twisted_pb 1.00 3.91 2.32 2.07 0.46twisted_tcp 1.00 3.94 3.09 2.04 1.05waf 1.00 1.00 1.01 1.00 0.99Table 7: Relative runtimes <strong>for</strong> standard PyPy benchmarks run on x86 normalized toCPythonBenchmark no-jit, boehm jit, boehm no-jit, own gc jit, own gc SWIboyer 4540 ms 9450 ms 1300 ms 5430 ms 100 mschat_parser 58020 ms 170 ms 52920 ms 40 ms 9060 mscrypt 1200 ms 3920 ms 950 ms 3170 ms 490 msderiv 3780 ms 10 ms 3340 ms 00 ms 420 msmeta_nrev 2910 ms 1270 ms 2180 ms 490 ms 650 msmu 2660 ms 130 ms 1930 ms 40 ms 490 msnrev 1040 ms 300 ms 970 ms 410 ms 180 mspoly 710 ms 990 ms 390 ms 690 ms 10 msprimes 5060 ms 1190 ms 1740 ms 380 ms 400 msqsort 3020 ms 70 ms 2560 ms 20 ms 450 msqueens 7200 ms 1890 ms 5500 ms 240 ms 1870 mstak 1840 ms 820 ms 700 ms 180 ms 80 mszebra 7390 ms 20 ms 6990 ms 10 ms 2230 msTable 8: Absolute runtimes <strong>for</strong> Pyrolog benchmarks run on x86


REFERENCES 53References[AACM07] ANCONA, Davide ; ANCONA, Massimo ; CUNI, <strong>An</strong>tonio ; MATSAKIS,Nicholas: RPython: a Step Towards Reconciling Dynamically and StaticallyTyped OO Languages. In: DLS ’07: Proceedings of the 2007 symposium on Dynamiclanguages (2007), Oct. http://portal.acm.org/citation.cfm?id=1297081.1297091[ALSU06] AHO, Alfred ; LAM, Monica ; SETHI, Ravi ; ULLMAN, Jeffrey:Compilers: Principles, Techniques, and Tools (2nd Edition). In:Compilers: Principles, Techniques, and Tools (2nd Edition (2006), Aug.http://portal.acm.org/citation.cfm?id=1177220&coll=DL&dl=GUIDE&CFID=8318957&CFTOKEN=98469675[Ard11] ARDÖ, Håkan: Loop invariant code motion. http://morepypy.blogspot.com/2010/12/we-are-not-heroes-just-very-patient.html.Version: 2011. – [Online; accessed 10-February-2011][<strong>ARM</strong>95] <strong>ARM</strong> HOLDINGS: <strong>An</strong> Introduction to ThumbTM. (1995), Mar, S. 1–29[<strong>ARM</strong>10] <strong>ARM</strong> HOLDINGS: <strong>ARM</strong> R○ Architecture Reference Manual. (2010), Sep, S. 1–2158[<strong>ARM</strong>11] <strong>ARM</strong> HOLDINGS: Company Profile - <strong>ARM</strong>. http://www.arm.com/about/company-profile/index.php. Version: 2011. – [Online; accessed 13-January-2011][Ayc03] AYCOCK, J: A Brief History of Just-In-Time. In: ACM Computing Surveys(CSUR) 35 (2003), Nr. 2, S. 97–113[BBF + 10] BEBENITA, Michael ; BRANDNER, Florian ; FAHNDRICH, Manuel ; LO-GOZZO, Francesco ; SCHULTE, Wolfram ; TILLMANN, Nikolai ; VEN-TER, Herman: SPUR: A Trace-Based <strong>JIT</strong> Compiler <strong>for</strong> CIL. In:OOPSLA ’10: Proceedings of the ACM international conference on Objectoriented programming systems languages and applications (2010), Oct.http://portal.acm.org/ft_gateway.cfm?id=1869517&type=pdf&coll=DL&dl=GUIDE&CFID=8318957&CFTOKEN=98469675[BCF + 11] BOLZ, Carl F. ; CUNI, <strong>An</strong>tonio ; FIJABKOWSKI, Maciej ; LEUSCHEL, Michael; PEDRONI, Samuele ; RIGO, Armin: Allocation Removal by Partial Evaluationin a <strong>Tracing</strong> <strong>JIT</strong>. In: PEPM ’11: Proceedings of the 20th ACMSIGPLAN workshop on Partial evaluation and program manipulation (2011),Jan. http://portal.acm.org/ft_gateway.cfm?id=1929508&type=pdf&coll=DL&dl=GUIDE&CFID=8318957&CFTOKEN=98469675[BCFR09] BOLZ, Carl F. ; CUNI, <strong>An</strong>tonio ; FIJALKOWSKI, Maciej ; RIGO, Armin: <strong>Tracing</strong>the Meta-Level: PyPy’s <strong>Tracing</strong> <strong>JIT</strong> Compiler. In: ICOOOLPS ’09: Proceedingsof the 4th workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (2009), Jul. http://portal.acm.org/citation.cfm?id=1565824.1565827


54 REFERENCES[BDB00] BALA, Vasanth ; DUESTERWALD, Evelyn ; BANERJIA, Sanjeev: Dynamo: ATransparent Dynamic Optimization System. In: PLDI ’00: Proceedings of theACM SIGPLAN 2000 conference on Programming language design and implementation(2000), Aug. http://portal.acm.org/citation.cfm?id=349299.349303[BEES04] BRUENING, Derek L. ; ELECTRICAL ENGINEERING, Massachusetts I. o. ; SCI-ENCE, Computer: Efficient, Transparent, and Comprehensive Runtime CodeManipulation. (2004), Jan, 306. http://books.google.com/books?id=WPpdNwAACAAJ[BKL + 08] BOLZ, Carl F. ; KUHN, Adrian ; LIENHARD, A ; MATSAKIS, Nicholas ; NIER-STRASZ, Oscar ; RENGGLI, Lukas ; RIGO, Armin ; VERWAEST, Toon: Back to thefuture in one week—implementing a Smalltalk VM in PyPy. In: Self-SustainingSystems (2008), S. 123–139[BLS10]BOLZ, Carl F. ; LEUSCHEL, Michael ; SCHNEIDER, David: Towards a Jitting VM<strong>for</strong> Prolog Execution. In: PPDP ’10: Proceedings of the 12th international ACMSIGPLAN symposium on Principles and practice of declarative programming (2010),Jul. http://portal.acm.org/citation.cfm?id=1836089.1836102[Bol10] BOLZ, Carl F.: We are not heroes, just very patient. http://morepypy.blogspot.com/2010/12/we-are-not-heroes-just-verypatient.html.Version: 2010. – [Online; accessed 02-February-2011][BW88]BOEHM, HJ ; WEISER, M: Garbage Collection in an Uncooperative Environment.In: Software - Practice and Experience 18 (1988), Nr. 9, S. 807–820. – Readon Jan. 4[CFR + 91] CYTRON, Ron ; FERRANTE, Jeanne ; ROSEN, Barry ; WEGMAN,Mark ; ZADECK, F: Efficiently Computing Static Single AssignmentForm and the Control Dependence Graph. In: Transactions onProgramming Languages and Systems (TOPLAS 13 (1991), Oct, Nr. 4.http://portal.acm.org/ft_gateway.cfm?id=115320&type=pdf&coll=DL&dl=GUIDE&CFID=13052475&CFTOKEN=24119255[CUL89] CHAMBERS, Craig ; UNGAR, David ; LEE, Elgin: <strong>An</strong> efficient implementationof SELF a dynamically-typed object-oriented language based on prototypes.In: ACM SIGPLAN Notices 24 (1989), Nr. 10, S. 49–70[DS84]DEUTSCH, LP ; SCHIFFMAN, AM: Efficient Implementation of the Smalltalk-80 System. In: Proceedings of the 11th ACM SIGACT-SIGPLAN symposium onPrinciples of programming languages (1984), S. 302[Ear09] EARNSHAW, Richard: Procedure Call Standard <strong>for</strong> the <strong>ARM</strong> Architecture.(2009), Oct, 1–34. http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/IHI0042D_aapcs.pdf[GBE07] GEORGES, <strong>An</strong>dy ; BUYTAERT, Dries ; EECKHOUT, Lieven: Statistically RigorousJava Per<strong>for</strong>mance Evaluation. In: OOPSLA ’07: Proceedings of the 22nd


REFERENCES 55annual ACM SIGPLAN conference on Object-oriented programming systems andapplications (2007), Oct. http://portal.acm.org/ft_gateway.cfm?id=1297033&type=pdf&coll=DL&dl=GUIDE&CFID=12089051&CFTOKEN=21716971[GFE + 09] GAL, <strong>An</strong>dreas ; FRANZ, Michael ; EICH, B ; SHAVER, M ; ANDERSON, David:Trace-based Just-in-Time Type Specialization <strong>for</strong> Dynamic Languages. In: Proceedingsof the ACM SIGPLAN 2009 conference on Programming language designand implementation (2009), Jan. http://portal.acm.org/citation.cfm?id=1542528&dl=[GPF06]GAL, A ; PROBST, C.W ; FRANZ, Michael: HotpathVM: <strong>An</strong> Effective <strong>JIT</strong> Compiler<strong>for</strong> Resource-constrained Devices. In: Proceedings of the 2nd internationalconference on Virtual execution environments (2006), S. 144–153[HRP05] HUDSON, M ; RIGO, A ; PEDRONI, Samuele: Compiling Dynamic LanguageImplementations. In: Project co-funded by the European Commission within theSixth Framework Programme (2002-2006) (2005)[HS74]HANSEN, Gilbert J. ; SCIENCE., CARNEGIE-MELLON UNIV PITTSBURGHPA DEPT OF C.: Adaptive Systems <strong>for</strong> the Dynamic Run-time Optimizationof Programs. (1974), Jan, 179. http://books.google.com/books?id=pNwicAAACAAJ[McC60] MCCARTHY, John: Recursive Functions of Symbolic Expressions and TheirComputation by Machine, Part I. In: Communications of the ACM 3 (1960), Apr,Nr. 4. http://portal.acm.org/ft_gateway.cfm?id=367199&type=pdf&coll=DL&dl=GUIDE&CFID=13052475&CFTOKEN=24119255[Moz11]MOZILLA FOUNDATION / ADOBE: Tamarin Project. http://www.mozilla.org/projects/tamarin/. Version: 2011. – [Online; accessed 09-February-2011][MSD05] MSDN TEAM: Compiling MSIL to Native Code. http://msdn.microsoft.com/en-us/library/ht8ecch6(v=vs.80).aspx. Version: 2005. – [Online;accessed 09-March-2011][Pal11] PALL, Mike: Lua<strong>JIT</strong>. http://luajit.org/luajit.html. Version: 2011. –[Online; accessed 09-February-2011][Pho10] PHOENIX, Evan: Making Ruby Fast: The Rubinius <strong>JIT</strong>. http://www.engineyard.com/blog/2010/making-ruby-fast-therubinius-jit/.Version: 2010. – [Online; accessed 25-February-2011][PS99][PVC01]POLETTO, M ; SARKAR, V: Linear Scan Register Allocation. In: ACM Transactionson Programming Languages and Systems (TOPLAS) 21 (1999), Nr. 5, S. 895–913PALECZNY, M ; VICK, C ; CLICK, C: The Java Hotspot TM server compiler.In: Proceedings of the 2001 Symposium on Java TM Virtual Machine Research andTechnology Symposium-Volume 1 (2001), S. 1


56 REFERENCES[PyP11a] PYPY DEVELOPERS: How to run PyPy on top of maemo plat<strong>for</strong>m. http://codespeak.net/pypy/dist/pypy/doc/maemo.html. Version: 2011. –[Online; accessed 18-February-2011][PyP11b] PYPY DEVELOPERS: PyPy - Getting Started. http://codespeak.net/pypy/dist/pypy/doc/getting-started.html. Version: 2011. – [Online; accessed16-March-2011][PyP11c] PYPY DEVELOPERS: PyPy - Goals and Architecture Overview. http://codespeak.net/pypy/dist/pypy/doc/architecture.html.Version: 2011. – [Online; accessed 02-March-2011][Rig04][RP06]RIGO, Armin: Representation-based just-in-time specialization and the psycoprototype <strong>for</strong> python. In: PEPM ’04: Proceedings of the 2004 ACM SIGPLANsymposium on Partial evaluation and semantics-based program manipulation (2004),Aug. http://portal.acm.org/ft_gateway.cfm?id=1014010&type=pdf&coll=Portal&dl=GUIDE&CFID=108138688&CFTOKEN=34923119RIGO, Armin ; PEDRONI, Samuele: PyPy’s approach to virtual machine construction.In: OOPSLA ’06: Companion to the 21st ACM SIGPLAN symposiumon Object-oriented programming systems, languages, and applications (2006), Oct.http://portal.acm.org/citation.cfm?id=1176617.1176753[SBB + 03] SULLIVAN, Gregory ; BRUENING, Derek ; BARON, Iris ; GARNETT, Timothy; AMARASINGHE, Saman: Dynamic native optimization of interpreters.In: IVME ’03: Proceedings of the 2003 workshop on Interpreters, virtual machinesand emulators (2003), Jun. http://portal.acm.org/citation.cfm?id=858570.858576[Smi09] SMITH, Lee: Run-time ABI <strong>for</strong> the <strong>ARM</strong> Architecture. (2009), Oct, S. 1–28[SU94] SMITH, Randall B. ; UNGAR, David: Self: The power of simplicity. In: SunMicrosystems, Inc. Technical Reports; Vol. TR-94-30 (1994)[TCAR09] TARTARA, Michele ; CAMPANONI, Simone ; AGOSTA, Giovanni ; REGHIZZI,Stefano: Just-In-Time compilation on <strong>ARM</strong> processors. In: ICOOOLPS ’09:Proceedings of the 4th workshop on the Implementation, Compilation, Optimizationof Object-Oriented Languages and Programming Systems (2009), Jul. http://portal.acm.org/citation.cfm?id=1565824.1565834[UCCH91] UNGAR, David ; CHAMBERS, Craig ; CHANG, BW ; HÖLZLE, Urs: Organizingprograms without classes. In: Lisp and Symbolic Computation 4 (1991), Nr. 3, S.223–242[V8 11] V8 TEAM: Design Elements - V8 JavaScript Engine - Google Code. http://code.google.com/apis/v8/design.html#mach_code. Version: 2011. –[Online; accessed 21-February-2011]


LIST OF FIGURES 57List of Figures1 Calculation of Fibonacci numbers written in <strong>ARM</strong>-assembler . . . . . . . . 92 Calculation of Fibonacci numbers written in Python . . . . . . . . . . . . . 93 Use of <strong>ARM</strong>’s barrel shifter <strong>for</strong> multiplication by two . . . . . . . . . . . . 104 Use of conditional execution to assign a value to a register . . . . . . . . . 105 Two simple RPython functions, and a trace recorded when executing thefunction strange_sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A very simple bytecode interpreter with registers and an accumulator. . . 187 Example bytecode: Compute the square of the accumulator . . . . . . . . 188 Simple bytecode interpreter with hints applied . . . . . . . . . . . . . . . . 199 Trace when executing the Square function of Figure 7, with the correspondingbytecodes as comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2110 Trace when executing the Square function of Figure 7, with the correspondingopcodes as comments. The constant-folding of operations on greenvariables is enabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2211 Bit encoding <strong>for</strong> the data processing instructions with immediate values[<strong>ARM</strong>10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2412 Encoding of the group of data processing instructions with immediate valuesas arguments as a Python hashmap . . . . . . . . . . . . . . . . . . . . 2413 Function that creates a function <strong>for</strong> data processing instructions based onthe encoded instructions shown in Figure 12. It takes the encoded descriptionof an assembler instruction and generates a function that, when called,writes the binary representation of the instruction to memory . . . . . . . 2514 Loading a constant by shifting the bytes into the target register . . . . . . . 2615 Loading a constant encoded in the program stream . . . . . . . . . . . . . 2616 Function to load constants into a register . . . . . . . . . . . . . . . . . . . 2717 Frame layout used by the <strong>ARM</strong> backend . . . . . . . . . . . . . . . . . . . . 2918 Implementation of the int_sub operation . . . . . . . . . . . . . . . . . . . . 3119 Instructions generated <strong>for</strong> the int_lt(i0, i1) operation . . . . . . . . 3220 Generated machine code guard_true(i0, [p0, i0]) . . . . . . . . . 3521 Encoding of variable locations <strong>for</strong> guard_true(i0, [p0, i0]). Variablep0 is spilled on the stack and variable i0 is stored in register r1 . . . 3522 Patched machine code <strong>for</strong> guard_true(i0, [p0, i0]) . . . . . . . . . 3623 BeagleBoard-xM development board . . . . . . . . . . . . . . . . . . . . . . 3824 PyPy benchmarks run on <strong>ARM</strong>, normalized to CPython . . . . . . . . . . . 4325 PyPy benchmarks run on x86, normalized to CPython . . . . . . . . . . . . 44


58 LIST OF TABLES26 Pyrolog benchmarks run on <strong>ARM</strong>, normalized to SWI-Prolog . . . . . . . 4627 Pyrolog iterate benchmarks run on <strong>ARM</strong>, normalized to SWI-Prolog . . . 4728 Pyrolog benchmarks run on x86, normalized to SWI-Prolog . . . . . . . . . 48List of Tables1 Absolute runtimes <strong>for</strong> standard PyPy benchmarks run on <strong>ARM</strong> . . . . . . 402 Relative runtimes <strong>for</strong> standard PyPy benchmarks run on <strong>ARM</strong> normalizedto CPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Relative results of the PyPy benchmarks using the Boehm GC on <strong>ARM</strong> andx86, normalized to CPython <strong>for</strong> the corresponding plat<strong>for</strong>m . . . . . . . . 424 Absolute runtimes <strong>for</strong> Pyrolog iterate benchmarks run on <strong>ARM</strong> . . . . . . 455 Absolute runtimes <strong>for</strong> Pyrolog benchmarks run on <strong>ARM</strong> . . . . . . . . . . 476 Absolute runtimes <strong>for</strong> standard PyPy benchmarks run on x86 . . . . . . . 517 Relative runtimes <strong>for</strong> standard PyPy benchmarks run on x86 normalizedto CPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Absolute runtimes <strong>for</strong> Pyrolog benchmarks run on x86 . . . . . . . . . . . 52

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!