An ARM Backend for PyPyls Tracing JIT - STUPS Group

INSTITUT FÜR INFORMATIKSoftwaretechnik undProgrammiersprachenUniversitätsstr. 1D–40225 DüsseldorfAn ARM Backend for PyPy’s Tracing JITDavid SchneiderMasterarbeitBeginn der Arbeit: 24. November 2010Abgabe der Arbeit: 23. März 2011Gutachter: Prof. Dr. Michael LeuschelProf. Dr. Michael Schöttner

ErklärungHiermit versichere ich, dass ich diese Masterarbeit selbstständig verfasst habe. Ich habedazu keine anderen als die angegebenen Quellen und Hilfsmittel verwendet.Düsseldorf, den 23. März 2011David Schneider

AbstractA large part of the computing done today is performed on mobile devices. Phones,tablets, netbooks are changing the way we see and use computers. Devices built formobile usage are constrained in different ways. Battery lifetime, speed, power consumption,storage and memory resources are issues compared to todays desktop computers.A very large number of the mobile devices being in use or created nowadays use ARMbased CPUs. ARM, a RISC architecture, is specialized in the development of CPUs formobile and embedded systems, producing reasonably fast and energy efficient CPU designs.Due to the wide usage of ARM CPUs on mobile devices it is becoming an increasinglyimportant platform for software developers. The frameworks for creating softwarefor mobile phones are mostly based on statically typed languages such as C#, Java orObjective-C.Although dynamic languages are seeing a rise in interest and usage over the last years,many of the virtual machines for these languages are rather straightforward and do notoffer performance comparable to that of statically typed languages. Making them less aptfor developing software for platforms with restricted resources. JIT in time compilationhas shown to be a viable way to improve the performance of virtual machines for dynamiclanguages. Just-in-time compilers perform profiling at runtime, while interpretinga given program and generate specialized code for frequently used code paths. There aredifferent approaches on how to perform profiling, either based on methods or on loops.The PyPy project, a toolset for writing dynamic languages, and an implementation of thePython language which are both written in Python provides a way to write interpretersfor dynamic languages at a high-level of abstraction by leaving out platform specificdetails, that are added to the interpreter when it is translated. The translation toolchainallows to target different architectures such as x86, x86_64, the JVM or .NET CLR virtualmachines and also ARM while only having to maintain one language implementation.PyPy also provides a generator for tracing JITs that can be added to a language implementation,by providing a few hints in the language specification. The tracing JIT isalready giving good results on x86, speeding up the execution of Python programs byabout 4x compared to CPython. In this work we want to present the backend developedfor PyPy’s JIT that targets the ARM architecture, presenting the specifics of how dynamiccompilation can be implemented on ARM in the scope of PyPy.The ARM backend shows promising results in terms of speed, the current results show asimilar relative performance when comparing it to CPython as it is seen on x86. Furtherspeed-ups are to be expected when integrating PyPy’s own GCs.

2 CONTENTS6.4.3 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4.4 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4.5 Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.4.6 Forcing the Frame Object . . . . . . . . . . . . . . . . . . . . . . . . 336.5 Guards and Leaving the Loop . . . . . . . . . . . . . . . . . . . . . . . . . . 346.6 Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Cross Translation 368 Evaluation 388.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2.1 Python Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 398.2.2 Prolog Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Related Work 4910 Conclusion and Future Work 5011 Annex 51References 53List of Figures 57List of Tables 58

31 IntroductionARM cores are present in most of the devices we use day to day, be it a phone, a tabletor a music player. Additionally ARM based CPUs are beginning to be used in netbooksand energy efficient servers. The large scale deployment of ARM cores and the varietyof devices they power makes ARM an increasingly interesting platform to be targeted bydevelopers.At the same time there is a growing interest in dynamic programming languages and anincrease in their usage due to improvements in the implementations of the languages andthe increased computational power of computers. These developments make it more andmore feasible to develop software from small scripts to large scale server systems usingdynamic languages.ARM based devices are usually small, energy efficient and portable devices where executionspeed and power consumption are critical concerns when creating software.To bridge the gap between the high level expressiveness and ease of development of dynamiclanguages with the restricted resources found on small systems we want to createa backend for PyPy’s tracing JIT [PyP11b]. Based on the powerful toolchain provided byPyPy, we are able provide a way to create interpreters for different dynamic languages.These interpreters are able to, besides the platforms already targeted by PyPy, also run onARM based systems. At the same time these interpreters may benefit from the optimizationsprovided by just-in-time compilation as done by PyPy. This will make it possibleto write more and faster software for mobile devices using high-level languages such asPython.PyPy is the tool of choice to bring dynamic languages and their dynamic compilationto ARM, because it is a general toolchain for writing dynamic languages that alreadyhas a very compliant Python 2.7 implementation as well as a Prolog implementation andseveral experimental language implementations. Due to its modular architecture PyPyallows all language implementations to potentially benefit from and use the new featureof an ARM backend for the JIT. PyPy has a powerful JIT compiler generator providingvery good results on x86. We hope to be able to bring these good results to ARM.In this thesis we present an ARM backend for PyPy’s tracing JIT compiler. After givingan introduction into the ARM architecture and into just-int-time compilation, we willdescribe how this works on PyPy. Based on this foundation we will describe how wetranslate the operations PyPy’s JIT to machine code in a way that it can be efficientlyexecuted on ARM based devices. Additionally we present an new translation platformfor PyPy to cross-translate programs written in RPython to target ARM.The rest of this thesis is structured as follows: in Section 2 we present the ARM architectureand the company behind it, describing the different features that ARM provides,focussing especially on those that are central to our goal of creating a backend for PyPy’sJIT. In Section 3 we present a general introduction to just-in-time compilation and differenttechniques which have been developed to improve the execution speed of high levellanguages. In Section 4 we are going to focus on one particular approach in developingdynamic languages presenting the PyPy project and some of its core ideas. In Section 5we will describe PyPy’s approach to just-in-time compilation. Based on these concepts

4 1 INTRODUCTIONin Section 6 we elaborate on how a backend for PyPy’s tracing JIT is built and what theparticularities of creating such a backend for ARM are. In Section 7 we describe howwe cross translate interpreters using PyPy’s toolchain to run on resource constrained devices.In Section 8 the results we have gathered from benchmarks evaluating our ARMbackend are compared to another backend for PyPy’s JIT. Finally in Sections 9 and 10 ourwork in related to previous research and development and we outline the topics of futurework related to the subject of this thesis.

52 ARMARM is at the same time the name of the company as well as the name of their mainproduct. The CPUs designed by ARM and based on the ARM architecture are nowadayswidely deployed, being built into most of current mobile devices such as phones andtablets. At the same time ARM CPUs are entering the computer and server markets.The CPUs designed by ARM are based on a RISC architecture and have several uniquefeatures. In this section we are going to give a short overview about the company behindthese CPUs and to introduce the ARM architecture itself focusing on aspects relevant forPyPy’s JIT backend presented in this thesis.2.1 Company and HistoryARM Holdings, the company that develops the ARM CPUs, was founded in 1990 underthe name of Advanced RISC Machines as a spin-off of Acorn and Apple Computer.ARM itself does not sell chips, instead their business model is based on designing 32-bitRISC CPUs and GPUs as well as tools and software targeting their designs. The intellectualproperty, in form of chip and architecture designs is licensed to partners. Partnercompanies who licence the designs, can potentially adapt them to their needs and actuallybuild the chips paying ARM a license fee per chip built. An example for such chipsare Apple’s A4 and A5 processors used in the iPad and iPhone. These processors aresystem-on-a-chip designs based on an ARM CPU and combined with additional componentson the same chip such as a GPU.The main markets that ARM targets with its technologies are those of mobile computing,home entertainment systems and embedded computing in areas such as householdappliances and car controls.Currently ARM is the largest vendor of intellectual property for mobile computing, beingpresent on about 95% of all mobile handsets, according to company numbers [ARM11].With the growth of the mobile computing market, mainly due to smartphones andtablets, also the importance of ARM as a platform has increased. This has made ARMan interesting target platform. Large vendors of mobile devices and software, such asApple and Google, are already building on ARM as a platform and lately Microsoft announcedthat their next operating system will also run on ARM 1 . NVIDIA announced tobuild CPUs based on ARM designs 2 .2.2 The ARM ArchitectureThe architecture used in the chips developed by ARM, which is called the ARM architecture,is based on a RISC-Model with a special focus on small code-size, energy efficiencyand performance, which all are relevant features for embedded and mobile computing.1 http://www.microsoft.com/presspass/press/2011/jan11/01-05socsupport.mspx2 http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09&version=live&releasejsp=release_157&xhtml=true&prid=705184

6 2 ARMAs ARM states [ARM10] the architecture spots a set of typical RISC characteristics, witha few additional properties only provided by ARM processors. The ARM architectureprovides:• a large set of multi-purpose registers• data processing operations that work on register contents and not directly on datastored in memory• a simple and direct addressing mode based only on register contents and constantoffsetsAmong the special features only found on ARM are:• the combination of shift operations with logical and arithmetic operations, meaningthat it is possible to apply a shift to arguments of operations before they areevaluated• auto-decrement and increment addressing modes• instructions to load and store multiple register values at once, to improve datathroughput• conditional execution of most of the instructions, to avoid jump operations andimprove performance and cache behaviour2.2.1 ARM/THUMBThe latest version of the ARM architecture is version 7 which is usually referred to asARMv7. This architecture defines two main instruction sets and corresponding executionmodes.Chronologically the first was the ARM instruction set, introduced in 1983 by AcornsComputers. The ARM instruction set is a fixed width 32-bit instructions set, providingthe best performance and the biggest flexibility in the interaction with the processor.The second was the Thumb instruction set, first introduced with some cores of the ARMv4processor family around 1993, it was a fixed width 16 bit instruction set designed to improvethe code density at the cost of execution speed. The speed penalty is present becauseThumb instructions are translated at runtime to the corresponding ARM instructions.[ARM95]The Thumb instruction set evolved into Thumb-2 which is a variable width instructionset trying to combine the code density of Thumb with performance comparable to theARM instruction set.There are also additional instruction sets and execution modes targeting virtual machinessuch as Jazelle which specifically targets Java bytecode. Jazelle has been deprecated infavor of ThumbEE which extends the Thumb instruction set with null pointer checks andsome other additional features.

2.2 The ARM Architecture 7The current implementation of the JIT backend for ARM targets the ARM instruction set,but it should not be hard to adapt the backend to generate Thumb-2 or ThumbEE code inorder to evaluate the memory usage compared to the speed trade-offs.Unless stated explicitly we refer to the ARM instruction set and mode in version 7 for therest of this thesis.2.2.2 ProfilesWith the specification of the ARMv7 architecture, ARM introduced the concept of profilesfor processors, defining a profile for applications, embedded platforms and for real-timeprocessors. The requirements for the different profiles contain a set of necessary andoptional features for the processors to provide. Each processor implementing a profileprovides a defined set of core functionality.The profiles are:• ARMv7-A Application profile provides different execution modes and instructionsets such as ARM and Thumb and it supports virtual memory.• ARMv7-R Real-time profile is similar to the ARMv7-A profile but provides a simplerand more direct memory access model.• ARMv7-M Microcontroller profile defines a profile for low level programming, alsowith a simple memory model and only supports the Thumb instruction set.The ARM backend for PyPy’s JIT backend targets the ARMv7-A processor profile butshould also be compatible with the ARMv7-R profile.2.2.3 Calling ConventionThe ARM architecture procedure calling standard (AAPCS) [Ear09] describes the callingconvention in form of a specification with rules, the calling and called side of a subroutinecall have to follow. The AAPCS also defines the different data types available inprocedure calls and how they are passed as arguments and return values. By followingthis convention independently defined, compiled and assembled libraries can work witheach other.ARM processors have a register set of 16 core registers accessible in ARM and Thumbmodes with slightly different rules. In ARM mode all of these registers can be read andwritten including the one used as program counter.The registers r12 to r15 have a special role in the AAPCS being used to hold the currentexecution state.The names and roles of the registers are defined as follows:• Registers r0 to r11 are freely usable general purpose registers used for storingvalues during computation.

8 2 ARM• Register r11 was used as a frame pointer (FP), but this has been deprecated in theAAPCS and can now be used as a general purpose register.• Register r12 is also called the intra-procedure-call scratch register, it is used to storetemporary values during the execution of a subroutine. The value of this register isnot guaranteed to be preserved through subroutine calls.• Register r13 is used as a stack pointer (SP) pointing to the top of the currently usedstack (the lowest address).• Register r14 also called the link register (LR) is used to store the return address ofthe calling subroutine for procedure calls.• Register r15 is the program counter (PC), containing the address of the currentinstruction. Writing to this register results in an unconditional branch.According to the AAPCS arguments to procedures are passed in registers r0 to r3 andon the stack. By passing the first arguments in registers, calls with up to four integerarguments can profit from avoiding the overhead of passing the arguments on the stack.The return values of sub-procedure calls are returned in the same registers used to passarguments. For this reasons the caller has to take care of preserving values stored in theseregisters. On the other hand the callee has to take care to preserve the values stored inthe registers r4 to r10 which are to be used for local variables.Figure 1 shows an example of a procedure written in ARM assembler that follows thecalling convention defined in the AAPCS, Figure 2 shows the equivalent algorithm writtenin Python. The procedure in this example calculates the Fibonacci number of an inputpassed in register r0 using a recursive algorithm. When entering the procedure the callersaved registers, that are going to be used, and the link register that contains the returnaddress are saved on the stack. Next, by checking if the input passed in register r0 isless than two we can check for the base cases (r0=0 or r0=1). In case we hit such a basecase, using a conditionally executed POP instruction (see Section 2.2.5 for details), we restorethe caller saved registers and load the return address into the program counter thusperforming a jump to it. the recursive calculation of the two previous Fibonacci numbersby calling the same procedure using the branch and link instruction (BL) and passing theargument in register r0. The BL instruction stores the return address in the link registerand performs a jump to the address or label passed. Finally we calculate the sum of bothnumbers and store it in r0 as the return value. To return, using a POP instruction, werestore the caller saved registers and put the saved value of the link register into programcounter performing an unconditional branch to the return address.The AAPCS defines, based on a 32 bit word size, as core data types signed and unsignedbytes, halfwords, words and double words as well as half, single and double precisionfloating point numbers and finally vector data types. According to the size of the datatypes the values are stored in one, two or four consecutive registers.Section 5.5 of the AAPCS describes in detail the algorithm used to pass arguments toprocedures, describing the encoding of the arguments in registers, the stack and the coprocessorregisters taking into account the different sizes of data types described above.Larger data structures or data structures with variable size are passed and returned to

2.2 The ARM Architecture 9and from subroutines by reference, storing the data in memory and passing its addressaccording to the AAPCS as the actual parameter for a procedure call. Smaller values suchas numbers are passed by value.fib:push {r5, r6, lr} @ save the link register and@ required registerscmp r0, #2 @ check for base casespoplt {r5, r6, pc} @ return if n < 2mov r5, r0 @ store input in r5sub r0, r5, #1 @ put n - 1 in r0bl fib @ call fib(n-1)mov r6, r0 @ store result in r6sub r0, r5, #2 @ put n - 2 in r0bl fib@ call fib(n-2)add r0, r0, r6 @ add both resultspop {r5, r6, pc} @ restore registers and returnFigure 1: Calculation of Fibonacci numbers written in ARM-assemblerdef fib(n):if n < 2:return nn1 = fib(n-1)n2 = fib(n-2)return n1 + n2Figure 2: Calculation of Fibonacci numbers written in Python2.2.4 Barrel ShifterThe arithmetic logic units built into ARM CPUs provide a 32-bit barrel shifter, which canperform efficient shift operations. ARM supports five types of shift operations which areavailable as standalone operations as well as applicable to the immediate parameter ofthe operations that support them [ARM10].The types of shift operations are:• Logical shift left, shifts a value left by a given number of bits.• Logical shift right, shifts a value right by a given number of bits.• Arithmetic shift right, shifts a value right by a given number of bits, the highest bitis copied and shifted into the highest bits preserving the sign of the value.• Rotate right, rotates a value right by a given number of bits, the shifted out low bitsare shifted in at the highest bits.• Rotate right extended, rotates the value right by one bit and the carry bit is shiftedin on the highest bit of the value

10 2 ARMGiven that the ARM instruction set is a fixed width instruction set, the immediate valuesthat can be passed to operations are encoded in one byte of the instruction. Thismeans that more than one byte can not be encoded as an immediate value. One usage ofthe barrel shifter in combination with immediate values is to load larger constants intoa register, for an application of this see Section 6.1.1. As an example for using shiftedimmediate values see Figure 3 where we move the value of register r1 shifted by one tothe left into register r0. If the value of register r1 is 0xF0, then the resulting value storedin register r0 after this operation is 0x1E0, which corresponds to a multiplication by two.MOV r0, r1, LSL #1 % r0 = r1 * 2Figure 3: Use of ARM’s barrel shifter for multiplication by two2.2.5 Conditional ExecutionA feature supported by almost all instructions in the ARM instruction set is the conditionalexecution. Conditional execution allows to specify a condition under which instructionsshould be executed depending on the processor flags. Encoded in the fourhighest bits of each instruction there are 16 different conditions under which a given instructioncan be executed. The execution is based on the state of the processor flags andthe default is to always execute the instruction. Conditional execution can be used forbranched execution while avoiding jumps, possibly improving cache behaviour. Availablecondition codes include:• EQ, NE for equality checks• VS, VC for overflow checks• GE, LT, LE, GT for less and greater than checks• etc.Figure 4 shows how the operation r1 = (r6 >= r7) could be encoded using conditionalexecution. The CMP operation sets the condition flags according to the values ofthe operands, the next two operations move the result, either 0 or 1, to the register r1 byexecuting one of the MOV operations and ignoring the other depending on the flags set bythe CMP operation.CMP r6, r7MOVGE r1, 1MOVLT r1, 0Figure 4: Use of conditional execution to assign a value to a register

113 Just-in-Time CompilationJust-in-time compilation is a technique used for high-level languages to improve thespeed of program execution by analyzing the program at runtime and generating optimizedcode for frequently used code paths. The tools that perform this kind of optimizationare called just in time compilers (JITs).In the last years JIT technology, especially that of tracing JITs, has been investigated byimplementations of dynamic languages seeking to improve the performance of programexecution on their virtual machines. There are several projects integrating different kindsof JITs into their virtual machines such as V8, Mozilla’s Trace- and JägerMonkey or Spurfor Javascript [GFE + 09, BBF + 10], PyPy for Python [BCFR09], LuaJIT for Lua[Pal11] andRubinius in the case of Ruby [Pho10]. Other platforms not primarily targeting dynamiclanguages that also use dynamic compilation to improve the execution speed are the JavaVirtual Machine [PVC01] and the .NET CLR [MSD05].The execution and translation of programs is usually divided into two groups.• Static compilation, which translates a given program to another representation, inmost cases at a lower level. The target representation is then suited for executionon a chosen platform which might be either a virtual machine or natively on aprocessor.• Interpretation, where the compilation and execution phases are combined into oneexecution context. The interpreter analyzes a program, or parts of it, transformingit into a more suitable format for evaluation and executes the result immediately,possibly interleaving further analysis and execution steps.A JIT compiler tries to bridge the advantages of static compilation and of interpretationtaking advantage from the additional information available at runtime. Following Aycock’shistorical overview over just-in-time compilation [Ayc03] the advantages of thisapproach are:• the fast execution speed of compiled programs• smaller program size because of the higher level of abstraction usually found ininterpreted languages• portability of interpreted programs, given that the executing VM is available ondifferent systems• the availability of runtime information such as control and data flowDynamic compilation is not a novel optimization method Aycock dates the first JIT relatedpublication to a paper about LISP by McCarthy published in 1960 which mentionsthe dynamic compilation of functions into machine language at runtime [McC60]. Afurther step was taken by analyzing which parts of a program should be optimized.This was achieved by searching for frequently executed paths, detected using executioncounters, described in Hansen’s work on Fortran in 1974 [HS74]. In Hansen’s approach,

12 3 JUST-IN-TIME COMPILATIONwhen the execution count for a certain block of code passed a defined threshold thisblock would be further optimized and transformed to machine code. Another projectthat innovated in the area of dynamic compilation is the SELF 3 project. SELF is a dynamicobject oriented language, inspired in many ways – such as message passing – bySmalltalk, it was built on a prototype based object model [SU94, UCCH91]. Because ofthe dynamic nature of the language it is difficult to optimize. SELF developed differentapproaches, finally creating a compiler that, based on runtime profiling, generatedfor one SELF method different optimized code paths depending on the types of the inputarguments [CUL89]. For a in depth historic overview of dynamic compilation see[Ayc03], which also contains a more detailed overview as well as different categorizationapproaches for JIT compilers.There are two main groups in which JIT compilers can be classified depending on howthey perform profiling and optimize the instructions before generating low level code.3.1 Method Based Just-In-Time CompilationMethod based JITs, as the name suggests, perform profiling and compilation on a permethod basis. There are different approaches in this area, some systems compile methodsto lower level code when a method is first executed, such as some Smalltalk systems[DS84]. Other systems perform profiling by counting the executions of methods to selectcandidate methods for compilation such as the Rubinius Ruby VM [Pho10]. And finallythere are approaches such as the Java HotSpot server compiler and Google’s V8 Javascriptengine which do not contain an interpreter but perform two step compilation by firstcompiling a version of the code with a fast compiler that does not perform all possibleoptimizations, then in a second step the generated code is profiled to select candidatesthat are compiled performing more aggressive optimizations. [PVC01, V8 11]3.2 Trace Based Just-In-Time CompilationAn alternative approach to profile and optimize programs at runtime is taken by tracingJITs. Tracing JITs operate take loops as the basic unit for profiling and optimization. Theexact definition of a loop can vary depending on the VM, but for bytecode based VMsis usually is defined as a stream of bytecodes ending with a backwards jump, the jumptarget is then the beginning of the loop. The basic assumptions of tracing JITs are that:• A program spends most of its time in loops.• Iterations of a loop usually take similar paths through the code.A tracing JIT runs a program by interpreting it, first converting it to bytecode and recordingprofiling information on the interpreted code. Once a certain threshold is passed thetracing JIT records the operations performed during one execution of the loop. This listof operations that corresponds to one possible path through the loop is called a trace.3 http://labs.oracle.com/self/

3.2 Trace Based Just-In-Time Compilation 13Performing optimizations by recording the program flow was initially explored withinthe Dynamo Project [BDB00] which is a tracing optimizer for machine code. Dynamoacts as an interpreter for machine code, profiling and optimizing the machine code duringexecution. Optimized pieces of assembler are then executed directly on the processor.Based on the project DynamoRIO [BEES04], the successor of Dynamo, a tracing JIT forbytecode based dynamic languages has been created. This tracing JIT uses partial evaluationtechniques to optimize the traced operations [SBB + 03]. There have also been experimentalimplementations of the JVM using trace optimization techniques such as [GPF06].The application of tracing JITs to dynamic languages has lately been investigated by severalprojects, such as Tracemonkey [GFE + 09] and Spur [BBF + 10] for Javascript, LuaJitfor Lua [Pal11] and Adobes Tamarin Action Script VM used in the Flash Player [Moz11]among others, showing that dynamic trace based compilation can help improve the executionspeed of dynamic languages.In a loop, defined as above, a backward jump is an operation that makes the VM branchto a previous location in the bytecode stream, i.e. by decrementing the program counter.Profiling is performed on a per loop basis at the backward jump instructions, countingthe number of iterations for a loop. Once a certain threshold in the number of iterations isreached the loop is considered hot. Once a loop is considered hot the interpreter switchesto tracing mode for the next execution of the loop. In tracing mode the interpreter recordsall operations performed during the execution of the loop, possibly inlining method callsperformed within the loop. The resulting trace of operations is the basis for all furtheroptimization and finally for the translation to machine code.Because the recorded trace corresponds only to one of the possible paths through thetraced loop, special care needs to be taken regarding conditional paths such as if/elsebranches. At every branching point a special instruction is inserted into the trace thatchecks that the condition valid at that branching point during tracing is still valid whenthe trace is executed. These special instructions are called guards [GPF06].A guard condition fails during the execution of a loop when a condition at a branchingpoint that was true during tracing is now false. The failure of a guard leads to the stoppingthe execution of the optimized version of the loop and returning to executing theinterpreted version of it.Once a loop has been compiled the JIT also performs profiling on the guards insertedinto the trace, counting how often a trace is left through a specific failing guard. Once thefailure count for a guard reaches a certain threshold the execution is traced again startingfrom the failing guard until the entry point to a loop is reached, this trace is also compiledand executed the next time the guard fails, instead of leaving the optimized code. Thiskind of trace is called a bridge.Tracing can end for different reasons, either because the tracing interpreter jumps to aprevious position in the bytecode and thus closing the loop or by tracing abortion due todifferent conditions such as a too long trace. If the tracing finished without aborting thetraced list of instructions is passed to the trace optimizer. Depending on the implementationdifferent optimizations are performed on the trace such as constant folding, loopinvariant code motion, allocation removal, etc. The optimized trace is then passed to thebackend to be transformed to machine code for the corresponding platform.

14 3 JUST-IN-TIME COMPILATIONOnce compiled the optimized version of the loop becomes available to the interpreter.When the interpreter is going to execute bytecode corresponding to the loop it can insteadpass control to the compiled version of the loop. Once the optimized loop finishesthe interpreter updates is state based on the results of the optimized loop and resumesinterpretation.As an example for a trace we will look at the RPython functions defined in Figure 5, thedetails of PyPy’s JIT will be introduced in Section 5. This example is taken from [BCFR09].The function strange_sum executes a loop that calls the function f which does somecalculations on its inputs. When the loop is considered hot, after executing it a certainnumber of times, the execution is traced. The tracing records all operations executed inone iteration of the loop, the recorded trace is also shown in Figure 5. The tracing onlyrecorded the hot else path, inserting a guard to check the condition. The tracing alsoinlined the call to f. The operations recorded correspond to the internal representationof the functions. The trace ends with a jump to its beginning and is only left when one ofthe guards generated for the conditional paths fails.def strange_sum(n):result = 0while n >= 0:result = f(result, n)n -= 1return resultdef f(a, b):if b % 46 == 41:return a - belse:return a + b# loop_start(result0, n0)# i0 = int_mod(n0, Const(46))# i1 = int_eq(i0, Const(41))# guard_false(i1)# result1 = int_add(result0, n0)# n1 = int_sub(n0, Const(1))# i2 = int_ge(n1, Const(0))# guard_true(i2)# jump(result1, n1)Figure 5: Two simple RPython functions, and a trace recorded when executing the functionstrange_sum

154 PyPyThe PyPy project was started in 2003 with the goal to create a high level implementationof Python which should be easily extensible and portable while not sacrificing performance[Bol10]. The language chosen to implement this is Python. To achieve the goalsof flexibility and extendability PyPy is written in a very modular way, which also ledto it becoming a general tool for writing interpreters for other dynamic languages. Besidesthe Python implementation there are interpreters for Smalltalk [BKL + 08], Prolog[BLS10], Javascript, etc. When creating an interpreter using PyPy, the language specificationis written in a restricted subset of Python called RPython [AACM07]. This allowsthe interpreter author to concentrate on the language semantics and not mix these withthe low level details of a target platform. This subset of the language is chosen in a way,that it is possible to perform abstract interpretation on it to do data flow analysis andtype inference. Based on the inferred types the RPython code can be transformed to alower level representation such as C or JVM/.NET-bytecode and then compiled to anexecutable for the corresponding platform.RPython is a subset of Python, which was chosen in a way that it does not have all thedynamic features of Python itself and thus allows to be statically analysed. RPython isa statically typed, object oriented and garbage collected language which supports singleinheritance. Being a subset of Python, programs written in RPython can be run and testedusing the standard Python interpreter providing fast feedback and quick prototyping andtesting tools. Some of the restrictions imposed on RPython programs compared to Pythonare:• RPython programs are not allowed to modify the layout of instances at runtime• it is now possible to create classes at runtime• the reflection features are restricted• the types of objects used in the same places, i.e. lists, must have a common basetypeAmong the advantages of using RPython as the language to encode the low level detailsof language specifications is fact that it is a subset of Python, so programs written inRPython can be run, debugged and tested on top of CPython emulating the low leveldetails.RPython can not be discussed separated from the translation framework, RPython aslanguage evolved as a part of PyPy’s translation toolchain. The translation toolchaintranslates RPython programs to a lower level representation and generates an executableprogram [HRP05]. The translation process in PyPy does not work on source level butperforms program analysis on live objects loaded into memory. To translate a program itis loaded into a Python interpreter and then following steps are applied [PyP11c]:• A control flow graph is generate for each function.• The graphs are annotated, through whole program type inference, with informationabout the type each variable will have at runtime.

16 4 PYPY• The graphs are transformed into graphs using operations closer to those of the targetplatform.• Several transformations can be performed on the graphs such as optimizations orthe insertion of a garbage collector.• Finally the graphs are converted to source code for the selected platform and compiled.Being a subset of Python, RPython shares the characteristic that it is a garbage collectedlanguage. This means that programs are free of low level details about memory management.For this reason RPython programs are not bound to a certain memory managementmodel. During the translation process different implementations of the memory managementand garbage collection can be added to the generated program, without affectingthe semantics of the program itself. PyPy supports several own garbage collector implementationsas well as the Boehm-Demers-Weiser conservative garbage collector [BW88].Other aspects of RPython and of the programs implemented using it can be chosen attranslation time, such as the use of tagged pointers or boxes to represent integers or howclasses should be represented in memory. These details are added to the program duringthe translation process.Because RPython is translated from live objects loaded within the Python interpreter, thefull power of the Python language can be used as a preprocessor for RPython allowingto dynamically generate RPython code at load time which can then be analyzed, transformedand translated to a lower level target [RP06].

175 PyPy’s Approach to Tracing JITsOne of the features provided by PyPy is that it includes a JIT compiler generator. Thisgenerator can be applied to language implementations generating a JIT compiler for thelanguage at translation time. Although the procedure is similar to adding other low leveldetails such as garbage collection, the JIT generator requires a few hints which need tobe provided by the language implementor. In this section we are going to describe thedetails of PyPy’s JIT. This section as well as the examples used here are based on thepaper "Tracing the meta-level: PyPy’s tracing JIT compiler" [BCFR09].PyPy’s tracing JIT is not specific to the Python interpreter, instead PyPy provides a tracingJIT generator which can be applied to any language implementation itself which is basedon PyPy. Because there are two interpreters involved in the discussion of PyPy’s JIT itis helpful to distinguish some terminologies. Following the naming scheme proposed in[BCFR09] first there is the language interpreter which runs a given program. There is alsothe interpreter used to perform tracing, which runs the language interpreter. We will callthis the tracing interpreter. Programs running on the language interpreter are called userprograms and loops within them are named user loops while interpreter loops refer to toloops within the language interpreter.As a running example for a language interpreter we are going to use the simple bytecodeinterpreter shown in Figure 6 from [BCFR09]. It is a bytecode interpreter with 256registers managed in a list. The inputs the interpreter takes are the bytecode as a list ofcharacters and an integer as an accumulator. A user program for our bytecode interpreteris shown in Figure 7, it calculates the square of the accumulator.A traditional tracing JIT, which is applied to a user program, would detects hot loops init. In difference to this approach PyPy’s tracing JIT is applied to the language interpreterand not the user program. The main hot loop of a language interpreter is its bytecodedispatch loop. One iteration of the dispatch loop corresponds to the execution of onebytecode instruction. It is rather unlikely that this loop will take similar code paths insubsequent iterations rendering one of the core assumptions of tracing JITs invalid.To solve this problems PyPy traces several iterations of the bytecode dispatch loop. Theloop is unrolled so far that it corresponds to a user loop. As mentioned before a loop in auser program occurs when the program counter has repeatedly the same value which canonly happen through backward jumps. Usually the program counter consists of the currentlist of bytecodes and the pointer or index to the currently executed bytecode. Sincethe JIT is not aware of which variables contain the program counter it needs to get thisinformation via hints placed by the implementor of the language interpreter. The tracingJIT can detect a user loop by checking the values that make up the program counter, aloop is found if these variables have a value they have had at previous point of the execution.The program counter can only have a previous value again if it is explicitly set toan earlier value. This can only happen at a backward jump in the language interpreter.Because the tracing JIT can also not know which parts of the language interpreter correspondto backward jumps it needs an additional hint to mark these places. The examplebytecode interpreter with the hints applied is shown in Figure 8. In the example, thevariables passed in the greens list to JitDriver class are those that compose the programcounter and those variables passed in the red are the ones that are not used for the

18 5 PYPY’S APPROACH TO TRACING JITSdef interpret(bytecode, a):regs = [0] * 256pc = 0while True:opcode = ord(bytecode[pc])pc += 1if opcode == JUMP_IF_A:target = ord(bytecode[pc])pc += 1if a:pc = targetelif opcode == MOV_A_R:n = ord(bytecode[pc])pc += 1regs[n] = aelif opcode == MOV_R_A:n = ord(bytecode[pc])pc += 1a = regs[n]elif opcode == ADD_R_TO_A:n = ord(bytecode[pc])pc += 1a += regs[n]elif opcode == DECR_A:a -= 1elif opcode == RETURN_A:return aFigure 6: A very simple bytecode interpreter with registers and an accumulator.MOV_A_R 0 # i = aMOV_A_R 1 # copy of ’a’# 4:MOV_R_A 0 # i--DECR_AMOV_A_R 0MOV_R_A 2 # res += aADD_R_TO_A 1MOV_A_R 2MOV_R_A 0 # if i!=0: goto 4JUMP_IF_A 4MOV_R_A 2 # return resRETURN_AFigure 7: Example bytecode: Compute the square of the accumulator

5.1 The Shape of a Trace 19program counter. The call to can_enter_jit on the instance of the JitDriver is usedto mark a backward jump in the language interpreter and thus a possible entry point forthe JIT. Finally the call to jit_merge_point tells the JIT where it can resume normalinterpretation in case it had to leave the optimized code.tlrjitdriver = JitDriver(greens = [’pc’, ’bytecode’],reds = [’a’, ’regs’])def interpret(bytecode, a):regs = [0] * 256pc = 0while True:tlrjitdriver.jit_merge_point()opcode = ord(bytecode[pc])pc += 1if opcode == JUMP_IF_A:target = ord(bytecode[pc])pc += 1if a:if target < pc:tlrjitdriver.can_enter_jit()pc = targetelif opcode == MOV_A_R:... # rest unmodifiedFigure 8: Simple bytecode interpreter with hints appliedWith these hints applied the tracing JIT can perform profiling based on loops found inthe user program while tracing the operations the tracing interpreter performs when executingthe user loop. The downside of this is that the tracing interpreter records alsoinstructions that are related to manipulating the internal state and data structures of thelanguage interpreter. How this problem is handled is explained in Section 5.2. The tracerecorded by the tracing interpreter for a user loop is passed to the optimizer before compilingthe trace.5.1 The Shape of a TraceA trace in PyPy’s JIT is composed of operations and input arguments. The operations area list of instructions provided in SSA form [CFR + 91]. The input arguments are providedas list of variables which are used within the loop. Figure 10 shows the trace for a loopwithin the user program shown in Figure 7. Because the trace corresponds to the executionof a loop, the instructions in the trace build an infinite loop that can only be leftthrough a failing guard.Input Arguments and Types The argument types used in the PyPy JIT are types eitherrepresenting variables or constants. Arguments referring to variables are called boxes andhave the information about the type of the value they box. Constants are also representedby containers, these containers carry the type of the value and the value of the constant.

20 5 PYPY’S APPROACH TO TRACING JITS5.2 OptimizationsIn this section we are going to describe some of the optimizations performed by PyPy’stracing JIT on the recorded traces before they are transformed into machine code. Optimizationsapplicable to traces are well known optimizations used in compilers, but havethe advantage that the trace they operate on is a linear structure and do not have to takebranching and conditional execution into account.Some optimizations are described below, other optimizations are:• reducing memory accesses by removing read and write operations and reusing previousresults• reusing results of pure operations• removing redundant guardsGreen Folding Because the tracing JIT is applied to the language interpreter many ofthe operations in the trace deal with manipulating the data structures of the languageinterpreter and not with the calculation performed by the user program itself. Specificallymost of the operations work on the bytecode and program counter. The values of thesedata structures are known when entering the loop, since they are part of the hints forthe JIT. Operations on these variables can be constant folded, because the operationsinvolved are side effect free and the string holding the bytecode is immutable.Figure 9 shows the trace before removing the operations on the interpreter data structuresby constant folding and Figure 10 shows the resulting trace after removing theseoperations.Allocation Removal This optimization, described in [BCF + 11], tries to avoid object allocationsby virtualizing objects allocated within the trace. If the object does not escapethe trace, i.e. it is only used within the trace the allocation can be removed completelyand the object is replaced by its fields which are treated as local variables.Loop Invariant Code Motion Is a technique to optimize loops by moving loop invariantexpressions [ALSU06] outside of the loop body, i.e. values that do not change overiterations and are not dependent of variables modified in the loop. The core idea abouthow this optimization is applied to traces is described in [Ard11].

5.2 Optimizations 21loop_start(a0, regs0, bytecode0, pc0)# MOV_R_A 0opcode0 = strgetitem(bytecode0, pc0)pc1 = int_add(pc0, Const(1))guard_value(opcode0, Const(2))n1 = strgetitem(bytecode0, pc1)pc2 = int_add(pc1, Const(1))a1 = list_getitem(regs0, n1)# DECR_Aopcode1 = strgetitem(bytecode0, pc2)pc3 = int_add(pc2, Const(1))guard_value(opcode1, Const(7))a2 = int_sub(a1, Const(1))# MOV_A_R 0opcode1 = strgetitem(bytecode0, pc3)pc4 = int_add(pc3, Const(1))guard_value(opcode1, Const(1))n2 = strgetitem(bytecode0, pc4)pc5 = int_add(pc4, Const(1))list_setitem(regs0, n2, a2)# MOV_R_A 2...# ADD_R_TO_A 1opcode3 = strgetitem(bytecode0, pc7)pc8 = int_add(pc7, Const(1))guard_value(opcode3, Const(5))n4 = strgetitem(bytecode0, pc8)pc9 = int_add(pc8, Const(1))i0 = list_getitem(regs0, n4)a4 = int_add(a3, i0)# MOV_A_R 2...# MOV_R_A 0...# JUMP_IF_A 4opcode6 = strgetitem(bytecode0, pc13)pc14 = int_add(pc13, Const(1))guard_value(opcode6, Const(3))target0 = strgetitem(bytecode0, pc14)pc15 = int_add(pc14, Const(1))i1 = int_is_true(a5)guard_true(i1)jump(a5, regs0, bytecode0, target0)Figure 9: Trace when executing the Square function of Figure 7, with the correspondingbytecodes as comments.

22 5 PYPY’S APPROACH TO TRACING JITSloop_start(a0, regs0)# MOV_R_A 0a1 = list_getitem(regs0, Const(0))# DECR_Aa2 = int_sub(a1, Const(1))# MOV_A_R 0list_setitem(regs0, Const(0), a2)# MOV_R_A 2a3 = list_getitem(regs0, Const(2))# ADD_R_TO_A 1i0 = list_getitem(regs0, Const(1))a4 = int_add(a3, i0)# MOV_A_R 2list_setitem(regs0, Const(2), a4)# MOV_R_A 0a5 = list_getitem(regs0, Const(0))# JUMP_IF_A 4i1 = int_is_true(a5)guard_true(i1)jump(a5, regs0)Figure 10: Trace when executing the Square function of Figure 7, with the correspondingopcodes as comments. The constant-folding of operations on green variables is enabled.

236 A Backend for PyPy’s JITIn this section we are going to explain how a backend for PyPy’s tracing JIT compiler isstructured, especially taking a look at how a backend is build to generate code suited forARM processors. Many of the ideas presented here were also used before in the x86 andx86_64 backends of PyPy’s JIT.The tasks necessary to compile and execute such a loop are manifold. There are several,rather independent components that need to in place before we can generate and executecode for a loop.First we need a way to generate executable machine code. We want to have a well structuredway to interface with it to avoid duplicating the logic of encoding the instructions,which is described in Subsection 6.1. Additionally we need a way to manage the registers.This tool should take care of associating registers to variables, keeping track of thebindings and taking care of freeing the registers. This interface is described in Subsection6.2. With these tools in place we can start creating the execution context for the loop.We want to create a procedure that acts as a container for the compiled loop that alsoprovides a context to hold the state while execution the loop, how the frame layout usedin the backend to hold the state is organized is described in Subsection 6.3.1 and howthe procedure that is used to encapsulate the trace works is described in Subsection 6.3.2.With all these components in place we can proceed to compile the operations containedin the trace, which is described in Subsection 6.4. How the loop is left and how we passinformation to the frontend so it can restore the execution state before resuming interpretationis described in Subsection 6.5. Finally how bridges are handled is described inSubsection 6.6.6.1 Low level Code Generation InterfaceBefore we can generate code from the traces passed from the frontend into the platformspecific backends we need a way to encode the machine instructions. In classical compilersthis is done by the assembler which provides a higher level interface to the instructionssupported by the processor, transforming them to their binary representation whichis then executable by the processor.Because the JIT compiler operates at runtime we need to provide an assembler that isavailable at runtime. To achieve this and to avoid having to encode repeatedly all instructionsas numbers by hand as well as to have a clean interface to the machine language wegenerate an assembler interface which provides functions named after the assembler instructions.Such a function encodes the arguments passed to it according to the platformdocumentation an stores the corresponding instruction to memory. In the ARM backendthis interface is provided by a class named CodeBuilder, so we are going to further referto this interface for simplicity as codebuilder.The assembler functions are exposed in the codebuilder using PyPy’s metaprogrammingfeatures by generating a function for each required instruction from anencoded representation of the specification at load time.As mentioned earlier the current ARM backend targets the ARM state/instruction set

24 6 A BACKEND FOR PYPY’S JITwhich is a fixed-width 32-bit instruction set.The ARM Architecture Reference Manual [ARM10] defines different instruction groupsand for each of these instruction groups it describes how the arguments to the instructionare encoded in the bits that actually build the instruction. In this section we are going touse the group of Data-processing (immediate) instructions as an example. These instructionstake one register and one immediate value as their arguments. Instructions suchas AND, CMP are defined in this group. The bit encoding for instructions defined there isshown in Figure 11. The empty part in the lowest bits of the encoding is defined on aper instruction basis. As explained in Section 2.2.5 almost every instruction can take aconditional execution flag, this flag is stored in the upper 4 bits of the instruction.Figure 11: Bit encoding for the data processing instructions with immediate values[ARM10]To encode the instructions for the backend all instructions for one group are stored in aPython hashmap. This grouping is possible, because most of the instructions in a grouphave a common interface. For each instruction the hashmap contains the name of theoperation, its arguments and a description of the variable parts of the encoding. Figure12 shows how some of the instruction of our example group are stored. For exampleAND_ri instruction has the operation number 0 and it has a result and a register argument.This instruction performs a bitwise AND between the register and the immediatevalue arguments storing the result in a result register.data_proc_imm = {’AND_ri’: {’op’: 0, ’result’:True, ’base’:True},’EOR_ri’: {’op’: 0x2, ’result’:True, ’base’:True},’SUB_ri’: {’op’: 0x4, ’result’:True, ’base’:True},’RSB_ri’: {’op’: 0x6, ’result’:True, ’base’:True},’CMN_ri’: {’op’: 0x17, ’result’:False, ’base’:True},’ADD_ri’: {’op’: 0x8, ’result’:True, ’base’:True},’ORR_ri’: {’op’: 0x18, ’result’:True, ’base’:True},’MVN_ri’: {’op’: 0x1E, ’result’:True, ’base’:False},#snip}Figure 12: Encoding of the group of data processing instructions with immediate valuesas arguments as a Python hashmapAt load time, when the modules defining the backend are loaded into memory theseinstruction encodings are transformed into RPython functions and attached to theCodeBuilder class. When the codebuilder is translated all required functions arealready in place and translated with the class, thus available at runtime to provide a highlevel interface to the machine instructions. In Figure 13 the function used to generate thefunctions for data processing instructions with immediate values is shown. It checks if

6.1 Low level Code Generation Interface 25the instruction has a return value and if it has a base register and returns a function withthe corresponding signature. When the returned function is called at runtime it writes tomemory the binary representation of the instruction it implements.def define_data_proc_imm_func(name, table):n = (0x1

26 6 A BACKEND FOR PYPY’S JITaddress it by its offset to the program counter. The first solution has the advantage thatit works with conditional execution. Figure 16 shows the function used to load constantsof up to 32 bits into a register, passing in the variable r that marks the target register,where the value will be stored after executing the generated instructions. If the valuefits in 2 bytes or if the instructions are to be executed conditionally we load the constantbyte by byte into the target register (see the function _load_by_shifting in Figure 16).Initially the lowest byte is moved into the target register, all further bytes are written tothe register by an ORR operation which applies a logical OR to the argument register andthe shifted 8 bit immediate value, Figure 14 shows how the constant 0xBDBC would beloaded into register r4 using this approach. For the second approach to loading a constant,used when the constant requires three or four bytes and we do not need conditionalexecution, we encode the constant in the program stream, read it using an address basedon an offset to the value of the PC and perform a small unconditional jump to continueexecution after the location of the value. With this method loading a 32 bit constantstakes at most 3 words (2 instructions plus 1 data word) see Figure 15 for an example ofthe code generated to load the constant 0x6A5EF3C5 into register r4....MOV r4, #179ORR r4, r4, #189 LSL #8...Figure 14: Loading a constant by shifting the bytes into the target register...LDR R4, [PC + #8]ADD PC, PC, #40x6A5EF3C5...Figure 15: Loading a constant encoded in the program streamWe also experimented with using exclusively one or the other approach but this mixedapproach gives the best results in terms of speed, although the differences are small.A further method worth investigating, which we have not tried so far, is using a constantpool stored in memory in a location that can be accessed by offset addressing relativeto the PC where all the used constants are stored. Constants can then be loaded intoa register by being referenced by their corresponding offset to the PC. This techniquewould save the jump and work with conditional execution.6.2 Register AllocationAn important aspect for the results produced by a compiler, which is also valid for justin-timecompilers, is register allocation. Register allocation is the process of assigningCPU registers to the variables used in the intermediate representation of a program. Thisprocess should be fast, especially for JITs, and at the same time produce good results to

6.2 Register Allocation 27def gen_load_int(self, r, value, cond=cond.AL):from pypy.jit.backend.arm.conditions import ALif cond != AL or 0 offset) & 0xFFif b == 0:continuet = b | (shift

28 6 A BACKEND FOR PYPY’S JITto be spilled is based on the previously calculated longevity of the variables, spilling thevariable that survives for the longest time. This selection scheme should avoid blockingregisters by linking them to very long lived variables. A possible drawback is althoughif the variable is long-lived and often used, then it could occur that it needs to be readfrequently from memory if spilling occurs between usages. Once a variable is selected forspilling we generate an instruction to move this variable to the spilling area on the stackand store the information about where on the stack the variable is stored in the registerallocator. After moving the variable away we can free the previously bound register andassociate it with a new variable.6.3 Setup to Execute a Compiled LoopApart from compiling a trace, we need a way to actually execute the compiled instructions.This means that we need a mechanism to call into the compiled code once the frontendtries to execute the same loop again. Similar to the approach described in [GFE + 09]we generate code that follows the platform calling conventions (see Section 2.2.3 for howto create a procedure interface [Ear09]) so it can be casted to a function pointer at runtimeand called as a normal function/procedure.The compilation process for a loop is started when the frontend passes an optimized traceto the backend. Before the instructions contained in the trace can be compiled the backendneeds to take some preparing steps to provide an interface to the compiled trace thatcan be executed. This consist of creating a callable interface and generating the instructionsto set up the frame and load the arguments for the loop.6.3.1 Frame LayoutBefore we can execute instruction we need to setup the frame of the procedure. For thiswe generate instructions that create a frame as described below.The frame layout is composed of four parts:• callee saved registers, according to the calling convention• a slot to store the force index, a value used to check if the interpreter level framewas forced (see Section 6.4.6)• an area to store spilled variables• and the stack where push and pop instructions operateThe frame begins with the registers the function has to save according to the calling conventiondescribed in Section 2.2.3, which are on ARM registers r4 up to r11 or FP. Afterthese registers one word on the stack is left to store the force index, see Section 6.4.6.After the location for the force index the frame contains space for spilled register values.The address of the beginning of this area is stored in the FP register and spilled valuesare addressed by their offset from the FP. After this area the classic stack area begins

6.3 Setup to Execute a Compiled Loop 29where intermediate results are stored and registers are pushed and popped around subprocedurecalls. The stack pointer (SP) always points to the last value pushed on thestack, so at the beginning of the execution it points to the end of the spilling area and ismodified automatically every time a value is pushed on the stack or popped from it. Deflectingfrom the calling convention we use the frame pointer to mark the location werethe spilled registers are stored in the frame. The exact size of the spilling area is determinedby the number of registers the register allocator needs to spill for a particular loopand is only known after compiling the instructions for the loop, so the exact position ofthe SP is patched after compiling the instructions of the loop, once the size of the spillingarea is known. Figure 17 shows the layout of the frame at the beginning of the executionof a compiled loop.Previous Framer4r5r6r7r8r9r10FPForce IndexFramePointerSpillingStackPointerFigure 17: Frame layout used by the ARM backend6.3.2 Function InterfaceThe interface mentioned above to call the compiled code is generated by creating an interfacethat follows the rules imposed by the AAPCS and then performs a set of operationsthat setup of the frame and state to execute the loop.The function interface generated for ARM starts with pushing the callee saved registerson the stack. As a next step we generate instructions to move the stack pointer by oneword to leave area for the force index. Then we generate four no-ops where we laterare going to patch the instructions to setup the size of the spilling area. With these in-

30 6 A BACKEND FOR PYPY’S JITstructions the entry point is set up and we can proceed to load the input arguments andcompile the operations in the trace.After generating the function interface we generate instructions to load the input argumentsof the loop to locations controlled by the register allocator to setup the state forthe loop. Input values for a loop are passed to the backend through pre-allocated arrays.There is one such array for each different argument type. These types can be floats, integers,long longs and pointers, although at the current time the backend only supportsintegers and pointers. At runtime we allocate a register for each box passed to the loopusing these lists of input arguments. For each box we then generate the instructions toload the value stored in the corresponding list in memory to an allocated register.6.4 Compiling Trace OperationsOnce the frame is set up and the instructions to load the input arguments into registershave been generated, the next step is to generate instructions for each of the operationsin the trace. The code generation for operations is rather straight forward and is dividedinto two steps. For each operation the first step is allocating the registers for theoperation’s arguments and, if present, for the result. The second step is the instruction selectionand generation step. This step actually emits the instructions that implement theoperations using the allocated registers. A goal we try to achieve during the instructionselection is to emit the operations that provide the best execution speed in relation to thekind of arguments for an operation, i.e selection the correct operation to add a registerand small constant.We are going to look at how these operations are implemented by taking a look at thedifferent groups of trace operations that are handled by the backend. The operations canbe categorized as follows:• arithmetic operations• comparison operations• memory allocation• memory access• calls• forcing of frames• guards• jumps6.4.1 Arithmetic OperationsThe JIT implements operations for unary and binary arithmetic operations for signed andunsigned integers and long longs as well as for floating point numbers.

6.4 Compiling Trace Operations 31According to the scheme described above, for arithmetic operations we first allocate registersfor the operands and for the result. How this is done for the int_sub operation isshown in the function prepare_op_int_sub in Figure 18. This function allocates theregisters for variables involved in the operation and makes sure the corresponding valuesare stored in the registers. Once we have allocated registers for the current operationwe emit the actual instructions, done by the function emit_op_int_sub also shown inFigure 18, taking care to select the most suitable instructions to implement the operationdepending on the input variables. For this operation we check if the second operandfits in an immediate value and emit an operation which takes one register and one immediateoperand. If the first argument is an immediate value we make use of ARM’sreverse subtract operation which subtracts the second operand from the value stored inthe register passed as first operand. Finally if both values are stored in registers we emitan instruction which takes both arguments in registers. This procedure is similar for allarithmetic operations, except for those not provided by the underlying platform.def prepare_op_int_sub(self, op, fcond):a0, a1 = op.getarglist()l0 = self.make_sure_var_in_reg(a0)l1 = self.make_sure_var_in_reg(a1)res = self.force_allocate_reg(op.result)return [l0, l1, res]def emit_op_int_sub(self, op, arglocs, regalloc, fcond):l0, l1, res = arglocsif l0.is_imm():self.mc.RSB_ri(res.value, l1.value, l0.value)elif l1.is_imm():self.mc.SUB_ri(res.value, l0.value, l1.value)else:self.mc.SUB_rr(res.value, l0.value, l1.value)Figure 18: Implementation of the int_sub operationFor operations such as division and modulo which are not supported by the ARMv7-Aprofile of the ARM instruction set we provide pre-compiled wrapper functions that implementthe behaviour expected by the JIT operations. The ARMv7-R profile supportsthese operations and would not need these helper fuctions. The functions are compiledwhen the backend is translated and rely on the implementations of these operations providedby the compiler vendors, which usually comply to ARM’s EABI. The EABI [Smi09]provides a binary interface which, among other things, defines interfaces for arithmeticfunctions to be followed by compiler vendors. As defined in the EABI the compiler vendorscan reuse the implementation of functions defined in the EABI from other librariesat link time or they can generate code for the used functions from the EABI. When generatingEABI functions the compiler can emit code that uses the features provided bythe selected version of the EABI and the hardware platform, exploiting the presence ofvector and/or floating point units or falling back to software based implementations, tomake the best use of the platform while still adhering to a standardized interface whichis reusable across toolchains.

32 6 A BACKEND FOR PYPY’S JIT6.4.2 Comparison OperationsAmong the operations a trace can contain, there are several comparison operations thatcan perform boolean, equality, greater or less than checks to compare two variables. Weimplemented these operations using the standard ARM CMP operation that comparestwo registers or a register and a constant and sets the condition flags of the processoraccording to the result. Using the conditional execution feature we can then assign theresult to a register according to the state of the condition flags. The same procedure isused for boolean checks where we compare a variable against a constant considering allnon-zero values as true.As an example, the i2 = int_lt(i0, i1) operation checks if i0 is less than i1 andassigns the result of the comparison to the variable i2. Assuming that i0 is stored inregister r9, the variable i1 is stored in register r4 and the result should be stored inregister r0. The generated code for the operation would look like the one shown inFigure 19.CMP r9, r4MOVLT r0, #1MOVGE r0, #0Figure 19: Instructions generated for the int_lt(i0, i1) operation6.4.3 AllocationIn order to provide a way to be able to improve the performance of operations that performmemory allocation, especially in conjunction with PyPy’s own garbage collectors,the JIT provides operations to allocate new objects, strings and arrays. These objects areallocated by the way as done in the interpreter, so they can be passed to the frontendwhen loop is left. Currently the ARM backend for PyPy’s JIT only supports the Boehm-Demers-Weiser conservative garbage collector [BW88], which is library that replaces thestandard malloc implementation. In future we plan to support PyPy’s own garbage collectors.With the supported garbage collector the allocation procedure is similar for allthree types of allocations. We calculate how much memory is required based on the numberand the size of fields or elements the object has and the size of the meta data storedin the object. Once the size is known we generate a call to the malloc implementationprovided by the garbage collector which returns the address to the allocated memory. Inthe freshly allocated object we store the metadata for it, i.e. the length of a list or string,finally we associate the object with the corresponding variable.6.4.4 Memory AccessHaving optimized operations for memory allocation, we also want to be able to optimizethe operations that interact with object that are stored in memory, reading and writingfields of an object or reading and writing items of strings and arrays. For these operationsfirst we generate instructions that calculate the address of the field based on the address

6.4 Compiling Trace Operations 33of the object and the offset within it. Then, depending on the size of field we generate aninstruction that reads or writes a byte, a half-word or a word and stores the result in thecorresponding allocated register.6.4.5 CallsThere are two cases where we might want to perform external calls from our traces. Thefirst case is to call an externally defined function, such as malloc for allocation. Thesecond kind of situation is when we have a trace that contains calls an already compiledtrace.The first kind of call is handled by the call operation used to call any function thatfollows the AAPCS (see Section 2.2.3). When compiling this instruction we move thearguments to the registers r0 to r3 and to the stack, according to the AAPCS and updatethe stack pointer. We write the return address to the link register and perform the jumpto the procedure. When we return from the call we associate the return value storedin register r0 with the corresponding variable, so it is further managed by the registerallocator.The other kind of call operation supported by the JIT is called call_assembler. Thisoperation is used to make a call from one compiled loop to another. This has the advantagethat we do not need to compile all the instructions in the called loop again as itwould be the case when inlining the operations. The loop is invoked as any other procedure,passing the arguments to the loop following the AAPCS. We make a call fromone compiled loop to another by this way to avoid the overhead of passing all argumentsthrough the preallocated lists as done when calling the loop from the JIT frontend. Tobe able to call the compiled loop by this way we generate a second entry point to thefunction that wraps the compiled loop. The additional entry point is generated after theactual instructions in the loop have been compiled, because then we know where all thevariables should be stored when we enter the loop body. When we call the loop usingthe second entry point we generate instructions that first setup the frame and then readthe values passed in registers and on the stack and put them in the locations the loopoperations expect them. Once the arguments are in place we jump directly to the loopbody, skipping the original setup instructions.6.4.6 Forcing the Frame ObjectOne of the optimizations performed by the JIT is to remove object allocations. In languagessuch as Python the frame of an interpreter level function is accessible from theinterpreter and is a heap allocated object that contains references to the local variables.Among the optimizations performed by PyPy’s JIT is avoiding the allocation of the objectswithin the frame by virtualizing them. In the case some function called from theoptimized loop accesses the frame we need to forcibly allocate the objects in the framesfor the scope of the optimized loop, this operation is called forcing.When the frame needs to be forced the backend stores the variables used in the loop toa location where the frontend can read. The frontend then allocate the objects referenced

34 6 A BACKEND FOR PYPY’S JITin frames. When forcing happens we can not continue with the execution of the optimizedloop and have to fall back to interpretation. Because of this, around calls that cantrigger such a frame forcing we need to check if forcing happened, in order to leave theoptimized loop in that case.To perform the steps described above we make use of the force_index, mentionedbefore when describing the frame layout. We use the slot of the force_index on thestack to store a value that is changed when we force the frame. Using this value wecan check if the frame was forced and exit the loop if necessary. Additionally we makeuse of the address of the force_index, that is stored in the FP register, to restore thevariables used in the loop so the frontend can create the frame objects. We can restorethe variables using the address of the force_index, because the variables are stored ina known offset from this address.6.5 Guards and Leaving the LoopAs described in Section 3.2 traces correspond to only one possible path through the tracedloop. Operations called guards are inserted in the trace at all points where the executioncould take a different path than the one recorded during tracing. These operations checkthat the conditions valid during tracing are still valid during execution. Examples forsuch guards are guard_value, guard_class, guard_true, etc. which check the conditionindicated by their name to be true.Because a trace contains a large amount of guards and guards fail rarely we want toimplement guards in a way that the case where the guard does not fail is efficient. Theefficiency of the failure case is not as important.Guards are implemented by generating a comparison instruction on the input argumentof the guard and the provided condition. If the condition passes we jump over the blockof instructions generated for the guard and continue with the execution of the loop.In the case the check fails we enter the guard block. Within the guard block we first savethe locations of all variables, i.e. the registers they are associated with, and jump to aprocedure that takes care of all the steps needed to exit the loop. Each guard is associatedwith a list of variables that should be passed to the frontend to restore the state of theexecution in case the guard fails. When we generate the instructions for a guard, weadditionally allocate a small piece of memory which is associated with the guard. In thispiece of memory we store in an encoded way the information about where each of thearguments associated with the guard is currently stored. Because of the large numberof guards generated we want to keep this encoding as compact as possible to reduce thememory footprint of the compiled loops. For each variable the encoding consists of upto 6 bytes: one byte for the type (int, pointer or float), one for the location (stack, registeror immediate) and four for the value itself, in the case of register locations we only storethe type and the register number, which on ARM fits into one byte.After a guard failure we jump to a pre-generated procedure that serves as a common exitpath for all compiled loops. This procedure takes care of saving all values that should bepassed to the frontend and then finally leaves the compiled loop by returning from thefunction that executes the trace. The saving of the values is done by reading the encoding

6.5 Guards and Leaving the Loop 35of the variable locations generated for the guard. With that information we can load thecorresponding values and save them to preallocated arrays where the frontend can readthem. The caller or the frontend can then read all values from the corresponding lists andrestore the state before resuming interpretation. Once all values are saved we generateinstructions to return from the function we generated for the trace.To illustrate this we will describe the procedure for the guard_true operation that producesthe code shown in Figure 20. The guard in this example has one input argument(i0) and two variables (p0 and i0) that should be passed to the frontend in case thisguard fails.......CMP r1, 0ADDNE PC, 20LDR lr, pc + 8LDR pc, pc + 8ADDRESS OF ENCODED LOCATIONSADDRESS OF EXIT PROCEDURE......Figure 20: Generated machine code guard_true(i0, [p0, i0])Let’s assume the variable i0 is associated to register r1 and p0 is currently stored on thestack with one word offset from the FP. The instructions generated for the guard can beseen in Figure 20. First we generate a CMP instruction to compare the value of i0 with 0.In PyPy compare with 0 because all non-zero values are true. If the value in register r1 iszero we ignore the next instruction, load the address of the memory location containingthe encoded locations of the variables into the LR register and finally we load the addressof the exit procedure into the register PC thus performing an unconditional jump to it.The encoding of the variable locations for our guard is shown in Figure 21. We firstencode the location of the variable p0. The information that p0 is is a pointer is encodedas \xEE. Then we encode the information that it is stored on the stack, represented as\xFC and finally we store the offset to FP as a sequence of four bytes. After p0 weencode the variable i0 first encoding the information that it is an integer (represented as\xEF) and then the number of the register it is stored in. In the case of values stored inregisters we omit the information about the location in the encoding.\xEE \xFC 0 0 0 1\xEF 1Figure 21: Encoding of variable locations for guard_true(i0, [p0, i0]). Variablep0 is spilled on the stack and variable i0 is stored in register r1

36 7 CROSS TRANSLATION6.6 BridgesWhen a specific guard fails often enough that its failure count passes a defined thresholdthe JIT frontend begins to trace the execution from the failing guard. The trace is recordeduntil the entry point of another loop is found. This kind of trace is called a bridge. Once abridge has been traced it is passed to the backend to be compiled. The compilation of theinstructions contained in a bridge is the same as for a loop. The difference is in the setupof the bridge and how it is entered. A bridge is entered from a failing guard. To makethe entering of the loop as efficient as possible we patch the previously generated codefor the guard in the loop with a jump to the compiled bridge. The patched guard fromour example above would then look like shown in Figure 22. Now in case the guard failsinstead of the jump to the exit procedure it now jumps directly to the code of the bridge.This means that when we enter a bridge we are in the same frame as the loop, thus wehave the same state, i.e. allocated registers, spilled values etc. When we generate a bridgewe need to setup the state to be the same as it was in the loop at the time the guard failed,for this we use the same encoded information used when leaving the loop, decoding itand restoring the register and stack bindings with the variables.......CMP r1, 0ADDNE PC, 20LDR pc, pc + 4ADDRESS OF THE BRIDGEADDRESS OF ENCODED LOCATIONSADDRESS OF EXIT PROCEDURE......Figure 22: Patched machine code for guard_true(i0, [p0, i0])7 Cross TranslationTranslating PyPy is a memory and computation intensive task. To translate the fullPython interpreter at least 2GB of memory are required. Most ARM devices target eithermobile or embedded platforms, these devices usually do not have the power orresources to translate larger programs such as PyPy’s Python interpreter on the device itself.Therefore, based on the work done for PyPy on Maemo [PyP11a], we implemented across translation target to translate PyPy on a more powerful host machine and generatecode for ARM-processors.PyPy has a mechanism to select different platforms and compilers to translate the generatedC code. This mechanism also provides an interface to gather information aboutthe platform, libraries and to compile and execute programs on the platform. For thisthesis we implemented an additional platform for PyPy which can generate a binary for

37ARM using a cross compiler, such as the GCC ARM cross compiler toolchain 4 . Alsousing Scratchbox2 5 to provide a chrooted environment with the libraries of the targetsystem and the possibility to run small programs in the emulated environment based onQemu 6 . During the translation process PyPy needs to compile and execute several smallprograms to gather information about the target platform it will generate code for. Withthe cross-translation approach the Python interpreter that performs the translation runson the host system and all calls to the compiler and to execute the compiled programs areredirected through the Scratchbox, working as if they were run on an ARM based system.This approach has some drawbacks, but gives a translation speed, that is comparable tothe speed when targeting the host environment. The biggest drawback is that the Pythoninterpreter running the translation process needs to be compiled for the same word sizeas the target system provides. So it is impossible to cross-translate from a 64-bit Pythonto ARM which is a 32-bit platform.4 htt://gcc.gnu.org5 http://freedesktop.org/wiki/Software/sbox26 http://www.qemu.org

38 8 EVALUATION8 EvaluationFigure 23: BeagleBoard-xM development boardIn order to evaluate the performance of the ARM backend and to analyze how it behavescompared to PyPy on x86 in this section we want to present the results of different benchmarks.We gathered results from benchmarks executed on PyPy’s Python interpreter andon Pyrolog [BLS10] a Prolog interpreter written in RPython.8.1 HardwareAll tests and benchmarks on ARM were run on an otherwise idle BeagleBoard-xM 7 runningUbuntu 10.10 for ARM 8 .The BeagleBoard-xM is an ARM based development board with an ARM Cortex A8 CPUat 1 GHz and with 512MB of memory with Linux 2.6.35. See Figure 23 for a picture of theboard.The benchmarks on x86 were performed on an otherwise idle Intel Core2 Duo P8400processor with 2.26 GHz and 3072 KB of cache on a machine with 3GB RAM runningLinux 2.6.35.8.2 BenchmarksTo measure the performance of the ARM backend we performed two different sets ofbenchmarks, one running the PyPy benchmarks and one running the Prolog benchmarksdescribed in [BLS10]. All interpreters were translated using a fixed revision(370c23f085d7) of the ARM branch of PyPy’s mercurial repository 9 .7 http://beagleboard.org/hardware-xM8 https://wiki.ubuntu.com/ARM9 https://bitbucket.org/pypy/pypy/src/arm-backend-2

8.2 Benchmarks 398.2.1 Python BenchmarksTo run the PyPy Python benchmarks we created a set of different versions of PyPy’sPython interpreter. We created a version of PyPy with and without the JIT for ARMboth using the Boehm-Demers-Weiser garbage collector [BW88] (wich we will refer to asBoehm GC) to compare them against CPython on ARM. Additionally we created fourversions of PyPy’s Python interpreter for x86. We created interpreter versions with allcombinations of JIT, no JIT, Boehm GC and PyPy’s own GC. The CPython version onARM and x86 is version 2.6.6 revision r266:84292. The benchmarks are a mix of smalland medium sized python programs composed of synthetic benchmarks and real applications.We had to remove the spambayes, spitfire, pyflate-fast and rietvield benchmarksfrom the full set, because they were hitting an issue in the ARM backend that is stillopen. The full set of benchmarks as well as the current results on x86 can be seen onthe PyPy Speed Center website 10 , the benchmark’s sourcecode can be found at the PyPybenchmarks repository 11 . All benchmarks were run 50 times in the same instance of theVM, taking the average of all runs so that they contain interpretation, warm-up, codegenerationand execution of the generated code. The errors were calculated using a confidenceinterval with a 95% confidence level [GBE07].The results for the benchmarks run on the BeagleBoard are shown below. In Table 1 wepresent the absolute runtimes for the different benchmarks. Table 2 shows the relativeruntimes which as well as the graphed data in Figure 24 are normalized to CPython.To put the results gathered on ARM into context we also performed the same benchmarkson x86 with all four PyPy versions mentioned above. The graph of the relative speedcompared to CPython on x86 can be seen in Figure 25. The tables with the absolute (seeTable 6) and the relative results (see Table 7) can be found in Section 11. The benchmarksshow that running PyPy with the Boehm GC is always, with one exception, slower thanthe corresponding version of PyPy with its own GC. Also on x86 the versions using theBoehm GC and the JIT are almost always faster than the version with the Boehm GC andno JIT.Looking at the graph in Figure 24 it can be seen that on ARM the JIT almost alwaysimproves over the non-jitted version of PyPy. At the same time the results comparedto CPython are not that clear. While PyPy without a JIT is always slower that CPythonthe jitted version only outperforms CPython in about half of the benchmarks. In somebenchmarks the jitted version is significantly slower than CPython, even being slowerthan PyPy without a JIT on the twisted_pb benchmark. The mixed results on ARM andthe results on x86, especially those for the versions of PyPy using the Boehm GC, showthe importance of a fast GC that is well integrated with the JIT to get good results.An important result can be seen when comparing the relative results calculated for bothplatforms. When looking at the relative results comparing the PyPy versions to CPythonon both platforms, as shown in Table 3, we see that on ARM we get relative results thatare comparable to those achieved on x86 for the interpreters using the Boehm GC withand without the JIT. The cases where the relative results differ might be due to severalmicro-optimizations that have not yet been done in the ARM backend as well as some10 http://speed.pypy.org11 http://codespeak.net/svn/pypy/benchmarks/

40 8 EVALUATIONBenchmark cpython [ms] nojit, boehm [ms] jit, boehm [ms]ai 5931.82 ± 7.15 18630.77 ± 58.60 13316.69 ± 492.45bm_mako 1976.47 ± 12.54 6045.01 ± 274.69 3080.70 ± 234.98chaos 6548.26 ± 19.39 16049.27 ± 67.81 4100.33 ± 851.92crypto_pyaes 26746.72 ± 9.91 127704.81 ± 53.35 12099.35 ± 1280.47django 13111.72 ± 264.58 27510.31 ± 605.54 6240.66 ± 222.82fannkuch 17592.60 ± 65.52 61427.77 ± 1056.54 18752.91 ± 240.71float 8642.43 ± 129.71 16318.45 ± 335.48 4527.81 ± 293.27go 11457.86 ± 38.30 67172.65 ± 60.21 11438.12 ± 801.13html5lib 177656.38 ± 785.98 660478.63 ± 58354.54 339913.73 ± 15962.25meteor-contest 3758.29 ± 4.33 10932.99 ± 70.55 8207.28 ± 222.83nbody_modified 7533.80 ± 52.81 11473.04 ± 61.70 7840.77 ± 109.42raytrace-simple 33938.57 ± 528.15 77114.42 ± 960.34 17280.65 ± 414.07richards 3712.72 ± 14.59 10840.95 ± 93.97 1187.04 ± 118.55slowspitfire 6378.05 ± 1.84 25646.74 ± 1234.26 21371.59 ± 8929.30spectral-norm 6990.59 ± 30.40 28566.51 ± 25.40 4492.29 ± 171.88spitfire_cstringio 153292.00 ± 125.83 257690.40 ± 636.23 189844.20 ± 979.49telco 15434.80 ± 9.53 61941.56 ± 49.16 10938.12 ± 219.95twisted_iteration 2055.69 ± 4.71 7319.68 ± 86.70 1305.72 ± 149.00twisted_names 139.98 ± 0.98 362.43 ± 6.55 385.62 ± 138.98twisted_pb 1076.19 ± 66.42 3400.00 ± 471.40 4066.67 ± 1394.84twisted_tcp 13386.24 ± 81.21 43825.39 ± 3205.68 33173.15 ± 4922.12waf 77516.19 ± 3944.68 80218.64 ± 8958.83 77198.68 ± 6029.46Table 1: Absolute runtimes for standard PyPy benchmarks run on ARMarchitectural differences. For some benchmarks, such as the float benchmark, the ARMbackend is currently slower than the x86 backend because on ARM the JIT does not yetsupport optimized floating point operations. On x86 the JIT in combination the PyPy’sown GCs perform significantly better than using Boehm. Given the similar results onboth platforms using the Boehm GC, it is to be expected to see speedups comparable tothose on x86 once PyPy’s GCs are integrated on ARM.

8.2 Benchmarks 41Benchmark cpython nojit, boehm jit, boehmai 1.00 3.14 2.24bm_mako 1.00 3.06 1.56chaos 1.00 2.45 0.63crypto_pyaes 1.00 4.77 0.45django 1.00 2.10 0.48fannkuch 1.00 3.49 1.07float 1.00 1.89 0.52go 1.00 5.86 1.00html5lib 1.00 3.72 1.91meteor-contest 1.00 2.91 2.18nbody_modified 1.00 1.52 1.04raytrace-simple 1.00 2.27 0.51richards 1.00 2.92 0.32slowspitfire 1.00 4.02 3.35spectral-norm 1.00 4.09 0.64spitfire_cstringio 1.00 1.68 1.24telco 1.00 4.01 0.71twisted_iteration 1.00 3.56 0.64twisted_names 1.00 2.59 2.75twisted_pb 1.00 3.16 3.78twisted_tcp 1.00 3.27 2.48waf 1.00 1.03 1.00Table 2: Relative runtimes for standard PyPy benchmarks run on ARM normalized toCPython

42 8 EVALUATIONBenchmark ARM, nojit ARM, jit x86, nojit x86, jitai 3.14 2.24 2.82 1.69bm_mako 3.06 1.56 3.38 1.89chaos 2.45 0.63 2.13 0.32crypto_pyaes 4.77 0.45 3.14 0.34django 2.10 0.48 1.76 0.40fannkuch 3.49 1.07 2.07 0.81float 1.89 0.52 1.62 0.22go 5.86 1.00 4.74 0.74html5lib 3.72 1.91 3.53 1.78meteor-contest 2.91 2.18 2.28 1.63nbody_modified 1.52 1.04 1.53 0.73raytrace-simple 2.27 0.51 2.06 0.20richards 2.92 0.32 1.97 0.23slowspitfire 4.02 3.35 2.91 2.97spectral-norm 4.09 0.64 4.43 0.34spitfire_cstringio 1.68 1.24 2.09 1.76telco 4.01 0.71 3.63 0.60twisted_iteration 3.56 0.64 3.54 0.65twisted_names 2.59 2.75 2.43 1.10twisted_pb 3.16 3.78 3.91 2.32twisted_tcp 3.27 2.48 3.94 3.09waf 1.03 1.00 1.00 1.01Table 3: Relative results of the PyPy benchmarks using the Boehm GC on ARM and x86,normalized to CPython for the corresponding platform

8.2 Benchmarks 43Ratio6x5x4x3x2x1x0xaibm_makochaoscrypto_pyaesdjangofannkuchfloatgohtml5libmeteor−contestcpythonnojit, boehmjit, boehmnbody_modifiedraytrace−simplerichardsslowspitfirespectral−normspitfire_cstringiotelcotwisted_iterationtwisted_namestwisted_pbtwisted_tcpwafBenchmarkFigure 24: PyPy benchmarks run on ARM, normalized to CPython

44 8 EVALUATIONRatio6x5x4x3x2x1x0xaibm_makochaoscrypto_pyaesdjangofannkuchfloatgohtml5libmeteor−contestcpythonnojit, boehmjit, boehmnojit, gcjit, gcnbody_modifiedraytrace−simplerichardsslowspitfirespectral−normspitfire_cstringiotelcotwisted_iterationtwisted_namestwisted_pbtwisted_tcpwafBenchmarkFigure 25: PyPy benchmarks run on x86, normalized to CPython

8.2 Benchmarks 458.2.2 Prolog BenchmarksTo see how the ARM backend performs when the JIT is applied to an interpreter that isnot based on a bytecode dispatch loop we used the Pyrolog interpreter and the benchmarksdescribed in [BLS10]. Pyrolog is a Prolog interpreter written in RPython that canbe translated using the PyPy toolchain. On ARM we created two versions and comparedthem, as for Python, to the same interpreter translated in different versions for x86. ForARM we created one version with and one version without the JIT, both using the BoehmGC. On x86 we created the same set of versions as we did for Python. On both platformswe compared the generated versions of Pyrolog to SWI-Prolog 12 which is available forUbuntu 10.10 on ARM and x86 in version 5.8.2. For comparability the benchmarks wererun the same way as described in [BLS10], running each benchmark twice in the sameprocess to give the JIT time to warm up and generate code and measuring the secondrun.Benchmark no jit, boehm jit, boehm SWIiterate_assert 1270 ms 10 ms 110 msiterate_call 3060 ms 10 ms 3190 msiterate_cut 2690 ms 360 ms 70 msiterate_exception 3140 ms 350 ms 4990 msiterate_failure 2780 ms 130 ms 430 msiterate_findall 4190 ms – 740 msiterate_if 7410 ms 10 ms 140 msiterate 1190 ms 10 ms 50 msTable 4: Absolute runtimes for Pyrolog iterate benchmarks run on ARMFrom the set of benchmarks we had to remove the arithmetic and reducer benchmarks,because they did not terminate in a reasonable amount of time on both platforms. Thisas well as the slower results compared to [BLS10] might be due to recent changes inthe optimizer of PyPy’s JIT that affect the Pyrolog interpreter negatively. Also we hadto adapt some of the benchmarks so they could run on ARM, reducing the number ofiterations they performed. The adapted benchmarks can be found in the arm branchof the pyrolog-benchmark repository 13 . All graphs of the results of running the Prologbenchmarks are presented in a logarithmic scale.Additionally on ARM we performed the iterate benchmarks which are rather atypicalProlog programs but show how a loop in form of recursive calls is optimized by the JIT,compared to SWI-Prolog. From this set of benchmarks the iterate_findall benchmarkcrashed when executed with the JIT, thus producing no result. Also worth noting,the bad performance of the iterate_cut which is due to the bad handling of the cutoperation by JIT. An in depth evaluation of these benchmarks has been done in [BLS10].The results for the iterate set of benchmarks, confirm the findings of [BLS10] that theJIT optimizes recursive calls and can optimize dynamic invocation particularly well alsogiving a speedup on ARM for the cases it does on x86. The results are shown in Table 412 http://www.swi-prolog.org/13 https://bitbucket.org/cfbolz/pyrolog-benchmark/changeset/arm

46 8 EVALUATIONand the graph of the relative runtimes normalized to SWI-Prolog is shown in Figure 27.For the standard Prolog benchmarks we get the same kind of results as for Python. Thebenchmarks of the relative speed compared to SWI shown in Figure 28 were run on x86using different GCs and show that also for this kind of interpreter the choice of the GChas a significant impact on the speed of the resulting interpreter. See Figure 26 for a graphof the relative runtimes on ARM normalized to SWI-Prolog and Table 5 for the absoluteruntimes. The graph of the relative runtimes on x86 can be found in Figure 28. In thisset of benchmarks we can also see that the relative results for Pyrolog on ARM and x86compared to the corresponding versions of SWI-Prolog are similar, confirming that theARM backend produces code with a similar relative performance compared to the x86backend.SWInojit, boehmjit, boehm100x10xRatio1x0.1x0.01x0.001xboyerchat_parsercryptderivmeta_nrevmunrevpolyprimesqsortqueenstakzebraBenchmarkFigure 26: Pyrolog benchmarks run on ARM, normalized to SWI-Prolog

8.2 Benchmarks 47SWInojit, boehmjit, boehm100x10xRatio1x0.1x0.01x0.001xiterateiterate_assertiterate_calliterate_cutiterate_exceptioniterate_failureiterate_findalliterate_ifBenchmarkFigure 27: Pyrolog iterate benchmarks run on ARM, normalized to SWI-PrologBenchmark own-interp own-JIT warm SWIboyer 45220 ms 125840 ms 890 mschat_parser 761760 ms 2140 ms 62790 mscrypt 15970 ms 44860 ms 4520 msderiv 48860 ms 60 ms 3780 msmeta_nrev 44630 ms 11160 ms 7590 msmu 32190 ms 1960 ms 4360 msnrev 16180 ms 2250 ms 1830 mspoly 7980 ms 10960 ms 80 msprimes 46390 ms 14420 ms 4320 msqsort 41080 ms 740 ms 4000 msqueens 95930 ms 38020 ms 16860 mstak 18930 ms 8960 ms 630 mszebra 97320 ms 230 ms 17330 msTable 5: Absolute runtimes for Pyrolog benchmarks run on ARM

48 8 EVALUATIONSWInojit, boehmjit, boehmnojit, gcjit, gc100x10xRatio1x0.1x0.01x0.001xboyerchat_parsercryptderivmeta_nrevmunrevpolyprimesqsortqueenstakzebraBenchmarkFigure 28: Pyrolog benchmarks run on x86, normalized to SWI-Prolog

499 Related WorkThere are many virtual machines for dynamic language that aim to improve their performanceby including just-in-time compilers into their implementations, for instance, asdescribed earlier, SELF [CUL89] contains a JIT that generates specialized code for differentcode paths depending on the input arguments to the optimized functions. Also theHotSpot JVM [PVC01] contains a JIT that directly generates machine code for functionsand then profiles the compiled functions to find candidates for optimized recompiling.Other examples are some Smalltalk systems [DS84] that compile methods on their firstexecution.Lately there have been several dynamic language implementations using tracing JIT technologyto improve their the performance of their virtual machines such as Spur [BBF + 10]or TraceMonkey [GFE + 09]. Some of these virtual machines especially those developedwith mobile platforms in mind also target the ARM architecture. The Dalvik VM, part ofGoogle’s Android mobile operating system, contains since version 2.2 a tracing JIT thattargets ARM 14 . The Tracemonkey Javascript engine used in the Firefox browser has beenported to ARM and tested in the mobile version of Firefox running on Android 15 . TheLuaJIT project, that adds a trace based JIT compiler to Lua, is working on adding supportfor ARM to their JIT 16 .There also are efforts that take a more general approach, as done by the Libjit 17 whichis part of the DotGNU project 18 . Libjit aims to provide tools for developers to add a JITto their language implementations, providing a library that encapsulates the low leveldetails of a JIT and provides backends for different platforms, also supporting dynamiccode generation on ARM [TCAR09].For Python there have been previous attempts to bring dynamic compilation to the language,most notably Psyco [Rig04], a specializing JIT compiler for Python. Psyco is ahandwritten JIT for Python that can be used with CPython as an extension module. Thismakes it hard to maintain and dependant on changes done to CPython. As an extensionmodule, written in C, Psyco can only be used for CPython and only works on x86 32-bitprocessors. In difference to PyPy that can used to implement different languages andis constructed in a way that it can target different architectures allowing to generate aJIT compiler at translation time for the target architecture. Additionally because PyPy iswritten in Python itself it provides a higher level of abstraction and an better maintainability.14 http://developer.android.com/sdk/android-2.2-highlights.html15 http://blog.cdleary.com/2011/02/picky-monkeys-pic-arm/16 http://lua-users.org/lists/lua-l/2011-01/msg01238.html17 http://www.gnu.org/software/dotgnu/libjit-doc/libjit.html18 http://www.gnu.org/software/dotgnu/

50 10 CONCLUSION AND FUTURE WORK10 Conclusion and Future WorkIn this thesis, after presenting the technical and theoretical background of ARM, PyPyand just-in-time compilation, we have described how a JIT backend for PyPy’s tracingJIT is built, with a special focus on targeting the ARM architecture. We showed howthe backend translates the instructions recorded by PyPy’s tracing JIT during the executionand how these are prepared and executed on an ARM based platform. We presentedhow it is possible to translate an interpreter written in RPython using PyPy, sothat it includes the presented backend, while being translated on a different platform,by creating a cross-translation target for ARM. With a cross-translation target it is possibleto translate on systems that provide enough resources to translate PyPy and stilltarget ARM devices. Based on these results we performed several benchmarks on PyPy’sPython implementation and Pyrolog, a Prolog implementation written in RPython. Thebenchmarks showed that our results are comparable to those achieved on x86 in terms ofrelative speed compared to reference implementations. Based on the results of the benchmarkswe can expect to see further performance improvements on ARM once PyPy’s owngarbage collectors are integrated with PyPy on ARM.There are still several open or unfinished tasks in relation to the ARM backend on PyPy’sJIT compiler. As mentioned before the current implementation of the backend only supportsthe Boehm-Demers-Weiser garbage collector. One of the next steps is to integratethe ARM backend with PyPy’s own exact garbage collectors, which also should increasethe speed of the code generated by the backend. On the other hand the support for floatingpoint numbers should be finished and merged soon. Additionally the JIT gained,about two months ago, support for long long operations providing specialized codefor 64 bit integer operations. We plan to implement these operations in the ARM backendtoo. Another open topic not directly related to the ARM JIT backend is to bring PyPyto Android phones, which run a special version of Linux on ARM processors. To do thiswe need to adapt the translation toolchain to target Androids NDK and C-library andto investigate viable ways of integrating PyPy with the host environment. This wouldallow applications for mobile phones to be developed in Python, or other languages implementedusing PyPy and benefit from the speed gains by using a JIT compiler. Wewould also like to explore how PyPy’s JIT performs on ARM compared with Android’sDalvik VM.AcknowledgmentI would like to thank in alphabetical order Carl Friedrich Bolz, Antonio Cuni and ArminRigo for their help during the development of the ARM backend as well as for the helpfulcomments when writing this thesis.

5111 AnnexBenchmark cpython [ms] nojit, boehm [ms] jit, boehm [ms] nojit, gc [ms] jit, gc [ms]ai 515.25 ± 1.65 1455.26 ± 5.21 869.16 ± 11.40 1045.98 ± 20.31 381.40 ± 23.65bm_mako 128.29 ± 0.63 434.11 ± 30.57 241.92 ± 28.00 378.61 ± 1.10 138.65 ± 12.26chaos 501.00 ± 4.65 1065.10 ± 7.04 159.68 ± 70.19 888.07 ± 2.01 42.24 ± 61.28crypto_pyaes 2828.41 ± 22.29 8875.09 ± 6.10 958.48 ± 82.07 5063.13 ± 8.38 177.35 ± 71.35django 990.67 ± 1.29 1744.97 ± 22.64 394.94 ± 21.74 1635.51 ± 2.07 155.88 ± 7.22fannkuch 1958.91 ± 7.24 4050.90 ± 62.17 1584.03 ± 19.76 2667.34 ± 12.80 402.22 ± 11.86float 569.37 ± 11.11 922.31 ± 27.85 127.43 ± 37.63 949.14 ± 20.16 106.10 ± 11.32go 932.47 ± 2.92 4416.65 ± 12.89 687.29 ± 34.19 2966.42 ± 5.25 203.37 ± 45.28html5lib 14451.03 ± 72.15 50954.23 ± 6874.04 25751.72 ± 3445.27 28898.94 ± 65.72 7008.61 ± 1505.37meteor-contest 351.97 ± 4.96 802.46 ± 5.54 574.82 ± 16.10 640.06 ± 2.24 282.25 ± 11.35nbody_modified 618.89 ± 4.65 943.92 ± 5.41 452.87 ± 10.95 660.92 ± 12.68 99.62 ± 4.31raytrace-simple 2655.83 ± 6.56 5475.93 ± 14.32 529.33 ± 15.65 3652.94 ± 33.11 130.93 ± 13.46richards 353.98 ± 3.62 697.24 ± 5.75 81.67 ± 10.90 552.73 ± 1.64 17.42 ± 3.27slowspitfire 630.94 ± 2.65 1837.00 ± 90.08 1872.80 ± 203.10 1828.18 ± 18.68 1135.90 ± 11.02spectral-norm 472.93 ± 1.17 2095.51 ± 6.29 158.45 ± 14.06 1069.36 ± 1.82 36.32 ± 8.60spitfire_cstringio 9231.00 ± 91.39 19257.20 ± 71.14 16227.60 ± 229.90 11522.40 ± 22.00 5056.60 ± 67.18telco 1202.20 ± 7.64 4360.59 ± 14.72 716.04 ± 15.16 2900.18 ± 4.92 188.49 ± 24.36twisted_iteration 154.48 ± 0.74 546.64 ± 1.85 100.91 ± 0.02 422.62 ± 1.24 37.75 ± 0.16twisted_names 9.63 ± 0.02 23.41 ± 0.13 10.63 ± 0.13 18.84 ± 0.11 5.36 ± 0.12twisted_pb 69.96 ± 1.24 273.84 ± 6.78 162.35 ± 13.11 145.03 ± 2.43 32.06 ± 1.44twisted_tcp 887.33 ± 3.12 3495.82 ± 4.88 2745.48 ± 26.98 1810.98 ± 10.89 931.62 ± 18.28waf 5415.81 ± 36.76 5405.10 ± 20.46 5493.30 ± 228.29 5413.62 ± 50.76 5373.34 ± 69.82Table 6: Absolute runtimes for standard PyPy benchmarks run on x86

52 11 ANNEXBenchmark cpython nojit, boehm jit, boehm nojit, gc jit, gcai 1.00 2.82 1.69 2.03 0.74bm_mako 1.00 3.38 1.89 2.95 1.08chaos 1.00 2.13 0.32 1.77 0.08crypto_pyaes 1.00 3.14 0.34 1.79 0.06django 1.00 1.76 0.40 1.65 0.16fannkuch 1.00 2.07 0.81 1.36 0.21float 1.00 1.62 0.22 1.67 0.19go 1.00 4.74 0.74 3.18 0.22html5lib 1.00 3.53 1.78 2.00 0.48meteor-contest 1.00 2.28 1.63 1.82 0.80nbody_modified 1.00 1.53 0.73 1.07 0.16raytrace-simple 1.00 2.06 0.20 1.38 0.05richards 1.00 1.97 0.23 1.56 0.05slowspitfire 1.00 2.91 2.97 2.90 1.80spectral-norm 1.00 4.43 0.34 2.26 0.08spitfire_cstringio 1.00 2.09 1.76 1.25 0.55telco 1.00 3.63 0.60 2.41 0.16twisted_iteration 1.00 3.54 0.65 2.74 0.24twisted_names 1.00 2.43 1.10 1.96 0.56twisted_pb 1.00 3.91 2.32 2.07 0.46twisted_tcp 1.00 3.94 3.09 2.04 1.05waf 1.00 1.00 1.01 1.00 0.99Table 7: Relative runtimes for standard PyPy benchmarks run on x86 normalized toCPythonBenchmark no-jit, boehm jit, boehm no-jit, own gc jit, own gc SWIboyer 4540 ms 9450 ms 1300 ms 5430 ms 100 mschat_parser 58020 ms 170 ms 52920 ms 40 ms 9060 mscrypt 1200 ms 3920 ms 950 ms 3170 ms 490 msderiv 3780 ms 10 ms 3340 ms 00 ms 420 msmeta_nrev 2910 ms 1270 ms 2180 ms 490 ms 650 msmu 2660 ms 130 ms 1930 ms 40 ms 490 msnrev 1040 ms 300 ms 970 ms 410 ms 180 mspoly 710 ms 990 ms 390 ms 690 ms 10 msprimes 5060 ms 1190 ms 1740 ms 380 ms 400 msqsort 3020 ms 70 ms 2560 ms 20 ms 450 msqueens 7200 ms 1890 ms 5500 ms 240 ms 1870 mstak 1840 ms 820 ms 700 ms 180 ms 80 mszebra 7390 ms 20 ms 6990 ms 10 ms 2230 msTable 8: Absolute runtimes for Pyrolog benchmarks run on x86

REFERENCES 53References[AACM07] ANCONA, Davide ; ANCONA, Massimo ; CUNI, Antonio ; MATSAKIS,Nicholas: RPython: a Step Towards Reconciling Dynamically and StaticallyTyped OO Languages. In: DLS ’07: Proceedings of the 2007 symposium on Dynamiclanguages (2007), Oct. http://portal.acm.org/citation.cfm?id=1297081.1297091[ALSU06] AHO, Alfred ; LAM, Monica ; SETHI, Ravi ; ULLMAN, Jeffrey:Compilers: Principles, Techniques, and Tools (2nd Edition). In:Compilers: Principles, Techniques, and Tools (2nd Edition (2006), Aug.http://portal.acm.org/citation.cfm?id=1177220&coll=DL&dl=GUIDE&CFID=8318957&CFTOKEN=98469675[Ard11] ARDÖ, Håkan: Loop invariant code motion. http://morepypy.blogspot.com/2010/12/we-are-not-heroes-just-very-patient.html.Version: 2011. – [Online; accessed 10-February-2011][ARM95] ARM HOLDINGS: An Introduction to ThumbTM. (1995), Mar, S. 1–29[ARM10] ARM HOLDINGS: ARM R○ Architecture Reference Manual. (2010), Sep, S. 1–2158[ARM11] ARM HOLDINGS: Company Profile - ARM. http://www.arm.com/about/company-profile/index.php. Version: 2011. – [Online; accessed 13-January-2011][Ayc03] AYCOCK, J: A Brief History of Just-In-Time. In: ACM Computing Surveys(CSUR) 35 (2003), Nr. 2, S. 97–113[BBF + 10] BEBENITA, Michael ; BRANDNER, Florian ; FAHNDRICH, Manuel ; LO-GOZZO, Francesco ; SCHULTE, Wolfram ; TILLMANN, Nikolai ; VEN-TER, Herman: SPUR: A Trace-Based JIT Compiler for CIL. In:OOPSLA ’10: Proceedings of the ACM international conference on Objectoriented programming systems languages and applications (2010), Oct.http://portal.acm.org/ft_gateway.cfm?id=1869517&type=pdf&coll=DL&dl=GUIDE&CFID=8318957&CFTOKEN=98469675[BCF + 11] BOLZ, Carl F. ; CUNI, Antonio ; FIJABKOWSKI, Maciej ; LEUSCHEL, Michael; PEDRONI, Samuele ; RIGO, Armin: Allocation Removal by Partial Evaluationin a Tracing JIT. In: PEPM ’11: Proceedings of the 20th ACMSIGPLAN workshop on Partial evaluation and program manipulation (2011),Jan. http://portal.acm.org/ft_gateway.cfm?id=1929508&type=pdf&coll=DL&dl=GUIDE&CFID=8318957&CFTOKEN=98469675[BCFR09] BOLZ, Carl F. ; CUNI, Antonio ; FIJALKOWSKI, Maciej ; RIGO, Armin: Tracingthe Meta-Level: PyPy’s Tracing JIT Compiler. In: ICOOOLPS ’09: Proceedingsof the 4th workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (2009), Jul. http://portal.acm.org/citation.cfm?id=1565824.1565827

54 REFERENCES[BDB00] BALA, Vasanth ; DUESTERWALD, Evelyn ; BANERJIA, Sanjeev: Dynamo: ATransparent Dynamic Optimization System. In: PLDI ’00: Proceedings of theACM SIGPLAN 2000 conference on Programming language design and implementation(2000), Aug. http://portal.acm.org/citation.cfm?id=349299.349303[BEES04] BRUENING, Derek L. ; ELECTRICAL ENGINEERING, Massachusetts I. o. ; SCI-ENCE, Computer: Efficient, Transparent, and Comprehensive Runtime CodeManipulation. (2004), Jan, 306. http://books.google.com/books?id=WPpdNwAACAAJ[BKL + 08] BOLZ, Carl F. ; KUHN, Adrian ; LIENHARD, A ; MATSAKIS, Nicholas ; NIER-STRASZ, Oscar ; RENGGLI, Lukas ; RIGO, Armin ; VERWAEST, Toon: Back to thefuture in one week—implementing a Smalltalk VM in PyPy. In: Self-SustainingSystems (2008), S. 123–139[BLS10]BOLZ, Carl F. ; LEUSCHEL, Michael ; SCHNEIDER, David: Towards a Jitting VMfor Prolog Execution. In: PPDP ’10: Proceedings of the 12th international ACMSIGPLAN symposium on Principles and practice of declarative programming (2010),Jul. http://portal.acm.org/citation.cfm?id=1836089.1836102[Bol10] BOLZ, Carl F.: We are not heroes, just very patient. http://morepypy.blogspot.com/2010/12/we-are-not-heroes-just-verypatient.html.Version: 2010. – [Online; accessed 02-February-2011][BW88]BOEHM, HJ ; WEISER, M: Garbage Collection in an Uncooperative Environment.In: Software - Practice and Experience 18 (1988), Nr. 9, S. 807–820. – Readon Jan. 4[CFR + 91] CYTRON, Ron ; FERRANTE, Jeanne ; ROSEN, Barry ; WEGMAN,Mark ; ZADECK, F: Efficiently Computing Static Single AssignmentForm and the Control Dependence Graph. In: Transactions onProgramming Languages and Systems (TOPLAS 13 (1991), Oct, Nr. 4.http://portal.acm.org/ft_gateway.cfm?id=115320&type=pdf&coll=DL&dl=GUIDE&CFID=13052475&CFTOKEN=24119255[CUL89] CHAMBERS, Craig ; UNGAR, David ; LEE, Elgin: An efficient implementationof SELF a dynamically-typed object-oriented language based on prototypes.In: ACM SIGPLAN Notices 24 (1989), Nr. 10, S. 49–70[DS84]DEUTSCH, LP ; SCHIFFMAN, AM: Efficient Implementation of the Smalltalk-80 System. In: Proceedings of the 11th ACM SIGACT-SIGPLAN symposium onPrinciples of programming languages (1984), S. 302[Ear09] EARNSHAW, Richard: Procedure Call Standard for the ARM Architecture.(2009), Oct, 1–34. http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/IHI0042D_aapcs.pdf[GBE07] GEORGES, Andy ; BUYTAERT, Dries ; EECKHOUT, Lieven: Statistically RigorousJava Performance Evaluation. In: OOPSLA ’07: Proceedings of the 22nd

REFERENCES 55annual ACM SIGPLAN conference on Object-oriented programming systems andapplications (2007), Oct. http://portal.acm.org/ft_gateway.cfm?id=1297033&type=pdf&coll=DL&dl=GUIDE&CFID=12089051&CFTOKEN=21716971[GFE + 09] GAL, Andreas ; FRANZ, Michael ; EICH, B ; SHAVER, M ; ANDERSON, David:Trace-based Just-in-Time Type Specialization for Dynamic Languages. In: Proceedingsof the ACM SIGPLAN 2009 conference on Programming language designand implementation (2009), Jan. http://portal.acm.org/citation.cfm?id=1542528&dl=[GPF06]GAL, A ; PROBST, C.W ; FRANZ, Michael: HotpathVM: An Effective JIT Compilerfor Resource-constrained Devices. In: Proceedings of the 2nd internationalconference on Virtual execution environments (2006), S. 144–153[HRP05] HUDSON, M ; RIGO, A ; PEDRONI, Samuele: Compiling Dynamic LanguageImplementations. In: Project co-funded by the European Commission within theSixth Framework Programme (2002-2006) (2005)[HS74]HANSEN, Gilbert J. ; SCIENCE., CARNEGIE-MELLON UNIV PITTSBURGHPA DEPT OF C.: Adaptive Systems for the Dynamic Run-time Optimizationof Programs. (1974), Jan, 179. http://books.google.com/books?id=pNwicAAACAAJ[McC60] MCCARTHY, John: Recursive Functions of Symbolic Expressions and TheirComputation by Machine, Part I. In: Communications of the ACM 3 (1960), Apr,Nr. 4. http://portal.acm.org/ft_gateway.cfm?id=367199&type=pdf&coll=DL&dl=GUIDE&CFID=13052475&CFTOKEN=24119255[Moz11]MOZILLA FOUNDATION / ADOBE: Tamarin Project. http://www.mozilla.org/projects/tamarin/. Version: 2011. – [Online; accessed 09-February-2011][MSD05] MSDN TEAM: Compiling MSIL to Native Code. http://msdn.microsoft.com/en-us/library/ht8ecch6(v=vs.80).aspx. Version: 2005. – [Online;accessed 09-March-2011][Pal11] PALL, Mike: LuaJIT. http://luajit.org/luajit.html. Version: 2011. –[Online; accessed 09-February-2011][Pho10] PHOENIX, Evan: Making Ruby Fast: The Rubinius JIT. http://www.engineyard.com/blog/2010/making-ruby-fast-therubinius-jit/.Version: 2010. – [Online; accessed 25-February-2011][PS99][PVC01]POLETTO, M ; SARKAR, V: Linear Scan Register Allocation. In: ACM Transactionson Programming Languages and Systems (TOPLAS) 21 (1999), Nr. 5, S. 895–913PALECZNY, M ; VICK, C ; CLICK, C: The Java Hotspot TM server compiler.In: Proceedings of the 2001 Symposium on Java TM Virtual Machine Research andTechnology Symposium-Volume 1 (2001), S. 1

56 REFERENCES[PyP11a] PYPY DEVELOPERS: How to run PyPy on top of maemo platform. http://codespeak.net/pypy/dist/pypy/doc/maemo.html. Version: 2011. –[Online; accessed 18-February-2011][PyP11b] PYPY DEVELOPERS: PyPy - Getting Started. http://codespeak.net/pypy/dist/pypy/doc/getting-started.html. Version: 2011. – [Online; accessed16-March-2011][PyP11c] PYPY DEVELOPERS: PyPy - Goals and Architecture Overview. http://codespeak.net/pypy/dist/pypy/doc/architecture.html.Version: 2011. – [Online; accessed 02-March-2011][Rig04][RP06]RIGO, Armin: Representation-based just-in-time specialization and the psycoprototype for python. In: PEPM ’04: Proceedings of the 2004 ACM SIGPLANsymposium on Partial evaluation and semantics-based program manipulation (2004),Aug. http://portal.acm.org/ft_gateway.cfm?id=1014010&type=pdf&coll=Portal&dl=GUIDE&CFID=108138688&CFTOKEN=34923119RIGO, Armin ; PEDRONI, Samuele: PyPy’s approach to virtual machine construction.In: OOPSLA ’06: Companion to the 21st ACM SIGPLAN symposiumon Object-oriented programming systems, languages, and applications (2006), Oct.http://portal.acm.org/citation.cfm?id=1176617.1176753[SBB + 03] SULLIVAN, Gregory ; BRUENING, Derek ; BARON, Iris ; GARNETT, Timothy; AMARASINGHE, Saman: Dynamic native optimization of interpreters.In: IVME ’03: Proceedings of the 2003 workshop on Interpreters, virtual machinesand emulators (2003), Jun. http://portal.acm.org/citation.cfm?id=858570.858576[Smi09] SMITH, Lee: Run-time ABI for the ARM Architecture. (2009), Oct, S. 1–28[SU94] SMITH, Randall B. ; UNGAR, David: Self: The power of simplicity. In: SunMicrosystems, Inc. Technical Reports; Vol. TR-94-30 (1994)[TCAR09] TARTARA, Michele ; CAMPANONI, Simone ; AGOSTA, Giovanni ; REGHIZZI,Stefano: Just-In-Time compilation on ARM processors. In: ICOOOLPS ’09:Proceedings of the 4th workshop on the Implementation, Compilation, Optimizationof Object-Oriented Languages and Programming Systems (2009), Jul. http://portal.acm.org/citation.cfm?id=1565824.1565834[UCCH91] UNGAR, David ; CHAMBERS, Craig ; CHANG, BW ; HÖLZLE, Urs: Organizingprograms without classes. In: Lisp and Symbolic Computation 4 (1991), Nr. 3, S.223–242[V8 11] V8 TEAM: Design Elements - V8 JavaScript Engine - Google Code. http://code.google.com/apis/v8/design.html#mach_code. Version: 2011. –[Online; accessed 21-February-2011]

LIST OF FIGURES 57List of Figures1 Calculation of Fibonacci numbers written in ARM-assembler . . . . . . . . 92 Calculation of Fibonacci numbers written in Python . . . . . . . . . . . . . 93 Use of ARM’s barrel shifter for multiplication by two . . . . . . . . . . . . 104 Use of conditional execution to assign a value to a register . . . . . . . . . 105 Two simple RPython functions, and a trace recorded when executing thefunction strange_sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A very simple bytecode interpreter with registers and an accumulator. . . 187 Example bytecode: Compute the square of the accumulator . . . . . . . . 188 Simple bytecode interpreter with hints applied . . . . . . . . . . . . . . . . 199 Trace when executing the Square function of Figure 7, with the correspondingbytecodes as comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2110 Trace when executing the Square function of Figure 7, with the correspondingopcodes as comments. The constant-folding of operations on greenvariables is enabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2211 Bit encoding for the data processing instructions with immediate values[ARM10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2412 Encoding of the group of data processing instructions with immediate valuesas arguments as a Python hashmap . . . . . . . . . . . . . . . . . . . . 2413 Function that creates a function for data processing instructions based onthe encoded instructions shown in Figure 12. It takes the encoded descriptionof an assembler instruction and generates a function that, when called,writes the binary representation of the instruction to memory . . . . . . . 2514 Loading a constant by shifting the bytes into the target register . . . . . . . 2615 Loading a constant encoded in the program stream . . . . . . . . . . . . . 2616 Function to load constants into a register . . . . . . . . . . . . . . . . . . . 2717 Frame layout used by the ARM backend . . . . . . . . . . . . . . . . . . . . 2918 Implementation of the int_sub operation . . . . . . . . . . . . . . . . . . . . 3119 Instructions generated for the int_lt(i0, i1) operation . . . . . . . . 3220 Generated machine code guard_true(i0, [p0, i0]) . . . . . . . . . 3521 Encoding of variable locations for guard_true(i0, [p0, i0]). Variablep0 is spilled on the stack and variable i0 is stored in register r1 . . . 3522 Patched machine code for guard_true(i0, [p0, i0]) . . . . . . . . . 3623 BeagleBoard-xM development board . . . . . . . . . . . . . . . . . . . . . . 3824 PyPy benchmarks run on ARM, normalized to CPython . . . . . . . . . . . 4325 PyPy benchmarks run on x86, normalized to CPython . . . . . . . . . . . . 44

58 LIST OF TABLES26 Pyrolog benchmarks run on ARM, normalized to SWI-Prolog . . . . . . . 4627 Pyrolog iterate benchmarks run on ARM, normalized to SWI-Prolog . . . 4728 Pyrolog benchmarks run on x86, normalized to SWI-Prolog . . . . . . . . . 48List of Tables1 Absolute runtimes for standard PyPy benchmarks run on ARM . . . . . . 402 Relative runtimes for standard PyPy benchmarks run on ARM normalizedto CPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Relative results of the PyPy benchmarks using the Boehm GC on ARM andx86, normalized to CPython for the corresponding platform . . . . . . . . 424 Absolute runtimes for Pyrolog iterate benchmarks run on ARM . . . . . . 455 Absolute runtimes for Pyrolog benchmarks run on ARM . . . . . . . . . . 476 Absolute runtimes for standard PyPy benchmarks run on x86 . . . . . . . 517 Relative runtimes for standard PyPy benchmarks run on x86 normalizedto CPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Absolute runtimes for Pyrolog benchmarks run on x86 . . . . . . . . . . . 52

An ARM Backend for PyPyls Tracing JIT - STUPS Group

Create successful ePaper yourself

Delete template?

Save as template?