23.04.2015 Views

Trace-based Just-in-Time Type Specialization for Dynamic Languages

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Trace</strong>-<strong>based</strong> <strong>Just</strong>-<strong>in</strong>-<strong>Time</strong> <strong>Type</strong> <strong>Specialization</strong> <strong>for</strong> <strong>Dynamic</strong><br />

<strong>Languages</strong><br />

Andreas Gal ∗+ , Brendan Eich ∗ , Mike Shaver ∗ , David Anderson ∗ , David Mandel<strong>in</strong> ∗ ,<br />

Mohammad R. Haghighat $ , Blake Kaplan ∗ , Graydon Hoare ∗ , Boris Zbarsky ∗ , Jason Orendorff ∗ ,<br />

Jesse Ruderman ∗ , Edw<strong>in</strong> Smith # , Rick Reitmaier # , Michael Bebenita + , Mason Chang +# , Michael Franz +<br />

Mozilla Corporation ∗<br />

{gal,brendan,shaver,danderson,dmandel<strong>in</strong>,mrbkap,graydon,bz,jorendorff,jruderman}@mozilla.com<br />

Adobe Corporation #<br />

{edwsmith,rreitmai}@adobe.com<br />

Intel Corporation $<br />

{mohammad.r.haghighat}@<strong>in</strong>tel.com<br />

University of Cali<strong>for</strong>nia, Irv<strong>in</strong>e +<br />

{mbebenit,changm,franz}@uci.edu<br />

Abstract<br />

<strong>Dynamic</strong> languages such as JavaScript are more difficult to compile<br />

than statically typed ones. S<strong>in</strong>ce no concrete type <strong>in</strong><strong>for</strong>mation<br />

is available, traditional compilers need to emit generic code that can<br />

handle all possible type comb<strong>in</strong>ations at runtime. We present an alternative<br />

compilation technique <strong>for</strong> dynamically-typed languages<br />

that identifies frequently executed loop traces at run-time and then<br />

generates mach<strong>in</strong>e code on the fly that is specialized <strong>for</strong> the actual<br />

dynamic types occurr<strong>in</strong>g on each path through the loop. Our<br />

method provides cheap <strong>in</strong>ter-procedural type specialization, and an<br />

elegant and efficient way of <strong>in</strong>crementally compil<strong>in</strong>g lazily discovered<br />

alternative paths through nested loops. We have implemented<br />

a dynamic compiler <strong>for</strong> JavaScript <strong>based</strong> on our technique and we<br />

have measured speedups of 10x and more <strong>for</strong> certa<strong>in</strong> benchmark<br />

programs.<br />

Categories and Subject Descriptors D.3.4 [Programm<strong>in</strong>g <strong>Languages</strong>]:<br />

Processors — Incremental compilers, code generation.<br />

Design, Experimentation, Measurement, Per<strong>for</strong>-<br />

General Terms<br />

mance.<br />

Keywords<br />

1. Introduction<br />

JavaScript, just-<strong>in</strong>-time compilation, trace trees.<br />

<strong>Dynamic</strong> languages such as JavaScript, Python, and Ruby, are popular<br />

s<strong>in</strong>ce they are expressive, accessible to non-experts, and make<br />

deployment as easy as distribut<strong>in</strong>g a source file. They are used <strong>for</strong><br />

small scripts as well as <strong>for</strong> complex applications. JavaScript, <strong>for</strong><br />

example, is the de facto standard <strong>for</strong> client-side web programm<strong>in</strong>g<br />

Permission to make digital or hard copies of all or part of this work <strong>for</strong> personal or<br />

classroom use is granted without fee provided that copies are not made or distributed<br />

<strong>for</strong> profit or commercial advantage and that copies bear this notice and the full citation<br />

on the first page. To copy otherwise, to republish, to post on servers or to redistribute<br />

to lists, requires prior specific permission and/or a fee.<br />

PLDI’09, June 15–20, 2009, Dubl<strong>in</strong>, Ireland.<br />

Copyright c○ 2009 ACM 978-1-60558-392-1/09/06. . . $5.00<br />

and is used <strong>for</strong> the application logic of browser-<strong>based</strong> productivity<br />

applications such as Google Mail, Google Docs and Zimbra Collaboration<br />

Suite. In this doma<strong>in</strong>, <strong>in</strong> order to provide a fluid user<br />

experience and enable a new generation of applications, virtual mach<strong>in</strong>es<br />

must provide a low startup time and high per<strong>for</strong>mance.<br />

Compilers <strong>for</strong> statically typed languages rely on type <strong>in</strong><strong>for</strong>mation<br />

to generate efficient mach<strong>in</strong>e code. In a dynamically typed programm<strong>in</strong>g<br />

language such as JavaScript, the types of expressions<br />

may vary at runtime. This means that the compiler can no longer<br />

easily trans<strong>for</strong>m operations <strong>in</strong>to mach<strong>in</strong>e <strong>in</strong>structions that operate<br />

on one specific type. Without exact type <strong>in</strong><strong>for</strong>mation, the compiler<br />

must emit slower generalized mach<strong>in</strong>e code that can deal with all<br />

potential type comb<strong>in</strong>ations. While compile-time static type <strong>in</strong>ference<br />

might be able to gather type <strong>in</strong><strong>for</strong>mation to generate optimized<br />

mach<strong>in</strong>e code, traditional static analysis is very expensive<br />

and hence not well suited <strong>for</strong> the highly <strong>in</strong>teractive environment of<br />

a web browser.<br />

We present a trace-<strong>based</strong> compilation technique <strong>for</strong> dynamic<br />

languages that reconciles speed of compilation with excellent per<strong>for</strong>mance<br />

of the generated mach<strong>in</strong>e code. Our system uses a mixedmode<br />

execution approach: the system starts runn<strong>in</strong>g JavaScript <strong>in</strong> a<br />

fast-start<strong>in</strong>g bytecode <strong>in</strong>terpreter. As the program runs, the system<br />

identifies hot (frequently executed) bytecode sequences, records<br />

them, and compiles them to fast native code. We call such a sequence<br />

of <strong>in</strong>structions a trace.<br />

Unlike method-<strong>based</strong> dynamic compilers, our dynamic compiler<br />

operates at the granularity of <strong>in</strong>dividual loops. This design<br />

choice is <strong>based</strong> on the expectation that programs spend most of<br />

their time <strong>in</strong> hot loops. Even <strong>in</strong> dynamically typed languages, we<br />

expect hot loops to be mostly type-stable, mean<strong>in</strong>g that the types of<br />

values are <strong>in</strong>variant. (12) For example, we would expect loop counters<br />

that start as <strong>in</strong>tegers to rema<strong>in</strong> <strong>in</strong>tegers <strong>for</strong> all iterations. When<br />

both of these expectations hold, a trace-<strong>based</strong> compiler can cover<br />

the program execution with a small number of type-specialized, efficiently<br />

compiled traces.<br />

Each compiled trace covers one path through the program with<br />

one mapp<strong>in</strong>g of values to types. When the VM executes a compiled<br />

trace, it cannot guarantee that the same path will be followed<br />

or that the same types will occur <strong>in</strong> subsequent loop iterations.


Hence, record<strong>in</strong>g and compil<strong>in</strong>g a trace speculates that the path and<br />

typ<strong>in</strong>g will be exactly as they were dur<strong>in</strong>g record<strong>in</strong>g <strong>for</strong> subsequent<br />

iterations of the loop.<br />

Every compiled trace conta<strong>in</strong>s all the guards (checks) required<br />

to validate the speculation. If one of the guards fails (if control<br />

flow is different, or a value of a different type is generated), the<br />

trace exits. If an exit becomes hot, the VM can record a branch<br />

trace start<strong>in</strong>g at the exit to cover the new path. In this way, the VM<br />

records a trace tree cover<strong>in</strong>g all the hot paths through the loop.<br />

Nested loops can be difficult to optimize <strong>for</strong> trac<strong>in</strong>g VMs. In<br />

a naïve implementation, <strong>in</strong>ner loops would become hot first, and<br />

the VM would start trac<strong>in</strong>g there. When the <strong>in</strong>ner loop exits, the<br />

VM would detect that a different branch was taken. The VM would<br />

try to record a branch trace, and f<strong>in</strong>d that the trace reaches not the<br />

<strong>in</strong>ner loop header, but the outer loop header. At this po<strong>in</strong>t, the VM<br />

could cont<strong>in</strong>ue trac<strong>in</strong>g until it reaches the <strong>in</strong>ner loop header aga<strong>in</strong>,<br />

thus trac<strong>in</strong>g the outer loop <strong>in</strong>side a trace tree <strong>for</strong> the <strong>in</strong>ner loop.<br />

But this requires trac<strong>in</strong>g a copy of the outer loop <strong>for</strong> every side exit<br />

and type comb<strong>in</strong>ation <strong>in</strong> the <strong>in</strong>ner loop. In essence, this is a <strong>for</strong>m<br />

of un<strong>in</strong>tended tail duplication, which can easily overflow the code<br />

cache. Alternatively, the VM could simply stop trac<strong>in</strong>g, and give up<br />

on ever trac<strong>in</strong>g outer loops.<br />

We solve the nested loop problem by record<strong>in</strong>g nested trace<br />

trees. Our system traces the <strong>in</strong>ner loop exactly as the naïve version.<br />

The system stops extend<strong>in</strong>g the <strong>in</strong>ner tree when it reaches an outer<br />

loop, but then it starts a new trace at the outer loop header. When<br />

the outer loop reaches the <strong>in</strong>ner loop header, the system tries to call<br />

the trace tree <strong>for</strong> the <strong>in</strong>ner loop. If the call succeeds, the VM records<br />

the call to the <strong>in</strong>ner tree as part of the outer trace and f<strong>in</strong>ishes<br />

the outer trace as normal. In this way, our system can trace any<br />

number of loops nested to any depth without caus<strong>in</strong>g excessive tail<br />

duplication.<br />

These techniques allow a VM to dynamically translate a program<br />

to nested, type-specialized trace trees. Because traces can<br />

cross function call boundaries, our techniques also achieve the effects<br />

of <strong>in</strong>l<strong>in</strong><strong>in</strong>g. Because traces have no <strong>in</strong>ternal control-flow jo<strong>in</strong>s,<br />

they can be optimized <strong>in</strong> l<strong>in</strong>ear time by a simple compiler (10).<br />

Thus, our trac<strong>in</strong>g VM efficiently per<strong>for</strong>ms the same k<strong>in</strong>d of optimizations<br />

that would require <strong>in</strong>terprocedural analysis <strong>in</strong> a static<br />

optimization sett<strong>in</strong>g. This makes trac<strong>in</strong>g an attractive and effective<br />

tool to type specialize even complex function call-rich code.<br />

We implemented these techniques <strong>for</strong> an exist<strong>in</strong>g JavaScript <strong>in</strong>terpreter,<br />

SpiderMonkey. We call the result<strong>in</strong>g trac<strong>in</strong>g VM <strong>Trace</strong>-<br />

Monkey. <strong>Trace</strong>Monkey supports all the JavaScript features of SpiderMonkey,<br />

with a 2x-20x speedup <strong>for</strong> traceable programs.<br />

This paper makes the follow<strong>in</strong>g contributions:<br />

• We expla<strong>in</strong> an algorithm <strong>for</strong> dynamically <strong>for</strong>m<strong>in</strong>g trace trees to<br />

cover a program, represent<strong>in</strong>g nested loops as nested trace trees.<br />

• We expla<strong>in</strong> how to speculatively generate efficient type-specialized<br />

code <strong>for</strong> traces from dynamic language programs.<br />

• We validate our trac<strong>in</strong>g techniques <strong>in</strong> an implementation <strong>based</strong><br />

on the SpiderMonkey JavaScript <strong>in</strong>terpreter, achiev<strong>in</strong>g 2x-20x<br />

speedups on many programs.<br />

The rema<strong>in</strong>der of this paper is organized as follows. Section 3 is<br />

a general overview of trace tree <strong>based</strong> compilation we use to capture<br />

and compile frequently executed code regions. In Section 4<br />

we describe our approach of cover<strong>in</strong>g nested loops us<strong>in</strong>g a number<br />

of <strong>in</strong>dividual trace trees. In Section 5 we describe our tracecompilation<br />

<strong>based</strong> speculative type specialization approach we use<br />

to generate efficient mach<strong>in</strong>e code from recorded bytecode traces.<br />

Our implementation of a dynamic type-specializ<strong>in</strong>g compiler <strong>for</strong><br />

JavaScript is described <strong>in</strong> Section 6. Related work is discussed <strong>in</strong><br />

Section 8. In Section 7 we evaluate our dynamic compiler <strong>based</strong> on<br />

1 <strong>for</strong> (var i = 2; i < 100; ++i) {<br />

2 if (!primes[i])<br />

3 cont<strong>in</strong>ue;<br />

4 <strong>for</strong> (var k = i + i; i < 100; k += i)<br />

5 primes[k] = false;<br />

6 }<br />

Figure 1. Sample program: sieve of Eratosthenes. primes is<br />

<strong>in</strong>itialized to an array of 100 false values on entry to this code<br />

snippet.<br />

abort<br />

record<strong>in</strong>g<br />

Record<br />

LIR <strong>Trace</strong><br />

Compile<br />

LIR <strong>Trace</strong><br />

f<strong>in</strong>ish at<br />

loop header<br />

hot<br />

loop/exit<br />

Interpret<br />

Bytecodes<br />

loop<br />

edge<br />

Monitor<br />

cold/blacklisted<br />

loop/exit<br />

loop edge with<br />

same types<br />

side exit,<br />

no exist<strong>in</strong>g trace<br />

compiled trace<br />

ready<br />

Enter<br />

Compiled <strong>Trace</strong><br />

Execute<br />

Compiled <strong>Trace</strong><br />

Leave<br />

Compiled <strong>Trace</strong><br />

Symbol Key<br />

Overhead<br />

Interpret<strong>in</strong>g<br />

Native<br />

side exit to<br />

exist<strong>in</strong>g trace<br />

Figure 2. State mach<strong>in</strong>e describ<strong>in</strong>g the major activities of <strong>Trace</strong>-<br />

Monkey and the conditions that cause transitions to a new activity.<br />

In the dark box, TM executes JS as compiled traces. In the<br />

light gray boxes, TM executes JS <strong>in</strong> the standard <strong>in</strong>terpreter. White<br />

boxes are overhead. Thus, to maximize per<strong>for</strong>mance, we need to<br />

maximize time spent <strong>in</strong> the darkest box and m<strong>in</strong>imize time spent <strong>in</strong><br />

the white boxes. The best case is a loop where the types at the loop<br />

edge are the same as the types on entry–then TM can stay <strong>in</strong> native<br />

code until the loop is done.<br />

a set of <strong>in</strong>dustry benchmarks. The paper ends with conclusions <strong>in</strong><br />

Section 9 and an outlook on future work is presented <strong>in</strong> Section 10.<br />

2. Overview: Example Trac<strong>in</strong>g Run<br />

This section provides an overview of our system by describ<strong>in</strong>g<br />

how <strong>Trace</strong>Monkey executes an example program. The example<br />

program, shown <strong>in</strong> Figure 1, computes the first 100 prime numbers<br />

with nested loops. The narrative should be read along with Figure 2,<br />

which describes the activities <strong>Trace</strong>Monkey per<strong>for</strong>ms and when it<br />

transitions between the loops.<br />

<strong>Trace</strong>Monkey always beg<strong>in</strong>s execut<strong>in</strong>g a program <strong>in</strong> the bytecode<br />

<strong>in</strong>terpreter. Every loop back edge is a potential trace po<strong>in</strong>t.<br />

When the <strong>in</strong>terpreter crosses a loop edge, <strong>Trace</strong>Monkey <strong>in</strong>vokes<br />

the trace monitor, which may decide to record or execute a native<br />

trace. At the start of execution, there are no compiled traces yet, so<br />

the trace monitor counts the number of times each loop back edge is<br />

executed until a loop becomes hot, currently after 2 cross<strong>in</strong>gs. Note<br />

that the way our loops are compiled, the loop edge is crossed be<strong>for</strong>e<br />

enter<strong>in</strong>g the loop, so the second cross<strong>in</strong>g occurs immediately after<br />

the first iteration.<br />

Here is the sequence of events broken down by outer loop<br />

iteration:


v0 := ld state[748] // load primes from the trace activation record<br />

st sp[0], v0 // store primes to <strong>in</strong>terpreter stack<br />

v1 := ld state[764] // load k from the trace activation record<br />

v2 := i2f(v1)<br />

// convert k from <strong>in</strong>t to double<br />

st sp[8], v1 // store k to <strong>in</strong>terpreter stack<br />

st sp[16], 0 // store false to <strong>in</strong>terpreter stack<br />

v3 := ld v0[4]<br />

// load class word <strong>for</strong> primes<br />

v4 := and v3, -4 // mask out object class tag <strong>for</strong> primes<br />

v5 := eq v4, Array // test whether primes is an array<br />

xf v5<br />

// side exit if v5 is false<br />

v6 := js_Array_set(v0, v2, false) // call function to set array element<br />

v7 := eq v6, 0<br />

// test return value from call<br />

xt v7<br />

// side exit if js_Array_set returns false.<br />

Figure 3. LIR snippet <strong>for</strong> sample program. This is the LIR recorded <strong>for</strong> l<strong>in</strong>e 5 of the sample program <strong>in</strong> Figure 1. The LIR encodes<br />

the semantics <strong>in</strong> SSA <strong>for</strong>m us<strong>in</strong>g temporary variables. The LIR also encodes all the stores that the <strong>in</strong>terpreter would do to its data stack.<br />

Sometimes these stores can be optimized away as the stack locations are live only on exits to the <strong>in</strong>terpreter. F<strong>in</strong>ally, the LIR records guards<br />

and side exits to verify the assumptions made <strong>in</strong> this record<strong>in</strong>g: that primes is an array and that the call to set its element succeeds.<br />

mov edx, ebx(748)<br />

mov edi(0), edx<br />

mov esi, ebx(764)<br />

mov edi(8), esi<br />

mov edi(16), 0<br />

mov eax, edx(4)<br />

and eax, -4<br />

cmp eax, Array<br />

jne side_exit_1<br />

sub esp, 8<br />

push false<br />

push esi<br />

call js_Array_set<br />

add esp, 8<br />

mov ecx, ebx<br />

test eax, eax<br />

je side_exit_2<br />

...<br />

side_exit_1:<br />

mov ecx, ebp(-4)<br />

mov esp, ebp<br />

jmp epilog<br />

// load primes from the trace activation record<br />

// (*) store primes to <strong>in</strong>terpreter stack<br />

// load k from the trace activation record<br />

// (*) store k to <strong>in</strong>terpreter stack<br />

// (*) store false to <strong>in</strong>terpreter stack<br />

// (*) load object class word <strong>for</strong> primes<br />

// (*) mask out object class tag <strong>for</strong> primes<br />

// (*) test whether primes is an array<br />

// (*) side exit if primes is not an array<br />

// bump stack <strong>for</strong> call alignment convention<br />

// push last argument <strong>for</strong> call<br />

// push first argument <strong>for</strong> call<br />

// call function to set array element<br />

// clean up extra stack space<br />

// (*) created by register allocator<br />

// (*) test return value of js_Array_set<br />

// (*) side exit if call failed<br />

// restore ecx<br />

// restore esp<br />

// jump to ret statement<br />

Figure 4. x86 snippet <strong>for</strong> sample program. This is the x86 code compiled from the LIR snippet <strong>in</strong> Figure 3. Most LIR <strong>in</strong>structions compile<br />

to a s<strong>in</strong>gle x86 <strong>in</strong>struction. Instructions marked with (*) would be omitted by an idealized compiler that knew that none of the side exits<br />

would ever be taken. The 17 <strong>in</strong>structions generated by the compiler compare favorably with the 100+ <strong>in</strong>structions that the <strong>in</strong>terpreter would<br />

execute <strong>for</strong> the same code snippet, <strong>in</strong>clud<strong>in</strong>g 4 <strong>in</strong>direct jumps.<br />

i=2. This is the first iteration of the outer loop. The loop on<br />

l<strong>in</strong>es 4-5 becomes hot on its second iteration, so <strong>Trace</strong>Monkey enters<br />

record<strong>in</strong>g mode on l<strong>in</strong>e 4. In record<strong>in</strong>g mode, <strong>Trace</strong>Monkey<br />

records the code along the trace <strong>in</strong> a low-level compiler <strong>in</strong>termediate<br />

representation we call LIR. The LIR trace encodes all the operations<br />

per<strong>for</strong>med and the types of all operands. The LIR trace also<br />

encodes guards, which are checks that verify that the control flow<br />

and types are identical to those observed dur<strong>in</strong>g trace record<strong>in</strong>g.<br />

Thus, on later executions, if and only if all guards are passed, the<br />

trace has the required program semantics.<br />

<strong>Trace</strong>Monkey stops record<strong>in</strong>g when execution returns to the<br />

loop header or exits the loop. In this case, execution returns to the<br />

loop header on l<strong>in</strong>e 4.<br />

After record<strong>in</strong>g is f<strong>in</strong>ished, <strong>Trace</strong>Monkey compiles the trace to<br />

native code us<strong>in</strong>g the recorded type <strong>in</strong><strong>for</strong>mation <strong>for</strong> optimization.<br />

The result is a native code fragment that can be entered if the<br />

<strong>in</strong>terpreter PC and the types of values match those observed when<br />

trace record<strong>in</strong>g was started. The first trace <strong>in</strong> our example, T 45,<br />

covers l<strong>in</strong>es 4 and 5. This trace can be entered if the PC is at l<strong>in</strong>e 4,<br />

i and k are <strong>in</strong>tegers, and primes is an object. After compil<strong>in</strong>g T 45,<br />

<strong>Trace</strong>Monkey returns to the <strong>in</strong>terpreter and loops back to l<strong>in</strong>e 1.<br />

i=3. Now the loop header at l<strong>in</strong>e 1 has become hot, so <strong>Trace</strong>-<br />

Monkey starts record<strong>in</strong>g. When record<strong>in</strong>g reaches l<strong>in</strong>e 4, <strong>Trace</strong>-<br />

Monkey observes that it has reached an <strong>in</strong>ner loop header that already<br />

has a compiled trace, so <strong>Trace</strong>Monkey attempts to nest the<br />

<strong>in</strong>ner loop <strong>in</strong>side the current trace. The first step is to call the <strong>in</strong>ner<br />

trace as a subrout<strong>in</strong>e. This executes the loop on l<strong>in</strong>e 4 to completion<br />

and then returns to the recorder. <strong>Trace</strong>Monkey verifies that the call<br />

was successful and then records the call to the <strong>in</strong>ner trace as part of<br />

the current trace. Record<strong>in</strong>g cont<strong>in</strong>ues until execution reaches l<strong>in</strong>e<br />

1, and at which po<strong>in</strong>t <strong>Trace</strong>Monkey f<strong>in</strong>ishes and compiles a trace<br />

<strong>for</strong> the outer loop, T 16.


i=4. On this iteration, <strong>Trace</strong>Monkey calls T 16. Because i=4, the<br />

if statement on l<strong>in</strong>e 2 is taken. This branch was not taken <strong>in</strong> the<br />

orig<strong>in</strong>al trace, so this causes T 16 to fail a guard and take a side exit.<br />

The exit is not yet hot, so <strong>Trace</strong>Monkey returns to the <strong>in</strong>terpreter,<br />

which executes the cont<strong>in</strong>ue statement.<br />

i=5. <strong>Trace</strong>Monkey calls T 16, which <strong>in</strong> turn calls the nested trace<br />

T 45. T 16 loops back to its own header, start<strong>in</strong>g the next iteration<br />

without ever return<strong>in</strong>g to the monitor.<br />

i=6. On this iteration, the side exit on l<strong>in</strong>e 2 is taken aga<strong>in</strong>. This<br />

time, the side exit becomes hot, so a trace T 23,1 is recorded that<br />

covers l<strong>in</strong>e 3 and returns to the loop header. Thus, the end of T 23,1<br />

jumps directly to the start of T 16. The side exit is patched so that<br />

on future iterations, it jumps directly to T 23,1.<br />

At this po<strong>in</strong>t, <strong>Trace</strong>Monkey has compiled enough traces to cover<br />

the entire nested loop structure, so the rest of the program runs<br />

entirely as native code.<br />

3. <strong>Trace</strong> Trees<br />

In this section, we describe traces, trace trees, and how they are<br />

<strong>for</strong>med at run time. Although our techniques apply to any dynamic<br />

language <strong>in</strong>terpreter, we will describe them assum<strong>in</strong>g a bytecode<br />

<strong>in</strong>terpreter to keep the exposition simple.<br />

3.1 <strong>Trace</strong>s<br />

A trace is simply a program path, which may cross function call<br />

boundaries. <strong>Trace</strong>Monkey focuses on loop traces, that orig<strong>in</strong>ate at<br />

a loop edge and represent a s<strong>in</strong>gle iteration through the associated<br />

loop.<br />

Similar to an extended basic block, a trace is only entered at<br />

the top, but may have many exits. In contrast to an extended basic<br />

block, a trace can conta<strong>in</strong> jo<strong>in</strong> nodes. S<strong>in</strong>ce a trace always only<br />

follows one s<strong>in</strong>gle path through the orig<strong>in</strong>al program, however, jo<strong>in</strong><br />

nodes are not recognizable as such <strong>in</strong> a trace and have a s<strong>in</strong>gle<br />

predecessor node like regular nodes.<br />

A typed trace is a trace annotated with a type <strong>for</strong> every variable<br />

(<strong>in</strong>clud<strong>in</strong>g temporaries) on the trace. A typed trace also has an entry<br />

type map giv<strong>in</strong>g the required types <strong>for</strong> variables used on the trace<br />

be<strong>for</strong>e they are def<strong>in</strong>ed. For example, a trace could have a type map<br />

(x: <strong>in</strong>t, b: boolean), mean<strong>in</strong>g that the trace may be entered<br />

only if the value of the variable x is of type <strong>in</strong>t and the value of b<br />

is of type boolean. The entry type map is much like the signature<br />

of a function.<br />

In this paper, we only discuss typed loop traces, and we will<br />

refer to them simply as “traces”. The key property of typed loop<br />

traces is that they can be compiled to efficient mach<strong>in</strong>e code us<strong>in</strong>g<br />

the same techniques used <strong>for</strong> typed languages.<br />

In <strong>Trace</strong>Monkey, traces are recorded <strong>in</strong> trace-flavored SSA LIR<br />

(low-level <strong>in</strong>termediate representation). In trace-flavored SSA (or<br />

TSSA), phi nodes appear only at the entry po<strong>in</strong>t, which is reached<br />

both on entry and via loop edges. The important LIR primitives<br />

are constant values, memory loads and stores (by address and<br />

offset), <strong>in</strong>teger operators, float<strong>in</strong>g-po<strong>in</strong>t operators, function calls,<br />

and conditional exits. <strong>Type</strong> conversions, such as <strong>in</strong>teger to double,<br />

are represented by function calls. This makes the LIR used by<br />

<strong>Trace</strong>Monkey <strong>in</strong>dependent of the concrete type system and type<br />

conversion rules of the source language. The LIR operations are<br />

generic enough that the backend compiler is language <strong>in</strong>dependent.<br />

Figure 3 shows an example LIR trace.<br />

Bytecode <strong>in</strong>terpreters typically represent values <strong>in</strong> a various<br />

complex data structures (e.g., hash tables) <strong>in</strong> a boxed <strong>for</strong>mat (i.e.,<br />

with attached type tag bits). S<strong>in</strong>ce a trace is <strong>in</strong>tended to represent<br />

efficient code that elim<strong>in</strong>ates all that complexity, our traces operate<br />

on unboxed values <strong>in</strong> simple variables and arrays as much as<br />

possible.<br />

A trace records all its <strong>in</strong>termediate values <strong>in</strong> a small activation<br />

record area. To make variable accesses fast on trace, the trace also<br />

imports local and global variables by unbox<strong>in</strong>g them and copy<strong>in</strong>g<br />

them to its activation record. Thus, the trace can read and write<br />

these variables with simple loads and stores from a native activation<br />

record<strong>in</strong>g, <strong>in</strong>dependently of the box<strong>in</strong>g mechanism used by the<br />

<strong>in</strong>terpreter. When the trace exits, the VM boxes the values from<br />

this native storage location and copies them back to the <strong>in</strong>terpreter<br />

structures.<br />

For every control-flow branch <strong>in</strong> the source program, the<br />

recorder generates conditional exit LIR <strong>in</strong>structions. These <strong>in</strong>structions<br />

exit from the trace if required control flow is different from<br />

what it was at trace record<strong>in</strong>g, ensur<strong>in</strong>g that the trace <strong>in</strong>structions<br />

are run only if they are supposed to. We call these <strong>in</strong>structions<br />

guard <strong>in</strong>structions.<br />

Most of our traces represent loops and end with the special loop<br />

LIR <strong>in</strong>struction. This is just an unconditional branch to the top of<br />

the trace. Such traces return only via guards.<br />

Now, we describe the key optimizations that are per<strong>for</strong>med as<br />

part of record<strong>in</strong>g LIR. All of these optimizations reduce complex<br />

dynamic language constructs to simple typed constructs by specializ<strong>in</strong>g<br />

<strong>for</strong> the current trace. Each optimization requires guard <strong>in</strong>structions<br />

to verify their assumptions about the state and exit the<br />

trace if necessary.<br />

<strong>Type</strong> specialization.<br />

All LIR primitives apply to operands of specific types. Thus,<br />

LIR traces are necessarily type-specialized, and a compiler can<br />

easily produce a translation that requires no type dispatches. A<br />

typical bytecode <strong>in</strong>terpreter carries tag bits along with each value,<br />

and to per<strong>for</strong>m any operation, must check the tag bits, dynamically<br />

dispatch, mask out the tag bits to recover the untagged value,<br />

per<strong>for</strong>m the operation, and then reapply tags. LIR omits everyth<strong>in</strong>g<br />

except the operation itself.<br />

A potential problem is that some operations can produce values<br />

of unpredictable types. For example, read<strong>in</strong>g a property from an<br />

object could yield a value of any type, not necessarily the type<br />

observed dur<strong>in</strong>g record<strong>in</strong>g. The recorder emits guard <strong>in</strong>structions<br />

that conditionally exit if the operation yields a value of a different<br />

type from that seen dur<strong>in</strong>g record<strong>in</strong>g. These guard <strong>in</strong>structions<br />

guarantee that as long as execution is on trace, the types of values<br />

match those of the typed trace. When the VM observes a side exit<br />

along such a type guard, a new typed trace is recorded orig<strong>in</strong>at<strong>in</strong>g<br />

at the side exit location, captur<strong>in</strong>g the new type of the operation <strong>in</strong><br />

question.<br />

Representation specialization: objects. In JavaScript, name<br />

lookup semantics are complex and potentially expensive because<br />

they <strong>in</strong>clude features like object <strong>in</strong>heritance and eval. To evaluate<br />

an object property read expression like o.x, the <strong>in</strong>terpreter must<br />

search the property map of o and all of its prototypes and parents.<br />

Property maps can be implemented with different data structures<br />

(e.g., per-object hash tables or shared hash tables), so the search<br />

process also must dispatch on the representation of each object<br />

found dur<strong>in</strong>g search. <strong>Trace</strong>Monkey can simply observe the result of<br />

the search process and record the simplest possible LIR to access<br />

the property value. For example, the search might f<strong>in</strong>ds the value of<br />

o.x <strong>in</strong> the prototype of o, which uses a shared hash-table representation<br />

that places x <strong>in</strong> slot 2 of a property vector. Then the recorded<br />

can generate LIR that reads o.x with just two or three loads: one to<br />

get the prototype, possibly one to get the property value vector, and<br />

one more to get slot 2 from the vector. This is a vast simplification<br />

and speedup compared to the orig<strong>in</strong>al <strong>in</strong>terpreter code. Inheritance<br />

relationships and object representations can change dur<strong>in</strong>g execution,<br />

so the simplified code requires guard <strong>in</strong>structions that ensure<br />

the object representation is the same. In <strong>Trace</strong>Monkey, objects’ rep-


esentations are assigned an <strong>in</strong>teger key called the object shape.<br />

Thus, the guard is a simple equality check on the object shape.<br />

Representation specialization: numbers. JavaScript has no<br />

<strong>in</strong>teger type, only a Number type that is the set of 64-bit IEEE-<br />

754 float<strong>in</strong>g-po<strong>in</strong>ter numbers (“doubles”). But many JavaScript<br />

operators, <strong>in</strong> particular array accesses and bitwise operators, really<br />

operate on <strong>in</strong>tegers, so they first convert the number to an <strong>in</strong>teger,<br />

and then convert any <strong>in</strong>teger result back to a double. 1 Clearly, a<br />

JavaScript VM that wants to be fast must f<strong>in</strong>d a way to operate on<br />

<strong>in</strong>tegers directly and avoid these conversions.<br />

In <strong>Trace</strong>Monkey, we support two representations <strong>for</strong> numbers:<br />

<strong>in</strong>tegers and doubles. The <strong>in</strong>terpreter uses <strong>in</strong>teger representations<br />

as much as it can, switch<strong>in</strong>g <strong>for</strong> results that can only be represented<br />

as doubles. When a trace is started, some values may be imported<br />

and represented as <strong>in</strong>tegers. Some operations on <strong>in</strong>tegers require<br />

guards. For example, add<strong>in</strong>g two <strong>in</strong>tegers can produce a value too<br />

large <strong>for</strong> the <strong>in</strong>teger representation.<br />

Function <strong>in</strong>l<strong>in</strong><strong>in</strong>g. LIR traces can cross function boundaries<br />

<strong>in</strong> either direction, achiev<strong>in</strong>g function <strong>in</strong>l<strong>in</strong><strong>in</strong>g. Move <strong>in</strong>structions<br />

need to be recorded <strong>for</strong> function entry and exit to copy arguments<br />

<strong>in</strong> and return values out. These move statements are then optimized<br />

away by the compiler us<strong>in</strong>g copy propagation. In order to be able<br />

to return to the <strong>in</strong>terpreter, the trace must also generate LIR to<br />

record that a call frame has been entered and exited. The frame<br />

entry and exit LIR saves just enough <strong>in</strong><strong>for</strong>mation to allow the<br />

<strong>in</strong>tepreter call stack to be restored later and is much simpler than<br />

the <strong>in</strong>terpreter’s standard call code. If the function be<strong>in</strong>g entered<br />

is not constant (which <strong>in</strong> JavaScript <strong>in</strong>cludes any call by function<br />

name), the recorder must also emit LIR to guard that the function<br />

is the same.<br />

Guards and side exits. Each optimization described above<br />

requires one or more guards to verify the assumptions made <strong>in</strong><br />

do<strong>in</strong>g the optimization. A guard is just a group of LIR <strong>in</strong>structions<br />

that per<strong>for</strong>ms a test and conditional exit. The exit branches to a<br />

side exit, a small off-trace piece of LIR that returns a po<strong>in</strong>ter to<br />

a structure that describes the reason <strong>for</strong> the exit along with the<br />

<strong>in</strong>terpreter PC at the exit po<strong>in</strong>t and any other data needed to restore<br />

the <strong>in</strong>terpreter’s state structures.<br />

Aborts. Some constructs are difficult to record <strong>in</strong> LIR traces.<br />

For example, eval or calls to external functions can change the<br />

program state <strong>in</strong> unpredictable ways, mak<strong>in</strong>g it difficult <strong>for</strong> the<br />

tracer to know the current type map <strong>in</strong> order to cont<strong>in</strong>ue trac<strong>in</strong>g.<br />

A trac<strong>in</strong>g implementation can also have any number of other limitations,<br />

e.g.,a small-memory device may limit the length of traces.<br />

When any situation occurs that prevents the implementation from<br />

cont<strong>in</strong>u<strong>in</strong>g trace record<strong>in</strong>g, the implementation aborts trace record<strong>in</strong>g<br />

and returns to the trace monitor.<br />

3.2 <strong>Trace</strong> Trees<br />

Especially simple loops, namely those where control flow, value<br />

types, value representations, and <strong>in</strong>l<strong>in</strong>ed functions are all <strong>in</strong>variant,<br />

can be represented by a s<strong>in</strong>gle trace. But most loops have at least<br />

some variation, and so the program will take side exits from the<br />

ma<strong>in</strong> trace. When a side exit becomes hot, <strong>Trace</strong>Monkey starts a<br />

new branch trace from that po<strong>in</strong>t and patches the side exit to jump<br />

directly to that trace. In this way, a s<strong>in</strong>gle trace expands on demand<br />

to a s<strong>in</strong>gle-entry, multiple-exit trace tree.<br />

This section expla<strong>in</strong>s how trace trees are <strong>for</strong>med dur<strong>in</strong>g execution.<br />

The goal is to <strong>for</strong>m trace trees dur<strong>in</strong>g execution that cover all<br />

the hot paths of the program.<br />

1 Arrays are actually worse than this: if the <strong>in</strong>dex value is a number, it must<br />

be converted from a double to a str<strong>in</strong>g <strong>for</strong> the property access operator, and<br />

then to an <strong>in</strong>teger <strong>in</strong>ternally to the array implementation.<br />

Start<strong>in</strong>g a tree. Tree trees always start at loop headers, because<br />

they are a natural place to look <strong>for</strong> hot paths. In <strong>Trace</strong>Monkey, loop<br />

headers are easy to detect–the bytecode compiler ensures that a<br />

bytecode is a loop header iff it is the target of a backward branch.<br />

<strong>Trace</strong>Monkey starts a tree when a given loop header has been executed<br />

a certa<strong>in</strong> number of times (2 <strong>in</strong> the current implementation).<br />

Start<strong>in</strong>g a tree just means start<strong>in</strong>g record<strong>in</strong>g a trace <strong>for</strong> the current<br />

po<strong>in</strong>t and type map and mark<strong>in</strong>g the trace as the root of a tree. Each<br />

tree is associated with a loop header and type map, so there may be<br />

several trees <strong>for</strong> a given loop header.<br />

Clos<strong>in</strong>g the loop. <strong>Trace</strong> record<strong>in</strong>g can end <strong>in</strong> several ways.<br />

Ideally, the trace reaches the loop header where it started with<br />

the same type map as on entry. This is called a type-stable loop<br />

iteration. In this case, the end of the trace can jump right to the<br />

beg<strong>in</strong>n<strong>in</strong>g, as all the value representations are exactly as needed to<br />

enter the trace. The jump can even skip the usual code that would<br />

copy out the state at the end of the trace and copy it back <strong>in</strong> to the<br />

trace activation record to enter a trace.<br />

In certa<strong>in</strong> cases the trace might reach the loop header with a<br />

different type map. This scenario is sometime observed <strong>for</strong> the first<br />

iteration of a loop. Some variables <strong>in</strong>side the loop might <strong>in</strong>itially be<br />

undef<strong>in</strong>ed, be<strong>for</strong>e they are set to a concrete type dur<strong>in</strong>g the first loop<br />

iteration. When record<strong>in</strong>g such an iteration, the recorder cannot<br />

l<strong>in</strong>k the trace back to its own loop header s<strong>in</strong>ce it is type-unstable.<br />

Instead, the iteration is term<strong>in</strong>ated with a side exit that will always<br />

fail and return to the <strong>in</strong>terpreter. At the same time a new trace is<br />

recorded with the new type map. Every time an additional typeunstable<br />

trace is added to a region, its exit type map is compared to<br />

the entry map of all exist<strong>in</strong>g traces <strong>in</strong> case they complement each<br />

other. With this approach we are able to cover type-unstable loop<br />

iterations as long they eventually <strong>for</strong>m a stable equilibrium.<br />

F<strong>in</strong>ally, the trace might exit the loop be<strong>for</strong>e reach<strong>in</strong>g the loop<br />

header, <strong>for</strong> example because execution reaches a break or return<br />

statement. In this case, the VM simply ends the trace with an exit<br />

to the trace monitor.<br />

As mentioned previously, we may speculatively chose to represent<br />

certa<strong>in</strong> Number-typed values as <strong>in</strong>tegers on trace. We do so<br />

when we observe that Number-typed variables conta<strong>in</strong> an <strong>in</strong>teger<br />

value at trace entry. If dur<strong>in</strong>g trace record<strong>in</strong>g the variable is unexpectedly<br />

assigned a non-<strong>in</strong>teger value, we have to widen the type<br />

of the variable to a double. As a result, the recorded trace becomes<br />

<strong>in</strong>herently type-unstable s<strong>in</strong>ce it starts with an <strong>in</strong>teger value but<br />

ends with a double value. This represents a mis-speculation, s<strong>in</strong>ce<br />

at trace entry we specialized the Number-typed value to an <strong>in</strong>teger,<br />

assum<strong>in</strong>g that at the loop edge we would aga<strong>in</strong> f<strong>in</strong>d an <strong>in</strong>teger value<br />

<strong>in</strong> the variable, allow<strong>in</strong>g us to close the loop. To avoid future speculative<br />

failures <strong>in</strong>volv<strong>in</strong>g this variable, and to obta<strong>in</strong> a type-stable<br />

trace we note the fact that the variable <strong>in</strong> question as been observed<br />

to sometimes hold non-<strong>in</strong>teger values <strong>in</strong> an advisory data structure<br />

which we call the “oracle”.<br />

When compil<strong>in</strong>g loops, we consult the oracle be<strong>for</strong>e specializ<strong>in</strong>g<br />

values to <strong>in</strong>tegers. Speculation towards <strong>in</strong>tegers is per<strong>for</strong>med<br />

only if no adverse <strong>in</strong><strong>for</strong>mation is known to the oracle about that<br />

particular variable. Whenever we accidentally compile a loop that<br />

is type-unstable due to mis-speculation of a Number-typed variable,<br />

we immediately trigger the record<strong>in</strong>g of a new trace, which<br />

<strong>based</strong> on the now updated oracle <strong>in</strong><strong>for</strong>mation will start with a double<br />

value and thus become type stable.<br />

Extend<strong>in</strong>g a tree. Side exits lead to different paths through<br />

the loop, or paths with different types or representations. Thus, to<br />

completely cover the loop, the VM must record traces start<strong>in</strong>g at all<br />

side exits. These traces are recorded much like root traces: there is<br />

a counter <strong>for</strong> each side exit, and when the counter reaches a hotness<br />

threshold, record<strong>in</strong>g starts. Record<strong>in</strong>g stops exactly as <strong>for</strong> the root<br />

trace, us<strong>in</strong>g the loop header of the root trace as the target to reach.


Our implementation does not extend at all side exits. It extends<br />

only if the side exit is <strong>for</strong> a control-flow branch, and only if the side<br />

exit does not leave the loop. In particular we do not want to extend<br />

a trace tree along a path that leads to an outer loop, because we<br />

want to cover such paths <strong>in</strong> an outer tree through tree nest<strong>in</strong>g.<br />

3.3 Blacklist<strong>in</strong>g<br />

Sometimes, a program follows a path that cannot be compiled<br />

<strong>in</strong>to a trace, usually because of limitations <strong>in</strong> the implementation.<br />

<strong>Trace</strong>Monkey does not currently support record<strong>in</strong>g throw<strong>in</strong>g and<br />

catch<strong>in</strong>g of arbitrary exceptions. This design trade off was chosen,<br />

because exceptions are usually rare <strong>in</strong> JavaScript. However, if a<br />

program opts to use exceptions <strong>in</strong>tensively, we would suddenly<br />

<strong>in</strong>cur a punish<strong>in</strong>g runtime overhead if we repeatedly try to record<br />

a trace <strong>for</strong> this path and repeatedly fail to do so, s<strong>in</strong>ce we abort<br />

trac<strong>in</strong>g every time we observe an exception be<strong>in</strong>g thrown.<br />

As a result, if a hot loop conta<strong>in</strong>s traces that always fail, the VM<br />

could potentially run much more slowly than the base <strong>in</strong>terpreter:<br />

the VM repeatedly spends time try<strong>in</strong>g to record traces, but is never<br />

able to run any. To avoid this problem, whenever the VM is about<br />

to start trac<strong>in</strong>g, it must try to predict whether it will f<strong>in</strong>ish the trace.<br />

Our prediction algorithm is <strong>based</strong> on blacklist<strong>in</strong>g traces that<br />

have been tried and failed. When the VM fails to f<strong>in</strong>ish a trace start<strong>in</strong>g<br />

at a given po<strong>in</strong>t, the VM records that a failure has occurred. The<br />

VM also sets a counter so that it will not try to record a trace start<strong>in</strong>g<br />

at that po<strong>in</strong>t until it is passed a few more times (32 <strong>in</strong> our implementation).<br />

This backoff counter gives temporary conditions that<br />

prevent trac<strong>in</strong>g a chance to end. For example, a loop may behave<br />

differently dur<strong>in</strong>g startup than dur<strong>in</strong>g its steady-state execution. After<br />

a given number of failures (2 <strong>in</strong> our implementation), the VM<br />

marks the fragment as blacklisted, which means the VM will never<br />

aga<strong>in</strong> start record<strong>in</strong>g at that po<strong>in</strong>t.<br />

After implement<strong>in</strong>g this basic strategy, we observed that <strong>for</strong><br />

small loops that get blacklisted, the system can spend a noticeable<br />

amount of time just f<strong>in</strong>d<strong>in</strong>g the loop fragment and determ<strong>in</strong><strong>in</strong>g that<br />

it has been blacklisted. We now avoid that problem by patch<strong>in</strong>g the<br />

bytecode. We def<strong>in</strong>e an extra no-op bytecode that <strong>in</strong>dicates a loop<br />

header. The VM calls <strong>in</strong>to the trace monitor every time the <strong>in</strong>terpreter<br />

executes a loop header no-op. To blacklist a fragment, we<br />

simply replace the loop header no-op with a regular no-op. Thus,<br />

the <strong>in</strong>terpreter will never aga<strong>in</strong> even call <strong>in</strong>to the trace monitor.<br />

There is a related problem we have not yet solved, which occurs<br />

when a loop meets all of these conditions:<br />

• The VM can <strong>for</strong>m at least one root trace <strong>for</strong> the loop.<br />

• There is at least one hot side exit <strong>for</strong> which the VM cannot<br />

complete a trace.<br />

• The loop body is short.<br />

In this case, the VM will repeatedly pass the loop header, search<br />

<strong>for</strong> a trace, f<strong>in</strong>d it, execute it, and fall back to the <strong>in</strong>terpreter.<br />

With a short loop body, the overhead of f<strong>in</strong>d<strong>in</strong>g and call<strong>in</strong>g the<br />

trace is high, and causes per<strong>for</strong>mance to be even slower than the<br />

basic <strong>in</strong>terpreter. So far, <strong>in</strong> this situation we have improved the<br />

implementation so that the VM can complete the branch trace.<br />

But it is hard to guarantee that this situation will never happen.<br />

As future work, this situation could be avoided by detect<strong>in</strong>g and<br />

blacklist<strong>in</strong>g loops <strong>for</strong> which the average trace call executes few<br />

bytecodes be<strong>for</strong>e return<strong>in</strong>g to the <strong>in</strong>terpreter.<br />

4. Nested <strong>Trace</strong> Tree Formation<br />

Figure 7 shows basic trace tree compilation (11) applied to a nested<br />

loop where the <strong>in</strong>ner loop conta<strong>in</strong>s two paths. Usually, the <strong>in</strong>ner<br />

loop (with header at i 2) becomes hot first, and a trace tree is rooted<br />

at that po<strong>in</strong>t. For example, the first recorded trace may be a cycle<br />

T<br />

Tree
Anchor<br />

Trunk
<strong>Trace</strong><br />

<strong>Trace</strong>
Anchor<br />

Branch
<strong>Trace</strong><br />

Guard<br />

Side
Exit<br />

Figure 5. A tree with two traces, a trunk trace and one branch<br />

trace. The trunk trace conta<strong>in</strong>s a guard to which a branch trace was<br />

attached. The branch trace conta<strong>in</strong> a guard that may fail and trigger<br />

a side exit. Both the trunk and the branch trace loop back to the tree<br />

anchor, which is the beg<strong>in</strong>n<strong>in</strong>g of the trace tree.<br />

Number<br />

Number<br />

Number<br />

<strong>Trace</strong>
1<br />

<strong>Trace</strong>
2<br />

<strong>Trace</strong>
1<br />

<strong>Trace</strong>
2<br />

Closed L<strong>in</strong>ked L<strong>in</strong>ked L<strong>in</strong>ked<br />

(a)<br />

(b)<br />

Str<strong>in</strong>g<br />

<strong>Trace</strong>
1<br />

L<strong>in</strong>ked<br />

Number<br />

Boolean<br />

Number<br />

<strong>Trace</strong>
2<br />

Boolean<br />

Str<strong>in</strong>g<br />

Number<br />

Boolean<br />

Str<strong>in</strong>g<br />

Str<strong>in</strong>g<br />

<strong>Trace</strong>
3<br />

L<strong>in</strong>ked L<strong>in</strong>ked Closed<br />

(c)<br />

Boolean<br />

Number<br />

Figure 6. We handle type-unstable loops by allow<strong>in</strong>g traces to<br />

compile that cannot loop back to themselves due to a type mismatch.<br />

As such traces accumulate, we attempt to connect their loop<br />

edges to <strong>for</strong>m groups of trace trees that can execute without hav<strong>in</strong>g<br />

to side-exit to the <strong>in</strong>terpreter to cover odd type cases. This is particularly<br />

important <strong>for</strong> nested trace trees where an outer tree tries to<br />

call an <strong>in</strong>ner tree (or <strong>in</strong> this case a <strong>for</strong>est of <strong>in</strong>ner trees), s<strong>in</strong>ce <strong>in</strong>ner<br />

loops frequently have <strong>in</strong>itially undef<strong>in</strong>ed values which change type<br />

to a concrete value after the first iteration.<br />

through the <strong>in</strong>ner loop, {i 2, i 3, i 5, α}. The α symbol is used to<br />

<strong>in</strong>dicate that the trace loops back the tree anchor.<br />

When execution leaves the <strong>in</strong>ner loop, the basic design has two<br />

choices. First, the system can stop trac<strong>in</strong>g and give up on compil<strong>in</strong>g<br />

the outer loop, clearly an undesirable solution. The other choice is<br />

to cont<strong>in</strong>ue trac<strong>in</strong>g, compil<strong>in</strong>g traces <strong>for</strong> the outer loop <strong>in</strong>side the<br />

<strong>in</strong>ner loop’s trace tree.<br />

For example, the program might exit at i 5 and record a branch<br />

trace that <strong>in</strong>corporates the outer loop: {i 5, i 7, i 1, i 6, i 7, i 1, α}.<br />

Later, the program might take the other branch at i 2 and then<br />

exit, record<strong>in</strong>g another branch trace <strong>in</strong>corporat<strong>in</strong>g the outer loop:<br />

{i 2, i 4, i 5, i 7, i 1, i 6, i 7, i 1, α}. Thus, the outer loop is recorded and<br />

compiled twice, and both copies must be reta<strong>in</strong>ed <strong>in</strong> the trace cache.


i1<br />

t1<br />

Outer
Tree<br />

Tree
Call<br />

i1<br />

t1<br />

Nested
Tree<br />

i2<br />

i2<br />

t2<br />

Nested
Tree<br />

i3<br />

i6<br />

i3<br />

i4<br />

t2<br />

Exit
Guard<br />

i4<br />

t4<br />

i5<br />

i5<br />

i7<br />

Exit
Guard<br />

i6<br />

(a)<br />

Figure 7. Control flow graph of a nested loop with an if statement<br />

<strong>in</strong>side the <strong>in</strong>ner most loop (a). An <strong>in</strong>ner tree captures the <strong>in</strong>ner<br />

loop, and is nested <strong>in</strong>side an outer tree which “calls” the <strong>in</strong>ner tree.<br />

The <strong>in</strong>ner tree returns to the outer tree once it exits along its loop<br />

condition guard (b).<br />

In general, if loops are nested to depth k, and each loop has n paths<br />

(on geometric average), this naïve strategy yields O(n k ) traces,<br />

which can easily fill the trace cache.<br />

In order to execute programs with nested loops efficiently, a<br />

trac<strong>in</strong>g system needs a technique <strong>for</strong> cover<strong>in</strong>g the nested loops with<br />

native code without exponential trace duplication.<br />

4.1 Nest<strong>in</strong>g Algorithm<br />

The key <strong>in</strong>sight is that if each loop is represented by its own trace<br />

tree, the code <strong>for</strong> each loop can be conta<strong>in</strong>ed only <strong>in</strong> its own tree,<br />

and outer loop paths will not be duplicated. Another key fact is that<br />

we are not trac<strong>in</strong>g arbitrary bytecodes that might have irreduceable<br />

control flow graphs, but rather bytecodes produced by a compiler<br />

<strong>for</strong> a language with structured control flow. Thus, given two loop<br />

edges, the system can easily determ<strong>in</strong>e whether they are nested<br />

and which is the <strong>in</strong>ner loop. Us<strong>in</strong>g this knowledge, the system can<br />

compile <strong>in</strong>ner and outer loops separately, and make the outer loop’s<br />

traces call the <strong>in</strong>ner loop’s trace tree.<br />

The algorithm <strong>for</strong> build<strong>in</strong>g nested trace trees is as follows. We<br />

start trac<strong>in</strong>g at loop headers exactly as <strong>in</strong> the basic trac<strong>in</strong>g system.<br />

When we exit a loop (detected by compar<strong>in</strong>g the <strong>in</strong>terpreter PC<br />

with the range given by the loop edge), we stop the trace. The<br />

key step of the algorithm occurs when we are record<strong>in</strong>g a trace<br />

<strong>for</strong> loop L R (R <strong>for</strong> loop be<strong>in</strong>g recorded) and we reach the header<br />

of a different loop L O (O <strong>for</strong> other loop). Note that L O must be an<br />

<strong>in</strong>ner loop of L R because we stop the trace when we exit a loop.<br />

• If L O has a type-match<strong>in</strong>g compiled trace tree, we call L O as<br />

a nested trace tree. If the call succeeds, then we record the call<br />

<strong>in</strong> the trace <strong>for</strong> L R. On future executions, the trace <strong>for</strong> L R will<br />

call the <strong>in</strong>ner trace directly.<br />

• If L O does not have a type-match<strong>in</strong>g compiled trace tree yet,<br />

we have to obta<strong>in</strong> it be<strong>for</strong>e we are able to proceed. In order<br />

to do this, we simply abort record<strong>in</strong>g the first trace. The trace<br />

monitor will see the <strong>in</strong>ner loop header, and will immediately<br />

start record<strong>in</strong>g the <strong>in</strong>ner loop. 2<br />

If all the loops <strong>in</strong> a nest are type-stable, then loop nest<strong>in</strong>g creates<br />

no duplication. Otherwise, if loops are nested to a depth k, and each<br />

2 Instead of abort<strong>in</strong>g the outer record<strong>in</strong>g, we could pr<strong>in</strong>cipally merely suspend<br />

the record<strong>in</strong>g, but that would require the implementation to be able<br />

to record several traces simultaneously, complicat<strong>in</strong>g the implementation,<br />

while sav<strong>in</strong>g only a few iterations <strong>in</strong> the <strong>in</strong>terpreter.<br />

(b)<br />

Figure 8. Control flow graph of a loop with two nested loops (left)<br />

and its nested trace tree configuration (right). The outer tree calls<br />

the two <strong>in</strong>ner nested trace trees and places guards at their side exit<br />

locations.<br />

loop is entered with m different type maps (on geometric average),<br />

then we compile O(m k ) copies of the <strong>in</strong>nermost loop. As long as<br />

m is close to 1, the result<strong>in</strong>g trace trees will be tractable.<br />

An important detail is that the call to the <strong>in</strong>ner trace tree must act<br />

like a function call site: it must return to the same po<strong>in</strong>t every time.<br />

The goal of nest<strong>in</strong>g is to make <strong>in</strong>ner and outer loops <strong>in</strong>dependent;<br />

thus when the <strong>in</strong>ner tree is called, it must exit to the same po<strong>in</strong>t<br />

<strong>in</strong> the outer tree every time with the same type map. Because we<br />

cannot actually guarantee this property, we must guard on it after<br />

the call, and side exit if the property does not hold. A common<br />

reason <strong>for</strong> the <strong>in</strong>ner tree not to return to the same po<strong>in</strong>t would<br />

be if the <strong>in</strong>ner tree took a new side exit <strong>for</strong> which it had never<br />

compiled a trace. At this po<strong>in</strong>t, the <strong>in</strong>terpreter PC is <strong>in</strong> the <strong>in</strong>ner<br />

tree, so we cannot cont<strong>in</strong>ue record<strong>in</strong>g or execut<strong>in</strong>g the outer tree.<br />

If this happens dur<strong>in</strong>g record<strong>in</strong>g, we abort the outer trace, to give<br />

the <strong>in</strong>ner tree a chance to f<strong>in</strong>ish grow<strong>in</strong>g. A future execution of the<br />

outer tree would then be able to properly f<strong>in</strong>ish and record a call to<br />

the <strong>in</strong>ner tree. If an <strong>in</strong>ner tree side exit happens dur<strong>in</strong>g execution of<br />

a compiled trace <strong>for</strong> the outer tree, we simply exit the outer trace<br />

and start record<strong>in</strong>g a new branch <strong>in</strong> the <strong>in</strong>ner tree.<br />

4.2 Blacklist<strong>in</strong>g with Nest<strong>in</strong>g<br />

The blacklist<strong>in</strong>g algorithm needs modification to work well with<br />

nest<strong>in</strong>g. The problem is that outer loop traces often abort dur<strong>in</strong>g<br />

startup (because the <strong>in</strong>ner tree is not available or takes a side exit),<br />

which would lead to their be<strong>in</strong>g quickly blacklisted by the basic<br />

algorithm.<br />

The key observation is that when an outer trace aborts because<br />

the <strong>in</strong>ner tree is not ready, this is probably a temporary condition.<br />

Thus, we should not count such aborts toward blacklist<strong>in</strong>g as long<br />

as we are able to build up more traces <strong>for</strong> the <strong>in</strong>ner tree.<br />

In our implementation, when an outer tree aborts on the <strong>in</strong>ner<br />

tree, we <strong>in</strong>crement the outer tree’s blacklist counter as usual and<br />

back off on compil<strong>in</strong>g it. When the <strong>in</strong>ner tree f<strong>in</strong>ishes a trace, we<br />

decrement the blacklist counter on the outer loop, “<strong>for</strong>giv<strong>in</strong>g” the<br />

outer loop <strong>for</strong> abort<strong>in</strong>g previously. We also undo the backoff so that<br />

the outer tree can start immediately try<strong>in</strong>g to compile the next time<br />

we reach it.<br />

5. <strong>Trace</strong> Tree Optimization<br />

This section expla<strong>in</strong>s how a recorded trace is translated to an<br />

optimized mach<strong>in</strong>e code trace. The trace compilation subsystem,<br />

NANOJIT, is separate from the VM and can be used <strong>for</strong> other<br />

applications.


5.1 Optimizations<br />

Because traces are <strong>in</strong> SSA <strong>for</strong>m and have no jo<strong>in</strong> po<strong>in</strong>ts or φ-<br />

nodes, certa<strong>in</strong> optimizations are easy to implement. In order to<br />

get good startup per<strong>for</strong>mance, the optimizations must run quickly,<br />

so we chose a small set of optimizations. We implemented the<br />

optimizations as pipel<strong>in</strong>ed filters so that they can be turned on and<br />

off <strong>in</strong>dependently, and yet all run <strong>in</strong> just two loop passes over the<br />

trace: one <strong>for</strong>ward and one backward.<br />

Every time the trace recorder emits a LIR <strong>in</strong>struction, the <strong>in</strong>struction<br />

is immediately passed to the first filter <strong>in</strong> the <strong>for</strong>ward<br />

pipel<strong>in</strong>e. Thus, <strong>for</strong>ward filter optimizations are per<strong>for</strong>med as the<br />

trace is recorded. Each filter may pass each <strong>in</strong>struction to the next<br />

filter unchanged, write a different <strong>in</strong>struction to the next filter, or<br />

write no <strong>in</strong>struction at all. For example, the constant fold<strong>in</strong>g filter<br />

can replace a multiply <strong>in</strong>struction like v 13 := mul3, 1000 with a<br />

constant <strong>in</strong>struction v 13 = 3000.<br />

We currently apply four <strong>for</strong>ward filters:<br />

• On ISAs without float<strong>in</strong>g-po<strong>in</strong>t <strong>in</strong>structions, a soft-float filter<br />

converts float<strong>in</strong>g-po<strong>in</strong>t LIR <strong>in</strong>structions to sequences of <strong>in</strong>teger<br />

<strong>in</strong>structions.<br />

• CSE (constant subexpression elim<strong>in</strong>ation),<br />

• expression simplification, <strong>in</strong>clud<strong>in</strong>g constant fold<strong>in</strong>g and a few<br />

algebraic identities (e.g., a − a = 0), and<br />

• source language semantic-specific expression simplification,<br />

primarily algebraic identities that allow DOUBLE to be replaced<br />

with INT. For example, LIR that converts an INT to a DOUBLE<br />

and then back aga<strong>in</strong> would be removed by this filter.<br />

When trace record<strong>in</strong>g is completed, nanojit runs the backward<br />

optimization filters. These are used <strong>for</strong> optimizations that require<br />

backward program analysis. When runn<strong>in</strong>g the backward filters,<br />

nanojit reads one LIR <strong>in</strong>struction at a time, and the reads are passed<br />

through the pipel<strong>in</strong>e.<br />

We currently apply three backward filters:<br />

• Dead data-stack store elim<strong>in</strong>ation. The LIR trace encodes many<br />

stores to locations <strong>in</strong> the <strong>in</strong>terpreter stack. But these values are<br />

never read back be<strong>for</strong>e exit<strong>in</strong>g the trace (by the <strong>in</strong>terpreter or<br />

another trace). Thus, stores to the stack that are overwritten<br />

be<strong>for</strong>e the next exit are dead. Stores to locations that are off<br />

the top of the <strong>in</strong>terpreter stack at future exits are also dead.<br />

• Dead call-stack store elim<strong>in</strong>ation. This is the same optimization<br />

as above, except applied to the <strong>in</strong>terpreter’s call stack used <strong>for</strong><br />

function call <strong>in</strong>l<strong>in</strong><strong>in</strong>g.<br />

• Dead code elim<strong>in</strong>ation. This elim<strong>in</strong>ates any operation that<br />

stores to a value that is never used.<br />

After a LIR <strong>in</strong>struction is successfully read (“pulled”) from<br />

the backward filter pipel<strong>in</strong>e, nanojit’s code generator emits native<br />

mach<strong>in</strong>e <strong>in</strong>struction(s) <strong>for</strong> it.<br />

5.2 Register Allocation<br />

We use a simple greedy register allocator that makes a s<strong>in</strong>gle<br />

backward pass over the trace (it is <strong>in</strong>tegrated with the code generator).<br />

By the time the allocator has reached an <strong>in</strong>struction like<br />

v 3 = add v 1, v 2, it has already assigned a register to v 3. If v 1 and<br />

v 2 have not yet been assigned registers, the allocator assigns a free<br />

register to each. If there are no free registers, a value is selected <strong>for</strong><br />

spill<strong>in</strong>g. We use a class heuristic that selects the “oldest” registercarried<br />

value (6).<br />

The heuristic considers the set R of values v <strong>in</strong> registers immediately<br />

after the current <strong>in</strong>struction <strong>for</strong> spill<strong>in</strong>g. Let v m be the last<br />

<strong>in</strong>struction be<strong>for</strong>e the current where each v is referred to. Then the<br />

Tag JS <strong>Type</strong> Description<br />

xx1 number 31-bit <strong>in</strong>teger representation<br />

000 object po<strong>in</strong>ter to JSObject handle<br />

010 number po<strong>in</strong>ter to double handle<br />

100 str<strong>in</strong>g po<strong>in</strong>ter to JSStr<strong>in</strong>g handle<br />

110 boolean enumeration <strong>for</strong> null, undef<strong>in</strong>ed, true, false<br />

null, or<br />

undef<strong>in</strong>ed<br />

Figure 9. Tagged values <strong>in</strong> the SpiderMonkey JS <strong>in</strong>terpreter.<br />

Test<strong>in</strong>g tags, unbox<strong>in</strong>g (extract<strong>in</strong>g the untagged value) and box<strong>in</strong>g<br />

(creat<strong>in</strong>g tagged values) are significant costs. Avoid<strong>in</strong>g these costs<br />

is a key benefit of trac<strong>in</strong>g.<br />

heuristic selects v with m<strong>in</strong>imum v m. The motivation is that this<br />

frees up a register <strong>for</strong> as long as possible given a s<strong>in</strong>gle spill.<br />

If we need to spill a value v s at this po<strong>in</strong>t, we generate the<br />

restore code just after the code <strong>for</strong> the current <strong>in</strong>struction. The<br />

correspond<strong>in</strong>g spill code is generated just after the last po<strong>in</strong>t where<br />

v s was used. The register that was assigned to v s is marked free <strong>for</strong><br />

the preced<strong>in</strong>g code, because that register can now be used freely<br />

without affect<strong>in</strong>g the follow<strong>in</strong>g code<br />

6. Implementation<br />

To demonstrate the effectiveness of our approach, we have implemented<br />

a trace-<strong>based</strong> dynamic compiler <strong>for</strong> the SpiderMonkey<br />

JavaScript Virtual Mach<strong>in</strong>e (4). SpiderMonkey is the JavaScript<br />

VM embedded <strong>in</strong> Mozilla’s Firefox open-source web browser (2),<br />

which is used by more than 200 million users world-wide. The core<br />

of SpiderMonkey is a bytecode <strong>in</strong>terpreter implemented <strong>in</strong> C++.<br />

In SpiderMonkey, all JavaScript values are represented by the<br />

type jsval. A jsval is mach<strong>in</strong>e word <strong>in</strong> which up to the 3 of the<br />

least significant bits are a type tag, and the rema<strong>in</strong><strong>in</strong>g bits are data.<br />

See Figure 6 <strong>for</strong> details. All po<strong>in</strong>ters conta<strong>in</strong>ed <strong>in</strong> jsvals po<strong>in</strong>t to<br />

GC-controlled blocks aligned on 8-byte boundaries.<br />

JavaScript object values are mapp<strong>in</strong>gs of str<strong>in</strong>g-valued property<br />

names to arbitrary values. They are represented <strong>in</strong> one of two ways<br />

<strong>in</strong> SpiderMonkey. Most objects are represented by a shared structural<br />

description, called the object shape, that maps property names<br />

to array <strong>in</strong>dexes us<strong>in</strong>g a hash table. The object stores a po<strong>in</strong>ter to<br />

the shape and the array of its own property values. Objects with<br />

large, unique sets of property names store their properties directly<br />

<strong>in</strong> a hash table.<br />

The garbage collector is an exact, non-generational, stop-theworld<br />

mark-and-sweep collector.<br />

In the rest of this section we discuss key areas of the <strong>Trace</strong>Monkey<br />

implementation.<br />

6.1 Call<strong>in</strong>g Compiled <strong>Trace</strong>s<br />

Compiled traces are stored <strong>in</strong> a trace cache, <strong>in</strong>dexed by <strong>in</strong>tepreter<br />

PC and type map. <strong>Trace</strong>s are compiled so that they may be<br />

called as functions us<strong>in</strong>g standard native call<strong>in</strong>g conventions (e.g.,<br />

FASTCALL on x86).<br />

The <strong>in</strong>terpreter must hit a loop edge and enter the monitor <strong>in</strong><br />

order to call a native trace <strong>for</strong> the first time. The monitor computes<br />

the current type map, checks the trace cache <strong>for</strong> a trace <strong>for</strong> the<br />

current PC and type map, and if it f<strong>in</strong>ds one, executes the trace.<br />

To execute a trace, the monitor must build a trace activation<br />

record conta<strong>in</strong><strong>in</strong>g imported local and global variables, temporary<br />

stack space, and space <strong>for</strong> arguments to native calls. The local and<br />

global values are then copied from the <strong>in</strong>terpreter state to the trace<br />

activation record. Then, the trace is called like a normal C function<br />

po<strong>in</strong>ter.


When a trace call returns, the monitor restores the <strong>in</strong>terpreter<br />

state. First, the monitor checks the reason <strong>for</strong> the trace exit and<br />

applies blacklist<strong>in</strong>g if needed. Then, it pops or synthesizes <strong>in</strong>terpreter<br />

JavaScript call stack frames as needed. F<strong>in</strong>ally, it copies the<br />

imported variables back from the trace activation record to the <strong>in</strong>terpreter<br />

state.<br />

At least <strong>in</strong> the current implementation, these steps have a nonnegligible<br />

runtime cost, so m<strong>in</strong>imiz<strong>in</strong>g the number of <strong>in</strong>terpreterto-trace<br />

and trace-to-<strong>in</strong>terpreter transitions is essential <strong>for</strong> per<strong>for</strong>mance.<br />

(see also Section 3.3). Our experiments (see Figure 12)<br />

show that <strong>for</strong> programs we can trace well such transitions happen<br />

<strong>in</strong>frequently and hence do not contribute significantly to total<br />

runtime. In a few programs, where the system is prevented from<br />

record<strong>in</strong>g branch traces <strong>for</strong> hot side exits by aborts, this cost can<br />

rise to up to 10% of total execution time.<br />

6.2 <strong>Trace</strong> Stitch<strong>in</strong>g<br />

Transitions from a trace to a branch trace at a side exit avoid the<br />

costs of call<strong>in</strong>g traces from the monitor, <strong>in</strong> a feature called trace<br />

stitch<strong>in</strong>g. At a side exit, the exit<strong>in</strong>g trace only needs to write live<br />

register-carried values back to its trace activation record. In our implementation,<br />

identical type maps yield identical activation record<br />

layouts, so the trace activation record can be reused immediately<br />

by the branch trace.<br />

In programs with branchy trace trees with small traces, trace<br />

stitch<strong>in</strong>g has a noticeable cost. Although writ<strong>in</strong>g to memory and<br />

then soon read<strong>in</strong>g back would be expected to have a high L1<br />

cache hit rate, <strong>for</strong> small traces the <strong>in</strong>creased <strong>in</strong>struction count has<br />

a noticeable cost. Also, if the writes and reads are very close<br />

<strong>in</strong> the dynamic <strong>in</strong>struction stream, we have found that current<br />

x86 processors often <strong>in</strong>cur penalties of 6 cycles or more (e.g., if<br />

the <strong>in</strong>structions use different base registers with equal values, the<br />

processor may not be able to detect that the addresses are the same<br />

right away).<br />

The alternate solution is to recompile an entire trace tree, thus<br />

achiev<strong>in</strong>g <strong>in</strong>ter-trace register allocation (10). The disadvantage is<br />

that tree recompilation takes time quadratic <strong>in</strong> the number of traces.<br />

We believe that the cost of recompil<strong>in</strong>g a trace tree every time<br />

a branch is added would be prohibitive. That problem might be<br />

mitigated by recompil<strong>in</strong>g only at certa<strong>in</strong> po<strong>in</strong>ts, or only <strong>for</strong> very<br />

hot, stable trees.<br />

In the future, multicore hardware is expected to be common,<br />

mak<strong>in</strong>g background tree recompilation attractive. In a closely related<br />

project (13) background recompilation yielded speedups of<br />

up to 1.25x on benchmarks with many branch traces. We plan to<br />

apply this technique to <strong>Trace</strong>Monkey as future work.<br />

6.3 <strong>Trace</strong> Record<strong>in</strong>g<br />

The job of the trace recorder is to emit LIR with identical semantics<br />

to the currently runn<strong>in</strong>g <strong>in</strong>terpreter bytecode trace. A good implementation<br />

should have low impact on non-trac<strong>in</strong>g <strong>in</strong>terpreter per<strong>for</strong>mance<br />

and a convenient way <strong>for</strong> implementers to ma<strong>in</strong>ta<strong>in</strong> semantic<br />

equivalence.<br />

In our implementation, the only direct modification to the <strong>in</strong>terpreter<br />

is a call to the trace monitor at loop edges. In our benchmark<br />

results (see Figure 12) the total time spent <strong>in</strong> the monitor (<strong>for</strong> all<br />

activities) is usually less than 5%, so we consider the <strong>in</strong>terpreter<br />

impact requirement met. Increment<strong>in</strong>g the loop hit counter is expensive<br />

because it requires us to look up the loop <strong>in</strong> the trace cache,<br />

but we have tuned our loops to become hot and trace very quickly<br />

(on the second iteration). The hit counter implementation could be<br />

improved, which might give us a small <strong>in</strong>crease <strong>in</strong> overall per<strong>for</strong>mance,<br />

as well as more flexibility with tun<strong>in</strong>g hotness thresholds.<br />

Once a loop is blacklisted we never call <strong>in</strong>to the trace monitor <strong>for</strong><br />

that loop (see Section 3.3).<br />

Record<strong>in</strong>g is activated by a po<strong>in</strong>ter swap that sets the <strong>in</strong>terpreter’s<br />

dispatch table to call a s<strong>in</strong>gle “<strong>in</strong>terrupt” rout<strong>in</strong>e <strong>for</strong> every<br />

bytecode. The <strong>in</strong>terrupt rout<strong>in</strong>e first calls a bytecode-specific<br />

record<strong>in</strong>g rout<strong>in</strong>e. Then, it turns off record<strong>in</strong>g if necessary (e.g.,<br />

the trace ended). F<strong>in</strong>ally, it jumps to the standard <strong>in</strong>terpreter bytecode<br />

implementation. Some bytecodes have effects on the type map<br />

that cannot be predicted be<strong>for</strong>e execut<strong>in</strong>g the bytecode (e.g., call<strong>in</strong>g<br />

Str<strong>in</strong>g.charCodeAt, which returns an <strong>in</strong>teger or NaN if the<br />

<strong>in</strong>dex argument is out of range). For these, we arrange <strong>for</strong> the <strong>in</strong>terpreter<br />

to call <strong>in</strong>to the recorder aga<strong>in</strong> after execut<strong>in</strong>g the bytecode.<br />

S<strong>in</strong>ce such hooks are relatively rare, we embed them directly <strong>in</strong>to<br />

the <strong>in</strong>terpreter, with an additional runtime check to see whether a<br />

recorder is currently active.<br />

While separat<strong>in</strong>g the <strong>in</strong>terpreter from the recorder reduces <strong>in</strong>dividual<br />

code complexity, it also requires careful implementation and<br />

extensive test<strong>in</strong>g to achieve semantic equivalence.<br />

In some cases achiev<strong>in</strong>g this equivalence is difficult s<strong>in</strong>ce SpiderMonkey<br />

follows a fat-bytecode design, which was found to be<br />

beneficial to pure <strong>in</strong>terpreter per<strong>for</strong>mance.<br />

In fat-bytecode designs, <strong>in</strong>dividual bytecodes can implement<br />

complex process<strong>in</strong>g (e.g., the getprop bytecode, which implements<br />

full JavaScript property value access, <strong>in</strong>clud<strong>in</strong>g special cases<br />

<strong>for</strong> cached and dense array access).<br />

Fat bytecodes have two advantages: fewer bytecodes means<br />

lower dispatch cost, and bigger bytecode implementations give the<br />

compiler more opportunities to optimize the <strong>in</strong>terpreter.<br />

Fat bytecodes are a problem <strong>for</strong> <strong>Trace</strong>Monkey because they<br />

require the recorder to reimplement the same special case logic<br />

<strong>in</strong> the same way. Also, the advantages are reduced because (a)<br />

dispatch costs are elim<strong>in</strong>ated entirely <strong>in</strong> compiled traces, (b) the<br />

traces conta<strong>in</strong> only one special case, not the <strong>in</strong>terpreter’s large<br />

chunk of code, and (c) <strong>Trace</strong>Monkey spends less time runn<strong>in</strong>g the<br />

base <strong>in</strong>terpreter.<br />

One way we have mitigated these problems is by implement<strong>in</strong>g<br />

certa<strong>in</strong> complex bytecodes <strong>in</strong> the recorder as sequences of simple<br />

bytecodes. Express<strong>in</strong>g the orig<strong>in</strong>al semantics this way is not too difficult,<br />

and record<strong>in</strong>g simple bytecodes is much easier. This enables<br />

us to reta<strong>in</strong> the advantages of fat bytecodes while avoid<strong>in</strong>g some of<br />

their problems <strong>for</strong> trace record<strong>in</strong>g. This is particularly effective <strong>for</strong><br />

fat bytecodes that recurse back <strong>in</strong>to the <strong>in</strong>terpreter, <strong>for</strong> example to<br />

convert an object <strong>in</strong>to a primitive value by <strong>in</strong>vok<strong>in</strong>g a well-known<br />

method on the object, s<strong>in</strong>ce it lets us <strong>in</strong>l<strong>in</strong>e this function call.<br />

It is important to note that we split fat opcodes <strong>in</strong>to th<strong>in</strong>ner opcodes<br />

only dur<strong>in</strong>g record<strong>in</strong>g. When runn<strong>in</strong>g purely <strong>in</strong>terpretatively<br />

(i.e. code that has been blacklisted), the <strong>in</strong>terpreter directly and efficiently<br />

executes the fat opcodes.<br />

6.4 Preemption<br />

SpiderMonkey, like many VMs, needs to preempt the user program<br />

periodically. The ma<strong>in</strong> reasons are to prevent <strong>in</strong>f<strong>in</strong>itely loop<strong>in</strong>g<br />

scripts from lock<strong>in</strong>g up the host system and to schedule GC.<br />

In the <strong>in</strong>terpreter, this had been implemented by sett<strong>in</strong>g a “preempt<br />

now” flag that was checked on every backward jump. This<br />

strategy carried over <strong>in</strong>to <strong>Trace</strong>Monkey: the VM <strong>in</strong>serts a guard on<br />

the preemption flag at every loop edge. We measured less than a<br />

1% <strong>in</strong>crease <strong>in</strong> runtime on most benchmarks <strong>for</strong> this extra guard.<br />

In practice, the cost is detectable only <strong>for</strong> programs with very short<br />

loops.<br />

We tested and rejected a solution that avoided the guards by<br />

compil<strong>in</strong>g the loop edge as an unconditional jump, and patch<strong>in</strong>g<br />

the jump target to an exit rout<strong>in</strong>e when preemption is required.<br />

This solution can make the normal case slightly faster, but then<br />

preemption becomes very slow. The implementation was also very<br />

complex, especially try<strong>in</strong>g to restart execution after the preemption.


6.5 Call<strong>in</strong>g External Functions<br />

Like most <strong>in</strong>terpreters, SpiderMonkey has a <strong>for</strong>eign function <strong>in</strong>terface<br />

(FFI) that allows it to call C built<strong>in</strong>s and host system functions<br />

(e.g., web browser control and DOM access). The FFI has a standard<br />

signature <strong>for</strong> JS-callable functions, the key argument of which<br />

is an array of boxed values. External functions called through the<br />

FFI <strong>in</strong>teract with the program state through an <strong>in</strong>terpreter API (e.g.,<br />

to read a property from an argument). There are also certa<strong>in</strong> <strong>in</strong>terpreter<br />

built<strong>in</strong>s that do not use the FFI, but <strong>in</strong>teract with the program<br />

state <strong>in</strong> the same way, such as the CallIteratorNext function<br />

used with iterator objects. <strong>Trace</strong>Monkey must support this FFI <strong>in</strong><br />

order to speed up code that <strong>in</strong>teracts with the host system <strong>in</strong>side hot<br />

loops.<br />

Call<strong>in</strong>g external functions from <strong>Trace</strong>Monkey is potentially difficult<br />

because traces do not update the <strong>in</strong>terpreter state until exit<strong>in</strong>g.<br />

In particular, external functions may need the call stack or the<br />

global variables, but they may be out of date.<br />

For the out-of-date call stack problem, we refactored some of<br />

the <strong>in</strong>terpreter API implementation functions to re-materialize the<br />

<strong>in</strong>terpreter call stack on demand.<br />

We developed a C++ static analysis and annotated some <strong>in</strong>terpreter<br />

functions <strong>in</strong> order to verify that the call stack is refreshed<br />

at any po<strong>in</strong>t it needs to be used. In order to access the call stack,<br />

a function must be annotated as either FORCESSTACK or RE-<br />

QUIRESSTACK. These annotations are also required <strong>in</strong> order to call<br />

REQUIRESSTACK functions, which are presumed to access the call<br />

stack transitively. FORCESSTACK is a trusted annotation, applied<br />

to only 5 functions, that means the function refreshes the call stack.<br />

REQUIRESSTACK is an untrusted annotation that means the function<br />

may only be called if the call stack has already been refreshed.<br />

Similarly, we detect when host functions attempt to directly<br />

read or write global variables, and <strong>for</strong>ce the currently runn<strong>in</strong>g trace<br />

to side exit. This is necessary s<strong>in</strong>ce we cache and unbox global<br />

variables <strong>in</strong>to the activation record dur<strong>in</strong>g trace execution.<br />

S<strong>in</strong>ce both call-stack access and global variable access are<br />

rarely per<strong>for</strong>med by host functions, per<strong>for</strong>mance is not significantly<br />

affected by these safety mechanisms.<br />

Another problem is that external functions can reenter the <strong>in</strong>terpreter<br />

by call<strong>in</strong>g scripts, which <strong>in</strong> turn aga<strong>in</strong> might want to access<br />

the call stack or global variables. To address this problem, we made<br />

the VM set a flag whenever the <strong>in</strong>terpreter is reentered while a compiled<br />

trace is runn<strong>in</strong>g.<br />

Every call to an external function then checks this flag and exits<br />

the trace immediately after return<strong>in</strong>g from the external function call<br />

if it is set. There are many external functions that seldom or never<br />

reenter, and they can be called without problem, and will cause<br />

trace exit only if necessary.<br />

The FFI’s boxed value array requirement has a per<strong>for</strong>mance<br />

cost, so we def<strong>in</strong>ed a new FFI that allows C functions to be annotated<br />

with their argument types so that the tracer can call them<br />

directly, without unnecessary argument conversions.<br />

Currently, we do not support call<strong>in</strong>g native property get and set<br />

override functions or DOM functions directly from trace. Support<br />

is planned future work.<br />

6.6 Correctness<br />

Dur<strong>in</strong>g development, we had access to exist<strong>in</strong>g JavaScript test<br />

suites, but most of them were not designed with trac<strong>in</strong>g VMs <strong>in</strong><br />

m<strong>in</strong>d and conta<strong>in</strong>ed few loops.<br />

One tool that helped us greatly was Mozilla’s JavaScript fuzz<br />

tester, JSFUNFUZZ, which generates random JavaScript programs<br />

by nest<strong>in</strong>g random language elements. We modified JSFUNFUZZ<br />

to generate loops, and also to test more heavily certa<strong>in</strong> constructs<br />

we suspected would reveal flaws <strong>in</strong> our implementation. For example,<br />

we suspected bugs <strong>in</strong> <strong>Trace</strong>Monkey’s handl<strong>in</strong>g of type-unstable<br />

?>9@AJ.D#3$4,56#<br />

?>9@AJ.0A:9@AJ.>9@AJ.B9@AJ.1


%#"<br />

%!"<br />

$#"<br />

&'()*+,"<br />

-./"<br />

01"<br />

$!"<br />

#"<br />

!"<br />

Figure 10. Speedup vs. a basel<strong>in</strong>e JavaScript <strong>in</strong>terpreter (SpiderMonkey) <strong>for</strong> our trace-<strong>based</strong> JIT compiler, Apple’s SquirrelFish Extreme<br />

<strong>in</strong>l<strong>in</strong>e thread<strong>in</strong>g <strong>in</strong>terpreter and Google’s V8 JS compiler. Our system generates particularly efficient code <strong>for</strong> programs that benefit most from<br />

type specialization, which <strong>in</strong>cludes SunSpider Benchmark programs that per<strong>for</strong>m bit manipulation. We type-specialize the code <strong>in</strong> question<br />

to use <strong>in</strong>teger arithmetic, which substantially improves per<strong>for</strong>mance. For one of the benchmark programs we execute 25 times faster than<br />

the SpiderMonkey <strong>in</strong>terpreter, and almost 5 times faster than V8 and SFX. For a large number of benchmarks all three VMs produce similar<br />

results. We per<strong>for</strong>m worst on benchmark programs that we do not trace and <strong>in</strong>stead fall back onto the <strong>in</strong>terpreter. This <strong>in</strong>cludes the recursive<br />

benchmarks access-b<strong>in</strong>ary-trees and control-flow-recursive, <strong>for</strong> which we currently don’t generate any native code.<br />

In particular, the bitops benchmarks are short programs that per<strong>for</strong>m<br />

many bitwise operations, so <strong>Trace</strong>Monkey can cover the entire<br />

program with 1 or 2 traces that operate on <strong>in</strong>tegers. <strong>Trace</strong>Monkey<br />

runs all the other programs <strong>in</strong> this set almost entirely as native<br />

code.<br />

regexp-dna is dom<strong>in</strong>ated by regular expression match<strong>in</strong>g,<br />

which is implemented <strong>in</strong> all 3 VMs by a special regular expression<br />

compiler. Thus, per<strong>for</strong>mance on this benchmark has little relation<br />

to the trace compilation approach discussed <strong>in</strong> this paper.<br />

<strong>Trace</strong>Monkey’s smaller speedups on the other benchmarks can<br />

be attributed to a few specific causes:<br />

• The implementation does not currently trace recursion, so<br />

<strong>Trace</strong>Monkey achieves a small speedup or no speedup on<br />

benchmarks that use recursion extensively: 3d-cube, 3draytrace,<br />

access-b<strong>in</strong>ary-trees, str<strong>in</strong>g-tagcloud, and<br />

controlflow-recursive.<br />

• The implementation does not currently trace eval and some<br />

other functions implemented <strong>in</strong> C. Because date-<strong>for</strong>mattofte<br />

and date-<strong>for</strong>mat-xparb use such functions <strong>in</strong> their<br />

ma<strong>in</strong> loops, we do not trace them.<br />

• The implementation does not currently trace through regular<br />

expression replace operations. The replace function can be<br />

passed a function object used to compute the replacement text.<br />

Our implementation currently does not trace functions called<br />

as replace functions. The run time of str<strong>in</strong>g-unpack-code is<br />

dom<strong>in</strong>ated by such a replace call.<br />

• Two programs trace well, but have a long compilation time.<br />

access-nbody <strong>for</strong>ms a large number of traces (81). crypto-md5<br />

<strong>for</strong>ms one very long trace. We expect to improve per<strong>for</strong>mance<br />

on this programs by improv<strong>in</strong>g the compilation speed of nanojit.<br />

• Some programs trace very well, and speed up compared to<br />

the <strong>in</strong>terpreter, but are not as fast as SFX and/or V8, namely<br />

bitops-bits-<strong>in</strong>-byte, bitops-nsieve-bits, accessfannkuch,<br />

access-nsieve, and crypto-aes. The reason is<br />

not clear, but all of these programs have nested loops with<br />

small bodies, so we suspect that the implementation has a relatively<br />

high cost <strong>for</strong> call<strong>in</strong>g nested traces. str<strong>in</strong>g-fasta traces<br />

well, but its run time is dom<strong>in</strong>ated by str<strong>in</strong>g process<strong>in</strong>g built<strong>in</strong>s,<br />

which are unaffected by trac<strong>in</strong>g and seem to be less efficient <strong>in</strong><br />

SpiderMonkey than <strong>in</strong> the two other VMs.<br />

Detailed per<strong>for</strong>mance metrics. In Figure 11 we show the fraction<br />

of <strong>in</strong>structions <strong>in</strong>terpreted and the fraction of <strong>in</strong>structions executed<br />

as native code. This figure shows that <strong>for</strong> many programs, we<br />

are able to execute almost all the code natively.<br />

Figure 12 breaks down the total execution time <strong>in</strong>to four activities:<br />

<strong>in</strong>terpret<strong>in</strong>g bytecodes while not record<strong>in</strong>g, record<strong>in</strong>g traces<br />

(<strong>in</strong>clud<strong>in</strong>g time taken to <strong>in</strong>terpret the recorded trace), compil<strong>in</strong>g<br />

traces to native code, and execut<strong>in</strong>g native code traces.<br />

These detailed metrics allow us to estimate parameters <strong>for</strong> a<br />

simple model of trac<strong>in</strong>g per<strong>for</strong>mance. These estimates should be<br />

considered very rough, as the values observed on the <strong>in</strong>dividual<br />

benchmarks have large standard deviations (on the order of the


Loops Trees <strong>Trace</strong>s Aborts Flushes Trees/Loop <strong>Trace</strong>s/Tree <strong>Trace</strong>s/Loop Speedup<br />

3d-cube 25 27 29 3 0 1.1 1.1 1.2 2.20x<br />

3d-morph 5 8 8 2 0 1.6 1.0 1.6 2.86x<br />

3d-raytrace 10 25 100 10 1 2.5 4.0 10.0 1.18x<br />

access-b<strong>in</strong>ary-trees 0 0 0 5 0 - - - 0.93x<br />

access-fannkuch 10 34 57 24 0 3.4 1.7 5.7 2.20x<br />

access-nbody 8 16 18 5 0 2.0 1.1 2.3 4.19x<br />

access-nsieve 3 6 8 3 0 2.0 1.3 2.7 3.05x<br />

bitops-3bit-bits-<strong>in</strong>-byte 2 2 2 0 0 1.0 1.0 1.0 25.47x<br />

bitops-bits-<strong>in</strong>-byte 3 3 4 1 0 1.0 1.3 1.3 8.67x<br />

bitops-bitwise-and 1 1 1 0 0 1.0 1.0 1.0 25.20x<br />

bitops-nsieve-bits 3 3 5 0 0 1.0 1.7 1.7 2.75x<br />

controlflow-recursive 0 0 0 1 0 - - - 0.98x<br />

crypto-aes 50 72 78 19 0 1.4 1.1 1.6 1.64x<br />

crypto-md5 4 4 5 0 0 1.0 1.3 1.3 2.30x<br />

crypto-sha1 5 5 10 0 0 1.0 2.0 2.0 5.95x<br />

date-<strong>for</strong>mat-tofte 3 3 4 7 0 1.0 1.3 1.3 1.07x<br />

date-<strong>for</strong>mat-xparb 3 3 11 3 0 1.0 3.7 3.7 0.98x<br />

math-cordic 2 4 5 1 0 2.0 1.3 2.5 4.92x<br />

math-partial-sums 2 4 4 1 0 2.0 1.0 2.0 5.90x<br />

math-spectral-norm 15 20 20 0 0 1.3 1.0 1.3 7.12x<br />

regexp-dna 2 2 2 0 0 1.0 1.0 1.0 4.21x<br />

str<strong>in</strong>g-base64 3 5 7 0 0 1.7 1.4 2.3 2.53x<br />

str<strong>in</strong>g-fasta 5 11 15 6 0 2.2 1.4 3.0 1.49x<br />

str<strong>in</strong>g-tagcloud 3 6 6 5 0 2.0 1.0 2.0 1.09x<br />

str<strong>in</strong>g-unpack-code 4 4 37 0 0 1.0 9.3 9.3 1.20x<br />

str<strong>in</strong>g-validate-<strong>in</strong>put 6 10 13 1 0 1.7 1.3 2.2 1.86x<br />

Figure 13. Detailed trace record<strong>in</strong>g statistics <strong>for</strong> the SunSpider benchmark set.<br />

mean). We exclude regexp-dna from the follow<strong>in</strong>g calculations,<br />

because most of its time is spent <strong>in</strong> the regular expression matcher,<br />

which has much different per<strong>for</strong>mance characteristics from the<br />

other programs. (Note that this only makes a difference of about<br />

10% <strong>in</strong> the results.) Divid<strong>in</strong>g the total execution time <strong>in</strong> processor<br />

clock cycles by the number of bytecodes executed <strong>in</strong> the base<br />

<strong>in</strong>terpreter shows that on average, a bytecode executes <strong>in</strong> about<br />

35 cycles. Native traces take about 9 cycles per bytecode, a 3.9x<br />

speedup over the <strong>in</strong>terpreter.<br />

Us<strong>in</strong>g similar computations, we f<strong>in</strong>d that trace record<strong>in</strong>g takes<br />

about 3800 cycles per bytecode, and compilation 3150 cycles per<br />

bytecode. Hence, dur<strong>in</strong>g record<strong>in</strong>g and compil<strong>in</strong>g the VM runs at<br />

1/200 the speed of the <strong>in</strong>terpreter. Because it costs 6950 cycles to<br />

compile a bytecode, and we save 26 cycles each time that code is<br />

run natively, we break even after runn<strong>in</strong>g a trace 270 times.<br />

The other VMs we compared with achieve an overall speedup<br />

of 3.0x relative to our basel<strong>in</strong>e <strong>in</strong>terpreter. Our estimated native<br />

code speedup of 3.9x is significantly better. This suggests that<br />

our compilation techniques can generate more efficient native code<br />

than any other current JavaScript VM.<br />

These estimates also <strong>in</strong>dicate that our startup per<strong>for</strong>mance could<br />

be substantially better if we improved the speed of trace record<strong>in</strong>g<br />

and compilation. The estimated 200x slowdown <strong>for</strong> record<strong>in</strong>g and<br />

compilation is very rough, and may be <strong>in</strong>fluenced by startup factors<br />

<strong>in</strong> the <strong>in</strong>terpreter (e.g., caches that have not warmed up yet dur<strong>in</strong>g<br />

record<strong>in</strong>g). One observation support<strong>in</strong>g this conjecture is that <strong>in</strong><br />

the tracer, <strong>in</strong>terpreted bytecodes take about 180 cycles to run. Still,<br />

record<strong>in</strong>g and compilation are clearly both expensive, and a better<br />

implementation, possibly <strong>in</strong>clud<strong>in</strong>g redesign of the LIR abstract<br />

syntax or encod<strong>in</strong>g, would improve startup per<strong>for</strong>mance.<br />

Our per<strong>for</strong>mance results confirm that type specialization us<strong>in</strong>g<br />

trace trees substantially improves per<strong>for</strong>mance. We are able to<br />

outper<strong>for</strong>m the fastest available JavaScript compiler (V8) and the<br />

fastest available JavaScript <strong>in</strong>l<strong>in</strong>e threaded <strong>in</strong>terpreter (SFX) on 9<br />

of 26 benchmarks.<br />

8. Related Work<br />

<strong>Trace</strong> optimization <strong>for</strong> dynamic languages. The closest area of<br />

related work is on apply<strong>in</strong>g trace optimization to type-specialize<br />

dynamic languages. Exist<strong>in</strong>g work shares the idea of generat<strong>in</strong>g<br />

type-specialized code speculatively with guards along <strong>in</strong>terpreter<br />

traces.<br />

To our knowledge, Rigo’s Psyco (16) is the only published<br />

type-specializ<strong>in</strong>g trace compiler <strong>for</strong> a dynamic language (Python).<br />

Psyco does not attempt to identify hot loops or <strong>in</strong>l<strong>in</strong>e function calls.<br />

Instead, Psyco trans<strong>for</strong>ms loops to mutual recursion be<strong>for</strong>e runn<strong>in</strong>g<br />

and traces all operations.<br />

Pall’s LuaJIT is a Lua VM <strong>in</strong> development that uses trace compilation<br />

ideas. (1). There are no publications on LuaJIT but the creator<br />

has told us that LuaJIT has a similar design to our system, but<br />

will use a less aggressive type speculation (e.g., us<strong>in</strong>g a float<strong>in</strong>gpo<strong>in</strong>t<br />

representation <strong>for</strong> all number values) and does not generate<br />

nested traces <strong>for</strong> nested loops.<br />

General trace optimization. General trace optimization has<br />

a longer history that has treated mostly native code and typed<br />

languages like Java. Thus, these systems have focused less on type<br />

specialization and more on other optimizations.<br />

Dynamo (7) by Bala et al, <strong>in</strong>troduced native code trac<strong>in</strong>g as a<br />

replacement <strong>for</strong> profile-guided optimization (PGO). A major goal<br />

was to per<strong>for</strong>m PGO onl<strong>in</strong>e so that the profile was specific to<br />

the current execution. Dynamo used loop headers as candidate hot<br />

traces, but did not try to create loop traces specifically.<br />

<strong>Trace</strong> trees were orig<strong>in</strong>ally proposed by Gal et al. (11) <strong>in</strong> the<br />

context of Java, a statically typed language. Their trace trees actually<br />

<strong>in</strong>l<strong>in</strong>ed parts of outer loops with<strong>in</strong> the <strong>in</strong>ner loops (because


=?J+B:F>*:?7-


not be <strong>in</strong>terpreted as necessarily represent<strong>in</strong>g the official views,<br />

policies or endorsements, either expressed or implied, of the National<br />

Science foundation (NSF), any other agency of the U.S. Government,<br />

or any of the companies mentioned above.<br />

References<br />

[1] LuaJIT roadmap 2008 - http://lua-users.org/lists/lua-l/2008-<br />

02/msg00051.html.<br />

[2] Mozilla — Firefox web browser and Thunderbird email client -<br />

http://www.mozilla.com.<br />

[3] SPECJVM98 - http://www.spec.org/jvm98/.<br />

[4] SpiderMonkey (JavaScript-C) Eng<strong>in</strong>e -<br />

http://www.mozilla.org/js/spidermonkey/.<br />

[5] Surf<strong>in</strong>’ Safari - Blog Archive - Announc<strong>in</strong>g SquirrelFish Extreme -<br />

http://webkit.org/blog/214/<strong>in</strong>troduc<strong>in</strong>g-squirrelfish-extreme/.<br />

[6] A. Aho, R. Sethi, J. Ullman, and M. Lam. Compilers: Pr<strong>in</strong>ciples,<br />

techniques, and tools, 2006.<br />

[7] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent<br />

dynamic optimization system. In Proceed<strong>in</strong>gs of the ACM SIGPLAN<br />

Conference on Programm<strong>in</strong>g Language Design and Implementation,<br />

pages 1–12. ACM Press, 2000.<br />

[8] M. Berndl, B. Vitale, M. Zaleski, and A. Brown. Context Thread<strong>in</strong>g:<br />

a Flexible and Efficient Dispatch Technique <strong>for</strong> Virtual Mach<strong>in</strong>e Interpreters.<br />

In Code Generation and Optimization, 2005. CGO 2005.<br />

International Symposium on, pages 15–26, 2005.<br />

[9] C. Chambers and D. Ungar. Customization: Optimiz<strong>in</strong>g Compiler<br />

Technology <strong>for</strong> SELF, a <strong>Dynamic</strong>ally-<strong>Type</strong>d O bject-Oriented Programm<strong>in</strong>g<br />

Language. In Proceed<strong>in</strong>gs of the ACM SIGPLAN 1989<br />

Conference on Programm<strong>in</strong>g Language Design and Implementation,<br />

pages 146–160. ACM New York, NY, USA, 1989.<br />

[10] A. Gal. Efficient Bytecode Verification and Compilation <strong>in</strong> a Virtual<br />

Mach<strong>in</strong>e Dissertation. PhD thesis, University Of Cali<strong>for</strong>nia, Irv<strong>in</strong>e,<br />

2006.<br />

[11] A. Gal, C. W. Probst, and M. Franz. HotpathVM: An effective JIT<br />

compiler <strong>for</strong> resource-constra<strong>in</strong>ed devices. In Proceed<strong>in</strong>gs of the<br />

International Conference on Virtual Execution Environments, pages<br />

144–153. ACM Press, 2006.<br />

[12] C. Garrett, J. Dean, D. Grove, and C. Chambers. Measurement and<br />

Application of <strong>Dynamic</strong> Receiver Class Distributions. 1994.<br />

[13] J. Ha, M. R. Haghighat, S. Cong, and K. S. McK<strong>in</strong>ley. A concurrent<br />

trace-<strong>based</strong> just-<strong>in</strong>-time compiler <strong>for</strong> javascript. Dept.of Computer<br />

Sciences, The University of Texas at Aust<strong>in</strong>, TR-09-06, 2009.<br />

[14] B. McCloskey. Personal communication.<br />

[15] I. Piumarta and F. Riccardi. Optimiz<strong>in</strong>g direct threaded code by selective<br />

<strong>in</strong>l<strong>in</strong><strong>in</strong>g. In Proceed<strong>in</strong>gs of the ACM SIGPLAN 1998 conference<br />

on Programm<strong>in</strong>g language design and implementation, pages 291–<br />

300. ACM New York, NY, USA, 1998.<br />

[16] A. Rigo. Representation-Based <strong>Just</strong>-In-time <strong>Specialization</strong> and the<br />

Psyco Prototype <strong>for</strong> Python. In PEPM, 2004.<br />

[17] M. Salib. Starkiller: A Static <strong>Type</strong> Inferencer and Compiler <strong>for</strong><br />

Python. In Master’s Thesis, 2004.<br />

[18] T. Suganuma, T. Yasue, and T. Nakatani. A Region-Based Compilation<br />

Technique <strong>for</strong> <strong>Dynamic</strong> Compilers. ACM Transactions on Programm<strong>in</strong>g<br />

<strong>Languages</strong> and Systems (TOPLAS), 28(1):134–174, 2006.<br />

[19] M. Zaleski, A. D. Brown, and K. Stoodley. YETI: A graduallY<br />

Extensible <strong>Trace</strong> Interpreter. In Proceed<strong>in</strong>gs of the International<br />

Conference on Virtual Execution Environments, pages 83–93. ACM<br />

Press, 2007.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!