Trace-based Just-in-Time Type Specialization for Dynamic Languages
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Trace</strong>-<strong>based</strong> <strong>Just</strong>-<strong>in</strong>-<strong>Time</strong> <strong>Type</strong> <strong>Specialization</strong> <strong>for</strong> <strong>Dynamic</strong><br />
<strong>Languages</strong><br />
Andreas Gal ∗+ , Brendan Eich ∗ , Mike Shaver ∗ , David Anderson ∗ , David Mandel<strong>in</strong> ∗ ,<br />
Mohammad R. Haghighat $ , Blake Kaplan ∗ , Graydon Hoare ∗ , Boris Zbarsky ∗ , Jason Orendorff ∗ ,<br />
Jesse Ruderman ∗ , Edw<strong>in</strong> Smith # , Rick Reitmaier # , Michael Bebenita + , Mason Chang +# , Michael Franz +<br />
Mozilla Corporation ∗<br />
{gal,brendan,shaver,danderson,dmandel<strong>in</strong>,mrbkap,graydon,bz,jorendorff,jruderman}@mozilla.com<br />
Adobe Corporation #<br />
{edwsmith,rreitmai}@adobe.com<br />
Intel Corporation $<br />
{mohammad.r.haghighat}@<strong>in</strong>tel.com<br />
University of Cali<strong>for</strong>nia, Irv<strong>in</strong>e +<br />
{mbebenit,changm,franz}@uci.edu<br />
Abstract<br />
<strong>Dynamic</strong> languages such as JavaScript are more difficult to compile<br />
than statically typed ones. S<strong>in</strong>ce no concrete type <strong>in</strong><strong>for</strong>mation<br />
is available, traditional compilers need to emit generic code that can<br />
handle all possible type comb<strong>in</strong>ations at runtime. We present an alternative<br />
compilation technique <strong>for</strong> dynamically-typed languages<br />
that identifies frequently executed loop traces at run-time and then<br />
generates mach<strong>in</strong>e code on the fly that is specialized <strong>for</strong> the actual<br />
dynamic types occurr<strong>in</strong>g on each path through the loop. Our<br />
method provides cheap <strong>in</strong>ter-procedural type specialization, and an<br />
elegant and efficient way of <strong>in</strong>crementally compil<strong>in</strong>g lazily discovered<br />
alternative paths through nested loops. We have implemented<br />
a dynamic compiler <strong>for</strong> JavaScript <strong>based</strong> on our technique and we<br />
have measured speedups of 10x and more <strong>for</strong> certa<strong>in</strong> benchmark<br />
programs.<br />
Categories and Subject Descriptors D.3.4 [Programm<strong>in</strong>g <strong>Languages</strong>]:<br />
Processors — Incremental compilers, code generation.<br />
Design, Experimentation, Measurement, Per<strong>for</strong>-<br />
General Terms<br />
mance.<br />
Keywords<br />
1. Introduction<br />
JavaScript, just-<strong>in</strong>-time compilation, trace trees.<br />
<strong>Dynamic</strong> languages such as JavaScript, Python, and Ruby, are popular<br />
s<strong>in</strong>ce they are expressive, accessible to non-experts, and make<br />
deployment as easy as distribut<strong>in</strong>g a source file. They are used <strong>for</strong><br />
small scripts as well as <strong>for</strong> complex applications. JavaScript, <strong>for</strong><br />
example, is the de facto standard <strong>for</strong> client-side web programm<strong>in</strong>g<br />
Permission to make digital or hard copies of all or part of this work <strong>for</strong> personal or<br />
classroom use is granted without fee provided that copies are not made or distributed<br />
<strong>for</strong> profit or commercial advantage and that copies bear this notice and the full citation<br />
on the first page. To copy otherwise, to republish, to post on servers or to redistribute<br />
to lists, requires prior specific permission and/or a fee.<br />
PLDI’09, June 15–20, 2009, Dubl<strong>in</strong>, Ireland.<br />
Copyright c○ 2009 ACM 978-1-60558-392-1/09/06. . . $5.00<br />
and is used <strong>for</strong> the application logic of browser-<strong>based</strong> productivity<br />
applications such as Google Mail, Google Docs and Zimbra Collaboration<br />
Suite. In this doma<strong>in</strong>, <strong>in</strong> order to provide a fluid user<br />
experience and enable a new generation of applications, virtual mach<strong>in</strong>es<br />
must provide a low startup time and high per<strong>for</strong>mance.<br />
Compilers <strong>for</strong> statically typed languages rely on type <strong>in</strong><strong>for</strong>mation<br />
to generate efficient mach<strong>in</strong>e code. In a dynamically typed programm<strong>in</strong>g<br />
language such as JavaScript, the types of expressions<br />
may vary at runtime. This means that the compiler can no longer<br />
easily trans<strong>for</strong>m operations <strong>in</strong>to mach<strong>in</strong>e <strong>in</strong>structions that operate<br />
on one specific type. Without exact type <strong>in</strong><strong>for</strong>mation, the compiler<br />
must emit slower generalized mach<strong>in</strong>e code that can deal with all<br />
potential type comb<strong>in</strong>ations. While compile-time static type <strong>in</strong>ference<br />
might be able to gather type <strong>in</strong><strong>for</strong>mation to generate optimized<br />
mach<strong>in</strong>e code, traditional static analysis is very expensive<br />
and hence not well suited <strong>for</strong> the highly <strong>in</strong>teractive environment of<br />
a web browser.<br />
We present a trace-<strong>based</strong> compilation technique <strong>for</strong> dynamic<br />
languages that reconciles speed of compilation with excellent per<strong>for</strong>mance<br />
of the generated mach<strong>in</strong>e code. Our system uses a mixedmode<br />
execution approach: the system starts runn<strong>in</strong>g JavaScript <strong>in</strong> a<br />
fast-start<strong>in</strong>g bytecode <strong>in</strong>terpreter. As the program runs, the system<br />
identifies hot (frequently executed) bytecode sequences, records<br />
them, and compiles them to fast native code. We call such a sequence<br />
of <strong>in</strong>structions a trace.<br />
Unlike method-<strong>based</strong> dynamic compilers, our dynamic compiler<br />
operates at the granularity of <strong>in</strong>dividual loops. This design<br />
choice is <strong>based</strong> on the expectation that programs spend most of<br />
their time <strong>in</strong> hot loops. Even <strong>in</strong> dynamically typed languages, we<br />
expect hot loops to be mostly type-stable, mean<strong>in</strong>g that the types of<br />
values are <strong>in</strong>variant. (12) For example, we would expect loop counters<br />
that start as <strong>in</strong>tegers to rema<strong>in</strong> <strong>in</strong>tegers <strong>for</strong> all iterations. When<br />
both of these expectations hold, a trace-<strong>based</strong> compiler can cover<br />
the program execution with a small number of type-specialized, efficiently<br />
compiled traces.<br />
Each compiled trace covers one path through the program with<br />
one mapp<strong>in</strong>g of values to types. When the VM executes a compiled<br />
trace, it cannot guarantee that the same path will be followed<br />
or that the same types will occur <strong>in</strong> subsequent loop iterations.
Hence, record<strong>in</strong>g and compil<strong>in</strong>g a trace speculates that the path and<br />
typ<strong>in</strong>g will be exactly as they were dur<strong>in</strong>g record<strong>in</strong>g <strong>for</strong> subsequent<br />
iterations of the loop.<br />
Every compiled trace conta<strong>in</strong>s all the guards (checks) required<br />
to validate the speculation. If one of the guards fails (if control<br />
flow is different, or a value of a different type is generated), the<br />
trace exits. If an exit becomes hot, the VM can record a branch<br />
trace start<strong>in</strong>g at the exit to cover the new path. In this way, the VM<br />
records a trace tree cover<strong>in</strong>g all the hot paths through the loop.<br />
Nested loops can be difficult to optimize <strong>for</strong> trac<strong>in</strong>g VMs. In<br />
a naïve implementation, <strong>in</strong>ner loops would become hot first, and<br />
the VM would start trac<strong>in</strong>g there. When the <strong>in</strong>ner loop exits, the<br />
VM would detect that a different branch was taken. The VM would<br />
try to record a branch trace, and f<strong>in</strong>d that the trace reaches not the<br />
<strong>in</strong>ner loop header, but the outer loop header. At this po<strong>in</strong>t, the VM<br />
could cont<strong>in</strong>ue trac<strong>in</strong>g until it reaches the <strong>in</strong>ner loop header aga<strong>in</strong>,<br />
thus trac<strong>in</strong>g the outer loop <strong>in</strong>side a trace tree <strong>for</strong> the <strong>in</strong>ner loop.<br />
But this requires trac<strong>in</strong>g a copy of the outer loop <strong>for</strong> every side exit<br />
and type comb<strong>in</strong>ation <strong>in</strong> the <strong>in</strong>ner loop. In essence, this is a <strong>for</strong>m<br />
of un<strong>in</strong>tended tail duplication, which can easily overflow the code<br />
cache. Alternatively, the VM could simply stop trac<strong>in</strong>g, and give up<br />
on ever trac<strong>in</strong>g outer loops.<br />
We solve the nested loop problem by record<strong>in</strong>g nested trace<br />
trees. Our system traces the <strong>in</strong>ner loop exactly as the naïve version.<br />
The system stops extend<strong>in</strong>g the <strong>in</strong>ner tree when it reaches an outer<br />
loop, but then it starts a new trace at the outer loop header. When<br />
the outer loop reaches the <strong>in</strong>ner loop header, the system tries to call<br />
the trace tree <strong>for</strong> the <strong>in</strong>ner loop. If the call succeeds, the VM records<br />
the call to the <strong>in</strong>ner tree as part of the outer trace and f<strong>in</strong>ishes<br />
the outer trace as normal. In this way, our system can trace any<br />
number of loops nested to any depth without caus<strong>in</strong>g excessive tail<br />
duplication.<br />
These techniques allow a VM to dynamically translate a program<br />
to nested, type-specialized trace trees. Because traces can<br />
cross function call boundaries, our techniques also achieve the effects<br />
of <strong>in</strong>l<strong>in</strong><strong>in</strong>g. Because traces have no <strong>in</strong>ternal control-flow jo<strong>in</strong>s,<br />
they can be optimized <strong>in</strong> l<strong>in</strong>ear time by a simple compiler (10).<br />
Thus, our trac<strong>in</strong>g VM efficiently per<strong>for</strong>ms the same k<strong>in</strong>d of optimizations<br />
that would require <strong>in</strong>terprocedural analysis <strong>in</strong> a static<br />
optimization sett<strong>in</strong>g. This makes trac<strong>in</strong>g an attractive and effective<br />
tool to type specialize even complex function call-rich code.<br />
We implemented these techniques <strong>for</strong> an exist<strong>in</strong>g JavaScript <strong>in</strong>terpreter,<br />
SpiderMonkey. We call the result<strong>in</strong>g trac<strong>in</strong>g VM <strong>Trace</strong>-<br />
Monkey. <strong>Trace</strong>Monkey supports all the JavaScript features of SpiderMonkey,<br />
with a 2x-20x speedup <strong>for</strong> traceable programs.<br />
This paper makes the follow<strong>in</strong>g contributions:<br />
• We expla<strong>in</strong> an algorithm <strong>for</strong> dynamically <strong>for</strong>m<strong>in</strong>g trace trees to<br />
cover a program, represent<strong>in</strong>g nested loops as nested trace trees.<br />
• We expla<strong>in</strong> how to speculatively generate efficient type-specialized<br />
code <strong>for</strong> traces from dynamic language programs.<br />
• We validate our trac<strong>in</strong>g techniques <strong>in</strong> an implementation <strong>based</strong><br />
on the SpiderMonkey JavaScript <strong>in</strong>terpreter, achiev<strong>in</strong>g 2x-20x<br />
speedups on many programs.<br />
The rema<strong>in</strong>der of this paper is organized as follows. Section 3 is<br />
a general overview of trace tree <strong>based</strong> compilation we use to capture<br />
and compile frequently executed code regions. In Section 4<br />
we describe our approach of cover<strong>in</strong>g nested loops us<strong>in</strong>g a number<br />
of <strong>in</strong>dividual trace trees. In Section 5 we describe our tracecompilation<br />
<strong>based</strong> speculative type specialization approach we use<br />
to generate efficient mach<strong>in</strong>e code from recorded bytecode traces.<br />
Our implementation of a dynamic type-specializ<strong>in</strong>g compiler <strong>for</strong><br />
JavaScript is described <strong>in</strong> Section 6. Related work is discussed <strong>in</strong><br />
Section 8. In Section 7 we evaluate our dynamic compiler <strong>based</strong> on<br />
1 <strong>for</strong> (var i = 2; i < 100; ++i) {<br />
2 if (!primes[i])<br />
3 cont<strong>in</strong>ue;<br />
4 <strong>for</strong> (var k = i + i; i < 100; k += i)<br />
5 primes[k] = false;<br />
6 }<br />
Figure 1. Sample program: sieve of Eratosthenes. primes is<br />
<strong>in</strong>itialized to an array of 100 false values on entry to this code<br />
snippet.<br />
abort<br />
record<strong>in</strong>g<br />
Record<br />
LIR <strong>Trace</strong><br />
Compile<br />
LIR <strong>Trace</strong><br />
f<strong>in</strong>ish at<br />
loop header<br />
hot<br />
loop/exit<br />
Interpret<br />
Bytecodes<br />
loop<br />
edge<br />
Monitor<br />
cold/blacklisted<br />
loop/exit<br />
loop edge with<br />
same types<br />
side exit,<br />
no exist<strong>in</strong>g trace<br />
compiled trace<br />
ready<br />
Enter<br />
Compiled <strong>Trace</strong><br />
Execute<br />
Compiled <strong>Trace</strong><br />
Leave<br />
Compiled <strong>Trace</strong><br />
Symbol Key<br />
Overhead<br />
Interpret<strong>in</strong>g<br />
Native<br />
side exit to<br />
exist<strong>in</strong>g trace<br />
Figure 2. State mach<strong>in</strong>e describ<strong>in</strong>g the major activities of <strong>Trace</strong>-<br />
Monkey and the conditions that cause transitions to a new activity.<br />
In the dark box, TM executes JS as compiled traces. In the<br />
light gray boxes, TM executes JS <strong>in</strong> the standard <strong>in</strong>terpreter. White<br />
boxes are overhead. Thus, to maximize per<strong>for</strong>mance, we need to<br />
maximize time spent <strong>in</strong> the darkest box and m<strong>in</strong>imize time spent <strong>in</strong><br />
the white boxes. The best case is a loop where the types at the loop<br />
edge are the same as the types on entry–then TM can stay <strong>in</strong> native<br />
code until the loop is done.<br />
a set of <strong>in</strong>dustry benchmarks. The paper ends with conclusions <strong>in</strong><br />
Section 9 and an outlook on future work is presented <strong>in</strong> Section 10.<br />
2. Overview: Example Trac<strong>in</strong>g Run<br />
This section provides an overview of our system by describ<strong>in</strong>g<br />
how <strong>Trace</strong>Monkey executes an example program. The example<br />
program, shown <strong>in</strong> Figure 1, computes the first 100 prime numbers<br />
with nested loops. The narrative should be read along with Figure 2,<br />
which describes the activities <strong>Trace</strong>Monkey per<strong>for</strong>ms and when it<br />
transitions between the loops.<br />
<strong>Trace</strong>Monkey always beg<strong>in</strong>s execut<strong>in</strong>g a program <strong>in</strong> the bytecode<br />
<strong>in</strong>terpreter. Every loop back edge is a potential trace po<strong>in</strong>t.<br />
When the <strong>in</strong>terpreter crosses a loop edge, <strong>Trace</strong>Monkey <strong>in</strong>vokes<br />
the trace monitor, which may decide to record or execute a native<br />
trace. At the start of execution, there are no compiled traces yet, so<br />
the trace monitor counts the number of times each loop back edge is<br />
executed until a loop becomes hot, currently after 2 cross<strong>in</strong>gs. Note<br />
that the way our loops are compiled, the loop edge is crossed be<strong>for</strong>e<br />
enter<strong>in</strong>g the loop, so the second cross<strong>in</strong>g occurs immediately after<br />
the first iteration.<br />
Here is the sequence of events broken down by outer loop<br />
iteration:
v0 := ld state[748] // load primes from the trace activation record<br />
st sp[0], v0 // store primes to <strong>in</strong>terpreter stack<br />
v1 := ld state[764] // load k from the trace activation record<br />
v2 := i2f(v1)<br />
// convert k from <strong>in</strong>t to double<br />
st sp[8], v1 // store k to <strong>in</strong>terpreter stack<br />
st sp[16], 0 // store false to <strong>in</strong>terpreter stack<br />
v3 := ld v0[4]<br />
// load class word <strong>for</strong> primes<br />
v4 := and v3, -4 // mask out object class tag <strong>for</strong> primes<br />
v5 := eq v4, Array // test whether primes is an array<br />
xf v5<br />
// side exit if v5 is false<br />
v6 := js_Array_set(v0, v2, false) // call function to set array element<br />
v7 := eq v6, 0<br />
// test return value from call<br />
xt v7<br />
// side exit if js_Array_set returns false.<br />
Figure 3. LIR snippet <strong>for</strong> sample program. This is the LIR recorded <strong>for</strong> l<strong>in</strong>e 5 of the sample program <strong>in</strong> Figure 1. The LIR encodes<br />
the semantics <strong>in</strong> SSA <strong>for</strong>m us<strong>in</strong>g temporary variables. The LIR also encodes all the stores that the <strong>in</strong>terpreter would do to its data stack.<br />
Sometimes these stores can be optimized away as the stack locations are live only on exits to the <strong>in</strong>terpreter. F<strong>in</strong>ally, the LIR records guards<br />
and side exits to verify the assumptions made <strong>in</strong> this record<strong>in</strong>g: that primes is an array and that the call to set its element succeeds.<br />
mov edx, ebx(748)<br />
mov edi(0), edx<br />
mov esi, ebx(764)<br />
mov edi(8), esi<br />
mov edi(16), 0<br />
mov eax, edx(4)<br />
and eax, -4<br />
cmp eax, Array<br />
jne side_exit_1<br />
sub esp, 8<br />
push false<br />
push esi<br />
call js_Array_set<br />
add esp, 8<br />
mov ecx, ebx<br />
test eax, eax<br />
je side_exit_2<br />
...<br />
side_exit_1:<br />
mov ecx, ebp(-4)<br />
mov esp, ebp<br />
jmp epilog<br />
// load primes from the trace activation record<br />
// (*) store primes to <strong>in</strong>terpreter stack<br />
// load k from the trace activation record<br />
// (*) store k to <strong>in</strong>terpreter stack<br />
// (*) store false to <strong>in</strong>terpreter stack<br />
// (*) load object class word <strong>for</strong> primes<br />
// (*) mask out object class tag <strong>for</strong> primes<br />
// (*) test whether primes is an array<br />
// (*) side exit if primes is not an array<br />
// bump stack <strong>for</strong> call alignment convention<br />
// push last argument <strong>for</strong> call<br />
// push first argument <strong>for</strong> call<br />
// call function to set array element<br />
// clean up extra stack space<br />
// (*) created by register allocator<br />
// (*) test return value of js_Array_set<br />
// (*) side exit if call failed<br />
// restore ecx<br />
// restore esp<br />
// jump to ret statement<br />
Figure 4. x86 snippet <strong>for</strong> sample program. This is the x86 code compiled from the LIR snippet <strong>in</strong> Figure 3. Most LIR <strong>in</strong>structions compile<br />
to a s<strong>in</strong>gle x86 <strong>in</strong>struction. Instructions marked with (*) would be omitted by an idealized compiler that knew that none of the side exits<br />
would ever be taken. The 17 <strong>in</strong>structions generated by the compiler compare favorably with the 100+ <strong>in</strong>structions that the <strong>in</strong>terpreter would<br />
execute <strong>for</strong> the same code snippet, <strong>in</strong>clud<strong>in</strong>g 4 <strong>in</strong>direct jumps.<br />
i=2. This is the first iteration of the outer loop. The loop on<br />
l<strong>in</strong>es 4-5 becomes hot on its second iteration, so <strong>Trace</strong>Monkey enters<br />
record<strong>in</strong>g mode on l<strong>in</strong>e 4. In record<strong>in</strong>g mode, <strong>Trace</strong>Monkey<br />
records the code along the trace <strong>in</strong> a low-level compiler <strong>in</strong>termediate<br />
representation we call LIR. The LIR trace encodes all the operations<br />
per<strong>for</strong>med and the types of all operands. The LIR trace also<br />
encodes guards, which are checks that verify that the control flow<br />
and types are identical to those observed dur<strong>in</strong>g trace record<strong>in</strong>g.<br />
Thus, on later executions, if and only if all guards are passed, the<br />
trace has the required program semantics.<br />
<strong>Trace</strong>Monkey stops record<strong>in</strong>g when execution returns to the<br />
loop header or exits the loop. In this case, execution returns to the<br />
loop header on l<strong>in</strong>e 4.<br />
After record<strong>in</strong>g is f<strong>in</strong>ished, <strong>Trace</strong>Monkey compiles the trace to<br />
native code us<strong>in</strong>g the recorded type <strong>in</strong><strong>for</strong>mation <strong>for</strong> optimization.<br />
The result is a native code fragment that can be entered if the<br />
<strong>in</strong>terpreter PC and the types of values match those observed when<br />
trace record<strong>in</strong>g was started. The first trace <strong>in</strong> our example, T 45,<br />
covers l<strong>in</strong>es 4 and 5. This trace can be entered if the PC is at l<strong>in</strong>e 4,<br />
i and k are <strong>in</strong>tegers, and primes is an object. After compil<strong>in</strong>g T 45,<br />
<strong>Trace</strong>Monkey returns to the <strong>in</strong>terpreter and loops back to l<strong>in</strong>e 1.<br />
i=3. Now the loop header at l<strong>in</strong>e 1 has become hot, so <strong>Trace</strong>-<br />
Monkey starts record<strong>in</strong>g. When record<strong>in</strong>g reaches l<strong>in</strong>e 4, <strong>Trace</strong>-<br />
Monkey observes that it has reached an <strong>in</strong>ner loop header that already<br />
has a compiled trace, so <strong>Trace</strong>Monkey attempts to nest the<br />
<strong>in</strong>ner loop <strong>in</strong>side the current trace. The first step is to call the <strong>in</strong>ner<br />
trace as a subrout<strong>in</strong>e. This executes the loop on l<strong>in</strong>e 4 to completion<br />
and then returns to the recorder. <strong>Trace</strong>Monkey verifies that the call<br />
was successful and then records the call to the <strong>in</strong>ner trace as part of<br />
the current trace. Record<strong>in</strong>g cont<strong>in</strong>ues until execution reaches l<strong>in</strong>e<br />
1, and at which po<strong>in</strong>t <strong>Trace</strong>Monkey f<strong>in</strong>ishes and compiles a trace<br />
<strong>for</strong> the outer loop, T 16.
i=4. On this iteration, <strong>Trace</strong>Monkey calls T 16. Because i=4, the<br />
if statement on l<strong>in</strong>e 2 is taken. This branch was not taken <strong>in</strong> the<br />
orig<strong>in</strong>al trace, so this causes T 16 to fail a guard and take a side exit.<br />
The exit is not yet hot, so <strong>Trace</strong>Monkey returns to the <strong>in</strong>terpreter,<br />
which executes the cont<strong>in</strong>ue statement.<br />
i=5. <strong>Trace</strong>Monkey calls T 16, which <strong>in</strong> turn calls the nested trace<br />
T 45. T 16 loops back to its own header, start<strong>in</strong>g the next iteration<br />
without ever return<strong>in</strong>g to the monitor.<br />
i=6. On this iteration, the side exit on l<strong>in</strong>e 2 is taken aga<strong>in</strong>. This<br />
time, the side exit becomes hot, so a trace T 23,1 is recorded that<br />
covers l<strong>in</strong>e 3 and returns to the loop header. Thus, the end of T 23,1<br />
jumps directly to the start of T 16. The side exit is patched so that<br />
on future iterations, it jumps directly to T 23,1.<br />
At this po<strong>in</strong>t, <strong>Trace</strong>Monkey has compiled enough traces to cover<br />
the entire nested loop structure, so the rest of the program runs<br />
entirely as native code.<br />
3. <strong>Trace</strong> Trees<br />
In this section, we describe traces, trace trees, and how they are<br />
<strong>for</strong>med at run time. Although our techniques apply to any dynamic<br />
language <strong>in</strong>terpreter, we will describe them assum<strong>in</strong>g a bytecode<br />
<strong>in</strong>terpreter to keep the exposition simple.<br />
3.1 <strong>Trace</strong>s<br />
A trace is simply a program path, which may cross function call<br />
boundaries. <strong>Trace</strong>Monkey focuses on loop traces, that orig<strong>in</strong>ate at<br />
a loop edge and represent a s<strong>in</strong>gle iteration through the associated<br />
loop.<br />
Similar to an extended basic block, a trace is only entered at<br />
the top, but may have many exits. In contrast to an extended basic<br />
block, a trace can conta<strong>in</strong> jo<strong>in</strong> nodes. S<strong>in</strong>ce a trace always only<br />
follows one s<strong>in</strong>gle path through the orig<strong>in</strong>al program, however, jo<strong>in</strong><br />
nodes are not recognizable as such <strong>in</strong> a trace and have a s<strong>in</strong>gle<br />
predecessor node like regular nodes.<br />
A typed trace is a trace annotated with a type <strong>for</strong> every variable<br />
(<strong>in</strong>clud<strong>in</strong>g temporaries) on the trace. A typed trace also has an entry<br />
type map giv<strong>in</strong>g the required types <strong>for</strong> variables used on the trace<br />
be<strong>for</strong>e they are def<strong>in</strong>ed. For example, a trace could have a type map<br />
(x: <strong>in</strong>t, b: boolean), mean<strong>in</strong>g that the trace may be entered<br />
only if the value of the variable x is of type <strong>in</strong>t and the value of b<br />
is of type boolean. The entry type map is much like the signature<br />
of a function.<br />
In this paper, we only discuss typed loop traces, and we will<br />
refer to them simply as “traces”. The key property of typed loop<br />
traces is that they can be compiled to efficient mach<strong>in</strong>e code us<strong>in</strong>g<br />
the same techniques used <strong>for</strong> typed languages.<br />
In <strong>Trace</strong>Monkey, traces are recorded <strong>in</strong> trace-flavored SSA LIR<br />
(low-level <strong>in</strong>termediate representation). In trace-flavored SSA (or<br />
TSSA), phi nodes appear only at the entry po<strong>in</strong>t, which is reached<br />
both on entry and via loop edges. The important LIR primitives<br />
are constant values, memory loads and stores (by address and<br />
offset), <strong>in</strong>teger operators, float<strong>in</strong>g-po<strong>in</strong>t operators, function calls,<br />
and conditional exits. <strong>Type</strong> conversions, such as <strong>in</strong>teger to double,<br />
are represented by function calls. This makes the LIR used by<br />
<strong>Trace</strong>Monkey <strong>in</strong>dependent of the concrete type system and type<br />
conversion rules of the source language. The LIR operations are<br />
generic enough that the backend compiler is language <strong>in</strong>dependent.<br />
Figure 3 shows an example LIR trace.<br />
Bytecode <strong>in</strong>terpreters typically represent values <strong>in</strong> a various<br />
complex data structures (e.g., hash tables) <strong>in</strong> a boxed <strong>for</strong>mat (i.e.,<br />
with attached type tag bits). S<strong>in</strong>ce a trace is <strong>in</strong>tended to represent<br />
efficient code that elim<strong>in</strong>ates all that complexity, our traces operate<br />
on unboxed values <strong>in</strong> simple variables and arrays as much as<br />
possible.<br />
A trace records all its <strong>in</strong>termediate values <strong>in</strong> a small activation<br />
record area. To make variable accesses fast on trace, the trace also<br />
imports local and global variables by unbox<strong>in</strong>g them and copy<strong>in</strong>g<br />
them to its activation record. Thus, the trace can read and write<br />
these variables with simple loads and stores from a native activation<br />
record<strong>in</strong>g, <strong>in</strong>dependently of the box<strong>in</strong>g mechanism used by the<br />
<strong>in</strong>terpreter. When the trace exits, the VM boxes the values from<br />
this native storage location and copies them back to the <strong>in</strong>terpreter<br />
structures.<br />
For every control-flow branch <strong>in</strong> the source program, the<br />
recorder generates conditional exit LIR <strong>in</strong>structions. These <strong>in</strong>structions<br />
exit from the trace if required control flow is different from<br />
what it was at trace record<strong>in</strong>g, ensur<strong>in</strong>g that the trace <strong>in</strong>structions<br />
are run only if they are supposed to. We call these <strong>in</strong>structions<br />
guard <strong>in</strong>structions.<br />
Most of our traces represent loops and end with the special loop<br />
LIR <strong>in</strong>struction. This is just an unconditional branch to the top of<br />
the trace. Such traces return only via guards.<br />
Now, we describe the key optimizations that are per<strong>for</strong>med as<br />
part of record<strong>in</strong>g LIR. All of these optimizations reduce complex<br />
dynamic language constructs to simple typed constructs by specializ<strong>in</strong>g<br />
<strong>for</strong> the current trace. Each optimization requires guard <strong>in</strong>structions<br />
to verify their assumptions about the state and exit the<br />
trace if necessary.<br />
<strong>Type</strong> specialization.<br />
All LIR primitives apply to operands of specific types. Thus,<br />
LIR traces are necessarily type-specialized, and a compiler can<br />
easily produce a translation that requires no type dispatches. A<br />
typical bytecode <strong>in</strong>terpreter carries tag bits along with each value,<br />
and to per<strong>for</strong>m any operation, must check the tag bits, dynamically<br />
dispatch, mask out the tag bits to recover the untagged value,<br />
per<strong>for</strong>m the operation, and then reapply tags. LIR omits everyth<strong>in</strong>g<br />
except the operation itself.<br />
A potential problem is that some operations can produce values<br />
of unpredictable types. For example, read<strong>in</strong>g a property from an<br />
object could yield a value of any type, not necessarily the type<br />
observed dur<strong>in</strong>g record<strong>in</strong>g. The recorder emits guard <strong>in</strong>structions<br />
that conditionally exit if the operation yields a value of a different<br />
type from that seen dur<strong>in</strong>g record<strong>in</strong>g. These guard <strong>in</strong>structions<br />
guarantee that as long as execution is on trace, the types of values<br />
match those of the typed trace. When the VM observes a side exit<br />
along such a type guard, a new typed trace is recorded orig<strong>in</strong>at<strong>in</strong>g<br />
at the side exit location, captur<strong>in</strong>g the new type of the operation <strong>in</strong><br />
question.<br />
Representation specialization: objects. In JavaScript, name<br />
lookup semantics are complex and potentially expensive because<br />
they <strong>in</strong>clude features like object <strong>in</strong>heritance and eval. To evaluate<br />
an object property read expression like o.x, the <strong>in</strong>terpreter must<br />
search the property map of o and all of its prototypes and parents.<br />
Property maps can be implemented with different data structures<br />
(e.g., per-object hash tables or shared hash tables), so the search<br />
process also must dispatch on the representation of each object<br />
found dur<strong>in</strong>g search. <strong>Trace</strong>Monkey can simply observe the result of<br />
the search process and record the simplest possible LIR to access<br />
the property value. For example, the search might f<strong>in</strong>ds the value of<br />
o.x <strong>in</strong> the prototype of o, which uses a shared hash-table representation<br />
that places x <strong>in</strong> slot 2 of a property vector. Then the recorded<br />
can generate LIR that reads o.x with just two or three loads: one to<br />
get the prototype, possibly one to get the property value vector, and<br />
one more to get slot 2 from the vector. This is a vast simplification<br />
and speedup compared to the orig<strong>in</strong>al <strong>in</strong>terpreter code. Inheritance<br />
relationships and object representations can change dur<strong>in</strong>g execution,<br />
so the simplified code requires guard <strong>in</strong>structions that ensure<br />
the object representation is the same. In <strong>Trace</strong>Monkey, objects’ rep-
esentations are assigned an <strong>in</strong>teger key called the object shape.<br />
Thus, the guard is a simple equality check on the object shape.<br />
Representation specialization: numbers. JavaScript has no<br />
<strong>in</strong>teger type, only a Number type that is the set of 64-bit IEEE-<br />
754 float<strong>in</strong>g-po<strong>in</strong>ter numbers (“doubles”). But many JavaScript<br />
operators, <strong>in</strong> particular array accesses and bitwise operators, really<br />
operate on <strong>in</strong>tegers, so they first convert the number to an <strong>in</strong>teger,<br />
and then convert any <strong>in</strong>teger result back to a double. 1 Clearly, a<br />
JavaScript VM that wants to be fast must f<strong>in</strong>d a way to operate on<br />
<strong>in</strong>tegers directly and avoid these conversions.<br />
In <strong>Trace</strong>Monkey, we support two representations <strong>for</strong> numbers:<br />
<strong>in</strong>tegers and doubles. The <strong>in</strong>terpreter uses <strong>in</strong>teger representations<br />
as much as it can, switch<strong>in</strong>g <strong>for</strong> results that can only be represented<br />
as doubles. When a trace is started, some values may be imported<br />
and represented as <strong>in</strong>tegers. Some operations on <strong>in</strong>tegers require<br />
guards. For example, add<strong>in</strong>g two <strong>in</strong>tegers can produce a value too<br />
large <strong>for</strong> the <strong>in</strong>teger representation.<br />
Function <strong>in</strong>l<strong>in</strong><strong>in</strong>g. LIR traces can cross function boundaries<br />
<strong>in</strong> either direction, achiev<strong>in</strong>g function <strong>in</strong>l<strong>in</strong><strong>in</strong>g. Move <strong>in</strong>structions<br />
need to be recorded <strong>for</strong> function entry and exit to copy arguments<br />
<strong>in</strong> and return values out. These move statements are then optimized<br />
away by the compiler us<strong>in</strong>g copy propagation. In order to be able<br />
to return to the <strong>in</strong>terpreter, the trace must also generate LIR to<br />
record that a call frame has been entered and exited. The frame<br />
entry and exit LIR saves just enough <strong>in</strong><strong>for</strong>mation to allow the<br />
<strong>in</strong>tepreter call stack to be restored later and is much simpler than<br />
the <strong>in</strong>terpreter’s standard call code. If the function be<strong>in</strong>g entered<br />
is not constant (which <strong>in</strong> JavaScript <strong>in</strong>cludes any call by function<br />
name), the recorder must also emit LIR to guard that the function<br />
is the same.<br />
Guards and side exits. Each optimization described above<br />
requires one or more guards to verify the assumptions made <strong>in</strong><br />
do<strong>in</strong>g the optimization. A guard is just a group of LIR <strong>in</strong>structions<br />
that per<strong>for</strong>ms a test and conditional exit. The exit branches to a<br />
side exit, a small off-trace piece of LIR that returns a po<strong>in</strong>ter to<br />
a structure that describes the reason <strong>for</strong> the exit along with the<br />
<strong>in</strong>terpreter PC at the exit po<strong>in</strong>t and any other data needed to restore<br />
the <strong>in</strong>terpreter’s state structures.<br />
Aborts. Some constructs are difficult to record <strong>in</strong> LIR traces.<br />
For example, eval or calls to external functions can change the<br />
program state <strong>in</strong> unpredictable ways, mak<strong>in</strong>g it difficult <strong>for</strong> the<br />
tracer to know the current type map <strong>in</strong> order to cont<strong>in</strong>ue trac<strong>in</strong>g.<br />
A trac<strong>in</strong>g implementation can also have any number of other limitations,<br />
e.g.,a small-memory device may limit the length of traces.<br />
When any situation occurs that prevents the implementation from<br />
cont<strong>in</strong>u<strong>in</strong>g trace record<strong>in</strong>g, the implementation aborts trace record<strong>in</strong>g<br />
and returns to the trace monitor.<br />
3.2 <strong>Trace</strong> Trees<br />
Especially simple loops, namely those where control flow, value<br />
types, value representations, and <strong>in</strong>l<strong>in</strong>ed functions are all <strong>in</strong>variant,<br />
can be represented by a s<strong>in</strong>gle trace. But most loops have at least<br />
some variation, and so the program will take side exits from the<br />
ma<strong>in</strong> trace. When a side exit becomes hot, <strong>Trace</strong>Monkey starts a<br />
new branch trace from that po<strong>in</strong>t and patches the side exit to jump<br />
directly to that trace. In this way, a s<strong>in</strong>gle trace expands on demand<br />
to a s<strong>in</strong>gle-entry, multiple-exit trace tree.<br />
This section expla<strong>in</strong>s how trace trees are <strong>for</strong>med dur<strong>in</strong>g execution.<br />
The goal is to <strong>for</strong>m trace trees dur<strong>in</strong>g execution that cover all<br />
the hot paths of the program.<br />
1 Arrays are actually worse than this: if the <strong>in</strong>dex value is a number, it must<br />
be converted from a double to a str<strong>in</strong>g <strong>for</strong> the property access operator, and<br />
then to an <strong>in</strong>teger <strong>in</strong>ternally to the array implementation.<br />
Start<strong>in</strong>g a tree. Tree trees always start at loop headers, because<br />
they are a natural place to look <strong>for</strong> hot paths. In <strong>Trace</strong>Monkey, loop<br />
headers are easy to detect–the bytecode compiler ensures that a<br />
bytecode is a loop header iff it is the target of a backward branch.<br />
<strong>Trace</strong>Monkey starts a tree when a given loop header has been executed<br />
a certa<strong>in</strong> number of times (2 <strong>in</strong> the current implementation).<br />
Start<strong>in</strong>g a tree just means start<strong>in</strong>g record<strong>in</strong>g a trace <strong>for</strong> the current<br />
po<strong>in</strong>t and type map and mark<strong>in</strong>g the trace as the root of a tree. Each<br />
tree is associated with a loop header and type map, so there may be<br />
several trees <strong>for</strong> a given loop header.<br />
Clos<strong>in</strong>g the loop. <strong>Trace</strong> record<strong>in</strong>g can end <strong>in</strong> several ways.<br />
Ideally, the trace reaches the loop header where it started with<br />
the same type map as on entry. This is called a type-stable loop<br />
iteration. In this case, the end of the trace can jump right to the<br />
beg<strong>in</strong>n<strong>in</strong>g, as all the value representations are exactly as needed to<br />
enter the trace. The jump can even skip the usual code that would<br />
copy out the state at the end of the trace and copy it back <strong>in</strong> to the<br />
trace activation record to enter a trace.<br />
In certa<strong>in</strong> cases the trace might reach the loop header with a<br />
different type map. This scenario is sometime observed <strong>for</strong> the first<br />
iteration of a loop. Some variables <strong>in</strong>side the loop might <strong>in</strong>itially be<br />
undef<strong>in</strong>ed, be<strong>for</strong>e they are set to a concrete type dur<strong>in</strong>g the first loop<br />
iteration. When record<strong>in</strong>g such an iteration, the recorder cannot<br />
l<strong>in</strong>k the trace back to its own loop header s<strong>in</strong>ce it is type-unstable.<br />
Instead, the iteration is term<strong>in</strong>ated with a side exit that will always<br />
fail and return to the <strong>in</strong>terpreter. At the same time a new trace is<br />
recorded with the new type map. Every time an additional typeunstable<br />
trace is added to a region, its exit type map is compared to<br />
the entry map of all exist<strong>in</strong>g traces <strong>in</strong> case they complement each<br />
other. With this approach we are able to cover type-unstable loop<br />
iterations as long they eventually <strong>for</strong>m a stable equilibrium.<br />
F<strong>in</strong>ally, the trace might exit the loop be<strong>for</strong>e reach<strong>in</strong>g the loop<br />
header, <strong>for</strong> example because execution reaches a break or return<br />
statement. In this case, the VM simply ends the trace with an exit<br />
to the trace monitor.<br />
As mentioned previously, we may speculatively chose to represent<br />
certa<strong>in</strong> Number-typed values as <strong>in</strong>tegers on trace. We do so<br />
when we observe that Number-typed variables conta<strong>in</strong> an <strong>in</strong>teger<br />
value at trace entry. If dur<strong>in</strong>g trace record<strong>in</strong>g the variable is unexpectedly<br />
assigned a non-<strong>in</strong>teger value, we have to widen the type<br />
of the variable to a double. As a result, the recorded trace becomes<br />
<strong>in</strong>herently type-unstable s<strong>in</strong>ce it starts with an <strong>in</strong>teger value but<br />
ends with a double value. This represents a mis-speculation, s<strong>in</strong>ce<br />
at trace entry we specialized the Number-typed value to an <strong>in</strong>teger,<br />
assum<strong>in</strong>g that at the loop edge we would aga<strong>in</strong> f<strong>in</strong>d an <strong>in</strong>teger value<br />
<strong>in</strong> the variable, allow<strong>in</strong>g us to close the loop. To avoid future speculative<br />
failures <strong>in</strong>volv<strong>in</strong>g this variable, and to obta<strong>in</strong> a type-stable<br />
trace we note the fact that the variable <strong>in</strong> question as been observed<br />
to sometimes hold non-<strong>in</strong>teger values <strong>in</strong> an advisory data structure<br />
which we call the “oracle”.<br />
When compil<strong>in</strong>g loops, we consult the oracle be<strong>for</strong>e specializ<strong>in</strong>g<br />
values to <strong>in</strong>tegers. Speculation towards <strong>in</strong>tegers is per<strong>for</strong>med<br />
only if no adverse <strong>in</strong><strong>for</strong>mation is known to the oracle about that<br />
particular variable. Whenever we accidentally compile a loop that<br />
is type-unstable due to mis-speculation of a Number-typed variable,<br />
we immediately trigger the record<strong>in</strong>g of a new trace, which<br />
<strong>based</strong> on the now updated oracle <strong>in</strong><strong>for</strong>mation will start with a double<br />
value and thus become type stable.<br />
Extend<strong>in</strong>g a tree. Side exits lead to different paths through<br />
the loop, or paths with different types or representations. Thus, to<br />
completely cover the loop, the VM must record traces start<strong>in</strong>g at all<br />
side exits. These traces are recorded much like root traces: there is<br />
a counter <strong>for</strong> each side exit, and when the counter reaches a hotness<br />
threshold, record<strong>in</strong>g starts. Record<strong>in</strong>g stops exactly as <strong>for</strong> the root<br />
trace, us<strong>in</strong>g the loop header of the root trace as the target to reach.
Our implementation does not extend at all side exits. It extends<br />
only if the side exit is <strong>for</strong> a control-flow branch, and only if the side<br />
exit does not leave the loop. In particular we do not want to extend<br />
a trace tree along a path that leads to an outer loop, because we<br />
want to cover such paths <strong>in</strong> an outer tree through tree nest<strong>in</strong>g.<br />
3.3 Blacklist<strong>in</strong>g<br />
Sometimes, a program follows a path that cannot be compiled<br />
<strong>in</strong>to a trace, usually because of limitations <strong>in</strong> the implementation.<br />
<strong>Trace</strong>Monkey does not currently support record<strong>in</strong>g throw<strong>in</strong>g and<br />
catch<strong>in</strong>g of arbitrary exceptions. This design trade off was chosen,<br />
because exceptions are usually rare <strong>in</strong> JavaScript. However, if a<br />
program opts to use exceptions <strong>in</strong>tensively, we would suddenly<br />
<strong>in</strong>cur a punish<strong>in</strong>g runtime overhead if we repeatedly try to record<br />
a trace <strong>for</strong> this path and repeatedly fail to do so, s<strong>in</strong>ce we abort<br />
trac<strong>in</strong>g every time we observe an exception be<strong>in</strong>g thrown.<br />
As a result, if a hot loop conta<strong>in</strong>s traces that always fail, the VM<br />
could potentially run much more slowly than the base <strong>in</strong>terpreter:<br />
the VM repeatedly spends time try<strong>in</strong>g to record traces, but is never<br />
able to run any. To avoid this problem, whenever the VM is about<br />
to start trac<strong>in</strong>g, it must try to predict whether it will f<strong>in</strong>ish the trace.<br />
Our prediction algorithm is <strong>based</strong> on blacklist<strong>in</strong>g traces that<br />
have been tried and failed. When the VM fails to f<strong>in</strong>ish a trace start<strong>in</strong>g<br />
at a given po<strong>in</strong>t, the VM records that a failure has occurred. The<br />
VM also sets a counter so that it will not try to record a trace start<strong>in</strong>g<br />
at that po<strong>in</strong>t until it is passed a few more times (32 <strong>in</strong> our implementation).<br />
This backoff counter gives temporary conditions that<br />
prevent trac<strong>in</strong>g a chance to end. For example, a loop may behave<br />
differently dur<strong>in</strong>g startup than dur<strong>in</strong>g its steady-state execution. After<br />
a given number of failures (2 <strong>in</strong> our implementation), the VM<br />
marks the fragment as blacklisted, which means the VM will never<br />
aga<strong>in</strong> start record<strong>in</strong>g at that po<strong>in</strong>t.<br />
After implement<strong>in</strong>g this basic strategy, we observed that <strong>for</strong><br />
small loops that get blacklisted, the system can spend a noticeable<br />
amount of time just f<strong>in</strong>d<strong>in</strong>g the loop fragment and determ<strong>in</strong><strong>in</strong>g that<br />
it has been blacklisted. We now avoid that problem by patch<strong>in</strong>g the<br />
bytecode. We def<strong>in</strong>e an extra no-op bytecode that <strong>in</strong>dicates a loop<br />
header. The VM calls <strong>in</strong>to the trace monitor every time the <strong>in</strong>terpreter<br />
executes a loop header no-op. To blacklist a fragment, we<br />
simply replace the loop header no-op with a regular no-op. Thus,<br />
the <strong>in</strong>terpreter will never aga<strong>in</strong> even call <strong>in</strong>to the trace monitor.<br />
There is a related problem we have not yet solved, which occurs<br />
when a loop meets all of these conditions:<br />
• The VM can <strong>for</strong>m at least one root trace <strong>for</strong> the loop.<br />
• There is at least one hot side exit <strong>for</strong> which the VM cannot<br />
complete a trace.<br />
• The loop body is short.<br />
In this case, the VM will repeatedly pass the loop header, search<br />
<strong>for</strong> a trace, f<strong>in</strong>d it, execute it, and fall back to the <strong>in</strong>terpreter.<br />
With a short loop body, the overhead of f<strong>in</strong>d<strong>in</strong>g and call<strong>in</strong>g the<br />
trace is high, and causes per<strong>for</strong>mance to be even slower than the<br />
basic <strong>in</strong>terpreter. So far, <strong>in</strong> this situation we have improved the<br />
implementation so that the VM can complete the branch trace.<br />
But it is hard to guarantee that this situation will never happen.<br />
As future work, this situation could be avoided by detect<strong>in</strong>g and<br />
blacklist<strong>in</strong>g loops <strong>for</strong> which the average trace call executes few<br />
bytecodes be<strong>for</strong>e return<strong>in</strong>g to the <strong>in</strong>terpreter.<br />
4. Nested <strong>Trace</strong> Tree Formation<br />
Figure 7 shows basic trace tree compilation (11) applied to a nested<br />
loop where the <strong>in</strong>ner loop conta<strong>in</strong>s two paths. Usually, the <strong>in</strong>ner<br />
loop (with header at i 2) becomes hot first, and a trace tree is rooted<br />
at that po<strong>in</strong>t. For example, the first recorded trace may be a cycle<br />
T<br />
Tree Anchor<br />
Trunk <strong>Trace</strong><br />
<strong>Trace</strong> Anchor<br />
Branch <strong>Trace</strong><br />
Guard<br />
Side Exit<br />
Figure 5. A tree with two traces, a trunk trace and one branch<br />
trace. The trunk trace conta<strong>in</strong>s a guard to which a branch trace was<br />
attached. The branch trace conta<strong>in</strong> a guard that may fail and trigger<br />
a side exit. Both the trunk and the branch trace loop back to the tree<br />
anchor, which is the beg<strong>in</strong>n<strong>in</strong>g of the trace tree.<br />
Number<br />
Number<br />
Number<br />
<strong>Trace</strong> 1<br />
<strong>Trace</strong> 2<br />
<strong>Trace</strong> 1<br />
<strong>Trace</strong> 2<br />
Closed L<strong>in</strong>ked L<strong>in</strong>ked L<strong>in</strong>ked<br />
(a)<br />
(b)<br />
Str<strong>in</strong>g<br />
<strong>Trace</strong> 1<br />
L<strong>in</strong>ked<br />
Number<br />
Boolean<br />
Number<br />
<strong>Trace</strong> 2<br />
Boolean<br />
Str<strong>in</strong>g<br />
Number<br />
Boolean<br />
Str<strong>in</strong>g<br />
Str<strong>in</strong>g<br />
<strong>Trace</strong> 3<br />
L<strong>in</strong>ked L<strong>in</strong>ked Closed<br />
(c)<br />
Boolean<br />
Number<br />
Figure 6. We handle type-unstable loops by allow<strong>in</strong>g traces to<br />
compile that cannot loop back to themselves due to a type mismatch.<br />
As such traces accumulate, we attempt to connect their loop<br />
edges to <strong>for</strong>m groups of trace trees that can execute without hav<strong>in</strong>g<br />
to side-exit to the <strong>in</strong>terpreter to cover odd type cases. This is particularly<br />
important <strong>for</strong> nested trace trees where an outer tree tries to<br />
call an <strong>in</strong>ner tree (or <strong>in</strong> this case a <strong>for</strong>est of <strong>in</strong>ner trees), s<strong>in</strong>ce <strong>in</strong>ner<br />
loops frequently have <strong>in</strong>itially undef<strong>in</strong>ed values which change type<br />
to a concrete value after the first iteration.<br />
through the <strong>in</strong>ner loop, {i 2, i 3, i 5, α}. The α symbol is used to<br />
<strong>in</strong>dicate that the trace loops back the tree anchor.<br />
When execution leaves the <strong>in</strong>ner loop, the basic design has two<br />
choices. First, the system can stop trac<strong>in</strong>g and give up on compil<strong>in</strong>g<br />
the outer loop, clearly an undesirable solution. The other choice is<br />
to cont<strong>in</strong>ue trac<strong>in</strong>g, compil<strong>in</strong>g traces <strong>for</strong> the outer loop <strong>in</strong>side the<br />
<strong>in</strong>ner loop’s trace tree.<br />
For example, the program might exit at i 5 and record a branch<br />
trace that <strong>in</strong>corporates the outer loop: {i 5, i 7, i 1, i 6, i 7, i 1, α}.<br />
Later, the program might take the other branch at i 2 and then<br />
exit, record<strong>in</strong>g another branch trace <strong>in</strong>corporat<strong>in</strong>g the outer loop:<br />
{i 2, i 4, i 5, i 7, i 1, i 6, i 7, i 1, α}. Thus, the outer loop is recorded and<br />
compiled twice, and both copies must be reta<strong>in</strong>ed <strong>in</strong> the trace cache.
i1<br />
t1<br />
Outer Tree<br />
Tree Call<br />
i1<br />
t1<br />
Nested Tree<br />
i2<br />
i2<br />
t2<br />
Nested Tree<br />
i3<br />
i6<br />
i3<br />
i4<br />
t2<br />
Exit Guard<br />
i4<br />
t4<br />
i5<br />
i5<br />
i7<br />
Exit Guard<br />
i6<br />
(a)<br />
Figure 7. Control flow graph of a nested loop with an if statement<br />
<strong>in</strong>side the <strong>in</strong>ner most loop (a). An <strong>in</strong>ner tree captures the <strong>in</strong>ner<br />
loop, and is nested <strong>in</strong>side an outer tree which “calls” the <strong>in</strong>ner tree.<br />
The <strong>in</strong>ner tree returns to the outer tree once it exits along its loop<br />
condition guard (b).<br />
In general, if loops are nested to depth k, and each loop has n paths<br />
(on geometric average), this naïve strategy yields O(n k ) traces,<br />
which can easily fill the trace cache.<br />
In order to execute programs with nested loops efficiently, a<br />
trac<strong>in</strong>g system needs a technique <strong>for</strong> cover<strong>in</strong>g the nested loops with<br />
native code without exponential trace duplication.<br />
4.1 Nest<strong>in</strong>g Algorithm<br />
The key <strong>in</strong>sight is that if each loop is represented by its own trace<br />
tree, the code <strong>for</strong> each loop can be conta<strong>in</strong>ed only <strong>in</strong> its own tree,<br />
and outer loop paths will not be duplicated. Another key fact is that<br />
we are not trac<strong>in</strong>g arbitrary bytecodes that might have irreduceable<br />
control flow graphs, but rather bytecodes produced by a compiler<br />
<strong>for</strong> a language with structured control flow. Thus, given two loop<br />
edges, the system can easily determ<strong>in</strong>e whether they are nested<br />
and which is the <strong>in</strong>ner loop. Us<strong>in</strong>g this knowledge, the system can<br />
compile <strong>in</strong>ner and outer loops separately, and make the outer loop’s<br />
traces call the <strong>in</strong>ner loop’s trace tree.<br />
The algorithm <strong>for</strong> build<strong>in</strong>g nested trace trees is as follows. We<br />
start trac<strong>in</strong>g at loop headers exactly as <strong>in</strong> the basic trac<strong>in</strong>g system.<br />
When we exit a loop (detected by compar<strong>in</strong>g the <strong>in</strong>terpreter PC<br />
with the range given by the loop edge), we stop the trace. The<br />
key step of the algorithm occurs when we are record<strong>in</strong>g a trace<br />
<strong>for</strong> loop L R (R <strong>for</strong> loop be<strong>in</strong>g recorded) and we reach the header<br />
of a different loop L O (O <strong>for</strong> other loop). Note that L O must be an<br />
<strong>in</strong>ner loop of L R because we stop the trace when we exit a loop.<br />
• If L O has a type-match<strong>in</strong>g compiled trace tree, we call L O as<br />
a nested trace tree. If the call succeeds, then we record the call<br />
<strong>in</strong> the trace <strong>for</strong> L R. On future executions, the trace <strong>for</strong> L R will<br />
call the <strong>in</strong>ner trace directly.<br />
• If L O does not have a type-match<strong>in</strong>g compiled trace tree yet,<br />
we have to obta<strong>in</strong> it be<strong>for</strong>e we are able to proceed. In order<br />
to do this, we simply abort record<strong>in</strong>g the first trace. The trace<br />
monitor will see the <strong>in</strong>ner loop header, and will immediately<br />
start record<strong>in</strong>g the <strong>in</strong>ner loop. 2<br />
If all the loops <strong>in</strong> a nest are type-stable, then loop nest<strong>in</strong>g creates<br />
no duplication. Otherwise, if loops are nested to a depth k, and each<br />
2 Instead of abort<strong>in</strong>g the outer record<strong>in</strong>g, we could pr<strong>in</strong>cipally merely suspend<br />
the record<strong>in</strong>g, but that would require the implementation to be able<br />
to record several traces simultaneously, complicat<strong>in</strong>g the implementation,<br />
while sav<strong>in</strong>g only a few iterations <strong>in</strong> the <strong>in</strong>terpreter.<br />
(b)<br />
Figure 8. Control flow graph of a loop with two nested loops (left)<br />
and its nested trace tree configuration (right). The outer tree calls<br />
the two <strong>in</strong>ner nested trace trees and places guards at their side exit<br />
locations.<br />
loop is entered with m different type maps (on geometric average),<br />
then we compile O(m k ) copies of the <strong>in</strong>nermost loop. As long as<br />
m is close to 1, the result<strong>in</strong>g trace trees will be tractable.<br />
An important detail is that the call to the <strong>in</strong>ner trace tree must act<br />
like a function call site: it must return to the same po<strong>in</strong>t every time.<br />
The goal of nest<strong>in</strong>g is to make <strong>in</strong>ner and outer loops <strong>in</strong>dependent;<br />
thus when the <strong>in</strong>ner tree is called, it must exit to the same po<strong>in</strong>t<br />
<strong>in</strong> the outer tree every time with the same type map. Because we<br />
cannot actually guarantee this property, we must guard on it after<br />
the call, and side exit if the property does not hold. A common<br />
reason <strong>for</strong> the <strong>in</strong>ner tree not to return to the same po<strong>in</strong>t would<br />
be if the <strong>in</strong>ner tree took a new side exit <strong>for</strong> which it had never<br />
compiled a trace. At this po<strong>in</strong>t, the <strong>in</strong>terpreter PC is <strong>in</strong> the <strong>in</strong>ner<br />
tree, so we cannot cont<strong>in</strong>ue record<strong>in</strong>g or execut<strong>in</strong>g the outer tree.<br />
If this happens dur<strong>in</strong>g record<strong>in</strong>g, we abort the outer trace, to give<br />
the <strong>in</strong>ner tree a chance to f<strong>in</strong>ish grow<strong>in</strong>g. A future execution of the<br />
outer tree would then be able to properly f<strong>in</strong>ish and record a call to<br />
the <strong>in</strong>ner tree. If an <strong>in</strong>ner tree side exit happens dur<strong>in</strong>g execution of<br />
a compiled trace <strong>for</strong> the outer tree, we simply exit the outer trace<br />
and start record<strong>in</strong>g a new branch <strong>in</strong> the <strong>in</strong>ner tree.<br />
4.2 Blacklist<strong>in</strong>g with Nest<strong>in</strong>g<br />
The blacklist<strong>in</strong>g algorithm needs modification to work well with<br />
nest<strong>in</strong>g. The problem is that outer loop traces often abort dur<strong>in</strong>g<br />
startup (because the <strong>in</strong>ner tree is not available or takes a side exit),<br />
which would lead to their be<strong>in</strong>g quickly blacklisted by the basic<br />
algorithm.<br />
The key observation is that when an outer trace aborts because<br />
the <strong>in</strong>ner tree is not ready, this is probably a temporary condition.<br />
Thus, we should not count such aborts toward blacklist<strong>in</strong>g as long<br />
as we are able to build up more traces <strong>for</strong> the <strong>in</strong>ner tree.<br />
In our implementation, when an outer tree aborts on the <strong>in</strong>ner<br />
tree, we <strong>in</strong>crement the outer tree’s blacklist counter as usual and<br />
back off on compil<strong>in</strong>g it. When the <strong>in</strong>ner tree f<strong>in</strong>ishes a trace, we<br />
decrement the blacklist counter on the outer loop, “<strong>for</strong>giv<strong>in</strong>g” the<br />
outer loop <strong>for</strong> abort<strong>in</strong>g previously. We also undo the backoff so that<br />
the outer tree can start immediately try<strong>in</strong>g to compile the next time<br />
we reach it.<br />
5. <strong>Trace</strong> Tree Optimization<br />
This section expla<strong>in</strong>s how a recorded trace is translated to an<br />
optimized mach<strong>in</strong>e code trace. The trace compilation subsystem,<br />
NANOJIT, is separate from the VM and can be used <strong>for</strong> other<br />
applications.
5.1 Optimizations<br />
Because traces are <strong>in</strong> SSA <strong>for</strong>m and have no jo<strong>in</strong> po<strong>in</strong>ts or φ-<br />
nodes, certa<strong>in</strong> optimizations are easy to implement. In order to<br />
get good startup per<strong>for</strong>mance, the optimizations must run quickly,<br />
so we chose a small set of optimizations. We implemented the<br />
optimizations as pipel<strong>in</strong>ed filters so that they can be turned on and<br />
off <strong>in</strong>dependently, and yet all run <strong>in</strong> just two loop passes over the<br />
trace: one <strong>for</strong>ward and one backward.<br />
Every time the trace recorder emits a LIR <strong>in</strong>struction, the <strong>in</strong>struction<br />
is immediately passed to the first filter <strong>in</strong> the <strong>for</strong>ward<br />
pipel<strong>in</strong>e. Thus, <strong>for</strong>ward filter optimizations are per<strong>for</strong>med as the<br />
trace is recorded. Each filter may pass each <strong>in</strong>struction to the next<br />
filter unchanged, write a different <strong>in</strong>struction to the next filter, or<br />
write no <strong>in</strong>struction at all. For example, the constant fold<strong>in</strong>g filter<br />
can replace a multiply <strong>in</strong>struction like v 13 := mul3, 1000 with a<br />
constant <strong>in</strong>struction v 13 = 3000.<br />
We currently apply four <strong>for</strong>ward filters:<br />
• On ISAs without float<strong>in</strong>g-po<strong>in</strong>t <strong>in</strong>structions, a soft-float filter<br />
converts float<strong>in</strong>g-po<strong>in</strong>t LIR <strong>in</strong>structions to sequences of <strong>in</strong>teger<br />
<strong>in</strong>structions.<br />
• CSE (constant subexpression elim<strong>in</strong>ation),<br />
• expression simplification, <strong>in</strong>clud<strong>in</strong>g constant fold<strong>in</strong>g and a few<br />
algebraic identities (e.g., a − a = 0), and<br />
• source language semantic-specific expression simplification,<br />
primarily algebraic identities that allow DOUBLE to be replaced<br />
with INT. For example, LIR that converts an INT to a DOUBLE<br />
and then back aga<strong>in</strong> would be removed by this filter.<br />
When trace record<strong>in</strong>g is completed, nanojit runs the backward<br />
optimization filters. These are used <strong>for</strong> optimizations that require<br />
backward program analysis. When runn<strong>in</strong>g the backward filters,<br />
nanojit reads one LIR <strong>in</strong>struction at a time, and the reads are passed<br />
through the pipel<strong>in</strong>e.<br />
We currently apply three backward filters:<br />
• Dead data-stack store elim<strong>in</strong>ation. The LIR trace encodes many<br />
stores to locations <strong>in</strong> the <strong>in</strong>terpreter stack. But these values are<br />
never read back be<strong>for</strong>e exit<strong>in</strong>g the trace (by the <strong>in</strong>terpreter or<br />
another trace). Thus, stores to the stack that are overwritten<br />
be<strong>for</strong>e the next exit are dead. Stores to locations that are off<br />
the top of the <strong>in</strong>terpreter stack at future exits are also dead.<br />
• Dead call-stack store elim<strong>in</strong>ation. This is the same optimization<br />
as above, except applied to the <strong>in</strong>terpreter’s call stack used <strong>for</strong><br />
function call <strong>in</strong>l<strong>in</strong><strong>in</strong>g.<br />
• Dead code elim<strong>in</strong>ation. This elim<strong>in</strong>ates any operation that<br />
stores to a value that is never used.<br />
After a LIR <strong>in</strong>struction is successfully read (“pulled”) from<br />
the backward filter pipel<strong>in</strong>e, nanojit’s code generator emits native<br />
mach<strong>in</strong>e <strong>in</strong>struction(s) <strong>for</strong> it.<br />
5.2 Register Allocation<br />
We use a simple greedy register allocator that makes a s<strong>in</strong>gle<br />
backward pass over the trace (it is <strong>in</strong>tegrated with the code generator).<br />
By the time the allocator has reached an <strong>in</strong>struction like<br />
v 3 = add v 1, v 2, it has already assigned a register to v 3. If v 1 and<br />
v 2 have not yet been assigned registers, the allocator assigns a free<br />
register to each. If there are no free registers, a value is selected <strong>for</strong><br />
spill<strong>in</strong>g. We use a class heuristic that selects the “oldest” registercarried<br />
value (6).<br />
The heuristic considers the set R of values v <strong>in</strong> registers immediately<br />
after the current <strong>in</strong>struction <strong>for</strong> spill<strong>in</strong>g. Let v m be the last<br />
<strong>in</strong>struction be<strong>for</strong>e the current where each v is referred to. Then the<br />
Tag JS <strong>Type</strong> Description<br />
xx1 number 31-bit <strong>in</strong>teger representation<br />
000 object po<strong>in</strong>ter to JSObject handle<br />
010 number po<strong>in</strong>ter to double handle<br />
100 str<strong>in</strong>g po<strong>in</strong>ter to JSStr<strong>in</strong>g handle<br />
110 boolean enumeration <strong>for</strong> null, undef<strong>in</strong>ed, true, false<br />
null, or<br />
undef<strong>in</strong>ed<br />
Figure 9. Tagged values <strong>in</strong> the SpiderMonkey JS <strong>in</strong>terpreter.<br />
Test<strong>in</strong>g tags, unbox<strong>in</strong>g (extract<strong>in</strong>g the untagged value) and box<strong>in</strong>g<br />
(creat<strong>in</strong>g tagged values) are significant costs. Avoid<strong>in</strong>g these costs<br />
is a key benefit of trac<strong>in</strong>g.<br />
heuristic selects v with m<strong>in</strong>imum v m. The motivation is that this<br />
frees up a register <strong>for</strong> as long as possible given a s<strong>in</strong>gle spill.<br />
If we need to spill a value v s at this po<strong>in</strong>t, we generate the<br />
restore code just after the code <strong>for</strong> the current <strong>in</strong>struction. The<br />
correspond<strong>in</strong>g spill code is generated just after the last po<strong>in</strong>t where<br />
v s was used. The register that was assigned to v s is marked free <strong>for</strong><br />
the preced<strong>in</strong>g code, because that register can now be used freely<br />
without affect<strong>in</strong>g the follow<strong>in</strong>g code<br />
6. Implementation<br />
To demonstrate the effectiveness of our approach, we have implemented<br />
a trace-<strong>based</strong> dynamic compiler <strong>for</strong> the SpiderMonkey<br />
JavaScript Virtual Mach<strong>in</strong>e (4). SpiderMonkey is the JavaScript<br />
VM embedded <strong>in</strong> Mozilla’s Firefox open-source web browser (2),<br />
which is used by more than 200 million users world-wide. The core<br />
of SpiderMonkey is a bytecode <strong>in</strong>terpreter implemented <strong>in</strong> C++.<br />
In SpiderMonkey, all JavaScript values are represented by the<br />
type jsval. A jsval is mach<strong>in</strong>e word <strong>in</strong> which up to the 3 of the<br />
least significant bits are a type tag, and the rema<strong>in</strong><strong>in</strong>g bits are data.<br />
See Figure 6 <strong>for</strong> details. All po<strong>in</strong>ters conta<strong>in</strong>ed <strong>in</strong> jsvals po<strong>in</strong>t to<br />
GC-controlled blocks aligned on 8-byte boundaries.<br />
JavaScript object values are mapp<strong>in</strong>gs of str<strong>in</strong>g-valued property<br />
names to arbitrary values. They are represented <strong>in</strong> one of two ways<br />
<strong>in</strong> SpiderMonkey. Most objects are represented by a shared structural<br />
description, called the object shape, that maps property names<br />
to array <strong>in</strong>dexes us<strong>in</strong>g a hash table. The object stores a po<strong>in</strong>ter to<br />
the shape and the array of its own property values. Objects with<br />
large, unique sets of property names store their properties directly<br />
<strong>in</strong> a hash table.<br />
The garbage collector is an exact, non-generational, stop-theworld<br />
mark-and-sweep collector.<br />
In the rest of this section we discuss key areas of the <strong>Trace</strong>Monkey<br />
implementation.<br />
6.1 Call<strong>in</strong>g Compiled <strong>Trace</strong>s<br />
Compiled traces are stored <strong>in</strong> a trace cache, <strong>in</strong>dexed by <strong>in</strong>tepreter<br />
PC and type map. <strong>Trace</strong>s are compiled so that they may be<br />
called as functions us<strong>in</strong>g standard native call<strong>in</strong>g conventions (e.g.,<br />
FASTCALL on x86).<br />
The <strong>in</strong>terpreter must hit a loop edge and enter the monitor <strong>in</strong><br />
order to call a native trace <strong>for</strong> the first time. The monitor computes<br />
the current type map, checks the trace cache <strong>for</strong> a trace <strong>for</strong> the<br />
current PC and type map, and if it f<strong>in</strong>ds one, executes the trace.<br />
To execute a trace, the monitor must build a trace activation<br />
record conta<strong>in</strong><strong>in</strong>g imported local and global variables, temporary<br />
stack space, and space <strong>for</strong> arguments to native calls. The local and<br />
global values are then copied from the <strong>in</strong>terpreter state to the trace<br />
activation record. Then, the trace is called like a normal C function<br />
po<strong>in</strong>ter.
When a trace call returns, the monitor restores the <strong>in</strong>terpreter<br />
state. First, the monitor checks the reason <strong>for</strong> the trace exit and<br />
applies blacklist<strong>in</strong>g if needed. Then, it pops or synthesizes <strong>in</strong>terpreter<br />
JavaScript call stack frames as needed. F<strong>in</strong>ally, it copies the<br />
imported variables back from the trace activation record to the <strong>in</strong>terpreter<br />
state.<br />
At least <strong>in</strong> the current implementation, these steps have a nonnegligible<br />
runtime cost, so m<strong>in</strong>imiz<strong>in</strong>g the number of <strong>in</strong>terpreterto-trace<br />
and trace-to-<strong>in</strong>terpreter transitions is essential <strong>for</strong> per<strong>for</strong>mance.<br />
(see also Section 3.3). Our experiments (see Figure 12)<br />
show that <strong>for</strong> programs we can trace well such transitions happen<br />
<strong>in</strong>frequently and hence do not contribute significantly to total<br />
runtime. In a few programs, where the system is prevented from<br />
record<strong>in</strong>g branch traces <strong>for</strong> hot side exits by aborts, this cost can<br />
rise to up to 10% of total execution time.<br />
6.2 <strong>Trace</strong> Stitch<strong>in</strong>g<br />
Transitions from a trace to a branch trace at a side exit avoid the<br />
costs of call<strong>in</strong>g traces from the monitor, <strong>in</strong> a feature called trace<br />
stitch<strong>in</strong>g. At a side exit, the exit<strong>in</strong>g trace only needs to write live<br />
register-carried values back to its trace activation record. In our implementation,<br />
identical type maps yield identical activation record<br />
layouts, so the trace activation record can be reused immediately<br />
by the branch trace.<br />
In programs with branchy trace trees with small traces, trace<br />
stitch<strong>in</strong>g has a noticeable cost. Although writ<strong>in</strong>g to memory and<br />
then soon read<strong>in</strong>g back would be expected to have a high L1<br />
cache hit rate, <strong>for</strong> small traces the <strong>in</strong>creased <strong>in</strong>struction count has<br />
a noticeable cost. Also, if the writes and reads are very close<br />
<strong>in</strong> the dynamic <strong>in</strong>struction stream, we have found that current<br />
x86 processors often <strong>in</strong>cur penalties of 6 cycles or more (e.g., if<br />
the <strong>in</strong>structions use different base registers with equal values, the<br />
processor may not be able to detect that the addresses are the same<br />
right away).<br />
The alternate solution is to recompile an entire trace tree, thus<br />
achiev<strong>in</strong>g <strong>in</strong>ter-trace register allocation (10). The disadvantage is<br />
that tree recompilation takes time quadratic <strong>in</strong> the number of traces.<br />
We believe that the cost of recompil<strong>in</strong>g a trace tree every time<br />
a branch is added would be prohibitive. That problem might be<br />
mitigated by recompil<strong>in</strong>g only at certa<strong>in</strong> po<strong>in</strong>ts, or only <strong>for</strong> very<br />
hot, stable trees.<br />
In the future, multicore hardware is expected to be common,<br />
mak<strong>in</strong>g background tree recompilation attractive. In a closely related<br />
project (13) background recompilation yielded speedups of<br />
up to 1.25x on benchmarks with many branch traces. We plan to<br />
apply this technique to <strong>Trace</strong>Monkey as future work.<br />
6.3 <strong>Trace</strong> Record<strong>in</strong>g<br />
The job of the trace recorder is to emit LIR with identical semantics<br />
to the currently runn<strong>in</strong>g <strong>in</strong>terpreter bytecode trace. A good implementation<br />
should have low impact on non-trac<strong>in</strong>g <strong>in</strong>terpreter per<strong>for</strong>mance<br />
and a convenient way <strong>for</strong> implementers to ma<strong>in</strong>ta<strong>in</strong> semantic<br />
equivalence.<br />
In our implementation, the only direct modification to the <strong>in</strong>terpreter<br />
is a call to the trace monitor at loop edges. In our benchmark<br />
results (see Figure 12) the total time spent <strong>in</strong> the monitor (<strong>for</strong> all<br />
activities) is usually less than 5%, so we consider the <strong>in</strong>terpreter<br />
impact requirement met. Increment<strong>in</strong>g the loop hit counter is expensive<br />
because it requires us to look up the loop <strong>in</strong> the trace cache,<br />
but we have tuned our loops to become hot and trace very quickly<br />
(on the second iteration). The hit counter implementation could be<br />
improved, which might give us a small <strong>in</strong>crease <strong>in</strong> overall per<strong>for</strong>mance,<br />
as well as more flexibility with tun<strong>in</strong>g hotness thresholds.<br />
Once a loop is blacklisted we never call <strong>in</strong>to the trace monitor <strong>for</strong><br />
that loop (see Section 3.3).<br />
Record<strong>in</strong>g is activated by a po<strong>in</strong>ter swap that sets the <strong>in</strong>terpreter’s<br />
dispatch table to call a s<strong>in</strong>gle “<strong>in</strong>terrupt” rout<strong>in</strong>e <strong>for</strong> every<br />
bytecode. The <strong>in</strong>terrupt rout<strong>in</strong>e first calls a bytecode-specific<br />
record<strong>in</strong>g rout<strong>in</strong>e. Then, it turns off record<strong>in</strong>g if necessary (e.g.,<br />
the trace ended). F<strong>in</strong>ally, it jumps to the standard <strong>in</strong>terpreter bytecode<br />
implementation. Some bytecodes have effects on the type map<br />
that cannot be predicted be<strong>for</strong>e execut<strong>in</strong>g the bytecode (e.g., call<strong>in</strong>g<br />
Str<strong>in</strong>g.charCodeAt, which returns an <strong>in</strong>teger or NaN if the<br />
<strong>in</strong>dex argument is out of range). For these, we arrange <strong>for</strong> the <strong>in</strong>terpreter<br />
to call <strong>in</strong>to the recorder aga<strong>in</strong> after execut<strong>in</strong>g the bytecode.<br />
S<strong>in</strong>ce such hooks are relatively rare, we embed them directly <strong>in</strong>to<br />
the <strong>in</strong>terpreter, with an additional runtime check to see whether a<br />
recorder is currently active.<br />
While separat<strong>in</strong>g the <strong>in</strong>terpreter from the recorder reduces <strong>in</strong>dividual<br />
code complexity, it also requires careful implementation and<br />
extensive test<strong>in</strong>g to achieve semantic equivalence.<br />
In some cases achiev<strong>in</strong>g this equivalence is difficult s<strong>in</strong>ce SpiderMonkey<br />
follows a fat-bytecode design, which was found to be<br />
beneficial to pure <strong>in</strong>terpreter per<strong>for</strong>mance.<br />
In fat-bytecode designs, <strong>in</strong>dividual bytecodes can implement<br />
complex process<strong>in</strong>g (e.g., the getprop bytecode, which implements<br />
full JavaScript property value access, <strong>in</strong>clud<strong>in</strong>g special cases<br />
<strong>for</strong> cached and dense array access).<br />
Fat bytecodes have two advantages: fewer bytecodes means<br />
lower dispatch cost, and bigger bytecode implementations give the<br />
compiler more opportunities to optimize the <strong>in</strong>terpreter.<br />
Fat bytecodes are a problem <strong>for</strong> <strong>Trace</strong>Monkey because they<br />
require the recorder to reimplement the same special case logic<br />
<strong>in</strong> the same way. Also, the advantages are reduced because (a)<br />
dispatch costs are elim<strong>in</strong>ated entirely <strong>in</strong> compiled traces, (b) the<br />
traces conta<strong>in</strong> only one special case, not the <strong>in</strong>terpreter’s large<br />
chunk of code, and (c) <strong>Trace</strong>Monkey spends less time runn<strong>in</strong>g the<br />
base <strong>in</strong>terpreter.<br />
One way we have mitigated these problems is by implement<strong>in</strong>g<br />
certa<strong>in</strong> complex bytecodes <strong>in</strong> the recorder as sequences of simple<br />
bytecodes. Express<strong>in</strong>g the orig<strong>in</strong>al semantics this way is not too difficult,<br />
and record<strong>in</strong>g simple bytecodes is much easier. This enables<br />
us to reta<strong>in</strong> the advantages of fat bytecodes while avoid<strong>in</strong>g some of<br />
their problems <strong>for</strong> trace record<strong>in</strong>g. This is particularly effective <strong>for</strong><br />
fat bytecodes that recurse back <strong>in</strong>to the <strong>in</strong>terpreter, <strong>for</strong> example to<br />
convert an object <strong>in</strong>to a primitive value by <strong>in</strong>vok<strong>in</strong>g a well-known<br />
method on the object, s<strong>in</strong>ce it lets us <strong>in</strong>l<strong>in</strong>e this function call.<br />
It is important to note that we split fat opcodes <strong>in</strong>to th<strong>in</strong>ner opcodes<br />
only dur<strong>in</strong>g record<strong>in</strong>g. When runn<strong>in</strong>g purely <strong>in</strong>terpretatively<br />
(i.e. code that has been blacklisted), the <strong>in</strong>terpreter directly and efficiently<br />
executes the fat opcodes.<br />
6.4 Preemption<br />
SpiderMonkey, like many VMs, needs to preempt the user program<br />
periodically. The ma<strong>in</strong> reasons are to prevent <strong>in</strong>f<strong>in</strong>itely loop<strong>in</strong>g<br />
scripts from lock<strong>in</strong>g up the host system and to schedule GC.<br />
In the <strong>in</strong>terpreter, this had been implemented by sett<strong>in</strong>g a “preempt<br />
now” flag that was checked on every backward jump. This<br />
strategy carried over <strong>in</strong>to <strong>Trace</strong>Monkey: the VM <strong>in</strong>serts a guard on<br />
the preemption flag at every loop edge. We measured less than a<br />
1% <strong>in</strong>crease <strong>in</strong> runtime on most benchmarks <strong>for</strong> this extra guard.<br />
In practice, the cost is detectable only <strong>for</strong> programs with very short<br />
loops.<br />
We tested and rejected a solution that avoided the guards by<br />
compil<strong>in</strong>g the loop edge as an unconditional jump, and patch<strong>in</strong>g<br />
the jump target to an exit rout<strong>in</strong>e when preemption is required.<br />
This solution can make the normal case slightly faster, but then<br />
preemption becomes very slow. The implementation was also very<br />
complex, especially try<strong>in</strong>g to restart execution after the preemption.
6.5 Call<strong>in</strong>g External Functions<br />
Like most <strong>in</strong>terpreters, SpiderMonkey has a <strong>for</strong>eign function <strong>in</strong>terface<br />
(FFI) that allows it to call C built<strong>in</strong>s and host system functions<br />
(e.g., web browser control and DOM access). The FFI has a standard<br />
signature <strong>for</strong> JS-callable functions, the key argument of which<br />
is an array of boxed values. External functions called through the<br />
FFI <strong>in</strong>teract with the program state through an <strong>in</strong>terpreter API (e.g.,<br />
to read a property from an argument). There are also certa<strong>in</strong> <strong>in</strong>terpreter<br />
built<strong>in</strong>s that do not use the FFI, but <strong>in</strong>teract with the program<br />
state <strong>in</strong> the same way, such as the CallIteratorNext function<br />
used with iterator objects. <strong>Trace</strong>Monkey must support this FFI <strong>in</strong><br />
order to speed up code that <strong>in</strong>teracts with the host system <strong>in</strong>side hot<br />
loops.<br />
Call<strong>in</strong>g external functions from <strong>Trace</strong>Monkey is potentially difficult<br />
because traces do not update the <strong>in</strong>terpreter state until exit<strong>in</strong>g.<br />
In particular, external functions may need the call stack or the<br />
global variables, but they may be out of date.<br />
For the out-of-date call stack problem, we refactored some of<br />
the <strong>in</strong>terpreter API implementation functions to re-materialize the<br />
<strong>in</strong>terpreter call stack on demand.<br />
We developed a C++ static analysis and annotated some <strong>in</strong>terpreter<br />
functions <strong>in</strong> order to verify that the call stack is refreshed<br />
at any po<strong>in</strong>t it needs to be used. In order to access the call stack,<br />
a function must be annotated as either FORCESSTACK or RE-<br />
QUIRESSTACK. These annotations are also required <strong>in</strong> order to call<br />
REQUIRESSTACK functions, which are presumed to access the call<br />
stack transitively. FORCESSTACK is a trusted annotation, applied<br />
to only 5 functions, that means the function refreshes the call stack.<br />
REQUIRESSTACK is an untrusted annotation that means the function<br />
may only be called if the call stack has already been refreshed.<br />
Similarly, we detect when host functions attempt to directly<br />
read or write global variables, and <strong>for</strong>ce the currently runn<strong>in</strong>g trace<br />
to side exit. This is necessary s<strong>in</strong>ce we cache and unbox global<br />
variables <strong>in</strong>to the activation record dur<strong>in</strong>g trace execution.<br />
S<strong>in</strong>ce both call-stack access and global variable access are<br />
rarely per<strong>for</strong>med by host functions, per<strong>for</strong>mance is not significantly<br />
affected by these safety mechanisms.<br />
Another problem is that external functions can reenter the <strong>in</strong>terpreter<br />
by call<strong>in</strong>g scripts, which <strong>in</strong> turn aga<strong>in</strong> might want to access<br />
the call stack or global variables. To address this problem, we made<br />
the VM set a flag whenever the <strong>in</strong>terpreter is reentered while a compiled<br />
trace is runn<strong>in</strong>g.<br />
Every call to an external function then checks this flag and exits<br />
the trace immediately after return<strong>in</strong>g from the external function call<br />
if it is set. There are many external functions that seldom or never<br />
reenter, and they can be called without problem, and will cause<br />
trace exit only if necessary.<br />
The FFI’s boxed value array requirement has a per<strong>for</strong>mance<br />
cost, so we def<strong>in</strong>ed a new FFI that allows C functions to be annotated<br />
with their argument types so that the tracer can call them<br />
directly, without unnecessary argument conversions.<br />
Currently, we do not support call<strong>in</strong>g native property get and set<br />
override functions or DOM functions directly from trace. Support<br />
is planned future work.<br />
6.6 Correctness<br />
Dur<strong>in</strong>g development, we had access to exist<strong>in</strong>g JavaScript test<br />
suites, but most of them were not designed with trac<strong>in</strong>g VMs <strong>in</strong><br />
m<strong>in</strong>d and conta<strong>in</strong>ed few loops.<br />
One tool that helped us greatly was Mozilla’s JavaScript fuzz<br />
tester, JSFUNFUZZ, which generates random JavaScript programs<br />
by nest<strong>in</strong>g random language elements. We modified JSFUNFUZZ<br />
to generate loops, and also to test more heavily certa<strong>in</strong> constructs<br />
we suspected would reveal flaws <strong>in</strong> our implementation. For example,<br />
we suspected bugs <strong>in</strong> <strong>Trace</strong>Monkey’s handl<strong>in</strong>g of type-unstable<br />
?>9@AJ.D#3$4,56#<br />
?>9@AJ.0A:9@AJ.>9@AJ.B9@AJ.1
%#"<br />
%!"<br />
$#"<br />
&'()*+,"<br />
-./"<br />
01"<br />
$!"<br />
#"<br />
!"<br />
Figure 10. Speedup vs. a basel<strong>in</strong>e JavaScript <strong>in</strong>terpreter (SpiderMonkey) <strong>for</strong> our trace-<strong>based</strong> JIT compiler, Apple’s SquirrelFish Extreme<br />
<strong>in</strong>l<strong>in</strong>e thread<strong>in</strong>g <strong>in</strong>terpreter and Google’s V8 JS compiler. Our system generates particularly efficient code <strong>for</strong> programs that benefit most from<br />
type specialization, which <strong>in</strong>cludes SunSpider Benchmark programs that per<strong>for</strong>m bit manipulation. We type-specialize the code <strong>in</strong> question<br />
to use <strong>in</strong>teger arithmetic, which substantially improves per<strong>for</strong>mance. For one of the benchmark programs we execute 25 times faster than<br />
the SpiderMonkey <strong>in</strong>terpreter, and almost 5 times faster than V8 and SFX. For a large number of benchmarks all three VMs produce similar<br />
results. We per<strong>for</strong>m worst on benchmark programs that we do not trace and <strong>in</strong>stead fall back onto the <strong>in</strong>terpreter. This <strong>in</strong>cludes the recursive<br />
benchmarks access-b<strong>in</strong>ary-trees and control-flow-recursive, <strong>for</strong> which we currently don’t generate any native code.<br />
In particular, the bitops benchmarks are short programs that per<strong>for</strong>m<br />
many bitwise operations, so <strong>Trace</strong>Monkey can cover the entire<br />
program with 1 or 2 traces that operate on <strong>in</strong>tegers. <strong>Trace</strong>Monkey<br />
runs all the other programs <strong>in</strong> this set almost entirely as native<br />
code.<br />
regexp-dna is dom<strong>in</strong>ated by regular expression match<strong>in</strong>g,<br />
which is implemented <strong>in</strong> all 3 VMs by a special regular expression<br />
compiler. Thus, per<strong>for</strong>mance on this benchmark has little relation<br />
to the trace compilation approach discussed <strong>in</strong> this paper.<br />
<strong>Trace</strong>Monkey’s smaller speedups on the other benchmarks can<br />
be attributed to a few specific causes:<br />
• The implementation does not currently trace recursion, so<br />
<strong>Trace</strong>Monkey achieves a small speedup or no speedup on<br />
benchmarks that use recursion extensively: 3d-cube, 3draytrace,<br />
access-b<strong>in</strong>ary-trees, str<strong>in</strong>g-tagcloud, and<br />
controlflow-recursive.<br />
• The implementation does not currently trace eval and some<br />
other functions implemented <strong>in</strong> C. Because date-<strong>for</strong>mattofte<br />
and date-<strong>for</strong>mat-xparb use such functions <strong>in</strong> their<br />
ma<strong>in</strong> loops, we do not trace them.<br />
• The implementation does not currently trace through regular<br />
expression replace operations. The replace function can be<br />
passed a function object used to compute the replacement text.<br />
Our implementation currently does not trace functions called<br />
as replace functions. The run time of str<strong>in</strong>g-unpack-code is<br />
dom<strong>in</strong>ated by such a replace call.<br />
• Two programs trace well, but have a long compilation time.<br />
access-nbody <strong>for</strong>ms a large number of traces (81). crypto-md5<br />
<strong>for</strong>ms one very long trace. We expect to improve per<strong>for</strong>mance<br />
on this programs by improv<strong>in</strong>g the compilation speed of nanojit.<br />
• Some programs trace very well, and speed up compared to<br />
the <strong>in</strong>terpreter, but are not as fast as SFX and/or V8, namely<br />
bitops-bits-<strong>in</strong>-byte, bitops-nsieve-bits, accessfannkuch,<br />
access-nsieve, and crypto-aes. The reason is<br />
not clear, but all of these programs have nested loops with<br />
small bodies, so we suspect that the implementation has a relatively<br />
high cost <strong>for</strong> call<strong>in</strong>g nested traces. str<strong>in</strong>g-fasta traces<br />
well, but its run time is dom<strong>in</strong>ated by str<strong>in</strong>g process<strong>in</strong>g built<strong>in</strong>s,<br />
which are unaffected by trac<strong>in</strong>g and seem to be less efficient <strong>in</strong><br />
SpiderMonkey than <strong>in</strong> the two other VMs.<br />
Detailed per<strong>for</strong>mance metrics. In Figure 11 we show the fraction<br />
of <strong>in</strong>structions <strong>in</strong>terpreted and the fraction of <strong>in</strong>structions executed<br />
as native code. This figure shows that <strong>for</strong> many programs, we<br />
are able to execute almost all the code natively.<br />
Figure 12 breaks down the total execution time <strong>in</strong>to four activities:<br />
<strong>in</strong>terpret<strong>in</strong>g bytecodes while not record<strong>in</strong>g, record<strong>in</strong>g traces<br />
(<strong>in</strong>clud<strong>in</strong>g time taken to <strong>in</strong>terpret the recorded trace), compil<strong>in</strong>g<br />
traces to native code, and execut<strong>in</strong>g native code traces.<br />
These detailed metrics allow us to estimate parameters <strong>for</strong> a<br />
simple model of trac<strong>in</strong>g per<strong>for</strong>mance. These estimates should be<br />
considered very rough, as the values observed on the <strong>in</strong>dividual<br />
benchmarks have large standard deviations (on the order of the
Loops Trees <strong>Trace</strong>s Aborts Flushes Trees/Loop <strong>Trace</strong>s/Tree <strong>Trace</strong>s/Loop Speedup<br />
3d-cube 25 27 29 3 0 1.1 1.1 1.2 2.20x<br />
3d-morph 5 8 8 2 0 1.6 1.0 1.6 2.86x<br />
3d-raytrace 10 25 100 10 1 2.5 4.0 10.0 1.18x<br />
access-b<strong>in</strong>ary-trees 0 0 0 5 0 - - - 0.93x<br />
access-fannkuch 10 34 57 24 0 3.4 1.7 5.7 2.20x<br />
access-nbody 8 16 18 5 0 2.0 1.1 2.3 4.19x<br />
access-nsieve 3 6 8 3 0 2.0 1.3 2.7 3.05x<br />
bitops-3bit-bits-<strong>in</strong>-byte 2 2 2 0 0 1.0 1.0 1.0 25.47x<br />
bitops-bits-<strong>in</strong>-byte 3 3 4 1 0 1.0 1.3 1.3 8.67x<br />
bitops-bitwise-and 1 1 1 0 0 1.0 1.0 1.0 25.20x<br />
bitops-nsieve-bits 3 3 5 0 0 1.0 1.7 1.7 2.75x<br />
controlflow-recursive 0 0 0 1 0 - - - 0.98x<br />
crypto-aes 50 72 78 19 0 1.4 1.1 1.6 1.64x<br />
crypto-md5 4 4 5 0 0 1.0 1.3 1.3 2.30x<br />
crypto-sha1 5 5 10 0 0 1.0 2.0 2.0 5.95x<br />
date-<strong>for</strong>mat-tofte 3 3 4 7 0 1.0 1.3 1.3 1.07x<br />
date-<strong>for</strong>mat-xparb 3 3 11 3 0 1.0 3.7 3.7 0.98x<br />
math-cordic 2 4 5 1 0 2.0 1.3 2.5 4.92x<br />
math-partial-sums 2 4 4 1 0 2.0 1.0 2.0 5.90x<br />
math-spectral-norm 15 20 20 0 0 1.3 1.0 1.3 7.12x<br />
regexp-dna 2 2 2 0 0 1.0 1.0 1.0 4.21x<br />
str<strong>in</strong>g-base64 3 5 7 0 0 1.7 1.4 2.3 2.53x<br />
str<strong>in</strong>g-fasta 5 11 15 6 0 2.2 1.4 3.0 1.49x<br />
str<strong>in</strong>g-tagcloud 3 6 6 5 0 2.0 1.0 2.0 1.09x<br />
str<strong>in</strong>g-unpack-code 4 4 37 0 0 1.0 9.3 9.3 1.20x<br />
str<strong>in</strong>g-validate-<strong>in</strong>put 6 10 13 1 0 1.7 1.3 2.2 1.86x<br />
Figure 13. Detailed trace record<strong>in</strong>g statistics <strong>for</strong> the SunSpider benchmark set.<br />
mean). We exclude regexp-dna from the follow<strong>in</strong>g calculations,<br />
because most of its time is spent <strong>in</strong> the regular expression matcher,<br />
which has much different per<strong>for</strong>mance characteristics from the<br />
other programs. (Note that this only makes a difference of about<br />
10% <strong>in</strong> the results.) Divid<strong>in</strong>g the total execution time <strong>in</strong> processor<br />
clock cycles by the number of bytecodes executed <strong>in</strong> the base<br />
<strong>in</strong>terpreter shows that on average, a bytecode executes <strong>in</strong> about<br />
35 cycles. Native traces take about 9 cycles per bytecode, a 3.9x<br />
speedup over the <strong>in</strong>terpreter.<br />
Us<strong>in</strong>g similar computations, we f<strong>in</strong>d that trace record<strong>in</strong>g takes<br />
about 3800 cycles per bytecode, and compilation 3150 cycles per<br />
bytecode. Hence, dur<strong>in</strong>g record<strong>in</strong>g and compil<strong>in</strong>g the VM runs at<br />
1/200 the speed of the <strong>in</strong>terpreter. Because it costs 6950 cycles to<br />
compile a bytecode, and we save 26 cycles each time that code is<br />
run natively, we break even after runn<strong>in</strong>g a trace 270 times.<br />
The other VMs we compared with achieve an overall speedup<br />
of 3.0x relative to our basel<strong>in</strong>e <strong>in</strong>terpreter. Our estimated native<br />
code speedup of 3.9x is significantly better. This suggests that<br />
our compilation techniques can generate more efficient native code<br />
than any other current JavaScript VM.<br />
These estimates also <strong>in</strong>dicate that our startup per<strong>for</strong>mance could<br />
be substantially better if we improved the speed of trace record<strong>in</strong>g<br />
and compilation. The estimated 200x slowdown <strong>for</strong> record<strong>in</strong>g and<br />
compilation is very rough, and may be <strong>in</strong>fluenced by startup factors<br />
<strong>in</strong> the <strong>in</strong>terpreter (e.g., caches that have not warmed up yet dur<strong>in</strong>g<br />
record<strong>in</strong>g). One observation support<strong>in</strong>g this conjecture is that <strong>in</strong><br />
the tracer, <strong>in</strong>terpreted bytecodes take about 180 cycles to run. Still,<br />
record<strong>in</strong>g and compilation are clearly both expensive, and a better<br />
implementation, possibly <strong>in</strong>clud<strong>in</strong>g redesign of the LIR abstract<br />
syntax or encod<strong>in</strong>g, would improve startup per<strong>for</strong>mance.<br />
Our per<strong>for</strong>mance results confirm that type specialization us<strong>in</strong>g<br />
trace trees substantially improves per<strong>for</strong>mance. We are able to<br />
outper<strong>for</strong>m the fastest available JavaScript compiler (V8) and the<br />
fastest available JavaScript <strong>in</strong>l<strong>in</strong>e threaded <strong>in</strong>terpreter (SFX) on 9<br />
of 26 benchmarks.<br />
8. Related Work<br />
<strong>Trace</strong> optimization <strong>for</strong> dynamic languages. The closest area of<br />
related work is on apply<strong>in</strong>g trace optimization to type-specialize<br />
dynamic languages. Exist<strong>in</strong>g work shares the idea of generat<strong>in</strong>g<br />
type-specialized code speculatively with guards along <strong>in</strong>terpreter<br />
traces.<br />
To our knowledge, Rigo’s Psyco (16) is the only published<br />
type-specializ<strong>in</strong>g trace compiler <strong>for</strong> a dynamic language (Python).<br />
Psyco does not attempt to identify hot loops or <strong>in</strong>l<strong>in</strong>e function calls.<br />
Instead, Psyco trans<strong>for</strong>ms loops to mutual recursion be<strong>for</strong>e runn<strong>in</strong>g<br />
and traces all operations.<br />
Pall’s LuaJIT is a Lua VM <strong>in</strong> development that uses trace compilation<br />
ideas. (1). There are no publications on LuaJIT but the creator<br />
has told us that LuaJIT has a similar design to our system, but<br />
will use a less aggressive type speculation (e.g., us<strong>in</strong>g a float<strong>in</strong>gpo<strong>in</strong>t<br />
representation <strong>for</strong> all number values) and does not generate<br />
nested traces <strong>for</strong> nested loops.<br />
General trace optimization. General trace optimization has<br />
a longer history that has treated mostly native code and typed<br />
languages like Java. Thus, these systems have focused less on type<br />
specialization and more on other optimizations.<br />
Dynamo (7) by Bala et al, <strong>in</strong>troduced native code trac<strong>in</strong>g as a<br />
replacement <strong>for</strong> profile-guided optimization (PGO). A major goal<br />
was to per<strong>for</strong>m PGO onl<strong>in</strong>e so that the profile was specific to<br />
the current execution. Dynamo used loop headers as candidate hot<br />
traces, but did not try to create loop traces specifically.<br />
<strong>Trace</strong> trees were orig<strong>in</strong>ally proposed by Gal et al. (11) <strong>in</strong> the<br />
context of Java, a statically typed language. Their trace trees actually<br />
<strong>in</strong>l<strong>in</strong>ed parts of outer loops with<strong>in</strong> the <strong>in</strong>ner loops (because
=?J+B:F>*:?7-
not be <strong>in</strong>terpreted as necessarily represent<strong>in</strong>g the official views,<br />
policies or endorsements, either expressed or implied, of the National<br />
Science foundation (NSF), any other agency of the U.S. Government,<br />
or any of the companies mentioned above.<br />
References<br />
[1] LuaJIT roadmap 2008 - http://lua-users.org/lists/lua-l/2008-<br />
02/msg00051.html.<br />
[2] Mozilla — Firefox web browser and Thunderbird email client -<br />
http://www.mozilla.com.<br />
[3] SPECJVM98 - http://www.spec.org/jvm98/.<br />
[4] SpiderMonkey (JavaScript-C) Eng<strong>in</strong>e -<br />
http://www.mozilla.org/js/spidermonkey/.<br />
[5] Surf<strong>in</strong>’ Safari - Blog Archive - Announc<strong>in</strong>g SquirrelFish Extreme -<br />
http://webkit.org/blog/214/<strong>in</strong>troduc<strong>in</strong>g-squirrelfish-extreme/.<br />
[6] A. Aho, R. Sethi, J. Ullman, and M. Lam. Compilers: Pr<strong>in</strong>ciples,<br />
techniques, and tools, 2006.<br />
[7] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent<br />
dynamic optimization system. In Proceed<strong>in</strong>gs of the ACM SIGPLAN<br />
Conference on Programm<strong>in</strong>g Language Design and Implementation,<br />
pages 1–12. ACM Press, 2000.<br />
[8] M. Berndl, B. Vitale, M. Zaleski, and A. Brown. Context Thread<strong>in</strong>g:<br />
a Flexible and Efficient Dispatch Technique <strong>for</strong> Virtual Mach<strong>in</strong>e Interpreters.<br />
In Code Generation and Optimization, 2005. CGO 2005.<br />
International Symposium on, pages 15–26, 2005.<br />
[9] C. Chambers and D. Ungar. Customization: Optimiz<strong>in</strong>g Compiler<br />
Technology <strong>for</strong> SELF, a <strong>Dynamic</strong>ally-<strong>Type</strong>d O bject-Oriented Programm<strong>in</strong>g<br />
Language. In Proceed<strong>in</strong>gs of the ACM SIGPLAN 1989<br />
Conference on Programm<strong>in</strong>g Language Design and Implementation,<br />
pages 146–160. ACM New York, NY, USA, 1989.<br />
[10] A. Gal. Efficient Bytecode Verification and Compilation <strong>in</strong> a Virtual<br />
Mach<strong>in</strong>e Dissertation. PhD thesis, University Of Cali<strong>for</strong>nia, Irv<strong>in</strong>e,<br />
2006.<br />
[11] A. Gal, C. W. Probst, and M. Franz. HotpathVM: An effective JIT<br />
compiler <strong>for</strong> resource-constra<strong>in</strong>ed devices. In Proceed<strong>in</strong>gs of the<br />
International Conference on Virtual Execution Environments, pages<br />
144–153. ACM Press, 2006.<br />
[12] C. Garrett, J. Dean, D. Grove, and C. Chambers. Measurement and<br />
Application of <strong>Dynamic</strong> Receiver Class Distributions. 1994.<br />
[13] J. Ha, M. R. Haghighat, S. Cong, and K. S. McK<strong>in</strong>ley. A concurrent<br />
trace-<strong>based</strong> just-<strong>in</strong>-time compiler <strong>for</strong> javascript. Dept.of Computer<br />
Sciences, The University of Texas at Aust<strong>in</strong>, TR-09-06, 2009.<br />
[14] B. McCloskey. Personal communication.<br />
[15] I. Piumarta and F. Riccardi. Optimiz<strong>in</strong>g direct threaded code by selective<br />
<strong>in</strong>l<strong>in</strong><strong>in</strong>g. In Proceed<strong>in</strong>gs of the ACM SIGPLAN 1998 conference<br />
on Programm<strong>in</strong>g language design and implementation, pages 291–<br />
300. ACM New York, NY, USA, 1998.<br />
[16] A. Rigo. Representation-Based <strong>Just</strong>-In-time <strong>Specialization</strong> and the<br />
Psyco Prototype <strong>for</strong> Python. In PEPM, 2004.<br />
[17] M. Salib. Starkiller: A Static <strong>Type</strong> Inferencer and Compiler <strong>for</strong><br />
Python. In Master’s Thesis, 2004.<br />
[18] T. Suganuma, T. Yasue, and T. Nakatani. A Region-Based Compilation<br />
Technique <strong>for</strong> <strong>Dynamic</strong> Compilers. ACM Transactions on Programm<strong>in</strong>g<br />
<strong>Languages</strong> and Systems (TOPLAS), 28(1):134–174, 2006.<br />
[19] M. Zaleski, A. D. Brown, and K. Stoodley. YETI: A graduallY<br />
Extensible <strong>Trace</strong> Interpreter. In Proceed<strong>in</strong>gs of the International<br />
Conference on Virtual Execution Environments, pages 83–93. ACM<br />
Press, 2007.