11.07.2015 Views

Memory Allocation and Access Patterns in Dynamic ... - STUPS Group

Memory Allocation and Access Patterns in Dynamic ... - STUPS Group

Memory Allocation and Access Patterns in Dynamic ... - STUPS Group

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

INSTITUT FÜRINFORMATIKSoftwaretechnik undProgrammiersprachenUniversitätsstr. 1D–40225 Düsseldorf<strong>Memory</strong> <strong>Allocation</strong> <strong>and</strong> <strong>Access</strong> <strong>Patterns</strong><strong>in</strong> <strong>Dynamic</strong> LanguagesChristian Bol<strong>and</strong>MasterarbeitBeg<strong>in</strong>n der Arbeit: 17. April 2012Abgabe der Arbeit: 31. Oktober 2012Gutachter: Prof. Dr. Michael LeuschelProf. Dr. Michael Schöttner


ErklärungHiermit versichere ich, dass ich diese Masterarbeit selbstständig verfasst habe. Ich habedazu ke<strong>in</strong>e <strong>and</strong>eren als die angegebenen Quellen und Hilfsmittel verwendet.Düsseldorf, den 31. Oktober 2012Christian Bol<strong>and</strong>


AbstractInteger tagg<strong>in</strong>g is an optimization for virtual mach<strong>in</strong>es to improve performance <strong>and</strong> additionallyreduce the memory footpr<strong>in</strong>t. Although used by many virtual mach<strong>in</strong>es, the effectsof <strong>in</strong>teger tagg<strong>in</strong>g have not been analysed <strong>in</strong> detail yet. In this thesis it is evaluated how<strong>in</strong>teger tagg<strong>in</strong>g <strong>in</strong>fluences execution times of an exist<strong>in</strong>g Smalltalk <strong>in</strong>terpreter. Thereforedifferent versions have been created which were analysed <strong>in</strong> detail us<strong>in</strong>g benchmarks. Theresults <strong>in</strong>dicate that if the <strong>in</strong>terpreter offers fast allocations by us<strong>in</strong>g a modern garbagecollector, nearly no speedups can be achieved. In some cases the performance us<strong>in</strong>g <strong>in</strong>tegertagg<strong>in</strong>g even decreases.


CONTENTS 1ContentsContents 11 Introduction 22 Background 32.1 Smalltalk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Tagg<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.1 Integer Tagg<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 NaN-tagg<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Tagg<strong>in</strong>g Implementation 104 Implementation of Integer Cach<strong>in</strong>g 125 General Optimizations <strong>and</strong> Modifications 145.1 Lightweight Base Class for Simple VMObjects . . . . . . . . . . . . . . . . . 145.2 Lookup Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3 Support for Different Configurations <strong>and</strong> Other Architectures . . . . . . . . 156 Evaluation 166.1 Description of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 The Characteristics of Integer Usage . . . . . . . . . . . . . . . . . . . . . . 176.3 Evaluation of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.4 Integer Cach<strong>in</strong>g as Alternative . . . . . . . . . . . . . . . . . . . . . . . . . 316.5 Tagg<strong>in</strong>g on Other Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 357 Related Work 398 Conclusion 408.1 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Bibliography 41List of Figures 43


1 INTRODUCTION 21 IntroductionOver the last years dynamic programm<strong>in</strong>g languages like Javascript, Python or Ruby havega<strong>in</strong>ed more <strong>and</strong> more attention. Reasons for the <strong>in</strong>creased popularity are for examplethe dynamic typ<strong>in</strong>g of objects <strong>and</strong> the automatic memory management which comes withmost languages. These can speed up the development of software considerably. However,these benefits don’t come for free as they <strong>in</strong>cur a certa<strong>in</strong> overhead <strong>and</strong> deny severalstatic optimizations that are possible <strong>in</strong> other languages like Java or C. Therefore otheroptimizations have been <strong>in</strong>troduced to m<strong>in</strong>imize this overhead.One of these optimizations which is used by some virtual mach<strong>in</strong>es (VMs) is <strong>in</strong>teger tagg<strong>in</strong>gthat tries to avoid box<strong>in</strong>g of <strong>in</strong>tegers. Box<strong>in</strong>g of primitive data-types is needed <strong>in</strong>dynamic languages, s<strong>in</strong>ce the type of an object must be determ<strong>in</strong>able dur<strong>in</strong>g runtime.Although used by several VMs, the <strong>in</strong>fluence of <strong>in</strong>teger tagg<strong>in</strong>g on performance has notbeen evaluated <strong>in</strong> detail so far.The aim of this thesis is to analyse the performance impacts ga<strong>in</strong>ed by <strong>in</strong>teger tagg<strong>in</strong>gfor <strong>in</strong>terpreters. Therefore an exist<strong>in</strong>g Smalltalk VM is extended by <strong>in</strong>teger tagg<strong>in</strong>g <strong>and</strong>the performance for different configurations of this VM is measured us<strong>in</strong>g benchmarks.The evaluation of the results is focused on <strong>in</strong>terpreters, as VMs us<strong>in</strong>g just-<strong>in</strong>-time (JIT)compilation have different trade-offs.The implementation of <strong>in</strong>teger tagg<strong>in</strong>g is described <strong>in</strong> Section 3. The idea of <strong>in</strong>tegercach<strong>in</strong>g, an alternative optimization to reduce the memory footpr<strong>in</strong>t, <strong>and</strong> its implementationis described <strong>in</strong> Section 4. Section 5 describes some important additional optimizations<strong>and</strong> modifications implemented for the SOM++ virtual mach<strong>in</strong>e. The performance of <strong>in</strong>tegertagg<strong>in</strong>g is then evaluated <strong>in</strong> Section 6. First the benchmarks <strong>and</strong> their characteristicsof <strong>in</strong>teger usage are described. Then the benchmark results are evaluated. This sectionends with an short evaluation of <strong>in</strong>teger cach<strong>in</strong>g <strong>and</strong> of <strong>in</strong>teger tagg<strong>in</strong>g on x86 64 <strong>and</strong>ARM based systems. Section 7 lists related research done on <strong>in</strong>teger tagg<strong>in</strong>g. F<strong>in</strong>allySection 8 concludes this thesis.


2 BACKGROUND 32 Background2.1 SmalltalkSmalltalk is a dynamically typed, purely object-oriented programm<strong>in</strong>g language <strong>and</strong> afull development environment at the same time. It was created by the Learn<strong>in</strong>g Research<strong>Group</strong> at Xerox PARC dur<strong>in</strong>g the 1970s <strong>and</strong> made publicly available as Smalltalk-80 [1]<strong>in</strong> 1982.Smalltalk <strong>in</strong>fluenced a lot of other programm<strong>in</strong>g languages like Java, Objective-C <strong>and</strong>Python as well as W<strong>in</strong>dows or Mac<strong>in</strong>tosh with its w<strong>in</strong>dow<strong>in</strong>g system. Some of its keyconcepts are:• everyth<strong>in</strong>g is an object (<strong>in</strong>clud<strong>in</strong>g <strong>in</strong>tegers, str<strong>in</strong>gs, bools <strong>and</strong> methods)• everyth<strong>in</strong>g happens by send<strong>in</strong>g messages• s<strong>in</strong>gle <strong>in</strong>heritance• you can change everyth<strong>in</strong>g, even runn<strong>in</strong>g programsMajor Smalltalk implementations nowadays are Pharo 1 , Squeak 2 , C<strong>in</strong>com Smalltalk 3 <strong>and</strong>GNU Smalltalk 4 . Figure 1 shows Pharo Smalltalk runn<strong>in</strong>g unit tests.2.2 SOMIn this thesis the SOM++ virtual mach<strong>in</strong>e is used as basis. It is implemented <strong>in</strong> C++ [2]<strong>and</strong> part of the SOM (Simple Object Mach<strong>in</strong>e) virtual mach<strong>in</strong>e family [3], which all hostthe same Smalltalk dialect but are implemented <strong>in</strong> different languages. It was developedat the University of Århus <strong>and</strong> the Hasso-Plattner-Institute 5 <strong>in</strong> Potsdam with focus oneducation <strong>and</strong> research. Therefore their architecture <strong>and</strong> implementation were kept simple,which eases modifications to the VM <strong>and</strong> evaluation of their effects. SOM’s <strong>in</strong>structionset architecture for example consists of 16 bytecode <strong>in</strong>structions only, which are executedby a simple bytecode-<strong>in</strong>terpreter [4].For the author’s bachelor thesis [5] SOM++, orig<strong>in</strong>ally us<strong>in</strong>g a mark-sweep garbage collector,was extended by a copy<strong>in</strong>g <strong>and</strong> a generational garbage collector. The garbage collectorimplementation used can be chosen by sett<strong>in</strong>g compile-switches. The implementation <strong>and</strong>key advantages/ disadvantages are expla<strong>in</strong>ed briefly.mark-sweep garbage collector SOM++ uses a pretty straightforward implementationof the mark-sweep algorithm [6]. <strong>Allocation</strong> of objects is simply h<strong>and</strong>ed over tostdlib’s malloc. Reclamation of garbage is achieved <strong>in</strong> two steps. First all live objects1 http://www.pharo-project.org2 http://squeak.org3 http://www.c<strong>in</strong>comsmalltalk.com4 http://smalltalk.gnu.org5 http://hpi.uni-potsdam.de/hirschfeld/projects/som/


2 BACKGROUND 4Figure 1: Pharo Smalltalk runn<strong>in</strong>g unit tests


2 BACKGROUND 5are traversed start<strong>in</strong>g from the roots <strong>and</strong> marked by sett<strong>in</strong>g a mark-bit <strong>in</strong>side theobject header. In a second step all objects hav<strong>in</strong>g the mark-bit unset are deleted.This results <strong>in</strong> relatively expensive allocations s<strong>in</strong>ce malloc requests additionallyhave to avoid fragmentation which costs time. The complexity of the cleanup phaseis proportional to the overall number of allocated objects (<strong>in</strong>clud<strong>in</strong>g garbage), s<strong>in</strong>ceall objects have to be visited when be<strong>in</strong>g marked or deleted.copy<strong>in</strong>g garbage collector As a second alternative a semispace copy<strong>in</strong>g collector [7]is implemented. In this scheme, the heap is divided <strong>in</strong>to two semispaces calledfromspace <strong>and</strong> tospace. Objects are allocated l<strong>in</strong>early <strong>in</strong> tospace unless there is sufficientroom. In case of a garbage collection the two semispaces change roles. Start<strong>in</strong>gfrom the roots all live objects are then copied from new fromspace to tospace.The copy<strong>in</strong>g garbage collector offers a very fast allocation, which is achieved bysimply bump<strong>in</strong>g a po<strong>in</strong>ter to the next free location. Fragmentation can be ignored,s<strong>in</strong>ce memory is always compacted after a collection. The complexity of the collectionis related to the number of live objects s<strong>in</strong>ce all other objects are freed <strong>in</strong>directly byswitch<strong>in</strong>g the roles of the semispaces. The ma<strong>in</strong> drawback of this approach is, thatobjects surviv<strong>in</strong>g multiple collections have to be copied from one semispace to theother every time.generational garbage collector The generational garbage collector[8] comb<strong>in</strong>es a copy<strong>in</strong>gcollector with a mark-sweep collector. New (young) objects are allocated <strong>in</strong>sidea contiguous part of memory called nursery which corresponds to one semispace ofthe copy<strong>in</strong>g collector. Once the nursery becomes full, a m<strong>in</strong>or-collection is triggeredwhich copies all live objects out of the nursery. These objects are then consideredas old <strong>and</strong> collected by a mark-sweep collector.This scheme also offers a very fast allocation like the copy<strong>in</strong>g collector, but does notsuffer from hav<strong>in</strong>g to copy old objects aga<strong>in</strong> <strong>and</strong> aga<strong>in</strong> s<strong>in</strong>ce the mark-sweep collectorjust leaves objects <strong>in</strong> place. However it should be noted that old objects <strong>and</strong> newobjects cannot be collected <strong>in</strong>dependently s<strong>in</strong>ce there might be <strong>in</strong>tergenerationalreferences. This requires to keep track of such references which imposes an additionaloverhead.2.3 Tagg<strong>in</strong>gOne characteristic of dynamic languages is that “everyth<strong>in</strong>g is an object”. This meansthat also primitive data types like <strong>in</strong>tegers or float<strong>in</strong>g numbers must behave like regularobjects. For example they have a class, can receive function calls <strong>and</strong> are normally garbagecollected. To achieve this behavior, primitive data types are usually boxed <strong>in</strong>to regularobjects. Figure 2 shows a po<strong>in</strong>t object that does not store its coord<strong>in</strong>ates as primitivesbut po<strong>in</strong>ts to boxed <strong>in</strong>teger <strong>in</strong>stances.Obviously box<strong>in</strong>g of primitive values has the disadvantage that it causes allocations of heapobjects <strong>and</strong> therefore <strong>in</strong>creases the overall memory usage of the program. This is especiallydisadvantageous <strong>in</strong> case of <strong>in</strong>tegers, s<strong>in</strong>ce they are often allocated for <strong>in</strong>termediate resultsonly.


2 BACKGROUND 6Po<strong>in</strong>t...pos_xpos_y...BoxedIntegerclassgcfield<strong>in</strong>t_value=36BoxedIntegerclassgcfield<strong>in</strong>t_value=102Figure 2: A po<strong>in</strong>t object referenc<strong>in</strong>g two boxed <strong>in</strong>tegersAnother characteristic of dynamic languages is that variables are typeless. Unlike <strong>in</strong> staticallytyped languages where the types of variables are known at compile time, dynamicallytyped languages have to perform type-checks at runtime which usually <strong>in</strong>volves call<strong>in</strong>g amethod like GetType() on the object.2.3.1 Integer Tagg<strong>in</strong>gInteger tagg<strong>in</strong>g is a technique that tries to circumvent the allocation of boxed <strong>in</strong>teger<strong>in</strong>stances. To achieve this the memory alignment of objects is exploited. Most modernCPUs expect data structures to be aligned at multiples of the word size (for example 4byte on 32-bit mach<strong>in</strong>es). Read<strong>in</strong>g from or writ<strong>in</strong>g to unaligned addresses is considerablyslower [9] or may even result <strong>in</strong> an error. Therefore compilers usually add padd<strong>in</strong>g todata structures <strong>and</strong> memory requests return aligned addresses. One side effect of dataalignment is, that the last 6 bits of po<strong>in</strong>ters to objects are known to be zero.This allows to use the last bit to dist<strong>in</strong>guish between po<strong>in</strong>ters to regular objects <strong>and</strong><strong>in</strong>tegers. If for example this bit is set, the rema<strong>in</strong><strong>in</strong>g bits can be used to encode the<strong>in</strong>teger value. All po<strong>in</strong>ters with the last bit unset po<strong>in</strong>t to regular objects. Figure 3 showsexamples of how tagged <strong>and</strong> untagged po<strong>in</strong>ters are <strong>in</strong>terpreted.0000000000000000000000000000010131 bits to encode value tag-bit2 as tagged <strong>in</strong>teger11110111001010010001110010001100tag-bitpo<strong>in</strong>ter to object at 0xf7291c8cFigure 3: Examples of a tagged <strong>in</strong>teger with value 2 <strong>and</strong> an ord<strong>in</strong>ary po<strong>in</strong>terOne advantage of <strong>in</strong>teger tagg<strong>in</strong>g is that allocation of boxed <strong>in</strong>tegers is not necessary <strong>in</strong>most cases. Besides sav<strong>in</strong>g time for allocation, this also reduces the number of allocated6 Usually addresses are required to be word-aligned. This would result <strong>in</strong> two 0-bits for 32-bit systems<strong>and</strong> three 0-bits for 64-bit systems.


2 BACKGROUND 7objects <strong>and</strong> so lowers pressure on the garbage collector. Moreover <strong>in</strong> some cases like theaddition of <strong>in</strong>tegers no heap objects have to be accessed s<strong>in</strong>ce their values are alreadyencoded <strong>in</strong> the “po<strong>in</strong>ters”.But us<strong>in</strong>g <strong>in</strong>teger tagg<strong>in</strong>g also has some downsides. Whenever the <strong>in</strong>terpreter makes callsto objects, a tag-check has to be performed. If the tag-check succeeds (<strong>in</strong> case of tagged<strong>in</strong>tegers) the call has to be faked <strong>in</strong> some way s<strong>in</strong>ce there is no real object to call. Alsothe garbage collector has to dist<strong>in</strong>guish between references to heap objects <strong>and</strong> tagged<strong>in</strong>tegers which also adds some overhead. Furthermore it is not possible to represent thewhole range of <strong>in</strong>tegers <strong>in</strong> a tagged <strong>in</strong>teger s<strong>in</strong>ce the last bit is always used as tag-bit.On 32-bit CPUs for example the <strong>in</strong>teger value must be encodable <strong>in</strong>to 31 bit, otherwisea boxed <strong>in</strong>teger has to be created. It should also be noted that <strong>in</strong>teger tagg<strong>in</strong>g requires<strong>in</strong>tegers to be immutable objects.Integer tagg<strong>in</strong>g can be implemented <strong>in</strong> two ways. One way is to <strong>in</strong>dicate a tagged <strong>in</strong>teger bysett<strong>in</strong>g the last bit to 1. This approach favours po<strong>in</strong>ters over other objects, s<strong>in</strong>ce they canbe used without modification. To reta<strong>in</strong> the <strong>in</strong>teger value from the tagged representation,the tagged representation has to be bit-shifted once. When us<strong>in</strong>g the other way (the lastbit <strong>in</strong>dicates a po<strong>in</strong>ter) the tag-bit has to be masked off, before us<strong>in</strong>g the po<strong>in</strong>ter.Tagged <strong>in</strong>tegers are for example used <strong>in</strong> several Smalltalk VMs or JavaScript eng<strong>in</strong>es likeGoogle’s V8 [10] or older Mozilla eng<strong>in</strong>es [11]. The SPARC [12] processor architectureeven has support for tagged <strong>in</strong>teger operations.2.3.2 NaN-tagg<strong>in</strong>gAnother variant of tagg<strong>in</strong>g used for example <strong>in</strong> LuaJIT [13] <strong>and</strong> Mozilla’s JägerMonkey[11]is NaN-tagg<strong>in</strong>g. Here po<strong>in</strong>ters are represented as double precision IEEE 754 [14] float<strong>in</strong>gpo<strong>in</strong>t numbers. Accord<strong>in</strong>g to the st<strong>and</strong>ard such a number is represented by 1 sign bit,followed by 11 exponent bits <strong>and</strong> 52 fraction bits. Besides float<strong>in</strong>g po<strong>in</strong>t numbers, specialbit patterns are def<strong>in</strong>ed to represent certa<strong>in</strong> non-numbers. Positive <strong>and</strong> negative <strong>in</strong>f<strong>in</strong>ityare represented by all exponent bits are set <strong>and</strong> all fraction bits unset. All exponent bits set<strong>and</strong> at least one fraction bit set represents “not a number” (NaN) which is a return valuefor <strong>in</strong>valid operations. For example the square root of negative numbers, the logarithmof negative numbers, the division 0 0or any other operation <strong>in</strong>volv<strong>in</strong>g NaNs return NaN.Figure 4 shows the b<strong>in</strong>ary representations of the value π <strong>and</strong> the special values ±∞ <strong>and</strong>NaN.Although there are 2 53 − 1 bit patterns that represent NaN, current soft- <strong>and</strong> hardwarepractically emits only one canonical NaN value. This allows to use the rema<strong>in</strong><strong>in</strong>g NaNrepresentations for the tagg<strong>in</strong>g of objects.One obvious way to implement NaN tagg<strong>in</strong>g on 32-bit mach<strong>in</strong>es would be to use the first20 of the fraction bits as tag-bits <strong>and</strong> the rema<strong>in</strong><strong>in</strong>g 32 bits to encode the value s<strong>in</strong>ce this isthe natural word size of 32-bit systems. So po<strong>in</strong>ters or <strong>in</strong>tegers can easily be read withoutmask<strong>in</strong>g of other bits. Additionally the large number of possible tags allows not only totag <strong>in</strong>tegers, but also other types like for example str<strong>in</strong>gs or the special objects true <strong>and</strong>false. Figure 5 shows the b<strong>in</strong>ary representations of the <strong>in</strong>teger 52, the true object, a str<strong>in</strong>g<strong>and</strong> one other object us<strong>in</strong>g NaN tagg<strong>in</strong>g on 32-bit mach<strong>in</strong>es.


2 BACKGROUND 80100000000001001001000011111101101010100010001000010110100011000sign exponent fractionb<strong>in</strong>ary representation of π0111111111110000000000000000000000000000000000000000000000000000sign exponent fractionb<strong>in</strong>ary representation of positive <strong>in</strong>f<strong>in</strong>ity1111111111110000000000000000000000000000000000000000000000000000sign exponent fractionb<strong>in</strong>ary representation of negative <strong>in</strong>f<strong>in</strong>ity1111111111111000000000000000000000000000000000000000000000000000sign exponent fractionone b<strong>in</strong>ary representation of NaNFigure 4: B<strong>in</strong>ary representation of IEEE 754 float<strong>in</strong>g po<strong>in</strong>t values π, ±∞ <strong>and</strong> NaN1111111111111000000000000000000100000000000000000000000000110100sign exponent tag-bits (=<strong>in</strong>teger) 52b<strong>in</strong>ary representation of the <strong>in</strong>teger value 521111111111111000000000000000001100000000000000000000000000000000sign exponent tag-bits (=true)b<strong>in</strong>ary representation of TRUE1111111111111000000000000000001011110111001010010001110010001100sign exponent tag-bits (=str<strong>in</strong>g) 0xf7291c8cb<strong>in</strong>ary representation of a str<strong>in</strong>g1111111111111000000000000000001011110111001101011111001011100100sign exponent tag-bits (=object) 0xf735f2e4b<strong>in</strong>ary representation of an untagged objectFigure 5: Examples of 32-bit NaN-tagg<strong>in</strong>g


2 BACKGROUND 9Us<strong>in</strong>g NaN-tagg<strong>in</strong>g on 64 bit mach<strong>in</strong>es seems impossible at first glance, s<strong>in</strong>ce one cannotencode 64-bit po<strong>in</strong>ters <strong>in</strong> only 53 bits available. However, most operat<strong>in</strong>g systems <strong>and</strong>processors do not support the full 64 bit address range. AMD’s current x86-64 architecturefor example only supports 48bit virtual addresses[15]. Microsoft W<strong>in</strong>dows even makes useof 44 bit virtual addresses only. This allows to use 5 bits for tagg<strong>in</strong>g leav<strong>in</strong>g 48 bits toencode any po<strong>in</strong>ter. Of course 64-bit <strong>in</strong>tegers cannot be encoded. Here <strong>in</strong>tegers need<strong>in</strong>gmore than 48 bits need to be boxed. Figure 6 shows the b<strong>in</strong>ary representations of the<strong>in</strong>teger 52, the true object, a str<strong>in</strong>g <strong>and</strong> one other object us<strong>in</strong>g NaN tagg<strong>in</strong>g on 64-bitmach<strong>in</strong>es.1111111111111001000000000000000000000000000000000000000000110100sign exponent tag52b<strong>in</strong>ary representation of the <strong>in</strong>teger value 521111111111111010000000000000000000000000000000000000000000000000sign exponent tagb<strong>in</strong>ary representation of TRUE1111111111111011011111111011000100101111011100011000101101111000sign exponent tag0x7fb12f718b78b<strong>in</strong>ary representation of a str<strong>in</strong>g1111111111111100011111111101010010001001111110001110110010010000sign exponent tag0x7fd489f8ec90b<strong>in</strong>ary representation of an untagged objectFigure 6: Examples us<strong>in</strong>g 64-bit NaN-tagg<strong>in</strong>gS<strong>in</strong>ce NaN-tagg<strong>in</strong>g favours float<strong>in</strong>g po<strong>in</strong>t values over po<strong>in</strong>ters by its nature, us<strong>in</strong>g NaNtagg<strong>in</strong>gmakes most sense if a large percentage of objects is represent<strong>in</strong>g float<strong>in</strong>g po<strong>in</strong>tvalues. This is for example the case for JavaScript programs as here also <strong>in</strong>tegers arerepresented by float<strong>in</strong>g po<strong>in</strong>t numbers.


3 TAGGING IMPLEMENTATION 103 Tagg<strong>in</strong>g ImplementationTo evaluate the performance effects of tagged representations, I implemented <strong>in</strong>teger tagg<strong>in</strong>g.Although NaN-tagg<strong>in</strong>g might take advantage by tagg<strong>in</strong>g more types than just <strong>in</strong>tegers,the evaluation of <strong>in</strong>teger tagg<strong>in</strong>g will be a good <strong>in</strong>dicator for tagg<strong>in</strong>g performance <strong>in</strong>general.Another reason not to implement NaN-tagg<strong>in</strong>g is the complexity of implementation especiallyfor an exist<strong>in</strong>g virtual mach<strong>in</strong>e. S<strong>in</strong>ce tagg<strong>in</strong>g is a cross-cutt<strong>in</strong>g-concern [16], itaffects the source code <strong>in</strong> a large number of places. Besides special treatment of taggedobjects <strong>in</strong> the garbage collector or <strong>in</strong> primitives, for all calls to potentially tagged objectsa tag-check must be performed <strong>in</strong> advance. This makes the implementation of tagg<strong>in</strong>g atime-consum<strong>in</strong>g <strong>and</strong> often tedious task, s<strong>in</strong>ce forgett<strong>in</strong>g to check for tagged objects willusually crash the program.To represent tagged-<strong>in</strong>tegers, the two’s complement notation is used. On 32-bit systemsthis allows to tag all values that can be represented <strong>in</strong> 31-bit two’s complement representation,which is [−1073741824, 1073741823]. To ease the implementation of <strong>in</strong>teger tagg<strong>in</strong>g3 macros have been def<strong>in</strong>ed. IS TAGGED checks, if X is a tagged <strong>in</strong>teger <strong>and</strong> AS POINTERcasts X to an object po<strong>in</strong>ter. The macros TAG INTEGER <strong>and</strong> UNTAG INTEGER convert between<strong>in</strong>teger values <strong>and</strong> their tagged representations. List<strong>in</strong>g 1 shows the implementationof these macros.List<strong>in</strong>g 1: Implementation of help<strong>in</strong>g macros1 /*** max value for tagged <strong>in</strong>tegers3 * 01111111 11111111 11111111 1111111 X*/5 # def<strong>in</strong>e VMTAGGEDINTEGER_MAX 10737418237 /*** m<strong>in</strong> value for tagged <strong>in</strong>tegers9 * 10000000 00000000 00000000 0000000 X*/11 # def<strong>in</strong>e VMTAGGEDINTEGER_MIN -107374182413 // check if the last bit is set# def<strong>in</strong>e IS_TAGGED (X) (( long )X &1)15// simple cast17 # def<strong>in</strong>e AS_POINTER (X) (( AbstractVMObject *)X)19 // return tagged <strong>in</strong>teger , if value <strong>in</strong> valid range , otherwise boxed <strong>in</strong>teger# def<strong>in</strong>e TAG_INTEGER (X) ((X >= VMTAGGEDINTEGER_MIN && X >1)\27 : ((( VMInteger *)X)-> GetEmbeddedInteger ()))Most of the implementation effort was spent <strong>in</strong> <strong>in</strong>sert<strong>in</strong>g checks, whether a po<strong>in</strong>ter is a


3 TAGGING IMPLEMENTATION 11tagged <strong>in</strong>teger before call<strong>in</strong>g functions on it, s<strong>in</strong>ce call<strong>in</strong>g functions on tagged <strong>in</strong>tegerswill cause segfaults. If calls to tagged <strong>in</strong>teger are encountered, they are executed on aprototypical boxed <strong>in</strong>stance of <strong>in</strong>teger. Some calls can even be skipped, s<strong>in</strong>ce the resultis known <strong>in</strong> advance <strong>and</strong> the call is side-effect free. For example calls to GetClass() willalways return the object represent<strong>in</strong>g the class of <strong>in</strong>tegers. List<strong>in</strong>g 2 shows examples ofthese checks.List<strong>in</strong>g 2: Examples of tag-checks.1 // GlobalBox :: IntegerBox () returns an prototypical boxed <strong>in</strong>teger <strong>in</strong>stanceif ( IS_TAGGED ( sender ))3 GlobalBox :: IntegerBox ()-> Send (eB , arguments , 1);else5 AS_POINTER ( sender )-> Send (eB , arguments , 1);7 // example , where call can be skippedpVMClass receiverClass = IS_TAGGED ( receiver )9 ? <strong>in</strong>tegerClass: AS_POINTER ( receiver )-> GetClass ();Another part that has to be modified is the garbage collector. Tagged <strong>in</strong>tegers are notallocated as heap objects <strong>and</strong> therefore do not have to be garbage collected. This meansthat the implemented garbage collectors must ignore these. List<strong>in</strong>g 3 shows the mark<strong>in</strong>gphase of the mark-sweep collector.List<strong>in</strong>g 3: Examples of tag-checks.VMOBJECT_PTR mark_object ( VMOBJECT_PTR obj ) {2 // ignore tagged <strong>in</strong>tegersif ( IS_TAGGED ( obj ))4 return obj ;// also ignore objects that have been visited before6 if (obj -> GetGCField ())return obj ;8// mark this object <strong>and</strong> referenced ones10 obj -> SetGCField ( GC_MARKED );obj -> WalkObjects ( mark_object );12 return obj ;}


4 IMPLEMENTATION OF INTEGER CACHING 124 Implementation of Integer Cach<strong>in</strong>gAn alternative optimization to save heap-space is <strong>in</strong>teger cach<strong>in</strong>g. The key observation to<strong>in</strong>teger cach<strong>in</strong>g is that most <strong>in</strong>teger objects store values <strong>in</strong> a small range. This results <strong>in</strong>multiple <strong>in</strong>stances of <strong>in</strong>teger objects, all stor<strong>in</strong>g the same value. Also for SOM++, this isa valid assumption as can be seen later <strong>in</strong> Section 6.2.To circumvent this issue, some virtual mach<strong>in</strong>es preallocate <strong>in</strong>tegers <strong>in</strong> a certa<strong>in</strong> range.When an <strong>in</strong>teger object with<strong>in</strong> this range is needed the preallocated <strong>in</strong>stance is returned<strong>in</strong>stead of box<strong>in</strong>g a new one. As for <strong>in</strong>teger tagg<strong>in</strong>g it should be noted, that <strong>in</strong>teger cach<strong>in</strong>galso requires <strong>in</strong>tegers to be immutable objects.List<strong>in</strong>g 4: Smalltalk example, where boxed <strong>in</strong>teger objects can be reused.1 po<strong>in</strong>t = Po<strong>in</strong>t new .po<strong>in</strong>t pos_x : 36.3 po<strong>in</strong>t pos_y : 36.po<strong>in</strong>t pos_z : 36.List<strong>in</strong>g 4 shows an example, where a three dimensional po<strong>in</strong>t is constructed. Without<strong>in</strong>teger tagg<strong>in</strong>g three different <strong>in</strong>teger objects would be allocated, all stor<strong>in</strong>g the value 36as can be seen <strong>in</strong> Figure 7. With <strong>in</strong>teger cach<strong>in</strong>g only one <strong>in</strong>stance is created stor<strong>in</strong>g thevalue 36. So the allocation of at least 2 boxed <strong>in</strong>teger <strong>in</strong>stances could be avoided as shown<strong>in</strong> Figure 8.Po<strong>in</strong>t...pos_xpos_ypos_z...BoxedIntegerclassgcfield<strong>in</strong>t_value=36BoxedIntegerclassgcfield<strong>in</strong>t_value=36BoxedIntegerPo<strong>in</strong>t...pos_xpos_ypos_z...BoxedIntegerclassgcfield<strong>in</strong>t_value=36classgcfield<strong>in</strong>t_value=36Figure 7: Object graph without <strong>in</strong>tegercach<strong>in</strong>gFigure 8:cach<strong>in</strong>gObject graph with <strong>in</strong>tegerInteger cach<strong>in</strong>g is for example used <strong>in</strong> CPython which caches values from -5 to 256 7 , <strong>and</strong>several Java VMs cache from -128 to 127 8 .7 The CPython implementation can be found <strong>in</strong> http://hg.python.org/cpython/file/tip/Objects/longobject.c8 The OpenJDK implementation can be found <strong>in</strong> http://hg.openjdk.java.net/jdk6/jdk6-gate/jdk/file/tip/share/classes/java/lang/Integer.java


4 IMPLEMENTATION OF INTEGER CACHING 13List<strong>in</strong>g 5 shows, how the allocation of <strong>in</strong>tegers is implemented us<strong>in</strong>g <strong>in</strong>teger cach<strong>in</strong>g forSOM++. Before a new boxed <strong>in</strong>teger is allocated, it is tested whether the desired valueis <strong>in</strong>side the range of cached <strong>in</strong>tegers. If yes, an <strong>in</strong>teger from prebuildInts, an arrayconta<strong>in</strong><strong>in</strong>g all cached <strong>in</strong>tegers which is allocated on VM startup, is returned. Otherwise anew boxed <strong>in</strong>teger has to be returned.List<strong>in</strong>g 5: Implementation of cached <strong>in</strong>tegers.// Universe :: NewInteger is called to create <strong>in</strong>teger <strong>in</strong>stances2 pVMInteger Universe :: NewInteger ( long value ) const {4 unsigned long <strong>in</strong>dex =( unsigned long ) value - ( unsigned long ) INT_CACHE_MIN_VALUE ;6 if ( <strong>in</strong>dex < ( unsigned long )( INT_CACHE_MAX_VALUE - INT_CACHE_MIN_VALUE )) {return prebuildInts [ <strong>in</strong>dex ];8 }10 return new ( _HEAP ) VMInteger ( value );}To keep the overhead of <strong>in</strong>teger cach<strong>in</strong>g small, the range check has been implementedus<strong>in</strong>g only one comparison. This version is supposed to be faster than the obvious solutionwhich would be:if (value >= INT CACHE MIN VALUE && value


5 GENERAL OPTIMIZATIONS AND MODIFICATIONS 145 General Optimizations <strong>and</strong> ModificationsSOM++ is a Smalltalk VM designed for education <strong>and</strong> research. Focus was laid on a simpledesign which makes it easy to underst<strong>and</strong> <strong>and</strong> to start work<strong>in</strong>g with. Unfortunately thisapproach usually has a negative impact on performance. Most optimizations <strong>in</strong>crease thecomplexity of implementation <strong>and</strong> are therefore not used <strong>in</strong> SOM++. Evaluat<strong>in</strong>g tagg<strong>in</strong>gwith this unoptimized VM would have had two downsides. First it would be questionableif conclusions from this analysis would be transferable to other <strong>in</strong>terpreters. The absenceof important optimizations can <strong>in</strong>fluence the effect of <strong>in</strong>teger tagg<strong>in</strong>g. Additionally thebenefit from us<strong>in</strong>g <strong>in</strong>teger tagg<strong>in</strong>g might be hardly visible as it is dom<strong>in</strong>ated by the VMs<strong>in</strong>terpretative overhead.Besides the implementation of a generational <strong>and</strong> copy<strong>in</strong>g garbage collector [5] <strong>in</strong> theauthor’s bachelor thesis, several other optimizations have been applied to SOM++ ofwhich some important are described now.5.1 Lightweight Base Class for Simple VMObjectsOne important optimization was the <strong>in</strong>troduction of a lightweight base class for simpleobjects. Before, all objects represent<strong>in</strong>g Smalltalk objects had the same base class calledVMObject. This class had members for a hash value, the size of the object, the number offields, this object provided to Smalltalk <strong>and</strong> a field stor<strong>in</strong>g the class of this object. Thisresulted <strong>in</strong> a total size for VMIntegers of 28 bytes (4 bytes for each of the 6 fields + vtable)as can be seen <strong>in</strong> Figure 9.Figure 9: Before optimization of base classesHowever for some classes like for example VMIntegers, VMDoubles or VMStr<strong>in</strong>gs, theobject size can be reduced. S<strong>in</strong>ce the class, object size <strong>and</strong> the number of fields are fixed,their values don’t have to be stored <strong>in</strong> each object, but can be returned with the helpof virtual functions. The implementation of GetClass() for VMInteger for example willalways return the same object called <strong>in</strong>tegerClass. Also the Hash can be computed atruntime <strong>and</strong> doesn’t have to be stored for each <strong>in</strong>stance. With this optimization, the sizeof <strong>in</strong>teger objects could be reduced to 12 bytes (8 bytes for 2 fields + vtable) as shown<strong>in</strong> Figure 10. This optimization helps to keep the memory footpr<strong>in</strong>t small <strong>and</strong> reducespressure on the garbage collector.


5 GENERAL OPTIMIZATIONS AND MODIFICATIONS 15Figure 10: After optimization of base classes5.2 Lookup CacheAnother optimization was the implementation of a lookup cache [17]. Without appropriateoptimizations, virtual mach<strong>in</strong>es for dynamic languages would spend a lot of timedeterm<strong>in</strong><strong>in</strong>g the correct function for message sends. A naive implementation for examplewould require to fetch the class of the object <strong>and</strong> to walk up the <strong>in</strong>heritance cha<strong>in</strong> untilthe right function is found. This can be quite expensive <strong>in</strong> terms of speed.Lookup caches reduce the overhead of message sends by mapp<strong>in</strong>g (receiver class, methodname) pairs to their correspond<strong>in</strong>g functions. In case of a message send, first the cacheis consulted. Only if no correspond<strong>in</strong>g entry exists the expensive lookup procedure isperformed <strong>and</strong> the result is stored <strong>in</strong>side the cache. Lookup caches have proven to beeffective for several Smalltalk virtual mach<strong>in</strong>es.5.3 Support for Different Configurations <strong>and</strong> Other ArchitecturesInteger tagg<strong>in</strong>g is exam<strong>in</strong>ed by compar<strong>in</strong>g benchmark results of different configurationsfor SOM++. All these configurations can be built us<strong>in</strong>g the same code base, just bysett<strong>in</strong>g compile flags. This <strong>in</strong>cludes the choice of the garbage collector, <strong>in</strong>teger tagg<strong>in</strong>g<strong>and</strong> <strong>in</strong>teger cach<strong>in</strong>g with the desired cach<strong>in</strong>g range. Additionally one special version existsto generate <strong>in</strong>formation on the executed benchmarks as can be seen <strong>in</strong> Section 6.2. Alltogether benchmarks were executed on 19 different configurations of SOM++.Furthermore SOM++ was only developed for 32-bit operat<strong>in</strong>g systems. In order to testperformance also on other architectures, it was modified to run on 64-bit systems, whichturned out to be a time consum<strong>in</strong>g task. Additionally SOM++ was made compilable forARM systems which was much easier, s<strong>in</strong>ce only the makefile had to be adapted.


6 EVALUATION 166 EvaluationIn order to evaluate the effects of <strong>in</strong>teger tagg<strong>in</strong>g, several benchmarks <strong>in</strong> different configurationswere executed. This Section briefly describes the benchmarks <strong>and</strong> the characteristicsof their <strong>in</strong>teger usage first. Later benchmark results are shown <strong>and</strong> analysed <strong>in</strong> detail.6.1 Description of BenchmarksThe SOM++ VM itself conta<strong>in</strong>s 16 simple benchmarks.Bounce Simulates jump<strong>in</strong>g balls <strong>in</strong>side a box <strong>and</strong> counts how often they hit the wall.BubbleSort Sorts an array of <strong>in</strong>tegers us<strong>in</strong>g the bubble sort algorithm.Dispatch Tests message dispatch<strong>in</strong>g by call<strong>in</strong>g methods <strong>in</strong> a loop.Fibonacci Recursively calculates a Fibonacci number.IntegerLoop Performs <strong>in</strong>teger arithmetics <strong>in</strong> a loop.List Creates l<strong>in</strong>ked lists <strong>and</strong> uses them <strong>in</strong> a recursive algorithm.Loop Sums up numbers <strong>in</strong> two nested loops.Permute Permutes items <strong>in</strong> an array.Queens Solves the N-Queens problem.QuickSort Sorts an array of <strong>in</strong>tegers us<strong>in</strong>g the quicksort algorithm.Recurse Tests call<strong>in</strong>g recursive functions.Sieve Calculates the number of primes us<strong>in</strong>g the sieve of Eratosthenes.Storage Creates a tree stor<strong>in</strong>g <strong>in</strong>teger values.Sum Calculates triangular numbers <strong>and</strong> sums them up <strong>in</strong> a loop.Towers An implementation of the famous tower of Hanoi. This Benchmark moves alldisks from the first tower to the fourth.TreeSort Creates a sorted b<strong>in</strong>ary tree with <strong>in</strong>teger values.Richards is a well known benchmark, which simulates an operat<strong>in</strong>g system’s task dispatcher.It was orig<strong>in</strong>ally written by Mart<strong>in</strong> Richards [18] at Cambridge University. S<strong>in</strong>cethen it has been translated to several other languages. The benchmark version used herewas derived from a Smalltalk version. 9GCBench is a Benchmark designed to test garbage collector performance. It was orig<strong>in</strong>allywritten by John Ellis <strong>and</strong> Pete Kovac of Post Communications. The version used hereis derived from a modified version by Hans Boehm [20]. This benchmark allocates <strong>and</strong>then drops b<strong>in</strong>ary trees of different sizes. Additionally it keeps some data structurespermanently live throughout the benchmark.9 The Smalltalk version is available here [19].


True6 EVALUATION 176.2 The Characteristics of Integer UsagePrior to evaluat<strong>in</strong>g benchmark results it is necessary to have a look at some benchmarkcharacteristics. It will help to analyze the performance effects of tagg<strong>in</strong>g later. Obviouslyone important characteristic is the number of allocated <strong>in</strong>tegers, s<strong>in</strong>ce this is just whattagg<strong>in</strong>g tries to optimize. Additionally it gives an impression of the pressure that lies onthe garbage collector.1.2e+06Number of objects<strong>Allocation</strong> Statistics (IntegerLoop)Number of allocated objects1e+068000006000004000002000000IntegerLoopFalseSystemVMClassVMBlockVMArrayVMMethodVMIntegerVMFrameVMSymbolVMStr<strong>in</strong>gFigure 11: Number of allocated objects for IntegerLoop benchmarkFigure 11 shows the number of allocated <strong>in</strong>stances per class for SOM’s IntegerLoop benchmark.Although this is one of the benchmarks with highest <strong>in</strong>teger allocation rates, itturns out that <strong>in</strong>tegers only account for 25% of all allocations. Most objects with 62%were frames. The ratio of allocated <strong>in</strong>tegers is usually even smaller as can be seen <strong>in</strong> Figure12 for Richards benchmark or <strong>in</strong> Figure 13 for GCBench. Here <strong>in</strong>tegers only accountfor 2.5% (GCBench) of all objects.But not only the number of allocated objects is <strong>in</strong>terest<strong>in</strong>g, also the size of allocated objectsshould be taken <strong>in</strong>to account. For example the number of garbage collections triggereddepends on the total size of allocated objects. Figure 14 shows the size of allocated objectsper class for the IntegerLoop benchmark. Although 25% of objects were <strong>in</strong>tegers, theyonly account for 6.3% of allocated space. The majority of 85.3% was used for frames. Thiscan be expla<strong>in</strong>ed by the size of the objects which is 12 bytes for <strong>in</strong>tegers <strong>and</strong> 66 bytes onaverage for frames.These figures <strong>in</strong>dicate the potential of <strong>in</strong>teger tagg<strong>in</strong>g. As it only applies on <strong>in</strong>teger objects,the importance of this optimization is bound to the ratio of allocated <strong>in</strong>tegers to otherobjects. However it turns out that most allocations are for other types of objects. The high


6 EVALUATION 183.5e+06<strong>Allocation</strong> Statistics (RichardsBenchmarks)Number of objectsNumber of allocated objects3e+062.5e+062e+061.5e+061e+06500000SystemRichardsBenchmarksPacketIdleTaskDataRecordH<strong>and</strong>lerTaskDataRecordWorkerTaskDataRecordVMSymbolVMStr<strong>in</strong>gVMMethodVMIntegerVMFrameVMDoubleVMClassVMBlockVMArrayTrueTaskStateTaskControlBlock0FalseDeviceTaskDataRecordFigure 12: Number of allocated objects for Richards benchmark1.4e+08Number of objects<strong>Allocation</strong> Statistics (GCBench)1.2e+08Number of allocated objects1e+088e+076e+074e+072e+070TrueFalseNodeGCBenchSystemVMClassVMBlockVMArrayVMIntegerVMFrameVMDoubleVMStr<strong>in</strong>gVMMethodVMSymbolFigure 13: Number of allocated objects for GCBench benchmark


True6 EVALUATION 1970Cumulated size<strong>Allocation</strong> Statistics (IntegerLoop)Size of allocated objects <strong>in</strong> MB6050403020100IntegerLoopFalseSystemVMClassVMBlockVMArrayVMMethodVMIntegerVMFrameVMSymbolVMStr<strong>in</strong>gFigure 14: Size of allocated objects for IntegerLoop benchmarknumber of frames for example results from the tendency to small methods <strong>in</strong> Smalltalkprograms. In general it can be said that most allocations result from the <strong>in</strong>terpreterimplementation <strong>and</strong> not from doma<strong>in</strong> objects.Another <strong>in</strong>terest<strong>in</strong>g characteristic is the distribution of allocated <strong>in</strong>teger values. To evaluatethis, values of allocated <strong>in</strong>teger objects were recorded. Figure 15 shows the distributionof <strong>in</strong>teger values over all SOM benchmarks as log/log plot, Figure 16 for Richards benchmark<strong>and</strong> Figure 17 for GCBench.One observation is, that all values can be represented as tagged <strong>in</strong>tegers, s<strong>in</strong>ce they fall<strong>in</strong>to the encodable range. 10 This means that no memory would have to be allocated forstor<strong>in</strong>g <strong>in</strong>tegers. Furthermore these histograms are important for <strong>in</strong>teger cach<strong>in</strong>g whichis evaluated later <strong>in</strong> Section 6.4, s<strong>in</strong>ce they <strong>in</strong>dicate appropriate ranges to cache. For theSOM benchmarks 49.9% of <strong>in</strong>tegers fall <strong>in</strong>to the cach<strong>in</strong>g range of [−5, 100]. For Richardsit is 53.7% <strong>and</strong> for GCBench even 89.6%. These values show that the range for cach<strong>in</strong>gseems to be well chosen.The last characteristic evaluated is the number of sends to objects. This is <strong>in</strong>terest<strong>in</strong>g s<strong>in</strong>cebefore a send (function call) to an object can be made, a tag-check has to be performedwhether the po<strong>in</strong>ter po<strong>in</strong>ts to a real object or is a tagged <strong>in</strong>teger. Furthermore it is<strong>in</strong>terest<strong>in</strong>g if the function to be called is implemented as primitive (on <strong>in</strong>terpreter level<strong>in</strong> C++) or <strong>in</strong> Smalltalk. Sends to <strong>in</strong>tegers not implemented by primitives have to beexecuted on a boxed <strong>in</strong>teger <strong>in</strong>stance. Only sends to <strong>in</strong>tegers implemented by primitivescan benefit from not hav<strong>in</strong>g to access heap objects.10 Which is [−1073741824, 1073741823]. See Section 3 for encod<strong>in</strong>g details.


6 EVALUATION 20Figure 15: Histogram of allocated <strong>in</strong>tegers for SOM benchmarksFigure 16: Histogram of allocated <strong>in</strong>tegers for Richards benchmark


6 EVALUATION 21Figure 17: Histogram of allocated <strong>in</strong>tegers for GCBench benchmark900000800000Types of sends for IntegerLoopprimitivenon primitive700000Number of sends6000005000004000003000002000001000000NilArrayFalseBlock2Block1Benchmark classIntegerLoopIntegerLoop classIntegerMethodStr<strong>in</strong>gObject classSymbolTrueSystemFigure 18: Sends <strong>in</strong> SOM benchmarks


6 EVALUATION 22800000700000Types of sends for Richards Benchmarkprimitivenon primitiveNumber of sends6000005000004000003000002000001000000WorkerTaskDataRecord classWorkerTaskDataRecordTrueTranscript classTime classIdleTaskDataRecordH<strong>and</strong>lerTaskDataRecord classFalseDoubleDeviceTaskDataRecord classH<strong>and</strong>lerTaskDataRecordDeviceTaskDataRecordRichardsBenchmarks classRBObject classPacket classPacketObject classNilMethodIntegerIdleTaskDataRecord classBlock3Block2Block1Array classArrayTaskState classTaskControlBlockRichardsBenchmarksTaskControlBlock classSystemSymbolStr<strong>in</strong>gTaskStateFigure 19: Sends <strong>in</strong> Richards benchmark7e+076e+07Types of sends for GCBenchprimitivenon primitiveNumber of sends5e+074e+073e+072e+071e+070NilArray classBlock2Block1IntegerGCBench classArrayFalseMethodSystemSymbolGCBenchObject classNode classNodeStr<strong>in</strong>gTrueFigure 20: Sends <strong>in</strong> GCBench benchmark


6 EVALUATION 23Figure 18 shows all sends executed for SOM’s IntegerLoop Benchmark. Here 44% ofcalls go to <strong>in</strong>tegers <strong>and</strong> 75% of them are implemented by primitives. This means that<strong>in</strong> potentially 33% of all sends no heap object has to be accessed. When look<strong>in</strong>g at less<strong>in</strong>teger-heavy benchmarks such as Richards the percentage of sends to <strong>in</strong>tegers implementedby primitives drops significantly. As can be seen <strong>in</strong> Figure 19 only 9% of sendsgo to <strong>in</strong>tegers of which 60% are implemented by primitives. So the beneficial effect of nothav<strong>in</strong>g to access heap objects (<strong>in</strong> 6%) is rather small. For GCBench 30% of calls go to<strong>in</strong>tegers of which 73% are implemented as primitives as shown <strong>in</strong> Figure 20.


6 EVALUATION 246.3 Evaluation of BenchmarksIn order to evaluate <strong>in</strong>teger tagg<strong>in</strong>g performance, all benchmarks described <strong>in</strong> section 6.1were executed on an Intel R○ Core TM2 Duo T7300 CPU with 2 GHz <strong>and</strong> 4 MB cache<strong>in</strong> a mach<strong>in</strong>e with 2 GB RAM runn<strong>in</strong>g L<strong>in</strong>ux 3.2.0-32. The average execution time wascalculated from 20 iterations. Each iteration starts a new SOM++ <strong>in</strong>stance <strong>and</strong> the timetill term<strong>in</strong>ation is measured. Errors were computed us<strong>in</strong>g a confidence <strong>in</strong>terval with a 95%confidence level [21].Average execution time (ms)140012001000800600400200SOM BenchmarksSOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)0ListBounceIntegerLoopFibonacciDispatchBubbleSortRecurseQuickSortQueensPermuteStorageSumLoopSieveTreeSortTowersFigure 21: Results of SOM benchmarks compar<strong>in</strong>g tagg<strong>in</strong>g <strong>and</strong> non-tagg<strong>in</strong>g versionsAt first the performance of SOM++ us<strong>in</strong>g <strong>in</strong>teger tagg<strong>in</strong>g is compared with a non-tagg<strong>in</strong>gversion of SOM++ for each implemented garbage collector. Figure 21 shows the averageexecution times for SOM benchmarks. The most obvious result is, that the VMs us<strong>in</strong>gthe mark-sweep garbage collector are remarkably slower than the generational or copycollectedVMs. The reason is, that <strong>in</strong> these small benchmarks nearly all objects createdare short-lived which imposes much higher garbage collection costs to the mark-sweepcollector than to the others. The more important result for this thesis is, that the marksweepcollected SOM++ benefits more from <strong>in</strong>teger tagg<strong>in</strong>g than the others do. On average<strong>in</strong>teger tagg<strong>in</strong>g gives a 8.5% speedup 11 for the mark-sweep collected VM whereas the otherversions are only sped up by 2.5% (generational) <strong>and</strong> 3.7% (copy<strong>in</strong>g) on average. Especiallyon benchmarks with high <strong>in</strong>teger allocation rates, tagg<strong>in</strong>g is effective. A speedup of 14.3%is for example achieved <strong>in</strong> the Storage benchmark when us<strong>in</strong>g the mark-sweep garbagecollector.When <strong>in</strong>tegers make up only a small part of allocated objects, the effect of <strong>in</strong>teger tagg<strong>in</strong>g11 Process A be<strong>in</strong>g 8% faster than B means that time(A) = (1 − 0.08)time(B).


6 EVALUATION 25Average execution time (ms)50004000300020001000SOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)Richards Benchmark0RichardsBenchmarksFigure 22: Results of Richards benchmark compar<strong>in</strong>g tagg<strong>in</strong>g <strong>and</strong> non-tagg<strong>in</strong>g versions300250SOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)GCBench BenchmarkAverage execution time (s)200150100500GCBenchFigure 23: Results of GCBench compar<strong>in</strong>g tagg<strong>in</strong>g <strong>and</strong> non-tagg<strong>in</strong>g versions


6 EVALUATION 26is barely visible as shown <strong>in</strong> Figure 22 for Richards benchmark. Here it seems that lowerallocation <strong>and</strong> garbage collection times just make up the time needed for the additionaltag-checks. Figure 23 shows the results of GCBench. Although the VMs us<strong>in</strong>g a mov<strong>in</strong>gGC achieve moderate speedups us<strong>in</strong>g tagg<strong>in</strong>g (3% for generational <strong>and</strong> 6% for copy<strong>in</strong>g)the mark-sweep collected VM surpris<strong>in</strong>gly gets slower when us<strong>in</strong>g tagg<strong>in</strong>g. Here especiallythe overhead added to the garbage collector seems to dom<strong>in</strong>ate any benefits ga<strong>in</strong>ed.To analyse the speedups of <strong>in</strong>teger tagg<strong>in</strong>g more detailed, the time spent garbage collect<strong>in</strong>gwas measured. It should be noted that for the generational GC writebarrier costs are notaccounted to gc-time s<strong>in</strong>ce the overhead of time-measurement would be too large. Also thetime for allocat<strong>in</strong>g objects is not considered. In Figure 24 the results for SOM benchmarksare displayed. The leftmost two bars show times for the generational GC, followed by thecopy<strong>in</strong>g GC <strong>and</strong> the rightmost two bars show times for the mark-sweep GC.Average execution time (ms)140012001000800600400200Garbage collection times for SOM benchmarksrema<strong>in</strong><strong>in</strong>g timegc time0ListBounceQueensStorageSieveIntegerLoopFibonacciDispatchBubbleSortPermuteRecurseQuickSortTreeSortLoopSumTowersFigure 24: Garbage collection times for SOM benchmarksWhile the other VMs spent only little time for garbage collection (on average 1% for thegenerational collected VM <strong>and</strong> 4% for the copy-collected VM) the mark-sweep collectedVM uses 23% of its time for garbage collection. With <strong>in</strong>teger tagg<strong>in</strong>g the time spentgarbage collect<strong>in</strong>g can be reduced by 14% on average, for IntegerLoop for example evenby 20%. The reason is, that no <strong>in</strong>teger objects are allocated anymore that can becomegarbage. But the overall improvement achieved with <strong>in</strong>teger tagg<strong>in</strong>g cannot be expla<strong>in</strong>edwith reduced collection times alone. Also the time outside the GC was reduced by 7% onaverage when us<strong>in</strong>g the mark-sweep garbage collector while the improvement is only 2%for the generational <strong>and</strong> 3% for the copy collected VM. This results from saved allocationsof <strong>in</strong>tegers, be<strong>in</strong>g more expensive for the mark-sweep collector than for the other collectors.In benchmarks like Storage, where usually a lot of <strong>in</strong>tegers are allocated, the speedup for


6 EVALUATION 27the mark-sweep collected VM is even 12%.40003500Garbage collection times for Richards benchmarkgc timerema<strong>in</strong><strong>in</strong>g timeAverage execution time (ms)300025002000150010005000Figure 25: Garbage collection times for Richards benchmarkFor Richards benchmark the results are quite different as can be seen <strong>in</strong> Figure 25. Althoughfor the mark-sweep collected VM the time collect<strong>in</strong>g garbage was reduced by 11%,no overall speedup could be achieved s<strong>in</strong>ce more time was spent outside the GC. It seemsthat the benefit from not hav<strong>in</strong>g to allocate <strong>in</strong>teger objects is nullified by the numeroustag-checks <strong>in</strong>teger tagg<strong>in</strong>g <strong>in</strong>curs, especially for sends. Also the speedup for the copy<strong>in</strong>gcollected VM is rather small with 2%, while the generational collected VM is even sloweddown marg<strong>in</strong>ally.As can be expected, the proportion of time spent garbage collect<strong>in</strong>g is much higher forGCBench. The times spent for garbage collection are 5% for the generational, 16% forthe copy<strong>in</strong>g <strong>and</strong> even 43% for the mark-sweep version of SOM++ which shows the advantagesof the generational garbage collection scheme. While <strong>in</strong>teger tagg<strong>in</strong>g still improvesthe collection times for the copy<strong>in</strong>g collector, it slows down the mark-sweep collector byanother 6%. This is due to the large number of objects allocated <strong>in</strong> this benchmark. S<strong>in</strong>ce<strong>in</strong> the mark phase of the mark-sweep collector all objects (live <strong>and</strong> garbage, unlike forother collectors) have to be visited, the overhead of check<strong>in</strong>g the tag-bit while recurs<strong>in</strong>gthe object tree seems to be especially disadvantageous.To estimate which proportion of <strong>in</strong>teger tagg<strong>in</strong>g’s improvements comes from reduced allocationoverhead, a special tagg<strong>in</strong>g version is evaluated which additionally allocates boxed<strong>in</strong>tegers although it <strong>in</strong>ternally uses tagged <strong>in</strong>tegers. This version is then compared to thenormal tagg<strong>in</strong>g version of SOM++ without <strong>in</strong>teger allocations. This allows to measurethe overhead for allocations.The results of these benchmarks are shown <strong>in</strong> Figures 27 to 29. The generational <strong>and</strong>


6 EVALUATION 28300Garbage collection times for GCBenchgc timerema<strong>in</strong><strong>in</strong>g time250Average execution time (s)200150100500Figure 26: Garbage collection times for GCBenchAverage execution time (ms)140012001000800600400200SOM Benchmarks - additional allocationSOM++ generationalSOM++ generational (tagg<strong>in</strong>g+allocation)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g+allocation)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g+allocation)0ListBounceQueensStorageSieveIntegerLoopFibonacciDispatchBubbleSortPermuteRecurseQuickSortTreeSortLoopSumTowersFigure 27: Compar<strong>in</strong>g non-tagg<strong>in</strong>g <strong>and</strong> additional allocation versions for SOM benchmarks


6 EVALUATION 29Average execution time (ms)50004000300020001000Richards Benchmark - additional allocationSOM++ generationalSOM++ generational (tagg<strong>in</strong>g+allocation)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g+allocation)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g+allocation)0RichardsBenchmarksFigure 28: Compar<strong>in</strong>g non-tagg<strong>in</strong>g <strong>and</strong> additional allocation versions for Richards benchmark350300GCBench Benchmark - additional allocationGCBench generationalGCBench generational (tagg<strong>in</strong>g+allocation)GCBench copy<strong>in</strong>gGCBench copy<strong>in</strong>g (tagg<strong>in</strong>g+allocation)GCBench mark-sweepGCBench mark-sweep (tagg<strong>in</strong>g+allocation)Average execution time (s)250200150100500GCBenchFigure 29: Compar<strong>in</strong>g non-tagg<strong>in</strong>g <strong>and</strong> additional allocation versions for GCBench


6 EVALUATION 30the copy<strong>in</strong>g collected SOM++ cannot benfit <strong>in</strong> any of the benchmarks, s<strong>in</strong>ce they alreadyprovide fast allocations. However the mark-sweep collected SOM++ can benefit<strong>in</strong> all benchmarks, although the speedups that can be achieved, depend on the numberof allocations prevented. How important the effect of not hav<strong>in</strong>g to allocate is for <strong>in</strong>tegertagg<strong>in</strong>g, is shown <strong>in</strong> Figure 30. It compares the overall speedup achieved by tagg<strong>in</strong>gwith the speedup achieved by avoid<strong>in</strong>g allocations. It turns out, that on average 70% ofspeedups come from these saved allocations.Mark-sweep improvementsBounceBubbleSortDispatchFibonacciIntegerLoopListLoopPermuteQueensQuickSortRecurseSieveStorageSumTowersTreeSortallocationtagg<strong>in</strong>g-2 0 2 4 6 8 10 12 14Speedup <strong>in</strong> %Figure 30: Compar<strong>in</strong>g speedups based on allocations <strong>and</strong> tagg<strong>in</strong>g for the mark-sweepcollected SOM++


6 EVALUATION 316.4 Integer Cach<strong>in</strong>g as AlternativeAs mentioned <strong>in</strong> Section 2.3, another advantage of tagged <strong>in</strong>tegers is a reduced memoryfootpr<strong>in</strong>t. In Section 6.2 is shown, that <strong>in</strong>teger cach<strong>in</strong>g can also be an optimization tosave heap space, s<strong>in</strong>ce usually large proportions of <strong>in</strong>tegers are allocated <strong>in</strong> a small range.This section evaluates, how <strong>in</strong>teger cach<strong>in</strong>g affects the speed of the <strong>in</strong>terpreter. Thereforethe non-tagg<strong>in</strong>g version of SOM++ is compared with a version that caches <strong>in</strong>tegers <strong>in</strong> areasonable range <strong>and</strong> a version that caches <strong>in</strong>tegers <strong>in</strong> a bad range. By cach<strong>in</strong>g <strong>in</strong>tegers<strong>in</strong> the range [−5, 100] (good cache), a substantial amount of <strong>in</strong>teger allocations can beavoided by return<strong>in</strong>g preallocated <strong>in</strong>tegers. This effect will be much less visible for thecach<strong>in</strong>g range [100000, 100105] (bad cache). This version is used to <strong>in</strong>dicate the overhead<strong>in</strong>troduced by <strong>in</strong>teger cach<strong>in</strong>g.The benchmarks were once executed with a mark-sweep GC<strong>and</strong> once for a generational GC.Average execution time (ms)140012001000800600400200SOM Benchmarks - cach<strong>in</strong>gSOM++ mark-sweep (no cache)SOM++ mark-sweep (bad cache)SOM++ mark-sweep (good cache)0ListBounceIntegerLoopFibonacciDispatchBubbleSortSumRecurseQuickSortQueensPermuteStorageLoopSieveTreeSortTowersFigure 31: SOM benchmarks with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a mark-sweep garbage collectorAs Figure 31 shows, the mark-sweep collected SOM++ can benefit from <strong>in</strong>teger cach<strong>in</strong>gif the number of allocated <strong>in</strong>tegers is high. The speedup for example <strong>in</strong> IntegerLoop orStorage is achieved by sav<strong>in</strong>g allocations. For the other SOM benchmarks the performanceis on the same level as of the non-cach<strong>in</strong>g VM. Furthermore is shown, that cach<strong>in</strong>g a badrange does not make the VM considerably slower. Also for Richards <strong>and</strong> GCBench (Figures32 <strong>and</strong> 33) can be seen, that the <strong>in</strong>troduction of <strong>in</strong>teger cach<strong>in</strong>g does not negatively affectperformance.


6 EVALUATION 3245004000SOM++ mark sweep (no cache)SOM++ mark sweep (bad cache)SOM++ mark sweep (good cache)Richards - cach<strong>in</strong>gAverage execution time (ms)3500300025002000150010005000RichardsBenchmarksFigure 32: Richards benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a mark-sweep garbage collector300SOM++ mark sweep (no cache)SOM++ mark sweep (bad cache)SOM++ mark sweep (good cache)GCBench - cach<strong>in</strong>g250Average execution time (s)200150100500GCBenchFigure 33: GCBench benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a mark-sweep garbage collector


List6 EVALUATION 33800700SOM Benchmarks - cach<strong>in</strong>gSOM++ generational (no cache)SOM++ generational (bad cache)SOM++ generational (good cache)Average execution time (ms)6005004003002001000BounceIntegerLoopFibonacciDispatchBubbleSortSumRecurseQuickSortQueensPermuteStorageLoopSieveTreeSortTowersFigure 34: SOM benchmarks with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a generational garbage collectorFigures 34 to 36 show the results for the generational GC. When us<strong>in</strong>g <strong>in</strong>teger cach<strong>in</strong>gwith a generational GC, the effect on performance is very small. As allocations are alreadyfast, no substantial speedup can be achieved. But apart from that, SOM++ is also notslowed down, as the overhead of <strong>in</strong>teger cach<strong>in</strong>g seems to be negligible.Although <strong>in</strong>teger cach<strong>in</strong>g does not improve the speed when us<strong>in</strong>g a modern garbage collector,it can be considered as a good alternative to <strong>in</strong>teger tagg<strong>in</strong>g. <strong>Memory</strong> can be savedwith nearly no negative effects on performance.


6 EVALUATION 342500SOM++ generational (no cache)SOM++ generational (bad cache)SOM++ generational (good cache)Richards - cach<strong>in</strong>gAverage execution time (ms)2000150010005000RichardsBenchmarksFigure 35: Richards benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a generational garbage collector120SOM++ generational (no cache)SOM++ generational (bad cache)SOM++ generational (good cache)GCBench - cach<strong>in</strong>g100Average execution time (s)806040200GCBenchFigure 36: GCBench benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a generational garbage collector


6 EVALUATION 356.5 Tagg<strong>in</strong>g on Other ArchitecturesF<strong>in</strong>ally the performance of <strong>in</strong>teger tagg<strong>in</strong>g is evaluated on x68 64 <strong>and</strong> ARM mach<strong>in</strong>es.Benchmarks for ARM were executed on an otherwise idle BeagleBoard-xM 12 runn<strong>in</strong>gUbuntu 12.04. This development board has an ARM R○ Cortex TM A8 CPU at 1 GHzwith 512 MB of memory. Benchmarks for x86 64 were executed on a Intel R○ Xeon TMW3580 CPU with 3.33 GHz <strong>and</strong> 8 MB cache <strong>in</strong> a mach<strong>in</strong>e with 12 GB RAM runn<strong>in</strong>gUbuntu 12.04.1.Average execution time (ms)40035030025020015010050SOM BenchmarksSOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)0LoopListBounceIntegerLoopFibonacciDispatchBubbleSortSieveSumRecurseQuickSortQueensPermuteStorageTreeSortTowersFigure 37: SOM benchmarks on x86 64Figures 37 to 39 confirm that the results found for x86 are also valid for 64bit systems.Aga<strong>in</strong> the mark-sweep collected VM can benefit from <strong>in</strong>teger tagg<strong>in</strong>g if a sufficient amountof <strong>in</strong>tegers is allocated. The other garbage collectors can either barely benefit <strong>in</strong> someSOM benchmarks or even get slower <strong>in</strong> the others. The same applies for the Benchmarksexecuted on ARM as can be seen <strong>in</strong> Figures 40 to 42.12 http://beagleboard.org/hardware-xM


6 EVALUATION 36Average execution time (ms)12001000800600400200SOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)Richards Benchmark0RichardsBenchmarksFigure 38: Richards benchmark on x86 64Average execution time (s)12010080604020SOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)GCBench Benchmark0GCBenchFigure 39: GCBench benchmark on x86 64


6 EVALUATION 37Average execution time (ms)1600140012001000800600400200SOM BenchmarksSOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)0LoopListBounceIntegerLoopFibonacciDispatchBubbleSortSieveSumRecurseQuickSortQueensPermuteStorageTreeSortTowersFigure 40: SOM benchmarks on ARMAverage execution time (s)252015105SOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)Richards Benchmark0RichardsBenchmarksFigure 41: Richards benchmark on ARM


6 EVALUATION 38Average execution time (s)1600140012001000800600400SOM++ generationalSOM++ generational (tagg<strong>in</strong>g)SOM++ copy<strong>in</strong>gSOM++ copy<strong>in</strong>g (tagg<strong>in</strong>g)SOM++ mark-sweepSOM++ mark-sweep (tagg<strong>in</strong>g)GCBench Benchmark2000GCBenchFigure 42: GCBench benchmark on ARM


7 RELATED WORK 397 Related WorkAs mentioned before, only a few sources could be found that cover <strong>in</strong>teger tagg<strong>in</strong>g asoptimization for dynamic language virtual mach<strong>in</strong>es.A good overview of various ways to represent type <strong>in</strong>formation can be found <strong>in</strong> ”Represent<strong>in</strong>gType Information <strong>in</strong> <strong>Dynamic</strong>ally Typed Languages” by David Gudeman [22]. Thisreport covers tagged <strong>in</strong>tegers <strong>and</strong> also briefly describes NaN tagg<strong>in</strong>g. Although it givescosts <strong>in</strong> terms of mach<strong>in</strong>e cycles required for the various tagg<strong>in</strong>g operations, no specificimplementation is analysed.Another excellent description of <strong>in</strong>teger tagg<strong>in</strong>g can be found <strong>in</strong> Rob Sayre’s blog [11]. Heexpla<strong>in</strong>s how <strong>in</strong>teger tagg<strong>in</strong>g <strong>and</strong> NAN-tagg<strong>in</strong>g are used <strong>in</strong> Mozilla’s JavaScript virtualmach<strong>in</strong>es. But also here no evaluation of performance ga<strong>in</strong>s can be found.In July 2004 James Y Knight proposed on the Python-Dev mail<strong>in</strong>g list [23] to use <strong>in</strong>tegertagg<strong>in</strong>g for CPython (which uses reference count<strong>in</strong>g for garbage collection). He also<strong>in</strong>cluded a patch implement<strong>in</strong>g tagged <strong>in</strong>tegers for Python 2.3. Several posts later hereported the pystones benchmark to be 3.3% <strong>and</strong> pybench to be 7.2% faster on average.Despite the speedups, two ma<strong>in</strong> reasons were discussed not to use tagged <strong>in</strong>tegers forCPython. One was, that the <strong>in</strong>troduction of tagged <strong>in</strong>tegers would break compatibility tothird party libraries. The other reason was the <strong>in</strong>creased code complexity. For exampleGuido van Rossum, the creator of Python replied:“-1000 here. In a past life (the ABC implementation) we used this <strong>and</strong> it wasa horrible experience. Basically, there were too many places where you hadto make sure you didn’t have an <strong>in</strong>teger before dereferenc<strong>in</strong>g a po<strong>in</strong>ter, <strong>and</strong>f<strong>in</strong>d<strong>in</strong>g the omissions one core dump at a time was a nightmare.” 13Another VM where <strong>in</strong>teger tagg<strong>in</strong>g was briefly tested was SPY [24], which is a reimplementationof the Squeak Smalltalk VM us<strong>in</strong>g the PyPy 14 toolcha<strong>in</strong>. This toolcha<strong>in</strong>allows programs implemented <strong>in</strong> RPython (a restricted subset of Python) to be translatedto efficient C code <strong>in</strong>clud<strong>in</strong>g a trac<strong>in</strong>g JIT <strong>and</strong> generational garbage collection. The resultof us<strong>in</strong>g tagged <strong>in</strong>tegers here was:“A small bit of experimentation seemed to <strong>in</strong>dicate that us<strong>in</strong>g tagged po<strong>in</strong>tersfor small <strong>in</strong>tegers actually worsens performance. The necessity of check<strong>in</strong>gwhether a po<strong>in</strong>ter is a heap po<strong>in</strong>ter or a small <strong>in</strong>teger around every methodcall offsets all the benefits of the smaller memory footpr<strong>in</strong>t that comes withtagged small <strong>in</strong>tegers.”[24]13 http://mail.python.org/pipermail/python-dev/2004-July/046147.html14 http://pypy.org/


8 CONCLUSION 408 ConclusionDur<strong>in</strong>g this thesis the <strong>in</strong>teger tagg<strong>in</strong>g optimization was implemented for the SOM++Smalltalk VM. The effects on performance were then analysed <strong>in</strong> detail by evaluat<strong>in</strong>gbenchmarks.Results show that <strong>in</strong>teger tagg<strong>in</strong>g can improve the speed of <strong>in</strong>terpreters, if the allocationof new objects is slow. This is the case for the mark-sweep collected VM, which achievesits improvements ma<strong>in</strong>ly by avoid<strong>in</strong>g expensive allocations. VMs us<strong>in</strong>g modern garbagecollectors however cannot profit that much from avoided allocations, s<strong>in</strong>ce they usuallyoffer very fast allocations <strong>and</strong> can h<strong>and</strong>le short lived objects very good. Here the tagcheck<strong>in</strong>goverhead can even reduce performance.When focuss<strong>in</strong>g on the memory footpr<strong>in</strong>t, results <strong>in</strong> Section 6.4 <strong>in</strong>dicate that <strong>in</strong>tegercach<strong>in</strong>g can be a good alternative. It adds nearly no overhead to <strong>in</strong>terpretation <strong>and</strong> ismuch easier to implement.Consider<strong>in</strong>g the implementation effort, <strong>in</strong>teger tagg<strong>in</strong>g should only be used for <strong>in</strong>terpretershav<strong>in</strong>g slow allocations. When hav<strong>in</strong>g a fast allocation or a JIT with optimizations likeallocation removal [25], <strong>in</strong>teger tagg<strong>in</strong>g is not worth implement<strong>in</strong>g.8.1 Threats to ValidityDespite the results are provable for SOM++, it is not quite clear if they can also beapplied to other <strong>in</strong>terpreters. One reason might be the partly unrealistic design. Be<strong>in</strong>g aVM developed for education <strong>and</strong> research, the architecture was kept simple <strong>and</strong> is not aselaborate as the architecture of other VMs as Google’s V8 or CPython. Although severaloptimizations have been applied dur<strong>in</strong>g this thesis, Richards benchmark is for examplestill 4 times slower on SOM++ as on CPython. Architectural improvements or otheroptimizations can <strong>in</strong>fluence the results.Another reason might be, that the benchmarks evaluated dur<strong>in</strong>g this thesis are eithertoo small or too specialized to make general assumptions on the performance of tagg<strong>in</strong>g.GCBench for example is a benchmark focuss<strong>in</strong>g on garbage collector performance. Thespeed of this benchmark does not necessarily correspond to the speed of other real lifeprograms.


BIBLIOGRAPHY 41Bibliography[1] Adele Goldberg <strong>and</strong> David Robson. Smalltalk-80: The Language <strong>and</strong> Its Implementation.Longman Higher Education, December 1983.[2] Bjarne Stroustrup. The C++ Programm<strong>in</strong>g Language: Third Edition. Addison Wesley,3 edition, June 1997.[3] Michael Haupt, Robert Hirschfeld, Tobias Pape, Gregor Gabrysiak, Stefan Marr,Arne Bergmann, Arvid Heise, Matthias Kle<strong>in</strong>e, <strong>and</strong> Robert Krahn. The SOM family:virtual mach<strong>in</strong>es for teach<strong>in</strong>g <strong>and</strong> research. In Proceed<strong>in</strong>gs of the fifteenth annualconference on Innovation <strong>and</strong> technology <strong>in</strong> computer science education, ITiCSE ’10,page 18–22, New York, NY, USA, 2010. ACM.[4] M. Anton Ertl <strong>and</strong> David Gregg. The structure <strong>and</strong> performance of efficient <strong>in</strong>terpreters.Journal of Instruction-Level Parallelism, 5:2003, 2001.[5] Christian Bol<strong>and</strong>. A Mov<strong>in</strong>g Garbage Collector for a Smalltalk VM. Bachelors thesis,He<strong>in</strong>rich-He<strong>in</strong>e Universität, Düsseldorf, September 2011.[6] John McCarthy. Recursive functions of symbolic expressions <strong>and</strong> their computationby mach<strong>in</strong>e, Part I. Commun. ACM, 3(4):184–195, April 1960.[7] Robert R. Fenichel <strong>and</strong> Jerome C. Yochelson. A LISP garbage-collector for virtualmemorycomputer systems. Commun. ACM, 12(11):611–612, November 1969.[8] David Ungar. Generation Scaveng<strong>in</strong>g: A non-disruptive high performance storagereclamation algorithm. In Proceed<strong>in</strong>gs of the first ACM SIGSOFT/SIGPLAN softwareeng<strong>in</strong>eer<strong>in</strong>g symposium on Practical software development environments, SDE1, page 157–167, New York, NY, USA, 1984. ACM.[9] Richard Gerber, Aart J. C. Bik, Kev<strong>in</strong> Smith, <strong>and</strong> X<strong>in</strong>m<strong>in</strong> Tian. The SoftwareOptimization Cookbook: High Performance Recipes for IA-32 Platforms, 2nd Edition.Intel Press, 2nd edition, December 2005.[10] Google. Design Elements - Chrome V8 — Google Developers. https://developers.google.com/v8/design?hl=en.[11] Rob Sayre. Mozilla’s New JavaScript Value Representation. http://evilpie.github.com/sayrer-fatval-backup/cache.aspx.htm, August 2010.[12] D. L. Weaver <strong>and</strong> T. Gremond. The SPARC architecture manual. PTR Prentice Hall,1994.[13] Mike Pall. LuaJIT 2.0 <strong>in</strong>tellectual property disclosure <strong>and</strong> research opportunities.http://article.gmane.org/gmane.comp.lang.lua.general/58908, February2009.[14] IEEE St<strong>and</strong>ard for Float<strong>in</strong>g-Po<strong>in</strong>t Arithmetic. IEEE Std 754-2008, pages 1 –58, 2008.[15] AMD64 Architecture Programmer’s Manual Volume 2: System Programm<strong>in</strong>g,September 2012.


BIBLIOGRAPHY 42[16] Michael Haupt, Stefan Marr, <strong>and</strong> Robert Hirschfeld. CSOM/PL — A Virtual Mach<strong>in</strong>eProduct L<strong>in</strong>e. Journal of Object Technology, 10:12:1–30, 2011.[17] Urs Hölzle, Craig Chambers, <strong>and</strong> David Ungar. Optimiz<strong>in</strong>g dynamically-typedobject-oriented languages with polymorphic <strong>in</strong>l<strong>in</strong>e caches. In Pierre America, editor,ECOOP’91 European Conference on Object-Oriented Programm<strong>in</strong>g, volume 512of Lecture Notes <strong>in</strong> Computer Science, pages 21–38. Spr<strong>in</strong>ger Berl<strong>in</strong> / Heidelberg,1991.[18] Mart<strong>in</strong> Richards. Bench. http://www.cl.cam.ac.uk/~mr10/Bench.html.[19] Amy Briggs. Amy Briggs. http://www.cs.middlebury.edu/~briggs/Courses/CS313-F12/smalltalk/stx/goodies/benchmarks/richards/.[20] Hans Boehm. An Artificial Garbage Collection Benchmark. http://www.hpl.hp.com/personal/Hans_Boehm/gc/gc_bench.html.[21] Andy Georges, Dries Buytaert, <strong>and</strong> Lieven Eeckhout. Statistically rigorous java performanceevaluation. In Proceed<strong>in</strong>gs of the 22nd annual ACM SIGPLAN conferenceon Object-oriented programm<strong>in</strong>g systems <strong>and</strong> applications, OOPSLA ’07, page 57–76,New York, NY, USA, 2007. ACM.[22] D. Gudeman. Represent<strong>in</strong>g type <strong>in</strong>formation <strong>in</strong> dynamically typed languages. Technicalreport, Technical Report TR93-27, University of Arizona, Department of ComputerScience, Tucson, Arizona, 1993.[23] James Y Knight. [Python-Dev] mail<strong>in</strong>g list - Tagged <strong>in</strong>tegers. http://mail.python.org/pipermail/python-dev/2004-July/046139.html, May 2004.[24] C. Bolz, A. Kuhn, A. Lienhard, N. Matsakis, O. Nierstrasz, L. Renggli, A. Rigo, <strong>and</strong>T. Verwaest. Back to the future <strong>in</strong> one week—implement<strong>in</strong>g a Smalltalk VM <strong>in</strong> PyPy.Self-Susta<strong>in</strong><strong>in</strong>g Systems, page 123–139, 2008.[25] C.F. Bolz, A. Cuni, M. FijaBkowski, M. Leuschel, S. Pedroni, <strong>and</strong> A. Rigo. <strong>Allocation</strong>removal by partial evaluation <strong>in</strong> a trac<strong>in</strong>g JIT. In Proceed<strong>in</strong>gs of the 20th ACMSIGPLAN workshop on Partial evaluation <strong>and</strong> program manipulation, page 43–52,2011.


LIST OF FIGURES 43List of Figures1 Pharo Smalltalk runn<strong>in</strong>g unit tests . . . . . . . . . . . . . . . . . . . . . . . 42 A po<strong>in</strong>t object referenc<strong>in</strong>g two boxed <strong>in</strong>tegers . . . . . . . . . . . . . . . . . 63 Examples of a tagged <strong>in</strong>teger with value 2 <strong>and</strong> an ord<strong>in</strong>ary po<strong>in</strong>ter . . . . . 64 B<strong>in</strong>ary representation of IEEE 754 float<strong>in</strong>g po<strong>in</strong>t values π, ±∞ <strong>and</strong> NaN . 85 Examples of 32-bit NaN-tagg<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . 86 Examples us<strong>in</strong>g 64-bit NaN-tagg<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . 97 Object graph without <strong>in</strong>teger cach<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . 128 Object graph with <strong>in</strong>teger cach<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . 129 Before optimization of base classes . . . . . . . . . . . . . . . . . . . . . . . 1410 After optimization of base classes . . . . . . . . . . . . . . . . . . . . . . . . 1511 Number of allocated objects for IntegerLoop benchmark . . . . . . . . . . . 1712 Number of allocated objects for Richards benchmark . . . . . . . . . . . . . 1813 Number of allocated objects for GCBench benchmark . . . . . . . . . . . . 1814 Size of allocated objects for IntegerLoop benchmark . . . . . . . . . . . . . 1915 Histogram of allocated <strong>in</strong>tegers for SOM benchmarks . . . . . . . . . . . . . 2016 Histogram of allocated <strong>in</strong>tegers for Richards benchmark . . . . . . . . . . . 2017 Histogram of allocated <strong>in</strong>tegers for GCBench benchmark . . . . . . . . . . . 2118 Sends <strong>in</strong> SOM benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 2119 Sends <strong>in</strong> Richards benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 2220 Sends <strong>in</strong> GCBench benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 2221 Results of SOM benchmarks compar<strong>in</strong>g tagg<strong>in</strong>g <strong>and</strong> non-tagg<strong>in</strong>g versions . 2422 Results of Richards benchmark compar<strong>in</strong>g tagg<strong>in</strong>g <strong>and</strong> non-tagg<strong>in</strong>g versions 2523 Results of GCBench compar<strong>in</strong>g tagg<strong>in</strong>g <strong>and</strong> non-tagg<strong>in</strong>g versions . . . . . . 2524 Garbage collection times for SOM benchmarks . . . . . . . . . . . . . . . . 2625 Garbage collection times for Richards benchmark . . . . . . . . . . . . . . . 2726 Garbage collection times for GCBench . . . . . . . . . . . . . . . . . . . . . 2827 Compar<strong>in</strong>g non-tagg<strong>in</strong>g <strong>and</strong> additional allocation versions for SOM benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2828 Compar<strong>in</strong>g non-tagg<strong>in</strong>g <strong>and</strong> additional allocation versions for Richardsbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2929 Compar<strong>in</strong>g non-tagg<strong>in</strong>g <strong>and</strong> additional allocation versions for GCBench . . 2930 Compar<strong>in</strong>g speedups based on allocations <strong>and</strong> tagg<strong>in</strong>g for the mark-sweepcollected SOM++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


LIST OF FIGURES 4431 SOM benchmarks with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a mark-sweep garbage collector 3132 Richards benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a mark-sweep garbage collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3233 GCBench benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a mark-sweep garbage collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3234 SOM benchmarks with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a generational garbage collector 3335 Richards benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a generational garbage collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3436 GCBench benchmark with <strong>in</strong>teger cach<strong>in</strong>g us<strong>in</strong>g a generational garbagecollector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3437 SOM benchmarks on x86 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3538 Richards benchmark on x86 64 . . . . . . . . . . . . . . . . . . . . . . . . . 3639 GCBench benchmark on x86 64 . . . . . . . . . . . . . . . . . . . . . . . . . 3640 SOM benchmarks on ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . 3741 Richards benchmark on ARM . . . . . . . . . . . . . . . . . . . . . . . . . . 3742 GCBench benchmark on ARM . . . . . . . . . . . . . . . . . . . . . . . . . 38

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!