Memory Allocation and Access Patterns in Dynamic ... - STUPS Group

INSTITUT FÜRINFORMATIKSoftwaretechnik undProgrammiersprachenUniversitätsstr. 1D–40225 DüsseldorfMemory Allocation and Access Patternsin Dynamic LanguagesChristian BolandMasterarbeitBeginn der Arbeit: 17. April 2012Abgabe der Arbeit: 31. Oktober 2012Gutachter: Prof. Dr. Michael LeuschelProf. Dr. Michael Schöttner

ErklärungHiermit versichere ich, dass ich diese Masterarbeit selbstständig verfasst habe. Ich habedazu keine anderen als die angegebenen Quellen und Hilfsmittel verwendet.Düsseldorf, den 31. Oktober 2012Christian Boland

AbstractInteger tagging is an optimization for virtual machines to improve performance and additionallyreduce the memory footprint. Although used by many virtual machines, the effectsof integer tagging have not been analysed in detail yet. In this thesis it is evaluated howinteger tagging influences execution times of an existing Smalltalk interpreter. Thereforedifferent versions have been created which were analysed in detail using benchmarks. Theresults indicate that if the interpreter offers fast allocations by using a modern garbagecollector, nearly no speedups can be achieved. In some cases the performance using integertagging even decreases.

CONTENTS 1ContentsContents 11 Introduction 22 Background 32.1 Smalltalk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.1 Integer Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 NaN-tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Tagging Implementation 104 Implementation of Integer Caching 125 General Optimizations and Modifications 145.1 Lightweight Base Class for Simple VMObjects . . . . . . . . . . . . . . . . . 145.2 Lookup Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3 Support for Different Configurations and Other Architectures . . . . . . . . 156 Evaluation 166.1 Description of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 The Characteristics of Integer Usage . . . . . . . . . . . . . . . . . . . . . . 176.3 Evaluation of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.4 Integer Caching as Alternative . . . . . . . . . . . . . . . . . . . . . . . . . 316.5 Tagging on Other Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 357 Related Work 398 Conclusion 408.1 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Bibliography 41List of Figures 43

1 INTRODUCTION 21 IntroductionOver the last years dynamic programming languages like Javascript, Python or Ruby havegained more and more attention. Reasons for the increased popularity are for examplethe dynamic typing of objects and the automatic memory management which comes withmost languages. These can speed up the development of software considerably. However,these benefits don’t come for free as they incur a certain overhead and deny severalstatic optimizations that are possible in other languages like Java or C. Therefore otheroptimizations have been introduced to minimize this overhead.One of these optimizations which is used by some virtual machines (VMs) is integer taggingthat tries to avoid boxing of integers. Boxing of primitive data-types is needed indynamic languages, since the type of an object must be determinable during runtime.Although used by several VMs, the influence of integer tagging on performance has notbeen evaluated in detail so far.The aim of this thesis is to analyse the performance impacts gained by integer taggingfor interpreters. Therefore an existing Smalltalk VM is extended by integer tagging andthe performance for different configurations of this VM is measured using benchmarks.The evaluation of the results is focused on interpreters, as VMs using just-in-time (JIT)compilation have different trade-offs.The implementation of integer tagging is described in Section 3. The idea of integercaching, an alternative optimization to reduce the memory footprint, and its implementationis described in Section 4. Section 5 describes some important additional optimizationsand modifications implemented for the SOM++ virtual machine. The performance of integertagging is then evaluated in Section 6. First the benchmarks and their characteristicsof integer usage are described. Then the benchmark results are evaluated. This sectionends with an short evaluation of integer caching and of integer tagging on x86 64 andARM based systems. Section 7 lists related research done on integer tagging. FinallySection 8 concludes this thesis.

2 BACKGROUND 32 Background2.1 SmalltalkSmalltalk is a dynamically typed, purely object-oriented programming language and afull development environment at the same time. It was created by the Learning ResearchGroup at Xerox PARC during the 1970s and made publicly available as Smalltalk-80 [1]in 1982.Smalltalk influenced a lot of other programming languages like Java, Objective-C andPython as well as Windows or Macintosh with its windowing system. Some of its keyconcepts are:• everything is an object (including integers, strings, bools and methods)• everything happens by sending messages• single inheritance• you can change everything, even running programsMajor Smalltalk implementations nowadays are Pharo 1 , Squeak 2 , Cincom Smalltalk 3 andGNU Smalltalk 4 . Figure 1 shows Pharo Smalltalk running unit tests.2.2 SOMIn this thesis the SOM++ virtual machine is used as basis. It is implemented in C++ [2]and part of the SOM (Simple Object Machine) virtual machine family [3], which all hostthe same Smalltalk dialect but are implemented in different languages. It was developedat the University of Århus and the Hasso-Plattner-Institute 5 in Potsdam with focus oneducation and research. Therefore their architecture and implementation were kept simple,which eases modifications to the VM and evaluation of their effects. SOM’s instructionset architecture for example consists of 16 bytecode instructions only, which are executedby a simple bytecode-interpreter [4].For the author’s bachelor thesis [5] SOM++, originally using a mark-sweep garbage collector,was extended by a copying and a generational garbage collector. The garbage collectorimplementation used can be chosen by setting compile-switches. The implementation andkey advantages/ disadvantages are explained briefly.mark-sweep garbage collector SOM++ uses a pretty straightforward implementationof the mark-sweep algorithm [6]. Allocation of objects is simply handed over tostdlib’s malloc. Reclamation of garbage is achieved in two steps. First all live objects1 http://www.pharo-project.org2 http://squeak.org3 http://www.cincomsmalltalk.com4 http://smalltalk.gnu.org5 http://hpi.uni-potsdam.de/hirschfeld/projects/som/

2 BACKGROUND 4Figure 1: Pharo Smalltalk running unit tests

2 BACKGROUND 5are traversed starting from the roots and marked by setting a mark-bit inside theobject header. In a second step all objects having the mark-bit unset are deleted.This results in relatively expensive allocations since malloc requests additionallyhave to avoid fragmentation which costs time. The complexity of the cleanup phaseis proportional to the overall number of allocated objects (including garbage), sinceall objects have to be visited when being marked or deleted.copying garbage collector As a second alternative a semispace copying collector [7]is implemented. In this scheme, the heap is divided into two semispaces calledfromspace and tospace. Objects are allocated linearly in tospace unless there is sufficientroom. In case of a garbage collection the two semispaces change roles. Startingfrom the roots all live objects are then copied from new fromspace to tospace.The copying garbage collector offers a very fast allocation, which is achieved bysimply bumping a pointer to the next free location. Fragmentation can be ignored,since memory is always compacted after a collection. The complexity of the collectionis related to the number of live objects since all other objects are freed indirectly byswitching the roles of the semispaces. The main drawback of this approach is, thatobjects surviving multiple collections have to be copied from one semispace to theother every time.generational garbage collector The generational garbage collector[8] combines a copyingcollector with a mark-sweep collector. New (young) objects are allocated insidea contiguous part of memory called nursery which corresponds to one semispace ofthe copying collector. Once the nursery becomes full, a minor-collection is triggeredwhich copies all live objects out of the nursery. These objects are then consideredas old and collected by a mark-sweep collector.This scheme also offers a very fast allocation like the copying collector, but does notsuffer from having to copy old objects again and again since the mark-sweep collectorjust leaves objects in place. However it should be noted that old objects and newobjects cannot be collected independently since there might be intergenerationalreferences. This requires to keep track of such references which imposes an additionaloverhead.2.3 TaggingOne characteristic of dynamic languages is that “everything is an object”. This meansthat also primitive data types like integers or floating numbers must behave like regularobjects. For example they have a class, can receive function calls and are normally garbagecollected. To achieve this behavior, primitive data types are usually boxed into regularobjects. Figure 2 shows a point object that does not store its coordinates as primitivesbut points to boxed integer instances.Obviously boxing of primitive values has the disadvantage that it causes allocations of heapobjects and therefore increases the overall memory usage of the program. This is especiallydisadvantageous in case of integers, since they are often allocated for intermediate resultsonly.

2 BACKGROUND 6Point...pos_xpos_y...BoxedIntegerclassgcfieldint_value=36BoxedIntegerclassgcfieldint_value=102Figure 2: A point object referencing two boxed integersAnother characteristic of dynamic languages is that variables are typeless. Unlike in staticallytyped languages where the types of variables are known at compile time, dynamicallytyped languages have to perform type-checks at runtime which usually involves calling amethod like GetType() on the object.2.3.1 Integer TaggingInteger tagging is a technique that tries to circumvent the allocation of boxed integerinstances. To achieve this the memory alignment of objects is exploited. Most modernCPUs expect data structures to be aligned at multiples of the word size (for example 4byte on 32-bit machines). Reading from or writing to unaligned addresses is considerablyslower [9] or may even result in an error. Therefore compilers usually add padding todata structures and memory requests return aligned addresses. One side effect of dataalignment is, that the last 6 bits of pointers to objects are known to be zero.This allows to use the last bit to distinguish between pointers to regular objects andintegers. If for example this bit is set, the remaining bits can be used to encode theinteger value. All pointers with the last bit unset point to regular objects. Figure 3 showsexamples of how tagged and untagged pointers are interpreted.0000000000000000000000000000010131 bits to encode value tag-bit2 as tagged integer11110111001010010001110010001100tag-bitpointer to object at 0xf7291c8cFigure 3: Examples of a tagged integer with value 2 and an ordinary pointerOne advantage of integer tagging is that allocation of boxed integers is not necessary inmost cases. Besides saving time for allocation, this also reduces the number of allocated6 Usually addresses are required to be word-aligned. This would result in two 0-bits for 32-bit systemsand three 0-bits for 64-bit systems.

2 BACKGROUND 7objects and so lowers pressure on the garbage collector. Moreover in some cases like theaddition of integers no heap objects have to be accessed since their values are alreadyencoded in the “pointers”.But using integer tagging also has some downsides. Whenever the interpreter makes callsto objects, a tag-check has to be performed. If the tag-check succeeds (in case of taggedintegers) the call has to be faked in some way since there is no real object to call. Alsothe garbage collector has to distinguish between references to heap objects and taggedintegers which also adds some overhead. Furthermore it is not possible to represent thewhole range of integers in a tagged integer since the last bit is always used as tag-bit.On 32-bit CPUs for example the integer value must be encodable into 31 bit, otherwisea boxed integer has to be created. It should also be noted that integer tagging requiresintegers to be immutable objects.Integer tagging can be implemented in two ways. One way is to indicate a tagged integer bysetting the last bit to 1. This approach favours pointers over other objects, since they canbe used without modification. To retain the integer value from the tagged representation,the tagged representation has to be bit-shifted once. When using the other way (the lastbit indicates a pointer) the tag-bit has to be masked off, before using the pointer.Tagged integers are for example used in several Smalltalk VMs or JavaScript engines likeGoogle’s V8 [10] or older Mozilla engines [11]. The SPARC [12] processor architectureeven has support for tagged integer operations.2.3.2 NaN-taggingAnother variant of tagging used for example in LuaJIT [13] and Mozilla’s JägerMonkey[11]is NaN-tagging. Here pointers are represented as double precision IEEE 754 [14] floatingpoint numbers. According to the standard such a number is represented by 1 sign bit,followed by 11 exponent bits and 52 fraction bits. Besides floating point numbers, specialbit patterns are defined to represent certain non-numbers. Positive and negative infinityare represented by all exponent bits are set and all fraction bits unset. All exponent bits setand at least one fraction bit set represents “not a number” (NaN) which is a return valuefor invalid operations. For example the square root of negative numbers, the logarithmof negative numbers, the division 0 0or any other operation involving NaNs return NaN.Figure 4 shows the binary representations of the value π and the special values ±∞ andNaN.Although there are 2 53 − 1 bit patterns that represent NaN, current soft- and hardwarepractically emits only one canonical NaN value. This allows to use the remaining NaNrepresentations for the tagging of objects.One obvious way to implement NaN tagging on 32-bit machines would be to use the first20 of the fraction bits as tag-bits and the remaining 32 bits to encode the value since this isthe natural word size of 32-bit systems. So pointers or integers can easily be read withoutmasking of other bits. Additionally the large number of possible tags allows not only totag integers, but also other types like for example strings or the special objects true andfalse. Figure 5 shows the binary representations of the integer 52, the true object, a stringand one other object using NaN tagging on 32-bit machines.

2 BACKGROUND 80100000000001001001000011111101101010100010001000010110100011000sign exponent fractionbinary representation of π0111111111110000000000000000000000000000000000000000000000000000sign exponent fractionbinary representation of positive infinity1111111111110000000000000000000000000000000000000000000000000000sign exponent fractionbinary representation of negative infinity1111111111111000000000000000000000000000000000000000000000000000sign exponent fractionone binary representation of NaNFigure 4: Binary representation of IEEE 754 floating point values π, ±∞ and NaN1111111111111000000000000000000100000000000000000000000000110100sign exponent tag-bits (=integer) 52binary representation of the integer value 521111111111111000000000000000001100000000000000000000000000000000sign exponent tag-bits (=true)binary representation of TRUE1111111111111000000000000000001011110111001010010001110010001100sign exponent tag-bits (=string) 0xf7291c8cbinary representation of a string1111111111111000000000000000001011110111001101011111001011100100sign exponent tag-bits (=object) 0xf735f2e4binary representation of an untagged objectFigure 5: Examples of 32-bit NaN-tagging

2 BACKGROUND 9Using NaN-tagging on 64 bit machines seems impossible at first glance, since one cannotencode 64-bit pointers in only 53 bits available. However, most operating systems andprocessors do not support the full 64 bit address range. AMD’s current x86-64 architecturefor example only supports 48bit virtual addresses[15]. Microsoft Windows even makes useof 44 bit virtual addresses only. This allows to use 5 bits for tagging leaving 48 bits toencode any pointer. Of course 64-bit integers cannot be encoded. Here integers needingmore than 48 bits need to be boxed. Figure 6 shows the binary representations of theinteger 52, the true object, a string and one other object using NaN tagging on 64-bitmachines.1111111111111001000000000000000000000000000000000000000000110100sign exponent tag52binary representation of the integer value 521111111111111010000000000000000000000000000000000000000000000000sign exponent tagbinary representation of TRUE1111111111111011011111111011000100101111011100011000101101111000sign exponent tag0x7fb12f718b78binary representation of a string1111111111111100011111111101010010001001111110001110110010010000sign exponent tag0x7fd489f8ec90binary representation of an untagged objectFigure 6: Examples using 64-bit NaN-taggingSince NaN-tagging favours floating point values over pointers by its nature, using NaNtaggingmakes most sense if a large percentage of objects is representing floating pointvalues. This is for example the case for JavaScript programs as here also integers arerepresented by floating point numbers.

3 TAGGING IMPLEMENTATION 103 Tagging ImplementationTo evaluate the performance effects of tagged representations, I implemented integer tagging.Although NaN-tagging might take advantage by tagging more types than just integers,the evaluation of integer tagging will be a good indicator for tagging performance ingeneral.Another reason not to implement NaN-tagging is the complexity of implementation especiallyfor an existing virtual machine. Since tagging is a cross-cutting-concern [16], itaffects the source code in a large number of places. Besides special treatment of taggedobjects in the garbage collector or in primitives, for all calls to potentially tagged objectsa tag-check must be performed in advance. This makes the implementation of tagging atime-consuming and often tedious task, since forgetting to check for tagged objects willusually crash the program.To represent tagged-integers, the two’s complement notation is used. On 32-bit systemsthis allows to tag all values that can be represented in 31-bit two’s complement representation,which is [−1073741824, 1073741823]. To ease the implementation of integer tagging3 macros have been defined. IS TAGGED checks, if X is a tagged integer and AS POINTERcasts X to an object pointer. The macros TAG INTEGER and UNTAG INTEGER convert betweeninteger values and their tagged representations. Listing 1 shows the implementationof these macros.Listing 1: Implementation of helping macros1 /*** max value for tagged integers3 * 01111111 11111111 11111111 1111111 X*/5 # define VMTAGGEDINTEGER_MAX 10737418237 /*** min value for tagged integers9 * 10000000 00000000 00000000 0000000 X*/11 # define VMTAGGEDINTEGER_MIN -107374182413 // check if the last bit is set# define IS_TAGGED (X) (( long )X &1)15// simple cast17 # define AS_POINTER (X) (( AbstractVMObject *)X)19 // return tagged integer , if value in valid range , otherwise boxed integer# define TAG_INTEGER (X) ((X >= VMTAGGEDINTEGER_MIN && X >1)\27 : ((( VMInteger *)X)-> GetEmbeddedInteger ()))Most of the implementation effort was spent in inserting checks, whether a pointer is a

3 TAGGING IMPLEMENTATION 11tagged integer before calling functions on it, since calling functions on tagged integerswill cause segfaults. If calls to tagged integer are encountered, they are executed on aprototypical boxed instance of integer. Some calls can even be skipped, since the resultis known in advance and the call is side-effect free. For example calls to GetClass() willalways return the object representing the class of integers. Listing 2 shows examples ofthese checks.Listing 2: Examples of tag-checks.1 // GlobalBox :: IntegerBox () returns an prototypical boxed integer instanceif ( IS_TAGGED ( sender ))3 GlobalBox :: IntegerBox ()-> Send (eB , arguments , 1);else5 AS_POINTER ( sender )-> Send (eB , arguments , 1);7 // example , where call can be skippedpVMClass receiverClass = IS_TAGGED ( receiver )9 ? integerClass: AS_POINTER ( receiver )-> GetClass ();Another part that has to be modified is the garbage collector. Tagged integers are notallocated as heap objects and therefore do not have to be garbage collected. This meansthat the implemented garbage collectors must ignore these. Listing 3 shows the markingphase of the mark-sweep collector.Listing 3: Examples of tag-checks.VMOBJECT_PTR mark_object ( VMOBJECT_PTR obj ) {2 // ignore tagged integersif ( IS_TAGGED ( obj ))4 return obj ;// also ignore objects that have been visited before6 if (obj -> GetGCField ())return obj ;8// mark this object and referenced ones10 obj -> SetGCField ( GC_MARKED );obj -> WalkObjects ( mark_object );12 return obj ;}

4 IMPLEMENTATION OF INTEGER CACHING 124 Implementation of Integer CachingAn alternative optimization to save heap-space is integer caching. The key observation tointeger caching is that most integer objects store values in a small range. This results inmultiple instances of integer objects, all storing the same value. Also for SOM++, this isa valid assumption as can be seen later in Section 6.2.To circumvent this issue, some virtual machines preallocate integers in a certain range.When an integer object within this range is needed the preallocated instance is returnedinstead of boxing a new one. As for integer tagging it should be noted, that integer cachingalso requires integers to be immutable objects.Listing 4: Smalltalk example, where boxed integer objects can be reused.1 point = Point new .point pos_x : 36.3 point pos_y : 36.point pos_z : 36.Listing 4 shows an example, where a three dimensional point is constructed. Withoutinteger tagging three different integer objects would be allocated, all storing the value 36as can be seen in Figure 7. With integer caching only one instance is created storing thevalue 36. So the allocation of at least 2 boxed integer instances could be avoided as shownin Figure 8.Point...pos_xpos_ypos_z...BoxedIntegerclassgcfieldint_value=36BoxedIntegerclassgcfieldint_value=36BoxedIntegerPoint...pos_xpos_ypos_z...BoxedIntegerclassgcfieldint_value=36classgcfieldint_value=36Figure 7: Object graph without integercachingFigure 8:cachingObject graph with integerInteger caching is for example used in CPython which caches values from -5 to 256 7 , andseveral Java VMs cache from -128 to 127 8 .7 The CPython implementation can be found in http://hg.python.org/cpython/file/tip/Objects/longobject.c8 The OpenJDK implementation can be found in http://hg.openjdk.java.net/jdk6/jdk6-gate/jdk/file/tip/share/classes/java/lang/Integer.java

4 IMPLEMENTATION OF INTEGER CACHING 13Listing 5 shows, how the allocation of integers is implemented using integer caching forSOM++. Before a new boxed integer is allocated, it is tested whether the desired valueis inside the range of cached integers. If yes, an integer from prebuildInts, an arraycontaining all cached integers which is allocated on VM startup, is returned. Otherwise anew boxed integer has to be returned.Listing 5: Implementation of cached integers.// Universe :: NewInteger is called to create integer instances2 pVMInteger Universe :: NewInteger ( long value ) const {4 unsigned long index =( unsigned long ) value - ( unsigned long ) INT_CACHE_MIN_VALUE ;6 if ( index < ( unsigned long )( INT_CACHE_MAX_VALUE - INT_CACHE_MIN_VALUE )) {return prebuildInts [ index ];8 }10 return new ( _HEAP ) VMInteger ( value );}To keep the overhead of integer caching small, the range check has been implementedusing only one comparison. This version is supposed to be faster than the obvious solutionwhich would be:if (value >= INT CACHE MIN VALUE && value

5 GENERAL OPTIMIZATIONS AND MODIFICATIONS 145 General Optimizations and ModificationsSOM++ is a Smalltalk VM designed for education and research. Focus was laid on a simpledesign which makes it easy to understand and to start working with. Unfortunately thisapproach usually has a negative impact on performance. Most optimizations increase thecomplexity of implementation and are therefore not used in SOM++. Evaluating taggingwith this unoptimized VM would have had two downsides. First it would be questionableif conclusions from this analysis would be transferable to other interpreters. The absenceof important optimizations can influence the effect of integer tagging. Additionally thebenefit from using integer tagging might be hardly visible as it is dominated by the VMsinterpretative overhead.Besides the implementation of a generational and copying garbage collector [5] in theauthor’s bachelor thesis, several other optimizations have been applied to SOM++ ofwhich some important are described now.5.1 Lightweight Base Class for Simple VMObjectsOne important optimization was the introduction of a lightweight base class for simpleobjects. Before, all objects representing Smalltalk objects had the same base class calledVMObject. This class had members for a hash value, the size of the object, the number offields, this object provided to Smalltalk and a field storing the class of this object. Thisresulted in a total size for VMIntegers of 28 bytes (4 bytes for each of the 6 fields + vtable)as can be seen in Figure 9.Figure 9: Before optimization of base classesHowever for some classes like for example VMIntegers, VMDoubles or VMStrings, theobject size can be reduced. Since the class, object size and the number of fields are fixed,their values don’t have to be stored in each object, but can be returned with the helpof virtual functions. The implementation of GetClass() for VMInteger for example willalways return the same object called integerClass. Also the Hash can be computed atruntime and doesn’t have to be stored for each instance. With this optimization, the sizeof integer objects could be reduced to 12 bytes (8 bytes for 2 fields + vtable) as shownin Figure 10. This optimization helps to keep the memory footprint small and reducespressure on the garbage collector.

5 GENERAL OPTIMIZATIONS AND MODIFICATIONS 15Figure 10: After optimization of base classes5.2 Lookup CacheAnother optimization was the implementation of a lookup cache [17]. Without appropriateoptimizations, virtual machines for dynamic languages would spend a lot of timedetermining the correct function for message sends. A naive implementation for examplewould require to fetch the class of the object and to walk up the inheritance chain untilthe right function is found. This can be quite expensive in terms of speed.Lookup caches reduce the overhead of message sends by mapping (receiver class, methodname) pairs to their corresponding functions. In case of a message send, first the cacheis consulted. Only if no corresponding entry exists the expensive lookup procedure isperformed and the result is stored inside the cache. Lookup caches have proven to beeffective for several Smalltalk virtual machines.5.3 Support for Different Configurations and Other ArchitecturesInteger tagging is examined by comparing benchmark results of different configurationsfor SOM++. All these configurations can be built using the same code base, just bysetting compile flags. This includes the choice of the garbage collector, integer taggingand integer caching with the desired caching range. Additionally one special version existsto generate information on the executed benchmarks as can be seen in Section 6.2. Alltogether benchmarks were executed on 19 different configurations of SOM++.Furthermore SOM++ was only developed for 32-bit operating systems. In order to testperformance also on other architectures, it was modified to run on 64-bit systems, whichturned out to be a time consuming task. Additionally SOM++ was made compilable forARM systems which was much easier, since only the makefile had to be adapted.

6 EVALUATION 166 EvaluationIn order to evaluate the effects of integer tagging, several benchmarks in different configurationswere executed. This Section briefly describes the benchmarks and the characteristicsof their integer usage first. Later benchmark results are shown and analysed in detail.6.1 Description of BenchmarksThe SOM++ VM itself contains 16 simple benchmarks.Bounce Simulates jumping balls inside a box and counts how often they hit the wall.BubbleSort Sorts an array of integers using the bubble sort algorithm.Dispatch Tests message dispatching by calling methods in a loop.Fibonacci Recursively calculates a Fibonacci number.IntegerLoop Performs integer arithmetics in a loop.List Creates linked lists and uses them in a recursive algorithm.Loop Sums up numbers in two nested loops.Permute Permutes items in an array.Queens Solves the N-Queens problem.QuickSort Sorts an array of integers using the quicksort algorithm.Recurse Tests calling recursive functions.Sieve Calculates the number of primes using the sieve of Eratosthenes.Storage Creates a tree storing integer values.Sum Calculates triangular numbers and sums them up in a loop.Towers An implementation of the famous tower of Hanoi. This Benchmark moves alldisks from the first tower to the fourth.TreeSort Creates a sorted binary tree with integer values.Richards is a well known benchmark, which simulates an operating system’s task dispatcher.It was originally written by Martin Richards [18] at Cambridge University. Sincethen it has been translated to several other languages. The benchmark version used herewas derived from a Smalltalk version. 9GCBench is a Benchmark designed to test garbage collector performance. It was originallywritten by John Ellis and Pete Kovac of Post Communications. The version used hereis derived from a modified version by Hans Boehm [20]. This benchmark allocates andthen drops binary trees of different sizes. Additionally it keeps some data structurespermanently live throughout the benchmark.9 The Smalltalk version is available here [19].

True6 EVALUATION 176.2 The Characteristics of Integer UsagePrior to evaluating benchmark results it is necessary to have a look at some benchmarkcharacteristics. It will help to analyze the performance effects of tagging later. Obviouslyone important characteristic is the number of allocated integers, since this is just whattagging tries to optimize. Additionally it gives an impression of the pressure that lies onthe garbage collector.1.2e+06Number of objectsAllocation Statistics (IntegerLoop)Number of allocated objects1e+068000006000004000002000000IntegerLoopFalseSystemVMClassVMBlockVMArrayVMMethodVMIntegerVMFrameVMSymbolVMStringFigure 11: Number of allocated objects for IntegerLoop benchmarkFigure 11 shows the number of allocated instances per class for SOM’s IntegerLoop benchmark.Although this is one of the benchmarks with highest integer allocation rates, itturns out that integers only account for 25% of all allocations. Most objects with 62%were frames. The ratio of allocated integers is usually even smaller as can be seen in Figure12 for Richards benchmark or in Figure 13 for GCBench. Here integers only accountfor 2.5% (GCBench) of all objects.But not only the number of allocated objects is interesting, also the size of allocated objectsshould be taken into account. For example the number of garbage collections triggereddepends on the total size of allocated objects. Figure 14 shows the size of allocated objectsper class for the IntegerLoop benchmark. Although 25% of objects were integers, theyonly account for 6.3% of allocated space. The majority of 85.3% was used for frames. Thiscan be explained by the size of the objects which is 12 bytes for integers and 66 bytes onaverage for frames.These figures indicate the potential of integer tagging. As it only applies on integer objects,the importance of this optimization is bound to the ratio of allocated integers to otherobjects. However it turns out that most allocations are for other types of objects. The high

6 EVALUATION 183.5e+06Allocation Statistics (RichardsBenchmarks)Number of objectsNumber of allocated objects3e+062.5e+062e+061.5e+061e+06500000SystemRichardsBenchmarksPacketIdleTaskDataRecordHandlerTaskDataRecordWorkerTaskDataRecordVMSymbolVMStringVMMethodVMIntegerVMFrameVMDoubleVMClassVMBlockVMArrayTrueTaskStateTaskControlBlock0FalseDeviceTaskDataRecordFigure 12: Number of allocated objects for Richards benchmark1.4e+08Number of objectsAllocation Statistics (GCBench)1.2e+08Number of allocated objects1e+088e+076e+074e+072e+070TrueFalseNodeGCBenchSystemVMClassVMBlockVMArrayVMIntegerVMFrameVMDoubleVMStringVMMethodVMSymbolFigure 13: Number of allocated objects for GCBench benchmark

True6 EVALUATION 1970Cumulated sizeAllocation Statistics (IntegerLoop)Size of allocated objects in MB6050403020100IntegerLoopFalseSystemVMClassVMBlockVMArrayVMMethodVMIntegerVMFrameVMSymbolVMStringFigure 14: Size of allocated objects for IntegerLoop benchmarknumber of frames for example results from the tendency to small methods in Smalltalkprograms. In general it can be said that most allocations result from the interpreterimplementation and not from domain objects.Another interesting characteristic is the distribution of allocated integer values. To evaluatethis, values of allocated integer objects were recorded. Figure 15 shows the distributionof integer values over all SOM benchmarks as log/log plot, Figure 16 for Richards benchmarkand Figure 17 for GCBench.One observation is, that all values can be represented as tagged integers, since they fallinto the encodable range. 10 This means that no memory would have to be allocated forstoring integers. Furthermore these histograms are important for integer caching whichis evaluated later in Section 6.4, since they indicate appropriate ranges to cache. For theSOM benchmarks 49.9% of integers fall into the caching range of [−5, 100]. For Richardsit is 53.7% and for GCBench even 89.6%. These values show that the range for cachingseems to be well chosen.The last characteristic evaluated is the number of sends to objects. This is interesting sincebefore a send (function call) to an object can be made, a tag-check has to be performedwhether the pointer points to a real object or is a tagged integer. Furthermore it isinteresting if the function to be called is implemented as primitive (on interpreter levelin C++) or in Smalltalk. Sends to integers not implemented by primitives have to beexecuted on a boxed integer instance. Only sends to integers implemented by primitivescan benefit from not having to access heap objects.10 Which is [−1073741824, 1073741823]. See Section 3 for encoding details.

6 EVALUATION 20Figure 15: Histogram of allocated integers for SOM benchmarksFigure 16: Histogram of allocated integers for Richards benchmark

6 EVALUATION 21Figure 17: Histogram of allocated integers for GCBench benchmark900000800000Types of sends for IntegerLoopprimitivenon primitive700000Number of sends6000005000004000003000002000001000000NilArrayFalseBlock2Block1Benchmark classIntegerLoopIntegerLoop classIntegerMethodStringObject classSymbolTrueSystemFigure 18: Sends in SOM benchmarks

6 EVALUATION 22800000700000Types of sends for Richards Benchmarkprimitivenon primitiveNumber of sends6000005000004000003000002000001000000WorkerTaskDataRecord classWorkerTaskDataRecordTrueTranscript classTime classIdleTaskDataRecordHandlerTaskDataRecord classFalseDoubleDeviceTaskDataRecord classHandlerTaskDataRecordDeviceTaskDataRecordRichardsBenchmarks classRBObject classPacket classPacketObject classNilMethodIntegerIdleTaskDataRecord classBlock3Block2Block1Array classArrayTaskState classTaskControlBlockRichardsBenchmarksTaskControlBlock classSystemSymbolStringTaskStateFigure 19: Sends in Richards benchmark7e+076e+07Types of sends for GCBenchprimitivenon primitiveNumber of sends5e+074e+073e+072e+071e+070NilArray classBlock2Block1IntegerGCBench classArrayFalseMethodSystemSymbolGCBenchObject classNode classNodeStringTrueFigure 20: Sends in GCBench benchmark

6 EVALUATION 23Figure 18 shows all sends executed for SOM’s IntegerLoop Benchmark. Here 44% ofcalls go to integers and 75% of them are implemented by primitives. This means thatin potentially 33% of all sends no heap object has to be accessed. When looking at lessinteger-heavy benchmarks such as Richards the percentage of sends to integers implementedby primitives drops significantly. As can be seen in Figure 19 only 9% of sendsgo to integers of which 60% are implemented by primitives. So the beneficial effect of nothaving to access heap objects (in 6%) is rather small. For GCBench 30% of calls go tointegers of which 73% are implemented as primitives as shown in Figure 20.

6 EVALUATION 246.3 Evaluation of BenchmarksIn order to evaluate integer tagging performance, all benchmarks described in section 6.1were executed on an Intel R○ Core TM2 Duo T7300 CPU with 2 GHz and 4 MB cachein a machine with 2 GB RAM running Linux 3.2.0-32. The average execution time wascalculated from 20 iterations. Each iteration starts a new SOM++ instance and the timetill termination is measured. Errors were computed using a confidence interval with a 95%confidence level [21].Average execution time (ms)140012001000800600400200SOM BenchmarksSOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)0ListBounceIntegerLoopFibonacciDispatchBubbleSortRecurseQuickSortQueensPermuteStorageSumLoopSieveTreeSortTowersFigure 21: Results of SOM benchmarks comparing tagging and non-tagging versionsAt first the performance of SOM++ using integer tagging is compared with a non-taggingversion of SOM++ for each implemented garbage collector. Figure 21 shows the averageexecution times for SOM benchmarks. The most obvious result is, that the VMs usingthe mark-sweep garbage collector are remarkably slower than the generational or copycollectedVMs. The reason is, that in these small benchmarks nearly all objects createdare short-lived which imposes much higher garbage collection costs to the mark-sweepcollector than to the others. The more important result for this thesis is, that the marksweepcollected SOM++ benefits more from integer tagging than the others do. On averageinteger tagging gives a 8.5% speedup 11 for the mark-sweep collected VM whereas the otherversions are only sped up by 2.5% (generational) and 3.7% (copying) on average. Especiallyon benchmarks with high integer allocation rates, tagging is effective. A speedup of 14.3%is for example achieved in the Storage benchmark when using the mark-sweep garbagecollector.When integers make up only a small part of allocated objects, the effect of integer tagging11 Process A being 8% faster than B means that time(A) = (1 − 0.08)time(B).

6 EVALUATION 25Average execution time (ms)50004000300020001000SOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)Richards Benchmark0RichardsBenchmarksFigure 22: Results of Richards benchmark comparing tagging and non-tagging versions300250SOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)GCBench BenchmarkAverage execution time (s)200150100500GCBenchFigure 23: Results of GCBench comparing tagging and non-tagging versions

6 EVALUATION 26is barely visible as shown in Figure 22 for Richards benchmark. Here it seems that lowerallocation and garbage collection times just make up the time needed for the additionaltag-checks. Figure 23 shows the results of GCBench. Although the VMs using a movingGC achieve moderate speedups using tagging (3% for generational and 6% for copying)the mark-sweep collected VM surprisingly gets slower when using tagging. Here especiallythe overhead added to the garbage collector seems to dominate any benefits gained.To analyse the speedups of integer tagging more detailed, the time spent garbage collectingwas measured. It should be noted that for the generational GC writebarrier costs are notaccounted to gc-time since the overhead of time-measurement would be too large. Also thetime for allocating objects is not considered. In Figure 24 the results for SOM benchmarksare displayed. The leftmost two bars show times for the generational GC, followed by thecopying GC and the rightmost two bars show times for the mark-sweep GC.Average execution time (ms)140012001000800600400200Garbage collection times for SOM benchmarksremaining timegc time0ListBounceQueensStorageSieveIntegerLoopFibonacciDispatchBubbleSortPermuteRecurseQuickSortTreeSortLoopSumTowersFigure 24: Garbage collection times for SOM benchmarksWhile the other VMs spent only little time for garbage collection (on average 1% for thegenerational collected VM and 4% for the copy-collected VM) the mark-sweep collectedVM uses 23% of its time for garbage collection. With integer tagging the time spentgarbage collecting can be reduced by 14% on average, for IntegerLoop for example evenby 20%. The reason is, that no integer objects are allocated anymore that can becomegarbage. But the overall improvement achieved with integer tagging cannot be explainedwith reduced collection times alone. Also the time outside the GC was reduced by 7% onaverage when using the mark-sweep garbage collector while the improvement is only 2%for the generational and 3% for the copy collected VM. This results from saved allocationsof integers, being more expensive for the mark-sweep collector than for the other collectors.In benchmarks like Storage, where usually a lot of integers are allocated, the speedup for

6 EVALUATION 27the mark-sweep collected VM is even 12%.40003500Garbage collection times for Richards benchmarkgc timeremaining timeAverage execution time (ms)300025002000150010005000Figure 25: Garbage collection times for Richards benchmarkFor Richards benchmark the results are quite different as can be seen in Figure 25. Althoughfor the mark-sweep collected VM the time collecting garbage was reduced by 11%,no overall speedup could be achieved since more time was spent outside the GC. It seemsthat the benefit from not having to allocate integer objects is nullified by the numeroustag-checks integer tagging incurs, especially for sends. Also the speedup for the copyingcollected VM is rather small with 2%, while the generational collected VM is even sloweddown marginally.As can be expected, the proportion of time spent garbage collecting is much higher forGCBench. The times spent for garbage collection are 5% for the generational, 16% forthe copying and even 43% for the mark-sweep version of SOM++ which shows the advantagesof the generational garbage collection scheme. While integer tagging still improvesthe collection times for the copying collector, it slows down the mark-sweep collector byanother 6%. This is due to the large number of objects allocated in this benchmark. Sincein the mark phase of the mark-sweep collector all objects (live and garbage, unlike forother collectors) have to be visited, the overhead of checking the tag-bit while recursingthe object tree seems to be especially disadvantageous.To estimate which proportion of integer tagging’s improvements comes from reduced allocationoverhead, a special tagging version is evaluated which additionally allocates boxedintegers although it internally uses tagged integers. This version is then compared to thenormal tagging version of SOM++ without integer allocations. This allows to measurethe overhead for allocations.The results of these benchmarks are shown in Figures 27 to 29. The generational and

6 EVALUATION 28300Garbage collection times for GCBenchgc timeremaining time250Average execution time (s)200150100500Figure 26: Garbage collection times for GCBenchAverage execution time (ms)140012001000800600400200SOM Benchmarks - additional allocationSOM++ generationalSOM++ generational (tagging+allocation)SOM++ copyingSOM++ copying (tagging+allocation)SOM++ mark-sweepSOM++ mark-sweep (tagging+allocation)0ListBounceQueensStorageSieveIntegerLoopFibonacciDispatchBubbleSortPermuteRecurseQuickSortTreeSortLoopSumTowersFigure 27: Comparing non-tagging and additional allocation versions for SOM benchmarks

6 EVALUATION 29Average execution time (ms)50004000300020001000Richards Benchmark - additional allocationSOM++ generationalSOM++ generational (tagging+allocation)SOM++ copyingSOM++ copying (tagging+allocation)SOM++ mark-sweepSOM++ mark-sweep (tagging+allocation)0RichardsBenchmarksFigure 28: Comparing non-tagging and additional allocation versions for Richards benchmark350300GCBench Benchmark - additional allocationGCBench generationalGCBench generational (tagging+allocation)GCBench copyingGCBench copying (tagging+allocation)GCBench mark-sweepGCBench mark-sweep (tagging+allocation)Average execution time (s)250200150100500GCBenchFigure 29: Comparing non-tagging and additional allocation versions for GCBench

6 EVALUATION 30the copying collected SOM++ cannot benfit in any of the benchmarks, since they alreadyprovide fast allocations. However the mark-sweep collected SOM++ can benefitin all benchmarks, although the speedups that can be achieved, depend on the numberof allocations prevented. How important the effect of not having to allocate is for integertagging, is shown in Figure 30. It compares the overall speedup achieved by taggingwith the speedup achieved by avoiding allocations. It turns out, that on average 70% ofspeedups come from these saved allocations.Mark-sweep improvementsBounceBubbleSortDispatchFibonacciIntegerLoopListLoopPermuteQueensQuickSortRecurseSieveStorageSumTowersTreeSortallocationtagging-2 0 2 4 6 8 10 12 14Speedup in %Figure 30: Comparing speedups based on allocations and tagging for the mark-sweepcollected SOM++

6 EVALUATION 316.4 Integer Caching as AlternativeAs mentioned in Section 2.3, another advantage of tagged integers is a reduced memoryfootprint. In Section 6.2 is shown, that integer caching can also be an optimization tosave heap space, since usually large proportions of integers are allocated in a small range.This section evaluates, how integer caching affects the speed of the interpreter. Thereforethe non-tagging version of SOM++ is compared with a version that caches integers in areasonable range and a version that caches integers in a bad range. By caching integersin the range [−5, 100] (good cache), a substantial amount of integer allocations can beavoided by returning preallocated integers. This effect will be much less visible for thecaching range [100000, 100105] (bad cache). This version is used to indicate the overheadintroduced by integer caching.The benchmarks were once executed with a mark-sweep GCand once for a generational GC.Average execution time (ms)140012001000800600400200SOM Benchmarks - cachingSOM++ mark-sweep (no cache)SOM++ mark-sweep (bad cache)SOM++ mark-sweep (good cache)0ListBounceIntegerLoopFibonacciDispatchBubbleSortSumRecurseQuickSortQueensPermuteStorageLoopSieveTreeSortTowersFigure 31: SOM benchmarks with integer caching using a mark-sweep garbage collectorAs Figure 31 shows, the mark-sweep collected SOM++ can benefit from integer cachingif the number of allocated integers is high. The speedup for example in IntegerLoop orStorage is achieved by saving allocations. For the other SOM benchmarks the performanceis on the same level as of the non-caching VM. Furthermore is shown, that caching a badrange does not make the VM considerably slower. Also for Richards and GCBench (Figures32 and 33) can be seen, that the introduction of integer caching does not negatively affectperformance.

6 EVALUATION 3245004000SOM++ mark sweep (no cache)SOM++ mark sweep (bad cache)SOM++ mark sweep (good cache)Richards - cachingAverage execution time (ms)3500300025002000150010005000RichardsBenchmarksFigure 32: Richards benchmark with integer caching using a mark-sweep garbage collector300SOM++ mark sweep (no cache)SOM++ mark sweep (bad cache)SOM++ mark sweep (good cache)GCBench - caching250Average execution time (s)200150100500GCBenchFigure 33: GCBench benchmark with integer caching using a mark-sweep garbage collector

List6 EVALUATION 33800700SOM Benchmarks - cachingSOM++ generational (no cache)SOM++ generational (bad cache)SOM++ generational (good cache)Average execution time (ms)6005004003002001000BounceIntegerLoopFibonacciDispatchBubbleSortSumRecurseQuickSortQueensPermuteStorageLoopSieveTreeSortTowersFigure 34: SOM benchmarks with integer caching using a generational garbage collectorFigures 34 to 36 show the results for the generational GC. When using integer cachingwith a generational GC, the effect on performance is very small. As allocations are alreadyfast, no substantial speedup can be achieved. But apart from that, SOM++ is also notslowed down, as the overhead of integer caching seems to be negligible.Although integer caching does not improve the speed when using a modern garbage collector,it can be considered as a good alternative to integer tagging. Memory can be savedwith nearly no negative effects on performance.

6 EVALUATION 342500SOM++ generational (no cache)SOM++ generational (bad cache)SOM++ generational (good cache)Richards - cachingAverage execution time (ms)2000150010005000RichardsBenchmarksFigure 35: Richards benchmark with integer caching using a generational garbage collector120SOM++ generational (no cache)SOM++ generational (bad cache)SOM++ generational (good cache)GCBench - caching100Average execution time (s)806040200GCBenchFigure 36: GCBench benchmark with integer caching using a generational garbage collector

6 EVALUATION 356.5 Tagging on Other ArchitecturesFinally the performance of integer tagging is evaluated on x68 64 and ARM machines.Benchmarks for ARM were executed on an otherwise idle BeagleBoard-xM 12 runningUbuntu 12.04. This development board has an ARM R○ Cortex TM A8 CPU at 1 GHzwith 512 MB of memory. Benchmarks for x86 64 were executed on a Intel R○ Xeon TMW3580 CPU with 3.33 GHz and 8 MB cache in a machine with 12 GB RAM runningUbuntu 12.04.1.Average execution time (ms)40035030025020015010050SOM BenchmarksSOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)0LoopListBounceIntegerLoopFibonacciDispatchBubbleSortSieveSumRecurseQuickSortQueensPermuteStorageTreeSortTowersFigure 37: SOM benchmarks on x86 64Figures 37 to 39 confirm that the results found for x86 are also valid for 64bit systems.Again the mark-sweep collected VM can benefit from integer tagging if a sufficient amountof integers is allocated. The other garbage collectors can either barely benefit in someSOM benchmarks or even get slower in the others. The same applies for the Benchmarksexecuted on ARM as can be seen in Figures 40 to 42.12 http://beagleboard.org/hardware-xM

6 EVALUATION 36Average execution time (ms)12001000800600400200SOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)Richards Benchmark0RichardsBenchmarksFigure 38: Richards benchmark on x86 64Average execution time (s)12010080604020SOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)GCBench Benchmark0GCBenchFigure 39: GCBench benchmark on x86 64

6 EVALUATION 37Average execution time (ms)1600140012001000800600400200SOM BenchmarksSOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)0LoopListBounceIntegerLoopFibonacciDispatchBubbleSortSieveSumRecurseQuickSortQueensPermuteStorageTreeSortTowersFigure 40: SOM benchmarks on ARMAverage execution time (s)252015105SOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)Richards Benchmark0RichardsBenchmarksFigure 41: Richards benchmark on ARM

6 EVALUATION 38Average execution time (s)1600140012001000800600400SOM++ generationalSOM++ generational (tagging)SOM++ copyingSOM++ copying (tagging)SOM++ mark-sweepSOM++ mark-sweep (tagging)GCBench Benchmark2000GCBenchFigure 42: GCBench benchmark on ARM

7 RELATED WORK 397 Related WorkAs mentioned before, only a few sources could be found that cover integer tagging asoptimization for dynamic language virtual machines.A good overview of various ways to represent type information can be found in ”RepresentingType Information in Dynamically Typed Languages” by David Gudeman [22]. Thisreport covers tagged integers and also briefly describes NaN tagging. Although it givescosts in terms of machine cycles required for the various tagging operations, no specificimplementation is analysed.Another excellent description of integer tagging can be found in Rob Sayre’s blog [11]. Heexplains how integer tagging and NAN-tagging are used in Mozilla’s JavaScript virtualmachines. But also here no evaluation of performance gains can be found.In July 2004 James Y Knight proposed on the Python-Dev mailing list [23] to use integertagging for CPython (which uses reference counting for garbage collection). He alsoincluded a patch implementing tagged integers for Python 2.3. Several posts later hereported the pystones benchmark to be 3.3% and pybench to be 7.2% faster on average.Despite the speedups, two main reasons were discussed not to use tagged integers forCPython. One was, that the introduction of tagged integers would break compatibility tothird party libraries. The other reason was the increased code complexity. For exampleGuido van Rossum, the creator of Python replied:“-1000 here. In a past life (the ABC implementation) we used this and it wasa horrible experience. Basically, there were too many places where you hadto make sure you didn’t have an integer before dereferencing a pointer, andfinding the omissions one core dump at a time was a nightmare.” 13Another VM where integer tagging was briefly tested was SPY [24], which is a reimplementationof the Squeak Smalltalk VM using the PyPy 14 toolchain. This toolchainallows programs implemented in RPython (a restricted subset of Python) to be translatedto efficient C code including a tracing JIT and generational garbage collection. The resultof using tagged integers here was:“A small bit of experimentation seemed to indicate that using tagged pointersfor small integers actually worsens performance. The necessity of checkingwhether a pointer is a heap pointer or a small integer around every methodcall offsets all the benefits of the smaller memory footprint that comes withtagged small integers.”[24]13 http://mail.python.org/pipermail/python-dev/2004-July/046147.html14 http://pypy.org/

8 CONCLUSION 408 ConclusionDuring this thesis the integer tagging optimization was implemented for the SOM++Smalltalk VM. The effects on performance were then analysed in detail by evaluatingbenchmarks.Results show that integer tagging can improve the speed of interpreters, if the allocationof new objects is slow. This is the case for the mark-sweep collected VM, which achievesits improvements mainly by avoiding expensive allocations. VMs using modern garbagecollectors however cannot profit that much from avoided allocations, since they usuallyoffer very fast allocations and can handle short lived objects very good. Here the tagcheckingoverhead can even reduce performance.When focussing on the memory footprint, results in Section 6.4 indicate that integercaching can be a good alternative. It adds nearly no overhead to interpretation and ismuch easier to implement.Considering the implementation effort, integer tagging should only be used for interpretershaving slow allocations. When having a fast allocation or a JIT with optimizations likeallocation removal [25], integer tagging is not worth implementing.8.1 Threats to ValidityDespite the results are provable for SOM++, it is not quite clear if they can also beapplied to other interpreters. One reason might be the partly unrealistic design. Being aVM developed for education and research, the architecture was kept simple and is not aselaborate as the architecture of other VMs as Google’s V8 or CPython. Although severaloptimizations have been applied during this thesis, Richards benchmark is for examplestill 4 times slower on SOM++ as on CPython. Architectural improvements or otheroptimizations can influence the results.Another reason might be, that the benchmarks evaluated during this thesis are eithertoo small or too specialized to make general assumptions on the performance of tagging.GCBench for example is a benchmark focussing on garbage collector performance. Thespeed of this benchmark does not necessarily correspond to the speed of other real lifeprograms.

BIBLIOGRAPHY 41Bibliography[1] Adele Goldberg and David Robson. Smalltalk-80: The Language and Its Implementation.Longman Higher Education, December 1983.[2] Bjarne Stroustrup. The C++ Programming Language: Third Edition. Addison Wesley,3 edition, June 1997.[3] Michael Haupt, Robert Hirschfeld, Tobias Pape, Gregor Gabrysiak, Stefan Marr,Arne Bergmann, Arvid Heise, Matthias Kleine, and Robert Krahn. The SOM family:virtual machines for teaching and research. In Proceedings of the fifteenth annualconference on Innovation and technology in computer science education, ITiCSE ’10,page 18–22, New York, NY, USA, 2010. ACM.[4] M. Anton Ertl and David Gregg. The structure and performance of efficient interpreters.Journal of Instruction-Level Parallelism, 5:2003, 2001.[5] Christian Boland. A Moving Garbage Collector for a Smalltalk VM. Bachelors thesis,Heinrich-Heine Universität, Düsseldorf, September 2011.[6] John McCarthy. Recursive functions of symbolic expressions and their computationby machine, Part I. Commun. ACM, 3(4):184–195, April 1960.[7] Robert R. Fenichel and Jerome C. Yochelson. A LISP garbage-collector for virtualmemorycomputer systems. Commun. ACM, 12(11):611–612, November 1969.[8] David Ungar. Generation Scavenging: A non-disruptive high performance storagereclamation algorithm. In Proceedings of the first ACM SIGSOFT/SIGPLAN softwareengineering symposium on Practical software development environments, SDE1, page 157–167, New York, NY, USA, 1984. ACM.[9] Richard Gerber, Aart J. C. Bik, Kevin Smith, and Xinmin Tian. The SoftwareOptimization Cookbook: High Performance Recipes for IA-32 Platforms, 2nd Edition.Intel Press, 2nd edition, December 2005.[10] Google. Design Elements - Chrome V8 — Google Developers. https://developers.google.com/v8/design?hl=en.[11] Rob Sayre. Mozilla’s New JavaScript Value Representation. http://evilpie.github.com/sayrer-fatval-backup/cache.aspx.htm, August 2010.[12] D. L. Weaver and T. Gremond. The SPARC architecture manual. PTR Prentice Hall,1994.[13] Mike Pall. LuaJIT 2.0 intellectual property disclosure and research opportunities.http://article.gmane.org/gmane.comp.lang.lua.general/58908, February2009.[14] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1 –58, 2008.[15] AMD64 Architecture Programmer’s Manual Volume 2: System Programming,September 2012.

BIBLIOGRAPHY 42[16] Michael Haupt, Stefan Marr, and Robert Hirschfeld. CSOM/PL — A Virtual MachineProduct Line. Journal of Object Technology, 10:12:1–30, 2011.[17] Urs Hölzle, Craig Chambers, and David Ungar. Optimizing dynamically-typedobject-oriented languages with polymorphic inline caches. In Pierre America, editor,ECOOP’91 European Conference on Object-Oriented Programming, volume 512of Lecture Notes in Computer Science, pages 21–38. Springer Berlin / Heidelberg,1991.[18] Martin Richards. Bench. http://www.cl.cam.ac.uk/~mr10/Bench.html.[19] Amy Briggs. Amy Briggs. http://www.cs.middlebury.edu/~briggs/Courses/CS313-F12/smalltalk/stx/goodies/benchmarks/richards/.[20] Hans Boehm. An Artificial Garbage Collection Benchmark. http://www.hpl.hp.com/personal/Hans_Boehm/gc/gc_bench.html.[21] Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous java performanceevaluation. In Proceedings of the 22nd annual ACM SIGPLAN conferenceon Object-oriented programming systems and applications, OOPSLA ’07, page 57–76,New York, NY, USA, 2007. ACM.[22] D. Gudeman. Representing type information in dynamically typed languages. Technicalreport, Technical Report TR93-27, University of Arizona, Department of ComputerScience, Tucson, Arizona, 1993.[23] James Y Knight. [Python-Dev] mailing list - Tagged integers. http://mail.python.org/pipermail/python-dev/2004-July/046139.html, May 2004.[24] C. Bolz, A. Kuhn, A. Lienhard, N. Matsakis, O. Nierstrasz, L. Renggli, A. Rigo, andT. Verwaest. Back to the future in one week—implementing a Smalltalk VM in PyPy.Self-Sustaining Systems, page 123–139, 2008.[25] C.F. Bolz, A. Cuni, M. FijaBkowski, M. Leuschel, S. Pedroni, and A. Rigo. Allocationremoval by partial evaluation in a tracing JIT. In Proceedings of the 20th ACMSIGPLAN workshop on Partial evaluation and program manipulation, page 43–52,2011.

LIST OF FIGURES 43List of Figures1 Pharo Smalltalk running unit tests . . . . . . . . . . . . . . . . . . . . . . . 42 A point object referencing two boxed integers . . . . . . . . . . . . . . . . . 63 Examples of a tagged integer with value 2 and an ordinary pointer . . . . . 64 Binary representation of IEEE 754 floating point values π, ±∞ and NaN . 85 Examples of 32-bit NaN-tagging . . . . . . . . . . . . . . . . . . . . . . . . 86 Examples using 64-bit NaN-tagging . . . . . . . . . . . . . . . . . . . . . . . 97 Object graph without integer caching . . . . . . . . . . . . . . . . . . . . . 128 Object graph with integer caching . . . . . . . . . . . . . . . . . . . . . . . 129 Before optimization of base classes . . . . . . . . . . . . . . . . . . . . . . . 1410 After optimization of base classes . . . . . . . . . . . . . . . . . . . . . . . . 1511 Number of allocated objects for IntegerLoop benchmark . . . . . . . . . . . 1712 Number of allocated objects for Richards benchmark . . . . . . . . . . . . . 1813 Number of allocated objects for GCBench benchmark . . . . . . . . . . . . 1814 Size of allocated objects for IntegerLoop benchmark . . . . . . . . . . . . . 1915 Histogram of allocated integers for SOM benchmarks . . . . . . . . . . . . . 2016 Histogram of allocated integers for Richards benchmark . . . . . . . . . . . 2017 Histogram of allocated integers for GCBench benchmark . . . . . . . . . . . 2118 Sends in SOM benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 2119 Sends in Richards benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 2220 Sends in GCBench benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 2221 Results of SOM benchmarks comparing tagging and non-tagging versions . 2422 Results of Richards benchmark comparing tagging and non-tagging versions 2523 Results of GCBench comparing tagging and non-tagging versions . . . . . . 2524 Garbage collection times for SOM benchmarks . . . . . . . . . . . . . . . . 2625 Garbage collection times for Richards benchmark . . . . . . . . . . . . . . . 2726 Garbage collection times for GCBench . . . . . . . . . . . . . . . . . . . . . 2827 Comparing non-tagging and additional allocation versions for SOM benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2828 Comparing non-tagging and additional allocation versions for Richardsbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2929 Comparing non-tagging and additional allocation versions for GCBench . . 2930 Comparing speedups based on allocations and tagging for the mark-sweepcollected SOM++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

LIST OF FIGURES 4431 SOM benchmarks with integer caching using a mark-sweep garbage collector 3132 Richards benchmark with integer caching using a mark-sweep garbage collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3233 GCBench benchmark with integer caching using a mark-sweep garbage collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3234 SOM benchmarks with integer caching using a generational garbage collector 3335 Richards benchmark with integer caching using a generational garbage collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3436 GCBench benchmark with integer caching using a generational garbagecollector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3437 SOM benchmarks on x86 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3538 Richards benchmark on x86 64 . . . . . . . . . . . . . . . . . . . . . . . . . 3639 GCBench benchmark on x86 64 . . . . . . . . . . . . . . . . . . . . . . . . . 3640 SOM benchmarks on ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . 3741 Richards benchmark on ARM . . . . . . . . . . . . . . . . . . . . . . . . . . 3742 GCBench benchmark on ARM . . . . . . . . . . . . . . . . . . . . . . . . . 38

Memory Allocation and Access Patterns in Dynamic ... - STUPS Group

Create successful ePaper yourself

Delete template?

Save as template?