Using Kilim's isolation types for multicore efficiency

UNIVERSITYOF 

CAMBRIDGE 

Using Kilim’s isolation types for multicore 

efficiency 

Alan Mycroft (∗) 

Computer Laboratory, University of Cambridge 

http://www.cl.cam.ac.uk/users/am/ 

5 October 2011 

(*) I greatfully acknowledge the contributions of Sriram Srinivasan [Kilim, 

ECOOP’2008] and Robert Ennals and Richard Sharp [PacLang, ESOP’2004]. 

Using Kilim’s isolation types for multicore efficiency 1 FoVeOOS’2011

AM’s abstract 

UNIVERSITYOF 

CAMBRIDGE 

We identify a ‘memory isolation’ property which enables multi-core 

programs to avoid slowdown due to cache contention. 

We give a tutorial on existing work on Kilim and its isolation-type 

system building bridges with both substructural types and memory 

isolation. 


The big picture 

• Traditional object-orientation mode of thinking: 

– everything is an object 

UNIVERSITYOF 

CAMBRIDGE 

– any object can reference (or even alias) any other – limited 

only by type 

Spaghetti data! 

• Modern hardware isn’t like this: 

– memory isn’t uniform (or sequentially consistent) 

– multi-core parallelism 

• Spaghetti is bad for software engineering too 

• Programming languages can help [PacLang, Singularity, Kilim] 


What’s Kilim, and what’s new? 

• Kilim is a message-passing (actor) framework for Java with 

ultra-lightweight threads and zero-copy message passing 

[ECOOP’2008]. 

• A type system contrains pointer aliasing (using Java 

annotations) to make message passsing (pass-by-value) and 

pass-by-reference equivalent 

UNIVERSITYOF 

CAMBRIDGE 

• What’s new? No single idea, but the combination of ideas gives 

an effective language design point. 

• This talk highlights issues caused by current multi-core 

processors. 


Why does multi-core affect things? 

UNIVERSITYOF 

CAMBRIDGE 

• We want more than “call and wait for result” – otherwise we’re 

only using one CPU. So: threads, spawn . . . 

• Threads, locking, races, sequentially-inconsistent memory . . . 

• Caches mean that where computations are done greatly affects 

speed (doing all computations on one CPU is clearly bad). 

Method placement: by user or by system? 

• Communication costs, heterogeneous multi-core: placement is a 

big issue. 


How do programming languages help? 

• Specify invariants which analysers may find hard to discover 

• Provide compiler-enforced invariants 

UNIVERSITYOF 

CAMBRIDGE 

• Examples: types, “this method always executes on processor 42”. 

• Needs careful design: types are ubiquitously accepted but 

“always execute on processor 42” is likely to be too low-level and 

inflexible during system evolution. 

• This talk: type-like invariants saying “no other thread has an 

alias to my data”. 


• an example 

• modern hardware 

Talk structure 

• sub-structural type systems, ownership 

• Kilim 

UNIVERSITYOF 

CAMBRIDGE 


Example banking code 

void checktotal(AccountList p, int expected) 

{ int sum = 0 

} 

for (i in p) sum += i.balance; 

if (sum != expected) report_oddity(); 

void movemoney(AccountList p, int amount) 

{ p.first.balance -= amount; 

} 

p.second.balance += amount; 

UNIVERSITYOF 

CAMBRIDGE 

Questions: 

1. when can movemoney() and checktotal() be run in parallel? 

2. when can movemoney() be run in parallel with itself? 

3. Locking? Do you know how much this costs? 

It helps for compilers to understand aliasing. 






• Kilim 

UNIVERSITYOF 

CAMBRIDGE 


Introduction to multi-core hardware semantics 

Two ‘obvious’ but incorrect statements: 

1. If a program with two threads runs on a single-core processor 

then it will run unchanged on a two-core processor. 

UNIVERSITYOF 

CAMBRIDGE 

2. (After correcting problems in 1.) a two-thread program will run 

faster on a two-core processor than it runs on a single-core 

processor. 

Brief answer: need to think of multi-core processors as distibuted 

systems not merely as concurrent systems. 


Multi-core and sequential consistency 

On a single-core (implementing tasking using interrupts) the 

instructions in each thread are interleaved. 

volatile int x=0,y=0; 

thread1: { x=1; print "y=",y; } 

thread2: { y=1; print "x=",x; } 

For most executions this program prints x=0, y=1 or x=1, y=0; 

• relatively rarely it prints x=1, y=1; 

• but it never prints x=0, y=0. 

UNIVERSITYOF 

CAMBRIDGE 

But on multi-core x=0, y=0 can happen! Not all executions interleave 

instructions from the two threads (they do on a single-core processor 

with interrupt-driven scheduling). Failure of sequential consistency. 


Multi-core and sequential consistency (2) 

volatile int x=0,y=0; 

thread1: { x=1; print "y=",y; } 

thread2: { y=1; print "x=",x; } 

Why can x=0, y=0 happen? 

• On an isolated computation it is quite valid for a read and a 

write to distinct locations to be re-ordered 

• Single-core processors exploit this for speed (pipelining) 

UNIVERSITYOF 

CAMBRIDGE 

• Manufacturers of multi-core processors want a single CPU of a 

multicore processor to be as fast as a CPU of a single core. 

Solution: programmer’s responsibility(!) to fix such races, e.g. by 

locking, or using mfence [both expensive in time]. 

Find “[Relaxed] Memory Model” for guidance. 


Multi-core hardware semantics (2) 

Incorrect statement: a two-thread program will run faster on a 

two-core processor than it runs on a single-core processor. 

Reason: caches [ . . . implicit data copying] 

UNIVERSITYOF 

CAMBRIDGE 


A programmer’s view of memory 

✛ 

✚ 

CPU 

✘ 

✙ 

1 cycle 

✲ 

(This model was pretty accurate in 1985.) 

✛ 

MEMORY 

✚ 

✘ 

✙ 

A 2004-era single-core view of memory and timings 

✛ 

CPU 

✚ 

✘✛ 

✲ 

✙✚ 

2 

L1 cache 

✘✛ 

✲ 

✙✚ 

10 

L2 cache 

✘✛ 

✲ 

✙✚ 

200 

MEMORY 

UNIVERSITYOF 

CAMBRIDGE 

✘ 

✙ 


Multi-core-chip memory models 

Today’s model (cache simplified to one level): 

✛ 

other CPU 

or GPU etc ✲ 

✚ 

✛ 

CPU 2–15 

✚ 

✛ 

CPU 1 

✚ 

✘✛ 

✙2 

✚ 

✻ 

✘✛ 

❄ 

✲2 

✙✚ 

✲✻ 

✘✛ 

❄ 

✲ 

coherency 

2 

FAST 

MEMORY 

incoherency 

CACHES 2-15 

CACHE 1 

✙✚ 

✘ 

✙ 

✘ 

✙❄ 

✘ 

✒ 

❅ 

❅❘ 

✚ 

200 

✙ 

✛ DMA 

200 ✛ 

MEMORY 

✘ 

✙ 

UNIVERSITYOF 

CAMBRIDGE 


Why can a program run slower on multicore? 

UNIVERSITYOF 

CAMBRIDGE 

• CPU1 writes a cache line ⇒ data in CPU2’s cache is discarded. 

• CPU2 now writes this cache line ⇒ the line must be reloaded 

(even from memory) and data in CPU1’s cache discarded. 

• A two-core version of virtual memory ‘thrashing’ (or of repeated 

cache reloading in naive big-matrix multiplication). 

It’s harmless (or even good) if two threads running on the same 

processor core both access a cache line (since all accesses will hit the 

cache), but . . . 

it’s bad if two threads running on different processor cores access the 

same cache line. Program runs slower on two cores than one! 

Multicore cache model is like multiple-reader/single-writer . . . 


How should we react? 

Shared memory and parallelism is problematic: 

• hard to write correct code (locks, mfence etc. needed) 

• sequential consistency surprises in semantics 

• cache effect surprises in performance 

UNIVERSITYOF 

CAMBRIDGE 

But lots of these problems go away if the programmer knows when 

shared data is logically being transferred between cores. 

Type systems (richer than usual) can express and enforce this . . . 


Aside: shared memory vs. message passing 

Which is more efficient? 

Classical answer is shared memory (because message passing is 

‘implemented on top of it’). 

UNIVERSITYOF 

CAMBRIDGE 

However, note that on a dual-core processor the illusion of shared 

memory is implemented via message passing in the cache-coherency 

protocol! 


Caches are multiple-reader/single-writer 

Caches are like ownership: 

UNIVERSITYOF 

CAMBRIDGE 

• A core writing seizes ownership of the cache line (rescinding all 

ownership held by other cores) 

• A core reading shares ownership with the most recent writer and 

all intervening readers of that cache line. 

The transition from no-ownership to ownership costs time. 


Effective use of modern hardware – conclusions 

• Need careful control of concurrent memory access. 

UNIVERSITYOF 

CAMBRIDGE 

• For efficiency program has to be aware when data (a cache-line) 

is newly-accessed by a new CPU. Writes invalidate all but the 

writer CPU’s copy of a cache line. Single-writer multiple-reader 

view. 

• Solution: need to express and control inter-processor aliasing at 

source level. 

(understanding aliasing is also a good idea for software 

engineering purposes!) 

• Best speed-up is when separate cores access disjoint memory. 

• Minimise the number of ownership transfers (i.e. cache-level data 

movement). 

Tidy/improve 






• Kilim 

UNIVERSITYOF 

CAMBRIDGE 


Classical Type Systems 

UNIVERSITYOF 

CAMBRIDGE 

Compilers and theory of type systems model the variables in scope 

with a environment of type assumptions Γ – pairs x : t. 

We have the judgement form Γ ⊢ e : t which holds when expression e 

has type t under assumptions Γ. 

Note that Γ changes only at scope entry/exit. 

The standard rule for variables says: 

Γ ⊢ x : t 

(provided x : t ∈ Γ) 

Note that together these say (almost too obvious to notice/question): 

“each use of a variable in a scope has the same type”. 


Type Systems – weakness 

UNIVERSITYOF 

CAMBRIDGE 

Variables keeping the same type throughout a block means that the 

two errors in the following program can’t be detected by the type 

system: 

{ char *x = malloc(10); // x has type char * 

foo(x); // x has type char * 

free(x); // x has type char * (but should not??) 

foo(x); // a well-known disaster... 

x = malloc(20); // x has type char * (and should) 

// problem: a memory leak now occurs 

} // as x goes out-of-scope 

Replacing free(x) with pass_ownership(x,task42) shows the 

weakness of classical types at controlling sharing of data. 


Type Systems – weakness (2) 

• We want each variable to be free’d exactly (∗) once. 

• ‘Once’ means ‘once on each possible control path’. 

UNIVERSITYOF 

CAMBRIDGE 

• Let’s develop a type system in which each variable can only be 

used once – and then refine it to “at most one free-like 

operation” 

(∗) at most once if we don’t care about space leaks! 


Classical and Linear Type Systems (1) 

Standard ML-like type system: 

(VAR) Γ[x : t] ⊢ x : t 

(LAM) 

Γ[x : t] ⊢ e : t′ 

(INT) Γ ⊢ n : int 

UNIVERSITYOF 

CAMBRIDGE 

(BOOL) Γ ⊢ b : bool 

Γ ⊢ λx.e : t → t ′ (APP) Γ ⊢ e1 : t → t ′ Γ ⊢ e2 : t 

Γ ⊢ e1 e2 : t ′ 

(COND) Γ ⊢ e1 : bool Γ ⊢ e2 : t Γ ⊢ e3 : t 

Γ ⊢ if e1 then e2 else e3 : t 

Note that Γ only changes on scope change (here LAM). 


Classical and Linear Type Systems (2) 

UNIVERSITYOF 

CAMBRIDGE 

An equivalent type system. Assumptions Γ, ∆ are now multi-sets (i.e. 

permutable lists): 

(VAR) [x : t] ⊢ x : t 

(INT) [ ] ⊢ n : int 

(BOOL) [ ] ⊢ b : bool 

Γ[x : t] ⊢ e : t′ 

(LAM) 

Γ ⊢ λx.e : t → t ′ (APP) Γ ⊢ e1 : t → t ′ ∆ ⊢ e2 : t 

Γ, ∆ ⊢ e1 e2 : t ′ 

(COND) Γ ⊢ e1 : bool ∆ ⊢ e2 : t ∆ ⊢ e3 : t 

Γ, ∆ ⊢ if e1 then e2 else e3 : t 

Γ ⊢ e : t 

(WEAKEN) 

Γ, ∆ ⊢ e : t 

Γ, ∆, ∆ ⊢ e : t 

(CONTRACT) 

Γ, ∆ ⊢ e : t 

Why do this? Answer: substructural type systems – can now adjust 

WEAKEN/CONTRACT. 


Substructural type systems 

(WEAKEN) and (CONTRACT) are known as structural rules. 

Γ ⊢ e : t 

(WEAKEN) 

Γ, ∆ ⊢ e : t 

Γ, ∆, ∆ ⊢ e : t 

(CONTRACT) 

Γ, ∆ ⊢ e : t 

UNIVERSITYOF 

CAMBRIDGE 

They control how the assumptions Γ are can be passed within an 

inference tree. [There’s a correspondence between combinators S and 

K and (CONTRACT) and (WEAKEN).] 

Using both WEAKEN and CONTRACT gives a system equivalent to 

the previous one. 

Without CONTRACT every variable must be used at most once. 

Without WEAKEN every variable must be used at least once. 

This is a good start for banning use-after-free and memory-leak 

errors. [Apology: I’m slight cheating in that higher-order functions 

are considerably more complicated.] 


Consider: 

How to read this intuitively 

(COND) Γ ⊢ e1 : bool ∆ ⊢ e2 : t ∆ ⊢ e3 : t 

Γ, ∆ ⊢ if e1 then e2 else e3 : t 

UNIVERSITYOF 

CAMBRIDGE 

• Use of ∆ says both arms of if-then-else get the same variables. 

• The use of Γ, ∆ in the result of the inference rule says: “share 

out the variables in scope: passing some (Γ) to e1 and some (∆) 

to e2 (and e3).” 

It helps to read inference rules from the bottom (goal-oriented). 

However, we want to refine this all-or-nothing partition – we’d like to 

allow (e.g.) x.f to be tested in e1 before x is freed in e2 (and not 

vice-versa). 


An algebra of capabilities 

UNIVERSITYOF 

CAMBRIDGE 

Above in Γ, ∆, the “,” operator shares each x : t into either Γ or ∆. 

(Cf. partial function.) 

We’d like an algebra of capabilities, e.g. {unused, free, readonly} 

where unused models partiality above, and free means ‘this variable is 

allowed to be free’d”. These are combined with the “;” (afterwards) 

operator, giving for example: 

• unused ; x = x 

• x ; unused = x 

• readonly ; readonly = readonly 

• readonly ; free = free 

• free ; readonly is undefined 


[A subtlety in the type system.] 

An algebra of capabilities (2) 

We actually need another operator on environments to deal with 

passing the same variable as multiple arguments to a procedure: 

int g(free x, readonly y) 

{ if (...) { deallocate(x); return y.field; } 

} 

else { int z = y.field; deallocate(x); return z; } 

int f(free x) { return g(x,x); } 

UNIVERSITYOF 

CAMBRIDGE 

The function g() is perfectly fine – but a problem arises when its 

arguments alias (as in the call from f()). The “+” operator on 

environments forbids f() by being slightly more restrictive than “;” 

on argument tuples. 

• both free + readonly and readonly + free are undefined. 


From algebra to algorithm 

UNIVERSITYOF 

CAMBRIDGE 

This ‘immaculate’ splitting “;” is great for type theorists, but 

non-obvious how it’s implemented. Observe it behaves like a sum on 

usages – e.g. the sum effect of a readonly use and a free use is a free. 

When implementing this in a traditional type-checking left-right 

depth-first tree walk we implement the sum above as difference: 

• if x has type free [able to be freed] and the LHS operand only 

uses x as readonly then free still remains for the RHS 

• I.e. free − readonly = free 

• But readonly − free is undefined 

• Similarly x − unused = x 

• But unused − x is undefined (for x = unused). 


Aside: Separation logic 

UNIVERSITYOF 

CAMBRIDGE 

People familiar with judgements in separation logic will recognise its 

rules as being a substructural system: H1 ∗ H2 on heaps (partial 

maps from locations to values) is only defined when 

dom H1 ∩ dom H2 = {}. 

Indeed Parkinson’s generalisation to fractional ownership, e.g. 

[x 0.4 

→ a, y 0.5 

→ b] ∗ [x 0.3 

→ a, z 0.6 

→ c] = [x 0.7 

→ a, y 0.5 

→ b, z 0.6 

→ c] 

is similarly inspired by avoiding “all-or-nothing” ownership and 

parallels ‘;’ and ‘+’ above. 






• Kilim – a Java-based practical solution 

UNIVERSITYOF 

CAMBRIDGE 


Kilim syntax 

• Syntax same as Java, with additional type qualifiers: @free, 

@cuttable, @safe and we add @readonly. 

• These qualify classes marked as Message. 

• Fields of Message types can only be other Message types and 

Java primitive types. 

• Design choice: Message types are trees and have no heap 

aliasing. Subversive question: what does 

class Tree { Int val; Tree left,right; } 

define in Java? A tree? A graph? 

• Only Message types can be transferred between actors [next 

slide]. 

UNIVERSITYOF 

CAMBRIDGE 


Kilim qualifiers 

UNIVERSITYOF 

CAMBRIDGE 

Kilim enforces Messages to be tree-like. Qualifiers further restrict 

this: 

@free forbids an object to have heap-pointers to it (a tree root). 

[Not if-and-only-if for static analysis reasons.] 

@cuttable a value which is @free if its parent’s pointer is nullified 

@safe forbids structural modification of the Message recursively 

@readonly forbids any modification of the Message recursively. 

Note these are deep qualifiers, unlike const in C. 

• Only @free Message types can be transferred between actors. 

(And, more subtly, as method return types!) 


Examples 

void send(@free x) { ... builtin ...} 

bool send2(@free x) { send(x); send(x); } // illegal 

bool foo(@safe x) { x.count++; return length(x) != 42; } 

void maybesend1(@free x) { if (foo(x)) send(x); } // OK 

void maybesend2(@free x) { @safe y = x; 

} 

if (foo(y)) send(x); 

// accessing y is illegal here. 

UNIVERSITYOF 

CAMBRIDGE 

Note the last subtlety. Although qualifiers appear to act on variables, 

they actually operate on values (capabilities). 

So qualifiers on variables need to be downgraded due to dataflow. 


Core Kilim syntax 

FuncDcl ::= free opt m(p : α) { (lb : Stmt) ∗; } 

Stmt ::= x := new | x := y 

| x := y.f | x.f := y | x := cut(y.f) 

| x := y[·] | x[·] := y | x := cut(y[·]) 

| x := m(y) | if/goto lb | return x 

x, y, p ∈ variable names f ∈ field names 

lb ∈ label names m ∈ function names 

UNIVERSITYOF 

CAMBRIDGE 

sel ∈ field names ∪ {[·]} [·] pseudo field name for array access 

α, β ∈ isolation qualifier {free, cuttable, safe} 

null is treated as a special readonly variable 


Kilim – aliasing and de-aliasing (role of cut) 

UNIVERSITYOF 

CAMBRIDGE 

• Note that the x.sel = y form is the only creator of heap aliases. 

This can degrade y (e.g.) from @free to @cuttable. 

Note also that attempts to create heap-sharing will be faulted 

(next slide). 

• The cut operator severs a subtree from its @cuttable parent 

thus: 

y = cut(x.sel) 

def 

= y = x.sel; x.sel = null; 

Crucially it also marks y as @free – the right-hand-side above 

would not do this (especially on arrays). Merely doing y = x.sel 

would merely leave y as @cuttable, the x.sel = null makes good 

the semantic promise that y has no pointers to it and is hence 

@free. 


Kilim isolation type checking 

Many fine details [ECOOP’2008]. Principal steps: 

UNIVERSITYOF 

CAMBRIDGE 

1. Do a simplified shape analysis on the program to over-estimate 

possible aliasing. At each program point get a shape graph: 

• Nodes: finite, abstracting run-time heap locations. 

• Nodes labelled by which local variables may point to them. 

Special node ∅ for unknown internal structure. 

[Core Kilim syntax enforces “no unnamed temporaries”] 

• Edges represent possible run-time references. 

• Edges labelled with field names. 

A simple forward-dataflow iteration. 

2. At each program point downgrade declared qualifier information 

by dataflow and fault if inconsistent. 






• Kilim – a Java-based practical solution 

• Some random observations 

UNIVERSITYOF 

CAMBRIDGE 


One neat point 

UNIVERSITYOF 

CAMBRIDGE 

When values are used linearly (no longer referenced (∗) by the caller 

after being passed to another method/thread) then: 

• call-by-reference (pass-by-identity) = call-by-value (pass-by-copy) 

This sidesteps the well-known problem that RMI is a non-trivial 

solution to exploiting multicore. It’s heavyweight and its marshalling 

implies pass-by-copy rather that pass-by-identity. 

(*) Of course, the callee can later pass such a value back to the caller. 


Another neat point 

UNIVERSITYOF 

CAMBRIDGE 

Having a non-modified data-structure shared by several threads is a 

common programming idiom. 

Remember we said that caches work like multiple-reader 

single-writer? Hence they work efficiently on the above idiom (each 

CPU gets a private cache copy of it, and no invalidations occur). 

• Kilim above only allows Message-type objects to be transferred 

between actors – other Java types are faulted. 

• Some of these are harmless (e.g. String) 

• Marking these as Sharable directs Kilim to allow them to be 

passed between actors. 


What I didn’t say about Kilim 

UNIVERSITYOF 

CAMBRIDGE 

• There are also classical Java objects (no restrictions on aliasing) 

but these cannot be passed between threads – except via an 

(unsafe) loophole. 

• Implementation embedding in Java via bytecode re-writer. 

• Ultra-fast task switching (one million threads) implemented by 

stack reflect/reify using @pausable attribute. 

• Thread creation and messaging compares favourably with Erlang. 


What I didn’t say about ownership type systems 

• The Kilim ownership system is based on heap-pointer 

uniqueness; each heap object as at most one pointer to it. 

• In addition there are also ownership type systems based on: 

– owners as dominators 

– owners as modifiers 

UNIVERSITYOF 

CAMBRIDGE 


Conclusions 

• multi-core works best under process memory isolation: each 

process owns disjoint memory. A good model (actor-like). 

UNIVERSITYOF 

CAMBRIDGE 

• in general need ownership transfer – hardware cost of this is that 

of a copy. 

• good if ownership transfer is policed by a type system; syntactic 

markers give the compiler the right to optimise actual data 

transfer. 

• substructural types provide such a type system 

• might need loopholes/exceptions to the type system for real 

programs (most of the data is still protected). 

• Kilim is one solution (a sweet-spot).

Using Kilim's isolation types for multicore efficiency

Create successful ePaper yourself

Delete template?

Save as template?