29.04.2014 Views

Presburger Arithmetic and Its Use in Verification

Presburger Arithmetic and Its Use in Verification

Presburger Arithmetic and Its Use in Verification

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.2. SOME PITFALLS OF FUNCTIONAL PARALLELISM ON THE MULTICORE<br />

ARCHITECTURE<br />

aGCthreadforeachcoresoscalabilityofgarbagecollectionisbetter[19]. Come<br />

back to two examples which have been <strong>in</strong>troduced <strong>in</strong> Chapter 2, one of the reasons<br />

for a l<strong>in</strong>ear speedup of π calculation (see Table 2.1) is no further significant memory<br />

allocation except the <strong>in</strong>put array. However, a subl<strong>in</strong>ear speedup of MergeSort<br />

algorithm (see Figure 2.3) could be expla<strong>in</strong>ed by many rounds of garbage collection<br />

occurr<strong>in</strong>g when immediate arrays are discarded.<br />

False Cache-l<strong>in</strong>e Shar<strong>in</strong>g<br />

When a CPU loads a memory location <strong>in</strong>to cache, it also loads nearby memory locations<br />

<strong>in</strong>to the same cache l<strong>in</strong>e. The reason is to make the access to this memory cell<br />

<strong>and</strong> nearby cells faster. In the context of multithread<strong>in</strong>g, different threads writ<strong>in</strong>g<br />

to the same cache l<strong>in</strong>e may result <strong>in</strong> <strong>in</strong>validation of all CPUs’ caches <strong>and</strong> significantly<br />

damage performance. In the functional-programm<strong>in</strong>g sett<strong>in</strong>g, false cache-l<strong>in</strong>e<br />

shar<strong>in</strong>g is less critical because each value is often written only once when it is <strong>in</strong>itialized.<br />

But the fact that consecutive memory allocations make <strong>in</strong>dependent data<br />

fall <strong>in</strong>to the same cache l<strong>in</strong>e also causes problem. Some workarounds are padd<strong>in</strong>g<br />

data which are concurrently accessed or allocat<strong>in</strong>g memory locally <strong>in</strong> threads.<br />

We illustrate the problem by a small experiment as follows: an array which has<br />

asizeequaltothenumberofcores<strong>and</strong>eacharrayelementisupdated10000000<br />

times [25]. Because the size of the array is small, its all elements tend to fall <strong>in</strong>to<br />

the same cache l<strong>in</strong>e <strong>and</strong> many concurrent updates to the same array will <strong>in</strong>validate<br />

the cache l<strong>in</strong>e many times <strong>and</strong> badly <strong>in</strong>fluence the performance. The below code<br />

fragment shows concurrent updates on the same cache l<strong>in</strong>e:<br />

let par1() =<br />

let cores = System.Environment.ProcessorCount<br />

let counts = Array.zeroCreate cores<br />

Parallel.For(0, cores, fun i −><br />

for j =1to 10000000 do<br />

counts.[i] ignore<br />

The measurement of sequential <strong>and</strong> parallel versions on the 8-core mach<strong>in</strong>e is shown<br />

as follows:<br />

> Real: 00:00:00.647,CPU: 00:00:00.670,GC gen0: 0,gen1: 0,gen2: 0// sequential<br />

> Real: 00:00:00.769,CPU: 00:00:11.310,GC gen0: 0,gen1: 0,gen2: 0// parallel<br />

The parallel variant is even slower than the sequential one. We can fix the<br />

problem by padd<strong>in</strong>g the array by garbage data, this approach is 17× faster than<br />

the naive sequential one:<br />

let par1Fix1() =<br />

let cores = System.Environment.ProcessorCount<br />

let padd<strong>in</strong>g =128/sizeof<br />

let counts = Array.zeroCreate ((1+cores)∗padd<strong>in</strong>g)<br />

Parallel.For(0, cores, fun i −><br />

let paddedI =(1+i) ∗padd<strong>in</strong>g<br />

for j =1to 10000000 do<br />

23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!