Presburger Arithmetic and Its Use in Verification
Presburger Arithmetic and Its Use in Verification
Presburger Arithmetic and Its Use in Verification
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
3.2. SOME PITFALLS OF FUNCTIONAL PARALLELISM ON THE MULTICORE<br />
ARCHITECTURE<br />
aGCthreadforeachcoresoscalabilityofgarbagecollectionisbetter[19]. Come<br />
back to two examples which have been <strong>in</strong>troduced <strong>in</strong> Chapter 2, one of the reasons<br />
for a l<strong>in</strong>ear speedup of π calculation (see Table 2.1) is no further significant memory<br />
allocation except the <strong>in</strong>put array. However, a subl<strong>in</strong>ear speedup of MergeSort<br />
algorithm (see Figure 2.3) could be expla<strong>in</strong>ed by many rounds of garbage collection<br />
occurr<strong>in</strong>g when immediate arrays are discarded.<br />
False Cache-l<strong>in</strong>e Shar<strong>in</strong>g<br />
When a CPU loads a memory location <strong>in</strong>to cache, it also loads nearby memory locations<br />
<strong>in</strong>to the same cache l<strong>in</strong>e. The reason is to make the access to this memory cell<br />
<strong>and</strong> nearby cells faster. In the context of multithread<strong>in</strong>g, different threads writ<strong>in</strong>g<br />
to the same cache l<strong>in</strong>e may result <strong>in</strong> <strong>in</strong>validation of all CPUs’ caches <strong>and</strong> significantly<br />
damage performance. In the functional-programm<strong>in</strong>g sett<strong>in</strong>g, false cache-l<strong>in</strong>e<br />
shar<strong>in</strong>g is less critical because each value is often written only once when it is <strong>in</strong>itialized.<br />
But the fact that consecutive memory allocations make <strong>in</strong>dependent data<br />
fall <strong>in</strong>to the same cache l<strong>in</strong>e also causes problem. Some workarounds are padd<strong>in</strong>g<br />
data which are concurrently accessed or allocat<strong>in</strong>g memory locally <strong>in</strong> threads.<br />
We illustrate the problem by a small experiment as follows: an array which has<br />
asizeequaltothenumberofcores<strong>and</strong>eacharrayelementisupdated10000000<br />
times [25]. Because the size of the array is small, its all elements tend to fall <strong>in</strong>to<br />
the same cache l<strong>in</strong>e <strong>and</strong> many concurrent updates to the same array will <strong>in</strong>validate<br />
the cache l<strong>in</strong>e many times <strong>and</strong> badly <strong>in</strong>fluence the performance. The below code<br />
fragment shows concurrent updates on the same cache l<strong>in</strong>e:<br />
let par1() =<br />
let cores = System.Environment.ProcessorCount<br />
let counts = Array.zeroCreate cores<br />
Parallel.For(0, cores, fun i −><br />
for j =1to 10000000 do<br />
counts.[i] ignore<br />
The measurement of sequential <strong>and</strong> parallel versions on the 8-core mach<strong>in</strong>e is shown<br />
as follows:<br />
> Real: 00:00:00.647,CPU: 00:00:00.670,GC gen0: 0,gen1: 0,gen2: 0// sequential<br />
> Real: 00:00:00.769,CPU: 00:00:11.310,GC gen0: 0,gen1: 0,gen2: 0// parallel<br />
The parallel variant is even slower than the sequential one. We can fix the<br />
problem by padd<strong>in</strong>g the array by garbage data, this approach is 17× faster than<br />
the naive sequential one:<br />
let par1Fix1() =<br />
let cores = System.Environment.ProcessorCount<br />
let padd<strong>in</strong>g =128/sizeof<br />
let counts = Array.zeroCreate ((1+cores)∗padd<strong>in</strong>g)<br />
Parallel.For(0, cores, fun i −><br />
let paddedI =(1+i) ∗padd<strong>in</strong>g<br />
for j =1to 10000000 do<br />
23