10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.4. PARALLEL FASTPATH 65of blocks and then frees it, with the size of the groupbeing the “allocation run length” displayed on thex-axis. The y-axis shows the number of successfulallocation/free pairs per microsecond — failed allocationsare not counted. The “X”s are from atwo-thread run, while the “+”s are from a singlethreadedrun.and greater. Given the global-pool size of 40 andthe per-CPU target pool size of three, what is thesmallest allocation run length at which failures canoccur?5.4.4.7 Real-World DesignAllocations/Frees Per Microsecond3025201510500 5 10 15 20 25Allocation Run LengthFigure 5.32: Allocator Cache PerformanceNote that run lengths up to six scale linearly andgiveexcellentperformance, whilerunlengthsgreaterthan six show poor performance and almost alwaysalso show negative scaling. <strong>It</strong> is therefore quite importantto size TARGET_POOL_SIZE sufficiently large,which fortunately is usually quite easy to do inactual practice [MSK01], especially given today’slarge memories. For example, in most systems, itis quite reasonable to set TARGET_POOL_SIZE to 100,in which case allocations and frees are guaranteed tobe confined to per-thread pools at least 99% of thetime.As can be seen from the figure, the situationswhere the common-case data-ownership applies (runlengths up to six) provide greatly improved performancecompared to the cases where locks must beacquired. Avoiding locking in the common case willbe a recurring theme through this book.Quick Quiz 5.15: In Figure 5.32, there is apattern of performance rising with increasing runlength in groups of three samples, for example, forrun lengths 10, 11, and 12. Why?Quick Quiz 5.16: Allocation failures were observedin the two-thread tests at run lengths of 19The toy parallel resource allocator was quite simple,but real-world designs expand on this approach in anumber of ways.First, real-world allocators are required to handlea wide range of allocation sizes, as opposed tothe single size shown in this toy example. One popularway to do this is to offer a fixed set of sizes,spaced so as to balance external and internal fragmentation,such as in the late-1980s BSD memoryallocator [MK88]. <strong>Do</strong>ing this would mean that the“globalmem” variable would need to be replicatedon a per-size basis, and that the associated lockwould similarly be replicated, resulting in data lockingrather than the toy program’s code locking.Second, production-quality systems must be ableto repurpose memory, meaning that they must beable to coalesce blocks into larger structures, such aspages [MS93]. This coalescing will also need to beprotected by a lock, which again could be replicatedon a per-size basis.Third, coalesced memory must be returned to theunderlying memory system, and pages of memorymust also be allocated from the underlying memorysystem. The locking required at this levelwill depend on that of the underlying memory system,but could well be code locking. Code lockingcan often be tolerated at this level, because thislevel is so infrequently reached in well-designed systems[MSK01].Despite this real-world design’s greater complexity,the underlying idea is the same — repeated applicationof parallel fastpath, as shown in Table 5.1.Level Locking PurposePer-thread pool Data ownership High-speed allocationGlobal block pool Data locking Distributing blocksamong threadsCoalescing Data locking Combining blocksinto pagesSystem memory Code locking Memory from/tosystemTable 5.1: Schematic of Real-World <strong>Parallel</strong> Allocator

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!