10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

34 CHAPTER 4. COUNTINGvanish when that thread exits.Quick Quiz 4.22: Fine, but the Linux kerneldoesn’t have to acquire a lock when reading out theaggregatevalueofper-CPUcounters. <strong>So</strong>whyshoulduser-space code need to do this???4.2.5 DiscussionThese two implementations show that it is possibleto obtain uniprocessor performance for statisticalcounters, despite running on a parallel machine.Quick Quiz 4.23: <strong>What</strong> fundamental differenceis there between counting packets and counting thetotal number of bytes in the packets, given that thepackets vary in size?Quick Quiz 4.24: Given that the reader mustsum all the threads’ counters, this could take a longtime given large numbers of threads. <strong>Is</strong> there anyway that the increment operation can remain fastand scalable while allowing readers to also enjoy reasonableperformance and scalability?Given what has been presented in this section,you should now be able to answer the Quick Quizabout statistical counters for networking near thebeginning of this chapter.4.3 Approximate Limit CountersAnother special case of counting involves limitchecking.For example, as noted in the approximatestructure-allocation limit problem in the Quick Quizon page 29, suppose that you need to maintain acount of the number of structures allocated in orderto fail any allocations once the number of structuresin use exceeds a limit, in this case, 10,000. Supposefurther that these structures are short-lived,and that this limit is rarely exceeded.4.3.1 DesignOnepossibledesignforlimitcountersistodividethelimit of 10,000 by the number of threads, and giveeach thread a fixed pool of structures. For example,given100threads,eachthreadwouldmanageitsownpool of 100 structures. This approach is simple, andin some cases works well, but it does not handle thecommon case where a given structure is allocatedby one thread and freed by another [MS93]. Onthe one hand, if a given thread takes credit for anystructures it frees, then the thread doing most of theallocating runs out of structures, while the threadsdoing most of the freeing have lots of credits thatthey cannot use. On the other hand, if freed structuresare credited to the CPU that allocated them,it will be necessary for CPUs to manipulate eachothers’ counters, which will require lots of expensiveatomic instructions. Furthermore, because structurescome in different sizes, rather than supportinginc_count() and dec_count() interfaces, weimplement add_count() and sub_count() to allowvariable-sized structures to be properly accountedfor.In short, for many important workloads, we cannotfully partition the counter. However, we canpartially partition the counter, so that in the commoncase, each thread need only manipulate its ownprivate state, while still allowing counts to flow betweenthreads as needed. The statistical countingscheme discussed in Section 4.2.4 provides an interestingstarting point, in that it maintains a globalcounter as well as per-thread counters, with the aggregatevalue being the sum of all of these counters,global along with per-thread. The key change is topull each thread’s counter into the global sum whilethat thread is still running, rather than waiting forthread exit. Clearly, we want threads to pull in theirown counts, as cross-thread accesses are expensiveand scale poorly.This leaves open the question of exactly whena given thread’s counter should be pulled into theglobal counter. In the initial implementation, wewill start by maintaining a limit on the value of theper-thread counter. When this limit would be exceeded,the thread pulls its counter into the globalcounter. Of course, we cannot simply add to thecounter when a structure is allocated: we must alsosubtract from the counter when a structure is freed.We must therefore make use of the global counterwhenasubtractionwouldotherwisereducethevalueoftheper-threadcounterbelowzero. However, ifthelimit is reasonably large, almost all of the additionandsubtractionoperationsshouldbehandledbytheper-thread counter, which should give us good performanceand scalability.This design is an example of “parallel fastpath”,which is an important design pattern in which thecommon case executes with no expensive instructionsandnointeractionsbetweenthreads,butwhereoccasional use is also made of a more conservativelydesigned global algorithm.4.3.2 Simple Limit Counter ImplementationFigure 4.9 shows both the per-thread and globalvariables used by this implementation. The per-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!