10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.2. STATISTICAL COUNTERS 331 long __thread counter = 0;2 long *counterp[NR_THREADS] = { NULL };3 long finalcount = 0;4 DEFINE_SPINLOCK(final_mutex);56 void inc_count(void)7 {8 counter++;9 }1011 long read_count(void)12 {13 int t;14 long sum;1516 spin_lock(&final_mutex);17 sum = finalcount;18 for_each_thread(t)19 if (counterp[t] != NULL)20 sum += *counterp[t];21 spin_unlock(&final_mutex);22 return sum;23 }2425 void count_register_thread(void)26 {27 int idx = smp_thread_id();2829 spin_lock(&final_mutex);30 counterp[idx] = &counter;31 spin_unlock(&final_mutex);32 }3334 void count_unregister_thread(int nthreadsexpected)35 {36 int idx = smp_thread_id();3738 spin_lock(&final_mutex);39 finalcount += counter;40 counterp[idx] = NULL;41 spin_unlock(&final_mutex);42 }Figure 4.8: Per-Thread Statistical CountersQuick Quiz 4.17: Won’t the single global threadin the function eventual() of Figure 4.7 be just assevere a bottleneck as a global lock would be?Quick Quiz 4.18: Won’t the estimate returnedby read_count() in Figure 4.7 become increasinglyinaccurate as the number of threads rises?4.2.4 Per-Thread-Variable-Based ImplementationFortunately, gcc provides an __thread storage classthat provides per-thread storage. This can be usedas shown in Figure 4.8 (count_end.c) to implementa statistical counter that not only scales, but thatalso incurs little or no performance penalty to incrementerscompared to simple non-atomic increment.Lines 1-4 define needed variables: counter is theper-thread counter variable, the counterp[] arrayallows threads to access each others’ counters, finalcountaccumulates the total as individual threadsexit, and final_mutex coordinates between threadsaccumulating the total value of the counter and exitingthreads.Quick Quiz 4.19: Why do we need an explicitarray to find the other threads’ counters? Whydoesn’t gcc provide a per_thread() interface, similarto the Linux kernel’s per_cpu() primitive, toallow threads to more easily access each others’ perthreadvariables?The inc_count() function used by updaters isquite simple, as can be seen on lines 6-9.The read_count() function used by readers is abit more complex. Line 16 acquires a lock to excludeexiting threads, and line 21 releases it. Line 17 initializesthe sum to the count accumulated by thosethreads that have already exited, and lines 18-20sum the counts being accumulated by threads currentlyrunning. Finally, line 22 returns the sum.Quick Quiz 4.20: Why on earth do we needsomething as heavyweight as a lock guarding thesummation in the function read_count() in Figure4.8?Lines 25-32 show the count_register_thread()function, which must be called by each thread beforeits first use of this counter. This function simply setsup this thread’s element of the counterp[] array topoint to its per-thread counter variable.Quick Quiz 4.21: Why on earth do we needto acquire the lock in count_register_thread() inFigure4.8??? <strong>It</strong>isasingleproperlyalignedmachinewordstore to a location that no other thread is modifying,so it should be atomic anyway, right?Lines 34-42 show the count_unregister_thread() function, which must be called priorto exit by each thread that previously calledcount_register_thread(). Line 38 acquiresthe lock, and line 41 releases it, thus excludingany calls to read_count() as well as other callsto count_unregister_thread(). Line 39 addsthis thread’s counter to the global finalcount,and then NULLs out its counterp[] array entry.A subsequent call to read_count() will see theexiting thread’s count in the global finalcount,and will skip the exiting thread when sequencingthrough the counterp[] array, thus obtaining thecorrect total.This approach gives updaters almost exactly thesame performance as a non-atomic add, and alsoscales linearly. On the other hand, concurrent readscontend for a single global lock, and therefore performpoorly and scale abysmally. However, this isnot a problem for statistical counters, where incrementinghappens often and readout happens almostnever. In addition, this approach is considerablymore complex than the array-based scheme, due tothe fact that a given thread’s per-thread variables

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!