30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011TABLE ICMP baseline configuration.Number of cores 32CoreCache line sizeL1 I/D-CacheL2 Cache (per core)Memory access timeNetwork configurationNetwork bandwidthLink widthTechnology3GHz, in-or<strong>de</strong>r 2-way64 Bytes32KB, 4-way, 2 cycles256KB, 4-way, 12+4 cycles400 cycles2D-mesh75 GB/s75 bytes45 nmshown in Figure 4(b). In this case, R would chooseS1 by following the round-robin scheduling policy alreadydiscussed and would send the TOKEN signalat cycle 3. At cycle 4 and based on the round-robinpolicy, S1 chooses Core0 and sends the TOKEN signalgranting access to the lock. Figure 4(c) showsthe scenario in which an Sx can grant the lock ownershipwithout involving any additional notificationsto R. More specifically, once Core0 releases the lockat cycle m, its controller sends the REL signal (bywriting to the local f0 flag, as we mentioned) to S1.Next, at cycle m + 1, S1 grants the lock ownership(by means of the TOKEN signal) to the next coreby following the round-robin policy from the activefx flags. In this case, Core1 becomes the new lockhol<strong>de</strong>r. In the same way, Core2 would be grantedthe lock in cycle n + 1 (m < n). Finally, in Figure4(d) we illustrate the scenario when an S finishesits scheduling because either it has reached the lastactive f or there are no more pending local requestsfor the lock. In this case, S must send the REL signaltowards R, which will choose another availableSj lock manager from those that activated the fSxflags. In the figure, S1 sends the REL signal to R atcycle p + 1 (n < p), which following the round-robinpolicy grants the lock to S2. Finally, S2 sends theTOKEN signal giving access to the lock to Core3 atcycle p + 3.III. EvaluationIn this section we give <strong>de</strong>tails of our experimentalmethodology and performance results.A. TestbedIn or<strong>de</strong>r to support GLocks, the Sim-PowerCMP [8] performance simulator has beenexten<strong>de</strong>d. Sim-PowerCMP is a <strong>de</strong>tailed architecturelevelpower-performance simulation tool that simulatestiled-CMP architectures with a shared L2cache on-chip and a MESI directory-based cachecoherence protocol. Table I summarizes the valuesof the main configurable parameters assumed in thiswork.B. BenchmarksTo evaluate the performance benefits <strong>de</strong>rived fromGLocks, five microbenchmarks and three scientificapplications are used. On the one hand, the microbenchmarks(SCTR, MCTR, DBLL, PRCO andACTR) exhibit different highly-conten<strong>de</strong>d accesspatterns to shared data that can be commonly foundin parallel applications. To implement the microbenchmarkswe follow a methodology similar tothe one used in [9]. On the other hand, regardingreal applications, we have consi<strong>de</strong>red two programsbelonging to the SPLASH-2 benchmark suite [6](Ocean and Raytrace), and a well-known sorting algorithm(QSORT). These applications were chosensince they present a significant lock synchronizationoverhead due to the existence of highly-conten<strong>de</strong>dlocks 2 . In fact, these locks are accessed followingsimilar patterns to those of the microbenchmarks.We summarize the characteristics of the microbenchmarksand applications used in this work in Table II.For each of them we account for the input size, thetotal number of different locks, the number of theselocks that are highly-conten<strong>de</strong>d (H-C Locks), andpoint out the highly-conten<strong>de</strong>d lock access patternsin terms of the microbenchmarks they are similar to.C. Lock ImplementationsTo fairly quantify the benefits of our GLocks mechanism,we consi<strong>de</strong>r the case that highly-conten<strong>de</strong>dlocks found in the benchmarks previously <strong>de</strong>scribedare implemented by using MCS Locks. We use MCSLocks because they are consi<strong>de</strong>red one of the mostefficient software algorithms for lock synchronizationun<strong>de</strong>r high contention. In particular, MCSLocks gracefully manage high-contention situationsby having a distributed queue of waiting lock requesters.On the other hand, for the rest of locks(non-conten<strong>de</strong>d ones), we employ the Simple Lock algorithmenhanced with the test-and-test&set optimizationdue to it has been shown to lead to lowerlatencies when threads try to acquire a lock withoutcompetition. Finally, since the number of highlyconten<strong>de</strong>dlocks is commonly very small in real applications(up to 2 in the applications evaluated inthis work), we assume that two GLocks are provi<strong>de</strong>dat hardware level. We would like to point out thatto <strong>de</strong>termine the contention of locks, we performed apost-mortem analysis of the benchmarks un<strong>de</strong>r studywhere locks use the Simple Lock algorithm enhancedwith the test-and-test&set optimization. For further<strong>de</strong>tails of this analysis we refer to [10].D. Performance ResultsIn this section, we evaluate the performance benefits<strong>de</strong>rived from our GLocks mechanism.D.1 Execution TimeFigure 5 shows the execution times that are obtainedfor the set of benchmarks un<strong>de</strong>r study wheneither GLocks or MCS Locks are employed for thehighly-conten<strong>de</strong>d locks (GL bars and MCS bars respectively).In particular, execution times have been2 In this work, highly-conten<strong>de</strong>d locks are those locks accessedby all threads simultaneously or very close in time.<strong>JP2011</strong>-212

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!