30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011Efficient hardware support for locksynchronization in Many-core CMPsJosé L. Abellán, Juan Fernán<strong>de</strong>z y Manuel E. Acacio 1Abstract—Synchronization is of paramount importance to exploitthread-level parallelism on many-core CMPs.In these architectures, synchronization mechanismsusually rely on shared variables to coordinate multithrea<strong>de</strong>daccess to shared data structures thus avoidingdata <strong>de</strong>pen<strong>de</strong>ncy conflicts. Lock synchronizationis known to be a key limitation to performance andscalability. On the one hand, lock acquisition throughbusy waiting on shared variables generates additionalcoherence activity which interferes with applications.On the other hand, lock contention causes serializationwhich results in performance <strong>de</strong>gradation. Thispaper proposes and evaluates GLocks, a hardwaresupportedimplementation for highly-conten<strong>de</strong>d locksin the context of many-core CMPs. GLocks use atoken-based message-passing protocol over a <strong>de</strong>dicatednetwork built on state-of-the-art technology.This approach skips the memory hierarchy to provi<strong>de</strong>a non-intrusive, extremely efficient and fair lockimplementation with negligible impact on energy consumptionor die area. A comprehensive comparisonagainst the most efficient shared-memory-based lockimplementation for a set of microbenchmarks and realapplications quantifies the goodness of GLocks. Performanceresults show an average reduction of 42%and 14% in execution time, an average reduction of76% and 23% in network traffic, and also an averagereduction of 78% and 28% in energy-<strong>de</strong>lay 2 product(ED 2 P) metric for the full CMP for the microbenchmarksand the real applications, respectively. In lightof our performance results, we can conclu<strong>de</strong> thatGLocks satisfy our initial working hypothesis. GLocksminimize cache-coherence network traffic due to locksynchronization which translates into reduced powerconsumption and execution time.Keywords— Many-core CMP, lock synchronization,global line.I. Introduction and MotivationWHILE the number of cores currently offeredin general-purpose CMPs has already goneabove ten (e.g., the 12-core 2-die AMD’s Magny-Cours <strong>de</strong>sign [1]), the well-known Moore’s <strong>La</strong>w statesthat soon there will be available on-chip the resourcesrequired to integrate dozens of cores or even hundredsof them. CMPs of this kind are commonlyreferred to as many-core CMPs. For instance, the experimentalresearch microprocessor: 48-core SinglechipCloud Computer [2].If current trends continue, future many-coreCMP architectures will implement the hardwaremanaged,implicitly-addressed, coherent cachesmemory mo<strong>de</strong>l [3]. With this memory mo<strong>de</strong>l, allon-chip storage is used for private and shared cachesthat are kept coherent by hardware. Communicationbetween threads is performed by writing to andreading from shared memory. In or<strong>de</strong>r to guarantee1 Departamento <strong>de</strong> Ingeniería y Tecnología <strong>de</strong>Computadores, <strong>Universidad</strong> <strong>de</strong> Murcia, e-mail:{jl.abellan,juanf,meacacio}@ditec.um.esFig. 1. Potential benefits for Raytrace when using i<strong>de</strong>al locks.the integrity of shared data structures, most currentsystems support synchronization through a combinationof hardware (such as atomic read-modify-writeinstructions like test&set) and software (higher-levelmechanisms such as locks or barriers implementedatop the un<strong>de</strong>rlying hardware primitives) [4]. In thisway, implementations of locks usually rely on sharedvariables which are atomically updated.The use of shared variables for lock synchronizationhas two important implications for performanceand scalability in many-core CMPs. First, the cachecoherence protocol must come into play in or<strong>de</strong>r tomaintain the consistency of shared variables acrossall levels of the memory hierarchy. Coherence activitytranslates into traffic injection in the interconnectionnetwork. As a result, an ever-growingamount of resources may need to be <strong>de</strong>voted to supportlock synchronization as the number of coresin many-core CMPs increases. Moreover, lock acquisitionand release operations timing is <strong>de</strong>eply affectedby the performance and scalability of the cachecoherence protocol especially un<strong>de</strong>r the presence ofhighly-conten<strong>de</strong>d locks. Second, lock contention haslong been recognized as a key impediment to performanceand scalability since it causes serialization [5].Consequently, the longer the idle time spent on lockacquisition and release operations, the larger the parallelefficiency reduction.As an evi<strong>de</strong>nce, we show in Figure 1 the potentialbenefits to performance when lock synchronizationsdo not involve the cache coherence protocol andhave zero latency. To do that, the Raytrace applicationfrom the SPLASH-2 benchmark suite [6] isrun by using distinct lock implementations (for <strong>de</strong>tailsabout the evaluation see Section III). In eachcase, we highlight in gray the fraction of the executiontime due to the locks. Shared-memory-basedlocks use test-and-test&set (see TATAS bar inFigure 1). In turn, i<strong>de</strong>al locks (see IDEAL bar in Figure1) do not <strong>de</strong>al with the cache coherence protocolto eliminate any inherited performance or scalabilitysi<strong>de</strong> effect. Besi<strong>de</strong>s, lock acquisition and release<strong>JP2011</strong>-209

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!