30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011Normalized execution time1.31.21.11.00.90.80.70.60.50.40.30.20.10.0BarnesFlushingFlushing DC:2CholeskyFlushing DC:4Flushing DC:81.60 2.42FFTOceanRadiosityRaytrace-optVolrendWater-NsqTomcatvUnstructuredFaceRecMPG<strong>de</strong>cMPGencSpeechRecBlackscholesCannealSwaptionsFluidanimatex264ApacheSPEC-JBBAverageFig. 11. Execution time normalized to the base system. DC:2, DC:4, and DC:8 stand for directory caches with their sizedivi<strong>de</strong>d by 2, 4, and 8, respectively.Dynamic energy2.01.81.61.41.21.00.80.60.40.20.01. Base 2. Updating 3. Flushing 4. Flushing DC:2 5. Flushing DC:4 6. Flushing DC:8BarnesCholeskyDirectoryCache MemoryController NetworkFFTOceanRadiosityRaytrace-optVolrendWater-NsqTomcatvUnstructuredFaceRecMPG<strong>de</strong>cMPGencSpeechRecBlackscholesCannealSwaptionsFluidanimatex264ApacheSPEC-JBBAverageFig. 12. Dynamic energy consumption normalized to the base system. DC:2, DC:4, and DC:8 stand for directory caches withtheir size divi<strong>de</strong>d by 2, 4, and 8, respectively.3.02The static energy consumption is not shown in Figure12 because it is really tight to application runtime.Besi<strong>de</strong>s, in directory caches, it also <strong>de</strong>pendson the directory size. Thus, when using directorycaches 2, 4, and 8 times smaller than that in the basesystem, the static power consumption is reduced by48%, 74%, and 86%, respectively.VI. ConclusionsThe proposal ma<strong>de</strong> in this paper aims to improvethe effectiveness of directory caches. It takes advantageof the fact that most referred memory blocksare private and, therefore, they do not require coherencemaintenance. Thus, directory caches do notkeep track of them. Since the amount of informationstored by directory caches is drastically reduced, thenumber of blocks invalidated from processor cachesdue to replacements in directory caches also lowers(by about 57% on average). This contributes to increasesystem performance (15%) or to reduce thestorage requirements of directory caches (8 times).Due to the simplicity of the proposed technique,it can be implemented without modifying the coherenceprotocol or the processor hardware, being itsimplementation feasible in actual systems.AcknowledgmentsThis work has been supported by Generalitat Valencianaun<strong>de</strong>r Grant PROMETEO/2008/060.Referencias[1] P. Conway et al., “Cache hierarchy and memory subsystemof the AMD opteron processor,” IEEE Micro, vol.30, no. 2, pp. 16–29, Apr. 2010.[2] B. W. O’Krafka et al., “An empirical evaluation of twomemory-efficient directory methods,” in 17th Int’l Symp.on Computer Architecture (ISCA), June 1990, pp. 138–147.[3] A. Gupta et al., “Reducing memory traffic requirementsfor scalable directory-based cache coherence schemes,”in Int’l Conference on Parallel Processing (ICPP), Aug.1990, pp. 312–321.[4] N. Hardavellas et al., “Reactive NUCA: Near-optimalblock placement and replication in distributed caches,”in 36th Int’l Symp. on Computer Architecture (ISCA),June 2009, pp. 184–195.[5] D. Kim et al., “Subspace snooping: Filtering snoopswith operating system suport,” in 19th Int’l Conferenceon Parallel Architectures and Compilation Techniques(PACT), Sept. 2010, pp. 111–122.[6] A. Moshovos, “RegionScout: Exploiting coarse grainsharing in snoop-based coherence,” in 32nd Int’l Symp.on Computer Architecture (ISCA), June 2005, pp. 234–245.[7] J. F. Cantin et al., “Improving multiprocessor performancewith coarse-grain coherence tracking,” in 32thInt’l Symp. on Computer Architecture (ISCA), June2005, pp. 246–257.[8] J. Zebchuk et al., “A framework for coarse-grain optimizationsin the on-chip memory hierarchy,” in 40thIEEE/ACM Int’l Symp. on Microarchitecture (MICRO),Dec. 2007, pp. 314–327.[9] N. D. Enright-Jerger et al., “Virtual circuit tree multicasting:A case for on-chip hardware multicast support,”in 35th Int’l Symp. on Computer Architecture (ISCA),June 2008, pp. 229–240.[10] H. Zeffer et al., “TMA: A trap-based memory architecture,”in 20th Int’l Conference on Supercomputing (ICS),June 2006, pp. 259–268.[11] H. Zeffer et al., “A case for low-complexity MP architectures,”in ACM/IEEE Conference on Supercomputing(SC), Nov. 2007, pp. 10–16.[12] C. Fensch et al., “An OS-based alternative to full hardwarecoherence on tiled CMPs,” in 14th Int’l Symp. onHigh-Performance Computer Architecture (HPCA), Feb.2008, pp. 355–366.[13] P. S. Magnusson et al., “Simics: A full system simulationplatform,” IEEE Computer, vol. 35, no. 2, pp. 50–58,Feb. 2002.[14] M. M. K. Martin et al., “Multifacet’s general executiondrivenmultiprocessor simulator (GEMS) toolset,” ComputerArchitecture News, vol. 33, no. 4, pp. 92–99, Sept.2005.[15] N. Agarwal et al., “GARNET: A <strong>de</strong>tailed on-chip networkmo<strong>de</strong>l insi<strong>de</strong> a full-system simulator,” in IEEE Int’lSymp. on Performance Analysis of Systems and Software(ISPASS), Apr. 2009, pp. 33–42.[16] S. Li et al., “McPAT: An Integrated Power, Area, andTiming Mo<strong>de</strong>ling Framework for Multicore and ManycoreArchitectures,” in 42nd IEEE/ACM Int’l Symp. onMicroarchitecture (MICRO), Dec. 2009, pp. 469–480.[17] S. C. Woo et al., “The SPLASH-2 programs: Characterizationand methodological consi<strong>de</strong>rations,” in 22nd Int’lSymp. on Computer Architecture (ISCA), June 1995, pp.24–36.[18] M. Li et al., “The ALPBench benchmark suite for complexmultimedia applications,” in Int’l Symp. on WorkloadCharacterization, Oct. 2005, pp. 34–45.[19] C. Bienia et al., “The PARSEC benchmark suite: Characterizationand architectural implications,” in 17th Int’lConference on Parallel Architectures and CompilationTechniques (PACT), Oct. 2008, pp. 72–81.[20] A. R. Alamel<strong>de</strong>en et al., “Evaluating non-<strong>de</strong>terministicmulti-threa<strong>de</strong>d commercial workloads,” in 5th WorkshopOn Computer Architecture Evaluation using CommercialWorkloads (CAECW), Feb. 2002, pp. 30–38.[21] N. D. Enright-Jerger et al., “Virtual tree coherence:Leveraging regions and in-network multicast tree for scalablecache coherence,” in 41th IEEE/ACM Int’l Symp.on Microarchitecture (MICRO), Nov. 2008, pp. 35–46.<strong>JP2011</strong>-202

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!