30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011to the conventional baseline, D-NUCA with I columnsand J rows, L-NUCAs with I levels, and L-NUCAscombined with D-NUCAs organizations, respectively.Besi<strong>de</strong>s, each configuration inclu<strong>de</strong>s its total size forthe L2 and L-NUCAs and the size of their banks forD-NUCA.TABLE IIISample sizes for the different metrics andconfigurations in 4SMT executionMetricIPCSTP ANTT Throughtput FairnessL2-256KB-8Banks 126 111 85 162L2-1MB-8Banks 126 113 90 160NC-8x2-512KB 111 132 119 218NC-8x4-256KB 118 141 128 220LN3-240KB 103 96 51 194LN4-448KB 103 96 51 193LN2-NC-8x2 104 96 48 189LN2-NC-8x4 105 97 47 188LN3-NC-8x2 106 96 46 210LN3-NC-8x4 103 95 47 211Irrespective of the organization, we observe thatthe required sample size follows a common trend:computing IPC throughput requires small sizes whilefairness, due to its greater variance, requires doublingor even quadrupling the sample size. STP and ANNTlie in a middle ground. Across all organizations, IPCthroughput is the metric giving the more unevensample sizes, meaning that L-NUCA organizationsachieve less IPC variability among individual threadcombinations. From these results, we conclu<strong>de</strong> thatin general adding a different configuration to test doesnot require a large investment in new simulations forthe preexisting configurations. Finally, fairness is theonly metric that requires more samples as the numberof threads rise. While many threads are able to keepall functional units busy, they tend to bother eachother starving some of them. Conclusions are similarfor other number of threads, and we do not showthem for the sake of brevity.B. STP, ANTT, IPC throughput, and FairnessFigure 5 shows the results for the four metrics ofinterest for the best configuration of each hierarchyorganization. STP and ANTT are metrics relative tothe performance of a single thread system aimed atprograms that may expend time in spin-lock loops,which is not our case [24]. In both metrics, L2 overpassesthe rest of hierarchies from 4 threads andbeyond, and the L-NUCA ones dominates in the 2thread case. This results are mostly due to the factthe L2 has the worst IPC in single-thread mo<strong>de</strong> andthe maximum IPC for all configurations is 4, so L2 relativeslowdowns are smaller as the number of threadrises.IPC throughput is an absolute metric representingthe amount of committed instructions per unit oftime. LN+NC achieves the best results regardlessthe number of threads close followed by LN. Thesmall difference between them is mostly due to theNUCA partial tags not present in the L3 [5]. As thenumber of threads and bandwidth <strong>de</strong>mand increase,the L2 overpasses the NC.Fairness is <strong>de</strong>fined as the quotient between theprograms that have suffered the lowest and the highestslowdowns. A value of zero means complete starvationof one thread,and a value of one means all threadsexperience the same slowdown.VI. ConclusionsThe adoption of thread level parallelism as themainstream way to continue improving the performancepace of computers requires novel mechanismand the reevaluation of those which are well establishedsuch as cache hierarchies. While large <strong>La</strong>stLevel Caches have received a lot of attention in recentyears, first and second levels have remained apart.This work analyzes multiple state-of-the-art cachehierarchies executing multiprogramed workloads from2 to 8 threads. In or<strong>de</strong>r to provi<strong>de</strong> accurate resultsin a reasonable amount of time, we propose a samplingbased methodology reducing simulation timeby up 4 or<strong>de</strong>rs of magnitu<strong>de</strong> for 8 thread workloads,respectively. These savings do not occur at the costof fi<strong>de</strong>lity because their error is less than 3% for a97% confi<strong>de</strong>nce level for a high variance metric suchas fairness.From the analysis we observe that regardless ofthe <strong>La</strong>st Level Cache and the number of threads, L-NUCA provi<strong>de</strong>s the more efficient solution in termsof both throughput and fairness.AcknowledgementThis work was partially supported by grantsTIN2010-21291-C02-01 (Spanish Government andEuropean ERDF), gaZ: T48 research group(Aragón Government and European ESF), Consoli<strong>de</strong>rCSD2007-00050 (Spanish Government), and HiPEAC-2 NoE (European FP7/ICT 217068).References[1] Tom R. Halfhill, “Netlogic broa<strong>de</strong>ns XLP family,” MicroprocessorReport, vol. 24, no. 7, pp. 1–11, 2010.[2] J. L. Shin, K. Tam, D. Huang, B. Petrick, H. Pham,Changku Hwang, Hongping Li, A. Smith, T. Johnson,F. Schumacher, D. Greenhill, A. S. Leon, and A. Strong,“A 40nm 16-core 128-thread cmt sparc soc processor,” inProc. IEEE Int. Solid-State Circuits Conf. Digest of TechnicalPapers (ISSCC), 2010, pp. 98–99.[3] Ron Kalla, Balaram Sinharoy, William J. Starke, andMichael Floyd, “Power7: Ibm’s next-generation serverprocessor,” IEEE Micro, vol. 30, pp. 7–15, 2010.[4] Darío Suárez, Teresa Monreal, Fernando Vallejo, RamónBeivi<strong>de</strong>, and Víctor Viñals, “Light NUCA: a proposalfor bridging the inter-cache latency gap,” in Prooceedingsof the 12 th Design, Automation and Test in EuropeConference and Exhibition (DATE’09), April 2009.[5] Changkyu Kim, Doug Burger, and Stephen W. Keckler,“An adaptive, non-uniform cache structure for wire-<strong>de</strong>laydominated on-chip caches,” in Proceedings of the 10thinternational conference on architectural support for programminglanguages and operating systems (ASPLOS-X).October 2002, pp. 211–222, ACM Press.[6] D.M. Tullsen, S.J. Eggers, and H.M. Levy, “Simultaneousmultithreading: Maximizing on-chip parallelism,” inProceedings. 22nd Annual International Symposium onComputer Architecture, Jun 1995, pp. 392–403.[7] Dean M. Tullsen and Jeffery A. Brown, “Handling longlatencyloads in a simultaneous multithreading processor,”in MICRO 34: Proceedings of the 34th annual<strong>JP2011</strong>-567

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!