12.07.2015 Views

NoC design and optimization for Multi-core media processors

NoC design and optimization for Multi-core media processors

NoC design and optimization for Multi-core media processors

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>NoC</strong> Design & Optimization of <strong>Multi</strong><strong>core</strong> MediaProcessorsA ThesisSubmitted <strong>for</strong> the Degree ofDoctor of Philosophyin the Faculty of EngineeringbyBasavaraj TDEPARTMENT OF ELECTRICAL AND COMMUNICATIONENGINEERINGINDIAN INSTITUTE OF SCIENCEBANGALORE – 560 012, INDIAJuly 2013


AbstractNetwork on Chips[1][2][3][4] are critical elements of modern System on Chip(SoC) as wellas Chip <strong>Multi</strong>processor (CMP) <strong>design</strong>s. Network on Chips (<strong>NoC</strong>s) help manage high complexityof <strong>design</strong>ing large chips by decoupling computation from communication. SoCs<strong>and</strong> CMPs have a multiplicity of communicating entities like programmable processing elements,hardware acceleration engines, memory blocks as well as off-chip interfaces. Withpower having become a serious <strong>design</strong> constraint[5], there is a great need <strong>for</strong> <strong>design</strong>ing<strong>NoC</strong> which meets the target communication requirements, while minimizing power usingall the tricks available at the architecture, microarchitecture <strong>and</strong> circuit levels of the <strong>design</strong>.This thesis presents a holistic, QoS based, power optimal <strong>design</strong> solution of a <strong>NoC</strong>inside a CMP taking into account link microarchitecture <strong>and</strong> processor tile configurations.Guaranteeing QoS by <strong>NoC</strong>s involves guaranteeing b<strong>and</strong>width <strong>and</strong> throughput <strong>for</strong> connections<strong>and</strong> deterministic latencies in communication paths. Label Switching basedNetwork-on-Chip (LS-<strong>NoC</strong>) uses a centralized LS-<strong>NoC</strong> Management framework that engineerstraffic into QoS guaranteed routes. LS-<strong>NoC</strong> uses label switching, enables b<strong>and</strong>widthreservation, allows physical link sharing <strong>and</strong> leverages advantages of both packet<strong>and</strong> circuit switching techniques. A flow identification algorithm takes into account b<strong>and</strong>widthavailable in individual links to establish QoS guaranteed routes. LS-<strong>NoC</strong> catersto the requirements of streaming applications where communication channels are fixedover the lifetime of the application. The proposed <strong>NoC</strong> framework inherently supportsheterogeneous <strong>and</strong> ad-hoc SoC <strong>design</strong>s.A multicast, broadcast capable label switched router <strong>for</strong> the LS-<strong>NoC</strong> has been <strong>design</strong>ed,verified, synthesized, placed <strong>and</strong> routed <strong>and</strong> timing analyzed. A 5 port, 256i


Abstractiibit data bus, 4 bit label router occupies 0.431 mm 2 in 130nm <strong>and</strong> delivers peak b<strong>and</strong>widthof 80Gbits/s per link at 312.5MHz. LS Router is estimated to consume 43.08 mW.B<strong>and</strong>width <strong>and</strong> latency guarantees of LS-<strong>NoC</strong> have been demonstrated on streaming applicationslike HiperLAN/2 <strong>and</strong> Object Recognition Processor, Constant Bit Rate trafficpatterns <strong>and</strong> video decoder traffic representing Variable Bit Rate traffic. LS-<strong>NoC</strong> wasfound to have a competitive Area×PowerThroughputfigure of merit with state-of-the-art <strong>NoC</strong>s providingQoS. We envision the use of LS-<strong>NoC</strong> in general purpose CMPs where applicationsdem<strong>and</strong> deterministic latencies <strong>and</strong> hard b<strong>and</strong>width requirements.Designvariables<strong>for</strong>interconnectexplorationincludewirewidth,wirespacing,repeatersize <strong>and</strong> spacing, degree of pipelining, supply, threshold voltage, activity <strong>and</strong> couplingfactors. An optimal link configuration in terms of number of pipeline stages <strong>for</strong> a givenlength of link <strong>and</strong> desired operating frequency is arrived at. Optimal configurations of alllinks in the <strong>NoC</strong> are identified <strong>and</strong> a power-per<strong>for</strong>mance optimal <strong>NoC</strong> is presented. Wepresents a latency, power <strong>and</strong> per<strong>for</strong>mance trade-off study of <strong>NoC</strong>s using link microarchitectureexploration. The <strong>design</strong> <strong>and</strong> implementation of a framework <strong>for</strong> such a <strong>design</strong>space exploration study is also presented. We present the trade-off study on <strong>NoC</strong>s byvarying microarchitectural (e.g. pipelining) <strong>and</strong> circuit level (e.g. frequency <strong>and</strong> voltage)parameters.A System-C based <strong>NoC</strong> exploration framework is used to explore impacts of variousarchitectural <strong>and</strong> microarchitectural level parameters of <strong>NoC</strong> elements on power <strong>and</strong> per<strong>for</strong>manceof the <strong>NoC</strong>. The framework enables the <strong>design</strong>er to choose from a variety ofarchitectural options like topology, routing policy, etc., as well as allows experimentationwith various microarchitectural options <strong>for</strong> the individual links like length, wire width,pitch, pipelining, supply voltage <strong>and</strong> frequency. The framework also supports a flexibletraffic generation <strong>and</strong> communication model. Latency, power <strong>and</strong> throughput results usingthis framework to study a 4x4 CMP are presented. The framework is used to study<strong>NoC</strong> <strong>design</strong>s of a CMP using different classes of parallel computing benchmarks[6].Oneofthekeyfindingsisthattheaveragelatencyofalinkcanbereducedbyincreasingpipeline depth to a certain extent, as it enables link operation at higher link frequencies.


AbstractiiiThere exists an optimum degree of pipelining which minimizes the energy-delay productof the link. In a 2D Torus when the longest link is pipelined by 4 stages at which pointleast latency (1.56 times minimum) is achieved <strong>and</strong> power (40% of max) <strong>and</strong> throughput(64% of max) are nominal. Using frequency scaling experiments, power variations of upto 40%, 26.6% <strong>and</strong> 24% can be seen in 2D Torus, Reduced 2D Torus <strong>and</strong> Tree based <strong>NoC</strong>between various pipeline configurations to achieve same frequency at constant voltages.Alsoinsomecases, wefindthatswitchingtoahigherpipeliningconfigurationcanactuallyhelp reduce power as the links can be <strong>design</strong>ed with smaller repeaters. We also find thatthe overall per<strong>for</strong>mance of the ICNs is determined by the lengths of the links needed tosupport the communication patterns. Thus the mesh seems to per<strong>for</strong>m the best amongstthe three topologies (Mesh, Torus <strong>and</strong> Folded Torus) considered in case studies.The effects of communication overheads on per<strong>for</strong>mance, power <strong>and</strong> energy of a multiprocessorchip using L1, L2 cache sizes as primary exploration parameters using accurateinterconnect, processor, on-chip <strong>and</strong> off-chip memory modelling are presented. On-chip<strong>and</strong> off-chip communication times have significant impact on execution time <strong>and</strong> the energyefficiency of CMPs. Large caches imply larger tile area that result in longer inter-tilecommunication link lengths <strong>and</strong> latencies, thus adversely impacting communication time.Smaller caches potentially have higher number of misses <strong>and</strong> frequent of off-tile communication.Energy efficient tile <strong>design</strong> is a configuration exploration <strong>and</strong> trade-off study usingdifferent cache sizes <strong>and</strong> tile areas to identify a power-per<strong>for</strong>mance optimal configuration<strong>for</strong> the CMP.Trade-offs are explored using a detailed, cycle accurate, multi<strong>core</strong> simulation frameworkwhich includes superscalar processor <strong>core</strong>s, cache coherent memory hierarchies, onchippoint-to-point communication networks <strong>and</strong> detailed interconnect model includingpipelining <strong>and</strong> latency. Sapphire, a detailed multiprocessor execution environment integratingSESC, Ruby <strong>and</strong> DRAMSim was used to run applications from the Splash2benchmark (64K point FFT). Link latencies are estimated <strong>for</strong> a 16 <strong>core</strong> CMP simulationon Sapphire. Each tile has a single processor, L1 <strong>and</strong> L2 caches <strong>and</strong> a router. Differentsizes of L1 <strong>and</strong> L2 lead to different tile clock speeds, tile miss rates <strong>and</strong> tile area <strong>and</strong> hence


Abstractivinterconnect latency.Simulations across various L1, L2 sizes indicate that the tile configuration that maximizesenergy efficiency is related to minimizing communication time. Experiments alsoindicatedifferentoptimaltileconfigurations<strong>for</strong>per<strong>for</strong>mance, energy<strong>and</strong>energyefficiency.Clusteredinterconnectionnetwork, communicationawarecachebankmapping<strong>and</strong>threadmapping to physical <strong>core</strong>s are also explored as potential energy saving solutions. Resultsindicate that ignoring link latencies can lead to large errors in estimates of program completiontimes, of up to 17%. Per<strong>for</strong>mance optimal configurations are achieved at lower L1caches<strong>and</strong>atmoderateL2cachesizesduetohigheroperatingfrequencies<strong>and</strong>smallerlinklengths <strong>and</strong> comparatively lesser communication. Using minimal L1 cache size to operateatthehighestfrequencymaynotalwaysbetheper<strong>for</strong>mance-poweroptimalchoice. LargerL1sizes, despiteadropinfrequency, offeraenergyadvantageduetolessercommunicationdue to misses.Clusteredtileplacementexperiments<strong>for</strong>FFTshowconsiderableper<strong>for</strong>manceperwattimprovement (1.2%). Remapping most accessed L2 banks by a process in the same <strong>core</strong>or neighbouring <strong>core</strong>s after communication traffic analysis offers power <strong>and</strong> per<strong>for</strong>manceadvantages. Remapped processes <strong>and</strong> banks in clustered tile placement show a per<strong>for</strong>manceper watt improvement of 5.25% <strong>and</strong> energy reduction of 2.53%. This suggests that<strong>processors</strong> could execute a program in multiple modes, <strong>for</strong> example, minimum energy,maximum per<strong>for</strong>mance.


AcknowledgementsI thank my advisor, Prof. Bharadwaj Amrutur <strong>for</strong> his invaluable guidance throughout myPh.D. I thank all of you who have shared many precious moments with me <strong>and</strong> enrichedmy journey through life.v


PublicationsJournals• Basavaraj Talwar <strong>and</strong> Bharadwaj Amrutur, “Traffic Engineered <strong>NoC</strong> <strong>for</strong> StreamingApplications”, Micro<strong>processors</strong> <strong>and</strong> Microsystems, 37(2013), 333-344.Conferences• Basavaraj Talwar <strong>and</strong> Bharadwaj Amrutur, “A System-C based MicroarchitecturalExploration Framework <strong>for</strong> Latency, Power <strong>and</strong> Per<strong>for</strong>mance Trade-offs of On-ChipInterconnection Networks”, First International Workshop on Network on Chip Architectures,Nov. 2008.• Basavaraj Talwar, Shailesh Kulkarni <strong>and</strong> Bharadwaj Amrutur, “Latency, Power<strong>and</strong> Per<strong>for</strong>mance Trade-offs in Network-on-Chips by Link Microarchitecture Exploration”,22 nd Intl. Conference on VLSI Design, Jan. 2009.vi


ContentsAbstractAcknowledgementsiv1 Introduction 11.1 Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Switching Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Circuit Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Packet Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Label Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 QoS in <strong>NoC</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 QoS Guaranteed <strong>NoC</strong> Design . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . 71.5.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . 81.5.3 QoS in <strong>NoC</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Related Work 112.1 Traffic Engineered <strong>NoC</strong> <strong>for</strong> Streaming Applications . . . . . . . . . . . . . 112.1.1 QoS in Packet Switched Networks . . . . . . . . . . . . . . . . . . . 122.1.2 QoS in Circuit Switched Networks . . . . . . . . . . . . . . . . . . . 132.1.3 QoS by Space Division <strong>Multi</strong>plexing . . . . . . . . . . . . . . . . . . 152.1.4 Static routing in <strong>NoC</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.5 MPLS <strong>and</strong> Label Switching in <strong>NoC</strong>s . . . . . . . . . . . . . . . . . 162.1.6 Label Switched <strong>NoC</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Link Microarchitecture <strong>and</strong> Tile Area Exploration . . . . . . . . . . . . . . 172.2.1 <strong>NoC</strong> Design Space Exploration . . . . . . . . . . . . . . . . . . . . 172.3 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Link Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Router Power <strong>and</strong> Architecture Exploration Tools . . . . . . . . . . 202.3.3 Complete <strong>NoC</strong> Exploration . . . . . . . . . . . . . . . . . . . . . . 212.3.4 CMP Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . 222.3.5 Communication in CMPs - Per<strong>for</strong>mance Exploration . . . . . . . . 24vii


CONTENTSviii2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Link Microarchitecture Exploration 303.1 Motivation <strong>for</strong> a Microarchitectural Exploration Framework . . . . . . . . 323.2 <strong>NoC</strong> Microarchitectural Exploration Framework . . . . . . . . . . . . . . . 333.2.1 Traffic Generation <strong>and</strong> Distribution Models . . . . . . . . . . . . . 353.2.2 Router Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3 Case Study: Mesh, Torus & Folded-Torus . . . . . . . . . . . . . . . . . . . 383.3.1 <strong>NoC</strong> Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Round Trip Flit Latency & <strong>NoC</strong> Throughput . . . . . . . . . . . . 403.3.3 <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency Tradeoffs . . . . . . . . . . . . . 423.3.4 Power-Per<strong>for</strong>mance Tradeoff With Frequency Scaling . . . . . . . . 463.3.5 Power-Per<strong>for</strong>mance Tradeoff With Voltage <strong>and</strong> Frequency Scaling . 483.4 Case Study: Torus, Reduced Torus & Tree based <strong>NoC</strong> . . . . . . . . . . . . 503.4.1 <strong>NoC</strong> Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4.2 <strong>NoC</strong> Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.3 <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency Tradeoffs . . . . . . . . . . . . . 533.4.4 Power-Per<strong>for</strong>mance Tradeoff With Frequency Scaling . . . . . . . . 543.4.5 Power-Per<strong>for</strong>mance Tradeoff With Voltage <strong>and</strong> Frequency Scaling . 583.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Tile Exploration 614.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Observations <strong>and</strong> Contributions . . . . . . . . . . . . . . . . . . . . . . . . 654.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Communication Time <strong>and</strong> Energy Efficiency . . . . . . . . . . . . . . . . . 674.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.5.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 734.6 Effect of Link Latency on Per<strong>for</strong>mance of a CMP . . . . . . . . . . . . . . 804.7 Communication in CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Program Completion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.9 Ideal Interconnects, Custom Floorplanning, L2 Banks <strong>and</strong> Process Mapping 964.10 Remarks & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995 Label Switched <strong>NoC</strong> 1005.1 Streaming Applications in Media Processors . . . . . . . . . . . . . . . . . 1025.1.1 HiperLAN/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.1.2 Object Recognition Processor . . . . . . . . . . . . . . . . . . . . . 1035.2 LS-<strong>NoC</strong> - Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3 LS-<strong>NoC</strong> - The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.4 LS-<strong>NoC</strong> - Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.5 Label Switched Router Design . . . . . . . . . . . . . . . . . . . . . . . . . 1085.5.1 Pipes & Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


CONTENTSix5.5.2 Label Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.6 Simulation <strong>and</strong> Functional Verification . . . . . . . . . . . . . . . . . . . . 1125.7 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156 LS-<strong>NoC</strong> Management 1166.1 LS-<strong>NoC</strong> Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.1 <strong>NoC</strong> Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.2 Traffic Engineering in LS-<strong>NoC</strong> . . . . . . . . . . . . . . . . . . . . . 1176.2 Flow Based Pipe Identification . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Fault Tolerance in LS-<strong>NoC</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4 Overhead of <strong>NoC</strong> Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4.1 Computational Latency . . . . . . . . . . . . . . . . . . . . . . . . 1226.4.2 Configuration Latency . . . . . . . . . . . . . . . . . . . . . . . . . 1236.4.3 Scalability of LS-<strong>NoC</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.5 Number of Pipes in an <strong>NoC</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.5.1 Minimum, Maximum <strong>and</strong> Typical Pipes in a Network . . . . . . . . 1256.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277 Label Switched <strong>NoC</strong> 1297.1 HiperLAN/2 baseb<strong>and</strong> processing + Object Recognition Processor SoC . . 1307.2 Video Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1317.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.3.1 Design Philosophy of LS-<strong>NoC</strong> . . . . . . . . . . . . . . . . . . . . . 1347.3.2 LS-<strong>NoC</strong> Application . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.3.3 LS-<strong>NoC</strong> Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388 Conclusion <strong>and</strong> Future Work 1408.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . . . . . 1408.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . . . . . 1418.3 Label Switched <strong>NoC</strong> <strong>for</strong> Streaming Applications . . . . . . . . . . . . . . . 1438.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145A Interface <strong>and</strong> Outputs of the SystemC Framework 146B Testing & Validation of LS-<strong>NoC</strong> 150B.1 Implementation of LS-<strong>NoC</strong> Router . . . . . . . . . . . . . . . . . . . . . . 150B.2 Testing <strong>and</strong> Validation of LS-<strong>NoC</strong> Router . . . . . . . . . . . . . . . . . . 150B.2.1 Individual Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2.2 Router in 8×8 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . 152B.3 Synthesis & Place <strong>and</strong> Route . . . . . . . . . . . . . . . . . . . . . . . . . 153


CONTENTSxC The Flow Algorithm 155C.1 Ford-Fulkerson’s MaxFlow Algorithm . . . . . . . . . . . . . . . . . . . . . 155C.2 Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.3 Edges in the Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 157Bibliography 160


List of Tables3.1 ICN exploration framework parameters. . . . . . . . . . . . . . . . . . . . . 353.2 TrafficGeneration/DistributionModel<strong>and</strong>ExperimentSetup<strong>for</strong>theMesh,Torus & Folded-Torus case study. . . . . . . . . . . . . . . . . . . . . . . . 363.3 Links <strong>and</strong> pipelining details of <strong>NoC</strong>s . . . . . . . . . . . . . . . . . . . . . 403.4 DLA traffic, Frequency crossover points in 2D Mesh . . . . . . . . . . . . . 493.5 Comparison of 3 topologies <strong>for</strong> DLA traffic. . . . . . . . . . . . . . . . . . 493.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Links <strong>and</strong> pipelining details of <strong>NoC</strong>s . . . . . . . . . . . . . . . . . . . . . 513.8 Power optimal frequency trip points in a various <strong>NoC</strong>s. . . . . . . . . . . . 573.9 Comparison of 3 topologies. Maximum interconnect network per<strong>for</strong>mance<strong>and</strong> power consumption <strong>for</strong> varying pipe stages. . . . . . . . . . . . . . . . 584.1 Configuration parameters of <strong>processors</strong>, caches & interconnection networkused in experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Scaled processor power over L1 configurations. . . . . . . . . . . . . . . . . 774.3 Primary<strong>and</strong>Secondarycacheparameters(accesstime, area)obtainedfromcacti. L2 access latencies as a function of L1 access times is also shown. . . 774.4 Max operating frequencies, Dynamic energy per access of various L1/L2caches. Values were calculated using cacti power models using 32nm PTM. 784.5 Lengths of links between L1/L2 caches & routers <strong>and</strong> between routers ofneighbouring tiles <strong>for</strong> a regular mesh placement. No. of pipeline stagesrequired to meet the maximum frequency are also shown. . . . . . . . . . . 794.6 FFT. Power spent in links (in mW). . . . . . . . . . . . . . . . . . . . . . 894.7 Total messages in transit (in Millions). . . . . . . . . . . . . . . . . . . . . 934.8 Clustered tile placement floorplan <strong>for</strong> L1: 256KB <strong>and</strong> L2: 512KB. Lengthsoflinks between neighbouringrouters, numberofpipelinestagesare shown.Frequency: 1.38 GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.1 Communication characteristics between HiperLAN/2 nodes. . . . . . . . . 1025.2 Routing table of a n port (n = 5) router with a lw bit (lw = 4) labelindexed by labels used in the label switched <strong>NoC</strong>. Size of the routing table= 2 lw ×n×lw. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.3 Simulation parameters used <strong>for</strong> functional verification of the label switchedrouter <strong>design</strong>. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113xi


LIST OF TABLESxii5.4 Synthesis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5 Synthesis results 2 Router <strong>and</strong> Mesh networks. Area of a Router is 0.431mm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.1 <strong>NoC</strong> Manager Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.1 Pipes set up <strong>for</strong> HiperLAN/2 baseb<strong>and</strong> processing SoC <strong>and</strong> Object RecognitionProcessor SoC (Figure 7.1(a)). PEC[0-7]→PEC[0-7]: every PECcommunicates with every other PEC. . . . . . . . . . . . . . . . . . . . . . 1307.2 St<strong>and</strong>ard test videos used in experiments. . . . . . . . . . . . . . . . . . . 1327.3 Evaluation of the proposed Label Switched Router <strong>and</strong> <strong>NoC</strong>. CS: Circuitswitched, PS: Packet switched. . . . . . . . . . . . . . . . . . . . . . . . . . 136A.1 ICN exploration framework parameters <strong>and</strong> their default values. . . . . . . 147C.1 Routing tables at R0 I0, R0 I2 <strong>and</strong> R1 I4 nodes after pipes P0 <strong>and</strong> P1 havebeen set up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


List of Figures1.1 Design space exploration of <strong>NoC</strong>s in CMPs are closely related to link microarchitecture,router <strong>design</strong> <strong>and</strong> tile configurations. . . . . . . . . . . . . 62.1 Floorplan used in estimating wire lengths. Wire lengths estimated fromthese floorplans are used as input to Intacte to arrive at a power optimalconfiguration <strong>and</strong> latency in clock cycles. Horizontal R-R: Link betweenneighboring routers in the horizontal direction, Vertical R-R: Link betweenneighbouring routers in the vertical direction. . . . . . . . . . . . . . . . . 253.1 Architecture of the SystemC framework. . . . . . . . . . . . . . . . . . . . 343.2 Flow of the ICN exploration framework. . . . . . . . . . . . . . . . . . . . 343.3 Flit header <strong>for</strong>mat. DSTID/SRCID: Destination/Source ID, SQ:SequenceNumber, RQ & RP: Request <strong>and</strong> Response Flags <strong>and</strong> a 13 bit flit id. . . . 363.4 Example flit header <strong>for</strong>mats considered in this experiment. (DST/SRCID:Destination/Source ID, HC:Hop Count, CHNx:Direction at hop x). . . . . 373.5 Schematic of 3 compared topologies (L to R: Mesh, Torus, Folded Torus).Routers are shaded <strong>and</strong> Processing Elements(PE) are not. . . . . . . . . . 393.6 Normalized average round trip latency in cycles vs. Traffic injection ratein all the 3 <strong>NoC</strong>s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.7 Max. frequency of links in 3 topologies. Lengths of longest links in Mesh,Torus <strong>and</strong> Folded 2D Torus are 2.5mm, 8.15mm <strong>and</strong> 5.5mm. . . . . . . . . 423.8 Total <strong>NoC</strong> throughput in 3 topologies, DLA traffic. . . . . . . . . . . . . . 433.9 Avg. round trip flit latency in 3 <strong>NoC</strong>s, DLA traffic. . . . . . . . . . . . . . 433.10 2D Mesh Power/Throughput/Latency trade-offs <strong>for</strong> DLA traffic. Normalizedresults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.11 2D Mesh Power/Throughput/Latency trade-offs <strong>for</strong> SLA traffic. . . . . . . 443.12 DLATraffic, 2DTorusPower/Throughput/Latencytrade-offs. Normalizedresults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.13 DLATraffic,Folded2DTorusPower/Throughput/Latencytrade-offs. Normalizedresults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.14 Frequency scaling on 3 topologies, DLA Traffic. . . . . . . . . . . . . . . . 473.15 Dynamic voltage scaling on 2D Mesh, DLA Traffic. Frequency scaled curve<strong>for</strong> P=8 is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48xiii


LIST OF FIGURESxiv3.16 Schematic representation of the three compared topologies (L to R: 2DTorus, Tree, Reduced 2D Torus). Shaded rectangles are Routers <strong>and</strong> whiteboxes are source/sink Processing Elements(PE) nodes. . . . . . . . . . . . 503.17 Floorplans of the three compared topologies. . . . . . . . . . . . . . . . . . 513.18 Maximum attainable frequency by links in the respective topologies. Estimatedlength of the longest link in a 2D Torus is 7mm. Estimated longestlink in the Tree based <strong>and</strong> Reduced 2D Torus is 3.5mm. . . . . . . . . . . . 523.19 Variation of total <strong>NoC</strong> throughput with varying pipeline stages in all threetopologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.20 2D Torus Power/Throughput/Latency trade-offs. Normalized results areshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.21 Reduced 2D Torus Power/Throughput/Latency trade-offs. Normalized resultsare shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.22 Variation of <strong>NoC</strong> power with throughput <strong>for</strong> each topology. . . . . . . . . . 563.23 Effects of dynamic voltage scaling on the power <strong>and</strong> per<strong>for</strong>mance of a 2DTorus. Highest frequency of operation <strong>for</strong> P=1, 2, 4 <strong>and</strong> 7 are .93GHz,1.68GHz, 2.92GHz <strong>and</strong> 4.22GHz. Power consumption of the frequencyscaled <strong>NoC</strong> is shown <strong>for</strong> comparison. . . . . . . . . . . . . . . . . . . . . . 574.1 Error in per<strong>for</strong>mance measurement between real <strong>and</strong> ideal interconnectexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2 Schematic of a multiprocessor architecture comprising of tiles <strong>and</strong> an interconnectingnetwork. Each tile is made up of a processor, L1 <strong>and</strong> L2caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Flowchart illustrating the steps in experimental procedure. . . . . . . . . . 754.4 Tile floorplans <strong>for</strong> different(L1, L2) sizes. Fromleft: (8KB, 64KB), (64KB,1MB), (128KB, 4MB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Mesh floorplans used in experiments. From left: Conventional 2D Meshtopology, a clustered topology, cluster topology with L2 bank <strong>and</strong> threadmapping <strong>and</strong> <strong>and</strong> a mesh topology with L2 bank <strong>and</strong> thread mapping. . . . 774.6 Benchmark execution time vs. Communication time - DRAM access time<strong>and</strong> On-chip transit time vs. L2 cache size vs. Program completion time. . 804.7 Program energy vs. Communication time. . . . . . . . . . . . . . . . . . . 814.8 64K point FFT benchmark execution time vs. Total time spent in on-chipmessage transit. L2 cache sizes are in the order 64KB, 128KB, 256KB,512KB, 1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.9 64K point FFT execution time vs. Total time spent in DRAM (off-chip)accesses. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB,1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.10 Total messages over all the links during the execution of the benchmark<strong>and</strong> Average transit time of a message. . . . . . . . . . . . . . . . . . . . . 864.11 FFT. Totalinstructionsexecuted<strong>and</strong> power spentin thememoryhierarchy<strong>and</strong> on-chip links during the execution. . . . . . . . . . . . . . . . . . . . . 88


LIST OF FIGURESxv4.12 FFT Benchmark. Energy per Instruction <strong>and</strong> Instructions per second 2 perWatt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.13 Y1:PCT, Y2:on-chip transit <strong>and</strong> off-chip comm. times. . . . . . . . . . . . 924.14 FFT benchmark results. (Program Completion Time, comm.: communication). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.15 FFT benchmark results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.16 Program Completion Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 954.17 Alternative Tile Placements, custom process scheduling example <strong>and</strong> idealinterconnect comparison results. Benchmark: FFT, L1: 256K, L2: 512K. . 985.1 (a) Process graph of a HiperLAN/2 baseb<strong>and</strong> processing SoC[7] <strong>and</strong> (b)<strong>NoC</strong> of the Object recognition processor[8]. . . . . . . . . . . . . . . . . . . 1035.2 A 64 Node, 8 × 8 2D LS-<strong>NoC</strong> along with <strong>NoC</strong> Manager interface to routingtables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.3 Pipe establishment <strong>and</strong> label swapping example in a 3×3 LS-<strong>NoC</strong>. . . . . . 1095.4 LabelSwitchedRouterwithsinglecycleflittraversal. ValidsignalidentifiesData <strong>and</strong> Label as valid. PauseIn <strong>and</strong> PauseOut are flow control signals<strong>for</strong> downstream <strong>and</strong> upstream routers. Routing table has output port <strong>and</strong>label swap in<strong>for</strong>mation. Arbiter receives input from all the input portsalong with the flow control signal from the downstream router. . . . . . . . 1105.5 Label conflict at R1 resolved using Label swapping. il: Input Label, Dir:Direction, ol: Output Label. . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.1 Surveillance system showing the application of LS-<strong>NoC</strong> in the Video computationserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.2 A 2 router, 6 communicating nodes linear network. (b) <strong>Multi</strong>ple source,multiple sink flow calculation in a network. . . . . . . . . . . . . . . . . . . 1266.3 (a) Number of pipes in a linear network (Fig. 6.2(a)), lw = 3 bits, varyingconstraints. Constraint 1: Max 1 pipe per sink. (b) Max. number of pipesin 2D Mesh (Fig. 5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277.1 (a) Process blocks of HiperLAN/2 baseb<strong>and</strong> processing SoC <strong>and</strong> Objectrecognition processor mapped on to a 8 × 8 LS-<strong>NoC</strong>. Pipe 1: PEC0 →PEC6, Pipe 2: MP → PEC3. (b) Flows set up <strong>for</strong> CBR & VBR traffic. . . 1317.2 Latency of HiperLAN/2 <strong>and</strong> ORP pipes in LS-<strong>NoC</strong> over varying injectionratesofnon-streamingapplicationnodes. Latencyofnon-provisionedpathsare titled (U). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.3 (a) Latency of CBR traffic over various injection rates of non-streamingnodes in LS-<strong>NoC</strong>. (b) Latency of VBR traffic over various injection ratesof non-streaming nodes in LS-<strong>NoC</strong>. . . . . . . . . . . . . . . . . . . . . . . 1337.4 LS-<strong>NoC</strong> being used alongside a best ef<strong>for</strong>t <strong>NoC</strong>. . . . . . . . . . . . . . . . 136B.1 Modules in LS-<strong>NoC</strong> router <strong>design</strong> shown along with testbench, implementedin Verilog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151B.2 Test cases used to verify an individual LS-<strong>NoC</strong> router. . . . . . . . . . . . 151


LIST OF FIGURESxviB.3 8×8 mesh used <strong>for</strong> testing LS-<strong>NoC</strong>. . . . . . . . . . . . . . . . . . . . . . . 152B.4 Traffic test cases used to verify proper functioning of LS-<strong>NoC</strong> router. . . . 153B.5 Flowchart illustrating steps in Synthesis <strong>and</strong> Place & Route steps of theLS-<strong>NoC</strong> router. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154B.6 Placed <strong>and</strong> routed output - Single Router. . . . . . . . . . . . . . . . . . . 154C.1 Steps in the flow algorithm example. (a) Input Graph. Maximum flowshave to be identified between nodes X & Y. (b) Available capacities oflinks after flows X→A→C→Y & X→B→C→Y are set up. (c) Residualnetwork showing available capacities of links in the <strong>for</strong>ward direction <strong>and</strong>utilized capacity in the reverse. (d) Residual network after adding theflow: X→A→C→B→D→E→Y. (e) Final output of the maxflow algorithmshowing 3 flows from X to Y. . . . . . . . . . . . . . . . . . . . . . . . . . 156C.2 (a) A 2 router, 6 source+sink system used <strong>for</strong> validation of the LS-<strong>NoC</strong>router <strong>design</strong>. Graph representation of the system used as input to theflow algorithm is shown in (b). . . . . . . . . . . . . . . . . . . . . . . . . . 157C.3 The<strong>NoC</strong>aftertwopipes,P0<strong>and</strong>P1havebeenestablished. P0: R0S0→R1 D2<strong>and</strong> P1: R0S2→R1 D0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158


Chapter 1Introduction1.1 Network-on-ChipNetwork on Chips[1][2][3][4] are critical elements of modern Chip <strong>Multi</strong><strong>processors</strong> (CMPs)<strong>and</strong> System on Chips (SoCs). Network on Chips (<strong>NoC</strong>s) help manage high complexity of<strong>design</strong>ing large chips by decoupling computation from communication. SoCs <strong>and</strong> CMPshave a multiplicity of communicating entities like programmable processing elements,hardware acceleration engines, memory blocks as well as off-chip interfaces. Using an<strong>NoC</strong> enables modular <strong>design</strong> of communicating blocks <strong>and</strong> network interfaces. <strong>NoC</strong>shelp achieve a well structured <strong>design</strong> enabling higher per<strong>for</strong>mance while servicing largerb<strong>and</strong>widths compared to bus based systems[1]. Links in <strong>NoC</strong>s <strong>design</strong>ed with controlledelectrical parameters can use aggressive singling circuits to reduce power <strong>and</strong> delay[9].Network resources are utilized more efficiently in <strong>NoC</strong>s as compared to global wires[10].Communication patterns between communicating entities are application dependent.As a result, <strong>NoC</strong>s are expected to cater to diverse connections varying in <strong>for</strong>ms of connectivity,burstiness, latency <strong>and</strong> b<strong>and</strong>width requirements. <strong>NoC</strong> servicing communication requirementsin CMPs or SoCs are expected to meet Quality of Service (QoS) dem<strong>and</strong>s suchas maximum or average latency, typical or peak b<strong>and</strong>width <strong>and</strong> required throughput ofexecuting applications. Further, with power having become a serious <strong>design</strong> constraint[5],1


CHAPTER 1. INTRODUCTION 2there is a great need <strong>for</strong> <strong>design</strong>ing <strong>NoC</strong> which meets the target communication requirements,while minimizing power using various strategies at architecture, microarchitecture<strong>and</strong> circuit levels of the <strong>design</strong>.1.2 Switching PoliciesSwitching policies configure paths in routers to facilitate data transfer between input <strong>and</strong>output ports. Programming of internal switches in routers to connect input ports to outputports <strong>and</strong> determination of when <strong>and</strong> which data units are transferred is accomplishedusing switching policies. Flow control mechanisms synchronize data transfer betweenrouter <strong>and</strong> traffic sources <strong>and</strong> between two routers. Switching policies <strong>and</strong> flow controlmechanisms influence the <strong>design</strong> of internal switches, routing <strong>and</strong> arbitration units, <strong>and</strong>the amount of buffers in a router. The major types of switching policies are introducedhere.1.2.1 Circuit SwitchingCircuit switching is a reservation based switching policy in which network resources areallocated to a communication path be<strong>for</strong>e data is transferred. At the end of data transfer,reserved resources are de-allocated <strong>and</strong> are available <strong>for</strong> future circuits. As circuits areused on a reservation basis, circuit switching requires a simple router <strong>design</strong> with a fewor no buffers.Circuits are established using path identifying probe packets that reserve resourcesas they propagate towards the destination. The circuit establishment is complete afteran acknowledgment message is received by the source. Data is transferred along the circuitwithout further monitoring or control. After the transfer is complete, the circuit istorn down <strong>and</strong> resources freed using a tail packet. Popular examples of circuit switchednetworks are Autonomous Error-Tolerant Cell[11], Asynchronous SoC[12], Crossroad[13],dTDMA[14], Point to point network on real time systems[15], Programmable <strong>NoC</strong> <strong>for</strong>FPGA-based systems[16], Proto<strong>NoC</strong>[17], Space Division <strong>Multi</strong>plexing based <strong>NoC</strong>[18],


CHAPTER 1. INTRODUCTION 3SoCBuS[19], Reconfigurable Circuit Switched <strong>NoC</strong>[7], etc.1.2.2 Packet SwitchingIn packet switching, the message to be transmitted is partitioned <strong>and</strong> transmitted asfixed-length packets. Routing <strong>and</strong> control is h<strong>and</strong>led on a per packet basis. The packetheader includes routing <strong>and</strong> other control in<strong>for</strong>mation needed <strong>for</strong> the packet to reachthe destination. Packet switching increases network resource utilization as communicationchannels share resources along the path. Buffers <strong>and</strong> arbitration units in routersmanage resource conflicts <strong>and</strong> storage dem<strong>and</strong>s in communication paths. Packet switchingnetworks aid IP block re-use <strong>and</strong> are scalable[20]. Packet-switching is more flexiblethan circuit switching though it requires buffering <strong>and</strong> introduces unpredictable latency(jitter). Popular packet switched networks are Asynchronous <strong>NoC</strong>[21], FAUST[22], Arteris<strong>NoC</strong>[23], Butterfly Fat Tree[24], DyAD[25], Eclipse[26], MANGO[27], Proteo[28],Q<strong>NoC</strong>[29], SPIN[30], etc. Some <strong>NoC</strong> <strong>design</strong>s can adaptively work in circuit or packetswitched modes based on traffic requirements. A few examples are Æthereal[31], HeterogeneousIP Block Interconnection[32], dynamically reconfigurable <strong>NoC</strong>[33], Octagon[34],etc.1.2.3 Label SwitchingLabel switching is used by technologies such as ATM[35][36] <strong>and</strong> <strong>Multi</strong>protocol LabelSwitching (MPLS)[37] as a packet relaying technique. Individual packets carry route in<strong>for</strong>mationin the <strong>for</strong>m of labels. A label denotes a common route that a set of data packetstraverse. There<strong>for</strong>e, a minimalistic label identifies the source hop <strong>and</strong> the destination hopalong with the inter<strong>media</strong>te transit routers. Along with routing in<strong>for</strong>mation, labels canbe used to specify service priorities to packets. This feature of labels enables use of differentiatedservices <strong>for</strong> packets using common labels. Routers along the path use thelabel to identify the next hop, <strong>for</strong>warding in<strong>for</strong>mation, traffic priority, Quality of Serviceguarantees <strong>and</strong> the next label to be assigned. Label switching inherently supports trafficengineering, as labels can be chosen based on desired next hop or required QoS services.


CHAPTER 1. INTRODUCTION 4A few proposals of label switched <strong>NoC</strong>s are MPLS <strong>NoC</strong>[38], Nexus[39] <strong>and</strong> Blackbus[40].1.3 QoS in <strong>NoC</strong>s<strong>NoC</strong>s servicing CMPs <strong>and</strong> SoCs are expected to meet Quality of Service (QoS) dem<strong>and</strong>sof executing applications. Latency sensitive applications dem<strong>and</strong> a guaranteed average<strong>and</strong>maximumlatencyoncommunicationtraffic. Jittersensitiveapplicationsmaytoleratelonger latencies but require fixed delay along communication paths. Further, in betweenclasses of application some have higher priority than others. For example, applicationdata usually has higher priority than acknowledgment packets or control in<strong>for</strong>mation.The two basic approaches in <strong>NoC</strong> <strong>design</strong>s to enable QoS guarantees are: creation ofreserved connections between source <strong>and</strong> destinations via circuit switching or support <strong>for</strong>prioritized routing (in case of packet switched, connectionless paths).Circuit switched <strong>NoC</strong>s guarantee high data transfer rates in an energy efficient mannerby reducing intra-route data storage[41]. Circuit switched <strong>NoC</strong>s provide guaranteed QoS<strong>for</strong> worst case traffic scenarios leading to higher network resource requirements[42]. Theseare well suited <strong>for</strong> streaming traffic generated by <strong>media</strong> <strong>processors</strong> where communicationrequirements are well known a priori. One of the drawbacks here is under utilizationof network resources as resources are reserved <strong>for</strong> peak b<strong>and</strong>width while the averagerequirement might be lesser.Packetswitchednetworksprovideefficientinterconnectutilization<strong>and</strong>highthroughputs[43]while providing fairness amongst best ef<strong>for</strong>t flows. However, network resources in packetswitched networks need to be over-provisioned to support QoS <strong>for</strong> various traffic classes<strong>and</strong> have high buffer requirements in routers. Packet switching networks usually provideQoS by differentiated services to traffic by classifying them into various classes[29]. Prioritizedservices are provided to traffic belonging to each class. Due to the sharing ofnetwork resources, packet switched networks can be configured to provide GuaranteedThroughput (GT) <strong>for</strong> a few classes of traffic <strong>and</strong> Best Ef<strong>for</strong>t (BE) services <strong>for</strong> remainingclasses.With traffic engineering enabled label switching networks, communication loads can


CHAPTER 1. INTRODUCTION 5be distributed over the <strong>NoC</strong> resulting in fair allocation of network resources. Networkresource guarantees, enable paths with less or no jitter while keeping network utilizationfairly high. Further, <strong>design</strong> of routers is simplified compared to conventional wormholerouters[40].1.4 QoS Guaranteed <strong>NoC</strong> DesignMedia <strong>processors</strong> with streaming traffic such as HiperLAN/2 Baseb<strong>and</strong> Processors[7],Real-time Object Recognition Processors [8] <strong>and</strong> H.264 encoders[44][45] dem<strong>and</strong> adequateb<strong>and</strong>width <strong>and</strong> bounded latencies between communicating entities. They also havewell known communication patterns <strong>and</strong> b<strong>and</strong>width requirements. Adequate throughput,latency <strong>and</strong> b<strong>and</strong>width guarantees between process blocks have to be provided <strong>for</strong> suchapplications. Nature of streaming applications in <strong>media</strong> <strong>processors</strong> <strong>and</strong> characteristics ofstreaming traffic are illustrated in Section 5.1 of Chapter 5.Guaranteeing QoS by <strong>NoC</strong>s involves guaranteeing b<strong>and</strong>width <strong>and</strong> throughput <strong>for</strong> connections<strong>and</strong> deterministic latencies in communication paths. This thesis proposes a QoSguaranteeing <strong>NoC</strong> using label switching where b<strong>and</strong>width can be reserved while links areshared. The traffic is engineered during route setup <strong>and</strong> it leverages advantages of bothpacket <strong>and</strong> circuit switching techniques. We propose a QoS based Label Switched <strong>NoC</strong>(LS-<strong>NoC</strong>) router <strong>design</strong>. We present a latency, power <strong>and</strong> per<strong>for</strong>mance optimal interconnect<strong>design</strong> methodology considering low level circuit <strong>and</strong> system parameters. Further,optimal tile configurations are identified using effects of application communication trafficon per<strong>for</strong>mance <strong>and</strong> energy in chip multi<strong>processors</strong> (Figure 4.2).A label switched, QoS guaranteeing <strong>NoC</strong>, that retains advantages of both packetswitched <strong>and</strong> circuit switched networks is the main focus of this thesis. Congestion freecommunication pipes are identified by a centralized Manager with complete network visibility.Label Switched <strong>NoC</strong> (LS-<strong>NoC</strong>) sets up communication channels (pipes) betweencommunicating nodes that are independent of existing pipes <strong>and</strong> are contention free at therouters. Deterministic delays <strong>and</strong> b<strong>and</strong>width are guaranteed in newly established pipes,taking into account established flows. Residual b<strong>and</strong>width in links reserved by a pipe can


CHAPTER 1. INTRODUCTION 6Figure 1.1: Design space exploration of <strong>NoC</strong>s in CMPs are closely related to link microarchitecture,router <strong>design</strong> <strong>and</strong> tile configurations.be utilized by other pipes, thus enabling sharing of physical links between pipes withoutcompromising QoS guarantees. LS-<strong>NoC</strong> provides throughput guarantees irrespective ofspatial separation of the communicating entities.Interconnect delay <strong>and</strong> power contribute significantly towards the final per<strong>for</strong>mance<strong>and</strong> power numbers of a CMP[46]. Design variables <strong>for</strong> interconnect exploration includewire width, wire spacing, repeater size <strong>and</strong> spacing, degree of pipelining, supply (V dd ),threshold voltage (V th ), activity <strong>and</strong> coupling factors. A power <strong>and</strong> per<strong>for</strong>mance optimallink microarchitecture can be arrived at by optimizing these low level link parameters.A methodology to arrive at the optimal link configuration in terms of numberof pipeline stages (cycle latency) <strong>for</strong> a given length of link <strong>and</strong> desired operating frequencyis presented. Optimal configurations of all links in the <strong>NoC</strong> are identified <strong>and</strong> apower-per<strong>for</strong>mance optimal <strong>NoC</strong> thus achieved.Primary <strong>and</strong> secondary cache sizes have a major bearing on the amount of on-chip<strong>and</strong> off-chip communication in a Chip <strong>Multi</strong>processor (CMP). On-chip <strong>and</strong> off-chip communicationtimes have significant impact on execution time <strong>and</strong> the energy efficiency ofCMPs. From a per<strong>for</strong>mance point of view, cache accesses should suffer minimum delay<strong>and</strong> off-tile communication due to cache misses should be negligible. Large caches dissipatemore leakage energy <strong>and</strong> may exceed area budgets though they reduce cache misses<strong>and</strong> decrease off-tile communication. Larger caches result in longer inter-tile communicationlink lengths <strong>and</strong> latencies, thus adversely impacting communication time. Small


CHAPTER 1. INTRODUCTION 7caches reduce occupied tile area, have higher activity <strong>and</strong> hence dissipate lesser leakageenergy. Drawback of smaller caches is potentially higher number of misses <strong>and</strong> frequentof off-tile communication. This illustrates the trade off between cache size, miss rate,<strong>NoC</strong> communication latency <strong>and</strong> power. Energy efficient tile <strong>design</strong> is a configurationexploration <strong>and</strong> trade-off study using different cache sizes <strong>and</strong> tile areas to identify apower-per<strong>for</strong>mance optimal cache size <strong>and</strong> <strong>NoC</strong> configuration <strong>for</strong> the CMP.1.5 Contributions of the ThesisWork in this thesis presents methodologies <strong>for</strong> label switched QoS guaranteed <strong>NoC</strong> <strong>design</strong>,link microarchitecture exploration <strong>and</strong> optimal Chip <strong>Multi</strong>processor (CMP) tile configurations.Contributions from this thesis are listed here:1.5.1 Link Microarchitecture Exploration• Optimal Link Design <strong>and</strong> Exploration Framework: Wepresentsimulationframeworkdeveloped in System-C which allows the <strong>design</strong>er to explore <strong>NoC</strong> <strong>design</strong> across lowlevel link parameters such as pipelining, link width, wire pitch, supply voltage, operatingfrequency <strong>and</strong> <strong>NoC</strong> architectural parameters such as router type <strong>and</strong> topologyof the interconnection network. We use the simulation framework to identifypower-per<strong>for</strong>mance (Energy-Delay) optimal link configuration in a given <strong>NoC</strong> overparticular traffic patterns. Such an optimum exists because increasing pipeliningallows <strong>for</strong> shorter length wire segments which can be operated either faster or withlower power at the same speed.• Optimum Pipe Depth: Contrary to intuition, we find that increasing pipeline depthcan actually help reduce latency in absolute time units, by allowing shorter links& hence higher frequency of operation. In some cases, we find that switching toa higher pipelining configuration can actually help reduce power as the links canbe <strong>design</strong>ed with smaller repeaters. Larger <strong>NoC</strong> power savings can be achieved byvoltage scaling along with frequency scaling. Hence it is important to include the


CHAPTER 1. INTRODUCTION 8link microarchitecture parameters as well as circuit parameters like supply voltageduring architecture <strong>design</strong> exploration of <strong>NoC</strong>s.1.5.2 Optimal CMP Tile Configuration• Optimal Cache Size: The per<strong>for</strong>mance-power optimal L1/L2 configuration of a tileis close to the configuration that spends least amount of time in on-chip <strong>and</strong> off-chipcommunication.• Effect of Floorplanning <strong>and</strong> Process Mapping: Communication aware floorplanningcan reduce up to 2.6% of the energy spent in execution of an instruction <strong>and</strong> up to11%savingsincommunicationpowerduringtheexecutionoftheprogram. MappingL2banksinthesame<strong>core</strong>astheprocessesaccessingitreducestimespentincommunication<strong>and</strong> hence the overall program completion time <strong>and</strong> also has a bearing onthe Total Energy spent in the execution of the program. Experiments have revealedthat as much as 2% of energy per instruction can be saved by communication-awareprocess scheduling compared to a conventional thread mapping policies in a 2DMesh architecture.1.5.3 QoS in <strong>NoC</strong>s• A Label Switching <strong>NoC</strong> providing QoS guarantees: We present a LS-<strong>NoC</strong> to serviceQoS dem<strong>and</strong>s of streaming traffic in <strong>media</strong> <strong>processors</strong>. A centralized <strong>NoC</strong> Managercapable of traffic engineering establishes b<strong>and</strong>width guaranteed communicationchannels between nodes. LS-<strong>NoC</strong> guarantees deterministic path latencies, satisfiesb<strong>and</strong>width requirements <strong>and</strong> delivers constant throughput. Delay <strong>and</strong> throughputguaranteed paths (pipes) are established between source <strong>and</strong> destinations along contentionfree, b<strong>and</strong>width provisioned routes. Pipes are identified by labels unique toeach source node. Labels need fewer bits compared to node identification numbers- potentially decreasing memory usage in routing tables.• <strong>NoC</strong> Manager with traffic engineering capabilities: The <strong>NoC</strong> Manager utilizes flow


CHAPTER 1. INTRODUCTION 9identification algorithms to identify contention free, b<strong>and</strong>width provisioned pathsin LS-<strong>NoC</strong> called pipes. The LS-<strong>NoC</strong> Manager has complete visibility of the stateof LS-<strong>NoC</strong>. B<strong>and</strong>width requirements of the application are taken into account toprovision routes between communicating nodes by the flow identification algorithm.Flow based pipe establishment algorithm is topology independent <strong>and</strong> hence the<strong>NoC</strong> Manager supports applications mapped to both regular chip multi<strong>processors</strong>(CMPs) <strong>and</strong> customized SoCs with non-conventional <strong>NoC</strong> topologies. Additionally,fault tolerance is achieved by the <strong>NoC</strong> Manager by considering link status duringpipe establishment.• Design of a Label Switched Router: The Label Switched (LS) Router used in LS-<strong>NoC</strong> achieves single cycle traversal delay during no contention <strong>and</strong> is multicast <strong>and</strong>broadcast capable. Source nodes in the LS-<strong>NoC</strong> can work asynchronously as cyclelevel scheduling is not required in the LS Router. LS router supports multiple clockdomain operation. Dual clock buffers can be used at output ports in the LS-<strong>NoC</strong>router. This eases clock domain crossovers <strong>and</strong> reduces the need <strong>for</strong> a single globallysynchronous clock. As a result, clock tree <strong>design</strong> is less complex <strong>and</strong> clock power ispotentially saved.1.6 Organization of the ThesisChapter 2 highlights several works from current literature related to the broad areasof QoS guaranteed <strong>NoC</strong>s, link microarchitecture, <strong>design</strong> space exploration of <strong>NoC</strong>s <strong>and</strong>effects of communication on energy <strong>and</strong> per<strong>for</strong>mance trade-offs in CMPs.Chapter 3 presents a latency, power <strong>and</strong> per<strong>for</strong>mance trade-off study of <strong>NoC</strong>s throughlink microarchitecture exploration using microarchitectural <strong>and</strong> circuit level parameters.<strong>NoC</strong> exploration framework used in the trade-off studies is described. The interface tothe SystemC framework <strong>and</strong> sample output logs generater are presented in Appendix A.Effects of on-chip <strong>and</strong> off-chip communication due to various CMP tile configurationsis explored in Chapter 4. The need to use detailed interconnection network models to


CHAPTER 1. INTRODUCTION 10identify optimal energy <strong>and</strong> per<strong>for</strong>mance configurations is also highlighted. On-chip <strong>and</strong>off-chipcommunicationeffectsonpower<strong>and</strong>per<strong>for</strong>manceofaCMPsisexplored. Effectsofcommunication on program execution times <strong>and</strong> program execution energy are presented.Further, Energy-per<strong>for</strong>mance results <strong>for</strong> tile configurations <strong>and</strong> effects of custom L2 bankmapping <strong>and</strong> thread mapping on power <strong>and</strong> per<strong>for</strong>mance of CMPs is presented.Design <strong>and</strong> implementation of a label switching, traffic engineering capable <strong>NoC</strong> deliveringguaranteed QoS <strong>for</strong> streaming traffic in <strong>media</strong> <strong>processors</strong> has been presented inChapter 5. Traffic characteristics of streaming applications are also presented in the chapter.Functional verification of the LS-<strong>NoC</strong> router using various test cases is presented inAppendix B. Chapter 6 illustrates the LS-<strong>NoC</strong> management framework <strong>and</strong> the flowidentification algorithm used to establish pipes. An example of use of flow algorithm hasbeen presented in Appendix C. Streaming application test cases <strong>and</strong> various types ofvideo traffic are used to establish LS-<strong>NoC</strong> as a QoS guaranteeing framework in Chapter7. The thesis concludes in Chapter 8 after enlisting some future advancements possibleon the proposed work.


Chapter 2Related WorkSeveralpublicationshavehighlightedtheneed<strong>for</strong>solutionstopressingproblemsinvariousdomains in the broad area of Network-on-Chips[47][48][49][50]. This chapter introducesrelevant works in the broad areas of QoS guaranteed Network-on-Chips, <strong>design</strong> spaceexploration of <strong>NoC</strong>s <strong>and</strong> effects of communication on energy <strong>and</strong> per<strong>for</strong>mance trade-offsin CMPs.2.1 Traffic Engineered <strong>NoC</strong> <strong>for</strong> Streaming ApplicationsProviding QoS guarantees in on-chip communication networks has been identified as oneof major research problems in <strong>NoC</strong>s[48]. QoS solutions in packet switched networks usepriority based services while circuit switched <strong>NoC</strong>s use some <strong>for</strong>m of resource reservation.We introduce a few well known QoS solutions from literature <strong>and</strong> compare our work withthe state of the art. Packet switched <strong>NoC</strong>s use differentiated services <strong>for</strong> traffic classes[29][22][21][8] to provide latency <strong>and</strong> b<strong>and</strong>width guarantees. Circuit switched <strong>NoC</strong>s useresource reservation mechanisms to guarantee QoS[34][51][41][19]. Resource reservationmechanisms involve identifying a sufficiently resource rich path, reserving resources alongthe path, configuration, actual communication <strong>and</strong> path tear down. A fairly extensivesurvey of <strong>NoC</strong> proposals has been presented in [50]. Relevant QoS <strong>NoC</strong>s are discussed in11


CHAPTER 2. RELATED WORK 12this section.2.1.1 QoS in Packet Switched NetworksQoS <strong>NoC</strong> (Q<strong>NoC</strong>) presented by Bolotin et. al. [29] is a customized QoS <strong>NoC</strong> architecturebasedona2DMeshtosatisfyQoSbyallocatingfrequentlycommunicatingnodesclose-by,doing away with unnecessary links, tailoring link width to meet b<strong>and</strong>width requirements<strong>and</strong> balancing link utilization. Inter-module communication traffic is classified into fourclasses of service: signaling, real-time, RD/WR <strong>and</strong> block-transfer. FAUST[22] is a reconfigurablebaseb<strong>and</strong> plat<strong>for</strong>m based on an asynchronous <strong>NoC</strong> providing a programmablecommunication framework linking heterogeneous resources. FAUST uses 2 level prioritybasedvirtualcircuit<strong>design</strong>initsNetworkInterface(NI)toprovideQoSguarantees. Asynchronous<strong>NoC</strong>s[21] use clock-free interconnect to improve reliability <strong>and</strong> delay-insensitivearbiters to solve routing conflicts. A QoS Router with both soft (Soft GT) <strong>and</strong> hard (HardGT) guarantees <strong>for</strong> globally asynchronous, locally synchronous (GALS) <strong>NoC</strong>s is presentedin [52]. Leftover b<strong>and</strong>width in routers servicing Hard GT is utilized by Soft GT connections<strong>and</strong> best ef<strong>for</strong>t traffic. <strong>NoC</strong>s presented in [21], [52] <strong>and</strong> [53] employ multiple prioritylevels to provide differentiated services <strong>and</strong> guarantee QoS. The MANGO [27][54] <strong>NoC</strong>provides hard GT by prioritizing each GT connection <strong>and</strong> adopts Asynchronous LatencyGuarantee (ALG) scheduling to prevent starvation of packets with lower priority.One of the major drawbacks of priority based QoS schemes is that increase in trafficin one priority class effects the delay on traffic belonging to other classes. A prioritynetwork will lose the differentiated services advantage if all traffic belong to the samepriority level. Further, deadlock-free routing algorithms using virtual circuits with apriorityapproachmayleadtodegradationin<strong>NoC</strong>throughput. Incaseswhereconnectionscannot be overlapped with each other (eg. MANGO <strong>NoC</strong>), increased number of hard GTconnections will lead to increased cost in network resources.Another class of packet switched <strong>NoC</strong>s using priority based QoS solutions are applicationspecific SoCs. A tree based hierarchical packet-switched <strong>NoC</strong> <strong>for</strong> a real-time objectrecognition processor is implemented in [8]. The tree topology <strong>NoC</strong> with three crossbar


CHAPTER 2. RELATED WORK 13switches interconnects 12 IPs supports both bursty (<strong>for</strong> image traffic) <strong>and</strong> non-bursty (<strong>for</strong>control <strong>and</strong> synchronization signals) traffic. Network resources in this <strong>NoC</strong> are tailoredto meet throughput <strong>and</strong> b<strong>and</strong>width dem<strong>and</strong>s of the application <strong>and</strong> hence the <strong>design</strong> isnot a generic solution <strong>for</strong> servicing QoS in an CMP environment.2.1.2 QoS in Circuit Switched NetworksResource reservation between communicating nodes involves identification of path usingpoint-to-point links or a path probing service network or an intelligent, traffic awaredistributed or centralized manager. Hu et. al.[15] introduce point-to-point (P2P) communicationsynthesistomeettimingdem<strong>and</strong>sbetweencommunicatingnodesusingbuswidthsynthesis. Circuit switched bus based QoS solutions such as Crossroad[13], dTDMA[14]<strong>and</strong> Heterogeneous IP Block Interconnection (HIBI)[32] rely on communication localizationto satisfy timing dem<strong>and</strong>s. NEXUS[39] is a resource reservation based QoS <strong>NoC</strong><strong>for</strong> globally asynchronous, locally synchronous (GALS) architectures. NEXUS uses anasynchronous crossbar to connect synchronous modules through asynchronous channels<strong>and</strong> clock-domain converters.P2P networks do not share communication links between multiple nodes leading toinefficient utilization of network resources. This increases wiring resources inside thechip <strong>and</strong> results in poor scalability. Crossbar based solutions using protocol h<strong>and</strong>shakes(<strong>for</strong> example, 4-way h<strong>and</strong>shakes in NEXUS[39] <strong>and</strong> Proto<strong>NoC</strong>[17]) <strong>for</strong>ce communicatingnodes to wait till h<strong>and</strong>shake is complete <strong>and</strong> path is established. Non-interference ofcommunication channels is achieved by over-provisioning resources in the crossbar. Thisleads to complex <strong>and</strong> poorly scalable networks. Connecting frequently communicatingnodes on a single bus will increase dem<strong>and</strong> on the bus <strong>and</strong> lead to larger waiting times atthe nodes. Static routing along shortest paths does not guarantee latency bound routesdue to arbitration delays in the network.Amongst the <strong>NoC</strong>s that use a probe based circuit establishment solutions are Intel’s8×8 circuit switched <strong>NoC</strong>[41], SoCBUS[19][55] <strong>and</strong> distributed programming model in


CHAPTER 2. RELATED WORK 14Æthereal[51]. In these <strong>NoC</strong>s, probe packets are used to reconnoiter shortest communicationpaths <strong>and</strong> configure routing tables if path (circuit) is available. Routers are lockeddown <strong>and</strong> no other circuits can use the port during the lifetime of an established circuit.If the shortest X-Y path is not available, the probe packets initiate route discovery mechanismsin other paths. The method involves some dynamic behaviour as the probe mightrepeat route discovery steps or try after a r<strong>and</strong>om period of time if circuit set up doesnot succeed. This leads to indeterministic <strong>and</strong> sometimes large route setup times whichmay be unacceptable <strong>for</strong> real time application per<strong>for</strong>mance.Centralized Circuit ManagementReserved communication channels can be identified <strong>and</strong> configured using an applicationawarehardwareorsoftwareentity[51][34]. Suchatrafficmanagercanprovideprogrammabilityof routes.The Æthereal <strong>NoC</strong> [51] aims at providing hard guaranteed QoS using Time Division<strong>Multi</strong>plexing(TDM) to avoid contention in a synchronous network. The centralizedprogramming model in Æthereal <strong>NoC</strong>[51] uses a root process to identify free slots <strong>and</strong>configure network interfaces. Time slot tables are used in routers to reserve output portsper input port in a particular time slot. To avoid collisions <strong>and</strong> the loss of data, consecutivetime slots are then reserved in routers along the circuit path. The number ofpaths established in the <strong>NoC</strong> is restricted by the scheduling constraints during time slotsreservation. Increasing the number of time slots in TDM based <strong>NoC</strong>s increases router size.In cases where a communication channel cannot be found due to slots exhaustion, thetraffic division over multiple physical paths may be required[56]. Traffic division involvesreordering packets at the target node leading to increased memory <strong>and</strong> computationalcosts.TDM techniques using slot tables in Æthereal[51] <strong>and</strong> sequencers in Adaptive Systemon-Chip[12]require a single synchronous clock distributed over the chip. Accurate globalsynchronousclockdistributionisexpensiveintermsofpower. Globalsynchronicitycanbeachieved in a distributed manner using tokens such that every router synchronizes every


CHAPTER 2. RELATED WORK 15slot with all of its neighbors [57]. This method will bring down the operating speed of the<strong>NoC</strong> as the slowest router will dictate the speed of the <strong>NoC</strong>. Further, power managementtechniques such as multiple clock domains is not feasible with this approach. AElite[58]<strong>and</strong> dAElite[59] have been proposed as improved next generation Æthereal <strong>NoC</strong>s. AEliteinherits the guaranteed services model from Æthereal. To overcome the global synchronicityproblem, AElite proposes use of asynchronous <strong>and</strong> mesochronous links as a possibility.As noted in the paper[58], using mesochronous links alone may not be sufficient if routers<strong>and</strong> NIs are plesiochronous[60]. One of the drawbacks of AElite was number of slotsoccupied by the header flits. A header flit in AElite occupied one in three slots <strong>and</strong> theoverhead rises to up to 33%. dAElite circumvents the header flit overhead by routingbased on the time of packet injection <strong>and</strong> packet receiving. One of the disadvantages ofdAElite is an increase in the number of link wires, due to the configuration network <strong>and</strong>also because of separate wires <strong>for</strong> end-to-end credit communication.The Octagon <strong>NoC</strong>[34] implements a centralized best fit scheduler to configure <strong>and</strong>manage non-overlapping connections. The scheduler cannot establish a new connectionthrough a port if it is blocked by another connection. This results in increased connectionestablishment time at the routers <strong>and</strong> also packet losses.2.1.3 QoS by Space Division <strong>Multi</strong>plexingAs an alternative to TDM techniques, Spatial Division <strong>Multi</strong>plexing (SDM) techniques<strong>for</strong> QoS have been proposed in [23],[61] <strong>and</strong> [62]. SDM techniques involve sharing fractionof links between connections simultaneously based on b<strong>and</strong>width requirements of thecorresponding connections. An approach comparable to a static version of SDM calledLane-Division-<strong>Multi</strong>plexing has been proposed in [7]. Lane-Division-<strong>Multi</strong>plexing is basedon a reconfigurable circuit switched router composed of a crossbar <strong>and</strong> data converters.Disadvantage of the solution in [7] is that it does not support channel sharing <strong>and</strong> BEtraffic. An additional network is required <strong>for</strong> configuring the switches <strong>and</strong> <strong>for</strong> carryingthe BE traffic. Sharing a subset of wires between connections as in [63] leads to a morecomplex switch <strong>design</strong> with huge delay. SDM <strong>and</strong> TDM techniques have been combined


CHAPTER 2. RELATED WORK 16in [64] allowing <strong>for</strong> increase in number of connections supported by increasing the numberof sub-channels in the link or by increasing the number of time slots. This increases pathestablishment probability in the <strong>NoC</strong>.InSDMbasedtechniques, senderserializesdataonthewiresallocated <strong>and</strong>thereceiverdeserializes the data be<strong>for</strong>e <strong>for</strong>warding to the IP block. One of the issues in SDM basedcircuits is complexity of implementation of serializers <strong>and</strong> deserializers.2.1.4 Static routing in <strong>NoC</strong>sMost <strong>NoC</strong>s use traffic oblivious static routing[51] to establish communication channelsbetween nodes. Dimension ordered routing[41][53][17][53][51][34] or routes decided at <strong>design</strong>time[65] are not flexible <strong>and</strong> cannot circumvent congested paths. Routing in FPGAsalso present a similar scenario where routes between communicating nodes are b<strong>and</strong>width<strong>and</strong> latency guaranteed, but are static. These routes occupy network resources along thepath <strong>for</strong> the entire lifetime of the application. QoS is guaranteed in this case by overprovisioning resources along the route.2.1.5 MPLS <strong>and</strong> Label Switching in <strong>NoC</strong>sUse of <strong>Multi</strong>-Protocol Label Switching <strong>for</strong> QoS[38] in <strong>NoC</strong>s <strong>and</strong> advantages of identifyingcommunication channels using labels have been investigated in [39],[40]. A conventional<strong>NoC</strong> is connected to an MPLS backbone using Label Edge Routers (LERs)[38]. TheMPLSbackboneusestrafficengineering<strong>and</strong>prioritybasedQoSservicestocommunicationchannels identified by labels. The work is a direct mapping of the MPLS implementationin the Internet to <strong>NoC</strong>s. The router <strong>and</strong> <strong>NoC</strong> <strong>design</strong> approach is not optimized <strong>for</strong> ahardware implementation. Results from Network Simulator-2 (NS-2) are at a functionallevel <strong>and</strong> may not reflect the exact per<strong>for</strong>mance achievable inside a chip.Use of labels to identify communication channels instead of source <strong>and</strong> destinationidentification numbers reduces the amount of metadata transmitted in the <strong>NoC</strong>. Uniqueaddressing at source allows label reuse <strong>and</strong> enables efficient use of label space. Implementationof label based addressing in streaming applications have resulted in significant


CHAPTER 2. RELATED WORK 17reduction in router area[40]. The work employs a method similar to label switching toachieve non-global label addressing hence reducing label bit width. A C×N ↦→ C routingstrategy is described in conjunction with the label addressing scheme. Work presented in[40] presents a simple data transfer scheme <strong>and</strong> does not concentrate on rendering QoSbetween communicating nodes. The route establishment process has not been explicitlymentioned <strong>and</strong> one can assume that st<strong>and</strong>ard routing algorithms will be used.2.1.6 Label Switched <strong>NoC</strong>In the proposed work, we describe a Label Switched QoS guaranteeing <strong>NoC</strong> that retainsadvantages of both packet switched <strong>and</strong> circuit switched networks. Contention at outputports in is tackled using communication pipes. Pipes are communication routes establishedalong a b<strong>and</strong>width rich, contention free router path. Pipes are identified by acentralized Manager with complete network visibility.<strong>NoC</strong> Manager utilizes Flow identification algorithms[66][67] (Algorithm 1) to establishpipes. Flow identification algorithm guarantees a deterministic delay in identifying <strong>and</strong>configuring pipes. Flow identification algorithm takes into account b<strong>and</strong>width availablein individual links to establish QoS guaranteed pipes. This guarantees QoS servicedcommunication paths between communicating nodes. <strong>Multi</strong>ple pipes can be set up ina single link if QoS requirements of all the pipes are satisfied. This enables sharing ofphysical links between pipes without compromising QoS guarantees. LS-<strong>NoC</strong> providesthroughput guarantees irrespective of spatial separation of communicating entities.2.2 Link Microarchitecture<strong>and</strong> TileArea Exploration2.2.1 <strong>NoC</strong> Design Space ExplorationCurrent research in architectural level exploration of <strong>NoC</strong> in SoCs concentrates on underst<strong>and</strong>ingthe impacts of varying topologies, link <strong>and</strong> router parameters on the overallthroughput, area <strong>and</strong> power consumption of the system (SoCs <strong>and</strong> <strong>Multi</strong><strong>core</strong> chips) usingsuitable traffic models[68]. Impacts of varying topologies, link <strong>and</strong> router parameters on


CHAPTER 2. RELATED WORK 18the overall throughput, area <strong>and</strong> power consumption of the system (SoCs <strong>and</strong> <strong>Multi</strong><strong>core</strong>chips) using relevant traffic models is discussed in [68]. The paper illustrates a consistentcomparison <strong>and</strong> evaluation methodology based on a set of quantifiable critical parameters<strong>for</strong> <strong>NoC</strong>s. The work suggests that evaluation of <strong>NoC</strong>s must consider applications intoaccount. The usual most critical evaluation parameters are not exhaustive <strong>and</strong> differentapplications may require additional parameters such as testability, dependability, <strong>and</strong>reliability.Work in [69] emphasizes need <strong>for</strong> co-<strong>design</strong> of interconnects, processing elements <strong>and</strong>memory blocks to underst<strong>and</strong> the effects on overall system characteristics. Results fromthis work show that the architecture of the interconnect interacts with the <strong>design</strong> <strong>and</strong>architecture of the <strong>core</strong>s <strong>and</strong> caches closely. The work studies the area-b<strong>and</strong>widthper<strong>for</strong>mancetrade-off on on-chip interconnects. The increase in area dem<strong>and</strong>s of sharedcaches in CMPs is also documented. Not using detailed interconnect models during CMP<strong>design</strong> leads to non-optimal larger shared L2 caches inside the chip.2.3 Simulation ToolsSimulation tools have been developed to aid <strong>design</strong>ers in interconnection network (ICN)space exploration[70][71]. Kogel et. al.[70] present a modular exploration framework tocapture per<strong>for</strong>mance of point-to-point, shared bus <strong>and</strong> crossbar topologies.2.3.1 Link Exploration ToolsLink exploration tool works make a case <strong>for</strong> microarchitectural wire management in future<strong>processors</strong> where communication is a prominent contributor <strong>for</strong> power <strong>and</strong> per<strong>for</strong>mance.Separate wire exploration tools such as those presented in [71], [72], [73], [74] <strong>and</strong> [75]give an estimate of delay of the wire in terms of latency <strong>for</strong> a particular wire length <strong>and</strong>operating frequency.Orion [71] is a power-per<strong>for</strong>mance interconnection network simulator that is capable ofproviding power <strong>and</strong> per<strong>for</strong>mance statistics. Orion model estimates power consumed by


CHAPTER 2. RELATED WORK 19router elements (crossbars, FIFOs <strong>and</strong> arbiters) by calculating switching capacitances ofindividual circuit elements. Orion contains a library of architectural level parameterizedpower models.The more recent Orion 2.0 presented in [76] is an enhanced <strong>NoC</strong> power <strong>and</strong> areasimulator offering improved accuracy compared to the original Orion framework. Some oftheadditionsintoOrion2.0includeflip-flop<strong>and</strong>clockdynamic<strong>and</strong>leakagepowermodels,link power models, leveraging models developed in [74]. Virtual Channel (VC) allocatormicroarchitectureusesaVCallocationmodel, basedonthemicroarchitecture<strong>and</strong>pipelineproposed in [77]. Application-specific technology-level fine tuning of parameters usingdifferent V th <strong>and</strong> transistor widths are used to increase accuracy of power estimation.Work in [72] explores use of heterogeneous interconnects optimized <strong>for</strong> delay, b<strong>and</strong>widthorpowerbyvarying<strong>design</strong>parameterssuchasabuffersizes,wirewidth<strong>and</strong>numberof repeaters on the interconnects. The work presented in the paper uses Energy-Delay 2as the <strong>optimization</strong> parameter. An evaluation of different configurations of heterogeneousinterconnects is made. The evaluation shows that an optimal configuration (<strong>for</strong> delay,b<strong>and</strong>width, power or power <strong>and</strong> b<strong>and</strong>width) of wires can reduce the total processor ED 2value by up to 11% compared to a <strong>NoC</strong> with homogeneous interconnect in a typicalprocessor.Courtay et. al[73] have developed a high-level delay <strong>and</strong> power estimation tool <strong>for</strong>link exploration that offers similar statistics as Intacte does. The tool allows changingarchitectural level parameters such as different signal coding techniques to analyze theeffects on wire delay/power.Work in [74] proposes delay <strong>and</strong> power models <strong>for</strong> buffered interconnects. The modelscan be constructed from sources such as Liberty[78], LEF/ITF[79], ITRS[80], <strong>and</strong>PTM[81]. The buffered delay models take into account effects of input <strong>and</strong> output slewsof circuit elements in calculating intrinsic delays. The power models include leakage <strong>and</strong>dynamic power dissipation of gates. The area models include technology dependent coefficientsthat can be estimated by linear regression techniques per technology node toestimate repeater areas.


CHAPTER 2. RELATED WORK 20Intacte[82] is used <strong>for</strong> interconnect delay <strong>and</strong> power estimates. Design variables <strong>for</strong>Intacte’sinterconnect<strong>optimization</strong>arewirewidth,wirespacing,repeatersize<strong>and</strong>spacing,degreeofpipelining, supply(V dd )<strong>and</strong>thresholdvoltage(V th ). Intactecanbeusedtoarriveat power optimal number of repeaters, sizes <strong>and</strong> spacing <strong>for</strong> a given wire length to achievea desired frequency. Intacte outputs total power dissipated including short circuit <strong>and</strong>leakage power values.A high level power estimation tool accounting <strong>for</strong> interconnect effects is presented in[83]. The work presents an interconnect length estimation model based on Rent’s rule[84]<strong>and</strong> a high level area (gate count) prediction method. Different place <strong>and</strong> route engines<strong>and</strong> cell libraries can be used with this proposed model after some minor adaptations.2.3.2 Router Power <strong>and</strong> Architecture Exploration ToolsMostrouterexplorationtoolsmodelICNelementsatahigherlevelabstractionofswitches,links <strong>and</strong> buffers <strong>and</strong> help in power/per<strong>for</strong>mance trade-off studies[85][86]. These are usedto research the <strong>design</strong> of router architectures[87] <strong>and</strong> ICN topologies[34] with varyingarea/per<strong>for</strong>mance trade-offs <strong>for</strong> general purpose SoCs or to cater to specific applications.A high level power estimation methodology <strong>for</strong> <strong>NoC</strong> routers based on number oftraversing flits as the unit of abstraction has been proposed in [85]. The macro model oftheframeworkincursaminorabsolutecycleerrorcomparedtogatelevelanalysis. Providinga fast <strong>and</strong> cycle accurate power profile at an early stage of router <strong>design</strong> enables power<strong>optimization</strong>s such as power-aware compilers, <strong>core</strong> mapping, <strong>and</strong> scheduling techniques<strong>for</strong> CMPs to be incorporated into the final <strong>design</strong>. The power macro model uses statein<strong>for</strong>mation of FSM in a router assigned to reserve channels during packet <strong>for</strong>warding<strong>for</strong> wormhole flow control. This enhances the accuracy of the power macro model. Thepower macro model based on regression analysis can be migrated to different technologylibraries.An architectural-level power model <strong>for</strong> interconnection network routers has been presentedin [88]. The work specifically considers the Alpha 21364 <strong>and</strong> Infinib<strong>and</strong> routers


CHAPTER 2. RELATED WORK 21<strong>for</strong> modelling case studies. Memory arrays, crossbars <strong>and</strong> arbiters <strong>for</strong>m the basic buildingblocks of all router models using this framework. Each of these building blocks havebeen modelled in detail to estimate switching capacitance. Switching activity is estimatedbased on traffic models assuming certain arrival rates at the input ports. The power numbers<strong>for</strong> both Alpha 21364 <strong>and</strong> Infinib<strong>and</strong> routers have been found to be matching thevendors’ estimates within a minor error margin.The high level power model presented in [86] to estimate power consumption in semiglobal<strong>and</strong> global interconnects considers switching power, power due to vias <strong>and</strong> repeaters.The high level model estimates switching power within an error of 6% with aspeedup of three-to-four orders of magnitude. Error in via power is under 3%. A segmentlength distribution model has been presented <strong>for</strong> cases where Rents rule is insufficient.The segment length distribution model has been validated by analyzing netlists of a setof complex <strong>design</strong>s.A wormhole router implementing a minimal adaptive routing algorithm with nearoptimal per<strong>for</strong>mance <strong>and</strong> feasible <strong>design</strong> complexity is proposed in [87]. The work alsoestimates the optimal size of FIFO in an adaptive router with fixed priority scheme. Theoptimal size of the FIFO is derived to be equal to the length of the packet in flits in thiswork.2.3.3 Complete <strong>NoC</strong> ExplorationSeveral frameworks have been proposed <strong>for</strong> complete <strong>NoC</strong> exploration[89][90][91]. Theseframeworks can be used as tools to derive a first cut analysis of effect of certain <strong>NoC</strong> configurationsat an early <strong>design</strong> phase. Such frameworks are the first steps <strong>for</strong> roadmappingfuture of on-chip networks.A technology aware <strong>NoC</strong> topology exploration tool has been presented in [89]. The<strong>NoC</strong> exploration is optimized <strong>for</strong> energy consumption of the entire SoC. The work characterizes2D Meshes <strong>and</strong> Torii along with higher dimensions, multiple hierarchies <strong>and</strong>express channels, <strong>for</strong> energy spent in the network. The work presents analytical modelsbased on <strong>NoC</strong> parameters such as average hop count <strong>and</strong> average flit traversal energy to


CHAPTER 2. RELATED WORK 22predict the most energy-efficient topology <strong>for</strong> future technologies.A holistic approach to <strong>design</strong>ing energy-efficient cluster interconnects has been proposedin [90]. The work uses a cycle-accurate simulator with <strong>design</strong>s of an InfiniB<strong>and</strong>Architecture (IBA) compliant interconnect fabric. The system is modelled to compriseof switches, network interface cards <strong>and</strong> links. The study reveals that the links <strong>and</strong>switch buffers consume the major portion of the SoC power. The work proposes dynamicvoltage scaling <strong>and</strong> dynamic link shutdown as viable methods to save power during SoCoperation. A system-level roadmapping toolchain <strong>for</strong> interconnection networks has beenpresented in [91]. The framework is titled Polaris <strong>and</strong> iterates through available <strong>NoC</strong> <strong>design</strong>sto identify a power optimal one based on network traffic, architectures <strong>and</strong> processcharacteristics.Several complete <strong>NoC</strong> simulators have been developed <strong>and</strong> are in use by the <strong>NoC</strong>research community[92][93][94]. The Network-on-Chip Simulator, Noxim[92], was developedat the University of Catania, Italy. Several <strong>NoC</strong> parameters such as network size,buffer size, packet size distribution, routing algorithm, selection strategy, packet injectionrate, traffic time distribution, traffic pattern, hot-spot traffic distribution can be inputto this framework. The simulator allows <strong>NoC</strong> evaluation based on throughput, flit delay<strong>and</strong> power consumption. The Nostrum <strong>NoC</strong> Simulation Environment (NNSE) [94] ispart of the Nostrum project[65] <strong>and</strong> contains a SystemC based simulator. Inputs to thissimulator are network size, topology, routing policy <strong>and</strong> traffic patterns. Based on theseconfiguration parameters a simulator is built <strong>and</strong> executed to produce a desired set ofresults in a variety of graphs.2.3.4 CMP Exploration ToolsWattchwasoneofthefirst[95]architecturallevelframeworks<strong>for</strong>analyzing<strong>and</strong>optimizingmicroprocessor power dissipation. Wattch was orders of magnitude faster than layoutlevelpower tools, <strong>and</strong> its accuracy was within 10% of verified industry tools on leadingedge<strong>design</strong>s. Wattch was an architecture-level, parameterizable, simulator frameworkthat can accurately quantify potential power consumption in micro<strong>processors</strong>. Wattch


CHAPTER 2. RELATED WORK 23framework quantifies power consumption of all the major units of the processor, parameterizethem, <strong>and</strong> integrate these power estimates into a high-level simulator. Wattchmodels main processor units into array structures, fully associative content-addressablememories,combinational logic <strong>and</strong> wires or clocking elements. Individual capacitancesof each of these elements are estimated <strong>and</strong> power is calculated. Work presented in [95]integrates Wattch into SimpleScalar architectural simulator[96].A tool like Ruby[97], allows one to simulate a complete distributed memory hierarchywith an on-chip network as in Orion. However, it needs to be augmented with a detailedinterconnect model which accounts <strong>for</strong> the physical area of the tiles <strong>and</strong> their placements.Network Processor exploration <strong>and</strong> power estimation tools utilize models <strong>for</strong> smallercomponents <strong>and</strong> quote the integrated power <strong>for</strong> the system[98][99][100]. They use cycleaccurate register, cache <strong>and</strong> arbiter models introduced previously here. NePSim[99] isan open-source integrated simulation infrastructure. Typical network <strong>processors</strong> can besimulated with the cycle accurate simulator inclusive in the framework. Testing <strong>and</strong>validation of results can be per<strong>for</strong>med by an automatic verification framework. NePSimcombines various power models from Xcacti[101], Wattch[95] <strong>and</strong> Orion[71] <strong>for</strong> differenthardware structures in NePSim. XCacti[101] is an enhanced version of Cacti 2.0 thatincludes power modeling <strong>for</strong> cache writes, misses, <strong>and</strong> writebacks. NePSim classifies thenetwork processor components into categories such as ALU <strong>and</strong> shifter, registers, caches,queues <strong>and</strong> arbiters. The processor’s power consumption can be calculated using a power<strong>and</strong> an estimation tool inbuilt into the framework.SapphireThe tile area <strong>optimization</strong> problem is closely knit with interconnect, cache <strong>and</strong> processorarchitecture exploration. There is a need <strong>for</strong> a co-<strong>design</strong> of interconnects, processingelements <strong>and</strong> memory blocks to fully optimize the overall multi-<strong>core</strong> chip per<strong>for</strong>mance.This necessitates a simulation framework which allows a co-simulation of processor <strong>core</strong>s,a detailed cache memory hierarchy, on-chip network, along with a low-level interconnectmodel.


CHAPTER 2. RELATED WORK 24Sapphire[102] is a detailed chip multiprocessor simulation framework. Sapphire integratesSESC[103], Ruby[97], Intacte[82] <strong>and</strong> DRAMSim[104]. Sapphire enables cycleaccurate simulations of a multi-<strong>core</strong> chip having distributed memory hierarchy <strong>and</strong> anon-chip network, with interconnect latencies which are consistent with the physical sizes<strong>and</strong> placement of the <strong>core</strong>s. It is a multi-processor/multi<strong>core</strong> simulator where the memoryhierarchy, interconnect <strong>and</strong> off-chip DRAM are parameterized <strong>and</strong> can be configured tomodel various configurations. Power consumed by DRAM is modelled using MICRONpower model. Sapphire also integrates interconnect power models from Intacte. ThusSapphire allows users to explore power <strong>and</strong> per<strong>for</strong>mance implications of all main systemcomponents like processor, interconnect, cache hierarchy <strong>and</strong> off-chip DRAM.Modeling Interconnect in SapphireInterconnect models from Intacte[75] are used in Sapphire <strong>for</strong> link level exploration <strong>and</strong>power estimation. Wire length estimation is the first step in latency estimation. Interconnectlengths are estimated by constructing floorplans of each CMP tile. A CMP tilecontains a processor, L1 Cache Controller connected to the L1 instruction(L1-I) <strong>and</strong> L1data cache(L1-D), L2 Cache Controller <strong>and</strong> L2 Cache, a router <strong>and</strong> an optional memorycontroller. The memory is placed off-chip <strong>and</strong> is not a part of the node. The floorplan ofa typical tile is shown in Figure 2.1. The area of the processor is estimated from availablecommercial <strong>processors</strong>. Areas of L1 <strong>and</strong> L2 caches are obtained from CACTI. Thearea of the router is negligible compared to the tile area at 32nm[105]. This method ofwire length estimation has been used in Section 4.5.1 of Chapter 4. The Horizontal R-R<strong>and</strong> Vertical R-R distances denote the distance between horizontal <strong>and</strong> vertical routers inhomogeneous <strong>NoC</strong> such as a 2D Mesh.2.3.5 Communication in CMPs - Per<strong>for</strong>mance ExplorationAnalysis of communication delays on power <strong>and</strong> per<strong>for</strong>mance of CMPs has been thesubject of interest in recent years. Mitigating communication delays through compilertechniques <strong>and</strong> micro-architecture have been looked at.


CHAPTER 2. RELATED WORK 25Figure 2.1: Floorplan used in estimating wire lengths. Wire lengths estimated from thesefloorplans are used as input to Intacte to arrive at a power optimal configuration <strong>and</strong>latencyinclockcycles. HorizontalR-R:Linkbetweenneighboringroutersinthehorizontaldirection, Vertical R-R: Link between neighbouring routers in the vertical direction.Cache Based Microarchitectural TechniquesStridedprefetching[106]hasbeencomparedwithblockmigration<strong>and</strong>onchiptransmissionlines to manage on-chip wire delay in CMP caches to improve per<strong>for</strong>mance. Work in [106]combines block migration strategies, transmission line caches <strong>and</strong> stride-based hardwareto reduce cache based communication latencies in CMPs. Dynamic Non-Uni<strong>for</strong>m CacheArchitecture (DNUCA)[107] uses block migration techniques to reduce cache latencies.Block migration involves moving frequently accessed blocks in the cache into banks withlower latencies. On chip transmission lines are used to provide low latency to all banks inTransmission Line Caches[108]. Around 40%-60% of L2 cache hits are satisfied in centralbanks in CMPs with shared L2 rendering block migration ineffective. TLCs suffer fromb<strong>and</strong>width contention reducing its advantage.Communication energy <strong>and</strong> delay can be minimized by migrating frequently accessedcachelinestobanksclosesttotheaccessing<strong>processors</strong>[109][110][111]. Non-Uni<strong>for</strong>mCacheArchitectures (NUCAs) with policies allow important cache lines to migrate toward theprocessor within the same level of the cache have been proposed in [109][110]. The cachearchitecture proposed in [109][110] shows that a dynamic NUCA structure achieves higherIPC than a traditional Uni<strong>for</strong>m Cache Architecture (UCA) when maintaining the samesize <strong>and</strong> manufacturing technology. D-NUCA mapping policy attempts to provide fastestbankaccessto all banksetsby sharingtheclosest banksamong multiplebanksets. In this


CHAPTER 2. RELATED WORK 26case, banksinthecachearen-waysetassociativeiftheyshareasinglebank. Theproposedmethod results in low latency access, technology scalability, per<strong>for</strong>mance stability due toflattening of memory hierarchy.Work in [111] analyses the influence on per<strong>for</strong>mance by network routers in NUCAcaches. Specifically the work looks at cut-through latency <strong>and</strong> buffering capacity of networkroutersconnectingthecachecontrollertodifferentsub-banksinNUCAcaches.Theyconclude that the effect of cut-through latency on the per<strong>for</strong>mance of the caches is high<strong>and</strong> modest buffering is sufficient to achieve a good per<strong>for</strong>mance level. The work proposesa clustered NUCA organization that places an upper limit on the average numberof hops experienced by cache accesses. This method simplifies router implementation,scales better with cut-through latencies <strong>and</strong> is more effective compared to hybrid networksolutions.Compiler Assisted TechniquesInstruction steering [112] <strong>and</strong> instruction replication[113] coupled with clustering havebeen researched as an effective technique to reduce the impact of wire delays. In Clusteredsuperscalar microarchitecture, critical components are partitioned into smaller processingunits called clusters. The centralized instruction window is replaced with multiple smallerwindows. Simpler smaller blocks allow clusters to operate at higher clock speeds at betterenergy efficiency <strong>and</strong> are scalable. However clustering trades-off IPC to s<strong>core</strong> on theseparameters. Clusters reduce the impact of wire delays by reducing the complexity <strong>and</strong>power consumption in clusters. Instruction steering involves assigning instructions toclusters in such a manner that workload is balanced <strong>and</strong> off-cluster communication isreduced. Work in [112] reduces communication during steering. The work proposes anAccurate Rebalancing technique to offload processes from the most utilized processor tothe lesser utilized <strong>processors</strong> such that intra-cluster communication is minimal. Topologyaware steering has also been proposed as method to decrease latency between source <strong>and</strong>destination clusters.


CHAPTER 2. RELATED WORK 27Work in [113] presents an instruction replication method along with clustering approachto decrease inter-cluster (inter-PE) communication. The load balancing algorithmused to distribute instructions among clusters along with the amount of inter clustercommunication dictate the per<strong>for</strong>mance of a clustered processor. The work aims to reduceinter-cluster communication by replication instructions in processing elements (PEs)where their results are utilized. Resources idle <strong>and</strong> available in PEs are used by replicatedinstructions such that load balancing is maintained.Data transfer on long latency wires can be reduced by value prediction[114] <strong>and</strong> cacheline replication[115][116] techniques. Work presented in [114] reduces long wire delays bypredicting data being communicated. The predicted value is then validated locally whereit was produced. Correctly predicted values do not incur the long wire delay. The stridevalue predictor[117][118] predicts source oper<strong>and</strong>s of instructions to be executed.Victim cache line replication presented in [115] replicates evicted primary cache linesinto L2 slice local to the CMP tile. The work considers a CMP with each tile containinga slice of the total L2. Cache line replication is the hybrid cache management policycombining private local L2 slice <strong>and</strong> shared L2. Total effective capacity of L2 is reducedwhen every tile has a local copy of accessed cache lines. On the other h<strong>and</strong> a single sharedL2 may incur large latencies when cache lines have to be accessed from remote tiles. Hitsto replicated cache lines reduce effective latency of shared L2 cache <strong>and</strong> hence reducelatency effects from communication in CMPs.GALS & Floorplanning TechniquesScalablemicroarchitecturaltechniquestoreducetheimpactofwiredelayhavebeenlookedat[119][120]. Work in [119] investigates interconnect bottleneck in FPGA based systems<strong>and</strong> proposes Globally Asynchronous Locally Synchronous (GALS) as a potential solution.The work proposes a <strong>design</strong> flow to investigate optimal GALS isl<strong>and</strong> size to balanceamount of inter-isl<strong>and</strong> communication <strong>and</strong> asynchronous communication overhead betweenGALS isl<strong>and</strong>s.Floorplanning techniques to overcome long latencies between the processor <strong>and</strong> the


CHAPTER 2. RELATED WORK 28Level-2 cache have been experimented on[121]. Work states that floorplans should aidCMP systems in distinguishing <strong>and</strong> shared <strong>and</strong> private data accessed by <strong>core</strong>s. The paperintroduces a floorplan topology that partitions shared <strong>and</strong> private cached data <strong>and</strong> thusenables fast access to shared data <strong>for</strong> all <strong>processors</strong> while preserving the vicinity of privatedata to each processor.2.4 SummaryProviding QoS guarantees in on-chip communication networks has been identified as oneofmajorresearchproblemsin<strong>NoC</strong>s[48]. ALabelSwitchedQoSguaranteeing<strong>NoC</strong>thatretainsadvantagesofbothpacketswitched<strong>and</strong>circuitswitchednetworkshasbeendescribedin the thesis. LS-<strong>NoC</strong> sets up communication channels (pipes) between communicatingnodes that are independent of existing pipes <strong>and</strong> contention free at routers. Flow identificationalgorithms identify contention free <strong>and</strong> resource rich paths taking into accountthe existing routes in the <strong>NoC</strong>. LS-<strong>NoC</strong> is described in Chapters 5, 6 <strong>and</strong> 7.Wirelengths have a significant influence on the latency of interconnect, <strong>and</strong> henceneed to be included in the simulation framework. It is clear from works emphasizingon effects of communication on chip per<strong>for</strong>mance that there is a need <strong>for</strong> a co-<strong>design</strong>of interconnects, processing elements <strong>and</strong> memory blocks to fully optimize the overallsystem-on-chip per<strong>for</strong>mance. This necessitates a simulation framework which allows aco-simulation of the communicating entities along with ICN simulation. Additionally,to optimize power fully, one also needs to incorporate the link-level microarchitecturalchoices of pipelining etc.A System-C framework which enables <strong>NoC</strong> <strong>design</strong>ers to assemble communicating entitiesalong with the ICN <strong>and</strong> also allows <strong>for</strong> exploration of architectural <strong>and</strong> microarchitecturalparameters of the ICN in order to obtain the latency, throughput <strong>and</strong> powertrade-offs is presented in Chapter 3. Chapter 3 presents results <strong>for</strong> <strong>NoC</strong> power by consideringeffects of various pipelining configurations, frequency <strong>and</strong> voltage scaling values.Various traffic generation <strong>and</strong> distribution models have been used to mimic realistic trafficpatterns <strong>and</strong> activity in <strong>NoC</strong>s. Trade-off studies in this chapter consider Energy-Delay


CHAPTER 2. RELATED WORK 29product (of the <strong>NoC</strong>) as the <strong>optimization</strong> parameter.Communication effects need to be accounted into simulations as communication timeshave significant impact on per<strong>for</strong>mance <strong>and</strong> energy of CMPs. The trade-off between tilesize, communication time <strong>and</strong> energy efficiency has been explored in Chapter 4. We explorethese trade-offs using a detailed, cycle accurate, multi<strong>core</strong> simulation frameworkwhich includes superscalar processor <strong>core</strong>s, cache coherent memory hierarchies, on-chippointtopointcommunicationnetworks<strong>and</strong>detailedinterconnectmodelincludingpipelining<strong>and</strong> latency. CACTI[122] cache models are used to estimate area, energy per access<strong>and</strong> leakage power of L1 & L2 caches <strong>and</strong> values from the SPARC processor[123] are used<strong>for</strong> processor power estimates.


Chapter 3Link Microarchitecture ExplorationThis chapter presents a latency, power <strong>and</strong> per<strong>for</strong>mance trade-offs of microarchitecturalparameters of <strong>NoC</strong>s. The <strong>design</strong> <strong>and</strong> implementation of a framework <strong>for</strong> such a <strong>design</strong>space exploration study is also presented. The chapter presents the trade-off study on<strong>NoC</strong>s by varying microarchitectural (e.g. pipelining) <strong>and</strong> circuit level (e.g. frequency<strong>and</strong> voltage) parameters. Though the study is done in the context of packet based <strong>NoC</strong>,the results are applicable to all <strong>NoC</strong>s as the parameters considered are low level linkparameters which are common to all <strong>NoC</strong>s.A System-C based <strong>NoC</strong> exploration framework presented here is used to explore impactsof various architectural <strong>and</strong> microarchitectural level parameters of <strong>NoC</strong> elements onpower <strong>and</strong> per<strong>for</strong>mance of the <strong>NoC</strong>. The framework enables the <strong>design</strong>er to choose from avariety of architectural options like topology, routing policy, etc., as well as allows experimentationwith various microarchitectural options <strong>for</strong> the individual links like length, wirewidth, pitch, pipelining, supply voltage <strong>and</strong> frequency. Network-on-Chip <strong>design</strong> parameterssuch as topology generation <strong>and</strong> link pipelining have varying impacts on throughputofthenetwork, latencyofflits<strong>and</strong>powerdissipationofthe<strong>NoC</strong>inanSoC.Theframeworkuses Intacte[82] to estimate delay <strong>and</strong> power based on micro-architecture parameters suchas wire length, wire width, activity <strong>for</strong> a given technology <strong>and</strong> voltage. The frameworkalso supports a flexible traffic generation <strong>and</strong> communication model. Latency, power <strong>and</strong>throughput results using this framework to study a 4x4 CMP are presented. The chapter30


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 31presents results on power-per<strong>for</strong>mance trade-off studies on Mesh, Torus, Folded Torus,Tree based network <strong>and</strong> a Reduced 2D Torus topologies by varying pipelining in links <strong>and</strong>frequency <strong>and</strong> voltage scaling.The framework is used to study <strong>NoC</strong> <strong>design</strong>s of a CMP using different classes ofparallel computing benchmarks[6]. Traffic patterns from dense <strong>and</strong> sparse linear algebraapplicationsareusedinthisstudy. ThetrafficconsistsofbothRequest-Responsemessages(mimicking cache accesses) <strong>and</strong> One-Way messages. One of the key findings is that theaverage latency of a link can be reduced by increasing pipeline depth to a certain extent,as it enables link operation at higher link frequencies. There exists an optimum degree ofpipelining which minimizes the energy-delay product of the link.Using frequency scaling experiments we show that switching to a higher degree of linkpipelining to achieve higher frequency instead of adding larger buffers is advantageousfromapowerperspective. Twocasestudiescomparingtopologiesbasedonthroughputarepresented. A SystemC based simulation framework containing parameterizable routers,links, traffic generators <strong>and</strong> Sink nodes is used <strong>for</strong> <strong>NoC</strong> exploration.Organization of the ChapterRest of the chapter is organized as follows. A few router modelling, <strong>design</strong> space exploration<strong>and</strong>powerestimationrelatedworkshavebeenintroducedinSection3.1.Adetailedliterature survey router modelling <strong>and</strong> <strong>design</strong> space exploration of <strong>NoC</strong>s had been presentedin Section 2.2 of Chapter 2. <strong>NoC</strong> exploration framework used in the trade-offstudies is described in Section 3.2. Latency, power, per<strong>for</strong>mance trade-offs, frequencyscaling <strong>and</strong> voltage scaling results <strong>for</strong> two case studies are presented in Sections 3.3 <strong>and</strong>Section 3.4. Section 3.3 presents <strong>design</strong> exploration results <strong>for</strong> 4×4 2 dimensional Mesh,a similar Torus <strong>and</strong> an equivalent Folded Torus <strong>NoC</strong>. Section 3.4 presents <strong>design</strong> explorationresults <strong>for</strong> a 16 node (4×4), Torus, Reduced Torus <strong>and</strong> a Tree based <strong>NoC</strong>. Results<strong>and</strong> findings are summarized in Section 3.5.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 323.1 Motivation <strong>for</strong> a Microarchitectural ExplorationFrameworkCurrent research in architectural level exploration of <strong>NoC</strong> in SoCs concentrates on underst<strong>and</strong>ingthe impacts of varying topologies, link <strong>and</strong> router parameters on the overallthroughput, area <strong>and</strong> power consumption of the system (SoCs <strong>and</strong> <strong>Multi</strong><strong>core</strong> chips)using suitable traffic models[68]. Work in [69] emphasizes need <strong>for</strong> co-<strong>design</strong> of interconnects,processing elements <strong>and</strong> memory blocks to underst<strong>and</strong> the effects on overallsystem characteristics. Simulation tools have been developed to aid <strong>design</strong>ers in ICNspace exploration[70][71]. Kogel et. al.[70] present a modular exploration framework tocapture per<strong>for</strong>mance of point-to-point, shared bus <strong>and</strong> crossbar topologies. Orion[71] is apower-per<strong>for</strong>mance interconnection network simulator that is capable of providing power<strong>and</strong> per<strong>for</strong>mance statistics. Orion model estimates power consumed by Router elements(crossbars, FIFOs <strong>and</strong> arbiters) by calculating switching capacitances of individual circuitelements. Most of these tools do not allow <strong>for</strong> exploration of the various link leveloptions of wire width, pitch, serialization, repeater sizing, pipelining, supply voltage <strong>and</strong>operating frequency.<strong>NoC</strong> exploration tools usually model ICN elements at a higher level abstraction ofswitches, links <strong>and</strong> buffers <strong>and</strong> help in power/per<strong>for</strong>mance trade-off studies[86]. Anotherarea of active research is <strong>design</strong> of router architectures[124][87] <strong>and</strong> ICN topologies[34]with varying area/per<strong>for</strong>mance trade-offs <strong>for</strong> general purpose SoCs or to cater to specificapplications.On the other h<strong>and</strong>, tools exist to separately explore low level link options to variousdegrees as in [82], [72] <strong>and</strong> [73]. Work in [72] explores use of heterogeneous interconnectsoptimized <strong>for</strong> delay, b<strong>and</strong>width or power by varying <strong>design</strong> parameters such as a buffersizes, wire width <strong>and</strong> number of repeaters on the interconnects. Courtay et. al[73] havedeveloped a high-level delay <strong>and</strong> power estimation tool <strong>for</strong> link exploration that offerssimilar statistics as Intacte does. The tool allows changing architectural level parameterssuch as different signal coding techniques to analyze the effects on wire delay/power.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 33Intacte [82] provides a similar capability to explore link level <strong>design</strong> options <strong>and</strong> is usedin this research.It is clear from a<strong>for</strong>ementioned works that there is a need <strong>for</strong> a co-<strong>design</strong> of interconnects,processing elements <strong>and</strong> memory blocks to fully optimize the overall system-on-chipper<strong>for</strong>mance. This necessitates a simulation framework which allows a co-simulation ofthe communicating entities along with ICN simulation. Additionally, to optimize powerfully, one also needs to incorporate the link-level microarchitectural choices of pipeliningetc. A System-C framework which enables <strong>NoC</strong> <strong>design</strong>ers to assemble communicatingentities along with the ICN <strong>and</strong> also allows <strong>for</strong> exploration of architectural <strong>and</strong> microarchitecturalparameters of the ICN in order to obtain the latency, throughput <strong>and</strong> powertrade-offs has been developed <strong>and</strong> is presented in Section 3.2.Further, previous works largely concentrate on router power <strong>and</strong> do not take intoaccount various link microarchitectural parameters <strong>for</strong> power <strong>and</strong> per<strong>for</strong>mance trade-offcalculations. This chapter presents results <strong>for</strong> <strong>NoC</strong> power by considering effects of variouspipelining configurations, frequency <strong>and</strong> voltage scaling values. Various traffic generation<strong>and</strong> distribution models have been used to mimic realistic traffic patterns <strong>and</strong> activity in<strong>NoC</strong>s. Trade-off studies in this chapter consider Energy-Delay product of the <strong>NoC</strong> as the<strong>optimization</strong> parameter.3.2 <strong>NoC</strong> Microarchitectural Exploration FrameworkThe <strong>NoC</strong> exploration framework (Figure 3.1) has been built upon Open Core Protocol-IPmodels[125] using OSCI SystemC 2.0.1[126] on Linux (2.6.8-24.25-default). The frameworkcontains Router, Link <strong>and</strong> Processing Element (PE) modules <strong>and</strong> each can be customizedviavariousparameters.The<strong>NoC</strong>modulescanbeinterconnectedto<strong>for</strong>madesired<strong>NoC</strong>. The PE module represents any communicating entity on the SoC <strong>and</strong> not just theprocessing element. We can connect an actual executable model of the entity or someabstract model representing its communication characteristics. For abstract models, wesupport many different traffic generation <strong>and</strong> communication patterns. The link modulecan be used to customize the bit-width of the links as well as the degree of pipelining


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 34Figure 3.1: Architecture of the SystemC framework.Figure 3.2: Flow of the ICN exploration framework.in the link. A single run (Figure 3.2) uses these models to run a communication task<strong>and</strong> outputs data files of message transfer logs. From these log files, one-way <strong>and</strong> roundtrip flit latency, throughput <strong>and</strong> link capacitance activity factors are extracted. Intacte isthen used to obtain the final power numbers <strong>for</strong> different operating frequency <strong>and</strong> supplyvoltage options. Table 3.1 summarizes the various parameters that can be varied in theframework.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 35Table 3.1: ICN exploration framework parameters.Parameter Description<strong>NoC</strong> ParametersRouting Algorithms Source Routing <strong>and</strong> Table based routingSwitching Policy Packet, Circuit, Wormhole, VC switchingTraffic Paradigm Request-Response & One-Way TrafficTraffic Generation Scheme Deterministic, Self-SimilarTraffic Distribution Scheme Deterministic, Uni<strong>for</strong>mly r<strong>and</strong>omHotSpot, Localized, First Matrix TransposeRouter MicroarchitectureNo. of Input/Output Ports 2-8 (based on topology to be generated)Input/Output buffer sizes Flit-level buffersCrossbar Switching capacity In terms of flits (default=1)Link MicroarchitectureLength of interconnect Longest link in mmBit width of the interconnectCircuit ParametersFrequency, Supply Voltage3.2.1 Traffic Generation <strong>and</strong> Distribution ModelsTo test <strong>NoC</strong>s on realistic multi-<strong>core</strong> applications we setup traffic generation <strong>and</strong> distributionto mimic various communication patterns. Traffic models implemented in the TrafficGenerator module are Deterministic, Uni<strong>for</strong>mly R<strong>and</strong>om, Localized, Hotspot <strong>and</strong> FirstMatrix Transpose traffic. Each of the models vary in how many <strong>and</strong> how often do destinationnodes receive flits from a given generator. Request-Response (RR) <strong>and</strong> One-WayTraffic (OWT) generation are supported. For example in multi-<strong>core</strong> chips, the <strong>for</strong>mercan correspond to activities like cache line loads <strong>and</strong> the latter can correspond cache linewrite backs. Traffic distribution input is given using two matrices of sizes N×N, where Nis the number of communicating entities. Item (i,j) in a matrix gives the probability ofcommunication of PE i with j in the current cycle. Two separate matrices correspond toRR <strong>and</strong> OWT generation. The probability of choosing among the two matrices dependson a global input to decide the percentage of RR traffic to be generated <strong>for</strong> the simulationrun. This model can be further exp<strong>and</strong>ed to capture burst characteristics as well as


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 36Figure 3.3: Flit header <strong>for</strong>mat. DSTID/SRCID: Destination/Source ID, SQ:SequenceNumber, RQ & RP: Request <strong>and</strong> Response Flags <strong>and</strong> a 13 bit flit id.Table 3.2: Traffic Generation/Distribution Model <strong>and</strong> Experiment Setup <strong>for</strong> the Mesh,Torus & Folded-Torus case study.Parameter ValuesCommunication Patterns DLA Traffic SLA Traffic<strong>NoC</strong>s Simulated 2D Mesh, Torus <strong>and</strong> Folded TorusLocalization Factor 0.7 0.5Traffic injection rate 20%RR Factor 0.03 0.1Size of phit (Wire width) 32 bitsOW <strong>and</strong> RR Request Flit 2 phitsRR Response Flit 6 phitsSimulation Time 40000 cyclesProcess 45nmEnvironment Linux (2.6.8-24.25-default)+OSCI SystemC 2.0.1 +Matlab 7.4message size.Flit Header Format - Mesh, Torus & Folded-TorusThe communication packets are broken into a sequence of flit transfers. The flit header<strong>for</strong>mat is shown in Figure 3.3. For this case study, table based routing is assumed.The SQ field is used to identify in order arrival of all flits. Response flits have first 2bits set to 11. SRCID, DSTID <strong>and</strong> FlitID fields are preserved in Response flit <strong>for</strong> the sakeof error checking <strong>and</strong> latency calculations in the framework. The traffic receiver will readtheheadertodetermineiftheflittypeisRRornot(flagRQ).IfRQisset, thentheTrafficgenerator is notified <strong>and</strong> the flit header is sent to the Traffic generator. RR traffic haspriority over OWT <strong>and</strong> hence the request will be im<strong>media</strong>tely serviced without breaking


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 37(a) Header used in table based routing.(b) Header used in source routing.Figure 3.4: Example flit header <strong>for</strong>mats considered in this experiment. (DST/SRCID:Destination/Source ID, HC:Hop Count, CHNx:Direction at hop x).an OW flit. Response flit to the request flit has RP set <strong>and</strong> RQ reset. In a received flit, ifRQ is not set, then no action is taken. Table 3.2 lists out parameters used in our trafficmodel <strong>and</strong> in experiments. The framework is also capable of generating Deterministic,Uni<strong>for</strong>mly R<strong>and</strong>om, Hotspot <strong>and</strong> First Matrix Transpose traffic distributions.Flit Header Format - Torus, Reduced Torus & Tree based <strong>NoC</strong>In this case study, flit header <strong>for</strong>mats have been varied based on the type of routingscheme used. The flit header <strong>for</strong>mats <strong>for</strong> Source routing <strong>and</strong> Table based routing areshown in Figure 3.4. Source routing header (3.4(a)) is larger as it contains output portnumbers per hop the flit has to traverse. Table based routing header (3.4(b)) containsthe final destination address only.3.2.2 Router ModelThe router model is a parameterized, scalable module of a generic router[68]. Router microarchitectureparameters include number of Input/Output ports, sizes of input/outputbuffers, switching capacity of the crossbar (no. of bits that can be transferred from inputto output buffers in a cycle) etc (Table 3.1). Flow control is implemented throughsideb<strong>and</strong> signals[125]. Example Routing algorithms are source <strong>and</strong> table based routing.Switching policies such as circuit switching, packet switching, wormhole switching havebeen implemented. Flow control prevents traffic generators from spewing phits into the


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 38network after input buffer fills to a threshold. Router model has been carefully <strong>design</strong>edto be easily adapted <strong>for</strong> use in various topologies (with varying flit header <strong>for</strong>mats asshown in Figures 3.3 <strong>and</strong> 3.4) with minimal changes.3.2.3 Power ModelIntacte[82] is used <strong>for</strong> interconnect delay <strong>and</strong> power estimates. Design variables <strong>for</strong> Intacte’sinterconnect <strong>optimization</strong> are wire width, wire spacing, repeater size <strong>and</strong> spacing,degree of pipelining, supply (V dd ) <strong>and</strong> threshold voltage (V th ). Activity <strong>and</strong> coupling factorsare input to Intacte from the System-C simulation results. Intacte arrives at a poweroptimal number of repeater, sizes <strong>and</strong> spacing <strong>for</strong> a given wire to achieve a desired frequency.Wire width (in bits) is known per simulation run. The tool also includes flop <strong>and</strong>driver overheads <strong>for</strong> power <strong>and</strong> delay calculations. Intacte outputs total power dissipatedincluding short circuit <strong>and</strong> leakage power values. We arrive at approximate wire lengthsusing floorplans(Figure 3.17). Minimum wire spacing is obtained from foundry rules. Intactesolves an <strong>optimization</strong> problem to arrive at optimal number of repeaters, repeaterspacing <strong>for</strong> a given frequency <strong>and</strong> voltage. Other physical parameters are obtained fromPredictive Technology Models[81] models <strong>for</strong> 65nm <strong>and</strong> 45nm.3.3 Case Study: Mesh, Torus & Folded-TorusIn this case study, we study a 4x4 chip multiprocessor (CMP) <strong>for</strong> three different networktopologies - Mesh, Torus <strong>and</strong> Folded-torus. We use two communication patterns from[6] of Dense Linear Algebra (DLA) <strong>and</strong> Sparse Linear Algebra (SLA) benchmarks. DLAapplications exhibit highly localized communication. The traffic model <strong>for</strong> DLA generates70% traffic to im<strong>media</strong>te neighbors <strong>and</strong> remaining traffic is distributed uni<strong>for</strong>mly to othernodes. SLAcommunication is reproduced using 50% localized traffic <strong>and</strong> rest of the trafficisdestinedtohalfoftheremainingnodes. FurtherweassumeallRRtraffictobelocalized.For eg. 10% of generated traffic over the simulation per PE will be of Request type ifRR=0.1. All Request flits are destined to im<strong>media</strong>te neighbors. 70% of flits generated


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 39Figure 3.5: Schematic of 3 compared topologies (L to R: Mesh, Torus, Folded Torus).Routers are shaded <strong>and</strong> Processing Elements(PE) are not.by any PE over the simulation time are destined to im<strong>media</strong>te neighbors if localizationfactor is 0.7(as in case of DLA).Experimentsare<strong>design</strong>edtocalculatelatency(clockcycles),throughput(Gigabits/sec)<strong>and</strong> power (milliWatts) of various topologies. Table 3.2 lists some of the simulation setupparameters used in the following experiments.3.3.1 <strong>NoC</strong> TopologiesIn this work we consider three similar topologies <strong>for</strong> trade-off studies. Router <strong>and</strong> processingelements are identical in all three topologies. In fact the same communicationtrace is played out <strong>for</strong> all the different ICN parameter explorations. The schematic of thethree <strong>NoC</strong>s is shown in Figure 3.5 with the Floorplans largely following the schematics.The floorplans are used to estimate the wire lengths which are then input to Intacte.Processing elements sizes are estimated by scaling down the processor in [123] to 45nmtobe of size 2.25×1.75mm. The routers are of size 0.3×0.3mm. The length of the longestlinks in the Mesh, Torus <strong>and</strong> Folded Torus are estimated as 2.5mm, 8.15mm <strong>and</strong> 5.5mmrespectively. The longest link in the torus connect the routers at the opposite sides. Therouting policy <strong>for</strong> all topologies is table based. Routing policy was made to keep the worstcase latency under check. Where routes had an alternative longer link, the shorter has


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 40been chosen. Longer links had minimum activity in the experiments. Lengths of links ineach of the topologies <strong>and</strong> pipelining factors is illustrated in Table 3.3. Pipelining factorcorresponds to the longest link in the <strong>NoC</strong>. Pipelining factor of 1 means the longest linkis unpipelined, P=2 indicates it has a two cycle latency <strong>and</strong> so on.Table 3.3: Links <strong>and</strong> pipelining details of <strong>NoC</strong>sTopology Length in mm Pipelining(no. of links)2D Mesh 2.5 (24) 1 2 3 4 5 6 7 82.0 (56) 1 2 3 4 4 5 6 72D Torus 8.15(8) 1 2 3 4 5 6 7 86.65(8) 1 2 3 4 4 5 6 72.5 (24) 1 1 1 2 2 2 3 32.0 (56) 1 1 1 1 2 2 2 2Folded 5.5 (16) 1 2 3 4 5 6 7 82D Torus 4.5 (16) 1 2 3 4 5 5 6 72.75(16) 1 1 2 2 3 3 4 42.25(16) 1 1 2 2 3 3 3 42.0 (32) 1 1 2 2 2 3 3 33.3.2 Round Trip Flit Latency & <strong>NoC</strong> ThroughputRound trip flit (link-level flow control unit) latency is calculated starting from injectionof the first phit (physical transfer unit in an <strong>NoC</strong>) to the reception of the last phit. In thecase of OW traffic latency is one way. In case of RR traffic it is the delay in clock cyclesof beginning of request injection to completion of response arrival. Communication tracesare analyzed using error checking (<strong>for</strong> phit loss, out-of-order reception, erroneous transitetc.) <strong>and</strong> latency calculation scripts to ensure functional correctness of the system.Figure 3.6 shows the effect of increasing traffic rate over average round trip latency(normalized, in clock cycles) <strong>for</strong> all three <strong>NoC</strong>s. As input buffers start filling up, thewaiting time of flits increases <strong>and</strong> hence latencies increase at higher traffic rates. Similarresults have been shown in other <strong>NoC</strong> exploration frameworks[127].Totalthroughputofthe<strong>NoC</strong>(inbits/sec)iscalculatedastotalnumberofbitsreceived


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 411.451.42D Torus2D MeshFolded 2D TorusAverage Latency vs. Traffic Rate of <strong>NoC</strong>s1.35Normalized Avg. Round Trip Latency1.31.251.21.151.11.0510.950 0.1 0.2 0.3 0.4 0.5Packet Injection RateFigure 3.6: Normalized average round trip latency in cycles vs. Traffic injection rate inall the 3 <strong>NoC</strong>s.((phits r ∗bits phit )) at sink nodes divided by total (real) time (( 1 f ∗sim cycles)) spent (Eqn3.1).Th total = phits r ∗bits phit1f ∗sim cycles(3.1)Max achievable frequency of a wire of given length is obtained using Intacte(Figure3.7). Max throughput of each <strong>NoC</strong> running DLA traffic at P=1 is shown in Figure 3.8.2D Mesh has the shortest links <strong>and</strong> highest achievable frequency <strong>and</strong> hence the highestthroughput.Average round trip latencies in nano-seconds over various pipeline configurations inall 3 <strong>NoC</strong>s is shown in Fig. 3.9. Results show overall latency of flits actually decrease toa certain point by pipelining. Avg. latencies are larger <strong>for</strong> RR type of traffic <strong>and</strong> it alsohas a larger number of phits involved (2 Req + 6 Response). Clearly, there is a latencyadvantage by pipelining links in <strong>NoC</strong>s up to a point. This is because as the numberof pipe stages increase, the operation frequency can also be increased as the length ofwire segment in each pipe stage decreases. Real time latencies do not vary much afterpipelining configuration P=5, as delay of flops start to dominate <strong>and</strong> there is not much


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 4276DLA Traffic. Max Attainable Frequency vs. Pipeline configurations on 3 <strong>NoC</strong>s.2D Mesh. Max. Len: 2.5mm2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm5Max. Frequency (in GHz)432100 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.7: Max. frequency of links in 3 topologies. Lengths of longest links in Mesh,Torus <strong>and</strong> Folded 2D Torus are 2.5mm, 8.15mm <strong>and</strong> 5.5mm.marginal increase in frequency. Throughput <strong>and</strong> Latency behaviour <strong>for</strong> SLA traffic areidentical (not shown here).3.3.3 <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency Tradeoffs2D MeshFigure 3.10 <strong>and</strong> 3.11 shows the combined normalized results of <strong>NoC</strong> power, throughput<strong>and</strong> latency experiments on a 2D Mesh <strong>for</strong> DLA <strong>and</strong> SLA traffic. Throughput <strong>and</strong> powerconsumption are lowest at P=1 <strong>and</strong> highest at P=8. Normalized avg. round trip flitlatency <strong>for</strong> both OW <strong>and</strong> RR traffic is shown (the curves overlap). From the graph itis seen that growth in power makes configurations more than P=5 less desirable. Linkpipelines with P=1,2 <strong>and</strong> 3 are also not optimal with respect to latency in both thesebenchmarks. Rise in throughput also starts to fade as configuration of more than P=6 areused. The optimal point of operation indicated by the results from both communicationpatterns is P=5. Energy curve is obtained as the product of normalized Latency <strong>and</strong>


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 43180160DLA Traffic. Max Throughput vs. Pipeline Stages.2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm2D Mesh. Max. Len: 2.5mm140Max. Throughput (in Gigabits/sec)1201008060402000 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.8: Total <strong>NoC</strong> throughput in 3 topologies, DLA traffic.555045DLA Traffic: Avg. Round Trip Latency vs. Pipeline Depth in 3 <strong>NoC</strong>sMesh OWTMesh Req-RespTorus OWTTorus Req-RespFolded Torus OWTFolded Torus Req-RespAvg. Round Trip Latency (in ns)40353025201510501 2 3 4 5 6 7 8Pipeline DepthFigure 3.9: Avg. round trip flit latency in 3 <strong>NoC</strong>s, DLA traffic.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 44Throughput, Power, Energy & Energy Delay1614121086420DLA Traffic: 2D Mesh TradeoffsEnergy.Delay<strong>NoC</strong> PowerEnergyThroughputLatency1 2 3 4 5 6 7 8Link Pipeline Depth43.532.521.510.5Avg. Round Trip Latency (Normalized)Figure 3.10: 2D Mesh Power/Throughput/Latency trade-offs <strong>for</strong> DLA traffic. Normalizedresults are shown.Throughput, Power, Energy & Energy Delay1614121086420SLA Traffic: 2D Mesh Tradeoffs43.5Energy.Delay32.52<strong>NoC</strong> PowerEnergyThroughput 1.5Latency10.51 2 3 4 5 6 7 8Link Pipeline DepthAvg. Round Trip Latency(Normalized)Figure 3.11: 2D Mesh Power/Throughput/Latency trade-offs <strong>for</strong> SLA traffic.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 45Normalized Throughput(*206.23 Gbps), Power(*27.26 mW), Energy & Energy Delay35302520151050DLA Traffic: 2D Torus <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency TradeoffsEnergy.Delay<strong>NoC</strong> PowerEnergyThroughputLatency1 2 3 4 5 6 7 8Link Pipeline Depth654321Avg. Round Trip Latency (Normalized) (OWT: *2.98nS, RR: *8.8ns)Figure 3.12: DLA Traffic, 2D Torus Power/Throughput/Latency trade-offs. Normalizedresults are shown.Power values. Energy <strong>for</strong> communication increases with pipeline depth. Energy Latency(Energy.Delay) is the product of Energy <strong>and</strong> Latency values. Quantitatively the optimalpoint <strong>for</strong> operation is when the longest link has pipeline segments (P=5). In DLA traffic,Avg. round trip flit latency of phits in the <strong>NoC</strong> is 1.23 times minimum <strong>and</strong> 32% ofmaximum possible. <strong>NoC</strong> power consumed is 57% of max <strong>and</strong> throughput 80.5% of maxpossible value.2D Torus <strong>and</strong> Folded 2D TorusSimilar power, throughput <strong>and</strong> latency trade-off studies are done on both communicationpatternson2DTorus(Fig. 3.12)<strong>and</strong>Folded2DTorus(Fig. 3.13)<strong>NoC</strong>s. Resultsobtainedin 2D Torus experiments indicate that growth in power makes configurations more thanP=5 is not desirable. Latencies of phits in pipeline configurations P=1-4 are large. Rise inthroughput also starts to fade as configurations after P=5 are used. The optimal point ofoperation indicated by the Energy Delay curves in both DLA <strong>and</strong> SLA traffic (not shownhere) <strong>for</strong> 2D Torus is P=5. In DLA traffic, this configuration shows power consumed by


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 46Normalized Throughput(*214.98 Gbps), Power(*28.33 mW), Energy & Energy Delay302520151050DLA Traffic: Folded 2D Torus <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency TradeoffsEnergy.Delay<strong>NoC</strong> PowerThroughputEnergyLatency1 2 3 4 5 6 7 8Link Pipeline Depth5.554.543.532.521.510.5Avg. Round Trip Latency (Normalized) (OWT: *2.04ns, RR: *5.36ns)Figure 3.13: DLA Traffic, Folded 2D Torus Power/Throughput/Latency trade-offs. Normalizedresults are shown.the <strong>NoC</strong> is 50% of the value consumed at P=8 <strong>and</strong> throughput is 70.5% the max value.Avg. Round Trip latency of phits <strong>for</strong> both OW & RR traffic is 1.4 times minimum <strong>and</strong>24% of the maximum (when P=1).Tradeoff curves <strong>for</strong> the Folded 2D Torus show similar trends as in the 2D Torus. Avg.roundtripflitlatencyreduction<strong>and</strong>throughputgainafterP=6isnotconsiderable. Thereis no single optimum obtained from the Energy Delay curve. Pipeline configurations fromP=5 to P=7 present various throughput <strong>and</strong> energy configurations <strong>for</strong> approximatelysame Energy Delay product.3.3.4 Power-Per<strong>for</strong>mance Tradeoff With Frequency ScalingWe discuss the combined effects of pipelining links <strong>and</strong> frequency scaling on power consumption<strong>and</strong> throughput of the 3 topologies (Figure 3.5) running DLA traffic. Maximumpossible frequency of operation at full supply voltage (1.0V) is determined using Intacte.Figure 3.14 shows <strong>NoC</strong> power consumption <strong>for</strong> 3 example topologies over a pair ofpipelining configurations along with frequency scaling (at V dd ). As observed from the


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 47350300DLA Traffic. <strong>NoC</strong> Power Consumption vs. Throughput over 3 <strong>NoC</strong>s.2D Mesh. P=42D Mesh. P=52D Torus. P=72D Torus. P=8Folded 2D Torus. P=6Folded 2D Torus. P=7<strong>NoC</strong> Power Consumption (in mW)250200150100Folded 2D Torus2D Torus2D Mesh5000 500 1000 1500 2000 2500Throughput (in Gbps)Figure 3.14: Frequency scaling on 3 topologies, DLA Traffic.graph, power consumption of a lower pipeline configuration exceeds the power consumedby a higher configuration after a certain frequency. Larger buffers (repeaters) are addedto push frequencies to the maximum possible value. Power dissipated by these circuitelement start to outweigh the speed advantage after a certain frequency. We call this the“crossover” frequency. The graph shows 3 example pairs from each <strong>NoC</strong> from each of thetopologies to illustrate this fact.Maximum frequency of operation of an unpipelined longest link in a 2D Mesh (2.5mm)is determined to be 1.71GHz. This maximum throughput point is determined in eachpipeline configuration in each topology. Frequency is scaled down from this point <strong>and</strong>power measurements are made <strong>for</strong> <strong>NoC</strong> activity obtained using the SystemC framework<strong>for</strong> DLA traffic. At crossover frequencies it is advantageous to switch to higher pipeliningconfigurations to save power <strong>and</strong> increase throughput. For example in a 2D Mesh, linkfrequency of 3.5GHz can be achieved by pipelining configuration of 3 <strong>and</strong> above. <strong>NoC</strong>powerconsumptioncanbereducedby54%byswitchingtoa3stagepipelineconfigurationfrom8stagepipelineconfiguration. Inotherwords, adesiredfrequencycanbeachievedbymore than one pipeline configuration. For example, in a 2D Torus frequency (throughput)


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 486005002D Mesh <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency Tradeoffs using Voltage & Frequency ScalingFrequency Scaled, P=8Voltage scaled: P=2Voltage scaled: P=5Voltage scaled: P=7<strong>NoC</strong> Power Consumption (in mW)40030020010000 20 40 60 80 100 120 140 160 180Throughput (in Gbps)Figure 3.15: Dynamic voltage scaling on 2D Mesh, DLA Traffic. Frequency scaled curve<strong>for</strong> P=8 is also shown.of 2.0GHz can be achieved by using pipeline configurations from 4 to 8. <strong>NoC</strong> powerconsumption can be reduced by 13.8% by switching from P=8 to P=4 <strong>and</strong> still achievesimilar throughput.3.3.5 Power-Per<strong>for</strong>mance Tradeoff With Voltage <strong>and</strong> FrequencyScalingIn each topology, frequency is scaled down from the maximum <strong>and</strong> the least voltagerequired to meet the scaled frequency is estimated using Intacte <strong>and</strong> power consumption<strong>and</strong> throughput results are presented. Voltages are scaled from 1.0V till 0.1GHz is met<strong>for</strong> each pipelining configuration in each <strong>NoC</strong>.Similar to the frequency scaling resultsthere exists a crossover frequency in a pipelining configuration after which it is power<strong>and</strong> throughput optimal to switch to a higher pipelining stage (Table 3.4). Figure 3.15compares power <strong>and</strong> throughput values obtained by voltage <strong>and</strong> frequency scaling witha frequency scaled P=8 curve <strong>for</strong> 2D Mesh with DLA traffic. Scaling voltage along withfrequency compared to scaling frequency alone can result in power savings of up to 14%,


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 49Table 3.4: DLA traffic, Frequency crossover points in 2D MeshPipe Trip Frequency (in GHz)Stages Mesh Torus Folded Torus1-2 1.7 0.25 0.452-3 2.96 0.7 1.53-4 3.93 1.1 2.04-5 4.69 2.0 2.765-6 5.31 2.2 3.26-7 5.83 2.8 3.697-8 6.23 3.0 4.0727% <strong>and</strong> 51% in cases of P=7, P=5 <strong>and</strong> P=2 respectively.Comparison of all 3 <strong>NoC</strong>s is presented in Table 3.5.Table 3.5: Comparison of 3 topologies <strong>for</strong> DLA traffic.Topology Pipe Power Per<strong>for</strong>manceStages (mW) (Gbps)Mesh 1 55.18 42.822 109.87 74.124 250.83 117.447 464.16 156.00Torus 1 27.26 14.672 45.71 27.894 97.48 50.787 206.22 78.33Folded 1 28.32 21.03Torus 2 55.95 39.314 119.75 69.117 287.18 101.91


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 50Table 3.6: Experimental SetupTraffic Injection Rate 20%Traffic Model Localized Traffic (6%)Framework simulation time 35000 cyclesProcess Technology 65nmModels PTM[81]Frequency of Operation 1 GHz(unless mentioned otherwise)Environment Linux (2.6.8-24.25-default)+OSCI SystemC 2.0.1Figure3.16: Schematicrepresentationofthethreecomparedtopologies(LtoR:2DTorus,Tree, Reduced 2D Torus). Shaded rectangles are Routers <strong>and</strong> white boxes are source/sinkProcessing Elements(PE) nodes.3.4 Case Study: Torus, Reduced Torus & Tree based<strong>NoC</strong>Inthiscasestudy, experimentsare<strong>design</strong>edtocalculatelatency(clockcycles), throughput(gigabits/sec) <strong>and</strong> power (milliWatts) of three related topologies - 2D Torus, Reduced 2DTorus <strong>and</strong> a Tree based <strong>NoC</strong>. Table 3.6 lists some of the simulation setup parameters usedin the following experiments. We did not observe significant variation in activity factor<strong>and</strong> hence power <strong>and</strong> throughput of the <strong>NoC</strong> by running the simulation <strong>for</strong> durationsgreater than 35000 cycles.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 51Figure 3.17: Floorplans of the three compared topologies.Table 3.7: Links <strong>and</strong> pipelining details of <strong>NoC</strong>sTopology Length in mm Pipelining(no. of links)2D Torus 7 (8) 1 2 3 4 5 6 7 81.5 (88) 1 1 1 1 2 2 2 2Reduced 3.5 (12) 1 2 3 4 5 6 7 82D Torus 2.5 (16) 1 2 3 3 4 5 5 62.0 (44) 1 2 2 3 3 3 4 4Tree <strong>NoC</strong> 3.5 (8) 1 2 3 4 5 6 7 8(8+32) 0.75 (32) 1 1 1 1 2 2 2 23.4.1 <strong>NoC</strong> TopologiesStarting from a 2D Torus, two topologies (a hierarchical star topology <strong>and</strong> reduced Torus)containing equal number of source <strong>and</strong> sink nodes are derived by removing/reconnectinglinks. Router <strong>and</strong> processing elements are identical in all three topologies. The threetopologies are shown in Figures 3.4 (schematic) <strong>and</strong> 3.17 (floorplan). Processing elements(PEs) are assumed to be of size 1.5×1.5mm[128] Routers are assumed to be 15% of thePE size. Lengths of longest link in 2D Torus is estimated to be 7mm <strong>and</strong> in Reduced2D Torus <strong>and</strong> Tree based <strong>NoC</strong> is 3.5mm. Routing policies <strong>for</strong> all topologies is tablebased. The routing policy was <strong>for</strong>med keeping in view the latency (in cycles) of the linkspackets. Routes were chosen such that in cases that had choices of long <strong>and</strong> short links,the shorter links were chosen. Lengths of links in each of the topologies <strong>and</strong> pipelining


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 5276DLA Traffic. Max Attainable Frequency vs. Pipeline configurations on 3 <strong>NoC</strong>s.2D Mesh. Max. Len: 2.5mm2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm5Max. Frequency (in GHz)432100 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.18: Maximum attainable frequency by links in the respective topologies. Estimatedlength of the longest link in a 2D Torus is 7mm. Estimated longest link in theTree based <strong>and</strong> Reduced 2D Torus is 3.5mm.factors is illustrated in Table 3.7. Pipelining factor corresponds to the longest link in the<strong>NoC</strong>. Pipelining factor of 1 means the longest link is unpipelined, P=2 indicates it has atwo cycle latency <strong>and</strong> so on.3.4.2 <strong>NoC</strong> ThroughputThroughputs of each of the <strong>NoC</strong> topologies are calculated. Localized traffic generationscheme (each traffic generator sends 6% of its traffic to its im<strong>media</strong>te neighbors) with selfsimilartraffic distribution is used. Throughput is a measure of total data consumptionat sink nodes. Total throughput of the <strong>NoC</strong> (in bits/sec) is calculated as total number ofbits received ((phits r ∗bits phit )) at sink nodes divided by total (real) time (( 1 ∗sim f cycles))spent (Eqn 3.1).Maximum achievable frequency of a wire of a given length is shown in Figure 3.18.Maximum throughput of each <strong>NoC</strong> is presented in Figure 3.19. Tree based <strong>NoC</strong> supportslocalized traffic the best (at least two neighbours at one hop distance) <strong>and</strong> hence showshighest throughput. Both Tree <strong>NoC</strong> <strong>and</strong> Reduced 2D Torus show higher throughput


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 53180160DLA Traffic. Max Throughput vs. Pipeline Stages.2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm2D Mesh. Max. Len: 2.5mm140Max. Throughput (in Gigabits/sec)1201008060402000 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.19: Variation of total <strong>NoC</strong> throughput with varying pipeline stages in all threetopologies.because of shorter links resulting in higher frequency of operation. The Reduced 2DTorus has higher throughput than a conventional 2D Torus as the minimum distancebetween two neighbours is 1 hop (2 hops in case of a Torus).3.4.3 <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency TradeoffsFigure 3.20 shows the combined normalized results of power, throughput <strong>and</strong> latencyexperiments on a 2D Torus. Power consumption of the 2D Torus increases at a higherrate after P=4 due to insertion of flops in the shorter links (1.5mm) after P=5. Latencyis calculated as the real time spent in transit of all phits in the <strong>NoC</strong> over the completesimulation time. Decrease in latency after P=5 is not considerable as delays from insertedflops start to dominate clock cycle time <strong>and</strong> after a certain pipeline configuration latencieswill increase (not shown here). From the graph it is seen that growth in power makesconfigurations more than P=5 less desirable. Link pipelines with P=1,2 <strong>and</strong> 3 are also notoptimalwhenlatencyisconsidered. Riseinthroughputalsostartstofadeasconfigurationof more than P=5 are used. The optimal point of operation indicated by the results isP=4. At this point the same number of flops as P=1,2 <strong>and</strong> 3 are used but the least latency


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 54Normalized Throughput(*206.23 Gbps), Power(*27.26 mW), Energy & Energy Delay35302520151050DLA Traffic: 2D Torus <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency TradeoffsEnergy.Delay<strong>NoC</strong> PowerEnergyThroughputLatency1 2 3 4 5 6 7 8Link Pipeline Depth654321Avg. Round Trip Latency (Normalized) (OWT: *2.98nS, RR: *8.8ns)Figure 3.20: 2D Torus Power/Throughput/Latency trade-offs. Normalized results areshown.(1.56 times minimum) is achieved <strong>and</strong> power (40% of max) <strong>and</strong> throughput (64% of max)are at nominal points. The graph also show energy (Power × Latency) required <strong>for</strong> thecommunication. Energy <strong>for</strong> communication increases with pipeline depth. However theenergy delay product reduces initially with increasing pipeline depth <strong>and</strong> then increaseswith a minimum around P=4.Tradeoff results on Reduced 2D Torus are shown in Figure 3.21. Latency <strong>and</strong>throughputcurvesshowsimilartrendasina2DTorus. Latencyreduction<strong>and</strong>throughputgain after P=4 is not considerable. The power optimal point of operation indicated bythe results is P=3. At P=3 latency is 1.6 times the minimum <strong>and</strong> power (49% of max)<strong>and</strong> throughput (61% of max). There is a shallow minima <strong>for</strong> energy-delay product fromP=3-7.3.4.4 Power-Per<strong>for</strong>mance Tradeoff With Frequency ScalingWe discuss the combined effects of pipelining links <strong>and</strong> frequency scaling on power consumption<strong>and</strong> throughput of three example topologies (Figure 3.4) in this sub-section.Maximum possible frequency of operation at full supply voltage (1.1V) is determined


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 55Normalized Throughput(*53.2535Gbps),Power(*100.2mW),Energy,Energy Delay14121086420Reduced 2D Torus <strong>NoC</strong> Power/Per<strong>for</strong>mance/Latency TradeoffsEnergy.Delay<strong>NoC</strong> PowerThroughputEnergyLatency1 2 3 4 5 6 7 8Link Pipeline Depth43.532.521.510.5Normalized Latency (*0.11529 mS)Figure 3.21: Reduced 2D Torus Power/Throughput/Latency trade-offs. Normalized resultsare shown.using Intacte.Figure 3.22 shows <strong>NoC</strong> power consumption <strong>for</strong> 3 example topologies over variouspipelining factors along with frequency scaling. Maximum frequency of operation of anunpipelined longest link in a 2D Torus (we consider 7mm) is determined to be 0.93GHz.This maximum throughput point determined in each pipeline configuration in each topology.Frequency is scaled down from this point <strong>and</strong> power measurements are made <strong>for</strong> <strong>NoC</strong>activity obtained using the SystemC framework <strong>for</strong> Localized traffic with 20% injectionrate <strong>and</strong> 6% localization factor. Allowing <strong>for</strong> some overheads (extra cycles of latency),the frequency of operation required to achieve equivalent throughput in pipelined links is0.94−0.96GHz. Higher frequency translates to higher throughput (Eqn. 3.1). Pipeliningfactor of 1 is unpipelined <strong>and</strong> has single cycle delay <strong>and</strong> a factor of 2 means link has twocycles of delay <strong>and</strong> so on.Experiments <strong>for</strong> each topology show existence of crossover frequenciesafter which it is better to switch to a higher pipelining to save power <strong>and</strong> achievehigher throughput. Larger buffers are required to drive links at higher frequencies. Powerconsumed by buffers starts to overshadow the frequency gain at these frequencies. Experimentson the 2D Torus show that a link frequency of 2.5GHz can be achieved by


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 56<strong>NoC</strong> Power Consumption (mW)<strong>NoC</strong> Power Consumption vs. Throughput with Frequency scaling.7006005004003002001002D Torus. P=52D Torus. P=6T<strong>NoC</strong>. P=4T<strong>NoC</strong>. P=5R2DT. P=4R2DT. P=5020 40 60 80 100 120 140 160 180Throughput (Gbps)Figure 3.22: Variation of <strong>NoC</strong> power with throughput <strong>for</strong> each topology.pipelining the link in stages 4 to 7. <strong>NoC</strong> power consumption can be reduced by 40.97%by switching to a 4 stage pipeline from a 7 stage pipeline. Another interesting result isthe effect of larger buffers as upper limits of frequency are reached in a single pipelineconfiguration. For instance, in P=3 from 2.3GHz to 2.4GHz, buffers start to consumealmost the same power as a link with P=4.Sizes of links of a Reduced 2D Torus are estimated to be 3.5mm, 2.5mm <strong>and</strong>2.0mm. <strong>NoC</strong> power consumption across various pipeline stages differ by smaller amountscompared tothe2DTorusas thenumberoflinks arelesser (32+16 bidirectional comparedto 12+8+16 bidirectional links). Results show frequencies of 4.22GHz - 4.56GHz canbe achieved by both P=4 <strong>and</strong> P=5 (5% power difference). On the other h<strong>and</strong>, <strong>for</strong> agiven frequency there exists more than one pipeline configuration with varying powerconsumption. Frequency of 3.5GHz can be achieved by pipelining into 3 to 7 stages.<strong>NoC</strong> power consumption can be reduced by 26.6% by switching from P=7 to P=3 <strong>and</strong>still achieve 3.5GHz. Table 3.8 shows power reduction when switched from one pipelineconfiguration to the next higher one at ‘trip’ frequencies.EstimatedsizesoflinksinaTree Based <strong>NoC</strong>are3.5mm<strong>and</strong>0.75mm. Results<strong>for</strong>thefrequency scaling experiment follow a similar pattern as the previous two configurations


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 57Table 3.8: Power optimal frequency trip points in a various <strong>NoC</strong>s.Pipe Trip Frequency (in GHz)Stages 2DT R2DT Tree <strong>NoC</strong>1-2 0.93 1.65 1.052-3 1.71 2.75 2.13-4 2.36 3.55 3.054-5 3.4 4.22 4.455-6 3.84 5.13 4.756-7 4.22 5.3 5.132D Torus. Power vs. Throughput with Dynamic Voltage Scaling.300Total <strong>NoC</strong> Power (mW)250200150 P=7, Freq. ScaledP=7100P=150P=4P=2020 40 60 80 100 120 140 160Throughput (Gigabits/sec)Figure 3.23: Effects of dynamic voltage scaling on the power <strong>and</strong> per<strong>for</strong>mance of a 2DTorus. Highest frequency of operation <strong>for</strong> P=1, 2, 4 <strong>and</strong> 7 are .93GHz, 1.68GHz, 2.92GHz<strong>and</strong> 4.22GHz. Power consumption of the frequency scaled <strong>NoC</strong> is shown <strong>for</strong> comparison.(Figure 3.22). The differences between power numbers between various configurations arethe least in this network as this <strong>NoC</strong> contains the least number of links amongst the threecompared (4+16 bidirectional). Trip frequencies are recorded in Table 3.8. A maximumof 21.27% of power can be saved (at f = 3.84GHz) by switching over to P=4 from P=3after 3.05GHz. On the other h<strong>and</strong>, 4GHz can be achieved by P=4 to P=7 <strong>and</strong> <strong>NoC</strong> powerconsumption at P=4 is 76% power consumed at P=7.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 58Table 3.9: Comparison of 3 topologies. Maximum interconnect network per<strong>for</strong>mance <strong>and</strong>power consumption <strong>for</strong> varying pipe stages.Topology Pipe Power Per<strong>for</strong>manceStages (mW) (Gbps)2D Torus 1 32.01 31.52 49.44 53.274 101.42 115.347 146.41 268.61Reduced 1 49.05 100.22D Torus 2 91.75 230.954 142.5 496.257 181.7 742.27Tree 1 53.22 52.93Network 2 90.66 99.464 141.17 191.077 179.77 307.63.4.5 Power-Per<strong>for</strong>mance Tradeoff With Voltage <strong>and</strong> FrequencyScalingIn each of the topologies, frequency is scaled down from the maximum <strong>and</strong> the leastvoltage required to meet the scaled frequency is estimated using Intacte <strong>and</strong> power consumption<strong>and</strong> throughput results are presented in this section. Voltages are scaled from1.1V to 0.65V. <strong>NoC</strong> parameters are identical to ones used in Section 3.4.2. Figure 3.23shows results of DVS on the 2D Torus network. Similar to the frequency scaling resultsthere exists an frequency point in a pipelining configuration after which it is power <strong>and</strong>throughput optimal to switch to a higher pipelining stage. For throughput higher than90GbpsP=7offersahighest powerreductionof21.74% at 101Gbps. Thefrequencyscaledcurve is obtained by scaling only the frequency <strong>and</strong> <strong>NoC</strong> is run at supply voltage. Scalingvoltage along with frequency compared to scaling frequency alone can result in powersavings of up to 57% <strong>and</strong> 63% in cases of P=7 <strong>and</strong> P=4 respectively. Comparison of allthe three topologies is presented in Table 3.9.


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 593.5 Conclusion<strong>NoC</strong> <strong>design</strong> specifications can be met by varying a large number of system <strong>and</strong> circuitparameters. An SOC can be better optimized if low level link parameters <strong>and</strong> architecturalparameters such as pipelining, link width, wire pitch, supply voltage, operatingfrequency, router type, topology of the interconnection network etc. are considered. Thechapter presents a simulation framework developed in System-C able to explore <strong>NoC</strong> <strong>design</strong>through all the a<strong>for</strong>ementioned parameters. The framework also allows co-simulationwith models <strong>for</strong> the communicating entities along with the ICN. The interface to the SystemCframework <strong>and</strong> sample output logs produced are documented in Appendix A.Study presented in Section 3.3 on a 4x4 multi-<strong>core</strong> ICN <strong>for</strong> Mesh, Torus <strong>and</strong> Foldedtorustopologies <strong>and</strong> Dense Linear Algebra (DLA) <strong>and</strong> Sparse Linear Algebra (SLA)benchmarks’ communication patterns indicate that there is an optimum degree of pipeliningof the links which minimizes the average communication latency. There is also an optimumdegree of pipelining which minimizes the energy-delay product. Such an optimumexists because increasing pipelining allows <strong>for</strong> shorter length wire segments which can beoperated either faster or with lower power at the same speed.We also find that the overall per<strong>for</strong>mance of the ICNs is determined by the lengthsof the links needed to support the communication patterns. Thus the mesh seems toper<strong>for</strong>m the best amongst the three topologies considered in this case study.Another study (Section 3.4) uses 3 example topologies - 16 node 2D Torus, Tree network<strong>and</strong> Reduced 2D Torus to show variation of latency, throughput <strong>and</strong> <strong>NoC</strong> powerconsumption over link pipelining configurations with voltage <strong>and</strong> frequency scaling. Wefind that contrary to intuition, increasing pipeline depth can help reduce latency in absolutetime units, by allowing shorter links & hence higher frequency of operation. Ina 2D Torus when the longest link is pipelined by 4 stages at which point least latency(1.56 times minimum) is achieved <strong>and</strong> power (40% of max) <strong>and</strong> throughput (64% of max)are nominal. Using frequency scaling experiments, power variations of up to 40%, 26.6%<strong>and</strong> 24% can be seen in 2D Torus, Reduced 2D Torus <strong>and</strong> Tree based <strong>NoC</strong> between variouspipeline configurations to achieve same frequency at constant voltages. Also in some


CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 60cases, we find that switching to a higher pipelining configuration can actually help reducepower as the links can be <strong>design</strong>ed with smaller repeaters. Larger <strong>NoC</strong> power savingscan be achieved by voltage scaling along with frequency scaling. Hence it is importantto include the link microarchitecture parameters as well as circuit parameters like supplyvoltage during the architecture <strong>design</strong> exploration of a <strong>NoC</strong>.Thestudiesalsopointtoanoverall<strong>optimization</strong>problemthatexistsinthearchitectureof the individual PEs versus the overall SOC, since smaller PEs lead to shorter linksbetween PEs, but more traffic, thus pointing to the existence of a sweet spot in terms ofthe PE size.


Chapter 4Optimal Energy <strong>and</strong> Per<strong>for</strong>manceConfigurations using CommunicationCentric CMP SimulationsOn-chip <strong>and</strong> off-chip communication times have significant impact on execution time <strong>and</strong>the energy efficiency (Instructions per second 2 /W (IPS 2 /W)) of Chip-<strong>Multi</strong><strong>processors</strong>(CMPs) <strong>and</strong> need to be accounted in CMP simulation frameworks. Larger amount oftime spent in communication leads to longer execution time <strong>and</strong> hence increased lossesdue to leakage energy. Communication time is a function of the total number of messagesto be communicated <strong>and</strong> latency. The composition of an individual tile has a significantimpact on overall communication time. Larger caches in the tile will reduce cachemisses <strong>and</strong> decrease off-tile messages. However, larger caches also imply larger area <strong>and</strong>hence longer inter-tile communication link lengths <strong>and</strong> latencies, thus adversely impactingcommunication time. There exists a trade-off in the tile size which leads to optimumcommunication time <strong>and</strong> hence energy efficiency. This indicates a need <strong>for</strong> strategies toreduce communication time without increasing tile area.We explore these trade-offs using a detailed, cycle accurate, multi<strong>core</strong> simulationframework which includes superscalar processor <strong>core</strong>s, cache coherent memory hierarchies,on-chip point to point communication networks <strong>and</strong> detailed interconnect model61


CHAPTER 4. TILE EXPLORATION 62including pipelining <strong>and</strong> latency. Link latencies are estimated <strong>for</strong> a 16 <strong>core</strong> CMP simulationon a framework having superscalar <strong>core</strong>s, cache coherent memory hierarchies, on-chippoint to point communication networks <strong>and</strong> <strong>NoC</strong> model, running a SPLASH2[129] benchmark.Each tile has a single processor, L1 <strong>and</strong> L2 caches <strong>and</strong> a router. Different sizesof L1 <strong>and</strong> L2 lead to different tile clock speeds, tile miss rates <strong>and</strong> tile area <strong>and</strong> henceinterconnect latency.Simulations across a range of sizes of 8KB–256KB <strong>for</strong> L1 <strong>and</strong> 64KB–4MB <strong>for</strong> L2indicate that there is an optimal size which maximizes energy efficiency <strong>and</strong> this is relatedto minimizing communication time. Reducing off-chip communication time requires largeon-chip caches which in turn burn more leakage energy. This indicates a need <strong>for</strong> strategyto reduce communication time, without increasing tile area. Experiments across a rangeof L1/L2 configurations indicate different optimals <strong>for</strong> per<strong>for</strong>mance, energy <strong>and</strong> energyefficiency. Simulation frameworks lacking detailed interconnect models give inaccuratepower-per<strong>for</strong>mance optimal L1/L2 configurations. Additionally, clustered interconnectionnetwork, communicationawarecachebankmapping<strong>and</strong>threadmappingtophysical<strong>core</strong>sare explored as potential energy saving solutions.Organization of the ChapterThe need to use detailed interconnection network models to identify optimal energy <strong>and</strong>per<strong>for</strong>mance configurations is motivated in Section 4.1. Section 4.2 highlights researchcontributions of this chapter. A brief literature survey on the broad areas of communicationcentric per<strong>for</strong>mance estimation/<strong>optimization</strong> <strong>and</strong> system architecture of Networkon-Chips<strong>and</strong> System-on-Chips <strong>and</strong> interconnection network exploration is presented inSection 4.3. A detailed literature survey was presented in Section 2.3.5 of Chapter 2. Section4.6 describes on-chip <strong>and</strong> off-chip communication effects on power <strong>and</strong> per<strong>for</strong>manceof a CMP using the FFT benchmark as a case study. Section 4.4 builds the relationshipbetween energy efficiency <strong>and</strong> communication time <strong>and</strong> states the <strong>optimization</strong> problem.Section 4.5 illustrates the experimental setup <strong>and</strong> methodology. Experimental resultsanalyzing the effects of on-chip <strong>and</strong> off-chip communication are presented in Section 4.7.


CHAPTER 4. TILE EXPLORATION 63Effects of communication on program execution times <strong>and</strong> program execution energy arepresented in Section 4.8. Energy-per<strong>for</strong>mance results <strong>for</strong> various L1/L2 configurations<strong>and</strong> effect of custom L2 bank mapping <strong>and</strong> thread mapping on power <strong>and</strong> per<strong>for</strong>mance ofa multi<strong>core</strong> chip are presented in Section 4.9. The chapter concludes in in Section 4.10.4.1 MotivationTransistor scaling <strong>and</strong> higher transistor densities have enabled semiconductor industryto implement large scale chip multi-<strong>processors</strong>[123][130][131][132]. As on-chip wire delayscontinue to grow, intra-chip latencies are now a significant parameter in the totalper<strong>for</strong>mance of a CMP[120][133].The work presented in this chapter analyzes communication as the primary parameterin CMPs <strong>and</strong> the importance of including link latencies in cycle accurate simulationframeworks. Weestablish therelationship betweenProgram Completion Time (PCT)<strong>and</strong>Communication Time in a CMP over various runs of a parallel application. Communicationin a CMP is inter-tile (on-chip) or off-chip. On-chip (or Intra-chip) communicationis the communication between tiles in the CMP due to off-tile L2 accesses. Off-chipcommunication are accesses to DRAM due to L2 misses in the CMP.Experiments on FFT benchmark on a 4×4 CMP show that per<strong>for</strong>mance estimates areoff by up to 17% if detailed interconnect models are not included in CMP simulations(Figure 4.1). The ideal interconnect measurements were taken assuming single cycleinterconnectlatency. FFTbenchmarkwasruntocompletionusingvariousL1(8K-256K)<strong>and</strong> L2 (64K - 4M) cache sizes. Errors are larger <strong>for</strong> smaller L1s <strong>and</strong> large L2 caches. Inthesecases, themissrate<strong>and</strong>henceoff-<strong>core</strong>communicationisthelargest. LargeL2cachesincrease area of individual <strong>core</strong>s which in turn increases link latencies. Communicationtime during benchmark execution is affected by the delays of wires between tiles whichin turn depend on individual tile sizes. Larger tiles accommodate more on-chip memoryincreasing link lengths, increasing communication time. Length of an on-chip inter-tilelink depends on the size of the tiles in the CMP. A smaller tile area results in shorter wires<strong>and</strong> lesser individual link latency. L1 <strong>and</strong> L2 caches in smaller tiles incur more misses


CHAPTER 4. TILE EXPLORATION 64Execution Time Error. Real vs. Ideal Interconnect.1864K128K256K512K1MB2MB4MB16Percentage Error141210868K 16K 32K 64K 128K 256KL1 Cache Sizes.Figure 4.1: Error in per<strong>for</strong>mance measurement between real <strong>and</strong> ideal interconnect experiments.increasing time spent in communication. Hence there exists an optimum cache size whichminimizes the overall Inter Tile Communication Time. Increasing cache size also impactsEnergy as both the dynamic <strong>and</strong> leakage power increase. However, a reduction in PCThelps reduce leakage energy (product of leakage power <strong>and</strong> PCT) <strong>and</strong> hence one can againexpect an optimum <strong>for</strong> the Program Energy Efficiency.Time spent in communication is also a function of the communication latency betweena <strong>core</strong> running a process <strong>and</strong> the <strong>core</strong> containing the L2 bank to be accessed. Identifyingfrequently communicating processes <strong>and</strong> most accessed L2 banks <strong>and</strong> mapping the processes<strong>and</strong> L2 banks in same or neighbouring <strong>core</strong>s offer communication time savings ina shared, distributed L2 cache with S-NUCA policy. This potentially decreases overalltime spent in transit <strong>and</strong> hence increases per<strong>for</strong>mance <strong>and</strong> program energy efficiency. Tileplacement (floorplan) strategies <strong>and</strong> thread mapping algorithms also play a role in thefinal per<strong>for</strong>mance index. Time spent in communication influences Total Energy consumedduring execution of a program. This relationship between power, per<strong>for</strong>mance, energyefficiency <strong>and</strong> time spent in communication is explored in this chapter.


CHAPTER 4. TILE EXPLORATION 654.2 Observations <strong>and</strong> ContributionsResearch contributions <strong>and</strong> observations of the current chapter are highlighted here.• Time spent in communication influences the total execution time of a program inCMPs. The current work analyzes on-chip message transit time <strong>and</strong> off-chip DRAMaccess times to establish relationship between time spent in communication <strong>and</strong> theprogram completion time.• On-chiptransit<strong>and</strong>DRAMaccesstimestracktheexecutiontimeofaprocess. CMPconfigurations operating at higher frequencies (lower cycle time) have a per<strong>for</strong>manceedge over others when the communication time is comparatively similar.• CommunicationtimeisinfluencedbythefloorplanningoftheCMP.Communicationaware floorplanning can reduce upto 2.6% of the energy spent in execution of aninstruction <strong>and</strong> upto 11% savings in communication power during the execution ofthe program.4.3 BackgroundThe tile area <strong>optimization</strong> problem is closely knit with interconnect, cache <strong>and</strong> processorarchitecture exploration. It is clear from works like [69][106] that there is a need <strong>for</strong>a co-<strong>design</strong> of interconnects, processing elements <strong>and</strong> memory blocks to fully optimizethe overall multi-<strong>core</strong> chip per<strong>for</strong>mance. This necessitates a simulation framework whichallows a co-simulation of processor <strong>core</strong>s, a detailed cache memory hierarchy, on-chipnetwork, along with a low-level interconnect model.Analysis of communication delays on power <strong>and</strong> per<strong>for</strong>mance of CMPs has been thesubject of interest in recent years. Mitigating communication delays through compilertechniques <strong>and</strong> micro-architecture have been looked at. Strided prefetching[106] has beencompared with block migration <strong>and</strong> on chip transmission lines to manage on-chip wiredelay in CMP caches to improve per<strong>for</strong>mance. Instruction steering [112] <strong>and</strong> instructionreplication[113] coupled with clustering have been researched as an effective technique to


CHAPTER 4. TILE EXPLORATION 66reduce the impact of wire delays. Data transfer on long latency wires can be reducedby value prediction[114] <strong>and</strong> cache line replication[115][116] techniques. Communicationenergy <strong>and</strong> delay can be minimized by migrating frequently accessed cache lines to banksclosest to the accessing <strong>processors</strong>[109][111]. Scalable micro-architectural techniques toreduce the impact of wire delay have been looked at[119][120]. Floorplanning techniquesto overcome long latencies between the processor <strong>and</strong> the Level-2 cache have been experimentedon[121].Many separate tools have been developed <strong>for</strong> interconnection network (ICN) spaceexploration [70][71]. Most of these tools usually model the ICN elements at a higher levelabstraction of switches, links <strong>and</strong> buffers <strong>and</strong> help in power/per<strong>for</strong>mance trade-off studies[86]. These are used to research the <strong>design</strong> of router architectures [124][87] <strong>and</strong> ICNtopologies[34] with varying area/per<strong>for</strong>mance trade-offs <strong>for</strong> general purpose SoCs or tocater to specific applications. Kogel et. al.[70] present a modular exploration frameworkto capture per<strong>for</strong>mance of point-to-point, shared bus <strong>and</strong> crossbar topologies. Impacts ofvarying topologies, link <strong>and</strong> router parameters on the overall throughput, area <strong>and</strong> powerconsumption of the system (SoCs <strong>and</strong> <strong>Multi</strong><strong>core</strong> chips) using relevant traffic models isdiscussed in [68]. Orion [71] is a power-per<strong>for</strong>mance interconnection network simulatorthat is capable of providing power <strong>and</strong> per<strong>for</strong>mance statistics. Orion model estimatespower consumed by router elements by calculating switching capacitances of individualcircuit elements. A tool like Ruby[97], allows one to simulate a complete distributedmemoryhierarchywithanon-chipnetworkasinOrion. Howeveritneedstobeaugmentedwith a detailed interconnect model which accounts <strong>for</strong> the physical area of the tiles <strong>and</strong>their placements. Wire lengths have a significant influence on the latency of interconnect,<strong>and</strong> hence need to be included in the simulation framework. Separate wire explorationtools as in [72], [73] <strong>and</strong> [75] give an estimate of delay of the wire in terms of latency <strong>for</strong>a particular wire length <strong>and</strong> operating frequency.Sapphire[102]frameworkusedinthisworkintegratesSESC,Ruby,Intacte<strong>and</strong>DRAMSim[104].Sapphire enables cycle accurate simulations of a multi-<strong>core</strong> chip having distributed memoryhierarchy <strong>and</strong> an on-chip network, with interconnect latencies which are consistent


CHAPTER 4. TILE EXPLORATION 67Figure 4.2: Schematic of a multiprocessor architecture comprising of tiles <strong>and</strong> an interconnectingnetwork. Each tile is made up of a processor, L1 <strong>and</strong> L2 caches.with the physical sizes <strong>and</strong> placement of the <strong>core</strong>s. CACTI[122] cache models are usedto estimate area, energy per access <strong>and</strong> leakage power of L1 & L2 caches <strong>and</strong> values fromthe SPARC processor[123] are used <strong>for</strong> processor power estimates.4.4 Communication Time <strong>and</strong> Energy EfficiencyConsider a tiled multi-<strong>core</strong> chip as shown in Figure 4.2. Each tile contains a processor,private instruction <strong>and</strong> data L1 caches, a shared, distributed L2 cache <strong>and</strong> a router <strong>for</strong>on-chip communication. The L2 banks are distributed in the CMP using the S-NUCApolicy. Mapping of data into L2 banks is predetermined based on the address <strong>and</strong> givendata can reside in only one bank of L2 cache. The tiles are interconnected via a network,which is usually a mesh or 2D-torus <strong>for</strong> large networks.Data to be accessed by a program in such a CMP is present in one of the followingsources: the private (on-tile) L1, the local (on-tile) L2, (an off-tile) remote L2 or the offchipDRAM. If P L1 , P l.L2 , P r.L2 <strong>and</strong> P dram are probabilities of finding the required dataon L1, local L2, remote L2 <strong>and</strong> off-chip DRAM, then,P L1 +P l.L2 +P r.L2 +P dram = 1Tile placement strategies <strong>and</strong> process scheduling policies in CMPs also influence link


CHAPTER 4. TILE EXPLORATION 68latencies. Current work asserts the importance of T comm in program execution <strong>and</strong> examinesthe effects of T comm on the power, per<strong>for</strong>mance <strong>and</strong> energy efficiency of a CMP.CurrentworkusesvariablesL1, L2,tileplacementstrategies<strong>and</strong>non-conventionalprocessscheduling to examine the effect of T comm on power, per<strong>for</strong>mance <strong>and</strong> energy efficiencyof a CMP. Metric used <strong>for</strong> measurement of per<strong>for</strong>mance of a CMP is Energy × Delayproduct (ED). Instructions per second 2 /W (IPS 2 /W), which has the same dimensionsas1EDis used as the per<strong>for</strong>mance metric to compare various CMP configurations in thecurrent work.Processes (threads) of a program accesses private L1, local or remote L2 cache banksoroff-chipDRAMduringthecourseofprogramexecutioninaCMP.Duringitsexecution,a process generates an address sequence of data to be accessed. The generated addresssequence is independent of the <strong>core</strong> L1/L2 size. Let the generated address sequence bythe n th process containing k addresses be A n k .{A n k},n = 0,...,N T −1N T is the total number of processes (or threads) the executing program is parallelizedinto. N c is the number of <strong>core</strong>s in the CMP. The <strong>core</strong> id, Ck n , where L2 banks of therequired k data addresses of the n th process reside can be identified using the followingnotation:C n k = A n k%N c (4.1)C n kis the set of all local <strong>and</strong> remote L2 accesses (including hits <strong>and</strong> misses) of the nthprocess during the execution of a program in a CMP. The Logical communication patternof a program running N T threads of a program on the CMP is the set of all memoryaccesses over all the processes. Hence Logical Communication Pattern (LCP),LCP ={ {Cn 00} { n NT} } −1,..., CN c−1(4.2)Each of the N T threads of a program are mapped to available <strong>processors</strong>. A new threadis assigned to the first available <strong>core</strong> in a CMP. Mathematically, the logical mapping of


CHAPTER 4. TILE EXPLORATION 69N T threads to physical <strong>core</strong>s can be written as the mapping function M T ,M T : {T 0 ,...,T NT −1} → {P T0 ,...,P Tk−1 } (4.3)Assuming that the L2 size in each of the N c <strong>core</strong>s is constant, let N L2 be the number of L2cache lines per <strong>core</strong>. The L2 bank residing in the <strong>core</strong> Ck n to be accessed by the addressA n kis a function of the number of L2 cache lines per <strong>core</strong> <strong>and</strong> the number of executingthreads. The higher bits of the requested address A n k are used to select the L2 bank (usingN L2 ) <strong>and</strong> the <strong>core</strong> id is deduced using the number of executing threads N T . The <strong>core</strong> idexpression in Eqn. 4.1 can be refined as:C n k ⇒ {(A n k/N L2 )%N T }In a CMP with L2 banks distributed according to S-NUCA policy, the above distributionof L2 banks will result in L2 Bank 0 assigned to Core 0, ..., L2 Bank N c −1 assigned toCore N c − 1, L2 Bank N c assigned to Core 0 <strong>and</strong> so on. This is the logical mapping ofL2 bank ids to <strong>core</strong> ids. Given the flexibility, the data belonging to these L2 banks canbe assigned to any available physical <strong>core</strong>. Assuming equal number of <strong>core</strong>s (N c ) <strong>and</strong> L2banks, this mapping (M C ) of L2 banks to physical <strong>core</strong>s is represented as,M C : {C n 0 ...,C k−1 } → {P C n0...,P Ck−1 }The logical communication pattern equation in Eqn. 4.2 can be extended to a PhysicalCommunication Pattern. The Physical Communication Pattern (P A nk) is the sequences ofphysical addresses generated from the logical address sequence A n k .{A 0 k,...,A N T−1k} ⇒ {P A 0k,...,P N A T −1}kData to be accessed by a thread in CMP is present in one of the following sources: theprivate (on-tile) L1, the local (on-tile) L2, (an off-tile) remote L2 or the off-chip DRAM.The Average Memory Access Time (T mem ) is the average of access times of all L1 (N l1 ),


CHAPTER 4. TILE EXPLORATION 70local <strong>and</strong> remote L2 (N l.l2 <strong>and</strong> N r.l2 ) <strong>and</strong> DRAM accesses (N dram ) by a thread executingin the CMP. T l1 is the average L1 access time over the execution of the thread. T l.l2 <strong>and</strong>T nr.l2 are the average local <strong>and</strong> remote L2 access times <strong>for</strong> the nth thread. P l1 n is the L1miss probability. Pl.L2 n <strong>and</strong> Pn r.L2 are the probabilities of the address missing in L2 local tothe <strong>core</strong> running the process <strong>and</strong> the remote L2 <strong>core</strong>. T net is the average network transittime till the DRAM controller. T dram is the average DRAM access time. Average memoryaccess time (T mem ) is expressed mathematically as:[Tmem n = T l1 +Pl1n T l.l2 + ( )Pr.l2 n ×2.T r.l2+P nl.l2.P n r.l2(2.Tnet +T dram) ] ×N mem (4.4)where, totalnumberofmemoryaccesses, N mem = N l1 +N l.l2 +N r.l2 +N dram . InaS-NUCAsetup, If the access to L1 is a miss(Pl1 n ), then the data exists in either the local L2 or theremote L2. Penalty in time (T l.l2 ) is incurred on an L2 access. If the address resolves toa remote L2 P n r.l2 , an additional round trip network access time is incurred (2.T r.l2). Ifthe required data misses in L2 (local <strong>and</strong> remote (Pl.l2 n .Pn r.l2 )), an off-chip DRAM accessis required. The off-chip access incurs a round trip time delay till the DRAM controller(2.T net ) <strong>and</strong> the penalty of a DRAM access(T dram ).The total execution time (T exec ) of a program with N T parallel processes in a CMP isthe time of the thread spending the maximum in execution:T exec = Max { T nexec},n = 0,...,NT −1 (4.5)The time spent in network transit (T tran ) over L2 accesses by a process is the sum ofthe times spent in local L2 accesses (T l.l2 ) <strong>and</strong> round trip remote L2 accesses over thenetwork (T r.l2 ) by all the N T processes:T tran =N∑T −1n=0{P nl1 ×N mem . [ 2.T l.l2 +P nl.L2 ×2.T r.l2]+Pnl.l2 .P n r.l2[2.Tnet] } (4.6)


CHAPTER 4. TILE EXPLORATION 71The time spent due to off-chip DRAM transit, incurring a round trip time delay 2.T nettill the DRAM controller, by all processes is:T dram =N∑T −1n=0The energy to execute the program (E exec ) is given as:{Pnl1 .P nl.l2.P n r.l2 ×N mem}.Tnet (4.7)E exec = (P dyn +P leak )×T exec (4.8)P dyn is the dynamic power spent <strong>and</strong> P leak is the leakage during the execution of theprogram.The Energy-Delay (ED) product of the program executed is the product of the totalenergy spent <strong>and</strong> the time taken <strong>for</strong> the execution to complete. The Energy-Delay (ED)product is,ED = T exec ×E execAlso, ED is a function of the following parameters.ED = f(A n k,M C ,M T ,L1,L2,f,<strong>NoC</strong>,DRAM)The Energy × Delay product is a function of the memory access patterns (A n k ) of theprocesses in a CMP. This pattern dictates the data, instruction <strong>and</strong> coherence communicationinside <strong>and</strong> the DRAM accesses outside the CMP. A n kin turn depends on numberof processes a program is parallelized into. Assigning a single <strong>core</strong> to a process <strong>for</strong> executionis Thread Mapping (M T ). Method <strong>and</strong> procedure of allocation of L2 banks tophysical <strong>core</strong>s is L2 cache mapping ((M C )). L1 <strong>and</strong> L2 are sizes of caches per tile. <strong>NoC</strong>is the interconnection network connecting various tiles in the CMP. It encapsulates thetopology, flit sizes, wire widths, routing protocols <strong>and</strong> other <strong>design</strong> parameters of the interconnectionnetwork. DRAM encapsulates DRAM parameters such as size, operatingfrequency, access time, off-chip access delay, number of rows, columns <strong>and</strong> so on.If T execL1,L2 is the time taken <strong>for</strong> the benchmark to complete execution on a CMP with


CHAPTER 4. TILE EXPLORATION 72each tile containing a private Level 1 cache of size L1 <strong>and</strong> a distributed, shared Level 2cache of size L2, ED can be written as:ED L1,L2 = E execL1,L2 ×T execL1,L2The optimal <strong>design</strong> is the one with the least ED product. The <strong>optimization</strong> problem canbe <strong>for</strong>mulated as:ED opt = min{ED L1,L2 (∀L1,∀L2)}where ED opt is the least ED product CMP configuration over various L1 <strong>and</strong> L2.Consider N <strong>NoC</strong> tile placement (floorplanning), N cm cache mapping <strong>and</strong> N tm threadmapping strategies available during the <strong>design</strong> the CMP. The optimal <strong>design</strong> (ED opt ) isthe one with the least ED product over all M C <strong>and</strong> M T over all L1, L2 cache sizes, overvarious floorplanning techniques. The <strong>optimization</strong> problem can be <strong>for</strong>mulated as:ED best = min { ED opt1 ,...,ED optn},∀nExploration work carried on in this chapter uses parameters L1, L2, f, M C <strong>and</strong>M T , in varying degrees. L1 size per-tile is varied from 8KB – 256KB in powers of2 <strong>and</strong> L2 sizes per-tile are varied from 64K – 4M. M C <strong>and</strong> M T are discussed in thealternative tile placement <strong>and</strong> thread schedulingsection (Section 4.9). Four tile placement<strong>and</strong> process scheduling strategies are considered starting from a conventional 2D mesh asthe base <strong>design</strong>. Operating frequency of the CMP is decided based on the access timetime L1. Single cycle L1 access is assumed <strong>and</strong> hence 6 different operating frequencies areused in the exploration experiments (1.64GHz – 1.38GHz). L1 <strong>and</strong> L2 cache parameters(access time, area) are obtained from Cacti[134][135]. A CMP with a 4×4 2D-Mesh <strong>NoC</strong>connecting 16 <strong>core</strong>s is the example used throughout the chapter. A 667MHz, DDR2DRAM is used in the current work. More DRAM parameters are tabulated in Table 4.1.TheFFTSPLASH2benchmarkareusedascasestudies<strong>for</strong>power/per<strong>for</strong>manceevaluationof CMPs. The next subsection analyzes the effect of link latencies on per<strong>for</strong>mance of aCMP.


CHAPTER 4. TILE EXPLORATION 734.5 Experimental SetupTable 4.1 lists Sapphire framework[102] settings used <strong>for</strong> the experiments. Each tileconsists of a processor, its private L1 cache <strong>and</strong> a section of L2 shared cache. L1 cachesizes are varied from 8 KB to 256 KB in powers of 2. The size of L1-I <strong>and</strong> L1-D are set tothe same value in all experiments. Unified L2 cache sizes in each tile varies from 64KB to4 MB. 16 tiles are interconnected using a 4×4 2D mesh. Garnet Flexible Network modelin Ruby provides an abstraction of all interconnection network models, while allowing therouter pipeline to be flexibly adjusted. This model is used in all experiments. Networkinterfacesatnodes<strong>and</strong>outgoingbuffersattheroutersweremonitoredtocalculateactivity<strong>and</strong> coupling factors on the links. The router in the <strong>NoC</strong> routes flits using deterministicXY routing protocol <strong>and</strong> uses the credit-based VC flow control.Intacte was used to compute the Interconnect delay <strong>and</strong> power. Intacte operates ona number of <strong>design</strong> variables such as wire width, wire spacing, repeater size <strong>and</strong> spacing,degree of pipelining, supply voltage (V dd ) <strong>and</strong> threshold voltage (V th ). Activity <strong>and</strong> couplingfactors, which are inputs to Intacte, are obtained from the Sapphire simulations.Intacte determines the most power optimal configuration of the interconnect by iteratingover the parameter space containing repeater sizes <strong>and</strong> spacing <strong>for</strong> a given wire to achievea desired frequency. The tool also includes flop <strong>and</strong> driver overheads <strong>for</strong> power <strong>and</strong> delaycalculations. Intacte outputs total power dissipated including short circuit <strong>and</strong> leakagepower values.DRAMsim[104] is a detailed <strong>and</strong> highly-configurable C-based memory system simulatorthat implements detailed timing models <strong>for</strong> a variety of existing memories, includingSDRAM, DDR, DDR2, DRDRAM <strong>and</strong> FB-DIMM. It also models the power consumptionof SDRAM <strong>and</strong> its derivatives. It can be used as a st<strong>and</strong>-alone simulator or aspart of a more comprehensive system-level model. DRAMSim is integrated into Sapphireframework[102].4.5.1 Experimental MethodologyFigure 4.3 depicts the steps followed in the experiments.


CHAPTER 4. TILE EXPLORATION 74Table 4.1: Configuration parameters of <strong>processors</strong>, caches & interconnection network usedin experimentsConfiguration ParameterValueTechnology32nm (PTM)Processor FrequencyEqual to L1 FrequencyProcessor PowerScaled from SPARC[123]No. of Processors 16Tile Configuration 1 Processor, Private L1 I/D Cache & Shared L2 CacheLine Size64 bytesL1 Cache & L2 Cache Associativity 4Power Model CACTI([134],[135])Size per Tile 8, 16, 32, 64, 128, 256 KBL1 CacheNo. of Banks 2Frequency Estimated using CACTIAccess Time Single CycleL2 CacheSize per Tile 64KB, 128KB, 256KB, 512KB, 1, 2, 4 MBNo. of Banks 4TypeDDR2Frequency667MHzBank Count 8DRAM Refresh Time64msNumber of Rows 2 14Number of Columns 2 10Power Model DRAMSim[104]ModelGarnet[136]Topology 4×4 2D MeshFlit Size32 bitsInterconnection NetworkRouting Protocol Deterministic XY routingRouter Pipeline 4 stagesVirtual Channels 4 per portFlow Control Credit based VCLink Power Model Intacte[75]


CHAPTER 4. TILE EXPLORATION 75Figure 4.3: Flowchart illustrating the steps in experimental procedure.• L1 & L2 (per tile) sizes are input to CACTI to obtain access time, energy per access,leakage power <strong>and</strong> area per L1/L2 cache size. Energy per access, access time <strong>and</strong>total number of accesses per cache are combined to estimate power consumed bythe L1 cache in the final stages of the experiment. L1 sizes range from 8K – 256K<strong>and</strong> L2 sizes range from 64K – 4M. Other CACTI input parameters are tabulatedin Table 4.1. Processor frequency, type of DRAM to be used, channel width <strong>and</strong>count, clock granularity, number of rows <strong>and</strong> columns, refresh time are input toDRAMSim as parameters.• An individual tile contains a processor, private L1 <strong>and</strong> a shared L2 cache. Processorarea is estimated by scaling down the area of a SPARC processor[123] (fabricatedin 65nm process) to 32nm. Area of the processor logic used in the floorplans is2.4515×1.9989 mm 2 . Floorplan of a single tile is drawn using cache areas obtainedfrom CACTI <strong>and</strong> this processor area. Example floorplans are shown in Figure 4.4.• 16 individual tiles are arranged in a 4×4 2D mesh network <strong>and</strong> lengths of interconnectlinks are estimated from the complete CMP floorplan. Lengths of linksbetween L1 <strong>and</strong> the router <strong>and</strong> L2 <strong>and</strong> the router per tile are also estimated. Tilesare re-arranged <strong>for</strong> other sets of experiments <strong>and</strong> the link lengths are recalculated.The three interconnection networks used in the current work are shown in Figure4.5.• Processor is assumed to run at the same frequency as L1 cache. Using the frequency


CHAPTER 4. TILE EXPLORATION 76of operation <strong>and</strong> length of the link, the power optimal configuration of the link isobtained using Intacte. The power optimal link configuration is obtained by varyingthe number <strong>and</strong> sizes of repeaters among other parameters on the link. The linklatencies obtained <strong>for</strong> each link is input into interconnect network configuration filesof Sapphire. The values of link lengths <strong>for</strong> various cache sizes is tabulated in Table4.5. Access latency of the L2 cache in cycles is determined using the access time(in ns) obtained from CACTI <strong>and</strong> dividing by L1 access time. The L2 cache accesslatency is copied into the ruby configuration file.Latency L2 = AccessTime×Frequency• Sapphire is used to run the FFT benchmark to completion. The experiments arerepeated <strong>for</strong> various configurations of L1 cache, L2 cache <strong>and</strong> flit sizes (Table 4.1).Traffic monitored on the interconnect links are used as to estimate activity <strong>and</strong>coupling factors. Cache statistics output from Ruby is used to calculate miss rates<strong>and</strong> number of accesses. Cache access numbers are used to estimate cache power.Benchmark completion time combined with number of cycles <strong>and</strong> frequency of operationis used to tabulate IPC, program completion time <strong>and</strong> instructions per secondresults per benchmark per L1/L2 size.• Link activity <strong>and</strong> coupling factors are input to Intacte to obtain power spent in communicationover the benchmark execution. Processor power is estimated by scalingdown from values published in [123] <strong>for</strong> a 65nm SPARC processor. Processor powersused over various L1 configurations are tabulated in Table 4.2. Total power spentduring the benchmark execution is the sum of power spent in the processor, memoryhierarchy (L1 & L2 caches) <strong>and</strong> interconnect links. Energy spent in executingthe benchmark is obtained from total power spent <strong>for</strong> benchmark execution <strong>and</strong>program completion time. Energy per instruction <strong>and</strong> IPS 2 /W are also calculated.Results from these experiments using various benchmarks are illustrated in Section4.9.


CHAPTER 4. TILE EXPLORATION 77Figure 4.4: Tile floorplans <strong>for</strong> different (L1, L2) sizes. From left: (8KB, 64KB), (64KB,1MB), (128KB, 4MB)Table 4.2: Scaled processor power over L1 configurations.L1(→) 8K 16K 32K 64K 128K 256KPower (in Watts) 118.062 115.918 105.572 92.155 79.923 70.342Figure 4.5: Mesh floorplans used in experiments. From left: Conventional 2D Meshtopology, a clustered topology, cluster topology with L2 bank <strong>and</strong> thread mapping <strong>and</strong><strong>and</strong> a mesh topology with L2 bank <strong>and</strong> thread mapping.Table 4.3: Primary <strong>and</strong> Secondary cache parameters (access time, area) obtained fromcacti. L2 access latencies as a function of L1 access times is also shown.L2 Size (MB) 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MBL2 Access Time (in ns) 0.93 0.96 0.99 1.08 1.17 1.55 2.23L2 Cache Area (mm 2 ) 2.29 2.42 2.69 3.96 5.08 7.56 16.44L1 Size L1 Access L1 Area Frequency L2 Access Latency (cycles)(KB) (ns) (mm 2 ) (GHz)8 KB 0.61 0.28 1.64 2 2 2 2 2 3 416 KB 0.61 0.56 1.63 2 2 2 2 2 3 432 KB 0.63 0.6 1.58 2 2 2 2 2 3 464 KB 0.66 0.98 1.51 2 2 2 2 2 3 4128 KB 0.69 1.12 1.44 2 2 2 2 2 3 4256 KB 0.73 1.4 1.38 2 2 2 2 2 3 4


CHAPTER 4. TILE EXPLORATION 78Table4.4: Maxoperatingfrequencies, DynamicenergyperaccessofvariousL1/L2caches.Values were calculated using cacti power models using 32nm PTM.L1 CacheL2 CacheSize Dynamic Energy Leakage Size Dynamic Energy LeakagePer Access (nJ) Power (W) Per Access (nJ) Power (mW)8 KB 0.0775 0.0065 64KB 0.162 0.05716 KB 0.078 0.013 128KB 0.167 0.07132 KB 0.0816 0.016 256KB 0.18 0.09964 KB 0.143 0.033 512KB 0.256 0.175128 KB 0.1515 0.047 1 MB 0.288 0.293256 KB 0.1665 0.074 2 MB 0.373 0.5514 MB 0.928 1.212Table 4.3 lists L1 <strong>and</strong> L2 cache parameters (access time, area) obtained from Cacti.L2 access delay in cycles, as a function of L1 access times is also shown. The differencein area between L1 <strong>and</strong> L2 caches of the same size (eg. 64KB) is due to the differencein number of banks (2 vs. 4). This results in twice as many read/write ports in the L2cache. Also, the tag area in L2 caches are larger (0.0149 mm 2 in 64KB L1 vs 0.021 mm 2 in64KB L2). The banks area in data array <strong>and</strong> tag array in L2 caches are larger(Eg., 64KBL2 vs. 64 KB L1. Data Array: 1250µm × 1814µm vs. 489µm × 1962µm, Tag Array:91µm × 229µm vs. 61µm × 241µm). Table 4.4 shows the dynamic energy per access <strong>and</strong>leakage power values obtained from Cacti. These values are used <strong>for</strong> power estimation ofthe cache. Table 4.5 shows the estimated lengths of links between L1/L2 caches & routers<strong>and</strong> between routers of neighbouring tiles <strong>for</strong> the regular mesh placement. Lengths oflinks between router <strong>and</strong> caches were estimated from the floorplans shown in Figure4.4. Lengths of links between routers was estimated using the regular mesh placementshown in Figure 4.5. Power optimal pipeline configuration of each link operating at thedesired frequency was obtained from Intacte. Dynamic power <strong>for</strong> the processor was scaleddown from statistics recorded in [137]. Leakage power is assumed to be 33% of the totalprocessor power[138].


CHAPTER 4. TILE EXPLORATION 79Table 4.5: Lengths of links between L1/L2 caches & routers <strong>and</strong> between routers ofneighbouring tiles <strong>for</strong> a regular mesh placement. No. of pipeline stages required to meetthe maximum frequency are also shown.L2 Size Tile area L1 to Router L2 to Router Router to Router Horiz. & Vertical Directory(in MB) mm×mm Length Stages Length Stages Length Stages Length Stages to Router(mm) (mm) (mm) (mm)L1 8KB. Max. Frequency: 1.64 GHz0.064 3.5 × 2.45 1.25 2 2.04 4 3.6 7 2.55 5 3.6/0.20.128 3.55 ×2.45 1.27 2 2.08 4 3.65 7 2.55 5 3.65/0.20.256 3.65 ×2.45 1.36 3 2.17 4 3.75 8 2.55 5 3.75/0.20.512 4.0 × 3.0 2.75 5 2.5 5 4.1 8 3.1 6 4.1/0.21 M 4.25 × 3.0 3.05 5 2.78 5 4.35 8 3.1 6 4.35/0.22 M 4.75 × 3.0 3.5 6 2.75 5 4.85 9 3.1 6 4.85/0.24 M 6.0 × 4.0 1.25 2 6.45 12 6.1 11 4.1 8 6.1/0.2L1 16KB. Max. Frequency: 1.63 GHz0.064 3.5 × 2.5 1.2 2 2.26 4 3.6 7 2.6 5 3.6/0.20.128 3.55 × 2.5 1.22 2 2.31 5 3.65 7 2.6 5 3.65/0.20.256 3.65 × 2.5 1.3 3 2.39 5 3.75 8 2.6 5 3.75/0.20.512 4.0 × 3.1 3.12 5 2.75 5 4.1 8 3.2 6 4.1/0.21 M 4.25 × 3.1 3.36 6 3.04 5 4.35 8 3.2 6 4.35/0.22 M 4.75 × 3.1 3.88 7 2.75 5 4.85 9 3.2 6 4.85/0.24 M 6.0 × 4.0 1.9 4 6.45 12 6.1 11 4.1 8 6.1/0.2L1 32KB. Max. Frequency: 1.58 GHz0.064 3.5 × 2.5 1.22 2 2.28 4 3.6 7 2.6 5 3.6/0.20.128 3.55 × 2.5 1.24 2 2.32 4 3.65 7 2.6 5 3.65/0.20.256 3.65 × 2.5 1.32 3 2.4 5 3.75 7 2.6 5 3.75/0.20.512 4.0 × 3.12 3.14 6 2.77 5 4.1 7 3.22 6 4.1/0.21 M 4.25 × 3.12 3.38 7 3.08 6 4.35 8 4.35 8 4.35/0.22 M 4.75 × 3.12 3.9 8 2.75 5 4.85 9 4.85 9 4.85/0.24 M 6.0 × 4.0 1.92 4 6.45 12 6.1 11 6.1 11 6.1/0.2L1 64KB. Max. Frequency: 1.51 GHz0.064 3.95 × 3.0 3.0 5 2.5 5 4.05 8 3.1 6 4.05/0.20.128 4.0 × 3.0 3.1 6 2.5 5 4.1 8 3.1 6 4.1/0.20.256 4.1 × 3.0 3.14 6 2.6 5 4.2 8 3.1 6 4.2/0.20.512 4.0 × 3.45 3.5 7 3.0 5 4.1 8 3.55 7 4.1/0.21 M 4.25 × 3.45 3.75 8 3.25 6 4.35 8 3.55 7 4.35/0.22 M 4.75 × 3.45 4.25 7 2.75 5 4.85 8 3.55 7 4.85/0.24 M 6.0 × 4.0 2.05 4 6.45 12 6.1 11 4.1 8 6.1/0.2L1 128KB. Max. Frequency: 1.44 GHz0.064 3.95×3.1 3.1 6 2.57 4 4.05 7 3.2 6 4.05/0.20.128 4.0 ×3.1 3.2 6 2.61 4 4.1 7 3.2 6 4.1/0.20.256 4.1 ×3.1 3.23 6 2.7 5 4.2 7 3.2 6 4.2/0.20.512 4.0 ×3.55 3.56 6 3.06 5 4.1 7 3.65 7 4.1/0.21 M 4.25 ×3.55 3.8 7 3.3 6 4.35 8 3.65 7 4.35/0.22 M 4.75 ×3.55 4.3 8 2.75 5 4.85 8 3.65 7 4.85/0.24 M 6.0 ×4.0 2.1 4 6.45 12 6.1 11 4.1 7 6.1/0.2L1 256KB. Max. Frequency: 1.38 GHz0.064 3.95×3.2 3.3 6 2.7 5 4.05 8 3.3 6 4.05/0.20.128 4.0 ×3.2 3.35 6 2.7 5 4.1 8 3.3 6 4.1/0.20.256 4.1 ×3.2 3.44 6 2.8 5 4.2 8 3.3 6 4.2/0.20.512 4.0 ×3.65 3.8 6 3.2 6 4.1 8 3.75 6 4.1/0.21 M 4.25×3.65 4.05 7 3.45 6 4.35 8 3.75 6 4.35/0.22 M 4.75×3.65 4.55 8 2.75 5 4.85 8 3.75 6 4.85/0.24 M 6.0 ×4.0 2.15 4 6.45 12 6.1 11 4.1 8 6.1/0.2


CHAPTER 4. TILE EXPLORATION 80FFT. DRAM Access, On-Chip Transit Time vs. Program Completion Time.DRAM Access <strong>and</strong> On-chip transit times (in secs)0.30.250.20.150.1DRAM Access Time*10On-Chip Transit TimeProgram Completion Time (in cycles)5352515049484746454443424140Program Completion Time (in million cycles)390.0564KB 128KB 256KB 512KB 1M 2M 4ML2 Cache Sizes (64K - 4M)38Figure 4.6: Benchmark execution time vs. Communication time - DRAM access time <strong>and</strong>On-chip transit time vs. L2 cache size vs. Program completion time.4.6 Effect of Link Latency on Per<strong>for</strong>mance of a CMPIntra-tile<strong>and</strong>off-tilelinklatencieseffectcommunicationdelays<strong>and</strong>haveamajorinfluenceon the execution times of processes in CMPs. Figures 4.6 <strong>and</strong> 4.7 record sample resultsfrom FFT benchmark execution on a 4 × 4 CMP with individual tiles containing L1 of16K <strong>and</strong> L2 sizes varied from 64KB – 4MB per execution. Figure 4.6 illustrates the effectof varying tile sizes due to varying L2 sizes on on-chip <strong>and</strong> off-chip communication timesduring the execution of the program. The graph also illustrates the relationship betweenthe two communication times over program completion time. DRAM accesses decreasewith increasing L2 sizes till L2:256KB. The effect of off-chip communication saturatesafter L2:256K as working set has been accommodated inside the CMP.Decrease in program execution time from 52.9M cycles (0.0324 secs) to 42.6M cycles(0.026 secs) with change in L2 size from 64KB to 256KB. Oneof thefactors is thedecreasein overall DRAM communication time from 0.0025 to 0.007 secs. The effect of intra-chiptransit time is pronounced as the DRAM communication time saturates in runs with L2


CHAPTER 4. TILE EXPLORATION 81FFT. DRAM Access, On-Chip Transit Time vs. Program Completion Time.0.3DRAM Access Time*10On-Chip Transit Time3.93.8DRAM Access <strong>and</strong> On-chip transit times (in secs)0.250.20.150.1Program Energy (in nJ)3.73.63.53.43.33.23.13Program Energy (in nanoJoules)2.90.0564KB 128KB 256KB 512KB 1M 2M 4ML2 Cache Sizes (64K - 4M)2.8Figure 4.7: Program energy vs. Communication time.greater than 256K. The increase in program communication times between L2 sizes 512K<strong>and</strong> 4M can be attributed to intra-chip transit time as seen in Figure 4.6. The intra-chiptransit time depends on the latencies between L2 cache to router <strong>and</strong> the inter-tile links.The L2 – Router links have latencies of 5, 5, 5 <strong>and</strong> 6 cycles <strong>and</strong> the inter-tile latenciesare 6, 6, 6 <strong>and</strong> 8 <strong>for</strong> L2 sizes 512K – 4M (Table 4.5). The program completion times atL2:4M is larger than those <strong>for</strong> L2:512K,1M,2M due to larger inter-tile link latencies.Figure 4.7 illustrates the relationship between the two communication times over programenergy. The communication times are also an indicator of the energy spent in executingthe program. The minimum communication time L1/L2 configuration is the sameas the minimum energy point. The total energy spent in execution is directly influencedby the total execution time of the program. The energy spent during the execution ofthe benchmark is also a function of power spent in the processor, memory hierarchies <strong>and</strong>communication links. Increasing on-chip communication results in increasing link powerconsumption <strong>and</strong> adds to the total energy in benchmark execution. From the results inFigures 4.6 <strong>and</strong> 4.7 it is clear that both on-chip transit time <strong>and</strong> off-chip communicationtime are important parameters in determining the per<strong>for</strong>mance of a CMP.


CHAPTER 4. TILE EXPLORATION 82Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0295L2: 64K0.0290.0285L2: 4M0.0280.02750.0270.0265L1:8K0.0260.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.029L2:64K0.02850.0280.0275L2:4M0.0270.02650.026L1:16K0.02550.08 0.1 0.12 0.14 0.16 0.18 0.2Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0310.0305L2:64K0.03L2:4M0.02950.0290.02850.028L1:32K0.02750.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22Total on-chip transit time (in secs)(a)(b)(c)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0292L2:64K0.0290.02880.02860.02840.02820.028L2:4ML1:64K0.02780.088 0.092 0.096 0.1 0.104 0.108 0.112Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0302L2:64K0.030.02980.02960.02940.02920.0290.0288L2:4ML1:128K0.02860.076 0.08 0.084 0.088 0.092 0.096Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0316L2:64K0.03140.03120.0310.03080.03060.0304L2:4M0.0302L1:256K0.030.068 0.07 0.072 0.074 0.076 0.078 0.08 0.082Total on-chip transit time (in secs)(d)(e)(f)Figure 4.8: 64K point FFT benchmark execution time vs. Total time spent in on-chipmessage transit. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB, 1M, 2M,4M.4.7 Communication in CMPsFigure 4.8 <strong>and</strong> 4.9 show the relationship between time spent in on-chip message transit<strong>and</strong> off-chip DRAM accesses during a 64K point FFT benchmark execution. Total onchiptransit time (Fig. 4.8) is the sum of latencies of all messages on all links during theexecution of the program. Total off-chip communication time (Fig. 4.9) is the sum ofDRAM access latencies of all DRAM accesses over the execution of the program. Experimentalprocedure is detailed in Section 4.5. Each of the points in the graph represents thetotal execution time (in seconds) over various L1, L2 cache sizes. The ratio of the cycletimes <strong>for</strong> each of the L1 cache configurations is 1:1.00613:1.03797:1.08609:1.13889:1.18841respectively (From Tables 4.3 <strong>and</strong> 4.5, frequencies range from 1.64GHz – 1.38GHz).Figure 4.8 relates total benchmark execution time with on-chip transit time over FFTruns with varying L1 (8K – 256K) <strong>and</strong> L2 (64K – 4M) sizes. Each curve records programexecution times of a constant L1 configuration. On-chip transit times <strong>for</strong> runs on L1:64K


CHAPTER 4. TILE EXPLORATION 83are largest in each of the L2 configurations. Both Program completion <strong>and</strong> on-chip transittimes are minimum in all L2 configurations when L1:256K. Program completion times ineach of the L1 curves start from L2:64K, decrease with increasing L2 size till L2:256KB,<strong>and</strong> then increase. Off-chip DRAM accesses decrease from L2:64K to L2:256K <strong>and</strong> areconstant <strong>for</strong> larger L2 sizes (512K – 4M). The decrease in program completion times isdue to the decrease in DRAM access times between L2:64K <strong>and</strong> L2:256K. The decrease inDRAM access time during this period is due to the decrease in global L2 miss count fromL2:64K – L2:256K (Figure 4.10(b)). Global L2 miss counts saturate after L2:256K as theworking set of the FFT benchmark fits inside the CMP after per-tile L2 sizes are 256Kor greater. This leads to an almost constant time spent in DRAM accesses in executionswith L2 sizes 256K – 4M. Increase in on-chip transit time contributes to the increasein program completion time in configurations with L2 sizes greater that 256K. On-chiptransit times depend on the link latencies between the caches to the router inside a tile<strong>and</strong>thelatenciesbetweenroutersbetweentiles. FromTable4.5, tileconfigurationshavinglarger L2 sizes have larger link latencies. CMPs with L2 sizes of 4M have the largest linklatencies per L1 configuration. Sum of all on-chip latencies of a CMP of L1:16K <strong>and</strong> L2sizes from 256K – 4M are 21, 24, 25, 27 <strong>and</strong> 35 cycles respectively. The effect of on-chiplatencies starts to show after DRAM latencies have saturated <strong>and</strong> hence an increase inthe program completion times in configurations with L2 sizes 256K – 4M can be seen.L2 size of 256K with any L1 is the configuration with minimum on-chip transit time <strong>and</strong>minimum DRAM access latency.Configurations with L2:256K in every L1 size has the least number of total messagessentoverallthelinks(Figure4.10(a)). NumberofDRAMaccessessaturateafterL2:256Kin the FFT experiment. Depending on the overall working set size <strong>and</strong> the programrun on the CMP, an L2 size exists where the number of DRAM accesses are minimum<strong>and</strong> the effect of inter-tile latencies are minimal. This is the optimal L2 size <strong>for</strong> bestprogram per<strong>for</strong>mance. Increasing the size of L2 further will increase coherence messages<strong>and</strong> will result in increased inter-tile traffic. The resulting on-chip traffic effects the totalprogram completion time. Decreasing the L2 size below the per<strong>for</strong>mance optimal L2


CHAPTER 4. TILE EXPLORATION 84results in increased DRAM traffic that adversely effects the per<strong>for</strong>mance of the CMP <strong>for</strong>that program.CMPs of (L1,L2) sizes of (128K,2M), (64K,4M) <strong>and</strong> (16K,128K) complete executionin 0.0296 secs though they spend increasing amount of time (0.079 – 0.181secs) in on-chipcommunication. Time spent in DRAM accesses during executions in these configurationsare 0.0065, 0.01 <strong>and</strong> 0.013 seconds respectively. The increase in the time spent in DRAMaccesses contributes to the benchmark completion time.CMPs of (L1,L2) sizes of (16K,256K), (32K,256K), (128K,1M) <strong>and</strong> (256K,4M) spendaround 0.075sec in on-chip communication but have benchmark execution times (0.026 –0.032 seconds) in increasing order. Time spent in DRAM accesses during executions ofthese configurations are 7.1ms, 7.4ms, 6.5ms <strong>and</strong> 5.8ms seconds respectively. The cycletimes of operating frequencies of these configurations are 0.61, 0.63, 0.69, 0.73ns respectively.The total instructions executed in these configurations are 42.6M, 42.0M, 42.6M,43.9M respectively. CMP with tile configuration L1:256K,L2:4M has the longest executiontime due to the larger cycle time <strong>and</strong> greater number of instructions executed. Theconfigurations (32K,256K), (128K,1M) have decreasing execution times due decreasingnumber of instructions <strong>and</strong> decreasing cycle time. CMP with tile configuration 16K,256Khas the least execution time due to the combination of lesser DRAM access time <strong>and</strong>the least cycle time. CMP configurations running at a higher operating frequency have aper<strong>for</strong>mance edge over others when the communication time is comparatively similar.The effect of a large operating cycle time can be seen in the L1:256K curve. FFTrunning on L1:256K configurations take the longest to complete because of comparativelylow operating frequency despite the L2:256K configuration consuming the least on-chiptransit time. L1:256K, L2:256K spends the least time in on-chip transit due to the leastnumber of messages generated during FFT execution Figure 4.10(a). The advantage ofthe least cycle time in L1:8K configurations is lost to L1:16K configurations due to thesmall L1 cache size. In comparison with the L1:16K configurations, L1:8K configurationshave larger on-chip transit time (upto 10%) due to the larger number of L1 misses. Thisresults in larger number of messages (upto 12%), leading to potentially larger waiting


CHAPTER 4. TILE EXPLORATION 85Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0340.033L2:64KB0.0320.0310.030.029 L2:4M0.0280.027L1:8K0.0260.004 0.008 0.012 0.016 0.02 0.024 0.028Total Communication Time <strong>for</strong> DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.033L2:64KB0.0320.0310.030.0290.028 L2:4M0.027L1:16K0.0260.006 0.009 0.012 0.015 0.018 0.021 0.024Total Communication Time <strong>for</strong> DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0350.034L2:64KB0.0330.0320.031L2:4M0.030.029L1:32K0.0280 0.005 0.01 0.015 0.02 0.025Total Communication Time <strong>for</strong> DRAM Accesses (in secs)(a)(b)(c)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.03250.032L2:64KB0.03150.0310.03050.03 L2:4M0.02950.0290.0285L1:64K0.0280.004 0.008 0.012 0.016 0.02 0.024Total Communication Time <strong>for</strong> DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0306L2:4ML1:128K0.03040.03020.030.02980.02960.0294L2:64KB0.02920.0290.006 0.0065 0.007 0.0075 0.008Total Communication Time <strong>for</strong> DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0318L2:4ML1:256K0.03160.03140.0312L2:64KB0.0310.03080.03060.03040.004 0.006 0.008 0.01 0.012 0.014Total Communication Time <strong>for</strong> DRAM Accesses (in secs)(d)(e)(f)Figure 4.9: 64K point FFT execution time vs. Total time spent in DRAM (off-chip)accesses. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB, 1M, 2M, 4M.times <strong>and</strong> hence more spin lock instructions resulting in a larger number of executedinstructions (upto 1.5M) <strong>for</strong> the benchmark.Figure 4.10(a) records the total number of messages over all the links throughoutthe execution of the benchmark. On-chip messages include data from L2 caches due toL1 misses, synchronization variables, cache coherence messages, data from the DRAMaccesses. As L1 sizes increase, L1 misses potentially decrease <strong>and</strong> hence a sharp decreasein overall traffic is observed. The total number of messages on links in a CMP decreasewith increasing L2 sizes till 256K. The data set of the FFT benchmark fits in CMP whenL2:256K <strong>and</strong> higher. Accesses to DRAM saturate after L2:256KB as observed in Figure4.10(b). Number of messages in the CMP <strong>for</strong> L2 sizes from 256K – 4M is due to the intratiledata movement <strong>and</strong> coherence messages generated from L1 misses. A slight increasein the number of messages can be observed between L2 sizes of 256K – 4M in each L1configuration. For instance, at L1:128K the number of messages at L2:4M is 0.22% timesthe messages at L2:256K. The increase is due to the increased exchange of synchronization


CHAPTER 4. TILE EXPLORATION 86FFT. Total messages on links.2.6e+0764K 128K 256K 512K 1MB 2MB 4MB2.4e+072.2e+072e+07No. of Messages.1.8e+071.6e+071.4e+071.2e+071e+078e+066e+068K 16K 32K 64K 128K 256KL1 Cache Sizes.(a) Total messages over all the links during the execution of the benchmark.FFT. L2 Miss Count (No. of DRAM Reads). Flit sizes: 32 bits.22000064K 128K 256K 512K 1MB 2MB 4MB200000180000L2 Cache Miss Count.1600001400001200001000008000060000400008K 16K 32K 64K 128K 256KL1 Cache Sizes.(b) Total global L2 misses. (Total number of DRAM reads.)Figure 4.10: Total messages over all the links during the execution of the benchmark <strong>and</strong>Average transit time of a message.


CHAPTER 4. TILE EXPLORATION 87variables <strong>and</strong> coherence messages as L2 sizes increase. Global L2 misses result in DRAMaccesses <strong>for</strong> instruction/data retrieval. Figure 4.10(b) records the number of global L2misses during the execution of the FFT benchmark. DRAM accesses saturate after L2sizes of 256K as the 64K points input <strong>for</strong> the FFT benchmarks fits into the L2 cache.Figure4.11(a)recordsthenumberofinstructionsexecutedduringtheFFTbenchmarkexecution over various L1/L2 configurations. The difference in the number of instructionsisduetothespinlockinstructionsby<strong>core</strong>swaiting<strong>for</strong>releaseofresources. Thetimespentin communication by a program is an indicator of the number of spin-lock instructionsexecuted by a <strong>core</strong> during waiting. Configurations with L1:256K (over all L2 sizes) spendthe least time in on-chip communication. Figure 4.11(a) shows that L1:256K over all L2sizes execute the least number of instructions. Number of instructions executed in theconfigurations(8K,64K),(8K,128K),(16K,64K),(16K,128K),(32K,64K)<strong>and</strong>(32K,128K)is due to the increased time spent in DRAM accesses in these configurations (Figure4.9). The waiting times in these configurations are abnormally large due to the longerDRAMaccessesresultingintheincreasednumberofspinlockinstructions. Thenumberofinstructions increase in configurations with L2 greater than 256K because of the increasedtransit time (<strong>and</strong> hence the waiting time).Power consumed in the CMP memory hierarchy during L1/L2 accesses <strong>and</strong> the powerconsumed by the links during on-chip transit <strong>and</strong> off-chip accesses is shown in Figure4.11(b). Power consumed in the L1 <strong>and</strong> L2 accesses is proportional to the sizes of L1<strong>and</strong> L2 cache in all the configurations. Thus the power spent in the memory hierarchyincreases as L1 <strong>and</strong> L2 sizes increase. DRAM access power is largest in the lower L2 cacheconfigurations (64KB <strong>and</strong> 128KB), over all L1 sizes due to the larger number of global L2misses in these configurations (Figure 4.10(b)). DRAM accesses saturate after L2:256KB<strong>and</strong> hence the power consumed in configurations with L2 equal to or greater than 256KBare comparable. For instance, ratio of power consumed in the DRAM in the L1:16KBconfiguration over L2:64KB – L2:4M is 1.47:1.275:1.007:1.006:1.0048:1.0045:1. The linkpower is separately tabulated in Table 4.6. CMPs with configurations having L2:4M havethe longest links <strong>and</strong> largest latencies. The largest links consume more power in each L1


CHAPTER 4. TILE EXPLORATION 88FFT. Instructions Executed.5.6e+0764K 128K 256K 512K 1MB 2MB 4MB5.4e+07Total instructions executed.5.2e+075e+074.8e+074.6e+074.4e+074.2e+078K 16K 32K 64K 128K 256KL1 cache sizes.(a) Total instructions.25FFT. Chip Power.L1 L2 On-Chip Links DRAMChip Power (Communication + Cache) in Watts201510508K 16K 32K 64K 128K 256KL1(8K--256K), L2(64K--4M) Cache Sizes(b) Communication Power + Cache Hierarchy Power.Figure 4.11: FFT. Total instructions executed <strong>and</strong> power spent in the memory hierarchy<strong>and</strong> on-chip links during the execution.


CHAPTER 4. TILE EXPLORATION 89Table 4.6: FFT. Power spent in links (in mW).L2(→) 64K 128K 256K 512K 1M 2M 4ML1:8K 76.887 75.842 77.547 93.798 95.702 104.743 129.186L1:16K 75.555 76.177 78.408 89.785 97.107 105.709 128.699L1:32K 74.996 73.070 76.728 90.975 105.980 114.220 138.661L1:64K 84.181 86.746 87.801 99.491 98.306 94.788 121.621L1:128K 76.061 76.177 78.735 80.918 90.167 91.740 113.227L1:256K 77.891 77.425 78.226 81.005 84.421 84.900 107.982case due to the larger number of flops (Figure 4.10(a)). Link power dips in configurationswith L2:64K <strong>and</strong> L2:128K in L1:8K, 32K <strong>and</strong> 256K. Total link latencies in L2:64K <strong>and</strong>L2:128K, <strong>for</strong> instance, in L1:32K configuration are equal (18 cycles). The difference inpower due to the difference in activity factors (average activity factor per link is 0.00152<strong>and</strong> 0.0013) in the links during the benchmark execution. The maximum link poweris consumed in the 32K,4M configuration. Total link latencies of L2:4M configurationsover L1 sizes 8K – 256K are 33, 35, 38, 35, 34, 35 cycles respectively. The average linkactivity factors in each of these configurations are 0.00104, 0.00094, 0.00088, 0.00084,0.00072 <strong>and</strong> 0.00056 respectively. The differences in the activity factors between theseconfigurations are not large enough to influence the final link power number drastically.The dominating factor in the overall link power is the number of flops in links <strong>and</strong> hencethe 32K,4M configuration has the highest power consumed between these configurations.The differences in frequencies between L1:8K, 16K <strong>and</strong> 32K (1.64GHz, 1.63GHz <strong>and</strong>1.58GHz) are not large enough to compensate <strong>for</strong> the power spent in flops (102mW,102.7mW <strong>and</strong> 110mW respectively).Figure 4.12(a) shows the Energy spent per Instruction during the execution of theFFT benchmark. Energy spent in the execution of the benchmark is the product of thetotal power spent in the execution <strong>and</strong> time taken <strong>for</strong> the execution to complete. Thetotal power includes the processor power, power spent in the L1, L2 cache hierarchy,DRAM accesses <strong>and</strong> the on-chip links in the CMP. Energy per Instruction is the ratioof the total energy spent during the execution to the number of instructions executed.


CHAPTER 4. TILE EXPLORATION 906560FFT. Energy per Instruction. Real Interconnect.64K128K256K512K1MB2MB4MBFFT. IPS^2/Watt.64K 128K 256K 512K 1MB 2MB 4MB0.060.055Energy per Instruction (nJ)555045IPS^2/Watt0.050.0450.04400.035358K 16K 32K 64K 128K 256KL2 Cache Size (64K, 128K, 256K, 512K, 1M, 2M, 4M)0.038K 16K 32K 64K 128K 256KL1 Cache Sizes (8K, 16K, 32K, 64K, 128K, 256K)(a) EPI(b) MIPS 2 per WattFigure 4.12: FFT Benchmark. Energy per Instruction <strong>and</strong> Instructions per second 2 perWatt.Processor power is a significant percentage in the overall power of the CMP (85% – 98%in L1:8K configurations <strong>and</strong> 75% – 93% <strong>for</strong> L1:256K configurations). Execution time ofthe FFT run in L1:256K configurations is the largest amongst all L1 configurations asseen in Figures 4.8 <strong>and</strong> 4.9. The power consumed by L1:256K configuration is the lowestcompared to other L1 configurations owing largely to the lesser operating frequency ofoperation. Energy consumed in each of L2 configurations in L1:256K is lesser than theenergy spent in corresponding L2 sizes in other L1 configurations. Total power spent inthe CMP dominates the EPI number <strong>and</strong> hence the configuration with the least totalpower results in the least EPI. The operating frequency of the CMP influences stronglythe total power consumed <strong>and</strong> L1:256K operates at the least frequency (1.38GHz) <strong>and</strong>has this configuration has the least total power numbers. Experiments show that theenergy optimal configurations are those with the least operating frequency. This is incontrast with per<strong>for</strong>mance optimal configurations where the configurations with higherfrequencies are per<strong>for</strong>mance optimal. Energy per Instruction increases with L2 greaterthan 128K in all the L1 configurations. The dual effect of total power consumed <strong>and</strong> theincreasing execution time during these configurations overcome the increase in effect oftotal instructions executed. CMPs with L1:256K have lowest operating frequencies <strong>and</strong>hence the largest program execution times. But the time spent in communication is lowerowing to lesser L1 misses <strong>and</strong> hence will spend least power in communication. Due to


CHAPTER 4. TILE EXPLORATION 91the lesser messages in transit L1:256K has lesser waiting times in traversal <strong>and</strong> hence haslesser number of instructions during benchmark execution. The loss in per<strong>for</strong>mance inthis configuration is complimented by the energy savings.L1:16K is the per<strong>for</strong>mance optimal L1 configuration as seen in Figures 4.8 <strong>and</strong> 4.9.Power consumed in L1:16K configurations is upto 33% larger than power consumed inL1:256K configurations due to the larger operating frequency. Additionally, L1:256Kconfigurations execute lesser instructions due to lesser L1 misses <strong>and</strong> have lesser spin lockinstructions during execution. From an EPI perspective, tiles having larger private L1s,operating at lower frequencies are better than faster tiles with smaller L1 sizes.AmongsttheL1:256Kconfigurations,L2:4Mhasthelongestexecutiontime<strong>and</strong>largesttotal power contributing to the increased EPI number. Larger L2 sizes result in larger L2access power (L2 power <strong>for</strong> L2:64K – 4M range from 1.6W to 19.4W), as seen in Figure4.11(b).Figure 4.12(b) records the IPS 2 /W <strong>for</strong> the FFT benchmark. IPS 2 /W metric decreaseswith increase in L2 sizes from 256K – 4M in all the L1 configurations. Total power spentin the CMP monotonically increases between L2:256K <strong>and</strong> 4M over all L1 configurations.The dominating larger power number in the higher L2 configurations is due to the largerL2 access power as seen in Figure 4.11. Larger L2 sizes lead to larger tile sizes, leading tolarger latencies in the links, leading to increased waiting time <strong>and</strong> hence larger number ofexecutinginstructionsduringexecution. Italso leadstoincreasedtimespentinexecution.All these factors adversely effect the IPS 2 /W metric <strong>and</strong> hence larger L2 sized tiles arenot the best choice <strong>for</strong> per<strong>for</strong>mance-power optimal execution of the FFT benchmark. TheIPS 2 /W optimal L1/L2 configuration is L2:128K in most of the L1 configurations. Thisis a benchmark specific optimal configuration. The optimal is close to the point wherethe time spent in communication in the CMP is the least (L2:256K).4.8 Program Completion TimeProgram completion time follows on-chip transit (Eqn. 4.6) <strong>and</strong> off-chip communicationtimes (Eqn. 4.7) as in Figure 4.13. T tran is the sum of latencies of all messages on


CHAPTER 4. TILE EXPLORATION 92Program Completion Time (in secs).FFT. Program Completion Time Communication Times. L1: 16K.0.0330.0320.0310.030.0290.0280.0270.02664K 128K 256K 512K 1M 2M 4ML2 Cache Sizes.PCTOff-Chip*10On-Chip0.320.280.240.20.160.120.080.040On-Chip <strong>and</strong> Off-Chip Communication Times (in secs).Figure 4.13: Y1:PCT, Y2:on-chip transit <strong>and</strong> off-chip comm. times.25FFT. Chip Power.L1 L2 On-Chip Links DRAMFFT. L2 Miss Count (No. of DRAM Reads). Flit sizes: 32 bits.22000064K 128K 256K 512K 1MB 2MB 4MB200000Chip Power (Communication + Cache) in Watts2015105L2 Cache Miss Count.180000160000140000120000100000800006000008K 16K 32K 64K 128K 256KL1(8K--256K), L2(64K--4M) Cache Sizes(a) Memory hierarchy <strong>and</strong> interconnect power.400008K 16K 32K 64K 128K 256KL1 Cache Sizes.(b) Number of L2 misses.Figure 4.14: FFT benchmark results. (Program Completion Time, comm.: communication)


CHAPTER 4. TILE EXPLORATION 93Table 4.7: Total messages in transit (in Millions).L2(→) 64K 128K 256K 512K 1M 2M 4ML1:8 25.9 21.2 14.5 14.5 14.5 14.5 14.5L1:16 23.0 18.3 12.5 12.5 12.5 12.5 12.5L1:32 21.2 17.0 11.8 11.8 11.7 11.7 11.7L1:64 17.7 14.2 11.1 11.1 11.1 11.1 11.1L1:128 12.4 9.7 9.3 9.3 9.3 9.3 9.3L1:256 8.3 7.2 7.2 7.2 7.2 7.1 7.2all links during program execution. T dram is the sum of access latencies of all DRAMaccesses over program execution. For a given L1 size, decrease in PCT is a direct resultof decrease in on-chip <strong>and</strong> off-chip communication times from L2:64K – L2:256K. Offchipcommunication saturates after L2:256K as L2 misses saturate. Input data set <strong>for</strong>64K point FFT is accommodated inside the CMP after L2:256K. On-chip transit timesdecrease (<strong>for</strong> given L1) due to lesser off-tile L2 accesses as L2 increases from 64K – 256K.The effect of on-chip latencies starts to show after DRAM latencies have saturated <strong>and</strong>hence an increase in the program completion times in configurations with L2 sizes 256K– 4M can be seen.Configurations with L2:256K in every L1 size have least number of total messagessent over all the links (Table 4.7). Depending on the overall working set size <strong>and</strong> theprogram run on the CMP, an L2 size exists where the number of DRAM accesses areminimum <strong>and</strong> the effect of inter-tile latencies are minimal. This is the optimal L2 size<strong>for</strong> best program per<strong>for</strong>mance. Increasing the size of L2 further will increase coherencemessages <strong>and</strong> will result in increased inter-tile traffic. The resulting on-chip traffic effectsthe total program completion time. Decreasing the L2 size below the per<strong>for</strong>mance optimalL2 results in increased DRAM traffic that adversely effects the per<strong>for</strong>mance of the CMP<strong>for</strong> that program.Smaller caches suffer from frequent cache line evicts when data sizes are larger thancache sizes. Hence, smaller L1s have greater number of L2 accesses compared to largerL1 caches. L2 misses (Figure 4.14(b)) decrease as cache lines have longer lives in larger


CHAPTER 4. TILE EXPLORATION 94FFT. Energy per Instruction. Real Interconnect.FFT. IPS^2/Watt.656064K128K256K512K1MB2MB4MB64K 128K 256K 512K 1MB 2MB 4MB0.060.055Energy per Instruction (nJ)555045IPS^2/Watt0.050.0450.04400.035358K 16K 32K 64K 128K 256K0.038K 16K 32K 64K 128K 256KL2 Cache Size (64K, 128K, 256K, 512K, 1M, 2M, 4M)L1 Cache Sizes (8K, 16K, 32K, 64K, 128K, 256K)(a) Program Execution Energy.(b) IPS 2 /W.Figure 4.15: FFT benchmark results.L2. L2 misses saturate after L2:256K as the program data size fits into the combined L2cache of the CMP. For L2:64K <strong>and</strong> 128K over all L1 caches, large number of L1 missesincrease on-chip transit time <strong>and</strong> result in larger program completion times. Larger L2configurations result in large tile sizes resulting larger program completion times. PCTdecreases from L1:8K to L1:16K due to decreasing number of L1 misses <strong>for</strong> a given L2configuration. PCT <strong>for</strong> a given L2 decreases from L1:32K to L1:128K due to decreasein number of L1 cache misses <strong>and</strong> hence decrease on-chip transit time. In L2:4M configurations,effect of long latency on-chip links (due to large tile sizes) show up increasingPCT. L1:16K,L2:256K configuration achieves a balance between number of L1, L2 misses<strong>and</strong> on-chip link latencies <strong>and</strong> is the per<strong>for</strong>mance optimal configuration <strong>for</strong> FFT. SmallL1,L2 caches suffer from too many misses, large L2 cache configurations result in hugeinterconnect latencies. Results show that per<strong>for</strong>mance optimal, energy optimal <strong>and</strong> EDoptimal L1/L2 configurations are different.Memory <strong>and</strong> interconnect power is dominated by L2 power (Figure 4.14(a)). Memory<strong>and</strong> interconnect power is small in cases of L1 upto 64K <strong>and</strong> L2 upto 256K due to smallsizes of the caches. Program completion energy (Figure 4.15(a)) is a function of totalpower spent <strong>and</strong> time of execution. Total power decreases as L1 increases (as operatingfrequency decreases). Minimum energy is spent at L1:256K, L2:128K. Chip power isminimum at L2:64K in L1:256K configurations. Execution time decreases from L2:64Ktill 256K due to decrease in L2 misses (Figure 4.13). Decrease in execution time between


CHAPTER 4. TILE EXPLORATION 95Execution Time (in secs)FFT. Execution Time of the Benchmark.0.03264K 128K 256K 512K 1MB 2MB 4MB0.0310.030.0290.0280.027Execution Time (in secs)0.0320.0310.030.0290.0280.0270.0260.025FFT. Execution Time of the Benchmark. Ideal Interconnect.64K128K256K512K1MB2MB4MB0.0260.0240.0258K 16K 32K 64K 128K 256KL1 Cache Sizes.0.0238K 16K 32K 64K 128K 256KL1 Cache Sizes.(a) Real interconnect experiment.(b) Ideal interconnect experiment.Figure 4.16: Program Completion Times.L2:64K <strong>and</strong> L2:128K compensates <strong>for</strong> increase in L2 accesses <strong>and</strong> energy spent is leastat L1:256K,L2:128K. Large L1 caches <strong>and</strong> moderate L2 cache configurations are closer tothe minimal program execution energy point.IPS 2 1/W (same as ) results favor larger L1 cache configurations. Larger operatingEDfrequencies of small L1 cache configurations result in better per<strong>for</strong>mance numbers butlose out on power consumption index. Larger L1 cache (128K) with small L2 caches (64K– 256K) are closer to the optimal ED point. At L1:128K, the miss rate is relatively small<strong>and</strong> at L2:64K–256K, the effect of interconnect latencies is not as pronounced as in largerL2 cache configurations.Impact of ignoring accurate interconnect parameters on architectural choices is evaluatednext. Consider per<strong>for</strong>mance results of the FFT benchmark with (Figure 4.16(a))<strong>and</strong> without (Figure 4.16(b)) interconnect latencies included. All interconnect latenciesare set to 1 cycle regardless of tile area in the ideal interconnect experiment. Given atarget completion time of 0.24 secs the ideal interconnect graph (Figure 4.16(b)) shows up10 configurations in L1:8K & L1:16K but real interconnect graph shows none. Considerchoosing the best energy configuration within the per<strong>for</strong>mance requirement of 0.27secs.Ideal interconnect results indicate L1:128K,L2:256K while real interconnect results allowconfigurations in L1:8K,16K to be considered. Thus ignoring accurate interconnectlatencies can lead to wrong architectural choices.


CHAPTER 4. TILE EXPLORATION 964.9 Ideal Interconnects, Custom Floorplanning, L2Banks <strong>and</strong> Process MappingThis section presents comparative per<strong>for</strong>mance <strong>and</strong> power results CMPs with ideal interconnects,alternative floorplanning <strong>and</strong> process <strong>and</strong> L2 banks mapping. Table 4.8presents recalculated link lengths of the L1: 256KB <strong>and</strong> L2: 512KB configuration used inthe experiments. The CMP floorplans are shown in Figure 4.5:• Conventional 2D Mesh. Processes are scheduled starting from processor 0 (fromthe top left corner of the mesh CMP).• Clustered 2D Mesh. The 4×4 mesh is divided into 4 clusters of 2×2 <strong>processors</strong>each. The tiles in a cluster are rotated such that link lengths between routers areminimal(0.2mm). These intra-cluster links have a delay of 1 clock cycle. Each ofclusters are connected by longer links across neighbouring routers that span alongthe tile’s border. The links connecting clusters are 8 mm (13 cycles) <strong>and</strong> 7.5 mm(12 cycles) long. Communication within a cluster costs a lot less in terms of power<strong>and</strong> per<strong>for</strong>mance when compared to communication across clusters in this CMP.• Process <strong>and</strong> L2 Bank Mapped 2D Mesh. Clusters of 4 processes each areidentified after analysis of the communication patterns between tiles in a conventional2D mesh CMP. Clusters of frequently communicating tiles are identified <strong>and</strong>processes are mapped as shown in the figure. Conventionally, processes are mappedto the first available processor – Tile 0. Analysis of communication traffic in allbenchmarks has shown tile to be the most communicated with.• Process <strong>and</strong> L2 Bank Mapped 2D Mesh. Similar thread <strong>and</strong> L2 bank mappingis done <strong>for</strong> a conventional 2D mesh topology. Mapping L2 banks receivingmaximum requests to tiles towards the center of the CMP will decrease the averagemessage traversal distance <strong>and</strong> communication latency. Example results from theFFT benchmark are shown in this section. L2 banks allocated to Tile 0 are mappedto tile numbered 0 in the Figure 4.5 <strong>and</strong> so on.


CHAPTER 4. TILE EXPLORATION 97Table 4.8: Clustered tile placement floorplan <strong>for</strong> L1: 256KB <strong>and</strong> L2: 512KB. Lengthsof links between neighbouring routers, number of pipeline stages are shown. Frequency:1.38 GHz.Intra-Cluster Between clusters – Horizontal & Vertical Directory to RouterLength (mm) Stages Length (mm) Stages Length (mm) Stages Length (mm) Stages0.2 1 8.0 13 7.5 12 4.1 8The four configurations are additionally compared with a hypothetical chip havingideal interconnects. Ideal interconnects have latency of 1 cycle between routers. Figure4.17 compares Instructions per Cycle, Program completion time, Time spent by messagesin transit, Intra-chip communication power Total chip power, Energy per instruction <strong>and</strong>Energy efficiency (Instructions per second 2 /Watt) between the said configurations. Theresults are shown <strong>for</strong> the FFT benchmark with each <strong>core</strong> containing an L1 of 256KB <strong>and</strong>L2 of 512KB.IPC increases by 1.3%, 2.6%, 0.8% <strong>and</strong> 5.6% in the case of clustered mesh, processes<strong>and</strong> bank mapped clustered mesh, processes <strong>and</strong> banks mapped 2D Mesh <strong>and</strong> ideal interconnectssetups in comparison to the conventional 2D mesh. Reducing the latency in thelinks has an effect on the overall IPC of the CMP. IPC <strong>and</strong> IPS results suggest that thereduction in time spent in transit of messages has a positive impact on the per<strong>for</strong>manceof the CMP.Time spent in transit of messages has significantly reduced in the clustered configurations(upto 11%) compared to the conventional 2D mesh CMP. Ideal interconnect CMPconsumes upto 40% lesser time transmitting messages when compared to the conventional2D mesh.A significant reduction in power spent in communication is noticed both cases ofClustered Mesh <strong>for</strong>mations (upto 12%). This is due to the large amount of intra-clustercommunication in the FFT benchmark experiment. The reduction is lesser (0.3%) in thecase of the processes <strong>and</strong> bank mapped 2D Mesh experiment. Analysis of communicationtraffic <strong>and</strong> reordering the <strong>core</strong>s is an iterative process. An example tile ordering


CHAPTER 4. TILE EXPLORATION 981.151.1FFT. Various tile placements, Ideal interconnect. L1:256K, L2:512K.Conventional 2D MeshClustered MeshProcess <strong>and</strong> L2 Bank Remapped Clustered MeshProcess <strong>and</strong> L2 Bank Remapped 2D MeshIdeal Interconnect1.05Normalized Data10.950.90.85IPC Time Transit CP Power EPI MPWComparison ParametersFigure 4.17: Alternative Tile Placements, custom process scheduling example <strong>and</strong> idealinterconnect comparison results. Benchmark: FFT, L1: 256K, L2: 512K.(process scheduling) is used in the experiments. Several others may yield different powerper<strong>for</strong>manceresults.Total power of the CMP is dominated by the processor <strong>and</strong> memory hierarchy powerin these experiments <strong>and</strong> hence the advantage gained by communication power reductionis not visible. A reduction of 2.5% in energy spent per instruction is observed incommunication aware clustered tile placement experiment. EPI reduces upto 1% in theconventional mesh experiment. Average time taken <strong>for</strong> an instruction to complete reduceswith decrease in time spent during message transit. Thus the overall energy spent in thesystem per instruction reduces. An increase of 1.2%, 5.2% <strong>and</strong> 1.65% respectively, inthe Energy Efficiency metric, IPS 2 , is observed. Instructions executed per second haveincreased due to reduction in communication cycles spent <strong>and</strong> hence the per<strong>for</strong>manceincrease is observed.


CHAPTER 4. TILE EXPLORATION 994.10 Remarks & ConclusionThis chapter estimates the effects of communication overheads on per<strong>for</strong>mance, power<strong>and</strong> energy of a multiprocessor chip using L1, L2 cache sizes as primary exploration parametersusing accurate interconnect, processor, on-chip <strong>and</strong> off-chip memory modelling.A detailed multiprocessor execution environment integrating SESC, Ruby <strong>and</strong> DRAMSimnamed Sapphire was presented. Sapphire was used to run applications from the Splash2benchmark. A 4×4 2D Mesh <strong>for</strong>mation of 16 out-of-order <strong>core</strong>s, each consisting of a privateL1 <strong>and</strong> unified L2 cache was used as case study in the exploration. Detailed low-levelwire delay models were used from Intacte to calculate power consumed by the interconnectionnetwork. Architectural choices based on inaccurate interconnect estimates arenot optimal <strong>and</strong> the error is severe in cases where applications have heavy communicationrequirements.Per<strong>for</strong>mance optimal configurations are achieved at lower L1 caches <strong>and</strong> at moderateL2 cache sizes due to higher operating frequencies <strong>and</strong> smaller link lengths <strong>and</strong> comparativelylesser communication. Using minimal L1 cache size to operate at the highestfrequency may not always be the per<strong>for</strong>mance-power optimal choice. Larger L1 sizes,despite a drop in frequency, offer a energy advantage due to lesser communication due tomisses. Clustered tile placement experiments <strong>for</strong> FFT (L1:256KB <strong>and</strong> L2:512KB) showconsiderable per<strong>for</strong>mance per watt improvement (1.2%). Remapped processes <strong>and</strong> banksin clustered tile placement show a per<strong>for</strong>mance per watt improvement of 5.25% <strong>and</strong> energyreduction of 2.53%. Remapping threads <strong>and</strong> frequently accessed L2 banks closer ina conventional 2D Mesh shows an improvement of per<strong>for</strong>mance per watt by 1.6%. EPI<strong>and</strong> program completion time results indicate that minimum energy cache configurationsare not the same as minimum execution time configurations.


Chapter 5Label Switched <strong>NoC</strong> - Motivation &Design<strong>NoC</strong>s servicing generic CMPs or customized <strong>media</strong> <strong>processors</strong> are expected to meet Qualityof Service (QoS) dem<strong>and</strong>s of executing applications. The two basic approaches in <strong>NoC</strong><strong>design</strong>s to enable QoS guarantees are: creation of reserved connections between source<strong>and</strong> destinations via circuit switching or support <strong>for</strong> prioritized routing (in case of packetswitched, connectionless paths). Packet switched networks provide efficient interconnectutilization <strong>and</strong> high throughputs[43]. However, they need to be over-provisioned to supportQoS <strong>for</strong> various traffic classes <strong>and</strong> have high buffer requirements in routers. On theother h<strong>and</strong>, circuit switched <strong>NoC</strong>s guarantee high data transfer rates in an energy efficientmanner by reducing intra-route data storage[41]. These are well suited <strong>for</strong> streamingapplications where communication requirements are well known a priori.In this thesis, we present a Label Switching based Network-on-Chip (LS-<strong>NoC</strong>) motivatedby throughput guarantees offered by b<strong>and</strong>width reservation. Such an <strong>NoC</strong> can beused to service hard b<strong>and</strong>width <strong>and</strong> throughput guarantees to streaming applications ina multiprocessor environment amidst resource competing processes. The Label Switched(LS) Router used in LS-<strong>NoC</strong> achieves single cycle traversal delay during no contention<strong>and</strong> is multicast <strong>and</strong> broadcast capable. Source nodes in the LS-<strong>NoC</strong> can work asynchronouslyas cycle level scheduling is not required in the LS Router. LS router supports100


CHAPTER 5. LABEL SWITCHED NOC 101multiple clock domain operation. LS-<strong>NoC</strong> enables circuit switching without requiringglobally synchronous clock <strong>and</strong> hence eases clock tree <strong>design</strong> <strong>and</strong> reduces global clockdistribution power.Media <strong>processors</strong> with streaming traffic such as HiperLAN/2 Baseb<strong>and</strong> Processors[7],Real-time Object Recognition Processors [8] <strong>and</strong> H.264 encoders[44][45] dem<strong>and</strong> adequateb<strong>and</strong>width<strong>and</strong>boundedlatenciesbetweencommunicatingentities. Adequatethroughput,latency <strong>and</strong> b<strong>and</strong>width guarantees between process blocks can be provided by establishingprovisioned, contention free routes between nodes.A centralized LS-<strong>NoC</strong> Management framework engineers traffic into QoS guaranteedroutes. LS-<strong>NoC</strong> caters to the requirements of streaming applications where communicationchannels are fixed over the lifetime of the application. The proposed <strong>NoC</strong> frameworkinherently supports heterogeneous <strong>and</strong> ad-hoc SoC <strong>design</strong>s. The LS-<strong>NoC</strong> can be used inconjunction with conventional best ef<strong>for</strong>t <strong>NoC</strong> as a QoS guaranteed communication networkor as a replacement to the conventional <strong>NoC</strong>. The LS-<strong>NoC</strong> management frameworkis the focus of the next chapter.This chapter describes the <strong>design</strong> of the router used in Label Switched QoS guaranteeing<strong>NoC</strong>. A multicast, broadcast capable label switched router <strong>for</strong> the LS-<strong>NoC</strong> has beenpresented. The LS router (Figure 5.4) costs a single cycle flit traversal delay during nocontention. Table based look up <strong>for</strong> routing <strong>and</strong> label translation has been provided. A 5port, 256 bit data bus, 4 bit label router occupies 0.431 mm 2 in 130nm <strong>and</strong> delivers peakb<strong>and</strong>width of 80Gbits/s per link at 312.5MHz.Organization of the ChapterWork related to Label Switched <strong>NoC</strong> spreads over three chapters detailing the routerarchitecture, <strong>NoC</strong> management framework <strong>and</strong> results respectively. A detailed literaturesurvey of QoS guaranteed <strong>NoC</strong>s was presented in Section 2.1 of Chapter 2. Section 5.4illustrates working of an LS-<strong>NoC</strong>. Streaming applications <strong>and</strong> their traffic characteristicsare introduced in Section 5.1. Salient features of LS-<strong>NoC</strong> are listed in Section 5.3. Design


CHAPTER 5. LABEL SWITCHED NOC 102Table 5.1: Communication characteristics between HiperLAN/2 nodes.Edge(s) Stream B<strong>and</strong>width[Mbit/s]S/P → Pre-fix Removal 1-2 640Pre-fix Removal → FFT 3-4 512FFT → Channel eq. 5-6 416Channel eq. → De-map 7 384of the LS-<strong>NoC</strong> router <strong>and</strong> verification is presented in Section 5.5. Chapter 6.1 outlines LS-<strong>NoC</strong> framework <strong>and</strong> tasks of the <strong>NoC</strong> Manager. Experiments with streaming applicationcase studies are presented in Chapter 7.5.1 Streaming Applications in Media ProcessorsApplications exhibiting pipelined nature of operation between processing blocks havestream based traffic characteristics. We explain HiperLAN/2[139][140] <strong>and</strong> object basedrecognition applications[8] in the context of streaming traffic characteristics.5.1.1 HiperLAN/2HiperLAN/2 is a radio technology operating in unlicensed 5 GHz b<strong>and</strong>s used to providewireless connectivity in WLAN networks. HiperLAN/2 operates with data rates up to 54Mbps[139]. The physical layer of HiperLAN/2 is to modulate bits that originating on thetransmitter side <strong>and</strong> to demodulate on the receiver side using Orthogonal Frequency Division<strong>Multi</strong>plexing (OFDM). Processing in OFDM receivers is per<strong>for</strong>med in block-mode<strong>and</strong> this results in block based communication stream within nodes. HiperLAN/2 hasbeen been mapped on a multi-tile architecture in [140]. <strong>NoC</strong> QoS specifications in Hiper-LAN/2 implementations are driven by OFDM symbols processing. An OFDM symbol isprocessed every 4µs <strong>and</strong> the underlying <strong>NoC</strong> should guarantee sufficient b<strong>and</strong>width <strong>and</strong>latency between communicating nodes to facilitate this. Table 5.1 taken from [7] providesQoS requirements of HiperLAN/2.


CHAPTER 5. LABEL SWITCHED NOC 103Visual Attention EngineVAEMatching AcceleratorMAMain ProcessorMPExt. i/fSWcrossbarswitchSWSWPEC0PECPEC PEC PEC PEC PEC PEC1 2 3 4 5 6 7(a)(b)Figure 5.1: (a) Process graph of a HiperLAN/2 baseb<strong>and</strong> processing SoC[7] <strong>and</strong> (b) <strong>NoC</strong>of the Object recognition processor[8].Figure 5.1(a) presents processing blocks with communication directions in Hiper-LAN/2 implementation. Hiperlan/2 blocks connect in a pipelined manner with the previousblock’s output serving as input to the current block as shown in the figure. Serialto-Parallel(S2P) block generates Hiperlan/2 application traffic to serve the Demappingblock (DM) throughout the simulation.5.1.2 Object Recognition ProcessorObject recognition has been widely used in applications such as robot navigation, autonomousvehicle control, video surveillance, <strong>and</strong> natural human-machine interfaces[8].These applications require huge computational power <strong>and</strong> real-time response.Object recognition process is sped up by visual attention which involves selectingsalient parts of images where the desired object lies. Hardware implementation of such avision system involves processing elements (PEs) capable of data transactions to facilitateobject-level parallel processing. An implemented object recognition system[8] is presentedin Figure 5.1(b).The Main Processor (MP) orchestrates computational data between processor engineclusters(PEC),VisualAttentionEngine<strong>and</strong>theMatchingAccelerator. Externalinterfaceis provided <strong>for</strong> off-chip multi<strong>media</strong> data input/output. MP broadcasts periodically to allPECs while PECs set up bursty communications between themselves. Although the


CHAPTER 5. LABEL SWITCHED NOC 104communication pipes are fixed over the experiments, the destinations <strong>for</strong> bursty <strong>and</strong> nonburstytraffic are chosen at r<strong>and</strong>om. <strong>NoC</strong> should support high b<strong>and</strong>width <strong>for</strong> imageblock transfer as each PEC is capable of producing 12.8 Gb/s of aggregated throughput.Underlying <strong>NoC</strong> should also provide low-latency communication <strong>for</strong> real-time operation.5.2 LS-<strong>NoC</strong> - MotivationExisting packet switched networks use priority based QoS solutions where higher priorityroutes are reserved network resources <strong>for</strong> larger amount of time. Such schemes work wellwhen the amount of traffic in various priority classes are static <strong>and</strong> do not change overtime. A priority based QoS mechanism may not offer advantages when all routes belongto the same priority level. Further, best ef<strong>for</strong>t nature of packet switched networks resultin larger number of buffers inside routers.Circuit switching uses path based resource reservation techniques to guarantee QoS.Such schemes work well <strong>for</strong> real time traffic where latency in communication paths hasto be guaranteed. Reserving resources along a path <strong>for</strong> the entire lifetime may result inunder utilization of network resources as no other traffic can utilize this path in caseswhen path is free of traffic. These issues are discussed in detail in Chapter 2, describingQoS related <strong>NoC</strong> works.Streaming applications use long lived connections between processing blocks <strong>and</strong> requireconstant throughput with paths dem<strong>and</strong>ing real time latency. Characteristics ofstreaming traffic generated by <strong>media</strong> processor applications are list here:• The maximum amount b<strong>and</strong>width <strong>and</strong> throughput to be guaranteed by the <strong>NoC</strong>,<strong>for</strong> the application to be serviced, is known a priori.• In some streaming applications such as HiperLAN/2, data at every block arrivesat a fixed rate due to pipelined nature of the application. This is possible whenthe <strong>NoC</strong> delivers guaranteed constant throughput to communication paths of theapplication.


CHAPTER 5. LABEL SWITCHED NOC 105• Streaming applications require either guaranteed real time communication or afixed, deterministicdelaybetweenprocessingblocks. Jitterincommunicationdelaysmay result in packets being dropped or uneven processing in the <strong>media</strong> processingpipeline.• Communicatingnodesinastreamingapplicationuselonglivedconnectionsbetweenthem. While communication paths are long lived, usage may be intermittent <strong>and</strong>traffic bursty in nature. Reserving resources <strong>for</strong> the entire time period may result inunder utilization of <strong>NoC</strong> resources. Hence a balance has to be maintained betweenresource reservation <strong>and</strong> fair distribution of resources <strong>for</strong> all communication paths.• The amount of control in<strong>for</strong>mation is a small percentage of processing data trafficin such a system. Control in<strong>for</strong>mation such as source <strong>and</strong> destination addresses donot change often as connections are long lived between communicating nodes.Packet switched networks are not favoured to service streaming applications due totheir best ef<strong>for</strong>t nature. Circuit switched networks promise to deliver required QoS butresult in an under utilized network resulting in less energy efficiency. There is a need tobuild a hybrid <strong>NoC</strong> using the best of both packet <strong>and</strong> circuit switched QoS techniques<strong>for</strong> servicing streaming application traffic. Label switching offers a low-overhead solutionby reducing meta-data in packets. Labels uniquely identify a source <strong>and</strong> destination pairalong with inter<strong>media</strong> routers in the route. Smaller labels result in simpler routing tables<strong>and</strong> lesser logic in the router. The path can be identified based on QoS requirementsof the participating nodes. Links can be shared between multiple label paths as longas QoS requirements are met. Such a traffic engineered path with provisioned networkresources reduces buffer requirements in routers as communication paths are more evenlydistributed in the <strong>NoC</strong>. A simple router <strong>design</strong> with reduced buffer requirements alongwith shared link in the <strong>NoC</strong> results in an energy efficient <strong>NoC</strong>. Traffic characteristicsof streaming applications motivate the <strong>design</strong> <strong>for</strong> a pipe based, resource reserved, linkshared, label switched Network on Chip (LS-<strong>NoC</strong>). The LS-<strong>NoC</strong> concept is introduced inthe following section.


CHAPTER 5. LABEL SWITCHED NOC 1065.3 LS-<strong>NoC</strong> - The ConceptLong lived, throughput dem<strong>and</strong>ing connections between communicating nodes motivatethe use of pipe based communication in LS-<strong>NoC</strong>. A pipe is identified by the source,destination <strong>and</strong> a throughput guaranteed path between source <strong>and</strong> destination nodes(Section 5.5.1). A pipe reserves required amount of resources along the route during thelifetime of the connection. Reserving required amount of resources enables multiple pipesto share the same physical link as long as their QoS requirements are met.Pipes are unique to a source <strong>and</strong> destination pair of nodes <strong>and</strong> hence can be addressedusing pipe-ids (labels). Using labels instead of node addresses to identify source <strong>and</strong>destinationnodespotentiallysavesmeta-databitsinroutingtables<strong>and</strong>headersinpackets.LS-<strong>NoC</strong> uses sideb<strong>and</strong> signals to transmit flow control <strong>and</strong> labels in<strong>for</strong>mation.Physical link shared by multiple pipes may result in conflict in label addresses inrouting tables. Label conflicts can be resolved by assigning aliases to pipes. The processis called label swapping in LS-<strong>NoC</strong> <strong>and</strong> is dealt with in Section 5.5.2.Establishing a new QoS guaranteed pipe between nodes requires knowledge of existingpipes <strong>and</strong> resources in the <strong>NoC</strong>. An <strong>NoC</strong>-state aware entity having knowledge of existingconnections <strong>and</strong> remaining resources in the <strong>NoC</strong> is essential <strong>for</strong> this purpose. Such anentity (Manager) may be a software process running on a <strong>core</strong> or a separate hardwareaccelerator block. The manager will be responsible <strong>for</strong> mapping out a resource rich,contention-less path between communicating nodes. Configuration of routing tables inrouters along the path of the pipe should also be h<strong>and</strong>led by the Manager.Figure 5.2 shows an example 8×8 LS-<strong>NoC</strong> in a 2-dimensional mesh topology. Asshown in figure 5.2, after pipe 0 has been set up by the Manager between A & B, pipe1 is set up in a contention free route. Pipe between X & Y was set up as pipe 0 at theorigin. Label is swapped to pipe 1 in the last hop to avoid label conflict with the pipebetween A & B.


CHAPTER 5. LABEL SWITCHED NOC 107SrcMSinkWNREApipe 0pipe 1Ypipe 1Spipe 0PTRR+WXBCR+WR+WRT InterfaceA->B: pipe 0A->C: pipe 1X->Y: pipe 0, pipe 1Figure 5.2: A 64 Node, 8 × 8 2D LS-<strong>NoC</strong> along with <strong>NoC</strong> Manager interface to routingtables.5.4 LS-<strong>NoC</strong> - WorkingFigures 5.3(a)–5.3(e) illustrate a pipe establishment <strong>and</strong> label swapping example in LS-<strong>NoC</strong>.• Figure 5.3(a): <strong>NoC</strong> Manager contains the flow graph of the network. A few of theedges stored in a data structure are shown in this figure. Total labels available = 8.At the initialization stage all labels are free <strong>for</strong> use.• Figure 5.3(b): A pipe (3→4→5→2) is set up between nodes 3 & 2. The datastructure is updated in rows associated with edges 34, 45 <strong>and</strong> 52. The pipe occupiesthe entire b<strong>and</strong>width of the link <strong>and</strong> uses label = 0.• Figure 5.3(c): Another pipe is established between 7 & 2. The pipe is established ina non-intersecting manner w.r.t. to previous pipe. The pipe (7→4→1→2) occupies50% of the b<strong>and</strong>width available from the links.• Figure5.3(d): Apipehastobeestablishedbetweennodes6&4. Theflowalgorithmhas identified the route 6→7→4 <strong>for</strong> the pipe. Label 0 is the first available label atnode 6. The label conflicts at the South port of node 4 - Label 0 has been utilizedby pipe 0 of node 7 (7→2).


CHAPTER 5. LABEL SWITCHED NOC 108• Figure 5.3(e): Label swapping enables pipe to be established with label 1.5.5 Label Switched Router DesignALabelSwitched(LS)router(Figure5.4)costsasinglecycleflittraversaldelayduringnocontention. Labels are sent out of b<strong>and</strong> juxtaposed with data along the links. Separatinglabel from the data link reduces meta-data management at the network interfaces at theingress <strong>and</strong> egress of LS-<strong>NoC</strong>. Applications are free to choose data <strong>for</strong>mats granting them<strong>design</strong> flexibility. Further, wires are relatively inexpensive in CMPs[141] - a 4 bit labelincurs an overhead of 1.5% on a 256 bit wide data link. The accompanying label on thedata bus is used to identify the intended outgoing port by the routing table. The routingtableisindexedagainstestablishedlabels<strong>and</strong>hastwofields, Direction Bits <strong>and</strong>New Labelas shown in Table 5.2. Note that the ‘Input Label’ need not be stored if the memory datastructure enables indexing using the input label bits. The Direction Bits field containsbits equal to the number of output ports in the router. A bit corresponding to an outputport is set if the label to be routed is to exit from that output port. <strong>Multi</strong>ple bits set inthe ‘Direction Bit’ field enable multicast or broadcast. The New Label field is maintainedto enable label swapping. Label swapping is explained in Section 5.5.2. In Table 5.2,incoming label 0 exits through port 0 of the router. Label 1 does not pass through thisinput port. Labels 2 <strong>and</strong> 15 are broadcast <strong>and</strong> multicast messages respectively. Data,label <strong>and</strong> valid signals are replicated <strong>and</strong> sent to every output port. Output port flagsin the routing table serve as valid signals during arbitration. Routing table signals areused during pipe setup <strong>and</strong> tear down by the <strong>NoC</strong> Manager. WriteData <strong>and</strong> ReadDataon PTR location are controlled by WR/RD signals.The combinational circuitry from an input port to an output port is shown in Figure5.4. An accompanying valid signal denotes if flopped in data should be processed by theinput port. Incoming data flits are written into the FIFO in two cases. Case 1: Flits arequeued in the input port buffer <strong>for</strong> traversal. Case 2: Contention occurs at the desiredoutput port <strong>and</strong> the input port loses during arbitration. An ORed grant signal from


CHAPTER 5. LABEL SWITCHED NOC 109Links TableData StructureShadow Routing TableData StructureLinks TableData StructureShadow Routing TableData Structure0 1 23 4 56 7 8E ijnin ju ijv ijusedL ij12344552671 2 0 10 {}unusedL ij{0..7}3 4 0 10 {} {0..7}41 4 1 0 10 {} {0..7}4 5 0 10 {} {0..7}5 26 70 100 10{}{}{0..7}{0..7}74 7 4 0 10 {} {0..7}RPort IL DB NL0 1 23 4 56 7 8Pipe: 3->2E ijnin ju ijv ijusedL ij12344552671 2 0 10 {}unusedL ij{0..7}3 4 10 0 {0} {1..7}41 4 1 0 10 {} {0..7}4 5 10 0 {0} {1..7}5 26 71000 10{0}{}{1..7}{0..7}74 7 4 0 10 {} {0..7}R3 P45PortWW2 SIL DB NL000 E 000000 E 000000 N 000000 P 000(a) Initial state of a few edges in <strong>NoC</strong> Manager.(b) LS-<strong>NoC</strong> state after pipe 3→2 has been established.Links TableData StructureShadow Routing TableData StructureLinks TableData StructureShadow Routing TableData Structure0 1 23 4 56 7 8Pipe: 7->2E ijn in ju ijv ijusedL ij123452671 2 5 5 {0}unusedL ij{1..7}3 4 10 0 {0} {1..7}41 4 1 5 5 {0} {1..7}45 4 5 10 0 {0} {1..7}5 26 71074 7 4 5 5 {0} {1..7}00 10{0}{}{1..7}{0..7}RPort3 PIL DB NL000 E 00045WW000000EN0000002 S 000 P 0007 P 000 N 0004 S 000 N 0001 S 000 E 0002 W 000 P 0000 1 23 4 56 7 8Pipe: 6->4E ijnin ju ijv ijusedL ij123452671 2 5 5 {0}unusedL ij{1..7}3 4 10 0 {0} {1..7}41 4 1 5 5 {0} {1..7}45 4 5 10 0 {0} {1..7}5 26 7101074 7 4 10 0 {0} {1..7}00{0}{0}{1..7}{1..7}RPort3 PIL DB NL000 E 00045WW000000EN0000002 S 000 P 0007 P 000 N 0004 S 000 N 0001 S 000 E 0002 W 000 P 000(c) LS-<strong>NoC</strong> state after pipe 7→2 has been established.The pipe has no contention with the previouslyestablished pipe.(d) Label conflict at node 7’s North port during6→4 pipe establishment.0 1 23 4 56 7 8Pipe: 6->4E ijnin ju ijv ijusedL ij123452671 2 5 5 {0}{0}unusedL ij{1..7}3 4 10 0 {0} {1..7}41 4 1 5 5 {0} {1..7}45 4 5 10 0 {0} {1..7}5 26 7Links TableData Structure101074 7 4 10 0 {0,1} {2..7}00{1}{1..7}{0,2..7}RPort3 PShadow Routing TableData StructureIL DB NL000 E 00045WW000000EN0000002 S 000 P 0007 P 000 N 0004 S 000 N 0001 S 000 E 0002 W 000 P 0006 P 000 E 0017 E 001 N 0014 S 001 P 001(e) Labelswappingcompleted<strong>and</strong>label1assignedtopipe6→4.Figure 5.3: Pipe establishment <strong>and</strong> label swapping example in a 3×3 LS-<strong>NoC</strong>.Table 5.2: Routing table of a n port (n = 5) router with a lw bit (lw = 4) label indexedby labels used in the label switched <strong>NoC</strong>. Size of the routing table = 2 lw ×n×lw.Input Label Direction Bits New Label0000 00001 00000001 00000 00010010 11111 0010... ... ...1111 00101 1111


CHAPTER 5. LABEL SWITCHED NOC 110INPUT PORTSDataDataWidthLabelLabelWidthValidPauseInCLKData+LabelValidMUXCTRLFIFOData+LabelLabelRoutingTableto MUX in otheroutput portsData+LabelLabelRouteValid.to Arbiters inother ports.Data+Label. from all portsARBITER.......CLKOUTPUT PORTSDataLabelLabelWidthValidPauseOutDataWidthFCBCLKFIFOEmptyThresholdSignalPTRPauseIn.Grants from otheroutput portsGrantFlowControl signalfrom next routerFlowControl signalto the previous routerWR/RDW R<strong>NoC</strong> ManagerInterfaceFigure 5.4: Label Switched Router with single cycle flit traversal. Valid signal identifiesData <strong>and</strong> Label as valid. PauseIn <strong>and</strong> PauseOut are flow control signals <strong>for</strong> downstream<strong>and</strong> upstream routers. Routing table has output port <strong>and</strong> label swap in<strong>for</strong>mation. Arbiterreceives input from all the input ports along with the flow control signal from thedownstream router.all output ports is available to identify if the input port won the desired output portduring arbitration. The FIFO Control Block (FCB) h<strong>and</strong>les FIFO pointer arithmetic <strong>and</strong>generates the Pause (flow control) signal when the FIFO fills above a threshold. Thethreshold accounts <strong>for</strong> the traversal delay, in cycles, of the link connecting the currentrouter to an upstream router to which the flow control signal is input. The FCB alsogenerates the MUX control signal to decide if the incoming data on the link traverses therouter or is written into the FIFO.5.5.1 Pipes & LabelsA pipe(P) is a triplet (S,K,R) where S <strong>and</strong> K denote source <strong>and</strong> sink nodes <strong>and</strong> R isa non-empty set of inter<strong>media</strong>te routers connecting S <strong>and</strong> K. A source <strong>and</strong> a sink canhave many pipes between them varying the set of inter<strong>media</strong>te routers, R.A label belonging to a source S uniquely identifies a communication pipe <strong>and</strong> theintended destination K - though the value of the label can change en route. Labels


CHAPTER 5. LABEL SWITCHED NOC 111can be reused across sources. Label reuse requires reserved entries <strong>for</strong> each input portin the routing table or an additional field in a routing table to identify incoming port.Conceptually, each input port has a separate routing table up to 2 lw entries, where lw isthe label width. Independent routing tables at input ports enable label reuse resulting inmore efficient usage of label space. Label reuse by sources may rise to labels collision bypipes sharing a link. Label swapping reassigns unused labels to avert label collisions atinput ports (Section 5.5.2).Using labels, a pipe can be represented as the set S,l 0 ,l 1 ,...,l h−1 ,K, where the pipeconnects source S to destination K through h routers. l 0 is the value of the label at routerR 0 <strong>and</strong> so on. Router R 0 connects to S <strong>and</strong> R h−1 connects to K. Without label swappinglabels l 0 = l 1 = ... = l h−1 .5.5.2 Label SwappingA major advantage of LS-<strong>NoC</strong> is provisioning of routes with guaranteed throughput betweennodes. With increasing number of pipes in the LS-<strong>NoC</strong>, the probability of labelcollision increases. Label collision in a link results in routing table entry clash as shownin Figure 5.5. Pipes entering North <strong>and</strong> South ports of Router 0 both have labels as 0.Both pipes are destined to leave the router through the East port <strong>and</strong> reach the Westport of Router 1. There is no label conflict in Router 0 (contention exists at East port)as routing tables are individual to each input port. Conflict occurs at the West port ofRouter 1 in Routing Table 2 through which both pipes need to be routed. Furthermore,consider that the routing table entry <strong>for</strong> label 0 is already used <strong>and</strong> there are at least 2routing table entries free <strong>for</strong> use. In such a situation, neither of the pipes having label0 from Router 0 can pass through the West port of Router 1. This results in inefficientutilization of links <strong>and</strong> label space at input ports.Label swapping reassigns labels to conflicting pipes using available label space at thenext router. This allows complete utilization of the available label space. Figure 5.5illustrates label swapping <strong>for</strong> conflicting pipes. Routing table 0 at the North port ofRouter 0 swaps conflicting label 0 to the available (in Router 1) label 1. Similarly, routing


CHAPTER 5. LABEL SWITCHED NOC 1120RT0R0RT10Pipe-idConflict!00RT2R10RT0R0RT112RT20 1R12RT0il Dir ol0 EAST 0RT1il Dir ol0 EAST 0RT0il Dir ol0 EAST 1RT1il Dir ol0 EAST 2RT2il Dir ol1 SOUTH 1NORTH22Figure 5.5: Label conflict at R1 resolved using Label swapping. il: Input Label, Dir:Direction, ol: Output Label.table 1 at the South port of Router 0 swaps label 0 to 2 ensuring both pipes are the setupto pass through West port of Router 1.5.6 Simulation <strong>and</strong> Functional VerificationFunctional verification of the Verilog <strong>design</strong>s of the router <strong>and</strong> networks were done usingIcarus Verilog Simulator[142]. Table 5.3 lists simulation parameters used in functionalverification of the label switched router. Routing tables are populated in the initializationstage. The flow identification algorithm (Algorithm 1) is implemented in Perl. Routingtables in the <strong>NoC</strong> are configured based on the algorithm. The label switched router hasbeen implemented <strong>and</strong> tested <strong>for</strong> the following networks:• Single Router Network: A single router connected to 4 nodes (4 sources <strong>and</strong> 4 sinks)was tested <strong>for</strong> 10 8 cycles <strong>for</strong> different directed traffic <strong>and</strong> r<strong>and</strong>om traffic permutations.• 2D Mesh: A 64 node, 64 router 2 dimensional mesh with traffic from each source tor<strong>and</strong>om destinations <strong>for</strong> 10 8 cycles was tested. Appendix B presents more details ontest cases that were used to ascertain functional verification of the LS-<strong>NoC</strong> router.


CHAPTER 5. LABEL SWITCHED NOC 113Table 5.3: Simulation parameters used <strong>for</strong> functional verification of the label switchedrouter <strong>design</strong>.Network 64 node, 8 × 8, 2D mesh (Figure 5.2)Data Bus Width 256 bitsLabel Width 4 bitsInput Buffer Depth 8Simulation Time 10 8 cyclesSimulation Framework Icarus Verilog [142]Table 5.4: Synthesis ParametersProcess UMC 130nm,High Speed.Library FaradayProcess 1.00Temperature 25 ◦ CVoltage 1.2VInterconnect Worst Case TreeModelMetal Layers 8. 2 thick layers.5.7 Synthesis ResultsLogic <strong>and</strong> combinational circuits blocks of LS Router were synthesized using Faraday’sFSC0H D 130nm library using high-per<strong>for</strong>mance <strong>and</strong> high-density generic <strong>core</strong> cells <strong>for</strong>UMC 0.13µm eHS (FSG) process in the typical-NMOS, typical-PMOS case (Table 5.4).Timing <strong>and</strong> area results from synthesis are tabulated in Table 5.5. The router operatesat 312.5MHz in 130nm technology. Taking into account effects of scaling, estimatedfrequency of operation of the router at 45nm is above 1.2GHz[80]. Link width of 256 bitswas chosen to service peak b<strong>and</strong>width per link of 300Gbits/s comparable with GDDR5b<strong>and</strong>width. Area <strong>and</strong> power of memory elements inside the LS Router (input buffers <strong>and</strong>routing tables) were estimated using UMC’s Memory Compiler.Synthesis of the functionally verified Verilog HDL <strong>design</strong> of the router was per<strong>for</strong>medin Synopsys Design Compiler. Switching activity, timing <strong>and</strong> <strong>design</strong> constraints <strong>and</strong> the


CHAPTER 5. LABEL SWITCHED NOC 114Table 5.5: Synthesis results 2 Router <strong>and</strong> Mesh networks. Area of a Router is 0.431 mm 2 .Router DetailsPorts 5Data WidthLabel width256 bits4 bitsSynthesis ResultsLS-<strong>NoC</strong> RouterBuffers + Routing Table Combinational(Storage Elements) LogicArea (mm 2 ) 0.077 0.354Total Area (mm 2 ) 0.431Router Power 11.01 mW 32.07 mWTotal Power 43.08 mWMax Frequency 312.5MHzB<strong>and</strong>width/link 80 Gbpssynthesized netlist is input to Cadence SOC Encounter <strong>for</strong> place <strong>and</strong> route from whichthe area is obtained. Placed <strong>and</strong> routed netlist along with parasitics extracted file (SPEF)was used to obtain power of a router with no buffers. Area <strong>and</strong> power measurements ofmemory components <strong>for</strong> the buffers were done using UMC’s Memory Compiler tool. Thetotal power consumption is estimated to be 43.08mW at 312.5MHz, 1.2V. Section B.3 ofAppendix B provides more details on the steps involved in synthesis <strong>and</strong> place <strong>and</strong> routeof the LS-<strong>NoC</strong> router.Area of a processing engine cluster from [8] was used to estimate the area as a casestudy to identify wire lengths in the mesh <strong>and</strong> feasibility of single cycle operation. Areaof a PEC was estimated as equal to 2.538 mm 2 <strong>and</strong> the length of a side is 1.593 mm.Intacte[82] was used to estimate the maximum frequency of operation of a 1.6mm link in130nm. It was found that the 1.6mm link operates at single cycle latency at 312.5MHz.


CHAPTER 5. LABEL SWITCHED NOC 1155.8 ConclusionWe have presented a LS-<strong>NoC</strong> which receives QoS dem<strong>and</strong>s of streaming applicationswith the help of a centralized <strong>NoC</strong> Manager which achieves traffic engineering. LS-<strong>NoC</strong><strong>for</strong> streaming applications guarantees deterministic path latencies, satisfies b<strong>and</strong>widthrequirements <strong>and</strong> delivers constant throughput. Delay <strong>and</strong> throughput guaranteed paths(pipes) are established between source <strong>and</strong> destinations along contention free, b<strong>and</strong>widthprovisioned routes. Pipes are identified by labels unique to each source node. Labelsneed fewer bits compared to node identification numbers - potentially decreasing memoryusage in routing tables.The concept of LS-<strong>NoC</strong> was presented. The Label Switching router has been verified,synthesized, placed <strong>and</strong> routed <strong>and</strong> timing analyzed. The Label Switched (LS) Routerused in LS-<strong>NoC</strong> achieves two cycle traversal delay during no contention <strong>and</strong> is multicast<strong>and</strong> broadcast capable. A 5 port, 256 bit data bus, 4 bit label, 1 bit flow control, 8 buffersper input port individual router occupies an area of 0.431 mm 2 in 130nm Faraday libraryin the typical corner <strong>and</strong> operates at 312.5MHz. LS Router is estimated to consume43.08 mW. The following two chapters illustrate the LS-<strong>NoC</strong> management framework<strong>and</strong> LS-<strong>NoC</strong> results respectively.


Chapter 6LS-<strong>NoC</strong> ManagementStreaming applications require hard b<strong>and</strong>width <strong>and</strong> throughput guarantees in a multiprocessorenvironment amidst resource competing processes. We present a Label Switchingbased Network-on-Chip (LS-<strong>NoC</strong>) motivated by throughput guarantees offered by b<strong>and</strong>widthreservation in this chapter. A centralized LS-<strong>NoC</strong> Management framework engineerstraffic into QoS guaranteed routes. LS-<strong>NoC</strong> caters to the requirements of streamingapplications where communication channels are fixed over the lifetime of the application.The proposed <strong>NoC</strong> framework inherently supports heterogeneous <strong>and</strong> ad-hoc SoC <strong>design</strong>s.The LS-<strong>NoC</strong> can be used in conjunction with conventional best ef<strong>for</strong>t <strong>NoC</strong> as aQoS guaranteed communication network or as a replacement to the conventional <strong>NoC</strong>.A multicast, broadcast capable label switched router <strong>for</strong> the LS-<strong>NoC</strong> was presented inthe previous chapter. B<strong>and</strong>width <strong>and</strong> latency guarantees of LS-<strong>NoC</strong> have been demonstratedon streaming applications like HiperLAN/2 <strong>and</strong> Object Recognition Processor,Constant Bit Rate traffic patterns <strong>and</strong> video decoder traffic representing Variable BitRate traffic in the next chapter.Organization of the ChapterSection 6.1 outlines LS-<strong>NoC</strong> management framework. Flow identification algorithm usedto establish pipes is presented in Section 6.2. The chapter concludes in Section 6.6.116


CHAPTER 6. LS-NOC MANAGEMENT 1176.1 LS-<strong>NoC</strong> Management<strong>NoC</strong> Manager is the central entity responsible <strong>for</strong> identifying <strong>and</strong> configuring QoS guaranteedroutes between communicating blocks. A two stage router with broadcast <strong>and</strong>multicast enabled table based routing is used in the LS-<strong>NoC</strong>. Label based <strong>for</strong>wardingrequires every route to be identified by a unique label set. Using labels instead of nodeids decreases routing table sizes.6.1.1 <strong>NoC</strong> ManagerThe primary job of the <strong>NoC</strong> Manager is to identify QoS guaranteed communication paths(termed pipes) between communicating nodes <strong>and</strong> updating routing tables along the pipe.The <strong>NoC</strong> Manager can be implemented as a software thread running on one of the <strong>core</strong>s(in a CMP environment) or as a separate hardware accelerator (in an SoC environment).A managed LS-<strong>NoC</strong> is useful in a non-homogeneous, unstructured SoC where communicationchannels may not be required to be set up between all blocks. The ability of theManagertoestablishpipesdynamicallyconsideringnetworktraffic<strong>and</strong>availabilityofcontentionfree routes allows <strong>for</strong> guaranteed b<strong>and</strong>width <strong>and</strong> deterministic latency pipes. TheManagercanalsobeconfiguredtomonitorstatusoflinksinthe<strong>NoC</strong>.SuchaManagerwillwork as a fault tolerant route set up system in the LS-<strong>NoC</strong>. Study of fault managementstrategies <strong>for</strong> LS-<strong>NoC</strong> using the Manager is not dealt with in this work. <strong>NoC</strong> Managerrequires interfaces at the router to update routing tables while new flows are set up in the<strong>NoC</strong>. In SoCs where applications are fixed during operation, this will incur a one time setup latency at the initialization phase of the application.6.1.2 Traffic Engineering in LS-<strong>NoC</strong>The <strong>NoC</strong> Manager has complete visibility of current state of pipes in the <strong>NoC</strong>. Thisempowers the Manager to set up a new flow along a contention free, QoS guaranteed pipe.Traffic engineering capabilities of LS-<strong>NoC</strong> enables flit <strong>for</strong>warding through non-shortest,non-congested paths resulting in deterministic flit traversal latencies. Traffic engineering


CHAPTER 6. LS-NOC MANAGEMENT 118abilities of <strong>NoC</strong> Manager depend on pipe establishment algorithms (Section 6.2) <strong>and</strong>is independent of the network. This enables support <strong>for</strong> custom, specific applicationSoCs that contain non-homogeneous or adhoc <strong>NoC</strong>s. Pipe set up algorithms use a graphrepresentation of the <strong>NoC</strong>. Faulty or occupied links have nil communication capacity. Dueto this reason fault tolerance is in-built into LS-<strong>NoC</strong>. Fault tolerance in LS-<strong>NoC</strong> is thesubject of Section 6.3. A Centralized <strong>NoC</strong> Manager is adequate in most CMPs runningstreaming applications as there are a fixed number of communicating entities <strong>and</strong> the SoCis not a dynamically growing system. In a st<strong>and</strong>ard SoC, loss of scalability may not be aserious concern.6.2 Flow Based Pipe IdentificationWeproposetouseacentralizedflowmanagercalledLS-<strong>NoC</strong>Managerwhichhelpsmanagepipe set up <strong>and</strong> tear down. The LS-<strong>NoC</strong> Manager takes into account b<strong>and</strong>width reservedin links used by existing pipes while establishing a new pipe. Routing tables in routersalong the pipe are configured by LS-<strong>NoC</strong> Manager. Flow identification algorithms[66][67]are used to identify available pipes between communicating entities. The flow algorithmused by the <strong>NoC</strong> manager to calculate pipes is based on the Ford-Fulkerson algorithm[67].The Ford-Fulkerson algorithm was chosen <strong>for</strong> ease of implementation <strong>and</strong> its ability toconverge in polynomial time. Computation latency of the pipe establishment algorithmis analyzed in Section 6.4.1.Maximum capacity of a link is the measure of peak b<strong>and</strong>width a link can support.Reserving required capacity during establishment of a new pipe ensures that adequateb<strong>and</strong>width <strong>for</strong> the pipe is reserved. Once a link has exhausted its capacity, no new pipescan be set up through it. Setting up an end-to-end flow between two communicatingnodes by reserving required b<strong>and</strong>width through link capacities ensures a contention free,QoS guaranteed route. <strong>NoC</strong> Manager pseudo-code to identify a QoS guaranteeing pipe ispresented in Algorithm 1.One of the inputs to the algorithm is the flow graph of the <strong>NoC</strong>. An edge, E ij , from


CHAPTER 6. LS-NOC MANAGEMENT 119Algorithm 1 Identify Pipe. P: Pipe Stack.1: Identify PipeRequire: Input Network: {E ij } = {...,{n i ,n j ,u ij ,v ij ,L usedij ,L unusedij },...}, Source s,Destination d, Required Capacity: c2: Residual Graph, RG = {E ji } = {...,{n j ,n i ,v ij },...}3: EdgesCount: k ← 04: if s == d then5: push s onto {P}6: return true7: else8: <strong>for</strong> all edges starting from d do9: if v dj > c then10: if call Identify Pipe(RG, j, s, c) == true then11: push j onto {P}12: return true13: else14: pop from {P}15: end if16: end if17: end <strong>for</strong>18: end if19: return false


CHAPTER 6. LS-NOC MANAGEMENT 120the flow graph is represented as:E ij = {n i ,n j ,u ij ,v ij ,L usedij ,L unusedij } (6.1)where, edge E ij connects nodes n i to n j . Nodes can be traffic sources, traffic sinks, routerinput or output ports. u ij <strong>and</strong> v ij are utilized <strong>and</strong> available flow capacities of the edgerespectively. L usedijis the list of labels used in pipes through E ij . L unusedijis the list ofavailable labels to assign to future pipes. During the initialization stage, u ij <strong>and</strong> L usedijare equal to null; v ij holds the maximum capacity value E ij can support <strong>and</strong> L unusedijthe list of all available labels <strong>for</strong> pipes through E ij . Edges ending at the input ports ofrouters support a maximum capacity equal to the maximum pipes supported by the edge(2 lw = 16 in this work). These are the bottleneck edges. Edges not ending at input portsof routers are assigned infinite capacity <strong>and</strong> do not effect the final pipe route. Otherinputs to the algorithm are the source <strong>and</strong> destination nodes, s <strong>and</strong> d, between whom apipe of requested capacity, c, has to be established.The first step in the algorithm is to build a Residual Flow Graph from the input flowgraph. The residual flow network contains the same number of edges as the flow graphwith directions reversed. All edges E ij change to E ji in the residual graph. The residualgraph stores only the available capacities of every link. The residual graph is used toidentify a flow that satisfies requested capacity starting from the desired destination, d<strong>and</strong> traversing through the graph depth-first till source, s, is found (Lines 8–17).Starting from the destination node d, edges satisfying requested capacity are searchedin the residual network. If an edge satisfies the requested capacity, it is stored in the pipestack ({P}). The next edge is searched from the terminating node of the previous edge(Line 10). If a node is reached from which no edge satisfies the requested capacity, theedge is popped out of the pipe stack (Line 14) <strong>and</strong> search is backtracked. If the sourcehas been reached through edges servicing requested capacity, the final edge is pushed intothe pipe stack <strong>and</strong> flow has been identified (Lines 4–6).After pipe has been identified, the pipe stack {P} is used to update the used (L usedij )<strong>and</strong> available (L unusedij ) flow capacities from the flow graph. During the configuration ofis


CHAPTER 6. LS-NOC MANAGEMENT 121routing tables, label swapping is per<strong>for</strong>med in inter<strong>media</strong>te routers wherever necessary.Anavailablelabelfromthefirstrouterinputnodein{P}isused. Alongtheroutethrough{P}, L usedijof each edge is checked <strong>for</strong> conflicts. If conflict occurs, an unused label fromthe pipe is used. The routing table data structure at a node, n i , can be represented as:n i ,l old → n j ,l new (6.2)where, l old is the label of the pipe in the edge ending at n i <strong>and</strong> l new is the label of thepipe in the edge E ij . The procedure is repeated <strong>for</strong> every node along the pipe. This datastructure is used by the <strong>NoC</strong> Manager to update routing tables along {P}. An examplehas been presented in Appendix C.6.3 Fault Tolerance in LS-<strong>NoC</strong>Fault tolerance is built into LS-<strong>NoC</strong>. The steps taken by the LS-<strong>NoC</strong> Manager after linkis discovered to be faulty is listed below.• After a link is recognized to be faulty, the LS-<strong>NoC</strong> Manager updates the networkgraph to reflect the health of the link.• The link’s capacity is updated to 0.• Existing pipes through the link are invalidated. The pipes are re-identified <strong>and</strong>routing tables are updated.• Network graphs are updated as usual after pipes are configured.6.4 Overhead of <strong>NoC</strong> ManagerDetailed analysis of the amount of time spent in identifying pipes <strong>and</strong> updating routingtables in LS-<strong>NoC</strong> is presented in this section. Overhead of the <strong>NoC</strong> Manager comprisesof two components: computation <strong>and</strong> configuration. Computational overhead involves


CHAPTER 6. LS-NOC MANAGEMENT 122Table 6.1: <strong>NoC</strong> Manager Overhead.T comp T conf T overheadSingle Pipe 35157 cycles 17 cycles 35174 cycles (35.2µs)identifying a pipe using flow based algorithm (Algorithm 1). Configuration overheadincludes transmitting routing table configuration over the network <strong>and</strong> updating routingtables (Table 6.1).6.4.1 Computational LatencyOne of the issues in probe based circuit establishment in prior works[41][19][55] is theunpredictability of circuit setup time. In this section we present an upper bound on pipeestablishment time using flow algorithm implemented in the <strong>NoC</strong> Manager. The <strong>NoC</strong>Manager can be implemented as a software or a hardware entity residing on the CMP orSoC. The LS router is equipped with an interface to read from/write into a single portrouting table memory structure (Figure 5.4). We assume that a <strong>NoC</strong> Manager processresides in the first node as shown in Figure 5.2. The flow algorithm presented in Alg.1 is used to establish pipes between communicating nodes. The flow algorithm has thecomplexity O(E.f) where E is total edges in the network graph <strong>and</strong> f are the numberof flows to be identified. Total edges, E, in the graph representation of a 2D Meshwith degree d grows as 2 × d 2 . Total number of flows to be established depends on theapplications <strong>and</strong> number process nodes. Computational overhead in establishing a singlepipeinan8×8LS-<strong>NoC</strong>ispresentedinTable6.1. ThealgorithmwasexecutedonaCortexA8 processor (ARM v7 architecture, operating frequency = 1 GHz). 35157 cycles werespent <strong>for</strong> identification of a single pipe. Identification of a single pipe involves buildingthe residual network from the flow network graph, identifying a b<strong>and</strong>width satisfying pipebetweensource<strong>and</strong>destination<strong>and</strong>updatingthecurrentstateofthenetwork(Algorithms1). As the number of steps involved in a pipe identification are the same, the time takento identify p pipes is p× (time taken to identify one pipe).


CHAPTER 6. LS-NOC MANAGEMENT 1236.4.2 Configuration LatencyThe worst case configuration overhead to update the routing table in the bottom rightcorner of the LS-<strong>NoC</strong> is 17 cycles. In the case of maximum pipes setup where all ports ofall routers in the LS-<strong>NoC</strong> have to be updated by the <strong>NoC</strong> Manager, T conf = 5372 cycles.Time to configure in the maximum pipes is derived as follows:T conf = T network +T rt cycles (6.3)where T network is the network latency to transmit routes on the network <strong>and</strong> T RT isthe time to update routing tables in a router. At worst case where are all routers have tobe updated, assuming a regular 2D Mesh,{ } deg ×(deg −1)T network = (deg +1)× cycles (6.4)2T rt = Size rt ×deg 2 cycles (6.5)where deg is the degree of the regular 2D Mesh, Size rt is the size of the routingtable (number of writes required to fill the routing table). In the current work, deg = 8,Size rt =80 (16 per port× 5 ports), T conf = 5372 cycles.Streaming applications mapped on to a generic multi-<strong>core</strong> CMP will have a few communicatingnodes <strong>and</strong> far lesser pipes to be set up between communicating entities. Configurationof a pipe is done by the Manager as pipes need to be established. Given thatstreaming processes run over a large time frame, the <strong>NoC</strong> Manager overhead is acceptablein most applications.6.4.3 Scalability of LS-<strong>NoC</strong>The size of the network considered in our studies is an 8×8 chip multiprocessor. Thisis large enough to cover most large chip multi<strong>processors</strong> of today. Given the polynomialcompletion time (complexity: O(E.f)) of the pipe establishment algorithm, we envision


CHAPTER 6. LS-NOC MANAGEMENT 124VideoServerTransmissionCameradecodelow levelimage segmentationObjectRecognitionFigure 6.1: Surveillance system showing the application of LS-<strong>NoC</strong> in the Video computationserver.that the time required <strong>for</strong> pipe establishment will not be prohibitive with respect to theapplication’s lifetime. An example video surveillance streaming application is depicted inFigure 6.1. Such an application is on <strong>and</strong> works continuously <strong>for</strong> days. The implementedpipe establishment algorithm spends less than half a second to establish a pipe <strong>and</strong> thisdelay is negligible considering the lifetime of the application. A Centralized <strong>NoC</strong> Manageris adequate in most CMPs running streaming applications as there are a fixed number ofcommunicating entities <strong>and</strong> the SoC is not a dynamically growing system. In a st<strong>and</strong>ardSoC, loss of scalability may not be a serious concern.6.5 Number of Pipes in an <strong>NoC</strong>The size of a routing table at each input port of the LS Router is 2 lw entries, where lw isthe label width. Each source node can have a maximum of 2 lw unique pipes. Labels persource are unique. Total routing table entries, rt, in a p port router, rt = p×2 lw . In a rrouter network, if all the routing table entries are full (the entire label space is utilized),the maximum number of pipes that can be set up is rt ∗ r. In reality, the maximumdepends on factors such as the network topology, communication pattern, b<strong>and</strong>width <strong>and</strong>latency guarantees, current traffic scenario, number of existing entries in routing tables(depends on already set up pipes) <strong>and</strong> algorithms used to set up new pipes.


CHAPTER 6. LS-NOC MANAGEMENT 125In the current work, multi-source, multi-sink max-flow algorithms[66][67] are used toidentify (the upper bound on) maximum possible pipes in the network. An example linearnetwork with 2 routers (Figure 6.2(a)) is used to illustrate the maximum number of pipescalculation. A communicating element is represented by a pair of source (S) <strong>and</strong> sink(K) nodes. The capacities of edges are a measure of the number of buffers at the routerinput port at the end of the link.With no constraints applied, the maximum pipes, P max , in a r router linear networkis,∑r−1P max = 2 lw .where C i is the number of communicating entities attached to router i. The maximum isreached when all pipes are set up to terminate in the originating router. For example, inR 0 (Figure 6.2(a)), all the 2 lw pipes originating from source 0 end in source 1, all pipesfrom source 1 end in source 2 <strong>and</strong> so on.With constraints applied, the network graph can be modified <strong>and</strong> maximum pipescan be obtained using the max-flow algorithm. Consider the constraint that a source canset up at most one pipe to a sink. Given that the source node connects to the routernode with an edge of capacity 2 lw , the source node can be divided into 2 lw unique sourcenodes connecting 2 lw router nodes with edge capacity = 1 (Figure 6.2(b)). Minimum <strong>and</strong>maximum number of pipes that can be established in the LS-<strong>NoC</strong> are shown in Section6.5.1.i=0C i6.5.1 Minimum, Maximum <strong>and</strong> Typical Pipes in a NetworkFigure 6.3(a) shows the number of pipes set up in an <strong>NoC</strong> with a label width of lw.Maximum pipes are drawn when all pipes are local to a router. Pipes do not contendnor share links in this case. Minimum number of pipes is when pipes originating fromR 0 terminate at the farthest router R (r−1) <strong>and</strong> no local pipes are allowed. Consideringpipes in both to <strong>and</strong> fro directions, minimum number of pipes equals 2 × 2 lw . For the2 router network shown in Figure 6.2(a), maximum pipes are 2× 3 ×2 lw (2 routers ×


CHAPTER 6. LS-NOC MANAGEMENT 126S0lw-10 1 2S0 S0... S0...2 lw 1 1 1r0s0r0s0(a)(b)Figure 6.2: A 2 router, 6 communicating nodes linear network. (b) <strong>Multi</strong>ple source,multiple sink flow calculation in a network.number of ports with sources ×2 lw ). The constraint curve shows the total number ofpipes possible when every source can connect to a sink with at most one pipe. Moststreaming application nodes will communicate with fewer than all the nodes <strong>and</strong> a singleQoS guaranteed communication pipe between two nodes suffices. Width of the label canbe chosen based on the number of pipes required to be setup with the single pipe perdestination constraint applied.Figure 6.3(b) shows total pipes set up in a 2D mesh network over labels of sizes 2, 3<strong>and</strong> 4 bits. Total number of sources (or sinks) in the mesh is equal to Degree 2 . The graphshows pipes calculated with the constraint that sources connect to unique sinks. TheMax 4 bits curve shows the practical limit <strong>for</strong> number of pipes that can be establishedin the mesh (Degree 2 ×2 lw ). The results are a measure of routing table utilization whenpipes are set up from every source to unique destinations. The difference between thenumber of established pipes between the Max 4 bits <strong>and</strong> the 4 bits curve indicates thenumber of non-utilized routing table entries in the <strong>NoC</strong> router. Routing table entries ininter<strong>media</strong>te routers along pipes may remain non-utilized after sources have exhaustedlabel space. Non-utilized routing table slots in the <strong>NoC</strong> steadily increases from 3.1%(Degree 4) to 6.4% (Degree 14) as routes get longer <strong>and</strong> more inter<strong>media</strong>te routers haveunutilized slots in them.Label width to use in the <strong>NoC</strong> is decided after identifying maximum number of nodeseach node communicates with. Consider the tree based 3 router, 12 node <strong>NoC</strong> of a real


CHAPTER 6. LS-NOC MANAGEMENT 127Number of Pipes220200180160140120100806040200Total pipes with lw=3 in a Linear NetworkMaximumMinimumConstraint 12 4 6 8 10 12Routers in Linear NetworkNumber of Pipes350030002500200015001000500Maximum pipes in 2D Mesh Networks2 bits3 bits4 bitsMax. 4 bits.00 2 4 6 8 10 12 14Degree of the Mesh.(a)(b)Figure 6.3: (a) Number of pipes in a linear network (Fig. 6.2(a)), lw = 3 bits, varyingconstraints. Constraint 1: Max 1 pipe per sink. (b) Max. number of pipes in 2D Mesh(Fig. 5.2).time object recognition processor in Figure 5.1. Using the tree-based <strong>NoC</strong> topology usedin the work, assuming every node communicates with all other nodes, 12 pipes per sourceneed to be set up, i.e. a label size equals 4 bits. The H.264 decoder presented in [143]<strong>and</strong> [144] have 8 blocks <strong>and</strong> a label width of 4 bits will suffice. Current work uses lw = 4in all experiments.If the identification of the source at the destination is m<strong>and</strong>atory, the width of thelabel has to be at least equal to the number of bits used to uniquely identify a sourcenode. This will require assigning the source-id in place of label at the last hop. This isinternally supported in the routing table in the label swap field (Section 5.5.2).6.6 ConclusionThe proposed LS-<strong>NoC</strong> services QoS dem<strong>and</strong>s of streaming applications using a trafficengineering capable <strong>NoC</strong> Manager. A centrally managed, b<strong>and</strong>width provisioned, trafficengineering capable Manager utilizes flow identification algorithms to identify contentionfree, b<strong>and</strong>width provisioned paths. Network visibility enables <strong>NoC</strong> Manager to configurebounded latency pipes in homogeneous <strong>and</strong> heterogeneous networks alike.


CHAPTER 6. LS-NOC MANAGEMENT 128Flowidentificationalgorithmtakesintoaccountb<strong>and</strong>widthavailableinindividuallinksto establish QoS guaranteed pipes. The algorithm allows sharing of physical links betweenpipes without compromising QoS guarantees. Flow based pipe establishment algorithm istopology independent <strong>and</strong> hence the <strong>NoC</strong> Manager supports applications mapped to bothregular chip multi<strong>processors</strong> (CMPs) <strong>and</strong> customized SoCs with non-conventional <strong>NoC</strong>topologies. Linkstatusenables<strong>NoC</strong>Managertoestablishpipesinafaulttolerantmanner.Doing away with node identification number in favor of labels reduces memory usage inrouting tables. Traffic engineering <strong>and</strong> pipe based communication channels are directdescendants from the <strong>Multi</strong>-Protocol Label Switching[37] knowledge base. Overhead ofsetting up a pipe by a software LS-<strong>NoC</strong> Manager executing on a Cortex A8 processor is35174 cycles (up to 35.2 µseconds).The Label Switched Network-on-Chip (LS-<strong>NoC</strong>) framework was <strong>design</strong>ed with theview that pipe establishment <strong>and</strong> tear-down are rare events. The pipe establishment overheadincurred by the <strong>NoC</strong> Manager should not overshadow the QoS advantages gained bythe LS-<strong>NoC</strong> framework. Streaming applications with their fixed communication channelsfit this model perfectly. In principle, the LS-<strong>NoC</strong> can be used in all types of System-on-Chips <strong>and</strong> Chip <strong>Multi</strong>-Processors where functional blocks need to communicate duringoperation. Quality-of-Service can be guaranteed only in systems where communicationchannels are fixed over the lifetime of the application.Communication channels in streaming applications do not change during the life timeof the system except during link failures or reconfiguration. Ideally, pipe establishment isa one-time operation in non-reconfigurable system. If the system allows reconfigurationof functional blocks, then the pipe establishment <strong>and</strong> tear down procedures will have tobe executed at every reconfiguration step. In such a system, if the per<strong>for</strong>mance hit dueto a software <strong>NoC</strong> Manager is unacceptable, a hardware <strong>NoC</strong> manager is justified.A Label Switching router was presented in the previous chapter. Evaluation of LS-<strong>NoC</strong> over example streaming applications, CBR <strong>and</strong> VBR traffic are demonstrated in thenext chapter.


Chapter 7LS <strong>NoC</strong> - Case StudiesWe presented a Label Switching based Network-on-Chip (LS-<strong>NoC</strong>) motivated by throughputguaranteesofferedbyb<strong>and</strong>widthreservationinthepreviouschapter.LS-<strong>NoC</strong>containsa centralized management framework to engineer traffic into QoS guaranteed routes. Amulticast, broadcast capable label switched router <strong>for</strong> the LS-<strong>NoC</strong> has also been proposed.In the current chapter, b<strong>and</strong>width <strong>and</strong> latency guarantees of LS-<strong>NoC</strong> have beendemonstrated on streaming applications like HiperLAN/2 <strong>and</strong> Object Recognition Processor,Constant Bit Rate traffic patterns <strong>and</strong> video decoder traffic representing VariableBit Rate traffic. LS-<strong>NoC</strong> has been found to have a competitive Area×PowerThroughputfigure of meritwithstate-of-the-art<strong>NoC</strong>sprovidingQoS.Circuitswitchingwithlinksharingabilities<strong>and</strong>support <strong>for</strong> asynchronous operation make LS-<strong>NoC</strong> a desirable choice <strong>for</strong> QoS servicing inCMPs.Organization of the ChapterThe previous chapters presented LS-<strong>NoC</strong> architecture <strong>and</strong> concepts. The current chapterevaluates QoS guarantees offered by LS-<strong>NoC</strong> using streaming application case studies.Section 7.1 uses processes from HiperLAN/2 baseb<strong>and</strong> processing <strong>and</strong> Object RecognitionProcessor SoC mapped on to a generic CMP as a framework to evaluate LS-<strong>NoC</strong> QoSservices. Constant<strong>and</strong>VariablebitratevideotrafficisusedtoevaluateLS-<strong>NoC</strong>inSection7.2. Discussion on concepts of LS-<strong>NoC</strong> <strong>and</strong> comparison with existing works is presented129


CHAPTER 7. LABEL SWITCHED NOC 130Table7.1: Pipesset up<strong>for</strong> HiperLAN/2baseb<strong>and</strong> processingSoC<strong>and</strong> Object RecognitionProcessor SoC (Figure 7.1(a)). PEC[0-7]→PEC[0-7]: every PEC communicates withevery other PEC.HiperLAN/2 baseb<strong>and</strong> processing SoCS2P → PR → FOC → FFT → POC → CE → DeMap,S2P → Sync, Sync → POCObject Recognition ProcessorPEC[0-7] → PEC[0-7], MP → PEC[0-7], MP → VAE,MP → Ext. I/f, MP → MA, VAE → PEC[0-7], MA → MP.in Section 7.3. The chapter concludes in Section 7.4.7.1 HiperLAN/2baseb<strong>and</strong>processing+ObjectRecognitionProcessor SoCHiperLAN/2 baseb<strong>and</strong> processing SoC (Figure 5.1(a))[7] <strong>and</strong> Realtime objection recognitionprocessor(Figure 5.1(b)) [8] (Section 5.1) were mapped on to a 8×8 LS-<strong>NoC</strong> (Figure7.1(a)). Application pipes were setup as shown in Figure 7.1(a) <strong>and</strong> Table 7.1.Behaviour of pipe latencies in an 8×8 LS-<strong>NoC</strong> with a 64 node CMP running streamingapplications is presented in Figure 7.2. X-axis records injection rates of non-streamingapplication nodes. The LS-<strong>NoC</strong> was configured to support the 64 node CMP shown inFigure 7.1(a). LS-<strong>NoC</strong> establishes higher capacity pipes between communicating nodes ofstreaming applications. Resources are provisioned in pipes based on b<strong>and</strong>width requirements.A pipe might occupy a major portion or all of the available capacity of a link(Line 18–19 in Algorithm 1). The dem<strong>and</strong> capacity of a provisioned pipe (C req ) may betuned based on b<strong>and</strong>width requirements of the pipe. This ensures contention free pipeset up <strong>and</strong> guaranteed b<strong>and</strong>width <strong>for</strong> the pipe. A non-provisioned pipe will share link resourcesequally with other pipes resulting in increased latencies as injection rate increases.Latency curves of non-provisioned (labeled ‘U’) clearly fail to guarantee QoS comparedto provisioned pipes. Variation in injection rates of traffic generated by non-application


CHAPTER 7. LABEL SWITCHED NOC 131(a)(b)Figure 7.1: (a) Process blocks of HiperLAN/2 baseb<strong>and</strong> processing SoC <strong>and</strong> Object recognitionprocessor mapped on to a 8 × 8 LS-<strong>NoC</strong>. Pipe 1: PEC0 → PEC6, Pipe 2: MP →PEC3. (b) Flows set up <strong>for</strong> CBR & VBR traffic.nodes does not effect the latency in provisioned pipes. From the graph, average latencyof flits traversing the provisioned pipe is almost constant over varying injection rates ofother source nodes. The average latency of packets in the network does not change overinjection rates due to reservation <strong>and</strong> provisioning of LS-<strong>NoC</strong> resources during the executionof the application. Aggregate b<strong>and</strong>width of 120Gbits/s at maximum injection ratesatisfies the communication requirements of both HiperLAN/2 <strong>and</strong> Object RecognitionProcessor applications.7.2 Video Streaming ApplicationsLS-<strong>NoC</strong>hasbeentestedonbothConstantBitRates(1.55Mbps<strong>and</strong>55Mbps)<strong>and</strong>VariableBit Rate traffic. Flows <strong>for</strong> CBR <strong>and</strong> VBR traffic were set up assuming worst case spatialseparation between producer <strong>and</strong> consumer nodes (Figure 7.1(b)). Results <strong>for</strong> CBR <strong>and</strong>VBR experiments are shown in Figure 7.3. Videos used in H.264 st<strong>and</strong>ards evaluationwere used <strong>for</strong> VBR experiments <strong>and</strong> are tabulated in Table 7.2. It is observed thatlatency of CBR <strong>and</strong> VBR traffic is unaffected by varying injection rates of non-video


CHAPTER 7. LABEL SWITCHED NOC 132ORP <strong>and</strong> HiperLAN/2 on LS-<strong>NoC</strong>Average Latency of the Main Pipe6050403020HiperLAN/2(U)HiperLAN/2ORP - Pipe 1(U)ORP - Pipe 1ORP - Pipe 2100.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45Injection RateFigure 7.2: Latency of HiperLAN/2 <strong>and</strong> ORP pipes in LS-<strong>NoC</strong> over varying injectionrates of non-streaming application nodes. Latency of non-provisioned paths are titled(U).Table 7.2: St<strong>and</strong>ard test videos used in experiments.Sl. No. Video Frames per Second Frames Simulated1 Bridge Close 10 2002 Flower 10 2503 Foreman 10 3004 Hall 10 300sources. All flows in the LS-<strong>NoC</strong> are provisioned such that CBR/VBR traffic experiencesleast contention at the routers. CBR/VBR traffic flow is guaranteed throughput <strong>and</strong> hasdeterministic latency in the LS-<strong>NoC</strong>.


CHAPTER 7. LABEL SWITCHED NOC 133Average Pipe Latency (CBR pipes)2524232221Case Study - Streaming Constant Bit Rate55Mbps11Mbps200.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Injection Rate of Non-Streaming Nodes(a)Average Pipe Latency (VBR pipe)3029.52928.52827.52726.526Case Study - Streaming Variable Bit RateVideo 1Video 2Video 3Video 425.50.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Injection Rate of Non-Streaming Nodes(b)Figure 7.3: (a) Latency of CBR traffic over various injection rates of non-streaming nodesin LS-<strong>NoC</strong>. (b) Latency of VBR traffic over various injection rates of non-streaming nodesin LS-<strong>NoC</strong>.


CHAPTER 7. LABEL SWITCHED NOC 1347.3 Discussion7.3.1 Design Philosophy of LS-<strong>NoC</strong>Design philosophy of LS-<strong>NoC</strong> retains advantages of both packet switched <strong>and</strong> circuitswitched networks. Network visibility enables <strong>NoC</strong> Manager to set up pipes in a congestionfree <strong>and</strong> fault tolerant manner in homogeneous <strong>and</strong> heterogeneous networks. Thetraffic engineering capable LS-<strong>NoC</strong> <strong>NoC</strong> Manager allows identification <strong>and</strong> configurationof contention-less, non-shortest QoS guaranteed communication channels. The <strong>NoC</strong> Managerhas complete visibility of the state of the network (existing pipes, utilization of links,required capacity of the requested connection, remaining capacity of the link). This enablesnon-shortest, b<strong>and</strong>width provisioned pipe set up in LS-<strong>NoC</strong>. Flow identificationalgorithm in <strong>NoC</strong> Manager identifies a route which satisfies b<strong>and</strong>width requirements ofthe pipe to be established. Flow identification algorithm takes into account b<strong>and</strong>widthavailable in individual links to establish QoS guaranteed pipes. This allows sharing ofphysical links between pipes without compromising QoS guarantees. LS-<strong>NoC</strong> takes acentralized traffic analysis based approach through the <strong>NoC</strong> Manager. LS Router implementsminimal buffers to support flow control operations. LS-<strong>NoC</strong> provides throughputguarantees irrespective of spatial separation of communicating entities.LS-<strong>NoC</strong> sets up communication channels (pipes) between communicating nodes thatare independent of existing pipes <strong>and</strong> contention free at routers. Establishing pipes basedon b<strong>and</strong>width dem<strong>and</strong>s independent of existing pipes allow non-interference of trafficbetween pipes <strong>and</strong> end-to-end latency is guaranteed. Adding new pipes incrementallydoes not effect the guaranteed per<strong>for</strong>mance of already established pipes. LS-<strong>NoC</strong> usesflow based algorithms[66][67] (Algorithm 1) to identify pipes with sufficient b<strong>and</strong>widthbetween communicating nodes. The pipe establishment algorithm is independent of networktopology <strong>and</strong> hence is a generic solution <strong>for</strong> homogeneous <strong>and</strong> heterogeneous <strong>NoC</strong>s.Latency in establishing pipes using flow based algorithm in LS-<strong>NoC</strong> (Algorithm 1) dependson the size of the <strong>NoC</strong> (number of edges in the communication graph) only. Theflow based algorithm identifies a b<strong>and</strong>width satisfied route between nodes. For a given


CHAPTER 7. LABEL SWITCHED NOC 135LS-<strong>NoC</strong>, a definite upper bound on time to establish path is known. Computationaloverhead of the pipe establishment algorithm is documented in Section 6.4.1.The <strong>NoC</strong> Manager has knowledge of established pipes in the network. New pipesare added along non-interfering <strong>and</strong> contention free routes thus guaranteeing end-to-endlatency. LS-<strong>NoC</strong> allows pipes to share a single physical link as long as b<strong>and</strong>width requirementsare met by both pipes. In the cases where b<strong>and</strong>width dem<strong>and</strong>s are not met, alternativeadequate b<strong>and</strong>width routes are identified. Both Æthereal <strong>and</strong> Octagon use staticshortest path routing algorithm oblivious to the current state of <strong>NoC</strong> traffic. LS-<strong>NoC</strong>borrows traffic engineering <strong>and</strong> label switching concepts from MPLS. The <strong>NoC</strong> Manageris aware of state of the <strong>NoC</strong> <strong>and</strong> its flow identification algorithm identifies a b<strong>and</strong>widthguaranteed non-shortest paths in case of unavailability of shortest path <strong>for</strong> pipes. Labelswitching reduces metadata transfer in <strong>NoC</strong> <strong>and</strong> buffer requirement at routers. Circuitswitching requires minimal arbitration, has higher energy efficiency, consumes lesser area<strong>and</strong> power than a corresponding packet switched router[7]. Router ports reserved <strong>for</strong> acircuit in a circuit switched router cannot be utilized by other connections. The flowidentification algorithm in the LS-<strong>NoC</strong> Manager supports multiple pipes on a physicallink as long as b<strong>and</strong>width requirements of pipes are met. This increases utilization ofnetwork resources while guaranteeing b<strong>and</strong>width to pipes.Source nodes in the LS-<strong>NoC</strong> can work asynchronously as cycle level scheduling is notrequired in the LS Router. Bi-synchronous or mesochronous input buffers in LS routerenable multiple clock domain operation, without globally synchronous clocks.7.3.2 LS-<strong>NoC</strong> ApplicationFor streaming applications, results (Figures 7.2, 7.3) show that the LS-<strong>NoC</strong> guaranteespredictablelatency<strong>and</strong>guaranteedthroughput. LS-<strong>NoC</strong>issuitable<strong>for</strong>applicationswhosecommunication patterns do not vary over lifetime of the application <strong>and</strong> require guaranteedthroughput.In CMPs <strong>and</strong> complex SoCs, LS-<strong>NoC</strong> can be used as a separate <strong>NoC</strong> to service applicationsrequiring hard QoS guarantees. Network interfaces at ingress of the <strong>NoC</strong> can


CHAPTER 7. LABEL SWITCHED NOC 136LS-<strong>NoC</strong>P NI NIPBE-<strong>NoC</strong>Figure 7.4: LS-<strong>NoC</strong> being used alongside a best ef<strong>for</strong>t <strong>NoC</strong>.Table 7.3: Evaluation of the proposed Label Switched Router <strong>and</strong> <strong>NoC</strong>. CS: Circuitswitched, PS: Packet switched.Reference Tech.(L), Area Router Through- Energy FoM Support Share(Router Voltage (A) Power put (T) (per Efficiency Async Link?Type) (mm 2 ) (P) (mW) link, Gbps) (Tb/s/W) ClockDomainsThis work (LS) 130nm, 0.431 43.08 80 1.85 9.6 Yes Yes8×8 Mesh 1.2VNexus 130nm 1.75 – 48.75 – – Yes Yes[39] 1.2VSoCBUS 180nm 0.06 – 16 – – No[19]SDM 0.135 1.790 0.64 0.3575 29.72 No Yes[23] 1.2V (20 MHz)Æthereal (CS) 130nm, 0.26 – 16 – – No Yes[51] 1.2V8×10 Mesh 45nm, 0.030 21 – 74 11.78 0.159 – 188.46 No No(CS) [41] 1.1V (Approx.) 0.560Realtime ORP 130nm, 0.2 46 12.8 0.278 29.8 Yes Yes(PS) [8] 1.2V (Approx.)HiperLAN/2 130nm, 0.05 17.2 17.2 1.0 2.8 No NoBaseb<strong>and</strong> 1.2V (Est.)SoC (CS) [7]be configured to identify traffic belonging to QoS classes. Based on the type of the trafficbeing injected into the communication medium, either the conventional best ef<strong>for</strong>t <strong>NoC</strong>or the LS-<strong>NoC</strong> can be chosen. <strong>Multi</strong>ple <strong>NoC</strong>s to service individual classes of data havebeen present in commercially available multi-<strong>core</strong> chips[145]. The concept is illustratedin Figure 7.4.7.3.3 LS-<strong>NoC</strong> EvaluationTable 7.3 presents a comparative illustration of the proposed LS-<strong>NoC</strong> with a few proposedQoS <strong>NoC</strong>s. Link width of 256 bits <strong>for</strong> current work was chosen to service peak b<strong>and</strong>widthcomparable with GDDR5 b<strong>and</strong>width <strong>and</strong> hence throughput is higher than other <strong>design</strong>s.


CHAPTER 7. LABEL SWITCHED NOC 137Each port in the LS-<strong>NoC</strong> contains 8 input buffers each. Every buffer is 256 bit widecontributing to high area <strong>and</strong> power. Area numbers of works [41] <strong>and</strong> [8] were estimatedfrom total chip area. Custom <strong>design</strong> <strong>and</strong> low buffer usage in these routers has broughtdown the area significantly. Nexus[39] implements a 16 port, 36 bit asynchronous crossbarresulting in high area consumption.Voltage scaling enables Intel’s <strong>NoC</strong>[41] to operate in the 21mW – 74 mW range. Area<strong>and</strong> power consumption of the ORP router[8] is similar to the LS Router owing to similarbuffer area. The 7 port router has 32 bit input buffers up to 10 buffers per input. The lowpower <strong>and</strong> low area of the HiperLAN/2 SoC router in due to absence of input buffers <strong>and</strong>arbitration unit in the circuit switched router <strong>design</strong>. LS-<strong>NoC</strong> has maximum throughputowing to the high link width. However, LS-<strong>NoC</strong> shows high energy efficiency due to highthroughput at a nominal frequency of 312.5 MHz in 130nm. Results show that LS-<strong>NoC</strong>is the best <strong>design</strong> in terms of bits transmitted per Watts consumed.The normalized Figure of Merit (Area × Power/Throughput) is a technology independentcomparison parameter. The Figure of Merit (FoM) of 180nm <strong>and</strong> 130nm <strong>design</strong>sare scaled to 45nm by multiplying by the ratio of cubes of channel lengths as shown inEquation 7.3.3.FoM = Area×PowerThroughput × L 45 3L Tech3L Tech is the channel length of the corresponding Technology. With constant voltage<strong>and</strong> frequency, power varies linearly with technology owing to effects of capacitance.Length <strong>and</strong> breadth vary linearly to contribute to squared relation with respect to technology.Hence ratio of cubed channel lengths normalize the FoM in Equation 7.3.3. FoMis a measure of resources spent per bit transmitted (hence, lower the better). Buffer-lesslow area <strong>design</strong> of the HiperLAN/2 router contributes to the low FoM number <strong>for</strong> this<strong>design</strong>. Such a buffer-less circuit switched router cannot accommodate multiple connectionsthrough a physical link. Resource reservation by a single circuit <strong>for</strong> the lifetime ofthe application may result in inefficient network utilization in such networks. LS-<strong>NoC</strong>fares fairly well owing to high throughput though the area cost is high. The high power


CHAPTER 7. LABEL SWITCHED NOC 138consumption of 74mW in 45nm contributes to the maximum FoM number in Intel’s circuitswitched router. Globally synchronous <strong>design</strong>s dem<strong>and</strong> power hungry single clockdistribution over the chip. Bi-synchronous or mesochronous FIFOs in LS Router enablenetwork to operate at out of phase or clocks. This saves power spent in global clock distribution.One of the major features of packet switched networks borrowed into LS-<strong>NoC</strong>is link sharing by pipes without b<strong>and</strong>width compromise. Physical link sharing by multiplepipes result in higher network utilization than purely circuit switched techniques.From tabulated results, the LS router has high energy efficiency <strong>and</strong> provides hard QoSguarantees at a reasonable power budget.7.4 ConclusionThe proposed LS-<strong>NoC</strong> services QoS dem<strong>and</strong>s of streaming applications using a trafficengineering capable <strong>NoC</strong> Manager. A centrally managed, b<strong>and</strong>width provisioned, trafficengineering capable Manager utilizes flow identification algorithms to identify contentionfree, b<strong>and</strong>width provisioned paths. Network visibility enables <strong>NoC</strong> Manager to configureboundedlatencypipesinhomogeneous<strong>and</strong>heterogeneousnetworksalike. Flowidentificationalgorithm takes into account b<strong>and</strong>width available in individual links to establish QoSguaranteed pipes. The algorithm allows sharing of physical links between pipes withoutcompromising QoS guarantees. Link status enables <strong>NoC</strong> Manager to establish pipes in afault tolerant manner. Doing away with node identification number in favor of labels reducesmemory usage in routing tables. Traffic engineering <strong>and</strong> pipe based communicationchannels are direct descendants from the <strong>Multi</strong>-Protocol Label Switching[37] knowledgebase.A Label Switching router has been <strong>design</strong>ed, verified, synthesized, placed <strong>and</strong> routed<strong>and</strong> timing analyzed. The Label Switched (LS) Router used in LS-<strong>NoC</strong> achieves twocycle traversal delay during no contention <strong>and</strong> is multicast <strong>and</strong> broadcast capable. A 5port, 256 bit data bus, 4 bit label, 1 bit flow control, 8 buffers per input port individualrouter occupies an area of 0.431 mm 2 in 130nm Faraday library in the typical corner <strong>and</strong>operates at 312.5MHz. LS Router is estimated to consume 43.08 mW. LS-<strong>NoC</strong> has been


CHAPTER 7. LABEL SWITCHED NOC 139evaluated over examplestreamingapplications, CBR <strong>and</strong>VBRtraffic<strong>and</strong> QoSguaranteeshave been demonstrated. Overhead of setting up a pipe by a software LS-<strong>NoC</strong> Managerexecuting on a Cortex A8 processor is 35174 cycles (up to 35.2 µseconds).


Chapter 8Conclusion <strong>and</strong> Future WorkWork in this thesis presents methodologies <strong>for</strong> QoS guaranteed <strong>NoC</strong> <strong>design</strong>, link microarchitectureexploration <strong>and</strong> optimal Chip <strong>Multi</strong>processor (CMP) tile configurations.8.1 Link Microarchitecture Exploration<strong>NoC</strong> <strong>design</strong> specifications can be met by varying a large number of system <strong>and</strong> circuitparameters. An SOC can be better optimized if low level link parameters <strong>and</strong> architecturalparameters such as pipelining, link width, wire pitch, supply voltage, operatingfrequency, router type, topology of the interconnection network etc. are considered. Thechapter presents a simulation framework developed in System-C able to explore <strong>NoC</strong> <strong>design</strong>through all the a<strong>for</strong>ementioned parameters. The framework also allows co-simulationwith models <strong>for</strong> the communicating entities along with the ICN.Study presented in Section 3.3 on a 4x4 multi-<strong>core</strong> ICN <strong>for</strong> Mesh, Torus <strong>and</strong> Foldedtorustopologies <strong>and</strong> Dense Linear Algebra (DLA) <strong>and</strong> Sparse Linear Algebra (SLA)benchmarks’ communication patterns indicate that there is an optimum degree of pipeliningof the links which minimizes the average communication latency. There is also an optimumdegree of pipelining which minimizes the energy-delay product. Such an optimumexists because increasing pipelining allows <strong>for</strong> shorter length wire segments which can beoperated either faster or with lower power at the same speed.140


CHAPTER 8. CONCLUSION AND FUTURE WORK 141We also find that the overall per<strong>for</strong>mance of the ICNs is determined by the lengthsof the links needed to support the communication patterns. Thus the mesh seems toper<strong>for</strong>m the best amongst the three topologies considered in this case study.Another study (Section 3.4) uses 3 example topologies - 16 node 2D Torus, Tree network<strong>and</strong> Reduced 2D Torus to show variation of latency, throughput <strong>and</strong> <strong>NoC</strong> powerconsumption over link pipelining configurations with voltage <strong>and</strong> frequency scaling. Wefind that contrary to intuition, increasing pipeline depth can help reduce latency in absolutetime units, by allowing shorter links & hence higher frequency of operation. Ina 2D Torus when the longest link is pipelined by 4 stages at which point least latency(1.56 times minimum) is achieved <strong>and</strong> power (40% of max) <strong>and</strong> throughput (64% of max)are nominal. Using frequency scaling experiments, power variations of up to 40%, 26.6%<strong>and</strong> 24% can be seen in 2D Torus, Reduced 2D Torus <strong>and</strong> Tree based <strong>NoC</strong> between variouspipeline configurations to achieve same frequency at constant voltages. Also in somecases, we find that switching to a higher pipelining configuration can actually help reducepower as the links can be <strong>design</strong>ed with smaller repeaters. Larger <strong>NoC</strong> power savingscan be achieved by voltage scaling along with frequency scaling. Hence it is importantto include the link microarchitecture parameters as well as circuit parameters like supplyvoltage during the architecture <strong>design</strong> exploration of a <strong>NoC</strong>.Thestudiesalsopointtoanoverall<strong>optimization</strong>problemthatexistsinthearchitectureof the individual PEs versus the overall SOC, since smaller PEs lead to shorter linksbetween PEs, but more traffic, thus pointing to the existence of a sweet spot in terms ofthe PE size.8.2 Optimal CMP Tile ConfigurationThe effects of communication overheads on per<strong>for</strong>mance, power <strong>and</strong> energy of a multiprocessorchip using L1, L2 cache sizes as primary exploration parameters using accurateinterconnect, processor, on-chip <strong>and</strong> off-chip memory modeling were studied in Chapter4. Sapphire, a detailed multiprocessor execution environment integrating SESC, Ruby<strong>and</strong> DRAMSim was used to run applications from the Splash2 benchmark (64K point


CHAPTER 8. CONCLUSION AND FUTURE WORK 142FFT). A 4×4 2D Mesh <strong>for</strong>mation of 16 out-of-order <strong>core</strong>s, each consisting of a private L1<strong>and</strong> unified L2 cache was used as case study in the exploration. Detailed low-level wiredelay models were used from Intacte to calculate power consumed by the interconnectionnetwork. Results indicate that ignoring link latencies can lead to large errors in estimatesof program completion times, of up to 17%. Furthermore, architectural choices based oninaccurate interconnect estimates are not optimal <strong>and</strong> the error is severe in cases whereapplications have heavy communication requirements.Communication delays in wires adversely effect the per<strong>for</strong>mance in a CMP due tolarger time spent in message transit. Small L1 caches (larger misses) result in largeroff-tile accesses. Large L2 cache sizes increase individual tile area <strong>and</strong> result in longerinterconnects. Per<strong>for</strong>mance optimal configurations are achieved at lower L1 caches <strong>and</strong>at moderate L2 cache sizes due to higher operating frequencies <strong>and</strong> smaller link lengths<strong>and</strong> comparatively lesser communication. Using minimal L1 cache size to operate at thehighest frequency may not always be the per<strong>for</strong>mance-power optimal choice. Larger L1sizes, despite a drop in frequency, offer a energy advantage due to lesser communicationdue to misses.Experimental results show that custom floorplanning after communication patternanalysis between tiles will help reduce power <strong>and</strong> increase per<strong>for</strong>mance in a chip multiprocessor.Clustered tile placement experiments <strong>for</strong> FFT (L1:256KB <strong>and</strong> L2:512KB)show considerable per<strong>for</strong>mance per watt improvement (1.2%). Remapping most accessedL2 banks by a process in the same <strong>core</strong> or neighbouring <strong>core</strong>s after communication trafficanalysis offers power <strong>and</strong> per<strong>for</strong>mance advantages. Remapped processes <strong>and</strong> banks inclustered tile placement show a per<strong>for</strong>mance per watt improvement of 5.25% <strong>and</strong> energyreduction of 2.53%. Non-conventional tile placements that accommodate frequently communicating<strong>core</strong>s nearby can bring down communication latency <strong>and</strong> power significantly.Remapping threads <strong>and</strong> frequently accessed L2 banks closer in a conventional 2D Meshshows an improvement of per<strong>for</strong>mance per watt by 1.6%.EPI <strong>and</strong> program completion time results indicate that minimum energy cache configurationsare not the same as minimum execution time configurations. This suggests that


CHAPTER 8. CONCLUSION AND FUTURE WORK 143<strong>processors</strong> could execute a program in multiple modes, <strong>for</strong> example, minimum energy,maximum per<strong>for</strong>mance. Minimum energy mode can be achieved using available dynamicvoltagescalingfacilitiesin<strong>processors</strong>perL1/L2cacheconfiguration. AdaptivelychangingL1/L2 sizes can be one of the ways to achieve maximum per<strong>for</strong>mance.Level 1 <strong>and</strong> Level 2 cache sizes are important parameters in the <strong>design</strong> of a multi<strong>core</strong>chip. These sizes directly effect the tile area <strong>and</strong> have a huge bearing on the final costof the chip. An architect faced with such an exploration problem has to also take intoaccount the effect of lengths, area <strong>and</strong> power consumed by interconnects between <strong>core</strong>s<strong>and</strong> the physical <strong>design</strong> of the chip. The power spent in data fetches from off-chip DRAMis also an important parameter in the overall power of the system. The estimated power ofthe system will increase based on the amount of off-chip communication in the CMP withpower of DRAM accounted in. Analysis of such parameters require use of delay, area <strong>and</strong>power models of physical interconnects. Realistic traffic has to be obtained by executingappropriate benchmarks on detailed processor simulation environments. The Sapphire +Intacte + DRAMSim framework can be applied in solving architectural problems wherephysical aspects of wires, placement of tiles, floorplans effect per<strong>for</strong>mance <strong>and</strong> powerconsumption of a multi<strong>core</strong> chip.8.3 Label Switched <strong>NoC</strong> <strong>for</strong> Streaming ApplicationsStreaming applications have deterministic communication patterns due to pipelined natureof operation. Traffic engineering in LS-<strong>NoC</strong> guarantees QoS <strong>and</strong> delivers constantthroughput in such applications. Existing priority based QoS mechanisms <strong>for</strong> packetswitched <strong>NoC</strong>s are ineffective when traffic in a single priority class increases. Resourcereservation mechanisms in existing circuit switched networks suffer from inefficient networkusage, indeterministic circuit establishment times <strong>and</strong> routes oblivious to currentnetwork state.The proposed LS-<strong>NoC</strong> services QoS dem<strong>and</strong>s of streaming applications using a trafficengineering capable <strong>NoC</strong> Manager. A centrally managed, b<strong>and</strong>width provisioned, trafficengineering capable Manager utilizes flow identification algorithms to identify contention


CHAPTER 8. CONCLUSION AND FUTURE WORK 144free, b<strong>and</strong>width provisioned paths. Network visibility enables <strong>NoC</strong> Manager to configureboundedlatencypipesinhomogeneous<strong>and</strong>heterogeneousnetworksalike. Flowidentificationalgorithm takes into account b<strong>and</strong>width available in individual links to establish QoSguaranteed pipes. The algorithm allows sharing of physical links between pipes withoutcompromising QoS guarantees. Link status enables <strong>NoC</strong> Manager to establish pipes in afault tolerant manner. Pipes are identified by labels unique to each source node. Doingaway with node identification number in favor of labels reduces memory usage in routingtables. Traffic engineering <strong>and</strong> pipe based communication channels are direct descendantsfrom the <strong>Multi</strong>-Protocol Label Switching[37] knowledge base.A Label Switching router has been <strong>design</strong>ed, verified, synthesized, placed <strong>and</strong> routed<strong>and</strong> timing analyzed. The Label Switched (LS) Router used in LS-<strong>NoC</strong> achieves two cycletraversaldelayduringnocontention<strong>and</strong>ismulticast<strong>and</strong>broadcastcapable. Sourcenodesin the LS-<strong>NoC</strong> can work asynchronously as cycle level scheduling is not required in theLS Router. Bi-synchronous or mesochronous input buffers in LS router enable multipleclock domain operation, without globally synchronous clocks. This reduces power <strong>and</strong>enables flexible <strong>design</strong> of the clock tree. A 5 port, 256 bit data bus, 4 bit label, 1 bit flowcontrol, 8 buffers per input port individual router occupies an area of 0.431 mm 2 in 130nmFaraday library in the typical corner <strong>and</strong> operates at 312.5MHz. LS Router is estimatedto consume 43.08 mW. LS-<strong>NoC</strong> has been evaluated over example streaming applications,CBR <strong>and</strong> VBR traffic <strong>and</strong> QoS guarantees have been demonstrated. Overhead of settingup a pipe by a software LS-<strong>NoC</strong> Manager executing on a Cortex A8 processor is 35174cycles (up to 35.2 µseconds). Time required to identify multiple pipes is the productof number of pipes <strong>and</strong> an individual pipe identification time. Given that streamingprocesses run over a large time frame <strong>and</strong> pipe configuration is a one time process, thisoverhead is acceptable in most applications.We envision the use of LS-<strong>NoC</strong> in general purpose CMPs where applications dem<strong>and</strong>deterministic latencies <strong>and</strong> hard b<strong>and</strong>width requirements. LS-<strong>NoC</strong> can be used as aseparate layer catering to applications requiring hard QoS guarantees. Based on the typeof the traffic being injected into the communication medium, either the conventional best


CHAPTER 8. CONCLUSION AND FUTURE WORK 145ef<strong>for</strong>t <strong>NoC</strong> or the LS-<strong>NoC</strong> can be chosen.8.4 Future WorkQoS guaranteed <strong>NoC</strong>s is a wide research area with several directions <strong>for</strong> research as futureCMPs impose increasingly more QoS dem<strong>and</strong>s on <strong>NoC</strong>s. Label switched <strong>NoC</strong> is a steptowards building a complete <strong>NoC</strong> solution towards granting QoS <strong>for</strong> processes in CMPs<strong>and</strong> SoCs. We envision a multilayer <strong>NoC</strong> providing varying levels of QoS guarantees -an example 2 layer <strong>NoC</strong> was shown in Figure 7.4 of Chapter 7. The <strong>design</strong> of networkinterfaces <strong>for</strong> a multilayer <strong>NoC</strong>, coding inside headers of packets representing the requiredQoS levels, solutions to classify traffic into various QoS classes, network interfaces tofeed <strong>NoC</strong>s with relevant traffic, load balancing in <strong>NoC</strong>s to maintain fairness <strong>and</strong> QoSguarantees are some of the interesting future directions that can be pursued after LS-<strong>NoC</strong><strong>design</strong>.


Appendix AInterface <strong>and</strong> Outputs of theSystemC FrameworkThe SystemC framework was developed as a <strong>design</strong> space exploration tool to be used toexplore communication infrastructure parameters - topology, routing policy, link length,wire width, pitch, pipelining, supply voltage <strong>and</strong> frequency. The overall architecture <strong>and</strong>the working of the tool are presented in Figures 3.1 <strong>and</strong> 3.2.The framework helps chart the impact of various architectural <strong>and</strong> microarchitecturallevel parameters of the on-chip interconnection network elements on its power <strong>and</strong> per<strong>for</strong>mance.The framework also supports a flexible traffic generation <strong>and</strong> communicationmodel.In terms of the user interface, a configuration file was populated <strong>and</strong> read by theframework. The list of parameters supplied through the configuration file to the SystemCframework were presented in Table 3.1. Table A.1 presents a few of the configurationparameters with their default values.The user interface was otherwise terminal based with a compiler script made available.The make script has been pasted here <strong>for</strong> reference.# use gmakeTARGET_ARCH = linux146


APPENDIXA. INTERFACEANDOUTPUTSOFTHESYSTEMCFRAMEWORK147Table A.1: ICN exploration framework parameters <strong>and</strong> their default values.Parameter Default Value<strong>NoC</strong> ParametersTopology 2D MeshWidth of the Bus (in bits) 32Phit Size Width of the WireFlit Size 2Mesh Rows 4Mesh Columns 4RoutingSwitching PolicyTable BasedWormhole SwitchingTraffic ParametersInjection Rate 0.02Traffic Type One-Way/Request ResponseTraffic Pattern DLA/SLALocalization Factor 0.4Individual Link ParametersLength of the link 1mmWire Pitch Intacte specifiedPipelining 4Supply Voltage 1.1VFrequency Obtained from Intacte


APPENDIXA. INTERFACEANDOUTPUTSOFTHESYSTEMCFRAMEWORK148CC = g++opt: OPT = -Odebug: OPT = -gFLAGS = -Wall -Wno-deprecatedSOURCES = \Consumer.o \Producer.5.TBR.o \Router.Mesh.o \packet_<strong>for</strong>mat.mesh.tbr.o \top.mesh.oPROGRAM = icn.meshINC = -I../ -I../parser -I.INC += -I$(SYSTEMC)/include.cpp.o:$(CC) $(OPT) $(FLAGS) $(INC) -c $


APPENDIXA. INTERFACEANDOUTPUTSOFTHESYSTEMCFRAMEWORK149- rm *.oAn output file records activity <strong>and</strong> coupling factors <strong>for</strong> the 5 outgoing links <strong>for</strong> eachrouter. A snapshot of the output file is shown. Activity <strong>and</strong> coupling factors in the thirdcolumn are all zero as this is an output file corresponding to router in the second row,fourth column in a 4× 4 2D-Mesh.INPUT ACTIVITY 0.0251406 0.0162656 0 0.0236562 0.0175156OUTPUT ACTIVITY 0.0236406 0.0195625 0 0.0217344 0.0191406INPUT COUPLING 0.0278281 0.0206875 0 0.02875 0.02025OUTPUT COUPLING 0.0309375 0.0224219 0 0.0262187 0.0212344


Appendix BTesting & Validation of LS-<strong>NoC</strong>The <strong>design</strong> <strong>and</strong> implementation of a Label Switched <strong>NoC</strong> router has been presented inChapter 5. This appendix presents validation <strong>and</strong> testing details.B.1 Implementation of LS-<strong>NoC</strong> RouterTherouter<strong>design</strong>hasbeenshowninFigure5.4ofChapter5. Therouterwasimplementedin Verilog. Router modules <strong>and</strong> their interactions have been illustrated in Figure B.1.Each port of the router has its own FIFO <strong>and</strong> routing table blocks. Individual arbitersexist per output port. Input port bus contains data, label, valid <strong>and</strong> flow control bits.All input port buses are connected to the test bench, router tb.B.2 Testing <strong>and</strong> Validation of LS-<strong>NoC</strong> RouterB.2.1 Individual RouterThe router was rigorously tested in both individual <strong>and</strong> mesh environments. A 4 portLS-<strong>NoC</strong> router was connected to 4 sources <strong>and</strong> sinks as shown in the Figure B.2. The firsttests ensured proper transit of individual packets from all sources to all destinations. Allmeta-data including flow control signals were in place during all tests. Stress tests donein this setting include:150


APPENDIX B. TESTING & VALIDATION OF LS-NOC 151Testbench(router_tb)Router Module(router)data, label,valid, pauseFIFO(fifo_block)Router Module(router)ARBITER(router_arbiter)data, label,valid, pauseRouting Table(routing_table)Figure B.1: Modules in LS-<strong>NoC</strong> router <strong>design</strong> shown along with testbench, implementedin Verilog.SOURCESINKSOURCESINKSOURCESINKSINKSOURCEROUTERSINKSOURCESINKSOURCEROUTERSINKSOURCESINKSOURCEROUTERSINKSOURCESOURCESINKSOURCESINKSOURCESINK(a)(b)(c)Figure B.2: Test cases used to verify an individual LS-<strong>NoC</strong> router.


APPENDIX B. TESTING & VALIDATION OF LS-NOC 152SrcSinkWNRESFigure B.3: 8×8 mesh used <strong>for</strong> testing LS-<strong>NoC</strong>.• Individual sources sending data to a single destination during every cycle (FigureB.2(a)).• Individual sources sending out data serially to all destinations in every clock cycle(Figure B.2(b)).• Individual sources sending out data to destinations at r<strong>and</strong>om every cycle (FigureB.2(c)).B.2.2 Router in 8×8 MeshFigure B.3 shows the 8×8 mesh used <strong>for</strong> various test cases to verify <strong>and</strong> validate the<strong>design</strong> of LS-<strong>NoC</strong> router. Figure B.4 shows various traffic test cases used to verify thefunctioningofLS-<strong>NoC</strong>router. Labelwidthwassetat4bits. Henceeverynodepotentiallyhad 16 destination nodes to send data to. All nodes attempt to send out traffic in everycycle. Traffic injection into the <strong>NoC</strong> is throttled by flow control signal only. The testcases are:• Figure B.4(a): Every node s in the 8×8 mesh sends data serially to all its possibledestinations.


APPENDIX B. TESTING & VALIDATION OF LS-NOC 153sss(a)(b)(c)Figure B.4: Traffic test cases used to verify proper functioning of LS-<strong>NoC</strong> router.• Figure B.4(b): Every node s in the mesh sends data to one of the possible 16destinations after choosing destination at r<strong>and</strong>om.• Figure B.4(c): Every node s sends out data to the same node through differentroutes.B.3 Synthesis & Place <strong>and</strong> RouteFlowchart in Figure B.5 lists the operations done during synthesis <strong>and</strong> place <strong>and</strong> route ofthe LS-<strong>NoC</strong> router. Synthesis parameters <strong>and</strong> results have already been listed in Table5.4 <strong>and</strong> Table 5.5 respectively.Synthesis of the functionally verified Verilog HDL <strong>design</strong> of the LS-<strong>NoC</strong> router wasper<strong>for</strong>med using Synopsys Design Compiler. Timing was analyzed <strong>and</strong> cycle time wasestimated from the FIFO at the input to the output buffers. Switching activity, timing<strong>and</strong> <strong>design</strong> constraints <strong>and</strong> the synthesized netlist is input to Cadence SOC Encounter<strong>for</strong> place <strong>and</strong> route. The placed <strong>and</strong> routed output of a single router is shown in FigureB.6. Area numbers were obtained from Encounter <strong>for</strong> the HDL <strong>design</strong>s along with theplaced <strong>and</strong> routed (P & R) netlist <strong>and</strong> the parasitics (RC) extracted file (SPEF). The P& R netlist <strong>and</strong> the SPEF files are used by Synopsys Primetime PX <strong>for</strong> timing analysis<strong>and</strong> power calculations.


APPENDIX B. TESTING & VALIDATION OF LS-NOC 154Functionally verifiedVerilog HDLof Input DesignFaraday 130nm Library(fsc0h*db)Synthesis usingSynopsys Design CompilerValue ChangeDump (VCD)Synthesised Netlistof the input <strong>design</strong>Synopsys DesignConstraints (SDC file)SOC EncounterPlace & RouteDesign AreaFaraday 130nmLibrary (fsc0h*.db)P&R NetlistParasitics ExtractionFile (SPEF)Synopsys Primetime PXTiming <strong>and</strong> Power AnalysisPower <strong>and</strong>Timing ResultsFigure B.5: Flowchart illustrating steps in Synthesis <strong>and</strong> Place & Route steps of theLS-<strong>NoC</strong> router.Figure B.6: Placed <strong>and</strong> routed output - Single Router.


Appendix CThe Flow AlgorithmThis chapter presents the working of flow algorithm with an example.C.1 Ford-Fulkerson’s MaxFlow AlgorithmFord-Fulkerson’s maxflow algorithm (Algorithm 1, Section 6.2)[67] is used by the LS-<strong>NoC</strong>Manager to identify pipes between communicating entities.An example illustration of MaxFlow algorithm is presented in Figure C.1. The stepsare explained in the following list:1. The network is represented as a set of nodes connected by unidirectional links.Each of these connecting links have a capacity associated with them. The capacityof links is used as a representation <strong>for</strong> available b<strong>and</strong>width of the link betweencommunicating nodes.2. TwoflowsX→A→C→Y&X→B→C→YaresetupbetweennodesX&Y.Capacityvalue of 1 is assumed to be used up by a flow in links it passes through. Availablecapacity values are also shown in the graph. Links having at least one flow passingthrough them are written in dashed lines.3. A residual network is constructed from the graph. Every unidirectional edge fromthe graph is replaced by two edges in opposite directions. The capacity of a <strong>for</strong>ward155


APPENDIX C. THE FLOW ALGORITHM 156X31AB534CD22E3Y1/3X1/1A1/31/5B0/4CD0/22/2E0/3Y2X11AB12414CDResidual Networkafter 2 flows22E3Y(a)(b)(c)X1A211 5B231CDResidual Networkafter addingX_A_C_B_D_E_Y112E12Y2/3X1/1A2/30/5B1/4CD1/22/2E1/3Y(d)(e)Figure C.1: Steps in the flow algorithm example. (a) Input Graph. Maximum flowshave to be identified between nodes X & Y. (b) Available capacities of links after flowsX→A→C→Y & X→B→C→Y are set up. (c) Residual network showing available capacitiesof links in the <strong>for</strong>ward direction <strong>and</strong> utilized capacity in the reverse. (d) Residualnetwork after adding the flow: X→A→C→B→D→E→Y. (e) Final output of the maxflowalgorithm showing 3 flows from X to Y.link is equal to the available capacity of the link. The reverse link capacity is equalto the capacity reserved by flows.4. Flows between source to destination nodes are searched <strong>for</strong> in the residual network.If a flow exists in the residual network, it exists in the input graph too. The pathis identified <strong>and</strong> added. In the illustration in Figure C.1(c) & C.1(d), a new flowX→A→C→B→D→E→Y is added <strong>and</strong> the residual network is rebuilt.5. The algorithm terminates when no new flows can be identified between nodes fromthe residual network.C.2 Input GraphFord-Fulkerson’s MaxFlow Algorithm has been used to identify b<strong>and</strong>width guaranteedpipes between communicating nodes in LS-<strong>NoC</strong>. The input to the algorithm is a networkgraph representing communication channels in LS-<strong>NoC</strong>. An illustration of how the inputgraph is built using a 2 router, 6 communication node scenario is presented in this section.


APPENDIX C. THE FLOW ALGORITHM 157E0={R0S0,R0_I0,0,4,{},{0,..,3}}R0_D0SINKSOURCESOURCE SINKROUTER0SOURCE SINKROUTER1SINKSOURCER0S0R0S1R0S2E0E1E2R0_I0R0_I1R0_I2E4R0_D1R0_D2E3R1_D0R1_D1SOURCESINKSOURCESINKE5R1_I4R1_D2(a)(b)Figure C.2: (a) A 2 router, 6 source+sink system used <strong>for</strong> validation of the LS-<strong>NoC</strong> router<strong>design</strong>. Graph representation of the system used as input to the flow algorithm is shownin (b).Figure C.2(a) shows the 2 router system. The router has 4 input <strong>and</strong> output ports.The east port of router 0 is connected to the west port of router 1. All other ports areconnected to source <strong>and</strong> sink nodes. Blanked out links represent ingress links <strong>and</strong> shadedlinks represent egress links.The graph representation of the 2 router system is shown in Figure C.2(b). The nodeR0S0 indicates the source attached to port 0 on router 0 <strong>and</strong> so on. R0 I0 is the node onthe router’s input port interface at the router side. R0 I0 connects to all sinks on router0, viz., R0 D0, R0 D1, R0 D2. Each of R0 I0, R0 I1 <strong>and</strong> R0 I2 also connect the ingressport of router 1, R1 I4. R1 I4 connects to each of the sinks in R1, R1 D0, R1 D1, R1 D2.The capacity number on links ending in ingress ports of routers 0 <strong>and</strong> 1 are the capacityof input buffers at the input ports. These are the bottleneck links <strong>and</strong> the remaining linkscan be assigned infinite capacities. This example illustrates one half of the graph. Anexact graph is built from source nodes in router 1 to router 0. This is the input graph <strong>for</strong>the <strong>NoC</strong> Manager to identify pipes between communicating nodes.C.3 Edges in the Input GraphThe representation of an edge (Section 6.2) in the graph is presented here :E ij = {n i ,n j ,u ij ,v ij ,L usedij ,L unusedij } (C.1)


APPENDIX C. THE FLOW ALGORITHM 158E0={R0S0,R0_I0,1,3,{0},{1,..,3}}R0_D0R0S0E0R0_I0R0S1E1R0_I1E3R0_D1R1_D0R0S2E2R0_I2E4R0_D2R1_D1E2={R0S2,R0_I2,1,3,{0},{1,..,3}}E5P1R1_I4E5={R0_I2,R1_I4,2,2,{0,1},{2,3}}P0R1_D2FigureC.3: The<strong>NoC</strong>aftertwopipes,P0<strong>and</strong>P1havebeenestablished. P0: R0S0→R1 D2<strong>and</strong> P1: R0S2→R1 D0.where, edge E ij connects nodes n i to n j . Nodes can be traffic sources, traffic sinks, routerinput or output ports. u ij <strong>and</strong> v ij are utilized <strong>and</strong> available flow capacities of the edgerespectively. L usedijis the list of labels used in pipes through E ij . L unusedijis the list ofavailable labels to assign to future pipes. During the initialization stage, u ij <strong>and</strong> L usedijequal to null; v ij holds the maximum capacity value E ij can support <strong>and</strong> L unusedijlist of all available labels <strong>for</strong> pipes through E ij .areis theExample edges, E0,...,E5 are shown in the figure where E0 is the edge connectingsource S0 to the router interface I0 (node R0S0→R0 I0). E0 is represented as,E0 = {R0S0,R0 I0,0,4,{},{0,...,3}}(C.2)where edge E0 of capacity 4 connects node R0S0 to R0 I0. All of its resources are free(utilized capacity = 0). The labels associated with the edge are {0,...,3} <strong>and</strong> none ofthem have been used yet. Similarly, edges E2 <strong>and</strong> E5 can be represented as,E2 = {R0S2,R0 I2,0,4,{},{0,...,3}}(C.3)E5 = {R0 I2,R1 I4,0,4,{},{0,...,3}}(C.4)


APPENDIX C. THE FLOW ALGORITHM 159Table C.1: Routing tables at R0 I0, R0 I2 <strong>and</strong> R1 I4 nodes after pipes P0 <strong>and</strong> P1 havebeen set up.RT: R0 I0 RT: R0 I2 RT: R1 I4il Route ol il Route ol il Route ol0 R1 I4 0 ... ... ... 0 R1 D2 0... ... ... 0 R1 I4 1 1 R1 D0 1Figure C.3 presents the state of the <strong>NoC</strong> after two pipes P0 <strong>and</strong> P1 have been establishedin the network. P0 connects R0S0→R1 D2 <strong>and</strong> P1 connects R0S2→R1 D0. Thevalues of E0 <strong>and</strong> E2 now are:E0 = {R0S0,R0 I0,1,3,{0},{1,...,3}};E2 = {R0S2,R0 I2,1,3,{0},{1,...,3}}(C.5)Label from R0S0 <strong>and</strong> R0S2 conflict at edge E5 ending in node R1 I4. Label swappingis per<strong>for</strong>med <strong>and</strong> the label on pipe P1 is aliased to 1 from 0 in the routing table at nodeR1 I4. Table C.1 presents a section of the routing table at node R1 I4.Similar routing table configuration <strong>and</strong> label assignment operations are done at allrouter input port nodes.


Bibliography[1] W.J. Dally <strong>and</strong> B. Towles. Route packets, not wires: on-chip interconnection networks.In Design Automation Conference, 2001. Proceedings, pages 684 – 689, 2001.[2] L. Benini <strong>and</strong> G. De Micheli. Networks on chips: a new soc paradigm. Computer,35(1):70 –78, jan 2002.[3] Luca Benini <strong>and</strong> G.D. Micheli, editors. Networks on Chips: Technology <strong>and</strong> Tools.Morgan Kaufman, CA, USA., 2006.[4] Axel Jantsch <strong>and</strong> Hannu Tenhunen, editors. Networks on chip. Kluwer AcademicPublishers, Hingham, MA, USA, 2003.[5] An<strong>and</strong> Raghunathan, Niraj K. Jha, <strong>and</strong> Sujit Dey. High-Level Power Analysis <strong>and</strong>Optimization. Kluwer Academic Publishers, Norwell, MA, USA, 1998.[6] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,Parry Husb<strong>and</strong>s, Kurt Keutzer, David A. Patterson, William Lester Plishker, JohnShalf, Samuel Webb Williams, <strong>and</strong> Katherine A. Yelick. The l<strong>and</strong>scape of parallelcomputing research: A view from berkeley. Technical Report UCB/EECS-2006-183,EECS Department, University of Cali<strong>for</strong>nia, Berkeley, Dec 2006.[7] Pascal T. Wolkotte, Gerard J. M. Smit, Gerard K. Rauwerda, <strong>and</strong> Lodewijk T.Smit. An energy efficient reconfigurable circuit switched network-on-chip. In Proceedingsof the 19th IEEE International Parallel <strong>and</strong> Distributed Processing Symposium(IPDPS05) - 12th Reconfigurable Architecture Workshop (RAW 2005), p.155a, ISBN, pages 0–7695, 2005.160


BIBLIOGRAPHY 161[8] Kwanho Kim, Joo-Young Kim, Seungjin Lee, Minsu Kim, <strong>and</strong> Hoi-Jun Yoo. A 76.8gb/s 46 mw low-latency network-on-chip <strong>for</strong> real-time object recognition processor.InSolid-State Circuits Conference, 2008. A-SSCC ’08. IEEE Asian, pages189–192,Nov. 2008.[9] William J. Dally <strong>and</strong> John W. Poulton. Digital systems engineering. CambridgeUniversity Press, New York, NY, USA, 1998.[10] J. Duato, S. Yalamanchili, <strong>and</strong> L. Ni. Interconnection Networks: An EngineeringApproach. Morgan Kaufmann Publishers, 2003.[11] TuomasValtonen,TeroNurmi,JouniIsoaho,<strong>and</strong>HannuTenhunen. Anautonomouserror-tolerant cell <strong>for</strong> scalable network-on-chip architectures. In Proceedings of the19th IEEE Nordic Event in ASIC Design (NorChip 2001), number0, pages198–203,Kista, Sweden, nov 2001.[12] Jian Liang, A. Laffely, S. Srinivasan, <strong>and</strong> R. Tessier. An architecture <strong>and</strong> compiler<strong>for</strong> scalable on-chip communication. Very Large Scale Integration (VLSI) Systems,IEEE Transactions on, 12(7):711 –726, july 2004.[13] Kuei-Chung Chang, Jih-Sheng Shen, <strong>and</strong> Tien-Fu Chen. Evaluation <strong>and</strong> <strong>design</strong>trade-offsbetweencircuit-switched<strong>and</strong>packet-switchednocs<strong>for</strong>application-specificsocs. In Proceedings of the 43rd annual Design Automation Conference, DAC ’06,pages 143–148, New York, NY, USA, 2006. ACM.[14] T.D. Richardson, C. Nicopoulos, D. Park, V. Narayanan, Yuan Xie, C. Das, <strong>and</strong>V. Degalahal. A hybrid soc interconnect with dynamic tdma-based transaction-lessbuses <strong>and</strong> on-chip networks. In VLSI Design, 2006. Held jointly with 5th InternationalConference on Embedded Systems <strong>and</strong> Design., 19th International Conferenceon, page 8 pp., jan. 2006.[15] Jingcao Hu, Yangdong Deng, <strong>and</strong> Radu Marculescu. System-level point-to-pointcommunication synthesis using floorplanning in<strong>for</strong>mation. In in Proc. ASP-DAC,pages 573–579, 2002.


BIBLIOGRAPHY 162[16] C. Hilton <strong>and</strong> B. Nelson. Pnoc: a flexible circuit-switched noc <strong>for</strong> fpga-based systems.Computers <strong>and</strong> Digital Techniques, IEE Proceedings -, 153(3):181 – 188, may2006.[17] D. Castells-Rufas, J. Joven, <strong>and</strong> J. Carrabina. A validation <strong>and</strong> per<strong>for</strong>mance evaluationtool <strong>for</strong> protonoc. In System-on-Chip, 2006. International Symposium on,pages 1 –4, nov. 2006.[18] A. Leroy, P. Marchal, A. Shickova, F. Catthoor, F. Robert, <strong>and</strong> D. Verkest. Spatialdivision multiplexing: a novel approach <strong>for</strong> guaranteed throughput on nocs.In Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/softwareco<strong>design</strong> <strong>and</strong> system synthesis, CODES+ISSS ’05, pages 81–86, NewYork, NY, USA, 2005. ACM.[19] Daniel Wiklund <strong>and</strong> Dake Liu. Socbus: Switched network on chip <strong>for</strong> hard real timeembedded systems. In Proceedings of the 17th International Symposium on Parallel<strong>and</strong> Distributed Processing, IPDPS ’03, pages 78.1–, Washington, DC, USA, 2003.IEEE Computer Society.[20] K. Goossens, J. van Meerbergen, A. Peeters, <strong>and</strong> R. Wielage. Networks on silicon:combining best-ef<strong>for</strong>t <strong>and</strong> guaranteed services. In Design, Automation <strong>and</strong> Test inEurope Conference <strong>and</strong> Exhibition, 2002. Proceedings, pages 423 –425, 2002.[21] E. Beigne, F. Clermidy, P. Vivet, A. Clouard, <strong>and</strong> M. Renaudin. An asynchronousnoc architecture providing low latency service <strong>and</strong> its multi-level <strong>design</strong> framework.InAsynchronous Circuits <strong>and</strong> Systems, 2005. ASYNC 2005. Proceedings. 11th IEEEInternational Symposium on, pages 54 – 63, march 2005.[22] D.Lattard, E.Beigne, F.Clermidy, Y.Dur<strong>and</strong>, R.Lemaire, P.Vivet, <strong>and</strong>F.Berens.A reconfigurable baseb<strong>and</strong> plat<strong>for</strong>m based on an asynchronous network-on-chip.Solid-State Circuits, IEEE Journal of, 43(1):223 –235, jan. 2008.


BIBLIOGRAPHY 163[23] A.Leroy, D.Milojevic, D.Verkest, F.Robert, <strong>and</strong>F.Catthoor. Concepts<strong>and</strong>implementationof spatial division multiplexing <strong>for</strong> guaranteed throughput in networkson-chip.Computers, IEEE Transactions on, 57(9):1182 –1195, sept. 2008.[24] P.P. P<strong>and</strong>e, C.S. Grecu, A. Ivanov, <strong>and</strong> R.A. Saleh. Switch-based interconnectarchitecture <strong>for</strong> future systems on chip. Proc. SPIE - Int. Soc. Opt. Eng. (USA),5117:228–37, 2003.[25] Jingcao Hu <strong>and</strong> Radu Marculescu. Dyad: smart routing <strong>for</strong> networks-on-chip. InProceedings of the 41st annual Design Automation Conference, DAC ’04, pages 260–263, New York, NY, USA, 2004. ACM.[26] Martti Forsell. A scalable high-per<strong>for</strong>mance computing solution <strong>for</strong> networks onchips. IEEE Micro, 22(5):46–55, September 2002.[27] T. Bjerregaard <strong>and</strong> J. Sparso. A router architecture <strong>for</strong> connection-oriented serviceguarantees in the mango clockless network-on-chip. In Design, Automation <strong>and</strong> Testin Europe, 2005. Proceedings, pages 1226 – 1231 Vol. 2, march 2005.[28] DavidSigüenza-Tortosa, TapaniAhonen, <strong>and</strong>JariNurmi. Issuesinthedevelopmentof a practical noc: the proteo concept. Integr. VLSI J., 38(1):95–105, October 2004.[29] Evgeny Bolotin, Israel Cidon, Ran Ginosar, <strong>and</strong> Avinoam Kolodny. Qnoc: Qos architecture<strong>and</strong> <strong>design</strong> process <strong>for</strong> network on chip. Journal of Systems Architecture,50:105–128, 2004.[30] P.Guerrier<strong>and</strong>A.Greiner. Agenericarchitecture<strong>for</strong>on-chippacket-switchedinterconnections.In Design, Automation <strong>and</strong> Test in Europe Conference <strong>and</strong> Exhibition2000. Proceedings, pages 250 –256, 2000.[31] E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen,P. Wielage, <strong>and</strong> E. Waterl<strong>and</strong>er. Trade-offs in the <strong>design</strong> of a router with both guaranteed<strong>and</strong> best-ef<strong>for</strong>t services <strong>for</strong> networks on chip. IEE Proc.-Computer DigitalTechnology, 150(5):294 – 302, sep 2003.


BIBLIOGRAPHY 164[32] ErnoSalminen,TeroKangas,TimoD.Hämäläinen,JouniRiihimäki,VesaLahtinen,<strong>and</strong> Kimmo Kuusilinna. Hibi communication network <strong>for</strong> system-on-chip. J. VLSISignal Process. Syst., 43(2-3):185–205, June 2006.[33] B. Ahmad, Ahmet T. Erdogan, <strong>and</strong> Sami Khawam. Architecture of a dynamicallyreconfigurable noc <strong>for</strong> adaptive reconfigurable mpsoc. In Proceedings of the firstNASA/ESA conference on Adaptive Hardware <strong>and</strong> Systems, AHS ’06, pages 405–411, Washington, DC, USA, 2006. IEEE Computer Society.[34] Faraydon Karim, Anh Nguyen, <strong>and</strong> Sujit Dey. An interconnect architecture <strong>for</strong>networking systems on chips. IEEE Micro, 22(5):36–45, September 2002.[35] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, <strong>and</strong> W. Weiss. An Architecture<strong>for</strong> Differentiated Service. United States, 1998.[36] Amos E. Joel. Asynchronous Transfer Mode Switching. IEEE, 1993.[37] E. Rosen, A. Viswanathan, <strong>and</strong> R. Callon. <strong>Multi</strong>protocol Label Switching Architecture.RFC 3031, jan 2001.[38] M. Kim, D. Kim, <strong>and</strong> G.E. Sobelman. Network-on-chip quality-of-service throughmultiprotocol label switching. In Circuits <strong>and</strong> Systems, 2006. ISCAS 2006. Proceedings.2006 IEEE International Symposium on, page 1843, 2006.[39] A. Lines. Asynchronous interconnect <strong>for</strong> synchronous soc <strong>design</strong>. Micro, IEEE,24(1):32 – 41, jan.-feb. 2004.[40] MichihiroKoibuchi, KenichiroAnjo, YutakaYamada, AkiyaJouraku, <strong>and</strong>HideharuAmano. A simple data transfer technique using local address <strong>for</strong> networks-on-chips.IEEE Trans. Parallel Distrib. Syst., 17(12):1425–1437, December 2006.[41] M. A. Anders, H. Kaul, S. K. Hsu, A. Agarwal, S. K. Mathew, F. Sheikh, R. K.Krishnamurthy, <strong>and</strong>S.Borkar. A4.1tb/sbisection-b<strong>and</strong>width560gb/s/wstreamingcircuit-switched 8x8 mesh network-on-chip in 45nm cmos. In Solid-State Circuits


BIBLIOGRAPHY 165Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages110 –111, feb. 2010.[42] E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. Van Meerbergen,P. Wielage, <strong>and</strong> E. Waterl<strong>and</strong>er. Trade offs in the <strong>design</strong> of a router with bothguaranteed <strong>and</strong> best-ef<strong>for</strong>t services <strong>for</strong> networks on chip. In Design, Automation,<strong>and</strong> Test in Europe, pages 10350–10355, 2003.[43] S Bell. Tile64 - processor: A 64-<strong>core</strong> soc with mesh interconnect. In Solid-StateCircuits Conference Digest of Technical Papers (ISSCC), 2008 IEEE International,pages 88 – 89, feb. 2008.[44] T. Wieg<strong>and</strong>, G.J. Sullivan, G. Bjontegaard, <strong>and</strong> A. Luthra. Overview of theh.264/avc video coding st<strong>and</strong>ard. Circuits <strong>and</strong> Systems <strong>for</strong> Video Technology, IEEETransactions on, 13(7):560 –576, july 2003.[45] Thomas Canhao Xu, Alex<strong>and</strong>er Wei Yin, Pasi Liljeberg, <strong>and</strong> Hannu Tenhunen. Astudy of 3d network-on-chip <strong>design</strong> <strong>for</strong> data parallel h.264 coding. Microprocess.Microsyst., 35(7):603–612, October 2011.[46] R. Ho, K. Mai, <strong>and</strong> M. Horowitz. The future of wires. In Proc. of the IEEE, pages490–504, April 2001.[47] Tobias Bjerregaard <strong>and</strong> Shankar Mahadevan. A survey of research <strong>and</strong> practices ofnetwork-on-chip. ACM Comput. Surv., 38(1), June 2006.[48] R. Marculescu, U.Y. Ogras, Li-Shiuan Peh, N.E. Jerger, <strong>and</strong> Y. Hoskote. Outst<strong>and</strong>ingresearch problems in noc <strong>design</strong>: System, microarchitecture, <strong>and</strong> circuitperspectives. Computer-Aided Design of Integrated Circuits <strong>and</strong> Systems, IEEETransactions on, 28(1):3 –21, jan. 2009.[49] Erno Salminen, Ari Kulmala, <strong>and</strong> Timo D. Hamalainen. On network-on-chip comparison.Digital Systems Design, Euromicro Symposium on, 0:503–510, 2007.


BIBLIOGRAPHY 166[50] Erno Salminen, Ari Kulmala, <strong>and</strong> Timo D. Hämäläinen. Survey of Network-on-chipProposals. White Paper, pages 1–13, March 2008.[51] E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen,P. Wielage, <strong>and</strong> E. Waterl<strong>and</strong>er. Trade-offs in the <strong>design</strong> of a router with bothguaranteed <strong>and</strong> best-ef<strong>for</strong>t services <strong>for</strong> networks on chip. Computers <strong>and</strong> DigitalTechniques, IEE Proceedings -, 150(5):294–302, sept. 2003.[52] Xiaowen Wu, Yilang Wu, Ling Wang, <strong>and</strong> Xiaoqing Yang. Qos router with bothsoft <strong>and</strong> hard guarantee <strong>for</strong> network-on-chip. In NORCHIP, 2009, pages 1 –6, nov.2009.[53] Y. Salah <strong>and</strong> R. Tourki. Design <strong>and</strong> fpga implementation of a qos router <strong>for</strong>networks-on-chip. In Next Generation Networks <strong>and</strong> Services (NGNS), 2011 3rdInternational Conference on, pages 84 –89, dec. 2011.[54] T. Bjerregaard <strong>and</strong> J. Sparso. Scheduling discipline <strong>for</strong> latency <strong>and</strong> b<strong>and</strong>widthguarantees in asynchronous network-on-chip. In Asynchronous Circuits <strong>and</strong> Systems,2005. ASYNC 2005. Proceedings. 11th IEEE International Symposium on,pages 34 – 43, march 2005.[55] Jin Ouyang <strong>and</strong> Yuan Xie. Loft: A high per<strong>for</strong>mance network-on-chip providingquality-of-service support. In Proceedings of the 2010 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO ’43, pages 409–420, Washington,DC, USA, 2010. IEEE Computer Society.[56] R. Stefan <strong>and</strong> K. Goossens. A TDM slot allocation flow based on multipath routingin nocs. Micro<strong>processors</strong> <strong>and</strong> Microsystems, 35(2):130 – 138, 2011. Special issue onNetwork-on-Chip Architectures <strong>and</strong> Design Methodologies.[57] Edwin Rijpkema, Kees Goossens, <strong>and</strong> Paul Wielage. A router architecture <strong>for</strong>networks on silicon. In In Proceedings of Progress 2001, 2nd Workshop on EmbeddedSystems, pages 181–188, 2001.


BIBLIOGRAPHY 167[58] Andreas Hansson, Mahesh Subburaman, <strong>and</strong> Kees Goossens. Aelite: A flitsynchronousnetwork on chip with composable <strong>and</strong> predictable services. In Proceedingsof the Design, Automation & Test in Europe Conference <strong>and</strong> Exhibition,Los Alamitos, April 2009. IEEE Computer Society Press.[59] Radu Stefan, Anca Molnos, <strong>and</strong> Kees Goossens. daelite: A tdm noc supportingqos, multicast, <strong>and</strong> fast connection set-up. IEEE Transactions on Computers,99(PrePrints), 2012.[60] D. G. Messerschmitt. Synchronization in digital system <strong>design</strong>. IEEE J.Sel. A.Commun., 8(8):1404–1419, September 2006.[61] Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins, Simon W. Moore, <strong>and</strong>Gerard J.M. Smit. An energy <strong>and</strong> per<strong>for</strong>mance exploration of network-on-chiparchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,17(3):319–329, March 2009.[62] Z.Joseph. Yang, A. Kumar, <strong>and</strong> Yajun Ha. An area-efficient dynamically reconfigurablespatial division multiplexing network-on-chip with static throughput guarantee.In Field-Programmable Technology (FPT), 2010 International Conference on,pages 389 –392, dec. 2010.[63] Fayez Gebali, Haytham Elmiligi, <strong>and</strong> Mohamed Watheq El-Kharashi. Networks-on-Chips: Theory <strong>and</strong> Practice. CRC Press, Inc., Boca Raton, FL, USA, 1st edition,2009.[64] A.K. Lusala <strong>and</strong> J.-D. Legat. A sdm-tdm based circuit-switched router <strong>for</strong> on-chipnetworks. In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC),2011 6th International Workshop on, pages 1 –8, june 2011.[65] M. Millberg, E. Nilsson, R. Thid, <strong>and</strong> A. Jantsch. Guaranteed b<strong>and</strong>width usinglooped containers in temporally disjoint networks within the nostrum network onchip. In Design, Automation <strong>and</strong> Test in Europe Conference <strong>and</strong> Exhibition, 2004.Proceedings, volume 2, pages 890 – 895 Vol.2, feb. 2004.


BIBLIOGRAPHY 168[66] N. Megiddo. Optimal flows in networks with sources <strong>and</strong> sinks. MathematicalProgramming, 7:97 – 107, 1974.[67] L. R. Ford <strong>and</strong> D. R. Fulkerson. A simple algorithm <strong>for</strong> finding maximal networkflows <strong>and</strong> an application to the hitchcock problem. Canadian Journal of Mathematics,9:210 – 218, 1957.[68] P. P. P<strong>and</strong>e, C. Grecu, M. Jones, A. Ivanov, <strong>and</strong> R. Saleh. Per<strong>for</strong>mance evaluation<strong>and</strong> <strong>design</strong> trade-offs <strong>for</strong> network-on-chip interconnect architectures. IEEETransactions on Computers, 54:1025–1040, August 2005.[69] R. Kumar, V. Zyuban, <strong>and</strong> D. M. Tullsen. Interconnections in multi-<strong>core</strong> architectures:Underst<strong>and</strong>ing mechanisms, overheads <strong>and</strong> scaling. In Proc. of, ComputerArchitecture. ISCA ’05. 32nd International Symposium on, pages 408–418, 2005.[70] T. Kogel <strong>and</strong> et. al. A modular simulation framework <strong>for</strong> architectural explorationof on-chip interconnection networks. In Proc. of, Hardware/Software Co<strong>design</strong> <strong>and</strong>System Synthesis, 2003 (CODES+ISSS’03). Intl. Conf. on, pages 338–351, October2003.[71] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, <strong>and</strong> Sharad Malik. Orion: Apower-per<strong>for</strong>mance simulator <strong>for</strong> interconnection networks. In Proc. of, MICRO 35,2002.[72] R. Balasubramonian, N. Muralimanohar, K. Ramani, <strong>and</strong> V. Venkatachalapthy.Microarchitectural wire management <strong>for</strong> per<strong>for</strong>mance <strong>and</strong> power in partitioned architectures.In Proc. of, High-Per<strong>for</strong>mance Computer Architecture. HPCA-11. 11thInternational Symposium on, pages 28–39, February 2005.[73] A. Courtey, O. Sentieys, J. Laurent, <strong>and</strong> N. Julien. High-level interconnect delay<strong>and</strong> power estimation. Journal of Low Power Electronics, 4:1–13, 2008.[74] L. Carloni, A.B. Kahng, S. Muddu, A. Pinto, K. Samadi, <strong>and</strong> P. Sharma. Interconnectmodeling <strong>for</strong> improved system-level <strong>design</strong> <strong>optimization</strong>. In Design Automation


BIBLIOGRAPHY 169Conference, 2008. ASPDAC 2008. Asia <strong>and</strong> South Pacific, pages 258 –264, march2008.[75] Rahul Nagpal, Arvind Madan, Bharadwaj Amrutur, <strong>and</strong> Y. N. Srikant. Intacte:an interconnect area, delay, <strong>and</strong> energy estimation tool <strong>for</strong> microarchitectural explorations.In CASES ’07: Proceedings of the 2007 international conference onCompilers, architecture, <strong>and</strong> synthesis <strong>for</strong> embedded systems, pages 238–247, NewYork, NY, USA, 2007. ACM.[76] A.B. Kahng, Bin Li, Li-Shiuan Peh, <strong>and</strong> K. Samadi. Orion 2.0: A power-area simulator<strong>for</strong> interconnection networks. Very Large Scale Integration (VLSI) Systems,IEEE Transactions on, 20(1):191 –196, jan. 2012.[77] A. Kumary, P. Kunduz, A.P. Singhx, L.-S. Pehy, <strong>and</strong> N.K. Jhay. A 4.6tbits/s 3.6ghzsingle-cycle noc router with a novel switch allocator in 65nm cmos. In ComputerDesign, 2007. ICCD 2007. 25th International Conference on, pages 63 –70, oct2007.[78] http://www.synopsys.com/products/libertyccs/libertyccs.html. Liberty File Format.[79] Lef/def exchange <strong>for</strong>mat. http://openeda.si2.org/projects/lefdef”.[80] ITRS. International Technology Roadmap <strong>for</strong> Semiconductors, 2011.[81] http://www.eas.asu.edu/∼ptm/. Predictive technology models.[82] R. Nagpal, M. Arvind, Y. N. Srikanth, <strong>and</strong> B. Amrutur. Intacte: Tool <strong>for</strong> interconnectmodelling. In Proc. of 2007 Intl Conf. on Compilers, Architecture <strong>and</strong>Synthesis <strong>for</strong> Embedded Systems(CASES 2007), pages 238–247, 2007.[83] K.M. Buyuksahin <strong>and</strong> F.N. Najm. High-level power estimation with interconnecteffects. In Low Power Electronics <strong>and</strong> Design, 2000. ISLPED ’00. Proceedings ofthe 2000 International Symposium on, pages 197 – 202, 2000.


BIBLIOGRAPHY 170[84] B.S. L<strong>and</strong>man <strong>and</strong> R.L. Russo. On a pin versus block relationship <strong>for</strong> partitions oflogic graphs. IEEE Transactions on Computers, 20:1469–1479, 1971.[85] Seung Eun Lee <strong>and</strong> Nader Bagherzadeh. A high level power model <strong>for</strong> network-onchip(noc) router. Computers & Electrical Engineering, 35(6):837 – 845, 2009. HighPer<strong>for</strong>mance Computing Architectures, HPCA.[86] P. Gupta, L. Zhong, <strong>and</strong> N. K. Jha. A high-level interconnect power model <strong>for</strong><strong>design</strong> space exploration. In Proc. of, Computer Aided Design (ICCAD ’03). Intl.Conf. on, pages 551–558, 2003.[87] Seung Eun Lee, Jun Ho Bahn, <strong>and</strong> Nader Bagherzadeh. Design of a feasible on-chipinterconnection network <strong>for</strong> a chip multiprocessor (cmp). In Proc. of, Computer Architecture<strong>and</strong> High Per<strong>for</strong>mance Computing. Intl. Symp. on, pages 211–218, 2007.[88] Hang-Sheng Wang, Li-Shiuan Peh, <strong>and</strong> S. Malik. A power model <strong>for</strong> routers: modelingalpha 21364 <strong>and</strong> infinib<strong>and</strong> routers. Micro, IEEE, 23(1):26 – 35, Jan/Feb2003.[89] Hangsheng Wang, Li-Shiuan Peh, <strong>and</strong> S. Malik. A technology-aware <strong>and</strong> energyorientedtopologyexploration<strong>for</strong>on-chipnetworks. InDesign, Automation <strong>and</strong> Testin Europe, 2005. Proceedings, pages 1238 – 1243 Vol. 2, march 2005.[90] EunJungKim, GregM.Link, KiHwanYum, N.Vijaykrishnan, MahmutK<strong>and</strong>emir,Mary J. Irwin, <strong>and</strong> Chita R. Das. A holistic approach to <strong>design</strong>ing energy-efficientcluster interconnects. IEEE Trans. Comput., 54(6):660–671, June 2005.[91] V.. Soteriou, N.. Eisley, Hangsheng Wang, Bin Li, <strong>and</strong> Li-Shiuan Peh. Polaris:A system-level roadmapping toolchain <strong>for</strong> on-chip interconnection networks. VeryLarge Scale Integration (VLSI) Systems, IEEE Transactions on, 15(8):855 –868,aug. 2007.[92] N. Choudhary, M.S. Gaur, <strong>and</strong> V. Laxmi. Irregular noc simulation framework:


BIBLIOGRAPHY 171Irnirgam. In Emerging Trends in Networks <strong>and</strong> Computer Communications (ET-NCC), 2011 International Conference on, pages 1 –5, april 2011.[93] Noxim - the noc simulator. http://noxim.source<strong>for</strong>ge.net.[94] NNSE: The nostrum noc simulation environment.http://www.ict.kth.se/nostrum/NNSE.[95] D. Brooks, V. Tiwari, <strong>and</strong> M. Martonosi. Wattch: a framework <strong>for</strong> architecturallevelpoweranalysis<strong>and</strong><strong>optimization</strong>s.InComputer Architecture, 2000. Proceedingsof the 27th International Symposium on, pages 83 –94, june 2000.[96] T. Austin, E. Larson, <strong>and</strong> D. Ernst. Simplescalar: an infrastructure <strong>for</strong> computersystem modeling. Computer, 35(2):59 –67, feb 2002.[97] Milo M. K. Martin, Daniel J. Sorin, Brad<strong>for</strong>d M. Beckmann, Michael R. Marty, MinXu, AlaaR.Alameldeen, KevinE.Moore, MarkD.Hill, <strong>and</strong>DavidA.Wood. <strong>Multi</strong>facet’sgeneral execution-driven multiprocessor simulator (gems) toolset. SIGARCHComput. Archit. News, 33(4):92–99, November 2005.[98] Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, <strong>and</strong> Nader Bagherzadeh. A genericnetworkinterfacearchitecture<strong>for</strong>anetworkedprocessorarray(nepa). InProceedingsof the 21st international conference on Architecture of computing systems, ARCS’08,pages 247–260, Berlin, Heidelberg, 2008. Springer-Verlag.[99] Yan Luo, Jun Yang, L.N. Bhuyan, <strong>and</strong> Li Zhao. Nepsim: a network <strong>processors</strong>imulator with a power evaluation framework. Micro, IEEE, 24(5):34 – 44, Sept.-Oct. 2004.[100] S. Huang, Y. Luo, <strong>and</strong> W. Feng. Modeling <strong>and</strong> analysis of power in multi<strong>core</strong>network <strong>processors</strong>. In Parallel <strong>and</strong> Distributed Processing, 2008. IPDPS 2008.IEEE International Symposium on, pages 1 –8, april 2008.


BIBLIOGRAPHY 172[101] Michael Huang, J. Renau, Seung-Moon Yoo, <strong>and</strong> J. Torrellas. L1 data cache decomposition<strong>for</strong> energy efficiency. In Low Power Electronics <strong>and</strong> Design, InternationalSymposium on, 2001., pages 10 –15, 2001.[102] Aparna M<strong>and</strong>ke, Keshavan Varadarajan, Basavaraj Talwar, Bharadwaj Amrutur,<strong>and</strong> Y. N. Srikant. Sapphire: A framework to explore power/per<strong>for</strong>mance implicationsof tiled architecture on chip multi<strong>core</strong> plat<strong>for</strong>m. Technical Report IISc-CSA-TR-2010-03, CSA, IISc, July 2010.[103] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze,Smruti Sarangi, Paul Sack, Karin Strauss, <strong>and</strong> Pablo Montesinos. SESC simulator,January 2005. http://sesc.source<strong>for</strong>ge.net.[104] David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Katie Baynes, AamerJaleel, <strong>and</strong> Bruce Jacob. DRAMsim: A memory-system simulator. SIGARCHComputer Architecture News, 33(4):100–107, September 2005.[105] J. Nimmy, C. Ramesh Reddy, K. Varadarajan, M. Alle, A. Fell, S.K. N<strong>and</strong>y, <strong>and</strong>R. Narayan. Reconnect: A noc <strong>for</strong> polymorphic asics using a low overhead singlecycle router. In Application-Specific Systems, Architectures <strong>and</strong> Processors, 2008.ASAP 2008. International Conference on, pages 251 –256, july 2008.[106] Brad<strong>for</strong>d M. Beckmann <strong>and</strong> David A. Wood. Managing wire delay in large chipmultiprocessorcaches. In IEEE/ACM International Symposium on Microarchitecture,pages 319–330. IEEE Computer Society, 2004.[107] Changkyu Kim, Doug Burger, <strong>and</strong> Stephen W. Keckler. An adaptive, non-uni<strong>for</strong>mcache structure <strong>for</strong> wire-delay dominated on-chip caches. SIGOPS Oper. Syst. Rev.,36(5):211–222, October 2002.[108] B.M. Beckmann <strong>and</strong> D.A. Wood. Tlc: transmission line caches. In Microarchitecture,2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposiumon, pages 43 – 54, dec. 2003.


BIBLIOGRAPHY 173[109] Changkyu Kim, Doug Burger, <strong>and</strong> Stephen W. Keckler. An adaptive, non-uni<strong>for</strong>mcache structure <strong>for</strong> wire-delay dominated on-chip caches. In ACM SIGPLAN Notices,pages 211–222. ACM, 2002.[110] Changkyu Kim, D. Burger, <strong>and</strong> S.W. Keckler. Nonuni<strong>for</strong>m cache architectures <strong>for</strong>wire-delay dominated on-chip caches. Micro, IEEE, 23(6):99 – 107, nov.-dec. 2003.[111] A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, <strong>and</strong> C.A. Prete. Impact ofon-chip network parameters on nuca cache per<strong>for</strong>mances. Computers Digital Techniques,IET, 3(5):501 –512, september 2009.[112] J.-M. Parcerisa, J. Sahuquillo, A. Gonzalez, <strong>and</strong> J. Duato. On-chip interconnects<strong>and</strong> instruction steering schemes <strong>for</strong> clustered microarchitectures. Parallel <strong>and</strong> DistributedSystems, IEEE Transactions on, 16(2):130 – 144, February 2005.[113] A. Aggarwal <strong>and</strong> M. Franklin. Instruction replication <strong>for</strong> reducing delays due tointer-pe communication latency. Computers, IEEE Transactions on, 54(12):1496 –1507, dec. 2005.[114] Joan manuel Parcerisa <strong>and</strong> Antonio Gonzlez. Reducing wire delay penalty throughvalue prediction. In In International Symposium on Microarchitecture, pages 317–326, 2000.[115] M. Zhang <strong>and</strong> K. Asanovic. Victim replication: maximizing capacity while hidingwire delay in tiled chip multi<strong>processors</strong>. In Computer Architecture, 2005. ISCA ’05.Proceedings. 32nd International Symposium on, pages 336 – 345, 4-8 2005.[116] Brad<strong>for</strong>d M. Beckmann, Michael R. Marty, <strong>and</strong> David A. Wood. Asr: Adaptiveselective replication <strong>for</strong> cmp caches. In Microarchitecture, 2006. MICRO-39. 39thAnnual IEEE/ACM International Symposium on, pages 443 –454, dec. 2006.[117] Freddy Gabbay <strong>and</strong> Freddy Gabbay. Speculative execution based on value prediction.Technical report, EE Department TR 1080, Technion - Israel Institue ofTechnology, 1996.


BIBLIOGRAPHY 174[118] J Gonzlez <strong>and</strong> A Gonzlez. Memory address prediction <strong>for</strong> data speculation. TechnicalReport UPC-DAC-1996-50, Univ. Politcnica de Catalunya, Spain, 1996.[119] Xin Jia <strong>and</strong> Ranga Vemuri. Using gals architecture to reduce the impact of longwire delay on fpga per<strong>for</strong>mance. In ASP-DAC ’05: Proceedings of the 2005 Asia<strong>and</strong> South Pacific Design Automation Conference, pages 1260–1263, New York, NY,USA, 2005. ACM.[120] S.W. Keckler, D. Burger, C.R. Moore, R. Nagarajan, K. Sankaralingam, V. Agarwal,M.S. Hrishikesh, N. Ranganathan, <strong>and</strong> P. Shivakumar. A wire-delay scalablemicroprocessor architecture <strong>for</strong> high per<strong>for</strong>mance systems. In Solid-State CircuitsConference, 2003. Digest of Technical Papers. ISSCC. 2003 IEEE International,pages 168 – 169 vol.1, 2003.[121] Z. Guz, I. Keidar, A. Kolodny, <strong>and</strong> U.C. Weiser. Nahalal: Cache organization<strong>for</strong> chip multi<strong>processors</strong>. Computer Architecture Letters, 6(1):21 –24, january-june2007.[122] Naveen Muralimanohar, Rajeev Balasubramonian, <strong>and</strong> Norman P. Jouppi. Architectingefficient interconnects <strong>for</strong> large caches with cacti 6.0. IEEE Micro, 28(1):69–79, 2008.[123] G. Konstadinidis, M. Rashid, P.F. Lai, Y. Otaguro, Y. Orginos, S. Parampalli,M. Steigerwald, S. Gundala, R. Pyapali, L. Rarick, I. Elkin, Yuefei Ge,<strong>and</strong> I. Parulkar. Implementation of a third-generation 16-<strong>core</strong> 32-thread chipmultithreadingsparcs processor. In Solid-State Circuits Conference, 2008. ISSCC2008. Digest of Technical Papers. IEEE International, pages 84 –597, feb. 2008.[124] K. Lee, Se-Joong Lee, <strong>and</strong> Hoi-Jun Yoo. Low-power network-on-chip <strong>for</strong> highper<strong>for</strong>amancesoc <strong>design</strong>. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 14:148–160, February 2006.[125] http://www.ocpip.org/socket/systemc/. Ocp-ip, systemc ocp models.


BIBLIOGRAPHY 175[126] http://www.systemc.org/. Open systemc initiative.[127] Hang-Sheng Wang, X. Zhu, Li-Shiuan Peh, <strong>and</strong> S. Malik. Orion: A powerper<strong>for</strong>mancesimulator <strong>for</strong> interconnection networks. In Proceedings of the 35thannual ACM/IEEE International Symposium on Microarchitecture, pages 294–305,2002.[128] S.R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan,A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, <strong>and</strong>S. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journalof Solid-State Circuits, 43:29–41, January 2008.[129] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, <strong>and</strong>Anoop Gupta. The splash-2 programs: characterization <strong>and</strong> methodological considerations.In ISCA ’95: Proceedings of the 22nd annual international symposiumon Computer architecture, pages 24–36, New York, NY, USA, 1995. ACM.[130] V. Yalala, D. Brasili, D. Carlson, A. Hughes, A. Jain, T. Kiszely, K. Kod<strong>and</strong>apani,A. Varadharajan, <strong>and</strong> T. Xanthopoulos. A 16-<strong>core</strong> risc microprocessor with networkextensions. In Solid-State Circuits Conference, 2006. ISSCC 2006. Digest ofTechnical Papers. IEEE International, pages 305 –314, feb. 2006.[131] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, B. Cherkauer, J. Stinson,J. Benoit, R. Varada, J. Leung, R.D. Limaye, <strong>and</strong> S. Vora. A 65-nm dual-<strong>core</strong> multithreadedxeon processor with 16-mb l3 cache. Solid-State Circuits, IEEE Journalof, 42(1):17 –25, 2007.[132] J.L. Shin, K. Tam, D. Huang, B. Petrick, H. Pham, Changku Hwang, Hongping Li,A. Smith, T. Johnson, F. Schumacher, D. Greenhill, A.S. Leon, <strong>and</strong> A. Strong. A40nm16-<strong>core</strong>128-threadcmtsparcsocprocessor. InSolid-State Circuits ConferenceDigest of Technical Papers (ISSCC), 2010 IEEE International, pages 98 –99, feb.2010.


BIBLIOGRAPHY 176[133] Ron Ho, Kenneth W. Mai, <strong>and</strong> Mark A. Horowitz. The future of wires. In Proceedingsof the IEEE, pages 490–504, 2001.[134] http://www.hpl.hp.com/research/cacti. HP Labs : CACTI.[135] http://quid.hpl.hp.com:9081/cacti. CACTI 5.3. rev 174.[136] Niket Agarwal, Li-Shiuan Peh, <strong>and</strong> Niraj Jha. Garnet: A detailed interconnectionnetwork model inside a full-system simulation framework. Technical Report CE-P08-001, Princeton University, 2008.[137] G. K. Konstadinidis, M. Tremblay, S. Chaudhry, M. Rashid, P. F. Lai, Y. Otaguro,Y. Orginos, <strong>and</strong> S. Parampalli. Architecture <strong>and</strong> physical implementation of a thirdgeneration 65 nm, 16 <strong>core</strong>, 32 thread chip-multithreading sparc processor. IEEEJournal of Solid-State Circuits, 44(1):7–17, 2009.[138] Dejan Markovic, Vladimir Stojanovic, Borivoje Nikolic, Mark A. Horowitz, <strong>and</strong>Robert W. Brodersen. Methods <strong>for</strong> true energy-per<strong>for</strong>mance <strong>optimization</strong>. IEEEJournal of Solid State Circuits, 39(8):1282–1293, August 2004.[139] ETSI. Broadb<strong>and</strong> Radio Access Networks(BRAN); HiperLAN type 2; Physical(PHY) Layer. ETSI TS 101 475, V 1.2.2, 2001.[140] Paul M. Heysters, Gerard K. Rauwerda, <strong>and</strong> Gerard J.M. Smit. Implementation ofa hiperlan/2 receiver on the reconfigurable montium architecture. In 18th InternationalParallel <strong>and</strong> Distributed Processing Symposium, IPDPS 2004. IEEE, 2004.[141] Tilera Corporation. TILE-Gx 3000 Series Overview, 2011.[142] http://iverilog.wikia.com/. Icarus iverilog, 2011.[143] T. Fujiyoshi, S. Shiratake, S. Nomura, T. Nishikawa, Y. Kitasho, H. Arakida,Y. Okuda, Y. Tsuboi, M. Hamada, H. Hara, T. Fujita, F. Hatori, T. Shimazawa,K. Yahagi, H. Takeda, M. Murakata, F. Minami, N. Kawabe, T. Kitahara, K. Seta,M. Takahashi, Y. Oowaki, <strong>and</strong> T. Furuyama. A 63-mw h.264/mpeg-4 audio/visual


BIBLIOGRAPHY 177codec lsi with module-wise dynamic voltage/frequency scaling. Solid-State Circuits,IEEE Journal of, 41(1):54 – 62, jan. 2006.[144] A. Luczak, P. Garstecki, O. Stankiewicz, <strong>and</strong> M. Stepniewska. Network-on-chipbased architecture of h.264 video decoder. In Signals <strong>and</strong> Electronic Systems, 2008.ICSES ’08. International Conference on, pages 419 –422, sept. 2008.[145] TileraGX. Tile-GX Processor Family, 2011.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!