NoC design and optimization for Multi-core media processors

NoC Design & Optimization of Multicore MediaProcessorsA ThesisSubmitted for the Degree ofDoctor of Philosophyin the Faculty of EngineeringbyBasavaraj TDEPARTMENT OF ELECTRICAL AND COMMUNICATIONENGINEERINGINDIAN INSTITUTE OF SCIENCEBANGALORE – 560 012, INDIAJuly 2013

AbstractNetwork on Chips[1][2][3][4] are critical elements of modern System on Chip(SoC) as wellas Chip Multiprocessor (CMP) designs. Network on Chips (NoCs) help manage high complexityof designing large chips by decoupling computation from communication. SoCsand CMPs have a multiplicity of communicating entities like programmable processing elements,hardware acceleration engines, memory blocks as well as off-chip interfaces. Withpower having become a serious design constraint[5], there is a great need for designingNoC which meets the target communication requirements, while minimizing power usingall the tricks available at the architecture, microarchitecture and circuit levels of the design.This thesis presents a holistic, QoS based, power optimal design solution of a NoCinside a CMP taking into account link microarchitecture and processor tile configurations.Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for connectionsand deterministic latencies in communication paths. Label Switching basedNetwork-on-Chip (LS-NoC) uses a centralized LS-NoC Management framework that engineerstraffic into QoS guaranteed routes. LS-NoC uses label switching, enables bandwidthreservation, allows physical link sharing and leverages advantages of both packetand circuit switching techniques. A flow identification algorithm takes into account bandwidthavailable in individual links to establish QoS guaranteed routes. LS-NoC catersto the requirements of streaming applications where communication channels are fixedover the lifetime of the application. The proposed NoC framework inherently supportsheterogeneous and ad-hoc SoC designs.A multicast, broadcast capable label switched router for the LS-NoC has been designed,verified, synthesized, placed and routed and timing analyzed. A 5 port, 256i

Abstractiibit data bus, 4 bit label router occupies 0.431 mm 2 in 130nm and delivers peak bandwidthof 80Gbits/s per link at 312.5MHz. LS Router is estimated to consume 43.08 mW.Bandwidth and latency guarantees of LS-NoC have been demonstrated on streaming applicationslike HiperLAN/2 and Object Recognition Processor, Constant Bit Rate trafficpatterns and video decoder traffic representing Variable Bit Rate traffic. LS-NoC wasfound to have a competitive Area×PowerThroughputfigure of merit with state-of-the-art NoCs providingQoS. We envision the use of LS-NoC in general purpose CMPs where applicationsdemand deterministic latencies and hard bandwidth requirements.Designvariablesforinterconnectexplorationincludewirewidth,wirespacing,repeatersize and spacing, degree of pipelining, supply, threshold voltage, activity and couplingfactors. An optimal link configuration in terms of number of pipeline stages for a givenlength of link and desired operating frequency is arrived at. Optimal configurations of alllinks in the NoC are identified and a power-performance optimal NoC is presented. Wepresents a latency, power and performance trade-off study of NoCs using link microarchitectureexploration. The design and implementation of a framework for such a designspace exploration study is also presented. We present the trade-off study on NoCs byvarying microarchitectural (e.g. pipelining) and circuit level (e.g. frequency and voltage)parameters.A System-C based NoC exploration framework is used to explore impacts of variousarchitectural and microarchitectural level parameters of NoC elements on power and performanceof the NoC. The framework enables the designer to choose from a variety ofarchitectural options like topology, routing policy, etc., as well as allows experimentationwith various microarchitectural options for the individual links like length, wire width,pitch, pipelining, supply voltage and frequency. The framework also supports a flexibletraffic generation and communication model. Latency, power and throughput results usingthis framework to study a 4x4 CMP are presented. The framework is used to studyNoC designs of a CMP using different classes of parallel computing benchmarks[6].Oneofthekeyfindingsisthattheaveragelatencyofalinkcanbereducedbyincreasingpipeline depth to a certain extent, as it enables link operation at higher link frequencies.

AbstractiiiThere exists an optimum degree of pipelining which minimizes the energy-delay productof the link. In a 2D Torus when the longest link is pipelined by 4 stages at which pointleast latency (1.56 times minimum) is achieved and power (40% of max) and throughput(64% of max) are nominal. Using frequency scaling experiments, power variations of upto 40%, 26.6% and 24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoCbetween various pipeline configurations to achieve same frequency at constant voltages.Alsoinsomecases, wefindthatswitchingtoahigherpipeliningconfigurationcanactuallyhelp reduce power as the links can be designed with smaller repeaters. We also find thatthe overall performance of the ICNs is determined by the lengths of the links needed tosupport the communication patterns. Thus the mesh seems to perform the best amongstthe three topologies (Mesh, Torus and Folded Torus) considered in case studies.The effects of communication overheads on performance, power and energy of a multiprocessorchip using L1, L2 cache sizes as primary exploration parameters using accurateinterconnect, processor, on-chip and off-chip memory modelling are presented. On-chipand off-chip communication times have significant impact on execution time and the energyefficiency of CMPs. Large caches imply larger tile area that result in longer inter-tilecommunication link lengths and latencies, thus adversely impacting communication time.Smaller caches potentially have higher number of misses and frequent of off-tile communication.Energy efficient tile design is a configuration exploration and trade-off study usingdifferent cache sizes and tile areas to identify a power-performance optimal configurationfor the CMP.Trade-offs are explored using a detailed, cycle accurate, multicore simulation frameworkwhich includes superscalar processor cores, cache coherent memory hierarchies, onchippoint-to-point communication networks and detailed interconnect model includingpipelining and latency. Sapphire, a detailed multiprocessor execution environment integratingSESC, Ruby and DRAMSim was used to run applications from the Splash2benchmark (64K point FFT). Link latencies are estimated for a 16 core CMP simulationon Sapphire. Each tile has a single processor, L1 and L2 caches and a router. Differentsizes of L1 and L2 lead to different tile clock speeds, tile miss rates and tile area and hence

Abstractivinterconnect latency.Simulations across various L1, L2 sizes indicate that the tile configuration that maximizesenergy efficiency is related to minimizing communication time. Experiments alsoindicatedifferentoptimaltileconfigurationsforperformance, energyandenergyefficiency.Clusteredinterconnectionnetwork, communicationawarecachebankmappingandthreadmapping to physical cores are also explored as potential energy saving solutions. Resultsindicate that ignoring link latencies can lead to large errors in estimates of program completiontimes, of up to 17%. Performance optimal configurations are achieved at lower L1cachesandatmoderateL2cachesizesduetohigheroperatingfrequenciesandsmallerlinklengths and comparatively lesser communication. Using minimal L1 cache size to operateatthehighestfrequencymaynotalwaysbetheperformance-poweroptimalchoice. LargerL1sizes, despiteadropinfrequency, offeraenergyadvantageduetolessercommunicationdue to misses.ClusteredtileplacementexperimentsforFFTshowconsiderableperformanceperwattimprovement (1.2%). Remapping most accessed L2 banks by a process in the same coreor neighbouring cores after communication traffic analysis offers power and performanceadvantages. Remapped processes and banks in clustered tile placement show a performanceper watt improvement of 5.25% and energy reduction of 2.53%. This suggests thatprocessors could execute a program in multiple modes, for example, minimum energy,maximum performance.

AcknowledgementsI thank my advisor, Prof. Bharadwaj Amrutur for his invaluable guidance throughout myPh.D. I thank all of you who have shared many precious moments with me and enrichedmy journey through life.v

PublicationsJournals• Basavaraj Talwar and Bharadwaj Amrutur, “Traffic Engineered NoC for StreamingApplications”, Microprocessors and Microsystems, 37(2013), 333-344.Conferences• Basavaraj Talwar and Bharadwaj Amrutur, “A System-C based MicroarchitecturalExploration Framework for Latency, Power and Performance Trade-offs of On-ChipInterconnection Networks”, First International Workshop on Network on Chip Architectures,Nov. 2008.• Basavaraj Talwar, Shailesh Kulkarni and Bharadwaj Amrutur, “Latency, Powerand Performance Trade-offs in Network-on-Chips by Link Microarchitecture Exploration”,22 nd Intl. Conference on VLSI Design, Jan. 2009.vi

ContentsAbstractAcknowledgementsiv1 Introduction 11.1 Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Switching Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Circuit Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Packet Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Label Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 QoS Guaranteed NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . 71.5.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . 81.5.3 QoS in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Related Work 112.1 Traffic Engineered NoC for Streaming Applications . . . . . . . . . . . . . 112.1.1 QoS in Packet Switched Networks . . . . . . . . . . . . . . . . . . . 122.1.2 QoS in Circuit Switched Networks . . . . . . . . . . . . . . . . . . . 132.1.3 QoS by Space Division Multiplexing . . . . . . . . . . . . . . . . . . 152.1.4 Static routing in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.5 MPLS and Label Switching in NoCs . . . . . . . . . . . . . . . . . 162.1.6 Label Switched NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Link Microarchitecture and Tile Area Exploration . . . . . . . . . . . . . . 172.2.1 NoC Design Space Exploration . . . . . . . . . . . . . . . . . . . . 172.3 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Link Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Router Power and Architecture Exploration Tools . . . . . . . . . . 202.3.3 Complete NoC Exploration . . . . . . . . . . . . . . . . . . . . . . 212.3.4 CMP Exploration Tools . . . . . . . . . . . . . . . . . . . . . . . . 222.3.5 Communication in CMPs - Performance Exploration . . . . . . . . 24vii

CONTENTSviii2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Link Microarchitecture Exploration 303.1 Motivation for a Microarchitectural Exploration Framework . . . . . . . . 323.2 NoC Microarchitectural Exploration Framework . . . . . . . . . . . . . . . 333.2.1 Traffic Generation and Distribution Models . . . . . . . . . . . . . 353.2.2 Router Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3 Case Study: Mesh, Torus & Folded-Torus . . . . . . . . . . . . . . . . . . . 383.3.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Round Trip Flit Latency & NoC Throughput . . . . . . . . . . . . 403.3.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 423.3.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 463.3.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 483.4 Case Study: Torus, Reduced Torus & Tree based NoC . . . . . . . . . . . . 503.4.1 NoC Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4.2 NoC Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.3 NoC Power/Performance/Latency Tradeoffs . . . . . . . . . . . . . 533.4.4 Power-Performance Tradeoff With Frequency Scaling . . . . . . . . 543.4.5 Power-Performance Tradeoff With Voltage and Frequency Scaling . 583.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Tile Exploration 614.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Observations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 654.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Communication Time and Energy Efficiency . . . . . . . . . . . . . . . . . 674.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.5.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 734.6 Effect of Link Latency on Performance of a CMP . . . . . . . . . . . . . . 804.7 Communication in CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Program Completion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.9 Ideal Interconnects, Custom Floorplanning, L2 Banks and Process Mapping 964.10 Remarks & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995 Label Switched NoC 1005.1 Streaming Applications in Media Processors . . . . . . . . . . . . . . . . . 1025.1.1 HiperLAN/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.1.2 Object Recognition Processor . . . . . . . . . . . . . . . . . . . . . 1035.2 LS-NoC - Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3 LS-NoC - The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.4 LS-NoC - Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.5 Label Switched Router Design . . . . . . . . . . . . . . . . . . . . . . . . . 1085.5.1 Pipes & Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

CONTENTSix5.5.2 Label Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.6 Simulation and Functional Verification . . . . . . . . . . . . . . . . . . . . 1125.7 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156 LS-NoC Management 1166.1 LS-NoC Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.1 NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.2 Traffic Engineering in LS-NoC . . . . . . . . . . . . . . . . . . . . . 1176.2 Flow Based Pipe Identification . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Fault Tolerance in LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4 Overhead of NoC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.4.1 Computational Latency . . . . . . . . . . . . . . . . . . . . . . . . 1226.4.2 Configuration Latency . . . . . . . . . . . . . . . . . . . . . . . . . 1236.4.3 Scalability of LS-NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.5 Number of Pipes in an NoC . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.5.1 Minimum, Maximum and Typical Pipes in a Network . . . . . . . . 1256.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277 Label Switched NoC 1297.1 HiperLAN/2 baseband processing + Object Recognition Processor SoC . . 1307.2 Video Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1317.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.3.1 Design Philosophy of LS-NoC . . . . . . . . . . . . . . . . . . . . . 1347.3.2 LS-NoC Application . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.3.3 LS-NoC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388 Conclusion and Future Work 1408.1 Link Microarchitecture Exploration . . . . . . . . . . . . . . . . . . . . . . 1408.2 Optimal CMP Tile Configuration . . . . . . . . . . . . . . . . . . . . . . . 1418.3 Label Switched NoC for Streaming Applications . . . . . . . . . . . . . . . 1438.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145A Interface and Outputs of the SystemC Framework 146B Testing & Validation of LS-NoC 150B.1 Implementation of LS-NoC Router . . . . . . . . . . . . . . . . . . . . . . 150B.2 Testing and Validation of LS-NoC Router . . . . . . . . . . . . . . . . . . 150B.2.1 Individual Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2.2 Router in 8×8 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . 152B.3 Synthesis & Place and Route . . . . . . . . . . . . . . . . . . . . . . . . . 153

CONTENTSxC The Flow Algorithm 155C.1 Ford-Fulkerson’s MaxFlow Algorithm . . . . . . . . . . . . . . . . . . . . . 155C.2 Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.3 Edges in the Input Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 157Bibliography 160

List of Tables3.1 ICN exploration framework parameters. . . . . . . . . . . . . . . . . . . . . 353.2 TrafficGeneration/DistributionModelandExperimentSetupfortheMesh,Torus & Folded-Torus case study. . . . . . . . . . . . . . . . . . . . . . . . 363.3 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 403.4 DLA traffic, Frequency crossover points in 2D Mesh . . . . . . . . . . . . . 493.5 Comparison of 3 topologies for DLA traffic. . . . . . . . . . . . . . . . . . 493.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Links and pipelining details of NoCs . . . . . . . . . . . . . . . . . . . . . 513.8 Power optimal frequency trip points in a various NoCs. . . . . . . . . . . . 573.9 Comparison of 3 topologies. Maximum interconnect network performanceand power consumption for varying pipe stages. . . . . . . . . . . . . . . . 584.1 Configuration parameters of processors, caches & interconnection networkused in experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Scaled processor power over L1 configurations. . . . . . . . . . . . . . . . . 774.3 PrimaryandSecondarycacheparameters(accesstime, area)obtainedfromcacti. L2 access latencies as a function of L1 access times is also shown. . . 774.4 Max operating frequencies, Dynamic energy per access of various L1/L2caches. Values were calculated using cacti power models using 32nm PTM. 784.5 Lengths of links between L1/L2 caches & routers and between routers ofneighbouring tiles for a regular mesh placement. No. of pipeline stagesrequired to meet the maximum frequency are also shown. . . . . . . . . . . 794.6 FFT. Power spent in links (in mW). . . . . . . . . . . . . . . . . . . . . . 894.7 Total messages in transit (in Millions). . . . . . . . . . . . . . . . . . . . . 934.8 Clustered tile placement floorplan for L1: 256KB and L2: 512KB. Lengthsoflinks between neighbouringrouters, numberofpipelinestagesare shown.Frequency: 1.38 GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.1 Communication characteristics between HiperLAN/2 nodes. . . . . . . . . 1025.2 Routing table of a n port (n = 5) router with a lw bit (lw = 4) labelindexed by labels used in the label switched NoC. Size of the routing table= 2 lw ×n×lw. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.3 Simulation parameters used for functional verification of the label switchedrouter design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113xi

LIST OF TABLESxii5.4 Synthesis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5 Synthesis results 2 Router and Mesh networks. Area of a Router is 0.431mm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.1 NoC Manager Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.1 Pipes set up for HiperLAN/2 baseband processing SoC and Object RecognitionProcessor SoC (Figure 7.1(a)). PEC[0-7]→PEC[0-7]: every PECcommunicates with every other PEC. . . . . . . . . . . . . . . . . . . . . . 1307.2 Standard test videos used in experiments. . . . . . . . . . . . . . . . . . . 1327.3 Evaluation of the proposed Label Switched Router and NoC. CS: Circuitswitched, PS: Packet switched. . . . . . . . . . . . . . . . . . . . . . . . . . 136A.1 ICN exploration framework parameters and their default values. . . . . . . 147C.1 Routing tables at R0 I0, R0 I2 and R1 I4 nodes after pipes P0 and P1 havebeen set up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

List of Figures1.1 Design space exploration of NoCs in CMPs are closely related to link microarchitecture,router design and tile configurations. . . . . . . . . . . . . 62.1 Floorplan used in estimating wire lengths. Wire lengths estimated fromthese floorplans are used as input to Intacte to arrive at a power optimalconfiguration and latency in clock cycles. Horizontal R-R: Link betweenneighboring routers in the horizontal direction, Vertical R-R: Link betweenneighbouring routers in the vertical direction. . . . . . . . . . . . . . . . . 253.1 Architecture of the SystemC framework. . . . . . . . . . . . . . . . . . . . 343.2 Flow of the ICN exploration framework. . . . . . . . . . . . . . . . . . . . 343.3 Flit header format. DSTID/SRCID: Destination/Source ID, SQ:SequenceNumber, RQ & RP: Request and Response Flags and a 13 bit flit id. . . . 363.4 Example flit header formats considered in this experiment. (DST/SRCID:Destination/Source ID, HC:Hop Count, CHNx:Direction at hop x). . . . . 373.5 Schematic of 3 compared topologies (L to R: Mesh, Torus, Folded Torus).Routers are shaded and Processing Elements(PE) are not. . . . . . . . . . 393.6 Normalized average round trip latency in cycles vs. Traffic injection ratein all the 3 NoCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.7 Max. frequency of links in 3 topologies. Lengths of longest links in Mesh,Torus and Folded 2D Torus are 2.5mm, 8.15mm and 5.5mm. . . . . . . . . 423.8 Total NoC throughput in 3 topologies, DLA traffic. . . . . . . . . . . . . . 433.9 Avg. round trip flit latency in 3 NoCs, DLA traffic. . . . . . . . . . . . . . 433.10 2D Mesh Power/Throughput/Latency trade-offs for DLA traffic. Normalizedresults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.11 2D Mesh Power/Throughput/Latency trade-offs for SLA traffic. . . . . . . 443.12 DLATraffic, 2DTorusPower/Throughput/Latencytrade-offs. Normalizedresults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.13 DLATraffic,Folded2DTorusPower/Throughput/Latencytrade-offs. Normalizedresults are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.14 Frequency scaling on 3 topologies, DLA Traffic. . . . . . . . . . . . . . . . 473.15 Dynamic voltage scaling on 2D Mesh, DLA Traffic. Frequency scaled curvefor P=8 is also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48xiii

LIST OF FIGURESxiv3.16 Schematic representation of the three compared topologies (L to R: 2DTorus, Tree, Reduced 2D Torus). Shaded rectangles are Routers and whiteboxes are source/sink Processing Elements(PE) nodes. . . . . . . . . . . . 503.17 Floorplans of the three compared topologies. . . . . . . . . . . . . . . . . . 513.18 Maximum attainable frequency by links in the respective topologies. Estimatedlength of the longest link in a 2D Torus is 7mm. Estimated longestlink in the Tree based and Reduced 2D Torus is 3.5mm. . . . . . . . . . . . 523.19 Variation of total NoC throughput with varying pipeline stages in all threetopologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.20 2D Torus Power/Throughput/Latency trade-offs. Normalized results areshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.21 Reduced 2D Torus Power/Throughput/Latency trade-offs. Normalized resultsare shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.22 Variation of NoC power with throughput for each topology. . . . . . . . . . 563.23 Effects of dynamic voltage scaling on the power and performance of a 2DTorus. Highest frequency of operation for P=1, 2, 4 and 7 are .93GHz,1.68GHz, 2.92GHz and 4.22GHz. Power consumption of the frequencyscaled NoC is shown for comparison. . . . . . . . . . . . . . . . . . . . . . 574.1 Error in performance measurement between real and ideal interconnectexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2 Schematic of a multiprocessor architecture comprising of tiles and an interconnectingnetwork. Each tile is made up of a processor, L1 and L2caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Flowchart illustrating the steps in experimental procedure. . . . . . . . . . 754.4 Tile floorplans for different(L1, L2) sizes. Fromleft: (8KB, 64KB), (64KB,1MB), (128KB, 4MB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Mesh floorplans used in experiments. From left: Conventional 2D Meshtopology, a clustered topology, cluster topology with L2 bank and threadmapping and and a mesh topology with L2 bank and thread mapping. . . . 774.6 Benchmark execution time vs. Communication time - DRAM access timeand On-chip transit time vs. L2 cache size vs. Program completion time. . 804.7 Program energy vs. Communication time. . . . . . . . . . . . . . . . . . . 814.8 64K point FFT benchmark execution time vs. Total time spent in on-chipmessage transit. L2 cache sizes are in the order 64KB, 128KB, 256KB,512KB, 1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.9 64K point FFT execution time vs. Total time spent in DRAM (off-chip)accesses. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB,1M, 2M, 4M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.10 Total messages over all the links during the execution of the benchmarkand Average transit time of a message. . . . . . . . . . . . . . . . . . . . . 864.11 FFT. Totalinstructionsexecutedand power spentin thememoryhierarchyand on-chip links during the execution. . . . . . . . . . . . . . . . . . . . . 88

LIST OF FIGURESxv4.12 FFT Benchmark. Energy per Instruction and Instructions per second 2 perWatt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.13 Y1:PCT, Y2:on-chip transit and off-chip comm. times. . . . . . . . . . . . 924.14 FFT benchmark results. (Program Completion Time, comm.: communication). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.15 FFT benchmark results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.16 Program Completion Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 954.17 Alternative Tile Placements, custom process scheduling example and idealinterconnect comparison results. Benchmark: FFT, L1: 256K, L2: 512K. . 985.1 (a) Process graph of a HiperLAN/2 baseband processing SoC[7] and (b)NoC of the Object recognition processor[8]. . . . . . . . . . . . . . . . . . . 1035.2 A 64 Node, 8 × 8 2D LS-NoC along with NoC Manager interface to routingtables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.3 Pipe establishment and label swapping example in a 3×3 LS-NoC. . . . . . 1095.4 LabelSwitchedRouterwithsinglecycleflittraversal. ValidsignalidentifiesData and Label as valid. PauseIn and PauseOut are flow control signalsfor downstream and upstream routers. Routing table has output port andlabel swap information. Arbiter receives input from all the input portsalong with the flow control signal from the downstream router. . . . . . . . 1105.5 Label conflict at R1 resolved using Label swapping. il: Input Label, Dir:Direction, ol: Output Label. . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.1 Surveillance system showing the application of LS-NoC in the Video computationserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.2 A 2 router, 6 communicating nodes linear network. (b) Multiple source,multiple sink flow calculation in a network. . . . . . . . . . . . . . . . . . . 1266.3 (a) Number of pipes in a linear network (Fig. 6.2(a)), lw = 3 bits, varyingconstraints. Constraint 1: Max 1 pipe per sink. (b) Max. number of pipesin 2D Mesh (Fig. 5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277.1 (a) Process blocks of HiperLAN/2 baseband processing SoC and Objectrecognition processor mapped on to a 8 × 8 LS-NoC. Pipe 1: PEC0 →PEC6, Pipe 2: MP → PEC3. (b) Flows set up for CBR & VBR traffic. . . 1317.2 Latency of HiperLAN/2 and ORP pipes in LS-NoC over varying injectionratesofnon-streamingapplicationnodes. Latencyofnon-provisionedpathsare titled (U). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.3 (a) Latency of CBR traffic over various injection rates of non-streamingnodes in LS-NoC. (b) Latency of VBR traffic over various injection ratesof non-streaming nodes in LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 1337.4 LS-NoC being used alongside a best effort NoC. . . . . . . . . . . . . . . . 136B.1 Modules in LS-NoC router design shown along with testbench, implementedin Verilog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151B.2 Test cases used to verify an individual LS-NoC router. . . . . . . . . . . . 151

LIST OF FIGURESxviB.3 8×8 mesh used for testing LS-NoC. . . . . . . . . . . . . . . . . . . . . . . 152B.4 Traffic test cases used to verify proper functioning of LS-NoC router. . . . 153B.5 Flowchart illustrating steps in Synthesis and Place & Route steps of theLS-NoC router. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154B.6 Placed and routed output - Single Router. . . . . . . . . . . . . . . . . . . 154C.1 Steps in the flow algorithm example. (a) Input Graph. Maximum flowshave to be identified between nodes X & Y. (b) Available capacities oflinks after flows X→A→C→Y & X→B→C→Y are set up. (c) Residualnetwork showing available capacities of links in the forward direction andutilized capacity in the reverse. (d) Residual network after adding theflow: X→A→C→B→D→E→Y. (e) Final output of the maxflow algorithmshowing 3 flows from X to Y. . . . . . . . . . . . . . . . . . . . . . . . . . 156C.2 (a) A 2 router, 6 source+sink system used for validation of the LS-NoCrouter design. Graph representation of the system used as input to theflow algorithm is shown in (b). . . . . . . . . . . . . . . . . . . . . . . . . . 157C.3 TheNoCaftertwopipes,P0andP1havebeenestablished. P0: R0S0→R1 D2and P1: R0S2→R1 D0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Chapter 1Introduction1.1 Network-on-ChipNetwork on Chips[1][2][3][4] are critical elements of modern Chip Multiprocessors (CMPs)and System on Chips (SoCs). Network on Chips (NoCs) help manage high complexity ofdesigning large chips by decoupling computation from communication. SoCs and CMPshave a multiplicity of communicating entities like programmable processing elements,hardware acceleration engines, memory blocks as well as off-chip interfaces. Using anNoC enables modular design of communicating blocks and network interfaces. NoCshelp achieve a well structured design enabling higher performance while servicing largerbandwidths compared to bus based systems[1]. Links in NoCs designed with controlledelectrical parameters can use aggressive singling circuits to reduce power and delay[9].Network resources are utilized more efficiently in NoCs as compared to global wires[10].Communication patterns between communicating entities are application dependent.As a result, NoCs are expected to cater to diverse connections varying in forms of connectivity,burstiness, latency and bandwidth requirements. NoC servicing communication requirementsin CMPs or SoCs are expected to meet Quality of Service (QoS) demands suchas maximum or average latency, typical or peak bandwidth and required throughput ofexecuting applications. Further, with power having become a serious design constraint[5],1

CHAPTER 1. INTRODUCTION 2there is a great need for designing NoC which meets the target communication requirements,while minimizing power using various strategies at architecture, microarchitectureand circuit levels of the design.1.2 Switching PoliciesSwitching policies configure paths in routers to facilitate data transfer between input andoutput ports. Programming of internal switches in routers to connect input ports to outputports and determination of when and which data units are transferred is accomplishedusing switching policies. Flow control mechanisms synchronize data transfer betweenrouter and traffic sources and between two routers. Switching policies and flow controlmechanisms influence the design of internal switches, routing and arbitration units, andthe amount of buffers in a router. The major types of switching policies are introducedhere.1.2.1 Circuit SwitchingCircuit switching is a reservation based switching policy in which network resources areallocated to a communication path before data is transferred. At the end of data transfer,reserved resources are de-allocated and are available for future circuits. As circuits areused on a reservation basis, circuit switching requires a simple router design with a fewor no buffers.Circuits are established using path identifying probe packets that reserve resourcesas they propagate towards the destination. The circuit establishment is complete afteran acknowledgment message is received by the source. Data is transferred along the circuitwithout further monitoring or control. After the transfer is complete, the circuit istorn down and resources freed using a tail packet. Popular examples of circuit switchednetworks are Autonomous Error-Tolerant Cell[11], Asynchronous SoC[12], Crossroad[13],dTDMA[14], Point to point network on real time systems[15], Programmable NoC forFPGA-based systems[16], ProtoNoC[17], Space Division Multiplexing based NoC[18],

CHAPTER 1. INTRODUCTION 3SoCBuS[19], Reconfigurable Circuit Switched NoC[7], etc.1.2.2 Packet SwitchingIn packet switching, the message to be transmitted is partitioned and transmitted asfixed-length packets. Routing and control is handled on a per packet basis. The packetheader includes routing and other control information needed for the packet to reachthe destination. Packet switching increases network resource utilization as communicationchannels share resources along the path. Buffers and arbitration units in routersmanage resource conflicts and storage demands in communication paths. Packet switchingnetworks aid IP block re-use and are scalable[20]. Packet-switching is more flexiblethan circuit switching though it requires buffering and introduces unpredictable latency(jitter). Popular packet switched networks are Asynchronous NoC[21], FAUST[22], ArterisNoC[23], Butterfly Fat Tree[24], DyAD[25], Eclipse[26], MANGO[27], Proteo[28],QNoC[29], SPIN[30], etc. Some NoC designs can adaptively work in circuit or packetswitched modes based on traffic requirements. A few examples are Æthereal[31], HeterogeneousIP Block Interconnection[32], dynamically reconfigurable NoC[33], Octagon[34],etc.1.2.3 Label SwitchingLabel switching is used by technologies such as ATM[35][36] and Multiprotocol LabelSwitching (MPLS)[37] as a packet relaying technique. Individual packets carry route informationin the form of labels. A label denotes a common route that a set of data packetstraverse. Therefore, a minimalistic label identifies the source hop and the destination hopalong with the intermediate transit routers. Along with routing information, labels canbe used to specify service priorities to packets. This feature of labels enables use of differentiatedservices for packets using common labels. Routers along the path use thelabel to identify the next hop, forwarding information, traffic priority, Quality of Serviceguarantees and the next label to be assigned. Label switching inherently supports trafficengineering, as labels can be chosen based on desired next hop or required QoS services.

CHAPTER 1. INTRODUCTION 4A few proposals of label switched NoCs are MPLS NoC[38], Nexus[39] and Blackbus[40].1.3 QoS in NoCsNoCs servicing CMPs and SoCs are expected to meet Quality of Service (QoS) demandsof executing applications. Latency sensitive applications demand a guaranteed averageandmaximumlatencyoncommunicationtraffic. Jittersensitiveapplicationsmaytoleratelonger latencies but require fixed delay along communication paths. Further, in betweenclasses of application some have higher priority than others. For example, applicationdata usually has higher priority than acknowledgment packets or control information.The two basic approaches in NoC designs to enable QoS guarantees are: creation ofreserved connections between source and destinations via circuit switching or support forprioritized routing (in case of packet switched, connectionless paths).Circuit switched NoCs guarantee high data transfer rates in an energy efficient mannerby reducing intra-route data storage[41]. Circuit switched NoCs provide guaranteed QoSfor worst case traffic scenarios leading to higher network resource requirements[42]. Theseare well suited for streaming traffic generated by media processors where communicationrequirements are well known a priori. One of the drawbacks here is under utilizationof network resources as resources are reserved for peak bandwidth while the averagerequirement might be lesser.Packetswitchednetworksprovideefficientinterconnectutilizationandhighthroughputs[43]while providing fairness amongst best effort flows. However, network resources in packetswitched networks need to be over-provisioned to support QoS for various traffic classesand have high buffer requirements in routers. Packet switching networks usually provideQoS by differentiated services to traffic by classifying them into various classes[29]. Prioritizedservices are provided to traffic belonging to each class. Due to the sharing ofnetwork resources, packet switched networks can be configured to provide GuaranteedThroughput (GT) for a few classes of traffic and Best Effort (BE) services for remainingclasses.With traffic engineering enabled label switching networks, communication loads can

CHAPTER 1. INTRODUCTION 5be distributed over the NoC resulting in fair allocation of network resources. Networkresource guarantees, enable paths with less or no jitter while keeping network utilizationfairly high. Further, design of routers is simplified compared to conventional wormholerouters[40].1.4 QoS Guaranteed NoC DesignMedia processors with streaming traffic such as HiperLAN/2 Baseband Processors[7],Real-time Object Recognition Processors [8] and H.264 encoders[44][45] demand adequatebandwidth and bounded latencies between communicating entities. They also havewell known communication patterns and bandwidth requirements. Adequate throughput,latency and bandwidth guarantees between process blocks have to be provided for suchapplications. Nature of streaming applications in media processors and characteristics ofstreaming traffic are illustrated in Section 5.1 of Chapter 5.Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for connectionsand deterministic latencies in communication paths. This thesis proposes a QoSguaranteeing NoC using label switching where bandwidth can be reserved while links areshared. The traffic is engineered during route setup and it leverages advantages of bothpacket and circuit switching techniques. We propose a QoS based Label Switched NoC(LS-NoC) router design. We present a latency, power and performance optimal interconnectdesign methodology considering low level circuit and system parameters. Further,optimal tile configurations are identified using effects of application communication trafficon performance and energy in chip multiprocessors (Figure 4.2).A label switched, QoS guaranteeing NoC, that retains advantages of both packetswitched and circuit switched networks is the main focus of this thesis. Congestion freecommunication pipes are identified by a centralized Manager with complete network visibility.Label Switched NoC (LS-NoC) sets up communication channels (pipes) betweencommunicating nodes that are independent of existing pipes and are contention free at therouters. Deterministic delays and bandwidth are guaranteed in newly established pipes,taking into account established flows. Residual bandwidth in links reserved by a pipe can

CHAPTER 1. INTRODUCTION 6Figure 1.1: Design space exploration of NoCs in CMPs are closely related to link microarchitecture,router design and tile configurations.be utilized by other pipes, thus enabling sharing of physical links between pipes withoutcompromising QoS guarantees. LS-NoC provides throughput guarantees irrespective ofspatial separation of the communicating entities.Interconnect delay and power contribute significantly towards the final performanceand power numbers of a CMP[46]. Design variables for interconnect exploration includewire width, wire spacing, repeater size and spacing, degree of pipelining, supply (V dd ),threshold voltage (V th ), activity and coupling factors. A power and performance optimallink microarchitecture can be arrived at by optimizing these low level link parameters.A methodology to arrive at the optimal link configuration in terms of numberof pipeline stages (cycle latency) for a given length of link and desired operating frequencyis presented. Optimal configurations of all links in the NoC are identified and apower-performance optimal NoC thus achieved.Primary and secondary cache sizes have a major bearing on the amount of on-chipand off-chip communication in a Chip Multiprocessor (CMP). On-chip and off-chip communicationtimes have significant impact on execution time and the energy efficiency ofCMPs. From a performance point of view, cache accesses should suffer minimum delayand off-tile communication due to cache misses should be negligible. Large caches dissipatemore leakage energy and may exceed area budgets though they reduce cache missesand decrease off-tile communication. Larger caches result in longer inter-tile communicationlink lengths and latencies, thus adversely impacting communication time. Small

CHAPTER 1. INTRODUCTION 7caches reduce occupied tile area, have higher activity and hence dissipate lesser leakageenergy. Drawback of smaller caches is potentially higher number of misses and frequentof off-tile communication. This illustrates the trade off between cache size, miss rate,NoC communication latency and power. Energy efficient tile design is a configurationexploration and trade-off study using different cache sizes and tile areas to identify apower-performance optimal cache size and NoC configuration for the CMP.1.5 Contributions of the ThesisWork in this thesis presents methodologies for label switched QoS guaranteed NoC design,link microarchitecture exploration and optimal Chip Multiprocessor (CMP) tile configurations.Contributions from this thesis are listed here:1.5.1 Link Microarchitecture Exploration• Optimal Link Design and Exploration Framework: Wepresentsimulationframeworkdeveloped in System-C which allows the designer to explore NoC design across lowlevel link parameters such as pipelining, link width, wire pitch, supply voltage, operatingfrequency and NoC architectural parameters such as router type and topologyof the interconnection network. We use the simulation framework to identifypower-performance (Energy-Delay) optimal link configuration in a given NoC overparticular traffic patterns. Such an optimum exists because increasing pipeliningallows for shorter length wire segments which can be operated either faster or withlower power at the same speed.• Optimum Pipe Depth: Contrary to intuition, we find that increasing pipeline depthcan actually help reduce latency in absolute time units, by allowing shorter links& hence higher frequency of operation. In some cases, we find that switching toa higher pipelining configuration can actually help reduce power as the links canbe designed with smaller repeaters. Larger NoC power savings can be achieved byvoltage scaling along with frequency scaling. Hence it is important to include the

CHAPTER 1. INTRODUCTION 8link microarchitecture parameters as well as circuit parameters like supply voltageduring architecture design exploration of NoCs.1.5.2 Optimal CMP Tile Configuration• Optimal Cache Size: The performance-power optimal L1/L2 configuration of a tileis close to the configuration that spends least amount of time in on-chip and off-chipcommunication.• Effect of Floorplanning and Process Mapping: Communication aware floorplanningcan reduce up to 2.6% of the energy spent in execution of an instruction and up to11%savingsincommunicationpowerduringtheexecutionoftheprogram. MappingL2banksinthesamecoreastheprocessesaccessingitreducestimespentincommunicationand hence the overall program completion time and also has a bearing onthe Total Energy spent in the execution of the program. Experiments have revealedthat as much as 2% of energy per instruction can be saved by communication-awareprocess scheduling compared to a conventional thread mapping policies in a 2DMesh architecture.1.5.3 QoS in NoCs• A Label Switching NoC providing QoS guarantees: We present a LS-NoC to serviceQoS demands of streaming traffic in media processors. A centralized NoC Managercapable of traffic engineering establishes bandwidth guaranteed communicationchannels between nodes. LS-NoC guarantees deterministic path latencies, satisfiesbandwidth requirements and delivers constant throughput. Delay and throughputguaranteed paths (pipes) are established between source and destinations along contentionfree, bandwidth provisioned routes. Pipes are identified by labels unique toeach source node. Labels need fewer bits compared to node identification numbers- potentially decreasing memory usage in routing tables.• NoC Manager with traffic engineering capabilities: The NoC Manager utilizes flow

CHAPTER 1. INTRODUCTION 9identification algorithms to identify contention free, bandwidth provisioned pathsin LS-NoC called pipes. The LS-NoC Manager has complete visibility of the stateof LS-NoC. Bandwidth requirements of the application are taken into account toprovision routes between communicating nodes by the flow identification algorithm.Flow based pipe establishment algorithm is topology independent and hence theNoC Manager supports applications mapped to both regular chip multiprocessors(CMPs) and customized SoCs with non-conventional NoC topologies. Additionally,fault tolerance is achieved by the NoC Manager by considering link status duringpipe establishment.• Design of a Label Switched Router: The Label Switched (LS) Router used in LS-NoC achieves single cycle traversal delay during no contention and is multicast andbroadcast capable. Source nodes in the LS-NoC can work asynchronously as cyclelevel scheduling is not required in the LS Router. LS router supports multiple clockdomain operation. Dual clock buffers can be used at output ports in the LS-NoCrouter. This eases clock domain crossovers and reduces the need for a single globallysynchronous clock. As a result, clock tree design is less complex and clock power ispotentially saved.1.6 Organization of the ThesisChapter 2 highlights several works from current literature related to the broad areasof QoS guaranteed NoCs, link microarchitecture, design space exploration of NoCs andeffects of communication on energy and performance trade-offs in CMPs.Chapter 3 presents a latency, power and performance trade-off study of NoCs throughlink microarchitecture exploration using microarchitectural and circuit level parameters.NoC exploration framework used in the trade-off studies is described. The interface tothe SystemC framework and sample output logs generater are presented in Appendix A.Effects of on-chip and off-chip communication due to various CMP tile configurationsis explored in Chapter 4. The need to use detailed interconnection network models to

CHAPTER 1. INTRODUCTION 10identify optimal energy and performance configurations is also highlighted. On-chip andoff-chipcommunicationeffectsonpowerandperformanceofaCMPsisexplored. Effectsofcommunication on program execution times and program execution energy are presented.Further, Energy-performance results for tile configurations and effects of custom L2 bankmapping and thread mapping on power and performance of CMPs is presented.Design and implementation of a label switching, traffic engineering capable NoC deliveringguaranteed QoS for streaming traffic in media processors has been presented inChapter 5. Traffic characteristics of streaming applications are also presented in the chapter.Functional verification of the LS-NoC router using various test cases is presented inAppendix B. Chapter 6 illustrates the LS-NoC management framework and the flowidentification algorithm used to establish pipes. An example of use of flow algorithm hasbeen presented in Appendix C. Streaming application test cases and various types ofvideo traffic are used to establish LS-NoC as a QoS guaranteeing framework in Chapter7. The thesis concludes in Chapter 8 after enlisting some future advancements possibleon the proposed work.

Chapter 2Related WorkSeveralpublicationshavehighlightedtheneedforsolutionstopressingproblemsinvariousdomains in the broad area of Network-on-Chips[47][48][49][50]. This chapter introducesrelevant works in the broad areas of QoS guaranteed Network-on-Chips, design spaceexploration of NoCs and effects of communication on energy and performance trade-offsin CMPs.2.1 Traffic Engineered NoC for Streaming ApplicationsProviding QoS guarantees in on-chip communication networks has been identified as oneof major research problems in NoCs[48]. QoS solutions in packet switched networks usepriority based services while circuit switched NoCs use some form of resource reservation.We introduce a few well known QoS solutions from literature and compare our work withthe state of the art. Packet switched NoCs use differentiated services for traffic classes[29][22][21][8] to provide latency and bandwidth guarantees. Circuit switched NoCs useresource reservation mechanisms to guarantee QoS[34][51][41][19]. Resource reservationmechanisms involve identifying a sufficiently resource rich path, reserving resources alongthe path, configuration, actual communication and path tear down. A fairly extensivesurvey of NoC proposals has been presented in [50]. Relevant QoS NoCs are discussed in11

CHAPTER 2. RELATED WORK 12this section.2.1.1 QoS in Packet Switched NetworksQoS NoC (QNoC) presented by Bolotin et. al. [29] is a customized QoS NoC architecturebasedona2DMeshtosatisfyQoSbyallocatingfrequentlycommunicatingnodesclose-by,doing away with unnecessary links, tailoring link width to meet bandwidth requirementsand balancing link utilization. Inter-module communication traffic is classified into fourclasses of service: signaling, real-time, RD/WR and block-transfer. FAUST[22] is a reconfigurablebaseband platform based on an asynchronous NoC providing a programmablecommunication framework linking heterogeneous resources. FAUST uses 2 level prioritybasedvirtualcircuitdesigninitsNetworkInterface(NI)toprovideQoSguarantees. AsynchronousNoCs[21] use clock-free interconnect to improve reliability and delay-insensitivearbiters to solve routing conflicts. A QoS Router with both soft (Soft GT) and hard (HardGT) guarantees for globally asynchronous, locally synchronous (GALS) NoCs is presentedin [52]. Leftover bandwidth in routers servicing Hard GT is utilized by Soft GT connectionsand best effort traffic. NoCs presented in [21], [52] and [53] employ multiple prioritylevels to provide differentiated services and guarantee QoS. The MANGO [27][54] NoCprovides hard GT by prioritizing each GT connection and adopts Asynchronous LatencyGuarantee (ALG) scheduling to prevent starvation of packets with lower priority.One of the major drawbacks of priority based QoS schemes is that increase in trafficin one priority class effects the delay on traffic belonging to other classes. A prioritynetwork will lose the differentiated services advantage if all traffic belong to the samepriority level. Further, deadlock-free routing algorithms using virtual circuits with apriorityapproachmayleadtodegradationinNoCthroughput. Incaseswhereconnectionscannot be overlapped with each other (eg. MANGO NoC), increased number of hard GTconnections will lead to increased cost in network resources.Another class of packet switched NoCs using priority based QoS solutions are applicationspecific SoCs. A tree based hierarchical packet-switched NoC for a real-time objectrecognition processor is implemented in [8]. The tree topology NoC with three crossbar

CHAPTER 2. RELATED WORK 13switches interconnects 12 IPs supports both bursty (for image traffic) and non-bursty (forcontrol and synchronization signals) traffic. Network resources in this NoC are tailoredto meet throughput and bandwidth demands of the application and hence the design isnot a generic solution for servicing QoS in an CMP environment.2.1.2 QoS in Circuit Switched NetworksResource reservation between communicating nodes involves identification of path usingpoint-to-point links or a path probing service network or an intelligent, traffic awaredistributed or centralized manager. Hu et. al.[15] introduce point-to-point (P2P) communicationsynthesistomeettimingdemandsbetweencommunicatingnodesusingbuswidthsynthesis. Circuit switched bus based QoS solutions such as Crossroad[13], dTDMA[14]and Heterogeneous IP Block Interconnection (HIBI)[32] rely on communication localizationto satisfy timing demands. NEXUS[39] is a resource reservation based QoS NoCfor globally asynchronous, locally synchronous (GALS) architectures. NEXUS uses anasynchronous crossbar to connect synchronous modules through asynchronous channelsand clock-domain converters.P2P networks do not share communication links between multiple nodes leading toinefficient utilization of network resources. This increases wiring resources inside thechip and results in poor scalability. Crossbar based solutions using protocol handshakes(for example, 4-way handshakes in NEXUS[39] and ProtoNoC[17]) force communicatingnodes to wait till handshake is complete and path is established. Non-interference ofcommunication channels is achieved by over-provisioning resources in the crossbar. Thisleads to complex and poorly scalable networks. Connecting frequently communicatingnodes on a single bus will increase demand on the bus and lead to larger waiting times atthe nodes. Static routing along shortest paths does not guarantee latency bound routesdue to arbitration delays in the network.Amongst the NoCs that use a probe based circuit establishment solutions are Intel’s8×8 circuit switched NoC[41], SoCBUS[19][55] and distributed programming model in

CHAPTER 2. RELATED WORK 14Æthereal[51]. In these NoCs, probe packets are used to reconnoiter shortest communicationpaths and configure routing tables if path (circuit) is available. Routers are lockeddown and no other circuits can use the port during the lifetime of an established circuit.If the shortest X-Y path is not available, the probe packets initiate route discovery mechanismsin other paths. The method involves some dynamic behaviour as the probe mightrepeat route discovery steps or try after a random period of time if circuit set up doesnot succeed. This leads to indeterministic and sometimes large route setup times whichmay be unacceptable for real time application performance.Centralized Circuit ManagementReserved communication channels can be identified and configured using an applicationawarehardwareorsoftwareentity[51][34]. Suchatrafficmanagercanprovideprogrammabilityof routes.The Æthereal NoC [51] aims at providing hard guaranteed QoS using Time DivisionMultiplexing(TDM) to avoid contention in a synchronous network. The centralizedprogramming model in Æthereal NoC[51] uses a root process to identify free slots andconfigure network interfaces. Time slot tables are used in routers to reserve output portsper input port in a particular time slot. To avoid collisions and the loss of data, consecutivetime slots are then reserved in routers along the circuit path. The number ofpaths established in the NoC is restricted by the scheduling constraints during time slotsreservation. Increasing the number of time slots in TDM based NoCs increases router size.In cases where a communication channel cannot be found due to slots exhaustion, thetraffic division over multiple physical paths may be required[56]. Traffic division involvesreordering packets at the target node leading to increased memory and computationalcosts.TDM techniques using slot tables in Æthereal[51] and sequencers in Adaptive Systemon-Chip[12]require a single synchronous clock distributed over the chip. Accurate globalsynchronousclockdistributionisexpensiveintermsofpower. Globalsynchronicitycanbeachieved in a distributed manner using tokens such that every router synchronizes every

CHAPTER 2. RELATED WORK 15slot with all of its neighbors [57]. This method will bring down the operating speed of theNoC as the slowest router will dictate the speed of the NoC. Further, power managementtechniques such as multiple clock domains is not feasible with this approach. AElite[58]and dAElite[59] have been proposed as improved next generation Æthereal NoCs. AEliteinherits the guaranteed services model from Æthereal. To overcome the global synchronicityproblem, AElite proposes use of asynchronous and mesochronous links as a possibility.As noted in the paper[58], using mesochronous links alone may not be sufficient if routersand NIs are plesiochronous[60]. One of the drawbacks of AElite was number of slotsoccupied by the header flits. A header flit in AElite occupied one in three slots and theoverhead rises to up to 33%. dAElite circumvents the header flit overhead by routingbased on the time of packet injection and packet receiving. One of the disadvantages ofdAElite is an increase in the number of link wires, due to the configuration network andalso because of separate wires for end-to-end credit communication.The Octagon NoC[34] implements a centralized best fit scheduler to configure andmanage non-overlapping connections. The scheduler cannot establish a new connectionthrough a port if it is blocked by another connection. This results in increased connectionestablishment time at the routers and also packet losses.2.1.3 QoS by Space Division MultiplexingAs an alternative to TDM techniques, Spatial Division Multiplexing (SDM) techniquesfor QoS have been proposed in [23],[61] and [62]. SDM techniques involve sharing fractionof links between connections simultaneously based on bandwidth requirements of thecorresponding connections. An approach comparable to a static version of SDM calledLane-Division-Multiplexing has been proposed in [7]. Lane-Division-Multiplexing is basedon a reconfigurable circuit switched router composed of a crossbar and data converters.Disadvantage of the solution in [7] is that it does not support channel sharing and BEtraffic. An additional network is required for configuring the switches and for carryingthe BE traffic. Sharing a subset of wires between connections as in [63] leads to a morecomplex switch design with huge delay. SDM and TDM techniques have been combined

CHAPTER 2. RELATED WORK 16in [64] allowing for increase in number of connections supported by increasing the numberof sub-channels in the link or by increasing the number of time slots. This increases pathestablishment probability in the NoC.InSDMbasedtechniques, senderserializesdataonthewiresallocated andthereceiverdeserializes the data before forwarding to the IP block. One of the issues in SDM basedcircuits is complexity of implementation of serializers and deserializers.2.1.4 Static routing in NoCsMost NoCs use traffic oblivious static routing[51] to establish communication channelsbetween nodes. Dimension ordered routing[41][53][17][53][51][34] or routes decided at designtime[65] are not flexible and cannot circumvent congested paths. Routing in FPGAsalso present a similar scenario where routes between communicating nodes are bandwidthand latency guaranteed, but are static. These routes occupy network resources along thepath for the entire lifetime of the application. QoS is guaranteed in this case by overprovisioning resources along the route.2.1.5 MPLS and Label Switching in NoCsUse of Multi-Protocol Label Switching for QoS[38] in NoCs and advantages of identifyingcommunication channels using labels have been investigated in [39],[40]. A conventionalNoC is connected to an MPLS backbone using Label Edge Routers (LERs)[38]. TheMPLSbackboneusestrafficengineeringandprioritybasedQoSservicestocommunicationchannels identified by labels. The work is a direct mapping of the MPLS implementationin the Internet to NoCs. The router and NoC design approach is not optimized for ahardware implementation. Results from Network Simulator-2 (NS-2) are at a functionallevel and may not reflect the exact performance achievable inside a chip.Use of labels to identify communication channels instead of source and destinationidentification numbers reduces the amount of metadata transmitted in the NoC. Uniqueaddressing at source allows label reuse and enables efficient use of label space. Implementationof label based addressing in streaming applications have resulted in significant

CHAPTER 2. RELATED WORK 17reduction in router area[40]. The work employs a method similar to label switching toachieve non-global label addressing hence reducing label bit width. A C×N ↦→ C routingstrategy is described in conjunction with the label addressing scheme. Work presented in[40] presents a simple data transfer scheme and does not concentrate on rendering QoSbetween communicating nodes. The route establishment process has not been explicitlymentioned and one can assume that standard routing algorithms will be used.2.1.6 Label Switched NoCIn the proposed work, we describe a Label Switched QoS guaranteeing NoC that retainsadvantages of both packet switched and circuit switched networks. Contention at outputports in is tackled using communication pipes. Pipes are communication routes establishedalong a bandwidth rich, contention free router path. Pipes are identified by acentralized Manager with complete network visibility.NoC Manager utilizes Flow identification algorithms[66][67] (Algorithm 1) to establishpipes. Flow identification algorithm guarantees a deterministic delay in identifying andconfiguring pipes. Flow identification algorithm takes into account bandwidth availablein individual links to establish QoS guaranteed pipes. This guarantees QoS servicedcommunication paths between communicating nodes. Multiple pipes can be set up ina single link if QoS requirements of all the pipes are satisfied. This enables sharing ofphysical links between pipes without compromising QoS guarantees. LS-NoC providesthroughput guarantees irrespective of spatial separation of communicating entities.2.2 Link Microarchitectureand TileArea Exploration2.2.1 NoC Design Space ExplorationCurrent research in architectural level exploration of NoC in SoCs concentrates on understandingthe impacts of varying topologies, link and router parameters on the overallthroughput, area and power consumption of the system (SoCs and Multicore chips) usingsuitable traffic models[68]. Impacts of varying topologies, link and router parameters on

CHAPTER 2. RELATED WORK 18the overall throughput, area and power consumption of the system (SoCs and Multicorechips) using relevant traffic models is discussed in [68]. The paper illustrates a consistentcomparison and evaluation methodology based on a set of quantifiable critical parametersfor NoCs. The work suggests that evaluation of NoCs must consider applications intoaccount. The usual most critical evaluation parameters are not exhaustive and differentapplications may require additional parameters such as testability, dependability, andreliability.Work in [69] emphasizes need for co-design of interconnects, processing elements andmemory blocks to understand the effects on overall system characteristics. Results fromthis work show that the architecture of the interconnect interacts with the design andarchitecture of the cores and caches closely. The work studies the area-bandwidthperformancetrade-off on on-chip interconnects. The increase in area demands of sharedcaches in CMPs is also documented. Not using detailed interconnect models during CMPdesign leads to non-optimal larger shared L2 caches inside the chip.2.3 Simulation ToolsSimulation tools have been developed to aid designers in interconnection network (ICN)space exploration[70][71]. Kogel et. al.[70] present a modular exploration framework tocapture performance of point-to-point, shared bus and crossbar topologies.2.3.1 Link Exploration ToolsLink exploration tool works make a case for microarchitectural wire management in futureprocessors where communication is a prominent contributor for power and performance.Separate wire exploration tools such as those presented in [71], [72], [73], [74] and [75]give an estimate of delay of the wire in terms of latency for a particular wire length andoperating frequency.Orion [71] is a power-performance interconnection network simulator that is capable ofproviding power and performance statistics. Orion model estimates power consumed by

CHAPTER 2. RELATED WORK 19router elements (crossbars, FIFOs and arbiters) by calculating switching capacitances ofindividual circuit elements. Orion contains a library of architectural level parameterizedpower models.The more recent Orion 2.0 presented in [76] is an enhanced NoC power and areasimulator offering improved accuracy compared to the original Orion framework. Some oftheadditionsintoOrion2.0includeflip-flopandclockdynamicandleakagepowermodels,link power models, leveraging models developed in [74]. Virtual Channel (VC) allocatormicroarchitectureusesaVCallocationmodel, basedonthemicroarchitectureandpipelineproposed in [77]. Application-specific technology-level fine tuning of parameters usingdifferent V th and transistor widths are used to increase accuracy of power estimation.Work in [72] explores use of heterogeneous interconnects optimized for delay, bandwidthorpowerbyvaryingdesignparameterssuchasabuffersizes,wirewidthandnumberof repeaters on the interconnects. The work presented in the paper uses Energy-Delay 2as the optimization parameter. An evaluation of different configurations of heterogeneousinterconnects is made. The evaluation shows that an optimal configuration (for delay,bandwidth, power or power and bandwidth) of wires can reduce the total processor ED 2value by up to 11% compared to a NoC with homogeneous interconnect in a typicalprocessor.Courtay et. al[73] have developed a high-level delay and power estimation tool forlink exploration that offers similar statistics as Intacte does. The tool allows changingarchitectural level parameters such as different signal coding techniques to analyze theeffects on wire delay/power.Work in [74] proposes delay and power models for buffered interconnects. The modelscan be constructed from sources such as Liberty[78], LEF/ITF[79], ITRS[80], andPTM[81]. The buffered delay models take into account effects of input and output slewsof circuit elements in calculating intrinsic delays. The power models include leakage anddynamic power dissipation of gates. The area models include technology dependent coefficientsthat can be estimated by linear regression techniques per technology node toestimate repeater areas.

CHAPTER 2. RELATED WORK 20Intacte[82] is used for interconnect delay and power estimates. Design variables forIntacte’sinterconnectoptimizationarewirewidth,wirespacing,repeatersizeandspacing,degreeofpipelining, supply(V dd )andthresholdvoltage(V th ). Intactecanbeusedtoarriveat power optimal number of repeaters, sizes and spacing for a given wire length to achievea desired frequency. Intacte outputs total power dissipated including short circuit andleakage power values.A high level power estimation tool accounting for interconnect effects is presented in[83]. The work presents an interconnect length estimation model based on Rent’s rule[84]and a high level area (gate count) prediction method. Different place and route enginesand cell libraries can be used with this proposed model after some minor adaptations.2.3.2 Router Power and Architecture Exploration ToolsMostrouterexplorationtoolsmodelICNelementsatahigherlevelabstractionofswitches,links and buffers and help in power/performance trade-off studies[85][86]. These are usedto research the design of router architectures[87] and ICN topologies[34] with varyingarea/performance trade-offs for general purpose SoCs or to cater to specific applications.A high level power estimation methodology for NoC routers based on number oftraversing flits as the unit of abstraction has been proposed in [85]. The macro model oftheframeworkincursaminorabsolutecycleerrorcomparedtogatelevelanalysis. Providinga fast and cycle accurate power profile at an early stage of router design enables poweroptimizations such as power-aware compilers, core mapping, and scheduling techniquesfor CMPs to be incorporated into the final design. The power macro model uses stateinformation of FSM in a router assigned to reserve channels during packet forwardingfor wormhole flow control. This enhances the accuracy of the power macro model. Thepower macro model based on regression analysis can be migrated to different technologylibraries.An architectural-level power model for interconnection network routers has been presentedin [88]. The work specifically considers the Alpha 21364 and Infiniband routers

CHAPTER 2. RELATED WORK 21for modelling case studies. Memory arrays, crossbars and arbiters form the basic buildingblocks of all router models using this framework. Each of these building blocks havebeen modelled in detail to estimate switching capacitance. Switching activity is estimatedbased on traffic models assuming certain arrival rates at the input ports. The power numbersfor both Alpha 21364 and Infiniband routers have been found to be matching thevendors’ estimates within a minor error margin.The high level power model presented in [86] to estimate power consumption in semiglobaland global interconnects considers switching power, power due to vias and repeaters.The high level model estimates switching power within an error of 6% with aspeedup of three-to-four orders of magnitude. Error in via power is under 3%. A segmentlength distribution model has been presented for cases where Rents rule is insufficient.The segment length distribution model has been validated by analyzing netlists of a setof complex designs.A wormhole router implementing a minimal adaptive routing algorithm with nearoptimal performance and feasible design complexity is proposed in [87]. The work alsoestimates the optimal size of FIFO in an adaptive router with fixed priority scheme. Theoptimal size of the FIFO is derived to be equal to the length of the packet in flits in thiswork.2.3.3 Complete NoC ExplorationSeveral frameworks have been proposed for complete NoC exploration[89][90][91]. Theseframeworks can be used as tools to derive a first cut analysis of effect of certain NoC configurationsat an early design phase. Such frameworks are the first steps for roadmappingfuture of on-chip networks.A technology aware NoC topology exploration tool has been presented in [89]. TheNoC exploration is optimized for energy consumption of the entire SoC. The work characterizes2D Meshes and Torii along with higher dimensions, multiple hierarchies andexpress channels, for energy spent in the network. The work presents analytical modelsbased on NoC parameters such as average hop count and average flit traversal energy to

CHAPTER 2. RELATED WORK 22predict the most energy-efficient topology for future technologies.A holistic approach to designing energy-efficient cluster interconnects has been proposedin [90]. The work uses a cycle-accurate simulator with designs of an InfiniBandArchitecture (IBA) compliant interconnect fabric. The system is modelled to compriseof switches, network interface cards and links. The study reveals that the links andswitch buffers consume the major portion of the SoC power. The work proposes dynamicvoltage scaling and dynamic link shutdown as viable methods to save power during SoCoperation. A system-level roadmapping toolchain for interconnection networks has beenpresented in [91]. The framework is titled Polaris and iterates through available NoC designsto identify a power optimal one based on network traffic, architectures and processcharacteristics.Several complete NoC simulators have been developed and are in use by the NoCresearch community[92][93][94]. The Network-on-Chip Simulator, Noxim[92], was developedat the University of Catania, Italy. Several NoC parameters such as network size,buffer size, packet size distribution, routing algorithm, selection strategy, packet injectionrate, traffic time distribution, traffic pattern, hot-spot traffic distribution can be inputto this framework. The simulator allows NoC evaluation based on throughput, flit delayand power consumption. The Nostrum NoC Simulation Environment (NNSE) [94] ispart of the Nostrum project[65] and contains a SystemC based simulator. Inputs to thissimulator are network size, topology, routing policy and traffic patterns. Based on theseconfiguration parameters a simulator is built and executed to produce a desired set ofresults in a variety of graphs.2.3.4 CMP Exploration ToolsWattchwasoneofthefirst[95]architecturallevelframeworksforanalyzingandoptimizingmicroprocessor power dissipation. Wattch was orders of magnitude faster than layoutlevelpower tools, and its accuracy was within 10% of verified industry tools on leadingedgedesigns. Wattch was an architecture-level, parameterizable, simulator frameworkthat can accurately quantify potential power consumption in microprocessors. Wattch

CHAPTER 2. RELATED WORK 23framework quantifies power consumption of all the major units of the processor, parameterizethem, and integrate these power estimates into a high-level simulator. Wattchmodels main processor units into array structures, fully associative content-addressablememories,combinational logic and wires or clocking elements. Individual capacitancesof each of these elements are estimated and power is calculated. Work presented in [95]integrates Wattch into SimpleScalar architectural simulator[96].A tool like Ruby[97], allows one to simulate a complete distributed memory hierarchywith an on-chip network as in Orion. However, it needs to be augmented with a detailedinterconnect model which accounts for the physical area of the tiles and their placements.Network Processor exploration and power estimation tools utilize models for smallercomponents and quote the integrated power for the system[98][99][100]. They use cycleaccurate register, cache and arbiter models introduced previously here. NePSim[99] isan open-source integrated simulation infrastructure. Typical network processors can besimulated with the cycle accurate simulator inclusive in the framework. Testing andvalidation of results can be performed by an automatic verification framework. NePSimcombines various power models from Xcacti[101], Wattch[95] and Orion[71] for differenthardware structures in NePSim. XCacti[101] is an enhanced version of Cacti 2.0 thatincludes power modeling for cache writes, misses, and writebacks. NePSim classifies thenetwork processor components into categories such as ALU and shifter, registers, caches,queues and arbiters. The processor’s power consumption can be calculated using a powerand an estimation tool inbuilt into the framework.SapphireThe tile area optimization problem is closely knit with interconnect, cache and processorarchitecture exploration. There is a need for a co-design of interconnects, processingelements and memory blocks to fully optimize the overall multi-core chip performance.This necessitates a simulation framework which allows a co-simulation of processor cores,a detailed cache memory hierarchy, on-chip network, along with a low-level interconnectmodel.

CHAPTER 2. RELATED WORK 24Sapphire[102] is a detailed chip multiprocessor simulation framework. Sapphire integratesSESC[103], Ruby[97], Intacte[82] and DRAMSim[104]. Sapphire enables cycleaccurate simulations of a multi-core chip having distributed memory hierarchy and anon-chip network, with interconnect latencies which are consistent with the physical sizesand placement of the cores. It is a multi-processor/multicore simulator where the memoryhierarchy, interconnect and off-chip DRAM are parameterized and can be configured tomodel various configurations. Power consumed by DRAM is modelled using MICRONpower model. Sapphire also integrates interconnect power models from Intacte. ThusSapphire allows users to explore power and performance implications of all main systemcomponents like processor, interconnect, cache hierarchy and off-chip DRAM.Modeling Interconnect in SapphireInterconnect models from Intacte[75] are used in Sapphire for link level exploration andpower estimation. Wire length estimation is the first step in latency estimation. Interconnectlengths are estimated by constructing floorplans of each CMP tile. A CMP tilecontains a processor, L1 Cache Controller connected to the L1 instruction(L1-I) and L1data cache(L1-D), L2 Cache Controller and L2 Cache, a router and an optional memorycontroller. The memory is placed off-chip and is not a part of the node. The floorplan ofa typical tile is shown in Figure 2.1. The area of the processor is estimated from availablecommercial processors. Areas of L1 and L2 caches are obtained from CACTI. Thearea of the router is negligible compared to the tile area at 32nm[105]. This method ofwire length estimation has been used in Section 4.5.1 of Chapter 4. The Horizontal R-Rand Vertical R-R distances denote the distance between horizontal and vertical routers inhomogeneous NoC such as a 2D Mesh.2.3.5 Communication in CMPs - Performance ExplorationAnalysis of communication delays on power and performance of CMPs has been thesubject of interest in recent years. Mitigating communication delays through compilertechniques and micro-architecture have been looked at.

CHAPTER 2. RELATED WORK 25Figure 2.1: Floorplan used in estimating wire lengths. Wire lengths estimated from thesefloorplans are used as input to Intacte to arrive at a power optimal configuration andlatencyinclockcycles. HorizontalR-R:Linkbetweenneighboringroutersinthehorizontaldirection, Vertical R-R: Link between neighbouring routers in the vertical direction.Cache Based Microarchitectural TechniquesStridedprefetching[106]hasbeencomparedwithblockmigrationandonchiptransmissionlines to manage on-chip wire delay in CMP caches to improve performance. Work in [106]combines block migration strategies, transmission line caches and stride-based hardwareto reduce cache based communication latencies in CMPs. Dynamic Non-Uniform CacheArchitecture (DNUCA)[107] uses block migration techniques to reduce cache latencies.Block migration involves moving frequently accessed blocks in the cache into banks withlower latencies. On chip transmission lines are used to provide low latency to all banks inTransmission Line Caches[108]. Around 40%-60% of L2 cache hits are satisfied in centralbanks in CMPs with shared L2 rendering block migration ineffective. TLCs suffer frombandwidth contention reducing its advantage.Communication energy and delay can be minimized by migrating frequently accessedcachelinestobanksclosesttotheaccessingprocessors[109][110][111]. Non-UniformCacheArchitectures (NUCAs) with policies allow important cache lines to migrate toward theprocessor within the same level of the cache have been proposed in [109][110]. The cachearchitecture proposed in [109][110] shows that a dynamic NUCA structure achieves higherIPC than a traditional Uniform Cache Architecture (UCA) when maintaining the samesize and manufacturing technology. D-NUCA mapping policy attempts to provide fastestbankaccessto all banksetsby sharingtheclosest banksamong multiplebanksets. In this

CHAPTER 2. RELATED WORK 26case, banksinthecachearen-waysetassociativeiftheyshareasinglebank. Theproposedmethod results in low latency access, technology scalability, performance stability due toflattening of memory hierarchy.Work in [111] analyses the influence on performance by network routers in NUCAcaches. Specifically the work looks at cut-through latency and buffering capacity of networkroutersconnectingthecachecontrollertodifferentsub-banksinNUCAcaches.Theyconclude that the effect of cut-through latency on the performance of the caches is highand modest buffering is sufficient to achieve a good performance level. The work proposesa clustered NUCA organization that places an upper limit on the average numberof hops experienced by cache accesses. This method simplifies router implementation,scales better with cut-through latencies and is more effective compared to hybrid networksolutions.Compiler Assisted TechniquesInstruction steering [112] and instruction replication[113] coupled with clustering havebeen researched as an effective technique to reduce the impact of wire delays. In Clusteredsuperscalar microarchitecture, critical components are partitioned into smaller processingunits called clusters. The centralized instruction window is replaced with multiple smallerwindows. Simpler smaller blocks allow clusters to operate at higher clock speeds at betterenergy efficiency and are scalable. However clustering trades-off IPC to score on theseparameters. Clusters reduce the impact of wire delays by reducing the complexity andpower consumption in clusters. Instruction steering involves assigning instructions toclusters in such a manner that workload is balanced and off-cluster communication isreduced. Work in [112] reduces communication during steering. The work proposes anAccurate Rebalancing technique to offload processes from the most utilized processor tothe lesser utilized processors such that intra-cluster communication is minimal. Topologyaware steering has also been proposed as method to decrease latency between source anddestination clusters.

CHAPTER 2. RELATED WORK 27Work in [113] presents an instruction replication method along with clustering approachto decrease inter-cluster (inter-PE) communication. The load balancing algorithmused to distribute instructions among clusters along with the amount of inter clustercommunication dictate the performance of a clustered processor. The work aims to reduceinter-cluster communication by replication instructions in processing elements (PEs)where their results are utilized. Resources idle and available in PEs are used by replicatedinstructions such that load balancing is maintained.Data transfer on long latency wires can be reduced by value prediction[114] and cacheline replication[115][116] techniques. Work presented in [114] reduces long wire delays bypredicting data being communicated. The predicted value is then validated locally whereit was produced. Correctly predicted values do not incur the long wire delay. The stridevalue predictor[117][118] predicts source operands of instructions to be executed.Victim cache line replication presented in [115] replicates evicted primary cache linesinto L2 slice local to the CMP tile. The work considers a CMP with each tile containinga slice of the total L2. Cache line replication is the hybrid cache management policycombining private local L2 slice and shared L2. Total effective capacity of L2 is reducedwhen every tile has a local copy of accessed cache lines. On the other hand a single sharedL2 may incur large latencies when cache lines have to be accessed from remote tiles. Hitsto replicated cache lines reduce effective latency of shared L2 cache and hence reducelatency effects from communication in CMPs.GALS & Floorplanning TechniquesScalablemicroarchitecturaltechniquestoreducetheimpactofwiredelayhavebeenlookedat[119][120]. Work in [119] investigates interconnect bottleneck in FPGA based systemsand proposes Globally Asynchronous Locally Synchronous (GALS) as a potential solution.The work proposes a design flow to investigate optimal GALS island size to balanceamount of inter-island communication and asynchronous communication overhead betweenGALS islands.Floorplanning techniques to overcome long latencies between the processor and the

CHAPTER 2. RELATED WORK 28Level-2 cache have been experimented on[121]. Work states that floorplans should aidCMP systems in distinguishing and shared and private data accessed by cores. The paperintroduces a floorplan topology that partitions shared and private cached data and thusenables fast access to shared data for all processors while preserving the vicinity of privatedata to each processor.2.4 SummaryProviding QoS guarantees in on-chip communication networks has been identified as oneofmajorresearchproblemsinNoCs[48]. ALabelSwitchedQoSguaranteeingNoCthatretainsadvantagesofbothpacketswitchedandcircuitswitchednetworkshasbeendescribedin the thesis. LS-NoC sets up communication channels (pipes) between communicatingnodes that are independent of existing pipes and contention free at routers. Flow identificationalgorithms identify contention free and resource rich paths taking into accountthe existing routes in the NoC. LS-NoC is described in Chapters 5, 6 and 7.Wirelengths have a significant influence on the latency of interconnect, and henceneed to be included in the simulation framework. It is clear from works emphasizingon effects of communication on chip performance that there is a need for a co-designof interconnects, processing elements and memory blocks to fully optimize the overallsystem-on-chip performance. This necessitates a simulation framework which allows aco-simulation of the communicating entities along with ICN simulation. Additionally,to optimize power fully, one also needs to incorporate the link-level microarchitecturalchoices of pipelining etc.A System-C framework which enables NoC designers to assemble communicating entitiesalong with the ICN and also allows for exploration of architectural and microarchitecturalparameters of the ICN in order to obtain the latency, throughput and powertrade-offs is presented in Chapter 3. Chapter 3 presents results for NoC power by consideringeffects of various pipelining configurations, frequency and voltage scaling values.Various traffic generation and distribution models have been used to mimic realistic trafficpatterns and activity in NoCs. Trade-off studies in this chapter consider Energy-Delay

CHAPTER 2. RELATED WORK 29product (of the NoC) as the optimization parameter.Communication effects need to be accounted into simulations as communication timeshave significant impact on performance and energy of CMPs. The trade-off between tilesize, communication time and energy efficiency has been explored in Chapter 4. We explorethese trade-offs using a detailed, cycle accurate, multicore simulation frameworkwhich includes superscalar processor cores, cache coherent memory hierarchies, on-chippointtopointcommunicationnetworksanddetailedinterconnectmodelincludingpipeliningand latency. CACTI[122] cache models are used to estimate area, energy per accessand leakage power of L1 & L2 caches and values from the SPARC processor[123] are usedfor processor power estimates.

Chapter 3Link Microarchitecture ExplorationThis chapter presents a latency, power and performance trade-offs of microarchitecturalparameters of NoCs. The design and implementation of a framework for such a designspace exploration study is also presented. The chapter presents the trade-off study onNoCs by varying microarchitectural (e.g. pipelining) and circuit level (e.g. frequencyand voltage) parameters. Though the study is done in the context of packet based NoC,the results are applicable to all NoCs as the parameters considered are low level linkparameters which are common to all NoCs.A System-C based NoC exploration framework presented here is used to explore impactsof various architectural and microarchitectural level parameters of NoC elements onpower and performance of the NoC. The framework enables the designer to choose from avariety of architectural options like topology, routing policy, etc., as well as allows experimentationwith various microarchitectural options for the individual links like length, wirewidth, pitch, pipelining, supply voltage and frequency. Network-on-Chip design parameterssuch as topology generation and link pipelining have varying impacts on throughputofthenetwork, latencyofflitsandpowerdissipationoftheNoCinanSoC.Theframeworkuses Intacte[82] to estimate delay and power based on micro-architecture parameters suchas wire length, wire width, activity for a given technology and voltage. The frameworkalso supports a flexible traffic generation and communication model. Latency, power andthroughput results using this framework to study a 4x4 CMP are presented. The chapter30

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 31presents results on power-performance trade-off studies on Mesh, Torus, Folded Torus,Tree based network and a Reduced 2D Torus topologies by varying pipelining in links andfrequency and voltage scaling.The framework is used to study NoC designs of a CMP using different classes ofparallel computing benchmarks[6]. Traffic patterns from dense and sparse linear algebraapplicationsareusedinthisstudy. ThetrafficconsistsofbothRequest-Responsemessages(mimicking cache accesses) and One-Way messages. One of the key findings is that theaverage latency of a link can be reduced by increasing pipeline depth to a certain extent,as it enables link operation at higher link frequencies. There exists an optimum degree ofpipelining which minimizes the energy-delay product of the link.Using frequency scaling experiments we show that switching to a higher degree of linkpipelining to achieve higher frequency instead of adding larger buffers is advantageousfromapowerperspective. Twocasestudiescomparingtopologiesbasedonthroughputarepresented. A SystemC based simulation framework containing parameterizable routers,links, traffic generators and Sink nodes is used for NoC exploration.Organization of the ChapterRest of the chapter is organized as follows. A few router modelling, design space explorationandpowerestimationrelatedworkshavebeenintroducedinSection3.1.Adetailedliterature survey router modelling and design space exploration of NoCs had been presentedin Section 2.2 of Chapter 2. NoC exploration framework used in the trade-offstudies is described in Section 3.2. Latency, power, performance trade-offs, frequencyscaling and voltage scaling results for two case studies are presented in Sections 3.3 andSection 3.4. Section 3.3 presents design exploration results for 4×4 2 dimensional Mesh,a similar Torus and an equivalent Folded Torus NoC. Section 3.4 presents design explorationresults for a 16 node (4×4), Torus, Reduced Torus and a Tree based NoC. Resultsand findings are summarized in Section 3.5.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 323.1 Motivation for a Microarchitectural ExplorationFrameworkCurrent research in architectural level exploration of NoC in SoCs concentrates on understandingthe impacts of varying topologies, link and router parameters on the overallthroughput, area and power consumption of the system (SoCs and Multicore chips)using suitable traffic models[68]. Work in [69] emphasizes need for co-design of interconnects,processing elements and memory blocks to understand the effects on overallsystem characteristics. Simulation tools have been developed to aid designers in ICNspace exploration[70][71]. Kogel et. al.[70] present a modular exploration framework tocapture performance of point-to-point, shared bus and crossbar topologies. Orion[71] is apower-performance interconnection network simulator that is capable of providing powerand performance statistics. Orion model estimates power consumed by Router elements(crossbars, FIFOs and arbiters) by calculating switching capacitances of individual circuitelements. Most of these tools do not allow for exploration of the various link leveloptions of wire width, pitch, serialization, repeater sizing, pipelining, supply voltage andoperating frequency.NoC exploration tools usually model ICN elements at a higher level abstraction ofswitches, links and buffers and help in power/performance trade-off studies[86]. Anotherarea of active research is design of router architectures[124][87] and ICN topologies[34]with varying area/performance trade-offs for general purpose SoCs or to cater to specificapplications.On the other hand, tools exist to separately explore low level link options to variousdegrees as in [82], [72] and [73]. Work in [72] explores use of heterogeneous interconnectsoptimized for delay, bandwidth or power by varying design parameters such as a buffersizes, wire width and number of repeaters on the interconnects. Courtay et. al[73] havedeveloped a high-level delay and power estimation tool for link exploration that offerssimilar statistics as Intacte does. The tool allows changing architectural level parameterssuch as different signal coding techniques to analyze the effects on wire delay/power.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 33Intacte [82] provides a similar capability to explore link level design options and is usedin this research.It is clear from aforementioned works that there is a need for a co-design of interconnects,processing elements and memory blocks to fully optimize the overall system-on-chipperformance. This necessitates a simulation framework which allows a co-simulation ofthe communicating entities along with ICN simulation. Additionally, to optimize powerfully, one also needs to incorporate the link-level microarchitectural choices of pipeliningetc. A System-C framework which enables NoC designers to assemble communicatingentities along with the ICN and also allows for exploration of architectural and microarchitecturalparameters of the ICN in order to obtain the latency, throughput and powertrade-offs has been developed and is presented in Section 3.2.Further, previous works largely concentrate on router power and do not take intoaccount various link microarchitectural parameters for power and performance trade-offcalculations. This chapter presents results for NoC power by considering effects of variouspipelining configurations, frequency and voltage scaling values. Various traffic generationand distribution models have been used to mimic realistic traffic patterns and activity inNoCs. Trade-off studies in this chapter consider Energy-Delay product of the NoC as theoptimization parameter.3.2 NoC Microarchitectural Exploration FrameworkThe NoC exploration framework (Figure 3.1) has been built upon Open Core Protocol-IPmodels[125] using OSCI SystemC 2.0.1[126] on Linux (2.6.8-24.25-default). The frameworkcontains Router, Link and Processing Element (PE) modules and each can be customizedviavariousparameters.TheNoCmodulescanbeinterconnectedtoformadesiredNoC. The PE module represents any communicating entity on the SoC and not just theprocessing element. We can connect an actual executable model of the entity or someabstract model representing its communication characteristics. For abstract models, wesupport many different traffic generation and communication patterns. The link modulecan be used to customize the bit-width of the links as well as the degree of pipelining

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 34Figure 3.1: Architecture of the SystemC framework.Figure 3.2: Flow of the ICN exploration framework.in the link. A single run (Figure 3.2) uses these models to run a communication taskand outputs data files of message transfer logs. From these log files, one-way and roundtrip flit latency, throughput and link capacitance activity factors are extracted. Intacte isthen used to obtain the final power numbers for different operating frequency and supplyvoltage options. Table 3.1 summarizes the various parameters that can be varied in theframework.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 35Table 3.1: ICN exploration framework parameters.Parameter DescriptionNoC ParametersRouting Algorithms Source Routing and Table based routingSwitching Policy Packet, Circuit, Wormhole, VC switchingTraffic Paradigm Request-Response & One-Way TrafficTraffic Generation Scheme Deterministic, Self-SimilarTraffic Distribution Scheme Deterministic, Uniformly randomHotSpot, Localized, First Matrix TransposeRouter MicroarchitectureNo. of Input/Output Ports 2-8 (based on topology to be generated)Input/Output buffer sizes Flit-level buffersCrossbar Switching capacity In terms of flits (default=1)Link MicroarchitectureLength of interconnect Longest link in mmBit width of the interconnectCircuit ParametersFrequency, Supply Voltage3.2.1 Traffic Generation and Distribution ModelsTo test NoCs on realistic multi-core applications we setup traffic generation and distributionto mimic various communication patterns. Traffic models implemented in the TrafficGenerator module are Deterministic, Uniformly Random, Localized, Hotspot and FirstMatrix Transpose traffic. Each of the models vary in how many and how often do destinationnodes receive flits from a given generator. Request-Response (RR) and One-WayTraffic (OWT) generation are supported. For example in multi-core chips, the formercan correspond to activities like cache line loads and the latter can correspond cache linewrite backs. Traffic distribution input is given using two matrices of sizes N×N, where Nis the number of communicating entities. Item (i,j) in a matrix gives the probability ofcommunication of PE i with j in the current cycle. Two separate matrices correspond toRR and OWT generation. The probability of choosing among the two matrices dependson a global input to decide the percentage of RR traffic to be generated for the simulationrun. This model can be further expanded to capture burst characteristics as well as

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 36Figure 3.3: Flit header format. DSTID/SRCID: Destination/Source ID, SQ:SequenceNumber, RQ & RP: Request and Response Flags and a 13 bit flit id.Table 3.2: Traffic Generation/Distribution Model and Experiment Setup for the Mesh,Torus & Folded-Torus case study.Parameter ValuesCommunication Patterns DLA Traffic SLA TrafficNoCs Simulated 2D Mesh, Torus and Folded TorusLocalization Factor 0.7 0.5Traffic injection rate 20%RR Factor 0.03 0.1Size of phit (Wire width) 32 bitsOW and RR Request Flit 2 phitsRR Response Flit 6 phitsSimulation Time 40000 cyclesProcess 45nmEnvironment Linux (2.6.8-24.25-default)+OSCI SystemC 2.0.1 +Matlab 7.4message size.Flit Header Format - Mesh, Torus & Folded-TorusThe communication packets are broken into a sequence of flit transfers. The flit headerformat is shown in Figure 3.3. For this case study, table based routing is assumed.The SQ field is used to identify in order arrival of all flits. Response flits have first 2bits set to 11. SRCID, DSTID and FlitID fields are preserved in Response flit for the sakeof error checking and latency calculations in the framework. The traffic receiver will readtheheadertodetermineiftheflittypeisRRornot(flagRQ).IfRQisset, thentheTrafficgenerator is notified and the flit header is sent to the Traffic generator. RR traffic haspriority over OWT and hence the request will be immediately serviced without breaking

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 37(a) Header used in table based routing.(b) Header used in source routing.Figure 3.4: Example flit header formats considered in this experiment. (DST/SRCID:Destination/Source ID, HC:Hop Count, CHNx:Direction at hop x).an OW flit. Response flit to the request flit has RP set and RQ reset. In a received flit, ifRQ is not set, then no action is taken. Table 3.2 lists out parameters used in our trafficmodel and in experiments. The framework is also capable of generating Deterministic,Uniformly Random, Hotspot and First Matrix Transpose traffic distributions.Flit Header Format - Torus, Reduced Torus & Tree based NoCIn this case study, flit header formats have been varied based on the type of routingscheme used. The flit header formats for Source routing and Table based routing areshown in Figure 3.4. Source routing header (3.4(a)) is larger as it contains output portnumbers per hop the flit has to traverse. Table based routing header (3.4(b)) containsthe final destination address only.3.2.2 Router ModelThe router model is a parameterized, scalable module of a generic router[68]. Router microarchitectureparameters include number of Input/Output ports, sizes of input/outputbuffers, switching capacity of the crossbar (no. of bits that can be transferred from inputto output buffers in a cycle) etc (Table 3.1). Flow control is implemented throughsideband signals[125]. Example Routing algorithms are source and table based routing.Switching policies such as circuit switching, packet switching, wormhole switching havebeen implemented. Flow control prevents traffic generators from spewing phits into the

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 38network after input buffer fills to a threshold. Router model has been carefully designedto be easily adapted for use in various topologies (with varying flit header formats asshown in Figures 3.3 and 3.4) with minimal changes.3.2.3 Power ModelIntacte[82] is used for interconnect delay and power estimates. Design variables for Intacte’sinterconnect optimization are wire width, wire spacing, repeater size and spacing,degree of pipelining, supply (V dd ) and threshold voltage (V th ). Activity and coupling factorsare input to Intacte from the System-C simulation results. Intacte arrives at a poweroptimal number of repeater, sizes and spacing for a given wire to achieve a desired frequency.Wire width (in bits) is known per simulation run. The tool also includes flop anddriver overheads for power and delay calculations. Intacte outputs total power dissipatedincluding short circuit and leakage power values. We arrive at approximate wire lengthsusing floorplans(Figure 3.17). Minimum wire spacing is obtained from foundry rules. Intactesolves an optimization problem to arrive at optimal number of repeaters, repeaterspacing for a given frequency and voltage. Other physical parameters are obtained fromPredictive Technology Models[81] models for 65nm and 45nm.3.3 Case Study: Mesh, Torus & Folded-TorusIn this case study, we study a 4x4 chip multiprocessor (CMP) for three different networktopologies - Mesh, Torus and Folded-torus. We use two communication patterns from[6] of Dense Linear Algebra (DLA) and Sparse Linear Algebra (SLA) benchmarks. DLAapplications exhibit highly localized communication. The traffic model for DLA generates70% traffic to immediate neighbors and remaining traffic is distributed uniformly to othernodes. SLAcommunication is reproduced using 50% localized traffic and rest of the trafficisdestinedtohalfoftheremainingnodes. FurtherweassumeallRRtraffictobelocalized.For eg. 10% of generated traffic over the simulation per PE will be of Request type ifRR=0.1. All Request flits are destined to immediate neighbors. 70% of flits generated

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 39Figure 3.5: Schematic of 3 compared topologies (L to R: Mesh, Torus, Folded Torus).Routers are shaded and Processing Elements(PE) are not.by any PE over the simulation time are destined to immediate neighbors if localizationfactor is 0.7(as in case of DLA).Experimentsaredesignedtocalculatelatency(clockcycles),throughput(Gigabits/sec)and power (milliWatts) of various topologies. Table 3.2 lists some of the simulation setupparameters used in the following experiments.3.3.1 NoC TopologiesIn this work we consider three similar topologies for trade-off studies. Router and processingelements are identical in all three topologies. In fact the same communicationtrace is played out for all the different ICN parameter explorations. The schematic of thethree NoCs is shown in Figure 3.5 with the Floorplans largely following the schematics.The floorplans are used to estimate the wire lengths which are then input to Intacte.Processing elements sizes are estimated by scaling down the processor in [123] to 45nmtobe of size 2.25×1.75mm. The routers are of size 0.3×0.3mm. The length of the longestlinks in the Mesh, Torus and Folded Torus are estimated as 2.5mm, 8.15mm and 5.5mmrespectively. The longest link in the torus connect the routers at the opposite sides. Therouting policy for all topologies is table based. Routing policy was made to keep the worstcase latency under check. Where routes had an alternative longer link, the shorter has

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 40been chosen. Longer links had minimum activity in the experiments. Lengths of links ineach of the topologies and pipelining factors is illustrated in Table 3.3. Pipelining factorcorresponds to the longest link in the NoC. Pipelining factor of 1 means the longest linkis unpipelined, P=2 indicates it has a two cycle latency and so on.Table 3.3: Links and pipelining details of NoCsTopology Length in mm Pipelining(no. of links)2D Mesh 2.5 (24) 1 2 3 4 5 6 7 82.0 (56) 1 2 3 4 4 5 6 72D Torus 8.15(8) 1 2 3 4 5 6 7 86.65(8) 1 2 3 4 4 5 6 72.5 (24) 1 1 1 2 2 2 3 32.0 (56) 1 1 1 1 2 2 2 2Folded 5.5 (16) 1 2 3 4 5 6 7 82D Torus 4.5 (16) 1 2 3 4 5 5 6 72.75(16) 1 1 2 2 3 3 4 42.25(16) 1 1 2 2 3 3 3 42.0 (32) 1 1 2 2 2 3 3 33.3.2 Round Trip Flit Latency & NoC ThroughputRound trip flit (link-level flow control unit) latency is calculated starting from injectionof the first phit (physical transfer unit in an NoC) to the reception of the last phit. In thecase of OW traffic latency is one way. In case of RR traffic it is the delay in clock cyclesof beginning of request injection to completion of response arrival. Communication tracesare analyzed using error checking (for phit loss, out-of-order reception, erroneous transitetc.) and latency calculation scripts to ensure functional correctness of the system.Figure 3.6 shows the effect of increasing traffic rate over average round trip latency(normalized, in clock cycles) for all three NoCs. As input buffers start filling up, thewaiting time of flits increases and hence latencies increase at higher traffic rates. Similarresults have been shown in other NoC exploration frameworks[127].TotalthroughputoftheNoC(inbits/sec)iscalculatedastotalnumberofbitsreceived

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 411.451.42D Torus2D MeshFolded 2D TorusAverage Latency vs. Traffic Rate of NoCs1.35Normalized Avg. Round Trip Latency1.31.251.21.151.11.0510.950 0.1 0.2 0.3 0.4 0.5Packet Injection RateFigure 3.6: Normalized average round trip latency in cycles vs. Traffic injection rate inall the 3 NoCs.((phits r ∗bits phit )) at sink nodes divided by total (real) time (( 1 f ∗sim cycles)) spent (Eqn3.1).Th total = phits r ∗bits phit1f ∗sim cycles(3.1)Max achievable frequency of a wire of given length is obtained using Intacte(Figure3.7). Max throughput of each NoC running DLA traffic at P=1 is shown in Figure 3.8.2D Mesh has the shortest links and highest achievable frequency and hence the highestthroughput.Average round trip latencies in nano-seconds over various pipeline configurations inall 3 NoCs is shown in Fig. 3.9. Results show overall latency of flits actually decrease toa certain point by pipelining. Avg. latencies are larger for RR type of traffic and it alsohas a larger number of phits involved (2 Req + 6 Response). Clearly, there is a latencyadvantage by pipelining links in NoCs up to a point. This is because as the numberof pipe stages increase, the operation frequency can also be increased as the length ofwire segment in each pipe stage decreases. Real time latencies do not vary much afterpipelining configuration P=5, as delay of flops start to dominate and there is not much

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 4276DLA Traffic. Max Attainable Frequency vs. Pipeline configurations on 3 NoCs.2D Mesh. Max. Len: 2.5mm2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm5Max. Frequency (in GHz)432100 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.7: Max. frequency of links in 3 topologies. Lengths of longest links in Mesh,Torus and Folded 2D Torus are 2.5mm, 8.15mm and 5.5mm.marginal increase in frequency. Throughput and Latency behaviour for SLA traffic areidentical (not shown here).3.3.3 NoC Power/Performance/Latency Tradeoffs2D MeshFigure 3.10 and 3.11 shows the combined normalized results of NoC power, throughputand latency experiments on a 2D Mesh for DLA and SLA traffic. Throughput and powerconsumption are lowest at P=1 and highest at P=8. Normalized avg. round trip flitlatency for both OW and RR traffic is shown (the curves overlap). From the graph itis seen that growth in power makes configurations more than P=5 less desirable. Linkpipelines with P=1,2 and 3 are also not optimal with respect to latency in both thesebenchmarks. Rise in throughput also starts to fade as configuration of more than P=6 areused. The optimal point of operation indicated by the results from both communicationpatterns is P=5. Energy curve is obtained as the product of normalized Latency and

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 43180160DLA Traffic. Max Throughput vs. Pipeline Stages.2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm2D Mesh. Max. Len: 2.5mm140Max. Throughput (in Gigabits/sec)1201008060402000 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.8: Total NoC throughput in 3 topologies, DLA traffic.555045DLA Traffic: Avg. Round Trip Latency vs. Pipeline Depth in 3 NoCsMesh OWTMesh Req-RespTorus OWTTorus Req-RespFolded Torus OWTFolded Torus Req-RespAvg. Round Trip Latency (in ns)40353025201510501 2 3 4 5 6 7 8Pipeline DepthFigure 3.9: Avg. round trip flit latency in 3 NoCs, DLA traffic.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 44Throughput, Power, Energy & Energy Delay1614121086420DLA Traffic: 2D Mesh TradeoffsEnergy.DelayNoC PowerEnergyThroughputLatency1 2 3 4 5 6 7 8Link Pipeline Depth43.532.521.510.5Avg. Round Trip Latency (Normalized)Figure 3.10: 2D Mesh Power/Throughput/Latency trade-offs for DLA traffic. Normalizedresults are shown.Throughput, Power, Energy & Energy Delay1614121086420SLA Traffic: 2D Mesh Tradeoffs43.5Energy.Delay32.52NoC PowerEnergyThroughput 1.5Latency10.51 2 3 4 5 6 7 8Link Pipeline DepthAvg. Round Trip Latency(Normalized)Figure 3.11: 2D Mesh Power/Throughput/Latency trade-offs for SLA traffic.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 45Normalized Throughput(*206.23 Gbps), Power(*27.26 mW), Energy & Energy Delay35302520151050DLA Traffic: 2D Torus NoC Power/Performance/Latency TradeoffsEnergy.DelayNoC PowerEnergyThroughputLatency1 2 3 4 5 6 7 8Link Pipeline Depth654321Avg. Round Trip Latency (Normalized) (OWT: *2.98nS, RR: *8.8ns)Figure 3.12: DLA Traffic, 2D Torus Power/Throughput/Latency trade-offs. Normalizedresults are shown.Power values. Energy for communication increases with pipeline depth. Energy Latency(Energy.Delay) is the product of Energy and Latency values. Quantitatively the optimalpoint for operation is when the longest link has pipeline segments (P=5). In DLA traffic,Avg. round trip flit latency of phits in the NoC is 1.23 times minimum and 32% ofmaximum possible. NoC power consumed is 57% of max and throughput 80.5% of maxpossible value.2D Torus and Folded 2D TorusSimilar power, throughput and latency trade-off studies are done on both communicationpatternson2DTorus(Fig. 3.12)andFolded2DTorus(Fig. 3.13)NoCs. Resultsobtainedin 2D Torus experiments indicate that growth in power makes configurations more thanP=5 is not desirable. Latencies of phits in pipeline configurations P=1-4 are large. Rise inthroughput also starts to fade as configurations after P=5 are used. The optimal point ofoperation indicated by the Energy Delay curves in both DLA and SLA traffic (not shownhere) for 2D Torus is P=5. In DLA traffic, this configuration shows power consumed by

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 46Normalized Throughput(*214.98 Gbps), Power(*28.33 mW), Energy & Energy Delay302520151050DLA Traffic: Folded 2D Torus NoC Power/Performance/Latency TradeoffsEnergy.DelayNoC PowerThroughputEnergyLatency1 2 3 4 5 6 7 8Link Pipeline Depth5.554.543.532.521.510.5Avg. Round Trip Latency (Normalized) (OWT: *2.04ns, RR: *5.36ns)Figure 3.13: DLA Traffic, Folded 2D Torus Power/Throughput/Latency trade-offs. Normalizedresults are shown.the NoC is 50% of the value consumed at P=8 and throughput is 70.5% the max value.Avg. Round Trip latency of phits for both OW & RR traffic is 1.4 times minimum and24% of the maximum (when P=1).Tradeoff curves for the Folded 2D Torus show similar trends as in the 2D Torus. Avg.roundtripflitlatencyreductionandthroughputgainafterP=6isnotconsiderable. Thereis no single optimum obtained from the Energy Delay curve. Pipeline configurations fromP=5 to P=7 present various throughput and energy configurations for approximatelysame Energy Delay product.3.3.4 Power-Performance Tradeoff With Frequency ScalingWe discuss the combined effects of pipelining links and frequency scaling on power consumptionand throughput of the 3 topologies (Figure 3.5) running DLA traffic. Maximumpossible frequency of operation at full supply voltage (1.0V) is determined using Intacte.Figure 3.14 shows NoC power consumption for 3 example topologies over a pair ofpipelining configurations along with frequency scaling (at V dd ). As observed from the

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 47350300DLA Traffic. NoC Power Consumption vs. Throughput over 3 NoCs.2D Mesh. P=42D Mesh. P=52D Torus. P=72D Torus. P=8Folded 2D Torus. P=6Folded 2D Torus. P=7NoC Power Consumption (in mW)250200150100Folded 2D Torus2D Torus2D Mesh5000 500 1000 1500 2000 2500Throughput (in Gbps)Figure 3.14: Frequency scaling on 3 topologies, DLA Traffic.graph, power consumption of a lower pipeline configuration exceeds the power consumedby a higher configuration after a certain frequency. Larger buffers (repeaters) are addedto push frequencies to the maximum possible value. Power dissipated by these circuitelement start to outweigh the speed advantage after a certain frequency. We call this the“crossover” frequency. The graph shows 3 example pairs from each NoC from each of thetopologies to illustrate this fact.Maximum frequency of operation of an unpipelined longest link in a 2D Mesh (2.5mm)is determined to be 1.71GHz. This maximum throughput point is determined in eachpipeline configuration in each topology. Frequency is scaled down from this point andpower measurements are made for NoC activity obtained using the SystemC frameworkfor DLA traffic. At crossover frequencies it is advantageous to switch to higher pipeliningconfigurations to save power and increase throughput. For example in a 2D Mesh, linkfrequency of 3.5GHz can be achieved by pipelining configuration of 3 and above. NoCpowerconsumptioncanbereducedby54%byswitchingtoa3stagepipelineconfigurationfrom8stagepipelineconfiguration. Inotherwords, adesiredfrequencycanbeachievedbymore than one pipeline configuration. For example, in a 2D Torus frequency (throughput)

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 486005002D Mesh NoC Power/Performance/Latency Tradeoffs using Voltage & Frequency ScalingFrequency Scaled, P=8Voltage scaled: P=2Voltage scaled: P=5Voltage scaled: P=7NoC Power Consumption (in mW)40030020010000 20 40 60 80 100 120 140 160 180Throughput (in Gbps)Figure 3.15: Dynamic voltage scaling on 2D Mesh, DLA Traffic. Frequency scaled curvefor P=8 is also shown.of 2.0GHz can be achieved by using pipeline configurations from 4 to 8. NoC powerconsumption can be reduced by 13.8% by switching from P=8 to P=4 and still achievesimilar throughput.3.3.5 Power-Performance Tradeoff With Voltage and FrequencyScalingIn each topology, frequency is scaled down from the maximum and the least voltagerequired to meet the scaled frequency is estimated using Intacte and power consumptionand throughput results are presented. Voltages are scaled from 1.0V till 0.1GHz is metfor each pipelining configuration in each NoC.Similar to the frequency scaling resultsthere exists a crossover frequency in a pipelining configuration after which it is powerand throughput optimal to switch to a higher pipelining stage (Table 3.4). Figure 3.15compares power and throughput values obtained by voltage and frequency scaling witha frequency scaled P=8 curve for 2D Mesh with DLA traffic. Scaling voltage along withfrequency compared to scaling frequency alone can result in power savings of up to 14%,

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 49Table 3.4: DLA traffic, Frequency crossover points in 2D MeshPipe Trip Frequency (in GHz)Stages Mesh Torus Folded Torus1-2 1.7 0.25 0.452-3 2.96 0.7 1.53-4 3.93 1.1 2.04-5 4.69 2.0 2.765-6 5.31 2.2 3.26-7 5.83 2.8 3.697-8 6.23 3.0 4.0727% and 51% in cases of P=7, P=5 and P=2 respectively.Comparison of all 3 NoCs is presented in Table 3.5.Table 3.5: Comparison of 3 topologies for DLA traffic.Topology Pipe Power PerformanceStages (mW) (Gbps)Mesh 1 55.18 42.822 109.87 74.124 250.83 117.447 464.16 156.00Torus 1 27.26 14.672 45.71 27.894 97.48 50.787 206.22 78.33Folded 1 28.32 21.03Torus 2 55.95 39.314 119.75 69.117 287.18 101.91

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 50Table 3.6: Experimental SetupTraffic Injection Rate 20%Traffic Model Localized Traffic (6%)Framework simulation time 35000 cyclesProcess Technology 65nmModels PTM[81]Frequency of Operation 1 GHz(unless mentioned otherwise)Environment Linux (2.6.8-24.25-default)+OSCI SystemC 2.0.1Figure3.16: Schematicrepresentationofthethreecomparedtopologies(LtoR:2DTorus,Tree, Reduced 2D Torus). Shaded rectangles are Routers and white boxes are source/sinkProcessing Elements(PE) nodes.3.4 Case Study: Torus, Reduced Torus & Tree basedNoCInthiscasestudy, experimentsaredesignedtocalculatelatency(clockcycles), throughput(gigabits/sec) and power (milliWatts) of three related topologies - 2D Torus, Reduced 2DTorus and a Tree based NoC. Table 3.6 lists some of the simulation setup parameters usedin the following experiments. We did not observe significant variation in activity factorand hence power and throughput of the NoC by running the simulation for durationsgreater than 35000 cycles.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 51Figure 3.17: Floorplans of the three compared topologies.Table 3.7: Links and pipelining details of NoCsTopology Length in mm Pipelining(no. of links)2D Torus 7 (8) 1 2 3 4 5 6 7 81.5 (88) 1 1 1 1 2 2 2 2Reduced 3.5 (12) 1 2 3 4 5 6 7 82D Torus 2.5 (16) 1 2 3 3 4 5 5 62.0 (44) 1 2 2 3 3 3 4 4Tree NoC 3.5 (8) 1 2 3 4 5 6 7 8(8+32) 0.75 (32) 1 1 1 1 2 2 2 23.4.1 NoC TopologiesStarting from a 2D Torus, two topologies (a hierarchical star topology and reduced Torus)containing equal number of source and sink nodes are derived by removing/reconnectinglinks. Router and processing elements are identical in all three topologies. The threetopologies are shown in Figures 3.4 (schematic) and 3.17 (floorplan). Processing elements(PEs) are assumed to be of size 1.5×1.5mm[128] Routers are assumed to be 15% of thePE size. Lengths of longest link in 2D Torus is estimated to be 7mm and in Reduced2D Torus and Tree based NoC is 3.5mm. Routing policies for all topologies is tablebased. The routing policy was formed keeping in view the latency (in cycles) of the linkspackets. Routes were chosen such that in cases that had choices of long and short links,the shorter links were chosen. Lengths of links in each of the topologies and pipelining

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 5276DLA Traffic. Max Attainable Frequency vs. Pipeline configurations on 3 NoCs.2D Mesh. Max. Len: 2.5mm2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm5Max. Frequency (in GHz)432100 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.18: Maximum attainable frequency by links in the respective topologies. Estimatedlength of the longest link in a 2D Torus is 7mm. Estimated longest link in theTree based and Reduced 2D Torus is 3.5mm.factors is illustrated in Table 3.7. Pipelining factor corresponds to the longest link in theNoC. Pipelining factor of 1 means the longest link is unpipelined, P=2 indicates it has atwo cycle latency and so on.3.4.2 NoC ThroughputThroughputs of each of the NoC topologies are calculated. Localized traffic generationscheme (each traffic generator sends 6% of its traffic to its immediate neighbors) with selfsimilartraffic distribution is used. Throughput is a measure of total data consumptionat sink nodes. Total throughput of the NoC (in bits/sec) is calculated as total number ofbits received ((phits r ∗bits phit )) at sink nodes divided by total (real) time (( 1 ∗sim f cycles))spent (Eqn 3.1).Maximum achievable frequency of a wire of a given length is shown in Figure 3.18.Maximum throughput of each NoC is presented in Figure 3.19. Tree based NoC supportslocalized traffic the best (at least two neighbours at one hop distance) and hence showshighest throughput. Both Tree NoC and Reduced 2D Torus show higher throughput

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 53180160DLA Traffic. Max Throughput vs. Pipeline Stages.2D Torus. Max. Len: 8.15mmFolded 2D Torus. Max. Len: 5.5mm2D Mesh. Max. Len: 2.5mm140Max. Throughput (in Gigabits/sec)1201008060402000 1 2 3 4 5 6 7 8Pipeline stagesFigure 3.19: Variation of total NoC throughput with varying pipeline stages in all threetopologies.because of shorter links resulting in higher frequency of operation. The Reduced 2DTorus has higher throughput than a conventional 2D Torus as the minimum distancebetween two neighbours is 1 hop (2 hops in case of a Torus).3.4.3 NoC Power/Performance/Latency TradeoffsFigure 3.20 shows the combined normalized results of power, throughput and latencyexperiments on a 2D Torus. Power consumption of the 2D Torus increases at a higherrate after P=4 due to insertion of flops in the shorter links (1.5mm) after P=5. Latencyis calculated as the real time spent in transit of all phits in the NoC over the completesimulation time. Decrease in latency after P=5 is not considerable as delays from insertedflops start to dominate clock cycle time and after a certain pipeline configuration latencieswill increase (not shown here). From the graph it is seen that growth in power makesconfigurations more than P=5 less desirable. Link pipelines with P=1,2 and 3 are also notoptimalwhenlatencyisconsidered. Riseinthroughputalsostartstofadeasconfigurationof more than P=5 are used. The optimal point of operation indicated by the results isP=4. At this point the same number of flops as P=1,2 and 3 are used but the least latency

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 54Normalized Throughput(*206.23 Gbps), Power(*27.26 mW), Energy & Energy Delay35302520151050DLA Traffic: 2D Torus NoC Power/Performance/Latency TradeoffsEnergy.DelayNoC PowerEnergyThroughputLatency1 2 3 4 5 6 7 8Link Pipeline Depth654321Avg. Round Trip Latency (Normalized) (OWT: *2.98nS, RR: *8.8ns)Figure 3.20: 2D Torus Power/Throughput/Latency trade-offs. Normalized results areshown.(1.56 times minimum) is achieved and power (40% of max) and throughput (64% of max)are at nominal points. The graph also show energy (Power × Latency) required for thecommunication. Energy for communication increases with pipeline depth. However theenergy delay product reduces initially with increasing pipeline depth and then increaseswith a minimum around P=4.Tradeoff results on Reduced 2D Torus are shown in Figure 3.21. Latency andthroughputcurvesshowsimilartrendasina2DTorus. Latencyreductionandthroughputgain after P=4 is not considerable. The power optimal point of operation indicated bythe results is P=3. At P=3 latency is 1.6 times the minimum and power (49% of max)and throughput (61% of max). There is a shallow minima for energy-delay product fromP=3-7.3.4.4 Power-Performance Tradeoff With Frequency ScalingWe discuss the combined effects of pipelining links and frequency scaling on power consumptionand throughput of three example topologies (Figure 3.4) in this sub-section.Maximum possible frequency of operation at full supply voltage (1.1V) is determined

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 55Normalized Throughput(*53.2535Gbps),Power(*100.2mW),Energy,Energy Delay14121086420Reduced 2D Torus NoC Power/Performance/Latency TradeoffsEnergy.DelayNoC PowerThroughputEnergyLatency1 2 3 4 5 6 7 8Link Pipeline Depth43.532.521.510.5Normalized Latency (*0.11529 mS)Figure 3.21: Reduced 2D Torus Power/Throughput/Latency trade-offs. Normalized resultsare shown.using Intacte.Figure 3.22 shows NoC power consumption for 3 example topologies over variouspipelining factors along with frequency scaling. Maximum frequency of operation of anunpipelined longest link in a 2D Torus (we consider 7mm) is determined to be 0.93GHz.This maximum throughput point determined in each pipeline configuration in each topology.Frequency is scaled down from this point and power measurements are made for NoCactivity obtained using the SystemC framework for Localized traffic with 20% injectionrate and 6% localization factor. Allowing for some overheads (extra cycles of latency),the frequency of operation required to achieve equivalent throughput in pipelined links is0.94−0.96GHz. Higher frequency translates to higher throughput (Eqn. 3.1). Pipeliningfactor of 1 is unpipelined and has single cycle delay and a factor of 2 means link has twocycles of delay and so on.Experiments for each topology show existence of crossover frequenciesafter which it is better to switch to a higher pipelining to save power and achievehigher throughput. Larger buffers are required to drive links at higher frequencies. Powerconsumed by buffers starts to overshadow the frequency gain at these frequencies. Experimentson the 2D Torus show that a link frequency of 2.5GHz can be achieved by

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 56NoC Power Consumption (mW)NoC Power Consumption vs. Throughput with Frequency scaling.7006005004003002001002D Torus. P=52D Torus. P=6TNoC. P=4TNoC. P=5R2DT. P=4R2DT. P=5020 40 60 80 100 120 140 160 180Throughput (Gbps)Figure 3.22: Variation of NoC power with throughput for each topology.pipelining the link in stages 4 to 7. NoC power consumption can be reduced by 40.97%by switching to a 4 stage pipeline from a 7 stage pipeline. Another interesting result isthe effect of larger buffers as upper limits of frequency are reached in a single pipelineconfiguration. For instance, in P=3 from 2.3GHz to 2.4GHz, buffers start to consumealmost the same power as a link with P=4.Sizes of links of a Reduced 2D Torus are estimated to be 3.5mm, 2.5mm and2.0mm. NoC power consumption across various pipeline stages differ by smaller amountscompared tothe2DTorusas thenumberoflinks arelesser (32+16 bidirectional comparedto 12+8+16 bidirectional links). Results show frequencies of 4.22GHz - 4.56GHz canbe achieved by both P=4 and P=5 (5% power difference). On the other hand, for agiven frequency there exists more than one pipeline configuration with varying powerconsumption. Frequency of 3.5GHz can be achieved by pipelining into 3 to 7 stages.NoC power consumption can be reduced by 26.6% by switching from P=7 to P=3 andstill achieve 3.5GHz. Table 3.8 shows power reduction when switched from one pipelineconfiguration to the next higher one at ‘trip’ frequencies.EstimatedsizesoflinksinaTree Based NoCare3.5mmand0.75mm. Resultsforthefrequency scaling experiment follow a similar pattern as the previous two configurations

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 57Table 3.8: Power optimal frequency trip points in a various NoCs.Pipe Trip Frequency (in GHz)Stages 2DT R2DT Tree NoC1-2 0.93 1.65 1.052-3 1.71 2.75 2.13-4 2.36 3.55 3.054-5 3.4 4.22 4.455-6 3.84 5.13 4.756-7 4.22 5.3 5.132D Torus. Power vs. Throughput with Dynamic Voltage Scaling.300Total NoC Power (mW)250200150 P=7, Freq. ScaledP=7100P=150P=4P=2020 40 60 80 100 120 140 160Throughput (Gigabits/sec)Figure 3.23: Effects of dynamic voltage scaling on the power and performance of a 2DTorus. Highest frequency of operation for P=1, 2, 4 and 7 are .93GHz, 1.68GHz, 2.92GHzand 4.22GHz. Power consumption of the frequency scaled NoC is shown for comparison.(Figure 3.22). The differences between power numbers between various configurations arethe least in this network as this NoC contains the least number of links amongst the threecompared (4+16 bidirectional). Trip frequencies are recorded in Table 3.8. A maximumof 21.27% of power can be saved (at f = 3.84GHz) by switching over to P=4 from P=3after 3.05GHz. On the other hand, 4GHz can be achieved by P=4 to P=7 and NoC powerconsumption at P=4 is 76% power consumed at P=7.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 58Table 3.9: Comparison of 3 topologies. Maximum interconnect network performance andpower consumption for varying pipe stages.Topology Pipe Power PerformanceStages (mW) (Gbps)2D Torus 1 32.01 31.52 49.44 53.274 101.42 115.347 146.41 268.61Reduced 1 49.05 100.22D Torus 2 91.75 230.954 142.5 496.257 181.7 742.27Tree 1 53.22 52.93Network 2 90.66 99.464 141.17 191.077 179.77 307.63.4.5 Power-Performance Tradeoff With Voltage and FrequencyScalingIn each of the topologies, frequency is scaled down from the maximum and the leastvoltage required to meet the scaled frequency is estimated using Intacte and power consumptionand throughput results are presented in this section. Voltages are scaled from1.1V to 0.65V. NoC parameters are identical to ones used in Section 3.4.2. Figure 3.23shows results of DVS on the 2D Torus network. Similar to the frequency scaling resultsthere exists an frequency point in a pipelining configuration after which it is power andthroughput optimal to switch to a higher pipelining stage. For throughput higher than90GbpsP=7offersahighest powerreductionof21.74% at 101Gbps. Thefrequencyscaledcurve is obtained by scaling only the frequency and NoC is run at supply voltage. Scalingvoltage along with frequency compared to scaling frequency alone can result in powersavings of up to 57% and 63% in cases of P=7 and P=4 respectively. Comparison of allthe three topologies is presented in Table 3.9.

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 593.5 ConclusionNoC design specifications can be met by varying a large number of system and circuitparameters. An SOC can be better optimized if low level link parameters and architecturalparameters such as pipelining, link width, wire pitch, supply voltage, operatingfrequency, router type, topology of the interconnection network etc. are considered. Thechapter presents a simulation framework developed in System-C able to explore NoC designthrough all the aforementioned parameters. The framework also allows co-simulationwith models for the communicating entities along with the ICN. The interface to the SystemCframework and sample output logs produced are documented in Appendix A.Study presented in Section 3.3 on a 4x4 multi-core ICN for Mesh, Torus and Foldedtorustopologies and Dense Linear Algebra (DLA) and Sparse Linear Algebra (SLA)benchmarks’ communication patterns indicate that there is an optimum degree of pipeliningof the links which minimizes the average communication latency. There is also an optimumdegree of pipelining which minimizes the energy-delay product. Such an optimumexists because increasing pipelining allows for shorter length wire segments which can beoperated either faster or with lower power at the same speed.We also find that the overall performance of the ICNs is determined by the lengthsof the links needed to support the communication patterns. Thus the mesh seems toperform the best amongst the three topologies considered in this case study.Another study (Section 3.4) uses 3 example topologies - 16 node 2D Torus, Tree networkand Reduced 2D Torus to show variation of latency, throughput and NoC powerconsumption over link pipelining configurations with voltage and frequency scaling. Wefind that contrary to intuition, increasing pipeline depth can help reduce latency in absolutetime units, by allowing shorter links & hence higher frequency of operation. Ina 2D Torus when the longest link is pipelined by 4 stages at which point least latency(1.56 times minimum) is achieved and power (40% of max) and throughput (64% of max)are nominal. Using frequency scaling experiments, power variations of up to 40%, 26.6%and 24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoC between variouspipeline configurations to achieve same frequency at constant voltages. Also in some

CHAPTER 3. LINK MICROARCHITECTURE EXPLORATION 60cases, we find that switching to a higher pipelining configuration can actually help reducepower as the links can be designed with smaller repeaters. Larger NoC power savingscan be achieved by voltage scaling along with frequency scaling. Hence it is importantto include the link microarchitecture parameters as well as circuit parameters like supplyvoltage during the architecture design exploration of a NoC.Thestudiesalsopointtoanoveralloptimizationproblemthatexistsinthearchitectureof the individual PEs versus the overall SOC, since smaller PEs lead to shorter linksbetween PEs, but more traffic, thus pointing to the existence of a sweet spot in terms ofthe PE size.

Chapter 4Optimal Energy and PerformanceConfigurations using CommunicationCentric CMP SimulationsOn-chip and off-chip communication times have significant impact on execution time andthe energy efficiency (Instructions per second 2 /W (IPS 2 /W)) of Chip-Multiprocessors(CMPs) and need to be accounted in CMP simulation frameworks. Larger amount oftime spent in communication leads to longer execution time and hence increased lossesdue to leakage energy. Communication time is a function of the total number of messagesto be communicated and latency. The composition of an individual tile has a significantimpact on overall communication time. Larger caches in the tile will reduce cachemisses and decrease off-tile messages. However, larger caches also imply larger area andhence longer inter-tile communication link lengths and latencies, thus adversely impactingcommunication time. There exists a trade-off in the tile size which leads to optimumcommunication time and hence energy efficiency. This indicates a need for strategies toreduce communication time without increasing tile area.We explore these trade-offs using a detailed, cycle accurate, multicore simulationframework which includes superscalar processor cores, cache coherent memory hierarchies,on-chip point to point communication networks and detailed interconnect model61

CHAPTER 4. TILE EXPLORATION 62including pipelining and latency. Link latencies are estimated for a 16 core CMP simulationon a framework having superscalar cores, cache coherent memory hierarchies, on-chippoint to point communication networks and NoC model, running a SPLASH2[129] benchmark.Each tile has a single processor, L1 and L2 caches and a router. Different sizesof L1 and L2 lead to different tile clock speeds, tile miss rates and tile area and henceinterconnect latency.Simulations across a range of sizes of 8KB–256KB for L1 and 64KB–4MB for L2indicate that there is an optimal size which maximizes energy efficiency and this is relatedto minimizing communication time. Reducing off-chip communication time requires largeon-chip caches which in turn burn more leakage energy. This indicates a need for strategyto reduce communication time, without increasing tile area. Experiments across a rangeof L1/L2 configurations indicate different optimals for performance, energy and energyefficiency. Simulation frameworks lacking detailed interconnect models give inaccuratepower-performance optimal L1/L2 configurations. Additionally, clustered interconnectionnetwork, communicationawarecachebankmappingandthreadmappingtophysicalcoresare explored as potential energy saving solutions.Organization of the ChapterThe need to use detailed interconnection network models to identify optimal energy andperformance configurations is motivated in Section 4.1. Section 4.2 highlights researchcontributions of this chapter. A brief literature survey on the broad areas of communicationcentric performance estimation/optimization and system architecture of Networkon-Chipsand System-on-Chips and interconnection network exploration is presented inSection 4.3. A detailed literature survey was presented in Section 2.3.5 of Chapter 2. Section4.6 describes on-chip and off-chip communication effects on power and performanceof a CMP using the FFT benchmark as a case study. Section 4.4 builds the relationshipbetween energy efficiency and communication time and states the optimization problem.Section 4.5 illustrates the experimental setup and methodology. Experimental resultsanalyzing the effects of on-chip and off-chip communication are presented in Section 4.7.

CHAPTER 4. TILE EXPLORATION 63Effects of communication on program execution times and program execution energy arepresented in Section 4.8. Energy-performance results for various L1/L2 configurationsand effect of custom L2 bank mapping and thread mapping on power and performance ofa multicore chip are presented in Section 4.9. The chapter concludes in in Section 4.10.4.1 MotivationTransistor scaling and higher transistor densities have enabled semiconductor industryto implement large scale chip multi-processors[123][130][131][132]. As on-chip wire delayscontinue to grow, intra-chip latencies are now a significant parameter in the totalperformance of a CMP[120][133].The work presented in this chapter analyzes communication as the primary parameterin CMPs and the importance of including link latencies in cycle accurate simulationframeworks. Weestablish therelationship betweenProgram Completion Time (PCT)andCommunication Time in a CMP over various runs of a parallel application. Communicationin a CMP is inter-tile (on-chip) or off-chip. On-chip (or Intra-chip) communicationis the communication between tiles in the CMP due to off-tile L2 accesses. Off-chipcommunication are accesses to DRAM due to L2 misses in the CMP.Experiments on FFT benchmark on a 4×4 CMP show that performance estimates areoff by up to 17% if detailed interconnect models are not included in CMP simulations(Figure 4.1). The ideal interconnect measurements were taken assuming single cycleinterconnectlatency. FFTbenchmarkwasruntocompletionusingvariousL1(8K-256K)and L2 (64K - 4M) cache sizes. Errors are larger for smaller L1s and large L2 caches. Inthesecases, themissrateandhenceoff-corecommunicationisthelargest. LargeL2cachesincrease area of individual cores which in turn increases link latencies. Communicationtime during benchmark execution is affected by the delays of wires between tiles whichin turn depend on individual tile sizes. Larger tiles accommodate more on-chip memoryincreasing link lengths, increasing communication time. Length of an on-chip inter-tilelink depends on the size of the tiles in the CMP. A smaller tile area results in shorter wiresand lesser individual link latency. L1 and L2 caches in smaller tiles incur more misses

CHAPTER 4. TILE EXPLORATION 64Execution Time Error. Real vs. Ideal Interconnect.1864K128K256K512K1MB2MB4MB16Percentage Error141210868K 16K 32K 64K 128K 256KL1 Cache Sizes.Figure 4.1: Error in performance measurement between real and ideal interconnect experiments.increasing time spent in communication. Hence there exists an optimum cache size whichminimizes the overall Inter Tile Communication Time. Increasing cache size also impactsEnergy as both the dynamic and leakage power increase. However, a reduction in PCThelps reduce leakage energy (product of leakage power and PCT) and hence one can againexpect an optimum for the Program Energy Efficiency.Time spent in communication is also a function of the communication latency betweena core running a process and the core containing the L2 bank to be accessed. Identifyingfrequently communicating processes and most accessed L2 banks and mapping the processesand L2 banks in same or neighbouring cores offer communication time savings ina shared, distributed L2 cache with S-NUCA policy. This potentially decreases overalltime spent in transit and hence increases performance and program energy efficiency. Tileplacement (floorplan) strategies and thread mapping algorithms also play a role in thefinal performance index. Time spent in communication influences Total Energy consumedduring execution of a program. This relationship between power, performance, energyefficiency and time spent in communication is explored in this chapter.

CHAPTER 4. TILE EXPLORATION 654.2 Observations and ContributionsResearch contributions and observations of the current chapter are highlighted here.• Time spent in communication influences the total execution time of a program inCMPs. The current work analyzes on-chip message transit time and off-chip DRAMaccess times to establish relationship between time spent in communication and theprogram completion time.• On-chiptransitandDRAMaccesstimestracktheexecutiontimeofaprocess. CMPconfigurations operating at higher frequencies (lower cycle time) have a performanceedge over others when the communication time is comparatively similar.• CommunicationtimeisinfluencedbythefloorplanningoftheCMP.Communicationaware floorplanning can reduce upto 2.6% of the energy spent in execution of aninstruction and upto 11% savings in communication power during the execution ofthe program.4.3 BackgroundThe tile area optimization problem is closely knit with interconnect, cache and processorarchitecture exploration. It is clear from works like [69][106] that there is a need fora co-design of interconnects, processing elements and memory blocks to fully optimizethe overall multi-core chip performance. This necessitates a simulation framework whichallows a co-simulation of processor cores, a detailed cache memory hierarchy, on-chipnetwork, along with a low-level interconnect model.Analysis of communication delays on power and performance of CMPs has been thesubject of interest in recent years. Mitigating communication delays through compilertechniques and micro-architecture have been looked at. Strided prefetching[106] has beencompared with block migration and on chip transmission lines to manage on-chip wiredelay in CMP caches to improve performance. Instruction steering [112] and instructionreplication[113] coupled with clustering have been researched as an effective technique to

CHAPTER 4. TILE EXPLORATION 66reduce the impact of wire delays. Data transfer on long latency wires can be reducedby value prediction[114] and cache line replication[115][116] techniques. Communicationenergy and delay can be minimized by migrating frequently accessed cache lines to banksclosest to the accessing processors[109][111]. Scalable micro-architectural techniques toreduce the impact of wire delay have been looked at[119][120]. Floorplanning techniquesto overcome long latencies between the processor and the Level-2 cache have been experimentedon[121].Many separate tools have been developed for interconnection network (ICN) spaceexploration [70][71]. Most of these tools usually model the ICN elements at a higher levelabstraction of switches, links and buffers and help in power/performance trade-off studies[86]. These are used to research the design of router architectures [124][87] and ICNtopologies[34] with varying area/performance trade-offs for general purpose SoCs or tocater to specific applications. Kogel et. al.[70] present a modular exploration frameworkto capture performance of point-to-point, shared bus and crossbar topologies. Impacts ofvarying topologies, link and router parameters on the overall throughput, area and powerconsumption of the system (SoCs and Multicore chips) using relevant traffic models isdiscussed in [68]. Orion [71] is a power-performance interconnection network simulatorthat is capable of providing power and performance statistics. Orion model estimatespower consumed by router elements by calculating switching capacitances of individualcircuit elements. A tool like Ruby[97], allows one to simulate a complete distributedmemoryhierarchywithanon-chipnetworkasinOrion. Howeveritneedstobeaugmentedwith a detailed interconnect model which accounts for the physical area of the tiles andtheir placements. Wire lengths have a significant influence on the latency of interconnect,and hence need to be included in the simulation framework. Separate wire explorationtools as in [72], [73] and [75] give an estimate of delay of the wire in terms of latency fora particular wire length and operating frequency.Sapphire[102]frameworkusedinthisworkintegratesSESC,Ruby,IntacteandDRAMSim[104].Sapphire enables cycle accurate simulations of a multi-core chip having distributed memoryhierarchy and an on-chip network, with interconnect latencies which are consistent

CHAPTER 4. TILE EXPLORATION 67Figure 4.2: Schematic of a multiprocessor architecture comprising of tiles and an interconnectingnetwork. Each tile is made up of a processor, L1 and L2 caches.with the physical sizes and placement of the cores. CACTI[122] cache models are usedto estimate area, energy per access and leakage power of L1 & L2 caches and values fromthe SPARC processor[123] are used for processor power estimates.4.4 Communication Time and Energy EfficiencyConsider a tiled multi-core chip as shown in Figure 4.2. Each tile contains a processor,private instruction and data L1 caches, a shared, distributed L2 cache and a router foron-chip communication. The L2 banks are distributed in the CMP using the S-NUCApolicy. Mapping of data into L2 banks is predetermined based on the address and givendata can reside in only one bank of L2 cache. The tiles are interconnected via a network,which is usually a mesh or 2D-torus for large networks.Data to be accessed by a program in such a CMP is present in one of the followingsources: the private (on-tile) L1, the local (on-tile) L2, (an off-tile) remote L2 or the offchipDRAM. If P L1 , P l.L2 , P r.L2 and P dram are probabilities of finding the required dataon L1, local L2, remote L2 and off-chip DRAM, then,P L1 +P l.L2 +P r.L2 +P dram = 1Tile placement strategies and process scheduling policies in CMPs also influence link

CHAPTER 4. TILE EXPLORATION 68latencies. Current work asserts the importance of T comm in program execution and examinesthe effects of T comm on the power, performance and energy efficiency of a CMP.CurrentworkusesvariablesL1, L2,tileplacementstrategiesandnon-conventionalprocessscheduling to examine the effect of T comm on power, performance and energy efficiencyof a CMP. Metric used for measurement of performance of a CMP is Energy × Delayproduct (ED). Instructions per second 2 /W (IPS 2 /W), which has the same dimensionsas1EDis used as the performance metric to compare various CMP configurations in thecurrent work.Processes (threads) of a program accesses private L1, local or remote L2 cache banksoroff-chipDRAMduringthecourseofprogramexecutioninaCMP.Duringitsexecution,a process generates an address sequence of data to be accessed. The generated addresssequence is independent of the core L1/L2 size. Let the generated address sequence bythe n th process containing k addresses be A n k .{A n k},n = 0,...,N T −1N T is the total number of processes (or threads) the executing program is parallelizedinto. N c is the number of cores in the CMP. The core id, Ck n , where L2 banks of therequired k data addresses of the n th process reside can be identified using the followingnotation:C n k = A n k%N c (4.1)C n kis the set of all local and remote L2 accesses (including hits and misses) of the nthprocess during the execution of a program in a CMP. The Logical communication patternof a program running N T threads of a program on the CMP is the set of all memoryaccesses over all the processes. Hence Logical Communication Pattern (LCP),LCP ={ {Cn 00} { n NT} } −1,..., CN c−1(4.2)Each of the N T threads of a program are mapped to available processors. A new threadis assigned to the first available core in a CMP. Mathematically, the logical mapping of

CHAPTER 4. TILE EXPLORATION 69N T threads to physical cores can be written as the mapping function M T ,M T : {T 0 ,...,T NT −1} → {P T0 ,...,P Tk−1 } (4.3)Assuming that the L2 size in each of the N c cores is constant, let N L2 be the number of L2cache lines per core. The L2 bank residing in the core Ck n to be accessed by the addressA n kis a function of the number of L2 cache lines per core and the number of executingthreads. The higher bits of the requested address A n k are used to select the L2 bank (usingN L2 ) and the core id is deduced using the number of executing threads N T . The core idexpression in Eqn. 4.1 can be refined as:C n k ⇒ {(A n k/N L2 )%N T }In a CMP with L2 banks distributed according to S-NUCA policy, the above distributionof L2 banks will result in L2 Bank 0 assigned to Core 0, ..., L2 Bank N c −1 assigned toCore N c − 1, L2 Bank N c assigned to Core 0 and so on. This is the logical mapping ofL2 bank ids to core ids. Given the flexibility, the data belonging to these L2 banks canbe assigned to any available physical core. Assuming equal number of cores (N c ) and L2banks, this mapping (M C ) of L2 banks to physical cores is represented as,M C : {C n 0 ...,C k−1 } → {P C n0...,P Ck−1 }The logical communication pattern equation in Eqn. 4.2 can be extended to a PhysicalCommunication Pattern. The Physical Communication Pattern (P A nk) is the sequences ofphysical addresses generated from the logical address sequence A n k .{A 0 k,...,A N T−1k} ⇒ {P A 0k,...,P N A T −1}kData to be accessed by a thread in CMP is present in one of the following sources: theprivate (on-tile) L1, the local (on-tile) L2, (an off-tile) remote L2 or the off-chip DRAM.The Average Memory Access Time (T mem ) is the average of access times of all L1 (N l1 ),

CHAPTER 4. TILE EXPLORATION 70local and remote L2 (N l.l2 and N r.l2 ) and DRAM accesses (N dram ) by a thread executingin the CMP. T l1 is the average L1 access time over the execution of the thread. T l.l2 andT nr.l2 are the average local and remote L2 access times for the nth thread. P l1 n is the L1miss probability. Pl.L2 n and Pn r.L2 are the probabilities of the address missing in L2 local tothe core running the process and the remote L2 core. T net is the average network transittime till the DRAM controller. T dram is the average DRAM access time. Average memoryaccess time (T mem ) is expressed mathematically as:[Tmem n = T l1 +Pl1n T l.l2 + ( )Pr.l2 n ×2.T r.l2+P nl.l2.P n r.l2(2.Tnet +T dram) ] ×N mem (4.4)where, totalnumberofmemoryaccesses, N mem = N l1 +N l.l2 +N r.l2 +N dram . InaS-NUCAsetup, If the access to L1 is a miss(Pl1 n ), then the data exists in either the local L2 or theremote L2. Penalty in time (T l.l2 ) is incurred on an L2 access. If the address resolves toa remote L2 P n r.l2 , an additional round trip network access time is incurred (2.T r.l2). Ifthe required data misses in L2 (local and remote (Pl.l2 n .Pn r.l2 )), an off-chip DRAM accessis required. The off-chip access incurs a round trip time delay till the DRAM controller(2.T net ) and the penalty of a DRAM access(T dram ).The total execution time (T exec ) of a program with N T parallel processes in a CMP isthe time of the thread spending the maximum in execution:T exec = Max { T nexec},n = 0,...,NT −1 (4.5)The time spent in network transit (T tran ) over L2 accesses by a process is the sum ofthe times spent in local L2 accesses (T l.l2 ) and round trip remote L2 accesses over thenetwork (T r.l2 ) by all the N T processes:T tran =N∑T −1n=0{P nl1 ×N mem . [ 2.T l.l2 +P nl.L2 ×2.T r.l2]+Pnl.l2 .P n r.l2[2.Tnet] } (4.6)

CHAPTER 4. TILE EXPLORATION 71The time spent due to off-chip DRAM transit, incurring a round trip time delay 2.T nettill the DRAM controller, by all processes is:T dram =N∑T −1n=0The energy to execute the program (E exec ) is given as:{Pnl1 .P nl.l2.P n r.l2 ×N mem}.Tnet (4.7)E exec = (P dyn +P leak )×T exec (4.8)P dyn is the dynamic power spent and P leak is the leakage during the execution of theprogram.The Energy-Delay (ED) product of the program executed is the product of the totalenergy spent and the time taken for the execution to complete. The Energy-Delay (ED)product is,ED = T exec ×E execAlso, ED is a function of the following parameters.ED = f(A n k,M C ,M T ,L1,L2,f,NoC,DRAM)The Energy × Delay product is a function of the memory access patterns (A n k ) of theprocesses in a CMP. This pattern dictates the data, instruction and coherence communicationinside and the DRAM accesses outside the CMP. A n kin turn depends on numberof processes a program is parallelized into. Assigning a single core to a process for executionis Thread Mapping (M T ). Method and procedure of allocation of L2 banks tophysical cores is L2 cache mapping ((M C )). L1 and L2 are sizes of caches per tile. NoCis the interconnection network connecting various tiles in the CMP. It encapsulates thetopology, flit sizes, wire widths, routing protocols and other design parameters of the interconnectionnetwork. DRAM encapsulates DRAM parameters such as size, operatingfrequency, access time, off-chip access delay, number of rows, columns and so on.If T execL1,L2 is the time taken for the benchmark to complete execution on a CMP with

CHAPTER 4. TILE EXPLORATION 72each tile containing a private Level 1 cache of size L1 and a distributed, shared Level 2cache of size L2, ED can be written as:ED L1,L2 = E execL1,L2 ×T execL1,L2The optimal design is the one with the least ED product. The optimization problem canbe formulated as:ED opt = min{ED L1,L2 (∀L1,∀L2)}where ED opt is the least ED product CMP configuration over various L1 and L2.Consider N NoC tile placement (floorplanning), N cm cache mapping and N tm threadmapping strategies available during the design the CMP. The optimal design (ED opt ) isthe one with the least ED product over all M C and M T over all L1, L2 cache sizes, overvarious floorplanning techniques. The optimization problem can be formulated as:ED best = min { ED opt1 ,...,ED optn},∀nExploration work carried on in this chapter uses parameters L1, L2, f, M C andM T , in varying degrees. L1 size per-tile is varied from 8KB – 256KB in powers of2 and L2 sizes per-tile are varied from 64K – 4M. M C and M T are discussed in thealternative tile placement and thread schedulingsection (Section 4.9). Four tile placementand process scheduling strategies are considered starting from a conventional 2D mesh asthe base design. Operating frequency of the CMP is decided based on the access timetime L1. Single cycle L1 access is assumed and hence 6 different operating frequencies areused in the exploration experiments (1.64GHz – 1.38GHz). L1 and L2 cache parameters(access time, area) are obtained from Cacti[134][135]. A CMP with a 4×4 2D-Mesh NoCconnecting 16 cores is the example used throughout the chapter. A 667MHz, DDR2DRAM is used in the current work. More DRAM parameters are tabulated in Table 4.1.TheFFTSPLASH2benchmarkareusedascasestudiesforpower/performanceevaluationof CMPs. The next subsection analyzes the effect of link latencies on performance of aCMP.

CHAPTER 4. TILE EXPLORATION 734.5 Experimental SetupTable 4.1 lists Sapphire framework[102] settings used for the experiments. Each tileconsists of a processor, its private L1 cache and a section of L2 shared cache. L1 cachesizes are varied from 8 KB to 256 KB in powers of 2. The size of L1-I and L1-D are set tothe same value in all experiments. Unified L2 cache sizes in each tile varies from 64KB to4 MB. 16 tiles are interconnected using a 4×4 2D mesh. Garnet Flexible Network modelin Ruby provides an abstraction of all interconnection network models, while allowing therouter pipeline to be flexibly adjusted. This model is used in all experiments. Networkinterfacesatnodesandoutgoingbuffersattheroutersweremonitoredtocalculateactivityand coupling factors on the links. The router in the NoC routes flits using deterministicXY routing protocol and uses the credit-based VC flow control.Intacte was used to compute the Interconnect delay and power. Intacte operates ona number of design variables such as wire width, wire spacing, repeater size and spacing,degree of pipelining, supply voltage (V dd ) and threshold voltage (V th ). Activity and couplingfactors, which are inputs to Intacte, are obtained from the Sapphire simulations.Intacte determines the most power optimal configuration of the interconnect by iteratingover the parameter space containing repeater sizes and spacing for a given wire to achievea desired frequency. The tool also includes flop and driver overheads for power and delaycalculations. Intacte outputs total power dissipated including short circuit and leakagepower values.DRAMsim[104] is a detailed and highly-configurable C-based memory system simulatorthat implements detailed timing models for a variety of existing memories, includingSDRAM, DDR, DDR2, DRDRAM and FB-DIMM. It also models the power consumptionof SDRAM and its derivatives. It can be used as a stand-alone simulator or aspart of a more comprehensive system-level model. DRAMSim is integrated into Sapphireframework[102].4.5.1 Experimental MethodologyFigure 4.3 depicts the steps followed in the experiments.

CHAPTER 4. TILE EXPLORATION 74Table 4.1: Configuration parameters of processors, caches & interconnection network usedin experimentsConfiguration ParameterValueTechnology32nm (PTM)Processor FrequencyEqual to L1 FrequencyProcessor PowerScaled from SPARC[123]No. of Processors 16Tile Configuration 1 Processor, Private L1 I/D Cache & Shared L2 CacheLine Size64 bytesL1 Cache & L2 Cache Associativity 4Power Model CACTI([134],[135])Size per Tile 8, 16, 32, 64, 128, 256 KBL1 CacheNo. of Banks 2Frequency Estimated using CACTIAccess Time Single CycleL2 CacheSize per Tile 64KB, 128KB, 256KB, 512KB, 1, 2, 4 MBNo. of Banks 4TypeDDR2Frequency667MHzBank Count 8DRAM Refresh Time64msNumber of Rows 2 14Number of Columns 2 10Power Model DRAMSim[104]ModelGarnet[136]Topology 4×4 2D MeshFlit Size32 bitsInterconnection NetworkRouting Protocol Deterministic XY routingRouter Pipeline 4 stagesVirtual Channels 4 per portFlow Control Credit based VCLink Power Model Intacte[75]

CHAPTER 4. TILE EXPLORATION 75Figure 4.3: Flowchart illustrating the steps in experimental procedure.• L1 & L2 (per tile) sizes are input to CACTI to obtain access time, energy per access,leakage power and area per L1/L2 cache size. Energy per access, access time andtotal number of accesses per cache are combined to estimate power consumed bythe L1 cache in the final stages of the experiment. L1 sizes range from 8K – 256Kand L2 sizes range from 64K – 4M. Other CACTI input parameters are tabulatedin Table 4.1. Processor frequency, type of DRAM to be used, channel width andcount, clock granularity, number of rows and columns, refresh time are input toDRAMSim as parameters.• An individual tile contains a processor, private L1 and a shared L2 cache. Processorarea is estimated by scaling down the area of a SPARC processor[123] (fabricatedin 65nm process) to 32nm. Area of the processor logic used in the floorplans is2.4515×1.9989 mm 2 . Floorplan of a single tile is drawn using cache areas obtainedfrom CACTI and this processor area. Example floorplans are shown in Figure 4.4.• 16 individual tiles are arranged in a 4×4 2D mesh network and lengths of interconnectlinks are estimated from the complete CMP floorplan. Lengths of linksbetween L1 and the router and L2 and the router per tile are also estimated. Tilesare re-arranged for other sets of experiments and the link lengths are recalculated.The three interconnection networks used in the current work are shown in Figure4.5.• Processor is assumed to run at the same frequency as L1 cache. Using the frequency

CHAPTER 4. TILE EXPLORATION 76of operation and length of the link, the power optimal configuration of the link isobtained using Intacte. The power optimal link configuration is obtained by varyingthe number and sizes of repeaters among other parameters on the link. The linklatencies obtained for each link is input into interconnect network configuration filesof Sapphire. The values of link lengths for various cache sizes is tabulated in Table4.5. Access latency of the L2 cache in cycles is determined using the access time(in ns) obtained from CACTI and dividing by L1 access time. The L2 cache accesslatency is copied into the ruby configuration file.Latency L2 = AccessTime×Frequency• Sapphire is used to run the FFT benchmark to completion. The experiments arerepeated for various configurations of L1 cache, L2 cache and flit sizes (Table 4.1).Traffic monitored on the interconnect links are used as to estimate activity andcoupling factors. Cache statistics output from Ruby is used to calculate miss ratesand number of accesses. Cache access numbers are used to estimate cache power.Benchmark completion time combined with number of cycles and frequency of operationis used to tabulate IPC, program completion time and instructions per secondresults per benchmark per L1/L2 size.• Link activity and coupling factors are input to Intacte to obtain power spent in communicationover the benchmark execution. Processor power is estimated by scalingdown from values published in [123] for a 65nm SPARC processor. Processor powersused over various L1 configurations are tabulated in Table 4.2. Total power spentduring the benchmark execution is the sum of power spent in the processor, memoryhierarchy (L1 & L2 caches) and interconnect links. Energy spent in executingthe benchmark is obtained from total power spent for benchmark execution andprogram completion time. Energy per instruction and IPS 2 /W are also calculated.Results from these experiments using various benchmarks are illustrated in Section4.9.

CHAPTER 4. TILE EXPLORATION 77Figure 4.4: Tile floorplans for different (L1, L2) sizes. From left: (8KB, 64KB), (64KB,1MB), (128KB, 4MB)Table 4.2: Scaled processor power over L1 configurations.L1(→) 8K 16K 32K 64K 128K 256KPower (in Watts) 118.062 115.918 105.572 92.155 79.923 70.342Figure 4.5: Mesh floorplans used in experiments. From left: Conventional 2D Meshtopology, a clustered topology, cluster topology with L2 bank and thread mapping andand a mesh topology with L2 bank and thread mapping.Table 4.3: Primary and Secondary cache parameters (access time, area) obtained fromcacti. L2 access latencies as a function of L1 access times is also shown.L2 Size (MB) 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MBL2 Access Time (in ns) 0.93 0.96 0.99 1.08 1.17 1.55 2.23L2 Cache Area (mm 2 ) 2.29 2.42 2.69 3.96 5.08 7.56 16.44L1 Size L1 Access L1 Area Frequency L2 Access Latency (cycles)(KB) (ns) (mm 2 ) (GHz)8 KB 0.61 0.28 1.64 2 2 2 2 2 3 416 KB 0.61 0.56 1.63 2 2 2 2 2 3 432 KB 0.63 0.6 1.58 2 2 2 2 2 3 464 KB 0.66 0.98 1.51 2 2 2 2 2 3 4128 KB 0.69 1.12 1.44 2 2 2 2 2 3 4256 KB 0.73 1.4 1.38 2 2 2 2 2 3 4

CHAPTER 4. TILE EXPLORATION 78Table4.4: Maxoperatingfrequencies, DynamicenergyperaccessofvariousL1/L2caches.Values were calculated using cacti power models using 32nm PTM.L1 CacheL2 CacheSize Dynamic Energy Leakage Size Dynamic Energy LeakagePer Access (nJ) Power (W) Per Access (nJ) Power (mW)8 KB 0.0775 0.0065 64KB 0.162 0.05716 KB 0.078 0.013 128KB 0.167 0.07132 KB 0.0816 0.016 256KB 0.18 0.09964 KB 0.143 0.033 512KB 0.256 0.175128 KB 0.1515 0.047 1 MB 0.288 0.293256 KB 0.1665 0.074 2 MB 0.373 0.5514 MB 0.928 1.212Table 4.3 lists L1 and L2 cache parameters (access time, area) obtained from Cacti.L2 access delay in cycles, as a function of L1 access times is also shown. The differencein area between L1 and L2 caches of the same size (eg. 64KB) is due to the differencein number of banks (2 vs. 4). This results in twice as many read/write ports in the L2cache. Also, the tag area in L2 caches are larger (0.0149 mm 2 in 64KB L1 vs 0.021 mm 2 in64KB L2). The banks area in data array and tag array in L2 caches are larger(Eg., 64KBL2 vs. 64 KB L1. Data Array: 1250µm × 1814µm vs. 489µm × 1962µm, Tag Array:91µm × 229µm vs. 61µm × 241µm). Table 4.4 shows the dynamic energy per access andleakage power values obtained from Cacti. These values are used for power estimation ofthe cache. Table 4.5 shows the estimated lengths of links between L1/L2 caches & routersand between routers of neighbouring tiles for the regular mesh placement. Lengths oflinks between router and caches were estimated from the floorplans shown in Figure4.4. Lengths of links between routers was estimated using the regular mesh placementshown in Figure 4.5. Power optimal pipeline configuration of each link operating at thedesired frequency was obtained from Intacte. Dynamic power for the processor was scaleddown from statistics recorded in [137]. Leakage power is assumed to be 33% of the totalprocessor power[138].

CHAPTER 4. TILE EXPLORATION 79Table 4.5: Lengths of links between L1/L2 caches & routers and between routers ofneighbouring tiles for a regular mesh placement. No. of pipeline stages required to meetthe maximum frequency are also shown.L2 Size Tile area L1 to Router L2 to Router Router to Router Horiz. & Vertical Directory(in MB) mm×mm Length Stages Length Stages Length Stages Length Stages to Router(mm) (mm) (mm) (mm)L1 8KB. Max. Frequency: 1.64 GHz0.064 3.5 × 2.45 1.25 2 2.04 4 3.6 7 2.55 5 3.6/0.20.128 3.55 ×2.45 1.27 2 2.08 4 3.65 7 2.55 5 3.65/0.20.256 3.65 ×2.45 1.36 3 2.17 4 3.75 8 2.55 5 3.75/0.20.512 4.0 × 3.0 2.75 5 2.5 5 4.1 8 3.1 6 4.1/0.21 M 4.25 × 3.0 3.05 5 2.78 5 4.35 8 3.1 6 4.35/0.22 M 4.75 × 3.0 3.5 6 2.75 5 4.85 9 3.1 6 4.85/0.24 M 6.0 × 4.0 1.25 2 6.45 12 6.1 11 4.1 8 6.1/0.2L1 16KB. Max. Frequency: 1.63 GHz0.064 3.5 × 2.5 1.2 2 2.26 4 3.6 7 2.6 5 3.6/0.20.128 3.55 × 2.5 1.22 2 2.31 5 3.65 7 2.6 5 3.65/0.20.256 3.65 × 2.5 1.3 3 2.39 5 3.75 8 2.6 5 3.75/0.20.512 4.0 × 3.1 3.12 5 2.75 5 4.1 8 3.2 6 4.1/0.21 M 4.25 × 3.1 3.36 6 3.04 5 4.35 8 3.2 6 4.35/0.22 M 4.75 × 3.1 3.88 7 2.75 5 4.85 9 3.2 6 4.85/0.24 M 6.0 × 4.0 1.9 4 6.45 12 6.1 11 4.1 8 6.1/0.2L1 32KB. Max. Frequency: 1.58 GHz0.064 3.5 × 2.5 1.22 2 2.28 4 3.6 7 2.6 5 3.6/0.20.128 3.55 × 2.5 1.24 2 2.32 4 3.65 7 2.6 5 3.65/0.20.256 3.65 × 2.5 1.32 3 2.4 5 3.75 7 2.6 5 3.75/0.20.512 4.0 × 3.12 3.14 6 2.77 5 4.1 7 3.22 6 4.1/0.21 M 4.25 × 3.12 3.38 7 3.08 6 4.35 8 4.35 8 4.35/0.22 M 4.75 × 3.12 3.9 8 2.75 5 4.85 9 4.85 9 4.85/0.24 M 6.0 × 4.0 1.92 4 6.45 12 6.1 11 6.1 11 6.1/0.2L1 64KB. Max. Frequency: 1.51 GHz0.064 3.95 × 3.0 3.0 5 2.5 5 4.05 8 3.1 6 4.05/0.20.128 4.0 × 3.0 3.1 6 2.5 5 4.1 8 3.1 6 4.1/0.20.256 4.1 × 3.0 3.14 6 2.6 5 4.2 8 3.1 6 4.2/0.20.512 4.0 × 3.45 3.5 7 3.0 5 4.1 8 3.55 7 4.1/0.21 M 4.25 × 3.45 3.75 8 3.25 6 4.35 8 3.55 7 4.35/0.22 M 4.75 × 3.45 4.25 7 2.75 5 4.85 8 3.55 7 4.85/0.24 M 6.0 × 4.0 2.05 4 6.45 12 6.1 11 4.1 8 6.1/0.2L1 128KB. Max. Frequency: 1.44 GHz0.064 3.95×3.1 3.1 6 2.57 4 4.05 7 3.2 6 4.05/0.20.128 4.0 ×3.1 3.2 6 2.61 4 4.1 7 3.2 6 4.1/0.20.256 4.1 ×3.1 3.23 6 2.7 5 4.2 7 3.2 6 4.2/0.20.512 4.0 ×3.55 3.56 6 3.06 5 4.1 7 3.65 7 4.1/0.21 M 4.25 ×3.55 3.8 7 3.3 6 4.35 8 3.65 7 4.35/0.22 M 4.75 ×3.55 4.3 8 2.75 5 4.85 8 3.65 7 4.85/0.24 M 6.0 ×4.0 2.1 4 6.45 12 6.1 11 4.1 7 6.1/0.2L1 256KB. Max. Frequency: 1.38 GHz0.064 3.95×3.2 3.3 6 2.7 5 4.05 8 3.3 6 4.05/0.20.128 4.0 ×3.2 3.35 6 2.7 5 4.1 8 3.3 6 4.1/0.20.256 4.1 ×3.2 3.44 6 2.8 5 4.2 8 3.3 6 4.2/0.20.512 4.0 ×3.65 3.8 6 3.2 6 4.1 8 3.75 6 4.1/0.21 M 4.25×3.65 4.05 7 3.45 6 4.35 8 3.75 6 4.35/0.22 M 4.75×3.65 4.55 8 2.75 5 4.85 8 3.75 6 4.85/0.24 M 6.0 ×4.0 2.15 4 6.45 12 6.1 11 4.1 8 6.1/0.2

CHAPTER 4. TILE EXPLORATION 80FFT. DRAM Access, On-Chip Transit Time vs. Program Completion Time.DRAM Access and On-chip transit times (in secs)0.30.250.20.150.1DRAM Access Time*10On-Chip Transit TimeProgram Completion Time (in cycles)5352515049484746454443424140Program Completion Time (in million cycles)390.0564KB 128KB 256KB 512KB 1M 2M 4ML2 Cache Sizes (64K - 4M)38Figure 4.6: Benchmark execution time vs. Communication time - DRAM access time andOn-chip transit time vs. L2 cache size vs. Program completion time.4.6 Effect of Link Latency on Performance of a CMPIntra-tileandoff-tilelinklatencieseffectcommunicationdelaysandhaveamajorinfluenceon the execution times of processes in CMPs. Figures 4.6 and 4.7 record sample resultsfrom FFT benchmark execution on a 4 × 4 CMP with individual tiles containing L1 of16K and L2 sizes varied from 64KB – 4MB per execution. Figure 4.6 illustrates the effectof varying tile sizes due to varying L2 sizes on on-chip and off-chip communication timesduring the execution of the program. The graph also illustrates the relationship betweenthe two communication times over program completion time. DRAM accesses decreasewith increasing L2 sizes till L2:256KB. The effect of off-chip communication saturatesafter L2:256K as working set has been accommodated inside the CMP.Decrease in program execution time from 52.9M cycles (0.0324 secs) to 42.6M cycles(0.026 secs) with change in L2 size from 64KB to 256KB. Oneof thefactors is thedecreasein overall DRAM communication time from 0.0025 to 0.007 secs. The effect of intra-chiptransit time is pronounced as the DRAM communication time saturates in runs with L2

CHAPTER 4. TILE EXPLORATION 81FFT. DRAM Access, On-Chip Transit Time vs. Program Completion Time.0.3DRAM Access Time*10On-Chip Transit Time3.93.8DRAM Access and On-chip transit times (in secs)0.250.20.150.1Program Energy (in nJ)3.73.63.53.43.33.23.13Program Energy (in nanoJoules)2.90.0564KB 128KB 256KB 512KB 1M 2M 4ML2 Cache Sizes (64K - 4M)2.8Figure 4.7: Program energy vs. Communication time.greater than 256K. The increase in program communication times between L2 sizes 512Kand 4M can be attributed to intra-chip transit time as seen in Figure 4.6. The intra-chiptransit time depends on the latencies between L2 cache to router and the inter-tile links.The L2 – Router links have latencies of 5, 5, 5 and 6 cycles and the inter-tile latenciesare 6, 6, 6 and 8 for L2 sizes 512K – 4M (Table 4.5). The program completion times atL2:4M is larger than those for L2:512K,1M,2M due to larger inter-tile link latencies.Figure 4.7 illustrates the relationship between the two communication times over programenergy. The communication times are also an indicator of the energy spent in executingthe program. The minimum communication time L1/L2 configuration is the sameas the minimum energy point. The total energy spent in execution is directly influencedby the total execution time of the program. The energy spent during the execution ofthe benchmark is also a function of power spent in the processor, memory hierarchies andcommunication links. Increasing on-chip communication results in increasing link powerconsumption and adds to the total energy in benchmark execution. From the results inFigures 4.6 and 4.7 it is clear that both on-chip transit time and off-chip communicationtime are important parameters in determining the performance of a CMP.

CHAPTER 4. TILE EXPLORATION 82Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0295L2: 64K0.0290.0285L2: 4M0.0280.02750.0270.0265L1:8K0.0260.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.029L2:64K0.02850.0280.0275L2:4M0.0270.02650.026L1:16K0.02550.08 0.1 0.12 0.14 0.16 0.18 0.2Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0310.0305L2:64K0.03L2:4M0.02950.0290.02850.028L1:32K0.02750.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22Total on-chip transit time (in secs)(a)(b)(c)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0292L2:64K0.0290.02880.02860.02840.02820.028L2:4ML1:64K0.02780.088 0.092 0.096 0.1 0.104 0.108 0.112Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0302L2:64K0.030.02980.02960.02940.02920.0290.0288L2:4ML1:128K0.02860.076 0.08 0.084 0.088 0.092 0.096Total on-chip transit time (in secs)Program Completion Time (in secs)FFT. Execution Time vs. On-Chip Transit Time.0.0316L2:64K0.03140.03120.0310.03080.03060.0304L2:4M0.0302L1:256K0.030.068 0.07 0.072 0.074 0.076 0.078 0.08 0.082Total on-chip transit time (in secs)(d)(e)(f)Figure 4.8: 64K point FFT benchmark execution time vs. Total time spent in on-chipmessage transit. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB, 1M, 2M,4M.4.7 Communication in CMPsFigure 4.8 and 4.9 show the relationship between time spent in on-chip message transitand off-chip DRAM accesses during a 64K point FFT benchmark execution. Total onchiptransit time (Fig. 4.8) is the sum of latencies of all messages on all links during theexecution of the program. Total off-chip communication time (Fig. 4.9) is the sum ofDRAM access latencies of all DRAM accesses over the execution of the program. Experimentalprocedure is detailed in Section 4.5. Each of the points in the graph represents thetotal execution time (in seconds) over various L1, L2 cache sizes. The ratio of the cycletimes for each of the L1 cache configurations is 1:1.00613:1.03797:1.08609:1.13889:1.18841respectively (From Tables 4.3 and 4.5, frequencies range from 1.64GHz – 1.38GHz).Figure 4.8 relates total benchmark execution time with on-chip transit time over FFTruns with varying L1 (8K – 256K) and L2 (64K – 4M) sizes. Each curve records programexecution times of a constant L1 configuration. On-chip transit times for runs on L1:64K

CHAPTER 4. TILE EXPLORATION 83are largest in each of the L2 configurations. Both Program completion and on-chip transittimes are minimum in all L2 configurations when L1:256K. Program completion times ineach of the L1 curves start from L2:64K, decrease with increasing L2 size till L2:256KB,and then increase. Off-chip DRAM accesses decrease from L2:64K to L2:256K and areconstant for larger L2 sizes (512K – 4M). The decrease in program completion times isdue to the decrease in DRAM access times between L2:64K and L2:256K. The decrease inDRAM access time during this period is due to the decrease in global L2 miss count fromL2:64K – L2:256K (Figure 4.10(b)). Global L2 miss counts saturate after L2:256K as theworking set of the FFT benchmark fits inside the CMP after per-tile L2 sizes are 256Kor greater. This leads to an almost constant time spent in DRAM accesses in executionswith L2 sizes 256K – 4M. Increase in on-chip transit time contributes to the increasein program completion time in configurations with L2 sizes greater that 256K. On-chiptransit times depend on the link latencies between the caches to the router inside a tileandthelatenciesbetweenroutersbetweentiles. FromTable4.5, tileconfigurationshavinglarger L2 sizes have larger link latencies. CMPs with L2 sizes of 4M have the largest linklatencies per L1 configuration. Sum of all on-chip latencies of a CMP of L1:16K and L2sizes from 256K – 4M are 21, 24, 25, 27 and 35 cycles respectively. The effect of on-chiplatencies starts to show after DRAM latencies have saturated and hence an increase inthe program completion times in configurations with L2 sizes 256K – 4M can be seen.L2 size of 256K with any L1 is the configuration with minimum on-chip transit time andminimum DRAM access latency.Configurations with L2:256K in every L1 size has the least number of total messagessentoverallthelinks(Figure4.10(a)). NumberofDRAMaccessessaturateafterL2:256Kin the FFT experiment. Depending on the overall working set size and the programrun on the CMP, an L2 size exists where the number of DRAM accesses are minimumand the effect of inter-tile latencies are minimal. This is the optimal L2 size for bestprogram performance. Increasing the size of L2 further will increase coherence messagesand will result in increased inter-tile traffic. The resulting on-chip traffic effects the totalprogram completion time. Decreasing the L2 size below the performance optimal L2

CHAPTER 4. TILE EXPLORATION 84results in increased DRAM traffic that adversely effects the performance of the CMP forthat program.CMPs of (L1,L2) sizes of (128K,2M), (64K,4M) and (16K,128K) complete executionin 0.0296 secs though they spend increasing amount of time (0.079 – 0.181secs) in on-chipcommunication. Time spent in DRAM accesses during executions in these configurationsare 0.0065, 0.01 and 0.013 seconds respectively. The increase in the time spent in DRAMaccesses contributes to the benchmark completion time.CMPs of (L1,L2) sizes of (16K,256K), (32K,256K), (128K,1M) and (256K,4M) spendaround 0.075sec in on-chip communication but have benchmark execution times (0.026 –0.032 seconds) in increasing order. Time spent in DRAM accesses during executions ofthese configurations are 7.1ms, 7.4ms, 6.5ms and 5.8ms seconds respectively. The cycletimes of operating frequencies of these configurations are 0.61, 0.63, 0.69, 0.73ns respectively.The total instructions executed in these configurations are 42.6M, 42.0M, 42.6M,43.9M respectively. CMP with tile configuration L1:256K,L2:4M has the longest executiontime due to the larger cycle time and greater number of instructions executed. Theconfigurations (32K,256K), (128K,1M) have decreasing execution times due decreasingnumber of instructions and decreasing cycle time. CMP with tile configuration 16K,256Khas the least execution time due to the combination of lesser DRAM access time andthe least cycle time. CMP configurations running at a higher operating frequency have aperformance edge over others when the communication time is comparatively similar.The effect of a large operating cycle time can be seen in the L1:256K curve. FFTrunning on L1:256K configurations take the longest to complete because of comparativelylow operating frequency despite the L2:256K configuration consuming the least on-chiptransit time. L1:256K, L2:256K spends the least time in on-chip transit due to the leastnumber of messages generated during FFT execution Figure 4.10(a). The advantage ofthe least cycle time in L1:8K configurations is lost to L1:16K configurations due to thesmall L1 cache size. In comparison with the L1:16K configurations, L1:8K configurationshave larger on-chip transit time (upto 10%) due to the larger number of L1 misses. Thisresults in larger number of messages (upto 12%), leading to potentially larger waiting

CHAPTER 4. TILE EXPLORATION 85Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0340.033L2:64KB0.0320.0310.030.029 L2:4M0.0280.027L1:8K0.0260.004 0.008 0.012 0.016 0.02 0.024 0.028Total Communication Time for DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.033L2:64KB0.0320.0310.030.0290.028 L2:4M0.027L1:16K0.0260.006 0.009 0.012 0.015 0.018 0.021 0.024Total Communication Time for DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0350.034L2:64KB0.0330.0320.031L2:4M0.030.029L1:32K0.0280 0.005 0.01 0.015 0.02 0.025Total Communication Time for DRAM Accesses (in secs)(a)(b)(c)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.03250.032L2:64KB0.03150.0310.03050.03 L2:4M0.02950.0290.0285L1:64K0.0280.004 0.008 0.012 0.016 0.02 0.024Total Communication Time for DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0306L2:4ML1:128K0.03040.03020.030.02980.02960.0294L2:64KB0.02920.0290.006 0.0065 0.007 0.0075 0.008Total Communication Time for DRAM Accesses (in secs)Program Completion Time (in secs)FFT. Execution Time vs. Off-Chip Comunication Time.0.0318L2:4ML1:256K0.03160.03140.0312L2:64KB0.0310.03080.03060.03040.004 0.006 0.008 0.01 0.012 0.014Total Communication Time for DRAM Accesses (in secs)(d)(e)(f)Figure 4.9: 64K point FFT execution time vs. Total time spent in DRAM (off-chip)accesses. L2 cache sizes are in the order 64KB, 128KB, 256KB, 512KB, 1M, 2M, 4M.times and hence more spin lock instructions resulting in a larger number of executedinstructions (upto 1.5M) for the benchmark.Figure 4.10(a) records the total number of messages over all the links throughoutthe execution of the benchmark. On-chip messages include data from L2 caches due toL1 misses, synchronization variables, cache coherence messages, data from the DRAMaccesses. As L1 sizes increase, L1 misses potentially decrease and hence a sharp decreasein overall traffic is observed. The total number of messages on links in a CMP decreasewith increasing L2 sizes till 256K. The data set of the FFT benchmark fits in CMP whenL2:256K and higher. Accesses to DRAM saturate after L2:256KB as observed in Figure4.10(b). Number of messages in the CMP for L2 sizes from 256K – 4M is due to the intratiledata movement and coherence messages generated from L1 misses. A slight increasein the number of messages can be observed between L2 sizes of 256K – 4M in each L1configuration. For instance, at L1:128K the number of messages at L2:4M is 0.22% timesthe messages at L2:256K. The increase is due to the increased exchange of synchronization

CHAPTER 4. TILE EXPLORATION 86FFT. Total messages on links.2.6e+0764K 128K 256K 512K 1MB 2MB 4MB2.4e+072.2e+072e+07No. of Messages.1.8e+071.6e+071.4e+071.2e+071e+078e+066e+068K 16K 32K 64K 128K 256KL1 Cache Sizes.(a) Total messages over all the links during the execution of the benchmark.FFT. L2 Miss Count (No. of DRAM Reads). Flit sizes: 32 bits.22000064K 128K 256K 512K 1MB 2MB 4MB200000180000L2 Cache Miss Count.1600001400001200001000008000060000400008K 16K 32K 64K 128K 256KL1 Cache Sizes.(b) Total global L2 misses. (Total number of DRAM reads.)Figure 4.10: Total messages over all the links during the execution of the benchmark andAverage transit time of a message.

CHAPTER 4. TILE EXPLORATION 87variables and coherence messages as L2 sizes increase. Global L2 misses result in DRAMaccesses for instruction/data retrieval. Figure 4.10(b) records the number of global L2misses during the execution of the FFT benchmark. DRAM accesses saturate after L2sizes of 256K as the 64K points input for the FFT benchmarks fits into the L2 cache.Figure4.11(a)recordsthenumberofinstructionsexecutedduringtheFFTbenchmarkexecution over various L1/L2 configurations. The difference in the number of instructionsisduetothespinlockinstructionsbycoreswaitingforreleaseofresources. Thetimespentin communication by a program is an indicator of the number of spin-lock instructionsexecuted by a core during waiting. Configurations with L1:256K (over all L2 sizes) spendthe least time in on-chip communication. Figure 4.11(a) shows that L1:256K over all L2sizes execute the least number of instructions. Number of instructions executed in theconfigurations(8K,64K),(8K,128K),(16K,64K),(16K,128K),(32K,64K)and(32K,128K)is due to the increased time spent in DRAM accesses in these configurations (Figure4.9). The waiting times in these configurations are abnormally large due to the longerDRAMaccessesresultingintheincreasednumberofspinlockinstructions. Thenumberofinstructions increase in configurations with L2 greater than 256K because of the increasedtransit time (and hence the waiting time).Power consumed in the CMP memory hierarchy during L1/L2 accesses and the powerconsumed by the links during on-chip transit and off-chip accesses is shown in Figure4.11(b). Power consumed in the L1 and L2 accesses is proportional to the sizes of L1and L2 cache in all the configurations. Thus the power spent in the memory hierarchyincreases as L1 and L2 sizes increase. DRAM access power is largest in the lower L2 cacheconfigurations (64KB and 128KB), over all L1 sizes due to the larger number of global L2misses in these configurations (Figure 4.10(b)). DRAM accesses saturate after L2:256KBand hence the power consumed in configurations with L2 equal to or greater than 256KBare comparable. For instance, ratio of power consumed in the DRAM in the L1:16KBconfiguration over L2:64KB – L2:4M is 1.47:1.275:1.007:1.006:1.0048:1.0045:1. The linkpower is separately tabulated in Table 4.6. CMPs with configurations having L2:4M havethe longest links and largest latencies. The largest links consume more power in each L1

CHAPTER 4. TILE EXPLORATION 88FFT. Instructions Executed.5.6e+0764K 128K 256K 512K 1MB 2MB 4MB5.4e+07Total instructions executed.5.2e+075e+074.8e+074.6e+074.4e+074.2e+078K 16K 32K 64K 128K 256KL1 cache sizes.(a) Total instructions.25FFT. Chip Power.L1 L2 On-Chip Links DRAMChip Power (Communication + Cache) in Watts201510508K 16K 32K 64K 128K 256KL1(8K--256K), L2(64K--4M) Cache Sizes(b) Communication Power + Cache Hierarchy Power.Figure 4.11: FFT. Total instructions executed and power spent in the memory hierarchyand on-chip links during the execution.

CHAPTER 4. TILE EXPLORATION 89Table 4.6: FFT. Power spent in links (in mW).L2(→) 64K 128K 256K 512K 1M 2M 4ML1:8K 76.887 75.842 77.547 93.798 95.702 104.743 129.186L1:16K 75.555 76.177 78.408 89.785 97.107 105.709 128.699L1:32K 74.996 73.070 76.728 90.975 105.980 114.220 138.661L1:64K 84.181 86.746 87.801 99.491 98.306 94.788 121.621L1:128K 76.061 76.177 78.735 80.918 90.167 91.740 113.227L1:256K 77.891 77.425 78.226 81.005 84.421 84.900 107.982case due to the larger number of flops (Figure 4.10(a)). Link power dips in configurationswith L2:64K and L2:128K in L1:8K, 32K and 256K. Total link latencies in L2:64K andL2:128K, for instance, in L1:32K configuration are equal (18 cycles). The difference inpower due to the difference in activity factors (average activity factor per link is 0.00152and 0.0013) in the links during the benchmark execution. The maximum link poweris consumed in the 32K,4M configuration. Total link latencies of L2:4M configurationsover L1 sizes 8K – 256K are 33, 35, 38, 35, 34, 35 cycles respectively. The average linkactivity factors in each of these configurations are 0.00104, 0.00094, 0.00088, 0.00084,0.00072 and 0.00056 respectively. The differences in the activity factors between theseconfigurations are not large enough to influence the final link power number drastically.The dominating factor in the overall link power is the number of flops in links and hencethe 32K,4M configuration has the highest power consumed between these configurations.The differences in frequencies between L1:8K, 16K and 32K (1.64GHz, 1.63GHz and1.58GHz) are not large enough to compensate for the power spent in flops (102mW,102.7mW and 110mW respectively).Figure 4.12(a) shows the Energy spent per Instruction during the execution of theFFT benchmark. Energy spent in the execution of the benchmark is the product of thetotal power spent in the execution and time taken for the execution to complete. Thetotal power includes the processor power, power spent in the L1, L2 cache hierarchy,DRAM accesses and the on-chip links in the CMP. Energy per Instruction is the ratioof the total energy spent during the execution to the number of instructions executed.

CHAPTER 4. TILE EXPLORATION 906560FFT. Energy per Instruction. Real Interconnect.64K128K256K512K1MB2MB4MBFFT. IPS^2/Watt.64K 128K 256K 512K 1MB 2MB 4MB0.060.055Energy per Instruction (nJ)555045IPS^2/Watt0.050.0450.04400.035358K 16K 32K 64K 128K 256KL2 Cache Size (64K, 128K, 256K, 512K, 1M, 2M, 4M)0.038K 16K 32K 64K 128K 256KL1 Cache Sizes (8K, 16K, 32K, 64K, 128K, 256K)(a) EPI(b) MIPS 2 per WattFigure 4.12: FFT Benchmark. Energy per Instruction and Instructions per second 2 perWatt.Processor power is a significant percentage in the overall power of the CMP (85% – 98%in L1:8K configurations and 75% – 93% for L1:256K configurations). Execution time ofthe FFT run in L1:256K configurations is the largest amongst all L1 configurations asseen in Figures 4.8 and 4.9. The power consumed by L1:256K configuration is the lowestcompared to other L1 configurations owing largely to the lesser operating frequency ofoperation. Energy consumed in each of L2 configurations in L1:256K is lesser than theenergy spent in corresponding L2 sizes in other L1 configurations. Total power spent inthe CMP dominates the EPI number and hence the configuration with the least totalpower results in the least EPI. The operating frequency of the CMP influences stronglythe total power consumed and L1:256K operates at the least frequency (1.38GHz) andhas this configuration has the least total power numbers. Experiments show that theenergy optimal configurations are those with the least operating frequency. This is incontrast with performance optimal configurations where the configurations with higherfrequencies are performance optimal. Energy per Instruction increases with L2 greaterthan 128K in all the L1 configurations. The dual effect of total power consumed and theincreasing execution time during these configurations overcome the increase in effect oftotal instructions executed. CMPs with L1:256K have lowest operating frequencies andhence the largest program execution times. But the time spent in communication is lowerowing to lesser L1 misses and hence will spend least power in communication. Due to

CHAPTER 4. TILE EXPLORATION 91the lesser messages in transit L1:256K has lesser waiting times in traversal and hence haslesser number of instructions during benchmark execution. The loss in performance inthis configuration is complimented by the energy savings.L1:16K is the performance optimal L1 configuration as seen in Figures 4.8 and 4.9.Power consumed in L1:16K configurations is upto 33% larger than power consumed inL1:256K configurations due to the larger operating frequency. Additionally, L1:256Kconfigurations execute lesser instructions due to lesser L1 misses and have lesser spin lockinstructions during execution. From an EPI perspective, tiles having larger private L1s,operating at lower frequencies are better than faster tiles with smaller L1 sizes.AmongsttheL1:256Kconfigurations,L2:4Mhasthelongestexecutiontimeandlargesttotal power contributing to the increased EPI number. Larger L2 sizes result in larger L2access power (L2 power for L2:64K – 4M range from 1.6W to 19.4W), as seen in Figure4.11(b).Figure 4.12(b) records the IPS 2 /W for the FFT benchmark. IPS 2 /W metric decreaseswith increase in L2 sizes from 256K – 4M in all the L1 configurations. Total power spentin the CMP monotonically increases between L2:256K and 4M over all L1 configurations.The dominating larger power number in the higher L2 configurations is due to the largerL2 access power as seen in Figure 4.11. Larger L2 sizes lead to larger tile sizes, leading tolarger latencies in the links, leading to increased waiting time and hence larger number ofexecutinginstructionsduringexecution. Italso leadstoincreasedtimespentinexecution.All these factors adversely effect the IPS 2 /W metric and hence larger L2 sized tiles arenot the best choice for performance-power optimal execution of the FFT benchmark. TheIPS 2 /W optimal L1/L2 configuration is L2:128K in most of the L1 configurations. Thisis a benchmark specific optimal configuration. The optimal is close to the point wherethe time spent in communication in the CMP is the least (L2:256K).4.8 Program Completion TimeProgram completion time follows on-chip transit (Eqn. 4.6) and off-chip communicationtimes (Eqn. 4.7) as in Figure 4.13. T tran is the sum of latencies of all messages on

CHAPTER 4. TILE EXPLORATION 92Program Completion Time (in secs).FFT. Program Completion Time Communication Times. L1: 16K.0.0330.0320.0310.030.0290.0280.0270.02664K 128K 256K 512K 1M 2M 4ML2 Cache Sizes.PCTOff-Chip*10On-Chip0.320.280.240.20.160.120.080.040On-Chip and Off-Chip Communication Times (in secs).Figure 4.13: Y1:PCT, Y2:on-chip transit and off-chip comm. times.25FFT. Chip Power.L1 L2 On-Chip Links DRAMFFT. L2 Miss Count (No. of DRAM Reads). Flit sizes: 32 bits.22000064K 128K 256K 512K 1MB 2MB 4MB200000Chip Power (Communication + Cache) in Watts2015105L2 Cache Miss Count.180000160000140000120000100000800006000008K 16K 32K 64K 128K 256KL1(8K--256K), L2(64K--4M) Cache Sizes(a) Memory hierarchy and interconnect power.400008K 16K 32K 64K 128K 256KL1 Cache Sizes.(b) Number of L2 misses.Figure 4.14: FFT benchmark results. (Program Completion Time, comm.: communication)

CHAPTER 4. TILE EXPLORATION 93Table 4.7: Total messages in transit (in Millions).L2(→) 64K 128K 256K 512K 1M 2M 4ML1:8 25.9 21.2 14.5 14.5 14.5 14.5 14.5L1:16 23.0 18.3 12.5 12.5 12.5 12.5 12.5L1:32 21.2 17.0 11.8 11.8 11.7 11.7 11.7L1:64 17.7 14.2 11.1 11.1 11.1 11.1 11.1L1:128 12.4 9.7 9.3 9.3 9.3 9.3 9.3L1:256 8.3 7.2 7.2 7.2 7.2 7.1 7.2all links during program execution. T dram is the sum of access latencies of all DRAMaccesses over program execution. For a given L1 size, decrease in PCT is a direct resultof decrease in on-chip and off-chip communication times from L2:64K – L2:256K. Offchipcommunication saturates after L2:256K as L2 misses saturate. Input data set for64K point FFT is accommodated inside the CMP after L2:256K. On-chip transit timesdecrease (for given L1) due to lesser off-tile L2 accesses as L2 increases from 64K – 256K.The effect of on-chip latencies starts to show after DRAM latencies have saturated andhence an increase in the program completion times in configurations with L2 sizes 256K– 4M can be seen.Configurations with L2:256K in every L1 size have least number of total messagessent over all the links (Table 4.7). Depending on the overall working set size and theprogram run on the CMP, an L2 size exists where the number of DRAM accesses areminimum and the effect of inter-tile latencies are minimal. This is the optimal L2 sizefor best program performance. Increasing the size of L2 further will increase coherencemessages and will result in increased inter-tile traffic. The resulting on-chip traffic effectsthe total program completion time. Decreasing the L2 size below the performance optimalL2 results in increased DRAM traffic that adversely effects the performance of the CMPfor that program.Smaller caches suffer from frequent cache line evicts when data sizes are larger thancache sizes. Hence, smaller L1s have greater number of L2 accesses compared to largerL1 caches. L2 misses (Figure 4.14(b)) decrease as cache lines have longer lives in larger

CHAPTER 4. TILE EXPLORATION 94FFT. Energy per Instruction. Real Interconnect.FFT. IPS^2/Watt.656064K128K256K512K1MB2MB4MB64K 128K 256K 512K 1MB 2MB 4MB0.060.055Energy per Instruction (nJ)555045IPS^2/Watt0.050.0450.04400.035358K 16K 32K 64K 128K 256K0.038K 16K 32K 64K 128K 256KL2 Cache Size (64K, 128K, 256K, 512K, 1M, 2M, 4M)L1 Cache Sizes (8K, 16K, 32K, 64K, 128K, 256K)(a) Program Execution Energy.(b) IPS 2 /W.Figure 4.15: FFT benchmark results.L2. L2 misses saturate after L2:256K as the program data size fits into the combined L2cache of the CMP. For L2:64K and 128K over all L1 caches, large number of L1 missesincrease on-chip transit time and result in larger program completion times. Larger L2configurations result in large tile sizes resulting larger program completion times. PCTdecreases from L1:8K to L1:16K due to decreasing number of L1 misses for a given L2configuration. PCT for a given L2 decreases from L1:32K to L1:128K due to decreasein number of L1 cache misses and hence decrease on-chip transit time. In L2:4M configurations,effect of long latency on-chip links (due to large tile sizes) show up increasingPCT. L1:16K,L2:256K configuration achieves a balance between number of L1, L2 missesand on-chip link latencies and is the performance optimal configuration for FFT. SmallL1,L2 caches suffer from too many misses, large L2 cache configurations result in hugeinterconnect latencies. Results show that performance optimal, energy optimal and EDoptimal L1/L2 configurations are different.Memory and interconnect power is dominated by L2 power (Figure 4.14(a)). Memoryand interconnect power is small in cases of L1 upto 64K and L2 upto 256K due to smallsizes of the caches. Program completion energy (Figure 4.15(a)) is a function of totalpower spent and time of execution. Total power decreases as L1 increases (as operatingfrequency decreases). Minimum energy is spent at L1:256K, L2:128K. Chip power isminimum at L2:64K in L1:256K configurations. Execution time decreases from L2:64Ktill 256K due to decrease in L2 misses (Figure 4.13). Decrease in execution time between

CHAPTER 4. TILE EXPLORATION 95Execution Time (in secs)FFT. Execution Time of the Benchmark.0.03264K 128K 256K 512K 1MB 2MB 4MB0.0310.030.0290.0280.027Execution Time (in secs)0.0320.0310.030.0290.0280.0270.0260.025FFT. Execution Time of the Benchmark. Ideal Interconnect.64K128K256K512K1MB2MB4MB0.0260.0240.0258K 16K 32K 64K 128K 256KL1 Cache Sizes.0.0238K 16K 32K 64K 128K 256KL1 Cache Sizes.(a) Real interconnect experiment.(b) Ideal interconnect experiment.Figure 4.16: Program Completion Times.L2:64K and L2:128K compensates for increase in L2 accesses and energy spent is leastat L1:256K,L2:128K. Large L1 caches and moderate L2 cache configurations are closer tothe minimal program execution energy point.IPS 2 1/W (same as ) results favor larger L1 cache configurations. Larger operatingEDfrequencies of small L1 cache configurations result in better performance numbers butlose out on power consumption index. Larger L1 cache (128K) with small L2 caches (64K– 256K) are closer to the optimal ED point. At L1:128K, the miss rate is relatively smalland at L2:64K–256K, the effect of interconnect latencies is not as pronounced as in largerL2 cache configurations.Impact of ignoring accurate interconnect parameters on architectural choices is evaluatednext. Consider performance results of the FFT benchmark with (Figure 4.16(a))and without (Figure 4.16(b)) interconnect latencies included. All interconnect latenciesare set to 1 cycle regardless of tile area in the ideal interconnect experiment. Given atarget completion time of 0.24 secs the ideal interconnect graph (Figure 4.16(b)) shows up10 configurations in L1:8K & L1:16K but real interconnect graph shows none. Considerchoosing the best energy configuration within the performance requirement of 0.27secs.Ideal interconnect results indicate L1:128K,L2:256K while real interconnect results allowconfigurations in L1:8K,16K to be considered. Thus ignoring accurate interconnectlatencies can lead to wrong architectural choices.

CHAPTER 4. TILE EXPLORATION 964.9 Ideal Interconnects, Custom Floorplanning, L2Banks and Process MappingThis section presents comparative performance and power results CMPs with ideal interconnects,alternative floorplanning and process and L2 banks mapping. Table 4.8presents recalculated link lengths of the L1: 256KB and L2: 512KB configuration used inthe experiments. The CMP floorplans are shown in Figure 4.5:• Conventional 2D Mesh. Processes are scheduled starting from processor 0 (fromthe top left corner of the mesh CMP).• Clustered 2D Mesh. The 4×4 mesh is divided into 4 clusters of 2×2 processorseach. The tiles in a cluster are rotated such that link lengths between routers areminimal(0.2mm). These intra-cluster links have a delay of 1 clock cycle. Each ofclusters are connected by longer links across neighbouring routers that span alongthe tile’s border. The links connecting clusters are 8 mm (13 cycles) and 7.5 mm(12 cycles) long. Communication within a cluster costs a lot less in terms of powerand performance when compared to communication across clusters in this CMP.• Process and L2 Bank Mapped 2D Mesh. Clusters of 4 processes each areidentified after analysis of the communication patterns between tiles in a conventional2D mesh CMP. Clusters of frequently communicating tiles are identified andprocesses are mapped as shown in the figure. Conventionally, processes are mappedto the first available processor – Tile 0. Analysis of communication traffic in allbenchmarks has shown tile to be the most communicated with.• Process and L2 Bank Mapped 2D Mesh. Similar thread and L2 bank mappingis done for a conventional 2D mesh topology. Mapping L2 banks receivingmaximum requests to tiles towards the center of the CMP will decrease the averagemessage traversal distance and communication latency. Example results from theFFT benchmark are shown in this section. L2 banks allocated to Tile 0 are mappedto tile numbered 0 in the Figure 4.5 and so on.

CHAPTER 4. TILE EXPLORATION 97Table 4.8: Clustered tile placement floorplan for L1: 256KB and L2: 512KB. Lengthsof links between neighbouring routers, number of pipeline stages are shown. Frequency:1.38 GHz.Intra-Cluster Between clusters – Horizontal & Vertical Directory to RouterLength (mm) Stages Length (mm) Stages Length (mm) Stages Length (mm) Stages0.2 1 8.0 13 7.5 12 4.1 8The four configurations are additionally compared with a hypothetical chip havingideal interconnects. Ideal interconnects have latency of 1 cycle between routers. Figure4.17 compares Instructions per Cycle, Program completion time, Time spent by messagesin transit, Intra-chip communication power Total chip power, Energy per instruction andEnergy efficiency (Instructions per second 2 /Watt) between the said configurations. Theresults are shown for the FFT benchmark with each core containing an L1 of 256KB andL2 of 512KB.IPC increases by 1.3%, 2.6%, 0.8% and 5.6% in the case of clustered mesh, processesand bank mapped clustered mesh, processes and banks mapped 2D Mesh and ideal interconnectssetups in comparison to the conventional 2D mesh. Reducing the latency in thelinks has an effect on the overall IPC of the CMP. IPC and IPS results suggest that thereduction in time spent in transit of messages has a positive impact on the performanceof the CMP.Time spent in transit of messages has significantly reduced in the clustered configurations(upto 11%) compared to the conventional 2D mesh CMP. Ideal interconnect CMPconsumes upto 40% lesser time transmitting messages when compared to the conventional2D mesh.A significant reduction in power spent in communication is noticed both cases ofClustered Mesh formations (upto 12%). This is due to the large amount of intra-clustercommunication in the FFT benchmark experiment. The reduction is lesser (0.3%) in thecase of the processes and bank mapped 2D Mesh experiment. Analysis of communicationtraffic and reordering the cores is an iterative process. An example tile ordering

CHAPTER 4. TILE EXPLORATION 981.151.1FFT. Various tile placements, Ideal interconnect. L1:256K, L2:512K.Conventional 2D MeshClustered MeshProcess and L2 Bank Remapped Clustered MeshProcess and L2 Bank Remapped 2D MeshIdeal Interconnect1.05Normalized Data10.950.90.85IPC Time Transit CP Power EPI MPWComparison ParametersFigure 4.17: Alternative Tile Placements, custom process scheduling example and idealinterconnect comparison results. Benchmark: FFT, L1: 256K, L2: 512K.(process scheduling) is used in the experiments. Several others may yield different powerperformanceresults.Total power of the CMP is dominated by the processor and memory hierarchy powerin these experiments and hence the advantage gained by communication power reductionis not visible. A reduction of 2.5% in energy spent per instruction is observed incommunication aware clustered tile placement experiment. EPI reduces upto 1% in theconventional mesh experiment. Average time taken for an instruction to complete reduceswith decrease in time spent during message transit. Thus the overall energy spent in thesystem per instruction reduces. An increase of 1.2%, 5.2% and 1.65% respectively, inthe Energy Efficiency metric, IPS 2 , is observed. Instructions executed per second haveincreased due to reduction in communication cycles spent and hence the performanceincrease is observed.

CHAPTER 4. TILE EXPLORATION 994.10 Remarks & ConclusionThis chapter estimates the effects of communication overheads on performance, powerand energy of a multiprocessor chip using L1, L2 cache sizes as primary exploration parametersusing accurate interconnect, processor, on-chip and off-chip memory modelling.A detailed multiprocessor execution environment integrating SESC, Ruby and DRAMSimnamed Sapphire was presented. Sapphire was used to run applications from the Splash2benchmark. A 4×4 2D Mesh formation of 16 out-of-order cores, each consisting of a privateL1 and unified L2 cache was used as case study in the exploration. Detailed low-levelwire delay models were used from Intacte to calculate power consumed by the interconnectionnetwork. Architectural choices based on inaccurate interconnect estimates arenot optimal and the error is severe in cases where applications have heavy communicationrequirements.Performance optimal configurations are achieved at lower L1 caches and at moderateL2 cache sizes due to higher operating frequencies and smaller link lengths and comparativelylesser communication. Using minimal L1 cache size to operate at the highestfrequency may not always be the performance-power optimal choice. Larger L1 sizes,despite a drop in frequency, offer a energy advantage due to lesser communication due tomisses. Clustered tile placement experiments for FFT (L1:256KB and L2:512KB) showconsiderable performance per watt improvement (1.2%). Remapped processes and banksin clustered tile placement show a performance per watt improvement of 5.25% and energyreduction of 2.53%. Remapping threads and frequently accessed L2 banks closer ina conventional 2D Mesh shows an improvement of performance per watt by 1.6%. EPIand program completion time results indicate that minimum energy cache configurationsare not the same as minimum execution time configurations.

Chapter 5Label Switched NoC - Motivation &DesignNoCs servicing generic CMPs or customized media processors are expected to meet Qualityof Service (QoS) demands of executing applications. The two basic approaches in NoCdesigns to enable QoS guarantees are: creation of reserved connections between sourceand destinations via circuit switching or support for prioritized routing (in case of packetswitched, connectionless paths). Packet switched networks provide efficient interconnectutilization and high throughputs[43]. However, they need to be over-provisioned to supportQoS for various traffic classes and have high buffer requirements in routers. On theother hand, circuit switched NoCs guarantee high data transfer rates in an energy efficientmanner by reducing intra-route data storage[41]. These are well suited for streamingapplications where communication requirements are well known a priori.In this thesis, we present a Label Switching based Network-on-Chip (LS-NoC) motivatedby throughput guarantees offered by bandwidth reservation. Such an NoC can beused to service hard bandwidth and throughput guarantees to streaming applications ina multiprocessor environment amidst resource competing processes. The Label Switched(LS) Router used in LS-NoC achieves single cycle traversal delay during no contentionand is multicast and broadcast capable. Source nodes in the LS-NoC can work asynchronouslyas cycle level scheduling is not required in the LS Router. LS router supports100

CHAPTER 5. LABEL SWITCHED NOC 101multiple clock domain operation. LS-NoC enables circuit switching without requiringglobally synchronous clock and hence eases clock tree design and reduces global clockdistribution power.Media processors with streaming traffic such as HiperLAN/2 Baseband Processors[7],Real-time Object Recognition Processors [8] and H.264 encoders[44][45] demand adequatebandwidthandboundedlatenciesbetweencommunicatingentities. Adequatethroughput,latency and bandwidth guarantees between process blocks can be provided by establishingprovisioned, contention free routes between nodes.A centralized LS-NoC Management framework engineers traffic into QoS guaranteedroutes. LS-NoC caters to the requirements of streaming applications where communicationchannels are fixed over the lifetime of the application. The proposed NoC frameworkinherently supports heterogeneous and ad-hoc SoC designs. The LS-NoC can be used inconjunction with conventional best effort NoC as a QoS guaranteed communication networkor as a replacement to the conventional NoC. The LS-NoC management frameworkis the focus of the next chapter.This chapter describes the design of the router used in Label Switched QoS guaranteeingNoC. A multicast, broadcast capable label switched router for the LS-NoC has beenpresented. The LS router (Figure 5.4) costs a single cycle flit traversal delay during nocontention. Table based look up for routing and label translation has been provided. A 5port, 256 bit data bus, 4 bit label router occupies 0.431 mm 2 in 130nm and delivers peakbandwidth of 80Gbits/s per link at 312.5MHz.Organization of the ChapterWork related to Label Switched NoC spreads over three chapters detailing the routerarchitecture, NoC management framework and results respectively. A detailed literaturesurvey of QoS guaranteed NoCs was presented in Section 2.1 of Chapter 2. Section 5.4illustrates working of an LS-NoC. Streaming applications and their traffic characteristicsare introduced in Section 5.1. Salient features of LS-NoC are listed in Section 5.3. Design

CHAPTER 5. LABEL SWITCHED NOC 102Table 5.1: Communication characteristics between HiperLAN/2 nodes.Edge(s) Stream Bandwidth[Mbit/s]S/P → Pre-fix Removal 1-2 640Pre-fix Removal → FFT 3-4 512FFT → Channel eq. 5-6 416Channel eq. → De-map 7 384of the LS-NoC router and verification is presented in Section 5.5. Chapter 6.1 outlines LS-NoC framework and tasks of the NoC Manager. Experiments with streaming applicationcase studies are presented in Chapter 7.5.1 Streaming Applications in Media ProcessorsApplications exhibiting pipelined nature of operation between processing blocks havestream based traffic characteristics. We explain HiperLAN/2[139][140] and object basedrecognition applications[8] in the context of streaming traffic characteristics.5.1.1 HiperLAN/2HiperLAN/2 is a radio technology operating in unlicensed 5 GHz bands used to providewireless connectivity in WLAN networks. HiperLAN/2 operates with data rates up to 54Mbps[139]. The physical layer of HiperLAN/2 is to modulate bits that originating on thetransmitter side and to demodulate on the receiver side using Orthogonal Frequency DivisionMultiplexing (OFDM). Processing in OFDM receivers is performed in block-modeand this results in block based communication stream within nodes. HiperLAN/2 hasbeen been mapped on a multi-tile architecture in [140]. NoC QoS specifications in Hiper-LAN/2 implementations are driven by OFDM symbols processing. An OFDM symbol isprocessed every 4µs and the underlying NoC should guarantee sufficient bandwidth andlatency between communicating nodes to facilitate this. Table 5.1 taken from [7] providesQoS requirements of HiperLAN/2.

CHAPTER 5. LABEL SWITCHED NOC 103Visual Attention EngineVAEMatching AcceleratorMAMain ProcessorMPExt. i/fSWcrossbarswitchSWSWPEC0PECPEC PEC PEC PEC PEC PEC1 2 3 4 5 6 7(a)(b)Figure 5.1: (a) Process graph of a HiperLAN/2 baseband processing SoC[7] and (b) NoCof the Object recognition processor[8].Figure 5.1(a) presents processing blocks with communication directions in Hiper-LAN/2 implementation. Hiperlan/2 blocks connect in a pipelined manner with the previousblock’s output serving as input to the current block as shown in the figure. Serialto-Parallel(S2P) block generates Hiperlan/2 application traffic to serve the Demappingblock (DM) throughout the simulation.5.1.2 Object Recognition ProcessorObject recognition has been widely used in applications such as robot navigation, autonomousvehicle control, video surveillance, and natural human-machine interfaces[8].These applications require huge computational power and real-time response.Object recognition process is sped up by visual attention which involves selectingsalient parts of images where the desired object lies. Hardware implementation of such avision system involves processing elements (PEs) capable of data transactions to facilitateobject-level parallel processing. An implemented object recognition system[8] is presentedin Figure 5.1(b).The Main Processor (MP) orchestrates computational data between processor engineclusters(PEC),VisualAttentionEngineandtheMatchingAccelerator. Externalinterfaceis provided for off-chip multimedia data input/output. MP broadcasts periodically to allPECs while PECs set up bursty communications between themselves. Although the

CHAPTER 5. LABEL SWITCHED NOC 104communication pipes are fixed over the experiments, the destinations for bursty and nonburstytraffic are chosen at random. NoC should support high bandwidth for imageblock transfer as each PEC is capable of producing 12.8 Gb/s of aggregated throughput.Underlying NoC should also provide low-latency communication for real-time operation.5.2 LS-NoC - MotivationExisting packet switched networks use priority based QoS solutions where higher priorityroutes are reserved network resources for larger amount of time. Such schemes work wellwhen the amount of traffic in various priority classes are static and do not change overtime. A priority based QoS mechanism may not offer advantages when all routes belongto the same priority level. Further, best effort nature of packet switched networks resultin larger number of buffers inside routers.Circuit switching uses path based resource reservation techniques to guarantee QoS.Such schemes work well for real time traffic where latency in communication paths hasto be guaranteed. Reserving resources along a path for the entire lifetime may result inunder utilization of network resources as no other traffic can utilize this path in caseswhen path is free of traffic. These issues are discussed in detail in Chapter 2, describingQoS related NoC works.Streaming applications use long lived connections between processing blocks and requireconstant throughput with paths demanding real time latency. Characteristics ofstreaming traffic generated by media processor applications are list here:• The maximum amount bandwidth and throughput to be guaranteed by the NoC,for the application to be serviced, is known a priori.• In some streaming applications such as HiperLAN/2, data at every block arrivesat a fixed rate due to pipelined nature of the application. This is possible whenthe NoC delivers guaranteed constant throughput to communication paths of theapplication.

CHAPTER 5. LABEL SWITCHED NOC 105• Streaming applications require either guaranteed real time communication or afixed, deterministicdelaybetweenprocessingblocks. Jitterincommunicationdelaysmay result in packets being dropped or uneven processing in the media processingpipeline.• Communicatingnodesinastreamingapplicationuselonglivedconnectionsbetweenthem. While communication paths are long lived, usage may be intermittent andtraffic bursty in nature. Reserving resources for the entire time period may result inunder utilization of NoC resources. Hence a balance has to be maintained betweenresource reservation and fair distribution of resources for all communication paths.• The amount of control information is a small percentage of processing data trafficin such a system. Control information such as source and destination addresses donot change often as connections are long lived between communicating nodes.Packet switched networks are not favoured to service streaming applications due totheir best effort nature. Circuit switched networks promise to deliver required QoS butresult in an under utilized network resulting in less energy efficiency. There is a need tobuild a hybrid NoC using the best of both packet and circuit switched QoS techniquesfor servicing streaming application traffic. Label switching offers a low-overhead solutionby reducing meta-data in packets. Labels uniquely identify a source and destination pairalong with intermedia routers in the route. Smaller labels result in simpler routing tablesand lesser logic in the router. The path can be identified based on QoS requirementsof the participating nodes. Links can be shared between multiple label paths as longas QoS requirements are met. Such a traffic engineered path with provisioned networkresources reduces buffer requirements in routers as communication paths are more evenlydistributed in the NoC. A simple router design with reduced buffer requirements alongwith shared link in the NoC results in an energy efficient NoC. Traffic characteristicsof streaming applications motivate the design for a pipe based, resource reserved, linkshared, label switched Network on Chip (LS-NoC). The LS-NoC concept is introduced inthe following section.

CHAPTER 5. LABEL SWITCHED NOC 1065.3 LS-NoC - The ConceptLong lived, throughput demanding connections between communicating nodes motivatethe use of pipe based communication in LS-NoC. A pipe is identified by the source,destination and a throughput guaranteed path between source and destination nodes(Section 5.5.1). A pipe reserves required amount of resources along the route during thelifetime of the connection. Reserving required amount of resources enables multiple pipesto share the same physical link as long as their QoS requirements are met.Pipes are unique to a source and destination pair of nodes and hence can be addressedusing pipe-ids (labels). Using labels instead of node addresses to identify source anddestinationnodespotentiallysavesmeta-databitsinroutingtablesandheadersinpackets.LS-NoC uses sideband signals to transmit flow control and labels information.Physical link shared by multiple pipes may result in conflict in label addresses inrouting tables. Label conflicts can be resolved by assigning aliases to pipes. The processis called label swapping in LS-NoC and is dealt with in Section 5.5.2.Establishing a new QoS guaranteed pipe between nodes requires knowledge of existingpipes and resources in the NoC. An NoC-state aware entity having knowledge of existingconnections and remaining resources in the NoC is essential for this purpose. Such anentity (Manager) may be a software process running on a core or a separate hardwareaccelerator block. The manager will be responsible for mapping out a resource rich,contention-less path between communicating nodes. Configuration of routing tables inrouters along the path of the pipe should also be handled by the Manager.Figure 5.2 shows an example 8×8 LS-NoC in a 2-dimensional mesh topology. Asshown in figure 5.2, after pipe 0 has been set up by the Manager between A & B, pipe1 is set up in a contention free route. Pipe between X & Y was set up as pipe 0 at theorigin. Label is swapped to pipe 1 in the last hop to avoid label conflict with the pipebetween A & B.

CHAPTER 5. LABEL SWITCHED NOC 107SrcMSinkWNREApipe 0pipe 1Ypipe 1Spipe 0PTRR+WXBCR+WR+WRT InterfaceA->B: pipe 0A->C: pipe 1X->Y: pipe 0, pipe 1Figure 5.2: A 64 Node, 8 × 8 2D LS-NoC along with NoC Manager interface to routingtables.5.4 LS-NoC - WorkingFigures 5.3(a)–5.3(e) illustrate a pipe establishment and label swapping example in LS-NoC.• Figure 5.3(a): NoC Manager contains the flow graph of the network. A few of theedges stored in a data structure are shown in this figure. Total labels available = 8.At the initialization stage all labels are free for use.• Figure 5.3(b): A pipe (3→4→5→2) is set up between nodes 3 & 2. The datastructure is updated in rows associated with edges 34, 45 and 52. The pipe occupiesthe entire bandwidth of the link and uses label = 0.• Figure 5.3(c): Another pipe is established between 7 & 2. The pipe is established ina non-intersecting manner w.r.t. to previous pipe. The pipe (7→4→1→2) occupies50% of the bandwidth available from the links.• Figure5.3(d): Apipehastobeestablishedbetweennodes6&4. Theflowalgorithmhas identified the route 6→7→4 for the pipe. Label 0 is the first available label atnode 6. The label conflicts at the South port of node 4 - Label 0 has been utilizedby pipe 0 of node 7 (7→2).

CHAPTER 5. LABEL SWITCHED NOC 108• Figure 5.3(e): Label swapping enables pipe to be established with label 1.5.5 Label Switched Router DesignALabelSwitched(LS)router(Figure5.4)costsasinglecycleflittraversaldelayduringnocontention. Labels are sent out of band juxtaposed with data along the links. Separatinglabel from the data link reduces meta-data management at the network interfaces at theingress and egress of LS-NoC. Applications are free to choose data formats granting themdesign flexibility. Further, wires are relatively inexpensive in CMPs[141] - a 4 bit labelincurs an overhead of 1.5% on a 256 bit wide data link. The accompanying label on thedata bus is used to identify the intended outgoing port by the routing table. The routingtableisindexedagainstestablishedlabelsandhastwofields, Direction Bits andNew Labelas shown in Table 5.2. Note that the ‘Input Label’ need not be stored if the memory datastructure enables indexing using the input label bits. The Direction Bits field containsbits equal to the number of output ports in the router. A bit corresponding to an outputport is set if the label to be routed is to exit from that output port. Multiple bits set inthe ‘Direction Bit’ field enable multicast or broadcast. The New Label field is maintainedto enable label swapping. Label swapping is explained in Section 5.5.2. In Table 5.2,incoming label 0 exits through port 0 of the router. Label 1 does not pass through thisinput port. Labels 2 and 15 are broadcast and multicast messages respectively. Data,label and valid signals are replicated and sent to every output port. Output port flagsin the routing table serve as valid signals during arbitration. Routing table signals areused during pipe setup and tear down by the NoC Manager. WriteData and ReadDataon PTR location are controlled by WR/RD signals.The combinational circuitry from an input port to an output port is shown in Figure5.4. An accompanying valid signal denotes if flopped in data should be processed by theinput port. Incoming data flits are written into the FIFO in two cases. Case 1: Flits arequeued in the input port buffer for traversal. Case 2: Contention occurs at the desiredoutput port and the input port loses during arbitration. An ORed grant signal from

CHAPTER 5. LABEL SWITCHED NOC 109Links TableData StructureShadow Routing TableData StructureLinks TableData StructureShadow Routing TableData Structure0 1 23 4 56 7 8E ijnin ju ijv ijusedL ij12344552671 2 0 10 {}unusedL ij{0..7}3 4 0 10 {} {0..7}41 4 1 0 10 {} {0..7}4 5 0 10 {} {0..7}5 26 70 100 10{}{}{0..7}{0..7}74 7 4 0 10 {} {0..7}RPort IL DB NL0 1 23 4 56 7 8Pipe: 3->2E ijnin ju ijv ijusedL ij12344552671 2 0 10 {}unusedL ij{0..7}3 4 10 0 {0} {1..7}41 4 1 0 10 {} {0..7}4 5 10 0 {0} {1..7}5 26 71000 10{0}{}{1..7}{0..7}74 7 4 0 10 {} {0..7}R3 P45PortWW2 SIL DB NL000 E 000000 E 000000 N 000000 P 000(a) Initial state of a few edges in NoC Manager.(b) LS-NoC state after pipe 3→2 has been established.Links TableData StructureShadow Routing TableData StructureLinks TableData StructureShadow Routing TableData Structure0 1 23 4 56 7 8Pipe: 7->2E ijn in ju ijv ijusedL ij123452671 2 5 5 {0}unusedL ij{1..7}3 4 10 0 {0} {1..7}41 4 1 5 5 {0} {1..7}45 4 5 10 0 {0} {1..7}5 26 71074 7 4 5 5 {0} {1..7}00 10{0}{}{1..7}{0..7}RPort3 PIL DB NL000 E 00045WW000000EN0000002 S 000 P 0007 P 000 N 0004 S 000 N 0001 S 000 E 0002 W 000 P 0000 1 23 4 56 7 8Pipe: 6->4E ijnin ju ijv ijusedL ij123452671 2 5 5 {0}unusedL ij{1..7}3 4 10 0 {0} {1..7}41 4 1 5 5 {0} {1..7}45 4 5 10 0 {0} {1..7}5 26 7101074 7 4 10 0 {0} {1..7}00{0}{0}{1..7}{1..7}RPort3 PIL DB NL000 E 00045WW000000EN0000002 S 000 P 0007 P 000 N 0004 S 000 N 0001 S 000 E 0002 W 000 P 000(c) LS-NoC state after pipe 7→2 has been established.The pipe has no contention with the previouslyestablished pipe.(d) Label conflict at node 7’s North port during6→4 pipe establishment.0 1 23 4 56 7 8Pipe: 6->4E ijnin ju ijv ijusedL ij123452671 2 5 5 {0}{0}unusedL ij{1..7}3 4 10 0 {0} {1..7}41 4 1 5 5 {0} {1..7}45 4 5 10 0 {0} {1..7}5 26 7Links TableData Structure101074 7 4 10 0 {0,1} {2..7}00{1}{1..7}{0,2..7}RPort3 PShadow Routing TableData StructureIL DB NL000 E 00045WW000000EN0000002 S 000 P 0007 P 000 N 0004 S 000 N 0001 S 000 E 0002 W 000 P 0006 P 000 E 0017 E 001 N 0014 S 001 P 001(e) Labelswappingcompletedandlabel1assignedtopipe6→4.Figure 5.3: Pipe establishment and label swapping example in a 3×3 LS-NoC.Table 5.2: Routing table of a n port (n = 5) router with a lw bit (lw = 4) label indexedby labels used in the label switched NoC. Size of the routing table = 2 lw ×n×lw.Input Label Direction Bits New Label0000 00001 00000001 00000 00010010 11111 0010... ... ...1111 00101 1111

CHAPTER 5. LABEL SWITCHED NOC 110INPUT PORTSDataDataWidthLabelLabelWidthValidPauseInCLKData+LabelValidMUXCTRLFIFOData+LabelLabelRoutingTableto MUX in otheroutput portsData+LabelLabelRouteValid.to Arbiters inother ports.Data+Label. from all portsARBITER.......CLKOUTPUT PORTSDataLabelLabelWidthValidPauseOutDataWidthFCBCLKFIFOEmptyThresholdSignalPTRPauseIn.Grants from otheroutput portsGrantFlowControl signalfrom next routerFlowControl signalto the previous routerWR/RDW RNoC ManagerInterfaceFigure 5.4: Label Switched Router with single cycle flit traversal. Valid signal identifiesData and Label as valid. PauseIn and PauseOut are flow control signals for downstreamand upstream routers. Routing table has output port and label swap information. Arbiterreceives input from all the input ports along with the flow control signal from thedownstream router.all output ports is available to identify if the input port won the desired output portduring arbitration. The FIFO Control Block (FCB) handles FIFO pointer arithmetic andgenerates the Pause (flow control) signal when the FIFO fills above a threshold. Thethreshold accounts for the traversal delay, in cycles, of the link connecting the currentrouter to an upstream router to which the flow control signal is input. The FCB alsogenerates the MUX control signal to decide if the incoming data on the link traverses therouter or is written into the FIFO.5.5.1 Pipes & LabelsA pipe(P) is a triplet (S,K,R) where S and K denote source and sink nodes and R isa non-empty set of intermediate routers connecting S and K. A source and a sink canhave many pipes between them varying the set of intermediate routers, R.A label belonging to a source S uniquely identifies a communication pipe and theintended destination K - though the value of the label can change en route. Labels

CHAPTER 5. LABEL SWITCHED NOC 111can be reused across sources. Label reuse requires reserved entries for each input portin the routing table or an additional field in a routing table to identify incoming port.Conceptually, each input port has a separate routing table up to 2 lw entries, where lw isthe label width. Independent routing tables at input ports enable label reuse resulting inmore efficient usage of label space. Label reuse by sources may rise to labels collision bypipes sharing a link. Label swapping reassigns unused labels to avert label collisions atinput ports (Section 5.5.2).Using labels, a pipe can be represented as the set S,l 0 ,l 1 ,...,l h−1 ,K, where the pipeconnects source S to destination K through h routers. l 0 is the value of the label at routerR 0 and so on. Router R 0 connects to S and R h−1 connects to K. Without label swappinglabels l 0 = l 1 = ... = l h−1 .5.5.2 Label SwappingA major advantage of LS-NoC is provisioning of routes with guaranteed throughput betweennodes. With increasing number of pipes in the LS-NoC, the probability of labelcollision increases. Label collision in a link results in routing table entry clash as shownin Figure 5.5. Pipes entering North and South ports of Router 0 both have labels as 0.Both pipes are destined to leave the router through the East port and reach the Westport of Router 1. There is no label conflict in Router 0 (contention exists at East port)as routing tables are individual to each input port. Conflict occurs at the West port ofRouter 1 in Routing Table 2 through which both pipes need to be routed. Furthermore,consider that the routing table entry for label 0 is already used and there are at least 2routing table entries free for use. In such a situation, neither of the pipes having label0 from Router 0 can pass through the West port of Router 1. This results in inefficientutilization of links and label space at input ports.Label swapping reassigns labels to conflicting pipes using available label space at thenext router. This allows complete utilization of the available label space. Figure 5.5illustrates label swapping for conflicting pipes. Routing table 0 at the North port ofRouter 0 swaps conflicting label 0 to the available (in Router 1) label 1. Similarly, routing

CHAPTER 5. LABEL SWITCHED NOC 1120RT0R0RT10Pipe-idConflict!00RT2R10RT0R0RT112RT20 1R12RT0il Dir ol0 EAST 0RT1il Dir ol0 EAST 0RT0il Dir ol0 EAST 1RT1il Dir ol0 EAST 2RT2il Dir ol1 SOUTH 1NORTH22Figure 5.5: Label conflict at R1 resolved using Label swapping. il: Input Label, Dir:Direction, ol: Output Label.table 1 at the South port of Router 0 swaps label 0 to 2 ensuring both pipes are the setupto pass through West port of Router 1.5.6 Simulation and Functional VerificationFunctional verification of the Verilog designs of the router and networks were done usingIcarus Verilog Simulator[142]. Table 5.3 lists simulation parameters used in functionalverification of the label switched router. Routing tables are populated in the initializationstage. The flow identification algorithm (Algorithm 1) is implemented in Perl. Routingtables in the NoC are configured based on the algorithm. The label switched router hasbeen implemented and tested for the following networks:• Single Router Network: A single router connected to 4 nodes (4 sources and 4 sinks)was tested for 10 8 cycles for different directed traffic and random traffic permutations.• 2D Mesh: A 64 node, 64 router 2 dimensional mesh with traffic from each source torandom destinations for 10 8 cycles was tested. Appendix B presents more details ontest cases that were used to ascertain functional verification of the LS-NoC router.

CHAPTER 5. LABEL SWITCHED NOC 113Table 5.3: Simulation parameters used for functional verification of the label switchedrouter design.Network 64 node, 8 × 8, 2D mesh (Figure 5.2)Data Bus Width 256 bitsLabel Width 4 bitsInput Buffer Depth 8Simulation Time 10 8 cyclesSimulation Framework Icarus Verilog [142]Table 5.4: Synthesis ParametersProcess UMC 130nm,High Speed.Library FaradayProcess 1.00Temperature 25 ◦ CVoltage 1.2VInterconnect Worst Case TreeModelMetal Layers 8. 2 thick layers.5.7 Synthesis ResultsLogic and combinational circuits blocks of LS Router were synthesized using Faraday’sFSC0H D 130nm library using high-performance and high-density generic core cells forUMC 0.13µm eHS (FSG) process in the typical-NMOS, typical-PMOS case (Table 5.4).Timing and area results from synthesis are tabulated in Table 5.5. The router operatesat 312.5MHz in 130nm technology. Taking into account effects of scaling, estimatedfrequency of operation of the router at 45nm is above 1.2GHz[80]. Link width of 256 bitswas chosen to service peak bandwidth per link of 300Gbits/s comparable with GDDR5bandwidth. Area and power of memory elements inside the LS Router (input buffers androuting tables) were estimated using UMC’s Memory Compiler.Synthesis of the functionally verified Verilog HDL design of the router was performedin Synopsys Design Compiler. Switching activity, timing and design constraints and the

CHAPTER 5. LABEL SWITCHED NOC 114Table 5.5: Synthesis results 2 Router and Mesh networks. Area of a Router is 0.431 mm 2 .Router DetailsPorts 5Data WidthLabel width256 bits4 bitsSynthesis ResultsLS-NoC RouterBuffers + Routing Table Combinational(Storage Elements) LogicArea (mm 2 ) 0.077 0.354Total Area (mm 2 ) 0.431Router Power 11.01 mW 32.07 mWTotal Power 43.08 mWMax Frequency 312.5MHzBandwidth/link 80 Gbpssynthesized netlist is input to Cadence SOC Encounter for place and route from whichthe area is obtained. Placed and routed netlist along with parasitics extracted file (SPEF)was used to obtain power of a router with no buffers. Area and power measurements ofmemory components for the buffers were done using UMC’s Memory Compiler tool. Thetotal power consumption is estimated to be 43.08mW at 312.5MHz, 1.2V. Section B.3 ofAppendix B provides more details on the steps involved in synthesis and place and routeof the LS-NoC router.Area of a processing engine cluster from [8] was used to estimate the area as a casestudy to identify wire lengths in the mesh and feasibility of single cycle operation. Areaof a PEC was estimated as equal to 2.538 mm 2 and the length of a side is 1.593 mm.Intacte[82] was used to estimate the maximum frequency of operation of a 1.6mm link in130nm. It was found that the 1.6mm link operates at single cycle latency at 312.5MHz.

CHAPTER 5. LABEL SWITCHED NOC 1155.8 ConclusionWe have presented a LS-NoC which receives QoS demands of streaming applicationswith the help of a centralized NoC Manager which achieves traffic engineering. LS-NoCfor streaming applications guarantees deterministic path latencies, satisfies bandwidthrequirements and delivers constant throughput. Delay and throughput guaranteed paths(pipes) are established between source and destinations along contention free, bandwidthprovisioned routes. Pipes are identified by labels unique to each source node. Labelsneed fewer bits compared to node identification numbers - potentially decreasing memoryusage in routing tables.The concept of LS-NoC was presented. The Label Switching router has been verified,synthesized, placed and routed and timing analyzed. The Label Switched (LS) Routerused in LS-NoC achieves two cycle traversal delay during no contention and is multicastand broadcast capable. A 5 port, 256 bit data bus, 4 bit label, 1 bit flow control, 8 buffersper input port individual router occupies an area of 0.431 mm 2 in 130nm Faraday libraryin the typical corner and operates at 312.5MHz. LS Router is estimated to consume43.08 mW. The following two chapters illustrate the LS-NoC management frameworkand LS-NoC results respectively.

Chapter 6LS-NoC ManagementStreaming applications require hard bandwidth and throughput guarantees in a multiprocessorenvironment amidst resource competing processes. We present a Label Switchingbased Network-on-Chip (LS-NoC) motivated by throughput guarantees offered by bandwidthreservation in this chapter. A centralized LS-NoC Management framework engineerstraffic into QoS guaranteed routes. LS-NoC caters to the requirements of streamingapplications where communication channels are fixed over the lifetime of the application.The proposed NoC framework inherently supports heterogeneous and ad-hoc SoC designs.The LS-NoC can be used in conjunction with conventional best effort NoC as aQoS guaranteed communication network or as a replacement to the conventional NoC.A multicast, broadcast capable label switched router for the LS-NoC was presented inthe previous chapter. Bandwidth and latency guarantees of LS-NoC have been demonstratedon streaming applications like HiperLAN/2 and Object Recognition Processor,Constant Bit Rate traffic patterns and video decoder traffic representing Variable BitRate traffic in the next chapter.Organization of the ChapterSection 6.1 outlines LS-NoC management framework. Flow identification algorithm usedto establish pipes is presented in Section 6.2. The chapter concludes in Section 6.6.116

CHAPTER 6. LS-NOC MANAGEMENT 1176.1 LS-NoC ManagementNoC Manager is the central entity responsible for identifying and configuring QoS guaranteedroutes between communicating blocks. A two stage router with broadcast andmulticast enabled table based routing is used in the LS-NoC. Label based forwardingrequires every route to be identified by a unique label set. Using labels instead of nodeids decreases routing table sizes.6.1.1 NoC ManagerThe primary job of the NoC Manager is to identify QoS guaranteed communication paths(termed pipes) between communicating nodes and updating routing tables along the pipe.The NoC Manager can be implemented as a software thread running on one of the cores(in a CMP environment) or as a separate hardware accelerator (in an SoC environment).A managed LS-NoC is useful in a non-homogeneous, unstructured SoC where communicationchannels may not be required to be set up between all blocks. The ability of theManagertoestablishpipesdynamicallyconsideringnetworktrafficandavailabilityofcontentionfree routes allows for guaranteed bandwidth and deterministic latency pipes. TheManagercanalsobeconfiguredtomonitorstatusoflinksintheNoC.SuchaManagerwillwork as a fault tolerant route set up system in the LS-NoC. Study of fault managementstrategies for LS-NoC using the Manager is not dealt with in this work. NoC Managerrequires interfaces at the router to update routing tables while new flows are set up in theNoC. In SoCs where applications are fixed during operation, this will incur a one time setup latency at the initialization phase of the application.6.1.2 Traffic Engineering in LS-NoCThe NoC Manager has complete visibility of current state of pipes in the NoC. Thisempowers the Manager to set up a new flow along a contention free, QoS guaranteed pipe.Traffic engineering capabilities of LS-NoC enables flit forwarding through non-shortest,non-congested paths resulting in deterministic flit traversal latencies. Traffic engineering

CHAPTER 6. LS-NOC MANAGEMENT 118abilities of NoC Manager depend on pipe establishment algorithms (Section 6.2) andis independent of the network. This enables support for custom, specific applicationSoCs that contain non-homogeneous or adhoc NoCs. Pipe set up algorithms use a graphrepresentation of the NoC. Faulty or occupied links have nil communication capacity. Dueto this reason fault tolerance is in-built into LS-NoC. Fault tolerance in LS-NoC is thesubject of Section 6.3. A Centralized NoC Manager is adequate in most CMPs runningstreaming applications as there are a fixed number of communicating entities and the SoCis not a dynamically growing system. In a standard SoC, loss of scalability may not be aserious concern.6.2 Flow Based Pipe IdentificationWeproposetouseacentralizedflowmanagercalledLS-NoCManagerwhichhelpsmanagepipe set up and tear down. The LS-NoC Manager takes into account bandwidth reservedin links used by existing pipes while establishing a new pipe. Routing tables in routersalong the pipe are configured by LS-NoC Manager. Flow identification algorithms[66][67]are used to identify available pipes between communicating entities. The flow algorithmused by the NoC manager to calculate pipes is based on the Ford-Fulkerson algorithm[67].The Ford-Fulkerson algorithm was chosen for ease of implementation and its ability toconverge in polynomial time. Computation latency of the pipe establishment algorithmis analyzed in Section 6.4.1.Maximum capacity of a link is the measure of peak bandwidth a link can support.Reserving required capacity during establishment of a new pipe ensures that adequatebandwidth for the pipe is reserved. Once a link has exhausted its capacity, no new pipescan be set up through it. Setting up an end-to-end flow between two communicatingnodes by reserving required bandwidth through link capacities ensures a contention free,QoS guaranteed route. NoC Manager pseudo-code to identify a QoS guaranteeing pipe ispresented in Algorithm 1.One of the inputs to the algorithm is the flow graph of the NoC. An edge, E ij , from

CHAPTER 6. LS-NOC MANAGEMENT 119Algorithm 1 Identify Pipe. P: Pipe Stack.1: Identify PipeRequire: Input Network: {E ij } = {...,{n i ,n j ,u ij ,v ij ,L usedij ,L unusedij },...}, Source s,Destination d, Required Capacity: c2: Residual Graph, RG = {E ji } = {...,{n j ,n i ,v ij },...}3: EdgesCount: k ← 04: if s == d then5: push s onto {P}6: return true7: else8: for all edges starting from d do9: if v dj > c then10: if call Identify Pipe(RG, j, s, c) == true then11: push j onto {P}12: return true13: else14: pop from {P}15: end if16: end if17: end for18: end if19: return false

CHAPTER 6. LS-NOC MANAGEMENT 120the flow graph is represented as:E ij = {n i ,n j ,u ij ,v ij ,L usedij ,L unusedij } (6.1)where, edge E ij connects nodes n i to n j . Nodes can be traffic sources, traffic sinks, routerinput or output ports. u ij and v ij are utilized and available flow capacities of the edgerespectively. L usedijis the list of labels used in pipes through E ij . L unusedijis the list ofavailable labels to assign to future pipes. During the initialization stage, u ij and L usedijare equal to null; v ij holds the maximum capacity value E ij can support and L unusedijthe list of all available labels for pipes through E ij . Edges ending at the input ports ofrouters support a maximum capacity equal to the maximum pipes supported by the edge(2 lw = 16 in this work). These are the bottleneck edges. Edges not ending at input portsof routers are assigned infinite capacity and do not effect the final pipe route. Otherinputs to the algorithm are the source and destination nodes, s and d, between whom apipe of requested capacity, c, has to be established.The first step in the algorithm is to build a Residual Flow Graph from the input flowgraph. The residual flow network contains the same number of edges as the flow graphwith directions reversed. All edges E ij change to E ji in the residual graph. The residualgraph stores only the available capacities of every link. The residual graph is used toidentify a flow that satisfies requested capacity starting from the desired destination, dand traversing through the graph depth-first till source, s, is found (Lines 8–17).Starting from the destination node d, edges satisfying requested capacity are searchedin the residual network. If an edge satisfies the requested capacity, it is stored in the pipestack ({P}). The next edge is searched from the terminating node of the previous edge(Line 10). If a node is reached from which no edge satisfies the requested capacity, theedge is popped out of the pipe stack (Line 14) and search is backtracked. If the sourcehas been reached through edges servicing requested capacity, the final edge is pushed intothe pipe stack and flow has been identified (Lines 4–6).After pipe has been identified, the pipe stack {P} is used to update the used (L usedij )and available (L unusedij ) flow capacities from the flow graph. During the configuration ofis

CHAPTER 6. LS-NOC MANAGEMENT 121routing tables, label swapping is performed in intermediate routers wherever necessary.Anavailablelabelfromthefirstrouterinputnodein{P}isused. Alongtheroutethrough{P}, L usedijof each edge is checked for conflicts. If conflict occurs, an unused label fromthe pipe is used. The routing table data structure at a node, n i , can be represented as:n i ,l old → n j ,l new (6.2)where, l old is the label of the pipe in the edge ending at n i and l new is the label of thepipe in the edge E ij . The procedure is repeated for every node along the pipe. This datastructure is used by the NoC Manager to update routing tables along {P}. An examplehas been presented in Appendix C.6.3 Fault Tolerance in LS-NoCFault tolerance is built into LS-NoC. The steps taken by the LS-NoC Manager after linkis discovered to be faulty is listed below.• After a link is recognized to be faulty, the LS-NoC Manager updates the networkgraph to reflect the health of the link.• The link’s capacity is updated to 0.• Existing pipes through the link are invalidated. The pipes are re-identified androuting tables are updated.• Network graphs are updated as usual after pipes are configured.6.4 Overhead of NoC ManagerDetailed analysis of the amount of time spent in identifying pipes and updating routingtables in LS-NoC is presented in this section. Overhead of the NoC Manager comprisesof two components: computation and configuration. Computational overhead involves

CHAPTER 6. LS-NOC MANAGEMENT 122Table 6.1: NoC Manager Overhead.T comp T conf T overheadSingle Pipe 35157 cycles 17 cycles 35174 cycles (35.2µs)identifying a pipe using flow based algorithm (Algorithm 1). Configuration overheadincludes transmitting routing table configuration over the network and updating routingtables (Table 6.1).6.4.1 Computational LatencyOne of the issues in probe based circuit establishment in prior works[41][19][55] is theunpredictability of circuit setup time. In this section we present an upper bound on pipeestablishment time using flow algorithm implemented in the NoC Manager. The NoCManager can be implemented as a software or a hardware entity residing on the CMP orSoC. The LS router is equipped with an interface to read from/write into a single portrouting table memory structure (Figure 5.4). We assume that a NoC Manager processresides in the first node as shown in Figure 5.2. The flow algorithm presented in Alg.1 is used to establish pipes between communicating nodes. The flow algorithm has thecomplexity O(E.f) where E is total edges in the network graph and f are the numberof flows to be identified. Total edges, E, in the graph representation of a 2D Meshwith degree d grows as 2 × d 2 . Total number of flows to be established depends on theapplications and number process nodes. Computational overhead in establishing a singlepipeinan8×8LS-NoCispresentedinTable6.1. ThealgorithmwasexecutedonaCortexA8 processor (ARM v7 architecture, operating frequency = 1 GHz). 35157 cycles werespent for identification of a single pipe. Identification of a single pipe involves buildingthe residual network from the flow network graph, identifying a bandwidth satisfying pipebetweensourceanddestinationandupdatingthecurrentstateofthenetwork(Algorithms1). As the number of steps involved in a pipe identification are the same, the time takento identify p pipes is p× (time taken to identify one pipe).

CHAPTER 6. LS-NOC MANAGEMENT 1236.4.2 Configuration LatencyThe worst case configuration overhead to update the routing table in the bottom rightcorner of the LS-NoC is 17 cycles. In the case of maximum pipes setup where all ports ofall routers in the LS-NoC have to be updated by the NoC Manager, T conf = 5372 cycles.Time to configure in the maximum pipes is derived as follows:T conf = T network +T rt cycles (6.3)where T network is the network latency to transmit routes on the network and T RT isthe time to update routing tables in a router. At worst case where are all routers have tobe updated, assuming a regular 2D Mesh,{ } deg ×(deg −1)T network = (deg +1)× cycles (6.4)2T rt = Size rt ×deg 2 cycles (6.5)where deg is the degree of the regular 2D Mesh, Size rt is the size of the routingtable (number of writes required to fill the routing table). In the current work, deg = 8,Size rt =80 (16 per port× 5 ports), T conf = 5372 cycles.Streaming applications mapped on to a generic multi-core CMP will have a few communicatingnodes and far lesser pipes to be set up between communicating entities. Configurationof a pipe is done by the Manager as pipes need to be established. Given thatstreaming processes run over a large time frame, the NoC Manager overhead is acceptablein most applications.6.4.3 Scalability of LS-NoCThe size of the network considered in our studies is an 8×8 chip multiprocessor. Thisis large enough to cover most large chip multiprocessors of today. Given the polynomialcompletion time (complexity: O(E.f)) of the pipe establishment algorithm, we envision

CHAPTER 6. LS-NOC MANAGEMENT 124VideoServerTransmissionCameradecodelow levelimage segmentationObjectRecognitionFigure 6.1: Surveillance system showing the application of LS-NoC in the Video computationserver.that the time required for pipe establishment will not be prohibitive with respect to theapplication’s lifetime. An example video surveillance streaming application is depicted inFigure 6.1. Such an application is on and works continuously for days. The implementedpipe establishment algorithm spends less than half a second to establish a pipe and thisdelay is negligible considering the lifetime of the application. A Centralized NoC Manageris adequate in most CMPs running streaming applications as there are a fixed number ofcommunicating entities and the SoC is not a dynamically growing system. In a standardSoC, loss of scalability may not be a serious concern.6.5 Number of Pipes in an NoCThe size of a routing table at each input port of the LS Router is 2 lw entries, where lw isthe label width. Each source node can have a maximum of 2 lw unique pipes. Labels persource are unique. Total routing table entries, rt, in a p port router, rt = p×2 lw . In a rrouter network, if all the routing table entries are full (the entire label space is utilized),the maximum number of pipes that can be set up is rt ∗ r. In reality, the maximumdepends on factors such as the network topology, communication pattern, bandwidth andlatency guarantees, current traffic scenario, number of existing entries in routing tables(depends on already set up pipes) and algorithms used to set up new pipes.

CHAPTER 6. LS-NOC MANAGEMENT 125In the current work, multi-source, multi-sink max-flow algorithms[66][67] are used toidentify (the upper bound on) maximum possible pipes in the network. An example linearnetwork with 2 routers (Figure 6.2(a)) is used to illustrate the maximum number of pipescalculation. A communicating element is represented by a pair of source (S) and sink(K) nodes. The capacities of edges are a measure of the number of buffers at the routerinput port at the end of the link.With no constraints applied, the maximum pipes, P max , in a r router linear networkis,∑r−1P max = 2 lw .where C i is the number of communicating entities attached to router i. The maximum isreached when all pipes are set up to terminate in the originating router. For example, inR 0 (Figure 6.2(a)), all the 2 lw pipes originating from source 0 end in source 1, all pipesfrom source 1 end in source 2 and so on.With constraints applied, the network graph can be modified and maximum pipescan be obtained using the max-flow algorithm. Consider the constraint that a source canset up at most one pipe to a sink. Given that the source node connects to the routernode with an edge of capacity 2 lw , the source node can be divided into 2 lw unique sourcenodes connecting 2 lw router nodes with edge capacity = 1 (Figure 6.2(b)). Minimum andmaximum number of pipes that can be established in the LS-NoC are shown in Section6.5.1.i=0C i6.5.1 Minimum, Maximum and Typical Pipes in a NetworkFigure 6.3(a) shows the number of pipes set up in an NoC with a label width of lw.Maximum pipes are drawn when all pipes are local to a router. Pipes do not contendnor share links in this case. Minimum number of pipes is when pipes originating fromR 0 terminate at the farthest router R (r−1) and no local pipes are allowed. Consideringpipes in both to and fro directions, minimum number of pipes equals 2 × 2 lw . For the2 router network shown in Figure 6.2(a), maximum pipes are 2× 3 ×2 lw (2 routers ×

CHAPTER 6. LS-NOC MANAGEMENT 126S0lw-10 1 2S0 S0... S0...2 lw 1 1 1r0s0r0s0(a)(b)Figure 6.2: A 2 router, 6 communicating nodes linear network. (b) Multiple source,multiple sink flow calculation in a network.number of ports with sources ×2 lw ). The constraint curve shows the total number ofpipes possible when every source can connect to a sink with at most one pipe. Moststreaming application nodes will communicate with fewer than all the nodes and a singleQoS guaranteed communication pipe between two nodes suffices. Width of the label canbe chosen based on the number of pipes required to be setup with the single pipe perdestination constraint applied.Figure 6.3(b) shows total pipes set up in a 2D mesh network over labels of sizes 2, 3and 4 bits. Total number of sources (or sinks) in the mesh is equal to Degree 2 . The graphshows pipes calculated with the constraint that sources connect to unique sinks. TheMax 4 bits curve shows the practical limit for number of pipes that can be establishedin the mesh (Degree 2 ×2 lw ). The results are a measure of routing table utilization whenpipes are set up from every source to unique destinations. The difference between thenumber of established pipes between the Max 4 bits and the 4 bits curve indicates thenumber of non-utilized routing table entries in the NoC router. Routing table entries inintermediate routers along pipes may remain non-utilized after sources have exhaustedlabel space. Non-utilized routing table slots in the NoC steadily increases from 3.1%(Degree 4) to 6.4% (Degree 14) as routes get longer and more intermediate routers haveunutilized slots in them.Label width to use in the NoC is decided after identifying maximum number of nodeseach node communicates with. Consider the tree based 3 router, 12 node NoC of a real

CHAPTER 6. LS-NOC MANAGEMENT 127Number of Pipes220200180160140120100806040200Total pipes with lw=3 in a Linear NetworkMaximumMinimumConstraint 12 4 6 8 10 12Routers in Linear NetworkNumber of Pipes350030002500200015001000500Maximum pipes in 2D Mesh Networks2 bits3 bits4 bitsMax. 4 bits.00 2 4 6 8 10 12 14Degree of the Mesh.(a)(b)Figure 6.3: (a) Number of pipes in a linear network (Fig. 6.2(a)), lw = 3 bits, varyingconstraints. Constraint 1: Max 1 pipe per sink. (b) Max. number of pipes in 2D Mesh(Fig. 5.2).time object recognition processor in Figure 5.1. Using the tree-based NoC topology usedin the work, assuming every node communicates with all other nodes, 12 pipes per sourceneed to be set up, i.e. a label size equals 4 bits. The H.264 decoder presented in [143]and [144] have 8 blocks and a label width of 4 bits will suffice. Current work uses lw = 4in all experiments.If the identification of the source at the destination is mandatory, the width of thelabel has to be at least equal to the number of bits used to uniquely identify a sourcenode. This will require assigning the source-id in place of label at the last hop. This isinternally supported in the routing table in the label swap field (Section 5.5.2).6.6 ConclusionThe proposed LS-NoC services QoS demands of streaming applications using a trafficengineering capable NoC Manager. A centrally managed, bandwidth provisioned, trafficengineering capable Manager utilizes flow identification algorithms to identify contentionfree, bandwidth provisioned paths. Network visibility enables NoC Manager to configurebounded latency pipes in homogeneous and heterogeneous networks alike.

CHAPTER 6. LS-NOC MANAGEMENT 128Flowidentificationalgorithmtakesintoaccountbandwidthavailableinindividuallinksto establish QoS guaranteed pipes. The algorithm allows sharing of physical links betweenpipes without compromising QoS guarantees. Flow based pipe establishment algorithm istopology independent and hence the NoC Manager supports applications mapped to bothregular chip multiprocessors (CMPs) and customized SoCs with non-conventional NoCtopologies. LinkstatusenablesNoCManagertoestablishpipesinafaulttolerantmanner.Doing away with node identification number in favor of labels reduces memory usage inrouting tables. Traffic engineering and pipe based communication channels are directdescendants from the Multi-Protocol Label Switching[37] knowledge base. Overhead ofsetting up a pipe by a software LS-NoC Manager executing on a Cortex A8 processor is35174 cycles (up to 35.2 µseconds).The Label Switched Network-on-Chip (LS-NoC) framework was designed with theview that pipe establishment and tear-down are rare events. The pipe establishment overheadincurred by the NoC Manager should not overshadow the QoS advantages gained bythe LS-NoC framework. Streaming applications with their fixed communication channelsfit this model perfectly. In principle, the LS-NoC can be used in all types of System-on-Chips and Chip Multi-Processors where functional blocks need to communicate duringoperation. Quality-of-Service can be guaranteed only in systems where communicationchannels are fixed over the lifetime of the application.Communication channels in streaming applications do not change during the life timeof the system except during link failures or reconfiguration. Ideally, pipe establishment isa one-time operation in non-reconfigurable system. If the system allows reconfigurationof functional blocks, then the pipe establishment and tear down procedures will have tobe executed at every reconfiguration step. In such a system, if the performance hit dueto a software NoC Manager is unacceptable, a hardware NoC manager is justified.A Label Switching router was presented in the previous chapter. Evaluation of LS-NoC over example streaming applications, CBR and VBR traffic are demonstrated in thenext chapter.

Chapter 7LS NoC - Case StudiesWe presented a Label Switching based Network-on-Chip (LS-NoC) motivated by throughputguaranteesofferedbybandwidthreservationinthepreviouschapter.LS-NoCcontainsa centralized management framework to engineer traffic into QoS guaranteed routes. Amulticast, broadcast capable label switched router for the LS-NoC has also been proposed.In the current chapter, bandwidth and latency guarantees of LS-NoC have beendemonstrated on streaming applications like HiperLAN/2 and Object Recognition Processor,Constant Bit Rate traffic patterns and video decoder traffic representing VariableBit Rate traffic. LS-NoC has been found to have a competitive Area×PowerThroughputfigure of meritwithstate-of-the-artNoCsprovidingQoS.Circuitswitchingwithlinksharingabilitiesandsupport for asynchronous operation make LS-NoC a desirable choice for QoS servicing inCMPs.Organization of the ChapterThe previous chapters presented LS-NoC architecture and concepts. The current chapterevaluates QoS guarantees offered by LS-NoC using streaming application case studies.Section 7.1 uses processes from HiperLAN/2 baseband processing and Object RecognitionProcessor SoC mapped on to a generic CMP as a framework to evaluate LS-NoC QoSservices. ConstantandVariablebitratevideotrafficisusedtoevaluateLS-NoCinSection7.2. Discussion on concepts of LS-NoC and comparison with existing works is presented129

CHAPTER 7. LABEL SWITCHED NOC 130Table7.1: Pipesset upfor HiperLAN/2baseband processingSoCand Object RecognitionProcessor SoC (Figure 7.1(a)). PEC[0-7]→PEC[0-7]: every PEC communicates withevery other PEC.HiperLAN/2 baseband processing SoCS2P → PR → FOC → FFT → POC → CE → DeMap,S2P → Sync, Sync → POCObject Recognition ProcessorPEC[0-7] → PEC[0-7], MP → PEC[0-7], MP → VAE,MP → Ext. I/f, MP → MA, VAE → PEC[0-7], MA → MP.in Section 7.3. The chapter concludes in Section 7.4.7.1 HiperLAN/2basebandprocessing+ObjectRecognitionProcessor SoCHiperLAN/2 baseband processing SoC (Figure 5.1(a))[7] and Realtime objection recognitionprocessor(Figure 5.1(b)) [8] (Section 5.1) were mapped on to a 8×8 LS-NoC (Figure7.1(a)). Application pipes were setup as shown in Figure 7.1(a) and Table 7.1.Behaviour of pipe latencies in an 8×8 LS-NoC with a 64 node CMP running streamingapplications is presented in Figure 7.2. X-axis records injection rates of non-streamingapplication nodes. The LS-NoC was configured to support the 64 node CMP shown inFigure 7.1(a). LS-NoC establishes higher capacity pipes between communicating nodes ofstreaming applications. Resources are provisioned in pipes based on bandwidth requirements.A pipe might occupy a major portion or all of the available capacity of a link(Line 18–19 in Algorithm 1). The demand capacity of a provisioned pipe (C req ) may betuned based on bandwidth requirements of the pipe. This ensures contention free pipeset up and guaranteed bandwidth for the pipe. A non-provisioned pipe will share link resourcesequally with other pipes resulting in increased latencies as injection rate increases.Latency curves of non-provisioned (labeled ‘U’) clearly fail to guarantee QoS comparedto provisioned pipes. Variation in injection rates of traffic generated by non-application

CHAPTER 7. LABEL SWITCHED NOC 131(a)(b)Figure 7.1: (a) Process blocks of HiperLAN/2 baseband processing SoC and Object recognitionprocessor mapped on to a 8 × 8 LS-NoC. Pipe 1: PEC0 → PEC6, Pipe 2: MP →PEC3. (b) Flows set up for CBR & VBR traffic.nodes does not effect the latency in provisioned pipes. From the graph, average latencyof flits traversing the provisioned pipe is almost constant over varying injection rates ofother source nodes. The average latency of packets in the network does not change overinjection rates due to reservation and provisioning of LS-NoC resources during the executionof the application. Aggregate bandwidth of 120Gbits/s at maximum injection ratesatisfies the communication requirements of both HiperLAN/2 and Object RecognitionProcessor applications.7.2 Video Streaming ApplicationsLS-NoChasbeentestedonbothConstantBitRates(1.55Mbpsand55Mbps)andVariableBit Rate traffic. Flows for CBR and VBR traffic were set up assuming worst case spatialseparation between producer and consumer nodes (Figure 7.1(b)). Results for CBR andVBR experiments are shown in Figure 7.3. Videos used in H.264 standards evaluationwere used for VBR experiments and are tabulated in Table 7.2. It is observed thatlatency of CBR and VBR traffic is unaffected by varying injection rates of non-video

CHAPTER 7. LABEL SWITCHED NOC 132ORP and HiperLAN/2 on LS-NoCAverage Latency of the Main Pipe6050403020HiperLAN/2(U)HiperLAN/2ORP - Pipe 1(U)ORP - Pipe 1ORP - Pipe 2100.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45Injection RateFigure 7.2: Latency of HiperLAN/2 and ORP pipes in LS-NoC over varying injectionrates of non-streaming application nodes. Latency of non-provisioned paths are titled(U).Table 7.2: Standard test videos used in experiments.Sl. No. Video Frames per Second Frames Simulated1 Bridge Close 10 2002 Flower 10 2503 Foreman 10 3004 Hall 10 300sources. All flows in the LS-NoC are provisioned such that CBR/VBR traffic experiencesleast contention at the routers. CBR/VBR traffic flow is guaranteed throughput and hasdeterministic latency in the LS-NoC.

CHAPTER 7. LABEL SWITCHED NOC 133Average Pipe Latency (CBR pipes)2524232221Case Study - Streaming Constant Bit Rate55Mbps11Mbps200.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Injection Rate of Non-Streaming Nodes(a)Average Pipe Latency (VBR pipe)3029.52928.52827.52726.526Case Study - Streaming Variable Bit RateVideo 1Video 2Video 3Video 425.50.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Injection Rate of Non-Streaming Nodes(b)Figure 7.3: (a) Latency of CBR traffic over various injection rates of non-streaming nodesin LS-NoC. (b) Latency of VBR traffic over various injection rates of non-streaming nodesin LS-NoC.

CHAPTER 7. LABEL SWITCHED NOC 1347.3 Discussion7.3.1 Design Philosophy of LS-NoCDesign philosophy of LS-NoC retains advantages of both packet switched and circuitswitched networks. Network visibility enables NoC Manager to set up pipes in a congestionfree and fault tolerant manner in homogeneous and heterogeneous networks. Thetraffic engineering capable LS-NoC NoC Manager allows identification and configurationof contention-less, non-shortest QoS guaranteed communication channels. The NoC Managerhas complete visibility of the state of the network (existing pipes, utilization of links,required capacity of the requested connection, remaining capacity of the link). This enablesnon-shortest, bandwidth provisioned pipe set up in LS-NoC. Flow identificationalgorithm in NoC Manager identifies a route which satisfies bandwidth requirements ofthe pipe to be established. Flow identification algorithm takes into account bandwidthavailable in individual links to establish QoS guaranteed pipes. This allows sharing ofphysical links between pipes without compromising QoS guarantees. LS-NoC takes acentralized traffic analysis based approach through the NoC Manager. LS Router implementsminimal buffers to support flow control operations. LS-NoC provides throughputguarantees irrespective of spatial separation of communicating entities.LS-NoC sets up communication channels (pipes) between communicating nodes thatare independent of existing pipes and contention free at routers. Establishing pipes basedon bandwidth demands independent of existing pipes allow non-interference of trafficbetween pipes and end-to-end latency is guaranteed. Adding new pipes incrementallydoes not effect the guaranteed performance of already established pipes. LS-NoC usesflow based algorithms[66][67] (Algorithm 1) to identify pipes with sufficient bandwidthbetween communicating nodes. The pipe establishment algorithm is independent of networktopology and hence is a generic solution for homogeneous and heterogeneous NoCs.Latency in establishing pipes using flow based algorithm in LS-NoC (Algorithm 1) dependson the size of the NoC (number of edges in the communication graph) only. Theflow based algorithm identifies a bandwidth satisfied route between nodes. For a given

CHAPTER 7. LABEL SWITCHED NOC 135LS-NoC, a definite upper bound on time to establish path is known. Computationaloverhead of the pipe establishment algorithm is documented in Section 6.4.1.The NoC Manager has knowledge of established pipes in the network. New pipesare added along non-interfering and contention free routes thus guaranteeing end-to-endlatency. LS-NoC allows pipes to share a single physical link as long as bandwidth requirementsare met by both pipes. In the cases where bandwidth demands are not met, alternativeadequate bandwidth routes are identified. Both Æthereal and Octagon use staticshortest path routing algorithm oblivious to the current state of NoC traffic. LS-NoCborrows traffic engineering and label switching concepts from MPLS. The NoC Manageris aware of state of the NoC and its flow identification algorithm identifies a bandwidthguaranteed non-shortest paths in case of unavailability of shortest path for pipes. Labelswitching reduces metadata transfer in NoC and buffer requirement at routers. Circuitswitching requires minimal arbitration, has higher energy efficiency, consumes lesser areaand power than a corresponding packet switched router[7]. Router ports reserved for acircuit in a circuit switched router cannot be utilized by other connections. The flowidentification algorithm in the LS-NoC Manager supports multiple pipes on a physicallink as long as bandwidth requirements of pipes are met. This increases utilization ofnetwork resources while guaranteeing bandwidth to pipes.Source nodes in the LS-NoC can work asynchronously as cycle level scheduling is notrequired in the LS Router. Bi-synchronous or mesochronous input buffers in LS routerenable multiple clock domain operation, without globally synchronous clocks.7.3.2 LS-NoC ApplicationFor streaming applications, results (Figures 7.2, 7.3) show that the LS-NoC guaranteespredictablelatencyandguaranteedthroughput. LS-NoCissuitableforapplicationswhosecommunication patterns do not vary over lifetime of the application and require guaranteedthroughput.In CMPs and complex SoCs, LS-NoC can be used as a separate NoC to service applicationsrequiring hard QoS guarantees. Network interfaces at ingress of the NoC can

CHAPTER 7. LABEL SWITCHED NOC 136LS-NoCP NI NIPBE-NoCFigure 7.4: LS-NoC being used alongside a best effort NoC.Table 7.3: Evaluation of the proposed Label Switched Router and NoC. CS: Circuitswitched, PS: Packet switched.Reference Tech.(L), Area Router Through- Energy FoM Support Share(Router Voltage (A) Power put (T) (per Efficiency Async Link?Type) (mm 2 ) (P) (mW) link, Gbps) (Tb/s/W) ClockDomainsThis work (LS) 130nm, 0.431 43.08 80 1.85 9.6 Yes Yes8×8 Mesh 1.2VNexus 130nm 1.75 – 48.75 – – Yes Yes[39] 1.2VSoCBUS 180nm 0.06 – 16 – – No[19]SDM 0.135 1.790 0.64 0.3575 29.72 No Yes[23] 1.2V (20 MHz)Æthereal (CS) 130nm, 0.26 – 16 – – No Yes[51] 1.2V8×10 Mesh 45nm, 0.030 21 – 74 11.78 0.159 – 188.46 No No(CS) [41] 1.1V (Approx.) 0.560Realtime ORP 130nm, 0.2 46 12.8 0.278 29.8 Yes Yes(PS) [8] 1.2V (Approx.)HiperLAN/2 130nm, 0.05 17.2 17.2 1.0 2.8 No NoBaseband 1.2V (Est.)SoC (CS) [7]be configured to identify traffic belonging to QoS classes. Based on the type of the trafficbeing injected into the communication medium, either the conventional best effort NoCor the LS-NoC can be chosen. Multiple NoCs to service individual classes of data havebeen present in commercially available multi-core chips[145]. The concept is illustratedin Figure 7.4.7.3.3 LS-NoC EvaluationTable 7.3 presents a comparative illustration of the proposed LS-NoC with a few proposedQoS NoCs. Link width of 256 bits for current work was chosen to service peak bandwidthcomparable with GDDR5 bandwidth and hence throughput is higher than other designs.

CHAPTER 7. LABEL SWITCHED NOC 137Each port in the LS-NoC contains 8 input buffers each. Every buffer is 256 bit widecontributing to high area and power. Area numbers of works [41] and [8] were estimatedfrom total chip area. Custom design and low buffer usage in these routers has broughtdown the area significantly. Nexus[39] implements a 16 port, 36 bit asynchronous crossbarresulting in high area consumption.Voltage scaling enables Intel’s NoC[41] to operate in the 21mW – 74 mW range. Areaand power consumption of the ORP router[8] is similar to the LS Router owing to similarbuffer area. The 7 port router has 32 bit input buffers up to 10 buffers per input. The lowpower and low area of the HiperLAN/2 SoC router in due to absence of input buffers andarbitration unit in the circuit switched router design. LS-NoC has maximum throughputowing to the high link width. However, LS-NoC shows high energy efficiency due to highthroughput at a nominal frequency of 312.5 MHz in 130nm. Results show that LS-NoCis the best design in terms of bits transmitted per Watts consumed.The normalized Figure of Merit (Area × Power/Throughput) is a technology independentcomparison parameter. The Figure of Merit (FoM) of 180nm and 130nm designsare scaled to 45nm by multiplying by the ratio of cubes of channel lengths as shown inEquation 7.3.3.FoM = Area×PowerThroughput × L 45 3L Tech3L Tech is the channel length of the corresponding Technology. With constant voltageand frequency, power varies linearly with technology owing to effects of capacitance.Length and breadth vary linearly to contribute to squared relation with respect to technology.Hence ratio of cubed channel lengths normalize the FoM in Equation 7.3.3. FoMis a measure of resources spent per bit transmitted (hence, lower the better). Buffer-lesslow area design of the HiperLAN/2 router contributes to the low FoM number for thisdesign. Such a buffer-less circuit switched router cannot accommodate multiple connectionsthrough a physical link. Resource reservation by a single circuit for the lifetime ofthe application may result in inefficient network utilization in such networks. LS-NoCfares fairly well owing to high throughput though the area cost is high. The high power

CHAPTER 7. LABEL SWITCHED NOC 138consumption of 74mW in 45nm contributes to the maximum FoM number in Intel’s circuitswitched router. Globally synchronous designs demand power hungry single clockdistribution over the chip. Bi-synchronous or mesochronous FIFOs in LS Router enablenetwork to operate at out of phase or clocks. This saves power spent in global clock distribution.One of the major features of packet switched networks borrowed into LS-NoCis link sharing by pipes without bandwidth compromise. Physical link sharing by multiplepipes result in higher network utilization than purely circuit switched techniques.From tabulated results, the LS router has high energy efficiency and provides hard QoSguarantees at a reasonable power budget.7.4 ConclusionThe proposed LS-NoC services QoS demands of streaming applications using a trafficengineering capable NoC Manager. A centrally managed, bandwidth provisioned, trafficengineering capable Manager utilizes flow identification algorithms to identify contentionfree, bandwidth provisioned paths. Network visibility enables NoC Manager to configureboundedlatencypipesinhomogeneousandheterogeneousnetworksalike. Flowidentificationalgorithm takes into account bandwidth available in individual links to establish QoSguaranteed pipes. The algorithm allows sharing of physical links between pipes withoutcompromising QoS guarantees. Link status enables NoC Manager to establish pipes in afault tolerant manner. Doing away with node identification number in favor of labels reducesmemory usage in routing tables. Traffic engineering and pipe based communicationchannels are direct descendants from the Multi-Protocol Label Switching[37] knowledgebase.A Label Switching router has been designed, verified, synthesized, placed and routedand timing analyzed. The Label Switched (LS) Router used in LS-NoC achieves twocycle traversal delay during no contention and is multicast and broadcast capable. A 5port, 256 bit data bus, 4 bit label, 1 bit flow control, 8 buffers per input port individualrouter occupies an area of 0.431 mm 2 in 130nm Faraday library in the typical corner andoperates at 312.5MHz. LS Router is estimated to consume 43.08 mW. LS-NoC has been

CHAPTER 7. LABEL SWITCHED NOC 139evaluated over examplestreamingapplications, CBR andVBRtrafficand QoSguaranteeshave been demonstrated. Overhead of setting up a pipe by a software LS-NoC Managerexecuting on a Cortex A8 processor is 35174 cycles (up to 35.2 µseconds).

Chapter 8Conclusion and Future WorkWork in this thesis presents methodologies for QoS guaranteed NoC design, link microarchitectureexploration and optimal Chip Multiprocessor (CMP) tile configurations.8.1 Link Microarchitecture ExplorationNoC design specifications can be met by varying a large number of system and circuitparameters. An SOC can be better optimized if low level link parameters and architecturalparameters such as pipelining, link width, wire pitch, supply voltage, operatingfrequency, router type, topology of the interconnection network etc. are considered. Thechapter presents a simulation framework developed in System-C able to explore NoC designthrough all the aforementioned parameters. The framework also allows co-simulationwith models for the communicating entities along with the ICN.Study presented in Section 3.3 on a 4x4 multi-core ICN for Mesh, Torus and Foldedtorustopologies and Dense Linear Algebra (DLA) and Sparse Linear Algebra (SLA)benchmarks’ communication patterns indicate that there is an optimum degree of pipeliningof the links which minimizes the average communication latency. There is also an optimumdegree of pipelining which minimizes the energy-delay product. Such an optimumexists because increasing pipelining allows for shorter length wire segments which can beoperated either faster or with lower power at the same speed.140

CHAPTER 8. CONCLUSION AND FUTURE WORK 141We also find that the overall performance of the ICNs is determined by the lengthsof the links needed to support the communication patterns. Thus the mesh seems toperform the best amongst the three topologies considered in this case study.Another study (Section 3.4) uses 3 example topologies - 16 node 2D Torus, Tree networkand Reduced 2D Torus to show variation of latency, throughput and NoC powerconsumption over link pipelining configurations with voltage and frequency scaling. Wefind that contrary to intuition, increasing pipeline depth can help reduce latency in absolutetime units, by allowing shorter links & hence higher frequency of operation. Ina 2D Torus when the longest link is pipelined by 4 stages at which point least latency(1.56 times minimum) is achieved and power (40% of max) and throughput (64% of max)are nominal. Using frequency scaling experiments, power variations of up to 40%, 26.6%and 24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoC between variouspipeline configurations to achieve same frequency at constant voltages. Also in somecases, we find that switching to a higher pipelining configuration can actually help reducepower as the links can be designed with smaller repeaters. Larger NoC power savingscan be achieved by voltage scaling along with frequency scaling. Hence it is importantto include the link microarchitecture parameters as well as circuit parameters like supplyvoltage during the architecture design exploration of a NoC.Thestudiesalsopointtoanoveralloptimizationproblemthatexistsinthearchitectureof the individual PEs versus the overall SOC, since smaller PEs lead to shorter linksbetween PEs, but more traffic, thus pointing to the existence of a sweet spot in terms ofthe PE size.8.2 Optimal CMP Tile ConfigurationThe effects of communication overheads on performance, power and energy of a multiprocessorchip using L1, L2 cache sizes as primary exploration parameters using accurateinterconnect, processor, on-chip and off-chip memory modeling were studied in Chapter4. Sapphire, a detailed multiprocessor execution environment integrating SESC, Rubyand DRAMSim was used to run applications from the Splash2 benchmark (64K point

CHAPTER 8. CONCLUSION AND FUTURE WORK 142FFT). A 4×4 2D Mesh formation of 16 out-of-order cores, each consisting of a private L1and unified L2 cache was used as case study in the exploration. Detailed low-level wiredelay models were used from Intacte to calculate power consumed by the interconnectionnetwork. Results indicate that ignoring link latencies can lead to large errors in estimatesof program completion times, of up to 17%. Furthermore, architectural choices based oninaccurate interconnect estimates are not optimal and the error is severe in cases whereapplications have heavy communication requirements.Communication delays in wires adversely effect the performance in a CMP due tolarger time spent in message transit. Small L1 caches (larger misses) result in largeroff-tile accesses. Large L2 cache sizes increase individual tile area and result in longerinterconnects. Performance optimal configurations are achieved at lower L1 caches andat moderate L2 cache sizes due to higher operating frequencies and smaller link lengthsand comparatively lesser communication. Using minimal L1 cache size to operate at thehighest frequency may not always be the performance-power optimal choice. Larger L1sizes, despite a drop in frequency, offer a energy advantage due to lesser communicationdue to misses.Experimental results show that custom floorplanning after communication patternanalysis between tiles will help reduce power and increase performance in a chip multiprocessor.Clustered tile placement experiments for FFT (L1:256KB and L2:512KB)show considerable performance per watt improvement (1.2%). Remapping most accessedL2 banks by a process in the same core or neighbouring cores after communication trafficanalysis offers power and performance advantages. Remapped processes and banks inclustered tile placement show a performance per watt improvement of 5.25% and energyreduction of 2.53%. Non-conventional tile placements that accommodate frequently communicatingcores nearby can bring down communication latency and power significantly.Remapping threads and frequently accessed L2 banks closer in a conventional 2D Meshshows an improvement of performance per watt by 1.6%.EPI and program completion time results indicate that minimum energy cache configurationsare not the same as minimum execution time configurations. This suggests that

CHAPTER 8. CONCLUSION AND FUTURE WORK 143processors could execute a program in multiple modes, for example, minimum energy,maximum performance. Minimum energy mode can be achieved using available dynamicvoltagescalingfacilitiesinprocessorsperL1/L2cacheconfiguration. AdaptivelychangingL1/L2 sizes can be one of the ways to achieve maximum performance.Level 1 and Level 2 cache sizes are important parameters in the design of a multicorechip. These sizes directly effect the tile area and have a huge bearing on the final costof the chip. An architect faced with such an exploration problem has to also take intoaccount the effect of lengths, area and power consumed by interconnects between coresand the physical design of the chip. The power spent in data fetches from off-chip DRAMis also an important parameter in the overall power of the system. The estimated power ofthe system will increase based on the amount of off-chip communication in the CMP withpower of DRAM accounted in. Analysis of such parameters require use of delay, area andpower models of physical interconnects. Realistic traffic has to be obtained by executingappropriate benchmarks on detailed processor simulation environments. The Sapphire +Intacte + DRAMSim framework can be applied in solving architectural problems wherephysical aspects of wires, placement of tiles, floorplans effect performance and powerconsumption of a multicore chip.8.3 Label Switched NoC for Streaming ApplicationsStreaming applications have deterministic communication patterns due to pipelined natureof operation. Traffic engineering in LS-NoC guarantees QoS and delivers constantthroughput in such applications. Existing priority based QoS mechanisms for packetswitched NoCs are ineffective when traffic in a single priority class increases. Resourcereservation mechanisms in existing circuit switched networks suffer from inefficient networkusage, indeterministic circuit establishment times and routes oblivious to currentnetwork state.The proposed LS-NoC services QoS demands of streaming applications using a trafficengineering capable NoC Manager. A centrally managed, bandwidth provisioned, trafficengineering capable Manager utilizes flow identification algorithms to identify contention

CHAPTER 8. CONCLUSION AND FUTURE WORK 144free, bandwidth provisioned paths. Network visibility enables NoC Manager to configureboundedlatencypipesinhomogeneousandheterogeneousnetworksalike. Flowidentificationalgorithm takes into account bandwidth available in individual links to establish QoSguaranteed pipes. The algorithm allows sharing of physical links between pipes withoutcompromising QoS guarantees. Link status enables NoC Manager to establish pipes in afault tolerant manner. Pipes are identified by labels unique to each source node. Doingaway with node identification number in favor of labels reduces memory usage in routingtables. Traffic engineering and pipe based communication channels are direct descendantsfrom the Multi-Protocol Label Switching[37] knowledge base.A Label Switching router has been designed, verified, synthesized, placed and routedand timing analyzed. The Label Switched (LS) Router used in LS-NoC achieves two cycletraversaldelayduringnocontentionandismulticastandbroadcastcapable. Sourcenodesin the LS-NoC can work asynchronously as cycle level scheduling is not required in theLS Router. Bi-synchronous or mesochronous input buffers in LS router enable multipleclock domain operation, without globally synchronous clocks. This reduces power andenables flexible design of the clock tree. A 5 port, 256 bit data bus, 4 bit label, 1 bit flowcontrol, 8 buffers per input port individual router occupies an area of 0.431 mm 2 in 130nmFaraday library in the typical corner and operates at 312.5MHz. LS Router is estimatedto consume 43.08 mW. LS-NoC has been evaluated over example streaming applications,CBR and VBR traffic and QoS guarantees have been demonstrated. Overhead of settingup a pipe by a software LS-NoC Manager executing on a Cortex A8 processor is 35174cycles (up to 35.2 µseconds). Time required to identify multiple pipes is the productof number of pipes and an individual pipe identification time. Given that streamingprocesses run over a large time frame and pipe configuration is a one time process, thisoverhead is acceptable in most applications.We envision the use of LS-NoC in general purpose CMPs where applications demanddeterministic latencies and hard bandwidth requirements. LS-NoC can be used as aseparate layer catering to applications requiring hard QoS guarantees. Based on the typeof the traffic being injected into the communication medium, either the conventional best

CHAPTER 8. CONCLUSION AND FUTURE WORK 145effort NoC or the LS-NoC can be chosen.8.4 Future WorkQoS guaranteed NoCs is a wide research area with several directions for research as futureCMPs impose increasingly more QoS demands on NoCs. Label switched NoC is a steptowards building a complete NoC solution towards granting QoS for processes in CMPsand SoCs. We envision a multilayer NoC providing varying levels of QoS guarantees -an example 2 layer NoC was shown in Figure 7.4 of Chapter 7. The design of networkinterfaces for a multilayer NoC, coding inside headers of packets representing the requiredQoS levels, solutions to classify traffic into various QoS classes, network interfaces tofeed NoCs with relevant traffic, load balancing in NoCs to maintain fairness and QoSguarantees are some of the interesting future directions that can be pursued after LS-NoCdesign.

Appendix AInterface and Outputs of theSystemC FrameworkThe SystemC framework was developed as a design space exploration tool to be used toexplore communication infrastructure parameters - topology, routing policy, link length,wire width, pitch, pipelining, supply voltage and frequency. The overall architecture andthe working of the tool are presented in Figures 3.1 and 3.2.The framework helps chart the impact of various architectural and microarchitecturallevel parameters of the on-chip interconnection network elements on its power and performance.The framework also supports a flexible traffic generation and communicationmodel.In terms of the user interface, a configuration file was populated and read by theframework. The list of parameters supplied through the configuration file to the SystemCframework were presented in Table 3.1. Table A.1 presents a few of the configurationparameters with their default values.The user interface was otherwise terminal based with a compiler script made available.The make script has been pasted here for reference.# use gmakeTARGET_ARCH = linux146

APPENDIXA. INTERFACEANDOUTPUTSOFTHESYSTEMCFRAMEWORK147Table A.1: ICN exploration framework parameters and their default values.Parameter Default ValueNoC ParametersTopology 2D MeshWidth of the Bus (in bits) 32Phit Size Width of the WireFlit Size 2Mesh Rows 4Mesh Columns 4RoutingSwitching PolicyTable BasedWormhole SwitchingTraffic ParametersInjection Rate 0.02Traffic Type One-Way/Request ResponseTraffic Pattern DLA/SLALocalization Factor 0.4Individual Link ParametersLength of the link 1mmWire Pitch Intacte specifiedPipelining 4Supply Voltage 1.1VFrequency Obtained from Intacte

APPENDIXA. INTERFACEANDOUTPUTSOFTHESYSTEMCFRAMEWORK148CC = g++opt: OPT = -Odebug: OPT = -gFLAGS = -Wall -Wno-deprecatedSOURCES = \Consumer.o \Producer.5.TBR.o \Router.Mesh.o \packet_format.mesh.tbr.o \top.mesh.oPROGRAM = icn.meshINC = -I../ -I../parser -I.INC += -I$(SYSTEMC)/include.cpp.o:$(CC) $(OPT) $(FLAGS) $(INC) -c $

APPENDIXA. INTERFACEANDOUTPUTSOFTHESYSTEMCFRAMEWORK149- rm *.oAn output file records activity and coupling factors for the 5 outgoing links for eachrouter. A snapshot of the output file is shown. Activity and coupling factors in the thirdcolumn are all zero as this is an output file corresponding to router in the second row,fourth column in a 4× 4 2D-Mesh.INPUT ACTIVITY 0.0251406 0.0162656 0 0.0236562 0.0175156OUTPUT ACTIVITY 0.0236406 0.0195625 0 0.0217344 0.0191406INPUT COUPLING 0.0278281 0.0206875 0 0.02875 0.02025OUTPUT COUPLING 0.0309375 0.0224219 0 0.0262187 0.0212344

Appendix BTesting & Validation of LS-NoCThe design and implementation of a Label Switched NoC router has been presented inChapter 5. This appendix presents validation and testing details.B.1 Implementation of LS-NoC RouterTherouterdesignhasbeenshowninFigure5.4ofChapter5. Therouterwasimplementedin Verilog. Router modules and their interactions have been illustrated in Figure B.1.Each port of the router has its own FIFO and routing table blocks. Individual arbitersexist per output port. Input port bus contains data, label, valid and flow control bits.All input port buses are connected to the test bench, router tb.B.2 Testing and Validation of LS-NoC RouterB.2.1 Individual RouterThe router was rigorously tested in both individual and mesh environments. A 4 portLS-NoC router was connected to 4 sources and sinks as shown in the Figure B.2. The firsttests ensured proper transit of individual packets from all sources to all destinations. Allmeta-data including flow control signals were in place during all tests. Stress tests donein this setting include:150

APPENDIX B. TESTING & VALIDATION OF LS-NOC 151Testbench(router_tb)Router Module(router)data, label,valid, pauseFIFO(fifo_block)Router Module(router)ARBITER(router_arbiter)data, label,valid, pauseRouting Table(routing_table)Figure B.1: Modules in LS-NoC router design shown along with testbench, implementedin Verilog.SOURCESINKSOURCESINKSOURCESINKSINKSOURCEROUTERSINKSOURCESINKSOURCEROUTERSINKSOURCESINKSOURCEROUTERSINKSOURCESOURCESINKSOURCESINKSOURCESINK(a)(b)(c)Figure B.2: Test cases used to verify an individual LS-NoC router.

APPENDIX B. TESTING & VALIDATION OF LS-NOC 152SrcSinkWNRESFigure B.3: 8×8 mesh used for testing LS-NoC.• Individual sources sending data to a single destination during every cycle (FigureB.2(a)).• Individual sources sending out data serially to all destinations in every clock cycle(Figure B.2(b)).• Individual sources sending out data to destinations at random every cycle (FigureB.2(c)).B.2.2 Router in 8×8 MeshFigure B.3 shows the 8×8 mesh used for various test cases to verify and validate thedesign of LS-NoC router. Figure B.4 shows various traffic test cases used to verify thefunctioningofLS-NoCrouter. Labelwidthwassetat4bits. Henceeverynodepotentiallyhad 16 destination nodes to send data to. All nodes attempt to send out traffic in everycycle. Traffic injection into the NoC is throttled by flow control signal only. The testcases are:• Figure B.4(a): Every node s in the 8×8 mesh sends data serially to all its possibledestinations.

APPENDIX B. TESTING & VALIDATION OF LS-NOC 153sss(a)(b)(c)Figure B.4: Traffic test cases used to verify proper functioning of LS-NoC router.• Figure B.4(b): Every node s in the mesh sends data to one of the possible 16destinations after choosing destination at random.• Figure B.4(c): Every node s sends out data to the same node through differentroutes.B.3 Synthesis & Place and RouteFlowchart in Figure B.5 lists the operations done during synthesis and place and route ofthe LS-NoC router. Synthesis parameters and results have already been listed in Table5.4 and Table 5.5 respectively.Synthesis of the functionally verified Verilog HDL design of the LS-NoC router wasperformed using Synopsys Design Compiler. Timing was analyzed and cycle time wasestimated from the FIFO at the input to the output buffers. Switching activity, timingand design constraints and the synthesized netlist is input to Cadence SOC Encounterfor place and route. The placed and routed output of a single router is shown in FigureB.6. Area numbers were obtained from Encounter for the HDL designs along with theplaced and routed (P & R) netlist and the parasitics (RC) extracted file (SPEF). The P& R netlist and the SPEF files are used by Synopsys Primetime PX for timing analysisand power calculations.

APPENDIX B. TESTING & VALIDATION OF LS-NOC 154Functionally verifiedVerilog HDLof Input DesignFaraday 130nm Library(fsc0h*db)Synthesis usingSynopsys Design CompilerValue ChangeDump (VCD)Synthesised Netlistof the input designSynopsys DesignConstraints (SDC file)SOC EncounterPlace & RouteDesign AreaFaraday 130nmLibrary (fsc0h*.db)P&R NetlistParasitics ExtractionFile (SPEF)Synopsys Primetime PXTiming and Power AnalysisPower andTiming ResultsFigure B.5: Flowchart illustrating steps in Synthesis and Place & Route steps of theLS-NoC router.Figure B.6: Placed and routed output - Single Router.

Appendix CThe Flow AlgorithmThis chapter presents the working of flow algorithm with an example.C.1 Ford-Fulkerson’s MaxFlow AlgorithmFord-Fulkerson’s maxflow algorithm (Algorithm 1, Section 6.2)[67] is used by the LS-NoCManager to identify pipes between communicating entities.An example illustration of MaxFlow algorithm is presented in Figure C.1. The stepsare explained in the following list:1. The network is represented as a set of nodes connected by unidirectional links.Each of these connecting links have a capacity associated with them. The capacityof links is used as a representation for available bandwidth of the link betweencommunicating nodes.2. TwoflowsX→A→C→Y&X→B→C→YaresetupbetweennodesX&Y.Capacityvalue of 1 is assumed to be used up by a flow in links it passes through. Availablecapacity values are also shown in the graph. Links having at least one flow passingthrough them are written in dashed lines.3. A residual network is constructed from the graph. Every unidirectional edge fromthe graph is replaced by two edges in opposite directions. The capacity of a forward155

APPENDIX C. THE FLOW ALGORITHM 156X31AB534CD22E3Y1/3X1/1A1/31/5B0/4CD0/22/2E0/3Y2X11AB12414CDResidual Networkafter 2 flows22E3Y(a)(b)(c)X1A211 5B231CDResidual Networkafter addingX_A_C_B_D_E_Y112E12Y2/3X1/1A2/30/5B1/4CD1/22/2E1/3Y(d)(e)Figure C.1: Steps in the flow algorithm example. (a) Input Graph. Maximum flowshave to be identified between nodes X & Y. (b) Available capacities of links after flowsX→A→C→Y & X→B→C→Y are set up. (c) Residual network showing available capacitiesof links in the forward direction and utilized capacity in the reverse. (d) Residualnetwork after adding the flow: X→A→C→B→D→E→Y. (e) Final output of the maxflowalgorithm showing 3 flows from X to Y.link is equal to the available capacity of the link. The reverse link capacity is equalto the capacity reserved by flows.4. Flows between source to destination nodes are searched for in the residual network.If a flow exists in the residual network, it exists in the input graph too. The pathis identified and added. In the illustration in Figure C.1(c) & C.1(d), a new flowX→A→C→B→D→E→Y is added and the residual network is rebuilt.5. The algorithm terminates when no new flows can be identified between nodes fromthe residual network.C.2 Input GraphFord-Fulkerson’s MaxFlow Algorithm has been used to identify bandwidth guaranteedpipes between communicating nodes in LS-NoC. The input to the algorithm is a networkgraph representing communication channels in LS-NoC. An illustration of how the inputgraph is built using a 2 router, 6 communication node scenario is presented in this section.

APPENDIX C. THE FLOW ALGORITHM 157E0={R0S0,R0_I0,0,4,{},{0,..,3}}R0_D0SINKSOURCESOURCE SINKROUTER0SOURCE SINKROUTER1SINKSOURCER0S0R0S1R0S2E0E1E2R0_I0R0_I1R0_I2E4R0_D1R0_D2E3R1_D0R1_D1SOURCESINKSOURCESINKE5R1_I4R1_D2(a)(b)Figure C.2: (a) A 2 router, 6 source+sink system used for validation of the LS-NoC routerdesign. Graph representation of the system used as input to the flow algorithm is shownin (b).Figure C.2(a) shows the 2 router system. The router has 4 input and output ports.The east port of router 0 is connected to the west port of router 1. All other ports areconnected to source and sink nodes. Blanked out links represent ingress links and shadedlinks represent egress links.The graph representation of the 2 router system is shown in Figure C.2(b). The nodeR0S0 indicates the source attached to port 0 on router 0 and so on. R0 I0 is the node onthe router’s input port interface at the router side. R0 I0 connects to all sinks on router0, viz., R0 D0, R0 D1, R0 D2. Each of R0 I0, R0 I1 and R0 I2 also connect the ingressport of router 1, R1 I4. R1 I4 connects to each of the sinks in R1, R1 D0, R1 D1, R1 D2.The capacity number on links ending in ingress ports of routers 0 and 1 are the capacityof input buffers at the input ports. These are the bottleneck links and the remaining linkscan be assigned infinite capacities. This example illustrates one half of the graph. Anexact graph is built from source nodes in router 1 to router 0. This is the input graph forthe NoC Manager to identify pipes between communicating nodes.C.3 Edges in the Input GraphThe representation of an edge (Section 6.2) in the graph is presented here :E ij = {n i ,n j ,u ij ,v ij ,L usedij ,L unusedij } (C.1)

APPENDIX C. THE FLOW ALGORITHM 158E0={R0S0,R0_I0,1,3,{0},{1,..,3}}R0_D0R0S0E0R0_I0R0S1E1R0_I1E3R0_D1R1_D0R0S2E2R0_I2E4R0_D2R1_D1E2={R0S2,R0_I2,1,3,{0},{1,..,3}}E5P1R1_I4E5={R0_I2,R1_I4,2,2,{0,1},{2,3}}P0R1_D2FigureC.3: TheNoCaftertwopipes,P0andP1havebeenestablished. P0: R0S0→R1 D2and P1: R0S2→R1 D0.where, edge E ij connects nodes n i to n j . Nodes can be traffic sources, traffic sinks, routerinput or output ports. u ij and v ij are utilized and available flow capacities of the edgerespectively. L usedijis the list of labels used in pipes through E ij . L unusedijis the list ofavailable labels to assign to future pipes. During the initialization stage, u ij and L usedijequal to null; v ij holds the maximum capacity value E ij can support and L unusedijlist of all available labels for pipes through E ij .areis theExample edges, E0,...,E5 are shown in the figure where E0 is the edge connectingsource S0 to the router interface I0 (node R0S0→R0 I0). E0 is represented as,E0 = {R0S0,R0 I0,0,4,{},{0,...,3}}(C.2)where edge E0 of capacity 4 connects node R0S0 to R0 I0. All of its resources are free(utilized capacity = 0). The labels associated with the edge are {0,...,3} and none ofthem have been used yet. Similarly, edges E2 and E5 can be represented as,E2 = {R0S2,R0 I2,0,4,{},{0,...,3}}(C.3)E5 = {R0 I2,R1 I4,0,4,{},{0,...,3}}(C.4)

APPENDIX C. THE FLOW ALGORITHM 159Table C.1: Routing tables at R0 I0, R0 I2 and R1 I4 nodes after pipes P0 and P1 havebeen set up.RT: R0 I0 RT: R0 I2 RT: R1 I4il Route ol il Route ol il Route ol0 R1 I4 0 ... ... ... 0 R1 D2 0... ... ... 0 R1 I4 1 1 R1 D0 1Figure C.3 presents the state of the NoC after two pipes P0 and P1 have been establishedin the network. P0 connects R0S0→R1 D2 and P1 connects R0S2→R1 D0. Thevalues of E0 and E2 now are:E0 = {R0S0,R0 I0,1,3,{0},{1,...,3}};E2 = {R0S2,R0 I2,1,3,{0},{1,...,3}}(C.5)Label from R0S0 and R0S2 conflict at edge E5 ending in node R1 I4. Label swappingis performed and the label on pipe P1 is aliased to 1 from 0 in the routing table at nodeR1 I4. Table C.1 presents a section of the routing table at node R1 I4.Similar routing table configuration and label assignment operations are done at allrouter input port nodes.

Bibliography[1] W.J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks.In Design Automation Conference, 2001. Proceedings, pages 684 – 689, 2001.[2] L. Benini and G. De Micheli. Networks on chips: a new soc paradigm. Computer,35(1):70 –78, jan 2002.[3] Luca Benini and G.D. Micheli, editors. Networks on Chips: Technology and Tools.Morgan Kaufman, CA, USA., 2006.[4] Axel Jantsch and Hannu Tenhunen, editors. Networks on chip. Kluwer AcademicPublishers, Hingham, MA, USA, 2003.[5] Anand Raghunathan, Niraj K. Jha, and Sujit Dey. High-Level Power Analysis andOptimization. Kluwer Academic Publishers, Norwell, MA, USA, 1998.[6] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, JohnShalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallelcomputing research: A view from berkeley. Technical Report UCB/EECS-2006-183,EECS Department, University of California, Berkeley, Dec 2006.[7] Pascal T. Wolkotte, Gerard J. M. Smit, Gerard K. Rauwerda, and Lodewijk T.Smit. An energy efficient reconfigurable circuit switched network-on-chip. In Proceedingsof the 19th IEEE International Parallel and Distributed Processing Symposium(IPDPS05) - 12th Reconfigurable Architecture Workshop (RAW 2005), p.155a, ISBN, pages 0–7695, 2005.160

BIBLIOGRAPHY 161[8] Kwanho Kim, Joo-Young Kim, Seungjin Lee, Minsu Kim, and Hoi-Jun Yoo. A 76.8gb/s 46 mw low-latency network-on-chip for real-time object recognition processor.InSolid-State Circuits Conference, 2008. A-SSCC ’08. IEEE Asian, pages189–192,Nov. 2008.[9] William J. Dally and John W. Poulton. Digital systems engineering. CambridgeUniversity Press, New York, NY, USA, 1998.[10] J. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks: An EngineeringApproach. Morgan Kaufmann Publishers, 2003.[11] TuomasValtonen,TeroNurmi,JouniIsoaho,andHannuTenhunen. Anautonomouserror-tolerant cell for scalable network-on-chip architectures. In Proceedings of the19th IEEE Nordic Event in ASIC Design (NorChip 2001), number0, pages198–203,Kista, Sweden, nov 2001.[12] Jian Liang, A. Laffely, S. Srinivasan, and R. Tessier. An architecture and compilerfor scalable on-chip communication. Very Large Scale Integration (VLSI) Systems,IEEE Transactions on, 12(7):711 –726, july 2004.[13] Kuei-Chung Chang, Jih-Sheng Shen, and Tien-Fu Chen. Evaluation and designtrade-offsbetweencircuit-switchedandpacket-switchednocsforapplication-specificsocs. In Proceedings of the 43rd annual Design Automation Conference, DAC ’06,pages 143–148, New York, NY, USA, 2006. ACM.[14] T.D. Richardson, C. Nicopoulos, D. Park, V. Narayanan, Yuan Xie, C. Das, andV. Degalahal. A hybrid soc interconnect with dynamic tdma-based transaction-lessbuses and on-chip networks. In VLSI Design, 2006. Held jointly with 5th InternationalConference on Embedded Systems and Design., 19th International Conferenceon, page 8 pp., jan. 2006.[15] Jingcao Hu, Yangdong Deng, and Radu Marculescu. System-level point-to-pointcommunication synthesis using floorplanning information. In in Proc. ASP-DAC,pages 573–579, 2002.

BIBLIOGRAPHY 162[16] C. Hilton and B. Nelson. Pnoc: a flexible circuit-switched noc for fpga-based systems.Computers and Digital Techniques, IEE Proceedings -, 153(3):181 – 188, may2006.[17] D. Castells-Rufas, J. Joven, and J. Carrabina. A validation and performance evaluationtool for protonoc. In System-on-Chip, 2006. International Symposium on,pages 1 –4, nov. 2006.[18] A. Leroy, P. Marchal, A. Shickova, F. Catthoor, F. Robert, and D. Verkest. Spatialdivision multiplexing: a novel approach for guaranteed throughput on nocs.In Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/softwarecodesign and system synthesis, CODES+ISSS ’05, pages 81–86, NewYork, NY, USA, 2005. ACM.[19] Daniel Wiklund and Dake Liu. Socbus: Switched network on chip for hard real timeembedded systems. In Proceedings of the 17th International Symposium on Paralleland Distributed Processing, IPDPS ’03, pages 78.1–, Washington, DC, USA, 2003.IEEE Computer Society.[20] K. Goossens, J. van Meerbergen, A. Peeters, and R. Wielage. Networks on silicon:combining best-effort and guaranteed services. In Design, Automation and Test inEurope Conference and Exhibition, 2002. Proceedings, pages 423 –425, 2002.[21] E. Beigne, F. Clermidy, P. Vivet, A. Clouard, and M. Renaudin. An asynchronousnoc architecture providing low latency service and its multi-level design framework.InAsynchronous Circuits and Systems, 2005. ASYNC 2005. Proceedings. 11th IEEEInternational Symposium on, pages 54 – 63, march 2005.[22] D.Lattard, E.Beigne, F.Clermidy, Y.Durand, R.Lemaire, P.Vivet, andF.Berens.A reconfigurable baseband platform based on an asynchronous network-on-chip.Solid-State Circuits, IEEE Journal of, 43(1):223 –235, jan. 2008.

BIBLIOGRAPHY 163[23] A.Leroy, D.Milojevic, D.Verkest, F.Robert, andF.Catthoor. Conceptsandimplementationof spatial division multiplexing for guaranteed throughput in networkson-chip.Computers, IEEE Transactions on, 57(9):1182 –1195, sept. 2008.[24] P.P. Pande, C.S. Grecu, A. Ivanov, and R.A. Saleh. Switch-based interconnectarchitecture for future systems on chip. Proc. SPIE - Int. Soc. Opt. Eng. (USA),5117:228–37, 2003.[25] Jingcao Hu and Radu Marculescu. Dyad: smart routing for networks-on-chip. InProceedings of the 41st annual Design Automation Conference, DAC ’04, pages 260–263, New York, NY, USA, 2004. ACM.[26] Martti Forsell. A scalable high-performance computing solution for networks onchips. IEEE Micro, 22(5):46–55, September 2002.[27] T. Bjerregaard and J. Sparso. A router architecture for connection-oriented serviceguarantees in the mango clockless network-on-chip. In Design, Automation and Testin Europe, 2005. Proceedings, pages 1226 – 1231 Vol. 2, march 2005.[28] DavidSigüenza-Tortosa, TapaniAhonen, andJariNurmi. Issuesinthedevelopmentof a practical noc: the proteo concept. Integr. VLSI J., 38(1):95–105, October 2004.[29] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. Qnoc: Qos architectureand design process for network on chip. Journal of Systems Architecture,50:105–128, 2004.[30] P.GuerrierandA.Greiner. Agenericarchitectureforon-chippacket-switchedinterconnections.In Design, Automation and Test in Europe Conference and Exhibition2000. Proceedings, pages 250 –256, 2000.[31] E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen,P. Wielage, and E. Waterlander. Trade-offs in the design of a router with both guaranteedand best-effort services for networks on chip. IEE Proc.-Computer DigitalTechnology, 150(5):294 – 302, sep 2003.

BIBLIOGRAPHY 164[32] ErnoSalminen,TeroKangas,TimoD.Hämäläinen,JouniRiihimäki,VesaLahtinen,and Kimmo Kuusilinna. Hibi communication network for system-on-chip. J. VLSISignal Process. Syst., 43(2-3):185–205, June 2006.[33] B. Ahmad, Ahmet T. Erdogan, and Sami Khawam. Architecture of a dynamicallyreconfigurable noc for adaptive reconfigurable mpsoc. In Proceedings of the firstNASA/ESA conference on Adaptive Hardware and Systems, AHS ’06, pages 405–411, Washington, DC, USA, 2006. IEEE Computer Society.[34] Faraydon Karim, Anh Nguyen, and Sujit Dey. An interconnect architecture fornetworking systems on chips. IEEE Micro, 22(5):36–45, September 2002.[35] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An Architecturefor Differentiated Service. United States, 1998.[36] Amos E. Joel. Asynchronous Transfer Mode Switching. IEEE, 1993.[37] E. Rosen, A. Viswanathan, and R. Callon. Multiprotocol Label Switching Architecture.RFC 3031, jan 2001.[38] M. Kim, D. Kim, and G.E. Sobelman. Network-on-chip quality-of-service throughmultiprotocol label switching. In Circuits and Systems, 2006. ISCAS 2006. Proceedings.2006 IEEE International Symposium on, page 1843, 2006.[39] A. Lines. Asynchronous interconnect for synchronous soc design. Micro, IEEE,24(1):32 – 41, jan.-feb. 2004.[40] MichihiroKoibuchi, KenichiroAnjo, YutakaYamada, AkiyaJouraku, andHideharuAmano. A simple data transfer technique using local address for networks-on-chips.IEEE Trans. Parallel Distrib. Syst., 17(12):1425–1437, December 2006.[41] M. A. Anders, H. Kaul, S. K. Hsu, A. Agarwal, S. K. Mathew, F. Sheikh, R. K.Krishnamurthy, andS.Borkar. A4.1tb/sbisection-bandwidth560gb/s/wstreamingcircuit-switched 8x8 mesh network-on-chip in 45nm cmos. In Solid-State Circuits

BIBLIOGRAPHY 165Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages110 –111, feb. 2010.[42] E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. Van Meerbergen,P. Wielage, and E. Waterlander. Trade offs in the design of a router with bothguaranteed and best-effort services for networks on chip. In Design, Automation,and Test in Europe, pages 10350–10355, 2003.[43] S Bell. Tile64 - processor: A 64-core soc with mesh interconnect. In Solid-StateCircuits Conference Digest of Technical Papers (ISSCC), 2008 IEEE International,pages 88 – 89, feb. 2008.[44] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of theh.264/avc video coding standard. Circuits and Systems for Video Technology, IEEETransactions on, 13(7):560 –576, july 2003.[45] Thomas Canhao Xu, Alexander Wei Yin, Pasi Liljeberg, and Hannu Tenhunen. Astudy of 3d network-on-chip design for data parallel h.264 coding. Microprocess.Microsyst., 35(7):603–612, October 2011.[46] R. Ho, K. Mai, and M. Horowitz. The future of wires. In Proc. of the IEEE, pages490–504, April 2001.[47] Tobias Bjerregaard and Shankar Mahadevan. A survey of research and practices ofnetwork-on-chip. ACM Comput. Surv., 38(1), June 2006.[48] R. Marculescu, U.Y. Ogras, Li-Shiuan Peh, N.E. Jerger, and Y. Hoskote. Outstandingresearch problems in noc design: System, microarchitecture, and circuitperspectives. Computer-Aided Design of Integrated Circuits and Systems, IEEETransactions on, 28(1):3 –21, jan. 2009.[49] Erno Salminen, Ari Kulmala, and Timo D. Hamalainen. On network-on-chip comparison.Digital Systems Design, Euromicro Symposium on, 0:503–510, 2007.

BIBLIOGRAPHY 166[50] Erno Salminen, Ari Kulmala, and Timo D. Hämäläinen. Survey of Network-on-chipProposals. White Paper, pages 1–13, March 2008.[51] E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen,P. Wielage, and E. Waterlander. Trade-offs in the design of a router with bothguaranteed and best-effort services for networks on chip. Computers and DigitalTechniques, IEE Proceedings -, 150(5):294–302, sept. 2003.[52] Xiaowen Wu, Yilang Wu, Ling Wang, and Xiaoqing Yang. Qos router with bothsoft and hard guarantee for network-on-chip. In NORCHIP, 2009, pages 1 –6, nov.2009.[53] Y. Salah and R. Tourki. Design and fpga implementation of a qos router fornetworks-on-chip. In Next Generation Networks and Services (NGNS), 2011 3rdInternational Conference on, pages 84 –89, dec. 2011.[54] T. Bjerregaard and J. Sparso. Scheduling discipline for latency and bandwidthguarantees in asynchronous network-on-chip. In Asynchronous Circuits and Systems,2005. ASYNC 2005. Proceedings. 11th IEEE International Symposium on,pages 34 – 43, march 2005.[55] Jin Ouyang and Yuan Xie. Loft: A high performance network-on-chip providingquality-of-service support. In Proceedings of the 2010 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO ’43, pages 409–420, Washington,DC, USA, 2010. IEEE Computer Society.[56] R. Stefan and K. Goossens. A TDM slot allocation flow based on multipath routingin nocs. Microprocessors and Microsystems, 35(2):130 – 138, 2011. Special issue onNetwork-on-Chip Architectures and Design Methodologies.[57] Edwin Rijpkema, Kees Goossens, and Paul Wielage. A router architecture fornetworks on silicon. In In Proceedings of Progress 2001, 2nd Workshop on EmbeddedSystems, pages 181–188, 2001.

BIBLIOGRAPHY 167[58] Andreas Hansson, Mahesh Subburaman, and Kees Goossens. Aelite: A flitsynchronousnetwork on chip with composable and predictable services. In Proceedingsof the Design, Automation & Test in Europe Conference and Exhibition,Los Alamitos, April 2009. IEEE Computer Society Press.[59] Radu Stefan, Anca Molnos, and Kees Goossens. daelite: A tdm noc supportingqos, multicast, and fast connection set-up. IEEE Transactions on Computers,99(PrePrints), 2012.[60] D. G. Messerschmitt. Synchronization in digital system design. IEEE J.Sel. A.Commun., 8(8):1404–1419, September 2006.[61] Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins, Simon W. Moore, andGerard J.M. Smit. An energy and performance exploration of network-on-chiparchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,17(3):319–329, March 2009.[62] Z.Joseph. Yang, A. Kumar, and Yajun Ha. An area-efficient dynamically reconfigurablespatial division multiplexing network-on-chip with static throughput guarantee.In Field-Programmable Technology (FPT), 2010 International Conference on,pages 389 –392, dec. 2010.[63] Fayez Gebali, Haytham Elmiligi, and Mohamed Watheq El-Kharashi. Networks-on-Chips: Theory and Practice. CRC Press, Inc., Boca Raton, FL, USA, 1st edition,2009.[64] A.K. Lusala and J.-D. Legat. A sdm-tdm based circuit-switched router for on-chipnetworks. In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC),2011 6th International Workshop on, pages 1 –8, june 2011.[65] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch. Guaranteed bandwidth usinglooped containers in temporally disjoint networks within the nostrum network onchip. In Design, Automation and Test in Europe Conference and Exhibition, 2004.Proceedings, volume 2, pages 890 – 895 Vol.2, feb. 2004.

BIBLIOGRAPHY 168[66] N. Megiddo. Optimal flows in networks with sources and sinks. MathematicalProgramming, 7:97 – 107, 1974.[67] L. R. Ford and D. R. Fulkerson. A simple algorithm for finding maximal networkflows and an application to the hitchcock problem. Canadian Journal of Mathematics,9:210 – 218, 1957.[68] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh. Performance evaluationand design trade-offs for network-on-chip interconnect architectures. IEEETransactions on Computers, 54:1025–1040, August 2005.[69] R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures:Understanding mechanisms, overheads and scaling. In Proc. of, ComputerArchitecture. ISCA ’05. 32nd International Symposium on, pages 408–418, 2005.[70] T. Kogel and et. al. A modular simulation framework for architectural explorationof on-chip interconnection networks. In Proc. of, Hardware/Software Codesign andSystem Synthesis, 2003 (CODES+ISSS’03). Intl. Conf. on, pages 338–351, October2003.[71] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. Orion: Apower-performance simulator for interconnection networks. In Proc. of, MICRO 35,2002.[72] R. Balasubramonian, N. Muralimanohar, K. Ramani, and V. Venkatachalapthy.Microarchitectural wire management for performance and power in partitioned architectures.In Proc. of, High-Performance Computer Architecture. HPCA-11. 11thInternational Symposium on, pages 28–39, February 2005.[73] A. Courtey, O. Sentieys, J. Laurent, and N. Julien. High-level interconnect delayand power estimation. Journal of Low Power Electronics, 4:1–13, 2008.[74] L. Carloni, A.B. Kahng, S. Muddu, A. Pinto, K. Samadi, and P. Sharma. Interconnectmodeling for improved system-level design optimization. In Design Automation

BIBLIOGRAPHY 169Conference, 2008. ASPDAC 2008. Asia and South Pacific, pages 258 –264, march2008.[75] Rahul Nagpal, Arvind Madan, Bharadwaj Amrutur, and Y. N. Srikant. Intacte:an interconnect area, delay, and energy estimation tool for microarchitectural explorations.In CASES ’07: Proceedings of the 2007 international conference onCompilers, architecture, and synthesis for embedded systems, pages 238–247, NewYork, NY, USA, 2007. ACM.[76] A.B. Kahng, Bin Li, Li-Shiuan Peh, and K. Samadi. Orion 2.0: A power-area simulatorfor interconnection networks. Very Large Scale Integration (VLSI) Systems,IEEE Transactions on, 20(1):191 –196, jan. 2012.[77] A. Kumary, P. Kunduz, A.P. Singhx, L.-S. Pehy, and N.K. Jhay. A 4.6tbits/s 3.6ghzsingle-cycle noc router with a novel switch allocator in 65nm cmos. In ComputerDesign, 2007. ICCD 2007. 25th International Conference on, pages 63 –70, oct2007.[78] http://www.synopsys.com/products/libertyccs/libertyccs.html. Liberty File Format.[79] Lef/def exchange format. http://openeda.si2.org/projects/lefdef”.[80] ITRS. International Technology Roadmap for Semiconductors, 2011.[81] http://www.eas.asu.edu/∼ptm/. Predictive technology models.[82] R. Nagpal, M. Arvind, Y. N. Srikanth, and B. Amrutur. Intacte: Tool for interconnectmodelling. In Proc. of 2007 Intl Conf. on Compilers, Architecture andSynthesis for Embedded Systems(CASES 2007), pages 238–247, 2007.[83] K.M. Buyuksahin and F.N. Najm. High-level power estimation with interconnecteffects. In Low Power Electronics and Design, 2000. ISLPED ’00. Proceedings ofthe 2000 International Symposium on, pages 197 – 202, 2000.

BIBLIOGRAPHY 170[84] B.S. Landman and R.L. Russo. On a pin versus block relationship for partitions oflogic graphs. IEEE Transactions on Computers, 20:1469–1479, 1971.[85] Seung Eun Lee and Nader Bagherzadeh. A high level power model for network-onchip(noc) router. Computers & Electrical Engineering, 35(6):837 – 845, 2009. HighPerformance Computing Architectures, HPCA.[86] P. Gupta, L. Zhong, and N. K. Jha. A high-level interconnect power model fordesign space exploration. In Proc. of, Computer Aided Design (ICCAD ’03). Intl.Conf. on, pages 551–558, 2003.[87] Seung Eun Lee, Jun Ho Bahn, and Nader Bagherzadeh. Design of a feasible on-chipinterconnection network for a chip multiprocessor (cmp). In Proc. of, Computer Architectureand High Performance Computing. Intl. Symp. on, pages 211–218, 2007.[88] Hang-Sheng Wang, Li-Shiuan Peh, and S. Malik. A power model for routers: modelingalpha 21364 and infiniband routers. Micro, IEEE, 23(1):26 – 35, Jan/Feb2003.[89] Hangsheng Wang, Li-Shiuan Peh, and S. Malik. A technology-aware and energyorientedtopologyexplorationforon-chipnetworks. InDesign, Automation and Testin Europe, 2005. Proceedings, pages 1238 – 1243 Vol. 2, march 2005.[90] EunJungKim, GregM.Link, KiHwanYum, N.Vijaykrishnan, MahmutKandemir,Mary J. Irwin, and Chita R. Das. A holistic approach to designing energy-efficientcluster interconnects. IEEE Trans. Comput., 54(6):660–671, June 2005.[91] V.. Soteriou, N.. Eisley, Hangsheng Wang, Bin Li, and Li-Shiuan Peh. Polaris:A system-level roadmapping toolchain for on-chip interconnection networks. VeryLarge Scale Integration (VLSI) Systems, IEEE Transactions on, 15(8):855 –868,aug. 2007.[92] N. Choudhary, M.S. Gaur, and V. Laxmi. Irregular noc simulation framework:

BIBLIOGRAPHY 171Irnirgam. In Emerging Trends in Networks and Computer Communications (ET-NCC), 2011 International Conference on, pages 1 –5, april 2011.[93] Noxim - the noc simulator. http://noxim.sourceforge.net.[94] NNSE: The nostrum noc simulation environment.http://www.ict.kth.se/nostrum/NNSE.[95] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architecturallevelpoweranalysisandoptimizations.InComputer Architecture, 2000. Proceedingsof the 27th International Symposium on, pages 83 –94, june 2000.[96] T. Austin, E. Larson, and D. Ernst. Simplescalar: an infrastructure for computersystem modeling. Computer, 35(2):59 –67, feb 2002.[97] Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, MinXu, AlaaR.Alameldeen, KevinE.Moore, MarkD.Hill, andDavidA.Wood. Multifacet’sgeneral execution-driven multiprocessor simulator (gems) toolset. SIGARCHComput. Archit. News, 33(4):92–99, November 2005.[98] Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, and Nader Bagherzadeh. A genericnetworkinterfacearchitectureforanetworkedprocessorarray(nepa). InProceedingsof the 21st international conference on Architecture of computing systems, ARCS’08,pages 247–260, Berlin, Heidelberg, 2008. Springer-Verlag.[99] Yan Luo, Jun Yang, L.N. Bhuyan, and Li Zhao. Nepsim: a network processorsimulator with a power evaluation framework. Micro, IEEE, 24(5):34 – 44, Sept.-Oct. 2004.[100] S. Huang, Y. Luo, and W. Feng. Modeling and analysis of power in multicorenetwork processors. In Parallel and Distributed Processing, 2008. IPDPS 2008.IEEE International Symposium on, pages 1 –8, april 2008.

BIBLIOGRAPHY 172[101] Michael Huang, J. Renau, Seung-Moon Yoo, and J. Torrellas. L1 data cache decompositionfor energy efficiency. In Low Power Electronics and Design, InternationalSymposium on, 2001., pages 10 –15, 2001.[102] Aparna Mandke, Keshavan Varadarajan, Basavaraj Talwar, Bharadwaj Amrutur,and Y. N. Srikant. Sapphire: A framework to explore power/performance implicationsof tiled architecture on chip multicore platform. Technical Report IISc-CSA-TR-2010-03, CSA, IISc, July 2010.[103] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze,Smruti Sarangi, Paul Sack, Karin Strauss, and Pablo Montesinos. SESC simulator,January 2005. http://sesc.sourceforge.net.[104] David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Katie Baynes, AamerJaleel, and Bruce Jacob. DRAMsim: A memory-system simulator. SIGARCHComputer Architecture News, 33(4):100–107, September 2005.[105] J. Nimmy, C. Ramesh Reddy, K. Varadarajan, M. Alle, A. Fell, S.K. Nandy, andR. Narayan. Reconnect: A noc for polymorphic asics using a low overhead singlecycle router. In Application-Specific Systems, Architectures and Processors, 2008.ASAP 2008. International Conference on, pages 251 –256, july 2008.[106] Bradford M. Beckmann and David A. Wood. Managing wire delay in large chipmultiprocessorcaches. In IEEE/ACM International Symposium on Microarchitecture,pages 319–330. IEEE Computer Society, 2004.[107] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniformcache structure for wire-delay dominated on-chip caches. SIGOPS Oper. Syst. Rev.,36(5):211–222, October 2002.[108] B.M. Beckmann and D.A. Wood. Tlc: transmission line caches. In Microarchitecture,2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposiumon, pages 43 – 54, dec. 2003.

BIBLIOGRAPHY 173[109] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniformcache structure for wire-delay dominated on-chip caches. In ACM SIGPLAN Notices,pages 211–222. ACM, 2002.[110] Changkyu Kim, D. Burger, and S.W. Keckler. Nonuniform cache architectures forwire-delay dominated on-chip caches. Micro, IEEE, 23(6):99 – 107, nov.-dec. 2003.[111] A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C.A. Prete. Impact ofon-chip network parameters on nuca cache performances. Computers Digital Techniques,IET, 3(5):501 –512, september 2009.[112] J.-M. Parcerisa, J. Sahuquillo, A. Gonzalez, and J. Duato. On-chip interconnectsand instruction steering schemes for clustered microarchitectures. Parallel and DistributedSystems, IEEE Transactions on, 16(2):130 – 144, February 2005.[113] A. Aggarwal and M. Franklin. Instruction replication for reducing delays due tointer-pe communication latency. Computers, IEEE Transactions on, 54(12):1496 –1507, dec. 2005.[114] Joan manuel Parcerisa and Antonio Gonzlez. Reducing wire delay penalty throughvalue prediction. In In International Symposium on Microarchitecture, pages 317–326, 2000.[115] M. Zhang and K. Asanovic. Victim replication: maximizing capacity while hidingwire delay in tiled chip multiprocessors. In Computer Architecture, 2005. ISCA ’05.Proceedings. 32nd International Symposium on, pages 336 – 345, 4-8 2005.[116] Bradford M. Beckmann, Michael R. Marty, and David A. Wood. Asr: Adaptiveselective replication for cmp caches. In Microarchitecture, 2006. MICRO-39. 39thAnnual IEEE/ACM International Symposium on, pages 443 –454, dec. 2006.[117] Freddy Gabbay and Freddy Gabbay. Speculative execution based on value prediction.Technical report, EE Department TR 1080, Technion - Israel Institue ofTechnology, 1996.

BIBLIOGRAPHY 174[118] J Gonzlez and A Gonzlez. Memory address prediction for data speculation. TechnicalReport UPC-DAC-1996-50, Univ. Politcnica de Catalunya, Spain, 1996.[119] Xin Jia and Ranga Vemuri. Using gals architecture to reduce the impact of longwire delay on fpga performance. In ASP-DAC ’05: Proceedings of the 2005 Asiaand South Pacific Design Automation Conference, pages 1260–1263, New York, NY,USA, 2005. ACM.[120] S.W. Keckler, D. Burger, C.R. Moore, R. Nagarajan, K. Sankaralingam, V. Agarwal,M.S. Hrishikesh, N. Ranganathan, and P. Shivakumar. A wire-delay scalablemicroprocessor architecture for high performance systems. In Solid-State CircuitsConference, 2003. Digest of Technical Papers. ISSCC. 2003 IEEE International,pages 168 – 169 vol.1, 2003.[121] Z. Guz, I. Keidar, A. Kolodny, and U.C. Weiser. Nahalal: Cache organizationfor chip multiprocessors. Computer Architecture Letters, 6(1):21 –24, january-june2007.[122] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. Architectingefficient interconnects for large caches with cacti 6.0. IEEE Micro, 28(1):69–79, 2008.[123] G. Konstadinidis, M. Rashid, P.F. Lai, Y. Otaguro, Y. Orginos, S. Parampalli,M. Steigerwald, S. Gundala, R. Pyapali, L. Rarick, I. Elkin, Yuefei Ge,and I. Parulkar. Implementation of a third-generation 16-core 32-thread chipmultithreadingsparcs processor. In Solid-State Circuits Conference, 2008. ISSCC2008. Digest of Technical Papers. IEEE International, pages 84 –597, feb. 2008.[124] K. Lee, Se-Joong Lee, and Hoi-Jun Yoo. Low-power network-on-chip for highperforamancesoc design. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 14:148–160, February 2006.[125] http://www.ocpip.org/socket/systemc/. Ocp-ip, systemc ocp models.

BIBLIOGRAPHY 175[126] http://www.systemc.org/. Open systemc initiative.[127] Hang-Sheng Wang, X. Zhu, Li-Shiuan Peh, and S. Malik. Orion: A powerperformancesimulator for interconnection networks. In Proceedings of the 35thannual ACM/IEEE International Symposium on Microarchitecture, pages 294–305,2002.[128] S.R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan,A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, andS. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journalof Solid-State Circuits, 43:29–41, January 2008.[129] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, andAnoop Gupta. The splash-2 programs: characterization and methodological considerations.In ISCA ’95: Proceedings of the 22nd annual international symposiumon Computer architecture, pages 24–36, New York, NY, USA, 1995. ACM.[130] V. Yalala, D. Brasili, D. Carlson, A. Hughes, A. Jain, T. Kiszely, K. Kodandapani,A. Varadharajan, and T. Xanthopoulos. A 16-core risc microprocessor with networkextensions. In Solid-State Circuits Conference, 2006. ISSCC 2006. Digest ofTechnical Papers. IEEE International, pages 305 –314, feb. 2006.[131] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, B. Cherkauer, J. Stinson,J. Benoit, R. Varada, J. Leung, R.D. Limaye, and S. Vora. A 65-nm dual-core multithreadedxeon processor with 16-mb l3 cache. Solid-State Circuits, IEEE Journalof, 42(1):17 –25, 2007.[132] J.L. Shin, K. Tam, D. Huang, B. Petrick, H. Pham, Changku Hwang, Hongping Li,A. Smith, T. Johnson, F. Schumacher, D. Greenhill, A.S. Leon, and A. Strong. A40nm16-core128-threadcmtsparcsocprocessor. InSolid-State Circuits ConferenceDigest of Technical Papers (ISSCC), 2010 IEEE International, pages 98 –99, feb.2010.

BIBLIOGRAPHY 176[133] Ron Ho, Kenneth W. Mai, and Mark A. Horowitz. The future of wires. In Proceedingsof the IEEE, pages 490–504, 2001.[134] http://www.hpl.hp.com/research/cacti. HP Labs : CACTI.[135] http://quid.hpl.hp.com:9081/cacti. CACTI 5.3. rev 174.[136] Niket Agarwal, Li-Shiuan Peh, and Niraj Jha. Garnet: A detailed interconnectionnetwork model inside a full-system simulation framework. Technical Report CE-P08-001, Princeton University, 2008.[137] G. K. Konstadinidis, M. Tremblay, S. Chaudhry, M. Rashid, P. F. Lai, Y. Otaguro,Y. Orginos, and S. Parampalli. Architecture and physical implementation of a thirdgeneration 65 nm, 16 core, 32 thread chip-multithreading sparc processor. IEEEJournal of Solid-State Circuits, 44(1):7–17, 2009.[138] Dejan Markovic, Vladimir Stojanovic, Borivoje Nikolic, Mark A. Horowitz, andRobert W. Brodersen. Methods for true energy-performance optimization. IEEEJournal of Solid State Circuits, 39(8):1282–1293, August 2004.[139] ETSI. Broadband Radio Access Networks(BRAN); HiperLAN type 2; Physical(PHY) Layer. ETSI TS 101 475, V 1.2.2, 2001.[140] Paul M. Heysters, Gerard K. Rauwerda, and Gerard J.M. Smit. Implementation ofa hiperlan/2 receiver on the reconfigurable montium architecture. In 18th InternationalParallel and Distributed Processing Symposium, IPDPS 2004. IEEE, 2004.[141] Tilera Corporation. TILE-Gx 3000 Series Overview, 2011.[142] http://iverilog.wikia.com/. Icarus iverilog, 2011.[143] T. Fujiyoshi, S. Shiratake, S. Nomura, T. Nishikawa, Y. Kitasho, H. Arakida,Y. Okuda, Y. Tsuboi, M. Hamada, H. Hara, T. Fujita, F. Hatori, T. Shimazawa,K. Yahagi, H. Takeda, M. Murakata, F. Minami, N. Kawabe, T. Kitahara, K. Seta,M. Takahashi, Y. Oowaki, and T. Furuyama. A 63-mw h.264/mpeg-4 audio/visual

BIBLIOGRAPHY 177codec lsi with module-wise dynamic voltage/frequency scaling. Solid-State Circuits,IEEE Journal of, 41(1):54 – 62, jan. 2006.[144] A. Luczak, P. Garstecki, O. Stankiewicz, and M. Stepniewska. Network-on-chipbased architecture of h.264 video decoder. In Signals and Electronic Systems, 2008.ICSES ’08. International Conference on, pages 419 –422, sept. 2008.[145] TileraGX. Tile-GX Processor Family, 2011.

NoC design and optimization for Multi-core media processors

Create successful ePaper yourself

Delete template?

Save as template?