Logic Simulation - GPU Technology Conference

  • No tags were found...

Logic Simulation - GPU Technology Conference

GPU Accelerated High PerformanceLogic Simulation of Integrated CircuitsYangdong Steve Dengdengyd@tsinghua.edu.cnInstitute of MicroelectronicsTsinghua University

2Research Overview• Parallel algorithms/applications• Massively parallel simulation for System-on-Chips• Internet routing processing• GPU architecture• Heterogeneous integration for routing processing• Supporting task-pipelined parallelism on GPUs• Compiler• Polyhedral based automatic parallelization for GPUs• Automatic porting of ARM code to x86• Sponsored by China National Key Projects, CUDACenter of Excellence, Tsinghua-Intel Center of AdvancedMobile Computing Technology, Intel University Program,NVidia Professor Partnership Award11109876543210j'i'0 1 2 3 4 5 6 72

Outline• Motivation• Simulation algorithms• Massively parallel logic simulation• Gate Level simulation• Compiled HDL simulation• Ongoing work33

Integrated Circuits44

Simplified IC Design FlowSystem leveldescriptionSystem level simulationHDL descriptionHDL simulationSynthesisGate level netlistGate level simulationPhysical design5Layout5

Logic Simulation• Major method for IC design verification• Performed on a proper model of the design• Apply input patterns as stimuli• Observe output signals: satisfy design requirements?• The simulator evaluates the operation of the model1@0ns01@1ns1@0ns?@3ns0@0ns66

Complexity of Logic Simulation7• For a 1-bit register with 1 input• 2 possible states and 2 possible inputs• Number of test patterns required = 2 1 x2 1 =4• For a Z80 microprocessor (~5K gates)• Has 208 register bits and 13 primary inputs• Possible state transitions = 2 bits+inputs = 2 221• It takes 1 year to simulate all transitions at 1000MIPS• For a chip with 500M gatesd q clk q Next0 0 00 1 01 0 11 1 1• ????? years?• No exhaustive simulation any more• But has to meet a given coverage ratio7

Verification Gap• Functional verification has been the bottleneck!• Now consumes >60% of design turn-around time• Significantly lag behind fabrication capacityDistribution of IC design effortWidening design & verification gap88

Problem• Can we unleashing the power of GPUs forlogic simulation??99

Outline• Motivation• Simulation algorithms• Massively parallel logic simulation• Gate Level simulation• Compiled HDL simulation• Ongoing work1010

Event Driven Simulation• Most used algorithm for discrete event systems• Event: state transition + time stamp• Always evaluate an event with the earliest stampEvent Queue1@0nsaf?@3nsb: 01@1ns01@1ns1@0nsbceg?@3nse: 10@2ns1@0nsdf: 01@3ns11g: 01@3ns11

Synchronous Parallel Logic Simulation• Simultaneously process events with identical time-stamps• Insufficient parallelism• Generally

Asynchronous Parallel Logic Simulation• Conservative simulation13abcd• Chandy-Misra-Bryant (CMB) algorithm• Send events as messagesa=1@9nsProcess 1ed=1@7nsProcess 2• Optimistic simulation• Look-ahead simulationfe=1 until 10nsgProcess 3abcda=1@9nse=0@10nsee=0@10nsd=1@7nsa=1@9nse=0@10nsfgd=1@7nse=0@10ns13

Outline• Motivation• Simulation algorithms• Massively parallel logic simulation• Gate Level simulation• Compiled HDL simulation• Ongoing work1414

GPU Based CMB Parallel Simulation Flowwhile not finished// kernel 1: primary input updatefor each primary input(PI) doextract the first message in the PI queue;insert the message into the PI output array;end for// kernel 2: input pin updatefor each input pin doinsert messages from output to input pins;end for// kernel 3: gate evaluationfor each gate doextract the earliest message from its pins;evaluate messages and update gate status;write the gate output to the output array;end forend whileabcdProcess 2a=1@9nsfee=1 until 10nsgProcess 1d=1@7nsProcess 31515

Data Structure: Distributed Event Queues• Classical CMB allocates anevent queue to each gate• Need sorting when insertion• Hard to maintain on GPUa=1@9nsae=1@10nsebcde=1@10nsd=1@7nsPick up theearliest eventfgPick up theearliest evente: 10@5nse: 01@6nsa: 01@9nse: 10@5nse: 01@6nsd: 01@7ns16• Our work: every input pin of agate has its own event queue• A queue is naturally ordered• Irregular distribution ofrequired queue sizesPeaka# EventsCircuit 1 Circuit 2 Circuit 3on pinsfeb 0~99 34444 26106 158086c100~9999 737 1764 g40d >=10000 3 3 116

Dynamic GPU Memory Management• A dynamic memory allocator for GPU!• Proposed earlier than NVIDIA’s GPU memory allocatorpage_to_allocatepage_to_releasea16170…bef18………cdg170 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 … … … … … … … … …GPU Memory17

Dynamic GPU Memory Management• CPU-GPU collaborated memory management• Overlap update_pin(GPU) and update page_to_release(CPU)• Overlap evaluate_gate(GPU) and update page_to_allocate(CPU)• Exploit zero-copy to transfer memory maintenance informationGPU evaluate events update input update pins evaluate eventstimeCPUupdatepage_to_allocateupdatepage_to_releaseupdatepage_to_allocateZero-copy1818

Parallel Organization of Event Evaluation• A thread works on a gate• Event evaluation through truth-table lookup• Try assigning gates of the same type into a single warpBlock 2Block 1Block 0SIMD processinga b yNAND 0 * 1… … …NOR 1 * 0.. … … …1919

Gate Level Simulation Results• World’s fastest logic simulator on general purpose hardware 1• 30X speed-up on average 2 (100X for random patterns)• 1 month on CPU vs. 1 day on GPUDesign Baseline Simulator 1 (s) GPU-Based Simulator (s) SpeedupAES 109.90 4.45 24.70DES 183.11 4.50 40.66SHA 56.66 0.41 138.20R2000 9.20 3.15 2.92JPEG 136.33 43.09 3.16NOC 5389.42 347.95 15.49M1 118.48 22.43 5.28201Published on DAC 2010 and ACM Trans. on Design Automation 20112In-house simulator (~20% faster than Synopsys VCS)20

Outline• Motivation• Simulation algorithms• Massively parallel logic simulation• Gate Level simulation• Compiled HDL simulation• Ongoing work2121

Simulation of Hardware Description• Hardware Description Language (HDL)• Computer languages to describe a circuit’s operation, its design andorganization, and tests to verify its operation by means of simulationa=10@1nsb=2@1nsmodule example(input a, b, output x);wire c, d;always@(a,b)beginc = a + b; d = a - b;end;x=?@2nsProcess: the unit ofconcurrency22always@(c,d)beginx = c * d;end;endmodule;22

Parallel HDL Simulationalways@(a,b)beginf = a + b; g = a - b;end;always@(c,d,e)beginh = c + d; i = c + e; k = d - e;end;always@(f,h)beginx = f * h; y = f + h;end;23always@(posedge clk)beginz

Challenges• Verilog HDL as input• Similar to C but has a hardware orientedsemantics• Challenge 1:• Arbitrarily complex behavior in a Verilogprocess• Cannot be captured by truth tables• Challenge 2:• Divergent behaviors among differentprocesses and different sensitizations of asingle processVerilog HDLalways@(posedge clk)beginif (en == 1)q = d;endend;always@(a, b, op)beginif(op == 0)sum

Compiled Simulation• Translating Verilog into equivalent CUDA code• Also merge simple processesalways@(a,b)beginf

Handling the Semantic Differences• Careful on the semantic differences between HDL and CUDA!always@(clk)beginK2

Super-Multithreaded Execution• Divergence among differentprocesses Block 0 and differentsensitizations of the sameprocess• Will have zillions of divergentbranches• Solution• GPU as a super-multithreadwdprocessor• Drop SIMD executionVerilog HDLalways@(posedge clk)beginif (en == 1)q

CMB Based HDL Simulation2828

HDL Simulation Results• 20-50X speed-up depending on circuit structureDesignMentor GraphicsModelSim (s)GPU-BasedSimulator (s)SpeedupMux network forhierarchical bus83.32 4.019 20.73XAdder network forlarge numbers258.47 10.21 25.32XASIC(des) 178.66 3.54 50.47XASIC(aes) 104.63 2.67 39.19X*Published on ICCAD11 and Integration, the VLSI Journal2929

Related Publications• H. Qian and Y. Deng, “Accelerating RTL Simulation withGPUs,” IEEE/ACM International Conference on Computer-Aided Design, Nov. 2011.• Y. Zhu, B. Wang, and Y. Deng, “Massively Parallel LogicSimulation with GPUs, ” ACM Transaction on DesignAutomation of Electronics Systems, Vol.16, No.3, June, 2011.• B. Wang, Y. Zhu, and Y. Deng, “Distributed Time,Conservative Parallel Logic Simulation on GPUs,” IEEE/ACMDesign Automation Conference, Jun. 2010.3030

Outline• Motivation• Simulation algorithms• Massively parallel logic simulation• Gate Level simulation• Compiled HDL simulation• Ongoing work3131

Ongoing Work 1: Scalable MassivelyParallel Simulation• Now we have GPU-acceleratedlogic simulators• But single GPU cannot handle reallybig designs• How to scale GPU logic simulationto billion-transistor ICs?• Multi-GPU simulation is essential• However, fine-grain messages arehard to manage across PCIe buses32PCIe bus32

Hybrid Multi-GPU Based Logic Simulation• Intra-GPUaa=1@9nsf• Conservative simulation(i.e. CMB)bcee=1 until 10nsgdd=1@7ns• Inter-GPU• Optimistic simulationaa=1@9nsf• Less number of messagesbut needs roll-backbcee=1->0 @10nsgd33d=1@7ns33

Feasibility Study of Optimistic Simulation• Using a 64-core Tilera64 as prototyping platforms• Automatically translating Verilog HDL into C/C++• Automatic mapping “chunks of gates” to processors• Encouraging early results• A hybrid simulation engine is under development3434

Ongoing Work 2: System Level Simulationfor System-on-Chips (SoCs)• 4 Cortex-A9 cores in high-performance circuit• 1 Cortex-A9 core in low-power circuit• 12-core GeForce GPU• 1 ARM7 core for power management• HD video encoder/decoder• Audio processor• Security engine• HDMI3535

SoC Design Flow#include "shiftreg.h"int sc_main(int argc, char* argv[]) {sc_signal reset, din, dout;sc_clock clk("clk",10,SC_NS);// Local signals// Create a 10ns period clock signalElectronic SystemLevel Designshiftreg DUT("shiftreg");DUT.clk(clk);// Instantiate Device Under Test// Connect portsDUT.reset(reset);DUT.din(din);DUT.dout(dout);…}Input SpecificationArchitecture explorationSoCSimulationModelHW/SW Partitioning36Application Mapping36

Electronic System Level (ESL) Design• ESL design of SoC heavily depends onsimulation• Design verification and validation• Architecture exploration• Performance analysis• Early software development• Existing problems• System level simulation is costly in time• Lack of support for capturing full-system I/Obehaviors• Need multiple abstraction levelsPoliciesMentor Graphics’ Vista ToolVista Model BuilderTLMTLMOn Chip BusTLMTLMTLMProcessorTLMTransaction Level PlatformRTL IP37Verification & validationSW developmentLatencyPower37

A Hardware Accelerated SoC SimulationPlatformHAPS FPGAboardCPUGraphics CardGraphics code38• Key technologies• A unified GPU-based asynchronous logic simulation engine• SystemC/Verilog-to-CUDA translation engineGPU-accelerated full-system simulation forSystem-on-Chips• Compiling and running RTL code on FPGA prototyping board• Native running of graphics application on a graphics card• Emulate system peripherals with host PC’s peripherals38

Conclusion• Logic simulation is the essentialmeans of IC verification• World’s 1st successful asynchronouslogic simulation engine on manycore• Already done• Gate level simulation (30X speedup)• Verilog simulation (40X speedup)• Ongoing• Multi-GPU simulation• Full-system behavior simulationsupporting mixed abstraction levelsBlock 2Block 1HAPS FPGABoardBlock 0Graphics CardSIMD processingCPU3939

Future Work• Potential applications• Network simulation• Factory simulation• Military simulation• Business process simulation• …• Conservative and optimistic simulations areactually parallel scheduling mechanisms• Potentially better load balance than otherapproaches (e.g., work stealing)4040

Acknowledgement• Contributions from David Wang (StanfordUniversity), Yuhao Zhu (University of Texas atAustin), Hao Qian (AMD), and Lingfeng Wang(Tsinghua University)• Insightful discussions with Dr. Gilbert Chen(Sandvine), Prof. Radu Marculescu (CarnegieMellon University), Dr. Li Shang (Intel), Xin Zhou(Intel), and Jen-Hsun Huang (NVIDIA)4141

More magazines by this user
Similar magazines