Chapter 4 Introduction

Morgan Kaufmann Publishers 31 January 2013Chapter 4The ProcessorIntroductionn CPU performance factorsn Instruction countnn Determined by ISA and compilerCPI and Cycle timen Determined by CPU hardwaren We will examine two MIPS implementationsn A simplified versionn A more realistic pipelined versionn Simple subset, shows most aspectsn Memory reference: lw, swn Arithmetic/logical: add, sub, and, or, sltn Control transfer: beq, j§4.1 IntroductionCSE 420 Chapter 4 — The Processor — 2

Morgan Kaufmann Publishers 31 January 2013Instruction Executionn PC → instruction memory, fetch instructionn Register numbers → register file, read registersn Depending on instruction classn Use ALU to calculaten Arithmetic resultn Memory address for load/storen Branch target addressn Access data memory for load/storen PC ← target address or PC + 4CSE 420 Chapter 4 — The Processor — 3CPU OverviewCSE 420 Chapter 4 — The Processor — 4

Morgan Kaufmann Publishers 31 January 2013Multiplexersn Can’t just joinwires togethernUse multiplexersCSE 420 Chapter 4 — The Processor — 5ControlCSE 420 Chapter 4 — The Processor — 6

Morgan Kaufmann Publishers 31 January 2013Logic Design Basicsn Information encoded in binaryn Low voltage = 0, High voltage = 1n One wire per bitn Multi-bit data encoded on multi-wire busesn Combinational elementn Operate on datan Output is a function of inputn State (sequential) elementsn Store information§4.2 Logic Design ConventionsCSE 420 Chapter 4 — The Processor — 7Combinational Elementsn AND-gaten Y = A & Bn Addern Y = A + BAB+YABI0I1MuxSYn Multiplexern Y = S ? I1 : I0Yn Arithmetic/Logic Unitn Y = F(A, B)AALU YBFCSE 420 Chapter 4 — The Processor — 8

Morgan Kaufmann Publishers 31 January 2013Sequential ElementsRegister: stores data in a circuitn Uses a clock signal todetermine when to update the stored valuen Edge-triggered: update when Clk changes,e.g. from 0 to 1DClkQClkDQCSE 420 Chapter 4 — The Processor — 9Sequential Elementsn Register with write controln Only updates on clock edgewhen write control input is 1n Used when stored value is required laterClkDWriteClkQWriteDQCSE 420 Chapter 4 — The Processor — 10

Morgan Kaufmann Publishers 31 January 2013Clocking Methodologyn Combinational logictransforms data during clock cyclesn Between clock edgesn Input from state elements,output to state elementn Longest delay determines clock periodCSE 420 Chapter 4 — The Processor — 11Building a Datapathn Datapathn Elements that process data and addressesin the CPUn Registers, ALUs, mux’s, memories, …n We will build a MIPS datapathincrementallyn Refining the overview design§4.3 Building a DatapathCSE 420 Chapter 4 — The Processor — 12

Morgan Kaufmann Publishers 31 January 2013Instruction Fetch32-bitregisterIncrement by4 for nextinstructionCSE 420 Chapter 4 — The Processor — 13R-Format Instructionsn Read two register operandsn Perform arithmetic/logical operationn Write register resultCSE 420 Chapter 4 — The Processor — 14

Morgan Kaufmann Publishers 31 January 2013Load/Store Instructionsn Read register operandsn Calculate address using 16-bit offsetn Use ALU, but sign-extend offsetn Load: Read memory and update registern Store: Write register value to memoryCSE 420 Chapter 4 — The Processor — 15Branch Instructionsn Read register operandsn Compare operandsn Use ALU, subtract and check Zero outputn Calculate target addressn Sign-extend displacementn Shift left 2 places (word displacement)n Add to PC + 4n Already calculated by instruction fetchCSE 420 Chapter 4 — The Processor — 16

Morgan Kaufmann Publishers 31 January 2013Branch InstructionsJustre-routeswiresSign-bit wirereplicatedCSE 420 Chapter 4 — The Processor — 17Composing the Elementsn First-cut data pathdoes an instruction in one clock cyclen Each data-path elementcan only do one function at a timen Hence, we needseparate instruction and data memoriesn Use multiplexers where alternate datasources are used for different instructionsCSE 420 Chapter 4 — The Processor — 18

Morgan Kaufmann Publishers 31 January 2013R-Type/Load/Store DatapathCSE 420 Chapter 4 — The Processor — 19Full Data Path

Morgan Kaufmann Publishers 31 January 2013ALU ControlALU used forn Load/Store: F = addn Branch: F = subtractn R-type: F depends on funct fieldALU controlFunction0000 AND0001 OR0010 add0110 subtract0111 set-on-less-than1100 NOR§4.4 A Simple Implementation SchemeCSE 420 Chapter 4 — The Processor — 21ALU ControlAssume 2-bit ALUOp derived from opcoden Combinational logic derives ALU controlopcode ALUOp Operation funct ALU function ALU controllw 00 load word XXXXXX add 0010sw 00 store word XXXXXX add 0010beq 01 branch equal XXXXXX subtract 0110R-type 10 add 100000 add 0010subtract 100010 subtract 0110AND 100100 AND 0000OR 100101 OR 0001set-on-less-than 101010 set-on-less-than 0111CSE 420 Chapter 4 — The Processor — 22

Morgan Kaufmann Publishers 31 January 2013The Main Control Unitn Control signals derived from instructionR-typeLoad/StoreBranch0 rs rt rd shamt funct31:26 25:21 20:16 15:11 10:6 5:035 or 43 rs rt address31:26 25:21 20:16 15:04 rs rt address31:26 25:21 20:16 15:0opcodealwaysreadread,exceptfor loadwrite forR-typeand loadsign-extendand addCSE 420 Chapter 4 — The Processor — 23Data Path With Control

Morgan Kaufmann Publishers 31 January 2013R-Type InstructionLoad Instruction

Morgan Kaufmann Publishers 31 January 2013Branch-on-Equal InstructionImplementing JumpsJump2 address31:26 25:0n Jump uses word addressn Update PC with concatenation ofn Top 4 bits of old PCn 26-bit jump addressn 00n Need an extra control signaldecoded from opcodeCSE 420 Chapter 4 — The Processor — 28

Morgan Kaufmann Publishers 31 January 2013Datapath With Jumps AddedPerformance Issuesn Longest delay determines clock periodn Critical path: load instructionn Instruction memory → register file → ALU →data memory → register filen Not feasible to vary periodfor different instructionsn Violates design principlen Making the common case fastn We will improve performance by pipeliningCSE 420 Chapter 4 — The Processor — 30

Morgan Kaufmann Publishers 31 January 2013Pipelining AnalogynPipelined laundry: overlapping executionn Parallelism improves performancen Four stages: Wash – Dry – Fold - Hangn Four loads:n Speedup= 8/3.5 = 2.3n Non-stop:nSpeedup= 2n/(0.5n + 1.5) ≈ 4= number of stages§4.5 An Overview of PipeliningCSE 420 Chapter 4 — The Processor — 31MIPS PipelinenFive stages, one step per stage1. IF: Instruction fetch from memory2. ID: Instruction decode & register read3. EX: Execute operation or calculate address4. MEM: Access memory operand5. WB: Write result back to registerCSE 420 Chapter 4 — The Processor — 32

Morgan Kaufmann Publishers 31 January 2013Pipeline Performancen Assume time for stages isn 100ps for register read or writen200ps for other stagesn Compare pipelined datapath with single-cycledatapathInstrInstr fetch RegisterreadALU opMemoryaccessRegisterwriteTotal timelw 200ps 100 ps 200ps 200ps 100 ps 800pssw 200ps 100 ps 200ps 200ps 700psR-format 200ps 100 ps 200ps 100 ps 600psbeq 200ps 100 ps 200ps 500psCSE 420 Chapter 4 — The Processor — 33Pipeline PerformanceSingle-cycle (T c = 800ps)Pipelined (T c = 200ps)CSE 420 Chapter 4 — The Processor — 34

Morgan Kaufmann Publishers 31 January 2013Pipeline Speedupn If all stages are balancedn i.e., all take the same timen Time between instructions pipelined= Time between instructions nonpipelinedNumber of stagesn If not balanced, speedup is lessn Speedup is due to increased throughputn Latency (time for each instruction)does not decreaseCSE 420 Chapter 4 — The Processor — 35Pipelining and ISA Designn MIPS ISA designed for pipeliningn All instructions are 32-bitsn Easier to fetch and decode in one cyclen c.f. x86: 1- to 17-byte instructionsn Few and regular instruction formatsn Can decode and read registers in one stepn Load/store addressingn Can calculate address in 3 rd stage,access memory in 4 th stagen Alignment of memory operandsn Memory access takes only one cycleCSE 420 Chapter 4 — The Processor — 36

Morgan Kaufmann Publishers 31 January 2013Hazardsn Situations that prevent starting the nextinstruction in the next cyclen Structural hazardn A required resource is busyn Data hazardn Need to wait for previous instructionto complete its data read/writen Control hazardn Deciding on control actiondepends on previous instructionCSE 420 Chapter 4 — The Processor — 37Structure Hazardsn Conflict for use of a resourcen In MIPS pipeline with a single memoryn Load/store requires data accessn Instruction fetch would have to stallfor that cyclen Would cause a pipeline “bubble”n Hence, pipelined data paths requireseparate instruction & data memoriesn so separate instruction & data cachesCSE 420 Chapter 4 — The Processor — 38

Morgan Kaufmann Publishers 31 January 2013Data Hazardsn An instruction depends on completion ofdata access by a previous instructionn add $s0, $t0, $t1sub $t2, $s0, $t3CSE 420 Chapter 4 — The Processor — 39Forwarding (a.k.a. Bypassing)n Use result when it is computedn Don’t wait for it to be stored in a registern Requires extra connections in the datapathCSE 420 Chapter 4 — The Processor — 40

Morgan Kaufmann Publishers 31 January 2013Load-Use Data Hazardn Can’t always avoid stalls by forwardingn If value not computed when neededn Can’t forward backward in time!CSE 420 Chapter 4 — The Processor — 41Code Scheduling to Avoid Stallsn Reorder code to avoid use of load resultin the next instructionn C code for A = B + E; C = B + F;stallstalllw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0)lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0)lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2sw $t3, 12($t0)add $t5, $t1, $t4sw $t5, 16($t0)Eliminates 2 stall cyclesCSE 420 Chapter 4 — The Processor — 42

Morgan Kaufmann Publishers 31 January 2013Control Hazardsn Branch determines flow of controln Fetching next instructiondepends on branch outcomen Still working on ID stage of branchn In MIPS pipelinen Need to compare registers andcompute target early in the pipelinen Add hardware to do it in ID stageCSE 420 Chapter 4 — The Processor — 43Stall on Branchn Wait until branch outcome determinedbefore fetching next instructionCSE 420 Chapter 4 — The Processor — 44

Morgan Kaufmann Publishers 31 January 2013Branch Predictionn Longer pipelines cannotreadily determine branch outcome earlyn Stall penalty becomes unacceptablen Predict outcome of branchn Only stall if prediction is wrongn In MIPS pipelinen Can predict branches not takenn Fetch instruction after branch, with no delayCSE 420 Chapter 4 — The Processor — 45MIPS with Predict Not TakenPredictioncorrectPredictionincorrectCSE 420 Chapter 4 — The Processor — 46

Morgan Kaufmann Publishers 31 January 2013More-Realistic Branch Predictionn Static branch predictionn Based on typical branch behaviornExample: loop and if-statement branchesn Predict backward branches takenn Predict forward branches not takenn Dynamic branch predictionn Hardware measures actual branch behaviornne.g., record recent history of each branchAssume future behavior will continue the trendn When wrong, stall while re-fetching, and update historyCSE 420 Chapter 4 — The Processor — 47Pipeline SummaryThe BIG Picturen Pipelining improves performance byincreasing instruction throughputn Executes multiple instructions in paralleln Each instruction has the same latencyn Subject to hazardsn Structure, data, controln Instruction set design affectscomplexity of pipeline implementationCSE 420 Chapter 4 — The Processor — 48

Morgan Kaufmann Publishers 31 January 2013MIPS Pipelined Datapath§4.6 Pipelined Datapath and ControlMEMRight-to-leftflow leads tohazardsWBCSE 420 Chapter 4 — The Processor — 49Pipeline registersn Need registers between stagesn To hold information produced in previous cycleCSE 420 Chapter 4 — The Processor — 50

Morgan Kaufmann Publishers 31 January 2013Pipeline Operationn Cycle-by-cycle flow of instructionsthrough the pipelined data pathn “Single-clock-cycle” pipeline diagramn Shows pipeline usage in a single cyclen Highlight resources usedn c.f. “multi-clock-cycle” diagramn Graph of operation over timen We’ll look at “single-clock-cycle” diagramsfor load & storeCSE 420 Chapter 4 — The Processor — 51IF for Load, Store, …CSE 420 Chapter 4 — The Processor — 52

Morgan Kaufmann Publishers 31 January 2013ID for Load, Store, …CSE 420 Chapter 4 — The Processor — 53EX for LoadCSE 420 Chapter 4 — The Processor — 54

Morgan Kaufmann Publishers 31 January 2013MEM for LoadCSE 420 Chapter 4 — The Processor — 55WB for LoadWrongregisternumberCSE 420 Chapter 4 — The Processor — 56

Morgan Kaufmann Publishers 31 January 2013Corrected Datapath for LoadCSE 420 Chapter 4 — The Processor — 57EX for StoreCSE 420 Chapter 4 — The Processor — 58

Morgan Kaufmann Publishers 31 January 2013MEM for StoreCSE 420 Chapter 4 — The Processor — 59WB for StoreCSE 420 Chapter 4 — The Processor — 60

Morgan Kaufmann Publishers 31 January 2013Multi-Cycle Pipeline DiagramCSE 420 Chapter 4 — The Processor — 61Multi-Cycle Pipeline DiagramCSE 420 Chapter 4 — The Processor — 62

Morgan Kaufmann Publishers 31 January 2013Single-Cycle Pipeline DiagramChapter 4 — The Processor — 63Pipelined Control (Simplified)Chapter 4 — The Processor — 64

Morgan Kaufmann Publishers 31 January 2013Pipelined ControlnControl signals derived from instructionn As in single-cycle implementationCSE 420 Chapter 4 — The Processor — 65Pipelined ControlChapter 4 — The Processor — 66

Morgan Kaufmann Publishers 31 January 2013Data Hazards in ALU Instructionsn Consider this sequence:sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)n We can resolve hazards with forwardingn How do we detect when to forward?§4.7 Data Hazards: Forwarding vs. StallingCSE 420 Chapter 4 — The Processor — 67Dependencies & ForwardingBlue lines show the problem;Red lines represent forwardingCSE 420 Chapter 4 — The Processor — 68

Morgan Kaufmann Publishers 31 January 2013Detecting the Need to Forwardn Pass register numbers along pipelinen e.g., ID/EX.RegisterRs = register number for Rssitting in ID/EX pipeline registern ALU operand register numbers in EX stageare given byn ID/EX.RegisterRs, ID/EX.RegisterRtn Data hazards when1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRtFwd fromEX/MEMpipeline regFwd fromMEM/WBpipeline regCSE 420 Chapter 4 — The Processor — 69Detecting the Need to Forwardn But only if forwarding instruction will writeto a register!n EX/MEM.RegWrite, MEM/WB.RegWriten And only if Rd for that instruction is not$zeron EX/MEM.RegisterRd ≠ 0,MEM/WB.RegisterRd ≠ 0CSE 420 Chapter 4 — The Processor — 70

Morgan Kaufmann Publishers 31 January 2013Forwarding PathsCSE 420 Chapter 4 — The Processor — 71Forwarding ConditionsnnEX hazardn if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRs))ForwardA = 10n if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRt))ForwardB = 10MEM hazardn if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)and (MEM/WB.RegisterRd = ID/EX.RegisterRs))ForwardA = 01n if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)and (MEM/WB.RegisterRd = ID/EX.RegisterRt))ForwardB = 01CSE 420 Chapter 4 — The Processor — 72

Morgan Kaufmann Publishers 31 January 2013Double Data Hazardn Consider the sequence:add $1,$1,$2add $1,$1,$3add $1,$1,$4n Both hazards occurn Want to use the most recentn Revise MEM hazard conditionn Only fwd if EX hazard condition isn’t trueCSE 420 Chapter 4 — The Processor — 73Revised Forwarding ConditionnMEM hazardn if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRs))and (MEM/WB.RegisterRd = ID/EX.RegisterRs))ForwardA = 01n if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRt))and (MEM/WB.RegisterRd = ID/EX.RegisterRt))ForwardB = 01CSE 420 Chapter 4 — The Processor — 74

Morgan Kaufmann Publishers 31 January 2013Datapath with ForwardingCSE 420 Chapter 4 — The Processor — 75Load-Use Data HazardNeed to stallfor one cycleCSE 420 Chapter 4 — The Processor — 76

Morgan Kaufmann Publishers 31 January 2013Load-Use Hazard Detectionn Check when using instruction is decodedin ID stagen ALU operand register numbers in ID stageare given byn IF/ID.RegisterRs, IF/ID.RegisterRtn Load-use hazard whenn ID/EX.MemRead and((ID/EX.RegisterRt = IF/ID.RegisterRs) or(ID/EX.RegisterRt = IF/ID.RegisterRt))n If detected, stall and insert bubbleCSE 420 Chapter 4 — The Processor — 77How to Stall the Pipelinen Force control values in ID/EX registerto 0n EX, MEM and WB do nop (no-operation)n Prevent update of PC and IF/ID registern Using instruction is decoded againn Following instruction is fetched againn 1-cycle stall allows MEM to read data for lwn Can subsequently forward to EX stageCSE 420 Chapter 4 — The Processor — 78

Morgan Kaufmann Publishers 31 January 2013Stall/Bubble in the PipelineStall insertedhereCSE 420 Chapter 4 — The Processor — 79Stall/Bubble in the PipelineOr, moreaccurately…CSE 420 Chapter 4 — The Processor — 80

Morgan Kaufmann Publishers 31 January 2013Datapath with Hazard DetectionCSE 420 Chapter 4 — The Processor — 81Stalls and PerformanceThe BIG Picturen Stalls reduce performancen But are required to get correct resultsn Compiler can arrange code to avoidhazards and stallsn Requires knowledge of the pipeline structureCSE 420 Chapter 4 — The Processor — 82

Morgan Kaufmann Publishers 31 January 2013Branch Hazardsn If branch outcome determined in MEM§4.8 Control HazardsFlush theseinstructions(Set controlvalues to 0)PCCSE 420 Chapter 4 — The Processor — 83Reducing Branch Delayn Move hardware to determine outcome to IDstagennTarget address adderRegister comparatorn Example: branch taken36: sub $10, $4, $840: beq $1, $3, 744: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7...72: lw $4, 50($7)CSE 420 Chapter 4 — The Processor — 84

Morgan Kaufmann Publishers 31 January 2013Example: Branch TakenCSE 420 Chapter 4 — The Processor — 85Example: Branch TakenCSE 420 Chapter 4 — The Processor — 86

Morgan Kaufmann Publishers 31 January 2013Data Hazards for Branchesn If a comparison register is a destination of2 nd or 3 rd preceding ALU instructionadd $1, $2, $3IF ID EX MEM WBadd $4, $5, $6IF ID EX MEM WB…IF ID EX MEM WBbeq $1, $4, targetIF ID EX MEM WBn Can resolve using forwardingCSE 420 Chapter 4 — The Processor — 87Data Hazards for Branchesn If a comparison register is a destination ofpreceding ALU instruction or 2 nd precedingload instructionn Need 1 stall cyclelw $1, addrIF ID EX MEM WBadd $4, $5, $6IF ID EX MEM WBbeq stalledIFIDbeq $1, $4, targetID EX MEM WBCSE 420 Chapter 4 — The Processor — 88

Morgan Kaufmann Publishers 31 January 2013Data Hazards for Branchesn If a comparison register is a destination ofimmediately preceding load instructionn Need 2 stall cycleslw $1, addrIF ID EX MEM WBbeq stalledIFIDbeq stalledIDbeq $1, $0, targetID EX MEM WBCSE 420 Chapter 4 — The Processor — 89Dynamic Branch Predictionn In deeper and superscalar pipelines,branch penalty is more significantn Use dynamic predictionn Branch prediction buffer (a.k.a. branch history table)nnnIndexed by recent branch instruction addressesStores outcome (taken/not taken)To execute a branchn Check table, expect the same outcomen Start fetching from fall-through or targetn If wrong, flush pipeline and flip predictionCSE 420 Chapter 4 — The Processor — 90

Morgan Kaufmann Publishers 31 January 20131-Bit Predictor: Shortcomingn Inner loop branches mispredicted twice!outer: ……inner: ……beq …, …, inner…beq …, …, outern Mispredict as takenon last iteration of inner loopn Then mispredict as not takenon first iteration of inner loop next time aroundCSE 420 Chapter 4 — The Processor — 912-Bit Predictorn Only change predictionon two successive mispredictions“weak”“strong”CSE 420 Chapter 4 — The Processor — 92

Morgan Kaufmann Publishers 31 January 2013Calculating the Branch Targetn Even with predictor,still need to calculate the target addressn 1-cycle penalty for a taken branchn Branch target buffern Cache of target addressesn Indexed by PC when instruction fetchedn If hit and instruction is branch predicted taken,can fetch target immediatelyCSE 420 Chapter 4 — The Processor — 93Exceptions and Interruptsn “Unexpected” eventsrequiring change in flow of controln Different ISAs use the terms differentlyn ExceptionnArises within the CPUn e.g., undefined opcode, overflow, syscall, …n Interruptn From an external I/O controllern Dealing with them without sacrificing performanceis hard§4.9 ExceptionsCSE 420 Chapter 4 — The Processor — 94

Morgan Kaufmann Publishers 31 January 2013Handling Exceptionsn In MIPS, exceptions managedby a System Control Coprocessor (CP0)n Save PC of offending (or interrupted) instructionn In MIPS: Exception Program Counter (EPC)n Save indication of the problemn In MIPS: Cause registern We’ll assume 1-bitn0 for undefined opcode, 1 for overflown Jump to handler at 8000 00180CSE 420 Chapter 4 — The Processor — 95An Alternate Mechanismn Vectored Interruptsn Handler address determined by the causen Example:n Undefined opcode: C000 0000n Overflow: C000 0020n …: C000 0040n Instructions eithern Deal with the interrupt, orn Jump to real handlerCSE 420 Chapter 4 — The Processor — 96

Morgan Kaufmann Publishers 31 January 2013Handler Actionsn Read cause,and transfer to relevant handlern Determine action requiredn If restartablen Take corrective actionn use EPC to return to programn Otherwisen Terminate programn Report error using EPC, cause, …CSE 420 Chapter 4 — The Processor — 97Exceptions in a Pipelinen Another form of control hazardn Consider overflow on add in EX stageadd $1, $2, $1n Prevent $1 from being clobberedn Complete previous instructionsn Flush add and subsequent instructionsn Set Cause and EPC register valuesn Transfer control to handlern Similar to mispredicted branchn Use much of the same hardwareCSE 420 Chapter 4 — The Processor — 98

Morgan Kaufmann Publishers 31 January 2013Pipeline with ExceptionsCSE 420 Chapter 4 — The Processor — 99Exception Propertiesn Restartable exceptionsn Pipeline can flush the instructionn Handler executes,then returns to the instructionn Refetched and executed from scratchn PC saved in EPC registern Identifies causing instructionn Actually PC + 4 is savedn Handler must adjustCSE 420 Chapter 4 — The Processor — 100

Morgan Kaufmann Publishers 31 January 2013Exception Examplen Exception on add in40 sub $11, $2, $444 and $12, $2, $548 or $13, $2, $64C add $1, $2, $150 slt $15, $6, $754 lw $16, 50($7)…n Handler80000180 sw $25, 1000($0)80000184 sw $26, 1004($0)…CSE 420 Chapter 4 — The Processor — 101Exception ExampleCSE 420 Chapter 4 — The Processor — 102

Morgan Kaufmann Publishers 31 January 2013Exception ExampleCSE 420 Chapter 4 — The Processor — 103Multiple Exceptionsn Pipelining overlaps multiple instructionsn Could have multiple exceptions at oncen Simple approach:deal with exception from earliest instructionnnFlush subsequent instructions“Precise” exceptionsn In complex pipelinesn Multiple instructions issued per cyclennOut-of-order completionMaintaining precise exceptions is difficult!CSE 420 Chapter 4 — The Processor — 104

Morgan Kaufmann Publishers 31 January 2013Imprecise Exceptionsn Just stop pipeline and save staten Including exception cause(s)n Let the handler work it outn Which instruction(s) had exceptionsnWhich to complete or flushn May require “manual” completionn Simplifies hardware,but more complex handler softwaren Not feasible for complex, multiple-issue,out-of-order pipelinesCSE 420 Chapter 4 — The Processor — 105Instruction-Level Parallelism (ILP)n Pipelining: executing multiple instructions inparalleln To better utilize ILPn Deeper pipelinenn Less work per stage ⇒ shorter clock cycleMultiple issuen Replicate pipeline stages ⇒ multiple pipelinesn Start multiple instructions per clock cyclen CPI < 1, so use Instructions Per Cycle (IPC)n E.g., 4GHz 4-way multiple-issuen 16 BIPS, peak CPI = 0.25, peak IPC = 4n But dependencies reduce this in practice§4.10 Parallelism and Advanced Instruction Level ParallelismCSE 420 Chapter 4 — The Processor — 106

Morgan Kaufmann Publishers 31 January 2013Multiple Issuen Static multiple issuen Compiler groups instructions to be issued togethernnPackages them into “issue slots”Compiler detects and avoids hazardsn Dynamic multiple issuen CPU examines instruction stream andchooses instructions to issue each cyclen Compiler can help by reordering instructionsnCPU resolves hazardsusing advanced techniques at runtimeCSE 420 Chapter 4 — The Processor — 107Speculationn “Guess” what to do with an instructionn Start operation as soon as possiblenCheck whether guess was rightn If so, complete the operationn If not, roll-back and do the right thingn Common to static and dynamic multiple issuen ExamplesnnSpeculate on branch outcomenRoll back if path taken is differentSpeculate on loadn Roll back if location is updatedCSE 420 Chapter 4 — The Processor — 108

Morgan Kaufmann Publishers 31 January 2013Compiler/Hardware Speculationn Compiler can reorder instructionsn e.g., move load before branchn Can include “fix-up” instructionsto recover from incorrect guessn Hardware can look ahead for instructionsto executen Buffer resultsuntil it determines they are actually neededn Flush buffers on incorrect speculationCSE 420 Chapter 4 — The Processor — 109Speculation and Exceptionsn What if exception occurs on aspeculatively executed instruction?n e.g., speculative load before null-pointer checkn Static speculationn Can add ISA support for deferring exceptionsn Dynamic speculationn Can buffer exceptions until instructioncompletion (which may not occur)CSE 420 Chapter 4 — The Processor — 110

Morgan Kaufmann Publishers 31 January 2013Static Multiple Issuen Compiler groups instructionsinto “issue packets”n Group of instructionsthat can be issued on a single cyclen Determined by pipeline resources requiredn Think of an issue packetas a very long instructionn Specifies multiple concurrent operationsn ⇒ Very Long Instruction Word (VLIW)CSE 420 Chapter 4 — The Processor — 111Scheduling Static Multiple Issuen Compiler must remove some/all hazardsn Reorder instructions into issue packetsn No dependencies with a packetn Possibly some dependencies between packetsn Varies between ISAs; compiler must know!n Pad with nop if necessaryCSE 420 Chapter 4 — The Processor — 112

Morgan Kaufmann Publishers 31 January 2013MIPS with Static Dual Issuen Two-issue packetsn One ALU/branch instructionnnOne load/store instruction64-bit alignednnALU/branch, then load/storePad an unused instruction with nopAddress Instruction type Pipeline Stagesn ALU/branch IF ID EX MEM WBn + 4 Load/store IF ID EX MEM WBn + 8 ALU/branch IF ID EX MEM WBn + 12 Load/store IF ID EX MEM WBn + 16 ALU/branch IF ID EX MEM WBn + 20 Load/store IF ID EX MEM WBCSE 420 Chapter 4 — The Processor — 113MIPS with Static Dual IssueCSE 420 Chapter 4 — The Processor — 114

Morgan Kaufmann Publishers 31 January 2013Hazards in the Dual-Issue MIPSn More instructions executing in paralleln EX data hazardn Forwarding avoided stalls with single-issuenNow can’t use ALU result in load/store in same packetn add $t0, $s0, $s1load $s2, 0($t0)n Split into two packets, effectively a stalln Load-use hazardn Still one cycle use latency, but now two instructionsn More aggressive scheduling requiredCSE 420 Chapter 4 — The Processor — 115Scheduling Examplen Schedule this for dual-issue MIPSLoop: lw $t0, 0($s1) # $t0=array elementaddu $t0, $t0, $s2 # add scalar in $s2sw $t0, 0($s1) # store resultaddi $s1, $s1,–4 # decrement pointerbne $s1, $zero, Loop # branch $s1!=0ALU/branch Load/store cycleLoop: nop lw $t0, 0($s1) 1addi $s1, $s1,–4 nop 2addu $t0, $t0, $s2 nop 3bne $s1, $zero, Loop sw $t0, 4($s1) 4n IPC = 5/4 = 1.25 (c.f. peak IPC = 2)CSE 420 Chapter 4 — The Processor — 116

Morgan Kaufmann Publishers 31 January 2013Loop Unrollingn Replicate loop body to expose moreparallelismn Reduces loop-control overheadn Use different registers per replicationn Called “register renaming”n Avoid loop-carried “anti-dependencies”n Store followed by a load of the same registern Aka “name dependence”n Reuse of a register nameCSE 420 Chapter 4 — The Processor — 117Loop Unrolling ExampleLoop: lw $t0, 0($s1) # $t0=array elementaddu $t0, $t0, $s2 # add scalar in $s2sw $t0, 0($s1) # store resultaddi $s1, $s1,–4 # decrement pointerbne $s1, $zero, Loop # branch $s1!=0ALU/branch Load/store cycleLoop: addi $s1, $s1,–16 lw $t0, 0($s1) 1nop lw $t1, 12($s1) 2addu $t0, $t0, $s2 lw $t2, 8($s1) 3addu $t1, $t1, $s2 lw $t3, 4($s1) 4addu $t2, $t2, $s2 sw $t0, 16($s1) 5addu $t3, $t4, $s2 sw $t1, 12($s1) 6nop sw $t2, 8($s1) 7bne $s1, $zero, Loop sw $t3, 4($s1) 8n IPC = 14/8 = 1.75nCloser to 2, but at cost of registers and code sizeChapter 4 — The Processor — 118

Morgan Kaufmann Publishers 31 January 2013Dynamic Multiple Issuen “Superscalar” processorsn CPU decides whether to issue 0, 1, 2, …each cyclen Avoiding structural and data hazardsn Avoids the need for compiler schedulingn Though it may still helpn Code semantics ensured by the CPUCSE 420 Chapter 4 — The Processor — 119Dynamic Pipeline Schedulingn Allow the CPU to execute instructionsout of order to avoid stallsn But commit result to registers in ordern Examplelw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20n Can start sub while addu is waiting for lwCSE 420 Chapter 4 — The Processor — 120

Morgan Kaufmann Publishers 31 January 2013Dynamically Scheduled CPUPreservesdependenciesHold pendingoperandsResults also sentto any waitingreservation stationsReorders buffer forregister writesCan supplyoperands forissued instructionsCSE 420 Chapter 4 — The Processor — 121Register Renamingn Reservation stations and reorder buffereffectively provide register renamingn On instruction issue to reservation stationn If operand is available in register fileor reorder buffer,n Copied to reservation stationn No longer required in the registerso it can be overwrittenn If operand is not yet available,n It will be provided to the reservation stationby a function unitn Register update may not be requiredCSE 420 Chapter 4 — The Processor — 122

Morgan Kaufmann Publishers 31 January 2013Speculationn Predict branch and continue issuingn Don’t commituntil branch outcome determinedn Load speculationn Avoid load and cache miss delayn Predict the effective addressn Predict loaded valuen Load before completing outstanding storesn Bypass stored values to load unitn Don’t commit load until speculation clearedCSE 420 Chapter 4 — The Processor — 123Why Do Dynamic Scheduling?n Why not let the compiler schedule code?n Not all stalls are predicablen e.g., cache missesn Can’t always schedule around branchesn Branch outcome is dynamically determinedn Different implementations of an ISAhave different latencies and hazardsCSE 420 Chapter 4 — The Processor — 124

Morgan Kaufmann Publishers 31 January 2013Does Multiple Issue Work?The BIG Picturen Yes, but not as much as we’d liken Programs have real dependencies that limit ILPn Some dependencies are hard to eliminatene.g., pointer aliasingn Some parallelism is hard to exposen Limited window size during instruction issuen Memory delays and limited bandwidthnHard to keep pipelines fulln Speculation can help if done wellCSE 420 Chapter 4 — The Processor — 125Power Efficiencyn Complexity of dynamic scheduling andspeculations requires powern Multiple simpler cores may be betterMicroprocessor Year Clock Rate PipelineStagesIssuewidthOut-of-order/SpeculationCoresPoweri486 1989 25MHz 5 1 No 1 5WPentium 1993 66MHz 5 2 No 1 10WPentium Pro 1997 200MHz 10 3 Yes 1 29WP4 Willamette 2001 2000MHz 22 3 Yes 1 75WP4 Prescott 2004 3600MHz 31 3 Yes 1 103WCore 2006 2930MHz 14 4 Yes 2 75WUltraSparc III 2003 1950MHz 14 4 No 1 90WUltraSparc T1 2005 1200MHz 6 1 No 8 70WCSE 420 Chapter 4 — The Processor — 126

Morgan Kaufmann Publishers 31 January 2013The Opteron X4 Microarchitecture72 physicalregisters§4.11 Real Stuff: The AMD Opteron X4 (Barcelona) PipelineCSE 420 Chapter 4 — The Processor — 127The Opteron X4 Pipeline Flown For integer operationsnnFP is 5 stages longerUp to 106 RISC-ops in progressn Bottlenecksn Complex instructions with long dependenciesnnBranch mispredictionsMemory access delaysCSE 420 Chapter 4 — The Processor — 128

Morgan Kaufmann Publishers 31 January 2013Fallaciesn Pipelining is easy (!)n The basic idea is easynThe devil is in the detailsn e.g., detecting data hazardsn Pipelining is independent of technologyn So why haven’t we always done pipelining?nnMore transistors make more advanced techniquesfeasiblePipeline-related ISA design needs to take account oftechnology trendsn e.g., predicated instructions§4.13 Fallacies and PitfallsCSE 420 Chapter 4 — The Processor — 129Pitfallsn Poor ISA design can make pipelininghardern e.g., complex instruction sets (VAX, IA-32)n Significant overhead to make pipelining workn IA-32 micro-op approachn e.g., complex addressing modesn Register update side effects, memory indirectionn e.g., delayed branchesn Advanced pipelines have long delay slotsCSE 420 Chapter 4 — The Processor — 130

Morgan Kaufmann Publishers 31 January 2013Concluding Remarksn ISA influences design of datapath and controln Datapath and control influence design of ISAn Pipelining improves instruction throughputusing parallelismn More instructions completed per secondn Latency for each instruction not reducedn Hazards: structural, data, controln Multiple issue and dynamic scheduling (ILP)n Dependencies limit achievable parallelismnComplexity leads to the power wall§4.14 Concluding RemarksCSE 420 Chapter 4 — The Processor — 131

Chapter 4 Introduction

Create successful ePaper yourself

Delete template?

Save as template?