Tutorial 11, Final Review - ECSE 425 â Computer Organization and ...

Tutorial 11Final Exam Review

Introduction●●●Instruction Set Architecture: contract betweenprogrammer and designers (e.g.: IA-32, IA-64,X86-64)Computer organization: describe the functionalunits, cache hierarchy (e.g.: opteron vs pentium4)Computer architecture: manufacturingtechnology, packaging, etc (Pentium 4 vsMobile Pentium 4)

Trends in technology●●●Bandwidth versus latency: former increasedmuch faster than the latterMoore's law: exponential growth in transistorcount in ICsPower consumption: static (leakage) versusdynamic (switching) power consumption

Trends in IC cost●●●●ICs are produced on silicon wafersMultiple dies are produced per waferCosts are split between:●●Fixed expenses: masksRecurring expenses: materials, manufacturing,testing, packaging, lossesWafers contain defects, and manufacturing canproduce defective parts●Yield: proportion of good dies on a wafer

Trends in IC cost●Four equations to determine the final cost of anintegrated circuit●●●●Dies per waferDie yieldDie costIC cost

Dependability●●●Mean Time to Failure (MTTF): how long beforea failure occurs on averageFailures In Time (FIT): number of failures perbillion hours (10^9/MTTF)Assume independent failures, exponentiallydistributed lifetimes

Locality●A processor spends most of its time in smallportions of code●●●Spatial locality: nearby addresses tend to bereferenced togetherTemporal locality: you reuse things you've accessedrecentlyAmdahl's law: compute the speedup resulting froman improvement in a certain portion of a system

Performance●CPU performance equation●Benchmarks: programs that allow you to getperformance measurements by simulating reallifeworkloads●Geometric mean used to average unitlessbenchmark results, otherwise, use arithmetic mean

Pipelining●●●Split an instruction in multiple consecutive stageswhich can be overlapped. Multiple instructions arein a different stage of execution at any given timeN-stage pipeline means overlapped execution of Ninstructions. Ideal speedup=N.Never ideal!●●●New delays introduced by pipelineHazards: structural, data, controlOthers: memory contention, pipeline imbalance

Pipelining●Simple 5-stage RISC pipeline: IF/ID/EX/MEM/WB

Pipeline hazards●●●Structural: contention over a resourceData: unavailability of a result until a later time●●RAW, WAR, WAWRAW can be mitigated using forwardingControl: branch resolution causes a stall●Many ways to mitigate hazards:●●Compiler techniques: reordering, register naming,profiling-assisted branch predictionHardware techniques: speculation, register renaming,write buffers

Control hazards●●●●Flush pipeline: when a branch is encountered,freeze for 1CC to resolve branchPredict not-taken: start executing PC+4Predict taken: start executing PC+offset(requires address resolution in IF)Delayed branch: insert “neutral” instructionsright after a branch to give the CPU time toresolve the branch outcome

Multicycle operations●The FP pipeline: separate pipelines for differenttypes of operations: integer, FP mult, FP add,FP div●Those extra data paths either take more than 1CCor have multiple execution stages (add, mul, ...)

Multicycle operations●●●●Instruction can complete out of order: WAR,WAW hazardsExceptions can occur out of orderNew structural hazards: e.g.: DIV unit notpipelinedSuperpipelining: sub-dividing the operationsfurther – e.g.: multicycle memory access

Instruction-level parallelism●●●By overlapping the execution of multiple instructions, we obtain ILPTo maximize ILP, we want to execute instructions in program orderexcept when it doesn't affect the result of the programThree types of dependences which can cause hazards●●●●True/data: an instruction depends on a result produced by aprevious instruction (RAW)Name/anti: two instructions use the same memory location, butdon't exchange information (WAR)Name/output: two instructions write to the same memorylocation, and a third reads it before the second has properlywritten to it (WAW)Control: dependence on the outcome of a branch

Instruction-level parallelism

Loop unrolling●●●●We want to keep the pipeline fullReplicate the body of a loop multiple times tofind ILP and reduce the number of controldependencesThis technique yields larger executables,requires lots of registersWe don't always know if we can unroll a loopsince the upper bound is not always static

Branch prediction●●●No prediction: flushing, delayed branchStatic prediction: predict taken, not-takenDynamic prediction: use past behavior●●●●1-bit predictor: repeat past outcome2-bit predictor: repeat past outcome provided it hasoccurred at least twice in a rowCorrelating predictor: (m,n) predictor uses 2^m n-bitpredictors, depending on the outcome of the past mbranchesTournament predictor: two predictor – one local, oneglobal. Dynamically select between the two

Dynamic scheduling●Hardware rearranges instructions to reducestalls●●Allows out-of-order execution and completionTwo stages instead of ID:– Dispatch: decode and check for structural hazards (RS,ROB)– Issue: wait on data and structural hazards (FU)●●In-order dispatch, but instructions can issue out-oforderand execute out-of-orderTomasulo's algorithm for FP operations

Tomasulo's algorithm●●●●●Performs implicit register renaming to get rid ofWAR and WAW hazardsUses reservation stations to wait on RAWhazardsUse a common data bus to broadcast and listenfor execution resultsTwo-step loads and stores: compute effectiveaddress, put in load/store buffersCan be adapted for speculative operation

Tomasulo's algorithm

Speculative Tomasulo●●●Branching can limit the ILP exploitable by Tomasulo's algorithmSpeculate on branch outcome and start executing the instructionsthat follow the branchDangerous: can modify the processor state irreversibly or raiseunwanted exceptions●●●●●●●Keep track of speculative execution and undo thoseExecute o-o-o but commit in-orderUse re-order buffer to hold uncommitted resultsRegister file updated only when instructions commitRS buffer instructions between issue and execution, but renaming done byROBMispredicted branches flush the later ROB entries and restart executionROB can provide precise exceptions since it commits in order

Speculative Tomasulo(C) 2003 Elsevier Science (USA). All rights reserved.

Multiple-issue processors●●●A CPU can have parallel pipelinesSuperscalar: schedule instructions whenpossible. More than one can issue at once.●●Static: fetch multiple instructions at once, issue all ifpossible. In-order execution.Dynamic: fetch multiple instructions at once,dynamic scheduling thereafter = o-o-o executionVLIW: (static superscalar, no hw hazarddetection) pack instructions into fixed-size longwords. Statically scheduled by the compiler

Memory hierarchy●●●●●Memory performance has not increased asquickly as processor performanceWe would like an unlimited amount of very fastmemory, but it's not feasiblePrinciple of locality: temporal/spatialHierarchy of memories, progressively larger butslower, organized in levelsMultiple levels of cache down to main memory●Hit/miss in a cache

Memory hierarchy●●●Memory accesses per instruction: to fetch theinstruction and (possibly) load/store to memoryMiss penalty: how many CC to get the datafrom a slower memoryMiss rate: how often you miss in a cache

Caches●●●●●Block placement: where to do you put a block?(direct mapped, set associative, fully associative)Block identification: how do you know you have theright block? (tag, index, offsets)Block replacement: what do you do when a spot istaken in the cache? (Random, LRU, FIFO)Write strategy: what do you do on a write? (writethrough, write back)Write miss strategy: what do you do on a write miss?(write allocate, no-write allocate)

Caches●●●●How cache speedups can be calculated, can beintegrated in the CPU time equationCache addressing●●●IndexTagOffsetsValidity bit, dirty bitThe three Cs of cache misses●●●Compulsory: you've never accessed this data beforeCapacity: you need too much informationConflict: mapping rules map too many blocks to the samelocation

Caches●●●●Miss rateMulti-level caches●●Global miss rateLocal miss rateUnified vs split caches (data vs instructions)Average Memory Access Time (AMAT)

Caches optimizations●●●Reduce the miss rate●●●Increase block sizeIncrease cache sizeIncrease associativityReduce the miss penalty●●Multilevel cachingPrioritize reads over writesReduce hit time●Avoid address translation during cache indexing

Cache coherence protocols●●●●Coherence: memory accesses to a single locationare seen by all CPUs in orderConsistency: memory accesses to various locationsare seen by all CPUs in orderTwo classes of protocols:●●Directory based: central locationSnooping: global busTwo ways to ensure cache coherence:●●Write invalidate: broadcast invalidation messagesWrite update: broadcast new value

Cache coherence protocols●●Snooping + write-back + write-invalidateThree states: Invalid, modified, shared

Virtual memory●●●●Each program gets a very large virtual addressspace which is then mapped to memory.Move pages in and out of memory as you needthem, depending on RAM sizeGood for multitasking: isolation, memorysharing because virtual memory is relocatableThe sum of all the memory needed by allprograms can be larger than RAM. Use disk asextra memory space.

Virtual memory●●●●●●Virtual memory requires a mapping to physicalmemoryBlock placement: Fully associativeBlock identification: page tableBlock replacement: LRU to minimize page faultsWrite strategy: write-back because disk is slowTranslation look-aside buffer (TLB) used to storerecent translations

Final exam● December 9 th at 9am●●●Allowed 2 double-sized crib sheets●You have to hand those in with your exam3h exam120 points total●●5 problems: 20pts each = 100pts20 short answer questions = 20pts

Tutorial 11, Final Review - ECSE 425 â Computer Organization and ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?

Tutorial 11, Final Review - ECSE 425 â Computer Organization and ...