11.07.2015 Views

Tutorial 11, Final Review - ECSE 425 – Computer Organization and ...

Tutorial 11, Final Review - ECSE 425 – Computer Organization and ...

Tutorial 11, Final Review - ECSE 425 – Computer Organization and ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Tutorial</strong> <strong>11</strong><strong>Final</strong> Exam <strong>Review</strong>


Introduction●●●Instruction Set Architecture: contract betweenprogrammer <strong>and</strong> designers (e.g.: IA-32, IA-64,X86-64)<strong>Computer</strong> organization: describe the functionalunits, cache hierarchy (e.g.: opteron vs pentium4)<strong>Computer</strong> architecture: manufacturingtechnology, packaging, etc (Pentium 4 vsMobile Pentium 4)


Trends in technology●●●B<strong>and</strong>width versus latency: former increasedmuch faster than the latterMoore's law: exponential growth in transistorcount in ICsPower consumption: static (leakage) versusdynamic (switching) power consumption


Trends in IC cost●●●●ICs are produced on silicon wafersMultiple dies are produced per waferCosts are split between:●●Fixed expenses: masksRecurring expenses: materials, manufacturing,testing, packaging, lossesWafers contain defects, <strong>and</strong> manufacturing canproduce defective parts●Yield: proportion of good dies on a wafer


Trends in IC cost●Four equations to determine the final cost of anintegrated circuit●●●●Dies per waferDie yieldDie costIC cost


Dependability●●●Mean Time to Failure (MTTF): how long beforea failure occurs on averageFailures In Time (FIT): number of failures perbillion hours (10^9/MTTF)Assume independent failures, exponentiallydistributed lifetimes


Locality●A processor spends most of its time in smallportions of code●●●Spatial locality: nearby addresses tend to bereferenced togetherTemporal locality: you reuse things you've accessedrecentlyAmdahl's law: compute the speedup resulting froman improvement in a certain portion of a system


Performance●CPU performance equation●Benchmarks: programs that allow you to getperformance measurements by simulating reallifeworkloads●Geometric mean used to average unitlessbenchmark results, otherwise, use arithmetic mean


Pipelining●●●Split an instruction in multiple consecutive stageswhich can be overlapped. Multiple instructions arein a different stage of execution at any given timeN-stage pipeline means overlapped execution of Ninstructions. Ideal speedup=N.Never ideal!●●●New delays introduced by pipelineHazards: structural, data, controlOthers: memory contention, pipeline imbalance


Pipelining●Simple 5-stage RISC pipeline: IF/ID/EX/MEM/WB


Pipeline hazards●●●Structural: contention over a resourceData: unavailability of a result until a later time●●RAW, WAR, WAWRAW can be mitigated using forwardingControl: branch resolution causes a stall●Many ways to mitigate hazards:●●Compiler techniques: reordering, register naming,profiling-assisted branch predictionHardware techniques: speculation, register renaming,write buffers


Control hazards●●●●Flush pipeline: when a branch is encountered,freeze for 1CC to resolve branchPredict not-taken: start executing PC+4Predict taken: start executing PC+offset(requires address resolution in IF)Delayed branch: insert “neutral” instructionsright after a branch to give the CPU time toresolve the branch outcome


Multicycle operations●The FP pipeline: separate pipelines for differenttypes of operations: integer, FP mult, FP add,FP div●Those extra data paths either take more than 1CCor have multiple execution stages (add, mul, ...)


Multicycle operations●●●●Instruction can complete out of order: WAR,WAW hazardsExceptions can occur out of orderNew structural hazards: e.g.: DIV unit notpipelinedSuperpipelining: sub-dividing the operationsfurther – e.g.: multicycle memory access


Instruction-level parallelism●●●By overlapping the execution of multiple instructions, we obtain ILPTo maximize ILP, we want to execute instructions in program orderexcept when it doesn't affect the result of the programThree types of dependences which can cause hazards●●●●True/data: an instruction depends on a result produced by aprevious instruction (RAW)Name/anti: two instructions use the same memory location, butdon't exchange information (WAR)Name/output: two instructions write to the same memorylocation, <strong>and</strong> a third reads it before the second has properlywritten to it (WAW)Control: dependence on the outcome of a branch


Instruction-level parallelism


Loop unrolling●●●●We want to keep the pipeline fullReplicate the body of a loop multiple times tofind ILP <strong>and</strong> reduce the number of controldependencesThis technique yields larger executables,requires lots of registersWe don't always know if we can unroll a loopsince the upper bound is not always static


Branch prediction●●●No prediction: flushing, delayed branchStatic prediction: predict taken, not-takenDynamic prediction: use past behavior●●●●1-bit predictor: repeat past outcome2-bit predictor: repeat past outcome provided it hasoccurred at least twice in a rowCorrelating predictor: (m,n) predictor uses 2^m n-bitpredictors, depending on the outcome of the past mbranchesTournament predictor: two predictor – one local, oneglobal. Dynamically select between the two


Dynamic scheduling●Hardware rearranges instructions to reducestalls●●Allows out-of-order execution <strong>and</strong> completionTwo stages instead of ID:– Dispatch: decode <strong>and</strong> check for structural hazards (RS,ROB)– Issue: wait on data <strong>and</strong> structural hazards (FU)●●In-order dispatch, but instructions can issue out-oforder<strong>and</strong> execute out-of-orderTomasulo's algorithm for FP operations


Tomasulo's algorithm●●●●●Performs implicit register renaming to get rid ofWAR <strong>and</strong> WAW hazardsUses reservation stations to wait on RAWhazardsUse a common data bus to broadcast <strong>and</strong> listenfor execution resultsTwo-step loads <strong>and</strong> stores: compute effectiveaddress, put in load/store buffersCan be adapted for speculative operation


Tomasulo's algorithm


Speculative Tomasulo●●●Branching can limit the ILP exploitable by Tomasulo's algorithmSpeculate on branch outcome <strong>and</strong> start executing the instructionsthat follow the branchDangerous: can modify the processor state irreversibly or raiseunwanted exceptions●●●●●●●Keep track of speculative execution <strong>and</strong> undo thoseExecute o-o-o but commit in-orderUse re-order buffer to hold uncommitted resultsRegister file updated only when instructions commitRS buffer instructions between issue <strong>and</strong> execution, but renaming done byROBMispredicted branches flush the later ROB entries <strong>and</strong> restart executionROB can provide precise exceptions since it commits in order


Speculative Tomasulo(C) 2003 Elsevier Science (USA). All rights reserved.


Multiple-issue processors●●●A CPU can have parallel pipelinesSuperscalar: schedule instructions whenpossible. More than one can issue at once.●●Static: fetch multiple instructions at once, issue all ifpossible. In-order execution.Dynamic: fetch multiple instructions at once,dynamic scheduling thereafter = o-o-o executionVLIW: (static superscalar, no hw hazarddetection) pack instructions into fixed-size longwords. Statically scheduled by the compiler


Memory hierarchy●●●●●Memory performance has not increased asquickly as processor performanceWe would like an unlimited amount of very fastmemory, but it's not feasiblePrinciple of locality: temporal/spatialHierarchy of memories, progressively larger butslower, organized in levelsMultiple levels of cache down to main memory●Hit/miss in a cache


Memory hierarchy●●●Memory accesses per instruction: to fetch theinstruction <strong>and</strong> (possibly) load/store to memoryMiss penalty: how many CC to get the datafrom a slower memoryMiss rate: how often you miss in a cache


Caches●●●●●Block placement: where to do you put a block?(direct mapped, set associative, fully associative)Block identification: how do you know you have theright block? (tag, index, offsets)Block replacement: what do you do when a spot istaken in the cache? (R<strong>and</strong>om, LRU, FIFO)Write strategy: what do you do on a write? (writethrough, write back)Write miss strategy: what do you do on a write miss?(write allocate, no-write allocate)


Caches●●●●How cache speedups can be calculated, can beintegrated in the CPU time equationCache addressing●●●IndexTagOffsetsValidity bit, dirty bitThe three Cs of cache misses●●●Compulsory: you've never accessed this data beforeCapacity: you need too much informationConflict: mapping rules map too many blocks to the samelocation


Caches●●●●Miss rateMulti-level caches●●Global miss rateLocal miss rateUnified vs split caches (data vs instructions)Average Memory Access Time (AMAT)


Caches optimizations●●●Reduce the miss rate●●●Increase block sizeIncrease cache sizeIncrease associativityReduce the miss penalty●●Multilevel cachingPrioritize reads over writesReduce hit time●Avoid address translation during cache indexing


Cache coherence protocols●●●●Coherence: memory accesses to a single locationare seen by all CPUs in orderConsistency: memory accesses to various locationsare seen by all CPUs in orderTwo classes of protocols:●●Directory based: central locationSnooping: global busTwo ways to ensure cache coherence:●●Write invalidate: broadcast invalidation messagesWrite update: broadcast new value


Cache coherence protocols●●Snooping + write-back + write-invalidateThree states: Invalid, modified, shared


Virtual memory●●●●Each program gets a very large virtual addressspace which is then mapped to memory.Move pages in <strong>and</strong> out of memory as you needthem, depending on RAM sizeGood for multitasking: isolation, memorysharing because virtual memory is relocatableThe sum of all the memory needed by allprograms can be larger than RAM. Use disk asextra memory space.


Virtual memory●●●●●●Virtual memory requires a mapping to physicalmemoryBlock placement: Fully associativeBlock identification: page tableBlock replacement: LRU to minimize page faultsWrite strategy: write-back because disk is slowTranslation look-aside buffer (TLB) used to storerecent translations


<strong>Final</strong> exam● December 9 th at 9am●●●Allowed 2 double-sized crib sheets●You have to h<strong>and</strong> those in with your exam3h exam120 points total●●5 problems: 20pts each = 100pts20 short answer questions = 20pts

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!