The PowerPC 604 RISC Microprocessor - eisber.net

Recommendations

Info

17-bit signed integer. It can sustain onemultiply per rwo cycles for larger 32x32-bit operands. The CFX also uses themultiply pipeline stages to execute adivide instruction in 20 cycles. The 604Stwo simple integer units execute allother integer instructions in one cycle.The chree-stage floating-point unit cansustain one double-precision multiplyaddper cycle, one single-precisiondivide every 18 cycles, or one doubleprecisiondivide every 31 cycles. It compliesfully with the IEEE-Std 754floating-point arithmetic standard. The604 provides hardware support fordenormalization, exceptions, and threegraphics instructions. It also provides aGPR result busesIntegerunit 1non-IEEE mode for graphics support. Tne non-IEEE modeconverts a denorm.alized result to zero to avoid prenormalizationin subsequent operations.Instruction completion and write back. After aninstruction executes, the execution unit copies results to itsrename buffer entries and the execution status to its reorderbuffer entry. Among other things, the execution status indicateswhether the instruction Finished execution without anexception. Of the four reorder buffer entries examined everycycle, up to fcur instructions that finished without an exceptioncomplete in program order.Other than the in-order completion necessary to supportprecise exceptions, the 604 imposes only a few additionalrestrictions on instruction completion. They are1) Stop before a store instruction. Since a store dataoperand is read from the register file in the completionstage, a store instruction cannot complete if its storeoperand is still in the rename buffer. Stopping completionbefore a store instruction allows the store operandto be written to the register File, even if it is producedby an instruction currently being completed.2) Stop after a taken branch instruction. Since a takenbranch instruction sets the program counter to its targetaddress. it is speed critical to advance the programcounter from the new target address by the number ofinstructions completed after the taken branch in onecycle. Stopping completion after a taken branch instructionavoids this logic altogether.To minimize effects of long execution latenc'i on in-ordercompletion. the completion stage overlaps with the last executioncycle for those instructions with multicycle executionstage. These include the multiply. divide. store. load miss, andexecution serialized instructions A store instruction completesas soon :is it is translated without an exception. Similarly. aIntegerunit 2Complexinteger unitGPRrenamebufferGPR General-purpose registerOp OperandFigure 4. GPR result buses, reservation stations, and rename buffer.load instruction that misses in the data cache completes upontranslation without an exception. Since the reservation stationscan forward the load data when it is available to the dependentinstructions, the load miss can safely be completed.Most superscalar designs impose additional restrictionsdue to a limited number of ports to register files. For instance.four write ports would be required to complete up to fourinstructions if each one can update one register. The 604GPR file would require eight write ports to complete fourload-with-update instnictions per cycle. The 604 avoids thisproblem by decoupling instruction completion from registerFile updates using the write-back stage. Instructions completewithout regard to the type or number of registers theyupdate. The completion stage updates their results if ports areavailable; if not. the write-back stage updates them. Therename buffer entries function as temporary buffers for thoseinstructions that are not completed and as write-back buffersfor those that are. All three GPR. FPR. and CR rename bufferscontain two read ports for write back. Correspondingly, thethree register files have two write ports for write back.Memory operationsHigh-speed superscalar processors require a greater memorybandwidth to sustain their performance. The 604 meetsthe increased demand with large on-chip caches, nonblockingmemory operations, and a high-bandwidth systeminterface. The 604 takes advantage of the weakly orderedmemory model, to which the PowerPC architecture subscribes,to offer efficient memory operations. Although loadsand stores that hit in the data cache can bypass earlier loadsand stores, program order memory access can be enforcedwith instructions provided for this purpose.Load/store unit Figure 5 (next page) shows a block diagramof the load/store unit and the memory queues. Tnisunit has a two-cycle execution stage It calculates the memoryaddress and translates that address with a 128-entry, two-October 1994 13
PowerPC 604Reservation stationRA (0:31) pipeline registerLoad queue Store queueLoad miss registerGPR result busData cacheFigure 5. Load/store unit and memory queues.Byte alignment-4--1way set-associative translation lookaside buffer (T1B) in thefirst cycle. The second cycle processes loads making a speculativecache access and aligns bytes when the access hits inthe cache. The pipelined execution stage executes one loador store instruction per cycle.In the first half cycle, the load/store unit calculates a loadinstruction's memory address, denoted as EA in the figure.It translates the real address, denoted as RA in the figure,and the data cache access begins in the second half cycle. Ifthe access hits in the cache, the unit aligns the data and forwardsit to the rename buffer and the execution units in thesecond cycle. If the access misses. the unit places the instructionand its real address in the four-entry load queue. Whena load miss completes, it accesses the cache a second time.If the load is still a miss, the unit moves it to the load missregister while reloading the missing cache line. This permitsa second load miss to access the cache and to initiate thesecond cache line reload before the first is brought in.The unit calculates the memory address of a store and translatesit in the first cycle. It does not write the data to the cache,however, until after the store instruction completes. The unitplaces the instruction and its real address in the six-entry storequeue. Since the data cache is not accessed in the secondcycle, it is available for an earlier store from the store queue(if necessary) or load miss from the load queue (if necessary).When a store instruction completes. the load/store unitmarks it completed in the store queue so that instruction completioncan continue without waiting for storage to the cacheor memory. If the store hits in the cache, the unit writes it tothe cache and removes it from the store queue. If the storeIs a miss. the unit a ili bypass it in the store queue to alloylater stores to take place while cache reloading proceeds.Multiple store misses can be bypassed in the store queue.Figure 6 shows the store queue structure. Four pointersidentify the state of the store instructions in the circular storequeue. When a store has finished execution (or successfultranslation). the load.'store unit places it in the finished state.When it completes. the finish pointer advances to place it inthe completed state. When it is committed to cache or memory,the completion pointer advances to place it in the committedstate. If the store hits in the cache. advancing thecommit pointer removes it from the queue. If the store ismiss. the commit pointer does not advance until the missingcache line is reloaded and the store is written to the cacheWhile the cache line is being reloaded. the next store indicatedby the completion pointer can access the cache. If thisstore hits in the cache. the unit removes it from the queueIf it misses. another cache line reload begins.Caches. The 604 provides separate instruction and datacaches to allow simultaneous instruction and data . accessesevery cycle. Both 16-Kbyte caches provide byte parity protectionand a four-way set-associative organization with 32-byte lines. They are indexed with physical addresses. havephysical tags. and make use of the least recently usedreplacement algorithm.The instruction cache provides a 16-byte interface to thefetch unit to support the four-instruction dispatch design.This nonblocking cache allows subsequent instructions tobe fetched while a prior cache line is being reloaded. Thisdesign is particularly beneficial if the missing cache linebelongs in a mispredicted path. since it allows the correctinstructions to be accessed immediately after a branch mispredictionrecovery. The instruction cache also providesstreaming, a mechanism to forward instructions as they arereceived from off-chip cache or memory.The instruction cache does not maintain coherency:instead. the architecture provides a set of instructions forsoftware to manage coherency. In particular, the instructioncache block invalidate (ICBI) instruction causes all copies ofthe addressed cache line to be invalidated in the system. TheICBI generates an invalidation request, to which all coherentcaches must comply.The data cache contains a 64-bit data interface to theload/store unit for data access, ?AMU for tablewalking, andbus interface unit (BIU) for cache line reloading and snooping.(The architecture specifies an algorithm to traverse pagetable entries that define the virtual-to-physical memory mappings.- Tablewalk" refers to a hardware implementation ofthe algorithm.) The data cache's rwo copy-back buffers supportnonblocking cache operations. A copy-back buffer holdsa dirty (modified) cache line that is being replaced or that hitson a snoop request. The data cache moves an entire cacheline into a copy-back buffer in one cycle to minimize the14 IEEE Micro
Page 1 and 2: The PowerPC 604 RISC Microprocessor
Page 3 and 4: Rename valid I Reg num Result Resul
Page 5: updated by speculatively executed m
Page 9 and 10: Data addressInstruction address•M
Page 11 and 12: Threaded Codethreaded = aufgefadelt
Page 13 and 14: token threaded codeindirect token t
Page 15 and 16: Kosten auf dem MIPS 83000indirect t
Page 17 and 18: CTIL'Start / Restart' 14ZIP Funktio
Page 19 and 20: SystemparameierStackbefehleKarmen a
Page 21 and 22: KontrollstrukturenBEGIN ... END Sch
Page 23 and 24: Stack Framelocal stacklocalsparamet
Page 25 and 26: 122 The P-code Machine [Ch.Thus the
Page 27 and 28: 68Pascal Implementation: Compiler a
Page 29 and 30: 72 Pascal Implementation: Compiler
Page 35 and 36: Tables of Lexical Analysis123456frw
Page 37 and 38: JVIII ■ 111t14 /1.11111ySIS ai Y
Page 39 and 40: 4Pascal implementation: Compiler an
Page 41 and 42: 8 Pascal Implementation: Compiler a
Page 43 and 44: 12 Pascal implementation: Compiler
Page 45 and 46: 16Pascal Implementation: Compiler a
Page 47 and 48: 30 Pascal inipletnentatIon: Compile
Page 49 and 50: 34 Pascal implementation: Complier
Page 51 and 52: if) 1993, 1994, 1995 Sim Microsyste
Page 53 and 54: 0-PrefaceThis document describes ve
Page 55 and 56: method calls into actual method cal
Page 57 and 58:
auper_el arsThis field is an index
Page 59 and 60:
ATtagbytesCONSTANT_Integer_infoul t
Page 61 and 62:
2.6.1 SourceFileThe "SourceFile" at
Page 63 and 64:
local ..verieble_table_lengthThis f
Page 65 and 66:
1dc2wPush long or double from const
Page 67 and 68:
I 3.4Storing Stack Values into Loca
Page 69 and 70:
anewarrayAllocate new array of refe
Page 71 and 72:
PastoreStore into single float arra
Page 73 and 74:
laddLong integer addfSubSingle floa
Page 75 and 76:
dreminegInegInegdnegDouble float re
Page 77 and 78:
IliLong integer to integerSyntax:L
Page 79 and 80:
ifleBranch if less than or equal to
Page 81 and 82:
dcmpgDouble float compare (1 on NaN
Page 83 and 84:
3.13 Table JumpingtableswitchAccess
Page 85 and 86:
the matched method is found. The me
Page 87 and 88:
Appendix A: An OptimizationA.2 Push
Page 89 and 90:
cpgetfield2_quickFetch held from ob
Page 91 and 92:
checkcast_quickMake sure object is
Page 93 and 94:
Efficient JavaVM Just-in-Time Compi
Page 95 and 96:
The second pass of the compiler tra
Page 97 and 98:
chain length 1 2 3 4 5 6 7 8 9 >9oc
Page 99 and 100:
sieve JavaLex javac espresso i Toba
Page 101 and 102:
Technical Overview of the Common La
Page 103 and 104:
In contrast to the JVM where all st
Page 105 and 106:
The ldind t instruction expects an
Page 107 and 108:
It should be obvious that having va
Page 109 and 110:
9 Interaction between value and ref
Page 111 and 112:
2. CLI Partition II: Metadata. http
Page 113 and 114:
Anforderungen an den Zwischencode
Page 115 and 116:
DieRoboterprogrammiersprache• kei
Page 117 and 118:
Implementierung des Interpreters•
Page 119 and 120:
Printed by andi from a0.complang.tu
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
power but retain the disciplined vi
Page 131 and 132:
I- be reached from the present curs
Page 133 and 134:
the editor will be invoked asking t
Page 135 and 136:
immediately that 'I' has type integ
Page 137 and 138:
--y-PASC7,1,.The useof a formal lan
Page 139 and 140:
Implementation Techniques for Prolo
Page 141 and 142:
x(X) a(A)a(C) b(C), c(C)b(s(0))c(s(
Page 143 and 144:
Unification in general consists of
Page 145 and 146:
for the classification of temporary
Page 147 and 148:
copy stack1trailenvironment stack1t
Page 149:
process reduction attempt are elimi
show all

The PowerPC 604 RISC Microprocessor - eisber.net

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?