15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Modern Designs<br />

Most high-performance processors now incorporate some form of superscalar processing. Even many<br />

simple processors can decode and execute one integer instruction along with one floating-point instruction<br />

per cycle. We briefly survey three representative processors in the following subsections. Other notable<br />

superscalar designs include the Compaq Alpha 21264, HP 8000, and MIPS R10000. It should be noted<br />

that designers of IBM mainframes developed a superscalar implementation, the IBM ES/9000 Model<br />

520, in 1992, but more recent implementations have reverted to scalar pipelines.<br />

UltraSPARC, 1995, [8]<br />

The UltraSPARC is an example of an in-order superscalar processor. It provides four-way instruction<br />

issue to nine functional units. The design team extensively simulated many alternatives and concluded<br />

that an out-of-order approach would have required a 20% penalty in clock cycle time and increased the<br />

time to market by up to half a year. The final design involves a nine-stage pipeline. This includes a decoupled<br />

front-end pipeline (fetch and decode stages) that performs branch prediction and places decoded instructions<br />

in a 12-entry buffer. A grouping stage then selects up to four instructions in-order from the buffer<br />

to be issued in the next cycle. Precise exceptions are provided by padding out most function unit pipelines<br />

to four stages each (i.e., the required length for the floating-point pipelines) so that most fourinstruction<br />

groups complete in-order. The final two stages resolve any exceptions in the groups and write<br />

back the results.<br />

PowerPC 750, 1997, [9]<br />

The PowerPC 750 is an example of an out-of-order processor with distributed reservation stations and<br />

a reorder buffer (called the completion buffer in the 750). The 750 has six function units, including two<br />

integer units. Each unit has one reservation station, except the load/store unit, which has two. Instructions<br />

can issue, when ready, from these reservation stations. (This limited form of out-of-order execution is<br />

sometimes called “function unit slip.”) The 750 also includes six rename registers for renaming the 32<br />

integer registers and six rename registers for renaming the 32 floating-point registers.<br />

The overall pipeline works as follows. A decoding stage is not needed since instructions are predecoded<br />

into a wider representation as they are filled into the instruction cache. Up to four instructions are fetched<br />

per cycle into a six-entry instruction buffer. Logic associated with the instruction buffer removes any<br />

nops or predict-untaken branches and overwrites predict-taken branches with target-path instructions<br />

so that no instruction buffer entries are required for nops or branches. (However, predicted branches are<br />

kept in the branch unit until resolution to provide for misprediction recovery.) Up to two instructions<br />

can be dispatched per cycle to the reservation stations and can be allocated entries in the six-entry<br />

completion buffer. The integer units require a single cycle for execution, while the load/store unit and<br />

the floating-point unit require two and three cycles, respectively. After execution, results are placed into<br />

the assigned entries in the completion buffer. Up to two entries per cycle can be written back from the<br />

completion buffer to the register files.<br />

Pentium 4, 2000, [5]<br />

The Pentium 4 is the most recent 32-bit processor from Intel and is an example of a very aggressive outof-order<br />

processor with a centralized instruction window and a reorder buffer. The original Pentium<br />

combines two integer pipelines, each similar in design to the pipeline of the 486, and can thus decode<br />

and execute up to two instructions in-order per cycle. Intel then developed the P6 core microarchitecture,<br />

which serves as the basis for the Pentium Pro, Pentium II, and Pentium III. After branch prediction and<br />

instruction fetch, the P6 core decodes up to three variable-length Intel IA-32 instructions each cycle and<br />

translates them into up to six fixed-length uops (microoperations). Up to three uops are processed by<br />

register renaming logic each cycle, and these are placed into the 20-entry centralized instruction window<br />

along with being allocated entries in the 40-entry reorder buffer. The window is scanned each cycle in a<br />

pseudo-FIFO manner in an attempt to issue up to four uops. Preference is given to back-to-back uops<br />

to reduce the amount of operand forwarding among the execution units. The actual scanning and issue<br />

requires two cycles, while most instructions require single-cycle execution. At maximum, the reorder<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!