15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

PE 0<br />

Local<br />

Register<br />

File<br />

local bypass<br />

FIGURE 8.19 Trace processor.<br />

trace window<br />

Branch<br />

Predictor<br />

Instruction<br />

Cache<br />

FU FU FU FU<br />

trace dispatch bus<br />

Global<br />

Register<br />

File<br />

global bypass<br />

Global Result Buses<br />

Cache Ports<br />

Trace<br />

Predictor<br />

Trace<br />

Cache<br />

ARB D$<br />

Global Rename Maps & Freelist<br />

Trace Processor: Efficient High-Bandwidth Instruction Execution<br />

Instruction execution is inefficient in wide-issue superscalar processors because all data dependences are<br />

handled uniformly. When an instruction issues, its data dependent instructions wakeup with uniform<br />

latency, usually a single cycle, regardless of their location in the window. Resolving all dependences in a<br />

single cycle optimizes parallelism, but cycle time is extended to accommodate the full length of the<br />

window. Increasing processor cycle time penalizes the entire pipeline. A better alternative is to increase<br />

the number of cycles to resolve data dependences, e.g., two cycles instead of one, so other pipeline stages<br />

are unaffected. However, it still remains the case that all data dependences are slow to resolve.<br />

Fortunately, there is a compromise between optimizing for parallelism and optimizing for cycle time<br />

if data dependences are handled nonuniformly. A trace processor [21,24,27–29,31] hierarchically divides<br />

the processor into smaller processing elements (PEs), as shown in Fig. 8.19. The approach preserves a<br />

fast clock and resolves many data dependences in one clock cycle (data dependences within PEs), at the<br />

expense of resolving some data dependences in two or more clock cycles (data dependences among PEs).<br />

The microarchitecture shown in Fig. 8.19 is described in the remainder of this section.<br />

Instruction Supply<br />

The trace predictor and trace cache supply a single trace per cycle. The conventional branch predictor<br />

and instruction cache shown in Fig. 8.19 are secondary, back-up mechanisms for constructing traces that<br />

miss in the cache or that were mispredicted [21,23,24].<br />

Register Renaming<br />

Register renaming determines data dependences among all newly-fetched instructions, and between newlyfetched<br />

instructions and other instructions already in the window. The first aspect—determining data<br />

dependences among 16 or 32 fetched instructions—almost certainly takes more than a single clock cycle.<br />

© 2002 by CRC Press LLC<br />

PE 1<br />

PE 2<br />

PE 3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!