15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

FU<br />

Instruction<br />

Fetch<br />

Engine<br />

Instruction Window<br />

Instruction Execution Engine<br />

FU FU FU FU<br />

FIGURE 8.16 High-level view of a superscalar processor: instruction window and decoupled fetch and execute engines.<br />

Instruction Fetch Bottleneck<br />

Taken branches in the dynamic instruction stream cause frequent disruptions in the flow of instructions<br />

into the window. The best conventional instruction cache and next-program-counter logic incurs a singlecycle<br />

disruption when a taken branch is encountered. At best, sustained fetch bandwidth is equal to the<br />

average number of instructions between taken branches, which is typically from 6 to 8 instructions per<br />

cycle for integer programs [2,19,32]. Moreover, conventional branch predictors predict at most one<br />

branch per cycle, limiting fetch bandwidth to a single basic block per cycle.<br />

It is possible to modify conventional instruction caches and the next-program-counter logic to remove<br />

taken-branch disruptions, however, that approach is typically complex. Low latency is sacrificed for high<br />

bandwidth. A trace cache [8,14,18,20] changes the way instructions are stored to optimize instruction<br />

fetching for both high bandwidth and low latency.<br />

Inefficient High-Bandwidth Execution Mechanisms<br />

The scheduling mechanism in a superscalar processor converts an artificially sequential program into an<br />

instruction-level parallel program. The scheduling mechanism is composed of register rename logic<br />

(identifies true dependences among instructions and removes artificial dependences), the scheduling<br />

window (resolves dependences near-optimally by issuing instructions out-of-order), and the register file<br />

with result bypasses (moves data to and from the functional units as instructions issue and complete,<br />

respectively). All of the circuits are monolithic and their speed does not scale well for 8 or more instructions<br />

per cycle [13].<br />

Trace processors [21,24,27,29,31] use a more efficient, hierarchical scheduling mechanism to optimize<br />

for both high-bandwidth execution and a fast clock.<br />

Control and Data Dependence Bottlenecks<br />

Most control dependences are removed by branch prediction, but branch mispredictions incur large performance<br />

penalties because all instructions after a mispredicted branch are flushed from the window, even<br />

control- and data-independent instructions. Exploiting control independence preserves useful instructions<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!