15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

FIGURE 6.15 Pipeline behavior with branch prediction. In this diagram, the branch’s outcome is predicted. Immediately<br />

in the next cycle, subsequent instructions are fetched and executed speculatively (black boxes). If the prediction<br />

is correct, the speculative instructions do useful work and the bubble has been eliminated. If the prediction is incorrect,<br />

the speculative instructions are squashed. (From Skadron, K., “Characterizing and removing branch mispredictions.”<br />

PhD thesis, Princeton Univ., June 1999. With permission.)<br />

Why Is It Needed?<br />

Branch prediction is necessary because branches are frequent, 15–25% of instructions in a typical program.<br />

Without prediction, the pipeline would stall for each branch’s resolution (refer again to Fig. 6.14) and<br />

impose a substantial performance penalty. Even if the processor could issue only one instruction per<br />

cycle, and branch resolution stalled the pipeline for only one cycle, this would impose a performance<br />

penalty of 15–25%. But today’s pipelines are substantially longer (to permit faster clock speeds) and wider<br />

(to exploit instruction-level parallelism or ILP), making the penalties much more severe in terms of wasted<br />

instruction-issue opportunities. Every additional stage in the pipeline between fetch and execute adds a<br />

cycle to the branch resolution delay. In addition, in today’s wide-issue “superscalar” pipelines, the penalty<br />

is equal to the resolution delay times the issue width. The minimum resolution delay in the Compaq® 1<br />

Alpha 21264—a four-wide superscalar processor—is seven cycles [8], and the minimum resolution delay<br />

in the Intel Pentium® 2 Pro—a three-wide superscalar organization—and its successors is eleven cycles [9].<br />

The corresponding penalties are 28 and 33 instruction-issue slots. Of course, programs often do not<br />

exhibit enough ILP to use the full issue width all the time, so the actual penalties are not quite so severe.<br />

On the other hand, the resolution delays just specified are only the minimum delays. The out-of-order<br />

nature of many high-performance processors’ execution engines means that instructions may spend an<br />

arbitrary amount of time in decoupling buffers, and this makes the pipeline seem longer and exacerbates<br />

the branch resolution delays. A correct branch prediction eliminates these stall cycles. A further problem<br />

1 Compaq Computer Corp., Houston, Texas.<br />

2 Intel Corp., Santa Clara, California.<br />

© 2002 by CRC Press LLC<br />

Cycle<br />

1<br />

2<br />

3<br />

4<br />

F = fetch<br />

X = execute<br />

F<br />

branch is<br />

fetched<br />

F<br />

F<br />

F<br />

X<br />

X<br />

successor<br />

instruction branch is<br />

is fetched decoded<br />

speculatively<br />

speculative<br />

execution<br />

continues<br />

execution<br />

continues<br />

in nonspeculative<br />

mode<br />

D<br />

D<br />

D<br />

D<br />

= known-correct instruction<br />

= speculative instruction<br />

X<br />

branch is<br />

executed<br />

X<br />

branch<br />

has been<br />

resolved<br />

speculative<br />

instructions<br />

become<br />

non-speculative<br />

or are squashed

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!