15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

FIGURE 6.16 The placement of the branch prediction components in the pipeline.<br />

will be the appropriate target address. As branches change their taken/not-taken behavior, this next-line<br />

address is updated accordingly. The next-line predictor is, therefore, a combination of the functionality<br />

of a BTB and a bimodal predictor (see the section on “Bimodal Prediction”). If a more sophisticated<br />

direction predictor is present, it overrides the next-line predictor. One motivation for using such an<br />

organization is to permit a larger, slower, but more accurate direction predictor that may not be able to<br />

be accessed in a single cycle. The Alpha 21264 takes such an approach [25], using as its slower but more<br />

accurate direction predictor the hybrid predictor described in the section on “Hybrid Prediction.”<br />

Pipeline Issues<br />

In the most efficient organization, both the BTAC and the branch direction predictor are consulted during<br />

the fetch stage as shown in Fig. 6.16. In this way, the PC can be updated immediately and the processor<br />

can fetch from the appropriate location (taken or not-taken) in the next cycle. This avoids introducing<br />

pipeline bubbles unless there is a BTAC miss or a branch misprediction.<br />

Unfortunately, some problems occur with probing the branch-prediction hardware in the fetch stage.<br />

One concern is the branch-predictor and BTAC lookup times. These tables must be fast enough, and<br />

hence small enough, to permit the lookup to complete and the PC to be updated within a single cycle.<br />

Otherwise the fetch stage falls behind. Current processors use predictors as big as 32 kbits, but Jiménez<br />

et al. [26] argue that the feasible predictor size for single-cycle access will shrink in the coming years.<br />

The reason for this is that even though the feature size on a processor die continues to shrink with<br />

Moore’s law [27], electrical RC delays are not shrinking accordingly, and hence wire delays are not<br />

shrinking as fast as logic delays. As feature size shrinks, large structures therefore seem to be getting<br />

relatively slower.<br />

Another problem is that in a typical organization, the fetch stage cannot determine whether the<br />

instructions being fetched from the instruction cache contain any branches; that information must wait<br />

until the instructions are decoded. Several solutions are available. The first technique is for the instructions<br />

to be “pre-decoded” before they are installed into the instruction cache to indicate which instructions<br />

are branches. The predictor structures can then be indexed using the actual addresses of the<br />

branches. Note that this means either that the predictor must be multi-ported to cope with fetch blocks<br />

that contain more than one branch, or the predictor can only predict one branch at a time. This is not<br />

necessarily a major restriction, since if the predicted result is not-taken, the remaining instructions in<br />

© 2002 by CRC Press LLC<br />

mux<br />

mux<br />

+<br />

PC<br />

4<br />

I-cache<br />

bpred<br />

BTAC<br />

fetched instructions<br />

T/NT<br />

BTB hit?<br />

predicted taken target<br />

computed taken target, from decode<br />

to decode

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!