15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

fetch engine to identify which instructions are branches or to compute branch targets. Unfortunately, in<br />

most programs more than half of branches are taken [17], making the performance of static-not-taken<br />

usually quite poor. On the other hand, a static taken policy either requires the fetch engine to identify<br />

which instructions are branches and immediately identify their taken targets, or requires some delay<br />

while instructions are decoded and the target is computed.<br />

A third policy takes advantage of the fact that backward conditional branches almost always correspond<br />

to loops, which tend to iterate multiple times, so these branches are likely to be taken. Non-backward branches,<br />

on the other hand, are less biased. Hennessy and Patterson [10] found that 85% of backward branches are<br />

taken while only 60% of forward branches are taken. This suggests a static policy of backwards-taken,<br />

forwards-not-taken, or BTFNT. The problem of computing branch targets remains.<br />

These policies were described by Smith [17] along with the core, bimodal dynamic prediction technique<br />

described in the section on “Bimodal Prediction.” Another seminal paper from this era is the exploration<br />

of branch predictor and branch target address cache design choices by Lee and Smith [18]. Both papers<br />

also survey the earliest literature on branch handling.<br />

Branch Target Address Caches<br />

Not only static techniques, but in fact all branch-prediction techniques have the problem that on a<br />

predicted-taken branch, the branch’s target must be computed. This requires extracting the offset field<br />

from the branch instruction and adding it to the PC; tasks which typically cannot be performed until<br />

the instruction-decode stage. If this is the case, some stall cycles result, called a “branch-taken bubble.”<br />

A second type of predictor—a branch target predictor—can eliminate this problem. In its simplest form,<br />

this is simply a small on-chip memory in the fetch stage that serves as a table of recently seen branches,<br />

a branch target address cache or BTAC [19,20]. (The BTAC is also often referred to as a branch target<br />

buffer or BTB, but this latter term is too heavily overloaded.) The BTAC is indexed with the branch’s<br />

address (in other words, the PC—program counter—used to fetch the branch). It may be direct-mapped<br />

or associative, and tagged or not tagged. Omitting tags reduces cost, but then a BTAC miss cannot be<br />

identified, the predicted-taken branch will use the wrong target, and this will not be discovered until the<br />

branch resolves. For this reason, BTACs are best tagged.<br />

The dynamic hardware schemes described later in this section maintain tables in which they track<br />

state about conditional branch directions. These direction-prediction tables are often indexed using the<br />

branch address. Because the BTAC table is also indexed by branch address, it may be convenient with<br />

these dynamic schemes to store the direction-prediction information in the BTAC along with each<br />

branch’s target. Aside from the convenience of integrating these different sources of information into<br />

one table, this confers the advantage that if the BTAC is tagged, any branch prediction state stored in the<br />

BTAC is also tagged. While some processors use this organization, Calder and Grunwald [21] point out<br />

that many branches are not taken and hence do not require the BTAC to store a target. Decoupling the<br />

direction-prediction state from the target-prediction state therefore permits a smaller BTAC. It also<br />

improves flexibility, as some predictors, such as global-history predictors (see the section on “Two-level<br />

Prediction”) do not keep a one-to-one mapping between branch addresses and direction-prediction entries.<br />

Instead of a BTAC, the processor might employ a branch target instruction cache, which stores some<br />

actual instructions from the branch target rather than merely the target address. This replicates quite a<br />

bit of state from the instruction cache, so this organization is rarely seen, although it does appear in the<br />

Motorola® 5 PowerPC® 6 G4 [22], for example.<br />

The BTAC can also be integrated with the instruction cache. Each cache line can simply store the target<br />

address of one or more of its branches in case that branch is predicted taken. Alternatively, the I-cache<br />

can implement a next-line predictor [23]. Each cache line now stores the index of the next cache line to<br />

be fetched (and also the set if the cache is associative) [24]. If no branches are taken in the current line,<br />

the next-line address will be the next sequential address. If there is a taken branch, the next-line address<br />

5 Motorola, Inc., Schaumburg, Illinois.<br />

6 International Business Machines Corp., Armonk, New York.<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!