Hybrid Branch Prediction Strategies
Paul Jaeger, Colin Thompkins, Brad Werth
This paper introduces the generalized notion of hybrid branch prediction, of which the
tournament predictor is a specific example. Two-component hybrid prediction schemes
composed of local two-level lookup, global history lookup, and global perceptrons are
compared against the performance of their component pieces. A strategy for threecomponent
selection is demonstrated, and three-component hybrid branch predictors are
evaluated. The ideal distribution of the bit budget between components for hybrid prediction
schemes is also explored.
When designing a microprocessor architecture, efficiency is gained at the cost of additional
hardware complexity and cost. Early microprocessor designs dealt with high hardware
component costs by keeping the architecture as simple as possible. As manufacturing
processes improved, hardware costs dropped, allowing microprocessor designs to grow in
complexity and efficiency. The modern microprocessor architect has a large, but finite
"budget" of hardware to utilize in the effort to improve processor efficiency. One of the
significant sources of inefficiency in program execution is the failure to correctly predict
whether or not a branch is taken. It is for this reason that all modern microprocessors employ
a branch prediction scheme
Branch prediction became necessary when microprocessor designs started employing the
technique of pipelining. In a pipelined architecture, a conditional branch instruction (like all
instructions) takes several cycles to complete. During this time, the pipeline continues to fill
with instructions -- work which will be discarded if the instructions have been retrieved from
the incorrect location in memory. In such a case the penalty to efficiency is at least one (and
perhaps many more) cycles' worth of wasted work. Correct prediction, on the other hand,
results in no adverse efficiency impact. Since the average program contains one conditional
branch in every ten instructions, correct branch prediction is a clear priority.
Several basic, single-strategy prediction schemes have been developed, with reasonably
good results. Two of the most common basic predictors are a global history lookup, and a
two-level local history lookup. Both of these schemes have a number of parameters that can
be tuned to find an ideal solution for a given hardware budget. With sufficient hardware
resources, single-strategy prediction schemes can achieve 90% or better prediction success.
However, a common problem with single-strategy schemes is that they can be "vulnerable" to
certain execution patterns which cause them perform sub-optimally in some circumstances.
No amount of additional hardware can compensate for these fundamental flaws in the
One way to overcome this problem is to create a multiple-strategy prediction scheme, a
construction we will call a "hybrid". A hybrid scheme uses several component prediction
schemes running concurrently and contains some method to arbitrate between them. The
"tournament"1 predictor is a popular and successful hybrid of the global and two-level local
prediction schemes. Tournament predictors can achieve 93% or better prediction success,
with sufficient hardware resources. Because of this very high rate of success, most research
in hybrid schemes has focused on fine-tuning the tournament predictor scheme. One of the
goals of this paper is to re-open hybrid schemes to more general composition.
Research into new branch prediction schemes continues. Some new types of single-strategy
schemes have been developed since the tournament predictor was first developed. One of
the most promising is the global perceptron predictor, which can achieve a prediction success
rate similar to the tournament predictor. Instead of considering this scheme to be a terminal,
self-contained solution, this predictor can be viewed as a powerful new component predictor
to be used in hybrid schemes.
Hybrid prediction schemes need not be limited to two components, either. Including more
than two components adds some complexity to the component selection process, but not
insurmountably so. In our experiments we tested a variety of three-component hybrid
schemes using a new selection process designed to be simple and effective.
Discussion of Predictor Components and Parameters
For our experiments, we used a building-block approach for the composition of each
prediction scheme. We used a number of component branch predictors and predictor
selectors as building blocks. The workings of each type of component are described below.
The local two-level predictor2 hashes the branch instruction's address bits into a table of local
history entries. These history entries store the taken / not taken results for the last several
executions of the branches referenced at that index. The history entry is then used as an
index into a table of saturating counters, functioning as predictors. After the result of the
branch is known, the predictor and the history entry are both updated. This scheme has
three parameters that we manipulated in our analysis: the number of entries in the history
table, the length of the history entries, and the number of bits used by the predictors.
The global history predictor3 maintains a single global branch history buffer which stores the
results of the last several branches executed. This history value is used as an index into a
table of saturating counters functioning as predictors. After the branch result is known, the
history buffer and the predictor are both updated. This scheme has two parameters to
consider: the length of the history buffer, and the number of bits used by the predictors.
The global perceptron predictor4, like the global history predictor, maintains a global branch
history buffer. Also, like the local two-level predictor, the perceptron predictor hashes the
branch instruction's address bits into a table. Instead of containing local branch history, this
table contains perceptrons which are used to weight the global branch history in order to
arrive at a prediction. After the branch result is known, the global history is updated, and the
perceptron used in the prediction has its weights modified. This predictor has three
parameters we considered: the number of entries in the table, the number of weight bits used
by the perceptrons, and the length of the global branch history. There is one additional
parameter we did not consider: the perceptron training threshold. This value we derived from
the history buffer length using the formula proposed by Jimenez and Lin, which has been
found to be experimentally optimal.
For hybrid prediction schemes, a selection process must exist to choose which component
predictor should be used for a given case. This functionality is contained within a data
construct called a selector. We used a variety of selectors in our experiments.
Fundamentally, for all our hybrids, we use a table of selectors hashed by the branch
instruction's address bits. The existence of this table indicates one of the parameters of a
hybrid scheme: the size of the selector table.
For our two-component hybrids, we used two-state selectors -- the same selectors used in
most tournament prediction schemes. Two-state selectors function similarly to saturating
counters, but the logic for updating their state is more complicated. If the two predictors are A
and B, and the selector is in a state indicating confidence in predictor A, confidence in A will be
increased only if A is correct and B is wrong. Likewise, confidence in A will decrease (and
confidence in B will increase) when B is correct and A is wrong. Two-state selectors have a
single parameter: the number of bits used.
For three-component hybrids, the selection process must be able to indicate which of the
three predictors should be used. We designed a three-state selector specifically for this
purpose. Our general three-state selector has three sets of states that represent increasing
confidence in one of the three predictors. The update logic of the basic three-state selector is
fairly complicated. Figure 1 is a diagram of the update logic for the three-state selector. In
the diagram, A B and C represent the three component predictors, and the oblong shapes
represent the set of states dedicated to them. The triplet of numbers on each state transition
represent, in order, whether A B or C is correct:
Figure 1. A left handed three-state selector
This arrangement of states introduces a favoritism in the case of a "tie" between the
predictors. For example: if the selector is in a state of minimal confidence in A, and A is wrong
while both B and C are correct, to which state should the selector transition? Clearly
remaining on the current state is not a valuable strategy. In this case, a definite favoritism is
needed, which we call "handedness". The favoritism of A -> B -> C -> A we call "right
handedness" and A -> C -> B -> A we call "left handedness". The selector in Figure 1 is left
Clearly, the mirror-nature of the state arrangement makes the distinction of handedness
unnecessary for simple three-state selection -- by changing the order of components in the
hybrid, we can induce in a given selector whatever favoritism is desired. This distinction can
be useful, however, in a "switching" three-state selector. By using an additional bit of
information for the selector, it can track its handedness and switch when appropriate. For
example, a switching three-state selector that is currently right handed and indicating A would
switch to being left handed whenever C is correct and B is wrong, regardless of the overall
state transition undertaken. For this reason, our three-state selectors have two parameters:
the number of bits used, and whether or not the handedness switches during state transitions.
For this experiment, a number of traces were used as input data. Some of these traces are
from the SPEC95 benchmark suite5, and some from the SPEC2000 benchmark suite6. The
099.go -- An internationally ranked go-playing program.
124.m88ksim -- A chip simulator for the Motorola 88100 microprocessor.
126.gcc -- Based on the GNU C compiler version 2.5.3.
130.li -- Xlisp interpreter.
132.ijpeg -- Image compression/decompression on in-memory images.
134.perl -- An interpreter for the Perl language.
181.mcf -- Combinatorial optimization.
All traces were transformed into a common format, and run through a simulator. For each
type of predictor (simple, two-component, three-component), all the traces were run through
various permutations of the parameters for that type of predictor. Early ad-hoc testing for
reasonable boundary values of each parameter gave definition to an experimental domain.
This domain was further stratified by the artificial imposition of a hardware budget on the test
predictors. The simulator used each predictor type's parameters in various permutations in
this parameter space. For a given hardware budget, the best-performing predictors of each
type were identified and compared against each other. The final misprediction rate for each
predictor was calculated as the harmonic mean of the misprediction rate of its performances
on all the traces.
The hardware budget for these predictors can be reduced down to a more simplified concept
of a "bit budget". The cost and size of the various logical constructs of the hardware pale in
comparison to the cost and size of the memory requirements of a given prediction scheme.
For this reason, we use the bit budget as a reflection of how much data can be used by each
scheme, and this is a reasonable stand-in for the actual hardware cost and size. The bit
budgets considered were intended to be reflective of modern computer microprocessor
design. We considered bit budgets of 16K, 32K, and 64K. For embedded microprocessors,
smaller budgets would be more relevant. And, as hardware costs continue to decrease, bit
budgets of future microprocessors may be much larger. These budgets are intended to be
reflective of the restrictions faced by the microprocessor architects of today.
For each simple predictor, we first did some ad hoc testing to assess reasonable boundary
values for all parameters. Then, for each type of predictor, various permutations of the
parameters were considered. The best performing simple predictors for each type and bit
Figure 2. Performance of best simple predictors for each bit budget
For the two-component hybrids, we kept the parameter domain at the same size as the
simple predictor experiments, with the exception that we constrained table sizes to powers of
two. We did not want to rule out the possibility that some parameter combination that
performed poorly in a simple predictor could achieve superior results as part of a hybrid. This
considerably multiplied the computation time necessary to do all the testing on each hybrid.
The best two-component hybrids for each type and bit budget:
Figure 3. Performance of best two-component hybrids for each bit budget
For the three-component hybrids, it was necessary to constrain the parameter domain further
in order to run our experiments in a reasonable amount of time. We identified the highestperforming
simple and two-component hybrid predictors and modeled the parameter domain
around values that appeared to be most successful in those conditions. The best threecomponent
hybrids for each bit-budget:
Figure 4. Performance of best three-component hybrid for each bit budget
Since a lower misprediction rate indicates superior performance, the data shows that the
simple global perceptron predictor is the best performing predictor overall. The hybrid
perceptron-local predictor does slightly better at the 64K bit budget. Here is a graph of the
misprediction rates of all seven predictor types, per bit budget:
Figure 5. Performance of seven predictor types at each bit budget
For the best hybrid predictors of the four hybrid types considered, below are some graphs
that show the percentage of the budget allocated to each component. Data is shown for each
Figure 6a. Distribution of bit budget in best global-local hybrids at each bit budget
Figure 6b. Distribution of bit budget in best perceptron-global hybrids at each bit budget
Figure 6c. Distribution of bit budget in best perceptron-local hybrids at each bit budget
Figure 6d. Distribution of bit budget in best global-local-perceptron hybrids at each bit budget
Our results show that although hybridization of predictors can improve performance in some
circumstances, the technique itself is not a guarantee of performance. Hybrids appear to
produce the greatest gain in performance when the components involved have
complementary prediction strategies. The success of the global-local hybrid, relative to its two
components, is an indication of the fundamentally different approaches to branch prediction
taken by the two components. On the other hand, the global perceptron predictor shares
common behavior with both the global buffer predictor (they both use a global branch history
buffer) and also has common behavior with the local two-level predictor (they both hash the
branch instruction address into a table of entries). As none of the perceptron hybrids
significantly outperform the basic perceptron predictor, it would appear that this commonality
of formulation produces commonality of performance that makes hybridization less effective.
Hybridization also improves relative to basic predictors as the bit budget increases. At the
64K bit budget, the perceptron-local hybrid very narrowly outperforms the basic perceptron
predictor. For all of the hybrids, the rate of increase in prediction accuracy as bit budgets
increase is greater than the rate of increase for simple predictors. An experiment with larger
bit budgets may reveal a continuation of this trend and show large hybrids to be superior to
the simple perceptron predictor, although larger bit budgets create much larger parameter
spaces to investigate.
The chaotic nature of the optimal hybrid solution space also merits attention. The parameters
of the best predictor of a certain hybrid type often share very little in common with the
parameters of the next best performing predictors. This indicates that the "tuning" which can
be undertaken with simple predictors is not generally transferrable to the components of a
hybrid. Again, this means a large parameter space must be considered if optimality is
desired. In terms of overall size of components within a hybrid, Figures 6a through 6d
indicate that the best hybrid predictors can vary widely with their distribution of the bit budget
amongst components. So there does not even appear to be a budget guideline for optimality.
The three-component hybrids we studied did not improve upon the performance of the basic
perceptron predictor or the perceptron-local hybrid. Performance may be able to be
improved by an improved three-state selector, or the introduction of a fundamentally new type
of basic predictor. And, like two-state hybrids, three-state hybrids improve relative to basic
predictors as bit budgets are increased.
This experiment with three-component hybrids naturally leads to a consideration of ncomponent
hybrids. In our efforts to create a three-component hybrid, it has become clear
that the two major obstacles to n-component hybrids are the complexity of the selector, and
the size of the parameter space. When applied to an n-state selector, the "spoke" approach
of our three-state selector degenerates into a non-saturating ring of states which provides no
improvement over a simple "last used" register. Also, the parameter space of n-component
hybrids becomes prohibitively large, and would have to be explored in a very guided fashion
in order to discover optimal hybrids.
Overall, our research shows that hybridization is a valuable technique that can produce
superior levels of prediction accuracy compared to some simple predictors at the same bit
budget. The trends indicated by our tests suggest that hybrids do better and better relative to
simple predictors as the bit budget increases. Furthermore, the development of new simple
prediction schemes may open up new opportunities for further gains by hybrid predictors.
Unfortunately, there does not appear to be a consistent pattern of composition for the most
successful hybrids, making the development of optimal predictors very time-consuming to
pursue exhaustively as bit budgets increase.
1. R. Kessler. "The Alpha 21264 microprocessor," IEEE Micro 19:2 (March/April), 1999, pp
2. T. Yeh, Y. Patt. “Two Level Adaptive Training Branch Prediction,” Proceedings 24th Annual
International Symposium on Microarchitecture, 1991, pp 51-61.
3. Pan, S.-T., K. So, and J. T. Rameh. "Improving the accuracy of dynamic branch prediction
using branch correlation," Proc. Fifth Conf. on Architectural Support for Programming
Languages and Operating Systems, IEEE/ACM (October), 1992, pp 76-84.
4. D. Jimenez, C Lin. “Dynamic Branch Prediction with Perceptrons,” Proceedings 7th
International Symposium on High Performance Computer Architecture, 2001.