HBPS JTW 2003.pdf - Members.efn.org

members.efn.org

HBPS JTW 2003.pdf - Members.efn.org

Hybrid Branch Prediction Strategies

Paul Jaeger, Colin Thompkins, Brad Werth

Abstract

This paper introduces the generalized notion of hybrid branch prediction, of which the

tournament predictor is a specific example. Two-component hybrid prediction schemes

composed of local two-level lookup, global history lookup, and global perceptrons are

compared against the performance of their component pieces. A strategy for threecomponent

selection is demonstrated, and three-component hybrid branch predictors are

evaluated. The ideal distribution of the bit budget between components for hybrid prediction

schemes is also explored.

Introduction

When designing a microprocessor architecture, efficiency is gained at the cost of additional

hardware complexity and cost. Early microprocessor designs dealt with high hardware

component costs by keeping the architecture as simple as possible. As manufacturing

processes improved, hardware costs dropped, allowing microprocessor designs to grow in

complexity and efficiency. The modern microprocessor architect has a large, but finite

"budget" of hardware to utilize in the effort to improve processor efficiency. One of the

significant sources of inefficiency in program execution is the failure to correctly predict

whether or not a branch is taken. It is for this reason that all modern microprocessors employ

a branch prediction scheme

Branch prediction became necessary when microprocessor designs started employing the

technique of pipelining. In a pipelined architecture, a conditional branch instruction (like all

instructions) takes several cycles to complete. During this time, the pipeline continues to fill

with instructions -- work which will be discarded if the instructions have been retrieved from

the incorrect location in memory. In such a case the penalty to efficiency is at least one (and

perhaps many more) cycles' worth of wasted work. Correct prediction, on the other hand,

results in no adverse efficiency impact. Since the average program contains one conditional

branch in every ten instructions, correct branch prediction is a clear priority.

Several basic, single-strategy prediction schemes have been developed, with reasonably

good results. Two of the most common basic predictors are a global history lookup, and a

two-level local history lookup. Both of these schemes have a number of parameters that can

be tuned to find an ideal solution for a given hardware budget. With sufficient hardware

resources, single-strategy prediction schemes can achieve 90% or better prediction success.

However, a common problem with single-strategy schemes is that they can be "vulnerable" to

certain execution patterns which cause them perform sub-optimally in some circumstances.

No amount of additional hardware can compensate for these fundamental flaws in the

scheme.

One way to overcome this problem is to create a multiple-strategy prediction scheme, a

construction we will call a "hybrid". A hybrid scheme uses several component prediction

schemes running concurrently and contains some method to arbitrate between them. The

"tournament"1 predictor is a popular and successful hybrid of the global and two-level local

prediction schemes. Tournament predictors can achieve 93% or better prediction success,

with sufficient hardware resources. Because of this very high rate of success, most research

in hybrid schemes has focused on fine-tuning the tournament predictor scheme. One of the

goals of this paper is to re-open hybrid schemes to more general composition.


Research into new branch prediction schemes continues. Some new types of single-strategy

schemes have been developed since the tournament predictor was first developed. One of

the most promising is the global perceptron predictor, which can achieve a prediction success

rate similar to the tournament predictor. Instead of considering this scheme to be a terminal,

self-contained solution, this predictor can be viewed as a powerful new component predictor

to be used in hybrid schemes.

Hybrid prediction schemes need not be limited to two components, either. Including more

than two components adds some complexity to the component selection process, but not

insurmountably so. In our experiments we tested a variety of three-component hybrid

schemes using a new selection process designed to be simple and effective.

Discussion of Predictor Components and Parameters

For our experiments, we used a building-block approach for the composition of each

prediction scheme. We used a number of component branch predictors and predictor

selectors as building blocks. The workings of each type of component are described below.

The local two-level predictor2 hashes the branch instruction's address bits into a table of local

history entries. These history entries store the taken / not taken results for the last several

executions of the branches referenced at that index. The history entry is then used as an

index into a table of saturating counters, functioning as predictors. After the result of the

branch is known, the predictor and the history entry are both updated. This scheme has

three parameters that we manipulated in our analysis: the number of entries in the history

table, the length of the history entries, and the number of bits used by the predictors.

The global history predictor3 maintains a single global branch history buffer which stores the

results of the last several branches executed. This history value is used as an index into a

table of saturating counters functioning as predictors. After the branch result is known, the

history buffer and the predictor are both updated. This scheme has two parameters to

consider: the length of the history buffer, and the number of bits used by the predictors.

The global perceptron predictor4, like the global history predictor, maintains a global branch

history buffer. Also, like the local two-level predictor, the perceptron predictor hashes the

branch instruction's address bits into a table. Instead of containing local branch history, this

table contains perceptrons which are used to weight the global branch history in order to

arrive at a prediction. After the branch result is known, the global history is updated, and the

perceptron used in the prediction has its weights modified. This predictor has three

parameters we considered: the number of entries in the table, the number of weight bits used

by the perceptrons, and the length of the global branch history. There is one additional

parameter we did not consider: the perceptron training threshold. This value we derived from

the history buffer length using the formula proposed by Jimenez and Lin, which has been

found to be experimentally optimal.

For hybrid prediction schemes, a selection process must exist to choose which component

predictor should be used for a given case. This functionality is contained within a data

construct called a selector. We used a variety of selectors in our experiments.

Fundamentally, for all our hybrids, we use a table of selectors hashed by the branch

instruction's address bits. The existence of this table indicates one of the parameters of a

hybrid scheme: the size of the selector table.

For our two-component hybrids, we used two-state selectors -- the same selectors used in

most tournament prediction schemes. Two-state selectors function similarly to saturating


counters, but the logic for updating their state is more complicated. If the two predictors are A

and B, and the selector is in a state indicating confidence in predictor A, confidence in A will be

increased only if A is correct and B is wrong. Likewise, confidence in A will decrease (and

confidence in B will increase) when B is correct and A is wrong. Two-state selectors have a

single parameter: the number of bits used.

For three-component hybrids, the selection process must be able to indicate which of the

three predictors should be used. We designed a three-state selector specifically for this

purpose. Our general three-state selector has three sets of states that represent increasing

confidence in one of the three predictors. The update logic of the basic three-state selector is

fairly complicated. Figure 1 is a diagram of the update logic for the three-state selector. In

the diagram, A B and C represent the three component predictors, and the oblong shapes

represent the set of states dedicated to them. The triplet of numbers on each state transition

represent, in order, whether A B or C is correct:

Figure 1. A left handed three-state selector

This arrangement of states introduces a favoritism in the case of a "tie" between the

predictors. For example: if the selector is in a state of minimal confidence in A, and A is wrong

while both B and C are correct, to which state should the selector transition? Clearly

remaining on the current state is not a valuable strategy. In this case, a definite favoritism is

needed, which we call "handedness". The favoritism of A -> B -> C -> A we call "right

handedness" and A -> C -> B -> A we call "left handedness". The selector in Figure 1 is left

handed.

Clearly, the mirror-nature of the state arrangement makes the distinction of handedness

unnecessary for simple three-state selection -- by changing the order of components in the

hybrid, we can induce in a given selector whatever favoritism is desired. This distinction can

be useful, however, in a "switching" three-state selector. By using an additional bit of

information for the selector, it can track its handedness and switch when appropriate. For

example, a switching three-state selector that is currently right handed and indicating A would


switch to being left handed whenever C is correct and B is wrong, regardless of the overall

state transition undertaken. For this reason, our three-state selectors have two parameters:

the number of bits used, and whether or not the handedness switches during state transitions.

Experimental Methodology

For this experiment, a number of traces were used as input data. Some of these traces are

from the SPEC95 benchmark suite5, and some from the SPEC2000 benchmark suite6. The

traces:

099.go -- An internationally ranked go-playing program.

124.m88ksim -- A chip simulator for the Motorola 88100 microprocessor.

126.gcc -- Based on the GNU C compiler version 2.5.3.

130.li -- Xlisp interpreter.

132.ijpeg -- Image compression/decompression on in-memory images.

134.perl -- An interpreter for the Perl language.

181.mcf -- Combinatorial optimization.

All traces were transformed into a common format, and run through a simulator. For each

type of predictor (simple, two-component, three-component), all the traces were run through

various permutations of the parameters for that type of predictor. Early ad-hoc testing for

reasonable boundary values of each parameter gave definition to an experimental domain.

This domain was further stratified by the artificial imposition of a hardware budget on the test

predictors. The simulator used each predictor type's parameters in various permutations in

this parameter space. For a given hardware budget, the best-performing predictors of each

type were identified and compared against each other. The final misprediction rate for each

predictor was calculated as the harmonic mean of the misprediction rate of its performances

on all the traces.

The hardware budget for these predictors can be reduced down to a more simplified concept

of a "bit budget". The cost and size of the various logical constructs of the hardware pale in

comparison to the cost and size of the memory requirements of a given prediction scheme.

For this reason, we use the bit budget as a reflection of how much data can be used by each

scheme, and this is a reasonable stand-in for the actual hardware cost and size. The bit

budgets considered were intended to be reflective of modern computer microprocessor

design. We considered bit budgets of 16K, 32K, and 64K. For embedded microprocessors,

smaller budgets would be more relevant. And, as hardware costs continue to decrease, bit

budgets of future microprocessors may be much larger. These budgets are intended to be

reflective of the restrictions faced by the microprocessor architects of today.


Results

For each simple predictor, we first did some ad hoc testing to assess reasonable boundary

values for all parameters. Then, for each type of predictor, various permutations of the

parameters were considered. The best performing simple predictors for each type and bit

budget:

Figure 2. Performance of best simple predictors for each bit budget

For the two-component hybrids, we kept the parameter domain at the same size as the

simple predictor experiments, with the exception that we constrained table sizes to powers of

two. We did not want to rule out the possibility that some parameter combination that

performed poorly in a simple predictor could achieve superior results as part of a hybrid. This

considerably multiplied the computation time necessary to do all the testing on each hybrid.

The best two-component hybrids for each type and bit budget:

Figure 3. Performance of best two-component hybrids for each bit budget


For the three-component hybrids, it was necessary to constrain the parameter domain further

in order to run our experiments in a reasonable amount of time. We identified the highestperforming

simple and two-component hybrid predictors and modeled the parameter domain

around values that appeared to be most successful in those conditions. The best threecomponent

hybrids for each bit-budget:

Figure 4. Performance of best three-component hybrid for each bit budget

Since a lower misprediction rate indicates superior performance, the data shows that the

simple global perceptron predictor is the best performing predictor overall. The hybrid

perceptron-local predictor does slightly better at the 64K bit budget. Here is a graph of the

misprediction rates of all seven predictor types, per bit budget:

Figure 5. Performance of seven predictor types at each bit budget

For the best hybrid predictors of the four hybrid types considered, below are some graphs

that show the percentage of the budget allocated to each component. Data is shown for each

bit budget:

Figure 6a. Distribution of bit budget in best global-local hybrids at each bit budget


Figure 6b. Distribution of bit budget in best perceptron-global hybrids at each bit budget

Figure 6c. Distribution of bit budget in best perceptron-local hybrids at each bit budget

Figure 6d. Distribution of bit budget in best global-local-perceptron hybrids at each bit budget

Conclusion

Our results show that although hybridization of predictors can improve performance in some

circumstances, the technique itself is not a guarantee of performance. Hybrids appear to

produce the greatest gain in performance when the components involved have

complementary prediction strategies. The success of the global-local hybrid, relative to its two

components, is an indication of the fundamentally different approaches to branch prediction

taken by the two components. On the other hand, the global perceptron predictor shares

common behavior with both the global buffer predictor (they both use a global branch history

buffer) and also has common behavior with the local two-level predictor (they both hash the

branch instruction address into a table of entries). As none of the perceptron hybrids

significantly outperform the basic perceptron predictor, it would appear that this commonality

of formulation produces commonality of performance that makes hybridization less effective.


Hybridization also improves relative to basic predictors as the bit budget increases. At the

64K bit budget, the perceptron-local hybrid very narrowly outperforms the basic perceptron

predictor. For all of the hybrids, the rate of increase in prediction accuracy as bit budgets

increase is greater than the rate of increase for simple predictors. An experiment with larger

bit budgets may reveal a continuation of this trend and show large hybrids to be superior to

the simple perceptron predictor, although larger bit budgets create much larger parameter

spaces to investigate.

The chaotic nature of the optimal hybrid solution space also merits attention. The parameters

of the best predictor of a certain hybrid type often share very little in common with the

parameters of the next best performing predictors. This indicates that the "tuning" which can

be undertaken with simple predictors is not generally transferrable to the components of a

hybrid. Again, this means a large parameter space must be considered if optimality is

desired. In terms of overall size of components within a hybrid, Figures 6a through 6d

indicate that the best hybrid predictors can vary widely with their distribution of the bit budget

amongst components. So there does not even appear to be a budget guideline for optimality.

The three-component hybrids we studied did not improve upon the performance of the basic

perceptron predictor or the perceptron-local hybrid. Performance may be able to be

improved by an improved three-state selector, or the introduction of a fundamentally new type

of basic predictor. And, like two-state hybrids, three-state hybrids improve relative to basic

predictors as bit budgets are increased.

This experiment with three-component hybrids naturally leads to a consideration of ncomponent

hybrids. In our efforts to create a three-component hybrid, it has become clear

that the two major obstacles to n-component hybrids are the complexity of the selector, and

the size of the parameter space. When applied to an n-state selector, the "spoke" approach

of our three-state selector degenerates into a non-saturating ring of states which provides no

improvement over a simple "last used" register. Also, the parameter space of n-component

hybrids becomes prohibitively large, and would have to be explored in a very guided fashion

in order to discover optimal hybrids.

Overall, our research shows that hybridization is a valuable technique that can produce

superior levels of prediction accuracy compared to some simple predictors at the same bit

budget. The trends indicated by our tests suggest that hybrids do better and better relative to

simple predictors as the bit budget increases. Furthermore, the development of new simple

prediction schemes may open up new opportunities for further gains by hybrid predictors.

Unfortunately, there does not appear to be a consistent pattern of composition for the most

successful hybrids, making the development of optimal predictors very time-consuming to

pursue exhaustively as bit budgets increase.

References

1. R. Kessler. "The Alpha 21264 microprocessor," IEEE Micro 19:2 (March/April), 1999, pp

24-36.

2. T. Yeh, Y. Patt. “Two Level Adaptive Training Branch Prediction,” Proceedings 24th Annual

International Symposium on Microarchitecture, 1991, pp 51-61.

3. Pan, S.-T., K. So, and J. T. Rameh. "Improving the accuracy of dynamic branch prediction

using branch correlation," Proc. Fifth Conf. on Architectural Support for Programming

Languages and Operating Systems, IEEE/ACM (October), 1992, pp 76-84.


4. D. Jimenez, C Lin. “Dynamic Branch Prediction with Perceptrons,” Proceedings 7th

International Symposium on High Performance Computer Architecture, 2001.

5. http://spec.unipv.it/cpu95/CINT95/

6. http://spec.unipv.it/cpu2000/CINT2000/

More magazines by this user
Similar magazines