15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Studies of Instruction-Level Parallelism<br />

In the early 1970s, two studies on decoding and executing multiple instructions per cycle were published—<br />

one by Gary Tjaden and Mike Flynn on a design of a multiple-issue IBM 7094 [2] and the other by<br />

Ed Riseman and Caxton Foster on the effect of branches in CDC 3600 programs [3]. The conclusion in both<br />

papers was that only a small amount of instruction-level parallelism existed in sequential programs—1.86<br />

and 1.72 instructions per cycle determined by the respective studies. Thus, these studies clearly demonstrated<br />

the limiting effect of data and control dependencies on instruction-level parallelism, and the result was to<br />

encourage researchers to look for parallelism in other arenas, such as vector processors and multiprocessors.<br />

However, the Riseman and Foster study did examine the effect of relaxing the control dependencies and<br />

found increasing levels of parallelism, up to 51 instructions per cycle, as the number of branches were<br />

eliminated (albeit in an impractical way). Later studies, in which false data dependencies were eliminated<br />

as well as control dependencies, found much more available parallelism, with the highest published estimate<br />

being 90 instructions per cycle by Alexandru Nicolau and Josh Fisher as part of their VLIW research [4].<br />

Techniques to Increase Instruction-Level Parallelism<br />

Just as the limit studies indicated, performance can be increased if dependencies can be eliminated or<br />

reduced. Let us address the dependencies in the reverse order from their enumeration above. First, many<br />

structural dependencies can be avoided by providing duplicate copies of necessary resources. Even scalar<br />

pipelines provide two paths for memory access (i.e., separate instruction and data caches) and multiple<br />

adders (i.e., branch target adder and main ALU). Superscalar processors have even more resource requirements,<br />

and it is not unusual to find duplicated function units and even multiple ports to the data<br />

cache (e.g., true multiporting, multiple banks, or accessing a single-ported cache multiple times per cycle).<br />

Control dependencies are eliminated by compiler techniques of unrolling loops and performing<br />

optimizations such as “if conversion” (i.e., using conditional or predicated execution of instructions so<br />

that a control-dependent instruction is transformed into a data-dependent instruction). However, the<br />

main approach to reducing the impact of control dependencies is the use of sophisticated branch<br />

prediction. For example, the Pentium 4 keeps the history of over 4000 branches [5]. Branch prediction<br />

techniques allow instructions from the predicted path to begin before the branch is resolved and execute<br />

in a speculative manner. Of course, if a prediction is incorrect, there must be a way to recover and restart<br />

execution along the correct path.<br />

False data dependencies can be eliminated or reduced by better compiler techniques (e.g., register and<br />

memory allocation algorithms that avoid reuse) or by the use of register and memory renaming hardware<br />

on the processor. Register renaming can be accomplished in the hardware by incorporating a larger set<br />

of physical registers than are available in the instruction set architecture. Thus, as each instruction is<br />

decoded, that instruction’s architectural destination register is mapped to a new physical register, and<br />

future uses of that architectural register will be mapped to the assigned physical register. Hardware<br />

renaming is especially important for older instruction sets that have few architectural registers and for<br />

legacy codes that for one reason or another will not be recompiled.<br />

True data dependencies have been viewed as the fundamental limit for program execution; however,<br />

value prediction has been proposed in the past few years as somewhat of an analog of branch prediction,<br />

in which paths within the instruction stream, which depend on easily predicted source values, can be<br />

started earlier. As with branch prediction, there must be a way to recover from mispredictions. Another<br />

method that is currently being proposed to reduce the impact of true data dependencies is the use of<br />

simultaneous multithreading in which instructions from multiple threads are interleaved on a single<br />

processor; of course, instructions from different threads are independent by definition.<br />

Out-of-Order Completion<br />

All processors that attempt to execute instructions in parallel must deal with variations in instruction<br />

execution times. That is, some instructions, such as those involving simple integer arithmetic or logic<br />

operations, will need only one cycle for execution, while others, such as floating-point instructions, will<br />

need multiple cycles for execution. If these different instructions are started at the same time, as in a<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!