15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

synchronization) and usually contain multiple execution units (EUs). The PEs are interconnected by some<br />

network or through centralized resources such as register file and memory, for inter-PE communication.<br />

Our definition of a PE is somewhat loose. On one extreme, the PEs in some multithreaded processors<br />

are separate processor-memory systems with their own instruction cache, decode unit, register file, and<br />

execution units; on the other extreme, the PEs in some multithreaded processors even share the execution<br />

units, as in the dynamic multithreading processor [1]. Such a loose definition allows us to discuss a wide<br />

spectrum of multithreaded processors under a common framework.<br />

Number of PEs and PE Organization<br />

The number of PEs in a multiprocessor is an important hardware parameter. This number is strongly<br />

tied to the perceived parallelism in the targeted application domain, and also the nature of the threads.<br />

On one extreme, we have single-PE multithreaded processors that perform time sharing. On the other<br />

extreme, we have massively parallel processors (MPPs) consisting of thousands of PEs, which are the<br />

most powerful machines available today for many time-critical applications [4]. Because of the sharp<br />

increase in the number of transistors integrated in a single chip, there is significant interest in integrating<br />

multiple PEs in the same chip. This has been the motivation behind many of the SpMT processing<br />

models.<br />

Processor Context Interleaving<br />

When the number of parallel threads exceeds the number of PEs, it is possible to time-share a single PE<br />

among multiple threads in a way that minimizes the time required to switch threads. This is accomplished<br />

by sharing as much as possible of the program execution environment between the different threads so<br />

that very little state needs to be saved and restored when changing threads. This type of low-overhead<br />

interleaving is given the name multithreading in many circles [2,3,17]. Interleaving-based multithreading<br />

differs from conventional multitasking (or multiprogramming) in that the concurrent threads share more<br />

of their environment with each other than do concurrent tasks under multitasking. Threads may be<br />

distinguished only by the value of their program counters and stack pointers while sharing a single address<br />

space and set of global variables. As a result, there is very little protection of one thread from another,<br />

in contrast to multitasking. Interleaving-based multithreading can thus be used for very fine-grain<br />

multitasking, at the level of a few instructions, and so can hide latency by keeping the processor busy<br />

after one thread issues a long-latency instruction on which subsequent instructions in that thread depend.<br />

• Cycle-level interleaving: In this scheme, a PE switches to a different thread after each instruction<br />

fetch; i.e., an instruction of another thread is fetched and fed into the execution pipeline in the<br />

next clock cycle. Cycle-level interleaving is typically used for coarse-grain threads—processes or<br />

light-weight processes. The motivation for this is that it eliminates control and data dependences<br />

between the instructions that are simultaneously active in the pipeline. Thus, there is no need to<br />

build complex forwarding paths, permitting a simple and potentially fast pipeline. Furthermore,<br />

the context switch latency is zero cycles. Memory latency is tolerated by not scheduling a thread<br />

until the memory access has been completed. For this interleaving to work well, there must be as<br />

many threads as the worst-case latencies experienced by the instructions. Interleaving the instructions<br />

from many threads limits the processing speed of a single thread, thereby degrading single-thread<br />

performance. The most well-known examples of cycle-level interleaving processors are HEP [29],<br />

Horizon [33], and Tera MTA [2].<br />

• Block interleaving: In this scheme, the instructions of a thread are executed successively until a<br />

long-latency event occurs, which causes a context switch. A typical long-latency operation is a<br />

remote memory access. Compared to the cycle-level interleaving technique, a smaller number of<br />

threads is sufficient, and a single thread can execute at full speed until the next context switch.<br />

The events that cause a context switch can be determined statically or dynamically.<br />

When hardware technology allows more PEs to be integrated in a processor, PE interleaving becomes<br />

less attractive, because computational throughput will clearly improve when multiple threads execute in<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!