15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Inter-PE Memory Communication and Synchronization<br />

When threads do not share a common memory address space (as in the message passing model), it is<br />

straightforward to provide a memory system for each PE, as we do not need to worry about inter-thread<br />

memory communication and synchronization.<br />

Memory System Implementation<br />

When threads do share a common memory address space, the multithreaded processor needs to provide<br />

appropriate mechanisms for inter-thread memory communication as well as synchronization. One option<br />

is to provide a central memory system, in which all memory accesses roughly take the same amount of<br />

time. Such a system is called uniform memory access (UMA) system. An important class of UMA systems<br />

is the symmetric multiprocessor (SMP) .<br />

A UMA system may provide uniformly slow access time for every memory access. Instead of slowing<br />

down every access, we can provide fast access time for most of the accesses by distributing the memory<br />

system (or at least the top portions of the memory hierarchy system). Shared memory multiprocessors<br />

that use partitioning are called distributed shared memory (DSM) systems. As with the register file<br />

structure, we can use two techniques—partitioning and replication—to distribute the memory.<br />

• Memory Partitioning: Partitioning is useful if it is possible to confine most of the memory accesses<br />

made in one PE to its partition. Partitioning the top portion of the memory hierarchy may not<br />

be attractive, at least for irregular, non-numeric applications, because it may be difficult to do this<br />

confinement due to not knowing the addresses of most of the loads and stores at compile time.<br />

Partitioning of the lower portion of the memory hierarchy is often done, however, as this portion<br />

needs to handle only those accesses that missed in the PEs’ local caches.<br />

• Memory Replication: It is impractical to replicate the entire memory system. Therefore, only the<br />

top part of the memory hierarchy is replicated. The basic motivation behind replicating the top<br />

portion of the memory hierarchy among local caches is to satisfy most of the memory accesses<br />

made in a PE with its local cache. Notice that a replicated cache structure must maintain proper<br />

coherency among all the duplicate copies of data.<br />

DSMs often use a combination of partitioning and replication, i.e., portions of the memory hierarchy<br />

are replicated and the rest are partitioned. One type uses replicated cache memories and partitioned<br />

main memories. One interesting variation is the cache only memory architecture (COMA) system. A<br />

COMA multiprocessor partitions the entire memory system across the PEs; however, there is no fixed<br />

partition assigned for a particular memory location. Rather, the partition associated with a memory<br />

location is dynamically changed based on the PEs that access that location. Several other shared memory<br />

organizations are also possible [3,17].<br />

Inter-PE Data Dependence Speculation<br />

In the parallel threads model, synchronization of threads is carried out with the use of special mechanisms<br />

such as locks and barriers. In the sequential threads model, ensuring sequential semantics ensures proper<br />

memory synchronization. However, this means that when a load instruction is encountered in a PE, it<br />

has to ensure that its producer store has been already executed. This is difficult to determine if the producer<br />

store belongs to another thread, as memory addresses are calculated at run-time, and it is possible that<br />

the producer store instruction may not have even been fetched. In order to overcome this problem,<br />

sequential threads based processors incorporate some form of thread-level data speculation [11]. The idea<br />

is to speculate if a memory operation has to wait for inter-thread synchronization. This speculation can<br />

be as simple as predicting that the producer store has been already executed, or it can be more complex,<br />

based on past behavior of the load instruction. Below we discuss some of the hardware schemes proposed<br />

for carrying out thread-level data speculation.<br />

• Address Resolution Buffer (ARB): The ARB [11] is a hardware buffer for storing different versions<br />

of several memory locations as well as information regarding the loads and stores executed from<br />

the currently active threads. Each entry in the ARB buffers all versions of the same memory location.<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!