15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

a common register space, it is important that we still provide a separate register file in each PE to support<br />

fast register access, as it is difficult for a centralized register file to provide a 1-cycle multi-port access<br />

time with today’s high clock rates. This decentralization can be achieved in two ways, both of which<br />

provide faster register access times due to physical proximity and fewer access ports per physical register<br />

file.<br />

• RF Partitioning: In this approach, each physical register file implements (or maps) an independent<br />

set of ISA-visible registers. Notice that a PE may occasionally need a register value stored in a<br />

nonlocal register file, in which case the value is fetched through an interconnection network that<br />

interconnects the PEs.<br />

• RF Replication: With the replication scheme, a physical copy of the register file is kept in each PE,<br />

so that each PE has a local copy of the shared set register space. These register file replica maintain<br />

different versions of the register space, i.e., the multiple copies of the register file store register<br />

values that correspond to the processor state at different points in a sequential execution of the<br />

program. In general, replication avoids unnecessary communication; however, if not done carefully,<br />

it might increase communication by replicating data that is not used in the future. A<br />

multithreaded processor that uses the replication scheme is the multiscalar processor [9].<br />

PE Interconnect for Register Values<br />

When threads share a common register space, and a distributed RF structure is used, an important<br />

hardware attribute is the type of interconnect used to send register values from one PE to another. The<br />

interconnects that have been proposed in the context of multithreaded processors are bus, ring (unidirectional<br />

and bi-directional), crossbar, mesh, and hypercube; of course, it is possible to use other types<br />

of interconnects as well.<br />

Bus: The bus is a simple, fully connected network. However, it permits only one data transmission at<br />

any time, providing a bandwidth of only O(1). In fact, the bandwidth scaling is worse than O(1) because<br />

of reduction in bus operating speed with the number of ports, due to increase in capacitance. Therefore,<br />

it may be a poor choice as an interconnect for inter-PE register communication, which may be nontrivial,<br />

especially when using a large number of PEs.<br />

Crossbar: A crossbar interconnect also provides full connectivity from every PE to every other PE. It<br />

provides O( N)<br />

bandwidth, but the cost of the interconnect is proportional to the number of cross-points,<br />

2<br />

or O( N ). When using a crossbar, all PEs are of same proximity to each other; hence the thread allocation<br />

algorithm becomes straightforward; however, a crossbar may not scale as easily as a ring or mesh. It is<br />

important to note that fast crossbars can be built on a single chip. With a crossbar-type interconnect,<br />

there is no notion of neighboring PEs, so all PEs become equally far away. Therefore, the cross-chip wire<br />

delays begin to dominate the inter-PE communication latency.<br />

Ring: With a ring-type interconnect, the PEs are connected as a circular loop, and there is a notion of<br />

neighboring PEs and distant PEs. Routing in a ring is trivial because there is exactly one route between<br />

any pair of nodes (two routes if it is a bi-directional ring). The ring can be easily laid out with O( N)<br />

space using short wires (as depicted in Fig. 5.15), which can be easily widened. A ring is ideal if most of<br />

the inter-PE register communication can be localized to neighboring PEs (which is typically the case in<br />

a sequential threads processor that uses the circular queue PE organization [36]), but is a poor choice if<br />

a lot of communication happens across distant PEs. An advantage of the ring is that it easily supports<br />

the scaling up of the number of PEs, as allowed by technological advances.<br />

Mesh: Rings generalize naturally to higher dimensions, including 2D grids and 3D cubes (with endaround<br />

connections). The main advantages of mesh are its regular structure and its ability to provide<br />

full connectivity between four neighboring PEs (as opposed to two PEs with the ring). Similar to a ring, a<br />

mesh can easily support the scaling up of the number of PEs. The mesh suffers from the same disadvantages<br />

of a ring in communicating with distant PEs. Moreover, thread allocation for a mesh topology is<br />

more complex than that for ring and crossbar.<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!