28.01.2015 Views

cс 2004 by Derek Brendan Gottlieb. All rights reserved.

cс 2004 by Derek Brendan Gottlieb. All rights reserved.

cс 2004 by Derek Brendan Gottlieb. All rights reserved.

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

c○ <strong>2004</strong> <strong>by</strong> <strong>Derek</strong> <strong>Brendan</strong> <strong>Gottlieb</strong>. <strong>All</strong> <strong>rights</strong> <strong>reserved</strong>.


MEMORY HIEARARCHY AND NETWORK DESIGN TRADEOFFS FOR CLUSTERED<br />

PROCESSORS<br />

BY<br />

DEREK BRENDAN GOTTLIEB<br />

B.S., University of Rochester, 1999<br />

THESIS<br />

Submitted in partial fulfillment of the requirements<br />

for the degree of Master of Science in Electrical Engineering<br />

in the Graduate College of the<br />

University of Illinois at Urbana-Champaign, <strong>2004</strong><br />

Urbana, Illinois


To my wife, for putting up with life with a grad student.<br />

iii


ACKNOWLEDGMENTS<br />

This thesis did not appear overnight, but was the result of many years of hard work and video<br />

games. Without the love and support of my parents, I wouldn’t be where I am today, so they have<br />

to take a lot of credit for what I’ve become. I would also like to thank my wife for her support<br />

through the sometimes joyous, sometimes frustrating life that is graduate school. My research<br />

would not be what it is without the insightful input of my adviser, Professor Nicholas Carter. I am<br />

also indebted to my fellow Amalgam team members, past and present: Jeff Cook, Richard Kujoth,<br />

Lee Baugh, Chi-Wei Wang, Brian Greskamp, Josh Walstrom, and Steve Ferrera. With their help<br />

and good humor, I’ve managed to remain mostly sane throughout this process. I must also thank<br />

Chris Grier and Dan Fay, without whom I would have spent much longer writing benchmarks.<br />

This thesis work was supported <strong>by</strong> the Office of Naval Research under Award No. N00014-01-<br />

1-0824.<br />

iv


TABLE OF CONTENTS<br />

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .<br />

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .<br />

viii<br />

ix<br />

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Clustered Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.3 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

2.1 Wire Delay Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

2.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2.2.1 Chip multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2.2.2 Clustered processor architectures . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.2.3 Gridded processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

2.3 On-Chip Network Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

CHAPTER 3 AMALGAM ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . 14<br />

3.1 Cluster Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

3.1.1 Programmable clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

3.1.2 Reconfigurable cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

3.2 Communication and Synchronization Mechanisms . . . . . . . . . . . . . . . . . 18<br />

3.2.1 Shared-memory communication . . . . . . . . . . . . . . . . . . . . . . . 18<br />

3.2.2 Register-based communication . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

3.3 Cluster Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

CHAPTER 4 ON-CHIP NETWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

4.1 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

4.2 Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

4.3 Hierarchical Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

CHAPTER 5 MEMORY SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

5.1 Baseline Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

5.2 Local Data Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

v


5.3 Coherence Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

5.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

5.3.2 Types of coherence messages . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

CHAPTER 6 EXPERIMENTAL METHODOLOGY . . . . . . . . . . . . . . . . . . . 35<br />

6.1 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

6.1.1 Programmable cluster modeling . . . . . . . . . . . . . . . . . . . . . . . 37<br />

6.1.2 Network modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

6.1.2.1 Network description language . . . . . . . . . . . . . . . . . . . 38<br />

6.1.2.2 Network simulation . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

6.2 Network Latency Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

6.3 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

6.4 Latency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

6.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

6.5.1 Image dithering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

6.5.2 DNA sequence matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

6.5.3 GNU radio FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

6.5.4 Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

6.5.5 MPEG encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

6.5.6 Rijndael encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

6.5.7 Traveling salesman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

CHAPTER 7 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

7.1 Increasing Impact of Wire Delays . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

7.2 Scaling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

7.2.1 Full-chip latency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

7.2.2 Core-only latency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

7.3 Effects of Local Data Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

7.3.1 Area impact of local caches . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

7.4 Effects of Intercluster Register Writes . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

CHAPTER 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

APPENDIX A AMALSIM CONFIGURATION OPTIONS . . . . . . . . . . . . . . . 70<br />

A.1 Cluster Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

A.1.1 Programmable (-proc clust:) . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

A.1.2 Reconfigurable (-rec clust:) . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

A.2 Branch Predictor (-bpred:) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

A.3 Memory Options (-memory:) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

A.4 Cache Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

A.5 Logging Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

A.6 General Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

vi


APPENDIX B NETWORK DESCRIPTION LANGUAGE . . . . . . . . . . . . . . . 76<br />

B.1 Network Description File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

B.2 Configuring Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

B.3 Expressions and Variable Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

B.4 Formal Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

APPENDIX C AMALSIM COMMAND REFERENCE . . . . . . . . . . . . . . . . . 83<br />

C.1 Basic Simulator Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

C.2 Print Simulator State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

C.3 Modify Simulator State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

C.4 Stream Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

APPENDIX D AMALGAM ISA (AMISA) . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

vii


LIST OF FIGURES<br />

3.1 The Amalgam Clustered Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

3.2 Programmable Cluster Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

3.3 Reconfigurable Cluster Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

3.4 Idealized Shared Memory vs. Register Forwarding . . . . . . . . . . . . . . . . . 19<br />

3.5 Cluster Barrier Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

4.1 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

4.2 Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

4.3 Hierarchical Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

5.1 Illinois Protocol State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

5.2 Coherence Directory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

6.1 Amalsim System Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

6.2 NDL Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

7.1 Wire Delay in Future Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

7.2 Cache Access Time in Future Technologies . . . . . . . . . . . . . . . . . . . . . 54<br />

7.3 Full-Chip Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

7.4 Individual Benchmark Performance on Full-Chip Configuration (Bus) . . . . . . . 56<br />

7.5 Performance When Scaling the Architecture Core . . . . . . . . . . . . . . . . . . 57<br />

7.6 Individual Benchmark Performance on Core-Only Configuration (Bus) . . . . . . . 58<br />

7.7 Multicluster Speedup on Core-Only Configurations . . . . . . . . . . . . . . . . . 58<br />

7.8 Multicluster Speedup on Core-Only Configurations for Technology Endpoints . . . 59<br />

7.9 Multicluster Speedup on Core-Only Configurations with Stack Tweaks . . . . . . . 60<br />

7.10 Local Data Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

7.11 Best Cache Configuration Comparison (Crossbar) . . . . . . . . . . . . . . . . . . 62<br />

7.12 Fractional Chip Area for Different Local Cache Sizes . . . . . . . . . . . . . . . . 63<br />

7.13 Benefits of Intercluster Register Communication (Pipelined Crossbar) . . . . . . . 64<br />

7.14 Benefits of Intercluster Register Communication (Bus) . . . . . . . . . . . . . . . 65<br />

viii


LIST OF TABLES<br />

5.1 Coherence Message Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

6.1 Projected Fabrication Technology Parameters . . . . . . . . . . . . . . . . . . . . 42<br />

6.2 Calculated Technology Parameters and Projected Wire Delays . . . . . . . . . . . 44<br />

6.3 Projected Cache Memory Latencies . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

6.4 Projected System Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

A.1 Programmable Cluster Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

A.2 Reconfigurable Cluster Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

A.3 Branch Predictor Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

A.4 Memory Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

A.5 Cache Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

A.6 Coherence Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

A.7 Logging Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

A.8 General Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

D.1 ** AMISA ** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

ix


CHAPTER 1<br />

INTRODUCTION<br />

As the semiconductor industry develops deep submicron technology that will permit billions of<br />

transistors on a single die, computer architects face increasing pressure to maintain the rate of<br />

improvement in microprocessor performance witnessed in the past few decades.<br />

Historically,<br />

these advances in performance have been driven largely <strong>by</strong> geometric rates of improvement<br />

in fabrication technologies, which allow chips to operate at higher clock frequencies.<br />

These<br />

technology improvements also provide larger transistor budgets that allow architects to include<br />

new performance-enhancing techniques such as out-of-order execution and branch prediction.<br />

However, maintaining this rate of improvement in deep submicron technologies is becoming<br />

inscreasingly difficult, due to the poor scaling on global on-chip wires.<br />

Architects have proposed clustered architectures as a possible solution to the wire delay<br />

problem. This thesis examines the performance of one such clustered architecture, Amalgam,<br />

as fabrication technology scales from feature sizes of 180 nm down to 22 nm, and studies the<br />

performance impact of several design options over this range.<br />

1.1 Motivation<br />

High-performance processor designers face several obstacles to maintaining the historic rate of<br />

performance improvement witnessed in the past few decades.<br />

In particular, the poor scaling<br />

characteristics of on-chip wires impose limitations on future performance improvements.<br />

As<br />

1


transistor processes improve, wire delays fail to scale as well as transistors as a result of the<br />

increased resistance caused <strong>by</strong> shrinking wire cross sections.<br />

With the anticipated increases<br />

in clock rate as fabrication technologies improve, these wire delays lead to significant global<br />

communication latencies, which place severe performance restrictions on architectures that rely<br />

on centralized control logic and register files [1, 2].<br />

1.2 Clustered Architectures<br />

Clustered processors are a class of distributed processor architecture that has been proposed as a<br />

potential solution to the wire delay problem [3]. These architectures lie somewhere in the spectrum<br />

between traditional multiprocessor systems and hardware-scheduled superscalar processors. Like<br />

chip multiprocessors (CMPs), clustered microprocessors divide their execution units into several<br />

independent processing cores (known as clusters) that communicate with each other and the<br />

memory system via an on-chip network. Each cluster is composed of a set of functional units, a<br />

register file, and a network interface. By dividing the execution resources in this manner, designers<br />

base the clock speed of a clustered processor on the critical path through a single cluster, not the<br />

propogation delay across the on-chip network. Operations (such as memory references or intercluster<br />

communication) that require communication across the network see its delay as one or<br />

more cycles of additional latency. This latency is visible to the compiler, allowing it to be taken<br />

into account during scheduling.<br />

These architectures also include low latency communication,<br />

which allows direct communication between clusters instead of relying solely on much slower<br />

shared memory communication. Clustered processors also include low latency synchronization<br />

mechanisms that allow fine-grain parallelism to be exploited as effectively as coarse-grained. The<br />

combination of these architectural techniques should allow clustered processors to scale well in<br />

future technologies, despite the wire delay problem.<br />

This thesis examines one such clustered architecture, Amalgam, over the major technology<br />

nodes from 180 nm to 22 nm.<br />

During this process, we pay particular attention to the<br />

2


system’s ability to handle increasing wire delays when faced with rapidly increasing clock rates.<br />

Additionally, we study the role of various design options in overall performance, including<br />

instruction issue policy and presence and size of local data caches within the clusters. By studying<br />

these aspects of clustered processors, we gauge their effectiveness in dealing with the wire delay<br />

problem, the limitations of cache memories, and the relative importance of network and memory<br />

hierarchy design.<br />

1.3 Contributions of this Thesis<br />

The contributions of this thesis are threefold:<br />

1. A cache coherence protocol was developed for a clustered processor with an arbitrary onchip<br />

network to explore the benefit of introducing local data caches into the programmable<br />

clusters. This protocol is loosely based on the Illinois protocol [4], but includes extensions<br />

to improve its performance on networks that prohibit the snoopy optimizations typically<br />

associated with bus-based systems.<br />

2. A set of experiments were conducted to perform a thorough examination of the effects of<br />

technology scaling on the performance of clustered processors, paying particular attention<br />

to memory system and communication latencies. Execution time improves <strong>by</strong> more than<br />

a factor of three over the technology range examined, due in part to aggressive clock rate<br />

improvements. These clock rate improvements also contribute to the widening processormemory<br />

performance gap that largely dominates the performance of these systems. As a<br />

result, the performance of the memory hierarchy overshadows many of the issues in on-chip<br />

network design. Using a complex crossbar network yields a performance improvement of a<br />

few percent over the simpler bus network for the applications tested.<br />

3. Our study demonstrated that the traditional cache hierarchy does not effectively mitigate<br />

increasing memory latencies. Introducing additional levels into the cache hierarchy yields<br />

3


a performance gain, which tends to increase as memory latencies increase. However, this<br />

performance gain does not increase as quickly as the memory gap is widening. To improve<br />

this situation, designers must explore alternatives such as small software-managed memories<br />

that can achieve lower latencies while maintaining high hit rates.<br />

1.4 Thesis Organization<br />

The remainder of this thesis is organized as follows: Chapter 2 summarizes prior efforts to examine<br />

the effects of technology scaling on performance and potential solutions to the problems facing<br />

traditional architectures. Chapter 3 reviews the key details of the Amalgam architecture. Chapter<br />

4 examines the tradeoffs associated with the network topologies examined for Amalgam. Chapter<br />

5 discusses Amalgam’s memory system, paying particular attention to the details related to local<br />

data caches and the coherence protocol required to keep them synchronized. Chapter 6 explains<br />

the methodology behind the empirical phase of this thesis, and Chapter 7 discusses the results of<br />

experiments designed to evaluate the effects of technology scaling on the performance of clustered<br />

processors and the effectiveness of local data caches in mitigating some of these effects. Chapter 8<br />

presents some conclusions of this work. At the end of this thesis are four appendices that provide<br />

further details about the functionality of our simulator and present the formal instruction set used<br />

<strong>by</strong> Amalgam’s programmable clusters. Simulator functionality is presented in the form of the<br />

supported configuration options, the network description language developed to integrate various<br />

network models into the Amalgam simulator, and a command-line reference detailing the various<br />

options available to interact with the simulator.<br />

4


CHAPTER 2<br />

RELATED WORK<br />

Work related to our research can be divided into a few categories.<br />

The first category is the<br />

examination of the future of wire delays and its impact on high-performance computer architecture.<br />

Research in this category estimates the performance of different architectural paradigms over the<br />

next several technology generations and identifies architectural bottlenecks. This work motivates<br />

the remaining categories, the first of which focuses on alternative architectures. As several studies<br />

show, conventional superscalar architectures face numerous performance limitations as global<br />

communication becomes increasingly expensive. This has lead to a variety of projects that have<br />

divided the chip into smaller processing elements. The third category is the design of on-chip<br />

networks that address the communication needs of clustered processors.<br />

2.1 Wire Delay Studies<br />

To better understand the magnitude of the wire delay problem, studies have examined the behavior<br />

and performance of wires in future technologies [2]. These studies found that local wires that<br />

shorten in length as technology scales exhibit favorable scaling behavior with delays that closely<br />

follow or slowly grow relative to gate delays. As a result, architectural components that are simply<br />

scaled down to a new technology perform similarly to the original implementation and the number<br />

of gates that can communicate within a cycle remains roughly constant. On the other hand, global<br />

wires do not scale in length since they provide paths to communicate global signals across the<br />

5


chip. At best, the delay of these wires will remain constant, resulting in a steady increase relative<br />

to gate delays. The end result of this work is that architects must explore the impact of increasing<br />

wire delays on existing architectures and explore alternatives that provide better performance<br />

opportunities when faced with these limitations.<br />

Previous work has examined the impact of increasing wire delays on the performance<br />

of conventional superscalar architectures [1].<br />

Using models for wire and microarchitectural<br />

component delay, the study performed a series of architectural simulations of an aggressive out-oforder<br />

microarchitecture scaled over a 15-year range of technologies. The simulations included<br />

different clock scaling strategies as well as two different microarchitectural scaling strategies,<br />

which are differentiated <strong>by</strong> how they achieve the desired clock rates. The first strategy kept the<br />

capacity of microarchitectural structures constant and relied on deeper pipelines to handle the<br />

increasing delays associated with these structures. The second strategy attempted to limit the<br />

depth of the pipeline <strong>by</strong> reducing the capacity of microarchitectural structures to limit their delays.<br />

Even with optimistic estimates for wire scaling, superscalar processor performance improvements<br />

are limited to no better than 12.5% per year, which is in stark contrast to the rate of 50-60% per<br />

year witnessed over the past decade.<br />

On this basis, other studies have examined the design space of future chip multiprocessors or<br />

CMPs [5]. This study compares the performance and area trade-offs for CMPs to determine the<br />

ideal architectural makeup that maximizes server application throughput in future technologies.<br />

In particular, they weigh the tradeoffs of in-order and out-of-order processing cores, the number<br />

of these cores, and the capacity of the on-chip caches given finite off-chip bandwidth. These<br />

studies suggest that while future CMP designs will continue to offer improvements in overall<br />

throughput with increasing transistor budgets, the diminishing returns associated with conventional<br />

superscalars on general-purpose applications remains an issue.<br />

Furthermore, the number of<br />

cores that may effectively be utilized on future CMPs will largely be determined <strong>by</strong> the off-chip<br />

bandwidth. This suggests that something more effective than the traditional cache hierarchy is<br />

required to offset the increasing processor-memory performance gap.<br />

6


Other studies have examined the effects of increasing wire delays on gridded architectures<br />

[6]. These architectures divide available chip area into a grid of identical, programmable tiles<br />

to provide better scalability than conventional architectures. By exposing physical resources as<br />

explicit architectural entities, programmers can maximize performance <strong>by</strong> mapping applications<br />

to these tiles around increasing wire delays. The specifics of gridded architectures are discussed in<br />

further detail in Section 2.2.3. Comparing a relatively idealized gridded processor against idealized<br />

superscalar and VLIW processor models, the gridded processor dominates, with an average IPC<br />

more than twice that simulated on the ideal superscalar core. These studies support the assertions<br />

that gridded processors allow for better scalability, enabling the continued scaling of clock rate and<br />

instruction throughput.<br />

2.2 Architectures<br />

Several architectures have been proposed to continue the historic rates of increase for processor<br />

performance in light of the increasing pressures of wire delay and design complexity. <strong>All</strong> of<br />

these approaches rely on some form of modular design with a number of somewhat independent<br />

processing elements that rely on fast local communication with reduced reliance on global signals.<br />

These different architectures are presented here, from the fairly conventional chip multiprocessors<br />

to clustered architectures to far-reaching gridded processors.<br />

2.2.1 Chip multiprocessors<br />

Researchers have explored several potential solutions to the hurdles faced <strong>by</strong> the architecture<br />

community.<br />

Chip multiprocessors, or CMPs, integrate multiple processors and their primary<br />

caches onto a single chip together with a shared secondary cache. CMPs offer implementation<br />

and performance advantages over conventional wide-issue superscalar designs.<br />

Traditionally,<br />

applications for CMPs must be explicitly parallelized into independent threads with proper<br />

synchronization operations inserted as appropriate.<br />

The parallelization of an application may<br />

7


e performed either <strong>by</strong> the programmer or the compiler. In the programmer-centric paradigm,<br />

the source code specifies the independent threads of execution and explicitly controls their<br />

synchronization and communication. Alternatively, uniprocessor applications are compiled using<br />

an automatic parallelizing compiler that analyzes the code and extracts existing parallelism in order<br />

to generate independent threads.<br />

Speculative CMPs, such as Hydra [7], are a further extension of this work that aim to eliminate<br />

the need for recompilation of sequential code <strong>by</strong> speculatively running multiple threads in parallel.<br />

In these systems, potentially independent threads of execution are identified in a sequential binary.<br />

Each thread is then speculatively executed on a separate processor in a CMP. Since these systems<br />

are able to execute code in parallel that a compiler is not able to identify statically, it is possible<br />

for these systems to obtain higher levels of parallelism than a parallelizing compiler-based CMP.<br />

However, this may also result in large amounts of useless work, depending on the accuracy of<br />

the speculation. Furthermore, a compiler may more effectively exploit the identifiable parallelism<br />

within an application, since it may perform substantially more analysis than a run-time system.<br />

Performing this analysis in the compiler allows parallelization over a wider program scope since<br />

it is not confined <strong>by</strong> the limitations of a hardware structure and does not lead to the overhead of a<br />

software run-time system.<br />

Krishnan and Torrellas’s work uses speculation in the same manner as Hydra, but differs in<br />

several ways from other speculative CMP research [8]. Their architecture includes hardware for<br />

memory disambiguation that does not rely on snoopy-based cache coherence protocols to detect<br />

interthread memory dependence violations. Instead, their scheme uses a centralized table located<br />

near the L2 cache that is similar in nature to those used for directory-based cache coherence. As<br />

a result, systems based on their approach are not limited to using low bandwidth bus networks for<br />

on-chip communication. This work is also distinguished <strong>by</strong> the inclusion of a relatively simple<br />

hardware mechanism that enables the communication of thread live-out register values between<br />

on-chip processors. In order to implement this efficiently, this approach is limited to the broadcast<br />

of a single register value over a dedicated bus, but performs well given the nature of parallelization<br />

8


and communication in this system.<br />

2.2.2 Clustered processor architectures<br />

Clustered processor architectures, such as the M-Machine [3] and Multiscalar [9], divide a<br />

processor’s functional resources into independent clusters, much like the cores in a CMP, that<br />

communicate with each other and the memory system via an on-chip network. The clock rates<br />

of these designs are based on the wire delay within a single cluster, with multicycle global<br />

communication delays across the on-chip network exposed to the compiler. Clustered processors<br />

provide some nonmemory communication path between processing elements, which distinguished<br />

them from CMPs that rely solely on shared memory.<br />

Architectures such as M-Machine and<br />

Amalgam support a distributed register file paradigm, where the processor’s register file is divided<br />

evenly amongst the clusters. Operations that execute on a given cluster may only read their inputs<br />

from that cluster’s register file, but may write their results into any of the registers on the chip via<br />

the on-chip network. Previous work has shown that this additional register-based communication<br />

mechanism yields significantly better performance than a similar architecture that communicates<br />

only through shared memory [10]. As network and memory latencies increase relative to processor<br />

clock frequencies, this benefit should increase dramatically.<br />

In addition to the description used in this work, clustered processors have taken other forms that<br />

differ primarily in their approach to parallelization. The most traditional approach is the clustered<br />

superscalar that utilizes a single instruction stream [11–14]. Commercial implementations include<br />

the Alpha 21264 [15], MIPS R10000 [16], and IBM Power4 [17]. While this scheme has some<br />

clock rate advantages over traditional superscalar implementations, it still suffers from numerous<br />

scaling problems.<br />

These systems have centralized issue logic that determines the assignment<br />

of instructions to specific clusters at runtime, thus maintaining the traditional, single threaded<br />

uniprocessor execution model. This centralized issue logic is likely to become a performance<br />

bottleneck as clock rates increase, leading to a conflict between decreasing the instruction window<br />

to meet clock rate goals and increasing the number of clusters to exploit additional parallelism.<br />

9


Cluster assignment is further complicated <strong>by</strong> the uniprocessor execution model. To reduce<br />

register file access time, the register file is replicated in each cluster requiring hardware <strong>by</strong>passing<br />

between the clusters in addition to the <strong>by</strong>pass between functional units in a single cluster. Since<br />

this hardware <strong>by</strong>pass must connect all of the clusters in the system, it will involve a multicycle<br />

communication delay, due to the long wires involved. In order to achieve optimal performance, the<br />

issue logic must assign dependent operations to the same cluster in order to avoid the additional<br />

latency of intercluster register <strong>by</strong>passing. Furthermore, replicating the register file in this manner<br />

requires an increasing number of register ports proportional to the number of clusters and number<br />

of writes a cluster may commit per cycle. This results in a substantial increase in area consumed<br />

<strong>by</strong> the register file, which may be avoided if a cluster can only read from a portion of the register<br />

file.<br />

Alternatively, parallelization of the application may be performed <strong>by</strong> the programmer and<br />

compiler, breaking execution into multiple threads that are assigned to specific clusters. In this<br />

paradigm, the clusters are free to execute their instruction streams independent of the other clusters<br />

except when they reach an explicit synchronization operation. This allows clusters to continue to<br />

make forward progress in the event that a single cluster is held up <strong>by</strong> a cache miss. This approach<br />

also avoids many of the drawbacks associated with clustered processors that rely on centralized<br />

issue logic to perform cluster assignment. Issue logic may be kept small, since it is only operating<br />

on a subset of the total program, and is responsible for a small number of functional units instead<br />

of a large number of clusters consisting of multiple functional units. Register files are also more<br />

effective, since they only reflect local state instead of processor-wide state, allowing the compiler<br />

the flexibility to pass around only the values that are required for the computation, and to take<br />

global wire delays into account when scheduling these intercluster register writes.<br />

2.2.3 Gridded processors<br />

Gridded processors, such as RAW [18], are comprised of a set of replicated tiles arranged in<br />

a grid.<br />

In RAW, each tile contains a small RISC processor, some configurable logic, and a<br />

10


lock of instruction and data memory and maintains an independent program counter. To enable<br />

communication between these tiles, a programmable switch is associated with each tile, connecting<br />

the tiles in a configurable, wide-channel point-to-point interconnect.<br />

By supporting complete<br />

software control over these resources, RAW translates low-level physical entities such as gates,<br />

wire delays, and pins into higher level architectural concepts such as tiles, network hops, and I/O<br />

ports. In this manner, the compiler has full knowledge of the physical limitations of the underlying<br />

hardware and can exploit parallelism in a way that best utilizes the available hardware and avoids<br />

costly global communition.<br />

The polymorphous TRIPS architecture takes a slightly different approach [19].<br />

A TRIPS<br />

processor is composed of an array of memory tiles and a small number of polymorphous cores,<br />

which are composed of a grid of nodes containing an ALU, reservation station, and routing<br />

hardware. While nodes are directly connected to only their nearest neighbors, the routing network<br />

supports the forwarding of results to any ALU within the array.<br />

This network also provides<br />

access to the banked supporting resources including the register file, instruction and L1 data<br />

caches. In addition to the static architectural structures, TRIPS includes a variety of configurable<br />

or polymorphous resources to support multiple execution strategies, such as speculative execution<br />

and multithreading. Reservation stations indices may be used to designate parallel threads. Each<br />

core also includes a highly configurable block sequencing control logic that determines when a<br />

block of instructions has completed execution, when it should be deallocated from frame space,<br />

and which new block should replace it. This logic may be configured differently depending on<br />

the nature of the application. The flexibility of the memory tiles allows them to behave as a<br />

NUCA L2 cache, scratchpad memory, synchronization buffers for multithreaded applications, or<br />

stream register files. Using these resources, TRIPS achieves high performance when exploiting<br />

instruction-, thread-, and data-level parallelism.<br />

11


2.3 On-Chip Network Studies<br />

Clustered microarchitectures rely heavily on a low-latency intercluster interconnection network to<br />

achieve high performance. The on-chip networks for these architectures are also likely to have<br />

noticeably different requirements and characteristics than those used for traditional multiprocessor<br />

systems. As a result, the study of effective networks is critical for these systems. As illustrated <strong>by</strong><br />

the Alpha 21264, implementing an efficient, contention-free interconnect for a simple two-cluster<br />

architecture is possible <strong>by</strong> directly connecting each functional unit output to a register file write<br />

port in the other cluster. Unfortunately, this fully connected network does not scale well as the<br />

number of clusters increases, leading to excessive cost and complexity. Traditional multiprocessor<br />

systems have frequently resorted to a simple shared bus for interprocessor communication, which<br />

substantially reduces the complexity of the interconnect at the expense of substantially higher<br />

contention.<br />

Other studies have sought to explore the design trade-offs between interconnect<br />

complexity and communication latency for clustered architectures relying on centralized issue<br />

logic.<br />

Parcerisa et al.<br />

explore a variety of interconnection schemes for four and eight cluster<br />

processors ranging from simple bus per cluster interconnects to several types of point-to-point<br />

interconnects [20]. Point-to-point interconnects for the four cluster processors are limited to a<br />

basic ring topology, but the eight cluster processors also examine the more complex mesh and torus<br />

topologies. Additionally, their study examines the impact of using a topology-aware instruction<br />

steering scheme that minimizes communication distances for point-to-point interconnects, which<br />

potentially reduces communication latency significantly when compared to a scheme that does<br />

not take this into account. The increased connectivity provided <strong>by</strong> mesh and torus topologies<br />

yields substantial performance gains for the system studied. Furthermore, it is possible to design<br />

a partially asynchronous interconnect with low hardware requirements that achieves performance<br />

close to an equivalent idealized interconnect.<br />

Aggarwal and Franklin’s work explores alternative interconnects and cluster instruction<br />

12


distribution algorithms that support better scalability than existing approaches [21]. This study<br />

presents hierarchical interconnects as an alternative that provides scalable performance as the<br />

number of on-chip clusters increases. Crossbar interconnects provide full connectivity between<br />

all clusters at the expense of gradually increasing latency with the number of clusters, due<br />

to the increasing physical distance between clusters.<br />

Ring interconnects only rely on short<br />

wires, since they provide communication between neighboring clusters, but the maximum latency<br />

between distant clusters increases dramatically with the number of clusters. On the other hand,<br />

hierarchical interconnects can provide low latency communication between a small number of<br />

clusters <strong>by</strong> connecting groups of four with a crossbar, and scales well to higher numbers of clusters<br />

<strong>by</strong> providing scalable interconnects between these small groups such as with a ring network<br />

connecting these small crossbar networks.<br />

Dividing clusters in this manner also provides the<br />

additional benefit of reducing the complexity of instruction distribution as it may also be spread out<br />

over the different levels of the hierarchy. As a result, their approach provides substantially better<br />

scalability than existing approaches while achieving better performance due to simpler hardware.<br />

This chapter presented research related to our work. Studies have shown that wire delays do<br />

not scale well relative to transistors, which has a negative impact on conventional architectures<br />

that rely on large, centralized structures.<br />

Distributed architectures, such as CMPs, clustered<br />

processors, and gridded processors, show promise in meeting future performance goals given<br />

global communication constraints. Studies have also examined the communication requirements<br />

of clustered processors and alternative on-chip interconnects that address these needs.<br />

13


CHAPTER 3<br />

AMALGAM ARCHITECTURE<br />

Amalgam, as illustrated in Figure 3.1, is a clustered processor that supports implementation at<br />

high clock rates, providing higher performance and better scaling characteristics than conventional<br />

uniprocessor designs. Each cluster contains either a simple programmable processor core or a<br />

block of reconfigurable logic.<br />

These clusters communicate with each other and the memory<br />

system via an on-chip network, exposing all global communication to the compiler. By doing so,<br />

architects limit the impact of wire delays <strong>by</strong> relying on cluster locality when possible and carefully<br />

scheduling operations that require global communication.<br />

In addition to the implementation<br />

advantages, clustered processor designs provide an attractive fabric for heterogeneous computing<br />

systems because the distributed register file abstraction provides a convenient mechanism for<br />

integrating different types of computational resources. To exploit the advantages of such a system,<br />

the architecture must also provide efficient communication and synchronization mechanisms to<br />

effectively utilize an application’s inherent parallelism.<br />

To expand on this general description, the remainder of this chapter elaborates on Amalgam’s<br />

cluster types and the methods they use to coordinate during computation.<br />

3.1 Cluster Types<br />

Amalgam supports programmable and reconfigurable clusters. This mix of processing resources<br />

allows Amalgam to execute each component of an algorithm on the most appropriate resource.<br />

14


Off-Chip Memory<br />

Cache<br />

Bank Bank Bank Bank<br />

(multi-banked)<br />

Network<br />

PCluster<br />

PCluster PCluster PCluster<br />

RCluster<br />

RCluster RCluster RCluster<br />

Figure 3.1 The Amalgam Clustered Processor<br />

The following sections present the architectural details for each type of cluster.<br />

3.1.1 Programmable clusters<br />

Each programmable cluster (Figure 3.2) consists of a dual-issue, in-order 32-bit processor, a 32-<br />

entry, 32-bit register file, a network interface, and an instruction cache. To minimize complexity,<br />

the programmable cluster uses a simple not-taken branch prediction scheme.<br />

Programmable<br />

clusters execute independent instruction streams consisting of a MIPS-like instruction set that is<br />

loosely based on the DLX ISA presented in [22]. The Amalgam ISA (Appendix D) contains three<br />

key extensions to the DLX ISA that support clustered architectures, based on the instruction set<br />

used on the M-Machine [3]: a barrier instruction to reduce synchronization overhead, specification<br />

of both a destination cluster and a destination register for the result of most instructions, and an<br />

EMPTY instruction that prepares registers to receive results from other clusters.<br />

To further explore cluster design options, we have compared Amalgam’s baseline in-order<br />

15


Network<br />

Network<br />

Interface<br />

Register File<br />

ALU<br />

ALU<br />

I-cache<br />

D-cache<br />

Figure 3.2 Programmable Cluster Detail<br />

programmable cluster against a few alternatives that include additional functionality. The first<br />

extension compared against the baseline introduces support for out-of-order execution.<br />

This<br />

alternative core uses register renaming, a 128-entry reorder buffer, and a 64-entry temporary<br />

register file to supplement the 32-entry architectural register file. It also employs a more accurate<br />

branch prediction scheme involving a 256-entry table of 2-bit saturating counters and a branch<br />

target buffer with 128 sets and 8-way associativity. The out-of-order core does not support any<br />

form of dynamic memory disambiguation. The second extension that was compared against the<br />

baseline was the inclusion of an additional level of data cache in the clusters. These data caches<br />

added to the alternative programmable clusters use a variant of the Illinois protocol [4] to maintain<br />

coherence between the local caches. This protocol is discussed in further detail in Section 5.3.<br />

Cache-coherence messages travel over the on-chip network and are therefore affected <strong>by</strong> changes<br />

in the network delay and topology.<br />

16


Network Interface<br />

Register<br />

Bank 0<br />

Segment 3<br />

Segment 0<br />

Register<br />

Bank 1<br />

ACU<br />

Register<br />

Bank 3<br />

Segment 2<br />

Segment 1<br />

Register<br />

Bank 2<br />

Figure 3.3 Reconfigurable Cluster Detail<br />

3.1.2 Reconfigurable cluster<br />

Conceptually, a reconfigurable cluster consists of a simple row-based field-programmable gate<br />

array (FPGA) with a register file that supports Amalgam’s register-based communication mechanisms.<br />

A more detailed discussion of the Amalgam’s reconfigurable cluster and its design may be<br />

found in [23]. Each cluster contains a 32-entry, 32-bit register file, a 32 x 32 array of 4-input logic<br />

blocks, a network interface, and an array control unit (ACU). The register file is divided into four<br />

equally sized banks and interleaved with the reconfigurable array, which is also divided into four<br />

equally sized segments as depicted in Figure 3.3.<br />

This approach eliminates the complex logic required to generate register indices in reconfigurable<br />

logic while providing substantial bandwidth between the array and register file. Each<br />

register in a bank continuously drives its output on a vertical wire that can be read <strong>by</strong> any logic<br />

block in the corresponding bit column of the segment below the bank <strong>by</strong> appropriately configuring<br />

17


the block’s input multiplexors. Similarly, the input to each register bit is taken from a vertical wire<br />

that can be driven <strong>by</strong> any logic block in the corresponding column of the segment above the bank.<br />

This organization makes the entire contents of the register file available to the array on every cycle,<br />

significantly increasing the number of computations that can be carried out in parallel.<br />

3.2 Communication and Synchronization Mechanisms<br />

When compared to a modern microprocessor, the simple architecture of a programmable cluster<br />

is relatively uninteresting. A similar comparison may be drawn between a modern FPGA and a<br />

single reconfigurable cluster. However, when several programmable and reconfigurable clusters<br />

are combined with efficient communication and synchronization mechanisms, these relatively<br />

simple processing resources can coordinate to provide impressive performance on a variety of<br />

applications. The following sections will discuss traditional shared memory communication, lowlatency<br />

register-based communication and its advantages over shared-memory, and the hardwarebased<br />

cluster barrier operation that provides efficient synchronization between programmable<br />

clusters.<br />

3.2.1 Shared-memory communication<br />

In Amalgam, programmable clusters may communicate via the memory hierarchy using the<br />

shared-memory techniques employed in more traditional multiprocessor systems. To communicate<br />

data between clusters, a producing cluster writes its result to a shared portion of memory. Once<br />

it has been written, other clusters may use this data <strong>by</strong> reading from that location in memory. To<br />

do this reliably, shared-memory systems rely on a locking mechanism where the producing cluster<br />

grabs the lock, guaranteeing exclusive access to a region of memory, writes its data, and then<br />

releases the lock. When other clusters check this lock, the release of the lock indicates that the<br />

data is available, allowing them to proceed with their computation using this data.<br />

An idealized example of this form of communication is illustrated in Figure 3.4, where cluster<br />

18


Cache<br />

Cache<br />

3 cycles 5 cycles<br />

1-2 cycles<br />

Px<br />

Py<br />

Px<br />

Py<br />

Figure 3.4 Idealized Shared Memory vs. Register Forwarding<br />

Px is producing a value consumed <strong>by</strong> cluster Py. Assuming a two-cycle network delay and onecycle<br />

hit in the main cache, transferring data via shared memory requires a minimum of eight<br />

cycles. When Px finishes computing a data word, it sends a write message to the main cache via<br />

the on-chip network, requiring a minimum of three cycles: two cycles to traverse the network and<br />

one cycle to write the data into the cache. Once Py knows the data is available, it sends a request to<br />

the main cache that then returns the data via the on-chip network, resulting in a total latency of at<br />

least five cycles. Unfortunately, even this is a very optimistic estimate since we have not addressed<br />

how Py knows the data is available. As mentioned previously, this would likely involve some<br />

form of locking mechanism that would require several additional memory accesses and network<br />

traversals.<br />

As a result, the relatively long latency of shared-memory-based communication limits the<br />

frequency of intercluster synchronization and data sharing, thus limiting the performance of<br />

applications requiring fine-grained computation.<br />

3.2.2 Register-based communication<br />

When more frequent communication is required, Amalgam supports a low-latency communication<br />

scheme using intercluster register writes.<br />

Part of the Amalgam specification requires that all<br />

clusters must provide data storage that other clusters may access <strong>by</strong> way of intercluster register<br />

writes. In the case of a programmable cluster, this is implemented as a traditional register file, but<br />

19


may be implemented in any manner appropriate for the internal organization of nonprogrammable<br />

clusters as long as this interface abstraction is maintained. While each cluster may only read from<br />

its local registers, it may write to any register in the chip <strong>by</strong> specifying the destination cluster and<br />

register and passing this information along with the desired data to the network interface controller.<br />

The network interface then generates an appropriate packet that is routed to the destination cluster<br />

via the on-chip network. Once received, the destination cluster writes this data into its local register<br />

file.<br />

Register scoreboarding is used to indicate when an instruction may issue based on the<br />

availability of its input operands. In this scheme, a central scoreboard tracks register availability<br />

<strong>by</strong> associating a valid bit with each register in the processor.<br />

When an instruction issues, it<br />

clears the valid bit associated with its destination register. When the instruction completes the<br />

writeback stage, it writes its result to the appropriate destination register and sets the valid bit in<br />

the scoreboard. Dependent operations may only issue once the valid bits are set for each of their<br />

input operands. This mechanism is easily extended to support clustered architectures <strong>by</strong> properly<br />

handling intercluster register writes. When an instruction targets a remote cluster’s register, the<br />

issuing cluster is unable to clear the valid bit in the remote cluster’s scoreboard. Therefore, the<br />

Amalgam ISA includes an EMPTY instruction, which clears the valid bit for the specified register.<br />

When a cluster expects to receive an intercluster register write, it must first execute an EMPTY<br />

instruction, there<strong>by</strong> stalling the issue of any dependent operations until the data arrives via the onchip<br />

network. When an intercluster register write arrives at a cluster, the receiving cluster writes<br />

the data into the appropriate register and sets the valid bit in the scoreboard, allowing dependent<br />

operations to proceed.<br />

Previous studies have shown that register-based communication mechanisms improve parallel<br />

application performance over systems that rely solely on shared-memory mechanisms [10, 24].<br />

This is largely due to the difference in latency between the two approaches. This disparity may<br />

be clearly seen in the example presented in Figure 3.4. As discussed previously, shared-memory<br />

systems would require a minimum of eight cycles to communicate a single result from between<br />

20


clusters Px and Py due to multiple network and cache accesses. This estimate is overly optimistic,<br />

because it ignores lock overhead and any additional memory or network delays. On the other<br />

hand, forwarding the result directly from the Px to Py using an intercluster register write only<br />

requires a single traversal of the network and incurs no memory delays. As a result, register-based<br />

mechanisms are substantially faster than shared-memory mechanisms when employed effectively.<br />

Register-based communication includes a critical limitation not present in shared-memory<br />

mechanisms. The small size of a register file places stringent constraints on the quantity of data<br />

that can be shared at any one time. As a result, it is best suited for sharing small amounts of data<br />

that form the critical path of a computation. However, this limitation may potentially be reduced<br />

<strong>by</strong> implementing the register file as a series of register queues instead of traditional registers. In<br />

this scheme, EMPTY operations would simply increment the queue until it no longer holds valid<br />

data. While this would increase the complexity of the register file, a queue structure would make<br />

application mapping easier and increase the data that may be transferred between synchronizations<br />

operations.<br />

The combination of shared memory and register-based communication mechanisms provides<br />

Amalgam with the flexibility to handle applications with varying levels of parallelism. Applications<br />

with coarse-grained parallelism map onto Amalgam as they would onto a conventional CMP,<br />

relying on shared memory for communication. Applications that allow for fine-grained parallelism<br />

may treat Amalgam’s clusters as loosely coupled functional units, relying more heavily on registerbased<br />

communication.<br />

3.3 Cluster Barriers<br />

In addition to providing efficient communication mechanisms, clustered processors must provide<br />

low-overhead, low-latency mechanisms for synchronizing computation between clusters.<br />

In<br />

Amalgam, this is achieved using a hardware implemented cluster barrier, which serves as a<br />

mechanism for synchronizing all of the programmable clusters in Amalgam.<br />

When a cluster<br />

21


Cluster<br />

Cluster Cluster Cluster<br />

Cluster<br />

Cluster Cluster Cluster<br />

1 0 1 1 0 0 0 1<br />

Figure 3.5 Cluster Barrier Register<br />

executes a CBAR instruction, it stalls the issue process until all programmable clusters in the<br />

system have also reached the barrier. Cluster barriers are implemented using a global condition<br />

register with a bit representing each programmable cluster, as depicted in Figure 3.5. Since this<br />

register is accessed via global wires, barrier latency is affected <strong>by</strong> the increasing wire delays<br />

associated with technology scaling. Initially, all of the bits in the register are unset, indicating<br />

that no programmable cluster is waiting on a cluster barrier. The execution of a CBAR instruction<br />

sets the bit in the register associated with the executing cluster. This is illustrated <strong>by</strong> the shaded<br />

clusters in the figure, while the unshaded clusters are still actively computing. When all bits in the<br />

register are set, all programmable clusters have reached the barrier, which clears the register and<br />

allows execution to proceed normally. Using this implementation, a cluster barrier may only be<br />

used to synchronize all of the programmable clusters within the system. When synchronizing a<br />

subset of the programmable clusters, or synchronizing with reconfigurable clusters, a combination<br />

of register scoreboarding and intercluster register writes may be used.<br />

This chapter has discussed the core functional elements of an Amalgam processor and how they<br />

interact. The following chapters detail the system components that are the primary focus of this<br />

thesis. Chapter 4 discusses the on-chip network, focusing on the different topologies examined<br />

including Amalgam’s hierarchical network. Chapter 5 describes the base memory system and<br />

details the hardware-managed coherence protocol created to support local data caches within the<br />

programmable clusters.<br />

22


CHAPTER 4<br />

ON-CHIP NETWORK<br />

The growing communication latencies caused <strong>by</strong> the poor scaling characteristics of global wires<br />

are an increasing concern in deep submicron technologies. While clustered architectures seek to<br />

deal with this issue <strong>by</strong> limiting single cycle wiring delay to local communication and explicitly<br />

exposing the multicycle delay of global communication to the compiler, careful network design<br />

remains important to help minimize the effects of these increasing delays. To this end, this study<br />

examined a variety of network topologies for use in a clustered processor such as Amalgam.<br />

In addition to latency, architects must consider the other characteristics of network topologies<br />

including their hardware complexity, scalability as the number of attached devices increase, and<br />

performance behavior as transistors shrink. These characteristics combine to control how well<br />

a given topology handles increasing communication latency and the corresponding increase in<br />

network contention as network resources become busy for greater lengths of time for the same<br />

operation. The performance of clustered processors is closely tied to this behavior as intercluster<br />

register communication places different demands on the on-chip network than traditional CMPs,<br />

which are limited to shared-memory communication.<br />

To address these issues, the following<br />

sections describe the basic characteristics of bus and crossbar topologies along with a proposed<br />

hierarchical topology, and attempts to predict their effectiveness in a clustered architecture.<br />

23


Bank<br />

Bank<br />

Bank<br />

Bank<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Figure 4.1 Bus<br />

4.1 Bus<br />

Bus networks (Figure 4.1) connect all of the clusters and cache banks with a single set of shared<br />

wires. Communication on a bus is broken into discrete transactions with a specific sender and<br />

receiver. In order to initiate a transaction, the sender must first gain control of the bus and become<br />

the bus master. Once the sender has control of the bus, it broadcasts its message to the receiver.<br />

Once the receiver has received the message, the bus becomes free for another device to become the<br />

bus master.<br />

The bus has a few advantages over more complex networks, due to its inherent simplicity.<br />

Buses are easy to design and have a consistent interface that is independent of the number of<br />

devices on the bus. Buses represent the bare minimum in hardware required, resulting from the<br />

fact that all devices on the bus share a single set of wires for all communication and only require<br />

an additional network interface to connect another device to the network. While this minimizes the<br />

complexity of the network, it also results in a constant or O(1) bandwidth regardless of the number<br />

of devices that are attached to the network. Realistically, bus performance worsens as the number<br />

of devices increases, because total wire length must increase leading to higher latency. This is<br />

a distinct disadvantage of bus networks, since total network communication would be expected<br />

to increase with the number of devices on the network.<br />

This is particularly true of clustered<br />

processors, due to the large amount of local communication between neighboring clusters via<br />

24


Bank<br />

Bank<br />

Bank<br />

Bank<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Cluster<br />

Figure 4.2 Crossbar<br />

intercluster register writes.<br />

4.2 Crossbar<br />

Crossbar networks provide substantially more communication bandwidth than bus networks <strong>by</strong><br />

connecting the processing nodes to a grid of crossed wires.<br />

Figure 4.2 depicts a somewhat<br />

simplified example of a crossbar network, where each cluster may drive an associated horizontal<br />

wire that connects to a series of switches. These switches in turn can connect the clusters to the<br />

vertical wires associated with the cache banks. These switches operate independently allowing<br />

multiple transactions per cycle, as long as the there is no contention between the transactions. In<br />

other words, as long as each cluster wants to communicate with a different cache bank there will<br />

be no contention. If two or more clusters need to access the same memory, however, one will be<br />

blocked until the switch reconfigures itself. For example, cluster 0 may communicate with bank 1<br />

during the same cycle as cluster 4 communicates with bank 0, but not at the same time as cluster<br />

7 communicates with bank 1. This is in direct contrast with a bus-based network, which is limited<br />

25


to a single transaction per cycle.<br />

As a result, crossbar networks have clear advantages over the simpler bus-based networks,<br />

especially for high-performance designs with substantial communication requirements. For such<br />

systems, crossbars provide bandwidth on the order of O(n), where n is the number of connected<br />

devices (clusters or cache banks), and support any permutation of inputs to outputs. However, the<br />

advantages of crossbar networks come at the cost of increased hardware growth, on the order of<br />

O(n 2 ) for n processing elements.<br />

This study examines two variants of the crossbar network that are differentiated <strong>by</strong> how they<br />

scale beyond single cycle latency. In the unpipelined crossbar model, the path between a source<br />

and destination remains busy for the entire latency of a communication across the network. Since<br />

this limits the performance in high clock rate designs, we also study a more complex pipelined<br />

crossbar that can accept a new message for each destination on each cycle. In addition to improving<br />

the performance of communication bursts between two points, this also would also improve the<br />

performance of communication bursts from a single source to multiple destinations.<br />

4.3 Hierarchical Networks<br />

It is anticipated that a single crossbar may be insufficient to meet the performance goals of future<br />

clustered processors.<br />

To exploit potential locality in intercluster communication, hierarchical<br />

networks have been proposed as a simple approach to provide lower-latency communication<br />

between neighboring clusters [21]. This topology augments the global, chip-wide network with<br />

smaller local networks that may achieve lower latencies than the global network <strong>by</strong> constraining<br />

their overall length.<br />

This study briefly examines a two-level hierarchy consisting of a global<br />

network and a pair of local networks that provide communication between the clusters in half<br />

of the chip. This is illustrated in Figure 4.3, where the clusters in the one half may communicate<br />

with each other over the low latency, local network, but fall back to the global network when<br />

they communicate with memory or clusters in the other half of the chip. Splitting the chip into<br />

26


Bank0 Bank1 Bank2 Bank3<br />

Global<br />

(2 cycle)<br />

Local<br />

(1 cycle)<br />

C0 C1 C2 C3 C4 C5 C6 C7<br />

Figure 4.3 Hierarchical Network<br />

halves is motivated <strong>by</strong> the current signal delay estimates for an Amalgam implemented in 130-nm<br />

technology. These estimates imply that a cross-chip signal will require a full two clock cycles<br />

while a signal travelling only half-way across the chip will still fit within a single cycle.<br />

This approach may be further extended to an arbitrary number of levels, depending on<br />

the expected communication needs of the targetted application domain.<br />

In an eight-clustered<br />

Amalgam, a designer could potentially include an additional level of hierarchy that would provide<br />

even lower-latency communication between adjacent cluster pairs over that already provided <strong>by</strong><br />

the half-chip local networks.<br />

Introducing a nonuniform interconnection pattern increases the<br />

importance of designing and compiling applications around the topology to achieve maximum<br />

performance. Scheduling threads that communicate frequently onto clusters that are connected<br />

via a low latency local network, and threads that communicate rarely onto separate halves of<br />

the chip will yield noticeably better performance for communication-intensive applications than<br />

a scheduling algorithm that ignores locality.<br />

27


CHAPTER 5<br />

MEMORY SYSTEM<br />

The previous chapter discussed the importance of on-chip network design due to increasing<br />

communication delays. Aggressive clock rates and technology scaling will also lead to increased<br />

cache and memory access times, which suggests that the design of the memory system is<br />

as important to the overall performance of the system as the design of the on-chip network.<br />

Conventional cache structures require multicycle access times in 130-nm technology for capacities<br />

as small as 4 kB. This situation will only worsen as memory latencies continue to increase relative<br />

to the logic in a single pipeline stage.<br />

This chapter examines Amalgam’s baseline memory<br />

hierarchy, as well as the introduction of local data caches into each programmable cluster and<br />

the protocol implemented to maintain coherence among them.<br />

5.1 Baseline Memory Hierarchy<br />

Amalgam’s baseline memory hierarchy consists of an off-chip memory supplemented <strong>by</strong> a shared<br />

on-chip data cache.<br />

The shared on-chip data cache is divided into four banks to support up<br />

to four memory references per cycle, which matches the maximum number of references that<br />

may be generated per cycle in an Amalgam processor consisting of four programmable and four<br />

reconfigurable clusters.<br />

Our studies have shown that this is not a limiting factor in an eight<br />

programmable cluster system, as there is little performance gain from increasing to eight cache<br />

banks, with at most a 6% performance improvement for one benchmark. However, programmers<br />

28


that do not carefully design their data structures and memory access patterns around the memory<br />

system may suffer from significant cache aliasing, since Amalgam uses four-way associativity.<br />

Increasing this to eight-way associativity results in a substantial performance increase for two of<br />

our benchmarks, one gaining greater than a twofold improvement and the other running nearly<br />

six times as fast. Both benchmark implementations suffer from aliasing between the different<br />

cluster stack addresses, and minor changes to function calls largely eliminates this disparity. Unless<br />

otherwise specified, the results presented do not use these modified versions.<br />

Memory addresses are interleaved across the banks on a word-<strong>by</strong>-word basis, so bank 0<br />

contains all words whose addresses end in 0 (mod 4), bank 1 contains the words whose addresses<br />

end in 1 (mod 4), and so on. To simplify data cache write-backs, the data cache banks are required<br />

to contain the same set of cache lines, so a cache line is never present in some of the banks and<br />

invalid in others. In addition, each programmable cluster contains a private instruction cache that<br />

interfaces directly to off-chip memory, <strong>by</strong>passing the on-chip network and shared data cache.<br />

5.2 Local Data Caches<br />

As mentioned in Section 3.1.1, designers typically introduce additional levels to the cache<br />

hierarchy in order to enhance the performance of the memory system. To explore this design<br />

option, we looked at extending the programmable cluster architecture to include a small data cache<br />

in order to weigh the performance benefits against the increased complexity. To simplify software<br />

complexity and provide better performance, this modified Amalgam implements a hardwaremanaged,<br />

directory-based coherence protocol in the cache controllers throughout the system,<br />

which is discussed further in Section 5.3.<br />

Cache-coherence messages travel over the on-chip<br />

network, and are therefore affected <strong>by</strong> changes in network delay and topology. To fully explore the<br />

capacity/latency tradeoff, this study examines local data cache sizes ranging from 1 kB to 64 kB,<br />

including the increased access latency of these larger caches.<br />

29


BR + BW<br />

BW<br />

PR<br />

Invalid<br />

PR/~S<br />

Exclusive<br />

PW/~S<br />

PW/S<br />

BR<br />

Key<br />

PR - processor read<br />

PR/S<br />

BW<br />

PW<br />

PW - processor write<br />

BR - observed bus read<br />

PW<br />

BW<br />

BW - observed bus write<br />

S/~S - shared/NOT shared<br />

Shared<br />

Modified<br />

BR<br />

PR + BR<br />

PR + PW<br />

Figure 5.1 Illinois Protocol State Diagram<br />

5.3 Coherence Protocol<br />

Amalgam implements a directory-based coherence protocol loosely based on the Illinois protocol<br />

[4]. Originally developed for bus-based multiprocessor systems, the Illinois protocol represents the<br />

status of a cache block using four states: Modified (M), Exclusive (E), Shared (S), or Invalid (I).<br />

Figure 5.1 illustrates the basic state diagram for this protocol. In this protocol, all cache lines start<br />

in the Invalid state. When a processor initially requests access to an Invalid block, the processor<br />

obtains an Exclusive copy. If the processor writes to this block, it transitions to the Modified state.<br />

If another processor requests this block, the first processor writes back the block if it has been<br />

Modified, and then both processors have Shared copies of the block.<br />

In a traditional bus-based system, it is possible to exploit certain “snoopy” optimizations to<br />

improve the performance of this protocol. Since memory requests are visible to all processors<br />

on the bus, explicit coherence messages are not required to update the status of local copies for<br />

several state transitions. As a result, it is possible for a processor to respond to memory requests<br />

from other processors for blocks in its local cache, thus avoiding expensive memory delays. Since<br />

30


us networks are not expected to scale well, Amalgam focuses more heavily on crossbar networks<br />

which eliminate this possible optimization. While it remains viable in a bus-based Amalgam,<br />

“snoopy” optimizations are not currently implemented for those systems.<br />

5.3.1 Implementation<br />

To implement the aforementioned protocol, the main cache controller must be extended to maintain<br />

a central coherence directory. This directory tracks the status of all cache lines in the system as<br />

illustrated in Figure 5.2. A directory entry for a cache block consists of the address of the first word<br />

in the cache block and an associated status vector where each bit in this status vector represents a<br />

programmable cluster. The cache controller uses this status vector to determine the current state of<br />

the associated block. If no bits are set, no local copies of the block exist, as illustrated <strong>by</strong> address<br />

0x10f0 in the figure. If one bit is set, the programmable cluster corresponding to the bit currently<br />

has Exclusive access to that block, as illustrated <strong>by</strong> address 0x00f0. If multiple bits are set, the<br />

block is shared among the specified clusters, as illustrated <strong>by</strong> addresses 0x0a00 and 0x20a0.<br />

Since the line length for the local caches may be shorter than the lines in the main cache, the<br />

directory maintains a separate status vector for each subblock of the main cache line that may be<br />

locally cached.<br />

In addition to the information stored in the main cache controller, the controllers in each<br />

programmable cluster maintain similar status information for locally cached blocks. To determine<br />

whether a cluster has Exclusive or Shared access to a locally cached block, a bit is associated with<br />

each local block that is set when the cluster has exclusive access, as illustrated <strong>by</strong> address 0x00f0.<br />

The remaining addresses 0x0a0 and 0x20a0 are both shared between multiple clusters, and<br />

hence have this bit unset in the clusters. The cluster uses a conventional dirty bit to differentiate<br />

between Exclusive and Modified blocks.<br />

Since network messages may not be visible to all clusters in the system, the main cache<br />

controller must inform clusters of status changes to locally cached blocks.<br />

In particular, this<br />

involves Exclusive → Shared and Valid → Invalid transitions. Local cache controllers are designed<br />

31


Address<br />

0x00f0<br />

0x0a00<br />

0x10f0<br />

0x20a0<br />

Main Cache<br />

Status Vector<br />

10000000<br />

00001111<br />

00000000<br />

11110001<br />

Cluster 0<br />

Cluster 1<br />

Cluster 2<br />

Cluster 3<br />

0x00f0<br />

0x20a0<br />

1<br />

0<br />

0x20a0<br />

0<br />

0x20a0<br />

0<br />

0x20a0<br />

0<br />

Cluster 4<br />

Cluster 5<br />

Cluster 6<br />

Cluster 7<br />

0x0a00<br />

0<br />

0x0a00<br />

0<br />

0x0a00<br />

0<br />

0x0a00<br />

0x20a0<br />

0<br />

0<br />

Figure 5.2 Coherence Directory Example<br />

to maintain more accurate status information for local lines, since it is possible to distinguish<br />

between Exclusive and Modified blocks at this level.<br />

5.3.2 Types of coherence messages<br />

To communicate coherence information across Amalgam’s on-chip network, network messages<br />

include information about the status of data being transmitted.<br />

Table 5.1 lists the coherence<br />

message types supported <strong>by</strong> Amalgam, along with the event that causes a cluster to send each<br />

type and the cluster’s response to receiving each type. A few key message types are required to<br />

implement an Illinois-like protocol, including READ, WRITE, SHARED, and INVALIDATE. These<br />

messages allow a cluster to read data into its local cache, write back any modifications it has made<br />

to shared memory, transition from an exclusive to a shared copy, and evict blocks. While this set of<br />

message types is largely sufficient to implement our protocol, the inclusion of additional message<br />

types and transitions may allow for substantially improved performance.<br />

To this end, these message types are supplemented with additional types that provide further<br />

insight into a block’s status and how the message is to be handled. In this scheme, the ordinary<br />

READ is supplemented <strong>by</strong> the READ X type which is used to request exclusive access to a block,<br />

32


Table 5.1 Coherence Message Types<br />

Message Type Cluster Initiates Cluster Receives<br />

READ Load address not in cache Block marked S<br />

READ X Store address that is I or S Block marked E<br />

Store completes<br />

WRITE Receives SHARED NA<br />

message for a M block<br />

WRITEBACK Receives INVALIDATE* NA<br />

message for M block<br />

ACK Receives INVALIDATE* NA<br />

message for a E block<br />

FLUSH Cluster evicts an E/S block NA<br />

SHARED NA Block marked S<br />

INVALIDATE NA Block marked I<br />

No reply<br />

INVALIDATE X NA Block marked I<br />

Send WRITEBACK if M<br />

Send ACK if E<br />

INVALIDATE G NA <strong>All</strong> blocks in global line marked I<br />

Send WRITEBACK if M<br />

Send ACK if E<br />

thus ensuring that the cluster may immediately write to that location once it obtains a copy. When<br />

a cluster recieves a SHARED message for a block it has modified, it sends a WRITE messages with<br />

the modified data, updating the main cache and retaining a local copy. This is distinguished from<br />

WRITEBACK messages that send modified data back to the main cache for a block that is no longer<br />

locally cached. FLUSH messages informs the main cache controller that a programmable cluster<br />

has evicted an unmodified block and no longer has a copy.<br />

Our protocol includes three different invalidate messages that are distinguished <strong>by</strong> the expected<br />

response. An INVALIDATE message is only sent to clusters that have a Shared copy of a block,<br />

and requires no reply message from the cluster. Exclusive/Modified blocks are removed using an<br />

INVALIDATE X message. If the block has been modified, the cluster returns the data with a series<br />

of WRITEBACK messages and invalidates its local copy. If the block is unmodified, the cluster may<br />

send a single ACK message indicating it no longer has a copy. When the main cache replaces a<br />

33


global cache line, it sends a single INVALIDATE G message to each cluster that has copies of any<br />

subblocks of that line. The cluster’s cache controller is then responsible for evicting all locally<br />

cached subblocks of the specified global line.<br />

Some of these additional message types are necessary because coherence messages are not<br />

atomic, and are not visible to all the other clusters in the system. In the “snoopy” implementation<br />

of the Illinois protocol, READ X messages are not required because a cluster may simply write<br />

to a location and broadcast the result to the rest of the system. In Amalgam, this message type<br />

is necessary because a cluster must have Exclusive access to a block before writing to it. Thus<br />

when a cluster attempts to store a result to a Shared block, it must first request Exclusive access<br />

via a READ X message and may only complete the store once it has received Exclusive access.<br />

Other message types, such as INVALIDATE G and ACK, have been included in order to reduce<br />

the number of network messages required to communicate certain status changes. If the main<br />

cache must evict a cache line, it can simply transmit a single INVALIDATE G message to each<br />

cluster that has access to any of the subblocks of that line, as opposed to sending an INVALIDATE<br />

message for each locally cached subblock for each cluster. Similarly, ACK messages reduce the<br />

number of messages required to write back an unmodified local line when a cluster receives an<br />

INVALIDATE* message.<br />

34


CHAPTER 6<br />

EXPERIMENTAL METHODOLOGY<br />

This chapter describes the methodology behind the empirical phase of this thesis work.<br />

To<br />

this end, the chapter begins <strong>by</strong> describing the simulator framework used to examine clustered<br />

programmable-reconfigurable architectures, paying particular attention to the areas related to this<br />

study. Next, it presents the models used for estimating network latency over the range of process<br />

technologies of interest. This is followed <strong>by</strong> a brief discussion of area estimation for the clusters<br />

and cache memories, and the basic latency models used for these components. Finally, this chapter<br />

ends <strong>by</strong> describing the applications in Amalgam’s current benchmark suite.<br />

6.1 Simulation Framework<br />

Microarchitectural timing simulations are performed using amalsim, a cycle-accurate simulator<br />

for clustered programmable-reconfigurable processors. Amalsim runs as a command-line shell that<br />

allows users to interact with an application as it is running <strong>by</strong> typing commands. Among other<br />

things, amalsim supports commands for setting simulator breakpoints, printing debug information,<br />

and stepping through the execution of an application. Supported commands are discussed in further<br />

detail in Appendix C.<br />

To facilitate the exploration of a variety of design tradeoffs in clustered processor design,<br />

amalsim allows the user to define a wide range of clustered architectures through a configuration<br />

file interface, which is described in Appendix A. Users can define the number of programmable<br />

35


and reconfigurable clusters, the number of ALUs in each programmable cluster, and the memory<br />

hierarchy.<br />

In addition, the simulator supports a network description language that allows the<br />

modeling of virtually any conceivable network topology and latency. This language is discussed<br />

in greater detail in Section 6.1.2.1. This high level of configurability makes it possible to evaluate<br />

a wide range of design trade-offs without modifying the simulator.<br />

For this study, several parameters are varied including the number of clusters, the issue policy<br />

of the programmable clusters, the network topology and latency, and cache parameters. <strong>All</strong> other<br />

configuration options are kept constant. Clusters are modeled as dual-issue in-order processors<br />

with five-stage pipelines. The memory system used for these experiments has 256 kB of on-chip,<br />

shared data cache (four banks of 64 kB each). Each cluster’s instruction cache is 4 kB in size. To<br />

understand how local data cache design affects overall performance and network design, clusters<br />

are simulated without local caches and with local caches varying from 4 kB/cluster to 64 kB/cluster.<br />

<strong>All</strong> of the caches studied (shared data, local data, and instruction) are four-way set-associative.<br />

Amalsim was designed around a set of components: the programmable cluster, the reconfigurable<br />

cluster, the on-chip network, the register file, the branch predictor, the cache, and the<br />

main memory system. After parsing the configuration file, the simulator builds up a hierarchical<br />

structure representing the entire system as illustrated in Figure 6.1. Each component of the system<br />

has a set of pointers to its immediate children in the hierarchy and a pointer to its immediate<br />

parent. This approach allows any component to communicate with any other part of the simulator,<br />

although this is carefully controlled to meet expected hardware constraints. At the beginning of<br />

each cycle, the simulator clocks the top-level structure, which recursively clocks all components<br />

beneath it in the hierarchy until all components have been stepped.<br />

To better understand the results presented later, it is useful to have a good understanding for<br />

how some aspects of this hierarchy are modeled and how the various components communicate. To<br />

that end, the following subsections will briefly touch on how different components of the system<br />

are modeled, focusing on the programmable cluster and the on-chip network.<br />

36


Top-Level<br />

Amalgam<br />

Memory<br />

Shared<br />

Data Cache<br />

Banks<br />

Reconfigurable<br />

Clusters<br />

On-chip<br />

Network<br />

Programmable<br />

Clusters<br />

Reconfigurable<br />

Array<br />

ACU<br />

Register File<br />

Branch Predictor<br />

Instruction Cache<br />

Data Cache<br />

Register File<br />

Figure 6.1 Amalsim System Hierarchy<br />

6.1.1 Programmable cluster modeling<br />

As mentioned previously, Amalgam’s programmable clusters are dual-issue in-order processors<br />

with five-stage pipelines that execute the MIPS-like instruction set detailed in Appendix D.<br />

Simulation of an in-order programmable cluster relies on a central scoreboard and two register<br />

files: architectural and temporary. The architectural register file contains the correct architectural<br />

state of the processor and is only written during the final writeback stage of the pipeline or when<br />

processing incoming network messages. The contents of the temporary register file reflect more<br />

transient values and serve as the basis for pipeline <strong>by</strong>passing. When instructions issue, their source<br />

operands are read from the temporary register file, which holds values that have been generated<br />

in the execution stage, but have not yet been written back to the architectural register file. As<br />

opposed to maintaining a simple valid bit, the central scoreboard maintains valid times, indicating<br />

when a source value will be available via <strong>by</strong>passing. Out-of-order execution is handled in a similar<br />

manner, but relies on register renaming, a larger physical register file, and a reorder buffer. When<br />

instructions complete, they write their results back into the architectural register file. We maintain<br />

two separate register files in order to differentiate between in-flight data that may potentially be<br />

squashed <strong>by</strong> a mispredict and architecturally correct data. In the event of an exception, data from<br />

the architectural register file may be used to correctly resume computation.<br />

Branch prediction is kept relatively simple for both in-order and out-of-order programmable<br />

37


clusters, limiting the clock rate and area impact of prediction structures. For all in-order cores<br />

modelled, the simulator uses a simple not-taken prediction scheme. Out-of-order cores require<br />

more advanced techniques in order to obtain reasonable performance benefit from out-of-order<br />

execution, and so a table of 2-bit saturating counters is used combined with a small branch target<br />

buffer.<br />

6.1.2 Network modeling<br />

On-chip networks are described in a textual description developed for Amalgam known as the<br />

Network Description Language, or NDL. This format describes the network as a set of nodes,<br />

where each node is defined <strong>by</strong> its input and output ports and the transfer of messages between<br />

these input and output ports is determined <strong>by</strong> the node’s mode. Arbitrarily complex networks may<br />

be constructed <strong>by</strong> tying the output ports of nodes together with the input ports of other nodes.<br />

Clusters and cache banks are connected to the network <strong>by</strong> associating the input or output ports of<br />

a node with the desired component. Once these connections have been defined, a component may<br />

send a message via the network <strong>by</strong> placing the message on the input port of the node associated<br />

with it, where it is promptly transferred between nodes in the network until it arrives at a node that<br />

connects to the input port of the destination. The following subsections detail how this simulation<br />

is actually performed as well as describing the features of the NDL.<br />

6.1.2.1 Network description language<br />

The NDL is intended to efficiently describe the topology of the Amalgam on-chip network. The<br />

simulator was initially designed to model a specific network topology with a hard-coded latency.<br />

While simple to model, this approach only supported systems that meet the constraints of the<br />

predefined model and required substantial effort and recompilation of the simulator in order to<br />

study additional topologies. To deal with these limitations, a more generic network simulation<br />

infrastructure was developed that relies on the NDL to express the network topology in a textual<br />

format that is parsed <strong>by</strong> the simulator at run-time. This section briefly discusses the key features<br />

38


node_cfg {<br />

idx 0;<br />

latency 2;<br />

mode XBAR;<br />

in P 0, P 1;<br />

out N 1, P 0, P 1;<br />

}<br />

node_cfg {<br />

idx 1;<br />

latency 3;<br />

mode BUS;<br />

in N 0;<br />

out B 0, B 1;<br />

}<br />

P0<br />

B0<br />

N0<br />

N1<br />

P1<br />

B1<br />

Figure 6.2 NDL Example<br />

of the NDL, describes how it is simulated, and provides a few brief examples.<br />

The node cfg block serves as the basic element of a network description. Each node cfg<br />

block contains a node index, latency, mode, and input and output connections. The node index<br />

uniquely identifies a given node, and is referenced when specifying internode connections. The<br />

latency is simply the time required to traverse the node in clock cycles. Latency modeling varies<br />

depending on the mode specification, which currently includes a bus, a crossbar, and a pipelined<br />

crossbar. Node inputs and outputs are specified as a comma delimited list of supported unit types<br />

including generic clusters (C), programmable clusters (P), reconfigurable clusters (R), cache banks<br />

(B), and other network nodes (N). A unit type combined with the appropriate index is used to<br />

designate a particular functional element connected to the network. In addition, the NDL grammar<br />

supports a variety of additional constructs such as for loops, if statements, and expressions that<br />

may be used to efficiently describe more complex interconnection networks. Appendix B provides<br />

a more in-depth examination of the language, including the complete formal specification.<br />

Figure 6.2 illustrates the simulator’s interpretation of the short example NDL code on the left.<br />

The first node cfg block describes network node 0, which is depicted <strong>by</strong> the upper circle in this<br />

figure. This node may receive messages from programmable clusters 0 and 1, represented <strong>by</strong> the<br />

upper rectangles, and may transfer messages to programmable clusters 0 and 1 as well as network<br />

node 1, represented <strong>by</strong> the lower circle. The second node cfg block describes network node 1,<br />

39


which may receive messages from node 0 and transfer them to cache banks 0 or 1, represented <strong>by</strong><br />

the lower rectangles.<br />

6.1.2.2 Network simulation<br />

After parsing the network description, the simulator generates an internal representation of<br />

the described interconnect that facilitates simulation of arbitrary topologies.<br />

This internal<br />

representation relies upon individual network elements, or nodes, as the core building block for<br />

any desired topology. As specified in the node cfg block, a network node has an associated<br />

latency, a mode of operation, and a collection of input and output ports. These ports are connected<br />

to other network nodes, clusters, or cache banks. This port information is later used to develop the<br />

routing information for the network that includes the shortest paths from each network input to all<br />

reachable outputs.<br />

After the complete internal representation has been created to reflect all nodes in the network<br />

description, the simulator developes a routing table for each node consisting of a list of final<br />

destinations that may be reached from this node, the cost of the shortest known route to each<br />

final destination, and the next hop for that route. This table is constructed using Djikstra’s forward<br />

search as described in [25]. In this approach, the simulator fills in directly connected neighbors<br />

as it initially builds up the network. Using this information, the simulator develops a complete<br />

routing table for every node that includes all possible destinations for this node along with the<br />

optimal route to each using an iterative process. At present, the cost function relies solely on the<br />

latency for all routes examined, but may be easily extended if additional routing constraints are<br />

desired.<br />

Actual network simulation occurs in two distinct phases. At the beginning of every cycle,<br />

the simulator iterates over all of the nodes, attempting to route all available operations from each<br />

node’s input ports to its output ports. Once this has been completed, the simulator proceeds to<br />

re-evaluate all of the nodes, attempting to transfer any operations residing on the output ports of<br />

a node to the input ports of subsequent units. A simple round-robin scheme maintains fairness <strong>by</strong><br />

40


otating routing priority among the various inputs.<br />

While this succinctly describes the modeling for most network nodes, the simulator handles<br />

pipelined crossbar nodes in a slightly different manner. From the discussion of network topologies<br />

in Section 4, pipelined crossbars allow the sending of additional messages every cycle, dividing the<br />

total latency across multiple pipeline stages. To describe this in a concise manner, NDL includes<br />

support for the pipelined crossbar mode. For nodes operating in this mode, the simulator constructs<br />

a series of n subnodes, where n equals the overall latency of the node. The simulator associates<br />

two indices with these subnodes: a major node index that matches that specified in the node cfg<br />

block, and a minor node index that designates the subnode within the series. The first subnode<br />

uses the input connections specified for the overall node, while the last subnode uses the output<br />

connections. Internode connections are automatically created between the subnodes and contain<br />

buffer space equal to the number of final output connections. A single copy of the routing table is<br />

maintained for the overall node being modeled, since routing between subnodes is trivial.<br />

6.2 Network Latency Modeling<br />

To estimate network latencies for future technologies, approximate wire delays for global on-chip<br />

wires are calculated using the future technology parameters projected in the 2001 International<br />

Technology Roadmap for Semiconductors [26].<br />

calculations presented here are listed in Table 6.1.<br />

Values from this roadmap relevant to the<br />

The wire delays estimated in the manner<br />

presented here serves as the approximate latency for the on-chip network. For the purposes of<br />

this study, two different metrics are used to estimate the wire length of the on-chip network: the<br />

scaled Amalgam core, and the SIA estimate for a full chip.<br />

Area estimates for a single programmable cluster serve as the basis for the estimate of a<br />

complete Amalgam core. These estimates are discussed in more detail in Section 6.3. Using<br />

the side of a square programmable cluster as our basis, we estimate the longest global wire length<br />

as the edge-to-edge distance for eight programmable clusters arranged in a single row. This wire<br />

41


Table 6.1 Projected Fabrication Technology Parameters<br />

Gate Dielectric Metal ρ Wire Width Aspect Clock Rate<br />

Length (nm) Constant κ (µΩ-cm) (nm) Ratio (GHz)<br />

130 3.0 2.2 335 2.0 1.684<br />

90 2.6 2.2 230 2.1 3.99<br />

65 2.3 2.2 145 2.2 6.739<br />

45 2.1 2.2 102.5 2.3 11.511<br />

32 1.9 2.2 70 2.4 19.348<br />

22 1.8 2.2 50 2.5 28.751<br />

length is then used to estimate wire delay in Amalgam’s on-chip network. This wire length is<br />

also compared against a similar estimate using four cache banks to verify that the programmable<br />

clusters dominate the chip area.<br />

To predict how networks will scale for full size chips, networks traversing the side of a 140-<br />

mm 2 chip are also studied.<br />

The 140-mm 2 chip is the expected chip size for mass-produced<br />

processors across all process generations [26]. While Amalgam clearly does not consume the<br />

same area regardless of process technology, designers will undoubtedly find uses for the additional<br />

transistors, resulting in a relatively consistent chip area across future generations. Since designers<br />

will be facing the wire delays present in full size chips, the impact of these delays on network<br />

topologies is equally important.<br />

Since the delay of a wire is directly proportional to the product of its resistance and capacitance,<br />

models are employed for these parameters across all of the technology generations studied. Wire<br />

resistance per unit length (mΩ/µm) is obtained using the equation R =<br />

ρ , where wire resistivity<br />

W ∗T<br />

is denoted <strong>by</strong> ρ, wire width <strong>by</strong> W, and wire thickness <strong>by</strong> T. Computing capacitance per unit length<br />

(fF/µm) is more complex. Top-level global wires are modeled as a set of parallel lines on one<br />

plate using the equations presented in [27] (reproduced here as Equations (6.1) and (6.2)). Using<br />

this model, capacitance is decomposed into two components: the coupling capacitance between<br />

neighboring wires and the capacitance between the wire and the underlying plane. The flux to an<br />

adjacent wire, C couple , is calculated using Equation (6.1).<br />

42


C couple = ɛ ox (1.144 T S ( H<br />

H + 2.059S )0.0944<br />

W<br />

+0.7428(<br />

W + 1.592S )1.144<br />

(6.1)<br />

W<br />

+1.158(<br />

W + 1.874S )0.1612<br />

H<br />

·(<br />

H + 0.9801S )1.179<br />

The area and fringe flux to the underlying plane, C af , is calculated using Equation (6.2).<br />

C af = ɛ ox ( W H + 2.217( S<br />

S + 0.702H )3.193<br />

S<br />

+1.171(<br />

S + 1.510H )0.7642<br />

T<br />

·(<br />

T + 4.532H )0.1204 )<br />

(6.2)<br />

In the preceding equations, wire width is denoted <strong>by</strong> W, wire thickness <strong>by</strong> T, interwire spacing<br />

<strong>by</strong> S and dielectric thickness <strong>by</strong> H. Wire spacing is assumed to be equal to wire width, which<br />

is estimated to be half the projected pitch for global wires.<br />

Using the above relations, total<br />

capacitance is calculated using C total = C af + 2C couple .<br />

Table 6.2 lists the projected wire parameters from 130-nm to 22-nm technologies.<br />

The<br />

estimated resistance per unit length (R wire ) and capacitance per unit length (C wire ) are shown for<br />

top-level metal layers, as on-chip networks are assumed to be implemented in high-level metal.<br />

R wire increases dramatically across the technology parameters.<br />

This occurs despite projected<br />

increases in wire aspect ratios that attempt to reduce the increase in R wire at the expense of<br />

an increased coupling capacitance (C couple ).<br />

The increase in C couple is mitigated <strong>by</strong> material<br />

improvements that reduce the dielectric constants of the insulators between wires. Despite the<br />

advances in fabrication materials, the intrinsic delay of a wire continues to increase with each<br />

process generation. This is consistent with the results of other studies [1, 2, 28].<br />

43


Table 6.2 Calculated Technology Parameters and Projected Wire Delays<br />

Gate R wire C wire 140 mm 2<br />

Length (nm) (mΩ/µm) (fF/µm) Delay (cycles)<br />

130 98 0.737 1.49<br />

90 198 0.690 2.15<br />

65 475 0.567 3.9<br />

45 910 0.517 5.9<br />

32 1870 0.462 8.5<br />

22 3520 0.440 13.8<br />

The estimated values for R wire and C wire form the core of the wire delay model used in this<br />

study. The basic delay model for a wire of length L is D wire = 0.38R wire C wire L 2 . To avoid this<br />

quadratic dependence of delay on wire length, repeaters are inserted periodically along a wire. For<br />

each process, optimal repeater size and number is determined in order to minimize overall wire<br />

delay. Assuming optimal repeater placement, this reduces the delay’s dependence on wire length<br />

from quadratic to linear. Overall wire delay is then determined using Equation (6.3).<br />

D wire = 0.38R wire C wire<br />

L 2<br />

M + (M − 1)t repeater (6.3)<br />

The number of segments is denoted <strong>by</strong> M and delay of a repeater <strong>by</strong> t repeater . This delay is then<br />

multiplied <strong>by</strong> the projected SIA clock frequency to determine the number of cycles required for<br />

a signal to traverse the wire in that technology. This wire delay estimate is used for the network<br />

latency in the technology being examined.<br />

6.3 Area Estimation<br />

To examine the effect of scaling the Amalgam core for each process technology, information<br />

provided in [29] is used to estimate the size of a programmable cluster in each of the fabrication<br />

technologies and cluster configurations we studied.<br />

For the purpose of these estimates, each<br />

programmable cluster includes two 32-bit arithmetic-logic units (ALUs), two 32-bit integer<br />

44


Table 6.3 Projected Cache Memory Latencies<br />

Gate 256 kB 1 kB 2 kB 4 kB 8 kB 16 kB 32 kB 64 kB<br />

Length Main (cycles) (cycles) (cycles) (cycles) (cycles) (cycles) (cycles)<br />

(nm) (cycles)<br />

180 1.2 0.85 0.87 0.91 0.96 0.99 1.1 1.2<br />

130 2.0 1.4 1.4 1.5 1.6 1.6 1.7 1.9<br />

90 3.3 2.2 2.3 2.4 2.5 2.6 2.9 3.2<br />

65 4.0 2.8 2.8 3.0 3.0 3.2 3.5 4.0<br />

45 4.8 3.3 3.3 3.5 3.6 3.8 4.2 5.0<br />

32 5.7 3.9 4.0 4.2 4.3 4.5 5.0 6.0<br />

22 5.8 4.0 4.1 4.3 4.4 4.6 5.0 6.1<br />

multiplier units, 32-entry, 32-bit integer register file, 4-kB instruction cache, one 64-bit floating<br />

point unit (FPU), 16-entry, 64-bit floating point register file, and the appropriate amount of data<br />

cache. While Amalgam does not currently support floating point execution, a high-performance<br />

clustered processor design would realistically include hardware support for floating point data.<br />

Cacti [30] is used to generate area estimates for the shared data cache to verify our assumption that<br />

the size of the clusters determined the length of the on-chip network.<br />

6.4 Latency Models<br />

In addition to the latency model used for the on-chip network, this study uses Cacti [30] to<br />

accurately model the latencies of all on-chip memories including the main shared data cache,<br />

instruction caches, and local data caches. Table 6.3 presents these latencies for all technologies<br />

and cache sizes examined.<br />

Rough estimates have also been made to account for increasing pipeline depth and the<br />

performance disparity between logic and DRAM processes.<br />

To obtain the clock frequencies<br />

predicted in the SIA roadmap, designers must aggressively pipeline all aspects of the datapath.<br />

To model this, we have calculated the total number of FO4 delays for our pipeline using a simple<br />

five-stage pipeline implemented in 180 nm. This FO4 delay forms the basis for estimating the<br />

45


Table 6.4 Projected System Latencies<br />

Gate Pipeline Memory Start Delay Word Delay<br />

Length (nm) Depth (cycles) (cycles) (cycles)<br />

180 5 156 52 26<br />

130 9 300 100 50<br />

90 14 582 194 97<br />

65 17 798 266 133<br />

45 20 1110 370 185<br />

32 24 1524 508 254<br />

22 25 1848 616 308<br />

pipeline depth required to meet SIA clock rate goals in each process, given the improvements in<br />

transistor performance from moving to each process. Since designers spend significant effort on<br />

ALU performance, the <strong>by</strong>pass latency for operations is expected to increase at a slower but similar<br />

rate.<br />

The baseline memory latency is based on the CAS latency of PC133 SDRAM. The time to<br />

communicate with the memory system is approximated as CASlatency + 50%overhead. Using<br />

this latency, the time to load a line of data into the cache is estimated to be two transits across the<br />

memory bus to send the address followed <strong>by</strong> one transit per word in the line, with a line length of<br />

four words. This cache line latency serves as the memory latency for the 180-nm technology node.<br />

To predict the memory latency for future technologies, this latency is mutliplied <strong>by</strong> an expected<br />

performance increase of roughly 7% per year. Using this information and the timeline presented<br />

in the SIA roadmap, memory latencies may be approximated for each major technology node. The<br />

time to load a cache line based on these memory latencies along with pipeline depths are depicted<br />

in Table 6.4.<br />

The past several sections presented the major aspects of our experimental methodology<br />

including our overall simulation framework and the technological basis for our latency estimates.<br />

To complete this discussion, it is necessary to describe the benchmarking tools used to evaluate<br />

our architecture. The following section presents the set of applications that compose the Amalgam<br />

benchmark suite used in our studies.<br />

46


6.5 Benchmarks<br />

The Amalgam benchmark set includes seven applications selected from benchmarks previously<br />

used to evaluate multiprocessor systems or reconfigurable architectures: Image dithering, DNA<br />

sequence matching, GNU Radio FIR, Mergesort, MPEG encoding, Rijndael encryption, and the<br />

Traveling Salesman problem. To map these application to Amalgam, a programmer writes handparallelized<br />

C code that explicitly specifies the execution flow for each programmable cluster in the<br />

system, which is then compiled down to an Amalgam binary using our compilation tool flow [31].<br />

Hardware barriers are specified <strong>by</strong> the programmer using an Amalgam-specific pragma. Intercluster<br />

register writes are handled using variable names of the form cx iy where x specifies the<br />

destination cluster index and y specifies the destination register. The next seven sections summarize<br />

the benchmarks.<br />

6.5.1 Image dithering<br />

Image dithering maps an N × N pixel image to a reduced color palette. The version implemented<br />

for Amalgam uses Floyd-Steinberg error diffusion [32] to convert a 128 x 128 pixel image from<br />

8-bit red-green-blue (RGB) palette to a six-color RGB palette, resulting in a reduction from 256 3<br />

colors to 6 3 . This is performed <strong>by</strong> dividing the original color <strong>by</strong> 51 (255/5) and propagating the<br />

remainder as error to some of the adjacent pixels.<br />

6.5.2 DNA sequence matching<br />

DNA sequence matching computes the relative similarity between a 128-base input DNA sequence<br />

and a target 128-base sequence in a genetic database. When comparing genetic sequences [33], the<br />

edit distance has been shown to be a convenient way to quantify the similarity of two sequences.<br />

This edit distance is defined as the minimum cost of converting one sequence to another through<br />

character deletions, character insertions, and the substitution of one character for another. To<br />

compute the edit distance between two strings, a well-known dynamic programming algorithm is<br />

47


employed [34], which develops an m × n table of distances using Equation (6.4), where m and<br />

n are the length of the source and target sequences, respectively. Each element in the table is<br />

calculated <strong>by</strong> determining the set of minimum cost operations required to transform the source<br />

sequence into the target sequence up to that point in each sequence. In other words, element d i,j<br />

represents the edit distance between the first i characters of the source sequence and the first j<br />

characters of the target sequences. The total edit distance is simply d m,n in this table and requires<br />

O(mn) time to compute using a straightforward sequential implementation. However, there is a<br />

certain amount of parallelism present in the recurrence for d i,j , since each element only depends<br />

on adjacent distances (d i−1,j , d i,j−1 , and d i−1,j−1 ), which we exploit for improved performance.<br />

d 0,0 = 0<br />

d i,0 = d i−1,0 + deletion cost<br />

d 0,j = d 0,j−1 + insertion cost (6.4)<br />

⎧<br />

⎫<br />

d i−1,j + deletion cost<br />

⎪⎨ d i,j−1 + insertion cost<br />

⎪⎬<br />

d i,j = min<br />

d i−1,j−1 + substitution cost<br />

⎪⎩ d i−1,j−1 (source i == target j )<br />

⎪⎭<br />

6.5.3 GNU radio FIR<br />

FIR is an implementation of an intermediate frequency filter used in an FM radio example in GNU<br />

Radio [35], employing a 50-tap, causal FIR filter. It is parallelized in a pipelined manner, where<br />

each cluster does a fraction of the total calculations and then sends the results to the next cluster to<br />

process.<br />

48


6.5.4 Mergesort<br />

Mergesort implements a recursive mergesort algorithm to sort a list of 1000 integers. Similar to<br />

Quicksort, the Mergesort algorithm is based on a divide-and-conquer strategy wherein the list is<br />

divided into two halves, each half is sorted independently, and the two sorted halves are merged<br />

to a sorted sequence. Amalgam exploits the parallelism of this algorithm <strong>by</strong> passing the original<br />

divisions off to other programmable clusters and then performing the final merge on the originating<br />

cluster. While the sublists may be operated on entirely in parallel, the sequential nature of the final<br />

merge limits the potential benefits of parallel execution.<br />

6.5.5 MPEG encoding<br />

MPEG is an implementation of an MPEG1 I-frame encoder [36], that takes a series of 256 × 256<br />

24-bit RGB-formatted BMP files corresponding to a series of video frames and generates an intraframe-only<br />

MPEG1 compliant file. Each BMP frame is opened sequentially and organized into<br />

16 × 16 pixel blocks, which are converted into 4:2:0 YUV color space. The resultant data is<br />

organized into macroblocks consisting of six 8 × 8 blocks, referred to as Y0, Y1, U, V, Y2, and<br />

Y3. An 8 × 8 forward DCT (FDCT) is performed on each block, and resulting coefficients are<br />

quantized to zero out many of the high-frequency DCT coefficients. At this point, the data is<br />

organized into slices that consist of rows containing 16 macroblocks. The 8 × 8 blocks within a<br />

slice are encoded into a bitstream using a form of Huffman coding, all slices are merged to form<br />

an encoded frame, and the next frame is processed until all frames have been completed.<br />

The color space conversion is parallelized at the 16 × 16 pixel block level. Likewise, the DCT<br />

and quantization are parallelized at the macroblock level. The Huffman coding stage, however, is<br />

parallelized at the slice level because slices are the smallest objects in the MPEG bitstream that are<br />

<strong>by</strong>te aligned. Attempting to parallelize the Huffman coding stage at the macroblock level requires<br />

substantial shift and mask operations, which would hinder performance.<br />

49


6.5.6 Rijndael encryption<br />

The current Advanced Encryption Standard (AES), Rijndael, is an iterated block cipher encryption<br />

algorithm that supports variable block and key lengths [37]. The number of iterations depends<br />

on the current block and key sizes. Each iteration performs a set of round transformations that<br />

are composed of four distinct transformations. These transformations treat each data word as a<br />

row in an array, with each <strong>by</strong>te being an independent element. The first transformation, <strong>by</strong>te<br />

substitution, is a nonlinear <strong>by</strong>te substitution where each <strong>by</strong>te is used to index into a look-up table<br />

referred to as the S-box. The <strong>by</strong>te in the S-box at the indexed location replaces the current <strong>by</strong>te.<br />

The second transformation, shift row, performs a cyclic shift of the <strong>by</strong>tes in each row, with each<br />

row using a different offset. The third transformation, mix column, considers the columns as<br />

polynomials over Galois Field GF (2 8 ) and are multiplied modulo x 4 + 1 with a fixed polynomial<br />

c(x) = ‘03 ′ x 3 + ‘01 ′ x 2 + ‘01 ′ x + ‘02 ′ . During the final round, this step is skipped. The fourth<br />

transformation, round key addition, applies a round key using a simple bitwise XOR to the output<br />

of the mix column operation. The round key for each round is derived from the cipher key using<br />

a key schedule and is typically generated before executing the cipher as it is independent of the<br />

block to encrypt.<br />

The current Amalgam implementation uses a 128-bit block size, requiring 10 iterations of the<br />

round transformation over a sequence of 512 blocks.<br />

6.5.7 Traveling salesman<br />

The traveling salesman problem involves finding the tour of a list of cities that minimizes the total<br />

distance traveled. A tour starts at a city, visits each other city exactly once, and returns to the<br />

first city. The Branch-and-Cut [38] algorithm has been implemented for Amalgam, and assumes a<br />

symmetric, Euclidean traveling salesman problem.<br />

The Branch-and-Cut algorithm divides the problem into n search trees, with n being the number<br />

of points in the dataset. A recursive, depth-first-search is run on each of these trees that seeks to<br />

50


identify a valid, minimum distance tour of all cities. Upon each recursion, the distance travelled<br />

so far is compared to the current shortest path and either aborts this search path if it is determined<br />

to be longer, or continues down this path if not. Aborting a long subpath at this point improves<br />

efficiency <strong>by</strong> eliminating subtrees that can not produce a shorter path than the current path. When<br />

a search reaches a leaf node, it has successfully found a shorter path, and updates the current<br />

record. This implementation is parallelized <strong>by</strong> dividing up the lowest-level search trees among the<br />

programmable clusters.<br />

This chapter described the experimental methodology used in this study.<br />

To thoroughly<br />

examine the scalability of clustered architectures and the associated performance tradeoffs, we<br />

have developed a flexible simulation infrastructure and estimated the critical system latencies<br />

required for this work. The applications used to evalute the system were also discussed. The<br />

following chapter presents the results of our study.<br />

51


CHAPTER 7<br />

EXPERIMENTAL RESULTS<br />

This chapter presents the results from the scaling studies of Amalgam’s on-chip network and<br />

memory systems. The discussion begins with an analysis of the scaling characteristics of wire<br />

delay, focusing on its impact on global communication and cache access time. This motivates<br />

the remainder of the chapter, which presents the scaling behavior of Amalgam, the effects of<br />

introducing local data caches into the programmable clusters, and the effects of intercluster register<br />

writes.<br />

7.1 Increasing Impact of Wire Delays<br />

While transistor performance is expected to continue improving in the near future, wire delays are<br />

not expected to improve as quickly and may potentially worsen as wire cross sections continue to<br />

shrink. Using advanced dielectrics, the number of transistors that may be reached within a single<br />

cycle will remain relatively constant, but chip transistor budgets will continue to increase leading to<br />

multi-cycle, cross-chip communication latencies. Figure 7.1 illustrates this effect under optimistic<br />

circumstances, assuming aggressive process advances and optimal repeater placement. The core<br />

trendline depicts the scaling behavior of global wires communicating between a fixed number of<br />

transistors, in this case depicting the area of an Amalgam processor with 8 programmable clusters<br />

and 256 kB of shared data cache. While the communication delay in nanoseconds is decreasing,<br />

technology improvements are unable to keep up with rapidly increasing clock frequencies, as<br />

52


Wire Delay (ns)<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

22<br />

core<br />

full<br />

Wire Delay (cycles)<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

22<br />

core<br />

full<br />

(a) Wire Delay in ns<br />

(b) Wire Delay in cycles<br />

Figure 7.1 Wire Delay in Future Technologies<br />

demonstrated <strong>by</strong> the slowly rising wire delay in cycles. The full trendline depicts the global wire<br />

delay between the chip boundaries of a 140-mm 2 die. As expected, the number of cycles required<br />

to propagate a signal from edge to edge of a chip will increase substantially over the next several<br />

generations. This suggests that greater design emphasis needs to be placed on exploiting spatial<br />

locality in the future, avoiding global communication wherever possible.<br />

Figure 7.1 also illustrates that an eight clustered Amalgam will require a die size greater than<br />

140 mm 2 when implemented in either 180-nm or 130-nm processes. The 140-mm 2 die only serves<br />

as a guideline for average implementations in each technology, and aggressive, high-performance<br />

processors are expected to consume up to a 280-mm 2 die.<br />

While the nontrivial latency of global communication must be addressed in future designs,<br />

traditional cache memories have the potential to limit performance even further when faced with<br />

increasing wire delays, aggressive clock rates, and the widening processor-memory performance<br />

gap. As illustrated in Figure 7.2, increasing cache latencies will hinder their ability to mitigate<br />

the impact of the ever-widening processor-memory performance gap. While the absolute cache<br />

access time will decrease with technology (Figure 7.2(a)), chip clock rates are expected to increase<br />

dramatically due to very deeply pipelined designs. Caches as small as 1 kB will have access times<br />

53


Access Time (ns)<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 32 64 96 128<br />

Cache Size (kB)<br />

180nm<br />

130nm<br />

90nm<br />

65nm<br />

45nm<br />

32nm<br />

22nm<br />

Access Time (cycles)<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0 32 64 96 128<br />

Cache Size (kB)<br />

180nm<br />

130nm<br />

90nm<br />

65nm<br />

45nm<br />

32nm<br />

22nm<br />

(a) Access Time in ns<br />

(b) Access Time in cycles<br />

Figure 7.2 Cache Access Time in Future Technologies<br />

of four cycles or more, which is likely to be a performance limitation in future systems. This<br />

suggests that designers must closely examine the role of traditional cache memory hierarchies in<br />

future designs and explore potential alternatives such as small software-managed memories.<br />

7.2 Scaling Results<br />

Benchmark simulations estimate execution times for a range of Amalgam implementations with<br />

varying local data cache sizes and network topologies. These simulations provide insight into<br />

how network and memory latencies impact the performance of future clustered processors. The<br />

graphs presented in the upcoming sections rely on execution time as the performance metric,<br />

which is computed <strong>by</strong> dividing the number of cycles required to execute each benchmark on<br />

a given configuration <strong>by</strong> the SIA Roadmap’s predicted clock rate for the fabrication process<br />

used in the configuration. Using this metric, several results are presented to illustrate various<br />

scaling characteristics of clustered processors.<br />

To explore the overall behavior of the system,<br />

total performance results are presented, in which the execution time of applications on an eightcluster<br />

processor of various configurations is normalized against the performance on an eight-<br />

54


1<br />

1<br />

0.9<br />

0.9<br />

Relative Execution Time<br />

(vs. 180nm 8PC Bus)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

Network Topology<br />

Bus<br />

Crossbar<br />

Pipelined Crossbar<br />

Unit Crossbar<br />

Relative Execution Time<br />

(vs. 180nm 8PC Bus)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

Network Topology<br />

Bus<br />

Crossbar<br />

Pipelined Crossbar<br />

0.1<br />

0.1<br />

0<br />

162<br />

142 122 102 82 62<br />

Gate Length (nm)<br />

42<br />

22<br />

0<br />

162<br />

142 122 102 82 62<br />

Gate Length (nm)<br />

42<br />

22<br />

(a) In-Order Processor Cores<br />

(b) Out-of-Order Processor Cores<br />

Figure 7.3 Full-Chip Summary<br />

cluster processor implemented in a 180-nm process. These results are supplemented with plots<br />

that directly illustrate the performance gains from parallelism <strong>by</strong> comparing varying numbers of<br />

programmable clusters against a single programmable cluster. To further clarify the benefits gained<br />

from parallelism, relative speedups are also presented for 180-nm and 22-nm technology nodes.<br />

7.2.1 Full-chip latency scaling<br />

Figure 7.3 shows the average runtime for our benchmarks in different network topologies as a<br />

function of the fabrication process used, holding the chip size constant at 140 mm 2 and assuming<br />

no local data caches. Somewhat surprisingly, configurations that use the bus-based interconnect<br />

network perform only slightly worse than the crossbar-based network when using in-order cores<br />

(Figure 7.3(a)). To fully assess the importance of the on-chip network, this figure also includes<br />

the performance results for a system with a single cycle crossbar, referred to as a Unit Crossbar<br />

in this figure. In current technologies, this topology performs better than systems with realistic<br />

communication delays. However, the performance benefit is negligible in future processes, due<br />

largely to the more dominant memory latency.<br />

Figure 7.3(b) shows a somewhat different story for out-of-order cores, where the flexibility<br />

55


1<br />

0.9<br />

Relative Execution Time<br />

(vs. 180-nm 8PC Bus)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

dither<br />

dna<br />

fir<br />

msort<br />

mpeg<br />

rijndael<br />

tsp<br />

0.1<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

22<br />

Figure 7.4 Individual Benchmark Performance on Full-Chip Configuration (Bus)<br />

of out-of-order execution largely mitigates any differences in latency between the topologies. It<br />

is interesting to note that even a simple bus-based interconnect can achieve performance on par<br />

with a pipelined crossbar, but this may be partially accounted for <strong>by</strong> our simple out-of-order core.<br />

Since our model provides no dynamic memory disambiguation, all memory references are forced<br />

to execute sequentially, which reduces the benefit one would expect to see from pipelining the<br />

interconnect.<br />

Figure 7.4 shows the impact of process scaling on the performance of each of our benchmarks.<br />

While all of the benchmarks display similar overall performance curves, sensitivity to communication<br />

and memory delays varies between benchmarks, which results in some variance in the<br />

performance gains from technology scaling.<br />

7.2.2 Core-only latency scaling<br />

As fabrication technologies improve, the amount of chip area required to implement a given<br />

clustered processor will shrink significantly, mitigating some of the impact of network latency on<br />

56


1<br />

1<br />

0.9<br />

0.9<br />

Relative Execution Time<br />

(vs. 180nm 8PC Bus)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

Network Topology<br />

Bus<br />

Crossbar<br />

Pipelined Crossbar<br />

Unit Crossbar<br />

Relative Execution Time<br />

(vs. 180nm 8PC Bus)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

Network Topology<br />

Bus<br />

Crossbar<br />

Pipelined Crossbar<br />

0.1<br />

0.1<br />

0<br />

162<br />

142 122 102 82 62<br />

Gate Length (nm)<br />

42<br />

22<br />

0<br />

162<br />

142 122 102 82 62<br />

Gate Length (nm)<br />

42<br />

22<br />

(a) In-Order Processor Cores<br />

(b) Out-of-Order Processor Cores<br />

Figure 7.5 Performance When Scaling the Architecture Core<br />

overall performance. Figure 7.5 presents the average performance of our benchmarks as a function<br />

of interconnect topology and fabrication process if we hold the architecture constant and scale the<br />

size of the chip to match the estimated area required <strong>by</strong> the processor. The curves shown in this<br />

figure are very similar to those shown in Figure 7.3, indicating that the increase in the latency of<br />

the on-chip memories dominates the change in network latency for these configurations, which do<br />

not include local data caches in each cluster. In particular, the latency of the instruction caches in<br />

each cluster becomes as high as four cycles in the 22-nm technology, arguing that more aggressive<br />

instruction fetch mechanisms are required to maintain past performance trends.<br />

Figure 7.6 shows the impact of process scaling on the performance of each of our benchmarks<br />

as the core is scaled. Again, these curves are very similar to the ones shown for full-chip scaling,<br />

arguing that memory latency is more important than network latencies for these configurations.<br />

The performance of clustered processors derives from two factors:<br />

speedup from the<br />

parallelization of applications across the clusters, and clock rate improvements as fabrication<br />

technology improves. To understand the relative impact of these factors, Figure 7.7 shows the<br />

relative execution time of our benchmarks for configurations with a pipelined crossbar interconnect<br />

as a function of the number of clusters used and the fabrication technology.<br />

57


1<br />

0.9<br />

Relative Execution Time<br />

(vs. 180nm 8PC Bus)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

dither<br />

dna<br />

fir<br />

msort<br />

mpeg<br />

rijndael<br />

tsp<br />

0.1<br />

0<br />

162<br />

142<br />

122<br />

102<br />

82<br />

62<br />

42<br />

22<br />

Gate Length (nm)<br />

Figure 7.6 Individual Benchmark Performance on Core-Only Configuration (Bus)<br />

1<br />

1<br />

Relative Execution Time<br />

(vs. 180nm 1PC Pipelined Crossbar)<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

180nm<br />

130nm<br />

90nm<br />

65nm<br />

45nm<br />

32nm<br />

22nm<br />

Relative Execution Time<br />

(vs. 180nm 1PC Pipelined Crossbar)<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

180nm<br />

130nm<br />

90nm<br />

65nm<br />

45nm<br />

32nm<br />

22nm<br />

0<br />

1 2 3 4 5 6 7 8<br />

Number of Clusters<br />

0<br />

1 2 3 4 5 6 7 8<br />

Number of Clusters<br />

(a) In-Order Processor Cores<br />

(b) Out-of-Order Processor Cores<br />

Figure 7.7 Multicluster Speedup on Core-Only Configurations<br />

For systems fabricated in the 180-nm process, the eight-cluster configuration achieves a<br />

speedup of approximately 3x over the one-cluster system, but the parallel speedup declines<br />

significantly in the more advanced fabrication processes, as would be expected due to the increased<br />

network latency.<br />

In 22 nm, the average speedup for an eight-cluster configuration drops to<br />

58


6<br />

6<br />

Speedup vs. 1PC<br />

(180nm, In-order, Pipelined Crossbar)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

dither<br />

dna<br />

fir<br />

msort<br />

mpeg<br />

rijndael<br />

tsp<br />

Speedup vs. 1PC<br />

(22nm, In-order, Pipelined Crossbar)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

dither<br />

dna<br />

fir<br />

msort<br />

mpeg<br />

rijndael<br />

tsp<br />

0<br />

1 2 3 4 5 6 7 8<br />

Number of Clusters<br />

0<br />

1 2 3 4 5 6 7 8<br />

Number of Clusters<br />

(a) 180 nm<br />

(b) 22 nm<br />

Figure 7.8 Multicluster Speedup on Core-Only Configurations for Technology Endpoints<br />

1.7x over the single cluster system. While all of the configurations see significant performance<br />

improvement as the number of clusters increases from one to four, the performance difference<br />

between the four-cluster and eight-cluster configurations becomes very small and actually shifts<br />

to a performance loss in the out-of-order core, indicating that additional mechanisms to reduce<br />

the performance impact of network and memory latency will be required as we approach these<br />

fabrication technologies.<br />

To further examine the nature of individual benchmarks, Figure 7.8 depicts the multicluster<br />

speedup over a single cluster for the 180-nm and 22-nm technology nodes using a pipelined<br />

crossbar interconnect and in-order cores. At the 180-nm node, most benchmarks demonstrate<br />

fairly linear speedups as the number of clusters is increased, indicating that the underlying system<br />

is able to effectively exploit the available parallelism within an application. Some benchmarks,<br />

such as MPEG1 and DNA, see reduced speedups when moving from four to eight cluster<br />

implementations. Unfortunately, TSP and Mergesort suffer from significant cache line aliasing<br />

in their stack address space, and experience a performance drop when moving from four to eight<br />

cluster implementations. As discussed previously in Section 5.1, this problem may be eliminated<br />

<strong>by</strong> making changes to the function calls within an application to alter their stack allocation.<br />

59


7<br />

6<br />

Speedup vs. 1PC<br />

(180nm, In-order, Pipelined Crossbar)<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

dither<br />

dna<br />

fir<br />

msort<br />

mpeg<br />

rijndael<br />

tsp<br />

Speedup vs. 1PC<br />

(22nm, In-order, Pipelined Crossbar)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

dither<br />

dna<br />

fir<br />

msort<br />

mpeg<br />

rijndael<br />

tsp<br />

0<br />

1 2 3 4 5 6 7 8<br />

Number of Clusters<br />

0<br />

1 2 3 4 5 6 7 8<br />

Number of Clusters<br />

(a) 180 nm<br />

(b) 22 nm<br />

Figure 7.9 Multicluster Speedup on Core-Only Configurations with Stack Tweaks<br />

Figure 7.9 depicts the result of fixing stack allocation for TSP and Mergesort. As we can see,<br />

this increases our average parallel speedup for an eight-cluster system over a single cluster system<br />

from 3x to 4x in 180 nm, and from 1.7x to 2.4x in 22 nm.<br />

7.3 Effects of Local Data Caches<br />

To reduce the impact of memory and network latency on performance, we examined the addition<br />

of a variety of local data caches to each programmable cluster, which required implementing a<br />

MESI directory-based, cache-coherence protocol over the on-chip network to keep the local caches<br />

consistent. Figure 7.10 shows the average performance of our benchmark suite on an eight-cluster<br />

system as a function of the local cache size, network architecture, and fabrication process.<br />

As would be expected, adding even a small local data cache to each cluster significantly<br />

improves performance in near term technologies. As cache size increases, however, performance<br />

remains relatively constant, particularly for the more advanced fabrication processes.<br />

This<br />

somewhat-unexpected result is due to the increase in cache access time as local cache capacity<br />

increases, which cancels out or even dominates the increased hit rate of the local cache.<br />

As<br />

60


Relative Execution Time<br />

(vs. 180nm Bus)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

180 130 90 65 45 32 22<br />

Gate Length (nm)<br />

Local Data<br />

Cache Size<br />

none<br />

1k<br />

2k<br />

4k<br />

8k<br />

16k<br />

32k<br />

64k<br />

Relative Execution Time<br />

(vs. 180nm Bus)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

180 130 90 65 45 32 22<br />

Gate Length (nm)<br />

Local Data<br />

Cache Size<br />

none<br />

1k<br />

2k<br />

4k<br />

8k<br />

16k<br />

32k<br />

64k<br />

(a) Bus<br />

(b) Crossbar<br />

Figure 7.10 Local Data Cache Performance<br />

memory latencies increase, local caches become less effective due to increasing access times.<br />

By 22 nm, little benefit if any is obtained from the additional complexity of local data caches.<br />

Another point of interest is the relatively minor performance difference between a bus and a<br />

crossbar network. Introducing an appropriately sized local data cache into each cluster largely<br />

mitigates the differences between topologies.<br />

In almost all cases, the 8-kB local cache configuration gives the best performance, arguing that<br />

local cache access time is at least as important as hit rate for these architectures. This makes a<br />

strong argument in favor of optimizations that allow smaller local memories to capture significant<br />

portions of a program’s data traffic. The inclusion of small, software-controlled memories would<br />

likely improve future performance substantially <strong>by</strong> reducing access times while maintaining high<br />

hit rates.<br />

To illustrate the best performance possible with local, hardware-managed caches, Figure 7.11<br />

compares the best cache configuration in each technology against a system without local caches.<br />

Clearly, local caches provide very limited benefit as access times increase. Hit rates do not change<br />

substantially as technology shrinks.<br />

It should be noted that clusters have independent stack address spaces, and that these addresses<br />

61


1<br />

1<br />

0.9<br />

0.9<br />

Relative Execution Time<br />

(vs. 180nm, no local caches)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

no local caches<br />

local caches<br />

Relative Execution Time<br />

(vs. 180nm, no local caches)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

no local caches<br />

local caches<br />

0.1<br />

0.1<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

22<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

22<br />

(a) In-Order Processor Cores<br />

(b) Out-of-Order Processor Cores<br />

Figure 7.11 Best Cache Configuration Comparison (Crossbar)<br />

have been skewed in an attempt to avoid any cache line aliasing in the main cache. Since clusters<br />

are frequently executing identical code, they tend to allocate objects to the stack in the same<br />

manner. As a result, it is important to carefully assign the size and starting addresses for cluster<br />

stack space to avoid constant aliasing between the clusters.<br />

7.3.1 Area impact of local caches<br />

Figure 7.12 depicts the distribution of relative total chip area consumed <strong>by</strong> each major system<br />

component for the range of local data cache sizes being examined. In the system without local<br />

data caches, chip area is divided fairly evenly between programmable cluster resources and the<br />

256 kB of global shared cache. As local data cache sizes increase, we see that the clusters with 64<br />

kB of local data cache account for approximately 75% of the chip, with the local data caches alone<br />

consuming close to half.<br />

62


Percentage Chip Area<br />

100%<br />

90%<br />

80%<br />

70%<br />

60%<br />

50%<br />

40%<br />

30%<br />

20%<br />

10%<br />

Cluster<br />

Local<br />

Global<br />

0%<br />

None 4kB 8kB 16kB 32kB 64kB<br />

Local Data Cache Capacity<br />

Figure 7.12 Fractional Chip Area for Different Local Cache Sizes<br />

7.4 Effects of Intercluster Register Writes<br />

As mentioned previously, studies have shown that register-based communication mechanisms have<br />

the potential to yield significant performance advantages over systems that rely solely on sharedmemory<br />

mechanisms [10, 24]. Two of the applications in the Amalgam benchmark suite have<br />

been implemented using register-based communication to provide a comparison against sharedmemory<br />

implementations in Amalgam.<br />

FIR utilizes register communication in the traditional<br />

sense, passing temporary results from producing to consuming cluster, thus reducing the amount of<br />

data communicated through shared memory. DNA takes a different approach, passing information<br />

about the availability of segments of the output array to the consuming clusters to replace<br />

traditional lock operations.<br />

Figure 7.13 illustrates the percent speedup from these register-based implementations for<br />

pipelined crossbar-based processors with two, four, and eight clusters over the technology range<br />

studied. FIR achieves speedup as high as 19% in a four-clustered Amalgam, yet exhibits gradually<br />

reduced improvements as technology advances.<br />

DNA achieves speedup as high as 9%, but<br />

63


0.26<br />

0.1<br />

Speedup from Register Communication<br />

0.24<br />

0.22<br />

0.2<br />

0.18<br />

0.16<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

Number of<br />

Clusters<br />

2<br />

4<br />

8<br />

22<br />

Speedup from Register Communication<br />

0.09<br />

0.08<br />

0.07<br />

0.06<br />

0.05<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

Number of<br />

Clusters<br />

2<br />

4<br />

8<br />

22<br />

(a) FIR<br />

(b) DNA<br />

Figure 7.13 Benefits of Intercluster Register Communication (Pipelined Crossbar)<br />

demonstrates more interesting behavior over the studied technologies. The performance gains<br />

from intercluster communication initially degrade with technology, but improve after 90 nm and<br />

stabilize thereafter. This can be attributed to the relative change in access latency between different<br />

microarchitectural components. The shared cache banks have a two-cycle access latency in both<br />

180-nm and 130-nm feature sizes, but this latency doubles during the shrink to 90 nm. This results<br />

in a sudden rise in the performance gain from register communication. This latency continues to<br />

rise as clock rates improve, but does not see as sharp a rate of increase as at this point, which results<br />

in somewhat more stable behavior.<br />

Figure 7.14 illustrates a similar experiment as that depicted in Figure 7.13, but for systems<br />

that use a bus-based interconnect. On average, the relative performance gains from forwarding are<br />

better than with the pipelined crossbar interconnect, with as much as a 24% speedup for FIR. Since<br />

bus interconnects suffer from higher contention, the reduced number of network messages required<br />

when using register communication explains this trend. Interestingly, this results in a gradual<br />

increase in speedup for FIR on the eight-clustered Amalgam. This effect also keeps the fourclustered<br />

speedup higher than the two-clustered speedup on FIR. Unfortunately, both benchmarks<br />

demonstrate reduced benefit for the eight-cluster systems, only achieving speedups in the range of<br />

64


0.26<br />

0.1<br />

Speedup from Register Communication<br />

0.24<br />

0.22<br />

0.2<br />

0.18<br />

0.16<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

Number of<br />

Clusters<br />

2<br />

4<br />

8<br />

22<br />

Speedup from Register Communication<br />

0.09<br />

0.08<br />

0.07<br />

0.06<br />

0.05<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

0<br />

162<br />

142<br />

122 102 82<br />

Gate Length (nm)<br />

62<br />

42<br />

Number of<br />

Clusters<br />

2<br />

4<br />

8<br />

22<br />

(a) FIR<br />

(b) DNA<br />

Figure 7.14 Benefits of Intercluster Register Communication (Bus)<br />

1-7%.<br />

These limitations may be attributed to two key factors: increasing synchronization overhead<br />

and communication latency. As the number of clusters increases, the size of the core increases <strong>by</strong> a<br />

similar factor, resulting in longer global wires with longer latencies. This global wire delay directly<br />

correlates to the latency of the CBAR operation and the on-chip network. Furthermore, as the<br />

number of clusters increases, the time individual clusters spend waiting at a CBAR may increase<br />

substantially if the load between clusters is not balanced. For example, the time spent waiting at<br />

barriers varies substantially between different clusters in our FIR implementation, indicating that<br />

forward progress is typically limited <strong>by</strong> a single cluster. This cluster spends roughly 5% of the total<br />

execution time waiting at barrier instructions, which is due to the latency of the hardware barrier.<br />

In 180 nm, the remaining clusters in an eight-clustered Amalgam spend 66% of their time waiting<br />

at barriers. By 22 nm, these clusters remain idle at barriers 84% of the time.<br />

To address some of these issues, clustered architectures must support low-latency, localized<br />

synchronization and communication networks. As discussed in Section 4.3, hierarchical networks<br />

have been proposed to provide fast communication between neighboring clusters. We have studied<br />

a simple two-level hierarchy that provides communications within one half of the chip at a<br />

65


lower latency than the chip-wide networks. This topology provided negligible speedup for our<br />

benchmark suite for several reasons. First, the benchmarks were not implemented to rely solely<br />

on fast local communication, so performance was limited <strong>by</strong> register writes that used the global<br />

network. Exploiting locality in an application may be difficult given an arbitrary interconnect<br />

pattern. Including more local connectivity, such as with a ring network or gridded architecture,<br />

may simplify this process. Second, this hierarchical network fails to address the synchronization<br />

problem. Attempting to parallelize applications in a way that balances the execution time between<br />

all clusters is a difficult task, especially given the importance and complexity of predicting cache<br />

behavior. If the system only supports a single global synchronization construct, such as Amalgam’s<br />

CBAR operation, it is possible to imagine a scenario where a single cluster’s irregular memory<br />

access pattern may lead to large numbers of cache misses, there<strong>by</strong> holding up computation for<br />

the entire system. Depending on the nature of the application, it is also possible there are several<br />

independent threads with only a small subset requiring synchronization. As such, providing finegrained<br />

synchronization in addition to processor-wide synchronization would be desirable.<br />

This chapter presented the results of our scaling studies for Amalgam’s on-chip network and<br />

memory systems.<br />

As technology improves, wire delay’s effect on global communication and<br />

increasing cache access latencies becomes more important to processor performance.<br />

While<br />

network implementation is important, performance differences may largely be eliminated using<br />

out-of-order execution. More importantly, memory delays dominate system performance, and<br />

cache memories are unable to mitigate this effect in future technologies. Clustered processors have<br />

the potential to reduce dependence on memory <strong>by</strong> replacing shared memory-based communication<br />

with register-based communication, but more work is required to improve their effectiveness and<br />

applicability to a wider range of programs.<br />

66


CHAPTER 8<br />

CONCLUSION<br />

This thesis studied the performance of Amalgam as fabrication technology scales from feature<br />

sizes of 180 nm down to 22 nm.<br />

Amalgam divides the processor’s functional resources into<br />

independent clusters and exposes global communication delays to the compiler to improve<br />

instruction scheduling. Amalgam supports efficient, low-latency mechanisms for communication<br />

and synchronization between these clusters.<br />

Through the combination of these techniques,<br />

Amalgam maintains a consistent rate of performance improvement over the technologies studied.<br />

Clock rates are expected to increase <strong>by</strong> close to a factor of forty from 180 nm to 22 nm, due in<br />

part to improved transistor performance. The remainder of these clock rate advances derive from<br />

aggressive deep pipelining, with a reduction in FO4 delays from 14.81 to 3.16 per pipeline stage<br />

[39]. Deep pipelining results in multicycle latencies for every architectural component, including<br />

key structures such as caches, register files, and functional units. Global interconnects experience<br />

latencies as high as 14 cycles for a relatively conservative 140-mm 2 die, and are as high as twenty<br />

cycles for high-performance processors, which are expected to consume up to a 280-mm 2 die. The<br />

time required to load a cache line from memory also rises to nearly two thousand cycles, with<br />

six-cycle access times for a banked 256-kB on-chip cache. High latencies limit the performance<br />

gains from clock rate advances. As a result, execution times improve <strong>by</strong> slightly more than a factor<br />

of three over this technology range, roughly an order of magnitude smaller than the improvement<br />

in clock rate.<br />

The performance of clustered processors derives from more than clock rate improvements<br />

67


alone. Clustered processors also obtain speedup from parallelizing applications across the clusters.<br />

By spreading computation over multiple clusters, an eight-cluster Amalgam achieves a 4x speedup<br />

over a one-cluster system implemented in 180-nm technology. As would be expected, parallel<br />

speedup declines in the more advanced fabrication processes, due to increasing communication<br />

latencies. In 22 nm, speedup drops to 2.4x for an eight-cluster configuration over a single cluster<br />

system.<br />

In addition to exploring the scaling characteristics of Amalgam, this study examined the<br />

role of various design options in the performance of Amalgam, including network topology,<br />

instruction issue policy, and presence and size of local data caches within the clusters. None<br />

of these options impacts performance as significantly as fabrication advances, but a combination<br />

of these options can result in a worthwhile performance boosts over a baseline Amalgam. While<br />

network implementation remains important, the performance differences between a simple bus and<br />

a more complex crossbar are largely eliminated <strong>by</strong> using out-of-order execution. Many of these<br />

effects are actually overshadowed <strong>by</strong> the more dominant effects of increasing memory delays.<br />

Unfortunately, cache memories are unable to mitigate the effects of increasing memory latency in<br />

future technologies, due to prohibitive increases in access latency for even small caches.<br />

Clustered processors can reduce the dependence of applications on memory <strong>by</strong> replacing<br />

costly shared memory-based communication with low-latency register writes.<br />

Register-based<br />

communication reduces cache pressure <strong>by</strong> moving temporary data directly from producer to<br />

consumer, allowing more useful data to be kept in the cache.<br />

Unfortunately, it is difficult to<br />

efficiently utilize intercluster register transfers in general applications.<br />

Since the register file<br />

is substantially smaller than shared memory, only limited amounts of data can be transferred<br />

between clusters using registers. Limited storage space increases the amount of synchronization<br />

needed, which further limits performance gains. We demonstrated speedups as high as 10-20%<br />

for applications that replace some shared memory communication with intercluster register writes.<br />

These speedups are reduced <strong>by</strong> increasing network latencies or increasing the number of clusters.<br />

To improve this behavior, we must look into networks that provide higher levels of connectivity<br />

68


and mechanisms that support the synchronization of smaller groups of clusters in addition to chipwide<br />

synchronization. Mapping applications to effectively utilize these resources is also important,<br />

since computation that relies on global communication limits potential performance gains.<br />

The work in this thesis provides motivation for future exploration. In particular, we will be<br />

investigating architectural techniques that effectively deal with the widening processor-memory<br />

performance gap. The first direction is to introduce small, software-managed memories into the<br />

clusters.<br />

These memories are designed to operate at higher clock rates than cache memories<br />

may achieve, but careful program analysis should allow similar hit rates for applications with<br />

regular memory access patterns.<br />

Since register communication has tremendous potential, the<br />

second direction is to design a system that increases the amount of data that may be transferred<br />

between clusters while reducing synchronization overhead. One possible option is to replace the<br />

conventional SRAM array-based register file with a set of hardware queues. In this system, data<br />

is loaded into the queue as it arrives, and reads to that register access the element at the head of<br />

the queue. When the programmable cluster is ready for the next data value, it executes an EMPTY<br />

operation as before, which increments the queue until it no longer holds valid data.<br />

In conclusion, this thesis demonstrated the ability of clustered architectures to provide scalable<br />

performance in future fabrication processes. The widening processor-memory gap is an issue of<br />

great importance in future high-performance systems, and must be addressed to provide greater<br />

performance in the future. Since cache memories are unlikely to remain effective in dealing with<br />

this problem, alternatives such as small, software-managed memories and efficient intercluster<br />

communication should be explored.<br />

69


APPENDIX A<br />

AMALSIM CONFIGURATION OPTIONS<br />

This appendix describes the range of configuration options that are used to control amalsim. These<br />

options are stored in a text .cfg file that is passed to the simulator using the create command<br />

(described in Appendix C) at the amalsim prompt. Default values are built into the simulator for<br />

every option, but may be overridden <strong>by</strong> the options included in the .cfg file. To simulate an<br />

Amalgam with eight programmable clusters executing a binary dna.x with a network specified<br />

in proc8.ndf, the following configuration options are used:<br />

-num_pclust 8<br />

-codefile dna.x<br />

-nconfig proc8.ndf<br />

Upon loading this .cfg file, the simulator creates an Amalgam with eight programmable clusters,<br />

parses the binary executable for the starting PCs for each cluster, and parses and builds a model for<br />

the network specified in proc8.ndf.<br />

70


A.1 Cluster Options<br />

A.1.1<br />

Programmable (-proc clust:)<br />

Table A.1 shows the available configuration options for the programmable clusters within amalsim<br />

as well as their default values.<br />

Table A.1 Programmable Cluster Options<br />

Option Description Default<br />

out of order Model ooo core if present FALSE<br />

pipeline depth Depth of the pipeline 3<br />

num alus Number of ALUs 2<br />

num arch registers Number of architectural registers 32<br />

num phys registers Number of physical registers 32<br />

rob size Reorder buffer size 0<br />

retire num ops Number of ops we can retire in one cycle 2<br />

decode buffer size Decode buffer size 2<br />

net buffer size Network buffer size 10<br />

ex stage EX Stage 0<br />

<strong>by</strong>pass stage Bypass Stage 1<br />

br stage BR Stage 1<br />

mem size Local Memory Size 0<br />

read delay Local Memory Read Delay 1<br />

write delay Local Memory Write Delay 1<br />

base address Local Memory Base Address 0<br />

A.1.2<br />

Reconfigurable (-rec clust:)<br />

Table A.2 shows the available configuration options for the reconfigurable clusters within amalsim<br />

as well as their default values.<br />

71


Table A.2 Reconfigurable Cluster Options<br />

Option Description Default<br />

num registers Number of registers 32<br />

net buffer size Network buffer size 10<br />

rc latency RC cycle latency (in Amalgam cycles) 1<br />

A.2 Branch Predictor (-bpred:)<br />

Table A.3 shows the available configuration options for the programmable cluster’s branch<br />

predictor within amalsim as well as their default values.<br />

Table A.3 Branch Predictor Options<br />

Option Description Default<br />

type Type of predictor nottaken<br />

bimod size Size of bimodal table 0<br />

l1size Size of level 1 table 0<br />

l2size Size of level 2 table 0<br />

shift width Shift width 0<br />

xor XOR 0<br />

btb sets Number of BTB sets 0<br />

btb assoc Associativity of the BTB 0<br />

Supported predictor types: NotTaken (nottaken), Taken (taken), Bimodal (bimodal),<br />

Two-level (two level).<br />

A.3 Memory Options (-memory:)<br />

Table A.4 shows the available configuration options for the off-chip memory within amalsim as<br />

well as their default values.<br />

72


Table A.4 Memory Options<br />

Option Description Default<br />

memorySize Memory Size 8388608<br />

readStartDelay Read Start Delay 5<br />

readWordDelay Read Word Delay 5<br />

wrtStartDelay Write Start Delay 5<br />

wrtWordDelay Write Word Delay 5<br />

base address Base Address 0<br />

A.4 Cache Options<br />

Table A.5 shows the available configuration options for the different cache memories within<br />

amalsim as well as their default values. <strong>All</strong> cache options are preceeded <strong>by</strong> -cache: for the globally<br />

shared data cache, -icache: for the instruction cache located on each processor cluster, or -dcache:<br />

for the local data cache located on each processor cluster.<br />

Table A.5 Cache Options<br />

Option Description Default (Cache) Default (ICache) Default (DCache)<br />

bank size Bank Size 16384 4096 4096<br />

bank assoc Bank Associativity 4 4 4<br />

bank linelen Bank Line Length 32 32 32<br />

num banks Number of Banks 4 1 0<br />

buffer size Network buffer size 10 NA NA<br />

hit latency Hit Latency 1 1 1<br />

miss latency Miss Latency 1 1 1<br />

fetch width Fetch Width NA 4 NA<br />

Table A.6 shows the available configuration options for logging cache coherence events and<br />

constraining the address space that may be cached locally. There are two options available to<br />

control logging of coherence messages (when local dcaches are active). These are preceeded <strong>by</strong><br />

-coherence:. If no logfile is specified, this option is disabled. At the moment, messages are printed<br />

that are received and sent <strong>by</strong> both the main cache and the local dcaches, and both contribute to the<br />

count of number of messages (for comparison to the log limit).<br />

73


Table A.6 Coherence Options<br />

Option Description Default<br />

logfile Filename to dump output to NA<br />

log limit Limit on the number of messages to be printed MAXINT<br />

uncacheable start Beginning of uncacheable region -1<br />

uncacheable end End of uncacheable region -1<br />

A.5 Logging Options<br />

Table A.7 shows the configuration options for logging other aspects of the system, including branch<br />

history, instruction traces, and RS logging. RS was originally used to refer to the reservation<br />

stations within the programmable clusters, but now refers to the structures in the simulator used to<br />

maintain the list of live instructions associated with a particular cluster. These are preceeded <strong>by</strong><br />

-rs:. If no logfile is specified, this option is disabled. At the moment, messages are printed when<br />

new ops are issued and when ops are retired. Both contribute to the count of number of messages<br />

(for comparison to the log limit).<br />

Table A.7 Logging Options<br />

Option Description Default<br />

logfile Filename to dump output to NA<br />

log limit Limit on the number of messages to be printed MAXINT<br />

Furthermore, you can log branch history using -br: followed <strong>by</strong> logcluster where cluster is the<br />

programmable cluster you wish to track. The message limit is set using log limit.<br />

Instruction traces may be generated using -trace: followed <strong>by</strong> logfile to set the filename, or<br />

log limit to limit the number of instructions to be logged.<br />

At the moment, instruction traces<br />

are only supported for simple configurations (i.e., in-order Pclusters without local data caches).<br />

Instructions are added to the trace once they are committed to architectural state (hence the exact<br />

cycle a result is available is actually slightly earlier, but should not significantly affect observed<br />

behavior).<br />

74


Basic path profiling may be enabled using -path: followed <strong>by</strong> logfile to set the filename. By<br />

default, paths are limited to intraprocedural, but you may enable interprocedural profiling including<br />

the -path:interproc option. Paths are currently broken <strong>by</strong> backwards (i.e., loop) branches, so the<br />

common path through a nested loop (i.e., beginning of outer loop, through the inner loop, followed<br />

<strong>by</strong> the remainder of the outer loop) is not yet supported.<br />

A.6 General Options<br />

Table A.8 shows the available configuration options for other aspects of amalsim as well as their<br />

default values.<br />

Table A.8 General Options<br />

Option Description Default<br />

-codefile filename Use the assembly code in filename NA<br />

-nconfig filename Configure the network using the<br />

specifications in filename<br />

NA<br />

-ckpt:file Checkpoint file prefix ckpt default<br />

-rconfig:index filename Configure the reconfig index using the<br />

specifications in filename<br />

NA<br />

-fconfig:index filename Configure the reconfig index’s fsm<br />

using the specifications in filename NA<br />

-num pclust Number of processor clusters 1<br />

-num rclust Number of reconfigurable clusters 0<br />

-flush timeout Timeout for flush command 50<br />

-cbar latency Minimum latency of a CBAR instruction 1<br />

75


APPENDIX B<br />

NETWORK DESCRIPTION LANGUAGE<br />

The Network Description Language, or NDL, is a language for describing the topology of the<br />

Amalgam on-chip network. Originally, the network was coded in C with a specific predefined<br />

topology. While simple to implement, this approach only worked as long as the system being<br />

modeled met the constraints for the predefined topology. As a result, this approach did not provide<br />

a convenient way to examine alternative topologies. To deal with these limitations, we developed<br />

the NDL to express network topologies in a textual format that is parsed <strong>by</strong> the simulator at runtime.<br />

B.1 Network Description File<br />

Each network topology resides in a separate text file called the Network Description File (NDF),<br />

which has the extension .ndf. The NDF contains the description of a single network topology<br />

in NDL, but the NDL is designed so that a single NDF will work for a variety of Amalgam<br />

system configurations. The following section discusses the means of configuring a network node,<br />

including the connections between it and other nodes as well as Amalgam system components.<br />

76


B.2 Configuring Nodes<br />

A NODE CFG block configures a single major network node and consists of exactly four statements<br />

that specify the index of the node being configured, its latency, its mode of operation, and its inputs<br />

and outputs. For example,<br />

node_cfg {<br />

idx 0;<br />

latency 3;<br />

mode BUS;<br />

in P 0, P 1, R 0, R 1;<br />

out N 1, B 0, B 1;<br />

}<br />

Put into words, node 0 has a latency of three cycles and should be modeled as a bus network.<br />

Its inputs connect to programmable clusters 0 and 1 and reconfigurable clusters 0 and 1 while its<br />

outputs connect to node 1 and cache banks 0 and 1.<br />

The following is a list of parameters that are specified in a NODE CFG block.<br />

For each<br />

parameter, we provide a brief description, an explanation of the available options, and restrictions<br />

imposed <strong>by</strong> the simulator.<br />

idx ;<br />

The node index. Used to uniquely identify the node. Node indexes must start at 0,<br />

be contiguous, and have no duplicates. Nodes actually have two coordinates denoted<br />

as major.minor, but the minor coordinate is only used internally for nonunit latency<br />

nodes and may not be referenced directly in the configuration file.<br />

77


latency ;<br />

The latency for this node. Node latency must be 1 or greater.<br />

mode ;<br />

The operating mode of this node. Used to determine how the node is modeled. The<br />

simulator currently supports three different operating modes:<br />

• XBAR<br />

Behaves like an unpipelined crossbar. If the node has a latency greater than 1,<br />

incoming ops will occupy the in port for the duration of this latency, and then be<br />

transferred to the out port at the end of that period.<br />

• XBAR PIPELINED<br />

Behaves like a pipelined crossbar. If the node has a latency greater than 1,<br />

the node will be modeled as several nodes hooked together linearly. So in the<br />

previous example configuration nodes 0.0, 0.1 and 0.2 are created to simulate<br />

a three-cycle latency. Internal node connections will be equal to the number of<br />

final out connections and are currently modeled as one large buffer.<br />

• BUS<br />

Behaves like a traditional bus. This is modeled using the same general code as an<br />

unpipelined crossbar, but we only allow one in port to be occupied at any time<br />

(i.e., we still have in ports connected to each unit, but only one can contain an op<br />

during a particular cycle).<br />

in ;<br />

The inputs to this node. Used to specify which units feed into this node. <br />

is a list of valid units separated <strong>by</strong> commas. Valid units may be generic clusters,<br />

78


programmable clusters, reconfigurable clusters, data cache banks, or other major<br />

network nodes and are expressed in the form of . Unit<br />

types are expressed as C (generic cluster), P (programmable cluster), R (reconfigurable<br />

cluster), B (data cache bank), and N (network node). Unit indexes may take the form<br />

of any expression supported <strong>by</strong> the language.<br />

Generic cluster refers to either a programmable cluster or a reconfigurable cluster and<br />

is differentiated when the number and type of clusters is known.<br />

The numbering<br />

scheme for generic cluster indexes is consistent with the rest of the simulator such<br />

that indexes in the range (0, P - 1) are programmable clusters and indexes in the range<br />

(P, N - 1) are reconfigurable clusters, where P is the number of programmable clusters<br />

and N is the total number of clusters. The addition of the generic cluster specification<br />

allows the use of a single network configuration on an assortment of different system<br />

configurations (ie 8P, 4P x 4R, etc), but special attention should be paid to how the<br />

network configuration will interact with a given system configuration. For example,<br />

the following input statement will generate very different networks depending on the<br />

number of clusters:<br />

in C 0, C 1, C 2, C 3;<br />

For a 2P x 2R system, this will be equivalent to:<br />

in P 0, P 1, R 0, R 1;<br />

But for a 4P x 4R system, it will be equivalent to:<br />

in P 0, P 1, P 2, P 3;<br />

out ;<br />

The outputs of this node. Used to specify which units are fed <strong>by</strong> this node. <br />

is the same as that described above.<br />

79


B.3 Expressions and Variable Scoping<br />

<strong>All</strong> variables in NDL are 32-bit signed integers and have global scope. There are no variable<br />

declarations in NDL. Rather, a variable becomes available after its first assignment. NDL supports<br />

expressions involving variables, integer constants, and most of C’s binary and unary operators,<br />

which have the usual semantics and precedence rules. See the formal grammar in Section B.4 for<br />

more details.<br />

B.4 Formal Grammar<br />

This section presents the NDL grammar. <strong>All</strong> nonterminals are represented in the form ,<br />

and all terminals are represented in typewriter style. Regular expressions are rendered in plain<br />

text.<br />

:= <br />

:= <br />

| <br />

:= <br />

| <br />

| <br />

| <br />

:= for ( ; ; ) <br />

:= while ( ) <br />

:= if ( ) <br />

| if ( ) else <br />

| if ( ) else <br />

80


:= { }<br />

:= { }<br />

:= { }<br />

:= NODE CFG <br />

:= <br />

| <br />

:= ;<br />

| ;<br />

| ;<br />

| ;<br />

:= IDX <br />

:= LATENCY <br />

:= IN <br />

:= OUT <br />

:= <br />

| <br />

| <br />

| ++<br />

| --<br />

| <br />

| <br />

| ( )<br />

81


:= + | - | * | / | % | > | && | ‖ | == | != |<br />

> | < | >= | >= |


APPENDIX C<br />

AMALSIM COMMAND REFERENCE<br />

This appendix describes the range of commands that are used to control the simulator via the<br />

amalsim prompt.<br />

C.1 Basic Simulator Control<br />

• help command – Prints the usage information for the specified command. If command is<br />

omitted, lists all available commands.<br />

• step n – Step the design n times. If n is omitted, the design is stepped once. May be<br />

interrupted <strong>by</strong> the user with Ctrl-C.<br />

• run cycle – Steps the design until a breakpoint is reached, a HALT instruction is executed,<br />

the simulator has reached cycle if cycle is specified, or the user interrupts execution with<br />

Ctrl-C.<br />

• runlevel new runlevel – Sets the runlevel for the simulator (conditional execution of halts<br />

and flushes).<br />

Initially set at 0 (so any halts/flushes with a higher runlevel specified are<br />

ignored). If new runlevel is omitted, displays the current runlevel.<br />

• exit – Exit the simulator.<br />

• source file – Opens a file with commands and executes them.<br />

83


• create cfg file – Create an Amalgam processor instance using the options in cfg file. If no<br />

file is specified, uses default.cfg.<br />

• cdf cdf file fsm file – Creates a dummy reconfigurable cluster and tries to configure it using<br />

the specified cdf and fdf files. Should allow for a convenient way to check cdf/fdf syntax.<br />

• destroy – Frees all memory associated with the current processor being simulated.<br />

• watch startAddr/label numwords/endAddr – Set a watchpoint on a memory location or<br />

locations. The start address may be expressed in either base 10, hex (when preceeded <strong>by</strong> 0x),<br />

or as a label. The second argument may be expressed as either an end address (when specified<br />

in hex with 0x) or as a number of words (when specified in base 10). If numwords/endAddr<br />

is omitted, sets a watchpoint only at startAddr/label. If a label is used, it is only associated<br />

with the first address and not the whole range.<br />

• watch off startAddr/label numwords/endAddr – Delete watchpoint on a memory location<br />

or locations. The start address may be expressed in either base 10, hex (when preceeded<br />

<strong>by</strong> 0x), or as a label. The second argument may be expressed as either an end address<br />

(when specified in hex with 0x) or as a number of words (when specified in base 10). If<br />

numwords/endAddr is omitted, deletes only the watchpoint at startAddr.<br />

• regpt cluster register – Set a watchpoint on cluster’s register.<br />

• regpt off cluster register – Delete a watchpoint on cluster’s register.<br />

• break location/label – Set breakpoint on a location/label.<br />

• break list – Lists all defined breakpoints.<br />

• break off location/label – Delete breakpoint on a location/label.<br />

• fsm on cluster – Prints the current state, tasks that were executed in the last cycle and counter<br />

values for the specified rec cluster’s fsm every cycle.<br />

84


• fsm off cluster – Stop printing fsm information every cycle for rec cluster.<br />

• regress rdf file outfile – Starts regression on current amal. If no rdf file specified, uses<br />

default.rdf. If no outfile is specified, regression info is written to stdout.<br />

• ckpt – Creates a checkpoint. Filename consists of the ckpt file prefix with “-num.ckpt”<br />

appended where num is which checkpoint this is (starts at 1, or the number from the last<br />

restore, and auto-increments). See option -ckpt in Appendix A for details on how to set<br />

the file prefix and its default value.<br />

• ckptr ckpt file/ckpt num – Restore from a checkpoint. Can either specify a filename or<br />

simply a checkpoint number an Amalgam processor instance already exists.<br />

• coherence log log file – Enables/disables coherence logging.<br />

log file is only used for<br />

enabling when no logfile is specified in the cfg file.<br />

• trace on/off – Enables/disables instruction trace logging. By default, traces are enabled if a<br />

logfile is specified in the cfg file.<br />

• stall threshold num cycles – Sets the stall detection threshold to num cycles. If num cycles<br />

is omitted, displays current threshold.<br />

C.2 Print Simulator State<br />

• pipe cluster pipeline – Prints the contents of the programmable cluster’s specified pipeline.<br />

• dbop start pc num of ops – Prints the num of ops instructions starting with start pc. If<br />

num of ops is omitted, prints only the instruction at start pc.<br />

• pc cluster – Prints the proc cluster’s PC.<br />

• next pc cluster – Prints the PC of the next op to be dispatched (first decode buffer entry) or<br />

current PC if the decode buffer is empty.<br />

85


• bpred cluster – Print contents of cluster’s branch predictor.<br />

• btb cluster – Print contents of cluster’s BTB.<br />

• rat cluster – Print contents of cluster’s RAT.<br />

• freelist cluster – Print contents of cluster’s freelist.<br />

• rob cluster – Print contents of cluster’s ROB.<br />

• stat cluster – Prints various status information associated with a cluster.<br />

• stack cluster numwords – Print the stack for a cluster.<br />

• live cluster – Print the live instruction list for the specified unit (cluster/bank). If you request<br />

a cluster number higher than the number of pclusts and rclusts in the system (i.e., num pclust<br />

+ num rclust), displays the live instruction list for the cache (i.e., when using coherence).<br />

• rs cluster – Has been replaced <strong>by</strong> the live command (in an attempt to alleviate any<br />

comparison to Reservation Stations).<br />

• db cluster – Prints the contents of the proc cluster’s decode buffer.<br />

• nb cluster – Prints the contents of the cluster’s network buffer.<br />

• rc acu ctl cluster – Prints PCU control bits for the clusters’s RA.<br />

• rc lb cluster row col – Prints information and status of the LB in the cluster’s RA.<br />

• rc segment cluster segment – Prints the present LB outputs in the cluster’s RA.<br />

• rc lb wires cluster row col connection – Prints the local wire connections around the<br />

LB(row,col) in the cluster’s RA.<br />

86


Connections:<br />

– ALL: ALL local connections<br />

– BC: Broadcast connections if they exist<br />

– LBout: Connections between LB output and Hwires/Vwires/WCH<br />

– VH: Connections between Vwires and Hwires(row+1)<br />

• rc reg wires cluster register – Prints active WCH writing to the register and active RCH<br />

reading from the register.<br />

• rc wire cluster arguments – Prints the resource that is driving the wire and the resource(s)<br />

that are reading from the wire. Valid arguments:<br />

– VWIRE col index<br />

– HWIRE row index<br />

– RCH seg col index<br />

– WCH seg col index<br />

• net print out ports – Prints the state of the network. If print out ports is non-zero, node out<br />

ports are printed in addition to in ports. If print out ports is omitted or 0, node out ports are<br />

excluded from the output.<br />

• node idx idx internal print out ports – Prints the state of node idx.idx internal in the<br />

network. If print out ports is non-zero, node out ports are printed in addition to in ports.<br />

If print out ports is omitted or 0, node out ports are excluded from the output.<br />

• netstat – Prints various network statistics.<br />

• net rt – Prints the routing tables for all major network nodes.<br />

• node rt idx – Prints the routing table for node idx in the network.<br />

87


• net conn – Prints the connections lists for the units connected to the network.<br />

• reg cluster – Prints the contents of cluster’s regfile.<br />

• lmem cluster startAddr numwords/endAddr – Prints the contents of the programmable<br />

cluster’s local memory over the specified address range. The start address may be expressed<br />

in either base 10 or hex (when preceeded <strong>by</strong> 0x). The second argument may be expressed<br />

as either an end address (when specified in hex with 0x) or as a number of words (when<br />

specified in base 10).<br />

If numwords/endAddr is omitted, displays only the contents of<br />

M[startAddr].<br />

• mem startAddr numwords/endAddr – Prints the contents of memory over the specified<br />

address range. The start address may be expressed in either base 10 or hex (when preceeded<br />

<strong>by</strong> 0x). The second argument may be expressed as either an end address (when specified in<br />

hex with 0x) or as a number of words (when specified in base 10). If numwords/endAddr is<br />

omitted, displays only the contents of M[startAddr].<br />

• memstat – Prints the number of loads, stores, etc that the data cache has handled so far.<br />

• cache firstset numsets – Prints the contents of the specified sets (from firstset to firstset +<br />

numsets) within the cache (tags, valid bits, and last access times). If numsets is omitted,<br />

prints only firstset.<br />

• is in cache address – See if address is in the main cache, and if so, print location and other<br />

relevant info.<br />

• cache nb bank – Prints the contents of the bank’s network buffers. If bank is omitted, prints<br />

contents of network buffers for all banks.<br />

• cache wait bank – Prints the contents of the bank’s waiting on list. If bank is omitted, prints<br />

waiting on lists for all banks.<br />

88


• dcache cluster firstset numsets – Prints the contents of the specified sets (from firstset to<br />

firstset + numsets) within proc cluster’s dcache. If numsets is omitted, prints only firstset.<br />

• is in dcache cluster address – See if address is in the programmable cluster’s dcache, and<br />

if so, print location and other relevant info.<br />

• icache cluster firstset numsets – Prints the contents of the specified sets (from firstset to<br />

firstset + numsets) within proc cluster’s icache. If numsets is omitted, prints only firstset.<br />

• is in icache cluster address – See if address is in the programmable cluster’s icache, and if<br />

so, print location and other relevant info.<br />

C.3 Modify Simulator State<br />

• ldreg cluster reg value – Loads cluster’s reg with value. (R[cluster.reg] ←− value). Sets the<br />

valid entry for reg to the current cycle. Only works on programmable clusters.<br />

• ldpc cluster pc – Loads the programmable cluster’s PC with pc. (PC[cluster] ←− pc).<br />

• ldmem address value – Loads the memory word located at address with value. (M[address]<br />

←− value).<br />

C.4 Stream Utilities<br />

Streams provide terminal interfaces to the simulator including stdin, stdout and stderr.<br />

• stream fd start stop show cycles – Display the contents of stream fd. show cycles is 0=no<br />

1=yes. Optionally specify start and stop “cycle added” limits.<br />

• stream dump file fd filename – Dump contents of stream fd into file filename. Only works<br />

for stdin/stdout/stderr.<br />

89


• stream inject fd string – Inject string into stream fd.<br />

• stream inject file fd filename – Inject file filename’s contents into stream fd. Only works<br />

for stdin/stdout/stderr.<br />

90


APPENDIX D<br />

AMALGAM ISA (AMISA)<br />

Table D.1 shows the initial Amalgam ISA, which is a subset of DLX, as defined in [22]. To start,<br />

we support the bulk of the arithmetic and logical instructions, excepting the instructions to set<br />

conditionals. Also, the MULT and DIV instructions are integer-based as opposed to floating<br />

point. U instructions imply unsigned arithmetic.<br />

Conventions<br />

cx.ry: Register Y on Cluster X<br />

ry: Register Y on the executing cluster<br />

#i: A constant (generally 8-bit)<br />

Table D.1 ** AMISA **<br />

Op Arguments Definition<br />

ADD (cx.)rd, r1, r2 (cx.)rd ← r1 + r2<br />

ADDI (cx.)rd, r1, #i (cx.)rd ← r1 + #i<br />

ADDU (cx.)rd, r1, r2 (cx.)rd ← r1 + r2<br />

ADDUI (cx.)rd, r1, #i (cx.)rd ← r1 + #i<br />

AND (cx.)rd, r1, r2 (cx.)rd ← r1 & r2<br />

ANDI (cx.)rd, r1, #i (cx.)rd ← r1 & 0x00.#i<br />

BEQZ r1, #i if (r1 = 0) then PC ← (PC branch + 4) + #i<br />

BNEZ r1, #i if (r1 ≠ 0) then PC ← (PC branch + 4) + #i<br />

CBAR NA halt instruction issue until all clusters execute CBAR<br />

CMOV (cx.)rd, r1, r2 if (r2 ≠ 0) then (cx.)rd ← r1<br />

DIV (cx.)rd, r1, r2 (cx.)rd ← r1 / r2 (int)<br />

91


Table D.1 ** AMISA Continued **<br />

Op Arguments Definition<br />

DIVU (cx.)rd, r1, r2 (cx.)rd ← r1 / r2 (int)<br />

EMPTY rd invalidate rd<br />

EXTB (cx.)rd, r1, r2 extract <strong>by</strong>te - See below<br />

EXTBI (cx.)rd, r1, #i extract <strong>by</strong>te - See below<br />

EXTS (cx.)rd, r1, r2 extract short - See below<br />

EXTSI (cx.)rd, r1, #i extract short - See below<br />

FLUSH #i if lower four bits of i are less than or equal to current runlevel<br />

upper four bits of i determine what is flushed (defined in opcode.h)<br />

HALT #i if i is less than or equal to current runlevel<br />

halts the simulator and returns to user control<br />

INSB (cx.)rd, r1, r2 insert <strong>by</strong>te - See below<br />

INSBI (cx.)rd, r1, #i insert <strong>by</strong>te - See below<br />

INSS (cx.)rd, r1, r2 insert short - See below<br />

INSSI (cx.)rd, r1, #i insert short - See below<br />

J #i PC ← (PC branch + 4) + #i<br />

JAL #i r31 ← (PC branch + 4)<br />

PC ← (PC branch + 4) + #i<br />

JALR r2 r31 ← (PC branch + 4)<br />

PC ← r2<br />

JR r2 PC ← r2<br />

LHI (cx.)rd, #i (cx.)rd ← #i.0x00<br />

LW (cx.)rd, r1, #i (cx.)rd ← M[r1 + #i]<br />

MASKB (cx.)rd, r1, r2 mask <strong>by</strong>te - See below<br />

MASKBI (cx.)rd, r1, #i mask <strong>by</strong>te - See below<br />

MASKS (cx.)rd, r1, r2 mask short - See below<br />

MASKSI (cx.)rd, r1, #i mask short - See below<br />

MULT (cx.)rd, r1, r2 (cx.)rd ← r1 * r2<br />

MULTU (cx.)rd, r1, r2 (cx.)rd ← r1 * r2<br />

NEG (cx.)rd, r1 (cx.)rd ← −r1<br />

OR (cx.)rd, r1, r2 (cx.)rd ← r1 | r2<br />

ORI (cx.)rd, r1, #i (cx.)rd ← r1 | #i<br />

RET r2 PC ← r2<br />

SEQ (cx.)rd, r1, r2 if (r1 = r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SEQI (cx.)rd, r1, #i if (r1 = #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SEQUI (cx.)rd, r1, #i if (r1 = #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SGE (cx.)rd, r1, r2 if (r1 ≥ r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SGEI (cx.)rd, r1, #i if (r1 ≥ #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SGEU (cx.)rd, r1, r2 if (r1 ≥ r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SGEUI (cx.)rd, r1, #i if (r1 ≥ #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SGT (cx.)rd, r1, r2 if (r1 > r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

92


Table D.1 ** AMISA Continued **<br />

Op Arguments Definition<br />

SGTI (cx.)rd, r1, #i if (r1 > #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SGTU (cx.)rd, r1, r2 if (r1 > r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SGTUI (cx.)rd, r1, #i if (r1 > #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLE (cx.)rd, r1, r2 if (r1 ≤ r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLEI (cx.)rd, r1, #i if (r1 ≤ #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLEU (cx.)rd, r1, r2 if (r1 ≤ r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLEUI (cx.)rd, r1, #i if (r1 ≤ #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLL (cx.)rd, r1, r2 (cx.)rd ← r1 ≪ r2<br />

SLLI (cx.)rd, r1, #i (cx.)rd ← r1 ≪ #i<br />

SLT (cx.)rd, r1, r2 if (r1 < r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLTI (cx.)rd, r1, #i if (r1 < #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLTU (cx.)rd, r1, r2 if (r1 < r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SLTUI (cx.)rd, r1, #i if (r1 < #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SNE (cx.)rd, r1, r2 if (r1 ≠ r2) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SNEI (cx.)rd, r1, #i if (r1 ≠ #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SNEUI (cx.)rd, r1, #i if (r1 ≠ #i) then (cx.)rd ← 1 else (cx.)rd ← 0<br />

SRA (cx.)rd, r1, r2 (cx.)rd ← r1 ≫ r2 (A)<br />

SRAI (cx.)rd, r1, r2 (cx.)rd ← r1 ≫ r2 (A)<br />

SRL (cx.)rd, r1, r2 (cx.)rd ← r1 ≫ r2 (L)<br />

SRLI (cx.)rd, r1, r2 (cx.)rd ← r1 ≫ r2 (L)<br />

SUB (cx.)rd, r1, r2 (cx.)rd ← r1 − r2<br />

SUBI (cx.)rd, r1, #i (cx.)rd ← r1 − #i<br />

SUBU (cx.)rd, r1, r2 (cx.)rd ← r1 − r2<br />

SUBUI (cx.)rd, r1, #i (cx.)rd ← r1 − #i<br />

SW rd, r1, #i M[r1 + #i] ← rd<br />

TNS (cx.)rd, r1, #i (cx.)rd ← M[r1 + #i] and M[r1 + #i] ← 1<br />

XOR (cx.)rd, r1, r2 (cx.)rd ← r1 ⊕ r2<br />

XORI (cx.)rd, r1, #i (cx.)rd ← r1 ⊕ #i<br />

93


The following instruction descriptions have not been included in the above table due to their<br />

complexity. We have included examples here to further clarify their functionality.<br />

EXTB (cx.)rd, r1, r2 – Extract <strong>by</strong>te. Use bits 0 and 1 of r2 to determine which <strong>by</strong>te of r1 to<br />

place into rd. Immediate version performs selection using the immediate operand.<br />

Example: extb rd, r1, r2<br />

r1 = 0xb3b2b1b0<br />

r2 = 0x00000003<br />

--> rd = 0x000000b0<br />

EXTS (cx.)rd, r1, r2 – Extract short. Use bit 1 of r2 to determine which short of r1 to place<br />

into rd. Immediate version performs selection using the immediate operand.<br />

Example: exts rd, r1, r2<br />

r1 = 0xb3b2b1b0<br />

r2 = 0x00000002<br />

--> rd = 0x0000b1b0<br />

INSB (cx.)rd, r1, r2 – Insert <strong>by</strong>te. Use bits 0 and 1 of r2 to determine which <strong>by</strong>te of rd to place<br />

r1 into. Immediate version performs selection using the immediate operand.<br />

Example: insb rd, r1, r2<br />

r1 = 0x000000ff<br />

r2 = 0x00000002<br />

--> rd = 0x0000ff00<br />

INSS (cx.)rd, r1, r2 – Insert short. Use bit 1 of r2 to determine which short of rd to place r1<br />

into. Immediate version performs selection using the immediate operand.<br />

Example: inss rd, r1, r2<br />

r1 = 0x0000b1b0<br />

94


2 = 0x00000000<br />

--> rd = 0xb1b00000<br />

MASKB (cx.)rd, r1, r2 – Mask <strong>by</strong>te. Use bits 0 and 1 of r2 to determine which <strong>by</strong>te of r1<br />

to mask out and place result into rd. Immediate version performs selection using the immediate<br />

operand.<br />

Example: maskb rd, r1, r2<br />

r1 = 0xb3b2b1b0<br />

r2 = 0x00000001<br />

--> rd = 0xb300b1b0<br />

MASKS (cx.)rd, r1, r2 – Mask short. Use bit 1 of r2 to determine which short of r1 to mask<br />

out and place result into rd. Immediate version performs selection using the immediate operand.<br />

Example: masks rd, r1, r2<br />

r1 = 0xb3b2b1b0<br />

r2 = 0x00000002<br />

--> rd = 0xb3b20000<br />

95


REFERENCES<br />

[1] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, “Clock rate versus IPC: The<br />

end of the road for conventional microarchitectures,” in Proceedings of the International<br />

Symposium on Computer Architecture, 2000, pp. 248–259.<br />

[2] R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,” Proceedings of the IEEE,<br />

vol. 89, pp. 490–504, April 1991.<br />

[3] M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee,<br />

“The M-machine multicomputer,” in Proceedings of the Annual International Symposium on<br />

Microarchitecture, 1995, pp. 172–180.<br />

[4] M. Papamarcos and J. Patel, “A low overhead coherence solution for multiprocessors with<br />

private cache memories,” in Proceedings of the International Symposium on Computer<br />

Architecture, 1984, pp. 348–354.<br />

[5] J. Huh, D. Burger, and S. Keckler, “Exploring the design space of future CMPs,” in<br />

Proceedings of the International Conference on Parallel Architectures and Compilation<br />

Techniques, 2001, pp. 199–210.<br />

[6] R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler, “A design space evaluation<br />

of grid processor architectures,” in Proceedings of the International Symposium on<br />

Microarchitecture, 2001, pp. 40–51.<br />

[7] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen, and K. Olukotun, “The Stanford<br />

hydra CMP,” IEEE Micro, vol. 20, pp. 71–84, March/April 2000.<br />

[8] V. Krishnan and J. Torrellas, “A chip multiprocessor architecture with speculative<br />

multithreading,” IEEE Transactions on Computers, Special Issue on Multithreaded<br />

Architecture, vol. 48, pp. 866–880, September 1999.<br />

[9] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, “Multiscalar processors,” in Proceedings of<br />

the International Symposium on Computer Architecture, 1995, pp. 414–425.<br />

[10] S. W. Keckler, W. J. Dally, D. Maskit, N. P. Carter, A. Chang, and W. S. Lee, “Exploiting<br />

fine-grain thread level parallelism on the MIT multi-ALU processor,” in Proceedings of the<br />

Annual International Symposium on Computer Architecture, 1998, pp. 306–317.<br />

96


[11] K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic, “The multicluster architecture:<br />

Reducing cycle time through partitioning,” in Proceedings of the International Symposium<br />

on Microarchitecture, 1997, pp. 149–159.<br />

[12] S. Palacharla, N. P. Jouppi, and J. Smith, “Complexity-effective superscalara processors,” in<br />

Proceedings of the International Symposium on Computer Architecture, 1997, pp. 206–218.<br />

[13] R. Canal, J.-M. Parcerisa, and A. González, “A cost-effective clustered architecture,” in<br />

Proceedings of the International Conference on Parallel Architectures and Compilation<br />

Techniques (PACT 99), 1999, pp. 160–168.<br />

[14] M. Tremblay, J. Chan, S. Chaundrhy, A. Conigliaro, and S. Tse, “The MAJC<br />

architecture: A synthesis of parallelism and scalability,” IEEE Micro, vol. 20, pp. 12–25,<br />

November/December 2000.<br />

[15] R. E. Kessler, “The Alpha 21264 microprocessor,” IEEE Micro, vol. 19, pp. 24–36, April<br />

1999.<br />

[16] K. Yeager, “The MIPS R10000 superscalar microprocessor,” IEEE Micro, vol. 16, pp. 28–41,<br />

April 1996.<br />

[17] J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy,<br />

“POWER4 system microarchitecture,” October 2001, http://www-<br />

1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.pdf.<br />

[18] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffmann,<br />

P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen,<br />

M. Frank, S. Amarasinghe, and A. Agarwal, “The raw microprocessor: A computational<br />

fabric for software circuits and general purpose programs,” IEEE Micro, vol. 22, pp. 25–35,<br />

March/April 2002.<br />

[19] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, and J. Huh, “Exploiting ILP, TLP, and DLP<br />

with the polymorphous TRIPS architecture,” in Proceedings of the International Symposium<br />

on Computer Architecture, 2003.<br />

[20] J.-M. Parcerisa, J. Sahuquillo, A. González, and J. Duato, “Efficient interconnects for<br />

clustered microarchitectures,” in Proceedings of the International Conference on Parallel<br />

Architectures and Compilation Techniques, 2002, pp. 181–191.<br />

[21] A. Aggarwal and M. Franklin, “Hierarchical interconnects for on-chip clustering,” in<br />

Proceedings of the International Parallel and Distributed Processing Symposium, 2002.<br />

[22] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San<br />

Francisco, CA: Morgan Kaufmann Publishers, Inc., 2nd ed., 1996.<br />

[23] J. D. Walstrom, “The design of the Amalgam reconfigurable cluster,” M.S. thesis, University<br />

of Illinois at Urbana-Champaign, 2002.<br />

97


[24] D. B. <strong>Gottlieb</strong>, J. J. Cook, J. D. Walstrom, S. Ferrera, C.-W. Wang, and N. P.<br />

Carter, “Clustered programmable-reconfigurable processors,” in Proceedings of the IEEE<br />

International Conference on Field Programmable Technology, 2002, pp. 134–141.<br />

[25] L. L. Peterson and B. S. Davie, Computer Networks: A Systems Approach. San Francisco,<br />

CA: Morgan Kaufmann Publishers, Inc., 1996.<br />

[26] “The international technology roadmap for semiconductors,” Semiconductor Industry<br />

Association, 2001, http://public.itrs.net/Files/2001ITRS/Home.htm.<br />

[27] S.-C. Wong, G.-Y. Lee, and D.-J. Ma, “Modeling of interconnect capacitance, delay, and<br />

crosstalk in VLSI,” IEEE Transactions on Semiconductor Manufacturing, vol. 13, pp. 108–<br />

111, February 2000.<br />

[28] D. Sylvester and K. Keutzer, “Rethinking deep-submicron circuit design,” IEEE Computer,<br />

vol. 32, pp. 25–33, November 1999.<br />

[29] S. Gupta, S. W. Keckler, and D. Burger, “Technology independent area and delay estimates<br />

for microprocessor building blocks,” Department of Computer Sciences, The University of<br />

Texas at Austin, Tech. Rep. 2000-05, 2000.<br />

[30] P. Shivakumar and N. P. Jouppi, “CACTI 3.0: An integrated cache timing, power, and area<br />

model,” Western Research Laboratory, Tech. Rep. 2001/2, COMPAQ, August 2001.<br />

[31] J. J. Cook, “The amalgam compiler infrastructure,” M.S. thesis, University of Illinois at<br />

Urbana-Champaign, <strong>2004</strong>.<br />

[32] R. Ulichney, Digital Halftoning. Cambridge, MA: MIT Press, 1987.<br />

[33] D. T. Hoang, “Searching genetic databases on splash 2,” in IEEE Workshop FPGAs for<br />

Custom Computing Machines, 1993, pp. 185–192.<br />

[34] D. Sankoff and J. Kruskal, Eds., Time Warps, String Edits, and Macromolecules: The Theory<br />

and Practice of Sequence Comparison. Reading, MA: Addison-Wesley, 1983.<br />

[35] “The GNU radio project,” 2003, http://www.gnu.org/software/gnuradio/.<br />

[36] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “MediaBench: A tool for evaluating and<br />

synthesizing multimedia and communication systems,” in Proceedings of the International<br />

Symposium on Microarchitecture, 1997, pp. 330–335.<br />

[37] “Specification for the advanced encryption standard AES,” Federal Information Processing<br />

Standards Publication 197, February 2001, http://csrc.nist.gov/publications/fips/fips197/fips-<br />

197.pdf.<br />

[38] R. Dakin, “Dakin branch-and-bound TSP algorithm,” July 1997,<br />

http://www.pcug.org.au/ dakin/tspbb.htm.<br />

98


[39] R. B. Kujoth, C.-W. Wang, D. B. <strong>Gottlieb</strong>, J. J. Cook, and N. P. Carter, “A reconfigurable<br />

unit for a clustered programmable-reconfigurable processor,” in Proceedings of the ACM<br />

International Symposium on Field-Programmable Gate Arrays, <strong>2004</strong>, pp. 200–209.<br />

99

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!