BSV by Example - Computation Structures Group

More documents

Recommendations

Info

For speed, one starts with the native speed of compiled C++. Software virtual platforms use sophisticated technology like just-in-time cross-compilation to deliver impressive execution speeds, for example booting an entire guest operating system in seconds. TLM models avoid the overheads of event-driven simulation (by sticking to SC METHOD and avoiding SC THREAD and SC CTHREAD). If they must use events, they play tricks like “temporal decoupling” to amortize the overhead. While all these tricks are laudable (and perhaps adequate for some modeling purposes), unfortunately performance drops steeply as one adds almost any architectural detail, such as actual parallelism, interconnect contention, memory or I/O bottlenecks, cycle-accurate device models, and so on. Performance also drops steeply when one adds any instrumentation at all (which is necessary for performance modeling). Further, since the basic substrate is C++, it is extraordinarily difficult to achieve any speed up by exploiting multi-core hosts. In summary, as you add architectural detail (for accuracy), you lose speed. Development time increases sharply, as programmers contort the code to apply tricks to restore speed. One attempted solution is to use FPGA-based execution for the architecturally detailed component(s) of the system model. This is still a problem because of the logistical difficulties of dealing with multiple languages and paradigms for different components, because the software side may still run too slowly, because bandwidth constraints in communicating between the software and FPGA sides may anyway limit speed, and because of limited visibility into the FPGA-side components. Further, conventional approaches have problems in creating the FPGA-side components (see synthesis discussion below). High-Level Synthesis (HLS) from C++ and SystemC is often held out by many in the industry as a potential way forward. Unfortunately there are serious problems in using C++ and SystemC as a starting point. A central issue is architecture, architecture, architecture. It is well understood that the key to high-performance software is good algorithms. Further, good algorithms are designed by people with training and experience, and not by tools (a compiler only optimizes an algorithm we write, it doesn’t improve the algorithm itself). Similarly, the key to highperformance hardware (measured as area, speed, power) is good architecture. Hence, it is crucial to give designers the maximum power to express architecture; all the rest can be left to tools. Unfortunately, HLS from C++ does exactly the opposite—it obscures architecture. These HLS tools try to choose an architecture based on internal heuristics. Designers have indirect, second-order controls on this activity, expressed as “constraints” such as bill-of-materials limits for hardware resources and pragmas for loop transformations (unrolling, peeling, fusion, etc.). It is hard (and often involves guesswork) to “steer” these tools towards a good architecture. HLS from C++ has another serious limitation. It is fundamentally an exercise in “automatic parallelization”, i.e., in automatically transforming a sequential program into a parallel implementation. This problem has been well-studied since the days of the earliest vector supercomputers like the CDC 6600 and Cray 1, and works (and even then, only moderately) for “loop-and-array” codes. This is because the key technologies needed for automatic parallelization, namely dependence analysis, alias analysis, etc., on the CDFG (Control/Data Flow Graph) have been shown to be feasible only for simple loops, often statically bounded, working on arrays with simple (affine, linear) combinations of loop indexes. Such loop-and-array computations of course exist in a few components in an SoC, notably the signal-processing ones, but are absent in the majority of the SoC, which are more “control-oriented” (processors, memory systems, interconnects, DMAs and data movers, I/O peripherals, etc.). Even in the DSP space, note that modern DSP algorithms are also becoming more control-oriented in order to become more adaptive to noise (for signal processing) and actual content (for audio and video codecs). Finally, HLS from C++ has yet another serious limitation. It is well known that, for the same computation, one may choose different algorithms depending on the computation model and associated cost model of the platform. This is why there are entire textbooks on parallel algorithms—they 12
would be unnecessary if sequential algorithms were adequate. Unfortunately, C++ has a fixed computational model—the von Neumann model of sequential execution and large, flat (constant-cost) memory. Consequently, algorithms written in C++ may not be the best choice for hardware implementation. Often, one must “reverse engineer” the mathematical function from a particular C++ algorithm before coming up with a good hardware architecture. Of course, no HLS tool is going to do that for you. HLS from SystemC is almost a misnomer. While it may incorporate HLS from C++ for the bodies of some IP blocks as described in the previous paragraphs, the rest of the design (blocks that are not “loop-and-array”, and system integration) are essentially expressed at the RTL level of abstraction and uses classical RTL synthesis techniques. With this discussion of the C++/SystemC/SystemVerilog state of the art behind us, we can describe why BSV delivers the order of magnitude improvements in productivity and capability promised at the beginning of this section. First, unlike C++ and SystemC TLM, BSV is architecturally transparent. The designer, and not the tool, explicitly and directly expresses architecture. BSV’s powerful types and static elaboration mechanisms provide great power to express architectural structures very succinctly and programmatically—but static elaboration is deterministic and predictable and there are no surprises. As a result, HLS from BSV typically allows you to converge to a high-quality design much faster than HLS from C++, where you have to try “steering” the tool towards a good architecture. Second, unlike C++ and SystemVerilog, BSV has a superior behavioral semantics—Atomic Rules and Interfaces—which is a higher-level abstraction for concurrency and is much better suited to the task of describing the fine-grain, multi-rate, heterogeneous parallelism found in hardware systems. Because of the above two properties, with BSV one can go from a truly abstract (mathematical) description of an algorithm directly towards a good architectural implementation, without the detour through C++ code that has at best obscured it and at worst chosen a poor concrete sequential algorithm. And unlike HLS from C++, HLS from BSV can seamlessly encompass and interact with complex control logic, interconnects, memory subsystems, and so on. Third, BSV has much stronger parameterization, which directly affects everything: code size, code structure, code reuse, and code correctness. All types in BSV are first-class, that is, one can write functions taking any types as arguments and any types as results. Modules and functions can be parameterized by other modules and functions; functions can generate Rules and Interfaces, and so on. This results in an unprecedented level of static elaboration expressive power (Turing complete “generate”). Fourth, BSV has a much stronger data type system (including polymorphic types and user-defined overloading) with much stronger type-checking. This gives more expressive power with greater safety. Fifth, BSV also performs many other kinds of strong static checking. Chief amongst these is a semantically well-founded notion of clock and reset domains. For example, accidentally crossing a clock domain boundary without a synchronizer is a thing of the past. Sixth, and finally, programs using all the above features are synthesizable—there is no retraction into any weaker “synthesizable subset” as one sees in C++, SystemC and SystemVerilog. As a result, BSV also solves the “multitude of disparate languages” problem. A single, unified language can be used for synthesizable executable specs, synthesizable (fast, scalable) virtual platforms, synthesizable (fast, scalable) architecture models, synthesizable design and implementation, and synthesizable test environments. All of these can be run on FPGA platforms, the modern way to do extensive verification. 1.4 About this book This book is intended to be a gentle introduction to BSV. Even though all the key ideas have been tried and tested for decades in other specification and programming languages (as cited in previous 13
Page 1 and 2: BSV by Example The next-generation
Page 3 and 4: 3.6.3 UInt#(n) . . . . . . . . . .
Page 5 and 6: 10 Advanced types and pattern-match
Page 7 and 8: A.6.1 Scheduling error due to paral
Page 9 and 10: 1 Introduction BSV (Bluespec System
Page 11: • Haskell, for more advanced type
Page 15 and 16: A few examples span multiple sectio
Page 17 and 18: In this simple example, the name of
Page 19 and 20: 2.2.4 Create a project file Figure
Page 21 and 22: This next example looks at a design
Page 23 and 24: • ActionValue Methods: These meth
Page 25 and 26: Another example: the Ord typeclass
Page 27 and 28: cannot automatically convert from a
Page 29 and 30: $display("maxInt16/4 = %x", maxInt1
Page 31 and 32: 3.9 Strings Source directory: data/
Page 33 and 34: 4.1.3 Value initialization from a V
Page 35 and 36: ActionValue #(Coord) g = s.response
Page 37 and 38: if (pack(cycle)[1:0] == 3) a = a +
Page 39 and 40: package Tb; (* synthesize *) module
Page 41 and 42: In BSV, the logical sequence of rul
Page 43 and 44: These are both legal final states o
Page 45 and 46: ule r1; x2
Page 47 and 48: Because of the rule condition if (v
Page 49 and 50: 6 Module hierarchy and interfaces 6
Page 51 and 52: Wherever mkM2 is instantiated, the
Page 53 and 54: Window Bluespec Development Worksta
Page 55 and 56: 6.3.1 The let statement This statem
Page 57 and 58: module mkTb (Empty); Dut_ifc dut 0
Page 59 and 60: interface Put_int; method Action pu
Page 61 and 62: module mkStimulusGen (Client_int);
Page 63 and 64:
interface Server#(type req_type, ty
Page 65 and 66:
Conflicts are discussed in more det
Page 67 and 68:
(* descending_urgency = "r1, r2, r3
Page 69 and 70:
Figure 6: Execution Order Graph As
Page 71 and 72:
If you compile with the mutually_ex
Page 73 and 74:
(* preempts = "(the_bar.r1, (r2, r3
Page 75 and 76:
However, the rule calls dut.put() a
Page 77 and 78:
After you compile and observe the r
Page 79 and 80:
Sometimes, splitting a rule syntact
Page 81 and 82:
Note that in each cycle, with respe
Page 83 and 84:
interface Counter; method int read(
Page 85 and 86:
noAction is a special null Action (
Page 87 and 88:
In this example, the four lines enu
Page 89 and 90:
Notice that the default value on bo
Page 91 and 92:
ule doit; case (tuple2 (rw_incr.wge
Page 93 and 94:
method Action enq(v) if (enq_ok); r
Page 95 and 96:
In the next few examples we’ll cr
Page 97 and 98:
Common Overloading Provisos Proviso
Page 99 and 100:
module mkPlainReg2( Reg#(Bit#(n)) )
Page 101 and 102:
Since type_b is a different type th
Page 103 and 104:
• The optional deriving Eq clause
Page 105 and 106:
color0 color1 color2 8 bits 8 bits
Page 107 and 108:
A common example of a tagged union
Page 109 and 110:
The simple processor module contain
Page 111 and 112:
defines instruction addresses to be
Page 113 and 114:
typedef struct { Reg#(int) ra; Reg#
Page 115 and 116:
Functions like tuple2 are ordinary
Page 117 and 118:
At every state, it tries to reduce
Page 119 and 120:
Reg#(int) cycle
Page 121 and 122:
• Z a wire that is not driven at
Page 123 and 124:
These function expressions can be u
Page 125 and 126:
Reg#(int) arr3[4]; for (Integer i=0
Page 127 and 128:
to produce a vector of 4 elements (
Page 129 and 130:
We can then use the genWith vector
Page 131 and 132:
The output is: 13.4 Whole register
Page 133 and 134:
Error: "Tb.bsv", line 20, column 35
Page 135 and 136:
We’ll take a simple approach and
Page 137 and 138:
arr[5]
Page 139 and 140:
Reg#(int) counter
Page 141 and 142:
ule stateIdle ( state[idle] == 1 );
Page 143 and 144:
statement times (the number of cloc
Page 145 and 146:
function Action functionOfAction( i
Page 147 and 148:
action $display("Enq 10 at time ",
Page 149 and 150:
Warning: "Tb.bsv", line 36, column
Page 151 and 152:
module SizedFIFO(V_CLK, V_RST_N, V_
Page 153 and 154:
• Every import "BVI" wrapper must
Page 155 and 156:
The Verilog parameters, inputs, and
Page 157 and 158:
A Source Files Complete, compilable
Page 159 and 160:
package DeepThought; interface Ifc_
Page 161 and 162:
rule step3 ( step == 3 ); $display(
Page 163 and 164:
Reg#(int) step
Page 165 and 166:
(* synthesize *) module mkTb (Empty
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
method Action send (int a); f1.enq
Page 173 and 174:
endmodule // ----------------------
Page 175 and 176:
Copyright 2010 Bluespec, Inc. All r
Page 177 and 178:
interface Put_int put_response; end
Page 179 and 180:
f_out.enq (y); endrule interface re
Page 181 and 182:
endrule rule enq2 (state > 4); f.en
Page 183 and 184:
(* synthesize *) module mkGadget (S
Page 185 and 186:
cycle
Page 187 and 188:
(* descending_urgency = "r1, r2" *)
Page 189 and 190:
$display ("%0d: Enqueuing %d into f
Page 191 and 192:
A.7 RWires and Wire types Chapter e
Page 193 and 194:
endrule method int read(); return v
Page 195 and 196:
Version 2 of the counter (* synthes
Page 197 and 198:
method Action decrement (int dd); v
Page 199 and 200:
-----------------------------------
Page 201 and 202:
-----------------------------------
Page 203 and 204:
ule decr2 (state > 3); c2.decrement
Page 205 and 206:
ule e; let x = state * 10; f.enq (x
Page 207 and 208:
endfunction function Bool isBlock(
Page 209 and 210:
method Action drive(tx ina, ty inb)
Page 211 and 212:
package Tb; import SimpleProcessor:
Page 213 and 214:
Reg#(InstructionAddress) pc
Page 215 and 216:
$display("=== step 0 ==="); $displa
Page 217 and 218:
endrule endmodule: mkTb endpackage:
Page 219 and 220:
it o = 0; for (Bit#(2) i=0; i
Page 221 and 222:
step
Page 223 and 224:
Page 225 and 226:
endfunction Vector#(16,bit) bar2 =
Page 227 and 228:
Reg#( Vector#(10,Status) ) regstatu
Page 229 and 230:
endmodule: mkTb endpackage: Tb A.12
Page 231 and 232:
A top-level module connecting the s
Page 233 and 234:
Page 235 and 236:
functionOfAction( 30 ); // Create a
Page 237 and 238:
endseq endpar $display("Par block d
Page 239 and 240:
endmodule schedule first SB (clear,
show all

BSV by Example - Computation Structures Group

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?