Reconfigurable systems in terms of computer architectures
Reconfigurable systems in terms of computer architectures
Reconfigurable systems in terms of computer architectures
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Reconfigurable</strong> Systems <strong>in</strong> <strong>terms</strong><br />
<strong>of</strong> Computer Architectures<br />
Kees A. Vissers<br />
Research Fellow, UC Berkeley<br />
vissers@ieee.org<br />
www.eecs.berkeley.edu/~vissers<br />
July 10. 2003
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-2
Overview<br />
Introduction<br />
Video Signal Processors, Network <strong>of</strong> Alus<br />
Dynamic Reconfiguration<br />
Conclusions<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-3
System on a chip<br />
Silicon Technology is provid<strong>in</strong>g the opportunity to<br />
add new functionality and <strong>in</strong>tegrate several functions<br />
and allow more programmable/reconfigurable <strong>systems</strong>.<br />
Hardware, dedicated solutions<br />
S<strong>of</strong>tware, programmable solutions<br />
First System Second System<br />
time<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-4
Theo Claasen: keynote Dac2000<br />
Rapid silicon prototyp<strong>in</strong>g<br />
Design cycle benefits<br />
Conventional Design Process sttime (1 right Si)<br />
RSP Design Process<br />
HW Development and Validation<br />
Production<br />
Release<br />
Placement, Rout<strong>in</strong>g & Physical Verification<br />
Silicon Fabrication<br />
Production<br />
Release<br />
SW Development and Validation<br />
• Faster chip development<br />
– More than 50% total design cycle reduction<br />
– True HW / SW co -development<br />
Higher probability <strong>of</strong> -pass first success<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-5<br />
PS - Theo ClaasenDAC 2000 -14
Future: The prototype = product<br />
Trade-<strong>of</strong>f between<br />
HW/SW trade-<strong>of</strong>f: Processor and reconfigurable fabric<br />
Granularity issue: f<strong>in</strong>e gra<strong>in</strong> process<strong>in</strong>g ILP, or Task level<br />
bit oriented – word oriented<br />
distributed on chip memory<br />
Large number <strong>of</strong> architectural choices:<br />
Which is ‘the right one’ for a particular application doma<strong>in</strong>?<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-6
Problem def<strong>in</strong>ition<br />
Perform process<strong>in</strong>g on a stream <strong>of</strong> data<br />
Sampl<strong>in</strong>g rates <strong>in</strong> the order <strong>of</strong> 100KHz to 100MHz<br />
per sample 1000 – 100,000 operations<br />
Perform 10 10- 10 11 operations/sec, like add, multiply etc.<br />
Take a 200MHz Alu, still need 50-500 <strong>of</strong> them,<br />
Solution: Time multiplex and multi-processor<br />
Note: performance is a required part <strong>of</strong> the solution!<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-7
Revisit processor <strong>architectures</strong><br />
S<strong>in</strong>gle Risc like processor<br />
ILP process<strong>in</strong>g: VLIW for DSP, superscalar for general<br />
purpose<br />
Very successful programm<strong>in</strong>g environment: compilers and<br />
OS.<br />
Multi processor: no clear w<strong>in</strong>n<strong>in</strong>g model, suggested to<br />
move to chapter 11 <strong>in</strong> Computer Architecture, a<br />
quantitative approach from Hennessy and Patterson<br />
Vector process<strong>in</strong>g<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-8
What are the solution options<br />
Concurrently execute 50- 500 operations every clock cycle.<br />
Multiple Risc cores, e.g. ARM, MIPS etc.<br />
Multiple VLIW oriented DSPs, e.g TI, Starcore etc<br />
Build a bit oriented FPGA and synthesize everyth<strong>in</strong>g on top<br />
<strong>of</strong> that, <strong>in</strong>clud<strong>in</strong>g processors cores, packet rout<strong>in</strong>g<br />
networks etc.<br />
Build a fabric <strong>of</strong> <strong>in</strong>terconnected ALUs (coarse gra<strong>in</strong>ed<br />
FPGA)<br />
SoC platforms exploit<strong>in</strong>g the best part for the specific<br />
application (part).<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-9
Multi processor challenges<br />
Programm<strong>in</strong>g language problem: no concurrency<br />
Limited extraction <strong>of</strong> ILP out <strong>of</strong> sequential program<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-10
Why can we program processors?<br />
Model <strong>of</strong> Computation matched to the Model <strong>of</strong> the<br />
Architecture<br />
C/C++: s<strong>in</strong>gle po<strong>in</strong>t <strong>of</strong> control, loop : Memory model,<br />
branch<br />
CSP (Hoare), Occam(2): Semantics <strong>of</strong> handshake <strong>in</strong><br />
Hardware: transputer channel<br />
Kahn Process Networks: distributed fifo organization,<br />
semantics <strong>in</strong> hardware or s<strong>of</strong>tware<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-11
<strong>Reconfigurable</strong> programm<strong>in</strong>g challenges<br />
In general 3 problems;<br />
1. Programm<strong>in</strong>g a network <strong>of</strong> ALUs/ bit cells <strong>in</strong> FPGA<br />
2. Sett<strong>in</strong>g up the buffers and partition<strong>in</strong>g for the network<br />
3. Dynamic reconfiguration management<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-12
Architecture Design: Y - Chart<br />
Benchmark driven, based on a Set <strong>of</strong> Applications<br />
Architecture Applications<br />
Mapp<strong>in</strong>g<br />
Performance<br />
Analysis<br />
Performance<br />
Data<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-13
What is an <strong>in</strong>struction set processor<br />
C/C++, Java programm<strong>in</strong>g<br />
Program control translated to branches (most <strong>of</strong> the time)<br />
for<br />
if<br />
case statements<br />
S<strong>in</strong>gle Program counter<br />
Data cache and Instruction cache<br />
Time-multiplex with <strong>in</strong>structions over ALUs<br />
Load, Store architecture, conta<strong>in</strong>s a Register File<br />
Debug with s<strong>in</strong>gle stepp<strong>in</strong>g, breakpo<strong>in</strong>ts and register views<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-14
Multi-processors<br />
Multiple <strong>in</strong>struction set processors:<br />
programmers model?<br />
cache coherence?<br />
granularity at the <strong>in</strong>struction level required<br />
Instruction Level parallelism limited to 4-5<br />
branch penalty <strong>in</strong> cycles<br />
Operand rout<strong>in</strong>g and memory hierarchy are the cost<br />
load-store <strong>in</strong>structions 30% <strong>of</strong> all <strong>in</strong>structions<br />
L1 cache is half <strong>of</strong> the processor area<br />
cache works poorly for stream oriented comput<strong>in</strong>g<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-15
The Programmability Issue<br />
Extract f<strong>in</strong>e gra<strong>in</strong> Parallelism out C/C++/ Matlab/Java<br />
alias analysis and po<strong>in</strong>ter problem<br />
Extract coarse gra<strong>in</strong> Parallelism out C/C++/ Matlab/Java<br />
process notion<br />
Start with an f<strong>in</strong>e gra<strong>in</strong> Parallel description<br />
Hardware description languages (VHDL/Verilog)<br />
Simul<strong>in</strong>k environment, Signal Flow Graphs<br />
Start with an coarse gra<strong>in</strong> Parallel description<br />
CSP, occam<br />
Kahn Process Networks<br />
System C<br />
...<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-16
<strong>in</strong>puts<br />
Exampe VSP architecture, 12-bit Elements<br />
ALE<br />
core<br />
ALEs<br />
ME<br />
core<br />
MEs<br />
BEs OEs<br />
P P<br />
P P<br />
outputs<br />
switch<br />
matrix<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-17<br />
silo<br />
program
VSP1 and VSP2 Layouts<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-18
Example programm<strong>in</strong>g VSP with SFG<br />
l<strong>in</strong>e delays for<br />
lum<strong>in</strong>ance signal<br />
match<strong>in</strong>g l<strong>in</strong>e delays<br />
lum<strong>in</strong>ance signal<br />
horizontal filter section<br />
vertical filter section<br />
sync signal<br />
chrom<strong>in</strong>ance signals<br />
contour level<br />
control<br />
clip function<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-19
Signal Flow Graph Mapp<strong>in</strong>g<br />
Retim<strong>in</strong>g, Delay management<br />
Partition<strong>in</strong>g, which part runs on which IC<br />
Schedul<strong>in</strong>g, and processor allocation<br />
All for statically scheduled multi-rate signal flow graphs<br />
Basic operation translates to basic <strong>in</strong>struction<br />
explicitly programmed data memory storage<br />
no po<strong>in</strong>ters, no branches<br />
<strong>in</strong>herent stream semantics<br />
Used for complete HDTV studio Camera Development, and<br />
next generation electronic X-ray equipment at Philips<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-20
Chameleon Second generation<br />
Network <strong>of</strong> <strong>in</strong>terconnected ALUs (close to a VLIW)<br />
Explicit retim<strong>in</strong>g methodology<br />
Programm<strong>in</strong>g <strong>in</strong> Simul<strong>in</strong>k<br />
Buffer programm<strong>in</strong>g explicit with hardware specific choices<br />
Benchmark results for video and next generation cell<br />
phones, base stations etc.<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-21
Fabric<br />
TILE<br />
Fabric<br />
TILE<br />
Pegasus Top Level<br />
Example Chameleon 6 tile Architecture<br />
High-Speed I/O Bus1<br />
(SB 128-bit 200 MHz)<br />
{ = 3.2 GB/s }<br />
High-Speed I/O Bus2<br />
(SB 128-bit 200 MHz)<br />
{ = 3.2 GB/s }<br />
Rapid IO<br />
port<br />
RAPID<br />
IO<br />
BLOCK<br />
Viterbi<br />
Decoder<br />
RAPID<br />
IO<br />
BLOCK<br />
Rapid IO<br />
port<br />
Fabric Memory Access Bus<br />
(SB 128-bit 200 MHz)<br />
{ = 3.2 GB/S }<br />
Turbo<br />
Decoder<br />
PCIX<br />
port<br />
PCIX<br />
IF<br />
BLOCK<br />
HIGH-SPEED<br />
MPIO<br />
HS-MPIO<br />
port<br />
Fabric<br />
TILE<br />
Fabric<br />
TILE<br />
Rapid IO<br />
port<br />
RAPID<br />
IO<br />
BLOCK<br />
Fabric<br />
TILE<br />
Fabric<br />
TILE<br />
RAPID<br />
IO<br />
BLOCK<br />
Rapid IO<br />
port<br />
SB<br />
Bridge<br />
SB<br />
Bridge<br />
SB 128-bit 200 MHz<br />
{ = 3.2 GB/s }<br />
SB<br />
Bridge<br />
SB<br />
Bridge<br />
USB 2.0<br />
DEV I/F<br />
USB<br />
2.0<br />
host<br />
DDR<br />
DRAM<br />
Controller<br />
Memory Bus<br />
(SB 128-bit 200 MHz)<br />
{ = 3.2 GB/S }<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-22<br />
SB<br />
Bridge<br />
USB 2.0<br />
HOST I/F<br />
USB<br />
2.0<br />
dev ice<br />
DMA<br />
SB<br />
Bridge<br />
AC97<br />
I/F<br />
AC97<br />
audio port<br />
Flash<br />
EPROM<br />
Parallel<br />
Flash<br />
I/F<br />
DDR<br />
DRAM<br />
JTAG<br />
I/F<br />
JTAG<br />
port<br />
Serial<br />
Flash<br />
EPROM<br />
Serial<br />
Flash<br />
I/F<br />
GPIO<br />
BLOCK<br />
GPIO<br />
(64 PINS)<br />
eJTAG<br />
port<br />
MIPS 64K<br />
CPU<br />
5f<br />
Cache<br />
Processor Bus<br />
(SB 64-bit 200 MHz)<br />
{ = 1.6 GB/S }<br />
LOW-SPEED<br />
MPIO<br />
LS-MPIO<br />
port<br />
Low-Speed I/O Bus<br />
(SB 64-bit 100 MHz)<br />
{ = 800 MB/S }
Tile Architecture<br />
16 DPUs with full Interconnect<br />
4 16KByte Memories<br />
DMA to System<br />
32 Connections to nearest Neighbor<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-23
DPU datapath +control (conceptual)<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-24
DPU datapath Architecture (conceptual)<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-25
Programm<strong>in</strong>g the System<br />
Resource Access<br />
DPU Simul<strong>in</strong>k Diagram Untimed<br />
Control Simul<strong>in</strong>k Diagram Untimed<br />
Fabric Memory mapped<br />
CPU C code L<strong>in</strong>ked via eFlow/INTs<br />
Peripherals Memory mapped/DMA<br />
Memories Memory mapped/DMA<br />
Programm<strong>in</strong>g Options<br />
Enter control and dataflow <strong>in</strong> Simul<strong>in</strong>k<br />
C code entry with reference to dataflow kernels <strong>in</strong> fabric<br />
Kernels are built with Simul<strong>in</strong>k<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-26
Simul<strong>in</strong>k based design entry <strong>systems</strong><br />
Integrated entry and simulation environment<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-27
Detail - Schedul<strong>in</strong>g and Mapp<strong>in</strong>g<br />
DFS<br />
DFS = Data Flow Scheduler<br />
DPU Pack<strong>in</strong>g<br />
Retime<br />
Delay DPU Allocation<br />
P+R<br />
Retime<br />
Fold<br />
Delay DPU Allocation<br />
depopulate<br />
Ctl Gen<br />
Bit Gen<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-28
<strong>Reconfigurable</strong> programm<strong>in</strong>g challenges<br />
In general 3 problems;<br />
1. Programm<strong>in</strong>g a network <strong>of</strong> ALUs/ bit cell <strong>in</strong> FPGA<br />
2. Sett<strong>in</strong>g up the buffers and partition<strong>in</strong>g for the network<br />
3. Dynamic reconfiguration management<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-29
Architecture Design: Y - Chart<br />
Benchmark driven, based on a Set <strong>of</strong> Applications<br />
Architecture Applications<br />
Mapp<strong>in</strong>g<br />
Performance<br />
Analysis<br />
Performance<br />
Data<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-30
Recommendation: Kahn Process Networks<br />
Formal semantics, theory and ‘pro<strong>of</strong> <strong>of</strong> concepts’ well<br />
established<br />
very well suited for stream oriented high-performance<br />
signal process<strong>in</strong>g, <strong>in</strong>clud<strong>in</strong>g dynamic dataflow.<br />
abstraction from specific Hardware, Operat<strong>in</strong>g System and<br />
device drivers<br />
Extensive prototypes available: UC Berkeley, Caltech,<br />
Leiden University, Philips Research<br />
Compute result is <strong>in</strong>dependent <strong>of</strong> schedule<br />
C++ stylized <strong>in</strong>dustry standard proposed <strong>in</strong> SystemC 2.0<br />
Satisfies the goals<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-31
Producer-Consumer example (YAPI flavor)<br />
#<strong>in</strong>clude "producer.h"<br />
#<strong>in</strong>clude <br />
Producer::Producer(const Id& n, Out& o) :<br />
Process(n),<br />
out( id("out"), o)<br />
{ }<br />
const char* Producer::type() const<br />
{<br />
return "Producer";<br />
}<br />
void Producer::ma<strong>in</strong>()<br />
{<br />
std::cout
Mapp<strong>in</strong>g Kahn Process Networks<br />
Solve 1,2,3<br />
Input language: Stylized flavors <strong>of</strong> C++<br />
recommendation: SystemC dataflow proposal<br />
Map one or more processes onto a processor:<br />
generate buffer <strong>in</strong>sertion (sometimes statically determ<strong>in</strong>ed)<br />
generate schedule for switch<strong>in</strong>g between processes on the same<br />
processor (sometimes statically determ<strong>in</strong>ed)<br />
generate overall process for manipulation, preferably runn<strong>in</strong>g on a<br />
general processor<br />
synthesize run-time schedule or determ<strong>in</strong>e run-time schedul<strong>in</strong>g<br />
technique, potentially us<strong>in</strong>g thread schedulers or RTOS or OS<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-33
New Systems<br />
Understand the application!<br />
On chip memory<br />
Multi processor, programmable and reconfigurable<br />
Power consumption <strong>of</strong> the complete IC needs to be constant<br />
The PROGRAMMERS view is mak<strong>in</strong>g the difference<br />
<strong>in</strong>put<br />
output<br />
1-4<br />
cpu<br />
Memory<br />
I$<br />
D$<br />
Serial I/O<br />
RCF<br />
timers fixed IP<br />
Memory<br />
I/O<br />
ReConfigurable<br />
Fabric<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-34
Programm<strong>in</strong>g <strong>Reconfigurable</strong> Systems<br />
The ONLY <strong>in</strong>terest<strong>in</strong>g <strong>architectures</strong> are the ones you can<br />
program/reconfigure.<br />
Logic synthesis for a specific architecture programm<strong>in</strong>g<br />
(FPGA problem)<br />
In conventional processors 2/3 <strong>of</strong> the silicon is <strong>in</strong> cache<br />
sub<strong>systems</strong>:<br />
abstraction for the programmer where the <strong>in</strong>structions or the data<br />
are resid<strong>in</strong>g<br />
conceptually a shared memory model, even cache coherent multi<br />
processors <strong>systems</strong><br />
It took the DSP world more then 10 years to learn: first<br />
assembly and MAC, now RISC and VLIW and C compilers<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-35
<strong>Reconfigurable</strong> embedded <strong>systems</strong><br />
SpecInt is irrelevant, EEMBC might be more relevant<br />
Reth<strong>in</strong>k time-multiplex<strong>in</strong>g:<br />
Processor<br />
<strong>in</strong>struction: compiler<br />
task: OS<br />
<strong>Reconfigurable</strong> system<br />
compute graph that is statically determ<strong>in</strong>ed<br />
reconfigure dynamically: runtime support and buffer management.<br />
Extract<strong>in</strong>g parallelism:<br />
Instruction Level Parallelsim: C is the problem NOT the application<br />
Task level: Wrong QUESTION!<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-36
Summary<br />
<strong>Reconfigurable</strong> comput<strong>in</strong>g has many advantages over ASIC<br />
and CPU/MPU<br />
Large parallelism with no <strong>in</strong>struction overhead<br />
Customizable data path size<br />
Flexible (reconfigurable!)<br />
It is still <strong>in</strong> its <strong>in</strong>fancy<br />
Semantic gap between algorithms and circuits is still a major<br />
obstacle<br />
Hardware platforms are only now emerg<strong>in</strong>g commercially that are<br />
designed for RC<br />
Mapp<strong>in</strong>gs are <strong>of</strong>ten architecture specific<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-37
Trends<br />
Th<strong>in</strong>k Y-chart: FIRST build a mapp<strong>in</strong>g environment, then<br />
ask the question is this a good architecture.<br />
Often ad-hoc match<strong>in</strong>g <strong>of</strong> the model <strong>of</strong> computation with<br />
the model <strong>of</strong> the architecture -> formalize<br />
Very excit<strong>in</strong>g time:<br />
new tools<br />
new <strong>architectures</strong><br />
Reverse the world: silicon is cheap, concurrency and<br />
communication is the problem,<br />
New programm<strong>in</strong>g paradigms<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-38
Selected References<br />
J. Henkel, W. Najjir, F. Vahid, K. Vissers,<br />
New Comput<strong>in</strong>g Platforms for Embedded Systems, Full day tutorial at 2002, Design Automation Conference.<br />
Andrew Mihal, Chidamber Kulkarni, Christian Sauer, Kees Vissers, Mathew Moskewicz, Mel Tsai, Niraj Shah, Scott<br />
Weber, Yujia J<strong>in</strong>,Kurt Keutzer, Sharad Malik, A Discipl<strong>in</strong>ed Approach to the Development <strong>of</strong> Architectural Platforms,<br />
2-12, 19, IEEE Design and Test <strong>of</strong> Computers, 2002<br />
M. Sima, S. Cot<strong>of</strong>ana, S. Vassiliades, J.T.J. van Eijndhoven, K. Vissers, MPEG Marcroblock Pars<strong>in</strong>g and Pel<br />
Reconstruction on an FPGA-augmented Trimedia Processor, best paper award International Conference on Computer<br />
Design (ICCD), 2001.<br />
J.T.J. van Eijndhoven, F.W. Sijstermans, K.A. Vissers, E.J.D. Pol, M.J.A. Tromp, P. Struik, R.H.J. Bloks, P. van der<br />
Wolf, A.D. Pimentel, H.P.E. Vranken, Trimedia CPU64 Architecture, International Conference on Computer Design<br />
(ICCD), 1999.<br />
F. Sijstermans, E.J. Pol, B. Riemens, K.A. Vissers, S. Rathnam, and G. Slavenburg. Design Space Exploration for<br />
Future Trimedia CPUs, ICASSP '98. 1998.<br />
B. Kienhuis, E. Deprettere, K.A. Vissers, and P. van der Wolf, An Approach for Quantitative Analysis <strong>of</strong> Applicationspecific<br />
Dataflow Architectures. Proceed<strong>in</strong>g <strong>of</strong> 11th Int. Conference <strong>of</strong> Applications-specific Systems Architectures<br />
and Processors (ASAP '97), pp. 338-349. 1997.<br />
K.A. Vissers, G. Ess<strong>in</strong>k, P.H.J van Gerwen, P.J.M. Janssen, O. Popp, E. Riddersma, W.J.M Smits, H.J.M. Veendrick.<br />
Architecture and Programm<strong>in</strong>g <strong>of</strong> Two Generations Video Signal Processors. Journal on Microprocess<strong>in</strong>g and<br />
Microprogramm<strong>in</strong>g . February, 1996.<br />
MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-39