23.07.2013 Views

Reconfigurable systems in terms of computer architectures

Reconfigurable systems in terms of computer architectures

Reconfigurable systems in terms of computer architectures

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Reconfigurable</strong> Systems <strong>in</strong> <strong>terms</strong><br />

<strong>of</strong> Computer Architectures<br />

Kees A. Vissers<br />

Research Fellow, UC Berkeley<br />

vissers@ieee.org<br />

www.eecs.berkeley.edu/~vissers<br />

July 10. 2003


MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-2


Overview<br />

Introduction<br />

Video Signal Processors, Network <strong>of</strong> Alus<br />

Dynamic Reconfiguration<br />

Conclusions<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-3


System on a chip<br />

Silicon Technology is provid<strong>in</strong>g the opportunity to<br />

add new functionality and <strong>in</strong>tegrate several functions<br />

and allow more programmable/reconfigurable <strong>systems</strong>.<br />

Hardware, dedicated solutions<br />

S<strong>of</strong>tware, programmable solutions<br />

First System Second System<br />

time<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-4


Theo Claasen: keynote Dac2000<br />

Rapid silicon prototyp<strong>in</strong>g<br />

Design cycle benefits<br />

Conventional Design Process sttime (1 right Si)<br />

RSP Design Process<br />

HW Development and Validation<br />

Production<br />

Release<br />

Placement, Rout<strong>in</strong>g & Physical Verification<br />

Silicon Fabrication<br />

Production<br />

Release<br />

SW Development and Validation<br />

• Faster chip development<br />

– More than 50% total design cycle reduction<br />

– True HW / SW co -development<br />

Higher probability <strong>of</strong> -pass first success<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-5<br />

PS - Theo ClaasenDAC 2000 -14


Future: The prototype = product<br />

Trade-<strong>of</strong>f between<br />

HW/SW trade-<strong>of</strong>f: Processor and reconfigurable fabric<br />

Granularity issue: f<strong>in</strong>e gra<strong>in</strong> process<strong>in</strong>g ILP, or Task level<br />

bit oriented – word oriented<br />

distributed on chip memory<br />

Large number <strong>of</strong> architectural choices:<br />

Which is ‘the right one’ for a particular application doma<strong>in</strong>?<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-6


Problem def<strong>in</strong>ition<br />

Perform process<strong>in</strong>g on a stream <strong>of</strong> data<br />

Sampl<strong>in</strong>g rates <strong>in</strong> the order <strong>of</strong> 100KHz to 100MHz<br />

per sample 1000 – 100,000 operations<br />

Perform 10 10- 10 11 operations/sec, like add, multiply etc.<br />

Take a 200MHz Alu, still need 50-500 <strong>of</strong> them,<br />

Solution: Time multiplex and multi-processor<br />

Note: performance is a required part <strong>of</strong> the solution!<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-7


Revisit processor <strong>architectures</strong><br />

S<strong>in</strong>gle Risc like processor<br />

ILP process<strong>in</strong>g: VLIW for DSP, superscalar for general<br />

purpose<br />

Very successful programm<strong>in</strong>g environment: compilers and<br />

OS.<br />

Multi processor: no clear w<strong>in</strong>n<strong>in</strong>g model, suggested to<br />

move to chapter 11 <strong>in</strong> Computer Architecture, a<br />

quantitative approach from Hennessy and Patterson<br />

Vector process<strong>in</strong>g<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-8


What are the solution options<br />

Concurrently execute 50- 500 operations every clock cycle.<br />

Multiple Risc cores, e.g. ARM, MIPS etc.<br />

Multiple VLIW oriented DSPs, e.g TI, Starcore etc<br />

Build a bit oriented FPGA and synthesize everyth<strong>in</strong>g on top<br />

<strong>of</strong> that, <strong>in</strong>clud<strong>in</strong>g processors cores, packet rout<strong>in</strong>g<br />

networks etc.<br />

Build a fabric <strong>of</strong> <strong>in</strong>terconnected ALUs (coarse gra<strong>in</strong>ed<br />

FPGA)<br />

SoC platforms exploit<strong>in</strong>g the best part for the specific<br />

application (part).<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-9


Multi processor challenges<br />

Programm<strong>in</strong>g language problem: no concurrency<br />

Limited extraction <strong>of</strong> ILP out <strong>of</strong> sequential program<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-10


Why can we program processors?<br />

Model <strong>of</strong> Computation matched to the Model <strong>of</strong> the<br />

Architecture<br />

C/C++: s<strong>in</strong>gle po<strong>in</strong>t <strong>of</strong> control, loop : Memory model,<br />

branch<br />

CSP (Hoare), Occam(2): Semantics <strong>of</strong> handshake <strong>in</strong><br />

Hardware: transputer channel<br />

Kahn Process Networks: distributed fifo organization,<br />

semantics <strong>in</strong> hardware or s<strong>of</strong>tware<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-11


<strong>Reconfigurable</strong> programm<strong>in</strong>g challenges<br />

In general 3 problems;<br />

1. Programm<strong>in</strong>g a network <strong>of</strong> ALUs/ bit cells <strong>in</strong> FPGA<br />

2. Sett<strong>in</strong>g up the buffers and partition<strong>in</strong>g for the network<br />

3. Dynamic reconfiguration management<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-12


Architecture Design: Y - Chart<br />

Benchmark driven, based on a Set <strong>of</strong> Applications<br />

Architecture Applications<br />

Mapp<strong>in</strong>g<br />

Performance<br />

Analysis<br />

Performance<br />

Data<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-13


What is an <strong>in</strong>struction set processor<br />

C/C++, Java programm<strong>in</strong>g<br />

Program control translated to branches (most <strong>of</strong> the time)<br />

for<br />

if<br />

case statements<br />

S<strong>in</strong>gle Program counter<br />

Data cache and Instruction cache<br />

Time-multiplex with <strong>in</strong>structions over ALUs<br />

Load, Store architecture, conta<strong>in</strong>s a Register File<br />

Debug with s<strong>in</strong>gle stepp<strong>in</strong>g, breakpo<strong>in</strong>ts and register views<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-14


Multi-processors<br />

Multiple <strong>in</strong>struction set processors:<br />

programmers model?<br />

cache coherence?<br />

granularity at the <strong>in</strong>struction level required<br />

Instruction Level parallelism limited to 4-5<br />

branch penalty <strong>in</strong> cycles<br />

Operand rout<strong>in</strong>g and memory hierarchy are the cost<br />

load-store <strong>in</strong>structions 30% <strong>of</strong> all <strong>in</strong>structions<br />

L1 cache is half <strong>of</strong> the processor area<br />

cache works poorly for stream oriented comput<strong>in</strong>g<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-15


The Programmability Issue<br />

Extract f<strong>in</strong>e gra<strong>in</strong> Parallelism out C/C++/ Matlab/Java<br />

alias analysis and po<strong>in</strong>ter problem<br />

Extract coarse gra<strong>in</strong> Parallelism out C/C++/ Matlab/Java<br />

process notion<br />

Start with an f<strong>in</strong>e gra<strong>in</strong> Parallel description<br />

Hardware description languages (VHDL/Verilog)<br />

Simul<strong>in</strong>k environment, Signal Flow Graphs<br />

Start with an coarse gra<strong>in</strong> Parallel description<br />

CSP, occam<br />

Kahn Process Networks<br />

System C<br />

...<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-16


<strong>in</strong>puts<br />

Exampe VSP architecture, 12-bit Elements<br />

ALE<br />

core<br />

ALEs<br />

ME<br />

core<br />

MEs<br />

BEs OEs<br />

P P<br />

P P<br />

outputs<br />

switch<br />

matrix<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-17<br />

silo<br />

program


VSP1 and VSP2 Layouts<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-18


Example programm<strong>in</strong>g VSP with SFG<br />

l<strong>in</strong>e delays for<br />

lum<strong>in</strong>ance signal<br />

match<strong>in</strong>g l<strong>in</strong>e delays<br />

lum<strong>in</strong>ance signal<br />

horizontal filter section<br />

vertical filter section<br />

sync signal<br />

chrom<strong>in</strong>ance signals<br />

contour level<br />

control<br />

clip function<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-19


Signal Flow Graph Mapp<strong>in</strong>g<br />

Retim<strong>in</strong>g, Delay management<br />

Partition<strong>in</strong>g, which part runs on which IC<br />

Schedul<strong>in</strong>g, and processor allocation<br />

All for statically scheduled multi-rate signal flow graphs<br />

Basic operation translates to basic <strong>in</strong>struction<br />

explicitly programmed data memory storage<br />

no po<strong>in</strong>ters, no branches<br />

<strong>in</strong>herent stream semantics<br />

Used for complete HDTV studio Camera Development, and<br />

next generation electronic X-ray equipment at Philips<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-20


Chameleon Second generation<br />

Network <strong>of</strong> <strong>in</strong>terconnected ALUs (close to a VLIW)<br />

Explicit retim<strong>in</strong>g methodology<br />

Programm<strong>in</strong>g <strong>in</strong> Simul<strong>in</strong>k<br />

Buffer programm<strong>in</strong>g explicit with hardware specific choices<br />

Benchmark results for video and next generation cell<br />

phones, base stations etc.<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-21


Fabric<br />

TILE<br />

Fabric<br />

TILE<br />

Pegasus Top Level<br />

Example Chameleon 6 tile Architecture<br />

High-Speed I/O Bus1<br />

(SB 128-bit 200 MHz)<br />

{ = 3.2 GB/s }<br />

High-Speed I/O Bus2<br />

(SB 128-bit 200 MHz)<br />

{ = 3.2 GB/s }<br />

Rapid IO<br />

port<br />

RAPID<br />

IO<br />

BLOCK<br />

Viterbi<br />

Decoder<br />

RAPID<br />

IO<br />

BLOCK<br />

Rapid IO<br />

port<br />

Fabric Memory Access Bus<br />

(SB 128-bit 200 MHz)<br />

{ = 3.2 GB/S }<br />

Turbo<br />

Decoder<br />

PCIX<br />

port<br />

PCIX<br />

IF<br />

BLOCK<br />

HIGH-SPEED<br />

MPIO<br />

HS-MPIO<br />

port<br />

Fabric<br />

TILE<br />

Fabric<br />

TILE<br />

Rapid IO<br />

port<br />

RAPID<br />

IO<br />

BLOCK<br />

Fabric<br />

TILE<br />

Fabric<br />

TILE<br />

RAPID<br />

IO<br />

BLOCK<br />

Rapid IO<br />

port<br />

SB<br />

Bridge<br />

SB<br />

Bridge<br />

SB 128-bit 200 MHz<br />

{ = 3.2 GB/s }<br />

SB<br />

Bridge<br />

SB<br />

Bridge<br />

USB 2.0<br />

DEV I/F<br />

USB<br />

2.0<br />

host<br />

DDR<br />

DRAM<br />

Controller<br />

Memory Bus<br />

(SB 128-bit 200 MHz)<br />

{ = 3.2 GB/S }<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-22<br />

SB<br />

Bridge<br />

USB 2.0<br />

HOST I/F<br />

USB<br />

2.0<br />

dev ice<br />

DMA<br />

SB<br />

Bridge<br />

AC97<br />

I/F<br />

AC97<br />

audio port<br />

Flash<br />

EPROM<br />

Parallel<br />

Flash<br />

I/F<br />

DDR<br />

DRAM<br />

JTAG<br />

I/F<br />

JTAG<br />

port<br />

Serial<br />

Flash<br />

EPROM<br />

Serial<br />

Flash<br />

I/F<br />

GPIO<br />

BLOCK<br />

GPIO<br />

(64 PINS)<br />

eJTAG<br />

port<br />

MIPS 64K<br />

CPU<br />

5f<br />

Cache<br />

Processor Bus<br />

(SB 64-bit 200 MHz)<br />

{ = 1.6 GB/S }<br />

LOW-SPEED<br />

MPIO<br />

LS-MPIO<br />

port<br />

Low-Speed I/O Bus<br />

(SB 64-bit 100 MHz)<br />

{ = 800 MB/S }


Tile Architecture<br />

16 DPUs with full Interconnect<br />

4 16KByte Memories<br />

DMA to System<br />

32 Connections to nearest Neighbor<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-23


DPU datapath +control (conceptual)<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-24


DPU datapath Architecture (conceptual)<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-25


Programm<strong>in</strong>g the System<br />

Resource Access<br />

DPU Simul<strong>in</strong>k Diagram Untimed<br />

Control Simul<strong>in</strong>k Diagram Untimed<br />

Fabric Memory mapped<br />

CPU C code L<strong>in</strong>ked via eFlow/INTs<br />

Peripherals Memory mapped/DMA<br />

Memories Memory mapped/DMA<br />

Programm<strong>in</strong>g Options<br />

Enter control and dataflow <strong>in</strong> Simul<strong>in</strong>k<br />

C code entry with reference to dataflow kernels <strong>in</strong> fabric<br />

Kernels are built with Simul<strong>in</strong>k<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-26


Simul<strong>in</strong>k based design entry <strong>systems</strong><br />

Integrated entry and simulation environment<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-27


Detail - Schedul<strong>in</strong>g and Mapp<strong>in</strong>g<br />

DFS<br />

DFS = Data Flow Scheduler<br />

DPU Pack<strong>in</strong>g<br />

Retime<br />

Delay DPU Allocation<br />

P+R<br />

Retime<br />

Fold<br />

Delay DPU Allocation<br />

depopulate<br />

Ctl Gen<br />

Bit Gen<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-28


<strong>Reconfigurable</strong> programm<strong>in</strong>g challenges<br />

In general 3 problems;<br />

1. Programm<strong>in</strong>g a network <strong>of</strong> ALUs/ bit cell <strong>in</strong> FPGA<br />

2. Sett<strong>in</strong>g up the buffers and partition<strong>in</strong>g for the network<br />

3. Dynamic reconfiguration management<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-29


Architecture Design: Y - Chart<br />

Benchmark driven, based on a Set <strong>of</strong> Applications<br />

Architecture Applications<br />

Mapp<strong>in</strong>g<br />

Performance<br />

Analysis<br />

Performance<br />

Data<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-30


Recommendation: Kahn Process Networks<br />

Formal semantics, theory and ‘pro<strong>of</strong> <strong>of</strong> concepts’ well<br />

established<br />

very well suited for stream oriented high-performance<br />

signal process<strong>in</strong>g, <strong>in</strong>clud<strong>in</strong>g dynamic dataflow.<br />

abstraction from specific Hardware, Operat<strong>in</strong>g System and<br />

device drivers<br />

Extensive prototypes available: UC Berkeley, Caltech,<br />

Leiden University, Philips Research<br />

Compute result is <strong>in</strong>dependent <strong>of</strong> schedule<br />

C++ stylized <strong>in</strong>dustry standard proposed <strong>in</strong> SystemC 2.0<br />

Satisfies the goals<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-31


Producer-Consumer example (YAPI flavor)<br />

#<strong>in</strong>clude "producer.h"<br />

#<strong>in</strong>clude <br />

Producer::Producer(const Id& n, Out& o) :<br />

Process(n),<br />

out( id("out"), o)<br />

{ }<br />

const char* Producer::type() const<br />

{<br />

return "Producer";<br />

}<br />

void Producer::ma<strong>in</strong>()<br />

{<br />

std::cout


Mapp<strong>in</strong>g Kahn Process Networks<br />

Solve 1,2,3<br />

Input language: Stylized flavors <strong>of</strong> C++<br />

recommendation: SystemC dataflow proposal<br />

Map one or more processes onto a processor:<br />

generate buffer <strong>in</strong>sertion (sometimes statically determ<strong>in</strong>ed)<br />

generate schedule for switch<strong>in</strong>g between processes on the same<br />

processor (sometimes statically determ<strong>in</strong>ed)<br />

generate overall process for manipulation, preferably runn<strong>in</strong>g on a<br />

general processor<br />

synthesize run-time schedule or determ<strong>in</strong>e run-time schedul<strong>in</strong>g<br />

technique, potentially us<strong>in</strong>g thread schedulers or RTOS or OS<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-33


New Systems<br />

Understand the application!<br />

On chip memory<br />

Multi processor, programmable and reconfigurable<br />

Power consumption <strong>of</strong> the complete IC needs to be constant<br />

The PROGRAMMERS view is mak<strong>in</strong>g the difference<br />

<strong>in</strong>put<br />

output<br />

1-4<br />

cpu<br />

Memory<br />

I$<br />

D$<br />

Serial I/O<br />

RCF<br />

timers fixed IP<br />

Memory<br />

I/O<br />

ReConfigurable<br />

Fabric<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-34


Programm<strong>in</strong>g <strong>Reconfigurable</strong> Systems<br />

The ONLY <strong>in</strong>terest<strong>in</strong>g <strong>architectures</strong> are the ones you can<br />

program/reconfigure.<br />

Logic synthesis for a specific architecture programm<strong>in</strong>g<br />

(FPGA problem)<br />

In conventional processors 2/3 <strong>of</strong> the silicon is <strong>in</strong> cache<br />

sub<strong>systems</strong>:<br />

abstraction for the programmer where the <strong>in</strong>structions or the data<br />

are resid<strong>in</strong>g<br />

conceptually a shared memory model, even cache coherent multi<br />

processors <strong>systems</strong><br />

It took the DSP world more then 10 years to learn: first<br />

assembly and MAC, now RISC and VLIW and C compilers<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-35


<strong>Reconfigurable</strong> embedded <strong>systems</strong><br />

SpecInt is irrelevant, EEMBC might be more relevant<br />

Reth<strong>in</strong>k time-multiplex<strong>in</strong>g:<br />

Processor<br />

<strong>in</strong>struction: compiler<br />

task: OS<br />

<strong>Reconfigurable</strong> system<br />

compute graph that is statically determ<strong>in</strong>ed<br />

reconfigure dynamically: runtime support and buffer management.<br />

Extract<strong>in</strong>g parallelism:<br />

Instruction Level Parallelsim: C is the problem NOT the application<br />

Task level: Wrong QUESTION!<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-36


Summary<br />

<strong>Reconfigurable</strong> comput<strong>in</strong>g has many advantages over ASIC<br />

and CPU/MPU<br />

Large parallelism with no <strong>in</strong>struction overhead<br />

Customizable data path size<br />

Flexible (reconfigurable!)<br />

It is still <strong>in</strong> its <strong>in</strong>fancy<br />

Semantic gap between algorithms and circuits is still a major<br />

obstacle<br />

Hardware platforms are only now emerg<strong>in</strong>g commercially that are<br />

designed for RC<br />

Mapp<strong>in</strong>gs are <strong>of</strong>ten architecture specific<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-37


Trends<br />

Th<strong>in</strong>k Y-chart: FIRST build a mapp<strong>in</strong>g environment, then<br />

ask the question is this a good architecture.<br />

Often ad-hoc match<strong>in</strong>g <strong>of</strong> the model <strong>of</strong> computation with<br />

the model <strong>of</strong> the architecture -> formalize<br />

Very excit<strong>in</strong>g time:<br />

new tools<br />

new <strong>architectures</strong><br />

Reverse the world: silicon is cheap, concurrency and<br />

communication is the problem,<br />

New programm<strong>in</strong>g paradigms<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-38


Selected References<br />

J. Henkel, W. Najjir, F. Vahid, K. Vissers,<br />

New Comput<strong>in</strong>g Platforms for Embedded Systems, Full day tutorial at 2002, Design Automation Conference.<br />

Andrew Mihal, Chidamber Kulkarni, Christian Sauer, Kees Vissers, Mathew Moskewicz, Mel Tsai, Niraj Shah, Scott<br />

Weber, Yujia J<strong>in</strong>,Kurt Keutzer, Sharad Malik, A Discipl<strong>in</strong>ed Approach to the Development <strong>of</strong> Architectural Platforms,<br />

2-12, 19, IEEE Design and Test <strong>of</strong> Computers, 2002<br />

M. Sima, S. Cot<strong>of</strong>ana, S. Vassiliades, J.T.J. van Eijndhoven, K. Vissers, MPEG Marcroblock Pars<strong>in</strong>g and Pel<br />

Reconstruction on an FPGA-augmented Trimedia Processor, best paper award International Conference on Computer<br />

Design (ICCD), 2001.<br />

J.T.J. van Eijndhoven, F.W. Sijstermans, K.A. Vissers, E.J.D. Pol, M.J.A. Tromp, P. Struik, R.H.J. Bloks, P. van der<br />

Wolf, A.D. Pimentel, H.P.E. Vranken, Trimedia CPU64 Architecture, International Conference on Computer Design<br />

(ICCD), 1999.<br />

F. Sijstermans, E.J. Pol, B. Riemens, K.A. Vissers, S. Rathnam, and G. Slavenburg. Design Space Exploration for<br />

Future Trimedia CPUs, ICASSP '98. 1998.<br />

B. Kienhuis, E. Deprettere, K.A. Vissers, and P. van der Wolf, An Approach for Quantitative Analysis <strong>of</strong> Applicationspecific<br />

Dataflow Architectures. Proceed<strong>in</strong>g <strong>of</strong> 11th Int. Conference <strong>of</strong> Applications-specific Systems Architectures<br />

and Processors (ASAP '97), pp. 338-349. 1997.<br />

K.A. Vissers, G. Ess<strong>in</strong>k, P.H.J van Gerwen, P.J.M. Janssen, O. Popp, E. Riddersma, W.J.M Smits, H.J.M. Veendrick.<br />

Architecture and Programm<strong>in</strong>g <strong>of</strong> Two Generations Video Signal Processors. Journal on Microprocess<strong>in</strong>g and<br />

Microprogramm<strong>in</strong>g . February, 1996.<br />

MPSOC 2003, 7-11 July, Chamonix, France, K. Vissers 2-39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!