21.04.2014 Views

Floating Point Vector Processing on an FPGA

Floating Point Vector Processing on an FPGA

Floating Point Vector Processing on an FPGA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> <str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g><br />

<strong>on</strong> <strong>an</strong> <strong>FPGA</strong><br />

Prof. Miriam Leeser<br />

Department of Electrical <strong>an</strong>d Computer Engineering<br />

Northeastern University<br />

Bost<strong>on</strong>, MA<br />

mel@coe.neu.edu<br />

http://www.coe.neu.edu/Research/rcl/index.php<br />

Based <strong>on</strong> MS thesis by Jainik Kathiara, J<strong>an</strong> 2011<br />

<strong>an</strong>d FCCM 2011 paper


Outline<br />

• Introducti<strong>on</strong> to <str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g><br />

• <str<strong>on</strong>g>Vector</str<strong>on</strong>g>‐scalar ISA<br />

• <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Co‐processor (FPVC)<br />

• <str<strong>on</strong>g>Vector</str<strong>on</strong>g>ized Linear Algebra Kernels<br />

• Results<br />

• Future Directi<strong>on</strong>s


Xilinx Rec<strong>on</strong>figurable Architecture<br />

• Rich set of rec<strong>on</strong>figurable<br />

elements<br />

• Embedded Processor: PowerPC<br />

• <str<strong>on</strong>g>Floating</str<strong>on</strong>g> point implemented as<br />

instructi<strong>on</strong> extensi<strong>on</strong>s to<br />

PowerPC:<br />

– Emulated in software<br />

– Hardware coprocessor<br />

• Alternative: build custom FP<br />

pipeline<br />

– NU VFLOAT library


C<strong>on</strong>venti<strong>on</strong>al FPU Implementati<strong>on</strong><br />

• <strong>FPGA</strong> <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> Unit serializes operati<strong>on</strong>s:<br />

– PowerPC fetches <strong>an</strong>d executes instructi<strong>on</strong>s, data<br />

– Limited parallelism, limited speedup


How to do better?<br />

• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Processor: potential to operate <strong>on</strong> lots of data<br />

at the same time<br />

– Multiple data elements stored in a vector<br />

– Perform the same operati<strong>on</strong> <strong>on</strong> all the data elements<br />

• Eliminates loops<br />

• FPVC does its own instructi<strong>on</strong> fetch <strong>an</strong>d execute<br />

• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> instructi<strong>on</strong>s ti are dense<br />

– Reduced program code size<br />

– Reduced dynamic instructi<strong>on</strong> b<strong>an</strong>dwidth<br />

– Reduced data hazards<br />

• Parallel executi<strong>on</strong>, parallel data<br />

– Improved perform<strong>an</strong>ce


What is vector processing?<br />

for (i=0; i < n; i++)<br />

Y[i] = A[i] * x + Y[i];<br />

• BLAS library routine SAXPY / DAXPY<br />

• 6 basic operati<strong>on</strong>s repeated for each element of<br />

vector Y<br />

• In <str<strong>on</strong>g>Vector</str<strong>on</strong>g> ISA such operati<strong>on</strong>s are written very<br />

compactly: operate <strong>on</strong> entire evector Y[i]


MIPS code for SAXPY<br />

L.D F0,a<br />

;load scalar a<br />

DADDIU R4,Rx,#512 ;last address to load<br />

Loop: L.D F2,0(Rx) ;load X(i)<br />

MUL.D F2,F2,F0<br />

F0<br />

;a × X(i)<br />

L.D F4,0(Ry) ;load Y(i)<br />

ADD.D F4,F4,F2 ;a × X(i) + Y(i)<br />

S.D 0(Ry),F4 ;store into Y(i)<br />

DADDIU Rx,Rx,#8 ;increment index to X<br />

DADDIU Ry,Ry,#8<br />

;increment index to Y<br />

DSUBU R20,R4,Rx ;compute bound<br />

BNEZ R20,Loop ;check if d<strong>on</strong>e


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> MIPS code for SAXPY<br />

L.S F0,a<br />

;load scalar a<br />

LV V1,Rx ;load vector X<br />

MULVS.S V2,V1,F0 ;vector‐scalar multiply<br />

LV V3,Ry ;load vector Y<br />

ADDV.S V4,V2,V3 ;add<br />

SV Ry,V4 ;store the result<br />

• Assumes vector length matches length of<br />

registers, etc.


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g><br />

• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> registers hold m<strong>an</strong>y oper<strong>an</strong>ds at <strong>on</strong>ce<br />

– 64, 128, 256 typical<br />

• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> instructi<strong>on</strong>s operate <strong>on</strong> m<strong>an</strong>y oper<strong>an</strong>ds at <strong>on</strong>ce:<br />

– LV, SV<br />

– VADD, VMULT<br />

– This reduces code size <strong>an</strong>d dynamic instructi<strong>on</strong> count<br />

• What about processing?<br />

– Use <strong>on</strong>e functi<strong>on</strong>al unit (e.g. MULT) <strong>an</strong>d pipeline it<br />

• Start a new oper<strong>an</strong>d pair every clock cycle<br />

– Have multiple functi<strong>on</strong>al units operating at <strong>on</strong>ce: vector l<strong>an</strong>es<br />

– Do both: parallelism <strong>an</strong>d pipelining


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Arithmetic Executi<strong>on</strong><br />

• Use deep pipeline (=> fast clock) to<br />

V V V<br />

execute element operati<strong>on</strong>s<br />

• Simplifies c<strong>on</strong>trol of deep pipeline<br />

because elements in vector are<br />

independent (=> no hazards!)<br />

1 2 3<br />

Six stage multiply pipeline<br />

Images fromAs<strong>an</strong>ovic’s PhD thesis.<br />

V3


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Instructi<strong>on</strong> Executi<strong>on</strong><br />

ADDV C,A,B<br />

Executi<strong>on</strong> using <strong>on</strong>e<br />

pipelined dfuncti<strong>on</strong>al<br />

unit<br />

Executi<strong>on</strong> using four<br />

pipelined dfuncti<strong>on</strong>al<br />

units<br />

A[6]<br />

A[5]<br />

A[4]<br />

A[3]<br />

B[6]<br />

B[5]<br />

B[4]<br />

B[3]<br />

A[24]<br />

A[20]<br />

A[16]<br />

A[12]<br />

B[24]<br />

B[20]<br />

B[16]<br />

B[12]<br />

A[25]<br />

A[21]<br />

A[17]<br />

A[13]<br />

B[25]<br />

B[21]<br />

B[17]<br />

B[13]<br />

A[26]<br />

A[22]<br />

A[18]<br />

A[14]<br />

B[26]<br />

B[22]<br />

B[18]<br />

B[14]<br />

A[27]<br />

A[23]<br />

A[19]<br />

A[15]<br />

B[27]<br />

B[23]<br />

B[19]<br />

B[15]<br />

C[2]<br />

C[8]<br />

C[9]<br />

C[10]<br />

C[11]<br />

C[1]<br />

C[4]<br />

C[5]<br />

C[6]<br />

C[7]<br />

C[0] C[0] C[1] C[2] C[3]


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Unit Structure<br />

Functi<strong>on</strong>al Unit<br />

<str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

Registers<br />

Elements 0,<br />

4, 8, …<br />

Elements 1,<br />

5, 9, …<br />

Elements 2,<br />

6, 10, …<br />

Elements 3,<br />

7, 11, …<br />

L<strong>an</strong>e<br />

Memory Subsystem


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> L<strong>an</strong>e<br />

• Each l<strong>an</strong>e c<strong>on</strong>sists of a functi<strong>on</strong>al unit, a partiti<strong>on</strong> of<br />

vector register file <strong>an</strong>d vector flag register<br />

• Similar to SIMD extensi<strong>on</strong>s in popular instructi<strong>on</strong> sets:<br />

– Intel’s MMX, SSE, PowerPC’s AltiVec


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Length C<strong>on</strong>trol<br />

m = n; i =0;<br />

while (m > MVL){<br />

for(j=0;j


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Strip Mining<br />

Problem: <str<strong>on</strong>g>Vector</str<strong>on</strong>g> registers have finite length<br />

Soluti<strong>on</strong>: Break loops into pieces that fit into vector registers, “Strip<br />

mining”<br />

for (i=0; i


Related Work<br />

• One of the earliest single chip vector processor are<br />

VIRAM <strong>an</strong>d T0 designed by Kozyrakis[1] <strong>an</strong>d<br />

As<strong>an</strong>ovic[2] respectively<br />

• VIRAM <strong>an</strong>d T0 are implemented with ASICs<br />

• Yi<strong>an</strong>ncouras[3] <strong>an</strong>d Yu[4] have designed <strong>FPGA</strong> based<br />

soft vector processors inspired by VIRAM <strong>an</strong>d T0<br />

– This work implements integer arithmetic, not floating<br />

point


The FPVC<br />

• Our floating point vector co‐processor differs<br />

from earlier work <strong>on</strong> floating point co‐<br />

processors:<br />

– Fetches its own instructi<strong>on</strong>s<br />

– Operates <strong>on</strong> scalar data<br />

• Loop c<strong>on</strong>trol is local to the FPVC<br />

– Is completely aut<strong>on</strong>omous of the main processor<br />

– Includes divide <strong>an</strong>d square root in floating point<br />

pipelined data path


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Chaining <strong>an</strong>d Hybrid<br />

vector/SIMD Architecture<br />

• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> chaining is pipeline forwarding in a vector<br />

processor<br />

• Requires <strong>on</strong>e read <strong>an</strong>d write port each functi<strong>on</strong>al<br />

unit<br />

• Hybrid vector/SIMD computati<strong>on</strong> performs in SIMD<br />

fashi<strong>on</strong> <strong>an</strong>d over time as in the traditi<strong>on</strong>al vector (b)<br />

• AMD GPU architecture implements vector/SIMD<br />

architecture


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Scalar Instructi<strong>on</strong> Set<br />

• 32 bit instructi<strong>on</strong> set<br />

Architecture<br />

t<br />

• Supports 32 vector registers<br />

• All the instructi<strong>on</strong>s c<strong>an</strong> be classified into categories:<br />

– Memory access instructi<strong>on</strong>s<br />

– Integer arithmetic instructi<strong>on</strong>s<br />

– Program flow c<strong>on</strong>trol instructi<strong>on</strong>s<br />

– <str<strong>on</strong>g>Floating</str<strong>on</strong>g> point arithmetic instructi<strong>on</strong>s<br />

– Special instructi<strong>on</strong>s


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Register Org<strong>an</strong>izati<strong>on</strong><br />

• Two types of<br />

org<strong>an</strong>izati<strong>on</strong><br />

– Register<br />

Partiti<strong>on</strong>ed<br />

– Element<br />

Partiti<strong>on</strong>ed<br />

i • <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Register<br />

– Number of <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

l<strong>an</strong>es<br />

– Short vector size


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> L<strong>an</strong>e, Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g>, <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

Register, Scalar Register ??<br />

<str<strong>on</strong>g>Vector</str<strong>on</strong>g> L<strong>an</strong>es (L)<br />

Scalar<br />

Registers<br />

Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> (SV)


Memory Access Instructi<strong>on</strong>s<br />

Memory Instructi<strong>on</strong> format<br />

op[5:0] rd[4:0] r1[4:0] r2[4:0] imd[10:0]<br />

• Various types of memory<br />

access patterns are<br />

– Unit stride<br />

– N<strong>on</strong>‐unit stride<br />

– Permutati<strong>on</strong> access<br />

– Look up table access<br />

– Rake access


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Arithmetic Instructi<strong>on</strong>s<br />

Arithmetic Instructi<strong>on</strong> with both register oper<strong>an</strong>d<br />

op[5:0] rd[4:0] r1[4:0] r2[4:0] exop[10:0]<br />

0]<br />

Arithmetic Instructi<strong>on</strong> with 16‐bit immediate value<br />

op[5:0] rd[4:0] r1[4:0] imd[15:0]<br />

• Includes both integer<br />

<strong>an</strong>d floating point<br />

instructi<strong>on</strong>s<br />

• Masked instructi<strong>on</strong><br />

executi<strong>on</strong> is also<br />

included


Scalar Arithmetic Instructi<strong>on</strong>s<br />

• Same instructi<strong>on</strong> format is used<br />

• Only first element of first short vector of each vector<br />

register is used<br />

• Result will be replicated to all l<strong>an</strong>es <strong>an</strong>d stored <strong>on</strong><br />

the first short vector


<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Compressi<strong>on</strong> <strong>an</strong>d Exp<strong>an</strong>si<strong>on</strong><br />

• Compress<br />

Mask <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

1 A[0]<br />

0 A[1]<br />

1 A[2]<br />

1 A[3]<br />

1 A[4]<br />

0 A[5]<br />

0 A[6]<br />

1 A[7]<br />

Mask <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

1 A[0]<br />

1 A[2]<br />

1 A[3]<br />

1 A[4]<br />

1 A[7]<br />

0 ‐<br />

0 ‐<br />

0 ‐<br />

• Exp<strong>an</strong>d<br />

Mask <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

1 A[0]<br />

0 ‐<br />

1 A[2]<br />

1 A[3]<br />

1 A[4]<br />

0 ‐<br />

0 ‐<br />

1 A[7]


FPVC Org<strong>an</strong>izati<strong>on</strong>


<str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Core<br />

• Aut<strong>on</strong>omous from the<br />

main processor<br />

• Supports vector scalar ISA<br />

• 4 stage RISC pipeline<br />

• in‐order issue, out of<br />

order completi<strong>on</strong><br />

– Arbiter h<strong>an</strong>dles completi<strong>on</strong><br />

• Unified vector scalar<br />

general purpose register<br />

file<br />

• Uses NU VFLOAT library<br />

for floating point units


Compile time parameters


Memory Hierarchy<br />

• Supports modified Harvard style memory<br />

architecture<br />

• Separate instructi<strong>on</strong> <strong>an</strong>d data memory in local <strong>on</strong>chip<br />

RAM<br />

• Unified main memory (in other <strong>on</strong>‐chip RAM)<br />

• Local <strong>on</strong> chip RAM reduces traffic <strong>on</strong> the system bus<br />

• Program <strong>an</strong>d data size are limited by local <strong>on</strong>‐chip<br />

RAM size<br />

– <str<strong>on</strong>g>Vector</str<strong>on</strong>g> code is more compact th<strong>an</strong> scalar code!


System Bus Interface<br />

• FPVC is c<strong>on</strong>nected through PLB interface to system<br />

bus but not limited i to <strong>an</strong>y bus protocol<br />

• Two ports are provided for c<strong>on</strong>necti<strong>on</strong> in embedded<br />

system<br />

– Slave port –for communicati<strong>on</strong> with main<br />

processor<br />

– Master port –for main memory accesses<br />

• PLB interface c<strong>an</strong> be c<strong>on</strong>figured for 32, 64 or 128‐bit<br />

data width<br />

• Master port includes DMA c<strong>on</strong>troller for main<br />

memory accesses


Experimental Setup<br />

• Design implemented <strong>on</strong> Xilinx<br />

ML510 board<br />

• 32‐bit PLB based system bus<br />

• Embedded system runs at 100<br />

MHz<br />

• PowerPC program code is<br />

compiled with gcc using –o2<br />

optimizati<strong>on</strong><br />

• FPU <strong>on</strong>ly used for comparis<strong>on</strong><br />

• FPVC program code is written in<br />

machine code <strong>an</strong>d unoptimized<br />

• Program <strong>an</strong>d Data Dt are stored din<br />

BRAM (main memory)<br />

• Main metric for perform<strong>an</strong>ce<br />

measurement is number of clock<br />

cycles


<strong>FPGA</strong> Resources Used


Program Flow<br />

PowerPC_main ()<br />

{<br />

1. Start PowerPC timer();<br />

2. Write kernel parameter to FPVC’s<br />

local data RAM;<br />

3. C<strong>on</strong>figure <strong>an</strong>d enable FPVC DMA for<br />

FPVC instructi<strong>on</strong> load;<br />

4. Wait until FPVC completes executi<strong>on</strong>;<br />

5. Stop PowerPC timer();<br />

}<br />

FPVC_main()<br />

{<br />

wait for instructi<strong>on</strong> load;<br />

load data;<br />

compute kernel();<br />

store result;<br />

HALT FPVC;<br />

}


Linear Algebra Kernels<br />

• Dot Product<br />

• Matrix‐<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Product<br />

• Matrix‐Matrix Multiplicati<strong>on</strong><br />

• QR Decompositi<strong>on</strong><br />

• Cholesky Decompositi<strong>on</strong>


DOT Product<br />

DOT_product_kernel(){<br />

load vector u from local data RAM;<br />

load vector v from local data RAM;<br />

mul_vector = multiply u <strong>an</strong>d v;<br />

accumulate = reducti<strong>on</strong>(mul_vector);<br />

• BLAS level 1 routine<br />

• Performs O(N)<br />

floating point<br />

operati<strong>on</strong>s<br />

• Product c<strong>an</strong> be<br />

formulated as:<br />

}<br />

store accumulate to local memory;


DOT Product Perform<strong>an</strong>ce for Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

Scaling<br />

1.8<br />

DOT PRODUCT with L<strong>an</strong>e (L) = 2<br />

1.6<br />

Perform<strong>an</strong>ce e Improvement<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

8 16 32 64 128 256 512<br />

Number of <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Elements<br />

PowerPC SV = 8, L = 2 SV = 16, L = 2 SV = 32, L = 2


DOT Product Perform<strong>an</strong>ce for L<strong>an</strong>e Scaling<br />

2.4<br />

2.2<br />

DOT PRODUCT with Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Size (SV)= 32<br />

Perform<strong>an</strong>ce e Improvement<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

8 16 32 64 128 256 512<br />

<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Length<br />

Power PC L = 1 L = 2 L = 4 L = 8


Matrix‐<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Product<br />

MV_product_kernel(){<br />

loop (i = 0 to i= N‐1)<br />

y i = DOT_product_kernel(A i ,x);<br />

store result y i to local memory;<br />

end loop;<br />

• BLAS level 2 routine<br />

• Performs O(N 2 )<br />

floating point<br />

operati<strong>on</strong>s<br />

• Product c<strong>an</strong> be<br />

formulated as:<br />

}


Matrix‐<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Product Perform<strong>an</strong>ce for<br />

L<strong>an</strong>e Scaling<br />

MV PRODUCT with Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Size (SV) = 32<br />

1.6<br />

1.4<br />

Perform<strong>an</strong>ce Im mprovement<br />

1.2<br />

1<br />

0.8<br />

06 0.6<br />

0.4<br />

4 8 12 16<br />

Square Matrix Size<br />

PowerPC L = 1 L = 2 L = 4 L = 8


Matrix‐Matrix Multiplicati<strong>on</strong><br />

MM_product_kernel(){<br />

loop (i = 0 to i= N‐1)<br />

C i = MV_product_kernel(A,B i );<br />

store result C i to local memory;<br />

end loop;<br />

• BLAS level 3 routine<br />

• Performs O(N 3 )<br />

floating point<br />

operati<strong>on</strong>s<br />

• Product c<strong>an</strong> be<br />

formulated as:<br />

}


Matrix‐Matrix Multiplicati<strong>on</strong> Perform<strong>an</strong>ce<br />

for L<strong>an</strong>e Scaling<br />

1.7<br />

1.6<br />

MM Product with Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Size (SV) = 32<br />

Perform<strong>an</strong>ce e Improvement<br />

15 1.5<br />

1.4<br />

1.3<br />

1.2<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

4 8 12 16<br />

Square Matrix Size<br />

PowerPC L<strong>an</strong>e = 1 L<strong>an</strong>e = 2 L<strong>an</strong>e = 4 L<strong>an</strong>e = 8


QR Decompositi<strong>on</strong><br />

QR_Decomp_kernel(){<br />

loop (i = 0 up to i= M‐1)<br />

loop (j = N‐1 down to j > i)<br />

x = A[j‐1] [i];<br />

y = A[j][i];<br />

compute Q i,j ;<br />

A[j‐1:j][0:N‐1] = MM_product_kernel<br />

(Q i,j , A[j‐1:j][0:N‐1]);<br />

end loop;<br />

end loop;<br />

}<br />

• This kernel uses Givens<br />

rotati<strong>on</strong> to decompose<br />

matrix into <strong>an</strong> orthog<strong>on</strong>al<br />

(Q) <strong>an</strong>d upper tri<strong>an</strong>gular<br />

matrix (R) such that A = QR.<br />

• An N x N matrix A is zeroed<br />

out <strong>on</strong>e element at a time<br />

using 2 x 2 rotati<strong>on</strong> matrix:<br />

• Performs O(N 3 ) floating point<br />

operati<strong>on</strong>s.


QR Decompositi<strong>on</strong> Perform<strong>an</strong>ce for L<strong>an</strong>e<br />

Scaling


Cholesky Decompositi<strong>on</strong><br />

Cholesky_Decomp_kernel(){<br />

kernel(){ • This kernel decomposes<br />

loop (i = 0 up to i= N‐1)<br />

symmetric positivedefinite<br />

matrix into the<br />

pivot value = sqrt (A i,i );<br />

divide i th column vector from ito N by pivot value;<br />

tri<strong>an</strong>gular matrices<br />

such that A = LL T .<br />

loop (j = i+1 upto N)<br />

accumulate row vector from 0 to i;<br />

subtract accumulated value from A j,i+1 ;<br />

end loop;<br />

end loop;<br />

}<br />

• Each element of L c<strong>an</strong><br />

be defined as below:


Cholesky Decompositi<strong>on</strong> Perform<strong>an</strong>ce for<br />

L<strong>an</strong>e Scaling


Summary<br />

• Designed <strong>an</strong>d implemented unified vector<br />

scalar floating point architecture<br />

• Support all the basic floating point<br />

operati<strong>on</strong>s:<br />

– Add, Sub, Mul, Div, Sqrt<br />

• Initiated designing linear algebra library for<br />

basic matrix <strong>an</strong>d vector arithmetic<br />

computati<strong>on</strong>


Key Features<br />

• FPVC is faster th<strong>an</strong> Xilinx FPU plus embedded<br />

processor<br />

• FPVC is slower th<strong>an</strong> custom datapath<br />

– Easier to implement<br />

• FPVC is aut<strong>on</strong>omous from embedded<br />

processor<br />

– Good choice for implementing scientific apps that<br />

use rest of <strong>FPGA</strong> for at the same time


Future Work<br />

• Double Precisi<strong>on</strong> <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> Support<br />

• Architectural Improvements<br />

– Memory Caching<br />

• Improved Tools<br />

– <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Compiler Tool Flow<br />

• More applicati<strong>on</strong>s<br />

– Dem<strong>on</strong>strate c<strong>on</strong>current use of FPVC


References<br />

[1] C. Kozyrakis <strong>an</strong>d D. Patters<strong>on</strong>, “Overcoming the Limitati<strong>on</strong>s of<br />

C<strong>on</strong>venti<strong>on</strong>al <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Processors”, In Proceedings of the 30th Internati<strong>on</strong>al<br />

Symposium <strong>on</strong> Computer Architecture, S<strong>an</strong> Diego, California, June<br />

2003, pp. 399–409.<br />

[2] K. As<strong>an</strong>ovic, J. Beck, B. Irissou, B. Kingsbury, <strong>an</strong>d N. Morg<strong>an</strong>, “The T0<br />

<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Microprocessor,” Hot Chips, vol. 7, pp. 187–196, 1995.<br />

[3] P. Yi<strong>an</strong>nacouras, J. Gregory Steff<strong>an</strong>, <strong>an</strong>d J<strong>on</strong>ath<strong>an</strong> Rose, VESPA:<br />

Portable, Scalable, <strong>an</strong>d Flexible <strong>FPGA</strong>‐Based <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />

Processors, Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Compilers, Architecture <strong>an</strong>d<br />

Synthesis for Embedded Systems (CASES), October 2008, Atl<strong>an</strong>ta, GA.<br />

[4] . Yu, G. Lemieux, <strong>an</strong>d C. Eaglest<strong>on</strong>, "<str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g> as a Soft‐core CPU<br />

Accelerator," ACM Internati<strong>on</strong>al Symposium <strong>on</strong> <strong>FPGA</strong>, 2008.


Th<strong>an</strong>k You!<br />

Miriam Leeser<br />

mel@coe.neu.edu<br />

http://www.coe.neu.edu/Research/rcl/index.php<br />

p// / / / p p<br />

More details c<strong>an</strong> be found in:<br />

Jainik Kathiara’s MS thesis under publicati<strong>on</strong>s link<br />

FCCM 2011 paper:<br />

An Aut<strong>on</strong>omous <str<strong>on</strong>g>Vector</str<strong>on</strong>g>/Scalar <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> Coprocessor<br />

for <strong>FPGA</strong>s by Jainik Kathiara <strong>an</strong>d Miriam Leeser

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!