Floating Point Vector Processing on an FPGA
Floating Point Vector Processing on an FPGA
Floating Point Vector Processing on an FPGA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> <str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g><br />
<strong>on</strong> <strong>an</strong> <strong>FPGA</strong><br />
Prof. Miriam Leeser<br />
Department of Electrical <strong>an</strong>d Computer Engineering<br />
Northeastern University<br />
Bost<strong>on</strong>, MA<br />
mel@coe.neu.edu<br />
http://www.coe.neu.edu/Research/rcl/index.php<br />
Based <strong>on</strong> MS thesis by Jainik Kathiara, J<strong>an</strong> 2011<br />
<strong>an</strong>d FCCM 2011 paper
Outline<br />
• Introducti<strong>on</strong> to <str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g><br />
• <str<strong>on</strong>g>Vector</str<strong>on</strong>g>‐scalar ISA<br />
• <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Co‐processor (FPVC)<br />
• <str<strong>on</strong>g>Vector</str<strong>on</strong>g>ized Linear Algebra Kernels<br />
• Results<br />
• Future Directi<strong>on</strong>s
Xilinx Rec<strong>on</strong>figurable Architecture<br />
• Rich set of rec<strong>on</strong>figurable<br />
elements<br />
• Embedded Processor: PowerPC<br />
• <str<strong>on</strong>g>Floating</str<strong>on</strong>g> point implemented as<br />
instructi<strong>on</strong> extensi<strong>on</strong>s to<br />
PowerPC:<br />
– Emulated in software<br />
– Hardware coprocessor<br />
• Alternative: build custom FP<br />
pipeline<br />
– NU VFLOAT library
C<strong>on</strong>venti<strong>on</strong>al FPU Implementati<strong>on</strong><br />
• <strong>FPGA</strong> <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> Unit serializes operati<strong>on</strong>s:<br />
– PowerPC fetches <strong>an</strong>d executes instructi<strong>on</strong>s, data<br />
– Limited parallelism, limited speedup
How to do better?<br />
• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Processor: potential to operate <strong>on</strong> lots of data<br />
at the same time<br />
– Multiple data elements stored in a vector<br />
– Perform the same operati<strong>on</strong> <strong>on</strong> all the data elements<br />
• Eliminates loops<br />
• FPVC does its own instructi<strong>on</strong> fetch <strong>an</strong>d execute<br />
• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> instructi<strong>on</strong>s ti are dense<br />
– Reduced program code size<br />
– Reduced dynamic instructi<strong>on</strong> b<strong>an</strong>dwidth<br />
– Reduced data hazards<br />
• Parallel executi<strong>on</strong>, parallel data<br />
– Improved perform<strong>an</strong>ce
What is vector processing?<br />
for (i=0; i < n; i++)<br />
Y[i] = A[i] * x + Y[i];<br />
• BLAS library routine SAXPY / DAXPY<br />
• 6 basic operati<strong>on</strong>s repeated for each element of<br />
vector Y<br />
• In <str<strong>on</strong>g>Vector</str<strong>on</strong>g> ISA such operati<strong>on</strong>s are written very<br />
compactly: operate <strong>on</strong> entire evector Y[i]
MIPS code for SAXPY<br />
L.D F0,a<br />
;load scalar a<br />
DADDIU R4,Rx,#512 ;last address to load<br />
Loop: L.D F2,0(Rx) ;load X(i)<br />
MUL.D F2,F2,F0<br />
F0<br />
;a × X(i)<br />
L.D F4,0(Ry) ;load Y(i)<br />
ADD.D F4,F4,F2 ;a × X(i) + Y(i)<br />
S.D 0(Ry),F4 ;store into Y(i)<br />
DADDIU Rx,Rx,#8 ;increment index to X<br />
DADDIU Ry,Ry,#8<br />
;increment index to Y<br />
DSUBU R20,R4,Rx ;compute bound<br />
BNEZ R20,Loop ;check if d<strong>on</strong>e
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> MIPS code for SAXPY<br />
L.S F0,a<br />
;load scalar a<br />
LV V1,Rx ;load vector X<br />
MULVS.S V2,V1,F0 ;vector‐scalar multiply<br />
LV V3,Ry ;load vector Y<br />
ADDV.S V4,V2,V3 ;add<br />
SV Ry,V4 ;store the result<br />
• Assumes vector length matches length of<br />
registers, etc.
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g><br />
• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> registers hold m<strong>an</strong>y oper<strong>an</strong>ds at <strong>on</strong>ce<br />
– 64, 128, 256 typical<br />
• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> instructi<strong>on</strong>s operate <strong>on</strong> m<strong>an</strong>y oper<strong>an</strong>ds at <strong>on</strong>ce:<br />
– LV, SV<br />
– VADD, VMULT<br />
– This reduces code size <strong>an</strong>d dynamic instructi<strong>on</strong> count<br />
• What about processing?<br />
– Use <strong>on</strong>e functi<strong>on</strong>al unit (e.g. MULT) <strong>an</strong>d pipeline it<br />
• Start a new oper<strong>an</strong>d pair every clock cycle<br />
– Have multiple functi<strong>on</strong>al units operating at <strong>on</strong>ce: vector l<strong>an</strong>es<br />
– Do both: parallelism <strong>an</strong>d pipelining
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Arithmetic Executi<strong>on</strong><br />
• Use deep pipeline (=> fast clock) to<br />
V V V<br />
execute element operati<strong>on</strong>s<br />
• Simplifies c<strong>on</strong>trol of deep pipeline<br />
because elements in vector are<br />
independent (=> no hazards!)<br />
1 2 3<br />
Six stage multiply pipeline<br />
Images fromAs<strong>an</strong>ovic’s PhD thesis.<br />
V3
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Instructi<strong>on</strong> Executi<strong>on</strong><br />
ADDV C,A,B<br />
Executi<strong>on</strong> using <strong>on</strong>e<br />
pipelined dfuncti<strong>on</strong>al<br />
unit<br />
Executi<strong>on</strong> using four<br />
pipelined dfuncti<strong>on</strong>al<br />
units<br />
A[6]<br />
A[5]<br />
A[4]<br />
A[3]<br />
B[6]<br />
B[5]<br />
B[4]<br />
B[3]<br />
A[24]<br />
A[20]<br />
A[16]<br />
A[12]<br />
B[24]<br />
B[20]<br />
B[16]<br />
B[12]<br />
A[25]<br />
A[21]<br />
A[17]<br />
A[13]<br />
B[25]<br />
B[21]<br />
B[17]<br />
B[13]<br />
A[26]<br />
A[22]<br />
A[18]<br />
A[14]<br />
B[26]<br />
B[22]<br />
B[18]<br />
B[14]<br />
A[27]<br />
A[23]<br />
A[19]<br />
A[15]<br />
B[27]<br />
B[23]<br />
B[19]<br />
B[15]<br />
C[2]<br />
C[8]<br />
C[9]<br />
C[10]<br />
C[11]<br />
C[1]<br />
C[4]<br />
C[5]<br />
C[6]<br />
C[7]<br />
C[0] C[0] C[1] C[2] C[3]
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Unit Structure<br />
Functi<strong>on</strong>al Unit<br />
<str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
Registers<br />
Elements 0,<br />
4, 8, …<br />
Elements 1,<br />
5, 9, …<br />
Elements 2,<br />
6, 10, …<br />
Elements 3,<br />
7, 11, …<br />
L<strong>an</strong>e<br />
Memory Subsystem
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> L<strong>an</strong>e<br />
• Each l<strong>an</strong>e c<strong>on</strong>sists of a functi<strong>on</strong>al unit, a partiti<strong>on</strong> of<br />
vector register file <strong>an</strong>d vector flag register<br />
• Similar to SIMD extensi<strong>on</strong>s in popular instructi<strong>on</strong> sets:<br />
– Intel’s MMX, SSE, PowerPC’s AltiVec
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Length C<strong>on</strong>trol<br />
m = n; i =0;<br />
while (m > MVL){<br />
for(j=0;j
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Strip Mining<br />
Problem: <str<strong>on</strong>g>Vector</str<strong>on</strong>g> registers have finite length<br />
Soluti<strong>on</strong>: Break loops into pieces that fit into vector registers, “Strip<br />
mining”<br />
for (i=0; i
Related Work<br />
• One of the earliest single chip vector processor are<br />
VIRAM <strong>an</strong>d T0 designed by Kozyrakis[1] <strong>an</strong>d<br />
As<strong>an</strong>ovic[2] respectively<br />
• VIRAM <strong>an</strong>d T0 are implemented with ASICs<br />
• Yi<strong>an</strong>ncouras[3] <strong>an</strong>d Yu[4] have designed <strong>FPGA</strong> based<br />
soft vector processors inspired by VIRAM <strong>an</strong>d T0<br />
– This work implements integer arithmetic, not floating<br />
point
The FPVC<br />
• Our floating point vector co‐processor differs<br />
from earlier work <strong>on</strong> floating point co‐<br />
processors:<br />
– Fetches its own instructi<strong>on</strong>s<br />
– Operates <strong>on</strong> scalar data<br />
• Loop c<strong>on</strong>trol is local to the FPVC<br />
– Is completely aut<strong>on</strong>omous of the main processor<br />
– Includes divide <strong>an</strong>d square root in floating point<br />
pipelined data path
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Chaining <strong>an</strong>d Hybrid<br />
vector/SIMD Architecture<br />
• <str<strong>on</strong>g>Vector</str<strong>on</strong>g> chaining is pipeline forwarding in a vector<br />
processor<br />
• Requires <strong>on</strong>e read <strong>an</strong>d write port each functi<strong>on</strong>al<br />
unit<br />
• Hybrid vector/SIMD computati<strong>on</strong> performs in SIMD<br />
fashi<strong>on</strong> <strong>an</strong>d over time as in the traditi<strong>on</strong>al vector (b)<br />
• AMD GPU architecture implements vector/SIMD<br />
architecture
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Scalar Instructi<strong>on</strong> Set<br />
• 32 bit instructi<strong>on</strong> set<br />
Architecture<br />
t<br />
• Supports 32 vector registers<br />
• All the instructi<strong>on</strong>s c<strong>an</strong> be classified into categories:<br />
– Memory access instructi<strong>on</strong>s<br />
– Integer arithmetic instructi<strong>on</strong>s<br />
– Program flow c<strong>on</strong>trol instructi<strong>on</strong>s<br />
– <str<strong>on</strong>g>Floating</str<strong>on</strong>g> point arithmetic instructi<strong>on</strong>s<br />
– Special instructi<strong>on</strong>s
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Register Org<strong>an</strong>izati<strong>on</strong><br />
• Two types of<br />
org<strong>an</strong>izati<strong>on</strong><br />
– Register<br />
Partiti<strong>on</strong>ed<br />
– Element<br />
Partiti<strong>on</strong>ed<br />
i • <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Register<br />
– Number of <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
l<strong>an</strong>es<br />
– Short vector size
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> L<strong>an</strong>e, Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g>, <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
Register, Scalar Register ??<br />
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> L<strong>an</strong>es (L)<br />
Scalar<br />
Registers<br />
Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> (SV)
Memory Access Instructi<strong>on</strong>s<br />
Memory Instructi<strong>on</strong> format<br />
op[5:0] rd[4:0] r1[4:0] r2[4:0] imd[10:0]<br />
• Various types of memory<br />
access patterns are<br />
– Unit stride<br />
– N<strong>on</strong>‐unit stride<br />
– Permutati<strong>on</strong> access<br />
– Look up table access<br />
– Rake access
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Arithmetic Instructi<strong>on</strong>s<br />
Arithmetic Instructi<strong>on</strong> with both register oper<strong>an</strong>d<br />
op[5:0] rd[4:0] r1[4:0] r2[4:0] exop[10:0]<br />
0]<br />
Arithmetic Instructi<strong>on</strong> with 16‐bit immediate value<br />
op[5:0] rd[4:0] r1[4:0] imd[15:0]<br />
• Includes both integer<br />
<strong>an</strong>d floating point<br />
instructi<strong>on</strong>s<br />
• Masked instructi<strong>on</strong><br />
executi<strong>on</strong> is also<br />
included
Scalar Arithmetic Instructi<strong>on</strong>s<br />
• Same instructi<strong>on</strong> format is used<br />
• Only first element of first short vector of each vector<br />
register is used<br />
• Result will be replicated to all l<strong>an</strong>es <strong>an</strong>d stored <strong>on</strong><br />
the first short vector
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Compressi<strong>on</strong> <strong>an</strong>d Exp<strong>an</strong>si<strong>on</strong><br />
• Compress<br />
Mask <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
1 A[0]<br />
0 A[1]<br />
1 A[2]<br />
1 A[3]<br />
1 A[4]<br />
0 A[5]<br />
0 A[6]<br />
1 A[7]<br />
Mask <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
1 A[0]<br />
1 A[2]<br />
1 A[3]<br />
1 A[4]<br />
1 A[7]<br />
0 ‐<br />
0 ‐<br />
0 ‐<br />
• Exp<strong>an</strong>d<br />
Mask <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
1 A[0]<br />
0 ‐<br />
1 A[2]<br />
1 A[3]<br />
1 A[4]<br />
0 ‐<br />
0 ‐<br />
1 A[7]
FPVC Org<strong>an</strong>izati<strong>on</strong>
<str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Core<br />
• Aut<strong>on</strong>omous from the<br />
main processor<br />
• Supports vector scalar ISA<br />
• 4 stage RISC pipeline<br />
• in‐order issue, out of<br />
order completi<strong>on</strong><br />
– Arbiter h<strong>an</strong>dles completi<strong>on</strong><br />
• Unified vector scalar<br />
general purpose register<br />
file<br />
• Uses NU VFLOAT library<br />
for floating point units
Compile time parameters
Memory Hierarchy<br />
• Supports modified Harvard style memory<br />
architecture<br />
• Separate instructi<strong>on</strong> <strong>an</strong>d data memory in local <strong>on</strong>chip<br />
RAM<br />
• Unified main memory (in other <strong>on</strong>‐chip RAM)<br />
• Local <strong>on</strong> chip RAM reduces traffic <strong>on</strong> the system bus<br />
• Program <strong>an</strong>d data size are limited by local <strong>on</strong>‐chip<br />
RAM size<br />
– <str<strong>on</strong>g>Vector</str<strong>on</strong>g> code is more compact th<strong>an</strong> scalar code!
System Bus Interface<br />
• FPVC is c<strong>on</strong>nected through PLB interface to system<br />
bus but not limited i to <strong>an</strong>y bus protocol<br />
• Two ports are provided for c<strong>on</strong>necti<strong>on</strong> in embedded<br />
system<br />
– Slave port –for communicati<strong>on</strong> with main<br />
processor<br />
– Master port –for main memory accesses<br />
• PLB interface c<strong>an</strong> be c<strong>on</strong>figured for 32, 64 or 128‐bit<br />
data width<br />
• Master port includes DMA c<strong>on</strong>troller for main<br />
memory accesses
Experimental Setup<br />
• Design implemented <strong>on</strong> Xilinx<br />
ML510 board<br />
• 32‐bit PLB based system bus<br />
• Embedded system runs at 100<br />
MHz<br />
• PowerPC program code is<br />
compiled with gcc using –o2<br />
optimizati<strong>on</strong><br />
• FPU <strong>on</strong>ly used for comparis<strong>on</strong><br />
• FPVC program code is written in<br />
machine code <strong>an</strong>d unoptimized<br />
• Program <strong>an</strong>d Data Dt are stored din<br />
BRAM (main memory)<br />
• Main metric for perform<strong>an</strong>ce<br />
measurement is number of clock<br />
cycles
<strong>FPGA</strong> Resources Used
Program Flow<br />
PowerPC_main ()<br />
{<br />
1. Start PowerPC timer();<br />
2. Write kernel parameter to FPVC’s<br />
local data RAM;<br />
3. C<strong>on</strong>figure <strong>an</strong>d enable FPVC DMA for<br />
FPVC instructi<strong>on</strong> load;<br />
4. Wait until FPVC completes executi<strong>on</strong>;<br />
5. Stop PowerPC timer();<br />
}<br />
FPVC_main()<br />
{<br />
wait for instructi<strong>on</strong> load;<br />
load data;<br />
compute kernel();<br />
store result;<br />
HALT FPVC;<br />
}
Linear Algebra Kernels<br />
• Dot Product<br />
• Matrix‐<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Product<br />
• Matrix‐Matrix Multiplicati<strong>on</strong><br />
• QR Decompositi<strong>on</strong><br />
• Cholesky Decompositi<strong>on</strong>
DOT Product<br />
DOT_product_kernel(){<br />
load vector u from local data RAM;<br />
load vector v from local data RAM;<br />
mul_vector = multiply u <strong>an</strong>d v;<br />
accumulate = reducti<strong>on</strong>(mul_vector);<br />
• BLAS level 1 routine<br />
• Performs O(N)<br />
floating point<br />
operati<strong>on</strong>s<br />
• Product c<strong>an</strong> be<br />
formulated as:<br />
}<br />
store accumulate to local memory;
DOT Product Perform<strong>an</strong>ce for Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
Scaling<br />
1.8<br />
DOT PRODUCT with L<strong>an</strong>e (L) = 2<br />
1.6<br />
Perform<strong>an</strong>ce e Improvement<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
8 16 32 64 128 256 512<br />
Number of <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Elements<br />
PowerPC SV = 8, L = 2 SV = 16, L = 2 SV = 32, L = 2
DOT Product Perform<strong>an</strong>ce for L<strong>an</strong>e Scaling<br />
2.4<br />
2.2<br />
DOT PRODUCT with Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Size (SV)= 32<br />
Perform<strong>an</strong>ce e Improvement<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
8 16 32 64 128 256 512<br />
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Length<br />
Power PC L = 1 L = 2 L = 4 L = 8
Matrix‐<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Product<br />
MV_product_kernel(){<br />
loop (i = 0 to i= N‐1)<br />
y i = DOT_product_kernel(A i ,x);<br />
store result y i to local memory;<br />
end loop;<br />
• BLAS level 2 routine<br />
• Performs O(N 2 )<br />
floating point<br />
operati<strong>on</strong>s<br />
• Product c<strong>an</strong> be<br />
formulated as:<br />
}
Matrix‐<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Product Perform<strong>an</strong>ce for<br />
L<strong>an</strong>e Scaling<br />
MV PRODUCT with Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Size (SV) = 32<br />
1.6<br />
1.4<br />
Perform<strong>an</strong>ce Im mprovement<br />
1.2<br />
1<br />
0.8<br />
06 0.6<br />
0.4<br />
4 8 12 16<br />
Square Matrix Size<br />
PowerPC L = 1 L = 2 L = 4 L = 8
Matrix‐Matrix Multiplicati<strong>on</strong><br />
MM_product_kernel(){<br />
loop (i = 0 to i= N‐1)<br />
C i = MV_product_kernel(A,B i );<br />
store result C i to local memory;<br />
end loop;<br />
• BLAS level 3 routine<br />
• Performs O(N 3 )<br />
floating point<br />
operati<strong>on</strong>s<br />
• Product c<strong>an</strong> be<br />
formulated as:<br />
}
Matrix‐Matrix Multiplicati<strong>on</strong> Perform<strong>an</strong>ce<br />
for L<strong>an</strong>e Scaling<br />
1.7<br />
1.6<br />
MM Product with Short <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Size (SV) = 32<br />
Perform<strong>an</strong>ce e Improvement<br />
15 1.5<br />
1.4<br />
1.3<br />
1.2<br />
1.1<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
4 8 12 16<br />
Square Matrix Size<br />
PowerPC L<strong>an</strong>e = 1 L<strong>an</strong>e = 2 L<strong>an</strong>e = 4 L<strong>an</strong>e = 8
QR Decompositi<strong>on</strong><br />
QR_Decomp_kernel(){<br />
loop (i = 0 up to i= M‐1)<br />
loop (j = N‐1 down to j > i)<br />
x = A[j‐1] [i];<br />
y = A[j][i];<br />
compute Q i,j ;<br />
A[j‐1:j][0:N‐1] = MM_product_kernel<br />
(Q i,j , A[j‐1:j][0:N‐1]);<br />
end loop;<br />
end loop;<br />
}<br />
• This kernel uses Givens<br />
rotati<strong>on</strong> to decompose<br />
matrix into <strong>an</strong> orthog<strong>on</strong>al<br />
(Q) <strong>an</strong>d upper tri<strong>an</strong>gular<br />
matrix (R) such that A = QR.<br />
• An N x N matrix A is zeroed<br />
out <strong>on</strong>e element at a time<br />
using 2 x 2 rotati<strong>on</strong> matrix:<br />
• Performs O(N 3 ) floating point<br />
operati<strong>on</strong>s.
QR Decompositi<strong>on</strong> Perform<strong>an</strong>ce for L<strong>an</strong>e<br />
Scaling
Cholesky Decompositi<strong>on</strong><br />
Cholesky_Decomp_kernel(){<br />
kernel(){ • This kernel decomposes<br />
loop (i = 0 up to i= N‐1)<br />
symmetric positivedefinite<br />
matrix into the<br />
pivot value = sqrt (A i,i );<br />
divide i th column vector from ito N by pivot value;<br />
tri<strong>an</strong>gular matrices<br />
such that A = LL T .<br />
loop (j = i+1 upto N)<br />
accumulate row vector from 0 to i;<br />
subtract accumulated value from A j,i+1 ;<br />
end loop;<br />
end loop;<br />
}<br />
• Each element of L c<strong>an</strong><br />
be defined as below:
Cholesky Decompositi<strong>on</strong> Perform<strong>an</strong>ce for<br />
L<strong>an</strong>e Scaling
Summary<br />
• Designed <strong>an</strong>d implemented unified vector<br />
scalar floating point architecture<br />
• Support all the basic floating point<br />
operati<strong>on</strong>s:<br />
– Add, Sub, Mul, Div, Sqrt<br />
• Initiated designing linear algebra library for<br />
basic matrix <strong>an</strong>d vector arithmetic<br />
computati<strong>on</strong>
Key Features<br />
• FPVC is faster th<strong>an</strong> Xilinx FPU plus embedded<br />
processor<br />
• FPVC is slower th<strong>an</strong> custom datapath<br />
– Easier to implement<br />
• FPVC is aut<strong>on</strong>omous from embedded<br />
processor<br />
– Good choice for implementing scientific apps that<br />
use rest of <strong>FPGA</strong> for at the same time
Future Work<br />
• Double Precisi<strong>on</strong> <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> Support<br />
• Architectural Improvements<br />
– Memory Caching<br />
• Improved Tools<br />
– <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Compiler Tool Flow<br />
• More applicati<strong>on</strong>s<br />
– Dem<strong>on</strong>strate c<strong>on</strong>current use of FPVC
References<br />
[1] C. Kozyrakis <strong>an</strong>d D. Patters<strong>on</strong>, “Overcoming the Limitati<strong>on</strong>s of<br />
C<strong>on</strong>venti<strong>on</strong>al <str<strong>on</strong>g>Vector</str<strong>on</strong>g> Processors”, In Proceedings of the 30th Internati<strong>on</strong>al<br />
Symposium <strong>on</strong> Computer Architecture, S<strong>an</strong> Diego, California, June<br />
2003, pp. 399–409.<br />
[2] K. As<strong>an</strong>ovic, J. Beck, B. Irissou, B. Kingsbury, <strong>an</strong>d N. Morg<strong>an</strong>, “The T0<br />
<str<strong>on</strong>g>Vector</str<strong>on</strong>g> Microprocessor,” Hot Chips, vol. 7, pp. 187–196, 1995.<br />
[3] P. Yi<strong>an</strong>nacouras, J. Gregory Steff<strong>an</strong>, <strong>an</strong>d J<strong>on</strong>ath<strong>an</strong> Rose, VESPA:<br />
Portable, Scalable, <strong>an</strong>d Flexible <strong>FPGA</strong>‐Based <str<strong>on</strong>g>Vector</str<strong>on</strong>g><br />
Processors, Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Compilers, Architecture <strong>an</strong>d<br />
Synthesis for Embedded Systems (CASES), October 2008, Atl<strong>an</strong>ta, GA.<br />
[4] . Yu, G. Lemieux, <strong>an</strong>d C. Eaglest<strong>on</strong>, "<str<strong>on</strong>g>Vector</str<strong>on</strong>g> <str<strong>on</strong>g>Processing</str<strong>on</strong>g> as a Soft‐core CPU<br />
Accelerator," ACM Internati<strong>on</strong>al Symposium <strong>on</strong> <strong>FPGA</strong>, 2008.
Th<strong>an</strong>k You!<br />
Miriam Leeser<br />
mel@coe.neu.edu<br />
http://www.coe.neu.edu/Research/rcl/index.php<br />
p// / / / p p<br />
More details c<strong>an</strong> be found in:<br />
Jainik Kathiara’s MS thesis under publicati<strong>on</strong>s link<br />
FCCM 2011 paper:<br />
An Aut<strong>on</strong>omous <str<strong>on</strong>g>Vector</str<strong>on</strong>g>/Scalar <str<strong>on</strong>g>Floating</str<strong>on</strong>g> <str<strong>on</strong>g>Point</str<strong>on</strong>g> Coprocessor<br />
for <strong>FPGA</strong>s by Jainik Kathiara <strong>an</strong>d Miriam Leeser