Floating Point Vector Processing on an FPGA

<strong>Floating</strong> <strong>Point</strong> <strong>Vector</strong> <strong>Processing</strong> 

on an FPGA 

Prof. Miriam Leeser 

Department of Electrical and Computer Engineering 

Northeastern University 

Boston, MA 

mel@coe.neu.edu 

http://www.coe.neu.edu/Research/rcl/index.php 

Based on MS thesis by Jainik Kathiara, Jan 2011 

and FCCM 2011 paper

Outline 

• Introduction to <strong>Vector</strong> <strong>Processing</strong> 

• <strong>Vector</strong>‐scalar ISA 

• <strong>Floating</strong> <strong>Point</strong> <strong>Vector</strong> Co‐processor (FPVC) 

• <strong>Vector</strong>ized Linear Algebra Kernels 

• Results 

• Future Directions

Xilinx Reconfigurable Architecture 

• Rich set of reconfigurable 

elements 

• Embedded Processor: PowerPC 

• <strong>Floating</strong> point implemented as 

instruction extensions to 

PowerPC: 

– Emulated in software 

– Hardware coprocessor 

• Alternative: build custom FP 

pipeline 

– NU VFLOAT library

Conventional FPU Implementation 

• FPGA <strong>Floating</strong> <strong>Point</strong> Unit serializes operations: 

– PowerPC fetches and executes instructions, data 

– Limited parallelism, limited speedup

How to do better? 

• <strong>Vector</strong> Processor: potential to operate on lots of data 

at the same time 

– Multiple data elements stored in a vector 

– Perform the same operation on all the data elements 

• Eliminates loops 

• FPVC does its own instruction fetch and execute 

• <strong>Vector</strong> instructions ti are dense 

– Reduced program code size 

– Reduced dynamic instruction bandwidth 

– Reduced data hazards 

• Parallel execution, parallel data 

– Improved performance

What is vector processing? 

for (i=0; i < n; i++) 

Y[i] = A[i] * x + Y[i]; 

• BLAS library routine SAXPY / DAXPY 

• 6 basic operations repeated for each element of 

vector Y 

• In <strong>Vector</strong> ISA such operations are written very 

compactly: operate on entire evector Y[i]

MIPS code for SAXPY 

L.D F0,a 

;load scalar a 

DADDIU R4,Rx,#512 ;last address to load 

Loop: L.D F2,0(Rx) ;load X(i) 

MUL.D F2,F2,F0 

F0 

;a × X(i) 

L.D F4,0(Ry) ;load Y(i) 

ADD.D F4,F4,F2 ;a × X(i) + Y(i) 

S.D 0(Ry),F4 ;store into Y(i) 

DADDIU Rx,Rx,#8 ;increment index to X 

DADDIU Ry,Ry,#8 

;increment index to Y 

DSUBU R20,R4,Rx ;compute bound 

BNEZ R20,Loop ;check if done

<strong>Vector</strong> MIPS code for SAXPY 

L.S F0,a 

;load scalar a 

LV V1,Rx ;load vector X 

MULVS.S V2,V1,F0 ;vector‐scalar multiply 

LV V3,Ry ;load vector Y 

ADDV.S V4,V2,V3 ;add 

SV Ry,V4 ;store the result 

• Assumes vector length matches length of 

registers, etc.

<strong>Vector</strong> <strong>Processing</strong> 

• <strong>Vector</strong> registers hold many operands at once 

– 64, 128, 256 typical 

• <strong>Vector</strong> instructions operate on many operands at once: 

– LV, SV 

– VADD, VMULT 

– This reduces code size and dynamic instruction count 

• What about processing? 

– Use one functional unit (e.g. MULT) and pipeline it 

• Start a new operand pair every clock cycle 

– Have multiple functional units operating at once: vector lanes 

– Do both: parallelism and pipelining

<strong>Vector</strong> Arithmetic Execution 

• Use deep pipeline (=> fast clock) to 

V V V 

execute element operations 

• Simplifies control of deep pipeline 

because elements in vector are 

independent (=> no hazards!) 

1 2 3 

Six stage multiply pipeline 

Images fromAsanovic’s PhD thesis. 

V3

<strong>Vector</strong> Instruction Execution 

ADDV C,A,B 

Execution using one 

pipelined dfunctional 

unit 

Execution using four 

pipelined dfunctional 

units 

A[6] 

A[5] 

A[4] 

A[3] 

B[6] 

B[5] 

B[4] 

B[3] 

A[24] 

A[20] 

A[16] 

A[12] 

B[24] 

B[20] 

B[16] 

B[12] 

A[25] 

A[21] 

A[17] 

A[13] 

B[25] 

B[21] 

B[17] 

B[13] 

A[26] 

A[22] 

A[18] 

A[14] 

B[26] 

B[22] 

B[18] 

B[14] 

A[27] 

A[23] 

A[19] 

A[15] 

B[27] 

B[23] 

B[19] 

B[15] 

C[2] 

C[8] 

C[9] 

C[10] 

C[11] 

C[1] 

C[4] 

C[5] 

C[6] 

C[7] 

C[0] C[0] C[1] C[2] C[3]

<strong>Vector</strong> Unit Structure 

Functional Unit 

<strong>Vector</strong> 

Registers 

Elements 0, 

4, 8, … 

Elements 1, 

5, 9, … 

Elements 2, 

6, 10, … 

Elements 3, 

7, 11, … 

Lane 

Memory Subsystem

<strong>Vector</strong> Lane 

• Each lane consists of a functional unit, a partition of 

vector register file and vector flag register 

• Similar to SIMD extensions in popular instruction sets: 

– Intel’s MMX, SSE, PowerPC’s AltiVec

<strong>Vector</strong> Length Control 

m = n; i =0; 

while (m > MVL){ 

for(j=0;j

<strong>Vector</strong> Strip Mining 

Problem: <strong>Vector</strong> registers have finite length 

Solution: Break loops into pieces that fit into vector registers, “Strip 

mining” 

for (i=0; i

Related Work 

• One of the earliest single chip vector processor are 

VIRAM and T0 designed by Kozyrakis[1] and 

Asanovic[2] respectively 

• VIRAM and T0 are implemented with ASICs 

• Yianncouras[3] and Yu[4] have designed FPGA based 

soft vector processors inspired by VIRAM and T0 

– This work implements integer arithmetic, not floating 

point

The FPVC 

• Our floating point vector co‐processor differs 

from earlier work on floating point co‐ 

processors: 

– Fetches its own instructions 

– Operates on scalar data 

• Loop control is local to the FPVC 

– Is completely autonomous of the main processor 

– Includes divide and square root in floating point 

pipelined data path

<strong>Vector</strong> Chaining and Hybrid 

vector/SIMD Architecture 

• <strong>Vector</strong> chaining is pipeline forwarding in a vector 

processor 

• Requires one read and write port each functional 

unit 

• Hybrid vector/SIMD computation performs in SIMD 

fashion and over time as in the traditional vector (b) 

• AMD GPU architecture implements vector/SIMD 

architecture

<strong>Vector</strong> Scalar Instruction Set 

• 32 bit instruction set 

Architecture 

t 

• Supports 32 vector registers 

• All the instructions can be classified into categories: 

– Memory access instructions 

– Integer arithmetic instructions 

– Program flow control instructions 

– <strong>Floating</strong> point arithmetic instructions 

– Special instructions

<strong>Vector</strong> Register Organization 

• Two types of 

organization 

– Register 

Partitioned 

– Element 

Partitioned 

i • <strong>Vector</strong> Register 

– Number of <strong>Vector</strong> 

lanes 

– Short vector size

<strong>Vector</strong> Lane, Short <strong>Vector</strong>, <strong>Vector</strong> 

Register, Scalar Register ?? 

<strong>Vector</strong> Lanes (L) 

Scalar 

Registers 

Short <strong>Vector</strong> (SV)

Memory Access Instructions 

Memory Instruction format 

op[5:0] rd[4:0] r1[4:0] r2[4:0] imd[10:0] 

• Various types of memory 

access patterns are 

– Unit stride 

– Non‐unit stride 

– Permutation access 

– Look up table access 

– Rake access

<strong>Vector</strong> Arithmetic Instructions 

Arithmetic Instruction with both register operand 

op[5:0] rd[4:0] r1[4:0] r2[4:0] exop[10:0] 

0] 

Arithmetic Instruction with 16‐bit immediate value 

op[5:0] rd[4:0] r1[4:0] imd[15:0] 

• Includes both integer 

and floating point 

instructions 

• Masked instruction 

execution is also 

included

Scalar Arithmetic Instructions 

• Same instruction format is used 

• Only first element of first short vector of each vector 

register is used 

• Result will be replicated to all lanes and stored on 

the first short vector

<strong>Vector</strong> Compression and Expansion 

• Compress 

Mask <strong>Vector</strong> 

1 A[0] 

0 A[1] 

1 A[2] 

1 A[3] 

1 A[4] 

0 A[5] 

0 A[6] 

1 A[7] 


1 A[0] 

1 A[2] 

1 A[3] 

1 A[4] 

1 A[7] 

0 ‐ 

0 ‐ 

0 ‐ 

• Expand 


1 A[0] 

0 ‐ 

1 A[2] 

1 A[3] 

1 A[4] 

0 ‐ 

0 ‐ 

1 A[7]

FPVC Organization

<strong>Floating</strong> <strong>Point</strong> <strong>Vector</strong> Core 

• Autonomous from the 

main processor 

• Supports vector scalar ISA 

• 4 stage RISC pipeline 

• in‐order issue, out of 

order completion 

– Arbiter handles completion 

• Unified vector scalar 

general purpose register 

file 

• Uses NU VFLOAT library 

for floating point units

Compile time parameters

Memory Hierarchy 

• Supports modified Harvard style memory 

architecture 

• Separate instruction and data memory in local onchip 

RAM 

• Unified main memory (in other on‐chip RAM) 

• Local on chip RAM reduces traffic on the system bus 

• Program and data size are limited by local on‐chip 

RAM size 

– <strong>Vector</strong> code is more compact than scalar code!

System Bus Interface 

• FPVC is connected through PLB interface to system 

bus but not limited i to any bus protocol 

• Two ports are provided for connection in embedded 

system 

– Slave port –for communication with main 

processor 

– Master port –for main memory accesses 

• PLB interface can be configured for 32, 64 or 128‐bit 

data width 

• Master port includes DMA controller for main 

memory accesses

Experimental Setup 

• Design implemented on Xilinx 

ML510 board 

• 32‐bit PLB based system bus 

• Embedded system runs at 100 

MHz 

• PowerPC program code is 

compiled with gcc using –o2 

optimization 

• FPU only used for comparison 

• FPVC program code is written in 

machine code and unoptimized 

• Program and Data Dt are stored din 

BRAM (main memory) 

• Main metric for performance 

measurement is number of clock 

cycles

FPGA Resources Used

Program Flow 

PowerPC_main () 

{ 

1. Start PowerPC timer(); 

2. Write kernel parameter to FPVC’s 

local data RAM; 

3. Configure and enable FPVC DMA for 

FPVC instruction load; 

4. Wait until FPVC completes execution; 

5. Stop PowerPC timer(); 

} 

FPVC_main() 

{ 

wait for instruction load; 

load data; 

compute kernel(); 

store result; 

HALT FPVC; 

}

Linear Algebra Kernels 

• Dot Product 

• Matrix‐<strong>Vector</strong> Product 

• Matrix‐Matrix Multiplication 

• QR Decomposition 

• Cholesky Decomposition

DOT Product 

DOT_product_kernel(){ 

load vector u from local data RAM; 

load vector v from local data RAM; 

mul_vector = multiply u and v; 

accumulate = reduction(mul_vector); 

• BLAS level 1 routine 

• Performs O(N) 

floating point 

operations 

• Product can be 

formulated as: 

} 

store accumulate to local memory;

DOT Product Performance for Short <strong>Vector</strong> 

Scaling 

1.8 

DOT PRODUCT with Lane (L) = 2 

1.6 

Performance e Improvement 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

8 16 32 64 128 256 512 

Number of <strong>Vector</strong> Elements 

PowerPC SV = 8, L = 2 SV = 16, L = 2 SV = 32, L = 2

DOT Product Performance for Lane Scaling 

2.4 

2.2 

DOT PRODUCT with Short <strong>Vector</strong> Size (SV)= 32 


2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

8 16 32 64 128 256 512 

<strong>Vector</strong> Length 

Power PC L = 1 L = 2 L = 4 L = 8

Matrix‐<strong>Vector</strong> Product 

MV_product_kernel(){ 

loop (i = 0 to i= N‐1) 

y i = DOT_product_kernel(A i ,x); 

store result y i to local memory; 

end loop; 


• Performs O(N 2 ) 





}

Matrix‐<strong>Vector</strong> Product Performance for 

Lane Scaling 

MV PRODUCT with Short <strong>Vector</strong> Size (SV) = 32 

1.6 

1.4 

Performance Im mprovement 

1.2 

1 

0.8 

06 0.6 

0.4 

4 8 12 16 

Square Matrix Size 

PowerPC L = 1 L = 2 L = 4 L = 8

Matrix‐Matrix Multiplication 

MM_product_kernel(){ 

loop (i = 0 to i= N‐1) 

C i = MV_product_kernel(A,B i ); 

store result C i to local memory; 

end loop; 


• Performs O(N 3 ) 





}

Matrix‐Matrix Multiplication Performance 

for Lane Scaling 

1.7 

1.6 

MM Product with Short <strong>Vector</strong> Size (SV) = 32 


15 1.5 

1.4 

1.3 

1.2 

1.1 

1 

0.9 

0.8 

0.7 

4 8 12 16 

Square Matrix Size 

PowerPC Lane = 1 Lane = 2 Lane = 4 Lane = 8

QR Decomposition 

QR_Decomp_kernel(){ 

loop (i = 0 up to i= M‐1) 

loop (j = N‐1 down to j > i) 

x = A[j‐1] [i]; 

y = A[j][i]; 

compute Q i,j ; 

A[j‐1:j][0:N‐1] = MM_product_kernel 

(Q i,j , A[j‐1:j][0:N‐1]); 

end loop; 

end loop; 

} 

• This kernel uses Givens 

rotation to decompose 

matrix into an orthogonal 

(Q) and upper triangular 

matrix (R) such that A = QR. 

• An N x N matrix A is zeroed 

out one element at a time 

using 2 x 2 rotation matrix: 

• Performs O(N 3 ) floating point 

operations.

QR Decomposition Performance for Lane 

Scaling

Cholesky Decomposition 

Cholesky_Decomp_kernel(){ 

kernel(){ • This kernel decomposes 

loop (i = 0 up to i= N‐1) 

symmetric positivedefinite 

matrix into the 

pivot value = sqrt (A i,i ); 

divide i th column vector from ito N by pivot value; 

triangular matrices 

such that A = LL T . 

loop (j = i+1 upto N) 

accumulate row vector from 0 to i; 

subtract accumulated value from A j,i+1 ; 

end loop; 

end loop; 

} 

• Each element of L can 

be defined as below:

Cholesky Decomposition Performance for 

Lane Scaling

Summary 

• Designed and implemented unified vector 

scalar floating point architecture 

• Support all the basic floating point 

operations: 

– Add, Sub, Mul, Div, Sqrt 

• Initiated designing linear algebra library for 

basic matrix and vector arithmetic 

computation

Key Features 

• FPVC is faster than Xilinx FPU plus embedded 

processor 

• FPVC is slower than custom datapath 

– Easier to implement 

• FPVC is autonomous from embedded 

processor 

– Good choice for implementing scientific apps that 

use rest of FPGA for at the same time

Future Work 

• Double Precision <strong>Floating</strong> <strong>Point</strong> Support 

• Architectural Improvements 

– Memory Caching 

• Improved Tools 

– <strong>Vector</strong> Compiler Tool Flow 

• More applications 

– Demonstrate concurrent use of FPVC

References 

[1] C. Kozyrakis and D. Patterson, “Overcoming the Limitations of 

Conventional <strong>Vector</strong> Processors”, In Proceedings of the 30th International 

Symposium on Computer Architecture, San Diego, California, June 

2003, pp. 399–409. 

[2] K. Asanovic, J. Beck, B. Irissou, B. Kingsbury, and N. Morgan, “The T0 

<strong>Vector</strong> Microprocessor,” Hot Chips, vol. 7, pp. 187–196, 1995. 

[3] P. Yiannacouras, J. Gregory Steffan, and Jonathan Rose, VESPA: 

Portable, Scalable, and Flexible FPGA‐Based <strong>Vector</strong> 

Processors, International Conference on Compilers, Architecture and 

Synthesis for Embedded Systems (CASES), October 2008, Atlanta, GA. 

[4] . Yu, G. Lemieux, and C. Eagleston, "<strong>Vector</strong> <strong>Processing</strong> as a Soft‐core CPU 

Accelerator," ACM International Symposium on FPGA, 2008.

Thank You! 

Miriam Leeser 

mel@coe.neu.edu 

http://www.coe.neu.edu/Research/rcl/index.php 

p// / / / p p 

More details can be found in: 

Jainik Kathiara’s MS thesis under publications link 

FCCM 2011 paper: 

An Autonomous <strong>Vector</strong>/Scalar <strong>Floating</strong> <strong>Point</strong> Coprocessor 

for FPGAs by Jainik Kathiara and Miriam Leeser

Floating Point Vector Processing on an FPGA

Create successful ePaper yourself

Delete template?

Save as template?