Floating Point Vector Processing on an FPGA

<strong>Floating</strong> <strong>Point</strong> <strong>Vector</strong> <strong>Processing</strong> 

on an FPGA 

Prof. Miriam Leeser 

Department of Electrical and Computer Engineering 

Northeastern University 

Boston, MA 

mel@coe.neu.edu 

http://www.coe.neu.edu/Research/rcl/index.php 

Based on MS thesis by Jainik Kathiara, Jan 2011 

and FCCM 2011 paper

Outline 

• Introduction to <strong>Vector</strong> <strong>Processing</strong> 

• <strong>Vector</strong>‐scalar ISA 

• <strong>Floating</strong> <strong>Point</strong> <strong>Vector</strong> Co‐processor (FPVC) 

• <strong>Vector</strong>ized Linear Algebra Kernels 

• Results 

• Future Directions

Xilinx Reconfigurable Architecture 

• Rich set of reconfigurable 

elements 

• Embedded Processor: PowerPC 

• <strong>Floating</strong> point implemented as 

instruction extensions to 

PowerPC: 

– Emulated in software 

– Hardware coprocessor 

• Alternative: build custom FP 

pipeline 

– NU VFLOAT library

Conventional FPU Implementation 

• FPGA <strong>Floating</strong> <strong>Point</strong> Unit serializes operations: 

– PowerPC fetches and executes instructions, data 

– Limited parallelism, limited speedup

How to do better? 

• <strong>Vector</strong> Processor: potential to operate on lots of data 

at the same time 

– Multiple data elements stored in a vector 

– Perform the same operation on all the data elements 

• Eliminates loops 

• FPVC does its own instruction fetch and execute 

• <strong>Vector</strong> instructions ti are dense 

– Reduced program code size 

– Reduced dynamic instruction bandwidth 

– Reduced data hazards 

• Parallel execution, parallel data 

– Improved performance

What is vector processing? 

for (i=0; i < n; i++) 

Y[i] = A[i] * x + Y[i]; 

• BLAS library routine SAXPY / DAXPY 

• 6 basic operations repeated for each element of 

vector Y 

• In <strong>Vector</strong> ISA such operations are written very 

compactly: operate on entire evector Y[i]

MIPS code for SAXPY 

L.D F0,a 

;load scalar a 

DADDIU R4,Rx,#512 ;last address to load 

Loop: L.D F2,0(Rx) ;load X(i) 

MUL.D F2,F2,F0 

F0 

;a × X(i) 

L.D F4,0(Ry) ;load Y(i) 

ADD.D F4,F4,F2 ;a × X(i) + Y(i) 

S.D 0(Ry),F4 ;store into Y(i) 

DADDIU Rx,Rx,#8 ;increment index to X 

DADDIU Ry,Ry,#8 

;increment index to Y 

DSUBU R20,R4,Rx ;compute bound 

BNEZ R20,Loop ;check if done

<strong>Vector</strong> MIPS code for SAXPY 

L.S F0,a 

;load scalar a 

LV V1,Rx ;load vector X 

MULVS.S V2,V1,F0 ;vector‐scalar multiply 

LV V3,Ry ;load vector Y 

ADDV.S V4,V2,V3 ;add 

SV Ry,V4 ;store the result 

• Assumes vector length matches length of 

registers, etc.

<strong>Vector</strong> <strong>Processing</strong> 

• <strong>Vector</strong> registers hold many operands at once 

– 64, 128, 256 typical 

• <strong>Vector</strong> instructions operate on many operands at once: 

– LV, SV 

– VADD, VMULT 

– This reduces code size and dynamic instruction count 

• What about processing? 

– Use one functional unit (e.g. MULT) and pipeline it 

• Start a new operand pair every clock cycle 

– Have multiple functional units operating at once: vector lanes 

– Do both: parallelism and pipelining

<strong>Vector</strong> Arithmetic Execution 

• Use deep pipeline (=> fast clock) to 

V V V 

execute element operations 

• Simplifies control of deep pipeline 

because elements in vector are 

independent (=> no hazards!) 

1 2 3 

Six stage multiply pipeline 

Images fromAsanovic’s PhD thesis. 

V3

<strong>Vector</strong> Instruction Execution 

ADDV C,A,B 

Execution using one 

pipelined dfunctional 

unit 

Execution using four 

pipelined dfunctional 

units 

A[6] 

A[5] 

A[4] 

A[3] 

B[6] 

B[5] 

B[4] 

B[3] 

A[24] 

A[20] 

A[16] 

A[12] 

B[24] 

B[20] 

B[16] 

B[12] 

A[25] 

A[21] 

A[17] 

A[13] 

B[25] 

B[21] 

B[17] 

B[13] 

A[26] 

A[22] 

A[18] 

A[14] 

B[26] 

B[22] 

B[18] 

B[14] 

A[27] 

A[23] 

A[19] 

A[15] 

B[27] 

B[23] 

B[19] 

B[15] 

C[2] 

C[8] 

C[9] 

C[10] 

C[11] 

C[1] 

C[4] 

C[5] 

C[6] 

C[7] 

C[0] C[0] C[1] C[2] C[3]

<strong>Vector</strong> Unit Structure 

Functional Unit 

<strong>Vector</strong> 

Registers 

Elements 0, 

4, 8, … 

Elements 1, 

5, 9, … 

Elements 2, 

6, 10, … 

Elements 3, 

7, 11, … 

Lane 

Memory Subsystem

<strong>Vector</strong> Lane 

• Each lane consists of a functional unit, a partition of 

vector register file and vector flag register 

• Similar to SIMD extensions in popular instruction sets: 

– Intel’s MMX, SSE, PowerPC’s AltiVec

<strong>Vector</strong> Length Control 

m = n; i =0; 

while (m > MVL){ 

for(j=0;j

<strong>Vector</strong> Strip Mining 

Problem: <strong>Vector</strong> registers have finite length 

Solution: Break loops into pieces that fit into vector registers, “Strip 

mining” 

for (i=0; i

Related Work 

• One of the earliest single chip vector processor are 

VIRAM and T0 designed by Kozyrakis[1] and 

Asanovic[2] respectively 

• VIRAM and T0 are implemented with ASICs 

• Yianncouras[3] and Yu[4] have designed FPGA based 

soft vector processors inspired by VIRAM and T0 

– This work implements integer arithmetic, not floating 

point

The FPVC 

• Our floating point vector co‐processor differs 

from earlier work on floating point co‐ 

processors: 

– Fetches its own instructions 

– Operates on scalar data 

• Loop control is local to the FPVC 

– Is completely autonomous of the main processor 

– Includes divide and square root in floating point 

pipelined data path

<strong>Vector</strong> Chaining and Hybrid 

vector/SIMD Architecture 

• <strong>Vector</strong> chaining is pipeline forwarding in a vector 

processor 

• Requires one read and write port each functional 

unit 

• Hybrid vector/SIMD computation performs in SIMD 

fashion and over time as in the traditional vector (b) 

• AMD GPU architecture implements vector/SIMD 

architecture

<strong>Vector</strong> Scalar Instruction Set 

• 32 bit instruction set 

Architecture 

t 

• Supports 32 vector registers 

• All the instructions can be classified into categories: 

– Memory access instructions 

– Integer arithmetic instructions 

– Program flow control instructions 

– <strong>Floating</strong> point arithmetic instructions 

– Special instructions

<strong>Vector</strong> Register Organization 

• Two types of 

organization 

– Register 

Partitioned 

– Element 

Partitioned 

i • <strong>Vector</strong> Register 

– Number of <strong>Vector</strong> 

lanes 

– Short vector size

<strong>Vector</strong> Lane, Short <strong>Vector</strong>, <strong>Vector</strong> 

Register, Scalar Register ?? 

<strong>Vector</strong> Lanes (L) 

Scalar 

Registers 

Short <strong>Vector</strong> (SV)

Memory Access Instructions 

Memory Instruction format 

op[5:0] rd[4:0] r1[4:0] r2[4:0] imd[10:0] 

• Various types of memory 

access patterns are 

– Unit stride 

– Non‐unit stride 

– Permutation access 

– Look up table access 

– Rake access

<strong>Vector</strong> Arithmetic Instructions 

Arithmetic Instruction with both register operand 

op[5:0] rd[4:0] r1[4:0] r2[4:0] exop[10:0] 

0] 

Arithmetic Instruction with 16‐bit immediate value 

op[5:0] rd[4:0] r1[4:0] imd[15:0] 

• Includes both integer 

and floating point 

instructions 

• Masked instruction 

execution is also 

included

Scalar Arithmetic Instructions 

• Same instruction format is used 

• Only first element of first short vector of each vector 

register is used 

• Result will be replicated to all lanes and stored on 

the first short vector

<strong>Vector</strong> Compression and Expansion 

• Compress 

Mask <strong>Vector</strong> 

1 A[0] 

0 A[1] 

1 A[2] 

1 A[3] 

1 A[4] 

0 A[5] 

0 A[6] 

1 A[7] 


1 A[0] 

1 A[2] 

1 A[3] 

1 A[4] 

1 A[7] 

0 ‐ 

0 ‐ 

0 ‐ 

• Expand 


1 A[0] 

0 ‐ 

1 A[2] 

1 A[3] 

1 A[4] 

0 ‐ 

0 ‐ 

1 A[7]

FPVC Organization

<strong>Floating</strong> <strong>Point</strong> <strong>Vector</strong> Core 

• Autonomous from the 

main processor 

• Supports vector scalar ISA 

• 4 stage RISC pipeline 

• in‐order issue, out of 

order completion 

– Arbiter handles completion 

• Unified vector scalar 

general purpose register 

file 

• Uses NU VFLOAT library 

for floating point units

Compile time parameters

Memory Hierarchy 

• Supports modified Harvard style memory 

architecture 

• Separate instruction and data memory in local onchip 

RAM 

• Unified main memory (in other on‐chip RAM) 

• Local on chip RAM reduces traffic on the system bus 

• Program and data size are limited by local on‐chip 

RAM size 

– <strong>Vector</strong> code is more compact than scalar code!

System Bus Interface 

• FPVC is connected through PLB interface to system 

bus but not limited i to any bus protocol 

• Two ports are provided for connection in embedded 

system 

– Slave port –for communication with main 

processor 

– Master port –for main memory accesses 

• PLB interface can be configured for 32, 64 or 128‐bit 

data width 

• Master port includes DMA controller for main 

memory accesses

Experimental Setup 

• Design implemented on Xilinx 

ML510 board 

• 32‐bit PLB based system bus 

• Embedded system runs at 100 

MHz 

• PowerPC program code is 

compiled with gcc using –o2 

optimization 

• FPU only used for comparison 

• FPVC program code is written in 

machine code and unoptimized 

• Program and Data Dt are stored din 

BRAM (main memory) 

• Main metric for performance 

measurement is number of clock 

cycles

FPGA Resources Used

Program Flow 

PowerPC_main () 

{ 

1. Start PowerPC timer(); 

2. Write kernel parameter to FPVC’s 

local data RAM; 

3. Configure and enable FPVC DMA for 

FPVC instruction load; 

4. Wait until FPVC completes execution; 

5. Stop PowerPC timer(); 

} 

FPVC_main() 

{ 

wait for instruction load; 

load data; 

compute kernel(); 

store result; 

HALT FPVC; 

}

Linear Algebra Kernels 

• Dot Product 

• Matrix‐<strong>Vector</strong> Product 

• Matrix‐Matrix Multiplication 

• QR Decomposition 

• Cholesky Decomposition

DOT Product 

DOT_product_kernel(){ 

load vector u from local data RAM; 

load vector v from local data RAM; 

mul_vector = multiply u and v; 

accumulate = reduction(mul_vector); 

• BLAS level 1 routine 

• Performs O(N) 

floating point 

operations 

• Product can be 

formulated as: 

} 

store accumulate to local memory;

DOT Product Performance for Short <strong>Vector</strong> 

Scaling 

1.8 

DOT PRODUCT with Lane (L) = 2 

1.6 

Performance e Improvement 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

8 16 32 64 128 256 512 

Number of <strong>Vector</strong> Elements 

PowerPC SV = 8, L = 2 SV = 16, L = 2 SV = 32, L = 2

DOT Product Performance for Lane Scaling 

2.4 

2.2 

DOT PRODUCT with Short <strong>Vector</strong> Size (SV)= 32 


2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

8 16 32 64 128 256 512 

<strong>Vector</strong> Length 

Power PC L = 1 L = 2 L = 4 L = 8

Matrix‐<strong>Vector</strong> Product 

MV_product_kernel(){ 

loop (i = 0 to i= N‐1) 

y i = DOT_product_kernel(A i ,x); 

store result y i to local memory; 

end loop; 


• Performs O(N 2 ) 





}

Matrix‐<strong>Vector</strong> Product Performance for 

Lane Scaling 

MV PRODUCT with Short <strong>Vector</strong> Size (SV) = 32 

1.6 

1.4 

Performance Im mprovement 

1.2 

1 

0.8 

06 0.6 

0.4 

4 8 12 16 

Square Matrix Size 

PowerPC L = 1 L = 2 L = 4 L = 8

Matrix‐Matrix Multiplication 

MM_product_kernel(){ 

loop (i = 0 to i= N‐1) 

C i = MV_product_kernel(A,B i ); 

store result C i to local memory; 

end loop; 


• Performs O(N 3 ) 





}

Matrix‐Matrix Multiplication Performance 

for Lane Scaling 

1.7 

1.6 

MM Product with Short <strong>Vector</strong> Size (SV) = 32 


15 1.5 

1.4 

1.3 

1.2 

1.1 

1 

0.9 

0.8 

0.7 

4 8 12 16 

Square Matrix Size 

PowerPC Lane = 1 Lane = 2 Lane = 4 Lane = 8

QR Decomposition 

QR_Decomp_kernel(){ 

loop (i = 0 up to i= M‐1) 

loop (j = N‐1 down to j > i) 

x = A[j‐1] [i]; 

y = A[j][i]; 

compute Q i,j ; 

A[j‐1:j][0:N‐1] = MM_product_kernel 

(Q i,j , A[j‐1:j][0:N‐1]); 

end loop; 

end loop; 

} 

• This kernel uses Givens 

rotation to decompose 

matrix into an orthogonal 

(Q) and upper triangular 

matrix (R) such that A = QR. 

• An N x N matrix A is zeroed 

out one element at a time 

using 2 x 2 rotation matrix: 

• Performs O(N 3 ) floating point 

operations.

QR Decomposition Performance for Lane 

Scaling

Cholesky Decomposition 

Cholesky_Decomp_kernel(){ 

kernel(){ • This kernel decomposes 

loop (i = 0 up to i= N‐1) 

symmetric positivedefinite 

matrix into the 

pivot value = sqrt (A i,i ); 

divide i th column vector from ito N by pivot value; 

triangular matrices 

such that A = LL T . 

loop (j = i+1 upto N) 

accumulate row vector from 0 to i; 

subtract accumulated value from A j,i+1 ; 

end loop; 

end loop; 

} 

• Each element of L can 

be defined as below:

Cholesky Decomposition Performance for 

Lane Scaling

Summary 

• Designed and implemented unified vector 

scalar floating point architecture 

• Support all the basic floating point 

operations: 

– Add, Sub, Mul, Div, Sqrt 

• Initiated designing linear algebra library for 

basic matrix and vector arithmetic 

computation

Key Features 

• FPVC is faster than Xilinx FPU plus embedded 

processor 

• FPVC is slower than custom datapath 

– Easier to implement 

• FPVC is autonomous from embedded 

processor 

– Good choice for implementing scientific apps that 

use rest of FPGA for at the same time

Future Work 

• Double Precision <strong>Floating</strong> <strong>Point</strong> Support 

• Architectural Improvements 

– Memory Caching 

• Improved Tools 

– <strong>Vector</strong> Compiler Tool Flow 

• More applications 

– Demonstrate concurrent use of FPVC

References 

[1] C. Kozyrakis and D. Patterson, “Overcoming the Limitations of 

Conventional <strong>Vector</strong> Processors”, In Proceedings of the 30th International 

Symposium on Computer Architecture, San Diego, California, June 

2003, pp. 399–409. 

[2] K. Asanovic, J. Beck, B. Irissou, B. Kingsbury, and N. Morgan, “The T0 

<strong>Vector</strong> Microprocessor,” Hot Chips, vol. 7, pp. 187–196, 1995. 

[3] P. Yiannacouras, J. Gregory Steffan, and Jonathan Rose, VESPA: 

Portable, Scalable, and Flexible FPGA‐Based <strong>Vector</strong> 

Processors, International Conference on Compilers, Architecture and 

Synthesis for Embedded Systems (CASES), October 2008, Atlanta, GA. 

[4] . Yu, G. Lemieux, and C. Eagleston, "<strong>Vector</strong> <strong>Processing</strong> as a Soft‐core CPU 

Accelerator," ACM International Symposium on FPGA, 2008.

Thank You! 

Miriam Leeser 

mel@coe.neu.edu 

http://www.coe.neu.edu/Research/rcl/index.php 

p// / / / p p 

More details can be found in: 

Jainik Kathiara’s MS thesis under publications link 

FCCM 2011 paper: 

An Autonomous <strong>Vector</strong>/Scalar <strong>Floating</strong> <strong>Point</strong> Coprocessor 

for FPGAs by Jainik Kathiara and Miriam Leeser

Floating Point Vector Processing on an FPGA

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?