Systolic arrays for matrix multiplications

Systolic arrays formatrix multiplicationsNatalija StojanovicFaculty of Electronic Engineering

Outline Systolic arrays Matrix-vector multiplication on the fixedsizesystolic array Systolic array for fault-tolerant matrixvectormultiplication Systolic array for fault-tolerant matrixmatrixmultiplication

Accelerators To provide the desired performance forcomputationally intensive problems, high-speedcomputing systems are designed Accelerators provide higher performance atlower cost, lower power, or with lessdevelopment effort then with general-purposeapplications They find application in many areas (computergraphics, computer games, bioinformatics,..)

Accelarators->systolic arrays In domains that are suitable to acceleratorbasedsolution the combination ofparallelism, pipelining and regularity ofcomputation is necessary Accelerator that posses all mentionedcharacteristics are systolic arrays

Systolic arraysSystolic Arrays-SA are VLSI/WSI componentsimplemented on the chip and attached to the hostcomputer as accelerators.SA are parallel architectures with simple processingelements usually of the same type, which are locallyconnected and communicate locally, executingoperations rhythmically and using pipelining.

Systolic arrays SA are very suitable for solving computingbounded tasks SA are algorithm-oriented architecturessynthesized for solving one or a group ofsimilar problems

ApplicationsDigital signal processing FIR filter 1D i 2D convolution 1D i 2D correlation DFT FFT template matchingBasic matrix algorithms matrix-matrix multiplication matrix-vector multiplication solution of triangular linear systemsNon-numeric applications data structures graph algorithms language recognition

Basic SA topologies 1D SABLSAULSA

Basic SA topologies 2D SA

Matrix-vector multiplication onfixed-size linear systolic arrays If dimension of problem is too large and overflowreal limits for the number of PEs in SA thanusage of fixed-size SA become significant Solution: decompose the algorithm and realize itin a finite number of iterations on fixed-size SA Due to bottleneck that memory systemrepresents, specific hardware for addressgeneration which optimizes memory access isnecessary

Address generators Address generators have to provide efficientmemory data access during SA operation andfast data transfer between SA and the hostbefore and at the end of computation Special attention is devoted to addressgeneration units which are designed with aim tospeedup address expression evaluation

The global structure of the targetarchitecture

Solution Since number of PEs (p) is less then problemsize (m) each element of resulting vector mustpass through SA several times Matrix A is partitioned into quasidigonal blocks.Each block consists of p quasidigonals. In order to enable second iteration to beginimmediately, index transformation of blockmatrices and vectors is necessary. Thistransformation enables that there is no nullelement between iterations

The ordering of elements of matrix A at thebeginning of computation-BLSA

The ordering of the input items at thebeginning of computation-BLSA (m=5, p=2)

The ordering of elements of matrix A at thebeginning of computation-ULSA

The ordering of the input items at thebeginning of computation-ULSA (m=5, p=2)

Hardware structure of the system The memory interface subsystem (MIS),located between the host and SA, thatprovides corresponding data transfersfrom/to SA is designed in detail. If this interface is not efficient, a significantdecrease in performance can result

Hardware structure of MIS_C

Hardware structure of MIS_A

Hardware structure of MIS_B

FPGA implementation Involving of address generators comes atthe cost of area overhead in term ofequivalent gate count We considered an FPGA implementationof ULSA and BLSA and correspondingaddress generators for various number ofPEs and various operand size

FPGA implementation Implementations are performed inXilinxVIRTEXE, device v1600efg1156 We define hardware overhead as a ratio oftotal equivalent gates in the SA with andwithout address generators

p = 5Area overhead (%)p = 10Area overhead (%)p = 15Area overhead (%)

p = 20Area overhead (%)p = 25Area overhead (%)p = 30Area overhead (%)

Fault tolerance Fault tolerance has become an essentialdesign requirement in designing VLSI/WSIarray processors In systems involving large number of PEsand intensive computations it is highlylikely that some components arepermanently or temporarily faulty Such systems are systolic arrays

How fault tolerance can beachieved? Through some kind of redundancy: information,space and/or time redundancy or byreconfiguration We used space-time redundancy followed bymajority voting The purpose of our work was to synthesize SAwith optimal number of PEs for a given problemsize. Fault tolerance is achieved through triplicatedcomputation of the same problem instancefollowed by majority voting

Fault tolerant algorithm for SAThe problem is how to obtain fault-tolerant systolicalgorithm from the basic systolic algorithmWe have solved this problem for the case of matrixvectorand matrix-matrix multiplicationc00cλ=30c0cλ=2ccccλ=1

Fault tolerant algorithm for SA A SA with λ=3 can perform original algorithmand two redundant algorithms derived fromoriginal one, concurrently. Redundant computations can be performed bythe idle PEs in idle clock cycles We propose procedure that enables synthesis ofBLSA and ULSA that performs fault tolerantalgorithm for matrix-vector multiplication

BLSA synthesized from the basicsystolic algorithm

BLSA with pipeline period λ=3

BLSA synthesized from the fault-tolerantsystolic algorithm

Results In the case of both, BLSA and ULSA,single transient faults and a number ofmultiple error patterns can be tolerated

Matrix multiplication algorithm onto fault–tolerant hexagonal systolic array Procedure for synthesis hexagonal systolicarray that performs fault-tolerant matrixmultiplication algorithm is proposed The purpose of our work was tosynthesize SA with minimal possiblenumber of PE needed to perform FTmatrix multiplication

How we achieve this? Fault tolerance is achieved throughtriplicated computation of each element ofthe resulting matrix followed by majorityvoting Two architectures are proposed: when voting is performed at the end ofcomputation when voting is performed after eachcomputation step

Results By the proposed solutions any singlepermanent or transient fault can betolerated Also, number of multiple fault patterns canbe tolerated Errors are masked concurrently with thenormal operation of the SA

A detail concerning voting mechanismperformed at the end of computation

A detail concerning voting mechanism whenvoting is performed after each computationstep

Which of the proposed solutions isbetter? We performed simulation of matrixmultiplication on the SA by inserting errorsrandomly for various dimensions ofmatrices and error rates

Simulation results for matrix dimension 50x50(Number of operations:300000000)



Conclusions Systolic arrays can be successfully used asaccelerators in application domains which needmatrix multiplication When dimension of problem is too large thanusage of fixed-size SA become significant mathematical model for realization of matrix-vectormultiplication on fixed-size linear SA is derived The memory interface that provides correspondingdata transfers from/to SA is designed in detail FPGA implementations of fixed-size systolic arraysand corresponding address generators are performed

Conclusions Fault tolerance is one of the crucial designrequirement in designing systolic arrays Procedure for synthesis linear systolic arrays(BLSA and ULSA) for fault tolerant matrixvectormultiplication is derived. Procedure for synthesis hexagonal systolicarrays for fault tolerant matrix-matrixmultiplication is derived.

Systolic arrays for matrix multiplications

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?