11.07.2015 Views

Systolic arrays for matrix multiplications

Systolic arrays for matrix multiplications

Systolic arrays for matrix multiplications

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Systolic</strong> <strong>arrays</strong> <strong>for</strong><strong>matrix</strong> <strong>multiplications</strong>Natalija StojanovicFaculty of Electronic Engineering


Outline <strong>Systolic</strong> <strong>arrays</strong> Matrix-vector multiplication on the fixedsizesystolic array <strong>Systolic</strong> array <strong>for</strong> fault-tolerant <strong>matrix</strong>vectormultiplication <strong>Systolic</strong> array <strong>for</strong> fault-tolerant <strong>matrix</strong><strong>matrix</strong>multiplication


Accelerators To provide the desired per<strong>for</strong>mance <strong>for</strong>computationally intensive problems, high-speedcomputing systems are designed Accelerators provide higher per<strong>for</strong>mance atlower cost, lower power, or with lessdevelopment ef<strong>for</strong>t then with general-purposeapplications They find application in many areas (computergraphics, computer games, bioin<strong>for</strong>matics,..)


Accelarators->systolic <strong>arrays</strong> In domains that are suitable to acceleratorbasedsolution the combination ofparallelism, pipelining and regularity ofcomputation is necessary Accelerator that posses all mentionedcharacteristics are systolic <strong>arrays</strong>


<strong>Systolic</strong> <strong>arrays</strong><strong>Systolic</strong> Arrays-SA are VLSI/WSI componentsimplemented on the chip and attached to the hostcomputer as accelerators.SA are parallel architectures with simple processingelements usually of the same type, which are locallyconnected and communicate locally, executingoperations rhythmically and using pipelining.


<strong>Systolic</strong> <strong>arrays</strong> SA are very suitable <strong>for</strong> solving computingbounded tasks SA are algorithm-oriented architecturessynthesized <strong>for</strong> solving one or a group ofsimilar problems


ApplicationsDigital signal processing FIR filter 1D i 2D convolution 1D i 2D correlation DFT FFT template matchingBasic <strong>matrix</strong> algorithms <strong>matrix</strong>-<strong>matrix</strong> multiplication <strong>matrix</strong>-vector multiplication solution of triangular linear systemsNon-numeric applications data structures graph algorithms language recognition


Basic SA topologies 1D SABLSAULSA


Basic SA topologies 2D SA


Matrix-vector multiplication onfixed-size linear systolic <strong>arrays</strong> If dimension of problem is too large and overflowreal limits <strong>for</strong> the number of PEs in SA thanusage of fixed-size SA become significant Solution: decompose the algorithm and realize itin a finite number of iterations on fixed-size SA Due to bottleneck that memory systemrepresents, specific hardware <strong>for</strong> addressgeneration which optimizes memory access isnecessary


Address generators Address generators have to provide efficientmemory data access during SA operation andfast data transfer between SA and the hostbe<strong>for</strong>e and at the end of computation Special attention is devoted to addressgeneration units which are designed with aim tospeedup address expression evaluation


The global structure of the targetarchitecture


Solution Since number of PEs (p) is less then problemsize (m) each element of resulting vector mustpass through SA several times Matrix A is partitioned into quasidigonal blocks.Each block consists of p quasidigonals. In order to enable second iteration to beginimmediately, index trans<strong>for</strong>mation of blockmatrices and vectors is necessary. Thistrans<strong>for</strong>mation enables that there is no nullelement between iterations


The ordering of elements of <strong>matrix</strong> A at thebeginning of computation-BLSA


The ordering of the input items at thebeginning of computation-BLSA (m=5, p=2)


The ordering of elements of <strong>matrix</strong> A at thebeginning of computation-ULSA


The ordering of the input items at thebeginning of computation-ULSA (m=5, p=2)


Hardware structure of the system The memory interface subsystem (MIS),located between the host and SA, thatprovides corresponding data transfersfrom/to SA is designed in detail. If this interface is not efficient, a significantdecrease in per<strong>for</strong>mance can result


Hardware structure of MIS_C


Hardware structure of MIS_A


Hardware structure of MIS_B


FPGA implementation Involving of address generators comes atthe cost of area overhead in term ofequivalent gate count We considered an FPGA implementationof ULSA and BLSA and correspondingaddress generators <strong>for</strong> various number ofPEs and various operand size


FPGA implementation Implementations are per<strong>for</strong>med inXilinxVIRTEXE, device v1600efg1156 We define hardware overhead as a ratio oftotal equivalent gates in the SA with andwithout address generators


p = 5Area overhead (%)p = 10Area overhead (%)p = 15Area overhead (%)


p = 20Area overhead (%)p = 25Area overhead (%)p = 30Area overhead (%)


Fault tolerance Fault tolerance has become an essentialdesign requirement in designing VLSI/WSIarray processors In systems involving large number of PEsand intensive computations it is highlylikely that some components arepermanently or temporarily faulty Such systems are systolic <strong>arrays</strong>


How fault tolerance can beachieved? Through some kind of redundancy: in<strong>for</strong>mation,space and/or time redundancy or byreconfiguration We used space-time redundancy followed bymajority voting The purpose of our work was to synthesize SAwith optimal number of PEs <strong>for</strong> a given problemsize. Fault tolerance is achieved through triplicatedcomputation of the same problem instancefollowed by majority voting


Fault tolerant algorithm <strong>for</strong> SAThe problem is how to obtain fault-tolerant systolicalgorithm from the basic systolic algorithmWe have solved this problem <strong>for</strong> the case of <strong>matrix</strong>vectorand <strong>matrix</strong>-<strong>matrix</strong> multiplicationc00cλ=30c0cλ=2ccccλ=1


Fault tolerant algorithm <strong>for</strong> SA A SA with λ=3 can per<strong>for</strong>m original algorithmand two redundant algorithms derived fromoriginal one, concurrently. Redundant computations can be per<strong>for</strong>med bythe idle PEs in idle clock cycles We propose procedure that enables synthesis ofBLSA and ULSA that per<strong>for</strong>ms fault tolerantalgorithm <strong>for</strong> <strong>matrix</strong>-vector multiplication


BLSA synthesized from the basicsystolic algorithm


BLSA with pipeline period λ=3


BLSA synthesized from the fault-tolerantsystolic algorithm


Results In the case of both, BLSA and ULSA,single transient faults and a number ofmultiple error patterns can be tolerated


Matrix multiplication algorithm onto fault–tolerant hexagonal systolic array Procedure <strong>for</strong> synthesis hexagonal systolicarray that per<strong>for</strong>ms fault-tolerant <strong>matrix</strong>multiplication algorithm is proposed The purpose of our work was tosynthesize SA with minimal possiblenumber of PE needed to per<strong>for</strong>m FT<strong>matrix</strong> multiplication


How we achieve this? Fault tolerance is achieved throughtriplicated computation of each element ofthe resulting <strong>matrix</strong> followed by majorityvoting Two architectures are proposed: when voting is per<strong>for</strong>med at the end ofcomputation when voting is per<strong>for</strong>med after eachcomputation step


Results By the proposed solutions any singlepermanent or transient fault can betolerated Also, number of multiple fault patterns canbe tolerated Errors are masked concurrently with thenormal operation of the SA


A detail concerning voting mechanismper<strong>for</strong>med at the end of computation


A detail concerning voting mechanism whenvoting is per<strong>for</strong>med after each computationstep


Which of the proposed solutions isbetter? We per<strong>for</strong>med simulation of <strong>matrix</strong>multiplication on the SA by inserting errorsrandomly <strong>for</strong> various dimensions ofmatrices and error rates


Simulation results <strong>for</strong> <strong>matrix</strong> dimension 50x50(Number of operations:300000000)


Simulation results <strong>for</strong> <strong>matrix</strong> dimension 200x200(Number of operations:312000000)


Simulation results <strong>for</strong> <strong>matrix</strong> dimension 500x500(Number of operations:375000000)


Conclusions <strong>Systolic</strong> <strong>arrays</strong> can be successfully used asaccelerators in application domains which need<strong>matrix</strong> multiplication When dimension of problem is too large thanusage of fixed-size SA become significant mathematical model <strong>for</strong> realization of <strong>matrix</strong>-vectormultiplication on fixed-size linear SA is derived The memory interface that provides correspondingdata transfers from/to SA is designed in detail FPGA implementations of fixed-size systolic <strong>arrays</strong>and corresponding address generators are per<strong>for</strong>med


Conclusions Fault tolerance is one of the crucial designrequirement in designing systolic <strong>arrays</strong> Procedure <strong>for</strong> synthesis linear systolic <strong>arrays</strong>(BLSA and ULSA) <strong>for</strong> fault tolerant <strong>matrix</strong>vectormultiplication is derived. Procedure <strong>for</strong> synthesis hexagonal systolic<strong>arrays</strong> <strong>for</strong> fault tolerant <strong>matrix</strong>-<strong>matrix</strong>multiplication is derived.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!