12.07.2015 Views

The Hierarchically Tiled Arrays Programming Approach

The Hierarchically Tiled Arrays Programming Approach

The Hierarchically Tiled Arrays Programming Approach

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Our toolbox has been implemented and tested in an IBMSP system, an HP HPC320 server with alpha processorsand True64 OS, and in a small cluster of PCs with Linux.We present experimental results for the IBM SP system.Most of our measurements were done on in IBM SP systembecause of its higher availability, larger number of processorsand more MATLAB TM licenses available to us.Our IBM SP system consists of two nodes. Each node has8 Power3 processors that run at 375 MHz and 8 GB ofmemory. Two configurations have been used in the experiments:one with 2 × 2 mesh of 4 servers, and another onewith 3 ×3 mesh of 9 servers. Each node contributes half ofthe servers. <strong>The</strong>re is an additional processor that executesthe main thread of the program, thus acting as the clientor master of the system.<strong>The</strong> computational engine and interpreter for our HTAtoolbox was implemented using MATLAB TM R13 (Section4). <strong>The</strong> C files of the toolbox were compiled with theVisualAge C xlc compiler for AIX, version 5.0, with theO3 flag. <strong>The</strong> MPI library used was the one provided bythe IBM parallel Environment for AIX, version 3.1. Serialand parallel MATLAB TM benchmarks were executedin this framework.Two series of benchmarks were used to evaluate the HTAtoolbox. <strong>The</strong> first one consists of five simple kernels: smvimplements the sparse matrix vector product shown inFig. 4 and described in Section 3.1; cannon is an implementationof Cannon’s matrix multiplication algorithm [4],whose main loop is depicted in Fig. 5; jacobi is the Jacobirelaxation code shown in Fig. 6; summa represents theSUMMA [7] matrix multiplication algorithm; and finallylu is an implementation of the LU factorization algorithm.Our second set of benchmarks consists of the well-knownep, mg, ft and cg NAS benchmark, version 3.1 [1].Four versions of each benchmark were written: two serialprograms, one in MATLAB TM and the other one inC/Fortran, and two parallel versions, one implemented usingHTAs and the other one using MPI+C/Fortran. In theMPI implementation of the benchmarks the blocks weremapped to the mesh of processors in the same way theHTAs do.<strong>The</strong> computational engine and interpreter for our HTAtoolbox was implemented using MATLAB TM R13 (Section4). <strong>The</strong> C files of the toolbox were compiled with theVisualAge C xlc compiler for AIX, version 5.0, with theO3 flag. <strong>The</strong> MPI library used was the one provided bythe IBM parallel Environment for AIX, version 3.1. ParallelMATLAB TM benchmarks were executed in this framework.Serial benchmarks were executed using the plainMATLAB TM R13.<strong>The</strong> serial C and MPI implementations for cannon, summaand lu were implemented using the IBM ESSL library,while the corresponding MATLAB TM HTA codes use theMATLAB TM libraries, which are less optimized. <strong>The</strong> Ccodes were compiled with the xlc compiler with the O3flag. <strong>The</strong> fortran codes (serial and MPI versions of NASbenchmarks) were compiled using the xlf compiler fromIBM also with the O3 flag. <strong>The</strong> MPI library used was MPIIBM Parallel Environment for AIX, version 3.1.Speedup4.543.532.521.510.50HTAMPI+C/Fortransmv cannon jacobi summa luBenchmarksFigure 11: Speedup for HTA and MPI+C/Fortranon 4 servers for the simple kernelsSpeedup9876543210HTAMPI+C/Fortransmv cannon jacobi summa luBenchmarksFigure 12: Speedup for HTA and MPI+C/Fortranon 9 servers for the simple kernelsFinally, notice that although the main focus of our workis the achievement of parallelism, HTAs can also be usedto express locality. In fact, we have run experiments thatshow that when multiplying matrices of sizes larger than2000 × 2000, an implementation based on HTAs, but executedserially, can run 8% faster than the native MATLAB TMmatrix product routine.In the next Sections we evaluate the performance and theease of programming of the MPI and the HTA approachto write parallel programs.5.2 Performance Analysis5.2.1 Simple benchmarksFigs. 11 and 12 show the speedup obtained by the simplekernels for the HTA and MPI+C/Fortran when running on4 and 9 servers, respectively. <strong>The</strong> speedup of the HTA parallelversions is computed from the parallel MATLAB TM

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!