22.01.2014 Views

Parallel Algorithms for Matrix Multiplication Petre Anghelescu

Parallel Algorithms for Matrix Multiplication Petre Anghelescu

Parallel Algorithms for Matrix Multiplication Petre Anghelescu

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2012 2nd International Conference on Future Computers in Education<br />

Lecture Notes in In<strong>for</strong>mation Technology, Vols.23-24<br />

<strong>Parallel</strong> <strong>Algorithms</strong> <strong>for</strong> <strong>Matrix</strong> <strong>Multiplication</strong><br />

<strong>Petre</strong> <strong>Anghelescu</strong><br />

University of Pitesti, Faculty of Electronics, Communications and Computers<br />

Str. Targu din Vale, No. 1, 110040, Pitesti, Arges, Romania<br />

petre.anghelescu@upit.ro<br />

Keywords: <strong>Parallel</strong> processing, <strong>Parallel</strong> algorithms, <strong>Matrix</strong> multiplication, MPI, Communication<br />

per<strong>for</strong>mance.<br />

Abstract. Technologic evolution, in the field of communication by using network computers, gives<br />

appropriate solutions to implement parallel algorithms in order to achieve high speed applications. In<br />

this paper, we show how the implementation of a matrix multiplication on a network computers can<br />

be accomplished using the MPI (Message Passing Interface) standard. We presents an analysis<br />

concerning to the time necessary <strong>for</strong> two different implementations of matrix multiplication: one is a<br />

sequential implementation and another is a distributed (parallel) implementation. We analyses the<br />

per<strong>for</strong>mances of the parallel algorithms. The only real requirement is that the matrix used in the<br />

multiplication is square. Our experimental plat<strong>for</strong>m includes the homogeneous Intel computers and<br />

the implementation of application is achieved using the MPI standard. Based on our experiments, we<br />

extract useful conclusions that can serve as guidelines <strong>for</strong> the optimization process of the matrix<br />

multiplication.<br />

1. Introduction<br />

Development of parallel computation methods <strong>for</strong> solving time-consuming problems is always a<br />

serious work. Matrices and matrix operations are widely used in mathematical modeling of processes,<br />

phenomena and systems, physics, economics, computer graphics, and so on. <strong>Matrix</strong> calculations are<br />

the basis of many research and engineering computations.<br />

Being time-consuming, matrix operations are a classical area <strong>for</strong> applying parallel computations.<br />

On the one hand, the use of high per<strong>for</strong>mance multiprocessor systems makes possible to significantly<br />

increase the complexity of the problems being solved. On the other hand, due to their simple way of<br />

<strong>for</strong>mulation, matrix operations give a good opportunity to demonstrate many techniques and methods<br />

of parallel programming.<br />

The paper is organized as follows. The next section defines sequential and parallel computation<br />

and provides all the needed concepts of both sequential and parallel computation methods. Section 3<br />

presents the principles used <strong>for</strong> parallel matrix multiplication: block-striped (row-wise and<br />

column-wise) decomposition, chessboard (checkerboard) block decomposition. Experiment results<br />

are presented in section 4. The last section draws the conclusions of the paper.<br />

2. Basics of parallel computation<br />

<strong>Parallel</strong> computing is a computer paradigm where multiple processors attempt to cooperate in the<br />

completion of a single task. Within the parallel computing paradigm, there are two memory models:<br />

shared-memory and distributed memory. The shared-memory model distinguishes itself by<br />

presenting the programmer with the illusion of a single memory space. The distributed-memory<br />

978-1-61275-014-9/10/$25.00 ©2012 IERI ICFCE2012<br />

65


model, on the other hand, presents the programmer with a separate memory space <strong>for</strong> each processor.<br />

Processors, there<strong>for</strong>e, have to share in<strong>for</strong>mation by sending messages to each other. To send these<br />

messages, usually applications call a standard communication library. The communication library is<br />

usually MPI [1] or PVM (<strong>Parallel</strong> Virtual Machine) [2], with MPI rapidly becoming the standard <strong>for</strong><br />

writing scientific programs with explicit message passing rather than PVM.<br />

An important component in the per<strong>for</strong>mance of a distributed-memory parallel computing<br />

application is the per<strong>for</strong>mance of the communication library that the application uses. There<strong>for</strong>e, the<br />

hardware and software systems providing these communication functions must be tuned to the<br />

highest degree possible. An important class of in<strong>for</strong>mation that would aid in the tuning of a<br />

communication library is an understanding of the communication patterns that occur within<br />

applications. This includes in<strong>for</strong>mation such as the relative frequency with which the various<br />

functions within the communication library are called, the lengths of the messages involved, and the<br />

ordering of the messages.<br />

Message Passing programming vs. Sequential Programming<br />

The main concepts needed to build and program a serial computer are well understood. A physical<br />

device called a processor is connected to a memory as illustrated in Fig. 1. The data in the memory<br />

can be read or overwritten by that processor.<br />

Fig. 1. Sequential programming paradigm<br />

In the message-passing programming model, each process has a local memory and no other<br />

process can directly read from or write to that local memory. The message-passing paradigm is<br />

illustrated in Fig. 2.<br />

Fig. 2. Message-passing programming paradigm<br />

An MPI program consists of a set of processes and a logical communication medium connecting<br />

those processes. The MPI is a controlled API standard <strong>for</strong> programming a wide array of parallel<br />

architectures. Though MPI was originally intended <strong>for</strong> classic distributed memory architectures, it is<br />

used on various architectures from networks of PCs via large shared memory systems to massive<br />

parallel architectures, such as Cray T3D and Intel paragon. The complete MPI API offers over 200<br />

operations, which makes this is a rather complex programming API. However, most MPI applications<br />

use only six to ten of the available operations.<br />

MPI is intended <strong>for</strong> the Single Program Multiple Data (SPMD) programming paradigm – all<br />

nodes run the same application-code. The SPMD paradigm is efficient and easy to use <strong>for</strong> a large set<br />

of scientific applications with a regular execution pattern. Other, less regular, applications are far less<br />

suited to this paradigm and implementation in MPI is tedious.<br />

MPI supports both group broadcasting and global reductions. Being SPMD, all nodes have to meet<br />

66


at a group operation, i.e. a broadcast operation blocks until all the processes in the context have issued<br />

the broadcast operation. This is important because it turns all group operations into synchronization<br />

points in the application. The MPI API also supports scatter-gather <strong>for</strong> easy exchange of large<br />

data-structures and virtual architecture topologies, which allow source-code compatible MPI<br />

applications to execute efficiently across different plat<strong>for</strong>ms [3, 4].<br />

3. <strong>Parallel</strong>izing principles of the matrix multiplication<br />

As a result of multiplying the matrix A of the dimension m x n by the matrix B of the size n x l, we<br />

obtain the matrix C of size m x l with each element defined according to the expression (Eq. 1).<br />

n<br />

1<br />

k 0<br />

c A B , 0 i m,<br />

0 j l . (1)<br />

ij<br />

ik<br />

kj<br />

The development of algorithms (in particular, the methods of parallel computations) <strong>for</strong> solving<br />

complicated research and engineering problems can be a real challenge. Here we assume that the<br />

computational scheme <strong>for</strong> solving the problem of matrix multiplication is already known. The<br />

activities <strong>for</strong> determining the efficient methods of parallel computations are the following:<br />

To analyze the available computation scheme and to decompose it into subtasks, which may be<br />

executed to a great degree independently,<br />

To select the in<strong>for</strong>mation dependencies <strong>for</strong> the selected set of subtasks; these in<strong>for</strong>mation<br />

<br />

dependencies should be carried out in the course of parallel computations,<br />

To determine the necessary or available computational system <strong>for</strong> solving the problem and to<br />

distribute.<br />

These stages of parallel algorithm development were first suggested by I. Foster [5]. Viewed in the<br />

large, it is obvious that the amount of computations <strong>for</strong> each processor must be approximately the<br />

same. It makes possible to provide equal computational load (balancing) of the processors. Besides, it<br />

is clear that the distribution of subtasks among the processors must be executed so that the number of<br />

the communication interactions among the subtasks is minim.<br />

The existence of various data distribution schemes generates a series of parallel algorithms of<br />

matrix computations: block-striped (rows and columns) matrix partitioning, chessboard block matrix<br />

partitioning [6, 7].<br />

3.1 Block-striped matrix data decomposition<br />

In case of block-striped partitioning each processor is assigned a certain subset of matrix rows<br />

(row-wise or horizontal partitioning) or matrix columns (column-wise or vertical partitioning) – see<br />

Fig. 3 (a) and (b).<br />

Fig. 3. Ways to distribute matrix elements among the processors. (a)Row-wise or horizontal partitioning,<br />

(b)Column-wise or vertical partitioning.<br />

Rows and columns are in most cases divided into stripes on a continuous sequential basis. The<br />

general scheme of in<strong>for</strong>mational interaction among subtasks in the course of executed computations<br />

is shown in Fig. 4.<br />

67


Fig. 4. Example of computation in case of block-striped matrix partition.<br />

To compute a row of the matrix C each subtask must have a row of the matrix A and access to all<br />

columns of the matrix B. Possible ways to organize parallel computations are described below.<br />

The first algorithm is an iterative procedure in which the number of iterations is equal to the<br />

number of subtasks. Each subtask holds a row of the matrix A and a column of the matrix B at each<br />

algorithm iteration. At each iteration the scalar products of rows and columns containing in the<br />

subtasks are computed, and the corresponding elements of the result matrix C are obtained. After<br />

completing of all iteration computations the columns of matrix B must be transmitted so that subtasks<br />

should have new columns of the matrix B and new elements of the matrix C could be calculated. This<br />

transmission of columns among the subtasks must be executed in such a way that all the columns of<br />

matrix B should have appeared in each subtask sequentially.<br />

The second algorithm is different from the first one because the subtasks contain not columns but<br />

rows of matrix B. As a result, data multiplication of each subtask is the multiplication of the row<br />

elements of the matrix B by a corresponding row element of the matrix A. There<strong>for</strong>e, a row of partial<br />

results <strong>for</strong> matrix C is obtained in each subtask. In case of this scheme of data decomposition <strong>for</strong><br />

matrix multiplication, it is necessary to provide sequential obtaining all rows of the matrix B by all in<br />

the subtasks, the multiplication of the row elements of the matrix B by a corresponding row element<br />

of the matrix A and summation of the new values and the previously computed ones.<br />

With regards to the number and the duration of the operations the time <strong>for</strong> carrying out the<br />

computations <strong>for</strong> parallel algorithm may be estimated as follows:<br />

T ( 2 p<br />

n / p)<br />

(2n<br />

1)<br />

. (2)<br />

where n is the matrices size, is execution time <strong>for</strong> a basic computational operation (this value<br />

has been computed in the course of testing the serial algorithm).<br />

3.2 Chessboard matrix data decomposition<br />

In this method of data decomposition the initial matrices A and B and the result matrix C are<br />

subdivided into sets of blocks. For simplicity the further explanations we will assume all the matrices<br />

are square of n × n size, the number of vertical blocks and the number of horizontal blocks are the<br />

same and are equal to q (i.e. the size of all block is equal to k×k, k=n/q). In this method, the matrix is<br />

divided into rectangular sets of elements as we depicted in Fig. 5.<br />

Fig. 5. Chessboard block matrix partitioning.<br />

As a rule, it is being done on a continuous basis. Let the number of processors be p = s x q, the<br />

number of matrices rows is divisible by s, the number of columns is divisible by q. In case of this<br />

approach it is expedient that a computational system have a physical or at least a logical processor<br />

68


grid topology of s rows and q columns. Then, <strong>for</strong> data distribution on a continuous basis the<br />

processors neighboring in grid structure process adjoining blocks of the original matrices. It should<br />

be noted however that cyclic alteration of rows and columns can be also used <strong>for</strong> the chessboard block<br />

scheme.<br />

The execution of this algorithm requires q iterations, during which each processor multiplies its<br />

current blocks of the matrices A and B, and adds the multiplication results to the current block of the<br />

matrix C. Time of parallel algorithm (chessboard block matrix partitioning) execution that<br />

corresponds to the processor calculations:<br />

2<br />

2<br />

T p<br />

q[(<br />

n / p)<br />

(2n<br />

/ q 1)<br />

( n / p)]<br />

<br />

(3)<br />

where n is the matrices size, q is the number of columns of grid topology, p is the number of<br />

processors of grid, is execution time <strong>for</strong> a basic computational operation.<br />

4. Experimental results<br />

Results of computational experiments <strong>for</strong> parallel algorithm of matrix multiplication based on<br />

block-striped matrix partitioning are presented in Fig. 6 (a). We calculate the speed up dividing the<br />

serial execution time by the parallel execution time. The results are presented in Fig. 6 (b).<br />

(a)<br />

(b)<br />

Fig. 6. (a) Theoretical and experimental execution times with respect to matrix size (block-striped matrix<br />

decomposition). (b) Speedup <strong>for</strong> the parallel algorithm of matrix multiplication (block-striped matrix decomposition).<br />

Results of computational experiments <strong>for</strong> parallel algorithm of matrix-vector multiplication based<br />

on chess-board block matrix partitioning are presented in Fig. 7 (a). We calculate the speed up<br />

dividing the serial execution time by the parallel execution time. The results are presented in Fig. 7<br />

(b).<br />

Fig. 7. (a) Theoretical and experimental execution times <strong>for</strong> chessboard block matrix decomposition. (b) Speedup <strong>for</strong><br />

the parallel algorithm of matrix multiplication with respect to number of processors.<br />

The summary graph in Fig. 8 presents the speedup values obtained as a result of the computational<br />

experiments <strong>for</strong> all the discussed algorithms. The computations have shown that increasing the<br />

69


number of processors improves the chessboard block multiplication algorithm efficiency.<br />

Fig. 8. Speedup of the matrix multiplication algorithms according with computational experiments (4 processors).<br />

The measurements were obtained in a laboratory on the network with 9 machines with the<br />

configuration as the follow:<br />

Intel Pentium 4 CPU 3GHz ,<br />

1024 MB RAM Memory.<br />

5. Conclusions<br />

In this paper we presented extensive experimental results regarding the per<strong>for</strong>mance issues of matrix<br />

parallel multiplication algorithms. Various ways of matrix distribution among processors have been<br />

described here: block-striped (rows and columns) matrix partitioning and chessboard block matrix<br />

partitioning. Since communications is a critical component of distributed-memory parallel computing,<br />

it is important that it be carefully optimized. Studies such as those in this paper can be used by<br />

hardware and software designers to tune their communications systems to increase their per<strong>for</strong>mance<br />

on real applications. This in turn should enable users to achieve higher per<strong>for</strong>mance and increased<br />

scalability of their codes. This article is a step towards creating a smooth transition from a sequential<br />

code base, to a distributing equivalent. By reducing the time and ef<strong>for</strong>t needed to produce a parallel<br />

code base, more time may be dedicated to solving the problem domain.<br />

6. Acknowledgement<br />

This research was financially supported by the CNCSIS UEFISCSU, project number PN II-RU PD<br />

369/2010, contract number 10/02.08.2010.<br />

References<br />

[1] M. P. I. Forum, Message-Passing Interface Standard, Technical report, University of Tennessee at<br />

Knoxville, May 1994.<br />

[2] Beguelin, J. Dongarra, A. Geist, R. Manchek and V. Sunderam, User's Guide to PVM (<strong>Parallel</strong><br />

Virtual Machine), Technical Report ORNL/TM-11826, Oak Ridge National Laboratory, Oak<br />

Ridge, TN, July 1991.<br />

[3] http://www.mcs.anl.gov/mpi/mpich.<br />

[4] http://www.mpi-<strong>for</strong>um.org/docs.<br />

[5] Foster, Designing and Building <strong>Parallel</strong> Programs, Addison-Wesley, ISBN 0-201-57594-9, 1995.<br />

[6] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to parallel computing, The<br />

Benjamin/Cummings Publishing Company, Inc. 2 nd edition, ISBN 0-201-64865-2, 2003.<br />

[7] M. J. Quinn, <strong>Parallel</strong> programming in C with MPI and OpenMP, New York, NY: McGraw-Hill,<br />

ISBN 0-07-282256-2, 2004.<br />

70

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!