Parallel Algorithms for Matrix Multiplication Petre Anghelescu
Parallel Algorithms for Matrix Multiplication Petre Anghelescu
Parallel Algorithms for Matrix Multiplication Petre Anghelescu
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2012 2nd International Conference on Future Computers in Education<br />
Lecture Notes in In<strong>for</strong>mation Technology, Vols.23-24<br />
<strong>Parallel</strong> <strong>Algorithms</strong> <strong>for</strong> <strong>Matrix</strong> <strong>Multiplication</strong><br />
<strong>Petre</strong> <strong>Anghelescu</strong><br />
University of Pitesti, Faculty of Electronics, Communications and Computers<br />
Str. Targu din Vale, No. 1, 110040, Pitesti, Arges, Romania<br />
petre.anghelescu@upit.ro<br />
Keywords: <strong>Parallel</strong> processing, <strong>Parallel</strong> algorithms, <strong>Matrix</strong> multiplication, MPI, Communication<br />
per<strong>for</strong>mance.<br />
Abstract. Technologic evolution, in the field of communication by using network computers, gives<br />
appropriate solutions to implement parallel algorithms in order to achieve high speed applications. In<br />
this paper, we show how the implementation of a matrix multiplication on a network computers can<br />
be accomplished using the MPI (Message Passing Interface) standard. We presents an analysis<br />
concerning to the time necessary <strong>for</strong> two different implementations of matrix multiplication: one is a<br />
sequential implementation and another is a distributed (parallel) implementation. We analyses the<br />
per<strong>for</strong>mances of the parallel algorithms. The only real requirement is that the matrix used in the<br />
multiplication is square. Our experimental plat<strong>for</strong>m includes the homogeneous Intel computers and<br />
the implementation of application is achieved using the MPI standard. Based on our experiments, we<br />
extract useful conclusions that can serve as guidelines <strong>for</strong> the optimization process of the matrix<br />
multiplication.<br />
1. Introduction<br />
Development of parallel computation methods <strong>for</strong> solving time-consuming problems is always a<br />
serious work. Matrices and matrix operations are widely used in mathematical modeling of processes,<br />
phenomena and systems, physics, economics, computer graphics, and so on. <strong>Matrix</strong> calculations are<br />
the basis of many research and engineering computations.<br />
Being time-consuming, matrix operations are a classical area <strong>for</strong> applying parallel computations.<br />
On the one hand, the use of high per<strong>for</strong>mance multiprocessor systems makes possible to significantly<br />
increase the complexity of the problems being solved. On the other hand, due to their simple way of<br />
<strong>for</strong>mulation, matrix operations give a good opportunity to demonstrate many techniques and methods<br />
of parallel programming.<br />
The paper is organized as follows. The next section defines sequential and parallel computation<br />
and provides all the needed concepts of both sequential and parallel computation methods. Section 3<br />
presents the principles used <strong>for</strong> parallel matrix multiplication: block-striped (row-wise and<br />
column-wise) decomposition, chessboard (checkerboard) block decomposition. Experiment results<br />
are presented in section 4. The last section draws the conclusions of the paper.<br />
2. Basics of parallel computation<br />
<strong>Parallel</strong> computing is a computer paradigm where multiple processors attempt to cooperate in the<br />
completion of a single task. Within the parallel computing paradigm, there are two memory models:<br />
shared-memory and distributed memory. The shared-memory model distinguishes itself by<br />
presenting the programmer with the illusion of a single memory space. The distributed-memory<br />
978-1-61275-014-9/10/$25.00 ©2012 IERI ICFCE2012<br />
65
model, on the other hand, presents the programmer with a separate memory space <strong>for</strong> each processor.<br />
Processors, there<strong>for</strong>e, have to share in<strong>for</strong>mation by sending messages to each other. To send these<br />
messages, usually applications call a standard communication library. The communication library is<br />
usually MPI [1] or PVM (<strong>Parallel</strong> Virtual Machine) [2], with MPI rapidly becoming the standard <strong>for</strong><br />
writing scientific programs with explicit message passing rather than PVM.<br />
An important component in the per<strong>for</strong>mance of a distributed-memory parallel computing<br />
application is the per<strong>for</strong>mance of the communication library that the application uses. There<strong>for</strong>e, the<br />
hardware and software systems providing these communication functions must be tuned to the<br />
highest degree possible. An important class of in<strong>for</strong>mation that would aid in the tuning of a<br />
communication library is an understanding of the communication patterns that occur within<br />
applications. This includes in<strong>for</strong>mation such as the relative frequency with which the various<br />
functions within the communication library are called, the lengths of the messages involved, and the<br />
ordering of the messages.<br />
Message Passing programming vs. Sequential Programming<br />
The main concepts needed to build and program a serial computer are well understood. A physical<br />
device called a processor is connected to a memory as illustrated in Fig. 1. The data in the memory<br />
can be read or overwritten by that processor.<br />
Fig. 1. Sequential programming paradigm<br />
In the message-passing programming model, each process has a local memory and no other<br />
process can directly read from or write to that local memory. The message-passing paradigm is<br />
illustrated in Fig. 2.<br />
Fig. 2. Message-passing programming paradigm<br />
An MPI program consists of a set of processes and a logical communication medium connecting<br />
those processes. The MPI is a controlled API standard <strong>for</strong> programming a wide array of parallel<br />
architectures. Though MPI was originally intended <strong>for</strong> classic distributed memory architectures, it is<br />
used on various architectures from networks of PCs via large shared memory systems to massive<br />
parallel architectures, such as Cray T3D and Intel paragon. The complete MPI API offers over 200<br />
operations, which makes this is a rather complex programming API. However, most MPI applications<br />
use only six to ten of the available operations.<br />
MPI is intended <strong>for</strong> the Single Program Multiple Data (SPMD) programming paradigm – all<br />
nodes run the same application-code. The SPMD paradigm is efficient and easy to use <strong>for</strong> a large set<br />
of scientific applications with a regular execution pattern. Other, less regular, applications are far less<br />
suited to this paradigm and implementation in MPI is tedious.<br />
MPI supports both group broadcasting and global reductions. Being SPMD, all nodes have to meet<br />
66
at a group operation, i.e. a broadcast operation blocks until all the processes in the context have issued<br />
the broadcast operation. This is important because it turns all group operations into synchronization<br />
points in the application. The MPI API also supports scatter-gather <strong>for</strong> easy exchange of large<br />
data-structures and virtual architecture topologies, which allow source-code compatible MPI<br />
applications to execute efficiently across different plat<strong>for</strong>ms [3, 4].<br />
3. <strong>Parallel</strong>izing principles of the matrix multiplication<br />
As a result of multiplying the matrix A of the dimension m x n by the matrix B of the size n x l, we<br />
obtain the matrix C of size m x l with each element defined according to the expression (Eq. 1).<br />
n<br />
1<br />
k 0<br />
c A B , 0 i m,<br />
0 j l . (1)<br />
ij<br />
ik<br />
kj<br />
The development of algorithms (in particular, the methods of parallel computations) <strong>for</strong> solving<br />
complicated research and engineering problems can be a real challenge. Here we assume that the<br />
computational scheme <strong>for</strong> solving the problem of matrix multiplication is already known. The<br />
activities <strong>for</strong> determining the efficient methods of parallel computations are the following:<br />
To analyze the available computation scheme and to decompose it into subtasks, which may be<br />
executed to a great degree independently,<br />
To select the in<strong>for</strong>mation dependencies <strong>for</strong> the selected set of subtasks; these in<strong>for</strong>mation<br />
<br />
dependencies should be carried out in the course of parallel computations,<br />
To determine the necessary or available computational system <strong>for</strong> solving the problem and to<br />
distribute.<br />
These stages of parallel algorithm development were first suggested by I. Foster [5]. Viewed in the<br />
large, it is obvious that the amount of computations <strong>for</strong> each processor must be approximately the<br />
same. It makes possible to provide equal computational load (balancing) of the processors. Besides, it<br />
is clear that the distribution of subtasks among the processors must be executed so that the number of<br />
the communication interactions among the subtasks is minim.<br />
The existence of various data distribution schemes generates a series of parallel algorithms of<br />
matrix computations: block-striped (rows and columns) matrix partitioning, chessboard block matrix<br />
partitioning [6, 7].<br />
3.1 Block-striped matrix data decomposition<br />
In case of block-striped partitioning each processor is assigned a certain subset of matrix rows<br />
(row-wise or horizontal partitioning) or matrix columns (column-wise or vertical partitioning) – see<br />
Fig. 3 (a) and (b).<br />
Fig. 3. Ways to distribute matrix elements among the processors. (a)Row-wise or horizontal partitioning,<br />
(b)Column-wise or vertical partitioning.<br />
Rows and columns are in most cases divided into stripes on a continuous sequential basis. The<br />
general scheme of in<strong>for</strong>mational interaction among subtasks in the course of executed computations<br />
is shown in Fig. 4.<br />
67
Fig. 4. Example of computation in case of block-striped matrix partition.<br />
To compute a row of the matrix C each subtask must have a row of the matrix A and access to all<br />
columns of the matrix B. Possible ways to organize parallel computations are described below.<br />
The first algorithm is an iterative procedure in which the number of iterations is equal to the<br />
number of subtasks. Each subtask holds a row of the matrix A and a column of the matrix B at each<br />
algorithm iteration. At each iteration the scalar products of rows and columns containing in the<br />
subtasks are computed, and the corresponding elements of the result matrix C are obtained. After<br />
completing of all iteration computations the columns of matrix B must be transmitted so that subtasks<br />
should have new columns of the matrix B and new elements of the matrix C could be calculated. This<br />
transmission of columns among the subtasks must be executed in such a way that all the columns of<br />
matrix B should have appeared in each subtask sequentially.<br />
The second algorithm is different from the first one because the subtasks contain not columns but<br />
rows of matrix B. As a result, data multiplication of each subtask is the multiplication of the row<br />
elements of the matrix B by a corresponding row element of the matrix A. There<strong>for</strong>e, a row of partial<br />
results <strong>for</strong> matrix C is obtained in each subtask. In case of this scheme of data decomposition <strong>for</strong><br />
matrix multiplication, it is necessary to provide sequential obtaining all rows of the matrix B by all in<br />
the subtasks, the multiplication of the row elements of the matrix B by a corresponding row element<br />
of the matrix A and summation of the new values and the previously computed ones.<br />
With regards to the number and the duration of the operations the time <strong>for</strong> carrying out the<br />
computations <strong>for</strong> parallel algorithm may be estimated as follows:<br />
T ( 2 p<br />
n / p)<br />
(2n<br />
1)<br />
. (2)<br />
where n is the matrices size, is execution time <strong>for</strong> a basic computational operation (this value<br />
has been computed in the course of testing the serial algorithm).<br />
3.2 Chessboard matrix data decomposition<br />
In this method of data decomposition the initial matrices A and B and the result matrix C are<br />
subdivided into sets of blocks. For simplicity the further explanations we will assume all the matrices<br />
are square of n × n size, the number of vertical blocks and the number of horizontal blocks are the<br />
same and are equal to q (i.e. the size of all block is equal to k×k, k=n/q). In this method, the matrix is<br />
divided into rectangular sets of elements as we depicted in Fig. 5.<br />
Fig. 5. Chessboard block matrix partitioning.<br />
As a rule, it is being done on a continuous basis. Let the number of processors be p = s x q, the<br />
number of matrices rows is divisible by s, the number of columns is divisible by q. In case of this<br />
approach it is expedient that a computational system have a physical or at least a logical processor<br />
68
grid topology of s rows and q columns. Then, <strong>for</strong> data distribution on a continuous basis the<br />
processors neighboring in grid structure process adjoining blocks of the original matrices. It should<br />
be noted however that cyclic alteration of rows and columns can be also used <strong>for</strong> the chessboard block<br />
scheme.<br />
The execution of this algorithm requires q iterations, during which each processor multiplies its<br />
current blocks of the matrices A and B, and adds the multiplication results to the current block of the<br />
matrix C. Time of parallel algorithm (chessboard block matrix partitioning) execution that<br />
corresponds to the processor calculations:<br />
2<br />
2<br />
T p<br />
q[(<br />
n / p)<br />
(2n<br />
/ q 1)<br />
( n / p)]<br />
<br />
(3)<br />
where n is the matrices size, q is the number of columns of grid topology, p is the number of<br />
processors of grid, is execution time <strong>for</strong> a basic computational operation.<br />
4. Experimental results<br />
Results of computational experiments <strong>for</strong> parallel algorithm of matrix multiplication based on<br />
block-striped matrix partitioning are presented in Fig. 6 (a). We calculate the speed up dividing the<br />
serial execution time by the parallel execution time. The results are presented in Fig. 6 (b).<br />
(a)<br />
(b)<br />
Fig. 6. (a) Theoretical and experimental execution times with respect to matrix size (block-striped matrix<br />
decomposition). (b) Speedup <strong>for</strong> the parallel algorithm of matrix multiplication (block-striped matrix decomposition).<br />
Results of computational experiments <strong>for</strong> parallel algorithm of matrix-vector multiplication based<br />
on chess-board block matrix partitioning are presented in Fig. 7 (a). We calculate the speed up<br />
dividing the serial execution time by the parallel execution time. The results are presented in Fig. 7<br />
(b).<br />
Fig. 7. (a) Theoretical and experimental execution times <strong>for</strong> chessboard block matrix decomposition. (b) Speedup <strong>for</strong><br />
the parallel algorithm of matrix multiplication with respect to number of processors.<br />
The summary graph in Fig. 8 presents the speedup values obtained as a result of the computational<br />
experiments <strong>for</strong> all the discussed algorithms. The computations have shown that increasing the<br />
69
number of processors improves the chessboard block multiplication algorithm efficiency.<br />
Fig. 8. Speedup of the matrix multiplication algorithms according with computational experiments (4 processors).<br />
The measurements were obtained in a laboratory on the network with 9 machines with the<br />
configuration as the follow:<br />
Intel Pentium 4 CPU 3GHz ,<br />
1024 MB RAM Memory.<br />
5. Conclusions<br />
In this paper we presented extensive experimental results regarding the per<strong>for</strong>mance issues of matrix<br />
parallel multiplication algorithms. Various ways of matrix distribution among processors have been<br />
described here: block-striped (rows and columns) matrix partitioning and chessboard block matrix<br />
partitioning. Since communications is a critical component of distributed-memory parallel computing,<br />
it is important that it be carefully optimized. Studies such as those in this paper can be used by<br />
hardware and software designers to tune their communications systems to increase their per<strong>for</strong>mance<br />
on real applications. This in turn should enable users to achieve higher per<strong>for</strong>mance and increased<br />
scalability of their codes. This article is a step towards creating a smooth transition from a sequential<br />
code base, to a distributing equivalent. By reducing the time and ef<strong>for</strong>t needed to produce a parallel<br />
code base, more time may be dedicated to solving the problem domain.<br />
6. Acknowledgement<br />
This research was financially supported by the CNCSIS UEFISCSU, project number PN II-RU PD<br />
369/2010, contract number 10/02.08.2010.<br />
References<br />
[1] M. P. I. Forum, Message-Passing Interface Standard, Technical report, University of Tennessee at<br />
Knoxville, May 1994.<br />
[2] Beguelin, J. Dongarra, A. Geist, R. Manchek and V. Sunderam, User's Guide to PVM (<strong>Parallel</strong><br />
Virtual Machine), Technical Report ORNL/TM-11826, Oak Ridge National Laboratory, Oak<br />
Ridge, TN, July 1991.<br />
[3] http://www.mcs.anl.gov/mpi/mpich.<br />
[4] http://www.mpi-<strong>for</strong>um.org/docs.<br />
[5] Foster, Designing and Building <strong>Parallel</strong> Programs, Addison-Wesley, ISBN 0-201-57594-9, 1995.<br />
[6] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to parallel computing, The<br />
Benjamin/Cummings Publishing Company, Inc. 2 nd edition, ISBN 0-201-64865-2, 2003.<br />
[7] M. J. Quinn, <strong>Parallel</strong> programming in C with MPI and OpenMP, New York, NY: McGraw-Hill,<br />
ISBN 0-07-282256-2, 2004.<br />
70