Parallel Algorithms for Matrix Multiplication Petre Anghelescu

2012 2nd International Conference on Future Computers in Education 

Lecture Notes in Information Technology, Vols.23-24 

Parallel Algorithms for Matrix Multiplication 

Petre Anghelescu 

University of Pitesti, Faculty of Electronics, Communications and Computers 

Str. Targu din Vale, No. 1, 110040, Pitesti, Arges, Romania 

petre.anghelescu@upit.ro 

Keywords: Parallel processing, Parallel algorithms, Matrix multiplication, MPI, Communication 

performance. 

Abstract. Technologic evolution, in the field of communication by using network computers, gives 

appropriate solutions to implement parallel algorithms in order to achieve high speed applications. In 

this paper, we show how the implementation of a matrix multiplication on a network computers can 

be accomplished using the MPI (Message Passing Interface) standard. We presents an analysis 

concerning to the time necessary for two different implementations of matrix multiplication: one is a 

sequential implementation and another is a distributed (parallel) implementation. We analyses the 

performances of the parallel algorithms. The only real requirement is that the matrix used in the 

multiplication is square. Our experimental platform includes the homogeneous Intel computers and 

the implementation of application is achieved using the MPI standard. Based on our experiments, we 

extract useful conclusions that can serve as guidelines for the optimization process of the matrix 

multiplication. 

1. Introduction 

Development of parallel computation methods for solving time-consuming problems is always a 

serious work. Matrices and matrix operations are widely used in mathematical modeling of processes, 

phenomena and systems, physics, economics, computer graphics, and so on. Matrix calculations are 

the basis of many research and engineering computations. 

Being time-consuming, matrix operations are a classical area for applying parallel computations. 

On the one hand, the use of high performance multiprocessor systems makes possible to significantly 

increase the complexity of the problems being solved. On the other hand, due to their simple way of 

formulation, matrix operations give a good opportunity to demonstrate many techniques and methods 

of parallel programming. 

The paper is organized as follows. The next section defines sequential and parallel computation 

and provides all the needed concepts of both sequential and parallel computation methods. Section 3 

presents the principles used for parallel matrix multiplication: block-striped (row-wise and 

column-wise) decomposition, chessboard (checkerboard) block decomposition. Experiment results 

are presented in section 4. The last section draws the conclusions of the paper. 

2. Basics of parallel computation 

Parallel computing is a computer paradigm where multiple processors attempt to cooperate in the 

completion of a single task. Within the parallel computing paradigm, there are two memory models: 

shared-memory and distributed memory. The shared-memory model distinguishes itself by 

presenting the programmer with the illusion of a single memory space. The distributed-memory 

978-1-61275-014-9/10/$25.00 ©2012 IERI ICFCE2012 

65

model, on the other hand, presents the programmer with a separate memory space for each processor. 

Processors, therefore, have to share information by sending messages to each other. To send these 

messages, usually applications call a standard communication library. The communication library is 

usually MPI [1] or PVM (Parallel Virtual Machine) [2], with MPI rapidly becoming the standard for 

writing scientific programs with explicit message passing rather than PVM. 

An important component in the performance of a distributed-memory parallel computing 

application is the performance of the communication library that the application uses. Therefore, the 

hardware and software systems providing these communication functions must be tuned to the 

highest degree possible. An important class of information that would aid in the tuning of a 

communication library is an understanding of the communication patterns that occur within 

applications. This includes information such as the relative frequency with which the various 

functions within the communication library are called, the lengths of the messages involved, and the 

ordering of the messages. 

Message Passing programming vs. Sequential Programming 

The main concepts needed to build and program a serial computer are well understood. A physical 

device called a processor is connected to a memory as illustrated in Fig. 1. The data in the memory 

can be read or overwritten by that processor. 

Fig. 1. Sequential programming paradigm 

In the message-passing programming model, each process has a local memory and no other 

process can directly read from or write to that local memory. The message-passing paradigm is 

illustrated in Fig. 2. 

Fig. 2. Message-passing programming paradigm 

An MPI program consists of a set of processes and a logical communication medium connecting 

those processes. The MPI is a controlled API standard for programming a wide array of parallel 

architectures. Though MPI was originally intended for classic distributed memory architectures, it is 

used on various architectures from networks of PCs via large shared memory systems to massive 

parallel architectures, such as Cray T3D and Intel paragon. The complete MPI API offers over 200 

operations, which makes this is a rather complex programming API. However, most MPI applications 

use only six to ten of the available operations. 

MPI is intended for the Single Program Multiple Data (SPMD) programming paradigm – all 

nodes run the same application-code. The SPMD paradigm is efficient and easy to use for a large set 

of scientific applications with a regular execution pattern. Other, less regular, applications are far less 

suited to this paradigm and implementation in MPI is tedious. 

MPI supports both group broadcasting and global reductions. Being SPMD, all nodes have to meet 

66

at a group operation, i.e. a broadcast operation blocks until all the processes in the context have issued 

the broadcast operation. This is important because it turns all group operations into synchronization 

points in the application. The MPI API also supports scatter-gather for easy exchange of large 

data-structures and virtual architecture topologies, which allow source-code compatible MPI 

applications to execute efficiently across different platforms [3, 4]. 

3. Parallelizing principles of the matrix multiplication 

As a result of multiplying the matrix A of the dimension m x n by the matrix B of the size n x l, we 

obtain the matrix C of size m x l with each element defined according to the expression (Eq. 1). 

n 

1 

k 0 

c A B , 0 i m, 

0 j l . (1) 

ij 

ik 

kj 

The development of algorithms (in particular, the methods of parallel computations) for solving 

complicated research and engineering problems can be a real challenge. Here we assume that the 

computational scheme for solving the problem of matrix multiplication is already known. The 

activities for determining the efficient methods of parallel computations are the following: 

To analyze the available computation scheme and to decompose it into subtasks, which may be 

executed to a great degree independently, 

To select the information dependencies for the selected set of subtasks; these information 

 

dependencies should be carried out in the course of parallel computations, 

To determine the necessary or available computational system for solving the problem and to 

distribute. 

These stages of parallel algorithm development were first suggested by I. Foster [5]. Viewed in the 

large, it is obvious that the amount of computations for each processor must be approximately the 

same. It makes possible to provide equal computational load (balancing) of the processors. Besides, it 

is clear that the distribution of subtasks among the processors must be executed so that the number of 

the communication interactions among the subtasks is minim. 

The existence of various data distribution schemes generates a series of parallel algorithms of 

matrix computations: block-striped (rows and columns) matrix partitioning, chessboard block matrix 

partitioning [6, 7]. 

3.1 Block-striped matrix data decomposition 

In case of block-striped partitioning each processor is assigned a certain subset of matrix rows 

(row-wise or horizontal partitioning) or matrix columns (column-wise or vertical partitioning) – see 

Fig. 3 (a) and (b). 

Fig. 3. Ways to distribute matrix elements among the processors. (a)Row-wise or horizontal partitioning, 

(b)Column-wise or vertical partitioning. 

Rows and columns are in most cases divided into stripes on a continuous sequential basis. The 

general scheme of informational interaction among subtasks in the course of executed computations 

is shown in Fig. 4. 

67

Fig. 4. Example of computation in case of block-striped matrix partition. 

To compute a row of the matrix C each subtask must have a row of the matrix A and access to all 

columns of the matrix B. Possible ways to organize parallel computations are described below. 

The first algorithm is an iterative procedure in which the number of iterations is equal to the 

number of subtasks. Each subtask holds a row of the matrix A and a column of the matrix B at each 

algorithm iteration. At each iteration the scalar products of rows and columns containing in the 

subtasks are computed, and the corresponding elements of the result matrix C are obtained. After 

completing of all iteration computations the columns of matrix B must be transmitted so that subtasks 

should have new columns of the matrix B and new elements of the matrix C could be calculated. This 

transmission of columns among the subtasks must be executed in such a way that all the columns of 

matrix B should have appeared in each subtask sequentially. 

The second algorithm is different from the first one because the subtasks contain not columns but 

rows of matrix B. As a result, data multiplication of each subtask is the multiplication of the row 

elements of the matrix B by a corresponding row element of the matrix A. Therefore, a row of partial 

results for matrix C is obtained in each subtask. In case of this scheme of data decomposition for 

matrix multiplication, it is necessary to provide sequential obtaining all rows of the matrix B by all in 

the subtasks, the multiplication of the row elements of the matrix B by a corresponding row element 

of the matrix A and summation of the new values and the previously computed ones. 

With regards to the number and the duration of the operations the time for carrying out the 

computations for parallel algorithm may be estimated as follows: 

T ( 2 p 

n / p) 

(2n 

1) 

. (2) 

where n is the matrices size, is execution time for a basic computational operation (this value 

has been computed in the course of testing the serial algorithm). 

3.2 Chessboard matrix data decomposition 

In this method of data decomposition the initial matrices A and B and the result matrix C are 

subdivided into sets of blocks. For simplicity the further explanations we will assume all the matrices 

are square of n × n size, the number of vertical blocks and the number of horizontal blocks are the 

same and are equal to q (i.e. the size of all block is equal to k×k, k=n/q). In this method, the matrix is 

divided into rectangular sets of elements as we depicted in Fig. 5. 

Fig. 5. Chessboard block matrix partitioning. 

As a rule, it is being done on a continuous basis. Let the number of processors be p = s x q, the 

number of matrices rows is divisible by s, the number of columns is divisible by q. In case of this 

approach it is expedient that a computational system have a physical or at least a logical processor 

68

grid topology of s rows and q columns. Then, for data distribution on a continuous basis the 

processors neighboring in grid structure process adjoining blocks of the original matrices. It should 

be noted however that cyclic alteration of rows and columns can be also used for the chessboard block 

scheme. 

The execution of this algorithm requires q iterations, during which each processor multiplies its 

current blocks of the matrices A and B, and adds the multiplication results to the current block of the 

matrix C. Time of parallel algorithm (chessboard block matrix partitioning) execution that 

corresponds to the processor calculations: 

2 

2 

T p 

q[( 

n / p) 

(2n 

/ q 1) 

( n / p)] 

 

(3) 

where n is the matrices size, q is the number of columns of grid topology, p is the number of 

processors of grid, is execution time for a basic computational operation. 

4. Experimental results 

Results of computational experiments for parallel algorithm of matrix multiplication based on 

block-striped matrix partitioning are presented in Fig. 6 (a). We calculate the speed up dividing the 

serial execution time by the parallel execution time. The results are presented in Fig. 6 (b). 

(a) 

(b) 

Fig. 6. (a) Theoretical and experimental execution times with respect to matrix size (block-striped matrix 

decomposition). (b) Speedup for the parallel algorithm of matrix multiplication (block-striped matrix decomposition). 

Results of computational experiments for parallel algorithm of matrix-vector multiplication based 

on chess-board block matrix partitioning are presented in Fig. 7 (a). We calculate the speed up 

dividing the serial execution time by the parallel execution time. The results are presented in Fig. 7 

(b). 

Fig. 7. (a) Theoretical and experimental execution times for chessboard block matrix decomposition. (b) Speedup for 

the parallel algorithm of matrix multiplication with respect to number of processors. 

The summary graph in Fig. 8 presents the speedup values obtained as a result of the computational 

experiments for all the discussed algorithms. The computations have shown that increasing the 

69

number of processors improves the chessboard block multiplication algorithm efficiency. 

Fig. 8. Speedup of the matrix multiplication algorithms according with computational experiments (4 processors). 

The measurements were obtained in a laboratory on the network with 9 machines with the 

configuration as the follow: 

Intel Pentium 4 CPU 3GHz , 

1024 MB RAM Memory. 

5. Conclusions 

In this paper we presented extensive experimental results regarding the performance issues of matrix 

parallel multiplication algorithms. Various ways of matrix distribution among processors have been 

described here: block-striped (rows and columns) matrix partitioning and chessboard block matrix 

partitioning. Since communications is a critical component of distributed-memory parallel computing, 

it is important that it be carefully optimized. Studies such as those in this paper can be used by 

hardware and software designers to tune their communications systems to increase their performance 

on real applications. This in turn should enable users to achieve higher performance and increased 

scalability of their codes. This article is a step towards creating a smooth transition from a sequential 

code base, to a distributing equivalent. By reducing the time and effort needed to produce a parallel 

code base, more time may be dedicated to solving the problem domain. 

6. Acknowledgement 

This research was financially supported by the CNCSIS UEFISCSU, project number PN II-RU PD 

369/2010, contract number 10/02.08.2010. 

References 

[1] M. P. I. Forum, Message-Passing Interface Standard, Technical report, University of Tennessee at 

Knoxville, May 1994. 

[2] Beguelin, J. Dongarra, A. Geist, R. Manchek and V. Sunderam, User's Guide to PVM (Parallel 

Virtual Machine), Technical Report ORNL/TM-11826, Oak Ridge National Laboratory, Oak 

Ridge, TN, July 1991. 

[3] http://www.mcs.anl.gov/mpi/mpich. 

[4] http://www.mpi-forum.org/docs. 

[5] Foster, Designing and Building Parallel Programs, Addison-Wesley, ISBN 0-201-57594-9, 1995. 

[6] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to parallel computing, The 

Benjamin/Cummings Publishing Company, Inc. 2 nd edition, ISBN 0-201-64865-2, 2003. 

[7] M. J. Quinn, Parallel programming in C with MPI and OpenMP, New York, NY: McGraw-Hill, 

ISBN 0-07-282256-2, 2004. 

70

Parallel Algorithms for Matrix Multiplication Petre Anghelescu

Create successful ePaper yourself

Delete template?

Save as template?