Lower bounds for sparse matrix vector multiplication on hypercubic ...

Discrete Mathematics and Theoretical Computer Science 2, 1998, 35–47 

<strong>Lower</strong> <strong>bounds</strong> <strong>for</strong> <strong>sparse</strong> <strong>matrix</strong> <strong>vector</strong> 

<strong>multiplication</strong> on hypercubic networks 

Giovanni Manzini 

Dipartimento Scienze e Tecnologie Avanzate, I-15100 Alessandria, Italy and Istituto di Matematica Computazionale, 

CNR, I-56126 Pisa, Italy. 

E-mail: manzini@mfn.al.unipmn.it 

In this paper we consider the problem of computing on a local memory machine the product ¢¡¤£¦¥ , where £ is 

a random §©¨§ <strong>sparse</strong> <strong>matrix</strong> with § nonzero elements. To study the average case communication cost of this 

problem, we introduce four different probability measures on the set of <strong>sparse</strong> matrices. We prove that on most local 

memory machines with processors, this computation requires § time on the average. We prove that 

the same lower bound also holds, in the worst case, <strong>for</strong> matrices with only !§ or "#§ nonzero elements. 

Keywords: Sparse matrices, pseudo expanders, hypercubic networks, bisection width lower <strong>bounds</strong> 

1 Introduction 

In this paper we consider the problem of $&%('*) computing , ' where is a +¢,-+ random <strong>sparse</strong> <strong>matrix</strong> 

.0/+21 with nonzero elements. We prove that, on most local memory machines 3 with processors, this computation 

45/6/+27831:9=?31 requires time in the average case. We also prove that this computation requires 

time in the worst case <strong>for</strong> matrices with only E>+ nonzero elements and, under additional 

45/@/A+27B319C;D=231 

hypotheses, also <strong>for</strong> matrices F>+ with nonzero elements. We prove these results by considering only the 

communication cost, i.e. the time required <strong>for</strong> routing the components ) of to their final destinations. 

The average communication cost of $G%H'*) computing has also been studied in [10] starting from 

different assumptions. In [10] we restricted our analysis to the algorithms in which the components ) of 

$ and are partitioned among the processors according to a fixed scheme not depending on the ' <strong>matrix</strong> . 

However, it is natural to try to reduce the cost of <strong>sparse</strong> <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong> by finding a ‘good’ 

partition of the components ) of $ and . Such a partition can be based on the nonzero structure of the 

<strong>matrix</strong> ' 

which is assumed to be known in advance. This approach is justified if one has to compute 

many products involving the same <strong>matrix</strong>, or involving matrices with the same nonzero structure. The 

development of efficient partition schemes is a very active area of research; <strong>for</strong> some recent results see [3, 

4, 5, 14, 15]. 

In this paper we prove lower <strong>bounds</strong> which also hold <strong>for</strong> algorithms which partition the components of 

) and $ on the basis of the nonzero structure of ' . First, we introduce four probability measures on the 

class of +I,J+ <strong>sparse</strong> matrices with .K/+21 nonzero elements. Then, we show that with high probability 

the cost of computing $L%M'*) on a 3 -processor hypercube is 45/@/+27831:9

+2s 

F 

Adj /{51 

 

 

x 

w 

 

 

 

36 Giovanni Manzini 

components of ) and $ among the processors. Since the hypercube can simulate with constant slowdown 

any .0/P31 -processor butterfly, CCC, mesh of trees and shuffle exchange network, the same <strong>bounds</strong> hold 

<strong>for</strong> these communication networks as well (see [7] <strong>for</strong> descriptions and properties of the hypercube, and 

of the other networks mentioned above). 

Our results are based on the expansion properties of a bipartite graph associated with the <strong>matrix</strong> ' . 

Expansion properties, as well as the related concept of separators, have played a major role in the study of 

the fill-in generated by Gaussian and Cholesky factorization of a random <strong>sparse</strong> <strong>matrix</strong> (see [1, 8, 11, 12]), 

but to our knowledge, this is the first time they have been used <strong>for</strong> the analysis of parallel <strong>matrix</strong> <strong>vector</strong> 

<strong>multiplication</strong> algorithms. 

We establish lower <strong>bounds</strong> <strong>for</strong> the hypercube assuming the following model of computation. Processors 

have only local memory and communicate with one another by sending packets of in<strong>for</strong>mation through 

the hypercube links. At each step, each processor is allowed to per<strong>for</strong>m a bounded amount of local computation, 

or to send one packet of bounded length to an adjacent processor (we are there<strong>for</strong>e considering 

the so-called weak hypercube). 

This paper is organized as follows. In section 2 we introduce four different probability measures on the 

set of <strong>sparse</strong> matrices, and the notion of the pseudo-expander graph. We also prove that the dependency 

graph associated with a random <strong>matrix</strong> is a pseudo-expander with high probability. In section 3 we 

study the average cost of <strong>sparse</strong> <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong>, and in section 4 we prove lower <strong>bounds</strong> <strong>for</strong> 

matrices with E>+ and FD+ nonzero elements. Section 5 contains some concluding remarks. 

2 Preliminaries 

QR%(S#TVU!W!XXXWYTZ\[ Let , ]^%(S#_DU!WXX!X\Ẁ _ZV[ and . Given +a,K+ an ' <strong>matrix</strong> , we associate ' with a directed 

bipartite ba/c'51 graph , called the dependency graph. The vertex set ba/c'51 of Qed©] is , /Tf`WY_Ygh1 and is an 

edge bi/c'j1 of if and only kgBfml %Gn if . In other words, the vertices ba/c'51 of represent the components ) of 

$ and , and the /Tf@Ẁ _YgD1po-ba/q'j1 edge if and only if the _Yg value depends Tf upon . 

Definition 2.1 nartsur^v Let wyxzv , s?wyrzv and . We say that the bi/c'j1 graph is /As¦W6wpWY+21 an -pseudoexpander 

if <strong>for</strong> each {¤|}Q set the following property holds: 

D~ 

{ 

~( 

{ 

 

Note that we must have sNwyr^v , since each set of s‚+ vertices can be connected to at most + vertices. 

The notion of a pseudo-expander graph is similar to the well known notion of an expander graph [2]. We 

(hence, all expanders 

recall that a graph /sW@wpWY+21 is an -expander { 

:~ 

+2s if implies Adj /{j1 x¤w 

 

{ 

are pseudo-expanders). 

The average case complexity of <strong>sparse</strong> <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong> will be obtained combining two 

 

basic results: in this section, we prove that a random <strong>sparse</strong> <strong>matrix</strong> is a pseudo-expander with probability 

very close to one, in the next section we prove $e%ƒ') that computing 3 on a -processor hypercube 

requires time ba/q'j1 if is a pseudo-expander. 

45/6/+27831:9=23\1 

To study the average cost of a <strong>sparse</strong> <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong>, we introduce four probability measures 

on the +y,„+ set of <strong>sparse</strong> matrices. Each measure represents a different type of <strong>sparse</strong> <strong>matrix</strong> 

with nonzero elements. Although the nonzero elements can be arbitrary real numbers, we assume 

.K/…+21 

the behaviour of the <strong>multiplication</strong> algorithms does not depend upon their numerical values. Hence, all 

measures are defined on the set of 0–1 v matrices, where represents a generic nonzero element. 

+2sR%2€

3. Let †D› be an integer n 

~ 

 

% 

{ 

 

U 

~ 

® 

 

Ÿ 

+¨–©3 

‘ª ¦ 

Ÿ 

Ÿ 

Ÿ 

+ 

š: 

+ 

Ÿ 

’ 

+ ~e¡¢ 

š©£ 

v 

m¥ T 

+ 

¦¨ 

’ 

– 

† 

 

~ 

› 

Ÿ 

the probability measure such that S'Š[al Z 

ˆ‚‰ 

Ÿ ’ 

+ 

† 

† 

¦ 

£˜« + 

f 

5± ²‚±± ³´± +a 

~ 

Ÿ 

† 

U 

± ²‚±± ³¹± +¨ 

 

Z 

Z 

~ 

š 

~ 

Sparse <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong> 37 

1. Let n 

+ such that † U + is an integer. We denote by ‡ U the probability measure such that 

~^† 

ˆ‚‰ 

%zn only if the <strong>matrix</strong> ' has exactly † U‹+ nonzero elements; moreover, all Œ 

Z: ŽB Zh matrices 

SB'Š[¢l 

with UB+ nonzero elements are equally likely. 

† 

2. n Let 

ˆ‚‰ 

elements ' of . 

the probability measure such that <strong>for</strong> each entry k>f”g 

, 3“% 7>+ . We denote by ‡ 

+ 

f g l %¤n:[0%¤3 . Hence, we have ˆ‚‰ S'Š[0%¤3•\/@v—–J31 Z˜Y • , where š is the number of nonzero 

S!k 

~‘†:’~ 

†:’ 

only if each row of ' 

contains exactly †D› nonzero elements; moreover, all Œ 

matrices with †D› 

nonzero elements per row are equally likely. 

ŽYœ Z 

~‘†D›5~ 

+ . We denote by ‡ 

%¤n 

4. Let be n + an integer . We ‡ denote by the probability measure such S'Š[al %¤n 

that 

only if each column †D of contains exactly nonzero elements; moreover, all ˆ‚‰ ŽYž matrices with 

†D 

' Œ 

nonzero elements per column are equally likely. 

The above measures are clearly related since they all represent unstructured <strong>sparse</strong> matrices. We have 

†D 

chosen these measures since they are quite natural, but of course, they do not cover the whole spectrum 

of ‘interesting’ <strong>sparse</strong> matrices. In the following, the expression ‘random <strong>matrix</strong>’ will denote a <strong>matrix</strong> 

chosen according to one of these probability measures. 

Given ' a random <strong>matrix</strong> we want to estimate the probability ba/c'51 that the graph is a pseudo-expander. 

In this section we prove that ‡ U –‡ <strong>for</strong> the measures previously defined such probability is very close 

to one (Theorem 2.6). In the following, we make use of some well known inequalities. For n 

+ , 

~‘†D5~ 

• 

(1) 

and, <strong>for</strong> any real number T©¤Gv , 

(2) 

v– 

r ¢ 

NU 

In addition, a straight<strong>for</strong>ward computation shows that <strong>for</strong> any triplet of positive ¦-WY+‚W3 integers such that 

we have + 

¦G§¢3 

NU 

~H¡ 

(3) 

v– 

' ‡Rf ¬%Hv>WX!XXW6 

{G|„Q ®G|}] 

Lemma 2.2 Let be a random <strong>matrix</strong> chosen according to the probability measure , . 

If , we have 

(4) 

ˆ‚‰ 

S Adj /{j1N¯L®t%t°?[ 

v– 

Proof. We consider ¬p%Hv>WYE only , since the ¬%µFW6 cases are similar. µ%eShkgBf Tfmo {¦Ẁ _Ygö ®K[ Let . 

We have 

 

and Adj /{j1?¯·®(%¸° if and only if kgBf‚%^n <strong>for</strong> all kgBfó& . For the probability 

 

measure ‡ƒU , using (3) we get 

NU 

ˆ‚‰ 

S Adj /{j12¯·®t%¤°?[—% 

v– 

 

 

U + 

U +\

Lemma 2.3 Let {t| Q , { 

We have that Adj /{j1 

Adj Ì /{j12¯ 

Since +2sÀ7DF 

S 

~ 

š 

~ 

+ ¢ 

+a–¢wNš 

S 

 

S 

S 

Adj /{j1 

 

~ 

~ 

Ÿ 

› 

Ÿ 

Ÿ 

Adj /{j1 

 

Adj /{j1 

 

Z 

•8Ð ZpÏ”Å 

+ ¢ 

+i–ÂwNš 

† 

f 

+K 

É 

 

~ 

+ ¢ 

+i–JwNš 

¢ 

š 

Ÿ 

~ 

~ 

Z¹ÏÑÅ •BÐ 

Ÿ 

Ÿ 

 

Ÿ 

+ 

É 

¢ + ~ 

+i–„+2s?w 

Ÿ 

~ 

Ÿ 

† 

f 

+K 

Ÿ 

S 

† 

Ÿ 

† 

f 

+K 

Adj /6{j1 

 

† 

f 

+ 

† 

Ÿ 

† 

É 

† 

f 

+i 

f 

©ÒhÓ`Ô +K 

~ 

r 

Ÿ 

¢ 

 

® 

 

† 

f 

+ 

† 

f 

+K 

É 


For the probability measure ‡ 

not contain nonzero elements in the { 

we have Adj /{j12¯·®t%¤° only if the ® 

rows corresponding ® to 

columns corresponding { to . By (3) we get 

do 

{ 

 

?U@¼ 

†D› 

+ 

†D› 

ˆ‚‰ 

S Adj /{j12¯·®t%¤°?[*%»º 

+¨– 

†D› 

v– 

+© 5± ²¹±@± ³¹± ½ 

± ³‚± 

%¤š , and assume (4) holds. If 

s‚+‚W 

f‚¤ 

s‚+ 

F 

F 

v–Â9=m/@v–„s?wN1AÃ (5) 

sÀ¿·Á 

n¾ru¿Lrtv>W 

then 

•DÄ ZÅ •hÆAÄ UYÈÇ Æ 

wNš\[¾r 

v– 

Proof. First note that 

ˆ¹‰ 

˜~ 

—~ËÉ 

, with Ì ®Í|R] 

® sets of +i– size 

, such that wNšÊ 

, if (4) holds, we have 

wNšÊ 

%Î+&– 

wNšÊ[ 

˜~ 

ˆ‚‰ 

D~µÉ 

ˆ¹‰ 

wNš\[5% 

wNšÊ only if there exists a set Ì 

®t%t° . Since there are Œ 

ˆ‚‰ 

D~ 

•>Ä Z¹ÏÑÅ •BÐ`Æ 

By (1) we have 

wNš\[ 

v– 

+i– 

wNšÊ8 

f 

+i 

Adj /{j1 

 

ZpÏ”Å •8Ð 

•>Ä Z¹ÏÑÅ •8Ð`Æ Ç Ÿ 

ˆ¹‰ 

•>Ä Z¹ÏÑÅ •BÐ`Æ…Ä UYÈÇ Æ 

v– 

˜~ 

wNš\[ 

¢ + 

v– 

+i– 

wNšÊ 

•DÄ ZpÏ”Å •8Ð`Æ Ç Ÿ 

•>Ä ZÅ •hÆAÄ UY2Ç Æ 

Hence, we need to show that 

v– 

v– 

• Ç 

rGv 

v– 

+2s , using (2) we get 

• Ç 

UYÖÕ6× 

v– 

This completes the proof since the hypothesis (5) on † f implies that 

v–&s?w ÔYÒ 

v– 

UY Õ6× 

½ 

v–„s?w 

ÔYÒ

š 

~ 

† 

Z 

• 

á 

+ ¢ 

š 

Ÿ 

† 

f 

+K 

† 

+ ¢ 

š 

Ÿ 

Ÿ 

† 

f 

+ 

ZÅ •hÆAÄ UYÈÇ Æ ~ F ¢ 

Ä 

s 

 

s 

F 

v–J9=*/6v–„sNwN1…ÃW 

s‚¿©Á 

á 

† 

W 

¢ 

† 

 

s 

with á 

§ 

r 

r 

v 

/@v–„¿\1B/6v–Is?wN1 

Ÿ 

¢ 

Ÿ 

 

+ 

š 

¢ + ¡ 

š·£ 

Ÿ 

~ 

Ÿ 

• 

 

sm 

† 

Ÿ 

v 

F 

f 

+K 

~ 

† 

f 

+K 

† 

f 

+ 

Z Ä Ù 2ÚDÅ ÆAÄ UYÈÇ Æ 

n 

v 

/6v–&¿\1B/6vm–&s?wN12ß 

Á 

½ 


Lemma 2.4 Suppose that the hypotheses of Lemma 2.3 hold. If 

(6) 

%eš such that Adj /{j1 

f‚¤ 

vp§Â9= 

?~ 

then the probability that there exists a {Ø|^Q set with { 

• . F 

wNš is less than 

Proof. The probability that a set of size š is connected to less than wNš vertices is given in Lemma 2.3. 

Since there exist Œ 

of such sets, we have 

ˆ‚‰ 

D~ 

•DÄ ZÅ •hÆAÄ UYÈÇ Æ 

S#Ù\{ such that Adj /{j1 

wNš\[ 

v– 

•>Ä ZÅ •hÆAÄ UYÈÇ Æ 

vp– 

There<strong>for</strong>e, we need to prove that 

(7) 

v– 

Ä Z\Å •#ÆAÄ Ù 2Ç Æ ~ 

Since 

’ ~ Z˜Ú 

+2s , we have 

v– 

v– 

From (6) we have 

F ~ 

s 

Ž 

× Ä UYÈÚDÅ ÆAÄ UYÈÇ Æ (8) 

UY 

hence 

 

s 

f`/6v–&s?wN1B/6vm–Â¿1 

9

† 

the graph ba/c'51 is an /As¦W6wpWY+21 -pseudo-expander with probability greater than v–&F UY 

† 

w 

 

the graph ba/c'51 is an /As¦W6wpWY+21 -pseudo-expander with probability greater than v–&F UY 

† 

 

U 

’ 

® 

~ 

§ 

§ 

~ 

f ã2ë 

Ð â ÏPZ˜Ú 

F 

• 

{ 

Ó 

ž 

Óò 

 

~ 

½ 

½ 

½ 


Lemma 2.5 Let ' be an +-,-+ random <strong>matrix</strong> <strong>for</strong> which (4) holds. If 

Ú vp§I9= 

v–„s?w 

F 

v–J9WX!XXW6 Theorem 2.6 Let be a random <strong>matrix</strong> chosen according to the probability measure . 

If 

Ó`Ò 

Ó`Ò 

Ú vp§I9= 

v–„s?w 

F 

v–J9ðmXX!X . Theorem 2.6 tells us that, 

under this assumption, † is an / U ’ W Uí 

Uî WY+21 -pseudo-expander with probability v–}F UY 

. Similarly, if 

ba/q'j1 

nDñmXX!X\ba/c'51 is a / î W î WY+21 -pseudo-expander with probability greater than v–„F Ù 

. 

fN¤Gv!ðX 

ba/q'j1 If is /sW@wpWY+21 an -pseudo-expander, then, {e|¸Q given such +2s‚7>F 

~ó \~ 

+2s that there 

 

are 

values _Yg that depend on the values TfoÂ{ . Similarly, if bi/c'môp1 is an /As¦W6wpWY+21 -pseudoexpander, 

®^|‘] given s‚7>F 

{ 

~z ~ 

s , , the _YgKoy® values depend upon more w 

 

® than Tf values . By 

more w than 

symmetry considerations we have the following corollary of Theorem 2.6. 

Corollary 2.7 ' Let be a random <strong>matrix</strong> chosen according to the probability ‡Rf2¬N%¸v>WX!XXW6 measure . 

If f satisfies (11), then the graph ba/c'—ô¹1 is an /sW@wpWY+21 -pseudo-expander with probability greater than 

† 

f‚¤ 

vp–„F Ù 

ÒÓ 

3 Study of the Average Case Communication Complexity 

' Let be +¢,J+ an <strong>sparse</strong> <strong>matrix</strong>, and 3 let 

case complexity of computing the $-%‘') product on 3 a -processor hypercube. Our analysis is based on 

the cost of routing the components ) of to their final destination. That is, we consider only the cost of 

+ . In this section we prove a lower bound on the average

+ 

’ 

’ 

~ 

› 

š 

~ 

¥ 

’ 

¥ 

 

Ì 

 

¤ 

+ 

F 

’ 

+ 

 

~ 

¥ 

’ 

ù 

ø 

 

¤ 

¥ 

1 

 

% 

ü 

+ 

F 

+ 

ï/cw§‘v1 

É 

¥ 

+ ~ 

F 

¤ 

+ 

ï/cw§Mv!1 

~ 

¥ 

’ 


routing, to the processor that _ g computes , the T f values ’s <strong>for</strong> ¬ all such k g8f l that 

<strong>for</strong> establishing a lower bound is that we can compute partial 3>gŠ%¸kgBf Tf § 

sums § kgBfcõhTföõ during 

the routing process. In this case, by moving a single value 3Dg we can ‘move’ several Tf ’s. However, since 

ªª!ª 

kgBf the ’s can be arbitrary, we assume that the partial 3>g0%Gkg8f Tf § 

sum § k!g8fcõ!Tfcõ can be used only 

<strong>for</strong> computing the ÷ -th component _Yg . 

ª!ªª 

%^n . A major difficulty 

In the previous section, we have shown that a random <strong>matrix</strong> is a pseudo-expander with high probability. 

There<strong>for</strong>e, to establish an average case result it suffices to obtain a lower bound <strong>for</strong> this class of matrices. 

In the following, we assume that there exist two ø functions 

vh[ such that 

WYø?ù mapping S˜vDẀ FW!XX!XẀ +?[ into ShnWX!XXWA3m– 

1. ¬N%¸v>WX!XXWY+ <strong>for</strong> , the T f value is initially contained only in processor ø /A¬1 number ; 

2. ÷I%úvDW!XXXẀ + <strong>for</strong> , at the end of the computation the _Yg value is contained in processor number 

ù /1 . ø 

In the ø NU /Aš1 following, will denote the ShT f /¬A1a%ûš\[ set , that is, the set of all components ) of 

initially contained in š processor . ø NU /Aš1 Similarly, will denote the components $ of contained in processor 

š . 

' ba/c'51 /sW@wpWY+21 s-%¸v!7DF › r„wÂr 

F n rÂ3 ø 

~tü ?U / +2783Ê ù ç +27B3 è 

Theorem 3.1 Let be a <strong>matrix</strong> such that is an -pseudo-expander with , 

. If conditions 1–2 hold, and <strong>for</strong> , or , any algorithm <strong>for</strong> computing 

the product $J% 

') on a 3 -processor hypercube requires 45/6/+27831:9=23\1 time. 

Proof. The two ø WYø ù functions define a mapping of the vertices ba/q'j1 of into the processors of the 

hypercube. If the /ATf6Ẁ _Yg>1 edge belongs ba/c'51 to (that kg8fjl %Gn is, ) the _Yg value depends Tf upon . Hence, 

the Tf value has to be moved from ø /A¬1 processor to ø ù /1 processor . 

v 9=N3 For we consider the set of š dimension links of the hypercube (that is, the links 

connecting processors whose addresses differ in š the -th bit). If the š dimension links are removed, 

the hypercube is bisected into two ýÜ!Ẁ ý sets 37DF with processors each. From the hypothesis ø ù on it 

follows that at the end of the computation each set contains at /3\7>F>1 ç +27B3 è F>+27DE most _Yg values ’s. 

We can assume ý U that initially contains at +27DF least T f values ’s (if not, we exchange the roles ý U of 

ý and ). A T f o‘ý U value can ý reach either by itself or inside a partial þRk g!ÿ T ÿ sum . Note that a 

T f value that ý reaches by itself can be utilized <strong>for</strong> the computation of _ g several ’s, but we assume that 

each partial sum can be utilized <strong>for</strong> computing only _ g one . 

+?U Let denote the number of Tf‚oLýÜ values that traverse the š dimension links by themselves, and let 

be the number of partial sums that traverse the š dimension links. We claim Û©ÜhÝ\/+?U!WY+ 

that 

x + 1 ¡ 

% 

Å: 

í where Å£¢?U . 

Æ Ä ¡ ~ ¡ + If , we consider the Ì QU set of the Tf‚o-ýÜ values that do not traverse the š dimension links by 

+?U 

themselves. We have 

(12) 

Q U 

–„+ U ¤ 

– ¡ +©% 

Since ba/c'51 is a pseudo-expander, and

¡ 

’ 

’ 

’ 

› 

’ 

’ 

¥ 

’ 

ÅhZ 

í ¥ Å£¢?U Æ Ä 

3 

£ 

’ 

– 

¤ 

’ 

› 

values _ g oJý 

Z 

wN+ 

ï/cw§Mv!1 

Ì ] 

 

] 

’ 

’ 

¤ 

– 

+ 

F 

F>+ 

E 

’ 

¤ 

’ 

¤ 

+ 

F 

’ 

¤ 

] 

’ 

Ì 

wN+ 

ï/qw“§¤v1 

depend on the values T f o 

w©–Â/qw“§¤v1 

ï/cw§Mv!1 

– 

¥ 

’ 

F>+ 

E 

’ 

ü 

1 

 

% / 

¤ 

+ 

ï/cwa§¤v!1 

Ì 

É 

~ 

– 

Z 

’ 

’ 

’ 

Ì 

½ 

½ 


we have that more than 

Q U . We have Ì 

% ¡ + 

%‘+ 

that is, more than + values _YgŠo-ý depend upon the values Tf‚o QU . Moreover, the values Tf‚o QÜ can 

reach ¡ only inside the + partial sums that traverse the dimension š links. Since each partial sum can 

ý 

be utilized <strong>for</strong> the computation of only _ g one , we have + x ¡ + that as claimed. 

This proves 45/A+21 that items must traverse the š dimension links. Since the same result holds <strong>for</strong> all 

dimensions, the sum of the lengths of the paths traversed by the data 45/A+9C;D=N31 is . Since at each step at 

3 most links can be traversed, the computation $-%M'*) of 45/@/A+27B3\1:9=231 requires time. 

Theorem 3.2 ' Let be a <strong>matrix</strong> such ba/q'mô¹1 that is /sW@wpWY+21 an -pseudo-expander s‘%Hv!7DF with , r › 

. If conditions 1–2 hold, and, <strong>for</strong> n 

~óü 

r^3 , ø NU +2783Ê or ç +27B3 è , any algorithm <strong>for</strong> 

wGrØF 

computing the $-%M'*) product on 3 a -processor hypercube 45/6/+27831:9=?31 requires time. 

Proof. ýÜ!WYý Let 

the hypothesis ø on 

Let +?U!WY+ 

denote the sets obtained by removing the dimension š links of the hypercube. From 

follows that initially each set contains at /37DF>1 ç +27B3 è F>+27>E most Tf values ’s. We 

contains at +27DF least _Yg values ’s (if not we ýÜ consider ). 

can assume that at the end of the computation ý 

with ¡ % 

If + 

be defined as in the proof of Theorem 3.1. Clearly, it suffices to prove Û©ÜhÝ/A+?U#WY+ 

that 

. 

Å: 

Ä Å£¢?U Æ í 

~ ¡ + , we consider the set Ì of the values _ g o-ý 

’ 

traverses the dimension š links. We have 

k:ÿ8f…Tf 

only if no partial sum þ 

that are computed entirely inside ý 

. That is, 

1x ¡ + 

_hÿKo 

] Ì 

’ 

–Â+ 

– ¡ +©% 

ba/q'môp1 Since is a _ g o 

pseudo-expander, the values depend on 

ÅhZ 

í more Å£¢?U 

› 

than 

values T f oLý U . 

By construction, these values ¥ f ’s must traverse the dimension š links by themselves. Hence 

T 

Æ Ä 

+?Umx 

% ¡ + 

By combining Theorems 3.1 and 3.2 with a result on the complexity of balancing on the hypercube, 

it is possible to prove that the computation of $^%R'*) requires 45/@/+27831:9

¦ 

¡ 

¡ 

’ 

’ 

’ 

3 

£ 

 

 

¥ 

’ 

ü 

 

1 

ü 

1 

 

/ 

¥ 

£ 

r 

½ r 

½ 


We also need the converse of this lemma. Given a distribution of + items over 3 processors with no 

more ¦ than items per processor, there exists an algorithm UNBALANCE that constructs such distribution 

starting from an initial setting in which each processor ç +27B3 è contains or +2783Ê items. The algorithm 

UNBALANCE is obtained by executing the operations of the algorithm BALANCE in the opposite order, 

É 

hence its running time is again ¦ 

9F › r 

werÎF n rØ3 ø ?U ~Îü 

ù n 

~ 

%¦i/@/+27831:9

’ 

’ 

Ì 

› 

% 

 

’ 

É 

› 

Adj /{j1 

 

 

 

 

’ 

~ 

š 

~ 

ü 

 

1 

¥ 

ü ½ 

% 1 / 

½ 

½ 

½ 


what is the minimum number of nonzero elements of ' <strong>for</strong> which this property holds. In this section we 

give a partial answer to this question. 

Definition 4.1 varGw‘r^F Let . Given a ' <strong>matrix</strong> , we say that the ba/c'51 graph is w a -weak-expander if 

<strong>for</strong> each {G|„Q set the following property holds: 

{ 

 

{ 

 

+27>F>ÊÂ%2€ 

x}w 

 

Obviously, all /As¦W6wpWY+21 -pseudo-expanders with s¢¤Gv!7>F are weak-expanders but the converse is not true. 

As we will see, weak-expanders can be still used to get lower <strong>bounds</strong> <strong>for</strong> <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong>. In 

addition, the next lemma shows that there exist matrices with only E>+ nonzero elements whose dependency 

graph is a weak-expander. 

Lemma 4.2 For +·¤}ð all , there exists +·,©+ an Ì ' <strong>matrix</strong> such that 

, where U W 

Ì 't%© U § § 

1. 

2. ba/c'51 both ba/c'—ô¹1 and ! are 

W 

are permutation matrices, 

" -weak-expanders. 

Proof. The existence of Ì ' <strong>matrix</strong> is established in the proof of Theorem 4.3 in [11]. 

Lemma 4.3 ' Let be a <strong>matrix</strong> with a constant number of nonzero elements per row, and such ba/q'môp1 that 

is w a -weak-expander. If conditions 1–2 of section 3 3 + hold, , and, n 

~‘ü 

rÂ3 <strong>for</strong> , ø ?U 

ù / %t+2783 , any 

algorithm <strong>for</strong> computing the product on a 3 -processor hypercube requires 45/@/+27831:9+ most nonzero elements, such that, 

if the components of one of the ) <strong>vector</strong>s $ or are evenly distributed among the processors, any algorithm 

on a 3 -processor hypercube requires 45/6/+27831:9=?31 time. 

'*) 

<strong>for</strong> $-% 

computing 

The case of matrices with three nonzero elements per row appears to be the boundary between easy and 

difficult problems. In fact, if a # <strong>matrix</strong> contains at most two nonzero elements in each row and in each 

column, we can arrange the components of ) and $ so that the product $Â%¤'*) can be computed on the 

hypercube in ¦a/c+27831 time. To see this, it suffices to notice that each vertex of ba/#Š1 has degree at most

U 

Ì 

Ì 

’ 

› 

¥ 

’ 

› 

' 

# 

½ 


two. Hence, the connected components ba/#¾1 of can only be chains or rings and can be embedded in the 

hypercube with constant dilation and congestion. 

We conclude this section by studying the complexity of <strong>sparse</strong> <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong> when we 

require that, ¬2%¸v>WX!XXWY+ <strong>for</strong> , the _f value is stored in the same processor Tf containing . A typical situation 

in which this requirement must be met, is when the <strong>multiplication</strong> algorithm is utilized <strong>for</strong> computing the 

)‚ÄP• ¢?U Æ*% '*)‚Ä•hÆ sequence generated by an iterative method. Note that, using the notation of section 3, 

ø ù . 

$ 

this requirement is equivalent to assuming that ø 

Theorem 4.6 % Let be the class of <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong> algorithms such that, ¬%ev>WXX!XWY+ <strong>for</strong> , 

the _ f value is stored in the same processor T f containing . For +¢¤‘ð all , there exists +¢,-+ an Ì <strong>matrix</strong> 

such that 

1. each row and each column Ì # of contains at most two nonzero elements, 

2. if the component ) of are evenly distributed, any algorithm % in <strong>for</strong> $z% Ì 

computing 

-processor hypercube requires 45/@/A+27B3\1:9=231 time. 

3 

#5) on a 

Proof. By Lemmas 4.2 and 4.4, we know that there exists a Ì <strong>matrix</strong> such that the communication cost 

of $‘% ') computing 45/Y/A+27B319C;D=231 is time (note that this bound holds <strong>for</strong> the algorithms not % in ). 

Moreover, we know Ì that are permutation matrices. Now consider the 

. Since the <strong>multiplication</strong> by a permutation <strong>matrix</strong> does not require any communication 

' 

U §& §& , where U W W 

't%© 

'('%© 

NU 

<strong>matrix</strong> 

(<strong>for</strong> the algorithms not % in !), the communication cost of $„%G')'


lower bound by allowing processors to contain multiple copies of the components of ) . To be fair, our 

algorithm should produce multiple copies of the components of $ so that it can be used to compute 

sequences of the kind )‚Ä• ¢?U ÆÀ%u'*)‚Ä•hÆ . 

References 

[1] A. Agrawal, P. Klein and R. Ravi. Cutting down on fill using nested dissection: Provably good 

elimination orderings. In A. George, R. Gilbert and J. Liu, editors, Graph Theory and Sparse Matrix 

Computation, 31–55. Springer-Verlag, 1993. 

[2] B. Bollobás. Random Graphs. Academic Press, 1985. 

[3] U. Catalyuerek and C. Aykanat. Decomposing irregularly <strong>sparse</strong> matrices <strong>for</strong> parallel <strong>matrix</strong><strong>vector</strong> 

<strong>multiplication</strong>. Parallel Algorithms <strong>for</strong> Irregulary Structured Problems (IRREGULAR ’96), 

LNCS 1117, 75–86. Springer-Verlag, 1996. 

[4] T. Dehn, M. Eiermann, K. Giebermann and V. Sperling. Structured <strong>sparse</strong> <strong>matrix</strong>-<strong>vector</strong> <strong>multiplication</strong> 

on massively parallel SIMD architectures. Parallel Computing 21:1867–1894, 1995. 

[5] P. Fernandes and P. Girdinio. A new storage scheme <strong>for</strong> an efficient implementation of the <strong>sparse</strong> 

<strong>matrix</strong>-<strong>vector</strong> product. Parallel Computing 12:327–333, 1989. 

[6] V. Kumar, A. Grama, A. Gupta and G. Karypis. Introduction to Parallel Computing: Design and 

Analysis of Algorithms. Benjamin/Cummings, Redwood City, CA, 1994. 

[7] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. 

Morgan Kaufmann, San Mateo, CA, 1991. 

[8] R. Lipton, D. Rose and R. Tarjan. Generalized nested dissection. SIAM J. Numer. Anal. 16:346–358, 

1979. 

[9] G. Manzini. Sparse <strong>matrix</strong> computations on the hypercube and related networks. Journal of Parallel 

and Distributed Computing 21:169–183, 1994. 

[10] G. Manzini. Sparse <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong> on distributed architectures: <strong>Lower</strong> bound and 

average complexity results. In<strong>for</strong>mation Processing Letters 50:231–238, 1994. 

[11] G. Manzini. On the ordering of <strong>sparse</strong> linear systems. Theoretical Computer Science 156(1–2):301– 

313, 1996. 

[12] V. Pan. Parallel solution of <strong>sparse</strong> linear and path systems. In J. H. Reif, editor, Synthesis of Parallel 

Algorithms, 621–678. Morgan Kaufmann, 1993. 

[13] C. G. Plaxton. Load balancing, selection and sorting on the hypercube. Proc. of the 1st Annual ACM 

Symposium on Parallel Algorithms and Architectures, 64–73, 1989. 

[14] L. Romero and E. Zapata. Data distributions <strong>for</strong> <strong>sparse</strong> <strong>matrix</strong> <strong>vector</strong> <strong>multiplication</strong>. Parallel Computing 

21:583–605, 1995.


[15] L. Ziantz, C. Oezturan and B. Szymanski. Run-time optimization of <strong>sparse</strong> <strong>matrix</strong>-<strong>vector</strong> <strong>multiplication</strong> 

on SIMD machines. In Parallel Architectures and Languages Europe (PARLE ’94), LNCS 817, 

313–322. Springer-Verlag, 1994.

Lower bounds for sparse matrix vector multiplication on hypercubic ...

Create successful ePaper yourself

Delete template?

Save as template?