BSP/CGM Algorithms - USP

Workshop Integrade 

26 e 27 de Janeiro de 2004 - IME-USP-São Paulo-SP 

1/24 

BSP/CGM Algorithms 

Edson Norberto Cáceres 

Henrique Mongelli 

Siang Wun Song 

CNPq and FAPESP

Outline 

1. Parallel and Distributed Computation Systems √ 

2. Parallel Algorithms and Complexity √ 

3. Models of Parallel and Distributed Computation √ 

• Shared Memory Model √ 

• Network Model √ 

• Realistic Models 

4. Implementation of Parallel Algorithms 

• MPI 

• PVM 

• BSP 

• CGM 

2/24

Realistic Models 

3/24 

At the end of the 80´s the area of parallel algorithms 

went through a serious crisis. 

Several theoretical results and for specific machines (meshes 

and hypercubes). 

Practical results gave disappointing speedups. 

Programs without portability. 

• BSP 

• CGM 

• LogP √

The BSP Model 

4/24 

• The BSP (Bulk Synchronous Parallel) model was introduced 

by Valiant. 

• A BSP machine consists of p processors with local 

memory. 

• The processors communicate with each other through 

some interconnection scheme, managed by a router. 

• Also offers the capability of synchronizing the processors 

in regular intervals of L units of time.


5/24 

• BSP algorithm consists of a sequence of supersteps. 

• In each superstep each processor executes a combination 

of local computation, transmission and receiving 

of messages from other processors. 

• After each period of L units of time a barrier synchronization 

is carried out.


6/24


7/24 

• Parâmetros: 

– n : tamanho do problema. 

– p: número de processadores com memória local. 

– L: tempo máximo de um superpasso (periodicidade) 

- latência. 

– g: taxa de eficiência da computação/comunicação. 

• Custo (superpasso i): w i + gh i + L, w i = max{L, t 1 , . . . , t p } 

e h i = max{L, c 1 , . . . , c p } 

• Custo Total: W + gH + LT , W = ∑ p 

i=1 w i, H = ∑ p 

i=1 h i.

The CGM Model 

8/24 

• O modelo CGM (Coarse Grained Multicomputer) foi 

proposto por Dehne et al. 

• Utiliza apenas dois parâmetros: 

1. N: tamanho da entrada. 

2. P: número de processadores. 

• Uma máquina CGM consiste de um conjunto de p processadores, 

cada um com memória local de tamanho 

O(n/p). 

• Os processadores se comunicam através de um meio 

de interconexão.


9/24 

Computation round 

Communication round 

P p−1 

P 2 

P 1 

P 0 

Global Communication 

Synchronization Barrier 

Local computation


10/24 

• Um algoritmo CGM consiste de uma seqüência alternada 

de rodadas de computação e rodadas de comunicação 

separadas por uma barreira de sincronização. 

• Na fase de comunicação cada processador troca no 

máximo um total de O(n/p) dados com outros processadores. 

• Custo: 

– Computação: análogo ao BSP. 

– Comunicação: #H n,p onde H n,p é o custo único para 

cada rodada de comunicação.

BSP: Sum Algorithm 

11/24 

BSP Sum Algorithm(i,p,n,B = b((i − 1)r + 1 : ir);S) 

1 z ← b[1] + ... + b[r]; 

2 if i = 1 

3 then S ← z; 

4 else send(z, p 1 ); 

5 bspsync() 

6 if i = 1 

7 then for i ← 2 to p 

8 do receive(z, p i ); 

9 S ← S + z;

BSP: Algorithm Complexity 

12/24 

• Passo 1: cada p i efetua r operações 

• Passo 2-3: p 1 efetua uma operação 

• Passo 4: p i ’s (diferentes de 1) enviam uma MSG 

• Passo 5: Sincronização 

• Passo 6-9: p 1 recebe p − 1 mensagens e efetua p − 1 

operações. 

Uma sincronização. 

Dois superpassos. 

O(n/p) computação.

CGM: Sum Algorithm 

13/24 

CGM Sum Algorithm(i,p,n,B = b((i − 1)r + 1 : ir);S) 

1 z i ← b[1] + ... + b[r]; 

2 if i = 1 

3 then S ← z 1 ; 

4 else send(z, p 1 ); 

5 if i = 1 

6 then for i ← 2 to p 

7 do receive(z i , p i ); 

8 S ← S + z 2 + . . . + z p ;

CGM: Sum Algorithm - Complexity 

14/24 

• Passo 1: cada p i efetua r operações 

• Passo 2-3: p 1 efetua uma operação 

• Passo 4: p i ’s (diferentes de 1) enviam uma MSG 

• Passo 5-8: p 1 recebe p − 1 mensagens e efetua p − 1 

operações. 

Uma rodada de comunicação. 

O(n/p) computação.

BSP/CGM: Prefix Sum Algorithm 

15/24 

Given a vector A of n elements A[0], . . . , A[n − 1], 

prefix sum computes the values: 

A[0] 

A[0] ⊕ A[1] 

A[0] ⊕ A[1] ⊕ A[2] 

. 

A[0] ⊕ . . . ⊕ A[n − 1] 

where ⊕ is a binary associative operation.


16/24 

Example: 

⊕ is addition. 

Input Vector: 

[3 7 5 1 2 4 0 9] 

Result Vector: 

[3 10 15 16 18 22 22 31]


17/24 

Example: 

Quatro Processadores: P 1 , P 2 , P 3 , P 4 

Input Vector: [3 7 5 1 2 4 0 9] 

P 1 = [3 7], P 2 = [5 1], P 3 = [2 4] e P 4 = [0 9] 

SP 1 = 10, SP 2 = 6, SP 3 = 6 e SP 4 = 9 

P 1 ← 0, P 2 ← SP 1 , P 3 ← SP 1 + SP 2 e P 4 ← SP 1 + SP 2 + SP 3 

P 1 ← 0, P 2 ← 10, P 3 ← 16 e P 4 ← 22 

Result Vector: [3 10 15 16 18 22 22 31]


18/24 

Prefix Sum Algorithm(i,p,n,A[(i − 1)n/p + 1 : in/p];S) 

1 T ← A[(i − 1) ∗ n/p + 1] + . . . + A[i ∗ n/p]; 

2 for j ← i+1 to p 

3 do send(T, p j ); 

4 bspsync(); 

5 for k ← 1 to i − 1 

6 do receive(p k , T [k]); 

7 ST ← T [1] + . . . + T [i − 1]; 

8 for j ← (i − 1) ∗ n/p + 1 to i ∗ n/p 

9 do S[j] ← ST + ∑ j 

k=1 A[k];

Prefix Sum - BSP 

// Passo 1. Inicilizacao 

bsplib_saveargs(&argc, &argv); 

bsplib_init(BSPLIB_STDPARAMS, &bsp); // iniciliza o pub 

size = bsp_nprocs(&bsp); // numero de tarefas 

rank = bsp_pid(&bsp); // identificacao da tarefa 

// Passo 2. Envie os dados as tarefas filhos 

bsp_push_reg(&bsp, &numelem, sizeof(int)); 

bsp_get(&bsp, 0, &numelem, 0, &numelem, sizeof(int)); 

bsp_sync(&bsp); 

bsp_pop_reg(&bsp, &numelem); 

tam = (int) numelem/size; 

// Passo 3. Receba em SubVetor rank*tam-esima parte de VetorDados 

bsp_push_reg(&bsp, VetorDados, numelem*sizeof(int)); 

bsp_get(&bsp, 0, VetorDados, rank*tam*sizeof(int), SubVetor, tam*sizeof(int)); 


bsp_pop_reg(&bsp, VetorDados); 

// Passo 4. Calcule a Soma em cada tarefa 

for (i = 0; i < tam; i++) 

SomaP = SomaP + SubVetor[i]; 

soma = SomaP; 

19/24


// Passo 5. Receba as Somas das tarefas com pid < rank 

for (i = 0; i < size-1; i++) { 

bsp_push_reg(&bsp, &soma, sizeof(int)); 

if (rank > i) { 

bsp_get(&bsp, i, &soma, 0, &soma, sizeof(int));} 


if (rank > i) { 

Soma = Soma + soma;} 

bsp_pop_reg(&bsp, &soma); 

soma = SomaP; 

} 

// Passo 6. Calcule as Somas de Prefixos em cada tarefa 

SPre[0] = Soma + SubVetor[0]; 

for (i = 1; i < tam; i++) 

SPre[i] = SPre[i-1] + SubVetor[i]; 

// Passo 8. Armazene a soma de prefixos na tarefa 0 

for (i = 0; i < tam; i++) 

printf("rank %d e SPre[%d]: %d\n",rank, i, SPre[i]); 

// Passo 7. Finalize o BSP 

bsplib_done(); 

20/24


21/24 

// Passo 7. Envie os valores para a raiz 

bsp_push_reg(&bsp, SomaPre, numelem*sizeof(int)); 

bsp_put(&bsp, 0, SPre, SomaPre, tam*rank*sizeof(int), tam*sizeof(int)); 


bsp_pop_reg(&bsp, SomaPre); 

// Passo 8. Armazene a soma de prefixos na tarefa 0 

if (rank == root) { 

// Passo 8.1. Abre o arquivo 

ArqS = fopen("ArquivoS.txt", "a"); 

// Passo 8.2. Escreve no arquivo 

fprintf(ArqS, "%i %i\n", numelem, size); 

fclose(ArqS); 

for (i = 0; i < numelem; i++) { 

fprintf(ArqS, "%d ", SomaPre[i]); 

} 

}

Prefix Sum - MPI 

// Passo 1. Inicilizacao 

MPI_Init(&argc, &argv); 

MPI_Comm_size(MPI_COMM_WORLD, &size); // numero de tarefas 

MPI_Comm_rank(MPI_COMM_WORLD, &rank); // identificacao da tarefa 

tam = (int) TAMMAX/size; // tamanho subvetor 

// Passo 2. Envie os dados as tarefas filhos 

MPI_Scatter(VD, tam, MPI_INT, SV, tam, MPI_INT, root, MPI_COMM_WORLD); 

// Passo 3. Calcule a Soma em cada tarefa 

for (i = 0; i < tam; i++) 

SomaP = SomaP + SV[i]; 

// Passo 4. Receba as Somas Parciais 

MPI_Scan(&SomaP, &Soma, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); 

// Passo 5. Calcule as Somas Parciais em cada tarefa 

SomaPre[0] = Soma - SomaP + SV[0]; 

for (i = 1; i < tam; i++) 

SomaPre[i] = SomaPre[i-1] + SV[i]; 

// Passo 6. Imprima o valor da Soma (Processador 0) 

for (i = 0; i < tam; i++) 

printf("rank %d e SomaPre[%d]: %d\n",rank, i, SomaPre[i]); 

// Passo 7. Finalize o MPI 

MPI_Finalize(); 

22/24

Conclusions 

23/24 

• O BSP/CGM é um modelo adequado para o projeto 

de algoritmos paralelos. 

• A Biblioteca MPI utiliza um número menor de sincronizações 

explicítas. 

• A utilização excessiva de sincronizações pode comprometer 

a execução de programas paralelos em ambientes 

de grade.

References 

24/24 

• Cáceres, E. N., Mongelli, H. and Song, S. W. Algoritmos Paralelos usando 

CGM/PVM/MPI: Uma Introdução. JAI/SBC 2001, Fortaleza, 2001.(versão 

atualizada www.ime.usp.br/∼song).

BSP/CGM Algorithms - USP

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?