A new technique to reduce false sharing in parallel irregular ... - Arcos

A **new** **technique** **to** **reduce**

**false** **shar ing**

**irregular** codes based on

distance functions

J. C. Pichel, , D. B. Heras, , J. C. Cabaleiro and F. F. Rivera

Universidade de Santiago de Compostela

I-SPAN 2005, Las Vegas, USA

© Juan C. Pichel 2005

Motivation

Irregular codes: Accesses low spatial and

temporal locality

Data locality optimization algorithms:

Code restructur**in**g **technique**s Loop

Transformations ( loop **in**terchange, fusion,

fission, til**in**g/block**in**g, … )

Memory Layout Transformations Data

Reorder**in**g **technique**s

© Juan C. Pichel 2005

Motivation

Increas**in**g Locality on Shared Memory

Multiprocessors Another critical issue

False Shar**in**g

Multiple processors access (for both read

and write) different words on the same

cache block

Write-Invalidate Coherency pro**to**col

Extra **in**validations and cache misses

Specially Time consum**in**g (remote

accesses, coherence and consistence

mechanisms)

© Juan C. Pichel 2005

Motivation

Reduction **in** False Shar**in**g

Reductions **in** the

Execution Time

Improved Program

Scalability

© Juan C. Pichel 2005

Introduction

We have selected Sparse Matrix-Vec**to**r

product (SpMxV(

SpMxV) ) as **irregular** kernel.

Previous works:

Locality model and a procedure for

**in**creas**in**g it us**in**g data reorder**in**g

**technique**s

Uniprocessors

Shared Memory Multiprocessors

© Juan C. Pichel 2005

Introduction

Outl**in**e of the Locality Model:

General-purpose

Locality measured for consecutive pairs of

rows/columns

Based on two parameters:

Entry Matches (a(

elems

Block Matches (a(

blocks

elems )

blocks )

Distance between columns x and y d i (x,y)

d 1 (x,y) = max elems – a elems (x,y)

d 2 (x,y) = n blocks (x)+n blocks (y)-2*a blocks (x,y)

d 3 (x,y) = n elems (x)+n elems (y)-2*a elems (x,y)

© Juan C. Pichel 2005

Introduction

These def**in**itions can be easily extended **to**

rows.

F**in**ally, for a given sparse matrix:

D

j

N −

= ∑

2

i=

0

d

j

( i,

i

+

1),

j

= 1,

2 and3

© Juan C. Pichel 2005

Introduction

DO i = 1, N

DO j = PTR(i), PTR(i+1)

+1)-1

Y(Index(j)) += DA(j) ) * X(i)

ENDDO

ENDDO

Locality properties:

A closer group**in**g of elements **in** columns Nearer

elements of Y Improv**in**g Spatial Locality

A closer group**in**g of elements between two or more

consecutive columns Increase **in** the Temporal

Locality

Matrix s**to**red us**in**g CCS

format

DA, Index and PTR (data,

row **in**dices and column

po**in**ter)

X (vec**to**r) and Y (result

vec**to**r)

© Juan C. Pichel 2005

Introduction

SMPs:

Add some directives

Parallelized Code

False Shar**in**g

X

Text

Y(8)

Example us**in**g four

processors.

Proc. I Proc. II Proc. III Proc. IV

Y

© Juan C. Pichel 2005

Data Reorder**in**g Technique

0 1 2 3 4 5

0 2 5 3 1 4

Input

Matrix

Reordered

Matrix

Stage III

5

4

Stage I

0

3

1

2

d 1 max elems = 3

0

1

2

3

4

1 2 3 4 5

3 0 3 3 1

3 1 2 3

3 3

1

1

3

3

5

4

0

3

1

2

Proc. I

Proc. II

Stage II

© Juan C. Pichel 2005

Data Reorder**in**g Technique

Stage I - Def**in****in**g a graph of the problem:

Node Column/Row

Each edge has an associated weight

a elems and a blocks = 0 Distance matrix s**to**red

as a triangular sparse matrix

© Juan C. Pichel 2005

Data Reorder**in**g Technique

0 1 2 3 4 5

0 2 5 3 1 4

Input

Matrix

Reordered

Matrix

Stage III

5

4

Stage I

0

3

1

2

d 1 max elems = 3

0

1

2

3

4

1 2 3 4 5

3 0 3 3 1

3 1 2 3

3 3

1

1

3

3

5

4

0

3

1

2

Proc. I

Proc. II

Stage II

© Juan C. Pichel 2005

Data Reorder**in**g Technique

Stage II – Graph partition**in**g:

The objective is two-fold:

Equally partition the graph among the processors:

We have used pmetis from the METIS software package

Avoid **false** **shar ing**:

Related **to** **in**ter-processors locality

Equivalent **to** maximize the distance between the

subgraphs assigned **to** each pair of processors

© Juan C. Pichel 2005

Data Reorder**in**g Technique

0 1 2 3 4 5

0 2 5 3 1 4

Input

Matrix

Reordered

Matrix

Stage III

5

4

Stage I

0

3

1

2

d 1 max elems = 3

0

1

2

3

4

1 2 3 4 5

3 0 3 3 1

3 1 2 3

3 3

1

1

3

3

5

4

0

3

1

2

Proc. I

Proc. II

Stage II

© Juan C. Pichel 2005

Data Reorder**in**g Technique

Stage III – Subgraph reorder**in**g:

Objective: Increase the **in**tra-processor

locality

P subgraphs were obta**in**ed (one per

processor)

Analogy TSP F**in**d**in**g a path of m**in**imum

length that goes through all the nodes of each

subgraph Permutation vec**to**r

Cha**in**ed L**in**-Kernighan heuristic

© Juan C. Pichel 2005

Data Reorder**in**g Technique

syn12000a

d 2

d 3

© Juan C. Pichel 2005

Performance Evaluation

Matrix

N

N Z

Application Area

1

nc5

19652

1499816

N-Body Simulation

2

syn12000a

12000

1463806

Synthetic Matrix

3

nmos3

18588

237130

Semiconduc**to**r Simul.

4

igbt3

10938

130500

Semiconduc**to**r Simul.

5

garon2

13535

373235

FEM Navier-S**to**kes

6

poisson3Da

13514

352762

FEM

7

sme3Da

12504

874887

FEM

SGI Orig**in** 2000:

MIPS R10k

L1 Data Cache: 32KBytes - 32 bytes l**in**e size

L2 Unified Cache: 4MBytes – 128 bytes l**in**e size

Codes written **in** Fortran us**in**g OpenMP.

© Juan C. Pichel 2005

Performance Evaluation

Overhead of the reorder**in**g **technique**:

Number of SpMxV

400

350

300

250

200

150

100

50

Stage I

Stage II

Stage III

Stage I Most costly

Easily **parallel**ized

Overhead amortized by

the repeated execution

of **parallel** products

Iterative Methods

0

1 2 3 4 5 6 7

Matrices

© Juan C. Pichel 2005

Invalidations (normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Performance Evaluation

Orig**in**al

D1

D2

D3

L1 Cache Misses (normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Orig**in**al

D1

D2

D3

0

1 2 3 4 5 6 7

Matrices

0

1 2 3 4 5 6 7

Matrices

L2 Cache Misses (normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

© Juan C. Pichel 2005

1 2 3 4 5 6 7

Matrices

Orig**in**al

D1

D2

D3

Performance Evaluation

Reductions **in** the Execution Time

Execution Time (Normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1 2 3 4 5 6 7

Matrices

Orig**in**al

D1

D2

D3

© Juan C. Pichel 2005

Performance Evaluation

Improved program scalability

5

4

Orig**in**al

D1

D2

D3

4

Orig**in**al

D1

D2

D3

Speedup

3

Speedup

3

2

2

1

2 3 4 5 6 7 8 9 10

Num. Processors

1

2 3 4 5 6 7 8 9 10

Num. Processors

Matrix 6 Matrix 7

© Juan C. Pichel 2005

Conclusions

A **new** **technique** **to** deal with the problem of

locality and **false** **shar ing** for

SMPs is proposed.

Locality established **in** run-time.

Problem is solved as a graph partition**in**g,

followed by a reorder**in**g process.

Important improvements:

Reductions up **to** 95% **in** the number of **in**validations.

Average decrease **in** the **to**tal execution time is about

35%.

Average speedup us**in**g 10 processors is more than 7

us**in**g reordered matrices, while us**in**g orig**in**als is only

around 4.

© Juan C. Pichel 2005

A **new** **technique** **to** **reduce**

**false** **shar ing**

**irregular** codes based on

distance functions

J. C. Pichel, , D. B. Heras, , J. C. Cabaleiro and F. F. Rivera

Universidade de Santiago de Compostela

I-SPAN 2005, Las Vegas, USA

© Juan C. Pichel 2005