A new technique to reduce false sharing in parallel irregular ... - Arcos

arcos.inf.uc3m.es

A new technique to reduce false sharing in parallel irregular ... - Arcos

A new technique to reduce

false sharing in parallel

irregular codes based on

distance functions

J. C. Pichel, , D. B. Heras, , J. C. Cabaleiro and F. F. Rivera

Universidade de Santiago de Compostela

I-SPAN 2005, Las Vegas, USA

© Juan C. Pichel 2005


Motivation

Irregular codes: Accesses low spatial and

temporal locality

Data locality optimization algorithms:

Code restructuring techniques Loop

Transformations ( loop interchange, fusion,

fission, tiling/blocking, … )

Memory Layout Transformations Data

Reordering techniques

© Juan C. Pichel 2005


Motivation

Increasing Locality on Shared Memory

Multiprocessors Another critical issue

False Sharing

Multiple processors access (for both read

and write) different words on the same

cache block

Write-Invalidate Coherency protocol

Extra invalidations and cache misses

Specially Time consuming (remote

accesses, coherence and consistence

mechanisms)

© Juan C. Pichel 2005


Motivation

Reduction in False Sharing

Reductions in the

Execution Time

Improved Program

Scalability

© Juan C. Pichel 2005


Introduction

We have selected Sparse Matrix-Vector

product (SpMxV(

SpMxV) ) as irregular kernel.

Previous works:

Locality model and a procedure for

increasing it using data reordering

techniques

Uniprocessors

Shared Memory Multiprocessors

© Juan C. Pichel 2005


Introduction

Outline of the Locality Model:

General-purpose

Locality measured for consecutive pairs of

rows/columns

Based on two parameters:

Entry Matches (a(

elems

Block Matches (a(

blocks

elems )

blocks )

Distance between columns x and y d i (x,y)

d 1 (x,y) = max elems – a elems (x,y)

d 2 (x,y) = n blocks (x)+n blocks (y)-2*a blocks (x,y)

d 3 (x,y) = n elems (x)+n elems (y)-2*a elems (x,y)

© Juan C. Pichel 2005


Introduction

These definitions can be easily extended to

rows.

Finally, for a given sparse matrix:

D

j

N −

= ∑

2

i=

0

d

j

( i,

i

+

1),

j

= 1,

2 and3

© Juan C. Pichel 2005


Introduction

DO i = 1, N

DO j = PTR(i), PTR(i+1)

+1)-1

Y(Index(j)) += DA(j) ) * X(i)

ENDDO

ENDDO

Locality properties:

A closer grouping of elements in columns Nearer

elements of Y Improving Spatial Locality

A closer grouping of elements between two or more

consecutive columns Increase in the Temporal

Locality




Matrix stored using CCS

format

DA, Index and PTR (data,

row indices and column

pointer)

X (vector) and Y (result

vector)

© Juan C. Pichel 2005


Introduction

SMPs:

Add some directives

Parallelized Code

False Sharing

X

Text

Y(8)

Example using four

processors.

Proc. I Proc. II Proc. III Proc. IV

Y

© Juan C. Pichel 2005


Data Reordering Technique

0 1 2 3 4 5

0 2 5 3 1 4

Input

Matrix

Reordered

Matrix

Stage III

5

4

Stage I

0

3

1

2

d 1 max elems = 3

0

1

2

3

4

1 2 3 4 5

3 0 3 3 1

3 1 2 3

3 3

1

1

3

3

5

4

0

3

1

2

Proc. I

Proc. II

Stage II

© Juan C. Pichel 2005


Data Reordering Technique

Stage I - Defining a graph of the problem:

Node Column/Row

Each edge has an associated weight

a elems and a blocks = 0 Distance matrix stored

as a triangular sparse matrix

© Juan C. Pichel 2005


Data Reordering Technique

0 1 2 3 4 5

0 2 5 3 1 4

Input

Matrix

Reordered

Matrix

Stage III

5

4

Stage I

0

3

1

2

d 1 max elems = 3

0

1

2

3

4

1 2 3 4 5

3 0 3 3 1

3 1 2 3

3 3

1

1

3

3

5

4

0

3

1

2

Proc. I

Proc. II

Stage II

© Juan C. Pichel 2005


Data Reordering Technique

Stage II – Graph partitioning:

The objective is two-fold:

Equally partition the graph among the processors:

We have used pmetis from the METIS software package

Avoid false sharing:

Related to inter-processors locality

Equivalent to maximize the distance between the

subgraphs assigned to each pair of processors

© Juan C. Pichel 2005


Data Reordering Technique

0 1 2 3 4 5

0 2 5 3 1 4

Input

Matrix

Reordered

Matrix

Stage III

5

4

Stage I

0

3

1

2

d 1 max elems = 3

0

1

2

3

4

1 2 3 4 5

3 0 3 3 1

3 1 2 3

3 3

1

1

3

3

5

4

0

3

1

2

Proc. I

Proc. II

Stage II

© Juan C. Pichel 2005


Data Reordering Technique

Stage III – Subgraph reordering:

Objective: Increase the intra-processor

locality

P subgraphs were obtained (one per

processor)

Analogy TSP Finding a path of minimum

length that goes through all the nodes of each

subgraph Permutation vector

Chained Lin-Kernighan heuristic

© Juan C. Pichel 2005


Data Reordering Technique

syn12000a

d 2

d 3

© Juan C. Pichel 2005


Performance Evaluation

Matrix

N

N Z

Application Area

1

nc5

19652

1499816

N-Body Simulation

2

syn12000a

12000

1463806

Synthetic Matrix

3

nmos3

18588

237130

Semiconductor Simul.

4

igbt3

10938

130500

Semiconductor Simul.

5

garon2

13535

373235

FEM Navier-Stokes

6

poisson3Da

13514

352762

FEM

7

sme3Da

12504

874887

FEM

SGI Origin 2000:





MIPS R10k

L1 Data Cache: 32KBytes - 32 bytes line size

L2 Unified Cache: 4MBytes – 128 bytes line size

Codes written in Fortran using OpenMP.

© Juan C. Pichel 2005


Performance Evaluation

Overhead of the reordering technique:

Number of SpMxV

400

350

300

250

200

150

100

50

Stage I

Stage II

Stage III

Stage I Most costly

Easily parallelized

Overhead amortized by

the repeated execution

of parallel products

Iterative Methods

0

1 2 3 4 5 6 7

Matrices

© Juan C. Pichel 2005


Invalidations (normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Performance Evaluation

Original

D1

D2

D3

L1 Cache Misses (normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Original

D1

D2

D3

0

1 2 3 4 5 6 7

Matrices

0

1 2 3 4 5 6 7

Matrices

L2 Cache Misses (normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

© Juan C. Pichel 2005

1 2 3 4 5 6 7

Matrices

Original

D1

D2

D3


Performance Evaluation

Reductions in the Execution Time

Execution Time (Normalized)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1 2 3 4 5 6 7

Matrices

Original

D1

D2

D3

© Juan C. Pichel 2005


Performance Evaluation

Improved program scalability

5

4

Original

D1

D2

D3

4

Original

D1

D2

D3

Speedup

3

Speedup

3

2

2

1

2 3 4 5 6 7 8 9 10

Num. Processors

1

2 3 4 5 6 7 8 9 10

Num. Processors

Matrix 6 Matrix 7

© Juan C. Pichel 2005


Conclusions

A new technique to deal with the problem of

locality and false sharing for irregular codes on

SMPs is proposed.

Locality established in run-time.

Problem is solved as a graph partitioning,

followed by a reordering process.

Important improvements:

Reductions up to 95% in the number of invalidations.

Average decrease in the total execution time is about

35%.

Average speedup using 10 processors is more than 7

using reordered matrices, while using originals is only

around 4.

© Juan C. Pichel 2005


A new technique to reduce

false sharing in parallel

irregular codes based on

distance functions

J. C. Pichel, , D. B. Heras, , J. C. Cabaleiro and F. F. Rivera

Universidade de Santiago de Compostela

I-SPAN 2005, Las Vegas, USA

© Juan C. Pichel 2005

More magazines by this user
Similar magazines