New Numerical Techniques for Reservoir Simulations on GPUs ...

gputechconf.com

New Numerical Techniques for Reservoir Simulations on GPUs ...

Investigating ong>Newong> ong>Numericalong>

ong>Techniquesong> ong>forong> ong>Reservoirong> ong>Simulationsong>

on GPUs

Ahmad Abdelfettah, Hatem Ltaief and Rio Yokota


Credits

• This work is par-ally funded by:

• Strategic Ini-a-ve in Extreme Compu-ng @ KAUST

PI: Pr. David Keyes

• Saudi Aramco

PI: Dr. Ali Dogru

King Abdullah University of Science and Technology

2


Agenda

• Mo-va-ons

Computa-onal Scien-st

KAUST Supercompu-ng Lab

• Precondi-oning using Fast Mul-ple Methods

Research Scien-st

KAUST SIEC

• Op-mizing the itera-ve solver using GPU-­‐enabled SpMV

PhD Candidate

KAUST AMCS Dept

King Abdullah University of Science and Technology

3


Motivations

King Abdullah University of Science and Technology

Slide courtesy of J. Dongarra

4


Motivations

King Abdullah University of Science and Technology

5


Motivations

King Abdullah University of Science and Technology

6


Motivations



It’s all about algorithms (at the petascale)

Given, ong>forong> example:

a “physics” phase that scales as

O(N)

a “solver” phase that scales as

O(N 3/2 )

computa?on is almost all solver

a@er several doublings

Most applica?ons groups have not yet

“felt” this curve in their gut

as users get into queues with

more processors, this will

change

King Abdullah University of Science and Technology

Weak scaling limit, assuming efficiency of

100% in both physics and solver phases

problem size

Solver takes

50% ?me on

128 procs

Solver takes

97% ?me on

128K procs

7


FMM Preconditioner

King Abdullah University of Science and Technology

8


Green’s Function Matrix

King Abdullah University of Science and Technology

9


Compute Bound Preconditioner

King Abdullah University of Science and Technology

10


Scalability of FMM

King Abdullah University of Science and Technology

11


Related Work

King Abdullah University of Science and Technology

12


Boundary Conditions

Tests using a 2-­‐D finite element code -­‐-­‐ IFISS

King Abdullah University of Science and Technology

13


Boundary Element Method

King Abdullah University of Science and Technology

14


Boundary Element Method

King Abdullah University of Science and Technology

15


Eigenvalues of A

King Abdullah University of Science and Technology

16


Eigenvalues of P*A

King Abdullah University of Science and Technology

17


6 digit accuracy FMM

King Abdullah University of Science and Technology

18


5 digit accuracy FMM

King Abdullah University of Science and Technology

19


4 digit accuracy FMM

King Abdullah University of Science and Technology

20


3 digit accuracy FMM

King Abdullah University of Science and Technology

21


2 digit accuracy FMM

King Abdullah University of Science and Technology

22


1 digit accuracy FMM

King Abdullah University of Science and Technology

23


Bonsai

N=2^24 in 1.2s

!=0.75

p=2

1650 GFlop/s

K20C (758 MHz)

Jeroen Bédorf, Evghenii Gaburov

Thursday 10:00 Room 111

King Abdullah University of Science and Technology

24


Optimizing GPU-enabled SpMV

Sparse Matrix-­‐Vecotr Mul-plica-on (SpMV) is the common bofleneck in most

PDE-­‐based scien-fic compu-ng applica-ons

As part of the collabora-on with Saudi Aramco, we focus on two sparse matrix

storage ong>forong>mats:

The Compressed Sparse Row ong>forong>mat (CSR) – preliminary results

The Blocked CSR (BCSR) – future plans

The CSR ong>forong>mat is very convenient ong>forong> CPUs, but it is challenging to op-mize on

GPUs

We are going to men-on the challenges and how they are tackled in our

evolving CSR-­‐based SpMV

The BCSR however might be more suitable ong>forong> GPUs if the blocks are large enough

Blocks can be processed using dense GEMV/SYMV

Our BCSR kernel will be based on highly op-mized dense GEMV/SYMV kernels

developed at KAUST

King Abdullah University of Science and Technology

25


Optimizing GPU-enabled SpMV

CSR-based SpMV

The CSR ong>forong>mat

No zero padding

Three arrays (row pointer – column index – values “non-­‐zeros” )

“column index” and “values” are always of the same length

“row pointer” length is (1 + number of rows)

We want to compute: Y = α × A × X + β × Y

King Abdullah University of Science and Technology

26


Optimizing GPU-enabled SpMV

CSR-based SpMV

What are the challenges?

What are the challenges?

Irregular access ong>forong> the vecotr x A lot of cache misses

Rows are not of the same length Non-­‐coalesced memory access

Rows are – in most cases – very short Thread divergence

King Abdullah University of Science and Technology

27


Optimizing GPU-enabled SpMV

CSR-based SpMV

How do we tackle these challenges?

Irregular access ong>forong> the vecotr x: We direct fetches of the vector x to go

through the texture cache

Non-­‐coalesced memory access: We propose a strategy to eliminate

non-­‐coalesced accesses (The Super Row strategy)

Thread divergence: S-ll under inves-ga-on

King Abdullah University of Science and Technology

28


Optimizing GPU-enabled SpMV

CSR-based SpMV

Proper-es of the super row strategy

Matrix is processed in slices (mul-ple consecu-ve rows)

We assign a warp ong>forong> each slice

Boundaries among rows are totally ignored while reading (column index -­‐

values) from GLMEM, and while doing element-­‐wise mul-plica-on

(As if the slices is one long row)

Lazy accumula-on of par-al products

Non-­‐coalesced access can occur at most twice

King Abdullah University of Science and Technology

29


Optimizing GPU-enabled SpMV

CSR-based SpMV

Principles of the super row strategy

King Abdullah University of Science and Technology

30


Optimizing GPU-enabled SpMV

CSR-based SpMV

Principles of the super row strategy

embarrassingly

parallel

register-­‐level

reduc-on

King Abdullah University of Science and Technology

31


Optimizing GPU-enabled SpMV

CSR-based SpMV

How do we avoid non-­‐coalesced memory access?

King Abdullah University of Science and Technology

32


Optimizing GPU-enabled SpMV

CSR-based SpMV

The super row strategy: summary

Element wise fetch-­‐and-­‐mul-ply is embarrassingly parallel

Non-­‐coalesced access can be avoided

Lazy reduc-on should op-mally be hidden while doing another fetch-­‐and-­mul-ply

stage (not the case ong>forong> the current implementa-on)

King Abdullah University of Science and Technology

33


Optimizing GPU-enabled SpMV

CSR-based SpMV

Perong>forong>mance Results (Kepler K20c)


Single precision

King Abdullah University of Science and Technology

34


Optimizing GPU-enabled SpMV

CSR-based SpMV

Super-­‐row strategy: What’s leo ?

Variable slice size, balanced number of non-­‐zeros per warp.

Run mul-ple warps per slice, thus having more chances of latency hiding

Reduce thread divergence in reduc-on opera-on:

We have element-­‐wise par-al products? How can we efficiently decide

where to accumulate them?

Decisions can be built completely offline, once the kernel configura-ons

(number thread blocks, number of threads per block) is determined

Price is based in terms of extra memory

King Abdullah University of Science and Technology

35


Optimizing GPU-enabled SpMV

Future Plans: BCSR-based SpMV

Blocked CSR ong>forong>mat

Matrices with dense block substructure

Same as CSR, just replace elements with dense blocks.

Less overhead of integer indices:

Row pointers are stored per row-­‐of-­‐blocks

A column index per block (instead of one per element)

More chances ong>forong> befer perong>forong>mance: blocks can be processed using dense

GEMV/SYMV opera-ons

We have strong founda-on ong>forong> these dense kernels thanks to our highly

op-mized level-­‐2 BLAS kernels (The KBLAS Library)

Our first BCSR SpMV will be based on our op-mized GEMV/SYMV kernels

Further tuning will be needed ong>forong> very small block sizes (e.g. 2 or 4)

King Abdullah University of Science and Technology

36


Optimizing GPU-enabled SpMV

The KBLAS Library

The KBLAS Library

Recently launched project at KAUST

Target is to op-mize BLAS kernels on GPUs, where there is s-ll room ong>forong>

improvement

We’ve been successful ong>forong> the GEMV and the SYMV kernel so far, outperong>forong>ming

CUBLAS and MAGMA

Work has been published

A. Abdelfattah, J. Dongarra, D. Keyes, and H. Ltaief, "Optimizing Memory- Bound

SYMV Kernel on GPU Hardware Accelerators," in The 10th International Meeting

on High Perong>forong>mance Computing ong>forong> Computational Science (VECPAR), 2012.


A. Abdelfattah, D. Keyes, and H. Ltaief, "Systematic Approach in Optimizing

ong>Numericalong> Memory-Bound Kernels on GPU ," in The International Workshop on

Algorithms, Models and Tools ong>forong> Parallel Computing on Heterogeneous Platong>forong>ms

(Heteropar), 2012

King Abdullah University of Science and Technology

37


Optimizing GPU-enabled SpMV

The KBLAS Library Perong>forong>mance

SGEMV Perong>forong>mance on K20x

SSYMV Perong>forong>mance on K20x

King Abdullah University of Science and Technology

38


Optimizing GPU-enabled SpMV

Conclusion

Generic CSR-­‐based SpMV

We have an ini-al design that is compe--ve to CUSPARSE 5.0

Challenges remain on reducing thread divergence and load balancing

BCSR based SpMV

Design will be based on the KBLAS library

Small dimensions should be taken into considera-on ong>forong> further

op-miza-on

King Abdullah University of Science and Technology

39


Questions?

More magazines by this user
Similar magazines