Using MPI - Institut für Theoretische Chemie und Computerchemie

GoBack

**Using** **MPI** - Parallelization of the Spin-Orbit-Free

GASCI-Program LUCITA

Stefan Knecht

**Institut** **für** **Theoretische** **Chemie** **und** **Computerchemie**

Heinrich-Heine-Universitität Düsseldorf

Vortrag im Rahmen des **Institut**sseminars

am 11.05.2006

Outline

1. Motivation

2. Theoretical Backgro**und**

3. An Initial Application

4. Concluding Remarks

1. Motivation

SFB 663: molecular response upon

electronic excitation

Figure 1: adenine-thymine base pair.

❍ photophysics of the human

DNA/RNA base-pairs (BP) and

their corresponding monomers (M)

are of particular interest

❍ lifetimes in the ps resp. fs regime of

excited (singlet) states for the BP’s

and M’s are known from gas-phase

spectroscopy

❍ the challenge for theoretical chemists consists of :

✗

✗

✗

a precise determination and identification of excited states

a proper characterization of charge-transfer (CT) states

a provision of explanations for the observed behaviour

Requirements

For an accurate treatment of the DNA/RNA bases in calculations three

features are obviously mandatory:

❍ large basis sets exhibiting diffuse and polarizing functions, e.g.

aug-cc-pVXZ (X=D,T,Q,...)

❍ high-level ab initio electron correlation methods (DFT-methods ):

❍

✗

perturbation theory methods like MP2

✗ Coupled-Cluster (CC) methods: CCSD, CCSD(T), ..., CC2, CC3 ...

✗ Configuration Interaction (CI) and MR-CI methods

size of the molecules requires further approximations like

resolution of the identity (RI)

✗ reduction of the number of (ij|lk)-integrals

✗ significant reduction of computational time

✗ employment of larger basis sets becomes possible

✗ yet available: RI-MP2, RI-CC2, ...

Requirements II

❍

❍

❍

e.g. gro**und** states of interesting (biological) molecules may be

multi-reference cases:

✗

✗

one-determinant based methods fail in this case

MR-CI methods are particularly suitable for such systems

→ LUCITA: a string-driven GAS-CI program

despite an aspired RI-approximation, today’s computational ressources

demand for a highly parallelized code

the way of computing requested eigenvalues offers two possible

parallelization routes in LUCITA:

Requirements II

❍

❍

❍

e.g. gro**und** states of interesting (biological) molecules may be

multi-reference cases:

✗

✗

one-determinant based methods fail in this case

MR-CI methods are particularly suitable for such systems

→ LUCITA: a string-driven GAS-CI program

despite an aspired RI-approximation, today’s computational ressources

demand for a highly parallelized code

the way of computing requested eigenvalues offers two possible

parallelization routes in LUCITA:

✗

coarse-grain

Requirements II

❍

❍

❍

e.g. gro**und** states of interesting (biological) molecules may be

multi-reference cases:

✗

✗

one-determinant based methods fail in this case

MR-CI methods are particularly suitable for such systems

→ LUCITA: a string-driven GAS-CI program

despite an aspired RI-approximation, today’s computational ressources

demand for a highly parallelized code

the way of computing requested eigenvalues offers two possible

parallelization routes in LUCITA:

✗

✗

coarse-grain

fine-grain

Requirements II

❍

❍

❍

❍

e.g. gro**und** states of interesting (biological) molecules may be

multi-reference cases:

✗

✗

one-determinant based methods fail in this case

MR-CI methods are particularly suitable for such systems

→ LUCITA: a string-driven GAS-CI program

despite an aspired RI-approximation, today’s computational ressources

demand for a highly parallelized code

the way of computing requested eigenvalues offers two possible

parallelization routes in LUCITA:

✗

✗

coarse-grain

fine-grain

→ Message Passing Interface (**MPI**) model

2. Theoretical Backgro**und**

Truncated

GAS III

External

GAS II

GAS I

Frozen Core

LUCITA - a GAS-CI program I

molecules ≫ 2 atoms: FCI wave functions are not feasible → truncated CI

wave functions:

❍

❍

❍

complete active space (CAS) wave functions

restricted active space (RAS) wave functions

generalized active space (GAS) wave functions

✗

✗

✗

complete generalization of the RAS-CI method

allows for an arbitrary division of orbital spaces w.r.t. the considered

chemical or physical problem

no occupation constraints → guarantees a very flexible definition of a

wave function

❍ HC=EC with C= ∑ i C i|i〉 = |0〉+ ˆP |c〉

√

1+〈c| ˆP |c〉

❍ large CI expansions (more than 10 6 determinants): set-up and

diagonalization of H in a straighforward fashion not possible

❍ typically: only a few lowest eigenvalues are of interest

LUCITA - use of the modified

Davidson-Algorithm

❍

❍

→ modified Davidson-Algorithm, a quasi-Newton method, featuring linear

transformations: σ =HC

algorithm:

1. calculate H ii

2. set up trial vectors building a subspace {C (0)

k

3. calculate linear transformation HC (0)

k

= σ k

} of dimension M

4. calculate and diagonalize the projected matrix ˜H → λ (M)

k

〈C (0)

l

|σ k 〉 = ˜H lk

∥

5. calculate residuals ‖R k ‖ = ∥σ k − λ (M)

k

6. if not converged → new reference vectors

“

”

C ′ k = H ii −λ (M) −1 “

”

k

1 σ k −λ (M)

k

1C (0)

k

‚

“

‚

H ii −λ (M)

k

1

7. orthogonalize C ′ k

” −1 “

σ k −λ (M)

k

1C (0)

k

” ‚ ‚ ‚‚

to previous vectors

α (M)

k

∥

, α (M)

k

, with

The two parallel routes - benefit

from the Sigma-Vector scheme

❍

time-consuming step in the algorithm: σ-vector generation → 2 schemes:

σ(Ι Ι )

α β

master and slaves

proc 1

CI II (Jα J

1 )

β

master and slaves

proc 1

master

Σ σ(Ι Ι )

α β

σ(Ι Ι )

α β

σ(Ι Ι )

α β

σ(Ι Ι )

α β

σ(Ι Ι )

α β

proc 2

proc 3

proc 4

proc 5

...

master

Σ σ(Ι Ι )

α β

master

C(J J )

α β

C I II 2 (J J )

α β

C I III (J J ) 1 α β

C I III

2

(J α J β )

C I III (J J )

3 α β

proc 2

proc 3

proc 4

proc 5

...

master

σ(Ι Ι )

α β

Figure 2: coarse-grain version.

Figure 3: fine-grain version.

where

σ(I α , I β ) = ∑

J α ,J β

∑

ijkl

〈S(I β )|a † iβ a jβ|S(J β )〉〈S(I α )|a † kα a lα|S(J α )〉(ij|kl)C(I α , I β )

note: strings S(I α ) and S(I β ) are ordered products of creation operators a † for MO’s

coarse-grain ↔ fine-grain

❍

❍

❍

❍

❍

❍

a comparison between both variants points out the superiority of the

fine-grain variant over the coarse-grain one:

coarse-grain

fine-grain

# of procs based on the # of EV # of procs arbitrary

every proc → one σ-vector σ-vector ≺ arb. # of batches

every proc → all MO-Int’s only relevant MO-Int’s

’static’ parallel version ’dynamic’ parallel version

fine-grain version comes up to the demand of more flexibility

→ fine-grain version of LUCITA will be soon implemented in the DIRAC

program package

coarse-grain version is in two different variants already available

both versions were resp. will be implemented using the **MPI** library

no c○ → **MPI** is freeware

orthogonalization of actual reference vectors

’broadcast route’

master

’send route’

master informs slaves about actual # of roots

not required slaves return to a waiting loop

master and slaves build a

new communication group

distribution of all c−coeffi−

cients from M to S by

**MPI**_bcast

stepwise distribution of in−

dividual c−coefficients

from M to S by **MPI**_send

slave sends new sigma−

vector and subspace matrix

elements to M

σ −vector calculation

slave sends new sigma−

vectors to M by **MPI**_send

master computes the sub−

space matrix elements

computation of projected hamiltonian matrix

slaves return to waiting loop

master

diagonalization of projected hamiltonian matrix

The **MPI** library - a short review

❍

❍

❍

❍

some essential features:

✗

✗

✗

communication between processes

widely used: SPDM-scheme (single program, multiple data)

processes are unique w.r.t. their process tag (within a communicator)

**MPI** offers a huge amount of library routines designed to satisfy nearly

every claims

**MPI** Send() and **MPI** Recv() are the basic routines related to (blocking)

point-to-point communication

a very effective way of one-to-all communication is with **MPI** Bcast()

possible

time

0

0

4

0

2

4

6

0 1

2 3

4

5

6 7

3. An Initial Application

’TIME IS MONEY !’

H 2 O - an appropiate test molecule

GAS III

GAS II

GAS I

Truncated

External

correlating

orbitals

1s(H)

1s

2s

2p(y)

2p

Frozen Core

SD

SDTQ

Figure 2: GAS-CI scheme for H 2 O used

in the timing tests.

❍

❍

❍

❍

❍

❍

❍

employing Dirac-Coulomb Hamiltonian

→ spin-orbit interactions neglected

correlation of all 10 electrons

147 basis functions in total (L+S)

cut-off for MO-transformation:

25 a.u.

5 roots requested

GAS-CI set-up (SDTQ-mrSD) results

in a maximum of 9.472.264 determinants/combinations

three routes of calculations:

✗

✗

✗

serial run (1 proc)

parallel run

(’send’-version; 5 procs)

parallel run

(’broadcast’-version; 5 procs)

H 2 O - timing test results

❍ calculations were performed on FATS and JUMP and stopped after 20

iterations

FATS

JUMP

serial ’send’ ’broadcast’ serial ’send’ ’broadcast’

CPU-time (s) 70014 46033 56713 38275 26931 25182

WALL-time (s) 70005 47380 64683 56511 55567 38310

❍

❍

❍

FATS and JUMP: regarding CPU-times a speed-up can be observed for the

parallel calculations compared to the serial one (at least 19% up to 35%)

comparing WALL-time with CPU-time suggests I/O resp.

communication-traffic overload

in a fine-grain variant this problems should be minimized because of a

reduced I/O and communication load

H 2 O - more numbers ...

❍

another set of timing test calculations were done on the H 2 O molecule

w.r.t. the following set-up:

✗

✗

10 electrons, two GA spaces, 4 roots

either a maximum of quadruple (SDTQ) or of quintuple (SDTQ5)

excitations → 5.354.278 resp. 62.529.742 det’s

❍ tests were done on FATS with regard to calculations with 4 procs on 4

different nodes (4/1) resp. on 2 nodes each with 2 procs (2/2); serial

calculations were performed with a single proc (1/2)

SDTQ

35 iterations 4/1 ’send’ 2/2 ’send’ 4/1 ’bcast’ 2/2 ’bcast’ 1/2

CPU-time (s) 35586 39883 43596 35655 41877

WALL-time (s) 36124 39926 44576 37926 41878

SDTQ5

41 iterations 4/1 ’send’ 2/2 ’send’ 4/1 ’bcast’ 2/2 ’bcast’ 1/2

CPU-time (s) 422460 394740 383460 393660 544200

WALL-time (s) 555660 653940 541980 576660 689580

4. Concluding Remarks

Summary and Outlook

❍

in order to calculate excited states of molecules a variety of methods can

be employed

❍

although size and electronic structure of most molecules limits the options

❍

GAS-CI concept is a powerful technique with a wide application range

❍

a significant reduction of computational limits will especially be achieved

by an efficient parallelization route

❍

current coarse-grain variants show a moderate up to a satisfactory

speed-up w.r.t. a serial run → the not yet implemented fine-grain version

should provide for a more significant reduction in CPU- and WALL-time as

well as an implementation of a RI-approximation

❍

a fine-grain version will also be included in the 4c-GAS-CI code LUCIAREL

which is as well available within the DIRAC program package + ...

I would like to thank:

⋆ Priv.-Doz. Dr. Timo Fleig

⋆ Prof. Dr. Christel Marian

⋆ Lasse Sørensen, Stephan Raub, Martin Kleinschmidt, ..., just the whole group

I would like to thank:

⋆ Priv.-Doz. Dr. Timo Fleig

⋆ Prof. Dr. Christel Marian

⋆ Lasse Sørensen, Stephan Raub, Martin Kleinschmidt, ..., just the whole group

Thank You For Your Attention !