06.08.2012 Views

Uncontracted Rys Quadrature implementation of up to g functions on ...

Uncontracted Rys Quadrature implementation of up to g functions on ...

Uncontracted Rys Quadrature implementation of up to g functions on ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Andrey Asadchev<br />

Jacob Felder<br />

Veerendra Allada<br />

Dr. Mark S Gord<strong>on</strong><br />

Dr. Theresa Windus<br />

Dr. Brett Bode<br />

GPU Technology C<strong>on</strong>ference , NVIDIA , San Jose, 2009


� Computati<strong>on</strong>al Quantum Chemistry<br />

� General A<str<strong>on</strong>g>to</str<strong>on</strong>g>mic Molecular Electr<strong>on</strong>ic Structure Systems -GAMESS<br />

� Electr<strong>on</strong> Repulsi<strong>on</strong> Integral (ERI) Problem<br />

� Our Approach<br />

� CUDA Implementati<strong>on</strong><br />

� Optimizati<strong>on</strong>s<br />

� Au<str<strong>on</strong>g>to</str<strong>on</strong>g>matically generated code<br />

� Performance Results<br />

� Future Goals<br />

� Questi<strong>on</strong>s & Discussi<strong>on</strong>


� Use computati<strong>on</strong>al methods <str<strong>on</strong>g>to</str<strong>on</strong>g> solve the electr<strong>on</strong>ic structure and<br />

properties <str<strong>on</strong>g>of</str<strong>on</strong>g> molecules.<br />

� Finds utility in the design <str<strong>on</strong>g>of</str<strong>on</strong>g> new drugs and materials<br />

� Underlying theory is based <strong>on</strong> Quantum Mechanics –Schrodinger<br />

wave equati<strong>on</strong><br />

� Properties calculated<br />

� Energies<br />

� Electr<strong>on</strong>ic charge distributi<strong>on</strong><br />

� Dipole moments, vibrati<strong>on</strong>al frequencies.<br />

� Methods employed<br />

� Ab initio Methods ( Solve from first principles)<br />

� Density Functi<strong>on</strong>al Theory (DFT)<br />

� Semi-empirical methods<br />

� Molecular Mechanics (MM)


� Ab initio molecular quantum chemistry s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware<br />

� USDOE “SciDAC Basic Energy Sciences” (BES) applicati<strong>on</strong><br />

� Serial and parallel versi<strong>on</strong>s for several methods<br />

� In brief, GAMESS can compute<br />

� Self C<strong>on</strong>sistent Field (SCF) wave <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> - RHF, ROHF, UHF, GVB,<br />

and MCSCF using the Hartree-Fock method<br />

� Correlati<strong>on</strong> correcti<strong>on</strong>s <str<strong>on</strong>g>to</str<strong>on</strong>g> SCF using c<strong>on</strong>figurati<strong>on</strong> interacti<strong>on</strong> (CI),<br />

sec<strong>on</strong>d order perturbati<strong>on</strong> theory, and co<str<strong>on</strong>g>up</str<strong>on</strong>g>led cluster theories (CC)<br />

� Density Functi<strong>on</strong>al Theory approximati<strong>on</strong>s<br />

Reference:"Advances in electr<strong>on</strong>ic structure theory: GAMESS a decade later" M.S.Gord<strong>on</strong>, M.W.Schmidt pp.<br />

1167-1189, in "Theory and Applicati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Chemistry: the first forty years" C.E.Dykstra,<br />

G.Frenking, K.S.Kim, G.E.Scuseria (edi<str<strong>on</strong>g>to</str<strong>on</strong>g>rs), Elsevier, Amsterdam, 2005.


� Molecules are made <str<strong>on</strong>g>of</str<strong>on</strong>g> a<str<strong>on</strong>g>to</str<strong>on</strong>g>ms and a<str<strong>on</strong>g>to</str<strong>on</strong>g>ms have electr<strong>on</strong>s<br />

� Electr<strong>on</strong>s live in shells – s, p, d, f, g, h<br />

� Shells are made <str<strong>on</strong>g>of</str<strong>on</strong>g> sub-shells – all have the same angular<br />

momentum (L)<br />

� Shells are represented using the mathematical <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g><br />

� Gaussian <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are taken as standard primitive <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (S.F. Boys)<br />

�<br />

� x, y, z – Cartesian center<br />

� a x, a y, a z – Angular momenta comp<strong>on</strong>ents; L = a x + a y + a z<br />

� � is the exp<strong>on</strong>ent<br />

� Shells with low angular momentum are typically c<strong>on</strong>tracted<br />

▪<br />

a a<br />

x y az<br />

2<br />

(r)= x y z exp( � r )<br />

� �<br />

K<br />

� ( r) � �<br />

D � ( r)<br />

a ka k<br />

k<br />

▪ K is the c<strong>on</strong>tracti<strong>on</strong> coefficient. D k’s are the c<strong>on</strong>tracti<strong>on</strong> coefficients


cheap <strong>on</strong>e-time<br />

operati<strong>on</strong><br />

1<br />

H core<br />

(<strong>on</strong>e-electr<strong>on</strong> integrals)<br />

Kinetic Energy Integrals (T)<br />

Nuclear Attracti<strong>on</strong><br />

Integrals (V)<br />

Form the Fock Matrix<br />

F = H core + G<br />

Transformati<strong>on</strong>s<br />

F ’ = X’FX<br />

C’ � Diag<strong>on</strong>alize(F’)<br />

C � XC’<br />

C<strong>on</strong>vergence<br />

Checks<br />

yes<br />

S<str<strong>on</strong>g>to</str<strong>on</strong>g>p<br />

6<br />

7<br />

Molecule Specificati<strong>on</strong><br />

•List <str<strong>on</strong>g>of</str<strong>on</strong>g> A<str<strong>on</strong>g>to</str<strong>on</strong>g>ms ( A<str<strong>on</strong>g>to</str<strong>on</strong>g>mic Numbers Z)<br />

•List <str<strong>on</strong>g>of</str<strong>on</strong>g> Nuclear Coordinates (R)<br />

• Number <str<strong>on</strong>g>of</str<strong>on</strong>g> electr<strong>on</strong>s<br />

•List <str<strong>on</strong>g>of</str<strong>on</strong>g> Primitive Functi<strong>on</strong>s, exp<strong>on</strong>ents<br />

• Number <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>tracti<strong>on</strong>s<br />

Form the basis <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (M)<br />

2<br />

Initial guess <str<strong>on</strong>g>of</str<strong>on</strong>g> the wave<br />

functi<strong>on</strong><br />

Obtain the guess at the<br />

Density Matrix (P)<br />

O(M 2 )<br />

No<br />

8<br />

5<br />

4<br />

Update the density<br />

matrix from C<br />

Repeat steps 3, 4, 5, 6, 7<br />

3<br />

4<br />

G – Matrix<br />

O(M 2 )<br />

G = [(ij|kl) – ½(ik|jl)]*P<br />

•Required in every iterati<strong>on</strong><br />

•Very Expensive operati<strong>on</strong><br />

•S<str<strong>on</strong>g>to</str<strong>on</strong>g>red procedures not scalable<br />

•Re-compute in every iterati<strong>on</strong><br />

•Good target for GPU<br />

ERI<br />

Two Electr<strong>on</strong> Repulsi<strong>on</strong> Integral<br />

(ij|kl)<br />

O(M 3 ) <str<strong>on</strong>g>to</str<strong>on</strong>g> O(M 4 )


� Four-center two-electr<strong>on</strong> repulsi<strong>on</strong><br />

integral<br />

�<br />

1<br />

( ab|cd ) = �� �a( 1) �b( 1) �c( 2 ) �d(<br />

2 )<br />

r<br />

12<br />

� Major computati<strong>on</strong>al step in both Ab<br />

initio and DFT methods<br />

� Complexity is O(M 3 )-O(M 4 ), M is the<br />

number <str<strong>on</strong>g>of</str<strong>on</strong>g> basis <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (Gaussian<br />

<str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are standard)<br />

� <str<strong>on</strong>g>Rys</str<strong>on</strong>g> <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> – proposed by D<str<strong>on</strong>g>up</str<strong>on</strong>g>ius,<br />

<str<strong>on</strong>g>Rys</str<strong>on</strong>g>, King (DRK)<br />

� Numerical Gaussian quadrature based <strong>on</strong> a<br />

set <str<strong>on</strong>g>of</str<strong>on</strong>g> orthog<strong>on</strong>al <str<strong>on</strong>g>Rys</str<strong>on</strong>g> polynomials<br />

� Numerically stable, low memory foot print<br />

� Amenable for GPUs and architectures with<br />

smaller caches<br />

ERI Inputs<br />

bra ket<br />

shell a shell b shell c shell d<br />

CGTO CGTO CGTO CGTO<br />

x y z<br />

a x a y a z<br />

C<strong>on</strong>tracti<strong>on</strong><br />

Coefficients<br />

Gaussian Exp<strong>on</strong>ents<br />

Relevant <str<strong>on</strong>g>to</str<strong>on</strong>g> computati<strong>on</strong>s<br />

Lower order <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are<br />

typically c<strong>on</strong>tracted


X �� ��<br />

( r ��<br />

r )<br />

2<br />

A B<br />

r �� ( �� r ����<br />

r ) / A<br />

A i i j j<br />

r �� ( �� r ����<br />

r ) / B<br />

B k k l l<br />

� Two electr<strong>on</strong> integral is expressed as<br />

1<br />

where , 2m2 F exp( ) and<br />

m(X)=<br />

�t<br />

�Xt<br />

dt<br />

0<br />

( ij |kl ) = C F ( X )<br />

� X depends <strong>on</strong> exp<strong>on</strong>ents, centers and is independent <str<strong>on</strong>g>of</str<strong>on</strong>g> angular<br />

momenta<br />

2<br />

1<br />

X ��( rA �rB)<br />

r �( � r ��<br />

r ) / A<br />

A<br />

i i j j<br />

r �( � r ��<br />

r ) / B<br />

B k k l l<br />

L= L a+L b+L c+Ld<br />

2<br />

� ( ij |kl ) = �exp(<br />

�Xt<br />

) PL ( t) dt , where PL(t) is polynomial <str<strong>on</strong>g>of</str<strong>on</strong>g> degree L in t2. Evaluated<br />

0<br />

N<br />

using N-point quadrature and hence ( ij |kl ) = � W�PL( t�)<br />

where N = L / 2+ 1<br />

� Using separati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> variables, P L(t) which is integral over dr 1dr 2 , can be<br />

written as a product <str<strong>on</strong>g>of</str<strong>on</strong>g> three (2-D) integrals over dx 1dx 2, dr 1dr 2, dz 1dz 2<br />

1/2<br />

� ( ij |kl ) = 2 ( �/ � ) � I x(t � )I y(t � )I z(t<br />

� )W�<br />

and I q( = x,y,z) (N, 0: L a, 0: L b, 0: Lc,0: Ld<br />

)<br />

�<br />

� Ix, Iy, Iz are computed using recurrence and transfer relati<strong>on</strong>s<br />

L<br />

�<br />

m�0<br />

� � AB / ( A �B)<br />

A �� ��<br />

i j<br />

B �� ��<br />

k l<br />

��1<br />

m m


<str<strong>on</strong>g>Rys</str<strong>on</strong>g> <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> Algorithm<br />

for all l do<br />

for all k do<br />

for all j do<br />

for all i do<br />

end for<br />

end for<br />

end for<br />

end for<br />

I( i, j, k, l) � � I x( �, ix, jx, kx, lx) I y( �, iy, jy, ky , ly ) I z( �,<br />

iz, jz, kz, lz<br />

)<br />

�<br />

� Summati<strong>on</strong> over the roots over all the intermediate 2-D integrals<br />

� floating point operati<strong>on</strong>s =<br />

3* N *<br />

La�1Lb�1Lc�1Ld �1<br />

2 2 2 2<br />

� �� �� �� �<br />

�� �� �� �� �� �� �� ��<br />

� �� �� �� �<br />

� Recurrence, transfer and roots have predictable memory access<br />

patterns, fewer flops. <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> step is the main focus here.


� Example: (dd|dd) ERI block<br />

� L a = L b = L c = L d = 2<br />

� Number <str<strong>on</strong>g>of</str<strong>on</strong>g> roots, N = 5<br />

� ERI size = 6 4 = 1296 elements<br />

� Intermediate 2-D integrals Ix, Iy, Iz size:3 4 *5 = 245<br />

Possible Optimizati<strong>on</strong>s<br />

� ERI computati<strong>on</strong>s are memory bound, hence optimize memory accesses<br />

� Intermediate 2-D integrals are reused multiple times <str<strong>on</strong>g>to</str<strong>on</strong>g> c<strong>on</strong>struct<br />

different ERI elements.<br />

� Generate the different combinati<strong>on</strong>s au<str<strong>on</strong>g>to</str<strong>on</strong>g>matically


SP1<br />

Registers<br />

SP5<br />

Registers<br />

SP2<br />

Registers<br />

SP6<br />

Registers<br />

Multithreaded Instructi<strong>on</strong> Unit<br />

SP3<br />

Registers<br />

SP7<br />

Registers<br />

Shared Memory<br />

C<strong>on</strong>stant Cache<br />

Texture Cache<br />

SP4<br />

Registers<br />

SP8<br />

Registers<br />

Device Memory<br />

SM – Streaming Multiprocessor<br />

SP – Scalar Processor Core<br />

SFU – Special Functi<strong>on</strong>al Unit<br />

DP – Double Precisi<strong>on</strong> Unit<br />

Symmetric Multiprocessor 2<br />

Symmetric Multiprocessor 1<br />

SFU<br />

SFU<br />

Symmetric Multiprocessor N<br />

DP unit<br />

Thread<br />

(0,0)<br />

Thread<br />

(0,1)<br />

Thread<br />

(0,2)<br />

Block<br />

(0,0)<br />

Block<br />

(0,1)<br />

Thread<br />

(1,0)<br />

Block<br />

(1,0)<br />

Block<br />

(1,1)<br />

Grid <str<strong>on</strong>g>of</str<strong>on</strong>g> Blocks<br />

Thread<br />

(1,1)<br />

Thread<br />

(1,2)<br />

Thread<br />

(2,0)<br />

Thread<br />

(2,1)<br />

Thread<br />

(2,2)<br />

Thread Block<br />

Block<br />

(2,0)<br />

Block<br />

(2,1)<br />

Thread<br />

(3,0)<br />

Thread<br />

(3,1)<br />

Thread<br />

(3,2)<br />

Thread<br />

(4,0)<br />

Thread<br />

(4,1)<br />

Thread<br />

(4,2)


� Since 2-D integrals are reused multiple times, load them in<str<strong>on</strong>g>to</str<strong>on</strong>g><br />

shared memory<br />

� However, shared memory access , synchr<strong>on</strong>izati<strong>on</strong> limited <str<strong>on</strong>g>to</str<strong>on</strong>g> thread block<br />

boundaries<br />

� ERI block should be mapped <strong>on</strong><str<strong>on</strong>g>to</str<strong>on</strong>g> a single thread block<br />

� Is it possible <str<strong>on</strong>g>to</str<strong>on</strong>g> map all the ERI elements <str<strong>on</strong>g>to</str<strong>on</strong>g> individual threads in a block ?<br />

� The answer depends <strong>on</strong> the ERI block under c<strong>on</strong>siderati<strong>on</strong><br />

� For a (dd|dd) ERI block, ERI size = 6 4 = 1296 elements<br />

� Maximum <str<strong>on</strong>g>of</str<strong>on</strong>g> 512 or 768 threads per block<br />

� Map i, j, k indices corresp<strong>on</strong>ding <str<strong>on</strong>g>to</str<strong>on</strong>g> the three shells <str<strong>on</strong>g>of</str<strong>on</strong>g> the block <str<strong>on</strong>g>to</str<strong>on</strong>g> unique<br />

threads and iterate over the l index<br />

� Thread blocks are three dimensi<strong>on</strong>al, the mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> i, j, k is natural<br />

� For (ff|ff) ERI block, ERI size = 10 4 = 1000 elements<br />

� Map i, j indices corresp<strong>on</strong>ding <str<strong>on</strong>g>to</str<strong>on</strong>g> the first two shells <str<strong>on</strong>g>of</str<strong>on</strong>g> the block <str<strong>on</strong>g>to</str<strong>on</strong>g> unique<br />

threads and iterate over the l index


CUDA <str<strong>on</strong>g>Rys</str<strong>on</strong>g> quadrature: i, j, k mapping<br />

# map threads <str<strong>on</strong>g>to</str<strong>on</strong>g> ERI elements<br />

I = threadIdx.x, j = threadIdx.y, k = threadIdx.z<br />

# arrays LX, LY, LZ map <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> <str<strong>on</strong>g>to</str<strong>on</strong>g> exp<strong>on</strong>ents<br />

(ix, iy, iz) � (LX[i], LY[i], LZ[i])<br />

(jx, jy, jz) � (LX[j], LY[j], LZ[j])<br />

(kx, ky, kz)� (LX[k], LY[k], LZ[k])<br />

for all l do<br />

syncthreads<br />

## load the 2-D integrals <str<strong>on</strong>g>to</str<strong>on</strong>g> shmem<br />

I x, shmem � Ix(:,:,:,LX[l])<br />

I y, shmem �Iy(:,:,:,LX[l])<br />

I z, shmem � Iz(:,:,:,LX[l])<br />

syncthreads<br />

I(i, j, k, l) �<br />

end for<br />

�<br />

N<br />

I I I<br />

x, shmem y, shmem z, shmem<br />

Further optimizati<strong>on</strong>s<br />

� (dd|dd) case<br />

� I {x,y,z},shmem = 5(3 3 ) = 135<br />

elements per 2-D block<br />

� Across iterati<strong>on</strong>s, some <str<strong>on</strong>g>of</str<strong>on</strong>g> the<br />

elements in shared memory can<br />

be reused<br />

d-shell<br />

d x 2 , dy 2 , dz 2 , dxy, d xz, d yz � 18 loads<br />

d y 2 , dz 2 , dyz, d xy, d xz, d x 2 � 13 loads<br />

I x 0* 0 0 1* 1 2*<br />

I y 2* 0* 1* 1 0* 0<br />

I z 0* 2* 1* 0* 1* 0*


CUDA <str<strong>on</strong>g>Rys</str<strong>on</strong>g> quadrature: i, j mapping<br />

# map threads <str<strong>on</strong>g>to</str<strong>on</strong>g> ERI elements<br />

I = threadIdx.x, j = threadIdx.y<br />

# arrays LX, LY, LZ map <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> <str<strong>on</strong>g>to</str<strong>on</strong>g> exp<strong>on</strong>ents<br />

(ix, iy, iz) � (LX[i], LY[i], LZ[i])<br />

(jx, jy, jz) � (LX[j], LY[j], LZ[j])<br />

for all klz-block do<br />

syncthreads<br />

Iz, shmem � Iz(:,:,LZ[k],LZ[l])<br />

## load 2-D integrals <str<strong>on</strong>g>to</str<strong>on</strong>g> shmem<br />

for all klxy klz-block do<br />

syncthreads<br />

Ix, shmem � Ix(:, :, LX[k], LX[l])<br />

Iy, shmem � Iy(:, :, LY[k], LX[l])<br />

syncthreads<br />

I(i, j, k, l) � � I x, shmemI y, shmemI z, shmem<br />

end for N<br />

end for<br />

Further optimizati<strong>on</strong>s<br />

� (ff|ff) case<br />

� I {x,y,z},shmem = 7(4 2 ) = 112<br />

elements per 2-D block<br />

� 10 <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> in the f-shell<br />

� Reorder them ( next slide)


3<br />

0<br />

0<br />

2<br />

2<br />

1<br />

1<br />

1<br />

0<br />

0<br />

3<br />

2<br />

1<br />

0<br />

0<br />

1<br />

2<br />

1<br />

0<br />

0<br />

X Y Z<br />

3 0 0 2 2 1 1 1 0 0<br />

3 2 1 0 0 1 2 1 0 0<br />

0<br />

3<br />

0<br />

1<br />

0<br />

2<br />

0<br />

1<br />

2<br />

1<br />

0 3 0 1 0 2 0 1 2 1<br />

f x 3 , fy 3 , fz 3 , fx 2 y, f x 2 z, f xy 2 , fxz 2 , fxyz, f y 2 z, f yz 2<br />

0<br />

1<br />

2<br />

3<br />

2<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0 1 2 3 2 1 0 0 1 0<br />

f x 3 , fx 2 y , f xy 2 fy 3 , fy 2 z , f xyz , f x 2 z, f xz 2 , fyz 2 , fz 3<br />

0<br />

0<br />

3<br />

0<br />

1<br />

0<br />

2<br />

1<br />

1<br />

2<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

2<br />

2<br />

3<br />

0 0 3 0 1 0 2 1 1 2<br />

0 0 0 0 1 1 1 2 2 3


� Number <str<strong>on</strong>g>of</str<strong>on</strong>g> registers per thread, shared memory per thread<br />

block limits the thread blocks that can be assigned per SM<br />

� Loops implemented directly result in high register usage<br />

� Explicitly unroll the loops. How ? Manually it’s tedious and<br />

error-pr<strong>on</strong>e<br />

� Use a comm<strong>on</strong> template and generate all the cases<br />

� Pyth<strong>on</strong> based Cheetah template engine is used- reuse<br />

existing Pyth<strong>on</strong> utilities and program s<str<strong>on</strong>g>up</str<strong>on</strong>g>port modules<br />

easily.


ERI blocks flop count GFLOPS SP 3 GFLOPS DP 4<br />

map 5 i jk map 5 i j map 5 i jk map 5 i j<br />

(gg|gg) 2000 2733750000 n/a 45.23 n/a 22.55<br />

(gg|f f) 4000 2160000000 n/a 34.42 n/a 15.32<br />

(f f |gg) 4000 2160000000 n/a 30.91 n/a 14.11<br />

(gg|dd) 10000 1701000000 n/a 43.08 n/a 21.05<br />

(gg|pp) 40000 1458000000 n/a 36.53 n/a 17.08<br />

(pp|gg) 40000 1458000000 34.23 6.93 18.20 5.38<br />

(f f |f f) 10000 2100000000 n/a 40.43 n/a 20.11<br />

(f f |dd) 20000 1296000000 n/a 37.54 n/a 18.29<br />

(dd|f f ) 20000 1296000000 37.69 23.32 16.53 15.04<br />

(f f |pp) 80000 1080000000 27.43 31.46 15.23 17.05<br />

(pp|f f) 80000 1080000000 32.23 6.21 17.45 4.84<br />

(dd|dd) 60000 1166400000 31.10 20.17 16.38 13.67<br />

� ERIs with odd number <str<strong>on</strong>g>of</str<strong>on</strong>g> roots have<br />

maximum performance over the<br />

even roots<br />

� Odd roots - (gg|gg), (gg|dd),<br />

and (ff|ff) cases<br />

� Even roots – (ff|gg), (gg|ff),<br />

and (dd|gg)<br />

� The difference is as high as 25%<br />

� Difference in the single and double<br />

precisi<strong>on</strong> is roughly a fac<str<strong>on</strong>g>to</str<strong>on</strong>g>r <str<strong>on</strong>g>of</str<strong>on</strong>g> two<br />

� Larger ijk mapping perform better<br />

than the ij mappings


ERI blocks flop count GFLOPS SP 3 GFLOPS DP 4<br />

GTX 275 Tesla GTX 275 Tesla<br />

(gg|gg) 2000 2733750000 45.23 55.97 22.55 27.34<br />

(gg|f f ) 4000 2160000000 34.42 42.07 15.32 18.67<br />

( f f|gg) 4000 2160000000 30.91 37.70 14.11 17.19<br />

(gg|dd) 10000 1701000000 43.08 53.39 21.05 25.34<br />

(dd|gg) 10000 1701000000 23.63 24.03 16.35 29.88<br />

(gg|pp) 40000 1458000000 36.53 45.15 17.08 20.65<br />

(pp|gg) 40000 1458000000 34.23 42.42 18.20 22.09<br />

( f f|f f) 10000 2100000000 40.43 50.19 20.11 24.46<br />

( f f|dd) 20000 1296000000 37.54 46.15 18.29 22.44<br />

(dd|f f) 20000 1296000000 37.69 45.71 16.53 19.71<br />

( f f|pp) 80000 1080000000 31.46 39.38 17.05 20.10<br />

( pp|f f ) 80000 1080000000 32.23 40.33 17.45 21.46<br />

(dd|dd) 60000 1166400000 31.10 38.74 16.38 19.78<br />

Inferences<br />

� Performance depends <strong>on</strong> the ERI<br />

class under evaluati<strong>on</strong> and hence<br />

also <strong>on</strong> the mapping ( i,j,k vs. i,j)<br />

� Difference between single and<br />

double precisi<strong>on</strong> performance is<br />

roughly a fac<str<strong>on</strong>g>to</str<strong>on</strong>g>r <str<strong>on</strong>g>of</str<strong>on</strong>g> two<br />

� Difference between the GTX and<br />

Tesla T is roughly 30% ( c<strong>on</strong>sistent<br />

with the clock speeds)<br />

� In terms <str<strong>on</strong>g>of</str<strong>on</strong>g> register and shared<br />

memory usage both are identical


� <str<strong>on</strong>g>Rys</str<strong>on</strong>g>q quadrature <str<strong>on</strong>g>implementati<strong>on</strong></str<strong>on</strong>g> performance results are comparable or<br />

better than DGEMV BLAS routines.<br />

� Some more improvements are possible by caching (texture, c<strong>on</strong>stant)<br />

and also by more aggressive memory reuse possibly at the expense <str<strong>on</strong>g>of</str<strong>on</strong>g> recomputati<strong>on</strong><br />

� Very easy <str<strong>on</strong>g>to</str<strong>on</strong>g> generate the possible ERI shell combinati<strong>on</strong>s using a single<br />

template<br />

� Explicit unrolling can be c<strong>on</strong>trolled at different levels such as shells, roots<br />

<str<strong>on</strong>g>to</str<strong>on</strong>g> test for performance improvements<br />

� Being developed as a standal<strong>on</strong>e library and applicati<strong>on</strong> agnostic


� ERIs are 4-dimensi<strong>on</strong>al, hence it is very expensive <str<strong>on</strong>g>to</str<strong>on</strong>g> transfer them <str<strong>on</strong>g>to</str<strong>on</strong>g> the<br />

host memory after computati<strong>on</strong>.<br />

� Fock matrix is 2-dimensi<strong>on</strong>al. So, c<strong>on</strong>sume the ERI’s as they are formed<br />

<str<strong>on</strong>g>to</str<strong>on</strong>g> build the Fock matrix<br />

� Handle the c<strong>on</strong>tracted ERI’s<br />

� Mixed precisi<strong>on</strong> s<str<strong>on</strong>g>up</str<strong>on</strong>g>port<br />

� A complete working SCF algorithm


1) <str<strong>on</strong>g>Rys</str<strong>on</strong>g>, J.; D<str<strong>on</strong>g>up</str<strong>on</strong>g>uis, M.; King, H. J. Comput. Phys. 1976, 21, 144.<br />

2) Boys, S.F. Proc. R. Soc 1950, 200, 542.<br />

3) <str<strong>on</strong>g>Rys</str<strong>on</strong>g>, J.; D<str<strong>on</strong>g>up</str<strong>on</strong>g>uis, M.; King, H. J. Comput. Chem. 1983, 4, 154–157.<br />

4) Gord<strong>on</strong>, M. S.; Schmidt, M. W. Advances in electr<strong>on</strong>ic structure theory:<br />

GAMESS a decade later. In Theory and Applicati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al<br />

Chemistry: the first forty years;<br />

Dykstra, C. E.; Frenking, G.; Kim, K. S.; Scuseria, G. E., Eds.; Elsevier:<br />

Amsterdam, 2005.<br />

5) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theory Comput. 2008, 4, 222–231.<br />

6) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theory Comput. 2009, 5, 1004–1015.<br />

7) Yasuda, K. Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Chemistry 2008, 29, 334-342.


US Department <str<strong>on</strong>g>of</str<strong>on</strong>g> Energy<br />

Department <str<strong>on</strong>g>of</str<strong>on</strong>g> Defense - DURIP Grant<br />

Ames Labora<str<strong>on</strong>g>to</str<strong>on</strong>g>ry, Iowa State University<br />

Air Force Office <str<strong>on</strong>g>of</str<strong>on</strong>g> Scientific Research<br />

Nati<strong>on</strong>al Science Foundati<strong>on</strong> - Petascale Applicati<strong>on</strong>s grant<br />

NVIDIA Corporati<strong>on</strong><br />

Pr<str<strong>on</strong>g>of</str<strong>on</strong>g>essor Todd Martinez and his gro<str<strong>on</strong>g>up</str<strong>on</strong>g><br />

asadchev@gmail.com<br />

jfelder@iastate.edu<br />

allada.v@gmail.com<br />

mark@ si.msg.chem.iastate.edu<br />

theresa@fi.ameslab.gov<br />

brett@si.msg.chem.iastate.edu

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!