Uncontracted Rys Quadrature implementation of up to g functions on ...
Uncontracted Rys Quadrature implementation of up to g functions on ...
Uncontracted Rys Quadrature implementation of up to g functions on ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Andrey Asadchev<br />
Jacob Felder<br />
Veerendra Allada<br />
Dr. Mark S Gord<strong>on</strong><br />
Dr. Theresa Windus<br />
Dr. Brett Bode<br />
GPU Technology C<strong>on</strong>ference , NVIDIA , San Jose, 2009
� Computati<strong>on</strong>al Quantum Chemistry<br />
� General A<str<strong>on</strong>g>to</str<strong>on</strong>g>mic Molecular Electr<strong>on</strong>ic Structure Systems -GAMESS<br />
� Electr<strong>on</strong> Repulsi<strong>on</strong> Integral (ERI) Problem<br />
� Our Approach<br />
� CUDA Implementati<strong>on</strong><br />
� Optimizati<strong>on</strong>s<br />
� Au<str<strong>on</strong>g>to</str<strong>on</strong>g>matically generated code<br />
� Performance Results<br />
� Future Goals<br />
� Questi<strong>on</strong>s & Discussi<strong>on</strong>
� Use computati<strong>on</strong>al methods <str<strong>on</strong>g>to</str<strong>on</strong>g> solve the electr<strong>on</strong>ic structure and<br />
properties <str<strong>on</strong>g>of</str<strong>on</strong>g> molecules.<br />
� Finds utility in the design <str<strong>on</strong>g>of</str<strong>on</strong>g> new drugs and materials<br />
� Underlying theory is based <strong>on</strong> Quantum Mechanics –Schrodinger<br />
wave equati<strong>on</strong><br />
� Properties calculated<br />
� Energies<br />
� Electr<strong>on</strong>ic charge distributi<strong>on</strong><br />
� Dipole moments, vibrati<strong>on</strong>al frequencies.<br />
� Methods employed<br />
� Ab initio Methods ( Solve from first principles)<br />
� Density Functi<strong>on</strong>al Theory (DFT)<br />
� Semi-empirical methods<br />
� Molecular Mechanics (MM)
� Ab initio molecular quantum chemistry s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware<br />
� USDOE “SciDAC Basic Energy Sciences” (BES) applicati<strong>on</strong><br />
� Serial and parallel versi<strong>on</strong>s for several methods<br />
� In brief, GAMESS can compute<br />
� Self C<strong>on</strong>sistent Field (SCF) wave <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> - RHF, ROHF, UHF, GVB,<br />
and MCSCF using the Hartree-Fock method<br />
� Correlati<strong>on</strong> correcti<strong>on</strong>s <str<strong>on</strong>g>to</str<strong>on</strong>g> SCF using c<strong>on</strong>figurati<strong>on</strong> interacti<strong>on</strong> (CI),<br />
sec<strong>on</strong>d order perturbati<strong>on</strong> theory, and co<str<strong>on</strong>g>up</str<strong>on</strong>g>led cluster theories (CC)<br />
� Density Functi<strong>on</strong>al Theory approximati<strong>on</strong>s<br />
Reference:"Advances in electr<strong>on</strong>ic structure theory: GAMESS a decade later" M.S.Gord<strong>on</strong>, M.W.Schmidt pp.<br />
1167-1189, in "Theory and Applicati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Chemistry: the first forty years" C.E.Dykstra,<br />
G.Frenking, K.S.Kim, G.E.Scuseria (edi<str<strong>on</strong>g>to</str<strong>on</strong>g>rs), Elsevier, Amsterdam, 2005.
� Molecules are made <str<strong>on</strong>g>of</str<strong>on</strong>g> a<str<strong>on</strong>g>to</str<strong>on</strong>g>ms and a<str<strong>on</strong>g>to</str<strong>on</strong>g>ms have electr<strong>on</strong>s<br />
� Electr<strong>on</strong>s live in shells – s, p, d, f, g, h<br />
� Shells are made <str<strong>on</strong>g>of</str<strong>on</strong>g> sub-shells – all have the same angular<br />
momentum (L)<br />
� Shells are represented using the mathematical <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g><br />
� Gaussian <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are taken as standard primitive <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (S.F. Boys)<br />
�<br />
� x, y, z – Cartesian center<br />
� a x, a y, a z – Angular momenta comp<strong>on</strong>ents; L = a x + a y + a z<br />
� � is the exp<strong>on</strong>ent<br />
� Shells with low angular momentum are typically c<strong>on</strong>tracted<br />
▪<br />
a a<br />
x y az<br />
2<br />
(r)= x y z exp( � r )<br />
� �<br />
K<br />
� ( r) � �<br />
D � ( r)<br />
a ka k<br />
k<br />
▪ K is the c<strong>on</strong>tracti<strong>on</strong> coefficient. D k’s are the c<strong>on</strong>tracti<strong>on</strong> coefficients
cheap <strong>on</strong>e-time<br />
operati<strong>on</strong><br />
1<br />
H core<br />
(<strong>on</strong>e-electr<strong>on</strong> integrals)<br />
Kinetic Energy Integrals (T)<br />
Nuclear Attracti<strong>on</strong><br />
Integrals (V)<br />
Form the Fock Matrix<br />
F = H core + G<br />
Transformati<strong>on</strong>s<br />
F ’ = X’FX<br />
C’ � Diag<strong>on</strong>alize(F’)<br />
C � XC’<br />
C<strong>on</strong>vergence<br />
Checks<br />
yes<br />
S<str<strong>on</strong>g>to</str<strong>on</strong>g>p<br />
6<br />
7<br />
Molecule Specificati<strong>on</strong><br />
•List <str<strong>on</strong>g>of</str<strong>on</strong>g> A<str<strong>on</strong>g>to</str<strong>on</strong>g>ms ( A<str<strong>on</strong>g>to</str<strong>on</strong>g>mic Numbers Z)<br />
•List <str<strong>on</strong>g>of</str<strong>on</strong>g> Nuclear Coordinates (R)<br />
• Number <str<strong>on</strong>g>of</str<strong>on</strong>g> electr<strong>on</strong>s<br />
•List <str<strong>on</strong>g>of</str<strong>on</strong>g> Primitive Functi<strong>on</strong>s, exp<strong>on</strong>ents<br />
• Number <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>tracti<strong>on</strong>s<br />
Form the basis <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (M)<br />
2<br />
Initial guess <str<strong>on</strong>g>of</str<strong>on</strong>g> the wave<br />
functi<strong>on</strong><br />
Obtain the guess at the<br />
Density Matrix (P)<br />
O(M 2 )<br />
No<br />
8<br />
5<br />
4<br />
Update the density<br />
matrix from C<br />
Repeat steps 3, 4, 5, 6, 7<br />
3<br />
4<br />
G – Matrix<br />
O(M 2 )<br />
G = [(ij|kl) – ½(ik|jl)]*P<br />
•Required in every iterati<strong>on</strong><br />
•Very Expensive operati<strong>on</strong><br />
•S<str<strong>on</strong>g>to</str<strong>on</strong>g>red procedures not scalable<br />
•Re-compute in every iterati<strong>on</strong><br />
•Good target for GPU<br />
ERI<br />
Two Electr<strong>on</strong> Repulsi<strong>on</strong> Integral<br />
(ij|kl)<br />
O(M 3 ) <str<strong>on</strong>g>to</str<strong>on</strong>g> O(M 4 )
� Four-center two-electr<strong>on</strong> repulsi<strong>on</strong><br />
integral<br />
�<br />
1<br />
( ab|cd ) = �� �a( 1) �b( 1) �c( 2 ) �d(<br />
2 )<br />
r<br />
12<br />
� Major computati<strong>on</strong>al step in both Ab<br />
initio and DFT methods<br />
� Complexity is O(M 3 )-O(M 4 ), M is the<br />
number <str<strong>on</strong>g>of</str<strong>on</strong>g> basis <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> (Gaussian<br />
<str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are standard)<br />
� <str<strong>on</strong>g>Rys</str<strong>on</strong>g> <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> – proposed by D<str<strong>on</strong>g>up</str<strong>on</strong>g>ius,<br />
<str<strong>on</strong>g>Rys</str<strong>on</strong>g>, King (DRK)<br />
� Numerical Gaussian quadrature based <strong>on</strong> a<br />
set <str<strong>on</strong>g>of</str<strong>on</strong>g> orthog<strong>on</strong>al <str<strong>on</strong>g>Rys</str<strong>on</strong>g> polynomials<br />
� Numerically stable, low memory foot print<br />
� Amenable for GPUs and architectures with<br />
smaller caches<br />
ERI Inputs<br />
bra ket<br />
shell a shell b shell c shell d<br />
CGTO CGTO CGTO CGTO<br />
x y z<br />
a x a y a z<br />
C<strong>on</strong>tracti<strong>on</strong><br />
Coefficients<br />
Gaussian Exp<strong>on</strong>ents<br />
Relevant <str<strong>on</strong>g>to</str<strong>on</strong>g> computati<strong>on</strong>s<br />
Lower order <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> are<br />
typically c<strong>on</strong>tracted
X �� ��<br />
( r ��<br />
r )<br />
2<br />
A B<br />
r �� ( �� r ����<br />
r ) / A<br />
A i i j j<br />
r �� ( �� r ����<br />
r ) / B<br />
B k k l l<br />
� Two electr<strong>on</strong> integral is expressed as<br />
1<br />
where , 2m2 F exp( ) and<br />
m(X)=<br />
�t<br />
�Xt<br />
dt<br />
0<br />
( ij |kl ) = C F ( X )<br />
� X depends <strong>on</strong> exp<strong>on</strong>ents, centers and is independent <str<strong>on</strong>g>of</str<strong>on</strong>g> angular<br />
momenta<br />
2<br />
1<br />
X ��( rA �rB)<br />
r �( � r ��<br />
r ) / A<br />
A<br />
i i j j<br />
r �( � r ��<br />
r ) / B<br />
B k k l l<br />
L= L a+L b+L c+Ld<br />
2<br />
� ( ij |kl ) = �exp(<br />
�Xt<br />
) PL ( t) dt , where PL(t) is polynomial <str<strong>on</strong>g>of</str<strong>on</strong>g> degree L in t2. Evaluated<br />
0<br />
N<br />
using N-point quadrature and hence ( ij |kl ) = � W�PL( t�)<br />
where N = L / 2+ 1<br />
� Using separati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> variables, P L(t) which is integral over dr 1dr 2 , can be<br />
written as a product <str<strong>on</strong>g>of</str<strong>on</strong>g> three (2-D) integrals over dx 1dx 2, dr 1dr 2, dz 1dz 2<br />
1/2<br />
� ( ij |kl ) = 2 ( �/ � ) � I x(t � )I y(t � )I z(t<br />
� )W�<br />
and I q( = x,y,z) (N, 0: L a, 0: L b, 0: Lc,0: Ld<br />
)<br />
�<br />
� Ix, Iy, Iz are computed using recurrence and transfer relati<strong>on</strong>s<br />
L<br />
�<br />
m�0<br />
� � AB / ( A �B)<br />
A �� ��<br />
i j<br />
B �� ��<br />
k l<br />
��1<br />
m m
<str<strong>on</strong>g>Rys</str<strong>on</strong>g> <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> Algorithm<br />
for all l do<br />
for all k do<br />
for all j do<br />
for all i do<br />
end for<br />
end for<br />
end for<br />
end for<br />
I( i, j, k, l) � � I x( �, ix, jx, kx, lx) I y( �, iy, jy, ky , ly ) I z( �,<br />
iz, jz, kz, lz<br />
)<br />
�<br />
� Summati<strong>on</strong> over the roots over all the intermediate 2-D integrals<br />
� floating point operati<strong>on</strong>s =<br />
3* N *<br />
La�1Lb�1Lc�1Ld �1<br />
2 2 2 2<br />
� �� �� �� �<br />
�� �� �� �� �� �� �� ��<br />
� �� �� �� �<br />
� Recurrence, transfer and roots have predictable memory access<br />
patterns, fewer flops. <str<strong>on</strong>g>Quadrature</str<strong>on</strong>g> step is the main focus here.
� Example: (dd|dd) ERI block<br />
� L a = L b = L c = L d = 2<br />
� Number <str<strong>on</strong>g>of</str<strong>on</strong>g> roots, N = 5<br />
� ERI size = 6 4 = 1296 elements<br />
� Intermediate 2-D integrals Ix, Iy, Iz size:3 4 *5 = 245<br />
Possible Optimizati<strong>on</strong>s<br />
� ERI computati<strong>on</strong>s are memory bound, hence optimize memory accesses<br />
� Intermediate 2-D integrals are reused multiple times <str<strong>on</strong>g>to</str<strong>on</strong>g> c<strong>on</strong>struct<br />
different ERI elements.<br />
� Generate the different combinati<strong>on</strong>s au<str<strong>on</strong>g>to</str<strong>on</strong>g>matically
SP1<br />
Registers<br />
SP5<br />
Registers<br />
SP2<br />
Registers<br />
SP6<br />
Registers<br />
Multithreaded Instructi<strong>on</strong> Unit<br />
SP3<br />
Registers<br />
SP7<br />
Registers<br />
Shared Memory<br />
C<strong>on</strong>stant Cache<br />
Texture Cache<br />
SP4<br />
Registers<br />
SP8<br />
Registers<br />
Device Memory<br />
SM – Streaming Multiprocessor<br />
SP – Scalar Processor Core<br />
SFU – Special Functi<strong>on</strong>al Unit<br />
DP – Double Precisi<strong>on</strong> Unit<br />
Symmetric Multiprocessor 2<br />
Symmetric Multiprocessor 1<br />
SFU<br />
SFU<br />
Symmetric Multiprocessor N<br />
DP unit<br />
Thread<br />
(0,0)<br />
Thread<br />
(0,1)<br />
Thread<br />
(0,2)<br />
Block<br />
(0,0)<br />
Block<br />
(0,1)<br />
Thread<br />
(1,0)<br />
Block<br />
(1,0)<br />
Block<br />
(1,1)<br />
Grid <str<strong>on</strong>g>of</str<strong>on</strong>g> Blocks<br />
Thread<br />
(1,1)<br />
Thread<br />
(1,2)<br />
Thread<br />
(2,0)<br />
Thread<br />
(2,1)<br />
Thread<br />
(2,2)<br />
Thread Block<br />
Block<br />
(2,0)<br />
Block<br />
(2,1)<br />
Thread<br />
(3,0)<br />
Thread<br />
(3,1)<br />
Thread<br />
(3,2)<br />
Thread<br />
(4,0)<br />
Thread<br />
(4,1)<br />
Thread<br />
(4,2)
� Since 2-D integrals are reused multiple times, load them in<str<strong>on</strong>g>to</str<strong>on</strong>g><br />
shared memory<br />
� However, shared memory access , synchr<strong>on</strong>izati<strong>on</strong> limited <str<strong>on</strong>g>to</str<strong>on</strong>g> thread block<br />
boundaries<br />
� ERI block should be mapped <strong>on</strong><str<strong>on</strong>g>to</str<strong>on</strong>g> a single thread block<br />
� Is it possible <str<strong>on</strong>g>to</str<strong>on</strong>g> map all the ERI elements <str<strong>on</strong>g>to</str<strong>on</strong>g> individual threads in a block ?<br />
� The answer depends <strong>on</strong> the ERI block under c<strong>on</strong>siderati<strong>on</strong><br />
� For a (dd|dd) ERI block, ERI size = 6 4 = 1296 elements<br />
� Maximum <str<strong>on</strong>g>of</str<strong>on</strong>g> 512 or 768 threads per block<br />
� Map i, j, k indices corresp<strong>on</strong>ding <str<strong>on</strong>g>to</str<strong>on</strong>g> the three shells <str<strong>on</strong>g>of</str<strong>on</strong>g> the block <str<strong>on</strong>g>to</str<strong>on</strong>g> unique<br />
threads and iterate over the l index<br />
� Thread blocks are three dimensi<strong>on</strong>al, the mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> i, j, k is natural<br />
� For (ff|ff) ERI block, ERI size = 10 4 = 1000 elements<br />
� Map i, j indices corresp<strong>on</strong>ding <str<strong>on</strong>g>to</str<strong>on</strong>g> the first two shells <str<strong>on</strong>g>of</str<strong>on</strong>g> the block <str<strong>on</strong>g>to</str<strong>on</strong>g> unique<br />
threads and iterate over the l index
CUDA <str<strong>on</strong>g>Rys</str<strong>on</strong>g> quadrature: i, j, k mapping<br />
# map threads <str<strong>on</strong>g>to</str<strong>on</strong>g> ERI elements<br />
I = threadIdx.x, j = threadIdx.y, k = threadIdx.z<br />
# arrays LX, LY, LZ map <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> <str<strong>on</strong>g>to</str<strong>on</strong>g> exp<strong>on</strong>ents<br />
(ix, iy, iz) � (LX[i], LY[i], LZ[i])<br />
(jx, jy, jz) � (LX[j], LY[j], LZ[j])<br />
(kx, ky, kz)� (LX[k], LY[k], LZ[k])<br />
for all l do<br />
syncthreads<br />
## load the 2-D integrals <str<strong>on</strong>g>to</str<strong>on</strong>g> shmem<br />
I x, shmem � Ix(:,:,:,LX[l])<br />
I y, shmem �Iy(:,:,:,LX[l])<br />
I z, shmem � Iz(:,:,:,LX[l])<br />
syncthreads<br />
I(i, j, k, l) �<br />
end for<br />
�<br />
N<br />
I I I<br />
x, shmem y, shmem z, shmem<br />
Further optimizati<strong>on</strong>s<br />
� (dd|dd) case<br />
� I {x,y,z},shmem = 5(3 3 ) = 135<br />
elements per 2-D block<br />
� Across iterati<strong>on</strong>s, some <str<strong>on</strong>g>of</str<strong>on</strong>g> the<br />
elements in shared memory can<br />
be reused<br />
d-shell<br />
d x 2 , dy 2 , dz 2 , dxy, d xz, d yz � 18 loads<br />
d y 2 , dz 2 , dyz, d xy, d xz, d x 2 � 13 loads<br />
I x 0* 0 0 1* 1 2*<br />
I y 2* 0* 1* 1 0* 0<br />
I z 0* 2* 1* 0* 1* 0*
CUDA <str<strong>on</strong>g>Rys</str<strong>on</strong>g> quadrature: i, j mapping<br />
# map threads <str<strong>on</strong>g>to</str<strong>on</strong>g> ERI elements<br />
I = threadIdx.x, j = threadIdx.y<br />
# arrays LX, LY, LZ map <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> <str<strong>on</strong>g>to</str<strong>on</strong>g> exp<strong>on</strong>ents<br />
(ix, iy, iz) � (LX[i], LY[i], LZ[i])<br />
(jx, jy, jz) � (LX[j], LY[j], LZ[j])<br />
for all klz-block do<br />
syncthreads<br />
Iz, shmem � Iz(:,:,LZ[k],LZ[l])<br />
## load 2-D integrals <str<strong>on</strong>g>to</str<strong>on</strong>g> shmem<br />
for all klxy klz-block do<br />
syncthreads<br />
Ix, shmem � Ix(:, :, LX[k], LX[l])<br />
Iy, shmem � Iy(:, :, LY[k], LX[l])<br />
syncthreads<br />
I(i, j, k, l) � � I x, shmemI y, shmemI z, shmem<br />
end for N<br />
end for<br />
Further optimizati<strong>on</strong>s<br />
� (ff|ff) case<br />
� I {x,y,z},shmem = 7(4 2 ) = 112<br />
elements per 2-D block<br />
� 10 <str<strong>on</strong>g>functi<strong>on</strong>s</str<strong>on</strong>g> in the f-shell<br />
� Reorder them ( next slide)
3<br />
0<br />
0<br />
2<br />
2<br />
1<br />
1<br />
1<br />
0<br />
0<br />
3<br />
2<br />
1<br />
0<br />
0<br />
1<br />
2<br />
1<br />
0<br />
0<br />
X Y Z<br />
3 0 0 2 2 1 1 1 0 0<br />
3 2 1 0 0 1 2 1 0 0<br />
0<br />
3<br />
0<br />
1<br />
0<br />
2<br />
0<br />
1<br />
2<br />
1<br />
0 3 0 1 0 2 0 1 2 1<br />
f x 3 , fy 3 , fz 3 , fx 2 y, f x 2 z, f xy 2 , fxz 2 , fxyz, f y 2 z, f yz 2<br />
0<br />
1<br />
2<br />
3<br />
2<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0 1 2 3 2 1 0 0 1 0<br />
f x 3 , fx 2 y , f xy 2 fy 3 , fy 2 z , f xyz , f x 2 z, f xz 2 , fyz 2 , fz 3<br />
0<br />
0<br />
3<br />
0<br />
1<br />
0<br />
2<br />
1<br />
1<br />
2<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
2<br />
2<br />
3<br />
0 0 3 0 1 0 2 1 1 2<br />
0 0 0 0 1 1 1 2 2 3
� Number <str<strong>on</strong>g>of</str<strong>on</strong>g> registers per thread, shared memory per thread<br />
block limits the thread blocks that can be assigned per SM<br />
� Loops implemented directly result in high register usage<br />
� Explicitly unroll the loops. How ? Manually it’s tedious and<br />
error-pr<strong>on</strong>e<br />
� Use a comm<strong>on</strong> template and generate all the cases<br />
� Pyth<strong>on</strong> based Cheetah template engine is used- reuse<br />
existing Pyth<strong>on</strong> utilities and program s<str<strong>on</strong>g>up</str<strong>on</strong>g>port modules<br />
easily.
ERI blocks flop count GFLOPS SP 3 GFLOPS DP 4<br />
map 5 i jk map 5 i j map 5 i jk map 5 i j<br />
(gg|gg) 2000 2733750000 n/a 45.23 n/a 22.55<br />
(gg|f f) 4000 2160000000 n/a 34.42 n/a 15.32<br />
(f f |gg) 4000 2160000000 n/a 30.91 n/a 14.11<br />
(gg|dd) 10000 1701000000 n/a 43.08 n/a 21.05<br />
(gg|pp) 40000 1458000000 n/a 36.53 n/a 17.08<br />
(pp|gg) 40000 1458000000 34.23 6.93 18.20 5.38<br />
(f f |f f) 10000 2100000000 n/a 40.43 n/a 20.11<br />
(f f |dd) 20000 1296000000 n/a 37.54 n/a 18.29<br />
(dd|f f ) 20000 1296000000 37.69 23.32 16.53 15.04<br />
(f f |pp) 80000 1080000000 27.43 31.46 15.23 17.05<br />
(pp|f f) 80000 1080000000 32.23 6.21 17.45 4.84<br />
(dd|dd) 60000 1166400000 31.10 20.17 16.38 13.67<br />
� ERIs with odd number <str<strong>on</strong>g>of</str<strong>on</strong>g> roots have<br />
maximum performance over the<br />
even roots<br />
� Odd roots - (gg|gg), (gg|dd),<br />
and (ff|ff) cases<br />
� Even roots – (ff|gg), (gg|ff),<br />
and (dd|gg)<br />
� The difference is as high as 25%<br />
� Difference in the single and double<br />
precisi<strong>on</strong> is roughly a fac<str<strong>on</strong>g>to</str<strong>on</strong>g>r <str<strong>on</strong>g>of</str<strong>on</strong>g> two<br />
� Larger ijk mapping perform better<br />
than the ij mappings
ERI blocks flop count GFLOPS SP 3 GFLOPS DP 4<br />
GTX 275 Tesla GTX 275 Tesla<br />
(gg|gg) 2000 2733750000 45.23 55.97 22.55 27.34<br />
(gg|f f ) 4000 2160000000 34.42 42.07 15.32 18.67<br />
( f f|gg) 4000 2160000000 30.91 37.70 14.11 17.19<br />
(gg|dd) 10000 1701000000 43.08 53.39 21.05 25.34<br />
(dd|gg) 10000 1701000000 23.63 24.03 16.35 29.88<br />
(gg|pp) 40000 1458000000 36.53 45.15 17.08 20.65<br />
(pp|gg) 40000 1458000000 34.23 42.42 18.20 22.09<br />
( f f|f f) 10000 2100000000 40.43 50.19 20.11 24.46<br />
( f f|dd) 20000 1296000000 37.54 46.15 18.29 22.44<br />
(dd|f f) 20000 1296000000 37.69 45.71 16.53 19.71<br />
( f f|pp) 80000 1080000000 31.46 39.38 17.05 20.10<br />
( pp|f f ) 80000 1080000000 32.23 40.33 17.45 21.46<br />
(dd|dd) 60000 1166400000 31.10 38.74 16.38 19.78<br />
Inferences<br />
� Performance depends <strong>on</strong> the ERI<br />
class under evaluati<strong>on</strong> and hence<br />
also <strong>on</strong> the mapping ( i,j,k vs. i,j)<br />
� Difference between single and<br />
double precisi<strong>on</strong> performance is<br />
roughly a fac<str<strong>on</strong>g>to</str<strong>on</strong>g>r <str<strong>on</strong>g>of</str<strong>on</strong>g> two<br />
� Difference between the GTX and<br />
Tesla T is roughly 30% ( c<strong>on</strong>sistent<br />
with the clock speeds)<br />
� In terms <str<strong>on</strong>g>of</str<strong>on</strong>g> register and shared<br />
memory usage both are identical
� <str<strong>on</strong>g>Rys</str<strong>on</strong>g>q quadrature <str<strong>on</strong>g>implementati<strong>on</strong></str<strong>on</strong>g> performance results are comparable or<br />
better than DGEMV BLAS routines.<br />
� Some more improvements are possible by caching (texture, c<strong>on</strong>stant)<br />
and also by more aggressive memory reuse possibly at the expense <str<strong>on</strong>g>of</str<strong>on</strong>g> recomputati<strong>on</strong><br />
� Very easy <str<strong>on</strong>g>to</str<strong>on</strong>g> generate the possible ERI shell combinati<strong>on</strong>s using a single<br />
template<br />
� Explicit unrolling can be c<strong>on</strong>trolled at different levels such as shells, roots<br />
<str<strong>on</strong>g>to</str<strong>on</strong>g> test for performance improvements<br />
� Being developed as a standal<strong>on</strong>e library and applicati<strong>on</strong> agnostic
� ERIs are 4-dimensi<strong>on</strong>al, hence it is very expensive <str<strong>on</strong>g>to</str<strong>on</strong>g> transfer them <str<strong>on</strong>g>to</str<strong>on</strong>g> the<br />
host memory after computati<strong>on</strong>.<br />
� Fock matrix is 2-dimensi<strong>on</strong>al. So, c<strong>on</strong>sume the ERI’s as they are formed<br />
<str<strong>on</strong>g>to</str<strong>on</strong>g> build the Fock matrix<br />
� Handle the c<strong>on</strong>tracted ERI’s<br />
� Mixed precisi<strong>on</strong> s<str<strong>on</strong>g>up</str<strong>on</strong>g>port<br />
� A complete working SCF algorithm
1) <str<strong>on</strong>g>Rys</str<strong>on</strong>g>, J.; D<str<strong>on</strong>g>up</str<strong>on</strong>g>uis, M.; King, H. J. Comput. Phys. 1976, 21, 144.<br />
2) Boys, S.F. Proc. R. Soc 1950, 200, 542.<br />
3) <str<strong>on</strong>g>Rys</str<strong>on</strong>g>, J.; D<str<strong>on</strong>g>up</str<strong>on</strong>g>uis, M.; King, H. J. Comput. Chem. 1983, 4, 154–157.<br />
4) Gord<strong>on</strong>, M. S.; Schmidt, M. W. Advances in electr<strong>on</strong>ic structure theory:<br />
GAMESS a decade later. In Theory and Applicati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al<br />
Chemistry: the first forty years;<br />
Dykstra, C. E.; Frenking, G.; Kim, K. S.; Scuseria, G. E., Eds.; Elsevier:<br />
Amsterdam, 2005.<br />
5) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theory Comput. 2008, 4, 222–231.<br />
6) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theory Comput. 2009, 5, 1004–1015.<br />
7) Yasuda, K. Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> Computati<strong>on</strong>al Chemistry 2008, 29, 334-342.
US Department <str<strong>on</strong>g>of</str<strong>on</strong>g> Energy<br />
Department <str<strong>on</strong>g>of</str<strong>on</strong>g> Defense - DURIP Grant<br />
Ames Labora<str<strong>on</strong>g>to</str<strong>on</strong>g>ry, Iowa State University<br />
Air Force Office <str<strong>on</strong>g>of</str<strong>on</strong>g> Scientific Research<br />
Nati<strong>on</strong>al Science Foundati<strong>on</strong> - Petascale Applicati<strong>on</strong>s grant<br />
NVIDIA Corporati<strong>on</strong><br />
Pr<str<strong>on</strong>g>of</str<strong>on</strong>g>essor Todd Martinez and his gro<str<strong>on</strong>g>up</str<strong>on</strong>g><br />
asadchev@gmail.com<br />
jfelder@iastate.edu<br />
allada.v@gmail.com<br />
mark@ si.msg.chem.iastate.edu<br />
theresa@fi.ameslab.gov<br />
brett@si.msg.chem.iastate.edu