Software Development on Quest
Software Development on Quest
Software Development on Quest
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
<strong>Quest</strong><br />
<str<strong>on</strong>g>Software</str<strong>on</strong>g> <str<strong>on</strong>g>Development</str<strong>on</strong>g> <strong>on</strong> <strong>Quest</strong><br />
Compilers and Single Core Optimizati<strong>on</strong><br />
Pradeep Sivakumar<br />
pradeep-sivakumar@northwestern.edu
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
C<strong>on</strong>tents<br />
• Compilers Introducti<strong>on</strong><br />
• Compilers and parallel libraries available <strong>on</strong> <strong>Quest</strong><br />
• Basic compiler opti<strong>on</strong>s<br />
• Multi-core and Multi-process compilati<strong>on</strong><br />
• Building programs using MPI and OpenMP<br />
• Single Core Optimizati<strong>on</strong><br />
• Optimizati<strong>on</strong> level (On)<br />
• fast opti<strong>on</strong><br />
• Inter-procedural optimizati<strong>on</strong><br />
• Inlining<br />
• Architecture specificati<strong>on</strong><br />
• Profile guided optimizati<strong>on</strong><br />
• Best practices
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Compilers/Parallel libraries<br />
• Compilers<br />
• Parallel libraries<br />
<strong>on</strong> <strong>Quest</strong><br />
GNU (gcc/gfortran) Intel (icc/ifort)<br />
4.3.5, 4.6.1 11.1, Composer XE 2011<br />
OpenMPI-GNU OpenMPI-Intel<br />
1.4.0, 1.4.2, and 1.4.3 1.4.0, 1.4.2, and 1.4.3
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Basic compiler opti<strong>on</strong>s<br />
• How to invoke the compiler<br />
– Directly from the command line<br />
– icc -O2 myprogram.c*<br />
– Indirectly from the command line using a Makefile<br />
CC=icc<br />
CFLAGS=-O2<br />
LIBS=<br />
SOURCES = myprogram.c<br />
OBJECTS = myprogram.o<br />
all: myprogram<br />
myprogram: $(SOURCES) $(OBJECTS) $(LIBS)<br />
clean: rm -f *.o
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Basic compiler opti<strong>on</strong>s<br />
• icc or gcc [opti<strong>on</strong>s] filenames [libraries]...<br />
– The compiler accepts a list of source files and object files in<br />
the list specified by filenames.<br />
• -o: executable name.<br />
• -g: opti<strong>on</strong> for debugging<br />
• -L: link with a library<br />
• -I: specify an additi<strong>on</strong>al file path to include filenames<br />
• -static: create a static executable. permit links to<br />
programs without having to recompile code.<br />
• -multiple-processes[=n]: (Intel <strong>on</strong>ly) creates multiple<br />
processes to compile large number of source files at<br />
the same time.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
GNU make<br />
• Executes commands from the Makefile. By default it<br />
will look for a GNUmakefile, makefile, and Makefile.<br />
– To make a n<strong>on</strong>-default filename use make –f Makefile.custom<br />
• Uses the timestamp to decide if source files have been<br />
updated and if everything needs to be recompiled.<br />
– When in doubt use make clean to delete all *.o files and<br />
rebuild.<br />
• To execute make in parallel.<br />
– make –j [n] Makefile<br />
– Executes several instances simultaneously, output from<br />
multiple cores maybe interspersed.<br />
– Rule of thumb: use multiple cores when you need to compile<br />
faster <strong>on</strong> <strong>Quest</strong>, but use fewer cores when the node is heavily<br />
loaded.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Building programs with MPI<br />
• Building MPI programs, using OpenMPI's<br />
wrapper compilers (available for C, C++,<br />
FORTRAN77, and FORTRAN90)<br />
%mpicc -o first first.c<br />
%mpif90 -o first first.f90<br />
• OpenMPI wrappers:<br />
– mpicc, mpif77, mpif90
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
OpenMP<br />
• API for writing multi-threaded applicati<strong>on</strong>s.<br />
o Explicit parallelism.<br />
• Use the following compiler flags to "turn <strong>on</strong>"<br />
OpenMP compilati<strong>on</strong>s:<br />
GNU Intel<br />
-fopenmp -openmp<br />
• Set OMP_NUM_THREADS prior to program<br />
executi<strong>on</strong> to a value no greater than the<br />
number of available cores <strong>on</strong> a target platform.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Single core optimizati<strong>on</strong><br />
• Iterative process.<br />
• Entirely applicati<strong>on</strong> dependent.<br />
• Different ways to achieve it<br />
• Compiler opti<strong>on</strong>s<br />
• Performance libraries<br />
• Code optimizati<strong>on</strong>s after identifying and modifying hotspots
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Single core optimizati<strong>on</strong><br />
• Compilers can perform significant<br />
optimizati<strong>on</strong> if,<br />
o Code is structured to make apparent what the<br />
compiler should do<br />
o Use simple language c<strong>on</strong>structs (e.g. d<strong>on</strong>'t<br />
use pointers)<br />
• Use latest compilers<br />
o Check compiler opti<strong>on</strong>s<br />
o Look for architecture specific opti<strong>on</strong>s<br />
• Experiment with different opti<strong>on</strong>s<br />
• May need routine-specific opti<strong>on</strong>s
Ideal for large codes,<br />
with many branches,<br />
which are less<br />
computati<strong>on</strong>ally<br />
intensive<br />
Ideal for codes<br />
c<strong>on</strong>taining short loops<br />
which are executed<br />
regularly<br />
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Optimizati<strong>on</strong> level O(n)<br />
ref: Intel optimizati<strong>on</strong> manual<br />
• -O0: no optimizati<strong>on</strong>s, use for debugging.<br />
• -O1: optimize for speed, but disable optimizati<strong>on</strong>s which<br />
increase code size.<br />
o Algebraic identity removal<br />
o Comm<strong>on</strong> subexpressi<strong>on</strong> eliminati<strong>on</strong><br />
o C<strong>on</strong>stant folding<br />
o Redundant load and store eliminati<strong>on</strong><br />
• -O2: perform optimizati<strong>on</strong>s that the compilers c<strong>on</strong>siders the<br />
best combinati<strong>on</strong> for compilati<strong>on</strong> speed and runtime<br />
performance. example: software pipelining.<br />
o Inlining<br />
o C<strong>on</strong>stant propagati<strong>on</strong><br />
o Loop unrolling<br />
o Vectorizati<strong>on</strong><br />
o Strength reducti<strong>on</strong><br />
o Dead code eliminati<strong>on</strong><br />
o Global register allocati<strong>on</strong>
do i=1,n<br />
A(i)=A(i) + B(i)*C<br />
end do<br />
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Loop unrolling<br />
do i=1,n,4<br />
A(i)=A(i) + B(i)*C<br />
A(i+1)=A(i+1) + B(i+1)*C<br />
A(i+2)=A(i+2) + B(i+2)*C<br />
A(i+3)=A(i+3) + B(i+3)*C<br />
end do<br />
Perform more operati<strong>on</strong>s per iterati<strong>on</strong><br />
inside the loop
Best for codes which<br />
are floating point<br />
intensive with large<br />
loops<br />
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
O3 - pros and c<strong>on</strong>s<br />
GNU Intel Descripti<strong>on</strong><br />
-O3 -O3 Aggressive<br />
optimizati<strong>on</strong><br />
• Perform additi<strong>on</strong>al optimizati<strong>on</strong>s that are memory intensive,<br />
compile-time intensive, or both.<br />
o Scalar replacement<br />
o Cache blocking<br />
o Prefetching<br />
o Loop and memory access transformati<strong>on</strong>s<br />
• Pros<br />
o The compiler may produce faster code.<br />
• C<strong>on</strong>s<br />
o Code size may bloat.<br />
o Compilati<strong>on</strong> may take more time.<br />
o May change semantics, results and sometimes can break code.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Useful compiler opti<strong>on</strong>s <strong>on</strong><br />
<strong>Quest</strong><br />
GNU Intel Descripti<strong>on</strong><br />
-msse4.2 -mSSE4.2 Tells the compiler to generate<br />
code specialized for Intel<br />
Nehalem<br />
• SSE=Streaming SIMD Extensi<strong>on</strong><br />
• SSE instructi<strong>on</strong>s pipeline and simultaneously<br />
execute independent operati<strong>on</strong>s to get<br />
multiple operati<strong>on</strong>s per clock cycle<br />
• Directs the compiler to use the most<br />
advanced instructi<strong>on</strong> set available for the<br />
target architecture.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
-fast (Intel <strong>on</strong>ly)<br />
• GNU: partially included in -03<br />
• Intel: -O3 -ipo -static -xHOST -no-prec-div<br />
o no-prec-div: change floating point divisi<strong>on</strong><br />
into multiplicati<strong>on</strong>. (A/B = A*(1/B))<br />
o xHOST: generate binaries which are<br />
optimized for specific architecture.<br />
o ipo: enables multifile inlining, c<strong>on</strong>stant<br />
propagati<strong>on</strong>, code placement (i.e. functi<strong>on</strong><br />
layout), dead code eliminati<strong>on</strong> and data<br />
placement. Needs to be provided while<br />
compiling and linking.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Inlining<br />
• What is inlining?<br />
o Replacing a functi<strong>on</strong> call with the code from the<br />
functi<strong>on</strong>.<br />
o Eliminates the cost of the functi<strong>on</strong> call and return,<br />
improves instructi<strong>on</strong> cache.<br />
• When is inlining important?<br />
o When the functi<strong>on</strong> is a hot spot.<br />
o When the call-overhead to work ratio is high.<br />
o When it can benefit from interprocedural<br />
optimizati<strong>on</strong>.<br />
• Use -ipo or -ip to allow the compiler to inline.
program MAIN<br />
integer::ndim=2,niter=1000000<br />
real*8 ::x(ndim), x0(ndim), r<br />
integer ::i, j<br />
do i=1,niter<br />
r=dist(x,x0,ndim)<br />
end do<br />
end program<br />
real*8 functi<strong>on</strong> dist(x,x0,n)<br />
real*8 ::x0(n), x(n), r<br />
integer :: j,n<br />
r = 0.<br />
do j =1,n<br />
r=r+(x(j) - x0(j))**2<br />
end do<br />
Stuck? Need help?<br />
dist=r<br />
quest-help@northwestern.edu<br />
end functi<strong>on</strong><br />
Inlining<br />
program MAIN<br />
integer, parameter::ndim=2<br />
real*8::x(ndim), x0(ndim), r<br />
integer::i,j<br />
do i=1, niter<br />
r=0.<br />
do j=1, ndim<br />
r=r+(x(j)-x0(j))**2<br />
end do<br />
end do<br />
end program
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Source files<br />
Interprocedural<br />
optimizati<strong>on</strong><br />
Compile with ipo<br />
.o files with IL<br />
informati<strong>on</strong><br />
Link with ipo<br />
The interprocedural optimizati<strong>on</strong> process<br />
• What you should know about IPO: It<br />
extends compilati<strong>on</strong> time and memory<br />
usage. –ip-no-inlining disables inlining.<br />
Executable
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Profile guided optimizati<strong>on</strong><br />
• Improves instructi<strong>on</strong> cache usage<br />
o Moves frequently accessed code segments<br />
adjacent to <strong>on</strong>e another, moves seldom accessed<br />
code to the end of the module shrinking code size,<br />
eliminating branches.<br />
• Increases applicati<strong>on</strong> performance by improving<br />
branch predicti<strong>on</strong>.<br />
• Applicati<strong>on</strong>s suited to PGO<br />
o Applicati<strong>on</strong>s c<strong>on</strong>taining several functi<strong>on</strong>s which are<br />
executed frequently.<br />
• Uses the profile guided feedback from a test run using<br />
data sets, which represents a typical applicati<strong>on</strong><br />
pattern.
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Profile guided optimizati<strong>on</strong><br />
Little Benefit<br />
VS<br />
Significant Benefit
Step 1<br />
Compile<br />
with PGO<br />
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Profile guided optimizati<strong>on</strong><br />
Instrumented<br />
executable<br />
Step 2<br />
Run instrumented<br />
applicati<strong>on</strong> to produce<br />
dynamic informati<strong>on</strong> files<br />
Step 3<br />
Feedback<br />
compile<br />
with PGO<br />
Profile-guided<br />
applicati<strong>on</strong><br />
GNU Intel Descripti<strong>on</strong><br />
-fprofile-generate &<br />
-fprofile-use<br />
-prof-gen &<br />
-prof-use<br />
Profile guided<br />
optimizati<strong>on</strong>
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Best Practices<br />
Performance Libraries<br />
• Optimized for specific architecture<br />
• Use when possible, remember even a<br />
numerical recipes book does not provide<br />
an optimized algorithm<br />
• Performance libraries available <strong>on</strong> <strong>Quest</strong>:<br />
– ATLAS (3.8.3,3.9.16,3.9.24, 3.9.45)<br />
– Intel MKL includes BLAS, LAPACK, FFTW,<br />
VML (ComposerXE 2011 and 10.2)<br />
– GNU Scientific Library (1.13, 1.14)
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
Best practices<br />
<str<strong>on</strong>g>Software</str<strong>on</strong>g> development<br />
• Write it to be clear and c<strong>on</strong>cise, make sure it<br />
is well commented.<br />
• Portability<br />
• Minimize number of divisi<strong>on</strong>s<br />
• Cache : use of spatial locality<br />
– Row-major in C<br />
– Column-major in Fortran<br />
• Minimize pointer arithmetic, avoid typecasting<br />
and c<strong>on</strong>versi<strong>on</strong>s.<br />
• Avoid branches, c<strong>on</strong>diti<strong>on</strong>als, and IO within<br />
loops
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
References<br />
• Intel C++ compiler documentati<strong>on</strong>:<br />
http://software.intel.com/en-us/articles/intel-c-compilerprofessi<strong>on</strong>al-editi<strong>on</strong>-for-linux-documentati<strong>on</strong>/<br />
• Intel FORTRAN compiler documentati<strong>on</strong>:<br />
http://software.intel.com/en-us/articles/intel-fortran-compilerprofessi<strong>on</strong>al-editi<strong>on</strong>-for-linux-documentati<strong>on</strong>/<br />
• GCC <strong>on</strong>line documentati<strong>on</strong>:<br />
http://gcc.gnu.org/<strong>on</strong>linedocs/<br />
• OpenMPI v1.4.3 documentati<strong>on</strong>:<br />
http://www.open-mpi.org/doc/v1.4/<br />
• Optimizati<strong>on</strong>:<br />
http://cache-www.intel.com/cd/00/00/27/66/276615_276615.pdf,<br />
http://gcc.gnu.org/<strong>on</strong>linedocs/gcc/Optimize-Opti<strong>on</strong>s.html -<br />
Optimize-Opti<strong>on</strong>s
Stuck? Need help?<br />
quest-help@northwestern.edu<br />
<strong>Quest</strong>i<strong>on</strong>s/Comments?