12.06.2013 Views

Software Development on Quest

Software Development on Quest

Software Development on Quest

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Stuck? Need help?<br />

quest-help@northwestern.edu<br />

<strong>Quest</strong><br />

<str<strong>on</strong>g>Software</str<strong>on</strong>g> <str<strong>on</strong>g>Development</str<strong>on</strong>g> <strong>on</strong> <strong>Quest</strong><br />

Compilers and Single Core Optimizati<strong>on</strong><br />

Pradeep Sivakumar<br />

pradeep-sivakumar@northwestern.edu


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

C<strong>on</strong>tents<br />

• Compilers Introducti<strong>on</strong><br />

• Compilers and parallel libraries available <strong>on</strong> <strong>Quest</strong><br />

• Basic compiler opti<strong>on</strong>s<br />

• Multi-core and Multi-process compilati<strong>on</strong><br />

• Building programs using MPI and OpenMP<br />

• Single Core Optimizati<strong>on</strong><br />

• Optimizati<strong>on</strong> level (On)<br />

• fast opti<strong>on</strong><br />

• Inter-procedural optimizati<strong>on</strong><br />

• Inlining<br />

• Architecture specificati<strong>on</strong><br />

• Profile guided optimizati<strong>on</strong><br />

• Best practices


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Compilers/Parallel libraries<br />

• Compilers<br />

• Parallel libraries<br />

<strong>on</strong> <strong>Quest</strong><br />

GNU (gcc/gfortran) Intel (icc/ifort)<br />

4.3.5, 4.6.1 11.1, Composer XE 2011<br />

OpenMPI-GNU OpenMPI-Intel<br />

1.4.0, 1.4.2, and 1.4.3 1.4.0, 1.4.2, and 1.4.3


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Basic compiler opti<strong>on</strong>s<br />

• How to invoke the compiler<br />

– Directly from the command line<br />

– icc -O2 myprogram.c*<br />

– Indirectly from the command line using a Makefile<br />

CC=icc<br />

CFLAGS=-O2<br />

LIBS=<br />

SOURCES = myprogram.c<br />

OBJECTS = myprogram.o<br />

all: myprogram<br />

myprogram: $(SOURCES) $(OBJECTS) $(LIBS)<br />

clean: rm -f *.o


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Basic compiler opti<strong>on</strong>s<br />

• icc or gcc [opti<strong>on</strong>s] filenames [libraries]...<br />

– The compiler accepts a list of source files and object files in<br />

the list specified by filenames.<br />

• -o: executable name.<br />

• -g: opti<strong>on</strong> for debugging<br />

• -L: link with a library<br />

• -I: specify an additi<strong>on</strong>al file path to include filenames<br />

• -static: create a static executable. permit links to<br />

programs without having to recompile code.<br />

• -multiple-processes[=n]: (Intel <strong>on</strong>ly) creates multiple<br />

processes to compile large number of source files at<br />

the same time.


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

GNU make<br />

• Executes commands from the Makefile. By default it<br />

will look for a GNUmakefile, makefile, and Makefile.<br />

– To make a n<strong>on</strong>-default filename use make –f Makefile.custom<br />

• Uses the timestamp to decide if source files have been<br />

updated and if everything needs to be recompiled.<br />

– When in doubt use make clean to delete all *.o files and<br />

rebuild.<br />

• To execute make in parallel.<br />

– make –j [n] Makefile<br />

– Executes several instances simultaneously, output from<br />

multiple cores maybe interspersed.<br />

– Rule of thumb: use multiple cores when you need to compile<br />

faster <strong>on</strong> <strong>Quest</strong>, but use fewer cores when the node is heavily<br />

loaded.


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Building programs with MPI<br />

• Building MPI programs, using OpenMPI's<br />

wrapper compilers (available for C, C++,<br />

FORTRAN77, and FORTRAN90)<br />

%mpicc -o first first.c<br />

%mpif90 -o first first.f90<br />

• OpenMPI wrappers:<br />

– mpicc, mpif77, mpif90


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

OpenMP<br />

• API for writing multi-threaded applicati<strong>on</strong>s.<br />

o Explicit parallelism.<br />

• Use the following compiler flags to "turn <strong>on</strong>"<br />

OpenMP compilati<strong>on</strong>s:<br />

GNU Intel<br />

-fopenmp -openmp<br />

• Set OMP_NUM_THREADS prior to program<br />

executi<strong>on</strong> to a value no greater than the<br />

number of available cores <strong>on</strong> a target platform.


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Single core optimizati<strong>on</strong><br />

• Iterative process.<br />

• Entirely applicati<strong>on</strong> dependent.<br />

• Different ways to achieve it<br />

• Compiler opti<strong>on</strong>s<br />

• Performance libraries<br />

• Code optimizati<strong>on</strong>s after identifying and modifying hotspots


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Single core optimizati<strong>on</strong><br />

• Compilers can perform significant<br />

optimizati<strong>on</strong> if,<br />

o Code is structured to make apparent what the<br />

compiler should do<br />

o Use simple language c<strong>on</strong>structs (e.g. d<strong>on</strong>'t<br />

use pointers)<br />

• Use latest compilers<br />

o Check compiler opti<strong>on</strong>s<br />

o Look for architecture specific opti<strong>on</strong>s<br />

• Experiment with different opti<strong>on</strong>s<br />

• May need routine-specific opti<strong>on</strong>s


Ideal for large codes,<br />

with many branches,<br />

which are less<br />

computati<strong>on</strong>ally<br />

intensive<br />

Ideal for codes<br />

c<strong>on</strong>taining short loops<br />

which are executed<br />

regularly<br />

Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Optimizati<strong>on</strong> level O(n)<br />

ref: Intel optimizati<strong>on</strong> manual<br />

• -O0: no optimizati<strong>on</strong>s, use for debugging.<br />

• -O1: optimize for speed, but disable optimizati<strong>on</strong>s which<br />

increase code size.<br />

o Algebraic identity removal<br />

o Comm<strong>on</strong> subexpressi<strong>on</strong> eliminati<strong>on</strong><br />

o C<strong>on</strong>stant folding<br />

o Redundant load and store eliminati<strong>on</strong><br />

• -O2: perform optimizati<strong>on</strong>s that the compilers c<strong>on</strong>siders the<br />

best combinati<strong>on</strong> for compilati<strong>on</strong> speed and runtime<br />

performance. example: software pipelining.<br />

o Inlining<br />

o C<strong>on</strong>stant propagati<strong>on</strong><br />

o Loop unrolling<br />

o Vectorizati<strong>on</strong><br />

o Strength reducti<strong>on</strong><br />

o Dead code eliminati<strong>on</strong><br />

o Global register allocati<strong>on</strong>


do i=1,n<br />

A(i)=A(i) + B(i)*C<br />

end do<br />

Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Loop unrolling<br />

do i=1,n,4<br />

A(i)=A(i) + B(i)*C<br />

A(i+1)=A(i+1) + B(i+1)*C<br />

A(i+2)=A(i+2) + B(i+2)*C<br />

A(i+3)=A(i+3) + B(i+3)*C<br />

end do<br />

Perform more operati<strong>on</strong>s per iterati<strong>on</strong><br />

inside the loop


Best for codes which<br />

are floating point<br />

intensive with large<br />

loops<br />

Stuck? Need help?<br />

quest-help@northwestern.edu<br />

O3 - pros and c<strong>on</strong>s<br />

GNU Intel Descripti<strong>on</strong><br />

-O3 -O3 Aggressive<br />

optimizati<strong>on</strong><br />

• Perform additi<strong>on</strong>al optimizati<strong>on</strong>s that are memory intensive,<br />

compile-time intensive, or both.<br />

o Scalar replacement<br />

o Cache blocking<br />

o Prefetching<br />

o Loop and memory access transformati<strong>on</strong>s<br />

• Pros<br />

o The compiler may produce faster code.<br />

• C<strong>on</strong>s<br />

o Code size may bloat.<br />

o Compilati<strong>on</strong> may take more time.<br />

o May change semantics, results and sometimes can break code.


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Useful compiler opti<strong>on</strong>s <strong>on</strong><br />

<strong>Quest</strong><br />

GNU Intel Descripti<strong>on</strong><br />

-msse4.2 -mSSE4.2 Tells the compiler to generate<br />

code specialized for Intel<br />

Nehalem<br />

• SSE=Streaming SIMD Extensi<strong>on</strong><br />

• SSE instructi<strong>on</strong>s pipeline and simultaneously<br />

execute independent operati<strong>on</strong>s to get<br />

multiple operati<strong>on</strong>s per clock cycle<br />

• Directs the compiler to use the most<br />

advanced instructi<strong>on</strong> set available for the<br />

target architecture.


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

-fast (Intel <strong>on</strong>ly)<br />

• GNU: partially included in -03<br />

• Intel: -O3 -ipo -static -xHOST -no-prec-div<br />

o no-prec-div: change floating point divisi<strong>on</strong><br />

into multiplicati<strong>on</strong>. (A/B = A*(1/B))<br />

o xHOST: generate binaries which are<br />

optimized for specific architecture.<br />

o ipo: enables multifile inlining, c<strong>on</strong>stant<br />

propagati<strong>on</strong>, code placement (i.e. functi<strong>on</strong><br />

layout), dead code eliminati<strong>on</strong> and data<br />

placement. Needs to be provided while<br />

compiling and linking.


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Inlining<br />

• What is inlining?<br />

o Replacing a functi<strong>on</strong> call with the code from the<br />

functi<strong>on</strong>.<br />

o Eliminates the cost of the functi<strong>on</strong> call and return,<br />

improves instructi<strong>on</strong> cache.<br />

• When is inlining important?<br />

o When the functi<strong>on</strong> is a hot spot.<br />

o When the call-overhead to work ratio is high.<br />

o When it can benefit from interprocedural<br />

optimizati<strong>on</strong>.<br />

• Use -ipo or -ip to allow the compiler to inline.


program MAIN<br />

integer::ndim=2,niter=1000000<br />

real*8 ::x(ndim), x0(ndim), r<br />

integer ::i, j<br />

do i=1,niter<br />

r=dist(x,x0,ndim)<br />

end do<br />

end program<br />

real*8 functi<strong>on</strong> dist(x,x0,n)<br />

real*8 ::x0(n), x(n), r<br />

integer :: j,n<br />

r = 0.<br />

do j =1,n<br />

r=r+(x(j) - x0(j))**2<br />

end do<br />

Stuck? Need help?<br />

dist=r<br />

quest-help@northwestern.edu<br />

end functi<strong>on</strong><br />

Inlining<br />

program MAIN<br />

integer, parameter::ndim=2<br />

real*8::x(ndim), x0(ndim), r<br />

integer::i,j<br />

do i=1, niter<br />

r=0.<br />

do j=1, ndim<br />

r=r+(x(j)-x0(j))**2<br />

end do<br />

end do<br />

end program


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Source files<br />

Interprocedural<br />

optimizati<strong>on</strong><br />

Compile with ipo<br />

.o files with IL<br />

informati<strong>on</strong><br />

Link with ipo<br />

The interprocedural optimizati<strong>on</strong> process<br />

• What you should know about IPO: It<br />

extends compilati<strong>on</strong> time and memory<br />

usage. –ip-no-inlining disables inlining.<br />

Executable


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Profile guided optimizati<strong>on</strong><br />

• Improves instructi<strong>on</strong> cache usage<br />

o Moves frequently accessed code segments<br />

adjacent to <strong>on</strong>e another, moves seldom accessed<br />

code to the end of the module shrinking code size,<br />

eliminating branches.<br />

• Increases applicati<strong>on</strong> performance by improving<br />

branch predicti<strong>on</strong>.<br />

• Applicati<strong>on</strong>s suited to PGO<br />

o Applicati<strong>on</strong>s c<strong>on</strong>taining several functi<strong>on</strong>s which are<br />

executed frequently.<br />

• Uses the profile guided feedback from a test run using<br />

data sets, which represents a typical applicati<strong>on</strong><br />

pattern.


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Profile guided optimizati<strong>on</strong><br />

Little Benefit<br />

VS<br />

Significant Benefit


Step 1<br />

Compile<br />

with PGO<br />

Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Profile guided optimizati<strong>on</strong><br />

Instrumented<br />

executable<br />

Step 2<br />

Run instrumented<br />

applicati<strong>on</strong> to produce<br />

dynamic informati<strong>on</strong> files<br />

Step 3<br />

Feedback<br />

compile<br />

with PGO<br />

Profile-guided<br />

applicati<strong>on</strong><br />

GNU Intel Descripti<strong>on</strong><br />

-fprofile-generate &<br />

-fprofile-use<br />

-prof-gen &<br />

-prof-use<br />

Profile guided<br />

optimizati<strong>on</strong>


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Best Practices<br />

Performance Libraries<br />

• Optimized for specific architecture<br />

• Use when possible, remember even a<br />

numerical recipes book does not provide<br />

an optimized algorithm<br />

• Performance libraries available <strong>on</strong> <strong>Quest</strong>:<br />

– ATLAS (3.8.3,3.9.16,3.9.24, 3.9.45)<br />

– Intel MKL includes BLAS, LAPACK, FFTW,<br />

VML (ComposerXE 2011 and 10.2)<br />

– GNU Scientific Library (1.13, 1.14)


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

Best practices<br />

<str<strong>on</strong>g>Software</str<strong>on</strong>g> development<br />

• Write it to be clear and c<strong>on</strong>cise, make sure it<br />

is well commented.<br />

• Portability<br />

• Minimize number of divisi<strong>on</strong>s<br />

• Cache : use of spatial locality<br />

– Row-major in C<br />

– Column-major in Fortran<br />

• Minimize pointer arithmetic, avoid typecasting<br />

and c<strong>on</strong>versi<strong>on</strong>s.<br />

• Avoid branches, c<strong>on</strong>diti<strong>on</strong>als, and IO within<br />

loops


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

References<br />

• Intel C++ compiler documentati<strong>on</strong>:<br />

http://software.intel.com/en-us/articles/intel-c-compilerprofessi<strong>on</strong>al-editi<strong>on</strong>-for-linux-documentati<strong>on</strong>/<br />

• Intel FORTRAN compiler documentati<strong>on</strong>:<br />

http://software.intel.com/en-us/articles/intel-fortran-compilerprofessi<strong>on</strong>al-editi<strong>on</strong>-for-linux-documentati<strong>on</strong>/<br />

• GCC <strong>on</strong>line documentati<strong>on</strong>:<br />

http://gcc.gnu.org/<strong>on</strong>linedocs/<br />

• OpenMPI v1.4.3 documentati<strong>on</strong>:<br />

http://www.open-mpi.org/doc/v1.4/<br />

• Optimizati<strong>on</strong>:<br />

http://cache-www.intel.com/cd/00/00/27/66/276615_276615.pdf,<br />

http://gcc.gnu.org/<strong>on</strong>linedocs/gcc/Optimize-Opti<strong>on</strong>s.html -<br />

Optimize-Opti<strong>on</strong>s


Stuck? Need help?<br />

quest-help@northwestern.edu<br />

<strong>Quest</strong>i<strong>on</strong>s/Comments?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!