Software Development on Quest

Stuck? Need help? 

quest-help@northwestern.edu 

Quest 

<strong>Software</strong> <strong>Development</strong> on Quest 

Compilers and Single Core Optimization 

Pradeep Sivakumar 

pradeep-sivakumar@northwestern.edu



Contents 

• Compilers Introduction 

• Compilers and parallel libraries available on Quest 

• Basic compiler options 

• Multi-core and Multi-process compilation 

• Building programs using MPI and OpenMP 

• Single Core Optimization 

• Optimization level (On) 

• fast option 

• Inter-procedural optimization 

• Inlining 

• Architecture specification 

• Profile guided optimization 

• Best practices



Compilers/Parallel libraries 

• Compilers 

• Parallel libraries 

on Quest 

GNU (gcc/gfortran) Intel (icc/ifort) 

4.3.5, 4.6.1 11.1, Composer XE 2011 

OpenMPI-GNU OpenMPI-Intel 

1.4.0, 1.4.2, and 1.4.3 1.4.0, 1.4.2, and 1.4.3



Basic compiler options 

• How to invoke the compiler 

– Directly from the command line 

– icc -O2 myprogram.c* 

– Indirectly from the command line using a Makefile 

CC=icc 

CFLAGS=-O2 

LIBS= 

SOURCES = myprogram.c 

OBJECTS = myprogram.o 

all: myprogram 

myprogram: $(SOURCES) $(OBJECTS) $(LIBS) 

clean: rm -f *.o



Basic compiler options 

• icc or gcc [options] filenames [libraries]... 

– The compiler accepts a list of source files and object files in 

the list specified by filenames. 

• -o: executable name. 

• -g: option for debugging 

• -L: link with a library 

• -I: specify an additional file path to include filenames 

• -static: create a static executable. permit links to 

programs without having to recompile code. 

• -multiple-processes[=n]: (Intel only) creates multiple 

processes to compile large number of source files at 

the same time.



GNU make 

• Executes commands from the Makefile. By default it 

will look for a GNUmakefile, makefile, and Makefile. 

– To make a non-default filename use make –f Makefile.custom 

• Uses the timestamp to decide if source files have been 

updated and if everything needs to be recompiled. 

– When in doubt use make clean to delete all *.o files and 

rebuild. 

• To execute make in parallel. 

– make –j [n] Makefile 

– Executes several instances simultaneously, output from 

multiple cores maybe interspersed. 

– Rule of thumb: use multiple cores when you need to compile 

faster on Quest, but use fewer cores when the node is heavily 

loaded.



Building programs with MPI 

• Building MPI programs, using OpenMPI's 

wrapper compilers (available for C, C++, 

FORTRAN77, and FORTRAN90) 

%mpicc -o first first.c 

%mpif90 -o first first.f90 

• OpenMPI wrappers: 

– mpicc, mpif77, mpif90



OpenMP 

• API for writing multi-threaded applications. 

o Explicit parallelism. 

• Use the following compiler flags to "turn on" 

OpenMP compilations: 

GNU Intel 

-fopenmp -openmp 

• Set OMP_NUM_THREADS prior to program 

execution to a value no greater than the 

number of available cores on a target platform.



Single core optimization 

• Iterative process. 

• Entirely application dependent. 

• Different ways to achieve it 

• Compiler options 

• Performance libraries 

• Code optimizations after identifying and modifying hotspots



Single core optimization 

• Compilers can perform significant 

optimization if, 

o Code is structured to make apparent what the 

compiler should do 

o Use simple language constructs (e.g. don't 

use pointers) 

• Use latest compilers 

o Check compiler options 

o Look for architecture specific options 

• Experiment with different options 

• May need routine-specific options

Ideal for large codes, 

with many branches, 

which are less 

computationally 

intensive 

Ideal for codes 

containing short loops 

which are executed 

regularly 



Optimization level O(n) 

ref: Intel optimization manual 

• -O0: no optimizations, use for debugging. 

• -O1: optimize for speed, but disable optimizations which 

increase code size. 

o Algebraic identity removal 

o Common subexpression elimination 

o Constant folding 

o Redundant load and store elimination 

• -O2: perform optimizations that the compilers considers the 

best combination for compilation speed and runtime 

performance. example: software pipelining. 

o Inlining 

o Constant propagation 

o Loop unrolling 

o Vectorization 

o Strength reduction 

o Dead code elimination 

o Global register allocation

do i=1,n 

A(i)=A(i) + B(i)*C 

end do 



Loop unrolling 

do i=1,n,4 

A(i)=A(i) + B(i)*C 

A(i+1)=A(i+1) + B(i+1)*C 

A(i+2)=A(i+2) + B(i+2)*C 

A(i+3)=A(i+3) + B(i+3)*C 

end do 

Perform more operations per iteration 

inside the loop

Best for codes which 

are floating point 

intensive with large 

loops 



O3 - pros and cons 

GNU Intel Description 

-O3 -O3 Aggressive 

optimization 

• Perform additional optimizations that are memory intensive, 

compile-time intensive, or both. 

o Scalar replacement 

o Cache blocking 

o Prefetching 

o Loop and memory access transformations 

• Pros 

o The compiler may produce faster code. 

• Cons 

o Code size may bloat. 

o Compilation may take more time. 

o May change semantics, results and sometimes can break code.



Useful compiler options on 

Quest 


-msse4.2 -mSSE4.2 Tells the compiler to generate 

code specialized for Intel 

Nehalem 

• SSE=Streaming SIMD Extension 

• SSE instructions pipeline and simultaneously 

execute independent operations to get 

multiple operations per clock cycle 

• Directs the compiler to use the most 

advanced instruction set available for the 

target architecture.



-fast (Intel only) 

• GNU: partially included in -03 

• Intel: -O3 -ipo -static -xHOST -no-prec-div 

o no-prec-div: change floating point division 

into multiplication. (A/B = A*(1/B)) 

o xHOST: generate binaries which are 

optimized for specific architecture. 

o ipo: enables multifile inlining, constant 

propagation, code placement (i.e. function 

layout), dead code elimination and data 

placement. Needs to be provided while 

compiling and linking.



Inlining 

• What is inlining? 

o Replacing a function call with the code from the 

function. 

o Eliminates the cost of the function call and return, 

improves instruction cache. 

• When is inlining important? 

o When the function is a hot spot. 

o When the call-overhead to work ratio is high. 

o When it can benefit from interprocedural 

optimization. 

• Use -ipo or -ip to allow the compiler to inline.

program MAIN 

integer::ndim=2,niter=1000000 

real*8 ::x(ndim), x0(ndim), r 

integer ::i, j 

do i=1,niter 

r=dist(x,x0,ndim) 

end do 

end program 

real*8 function dist(x,x0,n) 

real*8 ::x0(n), x(n), r 

integer :: j,n 

r = 0. 

do j =1,n 

r=r+(x(j) - x0(j))**2 

end do 


dist=r 


end function 

Inlining 

program MAIN 

integer, parameter::ndim=2 

real*8::x(ndim), x0(ndim), r 

integer::i,j 

do i=1, niter 

r=0. 

do j=1, ndim 

r=r+(x(j)-x0(j))**2 

end do 

end do 

end program



Source files 

Interprocedural 

optimization 

Compile with ipo 

.o files with IL 

information 

Link with ipo 

The interprocedural optimization process 

• What you should know about IPO: It 

extends compilation time and memory 

usage. –ip-no-inlining disables inlining. 

Executable



Profile guided optimization 

• Improves instruction cache usage 

o Moves frequently accessed code segments 

adjacent to one another, moves seldom accessed 

code to the end of the module shrinking code size, 

eliminating branches. 

• Increases application performance by improving 

branch prediction. 

• Applications suited to PGO 

o Applications containing several functions which are 

executed frequently. 

• Uses the profile guided feedback from a test run using 

data sets, which represents a typical application 

pattern.




Little Benefit 

VS 

Significant Benefit

Step 1 

Compile 

with PGO 




Instrumented 

executable 

Step 2 

Run instrumented 

application to produce 

dynamic information files 

Step 3 

Feedback 

compile 

with PGO 

Profile-guided 

application 


-fprofile-generate & 

-fprofile-use 

-prof-gen & 

-prof-use 

Profile guided 

optimization



Best Practices 

Performance Libraries 

• Optimized for specific architecture 

• Use when possible, remember even a 

numerical recipes book does not provide 

an optimized algorithm 

• Performance libraries available on Quest: 

– ATLAS (3.8.3,3.9.16,3.9.24, 3.9.45) 

– Intel MKL includes BLAS, LAPACK, FFTW, 

VML (ComposerXE 2011 and 10.2) 

– GNU Scientific Library (1.13, 1.14)



Best practices 

<strong>Software</strong> development 

• Write it to be clear and concise, make sure it 

is well commented. 

• Portability 

• Minimize number of divisions 

• Cache : use of spatial locality 

– Row-major in C 

– Column-major in Fortran 

• Minimize pointer arithmetic, avoid typecasting 

and conversions. 

• Avoid branches, conditionals, and IO within 

loops



References 

• Intel C++ compiler documentation: 

http://software.intel.com/en-us/articles/intel-c-compilerprofessional-edition-for-linux-documentation/ 

• Intel FORTRAN compiler documentation: 

http://software.intel.com/en-us/articles/intel-fortran-compilerprofessional-edition-for-linux-documentation/ 

• GCC online documentation: 

http://gcc.gnu.org/onlinedocs/ 

• OpenMPI v1.4.3 documentation: 

http://www.open-mpi.org/doc/v1.4/ 

• Optimization: 

http://cache-www.intel.com/cd/00/00/27/66/276615_276615.pdf, 

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html - 

Optimize-Options



Questions/Comments?

Software Development on Quest

Create successful ePaper yourself

Delete template?

Save as template?