OpenMP Slides - Research Computing

Research ComputingUNIVERSITY OF COLORADOIntroduction toOpenMP

OutlineOverview of OpemMPMechanicsHello World!Simple loopsNested loopsCalculating PITips and SchedulingConsider checking out this great online tutorial:https://computing.llnl.gov/tutorials/openMP/

Overview of OpenMP

Levels of Parallelismmain()task 1task 2MPItask nCoarse grainthread 1thread 1thread 1thread 2thread 2Threadsthread 2thread mthread mthread ma(0)b(0)a(1)b(1)a(i)b(i)Compiler and CPUFine grain

OpenMPAn application programming interface (API) for parallel programmingon multiprocessorsCompiler directivesLibrary functionsEnvironmental variablesOpen works in conjunction with Fortran, C, or C++Frees you from the details of programming threads (e.g. POSIX)http://openmp.org

Shared-memory ModelProcessor Processor Processor ProcessorMemoryProcessors interact and synchronize with each other through sharedvariables

JANUS

Fork/Join ParallelismInitially only master thread is activeMaster thread executes sequential codeFork: Master thread creates or awakens additional threads to executeparallel codeJoin: At end of parallel code created threads die or are suspended

Master ThreadOther threadsforkTimejoinforkjoin

Shared-memory vs. Message PassingShared-memory modelNumber active threads 1 at start and finish of program, changesdynamically during executionIncrementally make your program parallelPrograms may only have a single parallel loopMessage-passing modelAll processes active throughout execution of programSequential-to-parallel transformation requires major effort

Compiling OpenMPGnugfortran -fopenmp –g basic.f90gcc -fopenmp -g basic.cIntelifort -openmp –g basic.f90icc -openmp -g basic.cEnvironmental variablessetenv OMP_NUM_THREADS 12export OMP_NUM_THREADS=12

Execution ContextAddress space containing all of the variables a thread may accessEvery thread has its own execution contextContents:static variablesdynamically allocated variables on the heapvariables on the stackShared variable: every thread has same address in execution contextPrivate variable: every thread has a different addressA thread cannot access the private variables of another thread

Shared and Private Variablesbptrii thread master i thread 2

MechanicsFunctions, directives, and clauses

FunctionsThreadsint thread = omp_get_thread_num();Thread idint size = omp_get_num_threads();How many threads?omp_set_num_threads(4);What is another way to set the number of threads?

FunctionsThreadsint thread = omp_get_thread_num();Thread idint size = omp_get_num_threads();How many threads?omp_set_num_threads(4);What is another way to set the number of threads?export OMP_NUM_THREADS=12

DirectivesA way for the programmer to communicate with the compilerCompiler free to ignore directives (they are hints)#pragma directive is used to instruct the compiler to use pragmatic orimplementation-dependent features (e.g. OpenMP)#pragma omp parallel#pragma omp parallel private(var1, var2,...)

Parallel#pragma omp parallel [clause ...]if (scalar_expression)private (list)shared (list)default (shared | none)firstprivate (list)reduction (operator: list)copyin (list)num_threads (integer-expression)structured_block

Hello World!#include #include using namespace std;int main (int argc, char *argv[]){#pragma omp parallel private(thread, size){// Obtain thread numberint thread = omp_get_thread_num();int size = omp_get_num_threads();cout



General formC/C++#pragma omp [clause]Fortran!$OMP [clause]

DirectivesparallelmasterforatomiccriticalClausesprivatesharednum_threadsschedulereductionbarriersectionsordered

#pragma omp for [clause ...]schedule (type [,chunk])orderedprivate (list)firstprivate (list)lastprivate (list)shared (list)reduction (operator: list)collapse (n)nowait

Simple loop#pragma omp parallel forfor (int i = 0; i < large_number; i++){// do something}#pragma omp parallel{#pragma omp forfor (int i = 0; i < large_number; i++){// do something}}



Examplefor(int i=0; i




Directs compiler to make one or more variables private#pragma omp parallel for private(i,j)for(int i=0; i

Privatedirects compiler to make one or more variables privatefirstprivateprivate variables having initial values identical to the variablecontrolled by the master thread as the loop is enteredVariables are initialized once per thread, not once per loopiterationlastprivateCopies the private copy of the variable from the thread thatexecuted the sequentially last iteration... back to the master

Calculating PIdouble x, sum = 0.0;const double step = 1.0/static_cast(large_number);for (int i=1; i

Calculating PIIf we simply parallelize the loop...double x, sum = 0.0;const double step = 1.0/static_cast(large_number);#pragma omp parallel private(x)for (int i=1; i

Race condition... we set up a race condition in which one process may “race ahead”of another and not see its change to shared variable sumfor (int i=1;i

Critical directiveCritical section: a portion of code that only thread at a time mayexecute#pragma omp parallel for private(x)for (int i=1;i

ReductionsReductions are so common that OpenMP provides support for themMay add reduction clause to parallel for directiveSpecify reduction operation and reduction variableOpenMP takes care of storing partial results in private variables andcombining partial results after the loopreduction(operation:var)Operators+, *, -, ... and more

#pragma omp parallel for reduction(+:sum) private(x)for (int i=1;i

SectionsDivide sections among the threads in the team#pragma omp sections [clause ...]private (list)firstprivate (list)lastprivate (list)reduction (operator: list)nowait{#pragma omp sectionstructured_block#pragma omp section}structured_block

#pragma omp parallel shared(a,b,c,d) private(i){}#pragma omp sections nowait{#pragma omp sectionfor (i=0; i < N; i++)c[i] = a[i] + b[i];#pragma omp sectionfor (i=0; i < N; i++)d[i] = a[i] * b[i];} /* end of sections */} /* end of parallel section */

Tips and Scheduling

Performance Improvement #1Too many fork/joins can lower performanceInverting loops may help performance ifParallelism is in inner loopAfter inversion, the outer loop can be made parallelInversion does not significantly lower cache hit rate#pragma omp parallel for private(j)for(int i=0; i

Performance Improvement #2If loop has too few iterations, fork/join overhead is greater than timesavings from parallel executionThe if clause instructs compiler to insert code that determines atrun-time whether loop should be executed in parallel;#pragma omp parallel if (num > 500)

Performance Improvement #3We can use schedule clause to specify how iterations of a loop shouldbe allocated to threadsStatic schedule: all iterations allocated to threads before any iterationsexecutedDynamic schedule: only some iterations allocated to threads atbeginning of loop’s execution. Remaining iterations allocated tothreads that complete their assigned iterations.

Scheduling: static and dynamicStatic schedulingLow overheadMay exhibit high workload imbalanceDynamic schedulingHigher overheadCan reduce workload imbalanceschedule(type[,chunk])chunk specifies the size of iterationstype: dynamic, static

ChunksA chunk is a contiguous range of iterationsIncreasing chunk sizereduces overheadmay increase cache hit rateDecreasing chunk sizeallows finer balancing of workloads

Scheduling optionsschedule(static)block allocation of about n/t contiguous iterations to each threadschedule(static,C)interleaved allocation of chunks of size C to threadsschedule(dynamic)dynamic one-at-a-time allocation of iterations to threadsschedule(dynamic,C)dynamic allocation of C iterations at a time to threads

SummaryOpenMP an API for shared-memory parallel programmingShared-memory model based on fork/join parallelismData parallelismparallel forreduction clause

OpenMP Slides - Research Computing

Create successful ePaper yourself

Delete template?

Save as template?