OpenMP Slides - Research Computing
OpenMP Slides - Research Computing
OpenMP Slides - Research Computing
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Research</strong> <strong>Computing</strong>UNIVERSITY OF COLORADOIntroduction to<strong>OpenMP</strong>
OutlineOverview of OpemMPMechanicsHello World!Simple loopsNested loopsCalculating PITips and SchedulingConsider checking out this great online tutorial:https://computing.llnl.gov/tutorials/openMP/
Overview of <strong>OpenMP</strong>
Levels of Parallelismmain()task 1task 2MPItask nCoarse grainthread 1thread 1thread 1thread 2thread 2Threadsthread 2thread mthread mthread ma(0)b(0)a(1)b(1)a(i)b(i)Compiler and CPUFine grain
<strong>OpenMP</strong>An application programming interface (API) for parallel programmingon multiprocessorsCompiler directivesLibrary functionsEnvironmental variablesOpen works in conjunction with Fortran, C, or C++Frees you from the details of programming threads (e.g. POSIX)http://openmp.org
Shared-memory ModelProcessor Processor Processor ProcessorMemoryProcessors interact and synchronize with each other through sharedvariables
JANUS
Fork/Join ParallelismInitially only master thread is activeMaster thread executes sequential codeFork: Master thread creates or awakens additional threads to executeparallel codeJoin: At end of parallel code created threads die or are suspended
Master ThreadOther threadsforkTimejoinforkjoin
Shared-memory vs. Message PassingShared-memory modelNumber active threads 1 at start and finish of program, changesdynamically during executionIncrementally make your program parallelPrograms may only have a single parallel loopMessage-passing modelAll processes active throughout execution of programSequential-to-parallel transformation requires major effort
Compiling <strong>OpenMP</strong>Gnugfortran -fopenmp –g basic.f90gcc -fopenmp -g basic.cIntelifort -openmp –g basic.f90icc -openmp -g basic.cEnvironmental variablessetenv OMP_NUM_THREADS 12export OMP_NUM_THREADS=12
Execution ContextAddress space containing all of the variables a thread may accessEvery thread has its own execution contextContents:static variablesdynamically allocated variables on the heapvariables on the stackShared variable: every thread has same address in execution contextPrivate variable: every thread has a different addressA thread cannot access the private variables of another thread
Shared and Private Variablesbptrii thread master i thread 2
MechanicsFunctions, directives, and clauses
FunctionsThreadsint thread = omp_get_thread_num();Thread idint size = omp_get_num_threads();How many threads?omp_set_num_threads(4);What is another way to set the number of threads?
FunctionsThreadsint thread = omp_get_thread_num();Thread idint size = omp_get_num_threads();How many threads?omp_set_num_threads(4);What is another way to set the number of threads?export OMP_NUM_THREADS=12
DirectivesA way for the programmer to communicate with the compilerCompiler free to ignore directives (they are hints)#pragma directive is used to instruct the compiler to use pragmatic orimplementation-dependent features (e.g. <strong>OpenMP</strong>)#pragma omp parallel#pragma omp parallel private(var1, var2,...)
Parallel#pragma omp parallel [clause ...]if (scalar_expression)private (list)shared (list)default (shared | none)firstprivate (list)reduction (operator: list)copyin (list)num_threads (integer-expression)structured_block
Hello World!#include #include using namespace std;int main (int argc, char *argv[]){#pragma omp parallel private(thread, size){// Obtain thread numberint thread = omp_get_thread_num();int size = omp_get_num_threads();cout
Hello World!#include #include using namespace std;int main (int argc, char *argv[]){#pragma omp parallel private(thread, size){// Obtain thread numberint thread = omp_get_thread_num();int size = omp_get_num_threads();cout
Hello World!#include #include using namespace std;int main (int argc, char *argv[]){#pragma omp parallel private(thread, size){// Obtain thread numberint thread = omp_get_thread_num();int size = omp_get_num_threads();cout
General formC/C++#pragma omp [clause]Fortran!$OMP [clause]
DirectivesparallelmasterforatomiccriticalClausesprivatesharednum_threadsschedulereductionbarriersectionsordered
#pragma omp for [clause ...]schedule (type [,chunk])orderedprivate (list)firstprivate (list)lastprivate (list)shared (list)reduction (operator: list)collapse (n)nowait
Simple loop#pragma omp parallel forfor (int i = 0; i < large_number; i++){// do something}#pragma omp parallel{#pragma omp forfor (int i = 0; i < large_number; i++){// do something}}
Simple loop#pragma omp parallel forfor (int i = 0; i < large_number; i++){// do something}#pragma omp parallel{#pragma omp forfor (int i = 0; i < large_number; i++){// do something}}
Simple loop#pragma omp parallel forfor (int i = 0; i < large_number; i++){// do something}#pragma omp parallel{#pragma omp forfor (int i = 0; i < large_number; i++){// do something}}
Examplefor(int i=0; i
Examplefor(int i=0; i
Examplefor(int i=0; i
Examplefor(int i=0; i
Directs compiler to make one or more variables private#pragma omp parallel for private(i,j)for(int i=0; i
Privatedirects compiler to make one or more variables privatefirstprivateprivate variables having initial values identical to the variablecontrolled by the master thread as the loop is enteredVariables are initialized once per thread, not once per loopiterationlastprivateCopies the private copy of the variable from the thread thatexecuted the sequentially last iteration... back to the master
Calculating PIdouble x, sum = 0.0;const double step = 1.0/static_cast(large_number);for (int i=1; i
Calculating PIIf we simply parallelize the loop...double x, sum = 0.0;const double step = 1.0/static_cast(large_number);#pragma omp parallel private(x)for (int i=1; i
Race condition... we set up a race condition in which one process may “race ahead”of another and not see its change to shared variable sumfor (int i=1;i
Critical directiveCritical section: a portion of code that only thread at a time mayexecute#pragma omp parallel for private(x)for (int i=1;i
ReductionsReductions are so common that <strong>OpenMP</strong> provides support for themMay add reduction clause to parallel for directiveSpecify reduction operation and reduction variable<strong>OpenMP</strong> takes care of storing partial results in private variables andcombining partial results after the loopreduction(operation:var)Operators+, *, -, ... and more
#pragma omp parallel for reduction(+:sum) private(x)for (int i=1;i
SectionsDivide sections among the threads in the team#pragma omp sections [clause ...]private (list)firstprivate (list)lastprivate (list)reduction (operator: list)nowait{#pragma omp sectionstructured_block#pragma omp section}structured_block
#pragma omp parallel shared(a,b,c,d) private(i){}#pragma omp sections nowait{#pragma omp sectionfor (i=0; i < N; i++)c[i] = a[i] + b[i];#pragma omp sectionfor (i=0; i < N; i++)d[i] = a[i] * b[i];} /* end of sections */} /* end of parallel section */
Tips and Scheduling
Performance Improvement #1Too many fork/joins can lower performanceInverting loops may help performance ifParallelism is in inner loopAfter inversion, the outer loop can be made parallelInversion does not significantly lower cache hit rate#pragma omp parallel for private(j)for(int i=0; i
Performance Improvement #2If loop has too few iterations, fork/join overhead is greater than timesavings from parallel executionThe if clause instructs compiler to insert code that determines atrun-time whether loop should be executed in parallel;#pragma omp parallel if (num > 500)
Performance Improvement #3We can use schedule clause to specify how iterations of a loop shouldbe allocated to threadsStatic schedule: all iterations allocated to threads before any iterationsexecutedDynamic schedule: only some iterations allocated to threads atbeginning of loop’s execution. Remaining iterations allocated tothreads that complete their assigned iterations.
Scheduling: static and dynamicStatic schedulingLow overheadMay exhibit high workload imbalanceDynamic schedulingHigher overheadCan reduce workload imbalanceschedule(type[,chunk])chunk specifies the size of iterationstype: dynamic, static
ChunksA chunk is a contiguous range of iterationsIncreasing chunk sizereduces overheadmay increase cache hit rateDecreasing chunk sizeallows finer balancing of workloads
Scheduling optionsschedule(static)block allocation of about n/t contiguous iterations to each threadschedule(static,C)interleaved allocation of chunks of size C to threadsschedule(dynamic)dynamic one-at-a-time allocation of iterations to threadsschedule(dynamic,C)dynamic allocation of C iterations at a time to threads
Summary<strong>OpenMP</strong> an API for shared-memory parallel programmingShared-memory model based on fork/join parallelismData parallelismparallel forreduction clause