11.07.2015 Views

Guidelines on debugging and introduction to Totalview

Guidelines on debugging and introduction to Totalview

Guidelines on debugging and introduction to Totalview

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Guidelines</str<strong>on</strong>g> <strong>on</strong> <strong>debugging</strong> <strong>and</strong>introducti<strong>on</strong> <strong>to</strong> <strong>Totalview</strong>Dominique Lucas – George Mozdzynski1 Com hpcf training course - 2004 DebuggingECMWF


Outline• Other approaches <strong>to</strong> <strong>debugging</strong>• Debugging with <strong>to</strong>talview2 Com hpcf training course - 2004 DebuggingECMWF


Introducti<strong>on</strong>• “Programming is an art” … <strong>and</strong> so is <strong>debugging</strong>“You d<strong>on</strong>’t need a sledgehammer <strong>to</strong> crack a nut”• Time spent writing better code will be more thancompensated by reduced debug time3 Com hpcf training course - 2004 DebuggingECMWF


D<strong>on</strong>’t Panic• Most problems are trivial <strong>and</strong> easy <strong>to</strong> fix.• Look at stack trace/point of failure.• Intuiti<strong>on</strong>, experience, luck all play a part.• Check your code “again”.• Explain/show your code <strong>to</strong> somebody else.• Use make <strong>to</strong> build your libraries/models, <strong>to</strong> guarantee thatthe same opti<strong>on</strong>s are used everywhere.4 Com hpcf training course - 2004 DebuggingECMWF


Some suggesti<strong>on</strong>s• Is the problem reproducible <strong>on</strong> a rerunDoes it fail in exactly the same way <strong>and</strong> place• Try THREADS=1• No stack trace?suspect ‘you’ have trashed memory• Try <strong>to</strong> reproduce problem at a lower resoluti<strong>on</strong>5 Com hpcf training course - 2004 DebuggingECMWF


The Universal Debug Tool(the print/write statement)• Will always be there for you!• What <strong>to</strong> write out.• Must be selective <strong>to</strong> keep file size(s) manageable.• Can be activated with an DEBUG variable or Namelistentry.6 Com hpcf training course - 2004 DebuggingECMWF


Memory C<strong>on</strong>straints• Task stack limit 4 Gbytes• Thread stack limitMaster thread 4 GbytesOther threads 256 Mbytes• Large arrays (>1 Mbyte) should be declared allocatableallocatable arrays use the (per task) heapno practical limit <strong>on</strong> heap size• Fortran 90/95 (xlf90/xlf95) put local data <strong>on</strong> stack.• Fortran 77 (xlf) puts statically allocated local data <strong>on</strong> heap.7 Com hpcf training course - 2004 DebuggingECMWF


Eoj – check memory <strong>and</strong> CPU utilisati<strong>on</strong>eoj vers1.4 run at Wed Jun 11 17:35:46 GMT 2003 <strong>on</strong> hpca2501 for jobstep hpca2301.347796.0Queued : Wed Jun 11 17:11:09 GMT 2003 for 898 sec<strong>on</strong>dsDispatched : Wed Jun 11 17:26:07 GMT 2003 for 579 sec<strong>on</strong>dsJob Name : edhm_model_fcgroup1Step Name : 0Owner : rdxUnix Group : rdAccount : ecpreopsSTDIN : /dev/nullSTDOUT : /hpca/rdx_dir/log/mpm/edhm/fc/fcgroup1/model.1STDERR : /hpca/rdx_dir/log/mpm/edhm/fc/fcgroup1/model.1Class : npStep Type : General ParallelNode Usage : sharedStep Cpus : 8Total Tasks :Blocking :Node actual : 1Adapter Req. : (csss,MPI,shared,US)Resources : C<strong>on</strong>sumableCpus(1) C<strong>on</strong>sumableMemory(900.000 mb)#*#* Next 3 times NOT up-<strong>to</strong>-date (TOTAL CPU TIME given later IS accurate)Step User Time : 00:13:41.940000Step System Time : 00:00:33.760000Step Total Time : 00:14:15.700000 (855.7 secs)#*#* Last 3 times NOT up-<strong>to</strong>-date (TOTAL CPU TIME given later IS accurate)C<strong>on</strong>text switches : involuntary = 28637, voluntary = 12378per sec<strong>on</strong>d = 49 21Page faults : with I/O = 13056, without I/O = 1203613per sec<strong>on</strong>d = 22 2078 Node ? #T #t secs/CPU (Eff%) (Now%) max/TSK mb (Eff%) (Now% - mb ) Task list-------- - -- -- ---------- ------ ------ ---------- ------ -------------- ---------hpca1503 M 8 1 120.33 ( 20%) ( 52%) 764.85 ( 84%) ( 88% - 7680) 0:1:2:3:4:5:6:7:-------- - -- -- ---------- ------ ------ ---------- ------ -------------- ---------Elapsed = 579 secs 900 mb = C<strong>on</strong>sumableMemoryCPU Tot = 962.64 ( 0+00:16:02) Average: 963 s/node, 120 s/taskSystem Billing Units used by this jobstep = 1.0378 Com hpcf training course - 2004 DebuggingECMWF


Debugging• checking:argument checking: -qextchk• xlf –qextchk prog.f –o progchecking d<strong>on</strong>e at compilati<strong>on</strong>/linkingarray bounds checking: -C• xlf –C prog.f –o prog• ./progchecking d<strong>on</strong>e at runtimeundefined reference checking• xlf –qinitau<strong>to</strong>=FF –qflttrap:inv:en prog.f –o progchecking d<strong>on</strong>e at runtime9 Com hpcf training course - 2004 DebuggingECMWF


Debugging – floating point excepti<strong>on</strong>s• nothing generated <strong>on</strong> floating point excepti<strong>on</strong>.• Floating point trapping% xlf –qflttrap=overflow:invalid:zerodivide:enable \ –qsigtrap prog.f –o prog% ./prog...CPU overhead of up <strong>to</strong> 20%.• Core files – how <strong>to</strong> get a traceback % dbx ./prog core


ECMWF local signal trap - ECLIBINTEGER*4 CORE_DUMP_FLAG, IRETURN, SIGNALS(1)REAL ACORE_DUMP_FLAG = 0SIGNALS(1) = 0IRETURN = SIGNAL_TRAP(CORE_DUMP_FLAG, SIGNALS)IF (IRETURN .LT. 0) THENPRINT *, 'ERROR'ELSE IF (IRETURN .EQ. 0) THENPRINT *, 'FPE TRAPPING IS NOT SET'ELSEPRINT *, 'FPE TRAPPING MODE =', IRETURNENDIFcall b(-2.)endsubroutine b(a)real awrite(*,*)sqrt(a)returnEND• Link using $ECLIB, e.g.xlf -c prog.fxlf prog.o -o prog $ECLIB11 Com hpcf training course - 2004 DebuggingECMWF


Signal_trap - arguments• CORE_DUM_FLAG:= 0: no core dumpedNot = 0: core dumped• SIGNALS – integer array with signals <strong>to</strong> trapSignals(1)=0 => SIGFPE, SIGILL, SIGBUS,SIGSEGV, SIGXCPU.See “kill –l “ for list of signals.12 Com hpcf training course - 2004 DebuggingECMWF


<strong>Totalview</strong>• Re-compile (part of) your applicati<strong>on</strong> without optimizati<strong>on</strong> -qnooptimize for all routines -qsmp=noopt for routines with OpenMP Beware –qsmp=omp implies optimizati<strong>on</strong>• Comm<strong>and</strong> Line interface exists $ <strong>to</strong>talviewcli• Recommended use of <strong>to</strong>talview in batch mode, i.e. with theGUI versi<strong>on</strong>.13 Com hpcf training course - 2004 DebuggingECMWF


<strong>Totalview</strong> – GUI interface• Before submitting batch job <strong>to</strong> launch <strong>to</strong>talview, request anX11 proxy at login time via ECaccess.• Include in your batch job the display export DISPLAY=• Include source code searchpath -searchpath=“dir_1/,dir_2/,….dir_n/”• For MPI-parallel (load-leveller jobs) <strong>to</strong>talview –searchPath=$searchpath poe –a • For serial/OpenMP <strong>on</strong>ly (interactive) <strong>to</strong>talview –searchPath=$searchpath -a 14 Com hpcf training course - 2004 DebuggingECMWF


15 Com hpcf training course - 2004 DebuggingECMWF


16 Com hpcf training course - 2004 DebuggingECMWF


17 Com hpcf training course - 2004 DebuggingECMWF


View > Lookup_Functi<strong>on</strong> …fshortcutscan2mdm_ orscan2mdm.F9018 Com hpcf training course - 2004 DebuggingECMWF


19 Com hpcf training course - 2004 DebuggingECMWF


20 Com hpcf training course - 2004 DebuggingECMWF


21 Com hpcf training course - 2004 DebuggingECMWF


To “restart” applicati<strong>on</strong>Group > Restart orGroup > Delete Ctrl+ZShortcut22 Com hpcf training course - 2004 DebuggingECMWF


File > Preferences23 Com hpcf training course - 2004 DebuggingECMWF


When breakpoint hit, s<strong>to</strong>p: Process…is more practical24 Com hpcf training course - 2004 DebuggingECMWF


25 Com hpcf training course - 2004 DebuggingECMWF


View > Dive Anew foradditi<strong>on</strong>al process windows26 Com hpcf training course - 2004 DebuggingECMWF


Tools > Fortran Modules<strong>to</strong> view module data27 Com hpcf training course - 2004 DebuggingECMWF


Tools > Visualize28 Com hpcf training course - 2004 DebuggingECMWF


Set slice <strong>to</strong>reduce data29 Com hpcf training course - 2004 DebuggingECMWF


30 Com hpcf training course - 2004 DebuggingECMWF


View > Sort > Descending31 Com hpcf training course - 2004 DebuggingECMWF


View module c<strong>on</strong>tents bydiving <strong>on</strong> module name32 Com hpcf training course - 2004 DebuggingECMWF


View all variables in aroutine by diving <strong>on</strong>name in Stack Frame33 Com hpcf training course - 2004 DebuggingECMWF


Tools > Memory UsageT L 255 model using 8 CPUs of st<strong>and</strong>ard memory node34 Com hpcf training course - 2004 DebuggingECMWF


35 Com hpcf training course - 2004 DebuggingECMWF


Logicals0 = false, 1 = trueWill be displayed asT/F in future release36 Com hpcf training course - 2004 DebuggingECMWF


37 Com hpcf training course - 2004 DebuggingECMWF


38 Com hpcf training course - 2004 DebuggingECMWF


http://www.etnus.com/Support/docs/39 Com hpcf training course - 2004 DebuggingECMWF


…or just type <strong>to</strong>talview <strong>and</strong> use Help pages• …you d<strong>on</strong>’t need a program <strong>to</strong> debug …40 Com hpcf training course - 2004 DebuggingECMWF

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!