QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following indicates that one program on one node died:<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100000 1000000<br />
MPIRUN: node program unexpectedly quit: Exiting.<br />
The quiescence detected message is printed when an MPI job is not making<br />
progress. The default timeout is 900 seconds. After this length of time, all the<br />
node processes are terminated. This timeout can be extended or disabled with the<br />
-quiescence-timeout option in mpirun.<br />
$ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000<br />
MPIRUN: MPI progress Quiescence Detected after 9000 seconds.<br />
MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress.<br />
MPIRUN: Per-rank details are the following:<br />
MPIRUN: Rank 0 ( ) caused MPI progress Quiescence.<br />
MPIRUN: Rank 1 ( ) caused MPI progress Quiescence.<br />
MPIRUN: both MPI progress and Ping Quiescence Detected after 120<br />
seconds.<br />
Occasionally, a stray process will continue to exist out of its context. mpirun<br />
checks for stray processes; they are killed after detection. The following code is<br />
an example of the type of message that displays in this case:<br />
$ mpirun -np 2 -ppn 1 -m ~/tmp/mfast mpi_latency 500000 2000<br />
iqa-38: Received 1 out-of-context eager message(s) from stray<br />
process PID=29745<br />
running on host 192.168.9.218<br />
iqa-35: PSM pid 10513 on host IP 192.168.9.221 has detected that I<br />
am a stray process, exiting.<br />
2000 5.222116<br />
iqa-38:1.ips_ptl_report_strays: Process PID=29745 on host<br />
IP=192.168.9.218 sent<br />
1 stray message(s) and was told so 1 time(s) (first stray message<br />
at 0.7s (13%),last at 0.7s (13%) into application run)<br />
The following message should never occur. If it does, notify Technical Support:<br />
Internal Error: NULL function/argument found:func_ptr(arg_ptr)<br />
Driver and Link Error Messages Reported by MPI Programs<br />
The following driver and link error messages are reported by MPI programs.<br />
When the InfiniBand link fails during a job, a message is reported once per<br />
occurrence. The message will be similar to:<br />
ipath_check_unit_status: IB Link is down<br />
D000046-005 B F-29