27.12.2014 Views

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following indicates that one program on one node died:<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 100000 1000000<br />

MPIRUN: node program unexpectedly quit: Exiting.<br />

The quiescence detected message is printed when an MPI job is not making<br />

progress. The default timeout is 900 seconds. After this length of time, all the<br />

node processes are terminated. This timeout can be extended or disabled with the<br />

-quiescence-timeout option in mpirun.<br />

$ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000<br />

MPIRUN: MPI progress Quiescence Detected after 9000 seconds.<br />

MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress.<br />

MPIRUN: Per-rank details are the following:<br />

MPIRUN: Rank 0 ( ) caused MPI progress Quiescence.<br />

MPIRUN: Rank 1 ( ) caused MPI progress Quiescence.<br />

MPIRUN: both MPI progress and Ping Quiescence Detected after 120<br />

seconds.<br />

Occasionally, a stray process will continue to exist out of its context. mpirun<br />

checks for stray processes; they are killed after detection. The following code is<br />

an example of the type of message that displays in this case:<br />

$ mpirun -np 2 -ppn 1 -m ~/tmp/mfast mpi_latency 500000 2000<br />

iqa-38: Received 1 out-of-context eager message(s) from stray<br />

process PID=29745<br />

running on host 192.168.9.218<br />

iqa-35: PSM pid 10513 on host IP 192.168.9.221 has detected that I<br />

am a stray process, exiting.<br />

2000 5.222116<br />

iqa-38:1.ips_ptl_report_strays: Process PID=29745 on host<br />

IP=192.168.9.218 sent<br />

1 stray message(s) and was told so 1 time(s) (first stray message<br />

at 0.7s (13%),last at 0.7s (13%) into application run)<br />

The following message should never occur. If it does, notify Technical Support:<br />

Internal Error: NULL function/argument found:func_ptr(arg_ptr)<br />

Driver and Link Error Messages Reported by MPI Programs<br />

The following driver and link error messages are reported by MPI programs.<br />

When the InfiniBand link fails during a job, a message is reported once per<br />

occurrence. The message will be similar to:<br />

ipath_check_unit_status: IB Link is down<br />

D000046-005 B F-29

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!