QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following message usually indicates a node failure or malfunctioning link in<br />
the fabric:<br />
Couldn’t connect to (LID=::). Time<br />
elapsed 00:00:30. Still trying...<br />
IP is the MPI rank’s IP address, and are the rank’s lid,<br />
port, and subport.<br />
If messages similar to the following display, it may mean that the program is trying<br />
to receive to an invalid (unallocated) memory address, perhaps due to a logic<br />
error in the program, usually related to malloc/free:<br />
ipath_update_tid_err: Failed TID update for rendezvous, allocation<br />
problem<br />
kernel: infinipath: get_user_pages (0x41 pages starting at<br />
0x2aaaaeb50000<br />
kernel: infinipath: Failed to lock addr 0002aaaaeb50000, 65 pages:<br />
errno 12<br />
TID is short for Token ID, and is part of the <strong>QLogic</strong> hardware. This error indicates<br />
a failure of the program, not the hardware or driver.<br />
MPI Messages<br />
Some MPI error messages are issued from the parts of the code inherited from<br />
the MPICH implementation. See the MPICH documentation for message<br />
descriptions. This section discusses the error messages specific to the <strong>QLogic</strong><br />
MPI implementation.<br />
These messages appear in the mpirun output. Most are followed by an abort,<br />
and possibly a backtrace. Each is preceded by the name of the function where the<br />
exception occurred.<br />
The following message is always followed by an abort. The processlabel is<br />
usually in the form of the host name followed by process rank:<br />
processlabel Fatal Error in filename line_no: error_string<br />
At the time of publication, the possible error_strings are:<br />
Illegal label format character.<br />
Memory allocation failed.<br />
Error creating shared memory object.<br />
Error setting size of shared memory object.<br />
Error mmapping shared memory.<br />
Error opening shared memory object.<br />
Error attaching to shared memory.<br />
Node table has inconsistent len! Hdr claims %d not %d<br />
Timeout waiting %d seconds to receive peer node table from mpirun<br />
D000046-005 B F-27