27.12.2014 Views

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following message usually indicates a node failure or malfunctioning link in<br />

the fabric:<br />

Couldn’t connect to (LID=::). Time<br />

elapsed 00:00:30. Still trying...<br />

IP is the MPI rank’s IP address, and are the rank’s lid,<br />

port, and subport.<br />

If messages similar to the following display, it may mean that the program is trying<br />

to receive to an invalid (unallocated) memory address, perhaps due to a logic<br />

error in the program, usually related to malloc/free:<br />

ipath_update_tid_err: Failed TID update for rendezvous, allocation<br />

problem<br />

kernel: infinipath: get_user_pages (0x41 pages starting at<br />

0x2aaaaeb50000<br />

kernel: infinipath: Failed to lock addr 0002aaaaeb50000, 65 pages:<br />

errno 12<br />

TID is short for Token ID, and is part of the <strong>QLogic</strong> hardware. This error indicates<br />

a failure of the program, not the hardware or driver.<br />

MPI Messages<br />

Some MPI error messages are issued from the parts of the code inherited from<br />

the MPICH implementation. See the MPICH documentation for message<br />

descriptions. This section discusses the error messages specific to the <strong>QLogic</strong><br />

MPI implementation.<br />

These messages appear in the mpirun output. Most are followed by an abort,<br />

and possibly a backtrace. Each is preceded by the name of the function where the<br />

exception occurred.<br />

The following message is always followed by an abort. The processlabel is<br />

usually in the form of the host name followed by process rank:<br />

processlabel Fatal Error in filename line_no: error_string<br />

At the time of publication, the possible error_strings are:<br />

Illegal label format character.<br />

Memory allocation failed.<br />

Error creating shared memory object.<br />

Error setting size of shared memory object.<br />

Error mmapping shared memory.<br />

Error opening shared memory object.<br />

Error attaching to shared memory.<br />

Node table has inconsistent len! Hdr claims %d not %d<br />

Timeout waiting %d seconds to receive peer node table from mpirun<br />

D000046-005 B F-27

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!