QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following message displays after installation:<br />
$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000<br />
node-00:1.ipath_update_tid_err: failed: Cannot allocate memory<br />
mpi_latency:<br />
/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src<br />
mq_ips.c:691:<br />
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program<br />
unexpectedly quit. Exiting.<br />
You can check the ulimit -l on all the nodes by running ipath_checkout. A<br />
warning similar to this displays if ulimit -l is less than 4096:<br />
!!!ERROR!!! Lockable memory less than 4096KB on x nodes<br />
To fix this error, install the infinipath RPM on the node, and reboot it to ensure<br />
that /etc/initscript is run.<br />
Alternately, you can create your own /etc/initscript and set the ulimit<br />
there.<br />
Error Creating Shared Memory Object<br />
<strong>QLogic</strong> MPI (and PSM) use Linux’s shared memory mapped files to share<br />
memory within a node. When an MPI job is started, a shared memory file is<br />
created on each node for all MPI ranks sharing memory on that one node. During<br />
job execution, the shared memory file remains in /dev/shm. At program exit, the<br />
file is removed automatically by the operating system when the <strong>QLogic</strong> MPI<br />
(InfiniPath) library properly exits. Also, as an additional backup in the sequence of<br />
commands invoked by mpirun during every MPI job launch, the file is explicitly<br />
removed at program termination.<br />
However, under circumstances such as hard and explicit program termination (i.e.<br />
kill -9 on the mpirun process PID), <strong>QLogic</strong> MPI cannot guarantee that the<br />
/dev/shm file is properly removed. As many stale files accumulate on each node,<br />
an error message like the following can appear at startup:<br />
node023:6.Error creating shared memory object in shm_open(/dev/shm<br />
may have stale shm files that need to be removed):<br />
If this occurs, administrators should clean up all stale files by running this<br />
command (as a root user):<br />
# rm -rf /dev/shm/psm_shm.*<br />
You can also selectively identify stale files by using a combination of the fuser,<br />
ps, and rm commands (all files start with the psm_shm prefix). Once identified,<br />
you can issue rm commands on the stale files that you own.<br />
D000046-005 B F-23