27.12.2014 Views

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following message displays after installation:<br />

$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000<br />

node-00:1.ipath_update_tid_err: failed: Cannot allocate memory<br />

mpi_latency:<br />

/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src<br />

mq_ips.c:691:<br />

mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program<br />

unexpectedly quit. Exiting.<br />

You can check the ulimit -l on all the nodes by running ipath_checkout. A<br />

warning similar to this displays if ulimit -l is less than 4096:<br />

!!!ERROR!!! Lockable memory less than 4096KB on x nodes<br />

To fix this error, install the infinipath RPM on the node, and reboot it to ensure<br />

that /etc/initscript is run.<br />

Alternately, you can create your own /etc/initscript and set the ulimit<br />

there.<br />

Error Creating Shared Memory Object<br />

<strong>QLogic</strong> MPI (and PSM) use Linux’s shared memory mapped files to share<br />

memory within a node. When an MPI job is started, a shared memory file is<br />

created on each node for all MPI ranks sharing memory on that one node. During<br />

job execution, the shared memory file remains in /dev/shm. At program exit, the<br />

file is removed automatically by the operating system when the <strong>QLogic</strong> MPI<br />

(InfiniPath) library properly exits. Also, as an additional backup in the sequence of<br />

commands invoked by mpirun during every MPI job launch, the file is explicitly<br />

removed at program termination.<br />

However, under circumstances such as hard and explicit program termination (i.e.<br />

kill -9 on the mpirun process PID), <strong>QLogic</strong> MPI cannot guarantee that the<br />

/dev/shm file is properly removed. As many stale files accumulate on each node,<br />

an error message like the following can appear at startup:<br />

node023:6.Error creating shared memory object in shm_open(/dev/shm<br />

may have stale shm files that need to be removed):<br />

If this occurs, administrators should clean up all stale files by running this<br />

command (as a root user):<br />

# rm -rf /dev/shm/psm_shm.*<br />

You can also selectively identify stale files by using a combination of the fuser,<br />

ps, and rm commands (all files start with the psm_shm prefix). Once identified,<br />

you can issue rm commands on the stale files that you own.<br />

D000046-005 B F-23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!