23.07.2014 Views

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

You may also receive this error if the MDS runs out of free blocks. Since the output<br />

of df is an aggregate of the data from the MDS and all of the OSTs, it may not show<br />

that the filesystem is full when one of the OSTs has run out of space. To determine<br />

which OST or MDS is running out of space, check the free space and inodes on a<br />

client:<br />

grep '[0-9]' /proc/fs/lustre/osc/*/kbytes{free,avail,total}<br />

grep '[0-9]' /proc/fs/lustre/osc/*/files{free,total}<br />

grep '[0-9]' /proc/fs/lustre/mdc/*/kbytes{free,avail,total}<br />

grep '[0-9]' /proc/fs/lustre/mdc/*/files{free,total}<br />

You can find other numeric error codes in /usr/include/asm/errno.h along with<br />

their short name and textual description.<br />

22.3.15 Triggering Watchdog for PID NNN<br />

In some cases, a server node triggers a watchdog timer and this causes a process<br />

stack to be dumped to the console along with a <strong>Lustre</strong> kernel debug log being<br />

dumped into /tmp (by default). The presence of a watchdog timer does NOT mean<br />

that the thread OOPSed, but rather that it is taking longer time than expected to<br />

complete a given operation. In some cases, this situation is expected.<br />

For example, if a RAID rebuild is really slowing down I/O on an OST, it might<br />

trigger watchdog timers to trip. But another message follows shortly thereafter,<br />

indicating that the thread in question has completed processing (after some number<br />

of seconds). Generally, this indicates a transient problem. In other cases, it may<br />

legitimately signal that a thread is stuck because of a software error (lock inversion,<br />

for example).<br />

<strong>Lustre</strong>: 0:0:(watchdog.c:122:lcw_cb())<br />

The above message indicates that the watchdog is active for pid 933:<br />

It was inactive for 100000ms:<br />

<strong>Lustre</strong>: 0:0:(linux-debug.c:132:portals_debug_dumpstack())<br />

Showing stack for process:<br />

933 ll_ost_25 D F896071A 0 933 1 934 932 (L-TLB)<br />

f6d87c60 00000046 00000000 f896071a f8def7cc 00002710 00001822 2da48cae<br />

0008cf1a f6d7c220 f6d7c3d0 f6d86000 f3529648 f6d87cc4 f3529640 f8961d3d<br />

00000010 f6d87c9c ca65a13c 00001fff 00000001 00000001 00000000 00000001<br />

22-14 <strong>Lustre</strong> <strong>1.6</strong> <strong>Operations</strong> <strong>Manual</strong> • September 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!