You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
144 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />
OOM.” On a large-memory Altix system, it can<br />
easily take much longer than that to complete<br />
the necessary calls to shrink_caches().<br />
<strong>The</strong> result is that an Altix system never goes<br />
OOM in spite of the fact that swap space is full<br />
and there is no memory to be allocated.<br />
It seemed to us that part of the problem here<br />
is the amount of time it can take for a swap<br />
full condition (readily detectable in try_<br />
to_swap_out() to bubble all the way up<br />
to the top level in try_to_free_pages_<br />
zone(), especially on a large memory machine.<br />
To solve this problem on Altix, we<br />
decided to drive the OOM killer directly off<br />
of detection of swap-space-full condition provided<br />
that the system also continues to try to<br />
swap out additional pages. A count of the<br />
number of successful swaps and unsuccessful<br />
swap attempts is maintained in try_to_<br />
swap_out(). If, in a 10 second interval, the<br />
number of successful swap outs is less than<br />
one percent of the number of attempted swap<br />
outs, and the total number of swap out attempts<br />
exceeds a specified threshold, then try_to_<br />
swap_out()) will directly wake the OOM<br />
killer thread (also new in our implementation).<br />
This thread will wait another 10 seconds, and<br />
if the out-of-swap condition persists, it will invoke<br />
oom_kill() to select a victim and kill<br />
it. <strong>The</strong> OOM killer thread will repeat this sleep<br />
and kill cycle until it appears that swap space<br />
is no longer full or the number of attempts to<br />
swap out new pages (since the thread went to<br />
sleep) falls below the threshold.<br />
In our experience, this has made invocation of<br />
the OOM killer much more reliable than it was<br />
before, at least on Altix. Once again, this implementation<br />
was for <strong>Linux</strong> 2.4.21; we are in<br />
the process of evaluating this problem and the<br />
associated fix on <strong>Linux</strong> 2.6 at the present time.<br />
Another fix we have made to the VM system<br />
in <strong>Linux</strong> 2.4.21 for Altix is in handling<br />
of HUGETLB pages. <strong>The</strong> existing implementation<br />
in <strong>Linux</strong> 2.4.21 allocates HUGETLB<br />
pages to an address space at mmap() time (see<br />
hugetlb_prefault()); it also zeroes the<br />
pages at this time. This processing is done by<br />
the thread that makes the mmap() call. In<br />
particular, this means that zeroing of the allocated<br />
HUGETLB pages is done by a single<br />
processor. On a machine with 4 TB of<br />
memory and with as much memory allocated<br />
to HUGETLB pages as possible, our measurements<br />
have shown that it can take as long as<br />
5,000 seconds to allocate and zero all available<br />
HUGETLB pages. Worse yet, the thread that<br />
does this operation holds the address space’s<br />
mmap_sem and the page_table_lock for<br />
the entire 5,000 seconds. Unfortunately, many<br />
commands that query system state (such as ps<br />
and w) also wish to acquire one of these locks.<br />
<strong>The</strong> result is that the system appears to be hung<br />
for the entire 5,000 seconds.<br />
We solved this problem on Altix by changing<br />
the implementation of HUGETLB page allocation<br />
from prefault to allocate on fault. Many<br />
others have created similar patches; our patch<br />
was unique in that it also allowed zeroing of<br />
pages to occur in parallel if the HUGETLB<br />
page faults occurred on different processors.<br />
This was crucial to allow a large HUGETLB<br />
page region to be faulted into an address space<br />
in parallel, using as many processors as possible.<br />
For example, we have observed speedups<br />
of 25x using 16 processors to touch O(100 GB)<br />
of HUGETLB pages. (<strong>The</strong> speedup is super<br />
linear because if you use just one processor<br />
it has to zero many remote pages, whereas if<br />
you use more processors, at least some of the<br />
pages you are zeroing are local or on nearby<br />
nodes.) Assuming we can achieve the same<br />
kind of speedup on a 4 TB system, we would<br />
reduce the 5,000 second time stated above to<br />
200 seconds.<br />
Recently, we have worked with Kenneth Chen