18.11.2014 Views

One - The Linux Kernel Archives

One - The Linux Kernel Archives

One - The Linux Kernel Archives

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

144 • <strong>Linux</strong> Symposium 2004 • Volume <strong>One</strong><br />

OOM.” On a large-memory Altix system, it can<br />

easily take much longer than that to complete<br />

the necessary calls to shrink_caches().<br />

<strong>The</strong> result is that an Altix system never goes<br />

OOM in spite of the fact that swap space is full<br />

and there is no memory to be allocated.<br />

It seemed to us that part of the problem here<br />

is the amount of time it can take for a swap<br />

full condition (readily detectable in try_<br />

to_swap_out() to bubble all the way up<br />

to the top level in try_to_free_pages_<br />

zone(), especially on a large memory machine.<br />

To solve this problem on Altix, we<br />

decided to drive the OOM killer directly off<br />

of detection of swap-space-full condition provided<br />

that the system also continues to try to<br />

swap out additional pages. A count of the<br />

number of successful swaps and unsuccessful<br />

swap attempts is maintained in try_to_<br />

swap_out(). If, in a 10 second interval, the<br />

number of successful swap outs is less than<br />

one percent of the number of attempted swap<br />

outs, and the total number of swap out attempts<br />

exceeds a specified threshold, then try_to_<br />

swap_out()) will directly wake the OOM<br />

killer thread (also new in our implementation).<br />

This thread will wait another 10 seconds, and<br />

if the out-of-swap condition persists, it will invoke<br />

oom_kill() to select a victim and kill<br />

it. <strong>The</strong> OOM killer thread will repeat this sleep<br />

and kill cycle until it appears that swap space<br />

is no longer full or the number of attempts to<br />

swap out new pages (since the thread went to<br />

sleep) falls below the threshold.<br />

In our experience, this has made invocation of<br />

the OOM killer much more reliable than it was<br />

before, at least on Altix. Once again, this implementation<br />

was for <strong>Linux</strong> 2.4.21; we are in<br />

the process of evaluating this problem and the<br />

associated fix on <strong>Linux</strong> 2.6 at the present time.<br />

Another fix we have made to the VM system<br />

in <strong>Linux</strong> 2.4.21 for Altix is in handling<br />

of HUGETLB pages. <strong>The</strong> existing implementation<br />

in <strong>Linux</strong> 2.4.21 allocates HUGETLB<br />

pages to an address space at mmap() time (see<br />

hugetlb_prefault()); it also zeroes the<br />

pages at this time. This processing is done by<br />

the thread that makes the mmap() call. In<br />

particular, this means that zeroing of the allocated<br />

HUGETLB pages is done by a single<br />

processor. On a machine with 4 TB of<br />

memory and with as much memory allocated<br />

to HUGETLB pages as possible, our measurements<br />

have shown that it can take as long as<br />

5,000 seconds to allocate and zero all available<br />

HUGETLB pages. Worse yet, the thread that<br />

does this operation holds the address space’s<br />

mmap_sem and the page_table_lock for<br />

the entire 5,000 seconds. Unfortunately, many<br />

commands that query system state (such as ps<br />

and w) also wish to acquire one of these locks.<br />

<strong>The</strong> result is that the system appears to be hung<br />

for the entire 5,000 seconds.<br />

We solved this problem on Altix by changing<br />

the implementation of HUGETLB page allocation<br />

from prefault to allocate on fault. Many<br />

others have created similar patches; our patch<br />

was unique in that it also allowed zeroing of<br />

pages to occur in parallel if the HUGETLB<br />

page faults occurred on different processors.<br />

This was crucial to allow a large HUGETLB<br />

page region to be faulted into an address space<br />

in parallel, using as many processors as possible.<br />

For example, we have observed speedups<br />

of 25x using 16 processors to touch O(100 GB)<br />

of HUGETLB pages. (<strong>The</strong> speedup is super<br />

linear because if you use just one processor<br />

it has to zero many remote pages, whereas if<br />

you use more processors, at least some of the<br />

pages you are zeroing are local or on nearby<br />

nodes.) Assuming we can achieve the same<br />

kind of speedup on a 4 TB system, we would<br />

reduce the 5,000 second time stated above to<br />

200 seconds.<br />

Recently, we have worked with Kenneth Chen

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!