One - The Linux Kernel Archives

More documents

Recommendations

Info

140 • Linux Symposium 2004 • Volume One tion can succeed. In subsequent allocation passes, toss() is also called to free pagecache pages on nodes other than the current one. The result of this change is that clean page-cache pages are effectively treated as free memory by the page allocator. At the same time that the toss() code was added, we added a new user command bcfree that could be used to free all idle page-cache pages. (On the __alloc_ pages() path, toss() would only try to free 32 pages per node.) The bcfree command was intended to be used only for resetting the state of the page cache before running a benchmark, and in lieu of rebooting the system in order to get a clean system state. However, our customers found that this command could be used to reduce the size of the page cache and to avoid situations where large amounts of buffered-file I/O could force the system to begin swapping. Since bcfree kills the entire page-cache, however, this was regarded as a substandard solution that could also hurt read performance of cached data and we began looking for another way to solve this “BIGIO” problem. Just to be specific, the BIGIO problem we were trying to solve was based on the behavior of our Linux 2.4.21 kernel for Altix. A customer reported that on a 256 GB Altix system, if 200 GB were allocated and 50 GB free, that if the user program then tried to write 100 GB of data out to disk, the system would start to swap, and then in many cases fill up the swap space. At that point our Out-of-memory (OOM) killer would wake up and kill the user program! (See the next section for discussion of our OOM killer changes.) Initially we were able to work around this problem by increasing the amount of swap space on the system. Our experiments showed that with an amount of swap space equal to one-quarter the main memory size, the 256 GB example discussed above would continue to completion without the OOM killer being invoked. I/O performance during this phase was typically one-half of what the hardware could deliver, since two I/O operations often had to be completed: one to read the data in from the swap device, and one to write the data to the output file. Additionally, while the swap scan was active, the system was very sluggish. These problems led us to search for another solution. Eventually what we developed is an aggressive method of trimming the page cache when it started to grow too big. This solution involved several steps: (1) We first added a new page list, the reclaim_list. This increased the size of struct page by another 16 bytes. On our system, struct page is allocated on cachealigned boundaries anyway, so this really did not cause an increase in storage, since the current struct page size was less than 112 bytes. Pages were added to the reclaim list when they were inserted into the page cache. The reclaim list is per node, with per node locking. Pages were removed from the reclaim list when they were no longer reclaimable; that is, they were removed from the reclaim list when they were marked as dirty due to buffer file-I/O or when they were mapped into an address space. (2) We rewrote toss() to scan the reclaim list instead of the inactive and active lists. Herein we will refer to the new version of toss() as toss_fast(). (3) We introduced a variant of page_cache_ alloc() called page_cache_alloc_ limited(). Associated with this new routine were two control variables settable via sysctl(): page_cache_limit and page_cache_limit_threshold.
Linux Symposium 2004 • Volume One • 141 (4) We modified the generic_file_ write() path to call page_cache_ alloc_limited() instead of page_ cache_alloc(). page_cache_alloc_ limited() examines the size of the page cache. If the total amount of free memory in the system is less than page_cache_ limit_threshold and the size of the page cache is larger than page_cache_limit, then page_cache_alloc_limited() calls page_cache_reduce() to free enough page-cache pages on the system to bring the page cache size down below page_ cache_limit. If this succeeds, then page_ cache_alloc_limited() calls page_ cache_alloc to allocate the page. If not, then we wakeup bdflush and the current thread is put to sleep for 30ms (a tunable parameter) The rationale for the reclaim_list and toss_fast() was that when we needed to trim the page cache, practically all pages in the system would typically be on the inactive list. The existing toss() routine scanned the inactive list and thus was too slow to call from generic_file_write. Moreover, most of the pages on the inactive list were not reclaimable anyway. Most of the pages on the reclaim_list are reclaimable. As a result toss_fast() runs much faster and is more efficient at releasing idle page-cache pages than the old routine. The rationale for the page_cache_limit_ threshold in addition to the page_ cache_limit is that if there is lots of free memory then there is no reason to trim the page cache. One might think that because we only trim the page cache on the file write path that this approach would still let the page cache to grow arbitrarily due to file reads. Unfortunately, this is not the case, since the Linux kernel in normal multiuser operation is constantly writing something to the disk. So, a page cache limit enforced at file write time is also an effective limit on the size of the page cache due to file reads. Finally, the rationale for delaying the calling task when page_cache_reduce() fails is that we do not want the system to start swapping to make space for new buffered I/O pages, since that will reduce I/O bandwidth by as much as one-half anyway, as well as take a lot of CPU time to figure out which pages to swap out. So it is better to reduce the I/O bandwidth directly, by limiting the rate of requested I/O, instead of allowing that I/O to proceed at rate that causes the system to be overrun by pagecache pages. Thus far, we have had good experience with this algorithm. File I/O rates are not substantially reduced from what the hardware can provide, the system does not start swapping, and the system remains responsive and usable during the period of time when the BIGIO is running. Of course, this entire discussion is specific to Linux 2.4.21. For Linux 2.6, we have plans to evaluate whether this is a problem in the system at all. In particular, we want to see if an appropriate setting for vm_swappiness to zero can eliminate the “BIGIO causes swapping” problem. We also are interested in evaluating the recent set of VM patches that Nick Piggin [6] has assembled to see if they eliminate this problem for systems of the size of a large Altix. VM and Memory Allocation Fixes In addition to the page-cache changes described in the last section, we have made a number of smaller changes related to virtual memory and paging performance. One set of such changes increased the parallelism of page-fault handling for anonymous
Page 1:
Proceedings of the Linux Symposium
Page 4 and 5:
The State of ACPI in the Linux Kern
Page 7 and 8:
Conference Organizers Andrew J. Hut
Page 9 and 10:
TCP Connection Passing Werner Almes
Page 11 and 12:
Linux Symposium 2004 • Volume One
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Cooperative Linux Dan Aloni da-x@co
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Build your own Wireless Access Poin
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Run-time testing of LSB Application
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Linux Block IO—present and future
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Linux AIO Performance and Robustnes
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Methods to Improve Bootup Time in L
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90: Linux on NUMA Systems Martin J. Bli
Page 91 and 92: Linux Symposium 2004 • Volume One
Page 103 and 104: Improving Kernel Performance by Unm
Page 113 and 114: Linux Virtualization on IBM POWER5
Page 121 and 122: The State of ACPI in the Linux Kern
Page 133 and 134: Scaling Linux® to the Extreme From
Page 139: Linux Symposium 2004 • Volume One
Page 149 and 150: Get More Device Drivers out of the
Page 163 and 164: Big Servers—2.6 compared to 2.4 W
Page 167 and 168: Multi-processor and Frequency Scali
Page 187 and 188: Dynamic Kernel Module Support: From
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
e100 Weight Reduction Program Writi
Page 205 and 206:
Page 207 and 208:
NFSv4 and rpcsec_gss for linux J. B
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Comparing and Evaluating epoll, sel
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
The (Re)Architecture of the X Windo
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
IA64-Linux perf tools for IO dorks
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Carrier Grade Server Features in th
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Demands, Solutions, and Improvement
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
Page 287 and 288:
Hotplug Memory and the Linux VM Dav
Page 289 and 290:
Page 291 and 292:
Page 293 and 294:
show all

One - The Linux Kernel Archives

Create successful ePaper yourself

Delete template?

Save as template?