One - The Linux Kernel Archives

More documents

Recommendations

Info

92 • Linux Symposium 2004 • Volume One pressure). In the interest of reducing cross-node traffic, and reducing memory access latency for frequently accessed data and text, it is desirable to replicate any such memory that is read-only to each node, and use the local copy on any accesses, rather than a remote copy. The obvious candidates for such replication are the kernel text itself, and the text of shared libraries such as libc. Of course, this faster access comes at the price of increased memory usage, but this is rarely a problem on large NUMA machines. Whilst it might be technically possible to replicate read/write mappings, this is complex, of dubious utility, and is unlikely to be implemented. Kernel text is assumed by the kernel itself to appear at a fixed virtual address, and to change this would be problematic. Hence the easiest way to replicate it is to change the virtual to physical mappings for each node to point at a different address. On IA-64, this is easy, since the CPU provides hardware assistance in the form of a pinned TLB entry. On other architectures this proves more difficult, and would depend on the structure of the pagetables. On IA-32 with PAE enabled, as long as the user-kernel split is aligned on a PMD boundary, we can have a separate kernel PMD for each node, and point the vmalloc area (which uses small page mappings) back to a globally shared set of PTE pages. The PMD entries for the ZONE_NORMAL areas normally never change, so this is not an issue, though there is an issue with ioremap_nocache that can change them (GART trips over this) and speculative execution means that we will have to deal with that (this can be a slow-path that updates all copies of the PMDs though). Dave Hansen has created a patch to replicate read only pagecache data, by adding a per-node data structure to each node of the pagecache radix tree. As soon as any mapping is opened for write, the replication is collapsed, making it safe. The patch gives a 5%–40% increase in performance, depending on the workload. In the 2.6 Linux kernel, we have a per-node LRU for page management and a per-node LRU lock, in place of the global structures and locks of 2.4. Not only does this reduce contention through finer grained locking, it also means we do not have to search other nodes’ page lists to free up pages on one node which is under memory pressure. Moreover, we get much better locality, as only the local kswapd process is accessing that node’s pages. Before splitting the LRU into per-node lists, we were spending 50% of the system time during a kernel compile just spinning waiting for pagemap_lru_lock (which was the biggest global VM lock at the time). Contention for the pagemap_lru_lock is now so small it is not measurable. 4 Sched Domains—a Topologyaware Scheduler The previous Linux scheduler, the O(1) scheduler, provided some needed improvements to the 2.4 scheduler, but shows its age as more complex system topologies become more and more common. With technologies such as NUMA, Symmetric Multi-Threading (SMT), and variations and combinations of these, the need for a more flexible mechanism to model system topology is evident. 4.1 Overview In answer to this concern, the mainline 2.6 tree (linux-2.6.7-rc1 at the time of this writing) contains an updated scheduler with support for generic CPU topologies with a data structure, struct sched_domain, that models the architecture and defines scheduling policies.
Linux Symposium 2004 • Volume One • 93 Simply speaking, sched domains group CPUs together in a hierarchy that mimics that of the physical hardware. Since CPUs at the bottom of the hierarchy are most closely related (in terms of memory access), the new scheduler performs load balancing most often at the lower domains, with decreasing frequency at each higher level. Consider the case of a machine with two SMT CPUs. Each CPU contains a pair of virtual CPU siblings which share a cache and the core processor. The machine itself has two physical CPUs which share main memory. In such a situation, treating each of the four effective CPUs the same would not result in the best possible performance. With only two tasks, for example, the scheduler should place one on CPU0 and one on CPU2, and not on the two virtual CPUs of the same physical CPU. When running several tasks it seems natural to try to place newly ready tasks on the CPU they last ran on (hoping to take advantage of cache warmth). However, virtual CPU siblings share a cache; a task that was running on CPU0, then blocked, and became ready when CPU0 was running another task and CPU1 was idle, would ideally be placed on CPU1. Sched domains provide the structures needed to realize these sorts of policies. With sched domains, each physical CPU represents a domain containing the pair of virtual siblings, each represented in a sched_group structure. These two domains both point to a parent domain which contains all four effective processors in two sched_group structures, each containing a pair of virtual siblings. Figure 1 illustrates this hierarchy. Next consider a two-node NUMA machine with two processors per node. In this example there are no virtual sibling CPUs, and therefore no shared caches. When a task becomes ready and the processor it last ran on is busy, the scheduler needs to consider waiting un- Figure 1: SMT Domains til that CPU is available to take advantage of cache warmth. If the only available CPU is on another node, the scheduler must carefully weigh the costs of migrating that task to another node, where access to its memory will be slower. The lowest level sched domains in a machine like this will contain the two processors of each node. These two CPU level domains each point to a parent domain which contains the two nodes. Figure 2 illustrates this hierarchy. Figure 2: NUMA Domains The next logical step is to consider an SMT NUMA machine. By combining the previous two examples, the resulting sched domain hierarchy has three levels, sibling domains, physical CPU domains, and the node domain. Figure 3 illustrates this hierarchy. The unique AMD Opteron architecture warrants mentioning here as it creates a NUMA system on a single physical board. In this case, however, each NUMA node contains only one
Page 1:
Proceedings of the Linux Symposium
Page 4 and 5:
The State of ACPI in the Linux Kern
Page 7 and 8:
Conference Organizers Andrew J. Hut
Page 9 and 10:
TCP Connection Passing Werner Almes
Page 11 and 12:
Linux Symposium 2004 • Volume One
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Cooperative Linux Dan Aloni da-x@co
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Build your own Wireless Access Poin
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42: Run-time testing of LSB Application
Page 43 and 44: Linux Symposium 2004 • Volume One
Page 51 and 52: Linux Block IO—present and future
Page 63 and 64: Linux AIO Performance and Robustnes
Page 79 and 80: Methods to Improve Bootup Time in L
Page 89 and 90: Linux on NUMA Systems Martin J. Bli
Page 91: Linux Symposium 2004 • Volume One
Page 103 and 104: Improving Kernel Performance by Unm
Page 113 and 114: Linux Virtualization on IBM POWER5
Page 121 and 122: The State of ACPI in the Linux Kern
Page 133 and 134: Scaling Linux® to the Extreme From
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Get More Device Drivers out of the
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Big Servers—2.6 compared to 2.4 W
Page 165 and 166:
Page 167 and 168:
Multi-processor and Frequency Scali
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Dynamic Kernel Module Support: From
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
e100 Weight Reduction Program Writi
Page 205 and 206:
Page 207 and 208:
NFSv4 and rpcsec_gss for linux J. B
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Comparing and Evaluating epoll, sel
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
The (Re)Architecture of the X Windo
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
IA64-Linux perf tools for IO dorks
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Carrier Grade Server Features in th
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Demands, Solutions, and Improvement
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
Page 287 and 288:
Hotplug Memory and the Linux VM Dav
Page 289 and 290:
Page 291 and 292:
Page 293 and 294:
show all

One - The Linux Kernel Archives

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?