One - The Linux Kernel Archives

More documents

Recommendations

Info

146 • Linux Symposium 2004 • Volume One a barrier attribute to certain DMA writes (to consistent PCI allocations on Altix) and to interrupts. This works well with controllers that use DMA writes to indicate command completions (for example a SCSI controller with a response queue, where the response queue is allocated using pci_alloc_consistent, so that writes to the response queue have the barrier attribute). When we ported Linux to Altix, this behavior became a problem, because many Linux PCI drivers use PIO read responses to imply a status of a DMA write. For example, on an IDE controller, a bit status register read is performed to find out if a command is complete (command complete status implies that DMA writes of that command’s data are completed). As a result, SGI had to implement a rather heavyweight mechanism to guarantee ordering of DMA writes and PIO reads. This mechanism involves doing an explicit flush of DMA write data after each PIO read. For the cases in which strong ordering of PIO read responses and DMA writes are not necessary, a new API was needed so that drivers could communicate that a given PIO read response could used relaxed ordering with respect to prior DMA writes. The read_ relaxed API [8] was added early in the 2.6 series for this purpose, and mirrors the normal read routines, which have variants for various sized reads. The results below show how expensive a normal PIO read transaction can be, especially on a system doing a lot of I/O (and thus DMA). Type of PIO Time (ns) normal PIO read 3875 relaxed PIO read 1299 Table 1: Normal vs. relaxed PIO reads on an idle system It remains to be seen whether this API will also apply to the newly added RO bit in the PCI- Type of PIO Time (ns) normal PIO read 4889 relaxed PIO read 1646 Table 2: Normal vs. relaxed PIO reads on a busy system X specification—the author is hopeful! Either way, it does give hardware vendors who want to support Linux some additional flexibility in their design. Ordering posted writes efficiently On many platforms, PIO writes from different CPUs will not necessarily arrive in order (i.e., they may be intermixed) even when locking is used. Since the platform has no way of knowing whether a given PIO read depends on preceding writes, it has to guarantee that all writes have completed before allowing a read transaction to complete. So performing a read prior to releasing a lock protecting a region doing writes is sufficient to guarantee that the writes arrive in the correct order. However, performing PIO reads can be an expensive operation, especially if the device is on a distant node. SGI chipset designers foresaw this problem, however, and provided a way to ensure ordering by simply reading a register from the chipset on the local node. When the register indicates that all PIO writes are complete, it means they have arrived at the chipset attached to the device, and so are guaranteed to arrive at the device in the intended order. The SGI sn2 specific portion of the Linux ia64 port (sn2 is the architecture name for Altix in the Linux kernel source tree) provides a small function, sn_mmiob() (for memory–mapped I/O barrier, analogous to the mb() macro), to do just that. It can be used in place of reads that are intended to deal with posted writes and provides some benefit:
Linux Symposium 2004 • Volume One • 147 Type of flush Time (ns) regular PIO read 5940 relaxed PIO read 2619 sn_mmiob() 1610 (local chipset read alone) 399 Table 3: Normal vs. fast flushing of 5 PIO writes Adding this API to Linux (i.e., making it nonsn2-specific) was discussed some time ago [9], and may need to be raised again, since it does appear to be useful on Altix, and is probably similarly useful on other platforms. Local allocation of consistent DMA mappings Consistent DMA mappings are used frequently by drivers to store command and status buffers. They are frequently read and written by the device that owns them, so making sure they can be accessed quickly is important. The table below shows the difference in the number of operations per second that can be achieved using local versus remote allocation of consistent DMA buffers. Local allocations were guaranteed by changing the pci_ alloc_consistent function so that it calls alloc_pages_node using the node closest to the PCI device in question. Type I/Os per second Local consistent buffer 46231 Remote consistent buffer 41295 Table 4: Local vs. remote DMA buffer allocation Although this change is platform specific, it can be made generic if a pci_to_node or pci_to_nodemask routine is added to the Linux topology API. Concluding Remarks Today, our Linux 2.4.21 kernel for Altix provides a productive platform for our highperformance-computing users who desire to exploit the features of the SGI Altix 3000 hardware. To achieve this goal, we have made a number of changes to our Linux for Altix kernel. We are now in the process of either moving those changes forward to Linux 2.6 for Altix, or of evaluating the Linux 2.6 kernel on Altix in order to determine if these changes are indeed needed at all. Our goal is to develop a version of the Linux 2.6 kernel for Altix that not only supports our HPC customers equally well as our existing Linux 2.4.21 kernel, but also consists as much as possible of community supported code. References [1] Ray Bryant and John Hawkes, Linux Scalability for Large NUMA Systems, Proceedings of the 2003 Ottawa Linux Symposium, Ottawa, Ontario, Canada, (July 2003). [2] Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennesy, The DASH prototype: Logic overhead and performance, IEEE Transactions on Parallel and Distributed Systems, 4(1):41-61, January 1993. [3] Kenneth Chen, “hugetlb demand paging patch part [0/3],” linux-kernel@vger.kernel.org, 2004-04-13 23:17:04, http://marc.theaimsgroup. com/?l=linux-kernel&m= 108189860419356&w=2 [4] Andi Kleen, “Patch: NUMA API for Linux,” linux-kernel@vger.kernel.org,
Page 1:
Proceedings of the Linux Symposium
Page 4 and 5:
The State of ACPI in the Linux Kern
Page 7 and 8:
Conference Organizers Andrew J. Hut
Page 9 and 10:
TCP Connection Passing Werner Almes
Page 11 and 12:
Linux Symposium 2004 • Volume One
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Cooperative Linux Dan Aloni da-x@co
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Build your own Wireless Access Poin
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Run-time testing of LSB Application
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Linux Block IO—present and future
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Linux AIO Performance and Robustnes
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Methods to Improve Bootup Time in L
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Linux on NUMA Systems Martin J. Bli
Page 91 and 92:
Page 93 and 94:
Page 95 and 96: Linux Symposium 2004 • Volume One
Page 103 and 104: Improving Kernel Performance by Unm
Page 113 and 114: Linux Virtualization on IBM POWER5
Page 121 and 122: The State of ACPI in the Linux Kern
Page 133 and 134: Scaling Linux® to the Extreme From
Page 145: Linux Symposium 2004 • Volume One
Page 149 and 150: Get More Device Drivers out of the
Page 163 and 164: Big Servers—2.6 compared to 2.4 W
Page 167 and 168: Multi-processor and Frequency Scali
Page 187 and 188: Dynamic Kernel Module Support: From
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
e100 Weight Reduction Program Writi
Page 205 and 206:
Page 207 and 208:
NFSv4 and rpcsec_gss for linux J. B
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Comparing and Evaluating epoll, sel
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
The (Re)Architecture of the X Windo
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
IA64-Linux perf tools for IO dorks
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Carrier Grade Server Features in th
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Demands, Solutions, and Improvement
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
Page 287 and 288:
Hotplug Memory and the Linux VM Dav
Page 289 and 290:
Page 291 and 292:
Page 293 and 294:
show all

One - The Linux Kernel Archives

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?