One - The Linux Kernel Archives

More documents

Recommendations

Info

52 • Linux Symposium 2004 • Volume One the IO scheduler. The latter is not an error condition, it will happen all the time for even moderate amounts of write back against a queue. 1.2 struct buffer_head The main IO unit in the 2.4 kernel is the struct buffer_head. It’s a fairly unwieldy structure, used at various kernel layers for different things: caching entity, file system block, and IO unit. As a result, it’s suboptimal for either of them. From the block layer point of view, the two biggest problems is the size of the structure and the limitation in how big a data region it can describe. Being limited by the file system one block semantics, it can at most describe a PAGE_CACHE_SIZE amount of data. In Linux on x86 hardware that means 4KiB of data. Often it can be even worse: raw io typically uses the soft sector size of a queue (default 1KiB) for submitting io, which means that queuing eg 32KiB of IO will enter the io scheduler 32 times. To work around this limitation and get at least to a page at the time, a 2.4 hack was introduced. This is called vary_io. A driver advertising this capability acknowledges that it can manage buffer_head’s of varying sizes at the same time. File system read-ahead, another frequent user of submitting larger sized io, has no option but to submit the read-ahead window in units of the page size. 1.3 Scalability With the limit on buffer_head IO size and elevator_linus runtime, it doesn’t take a lot of thinking to discover obvious scalability problems in the Linux 2.4 IO path. To add insult to injury, the entire IO path is guarded by a single, global lock: io_request_lock. This lock is held during the entire IO queuing operation, and typically also from the other end when a driver subtracts requests for IO submission. A single global lock is a big enough problem on its own (bigger SMP systems will suffer immensely because of cache line bouncing), but add to that long runtimes and you have a really huge IO scalability problem. Linux vendors have long shipped lock scalability patches for quite some time to get around this problem. The adopted solution is typically to make the queue lock a pointer to a driver local lock, so the driver has full control of the granularity and scope of the lock. This solution was adopted from the 2.5 kernel, as we’ll see later. But this is another case where driver writers often need to differentiate between vendor and vanilla kernels. 1.4 API problems Looking at the block layer as a whole (including both ends of the spectrum, the producers and consumers of the IO units going through the block layer), it is a typical example of code that has been hacked into existence without much thought to design. When things broke or new features were needed, they had been grafted into the existing mess. No well defined interface exists between file system and block layer, except a few scattered functions. Controlling IO unit flow from IO scheduler to driver was impossible: 2.4 exposes the IO scheduler data structures (the ->queue_head linked list used for queuing) directly to the driver. This fact alone makes it virtually impossible to implement more clever IO scheduling in 2.4. Even the recently (in the 2.4.20’s) added lower latency work was horrible to work with because of this lack of boundaries. Verifying correctness of the code is extremely difficult; peer review of the code likewise, since a reviewer must be intimate with the block layer structures to follow the code. Another example on lack of clear direction is
Linux Symposium 2004 • Volume One • 53 the partition remapping. In 2.4, it’s the driver’s responsibility to resolve partition mappings. A given request contains a device and sector offset (i.e. /dev/hda4, sector 128) and the driver must map this to an absolute device offset before sending it to the hardware. Not only does this cause duplicate code in the drivers, it also means the IO scheduler has no knowledge of the real device mapping of a particular request. This adversely impacts IO scheduling whenever partitions aren’t laid out in strict ascending disk order, since it causes the io scheduler to make the wrong decisions when ordering io. 2 2.6 Block layer The above observations were the initial kick off for the 2.5 block layer patches. To solve some of these issues the block layer needed to be turned inside out, breaking basically anythingio along the way. 2.1 bio Given that struct buffer_head was one of the problems, it made sense to start from scratch with an IO unit that would be agreeable to the upper layers as well as the drivers. The main criteria for such an IO unit would be something along the lines of: 1. Must be able to contain an arbitrary amount of data, as much as the hardware allows. Or as much that makes sense at least, with the option of easily pushing this boundary later. 2. Must work equally well for pages that have a virtual mapping as well as ones that do not. 3. When entering the IO scheduler and driver, IO unit must point to an absolute location on disk. 4. Must be able to stack easily for IO stacks such as raid and device mappers. This includes full redirect stacking like in 2.4, as well as partial redirections. Once the primary goals for the IO structure were laid out, the struct bio was born. It was decided to base the layout on a scatter-gather type setup, with the bio containing a map of pages. If the map count was made flexible, items 1 and 2 on the above list were already solved. The actual implementation involved splitting the data container from the bio itself into a struct bio_vec structure. This was mainly done to ease allocation of the structures so that sizeof(struct bio) was always constant. The bio_vec structure is simply a tuple of {page, length, offset}, and the bio can be allocated with room for anything from 1 to BIO_MAX_PAGES. Currently Linux defines that as 256 pages, meaning we can support up to 1MiB of data in a single bio for a system with 4KiB page size. At the time of implementation, 1MiB was a good deal beyond the point where increasing the IO size further didn’t yield better performance or lower CPU usage. It also has the added bonus of making the bio_vec fit inside a single page, so we avoid higher order memory allocations (sizeof(struct bio_vec) == 12 on 32- bit, 16 on 64-bit) in the IO path. This is an important point, as it eases the pressure on the memory allocator. For swapping or other low memory situations, we ideally want to stress the allocator as little as possible. Different hardware can support different sizes of io. Traditional parallel ATA can do a maximum of 128KiB per request, qlogicfc SCSI doesn’t like more than 32KiB, and lots of high end controllers don’t impose a significant limit on max IO size but may restrict the maximum number of segments that one IO may be composed of. Additionally, software raid or de-
Page 1: Proceedings of the Linux Symposium
Page 4 and 5: The State of ACPI in the Linux Kern
Page 7 and 8: Conference Organizers Andrew J. Hut
Page 9 and 10: TCP Connection Passing Werner Almes
Page 11 and 12: Linux Symposium 2004 • Volume One
Page 23 and 24: Cooperative Linux Dan Aloni da-x@co
Page 33 and 34: Build your own Wireless Access Poin
Page 41 and 42: Run-time testing of LSB Application
Page 51: Linux Block IO—present and future
Page 63 and 64: Linux AIO Performance and Robustnes
Page 79 and 80: Methods to Improve Bootup Time in L
Page 89 and 90: Linux on NUMA Systems Martin J. Bli
Page 103 and 104:
Improving Kernel Performance by Unm
Page 105 and 106:
Linux Symposium 2004 • Volume One
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Linux Virtualization on IBM POWER5
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
The State of ACPI in the Linux Kern
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Scaling Linux® to the Extreme From
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Get More Device Drivers out of the
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Big Servers—2.6 compared to 2.4 W
Page 165 and 166:
Page 167 and 168:
Multi-processor and Frequency Scali
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Dynamic Kernel Module Support: From
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
e100 Weight Reduction Program Writi
Page 205 and 206:
Page 207 and 208:
NFSv4 and rpcsec_gss for linux J. B
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Comparing and Evaluating epoll, sel
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
The (Re)Architecture of the X Windo
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
IA64-Linux perf tools for IO dorks
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Carrier Grade Server Features in th
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Demands, Solutions, and Improvement
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
Page 287 and 288:
Hotplug Memory and the Linux VM Dav
Page 289 and 290:
Page 291 and 292:
Page 293 and 294:
show all

One - The Linux Kernel Archives

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?