One - The Linux Kernel Archives

More documents

Recommendations

Info

56 • Linux Symposium 2004 • Volume One 2.4.2 deadline specific hard target that must be met. In kernel 2.5.39, elevator_linus was finally replaced by something more appropriate, the deadline IO scheduler. The principles behind it are pretty straight forward — new requests are assigned an expiry time in milliseconds, based on data direction. Internally, requests are managed on two different data structures. The sort list, used for inserts and front merge lookups, is based on a red-black tree. This provides O(log n) runtime for both insertion and lookups, clearly superior to the doubly linked list. Two FIFO lists exist for tracking request expiry times, using a double linked list. Since strict FIFO behavior is maintained on these two lists, they run in O(1) time. For back merges it is important to maintain good performance as well, as they dominate the total merge count due to the layout of files on disk. So deadline added a merge hash for back merges, ideally providing O(1) runtime for merges. Additionally, deadline adds a onehit merge cache that is checked even before going to the hash. This gets surprisingly good hit rates, serving as much as 90% of the merges even for heavily threaded io. Implementation details aside, deadline continues to build on the fact that the fastest way to access a single drive, is by scanning in the direction of ascending sector. With its superior runtime performance, deadline is able to support very large queue depths without suffering a performance loss or spending large amounts of time in the kernel. It also doesn’t suffer from latency problems due to increased queue sizes. When a request expires in the FIFO, deadline jumps to that disk location and starts serving IO from there. To prevent accidental seek storms (which would further cause us to miss deadlines), deadline attempts to serve a number of requests from that location before jumping to the next expired request. This means that the assigned request deadlines are soft, not a 2.4.3 Anticipatory IO scheduler While deadline works very well for most workloads, it fails to observe the natural dependencies that often exist between synchronous reads. Say you want to list the contents of a directory—that operation isn’t merely a single sync read, it consists of a number of reads where only the completion of the final request will give you the directory listing. With deadline, you could get decent performance from such a workload in presence of other IO activities by assigning very tight read deadlines. But that isn’t very optimal, since the disk will be serving other requests in between the dependent reads causing a potentially disk wide seek every time. On top of that, the tight deadlines will decrease performance on other io streams in the system. Nick Piggin implemented an anticipatory IO scheduler [Iyer] during 2.5 to explore some interesting research in this area. The main idea behind the anticipatory IO scheduler is a concept called deceptive idleness. When a process issues a request and it completes, it might be ready to issue a new request (possibly close by) immediately. Take the directory listing example from above—it might require 3–4 IO operations to complete. When each of them completes, the process 4 is ready to issue the next one almost instantly. But the traditional io scheduler doesn’t pay any attention to this fact, the new request must go through the IO scheduler and wait its turn. With deadline, you would have to typically wait 500 milliseconds for each read, if the queue is held busy by other processes. The result is poor interactive performance for each process, even though overall throughput might be acceptable or even good. 4 Or the kernel, on behalf of the process.
Linux Symposium 2004 • Volume One • 57 Instead of moving on to the next request from an unrelated process immediately, the anticipatory IO scheduler (hence forth known as AS) opens a small window of opportunity for that process to submit a new IO request. If that happens, AS gives it a new chance and so on. Internally it keeps a decaying histogram of IO think times to help the anticipation be as accurate as possible. Internally, AS is quite like deadline. It uses the same data structures and algorithms for sorting, lookups, and FIFO. If the think time is set to 0, it is very close to deadline in behavior. The only differences are various optimizations that have been applied to either scheduler allowing them to diverge a little. If AS is able to reliably predict when waiting for a new request is worthwhile, it gets phenomenal performance with excellent interactiveness. Often the system throughput is sacrificed a little bit, so depending on the workload AS might not be the best choice always. The IO storage hardware used, also plays a role in this—a non-queuing ATA hard drive is a much better fit than a SCSI drive with a large queuing depth. The SCSI firmware reorders requests internally, thus often destroying any accounting that AS is trying to do. processes can be placed in. And using regular hashing technique to find the appropriate bucket in case of collisions, fatal collisions are avoided. CFQ deviates radically from the concepts that deadline and AS is based on. It doesn’t assign deadlines to incoming requests to maintain fairness, instead it attempts to divide bandwidth equally among classes of processes based on some correlation between them. The default is to hash on thread group id, tgid. This means that bandwidth is attempted distributed equally among the processes in the system. Each class has its own request sort and hash list, using red-black trees again for sorting and regular hashing for back merges. When dealing with writes, there is a little catch. A process will almost never be performing its own writes—data is marked dirty in context of the process, but write back usually takes place from the pdflush kernel threads. So CFQ is actually dividing read bandwidth among processes, while treating each pdflush thread as a separate process. Usually this has very minor impact on write back performance. Latency is much less of an issue with writes, and good throughput is very easy to achieve due to their inherent asynchronous nature. 2.4.4 CFQ The third new IO scheduler in 2.6 is called CFQ. It’s loosely based on the ideas on stochastic fair queuing (SFQ [McKenney]). SFQ is fair as long as its hashing doesn’t collide, and to avoid that, it uses a continually changing hashing function. Collisions can’t be completely avoided though, frequency will depend entirely on workload and timing. CFQ is an acronym for completely fair queuing, attempting to get around the collision problem that SFQ suffers from. To do so, CFQ does away with the fixed number of buckets that 2.5 Request allocation Each block driver in the system has at least one request_queue_t request queue structure associated with it. The recommended setup is to assign a queue to each logical spindle. In turn, each request queue has a struct request_list embedded which holds free struct request structures used for queuing io. 2.4 improved on this situation from 2.2, where a single global free list was available to add one per queue instead. This free list was split into two sections of equal size, for reads and writes, to prevent either
Page 1:
Proceedings of the Linux Symposium
Page 4 and 5:
The State of ACPI in the Linux Kern
Page 7 and 8: Conference Organizers Andrew J. Hut
Page 9 and 10: TCP Connection Passing Werner Almes
Page 11 and 12: Linux Symposium 2004 • Volume One
Page 23 and 24: Cooperative Linux Dan Aloni da-x@co
Page 33 and 34: Build your own Wireless Access Poin
Page 41 and 42: Run-time testing of LSB Application
Page 51 and 52: Linux Block IO—present and future
Page 55: Linux Symposium 2004 • Volume One
Page 63 and 64: Linux AIO Performance and Robustnes
Page 79 and 80: Methods to Improve Bootup Time in L
Page 89 and 90: Linux on NUMA Systems Martin J. Bli
Page 103 and 104: Improving Kernel Performance by Unm
Page 107 and 108:
Linux Symposium 2004 • Volume One
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Linux Virtualization on IBM POWER5
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
The State of ACPI in the Linux Kern
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Scaling Linux® to the Extreme From
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Get More Device Drivers out of the
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Big Servers—2.6 compared to 2.4 W
Page 165 and 166:
Page 167 and 168:
Multi-processor and Frequency Scali
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Dynamic Kernel Module Support: From
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
e100 Weight Reduction Program Writi
Page 205 and 206:
Page 207 and 208:
NFSv4 and rpcsec_gss for linux J. B
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Comparing and Evaluating epoll, sel
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
The (Re)Architecture of the X Windo
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
IA64-Linux perf tools for IO dorks
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Carrier Grade Server Features in th
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Demands, Solutions, and Improvement
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
Page 287 and 288:
Hotplug Memory and the Linux VM Dav
Page 289 and 290:
Page 291 and 292:
Page 293 and 294:
show all

One - The Linux Kernel Archives

Create successful ePaper yourself

Delete template?

Save as template?