One - The Linux Kernel Archives

More documents

Recommendations

Info

66 • Linux Symposium 2004 • Volume One fop->aio_read(fd, o3, 16384) = -EIOCBRETRY, after issuing readahead for 20KB, as the readahead logic sees all 4 pages of the read (the readahead window is maintained at 4+1=5 pages) . . and when read completes for o1 fop->aio_read(fd, o1, 16384) = 16384 and when read completes for o2 fop->aio_read(fd, o2, 16384) = 16384 and when read completes for o3 fop->aio_read(fd, o3, 16384) = 16384 . . 3.5 Upfront readahead and sendfile regressions At first sight it appears that upfront readahead is a reasonable change for all situations, since it immediately passes to the readahead logic the entire size of the request. However, it has the unintended, potential side-effect of losing pipelining benefits for really large reads, or operations like sendfile which involve post processing I/O on the contents just read. One way to address this is to clip the maximum size of upfront readahead to the maximum readahead setting for the device. To see why even that may not suffice for certain situations, let us take a look at the following sequence for a webserver that uses non-blocking sendfile to serve a large (2GB) file. sendfile(fd, 0, 2GB, fd2) = 8192, tells readahead about up to 128KB of the read sendfile(fd, 8192, 2GB - 8192, fd2) = 8192, tells readahead about 8KB - 132KB of the read sendfile(fd, 16384, 2GB - 16384, fd2) = 8192, tells readahead about 16KB-140KB of the read ... This confuses the readahead logic about the I/O pattern which appears to be 0–128K, 8K– 132K, 16K–140K instead of clear sequentiality from 0–2GB that is really appropriate. To avoid such unanticipated issues, upfront readahead required a special case for AIO alone, limited to the maximum readahead setting for the device. 3.6 Streaming AIO read microbenchmark comparisons We explored streaming AIO throughput improvements with the retry-based AIO implementation and optimizations discussed above, using a custom microbenchmark called aiostress [2]. aio-stress issues a stream of AIO requests to one or more files, where one can vary several parameters including I/O unit size, total I/O size, depth of iocbs submitted at a time, number of concurrent threads, and type and pattern of I/O operations, and reports the overall throughput attained. The hardware included a 4-way 700MHz Pentium ® III machine with 512MB of RAM and a 1MB L2 cache. The disk subsystem used for the I/O tests consisted of an Adaptec AIC7896/97 Ultra2 SCSI controller connected to a disk enclosure with six 9GB disks, one of which was configured as an ext3 filesystem with a block size of 4KB for testing. The runs compared aio-stress throughputs for streaming random buffered I/O reads (i.e., without O_DIRECT), with and without the previously described changes. All the runs were for the case where the file was not already cached in memory. The above graph summarizes how the results varied across individual request sizes of 4KB to 64KB, where I/O was targeted to a single file of size 1GB, the depth of iocbs outstanding at a time being 64KB. A third run was performed to find out how the results compared with equivalent runs using AIO-DIO. With the changes applied, the results showed an approximate 2x improvement across all block sizes, bringing throughputs to levels that match the corresponding results using AIO- DIO.
Linux Symposium 2004 • Volume One • 67 aio-stress throughput (MB/s) 25 20 15 10 5 Streaming AIO read results with aio-stress FSAIO (non-cached) 2.6.2 Vanilla FSAIO (non-cached) 2.6.2 Patched AIO-DIO 2.6.2 Vanilla 0 0 10 20 30 40 50 60 70 request size (KB) Figure 1: Comparisons of streaming random AIO read throughputs of blocks by a truncate while DIO was in progress. The semaphore was implemented held in shared mode by DIO and in exclusive mode by truncate. Note that handling the new locking rules (i.e., lock ordering of i_sem first and then i_ alloc_sem) while allowing for filesystemspecific implementations of the DIO and filewrite interfaces had to be handled with some care. 4.2 AIO-DIO specific races 4 AIO DIO vs cached I/O integrity issues 4.1 DIO vs buffered races Stephen Tweedie discovered several races between DIO and buffered I/O to the same file [3]. These races could lead to potential staledata exposures and even data-integrity issues. Most instances were related to situations when in-core meta-data updates were visible before actual instantiation or resetting of corresponding data blocks on disk. Problems could also arise when meta-data updates were not visible to other code paths that could simultaneously update meta-data as well. The races mainly affected sparse files due to the lack of atomicity between the file flush in the DIO paths and actual data block accesses. The solution that Stephen Tweedie came up with, and which Badari Pulavarty reported to Linux 2.6, involved protecting block lookups and meta-data updates with the inode semaphore (i_sem) in DIO paths for both read and write, atomically with the file flush. Overwriting of sparse blocks in the DIO write path was modified to fall back to buffered writes. Finally, an additional semaphore (i_alloc_ sem) was introduced to lock out deallocation The inclusion of AIO in Linux 2.6 added some tricky scenarios to the above-described problems because of the potential races inherent in returning without waiting for I/O completion. The interplay of AIO-DIO writes and truncate was a particular worry as it could lead to corruption of file data; for example, blocks could get deallocated and reallocated to a new file while an AIO-DIO write to the file was still in progress. To avoid this, AIO-DIO had to return with i_alloc_sem held, and only release it as part of I/O completion post-processing. Notice that this also had implications for AIO cancellation. File size updates for AIO-DIO file extends could expose unwritten blocks if they happened before I/O completed asynchronously. The case involving fallback to buffered I/O was particularly non-trivial if a single request spanned allocated and sparse regions of a file. Specifically, part of the I/O could have been initiated via DIO then continued asynchronously, while the fallback to buffered I/O occurred and signaled I/O completion to the application. The application may thus have reused its I/O buffer, overwriting it with other data and potentially causing file data corruption if writeout to disk had still been pending. It might appear that some of these problems
Page 1:
Proceedings of the Linux Symposium
Page 4 and 5:
The State of ACPI in the Linux Kern
Page 7 and 8:
Conference Organizers Andrew J. Hut
Page 9 and 10:
TCP Connection Passing Werner Almes
Page 11 and 12:
Linux Symposium 2004 • Volume One
Page 13 and 14:
Page 15 and 16: Linux Symposium 2004 • Volume One
Page 23 and 24: Cooperative Linux Dan Aloni da-x@co
Page 33 and 34: Build your own Wireless Access Poin
Page 41 and 42: Run-time testing of LSB Application
Page 51 and 52: Linux Block IO—present and future
Page 63 and 64: Linux AIO Performance and Robustnes
Page 65: Linux Symposium 2004 • Volume One
Page 79 and 80: Methods to Improve Bootup Time in L
Page 89 and 90: Linux on NUMA Systems Martin J. Bli
Page 103 and 104: Improving Kernel Performance by Unm
Page 113 and 114: Linux Virtualization on IBM POWER5
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
The State of ACPI in the Linux Kern
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Scaling Linux® to the Extreme From
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Get More Device Drivers out of the
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Big Servers—2.6 compared to 2.4 W
Page 165 and 166:
Page 167 and 168:
Multi-processor and Frequency Scali
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Dynamic Kernel Module Support: From
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
e100 Weight Reduction Program Writi
Page 205 and 206:
Page 207 and 208:
NFSv4 and rpcsec_gss for linux J. B
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Comparing and Evaluating epoll, sel
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
The (Re)Architecture of the X Windo
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
IA64-Linux perf tools for IO dorks
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Carrier Grade Server Features in th
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Demands, Solutions, and Improvement
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
Page 287 and 288:
Hotplug Memory and the Linux VM Dav
Page 289 and 290:
Page 291 and 292:
Page 293 and 294:
show all

One - The Linux Kernel Archives

Create successful ePaper yourself

Delete template?

Save as template?