One - The Linux Kernel Archives

More documents

Recommendations

Info

250 • Linux Symposium 2004 • Volume One gsyprf3:~# pfmon31 --us-c --cpu-list=0 --resolve-addr --smpl-module=dear-hist-itanium2 \ -e data_ear_cache_lat4 --long-smpl-periods=5000 --smpl-periods-random=0xfff:10 \ --system-wide -k -- /usr/src/pktgen-testing/pktgen-single-tg3 added event set 0 only kernel symbols are resolved in system-wide mode Adding devices to run. Configuring devices Running... ctrl^C to stop 57: 8484552 0 IO-SAPIC-level eth1 Result: OK: 5938001(c5866437+d71564) usec, 5000000 (64byte) 842034pps 411Mb/sec (431121408bps) errors: 0 57: 9093642 0 IO-SAPIC-level eth1 # total_samples 795 # instruction addr view # sorted by count # showing per per distinct value # %L2 : percentage of L1 misses that hit L2 # %L3 : percentage of L1 misses that hit L3 # %RAM : percentage of L1 misses that hit memory # L2 : 5 cycles load latency # L3 : 12 cycles load latency # sampling period: 5000 # #count %self %cum %L2 %L3 %RAM instruction addr 95 11.95% 11.95% 0.00% 98.95% 1.05% 0xa00000020003d150 tg3_tx[tg3]+0x290 83 10.44% 22.39% 93.98% 4.82% 1.20% 0xa00000020003d030 tg3_tx[tg3]+0x170 21 2.64% 25.03% 0.00% 95.24% 4.76% 0xa0000001000180f0 ia64_handle_irq+0x170 20 2.52% 27.55% 5.00% 80.00% 15.00% 0xa00000020003d040 tg3_tx[tg3]+0x180 18 2.26% 29.81% 50.00% 11.11% 38.89% 0xa00000020003cfa0 tg3_tx[tg3]+0xe0 17 2.14% 31.95% 0.00% 0.00% 100.00% 0xa00000020003e671 tg3_interrupt[tg3] +0x1d1 17 2.14% 34.09% 0.00% 100.00% 0.00% 0xa00000020003e700 tg3_interrupt[tg3] +0x260 16 2.01% 36.10% 56.25% 43.75% 0.00% 0xa000000100012160 ia64_leave_kernel +0x180 16 2.01% 38.11% 62.50% 0.00% 37.50% 0xa00000020003cf60 tg3_tx[tg3]+0xa0 15 1.89% 40.00% 86.67% 6.67% 6.67% 0xa00000020003cfd0 tg3_tx[tg3]+0x110 15 1.89% 41.89% 0.00% 0.00% 100.00% 0xa000000100016041 do_IRQ+0x1a1 15 1.89% 43.77% 0.00% 53.33% 46.67% 0xa00000020003e370 tg3_poll[tg3]+0x350 . . . # level 0 : counts=226 avg_cycles=0.0ms 28.43% # level 1 : counts=264 avg_cycles=0.0ms 33.21% # level 2 : counts=305 avg_cycles=0.0ms 38.36% approx cost: 0.0s Figure 9: tg3 v3.6 lat4 output
Linux Symposium 2004 • Volume One • 251 and in only two locations. Adding one prefetch to pull data from L3 into L2 would help for the top offender. One needs to figure out which bit of data each recorded access refers to and determine how early one can prefetch that data. We can also rule out MMIO accesses as the top culprit. tg3_interrupt+0x1d1 could be an MMIO read but it doesn’t show up in Figure 8 like tg3_write_indirect_reg32 does. Note smpl-periods is 10x higher in Figure 9 than in Figure 8. Collecting 10x more samples with lat4 definitely disturbs the workload. 5 q-tools q-syscollect and q-view are trivial to use. An example and brief explanation for kernel usage follow. Please remember most applications spend most of the time in user space and not in the kernel. q-tools is especially good in user space. 5.1 q-syscollect q-syscollect -c 5000 -C 5000 -t 20 -k This will collect system wide kernel data during the 20 second period. Twenty to thrity seconds is usually long enough to get sufficient accuracy 3 . However, if the workload generates a very wide call graph with even distribution, one will likely need to sample for longer periods to get accuracy in the ±1% range. When in doubt, try sampling for longer periods to see if the call-counts change significantly. 3 See Page 7 of the David Mosberger’s Gelato talk [4] for a nice graph on accuracy which only applies to his example. The -c and -C set the call sample rate and code sample rate respectively. The call sample rate is used to collect function call counts. This is one of the key differences compared to traditional profiling tools: q-syscollect obtains call-counts in a statistical fashion, just as has been done traditionally for the execution-time profile. The code sample rate is used to collect a flat profile (CPU_CYCLES by default). The -e option allows one to change the event used to sample for the flat profile. The default is to sample CPU_CYCLES event. This provides traditional execution time in the flat profile. The data is stored in the current directory under .q/ directory. The next section demonstrates how q-view displays the data. 5.2 q-view I was running the netperf [7] TCP_RR test in the background to another server when I collected the following data. As Figure 10 shows, this particular TCP_RR test isn’t costing many cycles in tg3 driver. Or, at least not ones I can measure. tg3_interrupt() shows up in the flat profile with 0.314 seconds time associated with it. The time measurement is only possible because handle_IRQ_event() re-enables interrupts if the IRQ handler is not registered with SA_INTERRUPT (to indicate latency sensitive IRQ handler). do_IRQ() and other functions in that same call graph do NOT have any time measurements because interrupts are disabled. As noted before, the callgraph is sampled using a different part of the PMU than the part which samples the flat profile. Lastly, I’ve omitted the trailing output of q-view which explains the fields and columns more completely. Read that first be-
Page 1:
Proceedings of the Linux Symposium
Page 4 and 5:
The State of ACPI in the Linux Kern
Page 7 and 8:
Conference Organizers Andrew J. Hut
Page 9 and 10:
TCP Connection Passing Werner Almes
Page 11 and 12:
Linux Symposium 2004 • Volume One
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Cooperative Linux Dan Aloni da-x@co
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Build your own Wireless Access Poin
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Run-time testing of LSB Application
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Linux Block IO—present and future
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Linux AIO Performance and Robustnes
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Methods to Improve Bootup Time in L
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Linux on NUMA Systems Martin J. Bli
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Improving Kernel Performance by Unm
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Linux Virtualization on IBM POWER5
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
The State of ACPI in the Linux Kern
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Scaling Linux® to the Extreme From
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Get More Device Drivers out of the
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Big Servers—2.6 compared to 2.4 W
Page 165 and 166:
Page 167 and 168:
Multi-processor and Frequency Scali
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Dynamic Kernel Module Support: From
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200: Linux Symposium 2004 • Volume One
Page 203 and 204: e100 Weight Reduction Program Writi
Page 207 and 208: NFSv4 and rpcsec_gss for linux J. B
Page 215 and 216: Comparing and Evaluating epoll, sel
Page 227 and 228: The (Re)Architecture of the X Windo
Page 239 and 240: IA64-Linux perf tools for IO dorks
Page 249: Linux Symposium 2004 • Volume One
Page 255 and 256: Carrier Grade Server Features in th
Page 269 and 270: Demands, Solutions, and Improvement
Page 287 and 288: Hotplug Memory and the Linux VM Dav
show all

One - The Linux Kernel Archives

Create successful ePaper yourself

Delete template?

Save as template?