High Performance I/O with NUMA Systems in Linux - The Linux ...
High Performance I/O with NUMA Systems in Linux - The Linux ...
High Performance I/O with NUMA Systems in Linux - The Linux ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> <strong>Systems</strong> <strong>in</strong> L<strong>in</strong>ux<br />
Lance Shelton<br />
Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved.
FU<br />
IO N-<br />
IO<br />
How prevalent is <strong>NUMA</strong>?<br />
▸ All major vendors<br />
• HP, Dell, IBM, Cisco, SuperMicro, Hitachi, etc.<br />
▸ As small as 1U<br />
▸ 2, 4, and 8 socket systems<br />
▸ 2 to 10 cores per socket<br />
▸ Number of cores doubles <strong>with</strong> HyperThread<strong>in</strong>g<br />
8 x 10 x 2 = 160 CPU cores<br />
FUSION-IO<br />
<strong>NUMA</strong> is ma<strong>in</strong>stream.<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 2
FU<br />
IO N-<br />
IO<br />
Why use large <strong>NUMA</strong> servers?<br />
FUSION-IO<br />
▸ <strong>NUMA</strong> is key to the current evolution of performance<br />
• Storage rates are drastically <strong>in</strong>creas<strong>in</strong>g<br />
• Processor speeds are stagnant<br />
• Multi-core cannot scale <strong>in</strong>f<strong>in</strong>itely <strong>with</strong>out <strong>NUMA</strong><br />
▸ <strong>NUMA</strong> meets the needs of the application<br />
• No partition<strong>in</strong>g is required<br />
• Faster than cluster<strong>in</strong>g multiple servers<br />
• Works well <strong>in</strong> comb<strong>in</strong>ation <strong>with</strong> scale out<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 3
FU<br />
IO N-<br />
IO<br />
Nodes, Sockets, Cores, Threads<br />
FUSION-IO<br />
+ =<br />
Memory Socket <strong>NUMA</strong> Node<br />
Cores<br />
Threads<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 4
FU<br />
IO N-<br />
IO<br />
Non-Uniform I/O Access<br />
FUSION-IO<br />
I/O = PCIe<br />
• Internal flash<br />
• Host bus adapter<br />
to storage network<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 5
FU<br />
IO N-<br />
IO<br />
2, 4, and 8 Socket Servers<br />
FUSION-IO<br />
# numactl -‐-‐hardware <br />
node 0 1 2 3 4 5 6 7 <br />
0: 10 12 17 17 19 19 19 19 <br />
1: 12 10 17 17 19 19 19 19 <br />
2: 17 17 10 12 19 19 19 19 <br />
3: 17 17 12 10 19 19 19 19 <br />
4: 19 19 19 19 10 12 17 17 <br />
5: 19 19 19 19 12 10 17 17 <br />
6: 19 19 19 19 17 17 10 12 <br />
7: 19 19 19 19 17 17 12 10 <br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 6
FU<br />
IO N-<br />
IO<br />
Enterprise Storage<br />
FUSION-IO<br />
Storage<br />
Database Servers<br />
Clients<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 7
FU<br />
IO N-<br />
IO<br />
PCIe Local Nodes<br />
FUSION-IO<br />
Intel Nehalem<br />
Intel Sandy Bridge<br />
I/O Hub<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 8
FU<br />
IO N-<br />
IO<br />
PCIe Local Nodes<br />
X<br />
FUSION-IO<br />
Intel Nehalem<br />
Intel Sandy Bridge<br />
I/O Hub<br />
X<br />
# cat /sys/devices/pci0000:50/0000:50:09.0/numa_node <br />
0<br />
Only one local node is presented to the OS<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 9
FU<br />
IO N-<br />
IO<br />
PCIe Local Node?<br />
FUSION-IO<br />
Intel Nehalem<br />
Intel Sandy Bridge<br />
I/O Hub<br />
?<br />
?<br />
?<br />
?<br />
# cat /sys/devices/pci0000:50/0000:50:09.0/numa_node <br />
-‐1<br />
Not detected? Update the BIOS.<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 10
FU<br />
IO N-<br />
IO<br />
Localiz<strong>in</strong>g I/O from Storage<br />
FUSION-IO<br />
▸ 50-100% performance improvement vs.<br />
non-<strong>NUMA</strong> aware volume placement<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 11
FU<br />
IO N-<br />
IO<br />
Localiz<strong>in</strong>g I/O – <strong>High</strong> Availability<br />
FUSION-IO<br />
▸ Node locality of volumes must still be ma<strong>in</strong>ta<strong>in</strong>ed<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 12
FU<br />
IO N-<br />
IO<br />
Localiz<strong>in</strong>g I/O – HA Failover<br />
FUSION-IO<br />
▸ Node locality ma<strong>in</strong>ta<strong>in</strong>ed<br />
▸ All available ports are still used<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 13
FU<br />
IO N-<br />
IO<br />
Localiz<strong>in</strong>g I/O – Components<br />
▸ HBA placement<br />
▸ Interrupt aff<strong>in</strong>ity<br />
▸ Kernel thread aff<strong>in</strong>ity<br />
▸ Application aff<strong>in</strong>ity<br />
FUSION-IO<br />
Analyze everyth<strong>in</strong>g <strong>in</strong> the data path.<br />
Localize all components to the HBA local nodes.<br />
Reduces latency and improves cach<strong>in</strong>g<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 14
FU<br />
IO N-<br />
IO<br />
Discover<strong>in</strong>g Device Locality<br />
▸ Devices associated <strong>with</strong> each volume<br />
• multipath –ll <br />
• ls -‐d /sys/bus/pci/devices/*/host* <br />
▸ Device location<br />
• dmidecode -‐t slot <br />
• lspci -‐tv <br />
▸ <strong>NUMA</strong> node of devices<br />
• /sys/devices/pci*/*/numa_node <br />
?<br />
?<br />
FUSION-IO<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 15
FU<br />
IO N-<br />
IO<br />
P<strong>in</strong>n<strong>in</strong>g Interrupts<br />
▸ P<strong>in</strong> IRQs to the local node<br />
FUSION-IO<br />
▸ Distribute IRQs between cores <strong>in</strong> the local node <br />
# grep [driver] /proc/<strong>in</strong>terrupts <br />
[num]: ... <br />
# cat /proc/irq/[num]/node <br />
[node] <br />
121 122<br />
123<br />
120<br />
…<br />
# echo [CPU mask] > /proc/irq/[num]/smp_aff<strong>in</strong>ity <br />
~100% improvement<br />
Reduces latency & CPU contention, improves cach<strong>in</strong>g<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 16
FU<br />
IO N-<br />
IO<br />
Persistently P<strong>in</strong>n<strong>in</strong>g Interrupts<br />
FUSION-IO<br />
▸ irqbalance: run-time load balanc<strong>in</strong>g<br />
• Mixed results<br />
• Better <strong>with</strong> RHEL 6.4 + kernel update + Sandy Bridge<br />
▸ Customized <strong>in</strong>it service to set aff<strong>in</strong>ity on boot<br />
• Best results for a known application<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 17
FU<br />
IO N-<br />
IO<br />
Driver Kernel Thread Aff<strong>in</strong>ity<br />
▸ Sometimes handled by L<strong>in</strong>ux drivers<br />
▸ Verify and adjust<br />
FUSION-IO<br />
# taskset -‐p -‐c [cpumask] [pid] <br />
pid [pid]'s current aff<strong>in</strong>ity list: 0-‐159 <br />
pid [pid]'s new aff<strong>in</strong>ity list: 20-‐29 <br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 18
FU<br />
IO N-<br />
IO<br />
Block Device Tun<strong>in</strong>g<br />
FUSION-IO<br />
# echo noop > /sys/block/[device]/queue/scheduler <br />
• 10% improvement<br />
# echo 0 > /sys/block/[device]/queue/add_random <br />
• 10% improvement<br />
# echo 2 > /sys/block/[device]/queue/rq_aff<strong>in</strong>ity <br />
• 5% improvement<br />
# echo [0 or 1] > /sys/block/[device]/queue/rotational <br />
• Mixed results<br />
Reduces latency and CPU utilization<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 19
FU<br />
IO N-<br />
IO<br />
Persistent Block Device Tun<strong>in</strong>g<br />
FUSION-IO<br />
▸ udev rules (/etc/udev/rules.d/)<br />
• I/O devices<br />
ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/vendor}=="FUSIONIO", ATTR{queue/scheduler}<br />
="noop", ATTR{queue/rq_aff<strong>in</strong>ity}="2", ATTR{queue/add_random}="0” <br />
• devices <strong>with</strong> I/O slave devices (DM multipath)<br />
ACTION=="add|change", KERNEL=="dm-‐*", PROGRAM="/b<strong>in</strong>/bash -‐c 'cat /sys/block/$name/slaves/*/device/<br />
vendor | grep FUSIONIO'", ATTR{queue/scheduler}="noop", ATTR{queue/rq_aff<strong>in</strong>ity}="2", ATTR{queue/<br />
add_random}="0" <br />
Note: udev rules may only rely on sysfs parameters that are available at<br />
the time the device is created.<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 20
FU<br />
IO N-<br />
IO<br />
Power/<strong>Performance</strong> Tun<strong>in</strong>g<br />
▸ Disable c-states <strong>in</strong> the BIOS<br />
▸ Disable c-states <strong>in</strong> the boot loader (grub)<br />
• <strong>in</strong>tel_idle.max_cstate=0<br />
• processor.max_cstate=0<br />
▸ 10% improvement<br />
FUSION-IO<br />
Keep<strong>in</strong>g processors <strong>in</strong> active states reduces latency<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 21
FU<br />
IO N-<br />
IO<br />
Application Tun<strong>in</strong>g<br />
▸ OS-provided tools<br />
• taskset<br />
• cgroups<br />
• numad<br />
▸ Application-specific sett<strong>in</strong>gs<br />
• Oracle<br />
▸ _enable_<strong>NUMA</strong>_support<br />
FUSION-IO<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 22
FU<br />
IO N-<br />
IO<br />
Benchmark<strong>in</strong>g <strong>Performance</strong><br />
▸ Test thread aff<strong>in</strong>ity<br />
• FIO<br />
▸ cpus_allowed<br />
▸ numa_cpu_nodes<br />
▸ numa_mem_policy<br />
• Oracle Orion and others<br />
▸ taskset -c [cpulist] [test command]<br />
▸ Measur<strong>in</strong>g performance<br />
• iostat<br />
• Application-specific<br />
FUSION-IO<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 23
FUN-<br />
IO N- IO<br />
IO<br />
8 Socket Total <strong>Performance</strong> Ga<strong>in</strong>s<br />
FUSION-IO<br />
800000<br />
700000<br />
4K BLOCK SIZE IOPS<br />
600000<br />
500000<br />
400000<br />
300000<br />
200000<br />
100000<br />
0<br />
READS<br />
Untuned<br />
WRITES<br />
Tuned<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 24
FU<br />
IO N-<br />
IO<br />
Tools for <strong>NUMA</strong> Tun<strong>in</strong>g<br />
numactl cgroups<br />
taskset lstopo<br />
dmidecode sysfs<br />
irqbalance numad<br />
top numatop<br />
htop<br />
tuna<br />
irqstat tuned-adm<br />
FUSION-IO<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 25
FU<br />
IO N-<br />
IO<br />
top: term<strong>in</strong>al is not big enough<br />
FUSION-IO<br />
(top <strong>in</strong> RHEL 6.4)<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 26
FU<br />
IO N-<br />
IO<br />
mpstat <strong>with</strong> 160 cores<br />
FUSION-IO<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 27
FU<br />
IO N-<br />
IO<br />
/proc/<strong>in</strong>terrupts <strong>with</strong> 160 cores<br />
FUSION-IO<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 28
FU<br />
IO N-<br />
IO<br />
top: added <strong>NUMA</strong> support<br />
FUSION-IO<br />
https://gitorious.org/procps/procps<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 29
FU<br />
IO N-<br />
IO<br />
irqstat: IRQ viewer for <strong>NUMA</strong><br />
FUSION-IO<br />
https://github.com/lanceshelton/irqstat<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 30
FU<br />
IO N-<br />
IO<br />
Must end users be <strong>NUMA</strong>-aware?<br />
▸ Unfortunately, yes.<br />
FUSION-IO<br />
• Users must be aware of PCIe device slot placement<br />
• Optimal <strong>NUMA</strong> tun<strong>in</strong>g is not yet performed by the OS<br />
• Persistent tun<strong>in</strong>g is a non-trivial task<br />
• <strong>Performance</strong> challenges are chang<strong>in</strong>g faster than tools<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 31
FU<br />
IO N-<br />
IO<br />
How can this be improved?<br />
FUSION-IO<br />
• <strong>NUMA</strong> architectures must be detected properly and<br />
tuned by default<br />
▸ Phase out Nehalem or add SLIT support for multiple local<br />
nodes<br />
• L<strong>in</strong>ux distributions need to provide optimal tun<strong>in</strong>g<br />
across applications and devices at the OS level<br />
• Improve exist<strong>in</strong>g tools<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 32
FU<br />
IO N-<br />
IO<br />
What’s next?<br />
FUSION-IO<br />
▸ More cores per socket<br />
• 15 core CPUs by the end of next year?<br />
▸ Removal of exist<strong>in</strong>g bottlenecks<br />
• Multi-queue block layer: http://kernel.dk/blk-mq.pdf<br />
▸ Improved tools<br />
• numatop: https://01.org/numatop<br />
• top: https://gitorious.org/procps/procps<br />
• irqstat: https://github.com/lanceshelton/irqstat<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 33
FU<br />
IO N-<br />
IO<br />
References<br />
FUSION-IO<br />
▸ HP DL980 Architecture<br />
▸ http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-0643ENW.pdf<br />
▸ HP <strong>NUMA</strong> Support<br />
▸ http://bizsupport1.aust<strong>in</strong>.hp.com/bc/docs/support/SupportManual/<br />
c03261871/c03261871.pdf<br />
▸ <strong>Performance</strong> profil<strong>in</strong>g methods<br />
▸ http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-l<strong>in</strong>uxperformance-checklist/<br />
May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 34
THANK YOU<br />
LShelton@fusionio.com<br />
fusionio.com | REDEFINE WHAT’S POSSIBLE