01.11.2014 Views

High Performance I/O with NUMA Systems in Linux - The Linux ...

High Performance I/O with NUMA Systems in Linux - The Linux ...

High Performance I/O with NUMA Systems in Linux - The Linux ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> <strong>Systems</strong> <strong>in</strong> L<strong>in</strong>ux<br />

Lance Shelton<br />

Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved.


FU<br />

IO N-<br />

IO<br />

How prevalent is <strong>NUMA</strong>?<br />

▸ All major vendors<br />

• HP, Dell, IBM, Cisco, SuperMicro, Hitachi, etc.<br />

▸ As small as 1U<br />

▸ 2, 4, and 8 socket systems<br />

▸ 2 to 10 cores per socket<br />

▸ Number of cores doubles <strong>with</strong> HyperThread<strong>in</strong>g<br />

8 x 10 x 2 = 160 CPU cores<br />

FUSION-IO<br />

<strong>NUMA</strong> is ma<strong>in</strong>stream.<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 2


FU<br />

IO N-<br />

IO<br />

Why use large <strong>NUMA</strong> servers?<br />

FUSION-IO<br />

▸ <strong>NUMA</strong> is key to the current evolution of performance<br />

• Storage rates are drastically <strong>in</strong>creas<strong>in</strong>g<br />

• Processor speeds are stagnant<br />

• Multi-core cannot scale <strong>in</strong>f<strong>in</strong>itely <strong>with</strong>out <strong>NUMA</strong><br />

▸ <strong>NUMA</strong> meets the needs of the application<br />

• No partition<strong>in</strong>g is required<br />

• Faster than cluster<strong>in</strong>g multiple servers<br />

• Works well <strong>in</strong> comb<strong>in</strong>ation <strong>with</strong> scale out<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 3


FU<br />

IO N-<br />

IO<br />

Nodes, Sockets, Cores, Threads<br />

FUSION-IO<br />

+ =<br />

Memory Socket <strong>NUMA</strong> Node<br />

Cores<br />

Threads<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 4


FU<br />

IO N-<br />

IO<br />

Non-Uniform I/O Access<br />

FUSION-IO<br />

I/O = PCIe<br />

• Internal flash<br />

• Host bus adapter<br />

to storage network<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 5


FU<br />

IO N-<br />

IO<br />

2, 4, and 8 Socket Servers<br />

FUSION-IO<br />

# numactl -­‐-­‐hardware <br />

node 0 1 2 3 4 5 6 7 <br />

0: 10 12 17 17 19 19 19 19 <br />

1: 12 10 17 17 19 19 19 19 <br />

2: 17 17 10 12 19 19 19 19 <br />

3: 17 17 12 10 19 19 19 19 <br />

4: 19 19 19 19 10 12 17 17 <br />

5: 19 19 19 19 12 10 17 17 <br />

6: 19 19 19 19 17 17 10 12 <br />

7: 19 19 19 19 17 17 12 10 <br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 6


FU<br />

IO N-<br />

IO<br />

Enterprise Storage<br />

FUSION-IO<br />

Storage<br />

Database Servers<br />

Clients<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 7


FU<br />

IO N-<br />

IO<br />

PCIe Local Nodes<br />

FUSION-IO<br />

Intel Nehalem<br />

Intel Sandy Bridge<br />

I/O Hub<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 8


FU<br />

IO N-<br />

IO<br />

PCIe Local Nodes<br />

X<br />

FUSION-IO<br />

Intel Nehalem<br />

Intel Sandy Bridge<br />

I/O Hub<br />

X<br />

# cat /sys/devices/pci0000:50/0000:50:09.0/numa_node <br />

0<br />

Only one local node is presented to the OS<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 9


FU<br />

IO N-<br />

IO<br />

PCIe Local Node?<br />

FUSION-IO<br />

Intel Nehalem<br />

Intel Sandy Bridge<br />

I/O Hub<br />

?<br />

?<br />

?<br />

?<br />

# cat /sys/devices/pci0000:50/0000:50:09.0/numa_node <br />

-­‐1<br />

Not detected? Update the BIOS.<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 10


FU<br />

IO N-<br />

IO<br />

Localiz<strong>in</strong>g I/O from Storage<br />

FUSION-IO<br />

▸ 50-100% performance improvement vs.<br />

non-<strong>NUMA</strong> aware volume placement<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 11


FU<br />

IO N-<br />

IO<br />

Localiz<strong>in</strong>g I/O – <strong>High</strong> Availability<br />

FUSION-IO<br />

▸ Node locality of volumes must still be ma<strong>in</strong>ta<strong>in</strong>ed<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 12


FU<br />

IO N-<br />

IO<br />

Localiz<strong>in</strong>g I/O – HA Failover<br />

FUSION-IO<br />

▸ Node locality ma<strong>in</strong>ta<strong>in</strong>ed<br />

▸ All available ports are still used<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 13


FU<br />

IO N-<br />

IO<br />

Localiz<strong>in</strong>g I/O – Components<br />

▸ HBA placement<br />

▸ Interrupt aff<strong>in</strong>ity<br />

▸ Kernel thread aff<strong>in</strong>ity<br />

▸ Application aff<strong>in</strong>ity<br />

FUSION-IO<br />

Analyze everyth<strong>in</strong>g <strong>in</strong> the data path.<br />

Localize all components to the HBA local nodes.<br />

Reduces latency and improves cach<strong>in</strong>g<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 14


FU<br />

IO N-<br />

IO<br />

Discover<strong>in</strong>g Device Locality<br />

▸ Devices associated <strong>with</strong> each volume<br />

• multipath –ll <br />

• ls -­‐d /sys/bus/pci/devices/*/host* <br />

▸ Device location<br />

• dmidecode -­‐t slot <br />

• lspci -­‐tv <br />

▸ <strong>NUMA</strong> node of devices<br />

• /sys/devices/pci*/*/numa_node <br />

?<br />

?<br />

FUSION-IO<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 15


FU<br />

IO N-<br />

IO<br />

P<strong>in</strong>n<strong>in</strong>g Interrupts<br />

▸ P<strong>in</strong> IRQs to the local node<br />

FUSION-IO<br />

▸ Distribute IRQs between cores <strong>in</strong> the local node <br />

# grep [driver] /proc/<strong>in</strong>terrupts <br />

[num]: ... <br />

# cat /proc/irq/[num]/node <br />

[node] <br />

121 122<br />

123<br />

120<br />

…<br />

# echo [CPU mask] > /proc/irq/[num]/smp_aff<strong>in</strong>ity <br />

~100% improvement<br />

Reduces latency & CPU contention, improves cach<strong>in</strong>g<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 16


FU<br />

IO N-<br />

IO<br />

Persistently P<strong>in</strong>n<strong>in</strong>g Interrupts<br />

FUSION-IO<br />

▸ irqbalance: run-time load balanc<strong>in</strong>g<br />

• Mixed results<br />

• Better <strong>with</strong> RHEL 6.4 + kernel update + Sandy Bridge<br />

▸ Customized <strong>in</strong>it service to set aff<strong>in</strong>ity on boot<br />

• Best results for a known application<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 17


FU<br />

IO N-<br />

IO<br />

Driver Kernel Thread Aff<strong>in</strong>ity<br />

▸ Sometimes handled by L<strong>in</strong>ux drivers<br />

▸ Verify and adjust<br />

FUSION-IO<br />

# taskset -­‐p -­‐c [cpumask] [pid] <br />

pid [pid]'s current aff<strong>in</strong>ity list: 0-­‐159 <br />

pid [pid]'s new aff<strong>in</strong>ity list: 20-­‐29 <br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 18


FU<br />

IO N-<br />

IO<br />

Block Device Tun<strong>in</strong>g<br />

FUSION-IO<br />

# echo noop > /sys/block/[device]/queue/scheduler <br />

• 10% improvement<br />

# echo 0 > /sys/block/[device]/queue/add_random <br />

• 10% improvement<br />

# echo 2 > /sys/block/[device]/queue/rq_aff<strong>in</strong>ity <br />

• 5% improvement<br />

# echo [0 or 1] > /sys/block/[device]/queue/rotational <br />

• Mixed results<br />

Reduces latency and CPU utilization<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 19


FU<br />

IO N-<br />

IO<br />

Persistent Block Device Tun<strong>in</strong>g<br />

FUSION-IO<br />

▸ udev rules (/etc/udev/rules.d/)<br />

• I/O devices<br />

ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/vendor}=="FUSIONIO", ATTR{queue/scheduler}<br />

="noop", ATTR{queue/rq_aff<strong>in</strong>ity}="2", ATTR{queue/add_random}="0” <br />

• devices <strong>with</strong> I/O slave devices (DM multipath)<br />

ACTION=="add|change", KERNEL=="dm-­‐*", PROGRAM="/b<strong>in</strong>/bash -­‐c 'cat /sys/block/$name/slaves/*/device/<br />

vendor | grep FUSIONIO'", ATTR{queue/scheduler}="noop", ATTR{queue/rq_aff<strong>in</strong>ity}="2", ATTR{queue/<br />

add_random}="0" <br />

Note: udev rules may only rely on sysfs parameters that are available at<br />

the time the device is created.<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 20


FU<br />

IO N-<br />

IO<br />

Power/<strong>Performance</strong> Tun<strong>in</strong>g<br />

▸ Disable c-states <strong>in</strong> the BIOS<br />

▸ Disable c-states <strong>in</strong> the boot loader (grub)<br />

• <strong>in</strong>tel_idle.max_cstate=0<br />

• processor.max_cstate=0<br />

▸ 10% improvement<br />

FUSION-IO<br />

Keep<strong>in</strong>g processors <strong>in</strong> active states reduces latency<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 21


FU<br />

IO N-<br />

IO<br />

Application Tun<strong>in</strong>g<br />

▸ OS-provided tools<br />

• taskset<br />

• cgroups<br />

• numad<br />

▸ Application-specific sett<strong>in</strong>gs<br />

• Oracle<br />

▸ _enable_<strong>NUMA</strong>_support<br />

FUSION-IO<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 22


FU<br />

IO N-<br />

IO<br />

Benchmark<strong>in</strong>g <strong>Performance</strong><br />

▸ Test thread aff<strong>in</strong>ity<br />

• FIO<br />

▸ cpus_allowed<br />

▸ numa_cpu_nodes<br />

▸ numa_mem_policy<br />

• Oracle Orion and others<br />

▸ taskset -c [cpulist] [test command]<br />

▸ Measur<strong>in</strong>g performance<br />

• iostat<br />

• Application-specific<br />

FUSION-IO<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 23


FUN-<br />

IO N- IO<br />

IO<br />

8 Socket Total <strong>Performance</strong> Ga<strong>in</strong>s<br />

FUSION-IO<br />

800000<br />

700000<br />

4K BLOCK SIZE IOPS<br />

600000<br />

500000<br />

400000<br />

300000<br />

200000<br />

100000<br />

0<br />

READS<br />

Untuned<br />

WRITES<br />

Tuned<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 24


FU<br />

IO N-<br />

IO<br />

Tools for <strong>NUMA</strong> Tun<strong>in</strong>g<br />

numactl cgroups<br />

taskset lstopo<br />

dmidecode sysfs<br />

irqbalance numad<br />

top numatop<br />

htop<br />

tuna<br />

irqstat tuned-adm<br />

FUSION-IO<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 25


FU<br />

IO N-<br />

IO<br />

top: term<strong>in</strong>al is not big enough<br />

FUSION-IO<br />

(top <strong>in</strong> RHEL 6.4)<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 26


FU<br />

IO N-<br />

IO<br />

mpstat <strong>with</strong> 160 cores<br />

FUSION-IO<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 27


FU<br />

IO N-<br />

IO<br />

/proc/<strong>in</strong>terrupts <strong>with</strong> 160 cores<br />

FUSION-IO<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 28


FU<br />

IO N-<br />

IO<br />

top: added <strong>NUMA</strong> support<br />

FUSION-IO<br />

https://gitorious.org/procps/procps<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 29


FU<br />

IO N-<br />

IO<br />

irqstat: IRQ viewer for <strong>NUMA</strong><br />

FUSION-IO<br />

https://github.com/lanceshelton/irqstat<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 30


FU<br />

IO N-<br />

IO<br />

Must end users be <strong>NUMA</strong>-aware?<br />

▸ Unfortunately, yes.<br />

FUSION-IO<br />

• Users must be aware of PCIe device slot placement<br />

• Optimal <strong>NUMA</strong> tun<strong>in</strong>g is not yet performed by the OS<br />

• Persistent tun<strong>in</strong>g is a non-trivial task<br />

• <strong>Performance</strong> challenges are chang<strong>in</strong>g faster than tools<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 31


FU<br />

IO N-<br />

IO<br />

How can this be improved?<br />

FUSION-IO<br />

• <strong>NUMA</strong> architectures must be detected properly and<br />

tuned by default<br />

▸ Phase out Nehalem or add SLIT support for multiple local<br />

nodes<br />

• L<strong>in</strong>ux distributions need to provide optimal tun<strong>in</strong>g<br />

across applications and devices at the OS level<br />

• Improve exist<strong>in</strong>g tools<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 32


FU<br />

IO N-<br />

IO<br />

What’s next?<br />

FUSION-IO<br />

▸ More cores per socket<br />

• 15 core CPUs by the end of next year?<br />

▸ Removal of exist<strong>in</strong>g bottlenecks<br />

• Multi-queue block layer: http://kernel.dk/blk-mq.pdf<br />

▸ Improved tools<br />

• numatop: https://01.org/numatop<br />

• top: https://gitorious.org/procps/procps<br />

• irqstat: https://github.com/lanceshelton/irqstat<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 33


FU<br />

IO N-<br />

IO<br />

References<br />

FUSION-IO<br />

▸ HP DL980 Architecture<br />

▸ http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-0643ENW.pdf<br />

▸ HP <strong>NUMA</strong> Support<br />

▸ http://bizsupport1.aust<strong>in</strong>.hp.com/bc/docs/support/SupportManual/<br />

c03261871/c03261871.pdf<br />

▸ <strong>Performance</strong> profil<strong>in</strong>g methods<br />

▸ http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-l<strong>in</strong>uxperformance-checklist/<br />

May 10, 2013 <strong>High</strong> <strong>Performance</strong> I/O <strong>with</strong> <strong>NUMA</strong> Servers 34


THANK YOU<br />

LShelton@fusionio.com<br />

fusionio.com | REDEFINE WHAT’S POSSIBLE

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!