09.07.2015 Views

Lustre SMP scalability and data I/O performance on Bullx ... - EOFS

Lustre SMP scalability and data I/O performance on Bullx ... - EOFS

Lustre SMP scalability and data I/O performance on Bullx ... - EOFS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Compute Partiti<strong>on</strong>sWhat is it ?divide multi-CPU server into several processing partiti<strong>on</strong>spartiti<strong>on</strong>-local <str<strong>on</strong>g>and</str<strong>on</strong>g> per-partiti<strong>on</strong> <str<strong>on</strong>g>data</str<strong>on</strong>g> allocators<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> threads can be bound to compute partiti<strong>on</strong>sC<strong>on</strong>figurati<strong>on</strong> parameterscpu_npartiti<strong>on</strong>s : # of CPU partiti<strong>on</strong>s opti<strong>on</strong>s libcfs cpu_npartiti<strong>on</strong>s=2cpu_pattern : CPU partiti<strong>on</strong>s pattern opti<strong>on</strong>s libcfs cpu_pattern="0[0-7] 1[8-15]" opti<strong>on</strong>s libcfs cpu_pattern="N 0[0] 1[2]"server0 1 2 3 4 5 6 7socket 0 socket 18 9 10 11socket 212 13 14 15socket 3©Bull 20127


Partiti<strong>on</strong>ed LNet <str<strong>on</strong>g>and</str<strong>on</strong>g> LNDWhat is it ?LND thread pool for each compute partiti<strong>on</strong>LNet per-partiti<strong>on</strong> <str<strong>on</strong>g>data</str<strong>on</strong>g>reduce lock c<strong>on</strong>tenti<strong>on</strong>©Bull 20129


Partiti<strong>on</strong>ed ptlrpc serviceWhat is it ?ptlrpc service thread pool for each partiti<strong>on</strong>request-queue <str<strong>on</strong>g>and</str<strong>on</strong>g> wait-queue for each partiti<strong>on</strong>to be completed with ptlrpc requests posting (lustre client)©Bull 201212


Partiti<strong>on</strong>ed ptlrpc serviceWhat is it ?ptlrpc service thread pool for each partiti<strong>on</strong>request-queue <str<strong>on</strong>g>and</str<strong>on</strong>g> wait-queue for each partiti<strong>on</strong>to be completed with ptlrpc requests posting (lustre client)C<strong>on</strong>figurati<strong>on</strong> parametersCPU partiti<strong>on</strong>s threads should run <strong>on</strong>– MDS threads : mds_num_cpts, mds_rdpg_num_cpts, mds_attr_num_cpts– OSS threads : oss_cpts, oss_io_cpts– ldlm threads : ldlm_cpts©Bull 201213


Data I/O Performance©Bull 201214


Data I/O <str<strong>on</strong>g>performance</str<strong>on</strong>g>Goalevaluate the effect <strong>on</strong> <str<strong>on</strong>g>data</str<strong>on</strong>g> I/O <str<strong>on</strong>g>performance</str<strong>on</strong>g>measure the improvement <strong>on</strong> bullx Supernodes– large <str<strong>on</strong>g>SMP</str<strong>on</strong>g> node (up to 128 cores), highly NUMIOA– follow up of NUMIOA presentati<strong>on</strong>s at LUG 2010 <str<strong>on</strong>g>and</str<strong>on</strong>g> LUG 2011ib3ib2BCSib1BCSib0 socket BCS 12 socket 13socket 8 socket 9BCSsocket 4 socket 50 1 2 3 socket 14 socket 15socketsocket1014 5 6 7socket 11QPIsocket 6 socket 7socket 2 socket 3©Bull 201216


Test Cluster Architecturekernel 2.6.32OFED 1.5.48 x 2-socketClientsbullx R424F3ClientClientClientClientClientClientClientClientIB networkFDR2 x IB2 x IBlustre 2.2.93two o2ib lustre networksOSSOSS8 x FC88 x FC8DiskArray16-socketClientbullx S6010-4Large <str<strong>on</strong>g>SMP</str<strong>on</strong>g>Client2 x IB QDR2-socket OSSsbullx R423E330 OSTsDDN SFA10k©Bull 201217


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21CLIENTOSSfc0 fc1 fc2 fc3socket 0 socket 1QPIsocket 0 socket 1QPIib0 ib0:0o2ib0 o2ib1ib0o2ib0ib1o2ib1©Bull 201218


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21CPT0CLIENTCPT0OSSfc0 fc1 fc2 fc3socket 0 socket 1QPIsocket 0 socket 1QPIib0 ib0:0o2ib0 o2ib1ib0o2ib0ib1o2ib1©Bull 201219


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21CLIENTOSSfc0 fc1 fc2 fc3socket 0 socket 1QPICPT0 CPT1 CPT2 CPT3socket 0 socket 1QPICPT0 CPT1 CPT2 CPT3ib0 ib0:0o2ib0 o2ib1ib0o2ib0ib1o2ib1©Bull 201220


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21CPT0CLIENTCPT1socket 0 socket 1QPICPT0OSSfc0 fc1 fc2 fc3CPT1socket 0 socket 1QPIib0 ib0:0o2ib0 o2ib1ib0o2ib0ib1o2ib1©Bull 201221


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21CLIENTOSSfc0 fc1 fc2 fc3socket 0 socket 1QPIsocket 0 socket 1QPICPT0CPT1CPT0CPT1ib0 ib0:0o2ib0 o2ib1ib0o2ib0ib1o2ib1©Bull 201222


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21BCSib1BCSib0 socket 12 socket 13BCSsocket 8 socket 9BCSsocket 4 socket 5socket 14 socket 15socket 0socketsocket101socket 11QPIsocket 6 socket 7socket 2 socket 3Large <str<strong>on</strong>g>SMP</str<strong>on</strong>g> CLIENT©Bull 201223


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21BCSib1BCSib0 socket 12 socket 13BCSsocket 8 socket 9BCSsocket 4 socket 5socket 14 socket 15socket 0socketsocket101socket 11QPIsocket 6 socket 7socket 2 socket 3CPT0Large <str<strong>on</strong>g>SMP</str<strong>on</strong>g> CLIENT©Bull 201224


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21CPT0BCSib1CPT6 BCS CPT7ib0 socket 12 socket 13CPT4 BCS CPT5socket 8 socket 9CPT2 BCS CPT3socket 4 socket 5CPT1 socket 14 socket 15socket 0socketsocket101socket 11QPIsocket 6 socket 7socket 2 socket 3Large <str<strong>on</strong>g>SMP</str<strong>on</strong>g> CLIENT©Bull 201225


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21BCSib1BCSib0 socket 12 socket 13BCSsocket 8 socket 9BCSCPT1socket 4 socket 5socket 14CPT0socket 15socket 0socketsocket101socket 11QPIsocket 6 socket 7socket 2 socket 3Large <str<strong>on</strong>g>SMP</str<strong>on</strong>g> CLIENT©Bull 201226


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> C<strong>on</strong>figurati<strong>on</strong>snocpt compute partiti<strong>on</strong>s disabledauto compute partiti<strong>on</strong>s automatic setupcpt2 two compute partiti<strong>on</strong>scpt2lcl two compute partiti<strong>on</strong>s, network interfaces bound to local CPTcpt2rmt two compute partiti<strong>on</strong>s, network interfaces bound to remote CPTlustre21BCSib1BCSib0 socket 12 socket 13BCSsocket 8 socket 9BCSCPT1socket 4 socket 5socket 14CPT0socket 15socket 0socketsocket101socket 11QPIsocket 6 socket 7socket 2 socket 3Large <str<strong>on</strong>g>SMP</str<strong>on</strong>g> CLIENT©Bull 201227


Data I/O <str<strong>on</strong>g>performance</str<strong>on</strong>g> measurementsLNet<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> <str<strong>on</strong>g>data</str<strong>on</strong>g> I/OObdfilterClientClientClientClientClientClientClientClientIB networkFDR2 x IB2 x IBOSSOSS8 x FC88 x FC8DiskArrayLarge <str<strong>on</strong>g>SMP</str<strong>on</strong>g>Client2 x IB QDR©Bull 201228


LNet <str<strong>on</strong>g>data</str<strong>on</strong>g> b<str<strong>on</strong>g>and</str<strong>on</strong>g>widthsingle client2-socket client : 5% improvement vs. lustre21 or no CPTlarge <str<strong>on</strong>g>SMP</str<strong>on</strong>g> client : 80% improvementLNet Selftests - <str<strong>on</strong>g>data</str<strong>on</strong>g> b<str<strong>on</strong>g>and</str<strong>on</strong>g>widthsingle client R424F3 (1 IB FDR)LNet Selftests - <str<strong>on</strong>g>data</str<strong>on</strong>g> b<str<strong>on</strong>g>and</str<strong>on</strong>g>widthsingle client S6010-4 (2 IB QDR)60006000Throughput (MB/s)50004000300020001000nocptautocpt2cpt2lclcpt2rmtlustre21Throughput (MB/s)50004000300020001000nocptautocpt2cpt2lclcpt2rmtlustre210w riteread0w riteread©Bull 201229


Obdfilter b<str<strong>on</strong>g>and</str<strong>on</strong>g>widthtest process placement according to NUMA node of each OSTenhanced versi<strong>on</strong> of obdfilter-surveyno NUMIOA effect observedThroughput (MB/s)120001000080006000400020000Throughput (MB/s)Obdfilter1 obj/OST - write12000100008000600040002000Obdfilter1 obj/OST - writeThroughput (MB/s)1200010000001 2 4 81 16 2 32 4 64 8 16 1 2 32 4 64 8 16 32 648000600040002000Obdfilter1 obj/OST - readthreads/OSTthreads/OSTthreads/OSTlustre21, no binding lustre21, lcl binding lustre21, rmt bindingauto, no binding auto, lcl binding auto, rmt binding©Bull 201230


File System <str<strong>on</strong>g>data</str<strong>on</strong>g> I/O B<str<strong>on</strong>g>and</str<strong>on</strong>g>widthsingle clientfar below LNet b<str<strong>on</strong>g>and</str<strong>on</strong>g>width:<str<strong>on</strong>g>performance</str<strong>on</strong>g> is limited by other <str<strong>on</strong>g>Lustre</str<strong>on</strong>g> client comp<strong>on</strong>entsLU-744 Single client's <str<strong>on</strong>g>performance</str<strong>on</strong>g> degradati<strong>on</strong> <strong>on</strong> 2.1IOR single client - file per processR424F3 - 16 processesIOR single client - file per processS6010-4 - 30 processes30003000Throughput (MB/s)2500200015001000500nocptautocpt2cpt2lclcpt2rmtlustre21Throughput (MB/s)2500200015001000500nocptautocpt2cpt2lclcpt2rmtlustre210w riteread0w riteread©Bull 201231


File System <str<strong>on</strong>g>data</str<strong>on</strong>g> I/O B<str<strong>on</strong>g>and</str<strong>on</strong>g>widthmultiple clientsOSSs have a low NUMIOA effect (obdfilter results)at least no regressi<strong>on</strong>8000IOR multiple clients - file per process7 R424F3 clients - 15 procs per clientThroughput (MB/s)7000600050004000300020001000nocptautocpt2cpt2lclcpt2rmtlustre210w riteread©Bull 201232


C<strong>on</strong>clusi<strong>on</strong><str<strong>on</strong>g>SMP</str<strong>on</strong>g> <str<strong>on</strong>g>scalability</str<strong>on</strong>g> <str<strong>on</strong>g>and</str<strong>on</strong>g> affinity featureefficient APIs to run <str<strong>on</strong>g>Lustre</str<strong>on</strong>g> <strong>on</strong> <str<strong>on</strong>g>SMP</str<strong>on</strong>g> <str<strong>on</strong>g>and</str<strong>on</strong>g> NUMIOA machinesc<strong>on</strong>figurati<strong>on</strong> is manual– "dynamic LNet c<strong>on</strong>figurati<strong>on</strong>" projectData I/O <str<strong>on</strong>g>performance</str<strong>on</strong>g> resultsat LNet level: visible improvementat filesystem level:– at least no regressi<strong>on</strong>– masked by client limitati<strong>on</strong>s©Bull 201233


©Bull 201234


Hardware / Software c<strong>on</strong>figurati<strong>on</strong>OSS2 bullx R423E3 - Intel S<str<strong>on</strong>g>and</str<strong>on</strong>g>yBridgeE5-2660, 2.20GHz, 16 CPU cores, 32GBMemory, 2 x IB FDR adapters, 4 x biportFC8 adaptersDDN SFA10k - 16 x FC8, 300 x 940GBSATA disks, 30 x RAID6 pools30 OSTs - 30 <str<strong>on</strong>g>data</str<strong>on</strong>g> + 30 journal LUNsClients8 bullx R424F3 - Intel S<str<strong>on</strong>g>and</str<strong>on</strong>g>yBridgeE5­2660, 2.20GHz, 16 CPU cores, 32GBMemory, IB FDR adaptersNetworkInfiniB<str<strong>on</strong>g>and</str<strong>on</strong>g> FDRtwo o2ib lustre networksOSTs restricted to <strong>on</strong>e of thelustre networkSoftware Stackbullxlinux 6.2 (based <strong>on</strong> rhel 6.2)OFED 1.5.4lustre 2.2.931 bullx S6010-4 - Intel NehalemEXX7550, 2.00GHz, 128 CPU cores, 256GBMemory, 2 x IB QDR adapters©Bull 201235


Test parametersLNet Selftestsbrw read/write, o2ib0+o2ib1, size=1M, c<strong>on</strong>currency=64, check=n<strong>on</strong>eObdfilter-surveyIORcase=disk, size=10GB, recordsize=1Mnobjlo=1, nobjhi=64, thrlo=1, thrhi=256verify=0numactl=n<strong>on</strong>e/local/remotenumactl --physcpubind=xxx --localalloc lctl test_brw ...stripe count=1, stripe size=1Mapi=POSIX, filePerProc=1, fsync=1, transferSize=1MBsingle R424F3 client: blockSize=4GB, numTasks=16single S6010-4 client: blockSize=2GB, numTasks=30multiple R424F3 clients: blockSize=4GB, numTasks=15*#clients©Bull 201236


<str<strong>on</strong>g>Lustre</str<strong>on</strong>g> <str<strong>on</strong>g>SMP</str<strong>on</strong>g> Node Affinity - MDS resultsfrom Liang Zhen's presentati<strong>on</strong> - Aug, 28 2012LNet Selftestslustre 2.3 ping is 900% (600%) of lustre 2.2 with Portal-RR OFF (ON)lustre 2.3 4K-BRW is 500%-700% of lustre2.2mdtestlustre 2.3 open-create <str<strong>on</strong>g>performance</str<strong>on</strong>g> is 350%-400% of lustre 2.2lustre 2.3 unlink <str<strong>on</strong>g>performance</str<strong>on</strong>g> is 150%-300% of lustre 2.2lustre 2.3 stat <str<strong>on</strong>g>performance</str<strong>on</strong>g> is 200%-400% of lustre 2.2©Bull 201237

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!