14.12.2012 Views

Similar

Similar

Similar

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ZFS in the Trenches<br />

Ben Rockwood<br />

Director of Systems Engineering<br />

Joyent, Inc.


The Big Questions<br />

Is node 5 of 150 struggling<br />

Is I/O efficient as it can be?<br />

How fast are requests being answered?<br />

What is my mix of sync vs async I/O?<br />

Do I need to tune?<br />

How do I back it up?


Setting the Record Straight


Correcting Assumptions<br />

Tuning ZFS is Evil (true)<br />

ZFS doesn’t require tuning (false)<br />

ZFS is a memory hog (true)<br />

ZFS is slow (false)<br />

ZFS won’t allow corruption (false)


Amazing Efficiency<br />

ZFS ARC is extremely efficient<br />

Example: 32 production zones (different customers),<br />

120,000 reads per second... ZERO physical read I/O!<br />

ZFS TXG Sync is extremely efficient, provides a clean<br />

and orderly flush of writes to disk<br />

ZFS Prefetch intelligence is smarter than you... but true<br />

efficiency can vary based on workload.


Physical vs Logical I/O


Observability


Kstat<br />

ZFS ARC Kstats are incredibly useful<br />

Tools such as arcstat or arc_summary are good<br />

examples<br />

More Kstats (hopefully) are coming<br />

Things get more interesting when combined with<br />

physical disk kstats and VFS layer kstats


oot@quadra ~$ kstat -p -n arcstats<br />

zfs:0:arcstats:c 264750592<br />

zfs:0:arcstats:c_max 268435456<br />

zfs:0:arcstats:c_min 126918784<br />

zfs:0:arcstats:class misc<br />

zfs:0:arcstats:crtime 11.976322232<br />

zfs:0:arcstats:data_size 105994240<br />

zfs:0:arcstats:deleted 871464<br />

zfs:0:arcstats:demand_data_hits 2690899<br />

zfs:0:arcstats:demand_data_misses 90664<br />

zfs:0:arcstats:demand_metadata_hits 6643919<br />

zfs:0:arcstats:demand_metadata_misses 340739<br />

zfs:0:arcstats:evict_skip 19395259<br />

zfs:0:arcstats:hash_chain_max 6<br />

zfs:0:arcstats:hash_chains 4508<br />

zfs:0:arcstats:hash_collisions 297073<br />

zfs:0:arcstats:hash_elements 28047<br />

zfs:0:arcstats:hash_elements_max 71400<br />

zfs:0:arcstats:hdr_size 5048208<br />

zfs:0:arcstats:hits 12065410<br />

zfs:0:arcstats:l2_abort_lowmem 0<br />

zfs:0:arcstats:l2_cksum_bad 0<br />

zfs:0:arcstats:l2_evict_lock_retry 0<br />

zfs:0:arcstats:l2_evict_reading 0<br />

zfs:0:arcstats:l2_feeds 0<br />

zfs:0:arcstats:l2_free_on_write 0<br />

zfs:0:arcstats:l2_hdr_size 0<br />

zfs:0:arcstats:l2_hits 0<br />

zfs:0:arcstats:l2_io_error 0<br />

zfs:0:arcstats:l2_misses 0<br />

zfs:0:arcstats:l2_read_bytes 0<br />

zfs:0:arcstats:l2_rw_clash 0<br />

zfs:0:arcstats:l2_size 0


MDB<br />

mdb Provides many useful features that aren’t as<br />

difficult to use as you think (see zfs.c)<br />

If you use just one, “::zfs_params” is handy to see all<br />

ZFS tunables in one shot<br />

Several walkers are available, most handy when doing<br />

postmortem (Solaris CAT has wrappers for that.)


oot@quadra ~$ mdb -k<br />

Loading modules: [ unix genunix s.....<br />

> ::zfs_params<br />

arc_reduce_dnlc_percent = 0x3<br />

zfs_arc_max = 0x10000000<br />

zfs_arc_min = 0x0<br />

arc_shrink_shift = 0x5<br />

zfs_mdcomp_disable = 0x0<br />

zfs_prefetch_disable = 0x0<br />

zfetch_max_streams = 0x8<br />

zfetch_min_sec_reap = 0x2<br />

zfetch_block_cap = 0x100<br />

zfetch_array_rd_sz = 0x100000<br />

zfs_default_bs = 0x9<br />

zfs_default_ibs = 0xe<br />

metaslab_aliquot = 0x80000<br />

spa_max_replication_override = 0x3<br />

mdb: variable spa_mode not found: unknown symbol name<br />

zfs_flags = 0x0<br />

zfs_txg_synctime = 0x5<br />

zfs_txg_timeout = 0x1e<br />

zfs_write_limit_min = 0x2000000<br />

zfs_write_limit_max = 0x1e428200<br />

zfs_write_limit_shift = 0x3<br />

zfs_write_limit_override = 0x0<br />

zfs_no_write_throttle = 0x0<br />

zfs_vdev_cache_max = 0x4000<br />

zfs_vdev_cache_size = 0xa00000<br />

zfs_vdev_cache_bshift = 0x10<br />

vdev_mirror_shift = 0x15<br />

zfs_vdev_max_pending = 0x23<br />

zfs_vdev_min_pending = 0x4<br />

zfs_scrub_limit = 0xa


DTrace<br />

I live by the FBT provider<br />

Watch entry and return of every ZFS function (w00t!)<br />

Most powerful when used for timing or aggregating<br />

stacks to learn code flow<br />

The fsstat provider is a hidden gem.


zdb<br />

Examine on-disk structures<br />

Can find and fix issues<br />

Breakdown of disk utilization can be telling<br />

Extremely interesting, but rarely handy<br />

Have used it to recover deleted files


oot@quadra ~$ zdb<br />

quadra<br />

version=18<br />

name='quadra'<br />

state=0<br />

txg=20039<br />

pool_guid=15851861834417040262<br />

hostid=12739881<br />

hostname='quadra'<br />

vdev_tree<br />

type='root'<br />

id=0<br />

guid=15851861834417040262<br />

children[0]<br />

type='mirror'<br />

id=0<br />

guid=7006308066956826552<br />

whole_disk=0<br />

metaslab_array=23<br />

metaslab_shift=33<br />

ashift=9<br />

asize=1000188936192<br />

is_log=0<br />

children[0]<br />

type='disk'<br />

id=0<br />

guid=916181432694315835<br />

path='/dev/dsk/c2d0s0'<br />

devid='id1,cmdk@AHitachi_HDT721010SLA360=______STF605MH1TK48W/a'<br />

phys_path='/pci@0,0/pci-ide@1f,5/ide@0/cmdk@0,0:a'<br />

whole_disk=1<br />

DTL=18<br />

children[1]


Keys to Observability<br />

Use Dtrace to hone your understanding of ZFS<br />

internals<br />

Don’t over-focus your instrumentation, leverage VFS<br />

and ZFS together to get a holistic picture<br />

Avoid getting obsessed with the Dtrace Syscall<br />

provider (you can do better)<br />

Hint: Study up on bdev_strategy & biodone<br />

Kstats, kstats, kstats....


Inside ARC


The Cache Lists<br />

Most Recently Used (MRU)<br />

Most Frequently Used (MFU)<br />

MRU Ghost<br />

MFU Ghost


ARC & Prefetch<br />

ARC Kstats can tell you how data arrived in cache:<br />

prefetch or direct<br />

The mix of direct to prefetched data can determine if<br />

your prefetch is worth the I/O<br />

If prefetch hits are less than 10% just disable it


ARC Sizing<br />

ARC is HUGE! 7/8th of Physical Memory by default!<br />

Min and Max size is tunable<br />

Watch ghost lists if you limit<br />

>25GB ARC on 32GB node extremely common<br />

Ghost list hit rate seems to be the best indicator that<br />

you should consider L2ARC


$ ./arc_summary.pl<br />

System Memory:<br />

Physical RAM: 16247 MB<br />

Free Memory : 1064 MB<br />

LotsFree: 253 MB<br />

ZFS Tunables (/etc/system):<br />

ARC Size:<br />

Current Size: 5931 MB (arcsize)<br />

Target Size (Adaptive): 5985 MB (c)<br />

Min Size (Hard Limit): 507 MB (zfs_arc_min)<br />

Max Size (Hard Limit): 15223 MB (zfs_arc_max)<br />

ARC Size Breakdown:<br />

Most Recently Used Cache Size: 73% 4405 MB (p)<br />

Most Frequently Used Cache Size: 26% 1579 MB (c-p)<br />

ARC Efficency:<br />

Cache Access Total: 30934237785<br />

Cache Hit Ratio: 99% 30933044424 [Defined State for buffer]<br />

Cache Miss Ratio: 0% 1193361 [Undefined State for Buffer]<br />

REAL Hit Ratio: 99% 30888972063 [MRU/MFU Hits Only]<br />

Data Demand Efficiency: 99%<br />

Data Prefetch Efficiency: 99%<br />

CACHE HITS BY CACHE LIST:<br />

Anon: 0% 43145095 [ New Customer, First Cache Hit ]<br />

Most Recently Used: 0% 56656566 (mru) [ Return Customer ]<br />

Most Frequently Used: 99% 30832315497 (mfu) [ Frequent Customer ]<br />

Most Recently Used Ghost: 0% 280356 (mru_ghost) [ Return Customer Evicted, Now Back ]<br />

Most Frequently Used Ghost: 0% 646910 (mfu_ghost) [ Frequent Customer Evicted, Now Back ]<br />

CACHE HITS BY DATA TYPE:<br />

Demand Data: 78% 24388331448<br />

Prefetch Data: 19% 6013656064<br />

Demand Metadata: 1% 472836583<br />

Prefetch Metadata: 0% 58220329<br />

CACHE MISSES BY DATA TYPE:<br />

Demand Data: 21% 255389<br />

Prefetch Data: 32% 385838<br />

Demand Metadata: 31% 381534<br />

Prefetch Metadata: 14% 170600<br />

---------------------------------------------


Physical I/O


ZFS Breathing<br />

Async writes grouped into Transaction Group (TXG)<br />

and “sync”ed to disk at regular intervals<br />

txg_synctime represents expected time to flush a TXG<br />

(5 sec)<br />

txg_timeout is typical frequency of flush (30 sec)<br />

Pre-snv_87 txg_time flushed every 5 seconds (tunable)<br />

Can be monitored via Dtrace (FBT: spa_sync)


Read I/O<br />

Reads are always synchronous<br />

...but ARC absorbs it very, very well


ZFS Intent Log<br />

ZIL throws a kink in the works<br />

Without a Log Device (“SLOG”) the intent log is part of<br />

the pool<br />

Can be monitored via Dtrace (FBT: zil_commit_writer)


Disabling ZIL<br />

Sync I/O treated as Async<br />

The fastest SSD won’t match the speed of<br />

zil_disable=1<br />

ZFS is always consistent on disk, regardless<br />

NFS corruption issues over-blown<br />

If power is lost, inflight data (uncommited TXG) is lost...<br />

this may cause logical corruption<br />

In some situations, its a necessary evil.. but have a<br />

good UPS


IOSTAT IS DEAD


... deal with it.


iostat<br />

asvc_t & %b have diminished meaning due to TXG’s<br />

Monitoring the physical devices not as telling as from<br />

the VFS layer<br />

What’s more interesting is the gap between TXG<br />

sync’s... which is better monitored via Dtrace, rather<br />

than inferred.<br />

If you must, always use iostat & fsstat together;<br />

essentially to see whats going in and out of ZFS.


The Write Throttle<br />

Delays processes that are over-zealous<br />

Tunable and can be disabled<br />

Provides excellent hook to find out who is doing heavy<br />

I/O<br />

Provides throughput numbers (fbt::dsl_pool_sync)


Backing it up


Backup Architecture<br />

Todays data loads are getting too big for traditional<br />

weekly full, daily incr architecture<br />

Full/Incr architecture designed around tape rotation<br />

Disk-to-disk backup is different, dump old assumptions<br />

Backup is really an exercise in asynchronous replication<br />

Joyent backup is rooted in NCDP ideology


Rsync & Co.<br />

Useful for non-ZFS clients, but slow startup time can<br />

be a killer<br />

Rsync to ZFS dataset, snapshot dataset, repeat.<br />

Instant backup retention solution.<br />

For proper consistency with ZFS, snapshot, rsync the<br />

snap, then release. Example: DB Data/Logs.


ZFS Send/Recv<br />

Exercise in snapshot replication<br />

No startup lag like rsync.<br />

Very fast; in head-to-head rsync vs zfs s/r, zfs s/r is on<br />

average 40% faster<br />

Large improvements in performance added to svn_105


Block (Pool) Replication<br />

AVS (SNDR) can do replication of pool block devices,<br />

but pool in unusable.<br />

Same goes for similar block-level replication tools


NDMP<br />

I’m watching NDMP with great interest<br />

NDMPcopy native for Solaris is a big win<br />

Solaris implementation will snapshot prior to copy and<br />

release behind<br />

Supposedly can use zfs s/r for data in addition to<br />

dump/tar<br />

Would allow for re-integration of traditional backup<br />

“managers” such as NetBackup


Parting Thoughts


Transactional Filesystem<br />

Always bear in mind the transactional nature of ZFS<br />

Throw away most of your UFS training... or at least<br />

update it<br />

Consult the VFS layer before jumping to conclusions<br />

about device level activity


Re-Discover Storage<br />

Spend time in the code with Dtrace<br />

Attempt to challenge old assumptions<br />

Benchmark for fun and profit (use FileBench!)


Focus on the Application<br />

More than ever, benchmark application performance<br />

Don’t assume that system level I/O metrics tell the<br />

whole story<br />

Use Dtrace to explore the interaction between your app<br />

and the VFS


Thank You.<br />

These people are awesome! --><br />

Robert Milkowski<br />

Jason Williams<br />

Marcelo Leal<br />

Max Bruning<br />

James Dickens<br />

Adrian Cockroft<br />

Jim Mauro<br />

Richard McDougall<br />

Tom Haynes<br />

Peter Tribble<br />

Jason King<br />

Octave Orgeron<br />

Brendan Gregg<br />

Matty<br />

Roch<br />

Joerg Moellenkamp<br />

...


Resources<br />

Joyent: joyent.com<br />

Cuddletech: cuddletech.com<br />

Solaris Internals Wiki: solarisinternals.com/wiki<br />

Also:<br />

blogs.sun.com<br />

planetsolaris.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!