Similar

ZFS in the Trenches 

Ben Rockwood 

Director of Systems Engineering 

Joyent, Inc.

The Big Questions 

Is node 5 of 150 struggling 

Is I/O efficient as it can be? 

How fast are requests being answered? 

What is my mix of sync vs async I/O? 

Do I need to tune? 

How do I back it up?

Setting the Record Straight

Correcting Assumptions 

Tuning ZFS is Evil (true) 

ZFS doesn’t require tuning (false) 

ZFS is a memory hog (true) 

ZFS is slow (false) 

ZFS won’t allow corruption (false)

Amazing Efficiency 

ZFS ARC is extremely efficient 

Example: 32 production zones (different customers), 

120,000 reads per second... ZERO physical read I/O! 

ZFS TXG Sync is extremely efficient, provides a clean 

and orderly flush of writes to disk 

ZFS Prefetch intelligence is smarter than you... but true 

efficiency can vary based on workload.

Physical vs Logical I/O

Observability

Kstat 

ZFS ARC Kstats are incredibly useful 

Tools such as arcstat or arc_summary are good 

examples 

More Kstats (hopefully) are coming 

Things get more interesting when combined with 

physical disk kstats and VFS layer kstats

oot@quadra ~$ kstat -p -n arcstats 

zfs:0:arcstats:c 264750592 

zfs:0:arcstats:c_max 268435456 

zfs:0:arcstats:c_min 126918784 

zfs:0:arcstats:class misc 

zfs:0:arcstats:crtime 11.976322232 

zfs:0:arcstats:data_size 105994240 

zfs:0:arcstats:deleted 871464 

zfs:0:arcstats:demand_data_hits 2690899 

zfs:0:arcstats:demand_data_misses 90664 

zfs:0:arcstats:demand_metadata_hits 6643919 

zfs:0:arcstats:demand_metadata_misses 340739 

zfs:0:arcstats:evict_skip 19395259 

zfs:0:arcstats:hash_chain_max 6 

zfs:0:arcstats:hash_chains 4508 

zfs:0:arcstats:hash_collisions 297073 

zfs:0:arcstats:hash_elements 28047 

zfs:0:arcstats:hash_elements_max 71400 

zfs:0:arcstats:hdr_size 5048208 

zfs:0:arcstats:hits 12065410 

zfs:0:arcstats:l2_abort_lowmem 0 

zfs:0:arcstats:l2_cksum_bad 0 

zfs:0:arcstats:l2_evict_lock_retry 0 

zfs:0:arcstats:l2_evict_reading 0 

zfs:0:arcstats:l2_feeds 0 

zfs:0:arcstats:l2_free_on_write 0 

zfs:0:arcstats:l2_hdr_size 0 

zfs:0:arcstats:l2_hits 0 

zfs:0:arcstats:l2_io_error 0 

zfs:0:arcstats:l2_misses 0 

zfs:0:arcstats:l2_read_bytes 0 

zfs:0:arcstats:l2_rw_clash 0 

zfs:0:arcstats:l2_size 0

MDB 

mdb Provides many useful features that aren’t as 

difficult to use as you think (see zfs.c) 

If you use just one, “::zfs_params” is handy to see all 

ZFS tunables in one shot 

Several walkers are available, most handy when doing 

postmortem (Solaris CAT has wrappers for that.)

oot@quadra ~$ mdb -k 

Loading modules: [ unix genunix s..... 

> ::zfs_params 

arc_reduce_dnlc_percent = 0x3 

zfs_arc_max = 0x10000000 

zfs_arc_min = 0x0 

arc_shrink_shift = 0x5 

zfs_mdcomp_disable = 0x0 

zfs_prefetch_disable = 0x0 

zfetch_max_streams = 0x8 

zfetch_min_sec_reap = 0x2 

zfetch_block_cap = 0x100 

zfetch_array_rd_sz = 0x100000 

zfs_default_bs = 0x9 

zfs_default_ibs = 0xe 

metaslab_aliquot = 0x80000 

spa_max_replication_override = 0x3 

mdb: variable spa_mode not found: unknown symbol name 

zfs_flags = 0x0 

zfs_txg_synctime = 0x5 

zfs_txg_timeout = 0x1e 

zfs_write_limit_min = 0x2000000 

zfs_write_limit_max = 0x1e428200 

zfs_write_limit_shift = 0x3 

zfs_write_limit_override = 0x0 

zfs_no_write_throttle = 0x0 

zfs_vdev_cache_max = 0x4000 

zfs_vdev_cache_size = 0xa00000 

zfs_vdev_cache_bshift = 0x10 

vdev_mirror_shift = 0x15 

zfs_vdev_max_pending = 0x23 

zfs_vdev_min_pending = 0x4 

zfs_scrub_limit = 0xa

DTrace 

I live by the FBT provider 

Watch entry and return of every ZFS function (w00t!) 

Most powerful when used for timing or aggregating 

stacks to learn code flow 

The fsstat provider is a hidden gem.

zdb 

Examine on-disk structures 

Can find and fix issues 

Breakdown of disk utilization can be telling 

Extremely interesting, but rarely handy 

Have used it to recover deleted files

oot@quadra ~$ zdb 

quadra 

version=18 

name='quadra' 

state=0 

txg=20039 

pool_guid=15851861834417040262 

hostid=12739881 

hostname='quadra' 

vdev_tree 

type='root' 

id=0 

guid=15851861834417040262 

children[0] 

type='mirror' 

id=0 

guid=7006308066956826552 

whole_disk=0 

metaslab_array=23 

metaslab_shift=33 

ashift=9 

asize=1000188936192 

is_log=0 

children[0] 

type='disk' 

id=0 

guid=916181432694315835 

path='/dev/dsk/c2d0s0' 

devid='id1,cmdk@AHitachi_HDT721010SLA360=______STF605MH1TK48W/a' 

phys_path='/pci@0,0/pci-ide@1f,5/ide@0/cmdk@0,0:a' 

whole_disk=1 

DTL=18 

children[1]

Keys to Observability 

Use Dtrace to hone your understanding of ZFS 

internals 

Don’t over-focus your instrumentation, leverage VFS 

and ZFS together to get a holistic picture 

Avoid getting obsessed with the Dtrace Syscall 

provider (you can do better) 

Hint: Study up on bdev_strategy & biodone 

Kstats, kstats, kstats....

Inside ARC

The Cache Lists 

Most Recently Used (MRU) 

Most Frequently Used (MFU) 

MRU Ghost 

MFU Ghost

ARC & Prefetch 

ARC Kstats can tell you how data arrived in cache: 

prefetch or direct 

The mix of direct to prefetched data can determine if 

your prefetch is worth the I/O 

If prefetch hits are less than 10% just disable it

ARC Sizing 

ARC is HUGE! 7/8th of Physical Memory by default! 

Min and Max size is tunable 

Watch ghost lists if you limit 

>25GB ARC on 32GB node extremely common 

Ghost list hit rate seems to be the best indicator that 

you should consider L2ARC

$ ./arc_summary.pl 

System Memory: 

Physical RAM: 16247 MB 

Free Memory : 1064 MB 

LotsFree: 253 MB 

ZFS Tunables (/etc/system): 

ARC Size: 

Current Size: 5931 MB (arcsize) 

Target Size (Adaptive): 5985 MB (c) 

Min Size (Hard Limit): 507 MB (zfs_arc_min) 

Max Size (Hard Limit): 15223 MB (zfs_arc_max) 

ARC Size Breakdown: 

Most Recently Used Cache Size: 73% 4405 MB (p) 

Most Frequently Used Cache Size: 26% 1579 MB (c-p) 

ARC Efficency: 

Cache Access Total: 30934237785 

Cache Hit Ratio: 99% 30933044424 [Defined State for buffer] 

Cache Miss Ratio: 0% 1193361 [Undefined State for Buffer] 

REAL Hit Ratio: 99% 30888972063 [MRU/MFU Hits Only] 

Data Demand Efficiency: 99% 

Data Prefetch Efficiency: 99% 

CACHE HITS BY CACHE LIST: 

Anon: 0% 43145095 [ New Customer, First Cache Hit ] 

Most Recently Used: 0% 56656566 (mru) [ Return Customer ] 

Most Frequently Used: 99% 30832315497 (mfu) [ Frequent Customer ] 

Most Recently Used Ghost: 0% 280356 (mru_ghost) [ Return Customer Evicted, Now Back ] 

Most Frequently Used Ghost: 0% 646910 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] 

CACHE HITS BY DATA TYPE: 

Demand Data: 78% 24388331448 

Prefetch Data: 19% 6013656064 

Demand Metadata: 1% 472836583 

Prefetch Metadata: 0% 58220329 

CACHE MISSES BY DATA TYPE: 

Demand Data: 21% 255389 

Prefetch Data: 32% 385838 

Demand Metadata: 31% 381534 

Prefetch Metadata: 14% 170600 

---------------------------------------------

Physical I/O

ZFS Breathing 

Async writes grouped into Transaction Group (TXG) 

and “sync”ed to disk at regular intervals 

txg_synctime represents expected time to flush a TXG 

(5 sec) 

txg_timeout is typical frequency of flush (30 sec) 

Pre-snv_87 txg_time flushed every 5 seconds (tunable) 

Can be monitored via Dtrace (FBT: spa_sync)

Read I/O 

Reads are always synchronous 

...but ARC absorbs it very, very well

ZFS Intent Log 

ZIL throws a kink in the works 

Without a Log Device (“SLOG”) the intent log is part of 

the pool 

Can be monitored via Dtrace (FBT: zil_commit_writer)

Disabling ZIL 

Sync I/O treated as Async 

The fastest SSD won’t match the speed of 

zil_disable=1 

ZFS is always consistent on disk, regardless 

NFS corruption issues over-blown 

If power is lost, inflight data (uncommited TXG) is lost... 

this may cause logical corruption 

In some situations, its a necessary evil.. but have a 

good UPS

IOSTAT IS DEAD

... deal with it.

iostat 

asvc_t & %b have diminished meaning due to TXG’s 

Monitoring the physical devices not as telling as from 

the VFS layer 

What’s more interesting is the gap between TXG 

sync’s... which is better monitored via Dtrace, rather 

than inferred. 

If you must, always use iostat & fsstat together; 

essentially to see whats going in and out of ZFS.

The Write Throttle 

Delays processes that are over-zealous 

Tunable and can be disabled 

Provides excellent hook to find out who is doing heavy 

I/O 

Provides throughput numbers (fbt::dsl_pool_sync)

Backing it up

Backup Architecture 

Todays data loads are getting too big for traditional 

weekly full, daily incr architecture 

Full/Incr architecture designed around tape rotation 

Disk-to-disk backup is different, dump old assumptions 

Backup is really an exercise in asynchronous replication 

Joyent backup is rooted in NCDP ideology

Rsync & Co. 

Useful for non-ZFS clients, but slow startup time can 

be a killer 

Rsync to ZFS dataset, snapshot dataset, repeat. 

Instant backup retention solution. 

For proper consistency with ZFS, snapshot, rsync the 

snap, then release. Example: DB Data/Logs.

ZFS Send/Recv 

Exercise in snapshot replication 

No startup lag like rsync. 

Very fast; in head-to-head rsync vs zfs s/r, zfs s/r is on 

average 40% faster 

Large improvements in performance added to svn_105

Block (Pool) Replication 

AVS (SNDR) can do replication of pool block devices, 

but pool in unusable. 

Same goes for similar block-level replication tools

NDMP 

I’m watching NDMP with great interest 

NDMPcopy native for Solaris is a big win 

Solaris implementation will snapshot prior to copy and 

release behind 

Supposedly can use zfs s/r for data in addition to 

dump/tar 

Would allow for re-integration of traditional backup 

“managers” such as NetBackup

Parting Thoughts

Transactional Filesystem 

Always bear in mind the transactional nature of ZFS 

Throw away most of your UFS training... or at least 

update it 

Consult the VFS layer before jumping to conclusions 

about device level activity

Re-Discover Storage 

Spend time in the code with Dtrace 

Attempt to challenge old assumptions 

Benchmark for fun and profit (use FileBench!)

Focus on the Application 

More than ever, benchmark application performance 

Don’t assume that system level I/O metrics tell the 

whole story 

Use Dtrace to explore the interaction between your app 

and the VFS

Thank You. 

These people are awesome! --> 

Robert Milkowski 

Jason Williams 

Marcelo Leal 

Max Bruning 

James Dickens 

Adrian Cockroft 

Jim Mauro 

Richard McDougall 

Tom Haynes 

Peter Tribble 

Jason King 

Octave Orgeron 

Brendan Gregg 

Matty 

Roch 

Joerg Moellenkamp 

...

Resources 

Joyent: joyent.com 

Cuddletech: cuddletech.com 

Solaris Internals Wiki: solarisinternals.com/wiki 

Also: 

blogs.sun.com 

planetsolaris.com

Similar

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?