Similar
Similar
Similar
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
ZFS in the Trenches<br />
Ben Rockwood<br />
Director of Systems Engineering<br />
Joyent, Inc.
The Big Questions<br />
Is node 5 of 150 struggling<br />
Is I/O efficient as it can be?<br />
How fast are requests being answered?<br />
What is my mix of sync vs async I/O?<br />
Do I need to tune?<br />
How do I back it up?
Setting the Record Straight
Correcting Assumptions<br />
Tuning ZFS is Evil (true)<br />
ZFS doesn’t require tuning (false)<br />
ZFS is a memory hog (true)<br />
ZFS is slow (false)<br />
ZFS won’t allow corruption (false)
Amazing Efficiency<br />
ZFS ARC is extremely efficient<br />
Example: 32 production zones (different customers),<br />
120,000 reads per second... ZERO physical read I/O!<br />
ZFS TXG Sync is extremely efficient, provides a clean<br />
and orderly flush of writes to disk<br />
ZFS Prefetch intelligence is smarter than you... but true<br />
efficiency can vary based on workload.
Physical vs Logical I/O
Observability
Kstat<br />
ZFS ARC Kstats are incredibly useful<br />
Tools such as arcstat or arc_summary are good<br />
examples<br />
More Kstats (hopefully) are coming<br />
Things get more interesting when combined with<br />
physical disk kstats and VFS layer kstats
oot@quadra ~$ kstat -p -n arcstats<br />
zfs:0:arcstats:c 264750592<br />
zfs:0:arcstats:c_max 268435456<br />
zfs:0:arcstats:c_min 126918784<br />
zfs:0:arcstats:class misc<br />
zfs:0:arcstats:crtime 11.976322232<br />
zfs:0:arcstats:data_size 105994240<br />
zfs:0:arcstats:deleted 871464<br />
zfs:0:arcstats:demand_data_hits 2690899<br />
zfs:0:arcstats:demand_data_misses 90664<br />
zfs:0:arcstats:demand_metadata_hits 6643919<br />
zfs:0:arcstats:demand_metadata_misses 340739<br />
zfs:0:arcstats:evict_skip 19395259<br />
zfs:0:arcstats:hash_chain_max 6<br />
zfs:0:arcstats:hash_chains 4508<br />
zfs:0:arcstats:hash_collisions 297073<br />
zfs:0:arcstats:hash_elements 28047<br />
zfs:0:arcstats:hash_elements_max 71400<br />
zfs:0:arcstats:hdr_size 5048208<br />
zfs:0:arcstats:hits 12065410<br />
zfs:0:arcstats:l2_abort_lowmem 0<br />
zfs:0:arcstats:l2_cksum_bad 0<br />
zfs:0:arcstats:l2_evict_lock_retry 0<br />
zfs:0:arcstats:l2_evict_reading 0<br />
zfs:0:arcstats:l2_feeds 0<br />
zfs:0:arcstats:l2_free_on_write 0<br />
zfs:0:arcstats:l2_hdr_size 0<br />
zfs:0:arcstats:l2_hits 0<br />
zfs:0:arcstats:l2_io_error 0<br />
zfs:0:arcstats:l2_misses 0<br />
zfs:0:arcstats:l2_read_bytes 0<br />
zfs:0:arcstats:l2_rw_clash 0<br />
zfs:0:arcstats:l2_size 0
MDB<br />
mdb Provides many useful features that aren’t as<br />
difficult to use as you think (see zfs.c)<br />
If you use just one, “::zfs_params” is handy to see all<br />
ZFS tunables in one shot<br />
Several walkers are available, most handy when doing<br />
postmortem (Solaris CAT has wrappers for that.)
oot@quadra ~$ mdb -k<br />
Loading modules: [ unix genunix s.....<br />
> ::zfs_params<br />
arc_reduce_dnlc_percent = 0x3<br />
zfs_arc_max = 0x10000000<br />
zfs_arc_min = 0x0<br />
arc_shrink_shift = 0x5<br />
zfs_mdcomp_disable = 0x0<br />
zfs_prefetch_disable = 0x0<br />
zfetch_max_streams = 0x8<br />
zfetch_min_sec_reap = 0x2<br />
zfetch_block_cap = 0x100<br />
zfetch_array_rd_sz = 0x100000<br />
zfs_default_bs = 0x9<br />
zfs_default_ibs = 0xe<br />
metaslab_aliquot = 0x80000<br />
spa_max_replication_override = 0x3<br />
mdb: variable spa_mode not found: unknown symbol name<br />
zfs_flags = 0x0<br />
zfs_txg_synctime = 0x5<br />
zfs_txg_timeout = 0x1e<br />
zfs_write_limit_min = 0x2000000<br />
zfs_write_limit_max = 0x1e428200<br />
zfs_write_limit_shift = 0x3<br />
zfs_write_limit_override = 0x0<br />
zfs_no_write_throttle = 0x0<br />
zfs_vdev_cache_max = 0x4000<br />
zfs_vdev_cache_size = 0xa00000<br />
zfs_vdev_cache_bshift = 0x10<br />
vdev_mirror_shift = 0x15<br />
zfs_vdev_max_pending = 0x23<br />
zfs_vdev_min_pending = 0x4<br />
zfs_scrub_limit = 0xa
DTrace<br />
I live by the FBT provider<br />
Watch entry and return of every ZFS function (w00t!)<br />
Most powerful when used for timing or aggregating<br />
stacks to learn code flow<br />
The fsstat provider is a hidden gem.
zdb<br />
Examine on-disk structures<br />
Can find and fix issues<br />
Breakdown of disk utilization can be telling<br />
Extremely interesting, but rarely handy<br />
Have used it to recover deleted files
oot@quadra ~$ zdb<br />
quadra<br />
version=18<br />
name='quadra'<br />
state=0<br />
txg=20039<br />
pool_guid=15851861834417040262<br />
hostid=12739881<br />
hostname='quadra'<br />
vdev_tree<br />
type='root'<br />
id=0<br />
guid=15851861834417040262<br />
children[0]<br />
type='mirror'<br />
id=0<br />
guid=7006308066956826552<br />
whole_disk=0<br />
metaslab_array=23<br />
metaslab_shift=33<br />
ashift=9<br />
asize=1000188936192<br />
is_log=0<br />
children[0]<br />
type='disk'<br />
id=0<br />
guid=916181432694315835<br />
path='/dev/dsk/c2d0s0'<br />
devid='id1,cmdk@AHitachi_HDT721010SLA360=______STF605MH1TK48W/a'<br />
phys_path='/pci@0,0/pci-ide@1f,5/ide@0/cmdk@0,0:a'<br />
whole_disk=1<br />
DTL=18<br />
children[1]
Keys to Observability<br />
Use Dtrace to hone your understanding of ZFS<br />
internals<br />
Don’t over-focus your instrumentation, leverage VFS<br />
and ZFS together to get a holistic picture<br />
Avoid getting obsessed with the Dtrace Syscall<br />
provider (you can do better)<br />
Hint: Study up on bdev_strategy & biodone<br />
Kstats, kstats, kstats....
Inside ARC
The Cache Lists<br />
Most Recently Used (MRU)<br />
Most Frequently Used (MFU)<br />
MRU Ghost<br />
MFU Ghost
ARC & Prefetch<br />
ARC Kstats can tell you how data arrived in cache:<br />
prefetch or direct<br />
The mix of direct to prefetched data can determine if<br />
your prefetch is worth the I/O<br />
If prefetch hits are less than 10% just disable it
ARC Sizing<br />
ARC is HUGE! 7/8th of Physical Memory by default!<br />
Min and Max size is tunable<br />
Watch ghost lists if you limit<br />
>25GB ARC on 32GB node extremely common<br />
Ghost list hit rate seems to be the best indicator that<br />
you should consider L2ARC
$ ./arc_summary.pl<br />
System Memory:<br />
Physical RAM: 16247 MB<br />
Free Memory : 1064 MB<br />
LotsFree: 253 MB<br />
ZFS Tunables (/etc/system):<br />
ARC Size:<br />
Current Size: 5931 MB (arcsize)<br />
Target Size (Adaptive): 5985 MB (c)<br />
Min Size (Hard Limit): 507 MB (zfs_arc_min)<br />
Max Size (Hard Limit): 15223 MB (zfs_arc_max)<br />
ARC Size Breakdown:<br />
Most Recently Used Cache Size: 73% 4405 MB (p)<br />
Most Frequently Used Cache Size: 26% 1579 MB (c-p)<br />
ARC Efficency:<br />
Cache Access Total: 30934237785<br />
Cache Hit Ratio: 99% 30933044424 [Defined State for buffer]<br />
Cache Miss Ratio: 0% 1193361 [Undefined State for Buffer]<br />
REAL Hit Ratio: 99% 30888972063 [MRU/MFU Hits Only]<br />
Data Demand Efficiency: 99%<br />
Data Prefetch Efficiency: 99%<br />
CACHE HITS BY CACHE LIST:<br />
Anon: 0% 43145095 [ New Customer, First Cache Hit ]<br />
Most Recently Used: 0% 56656566 (mru) [ Return Customer ]<br />
Most Frequently Used: 99% 30832315497 (mfu) [ Frequent Customer ]<br />
Most Recently Used Ghost: 0% 280356 (mru_ghost) [ Return Customer Evicted, Now Back ]<br />
Most Frequently Used Ghost: 0% 646910 (mfu_ghost) [ Frequent Customer Evicted, Now Back ]<br />
CACHE HITS BY DATA TYPE:<br />
Demand Data: 78% 24388331448<br />
Prefetch Data: 19% 6013656064<br />
Demand Metadata: 1% 472836583<br />
Prefetch Metadata: 0% 58220329<br />
CACHE MISSES BY DATA TYPE:<br />
Demand Data: 21% 255389<br />
Prefetch Data: 32% 385838<br />
Demand Metadata: 31% 381534<br />
Prefetch Metadata: 14% 170600<br />
---------------------------------------------
Physical I/O
ZFS Breathing<br />
Async writes grouped into Transaction Group (TXG)<br />
and “sync”ed to disk at regular intervals<br />
txg_synctime represents expected time to flush a TXG<br />
(5 sec)<br />
txg_timeout is typical frequency of flush (30 sec)<br />
Pre-snv_87 txg_time flushed every 5 seconds (tunable)<br />
Can be monitored via Dtrace (FBT: spa_sync)
Read I/O<br />
Reads are always synchronous<br />
...but ARC absorbs it very, very well
ZFS Intent Log<br />
ZIL throws a kink in the works<br />
Without a Log Device (“SLOG”) the intent log is part of<br />
the pool<br />
Can be monitored via Dtrace (FBT: zil_commit_writer)
Disabling ZIL<br />
Sync I/O treated as Async<br />
The fastest SSD won’t match the speed of<br />
zil_disable=1<br />
ZFS is always consistent on disk, regardless<br />
NFS corruption issues over-blown<br />
If power is lost, inflight data (uncommited TXG) is lost...<br />
this may cause logical corruption<br />
In some situations, its a necessary evil.. but have a<br />
good UPS
IOSTAT IS DEAD
... deal with it.
iostat<br />
asvc_t & %b have diminished meaning due to TXG’s<br />
Monitoring the physical devices not as telling as from<br />
the VFS layer<br />
What’s more interesting is the gap between TXG<br />
sync’s... which is better monitored via Dtrace, rather<br />
than inferred.<br />
If you must, always use iostat & fsstat together;<br />
essentially to see whats going in and out of ZFS.
The Write Throttle<br />
Delays processes that are over-zealous<br />
Tunable and can be disabled<br />
Provides excellent hook to find out who is doing heavy<br />
I/O<br />
Provides throughput numbers (fbt::dsl_pool_sync)
Backing it up
Backup Architecture<br />
Todays data loads are getting too big for traditional<br />
weekly full, daily incr architecture<br />
Full/Incr architecture designed around tape rotation<br />
Disk-to-disk backup is different, dump old assumptions<br />
Backup is really an exercise in asynchronous replication<br />
Joyent backup is rooted in NCDP ideology
Rsync & Co.<br />
Useful for non-ZFS clients, but slow startup time can<br />
be a killer<br />
Rsync to ZFS dataset, snapshot dataset, repeat.<br />
Instant backup retention solution.<br />
For proper consistency with ZFS, snapshot, rsync the<br />
snap, then release. Example: DB Data/Logs.
ZFS Send/Recv<br />
Exercise in snapshot replication<br />
No startup lag like rsync.<br />
Very fast; in head-to-head rsync vs zfs s/r, zfs s/r is on<br />
average 40% faster<br />
Large improvements in performance added to svn_105
Block (Pool) Replication<br />
AVS (SNDR) can do replication of pool block devices,<br />
but pool in unusable.<br />
Same goes for similar block-level replication tools
NDMP<br />
I’m watching NDMP with great interest<br />
NDMPcopy native for Solaris is a big win<br />
Solaris implementation will snapshot prior to copy and<br />
release behind<br />
Supposedly can use zfs s/r for data in addition to<br />
dump/tar<br />
Would allow for re-integration of traditional backup<br />
“managers” such as NetBackup
Parting Thoughts
Transactional Filesystem<br />
Always bear in mind the transactional nature of ZFS<br />
Throw away most of your UFS training... or at least<br />
update it<br />
Consult the VFS layer before jumping to conclusions<br />
about device level activity
Re-Discover Storage<br />
Spend time in the code with Dtrace<br />
Attempt to challenge old assumptions<br />
Benchmark for fun and profit (use FileBench!)
Focus on the Application<br />
More than ever, benchmark application performance<br />
Don’t assume that system level I/O metrics tell the<br />
whole story<br />
Use Dtrace to explore the interaction between your app<br />
and the VFS
Thank You.<br />
These people are awesome! --><br />
Robert Milkowski<br />
Jason Williams<br />
Marcelo Leal<br />
Max Bruning<br />
James Dickens<br />
Adrian Cockroft<br />
Jim Mauro<br />
Richard McDougall<br />
Tom Haynes<br />
Peter Tribble<br />
Jason King<br />
Octave Orgeron<br />
Brendan Gregg<br />
Matty<br />
Roch<br />
Joerg Moellenkamp<br />
...
Resources<br />
Joyent: joyent.com<br />
Cuddletech: cuddletech.com<br />
Solaris Internals Wiki: solarisinternals.com/wiki<br />
Also:<br />
blogs.sun.com<br />
planetsolaris.com