Zfs Nuts And Bolts

Zfs Nuts And Bolts Presentation Transcript

ZFS Nuts and Bolts Eric Sproul OmniTI Computer Consulting
Quick Overview * More than just another filesystem: it's a filesystem, a volume manager, and a RAID controller all in one * Production debut in Solaris 10 6/06 * 1 ZB = 1 billion TB * 128-bit * 264 snapshots, 248 files/directory, 264 bytes/filesystem, 278 bytes/pool, 264 devices/pool, 264 pools/system
Old & Busted Traditional storage stack: filesystem(upper): filename to object (inode) filesystem(lower): object to volume LBA volume manager: volume LBA to array LBA RAID controller: array LBA to disk LBA * Strict separation between layers * Each layer often comes from separate vendors * Complex, difficult to administer, hard to predict performance of a particular combination
New Hotness * Telescoped stack: ZPL: filename to object DMU: object to DVA SPA: DVA to disk LBA * Terms: * ZPL: ZFS POSIX layer (standard syscall interface) * DMU: Data Management Unit (transactional object store) * DVA: Data Virtual Address (vdev + offset) * SPA: Storage Pool Allocator (block allocation, data transformation)
New Hotness * No more separate tools to manage filesystems vs. volumes vs. RAID arrays * 2 commands: zpool(1M), zfs(1M) (RFE exists to combine these) * Pooled storage means never getting stuck with too much or too little space in your filesystems * Can expose block devices as well; "zvol" blocks map directly to DVAs
ZFS Advantages * Fast * copy-on-write, pipelined I/O, dynamic striping, variable block size, intelligent resilvering * Simple management * End-to-end data integrity, self-healing * Checksum everything, all the time * Built-in goodies * block transforms * snapshots * NFS, CIFS, iSCSI sharing * Platform-neutral on-disk format
Getting Down to Brass Tacks How does ZFS achieve these feats?
ZFS I/O Life Cycle Writes 1. Translated to object transactions by the ZPL: "Make these 5 changes to these 2 objects." 2. Transactions bundled in DMU into transaction groups (TXGs) that flush when full (1/8 of system memory) or at regular intervals (30 seconds) 3. Blocks making up a TXG are transformed (if necessary), scheduled and then issued to physical media in the SPA
ZFS I/O Life Cycle Synchronous Writes * ZFS maintains a per-filesystem log called the ZFS Intent Log (ZIL). Each transaction gets a log sequence number. * When a synchronous command, such as fsync(), is issued, the ZIL commits blocks up to the current sequence number. This is a blocking operation. * The ZIL commits all necessary operations and flushes any write caches that may be enabled, ensuring that all bits have been committed to stable storage.
ZFS I/O Life Cycle Reads * ZFS makes heavy use of caching and prefetching * If requested blocks are not cached, issue a prioritized I/O that "cuts the line" ahead of pending writes * Writes are intelligently throttled to maintain acceptable read performance * ARC (Adaptive Replacement Cache) tracks recently and frequently used blocks in main memory * L2 ARC uses durable storage to extend the ARC
Speed Is Life * Copy-on-write design means random writes can be made sequential * Pipelined I/O extracts maximum parallelism with out-of-order issue, sorting and aggregation * Dynamic striping across all underlying devices eliminates hot-spots * Variable block size = no wasted space or effort * Intelligent resilvering copies only live data, can do partial rebuild for transient outages
Copy-On-Write Initial block tree
Copy-On-Write New blocks represent changes Never modifies existing data
Copy-On-Write Indirect blocks also change
Copy-On-Write Atomically update uberblock to point at updated blocks The uberblock is special in that it does get overwritten, but 4 copies are stored as part of the vdev label and are updated in transactional pairs. Therefore, integrity on disk is maintained.
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes:
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning:
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move head
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin head wait
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move head wait head
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move head wait head head
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order:
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move head
Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move Move head head
Dynamic Striping * Load distribution across top-level vdevs * Factors determining block allocation include: * Capacity * Latency & bandwidth * Device health
Dynamic Striping New data striped across three mirrors. Writes striped across both mirrors. No migration of existing data. Copy-on-write reallocates data over time, Reads occur wherever data was gradually spreading it across all three mirrors. written. * RFE for "on-demand" resilvering to explicitly re-balance + # zpool create tank # zpool add tank mirror c1t0d0 c1t1d0 mirror c3t0d0 c3t1d0 mirror c2t0d0 c2t1d0
Variable Block Size * No single value works well with all types of files * Large blocks increase bandwidth but reduce metadata and can lead to wasted space * Small blocks save space for smaller files, but increase I/O operations on larger ones * Record-based files such as those used by databases have a fixed block size that must be matched by the filesystem to avoid extra overhead (blocks too small) or read-modify-write (blocks too large)
Variable Block Size * The DMU operates on units of a fixed record size; default is 128KB * Files that are less than the record size are written as a single filesystem block (FSB) of variable size in multiples of disk sectors (512B) * Files that are larger than the record size are stored in multiple FSBs equal to record size * DMU records are assembled into transaction groups and committed atomically
Variable Block Size * FSBs are the basic unit of ZFS datasets, of which checksums are maintained * Handled by the SPA, which can optionally transform them (compression, ditto blocks today; encryption, de-dupe in the future) * Compression improves I/O performance, as fewer operations are needed on the underlying disk
Intelligent Resilver * a.k.a. rebuild, resync, reconstruct * Traditional resilvering is basically a whole-disk copy in the mirror case; RAID-5 does XOR of the other disks to rebuild * No priority given to more important blocks (top of the tree) * If you've copied 99% of the blocks, but the last 1% contains the top few blocks in the tree, another failure ruins everything
Intelligent Resilver * The ZFS way is metadata-driven * Live blocks only: just walk the block tree; unallocated blocks are ignored * Top-down: Start with the most important blocks. Every block copied increases the amount of discoverable data. * Transactional pruning: If the failure is transient, repair by identifying the missed TXGs. Resilver time is only slightly longer than the outage time.
Keep It Simple * Unified management model: pools and datasets * Datasets are just a group of tagged bits with certain attributes: filesystems, volumes, snapshots, clones * Properties can be set while the dataset is active * Hierarchical arrangement: children inherit properties of parent * Datasets become administration points-- give every user or application their own filesystem
Keep It Simple * Datasets only occupy as much space as they need * Compression, quotas and reservations are built-in properties * Pools may be grown dynamically without service interruption
Data Integrity * Not enough to be fast and simple; must be safe too * Silent corruption is our mortal enemy * Defects can occur anywhere: disks, firmware, cables, kernel drivers * Main memory has ECC; why shouldn't storage have something similar? * Other types of corruption are also killers: * Power outages, accidental overwrite, use a disk as swap
Data Integrity Traditional Method: Disk Block Checksum cksum data
Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written ("bit rot")
Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written ("bit rot") Won't catch silent corruption caused by issues in the I/O path between disk and host
Data Integrity The ZFS Way * Store data checksum in parent block pointer ptr cksum * Isolates faults between checksum and data ptr * Forms a hash tree, enabling validation of cksum the entire pool * 256-bit checksums * fletcher2 (default, simple and fast) or data data SHA-256 (slower, more secure) * Can be validated at any time with 'zpool scrub'
Data Integrity App ZFS data data
Data Integrity App ZFS data data data
Data Integrity App ZFS data data data
Data Integrity App data ZFS data data
Data Integrity App data ZFS data data
Data Integrity App data ZFS data data Self-healing mirror!
Goodie Bag * Block Transforms * Snapshots & Clones * Sharing (NFS, CIFS, iSCSI) * Platform-neutral on-disk format
Block Transforms * Handled at SPA layer, transparent to upper layers * Available today: * Compression * zfs set compression=on tank/myfs * LZJB (default) or GZIP * Multi-threaded as of snv_79 * Duplication, a.k.a. "ditto blocks" * zfs set copies=N tank/myfs * In addition to mirroring/RAID-Z: One logical block = up to 3 physical blocks * Metadata always has 2+ copies, even without ditto blocks * Copies stored on different devices, or different places on same device * Future: de-duplication, encryption
Snapshots & Clones * zfs snapshot tank/myfs@thursday * Based on block birth time, stored in block pointer * Nearly instantaneous (<1 sec) on idle system * Communicates structure, since it is based on object changes, not just a block delta * Occupies negligible space initially, and only grows as large as the block changeset * Clone is just a read/write snapshot
Sharing * NFSv4 * zfs set sharenfs=on tank/myfs * Automatically updates /etc/dfs/sharetab * CIFS * zfs set sharesmb=on tank/myfs * Additional properties control the share name and workgroup * Supports full NT ACLs and user mapping, not just POSIX uid * iSCSI * zfs set shareiscsi=on tank/myvol * Makes sharing block devices as easy as sharing filesystems
On-Disk Format * Platform-neutral, adaptive endianness * Writes always use native endianness, recorded in a bit in the block pointer * Reads byteswap if necessary, based on comparison of host endianness to value of block pointer bit * Migrate between x86 and SPARC * No worries about device paths, fstab, mountpoints, it all just works * 'zpool export' on old host, move disks, 'zpool import' on new host * Also migrate between Solaris and non-Sun implementations, such as MacOS X and FreeBSD
Fin Further reading: ZFS Community: http://opensolaris.org/os/community/zfs ZFS Administration Guide: http://docs.sun.com/app/docs/doc/819-5461 Jeff Bonwick's blog: http://blogs.sun.com/bonwick/en_US/category/ZFS ZFS-related blog entries: http://blogs.sun.com/main/tags/zfs

May	JUN	Jul
	22
2013	2014	2015

http://www.vdi.com	41
http://www.slideshare.net	38
http://www.slideee.com	8
http://translate.googleusercontent.com	1
http://webcache.googleusercontent.com	1
https://twitter.com	1

by esproul

on May 21, 2009

Statistics

Views

Actions

6 Embeds 90

Accessibility

Categories

Upload Details

Usage Rights

Report content

Zfs Nuts And Bolts Presentation Transcript