ZFS on Linux — не все так просто из песочницы Системное администрирование*, Linux* Прочитав статью «ZFS on Linux — легко и просто», решил поделиться своим скромным опытом использования этой ФС на паре Linux-серверов. Вначале — лирическое отступление. ZFS — это круто. Это настолько круто, что перекрывает все недостатки ФС, портированной с идеологически другой платформы. Ядро Solaris работает с иными примитивами, нежели Linux, поэтому, чтобы сделать возможным перенос ZFS с использованием кода Solaris, разработчики создали слой совместимости SPL — Solaris Porting Layer. Прослойка эта работает вроде бы нормально, но она — дополнительный код в режиме ядра, который вполне может быть источником сбоев. ZFS не до конца совместима с Linux VFS. Например, нельзя управлять списками контроля доступа через POSIX API и команды getfacl/setfacl, что очень не нравится Samba, которая хранит в ACL разрешения NTFS. Поэтому выставить нормальные права на Samba-папки и файлы не получится. Samba, теоретически, поддерживает ZFS ACL, но этот модуль под Linux еще собрать надо… А вот расширенные атрибуты ФС в ZFS on Linux присутствуют и работают отлично. Кроме того, Solaris в 32-х битной редакции использует иной механизм распределения памяти, нежели Linux. Поэтому, если вы решились попробовать ZFS on Linux на архитектуре x86, а не x86_64 — готовьтесь к глюкам. Вас ждет стопроцентная загрузка процессора на элементарных операция и километры ошибок в dmesg. Как пишут разработчики ZFS on Linux: «You are strongly encouraged to use a 64-bit kernel. At the moment zfs will build in a 32-bit environment but will not run stably». ZFS — это своеобразная «вещь в себе», и она хранит в метаданных такие параметры, которые нетипичны для Linux. Например — имя точки монтирования ФС задается в ее настройках, а сама ФС монтируется командой zfs mount, что автоматически делает ее несовместимой с /etc/fstab и прочими способами монтирования ФС в Linux. Можно, конечно, выставить mountpoint=legacy и все-таки воспользоваться mount, но это, согласитесь, неизящно. В Ubuntu проблема решается пакетом mountall, который содержит специфичные для ZFS скрипты монтирования и пропатченную команду mount.Следующая проблема — мгновенные снимки системы, так называемые снэпшоты. Вообще, ZFS содержит очень эффективную реализацию снэпшотов, которая позволяет создавать «машину времени» — комплект снимков, скажем, за месяц, с разрешением 1 снимок в 15 минут. Мейнтейнеры Ubuntu конечно, включили эту возможность в пакет zfs-auto-snapshot, который и создает комплект снимков, правда, более разряженный по времени. Проблема в том, что каждый снимок отображается в каталоге /dev в виде блочного устройства. Цикличность создания снэпшотов такова, что за месяц мы получим 4+24+4+31+1=64 блочных устройста на каждый том пула. Тоесть, если у нас, скажем, 20 томов (вполне нормальное значение, если мы используем сервер для виртуализации), мы получим 64*20=1280 устройств за месяц. Когда же мы захотим перезагрузиться, нас будет ждать большой сюрприз — загрузка очень сильно затянется. Причина — при загрузке выполняется утилита blkid, опрашивающая все блочные устройства на предмет наличия файловых систем. То ли механизм определения ФС в ней реализован криво, то ли блочные устройства открываются медленно, но так или иначе процесс blkid убивается ядром через 120 секунд по таймауту. Надо ли говорить, что blkid и все основанные на ней скрипты не работают и после завершения загрузки? Горячие новости Допустим, мы победили все эти проблемы, и хотим отдать свежесозданный раздел другим машинам по iSCSI, FC или как-нибудь еще через систему LIO-Target, встроенную в ядро. Не тут то было! Модуль zfs при загрузке использует основной номер 230 для создания блочных устройств в каталоге /dev. LIO-Target (точнее, утилита targetcli) без последних патчей не считает устройство с таким номером готовым для экспортирования. Решение — исправить
одну строчку в файле /usr/lib/python2.7/dist-packages/rtslib/utils.py, или добавить параметр загрузки модуля zfs в файл /etc/modprobe.d/zfs.conf: options zfs zvol_major=240 И в завершение: как известно, включить модуль zfs в ядро мешает несовместимость CDDL, под которой выпущена ZFS, и GPL v2 в ядре. Поэтому каждый раз при обновлении ядра модуль пересобирается через DKMS. Иногда у модуля это получается, иногда (когда ядро уж слишком новое) — нет. Следовательно, самые свежие фишки (и багфиксы KVM и LIO-Target) из последних ядер вы будете получать с некоторой задержкой. Каков же вывод? Использовать ZFS в продакшене надо с осторожностью. Возможно, те конфигурации, которые без проблем работали на других ФС, работать не будут, а те команды, которые вы без опаски выполняли на LVM, будут вызывать взаимоблокировки. Зато в продакшене под Linux вам теперь доступны все фишки ZFS vol. 28 — дедупликация, он-лайи компрессия, устойчивость к сбоям, гибкий менеджер томов (его, кстати, можно использовать отдельно) и т. д. В общем удачи и успехов вам!
ZFS On Linux With Ubuntu 12.04 LTS
It has been a while since last benchmarking the ZFS file-system under Linux, but here's some benchmarks of the well-known Solaris file-system on Ubuntu 12.04 LTS and compared to EXT4 and Btrfs when using both a hard drive and solid-state drive.
Published on June 27, 2012 Written by Michael Larabel Page 1 of 4 Discuss This Article
Last year I benchmarked the official KQ Infotech ZFS implementation for Linux but that port is no longer active. There is also ZFS-FUSE, but that has not been too performance-friendly and FUSE remains widely criticized. Lastly, there is the fledging LLNL ZFS port for Linux. The "ZFS On Linux" port from Lawrence Livermore National Laboratory (LLNL) is what is being benchmarked here. It is basically the only serious ZFS implementation for Linux unless you count FUSE.
ZFS file-system support is not in the mainline Linux kernel since its license (CDDL) remains incompatible with the GPL code-base. Lawrence Livermore National Laboratory meanwhile makes all of their source-code publicly available so the user is free to build the kernel modules themselves. There is also some support for easily generating RPM and Debian packages for Linux ZFS as well as an Ubuntu PPA. The most recent release of ZFS on Linux is 0.6.0-rc9, which was released on 14 June and is what is being tested today. zfs-0.6.0-rc9 implements Zpool version 28 and FS version 5. There is also a matching SPL (Solaris Porting Layer) release. At ZFSOnLinux.org is much more information on the supported features, ZFS examples, and other
items. On an SSD and HDD the ZFS 0.6.0-rc9 performance was compared to EXT4 and Btrfs from Ubuntu 12.04 LTS. All file-systems were tested with their stock mount options. The system was built around an Intel Core i7 3770K "Ivy Bridge" processor. The SSD used for benchmarking was an OCZ Solid 2 and the rotating drive was a Seagate Barracuda 7200.10 SATA 2.0 HDD (ST3320620AS)
ZFS Jump to: navigation, search External resources Wikipedia
ZFS is an advanced filesystem and was developed by SUN Microsystems.
Contents • 1 Features • 2 Installation • 2.1 Modules • 2.2 USE flags • 2.3 Tweak • 3 Installing into the kernel directory (for static installs) • 4 Usage • 4.1 Preparation • 4.2 Zpools • 4.2.1 import/export Zpool • 4.2.2 One Hard Drive • 4.2.3 MIRROR Two Hard Drives • 4.2.4 RAIDZ1 Three Hard Drives • 4.2.5 RAIDZ2 Four Hard Drives • 4.2.6 Spares/Replace vdev • 4.2.7 Zpool Version Update • 4.2.8 Zpool Tips/Tricks • 4.3 Volumes • 4.3.1 Create Volumes • 4.3.2 Mount/Umount Volumes • 4.3.3 Remove Volumes • 4.3.4 Properties • 4.3.5 Set Mountpoint • 4.3.6 NFS Volume • 4.4 Snapshots • 4.4.1 Create Snapshots • 4.4.2 List Snapshots • 4.4.3 Rollback Snapshots • 4.4.4 Clone Snapshots • 4.4.5 Remove Snapshots • 4.5 Maintenance • 4.5.1 Scrubbing • 4.5.2 Log Files • 4.5.3 Monitor I/O • 5 External resources
Features ZFS includes many features like: • • • • • •
Manage storage hardware as vdevs in zpools Manage volumes in zpools (like LVM) Redundancy with support for RAIDZ1 (RAID5), RAIDZ2 (RAID6) and MIRROR (RAID1) Resilvering file system Data deduplication Data compression with zle (zero-length encoding — fast, but only compresses sequences of
zeros), LZJB or its replacement LZ4, or gzip (higher compression, but slower) • Snapshots (like differential backups) • NFS export of volumes
Installation Modules There are out-of-tree Linux kernel modules available from the ZFSOnLinux Project. The current release is version 0.6.1 (zpool version 28). This Version is the first release which is considered to be "ready for wide scale deployment on everything from desktops to super computers", by the ZFSOnLinux Project. Note All changes to the GIT repository are subject to regression tests by LLNL. To install ZFS on Gentoo Linux requires ~amd64 keyword for sys-fs/zfs and it's dependencies sysfs/zfskmod and sys-fs/spl: root # echo "sys-kernel/spl ~amd64" >> /etc/portage/package.accept_keywords root # echo "sys-fs/zfs-kmod ~amd64" >> /etc/portage/package.accept_keywords root # echo "sys-fs/zfs ~amd64" >> /etc/portage/package.accept_keywords root # emerge -av zfs The latest upstream versions require keywording the live ebuilds (optional): root # echo "=sys-kernel/spl-9999 **" >> /etc/portage/package.accept_keywords root # echo "=sys-fs/zfs-kmod-9999 **" >> /etc/portage/package.accept_keywords root # echo "=sys-fs/zfs-9999 **" >> /etc/portage/package.accept_keywords Add zfs to the boot runlevel to mount all zpools on boot: root # rc-update add zfs boot
USE flags → Information about USE flags USE flag Default Recommended Description custom-cflags No No Build with user-specified CFLAGS (unsupported) Enable dependencies required for booting off a pool rootfs Yes Yes containing a rootfs static-libs No No Build static libraries test-suite No No Install regression test suite
Tweak Per default ZFS uses as much memory as available for its ARC cache. It should not be less than 512MB and a good value is 1/4 of available memory. This property can only be set during module loading, to restrict how much memory should be used to 512MB: root # echo "options zfs zfs_arc_max=536870912" >> /etc/modprobe.d/zfs.conf
Installing into the kernel directory (for static installs) This example uses 9999, but just change it to the latest ~ or stable (when that happens) and you should be good. The only issue you may run into is having zfs and zfs-kmod out of sync with eachother. Just try to avoid that :D This will generate the needed files, and copy them into the kernel sources directory. root # (cd /var/tmp/portage/sys-kernel/spl-9999/work/spl-9999 && ./copy-builtin /usr/src/linux) root # (cd /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod9999/ && ./copy-builtin /usr/src/linux) After this, you just need to edit the kernel config to enable CONFIG_SPL and CONFIG_ZFS and emerge the zfs binaries. root # mkdir -p /etc/portage/profile root # echo 'sys-fs/zfs -kernel-builtin' >> /etc/portage/profile/package.use.mask root # echo 'sys-fs/zfs kernel-builtin' >> /etc/portage/package.use root # emerge -1v sys-fs/zfs The echo's only need to be run once, but the emerge needs to be run every time you install a new version of zfs.
Usage ZFS includes already all programs to manage the hardware and the file systems, there are no additional tools needed.
Preparation To go through the different commands and scenarios we can create virtual hard drives using loopback devices. First we need to make sure the loopback module is loaded. If you want to play around with partitions, use the following option: root # modprobe -r loop root # modprobe loop max_part=63 Note You cannot reload the module, if it is built into the kernel. The following commands create 2GB image files in /var/lib/zfs_img/ that we use as our hard drives (uses ~8GB disk space): root # mkdir /var/lib/zfs_img
root # dd if=/dev/null seek=2097152 root # dd if=/dev/null seek=2097152 root # dd if=/dev/null seek=2097152 root # dd if=/dev/null seek=2097152
of=/var/lib/zfs_img/zfs0.img bs=1024 of=/var/lib/zfs_img/zfs1.img bs=1024 of=/var/lib/zfs_img/zfs2.img bs=1024 of=/var/lib/zfs_img/zfs3.img bs=1024
Now we check which loopback devices are in use: root # losetup -a We assume that all loopback devices are available and create our hard drives: root root root root
# # # #
losetup losetup losetup losetup
/dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
/var/lib/zfs_img/zfs0.img /var/lib/zfs_img/zfs1.img /var/lib/zfs_img/zfs2.img /var/lib/zfs_img/zfs3.img
We have now /dev/loop[0-3] as four hard drives available Note On the next reboot, all the loopback devices will be released and the folder /var/lib/zfs_img can be deleted
Zpools The program /usr/sbin/zpool is used with any operation regarding zpools. import/export Zpool To export (unmount) an existing zpool named zfs_test into the file system, you can use the following command: root # zpool export zfs_test root # zpool status To import (mount) the zpool named zfs_test use this command: root # zpool import zfs_test root # zpool status The root mountpoint of zfs_test is a property and can be changed the same way as for volumes. To import (mount) the zpool named zfs_test root on /mnt/gentoo, use this command: root # zpool import -R /mnt/gentoo zfs_test root # zpool status Note ZFS will automatically search on the hard drives for the zpool named zfs_test One Hard Drive Create a new zpool named zfs_test with one hard drive: root # zpool create zfs_test /dev/loop0 The zpool will automatically be mounted, default is the root file system aka /zfs_test
root # zpool status To delete a zpool use this command: root # zpool destroy zfs_test Important ZFS will not ask if you really want to MIRROR Two Hard Drives In ZFS you can have several harddrives in a MIRROR, where equal copies exist on each storage. This increases the performance and redundancy. To create a new zpool named zfs_test with two hard drives as MIRROR: root # zpool create zfs_test mirror /dev/loop0 /dev/loop1 Note of the two hard drives only 2GB are effective useable so total_space * 1/n root # zpool status To delete the zpool: root # zpool destroy zfs_test RAIDZ1 Three Hard Drives RAIDZ1 is the equivalent to RAID5, where data is written to the first two drives and a parity onto the third. You need at least three hard drives, one can fail and the zpool is still ONLINE but the faulty drive should be replaced as soon as possible. To create a pool with RAIDZ1 and three hard drives: root # zpool create zfs_test raidz1 /dev/loop0 /dev/loop1 /dev/loop2 Note of the three hard drives only 4GB are effective useable so total_space * (1-1/n) root # zpool status To delete the zpool: root # zpool destroy zfs_test RAIDZ2 Four Hard Drives RAIDZ2 is the equivalent to RAID6, where data is written to the first two drives and a parity onto the next two. You need at least four hard drives, two can fail and the zpool is still ONLINE but the faulty drives should be replaced as soon as possible. To create a pool with RAIDZ2 and four hard drives: root # zpool create zfs_test raidz2 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 Note of the four hard drives only 4GB are effective useable so total_space * (1-2/n) root # zpool status To delete the zpool: root # zpool destroy zfs_test
Spares/Replace vdev You can add hot-spares into your zpool. In case a failure, those are already installed and available to replace faulty vdevs. In this example, we use RAIDZ1 with three hard drives and a zpool named zfs_test: root # zpool add zfs_test spare /dev/loop3 root # zpool status The status of /dev/loop3 will stay AVAIL until it is set to be online, now we let /dev/loop0 fail: root # zpool offline zfs_test /dev/loop0 root # zpool status NAME zfs_test raidz1-0 loop0 loop1 loop2 spares loop3
STATE DEGRADED DEGRADED OFFLINE ONLINE ONLINE
READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
AVAIL
We replace /dev/loop0 with our spare /dev/loop3: root # zpool replace zfs_test /dev/loop0 /dev/loop3 root # zpool status pool: zfs_test state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: resilver completed after 0h0m with 0 errors on Sun Aug 21 22:29:22 2011 config: NAME zfs_test raidz1-0 spare-0 loop0 loop3 loop1 loop2 spares loop3
STATE DEGRADED DEGRADED DEGRADED OFFLINE ONLINE ONLINE ONLINE
READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INUSE
currently in use
46.5K resilvered
errors: No known data errors
Note the file system got automatically resilvered onto /dev/loop3 and the zpool was all the time online Now we remove the failed vdev /dev/loop0 and start a manual scrubbing: root # zpool detach zfs_test /dev/loop0 && zpool scrub root # zpool status
pool: zfs_test state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Sun Aug 21 22:37:52 2011 config: NAME zfs_test raidz1-0 loop3 loop1 loop2
STATE ONLINE ONLINE ONLINE ONLINE ONLINE
READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
errors: No known data errors
Zpool Version Update With every update of sys-fs/zfs, you are likely to also get a more recent ZFS version. Also the status of your zpools will indicate a warning that a new version is available and the zpools could be upgraded. To display the current version on a zpool: root # zpool upgrade -v This system is currently running ZFS pool version 28. The following versions are supported: VER --1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
DESCRIPTION -------------------------------------------------------Initial ZFS version Ditto blocks (replicated metadata) Hot spares and double parity RAID-Z zpool history Compression using the gzip algorithm bootfs pool property Separate intent log devices Delegated administration refquota and refreservation properties Cache devices Improved scrub performance Snapshot properties snapused property passthrough-x aclinherit user/group space accounting stmf property support Triple-parity RAID-Z Snapshot user holds Log device removal Compression using zle (zero-length encoding) Deduplication Received properties Slim ZIL System attributes Improved scrub stats Improved snapshot deletion performance Improved snapshot creation performance Multiple vdev replacements
For more information on a particular version, including supported releases,
see the ZFS Administration Guide.
Warning systems with a lower version installed will not be able to import a zpool of a higher version To upgrade the version of zpool zfs_test: root # zpool upgrade zfs_test To upgrade the version of all zpools in the system: root # zpool upgrade -a Zpool Tips/Tricks • You cannot shrink a zpool and remove vdevs after it's initial creation. • It is possible to add more vdevs to a MIRROR after it's initial creation. Use the following command (/dev/loop0 is the first drive in the MIRROR): root # zpool attach zfs_test /dev/loop0 /dev/loop2 • More than 9 vdevs in one RAIDZ could cause performance regression. For example it is better to use 2xRAIDZ with each five vdevs rather than 1xRAIDZ with 10 vdevs in a zpool • RAIDZ1 and RAIDZ2 cannot be resized after intial creation (you can only add additional hot spares). You can however replace the hard drives with bigger ones (one at a time), e.g. replace 1T drives with 2T drives to double the available space in the zpool. • It is possible to mix MIRROR, RAIDZ1 and RAIDZ2 in a zpool. For example a zpool with RAIDZ1 named zfs_test, to add two more vdevs in a MIRROR use: root # zpool add -f zfs_test mirror /dev/loop4 /dev/loop5 Note this needs the -f option • It is possible to restore a destroyed zpool, by reimporting it straight after the accident happened: root # zpool import -D pool: id: state: action:
zfs_test 12744221975042547640 ONLINE (DESTROYED) The pool can be imported using its name or numeric identifier.
Note the option -D searches on all hard drives for existing zpools
Volumes The program /usr/sbin/zfs is used with any operation regarding volumes. To control the size of a volume you can set quota and you can reserver a certain amount of storage within a zpool, per default the complete storage size in the zpool is used. Create Volumes We use our zpool zfs_test to create a new volume called volume1:
root # zfs create zfs_test/volume1 The volume will be mounted automatically as /zfs_test/volumes1/ root # zfs list Mount/Umount Volumes Volumes can be mounted with the following command, the mountpoint is defined by the property mountpoint of the volume: root # zfs mount zfs_test/volume1 To unmount the volume: root # zfs unmount zfs_test/volume1 The folder /zfs_test/volume1 stays without the volume behind it. If you write data to it and then try to mount the volume again, you will see the following error message: [Collapse] Code cannot mount '/zfs_test/volume1': directory is not empty
Remove Volumes To remove volumes volume1 from zpool zfs_test: root # zfs destroy zfs_test/volume1 root # zfs list Note you cannot destroy a volume if there exist any snapshots of it Properties Properties for volumes are inherited from the zpool. So youy can either change the property on the zpool for all volumes or specific for each volume individual or a mix of both. To set a property for a volume: root # zfs set <property> zfs_test/volume1 To show the setting for a particular property on a volume: root # zfs get <property> zfs_test/volume1 Note The properties are used on a volume e.g. compression, the higher is the version of this volume You can get a list of all properties set on any zpool with the following command: root # zfs get all This is a partial list of properties that can be set on either zpools or volumes, for a full list see man zfs: Property Value quota= 20m,none reservation= 20m,none
Function set a quota of 20MB for the volume reserves 20MB for the volume within it's zpool
compression= zle,gzip,on,off sharenfs= exec= setuid= readonly= atime= dedup=
on,off,ro,nfso ptions on,off on,off on,off on,off on,off
mountpoint= none,path
uses the given compression method or the default method for compression which should be gzip shares the volume via NFS controls if programs can be executed on the volume controls if SUID or GUID can be set on the volume sets read only atribute to on/off update access times for files in the volume sets deduplication on or off sets the mountpoint for the volume below the zpool or elsewhere in the file system, a mountpoint set to none prevents the volume from being mounted
Set Mountpoint Set the mountpoint for a volume, use the following command: root # zfs set mountpoint=/mnt/data zfs_test/volume1 The volume will be automatically moved to /mnt/data NFS Volume Create a volume as NFS share: root # zfs create -o sharenfs=on zfs_test/volume2 Check what file systems are shared via NFS: root # exportfs Per default the volume is shared to all networks, to specify share options: root # zfs set sharenfs="-maproot=root -alldir -network 192.168.1.254 -mask 255.255.255.0" zfs_test/volume2 root # exportfs To stop sharing the volume: root # zfs set sharenfs=off zfs_test/volume2 root # exportfs
Snapshots Snapshots are volumes which have no initial size and save changes made to another volume. With increasing changes between the snapshot and the original volume it grows in size. Create Snapshots
To create a snapshot of a volume, use the following command: root # zfs snapshot zfs_test/volume1@22082011 Note volume1@22082011 is the full name of the snapshot, everything after the @ symbol can be any alphanumeric combination Every time a file in volume1 changes, the old data of the file will be linked into the snapshot.
List Snapshots
List all available snapshots: root # zfs list -t snapshot -o name,creation Rollback Snapshots
To rollback a full volume to a previous state: root # zfs rollback zfs_test/volume1@21082011 Note if there are other snapshots in between, then you have to use the -r option. This would remove all snapshots between the one you want to rollback and the original volume Clone Snapshots
ZFS can clone snapshots to new volumes, so you can access the files from previous states individually: root # zfs clone zfs_test/volume1@21082011 zfs_test/volume1_restore In the folder /zfs_test/volume1_restore can now be worked on in the version of a previous state Remove Snapshots
Remove snapshots of a volume with the following command: root # zfs destroy zfs_test/volume1@21082011
Maintenance Scrubbing Start a scrubbing for zpool zfs_test: root # zpool scrub zfs_test Note this might take some time and is quite I/O intensive Log Files To check the history of commands that were executed: root # zpool history Monitor I/O Monitor I/O activity on all zpools (refreshes every 6 seconds): root # zpool iostat 6
External resources â&#x20AC;˘ zfs-fuse.net â&#x20AC;˘ ZFS for Linux
• ZFS Best Practices Guide • ZFS Evil Tuning Guide • article about ZFS on Linux/Gentoo (german)
Category: • Filesystems
Navigation menu • Create account • Log in • Page • Discussion • Read • View source • View history
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
• • • • •
Main page Recent changes Help Sandbox FAQ
Toolbox • • • • • • •
What links here Related changes Special pages Printable version Permanent link Page information Browse properties
Categories • This page was last modified on 24 July 2013, at 01:25.
ZFS is an advanced filesystem and was developed by SUN Microsystems.
Contents • 1 Features • 2 Installation • 2.1 Modules • 2.2 USE flags • 2.3 Tweak • 3 Installing into the kernel directory (for static installs) • 4 Usage • 4.1 Preparation • 4.2 Zpools • 4.2.1 import/export Zpool • 4.2.2 One Hard Drive • 4.2.3 MIRROR Two Hard Drives • 4.2.4 RAIDZ1 Three Hard Drives • 4.2.5 RAIDZ2 Four Hard Drives • 4.2.6 Spares/Replace vdev • 4.2.7 Zpool Version Update • 4.2.8 Zpool Tips/Tricks • 4.3 Volumes • 4.3.1 Create Volumes • 4.3.2 Mount/Umount Volumes • 4.3.3 Remove Volumes • 4.3.4 Properties • 4.3.5 Set Mountpoint • 4.3.6 NFS Volume • 4.4 Snapshots • 4.4.1 Create Snapshots • 4.4.2 List Snapshots • 4.4.3 Rollback Snapshots • 4.4.4 Clone Snapshots • 4.4.5 Remove Snapshots • 4.5 Maintenance • 4.5.1 Scrubbing • 4.5.2 Log Files • 4.5.3 Monitor I/O • 5 External resources
Features ZFS includes many features like: Manage storage hardware as vdevs in zpools Manage volumes in zpools (like LVM) Redundancy with support for RAIDZ1 (RAID5), RAIDZ2 (RAID6) and MIRROR (RAID1) Resilvering file system Data deduplication Data compression with zle (zero-length encoding — fast, but only compresses sequences of zeros), LZJB or its replacement LZ4, or gzip (higher compression, but slower) • Snapshots (like differential backups) • NFS export of volumes • • • • • •
Installation Modules There are out-of-tree Linux kernel modules available from the ZFSOnLinux Project. The current release is version 0.6.1 (zpool version 28). This Version is the first release which is considered to be "ready for wide scale deployment on everything from desktops to super computers", by the ZFSOnLinux Project. Note All changes to the GIT repository are subject to regression tests by LLNL. To install ZFS on Gentoo Linux requires ~amd64 keyword for sys-fs/zfs and it's dependencies sysfs/zfskmod and sys-fs/spl: root # echo "sys-kernel/spl ~amd64" >> /etc/portage/package.accept_keywords root # echo "sys-fs/zfs-kmod ~amd64" >> /etc/portage/package.accept_keywords root # echo "sys-fs/zfs ~amd64" >> /etc/portage/package.accept_keywords root # emerge -av zfs The latest upstream versions require keywording the live ebuilds (optional): root # echo "=sys-kernel/spl-9999 **" >> /etc/portage/package.accept_keywords root # echo "=sys-fs/zfs-kmod-9999 **" >> /etc/portage/package.accept_keywords root # echo "=sys-fs/zfs-9999 **" >> /etc/portage/package.accept_keywords Add zfs to the boot runlevel to mount all zpools on boot: root # rc-update add zfs boot
USE flags → Information about USE flags
USE flag Default Recommended Description custom-cflags No No Build with user-specified CFLAGS (unsupported) Enable dependencies required for booting off a pool rootfs Yes Yes containing a rootfs static-libs No No Build static libraries test-suite No No Install regression test suite
Tweak Per default ZFS uses as much memory as available for its ARC cache. It should not be less than 512MB and a good value is 1/4 of available memory. This property can only be set during module loading, to restrict how much memory should be used to 512MB: root # echo "options zfs zfs_arc_max=536870912" >> /etc/modprobe.d/zfs.conf
Installing into the kernel directory (for static installs) This example uses 9999, but just change it to the latest ~ or stable (when that happens) and you should be good. The only issue you may run into is having zfs and zfs-kmod out of sync with eachother. Just try to avoid that :D This will generate the needed files, and copy them into the kernel sources directory. root # (cd /var/tmp/portage/sys-kernel/spl-9999/work/spl-9999 && ./copy-builtin /usr/src/linux) root # (cd /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod9999/ && ./copy-builtin /usr/src/linux) After this, you just need to edit the kernel config to enable CONFIG_SPL and CONFIG_ZFS and emerge the zfs binaries. root # mkdir -p /etc/portage/profile root # echo 'sys-fs/zfs -kernel-builtin' >> /etc/portage/profile/package.use.mask root # echo 'sys-fs/zfs kernel-builtin' >> /etc/portage/package.use root # emerge -1v sys-fs/zfs The echo's only need to be run once, but the emerge needs to be run every time you install a new version of zfs.
Usage ZFS includes already all programs to manage the hardware and the file systems, there are no additional tools needed.
Preparation To go through the different commands and scenarios we can create virtual hard drives using loopback devices. First we need to make sure the loopback module is loaded. If you want to play around with partitions, use the following option: root # modprobe -r loop
root # modprobe loop max_part=63 Note You cannot reload the module, if it is built into the kernel. The following commands create 2GB image files in /var/lib/zfs_img/ that we use as our hard drives (uses ~8GB disk space): root # mkdir /var/lib/zfs_img root # dd if=/dev/null of=/var/lib/zfs_img/zfs0.img seek=2097152 root # dd if=/dev/null of=/var/lib/zfs_img/zfs1.img seek=2097152 root # dd if=/dev/null of=/var/lib/zfs_img/zfs2.img seek=2097152 root # dd if=/dev/null of=/var/lib/zfs_img/zfs3.img seek=2097152
bs=1024 bs=1024 bs=1024 bs=1024
Now we check which loopback devices are in use: root # losetup -a We assume that all loopback devices are available and create our hard drives: root root root root
# # # #
losetup losetup losetup losetup
/dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
/var/lib/zfs_img/zfs0.img /var/lib/zfs_img/zfs1.img /var/lib/zfs_img/zfs2.img /var/lib/zfs_img/zfs3.img
We have now /dev/loop[0-3] as four hard drives available Note On the next reboot, all the loopback devices will be released and the folder /var/lib/zfs_img can be deleted
Zpools The program /usr/sbin/zpool is used with any operation regarding zpools. import/export Zpool To export (unmount) an existing zpool named zfs_test into the file system, you can use the following command: root # zpool export zfs_test root # zpool status To import (mount) the zpool named zfs_test use this command: root # zpool import zfs_test root # zpool status The root mountpoint of zfs_test is a property and can be changed the same way as for volumes. To import (mount) the zpool named zfs_test root on /mnt/gentoo, use this command: root # zpool import -R /mnt/gentoo zfs_test root # zpool status Note
ZFS will automatically search on the hard drives for the zpool named zfs_test One Hard Drive Create a new zpool named zfs_test with one hard drive: root # zpool create zfs_test /dev/loop0 The zpool will automatically be mounted, default is the root file system aka /zfs_test root # zpool status To delete a zpool use this command: root # zpool destroy zfs_test Important ZFS will not ask if you really want to MIRROR Two Hard Drives In ZFS you can have several harddrives in a MIRROR, where equal copies exist on each storage. This increases the performance and redundancy. To create a new zpool named zfs_test with two hard drives as MIRROR: root # zpool create zfs_test mirror /dev/loop0 /dev/loop1 Note of the two hard drives only 2GB are effective useable so total_space * 1/n root # zpool status To delete the zpool: root # zpool destroy zfs_test RAIDZ1 Three Hard Drives RAIDZ1 is the equivalent to RAID5, where data is written to the first two drives and a parity onto the third. You need at least three hard drives, one can fail and the zpool is still ONLINE but the faulty drive should be replaced as soon as possible. To create a pool with RAIDZ1 and three hard drives: root # zpool create zfs_test raidz1 /dev/loop0 /dev/loop1 /dev/loop2 Note of the three hard drives only 4GB are effective useable so total_space * (1-1/n) root # zpool status To delete the zpool: root # zpool destroy zfs_test RAIDZ2 Four Hard Drives RAIDZ2 is the equivalent to RAID6, where data is written to the first two drives and a parity onto the next two. You need at least four hard drives, two can fail and the zpool is still ONLINE but the faulty drives should be replaced as soon as possible. To create a pool with RAIDZ2 and four hard drives:
root # zpool create zfs_test raidz2 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 Note of the four hard drives only 4GB are effective useable so total_space * (1-2/n) root # zpool status To delete the zpool: root # zpool destroy zfs_test Spares/Replace vdev You can add hot-spares into your zpool. In case a failure, those are already installed and available to replace faulty vdevs. In this example, we use RAIDZ1 with three hard drives and a zpool named zfs_test: root # zpool add zfs_test spare /dev/loop3 root # zpool status The status of /dev/loop3 will stay AVAIL until it is set to be online, now we let /dev/loop0 fail: root # zpool offline zfs_test /dev/loop0 root # zpool status NAME zfs_test raidz1-0 loop0 loop1 loop2 spares loop3
STATE DEGRADED DEGRADED OFFLINE ONLINE ONLINE
READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
AVAIL
We replace /dev/loop0 with our spare /dev/loop3: root # zpool replace zfs_test /dev/loop0 /dev/loop3 root # zpool status pool: zfs_test state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: resilver completed after 0h0m with 0 errors on Sun Aug 21 22:29:22 2011 config: NAME zfs_test raidz1-0 spare-0 loop0 loop3 loop1 loop2 spares
STATE DEGRADED DEGRADED DEGRADED OFFLINE ONLINE ONLINE ONLINE
READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
46.5K resilvered
loop3
INUSE
currently in use
errors: No known data errors
Note the file system got automatically resilvered onto /dev/loop3 and the zpool was all the time online Now we remove the failed vdev /dev/loop0 and start a manual scrubbing: root # zpool detach zfs_test /dev/loop0 && zpool scrub root # zpool status pool: zfs_test state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Sun Aug 21 22:37:52 2011 config: NAME zfs_test raidz1-0 loop3 loop1 loop2
STATE ONLINE ONLINE ONLINE ONLINE ONLINE
READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
errors: No known data errors
Zpool Version Update With every update of sys-fs/zfs, you are likely to also get a more recent ZFS version. Also the status of your zpools will indicate a warning that a new version is available and the zpools could be upgraded. To display the current version on a zpool: root # zpool upgrade -v This system is currently running ZFS pool version 28. The following versions are supported: VER --1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
DESCRIPTION -------------------------------------------------------Initial ZFS version Ditto blocks (replicated metadata) Hot spares and double parity RAID-Z zpool history Compression using the gzip algorithm bootfs pool property Separate intent log devices Delegated administration refquota and refreservation properties Cache devices Improved scrub performance Snapshot properties snapused property passthrough-x aclinherit user/group space accounting stmf property support
17 18 19 20 21 22 23 24 25 26 27 28
Triple-parity RAID-Z Snapshot user holds Log device removal Compression using zle (zero-length encoding) Deduplication Received properties Slim ZIL System attributes Improved scrub stats Improved snapshot deletion performance Improved snapshot creation performance Multiple vdev replacements
For more information on a particular version, including supported releases, see the ZFS Administration Guide.
Warning systems with a lower version installed will not be able to import a zpool of a higher version To upgrade the version of zpool zfs_test: root # zpool upgrade zfs_test To upgrade the version of all zpools in the system: root # zpool upgrade -a Zpool Tips/Tricks • You cannot shrink a zpool and remove vdevs after it's initial creation. • It is possible to add more vdevs to a MIRROR after it's initial creation. Use the following command (/dev/loop0 is the first drive in the MIRROR): root # zpool attach zfs_test /dev/loop0 /dev/loop2 • More than 9 vdevs in one RAIDZ could cause performance regression. For example it is better to use 2xRAIDZ with each five vdevs rather than 1xRAIDZ with 10 vdevs in a zpool • RAIDZ1 and RAIDZ2 cannot be resized after intial creation (you can only add additional hot spares). You can however replace the hard drives with bigger ones (one at a time), e.g. replace 1T drives with 2T drives to double the available space in the zpool. • It is possible to mix MIRROR, RAIDZ1 and RAIDZ2 in a zpool. For example a zpool with RAIDZ1 named zfs_test, to add two more vdevs in a MIRROR use: root # zpool add -f zfs_test mirror /dev/loop4 /dev/loop5 Note this needs the -f option • It is possible to restore a destroyed zpool, by reimporting it straight after the accident happened: root # zpool import -D pool: id: state: action:
zfs_test 12744221975042547640 ONLINE (DESTROYED) The pool can be imported using its name or numeric identifier.
Note the option -D searches on all hard drives for existing zpools
Volumes The program /usr/sbin/zfs is used with any operation regarding volumes. To control the size of a volume you can set quota and you can reserver a certain amount of storage within a zpool, per default the complete storage size in the zpool is used. Create Volumes We use our zpool zfs_test to create a new volume called volume1: root # zfs create zfs_test/volume1 The volume will be mounted automatically as /zfs_test/volumes1/ root # zfs list Mount/Umount Volumes Volumes can be mounted with the following command, the mountpoint is defined by the property mountpoint of the volume: root # zfs mount zfs_test/volume1 To unmount the volume: root # zfs unmount zfs_test/volume1 The folder /zfs_test/volume1 stays without the volume behind it. If you write data to it and then try to mount the volume again, you will see the following error message: [Collapse] Code cannot mount '/zfs_test/volume1': directory is not empty
Remove Volumes To remove volumes volume1 from zpool zfs_test: root # zfs destroy zfs_test/volume1 root # zfs list Note you cannot destroy a volume if there exist any snapshots of it Properties Properties for volumes are inherited from the zpool. So youy can either change the property on the zpool for all volumes or specific for each volume individual or a mix of both. To set a property for a volume: root # zfs set <property> zfs_test/volume1 To show the setting for a particular property on a volume:
root # zfs get <property> zfs_test/volume1 Note The properties are used on a volume e.g. compression, the higher is the version of this volume You can get a list of all properties set on any zpool with the following command: root # zfs get all This is a partial list of properties that can be set on either zpools or volumes, for a full list see man zfs: Property Value quota= 20m,none reservation= 20m,none
Function set a quota of 20MB for the volume reserves 20MB for the volume within it's zpool uses the given compression method or the default method for compression= zle,gzip,on,off compression which should be gzip on,off,ro,nfso sharenfs= shares the volume via NFS ptions exec= on,off controls if programs can be executed on the volume setuid= on,off controls if SUID or GUID can be set on the volume readonly= on,off sets read only atribute to on/off atime= on,off update access times for files in the volume dedup= on,off sets deduplication on or off sets the mountpoint for the volume below the zpool or elsewhere in mountpoint= none,path the file system, a mountpoint set to none prevents the volume from being mounted Set Mountpoint Set the mountpoint for a volume, use the following command: root # zfs set mountpoint=/mnt/data zfs_test/volume1 The volume will be automatically moved to /mnt/data NFS Volume Create a volume as NFS share: root # zfs create -o sharenfs=on zfs_test/volume2 Check what file systems are shared via NFS: root # exportfs Per default the volume is shared to all networks, to specify share options: root # zfs set sharenfs="-maproot=root -alldir -network 192.168.1.254 -mask 255.255.255.0" zfs_test/volume2 root # exportfs To stop sharing the volume: root # zfs set sharenfs=off zfs_test/volume2 root # exportfs
Snapshots Snapshots are volumes which have no initial size and save changes made to another volume. With increasing changes between the snapshot and the original volume it grows in size. Create Snapshots
To create a snapshot of a volume, use the following command: root # zfs snapshot zfs_test/volume1@22082011 Note volume1@22082011 is the full name of the snapshot, everything after the @ symbol can be any alphanumeric combination Every time a file in volume1 changes, the old data of the file will be linked into the snapshot. List Snapshots
List all available snapshots: root # zfs list -t snapshot -o name,creation Rollback Snapshots
To rollback a full volume to a previous state: root # zfs rollback zfs_test/volume1@21082011 Note if there are other snapshots in between, then you have to use the -r option. This would remove all snapshots between the one you want to rollback and the original volume Clone Snapshots
ZFS can clone snapshots to new volumes, so you can access the files from previous states individually: root # zfs clone zfs_test/volume1@21082011 zfs_test/volume1_restore In the folder /zfs_test/volume1_restore can now be worked on in the version of a previous state Remove Snapshots
Remove snapshots of a volume with the following command: root # zfs destroy zfs_test/volume1@21082011
Maintenance Scrubbing Start a scrubbing for zpool zfs_test: root # zpool scrub zfs_test Note this might take some time and is quite I/O intensive
Log Files To check the history of commands that were executed: root # zpool history Monitor I/O Monitor I/O activity on all zpools (refreshes every 6 seconds): root # zpool iostat 6
External resources • • • • •
zfs-fuse.net ZFS for Linux ZFS Best Practices Guide ZFS Evil Tuning Guide article about ZFS on Linux/Gentoo (german)
Category: • Filesystems
LVM LVM (Logical Volume Manager) is a software which uses physical devices abstract as PVs (Physical Volumes) in storage pools called VG (Volume Group). Whereas physical volumes could be a partition, whole SATA hard drives grouped as JBOD (Just a Bunch Of Disks), RAID systems, iSCSI, Fibre Channel, eSATA etc.
Contents • 1 Installation • 1.1 Kernel • 1.2 Software • 2 Configuration • 2.1 Boot service • 2.1.1 openrc • 2.1.2 systemd • 2.2 LVM on root • 3 Usage • 3.1 PV (Physical Volume) • 3.1.1 Partitioning • 3.1.2 Create PV • 3.1.3 List PV • 3.1.4 Remove PV • 3.2 VG (Volume Group) • 3.2.1 Create VG • 3.2.2 List VG • 3.2.3 Extend VG • 3.2.4 Reduce VG • 3.2.5 Remove VG • 3.3 LV (Logical Volume) • 3.3.1 Create LV • 3.3.2 List LV • 3.3.3 Extend LV • 3.3.4 Reduce LV • 3.3.5 LV Permissions • 3.3.6 Remove LV • 3.4 Thin metadata, pool, and LV • 3.4.1 Create thin pool • 3.4.2 Create a thin LV • 3.4.3 List thin pool and thin LV • 3.4.4 Extend thin pool • 3.4.5 Extend thin LV • 3.4.6 Reduce thin pool • 3.4.7 Reduce thin LV • 3.4.8 Thin pool Permissions • 3.4.9 Thin LV Permissions • 3.4.10 Thin pool Removal • 3.4.11 Thin LV Removal • 4 Examples • 4.1 Preparation
Installation Kernel You need to activate the following kernel options: [Collapse] Kernel configuration Device Drivers ---> Multiple devices driver support (RAID and LVM) ---> <*> Device mapper support <*> Crypt target support <*> Snapshot target <*> Mirror target <*> Multipath target <*> I/O Path Selector based on the number of in-flight I/Os <*> I/O Path Selector based on the service time
Note You probably don't need everything enabled, but some of the options are needed for #LVM2_Snapshots, #LVM2_MIRROR, #LVM2_Stripeset and encryption.
Software Install sys-fs/lvm2: â&#x2020;&#x2019; Information about USE flags USE flag Default Recommended Description clvm No Allow users to build clustered lvm2 cman No Cman support for clustered lvm lvm1 Yes Allow users to build lvm2 with lvm1 support Enables support for libreadline, a GNU line-editing library readline Yes that almost everyone wants !!internal use only!! Security Enhanced Linux support, this selinux No No must be set by the selinux profile or breakage will occur !!do not set this during bootstrap!! Causes binaries to be static No statically linked instead of dynamically static-libs No Build static libraries thin Yes Support for thin volumes Enable sys-fs/udev integration (device discovery, power and udev Yes storage device support, etc) root # emerge --ask lvm2
Configuration The configuration file is /etc/lvm/lvm.conf
Boot service openrc To start LVM manaully: root # /etc/init.d/lvm start To start LVM at boot time: root # rc-update add lvm boot systemd To start lvm manually: root # systemctl start lvm.service To start LVM at boot time: root # systemctl enable lvm.service
LVM on root Most bootloaders cannot boot from LVM directly - neither GRUB legacy nor LILO can. Grub 2 CAN boot from an LVM linear LV, mirrored LV and possibly some kinds of RAID LVs. No bootloader currently support thin LVs. For that reason, it is recommended to use a non-LVM /boot partition and mount the LVM root from an initramfs. Most users will want to use a prebuilt one. Genkernel. Genkernel-next, and dracut can generate an initramfs suitable for most LV types. • genkernel can boot from all types except thin volumes (as it neither builds nor copies the thin-provisioning-tools binaries from the build host) and maybe RAID10 (RAID10 support requires LVM2 2.02.98, but genkernel builds 2.02.89, however if static binaries are available it can copy those) • genkernel-next can boot from all types except thin volumes (it copies the thin-provisioningtools binaries from the build host, but they are dynamically linked and the needed libraries are not copied over, so the resulting binaries are broken; thin-provisioning-tools does not support building static binaries) • dracut can boot all types, but only includes thin support in the initramfs if the host its being run on has a thin root. In that case it copies it thin-provisioning-tools binaries from the build host. It is unknown if the resulting binaries are functional or not.
Usage LVM organizes storage in three different levels as follows: • hard drives, partitions, RAID systems or other means of storage are initialized as PV (Physical Volume) • Physical Volumes (PV) are grouped together in Volume Groups (VG) • Logical Volumes (LV) are managed in Volume Groups (VG)
PV (Physical Volume) Physical Volumes are the actual hardware or storage system LVM builds up upon.
Partitioning The partition type for LVM is 8e (Linux LVM): root # fdisk /dev/sdX In fdisk, you can create MBR partitions using the n key and then change the partition type with the t key to 8e. We will end up with one primary partition /dev/sdX1 of partition type 8e (Linux LVM). Note This step is not needed, since LVM can initialize whole hard drives as PV. Actual not using partitioning avoids LVM to be restricted to certain limits of MBR or GPT tables. Create PV The following command creates a Physical Volume (PV) on the two first primary partitions of /dev/sdX and /dev/sdY: root # pvcreate /dev/sd[X-Y]1 List PV The folloing command lists all active Physical Volumes (PV) in the system: root # pvdisplay You can scan for PV in the system, to troubleshoot not properly initialized or lost storage devices: root # pvscan Remove PV LVM automatically distributed the data onto all available PV, if not told otherwise. To make sure there is no data left on our device before we remove it, use the following command: root # pvmove -v /dev/sdX1 This might take a long time and once finished, there should be no data left on /dev/sdX1. We first remove the PV from our Volume Group (VG) and then the actual PV: root # vgreduce vg0 /dev/sdX1 && pvremove /dev/sdX1 Note If a whole hard drives was once initialized as PV, then you have to remove it before it can be properly partitioned again. That is because PV have no valid MBR table
VG (Volume Group) Volume Groups (VG) consist of one or more Physical Volumes (PV) and show up as /dev/<VG name>/ in the device file system. Create VG The following command creates a Volume Group (VG) named vg0 on two previously initialized Physical Volumes (PV) named /dev/sdX1 and /dev/sdY1: root # vgcreate vg0 /dev/sd[X-Y]1
List VG The folloing command lists all active Volume Groups (VG) in the system: root # vgdisplay You can scan for VG in the system, to troubleshoot not properly created or lost VGs: root # vgscan Extend VG With the following command, we extend the existing Volume Group (VG) vg0 onto the Physical Volume (PV) /dev/sdZ1: root # vgextend vg0 /dev/sdZ1 Reduce VG Before we can remove a Physical Volume (PV), we need to make sure that LVM has no data left on the device. To move all data off that PV and distribute it onto the other available, use the following command: root # pvmove -v /dev/sdX1 This might take a while and once finished, we can remove the PV from our VG: root # vgreduce vg0 /dev/sdX1 Remove VG Before we can remove a Volume Group (VG), we have to remove all existing Snapshots, all Logical Volumes (LV) and all Physical Volumes (PV) but one. The following command removes the VG named vg0: root # vgremove vg0
LV (Logical Volume) Logical Volumes (LV) are created and managed in Volume Groups (VG), once created they show up as /dev/<VG name>/<LV name> and can be used like normal partitions. Create LV With the following command, we create a Logical Volume (LV) named lvol1 in Volume Group (VG) vg0 with a size of 150MB: root # lvcreate -L 150M -n lvol1 vg0 There are other useful options to set the size of a new LV like: â&#x20AC;˘ -l 100%FREE = maximum size of the LV within the VG â&#x20AC;˘ -l 50%VG = 50% size of the whole VG List LV The following command lists all Logical Volumes (LV) in the system: root # lvdisplay
You can scan for LV in the system, to troubleshoot not properly created or lost LVs: root # lvscan Extend LV With the following command, we can extend the Logical Volume (LV) named lvol1 in Volume Group (VG) vg0 to 500MB: root # lvextend -L500M /dev/vg0/lvol1 Note use -L+350M to increase the current size of a LV by 350MB Once the LV is extended, we need to grow the file system as well (in this example we used ext4 and the LV is mounted to /mnt/data): Note Some file systems do support online-resizing, like ext4 otherwise you have to umount the file system first root # resize2fs /mnt/data 500M Reduce LV Before we can reduce the size of our Logical Volume (LV) without corrupting existing data, we have to shrink the file system on it. In this example we used ext4, the LV needs to be unmounted to shrink the file system: root # umount /mnt/data root # e2fsck -f /dev/vg0/lvol1 root # resize2fs /dev/vg0/lvol1 150M Now we are ready to reduce the size of our LV: root # lvreduce -L150M /dev/vg0/lvol1 Note use -L-350M to reduce the current size of a LV by 350MB LV Permissions Logical Volumes (LV) can be set to be read only storage devices. root # lvchange -p r /dev/vg0/lvol1 The LV needs to be remounted for the changes to take affect: root # mount -o remount /dev/vg0/lvol1 To set the LV to be read/write again: root # lvchange -p rw /dev/vg0/lvol1 && mount -o remount /dev/vg0/lvol1 Remove LV Before we remove a Logical Volume (LV) we should unmount and deactivate, so no further write activity can take place: root # umount /dev/vg0/lvol1 && lvchange -a n /dev/vg0/lvol1
The following command removes the LV named lvol1 from VG named vg0: root # lvremove /dev/vg0/lvol1
Thin metadata, pool, and LV Recent versions of LVM2 (2.02.89) support "thin" volumes. Thin volumes are to block devices what sparse files are to filesystems. Thus, a thin LV within a pool can be "overcommitted" - it can even be larger than the pool itself. Just like a sparse file, the "holes" are filled as the block device gets populated. If the filesystem has "discard" support, as files are deleted, the "holes" can be recreated, reducing utilization of the thin pool. Create thin pool Warning If the thin pool metadata overflows, the pool will be corrupted. LVM cannot recover from this. Note If the thin pool gets exhausted, any process that would cause thin pool to overrun will be stuck in "killable sleep" state until either the thin pool is extended or the process recieves SIGKILL. Each thin pool has some metadata associated with it, which is added to the thin pool size. You can specify it explicitly, otheriwse lvm2 will compute one based on the size of the thin pool as the minimum of pool_chunks * 64 bytes or 2MiB, whichever is larger. root # lvcreate -L 150M --type thin-pool --thinpool thin_pool vg0 This create a thin pool named "thin_pool" with a size of 150MB (actually, it slightly bigger than 150MB because of the metadata). root # lvcreate -L 150M --metadatasize 2M --type thin-pool --thinpool thin_pool vg0 This create a thin pool named "thin_pool" with a size of 150MB and an explicit metadata size of 2MiB. Unfortunately, because the metasize is added to thin pool size, the intuitive way of filling a VG wit ha thin pool doesn't work:[1] root # lvcreate -l 100%FREE --type thin-pool --thinpool thin_pool vg0 Insufficient suitable allocatable extents for logical volume thin_pool: 549 more required Note the thin pool does not have an associated device node like other LV's. Create a thin LV A Thin LV is somewhat unusual in LVM - the thin pool itself is an LV, so a thin LV is a "LV-withinan-LV". Since the volumes are sparse, a virtual size instead of a physical size is specified: root # lvcreate -T vg0/thin_pool -V 300M -n lvol1 Note how the LV is larger then the pool it is create in. Its also possible to create the thin metadata, pool and LV on the same command: root # lvcreate -T vg0/thin_pool -V 300M -L150M -n lvol1
List thin pool and thin LV Thin LV are just like any other lv are are displayed using the lvdisplay and scanned using lvscan Extend thin pool Warning As of LVM2 2.02.89, the metadata size of the thin pool cannot be expanded, it is fixed at creation The thin pool is expanded like a non-thin LV: root # lvextend -L500M vg0/thin_pool or root # lvextend -L+350M vg0/thin_pool Extend thin LV A Thin LV is expanded just like a regular LV: root # lvextend -L1G vg0/lvol1 or root # lvextend -L+700M vg0/lvol1l Note this is asymmetric from create where the virtual size was specified with -V instead of -L/-l. The filesystem can then be expanded using that filesystem's tools. Reduce thin pool Currently, LVM cannot reduce the size of the thin pool[2]. Reduce thin LV Before shrinking an LV, shrink the filesystem first using that filesystem's tools. Some filesystems do not support shrinking. A Thin LV is reduced just like a regular LV: root # lvreduce -L300M vg0/lvol1l or root # lvreduce -L-700M vg0/lvol1 Note this is asymmetric from create where the virtual size was specified with -V instead of -L/-l. Thin pool Permissions It is not possible to change the permission on the thin pool (nor would it make any sense to). Thin LV Permissions A thin LV can be set read-only/read-write the same waya regular LV is Thin pool Removal The thin pool cannot be removed until all the thin LV within it are removed. Once that is done, it can be removed:
root # lvremove vg0/thin_pool Thin LV Removal A thin is removed like a regular LV
Examples We can create some scenarios using loopback devices, so no real storage devices are used.
Preparation First we need to make sure the loopback module is loaded. If you want to play around with partitions, use the following option: root # modprobe -r loop && modprobe loop max_part=63 Note you cannot reload the module, if it is built into the kernel Now we need to either tell LVM to not use udev to scan for devices or change the filters in /etc/lvm/lvm.conf. In this case we just temporarily do not use udev: [Collapse] File/etc/lvm/lvm.conf obtain_device_list_from_udev = 0
Important this is for testing only, you want to change the setting back when dealing with real devices since it is much faster We create some image files, that will become our storage devices (uses ~10GB of real hard drive space): root # mkdir /var/lib/lvm_img root # dd if=/dev/null of=/var/lib/lvm_img/lvm0.img seek=2097152 root # dd if=/dev/null of=/var/lib/lvm_img/lvm1.img seek=2097152 root # dd if=/dev/null of=/var/lib/lvm_img/lvm2.img seek=2097152 root # dd if=/dev/null of=/var/lib/lvm_img/lvm3.img seek=2097152 root # dd if=/dev/null of=/var/lib/lvm_img/lvm4.img seek=2097152 Check which loopback devices are available: root # losetup -a We assume all loopback devices are available and create our hard drives: root # losetup /dev/loop0 /var/lib/lvm_img/lvm0.img root # losetup /dev/loop1 /var/lib/lvm_img/lvm1.img root # losetup /dev/loop2 /var/lib/lvm_img/lvm2.img
bs=1024 bs=1024 bs=1024 bs=1024 bs=1024
root # losetup /dev/loop3 /var/lib/lvm_img/lvm3.img root # losetup /dev/loop4 /var/lib/lvm_img/lvm4.img Now we can use /dev/loop[0-4] as we would use any other hard drive in the system. Note On the next reboot, all the loopback devices will be released and the folder /var/lib/lvm_img can be deleted
LVM2 Linear volumes Linear volumes are the most common kind of LVM volume. A linear volume can consume all or part of a LV. LVM will attempt to allocate the LV to be as physically contiguous as possible. If there's a PV large enough to hold the entire LV, LVM will allocate it there, otherwise it will split it up into a few pieces a possible. A linear volume is actually implemented as degenerate stripe set (containing a single stripe). Creating a linear volume To create a linear volume: root root root root
# # # #
pvcreate vgcreate lvcreate lvcreate
/dev/loop[0-2] vg00 /dev/loop[0-2] -L3G -n lvm_stripe1 vg00 -L2G -n lvm_stripe2 vg00
The linear volume is the default type. root # pvscan PV /dev/loop0 PV /dev/loop1 PV /dev/loop2
VG vg00 VG vg00 VG vg00
lvm2 [2.00 GiB / 0 free] lvm2 [2.00 GiB / 1012.00 MiB free] lvm2 [2.00 GiB / 0 free]
LVM allocate the first LV to use all of the first PV and part of the second, and the second PV to use all of the third PV. Because linear volumes have no special requirements, they are the easiest to manipulate and can be resized, relocated, at will. If an LV is allocated across multiple PVs, and any of the PV's are unavailable, that LV cannot be started and will be unusable.
/etc/fstab Here is an example of an entry in fstab (using ext4): [Collapse] File/etc/fstab /dev/vg0/lvol1
/mnt/data
ext4
noatime
For thin volumes, add the discard option: [Collapse] File/etc/fstab
0 2
/dev/vg0/lvol1
/mnt/data
ext4
noatime,discard
0 2
LVM2 Snapshots and LVM2 Thin Snapshots A snapshot is an LV as copy of another LV, which takes in all the changes that were made in the original LV to show the content of that LV in a different state. We once again use our two hard drives and create LV lvol1 this time with 60% of VG vg0: root root root root root
# # # # #
pvcreate /dev/loop[0-1] vgcreate vg0 /dev/loop[0-1] lvcreate -l 60%VG -n lvol1 vg0 mkfs.ext4 /dev/vg0/lvol1 mount /dev/vg0/lvol1 /mnt/data
LVM2 Snapshots Now we create a snapshot of lvol1 named 08092011_lvol1 and give it 10% of VG vg0: root # lvcreate -l 10%VG -s -n 08092011_lvol1 /dev/vg0/lvol1 Important if a snapshot exceeds it's maximum size, it disappears Mount our snapshot somewhere else: root # mkdir /mnt/08092011_data root # mount /dev/vg0/08092011_lvol1 /mnt/08092011_data We could now access data in lvol1 from a previous state. LVM2 snapshots are writeable LV, we could use them to let a project go on into two different directions: root root root root root
# # # # #
lvcreate -l 10%VG -s -n project1_lvol1 /dev/vg0/lvol1 lvcreate -l 10%VG -s -n project2_lvol1 /dev/vg0/lvol1 mkdir /mnt/project1 /mnt/project2 mount /dev/vg0/project1_lvol1 /mnt/project1 mount /dev/vg0/project2_lvol1 /mnt/project2
Now we have three different versions of LV lvol1, the original and two snapshots which can be used parallel and changes are written to the snapshots. Note the original LV lvol1 cannot be reduced in size or removed if snapshots of it exist. Snapshots can be increased in size without growing the file system on them, but they cannot exceed the size of the original LV LVM2 Thin Snapshots Note A thin snapshot can only be taken on a thin origin. Its is possible to create a non-thin snapshot of a thin origin, however Creating a thin snapshot is simple: root # lvcreate -s -n 08092011_lvol1 /dev/vg0/lvol1 Note how a size is not specified with -l/-L - nor the virtual size with -V. Snapshots have a virtual
size the same as their origin, and a phyical size of 0 like all new thin volumes. This also means its not possible to limit the phyical size of the snapshot. Thin snapshots are writable just like regualr snapshot. Important If -l/-L is specified, a snapshot will still be created, but the resulting snapshot will be a regular snapshot, not a thin snapshot Recursive snapshpots can be created: root # lvcreate -s -n 08092012_lvol1 /dev/vg0/08092011_lvol1 Thin snapshots have several advantages over regular snapshots. First, thin snapshots are independent of their origins once created. The origin can be shrunk or deleted without affecting the snapshot. Second, thin snapshots can be efficiently created recursively (snapshots of snapshots) without the "chaining" overhead of regular recursive LVM snapshots. LVM2 Rollback Snapshots To rollback the logical volume to the version of the snapshot, use the following command: root # lvconvert --merge /dev/vg0/08092011_lvol1 This might take a couple of minutes, depending on the size of the volume. Important the snapshot will disappear and this is not revertible LVM2 Thin Rollback Snapshots For thin volumes, lvconvert --merge does not work. Instead, delete the origin and rename the snapshot: root # umount /dev/vg0/lvol1 root # lvremove /dev/vg0/lvol1 root # lvrename vg0/08092011_lvol1 lvol1
LVM2 Mirrors LVM support mirrored volume, which provide fault tolerance in the event of drive failure. Unlike RAID1, there is no performance benefit - all reads and writes are delivered to a single "leg" of the mirror. 1 additional PV is required for each mirror. Mirrors support 3 kind of logs: Note For all mirror log types except core LVM prefers - and sometimes insists - the mirror logs be kept on PV that does not contain the mirrored LVs. If if its desired to have the mirror logs on the same PV as the mirrored LVs themselves, and LVM insists on a separate PV for the log, add the --alloc anywhere parameter. â&#x20AC;˘ Disk mirror logs the state of the mirror on the disk in extra metadata extents. LVM keeps track of what mirrored and can pick up where it left off if incomplete. This is the default. â&#x20AC;˘ Mirror logs are disk logs that are themselves mirrored. â&#x20AC;˘ Core mirror logs record the state of the mirror in memory only. LVM will have to rebuild the mirror every time it is activated. Useful for temporary mirrors.
Creating an mirror LV To create an LV with a single mirror: root # pvcreate /dev/loop[0-1] root # vgcreate vg0 /dev/loop[0-1] root # lvcreate -m 1 --mirrorlog core -l 40%VG --nosync vg00 WARNING: New mirror won't be synchronised. Don't read what you didn't write! The -m 1 indicate we want to create 1 (additional) mirror, requiring 2 PV's. The --nosync option is an optimization - without it LVM will try synchronize the mirror by copying empty sectors from one LV to another. Creating a mirror of an existing LV It is possible to create a mirror of an existing LV: root root root root
# # # #
pvcreate /dev/loop[0-1] vgcreate vg0 /dev/loop[0-1] lvcreate -l 40%VG vg00 lvconvert -m 1 --mirrorlog core -b vg00/lvol0
The mirrors an existing LV onto a different PV. The -b option puts the operation into the background, as mirroring an LV takes a long time. Removing a mirror of an existing LV To remove mirror, set the number of mirrors to 0: root # lvconvert -m0 vg0/lvol0 Failed mirrors To simulate a failure: Warning Mirror failures can cause the device mapper to deadlock, requiring a reboot root # vgchange -an vg00 root # losetup -d /dev/loop1 root # rm /var/lib/lvm_img/lvm1.img root # dd if=/dev/null of=/var/lib/lvm_img/lvm1.img bs=1024 seek=2097152 root # losetup /dev/loop1 /var/lib/lvm_img/lvm1.img If part of the mirror is unavailable (usually because the disk containing the PV has failed), the VG will need to be brought up in degraded mode: root # vgchange -ay --partial vg00 On the first write, LVM will notice the mirror is broken. The default policy ("remove") is to automatically reduce/break the mirror according to the number of pieces available. A 3-way mirror with a missing PV will be reduced to 2-way mirror; a 2-way mirror will be reduced to a regular linear volume. If the failure is only transient, and the missing PV returns after LVM has broken the mirror, the mirrored LV will need to be recreated on it.
To recover the mirror, The failed PV needs to be removed from the VG, and replacement one added (or if the VG has a free PV, created on a different PV), the mirror recreated with lvconvert, and the old PV removed from the VG root root root root
# # # #
pvcreate /dev/loop1 vgextend vg00 /dev/loop1 lvconvert -b -m 1 --mirrorlog disk vg00/lvol0 vgreduce --removemissing vg00
It is possible to have LVM recreate the mirror with free extents on a different PV if a "leg" fails, to do that, set mirror_image_fault_policy to "allocate" in lvm.conf. Thin mirrors It is not (yet) possible to create a mirrored thin pool or thin volume. It is possible to create a mirrored thin pool my creating a normal mirrored LV and then converting the LV it to a thin pool with lvconvert. 2 LVs are required: One for the thin pool and one for the thin metadata, the conversion process will merge them into a single LV. Warning LVM 2.02.98 or above is required for this to work properly. Prior versions are either not capable or will segfault and corrupt the VG. Also, conversion of a mirror into a thin pool destroys all existing data in the mirror! root # lvcreate -m 1 --mirrorlog mirrored -l40%VG -n thin_pool vg00 root # lvcreate -m 1 --mirrorlog mirrored -L4MB -n thin_meta vg00 root # lvconvert --thinpool vg00/thin_pool --poolmetadata vg00/thin_meta
LVM2 RAID 0/Stripeset Important If a linear volume suffers a disk failure, a giant, contiguous "hole" is created. It may be possible to recover data from outside that hole. If a striped volume suffers a disk failure, the instead of a contiguous hole, the geometry is closer to Swiss cheese; recovery of anything is slim to none. Instead of a linear volume, where multiple contiguous volumes are appended, it possible to create a striped or RAID 0 volume for better performance. Creating a stripe set To create a 3-PV striped volume: root # pvcreate /dev/loop[0-2] root # vgcreate vg00 /dev/loop[0-2] root # lvcreate -i 3 -l 20%VG -n lvm_stripe vg00 Using default stripesize 64.00 KiB The -i option indicated how many PVs to stripe over, in this case, 3. root # pvscan PV /dev/loop0
VG vg00
lvm2 [2.00 GiB / 1.60 GiB free]
PV /dev/loop1 PV /dev/loop2
VG vg00 VG vg00
lvm2 [2.00 GiB / 1.60 GiB free] lvm2 [2.00 GiB / 1.60 GiB free]
On each PV 400MB got reserved for LV lvm_stripe in VG vg00 It is possible to mirror a stripe set. The -i and -m options can be combined to create a striped mirror: root # lvcreate -i 2 -m 1 -l 10%VG vg00 This creates a 2 PV stripe set and mirrors it on 2 different PVs, for a total of 4 PVs. An existing stripe set can be mirrored with lvconvert. A thin pool can be striped like any other LV. All the thin volumes created from the pool inherit that settings - do not specify it manually when creating a thin volume. It is not possible to stripe an existing volume, nor reshape the stripes across more/less PVs, nor to convert to a different RAID level/linear volume. A stripe set can be mirrored. It is possible to extend a stripe set across additional PVs, but they must be added in multiples of the original stripe set (which will effectively linearly append a new stripe set), or --alloc anywhere must be specified (which can hurt performance). In the above example, 3 additional PVs would be required without --alloc anywhere.
LVM2 RAID 1 Unlike RAID 0, which is striping, RAID 1 is mirroring, but implemented differently than the original LVM mirror. Under RAID1, reads are spread out across PV, improving performance. RAID1 Mirror failures do not cause I/O to block because LVM does not need to break it on write. Any place where an LVM mirror could be used, a RAID 1 mirror can be used in its place. Its possible to have LVM create RAID1 mirrors instead of regular mirrors implicitly by setting mirror_segtype_default in lvm.conf to raid1. Creating RAID 1 LV To create an LV with a single mirror: root # pvcreate /dev/loop[0-1] root # vgcreate vg00 /dev/loop[0-1] root # lvcreate -m 1 --type raid1 -l 40%VG --nosync -n lvm_raid1 vg00 WARNING: New raid1 won't be synchronised. Don't read what you didn't write! root # pvscan PV VG /dev/loop0 vg00 /dev/loop1 vg00
Fmt Attr PSize PFree lvm2 a-2.00g 408.00m lvm2 a-2.00g 408.00m
On each PV about 1.2G got reserved for LV lvm_raid1 in VG vg00 Note the difference for creating a mirror: There is no mirrorlog specified, because RAID1 LV (explicit) do not an explicit mirror log - it built-in to the LV. Second, --type raid1 needs is added, it wasn't needed with LVM mirror before. Also note the similarities: -m 1 to for a single mirror (-i 1' works too for RAID 1, unlike an LVM mirror), and the --nosync to skip the initial sync
Converting existing LV to RAID 1 It is possible to convert an existing LV to RAID 1: root root root root
# # # #
pvcreate /dev/loop[0-1] vgcreate vg00 /dev/loop[0-1] lvcreate -n lvm_raid1 -l20%VG vg00 lvconvert -m 1 --type raid1 -b vg00/lvm_raid1
Conversion is similar to creating a mirror from an existing LV. Removing a RAID 1 mirror To remove a RAID 1 mirror, set the number of mirrors to 0: root # lvconvert -m0 vg00/lvm_raid1 Same as an LVM mirror Failed RAID 1 Simulating a failure is the same as an LVM mirror If part of the RAID1 is unavailable (usually because the disk containing the PV has failed), the VG will need to be brought up in degraded mode: root # vgchange -ay --partial vg00 Unlike an LVM mirror, writing missing part of the RAID does NOT breaking the mirroring. If the failure is only transient, and the missing PV returns, LVM will resync the mirror my copying cover the out-of-date segments instead of the entire LV. To recover the RAID 1, The failed PV needs to be removed from the VG, and replacement one added (or if the VG has a free PV, created on a different PV), the mirror repaired with lvconvert, and the old PV removed from the VG. root root root root
# # # #
pvcreate /dev/loop1 vgextend vg00 /dev/loop1 lvconvert --repair -b vg00/lvm_raid1 vgreduce --removemissing vg00
Thin RAID1 It is not (yet) possible to create a RAID 1 thin pool or thin volume. It is possible to create a RAID 1 thin pool by creating a normal mirrored LV and then converting the LV it to a thin pool with lvconvert. 2 LVs are required: One for the thin pool and one for the thin metadata, the conversion process will merge them into a single LV. Warning LVM 2.02.98 or above is required for this to work properly. Prior versions are either not capable or will segfault and corrupt the VG. Also, conversion of a RAID 1 into a thin pool destroys all existing data in the mirror! root # lvcreate -m 1 --type raid1 -l40%VG -n thin_pool vg00 root # lvcreate -m 1 --type raid1 -L4MB -n thin_meta vg00 root # lvconvert --thinpool vg00/thin_pool --poolmetadata vg00/thin_meta
LVM2 Stripeset with Parity (RAID4 and RAID5) Note Stripeset with Parity require at least 3 PVs RAID 0 is not fault-tolerant - if any of the PVs fail the LV is unusable. By adding a parity stripe to RAID 0 the LV can still function will a single missing PV. A new PV can then be added to restore fault tolerance. Stripsets with parity come in 2 flavors: RAID 4 and RAID 5. Under RAID 4.all the parity stripes are stored on the same LV. The PV containing the LV can become a bottleneck because all writes hit that PV, and gets worse the more PVs in the array. With RAID 5, the parity data is distributed evenly across the LVs and no PV is a bottleneck. For that reason, RAID 4 is rare is considered obsolete/historical and in practice all stripesets with parity are RAID 5. Creating a RAID5 LV root # pvcreate /dev/loop[0-2] root # vgcreate vg00 /dev/loop[0-2] root # lvcreate --type raid5 -l 20%VG -i 2 -n lvm_raid5 vg00 Like the RAID0/Stripe without parity, the -i option is used to specify the number of PVs stripe. However, only the data PV are specified with -i - LVM adds the parity one automatically. Thus for a 3 PV RAID5, its -i 2 and not -i 3 . root # pvscan PV /dev/loop0 PV /dev/loop1 PV /dev/loop2
VG vg00 VG vg00 VG vg00
lvm2 [2.00 GiB / 1.39 GiB free] lvm2 [2.00 GiB / 1.39 GiB free] lvm2 [2.00 GiB / 1.39 GiB free]
On each PV about 600MB got reserved for LV lvm_raid5 in VG vg00 Recovering from a failed RAID5 To simulate a failure: root # vgchange -an vg00 root # losetup -d /dev/loop1 root # rm /var/lib/lvm_img/lvm1.img root # dd if=/dev/null of=/var/lib/lvm_img/lvm1.img bs=1024 seek=2097152 root # losetup /dev/loop1 /var/lib/lvm_img/lvm1.img The VG will need to be brought up in degraded mode root # vgchange -ay --partial vg00 The volume will work normally at this point, however this degraded the array to RAID 0 until a replacement PV is added. Performance is unlikely to be affected while the array is degraded - while it does need to recompute is missing data via parity, it only requires simple XOR the parity block with the remaining data. The overhead is negligible compared to the disk I/O. To repair the RAID5: root # pvcreate /dev/loop1 root # vgextend vg00 /dev/loop1
root # lvconvert --repair vg00/lvm_raid5 root # vgreduce --removemissing vg00 Its possible to replace a still working PV in RAID5 as well root root root root
# # # #
pvcreate /dev/loop3 vgextend vg00 /dev/loop3 lvconvert --replace /dev/loop1 vg00/lvm_raid5 vgreduce vg00 /dev/loop1
The same restrictions of stripe sets apply to stripe sets with parity as well: It is not possible to stripe with parity an existing volume, nor reshape the stripes with parity across more/less PVs, nor to convert to a different RAID level/linear volume. A stripe set with parity can be mirrored. It is possible to extend a stripe set with parity across additional PVs, but they must be added in multiples of the original stripe set with parity (which will effectively linearly append a new stripe set with parity), or --alloc anywhere must be specified (which can hurt performance). In the above example, 3 additional PVs would be required without --alloc anywhere. Thin RAID5 LV It is not (yet) possible to create stripe set with parity (RAID5) thin pool or thin volume. It is possible to create a RAID5 thin pool by creating a normal RAID5 LV and then converting the LV into a thin pool with lvconvert. 2 LVs are required: One for the thin pool and one for the thin metadata, the conversion process will merge them into a single LV. Warning LVM 2.02.98 or above is required for this to work properly. Prior versions are either not capable or will segfault and corrupt the VG. Also, coversion of a RAID5 LV into a thin pool destroys all existing data in the LV! root # lvcreate --type raid5 -i 2 -l20%VG -n thin_pool vg00 root # lvcreate --type raid5 -i 2 -L4MB -n thin_meta vg00 root # lvconvert --thinpool vg00/thin_pool --poolmetadata vg00/thin_meta
LVM2 RAID 6 Note RAID6 requires as at 5 PVs RAID 6 is similar to RAID 5, however RAID 6 can survive up to TWO PV failures, thus offering more fault tolerance than RAID5 at the expense of an extra PV. Creating a RAID6 LV root # pvcreate /dev/loop[0-4] root # vgcreate vg00 /dev/loop[0-4] root # lvcreate --type raid6 -l 20%VG -i 3 -n lvm_raid6 vg00 Like raid5, the -i option is used to specify the number of PVs stripe, excluding the 2 PV's for parity. Thus for 5 PV RAID6, its -i 3 and not -i 5. root # pvscan PV /dev/loop0 PV /dev/loop1
VG vg00 VG vg00
lvm2 [2.00 GiB / 1.32 GiB free] lvm2 [2.00 GiB / 1.32 GiB free]
PV /dev/loop2 PV /dev/loop3 PV /dev/loop4
VG vg00 VG vg00 VG vg00
lvm2 [2.00 GiB / 1.32 GiB free] lvm2 [2.00 GiB / 1.32 GiB free] lvm2 [2.00 GiB / 1.32 GiB free]
On each PV about 680MB got reserved for LV lvm_raid6 in VG vg00 Recovering from a failed RAID6 Recovery for RAID6 is the same as RAID5. A RAID6 LV with a single failure reduces to RAID5. A RAID6 LV with 2 failures reduces to RAID0. It is left as an exercise to the reader to simulate a 2 PV failure. Unlike RAID5 where parity block is cheap to recompute vs disk I/O, this is only half true in RAID6. RAID6 uses 2 parity stripes: One stripe is computed the same way as RAID5 (simple XOR). The second parity stripe is much harder to compute[3]. The same restrictions of stripe sets with parity apply to RAID6 as well: It is not possible to RAID6 an existing volume, nor reshape a RAID6 across more/less PVs, nor to convert to a different RAID level/linear volume. A RAID6 can be mirrored. It is possible to extend a RAID6 across additional PVs, but they must be added in multiples of the original RAID6 (which will effectively linearly append a new RAID6), or --alloc anywhere must be specified (which can hurt performance). In the above example, 5 additional PVs would be required without --alloc anywhere. Thin RAID6 LV It is not (yet) possible to create a RAID6 thin pool or thin volumes. It is possible to create a RAID6 thin pool by creating a normal RAID6 LV and then converting the LV into a thin pool with lvconvert. 2 LVs are required: One for the thin pool and one for the thin metadata, the conversion process will merge them into a single LV. Warning LVM 2.02.98 or above is required for this to work properly. Prior versions are either not capable or will segfault and corrupt the VG. Also, conversion of a RAID6 LV into a thin pool destroys all existing data in the LV! root # lvcreate --type raid6 -i 2 -l20%VG -n thin_pool vg00 root # lvcreate --type raid6 -i 2 -L4MB -n thin_meta vg00 root # lvconvert --thinpool vg00/thin_pool --poolmetadata vg00/thin_meta
LVM RAID10 Note RAID10 requires as at 4 PVs. Also LVM syntax requires the number of PV be multiple of the numbers stripes and mirror, even though RAID10 format does not RAID10 is a combination of RAID0 and RAID1. Its s more powerful than RAID 0+RAID 1 as mirror is done at the stripe level instead of the LV level, and therefore the layout need not be symmetric. A RAID10 volume can tolerate at least a single missing PV, and possibly more. Creating a RAID10 LV Note LVM currently limits RAID10 to a single mirror. root # pvcreate /dev/loop[0-3]
root # vgcreate vg00 /dev/loop[0-3] root # lvcreate --type raid10 -l 1020 -i 2 -m 1 --nosync -n lvm_raid10 vg00 Using default stripesize 64.00 KiB WARNING: New raid10 won't be synchronised. Don't read what you didn't write!
. Both the -i AND -m options are specified: -i is the number of stripes and -m is the number of mirrors. 2 stripes and 1 mirror require 4 PVs. --nosync is an optimization to skip the initial copy. root # pvscan PV /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
VG vg00 vg00 vg00 vg00
Fmt lvm2 lvm2 lvm2 lvm2
Attr PSize PFree a-2.00g 0 a-2.00g 0 a-2.00g 0 a-2.00g 0
On each PV 2G got reserved for LV lvm_raid10 in VG vg00 Recovering from a failed RAID10 For a single failed PV, recovery for RAID10 is the same as RAID5. In the example above LVM chose to stripe over PV loop0 and loop2, and mirror on loop1 and loop3. The resulting array can tolerate the loss of any one PV, or 2 PV if they are on different mirrors (0/2, 0/3, 1/2, 1/3 but not 0/1 or 2/3) The same restrictions of stripe sets apply to RAID10 as well: It is not possible to RAID10 an existing volume, nor reshape the RAID10 across more/less PVs, nor to convert to a different RAID level/linear volume , It is possible to extend a RAID10 across additional PVs, but they must be added in multiples of the original RAID10 (which will effectively linearly append a new RAID10), or --alloc anywhere must be specified (which can hurt performance). In the above example, 4 additional PVs would be required without --alloc anywhere Thin RAID 10 It is not (yet) possible to create a RAID10 thin pool or thin volumes. It is possible to create a RAID6 thin pool by creating a normal RAID10 LV and then converting the LV into a thin pool with lvconvert. 2 LVs are required: One for the thin pool and one for the thin metadata, the conversion process will merge them into a single LV. Warning Conversion of a RAID6 LV into a thin pool destroys all existing data in the LV! root # lvcreate -i 2 -m 1 --type raid10 -l 1012 -n thin_pool vg00 root # lvcreate -i 2 -m 1 --type raid10 -l 6 -n thin_meta vg00 root # lvconvert --thinpool vg00/thin_pool --poolmetadata vg00/thin_meta
Troubleshooting LVM has only MIRROR and snapshots to provide some level of redundancy. However there are
certain situations where one might be able to restore lost PV or LV.
vgcfgrestore utility By default, on any change to a LVM PV, VG, or LV, LVM2 create a backup file of the metadata in /etc/lvm/archive. These files can be used to recover from an accidental change (like deleting the wrong LV), LVM also keeps a backup copy of the most recent metadata in /etc/lvm/backup. These can be used to restore metadata to a replacement disk, or repair corrupted metadata. To see what states of the VG are available to be restored (this is just partial output) root # vgcfgrestore --list vg00 File: VG name: Description: Backup Time:
/etc/lvm/archive/vg00_00042-302371184.vg vg00 Created *before* executing 'lvremove vg00/lvm_raid1' Sat Jul 13 01:41:32 201
Recovering an accidently deleted LV Suppose LV lvm_raid1 was accidentally removed from VG vg00. It is possible to recover it: root # vgcfgrestore -f /etc/lvm/archive/vg00_00042-302371184.vg vg00 Important vgcfgrestore only restores LVM metadata, NOT the data inside the LV. However pvremove, vgremove, and lvremove only wipe metadata, leaving any data intact. However, if issue_discards is set in /etc/lvm/lvm.conf then these command ARE destructive to data. Replacing a failed PV In the above examples, when a disk containing a PV failed, an "add/remove" technique was used: a new PV was created on a new disk, the VG extended to it, the LV repaired and the old PV removed from the VG. However it possible to do a true "replace" and recreate the metadata on the disk to be the same as the old disk. Following the above example for a failed RAID 1: root # vgdisplay --partial --verbose --- Physical volumes --PV Name /dev/loop0 PV UUID iLdp2U-GX3X-W2PY-aSlX-AVE9-7zVC-Cjr5VU PV Status allocatable Total PE / Free PE 511 / 102 PV Name PV UUID PV Status Total PE / Free PE
unknown device T7bUjc-PYoO-bMqI-53vh-uxOV-xHYv-0VejBY allocatable 511 / 102
The important line here is the UUID "unknown device". root # pvcreate --uuid T7bUjc-PYoO-bMqI-53vh-uxOV-xHYv-0VejBY --restorefile /etc/lvm/backup/vg00 /dev/loop1
Couldn't find device with uuid T7bUjc-PYoO-bMqI-53vh-uxOV-xHYv-0VejBY. Physical volume "/dev/loop1" successfully created
This recreates the PV metadata, but not the missing LV or VG data on the PV. root # vgcfgrestore -f /etc/lvm/backup/vg00 vg00 Restored volume group vg00
This now reconstructs all the missing metadata on the PV, including the LV and VG data. However it doesn't restore the data, so the mirror is out of sync. root # vgchange -ay vg00 device-mapper: reload ioctl on failed: Invalid argument 1 logical volume(s) in volume group "vg00" now active
root # lvchange --resync vg00/lvm_raid1 Do you really want to deactivate logical volume lvm_raid1 to resync it? [y/n]: y
This will resync the mirror. This works with RAID 4,5 and 6 as well.
Deactivate LV You can deactivat a LV with the following command: root # umount /dev/vg0/lvol1 root # lvchange -a n /dev/vg0/lvol1 You will not be able to mount the LV anywhere before it got reactivated: root # lvchange -a y /dev/vg0/lvol1
External resources • LVM2 sourceware.org • LVM tldp.org • LVM2 Wiki redhat.com Category: • Core system
ZFS Best Practices Guide
Contents [hide] • 1 ZFS Administration Considerations • 1.1 ZFS Storage Pools Recommendations • 1.1.1 System/Memory/Swap Space • 1.1.2 Storage Pools • 1.1.3 Hybrid Storage Pools (or Pools with SSDs) • 1.1.4 Using Storage Arrays in Pools • 1.1.5 Additional Cautions for Storage Pools • 1.1.5.1 Simple or Striped Storage Pool Limitations • 1.1.5.2 Multiple Storage Pools on the Same System • 1.1.5.3 ZFS Root Pool Considerations • 1.1.5.3.1 ZFS Mirrored Root Pool Disk Replacement • 1.2 Storage Pool Performance Considerations • 1.2.1 General Storage Pool Performance Considerations • 1.2.1.1 Separate Log Devices • 1.2.1.2 Memory and Dynamic Reconfiguration Recommendations • 1.2.1.3 Separate Cache Devices • 1.2.2 RAIDZ Configuration Requirements and Recommendations • 1.2.3 Mirrored Configuration Recommendations • 1.2.4 Should I Configure a RAIDZ, RAIDZ-2, RAIDZ-3, or a Mirrored Storage Pool? • 1.2.5 RAIDZ Configuration Examples • 1.3 ZFS Migration Strategies • 1.3.1 Migrating to a ZFS Root File System • 1.3.2 Migrating a UFS Root File System With Zones to a ZFS Root File System • 1.3.3 Manually Migrating Non-System Data to a ZFS File System • 1.3.4 ZFS Interactions With Other Volume Management Products • 2 General ZFS Administration Information • 3 OpenSolaris/ZFS Considerations • 3.1 OpenSolaris/ZFS/Virtual Box Recommendations • 4 Using ZFS for Application Servers Considerations • 4.1 ZFS NFS Server Practices • 4.1.1 ZFS NFS Server Benefits • 4.2 ZFS file service for SMB (CIFS) or SAMBA • 4.3 ZFS Home Directory Server Practices • 4.3.1 ZFS Home Directory Server Benefits • 4.4 Recommendations for Saving ZFS Data • 4.4.1 Using ZFS Snapshots
[edit] ZFS Administration Considerations [edit] ZFS Storage Pools Recommendations This section describes general recommendations for setting up ZFS storage pools. [edit] System/Memory/Swap Space
• Run ZFS on a system that runs a 64-bit kernel • One GB or more of memory is recommended. • Approximately 64 KB of memory is consumed per mounted ZFS file system. On systems with 1,000s of ZFS file systems, provision 1 GB of extra memory for every 10,000 mounted file systems including snapshots. Be prepared for longer boot times on these systems as well. • Size memory requirements to actual system workload: • With a *known* application memory footprint, such as database application, you might cap the ARC size so that the application will not need to reclaim its necessary memory from the ZFS cache. • Identify ZFS memory usage with memstat • Consider dedup memory requirements • For additional memory considerations, see Memory and Dynamic Reconfiguration Recommendations [edit] Storage Pools
• Set up one storage pool using whole disks per system, if possible. • Keep vdevs belonging to one zpool of similar sizes; Otherwise, as the pool fills up, new allocations will be forced to favor larger vdevs over smaller ones and this will cause subsequent reads to come from a subset of underlying devices leading to lower performance. • For production systems, use whole disks rather than slices for storage pools for the following reasons: • Allows ZFS to enable the disk's write cache for those disks that have write caches. If you are using a RAID array with a non-volatile write cache, then this is less of an issue and slices as vdevs should still gain the benefit of the array's write cache. • For JBOD attached storage, having an enabled disk cache, allows some synchronous writes to be issued as multiple disk writes followed by a single cache flush allowing the disk controller to optimize I/O scheduling. Separately, for systems that lacks proper support for SATA NCQ or SCSI TCQ, having an enabled write cache allows the host to issue single I/O operation asynchronously from physical I/O. • The recovery process of replacing a failed disk is more complex when disks contain both ZFS and UFS file systems on slices. • ZFS pools (and underlying disks) that also contain UFS file systems on slices cannot be easily migrated to other systems by using zpool import and export features. • In general, maintaining slices increases administration time and cost. Lower your administration costs by simplifying your storage pool configuration model. • Note: See the Additional Cautions section below prior to Nevada, build 117 bug id 6844090. • If you must use slices for ZFS storage pools, review the following: • Consider migrating the pools to whole disks after a transition period. • Use slices on small systems, such as laptops, where experts need access to both UFS and ZFS file systems. • However, take great care when reinstalling OSes in different slices so you don't
accidentally clobber your ZFS pools. • Managing data on slices is more complex than managing data on whole disks. • For production environments, configure ZFS so that it can repair data inconsistencies. Use ZFS redundancy, such as RAIDZ, RAIDZ-2, RAIDZ-3, mirror, regardless of the RAID level implemented on the underlying storage device. With such redundancy, faults in the underlying storage device or its connections to the host can be discovered and repaired by ZFS. • Avoid creating a RAIDZ, RAIDZ-2, RAIDZ-3, or a mirrored configuration with one logical device of 40+ devices. See the sections below for examples of redundant configurations. • In a replicated pool configuration, leverage multiple controllers to reduce hardware failures and to improve performance. For example: # zpool create tank mirror c1t0d0 c2t0d0
• Set up hot spares to speed up healing in the face of hardware failures. Spares are critical for high mean time to data loss (MTTDL) environments. One or two spares for a 40-disk pool is a commonly used configuration. For example: # zpool create tank mirror c1t0d0 c2t0d0 [mirror cxtydz ...] spare c1t1d0 c2t1d0
• Run zpool scrub on a regular basis to identify data integrity problems. If you have consumer-quality drives, consider a weekly scrubbing schedule. If you have datacenterquality drives, consider a monthly scrubbing schedule. You should also run a scrub prior to replacing devices or temporarily reducing a pool's redundancy to ensure that all devices are currently operational. • ZFS works well with the following devices: • Solid-state storage devices that emulate disk drives (SSDs). You might wish to enable compression on storage pools that contain such devices because of their relatively high cost per byte. • iSCSI devices. For more information, see the ZFS Administration Guide and the following blog: x4500_solaris_zfs_iscsi_perfect • Storage based protected LUNs (RAID-5 or mirrored LUNs from intelligent storage arrays). However, ZFS cannot heal corrupted blocks that are detected by ZFS checksums. [edit] Hybrid Storage Pools (or Pools with SSDs) • There are two possible ways to accelerate your ZFS pool through hybrid storage. By using "cache" devices, you may accelerate read operations. By using "log" devices, you may accelerate synchronous write operations. • For more information about cache devices, please see #Separate_Cache_Devices • For more information about log devices, please see #Separate_Log_Devices [edit] Using Storage Arrays in Pools • With MPxIO • Running the Solaris 10 5/09 release is recommended. • Enable MPxIO by using the stmsboot command. The paths will change (under /scsi_vhci), but ZFS can handle this change. • ZFS and Array Replication Interactions Template:Draft • ZFS does not support the ability for a Solaris host to have both the the ZFS storage pool contained on the Master Volume and a controller-based (or host-based) snapshot of said ZFS storage pool accessible on the Shadow Volume. This Shadow Volume
can be accessed on another Solaris host, if the storage array supports multiple hosts, or the snapshot Shadow Volume is used as the source of remote replication, where ZFS storage pool can then be accessed on the secondary node. • If the SNDR unit of replication is a ZFS storage pool (replicated as an SNDR I/O consistency group), all ZFS storage pool and file system properties, such as compression, are replicated too. • The TrueCopy snapshot feature does not retain write-order consistency across all volumes in a single ZFS storage pool. To address this issue, within TrueCopy, you must create a single I/O consistency group for all volumes in a "named" ZFS storage pool. The other solution is to do the following: # zpool export <entire ZFS storage pool> # TrueCopy snapshot # zpool import <entire ZFS storage pool>
[edit] Additional Cautions for Storage Pools Review the following cautions before building your ZFS storage pool: • Do not create a storage pool that contains components from another storage pool. Deadlocks can occur in this unsupported configuration. • A pool created with a single slice or single disk has no redundancy and is at risk for data loss. A pool created with multiple slices but no redundancy is also at risk for data loss. A pool created with multiple slices across disks is harder to manage than a pool created with whole disks. • A pool that is not created with ZFS redundancy (RAIDZ or mirror) can only report data inconsistencies. It cannot repair data inconsistencies. A pool created without ZFS redundancy is harder to manage because you cannot replace or detach disks in a nonredundant ZFS configuration. • Although a pool that is created with ZFS redundancy can help reduce down time due to hardware failures, it is not immune to hardware failures, power failures, or disconnected cables. Make sure you backup your data on a regular basis. Performing routine backups of pool data on non-enterprise grade hardware is important. • A pool cannot be shared across systems. ZFS is not a cluster file system. • The size of the replacements disk, measured by usable sectors, must be the same or greater than the disk being replaced. This can be confusing when whole disks are used because different models of disks may provide a different number of usable sectors. For example, if a pool was created with a "500 GB" drive and you need to replace it with another "500 GB" drive, then you may not be able to do so if the drives are not of the same make, model, and firmware revision. • Today, pool capacity cannot be reduced in size. CR 4852783 addresses the ability to reduce pool capacity. • A disk that is part of a pool cannot be relabeled or repartitioned. • Consider that a BIOS or firmware upgrade might inadvertently relabel a disk so carefully review the upgrade changes that might impact the disks of your pool before the upgrade. • Other hardware upgrades or changes might change the device paths of the devices in your pool. In general, a ZFS storage pool on Sun hardware can handle these changes, but review your hardware manual to see if the pool should be imported or possibly other preparatory steps taken, before upgrading the hardware. [edit] Simple or Striped Storage Pool Limitations
Simple or striped storage pools have limitations that should be considered. Expansion of space is
possible by two methods: • Adding another disk to expand the stripe. This method should also increase the performance of the storage pool because more devices can be utilized concurrently. Be aware that for current ZFS implementations, once vdevs are added, they cannot be removed. # zpool add tank c2t2d0
• Replacing an existing vdev with a larger vdev. For example: # zpool replace tank c0t2d0 c2t2d0
• ZFS can tolerate many types of device failures. • For simple storage pools, metadata is dual redundant, but data is not redundant. • You can set the redundancy level for data using the ZFS copies property. • If a block cannot be properly read and there is no redundancy, ZFS will tell you which files are affected. • Replacing a failing disk for a simple storage pool requires access to both the old and new device in order to put the old data onto the new device. # zpool replace tank c0t2d0 there is no redundancy # zpool replace tank c0t2d0 c2t2d0
### wrong: cannot recreate data because ### ok
[edit] Multiple Storage Pools on the Same System
• The pooling of resources into one ZFS storage pool allows different file systems to get the benefit from all resources at different times. This strategy can greatly increase the performance seen by any one file system. • If some workloads require more predictable performance characteristics, then you might consider separating workloads into different pools. • For instance, Oracle log writer performance is critically dependent on I/O latency and we expect best performance to be achieved by keeping that load on a separate small pool that has the lowest possible latency. [edit] ZFS Root Pool Considerations
• A root pool must be created with disk slices rather than whole disks. Allocate the entire disk capacity for the root pool to slice 0, for example, rather than partition the disk that is used for booting for many different uses. A root pool must be labeled with a VTOC (SMI) label rather than an EFI label. • A disk that contains the root pool or any pool cannot be repartitioned while the pool is active. If the entire disk's capacity is allocated to the root pool, then it is less likely to need more disk space. • Consider keeping the root pool separate from pool(s) that are used for data. Several reasons exist for this strategy: • Only mirrored pools and pools with one disk are supported. No RAIDZ or unreplicated pools with more than one disk are supported. • You cannot add additional disks to create multiple mirrored vdevs but you can expand a mirrored vdev by using the zpool attach command. • Data pools can be architecture-neutral. It might make sense to move a data pool between SPARC and Intel. Root pools are pretty much tied to a particular architecture.
• •
• • • •
• In general, it's a good idea to separate the "personality" of a system from its data. Then, you can change one without having to change the other. A root pool cannot be exported on the local system. For recovery purposes, you can import a root pool when booted from the network or alternate media. Consider using descendent datasets in the root pool for non-system related data because you cannot rename or destroy the top-level pool dataset. Using the top-level pool dataset as a container for descendent datasets provide more flexibility if you need to snapshot, clone, or promote datasets in the root pool or a top-level dataset. Keep all root pool components, such as the /usr and /var directories, in the root pool. Create a mirrored root pool to reduce downtime due to hardware failures. For more information about setting up a ZFS root pool, see ZFS Root Pool Recommendations. For more information about migrating to a ZFS root file system, see Migrating to a ZFS Root File System.
[edit] ZFS Mirrored Root Pool Disk Replacement
If a disk in a mirrored root pool fails, you can either replace the disk or attach a replacement disk and then detach the failed disk. The basic steps are like this: • Identify the disk to be replaced by using the zpool status command. • You can do a live disk replacement if the system supports hot-plugging. On some systems, you might need to offline and unconfigure the failed disk first. For example: # zpool offline rpool c1t0d0s0 # cfgadm -c unconfigure c1::dsk/c1t0d0
• Physically replace the disk. • Reconfigure the disk. This step might not be necessary on some systems. # cfgadm -c configure c1::dsk/c1t0d0
• Confirm that the replacement disk has an SMI label and a slice 0 to match the existing root pool configuration. • Let ZFS know that the disk is replaced. # zpool replace rpool c1t0d0s0
• Bring the disk online. # zpool online rpool c1t0d0s0
• Install the bootblocks after the disk is resilvered. • Confirm that the replacement disk is bootable by booting the system from the replacement disk. • For information about formatting a disk that is intended for the root pool and installing boot blocks, see [[1]]
[edit] Storage Pool Performance Considerations [edit] General Storage Pool Performance Considerations • For better performance, use individual disks or at least LUNs made up of just a few disks. By providing ZFS with more visibility into the LUNs setup, ZFS is able to make better I/O scheduling decisions.
• Depending on workloads, the current ZFS implementation can, at times, cause much more I/O to be requested than other page-based file systems. If the throughput flowing toward the storage, as observed by iostat, nears the capacity of the channel linking the storage and the host, tuning down the zfs recordsize should improve performance. This tuning is dynamic, but only impacts new file creations. Existing files keep their old recordsize. • Tuning recordsize does not help sequential type loads. Tuning recordsize is aimed at improving workloads that intensively manage large files using small random reads and writes. • Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files (write once, never remove), then you can keep a pool in the 95-96% utilization range. Keep in mind that even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer. • For better performance, do not build UFS components on top of ZFS components. For ZFS performance testing, make sure you are not running UFS on top of ZFS components. See also ZFS for Databases. [edit] Separate Log Devices
The ZFS intent log (ZIL) is provided to satisfy POSIX requirements for synchronous writes. By default, the ZIL is allocated from blocks within the main storage pool. Better performance might be possible by using dedicated nonvolatile log devices such as NVRAM, SSD drives, or even a dedicated spindle disk. • If your server hosts a database, virtual machines, iSCSI targets, acts as a NFS server with the clients mounting in "sync" mode, or in any way has heavy synchronous write requests, then you may benefit by using a dedicated log device for ZIL. • The benefit of a dedicated ZIL depends on your usage. If you always do async writes, it won't matter at all, because the log device can only accelerate sync writes to be more similar to async writes. If you do many small sync writes, you will benefit a lot. If you do large continuous sync writes, you may see some benefit, but it's not clear exactly how significant. • If you add a log device to your storage pool, it cannot be removed, prior to zpool version 19. You can find out your pool version by running the "zpool upgrade" command. The Solaris 10 9/10 release includes pool version 22, which allows you to remove a log device. • With two or more nonvolatile storage devices, you can create a mirrored set of log devices. # zpool add tank log mirror c0t4d0 c0t6d0
• In a mirrored log configuration, you can always detach (unmirror) devices, but as mentioned above, you cannot remove your last unmirrored log device prior to pool version 19. • Log devices can be unreplicated or mirrored, but RAIDZ is not supported for log devices. • Mirroring the log device is recommended. Prior to pool version 19, if you have an unmirrored log device that fails, your whole pool might be lost or you might lose several seconds of unplayed writes, depending on the failure scenario. • In current releases, if an unmirrored log device fails during operation, the system reverts to the default behavior, using blocks from the main storage pool for the ZIL, just as if the log device had been gracefully removed via the "zpool remove" command. • The minimum size of a log device is the same as the minimum size of device in pool, which is 64 MB. The amount of in-play data that might be stored on a log device is relatively small. Log blocks are freed when the log transaction (system call) is committed. • The maximum size of a log device should be approximately 1/2 the size of physical memory
because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 GB of physical memory, consider a maximum log device size of 8 GB. • For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GB of log device should be sufficient. [edit] Memory and Dynamic Reconfiguration Recommendations
The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 GB. As memory pressure increases, the ARC relinquishes memory. Consider limiting the maximum ARC memory footprint in the following situations: • When a known amount of memory is always required by an application. Databases often fall into this category. • On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel cage onto all boards. • A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base pages. • Finally, if the system is running another non-ZFS file system, in addition to ZFS, it is advisable to leave some free memory to host that other file system's caches. The trade off is to consider that limiting this memory footprint means that the ARC is unable to cache as much file system data, and this limit could impact performance. In general, limiting the ARC is wasteful if the memory that now goes unused by ZFS is also unused by other system components. Note that non-ZFS file systems typically manage to cache data in what is nevertheless reported as free memory by the system. For information about tuning the ARC, see the following section: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cac he [edit] Separate Cache Devices
In addition to the in-memory ARC cache, ZFS employs a second level, L2ARC on-disk cache. In a typical configuration, there would be large pool of spindle disks, and a smaller number of SSD's or other high performance devices dedicated to cache. Using L2ARC cache devices may accelerate read operations, especially when some data is read repeatedly, and cannot fit in the system memory ARC cache. This is particularly likely when active processes are consuming the system memory, and in high performance machines which may already be maxed out for RAM. For example, if a machine maxes out at 128G ram, and requires 120G of ram for active processes, and frequently needs to read some data files from disk, then performance could likely be increased by adding a few hundred G of SSD cache devices. • You can add cache devices with the "zpool add" command. zpool add tank cache c0t5d0 c0t6d0
• It is not possible to mirror or use raidz on cache devices, nor is it necessary. If a cache device fails, the data will simply be read from the main pool storage devices instead.
[edit] RAIDZ Configuration Requirements and Recommendations A RAIDZ configuration with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised. • • • • •
Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1) Start a double-parity RAIDZ (raidz2) configuration at 6 disks (4+2) Start a triple-parity RAIDZ (raidz3) configuration at 9 disks (6+3) (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 6 The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups.
For a RAIDZ configuration example, see x4500 with RAID-Z2 and Recipe for a ZFS RAID-Z Storage Pool on Sun Fire X4540. [edit] Mirrored Configuration Recommendations • No currently reachable limits exist on the number of devices • On a Sun Fire X4500 server, do not create a single vdev with 48 devices. Consider creating 24 2-device mirrors. This configuration reduces the disk capacity by 1/2, but up to 24 disks or 1 disk in each mirror could be lost without a failure. • If you need better data protection, a 3-way mirror has a significantly greater MTTDL than a 2-way mirror. Going to a 4-way (or greater) mirror may offer only marginal improvements in data protection. Concentrate on other methods of data protection if a 3-way mirror is insufficient. For a mirrored ZFS configuration examples, see x4500 with mirror and x4200 with mirror. [edit] Should I Configure a RAIDZ, RAIDZ-2, RAIDZ-3, or a Mirrored Storage Pool? A general consideration is whether your goal is to maximum disk space or maximum performance. • A RAIDZ configuration maximizes disk space and generally performs well when data is written and read in large chunks (128K or more). • A RAIDZ-2 configuration offers better data availability, and performs similarly to RAIDZ. RAIDZ-2 has significantly better mean time to data loss (MTTDL) than either RAIDZ or 2way mirrors. • A RAIDZ-3 configuration maximizes disk space and offers excellent availability because it can withstand 3 disk failures. • A mirrored configuration consumes more disk space but generally performs better with small random reads. • If your I/Os are large, sequential, or write-mostly, then ZFS's I/O scheduler aggregates them in such a way that you'll get very efficient use of the disks regardless of the data replication model. For better performance, a mirrored configuration is strongly favored over a RAIDZ configuration particularly for large, uncacheable, random read loads. For more information about RAIDZ considerations, see When to (and not to) use RAID-Z. [edit] RAIDZ Configuration Examples For RAIDZ configuration on a Thumper, mirror c3t0 and c3t4 (disks 0 and 1) as your root pool, with the remaining 46 disks available for user data. The following RAIDZ-2 configurations illustrate how to set up the remaining 46 disks. * 5x(7+2), 1 hot spare, 17.5 TB
* 4x(9+2), 2 hot spares, 18.0 TB * 6x(5+2), 4 hot spares, 15.0 TB
[edit] ZFS Migration Strategies [edit] Migrating to a ZFS Root File System In the SXCE (Nevada) build 90 release or in the Solaris 10 10/08 release, you can migrate your UFS root file system to a ZFS root file system by upgrading to build 90 or the Solaris 10 10/08 release and then using the Solaris Live Upgrade feature to migrate to a ZFS root file system. You can create a mirrored ZFS root pool either during an initial installation or a JumpStart installation. Or, by using the zpool attach command to create a mirrored ZFS root pool after installation. For information about installing a ZFS root file system or migrating a UFS root file system to a ZFS root file system, see the Installing and Booting a ZFS Root File System [edit] Migrating a UFS Root File System With Zones to a ZFS Root File System Keep the following points in mind when using ZFS datasets on a Solaris system with zones installed: • In the Solaris 10 10/08 release, you can create a zone root path on a ZFS file system. However, supported configurations are limited when migrating a system with zones to a ZFS root file system by using the Solaris Live Upgrade feature. Review the following supported configurations before you begin migrating a system with zones. • Migrate System With UFS with Zones to ZFS Root • Configure ZFS Root With Zone Roots on ZFS • Upgrade or Patch ZFS Root with Zone Roots on ZFS • You can use ZFS as a zone root path in the Solaris Express releases, but keep in mind that patching or upgrading these zones is not supported. • You cannot associate ZFS snapshots with zones at this time. For more information about using ZFS with zones, see the Zones FAQ. [edit] Manually Migrating Non-System Data to a ZFS File System Consider the following practices when migrating non-system-related data from UFS file systems to ZFS file systems: • • • • •
Unshare the existing UFS file systems Unmount the existing UFS file systems from the previous mount points Mount the UFS file systems to temporary unshared mount points Migrate the UFS data with parallel instances of rsync running to the new ZFS file systems Set the mount points and the sharenfs properties on the new ZFS file systems
[edit] ZFS Interactions With Other Volume Management Products • ZFS works best without any additional volume management software. • If you must use ZFS with SVM because you need an extra level of volume management, ZFS expects that 1 to 4 MB of consecutive logical block map to consecutive physical blocks. Keeping to this rule allows ZFS to drive the volume with efficiency. • You can construct logical devices for ZFS by using volumes presented by software-based volume managers, such as SolarisTM Volume Manager (SVM) or Veritas Volume Manager (VxVM). However, these configurations are not recommended. While ZFS functions
properly on such devices, less-than-optimal performance might be the result.
[edit] General ZFS Administration Information • • • • •
•
• • •
ZFS administration is performed while the data is online. For information about setting up pools, see [ZFS Storage Pools Recommendations]. ZFS file systems are mounted automatically when created. ZFS file systems do not have to be mounted by modifying the /etc/vfstab file. Currently, ZFS doesn't provide a comprehensive backup/restore utility like ufsdump and ufsrestore commands. However, you can use the zfs send and zfs receive commands to capture ZFS data streams. You can also use the ufsrestore command to restore UFS data into a ZFS file system. For most ZFS administration tasks, see the following references: • zfs.1m and zpool.1m, provide basic syntax and examples • ZFS Administration Guide, provides more detailed syntax and examples • zfs-discuss, join this OpenSolaris discussion list to ask ZFS questions You can use "iostat -En" to display error information about devices that are part of a ZFS storage pool. A dataset is a generic term for a ZFS component, such as file system, snapshot, clone, or volume. When you create a ZFS storage pool, a ZFS file system is automatically created. For example, the following syntax created a pool named tank and top-level dataset named tank that is mounted at /tank.
# zpool create tank mirror c1t0d0 c2t0d0 # zfs list tank NAME USED AVAIL REFER MOUNTPOINT tank 72K 8.24G 21K /tank
• Consider using the top-level dataset as a container for other file systems. The top-level dataset cannot be destroyed or renamed. You can export and import the pool with a new name to change the name of the top-level dataset, but this operation would also change the name of the pool. If you want to snapshot, clone, or promote file system data then create separate file systems in your pool. File systems provide points of administration that allow you to manage different sets of data within the same pool.
[edit] OpenSolaris/ZFS Considerations • Most of the general pool and storage recommendations apply to using ZFS in the OpenSolaris release.
[edit] OpenSolaris/ZFS/Virtual Box Recommendations Template:Draft • By default, Virtual Box is configured to ignore cache flush commands from the underlying storage. This means that in the event of a system crash or a hardware failure, data could be lost. • To enable cache flushing on Virtual Box, issue the following command: VBoxManage setextradata <VM_NAME> "VBoxInternal/Devices/<type>/0/LUN#<n>/Config/IgnoreFlush" 0
where: • <VM_NAME> is the name of the virtual machine • <type> is the controller type, either piix3ide (if you're using thenormal IDE virtual controller) or ahci (if you're using a SATAcontroller) • <n> is the disk number. • For IDE disks, primary master is 0, primary slave is 1, secondary master is 2, secondary slave is 3. • For SATA disks, it's simply the SATA disk number. • Additional notes: • This setting can only be enabled for (virtual) SATA/IDE disks. It cannot be enabled for CD/DVD drives. iSCSI behavior is unknown at this time. • You can only enable this setting for disks that are attached to the particularvirtual machine. If you enable it for any other disks (LUN#), it willfail to boot with a rather cryptic error message. This means ifyou detach the disk, you have to disable this setting.
[edit] Using ZFS for Application Servers Considerations [edit] ZFS NFS Server Practices Consider the following lessons learned from a UFS to ZFS migration experience: • Existing user home directories were renamed but they were not unmounted. NFS continued to serve the older home directories when the new home directories were also shared. • Do not mix UFS directories and ZFS file systems in the same file system hierarchy because this model is confusing to administer and maintain. • Do not mix NFS legacy shared ZFS file systems and ZFS NFS shared file systems because this model is difficult to maintain. Go with ZFS NFS shared file systems. • ZFS file systems are shared with the sharenfs file system property and zfs share command. For example: # zfs set sharenfs=on export/home
• This syntax shares the file system automatically. If ZFS file systems need to be shared, use the zfs share command. For example: # zfs share export/home
For information about ZFS-over-NFS performance, see ZFS and NFS Server Performance. [edit] ZFS NFS Server Benefits • NFSv4-style ACLs are available with ZFS file systems and ACL information is automatically available over NFS. • ZFS snapshots are available over NFSv4 so NFS mounted-home directories can access their .snapshot directories.
[edit] ZFS file service for SMB (CIFS) or SAMBA Many of the best practices for NFS also apply to CIFS or SAMBA servers. • ZFS file systems can be shared using the SMB service, for those OS releases which support it.
# zfs set sharesmb=on export/home
• If native SMB support is not available, then SAMBA offers a reasonable solution.
[edit] ZFS Home Directory Server Practices Consider the following practices when planning your ZFS home directories: • • • •
Set up one file system per user Use quotas and reservations to manage user disk space Use snapshots to back up users' home directories Beware that mounting 1000s of file systems, will impact your boot time
Consider the following practices when migrating data from UFS file systems to ZFS file systems: • • • • •
Unshare the existing UFS file systems Unmount the existing UFS file systems from the previous mount points Mount the UFS file systems to temporary unshared mount points Migrate the UFS data with parallel instances of rsync running to the new ZFS file systems Set the mount points and the sharenfs properties on the new ZFS file systems
See the ZFS/NFS Server Practices section for additional tips on sharing ZFS home directories over NFS. [edit] ZFS Home Directory Server Benefits • ZFS can handle many small files and many users because of its high capacity architecture. • Additional space for user home directories is easily expanded by adding more devices to the storage pool. • ZFS quotas are an easy way to manage home directory space. • Use ZFS property inheritance to apply properties to many file systems.
[edit] Recommendations for Saving ZFS Data [edit] Using ZFS Snapshots • Using ZFS snapshots is a quick and easy way to protect files against accidental changes and deletion. In nearly all cases, availability of snapshots allows users to restore their own files, without administrator assistance, and without the need to access removable storage, such as tapes. For example: $ rm reallyimportantfile /* D'oh! $ cd .zfs/snapshot $ cd .auto... $ cp reallyimportantfile $HOME $ ls $HOME/reallyimportantfile /home/cindys/reallyimportantfile
• The following syntax creates recursive snapshots of all home directories in the tank/home file system. Then, you can use the zfs send -R command to create a recursive stream of the recursive home directory snapshot, which also includes the individual file system property settings. # zfs snapshot -r tank/home@monday # zfs send -R tank/home/@monday | ssh remote-system zfs receive -dvu pool
• You can create rolling snapshots and zfs-auto-snapshots to help manage snapshot copies. For more information, see the Rolling Snapshots Made Easy blog or the ZFS Automatic Snapshot blog by Tim Foster. • ZFS snapshots are accessible in the .zfs directories of the file system that was snapshot. Configure your backup product to skip these directories. [edit] Storing ZFS Snapshot Streams (zfs send/receive) • You can use the zfs send and zfs receive commands to archive a snapshot stream but saving ZFS send streams is different from traditional backups, for the following reasons: • You cannot select individual files or directories to receive because the zfs receive operation is an all-or-nothing event. You can get all of a file system snapshot or none of it. • If you store ZFS send stream on a file or on tape, and that file becomes corrupted, then it will not be possible to receive it, and none of the data will be recoverable. However, Nevada, build 125 adds the zstreamdump(1m) command to verify a ZFS snapshot send stream. See also, RFE 6736794. • You cannot restore individual files or directories from a ZFS send stream, although you can copy files or directories from a snapshot. This limitation means enterprise backup solutions and other archive tools, such as cp, tar, rsync, pax, cpio, are more appropriate for tape backup/restore because you can restore individual files or directories. • You cannot exclude directories or files from a ZFS send stream. • You can create an incremental snapshot stream (see "zfs send -i" syntax). This is generally much faster than incremental backups performed by file-level tools, such as tar and rsync, because ZFS already knows which blocks have changed on disk, and it can simply read those blocks as large sequential disk read operations, to the extent physically possible. Archive tools, such as tar and rsync, must walk the file system, checking every file and directory for modifications, in order to choose which files have changed and need to be included in the incremental backup. • Another advantage of using a ZFS send stream over a file-based backup tool is that you can send and receive a file system's property settings. • If you have random access storage (not tape) to receive onto, and you don't need to exclude anything, then zfs send and receive can be used to store data, provided that you pipe the zfs send command directly into the zfs receive command. [edit] Using ZFS With AVS The Sun StorageTek Availability Suite (AVS), Remote Mirror Copy and Point-in-Time Copy services, previously known as SNDR (Sun Network Data Replicator) and II (Instant Image), are similiar to the Veritas VVR (volume replicator) and Flashsnap (point-in-time copy) products, is currently available in the Solaris Express release. SNDR differs from the ZFS send and recv features, which are time-fixed replication features. For example, you can take a point-in-time snapshot, replicate it, or replicate it based on a differential of a prior snapshot. The combination of AVS II and SNDR features, also allows you to perform timefixed replication. The other modes of the AVS SNDR replication feature allows you to obtain CDP (continuous data replication). ZFS doesn't currently have this feature. For more information about AVS, see the OpenSolaris AVS Project. View the AVS/ZFS Demos here.
[edit] Using ZFS With Enterprise Backup Solutions • The robustness of ZFS does not protect you from all failures. You should maintain copies of your ZFS data either by taking regular ZFS snapshots, and saving them to tape or other offline storage. Or, consider using an enterprise backup solution. • Sun StorEdge Enterprise Backup Software (Legato Networker 7.3.2) product can fully back up and restore ZFS files including ACLs. • Symantec's Netbackup 6.5 product can fully back up and restore ZFS files including ACLs. Release 6.5.2A offers some fixes which make backup of ZFS file systems easier. • IBM's TSM product can be used to back up ZFS files. However supportability is not absolutely clear. Based on TSM documentation on IBM website, ZFS with ACLs is supported with TSM client 5.3. It has been verified (internally to Sun) to correctly work with 5.4.1.2. tsm> q file /opt/SUNWexplo/output # Last Incr Date Type File Space Name --------------------------------1 08/13/07 09:18:03 ZFS /opt/SUNWexplo/output ^ |__correct filesystem type
• For the latest information about enterprise-backup solution issues with ZFS, see the ZFS FAQ. [edit] Using ZFS With Open Source Backup Solutions • ZFS snapshots are accessible in the .zfs directories of the file system that was snapshot. Configure your backup product to skip these directories. • Amanda - Joe Little blogs about how he backs up ZFS file systems to Amazon's S3 using Amanda. Integration of ZFS snapshots with MySQL and Amanda Enterprise 2.6 Software can also take advantage of ZFS snapshot capabilities. • Bacula - Tips for backing up ZFS data are as follows: • Make sure you create multiple jobs per host to allow multiple backups to occur in parallel rather than just one job per host. If you have several file systems and/or pools, running multiple jobs speeds up the backup process. • Create a non-global zone to do backups. Having multiple zones for backups means that you can delegate control of a backup server to a customer so they can perform restores on their own and only have access to their own data. • Use a large RAIDZ pool (20 TB), instead of tapes, to store all the backups, and which allows quick backup and restores. • Use opencsw (from opencsw.org) and/or blastwave (from blastwave.org) for packages, which makes it very easy to install and maintain. If using opencsw, run "pkg-get -i bacula" and it installs all the prerequisites. If using blastwave, run "pkgutil -i bacula" and it installs all the prerequisites. On the clients, install opencsw and bacula_client on the global zones, and backup the local zones from the global zone. • On the server, the "director" configuration file, /opt/csw/etc/bacula/bacula-dir.conf, contains the information you need about what clients are backed up. You can split configurations into sections, such as "core" for the base system, "raid" for my local larger pools, and "zones" for my zones. If you have several zones, break up "zones" to a per-zone job, which is easy to do. • For more information, see Bacula Configuration Example. • The OpenSolaris ndmp service project is proposed to take advantage of ZFS features, such
as snapshots, to improve the backup and restore process. With this addition, an enterprise backup solution could take advantage of snapshots to improve data protection for large storage repositories. For more information, see ndmp service.
[edit] ZFS and Database Recommendations The information in this section has been consolidated into a separate ZFS for Databases section.
[edit] ZFS and Complex Storage Considerations • Certain storage subsystems stage data through persistent memory devices, such as NVRAM on a storage array, allowing them to respond to writes with very low latency. These memory devices are commonly considered as stable storage, in the sense that they are likely to survive a power outage and other types of breakdown. At critical times, ZFS is unaware of the persistent nature of storage memory, and asks for that memory to be flushed to disk. If indeed the memory devices are considered stable, the storage system should be configured to ignore those requests from ZFS. • For potential tuning considerations, see: ZFS Evil Tuning Guide, Cache_Flushes
[edit] Virtualization Considerations [edit] ZFS and Virtual Tape Libraries (VTLs) VTL solutions are hardware and software combinations that are used to emulate tapes, tape drives, and tape libraries. VTLs are used in backup/archive systems with a focus on reducing hardware and software costs. • VTLs are big disk space eaters and we believe ZFS will allow them to more efficiently and securely manage the massive, online disk space. • OpenSolaris - the COMSTAR project delivers both tape and disk targets, so a ZFS volume can look like a tape drive. • Falconstor VTL - has been tested on Thumper running ZFS. For more information, see: Sun Puts Thumper To Work • NetVault from BakBone - This backup solution includes a VTL feature that has been tested on Thumper running ZFS.
[edit] ZFS Performance Considerations See the following sections for basic system, memory, pool, and replication recommendations: • ZFS Storage Pools Recommendations • Should I Configure a RAIDZ, RAIDZ-2, RAIDZ-3, or a Mirrored Storage_Pool
[edit] ZFS and Application Considerations [edit] ZFS and NFS Server Performance ZFS is deployed over NFS in many different places with no reports of obvious deficiency. Many have reported disappointing performance, but those scenarios more typically relate to comparing ZFS-over-NFS performance with local file system performance. It is well known that serving NFS leads to significant slowdown compared to local or directly-attached file systems, especially for workloads that have low thread parallelism. A dangerous way to create better ZFS-over-NFS
performance at the expense of data integrity is to set the kernel variable, zil_disable. Setting this parameter is not recommended. Later versions of ZFS have implemented a separate ZIL log device option which improves NFS synchronous write operations. This is a better option than disabling the ZIL. Anecdotal evidence suggests good NFS performance improvement even if the log device does not have nonvolatile RAM. To see if your zpool can support separate log devices use zpool upgrade -v and look for version 7. For more information, see separate intent log. See also ZFS for Databases. For more detailed information about ZFS-over-NFS performance, see ZFS and NFS, a fine combination.
[edit] ZFS Overhead Considerations • Checksum and RAIDZ parity computations occur in concurrent threads in recent Solaris releases. • Compression is no longer single-threaded due to integration of CR 6460622.
[edit] Data Reconstruction Traditional RAID systems, where the context of the data is not known and data is reconstructed (also known as resilvering), blindly reconstruct the data blocks in block order. ZFS only needs to reconstruct data, so ZFS can be more efficient than traditional RAID systems when the storage pool is not full. ZFS reconstruction occurs top-down in a priority-based manner. Jeff Bonwick describes this in more detail in his Smokin' Mirrors blog post. Since ZFS reconstruction occurs on the host, some concern exists over the performance impact and availability trade-offs. Two competing RFEs address this subject: • CR 6678033, resilver code should prefetch • CR 6494473, ZFS needs a way to slow down resilvering Category: ZFS