XFS - Data Block Sharing (Reflink)
Matt Keenan

Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux Kernel.
Three years ago, I introduced to XFS a new experimental "reflink" feature that enables users to share data blocks between files. With this feature, users gain the ability to make fast snapshots of VM images and directory trees; and deduplicate file data for more efficient use of storage hardware. Copy on write is used when necessary to keep file contents intact, but XFS otherwise continues to use direct overwrites to keep metadata overhead low. The filesystem automatically creates speculative preallocations when copy on write is in use to combat fragmentation.
I'm pleased to announce with xfsprogs 5.1, the reflink feature is now production ready and enabled by default on new installations, having graduated from the experimental and stabilization phases. Based on feedback from early adopters of reflink, we also redesigned some of our in-kernel algorithms for better performance, as noted below:
iomap for Faster I/O
Beginning with Linux 4.18, Christoph Hellwig and I have migrated XFS' IO paths away from the old VFS/MM infrastructure, which dealt with IO on a per-block and per-block-per-page ("bufferhead") basis. These mechanisms were introduced to handle simple filesystems on Linux in the 1990s, but are very inefficient.
The new IO paths, known as "iomap", iterate IO requests on an extent basis as much as possible to reduce overhead. The subsystem was written years ago to handle file mapping leases and the like, but nowadays we can use it as a generic binding between the VFS, the memory manager, and XFS whenever possible. The conversion finished as of Linux 5.4.
In-Core Extent Tree
For many years, the in-core file extent cache in XFS used a contiguous chunk of memory to store the mappings. This introduces a serious and subtle pain point for users with large sparse files, because it can be very difficult for the kernel to fulfill such an allocation when memory is fragmented.
Christoph Hellwig rewrote the in-core mapping cache in Linux 4.15 to use a btree structure. Instead of using a single huge array, the btree structure reduces our contiguous memory requirements to 256 bytes per chunk, with no maximum on the number of chunks. This enables XFS to scale to hundreds of millions of extents while eliminating a source of OOM killer reports.
Users need only upgrade their kernel to take advantage of this improvement.
Demonstration: Reflink
To begin experimenting with XFS's reflink support, one must format a new filesystem:
1 2 3 4 5 6 7 8 9 10 11 | |
If you do not see the exact phrase "reflink=1" in the mkfs output then your system is too old to support reflink on XFS. Now one must mount the filesystem:
1 | |
At this point, the filesystem is ready to absorb some new files. Let's pretend that we're running a virtual machine (VM) farm and therefore need to manage deployment images. This and the next example are admittedly contrived, as any serious VM and container farm manager takes care of all these details.
1 2 3 | |
Now we install a base OS image that we will later use for fast deployment. Once that's done, we shut down the QEMU process. But first, we'll check that everything's in order:
1 2 3 4 5 6 7 8 9 | |
Now, let's say that we want to provision a new VM using the base image that we just created. In the old days we would have had to copy the entire image, which can be very time consuming. Now, we can do this very quickly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
This was a very quick copy! Notice how the extent map on the new image file shows file data pointing to the same physical storage as the original base image, but is now marked as a shared extent, and there's about as much free space as there was before the copy. Now let's start that new VM and let it run for a little while before re-querying the block mapping:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Notice how the first 128K of the file now points elsewhere. This is evidence that the VM guest wrote to its storage, causing XFS to employ copy on write on the file so that the original base image remains unmodified. We've apparently used another 4GB of space, which is far better than the 64GB that would have been required in the old days.
Let's turn our attention to the second major feature for reflink: fast(er) snapshotting of directory trees. Suppose now that we want to manage containers with XFS. After a fresh formatting, create a directory tree for our container base:
1 | |
In the directory we just created, install a base container OS image that we will later use for fast deployment. Once that's done, we shut down the container and check that everything's in order:
1 2 3 4 5 6 7 | |
Ok, that looks like a reasonable base system. Let's use reflink to make a fast copy of this system:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Now we let the container runtime do some work and update (for example) the bash binary:
1 2 3 4 5 6 7 8 9 | |
Notice that the two copies of bash no longer share blocks. This concludes our demonstration. We hope you enjoy this major new feature!