There is a special issue when using ZFS-backed NFS for a Datastore under ESXi.
The problem is that the ESXi NFS client forces a commit/cache flush after every write. This makes sense in the context of what ESXi does as it wants to be able to reliably inform the guest OS that a particular block was actually written to the underlying physical disk. However for ZFS writes and cache flushes trigger ZIL event log entries.
The end result is that the ZFS array will end up doing a massively disproportional amount of writing to the ZIL log and throughput will suffer (I was seeing under 1 MiB/sec on Gigabit Ethernet!).
Here are the results of testing the various work-arounds, as you can see that modifying the kernel is the clear winner. This also has minimal side affects when compared to the other options.
|Method||Read Speed||Read Ltncy.||Write Speed||Write Ltncy.|
|NFS Kernel Mod||67 MiB/sec||341 ms||110 MiB/sec||153 ms|
|zfs set sync=disabled||69 MiB/sec||198 ms||69 MiB/sec||1628 ms|
|cache_flush_disable="1"||67 Mib/sec||760 ms||16 MiB/sec||1543 ms|
* Tested with dedicated 1 Gbit Ethernet interconnect.
Here are the four solutions:
IDEAL: Hack the NFS Subsystem
This makes the kernel ignore NFS clients’ requests to commit to disk, and in doing so does not pass along ESXi (or any other NFS client’s) request to commit/flush the cache to the file system.
This, in my view, is the ideal. If you have UPS power there is very very little risk here.
Per this article we’re going to modify nfs_nfsdport.c: http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html
Search for NFSWRITE_UNSTABLE and find this block:
if (stable == NFSWRITE_UNSTABLE) ioflags = IO_NODELOCKED; else ioflags = (IO_SYNC IO_NODELOCKED); uiop->uio_resid = retlen; uiop->uio_rw = UIO_WRITE;
And change it to:
// if (stable == NFSWRITE_UNSTABLE) ioflags = IO_NODELOCKED; // else // ioflags = (IO_SYNC | IO_NODELOCKED); uiop->uio_resid = retlen; uiop->uio_rw = UIO_WRITE;
Then recompile the kernel and remember this needs to be re-done after doing a
freebsd-update or if you update /usr/src.
The Other Options
There are other solutions, and for completeness’ sake here they are (and why I think the above solution is better):
SSD ZIL Disks
For this you optimally want two SSDs (mirrored for redundancy) to locate your ZIL on instead of the array disks themselves.
Especially when you consider that writing is what wears out SSDs, I think this is a poor solution as there will still be many excessive writes, they’re just faster.
Disable the ZIL Entirely
This is a pretty blunt solution, but a quick and easy temporary fix. Running this on a zvol:
zfs set sync=disabled zroot
Which turns off sync forcing/cache flushing for the entire FS. There are some who say this can lead to underlying ZFS corruption and cry wolf but per this article I do not believe that is the case: https://blogs.oracle.com/roch/entry/nfs_and_zfs_a_fine
What it does say though is that you can end up with NFS client corruption (in the form of inconsistency). This may be so but remember that the guest filesystem itself also has protections built into it (ie; NTFS or UFS) which can help mitigate these things.
And of course if everything is UPS backed (and nothing panics) this is even less of an issue.
I used this method temporarily until I made the NFS change and experienced no problems, but I dislike how this affects “everything” including native writes, Samba, etc.
Setting vfs.zfs.cache_flush_disable="1" in /boot/loader.conf
This I think is an older “solution” in the 8.x days, and the sync=disable option supersedes it. I found that while it did improve performance by a factor of 15x, that only meant 15 MiB/sec writes which I consider to be still unacceptable. And the “risks” are similar to the above sync=disable which has much better performance.