Update November 14th, 2016: This is an old article but my recommendation to hack the NFS file still stand even given how inexpensive small SSDs are. An SSD ZIL still delivers low performance with ESXi/NFS unfortunately. We have have a dual SSD ZIL setup on this file server, and without the NFS hack we still only see 50 MiB/sec writes — we now have 10G fiber so this is in contrast to 650 MiB/sec reads, too.

There is a special issue when using ZFS-backed NFS for a Datastore under ESXi.

The problem is that the ESXi NFS client forces a commit/cache flush after every write. This makes sense in the context of what ESXi does as it wants to be able to reliably inform the guest OS that a particular block was actually written to the underlying physical disk. However for ZFS writes and cache flushes trigger ZIL event log entries.

The end result is that the ZFS array will end up doing a massively disproportional amount of writing to the ZIL log and throughput will suffer (I was seeing under 1 MiB/sec on Gigabit Ethernet!).

Performance Benchmarking

Here are the results of testing the various work-arounds, as you can see that modifying the kernel is the clear winner. This also has minimal side affects when compared to the other options.

Method Read Speed Read Ltncy. Write Speed Write Ltncy.
NFS Kernel Mod 67 MiB/sec 341 ms 110 MiB/sec 153 ms
zfs set sync=disabled 69 MiB/sec 198 ms 69 MiB/sec 1628 ms
cache_flush_disable="1" 67 Mib/sec 760 ms 16 MiB/sec 1543 ms

* Tested with dedicated 1 Gbit Ethernet interconnect.

Here are the four solutions:

IDEAL: Hack the NFS Subsystem

This makes the kernel ignore NFS clients’ requests to commit to disk, and in doing so does not pass along ESXi (or any other NFS client’s) request to commit/flush the cache to the file system.

This, in my view, is the ideal. If you have UPS power there is very very little risk here.

Per this article we’re going to modify nfs_nfsdport.c: http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html

vi /usr/src/sys/fs/nfsserver/nfs_nfsdport.c

Search for NFSWRITE_UNSTABLE and find this block:

if (stable == NFSWRITE_UNSTABLE)
  ioflags = IO_NODELOCKED;
else
  ioflags = (IO_SYNC IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;

And change it to:

// if (stable == NFSWRITE_UNSTABLE)
ioflags = IO_NODELOCKED;
// else
// ioflags = (IO_SYNC | IO_NODELOCKED);
uiop->uio_resid = retlen;
uiop->uio_rw = UIO_WRITE;

Then recompile the kernel and remember this needs to be re-done after doing a freebsd-update or if you update /usr/src.

The Other Options

There are other solutions, and for completeness’ sake here they are (and why I think the above solution is better):

SSD ZIL Disks

For this you optimally want two SSDs (mirrored for redundancy) to locate your ZIL on instead of the array disks themselves.

Especially when you consider that writing is what wears out SSDs, I think this is a poor solution as there will still be many excessive writes, they’re just faster.

Disable the ZIL Entirely

This is a pretty blunt solution, but a quick and easy temporary fix. Running this on a zvol:

zfs set sync=disabled zroot

Which turns off sync forcing/cache flushing for the entire FS. There are some who say this can lead to underlying ZFS corruption and cry wolf but per this article I do not believe that is the case: https://blogs.oracle.com/roch/entry/nfs_and_zfs_a_fine

What it does say though is that you can end up with NFS client corruption (in the form of inconsistency). This may be so but remember that the guest filesystem itself also has protections built into it (ie; NTFS or UFS) which can help mitigate these things.

And of course if everything is UPS backed (and nothing panics) this is even less of an issue.

I used this method temporarily until I made the NFS change and experienced no problems, but I dislike how this affects “everything” including native writes, Samba, etc.

Setting vfs.zfs.cache_flush_disable="1" in /boot/loader.conf

This I think is an older “solution” in the 8.x days, and the sync=disable option supersedes it. I found that while it did improve performance by a factor of 15x, that only meant 15 MiB/sec writes which I consider to be still unacceptable. And the “risks” are similar to the above sync=disable which has much better performance.

5 Responses to “SOLVED: Performance Issues With FreeBSD ZFS Backed ESXi Storage Over NFS”

  1. Jimmy Koerting

    Adam, I wonder if I understood this at all 🙂

    I guess my setup is the other way around: a linux storage (nfs4 server), a xen dom0 server (centOS) and a freebsd VM. This freebsd VM has a ufs2 root and a zfs mount where the active jail is hosted.
    Am I right that it should be no problem to disable zil (sync) as the nfs is not the layer writing into the zfs, so there is no risk for a data lost from this point?

    Would be great to get your view about this!

    Reply
    • Adam Strohl

      Hey Jimmy,

      My experience with the issue is specific to FreeBSD as the NFS server with ZFS, but as you may gather the underlying issue is caused by ESXi triggering the “flush” action when writing to the NFS server.

      Xen likely does the same, however Linux’s (in your case CentOS) ext3/ext4 file system doesn’t have the severe reaction to this as ZFS. Nor does FreeBSD’s UFS (the ‘native’ file system of FreeBSD), which is ultimately what you’re writing too, correct?

      That being said, how are you doing ZFS inside Xen? Virtual disks on the NFS server, or pass-through directly to devices? Can you show me the ‘zpool status’ output from the FreeBSD server?

      Reply
  2. Christian P

    Adam:

    Thanks a lot for your article. We changed the file and recompiled Freenas 9.10. we’re now getting 90+ MB/S transfers speeds with ESXI.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.