Backup failures and SSHFS

Initial investigations

My healthchecks instance informed me that my Borg backup cronjob on the system-apps VM was failing. Taking a look at it manually, I see the following:

root@system-apps:~# borgmatic -v1
/etc/borgmatic/config.yaml: Parsing configuration file
/mnt/freenas/Backups/borg: Pruning archives
Local Exception
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4455, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4387, in run
    return set_ec(func(args))
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 139, in wrapper
    with repository:
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 189, in __enter__
    self.open(self.path, bool(self.exclusive), lock_wait=self.lock_wait, lock=self.do_lock)
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 392, in open
    self.lock = Lock(os.path.join(path, 'lock'), exclusive, timeout=lock_wait, kill_stale_locks=hostname_is_unique()).acquire()
  File "/usr/lib/python3/dist-packages/borg/locking.py", line 350, in acquire
    self._wait_for_readers_finishing(remove, sleep)
  File "/usr/lib/python3/dist-packages/borg/locking.py", line 363, in _wait_for_readers_finishing
    self._lock.acquire()
  File "/usr/lib/python3/dist-packages/borg/locking.py", line 140, in acquire
    with open(self.unique_name, "wb"):
OSError: [Errno 5] Input/output error: '/mnt/freenas/Backups/borg/lock.exclusive/system-apps.hosts.lan@46333701320834.18075-0'

Platform: Linux system-apps 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64
Linux: debian 10.6 
Borg: 1.1.9  Python: CPython 3.7.3
PID: 18075  CWD: /root
sys.argv: ['/usr/bin/borg', 'prune', '/mnt/freenas/Backups/borg', '--keep-hourly', '24', '--keep-daily', '7', '--keep-weekly', '4', '--keep-monthly', '6', '--keep-yearly', '1', '--prefix', '{hostname}-', '--stats', '--info']
SSH_ORIGINAL_COMMAND: None

Command '('borg', 'prune', '/mnt/freenas/Backups/borg', '--keep-hourly', '24', '--keep-daily', '7', '--keep-weekly', '4', '--keep-monthly', '6', '--keep-yearly', '1', '--prefix', '{hostname}-', '--stats', '--info')' returned non-zero exit status 2.

Need some help? https://torsion.org/borgmatic/#issues

Repeating this cronjob shows inconsistent failure location: sometimes it occurs during the prune operation, sometimes during the check operation, but invariably while acquiring its lock.

I have an open issue from a month back in my GitLab instance identifying the first instance of the error. I had flagged it as “non-breaking bug” as it only occurred ~5% of the time. Clearly something’s changed now that it’s happening ~95% of the time.

Approach 1: SSHFS mount options

There’s a relevant discussion in the Borg backup repo that suggests using mounting options reconnect,cache=no,noauto_cache,entry_timeout=0 to resolve the issue. Mucking around with fstab, remounting, re-running borgmatic, and this investigation path proves fruitless. The commenters don’t seem quite sure how the options are actually impacting SSHFS and I don’t have time to dig.

Approach 2: Race condition

I know I’m likely going to replace SSHFS with NFS or SMB for reasons mentioned below, but as a quick stab in the dark, let’s note that the script always fails at locking.py:140. Maybe there’s a subtle race condition in SSHFS? I try introducing a short sleep between the creation of self.path and self.unique_name. If it works, great. I save myself the effort of having to edit all my /etc/fstab files. If not, it’s only a few minutes of testing.

    def acquire(self, timeout=None, sleep=None):
        if timeout is None:
            timeout = self.timeout
        if sleep is None:
            sleep = self.sleep
        timer = TimeoutTimer(timeout, sleep).start()
        while True:
            try:
                os.mkdir(self.path)
            except FileExistsError:  # already locked
                if self.by_me():
                    return self
                self.kill_stale_lock()
                if timer.timed_out_or_sleep():
                    raise LockTimeout(self.path)
            except OSError as err:
                raise LockFailed(self.path, str(err)) from None
            else:
+               sleep(1)
                with open(self.unique_name, "wb"):
                    pass
                return self

Misleadingly, this worked the first time I tested after making this change. Unfortunately, I think this was just luck as subsequent attempts all failed in the same location. Playing around with the sleep duration in the false hope this was related was likewise fruitless. Change reverted.

Approach 3: Ditch SSHFS

Searching around, I find other complaints about SSHFS that make me consider using NFS or SMB instead. I initially went with SSHFS as that’s what the Borg documentation recommends and it affords me the ability to use public key login. But if there are stability issues, perhaps I take a look at my other options.

Approach 3A: Install Borg directly on FreeNAS

Borg has the capacity to run remote repositories via SSH, but this requires the installation of Borg on the remote host (FreeNAS). While Borg does have FreeBSD releases, I’m reluctant to install any non-isolated third-party code on to what might be my most critical host. I’d honestly rather deal with spotty backups over SSHFS than expose myself to this kind of unnecessary risk.

Approach 3B: Use SMB or NFS instead

Neither SMB nor NFS support public key logins, which while understandable is a frustrating choice between strong authentication and stable backups. On the upside, FreeNAS does support subnet restrictions for both SMB and NFS shares, which will be sufficient in my context. Noting that SMB is single-threaded, I head in the direction of NFS with the intent of circling back to SMB if NFS doesn’t work out.

  1. In FreeNAS, I set up the NFS shared paths.
  2. Mounting the shares, permission issues indicate that I also need to set up FreeNAS’ Maproot User and Maproot Group options. Fixed.
  3. Trying again, I find that some of my paths are empty. A few moments go by before I realize it’s specifically any nested datastore.
    • Investigation shows this is an NFS limitation with discussion back in 2009.
    • I try using FreeNAS’ All dirs option and mounting the paths separately, but apparently “all dirs” doesn’t cross datastores.
    • An easy solution is to share each FreeNAS datastore separately. Somewhat irritating, but it’s a one-time setup.
  4. With the mounts functional, I run the backup script again and find it working. Moreover, it’s working a lot faster than before.

Closing thoughts

Edit from the future

It seems as though even with NFS, the problem rears its head again. The intermediary solution I’ve put in place is to run the Borg backups directly within the VM and to rsync the backups to FreeNAS on a nightly basis. This greatly speeds up the backup process itself and separates the “do the backup” from the “store the backup remotely” tasks in two. While VM snapshots are now larger, I regain public key authentication and efficient consistency checking. A trade I’m happy to make.

The long-term solution is to move the FreeNAS on to ethernet and avoid this whole hassle. I’m starting to considering drilling holes through a few walls to make that doable. Alternativelty, what could possibly go wrong with putting a server in a nursery?

Mommy, why is there a server in the house?