MSE Database Sync

Tags: mse (1) ops (29)

Introduction

The parameter database_sync, available for books in the mse.conf file, is a parameter controlling whether MSE will synchronize the database to disk on every transaction, or not. The parameter is strictly used by the LMDB library embedded in Varnish. This means that this parameter is passed on to the library when the book is opened, and then every code line in Varnish Enterprise outside of the library is executed identically. In other words, changing the parameter does not change the control flow in Varnish Enterprise, except inside the library, which is not written by Varnish Software.

What does it do?

The parameter database_sync controls how LMBD deals with the disk. When database_sync is off, LMDB does not provide any integrity guarantees for the rare occurrence of either:

  1. A complete power failure with no battery backup and graceful shutdown of the OS.
  2. A kernel panic.

Note that the list above does not contain a Varnish panic, which means that bugs in Varnish will not affect the integrity of the database.

Event examples

Power failure

In a data center, a power failure is extremely rare. The normal setup is to have some battery backup, so that the Linux kernel can do a graceful shutdown before the battery runs out. In these cases, there is no issue with the database, and using database_sync=off is completely fine. Getting this to work requires some configuration, where the management software of the data center notifies the kernel that it needs to shut down.

If you are not sure what happens with your server during a power failure, we would encourage you to contact whomever is responsible for the infrastructure where your servers are located.

Kernel panic

A kernel panic is even rarer and usually only results from bad hardware or drivers, and on experimental kernel builds (bleeding edge). On an LTS Linux system running on stable hardware, kernel panics essentially do not happen at all.

LMDB behavior

We will now dive into how LMDB behaves when it updates the book, and how database_sync​ affects its behavior. It is simplified, but still quite technical.

The book (LMDB database) is memory mapped into the varnishd memory space, which means that the kernel will write changes to the book whenever it finds it reasonable to do so. The kernel and the LMDB library manages the mapped files through pages which are 4k in size. When the kernel writes parts of the book to disk, it is always writing whole pages, never partial ones, and this is a part of the virtual memory subsystem, which also includes management of swap and many other things.

The kernel’s algorithms are good at finding the right time to do writing of dirty pages. These are pages whose disk backing is out of sync. Applications dealing with memory mapped files typically don’t interfere with this. The reason that they don’t, is that when the process (varnishd in our case) quits, the kernel will make sure that the file on disk contains all the changes done through “memory writes” (but this is in quotes because the program only sees its virtual memory space, not actual RAM versus other addresses). When the kernel makes everything match, it is called a sync, and the sync after a process exit is what makes a previously mapped file correct, independently on which pages were dirty when the process stopped.

The kernel provides APIs to do such sync operations explicitly, and such APIs are used by databases and other software that is similar to databases, or has database components (like Varnish Enterprise). The sync operation has a cost: the kernel has to write many pages to disk, which is likely to be made dirty (a term for a page that is out of sync with the disk) soon after. When the LMDB does sync operations, it does so only to provide the guarantees that a power failure or kernel panic, at the expense of performance in the book.

To understand how the sync happens in Varnish, we need to dive into LMDB. Every LMDB database is organized as a tree, and the tree occupies a certain set of pages. When the database is stable, which means that here are no cache insertions, objects expiring, purges or bans, all the threads access the tree without making any pages dirty.

As soon as a thread needs to change the database, it will:

  1. Acquire one or more locks, preventing other threads from making changes that conflict with its own change.
  2. Build up a new sub-tree in pages not currently in use by the database.
  3. If database_sync is on, sync the database to disk.
  4. Update a single page that will atomically substitute an old portion of the tree with the newly built subtree.
  5. If database_sync is on, sync the database to disk.

The result is that, for each point in time, the database will either be completely like the state before the update, or completely like the state after the update. If database_sync is on, the disk version of the database will also have the same guarantee: it will be either the old or the new version.