Learn how the database writes data to disk by reading the database's source code

Amit Davidson 6 minutes to read

The Inside of Databases: How BoltDB keeps data on disk

Image credit: Markus Winkler in Unsplash

Most software engineers don't know about databases. For many people, the only thing they know about a database is whether it supports SQL or NoSQL.

Today, we will talk about the concept of databases. We will see what happens from the beginning of the write to the commit to disk.

We will do this by looking at BoltDB. A lightweight database written in Go for storing data on disk. The code is simple, with less than 5,000 lines of code, so it's a great place to start reading the database.

Although Bolt was written in Go, this post will be language-neutral.

BoltDB overview

BoltDB is a B+ tree-based NoSQL key/value store engine. There are only a few types in BoltDB: DB, Bucket, Tx, and Cursor.

A DB is an encapsulator of a database. A bucket is a collection of key/value pairs. Bucket operations are handled within transaction Tx. Read-only or read-write transactions, depending on the process. Cursors are used to traverse keys in the tree sequentially.

BoltDB does not have a query language like a mature database. Instead, all operations are performed through the read, put, and delete functions of the transaction itself.

Data organization

memory

To locate a record, the database system uses indexes. Indexes are built on different data structures, each with its advantages. The most popular are LSM and B-trees and their variants.

Bolt's underlying data structure is a B+ tree. This means that Bolt is suitable for heavy reading applications. All nodes contain key/value pairs called inodes. On a branch node, the value is a pointer to a child node on disk, while on a leaf, it is the actual value.

BoltDB B+ trees are different from traditional trees. Child nodes do not hold pointers to sibling nodes, making range sweeps slower and helping to reduce space.

BoltDB B+ tree

Rebalancing also applies to "real-world" impacts. It only includes merging and splitting, no rotation. In addition, it is deferred until the transaction commits to avoid undoing the change when the transaction rolls back.

disk

When writing to disk, the database must develop a format for how to lay out data on disk. A page is the minimum amount of data that can be written or read using the operating system. It is usually 4KiB in size and is laid continuously on disk. The goal of databases is to align their data with pages to minimize requests to disk and improve performance.

In Bolt, each tree node corresponds to one or more disk pages. Each page contains titles: id, flags, count, and overflow.

Flags are the type of page: branch, leaf, meta, or free list
count is the number of key-value pairs
Overflow is the number of pages that are overflowed when the node is large.

There is also a generic ptr field that points to specific data for the page type. The page layout on disk is as follows.

Page layout

Branch Offices page

The branch page is divided into two parts: the key and the pointer to the key. The pointer has a fixed size, but the key can vary. To find a key, we can easily find the pointer and read the ksize bytes.

Branch node layout

Leaf pages

Leaf pages are similar to branch pages. The most obvious difference is that pgid is replaced with vsize, and the value is added to the keys segment.

Layout of leaf nodes

When a node is loaded, it must first be deserialized. BoltDB doesn't complicate things, but uses binary encoding. When deserialized, the flags field is used to load the appropriate page type into ptr.

When a node is saved back, it must be converted from node notation to a page. Therefore, the data is serialized into a page format.

Transaction

ACID refers to atomicity, consistency, isolation, durability. It is a set of properties of a database that guarantees that transactions are handled reliably.

Atomicity means that either all transactions succeed or none succeed. This ensures that the database is not left in an undefined state.
Consistency ensures that any data written to the database must be valid according to all defined rules after the transaction completes.
Isolation guarantees that all transactions will occur in isolation. No transaction is affected by other transactions. So a transaction cannot read data from any other transaction that has not yet completed.
Persistence means that once a transaction is committed, the data is retained even in the event of a system failure.

BoltDB transactions are ACID-compliant. It uses copy-on-write, so when a node is modified, BoltDB copies all nodes on the path to it. Then to apply these modifications, all you have to do is replace the root of the new branch.

In this way, transactions are segregated. When the copied version is modified, the original version can be used to serve multiple readers. In addition, a transaction is atomic because its state can be switched from non-executing to full execution by pointer switching.

Metadata is tracked in pages with the previously mentioned flag meta. The metapage holds a reference to the page ID of the root bucket. So BoltDB replaces the root pointer by rewriting the meta page.

Page management

The operating system maintains a page cache to improve read performance. In a normal file read, the call goes through the cache, but the page may have been evicted from the cache a long time ago.

Page caching

Many databases implement their caches instead. BoltDB does not implement its own caching, but uses mmap. With mmap, pages are kept separately in memory until they are finished, so pages are never evicted. More RAM is needed to support faster reads.

The entire database is a single file. With mmap, you must specify a fixed size to map. Files may grow indefinitely, so mmap must be called accordingly.

mmap is a system call and is expensive, so BoltDB avoids calling it after each write. Instead, it maps more than it needs, so it needs to be called less often. From 32KB to 1GB, the increment is a power of 2, and after 1GB, the file grows in units of 1GB.

Free pages of remaining space are managed by freelist. When a transaction is committed, it requests free pages from freelist to write to the modified node. Once the transaction is complete or rolled back, the pages can be recycled. If there are no free pages, mmap is called at a larger size.

summary

Finally, we can put everything together and understand how database transactions work in BoltDB.

To make changes to the database, you need to start a read-write transaction.
For example, call Put, and the cursor is moved to the correct node and location in the tree.
To modify a node, you must first copy it down. The related pages are loaded and deserialized according to the type of page. All modifications are made on the tree until the transaction commits.
Once The Proxim is called, the tree is rebalanced with merge or split.
Freelist assigns a new page. If there are not enough pages, the files are reallocated with mmap.
The modified node is serialized to a page. These pages are written to disk.
Replace the root by rewriting the metapage, pointing to the new page.

I hope you now have a better understanding of how databases work. The biggest takeaway is realizing that despite the complexity of modern databases, studying them is not so difficult.

The Inside of Databases: How BoltDB keeps data on disk

BoltDB overview

Data organization

summary