Database storage engines under the hood

Shashank Baravani
8 min readMay 20, 2018

We have often wondered how databases store and manage massive loads of data behind the scenes, allowing developers to focus on more pertinent matters of data model, api usage and sundry configurations. Here is a primer on the four major storage formats in use today, amongst a wide variety of data bases from NoSQLs to search indexes.

Hash Indexes

A HashIndex in its most simplest form is nothing but an in memory HashMap backed by a file on disk i.e. the values are byte offsets on the underlying file. A storage engine like BitCask (used by Riak) uses hash indexes under the hood. The values can be fetched from the disk in one seek if they are not already part of the filesystem cache. Since updates to file are append only(updates to key ), each physical file will have to broken down into smaller segments. On growth of large number of segments, these will have to be shrunk into fewer files using compaction

BitCask is well suited to situations where the entire dataset can fit in memory, the value for each key is updated frequently and the lookups happen on a web scale. For example, the key might be the geocode of a location and the value might be current weather conditions and forecast for next 30 days(updated periodically as the day progresses). Even if there are a million geocodes corresponding to inhabitable regions, all of the keys would still fit in memory.

  • Binary file formats are preferred since it helps encode key values in the most efficient manner.
  • Since there are no inplace updates, a deletion is actually a tombstone, an indicator that this record doesn’t exist. They are erased during compaction.
  • Crash recovery typically happens by recreating the hashmap through a snapshot or by a lengthy process of consuming all the segments.
  • Generally, a single writer thread and immutability of records eliminates the need for any complex concurrency control mechanism.

On the performance front, HashIndexes offer very high performance reads and writes. However they are not distributed and hence do not scale horizontally. Also, range queries are not efficient.

SSTables and LSM-Trees

Sorted String Tables (SSTable) are an improvement over the more basic hash indexes with an additional constraint that all keys in a segment be sorted. Additionally, each key is to appear only once in each segment file. While both rely on speed of sequential writes to disk, SSTables have some advantages.

  • Compaction run much faster since for all practical process the compaction is now a final pass of a merge sort with time complexity o(n)
  • Keys present in only the most recent segment are considered and any repetitions are discarded.
  • In some SSTables, during a read, the entire segment block containing the record is read into memory. This is useful for range queries and access patterns which read adjacent data records more often. By compressing the segment prior to disk writes, it helps reduce the disk I/O leading to fast reads.
  • In some other SSTables, during a read, the offset of a record is computed and then the record is fetched from the correct segment.

SSTables are best for use case that entail humongous writes are order of magnitude larger than the reads. For example, if you are building a distributed log service, then you would need a storage engine that can take a very large throughput of application and error logs by a wide variety of clients. While the writes have to be o(1), the logs themselves would be sparingly accessed such as for debugging purpose only, so a slightly slower reads would be acceptable (k*log(n) where k=no of segments)

In order to write a sorted segment onto a disk, SSTables have to maintain a sorted map or red black tree or an AVL tree in memory. They are called as memtables. Memtables get flushed to disk as segments when they reach a threshold. Reads start with memtables and proceed to segments, recent to oldest. Probabilistic data structures like bloom filters are used to reduce the disk lookups whenever a key doesn’t exist in the memtable.

Stashing away incoming updates to fast write ahead log, prior to memtable updates, help in recovery of data base crashes. These logs are deleted after every memtable flush and a new one is used. SSTables are compacted and merged using size-tiered (based on size of segments) and leveled (based in age of data) compaction techniques. Compactions are slow and expensive and sometimes eat away a significant portion of the I/O bandwidth of the data base thus slowing down writes and refusing reads as well.

Storage engines, such as HBase and Cassandra, that are based on this principle of merging and compacting sorted files are often called Log Structured Merge (LSM) storage engines.

B-Trees

Similar to SSTables, B-Trees also maintain a sorted in memory map but the underlying file sizes are much smaller blocks of 4KB in line with underlying hardware instead of much larger segments of many MBs. The blocks are connected like a linked list and a tree is also constructed on top of these pages.

The non leaf level nodes in the BST contain information around the range of of keys beneath each node. To arrive at a record, we start at the root node and traverse downwards, breaking down the range at each step ad moving in the direction of the nodes containing the sub-range to which the key belongs.

The leaf level nodes either contain the data itself inline or contain references to page blocks which hold the record we are interested in. The number of references to child pages in one page of the B-tree is called the branching factor(usually in 100s).

Modification of records require the entire block to be read, updated and then written back causing a high degree of read and write amplification. BTrees remain balanced and a four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.). Similar to SSTable, BTrees also maintain a write ahead log or a redo log to enable recovery after a crash.

BTrees, such as Oracle and MySql are an excellent choice for use cases that require ACID guarantees since to support atomicity and consistency, the updates have to be in-place and preferably the writes to a key should go via a single writer(concurrency). For ex., inventory updates during the checkout process of a merchant or money movement in an financial application. For range queries too, BTrees give better performance than SSTables. Lastly, Btrees are a best fit for a complex relational data model that requires joins over a multitude of primary indexes to arrive at the final result set.

BTrees these days come with several optimizations and here are some of them

  • Instead of a redo log, some BTress use copy-on-write schema; basically write the block to a newer location and simply swap references.
  • Long keys inside non leaf level nodes are abbreviated since we only need to know the range information to arrive at the right leaf node.
  • Leaf pages are sequentially ordered on disk for faster access
  • Btree also allow horizontal traversal by way of a skiplist like linked structure that enables quicker access of nodes
  • Some BTress are actually fractral trees enabling much faster insertions and deletions.

On the performance front, writes are generally slower due to write amplification however reads (especially range queries) are relatively fast if queried over indexes. There are some disadvantages with BTrees such as disk space fragmentation due to unused space when a page is split or when a row cannot fit into an existing page.

Inverted Index

In search indexes such as ElasticSearch, an index is composed of shards and each shard is an further broken down into segments. Each segment is an inverted index in itself and this is how it looks like: a collection of document Ids and the term frequency.

Postings list
  • Inside the postings list of each segment, documents are given an identifier between 0 and the number of documents in the segment (up to ²³¹-1)
  • Postings lists are split into blocks of 256 doc IDs and then each block is compressed separately using delta-encoding and bit packing, an encoding technique called Frame of Reference(FOR).
  • Using Huffman Length encoding, the maximum number of bits required to store deltas in a block are computed. This information is then added to the block header and lastly all deltas of the block are encoded using this number of bits.
  • Searches i.e. term queries and filters are are blazingly fast, the outcome of which is an iterator over a postings list from the inverted index.
  • Segments undergo a similar compaction process described earlier. Merging segments is a much slower process as compared to SSTable.

Bitsets

  • Bitwise operations are optimized at CPU circuitry level. So an in-memory bitwise AND is always orders of magnitude faster than parsing postings list and performing the intersections to arrive at a result set.
  • The first time a filter executes, the postings lists have to be parsed but thereafter the outcome is cached in form of BitSets. BitSets are reusable in other contexts as well. They are stored on a per-segment basis and segments are immutable — once written to disk, they will never change.
  • These cached bitsets are “smart”: they are updated incrementally. As you index new documents, only those new documents need to be added to the existing bitsets

Needless to say, search indexes excel at deep querying, especially queries with frequently occurring terms are very fast. Evidently keeping the index updated after a write is not a constant time operation. Writes are very expensive and require replacement of the indexed document itself instead of an inplace update. Apart from free text search, search indexes are best suited for use cases that require applying filtering over a large volume of data. For example such as, querying the right product collection from a catalogue of ,say, 20 million products by filtering over price, brand, color, size, weight, model, so on and so forth.

In Memory DBs

In addition to databases that rely on disk storage, we have purely in memory DBs such as VoltDB, MemSQL, Oracle TimesTen, etc for relational use cases and Memcached for just caching key-value pairs. Redis and Couchbase provide weak durability by writing to disk asynchronously. In memory DBs are faster primarily because there is no deserialization of records to be fetched. In-memory databases provide data models such as queues and sets that are difficult to implement with disk-based indexes.

To conclude…

Beyond the four covered, there are many other forms of data formats and storages but most of them are variants of these. The knowledge of the underlying storage engines is helpful in determining the right data base for a given volume of data, data model, access patterns and read-write performance needs. I hope this blog post gives a good summary of what exists under the hood of most modern databases and helps you in making the right choice.

--

--