Designing Database Storage and Retrieval

All you need to know about database storage, indexes and data retrieval

Shilpi Gupta
Towards Data Science

--

Photo from https://unsplash.com/

Database is an integral part of any application design where processes store and manage data. Although, one may never need to design a database from scratch but understanding its design and how different databases handle data would help you to choose the most suitable datastore which provides required scalability and performance. For instance, a database suitable for handling transactional data may not be a right fit for data warehouse.

Let’s start designing a database. We will start from simple and then optimize it till it becomes awesome.

The most simple way is to keep the key-value pairs in a file. Writing a record will just append the record to end of file. If a segment(or file) size exceeds the maximum possible size, start writing to next segment. As a background process, these segments are merged to remove duplicate entries and keep latest value for a key. This process is called compaction. The only problem with this approach is the read will have to parse the entire segments to find the last value written with that key. These may further involve searching in multiple segments leading to increased read time. Let’s try to solve this problem. A way is to have indexes on keys.

Hash Indexes

A simple way of creating an index is to use a hash map. You can have an in-memory hash table to store the keys and the byte offset i.e. the location it can be found in that segment. Each segment will have its own in-memory hash map.

Look up in Hash indexes

Write will create an entry in hash table and append the record in the latest segment. Read will involve going through the hash tables, to find the location of the record to read.

This approach has some cons:
1. Index is kept in-memory. It is not useful when the number of keys are large and exceeds memory available.
2. It is not optimized for range queries.

SSTables and LSM trees

Sorted String Tables (SSTables) contain data sorted by keys and a key appear only once in a segment. With keys in sorted order you don’t need to index each key in a hash map, you can index offsets for only a few records. Other benefits for using a sorted segment is it allows compressing a block before writing and also makes compaction process easier. Segments can be merged using a simple merge sort algorithm.

Indexing and compression in SSTables

Next question is how to keep the keys in sorted order in segment? Storing sorted structures is much easier in-memory using self balancing trees like AVL.

  1. When write comes, add it to in-memory balanced tree (know as memtable).
  2. As memory grows, read the values from memtable and write it to disk. As it is already in sorted order, segment will have keys in sorted order.
  3. During read, first check the memtable and then the recent segments.

But what if our system crashes before writing memtable to disk? To solve this issue, use logs. Keep a log for every write in disk. If the system crashes, these can be used to construct the segment.

Log structured tree or LSM tree is a data structure designed to provide low-cost indexing for files experiencing a high rate of inserts (and deletes) over an extended period. It is based on the idea of SSTables that are merged in the background. Different segments may contain the values for same key. These segments are compacted in background to remove duplicates and save space and read time.

B Trees

B-tree is a self-balancing tree data structure that maintains sorted data. In most of the other self-balancing search trees (like AVL and Red-Black Trees), it is assumed that everything is in main memory. To understand the use of B-trees, think of the huge amount of data that cannot fit in main memory. When the number of keys is high, the data is read from disk in the form of blocks. Disk access time is very high compared to main memory access time. The main idea of using B-Trees is to reduce the number of disk accesses. B-tree searches, sequential access, insertions, and deletions in logarithmic time. You can read more on B trees here.

Like SSTables, B tress also keep the keys in order. The only difference is in design philosophy. LSM trees are log structured indexes that write data into various segments whereas B trees are page structured indexes that read or write in only one page(or disk block) at a time. It breaks the datastore into fixed size blocks or pages. Leaf nodes can contain the actual value for the key or point to a reference where the value is stored. Below images will help you understand how B tree works.

Lookup for key = 272 in B tree

While reading, you need to traverse the tree to find the location where the block is stored. While writing, it first needs to find the page where the value needs to be written and then writes it. If there is no sufficient space in that block, it might lead to block splits and updates to parent node.

In order to make writes resilient to crashes, it uses logs. Before any write, first log it in disk and then insert it in B tree.

Comparison

  1. Read/write performance: Log structured indexes like LSM trees allow faster writes while page structured indexed like B trees allow faster reads. Reads in LSM are comparatively slow as they need to check multiple data structures and SSTables depending on the compaction level.
  2. Redundancy: With B trees each key is stored only at one place whereas with log structured it is stored in multiple segments.
  3. Storage: B-trees may leave some disk space due to fragmentation ie. when the current row can’t fit in the available page space. Compaction process in LSM trees reduce fragmentation.
  4. LSM needs optimized compaction strategy to merge segments for better read performance. Compaction strategy should not affect the ongoing read/writes. This can cause a problem when write throughput is high and limited disk bandwidth is shared between ongoing writes and threads performing compactions.

B trees are more popular for maintaining index.

All the other indexes like secondary indexes, multi-column indexes etc extend the above approaches. The only difference is that the key is not unique for these indexes. Each key can have multiple records associated with it.

Conclusion

Provided insights about how databases store and retrieve data. All the approaches mentioned above may not be optimized for data warehouse which stores bulk of data for analytics. Keeping storage and retrieval process for data warehouses for next article!

To read more about databases and application design, I’d highly recommend going through ‘Designing Data Intensive Applications’ by Martin Kleppmann.

Hope you liked the article!

--

--

https://www.onepercentbetter.dev/ | Software Developer| Previously @ Adobe| Expedia| Google WTM Scholar’17 | Computer Engineering’18 from DTU (aka DCE)