As one of the most widely used indexing data structures, B+-tree [3] powers almost all the relational database management systems (RDBMs) today. Recently, the log-structured merge tree (LSM-tree) [5] has attracted significant interest as a contender to B+-tree, because its data structure could enable better storage space usage efficiency and lower write amplification. Thanks to these two well-perceived advantages, LSM-trees have been adopted by many NoSQL products (e.g., Google’s BigTable, Cassandra, and RocksDB). We researched whether these two advantages of LSM-tree over B+-tree still stand in the presence of new storage hardware with built-in transparent compression.
Hardware-based data compression capability becomes increasingly more widely available on modern storage devices and infrastructure, e.g., commercial solid-state drives (SSDs) with built-in transparent compression are emerging [8], most all-flash arrays (AFAs) [2, 4, 6] transparently compress their data, and Cloud vendors have started to integrate hardware-based compression capability into their storage infrastructure (e.g., Microsoft Corsia [1]). Fig. 1 illustrates an SSD with built-in transparent compression: Its controller chip performs compression or decompression on each 4KB LBA (logical block address) block along the I/O path, which is transparent to the host that accesses the SSD as a normal block device through standard interface (e.g., NVMe).
To allow host materialize the benefit of in-storage transparent compression, such SSDs could expose a sparse LBA space that is much larger than its internal physical storage capacity. Moreover, since repeated data patterns (e.g., all-zero or all-one) can be highly compressed, the host can store sparse data content (i.e., an arbitrary amount of valid data) in each 4KB LBA block without wasting the physical storage space, as illustrated in Fig. 2. Therefore, when using SSDs with built-in transparent compression, we could employ sparse on-disk data structure without sacrificing the true physical storage cost, which creates a new spectrum of design space for data management systems [9].
We first study the storage cost comparison between B+-tree and LSM-tree. B+-tree manages its data storage in the unit of page. To reduce data storage cost, B+-tree could apply block compression algorithms (e.g., lz4, zlib, and ZSTD) to compress each on-storage page (e.g., the page compression feature in MySQL and MongoDB/WiredTiger). In addition to the obvious CPU overhead, B+-tree page compression suffers from compression ratio loss due to the 4KB-alignment constraint, which can be explained as follows: Storage devices serve I/O requests in the unit of 4 KB LBA blocks. As a result, each B+-tree page (regardless of compressed or uncompressed) must entirely occupy one or multiple 4 KB LBA blocks on the storage device. When B+-tree applies page compression, the 4KB-alignment constraint could incur noticeable storage space waste. This can be further illustrated in Fig. 3: Assume one 16 KB B+-tree page is compressed to 5 KB, the compressed page must occupy two LBA blocks (i.e., 8 KB) on the storage device, wasting 3 KB storage space. Therefore, due to the CPU overhead and storage space waste caused by the 4 KB-alignment constraint, B+-tree page compression is not widely used in production environment. Meanwhile, under workloads with random writes, B+-tree pages tend to be only 50%∼80% full [3]. Hence, B+-tree typically has a low storage space usage efficiency.
Compared with B+-tree, LSM-tree has a higher storage space usage efficiency, which can be explained as follows: One LSM-tree consists of multiple immutable SSTable files, each one contains a number of blocks (typical block size is 4 KB). Being immutable, all the blocks can be 100% full (i.e., completely filled with user data). When applying compression to reduce the storage cost, LSM-tree compresses each block individually and packs the compressed blocks tightly together inside SSTable files (i.e., compressed blocks are not subject to the 4 KB-alignment constraint). For the purpose of demonstration, we use RocksDB and WiredTiger as representatives of LSM-tree and B+-tree, and run random write-only workloads with 128-byte record size over a 150 GB dataset. For WiredTiger, we set its B+-tree leaf page size as 8 KB. Results show that RocksDB and WiredTiger occupy total 218GB and 280GB on the storage device, respectively. It clearly demonstrates the better storage space usage efficiency of LSM-tree.
Nevertheless, when running on storage hardware with built-in transparent compression, LSM-tree vs. B+-tree storage cost advantage will largely diminish. As illustrated in Fig 4, as long as B+-tree pages fill the unused space with highly compressible content like all-zeroes, in-storage transparent compression will eliminate the storage cost penalty of the less-compact B+-tree page data structure. Moreover, in-storage transparent compression is not subject to the 4 KB-alignment constraint, i.e., all the compressed data blocks are tightly packed in NAND flash memory, as shown in Fig 4. Therefore, in-storage transparent compression seamlessly mitigates the storage usage efficiency drawback of B+-tree, which closes the storage cost gap between the B+-tree and the LSM-tree. For example, we ran RocksDB and WiredTiger on an SSD with built-in transparent compression [8] with the same configuration as above, and the results show that the in-storage transparent compression could reduce the storage cost of RocksDB from 218 GB to 129GB, and reduce the storage cost of WiredTiger from 280GB to 104GB. The SSD could achieve a higher compression ratio on WiredTiger because its less-compact B+-tree page data structure results in a higher data compressibility. RocksDB has a higher postcompression storage cost mainly because of its inherent space amplification. The results demonstrate that in-storage transparent compression effectively closes the storage cost gap between B+-tree and LSM-tree.
Comparing B+-tree and LSM-tree in terms of write amplification is more complicated and strongly depends on runtime workload characteristics. B+-tree could have (much) lower write amplification than LSM-tree when (i) the B+-tree has a very large cache memory (e.g., enough to hold most or entire dataset) and uses very large redo log files, or (ii) the average record size is large (e.g., 512 B and above). This explains why the LSM-tree is typically used by systems that target at much-larger-thanmemory dataset with relatively small average record size.
We are interested in whether in-storage transparent compression could help to close the B+-tree vs. LSM-tree write amplification gap under those LSM-tree-friendly workloads (i.e., larger-than-memory dataset with small record size). In this context, write amplification of B+-tree is dominated by the write amplification caused by dirty page flushes. For example, if a B+-tree modifies a 32 B record within one 8 KB page and then flushes this dirty page from memory to storage, the write amplification will be 8 KB/32 B=256. Even if the page can be compressed by the storage hardware by 4:1 (hence reducing the write amplification to 64), it is still much larger than that of typical write amplification (e.g., ∼20) of an LSM-tree. Therefore, to close the B+-tree vs. LSM-tree write amplification gap, the B+-tree implementation must be modified to take better advantage of in-storage transparent compression. We present a solution below.
This is motivated by a simple observation: For a B+-tree page, let Δ denote the difference between its in-memory image and on-storage image. If the difference is significantly smaller than the page size, we can largely reduce the write amplification by logging the page modification Δ, instead of writing the entire in-memory page image, to the storage device. Unfortunately, when a B+-tree runs on normal storage devices without built-in transparent compression, this approach is not practical due to significant operational overhead: Given the 4 KB block IO interface, we must coalesce multiple Δ’s from different pages into one 4 KB LBA block in order to materialize the write amplification reduction. To enhance the gain, we should apply the page modification logging multiple times for each page, before resetting this process to construct the up-to-date on-storage page image. Accordingly, multiple Δ’s associated with the same page will spread over multiple 4 KB blocks on the storage device, which however will increase the data management complexity and page read latency. As a result, this simple design concept has not been used by real-world B+-tree implementations reported in the open literature.
Storage hardware with built-in transparent compression for the first time makes the above simple idea practically viable. By applying sparse data structure enabled by such storage hardware, we no longer have to coalesce multiple Δ’s from different pages into the same 4 KB LBA block. As illustrated in Fig. 5, each B+-tree page associates with one dedicated 4 KB LBA block as its modification logging space to store its own Δ, which is referred to as localized page modification logging. Under the 4 KB I/O interface, to realize the proposed page modification logging for each page, B+-tree writes D = [Δ,O] (where O represents an all-zero vector, and |D| is 4 KB) to the 4 KB modification log block associated with the page. Inside the storage device, all the zeros in D will be compressed away and only the compressed version of Δ will be physically stored. This evidently leads to much lower write amplification, compared with conventional practice where each dirty page is entirely written to storage no matter how many bytes in the page have truly changed. By dedicating one 4 KB modification logging space for each B+-tree page, we do not incur extra B+-tree storage management complexity. Of course, this design method is subject to a longer page load latency, which fortunately is not significant for two reasons: (i) a B+-tree only needs to read one additional 4 KB LBA block that is contiguous with the page in the LBA space. Hence, B+-tree only issues a single read request to the storage device. (ii) Compared with fetching data from storage devices, it takes much less time to construct the updated in-memory page from the Δ.
For the demonstration purposes, we implemented a B+-tree that incorporates the above method for reducing write amplification. The resulting implementation is referred to as B--tree. For comparison, RocksDB and WiredTiger were used as representatives of LSM-tree and normal B+-tree. In all the experiments, each record is generated by filling half its content as all-zero and the other half content as random bytes in order to mimic the runtime data content compressibility. Fig. 6 shows the measured write amplification under a 500 GB dataset, where all the cases use 15 GB cache memory. In each experiment, we use either 1, 2, 4, 8, or 16 client threads to cover a wide range of runtime workload concurrencies. Compared with RocksDB, WiredTiger has a much larger write amplification, while B--tree can essentially close the B+-tree vs. LSM-tree write amplification gap. For example, in the case of 32 B record size and 4 client threads, the write amplification of RocksDB is 38, while the write amplification of WiredTiger is 268 under 8 KB page size and 530 under 16 KB page size, respectively, which are 7.1x and 13.9x larger than that of RocksDB. In comparison, the write amplification of B--tree is 28 under 8 KB page size (which is only 73.7% of RocksDB’s write amplification) and 36 under 1 6KB page size (which is almost the same as RocksDB).
We further compare the total storage usage in terms of both logical storage usage on the LBA space (i.e., before in-storage compression) and physical usage of flash memory (i.e., after in-storage compression). Using the same 500 GB dataset, the logical storage usage of RocksDB, WiredTiger, and B--tree is 728 GB, 934 GB, and 1,548 GB, respectively. After in-storage compression, the physical flash memory usage of RocksDB, WiredTiger, and B--tree is 431 GB, 347 GB, and 452 GB, respectively. Since the LSM-tree has a more compact data structure than B+-tree, RocksDB has a smaller logical storage usage than the other two. Since B--tree allocates one 4KB block for each page in order to implement the localized modification logging, its logical storage usage is much larger than that of WiredTiger. WiredTiger consumes less physical flash memory capacity than RocksDB (because of the space amplification of an LSM-tree) and B--tree (because of the storage overhead caused by page modification logging). Due to the storage space overhead caused by page modification logging, B--tree has slightly larger physical storage usage than RocksDB. For example, the physical storage usage of RocksDB is 431 GB, while the physical storage usage of B--tree is 452 GB, only about 5% larger than that of RocksDB.
Finally, Fig. 7 shows the measured speed performance under three different workloads: (a) random point reads, (b) random range scans, and (c) random point writes. As shown in Fig. 7(a), normal B+-tree (i.e., WiredTiger) has the best point read TPS (transactions per second) performance, while RocksDB and B--tree achieve almost the same random point read TPS performance. For example, under 16 client threads, WiredTiger can achieve 71K TPS, while RocksDB qnd B--tree can achieve 57K TPS, about 19.7% less than that of WiredTiger. Fig. 7(b) shows the measured TPS when running random range scan queries, where each range scan covers 100 consecutive records. Compared with the case of random point reads, WiredTiger and B--tree have noticeably smaller difference in terms of range scan throughput performance. This is because the two overheads of B--tree (i.e., fetching an extra 4KB, and in-memory page reconstruction) can be amortized among the records covered by each range scan. In comparison, RocksDB has noticeably worse range scan throughput performance than the other two, because range scan invokes reads over all the levels in LSM-tree, leading to a high read amplification. Fig. 7(c) shows the performance under random point write workloads. The random write speed performance of B+-tree and LSM-tree is fundamentally limited by the write amplification. Therefore, by significantly reducing the write amplification, B--tree could achieve much higher write speed performance. As shown in Fig. 7(C), B--tree achieves 19% higher write throughput than RocksDB, and about 2.1x higher write throughput than WiredTiger. More comprehensive experimental results are available in our FAST’22 paper [7].
Modern storage hardware with built-in transparent compression allows data management systems to employ sparse on-disk data structure without sacrificing the true physical data storage cost. This opens a new but largely unexplored spectrum of opportunities to innovate data management software design. As one small step towards exploring this design spectrum, we showed that B+-tree could nicely leverage such modern storage hardware to close the gap with its recent contender LSM-trees in terms of storage cost and write amplification. This work suggests that the arrival of such modern storage hardware warrants a revisit on the role and comparison of B+-tree and LSM-tree in future data management systems.