Revisiting B+-tree vs. LSM-tree

The Arrival of Modern Storage Hardware with Built-in Transparent Compression Changes The Battlefield

March 29, 2022

Research

Authors:

Yifan Qiao, Xubin Chen, Ning Zheng, Jiangpeng Li, Yang Liu, Tong Zhang

Article shepherded by:

Rik Farrow

As one of the most widely used indexing data structures, B⁺-tree [3] powers almost all the relational database management systems (RDBMs) today. Recently, the log-structured merge tree (LSM-tree) [5] has attracted significant interest as a contender to B⁺-tree, because its data structure could enable better storage space usage efficiency and lower write amplification. Thanks to these two well-perceived advantages, LSM-trees have been adopted by many NoSQL products (e.g., Google’s BigTable, Cassandra, and RocksDB). We researched whether these two advantages of LSM-tree over B⁺-tree still stand in the presence of new storage hardware with built-in transparent compression.

In-Storage Transparent Compression

Hardware-based data compression capability becomes increasingly more widely available on modern storage devices and infrastructure, e.g., commercial solid-state drives (SSDs) with built-in transparent compression are emerging [8], most all-flash arrays (AFAs) [2, 4, 6] transparently compress their data, and Cloud vendors have started to integrate hardware-based compression capability into their storage infrastructure (e.g., Microsoft Corsia [1]). Fig. 1 illustrates an SSD with built-in transparent compression: Its controller chip performs compression or decompression on each 4KB LBA (logical block address) block along the I/O path, which is transparent to the host that accesses the SSD as a normal block device through standard interface (e.g., NVMe).

Figure 1: Illustration of an SSD with built-in transparent compression.

To allow host materialize the benefit of in-storage transparent compression, such SSDs could expose a sparse LBA space that is much larger than its internal physical storage capacity. Moreover, since repeated data patterns (e.g., all-zero or all-one) can be highly compressed, the host can store sparse data content (i.e., an arbitrary amount of valid data) in each 4KB LBA block without wasting the physical storage space, as illustrated in Fig. 2. Therefore, when using SSDs with built-in transparent compression, we could employ sparse on-disk data structure without sacrificing the true physical storage cost, which creates a new spectrum of design space for data management systems [9].

Figure 2: Illustration of intra-block sparse data content enabled by in-storage transparent compression.

B+ tree vs. LSM-tree: Storage Cost

We first study the storage cost comparison between B⁺-tree and LSM-tree. B⁺-tree manages its data storage in the unit of page. To reduce data storage cost, B⁺-tree could apply block compression algorithms (e.g., lz4, zlib, and ZSTD) to compress each on-storage page (e.g., the page compression feature in MySQL and MongoDB/WiredTiger). In addition to the obvious CPU overhead, B⁺-tree page compression suffers from compression ratio loss due to the 4KB-alignment constraint, which can be explained as follows: Storage devices serve I/O requests in the unit of 4 KB LBA blocks. As a result, each B⁺-tree page (regardless of compressed or uncompressed) must entirely occupy one or multiple 4 KB LBA blocks on the storage device. When B⁺-tree applies page compression, the 4KB-alignment constraint could incur noticeable storage space waste. This can be further illustrated in Fig. 3: Assume one 16 KB B⁺-tree page is compressed to 5 KB, the compressed page must occupy two LBA blocks (i.e., 8 KB) on the storage device, wasting 3 KB storage space. Therefore, due to the CPU overhead and storage space waste caused by the 4 KB-alignment constraint, B⁺-tree page compression is not widely used in production environment. Meanwhile, under workloads with random writes, B⁺-tree pages tend to be only 50%∼80% full [3]. Hence, B⁺-tree typically has a low storage space usage efficiency.

Figure 3: An example to show the storage space waste caused by 4KB-alignment constraint for B+ tree page compression.

Compared with B⁺-tree, LSM-tree has a higher storage space usage efficiency, which can be explained as follows: One LSM-tree consists of multiple immutable SSTable files, each one contains a number of blocks (typical block size is 4 KB). Being immutable, all the blocks can be 100% full (i.e., completely filled with user data). When applying compression to reduce the storage cost, LSM-tree compresses each block individually and packs the compressed blocks tightly together inside SSTable files (i.e., compressed blocks are not subject to the 4 KB-alignment constraint). For the purpose of demonstration, we use RocksDB and WiredTiger as representatives of LSM-tree and B⁺-tree, and run random write-only workloads with 128-byte record size over a 150 GB dataset. For WiredTiger, we set its B⁺-tree leaf page size as 8 KB. Results show that RocksDB and WiredTiger occupy total 218GB and 280GB on the storage device, respectively. It clearly demonstrates the better storage space usage efficiency of LSM-tree.

Nevertheless, when running on storage hardware with built-in transparent compression, LSM-tree vs. B⁺-tree storage cost advantage will largely diminish. As illustrated in Fig 4, as long as B⁺-tree pages fill the unused space with highly compressible content like all-zeroes, in-storage transparent compression will eliminate the storage cost penalty of the less-compact B⁺-tree page data structure. Moreover, in-storage transparent compression is not subject to the 4 KB-alignment constraint, i.e., all the compressed data blocks are tightly packed in NAND flash memory, as shown in Fig 4. Therefore, in-storage transparent compression seamlessly mitigates the storage usage efficiency drawback of B⁺-tree, which closes the storage cost gap between the B⁺-tree and the LSM-tree. For example, we ran RocksDB and WiredTiger on an SSD with built-in transparent compression [8] with the same configuration as above, and the results show that the in-storage transparent compression could reduce the storage cost of RocksDB from 218 GB to 129GB, and reduce the storage cost of WiredTiger from 280GB to 104GB. The SSD could achieve a higher compression ratio on WiredTiger because its less-compact B⁺-tree page data structure results in a higher data compressibility. RocksDB has a higher postcompression storage cost mainly because of its inherent space amplification. The results demonstrate that in-storage transparent compression effectively closes the storage cost gap between B⁺-tree and LSM-tree.

Figure 4: Illustration of in-storage transparent compression reducing the physical storage cost of B+ tree pages.

B+ tree vs. LSM-tree: Write Amplification

Comparing B⁺-tree and LSM-tree in terms of write amplification is more complicated and strongly depends on runtime workload characteristics. B⁺-tree could have (much) lower write amplification than LSM-tree when (i) the B⁺-tree has a very large cache memory (e.g., enough to hold most or entire dataset) and uses very large redo log files, or (ii) the average record size is large (e.g., 512 B and above). This explains why the LSM-tree is typically used by systems that target at much-larger-thanmemory dataset with relatively small average record size.

We are interested in whether in-storage transparent compression could help to close the B⁺-tree vs. LSM-tree write amplification gap under those LSM-tree-friendly workloads (i.e., larger-than-memory dataset with small record size). In this context, write amplification of B⁺-tree is dominated by the write amplification caused by dirty page flushes. For example, if a B⁺-tree modifies a 32 B record within one 8 KB page and then flushes this dirty page from memory to storage, the write amplification will be 8 KB/32 B=256. Even if the page can be compressed by the storage hardware by 4:1 (hence reducing the write amplification to 64), it is still much larger than that of typical write amplification (e.g., ∼20) of an LSM-tree. Therefore, to close the B⁺-tree vs. LSM-tree write amplification gap, the B⁺-tree implementation must be modified to take better advantage of in-storage transparent compression. We present a solution below.

This is motivated by a simple observation: For a B⁺-tree page, let Δ denote the difference between its in-memory image and on-storage image. If the difference is significantly smaller than the page size, we can largely reduce the write amplification by logging the page modification Δ, instead of writing the entire in-memory page image, to the storage device. Unfortunately, when a B⁺-tree runs on normal storage devices without built-in transparent compression, this approach is not practical due to significant operational overhead: Given the 4 KB block IO interface, we must coalesce multiple Δ’s from different pages into one 4 KB LBA block in order to materialize the write amplification reduction. To enhance the gain, we should apply the page modification logging multiple times for each page, before resetting this process to construct the up-to-date on-storage page image. Accordingly, multiple Δ’s associated with the same page will spread over multiple 4 KB blocks on the storage device, which however will increase the data management complexity and page read latency. As a result, this simple design concept has not been used by real-world B⁺-tree implementations reported in the open literature.

Storage hardware with built-in transparent compression for the first time makes the above simple idea practically viable. By applying sparse data structure enabled by such storage hardware, we no longer have to coalesce multiple Δ’s from different pages into the same 4 KB LBA block. As illustrated in Fig. 5, each B⁺-tree page associates with one dedicated 4 KB LBA block as its modification logging space to store its own Δ, which is referred to as localized page modification logging. Under the 4 KB I/O interface, to realize the proposed page modification logging for each page, B⁺-tree writes D = [Δ,O] (where O represents an all-zero vector, and |D| is 4 KB) to the 4 KB modification log block associated with the page. Inside the storage device, all the zeros in D will be compressed away and only the compressed version of Δ will be physically stored. This evidently leads to much lower write amplification, compared with conventional practice where each dirty page is entirely written to storage no matter how many bytes in the page have truly changed. By dedicating one 4 KB modification logging space for each B⁺-tree page, we do not incur extra B⁺-tree storage management complexity. Of course, this design method is subject to a longer page load latency, which fortunately is not significant for two reasons: (i) a B⁺-tree only needs to read one additional 4 KB LBA block that is contiguous with the page in the LBA space. Hence, B⁺-tree only issues a single read request to the storage device. (ii) Compared with fetching data from storage devices, it takes much less time to construct the updated in-memory page from the Δ.

Figure 5: Illustration of the localized modification logging design approach for reducing the write amplification caused by dirty page flush.

Quantitative Results

For the demonstration purposes, we implemented a B⁺-tree that incorporates the above method for reducing write amplification. The resulting implementation is referred to as B^--tree. For comparison, RocksDB and WiredTiger were used as representatives of LSM-tree and normal B⁺-tree. In all the experiments, each record is generated by filling half its content as all-zero and the other half content as random bytes in order to mimic the runtime data content compressibility. Fig. 6 shows the measured write amplification under a 500 GB dataset, where all the cases use 15 GB cache memory. In each experiment, we use either 1, 2, 4, 8, or 16 client threads to cover a wide range of runtime workload concurrencies. Compared with RocksDB, WiredTiger has a much larger write amplification, while B^--tree can essentially close the B⁺-tree vs. LSM-tree write amplification gap. For example, in the case of 32 B record size and 4 client threads, the write amplification of RocksDB is 38, while the write amplification of WiredTiger is 268 under 8 KB page size and 530 under 16 KB page size, respectively, which are 7.1x and 13.9x larger than that of RocksDB. In comparison, the write amplification of B^--tree is 28 under 8 KB page size (which is only 73.7% of RocksDB’s write amplification) and 36 under 1 6KB page size (which is almost the same as RocksDB).

Figure 6: Write amplification under a 500 GB dataset and 15 GB cache memory.

We further compare the total storage usage in terms of both logical storage usage on the LBA space (i.e., before in-storage compression) and physical usage of flash memory (i.e., after in-storage compression). Using the same 500 GB dataset, the logical storage usage of RocksDB, WiredTiger, and B^--tree is 728 GB, 934 GB, and 1,548 GB, respectively. After in-storage compression, the physical flash memory usage of RocksDB, WiredTiger, and B^--tree is 431 GB, 347 GB, and 452 GB, respectively. Since the LSM-tree has a more compact data structure than B⁺-tree, RocksDB has a smaller logical storage usage than the other two. Since B^--tree allocates one 4KB block for each page in order to implement the localized modification logging, its logical storage usage is much larger than that of WiredTiger. WiredTiger consumes less physical flash memory capacity than RocksDB (because of the space amplification of an LSM-tree) and B^--tree (because of the storage overhead caused by page modification logging). Due to the storage space overhead caused by page modification logging, B^--tree has slightly larger physical storage usage than RocksDB. For example, the physical storage usage of RocksDB is 431 GB, while the physical storage usage of B^--tree is 452 GB, only about 5% larger than that of RocksDB.

Finally, Fig. 7 shows the measured speed performance under three different workloads: (a) random point reads, (b) random range scans, and (c) random point writes. As shown in Fig. 7(a), normal B⁺-tree (i.e., WiredTiger) has the best point read TPS (transactions per second) performance, while RocksDB and B^--tree achieve almost the same random point read TPS performance. For example, under 16 client threads, WiredTiger can achieve 71K TPS, while RocksDB qnd B^--tree can achieve 57K TPS, about 19.7% less than that of WiredTiger. Fig. 7(b) shows the measured TPS when running random range scan queries, where each range scan covers 100 consecutive records. Compared with the case of random point reads, WiredTiger and B^--tree have noticeably smaller difference in terms of range scan throughput performance. This is because the two overheads of B^--tree (i.e., fetching an extra 4KB, and in-memory page reconstruction) can be amortized among the records covered by each range scan. In comparison, RocksDB has noticeably worse range scan throughput performance than the other two, because range scan invokes reads over all the levels in LSM-tree, leading to a high read amplification. Fig. 7(c) shows the performance under random point write workloads. The random write speed performance of B⁺-tree and LSM-tree is fundamentally limited by the write amplification. Therefore, by significantly reducing the write amplification, B^--tree could achieve much higher write speed performance. As shown in Fig. 7(C), B^--tree achieves 19% higher write throughput than RocksDB, and about 2.1x higher write throughput than WiredTiger. More comprehensive experimental results are available in our FAST’22 paper [7].

Figure 7: Comparison of speed performance under three different workloads: (a) random point reads, (b) random range scans, and (c) random point writes.

Conclusions

Modern storage hardware with built-in transparent compression allows data management systems to employ sparse on-disk data structure without sacrificing the true physical data storage cost. This opens a new but largely unexplored spectrum of opportunities to innovate data management software design. As one small step towards exploring this design spectrum, we showed that B⁺-tree could nicely leverage such modern storage hardware to close the gap with its recent contender LSM-trees in terms of storage cost and write amplification. This work suggests that the arrival of such modern storage hardware warrants a revisit on the role and comparison of B⁺-tree and LSM-tree in future data management systems.

Appendix

References:

[1] D. Chiou, E. Chung, and S. Carrie. (Cloud) Acceleration at Microsoft. Tutorial at Hot Chips, 2019

[2] Dell EMC PowerMax. https://delltechnologies.com/

[3] G. Graefe and H. Kuno. Modern B-tree techniques. In IEEE International Conference on Data Engineering, 2011

[4] HPE Nimble Storage. https://www.hpe.com/

[5] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 1996

[6] Pure Storage FlashBlade. https://purestorage.com/

[7] Y. Qiao, X. Chen, N. Zheng, J. Li, Y. Liu, and T. Zhang. Closing the B-tree vs. LSM-tree Write Amplification Gap on Modern Storage Hardware with Built-in Transparent Compression. In USENIX Conference on File and Storage Technologies (FAST), 2022

[8] ScaleFlux Computational Storage. http://scaleflux.com

[9] N. Zheng, X. Chen, J. Li, Q. Wu, Y. Liu, Y. Peng, F. Sun, H. Zhong, and T. Zhang. Re-think data management software design upon the arrival of storage hardware with built-in transparent compression. In USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), 2020

Article Categories:

Filesystem/storage

Last updated February 8, 2023

Authors:

Yifan Qiao is currently a Ph.D. student in the Electrical, Computer and Systems Engineering Department at Rensselaer Polytechnic Institute (RPI), NY. He received a bachelor degree in computer science and technology from Huazhong University of Science and Technology. His research work spans over database management, B-tree architecture, and computational storage.

qiaoyifan527@gmail.com

Xubin Chen is currently a software engineer at Google. In 2020, he received a Ph.D. degree in Computer and Systems Engineering from Rensselaer Polytechnic Institute (RPI), NY. His research interests include storage systems, distributed systems, and cloud computing.

xubin.chen86@gmail.com

Ning Zheng is currently working in ScaleFlux. He received both a Bachelor's degree and Master's degree from Zhejiang University, China, in 2009 and 2012, respectively, and received a Ph.D. degree in electrical engineering from Rensselaer Polytechnic Institute, NY, in 2015. Since then, he has worked at ScaleFlux, focusing on development of applications with computational storage drives.

ning.zheng@scaleflux.com

Jiangpeng Li is working at ScaleFlux as software engineering director. He received a Ph.D degree in electrical engineering in Shanghai Jiao Tong University in 2014. From 2013 to 2015, he was a visiting scholar at Rensselaer Polytechnic Institute (RPI) on research of NAND Flash based storage system. In 2015, he joined ScaleFlux to work on computational storage drives and systems.

jiangpeng.li@scaleflux.com

Yang Liu has been working on storage, networking and graphics industry for over 15 years, as architect and developer. Yang co-founded ScaleFlux in Oct. 2014. Prior to ScaleFlux, Yang held various engineer positions and made significant contributions in LSI, Sandforce, Juniper Networks and S3 Graphics. Dr. Liu holds Ph. D. degree from Rensselaer Polytechnic Institute (RPI), where his research interests include low-power VLSI architectures on variation-tolerant DSPs, 3D integrated digital systems, high-performance communication and signal processing circuits. Dr. Liu has published 13 academic technical papers and filed many patents.

yang.liu@scaleflux.com

Tong Zhang is currently a Professor in the Electrical, Computer and Systems Engineering Department at Rensselaer Polytechnic Institute (RPI), NY. In 2002, he received a Ph.D. degree in electrical engineering from the University of Minnesota and joined the faculty of RPI. He has graduated 20 PhD students, and authored/co-authored over 160 papers, with over 5,000 citations and h-index of 40. His current and past research spans over heterogeneous computing, data storage and management, memory fault tolerance, VLSI signal processing, and error correction coding and signal processing. In 2014, he co-founded ScaleFlux (San Jose, CA) to spearhead the commercialization of computational storage drives, and currently serves as its Chief Scientist. He is an IEEE Fellow.

tzhang@ecse.rpi.edu