Figure 3: Maintaining the virtual log. (a) New map entry sectors are appended to a backward chain. (b) Implementing the backward chain as a tree to support map entry overwriting. |
We implement the indirection map as a table. To keep the map persistent, we leverage the low latency offered by eager writing. Whenever an update takes place, we write the piece of the table that contains the new map entry to a free sector near the disk head. Suppose the file system addresses the disk at sector granularity and each map entry consumes four to eight bytes, the indirection map will consume a storage overhead between one to two percent of the disk capacity. We will discuss how to further reduce this storage overhead in the next section. If a transaction includes multiple data blocks and their map entries do not fall in the same map sector, then multiple map sectors may need to be written. Although the alternative of logging the multiple map entries using a single sector may better amortize the cost of map entry updates, it requires garbage collecting the obsolete map entries and is not used in our current design.
To recover the map after a failure, we must be able to identify the locations of the map sectors that are scattered throughout the disk due to eager writing. One way to accomplish this is to thread these sectors together to form a log. We term this a virtual log because its components are not necessarily physically contiguous. Because eager writing prevents us from predicting the location of the next map entry, we cannot maintain forward pointers in the map entries. Instead, we chain the map entries backward as shown in Figure 3a. Note that we can adapt this technique to a disk that supports a header per block, in which case a map entry including the backward pointer can be placed in the header.
As map entries are overwritten, the backward chain will accumulate obsolete sectors over time. We cannot simply reuse these obsolete sectors because doing so will break the backward chain. Our solution is to implement the backward linked list as a tree as shown in Figure 3b. Whenever an existing map entry is overwritten, a new log tail is introduced as the new tree root. One branch of the root points to the previous root; the other points to the map sector following the overwritten map entry. The overwritten sector can be recycled without breaking the virtual log. As Section 3.3 will show, we can keep the entire virtual log in disk memory during normal operation. Consequently, overwriting a map entry requires only one disk I/O to create the new log tail.
To recover the virtual log without scanning the disk, we must remember the location of the log tail. Modern disk drives use residual power to park their heads in a landing zone at the outer perimeter of the disks prior to spinning down the drives. It is easy to modify the firmware so that the drive records the current log tail location at a fixed location on disk before it parks the actuator[20, 31]. To be sure of the validity of the content stored at this fixed location, we can protect it with a checksum and clear it after recovery. In the extremely rare case when this power down sequence fails, we can detect the failure by computing the checksum and resort to scanning the disk for cryptographically signed map entries to retrieve the log tail.
With a stable log tail, recovering the virtual log is straightforward. We start at the log tail as the root and traverse the tree on the frontier based on age. Obsolete log entries can be recognized as such because their updated versions are younger and traversed earlier.