|
BSDCon 2002 Paper
[BSDCon '02 Tech Program Index]
Running "fsck" in the BackgroundMarshall Kirk McKusick, Author and ConsultantAbstractTraditionally, recovery of a BSD fast filesystem after an uncontrolled system crash such as a power failure or a system panic required the use of the filesystem checking program, "fsck". Because the filesystem cannot be used while it is being checked by "fsck", a large server may experience unacceptably long periods of unavailability after a crash.Rather than write a new version of "fsck" that can run on an active filesystem, I have added the ability to take a snapshot of a filesystem partition to create a quiescent filesystem on which a slightly modified version of the traditional "fsck" can run. A key feature of these snapshots is that they usually require filesystem write activity to be suspended for less than one second. The suspension time is independent of the size of the filesystem. To reduce the number and types of corruption, soft updates were added to ensure that the only filesystem inconsistencies are lost resources. With these two additions it is now possible to bring the system up immediately after a crash and then run checks to reclaim the lost resources on the active filesystems. Background "fsck" runs by taking a snapshot and then running its traditional first four passes to calculate the correct bitmaps for the allocations in the filesystem snapshot. From these bitmaps, "fsck" finds any lost resources and invokes special system calls to reclaim them in the underlying active filesystem. 1. Background and IntroductionTraditionally, recovery of the BSD fast filesystem [McKusick, et al., 1996] after an uncontrolled system crash such as a power failure or a system panic required the use of the "fsck" filesystem checking program [McKusick & Kowalski, 1994]. "fsck" would inspect all the filesystem metadata (bitmaps, inodes, and directories) and correct any inconsistencies that were found. As the metadata comprises about three to five percent of the disk space, checking a large filesystem can take an hour or longer. Because the filesystem is inaccessible while it is being checked by "fsck", a large server may experience unacceptably long periods of unavailability after a crash.Many methods exist for solving the metadata consistency and recovery problem [Ganger et al, 2000; Seltzer et al, 2000]. The solution selected for the fast filesystem is soft updates. Soft updates control the ordering of filesystem updates such that the only inconsistencies in the on-disk representation are that free blocks and inodes may be claimed as "in use" in the on-disk bitmaps when they are really unused. Soft updates eliminates the need to run "fsck" after a system crash except in the rare event of a hardware failure in the disk on which the filesystem resides. Since the only filesystem inconsistency is lost blocks and inodes, the only ill effect from running on the filesystem after a crash is that part of the filesystem space that should have been available will be lost. Any time the lost space from crashes reaches an unacceptable level, the filesystem must be taken offline long enough to run "fsck" to reclaim the lost resources. Because many server systems need to be available 24x7, there is never an available multi-hour time when a traditional version of "fsck" can be run. Thus, I have been motivated to develop a filesystem check program that can run on an active filesystem to reclaim the lost resources. The first approach that I considered was to write a new utility that would operate on an active filesystem to identify the lost resources and reclaim them. Such a utility would fall into the class of real-time garbage collection. Despite much research and literature, garbage collection of actively used resources remains challenging to implement. In addition to the complexity of real-time garbage collection, the new utility would need to recreate much of the functionality of the existing "fsck" utility. Having a second utility would lead to additional maintenance complexity as bugs or feature additions would need to be implemented in two utilities rather than just one. To avoid the complexity and maintenance problems, I decided to take an approach that would enable me to use most of the existing "fsck" program. "fsck" runs on the assumption that the filesystem it is checking is quiescent. To create an apparently quiescent filesystem, I added the ability to take a snapshot of a filesystem partition [McKusick & Ganger, 1999]. Using copy-on-write, a snapshot provides a filesystem image frozen at a specific point in time. The next section describes the details on how snapshots are taken and managed. As lost resources will not be used until they have been found and marked as free in the bitmaps, there are no limits on how long a background reclamation scheme can take to find and recover them. Thus, I can run the traditional "fsck" over a snapshot to find all the missing resources even if it takes several hours and the filesystem is being actively changed. Snapshots may seem a more complex solution to the problem of running space reclamation on an active filesystem than a more straight forward garbage-collection utility. However, their cost (about 1300 lines of code) is amortized over the other useful functionality: the ability to do reliable dumps of active filesystems and the ability to provide back-ups of the filesystem at several times during the day. This functionality was first popularized by Network Appliance [Hitz et al, 1994; Hutchinson et al, 1999]. Snapshots enable the safe backup of live filesystems. When dump notices that it is being asked to dump a mounted filesystem, it can simply take a snapshot of the filesystem and run over the snapshot instead of on the live filesystem. When dump completes, it releases the snapshot. Periodic snapshots can be made accessible to users. When taken every few hours during the day, they allow users to retrieve a file that they wrote several hours earlier and later deleted or overwrote by mistake. Snapshots are much more convenient to use than dump tapes and can be created much more frequently. To make a snapshot accessible to users through a traditional filesystem interface, BSD uses the memory-disk driver, md. The mdconfig command takes a snapshot file as input and produces a /dev/md0 character-device interface to access it. The /dev/md0 character device can then be used as the input device for a standard BSD FFS mount command, allowing the snapshot to appear as a replica of the frozen filesystem at whatever location in the namespace that the system administrator chooses to mount it. Thus, the following script could be run at noon to create a midday backup of the /usr filesystem and make it available at /backups/usr/noon:
# Take the snapshot mount -u -o snapshot /usr/snap.noon /usr # Attach it to /dev/md0 mdconfig -a -t vnode -u 0 -f /usr/snap.noon # Mount it for user access mount -r /dev/md0 /backups/usr/noonWhen no longer needed, it can be removed with: # Unmount snapshot umount /backups/usr/noon # Detach it from /dev/md0 mdconfig -d -u 0 # Delete the snapshot rm -f /usr/snap.noon 2. Creating a Filesystem Snapshot![]() Figure 1: Structure of a snapshot file A filesystem snapshot is a frozen image of a filesystem at a given instant in time. Implementing snapshots in the BSD fast filesystem has proven to be straightforward. Taking a snapshot entails the following steps:
The splitting of the bitmap copies between steps two and six is the way that I avoid having the suspend time be a function of the size of the filesystem. By making our primary copy pass while the filesystem is still active, and then having only a few cylinder groups in need of recopying after it has been suspended, I keep the suspend time down to a small and usually filesystem size independent time. The details of the two-pass algorithm are as follows. Before starting the copy and scan of all the cylinder groups, the snapshot code allocates a "progress" bitmap whose size is equal to the number of cylinder groups in the filesystem. The purpose of the "progress" bitmap is to keep track of which cylinder groups have been scanned. Initially, all the bits in the "progress" map are cleared. The first pass is completed in step two before the filesystem is suspended. In this first pass, all the cylinder groups are scanned. When the clinder group is read, its corresponding bit is set in the "progress" bitmap. The cylinder group is then copied and its block map is consulted to update the snapshot file as described in step two. Since the filesystem is still active, filesystem blocks may be allocated and freed while the cylinder groups are being scanned. Each time a cylinder group is updated because of a block being allocated or freed, its corresponding bit in the "progress" bitmap is cleared. Once this first pass over the cylinder groups is completed, the filesystem is "suspended." Step six now becomes the second pass of the algorithm. The second pass need only identify and update the snapshot for any cylinder groups that were modified after it handled them in the first pass. The changed cylinder groups are identified by scanning the "progress" bitmap and rescanning any cylinder groups whose bits are zero. Although every bitmap would have to be reprocessed in the worst case, in practice only a few bitmaps need to be recopied and checked. 3. Maintaining a Filesystem SnapshotEach time an existing block in the filesystem is modified, the filesystem checks whether that block was in use at the time that the snapshot was taken (i.e., it is not marked "not used"). If so, and if it has not already been copied (i.e., it is still marked "not copied"), a new block is allocated from among the "not used" blocks and placed in the snapshot file to replace the "not copied" entry. The previous contents of the block are copied to the newly allocated snapshot file block, and the modification to the original is then allowed to proceed. Whenever a file is removed, the snapshot code inspects each of the blocks being freed and claims any that were in use at the time of the snapshot. Those blocks marked "not used" are returned to the free list.When a snapshot file is read, reads of blocks marked "not copied" return the contents of the corresponding block in the filesystem. Reads of blocks that have been copied return the contents in the copied block (e.g., the contents that were stored at that location in the filesystem at the time that the snapshot was taken). Writes to snapshot files are not permitted. When a snapshot file is no longer needed, it can be removed in the same way as any other file; its blocks are simply returned to the free list and its inode is zeroed and returned to the free inode list. Snapshots may live across reboots. When a snapshot file is created, the inode number of the snapshot file is recorded in the superblock. When a filesystem is mounted, the snapshot list is traversed and all the listed snapshots are activated. The only limit on the number of snapshots that may exist in a filesystem is the size of the array in the superblock that holds the list of snapshots. Currently, this array can hold up to twenty snapshots. Multiple snapshot files can exist concurrently. As described above, earlier snapshot files would appear in later snapshots. If an earlier snapshot is removed, a later snapshot would claim its blocks rather than allowing them to be returned to the free list. This semantic means that it would be impossible to free any space on the filesystem except by removing the newest snapshot. To avoid this problem, the snapshot code goes through and expunges all earlier snapshots by changing its view of them to being zero length files. With this technique, the freeing of an earlier snapshot releases the space held by that snapshot. When a block is overwritten, all snapshots are given an opportunity to copy the block. A copy of the block is made for each snapshot in which the block resides. Overwrites typically occur only for inode and directory blocks. File data is usually not overwritten. Instead, a file will be truncated and then reallocated as it is rewritten. Thus, the slow and I/O intensive block copying is infrequent. Deleted blocks are handled differently. The list of snapshots is consulted. When a snapshot is found in which the block is active ("not copied"), the deleted block is claimed by that snapshot. The traversal of the snapshot list is then terminated. Other snapshots for which the block are active are left with an entry of "not copied" for that block. The result is that when they access that location, they will still reference the deleted block. Since snapshots may not be modified, the block will not change. Since the block is claimed by a snapshot, it will not be allocated to another use. If the snapshot claiming the deleted block is deleted, the remaining snapshots will be given the opportunity to claim the block. Only when none of the remaining snapshots wants to claim the block (i.e., it is marked "not used" in all of them) will it be returned to the freelist. 4. Snapshot PerformanceThe experiments described in this section and the background "fsck" performance section used the following hardware/software configuration:Computer: Dual Processor using two Celeron 350MHz CPUs. The machine has 256Mb of main memory. O/S: FreeBSD 5.0-current as of December 30, 2001 I/O Controller: Adaptec 2940 Ultra2 SCSI adapter
Disk: Two Small Filesystem: 0.5Gb, 8K block, 1K fragment, 90% full, 70874 files, initial snapshot size 0.392Mb (0.08% of filesystem space). Large Filesystem: 7.7Gb, 16K block, 2K fragment, 90% full, 520715 files, initial snapshot size 2.672Mb (0.03% of filesystem space). Load: Four continuously running simultaneous Andrew benchmarks that create a moderate amount of filesystem activity intermixed with periods of CPU intensive activity [Howard et al, 1988].
5. Implementation of Background "fsck"Background "fsck" runs by taking a snapshot then running its traditional algorithms over the snapshot. Because the snapshot is taken of a completely quiescent filesystem, all of whose dirty blocks have been written to disk, the snapshot appears to "fsck" to be exactly like an unmounted raw disk partition. "fsck" runs in five passes that can be summarized as follows:
Five operations were implemented in the kernel:
The final step after a successful background "fsck" run is to update the filesystem status in the superblock. There are two flags in the superblock that track the state of a filesystem. The first is the "clean" flag that is set when a filesystem is unmounted (by the system administrator or at system shutdown) and cleared while it is mounted with writing enabled. The second is the "unclean-at-mount" flag that is described below. The "clean" flag is used by the traditional "fsck" to decide which filesystems need to be checked. Those filesystems with the flag set are skipped; those filesystems with the flag clear are checked. Following a successful check, the "clean" flag is set. Before soft updates, the kernel did not allow unclean filesystems (e.g., filesystems with the "clean" flag cleared) to be mounted for writing as the corruption could cause the system to panic. The "unclean-at-mount" flag was added as part of soft updates. Unclean filesystems running with soft updates are safe to mount with writing permitted. However, the system needs to remember that some cleanup may be required. Thus, the "unclean-at-mount" flag gets set when an unclean filesystem is mounted (e.g., mounted with writing enabled without the "clean" flag having been set). The "unclean-at-mount" flag serves two purposes. First, when a filesystem with the "unclean-at-mount" is unmounted, the "clean" flag is not set to show that cleaning is still required. Second it tracks the filesystems that need cleaning. By the time that background "fsck" is run, all the filesystems are mounted so none will have their "clean" flag set. Thus, the "unclean-at-mount" flag is used by the background "fsck" to distinguish which filesystems need to be checked. Those filesystems with the flag set are checked; those filesystems with the flag clear are skipped. So, the final step after a successful run of a background "fsck" is to clear the "unclean-at-mount" bit in the superblock (using operation 1) so that the filesystem will be marked "clean" when it is unmounted by the system administrator or at system shutdown. 6. Operation of Background "fsck"Traditionally, "fsck" is invoked before the filesystems are mounted and all checks are synchronously done to completion at that time. If background checking is available, "fsck" is invoked twice. It is first invoked at the traditional time, before the filesystems are mounted, with the -F flag to do checking on all the filesystems that cannot do background checking. Filesystems that require traditional checking are those that are not running with soft updates and those that will not be mounted at system startup (e.g., those marked "noauto" in /etc/fstab). It is then invoked a second time, after the system has completed going multiuser, with the -B flag to do checking on all the filesystems that can do background checking. Unlike the foreground checking, the background checking is started asynchronously so that other system activity can proceed even on the filesystems that are being checked.The "fsck" program is really just a front end that reads the /etc/fstab file and determines which filesystems need to be checked. For each filesystem to be checked, the appropriate back end is invoked. For the fast filesystem, "fsck_ffs" is invoked. If "fsck" is invoked with neither the -F nor the -B flag, it runs in traditional mode and checks every listed filesystem. Otherwise it is invoked with the -F to request that it run in foreground mode. In foreground mode, the check program for each filesystem is invoked with the -F flag to determine whether it wishes to run as part of the boot up sequence, or if it is able to do its job in background after the system is up and running. A non-zero exit code indicates that it wants to run in foreground and it is invoked again with neither the -F nor the -B flag so that it will run in its traditional mode. A zero exit code indicates that it is able to run later in background and just a deferred message is printed. The "fsck" program attempts to run as many checks in parallel as possible. Typically it can run a check on each disk in parallel. After the system has gone multiuser, "fsck" is invoked with the -B flag to request that it run in background mode. The check program for each filesystem is invoked with the -F flag to determine whether it wishes to run as part of the boot up sequence, or if it is able to do its job in background after the system is up and running. A non-zero exit code indicates that it wanted to run in foreground that is assumed to have been done, so the filesystem is skipped. A zero exit code indicates that it is able to run in background so the check program is invoked with the -B flag to indicate that a check on the active filesystem should be done. When running in background mode, only one filesystem at a time will be checked. To further reduce the load on the system, the background check is typically run at a nice value of plus four. The "fsck_ffs" program does the actual checking of the fast filesystem. When invoked with the -F flag, "fsck_ffs" determines whether the filesystem needs to be checked immediately in foreground or if its checking can be deferred to background. To be eligible for background checking it must have been running with soft updates, not have been marked as needing a foreground check, and be mounted and writable when the background check is to be done. If these conditions are met, then "fsck_ffs" exits with a zero exit status. Otherwise it exits with a non-zero exit status. If the filesystem is clean, it will exit with a non-zero exit status so that the clean status of the filesystem can be verified and reported during the foreground checks. Note that, when invoked with the -F flag, no checking is done. The only thing that "fsck_ffs" does is to determine whether a foreground or background check is needed and exit with an appropriate status code. When "fsck_ffs" is invoked with the -B flag, a check is done on the specified and possibly active filesystem. The potential set of corrections is limited to those available when running in preen mode (as further detailed in the previous section). If unexpected errors are found, the filesystem is marked as needing a foreground check and "fsck_ffs" exits without attempting any further checking. 7. Background "fsck" PerformanceOver many years of tuning and refinement, "fsck" has been optimized to minimize and cluster its I/O requests. Further, its data structures have been tuned to the point where it consumes little CPU time and thus its running time is totally dominated by the time that it takes to do the needed I/O. The performance of background "fsck" is almost identical to that of the traditional "fsck". Since it is using the same algorithms, the number and pattern of its I/O requests are identical to the traditional program. Though the update requests are done through kernel calls rather than direct writes to the disk, the kernel calls typically execute the same (or through the benefits of soft updates) slightly fewer disk writes. The main added delay of background "fsck" is the time required to take and remove a snapshot of the filesystem. The time for the snapshot operation has already been covered in section 4 above.Because of "fsck"'s I/O clustering, it is capable of using nearly all the bandwidth of a disk. Although background "fsck" only checks one filesystem partition at a time (as compared to traditional "fsck" that checks all separate disks containing filesystems in parallel), even a single instance of "fsck" can cause seriously increased latency to processes trying to access files on the filesystem (or anything else on the same disk) that is being checked. As there is no urgency in completing the space reclamation, background "fsck" is usually run at lower priority than other processes. The usual way to reduce priority is to nice the process to some positive value which results in it getting a lower priority for the CPU. Because "fsck" is nearly completely I/O bound, giving it a lower CPU priority has almost no effect on the time in which it runs and hence in its rate of issuing I/O requests. As a general solution to reducing the resource usage of I/O bound processes such as background "fsck", a small change has been made to the disk strategy routine. When an I/O request is posted, the disk strategy routine first checks whether the process is running at a positive nice. If it is, and there are any other outstanding I/O requests for the disk, the process is put to sleep for nice hundredths of a second. Thus, a process running at a nice of four will sleep for forty millisecond each time it makes a disk I/O request. Such a process will be able to do at most twenty-five disk I/O requests per second - about a third of the bandwidth of a current technology disk. At the maximum nice value of twenty, a process is limited to five I/O requests per second which is low enough to be almost unnoticeable by other processes competing for access to the disk. Because the slowdown is imposed only when there are other outstanding disk requests, I/O bound processes can run at full speed on an otherwise idle system.
Table 6 shows the times to run background "fsck" on a filesystem that is otherwise idle. It is running with a nice value of zero. It is the only process active on the system, so represents a lower bound on the time that a background disk check can be done. Note that its running time is only slightly greater than would be expected by adding the time to take and remove a snapshot (see Table 1 and Table 3) to the running time of the traditional "fsck" shown in Table 5.
Table 7 shows the times to run background "fsck" on a filesystem that has four processes concurrently writing to it. It is running with a nice value of four. Note that its running time increases by a factor of ten as it is yielding to the other running processes. By contrast, its effect on the other processes is minimal as their aggregate throughput is slowed by less than ten percent. 8. Conclusions and Future WorkThis paper has described how to take snapshots of the fast filesystem with suspension intervals typically less than one second and independent of the size of the filesystem being snapshotted. When running with soft updates, the only filesystem corruption that occurs is the loss of inodes and data blocks. Using snapshots together with a slightly modified version of the traditional "fsck" it is possible to recover the lost inodes and data blocks while the filesystem is in active use. While a background "fsck" can run in about the same amount of time as a traditional "fsck", it is generally desirable to run it at a lower priority so that it causes less slowdown on other processes on the system.Snapshots may seem a more complex solution to the problem of running space reclamation on an active filesystem than a more straight forward garbage-collection utility. However, their cost (about 1300 lines of code) is amortized over the other useful functionality: the ability to do reliable dumps of active filesystems and the ability to provide back-ups of the filesystem at several times during the day. For the future, I need to gain experience with using background "fsck", to gain confidence in its robustness, and to find the optimal priority to minimize its slowdown on the system while still finishing its job in a reasonable amount of time. 9. Current StatusSnapshots and background "fsck" have been running on FreeBSD 5.0 systems since April 2001. All the relevant code including snapshots, the gating functions, the system call additions for the use of "fsck", and the changes to "fsck" itself are available as open-source under a Berkeley-style copyright.10. AcknowledgmentsI thank Rob Kolstad for his helpful comments and review of drafts of this paper. He pointed out several non-obvious gaps in the coverage. Thanks also to Matthew Dillon, Ian Dowse, Peter Wemm, and many other FreeBSD developers that put up with my kernel breakage and helped track down the bugs.11. References
12. BiographyDr. Marshall Kirk McKusick writes books and articles, consults, and teaches classes on UNIX- and BSD-related subjects. While at the University of California at Berkeley, he implemented the 4.2BSD fast filesystem, and was the Research Computer Scientist at the Berkeley Computer Systems Research Group (CSRG) overseeing the development and release of 4.3BSD and 4.4BSD. His particular areas of interest are the virtual-memory system and the filesystem. He earned his undergraduate degree in Electrical Engineering from Cornell University, and did his graduate work at the University of California at Berkeley, where he earned Masters degrees in Computer Science and Business Administration, and a doctorate in Computer Science. He is a past president and present board member of the Usenix Association, and is a member of ACM and IEEE.In his spare time, he enjoys swimming, scuba diving, and wine collecting. The wine is stored in a specially constructed wine cellar (accessible from the web at www.mckusick.com/~mckusick/) in the basement of the house that he shares with Eric Allman, his domestic partner of 20-and-some-odd years. You can contact him via email at mckusick@mckusick.com. |
This paper was originally published in the
Proceedings of the BSDCon '02 Conference on File and Storage Technologies, February 11-14, 2002, Cathedral Hill Hotel, San Francisco, California, USA.
Last changed: 28 Dec. 2001 ml |
|