Lakshmi Ganesh, Hakim Weatherspoon, Mahesh Balakrishnan, Ken Birman
Computer Science Department, Cornell University
{lakshmi, hweather, mahesh, ken}@cs.cornell.edu
Any disk power management scheme essentially attempts to exploit one fact: disks can be run in high-power mode, or low-power mode, with a corresponding performance tradeoff. In the limit, a disk can be shut off so that it consumes no power. Given a large cluster of disks, only a fraction of them is accessed at any time, so that the rest could potentially be switched to a low-power mode. However, since mode transitions consume time and power, disk management schemes have to walk the tightrope of finding the right balance between power consumption and performance.
The solution space explored thus far in the literature can be divided as follows: (1) Hardware-based solutions, (2) Disk Management solutions, and (3) Caching solutions. Each of these solutions proposes a new system of some kind; hardware-based solutions propose novel storage hierarchies to strike the right balance between performance and power consumption; disk management solutions interject a new `disk management layer' on top of the file system, which controls disk configuration and data layout to achieve power-optimal disk access patterns; caching solutions devise new power-aware caching algorithms that allow large fractions of the storage system to remain idle for longer periods of time, allowing them to be switched to lower power modes.
The principal contribution of this paper is to argue that there is a fourth niche as yet unexplored: (4) File System solutions. We do not present a new system; instead, we take an idea that has been around for well over a decade now - the Log-Structured File System (LFS) [13] and argue that technological evolution has given it a new relevance today as a natural power-saving opportunity for large-scale storage systems. The key insight is that, where other solutions attempt to predict disk access to determine which disks to power down, the LFS automatically provides a perfect prediction mechanism, simply by virtue of the fact that all write-accesses go to the log head. Section 3 explains and expands on this idea.
Another approach is proposed by Colarelli et. al. in [2], using massive arrays of inexpensive disks (MAID). They propose the use of a small number of cache disks in addition to the MAID disks. The data in these cache disks is updated to reflect the workload that is currently being accessed. The MAID disks can then be powered down, and need only be spun up when a cache miss occurs, upon which their contents are copied onto the cache disks. This approach has several of the weaknesses that memory caches suffer, only on a larger scale. If the cache disks are insufficient to store the entire working set of the current workload, then `thrashing' results, with considerable latency penalties. Further, the cache disks represent a significant added cost in themselves.
Disk Management Solutions
Pinheiro and Bianchini [11] suggest that if data is laid out on disks according to frequency of access, with the most popular files being located in one set of disks, and the least popular ones in another, then the latter set of disks could be powered down to conserve energy. Their scheme is called Popular Data Concentration (PDC) and they implement and evaluate a prototype file server called Nomad FS, which runs on top of the file system and monitors data layout on disks. Their findings are that if the low-access disks are powered down, this results in a considerable performance hit; they suggest instead that they be run at low speed. While their idea is sound, it is not clear whether this scheme would adapt to different workloads.
Son et. al. propose another data layout management scheme to optimize disk access patterns [14]. Their approach uses finer-grained control over data layout on disk, tuning it on a per-application basis. Applications are instrumented and then profiled to obtain array access sequences, which their system then uses to determine optimal disk layouts by computing optimal stripe factor, stripe size, start disk etc. Again, the wisdom of marrying the disk layout to the application seems questionable.
Hibernator, proposed by Zhu et. al [6], combines a number of ideas. It assumes multispeed disks, and computes online the optimal speed that each disk should run at. To minimize speed transition overheads, disks maintain their speeds for a fixed (long) period of time - they call this the coarse-grained approach. Hibernator includes a file server that sits on top of the file system and manipulates data layout to put the most-accessed data on the highest speed disks. The authors address the issue of performance guarantees by stipulating that if performance drops below some threshold, then all disks are spun up to their highest speed.
Caching Solutions
Zhu et. al [5] observe that the storage cache management policy is pivotal in determining the sequence of requests that access disks. Hence, cache management policies could be tailored to change the average idle time between disk requests, thus providing more opportunities for reducing disk energy consumption. Further, cache policies that are aware of the underlying disk management schemes (eg. which disks are running at which speeds, say) can make more intelligent replacement decisions. The authors present both offline and online power-aware cache replacement algorithms to optimize read accesses. They also show through experiments the somewhat obvious fact that for write accesses, write-back policies offer more opportunities to save power than write-through policies.
How do reads in the LFS work? In the same way as in conventional file systems! Reads require random-access, and hence do not avoid seek-latency. However, the assumption is that with good caching mechanisms, reads will be a small fraction of disk accesses.
As can be imagined, space reclamation is a tricky problem in log structured file systems. However, excellent solutions have been proposed to solve it, and one such is of interest to us: the disk is divided into large log segments. Once a log segment gets filled, a new log segment is allocated and the log head moves to the new segment. When some threshold of a segment gets invalidated, its valid data is moved to another segment (replacing that segment's invalid data), and it is then added to the pool of free log segments. Over time, this process results in a natural division of allocated segments into stable (ie.. consisting almost entirely of data that is rarely invalidated/modified), and volatile ones (which need to be constantly `cleaned'). We will see how this feature can be used to save power.
Since all writes in an LFS are to the log head, we know in advance which disk they will access. This gives us the perfect prediction mechanism, at least for write-accesses. Besides being deterministic, this prediction mechanism is also application-independent. Thus, if most accesses to disks were writes, we could power down every disk but the one that the log head resides on. This, however, is an ideal case scenario. Our view is that, with a good caching algorithm (the power-aware caching algorithms described in the related works section are good candidates), reads to disk can be minimized, and only a small fraction of the disks need be powered on in order to serve all writes as well as reads.
However, what about the performance and power costs of log cleaning? Matthews et. al present some optimizations in [4] to hide the performance penalty of log cleaning even when the workload allows little idle time. The power costs of log cleaning are a little more tricky to justify; however, this is where the natural division of segments into stable and volatile ones that the log cleaning process results in (as described above) can help. After a significant fraction of segments on a disk have been classified as stable, volatile, or free, we power the disk on and copy the stable segments to a `stable' disk, volatile segments to a `volatile' disk (disk is kept on), and the entire disk is freed for reuse. This is similar to the log cleaning scheme described in [10], which uses a `hidden' structure embedded in the log to track segment utilization. Cleaning an entire disk amortizes the cost of powering it on.
|
The mechanism we simulate is as follows: All (non-empty) disks are assumed to begin in the `on' state, and an access count (an exponentiated average) is maintained for each disk. The user specifies the maximum percentage () of disks that are kept powered on. Periodically (200ms, in our experiments), a `disk check' process scans the access count for each disk and powers down all but the most-accessed top of the disks, as well as any disk which does not have at least access count. is 0 in our experiments. If a cache-miss results in an access to a powered-down disk, then this disk is spun up (to remain powered on until the next `disk check'), and there is a corresponding latency penalty. Judicious choice of and minimizes the probability of this occurrence.
The performance of our system depends heavily on its cache configuration. Since cache optimization is an orthogonal issue that comprises an entire field of research in itself, it is important to isolate its effect on performance. To achieve this, we implemented an `ideal cache' algorithm, which we term the oracle. Experiments using the oracle approximate the best performance we could achieve since an oracle has future knowledge and is able to replace items accessed furthest in the future [3]. In fig. 1, 2, 3, the data points that use this algorithm are annotated with the word 'Oracle'.
Finally, we also wish to compare our system against conventional (not log-structured) file systems. As an approximation of such a system, we implemented a `random placement' (RP) algorithm, which maps each block to a random disk. All disks are kept powered up, and ideal caching (oracle) is assumed. This data point is labeled `Oracle RP' in our graphs.
Having set the context, let us examine our results. Fig. 1 shows the relation between performance (per-access latency) and the number of disks that are powered on. If we imagine a line at y=.001 (ie.. 99.9% of the accesses live above this line) 60% disks ON is the third best configuration, next only to the Oracle RP and 100% disks ON configurations. Further, the performance degradation in going from 100% disks ON to 60% disks ON is negligibly small. The principal take-away is that, for the system under test, the optimal configuration is to have 60% of the disks powered on. In other words, 40% of the disks can be spun down while still maintaining performance comparable to that of a conventional file system.
Fig. 2 shows an estimate of the actual power savings achieved by our solution. The height of each bar is the average power consumed while processing the trace. Further, each bar shows the break-up of power consumed by powered-up disks (On), powered-down disks (Idle), and mode-transitions (Transition). We assume the following disk specifications: Avg. operating power = 12.8 W, Avg. idle power = 7.2 W, Avg. mode transition power = 13.2 W, Avg time for transition = 6s. We see that turning off 40% of the disks results in 12% power savings (as compared to 32% with all the disks off), while maintaining acceptable performance.
Finally, fig. 3 shows how much time the disks spend in on/off/transition states. The height of each bar is the cumulative time spent by each disk in each of these three states. When 0% disks are on, the run takes 7253 cumulative hrs; we omit this bar from our graph for clearer presentation. We see that, both the total duration of the experiment, as well as the number of mode-transitions, increase as the percentage of disks that is powered on is decreased. However, as in fig. 1, we see that keeping 60% disks on strikes an acceptable balance.
We also provide some initial simulation results that validate our claim that power-savings are possible using a log-structured file system. While simulations cannot provide conclusive evidence for the feasibility of a system, they are an effective means to identify promising solutions. Our principal contribution in this paper is in having shown a new fit for an old idea; we believe that the log-structured file system shows promise as a power-saving opportunity for large-scale storage systems.
In future work, several questions still remain to be addressed. Our evaluation has been limited by the difficulty of obtaining real filesystem traces from commercial data centers; we are actively looking for more recent traces to test our solution against. We are also working on a more thorough study of the efficacy of the log cleaning approach we outline here. Finally, we believe the LFS provides an interesting substratum to build more elaborate solutions on, and we are working on some promising options that we hope to share with the community soon.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons lfs07
The translation was initiated by Hakim Weatherspoon on 2007-04-11