################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


Cut-and-Paste file-systems: integrating simulators and file-systems

Peter Bosch, Sape J. Mullender
Faculty of Computer Science/SPA, University of Twente, Netherlands
peterb@cs.utwente.nl, sape@cs.utwente.nl

Abstract

We have implemented an integrated and configurable file system
called the PFS and a trace-driven file-system simulator called
Patsy.  Patsy is used for off-line analysis of file-system
algorithms, PFS is used for on-line file-system data storage.
Algorithms are first analyzed in Patsy and when we are satisfied
with the performance results, migrated into PFS for on-line usage.
Since Patsy and PFS are derived from a common cut-and-paste
file-system framework, this migration proceeds smoothly.

We have found this integration quite useful: algorithm bottlenecks
have been found through Patsy that could have led to performance
degradations in PFS.  Off-line simulators are simpler to analyze
compared to on-line file-systems because a work load can repeatedly
be replayed on the same off-line simulator.  This is almost
impossible in on-line file-systems since it is hard to provide
similar conditions for each experiment run.  Since simulator and
file-system are integrated (hence, use the same code), experiment
results from the simulator have relevance in the real system.

This paper describes the cut-and-paste framework, the instantiation of
the framework to PFS and Patsy and finally, some of the experiments we
conducted in Patsy.

1. Introduction

Building a fast file-system or introducing new algorithms to an
existing file-system is often a difficult task.  The designer does
not know in full detail what a proposed algorithm does to the
overall file-system performance.  Normally, the only way to test a new
algorithm is to introduce it to an on-line file-system and measure the
performance effects while the system is in use.  If a new algorithm
does not work, file-system service may be interrupted.  

A better way to test algorithms is to analyze the performance of those
algorithms in an off-line simulator.  When the algorithm works as
expected, it can be integrated in a production file-system.  However,
the disadvantage of an off-line simulator is that it is hard to build
a representative simulator.  Usually, simulators are approximations of
real systems [21], and results from such a simulator may not show
actual system performance.

Recent developments [20, 26, 27, 3, 8] in file-system research have
shown us that improving file-system performance is still a hot topic.
Many groups find new storage algorithms and report better performance
numbers.  These performance numbers usually show the effectiveness of
those algorithms within the environment they were developed for.

Ideally, to validate those reported performance numbers, other
researchers would re-use (parts of) the presented system, insert it
into their own environments, and re-evaluate the published
performance.  In reality, this is well nigh impossible [5].  Often the
system code heavily depends on the environment for which it was
developed.  Through some effort it is possible to port the source
code, but then it is hard to set up a similar test environment [9,
29].

To solve both problems, we have designed and implemented an extensible
reference file-system component library from which we instantiate
on-line file-systems and off-line file-system simulators.  On-line
systems are systems that are in use in real systems, off-line systems
are not in use by clients and only run in a controlled environment.
The component library provides basic components that are required to
build a full file-system and the same components are used to build
file system simulators.  Helper components are added to complete an
instantiation to a real file-system or simulator.  Helper components
in a real system deal with actual data movement, while in the
simulator they compensate for the lack of ``real'' data.  We have made
the component library such that it is easy to extend the library with
new algorithms and policies.

The symbiosis of the simulator and the ``real thing'' helps us to:
* be more confident that simulated off-line performance numbers
  show real and representative on-line file-system performance
  numbers;

* easily detect performance bottlenecks of real file-system
  algorithms by running the same system off-line in an equivalent
  file-system simulator;

* analyze new file-system algorithms off-line before they are
  integrated into a production file-system;

* be more confident that no side effects are introduced when a
  simulated algorithm is moved into a real system;

* construct a reference file system: other file system algorithms
  can easily be integrated and compared to other algorithms in our
  environment;

* easily migrate algorithms from off-line simulators in real
  systems: the algorithm does not have to be re-implemented for the
  real system once implemented for a simulator.

Our initial goal for this work was to build a production file-system
and a separate file-system simulator.  The production file-system was
to be used for ordinary data storage with functionality to store
continuous media files, the simulator was to be used for analyzing
storage algorithms.

In the original simulator, we analyzed cache-flush algorithms.  For
that work we replaced the Unix 30-second-update timer policy by a
flush policy that keeps dirty data in the cache much longer [4],
so-called write-saving policies.  Unix file-system write traffic is
characterized by a high overwrite factor in the first part of a file's
lifetime [2, 10, 17, 22]. Keeping dirty data longer in memory without
writing dirty data to disk increases the probability that a block is
overwritten through truncate and delete calls in memory rather than on
disk.  As a result fewer data blocks are written to disk.

The work showed that replacing the 30-second-update timer by a
write-saving policy greatly reduces file-system read latencies.  The
work claims that disk I/O queues are the main cause of relatively high
file-system latencies.  Basically, the goal was to get writes out of
the way of reads simply by writing less data to disk. 

The initial simulator used for the experiments used an approximation
of a real file system and it used a simple disk model.  As is shown by
Ruemmler et al. [21], a simple disk model in a simulator may not show
the actual performance: the results can be completely useless.
Ruemmler et al. reported differences of up to 112% between real and
simulated performance.  It was obvious we could not trust the earlier
analysis.

To present more accurate simulated performance numbers, we decided to
build a file-system simulator that simulates a file-system in all
details, including a disk sub-system back-end much like HP Pantheon
disk simulator [31] and Dartmouth's disk simulator [13].  We
continuously refined the simulator and eventually we ended up with a
full file-system.  It turned out that this file system shared lots of
data structures and algorithms with our real file system, it only
lacked data manipulation code: i.e. the simulator was not an
approximation anymore.  We repeated the write-saving experiments in
this version of the simulator and analyzed the performance numbers
again.

We then realized that it is important that a system and its simulator
are closely related.  When policies and algorithms are different, we
cannot be reasonably sure that simulated performance shows real
performance.  Also, when policies and algorithms are analyzed in a
simulator, eventually they will have to be migrated into a real file
system.  If a real file-system uses completely different data
structures or is constructed differently, this migration process
usually results in a complete rewrite of the policy or algorithm,
which may introduce unwanted side effects or new bottlenecks in a real
system.  When simulator and file-system are derived from a common
framework and use similar, if not equal, data and component
structures, migrating code becomes a trivial matter.

We now use Patsy, PFS and the cut-and-paste component library as a new
way of doing file system development.  Algorithms and file system
extensions are first analyzed off-line through pre-recorded
file-system traces or hand crafted work loads before they are
integrated in a production file system.  We learn all the effects of
algorithms before they are used.

In fact, we use the framework as a reference system and starting point
for further work.  Quick off-line file-system experiments are
performed before we decide to use the algorithm in a production
version of the system.  For example, in the framework we are currently
performing continuous-media storage experiments, and we plan to use
the framework for tertiary storage experiments.  We feel that having
the framework available gives us a head-start for file-system
developments: components are already in-place or are available in the
library.  Components that are written for experiments are always added
to the component library for possible later re-use in other
experiments.

Currently, we know of one other system that has integrated a file
system simulator and a real system.  Thekkath et al. [30] describe a
file-system simulator that is able to run an existing kernel
file-system in a discrete event simulator.  Our work is similar to
theirs, but we arrived at the same point through a different route.
Thekkath's goal was to lift a kernel file-system into a user level
simulator and measure the performance of such a system.  We started by
building a simulator for some file-system experiments, and later
integrated the simulator and a real file-system.  In our system, we
make file-system development and measurements easy, whereas Thekkath's
system makes measurements of existing file-systems easy.  The major
advantage of Thekkath's system is that they have measured a production
file system (Unix FFS [14]) and calibrated their simulator through a
production file-system.  We are still using an experimental
file-system.  A second advantage of Thekkath's system is the
availability of a real file-system snapshot before an experiment
starts.  We are upgrading PFS from an experimental file system to a
production file-system and we will use snapshots of PFS in Patsy
experiments.

The remainder of this paper is organized as follows.  Section 2
describes the common cut-and-paste framework.  Section 3 describes the
added modules to make up a PFS, and Section 4 describes those
components added to make up a file-system simulator. Section 5 we 
describe a simulator that we built through the frame work, we describe
some of the experiments we conducted in this simulator and it
describes the lessons we have learned when we were building the
system.  Section 6 summarizes this paper.

2. Cut-and-Paste

Our goal for the component library is to create a reference file
system framework from which we can instantiate many possible file
system and simulator configurations.  This reference file-system
framework, therefore, must be easy to maintain and easy to extend.
The cut-and-paste framework should not enforce designers to early
algorithms and policy decisions. 

To build an open component library, we decided to implement the system
in an object-oriented language.  We have created a set of base
components that are used in all file-systems and implement default
behavior.  Derived components implement specific behavior and are
accessed through an inheritance chain.  We do not allow derived
components to enrich the interface of its base component without
adding an interface to its base component.  This ensures that
interfaces remain clean and without cross dependencies to other than
the base component dependencies.  By restricting ourselves to such a
scheme, our system remains upward compatible.

The cut-and-paste component library is best visualized as a collection
of objects that are combined into a real or simulated system whenever
they are needed.  All components are written in C++, are instantiated
from their classes and bound to global variables when a system starts.
Base components are C++ base classes, derived components are C++
derived classes.  The system currently consists of approx. 55K lines
of C++ code, divided into 80 classes.

Currently, the following core components exist: a thread scheduler
that allows multiple independent processes in the real or simulated
system, a cache that implements a full file-system block cache, a
file-system storage component that defines the storage layout on disk,
a client interface that defines an abstract front-end to the system,
an abstract file that ``knows'' everything about a file when it is
loaded into the cache and a device-driver that communicates with the
simulated or real hardware to read and write data from disk.  

For all of the core components we have implemented the default
behaviour and one or two other algorithms.  The core components
themselves implement a true default file-system.  As we are still
adding functionality to our system, we expect the number of available
algorithms to grow rapidly. 

Figure 1: Some of the framework components.  Class definitions are
shown as shaded ovals, instantiated objects as clear ovals, and global
variables and hash tables as clear boxes.  Unstriped arrows are plain
pointers, single-striped arrows show the relation between the
instantiated object and its class definition, double-striped arrows
refer to the inheritance chain between the class definitions and
triple-striped arrows show which objects call methods in other
objects.

Figure 1 shows most of the core components of the combined file system
and simulator.  The figure is used as a reference figure: not all
components are immediately explained.  In particular, all components
that are not located inside boxes labeled Patsy or PFS are common to
both simulator and system and are part of the cut-and-paste framework.
The cut-and-paste components are explained in the remainder of this
Section.  Boxes labeled PFS are explained in Section 3 and boxes
labeled Patsy are explained in Section 4.  Since interfaces between
objects can be quite large we decided not to describe them here and we
refer to the source distribution.

Thread scheduler

The thread scheduler implements threads, synchronization primitives
and real or virtual time.  Independent file-system processes are given
a separate thread of control inside the system.  These threads are
able to communicate with each other through synchronization primitives:
one thread can make others runnable.  Threads can also suspend by
yielding the processor for some amount of time.  Finally, the
scheduler knows what the current time is for a real system and it
defines virtual time for a simulator.

The synchronization primitives are based on events.  Each thread can
pick a unique event and block on it.  Once a thread has blocked
itself, another thread signals the event through the scheduler to make
the thread runnable again.

Internally, the scheduler maintains a queue of all delayed threads and
makes them runnable whenever the timers expire.  If the scheduler is
configured in a simulator, thread-timers only expire when there are no
other threads to run: virtual time is increased to the timer-expire
time of the first thread on the delayed queue.  When the scheduler is
configured in a real system, timers expire in
real-time.

External events are also managed by the scheduler when it is
configured in a real system.  In Unix, a file-descriptor is associated
with a thread and whenever a message arrives on the file-descriptor,
the thread associated with the file-descriptor is made runnable.

Currently, the scheduler provides random scheduling.  It picks a
random thread from the runnable set, whenever it has to schedule the
next thread.  When other scheduling policies are required (e.g. when
files with real-time constraints are introduced), a derived scheduler
class can implement a different scheduling policy.

Caches

The cache modules are used to administer and maintain a file-system
block cache.  It provides interfaces to administer all dirty,
non-dirty and free blocks in lists, and it provides interfaces to
allocate blocks from the cache.  Also, when blocks are allocated from
a full cache, it decides which blocks are replaced and flushed.

The base cache component implements LRU lists to maintain all dirty
and non-dirty blocks.  Blocks are first allocated from the non-dirty
list, and when there are no non-dirty blocks available, the cache
initiates a cache flush through the oldest dirty block.  To flush, it
calls an internal method that may be overloaded by different policies.
For example, we have implemented a flush policy that flushes the whole
file rather than a single block when a block flush is initiated.

Specific persistency requirements can be implemented in derived
components that call into the base component to initiate cache flushes. 
For example, the Unix SVR4 30-second-update timer policy is
implemented through a derived class that examines the contents of the
cache every couple of seconds.  When it detects that there exists a
dirty block older than 30 seconds, it flushes the file associated to
the oldest block.

Different cache administration policies are easily implemented by
re-implementing the replacement methods of the base-class in a new
derived class.  For example, to experiment with different replacement
policies (e.g. RR, LFU, SLRU, LRU-K or adaptive [12]), only those
functions that deal with LRU replacement need to be replaced from the
base-class.

The difference between a simulated cache and a real cache is the lack
of a data pointer in the simulated case.  In all cases where data is
moved between buffers, the simulator delays the current thread for the
amount of time it would take (based on the system hardware
configuration) to copy the data.  In a real system, a large chunk of
(physical) memory is allocated and divided over all the cache blocks
when the system starts.

Storage-layout

The storage-layout component is responsible for defining a file-system
layout on a raw disk.  This component knows the actual location(s) of
file-system meta-data, and is able to store and retrieve information
from one or more disks.  It is consulted whenever something needs to
be done with a raw disk.

The base storage-layout class is only an interface: it does not
implement an algorithm.  Specific layouts are implemented through
derived classes.  The interface to a storage-layout class is defined
such that for all layout and policy decisions, there exists a virtual
method in the base-class.

Currently, we have implemented a segmented LFS [20, 26].  This system
stores file-system updates to the end of the log, and is able to find
files through an IFILE.  The log-cleaner can be replaced and is
plugged into the LFS component when the system starts up.

To implement other storage-layouts (such as a Unix FFS [14], EFS [28],
or journalling file-systems), a new derived storage-layout class needs
to be written that defines a new storage-layout on disk.  This derived
class needs to implement all of the methods defined in the abstract
storage-layout class.

A storage-layout module can also be instantiated for a simulator.  In
this case, all information that would have been read or written to
disk is simulated by making educated guesses.  If, for example, a
file is accessed that is not yet known by the storage-layout module,
it picks a random location on disk.  Once an initial location has been
chosen for a file, the simulator sticks to those addresses.

Abstract Client Interface

The abstract client interface provides the basic file-system
interface.  There are functions to open, close, read, write or delete
a file and there are functions to manipulate an hierarchical
name-space.

The abstract client interface initiates the loading of a file from
disk when it is first accessed.  It calls into the file system module
to read the file's inode into memory.  Once the file is in memory, the
component stores a reference to it in a global file table.

Furthermore, when the file has been loaded into memory, the abstract
client interface maps all incoming user requests to the memory
representative of a file.

Files

Abstract client requests are dispatched to so-called instantiated
files.  An instantiated file is used to control a file that has been
loaded into the file-system cache.  It may contain a memory copy of
the file's inode, references to cached file data, and it contains a
set of functions to perform operations on a file, such as a read,
write and flush method.

As there are many file types in current file-systems (e.g. ordinary
Unix-like files, directories, symbolic links, multi-media files, and
administrative files) each with different access patterns and
behavior, we have implemented each file type in a separate derived
class.  All components derive basic file functionality from the base
file.  This structure allows us to separate policies that optimize
access to a particular file type.  An example of this is a
multi-media file.  If ordinary cache policies are used on a
multi-media file the whole cache would fill up with this data.  A
multi-media file prevents this from happening by implementing other
cache policies.

Pei Cao et al. has shown that two-level file-caching can be beneficial
to overall file-system performance [7].  In this work, cache
replacement policies are delegated to a user-level manager process.
In our approach we have the opportunity to delegate the policy
decisions to a particular file -- the manager can be implemented
inside the file-system.  When a file is opened, a client can inform
the file-system to use a certain replacement policy.  This approach
can give us a fine-grain control over cache replacement policies.

When a file is requested by a client, the file-system front-end
examines the file type of the requested file (through its inode or
other administrative means) and instantiates an object of that type to
manage the file while it is in core.

A file is called active if the instantiated file spawns a thread of
control that works independently inside the file-system. Such
functionality is especially useful if files are manipulated that 
have timing constraints on them.  In a multi-media file, for example,
the thread of control may take care of cache pre-loading and can even
negotiate with (remote) clients and other modules in the file-system
to establish a certain Quality of Service (QoS).

3. Pegasus File-system

The base components in the cut-and-paste library do not make up a
complete system: they lack interfaces to the environment.  To complete
such a system, helper components are added to the component library
that glue all the components into the environment.

The system glue currently only consists of two parts: the system needs
a real user interface, a PFS client interface and it requires a real
disk-driver to access a real disk.

PFS Client Interface

We use NFS [23] as the external PFS interface.  We have constructed a
full NFS client interface class, which is a derived class from the
abstract client interface class.  The NFS class spawns a number of
threads that wait for incoming mount and NFS requests.  Whenever a
request is received, the call is dispatched to one (or more) calls in
the abstract client interface.  Each thread in the NFS component acts
as a representative of a client while the request is in progress.

In the future we will construct a client interface that enables strict
consistency.  This client interface will use many of the techniques
that are already in use by Sprite [16] and MFS [6].  By using client
caching we hope to reduce the amount of network traffic and file
latency.  We will realize this within the cut-and-paste framework so
that we can simulate client/server interaction and client cache
performance.

Disk-driver

Real disks are accessed through disk-drivers.  Disk-drivers implement
one or more disk queues and send new operations to disks whenever they
are ready to service new requests.  They can implement disk queue
scheduling policies to optimize disk I/O queue time (e.g. SCAN,
C-SCAN, LOOK, C-LOOK [11, 25]) or guarantee real-time delivery of data
through algorithms such as scan-EDF [18].

Currently, only one disk-driver exists.  This driver implements
a combined read-write queue and schedules I/O requests through the
C-LOOK scheduling policy.  It uses a Unix-file (ordinary file, or
raw-device) as back-end.

4. Patsy

Patsy is the instantiation of the cut-and-paste library to a
file-system simulator combined with some helper components for
off-line file-system simulation.  In particular, we added components
that simulate real disk-drivers and disks, a component that simulates
the connection between the host and disk sub-system, and a component
hierarchy that is able to read a particular trace file and dispatch it
to the simulator.  We are simulating only a small subset all hardware
types.  We hope to integrate our file-system simulator into HP's
Pantheon disk-simulator and use that as our disk back-end because
Pantheon simulates a wide variety of storage hardware.

Simulated disk-drivers

Simulated disks are accessed through simulation disk-drivers.  These
disk-drivers provide the same functions as their real counterparts,
but also provide mechanisms to simulate the sending and receiving of
operations from disk.  The simulated disk-drivers have exactly the
same interface as a real disk-driver: the differences are in the
internal implementation.  The system itself does not know it is
communicating with a ``fake'' disk.

Simulation disk drivers package disk operations in I/O-request data
structures.  The I/O-request data structures contain all the relevant
information for the disk simulator to simulate a disk read or write
and contain timing information to measure the performance of the I/O
operation.

Before an operation is activated on disk, the disk-driver acquires the
host/disk connection and simulates the sending of data.  This is
required as many more entities can make use of the same host/disk
connection.  If the connection is already in use, the disk driver
waits until the connection is released again.  Finally, it sends the
I/O request to the disk itself and activates it.

The disk will perform the I/O request and later transmit the results
to the disk-driver.  For this, the disk acquires the host/disk
connection and simulates the transmission of the I/O request back to
the driver.  Finally, the simulated disk-driver resumes the original
caller and informs it of the I/O results through the I/O request data
structure.

Simulated disks

The disk component in the simulator acts as a representative for a
real disk.  A simulated disk component knows about heads, tracks,
sectors, rotational speed, controller overhead and it may implement
disk cache policies. 

Internally, a disk is modeled by a separate thread of control that
waits for work to arrive from external sources.  Whenever a read or
write request arrives at the disk, the controller unpacks the request,
seeks to the correct cylinder or switches heads.  Next, the disk waits
for the rotational delay and reads or writes data to disk.  Finally,
the disk transfers the data to the host, or signals the host that a
write has completed.

It is obvious that in the simulated world, no real data is moved to
and from disk.  We simulate this by delaying the thread of control by
the amount of time it would have taken to transfer the data. 

Currently, we have implemented an HP97560 disk [21, 13] as a separate
disk class.  This disk is equipped with a 128KB internal cache that
can be used for immediate reported writes (writes that complete once
data arrive in the disk's internal cache) and a read-ahead policy (when
there are no more outstanding requests, the disk reads the next 4KB
following the last read).

Connections

Connections are the links between the host and the disk sub-system.
Connections are used to transfer data, requests and responses between
disk and hosts.  They also arbitrate if there is more than one
controller that wants to send data over the same connection to
simulate connection contention (e.g. SCSI bus contention).

Again, as no real data can be moved through a connection, the
connection delays the thread of control by the amount of time it would
have taken to transfer the data through the connection.

We have implemented a SCSI-2 bus [24].  This bus allows multiple
hosts/disks to use the same connection, and it allows hosts/disks to
disconnect and re-connect during a single SCSI transaction.  The bus
simulates a bus transfer speed of 10MB/s.

Work loads and traces

File-system traces are collections of records that describe all the
activity of a real file-system at some time.  These records specify
when the operation took place (usually down to the microsecond), and
which file-system operation was executed.  Usually, file-system traces
do not present all the details that are required for an exact replay.
The reason is that, although all parameters can be recorded, doing so
would result in prohibitively large trace files: the recording of such
trace files would influence ordinary file-system performance too
much [15, 17].  When replaying traces, we synthesize those parameters
that are missing as best we can (e.g. the initial location of a file
on disk, file names, initial layout of the file-system, the exact time
a read or write was executed).

We are also considering a component that can be used to hand craft
work loads using probabilistic means.  This component will, given some
inputs, generate a work load and dispatch it to the simulator.  The
advantage of such a component is that it will improve our confidence
in the simulation results.

We have modeled a trace simulation class hierarchy on top of the
abstract-client interface.  All file-system requests recorded in the
file-system traces are mapped to calls in the abstract-client
interface.  The simulator components currently consist of three
parts: a general simulation class that provides performance results
gathering and dispatches operations to the abstract-client interface.
Furthermore, there are two derived classes: a Sprite class to replay
the Sprite traces [2] and a Coda class to replay the Coda traces
[15].

Clients are modeled by separate threads of control inside the Sprite
and Coda classes.  The threads read a part of the trace file, group
operations that obviously belong together (such as an open, read,
read, write, ..., close sequence), and call the abstract-client
interface to execute the operation on the simulated system.  Since all
of the trace records have timing information in them, the threads know
how long they have to delay themselves before they can dispatch the
next operation.

When simulation information is missing (such as the actual time a read
or write operation took place), the client thread makes a guess.  In
the case of the missing read and write times, the operations are
positioned equidistant between the open and close operation.  We plan
to experiment with the position of these operations.

The overall measurements are taken from the general simulation class.
This class measures how long it takes before an operation completes.
The measurements are shown every 15 minutes of simulation time and of
the overall simulation.

Detailed internal measurements are provided by plug-in statistics
objects.  These plug-in statistics can be activated when the simulator
is started and they can provide standard statistics output with or
without histograms.  Some of the standard detailed statistics objects
include histograms of disk queue sizes, cache statistics, and disk
rotational delay statistics.

5. Using the component library

We have used the component library for a file-system performance
experiments and to construct a real file-system.  We built the
algorithms for the various experiments in the simulator, and later we
instantiated the same algorithm for a real system: we did not have to
change anything in the code except for some small additions when data
was actually moved.

In this section we describe an experiment we conducted in an
instantiated simulator and some of the lessons we learned when we
built the component library.  The experiments primarily show what kind
of simulations are supported.  The full results that are presented
briefly in this section are subject of another report.  We only
present the final results of the off-line experiments.

5.1. Delayed write policies

One of the reasons to build a file-system simulator in the first place
was to re-do the performance analysis of various delayed write
policies, based on ideas described in earlier work [4].  Also, we have
conducted the experiments to confirm cacheexperiment results as
reported by Ousterhout [17].  In those experiments, we buffered dirty
data longer in the cache with the hope that less data is written to
disk.  If dirty data stays longer in the cache without being written
to disk, changes are higher that file delete and truncate calls remove
the file's dirty data before a flush policy gets a change to write the
data to disk.  If less data is written to disk, disk queues become
shorter and I/O latencies decrease.

If the fraction of dirty data in the file-system cache increases, read
cache hit-rates may be negatively influenced.  If there is more dirty
data in the cache, non-dirty data is replaced earlier.  If overall
cache hit-rates drop (i.e. if cache hit-rates are higher for non-dirty
caches), more data is read from disk.  This leads to longer disk
queues and to increased I/O latencies.  We analyze this trade-off.

If dirty data remains longer in the cache before being written to
disk, more data is lost when the system fails.  Earlier work describes
how to protect dirty data from a single point of failure in the system
through client write caching and by using NVRAM or a UPS in the
file-server [4].

We are performing four different experiments with the Sprite traces to
analyze the performance effects of these write-saving policies.  Our
first experiment shows the base line performance of an ordinary Unix
file-system with a 30-second-update policy.  Dirty data is buffered 30
seconds before it is sent to disk (the write-delay experiment).  Next,
we equip the file-system with a UPS and only flush a cache block when
we are out of non-dirty cache-blocks.  This experiment shows the other
extreme as blocks are only written when the memory is filled with
dirty blocks.  Finally, we equip the file-system with 4 MBs of NVRAM
and we disallow dirty data to reside in volatile-RAM.  If the NVRAM
is full and we need free (or non-dirty) blocks in the NVRAM buffer, we
flush the oldest dirty block to disk.  For the NVRAM case we consider
two flush policies: we either flush the 
whole file associated with the oldest block (with the expectation that
on average there is more non-dirty data in the cache), and we flush
only the oldest block.

We have included the NVRAM experiment to answer how much file-system
latencies improve by using a small NVRAM buffer in the file-server.
This question was raised in Baker et al. [1] and left unanswered.

For all experiments, we expected the UPS experiment to perform best.
The reason for this is that all of the available cache can be used for
write-caching, minimizing the amount of written data.  For the other
experiments we did not know on beforehand which policy would work
best: if too much dirty data is generated, the NVRAM buffer may be
the system's bottleneck and reducing the write-delay time when the
file-system is busy.  In fact, adding NVRAM to the file-system may be
counter productive.

We have rebuilt the Sprite file-server as closely as we
could [29, 9] through the cut-and-paste component library and we ran
the recorded Sprite traces on this version of Patsy.  The original
machine on which the traces were recorded was a Sun 4/280 equipped
with 128MB of main memory, and three SCSI busses that connect to a
total of 10 disks. There were a total of 14 file-systems on the set of
disks, of which two were clearly hot-spots.  For the experiments we
have used simulated HP97560 disks and SCSI-2 busses.  On all
file-systems we ran a segmented LFS.  All components used by this
simulator are derived from the cut-and-paste framework.

Figures 2--4 show a cumulative distribution of the file-system
latencies for the Sprite traces in Patsy (only traces 1a, 1b, and 5
are shown, the others are permutations of these three runs).  Each
trace represents a 24-hour period of operations and we have measured
the time it takes to complete each of the operations. The figures show
a cumulative distribution of the latencies.  The horizontal bars show
the average experienced latency, the vertical bar shows the fraction
of operations completed.  Figure 5 shows all mean file-system
latencies for all the traces.

Each graph in Figures 2--4 has a similar form.  All operations that
complete within 2-milliseconds are serviced from the file-system
caches.  The 2-milliseconds boundary is the minimal latency when a
request is serviced by the disk (SCSI-request decoding).  The period
up to 17-milliseconds represents the time waiting for the rotation on
disk (HP97560 disks spin at 4002 rpm), including the controller
overhead.  The bump at 17-milliseconds represents the large amount of
operations that had to wait for a full disk-rotation.  The periods
larger than 17-milliseconds are those when the disk queues were longer
than one entry or when the disk required head and/or cylinder
switches.  We are re-evaluating the trace results to present both
delays separately.

>From the performance numbers shown in Figures 2--4 and Figure 5 we
learn that, in general, the UPS experiment performs better than the
NVRAM experiments, which in turn performs better than the ordinary
write delay policy.  In all but trace 5, a UPS file-system is much
faster than the write delay experiment while the NVRAM experiment is
only twice as fast.  The reason for this is that for most traces disk
queues are minimized.

During trace 5, many large writes enter the system while there are
also a fair amount of stat and read operations.  The write operations
fill up all of the available memory (which are not flushed since we
are using a naive flush policy) and clutter up memory.  This lowers
read cache hit rates, and cause the read operations to stall while
data is flushed to disk. This happens to a lesser extent in trace 1b.
We found that in general write-saving policies combined with a naive
flush policy resulted in lower cache hit rates.

Figure 5 shows that for Trace 1b NVRAM does not help much to reduce
file system latencies.  In this trace there are many large and
parallel write operations.  Since dirty data can only be stored in
NVRAM, the NVRAM becomes a bottleneck:  new writes are waiting for
the NVRAM to drain. In some cases we found that the write-back
policy deteriorated to a write-through policy.

For the NVRAM case, we measured two different flush policies: whole
file (that is: when a file-block is flushed, all dirty blocks of that
file are flushed), and partial file (only the selected block is
flushed).  As expected, whole-file flushes perform better than
partial-file flushes as it leaves on average more non-dirty space in
the NVRAM buffer.  This means that write operations complete sooner as
they do not have to wait for data to be flushed to disk first.

In general, delaying write operations reduces disk contention even
though there are more cache misses.  We showed that because of this,
file-system operation latencies decrease.  When there is lots of
activity in the file-system, there is a possibility that the available
memory is cluttered up with dirty data, which increases sub-sequent
file-system latencies as data needs to be written to disk first.  To
solve this problem, we are experimenting with more aggressive write
policies.

5.2 Lessons learned

While building and experimenting with an off-line simulator, we were
surprised by a number of performance bottlenecks that would have been
hard to find in an on-line file-system.  Since simulator and system
are one and the same, solving the performance bottleneck in the
simulator, also solves the performance bottleneck in the real system.

We ran the Sprite traces on the simulator and we were surprised by the
``slowness'' of the simulator: it took several hours to simulate one
hour of Sprite time.  First, we used prof(1) to analyze which
procedures were accessed most.  It turned out that the way we were
maintaining the LRU lists was sub-optimal.  By carefully analyzing
what the system was doing through gdb(1), we detected several
short-cuts in list maintenance.  This improved simulation time
dramatically.  To look for this problem in a real system would have
been hard.  First, it would have been hard to continuously replay the
same workload.  Second, it would have been hard to run the real system
under gdb(1) and to analyze step-by-step what the actual problems
are.

A second problem we found had to do with the way a cache was flushed.
In our original system, the thread that needed a cache block was also
the one that initiated a cache flush and waited for the flush to
complete.  As more esoteric flush policies were used, the delay for
this thread increased.  Since we were able to analyze off-line we were
able to quickly analyze why some threads were severely delayed.  The
obvious solution was to make the flush policy an a-synchronous
operation.  Again, this problem would have been more difficult to find
in a real system.

We found the NVRAM contention problem through carefully analyzing and
hand-crafting a work load.  The other option would have been to
allocate a part of a file-server's memory and simulate the behavior of
NVRAM in an on-line system.  By being able to simulate behavior
off-line, we were able to quickly decide that it is better to equip a
file-system with a UPS rather than NVRAM.

The biggest surprise was when we instantiated the component library to
a real system: most algorithms ran immediately.  This basically means
that file-systems can be developed in a controlled environment.  The
file-system is now is use for real data storage.

It is certainly true that file-system experiments can also be
performed in kernel-based file-systems.  Current BSD systems can
easily be changed to support all kinds of experiments (as noted by an
anonymous reviewer).  We believe, however, that user-level
file-systems are easier to maintain and it is easier to track
performance bottlenecks.  Developing file-systems in user mode has the
advantage that the machine does not halt whenever there is a fatal
error.  A production version of the system can always be implemented
inside the kernel, although we do not expect the performance to
improve much.  The problem with Unix user-level file-systems, however,
is that some parts of the file-system are constructed carefully to
hide the absence of threads in most Unix systems.  To demonstrate that
the current system can also be used as a kernel service, we have
inserted the system into the Nemesis micro-kernel [19].  The reason
for this integration is to provide Nemesis with a file-system and to
perform multi-media storage experiments.

We fully agree that the system described in this paper is a partially
an engineering practice (as noted by an anonymous reviewer): we have
built a tool to perform file-system experiments.  Since related parts
of the system are encapsulated in their class hierarchy, changing
particular file-system policies is not hard.  This gives us an
opportunity to quickly learn the effects of new algorithms and
file-system configurations (even if we do not have the equipment we
are simulating).

5.3 How to make the results usable

Based on the experiments, we have decided to pursue a real file-system
that is equipped with a UPS.  We are re-evaluating a protocol to
safely distribute dirty data over client and server machine.  As is
shown in earlier work [4], this allows us to guarantee data
persistency in case of a single point of failure, or a global power
failure.  Evidently, we will re-evaluate this through the simulator.

Once we are happy with the performance results of a more aggressive
flush policy, we will migrate the cache policy into PFS and measure
the performance relative to a standard 30-second-update policy.  In
these measurements we will show the performance differences between a
simulator and a real system.

We have compared performance differences of system and simulator in a
small test environment.  The analysis so-far suggests that the results
in the simulator have real value and are comparable to real
performance.  We are working on a full comparison of a simulator and
system and we will provide a separate report that validates our
approach.

6. Summary and further work

We have presented a file-system reference framework, which we use to
experiment with new file-system storage algorithms.  The file-system
framework can be instantiated to off-line simulators and on-line
file-systems.  Through off-line simulations, storage algorithms are
analyzed thoroughly before being used in a real system.  We have
constructed the framework such that it is reasonably straightforward
to analyze other researcher's algorithms in our simulator.

As an example, we have analyzed several write-saving policies in a
simulator, that is instantiated from the framework.  We concluded that
write-saving policies are beneficial even though cache-hit rates
decrease.  Currently, we are still analyzing more aggressive write
policies before we will put the policies to use in a real system and
measure the relative performance of a real system compared to an
equivalent simulator.

Acknowledgments

We would like to thank the anonymous reviewers, Tage Stabell-Kuloe,
Richard Golding, Sjoerd Mullender, Arne Helme, George Neville-Neil,
Paul Sijben, Martijn van der Valk, Lars Vognild and the other Pegasi
for their support and criticism on earlier versions of this paper.

Availability

The framework is available through the authors.  Announcements are
made on (pfs[-request]@pegasus.esprit.ec.org).

References

[1] Mary Baker, Satoshi Asami, Etienne Deprit, John Ousterhout, and
Margo Seltzer.
Non-volatile memory for fast, reliable file systems.
Proceedings of Fifth ASPLOS (Boston, Massachusetts), pages 10--22, 1992.

[2] Mary G. Baker, John H. Hartman an Michael D. Kupfer, Ken
W. Shirriff, and John K. Ousterhout.
Measurements of a distributed file system.
Proceedings of the 13th ACM Symposium on Operating Systems Principles
(Pacific Grove CA (USA)), volume 25, number 5 of Operating Systems
Review, pages 198--212, October 1991.

[3] Trevor Blackwell, Jeffrey Harris, and Margo Seltzer.
Heuristic cleaning algorithms in log-structured file systems.
Conference Proceedings of of the USENIX 1995 Technical Conference on
UNIX and Advanced Computing Systems (New Orleans), pages 277--88.
Usenix Association, 16--20 January 1995.

[4] Peter Bosch.
A cache odyssey.
M.Sc. thesis, published as Technical Report SPA--94--10.
Faculty of Computer Science/SPA, Universiteit Twente, the
Netherlands, 23 June 1994.

[5] Peter Bosch.
>From ULFS towards PFS.
University of Twente, 9 December 1992.
A list of notes used to design a new file system.

[6] Michael Burrows.
Efficient Data Sharing.
PhD thesis, published as Technical report 153.
University of Cambridge, DEC 1988.

[7] Pei Cao, Edward W. Felten, and Kai Li.
Implementation and performance of application-controlled file caching.
Proceedings of First Symposium on Operating Systems Design and
Impl. (OSDI) (Monterey, CA), pages 165--77.
Usenix Association, 14--17 November 1994.

[8] Richard Golding, Peter Bosch, Carl Staelin, Tim Sullivan, and John
Wilkes.
Idleness is not sloth.
Conference Proceedings of of the USENIX 1995 Technical Conference on
UNIX and Advanced Computing Systems (New Orleans), pages 201--12.
Usenix Association, 16--20 January 1995.

[9] John H. Hartman.
Allspice's configuration, 24 March 1995.
Private communication.

[10] John H. Hartman and John K. Ousterhout.
Letter to the editor.
Operating Systems Review, 27(1):7--10.
Association for Computing Machinery SIGOPS, January 1993.

[11] David M. Jacobson and John Wilkes.
Disk scheduling algorithms based on rotational position.
Technical report HPL--CSP--91--7rev1.
Hewlett-Packard Company, 16 February 1991.

[12] Ramakrishna Karedla, J. Spencer Love, and Bradley G. Wherry.
Caching Strategies to Improve Disk System Performance.
IEEE Computer, pages 38--46, March 1994.

[13] David Kotz, Song Bac Toh, and Sriram Radhakrishnan.
A Detailed Simulation Model of the HP 97560 Disk Drive.
Technical report PCS--TR94--220.
Dartmouth College, 18 July 1994.

[14] Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and
Robert S. Fabry.
A fast file system for UNIX.
ACM Transactions on Computer Systems, 2(3):181--97,
August 1984.

[15] L. Mummert and M. Satyanarayanan.
Long Term Distributed File Reference Tracing: Implementation and
Experience.
CMU-CS-94-213, November 1994.

[16] Michael N. Nelson, Brent B. Welch, and John K. Ousterhout.
Caching in the Sprite network file system.
ACM Transactions on Computer Systems, 6(1):134--54,
February 1988.

[17] John K. Ousterhout, Herv\'e Da Costa, David Harrison, John
A. Kunze, Mike Kupfer, and James G. Thompson.
A trace-driven analysis of the Unix 4.2 BSD file system.
Proceedings of the 10th ACM Symposium on Operating Systems Principles
(Orcas Island WA (USA)).
Published as Operating Systems Review, 19(5):15--24,
December 1985.

[18] A. L. Narasimha Reddy and James C. Wyllie.
I/O Issues in a Multimedia System.
IEEE Computer, pages 69--74, March 1994.

[19] Timothy Roscoe.
The Structure of a Multi-Service Operating System.
Technical report 94--5.
Pegasus project, Esprit BRA 6586, March 1994.

[20] Mendel Rosenblum and John K. Ousterhout.
The design and implementation of a log-structured file system.
Proceedings of 13th ACM Symposium on Operating Systems Principles
(Pacific Grove, CA, USA, October 1991).
Published as SIGOPS, 25(5):1--15.
Association for Computing Machinery, October 1991.

[21] Chris Ruemmler and John Wilkes.
An Introduction to Disk Drive Modeling.
IEEE Computer, pages 17--28, March 1994.

[22] Chris Ruemmler and John Wilkes.
UNIX Disk Access Patterns.
USENIX Technical Conference Proceedings (San Diego, CA), pages
405--20.
USENIX, Winter 1993.

[23] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and
Bob Lyon.
Design and Implementation of the Sun Network Filesystem.
USENIX Conference Proceedings (Portland, OR), pages 119--30.
USENIX, Summer 1985.

[24] Small Computer System Interface -- 2.
American National Standard for Information systems, 23 June 1989.
Working draft proposal.

[25] Margo Seltzer.
A comparison of data manager and file system supported transaction
processing. 
Computer Science Div., Department of Electrical Engineering and
Computer Science, University of California at Berkeley, January 1990.
Obtained from the author.

[26] Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin.
An Implementation of a Log-Structured File System for UNIX.
USENIX Technical Conference Proceedings (San Diego, CA), pages 307--26.
USENIX, Winter 1993.

[27] Margo Seltzer, Keith A. Smith, Hari Balakrishnan, Jacqueline
Chang, Sara McMains, and Venkata Padmanabhan.
File system logging versus clustering: a performance comparison.
Conference Proceedings of of the USENIX 1995 Technical
Conference on UNIX and Advanced Computing Systems (New Orleans), pages
249--64.
Usenix Association, 16--20 January 1995.

[28] Margo Ilene Seltzer.
File system performance and transaction support.
PhD thesis.
University of California, 1992.

[29] Ken Shirriff.
Allspice's configuration, 23 March 1995.
Private communication.

[30] Chandramohan A. Thekkath, John Wilkes, and Edward D. Lazowska.
Techniques for file system simulation.
Software---Practice and Experience, 24(11):981--99.
John Wiley and Sons Ltd, November 1994.

[31] John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan.
The HP AutoRAID hierarchical storage system.
Proceedings of 15th Symposium on Operating System Principles, 1995.

Author information

Peter Bosch graduated with degrees in EE (Bc., 1988), and CS (M.Sc.,
1994) from the Univ. of Twente, NL.  He has worked since 1991 for the
Univ. of Twente as a Ph.D. student.  His current work involves
research into file-systems and the implementation of a
high-performance file-system for the Esprit Pegasus project. His
hobbies include traveling and spending evenings in restaurants.

Prof. Sape J. Mullender is chairman of the systems programming and
architecture department in the Faculty of Computer Science of the
University of Twente in the Netherlands, where he leads the Huygens
research project on fault-tolerance, real time, multimedia and
security in distributed systems. 

Sape Mullender is interested in high-performance distributed
computing and the design of scalable fault-tolerant services.  He is
also concerned about organization and protection in distributed
systems that can span a continent.  He is a principal designer of
the Amoeba distributed operating system.