################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX Summer 1993 Technical Conference
		  Cincinnati, Ohio  June 21-25, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


            Adaptive  Block  Rearrangement Under  UNIX*

                  Sedat Aky"urek & Kenneth Salem

               Department of Computer Science
               University of Maryland, College Park, MD 20742

                          Abstract


An adaptive UNIX disk device  driver  is  described.  The  driver
copies frequently-referenced blocks from their original locations
to reserved space near the center of  the  disk  to  reduce  seek
times.   Reference  frequencies need not be known in advance. In-
stead, they are estimated by monitoring the  stream  of  arriving
requests.   Measurements  show  that  the adaptive driver reduces
seek times by more than half, and improves response times  signi-
ficantly.

1      Introduction

In a recent paper [Akyurek 93   ] we introduced an adaptive block
rearrangement  technique  which  reduces  disk  response times by
reducing seek times.  In this technique, a small number  of  fre-
quently  referenced  blocks  are copied from their original loca-
tions to reserved space near the middle of  the  disk.  The  term
"adaptive"  means  that  no advance knowledge of the frequency of
reference to the data blocks  is  required.   Instead,  reference
frequencies  are  estimated  by monitoring the stream of requests
for data from the disk. The set of blocks in the  reserved  space
is changed periodically to adapt to changing access patterns.

        Trace-driven simulations have shown that  this  technique
can  be  very  effective  in  reducing  seek  times.   This paper
describes an implementation of the  technique  using  a  modified
UNIX  device  driver.   It  also presents the results of detailed
measurements taken  during  the  operation  of  the  system.  Our
results  demon-  strate  that adaptive block rearrangement can be
easily and effectively integrated under existing file systems.

1.1      Related Work


That a disk's performance can be improved by clustering frequent-
ly  accessed  data is well-known.  If data references are derived
from an independent random process with a known, fixed  distribu-
tion,  it has been shown that the organ pipe heuristic places the
data optimally [Wong 80   , Grossman 73     ].   The  organ  pipe
heuristic  calls  for  the  most  frequently  accessed data to be
placed in the center of the disk.  The next most  frequently  ac-
cessed  data is placed to either side of the center, and the pro-
cess continues until the least- accessed data has been placed  at
the  edge  of the disk.  More recently, similar results have been
shown for optical storage media [Ford 91  ].

        In practice, data references are not drawn from  a  fixed
distribution,  nor  are  they independent.  Al- though references
are highly skewed [Floyd 89  , Staelin 91  , Vongsath 90   , Ous-
ter  85   ], request distributions change over time, and they are
generally not known in advance. Nevertheless, variations  of  the
organ  pipe  heuristic  seem to work well in practice.  Recently,
several papers have proposed adaptive applications of data  clus-
tering based on this idea.
 __________________________________________________
   *This work was supported by National Science Foundation  Grant
No: CCR-8908898 and in part by CESDIS.
    UNIX is a trademark of ATT

        Vongsathorn and  Carson  [Vongsath  90    ]  showed  that
dynamically  clustering  frequently  accessed  data worked better
than static placement.  In that study, disk cylinders are dynami-
cally rearranged using the organ pipe heuristic, according to ob-
served data access frequencies.  Recent work in the DataMesh pro-
ject  [Ruemmler 91    ] considered rearrangement of cylinders and
blocks, with mixed results. Their conclusion that block shuffling
generally  outperforms cylinder shuffling corroborates one of our
own. A similar approach is employed in the  experimental  iPcress
file  system[Staelin  91  ],  which  monitors access to files and
moves files with high "temperatures" (frequency of access divided
by file size) to the center of the disk.

        Our technique differs from each of  the  techniques  men-
tioned above in at least one of the following aspects.


Granularity      Our technique moves blocks rather than cylinders
or files.  Blocks within a file or within a
       cylinder can vary in temperature.  Also, block  rearrange-
ment can increase the number of zero-length
       seeks (i.e. need to seek at all), while cylinder reorgani-
zation cannot. Smaller granularity also facilitates
       incremental rearrangement.

Data Volume        Only a small fraction of blocks are rearranged
at any time as opposed to reorganizing all
       the data on the disk. This makes rearrangement faster.

Layout Preservation         Block relocation  is  temporary,  and
cooled blocks are returned to their original lo-
       cations.

Transparency       Our technique can be implemented in  a  device
driver (or controller). No changes to the file
       system are required.


        Other systems cluster data based on criteria  other  than
reference   frequency.   The  Berkeley  Fast  File  System  (FFS)
[McKusick 84    ]  used  in  many  UNIX  systems  uses  placement
heuristics that try to cluster the blocks of a file. However, hot
blocks from different files may be spread widely over the  disk's
surface.  This can result in long random seek operations when re-
quests for the blocks of different files are interleaved,  as  is
the case in multi-user systems.

        LFS [Rosenblum 91    ] and Loge [English  92   ]  systems
rearrange  data  based  on  the  order  of  writes in the request
stream. The primary goal of these systems  is  to  improve  write
performance,  not read performance.  In contrast to both Loge and
LFS, our adaptive block rearrangement technique makes  both  read
and  write operations faster.  These systems try to improve write
performance by reducing both seek and rotational delays.  LFS el-
iminates  seek  and  rotational  delays  by  combining many write
operations into a single large write operation.  The  Loge  self-
organizing  disk controller transparently reorganizes blocks each
time they are written to reduce seek and rotational delay.  Simu-
lation  studies  of  the controller show that it can reduce write
service times, but the savings come at the expense  of  increased
read  service times.  Unlike Loge, the block rearrangement system
described here preserves the data placement done by the file sys-
tem.   Finally,  although the adaptive block rearrangement system
described here is implemented in the device driver on  the  host,
it  can  be  implemented  in an intelligent IO or disk controller
like Loge.

        In the next section we will give an overview of the block
rearrangement  technique.  Section 3 describes the implementation
of the adaptive block rearrangement in a UNIX system.  In section
4  we  present the results of performance measurement experiments
conducted while the system was  operational  on  a  network  file
server. 

2      Overview

The idea of block rearrangement is motivated by the fact that ac-
cess  to data stored on disks is highly skewed.  Of the thousands
of blocks stored on a disk, a small fraction of them absorbs most
of  the  requests.   If  the hot (frequently accessed) blocks are
spread over the surface of the disk,  distant  from  each  other,
long seek delays may result.


             Figure 1: Adaptive Device Driver


        Hot blocks can be clustered on a reserved set of contigu-
ous  disk  cylinders  to  reduce the seek times.  Under our block
rearrangement technique, this is achieved by copying  hot  blocks
on to a set of reserved cylinders in the middle of the disk.  The
block layout outside of the reserved  cylinders  is  left  undis-
turbed.    Incoming  requests  are  redirected  to  the  reserved
cylinders if the block resides there. If the  reserved  cylinders
accommodate  blocks  which  get most of the references, we expect
that the disk head will tend to linger over those  cylinders  and
seek distances will be reduced.

        Figure 1 illustrates  the  organization  of  an  adaptive
block  rearrangement  system  implemented  in a disk driver.  The
system monitors the stream of requests directed to the  disk  and
periodically  produces  a  list  of  hot  (frequently-referenced)
blocks, ordered by frequency of reference.  The hot block list is
used  to  determine which blocks should be placed in the reserved
cylinders.  The rearrangement  system  copies  the  selected  hot
blocks  from  their original locations to the reserved cylinders.
These blocks remain in the reserved  space  until  the  next  hot
block  list  is  produced.  Blocks  which are not hot anymore are
copied back to their original locations.

        The block rearrangement system assigns hot blocks to  the
reserved  cylinders  according  to  their rank in the list, i.e.,
their frequency of reference, using  the  organ  pipe  heuristic.
Assuming  that  C  blocks fit on a cylinder, the C hottest blocks
are placed on the middle cylinder of the reserved cylinder group.
The next hottest are placed on an adjacent cylinder, and so on so
that  the  final  cylinder  reference  distribution  across   the
reserved  cylinders forms an organ pipe distribution.  3      Im-
plementation

In this section we will present the implementation details of the
components  shown  in Figure 1. The block rearrangement system is
implemented through modifications to the disk  device  driver  of
SunOSy  4.1.1.  In addition, several user level programs are used
to control the modified driver.
 __________________________________________________y
    SunOS is a trademark of Sun Microsystems, Inc.


Figure 2: Block  Mapping  on  a Rearranged Disk

3.1      Modifications in the Device Driver


The device driver modifications implement the reserved  space  on
the  disk and the mapping of hot blocks to their new positions in
the reserved space.  They also provide entry points to the kernel
for  controlling  the movement of blocks to and from the reserved
area and for monitoring block accesses.  Another function of  the
modified driver is to gather performance statistics.

3.1.1     Reserved Space


Disk labels contain information about the size of  the  disk  and
the  sizes  and positions of disk partitions. This information is
used by the newfs program to initialize a file system on a  given
partition on the disk.

        To make space for the rearranged blocks, the target  disk
is  made  look  smaller  than  it  really is by changing the disk
geometry information on the disk label.  Disk partitioning infor-
mation is also changed accordingly so that the file system thinks
that the disk has fewer cylinders. The hidden cylinders implement
the  reserved  space.   The driver implements the mapping between
the logical (smaller) disk and the actual disk (Fig. 2)z.

        When a target disk is initialized for rearrangement,  the
first sector and the length of the reserved space are recorded in
its label. During initialization a special value is also recorded
in  the  label to mark it as a "rearranged" diskx. At the time of
system start-up, the attach routine of the driver checks  if  the
disk  is a rearranged disk.  If it is, then the information about
the reserved area is read in to be used for block mapping.

3.1.2     Block Mapping


The strategy routine of a disk driver is responsible for convert-
ing file system's block addresses to physical block addresses. In
the modified driver, it also does the mapping  from  the  logical
disk to the real disk and
 __________________________________________________z
    In SCSI disks the disk interface presents the disk space as a
sequence of logical sectors. The actual physical loca*
 *tions of the sectors are not known. We rely on an implicit  as-
sumption  that  most  SCSI sector numbers are close to their true
phys*
 *ical numbers.
   xBlock rearrangement is applied on a  physical  device  basis.
That is, a disk may have several partitions and conseque*
 *ntly several file systems on it, but all the blocks on the disk
can be rearranged regardless of which partition they belong *
 *to.  We require each file system on the disk to have  the  same
block size.

reroutes requests for rearranged blocks when necessary.

        When a request arrives for a rearranged disk,  the  stra-
tegy routine first converts the logical block address to a physi-
cal block address.  It then determines whether the block has been
repositioned into the reserved area.  To implement this check the
driver maintains a data structure called the block table for  the
rearranged  blocks.   When  a  block  is copied into the reserved
space, it is entered to the table.  The table records the block's
old  and new physical addresses and a dirty bit.  If an entry for
the requested block is found in the block table, its new physical
address is used to retrieve the data.

        A copy of the block table is also stored on the disk  (at
the  beginning  of  the reserved area) to be used at start-up and
for recovery purposes. The attach routine of the driver reads  in
the  block table at start-up.  The copy of the block table on the
disk always correctly reflects the rearranged  blocks  and  their
positions  in  the  reserved  area (unless media failure occurs).
However,  the dirty bit information may not always be up-to-date.
Dirty  bits  are used to determine if rearranged blocks have been
written.  They determine whether blocks must be  copied  back  to
their  original locations when they cool down. In our implementa-
tion, if there are repositioned blocks in the  reserved  area  at
the time of start-up, they are all marked as dirty.  This conser-
vative strategy insures correct operation in case a  repositioned
block  is  written and the system operation is interrupted before
this was reflected on the disk copy of the block table.

        The size of a "block" in the rearrangement system is  the
size  of  a  file system block.  The file system requests at most
one block of data from the disk at a time.  The raw disk  IO  in-
terface  on the other hand, allows requests larger than the block
size to be forwarded to the disk.   Driver  routines  responsible
for  raw IO were changed to break large requests into block-sized
subrequests.

3.1.3     Block Movement

The driver provides kernel entry points through ioctl  calls  for
user-level  processes  to control the movement of blocks into and
out of the reserved area. It implements two ioctl calls for block
movement :

        DKIOCBCOPY     copies a given block to a given address in
the reserved area.  It also
              places an entry for the block in  the  block  table
and forces block table to be
              written to disk.

        DKIOCCLEAN     cleans the reserved area by  removing  the
blocks from the block table
              one at a time.  Blocks leaving  the  reserved  area
are copied to their original
              disk positions if they have been updated,  i.e.  if
their dirty bit is set. The block
              table is written to disk after each block is  moved
out.   Copying  a  block into the reserved area requires three IO
operations. Moving a block out of the reserved area  requires  at
least  one  IO operation and two extra operations if the block is
dirty.  However, other requests can interleave with these  opera-
tions.  Requests for a block that is being moved are delayed tem-
porarily by the driver.

3.1.4     Request Monitoring

The driver records requests (block number,  request size,   flags
etc.)   in  a small internal table.  Request recording stops when
the table is full, and resumes when the  table  is  emptied.   An
ioctl  call  enables  user  processes to read the contents of the
table and to clean it.

        A user process periodically reads the contents of the re-
quest  table and uses the them to analyze the block accesses. The
frequency of request table reads can be changed. We have  used  a
period  of  two minutes in our experiments. This was short enough
to capture almost all the requests.

3.1.5     Performance Monitoring

Another function of the driver is to gather statistics about dif-
ferent measures related to disk accesses, such as seek distances,
service times and queueing times. This is provided for monitoring
the  performance  of  the  system  and is not a part of the block
rearrangement system.

        Performance monitoring is implemented much  like  request
monitoring.   Statistics  are  recorded  in  a  table  inside the
driver. User-level processes can read the contents of this  table
through  an ioctl call which also resets the values in the table.
The table contains statistics collected since the last  time  the
table  was  read. The beginning and ending times of the recording
are also entered into the table.

        All statistics are recorded separately  for  read  opera-
tions and write operations.  The table records seek distance dis-
tributions (in arrival order and in scheduled order) and  service
time  and  queueing time histograms. Time histograms are recorded
with a resolution of 1 ms. Cumulative service times and  queueing
times  are  also  recorded.  Queueing  time  for a request is the
period between the time the driver first receives the request and
the time the request is submitted to the disk. Service time for a
request is from the end of the queueing time to the time the  re-
quest is returned from the disk.

3.2      User Level Programs

User level programs use the entry points provided by the modified
kernel  to  monitor  and analyze block access frequencies.  Using
this information, they decide which blocks should  be  rearranged
and  where they should be placed in the reserved area.  One user-
level process, the reference stream analyzer, monitors the  block
accesses  and  tries  to determine the frequently accessed blocks
and their access frequencies.  Using  this  information,  another
process,  which is called the block arranger decides which blocks
are to be rearranged.  This process also controls  the  placement
of the selected blocks in the reserved area.

        The functions implemented by the user-level processes can
easily  be  incorporated in to the device driver itself.  In par-
ticular, our techniques for estimating block reference counts are
fast and require little space{ , and the block placement policies
are very simple.  However, we wanted to use our system as a  test
bed to experiment with different methods for learning access fre-
quencies and different rearrangement policies. Implementing those
functions  in user-level processes gave us flexibility in our ex-
periments.

        The block arranger implements three block placement poli-
cies :


Organ-pipe placement           This policy  arranges  the  blocks
chosen for rearrangement in an organ-pipe pattern
       according to their access frequencies. This done by  plac-
ing the blocks with highest access frequencies
       in the center cylinder of the reserved area and continuing
to fill cylinders on alternating sides of the
       center with the blocks which have the next highest  access
frequencies.

Interleaved placement           SunOS  UNIX  file  system  (UFS),
which is based on FFS, places the blocks of a file
       interleaved by an interleaving factor. The purpose  is  to
reduce rotational delays while accessing a file
       sequentially. Our interleaved placement  policy  tries  to
preserve the interleaving performed by the file
       system. This policy begins by placing  the  hottest  block
first, like the organ-pipe placement. However,
       after it places a block it does not immediately select the
next hottest block for placement. Instead it
       looks for a block which could be the previous block's suc-
cessor in the file system's interleaved placement.
       The successor's physical location must of  course  succeed
the previous block's physical location by the
       interleaving factor.  The placement policy  also  requires
the successor's access frequency to be closek
       to the previous block's  access  frequency  (this  is  the
policy's criterion to decide whether two blocks
       belong to the same file).  If such a "could be  successor"
block is found, it is placed using the same
       interleaving factor used by the file system. This  process
continues as long as a chain of such probable
 __________________________________________________{
    The hot block estimation algorithm we have used  and  several
other space-efficient hot spot detection algorithms are presented
in [Salem 92 ].
   kThe successor's access frequency is defined to be close if it
is at least 50% of the previous block's frequency. Th*
 *e 50% figure was selected arbitrarily.


Figure 3: Placement Policies.  The block placement  of  different
policies are shown.  The set of blocks to be rearranged and their
estimated access frequencies are also given.  The  reserved  area
has  three  cylinders with four blocks in each cylinder.  The in-
terleaving factor (rotational delay) of the file  system  is  as-
sumed  to be one block.  The interleaved placement policy consid-
ers the access frequency of a block "close" to  another  block's,
if  the  first  block's  frequency is at least half of the second
block's access frequency.


       successor blocks are found. The interleaved placement also
stops when the end of the current cylinder
       is reached.  When the policy quits  interleaved  placement
mode, it continues by choosing the block
       which has the next highest access frequency.

Serial placement       This is  the  simplest  placement  policy.
Blocks that are chosen for rearrangement are
       placed in the reserved space in ascending order  of  their
original block numbers.  This policy needs to
       know only the block numbers of the hot blocks. Access fre-
quencies of the hot blocks are not necessary.

4      Performance  Measurements


In this section we will present the results of a series of exper-
iments we have conducted to monitor the adaptive driver's perfor-
mance. We will first describe the environment in which the exper-
iments were run.


4.1      Experimental Environment


The adaptive block rearrangement system is operational on  a  Sun
Sparcstation**   2,  called Sakarya.  The operating system on the
machine is SunOS 4.1.1. The file system block size is 8 kilobytes
and the fragment size is 1 kilobyte.

        Sakarya has a main memory of 32 megabytes.  The amount of
memory  used  for  buffer  cache  is  not fixed.  Under SunOS the
memory pages are used for both process  pages  and  IO  pages  on
demand.   This  means all that of the available memory can poten-
tially be used as a buffer cache  [McVoy 91   ].  Sakarya is used
mainly  by  one  of  the authors only and is usually very lightly
loaded.  As a result, a large portion of the memory is  available
as a buffer cache most of the time.

        Sakarya has two disks attached to it and one of  them  is
used  as a rearranged disk. The disk used for block rearrangement
is a Toshiba Model MK156F SCSI disk.  Specifications of the  disk
and its seek time
 __________________________________________________**
    Sun and Sparc are trademarks of Sun Microsystems, Inc.
    _____________________________

    |                Toshiba             MK156F                 |
__________________________________________________________________________________
    |________SCSI_DISK________|__        |                      8
|
    |  Capacity (MB)   |     135   |    ||                 <    0
p __      p3 __                 if d = 0       ||
    |     Cylinders     |  815  |   || seektime(d)  = :   6:248 +
1:393  d - 0:99   d+ 0:813 lnd       if d < 315     ||
    |   Tracks/Cyln   |    10   |   |                      17:503
+ 0:03d                               if d   315     |
    |          Sectors/Track          |            34           |
|_________________________________________________________________________________
|
    |_______RPM_______|___3600__|_

Table 1:  Specifications of the disk.  Seek time is given in mil-
liseconds as a function of seek distances (in cylinders).


function are given in Table 1.  The seek time  function  is  bor-
rowed  from  [Carson   ],  in which the disk delay parameters for
this disk were measured and a precise seek time function was dev-
ised.

        Out of  the  disk's  815  cylinders,  48  cylinders  (  3
cylinder  groups)  in  the  middle  of the disk have been used as
reserved cylinders.  This amounts to  approximately  8  megabytes
(6%  of  the  total  disk capacity).  Approximatelly 1000 8Kbyte-
blocks can fit in this space.  We chose the size of the  reserved
space  in  the  light of the results of our previous studies [Ak-
yurek 93   ].  The results of trace-driven  simulations  in  that
study  showed  that  most  of the benefits of block rearrangement
were realized by rearranging 1%-2% of the total number of blocks.
The  simulation  results also showed that the additional benefits
of using more blocks decreased sharply.

        We used Sakarya as an NFS server for 14 sparcstations for
executable  files  and  libraries.   The disk stored parts of the
/usr file system. The file system was mounted  read-only  on  the
client workstations. The user population of the workstations con-
sisted of about 40 faculty and graduate students in the  Computer
Science Department of the University of Maryland.

        We also experimented with a read/write mounted file  sys-
tem  stored on the rearranged disk.   In this experiment the disk
stored the home directories of 10 users whose primary  activities
include  running simulations, editing papers and electronic mail-
ing.

4.2      Experiments


We will present the results of four sets of experiments.  In  the
first  set,  performance  improvements due to block rearrangement
were monitored when Sakarya's disk was storing the /usr file sys-
tem.   In the next two groups of experiments we varied the number
of rearranged blocks and compared the effects of different  rear-
rangement  policies,  respectively.   These  experiments were run
when Sakarya held the /usr file system.  In the last set  of  ex-
periments  we  used  rearrangement  when the disk stored the home
directories.

        During the operation of the system, block access informa-
tion that was monitored on one day was used at the end of the day
for rearranging the blocks for next day's operations.  Statistics
were  gathered  between  7am  and 10pm about the system's perfor-
mance.  The timing  measurements  have  a  resolution  of  5  mi-
croseconds on Sakarya.

4.2.1     Rearrangement of /usr


The block rearrangement system was  operational  on  Sakarya  for
several months when it was serving 14 client workstations. We ap-
plied block rearrangement on alternate days for several weeks  to
assess  the performance improvements.  When rearrangement was ap-
plied, 1018 blocks were placed in the  reserved  area  using  the
organ-pipe placement heuristic.

        Results from four  consecutive  days  are  summarized  in
Tables  2  and 3. Results for the other days are similar. Table 2
presents the  measurements  for  all  requests  whereas  Table  3
describes  only read operations.  On days 1 and 3, block arrange-
ment was not applied and original copies of blocks were used.  On
days 2 and
              ______________________________________________________________________________________________

              |____________________________________|___Day_1_____|____Day_2___|_____Day_3_____|____Day_4___|_
              |__________Rearrangement__________|___Not_Applied__|___Applied__|_Not_Applied__|___Applied__|_
              |__FCFS_Mean_Seek_Dist_(cyln)__|___________220_______|___225____|_______235_______|___228____|_
              |___Mean_Seek_Distance_(cyln)___|__________173_______|____8______|______183_______|____5______|
              |______Zero-
length_Seeks_(%)______|________23_______|_____88_____|______19_______|_____92_____|
              |__FCFS_Mean_Seek_Time_(ms)__|____________20.92______|__21.46___|______22.33______|__21.91___|_
              |______Mean_Seek_Time_(ms)______|_________18.21______|___1.55____|_____19.24______|___0.98____|
              |____Mean_Service_Time_(ms)____|__________38.41______|__22.95___|______39.71______|__22.61___|_
              |____Mean_Waiting_Time_(ms)____|__________87.30______|__50.03___|______94.51______|__48.82___|_

                               Table  2: Experimental results for
/usr (all requests).

              ______________________________________________________________________________________________
              |____________________________________|___Day_1_____|____Day_2___|_____Day_3_____|____Day_4___|_
              |__________Rearrangement__________|___Not_Applied__|___Applied__|_Not_Applied__|___Applied__|_
              |__FCFS_Mean_Seek_Dist_(cyln)__|___________131_______|___165____|_______182_______|___192____|_
              |___Mean_Seek_Distance_(cyln)___|__________123_______|____23_____|______170_______|____17_____|
              |______Zero-
length_Seeks_(%)______|________44_______|_____67_____|______31_______|_____72_____|
              |__FCFS_Mean_Seek_Time_(ms)__|____________12.97______|__16.14___|______17.24______|__17.91___|_
              |______Mean_Seek_Time_(ms)______|_________12.49______|___4.49____|_____16.61______|___3.54____|
              |____Mean_Service_Time_(ms)____|__________30.99______|__24.18___|______35.32______|__22.57___|_
              |____Mean_Waiting_Time_(ms)____|__________6.64______|____5.47____|_____4.76______|____4.46____|

                           Table 3: Experimental results for /usr
(read requests only).


4, block rearrangement was applied. All of the  figures  reported
in  the  tables  are  measured values except for seek times. Seek
times are computed using measured seek distance distributions and
the  seek time function shown in Table 2. First-come-first-served
(FCFS) order seek delays are also presented together with  actual
seek delays.  The effect of UNIX disk-head scheduling can be seen
when we compare the FCFS and actual values on days 1 and 3.

        Comparing the results for the "rearrangement-on" days and
the  "rearrangement-off"  days,  we see that adaptive block rear-
rangement drastically reduced  seek  distances  and  seek  times.
Service  time  reductions  are significant but are slightly lower
than the seek time reductions. This may be  due  to  the  distur-
bances  in the file-system rotational optimizations caused by re-
arranged blocks.

        The system is very successful at increasing the number of
zero-length seeks.  This is very important in reducing seek times
because of the high constant-time overhead associated with  every
non-zero seek.  By placing frequently accessed blocks on the same
cylinders,  the need to reposition the disk head was often  elim-
inated  completely.   This  is an advantage of rearranging blocks
rather than cylinders.  Cylinder  rearrangement  cannot  increase
the number of zero-length seeks.

        Waiting times are also reduced as a result of block rear-
rangement,   and this further reduces the response times for disk
operations.  This is particularly true for write  operations  be-
cause write requests arrive in bursts in UNIX systems.

        A notable difference between  the  write  operations  and
read  operations in this environment is that block access distri-
butions for write operations are much more skewed than  they  are
for  read  operations.   Since  write operations are concentrated
onto a smaller set of blocks, write operations benefit more  from
block  rearrangement  than read operations do. This fact explains
many of the differences between the results reported in  Table  3
and those in Table 2.


                  Figure 4: Results of experiments with different
number of rearranged blocks.

4.2.2     Varying the Number of Rearranged Blocks


In our next set of experiments we varied the number of rearranged
blocks in the reserved area.  Block rearrangement was applied for
several weeks (on  weekdays  only)  by  rearranging  a  different
number of blocks each day.  Results of these experiments are sum-
marized in Figure 4.  The graphs plot  mean  seek  distances  and
mean  seek  times as function of the number of rearranged blocks.
Arrival-order (FCFS) mean seek delays are also plotted.  If block
rearrangement  had  not been applied, he actual delays would have
been slightly lower (due to disk head  scheduling)  than  arrival
order delays.

        As the number of blocks  in  the  reserved  area  is  in-
creased, we expect to see greater reductions in seek delays. How-
ever, the results of our experiments draw  a  different  picture.
Most of the benefits of block rearrangement are realized by rear-
ranging a very small number of blocks.  The marginal  benefit  of
using additional blocks becomes is very small. Rearranging blocks
in excess of the hottest 200 blocks brings almost no extra  bene-
fits.

        The result of these experiments can be better  understood
by  examining  the skewed block reference distribution.  Figure 5
shows the cumulative distribution of block requests on a  typical
day.   Blocks are sorted in decreasing order of reference counts.
Distributions for all the requests and also just  for  reads  are
shown.   Fewer  than 2000 blocks absorb all of the requests.  The
100 hottest blocks absorb around 90% of all  the  requests.  Read
references  are  also very skewed. Most of the reductions in seek
delays are due to the first few hundred blocks which get most  of
the references.

4.2.3     Different Placement Policies

As mentioned earlier, the block arranger component of  the  rear-
rangement system implements two other block placement policies in
addition to the organ-pipe placement.  Although we normally  used
organ-pipe placement in our system, we also experimented with the
other placement policies.

        Tables 4 and 5 present results  of  experiments  for  the
serial  placement  policy  and  the interleaved placement policy.
Table 4 lists the results of a day's operation for  each  of  the
three placement policies.


Figure 5: Distribution of  block accesses on /usr.


Results for organ-pipe  placement  is  included  for  comparison.
Table5  presents  the results for another day for each of the in-
terleaved and serial placement policies.

        The serial placement policy projects hot blocks from dif-
ferent parts of the disk onto the reserved area without consider-
ing their access frequencies. It is a  simpler  policy  than  the
organ-pipe placement but it is not as effective as the organ-pipe
placement in reducing seek times.  It fails to  place  more  fre-
quently  accessed   blocks   closer   to   each   other   in  the
reserved  area.   As  it  can  be  seen  in  Tables   5   and   4
ser*
 *ial placement produces fewer zero-length seeks than  organ-pipe
does.   The serial placement policy's relatively poor performance
illustrates the importance of using block access  frequencies  to
determine placement.

        The performances of the interleaved placement policy  and
the organ-pipe placement policy are very close. Organ-pipe place-
ment performs slightly better in terms of seek times and  service
times.

        The interleaved placement policy is a  variation  of  the
organ-pipe  policy.  It tries to preserve the file system's rota-
tional optimization in the reserved area.   Organ-pipe  placement
may  disturb  the  rotational optimization. While accessing files
sequentially, this may result in higher rotational delays than in
the case of no rearrangement.

        Table 6 lists some data about the effects of each of  the
placement  policies on rotational delays. These data are compiled
from Tables 2-5. The  values  in  Table  6  are  the  differences
between  the  mean  service times and the mean seek times for the
read requests.  They correspond to the sum of mean transfer times
and mean rotational delays. Since rotational optimizations rarely
benefit write requests, only the values for the read requests are
included in the table.  There are results from two different days
for each the placement  policies  and  for  the  no-rearrangement
case.

        Request size distributions are almost identical for  each
of  the days, so we can safely assume that the differences in the
values in Table 6 are due to differences in rotational delays. We
can  see  from the values in the table that the organ-pipe place-
ment and serial placement policies increase the rotational delays
slightly.   This is because of the disturbances they cause in the
file system's rotationally-optimized placement.  The  interleaved
placement,  on  the  other hand, does not increase the rotational
delays.

        Interleaved placement, however, disturbs  the  organ-pipe
arrangement of blocks in the reserved area.
         _________________________________________________________________________________________________________

         |                                            _|___Organ-
pipe_____|_______Interleaved_____|_________Serial_________|
         |                                      |     All       |
|    All     |         |    All     |         |
         |____________________________________|Requests__|__Reads__|__Requests__|__Reads__|__Requests__|__Reads__|_
         |__FCFS_Mean_Seek_Dist_(cyln)__|_________225_____|__165___|_____208_____|__144___|_____208_____|__142___|_
         |___Mean_Seek_Distance_(cyln)___|_________8______|___23____|____15_____|____24____|____22_____|____39____|
         |______Zero-
length_Seeks_(%)______|______88_____|____67____|____83_____|____61____|____26_____|____39____|
         |__FCFS_Mean_Seek_Time_(ms)__|__________21.46____|_16.14__|____20.02____|_14.39__|____20.02____|_14.23__|_
         |______Mean_Seek_Time_(ms)______|_______1.55____|___4.49___|___2.50____|___5.86___|___8.50____|___8.57___|
         |____Mean_Service_Time_(ms)____|________22.95____|_24.18__|____23.71____|_24.31__|____28.53____|__27.8___|
         |____Mean_Waiting_Time_(ms)____|________50.03____|__5.47___|___46.85____|__5.14___|___61.32____|__6.32___|

                                   Table   4:   Experiments  with
placement policies.

                ___________________________________________________________________________________________
                |
_|_____Interleaved________|___________Serial___________|
                |____________________________________|All_Requests__|__Reads__|__All_Requests__|__Reads__|_
                |__FCFS_Mean_Seek_Dist_(cyln)__|___________214_______|__151___|_______200_______|___138___|_
                |___Mean_Seek_Distance_(cyln)___|__________12________|___35____|_______23________|__39____|_
                |______Zero-
length_Seeks_(%)______|________86________|___60____|_______30________|__39____|_
                |__FCFS_Mean_Seek_Time_(ms)__|____________20.59______|_14.97__|______19.57______|__13.91__|_
                |______Mean_Seek_Time_(ms)______|_________2.11_______|__5.98___|______8.29_______|_8.49___|_
                |____Mean_Service_Time_(ms)____|__________23.50______|_24.45__|______28.49______|__27.81__|_
                |____Mean_Waiting_Time_(ms)____|__________47.06______|__4.91___|_____55.91______|__6.11___|_

                      Table  5:  Experiments with interleaved and
serial placement policies.


As seen in Tables 5 and 4, this results in higher seek times than
in  the  case  of  the organ-pipe placement.  The increase in the
mean rotational latency caused by  the  organ-pipe  placement  is
more  than  made up for by the organ-pipe heuristic's higher seek
times reductions. As a result, organ-pipe placement gives slight-
ly  better  improvements  in service time than interleaved place-
ment.

4.2.4     Rearrangement on a Read/Write Mounted File System

In this subsection we will present the results  of  rearrangement
experiments on a read/write mounted file system.  As we mentioned
earlier, the disk stored the home directories of 10  users.   Be-
cause  of  the  size limitations of the disk, we were not able to
accommodate more users.  This fact combined with the large buffer
cache provided by SunOS resulted in very light disk loads.

        For this environment we conducted experiments similar  to
the  first set of experiments.  We applied block rearrangement on
alternate days for several weeks. In Figure 6, we present results
from four consecutive days.  Results for the other days are simi-
lar.  In the figure mean seek distances and mean seek  times  are
given.  We show both arrival-order (FCFS) and actual seek delays.
As usual statistics are given for read


                              _______________________________________________________________
                              |
|Mean Rotational Latency  |
                              |                               | +
Mean Transfer Time   |
                              |_______________________________|___________(ms)______________|_
                              |__Without_Rearrangement__|_________18.49______|_____18.72_____|
                              |___Organ-
pipe_Placement___|_________19.71______|____19.03_____|
                              |______Serial_Placement______|_______19.32______|____19.26_____|
                              |___Interleaved_Placement___|________18.45______|____18.47_____|

Table 6:  Effects of different placement policies  on  rotational
delays.  The values in the table are mean rotational delay + mean
transfer time for read requests in milliseconds.


                     Figure  6:  Results  of   experiments   with
read/write mounted file system.


requests as well as for all the requests. Block rearrangement was
applied on days 2 and 4.

        The reductions in the seek delays are  not  as  large  as
those  we  observed  for  the /usr file system.  Nonetheless, the
reductions are significant, especially for the  read  operations.
Mean  seek  times  were  reduced by 20%-40% for all the requests.
Seek times for read requests were reduced by 40%-50%.

        The poorer performance of the system in this  environment
can  be  attributed  to several factors. Firstly, there are write
requests resulting from new  file  creation  and  file  expansion
operations in this environment.  Block rearrangement cannot bene-
fit these kinds of write requests. Secondly,  the  original  seek
distances were relatively shorter. This makes the reductions look
smaller. Finally, and most importantly, the block access  distri-
butions  were  less  skewed in this environment.  This can be ob-
served from Figure 7.  In this figure we have plotted  the  block
access  distribution  for  a typical day. Comparing Figure 7 with
Figure 5, we see that the reference skew in this  environment  is
less than that of the /usr file system.

        A less skewed block access distribution is a disadvantage
for  the  block  rearrangement system. As the access distribution
for the rearranged blocks becomes more uniform, the  distribution
of requests over the cylinders in the reserved area will also be-
come more uniform. This reduces the system's ability to  increase
the number zero-length seeks.

        Although it is not the case in this environment,  a  less
skewed block access distribution may be a disadvantage in another
way:  the blocks placed in the reserved area get a smaller  frac-
tion  of  the total number of requests. As a result the disk head
spends less time over the reserved  area.  This  may  reduce  the
benefits of the block rearrangement.  5      Conclusion


We have described the implementation of an adaptive  block  rear-
rangement  technique  under  a UNIX oper- ating system. The tech-
nique is implemented in the disk device driver and  it  is  tran-
sparent to the rest of the system.


                  Figure 7: Distribution  of  block  accesses  on
read/write mounted file system.


        Measurements taken during the operation of the system  on
a  network  file  server  has  shown that it is very effective in
reducing disk service times by reducing seek times.  Results also
show  that  waiting  times are reduced, especially for bursty ar-
rivals. Disk space overhead of the block rearrangement  technique
is very low.  Most of the benefits of block rearrangement are ob-
tained by rearranging around 1% of all the blocks on a disk.

        We have experimented with three  placement  policies  for
the  rearranged blocks. The organ-pipe place- ment policy had the
best performance in reducing seek times and  service  times.  The
interleaved  placement  policy's  performance  was  close  to the
organ-pipe placement but slightly worse.   the  serial  placement
policy performed much worse than the other two.

        The success of the block rearrangement technique is  also
closely  tied  to  the skew in the block access distribution. The
more skewed the distribution, the greater the improvement.

        The experimental results shown in  this  paper  were  ob-
tained  using a 135 megabyte SCSI disk. We are planning to repeat
the experiments on a larger disk in future.  Acknowledgements


We would like to thank O'lafur Gudmundsson, James da Silva, Harry
Mantakos  and the technical staff at the University of Maryland's
Computer Science Department for their  help  and  suggestions  in
setting  up  the  experimental environment. We also would like to
thank the people who agreed to be the users  of  our  system  and
trusted our system with their files.  References


[Akyurek 93]    Akyurek,  Sedat,  Kenneth Salem,  "Adaptive Block
Rearrangement,"  Proceedings of Ninth

     International Conference on Data Engineering,  Vienna,  Aus-
tria, April 1993.

[Carson]   Carson, Scott D., Meenal Jobalia, "Precision  Measure-
ment of Disk Delay Parameters," in prepa-
     ration.

[English 92]   English,  Robert  M.,  Alexander  A.  Stepanov,  "
Loge:  A  Self-Organizing  Disk  Controller,"
     Proceedings of the Winter 1992 USENIX Conference, San  Fran-
cisco, CA, 1992.

[Floyd 89]    Floyd,   Richard   A.,   Carla   Schlatter   Ellis,
"Directory  Reference  Patterns  in  Hierarchical  File
     Systems," IEEE Transactions on Knowledge and Data  Engineer-
ing, Vol. 1, No. 2, June 1989.

[Ford 91]   Ford, Daniel A., Stavros Christodoulakis, "Optimizing
Random Retrievals from CLV format Op-
     tical Disks," Proceedings of the 17th International  Confer-
ence on Very Large Data Bases, Barcelona,
     Spain, September, 1991.

[Grossman  73]      Grossman,  David  D.,  Harvey  F.  Silverman,
"Placement of Records on a Secondary Storage
     Device to Minimize Access Time," JACM,  Vol.20,  No.3,  July
1973.

[McKusick 84]     McKusick, K. Marshall, et al, "A Fast File Sys-
tem for UNIX," ACM Transactions on Com-
     puter Systems 2(3), August 1984.

[McVoy 91]     McVoy, L.W., S.R.  Kleiman,  "Extent-like  Perfor-
mance from a UNIX File System," USENIX
     Winter 1991 Conference Proceedings, Dallas, TX, 1991.

[Ouster 85]    Ousterhout,  John  K.,  et  al,  "A  Trace  Driven
Analysis of the UNIX 4.2 BSD File System,"
     Proceedings of the 10th ACM Symposium  on  Operating  System
Principles, 1985.

[Rosenblum 91]     Rosenblum, M., J. K. Ousterhout,  "The  Design
and Implementation of a Log-Structured
     File System," ACM Transactions on Computer Systems,  Vol.10,
February 1992, 26-52.

[Ruemmler 91]     Ruemmler,  C.,  J.  Wilkes,  "Disk  Shuffling",
HPL-91-156, Hewlett-Packard Laboratories, Palo
     Alto, CA, October, 1991.

[Salem 92]    Salem, Kenneth, Daniel Barbar'a, Richard J. Lipton,
"Probalistic Diagnosis of Hot Spots," Pro-
     ceedings of Eighth  International  Conference  on  Data  En-
gineering, Tempe, Arizona, February 1992.

[Staelin 91]   Staelin,  Carl,   Hector   Garcia-Molina,   "Smart
Filesystems,"  Proceedings  of  the  Winter  1991
     USENIX Conference, Dallas, TX, 1991.

[Vongsath 90]    Vongsathorn, Paul, Scott D.  Carson,  "A  System
for Adaptive Disk Rearrangement," Software-
     Practice and Experience, Vol. 20(3), March 1990.

[Wong 80]    Wong, C. K., "Minimizing Expected Head  Movement  in
One-Dimensional and Two-Dimensional
     Mass Storage Systems," Computing Surveys, Vol.12, No.2, June
1980. 

Author  Biographies


Sedat Aky"urek is a PhD candidate in the Department  of  Computer
Science  at  the  University of Maryland.  His research interests
include disk and IO systems,  operating  systems,  databases  and
distributed systems.  He is currently working on data replication
techniques in disk systems. He has received an M.S.  in  Computer
Science  from  the  University  of Maryland in 1991 and a B.S. in
Computer Engineering from the Middle East  Technical  University,
Turkey  in  1988.  Sedat  Akyurek  can  be  reached  via email at
akyurek@cs.umd.edu.

Kenneth  Salem is an assistant professor  in  the  Department  of
Computer Science at the University of Maryland, College Park, and
a staff scientist at NASA's Center of Excellence  in  Space  Data
and  Informa-  tion Sciences, located at the Goddard Space Flight
Center.  His research interests include  database  and  operating
systems  and transaction processing.  Dr.  Salem received a BS in
electrical engineering and ap- plied mathematics  from  Carnegie-
Mellon  University  in  1983,  and a PhD in computer science from
Princeton University in 1989.  He is a member of the ACM and  the
IEEE Computer Society.  His email address is salem@cs.umd.edu