################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Summer 1993 Technical Conference Cincinnati, Ohio June 21-25, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Adaptive Block Rearrangement Under UNIX* Sedat Aky"urek & Kenneth Salem Department of Computer Science University of Maryland, College Park, MD 20742 Abstract An adaptive UNIX disk device driver is described. The driver copies frequently-referenced blocks from their original locations to reserved space near the center of the disk to reduce seek times. Reference frequencies need not be known in advance. In- stead, they are estimated by monitoring the stream of arriving requests. Measurements show that the adaptive driver reduces seek times by more than half, and improves response times signi- ficantly. 1 Introduction In a recent paper [Akyurek 93 ] we introduced an adaptive block rearrangement technique which reduces disk response times by reducing seek times. In this technique, a small number of fre- quently referenced blocks are copied from their original loca- tions to reserved space near the middle of the disk. The term "adaptive" means that no advance knowledge of the frequency of reference to the data blocks is required. Instead, reference frequencies are estimated by monitoring the stream of requests for data from the disk. The set of blocks in the reserved space is changed periodically to adapt to changing access patterns. Trace-driven simulations have shown that this technique can be very effective in reducing seek times. This paper describes an implementation of the technique using a modified UNIX device driver. It also presents the results of detailed measurements taken during the operation of the system. Our results demon- strate that adaptive block rearrangement can be easily and effectively integrated under existing file systems. 1.1 Related Work That a disk's performance can be improved by clustering frequent- ly accessed data is well-known. If data references are derived from an independent random process with a known, fixed distribu- tion, it has been shown that the organ pipe heuristic places the data optimally [Wong 80 , Grossman 73 ]. The organ pipe heuristic calls for the most frequently accessed data to be placed in the center of the disk. The next most frequently ac- cessed data is placed to either side of the center, and the pro- cess continues until the least- accessed data has been placed at the edge of the disk. More recently, similar results have been shown for optical storage media [Ford 91 ]. In practice, data references are not drawn from a fixed distribution, nor are they independent. Al- though references are highly skewed [Floyd 89 , Staelin 91 , Vongsath 90 , Ous- ter 85 ], request distributions change over time, and they are generally not known in advance. Nevertheless, variations of the organ pipe heuristic seem to work well in practice. Recently, several papers have proposed adaptive applications of data clus- tering based on this idea. __________________________________________________ *This work was supported by National Science Foundation Grant No: CCR-8908898 and in part by CESDIS. UNIX is a trademark of ATT Vongsathorn and Carson [Vongsath 90 ] showed that dynamically clustering frequently accessed data worked better than static placement. In that study, disk cylinders are dynami- cally rearranged using the organ pipe heuristic, according to ob- served data access frequencies. Recent work in the DataMesh pro- ject [Ruemmler 91 ] considered rearrangement of cylinders and blocks, with mixed results. Their conclusion that block shuffling generally outperforms cylinder shuffling corroborates one of our own. A similar approach is employed in the experimental iPcress file system[Staelin 91 ], which monitors access to files and moves files with high "temperatures" (frequency of access divided by file size) to the center of the disk. Our technique differs from each of the techniques men- tioned above in at least one of the following aspects. Granularity Our technique moves blocks rather than cylinders or files. Blocks within a file or within a cylinder can vary in temperature. Also, block rearrange- ment can increase the number of zero-length seeks (i.e. need to seek at all), while cylinder reorgani- zation cannot. Smaller granularity also facilitates incremental rearrangement. Data Volume Only a small fraction of blocks are rearranged at any time as opposed to reorganizing all the data on the disk. This makes rearrangement faster. Layout Preservation Block relocation is temporary, and cooled blocks are returned to their original lo- cations. Transparency Our technique can be implemented in a device driver (or controller). No changes to the file system are required. Other systems cluster data based on criteria other than reference frequency. The Berkeley Fast File System (FFS) [McKusick 84 ] used in many UNIX systems uses placement heuristics that try to cluster the blocks of a file. However, hot blocks from different files may be spread widely over the disk's surface. This can result in long random seek operations when re- quests for the blocks of different files are interleaved, as is the case in multi-user systems. LFS [Rosenblum 91 ] and Loge [English 92 ] systems rearrange data based on the order of writes in the request stream. The primary goal of these systems is to improve write performance, not read performance. In contrast to both Loge and LFS, our adaptive block rearrangement technique makes both read and write operations faster. These systems try to improve write performance by reducing both seek and rotational delays. LFS el- iminates seek and rotational delays by combining many write operations into a single large write operation. The Loge self- organizing disk controller transparently reorganizes blocks each time they are written to reduce seek and rotational delay. Simu- lation studies of the controller show that it can reduce write service times, but the savings come at the expense of increased read service times. Unlike Loge, the block rearrangement system described here preserves the data placement done by the file sys- tem. Finally, although the adaptive block rearrangement system described here is implemented in the device driver on the host, it can be implemented in an intelligent IO or disk controller like Loge. In the next section we will give an overview of the block rearrangement technique. Section 3 describes the implementation of the adaptive block rearrangement in a UNIX system. In section 4 we present the results of performance measurement experiments conducted while the system was operational on a network file server. 2 Overview The idea of block rearrangement is motivated by the fact that ac- cess to data stored on disks is highly skewed. Of the thousands of blocks stored on a disk, a small fraction of them absorbs most of the requests. If the hot (frequently accessed) blocks are spread over the surface of the disk, distant from each other, long seek delays may result. Figure 1: Adaptive Device Driver Hot blocks can be clustered on a reserved set of contigu- ous disk cylinders to reduce the seek times. Under our block rearrangement technique, this is achieved by copying hot blocks on to a set of reserved cylinders in the middle of the disk. The block layout outside of the reserved cylinders is left undis- turbed. Incoming requests are redirected to the reserved cylinders if the block resides there. If the reserved cylinders accommodate blocks which get most of the references, we expect that the disk head will tend to linger over those cylinders and seek distances will be reduced. Figure 1 illustrates the organization of an adaptive block rearrangement system implemented in a disk driver. The system monitors the stream of requests directed to the disk and periodically produces a list of hot (frequently-referenced) blocks, ordered by frequency of reference. The hot block list is used to determine which blocks should be placed in the reserved cylinders. The rearrangement system copies the selected hot blocks from their original locations to the reserved cylinders. These blocks remain in the reserved space until the next hot block list is produced. Blocks which are not hot anymore are copied back to their original locations. The block rearrangement system assigns hot blocks to the reserved cylinders according to their rank in the list, i.e., their frequency of reference, using the organ pipe heuristic. Assuming that C blocks fit on a cylinder, the C hottest blocks are placed on the middle cylinder of the reserved cylinder group. The next hottest are placed on an adjacent cylinder, and so on so that the final cylinder reference distribution across the reserved cylinders forms an organ pipe distribution. 3 Im- plementation In this section we will present the implementation details of the components shown in Figure 1. The block rearrangement system is implemented through modifications to the disk device driver of SunOSy 4.1.1. In addition, several user level programs are used to control the modified driver. __________________________________________________y SunOS is a trademark of Sun Microsystems, Inc. Figure 2: Block Mapping on a Rearranged Disk 3.1 Modifications in the Device Driver The device driver modifications implement the reserved space on the disk and the mapping of hot blocks to their new positions in the reserved space. They also provide entry points to the kernel for controlling the movement of blocks to and from the reserved area and for monitoring block accesses. Another function of the modified driver is to gather performance statistics. 3.1.1 Reserved Space Disk labels contain information about the size of the disk and the sizes and positions of disk partitions. This information is used by the newfs program to initialize a file system on a given partition on the disk. To make space for the rearranged blocks, the target disk is made look smaller than it really is by changing the disk geometry information on the disk label. Disk partitioning infor- mation is also changed accordingly so that the file system thinks that the disk has fewer cylinders. The hidden cylinders implement the reserved space. The driver implements the mapping between the logical (smaller) disk and the actual disk (Fig. 2)z. When a target disk is initialized for rearrangement, the first sector and the length of the reserved space are recorded in its label. During initialization a special value is also recorded in the label to mark it as a "rearranged" diskx. At the time of system start-up, the attach routine of the driver checks if the disk is a rearranged disk. If it is, then the information about the reserved area is read in to be used for block mapping. 3.1.2 Block Mapping The strategy routine of a disk driver is responsible for convert- ing file system's block addresses to physical block addresses. In the modified driver, it also does the mapping from the logical disk to the real disk and __________________________________________________z In SCSI disks the disk interface presents the disk space as a sequence of logical sectors. The actual physical loca* *tions of the sectors are not known. We rely on an implicit as- sumption that most SCSI sector numbers are close to their true phys* *ical numbers. xBlock rearrangement is applied on a physical device basis. That is, a disk may have several partitions and conseque* *ntly several file systems on it, but all the blocks on the disk can be rearranged regardless of which partition they belong * *to. We require each file system on the disk to have the same block size. reroutes requests for rearranged blocks when necessary. When a request arrives for a rearranged disk, the stra- tegy routine first converts the logical block address to a physi- cal block address. It then determines whether the block has been repositioned into the reserved area. To implement this check the driver maintains a data structure called the block table for the rearranged blocks. When a block is copied into the reserved space, it is entered to the table. The table records the block's old and new physical addresses and a dirty bit. If an entry for the requested block is found in the block table, its new physical address is used to retrieve the data. A copy of the block table is also stored on the disk (at the beginning of the reserved area) to be used at start-up and for recovery purposes. The attach routine of the driver reads in the block table at start-up. The copy of the block table on the disk always correctly reflects the rearranged blocks and their positions in the reserved area (unless media failure occurs). However, the dirty bit information may not always be up-to-date. Dirty bits are used to determine if rearranged blocks have been written. They determine whether blocks must be copied back to their original locations when they cool down. In our implementa- tion, if there are repositioned blocks in the reserved area at the time of start-up, they are all marked as dirty. This conser- vative strategy insures correct operation in case a repositioned block is written and the system operation is interrupted before this was reflected on the disk copy of the block table. The size of a "block" in the rearrangement system is the size of a file system block. The file system requests at most one block of data from the disk at a time. The raw disk IO in- terface on the other hand, allows requests larger than the block size to be forwarded to the disk. Driver routines responsible for raw IO were changed to break large requests into block-sized subrequests. 3.1.3 Block Movement The driver provides kernel entry points through ioctl calls for user-level processes to control the movement of blocks into and out of the reserved area. It implements two ioctl calls for block movement : DKIOCBCOPY copies a given block to a given address in the reserved area. It also places an entry for the block in the block table and forces block table to be written to disk. DKIOCCLEAN cleans the reserved area by removing the blocks from the block table one at a time. Blocks leaving the reserved area are copied to their original disk positions if they have been updated, i.e. if their dirty bit is set. The block table is written to disk after each block is moved out. Copying a block into the reserved area requires three IO operations. Moving a block out of the reserved area requires at least one IO operation and two extra operations if the block is dirty. However, other requests can interleave with these opera- tions. Requests for a block that is being moved are delayed tem- porarily by the driver. 3.1.4 Request Monitoring The driver records requests (block number, request size, flags etc.) in a small internal table. Request recording stops when the table is full, and resumes when the table is emptied. An ioctl call enables user processes to read the contents of the table and to clean it. A user process periodically reads the contents of the re- quest table and uses the them to analyze the block accesses. The frequency of request table reads can be changed. We have used a period of two minutes in our experiments. This was short enough to capture almost all the requests. 3.1.5 Performance Monitoring Another function of the driver is to gather statistics about dif- ferent measures related to disk accesses, such as seek distances, service times and queueing times. This is provided for monitoring the performance of the system and is not a part of the block rearrangement system. Performance monitoring is implemented much like request monitoring. Statistics are recorded in a table inside the driver. User-level processes can read the contents of this table through an ioctl call which also resets the values in the table. The table contains statistics collected since the last time the table was read. The beginning and ending times of the recording are also entered into the table. All statistics are recorded separately for read opera- tions and write operations. The table records seek distance dis- tributions (in arrival order and in scheduled order) and service time and queueing time histograms. Time histograms are recorded with a resolution of 1 ms. Cumulative service times and queueing times are also recorded. Queueing time for a request is the period between the time the driver first receives the request and the time the request is submitted to the disk. Service time for a request is from the end of the queueing time to the time the re- quest is returned from the disk. 3.2 User Level Programs User level programs use the entry points provided by the modified kernel to monitor and analyze block access frequencies. Using this information, they decide which blocks should be rearranged and where they should be placed in the reserved area. One user- level process, the reference stream analyzer, monitors the block accesses and tries to determine the frequently accessed blocks and their access frequencies. Using this information, another process, which is called the block arranger decides which blocks are to be rearranged. This process also controls the placement of the selected blocks in the reserved area. The functions implemented by the user-level processes can easily be incorporated in to the device driver itself. In par- ticular, our techniques for estimating block reference counts are fast and require little space{ , and the block placement policies are very simple. However, we wanted to use our system as a test bed to experiment with different methods for learning access fre- quencies and different rearrangement policies. Implementing those functions in user-level processes gave us flexibility in our ex- periments. The block arranger implements three block placement poli- cies : Organ-pipe placement This policy arranges the blocks chosen for rearrangement in an organ-pipe pattern according to their access frequencies. This done by plac- ing the blocks with highest access frequencies in the center cylinder of the reserved area and continuing to fill cylinders on alternating sides of the center with the blocks which have the next highest access frequencies. Interleaved placement SunOS UNIX file system (UFS), which is based on FFS, places the blocks of a file interleaved by an interleaving factor. The purpose is to reduce rotational delays while accessing a file sequentially. Our interleaved placement policy tries to preserve the interleaving performed by the file system. This policy begins by placing the hottest block first, like the organ-pipe placement. However, after it places a block it does not immediately select the next hottest block for placement. Instead it looks for a block which could be the previous block's suc- cessor in the file system's interleaved placement. The successor's physical location must of course succeed the previous block's physical location by the interleaving factor. The placement policy also requires the successor's access frequency to be closek to the previous block's access frequency (this is the policy's criterion to decide whether two blocks belong to the same file). If such a "could be successor" block is found, it is placed using the same interleaving factor used by the file system. This process continues as long as a chain of such probable __________________________________________________{ The hot block estimation algorithm we have used and several other space-efficient hot spot detection algorithms are presented in [Salem 92 ]. kThe successor's access frequency is defined to be close if it is at least 50% of the previous block's frequency. Th* *e 50% figure was selected arbitrarily. Figure 3: Placement Policies. The block placement of different policies are shown. The set of blocks to be rearranged and their estimated access frequencies are also given. The reserved area has three cylinders with four blocks in each cylinder. The in- terleaving factor (rotational delay) of the file system is as- sumed to be one block. The interleaved placement policy consid- ers the access frequency of a block "close" to another block's, if the first block's frequency is at least half of the second block's access frequency. successor blocks are found. The interleaved placement also stops when the end of the current cylinder is reached. When the policy quits interleaved placement mode, it continues by choosing the block which has the next highest access frequency. Serial placement This is the simplest placement policy. Blocks that are chosen for rearrangement are placed in the reserved space in ascending order of their original block numbers. This policy needs to know only the block numbers of the hot blocks. Access fre- quencies of the hot blocks are not necessary. 4 Performance Measurements In this section we will present the results of a series of exper- iments we have conducted to monitor the adaptive driver's perfor- mance. We will first describe the environment in which the exper- iments were run. 4.1 Experimental Environment The adaptive block rearrangement system is operational on a Sun Sparcstation** 2, called Sakarya. The operating system on the machine is SunOS 4.1.1. The file system block size is 8 kilobytes and the fragment size is 1 kilobyte. Sakarya has a main memory of 32 megabytes. The amount of memory used for buffer cache is not fixed. Under SunOS the memory pages are used for both process pages and IO pages on demand. This means all that of the available memory can poten- tially be used as a buffer cache [McVoy 91 ]. Sakarya is used mainly by one of the authors only and is usually very lightly loaded. As a result, a large portion of the memory is available as a buffer cache most of the time. Sakarya has two disks attached to it and one of them is used as a rearranged disk. The disk used for block rearrangement is a Toshiba Model MK156F SCSI disk. Specifications of the disk and its seek time __________________________________________________** Sun and Sparc are trademarks of Sun Microsystems, Inc. _____________________________ | Toshiba MK156F | __________________________________________________________________________________ |________SCSI_DISK________|__ | 8 | | Capacity (MB) | 135 | || < 0 p __ p3 __ if d = 0 || | Cylinders | 815 | || seektime(d) = : 6:248 + 1:393 d - 0:99 d+ 0:813 lnd if d < 315 || | Tracks/Cyln | 10 | | 17:503 + 0:03d if d 315 | | Sectors/Track | 34 | |_________________________________________________________________________________ | |_______RPM_______|___3600__|_ Table 1: Specifications of the disk. Seek time is given in mil- liseconds as a function of seek distances (in cylinders). function are given in Table 1. The seek time function is bor- rowed from [Carson ], in which the disk delay parameters for this disk were measured and a precise seek time function was dev- ised. Out of the disk's 815 cylinders, 48 cylinders ( 3 cylinder groups) in the middle of the disk have been used as reserved cylinders. This amounts to approximately 8 megabytes (6% of the total disk capacity). Approximatelly 1000 8Kbyte- blocks can fit in this space. We chose the size of the reserved space in the light of the results of our previous studies [Ak- yurek 93 ]. The results of trace-driven simulations in that study showed that most of the benefits of block rearrangement were realized by rearranging 1%-2% of the total number of blocks. The simulation results also showed that the additional benefits of using more blocks decreased sharply. We used Sakarya as an NFS server for 14 sparcstations for executable files and libraries. The disk stored parts of the /usr file system. The file system was mounted read-only on the client workstations. The user population of the workstations con- sisted of about 40 faculty and graduate students in the Computer Science Department of the University of Maryland. We also experimented with a read/write mounted file sys- tem stored on the rearranged disk. In this experiment the disk stored the home directories of 10 users whose primary activities include running simulations, editing papers and electronic mail- ing. 4.2 Experiments We will present the results of four sets of experiments. In the first set, performance improvements due to block rearrangement were monitored when Sakarya's disk was storing the /usr file sys- tem. In the next two groups of experiments we varied the number of rearranged blocks and compared the effects of different rear- rangement policies, respectively. These experiments were run when Sakarya held the /usr file system. In the last set of ex- periments we used rearrangement when the disk stored the home directories. During the operation of the system, block access informa- tion that was monitored on one day was used at the end of the day for rearranging the blocks for next day's operations. Statistics were gathered between 7am and 10pm about the system's perfor- mance. The timing measurements have a resolution of 5 mi- croseconds on Sakarya. 4.2.1 Rearrangement of /usr The block rearrangement system was operational on Sakarya for several months when it was serving 14 client workstations. We ap- plied block rearrangement on alternate days for several weeks to assess the performance improvements. When rearrangement was ap- plied, 1018 blocks were placed in the reserved area using the organ-pipe placement heuristic. Results from four consecutive days are summarized in Tables 2 and 3. Results for the other days are similar. Table 2 presents the measurements for all requests whereas Table 3 describes only read operations. On days 1 and 3, block arrange- ment was not applied and original copies of blocks were used. On days 2 and ______________________________________________________________________________________________ |____________________________________|___Day_1_____|____Day_2___|_____Day_3_____|____Day_4___|_ |__________Rearrangement__________|___Not_Applied__|___Applied__|_Not_Applied__|___Applied__|_ |__FCFS_Mean_Seek_Dist_(cyln)__|___________220_______|___225____|_______235_______|___228____|_ |___Mean_Seek_Distance_(cyln)___|__________173_______|____8______|______183_______|____5______| |______Zero- length_Seeks_(%)______|________23_______|_____88_____|______19_______|_____92_____| |__FCFS_Mean_Seek_Time_(ms)__|____________20.92______|__21.46___|______22.33______|__21.91___|_ |______Mean_Seek_Time_(ms)______|_________18.21______|___1.55____|_____19.24______|___0.98____| |____Mean_Service_Time_(ms)____|__________38.41______|__22.95___|______39.71______|__22.61___|_ |____Mean_Waiting_Time_(ms)____|__________87.30______|__50.03___|______94.51______|__48.82___|_ Table 2: Experimental results for /usr (all requests). ______________________________________________________________________________________________ |____________________________________|___Day_1_____|____Day_2___|_____Day_3_____|____Day_4___|_ |__________Rearrangement__________|___Not_Applied__|___Applied__|_Not_Applied__|___Applied__|_ |__FCFS_Mean_Seek_Dist_(cyln)__|___________131_______|___165____|_______182_______|___192____|_ |___Mean_Seek_Distance_(cyln)___|__________123_______|____23_____|______170_______|____17_____| |______Zero- length_Seeks_(%)______|________44_______|_____67_____|______31_______|_____72_____| |__FCFS_Mean_Seek_Time_(ms)__|____________12.97______|__16.14___|______17.24______|__17.91___|_ |______Mean_Seek_Time_(ms)______|_________12.49______|___4.49____|_____16.61______|___3.54____| |____Mean_Service_Time_(ms)____|__________30.99______|__24.18___|______35.32______|__22.57___|_ |____Mean_Waiting_Time_(ms)____|__________6.64______|____5.47____|_____4.76______|____4.46____| Table 3: Experimental results for /usr (read requests only). 4, block rearrangement was applied. All of the figures reported in the tables are measured values except for seek times. Seek times are computed using measured seek distance distributions and the seek time function shown in Table 2. First-come-first-served (FCFS) order seek delays are also presented together with actual seek delays. The effect of UNIX disk-head scheduling can be seen when we compare the FCFS and actual values on days 1 and 3. Comparing the results for the "rearrangement-on" days and the "rearrangement-off" days, we see that adaptive block rear- rangement drastically reduced seek distances and seek times. Service time reductions are significant but are slightly lower than the seek time reductions. This may be due to the distur- bances in the file-system rotational optimizations caused by re- arranged blocks. The system is very successful at increasing the number of zero-length seeks. This is very important in reducing seek times because of the high constant-time overhead associated with every non-zero seek. By placing frequently accessed blocks on the same cylinders, the need to reposition the disk head was often elim- inated completely. This is an advantage of rearranging blocks rather than cylinders. Cylinder rearrangement cannot increase the number of zero-length seeks. Waiting times are also reduced as a result of block rear- rangement, and this further reduces the response times for disk operations. This is particularly true for write operations be- cause write requests arrive in bursts in UNIX systems. A notable difference between the write operations and read operations in this environment is that block access distri- butions for write operations are much more skewed than they are for read operations. Since write operations are concentrated onto a smaller set of blocks, write operations benefit more from block rearrangement than read operations do. This fact explains many of the differences between the results reported in Table 3 and those in Table 2. Figure 4: Results of experiments with different number of rearranged blocks. 4.2.2 Varying the Number of Rearranged Blocks In our next set of experiments we varied the number of rearranged blocks in the reserved area. Block rearrangement was applied for several weeks (on weekdays only) by rearranging a different number of blocks each day. Results of these experiments are sum- marized in Figure 4. The graphs plot mean seek distances and mean seek times as function of the number of rearranged blocks. Arrival-order (FCFS) mean seek delays are also plotted. If block rearrangement had not been applied, he actual delays would have been slightly lower (due to disk head scheduling) than arrival order delays. As the number of blocks in the reserved area is in- creased, we expect to see greater reductions in seek delays. How- ever, the results of our experiments draw a different picture. Most of the benefits of block rearrangement are realized by rear- ranging a very small number of blocks. The marginal benefit of using additional blocks becomes is very small. Rearranging blocks in excess of the hottest 200 blocks brings almost no extra bene- fits. The result of these experiments can be better understood by examining the skewed block reference distribution. Figure 5 shows the cumulative distribution of block requests on a typical day. Blocks are sorted in decreasing order of reference counts. Distributions for all the requests and also just for reads are shown. Fewer than 2000 blocks absorb all of the requests. The 100 hottest blocks absorb around 90% of all the requests. Read references are also very skewed. Most of the reductions in seek delays are due to the first few hundred blocks which get most of the references. 4.2.3 Different Placement Policies As mentioned earlier, the block arranger component of the rear- rangement system implements two other block placement policies in addition to the organ-pipe placement. Although we normally used organ-pipe placement in our system, we also experimented with the other placement policies. Tables 4 and 5 present results of experiments for the serial placement policy and the interleaved placement policy. Table 4 lists the results of a day's operation for each of the three placement policies. Figure 5: Distribution of block accesses on /usr. Results for organ-pipe placement is included for comparison. Table5 presents the results for another day for each of the in- terleaved and serial placement policies. The serial placement policy projects hot blocks from dif- ferent parts of the disk onto the reserved area without consider- ing their access frequencies. It is a simpler policy than the organ-pipe placement but it is not as effective as the organ-pipe placement in reducing seek times. It fails to place more fre- quently accessed blocks closer to each other in the reserved area. As it can be seen in Tables 5 and 4 ser* *ial placement produces fewer zero-length seeks than organ-pipe does. The serial placement policy's relatively poor performance illustrates the importance of using block access frequencies to determine placement. The performances of the interleaved placement policy and the organ-pipe placement policy are very close. Organ-pipe place- ment performs slightly better in terms of seek times and service times. The interleaved placement policy is a variation of the organ-pipe policy. It tries to preserve the file system's rota- tional optimization in the reserved area. Organ-pipe placement may disturb the rotational optimization. While accessing files sequentially, this may result in higher rotational delays than in the case of no rearrangement. Table 6 lists some data about the effects of each of the placement policies on rotational delays. These data are compiled from Tables 2-5. The values in Table 6 are the differences between the mean service times and the mean seek times for the read requests. They correspond to the sum of mean transfer times and mean rotational delays. Since rotational optimizations rarely benefit write requests, only the values for the read requests are included in the table. There are results from two different days for each the placement policies and for the no-rearrangement case. Request size distributions are almost identical for each of the days, so we can safely assume that the differences in the values in Table 6 are due to differences in rotational delays. We can see from the values in the table that the organ-pipe place- ment and serial placement policies increase the rotational delays slightly. This is because of the disturbances they cause in the file system's rotationally-optimized placement. The interleaved placement, on the other hand, does not increase the rotational delays. Interleaved placement, however, disturbs the organ-pipe arrangement of blocks in the reserved area. _________________________________________________________________________________________________________ | _|___Organ- pipe_____|_______Interleaved_____|_________Serial_________| | | All | | All | | All | | |____________________________________|Requests__|__Reads__|__Requests__|__Reads__|__Requests__|__Reads__|_ |__FCFS_Mean_Seek_Dist_(cyln)__|_________225_____|__165___|_____208_____|__144___|_____208_____|__142___|_ |___Mean_Seek_Distance_(cyln)___|_________8______|___23____|____15_____|____24____|____22_____|____39____| |______Zero- length_Seeks_(%)______|______88_____|____67____|____83_____|____61____|____26_____|____39____| |__FCFS_Mean_Seek_Time_(ms)__|__________21.46____|_16.14__|____20.02____|_14.39__|____20.02____|_14.23__|_ |______Mean_Seek_Time_(ms)______|_______1.55____|___4.49___|___2.50____|___5.86___|___8.50____|___8.57___| |____Mean_Service_Time_(ms)____|________22.95____|_24.18__|____23.71____|_24.31__|____28.53____|__27.8___| |____Mean_Waiting_Time_(ms)____|________50.03____|__5.47___|___46.85____|__5.14___|___61.32____|__6.32___| Table 4: Experiments with placement policies. ___________________________________________________________________________________________ | _|_____Interleaved________|___________Serial___________| |____________________________________|All_Requests__|__Reads__|__All_Requests__|__Reads__|_ |__FCFS_Mean_Seek_Dist_(cyln)__|___________214_______|__151___|_______200_______|___138___|_ |___Mean_Seek_Distance_(cyln)___|__________12________|___35____|_______23________|__39____|_ |______Zero- length_Seeks_(%)______|________86________|___60____|_______30________|__39____|_ |__FCFS_Mean_Seek_Time_(ms)__|____________20.59______|_14.97__|______19.57______|__13.91__|_ |______Mean_Seek_Time_(ms)______|_________2.11_______|__5.98___|______8.29_______|_8.49___|_ |____Mean_Service_Time_(ms)____|__________23.50______|_24.45__|______28.49______|__27.81__|_ |____Mean_Waiting_Time_(ms)____|__________47.06______|__4.91___|_____55.91______|__6.11___|_ Table 5: Experiments with interleaved and serial placement policies. As seen in Tables 5 and 4, this results in higher seek times than in the case of the organ-pipe placement. The increase in the mean rotational latency caused by the organ-pipe placement is more than made up for by the organ-pipe heuristic's higher seek times reductions. As a result, organ-pipe placement gives slight- ly better improvements in service time than interleaved place- ment. 4.2.4 Rearrangement on a Read/Write Mounted File System In this subsection we will present the results of rearrangement experiments on a read/write mounted file system. As we mentioned earlier, the disk stored the home directories of 10 users. Be- cause of the size limitations of the disk, we were not able to accommodate more users. This fact combined with the large buffer cache provided by SunOS resulted in very light disk loads. For this environment we conducted experiments similar to the first set of experiments. We applied block rearrangement on alternate days for several weeks. In Figure 6, we present results from four consecutive days. Results for the other days are simi- lar. In the figure mean seek distances and mean seek times are given. We show both arrival-order (FCFS) and actual seek delays. As usual statistics are given for read _______________________________________________________________ | |Mean Rotational Latency | | | + Mean Transfer Time | |_______________________________|___________(ms)______________|_ |__Without_Rearrangement__|_________18.49______|_____18.72_____| |___Organ- pipe_Placement___|_________19.71______|____19.03_____| |______Serial_Placement______|_______19.32______|____19.26_____| |___Interleaved_Placement___|________18.45______|____18.47_____| Table 6: Effects of different placement policies on rotational delays. The values in the table are mean rotational delay + mean transfer time for read requests in milliseconds. Figure 6: Results of experiments with read/write mounted file system. requests as well as for all the requests. Block rearrangement was applied on days 2 and 4. The reductions in the seek delays are not as large as those we observed for the /usr file system. Nonetheless, the reductions are significant, especially for the read operations. Mean seek times were reduced by 20%-40% for all the requests. Seek times for read requests were reduced by 40%-50%. The poorer performance of the system in this environment can be attributed to several factors. Firstly, there are write requests resulting from new file creation and file expansion operations in this environment. Block rearrangement cannot bene- fit these kinds of write requests. Secondly, the original seek distances were relatively shorter. This makes the reductions look smaller. Finally, and most importantly, the block access distri- butions were less skewed in this environment. This can be ob- served from Figure 7. In this figure we have plotted the block access distribution for a typical day. Comparing Figure 7 with Figure 5, we see that the reference skew in this environment is less than that of the /usr file system. A less skewed block access distribution is a disadvantage for the block rearrangement system. As the access distribution for the rearranged blocks becomes more uniform, the distribution of requests over the cylinders in the reserved area will also be- come more uniform. This reduces the system's ability to increase the number zero-length seeks. Although it is not the case in this environment, a less skewed block access distribution may be a disadvantage in another way: the blocks placed in the reserved area get a smaller frac- tion of the total number of requests. As a result the disk head spends less time over the reserved area. This may reduce the benefits of the block rearrangement. 5 Conclusion We have described the implementation of an adaptive block rear- rangement technique under a UNIX oper- ating system. The tech- nique is implemented in the disk device driver and it is tran- sparent to the rest of the system. Figure 7: Distribution of block accesses on read/write mounted file system. Measurements taken during the operation of the system on a network file server has shown that it is very effective in reducing disk service times by reducing seek times. Results also show that waiting times are reduced, especially for bursty ar- rivals. Disk space overhead of the block rearrangement technique is very low. Most of the benefits of block rearrangement are ob- tained by rearranging around 1% of all the blocks on a disk. We have experimented with three placement policies for the rearranged blocks. The organ-pipe place- ment policy had the best performance in reducing seek times and service times. The interleaved placement policy's performance was close to the organ-pipe placement but slightly worse. the serial placement policy performed much worse than the other two. The success of the block rearrangement technique is also closely tied to the skew in the block access distribution. The more skewed the distribution, the greater the improvement. The experimental results shown in this paper were ob- tained using a 135 megabyte SCSI disk. We are planning to repeat the experiments on a larger disk in future. Acknowledgements We would like to thank O'lafur Gudmundsson, James da Silva, Harry Mantakos and the technical staff at the University of Maryland's Computer Science Department for their help and suggestions in setting up the experimental environment. We also would like to thank the people who agreed to be the users of our system and trusted our system with their files. References [Akyurek 93] Akyurek, Sedat, Kenneth Salem, "Adaptive Block Rearrangement," Proceedings of Ninth International Conference on Data Engineering, Vienna, Aus- tria, April 1993. [Carson] Carson, Scott D., Meenal Jobalia, "Precision Measure- ment of Disk Delay Parameters," in prepa- ration. [English 92] English, Robert M., Alexander A. Stepanov, " Loge: A Self-Organizing Disk Controller," Proceedings of the Winter 1992 USENIX Conference, San Fran- cisco, CA, 1992. [Floyd 89] Floyd, Richard A., Carla Schlatter Ellis, "Directory Reference Patterns in Hierarchical File Systems," IEEE Transactions on Knowledge and Data Engineer- ing, Vol. 1, No. 2, June 1989. [Ford 91] Ford, Daniel A., Stavros Christodoulakis, "Optimizing Random Retrievals from CLV format Op- tical Disks," Proceedings of the 17th International Confer- ence on Very Large Data Bases, Barcelona, Spain, September, 1991. [Grossman 73] Grossman, David D., Harvey F. Silverman, "Placement of Records on a Secondary Storage Device to Minimize Access Time," JACM, Vol.20, No.3, July 1973. [McKusick 84] McKusick, K. Marshall, et al, "A Fast File Sys- tem for UNIX," ACM Transactions on Com- puter Systems 2(3), August 1984. [McVoy 91] McVoy, L.W., S.R. Kleiman, "Extent-like Perfor- mance from a UNIX File System," USENIX Winter 1991 Conference Proceedings, Dallas, TX, 1991. [Ouster 85] Ousterhout, John K., et al, "A Trace Driven Analysis of the UNIX 4.2 BSD File System," Proceedings of the 10th ACM Symposium on Operating System Principles, 1985. [Rosenblum 91] Rosenblum, M., J. K. Ousterhout, "The Design and Implementation of a Log-Structured File System," ACM Transactions on Computer Systems, Vol.10, February 1992, 26-52. [Ruemmler 91] Ruemmler, C., J. Wilkes, "Disk Shuffling", HPL-91-156, Hewlett-Packard Laboratories, Palo Alto, CA, October, 1991. [Salem 92] Salem, Kenneth, Daniel Barbar'a, Richard J. Lipton, "Probalistic Diagnosis of Hot Spots," Pro- ceedings of Eighth International Conference on Data En- gineering, Tempe, Arizona, February 1992. [Staelin 91] Staelin, Carl, Hector Garcia-Molina, "Smart Filesystems," Proceedings of the Winter 1991 USENIX Conference, Dallas, TX, 1991. [Vongsath 90] Vongsathorn, Paul, Scott D. Carson, "A System for Adaptive Disk Rearrangement," Software- Practice and Experience, Vol. 20(3), March 1990. [Wong 80] Wong, C. K., "Minimizing Expected Head Movement in One-Dimensional and Two-Dimensional Mass Storage Systems," Computing Surveys, Vol.12, No.2, June 1980. Author Biographies Sedat Aky"urek is a PhD candidate in the Department of Computer Science at the University of Maryland. His research interests include disk and IO systems, operating systems, databases and distributed systems. He is currently working on data replication techniques in disk systems. He has received an M.S. in Computer Science from the University of Maryland in 1991 and a B.S. in Computer Engineering from the Middle East Technical University, Turkey in 1988. Sedat Akyurek can be reached via email at akyurek@cs.umd.edu. Kenneth Salem is an assistant professor in the Department of Computer Science at the University of Maryland, College Park, and a staff scientist at NASA's Center of Excellence in Space Data and Informa- tion Sciences, located at the Goddard Space Flight Center. His research interests include database and operating systems and transaction processing. Dr. Salem received a BS in electrical engineering and ap- plied mathematics from Carnegie- Mellon University in 1983, and a PhD in computer science from Princeton University in 1989. He is a member of the ACM and the IEEE Computer Society. His email address is salem@cs.umd.edu