FeatureUSENIX

 

optimal peripheral access using pipe-based double-buffering

spinellis_diomedis

by Diomidis Spinellis
<dds@sena.gr>

Diomidis Spinellis holds a Ph.D. in computer science. Currently he is leading software development at SENA S.A. His research interests include information security, software engineering, and programming languages.



When working with slow peripherals such as tape drives and CD-R recorders, careful tuning of the access patterns and process architecture can result in substantial increases in performance. One of the responsibilities of an operating system is to optimize the utilization of a computer system by the appropriate scheduling of tasks. Thus, when a process waits for a peripheral to complete an operation, the OS can schedule other processes that are waiting to run. Although scheduling happens behind the scenes of normal system operation and is transparent to the user, in some cases one can optimize execution time by carefully arranging interdependent tasks. With ever-increasing processor speeds and with peripheral access times relatively stagnant, such optimizations can deliver impressive results. I will describe and explain how I reduced the overhead of tape access by almost 30 percent by simply adding two dd(1) commands to our backup pipeline.

The Problem

The problem I wanted to solve was to minimize the actual wall-clock time taken by our backup process, which currently needs eight hours for a full backup. I thought I could achieve this by making our drive stream, and by increasing the parallelism of the backup process. Our LAN consists of Windows 95/98/NT and UNIX machines. We perform our backups on a machine running Linux (2.0.36). At backup time, the Linux machine mounts the disks of all Windows hosts using the SMB filesystem and runs a tar(1) — based backup process.

The process is responsible for:

  • administering incremental and full backups

  • maintaining backup quota

  • measuring the backup set size

  • comparing the files after backup

  • logging the backup results

The heart of the process is essentially the following command:

   tar czf /dev/tape -b 1000 /smb

The command creates a tar file in /dev/tape from the contents of smb using gzip compression and a blocking factor of 1000. The blocking factor is supposed to make tar write to the tape device in chunks of 1000 blocks (i.e., of 0.5MB).

Accessing the tape in large blocks is mandatory for some devices, and — according to some man pages for tar — can be more efficient than access in smaller chunks. Since tape drives typically read or write data at a fixed rate, if that rate cannot be maintained on the rest of the system the drive has to read or write some data and then stop and backtrack to a point where the next block can be written. When this phenomenon occurs, the performance of the drive drastically deteriorates, and the drive is no longer said to be streaming. Writing in large blocks can help the drive to stream.

Our drive consistently refused to stream. Increasing the size of the blocking factor had absolutely no effect. Going through the source code of tar we are using (GNU tar 1.11.8 with a small local modification to handle the lack of stable inode numbers in SMB filesystems), I noticed that when compression is specified, tar ignores the blocking factor. That explained the lack of streaming and prompted me to examine the matter further.

Measurements and the Solution

A golden rule of performance optimization is to profile the process to be optimized. I therefore first measured the time it takes for tar and gzip to process a medium-sized directory:

   % time sh -c 'tar cf - /usr/src/linux | gzip -c >/dev/null'
   375.40user 39.38system 7:02.18elapsed 98%CPU

This measurement establishes our performance bottom line. No matter how much we improve tape access, we will never be able to make the process faster than seven minutes. The next step was to measure the overhead of writing to tape:

   % time sh -c 'tar cf - /usr/src/linux | gzip -c >/dev/tape'
   389.81user 37.34system 10:19.96elapsed 68%CPU

So we now know that writing to the tape increases the execution time by about three minutes, or 47 percent. It is this increase that we want to fight. Since our full backup takes about eight hours, optimizing tape access could result in significant time savings. The measurement also tells us that the CPU is idle about 30 percent of the time. Our task mix is therefore I/O bound, and we have CPU power to spare in order to optimize it. As a first try, I piped the gzip output through dd(1) specifying a large (0.5MB) output block size. This causes dd to accumulate data from its input until it has 0.5MB of data available and then write that data in one large chunk to the tape:

   % time sh -c 'tar cf - /usr/src/linux | gzip -c |
    dd obs=512k of=/dev/tape'
   27979+1 records in
   27+1 records out
   392.65user 41.55system 11:28.28elapsed 63%CPU

This time I could hear and see the tape drive stream; a beautiful sound and sight. Embarrassingly, the execution time increased by more than one minute (11 percent), while CPU utilization decreased. Streaming the tape drive did not appear to help. An analysis of the situation was now in order.

Apparently, in the system I am using, write commands to the tape device are processed synchronously; that is, the process issuing the write system call is suspended until the data is written to the device. In the first case involving the tape drive, while data was being written by gzip to the tape drive (and gzip was waiting), tar could continue to process some files writing its output to the pipe buffer connecting the two processes. In the second case where dd was added, dd took a longer time to write the 0.5MB block to the tape drive. During that time, the pipe buffer between gzip and dd got full, thereby blocking gzip. That, in turn, made the pipe buffer between tar and gzip fill up, blocking tar as well. The explanation of the timing anomaly made me realize the problem's root and think about a possible solution.

In order to optimize the backup process I would need a way to prevent the pipeline stalls while data was being written to the tape drive. The addition of another dd process after gzip could provide this buffer:

   % tar cf - /usr/src/linux | gzip -c | \
    dd obs=512k | dd obs=512k of=/dev/tape

In the pipeline above, gzip writes data to the first dd process. As soon as this process accumulates 0.5MB, the data is written to the second dd process using a single write call. As soon as the second dd process accumulates 0.5MB (i.e., immediately after the first process writes it), it will issue a write call to send the data to the tape drive. While data is being written to the tape, the second dd process is blocked; however, the first one is still running, since it has written its block to the second process and is now again accumulating the next block to write. Therefore tar and gzip can continue gathering and compressing data at the same time that data is being written to the tape drive. Thus, the two dd processes effectively implement a double-buffering mechanism. The time measurements prove the validity of this approach:

   % time sh -c 'tar cf - /usr/src/linux | gzip -c |
     dd obs=512k | dd obs=512k of=/dev/tape
   27979+1 records in
   27+1 records out
   27979+1 records in
   27+1 records out
   364.34user 48.57system 9:23.22elapsed 73%CPU

Our change resulted in a 28 percent decrease in the tape overhead, giving a 10 percent decrease in overall execution time. Depending on the performance differences between the CPU and the peripheral in question, the execution time can decrease even more.

Conclusions

As I mentioned at the beginning, one of the responsibilities of an OS is to optimize the utilization of a computer system by the appropriate scheduling of tasks. In our initial case, while the system was waiting for the tape drive to write the data, it could indeed run other unrelated processes such as sendmail or httpd. However, the operation of the processes in the backup pipeline could not be parallelized beyond the limits imposed by the small buffer provided by the implementation of pipes under UNIX. Soon after one process in the pipeline blocked, the whole pipeline was stalled. The addition of another dd process before the process writing to the tape provided extra breathing space to the pipeline and increased the possibility for parallel operation. The same approach can be used in all cases where a pipeline consists of a mixture of I/O and CPU-bound tasks.


 

?Need help? Use our Contacts page.
Last changed: 18 Nov. 1999 mc
Issue index
;login: index
USENIX home