;login: The Magazine of USENIX & SAGEOpen Source

 

performance tuning with source code UNIX

gray_bob

by Bob Gray
<bob@cs.colorado.edu>

Bob Gray is co-founder of Boulder Labs, a software consulting company. Designing architectures for performance has been his focus ever since he built an image processor system on UNIX in the late 1970s. He has a Ph.D. in computer science from the University of Colorado.

 

This issue's column will focus on performance improvement — an activity requiring technique, intuition, and of course source code. Most code bases have room for large improvements - but an analysis is required to determine if the payoffs are worthwhile. We'll be talking about the intuition for code efficiency that takes years to develop, and we'll examine the engineering process that leads to faster code. Finally, we'll look at some code-tuning success stories.

For 20 years, CPU speeds have doubled every 1.5 years. The VAX 11/780 that was widely available in 1980 ran at one SPECMARK.[1]

We now bask in the land of cheap CPU cycles with affordable machines running almost one thousand times faster. Then why do we frequently hear complaints about "sluggish" programs or "slow- booting" operating systems? The layperson might think that a 400MHz Pentium should "feel" three times faster than a 133MHz Pentium, but the new one still takes 60 seconds to boot. And why isn't opening documents and receiving mail three times faster?

We can look at three root causes for slowness complaints:

1. The task is computationally intense and just takes a long time — for example, image processing, multimedia encoding/decoding, and certain numerical problems. Here we will notice that faster CPUs make a big difference.

2. The slowness is due to I/O waiting, and faster CPUs won't help. Web-page loading over a 28Kbps connection will require about the same elapsed time on a Pentium-400 as on a Pentium-133. Similarly, the time required to search for something in the file system is largely independent of the CPU speed.

3. The code being run is suboptimal, and there is room for improvement. Here is where we can make a difference by tuning the application or adjusting its data structures and algorithms.

Performance Intuition
Have you ever run into a programmer who says, "That code should be able to run much faster," without ever having seen the code base? How did he or she know there was room for improvement? Over the years, one acquires a feel for what it takes to get a job done. Jon Bentley has written an outstanding book, Programming Pearls (2nd ed., ACM Press, 1999) to help a programmer develop this intuition. He poses numerous problems and goes through the thinking process required for an efficient solution. Bentley talks about back-of-the-envelope calculations to perform sanity checks on system designs. For example, what size computer system would be required to store, index, and retrieve ten years' worth of newspaper articles? Would a laptop be sufficient, or would you want a multiprocessor enterprise server? Given that we store 365 x 10 days, we only need to estimate how much storage is required per day to perform a sanity check on the server size. You'll employ this intuition when your client says "perfor-mance needs to be improved." Now is the time to ask lots of questions and get the client to commit to specific assumptions.

Methodology for Code Improvement
How do you go about making improvements to a code base? You'll be faced with numerous constraints and goals, and your first task will be to find the acceptable subset of the solution space. How much time is available for the project? How much effort can be expended? Do you have the right kinds of people to work on the project? What size improvement is being sought? Does it need to run 20% faster or 20 times faster? Can you spend money on equipment and solve the problem with hardware? Once some guidelines are established, you will be able to home in on the appropriate changes.

Before changing the code base you should first establish "baseline" performance measurements. Until you can quantify your program's behavior in terms of runtime, you cannot systematically attack it. To establish the baseline, you'll need both performance monitors and a repeatable input sequence. The simplest monitor is a stopwatch, but you may need finer-grained timers. UNIX systems provide microsecond timers (for example, getrusage) that can identify process "user" time, "system" time, and elapsed time. Occasionally, you may need access to kernel nanosecond timers — they are available on most workstations and recent PCs.

You should find a way to get repeatable time measurements for a certain task. You may need to isolate the test machine so that other people's work won't interfere with your measurements. You'll also want to run the same input data or the same sequence of mouse clicks. Be aware of the "warm cache" effects — where successive runs of a program require much less time because the system has already "seen" your code and data. Performance tuning will require the following steps:

    1. Form a hypothesis of what is happening in the system.
    2. Establish baseline performance measurement.
    3. Analyze what is happening in the system.
    4. Implement code change.
    5. Measure the effects of the changes.
    6. Repeat steps as necessary.

Most skillful code tuners are not geniuses — they just methodically attack the problem and carefully think about what is happening. Solutions then are often straightforward.

Tools
Profilers are tools for measuring the execution times of code fragments. Most systems provide the typical subroutine profiler. It measures the time spent executing subroutines, either by sampling the program counter or by collecting data by adding subroutine prologues and epilogues.

Finer-grained profilers such as tcov measure the time spent executing blocks of code or even individual lines of code. Sometimes these tools are well suited for numerical analysts that must pay attention to the inner loops of calculations.

When I need to understand the performance characteristics of a body of code, I resort to a call-graph profiler. I use gprof, which attributes times to program abstractions. For example, we may have a collection of subroutines that perform a logical task. Even though none of the task's individual routines shows up as "expensive," the whole body of code taken together may show that the runtime of the program is dominated by the one logical task. gprof highlights this information and shows the programmer what abstractions should be scrutinized or replaced.

index% timeselfchildrencalledname
0.021.14100/679xp_emulate [57]
0.022.14579/679zp_emulate [37]
[20]7.90.023.14679rxl_PrintBuff [20]
0.013.121210/1210rxl_PrintChar [21]
0.000.00679/817rxl_SetTexture [471]
0.013.121210/1210rxl_PrintBuff [20]
[21]7.80.013.121210rxl_PrintChar [21]
0.032.871139/1139get_char [24]
0.000.211089/1089rxl_FixedCh [92]

Table 1

 

Table 1 shows the twentieth [20] most expensive abstraction is rxl_PrintBuff. It is called 100 times from xp_emulate and 579 times from zp_emulate. The abstraction that includes its children consumes 7.9% of the time, or about 3 seconds (0.02 + 3.14). The routine itself isn't expensive, but it calls rxl_PrintChar 1,210 times. Again, rxl_PrintChar itself doesn't consume much time — the child get_char is where the CPU time is spent. The percentage times, by the way, are slightly distorted because of the percentage spent profiling; in reality, they are slightly higher.

Examples
Mike Durian and I were challenged by a laser-printer company to "speed up" printing. Its printer consisted of a RISC processor, a large amount of memory, a disk, and a full-featured source code operating system. We used a number of tools to characterize the printer's behavior, ranging from a stopwatch to various kernel counters. We realized that more sophisticated tools would help zero in on problems and consequentially implemented kernel gprof and a graphical virtual-memory trace facility, as shown in Figure 1. What appears as a pair of parallel horizontal lines is two sets of points representing discrete page-in and page-out events from the time 2700 microseconds to 2760. The upper set of points represents page-out activity that is due to a memory shortage. The lower set shows the corresponding page-in events. Here we have a classic case of thrashing — sending to disk a page that will be needed again soon. We implemented a different page-replacement policy, and subsequently it minimized the thrashing.

bobnew
























Figure 1

 

Information collected from gprof showed a bottleneck in the I/O system. Far too much time was spent waiting for disk blocks during normal system operation. This was true even without thrashing. We implemented a page-clustering algorithm, à la FreeBSD, to amortize the disk seek time over a large chunk of blocks. Our customer had considered upgrading its printers with faster and more expensive disks. Instead, a few hundred lines of code achieved a performance improvement that far exceeded the possible speed gains of quicker disks.

History
In 1980, Bill Joy performed one of the most dramatic examples of kernel tuning. The DEC VAX 11/780 with its "huge" 32-bit address space had just been released. The ARPA community (ARPANET was the predecessor to the Internet) was evaluating which operating system to run on the new hardware. Their choices were: DEC's VMS, UNIX, or a custom-built operating system.

David Kashtan of SRI published some performance numbers that showed VMS to be significantly more efficient in numerous areas. He urged the community to adopt VMS as the operating system base for future ARPA-sponsored code.

A few weeks later, Joy published a memo entitled "Comments on the Performance of UNIX on the VAX." The report details a set of performance improvements that were incorporated into the production systems at Berkeley during the first three weeks of March 1980 (the full paper is available at <http://boulderlabs.com/joy.html>).

Joy's paper showed a careful analysis of where the kernel time was spent in UNIX and went through a series of fairly easy modifications that eliminated many of the former inefficiencies. After tuning, he showed UNIX and VMS taking about the same amount of time for most of the benchmarks. While Joy's paper is an elegant piece on system tuning, parts of his conclusion are most insightful:

 In selecting between UNIX and VMS as an operating system for use in ARPA . . . , the UNIX system was chosen primarily out of concern for portability. For our purposes within the Computer Science Department at Berkeley, we have chosen UNIX because of its pliability to meet our needs.

 In the short term, there are areas of the UNIX system that are still suffering growing pains from the porting of the system to larger machines. We believe that the simplicity and modularity of the system, which are the keys to its portability, are also the reasons why UNIX is easy to tune, as this paper has demonstrated.

Code tuning is not black magic. Certainly some people are extremely effective in this area; however, anyone willing to apply simple principles systematically can achieve strong results.

NOTE
[1] SPEC, the Standard Performance Evaluation Corporation, is a nonprofit corporation formed to "establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers." <http://www.spec.org>


 

?Need help? Use our Contacts page.
Last changed: 20 Jul. 2000 mc
Issue index
;login: index
USENIX home