|
performance tuning with source code UNIX
by Bob Gray
Bob Gray is co-founder of Boulder Labs, a software consulting company. Designing architectures for performance has been his focus ever since he built an image processor system on UNIX in the late 1970s. He has a Ph.D. in computer science from the University of Colorado.
This issue's column will focus on performance improvement an activity requiring technique, intuition, and of course source code. Most code bases have room for large improvements - but an analysis is required to determine if the payoffs are worthwhile. We'll be talking about the intuition for code efficiency that takes years to develop, and we'll examine the engineering process that leads to faster code. Finally, we'll look at some code-tuning success stories. For 20 years, CPU speeds have doubled every 1.5 years. The VAX 11/780 that was widely available in 1980 ran at one SPECMARK.[1] We now bask in the land of cheap CPU cycles with affordable machines running almost one thousand times faster. Then why do we frequently hear complaints about "sluggish" programs or "slow- booting" operating systems? The layperson might think that a 400MHz Pentium should "feel" three times faster than a 133MHz Pentium, but the new one still takes 60 seconds to boot. And why isn't opening documents and receiving mail three times faster? We can look at three root causes for slowness complaints: 1. The task is computationally intense and just takes a long time for example, image processing, multimedia encoding/decoding, and certain numerical problems. Here we will notice that faster CPUs make a big difference. 2. The slowness is due to I/O waiting, and faster CPUs won't help. Web-page loading over a 28Kbps connection will require about the same elapsed time on a Pentium-400 as on a Pentium-133. Similarly, the time required to search for something in the file system is largely independent of the CPU speed. 3. The code being run is suboptimal, and there is room for improvement. Here is where we can make a difference by tuning the application or adjusting its data structures and algorithms.
Performance Intuition
Methodology for Code Improvement
Before changing the code base you should first establish "baseline" performance measurements. Until you can quantify your program's behavior in terms of runtime, you cannot systematically attack it. To establish the baseline, you'll need both performance monitors and a repeatable input sequence. The simplest monitor is a stopwatch, but you may need finer-grained timers. UNIX systems provide microsecond timers (for example, getrusage) that can identify process "user" time, "system" time, and elapsed time. Occasionally, you may need access to kernel nanosecond timers they are available on most workstations and recent PCs. You should find a way to get repeatable time measurements for a certain task. You may need to isolate the test machine so that other people's work won't interfere with your measurements. You'll also want to run the same input data or the same sequence of mouse clicks. Be aware of the "warm cache" effects where successive runs of a program require much less time because the system has already "seen" your code and data. Performance tuning will require the following steps:
2. Establish baseline performance measurement. 3. Analyze what is happening in the system. 4. Implement code change. 5. Measure the effects of the changes. 6. Repeat steps as necessary. Most skillful code tuners are not geniuses they just methodically attack the problem and carefully think about what is happening. Solutions then are often straightforward.
Tools
Finer-grained profilers such as tcov measure the time spent executing blocks of code or even individual lines of code. Sometimes these tools are well suited for numerical analysts that must pay attention to the inner loops of calculations. When I need to understand the performance characteristics of a body of code, I resort to a call-graph profiler. I use gprof, which attributes times to program abstractions. For example, we may have a collection of subroutines that perform a logical task. Even though none of the task's individual routines shows up as "expensive," the whole body of code taken together may show that the runtime of the program is dominated by the one logical task. gprof highlights this information and shows the programmer what abstractions should be scrutinized or replaced.
Table 1
Table 1 shows the twentieth [20] most expensive abstraction is rxl_PrintBuff. It is called 100 times from xp_emulate and 579 times from zp_emulate. The abstraction that includes its children consumes 7.9% of the time, or about 3 seconds (0.02 + 3.14). The routine itself isn't expensive, but it calls rxl_PrintChar 1,210 times. Again, rxl_PrintChar itself doesn't consume much time the child get_char is where the CPU time is spent. The percentage times, by the way, are slightly distorted because of the percentage spent profiling; in reality, they are slightly higher.
Examples
Figure 1
Information collected from gprof showed a bottleneck in the I/O system. Far too much time was spent waiting for disk blocks during normal system operation. This was true even without thrashing. We implemented a page-clustering algorithm, à la FreeBSD, to amortize the disk seek time over a large chunk of blocks. Our customer had considered upgrading its printers with faster and more expensive disks. Instead, a few hundred lines of code achieved a performance improvement that far exceeded the possible speed gains of quicker disks.
History
David Kashtan of SRI published some performance numbers that showed VMS to be significantly more efficient in numerous areas. He urged the community to adopt VMS as the operating system base for future ARPA-sponsored code. A few weeks later, Joy published a memo entitled "Comments on the Performance of UNIX on the VAX." The report details a set of performance improvements that were incorporated into the production systems at Berkeley during the first three weeks of March 1980 (the full paper is available at <http://boulderlabs.com/joy.html>). Joy's paper showed a careful analysis of where the kernel time was spent in UNIX and went through a series of fairly easy modifications that eliminated many of the former inefficiencies. After tuning, he showed UNIX and VMS taking about the same amount of time for most of the benchmarks. While Joy's paper is an elegant piece on system tuning, parts of his conclusion are most insightful: In selecting between UNIX and VMS as an operating system for use in ARPA . . . , the UNIX system was chosen primarily out of concern for portability. For our purposes within the Computer Science Department at Berkeley, we have chosen UNIX because of its pliability to meet our needs. In the short term, there are areas of the UNIX system that are still suffering growing pains from the porting of the system to larger machines. We believe that the simplicity and modularity of the system, which are the keys to its portability, are also the reasons why UNIX is easy to tune, as this paper has demonstrated. Code tuning is not black magic. Certainly some people are extremely effective in this area; however, anyone willing to apply simple principles systematically can achieve strong results.
NOTE
|
|
Last changed: 20 Jul. 2000 mc |
|