Check out the new USENIX Web site. next up previous
Next: Acknowledgments Up: Maximizing Beowulf Performance Previous: CPU Results

Conclusions

We now have many of the ingredients needed to determine how well or poorly lucifer (and its similar single-Celeron nodes, adam, eve, and abel) might perform on a simple parallel task. We also have a wealth of information to help us tune the task on each host to both balance the loads and to take optimal advantage of various system performance determinants such as the L1 and L2 cache boundaries and the relatively poor (or at least inconsistent) network. These numbers, along with a certain amount of judicious task profiling (for a description of the use of profiling in parallelizing a beowulf application see [profiling]) can in turn be used to determine the parameters that describe a given task like $T_s$, $T_p$, $T_{is}$ and $T_{ip}$.

In addition, we have scaling curves that indicate the kind of parallel speedup we can expect to obtain for the task on the hardware we've microbenchmark-measured, and by comparing the appropriate microbenchmark numbers we might even be able to make a reliable guess at what the numbers and scaling would be on related but slightly different hardware (for example on a 300 MHz Celeron node instead of a 466 MHz Celeron node).

With these tools and the results they return, one can at least imagine being able to scientifically:

even if one isn't initially a true expert in beowulf or general systems performance tuning. Furthermore, by using the same tools across a wide range of candidate platforms and publishing the comparative results, it may eventually become possible to do the all important optimization of cost-benefit that is really the fundamental motivation for using a beowulf design in the first place.

It is the hope of the author that in the near future the lmbench suite develops into a more or less standard microbenchmarking tool that can be used, along with a basic knowledge of parallel scaling theory, to identify and aggressively attack the critical bottlenecks that all too often appear in beowulf design and operation. An additional, equally interesting possibility would be to transform it into a daemon or kernel module that periodically runs on all systems and provides a standard matrix of performance measurements available from simple systems calls or via a /proc structure. This, in turn, would facilitate many, many aspects of the job of dynamically maximizing beowulf or general systems performance in the spirit of ATLAS but without the need to rebuild a program.


next up previous
Next: Acknowledgments Up: Maximizing Beowulf Performance Previous: CPU Results
Robert G. Brown 2000-08-28