Lmbench Results

In order to publish lmbench results in a public forum, the lmbench license requires that the benchmark code must be compiled with a ``standard'' level of optimization (-O only) and that all the results produced by the lmbench suite must be published. These two rules together ensure that the results produced compare as fairly as possible apples to apples when considering multiple platforms, and prevents vendors or overzealous computer scientists from seeking "magic" combinations of optimizations that improve one result (which they then selectively publish) at the expense of others.

Accordingly, on the following page is a full set of lmbench results generated for ``lucifer'', the primary server node for my home (primarily development) beowulf [Eden]. The mean values and error estimates were generated from averaging ten independent runs of the full benchmark. lucifer is a 466 MHz dual Celeron system, permitting it to function (in principle) simultaneously as a master node and as a participant node. The cpu-rate results are also included on this page for completeness although they may be superseded by Carl Staelin's superior hardware instruction latency measures in the future.

Table 1: Lucifer System Description

HOST	lucifer
CPU	Celeron (Mendocino) (x2)
CPU Family	i686
MHz	467
L1 Cache Size	16 KB (code)/16 KB (data)
L2 Cache Size	128 KB
Motherboard	Abit BP6
Memory	128 MB of PC100 SDRAM
OS Kernel	Linux 2.2.14-5.0smp
Network (100BT)	Lite-On 82c168 PNIC rev 32
Network Switch	Netgear FS108

Table 2: lmbench latencies for selected processor/process activities. The values are all times in microseconds averaged over ten independent runs (with error estimates provided by an unbiased standard deviation), so ``smaller is better''.

null call	$0.696 \pm 0.006$
null I/O	$1.110 \pm 0.005$
stat	$3.794 \pm 0.032$
open/close	$5.547 \pm 0.054$
select	$44.7 \pm 0.82$
signal install	$1.971 \pm 0.006$
signal catch	$3.981 \pm 0.002$
fork proc	$634.4 \pm 28.82$
exec proc	$2755.5 \pm 10.34$
shell proc	$10569.0 \pm 46.92$

Table 3: Lmbench latencies for context switches, in microseconds (smaller is better).

2p/0K	$1.91 \pm 0.036$
2p/16K	$14.12 \pm 0.724$
2p/64K	$144.67 \pm 9.868$
8p/0K	$3.30 \pm 1.224$
8p/16K	$48.45 \pm 1.224$
8p/64K	$201.23 \pm 2.486$
16p/0K	$6.26 \pm 0.159$
16p/16K	$63.66 \pm 0.779$
16p/64K	$211.38 \pm 5.567$

Table 4: Lmbench local communication latencies, in microseconds (smaller is better).

pipe	$10.62 \pm 0.069$
AF UNIX	$33.74 \pm 3.398$
UDP	$55.13 \pm 3.080$
TCP	$127.71 \pm 5.428$
TCP Connect	$265.44 \pm 7.372$
RPC/UDP	$140.06 \pm 7.220$
RPC/TCP	$185.30 \pm 7.936$

Table 5: Lmbench network communication latencies, in microseconds (smaller is better).

UDP	$164.91 \pm 2.787$
TCP	$187.92 \pm 9.357$
TCP Connect	$312.19 \pm 3.587$
RPC/UDP	$210.65 \pm 3.021$
RPC/TCP	$257.44 \pm 4.828$

Table 6: Lmbench memory latencies in nanoseconds (smaller is better). Also see graphs for more complete picture.

L1 Cache	$6.00 \pm 0.000$
L2 Cache	$112.40 \pm 7.618$
Main mem	$187.10 \pm 1.312$

Table 7: Lmbench local communication bandwidths, in

bytes/second (bigger is better).

pipe	$290.17 \pm 11.881$
AF UNIX	$64.44 \pm 3.133$
TCP	$31.70 \pm 0.663$
UDP	(not available)
bcopy (libc)	$79.51 \pm 0.782$
bcopy (hand)	$72.93 \pm 0.617$
mem read	$302.79 \pm 3.054$
mem write	$97.92 \pm 0.787$

Table 8: Lmbench network communication bandwidths, in

bytes/second (bigger is better).

TCP	$11.21 \pm 0.018$
UDP	(not available)

Table 9: CPU-rates in BOGOMFLOPS -

simple arithmetic operations/second, in L1 cache (bigger is better). Also see graph for out-of-cache performance.

Single precision	$289.10 \pm 1.394$
Double precision	$299.09 \pm 2.295$

lmbench clearly produces an extremely detailed picture of microscopic systems performance. Many of these numbers are of obvious interest to beowulf designers and have indeed been discussed (in many cases without a sound quantitative basis) on the beowulf list [beowulf]. We must focus in order to conduct a sane discussion in the allotted space. In the following subsections on we will consider the network, the memory, and the cpu-rates as primary contributors to beowulf and parallel code design.

These are not at all independent. The rate at which the system does floating point arithmetic on streaming vectors of numbers is very strongly determined by the relative size of the L1 and L2 cache and the size of the vector(s) in question. Significant (and somewhat unexpected) structure is also revealed in network performance as a function of packet size, which suggests ``interesting'' interactions between the network, the memory subsystem, and the operating system that are worthy of further study.