Interconnect latencies and software contention are not included.
If we attempt to include these more realistic effects, we end up with a
slightly more subtle cementing procedure (Figure 4).
Figure 4. More realistic Cyberbricks configurations
Taking the Edge Off
Cementing together two Cyberbricks is the same as before (Figure 2) so
the effective capacity loss should be the same. To cement three
Cyberbricks, we have to combine them in such a way that each Cyberbrick
touches every other Cyberbrick. The third diagram in Figure 4 shows how
to do this.
There are three segments of cement and two Cyberbrick faces for each
segment, for a total of six faces that lose g=5% of each Cyberbrick's
volume (i.e., a total loss of 6 x 0.05 = 0.3 nodes). Subtracting this
loss from the 3 physical Cyberbricks leaves an effective capacity of
2.7 Cyberbricks. This is smaller than the previous case (Table 1).
When it comes to cementing four Cyberbricks, things get a little
trickier. You might expect the stack to be like that shown in Fig. 5,
but that configuration does not allow node A to interact directly with
node D. The correct stacking arrangement is shown in the last diagram
in Figure 4. The secret is that we need to take some additional
capacity by preparing a bevelled cement face between A-D and B-C.
Figure 5. Wrong version of stacking 4 Cyberbricks
The total loss is assessed as follows:
4 x (2 x 0.05) for each flush face
2 x (2 x 0.05) for each bevelled face
Subtracting this combined loss from the total number of Cyberbricks,
produces an effective capacity of:
4 (4 x 0.10) (2 x 0.10) = 4 0.6 = 3.40
For p = 1 to 4 nodes, the complete results are summarized in Table 2.
Table 2. Realistic scalability estimates
| Cyberbricks | Faces | Loss | Difference<
/TD> | Capacity |
| 1 | 0 | 0 | 1
0 | 1.00 |
| 2 | 2 | 2 x
0.05 | 2 0.1 | 1.90 |
| 3 | 6 | 6 x
0.05 | 3 0.3 | 2.70 |
| 4 | 12 | 12 x
0.05 | 4 0.6 | 3.40 |
The full impact of this stacking procedure up to eight nodes is shown
in Figure 6. The key difference from the naive scaling model (Figure 3)
is the nonlinear characteristic in the scaling curve (very typical of
real systems [4, 5]). This "curvature" in the scaling function reflects
the increasing overhead of the additional communication paths as more
Cyberbricks are added to the configuration.

Figure 6: Realistic scaling model up to eight Cyberbricks
If you don't know the g-factor value, you can get it from measured
throughput data (e.g., transaction per hour) across a few
configurations. This can often be obtained from application developers
doing pilot scalability studies or systems analysts who have been
tracking upgrades on production systems. These data are entered into
the array called "data" in the program called gfit.c. (See the
source code on pp. 53-54.)
Applying the Stacking Model
Let's see how this kind of simple model can be applied to estimating
the scalability of real computer systems. Because I discussed TPC-C
benchmark data previously, [2] I'll use that same source.
Unfortunately, there are no TPC-C data for different cluster
configurations (yet). In lieu of that, I'll apply the model to
multiprocessor data based on the 200MHz Pentium Pro running Microsoft
SQLServer (bars in Figure 7). This is permitted because the realistic
stacking model I've presented is quite generic and does not assume any
particular interconnect technology. A shared-memory bus (like the Intel
P6 bus) is also accounted for by our model.
Figure 7: Modelling estimates for Pentium/SQLServer TPC data
Turning to Figure 7, let's begin with the p=1 and p=2 CPU
configurations. The respective TPC-C throughput numbers were reported
as 2,502 tpmC and 4,939 tpmC. Applying gfit.c to these values,
we get a g-factor of almost 20% (a very poor value, by the way!). Using
this g-factor, we can project the throughput of a 4-way platform (white
dots in Fig. 7). The estimate is 4,219 transactions per minute
worse than the two-way configuration! That's right. Losing 20% of your
cycles for every communication path between CPUs implies that adding
the fourth CPU is a waste of money (even adding a third CPU is
questionable!).
But looking up the reported TPC-C result for a four-way, we find a
remarkable result: 11,055 tpmC very much better than expected
(by almost a factor of three times!). Does this measurement invalidate
our simple scaling model? No, it doesn't. There were many variables
changed during the period (MayOctober 1997) between reporting the
two-way TPC-C and the four-way TPC-C results (e.g., version changes in
the O/S, the RDBMS, and the L2 cache size was doubled for the four-way
platform).
Regardless of which of these changes had the biggest impact, we can use
our scaling model to do some interpolation estimates. Let's suppose
that 1MB L2 caches were also available on the two-way and single CPU
configurations. If the g-factor can be reduced to g=1%, then the single
CPU should produce about 2,800 transactions per minute (TPM), the
two-way should pull about 5,600 TPM, and the four-way (as expected)
should produce 1,058 TPM (see the curve with the black dots in Figure
7).
We can also do some extrapolations and estimate the TPC-C throughput
for an eight-way platform. With the g-factor maintained at 1% (or
better), the expected TPC-C throughput should be about 21,000 TPM
(black dots in Figure 7). How accurate is this prediction? We'll have
to wait and see.
Acknowledgments
I am grateful to Doug Smucker (HP) for a discussion about the P6 bus
architecture.
Notes
[1] N.J. Gunther, "NT to the Max...(NoT)," ;login:, November
1997, pp. 911.
[2] N.J. Gunther, "The ABC's of TPC's," ;login:, February 1998,
pp. 5054.
[3] J. Gray,
<http://www.usenix.org/publications/library/proceedings/usenix-nt97/
presentations/gray/index.htm>.
[4] N.J. Gunther, The Practical Performance Analyst, (New York:
McGraw-Hill, 1998).
[5] N.J. Gunther and A. Cockcroft, Practical Performance Methods,
Stanford class, August 1998.
<http://members.aol.com/CoDynamo/Training.wics.htm>.
Source Code
/* gfit.c Created by NJG, 11:46:45 03-27-96 */
#include <stdio.h>
#include <math.h>
| #define | MAXCPU | 8
|
| #define | STARTMIN | 9999.0
|
| #define | STARTERR | 100.0
|
| #define | | MAXITER | 10000
|
| #define | | MAXSEARCH | 100 |
| double | data[MAXCPU+1]; |
| float | | mpf, gf, sf; |
| main() { |
| int
| | | cpu, iter;
|
| int | | | | <
TD>findcpu1;
| double | quad(); |
| double | prev, abserr; |
| double | data1; |
/* input measured throughput data */
data[1] = 2502;
data[2] = 4039;
printf("Measured CPU Samples\n");
printf("CPU\tData\n");
for (cpu = 1; cpu <= LASTCPU; cpu++) {
if(data[cpu] > 0.0)
printf("%4d\t%03.2lf\n", cpu, data[cpu]);
}
printf("\nOptimizing ");
fflush(NULL);
if (data[1] != 0.0) { /* no need to iterate data[1] value
*/
data1 = data[1];
findcpu1 = FALSE;
} else { /* get the first non-zero sample to set data[1] */
findcpu1 = TRUE;
for (cpu = 1; cpu <= MAXCPU; cpu++) {
if(data[cpu] > 0.0) {
data1 = data[cpu]/cpu;
break;
}
}
data[1] = data1;
}
prev = STARTERR;
for (iter = 0; iter < MAXSEARCH; iter++) {
abserr = quad();
if (!findcpu1)
break;
if (abserr < prev) {
prev = abserr;
data[1] += 0.1;
}
else
break;
printf(".");
}
printf(" Done.\n");
printf("Best g-factor: %3.4f\n", gf);
printf("Estimated X(1): %3.2lf\n", data[1]);
printf("Cumulative error: %3.2lf %%\n", abserr);
} /* main */
double
quad() {
int it;
int cpu;
double q,
result;
double err, min;
float param;
min = STARTERR;
param = 0.0;
for (it = 0; it < MAXITER; it++) {
err = 0.0;
param += 0.0001;
for (cpu = 1; cpu <= LASTCPU; cpu++) {
if (data[cpu] > 0.0) {
q = data[1] * cpu * (1 - param * (cpu - 1));
err += fabs(q - data[cpu]) * 100.0 / data[cpu];
}
}
if (err <= min) {
min = err;
gf = param;
}
}
return(min);
} /* quad */
|