FeatureUSENIX

 

NT scalability III

gunther

by Neil Gunther
<ngunther@ricochet.net>

Neil Gunther is founder and principal consultant for Performance Dynamics Company in Mountain View, CA. Dr. Gunther has worked in the Silicon Valley for 18 years. He is a member of IEEE, ACM, and CMG.

How to Stack Cyberbricks

This is the third and final article in the series [1, 2] on NT and UNIX server scalability. The motivation stems from my concerns about the misleading comparisons between UNIX and NT scalability that were presented at the USENIX-NT Workshop last August. My final article has to do with defining what is actually meant by the term "scalability."

In his opening address [3], Gray never actually defined what he meant by scalability, but one wrong impression revolved around Microsoft's cluster strategy. The notion is to connect cheap PCs in the same way one combines bricks to build a wall; in this case the PC units are to be thought of as cyberbricks that can be combined via clustering technology into larger scale computational structures. Sounds good, doesn't it!? So where do you get yours? You can't, if you want the version Gray described, because it's not ready for prime time (recall that the two-node cluster demo failed to appear during Gray's presentation [1]). Pyramid, DEC, Sun, HP, et al., however, do have UNIX clusters on the market and in development.

Let's examine this scalability concept more closely. In Figure 1, the area bounded inside each box (or "brick" ) represents the maximum available PC computational capacity.

No Disk File

Figure 1. Ideal stacking.

This doesn't account for cementing the Cyberbricks together.If we include the "cement" or "glue" needed to build a larger computational structure with these Cyberbricks, we end up with the situation shown schematically in Figure 2. The "cement" in this case is the overhead cycles consumed at each PC node (due to both hardware and software) as they communicate with one another. Let's see how this would work in more detail. No Disk File

Figure 2. Cementing of up to 4 Cyberbricks

Adding Cyber-Cement

As with real bricks and mortar, we have to prepare the surface of each brick where the mortar will be applied to stick the bricks together. Let the percentage of each Cyberbrick given up to the interface be g% of the available nodal computational capacity (i.e., g% of each Cyberbrick's volume for each interface). This lost fraction of capacity is shown as the black area between the bricks in Figure 2.

Although this is how things would look if these were real bricks, it turns out to be rather unrealistic for Cyberbricks. The lack of reality can be seen on both theoretical and empirical grounds.

Look at the stack of four Cyberbricks. It's clear that node A can interact with node B because they are "cemented" together. But how does node A interact with node C or node D? In this naive model of scalability, they simply can't! The effect of this limitation is surprising. To see this, let's take some concrete values.

If we assume that g = 5%, then we can estimate the available computational capacity as we scale up the number of physical nodes (p). For p ranging from one to four nodes, the results are summarized in Table 1.

Table 1. Naive scalability estimates
CyberbricksFacesLossDifferenceCapacity
1001 ­ 01.00
222 x 0.052 ­ 0.11.90
344 x 0.053 ­ 0.22.80
466 x 0.054 ­ 0.33.70
Clearly, there is an impact. For example, when we add the second node, we don't get double the computational capacity, only 1.9 nodes of effective capacity (the discount being 5%, as expected). At 4 nodes, we get only 3.7 nodes of effective capacity. The full impact up to 8 nodes is shown in Figure 3.

No Disk File

Figure 3. Naive scaling model up to 8 nodes or Cyberbricks

The 5% discount (due to the chosen g-factor) applies across the board and simply means that the ideal scaling curve has a slightly lower slant (lower by 5%). So adding in the Cyber-cement like stacking real bricks (Figure 2) is more realistic than not including it all (Figure 1) but it is still an unrealistic model of real computers for two reasons:

  1. It doesn't account for arbitrary internode interaction (e.g., A to D).
  2. Interconnect latencies and software contention are not included.

    If we attempt to include these more realistic effects, we end up with a slightly more subtle cementing procedure (Figure 4).

    No Disk File

    Figure 4. More realistic Cyberbricks configurations

    Taking the Edge Off

    Cementing together two Cyberbricks is the same as before (Figure 2) so the effective capacity loss should be the same. To cement three Cyberbricks, we have to combine them in such a way that each Cyberbrick touches every other Cyberbrick. The third diagram in Figure 4 shows how to do this.

    There are three segments of cement and two Cyberbrick faces for each segment, for a total of six faces that lose g=5% of each Cyberbrick's volume (i.e., a total loss of 6 x 0.05 = 0.3 nodes). Subtracting this loss from the 3 physical Cyberbricks leaves an effective capacity of 2.7 Cyberbricks. This is smaller than the previous case (Table 1).

    When it comes to cementing four Cyberbricks, things get a little trickier. You might expect the stack to be like that shown in Fig. 5, but that configuration does not allow node A to interact directly with node D. The correct stacking arrangement is shown in the last diagram in Figure 4. The secret is that we need to take some additional capacity by preparing a bevelled cement face between A-D and B-C.

    No Disk File

    Figure 5. Wrong version of stacking 4 Cyberbricks

    The total loss is assessed as follows:

    4 x (2 x 0.05) for each flush face

    2 x (2 x 0.05) for each bevelled face

    Subtracting this combined loss from the total number of Cyberbricks, produces an effective capacity of:

    4 ­ (4 x 0.10) ­ (2 x 0.10) = 4 ­ 0.6 = 3.40
    For p = 1 to 4 nodes, the complete results are summarized in Table 2.

    Table 2. Realistic scalability estimates
    CyberbricksFacesLossDifference< /TD>Capacity
    1001 ­ 01.00
    222 x 0.052 ­ 0.11.90
    366 x 0.053 ­ 0.32.70
    41212 x 0.054 ­ 0.63.40
    The full impact of this stacking procedure up to eight nodes is shown in Figure 6. The key difference from the naive scaling model (Figure 3) is the nonlinear characteristic in the scaling curve (very typical of real systems [4, 5]). This "curvature" in the scaling function reflects the increasing overhead of the additional communication paths as more Cyberbricks are added to the configuration.

    No Disk File

    Figure 6: Realistic scaling model up to eight Cyberbricks

    If you don't know the g-factor value, you can get it from measured throughput data (e.g., transaction per hour) across a few configurations. This can often be obtained from application developers doing pilot scalability studies or systems analysts who have been tracking upgrades on production systems. These data are entered into the array called "data" in the program called gfit.c. (See the source code on pp. 53-54.)

    Applying the Stacking Model

    Let's see how this kind of simple model can be applied to estimating the scalability of real computer systems. Because I discussed TPC-C benchmark data previously, [2] I'll use that same source. Unfortunately, there are no TPC-C data for different cluster configurations (yet). In lieu of that, I'll apply the model to multiprocessor data based on the 200MHz Pentium Pro running Microsoft SQLServer (bars in Figure 7). This is permitted because the realistic stacking model I've presented is quite generic and does not assume any particular interconnect technology. A shared-memory bus (like the Intel P6 bus) is also accounted for by our model.

    No Disk File

    Figure 7: Modelling estimates for Pentium/SQLServer TPC data

    Turning to Figure 7, let's begin with the p=1 and p=2 CPU configurations. The respective TPC-C throughput numbers were reported as 2,502 tpmC and 4,939 tpmC. Applying gfit.c to these values, we get a g-factor of almost 20% (a very poor value, by the way!). Using this g-factor, we can project the throughput of a 4-way platform (white dots in Fig. 7). The estimate is 4,219 transactions per minute ­ worse than the two-way configuration! That's right. Losing 20% of your cycles for every communication path between CPUs implies that adding the fourth CPU is a waste of money (even adding a third CPU is questionable!).

    But looking up the reported TPC-C result for a four-way, we find a remarkable result: 11,055 tpmC ­ very much better than expected (by almost a factor of three times!). Does this measurement invalidate our simple scaling model? No, it doesn't. There were many variables changed during the period (May­October 1997) between reporting the two-way TPC-C and the four-way TPC-C results (e.g., version changes in the O/S, the RDBMS, and the L2 cache size was doubled for the four-way platform).

    Regardless of which of these changes had the biggest impact, we can use our scaling model to do some interpolation estimates. Let's suppose that 1MB L2 caches were also available on the two-way and single CPU configurations. If the g-factor can be reduced to g=1%, then the single CPU should produce about 2,800 transactions per minute (TPM), the two-way should pull about 5,600 TPM, and the four-way (as expected) should produce 1,058 TPM (see the curve with the black dots in Figure 7).

    We can also do some extrapolations and estimate the TPC-C throughput for an eight-way platform. With the g-factor maintained at 1% (or better), the expected TPC-C throughput should be about 21,000 TPM (black dots in Figure 7). How accurate is this prediction? We'll have to wait and see.

    Acknowledgments

    I am grateful to Doug Smucker (HP) for a discussion about the P6 bus architecture.

    Notes

    [1] N.J. Gunther, "NT to the Max...(NoT)," ;login:, November 1997, pp. 9­11.

    [2] N.J. Gunther, "The ABC's of TPC's," ;login:, February 1998, pp. 50­54.

    [3] J. Gray, <http://www.usenix.org/publications/library/proceedings/usenix-nt97/ presentations/gray/index.htm>.

    [4] N.J. Gunther, The Practical Performance Analyst, (New York: McGraw-Hill, 1998).

    [5] N.J. Gunther and A. Cockcroft, Practical Performance Methods, Stanford class, August 1998. <http://members.aol.com/CoDynamo/Training.wics.htm>.

    Source Code

    /*  gfit.c Created by NJG, 11:46:45 03-27-96 */

    #include <stdio.h>
    #include <math.h>
    < TD>findcpu1;
    #define MAXCPU 8
    #define STARTMIN9999.0
    #define STARTERR100.0
    #defineMAXITER10000
    #defineMAXSEARCH100
    double data[MAXCPU+1];
    float mpf, gf, sf;
    main() {
    int cpu, iter;
    int
    double quad();
    double prev, abserr;
    double data1;

     /* input measured throughput data */
     data[1] = 2502;
     data[2] = 4039;

     printf("Measured CPU Samples\n");
     printf("CPU\tData\n");
     for (cpu = 1; cpu <= LASTCPU; cpu++) {
      if(data[cpu] > 0.0)
       printf("%4d\t%03.2lf\n", cpu, data[cpu]);
     }

     printf("\nOptimizing ");
     fflush(NULL);

     if (data[1] != 0.0) { /* no need to iterate data[1] value */
      data1 = data[1];
      findcpu1 = FALSE;
     } else { /* get the first non-zero sample to set data[1] */
      findcpu1 = TRUE;
      for (cpu = 1; cpu <= MAXCPU; cpu++) {
       if(data[cpu] > 0.0) {
        data1 = data[cpu]/cpu;
        break;
       }
      }
      data[1] = data1;
     }

     prev = STARTERR;
     for (iter = 0; iter < MAXSEARCH; iter++) {
      abserr = quad();
      if (!findcpu1)
       break;
      if (abserr < prev) {
       prev = abserr;
       data[1] += 0.1;
      }
      else
       break;
     printf(".");
     }

    printf(" Done.\n");
     printf("Best g-factor: %3.4f\n", gf);
     printf("Estimated X(1): %3.2lf\n", data[1]);
     printf("Cumulative error: %3.2lf %%\n", abserr);

    } /* main */

    double
    quad() {
     int it;
     int cpu;
     double q,
     result;
     double err, min;
     float param;

    min = STARTERR;
     param = 0.0;
     for (it = 0; it < MAXITER; it++) {
      err = 0.0;
      param += 0.0001;

    for (cpu = 1; cpu <= LASTCPU; cpu++) {
       if (data[cpu] > 0.0) {
        q = data[1] * cpu * (1 - param * (cpu - 1));
        err += fabs(q - data[cpu]) * 100.0 / data[cpu];
       }
      }
      if (err <= min) {
       min = err;
       gf = param;
      }
     }

    return(min);

    } /* quad */

 

?Need help? Use our Contacts page.
First posted: 8th July 1998 efc
Last changed: 8th July 1998 efc
Issue index
;login: index
USENIX home