Mkbench Evaluation

To compare the performance and scalability of DSS, MQS and PMQS, we ran a series of kernel builds with varying job sizes on the 16-way NUMA machine. The load on the system is determined by the number of simultaneous kernel builds and the job size of each build.

For a 4x4 NUMA system, a poolsize of 4 is a natural selection for PMQS, as it assigns a single pool to each node. This limits scheduling and data lookup within local nodes and only migrates processes across nodes during the LB phases.

Table 2: PMQS (poolsize=4) as compared to MQS,DSS for Mkbench configurations with varying number (B) of kernel builds on a 4x4-way NUMA system.

Scheduler	B=2	B=4	B=8
	(32)	(64)	(128)
DSS	-3.93	-3.25	-3.47
LBOFF	5.27	16.30	23.28
IP	5.05	12.90	21.29
LBP-45	2.01	4.19	3.91
LBP-10	2.09	3.55	2.22
LBC	5.89	8.03	7.44

Table 2 shows the results for poolsize=4, the job size of 16 and LB invocation frequency of 600 milliseconds for 2, 4 and 8 parallel kernel builds (B=2,4,8). In this setup, B responds to the per CPU load and to an average system wide load of 32, 64 and 128 runnable tasks respectively. First, Table 2 shows that the DSS scheduler under performs the MQS scheduler consistently between 3.25% and 3.93% for B=2,4,8. This corresponds to the results published earlier in [5]. Overall PMQS consistently outperforms MQS across all considered loads and configurations.

In general, LBOFF performs best. The reason for this is that parallel kernel builds are throughput oriented parallel applications. Kernel compiles create a dependency graph, and upon finishing the compilation of an individual file, the next one is started. Hence, the rate of progress for individual compiles does not hinder overall completion time. MQS and DSS both are schedulers that take global priorities into account and hence tend to migrate tasks to ensure the best global scheduling decisions. On NUMA machines crossing node boundaries increases the negative cache effects. Among the dynamic load balancers, LBC, the most aggressive load balancer, performs better than the LBP-45 and LBP-10. This seems somewhat surprising, as one might expect that tighter load balancing of a parallel, mostly independent throughput oriented application creates unnecessary overhead. We believe that this can be attributed to the fact that LBC creates less overhead than LBP for poolsize=1.

To study the effect of overhead associated with load balancing we varied the invocation frequency for LBC from 200 milliseconds to 2 seconds for poolsize=4 and 4 kernel builds. The results, shown in Table 3, show the general trend that less frequent LB invocation, thus lowering the overhead associated with LB, in general increases performance.

Table 3: Impact of LB invocation frequency (in msecs) for LBC as compared to MQS for poolsizes 4 and 8 and Mkbench(B=4).

	200	400	600	800	1000	1200	1400	1600	1800	2000	LBOFF
poolsize=4	7.45	8.00	8.03	7.67	8.93	8.86	8.35	8.51	9.35	9.64	16.30
poolsize=8	2.55	2.52	2.56	3.01	2.69	3.72	3.26	3.99	3.79	4.67	6.02

Overall, PMQS had a maximum performance advantage of 23.28% for high load (B=8) and LBOFF and a minimum performance increase by 2.01% for low load (B=2) and LBP-45 when compared to MQS.

Analyzing the impact of increased load (B=2,4,8), it is shown that for non-periodic load balancers (LBOFF and IP), the %-improvement over MQS increases with the load. This is again explained by the fact that MQS tends to make global decisions, thus forcing more task migration and hence loss of cache state. Periodic load balancers do not show such dramatic %-improvements over MQS and actually peak at medium load.

Table 4: PMQS (poolsize=8) as compared to MQS,DSS for Mkbench configurations with varying number (B) of kernel builds on a 4x4-way NUMA system.

Scheduler	B=2	B=4	B=8
	(32)	(64)	(128)
DSS	-3.93	-3.25	-3.47
LBOFF	3.02	6.02	16.85
IP	2.09	13.02	14.26
LBP-45	0.57	1.62	1.19
LBP-10	-0.13	0.62	-0.13
LBC	1.98	2.56	2.18

We also studied the effect of changing poolsizes. Table 4 shows the results presented in Table 2 for poolsize=8. The trend for poolsize=8 are very similar to those for poolsize=4. However, the performance improvements are not as significant. The reason for is that during scheduling intra-pool balancing as performed by the basic PMQS scheduler results in more task migrations. For LBC we also measured the performance for poolsize=16 and we actually see relative performance degrations as compared to MQS of 2.27%, 1.31% and 1.01% for B=2,4,8 respectively.

Having evaluated the efficacy of PMQS for NUMA based systems for kernel compiles, and having estabilished that a poolsize equal the number of cpus per node provides the greatest benefit, we now turn our attention to whether providing smaller poolsize bears any effect. For that we executed the parallel kernel builds (B=1,2,4,8) on an 8-way Netfinity SMP system. The ``-j'' factor was chosen as 8 to again provide the same load per CPU as in the NUMA system. We varied the poolsize from 1 to 8. The results are presented in Table 5 and are relative to MQS performance. First we observe that the DSS and MQS have only marginal differences in their performances across all loads B.

LBOFF and IP are extremely sensitive to low load situations (B=1,2) and small poolsizes and substantially underperform MQS. Both show good improvements only for B=4 and poolsize=2,4. In general, LBP-45, LBP-10 and LBC demonstrate small overall performance improvements throughout the configuration space of poolsizes and buildfactors considered. Furthermore, for high loads (B=8) no meaningful difference with respect to MQS can be established independent from the poolsize. Though no definite selection can be made, in general LBs outperform LBOFF and IP. In particular LBP-45 seem to show the best overall performance while running Mkbench. We note that LBOFF with poolsize=8 is effectively an MQS scheduler with the changes made to idle process identification.

Overall for NUMA system we have shown that PMQS provides increasingly better performance as compared to MQS when the load increases and when the poolsize is equal the number of cpus in the system. LBOFF and IP provided by far the largest benefits. In contrast, we have shown for a single SMP, that LBOFF and IP have the opposite effect when the poolsize is decreased.

Table 5: PMQS as compared to MQS,DSS for Mkbench configurations with varying number (B) of kernel builds on an 8-way SMP system.

PoolSize	B=1	B=2	B=4	B=8
DSS
N/A	0.48	0.95	-0.42	-1.37
LBOFF
1	-34.10	-9.55	3.18	0.25
2	-17.57	-3.96	10.88	3.12
4	-12.72	4.05	13.59	-0.21
8	2.20	3.37	3.11	0.47
IP
1	-19.71	-10.51	-0.43	-1.02
2	-9.00	-6.60	15.02	-0.83
4	-10.88	-2.89	12.98	-8.35
8	-7.58	-10.91	-11.99	-1.09
LBP-45
1	2.82	3.61	2.75	0.30
2	2.28	-4.03	5.70	4.26
4	0.16	1.07	8.34	0.52
8	5.69	3.68	3.19	0.23
LBP-10
1	5.47	4.37	3.07	0.39
2	-1.77	-2.53	4.73	0.62
4	3.30	4.07	4.33	-0.25
8	4.11	3.09	3.01	0.24
LBC
1	2.22	-2.49	2.98	-0.18
2	0.23	3.58	3.23	0.39
4	4.49	0.53	3.32	-0.06
8	0.91	2.37	2.20	-0.13

This evaluation suggest that Mkbench on a NUMA system might benefit from a mixed load balancing approach wherein intra node balancing is performed based on LBs and internode balancing is performed using no loadbalancing or IP.