Next: Default SMP Scheduler (DSS) Up: PMQS : Scalable Linux Previous: Abstract

Introduction

Linux is becoming increasingly popular as a server operating system. It has already proven itself as a cost-effective solution for Web, file and print serving which typically run on systems with 1-4 CPUs. More demanding enterprise applications, such as database, e-business or departmental servers, tend to be deployed on larger symmetric multiprocessor (SMP) systems. To support such applications, Linux has to scale well as more CPUs are added to an SMP. Very large SMPs are increasingly built upon smaller SMP building blocks interconnected by a cache coherent interconnect. As this leads to non-uniform memory accesses, these systems are referred to as NUMA systems. Linux systems need to incorporate NUMA awareness into the base kernel.

With the increasing CPU count and application load, Linux SMP scalability has become one of the focal points for kernel development. Within that context, we have been looking at the scalability of the Linux kernel scheduler. The current Linux scheduler (2.4.x kernel) has two defining characteristics. First, there is a single unordered runqueue for all runnable tasks in the system, protected by a single spinlock. Second, during scheduling, every task on the runqueue is examined while the runqueue lock is held. These have a two-fold effect on scalability. As the number of CPUs increases, there is more potential for lock contention. As the number of runnable tasks increases, lock hold time increases due to the linear examination of the runqueue. Independent of the number of CPUs, increased lock hold time can also cause increased lock contention, depending on the frequency of scheduling decisions. For spinlocks, increased lock hold time and lock contention result in a direct increase in lock wait time which is a waste of CPU cycles. These observations are reinforced by recent studies. Measurements using Java benchmarks [2] show that the scheduler can consume up to 25% of the total system time for workloads with a large number of tasks. Another study [3] has observed run queue lock contention to be as high as 75% on a 32-way SMP.

To address these deficiencies, our previous work in [5] proposes and implements two kinds of kernel schedulers called Multi Queue Scheduler (MQS) and Priority Level Scheduler (PLS). PLS organizes the global runqueue using multiple priority levels resulting in fewer tasks examinations during scheduling. MQS is designed to address the issues of lock contention and lock hold time simultaneously. It does this by splitting up the global runqueue and its associated lock into per-CPU equivalents. The two schedulers have been evaluated on an 8-way SMP using a variety of relevant workloads. Our results show that lock contention is the greater of the two problems and that MQS outperforms both the default SMP scheduler (DSS) and PLS in most cases. Both PLS and MQS attempt to be functionally equivalent to DSS. As a result, MQS ends up doing an O(N_cpus) scan looking for the globally best candidate. While this has minimal impact on lock contention, it does leave MQS open to scalability concerns. A side-effect of making a global decision is an expected increase in the number of task migrations across CPUs leading to an increase in cache misses, and on NUMA systems, increased accesses to remote memory.

In this paper, we make an initial attempt to address these problems by using processor pooling. Processor pooling is implemented by an extension of MQS, called Pooled Multi Queue Scheduler (PMQS), which partitions the CPUs of an SMP into subsets for making scheduling decisions. This places an upper bound on the scope of the search for candidate CPUs and tasks and is expected to improve scheduler performance. It is also expected to improve system throughput by decreasing the probability of cache misses and remote memory accesses. However, processor pooling can create load imbalance problems due to runqueue partitioning. We look at two ways of balancing CPU loads. The initial placement (IP) scheme places newly created tasks on the least loaded CPU and does not interfere with the functioning of PMQS thereafter. The other way of load balancing is more aggressive. It runs periodically and explicitly balances CPU runqueues by moving runnable tasks between them.

Processor pooling is not a new idea. [7] has done a simulation-based study of processor pooling for parallel systems. Their results indicate that such pooling reduces the average job response time. The importance of cache-affinity in making scheduling decisions has been shown in [8,9] using simulations and analytical models. The effect of initial placement as a load balancing mechanism has been simulated in [4]. In this paper, we take an implementation based approach to processor pooling in the context of Linux.

We examine the performance of processor pooling using two representative benchmarks, Mkbench and Chat, on a 16-way NUMA system and on an 8-way SMP. For Mkbench, on the NUMA system we find that PMQS shows substantial benefits over DSS and MQS. These benefits are even higher when no load balancing is done. However, for Chat, while PMQS does outperform DSS, it does substantially worse than MQS. Even though we feel that Mkbench is more representative of server workloads, the mixed performance makes it difficult to generalize the results of this paper and draw strong conclusions about the efficacy of pooling under Linux. These are only preliminary implementations and results of the pooling concept. Much work remains to be done to make stronger pronouncements.

The rest of the paper is organized as follows. In Sections 2 and 3 we describe the default SMP scheduler (DSS) and the Multi Queue Scheduler (MQS) respectively. The Pooled Multi Queue Scheduler (PMQS) is presented in Section 4. The load balancing is described in Section 6. Section 7 contains the performance results and Section 8 concludes with directions for future work.

Next: Default SMP Scheduler (DSS) Up: PMQS : Scalable Linux Previous: Abstract

2001-09-18