Next: Benchmarks Up: PMQS : Scalable Linux Previous: Multi Queue Scheduler (MQS)

Pooled Multi Queue Scheduler (PMQS)

Before describing the design of the pooled multiqueue scheduler, it is instructive to examine the implications of the scheduling algorithm followed by DSS and MQS. During schedule(), DSS selects the next best task to run, as determined by the goodness() function. MQ attempts to do the same though its examination of non-local tasks (tasks not on the runqueue of the invoking cpu) is not done under a lock. During reschedule_idle(), DSS tries to find the best CPU on which the newly woken/created task can be run. MQS does the same without holding a lock on the remote runqueues. Overall, whether it is choosing a task or a CPU, DSS looks at all viable candidates and choose the globally best one while MQ tries to achieve the same goal by an intelligent examination of fewer candidates. This has two implications for large SMP and NUMA machines :

Scalability : The schedule() function in DSS is O(N_tasks) whereas on MQ it is O( $\frac {N_{tasks}}{N_{cpus}} + N_{cpus}$ ) (on an average, the local runqueue in MQ has $\frac{1}{N_{cpus}}$ of the total tasks in the system). Even if N_tasks is a small, constant multiple of N_cpus, both schedulers approach O(N_cpus) and are susceptible to scalability problems. Similar conclusions can be drawn from their reschedule_idle() functions, which are both O(N_cpus), albeit with different locking strategies. Compared to DSS, MQS does increase scalability considerably by removal of a global lock. But it does not eliminate the potential scalability bottleneck.
Locality : When a runnable thread migrates from one CPU to another (either directly as a result of preemption or after a sleep/wakeup cycle), it runs the risk of losing the cache context accumalated on the previous CPU. For any SMP system, the probability of being able to take advantage of a previous run on the same CPU, is proportional to the number of alternate CPUs to which it could be migrated. For a NUMA system, migration could have even a greater penalty if a task runs on a node different from the one on which most of its memory is allocated. The effect of increased probability of cache misses and remote memory accesses is also highly dependent on the workload e.g. an application whose working set causes its cache context to get replaced periodically is not as affected by the cache miss effect and applications with high cache hit rates will not be severely affected by remote memory accesses.

To address these two problems, we propose a pooled multiqueue scheduler (PMQS) based on MQS. The central idea in PMQS is to partition the CPUs of an SMP/NUMA into pools and to limit the search for candidate CPUs and/or tasks to these pools. The size of the pools is a configurable parameter that can be dynamically changed to meet different criteria of performance, fairness etc. Restricting the scope of scheduling decisions to fixed-size pools improves the scalability of the scheduler by making it independent of NCPUS. It also increases the localilty of tasks with respect to the CPUs on which they run.

For simplicity, the implementation of PMQS is based on minor modifications to MQS. In general, DSS code which looks at all CPUs using

for (i = 0; i <= smp_num_cpus; i++)

is replaced by

for (i = first; i <= last; i++)

where first and last denote the corresponding limits of the invoking CPU's pool. The first phase of the schedule() routine of MQS is unchanged in PMQS and is used to find the best local candidate task. In the second phase of the routine, the only remote runqueues examined are those of CPUs within the same pool as the invoking CPU. The examination of remote runqueues continues to be done without holding a lock. In the reschedule_idle() function, only those CPUs are treated as candidates which lie within the same pool as the task's previous CPU. The only exception to this rule is when there are idle CPUs outside the pool. Leaving a CPU idle when other CPUs have runnable tasks leads to a complete waste of CPU cycles and is particularly undesirable in a server environment. Therefore, PMQS looks for idle CPUs systemwide without regard to pool boundaries. The hunt for idle CPUs uses a simple mechanism. A global bitmap, called node_idle_bits, records the currently idle CPUs in the system. The bitmap is maintained by the scheduler when it switches tasks on any CPU. During schedule(), node_idle_bits is first masked with a local pool mask to identify idle CPUs within the local pool. If none is found, it is masked with a remote pool mask to identify idle CPUs in remote pools. Consistent with the pooling philosophy, idle CPUs within a pool are preferentially chosen over those outside the pool. If an idle CPU is found at any stage, it is selected as the target CPU. Contrary to the behaviour of DSS and MQS, PMQS makes no attempt to select the longest idle CPU as the target. If no idle CPU is found, reschedule_idle() proceeds as normal, examining max_na_goodness values of CPUs within the pool.

Next: Benchmarks Up: PMQS : Scalable Linux Previous: Multi Queue Scheduler (MQS)

2001-09-18