Check out the new USENIX Web site. next up previous
Next: Pooled Multi Queue Scheduler Up: PMQS : Scalable Linux Previous: Default SMP Scheduler (DSS)


Multi Queue Scheduler (MQS)

The Multi Queue Scheduler (MQS) is designed to address scalability by reducing lock contention and lock hold times while maintaining functional equivalence with DSS. It breaks up the global run-queue and global run-queue lock into corresponding per-CPU structures. Lock hold times are reduced by limiting the examination of tasks to those on the runqueue of the invoking CPU along with an intelligent examination of data corresponding to the non-local runqueues. Moreover, the absence of a global lock allows multiple instances of the scheduler to be run in parallel, reducing lock wait time related to lock contention. Together these reduce the scheduler related lock contention seen by the system.

MQS defines per-CPU runqueues which are similar to the global runqueue of the DSS scheduler. Related information such as the number of runnable tasks on this runqueue is maintained and protected by a per-CPU runqueue lock.

The schedule() routine of MQS operates in two distinct phases. In the first phase, it examines the local runqueue of the invoking CPU and finds the best local task to run. In the second phase, it compares this local candidate with top candidates from remote runqueues and chooses the global best.

In more detail, the schedule() routine of MQS acquires the runqueue lock of the invoking CPU's runqueue and scans the latter looking for the schedulable task with the highest goodness value. To facilitate the global decision in the second phase, it also records the second highest non-affinity goodness value in the max_na_goodness field of the local runqueue. The non-affinity goodness (henceforth called na_goodness) is the goodness value of a task without any consideration for CPU or memory map affinity. The local candidate's goodness value (which includes appropriate affinity boosts) is compared with the max_na_goodness of all other runqueues to determine the best global candidate. If the global candidate is on a remote runqueue, schedule() tries to acquire the corresponding lock and move that candidate task over to its local runqueue. If it fails to acquire the lock or the remote task is no longer a candidate (its na_goodness value has changed), schedule() skips the corresponding runqueue and tries again with the next best global candidate. In these situations, MQS's decisions deviate slightly from those made by DSS e.g. the third best task of the skipped runqueue could also have been a candidate but is not considered as one by MQS.

The reschedule_idle() function attempts to find a CPU for a task which becomes runnable. It creates a list of candidate CPUs and the na_goodness values of tasks currently running on those CPUs. It chooses a target CPU in much the same way as the schedule() routine, trying to acquire a runqueue lock and verifying that the na_goodness value is still valid. Once a target CPU is determined, it moves the task denoted by its argument onto the target CPU's runqueue and sends an IPI to the target CPU to force a schedule(). reschedule_idle() maintains functional equivalence with DSS in other ways too. If a tasks' previous CPU is idle, it is chosen as the target. Amongst other idle CPUs, the one which has been idle the longest is chosen first.

MQS's treatment of realtime tasks takes into account the conflicting requirements of efficient dispatch and the need to support Round Robin and FIFO scheduling policies. Like DSS, it keeps runnable realtime tasks on a separate global runqueue and processes them the same way.

The implementation of MQS avoids unnecessary cache misses and false sharing. Runqueue data is allocated in per-CPU cache-aligned data structures. Implications of MQS for large SMP and NUMA systems is discussed in Section 4


next up previous
Next: Pooled Multi Queue Scheduler Up: PMQS : Scalable Linux Previous: Default SMP Scheduler (DSS)
2001-09-18