Next: Conclusion Up: Lazy Receiver Processing (LRP): Previous: Experimental Results

Related Work

Experiences with DEC's 1994 California Election HTTP server reveal many of the problems of a conventional network subsystem architecture when used as a busy HTTP server [15]. Mogul [16] suggests that novel OS support may be required to satisfy the needs of busy servers.

Mogul and Ramakrishnan [17] devise and evaluate a set of techniques for improving the overload behavior of an interrupt-driven network architecture. These techniques avoid receiver livelock by temporarily disabling hardware interrupts and using polling under conditions of overload. Disabling interrupts limits the interrupt rate and causes early packet discard by the network interface. Polling is used to ensure progress by fairly allocating resources among receive and transmit processing, and multiple interfaces.

The overload stability of their system appears to be comparable to that of NI-LRP, and it has an advantage over SOFT-LRP in that it eliminates--rather than postpones--livelock. On the other hand, their system does not achieve traffic separation, and therefore drops packets irrespective of their destination during periods of overload. Their system does not attempt to charge resources spent in network processing to the receiving application, and it does not attempt to reduce context switching by processing packets lazily. A direct quantitative comparison between LRP and their system is difficult, because of differing hardware/software environments and benchmarks.

Many researchers have noted the importance of early demultiplexing to high-performance networking. Demultiplexing immediately at the network interface point is necessary for maintaining network quality of service (QoS) [22], it enables user-level implementations of network subsystems [2, 7, 11, 21, 23], it facilitates copy-avoidance by allowing smart placement of data in main memory [1, 2, 5, 6], and it allows proper resource accounting in the network subsystem [14, 19]. This paper argues that early demultiplexing also facilitates fairness and stability of network subsystems under conditions of overload. LRP uses early demultiplexing as a key component of its architecture.

Packet filters [12, 18, 25] are mechanisms that implement early demultiplexing without sacrificing layering and modularity in the network subsystem. In the most recent incarnations of packet filters, dynamic code generation is used to eliminate the overhead of the earlier interpreted versions [8].

Architecturally, the design of LRP is related to user-level network subsystems. Unlike LRP, the main goal of these prior works is to achieve low communication latency and high bandwidth by removing protection boundaries from the critical send/receive path, and/or by enabling application-specific customization of protocol services. To the best of our knowledge, the behavior of user-level network subsystems under overload has not been studied.

U-Net [1] and Application Device Channels (ADC) [4, 5] share with NI-LRP the approach of using the network interface to demultiplex incoming packets and placing them on queues associated with communication endpoints. With U-Net and ADCs, the endpoint queues are mapped into the address space of application processes. More conventional user-level networking subsystems [7, 11, 23] share with SOFT-LRP the early demultiplexing of incoming packets by the OS kernel (software). Demultiplexed packets are then handed to the appropriate application process using an upcall. In all user-level network subsystems, protocol processing is performed by user-level threads. Therefore, network processing resources are charged to the application process and scheduled at application priority.

Based on the combination of early demultiplexing and protocol processing by user-level threads, user-level network subsystems can be in principle expected to display improved overload stability. Since user-level threads are normally prioritized to compete with other user and kernel threads, protocol processing cannot starve other applications as in BSD. A user-level network subsystem's resilience to livelock depends then on the overhead of packet demultiplexing on the host. When demultiplexing and packet discard are performed by the NI as in [1, 5], the system should be free of livelock. When these tasks are performed by the OS kernel as in [7, 11, 23], the rate at which the system experiences livelock depends on the overhead of packet demultiplexing (as in SOFT-LRP). Since the systems described in the literature use interpreted packet filters for demultiplexing, the overhead is likely to be high, and livelock protection poor. User-level network subsystems share with LRP the improved fairness in allocating CPU resources, because protocol processing occurs in the context of the receiver process.

User-level network subsystems allow applications to use application-specific protocols on top of the raw network interface. The performance (i.e., latency, throughput) of such protocols under overload depends strongly on their implementation's processing model. LRP's technique of delaying packet processing until the application requests the associated data can be applied to such protocols. The following discussion is restricted to user-level implementations of TCP/IP.

The user-level implementations of TCP/IP described in the literature share with the original BSD architecture the eager processing model. That is, a dedicated user thread (which plays the role of the BSD software interrupt) is scheduled as soon as a packet arrives, regardless of whether or not the application is waiting for the packet. As in BSD, this eager processing can lead to additional context switching, when compared to LRP.

The single shared IP queue in BSD is replaced with a per-application IP queue that is shared only among multiple sockets in a single application. As a result, the system ensures traffic separation among traffic destined for different applications, but not necessarily among traffic destined for different sockets within a single application. Depending on the thread scheduling policy and the relative priority of the dedicated protocol processing thread(s) and application thread(s), it is possible that incoming traffic can cause an application process to enter a livelock state, where the network library thread consumes all CPU resources allocated to the application, with no CPU time left for the application threads. Traffic separation and livelock protection within an application process are important, for instance, in single-process HTTP servers.

Finally, UNIX based user-level TCP/IP implementations revert to conventional network processing under certain conditions (e.g., whenever a socket is shared among multiple processes.) In this case, the system's overload behavior is similar to than that of a standard BSD system.

In summary, we expect that user-level network implementations--while designed with different goals in mind--share some but not all of LRP's benefits with respect to overload. This paper identifies and evaluates techniques for stability, fairness, and performance under overload, independent of the placement of the network subsystem (application process, network server, or kernel). We fully expect that LRP's design principles can be applied to improve the overload behavior of kernelized, server-based, and user-level implementations of network subsystems.

Livelock and other negative effects of BSD's interrupt-driven network processing model can be viewed as an instance of a priority inversion problem. The real-time OS community has developed techniques for avoiding priority inversion in communication systems in order to provide quality of service guarantees for real-time data streams [9, 10]. RT-Mach's network subsystem [10], which is based on the Mach user-level network implementation [11], performs early demultiplexing, and then hands incoming packets for processing to a real-time thread with a priority and resource reservation appropriate for the packet's stream. Like LRP, the system employs early demultiplexing, schedules protocol processing at a priority appropriate to the data's receiver, and charges resources to the receiver. Unlike LRP, it does not attempt to delay protocol processing until the data is requested by the application. Moreover, the overhead of the Mach packet filter is likely to make RT-Mach vulnerable to overload. We fully expect that LRP, when combined with real-time thread scheduling, is applicable to real-time networking, without requiring user-level protocols.

Next: Conclusion Up: Lazy Receiver Processing (LRP): Previous: Experimental Results

Peter Druschel
Mon Sep 16 18:13:25 CDT 1996