|
USENIX 2001 Paper   
[USENIX '01 Tech Program Index]
Scalability of Linux Event-Dispatch Mechanisms
Abstract:Many Internet servers these days have to handle
not just heavy request loads, but also increasingly face large numbers
of concurrent connections. In this paper, we discuss some of the event-dispatch
mechanisms used by Internet servers to handle the network I/O generated
by these request loads. We focus on the mechanisms supported by the Linux
kernel, and measure their performance in terms of their dispatch overhead
and dispatch throughput. Our comparative studies show that POSIX.4 Real
Time signals (RT signals) are a highly efficient mechanism in terms of
the overhead and also provide good throughput compared to mechanisms like
select() and /dev/poll. We also look at some limitations
of RT signals and propose an enhancement to the default RT signal implementation
which we call signal-per-fd. This enhancement has the advantage of significantly
reducing the complexity of a server implementation, increasing its robustness
under high load, and also potentially increasing its throughput. In addition,
our results also show that, contrary to conventional wisdom, even a select()
based server can provide high throughput, even though it has high overhead,
if its overhead is amortized by performing more useful work per select()
call.
IntroductionThe fast growth of the Web and e-commerce has led to a large increase in Internet traffic. Most network applications such as Web servers and proxies have to handle heavy loads from clients spread all across the globe. In addition to high request rates, servers also have to handle a large number of concurrent connections, many of which are idle most of the time. This is because the connection times are large due to (i) the ``last-mile problem'' [3], which has the effect that most clients connect to the Internet through slow modems, and (ii) due to the geographically distributed nature of the Internet, which causes much of the traffic to travel across many hops, increasing both latency and the probability of packet drops due to congestion. For Web servers, the problem of long connections is exacerbated by the HTTP/1.1 protocol [5], which provides for persistent TCP connections that can be reused to handle multiple interactions with the server. These persistent connections further add to the length of the connection times. The bottom line is that servers need to service the high incoming request load, while simultaneously handling a large number of concurrent connections efficiently.To handle these demands, many high-performance Web servers are structured as event-handling applications [9,16,18]. These servers employ event-dispatch mechanisms provided by the underlying operating system to handle the network I/O on multiple concurrent connections. Some studies have looked at the scalability issues of such mechanisms and found that traditional dispatch mechanisms are not very scalable [1]. While the performance of Web servers clearly is important, we should not forget that there are many other Internet services, such as ftp servers, proxy caches, and mail servers, that have to deal with similar scalability concerns. For example, poor scalability is one of the primary reasons the number of concurrent connections on many ftp servers is limited to a small number (around 30-50) . Another approach to building Internet servers that can handle high request loads and large number of concurrent connections is to move the entire application into kernel space. Recent efforts in this direction have produced dramatic results for Web servers (e.g., TUX [17]). However, this does not obviate the need for efficient event-dispatch mechanisms. In fact, it is our contention that due to security and robustness concerns, many server sites are likely to prefer running Internet servers in user space, provided that they can achieve performance that is comparable to a kernel space solution. Efficient event dispatch mechanisms are also essential for those applications that may be important for some sites (e.g., ftp), but perhaps not quite important enough to warrant the effort of developing an OS-specific kernel solution. In this paper, we look at the different Linux event-dispatch mechanisms used by servers for doing network I/O. We try to identify the potential bottlenecks in each case, with an emphasis on the scalability of each mechanism and its performance under high load. We use two metrics to determine the efficiency of each mechanism, namely, the event-dispatch overhead and the dispatch throughput. The mechanisms we study in particular are the select() system call, /dev/poll interface and POSIX.4 Real Time signals (RT signals), each of which is described in more detail in the following sections. Our studies show that RT signals are an efficient and scalable mechanism for handling high loads, but have some potential limitations. We propose an enhancement to the kernel implementation of RT signals that overcomes some of these drawbacks, and allows for robust performance even under high load. We also measure the performance of a variant of the select() based server which amortizes the cost of each select() call, and show that it is more scalable in terms of the server throughput. The rest of the paper is organized as follows. In Section 2, we describe the primary event-dispatch mechanisms supported by the Linux kernel, and discuss some of the previous work in this regard. In Section 3, we compare some of these mechanisms for their dispatch overhead. We discuss RT signals in more detail, identifying their limitations and propose an enhancement to the default implementation of RT signals in the Linux kernel. In Section 4, we present a comparative study of some of the mechanisms from the perspective of throughput achieved under high loads. Finally, we present our conclusions in Section 5. Event-Dispatch MechanismsIn this section, we first discuss the two main schemes employed by servers for handling multiple connections. Next, we look at the various event-dispatch mechanisms supported by the Linux kernel that can be employed by Web servers for doing network I/O. We follow this up with a discussion of previous work that has focussed on the scalability of some of these mechanisms, including other mechanisms that have been proposed to overcome some of their drawbacks.Handling Multiple ConnectionsThere are two main methodologies that could be adopted by servers for performing network I/O on multiple concurrent connections.
Linux Kernel MechanismsAs described above, event-based servers employ event-dispatch mechanisms provided by the underlying operating system to perform network I/O. In this section, we describe the mechanisms supported by the Linux kernel for event notification to such applications. Following are the mechanisms supported by the Linux kernel.
One problem with RT signals is that the signal queue is finite, and
hence, once the signal queue overflows, a server using RT signals has to
fall back on a different dispatch mechanism (such as select()
or poll()). Also, sigwaitinfo() allows the application
to dequeue only one signal at a time. We'll talk more about these problems
in section 3.3.
Previous WorkBanga et al. [1] have studied the limitations of a select() based server on DEC UNIX, and shown the problems with its scalability, some of which we have discussed above. They have proposed a new API in [2], which allows an application to specify its interest set incrementally to the kernel and supports event notifications on descriptors instead of state notifications (as in the case of select() and poll()). The system calls proposed as part of this API are declare_interest(), which allows an application to declare its interest in a particular descriptor, and get_next_event(), which is used to get the next pending event(s) from the kernel. Another event-dispatch mechanism is the /dev/poll interface, which is supported by the Solaris 8 kernel [14]. This interface is an optimization for the poll() system call. Recently, Provos et al. [10] have implemented the /dev/poll interface in the Linux kernel. This interface works as follows. The application first does an open() on the /dev/poll device, which creates a new interest set for the application. From this point onwards, the application can add a new socket to this interest set incrementally by creating a pollfd struct and writing it to the /dev/poll device. Finally, the polling is done by using an ioctl() call, which returns a list of pollfd structs corresponding to the set of ready descriptors. Further, the overhead of user-kernel copies can be reduced by using mmap() to map the array of pollfd structs onto the /dev/poll device. In [10], the /dev/poll interface is shown to be an improvement on the traditional poll() implementation, especially as it reduces the cost of specifying the interest set to the kernel. Hence, in our experiments, we have used /dev/poll instead of poll() for comparison to other dispatch mechanisms. RT signals have been used for network I/O in the phhttpd [4] Web server. Provos et al. have discussed its implementation and some of its shortcomings, such as the potential of signal queue overflow and the ability of sigwaitinfo() system call to fetch only one signal at a time. They have proposed a new system call sigtimedwait4() which allows the server to dequeue multiple signals from the signal queue [11]. Dispatch OverheadIn this section, we look at the first scalability parameter for event-dispatch mechanisms, namely the overhead involved in handling requests as a function of the number of concurrent connections. This parameter becomes important in the context of large number of idle or slow connections, irrespective of the actual active load on the server. In what follows, we first present an experimental study of some of the Linux dispatch mechanisms, and then discuss some of the insights from this study. We follow this up with a detailed discussion of RT signal behavior, including their limitations. We then propose an enhancement to the implementation of RT signals in the Linux kernel to overcome some of these limitations.Comparative StudyIn this section, we present the results of our comparative study of some of the kernel mechanisms discussed above. The main goal of this study is to look at the behavior of Web servers under high load in terms of their CPU usage as the number of concurrent connections (most of them idle) increases. Experimental TestbedTo conduct the experimental study, we implemented a set of micro Web servers (servers), each employing a different event-dispatch mechanism. Most of the request handling and administrative code in these servers is identical to avoid differences in performance arising due to other factors. This ensures that the different versions are as similar as possible. Also, using the servers instead of widely-used, full-fledged Web servers allows us to focus on the performance impact of the dispatch mechanisms by reducing all other overheads to the absolute minimum. Moreover, existing Web servers have an underlying event-handling architecture (such as process-per-connection for Apache), which may not be suitable for the purpose of our study. Thus, the servers do very simple HTTP protocol processing, and the various servers differ only in their use of the event-dispatch mechanism. Specifically, we compared servers employing select(), /dev/poll and RT signals as their event-dispatch mechanisms. While this approach of using the servers does not answer the question of how important the event-dispatch costs are as part of the overall server overhead for commercial Web servers, it does help in determining the limit on the scalability of such servers. This is because, even if the dispatch overhead is tiny with a small number of connections, non-linear scaling behavior could magnify this overhead with increasing number of connections until it eventually becomes the first order bottleneck.Each of these servers was run on a 400 MHz Pentium-III based dual-processor HP NetServer LPr machine running Linux 2.4.0-test7 in uniprocessor mode. The client load was generated by running httperf [8] on ten B180 PA-RISC machines running HP-UX 11.0. The clients and the server were connected via a 100 Mbps Fast Ethernet switch. To simulate large number of concurrent and idle connections, each httperf was used to establish a set of persistent connections, each of which generated periodic requests to the server. The effect was that at all times, some of the connections were active while the rest were idle, and these active and idle connection sets kept changing with time. Thus, in these experiments, the connection rate was different from the request rate (with each connection generating multiple requests). The server's reply size was 92 bytes. In each experiment, the total request rate was kept constant, while the number of concurrent connections was varied to see the effect of large number of idle connections on server performance. To measure the CPU usage of the server, we inserted an idle_counter in the kernel running the server. This idle_counter counted the idle cycles on the CPU. We computed the CPU load imposed by the server by comparing the idle cycles for the system with and without the server running on it. The server reply rate and response times were measured by the httperf clients. Experimental ResultsAs part of our comparative study, we ran experiments to measure the performance of three servers based on select(), /dev/poll and RT signals respectively. In each experiment, the clients were used to generate a fixed request rate, and the number of concurrent connections was increased from 250 to 3000. Figure 3 shows the reply rates achieved by the servers for request rates of 500 req/s and 1000 req/s respectively. As can be seen from the figure, the reply rate matches the request rate for the RT signal and select() based servers at all points. On the other hand, the reply rate starts dropping off for the /dev/poll based server after a point. This is because the server becomes overloaded and starts dropping connections beyond a certain load. We cannot explain why the overload behavior of /dev/poll is so bad. The more interesting figures are figures 4 and 5, which show the CPU usage and the average response time respectively for each of the servers, as the number of concurrent connections is increased. As can be seen from figure 4, the CPU usage for both select() and /dev/poll increases with the number of concurrent connections and they become saturated after a certain point. On the other hand, the CPU usage for RT signals is insensitive to the number of idle connections. The RT signal based server's CPU usage is about 6.67% on average for the 500 req/s case, while it is about 13.25% for the 1000 req/s case. Thus, the CPU overhead of RT signals seems to be dependent only on the request rate. Also, the RT signal CPU usage is dramatically lower than either select() or /dev/poll based servers. A similar behavior is seen for the response time in figure 5. Once again, the response time increases for both the select() and /dev/poll based servers with the number of connections. On the other hand, the RT signal based server shows a very small response time for each of the request rates (about 0.3 ms in each case). Further, this response time is independent of the number of concurrent connections. Note that, even though the absolute value of the response times in the graphs may not seem significant from the perspective of an end user, it is the shape of these graphs which is significant, as these curves reflect the scalability of the dispatch mechanisms. Thus, the results in this section show that RT signals have very small dispatch overhead and also that this overhead does not depend on the number of concurrent or idle connections being handled by the server. Rather, it is determined only by the active work being done by the server. RT Signals: Reasons for EfficiencyFrom our comparative study, we observe that RT signals have a relatively low overhead compared to select() and /dev/poll event-dispatch mechanisms. Further, this overhead seems to be independent of the number of idle connections, and depends only on the active request rate. In other words, RT signals show essentially ideal behavior. In this section, we discuss the reasons for the better performance of RT signals in more detail.RT signals are more efficient due to the following reasons:
Limitations of RT signalsIn spite of their efficiency, RT signals, as currently implemented in Linux, have some potential limitations. These limitations arise from the fact that the signal queue is a limited resource. Since each event results in a signal being appended to the signal queue, a few active connections could dominate the signal queue usage or even trigger an overflow. The former could result in unfair service and the latter could cause a deadlock-like situation in which the server can no longer make any progress, and appears to be suspended or hung.To understand how a signal queue overflow can lead to a ``hung'' server, note that once the queue is full, no further signals can be enqueued and hence all future events are dropped. Of course, eventually the server would drain the queue and new events would start to come in again. However, those events that got dropped are lost forever. Further, notice that the signal queue is delinked from the TCP buffers and there is no feedback mechanism between the two. Thus, even after the signal queue fills up and starts losing signals, there is nothing to throttle the TCP traffic. Thus, even though events are occurring on the open connections and the listening socket, the server loses notifications corresponding to these events. In other words, there is a ``notification leak'' at the server. If one of the lost events happened to indicate, for example, that the listen queue has pending connections, the server may never realize that it ought to call accept() to service those connections. Similarly, if an event got dropped that indicated that a particular connection is now readable, the server may never realize that it should call read() on that connection. Over time, the more events are dropped, the more likely it becomes that either some connections end up in a suspended state or that the listening socket is no longer serviced. In either case, throughput will suffer and eventually drop to zero. To avoid this kind of suspended state, the Linux kernel sends a SIGIO signal to the application when the signal queue overflows. At this point, the application can recover from the overflow by falling back onto some other event dispatch mechanism. For example, the application could use select() or poll() to detect any events that may have been dropped from the signal queue. Unfortunately, using a fallback mechanism comes with its own set of problems. Specifically, there are two issues:
Thus, using RT signals as implemented in the kernel has some potential drawbacks even if they are used in conjunction with another mechanism. Signal-per-fd: RT Signal EnhancementAs discussed above, having to handle a signal queue overflow could be potentially costly as well as complex for an application. It would be desirable, therefore, if signal queue overflows could be avoided altogether. To understand why signal queue overflows are happening in the first place, note that there's a potential of multiple events being generated for each connection, and hence multiple signals being enqueued for each descriptor. But, most of the time, the application does not need to receive multiple events for the same descriptor. This is because even when an application picks up a signal corresponding to an event, it still needs to check the status of the descriptor for its current state, as the signal might have been enqueued much before the application picks it up. In the meantime, it is possible that there might have been other events and the status of the descriptor might have changed. For instance, the application might pick up a signal corresponding to a read event on a descriptor after the descriptor was closed, so that the application would have to decide what to do with the event in this case. Thus, it might be more efficient and useful if the kernel could coalesce multiple events and present them as a single notification to the application. The application could then check the status of the descriptor and figure out what needs to be done accordingly.To understand what kind of events can be coalesced together, we have to understand what kind of information the events are supplying to the application. In general, events are coalescable under two scenarios:
To efficiently check for the existence of a signal corresponding to a descriptor, we maintain a bitmap per process. In this bitmap, each bit corresponds to a file-descriptor and the bit is set whenever there is an enqueued signal for the corresponding descriptor. Note that checking the bit corresponding to a descriptor obviates the need to scan the signal queue for a signal corresponding to the descriptor, and hence, this check can be done in constant time. This bit is set whenever the kernel enqueues a new signal for the descriptor and it is cleared whenever the application dequeues the signal. By ensuring that one signal is delivered to the application for each descriptor, the kernel coalesces multiple events for a connection into a single notification, and the application then checks the status of the corresponding descriptor for the action to be taken. Thus, if the size of the signal queue (and hence the bitmap) is as large as the file descriptor set size, we can ensure that there would never be a signal queue overflow. This enhancement has the following advantages:
On the whole, signal-per-fd is a simple enhancement to the implementation of RT signals that can overcome some of their limitations in the context of using them as an event-dispatch mechanism for doing network I/O. Dispatch ThroughputIn this section, we look at another parameter associated with the efficiency of event-dispatch mechanisms, namely, the throughput that can be achieved as a function of the active load on the server. This metric is orthogonal to the overhead discussed in the previous section, as this refers to the actual amount of useful work being performed by the server. In what follows, we first provide a comparative experimental study of some of the Linux dispatch mechanisms, including the signal-per-fd optimization proposed in the previous section. In addition, we also look at the throughput achieved by a select() based server with a minor modification which allows the server to do multiple accept()s each time the listening socket becomes ready. Then, we discuss the results of this study and provide some insights into the behavior of the various mechanisms.Experimental StudyHere, we experimentally evaluate the throughput achieved by various event-dispatch mechanisms under high load. Our experimental setup is the same as that used in Section 3.1 for comparative study of select(), /dev/poll and RT signals. In this study, we evaluate two new mechanisms/enhancements as well:
Figure 6 shows the performance of the servers with 252 idle connections and 1 byte server reply sizes, while figures 7 and 8 plot the same information for 6000 idle connections with server reply sizes of 1 byte and 6KB respectively. As can be seen from figures 6(a), 7(a) and 8(a), the throughput with select() plateaus out much before it does for the RT signals (both the default and the signal-per-fd implementations). The fall in reply rate of /dev/poll is much more dramatic and again, it seems to perform very poorly under overload. The interesting observation is that multi-accept select is able to sustain a high throughput, similar to the RT signals, and even manages to achieve a slightly higher peak throughput in the first two cases. Figures 6(b), 7(b) and 8(b) show the CPU usage for the servers. Again, as can be seen from these figures, the CPU usage for RT signals is much less than that for select(), multi-accept select and /dev/poll in all cases, and RT signals reach saturation at a much higher load. In fact, for 6000 idle connections (figures 7(b) and 8(b)), CPU usage is 100% for the select(), multi-accept select and /dev/poll based servers right from the beginning, which can be attributed to their high overhead in handling large number of concurrent connections. On the other hand, the CPU overhead for the RT signals based server (for both the default and signal-per-fd cases) increases linearly with the load in either case. An interesting point to be noted from these figures is that, for the 1 byte reply sizes, the server with the default RT signal implementation reaches saturation at a slightly smaller load than signal-per-fd, and this is more pronounced for the 6000 idle connections. We will discuss this point in more detail below. Figures 9(a), (b) and (c) plot the average response times of the various servers with increasing load for 252 and 6000 idle connections (1B and 6KB reply sizes) respectively. Figure 9(a) shows that select() reaches overload at a relatively low load, while the other mechanisms get overloaded at much higher loads. In figures 9(b) and (c), select() shows high response times for all loads and is thus overloaded for all the points in the graph. These plots complement figures 6(a), 7(a) and 8(a), which show the throughput for these cases. The figures further show that the /dev/poll server achieves small response times at low loads, but under overload, it offers much higher response times compared to the other mechanisms. Thus, its overload behavior is again seen to be very poor. The interesting point in figure 9(a) is that multi-accept select is able to provide a low response time upto very high loads. Figure 9(b) shows an even more interesting behavior of multi-accept select -- its response time actually decreases with increasing load until it hits overload. This behavior clearly shows the load amortization occurring for multi-accept select, so that more useful work being extracted for the same select() call overhead translates to lower average response times. In figure 9(c), multi-accept select shows higher response times, but these increase only gradually with load, which again shows that the select() cost is being amortized. Finally, the two RT signal implementations have the lowest response times until they get overloaded, which is expected as they have the lowest overhead. Once again, these graphs show that the default RT signal based server reaches overload slightly earlier than the signal-per-fd server for the 1 byte reply size cases. Next, we will try to understand these results, and in particular, we
will focus on the behavior of multi-accept select and the two
implementations of RT signals.
DiscussionFrom the results of our comparative study, we get the following insights into the behavior of the various mechanisms:
ConclusionIn this paper, we first discussed some of the common event-dispatch mechanisms employed by Internet servers. We focussed on the mechanisms available in the Linux kernel, and measured their performance in terms of the overhead and throughput of a minimal Web server. Our comparative studies showed that RT signals are a highly efficient mechanism in terms of their dispatch overhead and also provide good throughput compared to mechanisms like select() and /dev/poll. In particular, the overhead of RT signals is independent of the number of connections being handled by the server, and depends only on the active I/O being performed by it. But, an RT signal based server can suffer from signal queue overflows. Handling such overflows leads to complexity in the server implementation and also potential performance penalties under high loads. To overcome these drawbacks, we proposed a scheme called signal-per-fd, which is an enhancement to the default RT signal implementation in the Linux kernel. This enhancement was shown to significantly reduce the complexity of a server implementation, increasing its robustness under high load, and also potentially increasing its throughput. Overall, we conclude that RT signals are a highly scalable event-dispatch mechanism and servers based on these signals can also be substantially simplified when coupled with the signal-per-fd enhancement.Another interesting result of our study was the performance of select() based servers under high loads. According to conventional wisdom, select() based servers have high overhead and thus, perform very poorly under high loads in terms of the server throughput as well. Our experiments with the multi-accept variant of a select() based server show that though select() does have high dispatch overhead, this overhead can be amortized better by performing more useful work per select() call, resulting in a high throughput even under heavy load conditions. Thus, we conclude that even a select() based server can be made to scale substantially if its overhead is better utilized to perform more useful work. AcknowledgementsWe would like to thank Martin Arlitt for providing us with large number of client machines and helping us set up the test-bed for our experiments. We would also like to thank the anonymous reviewers and our shepherd Greg Ganger whose invaluable comments and suggestions helped us improve this paper immensely.References
Scalable kernel performace for Internet servers under realistic loads. In Proceedings of the USENIX Annual Technical Conference, June 1998. A scalable and explicit event delivery mechanism for UNIX. In Proceedings of the USENIX Annual Technical Conference, June 1999. On-ramp prospects for the Information Superhighway Dream. Communications of the ACM, 39(7):55-61, July 1996. phhttpd. http://www.zabbo.net/phhttpd, November 1999. Hypertext Transfer Protocol - HTTP/1.1. RFC 2068, January 1997. POSIX.4: Programming for the Real World. O'Reilly, 1995. Measuring the Impact of Event Dispatching and Concurrency Models on Web Server Performance Over High-speed Networks. In Proceedings of the Second IEEE Global Internet Conference, November 1997. httperf - A Tool for Measuring Web Server Performance. In Proceedings of the SIGMETRICS Workshop on Internet Server Performance, June 1998. Flash: An efficient and portable web server. In Proceedings of the USENIX Annual Technical Conference, June 1999. Scalable Network I/O in Linux. In Proceedings of the USENIX Annual Technical Conference, FREENIX Track, June 2000. Analyzing the Overload Behavior of a Simple Web Server. In Proceedings of the Fourth Annual Linux Showcase and Conference, October 2000. Advanced Windows. Microsoft Press, third edition, 1997. The X window system. ACM Transactions on Graphics, 5(2):79-109, 1986. http://docs.sun.com:80/ab2/coll.40.6/REFMAN7/ @Ab2PageView/55123?Ab2Lang=C&Ab2Enc=iso-8859-1. UNIX Network Programming. Prentice Hall, 1990. http://www.acme.com/software/thttpd. http://slashdot.org/interviews/ 00/07/20/1440204.shtml. http://www.zeustech.net/ products/ws.
Footnotes
|
This paper was originally published in the
Proceedings of the 2001 USENIX Annual Technical Conference, June
25Ð30, 2001, Boston, Massachusetts, USA.
Last changed: 3 Jan. 2002 ml |
|