The use of network appliances, i.e., computers specialized to perform a single function, is becoming increasingly widespread. Examples of such appliances are file servers [24,6], e-mail servers [19,13], web proxies [25,5], web accelerators [25,5,16] and load balancers [4,12]. Appliance computers have many potential advantages over traditional general-purpose systems, such as higher performance/cost metrics, simpler configuration and lower costs of management. With the recent growth in the use of networked systems by the non-expert, mainstream population, all of these advantages have significant importance.
A network appliance is typically constructed using off-the-shelf hardware components. The appliance's service is implemented by custom software running on top of a specialized operating system. (Often the server software is tightly integrated with the OS in the same address space.) The operating system itself is either designed and constructed from scratch, e.g., Network Appliance's Data ONTAP [26], or is a stripped-down version of a general-purpose operating system, e.g., BSDI's Embedded BSD/OS [8].
While appliance computer systems have delivered the promise of higher performance/cost vis-a-vis general-purpose systems, the same is not strictly true of their manageability aspects. While the complexity of configuration and management of appliance computers in normal circumstances is significantly lower than that of general-purpose systems, the debugging of configuration and performance problems of appliances (when they do occur) remains a task that requires substantial operating system and networking expertise. In this respect, appliance systems are similar to general-purpose systems.
This state of technology is not very surprising: Today, the term ``appliance-like'' is usually taken to mean specialized to do a single coherent task well. Specialization of this form has allowed appliance vendors to build and maintain smaller amounts of code than used on general-purpose computer systems. The narrower functionality of appliances has enabled simpler configuration, and more aggressive optimizations leading to superior performance. The ability to easily debug configuration and performance problems has however been a secondary issue so far, and has not received much attention.
Appliance operating systems often contain significant code derived from general-purpose operating systems, particularly UNIX. For instance, the BSD TCP/IP protocol code [33] is a common building block in appliance operating systems. Like general-purpose systems, appliance operating systems export a set of command interfaces that allow users to display values of various statistic counters corresponding to the various events that have occurred during the operation of the system. Some command interfaces display system configuration parameters. As with general-purpose systems, these command interfaces are the key tools to debugging performance and configuration problems with appliance systems.
For example, the TCP/IP code of many appliance systems exports its event statistics and configuration via a variant of the UNIX netstat command. When a person debugging a configuration or performance problem suspects a bug or problem in the network subsystem of the target appliance, she executes the netstat command (possibly multiple times with different options) and analyzes the output for aberrations from expected normal values. Any deviations of these statistics from the norm provide clues to what might be wrong with the system. Using these clues, the person debugging the problem may perform additional observations of the system's statistics, using other commands, followed by further analysis and corrective actions (such as configuration changes).
The fundamental problem with this style of statistic-inspection based problem diagnosis is the need for human intervention, and specialized networking and performance debugging expertise in the intervening human. For example, consider a workstation that is experiencing poor NFS [27] file access performance. Assume that the cause of this problem is excessive packet loss in the network path between the client and an NFS server due to a Ethernet duplex mismatch at the server. To diagnose this problem today, the person debugging the problem needs to first isolate the problem to the problematic server, then check the packet drop statistics for the transport protocol in use (UDP or TCP), and correlate these statistics with excessive values for CRC errors or late-collisions maintained by the appropriate network interface device driver1. After this, the problem debugger has to check the appropriate switch's configuration to verify the existence of a duplex mismatch.
For any organization engaged in selling and supporting appliance computer systems, it is very expensive to provide a large number of human experts with this level of expertise for the on-site debugging of customer problems. In the absence of sufficient numbers of human experts, problem FAQs, and semi-interactive troubleshooting guides are commonly used by customers and by the (mostly) non-expert customer support staff of the appliance vendors for diagnosing field problems.
Another limitation of this style of problem debugging is that field problems are usually detected after they occur. Problems are first detected by unusual behavior (e.g., poor performance) at the application level and then traced back to the cause by a human expert via an exhaustive search and pattern-match through the system's statistics. While there is usually a well-understood notion of normal and bad values for the various statistics, there exists no software logic to continuously monitor the statistics, and to catch shifts in their values from normal to bad. Problems (and resulting service outages) which could otherwise be avoided by taking timely corrective actions are not avoided.
For all of these reasons, the use of an appliance system can sometimes be a somewhat frustrating experience for a non-expert customer. The subject of this paper is the problem of enabling simple and easy, i.e., appliance-like, debugging of the field problems of appliances. We describe four techniques, continuous statistic monitoring, protocol augmentation, cross-layer analysis and configuration change tracking, that we have developed to make the diagnosis of appliance problems easier. We also describe the application of these ideas in an auto-diagnosis subsystem of the Data ONTAP operating system.
Specifically, continuous monitoring involves periodically checking the system's collected operational statistics for potential problems, while actively analyzing and fixing whichever problems it can. Protocol augmentation allows configuration problems with a network protocol to be diagnosed using specially constructed higher-level protocol tests. Cross-layer analysis is a path-based approach [23] for isolating a problem with a multi-layered system to a specific system layer. Automatic configuration change tracking keeps track of changes in the system's configuration making it easier to pinpoint a problem to its cause.
Our discussion in the remainder of the paper is set in the context of an appliance operating system. More specifically, we focus on problems that arise with file server appliance systems built and sold by Network Appliance. However, we believe that most of the ideas that we present are directly applicable to the space of general-purpose operating systems. Indeed, the class of field problems involving general-purpose computer systems is much larger than the class of appliance field problems because of the broader functionality and services offered by general-purpose systems. It is probably just as important (and useful) to provide for easier debugging of field problems with general-purpose systems as it is with appliance systems. Later in this paper, we will briefly outline how our auto-diagnosis techniques can be used in a general-purpose operating system, such as BSD.
The rest of the paper is structured as follows. In the next section, we discuss the nature of common field problems of appliance computer systems. In Section 3, we describe the four techniques that we have developed to diagnose such problems automatically and efficiently. In Section 4, we describe the implementation of the NetApp Auto-diagnosis System (NADS). Section 5 describes our experience with this auto-diagnosis system. Section 6 covers related work. Finally, Section 7 summarizes the paper and offers some directions for future work.