Benjamin C. Ling, Emre Kiciman and Armando Fox
{bling, emrek, fox}@cs.stanford.edu
The cost and complexity of administration of large systems has come to
dominate their total cost of ownership. Stateless and soft-state
components, e.g. Web servers or network routers, are easy to manage:
capacity can be scaled incrementally by adding more nodes, rebalancing
of load after failover is easy, and reactive or proactive
(``rolling'') reboots can be used to handle transient failures. We
show that it is possible to achieve the same ease of management for
the state-storage subsystem by subdividing persistent state according
to the specific guarantees needed by each type. While other
systems [
19,
17] have addressed
persistent-until-deleted state, we describe SSM, a store for a
previously unaddressed class of state - user-session state - that
exhibits the same manageability properties as stateless nodes while
providing firm storage guarantees. Any node can be
proactively or reactively rebooted at any time to recover from
transient faults, without impacting online performance or losing data.
We exploit this simplified manageability by pairing SSM with an
application-generic, statistical-anomaly-based framework that detects
crashes, hangs, and performance failures, and automatically attempts
to recover from them by rebooting faulty nodes. Although
the detection techniques generate some false positives, the cost of
recovery is so low that the false positives have low impact. We
provide microbenchmarks to demonstrate SSM's built-in overload
protection, failure management and self-tuning. We benchmark
SSM integrated into a production enterprise-scale interactive service
to demonstrate that these benefits need not come at the cost of
significantly decreased throughput or response time.