Next: Brick MTTF vs. Availability
Up: Proposed Solution: SSM
Previous: Failure and Recovery
Previous work has argued that rebooting is an appealing recovery
strategy in cases where it can be made to
work [6]: it is simple to understand and use,
reclaims leaked resources, cleans up corrupted transient operating
state, and returns the system to a known state. Even assuming a
component is reboot-safe, in some cases multiple components may have
to be rebooted to allow the system as a whole to continue operating;
because inter-component interactions are not always fully known,
deciding which components to reboot may be difficult. If the
decision of which components to reboot is too conservative (too many
components rebooted), recovery may take longer than really needed. If
it is too lenient, the system as a whole may not recover, leading to
the need for another recovery attempt, again resulting in wasted time.
By making recovery ``free'' in SSM, we largely eliminate the cost of
being too conservative. If an SSM brick is suspected of being
faulty - perhaps it is displaying fail-stutter behavior [2] or other characteristics associated with software
aging [14] - there is essentially no penalty to reboot
it prophylactically. This can be thought of as a special case of
fault-model enforcement: treat any performance fault in an SSM brick
as a crash fault, and recover accordingly. In recent terminology,
SSM is a crash-only subsystem [6].
Next: Brick MTTF vs. Availability
Up: Proposed Solution: SSM
Previous: Failure and Recovery
Benjamin Chan-Bin Ling
2004-03-04