Recovery Philosophy

Next: Brick MTTF vs. Availability Up: Proposed Solution: SSM Previous: Failure and Recovery

Recovery Philosophy

Previous work has argued that rebooting is an appealing recovery strategy in cases where it can be made to work [6]: it is simple to understand and use, reclaims leaked resources, cleans up corrupted transient operating state, and returns the system to a known state. Even assuming a component is reboot-safe, in some cases multiple components may have to be rebooted to allow the system as a whole to continue operating; because inter-component interactions are not always fully known, deciding which components to reboot may be difficult. If the decision of which components to reboot is too conservative (too many components rebooted), recovery may take longer than really needed. If it is too lenient, the system as a whole may not recover, leading to the need for another recovery attempt, again resulting in wasted time.

By making recovery ``free'' in SSM, we largely eliminate the cost of being too conservative. If an SSM brick is suspected of being faulty - perhaps it is displaying fail-stutter behavior [2] or other characteristics associated with software aging [14] - there is essentially no penalty to reboot it prophylactically. This can be thought of as a special case of fault-model enforcement: treat any performance fault in an SSM brick as a crash fault, and recover accordingly. In recent terminology, SSM is a crash-only subsystem [6].

Next: Brick MTTF vs. Availability Up: Proposed Solution: SSM Previous: Failure and Recovery

Benjamin Chan-Bin Ling 2004-03-04