Check out the new USENIX Web site. next up previous
Next: CPU/Memory Performance Fault Up: Experimental Results Previous: Memory Fault in Data

Network Performance Fault

SSM can tolerate and recover from transient network faults. We use FAUMachine to inject a fault at the brick's network interface. In particular, we cause the brick's network interface to drop 70 percent of all outgoing packets. Figure 12 shows this experiment running on FAUmachine. Note that FAUmachine overhead causes the system to perform an order of magnitude slower; we run all six bricks on FAUMachine. In this benchmark, $W=3, WQ=2, R=2$, and we increase t to 700ms and decrease the size of the state written to 3KB to adjust to the order of magnitude slowdown.

Figure: Fault Injection: Dropping 70 percent of outgoing packets. Fault injected at time 35, brick killed at time 45, brick restarted at time 70.

The fault is injected at time 35; however, the brick continues to run with the injected fault for 10 seconds, as shown in the darkened portion of figure 12. At time 45, Pinpoint detects and kills the faulty brick. The fault is cleared to allow network traffic to resume as normal, and the brick is restarted. Restart takes significantly longer using the FAUMachine, and the brick completes its restart at time 70. During the entire experiment, all requests complete correctly in the specified timeout and data is available at all times. Throughput is affected slightly, as expected, as only five bricks are functioning during times 45-70; recall that running bricks on FAUMachine causes an order of magnitude slowdown.


next up previous
Next: CPU/Memory Performance Fault Up: Experimental Results Previous: Memory Fault in Data
Benjamin Chan-Bin Ling 2004-03-04