Next: Reliability During Disaster
Up: Evaluation
Previous: Experimental Environment
We identify the following three metrics to evaluate the efficacy of SMFS:
- Data Loss: What happens in the event of a disaster at the primary? For varying loss rates on the wide-area link, how much does the mirror site diverge from the primary? We want our system to minimize this divergence.
- Latency: Latency can be used to measure both performance and reliability. Application-perceived latency measures (perceived) performance. Mirroring latency, on the other hand, measures reliability. In particular, the lower the latency, and the smaller the spread of its distribution, the better the fidelity of the mirror to the primary.
- Throughput: Throughput is a good measure of performance. The property we desire from our system is that throughput should degrade gracefully with increasing link loss and latency. Also, for mirroring solutions that use forward error correcting (FEC) codes, there is a fundamental tradeoff between data reliability and goodput (i.e. application level throughput); proactive redundancy via FEC increases tolerance to link loss and latency, but reduces the maximum goodput due to the overhead of FEC codes. We focus on goodput.
Table 2:
Experimental Configuration Parameters
Layered Interleaving |
r |
8 |
FEC Params[10] |
c |
3 |
Network-sync Parameters |
segment size |
100 MB |
|
append size |
512 kB |
|
block size |
4 kB |
Experiment Parameters |
expt duration |
3 mins |
|
For effective comparison, we define the following five configurations; all configurations use TCP to communicate between each pair of storage servers.
- Local-sync: This is the canonical state-of-the-art solution.
It is a semi-synchronous solution. As soon as the request has been
applied to the local storage image and the local kernel buffers a
request to send a message to the remote mirror, the local storage server
responds to the application; it does not wait for feedback from remote
mirror, or even for the packet to be placed on the wire.
- Remote-sync: This is the other end of the spectrum. It is
a synchronous solution. The local storage server waits for a
storage-level acknowledgment from the remote mirror
before responding to the application.
- Network-sync: This is SMFS running with a network-sync option,
implemented by Maelstrom in the manner outlined in Section 3 (e.g. with TCP over FEC). The network-sync layer provides
feedback after proactively injecting redundancy into the network.
SMFS responds to the application after receiving these redundancy notification.
- Local-sync+FEC: As a comparison point, this scheme is the local-sync mechanism, with Maelstrom running on the wide-area link, but without network-level callbacks to report when FEC packets are placed on the wire (i.e. storage servers are unaware of the proactive redundancy). The local server permits the application to resume execution as soon as data has been written to the local storage system.
- Remote-sync+FEC: As a second comparison point, this scheme is the remote-sync mechanism, again using Maelstrom on the wide-area link but without upcalls when FEC packets are sent. The local server waits for the remote storage system to acknowledge updates.
These five SMFS configurations are evaluated on each of the above
metrics, and their comparative performance is presented. The
Network-sync, Local-sync+FEC, and Remote-sync+FEC configurations all
use the Maelstrom layered interleaving forward error correction codes
with parameters
, which increases the tolerance to
network transmission errors, but reduces the goodput by as much as
of
the maximum throughput without any proactive redundancy.
Table 2 lists the configuration parameters used in
the experiments described below.
Next: Reliability During Disaster
Up: Evaluation
Previous: Experimental Environment
Hakim Weatherspoon
2009-01-14