Evaluation Metrics

Next: Reliability During Disaster Up: Evaluation Previous: Experimental Environment

Evaluation Metrics

We identify the following three metrics to evaluate the efficacy of SMFS:

Data Loss: What happens in the event of a disaster at the primary? For varying loss rates on the wide-area link, how much does the mirror site diverge from the primary? We want our system to minimize this divergence.
Latency: Latency can be used to measure both performance and reliability. Application-perceived latency measures (perceived) performance. Mirroring latency, on the other hand, measures reliability. In particular, the lower the latency, and the smaller the spread of its distribution, the better the fidelity of the mirror to the primary.
Throughput: Throughput is a good measure of performance. The property we desire from our system is that throughput should degrade gracefully with increasing link loss and latency. Also, for mirroring solutions that use forward error correcting (FEC) codes, there is a fundamental tradeoff between data reliability and goodput (i.e. application level throughput); proactive redundancy via FEC increases tolerance to link loss and latency, but reduces the maximum goodput due to the overhead of FEC codes. We focus on goodput.

Table 2: Experimental Configuration Parameters

Layered Interleaving	r	8
FEC Params[10]	c	3
Network-sync Parameters	segment size	100 MB
	append size	512 kB
	block size	4 kB
Experiment Parameters	expt duration	3 mins

For effective comparison, we define the following five configurations; all configurations use TCP to communicate between each pair of storage servers.

Local-sync: This is the canonical state-of-the-art solution. It is a semi-synchronous solution. As soon as the request has been applied to the local storage image and the local kernel buffers a request to send a message to the remote mirror, the local storage server responds to the application; it does not wait for feedback from remote mirror, or even for the packet to be placed on the wire.
Remote-sync: This is the other end of the spectrum. It is a synchronous solution. The local storage server waits for a storage-level acknowledgment from the remote mirror before responding to the application.
Network-sync: This is SMFS running with a network-sync option, implemented by Maelstrom in the manner outlined in Section 3 (e.g. with TCP over FEC). The network-sync layer provides feedback after proactively injecting redundancy into the network. SMFS responds to the application after receiving these redundancy notification.
Local-sync+FEC: As a comparison point, this scheme is the local-sync mechanism, with Maelstrom running on the wide-area link, but without network-level callbacks to report when FEC packets are placed on the wire (i.e. storage servers are unaware of the proactive redundancy). The local server permits the application to resume execution as soon as data has been written to the local storage system.
Remote-sync+FEC: As a second comparison point, this scheme is the remote-sync mechanism, again using Maelstrom on the wide-area link but without upcalls when FEC packets are sent. The local server waits for the remote storage system to acknowledge updates.

These five SMFS configurations are evaluated on each of the above metrics, and their comparative performance is presented. The Network-sync, Local-sync+FEC, and Remote-sync+FEC configurations all use the Maelstrom layered interleaving forward error correction codes with parameters , which increases the tolerance to network transmission errors, but reduces the goodput by as much as of the maximum throughput without any proactive redundancy. Table 2 lists the configuration parameters used in the experiments described below.

Next: Reliability During Disaster Up: Evaluation Previous: Experimental Environment

Hakim Weatherspoon 2009-01-14