Andrey Ermolinskiy $^\dagger$ , Daekyeong Moon $^\dagger$ , Byung-Gon Chun $^\star$ , Scott Shenker $^{\dagger}$ $^{\ddagger}$
$^\dagger$ University of California at Berkeley, $^\star$ Intel Research Berkeley, $^\ddagger$ ICSI

Abstract:

1 Introduction

A SAN architecture is a particularly attractive choice for parallel clustered applications that demand high-speed concurrent access to a scalable storage backend. Such applications commonly rely on a clustered middleware service to provide a higher-level storage abstraction such as a filesystem (GFS [35], OCFS [8], PanFS [10], GPFS [37]) or a relational database (Oracle RAC [9]) on top of raw disk blocks.

One of the primary design challenges for clustered SAN applications and middleware is ensuring safe and efficient coordination of access to application state and metadata that resides on shared storage. The traditional approach to concurrency control in shared-disk clusters involves the use of a synchronization module called a distributed lock manager (DLM). Typically, DML services aim to provide the guarantee of strict mutual exclusion, ensuring that no two processes in the system can simultaneously hold conflicting locks. In abstract terms, providing such guarantees requires enforcing a globally-consistent view of lock acquisition state and one could argue that a traditional DLM design views such consistency as an end-in-itself rather than a means to achieving application-level correctness.

In this paper, we take a close look at the semantics of SAN lock services and ask whether the assurances of full mutual exclusion and strongly-consistent locking are, in fact, a prerequisite for correct application behavior. Our main finding is that the standard semantics of mutual exclusion provided by a DLM are neither strictly necessary nor sufficient to guarantee safe coordination in the presence of node failures and asynchrony. In particular, processing and queuing delays in SAN switches and host bus adapters (HBAs) expose applications to out-of-order delivery of I/O requests from presumed faulty processes which, in certain scenarios, can incur catastrophic violations of safety and cause permanent data loss.

We propose and evaluate a new technique for disk access coordination in SAN environments. Our approach augments target storage devices with a tiny application-independent functional component, called a guard, and a small amount of state, which enable them to reject inconsistent I/O requests and provide a property called session isolation.

These extensions enable a novel optimistic approach to concurrency control in SANs and can also make existing lock-based protocols safe in the face of arbitrarily delayed message delivery, drifting clocks, crash process failures, and network partitions. The session isolation property in turn provides a foundational primitive for implementing more complex and useful coordination semantics, such as serializable transactions, and we demonstrate one such protocol.

We then describe the implementation of Minuet- a software library that provides a novel synchronization primitive for SAN applications based on the protocols we present. Minuet assumes the presence of guard at the target storage devices and provides applications with locking and distributed transaction facilities, while guaranteeing liveness and data safety in the face of arbitrary asynchrony, node failures, and network partitions. Our evaluation shows that applications built atop Minuet compare favorably to those that rely on a conventional strongly-consistent DLM, offering comparable or better performance and improved availability.

Unlike existing services for fault-tolerant distributed coordination such as Chubby [20] and Zookeeper[15], Minuet requires its lock managers to maintain only loosely-consistent replicas of locking state and thus permits applications to make progress with less than a majority of lock manager replicas. To demonstrate the practical feasibility of our approach, we implemented two sample applications - a distributed chunkmap and a B+ tree - on top of Minuet and evaluated them in a clustered environment supported by an iSCSI-based SAN.

The benefits of optimistic concurrency control and the associated tradeoffs have been extensively explored in the database literature and are well understood. In particular, techniques such as callback locking, optimistic 2-phase locking, and adaptive callback locking [18, 21, 24, 42] have been proposed to enable safe coordination and efficient caching in client-server databases. It is important to note, however, that these approaches are not directly applicable to SANs because they assume the existence of a central lock server, typically co-located with the data block storage server. This assumption does not hold in a SAN environment, where the storage "servers" are application-agnostic disk arrays that possess no knowledge of locking state or node liveness status. Hence, a conservative DLM service that enforces strict mutual exclusion has traditionally been viewed as the only practical method of coordinating concurrent access to shared state for SAN applications.

Our main insight is that a single nearly trivial extension to the internal logic of a SAN storage device suffices to address the data safety problems associated with traditional DLMs and enables a very different approach to protocol layering for storage access coordination. Crucially, we achieve this without introducing application-level logic into storage devices and without forfeiting the generality and simplicity of the traditional block-level interface to SAN-attached devices.

The technical feasibility of device-based synchronization and its practical advantages have been demonstrated by several earlier proposals [12, 29]. Our study builds on this earlier work and while prior efforts have primarily focused on moving the functionality of a traditional cluster lock manager into the storage device, Minuet aims to provide a more general and useful synchronization primitive that supports a wider range of concurrency control mechanisms. In addition to supporting traditional conservative locking, our approach enables an optimistic method of concurrency control that can improve performance for certain application workloads. Further, Minuet allows existing locking protocols to remain safe in the presence of arbitrarily-delayed message delivery, node failures, and network partitions.

The rest of this paper is organized as follows. In Section 2, we provide the relevant background on SAN and some representative examples of data safety problems. In Section 3, we present our main contribution - the design of Minuet, a novel safe and highly available synchronization mechanism for SAN applications. Section 4 describes our prototype implementation and two sample parallel applications. We evaluate our system in Section 5 and discuss practical aspects of our approach in Section 6. Finally, we discuss related work in Section 7 and conclude in Section 8.

2 Background

2.1 Storage area networks

SANs aim to provide fully decentralized access to shared application state on disk and in principle, any SAN-attached client node can access any piece of data without routing its requests to a dedicated server. While in this model, all requests on a particular piece of data are centrally serialized, the crucial distinction from the traditional server-attached storage paradigm is that the point of serialization is a hardware disk controller that exposes an application-independent I/O interface on raw disk blocks and is oblivious to application semantics and data layout considerations.

Broadly, the SAN paradigm is advantageous from the standpoint of availability because it offers better redundancy and decouples node failures from loss of persistent state. Incoming application requests can be routed to any available node in the cluster and, in the event of a node failure, subsequent requests can be redirected to another processor with minimal interruption of service.

One of the primary design challenges for clustered SAN applications and middleware is ensuring safe and efficient coordination of access to shared state on disk and commonly, a software service called a distributed lock manager (DLM) is employed to provide such coordination. A typical lock service such as OpenDLM[7] operates on shared resources, abstract application-level entities that require access coordination, and attempts to provide the guarantee of mutual exclusion - no two processes may simultaneously hold conflicting locks on the same resource.

2.2 Safety and liveness problems in SANs

Scenario 1: Consider a data structure

spanning 10 blocks on a shared disk

and two clients,

and

, that are accessing the data structure concurrently.

is updating blocks

under the protection of an exclusive lock, while

wants to read

in its entirety (i.e., blocks

) and is waiting for a shared lock. Suppose

crashes after sending its

request to

but before hearing the response. The lock manager correctly detects the failure, reclaims the exclusive lock, and grants it to

in shared mode. Next,

proceeds to reading

and, assuming that a single disk request can carry up to 5 blocks of data, issues two requests: $R_1 = \langle READ [0-4] \rangle$ and $R_2 = \langle READ [5-9] \rangle$ . Suppose

's delayed

request on blocks

reaches the disk after

but before

, in which case only the latter would reflect the effects of

's update. Hence, although individual I/O requests are processed by

as atomic units, their inconsistent interleaving would cause

to observe and act upon a partial update from

, which can be viewed as a violation of data safety.

As an alternative to heartbeat failure detection, a lease-based mechanism [26] can be used to coordinate clients' accesses in the above example, but precisely the same problematic scenario would arise when clocks are not synchronized. When

crashes and its lease expires, the lease manager could grant it to

prior to the arrival of the last

from

to the storage target. Since the target does not coordinate with the lease manager, it fails to establish the fact that an incoming request from

is inconsistent with the current lease ownership state.

Scenario 2: Clustered applications and middleware services commonly need to enforce transactional semantics on updates to application state and metadata. In a shared-disk clustered environment, distributed transactions have traditionally been supported by two-phase locking in conjunction with a distributed write-ahead logging (WAL) protocol. In the abstract, the system maintains a snapshot of application state along with a set of per-client logs (also on shared disks) that record Redo and/or Undo information for every transaction along with its commit status. During failure recovery, the system must examine the suspected client's log and restore consistency by rolling back all uncommitted updates and replaying all updates associated with committed transactions that may not have been flushed to the snapshot prior to the failure. An essential underlying assumption here is that once log recovery is initiated, no additional

requests from the suspected process will reach the snapshot. A violation of this assumption could result in the corruption of logs and application data.

Ensuring data safety in a shared-disk environment has traditionally required a set of partial synchrony assumptions to allow reliable heartbeat-driven failure detection and/or leases. For example, lease-based mechanisms typically expect bounded clock drift rates and message delivery delays to ensure the absence of in-flight I/O requests upon lease termination. However, these assumptions are probabilistic at best and since application data integrity is predicated on the validity of these assumptions, failure timeouts must be tuned to a very conservative value to account for worst-case delays in switch queues and client-side buffering. Such (necessarily) pessimistic timeouts may have a profoundly negative impact on failure recovery times - one of the common criticisms of SAN-oriented applications[16].

Another serious limitation exhibited by today's SAN applications is liveness. The DLM (or lease manager) represents an additional point of failure and while various fault tolerance techniques can be applied to improve its availability, the very nature of the semantics enforced by the DLM places a fundamental constraint on the overall system availability. For instance, multiple lock manager replicas can be deployed in a cluster, but mutual exclusion can be guaranteed only if clients' requests are presented to them in a consistent order, which necessitates consensus mechanisms such as Paxos [31]. Alternatively, a single lock manager instance can be elected dynamically [27] from a group of candidates and in this case, ensuring mutual exclusion necessitates global agreement on the lock manager's identity. In both cases, reaching agreement fundamentally requires access to an active primary component - typically a majority of nodes. As a result, a large-scale node failure or a network partition that renders the primary component unavailable or unreachable may bring about a system-wide outage and complete loss of service.

To summarize, today's SAN applications and middleware face significant limitations along the dimensions of safety and liveness. At present, several hardware-assisted techniques, such as out-of-band power management (STOMITH) [3], SAN fabric fencing [1], and SCSI-3 PR[11] can be employed to mitigate some of these issues. These mechanisms help reduce the likelihood of data corruption under common failure scenarios, but do not provide the desired assurances of safety and liveness in the general case and, as we would argue, do not address the underlying problem. We observe that the underlying problem may be a case of capability mismatch between "intelligent" application processes that possess full knowledge of application's data structures, their disk layout, and consistency semantics on the one hand and relatively "dumb" storage devices on the other. The safety and liveness problems illustrated above can be attributed to a disk controller's inability to identify and appropriately react to the various application-level events such as lock release, failure suspicion, and failure recovery action.

3 Minuet Design

Rather than restricting access to critical code sections, our approach views the access coordination problem in terms of I/O request ordering guarantees that the storage system must provide to application processes. We refer to this alternate notion of correctness as session isolation.

We define this correctness property in formal terms below and then present a protocol that achieves session isolation with the help of guard logic. Finally, we demonstrate how distributed multi-resource transactions can be supported using session isolation as a building block.

3.1 Session isolation

Informally, the above condition requires

to observe the prefixes of all sessions to

in strictly serial order, ensuring that no two requests in a client's session are interleaved by a conflicting request from another client. To illustrate this definition, consider a pair of concurrent request sequences shown in Figure 1. In this scenario, the following two orderings of request observations by the owner of shared resource

would satisfy session isolation:

**Figure 1:** Concurrent request streams to a shared resource from two client processes, and . In this example, first performs two operations on under the protection of a lock, then upgrades to and issues two s. Lastly, downgrades its lock to and performs two more s. Client acquires a lock on and submits a request, followed by an upgrade to and two requests.
$\includegraphics[width=\columnwidth]{figs/fig-sess_serializability.eps}$

Note that session isolation is more permissive than strict mutual exclusion and in particular, permits execution histories in which two clients simultaneously hold conflicting locks on the same shared resource. At the same time, one could argue that these semantics meaningfully capture the essence of shared-disk locking, by which we mean that the request ordering guarantees provided by session isolation are precisely those that applications developers have come to expect from a traditional DLM. To see this, observe that in the previous example, a conventional lock service offering full mutual exclusion would cause

to observe

by granting clients' requests in the order $\langle C_1(Shared), C_1(Excl), C_2(Shared), C_2(Excl) \rangle$ . Likewise,

corresponds to a possible failure scenario in which

crashes after acquiring its locks, causing the DLM to reclaim them and grant ownership to

3.2 Guard

A session annotation for a disk command operating on

has two components: a session verifier and a session update, denoted by

and

, respectively. For commands that belong to an existing session, the verifier enables the target to confirm session validity prior to accepting the command and

is used by the initiator to signal the start of a new session.

For each shared resource

, its owner device maintains a local session identifier (denoted

) on persistent storage. Upon receipt of an I/O command from an initiator, the owner invokes the guard, which evaluates the command's session annotation against

and determines whether session isolation would be preserved by accepting the command. Functionally, the guard operation is a form of compare-and-set and we describe this operation in detail in Section 3.3.

If an incoming I/O request fails verification, the target drops the request from its input queue and notifies the initiator via a special status code EBADSESSION. From an application developer's point of view, session rejection appears as a failed disk request along with an exception notification from the lock service indicating that a lock on the respective resource is no longer valid.

The guard situated at the target devices addresses the safety problems due to delayed messages and inconsistent failure observations that plague asynchronous distributed environments and enforcing safety at the target device permits us to simplify the core functionality of the DLM module. In our scheme, the primary purpose of the lock service is ensuring an efficient assignment of session identifiers to clients that minimizes the frequency of command rejection for a given application workload.

Decoupling correctness from performance in this manner enables substantial flexibility in the choice of mechanism used to control the assignment of session identifiers. At one extreme is a purely optimistic technique, whereby every client selects its SIDs via an independent local decision without attempting to coordinate with the rest of the cluster and this might be an entirely reasonable strategy for applications and workloads characterized by a consistently low rate of data contention. A traditional DLM service that serializes all session requests at a central lock server can be viewed as a design point at the other extreme. Minuet aims to position itself in the continuum between these extremes and allow application developers to trade off lock service availability, synchronization overhead, and I/O performance.

3.3 Enforcing session isolation

Client-side state: For each shared resource

, a client

maintains a pair of session identifiers for its shared and exclusive sessions to

, denoted by

and

, respectively. Additionally,

identifies the current session type, one of $\{None, Shared, Excl\}$ , and

holds the client's session continuation type. The latter value is used by the target device to verify (prior to executing a request from

) that its existing session has not been broken by a conflicting request from another client. Finally, every client

maintains an estimate of the largest shared and exclusive timestamp values previously assigned to a session identifier for any client, which we denote by

and

. Initially,

, and

. The steps and states of the basic locking and storage access protocols are illustrated in Figure 2.

**Figure 2:** **Protocol messages and per-resource state at application clients, lock managers, and shared storage devices.**
$\includegraphics[width=\columnwidth]{figs/fig-locksvc_state.eps}$

Acquiring locks: To acquire/upgrade a lock on resource

, a client

proposes a unique session timestamp pair $\langle proposedT_s, proposedT_x \rangle$ to the lock manager. To acquire a

lock on

sets $proposedT_x \leftarrow R.maxT_x$ and sets

to some unique timestamp greater than

. The client then sends an

request to the lock manager, specifying the desired mode (

) and the proposed timestamp pair. The lock manager accepts and enqueues this request if no request with a larger

value has been accepted. Otherwise, the manager denies the request and responds with

, which includes the largest timestamp values observed by the manager. In the latter case, the client updates its local estimates $\langle R.maxT_s, R.maxT_x \rangle$ and submits a new proposal. After accepting and enqueuing

's request, the lock manager eventually grants it and responds with

. The client then sets $R.curSType \leftarrow Shared$ and initializes the shared session identifier: $R.sharedSID \leftarrow \langle proposedT_s, proposedT_x \rangle$ .

To upgrade a lock from

, the client sends

to the lock manager after setting $proposedT_s \leftarrow R.maxT_s$ and

to some unique timestamp greater than

. Upon receiving

from the lock manager, the client sets $R.curSType \leftarrow Excl$ and $R.exclSID \leftarrow\langle proposedT_s , proposedT_x \rangle$ . Upgrading from

is functionally equivalent to acquiring a

lock and then upgrading to

, but as an optimization, these operations can be combined into a single request to the lock manager.

Accessing shared storage: After establishing a session to

by acquiring a corresponding lock, client can proceed to issuing disk requests that operate on the content of

. Each outgoing request is augmented with a session annotation that enables the target device to verify proper ordering of requests and enforce session isolation. The annotation carries a tuple of the form $\langle R.resID, R.verifySID, R.updateSID \rangle$ and is initialized as shown in Figure 3.

**Figure 3:** **Disk request submission and guard logic pseudocode.**
$\begin{figure}\begin{footnotesize} \begin{tabbing} X \=\kill \begin{footnotesize... ...to the initiator; \\ \end{tabbing}\end{footnotesize}\vspace{-2em}\end{figure}$

Upon receipt of a disk request from a client, the owner device invokes the guard logic, which evaluates the session annotation as specified in Figure 3. In the event of rejection, the owner immediately discards the command and sends an

response to the client, together with a response annotation carrying $\langle R.ownerSID \rangle$ . Otherwise, the owner executes (or enqueues) the command and updates its local session identifier as shown in the figure.

Upon receipt of an

status code, the initiator examines the response annotation and notifies the application process that its lock and session on

is no longer valid. The condition

indicates interruption of an exclusive session, in which case the client downgrades its lock to

, sets $\langle R.curSType, R.contSType \rangle \leftarrow Shared$ , and sets $R.exclSID \leftarrow NIL$ . A

lock is further downgraded to

(since in this case, a conflicting exclusive-session request has been accepted). In this situation, the client sets $\langle R.curSType, R.contSType \rangle \leftarrow None$ and $R.sharedSID \leftarrow NIL$ . In both cases, the maximum timestamp estimates

and

are updated to reflect the most recent timestamps observed by the owner.

Upon receiving a

status code, the client sets $R.contSType \leftarrow R.curSType$ and updates the shared session identifier to reflect the most recent value in the annotation: $R.sharedSID \leftarrow R.updateSID$ . (This step is necessary to ensure that a shared session remains valid after a $Shared \rightarrow Excl$ upgrade or a downgrade to

Downgrading locks: To downgrade an existing lock from

, the client sends a DowngradeLock request to the lock manager and resets the exclusive-session state: $R.exclSID \leftarrow NIL$ , $\langle R.curSType, R.contSType \rangle \leftarrow Shared$ . Similarly, to downgrade from

, the client notifies the lock manager and sets $R.sharedSID \leftarrow NIL$ , $\langle R.curSType, R.contSType \rangle \leftarrow None$ . Upon receipt of a

request, the manager updates the ownership state for

and, if possible, grants the lock to the next waiter in the queue.

Correctness: The locking protocol and the guard described above guarantee session isolation and a formal correctness argument can be found in [22]. Informally, consider two clients

and

that compete for shared and exclusive access to

, respectively, and suppose that a shared-session request from

gets accepted with $R.updateSID = \langle T_s^1, T_x^1 \rangle$ in its annotation. Observe that due to global uniqueness of session proposals, the owner of

would subsequently accept an exclusive-session request from

with verifier $R.verifySID = \langle T_s^2, T_x^2 \rangle$ only if

is strictly greater than

. In this case, subsequent shared-session requests from

would fail verification, causing

to observe

and downgrade its lock. Thus, session isolation would be preserved in this example via a forced termination of

's session. A similar argument demonstrates that no two exclusive-session commands can be interleaved by a conflicting command from another client.

3.4 Supporting distributed transactions

3.4.1 Overview and design requirements

(1) Avoid introducing assumptions of synchrony required by conventional transaction schemes for SAN environments. We rely on the guard at target devices to provide session isolation and protect the state on disk from the effects of arbitrarily-delayed I/O commands operating on the application data and the log.

(2) Eliminate reliance on strongly-consistent locking. Rather than requiring clients to coordinate concurrent activity via a strongly-consistent DLM, the guard at storage devices enables a limited form of isolation and permits us to relax the degree of consistency required from the lock service. Prior to committing a transaction, a client process in Minuet issues an extra disk request, which verifies the validity of all locks acquired at the start of the transaction. This mechanism allows us to identify and resolve cases of conflicting access due to inconsistent locking state at commit time and can be viewed as a variant of optimistic concurrency control - a well-known technique from the DBMS literature[30].

(3) Avoid enforcing a globally-consistent view of process liveness. Rather than relying on a group membership service to detect client failures and initiate log recovery proactively in response to perceived failures, our design explores a lazy approach to transaction recovery that postpones the recovery action until the affected data is accessed. This enables Minuet to operate without global agreement on group membership.

3.4.2 Basic transaction protocol

To support transactions, we extend the basic session isolation machinery described in Section 3.3 with an additional piece of state called a commit session identifier (CSID) of the form $\langle$ clntID, xactID $\rangle$ . We extend the format of a session annotation to include two commit session identifiers, denoted

and

, and both are set to NIL unless specified otherwise. For each shared resource

, the owner device maintains a local commit session identifier (R.ownerCSID) as well as

. Conceptually, the value of R.ownerCSID at a particular point in time identifies the most recent transaction that may have updated

and committed without flushing its changes to the disk image of

. If R.ownerCSID $\neq$ NIL, the current state of

on disk may be missing updates from a committed transaction and thus cannot be assumed valid. In this case,

identifies the client process responsible for the latest transaction on

and it is used to locate the corresponding log for recovery purposes.

Upon receiving a disk request, the guard examines the annotation and rejects the request if $R.verifyCSID.clntID \neq R.ownerCSID.clntID$ or if

. A request is accepted only if its

and

both pass verification and upon completing the request, the owner device updates its local commit session identifier by setting $R.ownerCSID \leftarrow R.updateCSID$ . If verification fails, the owner responds with

and attaches the tuple $\langle R.ownerSID, R.ownerCSID \rangle$ in a response annotation.

In Minuet, transactions proceed in five stages: Begin, Read, Update, Prepare, and Commit and we illustrate them using high-level pseudocode in [22]. During one-time client initialization, Minuet's transaction service locks the local client's log in

mode. To begin a new transaction

, the client selects a new transaction identifier (

) via a monotonic local counter and appends a BeginXact record to its log. Next, in the Read phase of a transaction, the application process acquires a

lock on every resource in

and reads the corresponding data from shared disks into local memory buffers. In the Update phase that follows, the client acquires

locks on the elements of

, applies the desired set of updates locally, and communicates a description of updates to Minuet's transaction service, which appends the corresponding set of

records to the log. Each such record describes an atomic mutation on some resource in

and essentially stores the parameters of a single disk

command.

The Prepare phase serves a dual purpose: to verify the validity of client's sessions (and hence, the accuracy of cached data) and to lock the elements of the write set in preparation for committing. For each resource in $T.readSet \cup T.writeSet$ , the client sends a special

request to its owner. Minuet implements

requests as zero-length

s, whose sole purpose is to transport an annotation and invoke the guard.

requests for elements of

carry

and $updateCSID = \langle C, curXactID \rangle$ in their annotations, where

is the client's identifier. If all

requests return

, the transaction enters the final Commit phase, in which a CommitXact record is force-appended to client

's log.

The protocol outlined above ensures transaction isolation, identifying cases of conflicting access during the Prepare phase. Recall, however, that under the session isolation semantics, any I/O command, including operations on the log, may fail with

due to conflicting access from other clients. This gives rise to several exception cases at various stages of transaction execution. For example, a client

may receive an error while forcing CommitXact to disk due to loss of session to the log. This can happen only if another process has initiated log recovery on

and hence, the active transaction must be aborted. Other failure cases and the corresponding recovery logic are described in the report[22].

Syncing updates to disk: After committing a transaction, a client

flushes its locally-buffered updates to

simply by issuing the corresponding sequence of

commands to its owner device. Each such command specifies in its annotation $\{R.verifyCSID, R.updateCSID \} = \langle C, syncXactID \rangle$ , where

denotes

's most recent committed transaction that modified

. After flushing all committed updates,

issues an additional zero-length

request, which specifies $R.verifyCSID = \langle C, syncXactID \rangle$ and

in the annotation. This request causes the device to reset

, effectively marking the disk image of

as "clean". Lastly,

appends to its log an

record of the form $\langle R, syncXactID \rangle$ .

Lazy transaction recovery: A client

can initiate transaction recovery when its disk command on some resource

fails with

and a non-

value $ownerCSID = \langle C_F, xactID\rangle$ is specified in the response annotation. This response indicates that the disk image of

may be missing updates from a transaction committed earlier by another client

. If

suspects that

has failed, it invokes a local recovery procedure that tries to repair the disk image of

. First,

acquires exclusive locks on

and

and reads the log from disk. Next,

searches the log for the most recent transaction that has successfully flushed its updates to

, from which it determines the list of subsequent committed updates that may be missing from the disk image. The client then proceeds to repairing the state of

on disk by reapplying these updates and all

requests sent to the owner during this phase specify $\{R.verifyCSID, R.updateCSID \} = R.ownerCSID$ in the annotation. Finally, after reapplying all missing updates,

completes recovery by issuing a zero-length

annotated with

. A more detailed discussion of transaction recovery in Minuet can be found in [22].

3.5 Lock manager replication

To support manager replication, we extend the basic locking protocol presented in Section 3.3 as follows: When acquiring or upgrading a lock, a client selects a subset of managers, which we call its voter set, and sends an

request to all members of this set. The lock is considered granted once

votes are collected from all members. If any of the voters respond with

due to an outdated timestamp, the client downgrades the lock on all members that have responded with

, updates its

and

values, and resubmits the upgrade request with a new timestamp proposal. As a performance optimization, we allow

requests to specify an implicit downgrade for an earlier timestamp.

4 Implementation

4.1 Core Minuet modules

4.1.0.1 Client-side library (5,440 LoC):

4.1.0.2 Minuet lock manager (4,285 LoC):

4.1.0.3 SAN protocols and guard logic:

The bottom-level driver implements a TCP encapsulation of SCSI and our current prototype builds on the Open-iSCSI Initiator driver [6] v2.0-869.2. We used the additional header segment (AHS) feature of iSCSI to attach Minuet annotations to command PDUs and defined a new AHS type for this purpose.

Our storage backend is based on the iSCSI Enterprise Target driver[5] v0.4.16, which exposes a local block device to remote initiators via iSCSI. We extended it with the guard logic, which examines incoming PDUs and makes an accept/reject decision based on the annotation. Command rejection is signaled to the initiator via the REJECT PDU defined by the iSCSI standard.

The addition of guard logic represents the most substantial extension to the SAN protocol stack, but incurs only a modest increase in the overall complexity. The initial implementation of the Enterprise Target driver contained 14,341 lines of code and augmenting it with Minuet guard logic required adding 348 lines.

4.2 Sample applications

4.2.0.1 Distributed chunkmap (342 LoC):

4.2.0.2 Distributed B-Tree (3,345 LoC):

5 Evaluation

5.1 Experimental setup

In each iteration, we measure the aggregate goodput (the number of successful application-level operations per second) from all nodes and the rate of disk command rejection under the following locking configurations:

strong( ): We deploy a total of

lock managers and require clients to obtain permissions from a majority (

). Note that strong(1) represents a traditional locking protocol with a single central lock manager, while strong(2) requires 3 lock manager replicas and masks one failure.
weak-own: An extreme form of weakly-consistent locking. Each client obtains permissions only from the local lock manager (co-located on the same machine) and does not attempt to coordinate with the other clients.

In all of our experiments, applications rely on Minuet to provide both modes of locking and do not make use of any other synchronization facilities.

5.2 Distributed chunkmap

uniform: In each operation, a chunk to be modified is selected uniformly at random.
hotspot( ): $x\%$ of operations touch a hotspot region of the chunkmap constituting 0.1% of the entire dataset.

Table 1 reports the aggregate goodput under the uniform workload, which represents a low-contention scenario. The goodput exhibits linear scaling with the number of storage servers. Further, there is no measurable difference in performance between the three locking configurations. These results suggest that the optimistic method of coordination enabled by Minuet does not adversely affect application performance, while providing safety, in scenarios where the overall I/O load is high, but contention for a single resource is relatively rare.

	The number of storage targets
	1	2	3	4
strong(1) coordination	105.0	220.0	329.9	412.4
strong(2) coordination	105.5	219.5	330.7	411.7
weak-own coordination	105.9	220.9	331.1	410.6

The rate of I/O rejection increases when the workload has hotspots and, as expected, weak-own suffers a performance hit proportional to the popularity of the hotspot (Figure 5). We note that the hotspot workload represents a very stressful case (the hotspot size is 0.1%) and our results demonstrate that weakly-consistent locking degrades gracefully and can still provide reasonable performance in such scenarios.

**Figure 5:** Left: chunkmap goodput under the *hotspot(x)* workload for varying . Right: the rate of rejected I/O requests under *hotspot(x)* for varying .
$\includegraphics[width=0.49\textwidth]{figs/readmodwrite-skewed-thruput}$ $\includegraphics[width=0.49\textwidth]{figs/readmodwrite-skewed-iofail}$

We also ran experiments in a partitioned network scenario, where each client can communicate with only a subset of replicas. A strongly-consistent locking protocol demands a well-connected primary component containing at least a majority of manager replicas - a condition that our partitioned scenario fails to satisfy. As a result, no client can make progress with traditional strong locking and the overall application goodput is zero. In contrast, under Minuet's weak locking, clients can still make good progress. This demonstrates the availability benefits that Minuet gains over a traditional DLM design.

5.3 Distributed B+ tree

The B+ tree application demonstrates Minuet's support for serializible transactions. In this experiment, we start with a pre-populated tree and run the application for 5 minutes on 32 client nodes. Each client inserts a series of randomly-generated keys and we measure the aggregate goodput, defined as the total rate of successful insertions per second from all clients. To test Minuet's behavior under different transaction complexity and contention scenarios, we used two different pre-populated B+ trees, whose parameters are given in Figure 6.

Figure 6:Pre-populated B+ tree parameters.

	*tree-small*	*tree-large*
Node size	8KB	8KB
Fanout	150	150
Number of keys	187,500	18,750,00
Leaf node occupancy	50%	50%
Tree depth	3	4
Total dataset size	20MB	2GB

Figure 7(left) compares the performance of strong(1) and weak-own. Under both locking schemes, the throughput exhibits near-linear scaling with the number of storage targets. As expected, tree-large demonstrates a lower aggregate transaction rate because each transaction requires accessing a longer chain of nodes. Moreover, since the number of leaf nodes is large, read-write or write-write contention is relatively infrequent and hence, the performance penalty due to I/O rejection incurred under weak-own is negligible. By contrast, tree-small represents a high-contention workload and our results suggest that even in this stressful scenario, Minuet's weak locking incurs only a modest performance penalty.

**Figure 7:** Left: B+ tree application goodput with *tree-small* and *tree-large* datasets. Right: effects of the timestamp broadcast optimization with *tree-small* dataset.
$\includegraphics[width=0.49\textwidth]{figs/btree-regular}$ $\includegraphics[width=0.49\textwidth]{figs/btree-optimize-small}$

Further investigation revealed that the primary cause of the performance degradation was an outdated estimate of maximum timestamps $\langle maxT_s, maxT_x \rangle$ , causing some of the commands to carry outdated session identifiers (e.g., with ). Under weak-own, clients select session identifiers without coordinating with other clients and hence, a client may not know the up-to-date value of that may have been set by an earlier transaction from another client.

A simple optimization alleviating this issue is to let clients lazily synchronize their knowledge of maximum timestamps. More specifically, each client can broadcast its local updates on and to other clients at some fixed broadcast rate ( ) and other clients can update their local estimates accordingly. We implemented this optimization and measured its effects on the tree-small workload with 4 storage targets. Figure 7(right) shows the results, which suggest that we can substantially reduce the rate of rejection by broadcasting with $b \geq 0.2$ and the resulting goodput closely approaches the maximum value achievable under strong locking.

Note that this optimization affects only performance and is not required for safety. Conceptually, the broadcast rate provides a way of parameterizing the continuum between traditional locking and the fully optimistic case of weak-own and other methods may be possible.

6 Discussion

In this section, we discuss several issues pertaining to the practical feasibility of our approach and the implications of Minuet's programming model.

Storage target modifications: Our approach rests on the basic idea of extending SAN-attached storage targets with a small amount of guard logic that enables them to detect and filter out inconsistent I/O requests , which will require storage array vendors to introduce a new feature into their products.

We acknowledge that Minuet relies on functionality that does not presently exist in standard storage hardware and, consequently, faces non-trivial barriers to standardization, implementation, and deployment. However, we observe that the proposed extensions are very incremental and can easily be retrofitted into an existing design. The guard logic is amenable to efficient implementation in hardware or firmware, requiring only a few table lookups and comparison operations.

As we argue above, the benefits of implementing such an extension can be substantial. In addition to lifting the safety and liveness limitations that have traditionally characterized shared-disk applications and middleware, our approach establishes a new degree of freedom in the design space of SAN applications, enabling a choice between optimism and strict coordination.

Our investigation builds upon earlier work on device locking [12], which has demonstrated the practical feasibility of this approach and the willingness of storage hardware vendors to adopt a promising new feature [2].

Metadata storage overhead: In our prototype implementation, target storage devices maintain bytes of per-resource metadata. For a typical middleware service such as a database or a filesystem, a resource would correspond to a single fixed-length block containing application data or metadata and taking a clustered filesystem as an example, block sizes in the range 128KB - 1MB are common [8]. Assuming 128KB application block size, the table of Minuet session identifiers for a dataset of size 1TB would consume an additional 128MB.

Perhaps more alarmingly, Minuet metadata must be stored in random-access memory for efficient lookup on the data path. We envision the use of flash memory or battery-backed RAM for this purpose and observe that today, high-performance storage arrays make extensive use of NVRAM for asynchronous write caching [39]. Alternatively, the session state can be stored persistently on disk and a fixed-size NVRAM buffer can be used as a cache, providing efficient access to the working set.

Protocol extensions: Our approach requires augmenting the format of and commands with session annotations and our prototype implementation extends the iSCSI protocol with a new AHS type for this purpose. A transport-level modification simplified our software implementation, but would be difficult to deploy in a production environment, since it would require modifying the HBAs. For a more easily-deployable solution, the required set of extensions can be implemented in a transport-independent manner at the SCSI command level. One option would be to use an extended command descriptor block (XCDB), as defined in SPC-4 ([13], section 4.3.4), and introduce a new descriptor extension type for carrying the session annotation. Likewise, command rejection can be signaled to the initiator via a status code with a new additional sense code and the response annotation can be communicated as fixed-format sense data ([13], section 4.5.3).

Programming model: Another concern is that Minuet imposes a different programming model, exposing application developers to additional exception cases that do not naturally arise under strong locking. When a traditional DLM service grants a lock to an application process, the lock is assumed to be valid and the client can proceed to accessing the disk without worrying about conflicting access from other clients. In contrast, Minuet gives out locks in a more permissive manner, but provides machinery for detecting and resolving conflicting access at the storage device. As a result, applications that rely on Minuet for concurrency control must be programmed with the assumption that any I/O request can fail with due to inconsistent lock state.

We observe that while I/O rejection does not occur under conservative locking, the protocols employed by traditional DLMs for ensuring system-wide consistency of locking state inevitably expose application developers to analogous exception cases. For instance, a network connectivity problem causing some application node to lose connectivity to a majority of lock managers would typically cause that node to observe a DLM-related exception event. More concretely, the application process would be informed that due to lack of connectivity, some of its locks may no longer be valid - these are precisely the semantics of Minuet's notification. Hence, both models demand exception-handling for dealing with forced lock revocation.

With Minuet, a node that finds itself partitioned from the rest of the cluster need not immediately give up its locks and instead, can perform a more granular recovery action. For example, it can switch to the optimistic method and resume disk access without coordinating with other application processes and this would permit it to make progress in the absence of conflicting access.

Our experience with developing and deploying sample applications on top of Minuet suggests that the availability benefits enabled by the use of such fine-grained recovery actions are certainly worth the extra implementation effort, which we believe to be relatively small. The chunkmap application was initially implemented on top of conventional locking using 327 lines of C code and extending the implementation to operate on top of Minuet required adding only 15 lines of code to handle the notifications.

7 Related Work

Concurrency control has been extensively studied in the operating systems, distributed systems, and database communities. VMS [40] was among the first widely-available platforms to provide application developers with the abstraction of a general-purpose distributed lock manager and today, DLMs are generally viewed as a useful building block for distributed applications.

Clustered and distributed filesystems [8, 10, 35, 36 , 37] and relational databases [9] rely on locking or lease-based mechanisms to coordinate access to shared application state. Both sets of mechanisms make certain assumptions about timing, such as partially-synchronized clocks and bounded communication latency, in order to operate safely. These systems can directly leverage Minuet to ensure safe coordination of concurrent access to shared data on disk without assuming synchrony.

In web service data centers, distributed coordination services such as Chubby [20] and Zookeeper [15] have also become popular. These services are intended primarily for coarse-grained synchronization - a typical use case might be to elect a master among a set of candidates. Although the intended use of Minuet is to provide fine-grained synchronization in a shared-disk cluster, our system can also support such use cases by transitioning to strongly-consistent locking, whereby each lock is acquired with a majority voter set. Unlike our system, Chubby provides a hierarchical namespace and the ability to store small pieces of data, but these features are largely orthogonal to our approach. Chubby's lock sequencer mechanism allows servers to detect out-of-order requests submitted under an outdated lock and our timestamp-based sessions generalize this idea to support shared-exclusive locking. We also develop this notion further and observe that once we have the ability to reject inconsistent requests at the destination, very little is gained by enforcing strong consistency on replicated locking state and specifically, the use of an agreement protocol (e.g., Paxos [31]) may be more than necessary.

Concurrency control and transaction mechanisms have been extensively studied in databases. ARIES [33] is a state-of-the-art transaction recovery algorithm for a centralized database, supporting fine-granularity locking and partial rollbacks of transactions, while D-ARIES [38] extends this work to be usable in distributed shared-disk databases. Implementing these mechanisms on top of Minuet's locking and I/O facilities would ensure that they retain their safety properties in the face of arbitrary asynchrony. Minuet's basic transaction service presented in Section 3.4 is a variation of timestamp-based concurrency control - a standard and well-known technique in relational database design. Finally, database researchers have explored hybrid approaches to concurrency control [34] that enable tradeoffs between optimism and strict coordination and our work enables similar tradeoffs for general SAN applications, where the data resides on application-agnostic block devices.

There have been several research projects tackling the intelligence/information gap between operating systems and storage systems [17, 25, 28]. These projects aim to achieve more expressive storage interfaces by exposing more information or adding more intelligence to storage devices. In our work, we identified and tackled safety problems in SANs by narrowing the intelligence gap between clustered applications and SAN storage devices.

Several earlier projects have investigated new approaches to concurrency control via functional extensions to storage devices. [29] proposes Dlocks as a new primitive for distributed mutual exclusion, whereby the lock acquisition state is maintained by the target devices themselves and manipulated by the initiators using a new SCSI command. Due to the inherent complexity of distributed locking, the lock management functionality has proven too difficult to implement in a SAN storage array and as a result, this mechanism did not attain wide acceptance among the storage device vendors. Follow-on work to the initial proposal presented a simplified scheme in form of DMEP [12]. In this scheme, storage devices expose an array of shared memory buffers holding the lock state and clients manipulate this state directly using simple atomic commands. The DMEP specification was implemented by a storage device vendor[2] and used by earlier versions of GFS[35].

Our work revisits the idea of device-assisted synchronization and is in line with these earlier efforts, but differs in several crucial respects. First, rather than extending the storage devices with lock management functions, we propose a more general synchronization primitive that supports a wider range of coordination techniques. In addition to "traditional" conservative locking, Minuet enables the use of optimistic concurrency control, which has been shown to reduce the synchronization overhead and deliver better performance for certain application workloads. As a result, Minuet enables a new degree of freedom in the design space of parallel SAN applications, enabling the developers to safely exploit the tradeoffs between synchronization overhead, access conflict rate, and application availability. Second, acquiring or releasing a lock in Minuet does not require explicit communication with the target storage device and instead, clients annotate outgoing I/O requests with the relevant synchronization state. This technique addresses the problem of delayed requests delivered under the protection of an outdated lock and thus enables SAN applications to guarantee safety despite arbitrary message delays, drifting clocks, and node failures. Finally, unlike prior proposals, our design does not require new SCSI commands and can be implemented within the confines of existing protocol standards.

Similar in spirit to this work, SCSI-3 Persistent Reserve [11] tries to address the safety problems in shared-disk environments by extending the storage protocol and target devices. Revoking a suspected node's reservation typically necessitates a global decision on declaring the respective node faulty, which, in turn, requires majority agreement. Hence, SCSI-3 PR offers safety but not liveness in the presence of network partitions and massive node failures, while Minuet provides both.

8 Conclusion

This paper investigates a novel approach to concurrency control in SANs. Today, clustered SAN applications coordinate access to shared state on disks using strongly-consistent locking protocols, which are subject to safety and liveness issues in the presence of asynchrony and failures. To solve these problems, we augment target devices with a small amount of guard logic, which enables us to provide a property called session isolation and a relaxed model of locking which, in turn, provide a building block for distributed transactions. They also enable us to loosen the consistency requirements on distributed locking state, thus providing high availability despite failures and network partitions.

We have designed, implemented, and evaluated Minuet, a DLM-like synchronization and transaction module for SAN applications based on the protocols we presented. Our evaluation suggests that distributed applications built atop Minuet enjoy good performance and availability, while guaranteeing safety.

Acknowledgments

We would like to thank the anonymous reviewers for their useful comments and our shepherd, James Bottomley, for his guidance.

References

1: Brocade 5300 switch.
https://www.brocade.com.
2: Dot Hill and Sistina Software team up to improve computing performance for Linux users.
https://findarticles.com/p/articles/mi_m0EIN/is_2001_August_28/ai_77601552.
3: HP remote insight lights-out edition II (QuickSpecs).
https://www.hp.com.
4: IETF RFC 3720. Internet small computer systems interface (iSCSI).
https://www.ietf.org.
5: iSCSI enterprise target v0.4.16.
https://iscsitarget.sourceforge.net.
6: Open-iSCSI.
https://www.open-iscsi.org.
7: OpenDLM.
https://opendlm.sourceforge.net.
8: Oracle OCFS.
https://www.oracle.com.
9: Oracle real application clusters.
https://www.oracle.com.
10: Panasas PanFS.
https://www.panasas.com.
11: SCSI-3 block commands (draft proposed standard).
https://www.t10.org.
12: SCSI device memory export protocol (version 0.9.8)
https://www.t10.org.
13: SCSI primary commands - 4 (SPC-4), working draft project T10/1731-D revision 17.
https://www.t10.org.
14: Survey finds SAN usage becoming mainstream.
https://findarticles.com/p/articles/mi_qa4137/is_200402/ai_n9362169/pg_1.
15: ZooKeeper.
https://zookeeper.sourceforge.net.
16: Private communcation with IBM Research.
2005.
17: A. Acharya, M. Uysal, and J. Saltz.
Active disks: Programming model, algorithms and evaluation.
In ASPLOS, 1998.
18: A. Adya, R. Gruber, B. Liskov, and U. Maheshwari.
Efficient optimistic concurrency control using loosely synchronized clocks.
In ACM SIGMOD, pages 23-34, 1995.
19: T. Asaro.
ESG analysis - the state of iSCSI-based IP SAN 2006.
https://www.netelligentgroup.com/articles/esgiscsi.pdf
20: M. Burrows.
The Chubby lock service for loosely-coupled distributed systems.
In OSDI, 2006.
21: M. Carey, M. Franklin, and M. Zahariousdakis.
Fine-grained sharing in a page server OODBMS. In ACM, 1994.
22: A. Ermolinskiy, D. Moon, B.-G. Chun, and S. Shenker.
Deploying and evaluating an iSCSI-based implementation of Minuet.
In UC Berkeley EECS TR, Jan., 2009.
23: M. J. Fischer, N. A. Lynch, and M. S. Paterson.
Impossibility of distributed consensus with one faulty process.
In J. ACM, 32(2), pages 374-382, 1985.
24: M. Franklin, M. Carey, and M. Livny.
Transactional client-server cache consistency: alternatives and performance.
In ACM TODS 22(3), pages 315-363, 1997.
25: G. R. Ganger.
Blurring the line between OSes and storage devices.
In CMU SCS Technical Report CMU-CS-01-166, 2001.
26: C. G. Gray and D. R. Cheriton.
Leases: An efficient fault-tolerant mechanism for distributed file cache consistency.
In SOSP, 1989.
27: D. S. Hirschberg and J. B. Sinclair.
Decentralized extrema-finding in circular configurations of processors.
In Commun. ACM, 23(11)pages 627,628, 1980.
28: K Keeton, D. A. Patterson, and J. M. Hellerstein.
A case for intelligent disks (IDISKs).
In SIGMOD Record, Sept., 1998.
29: C. J. S. Kenneth W. Preslan, Steven R. Soltis, M. O'Keefe, G. Houlder, and J. Coomes.
Device locks: mutual exclusion for storage area networks.
In IEEE MSS, pages 262-274, 1999.
30: H. T. Kung and J. T. Robinson. On optimistic methods for concurrency control.
In Readings in database systems (2nd ed.), pages 209-215, 1994.
31: L. Lamport.
The part-time parliament.
In ACM Trans. Comput. Syst, 16(2), pages 133-169, 1998.
32: P. L. Lehman and s. Bing Yao.
Efficient locking for concurrent operations on B-trees.
In ACM Trans. DB Syst. 6(4), pages 650-670, 1981.
33: C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwartz.
ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging.
Int ACM TODS, 17(1), 1992.
34: S. H. Phatak and B. R. Badrinath.
Bounded locking for optimistic concurrency control.
In Rutgers University TR DCS-TR-380, 1999.
35: K. W. Preslan, A. Barry, J. Brassow, M. Declerck, A. J. Lewis, A. Manthei, B. Marzinski, E. Nygaard, S. V. Oort, D. Teigland, M. Tilstra, S. Whitehouse, and M. O'Keefe.
Scalability and failure recovery in a linux cluster file system.
In USENIX ALS, 2000.
36: O. Rodeh and A. Teperman.
zFS - a scalable distributed file system using object disks.
In MSST, pages 207-218, 2003.
37: F. Schmuck and R. Haskin.
GPFS: A shared-disk file system for large computing clusters.
In FAST, pages 231-244, 2002.
38: J. Speer and M. Kirchberg.
D-ARIES: A distributed version of the ARIES recovery algorithm.
In ADBIS Research Commun., 2005.
39: R. Treiber and J. Menon.
Simulation study of cached RAID5 designs.
In HPCA, 1995.
40: J. W. E. Snaman and D. W. Thiel.
The VAX/VMS distributed lock manager.
In Digital Technical Journal, Sept., 1987.
41: B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar.
An integrated experimental environment for distributed systems and networks.
In OSDI, 2002.
42: M. Zahariousdakis, M. Carey, and M. Franklin.
Adaptive, fine-grained sharing in a client-server OODBMS: a callback-based approach.
In ACM Trans. on Database Syst, 22(4), 1997.

About this document ...

Minuet: Rethinking Concurrency Control in Storage Area Networks

This document was generated using the LaTeX2HTML translator Version 2008 (1.71)

The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation minuet.tex

The translation was initiated by Andrey Ermolinskiy on 2009-01-15

Andrey Ermolinskiy 2009-01-15

$\displaystyle E_1 =$	$\displaystyle \langle R_{1.1}$ , $\displaystyle R_{1.2}$ , $\displaystyle W_{1.1}$ , $\displaystyle W_{1.2}$ , $\displaystyle R_{1.3}$ , $\displaystyle R_{1.4}$ , $\displaystyle R_{2.1}$ , $\displaystyle W_{2.1}$ , $\displaystyle W_{2.2} \rangle$
$\displaystyle E_2 =$	$\displaystyle \langle R_{1.1}$ , $\displaystyle R_{1.2}$ , $\displaystyle W_{1.1}$ , $\displaystyle R_{2.1}$ , $\displaystyle W_{2.1}$ , $\displaystyle W_{2.2} \rangle$