################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Summer 1993 Technical Conference Cincinnati, Ohio June 21-25, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Design And Implementation of a Multimedia Protocol Suite in a BSD Unix Kernel This research is supported in part by the National Science Foundation Grant No. NCR-9111323 and Grant no. STI-9108764. Lakshman K, Giri Kuthethoor, Raj Yavatkar Department of Computer Science University of Kentucky, Lexington, KY 40506 Abstract Development of distributed multimedia applications requires support for coordination and temporal/causal synchronization of traffic over related streams. Our current research involves investigation of appropriate OS and communication abstractions to support such applications. Towards this goal, we have designed and implemented MCP, a suite of transport and session layer protocols, in the framework of a standard BSD Unix networking platform. MCP contains two new abstractions. First, MCP contains a token-based mechanism for coordination of traffic over a multipoint connection. Second, MCP includes an abstraction called a multi-flow conversation that enforces both temporal and causal synchronization among related data streams. This paper discusses Unix kernel implementation of MCP and describes our experience in using MCP. abstract Introduction With the increasing use of high speed networks, it is now possible to build multimedia applications that involve geographically dispersed group of users. These applications involve concurrent communication of text, voice, graphics, and video. Our current research involves investigation of appropriate OS and communication abstractions to support distributed multimedia applications. Towards this goal, we have designed a suite of transport and session layer protocols [mcp-dcs] and implemented the protocols in the framework of a standard BSD Unix networking platform. This paper describes our experience in design and implementation of transport and session layer protocols for distributed multimedia applications. Development of distributed multimedia applications requires resolution of several issues including media coordination and synchronization, media mixing, OS scheduling, and the problem of network resource allocation to provide performance guarantees in terms of delay bounds and bandwidth reservation. However, we must emphasize that this paper only addresses the question of providing suitable transport and upper level protocol abstractions to meet the temporal/causal synchronization and coordination needs of such applications. In particular, the paper addresses the following aspects of building distributed multimedia systems: Unique communication requirements posed by distributed multimedia applications and need for appropriate communication support to facilitate development of such applications. Design of suitable transport and session layer communication mechanisms to build necessary communication framework. Implementation of a transport and session layer protocol suite using the SUNOS 4.1.1 version of Unix as the experimental platform. Investigation of suitability and extensibility of the UNIX IPC facilities, protocol software structure, and user interface in accommodating needs of multimedia applications. User interface requirements to support multipoint, collaborative applications. Rest of this paper is organized as follows. Sections~motive] and back provide the motivation and background for our work. Section~approach describes our approach including design of MCP and discusses the design alternatives and tradeoffs. Section~impl describes our implementation of MCP suite in SunOs 4.1.1 kernel. We have conducted a systematic performance evaluation of MCP kernel implementation and have also compared the performance of user applications with and without MCP. Sections~experience and perf discuss the results of our performance evaluation, Section~related compared our work with work by others, and Section~concl provides a summary of our work. Motivation motive Our goal is to facilitate development of multimedia distributed applications that involve a geographically dispersed group of users. An example of such an application is a collaborative, software engineering environment[ellis-gibbs-89,colab-from- xerox]. In such an environment, a group of designers located at different sites collaborate on a design document (or a program) using interactive tools to edit and test parts of the design under development. Interaction may involve a group editor (based on shared windows), an image display (that displays resulting design), and a voice channel that allows them to view, discuss, and edit the suggestions made by each other. Our main focus here is on the issues of coordination and synchronization of traffic over related data streams. These issues arise in following forms: A single multipoint communication involving multiple users on multiple, remote sites requires causal synchronization so that all participants ``see'' all the communication events in the same or ``correct'' order. For instance, a voice conference channel that spans multiple participants requires such a synchronization. Apart from voice/video interactions, both communication and application-level software must also enforce some degree of concurrency control when a group of users may simultaneously view and update shared text or image data to maintain consistency. The communication software must allow an application to coordinate multiple information channels that carry traffic from multiple media such as text, voice, video (or image). The collaborative, software development group effort described earlier is an example of such an interaction. For such an application, the communication software must allow related streams to be grouped together and recognize the order and sequencing of traffic sent over them. For example, when a participant scrolls through an image browser (or a shared editor window) and says, ``look at the middle of the display'', the statement should be heard at the same time (or just after) the scrolling is completed. Such temporal relationships must be captured in the delivery of traffic over related streams. Another example of such a synchronization is ``lip-synching'' when voice and video are transmitted over separate connections in a digital network. Existing transport and/or session layer protocols lack communication abstractions that provide the necessary semantics for coordination and temporal synchronization in multimedia applications. We are currently investigating such abstractions in our research and are using them in building bf PolySchmues, a distributed interactive environment for collaboration using multiple media. In the following, we first elaborate on the problem of coordination and temporal synchronization and then discuss proposed solution. Coordination and Temporal Synchronization Problem back Hereafter, we will refer to a single or a multipoint connection as a flow. We assume that the network layer provides a flow abstraction that may have performance guarantees associated with it to provide predictable performance needed by real-time voice or video channels[ferrari-verma]. The question of how to satisfy the real-time performance requirements of a flow is not the focus of discussion here; that question has been addressed by many researchers[ferrari-verma,rsvp,mythesis]. Instead, the focus here is on research issues in using and combining such flows to build multimedia, collaborative, distributed applications. For the design of multimedia systems such as PolySchmues, one must address two aspects of coordination, namely, what sort of coordination is necessary and how much control must be exercised at the communication substrate. In our view, two kinds of coordination are necessary: description [Intra-Flow Coordination] Given a flow spanning a group of users, one must consider the problem of concurrency control when more than one user may be transmitting data at the same time. A common example of such coordination is avoiding interruptions and crosstalk in a multipoint voice channel. Such coordination is also necessary when communication involves a shared text document or an image. [Inter-Flow Coordination] A multimedia application may need to synchronize traffic over multiple flows, each carrying traffic from a different medium (text, voice, video, or image). A collaborative, software development group effort described earlier is an example of such an interaction. description Degree Of Coordination The amount of coordination needed varies in both inter- and intra-flow cases depending on the application. In the case of a shared window package[colab-from-xerox], a single window is displayed on the display screens of multiple users, and each user gets an identical image of the window. Shared window systems provide no concurrency control for simultaneous actions by multiple users. Instead, they allow users to constantly see the actions of other users who are responsible for manually ensuring that there are no conflicts. At the other extreme lie teleconferencing systems in which interference is avoided by strict control based on the notion of a ``floor'', where only the current ``speaker'' that has the floor is allowed to ``speak'' (or transmit data) at any time. Strict coordination is also necessary among separate flows when a group of users are pointing at a shared text window and discussing parts of text over a voice channel. Both the extremes have their limitations. On one hand, lack of any concurrency control puts additional responsibility on application designers to resolve conflicts. On the other hand, strict floor-based control does not allow applications to exploit inherent concurrency and may sometimes unnecessarily increase latency due to the waiting involved. In addition, there is a continuum between these two extremes where varying degree of coordination may be necessary or desirable for present and future applications. For instance, users in a conferencing system may sometimes decide to enter smaller discussion groups that may hold multiple, concurrent conversations[colab-from-xerox]. A ``brainstorming'' session based on the ``chalkboard'' metaphor[colab-from-xerox] in a cooperative environment is another example where less stringent coordination is appropriate. In the case of a multimedia-based group editor, strict coordination is not necessary when a user is browsing through a part of text while another is annotating a different part of the text. Thus, the degree of coordination needed varies from time to time within an application and across different applications. subsubsectionCausal Synchronization Both intra-flow and inter-flow coordination also require causal synchronization when more than two communicating entities are involved. For instance, if sender $S_2$ sends a message over a connection $C_1$ after ``hearing'' from sender $S_1$ over (say) another connection $C_2$, all other participants must ``see'' messages from $S_1$ and $S_2$ in the proper causal order. Enforcing causality requires that the underlying communication system buffer out-of-order messages and enforce the same total ordering on all related messages at all the sites. However, traffic delivery over multimedia connections (such as voice or video) usually has delay constraints and the messages not delivered within their delay bounds are rendered useless. Thus, causally ordered delivery mechanism must also take into account the real-time delivery requirements of multimedia traffic. Our Approach approach MCP (Multi-Flow Conversation Protocol) is a protocol suite consisting of transport and session layer protocols to facilitate multimedia communication. MCP provides two methods of coordination: token-based control and an abstraction called multi-flow conversation]. Together, these two methods yield a flexible and adaptable coordination mechanism. Token-based Concurrency Control When a multipoint flow is created, a token is assigned to that flow that acts as an authorization for data transmission. A sender must hold a token to be able to send traffic over a flow. However, token management primitives are provided so that other participants can obtain transmission privileges. Thus, applications are free to transfer, replicate, and delete tokens to govern the degree of concurrency control needed. This type of concurrency control is entirely different from the token-based synchronization provided by OSI session-layer protocol. The latter method of synchronization allows session participants to insert resynchronization points (or checkpoints) in the data stream to allow rollback and to reduce the amount of retransmitted data in case of a transmission error. We use the token mechanism to allow several levels of concurrency control. The following control hierarchy is derived based on schemes proposed in the literature[colab-from- xerox,ellis-gibbs-89,lantz-cscw86,swinehart-system-support]. description [Floor Control] As described earlier, real-time teleconferencing systems employ such a strict concurrency control. A single token enforces such control over a multipoint connection. The token will be passed on from one speaker to another whenever the floor is transferred. To obtain control of the floor, a participant must explicitly request transfer of the token by invoking a token management primitive. [Brainstorming] This form of coordination is common in shared window systems where there is no concurrency control for simultaneous actions by multiple users. For such applications, the token is replicated and distributed to all the participants. [Chalkboard Interaction] Applications based on the ``chalkboard'' metaphor in a cooperative environment interact in two phases. In the first phase, a speaker addresses a group of listeners with no interruptions. The application may switch to the second phase any time. In the second phase, a group of questioners may address questions to the speaker. The token-based method can accommodate both types of interaction. Only the speaker may hold the token during the first phase, whereas the token may be replicated and distributed to the questioners during the second phase. Replicated tokens will be destroyed at the end of the second phase. [Discussion Groups] Some environments such as real-time conferencing systems envisage a session breaking up into smaller discussion groups and thus holding multiple, concurrent conversations. In such a system, initially only a single token would be created and passed on from one speaker to another to achieve strict ``floor-based'' control. However, the token can be replicated and transferred to each discussion group, and each group may then use the token independently as it sees fit. description Multi-flow Conversations To allow temporal synchronization among traffic over multiple flows, we introduce a communication abstraction called a multi-flow conversation. A conversation is a logical entity and consists of one or more (two-party or multipoint) flows. An application will typically first create individual flows with appropriate performance requirements[ferrari-verma]. It will then create a conversation that includes one or more related flows to achieve the necessary temporal synchronization. A conversation (Figure conv) may consist of one or more (two- party or multipoint) connections, MCP enforces temporal synchronization in delivery of traffic over participant connections. For instance, consider a conversation consisting of a multipoint voice connection $V_1$ and a data (text) connection $T_2$. If a sender S, sends a message on $T_2$ followed by a message over $V_1$; all the conversation participants should receive those messages in the same sequence. In addition, a conversation provides limited causal ordering among messages sent by multiple sources. For instance, if $S_2$ sends a message $M_2$ over a constituent connection after ``hearing'' $M_1$ from $S_1$, all other participants should ``see'' messages from $S_1$ and $S_2$ in the proper causal order. However, enforcement of such a causal order among messages sent over a conversation has performance penalties (including delaying message delivery for a long time) and such penalties are not acceptable to real-time flows involving voice or video that have better-never-than-late semantics. For instance, real-time voice flows must deliver traffic within 100 milliseconds before quality of service degrades and traffic delivered after about 300 milliseconds is useless for voice interaction. In fact, delivery of certain messages might be meaningless if delayed beyond the delay constraint of the flow involved. Therefore, we have decided to discard strict causal ordering in favor of a notion of causality called bf $Delta$-causality that takes into account the delay constraints associated with constituent flows. Additional features of MCP include an out-of-band signaling channel for exchanging control information and an asynchronous interface to applications. Out-of-band signaling facilitates real-time protocol processing and simplifies processing of user data. The asynchronous interface allows an application to specify event-triggered actions to facilitate fast processing of periodic data and changes to the status of multipoint connections. Delta-Causality Informally, $Delta$-causality works as follows. Given a set of flows and a set of participants in a conversation, message delivery to an application at a recipient site is causally ordered provided an upper bound $Delta$ on end-to-end delay is not violated for any of the related messages. For instance, in the example described above, the receiver $S_3$ will see the proper causal order provided $M_1$ and $M_2$ were generated and sent in such a way that their arrival at $S_3$ is not delayed beyond an upper bound $Delta$. $Delta$ is a function of individual delay constraints for flows $Flow_1$ and $Flow_2$. We assume that each flow has performance characteristics associated with it that are specified when a flow is established. Apart from the packet rate and error rate parameters, there are two delay constraints associated with each flow $F_i$: description [Desired delay $delta_i$] is the maximum end-to-end delay that the flow $F_i$ can tolerate before quality of service deteriorates. Example of such a constraint is 100 milliseconds delay bound in packet voice. [Loss delay $lambda_i$] is an upper limit on end-to-end delay for flow traffic beyond which the delivered traffic delivery is useless. For example, packet voice or video have better-never- than-late delivery semantics and specify such a constraint. Packet voice traffic with desired delay constraint of 100 ms has a loss delay constraint in the range of 200 to 300 ms. description Consider a conversation C consisting of flows $F_1 ldots F_i ldots F_n$ with respective delay constraints $delta_i$'s and $lambda_i$'s. We compute two conversation-wide desired delay and loss delay constraints: quote $Delta_1 = max delta_i, i=1, ldots n $ hspace*10mm $Delta_2 = min lambda_i, i=1, ldots n $ quote Based on these two constraints, we define the causality interval $Delta$ for each conversation. $Delta$ is computed as $Delta_1 < Delta < Delta_2$ and defines a window of causality for each message sent in a conversation. It must be noted that our notion of real-time assumes large-grain clock synchronization among the participants. This is not an unrealistic assumption as fault-tolerant clock synchronization algorithms exist that achieve such synchronization. In TCP/IP Internet, Network Time Protocol (NTP) achieves global clock synchronization across the country within a few milliseconds[rfc1128]. The value of $Delta$ is chosen based on the following pragmatic considerations. First, $Delta$ must at least be equal to $Delta_1 + 2 ast epsilon$ ($epsilon$ is the maximum possible clock skew). Second, to allow some flexibility in the presence of fluctuations in network conditions and delays, MCP provider may choose value of $Delta$ to be higher than $Delta_1 + 2 ast epsilon$ without compromising individual flow semantics as long as $Delta$ remains much below the upper bound $Delta_2$. Design Issues Given the need for flexible coordination and synchronization mechanisms, we carefully considered the following design alternatives: description [ Policy vs. Mechanism:] One of the questions we had to address was whether to implement the necessary policies in the communication substrate or to build only the necessary communication mechanisms that are flexible enough to allow an application to choose and enforce desired policies. We believe that the right way for accommodating a wide range of coordination and synchronization requirements is to provide communication mechanisms that allow specification and selection of the degree of synchronization needed and leave the task of choosing the appropriate policy and degree of synchronization to each individual application. Such a division of functionality is appropriate because an application programmer has the relevant knowledge to specify the necessary conditions for causality. [ Kernel vs. Library:] Given the availability and popularity of existing transport protocols such as TCP or UDP, we also had to grapple with the issue of whether we could implement MCP functionality in an application layer library on top of conventional transport protocols. Advantage of such an approach is two fold. First, no modifications are necessary to the kernel allowing easy portability to different systems. Second, a library-level implementation can always hide unpleasant details of coordination and synchronization from an application and thus let application writer concentrate on application-specific details. Library level implementation is also easier to modify allowing experimentation with alternative designs. However, library-level implementation approach suffers from some drawbacks. A distributed multimedia application must handle and coordinate transmission and reception of traffic over multiple data streams and must have the ability to handle asynchronous events that occur on any one of the streams. If you consider details of rate-based flow control, buffer and timer management, and asynchronous token management, even a multi-threaded library implementation becomes quite unwieldy. Some degree of inefficiency also results as the library must duplicate some of the kernel functions such as event scheduling and buffer/timer management. Based on these arguments and our previous experience with a library-based implementation of an experimental transport protocol[rgp-paper], we decided in favor of the kernel-based implementation. description MCP Implementation impl One of the goals of our research is to integrate MCP abstractions into a general purpose operating system such as Unix so that we can utilize a large collection of toolkits and application software libraries[suite] to build interactive multimedia applications. Towards that goal, we have implemented MCP in the framework of Unix 4.3 BSD networking software. Implementation Issues Unlike the implementation of TCP and UDP protocols in a Unix environment, the MCP protocol makes some additional demands on networking software: MCP must support both point-to-point (or two-party) and multipoint (or N-party) flows. In the case of multipoint flows, a flow is identified by a group address and the protocol software must be careful in multiplexing and demultiplexing of packets addressed to the group address. For instance, if three transport level entities at a host may participate in the multipoint flow, each endpoint must be separately identified and a copy of an incoming packet must be delivered to each local participant. MCP must also clearly define the loopback semantics for an outgoing packet over such a flow. For instance, the loopback may mean that that all the local participants other than the sender of the packet receive a copy of each outgoing packet. MCP application-layer interface must make provisions to allow specification of QOS parameters such as data rates, error tolerance, and delay bounds for individual flows and to specify control operations such as token request/transfer and joining or leaving an ongoing conversation. Given the QOS specifications, MCP must exercise rate-based flow control according to the QOS specification and include appropriate timer and buffer management functions to implement $Delta$-causality. MCP must also make provisions for reporting asynchronous events and exceptional conditions to user applications. For example, arrival of a token or token transfer request can occur any time. Because MCP assumes that a user application implements the desired policy for token management, MCP must asynchronously report such events to the application. The size of data units transmitted varies depending on the media used. For example, a voice sample may consist of a few bytes whereas a video frame typically consists of hundreds of bytes. To ensure proper temporal synchronization, MCP must preserve message boundaries over individual flows and insert synchronization markers to ensure appropriate playout of related media units at a receiver. Because the transmission and delivery of data over a multipoint flow is controlled by transfer of tokens and relevant control information, processing of normal data is complicated if one uses ``in-band signaling'' as in traditional transport protocols. To facilitate real-time protocol processing and to simplify processing of user data, MCP must establish an auxiliary control connection for communicating out-of-band control messages related to conversation and token management. Unix Networking Software unix Networking in the BSD kernel is divided into three software layers (Figure softlayers): The socket layer, the protocol layer, and the network layer. A socket acts as an endpoint of communication and is accessible to processes as a Unix file descriptor. The socket layer is usually invoked by the C library using a set of system calls. The protocol layer consists of protocol suits or domains such as TCP/IP, Xerox NS, and Decnet domains. This layer handles all protocol specific processing. The network interface layer manages the physical network such as ethernet or token ring. Figure~softlayers shows the MCP's place vis-a-vis other protocol modules and additional details on Unix network software design can be found in [4.3-bsd-book]. In addition to the standard Unix software, two extensions that support network layer connections (ST-II and COIP-K) are available in public domain[stii-partridge,coip-k]. Because we want to provide delay-constrained delivery of multimedia data, we needed a network layer service that would reserve resources to guarantee performance. From that point of view, both ST-II and COIP provide necessary service. We decided to use COIP protocol because COIP protocol implementation consists of a toolkit called COIP-K that provides a nice modular structure and a flexible platform to implement connection oriented protocols. Socket Library and Programming Interface MCP retains the standard 4.3 BSD socket interface as much as possible. We have extended the socket library to provide additional functionality needed by MCP applications. In particular, we have extended connect call to allow specification of multiple destination addresses for a multipoint connection. We have added new options to the setsockopt call so that an application can specify a flow's QOS requirements. setsockopt call is also used to request a token, to explicitly transfer a token to another participant, or to join/leave a conversation. We have also extended getsockopt to implement an asynchronous event notification mechanism as described later in Section~asynch-sect. MCP retains the conventional TCP/IP programming interface in Unix. Thus, applications involving multipoint communication are also written using client-server paradigm. In a multipoint application, a single participant acts as a client and all other participants act as servers (or passive entities). An example of such a program is shown in Figure~api-fig. Each server uses a combination of socket, bind, listen, and accept calls to indicate its willingness to join a multipoint flow. After one or more flows are established, participants may add the flows to a conversation. Once a conversation is established, the data transfer begins using the usual Unix send/recv calls. Before sending data, A participant must typically request a token using the setsockopt call and a token holder may transfer the token using another setsockopt call (see Figure~token- xfer). CLIENT SERVER s = socket(PF_COIP, SOCK_MCP, 0) s = socket(PF_COIP, SOCK_MCP, 0) s = setsockopt(s, MCP, MCP_SETFLSPEC, ..... &flowspec, sizeof(flowspec) ) bind(s, addr, addrlen); listen(s,5); connect(s, addr, addrlen); snew = accept(s, addr, addrlen); /* Establish a conversatioon */ setsockopt(s, MCP, MCP_INITCONP, sockarray, sizeof(sockarray)); caption typical multipoint MCP application consists of a single client and one or more servers. Servers are passive listeners waiting for the client to establish a connection. Once a connection is established, the token holder gets to transmit and a server (or the client) must first obtain a token before sending any data. MCP Transport and Session Layer Implementation MCP transport and session layers reside below the socket layer. The following discussion on transport and session layer implementation assumes that readers are familiar with basic structure of Unix network software including the two commonly used data structures, Protocol Control Block (or PCB) and mbuf (Memory Buffer). PCB is used to hold protocol-specific state information and mbuf is another useful data structure used to shepherd an incoming or outgoing packet through protocol layers without repeated copying. More details can be found in [4.3-bsd- book,coip-k]. Internal structure of MCP implementation is shown in Figure~interact. For simplicity, only selected functions are shown. The function mcp_userreq handles all user requests such as socket creation, bind, connect, and read/write and passes on these requests to lower layer functions. For instance, cin_output at network layer transmits data to other endpoints of a flow. Other functions implement token management and conversation abstraction as described below. On input side, the routine cinintr handles packet arrivals at the network interface and passes data to the transport layer. The routine mcp_ctxext enforces the causality requirement. The routine mcp_ctlinput handles data transfer across the control connection. figure[tbp] centerlinepsfigfigure=coiplay.ps captionInternal structure of MCP implementation interact figure subsubsectionTransport Layer Implementation MCP transport layer is responsible for establishment of point- to-point and multipoint data and control flows and for the transfer of data to and from the MCP session layer. MCP transport layer identifies each endpoint of a flow using a unique transport level address consisting of $$. When a client issues the socket call followed by a connect call, MCP transport layer traces the following steps to establish a multipoint flowfootnoteEstablishment of a point-to-point or two-party flow is similar except packet delivery is straightforward as in TCP or UDP. Separate discussion of point- to-point flows is omitted here.: figure[tbp] minipage[t]3.2in psfigfigure=token.ps,height=2.8in,width=2.8in minipagehspace0.2in minipage[t]2.9in psfigfigure=mpcb.ps,height=2.8in,width=2.4in minipage captionExample of token request and transfer using setsockopt call and unix signal is shown above. MCP PCB (Protocol Control Block) is shown on right hand side. token-xfer figure The transport layer creates a MCP PCB (see Figure~token-xfer) to maintain the state information about the flow and fills in flow specifications, list of flow participants, and initial status of token. MCP transport layer assigns the flow a unique flow identifier. A unique flow identifier is constructed by appending a local sequence number to the IP address of the client's site. The transport layer constructs an MCP open packet, fills in flow identifier and flow specifications, and invokes the COIP-K network layer to establish a multipoint flow passing the open packet as an argument. Under the first approach, COIP-K builds a tree with client (active entity) as a root that connects the group participants using point-to-point connections. Every participant forwards a message addressed to the group to the entity at the root and the network-level entity at the root delivers a copy of the message to the local participants and also forwards copies of the message to other leaves across a separate point-to-point connection. Under the second approach, MCP uses IP multicast[deering- multicast to send and receive packets over the flow. Thus, in this approach, there is no master-slave relationship. However, additional network layer functionality[rsvp] is necessary to provide performance guarantees. Given a destination group address, the IP multicast routing module (added to COIP-k) ensures that a packet reaches all the members of the multicast group. Current implementation does not include error control to ensure reliable multicast delivery and is the focus of our current research. Given a conversation, MCP also establishes a separate control flow among the transport-layer peers at participating sites. The control flow is devoted to exchange of state and control information related to the conversation and typically has different packet delivery semantics. Communication over a control flow is typically only two-party and requires reliable delivery. When a conversation is created, MCP transport layer uses the COIP-K layer to establish a control flow and the open packet sent to participating sites carries a list of flow identifiers for the constituent flows of the conversation. MCP Session Layer Implementation MCP session layer implements two important functions, namely, token management across a multipoint flow and temporal/causal synchronization among messages sent across constituent flows in a conversation. Data transmission across a flow at each sender consists of a sequence of messages. Data sent using each write call by an application constitutes a single message. The MCP session layer preserves the message boundaries in delivery of data at each participant and also uses the message boundaries as synchronization markers (or points) for both temporal and causal synchronization. subsubsectionTemporal Synchronization Given a flow, MCP session layer at each sender assigns a unique sequence number to each message sent across the flow. When multiple flows are part of the same conversation, sequence numbers for messages sent by the same sender across constituent flows are drawn from a single sequence number space. Thus, MCP session layer can deduce the message ordering among messages sent by the same sender across constituent flows of a conversation. When a sender $S_1$ sends two messages $M_1$ and $M_2$ in that order across two constituent flows $F_1$ and $F_2$, MCP implements temporal synchronization by delivering them in the same order at all the participants. Because message sequence numbers capture the necessary temporal order, implementation of temporal synchronization is straightforward. However, if a preceding message $M_1$ is lost or arrives out-of-sequence, MCP session layer at the destination must buffer the subsequent message $M_2$ and decide when to deliver it to the application. Because each flow QOS specification specifies the delay bounds for individual messages and the packet rate, MCP uses the information to derive a delivery deadline for each message. Thus, $M_2$ is buffered only until its deadline and is delivered to the application (with an exception reported) if $M_1$ fails to arrive by the deadline. subsubsectionToken Management For each participant of a flow, MCP allocates a separate MCP PCB that maintains the state information about the token for the participant. MCP enforces token semantics in the following way: When a sender tries to send a message across a flow using MCP, the MCP session layer transmits the message only if the sender PCB holds a token. Otherwise, MCP returns an error indicating that the token is not present and it is left to the application to explicitly request transfer of a token from the current token holder. When a token request is made, the MCP session layer broadcasts the request across the multipoint flow. When the request is received by MCP layer at a token holder, it immediately acknowledges receipt of a request so that the sender may not repeatedly broadcast its request. MCP layer at the token holder then notifies its application using an asynchronous event notification mechanism that identifies the requesting participant. MCP does not implement any particular token management policy and it is left to the application to decide whether and when to transfer the token to the requesting entity. Token replication requests are handled locally by updating appropriate fields in the MCP PCB for the participant. MCP also implements other token management calls such as copy_token delete_token and distribute_token. subsubsection$Delta$-Causal Synchronization When multiple senders are involved in a conversation, MCP session layer implements a limited notion of causality called $Delta$-causality. Whenever a conversation participant $A$ sends a message $M_1$, MCP session layer at $A$ includes some context information in the message header. The context information included with each message is a pair verb++ for each participant (or sender) in the conversation. That is, the context includes the sequence number of the last message received at $A$ from each participant before sending $M_1$ (see Figure~mcp-buff- mgmt). figure[tbp] captionAt the top, figure shows the context information included by A in its message number 18. Context information in the message specifies that the message should be delivered ONLY after message number 17 (from A), message number 2 (from C), and message number 12 (from D) have been delivered. However, the context information at B indicates that B has received so far first 15 messages from A, first 2 messages from C and first 10 messages from D. Therefore, message 18 from A is queued up and added to the dependency lists. mcp-buff-mgmt figure Thus, when $M_1$ arrives at another participant $B$, the MCP session layer does the following: MCP at B checks to see whether or not it has already received all the messages specified in the context of $M_1$. If yes, MCP schedules $M_1$ for delivery. If not, MCP can build the dependency list of messages for $M_1$. Figure~mcp-buff-mgmt shows an example. Pending reception of previous messages in the context, MCP computes the delivery deadline of $M_1$ using the $Delta$ value, sets a timer to go off at the deadline, and buffers $M_1$. $M_1$ is buffered in a message dependency list according to its position with respect to pending messages. Whenever another message $M_2$ arrives over the conversation, MCP checks the head of the dependency list to see whether there are any messages waiting for the arrival of $M_2$. If so, those packets are dequeued and are scheduled for delivery along with $M_2$ according to their delivery deadlines. If message $M_2$ itself contains some unresolved dependencies, it is also queued up (see Figure ~mcp-advance-buff). When a timer goes off for a message $M$, MCP dequeues $M$ and any messages that were queued behind $M$ according to dependency information and queues them up for delivery to the application layer. figure[t] centerlinepsfigfigure=mcpadvbuff.ps captionFigure shows MCP actions when another message (seq. no. 4) arrives from C. The new message depends on the messages 18 and 12 from A and D respectively. mcp-advance-buff figure subsubsectionAsynchronous Event Notification asynch-sect One of the interesting MCP implementation issues concerned an user interface concern. A multimedia application involving multiple data streams must simultaneously handle several events including arrival of traffic over individual flows, mouse and keyboard events, and asynchronous events such as token arrival or a token transfer request. Conventional approach to building such applications in Unix is to use the select system call which is clumsy at best. It is our belief that system designers must allow an upcall- based user interface to allow notification of asynchronous, event-triggered actions and MCP assumes such an interface. Under such an interface, an MCP application can specify an action to be taken when a particular event occurs. For example, the MCP service primitive verb+flow_status(, )+ specifies the upcall procedure to be called when one of the pre-defined status changes or events occurs on a flow. The arguments to the verb+procname+ include flow identifier, an integer code specifying the kind of change, and optionally the user data received. However, Unix does not support upcalls. We are currently investigating alternate ways of implementing upcalls in Unix. Currently, such a facility is implemented in a MCP library routine. The library routine in conjunction with MCP kernel simulates an upcall as follows. MCP reports an asynchronous event by sending a signal ( SIG_URG) to the user process and setting the outofband flag in the socket structure. The upcall_handler library routine first uses getsockopt call to retrieve the cause of the event and any additional information from the kernel, and then calls the corresponding upcall routine that is previously registered by the application. Experience with MCP experience The experimental set up for MCP prototype consists of four SUN Sparc ELC workstations equipped with audio devices on an ethernet in our Distributed Systems Laboratory. The laboratory is mainly used for experimental research and consists of a dozen HP and Sun workstations with three Network file servers. We have implemented many multi-party, interactive applications to test/debug MCP kernel, to evaluate the effectiveness of MCP abstractions, and to measure performance of multimedia applications with and without using MCP. In the following, we describe our experience using MCP and Section~perf discusses the results of our performance evaluation. Simple Tests To help us test and debug various parts of MCP implementation, we wrote two simple applications, namely, mshell and mtalk. Mshell is an interactive shell that provides command-level access to various MCP primitives such as flow creation, conversation creation, joining/leaving a conversation, and token request/transfer. Thus, Mshell was useful in testing each MCP service primitive separately and in various combinations with other primitives. Mtalk (for Multi-user talk) is a multi-party conferencing program that allows a conversation among multiple users[coip-k. At each user, the program divides the screen into two parts: an input region and display region. Lines of text typed in input region are transmitted over a MCP flow to all the other users and text from other users is displayed in the display region of the screen. In particular, we used mtalk to test our implementation of multi-point flows using IP multicast. Nevot NEVOT (NEtwork VOice Terminal) is a network voice terminal that supports multiple concurrent conferences (both two-party and multi-party) on top of a variety of transport protocols and using a range of audio encodings from vocoder to multi-channel CD quality. It is developed as an experimental tool by Henning Schulzrinne of Univ. of Massachusetts and has been implemented on top of TCP, UDP and ST-II. We have ported Nevot to run on top of MCP and used Nevot/MCP in two ways. First, Nevot allowed us to gain some experience in using MCP token management primitives in implementing different conference control styles. Second, we evaluated the cost of using MCP by comparing the performance of audio conferencing applications implemented on top of Nevot/MCP, Nevot/UDP, and Nevot/MCP. Using Nevot, we built an audio conferencing application and experimented with the following conference control styles: description [ Floor Control:] This is a user driven conference control style used commonly in teleconferencing applications. Under this model at most one speaker holds ``floor'' at any time and is allowed to talk at any time. Other users must request control of floor before speaking. MCP supports such a user-driven control using tokens. For each flow, there is only one token allocated initially and the applications can transfer token from one speaker to another based on a policy that is chosen by the application. In addition, MCP allows an application to make multiple copies of a token and distribute copies among few participants. In Nevot, for instance, we can implement a hierarchical conference control where a subgroup of participants have access to the floor at any time and can separately control arbitration of speaking rights among the members of the subgroup. Our overall experience so far has been that token management in MCP works very well in implementing a range of floor control policies. However, users must be made aware of token management because a speaker must typically be told to wait until a token is requested and obtained from another speaker. [ Activity Sensing:] Floor-control based conference control using explicit transfer of tokens is not suitable for many collaborative multimedia applications involving shared windows, image browsers, and group text editors. In such an environment, strict control is desired to ensure that at most one user is allowed to manipulate parts of a shared workspace. However, transparent transfer of transmission rights is desirable whenever a user finishes her update (idle for some time), or current sender of data moves mouse (focus) away from a shared window. An example of transparent floor control is a distributed, voice activated, collision sensing algorithm proposed in[jj-86paper]. Under an activity sensing algorithm, the underlying software transfers token to another requestor when it notices that the token holder has been idle for some time. However, such an algorithm requires choosing a correct time interval to decide whether or not a speaker is in idle (quiescent) state. Our experience showed that an idle interval of 200 to 400 milliseconds gives good results in terms of detecting quiescence. The algorithm works well only in a local area environment and may not work so well when propagation delays are large over a wide area network (WAN). We are now trying out several variations on activity sensing algorithms. [ Brain Storming:] An extreme alternative to strict floor-based control is free for all in which no explicit conference control is exercised and all participants are free to transmit whenever they wish. We have found this to be a useful and efficient form of conferncing especially when a conference enters a discussion phase where there is no obvious speaker at any time. description Tests involving $Delta$-causality To test $Delta$-causality, we wrote another application that created a MCP conversation consisting of three multipoint flows: a background flow, $F_file$ that repeatedly transferred a large file, a background audio flow $F_au$ that continuously played back file containing pre-recorded music, and an interactive mtalk flow in the foreground involving at least three participants. The tests involved simulating variable delays and random packet loss inside the kernel. In the experiments, about 1 packets were chosen at random for drop and, in addition, the end-to-end delay was varied between 20 milliseconds to 500 milliseconds. The parameter $Delta$ was chosen to be 200 milliseconds which happens to be an acceptable delay for packet voice before delays cause noticeable breaks in audio reception. The tests served two purposes. First, they helped us verify correctness of our implementation of causal and temporal synchronization. For instance, MCP kernel estimates the maximum buffering necessary to buffer packets delayed until their $Delta$ deadlines and also computes the rate of packet delivery based on QOS specifications. Tests allowed us to verify that buffer estimates were sufficient and packet rates could be maintained in a non real-time operating system such as Unix. Second, we wanted to see how does enforcement of $Delta$-causality affect the audio playback at participants in the presence of background file transfer and bursty interactive traffic. Our experience is that MCP can easily maintain the deadlines associated with $Delta$- causality in the presence of occasional packet loss. Packets are lost either due to random drop (network errors) or because packets that arrive beyond the delivery deadline are discarded. Assuming resource reservation at the network layer, we expect the packet loss and delays in future high-speed Internet to be of similar nature. Tests also showed that loss and delays in delivery of audio packets did not cause easily perceptible degradation in playback quality. However, more systematic and quantitative measures are necessary to evaluate impact of such losses. Performance Evaluation perf Goals of our performance evaluation were two-fold. First, we wanted to determine the impact of using MCP on a multimedia application. Specifically, we were interested in comparing the performance of an application implemented using MCP against the performance of the same application implemented using other transport protocols such as UDP and TCP. Second, we wanted to compare the performance of MCP kernel implementation against implementation of MCP as a run-time library. In the following, we describe the results of our performance evaluation. Whenever end-to-end measured values for an operation were too small for clock resolution, we repeated the operation several times and computed the average value. TCP & UDP & COIP-RAW & MCP-kernel Loopback (ms) 1.68 & 1.28 & 1.26 & .96 Ether (ms) 1.80 & 1.47 & 1.36 & 1.07 caption The columns show the cost of transferring a token from one participant to another located across the ethernet or on the same machine (loopback). The values represent an average computed across 100 token transfers and experiment was repeated several times to verify that the results are consistent. nevot-token-transfer table TCP & UDP & COIP-RAW & MCP 1.9 & 1.9 & 1.9 & 46 The columns show the cost of replicating a token in microseconds. Replicating a token only involves updating a field in MCP PCB and no communication. Cost of token replication is higher with MCP kernel implementation because it involves a system call as compared to simply updating a variable in MCP library in case of first three implementations. nevot-token-replicate table TCP & TCP (NO_DELAY) & UDP & COIP-RAW & MCP-kernel usertime 45.3 & 45 & 41.8 & 35 & 35/45 system 29.6 & 41 & 33.6 & 34.3 & 35/43 caption Statistics at the Nevot sender side for 20000 packets sent at the rate of 50 packets per second. Each packet contains an audio sample worth 164 bytes. The second (larger) time for nevot on the MCP kernel take into account the cost of activity sensing. nevot-sender table TCP & TCP(NO_DELAY) & UDP & COIP-RAW & MCP-kernel usertime 38.2 & 38.5 & 37.9 & 38.7 & 38.8 system 29.2 & 40.8 & 29.9 & 33 & 33.35 latepkt 21-11 & 3 & 3 & 5 & 5 caption Statistics at the Nevot receiver side. At the receiver side, row labeled late packets shows the number of packets that arrive too late for playout. nevot-receiver table Comparison with Other Protocols We have implemented MCP services as part of a run-time library on top of three different protocols: TCP, UDP, and COIP-RAW. COIP- RAW provides direct access to COIP network interface through Unix sockets. We ported Nevot to run on each of the MCP library implementations and on top of MCP kernel. Tables~nevot-token-transfer through nevot-receiver show the results involving token operations and use of Nevot. Tables~nevot-sender and nevot-receiver show results of a Nevot experiment involving a sender and a receiver. Sender transmitted contents of an audio file (audio playback) in a loop. Nevot prints statistics collected using Unix getrusage call after every 20000 packets and statistics include application's user and system times. Packets arriving late at the receiver are discarded. MCP kernel performance is comparable to that using other protocols. On the receiving side, Nevot/TCP loses packets due to late arrivals even across an ethernet because TCP queues up data at sending side to send in chunks. Using TCP_NODELAY option prevents TCP from queuing up data and that reduces the number of late packet arrivals, but the total times are then much worse as shown in Tables. Kernel vs. Library Implementation Type of Connection & Implementation level & MCP Library & Kernel Point to Point & 35.6 & 6.7 Multipoint & 57.6 & 4.5 caption The Conversation create time (in milliseconds) shows the amount of time needed to establish a single-flow conversation involving two participants in each case. Establishing a conversation consists of creating a control flow and updating MCP state to add the flow to a conversation. Typically, a conversation consists of more than one flow, but the cost of creating a multi-flow conversation is similar to that of creating a single-flow conversation. conv-create table Type of Connection & Implementation Method & MCP Library & MCP Library & Kernel & (COIP-RAW) & (TCP) Point to Point& 257 & 400 & 78 Multipoint & 441 & -- & hline caption Data Transfer time for 100 packets each containing one byte of information. All the timings are in milliseconds. giri-data-xfer table To evaluate the impact of implementing MCP in a kernel, we also implemented MCP kernel as a run-time library on top of COIP-RAW interface. A multimedia application using MCP library is implemented as a multi-process application where one thread is responsible for handling data transfer across a flow in a conversation. MCP state common among threads is maintained using shared memory. Thus, creating a conversation requires starting a new process to manage the control flow and communication with remote participants to establish a conversation. Implementing library routines through multiple threads of an application simplifies programming and makes it easy to hide the details of handling data transfer and asynchronous events over individual flows. However, as Tables~conv-create and giri-data-xfer shows, cost of operations such as creating a conversation and data transfer is significantly higher using the library as compared to the kernel implementation. Related Work related The work described in this paper is related to current research in two main areas: user interface for coordination and conference control and OS and protocol support for enforcement of QOS and temporal/causal synchronization. In the area of conference control and collaboration, CECED[ceced] system is a notable example of a software system that supports collaboration involving multiple users with minimal intrusion of software and user styles. For coordination and floor control, CECED uses an activity sensing algorithm as described earlier. We have used the algorithm as a basis for further experimentation with alternate activity sensing policies. Schulzerrinne[henning-ietf-draft] discusses several issues involved in designing a real-time transport protocol (RTP) to support audio conferencing across the Internet. RTP's main focus on achieving playout synchronization. MCP design[mcp-dcs] precedes RTP design and differs from RTP because MCP considers causal synchronization in addition to the temporal synchronization. Recent work at Tenet group in Berkeley[tenet-report-mim] has identified some fundamental requirements of multimedia collaborative applications including conference management, QOS mechanisms, support for $N X M$ (N senders and M receivers) data flows, and so on. MCP provides functionality to meet several of those requirements and although MCP currently supports only symmetric $N X N$ flows, it can easily be modified to support $N X M$ flows. MMCC, Multimedia Conference Control system, developed at USC/ISI[mmcc-joi] concentrates only on providing session level conference management functions such as conference establishment, inviting parties to join a conversation, merging conferences and starting a ``side-conference'' through an application level connection manager. MCP includes most of such support and, in addition, provides communication mechanisms to enforce several varieties of conference control and to achieve synchronization among several flows. A system developed by Angebranndt and others[louds] shares some of the goals of MCP. Their system contains an audio protocol that supports audio connections and virtual devices. Synchronization in that system refers to synchronizing beginning of playout at several audio devices and achieving lip-syncing. MCP does not contain mechanisms to synchronize playback starts at several sites, but provides a general purpose temporal and causal synchronization facility irrespective of whether media sources are of fixed rate or totally asynchronous variety. Another nice aspect of Angebranndt's system is that their programming interface supports a set of command queues (or scripts) that allow specification of synchronization among audio devices. The system also reports several types of events including device operation completions, arrival of synchronization points, and so on. MCP, on the other hand, only considers a subset of communication events related to token passing and changes in membership of a flow or a conversation. Summary concl We have described design and implementation of MCP session and transport layer protocols that provide communication mechanisms and abstractions useful for building distributed multimedia applications. We have implemented MCP in SUN OS 4.1.1 kernel and evaluated its performance and suitability of its mechanisms for building multimedia applications. Our work is significant in following respects. First, to the best of our knowledge, this is one of the first Unix-based implementation of transport and session level services designed explicitly to meet the coordination and temporal/causal synchronization needs of multimedia applications. Second, we have shown that coordination and synchronization mechanisms can be efficiently implemented in a general purpose OS such as Unix. Third, we have identified alternate conference control styles that should be supported to meet a wide range of applications including audio conferencing, group editors, and shared workspace. We plan to continue further research in various conference control styles and user interfaces for collaboration using multiple media. During our work, we have also discovered some of the limitations of the process scheduling and event notification mechanisms in Unix-like operating systems. For instance, Unix lacks an upcall mechanism and it is not easy to promptly notify a process of an asynchronous event such as arrival of a token or a token request. We have identified some possible extensions to Unix scheduling mechanism to overcome such a problem and plan to pursue further research in that direction. Acknowledgments Authors would like to thank Charles Kranor and Guru Parulkar of Washington University in St.~Louis for making COIP-K sources available to us and for cheerfully putting up with our questions about COIP-k software during the initial stages of MCP implementation. [j-86paper] L.Aguilar, J.J Garcia-Luna-Aceves, D.Morgan, E.J. Craighill, and R.Brungardt. Architecture for a multimedia teleconfrencing systsm. In Proceedings of ACM SIGCOMM '86 Symposium, 1986. [clouds] Susan Angebranndt, RichardL. hyde, DaphneHuetu Luong, and Nagendra Siravara. Integrating audio and telephony in a distributed workstion environment. In USENIX '91 Conference Proceedings, pages 419--435. USENIX, June 1991. [ceced] E.Craighill, R.Lang, K.Skinner, M.Fong, and J.J. Garcia- Luna-Aceves. CECED: A Ssystem for Informal Multimedia Collaboration. unpublished report, SRI International, CA, January 1993. [coip-k] C.Cranor. An Implementation Model for Connection-Oriented Internet Protocols. Technical report, Department of Computer Science, Washington University, May 1992. MS Thesis. [deering-multicast] S.E. Deering. Multicast Routing in Internetworks and Extended LANs. In Proceedings of ACM SIGCOMM '88 Symposium, August 1988. [suite] Prasun Dewan. A Guide to Suite: Version 1.0. Technical Report SERC-TR-60-P, Software Engineering Research Center, Purdue University, February 1990. [ellis-gibbs-89] C.Ellis, S.Gibbs, and G.L. Rein. Design and Use of a Group Editor. In IFIP WG2.7 Working Conference on Enginerring for Human Computer Interaction. North Holland, August 1989. [ferrari-verma] D.Ferrari and D.Verma. A scheme for real-time channel establishment in wide-area networks. IEEE Journal on Selected Areas in Communications, 8(3), April 1990. [lamport-timeout] Leslie Lamport. Using Time Instead of Timeouts. ACM Transactions on Programming Languages and Systems, 6(2):254--280, April 1984. [lantz-cscw86] KeithA. Lantz. An Experiment in Integrated Multimedia Conferencing. In Proceedings of Conference on Computer-Supported Cooperative Work, pages 267--275, December 1986. [4.3-bsd-book] SamuelJ Leffler, MarshallKirk McKusick, MichaelJ Karels, and JohnS Quatarman. The Design and Implementation of the 4.3 BSD UNIX Operating System. Addsion Wesly, 1989. [rfc1128] DaveL. Mills. Measured performance of the network time protocol in the internet system. Network Working Group Request for Comments: 1128, October 1989. [stii-partridge] Craig Partridge and Stephen Pink. An Implementation of the Revised Internet Stream Protocol (st-2). Journal of Internetworking Research and Experience, March 1992. [henning-ietf-draft] H.Schulzrinne. Issues in designing a transport protocol for audio and video conferences and other multiparticipant real-time applications. INTERNET-DRAFT, IETF Audio-Video Transport Working Group, December 1992. [nevot] H.Schulzrinne. Voice Communication Across the Internet: A Network Voice Terminal. Technical report, Department of Computer Science, University of Massachusetts, Amherst, MA, August 1992. unpublished report. [mmcc-joi] EveM. Schooler. Case study: Multimedia conference control in a packet-switched teleconferncing system. Journal of Internetworking: Research and Experience (1993), 1993. [colab-from-xerox] Mark Stefik, Gregg Foster, Daniel Bobrow, Kenneth Kahn, Stuan Lanning, and Lucy Suchmann. Beyond the Chalkboard: Computer Support for Collaboration and Problem Solving in Meetings. Communications of the ACM, 30(1):32--47, January 1987. [tenet-report-mim] Clemens Szyperski and Giorgio Ventre. A characterization of multi-party interactive multimedia applications. Technical report, The Tenet Group International Computer Science Institute and Computer Science Division UC Berkeley, February 1993. [swinehart-system-support] DanielC. Swinehart. System Support Requirements for Multi-media Workstations. In Proccedings of the SpeechTech '88 Conference, pages 82--83, New York, April 1988. Media Dimensions, Inc. [rajthesis] R.S. Yavatkar. An Architecture for High- Speed Packet Switched Networks. PhD thesis, Purdue University, August 1989. Also available as TR-898. [mcp-dcs] Raj Yavatkar. MCP: A Protocol for Coordination and Temporal Synchronization in Multi-media Collaborative Applications. In Proceedings of the 12th International Conference on Distributed Computing Systems. IEEE, June 1992. [rgp-paper] R.Yavatkar and V.Dumasia. Reliagram -- A Communication Abstraction for Distributed Processing. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing, December 1991. [rsvp] L.Zhang, S.Deering, D.Estrin, S.Schenker, and D.Zappala. RSVP: A New Resource ReSerVation Protocol. submitted to ACM SIGCOMM '93, March 1993. document