################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


FLIPC: A Low Latency Messaging System for Distributed Real Time Environments

David L. Black, Randall D. Smith, Steven J. Sears, and Randall W. Dean
Open Software Foundation Research Institute, Cambridge, MA
dlb@osf.org, randys@osf.org, sjs@novell.com, rwd@osf.org

Abstract

FLIPC is a new messaging system intended to support distributed real
time applications on high performance communication
hardware. Application messaging systems designed for high performance
computing envi ronments are not well suited to other environments
because they lack support for the complex application structures
involving multiple processes, threads, and classes of message traffic
found in environments such as distributed real time. These messaging
systems also have not been optimized for medium size messages found in
important classes of real time applications.  FLIPC includes
additional features to support applica tions outside the high
performance computing domain.  For medium size messages, our system
significantly outperforms other messaging systems on the Intel Par
agon. An explicit design focus on programmable communication hardware
and the resulting use of wait-free synchronization was a key factor in
achieving this level of performance. The implementation of FLIPC was
accelerated by our use of PC clusters connected by eth ernet or by a
SCSI bus as development platforms to reduce the need for Paragon time.

Introduction

Messaging systems for high performance computing are designed to
deliver as much of the hardware performance of the interconnect to
applications as possi ble. Examples include the NX system on the Intel
Paragon [11], CMMD on the Thinking Machines CM- 5 [17], and Active
Messages on various platforms [2][19]. These systems have been
optimized for numerical supercomputing to the detriment of their use
in other environments. Among the factors that contribute to this
limitation are the assumption of a single thread ed application
structure, system optimizations for small and large message sizes to
the detriment of intermediate sizes, and the assumption of a single
applica tion environment in which there are no competing sources of
message traffic. Our work concerns distributed real time environments
in which these assumptions do not hold.

Continuing advances in networking technology, such as ATM [18], and
Myrinet [1], have dramatically in creased the communication
performance available to environments other than numerical
supercomputing.  This paper describes our experimental implementation
of a high performance messaging system for one such environment,
namely event driven distributed real time. This environment is
characterized by the need to support multiple application threads of
varying importance or priority, and messaging streams of varying
importance concurrently on each node. This includes not only
processing, but also resource allocation. For example, the system must
not only process a message announcing detection of an incoming message
in preference to a message indicating that it is time for pre
ventative maintenance, but must also ensure that the latter message
does not consume resources required to handle the former.

The event driven nature of applications in this environment results in
messages that fall between the small and large sizes found in
numerical supercomputing.  The events cannot be described by very
small messages, and aggregation of events into larger messages is
limited by the impact of the aggregation delay on system
response. Examples of such systems include dis tributed systems for
process control, factory floor automation, and military command and
control [5] (e.g., AEGIS, AWACS).

The FLIPC messaging system is designed for event driven real time
environments. Among its more important characteristics are:

    performance optimization for medium sized messages (50-500 bytes),

    support for multithreaded applications,

    support for threads and message streams of varying importance,

    support for explicit control of resource allocation.

Communication Interface Architecture

FLIPC is designed to leverage the powerful program mable controllers
that are increasingly found in the in terfaces to high speed
interconnects and networks.  Examples of these controllers include the
dedicated message processor on each node of a Paragon system [6],
embedded microprocessors in communication controllers such as the i960
[4], and custom designed communication controllers such as Myrinet
[1], SCSI [10], and FLASH [8]. The increasing use of such controllers
is motivated by the need to match increases in processor speed with
increases in communication per formance, and the resulting desire to
off-load communication functionality from the main processor(s).
Similar trends have been observed in other domains, such as mainframe
input/output and graphics control lers[14]. Experience in these
domains suggests that the power and functionality of the programmable
control lers in network interfaces will continue to increase.

The power of these interface controllers cannot be directly utilized
by applications because the controllers and their execution
environment are functionally specialized to communication tasks. The
resulting obstacles to execution of application code include:

Instruction Set Differences: Of the controllers cited above, only the
dedicated Paragon processor employs the same instruction set as the
application processor(s).

Protection Concerns: Controllers have access to critical system
resources, often including all of physical memory. Communication is
also critical to system operation, as a controller crash may imply a
node crash. A related problem is that an application must be prevented
from monopolizing the communication resource to the exclusion of other
applications. As a result, untrusted code cannot be executed on such
interface controller.

Memory Access Restrictions: Neither the SCSI nor Myrinet controllers
are capable of performing read-modify-write atomic operations such as
test- and-set on main memory.

Execution Restrictions: A common structure for controller software is
a non-preemptible event loop.  This puts restrictions on the time that
may be consumed by added code; excessive consumption may have
undesirable side effects on unrelated commu nications.

FLIPC's architecture takes advantage of interface controllers by
programming them as operating system components rather than as part of
the applications.  This enables the use of system programming tools
that target the embedded controller, and simplifies protec tion
concerns. Restrictions on memory access and execution are dealt with
in the design of the FLIPC component that executes on the
controller. The alternate approach of using software fault isolation
[20] is of limited utility because correctness of the code executing
on the controller depends not only on the memory it accesses, but also
on completing execution in a timely fashion. Restrictions such as
forbidding back wards branches are necessary, greatly reducing the
programming flexibility.

Message Sizes 

The design of FLIPC is based on the observation that there are three
distinct message size classes with corresponding distinct
implementation techniques for high performance. The classes and
corresponding implementation techniques are:

Small. Small messages can both be stored in registers and be copied
without seriously impacting per formance in either case. Current
technology puts an upper bound on their size at about 32 bytes on a 32
bit processor with 32 registers. Implementation techniques for this
class include carrying a message in registers and allowing the message
to be copied to avoid coupling application and message transport
interfaces. Hardware DMA functionality may not be useful for this
class of communication because the messages are too short to amortize
DMA setup costs.

Medium. Medium messages are too large to fit in registers or to be
copied without impacting performance, but too small to amortize the
setup cost of a throughput-oriented transfer protocol. The result is a
size range from around 40 bytes to a kilobyte or small number of
kilobytes. Direct access to application memory and the use of DMA
support, if avail able, are important to achieving performance.

Large. The lower bound of this class of messages is the size at which
the primary performance concern shifts from message latency or message
throughput to data throughput within a single message. Optimizing for
data throughput requires a protocol that maximizes use of the
available bandwidth without causing receiver overrun. This
necessitates transfers directly to and from application memory, and
may require flow control. The resulting protocols often have a setup
phase to validate the memory address es on the remote system and
initialize any required flow control. Messages using such a protocol
must be large enough to amortize this setup overhead.  This minimum
size may be a kilobyte or more.

With the exception of Express Messages [9], most other messaging
systems have recognized only the small and large message
classes. FLIPC is designed to optimize performance of the medium class
of messages by using a shared memory interface to avoid copying, an
optimistic basic transfer protocol designed to move this class of
messages, and a flow control architecture that does not involve the
basic transport protocol.

Architecture and Design

FLIPC contains three major components, as shown in Figure 1:

    The messaging engine. This is the body of hardware and software that
    moves messages between nodes.

    A fixed size non-pageable communication buffer that is shared between
    the messaging engine and all applications that use FLIPC.

    An application interface layer that provides formal interfaces to
    applications and hides the data struc tures in the communication
    buffer. This consists of both a library and header file(s).

The operating system kernel is involved only in syn chronization
actions that cannot be directly accomplished via state in the
communication buffer. The heavy arrows in Figure 1 indicate message
flow, and the light arrows indicate synchronization operations that
require interaction with the operating system kernel.

The messaging engine is an independently executing component of the
system. It is intended to execute on the programmable controller in
the communication in terface when one is present, but can also be
implemented as part of the operating system kernel for debugging
purposes or on systems lacking the required hardware. Synchronization
between the messaging engine and the application consists entirely of
wait-free synchronization, making it impossible for an errant
application to stall the communication controller.

The communication buffer is the focal point of FLIPC.  It is located
in shared memory accessible to both the application(s) and the
messaging engine, and it contains all of the memory resources used for
messaging.  This design enables direct interactions between the
application and the messaging engine without requiring code from
either component to cross the protection boundary into the other. An
additional result is that the OS Kernel is removed from the messaging
path, making that path significantly shorter. Protection of the
messaging engine from the application can be enforced via appropriate
checks in the messaging engine, but can be removed to increase
performance of a trust ed application.

FLIPC implements multiple endpoints per communication buffer to
support application requirements. Us ing different endpoints for
different threads avoids contention for communication resources in
multithreaded applications. Multiple cooperating applica tions can run
on a single node by dividing or otherwise sharing the endpoints in a
single communication buffer. The implementation of resource control at
the end point level makes it easy to separate resources for different
classes of traffic by using different endpoints.  An endpoint group
logically combines multiple end points into a single
abstraction. FLIPC supports a receive operation that retrieves a
message from an endpoint if there is an available message on any end
point in the group. This operation is implemented entirely in the
library because the resource control model's association of buffers
with endpoints makes it infeasible to merge the endpoint buffer
queues. FLIPC also supports a blocking receive operation on an end
point group. This design approach was chosen over the interrupting
upcall methodology of active messages because application level
interrupts are not appropriate for a multitasking real time
environment. Interrupts disrupt execution in a way that cannot be
controlled by the scheduler, reducing the real time predictability of
the system. In contrast, FLIPC provides a real time semaphore option
that causes the thread awakened by a message arrival to be presented
to the scheduler in the OS kernel, allowing it to determine when it is
appropriate to execute that thread.

Fixed size messages simplify the messaging engine logic. The actual
size restriction is platform dependent.  For the Intel Paragon, the
characteristics of the DMA support in the interconnect interface
require a message size that is at least 64 bytes and a multiple of 32
bytes.  FLIPC uses 8 bytes of each message for internal addressing and
synchronization purposes, so 56 bytes is the minimum application
message size. Transfer of messages larger than the fixed size selected
at boot time is not supported. FLIPC shields applications from buffer
alignment restrictions by internalizing all message buffers. An
application must call FLIPC to allocate a message buffer, allowing the
implementation to ensure that all such buffers are correctly aligned.
Addressing facilities were designed to minimize implementation
complexity. FLIPC message destinations (receive endpoint addresses)
are opaque and determined by the system. This requires receivers to
obtain endpoint addresses of endpoints they have allocated from FLIPC
and pass those addresses to senders.  FLIPC does not contain a
nameservice of its own, but assumes that one is available for this
purpose.

Message Transfer

FLIPC's basic messaging operation performs an asynchronous one way
delivery of a fixed size message from a source endpoint to a
destination endpoint. The send and receive interactions with the
messaging engine are symmetric in that both involve queueing a message
buffer for the messaging engine and subsequently determining whether
the send or receive operation has completed. Both polling and blocking
versions of completion detection are supported.  Figure 2 shows the
complete sequence of operations involved in a message transfer. Step 1
involves the receiver providing a message buffer to receive the
message. The sender sends the message at step 2 by queueing the
message buffer on the endpoint for the messaging engine. The messaging
engine transfers the message in step 3, queueing it on the receive
endpoint by placing it in the buffer supplied at step 1. The receiver
receives the message by removing it from the receive endpoint at step
4, and the sender recovers its buffer for reuse by removing it from
the send endpoint at step 5. Steps 2 through 4 are the actual message
delivery path, with steps 1 and 5 being associated resource control
operations.

FLIPC uses a message delivery model that combines a reliable ordered
transport with higher level responsibility for resource management and
any associated flow control. The messaging engine provides a reliable
transport that preserves order for messages sent from the same source
endpoint to the same destination end point. Vesting responsibility for
resource management and flow control in the application and the
application library allows the transport to ensure that every node is
always prepared to receive from the interconnect. If a receive occurs
without an available buffer on the destination endpoint, the received
message is discarded.  The resulting guarantee that all messages in
transit can always be accepted by their receiving node avoids
deadlocks on reliable interconnects. FLIPC maintains a counter in each
endpoint to track discarded message events, and makes this available
to applications.  Flow control to avoid discarded messages can be
provided either by applications or by libraries designed to fit
between applications and FLIPC. This structure greatly simplifies the
buffer management logic in FLIPC and allows flow control policies to
be custom ized to application needs. In some cases, static proper ties
of the application structure may remove the need for runtime flow
control. One example is that an RPC interaction structure with a fixed
set of clients can statically determine the number of buffers needed
based on the maximum number of clients. Another example is that an
application made up of strictly periodic com ponents can often
determine its worst case buffering needs in advance based on the
maximum number of messages sent per time period.

Wait-Free Synchronization

Properties of the programmable controllers that FLIPC exploits require
the use of wait-free techniques for synchronization between the
applications and the messaging engine. This is necessitated by the
fact that the controller on which the messaging engine executes may be
the only means of communication with the rest of the system. Hence, a
controller hang may render the node useless. Wait-free synchronization
ensures that an ill-behaved application cannot stall the controller
and cause damage in this fashion.

This synchronization problem is further complicated by the inability
to assume that the programmable controller can execute atomic memory
operations other than loads and stores. The resulting memory model is
the same as that assumed by mutual exclusion algorithms such as
Peterson's algorithm [15]. We simpli fied these synchronization
problems by making the combination of the application and the
application library component of FLIPC responsible for all
synchronization among application threads. This reduces the wait-free
synchronization problems to two-way problems between the messaging
engine and an application thread. We solved these by separating or
duplicating data structure components so that the application and
messaging engine never attempt to concurrently write the same memory
location. The synchronization among application threads is implemented
via conventional multithreaded locking tech niques based on a test and
set lock, because these threads cannot execute on the programmable
controller.

This type of wait-free synchronization requires a different approach
to designing data structures because the only available atomic
operations are reads and writes. For example, consider the counter
used to record the number of discarded messages for each end
point. FLIPC allows the application to read and reset this counter as
a single logical operation, and guarantees that the reset will not
cause dropped message events to be lost. A single memory location is
not sufficient to implement a wait-free version of this counter
because message drops occurring between the read and the write that
resets the counter to zero will be lost.  Instead FLIPC uses two
memory locations to implement the counter. The first location is
incremented each time a message is discarded, and the second location
holds the value of this count as of the last `read and reset'
operation. The actual count of discarded messages is then the
difference between these two values, and the reset is implemented by
copying the value from the first location to the second location. This
cleanly separates the data structure into one location written only by
the messaging engine, the counter, and one written only by the
application, the copied value.

The most important data structure for synchronization between the
application and the messaging engine is the buffer queue in each
endpoint. As shown in Figure 3, this structure consists of a circular
queue of buffer pointers that tracks not only the queue head and tail,
but also how far the messaging engine has proceeded through the
queue. The solid circular arrow in the figure indicates the direction
in which the release (head), process (middle), and acquire (tail)
pointers move as buffers move through the endpoint. The application
releases buffers to the endpoint by inserting pointers to them at the
front of this queue. The messaging engine uses the process pointer to
follow behind the front of the queue, sending from or receiving into
these buffers depending on the endpoint type. Buffers processed by the
messaging engine become free to be acquired by the application. This
buffer acquisition occurs at the tail of the queue. The queue is empty
when all three pointers point to the same location; the two half empty
conditions of no buffers to process and no buffers to acquire occur
when the corresponding two pointers point to the same cell. Each
buffer also contains a state field that is changed when processing has
been completed, allowing an application to determine when processing
of a specific buffer is complete.

Implementation

FLIPC has been implemented on three different hardware platforms. The
application interface library and communication buffer data structures
are the same in all three cases, demonstrating the generality and port
ability of the system. On the other hand, the messaging engine is
necessarily specific to the communication hardware and requires
reimplementation for each platform. Our development platforms have
been PC clusters interconnected by ethernet or a SCSI bus [3]. We have
also implemented FLIPC on the Intel Paragon using the Paragon mesh
interconnect. Our initial devel opment employed transports for all
three platforms developed by another project as OS kernel components
to a common interface, the Kernel to Kernel Transport (KKT) interface
[13]. This interface is not a good match to the one way messages used
by FLIPC because KKT uses an RPC to deliver each message.  On the
other hand, this was very effective for development purposes, because
it allowed us to focus on the platform independent components of
FLIPC. The resulting development strategy allowed us to build and
debug a fully functional system without using scarce Paragon
time. When Paragon time became available, we were able to move the KKT
implementation of FLIPC from the SCSI and ethernet clusters to the
Paragon in less than a week including test time.

We then developed a native FLIPC messaging engine optimized and
customized for the Paragon hardware, specifically Paragons that use
MP3 nodes. These nodes contain three 50MHz i860 processors, one of
which is reserved as a message coprocessor for accelerating messaging
operations. Cache coherency is implemented among all three
processors. This messaging engine executes directly on the Paragon's
message coprocessor and accesses the communication buffer directly
instead of using the KKT transport interface.  The inter-node protocol
for message transfer is an optimistic protocol that aggressively sends
messages without acknowledgment or feedback. As noted previously, the
receiving node discards messages if receive buffers are not
available. This protocol coexists with other protocols in the
Paragon's protocol framework on the message coprocessor, allowing
multiple protocols to be used simultaneously. For instance, our
implementation of FLIPC on the OSF/1 AD operating system [22] requires
both the FLIPC and OSF/1 AD protocols to operate
simultaneously. Section 8 reports performance results for this
implementation.

Tuning this implementation uncovered two performance problems caused
by the cache coherency implementation on the multiprocessor Paragon
nodes. The first problem was that the caches do not implement cache
residency for multiprocessor locks. Instead, the test and set
operation that acquires a lock must lock the bus and perform the read
and write operations directly on memory, with a severe impact on
performance. We avoided this problem by implementing versions of the
message send and receive operations that do not use multiprocessor
locks for mutual exclusion among application threads. These operations
are intended for use by applications whose structure ensures that at
most one thread will access each endpoint, or for which mutual
exclusion can be provided at a higher level. All of our performance
results use these interface versions. The second problem was that
false sharing of variables written by both the messaging engine and
the application library in the same 32 byte cache line on the Paragon
caused excessive numbers of cache invalidations. We solved this
problem by extending our design approach to wait-free synchronization
to ensure that concurrent writes from the application and messaging
engine can never occur in the same cache line. This reorganization of
communication buffer data structures eliminated the false sharing. The
combination of these two optimizations improved latency by 15ms or
almost a factor of two.  We had expected to encounter cache effects in
tuning FLIPC, but were surprised by the magnitude of their impact.

Performance

Figure 4 shows the message latency for the optimized FLIPC
implementation on the Paragon. The measured message latencies range
from about 15.5ms to 17ms.  The corresponding standard deviations
range from 0.5ms to 0.65ms and are approximately the size of the
symbols used in the figure. For message sizes of 96 bytes and above,
the latencies obey the following equation:

        Latency = 15.45ms + 6.25ns/byte

Shorter messages can be sent slightly faster due to changes in
hardware behavior. The slope of 6.25 ns/byte indicates that increasing
the FLIPC message size increases the use of interconnect bandwidth at
over 150 MB/s. This is on an interconnect whose hardware peak
bandwidth is 200 MB/s, and for which the best throughput achieved by
any software is 160 MB/s [12].

These measurements were obtained via a test program that measures the
time consumed by multiple two-way message exchanges between a pair of
nodes. The time for a single message is then obtained by dividing this
overall time by twice the number of two-way exchang es. These results
are from a configuration that does not contain all of the validity
checks that protect the mes saging engine against corruption of the
communica tion buffer by an errant or malicious application.
Configuring these checks adds an additional 2ms to the above times.

Running the test program for a small number of exchanges yields
results that are about 3ms faster than the above steady state results
from test runs that include hundreds of message exchanges. We believe
that the faster results for small numbers of exchanges are explained
by cache start-up transients. The 16 kilobyte caches used by the i860
processors on the Paragon are relatively small, and the Paragon does
not implement any secondary caches. The result is that the caches are
sensitive to disturbances caused by executing code outside the inner
test loop, so that saving the results of the previous test cycle is
sufficient to cause replace ment of a significant portion of the
cache. The result of these replacements is that cache lines that are
shared in the steady state of the test are not shared at start-up,
resulting in fewer cache invalidations. This performance anomaly
reinforces the importance of cache effects as an impact on performance
of this class of messaging systems.

Related Work

FLIPC is most closely related to two systems, Paragon Active Messages
(PAM) on the Paragon and Express Messages on the iPSC/2 hypercube.

PAM [2] consists of two communication subsystems, an active messages
facility that transfers 20 byte messages, and a bulk transport
facility based on direct read and write operations to remote
memory. The active messages facility is layered on top of an
optimistic transport protocol for fixed size messages via polling for
incoming messages. PAM's optimizations for small messages and the
simpler functionality by comparison to FLIPC yield a message latency
of less than 10ms, about a third faster than FLIPC would be on a 20
byte message. An important difference caused by optimizing for small
messages is that application preallocation and management of message
buffers is not needed by PAM because a 20 byte message can be copied
to or from an internal data structure at almost zero cost, less than
0.2ms. Functionality differences include the support for multiple
endpoints, data streams, and the related resource control that is
found in FLIPC, but not PAM. PAM's bulk transport mechanism is
complementary to FLIPC, as FLIPC does not contain a bulk transport
mechanism.

There are strong similarities between the PAM and FLIPC
implementations in that both use a wired communication buffer shared
between the application and messaging engine, and an optimistic
transport that discards messages when receive resources are not
available. Both systems recognize that a dedicated buffer avoids
deadlock problems that may arise in the use of an optimistic transport
on a reliable interconnect, although the actual buffer designs are
quite different.  Both systems also move flow control support from the
basic internode protocol to higher layers to support customizing flow
control to application requirements.  The window based flow control
protocol used by PAM's active message facility is an example of this
customization. PAM is targeted to smaller messages than FLIPC; PAM
messages are 28 bytes, of which 8 bytes are used by PAM, leaving 20
for the application.  PAM implements active messages on top of its
transport by using 4 bytes of the message to hold the address of the
remote message handler; the same implementation could be used on
FLIPC.

Express Messages [9] was a messaging facility for the iPSC/2
hypercube. This system contained several ideas that were utilized and
enhanced in the design of FLIPC. Express Messages recognized the
distinction among small, medium, and large messages, and also used an
aggressive optimistic transfer protocol for medium messages. Fixed
size message buffers were used for medium messages, but via page
mapping techniques instead of a shared memory buffer. A shared control
bit was used switch between polling and interrupt-based message
delivery mechanisms, but system calls were used for buffer management
in contrast to the shared data structure implementation in FLIPC.  The
absence of kernel threading support in the NX/2 operating system
required Express Messages to implement threading at user level,
including a handoff mechanism that executed an interrupted critical
section if the interrupt handler needed to enter the section.  FLIPC
is based on kernel thread support, allowing it to deliver messages to
threads rather than upcall handlers. This avoids interrupting critical
sections. Instead, the message that would have caused an interruption
is handled by a thread, allowing conventional multithreaded
synchronization techniques to be used to resolve critical section
conflicts.

Two other common high performance messaging systems for the Paragon
are NX [11] and SUNMOS [21].  Both optimize performance of large
messages for high performance numerical computation, and SUNMOS also
optimizes zero length messages. NX is part of the basic Paragon
operating system, whereas SUNMOS is a single application operating
system that is run on a subset of the Paragon nodes, utilizing the
Paragon operating system running on the remainder of the machine for
the services it does not provide. NX achieves a bandwidth of over 140
MB/sec and SUNMOS approaches 160 MB/s for sufficiently large messages.
Aside from the underlying hardware architecture, FLIPC has little in
common with these systems because they are optimized for large
messages rather than medium messages. FLIPC is complementary to such
systems, as a bulk transfer mechanism needs to be added to FLIPC to
obtain a complete system. SUNMOS has the further complication that its
basic messaging protocol assumes a non-multiprogrammed system by
sending multi-megabyte messages as a single packet. This occupies the
path through the interconnect for the duration of the message and is a
potential responsiveness problem in a real time environment.

The optimized FLIPC implementation for the Paragon significantly
exceeds the performance of these other messaging systems on medium
sized messages. The FLIPC latency for a 120 byte message transferred
be tween applications on two Paragon nodes is 16.2ms.  The comparable
latencies for the three other messaging systems on the Paragon are:

    NX (Paragon O/S R1.3.2), 46ms [11].
    Paragon Active Messages, 26ms [2].
    SUNMOS, 28ms [12][21].

This demonstrates the performance impact of not optimizing for the
medium class of messages. These three messaging systems have been
optimized for bandwidth on large messages. In addition, PAM has been
optimized for latency on very small messages.

Future Work

There are a number of opportunities for future work to improve on
FLIPC. The explicit resource management model is effective, but is
also awkward to program.  Our experience is that a FLIPC application
can expect to employ about half of its calls to FLIPC to send or
receive messages, and the other half for message buffer management. An
improved buffer management design that frees the programmer from most
of these details is clearly called for. Support for multiple
communication buffers per node and protection mechanisms that restrict
where messages can be sent should be added to support multiple
applications that do not trust each other. FLIPC was designed solely
to address the transport of medium sized messages and needs to be
integrated into a system that provides excellent performance for
messages of all sizes. As part of this work, we are considering
extensions that allow applications to indirectly access memory on
other nodes [16]; some related ideas can be found in the SUNMOS [21],
PAM [2], and Illinois Fast Messages [7] systems.  Finally, we intend
to pursue further integration of FLIPC into a real time environment by
adding real time prioritization and capacity/bandwidth control
functionality to the basic inter-node transport.

Conclusion

FLIPC incorporates an innovative combination of design and
implementation techniques that achieve ex cellent performance for
medium messages on high speed communication hardware in a real time
environment. The most important architectural approach is to recognize
and explicitly target programmable communication controllers. The
FLIPC design manifests this targeting in several ways:

    Logical separation of messaging engine and OS kernel allows the
    messaging engine to be implemented on the communication controller.

    Asynchronous messaging interfaces and wait-free 
    synchronization decouple the communication controller from applications.

    The shared communication buffer allows the OS 
    kernel to be bypassed.

FLIPC is designed for medium messages, an important class of messages
for distributed real time applica tions. The increased performance it
obtains by com parison to other messaging systems indicates that this
size class of messages has been overlooked by these other systems.

Our experiences in implementing FLIPC provides some useful guidance
for designers of other systems.  The importance of cache management to
the performance of distributed messaging systems is indicated by the
cache problems we encountered during performance tuning, and the
factor of two in performance obtained by tuning for cache effects. The
limitations of some programmable communication controllers required us
to address wait-free synchronization problems in a memory model
without atomic read-modify-write operations. Finally, employing PC
clusters proved to be an effective way to develop prototypes,
including validating the portability of FLIPC across two different
communication mediums. This greatly reduced our need for Paragon time
to develop the final version of FLIPC, because the platform
independent components had already been implemented and debugged.

Acknowledgments

For their assistance in making this work possible, we are grateful to
the Intel Scalable Systems Division, es pecially Paul Pierce and Roy
Larsen. We would also like to thank the Active Messages group of the
University of California at Berkeley, and especially Lok Liu for
making the source code of Paragon Active Messages available for us to
study. Paul Davis of Honeywell kindly provided measurements on the
current version of NX. Dejan Milojicic and the anonymous Usenix
reviewers provided comments that greatly improved this paper.

References 

[1] Nanette J. Boden, Danny Cohen, Robert E. Felderman, Alan
E. Kulawik, Charles L. Seitz, Jakov N.  Seizovic, and Wen-King Su,
`Myrinet -- A Gigabit- per-second Local-Area Network', IEEE Micro,
February 1995.

[2] Eric A. Brewer, Frederic T. Chong, Lok T. Liu, Shamik D. Sharma,
and John Kubiatowicz, `Remote Queues: Exposing Messages for
Optimization and Atomicity', Proceedings of the 7th Annual ACM Sym
posium on Parallel Algorithms and Architectures, July 1995, pp. 42-53.

[3] Randall W. Dean, Michelle M. Dominijanni, Steven J.  Sears, and
Alan Langerman, `SCSI for Host to Host Communication', OSF Research
Institute Technical Report, In preparation.

[4] Peter Druschel, Larry L. Peterson, and Bruce S.  Davie,
`Experiences with a High-Speed Network Adapter: A Software
Perspective', 1994 SIGCOMM Conference Proceedings, October 1994,
pp. 2-13.

[5] `High Performance Distributed Computing (Hiper-D) Integrated
Demonstration One Report', Naval Surface Warfare Center (NSWC),
Dahlgren, Virginia, Septem ber 1994.

[6] Intel Corporation, Intel Paragon XP/S Supercomputer Spec Sheet, 1992.

[7] Vijay Karamcheti and Andrew A. Chien, `A Comparison of
Architectural Support for Messaging in the TMC CM-5 and the Cray T3D',
Proceedings of the 22nd Annual International Symposium on Computer
Architecture, June 1995, pp. 298-307.

[8] Jeffrey Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard
Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel
Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John
Hennessy, `The Stanford FLASH Multiprocessor', Proceedings of the 21st
Annual International Symposium on Computer Architecture, 1994,
pp. 302-313.

[9] J. William Lee, `Performance of User-Level Commu nication on
Distributed Memory Multiprocessors with and Optimistic Protocol',
Technical Report 93-12-06, Department of Computer Science and
Engineering, University of Washington, December 1993.

[10] NCR Corporation, NCR 53C825 PCI-SCSI I/O Pro cessor With Local
ROM Interface Data Manual, 1993.

[11] Paul Pierce and Greg Regnier, `The Paragon Implementation of the
NX Message Passing Interface', Technical Report, Intel Scalable
Systems Division, Beaverton, Oregon.

[12] Rolf Riesen, `SUNMOS Performance', Sandia National Laboratories
web page linked to https://www.cs.sandia.gov/~rolf/puma/sunmos/sunmos.html.

[13] Steve Sears, Michelle Dominijanni, Alan Langerman, and David
Black, `Kernel to Kernel Transport Interface for the Mach Kernel',
Technical Report in OSF Research Institute Collected Papers, Operating
Systems Volume 3, April 1994.

[14] Daniel P. Siewiorek, C. Gordon Bell and Allen Newell, Computer
Structures: Principles and Examples, New York: McGraw Hill Book
Company, 1982.

[15] Andrew S. Tannenbaum, Modern Operating Systems, Englewood Cliffs,
New Jersey: Prentice-Hall, Inc., 1992, page 37.

[16] Chandramohan A. Thekkath, Henry M. Levy, and Ed ward P. Laxowska,
`Separating Data and Control Transfer in Distributed Operating
Systems', Proceed ings or the Sixth International Conference on
Architectural Support for Programming Languages and Operating Systems,
Operating Systems Review, vol.  28, num. 4, December 1994, pp. 2-11.

[17] Lewis W. Tucker and Alan Mainwaring, `CMMD: Active Messages on
the CM-5', Parallel Computing, vol. 20, 1994, pp. 481-496.

[18] Ronald J. Vetter, `ATM Concepts, Architectures, and Protocols',
Communications of the ACM, vol. 38, num. 2, February 1995, pp. 30-38.

[19] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and
Klaus Erik Schauser, `Active Messages: a Mechanism for Integrated
Communication and Computation', Proceedings of the 19th International
Conference on Computer Architecture, May 1992, pp.  256-267.

[20] Robert Wahbe, Steven Lucco, Thomas E. Anderson, and Susan
L. Graham, `Efficient Software-Based Fault Isolation', Proceedings of
the 14th ACM Symposium on Operating Systems Principles, December 1993,
pp. 203-216.

[21] Stephen R. Wheat, Arthur B. Maccabe, Rolf Riesen, David W. van
Dresser, and T. Mack Stallcup, `PUMA: An Operating System for
Massively Parallel Systems', Proceedings of the 27th Annual Hawaii
International Conference on System Sciences, January 1994, pp. 56-65.

[22] Roman Zajcew, Paul Roy, David Black, Chris Peak, Paulo Guedes,
Bradford Kemp, John LoVerso, Michael Leibensperger, Michael Barnett,
Faramarz Rabii, and Durriya Netterwala, `An OSF/1 Unix for Massive ly
Parallel Multicomputers', Proceedings of the Winter 1993 USENIX
Technical Conference, January 1993, pp. 449-468.

Author Information

David L. Black is a Research Scientist at the OSF Research institute,
where he has been working on operating systems of various sorts since
1990. Previously, he was a graduate student at Carnegie Mellon
University performing research on Mach, and learning how long it takes
to write a Ph.D. thesis. His research interests include operating
systems, multiprocessors, and real-time computing.

Randall D. Smith has been working on high-speed application messaging
systems for three years, at Thinking Machines Corporation and the OSF
Research Institute. He also spent three years in graduate study at the
Massachusetts Institute of Technology in Computational Neuroscience,
and a year working for the Free Software Foundation hacking GDB and
GLD.

Steven J. Sears is currently a Senior Consulting Software Engineer
with Novell Corporate Research and Development, attached to the
Advanced Systems Architecture & Technologies Group. Before joining No
vell he solved `hard problems' at the OSF Research Institute. His
interests include distributed operating systems and software, computer
architecture, high speed networking, and distributed memory systems.

Randall W. Dean is a Senior Research Engineer at the OSF Research
Institute where he is currently working on distributed operating
system issues. Previous to joining the OSF in 1993, he spent eight
years on the research staff at Carnegie Mellon University, the last
four on the Mach project. His research interests include operating
systems, multiprocessors, distributed computing and distributed memory
management.

-- 
============================================================================
David L. Black                ___   ___   ___      Voice: (617) 621-7347
Open Software Foundation     /  /  /__   /__         Fax: (617) 621-8696
Eleven Cambridge Center     /__/  ___/  /         E-Mail: dlb@osf.org
Cambridge, MA  02142                                https://www.osf.org/~dlb/
============================================================================