################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  http://www.usenix.org


A New Approach to Distributed Memory Management 
in the Mach Microkernel

Stephan Zeisset, Stefan Tritscher, Martin Mairandres 
Intel Corporation

ABSTRACT

In this paper we describe a new approach towards 
extending Mach virtual memory semantics across the 
node boundaries of a multicomputer system. At its 
core, the Advanced Shared Virtual Memory (ASVM) 
system employs algorithms from the realm of shared 
virtual memory that were adapted and extended to cre-
ate a system that is efficient and scalable and supports 
full VM semantics, paging between node memories 
and efficient execution of SVM applications. Our per-
formance measurements demonstrate ASVM's supe-
rior efficiency and scalability compared to its 
predecessor, the Extended Memory Manager (XMM) 
that is part of the Mach kernel's NORMA distribution.

1. Introduction

Multicomputer systems generally run different copies 
of their operating system on each of their processing 
nodes.To ease both programming and system adminis-
tration, the more advanced multicomputer operating 
systems, among them OSF1/AD TNC [2], support the 
concept of a single system image. This provides the 
user with the illusion of a single workstation environ-
ment, including single IP, process and file name 
spaces.

To accomplish this, OSF1/AD TNC relies heavily on 
two additions to its Mach microkernel that are also 
known as NORMA (No Remote Memory Access) 
extensions: NORMA-IPC for extending interprocess 
communication and XMM for extending virtual mem-
ory semantics across the multicomputer's node bound-
aries. NORMA-IPC is used by the operating system 
for all kinds of internode communication and synchro-
nization, while XMM provides copy-on-access seman-
tics for remote task creation and shared memory 
semantics for the memory mapped file system and user 
created shared memory segments.

Unfortunately, the original design of NORMA-IPC 
and XMM did not perform well on large scale parallel 
machines. The main deficiencies of NORMA-IPC 
were its broken flow control in many-to-one communi-
cation scenarios and its poor utilization of the avail-
able communication bandwidth. The main deficiencies 
of XMM were its non-scalable approach to handling 
shared memory and an inefficient communication pro-
tocol.

These problems prompted Intel's Scalable Systems 
Division, which based the operating system of it's Par-
agon multicomputer system on OSF1/AD TNC, to 
redesign the NORMA-IPC and XMM subsystems. In 
this paper we will concentrate on the XMM rewrite 
effort and its outcome, the Advanced Shared Virtual 
Memory (ASVM) system.

2. Background

2.1  The Paragon multicomputer system

The Paragon is a multiprocessor system with distrib-
uted memory. Each of its nodes contains two or three 
i860XP processors that share a local memory with a 
size of 16 to 128 MBytes. One of the processors usu-
ally acts as a dedicated message passing processor 
while the others execute user tasks. The nodes are 
interconnected by a two dimensional mesh of worm-
hole routed communication channels with a raw band-
width of 200 MByte/s in each direction. Existing 
installations reach up to 1792 nodes.

Three properties of the Paragon multicomputer system 
had a major influence on the design of ASVM: The 
high bandwidth and low latency interconnect, the great 
number of nodes a system can contain and the fact that 
there is typically one node with an attached disk drive 
for 32 nodes that are used for computation.

2.2  The Mach Kernel's VM system

The Mach kernel's virtual memory (VM) system pro-
vides the abstraction of memory objects. These are 
user managed entities which are represented by a VM 
object on the kernel side and a memory object port on 
the user side. The available physical memory is used 
as a cache for memory object contents. User level 
pager tasks are responsible for providing the initial 
contents of a memory object and for preserving the 
object's data when it is evicted from the cache. The 
External Memory Management Interface (EMMI) pro-
tocol is used between the kernel and the pager tasks to 
exchange page contents and access rights (see FIG-
URE 1).

Beyond enabling a task to map a memory object into a 
specific part of its address space, the VM system also 
supports the shared access of multiple tasks to the 
same memory object (shared memory semantics) and 
the creation of lazy evaluated copies of a memory 
object (delayed copy semantics). While sharing of 
memory contents is easily accomplished by entering 
the same VM object into the address map of multiple 
tasks, delayed copy semantics are implemented by 
building shadow/copy relationships between VM 
objects.

Two different strategies are used for implementing 
delayed copy semantics, called symmetric and asym-
metric copy strategy. With the symmetric copy strat-
egy, the address maps of the source and the copy task 
continue to reference the same object. Only when a 
page is about to be written in either the source or the 
copy, a new object is created that has a shadow link to 
the original object. The new object replaces the origi-
nal object in the address map where the write fault 
occurred (see FIGURE 2). While pages that are not 
present in the shadow object are retrieved from the 
source object through the shadow link, modifications 
of them take place in the shadow object. This means 
that the contents of the source object are frozen after a 
symmetric copy has been made.

While the symmetric copy strategy is a very efficient 
way of implementing delayed copy semantics, it is not 
applicable if changes in the source address space have 
to be reflected back to the source object's pager, such 
as with memory mapped files. Therefore, the asym-
metric copy strategy is used in these cases. With this 
strategy, a copy object is created at the time a delayed 
copy operation is done. The source and the copy object 
are connected by copy and shadow links (see FIGURE 
3). When a page is needed in the copy object, it is 
retrieved from the source object through the shadow 
link (pull operation). Before a page can be modified in 
the source object, first a copy of it is inserted into the 
copy object, using the copy link (push operation).

Multiple objects which are connected through copy 
links form a copy chain. New copies are inserted into 
the copy chain immediately after their source object.

An interesting property of both the symmetric and 
asymmetric copy strategies is that the creation of page 
copies is delayed until the page is modified in the 
source or copy address space. Pages that are retrieved 
through a shadow link in lieu of a read page-fault are 
not copied to the object where the fault occurred. 
Instead, the page-fault is satisfied by directly entering 
the source object page into the physical pagemap of 
the faulting task.

2.3  The eXtended Memory Manager 

The original design of XMM is due to Joe Barrera, 
who created it in the context of a Ph.D. thesis at Carn-
egie Mellon University. Later, the Open Software 
Foundation (OSF) took over responsibility for further 
development of XMM [6]. In this section we describe 
the OSF NMK13 version of XMM, which is part of the 
software basis for the Paragon operating system and 
forms the basis for our performance comparisons. The 
major accomplishment in later OSF versions of XMM 
is a better integration of shared memory and delayed 
copy management, which closes a semantic gap in 
NMK13 XMM that prohibits combined use of shared 
and inherited memory.

2.3.1  Structure

XMM is located inside the Mach kernel and intercepts 
the communication between the VM system and exter-
nal pagers. For each memory object XMM representa-
tions of the object are created on all nodes that make 
use of it. Only one of these representations holds state 
information and communicates with the pager task, 
while the others merely act as forwarding proxies for 
requests to and from their local VM system. This way 
XMM acts towards each node's VM system as the 
memory object's pager and towards the pager as a sin-
gle node's VM system (see FIGURE 4).

2.3.2  Shared Memory Management

XMM uses the centralized manager model for provid-
ing a coherent address space across multiple nodes. 
The manager node keeps track of the status of each 
page in the memory object's virtual address space on 
each node that makes use of the object and enforces a 
"single writer or multiple readers" policy. When some 
node makes a request for read or write access to a page 
of the memory object, this request is answered in two 
steps:

First, a coherent version of the page is created at the 
pager. This means that the page contents have to be 
returned to the pager if the page has been modified. 
Also, if the request is for write access to the page, the 
page has to be flushed from the VM cache of all nodes 
except the origin node of the request.

Then the request is forwarded to the pager, who can 
now view the requesting node as the only user of the 
page. The pager's answer is then forwarded to the 
requesting node.

2.3.3  Delayed Copy Management

NMK13 XMM supports delayed copy optimizations 
only in the context of remote task creation. The 
approach it takes toward the lazy evaluation of inher-
ited memory is quite simple in that it leaves most of 
the work to the virtual memory system of the node 
where the source task is located. 

This is done by creating a copy of the source address 
space just as in the case of a local fork() operation. 
Then a XMM internal pager is created for each mem-
ory object in the copy address space, thereby providing 
a new memory object abstraction to be mapped in the 
address space of the remote task. Page faults in the 
remote task's address space cause a request to the 
XMM internal pager which in turn generates a page-
fault on the local copy address space and supplies the 
resulting page contents back to the remote node.

3. The ASVM system

3.1  Design Principles

Based on our experiences with XMM, we developed 
our new system along the following guidelines:

·	Distributed Manager

XMM is based on a centralized manager approach 
for managing cross-node memory relationships. 
This means that for each Mach memory object, the 
XMM code on a single node is responsible for 
coherency and copy management. Several research 
studies ([1], [10]) indicate that this approach is 
non-scalable and thus unsuitable for management 
of shared virtual memory on large systems, because 
the centralized manager becomes a bottleneck 
when a large number of nodes uses the memory 
object.

ASVM uses a distributed manager approach where 
each page has its own manager, which is also called 
the page owner. The ownership can migrate 
between all nodes that use the page.

·	Limited Memory Requirements

With XMM, the centralized manager stores the 
page state of a memory object in a data structure 
that requires 1 byte of non-pageable memory for 
each page in the virtual address space of the mem-
ory object, multiplied by the number of nodes that 
use the object. With large numbers of nodes or 
large and sparsely populated address spaces this 
concept can consume a lot of memory. In extreme 
cases it may even consume more memory than is 
actually available, leading to a system crash.

ASVM not only distributes the page state informa-
tion across the system, but also ties it to physical 
pages in that a node only holds state information 
about pages that are cached into its physical mem-
ory. The same concept is used by the Mach kernel's 
virtual memory system to support large and 
sparsely populated address spaces.

·	Asynchronous State Transitions

One problem of the mechanism XMM uses for 
implementing its delayed copy support is that the 
copy pager thread which generates a page-fault is 
blocked until the page-fault completes. As an inter-
node copy chain might cross the same node multi-
ple times, this leads to a deadlock if the available 
number of threads is exhausted.

To avoid problems of this kind, ASVM generally 
uses asynchronous state transitions, which means 
that neither a thread nor other resources are 
blocked while waiting for a request to be answered.

·	Specialized Communication Protocol

XMM uses a protocol called XMMI (eXtended 
Memory Management Interface) for communica-
tion inside the XMM system. XMMI is an exten-
sion of the EMMI protocol that defines the 
interaction between the virtual memory system and 
an external pager [4]. Although this is an elegant 
solution, it is also very inefficient when used for 
coherency management, as it requires more mes-
sages than necessary. For example, transferring a 
write permission from one node to another using 
XMMI takes five messages, two of them contain-
ing page contents. With a more suitable protocol, 
this number could be reduced to three messages 
(request to manager, request from manager to cur-
rent writer, and answer from current writer to new 
writer), only one of them containing page contents 
[7]. 

ASVM uses XMMI only as an interface to the local 
VM system and pager while defining its own 
ASVM protocol for all communication between the 
ASVM instances on different nodes.

·	Dedicated Transport Service

XMM uses Mach NORMA IPC as a transport ser-
vice for XMMI communication to remote nodes. 
This introduces high latencies and overhead due to 
the handling of port rights and complex message 
structures. This is especially conspicuous on multi-
processor systems with a high-performance inter-
connect such as the mesh used in Paragon systems. 
In fact, on these systems NORMA IPC is responsi-
ble for about 90 percent of the latency involved in 
resolving remote page faults for memory that is 
shared through XMM.

The ASVM protocol, on the other hand, is mapped 
to a dedicated transport interface, the SVM trans-
port service (STS). This allows the use of lower 
level protocol stacks for ASVM communication 
that can take advantage of the simple structure of 
ASVM messages, which consist of a fixed size 
block of untyped data (currently 32 Byte), possibly 
followed by the contents of a VM page (8 KByte). 
Also, flow-control is simplified by the fact that   
page contents are only transferred on behalf of a 
request from their receiver, which allows prealloca-
tion of page receive buffers.

3.2  Basic Operation

For managing coherency and memory inheritance, 
some entity must maintain information about the cur-
rent state of a page. This entity is called the manager of 
the page. The main design freedom for any SVM sys-
tem is the method used to distribute the managers of 
particular pages across the nodes of the system. There 
are several known approaches to this problem [1]: 

·	Centralized Manager	

The managers for all pages of a memory object are 
located on the same node. This approach is imple-
mented in NMK13 XMM.

·	Fixed Distributed Manager	

The page managers are statically distributed by 
using some function which maps a page number to 
a node number. The manager for the page is located 
on that node.

·	Dynamic Distributed Manager	

The manager of a page migrates between the nodes 
that use the page. The node on which the manager 
of a page is currently located is called the owner of 
the page. Kai Li [1] suggests moving page owner-
ship to the node that most recently made an request 
for the page and locating the actual page owner by 
following a hint chain across all the nodes that pre-
viously owned the page. This hint chain naturally 
collapses when a node that once owned a page 
becomes its owner again. The hint chain can also 
be collapsed while forwarding a request as the orig-
inator of the request becomes the next owner.

ASVM adopts the dynamic distributed manager 
approach, but makes a clearer distinction between the 
method for distributing page managers and the method 
for finding the current manager of a page. In this 
approach, there are actually two managers for a page. 
One of them is the page owner, which keeps informa-
tion about the state of the page and is responsible for 
answering access requests from other nodes. The other 
is the ownership manager, which is responsible for for-
warding requests to the page owner. While a page is 
always owned by the node that most recently had write 
access to it, multiple methods are available for for-
warding requests to the page owner. This is different 
from Kai Li's approach, in which page ownership also 
migrates on read requests and a single method is used 
for forwarding requests (the hint chain).

3.3  General Structure

FIGURE 5 shows the structure of the ASVM system 
and how it is embedded into the Mach kernel. 

When the virtual memory system of a node needs 
access to some page, an access request is generated 
and fed into the request redirector, which is responsi-
ble for delivering it to the current page owner. When 
the request arrives on the page owner node, it is fed 
into a state machine which keeps the page coherent 
throughout shared usage, delayed copying, and intern-
ode paging.

3.4  Forwarding Mechanisms

The motivation for providing multiple strategies for 
finding the owner of a page comes from the design 
goal of avoiding data structures that grow with the size 
of virtual address spaces. This goal is guaranteed for 
the page management structures by enforcing the 
invariant that each node that is owner of a page has the 
page in its virtual memory cache. If no node has the 
page in its VM cache, there is no owner of the page 
and no state information about it. The pager can be 
viewed as the page owner in this case.

A different approach must be taken for page ownership 
information, since each node must also hold ownership 
information for pages that aren't in its virtual memory 
cache. Consider the following forwarding strategies:

·	Dynamic forwarding	

Requests for a page are forwarded as in Kai Li's 
dynamic distributed manager approach.

·	Static forwarding	

Requests for a page are forwarded through a fixed 
distributed ownership manager.

·	Global forwarding

	Requests for a page are forwarded through all 
nodes until they eventually reach the page owner or 
- if there is no owner of the page - the pager.

Dynamic forwarding requires the most memory since 
each node has to hold an ownership hint for each page 
in the virtual address space. Static forwarding uses less 
memory since each node only holds ownership infor-
mation for a subset of the pages in the virtual address 
space. Global forwarding uses the least memory since 
only the page owner itself has to know that it owns the 
page.

ASVM implements all three of these strategies and 
holds ownership information for dynamic and static 
forwarding in caches for the most recently accessed 
pages (see FIGURE 6). This means that both dynamic 
and static forwarding can fail if required information is 
not present in the caches. Static forwarding will not 
fail as often as dynamic forwarding since the static for-
warding cache is in effect distributed among all static 
ownership managers and thus can hold many more 
owner references than a dynamic cache of same size. 
Therefore, static forwarding is used as a backup for 
dynamic forwarding and is in turn backed up by global 
forwarding. Global forwarding will never fail if any 
node is owner of the page.

In addition to a node reference the static cache can also 
hold hints that the page has never been initialized 
(fresh) and that the page has been paged out (paged), 
thereby avoiding costly global forwarding operations 
in these cases.

The ASVM system allows to disable either dynamic or 
static forwarding (or both) on a memory-object basis. 
This provides great flexibility. If only static and global 
forwarding are enabled, the behavior of the ASVM 
system is identical to Kai Li's fixed distributed man-
ager approach. Enabling dynamic forwarding makes 
the ASVM system resemble the dynamic manager 
approach. Of course this assumes that in both cases the 
relevant caches are big enough.

3.5  Shared Memory Management

The only coherency model that is currently supported 
by ASVM is strong coherence, which means that any 
read operation to a shared memory address will return 
the data of the most recent write operation to this 
address.

FIGURE 7 shows the part of the ASVM state machine 
that keeps a page coherent through shared usage, 
enforcing the single writer or multiple readers invari-
ant. The state of a page on a particular node includes 
the type of access the node's VM system has to the 
page and an indicator if the node is owner of the page. 
The following state transitions can occur:

1.	The node is granted read access to the page.

2.	The node is granted write access to the page.

3.	The node is granted an upgrade from read to write 
access.

4.	The node is owner of the page and grants write 
access to another node. 

5.	The node is owner of the page and grants read 
access to another node. This node is also entered 
into a list of nodes with read access (reader list)

6.	The node is owner of the page and grants write 
access to another node. An invalidation message is 
sent to all nodes in the reader list.

7.	The node is owner of the page and upgrades its 
own access rights from read to write access. An 
invalidation message is sent to all nodes in the 
reader list.

8.	The node receives an invalidation message from 
the page owner.

3.6  Internode Paging

The main idea behind the concept of internode paging 
is to extend not only the semantics of the VM system 
across node boundaries, but also its concept of manag-
ing physical memory as a cache for virtual memory 
contents. Together with the mechanisms for maintain-
ing coherency, ASVM's internode paging facilities 
allow it to view the physical memory of all nodes that 
use a particular memory object as a cache for the 
memory object's contents.

When a page is evicted from a node's virtual memory 
cache, the following algorithm is used:

1.	If the node is not the page owner, the page is sim-
ply discarded, as it can be retrieved from the page 
owner at any time.

2.	If the node is the page owner and its list of nodes 
with read access is not empty, these nodes are 
asked, one after another, if they still have read 
access to the page (they might have discarded the 
page as in Step 1). If a queried node answers to 
have no read access to the page, the node is 
removed from the reader list. Else ownership for 
the page is transferred to this node. Note that this 
ownership transfer doesn't require sending the 
page contents.

3.	If the node is the page owner and there are no more 
nodes with read access to the page, ASVM tries to 
transfer the page to one of the other nodes that have 
mapped the memory object to which the page 
belongs. This fails if no node with sufficient free 
memory can be found. 

4.	Finally, if Step 3 failed the page is returned to the 
memory object's pager.

For selecting a pageout node in step 3, a counter is 
used that cycles through the list of nodes which have 
mapped the memory object. This counter is incre-
mented on each pageout. First, the node identified by 
the current counter value is asked to accept the page 
transfer. If this node doesn't accept the page transfer 
because it is low on memory, the node which most 
recently accepted a page transfer is asked again. 

This algorithm is meant to adapt itself to the current 
memory situation by locking onto nodes which are 
known to have free memory available until another 
node is found which also has memory available. If 
many nodes have free memory, such as when a single 
node initializes a SVM region that is mapped on multi-
ple nodes, the pages are distributed evenly among the 
other nodes. This is provides good load balancing for 
ASVM's distributed manager algorithms.

3.7   Delayed Copy Management

ASVM extends the asymmetric copy strategy of the 
virtual memory system across node boundaries. The 
basis for internode copy relationships is provided by 
the shared virtual memory functionality of ASVM. 
When a copy of a memory object is to be mapped on 
some node, first a shared mapping of the source object 
is established on this node. Then a local copy is cre-
ated through the standard mechanisms of the VM sys-
tem (see FIGURE 8). After that all resident pages of 
the source object on all nodes which share the object 
are marked read only by sending a message to the 
sharing nodes which causes them to do a 
memory_object_lock_request on the copied region. 

While all local push and pull operations continue to be 
done internally in the VM system, ASVM has to be 
involved if either the source or the copy object are 
shared between multiple nodes:

·	If the source object is shared, a page which is about 
to be modified has not only to be pushed to the 
local copy object, but also to copy objects on all 
other nodes which share the source object.

·	If the copy object is shared, before pushing a page 
to the copy object it has to be determined if the 
page is already present on any of the nodes which 
share the copy object.

3.7.1  Extensions to the EMMI Interface

The shared virtual memory functionality of ASVM  
was implemented without changing the EMMI. This 
was not possible with the delayed copy management, 
because the EMMI design hides the existence of cop-
ies and shadows from the memory manager. So the fol-
lowing EMMI extensions had to be made to give the 
memory manager control over the VM internal copy 
mechanisms:

memory_object_lock_request was extended by a 
"mode" argument that allows to specify if the page 
should be pushed down the VM internal copy chain 
before executing the lock operation.

memory_object_lock_completed was extended by a 
"result" argument which is used to return an indication 
if the lock_request couldn't execute a push operation 
because the page was not present in the VM cache.

memory_object_data_supply was extended by a 
"mode" argument that allows to push a page down the 
copy chain instead of supplying it to the source object.

memory_object_pull_request was added to allow a 
page to be retrieved through the VM internal shadow 
chain. 

memory_object_pull_completed was added to return 
the result of a pull_request. There are three possible 
results: 

1.	The page is not available and can be zero-filled.

2.	The page is available and its contents are returned.

3.	 The memory manager of a shadow object has to be 
asked for the page and the shadow object port is 
returned.

3.7.2  Push operations

A page has to be pushed into all (possibly remote) cop-
ies of a memory object before it can be modified in the 
source object. A push operation is initiated by the cur-
rent owner of a page if he receives a write request and 
the page has not already been pushed into all copy 
objects.

To determine if a page needs a push operation, ASVM 
uses version counters for memory objects and pages: 
An object's version counter is incremented each time a 
copy is made from the object. The version counter of a 
page is set to the associated object's version counter 
each time a push operation takes place on the page. If a 
page is about to be modified and the page version is 
not equal to the associated object's version, a push 
operation is initiated before write access is granted.

When a push operation takes place for a page, a mes-
sage is sent to all nodes who have mapped the associ-
ated memory object, except the node who initiates the 
push operation. On these nodes the memory_object 
_lock_request EMMI call is used for prompting the 
VM system to push the page down the copy chain and 
invalidate it in the source object. If the page is not 
present in the VM cache, the reply to the lock request 
will indicate this and a request to send the page con-
tents is sent to the page owner. Once the request is 
answered, a memory_object_data_supply EMMI call 
is made to push the page down the copy chain.

Once the page owner received replies from all nodes 
which were requested to push the page and after all 
missing pages have been sent to these nodes, the 
source object side of the push operation is completed. 
Additional effort is required on the side of the copy 
objects if they are shared. To determine, if a page is 
already present in a shared copy object, a special type 
of access request, called a push scan request, is gener-
ated and forwarded through ASVM's forwarding 
mechanisms. If a page owner exists in the copy object, 
it will answer the push scan request and the push oper-
ation will be canceled for this copy object. If no page 
owner exists in the copy object, the request will finally 
be forwarded to the shadow object and the push opera-
tion will proceed for this copy object.

3.7.3  Pull operations

For pull operations, ASVM uses the VM system to 
traverse local shadow chains and its own SVM capa-
bilities to bridge the gap between node boundaries. If a 
page-fault occurs in a VM object, the VM system will 
traverse the local shadow chain and - if the page isn't 
present in one of the traversed source objects - gener-
ate a memory_object_data_request EMMI call in the 
first object that has an associated ASVM object. 
ASVM will then forward this request through its for-
warding mechanisms. If the page is present on some 
node, it will have an owner who receives and answers 
the request. 

Else - assuming that the current object is a copy too - 
the request will finally be forwarded to the node on 
which the copy was created and that also has a map-
ping of the corresponding source object. On this node, 
which is also called the peer node of the current object, 
ASVM will then use the memory_object_ pull_request 
EMMI call to traverse the local shadow chain. If the 
requested page is present in one of the source objects 
on this node or can be zero-filled, it will be supplied to 
the origin node of the request.

If the result of the pull_request indicates that the page 
might be present in a source object which has an asso-
ciated ASVM object, the request will be forwarded 
into this object. This continues until either the page is 
found in some object or the end of the shadow chain is 
reached, in which case the page can be zero-filled.

For an example, FIGURE 9 shows a copy chain across 
two nodes as it is created if a task forks to a remote 
node and the child task does the same. Assume that a 
page-fault occurs in object 3 on Node C and the page 
is located in object 1 on Node A. The VM system on 
Node C issues a data_request for the page in object 2. 
ASVM forwards the request to Node B, which is the 
peer node of object 2, and uses a pull_request to 
traverse the local shadow chain. The result of the 
pull_request indicates that the page has to be looked 
up in object 1 and ASVM forwards the request to 
Node A. Here, again a pull_request is used and returns 
the page contents. ASVM then supplies the page to the 
object from which it got the request, object 2 on Node 
C. Finally, the VM system will take care of entering 
the page into the physical pagemap of the task that 
took the page-fault in object 3.

One issue left out so far is that of synchronization 
between push and pull operations. Because of the dis-
tributed nature of ASVM's page management, a 
request from a copy object can enter its source object 
at the same time a push operation is in progress. In this 
case, the copy request is held up until the push opera-
tion completes. Then a retry indicator is set in the 
request and it is sent back to the origin node, which 
causes the request to be repeated.

4. Performance Measurements

We made our performance measurements on a Paragon 
multicomputer system with 72 GP nodes (2 processors 
and 16 MByte memory per node). 

4.1  Basic Page-Fault latencies

4.1.1  Page-Faults on Shared Memory

Table 1 compares characteristic types of SVM page 
faults and their latency under the ASVM system and 
NMK13 XMM. All latencies were measured in user-
task context by performing read or write operations 
and timing their duration (in milliseconds). 

Table 1: Page Fault Latencies

Fault Type			ASVM	XMM

Write fault on a page with 1 	2.24	38.42
read copy

Write fault on a page with 2 	3.10	12.92
read copies

Write fault on a page with 64 	8.96	72.18
read copies

Write fault on a page with 2 	1.51	3.83
read copies, faulting node 
has read copy

Write fault on a page with 64 	7.75	63.72
read copies, faulting node 
has read copy

Read fault on a page, 		2.35	38.59
faulting node is first reader

Read fault on a page, 		2.35	10.06
faulting node is second 
reader

FIGURE 10 is a graphical representation of the latency 
introduced by a write page fault in relation to the num-
ber of nodes that have read copies of the page. The fig-
ure distinguishes between two cases. In one, called 
write upgrade fault, the faulting node already has a 
read copy of the page. In the other, called write fault, 
the faulting node doesn't have a read copy. The 
NMK13 XMM times are measured for the general 
case in which the XMM stack is remote from both the 
faulting node and the nodes that have read copies.

The graph shows that the page fault latencies of 
ASVM increase much slower with the number of 
nodes that have a read copy of the page than the page 
fault latencies of NMK13 XMM. This is especially 
important for SVM applications and is one indicator 
for the better scalability of ASVM.

The huge difference between page faults with only one 
read copy and with two read copies that was measured 
for NMK13 XMM can be explained by the fact that 
XMM writes a dirty page to the paging space when it 
is requested for the first time by another node.

4.1.2  Page-Faults on Inherited Memory

For evaluating the performance of delayed copy opera-
tions across node boundaries we used a test program 
that initializes a region of memory (128 KByte), 
spawns a chain of copies of that region across a 
defined number of nodes and faults in all pages of the 
region on the last node in the copy chain. FIGURE 11 
shows the resulting page fault latencies both for NMK-
13 XMM and ASVM. They can be described as lb + n 
* la, where lb is the basic latency for a remote copy-
on-access fault (5.0 ms for NMK-13 XMM and 2.7 ms 
for ASVM) and la is the cost for forwarding the page 
fault across an additional node (about 4.3 ms for 
NMK-13 XMM and about 0.48 ms for ASVM). 

The low additional costs ASVM incurs on faults that 
have to be forwarded across a long copy chain is espe-
cially important for applications that use dynamic load 
balancing, as each migration of a task adds another 
stage to the copy chain from the node where the task 
has originally been started to the node where it is run-
ning. But also normal parallel applications take a con-
siderable advantage: The Paragon operating system 
spawns applications to a set of nodes by forking along 
a binary tree, which creates copy chains of a maximum 
length that grows logarithmically with the number of 
nodes involved. For example an application which is 
started on 256 nodes creates copy chains with a maxi-
mum length of 8. A page fault across a copy chain of 
length 8 is associated with a latency of 35 ms for 
NMK-13 XMM and 6.4 ms for ASVM.

4.2  Mapped Filesystem Performance

While the Paragon OS uses a memory mapped unix 
file system, parallel access to the same file is sequen-
tialized by the OSF1/AD TNC server, which lacks 
multiple reader semantics. Therefore, the measure-
ments in this section don't use standard Unix read/
write calls. Instead, they bypass the server by using the 
mmap() function to map a file into memory and read-
ing/writing directly from/to memory. The performance 
of standard read/write calls can be expected to come 
close to the values presented here once multiple reader 
semantics are added to the OSF1/AD server. TABLE 
2, FIGURE 13 and FIGURE 12 show the effective 
transfer rates seen by each of multiple nodes accessing 
the same file.

 The write transfer rate is measured by letting all nodes 
write different sections of a 4-MB file. Asynchronous 
writes were used, so the upper limit for the combined 
transfer rate of all nodes is given by the rate at which 
the file pager can supply initially zero-filled pages for 
the file. The read transfer rate is measured by letting all 
nodes read a 4-MB file in parallel. The upper limit for 
the individual transfer rate of each node is given by the 
rate at which the file pager can supply the contents of 
the file.              

TABLE 2  File Transfer Rates (MB/s)

Nodes:	1	2	4	8	16	32	64

ASVM	2.80	2.60	2.05	1.22	0.62	0.30	0.15
write

XMM	2.15	1.77	0.90	0.49	0.24	0.12	0.06
write

ASVM	1.57	1.53	1.14	0.91	0.70	0.66	0.66
read

XMM	1.18	0.38	0.25	0.11	0.05	0.02	0.01
read

Note that because of its distributed manager concept, 
ASVM is able to sustain a reasonable read transfer rate 
even on a high number of nodes. To achieve the same 
for write transfers, the file pager would have to tell the 
kernel about chunks of uninitialized data, so that 
ASVM's distributed mechanisms could be used for 
supplying initially zero-filled pages.

4.3  SVM Application Performance

This section presents performance results based on a 
program called EM3D. This program's communica-
tion was originally based on active messages [9], 
which we   changed towards shared memory based 
communication. EM3D models the three dimensional 
propagation of electromagnetic waves. Its basic data 
structure is a bipartite graph that has directed edges 
from a set of E nodes representing the electric field to a 
set of H nodes representing the magnetic field, and 
vice-versa. After an initialization phase for building up 
the graph, the computation consists of a large number 
of iterations to calculate the development of the fields 
over time. In each iteration the value of each E node is 
changed to the weighted sum of the H nodes to which 
it is connected, and then the value of each H node is 
changed to the weighted sum of the E nodes to which 
it is connected. To prevent confusion, the E and H 
nodes of the EM3D code will be called cells. 

The graph for the performance measurements was 
generated randomly with a user-specified percentage 
(20%) of the edges (6 per cell) leading to a cell located 
on a different processing node. The execution times 
are given in seconds for 100 iterations of the computa-
tion loop. The initialization phase wasn't included in 
the measurements since practical applications would 
use even more iterations, making the initialization 
overhead almost vanish.

TABLE 3 demonstrates the effect of ASVM's 
enhanced scalability on SVM applications and shows 
the execution times of EM3D for various problem 
sizes both for NMK13 XMM and ASVM. The column 
headings indicate the number of nodes on which the 
application is run while the row headings indicate the 
use of XMM or ASVM and the problem size, given by 
the number of cells. With ASVM the execution times 
decrease with the number of nodes, resulting in rea-
sonable speedups while with NMK13 XMM the exe-
cution times actually increase, resulting in a 
slowdown.

TABLE 3  EM3D Timings (seconds)

EM3D

Nodes:	1	2	4	8	16	32	64

ASVM	43.6	32.0	19.9	13.9	11.2	9.86	9.55
64000 

XMM	43.6	151	213	392	755	1405	2735
64000 

ASVM	174*	**	**	33.6	21.5	15.6	12.8
256000 

XMM	174*	**	**	520	842	1604	2957
256000 

ASVM	698*	**	**	**	**	54.2	24.4
1024000

XMM	698*	**	**	**	**	1863	3373
1024000 

Note that each cell in the problem space needs 224 
bytes of memory. This means that 64,000 cells con-
sume 14 MB, which is already too much for a 16-MB 
processing node that only has about 9 MB of memory 
available for user applications. The sequential mea-
surement with 64,000 cells was therefore made on a 
32-MB node. The parallel measurements were omitted 
where the combined memory of the 16-MB compute 
nodes wasn't sufficient to hold the complete data set.

5. Conclusions

Our first evaluations indicate that ASVM fully 
achieves its goal of providing efficient and scalable 
distributed memory management. The main reason for 
this success lies in the close interaction between the 
three ASVM subsystems for shared virtual memory   
management, delayed copy support and internode pag-
ing. To demonstrate this interaction, we will take a 
closer look at how it helps in achieving scalability and 
efficiency:

·	Scalability

As described in the section 3.5, the owner of a page 
keeps a list of nodes with read access to the page. 
The maximum size of this list grows linear with the 
number of nodes in the system, which poses a 
problem for scalability. This is where ASVM's 
internode pageout mechanisms enter the game: The 
greater the number of nodes with read access to a 
page, the greater the number of nodes which are 
available for an ownership transfer when memory 
gets scarce and the page is evicted from the VM 
cache. Such, ASVM's internode pageout mecha-
nisms effectively balance the amount of owner 
information each node has to keep among all nodes 
which share a memory region. 

·	Efficient Memory Usage

A great part of the memory inherited to a child pro-
cess is not modified by either the parent or the 
child. The most prominent example for this are 
program text segments. As ASVM uses the VM 
system for establishing local copy relationships, 
pages requested through a read page fault will be 
supplied to the source object on the requesting 
node, instead of the copy object (see 2.2). This 
means that paging for read-only pages of copy 
objects is done through their source object, taking 
full advantage of ASVM's internode pageout 
mechanisms. If, for example, a parallel application 
is loaded onto a number of nodes and memory gets 
scarce, ownership for all read-only pages will be 
distributed among the participating nodes and not 
owned pages are discarded. If a node later needs 
access to a page it has discarded, the page can be 
retrieved from its current owner instead of reading 
it from a disk. 

6. Future Work

While ASVM provides a solid foundation for estab-
lishing an efficient and scalable single system image, a 
number of modifications has to be made to the rest of 
the Paragon OS to utilize ASVM's full potential. We 
will concentrate on the area of file systems here.

Currently, the Paragon OS provides two types of local 
file systems, the Unix File System (UFS) and the Par-
allel File System (PFS). The Unix File System is a 
memory mapped filesystem and thereby uses the VM 
system to provide data buffering directly on the nodes 
which use a file. On the other hand, the Parallel File 
System supports striping of files across multiple I/O 
nodes and uses NORMA-IPC to distribute read/write 
requests to the I/O nodes. The main advantage of UFS 
is its caching scheme that becomes even more efficient 
through ASVM's internode paging facilities, while the 
main advantage of PFS is its scalability through the 
use of multiple I/O nodes. 

In the following we will hint at what is necessary to 
combine the advantages of UFS and PFS into a new 
filesystem that supports striping, local caching and full 
Unix file semantics without a loss of performance:

·	Provide and utilize ASVM primitives for locking a 
range of pages in a shared address space for the 
exclusive access of a particular task on a particular 
node. This would allow to guarantee the atomicity 
of read and write operations by locking the range 
which is about to be modified prior to write opera-
tions. The current scheme uses NORMA-IPC to 
acquire an exclusive token from a token server 
each time a read or write operation takes place and 
the token is not present on the node which does the 
file access.

·	Modify the VM system to allow multiple pagers for 
one VM object that are used for paging requests in 
a round-robin fashion. This allows one VM object 
to represent a striped file system, with one pager 
located on each of the I/O nodes.

·	A clustering of page-out and page-in requests has 
to be implemented in the virtual memory system 
and in ASVM to achieve adequate bandwidths.

Acknowledgments

This research was sponsored by the Advanced 
Research Projects Agency under contract MDA972-
89-C-0034. It has also benefitted from discussions, 
suggestions and feedback by many people, including 
our colleagues at Intel's Scalable Systems Division, 
Fritz Gerneth - Intel GmbH and Michael Gerndt - KFA 
Juelich.

References

[1]	Kai Li: Shared Virtual Memory on Loosely 
Coupled Multiprocessors. Ph.D. thesis, Yale 
University, September 1986

[2]	Roman Zajcew, Paul Roy, David Black et al: 
An OSF/1 UNIX for Massively Parallel 
Multicomputers. In Proceedings of the 
USENIX conference, January 1993.

[3]	Richard F. Rashid: Threads of a new system. 
Unix Review Vol 4(8), August 1986

[4]	Michael Wayne Young: Exporting a User 
Interface to Memory Management from a 
Communication-Oriented Operating Sys-
tem. Ph.D. thesis, Carnegie Mellon Univer-
sity, November 1989

[5]	Avadis Tevanian, Jr.: Architecture-Indepen-
dent Virtual Memory Management for 
Parallel and Distributed Environments. 
Ph.D. thesis, Carnegie Mellon University, 
December 1987

[6]	Bill Bryant, Steve Sears, David Black, Alan 
Langerman: An Introduction to Mach 3.0's 
XMM Subsystem. OSF Research Institute 
Operating Systems Collected Papers Vol. 2, 
October 1993

[7]	Alessandro Forin, Joseph Barrera, Michael 
Young, Richard Rashid: Design, Implemen-
tation and Performance Evaluation of a 
Distributed Shared Memory Server for 
Mach. Report CMU-CS-88-165, Carnegie 
Mellon University, August 1988

[8]	Pete Keleher, Sandhya Dwarkadas, Alan L. 
Cox, Willy Zwaenepoel: Treadmarks: Dis-
tributed shared memory on standard 
workstations and operating systems. In 
Proceedings of the USENIX conference, Jan-
uary 1994

[9]	Culler, D.E, Dusseau, A., Goldstein, S.C., et 
al.: Parallel Programming in Split-C. In 
Proceedings of the Supercomputing 93, 1993

[10]	Lahjomri, Z., Priol, T.: Koan: A Shared Vir-
tual Memory for the iPSC/2 Hypercube. In 
Proceedings of the CONPAR conference, 
1992

The Authors

Stephan Zeisset currently works as an International 
Project Leader at Intel's Scalable Systems Division. 
He graduated in 1994 with a Masters degree in Com-
puter Science from the Munich University of Technol-
ogy. His masterthesis built the basis of the XMM 
rewrite project, in which he took part as a Software 
Engineer at Intel GmbH. His e-mail address is 
sz@ssd.intel.com.

Stefan Tritscher is a Technical Marketing Engineer at 
Intel GmbH. Prior to joining the marketing department 
he worked as a Senior Software Engineer on Paragon 
OS components such as load balancing and the XMM 
rewrite project. He is holding a Masters degree in 
Computer Science from the Munich University of 
Technology. His e-mail address is ste-
fan@esdc.intel.com. 

Martin Mairandres is currently pursuing a Ph.D. 
degree from RWTH Aachen, Germany. He is also 
holding a Masters degree in Computer Science from 
the Munich University of Technology. During his 
Ph.D. research at Intel GmbH he contributed a lot to 
the shared virtual memory functionality of ASVM and 
designed interfaces for system and application level 
monitoring. His e-mail address is mar-
tinX@esdc.intel.com.

Trademarks

ParagonTM is a trademark of Intel Corporation.