|
1999 USENIX Annual Technical Conference
June 6-11, 1999
Monterey, California, USA
These reports were originally published in the December 1999 issue of ;login:.
Integration Applications: The Next Frontier in Programming
John Ousterhout, Scriptics Corporation
Summary by Peter Collinson
The 1999 USENIX conference at Monterey opened on a somewhat cooler-than-expected
Wednesday morning in early June. After the usual rounds of announcements and
thanks, including the announcement of the USENIX Lifetime Achievement Award and
the Software Tools User Group Award, John Ousterhout came to the podium to give
his keynote presentation. Ousterhout's been a frequent visitor to USENIX
conferences over the years, presenting papers that reflect diverse interests
that sprang originally from his academic base as a professor of computer science
at the University of California, Berkeley.
Ousterhout's life shifted from the campus into industry by his creation of Tcl
(pronounced "tickle"), which he created to be an extensible scripting language
that could be embedded easily to control both hardware and software. Tcl gave
rise to Tk, the GUI-builder library, that's had a wide impact, providing a
GUI-builder API for other scripting languages, notably Perl.
I'll guess that Ousterhout's move from the groves of academe surprised both the
university and Ousterhout. He became a distinguished engineer at Sun and
recently has moved again to be the CEO of his own company, Scriptics. Scriptics
promotes Tcl and the use of Tcl, and is part of the new wave of companies that
are based on open-source software.
The talk concentrated on, you guessed it, scripting languages and their impact.
He started with the premise that much of today's software development is
actually the integration of applications to make a greater whole, and that many
programmers are engaged in the often difficult business of making different
software components interact.
Several application areas are driving the need to provide integration
interfaces: we
have the ongoing shift from the command-line interface to the GUI; Web sites
are providing a need to integrate legacy applications such as databases with the
user on a network either local or remote; special-purpose black boxes are on the
rise and there's a need to configure such embedded devices; many applications
are consolidating legacy applications within an enterprise, perhaps a department
in a hospital or when there are mergers and acquisitions; finally, there are the
various component frameworksCOM, EJB, or CORBA.
His contention is that scripting languages are better at providing the necessary
flexible glue than traditional system-programming languages because these
integration tasks have different characteristics from traditional programming
tasks. The fundamental problem is not the algorithms or data structures of the
application but how to connect, coordinate, and customize different parts. The
integration exercise must support a variety of interfaces, protocols, and
formats; it often involves automating business processes and requires rapid and
unpredictable evolution. Finally, integration often involves less sophisticated
programmers.
Ousterhout went on to give a short history of system-programming languages in
terms of their original design goals and compared their approach with scripting
languages. He highlighted two areas where he feels that the traditional
languages fail when used for integration. First, the traditional languages use
strong variable typing designed to reduce errors by using compile-time checking.
Second, the design of traditional languages encourages the generation of errors,
which can be avoided by providing sensible defaults.
Ousterhout dismissed the traditional concerns about scripting languages. The
main one is usually performance. This is no longer a problemmachines are
500 times faster now than they were in 1980, and anyway most expensive
operations can be done in libraries. Second, people often complain that it's
hard to find errors in scripting languages because there are fewer compile-time
checks. He counters this problem by saying that there is better runtime checking
in scripting languages and that they are "safe" because they provide sensible
defaults. (I had to "hmm" a bit at this. I've played the game of "find the
missing bracket" a little too often in both Perl and Tcl.) Finally, he dismissed
the notion that scripting code is hard to maintain, on the ground that there is
much less code to deal with in the first place.
Ousterhout then moved on to talk about Tcl. Tcl arose because Ousterhout wanted
to create a simple command language for applications, one that could be reused
in many different applications. The ideas gave birth to the Tool Command
Language, or Tcl, which is a simple language that's embeddable and extensible.
Tcl provides generic programming facilities while anything really hard or
performance-impacting can be placed in a library for the application and
accessed by invoking a command.
Tcl today has more than 500,000 developers worldwide. The Scriptics site is
supplying 40,000 downloads every month. There's an active open-source community
with strong grassroots support. There are thousands of commercial applications:
automated testing, Web sites, electronic data automation, finance, media
control, health care, animation, and industrial control.
Ousterhout concluded by saying that we are experiencing a fundamental shift in
software development, moving toward more integration applications. These
applications are better served by scripting languages that are supporting a new
style of programming. The programming style seeks to minimize differences
between components and to eliminate special cases. The use of simple
interchangeable types in the language helps to keep the size of the code down
and aids the programmer. It also helps to minimize errors. Scripting languages
are providing a challenge to the community by making programming available to
more people and also by allowing more to be done with less code.
(Thanks to John for the slides of his talk, whose contents I have cheerfully
stolen for this report.)
REFEREED PAPERS
Session: Virtual
Memory
Summary by Brian Kurotsuchi
The Region Trap Library: Handling Traps on Application-Defined Regions of
Memory
Tim Brecht, University of Waterloo; Harjinder Sandhu, York University
Tim Brecht presented a library in which a user-level program can mark
arbitrarily sized memory regions as being invalid, read-only, or read-write.
The advantage of placing this capability into a library is that the protection
can be assigned at whatever granularity the application
wants, not just the page-by-page basis that the operating system provides.
This library does not use the mprotect mechanism because that form of
protection is page based and does not offer the flexibility this library is
looking for. Instead, the library takes a different approach based on address
"swizzling" and custom trap handlers. Applications that choose to use this
library allocate memory by using custom functions inside the library.
Inside the library, memory is allocated but the application is provided with a
address that has been swizzled to point into kernel space. Clearly the
application will cause a page fault when it tries to access that memory region
in the future. When that segmentation violation occurs, the preassigned trap
handler will look up the requested memory address in an AVL tree and unswizzle
the address into the appropriate register. The application can then continue
with its work as if there were no problem.
The Case for Compressed Caching in Virtual Memory Systems
Scott Kaplan, Paul R. Wilson, and Yannis Smaragdakis, University of Texas at
Austin
Memory subsystems in modern computers still suffer from the fact that the CPU
can use memory much faster than RAM can supply it. The general solution today
is to place faster caches between the CPU and main memory, but this only gets so
far and can hold only a limited amount of data. In this presentation, Scott
Kaplan made a case for adding yet another level in between main memory and the
CPU with the hope of increasing performance.
They added a cache of compressed memory between the main memory subsystem and
the paging mechanism. Kaplan pointed out that this strategy was attempted in
the past but ended up with results that show little to no benefit. He asserts
that with modern-day hardware we can compress/decompress the data fast enough to
gain a significant advantage when compared to the cost of paging the data to
disk, whereas past experiments have lacked the spare CPU cycles to make the
scheme feasible.
In their implementation of this new cache, they created the Wilson-Kaplan
compression algorithm, further improving on past work in this field. They claim
to have a 2:1 compression ratio when compressing the type of data found in a
typical chunk of memory. Measuring their results, they claim to have found a
gain of about 40% in paging requirements, but they are unable to present a
comparison to previous work because the previous work could not be reproduced.
Session: Web Servers
Summary by Aaron Brown
Web++: A System for Fast and Reliable Web Service
Radek Vingralek and Yuri Breitbart, Lucent TechnologiesBell
Laboratories; Mehmet Sayal and Peter Scheuermann, Northwestern University
Radek Vingralek's presentation described Web++, a system that addresses the
problems of Web-server response time and reliability by using "Smart Clients"
and cooperating servers to balance and replicate Web content dynamically across
a group of distributed Web servers. Vingralek began with several motivating
examples that illustrated the problems of poor and inconsistent Web performance,
especially when transcontinental links are involved. He then presented the
solution of content replication across geographically distributed servers, which
addresses both the performance and reliability problems of single-site servers
and which does not suffer the flaws of the proxy-caching approach (low hit rates
and cache bypassing by providers) or of the server-clustering approach (single
dispatcher and no way to avoid network bottlenecks).
The Web++ approach combines a Smart Client, implemented as a signed Java applet
and downloaded on demand to the user's browser, with Java servlet-based server
extensions that maintain the replicas and provide clients with information on
how to find them. The server preprocesses all HTML files sent to the clients,
replacing each HTTP URL with a list of URLs pointing to the various replicas of
the original object. It keeps a persistent directory of replica locations and
uses a genealogy-tree-based algorithm to maintain eventual consistency with
other servers; all communication among servers is handled with HTTP/1.1
put, delete, and post commands. The client
architecture is based on a Java applet that intercepts the JavaScript event
handlers to capture all page requests. For every page request, the applet uses
the replica information embedded by the server to select the actual destination
for the request. The replica-selection algorithm attempts to select the replica
located on the server with the overall best request latency in the recent past;
to do this, it keeps a persistent per-server latency table on the client
machine. To avoid suboptimal selections due to stale data, the client
periodically and asynchronously polls servers at a rate selected to balance
overhead with data freshness. This selection algorithm outperforms most standard
algorithms, including random selection, selection based on the RTT or number of
network hops, and probabilistic selection.
Vingralek presented a brief performance evaluation of Web++ that demonstrated an
average of 47% improvement in response time using a workload of fixed-size files
and three geographically distributed servers (in California, Kentucky, and
Germany). Server response time degraded by at most 14% because of the servlet
extension. The Web++ client algorithms underperformed the optimal algorithm
(sending each request to every replica simultaneously and using the fastest
response) by only 2.2%. The speaker concluded by bemoaning the poor support for
smart clients in the current Java applet model, and by emphasizing the
importance of developing good models and easy-to-use algorithms for replica
consistency. One audience member questioned the complexity of the Web++ system
relative to the benefits it provides, especially relative to simpler solutions.
Vingralek responded that Web++ can significantly outperform most of the standard
replica-selection algorithms such as random, number-of-hops, and RTT.
Efficient Support for P-HTTP in Cluster-Based Web Servers
Mohit Aron, Peter Druschel, and Willy Zwaenepoel, Rice University
Mohit Aron described extensions that add support for persistent HTTP (P-HTTP) to
the LARD (Locality-Aware Request Distribution) technique for cache-aware load
balancing in cluster-based Web servers. The first part of the talk introduced
traditional LARD, which relies on a front-end machine to examine each incoming
request and route it to the back-end cluster node that is most likely to have
the requested content in its cache. LARD outperforms traditional algorithms such
as weighted round-robin (WRR) in both load balance (a LARD system remains
CPU-bound as the size of the cluster is increased, whereas WRR becomes
disk-bound) and cache behavior (the effective cache size of a LARD system is the
sum of the sizes of each node's cache, as opposed to WRR, in which the effective
cache size is the size of one node's cache).
LARD was developed for the HTTP/1.0 protocol and thus assumes that one TCP
connection carries only one request. If used unmodified, simple LARD does not
perform well with HTTP/1.1's P-HTTP, since LARD balances load on the granularity
of a connection, which with P-HTTP can contain multiple independent requests. In
the next part of the talk, Aron described various options for updating LARD to
perform request-granular load balancing with P-HTTP. The first option, multiple
TCP handoff, requires that the front-end examine each request in the connection
and hand each request off to the appropriate back-end node, which then sends the
response to just that request directly to the client. This achieves
request-granular balancing but adds the overhead of creating a new back-end
connection with each request, defeating much of the benefit of P-HTTP. With the
other option, back-end forwarding, the front-end hands the entire connection
over to a back-end node, which services the first request and then sends
additional requests directly to appropriate (potentially different) back-end
nodes. In simulation, neither the naive connection-based method nor these two
attempts at request-based redistribution come close to the ideal of
zero-overhead-per-request redistribution.
Aron investigated a modified version of back-end forwarding for LARD that takes
into account the extra cost of forwarding a request between back-end nodes. In
this version, a connection is assigned (using LARD) to a back-end node X on the
basis of its first request. For each subsequent request, if that request is
already cached at X, it is serviced by X. Otherwise, if X's disk utilization
(load) is low, then the request is still serviced by X, avoiding the cost of an
extra hop. Otherwise, the request is sent to another back-end node that has the
appropriate data cached. In simulation, this policy comes very close to the
ideal. Aron also described an implementation of this policy in a cluster of
FreeBSD machines running Apache; the Web servers were unmodified, but the kernel
was enhanced with loadable modules to implement TCP handoff and the front-end
dispatcher. In experiments, the enhanced back-end forwarding policy exceeded the
performance of simple LARD with HTTP/1.0, simple LARD with P-HTTP, and all
weighted-round-robin variants. It outperformed LARD-HTTP/1.0 by 26%,
demonstrating the benefit of P-HTTP in a LARD system.
One questioner claimed that 90% of hits on big sites fit in a small cache and
can be serviced by a single machine, and asked why a complicated cluster-based
solution like LARD was necessary. Aron replied that their experiments were based
on real traces that did not have such a small working set, that it was common
for sites to have static-request working sets in the gigabytes, and that LARD
was targeted at situations where the working set does not fit in a
single-machine cache. He also argued that cluster-based solutions provide an
easy means of incremental scalability as the working set increases (simply add
more machines).
Flash: An Efficient and Portable Web Server
Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, Rice University
The authors' motivation in building yet another Web server was both to create a
server with good portability and high throughput over a range of workloads, and
to gain a better understanding of the impact of different concurrency
architectures on Web-server performance. To that end, Flash implements several
different concurrency architectures in a common implementation, so that
architecture can be examined independently of implementation.
In the first part of his talk, Vivek Pai introduced the various architectures
for handling multiple concurrent Web requests. The first of these is the
Multiple Process (MP) architecture, which uses multiple processes, each handling
one request at a time. MP is simple to program but suffers from high
context-switch overhead and poor caching. The next option, Multithreaded (MT),
uses one process with multiple threads; each thread handles one request at a
time. This approach reduces overhead and improves caching, but requires robust
kernel-threads support for large numbers of threads, blocking I/O in threads,
and synchronization. Another architecture is Single Process Event Driven (SPED),
in which only one process/thread is used, and in which multiple requests are
handled in an event-driven manner via an event-dispatcher using
select() and asynchronous I/O. This model removes the need for threads
and synchronization, but often in practice performs poorly because of the lack
of asynchronous disk I/O in most OSes. Finally, Pai introduced a new
architecture, Asymmetric Multiple-Process Event Driven (AMPED), which uses a
SPED-like model of a central event dispatcher, but which also uses independent
helper processes to handle disk and network I/O operations asynchronously.
Pai next described an implementation of the AMPED architecture in the Flash Web
server. Besides implementing AMPED (as well as several other concurrency
models), Flash also incorporates additional optimizations such as the use of
memory-mapped files and gather writes. Most important, Flash uses aggressive
application-level caching of pathname translations, response headers, and file
mappings. In simple experiments in which a single page is repeatedly fetched,
this application-level caching is the dominant performance factor (accounting
for a doubling in performance in some cases), and the concurrency architecture
is not a major factor. In trace-based experiments, Flash with AMPED in general
outperformed or was competitive with all other servers and architectures because
of its optimizations, application-level caches, and the good cache locality
achieved by its single-address-space design. It performed up to 30% faster than
the commercial Zeus SPED server, and up to 50% faster than MP-based Apache. In
particular, Flash approached SPED performance where SPED performs best (on
cacheable workloads) and exceeded MP performance on disk-bound workloads (where
MP performs best), demonstrating that AMPED combines the best features of both
architectures and works well across a range of workloads.
Session: Caching
Summary by David Oppenheimer
NewsCacheA High-Performance Cache Implementation for Usenet News
Thomas Gschwind and Manfred Hauswirth, Technische Universität Wien
Thomas Gschwind described NewsCache, a USENET news cache. Noting that the
network bandwidth requirement for carrying a typical USENET news feed is
35 gigabytes per day and growing, Gschwind suggested separating the USENET
article-distribution infrastructure from the access infrastructure, with the
latter being handled by cache servers and the former being served by a dedicated
distribution backbone. NewsCache is designed to serve as part of the access
infrastructure: it is accessed by clients using NNRP and itself accesses the
news infrastructure using NNRP.
NewsCache uses several techniques to achieve high performance. Most
significantly, it stores small articles (by default, those less than 16KB) in
memory-mapped databases, one database per newsgroup, with only large articles
stored in the file system. Gschwind studied a number of article-replacement
strategies for the cache, including BAF (biggest article first), LFU (least
frequently used first), LRU (least recently used first), and LETF (least
expiration time first). Both the LFU and LRU strategies were studied on a
per-article and per-newsgroup basis (i.e., replacement of the least recently
used article in the system, or of the entire least recently used newsgroup in
the system). Gschwind examined hit rate and bytes transferred as a function of
spool size for each of the replacement strategies. In space-constrained
situations the LRU-group strategy generally performed the best.
Besides caching, NewsCache provides transparent multiplexing among
multiple-source news servers and can perform prefetching. NewsCache is
distributed with Debian/GNU Linux. More information is available at
<https://www.infosys.tuwien.ac.at/NewsCache/>.
Reducing the Disk I/O of Web Proxy Server Caches
Carlos Maltzahn and Kathy J. Richardson, Compaq Computer Corporation; Dirk
Grunwald, University of Colorado, Boulder
Carlos Maltzahn described techniques for reducing the amount of disk I/O
required by Web-proxy-server caches running on top of a generic OS filesystem.
His study used Squid as its reference Web cache. Squid stores every cached
object as a single file in a two-level directory structure and uses a
round-robin mechanism to ensure that all directories in the two-level structure
remain balanced. Maltzahn compared and contrasted Web-cache workloads with
generic-filesystem workloads: the significant differences are that Web-cache
workloads show slowly changing object popularity, while filesystem workloads
show more temporal locality of reference; the hit rate of Web caches is lower
than that of file systems because of a higher fraction of writes in Web caches;
and crash recovery is less critical in Web caches than in file systems because
the objects stored by Web caches are by definition redundant copies.
Maltzahn described two changes to the Squid cache architecture that were found
to reduce disk I/O. The first is to hash each object's URL's hostname to
determine where the object is stored, in order to store all objects from the
same server in the same directory. This storage scheme reduced the number of
disk I/Os to 47% of the number using unmodified Squid. The second change is to
store all objects in one large memory-mapped file rather than in individual
per-object files. Maltzahn used the original Squid scheme for objects larger
than 8KB and a memory-mapped file for all other objects. This scheme reduced
disk I/O to 38% of the original value, and in combination with the
server-URL-hashing approach yielded 29% of the original number of disk I/Os
compared to unmodified Squid.
Maltzahn next compared three replacement strategies for the memory-mapped cache
by analyzing the strategies' ability to minimize disk I/O. The strategies
studied were LRU, FBC (frequency-based cyclic), and a near-optimal
"future-looking" replacement strategy derived from the entire reference stream.
Maltzahn found that LRU performed poorly. Compared to LRU, FBC provided an
almost identical hit rate, a small reduction in the number of disk I/Os, and a
good reduction in wall-clock time (mostly due to a reduction in seek time). The
near-optimal policy fared much better than either LRU or FBC, suggesting that
more careful coordination of memory and disk could lead to more significant
performance improvements.
An Implementation Study of a Detection-Based Adaptive Block Replacement
Scheme
Jongmoo Choi, Seoul National University; Sam H. Noh, Hong-Ik University; Sang
Lyul Min and Yookun Cho, Seoul National University
Sam Noh described DEAR, DEtection-based Adaptive Replacement, a filesystem
buffer cache management scheme that adapts its replacement strategy to the disk
block reference patterns of applications. A monitoring module in the kernel VFS
layer observes each application's disk block reference pattern over time. The
application's reference pattern is inferred by examining the relationship
between blocks' backward distance (time to last reference) and reference
frequency, and the expected time to the blocks' next reference. DEAR uses this
information to categorize an application's reference pattern as sequential,
looping, temporally clustered, or probabilistic (the last meaning blocks are
associated with a stationary probability of reference). As the application runs,
the program's reference pattern is dynamically detected, and the buffer cache
block replacement algorithm is updated. A detected sequential or looping pattern
triggers MRU replacement, a detected probabilistic pattern triggers LFU
replacement, and a temporally clustered or undetectable reference pattern
triggers LRU replacement.
DEAR uses a two-level buffer cache management scheme that is implemented in the
kernel VFS layer. One application cache manager (ACM) per application performs
reference pattern detection and block replacement, and one systemwide system
cache manager (SCM) allocates blocks to processes. DEAR was implemented in
FreeBSD 2.2.5 and its performance evaluated using a number of applications.
Compared to the default LRU replacement scheme, DEAR reduced disk I/Os by an
average of 23% and response time by an average of 12% for single applications.
When multiple applications were run simultaneously, disk I/Os were reduced by an
average of 12% and response time by an average of 8%. A trace-driven simulation
was used to compare DEAR with the application-controlled file caching scheme
developed by Cao, Felten, and Li, which requires explicit programmer hints to
select the replacement policy. DEAR achieved performance comparable to that of
application-controlled file caching, but without requiring explicit programmer
hints.
IP TelephonyProtocols and Architectures
Melinda Shore, Nokia IP Telephony Division
Summary by Jeffrey Hsu
Melinda Shore began her talk by noting that telecommunications and telephony are
undergoing a radical change, and that information on the topic has mostly been
tied up in expensive-to-join committees and thus not readily available to the
public. The driving factor behind all the interest in IP telephony is the
potential cost savings and efficiencies of using a data network to transport
voice. In addition to voice, IP telephony is also used for video and to
integrate voice and email.
Shore described in detail the various scenarios in which IP telephony can be
used, such as end-to-end IP, calls originating in IP network and terminating in
switched circuit network, calls originating and terminating in switched circuit
network but passing through an IP network, and various other permutations.
IP telephony is heavily standards-driven, since interoperability among different
vendors and with the traditional voice networks is key. Two communities are
working on the standards: those from the traditional voice networks (the
bellheads) and those from an IP networking background (the netheads). The
difference in opinion between the two revolves around the issue of centralized
versus decentralized call control. The netheads view the intelligence as being
in the terminals, while the bellheads view the intelligence as residing in the
network.
Shore then described the various standards bodies, such as the European
Telecommunications Standards Institute (ETSI), the ITU-T, and the IETF. It turns
out that many of these standards groups are attended by the same people, so the
standard bodies are not all that different.
Shore discussed in depth the H.323 standard, which is produced by the ITU-T.
H.323 is not a technical specification itself, but rather an umbrella
specification that refers to other specifications such as H.225 and H.425. H.323
is actually a multimedia conferencing specification, but it is used mainly for
voice telephony. H.225 is the call-control part of H.323. It specifies call
establishment and call tear-down. H.225 is the connection control part of H.323.
It is encoded used ANS.1 PER. H.235 is the security part of H.323.
H.323 is the most widely used IP telephony signaling protocol, but it is very
complex and H.323 stacks are very expensive, costing hundreds of thousands of
dollars. There is a new open-source H.323 project that can be referenced at
<https://www.openh323.org>.
Shore explained the role of a gatekeeper in an IP telephony network. It handles
address translation, bandwidth control, and zone management. A gatekeeper is
needed for billing purposes. Call signaling may also be routed through a
gatekeeper. Shore went over several alternative ways to set up call signaling
and the various phases of a telephone call.
Shore wrapped up by talking about
some of the addressing issues in IP telephony. The standard that covers this is
the ITU-T E-164. There are open issues involved with locating users and
telephone numbers in an IP network.
Will There Be an IPv6 Transition?
Allison Mankin, USC/Information Sciences Institute
Summary by Bruce Jones
Allison Mankin's talk explored the problems, concerns, and potentials
surrounding the IETF's proposal to move the Internet to IPv6the "next
generation" of the Internet Protocol.
The problem with the current generation of IPIPv4is simple:
there are not enough addresses for everyone to set up and operate the networks
they want. (We won't go into "needs" here, following the
as-apolitical-as-possible model of the IETF. As Mankin notes, not everyone is
using all of their addresses to best advantage, but . . . )
Compound this shortage with coming uses of IP for things like networks for
houses, cars, and Asiaand even with 232 (4 billion) nodes
possible in IPv4 you see that you couldn't cover the last one of these if all
the users wanted were access to a free email account at Yahoo! and a
remote-control refrigerator.
So the IETF, in its infinite wisdom, organized the "IP Next Generation Working
Group," whose job it is to see if they can generate a standard for the
generation of IP to succeed IPv4IPv6. Mankin was the co-director of the
IETF Steering Group process that led to formation of the working group.
The IETF, prior to the formation of the IPng Working Group, had generated three
proposals for a new standard: one proposed by the International Standards
Organization, CLNP, a "radical change candidate"; and the successful candidate,
"Simple IP." This plannow called IPv6is capable of supporting
billions of networks and trillions of end-nodes in its 128-bit address space.
The IPv6 address space is broken up in interesting ways. Toss out the three bits
for a format prefix and eight bits "Reserved for future use," and you're left
with bits for a "Top-level Aggregation" (8K of global ISPsa number and
scheme designed to reduce load on major routers), Site-level Aggregation ISPs
(137 billion prefixes), and Site-Level Aggregation with 65.5K networks for every
subscriber site. Polish it off with 64 bits for an Interface Identifier (IID),
which, if I understood correctly, is just the MAC address of the device, and you
have the potential for enough addresses for Internet light switches in every
flat in China.
Along the way this plan was modified to include address space for Sub-Top-Level
Aggregators because of the demands of the more conservative address managers.
While IPv6 is a coming thing, some strong currents in the Internet world are
counteracting the need for a broader address space. Primary among these is NAT
(Network Address Translation). A NAT is a device that "connect[s] an isolated
address realm with private addresses to an external realm with globally unique
registered addresses" (<https://www.ietf.org/internet-drafts/draft0ietf-nat-termonology-03.txt>). Put simply, if all you
have is one address and you want to put several machines on the Internet, then a
NAT will handle the job by letting you give your machines pseudo IP addresses
while it handles traffic outside your shop via your single address.
NATs are delaying the transition to IPv6 because they offer a solution to
address
shortages that works in many areas and for most applications. However, NATs
will not be able to forestall that transition completely, because they do not
work as well at the provider level as IPv6 addresses.
Finally, returning to the question in the titlewill there be a
transition to IPv6?Mankin finds that the answer depends on what is meant
by transition. If by transition one means that the entire network rushes to
embrace IPv6, then the answer is clearly no. People with systems to keep up are
understandably loath to replace current working technology with something new
simply because the backers of the new tech say it's better. For many, IPv4
serves current needs perfectly well, thank you very much. On the other hand, for
those for whom IPv4 is not a solution, there is movement toward IPv6, as would
be expected.
The biggest push toward IPv6 will come as the number of places that have made
the transition begin to exceed the number of places that haven't. As Asia comes
online in really large numbers, if some NAT solution is not discovered for those
numbers, there will be business pressures on non-IPv6 users to make the
transition too.
The slides from Mankin's talk are at
<https://www.usenix.org/events/usenix99/>.
The Joys of Interpretive Languages: Real Programmers Don't Always Use C
Henry Spencer, SP Systems
Summary by Arthur Richardson
Henry Spencer started his talk by claiming that far too often a programmer will
take the wrong approach when trying to solve a problem. In many cases
programmers immediately start to code a solution using C as the programming
language. This approach usually causes unnecessary work and too complex a
solution results from working on too low a level. An example that he mentioned
is the use of C to write the program man.
The typical reason someone will use C is the perception that it will result in a
more efficient program. Spencer reminds us that writing in C does not guarantee
efficient code. The algorithm the programmer uses is more important to
efficiency than the language. First-cut C code is not always fast.
One of the characteristics of an interpretive language is having significant
smarts at runtime as opposed to compile time. Java, he feels, is a half-breed
that tries to be all things to all people.
His reasons for the use of an interpretive language include: fast turnaround
times, better debugging tools, dynamic code, and working at a much higher level.
Although each of these benefits can be derived from compiled languages, they are
usually much harder to achieve with them.
The first goal in using an interpretive language is that it must work to solve
the problem. Often, this is good enough. Performance may not always be
important, and a working solution can always be used to please management. It
also allows for others involved in the project to work on their areas of the
solution much sooner in the process. Documen-tation can be created much earlier
in the development process. Clean code is a second goal. The programmer should
make it easy to alter the code for requirements that change later and for
modifications such as improving the user interface. Languages can't demand clear
code, but they can encourage it. A third goal is for the program to run fast
enough. Often soft realtime is all that is required. Spencer has written a mark
sense reader, for entering test answers, that was coded entirely in Awk. There
are circumstances when you can just throw hardware at it if performance is still
not acceptable. Other times there may be a real performance requirement that
requires a lower-level language, but he claims that doesn't happen as often as
people would guess.
Spencer went on to describe a few of the interpretive languages and to point out
some of their benefits and weaknesses:
SH. SH is often adequate. Although it is slow and clumsy when dealing
with low-level operations, it does have good primitives. The only problem with
some of those primitives is that they don't consider shells when designed. A lot
of times, the program doesn't work in a way that would allow inline use of
itself in a script. An example would be the program quota. This
language is very weak at doing arithmetic.
Awk. Awk is usually considered a glue language. It is very good at doing
small data manipulation. It is capable of doing larger jobs, such as the
troff clone that Spencer wrote, but can sometimes be slow. It is very
clumsy at doing some things such as splitting strings or subtracting fields. It
sometimes feels like development on this language was stopped before it was
completely built. There aren't any methods built into the language to allow for
extensions, so it will most likely remain as a glue language.
Perl. Spencer described Perl as "Awk with skin cancer." Perl is better
evolved and has a better implementation than Awk. The strongest benefits of Perl
are the large number of extensions built for it and its large and active user
community. Spencer then characterized Perl as having readability problems and
said that the structure of the language makes it hard to write good, readable
code.
Tcl. Tcl was designed to control other things and not be its own
language. There are many extensions available for it, including Tk for
programming in X and Expect for controlling interactive programs. Historically,
Tcl suffered from performance problems, but Spencer feels that it has improved
over time. It suffers from a user community that is not very well organized, but
that is improving as well.
Python. Spencer claims that Python is a long step away from scripting.
Much more syntax and more data types are used, which makes it feel more like a
programming language instead of an interpretive one. The object-oriented design
of the language requires much more design consideration prior to beginning the
coding process. All of these factors may take Python a step too far away from
the traditional interpretive languages.
When you are deciding on what language to use when attacking a problem, one of
the things to avoid is language fanaticism. Often a solution that mixes
interpretive languages with compiled extensions can often be the best answer.
The downsides of interpretive languages include late error checking, limited
data types, and the overhead in using mixed solutions.
In the choice of a language to use, availability is one of the more important
considerations. You must also consider what each language is good at. Compare
the need for data manipulation and pipes against the need for arithmetic
calculations. Familiarity with a language is very overrated, and often the
benefit of learning a language that suits the problem outweighs the cost of the
time involved in learning it.
E-Mail Bombs, Countermeasures, and the Langley Cyber Attack
Tim Bass, Consultant
Summary by Bruce Jones
"You gotta keep the mail running 100%the mission is to filter mail, kill
the spam, stop the trouble, keep the stuff running 100% of the time and never
let the system go down." --Tim Bass
Bass began with a long thank-you to the creators of UNIX, sendmail, and the
other tools available to him, pointing out that, while he hadn't invented any of
the utilities, facilities, and software packages he used to stop the attack,
neither would there have been tools or the network to use them on if it were not
for many of the folks in the USENIX audience. This proved a nice segue into the
core theme of his talk, that computers and systems and software and users and
admins are just interrelated nodes on a system.
"We are in a network of other people, and anything that we want to do that is
significant requires other people." In Bass's case, "significant" will come to
be defined as anything to do with setting up, running, maintaining, or
protecting a network.
Bass then turned his talk to a loose history of the events of the Langley Cyber
Attack:
An obviously forged email message from Clinton to Bass's boss alerts Bass to the
fact that the logging on his machines is not sufficient for intrusion detection.
When he turns up the logging, he finds that Langley machines are relaying the
trash of the Internetporno, hate mail, advertising spam, get-rich-quick
schemes, anything and everything.
At Langley, the initial response is to retaliate: bomb the spammers and porno
generators back; turn on error messages so they would be bombed automatically.
Bass convinces these folks to simply absorb all the traffic but not to
retaliate, not to reply. As Bass noted, "Archive all traffic [and] stop all
error messages because if you forward to the sender and you send a bad message,
the reply goes to the victim." Bass's strategy is the "ooda loop": observe,
orient, decide, act.
Like anyone trying to solve a complex problem, these folks had some lessons to
learn along the way:
First they try to clean up all outgoing mail. This proves to be an impossible
task, as they are getting ~3K messages every couple of minutes. The second
alternative is to queue all outgoing mail. Bass notes two problems with this
tactic: The first is technicalwriting scripts to do the work. This is
fairly easy, if labor-intensive. The second is political. Sysadmins have to
worry about securitysensitive mail that should not be seen by other than
the intended recipients; they have to worry about the privacy rights of the
individualssysadmins are not allowed to look at someone else's message
without good reason; and they have to worry about resource allocation and use.
Bass decides that the mail header files are fair game. He can key on those and
decide which messages to dump out of the delivery queue and into a holding area
for later use, if such use becomes possible or necessary.
He also immediately stops relaying mail, which has foreseeable effects: "When we
cut off the relays it pissed off the hackers. . . . People started bombing and
probing us," and they started trying to work around the fences: "Every rule set
we came up with, they figured out." Many of the standard methods of defense
fail. "We tried firewalls. That worked for about two seconds."
Other lines of defense were in place, even though Bass didn't realize it:
"Having a really slow network is a good line of defense."
After finishing his history lesson, Bass then gave a guided tour of some readily
available hacker tools and techniques on the Web. He covered four types of mail
attack: chain bombing; error-message bombing; covert channel distribution; and
mailing-list bombing. Then he
ran through a few of the available Windows-based GUI mail-bomber's tools:
Unabomber; Kaboom; Avalanche; Death & Destruction; Divine Interven-tion;
Genesis.
The conclusion of Bass's talk was an overview of the future of the work of
protecting systems against these kinds of attacks: "Intrusion detection systems
and firewalls are largely ineffective because you can't understand a network
from a GUI. In networks we need to be looking at another paradigm for the
future. We need to be teaching our operators and the people on the network
awareness of what's happening on the network. We need to take the concepts of
situational awareness and begin building awareness systems that allow people to
understand network infrastructure. . . . Our systems haven't learned to fuse
sensor information with long-term knowledge and then develop mid-term
situational awareness."
As Bass noted, much of the hacker/cracker menace is just juveniles engaging in
the same kinds of (what used to be mildly destructive) vandalism that kids have
engaged in for decades. Unfortunately for the system administrators whose
systems are the targets of the bad guys, these kids have more time, tools, and
energy (and, in some cases, computer horsepower) available than working
sysadmins. Their available resources turn what might look like mild vandalism
from the user-interface end into serious problems at the receiving end.
To paraphrase Bass, sysadmins and security people are going to expend a lot of
resources in the next few years dealing with this sort of stuff. Sysadmins have
to prepare for the day when they have multiple attacks on their network(s) with
multiple decoy targets and actual targets. They have to have systems where "the
average 17-year-old operator who's working a summer job managing a network is
now able to differentiate between what's real and what's just someone having fun
on the network."
You can read all the details in Bass's paper at
<www/silkroad.com/papers/html/bomb>.
Big Data and the Next Wave of InfraStress Problems, Solutions,
Opportunities
John R. Mashey, Chief Scientist, SGI
Summary by Art Mulder
John Mashey, current custodian of the California "UNIX" license plate, presented
an overview of where computer technology appears to be heading and outlined
areas where we need to be concerned and prepared. A key opening thought was that
if we don't understand the upcoming technology trends, then watch out, we'll be
like people standing on the shore when a large wave comes rushing in to crash
over us.
Mashey began with a definition of the term "infrastress," a word that he made up
by combining "infrastructure" and "stress." You experience infrastress when
computing subsystems and usage change more quickly than the underlying
infrastructure can change to keep up. The symptoms include bottlenecks,
workarounds, and instability.
We all know that computer technology is growing: disk capacities, CPU speeds,
RAM capacity constantly increase. But we need to understand how those
technologies interact, especially if the growth rates are not parallel. The
audience looked at a lot of log charts to understand this. For instance, on a
log chart we could clearly see that CPU speed was increasing at a rate far
larger than DRAM access times.
Most (all?) computer textbooks teach that a memory access is roughly equivalent
to a CPU instruction. But with new technologies the reality is that a memory
operation, like a cache miss, may cost you 1000 CPU instructions. We need to be
aware of this and change our programming practices accordingly. The gap between
CPU and disk latency is even worse. Avoid disk access at all costs. For
instance, how can I change my program to use more memory and avoid going to
disk? Or, similarly, minimize going to the network, since network latency is
another concern?
Disk capacity and latency is another area where two technologies are growing at
different rates. Disk capacity is growing at a faster rate than disk-access
time. We are packing in a lot more data, but our ability to read it back is not
speeding up at the same rate. This is a big concern for backups. Mashey
suggested that we may need to move from tape backups to other techniques
RAIDs, mirrors, or maybe backup on cartridge disks. We also need to change our
disk filesystems and algorithmic practices to deal with the changing technology.
One interesting side comment had to do with digital cameras and backups.
Virtually everyone in attendance probably has to deal with backups at work. Yet
how many people bother with backups at home? Probably very few, since most
people don't generate that much data on their home systems. A few letters or
spreadsheets, but for the rest the average home system these days is most likely
full of games and other purchased software, all of which are easily restored
from CD-ROM after a system crash. Yet very soon, with the proliferation of
digital cameras, we can expect that home computer systems are going to become
filled with many gigabytes of irreplaceable data in the form of family
snapshots and photo albums. Easy and reliable backup systems are going to be
needed to handle this.
Mashey's technology summary: On the good side, CPU is growing in MHz, and RAM,
disk and tape are all growing in capacity. On the bad side, all those
technologies have problems with latency. This means that there is lots of work
to be done in software and exciting times for system administrators.
The slides for this talk are available at
<https://www.usenix.org/events/usenix99/>.
What's Wrong with HTTP and Why It Doesn't Matter
Jeffrey C. Mogul, Compaq Western Research Laboratory
Summary by Jerry Peek
You probably know Jeff Mogul or use products of his work, such as subnetting.
One of his main projects in the '90s has been the HTTP protocol version 1.1.
Even his mother uses HTTP; it carries about 75% of the bytes on the Internet.
HTTP didn't have a formal specification until 1.1, and that process took four
years. It wasn't an easy four years. Mogul started by saying that the talk is
completely his own opinion and that some people would "violently disagree."
The features of HTTP are well documented; the talk covered them briefly. It's a
request-response protocol with ASCII headers and an optional unformatted binary
body. HTTP/1.1 made several major changes. In 1.1, connections are persistent;
the server can pipeline multiple requests per connection, which increases
efficiency. Handling of caching is much improved. HTTP/1.1 supports partial
transfersfor example, to request just the start of a long document or to
resume after an error. It's more extensible than version 1.0. There's digest
authentication without 1.0's cleartext passwords. And there's more. (For
details, see the paper "Key Differences Between HTTP/1.0 and HTTP/1.1"
<https://www.research.att.com/~bala/papers/h0vh1.html>.)
Most of the talk was a long series of critiques, many more than can be mentioned
here. Here's an example. Calling HTTP "object-oriented" is a mistake because the
terms "object" and "method" aren't used correctly. For instance, HTTP transfers
the current resource response instead of the resource itself. A resource can be
dynamic and vary over time; there's no cache consistency for updatable
resources. There's no precise term for the "thing that a resource gives in
response to a GET request at some particular point in time." This and other
problems led to "fuzzy thinking" and underspecification. For example, if a
client requests a file, the connection terminates early, and a partial transfer
is used to get the rest of the filethere's no way to make an MD5
checksum of the entire instance (only each of the two messages).
There were procedural problems in the HTTP-WG working group (of which Jeff was a
member). The spec took more than four years to write. Lots of players joined
relatively late, or moved on. There was a tendency to rush decisions ("gotta
finish!") but, on the other hand, architectural issues tended to drift because
the group wanted to get the "big-picture" view. The protocol was deployed in
1996 as RFC 2068, an IETF Proposed Standard. Normally, RFCs in this stage aren't
ready for widespread deployment; they're usually revised to fix mistakes. The
early deployment made it hard to change the protocol. There weren't enough
resources for tedious jobs that needed doing. On the good side, the long process
gave the group time to reflect, find many bugs in the original design, and come
to a consensus. The vendors "behaved themselves," not trying to bias toward
their code. ("Engineers cooperate better than marketers," Jeff pointed out.) He
said that HTTP/1.1 has a good balance between fixes and compatibility.
The bottom line, he said, is that the bugs in HTTP don't matter. (If technical
excellence always mattered, then FORTRAN, Windows, QWERTY keyboards, and VHS
tapes would be dead and buried.) HTTP works well enough, and revising it again
would be too hard. For instance, poor cache coherence can be fixed by
cache-busting or by telling users, "For latest view, hit Shift-Reload."
Inefficiency in the protocol (he gave several examples) might be irrelevant, as
bandwidth keeps increasing and as "site designers get a clue . . . some day." No
single protocol can support every feature; there will be other protocols, such
as RealAudio, that suit particular needs. Human nature adapts readily to
circumstance. HTTP isn't perfect, but it'll be hard to revise again
especially as the installed base gets massive and sites become mission-critical.
The presentation slides are at <https://www.usenix.org/events/usenix99/>.
UNIX to Linux in Perspective
Peter Salus, USENIX Historian
Summary by Jerry Peek
USENIX historian Peter Salus gave a warm and fascinating talk full of tidbits,
slides showing early UNIX relics, and lots of interaction with the audience
(many of whom had stories to contribute).
1999 is the thirtieth anniversary of "everything that makes it possible for us
to be here": the birth of both the ARPANET (the predecessor of the Internet) and
UNIX. He wove those two histories together into this talk because, without the
Internet, we wouldn't have Linux at all.
Peter started by showing the first page of the first technical report that was
the foundation of ARPANET, "the best investment that . . . the government has
ever made." The bottom line was $1,077,727 (for everything: salaries, phone
links, equipment), over a period of 15 months, to set up a network with five
nodes. On April 7, 1969, RFC 1 was released. As we think about 128-bit
addressing, remember that RFC 1 provided for five-bit addressing; no one had
"the foggiest idea" what the possible growth was. The first host was plugged in
on September 2, 1969. Peter showed Elmer Shapiro's first network map: a single
line. By the end of 1969, the size of the Net had quadrupled . . . to four sites
. . . and each site had a different architecture. The first two network
protocols let you log into a remote machine and transfer files. By 1973, there
were two important network links by satellite: to Hawaii and to Europe."
In October 1973 the Symposium on Operating Systems Principles at IBM Research in
Yorktown Heights was where Ken and Dennis gave the first UNIX paper. (Before
that, almost all UNIX use was at Bell Labs.) Peter said that the paper
"absolutely blew people away," leaving a lasting mark on people's lives. Lou
Katz remembers Cy Levinthal telling him to "get UNIX"; he did, from Ken
Thompson, and he was the first outside person to get UNIX (on RK05s, which he
didn't have a way to read at the time). There was no support, "no anything";
this forced users to get togetherand, eventually, to become USENIX. May
15, 1974, was the first UNIX users' meeting.
The first text editor, and the one that all UNIX systems have, is ed:
"my definition of user-hostile," said Peter. George Coulouris, in the UK, had
"instant hate" for ed. He rewrote it into em, which stands for
"ed for mortals." Then ed came to UC Berkeley, on tape, where
George went on sabbatical. One day, George was using em at a glass
terminal (Berkeley had two glass terminals!), and a "wild-eyed graduate student"
sitting next to him took a copy. Within a couple of weeks, that student, Bill
Joy, had rewritten the editor into ex . . . which was released in the
next edition of UNIX from Bell Labs. The story here is of software going from a
commercial company in New Jersey, to an academic institution in the UK, to
another academic institution in California, back to the same company in New
Jersey. This kind of exchange gave rise to the sort of user community that
fostered Linux . . . and brought many of us to where we are today. (It wasn't
the first "open source" software, though. Peter mentioned SHARE, the IBM users
software exchange, which began in 1955.)
Ken and Dennis's paper appeared in the July 1974 issue of CACM. At the
same time, a group of students at the University of Illinois started the
foundation of RFC 681. The students said they were "putting UNIX on the net." In
actuality, they were doing the opposite: causing the network to ride on top of
UNIX. Suddenly, the community realized that using UNIX as the basis of the
network changed everything.
One big meeting of the USENIX Association was in June 1979 in Toronto. The
meeting was preceded by a one-day meeting of the Software Tools User Group,
STUG. At that meeting, the first speaker was Al Arms from AT&T, who
announced a big increase in UNIX licensing prices. Now UNIX V7 would cost
$20,000 per CPU, and V32 would be $40,000 per CPU. Although academic
institutions paid much less, "I don't think anybody was very happy," Peter
quipped. This was "the sort of mistake, . . . a corporate lack of common sense,"
that drives users to create things like MINIX, which Andy Tanenbaum did that
year. And MINIX was what helped Linus Torvalds, a dozen years later, to write
Linux. "If it's good and you make it exorbitant, you drive the bright guys to
find alternatives."
You can read much more history in Peter's ;login: articles.
FREENIX TRACK
Session: File Systems
Summary by Chris van den Berg
Soft Updates: A Technique for Eliminating Most Synchronous Writes
in the Fast Filesystem
Marshall Kirk McKusick, Author and Consultant; Gregory R. Ganger, Carnegie
Mellon University
Marshall Kirk McKusick presented soft updates, a project he has been working on
for much of the last few years. As the title of the paper suggests, the central
intention of soft updates is to increase filesystem performance by reducing the
need for synchronous writes in the Fast Filesystem (and its current derivatives,
most commonly today's UFS). Soft updates also provide an important alternative
to file systems that use write-ahead logging, another common implementation for
tracking synchronous writes. Additionally, soft updates can eliminate the need
to run a filesystem-check program, such as fcsk, by ensuring that
unclaimed blocks or inodes are the only inconsistencies in a file system. Soft
updates can also take snapshots of a live filesystem, useful for doing
filesystem backups on nonquiescent systems.
The soft updates technique uses delayed writes for metadata changes, tracking
the dependencies between the updates and enforcing these dependencies during
write-back. Dependency tracking is performed on a per-pointer basis, allowing
blocks to be written in any order and reducing circular dependencies that occur
when dependencies are recorded only at the block level. Updates in a metadata
block can be rolled back before the block is written and rolled forward after
write. In this scheme, applications are ensured seeing current metadata blocks,
and the disk sees copies that are consistent with its contents.
McKusick discussed the incorporation of soft updates into the 4.4BSD-based Fast
File System (FFS) used by NetBSD, OpenBSD, FreeBSD, and BSDI. The three examples
of soft updates in real environments were very impressive. These tests compared
the speed of a standard BSD FFS, a file system mounted asynchronously, and a
file system using soft updates. The first is McKusick 's "filesystem torture
test," which showed asynchronous and soft updates requiring 42% fewer writes
(with synchronous writes almost nonexistent), and a 28% shorter running time
than the tests run on the BSD FFS. The second test involved building and
installing the FreeBSD system (known as "make world" in FreeBSD parlance). Soft
updates resulted in 75% fewer writes and 21% less running time. The last test
involved testing the BSDI central mail server (which compared only the BSD FFS
and soft updates, since asynchronous mounts are obviously too dangerous for real
systems requiring data coherency). Soft updates required a total of 70% fewer
writes than the BSD FFS, dramatically increasing the performance of the file
system.
The soft updates code is available for commercial use in BSDI's BSD/OS, versions
4.0 and later, and on FreeBSD, NetBSD, and OpenBSD. Also, McKusick announced
that Sun Microsystems has agreed to consider testing and incorporation of soft
updates into Solaris.
Design and Implementation of a Transaction-Based Filesystem on FreeBSD
Jason Evans, The Hungry Programmers
Jason Evans discussed transactional database-management systems (DBMSes), which
are structured to avoid data loss and corruption. One of the key points in the
implementation is that the traditional BSD Fast File System (FFS) doesn't
address the data-integrity requirements that are necessary for designers of
transactional database-management systems.
Typically, programmers of transaction-based applications must ensure that atomic
changes to files occur in order to avoid the possibility of data corruption. In
the FFS, the use of triple redundancy is common in order to implement atomic
writes. The principal downside of a triple-redundancy scheme is that performance
tends to suffer greatly. The alternative that Evans proposed is the use of a
Block Repository (BR), which is similar in many respects to a journaled
filesystem. The major highlights of the BR are that it provides:
- A simple block-oriented, rather than a file-oriented, interface.
- Userland implementation that provides improved performance and control.
The block repository library, which is linked into applications, controls
access to allocated storage resources.
- Data storage on multiple devices, named backing stores, which is similar
in many ways to concepts found in volume managers.
A block repository contains at least four backing stores, which are files or raw
devices with a header and data space. The backing-store header is
triple-redundant to permit atomicity of header updates.
The block repository is designed to be long-running and allow uninterrupted
access to the data in the BR. Online insertion and removal of backing stores is
possible, which allows modification of the BR size without downtime for
configuration changes or maintenance. The repository scheme also allows for
block caching, block locking, data-block management, and transaction commit-log
processing. Additionally, the BR supports full and incremental backup while
online.
The block repository is part of SQRL, a project sponsored by the Hungry
Programmers (<https://www.hungry.com>). Information on SQRL is available at
<https://www.sqrl.org/sqrl>.
The Global File System: A Shared Disk File System for *BSD and Linux
Kenneth Preslan and Matthew O'Keefe, University of Minnesota; John Lekashman,
NASA Ames
The Global File System is a Shared Disk File System (SDFS) that implements a
symmetric-share distributed filesystem. It is distributed under an open-source
GPL license and implements a high-performance 64-bit network storage filesystem
intended for Irix, Linux, and FreeBSD.
The basic design of the GFS includes a number of GFS clients, a Storage Area
Network, and a Network Storage Pool. Multiple clients are able to access the
Storage Area Network simultaneously.
Some of the key design features of the Global File System are:
- Increased availability. If one client fails, another may continue to
process its tasks while still accessing the failed client's files on the shared
disk.
- Load balancing. A client can quickly access any portion of the dataset on
any of the disks.
- Pooling. Multiple storage devices are made into a unified disk volume
accessible to all machines in the system.
- Scalability in terms of capacity, connectivity, and bandwidth. This
avoids many of the bottlenecks in file systems such as NFS, which typically
depend upon a centralized server holding the data.
The implementation includes a pool driver, which is a logical-volume driver for
network-attached storage. It handles disks that change IDs because of network
rearrangement. A pool is made up of subpools of devices with similar
characteristics. The file system presents a high-performance local filesystem
with intermachine locking and is optimized for network storage. Device locks are
global locks that provide the synchronization necessary for a symmetric Shared
Disk File System. The lock is at the level of the network storage device and is
accessed with the Dlock SCSI command. GFS also uses dynamic inodes and flat,
64-bit metadata structures (all sizes, offsets, and block addresses are
64-bits). Additionally, file metadata trees are of uniform height. To increase
performance, hashing of directories is used for fast directory access, and full
use is made of the buffer cache. Performance testing showed 50MB/s write
performance, with slightly less on read.
More information on the Global File System can be found at
<https://www.globalfilesystem.org/>.
Session: Device Drivers
Summary by Chris van den Berg
Standalone Device Drivers in Linux
Theodore Ts'o, MIT
Theodore Ts'o discussed the distribution of device drivers outside of the Linux
kernel. Device drivers have traditionally been developed and distributed inside
of the kernel, a situation that can have a number of disadvantages. For example,
versions of drivers are inherently tied to a given kernel version. This could
lead to a situation in which someone wanting to run a stable device-driver
release would be required to run a bleeding-edge kernelor vice versa:
someone running a stable kernel release ends up running a device driver that may
still have a number of problems. Additionally, device-driver distribution
doesn't scale well. The size of kernel distributions increases proportionally
with the number of device drivers that are added. Growth of the kernel therefore
can't be tied to the number of device drivers available for it if long-term
scalability is desired.
Initially, one method of having separate device-driver distribution simply
involved supplying patches for a given driver, reconfiguring, and recompiling
the kernel. As time went on, loadable kernel modules were introduced that
allowed developers to reduce the configuration, compilation, and test time
dramatically. This also keeps kernel bloat to a minimum. Kernel modules, Ts'o
noted, are an excellent distribution mechanism for standalone device drivers.
One complication in building modules outside of the kernel tree is that the
kernel build infrastructure may no longer be present. Linux does not, however,
typically require a complex system for building kernels. For very simple
modules, a few C preprocessor flags in the Makefile suffice. Similar
modifications to the Makefile can be made in order for drivers to be built both
standalone and inside the kernel.
One of the issues in the installation of modularized device drivers is how
user-friendly they are. This can be worked around by ensuring during
installation that the driver is placed in the right location (which may be a
little trickier than it sounds, depending on the kernel version), and setting up
rc scripts necessary to load the module at startup. Also, for modules that
export functions, it's important to modify kernel header files so that other
modules can load a driver's exported interfaces. This leads to the conclusion
that rather then using a Makefile target to do all of these functions, a shell
script is a better option. Some amount of work is also needed within Linux in
order to better support the development of standalone device drivers, such as
standardization of rc.d scripts, and binary compatibility. The latter could be
achieved through an adopter layer, which could maintain compatibility at the ABI
layer but may cause concerns with respect to performance. Taking compatibility
issues one step further, a project to create a "Uniform Device Interface" has
been undertaken by some industry vendors (most notably SCO). This could allow
device drivers to be portable across many OSes, but, again, performance concerns
are a major issue.
Design and Implementation of Firewire Device Driver on FreeBSD
Katshushi Kobayashi, Communication Research Laboratory
Katsushi Kobayashi discussed the implementation of a Firewire device driver
under FreeBSD. This driver includes an IP network stack, a socket system
interface, and a stream-device interface.
Firewire, the IEEE 1394 standard iLink or high-performance serial bus, currently
is the greatest area of interest in the audio-visual area. Firewire as a
standard encompasses the physical layer up to network-management functions
within the system and is capable of high network bandwidth. It also includes
online insertion and removal capabilities, as well as the possibility of
integrating numerous
different peripheral types into one bus system.
The standard itself defines a raw packet-level communication protocol, and
applications depend on higher-level protocols that can utilize Firewire. A
Firewire device driver has been implemented already for Windows 98, Windows NT,
and Linux. Firewire is being standardized for use in conjunction with IP
networking, audio-visual devices, and other peripherals, such as SBP (Serial Bus
Protocol) and a SCSI adaptation protocol to Firewire.
The FreeBSD implementation of the Firewire device driver is divided into two
parts: the common parts of the Firewire system that are hardware-independent,
and the device-dependent parts. The device driver currently supports two types
of Firewire chipsets, the Texas Instrument's PCILyns and Adaptec AIC5800, with
plans to develop driver code for newer-generation chipsets, such as OHCI and
PCILynx2, capable of 400Mbps transmission. The API specification currently
developed is still not complete, and compatibility with other types of UNIX is
an important goal in further development.
The FreeBSD Firewire device driver can be found at
<ftp://ftp.uec.ac.jp/pub/firewire>.
newconfig: A Dynamic-Configuration Framework for FreeBSD
Atsushi Furata, Software Research Associates, Inc.; Jun-ichiro Hagino,
Research Laboratory, Internet Initiative Japan, Inc.
The original inspiration for newconfig was work done by Chris Torek in 4.4BSD,
and the framework for it is currently being ported to FreeBSD-current. Its
motivations are PAO development, CardBus support, and dealing with the
difficulties of the IRQ abstraction (especially for CardBus support).
The goals of the newconfig project are to merge newconfig into FreeBSD-current,
implement dynamic configuration, and add support for any type of drivers and
buses. The eventual removal of the old config(8) is also one of the
purposes
of newconfig, which has the advantage
of bus and machine independence. Newconfig supports separation of bus-dependent
parts of device drivers from the generalized parts. Auto-configuration includes
configuration hints to the device drivers, bus and device hierarchy information,
inter-module dependency information, and device-name-to-object-filename
mappings. Currently newconfig handles these components by statically linking
them to the kernel. Part of the future work for newconfig includes dynamic
configuration.
Information on newconfig is available at
<https://www.jp.freebsd.org/newconfig/>.
Session: File Systems
Summary by Chris van den Berg
The Vinum Volume Manager
Greg Lehey, Nan Yang Computer Services Ltd.
Greg Lehey discussed the Vinum Volume Manager, a block device driver
implementing virtual disk drives. In Vinum, disk hardware is isolated from the
block device interface, and data is stored with an eye toward increasing
performance, flexibility, and reliability.
Vinum addresses a number of issues pressing upon current disk-drive and
filesystem technology:
- Disk drives are too small for current storage needs. Disk drivers that
can create abstract storage devices spanning multiple disks provide much greater
flexibility for current storage technology.
- Disk subsystems can often bottleneck, not necessarily because of slow
hardware but because of the type of load multiple concurrent processes can place
on the disk subsystem. Effective transfer capacity, for example, is greatly
reduced in the presence of significant random accesses that are small in nature.
- Data integrity is critical for most installations. Volume management
with Vinum can address these through both RAID-1 and RAID-5.
Vinum is open-source volume-management software available under FreeBSD. It was
inspired, according to its author, by the VERITAS volume manager. It implements
RAID-0 (striping), RAID-1 (mirroring), and RAID-5 (rotated block-interleaved
parity). Vinum allows for the possibility of striped mirrors (a.k.a. RAID-10).
Vinum also provides an easy-to-use command-line interface.
Vinum objects are divided into four types: Volumes, Plexes, Subdisks, and
Drives.
Volumes are essentially virtual disks that are much like a traditional UNIX disk
drive, with the principal exception that volumes have no inherent size
limitations. Volumes are made up of plexes. Plexes represent the total address
space of a volume and are the key hierarchical component in providing
redundancy. Subdisks are the building blocks of plexes. Rather than tie subdisks
to UNIX partitions, which are limited in number, Vinum subdisks allow plexes to
be composed of numerous subdisks, for increased flexibility. Drives are the
Vinum representation of UNIX partitions; they can contain an unlimited number of
subdisks. Also, an entire drive is available to the volume manager for storage.
Vinum has a configuration database that contains the objects known to the
system. The vinum(8) utility allows the user to construct volumes from
a configuration file. Copies of the configuration database are stored on each
drive that Vinum manages.
One of the interesting issues in performance, especially for RAID-0 stripes, is
the choice of stripe size. Frequently administrators set the stripe size too low
and actually degrade performance by causing single I/O requests to or from a
volume to be converted into more than one physical request. Since the most
significant performance factor is seek time, multiple physical requests can
cause significant slowdowns in volume performance. 256Kb was empirically
determined by Lehey to be the optimal stripe size for RAID-0 and RAID-5 volumes.
This should ensure that disk access isn't concentrated in one area, while also
ensuring that almost all single disk I/O operations won't result in multiple
physical transfers.
Future directions for Vinum include hot-spare capability, logging changes to a
degraded volume, volume snapshots, SNMP management interface, extensible UFS,
remote data replication, and extensible RAID-0 and RAID-5 plexes.
Vinum is available as part of the FreeBSD 3.1 distribution (without RAID-5) and
under license from Cybernet Inc.
<https://www.cybernet.com/netmax/index.html>.
Porting the Coda File System to Windows
Peter J. Braam, Carnegie Mellon University; Michael J. Callahan, The Roda
Group, Inc.; M. Satyanarayanan and Marc Schnieder, Carnegie Mellon
University
This presentation described the porting of the Coda distributed filesystem to
Windows 95 and Windows 98. (A Windows NT port is still in the incipient stages.)
Coda contains user-level cache managers and servers as well as kernel code for
filesystem support. It is a distributed filesystem that includes many
interesting features:
- read/write server replication
- a persistent client cache
- a good security model
- access control lists
- disconnected and low-bandwidth operation for mobile hosts
- assurance of continuing operation even during network or server failure
The port to Windows 9x involved a number of steps. The port of the user-level
code was relatively straightforward; much of the difficulty lay in implementing
the kernel code under Windows 9x. Coda's user-level programs include Vice, the
file server that services network requests from different clients, and Venus,
which acts as the client cache manager. The kernel module is called
Minicache. Filesystem metadata for clients and servers is mapped into the
address space for Vice and Venus, employing rvm, a transaction package.
The kernel-level module translates Win32 requests into requests that Venus can
service. The initial design was around BSD UNIX filesystems, and so required
modifications to account for differences in filesystem models.
Part of the task for getting clients running under Windows 9x involved
developing Potemkin Venus, a program that acts as a genuine client cache
manager and allows easier testing of the Minicache kernel code. Complications
arose in development with the Win32 API when filesystem I/O calls made with the
Windows Potemkin Venus would attempt to acquire a Win16Mutex, resulting in
deadlock conditions if a process making a request via Potemkin had already
acquired it. This resulted in the decision to implement the cache manager as a
DOS program rather than as a Win32 process. This could be done by hosting Venus
in a DOS box, which was made possible in part by using DJGPP's compiler and
libc. Once workarounds for missing APIs were resolved (BSD sockets,
select, mmap), the port became more straightforward. Also, a
solution for Windows 95's inability to support dynamically loaded filesystems
was required with a separate filesystem driver, and communication between Venus
and the Minicache was modified to use UDP sockets.
In summary, many of the complex porting problems were overcome through the use
of freely available software packages and the implementation of mechanisms to
circumvent the user-level Win32/Win16 mutex problems.
More information on Coda is available at
<https://www.coda.cs.cmu.edu/index.html>.
A Network File System over HTTP: Remote Access and Modification of Files and
"files"
Oleg Keeleyov
Oleg Keeleyov discussed the HTTP filesystem (HTTPFS), which allows access to
remote files, directories, and other objects via HTTP mechanisms. Standard
operations such as file retrieval, creation, and modification are possible as if
one were doing this on a local filesystem. The remote host can be any that
supports HTTP and can run Perl CGI scripts either directly or via a Web proxy or
gateway. The program runs in user space and currently supports creating,
reading, writing, appending, and truncating files on a remote server.
Using standard HTTP request methods such as GET, PUT, HEAD, and DELETE,
something akin to Network File System is created, but with the added advantages
that the system is cross-platform and can run on almost any HTTP server.
Additionally, both programmatic and interactive support models exist.
The HTTPFS is a user-level filesystem written as a C++ class library on the
client side and requiring a Perl CGI script on the remote server. The C++
classes can be employed directly or via different applications which link to a
library that replaces many standard filesystem calls such as open(),
stat(), and close(). Modifications to the kernel and system
libraries are not necessary, and in fact the system does not even need to be run
with administrative privileges.
Another advantage of HTTPFS is that the server can apply many of the request
methods to objects as if they were files without necessarily being files, such
as databases, documents, system attributes, or process I/O.
Keeleyov noted that the potential for security risks is inherent in the use of
the HTTPFS and that necessary access controls should be in place that are
concordant with administrators' authentication and authorization policies.
Session: Networking
Summary by Chris van den Berg
Trapeze/IP: TCP/IP at Near-Gigabit Speeds
Andrew Gallatin, Jeff Chase, and Ken Yocum, Duke University
This presentation focused on high-speed TCP/IP networking on a gigabit-per-
second Myrinet network, which employs a messaging system called Trapeze. Common
optimizations above and below the TCP/IP stack are important. They include
zero-copy sockets, large packets combined with scatter/gather I/O, checksum
offloading, adaptive message pipelining, and interrupt suppression. The tests
were conducted on a range of current desktop hardware using a modified FreeBSD
4.0 kernel (dated 04/15/1999), and showed bandwidth utilization as high as 956
MB/s with Myrinet, and 988 MB/s with Gigabit Ethernet NICs from Alteon Networks.
It is now widely believed that current TCP implementations are capable of
utilizing a high percentage of available bandwidth on gigabit-per-second speed
links. Nevertheless, TCP/IP implementations will depend upon a number of
critical modifications both above and below a host TCP/IP stack, to reduce data
movement overhead. One of this paper's critical foci was to profile the current
state of the art in short-haul networks with low latency and error rates, and
close to gigabit-per-second bandwidth, as well as to provide quantitative data
to support the importance of different optimizations.
The Trapeze messaging system consists of a messaging library linked into the
kernel or user applications, and firmware that runs on a Myrinet NIC. Trapeze
firmware communicates with the host via NIC memory that is addressable in the
host-address space by means of programmed I/O. The firmware controls host-NIC
data movement and allows for a number of features important for high-bandwidth
TCP/IP. These include:
- Header/payload separation (handled by the firmware and message system),
which allow for payloads to be moved to and from aligned page frames in host
memory. This in turn provides a mechanism for zero-copy optimizations.
- Large MTUs and scatter/gather DMA. Myrinet operates without requiring a
fixed MTU, and scatter/gather DMA lets payload buffers utilize multiple
noncontiguous page frames.
- Adaptive message pipelining allows minimization of large packet latency
balanced with unpipelined DMA under high bandwidth.
Interrupt suppression is also important in minimizing per-packet overhead for
smaller MTUs and is implemented on NICs such as the Alteon Gigabit Ethernet NIC,
which is capable of amortizing interrupts across multiple packets via adaptive
interrupt suppression. Interrupt suppression doesn't provide much benefit for
MTUs larger than 16Kb and is therefore not used for packet reception in Myrinet.
Low-overhead data movement is critical in conserving CPU, though it's important
to note that, because of memory bandwidth constraints, faster CPUs are not
necessarily a panacea for higher data movement. Optimizations to the FreeBSD I/O
manipulation routines such as zero-copy sockets are integral to reduction of
data-movement overhead. Page remapping techniques eliminate data movement while
preserving copy semantics of the current socket interface. Zero-copy TCP/IP at
the socket was implemented following John Dyson's read/write syscall interface
for zero-copy I/O. Zero-copy reads map kernel buffer pages into process address
space via uiomoveco, a variant of uiomove. A read from a file
instantiates a copy-on-write mapping to a page in the unified buffer cache,
while read from a socket requires no copy-on-write since the kernel buffer need
not be maintained after a read, and any physical page frames which back-remapped
virtual pages in the user buffer are freed. For a send, copy-on-write is used if
a sending process writes to its send buffer when the send has completed.
Copy-on-write mappings are freed when the mbuf is released after transmit. This
only applies to anonymous vm pages, as zero-copy transmission of memory backed
by mapped files duplicates the already existent sendfile routine
authored by David Greenman.
Checksum offloading reduces overhead by moving checksum computation to the NIC
hardware. This is available in Myricom's LANai-5 adapter and the Alteon Gigabit
Ethernet NIC. The host PCI-DMA engine employs checksum offloading during the DMA
transfer to and from host memory and can be done with little modification to the
TCP/IP stack. A few complications arise: multiple DMA transfers for single
packets require modifications to checksum computations; TCP/UDP checksumming in
conjunction with separate IP checksumming requires movement of the checksum
calculation below the IP stack (i.e., in the driver or NIC firmware); and the
complete packet must be available before checksum computation can occur.
Tests of the Myrinet system were performed on a variety of commercially
available desktop hardware, with a focus on TCP bandwidth, CPU utilization, TCP
overhead, and UDP latency. The tests showed the importance of techniques to
reduce communication overhead for TCP/IP performance on currently available
desktop platforms under FreeBSD 4.0. The results of 956 MB/s using Trapeze and
the Myrinet messaging system and 988 MB/s with the Alteon Gigabit Ethernet card
are currently the highest recorded TCP bandwidths publicly available, with a DEC
Monet (21264 model 500MHz Alpha) achieving these bandwidths at around 20% CPU
utilization.
Trapeze is available at: <https://www.cs.duke.edu/ari/trapeze>. FreeBSD modifications are available in the FreeBSD code base.
Managing Traffic with ALTQ
Kenjiro Cho, Sony Computer Science Laboratories, Inc.
Kenjiro Cho discussed ALTQ, a package for traffic management that includes a
framework and numerous queueing disciplines, and also supports diffserv and
RSVP. The advantages and disadvantages of different designs, as well as the
available technologies for traffic management, were discussed.
Kenjiro noted that traffic management typically boils down to queue management.
Many disciplines have been proposed that meet a variety of requirements.
Different functional blocks determine the type of queueing available for a
router, and while functional blocks can appear at the ingress interface, they
exist most commonly on the egress interface. Common functional blocks:
- Classifiers categorize traffic based on header content, and
packet-matching rules are used to determine further processing.
- Meters measure traffic streams for certain characteristics that
are saved as flow state and are available to other functions.
- Markers are particular values within a header, such as priority,
congestion information, or application type.
- Droppers discard packets in order to limit queue length or for
congestion notification.
- Queues are buffers that store packets; different queues can exist
for different types of traffic.
- Schedulers perform packet-forwarding determination for a given
queue.
- Shapers shape traffic streams by delaying certain packets and may
discard packets if insufficient space exists in available buffers.
Different queueing disciplines promote different requirements. The available
types of queues are: a standard FIFO; Priority Queueing (PQ); Weighted Fair
Queueing (WFQ), which assigns a different queue for every flow; Stochastic
Fairness Queueing, an easier-to-implement form of Weighted Fair Queueing;
Class-Based Queueing (CBQ), which divides link bandwidth using hierarchically
structured classes; and Random Early Detection (RED), a "fair" form of queueing
by dropping packets in accordance with the probability of given buffer filling.
Kenjiro discussed some of the major issues in queueing, including the wide
variety of mechanisms available and how many of them cover only certain specific
needs in a queueing environment. Also employing multiple types of queueing can
be difficult because each queue discipline is designed to meet a specific, not
necessarily inter-compatible, set of criteria and design goals. The most common
uses of traffic management are bandwidth control and congestion control.
Additionally, traffic-management needs must be balanced with ease of
administration. It's also important to note that queueing delays can have a
significant impact on latency, especially in comparison to link-speed latency
delays.
ALTQ is a framework for FreeBSD that allows for numerous queueing disciplines
for both research and operational needs. The queueing interface is implemented
as a switch to a set of disciplines. The struct ifnet has several
fields added to it, such as discipline type, a general state field, a pointer to
discipline state, and pointers to enqueue and dequeue functions. Queueing
disciplines have a common set of queue operations, and other parts of the kernel
employ four basic queue-management operations: enqueue, dequeue, peek, and
flush. Drivers can then refer to these structures rather than using the ifqueue
structure. This adds flexibility to driver support for queueing mechanisms.
Queueing disciplines are controlled by ioctl system calls via character
device interfaces in /dev, with each discipline defined as a minor
device for the primary character device. ALTQ implements CBQ, WFQ, RED, ECN, and
RIO queueing disciplines. CBQ meets many requirements for traffic management,
thanks, in part, to its flexibility.
Additionally, Kenjiro mentioned some of the ways Linux lends itself to queueing
in comparison to *BSD. Specifically, the number of fields that Linux's
sk_buff contains give it more flexibility than the BSD mbuf
structure. The Linux network device layer also adds flexibility by allowing
queue-discipline classifiers to access network or transport layer information
more readily.
ALTQ for FreeBSD is available at
<https://www.csl.sony.co.jp/person/kjc/software.html>.
Opening the Source Repository with Anonymous CVS
Charles D. Cranor, AT&T LabsResearch; Theo de Raadt, The OpenBSD
Project
Charles Cranor discussed Anonymous CVS, a source-file distribution mechanism
intended to allow open-source software projects to more readily distribute
source code and information regarding that code to the Internet community.
Anonymous CVS is built on top of CVS, the Concurrent Version System, which
provides revision control. Anonymous CVS is currently in use by a number of
open-source projects.
Anonymous CVS was initially developed to provide access to an open-source
software project for Internet users who did not have write access to the CVS
repository. This greatly enhanced the ability of developers and users to access
the repository without compromising the security of the repository itself.
Anonymous CVS also provides a much better format for distribution of open-source
software than previous mechanisms such as Usenet, anonymous FTP, Web, SUP,
rsync, or CTM. One of the critical features of anonymous CVS is that it allows
access to the metadata for a source repository, e.g., modification times and
version information for individual files.
Some of the principal design goals for anonymous CVS were security, efficiency,
and convenience. One particularly interesting aspect of the development of
anonymous CVS was an anoncvs chroot'd shell, which limited the capabilities of a
user with malicious intentions by allowing the client access to run in a secure
environment. This environment integrates nicely with the CVS server system and
can be accessed by standard means such as ssh or rsh.
One major implementation issue for anonymous CVS involved limitations in the
CVS file-locking mechanisms. Since a user cannot write to the CVS repository
when accessing it anonymously, file locking was disabled for read-only access.
Though there were cases where this could lead to some inconsistencies in the
files on the CVS servers, the likelihood was very low. Future versions of
anonymous CVS may look to provide some type of file-locking mechanism for
anonymous access.
New tools based upon CVS have been developed, e.g., the CVS Pserver, CVSWeb, and
CVSup. Pserver, distributed with CVS, requires a login for access to the CVS
repository. One downside of Pserver is that it does not operate in a chroot'd
environment as the anoncvs shell does. CVSWeb allows a GUI interface for
browsing the repository and updates to a local source tree. Addition-ally, the
CVSup package provides an efficient and flexible mechanism for file distribution
based on anonymous CVS. CVSup is capable of multiplexing stream requests between
the client and server, and it operates very quickly. It understands RCS files,
CVS repositories, and append-only log files, which make up most of the CVS
environment. CVSup provides a command-line and GUI interface. Its one major
drawback is that it's written in Modula3.
Session: Business
Summary by Jerry Peek
Open Software in a Commercial Operating System
Wilfredo Sánchez, Apple Computer Inc.
As Apple considered major rewrites to the MacOS after version 7, it faced the
fact that writing a new operating system is hard. An OS must be very reliable.
But OSes are complex, so new ones will have bugs. Apple acquired NeXT Software
and got their expertise in OSes. But Apple's core OS team saw that BSD and Mach,
two freely available OSes, had a lot of the features they needed. These
tried-and-true OSes have been being refined for years. An active developer
community is adding features all the time. As a bonus, Apple would get Internet
services, Emacs, Perl, CVS, and other useful packages.
So why should Apple bother having its own OS? Why not just give all Apple users
a copy of (for example) Linux? One reason is that many Apple customers don't
want raw UNIX-type systems; they want the familiar look and feel. So Apple added
application toolkits, did hardware integration, and merged the free code into
its new Mac OS X.
Apple decided to contribute much of its own work on the open code base
including some of the proprietary codeback to the community. They also
have an in-house system to let developers propose that certain new code be made
open source. Why? After all, the BSD license doesn't require release of new
code. One reason is that by sharing code and staying in sync with the open base,
Apple's code wouldn't fall behind and have to track larger and larger
differences. Staying in sync also lets Apple take advantage of the better
testing and quality feedback that the open base gets on multiple platforms. One
surprising side effect of this code sharing is that, as an active open-source
project, Apple's source for PowerPC processors still contains unused code for
Intel processors.
Business Issues in Free Software Licensing
Don Rosenberg, Stromian Technologies
Don Rosenberg's talk discussed how a commercial software vendor should deal with
open source and what those vendors really want. In general, software vendors
want to protect their financial investment and to recover that investment. They
also want to make a profit: revenues to keep the doors open and the programmers
fed. How can companies do this? The talk covered several current models.
The GNU Public License (GPL) is good for operating systems because OSes are so
widely used. In general, there are many more users of a particular OS than of
any single application under that OS. Here, vendors can make money by
distributing the source code and, possibly, binaries. Red Hat Software is a good
example of this model; Don quoted Bob Young, Red Hat's chairman, as saying that
Red Hat "gives away its software and sells its sales promotion items."
Scriptics Corporation distributes Tcl/Tk, a freely available language and
toolkit with hundreds of thousands of users. Scriptics improves that core
material for free while developing Tcl applications for sale. Because users can
modify and distribute Tcl and extensions themselves, Scriptics has to work hard
to keep them happy if it wants to stay at the center of development. Profits
from commercial applications pay for the free software work. Scriptics' Web site
also aims to be the principal resource for Tcl and its extensions.
Aladdin's Ghostscript has different free and revenue versions, under different
licenses, distributed by different enterprises. Free users are restricted in how
they can distribute the product; they get yearly updates from Aladdin. Licensed
commercial users, on the other hand, can distribute Ghostscript more freely;
they also receive more frequent updates.
More restrictive licenses, such as the Sun Community Source License, are
appearing. Sun's license lets users read the source code but requires that any
modifications be made by Sun. Rosenberg didn't have a prediction for the success
of this kind of license.
Next came a long discussion of the problems with the Troll Tech Qt library and
the Q Public License; I'll have to refer you to the paper for all the details.
The old Qt Free Edition License has been improved in the new QPL by allowing
distribution of Qt with patches, but it did not change the restrictions that the
license put on the popular Linux desktop KDE, which uses Qt. Troll Tech "wants
to control the toolkit . . . makes the product free on Linux in hope of
collecting improvements from users, and wants to reserve the Windows and
Macintosh platforms for their revenue product."
The Qt license problems were a good example of the trouble with licensing
dependencies. Licensing concerns meant that Debian and Red Hat wouldn't
distribute KDE or Qt. Movements have sprung up to clone a Qt that can be
distributed as freely as Linux. Will Troll Tech survive?
Finally, Don presented a model for licensing layers of an operating system and
its applications. The base operating system is most likely to succeed if it's
free. Toolkits and extensions, as well as applications that build on them, can
be either free or proprietarybut the free side should be carefully
separated from the proprietary side to ensure that licensing dependencies don't
cause serious problems.
There's an open-source licensing page and more information at
<https://www.stromian.com>.
"Magicpoint" Presentation Tool
Jun-ichiro Hagino, KAME Project
A last-minute substitution for another talk featured Magicpoint, a presentation
tool similar to the commercial Microsoft PowerPoint software. Magicpoint runs on
the X Window System and is distributed under a BSD-style license. The speaker
used Magicpoint to give his talk, and the slides, and the idea in general, drew
kudos and applause from the crowd.
There were three design goals:
- It should be possible to prepare a presentation on demand, in five
minutes.
- The display should be screen-size independent.
- It should look good.
The presentation source is plain text with encoding (the same idea as, say, HTML
or troffthough this encoding doesn't resemble those). You can
"%include" a style file. There's no built-in editor; you choose your own. As
soon as you edit and save a source file, Magicpoint updates the slide on the
screen.
You can place inline images; they'll be rescaled relative to the screen size.
Fancy backgrounds, such as your company logo, are no problem. Magicpoint handles
animated text and pauses. You can invoke xanim and mpegplay.
The speaker ran xeyes from one of his slides! It's also possible to
invoke UNIX commands interactively and have them appear, together with their
results, on the screen.
Magicpoint has better font rendering than X11 (which isn't good at big fonts, he
says). It uses any font size; all length metrics are relative to the screen
size. Magicpoint uses freetype (for TrueType fonts) and vflib
(a vector font renderer for Japanese fonts).
Here are some of the other goodies in this amazing tool:
- PostScript output generation (for paper printouts)
- A converter to HTML and LaTeX
- Remote control by PocketPoint (from <https://www.mindpath.com>)
- A handy automatic presentation timer on the screen, a histogram that lets
the presenter keep track of the time a slide has been up
- An interactive page selector that puts all the page numbers and titles at
the bottom of the screen
- Handling of multiple languages, including Japanese (of course) and other
Asian languages; can handle multi-language Asian presentations
Future work includes file conversion to and from PowerPoint, revisiting the
rendering engine, an improved syntax (though not too complex), better color
handling in a limited-color environment, and better math and table support
(right now, these are made by piping eqn or tbl output into
groff).
The initial idea, the key concepts, and much of the coding for Magicpoint were
by Yoshifumi Nishida, <nishida@csl.sony.co.jp>. To get the code and more
details, see <https://www.mew.org/mgp/>.
Session: Systems
Summary by Chris van den Berg
Sendmail Evolution: 8.10 and Beyond
Gregory Neil Shapiro and Eric Allman, Sendmail, Inc.
Gregory Neil Shapiro started out by recounting the history of sendmail and how
this led to the formation of the Sendmail Consortium. The overwhelming number of
feature requests received by the Sendmail Consortium then led to the formation
of Sendmail, Inc. Sendmail, Inc. now has full-time engineers and a formal
infrastructure consisting of seven different platforms that are tested on each
release.
Shapiro next discussed the driving forces for the evolution of sendmail:
changing usage patterns in the form of increased message volume; spam and virus
control; new standards such as SMTP authentication, message submissions, and the
IETF SMTP Update standard; and finally, friendly competition with other
open-source MTAs.
After that, Shapiro talked about the new features slated for sendmail v8.10.
These include SMTP authentication, a mail-filter API (which will probably be
deferred to v8.11), IPv6 support, and performance improvements such as multiple
queues and use of buffered file I/O to avoid creating temporary files on disk.
The buffered file I/O optimizations require the Torek stdio, which
provides a way to unbundle f-calls such as fprintf. The BSD
implementations of UNIX such as FreeBSD, OpenBSD, NetBSD, and BSDi all use the
Torek stdio library.
Finally, Shapiro concluded with some directions for the future of sendmail.
These include a complete mail-filter API, threading and better memory management
using memory pools instead of forking and exiting to clean up memory, a
WindowsNT port, and more performance tuning.
The GNOME Desktop Project
Miguel de Icaza, Universidad de México
With great animation, Miguel de Icaza discussed the GNOME desktop project. The
goal of the project is to bring new technology to free software systems: a
component model, compound document model, printing architecture, and GUI
development tools. On top of these, the project will then build missing
applications such as the desktop, the GNOME workshopconsisting of a
spreadsheet, word processor, and presentation programsand groupware
tools like distributed mail, calendaring, and contact
manager.
The GNOME project is structured to allow for volunteers, who cannot commit to
long-term development, to work on small components separately. GNOME makes use
of the CORBA framework to tie components together. The GNOME component and
document model is called Bonobo (a type of chimpanzee). There are query
interfaces to ask whether an operation is supported. This query interface is
similar to the OLE2/ActiveX design. CORBA services support mail and printing.
On the graphics side, the GNOME GUI builder is called Glade. It generates C,
Ada, and C++ code plus XML-based definitions of the layout. The GNOME canvas is
very similar to Tk's canvas, but without doing everything using strings, which
de Icaza pointed out as a shortcoming of Tcl/Tk. Finally, he went on to talk
about the GNOME printing architecture. It's the PostScript imaging model with
anti-aliasing plus alpha channels.
An audience question concerned KDE (K Desktop Environment) support and
integration with GNOME. Miguel feels this can be done but needs help from
volunteers. Those interested in GNOME are directed to
<https://www.gnome.org> and
<https://www.gnome-support.com>.
Meta: A Freely Available Scalable MTA
Assar Westerlund, Swedish Institute of Computer Science; Love
Hörnquist-Åstrand, Dept. of Signals, Sensors and Systems, KTH; Johan
Danielsson, Center for Parallel Computers, KTH
Meta addresses the problem of building a high-capacity, secure mail hub. The
main protocols supported are SMTP and POP. IMAP could be supported, but the
authors feel that IMAP is too big to support at this time. Meta is meant to
replace the traditional solution consisting of sendmail, the local mail-delivery
program mail.local, and popper.
The goals of the Meta MTA are simplicity, efficiency, scalability, security, and
little or no configuration. It uses the techniques of SMTP pipelining and
omitting fsyncs, and the authors conclude that it's not hard to do better than
sendmail simply by omitting the expensive fsync calls.
The spool files are kept in a special POP-wire format to speed up retrievals.
Meta servers are clustered: mail is received by any server and fetched by
querying all the servers. Simple load-sharing is achieved through multiple A
records, though more sophisticated schemes such as load-aware anem servers or
hardware TCP routers are possible.
As for security, all spool files are owned by the Meta nonprivileged user. Users
never access files directly, so they don't need shell accounts. A user database,
not /etc/passwd, containing information such as full names, quotas, and
spam filters, is kept and replicated on all servers.
The configuration is not sendmail.cf-compatible. The audience asked
when Meta would be available. The authors make no promises.
See <https://www.stacken.kth.se/project/meta> for further information.
Session: Kernel
Summary by Chris van den Berg
Porting Kernel Code to Four BSDs
and Linux
Craig Metz, ITT Systems and Sciences Corporation
Craig Metz discussed some of the issues involved in porting the U.S. Naval
Research Lab's IPv6 and IPSec distribution to different BSDs and Linux. Both the
specifics of the porting process and some discoveries concerning porting
software to different OSes was presented. The software discussed was ported to
FreeBSD, OpenBSD, NetBSD, BSDI's BSD/OS, and Linux.
One general observation: don't port code when it shouldn't be done. Anything can
potentially be ported, but architectural dissimilarities between systems, for
example, may make it infeasible to port certain software. Porting software
across kernel/user space generally shouldn't be attempted. Most code exists
where it does for a reason.
One technique for building portable code involves the use of abstraction. When
the operations being performed are substantially similar in nature, abstraction
can be a powerful tool for portable code. For example, abstract macros can
expand to system-specific code, such as the similar but differing use of
malloc under BSD systems and Linux. BSD systems use the function
malloc(size,type,wait), whereas Linux employs kmalloc(size,
flags). These two forms of malloc were abstracted to
OSDEP_MALLOC(size), which expands to the correct macro for each system.
Additionally, macros can be preferable to functions in kernel space, because the
overhead associated with a function call can be significant when performance and
memory use are critical, as they are in kernel space.
For significantly different parts of a system, abstraction is useful but may
often depend upon having large functions or a group of smaller functions that
are encompassed by conditionals. When data structures differ greatly across
platforms, abstraction may provide a useful tool for portability. Metz discussed
struct nbuf,which was developed as an abstract data structure for
portable packet buffers. The nbuf incorporated aspects from the
traditional BSD mbuf and Linux's pk_buff. From BSD, the
nbuf structure took advantage of small headers and few extraneous
fields, and from Linux the nbuf borrowed packet data that is contiguous
in memory and payload data copied to its final location with headers assembled
around it. Additionally, the nbuf design focused on ensuring that
system-native buffer conversion to an nbuf is quick for most cases, and
that converting such an nbuf structure back to the native buffer was
quick. The latter did not mean that nbuf-to-native conversion had to be
quick for most cases, since nbufs are never the initial data structure.
This helped reduce code complexity. The nbuf contains various pointers
to sections of the buffer and packet data, a pointer to the encapsulated native
buffer, and a few OS-specific fields. For most cases, quick conversion from
system-native to nbuf structure was possible.
In summary, Metz mentioned that they were able to achieve a significant degree
of portability for the IPv6 and IPSec implementations with these techniques and
that porting kernel code to multiple systems can be a feasible project.
strlcpy and strlcatConsistent, Safe, String Copy and
Concatenation
Todd C. Miller, University of Colorado, Boulder; Theo de Raadt, The OpenBSD
Project
Todd Miller gave a brief presentation on strlcpy and strlcat,
which are intended as safe and efficient alternatives to traditional C string
copy and concatenation routines, such as strcpy, strcat,
strncpy, and strncat. OpenBSD undertook the project of
auditing its source base for potential security holes in 1996, with an emphasis
on the possibility of buffer overflows in the use of strcat and
strcpy and sprintf. In many places, strncat and
strncpy had been used, but in ways which indicated that the API for
these function can easily be misunderstood. An alternative to these routines was
created that was safe, efficient, and had a more intuitive API.
One common misconception about strncpy is that it NUL-terminates the
destination string; this is true only when the source string is less than the
size parameter. Another misconception is that strncpy doesn't cause
performance degradation when compared with strcpy. Miller pointed out
that strncpy zero-fills the remaining bytes of the string. This can
cause degradation in cases where the size of the destination string greatly
exceeds the size of the source string. With strncat, a common mistake
is the use of an incorrect size parameter, because the space for the NUL should
not be counted in this parameter. Additionally, the size of the available space
is given as a parameter, rather than the destination string, and this can often
be computed incorrectly.
strlcpy and strlcat guarantee to NUL-terminate the destination
string for all non-zero-length strings. Also, they take the size of the
destination string as a size parameter, which can typically be easily computed
at compile time. Finally, strlcpy and strlcat do not zero-fill
the destination strings beyond the compulsory NUL-termination. Both of these
functions also return the total length of the string they create, which makes
checking for truncation easier.
strlcpy runs almost as quickly as strcpy, and significantly
faster than strncpy for cases that are copied into large buffers (i.e.,
that would require significant zero-filling by strncpy). This was
evident on tests run on different architectures, and is a by-product, in all
likelihood, not only of less zero-filling but also of the fact that the CPU
cache is not essentially flushed with the zero padding. strlcpy and
strlcat are included in OpenBSD and are slated for future versions of
Solaris. They are also available at <ftp://ftp.openbsd.org/pub/OpenBSD/src/lib/libc/string>.
pk: A POSIX Threads Kernel
Frank W. Miller, Cornfed Systems, Inc.
pk is an operating-system kernel whose primary targets are embedded and realtime
applications. It includes documentation with literate programming techniques
based on the noweb tool. The noweb tool allows documentation
and code to be written concurrently: the noweave utility extracts the
documentation portion of a noweb source file, generating LaTeX
documentation, and the notangle utility extracts the source-code
portion. This tends to force a programmer to document while coding, because the
source is mixed in with the documentation.
pk is based on the concurrency model for POSIX threads, and Pthreads assume that
all threads operate within the same address space, which was intended to be a
UNIX process. pk modifies the Pthreads design by adding page-based memory
protection in conjunction with the MMU. Additionally, pk is not developed around
the ability to utilize paging or swapping, since realtime applications can't
rely on this model. pk is designed for threads to have direct access to physical
memory, and the MMU is used to provide memory-access restrictions rather than
separate address spaces. The three types of memory protection that are provided
are Inter-thread, Kernel-thread, and Intra-thread. The first restricts a thread
to its own address space; the second restricts thread access to kernel memory
via syscall entry points; and the third allows portions of code that are part of
a thread to be marked read-only. Because the design of pk differs from the
Pthreads model of a monolithic unprotected address space, some modifications
were necessary to account for these differences, such as placing restrictions on
certain data structures or routines defined in the API.
Further information about pk is available at <https://www.cornfed.com/pk>.
Session: Applications
Summary by Arthur Richardson
Berkeley DB
Michael A. Olson, Keith Bostic, and Margo Seltzer, Sleepycat Software,
Inc.
Sleepycat Software is the company responsible for the embedded database system
called Berkeley DB. Michael Olson listed a few of the larger applications which
use Berkeley DB, including sendmail, some of Netscape's products, and LDAP.
Berkeley DB was first released in 1991. The current version, 2.6, was released
early in 1999. The original versions were released prior to the creation of the
Sleepycat Software company. Sleepycat was formed to provide commercial support
for Berkeley DB. Upon the creation of the company, they updated the software to
2.x, adding better currency handling and transaction locking.
One of the strengths of the Berkeley DB package is that it runs everywhere. It's
POSIX-compliant and runs on both Windows and X. It's considered the best
embedded database software available because of its features, scalability, and
performance. It has been used for directory servers, messaging, authentication,
security, and backing store Web sites.
The Berkeley DB system is a library that is linked in with other applications.
It is very versatile, allowing for use by single or multiple users, single or
multiple processes, and single or multiple threads. It has several built-in
on-disk storage structures and a full-fledged transaction system, and it is
capable of recovering from all kinds of crashes.
In the final part of Olson's presentation, he described how a company such as
Sleepycat Software could make money and stay in business by having its main
product released under an open-source license. It may be distributed freely with
applications whose source is also distributed. To distribute it as a binary
within a proprietary application, however, a commercial license is required.
Sleepycat Software also sells support, consulting, and training. Most of the
company's money comes from the commercial licenses, but it sells a number of
support contracts. It has experienced strong growth over the last two to three
years. Olson's final comment: Open source can be the basis for a thriving
business.
The FreeBSD Ports Collection
Satoshi Asami, The FreeBSD Project
Satoshi Asami has built a system for the distribution of a set of sanctioned
applications to FreeBSD systems. The Ports system keeps track of software
packages' version numbers, dependencies, and other useful details. With it,
installing software onto a FreeBSD system is much easier and better organized.
The install of the Ports system maintains a small footprint. It currently takes
about 50MB for the current list of over 2,400 ports. It is categorized, using
symbolic links, for easier searching of the applications by either real or
virtual categories.
Package dependency is maintained within the Ports system for both compile-time
and runtime dependencies. Currently, file checksums are used for maintaining
package integrity and as a method of security.
New packages get added into the current Ports collection via the
send-pr command. Groups of people, whom Satoshi called Committers,
first test the package and then commit the package for addition. The Ports
manager, Satoshi, does the actual incorporation of the package into the Ports
collection.
The Ports tree changes every day. It supports both FreeBSD-current (4.0) and
FreeBSD-stable (3.2). Both the current and stable packages are built three times
a week and updated once a week on <ftp.freebsd.org>. The Ports system is
frozen a few days before release.
Satoshi next spent some talking about the maintenance of the Ports collection.
Under the original Ports system, the packages were built by the issuing the
commands cd /usr/ports and make package). This method was slow
and required a great deal of human intervention. They had problems with
incomplete dependency checks, and the system took about three days to compile on
dual PII-300s and used more than 10GB of disk space. They now use
chroot to isolate each build environment, which provides for greater
control of package dependencies. They have also added parallel processes to help
speed up the build time. Currently they use a master system that is a dual
PII-300 system with 128MB memory and 10 SCSI disks. It has eight clients that
are all K6-3D 350s with 64MB of memory and a single IDE disk. The master does
some load balancing and passes off the work to the clients. With this new
system, it takes about 16 hours to build the 2000+ packages. There are an
additional four hours in which the cvs update runs and the
INDEX is built. All build errors are available at
<https://bento.freebsd.org>.
Current problems include the number of packages in the collection and the size
of the system. Having over 2000 packages can sometimes make it hard to find the
right port. They are considering some sort of keyword database to help organize
things. One size problem is that now they have so many small files that it slows
down CVS. Another size problem is that the collection no longer fits on a single
CD, and the weekly 2GB+ update to all the FTP mirrors can cause some network
stress.
They have two other unsolved problems. There is no built-in method for doing
updates. Currently they only allow installing and deleting a single port. There
is also more thought going into increasing security of the ports. Satoshi
mentioned PGP signatures as a possible solution to that problem.
Multilingual vi Clones: Past, Now and the Future
Jun-ichiro Hagino, KAME Project; and Yoshitaka Tokugawa, WIDE Project
Jun-ichiro has built a multilingual version of vi. His goal was to build a
product that could be used throughout the world by being able to support any
language format.
He planned to rely on experience he had gained in Asian support. He decided that
Unicode was not an option because it doesn't seem to be widely used, and some
Chinese, Korean, and Japanese characters are mapped to the same codepoint.
His problems had to do with some of the assumptions that most vi clones had
about text encoding. These included the idea that each single byte was a
character and that a space was used to separate words. Asian users need
multi-byte encoding support.
He set out to build something that would allow switching between various
external encoding methods in order to have seamless multilingual support. He
also wanted his product to be able to mix character sets in the text and to be
able to preserve that information within the saved fileall while still
behaving like the standard vi editor.
The first attempt resulted in JElvis. It was based on Elvis with an updated
internal encoding method. It was limited to only a few external encoding methods
but was a step closer to what he wanted to accomplish. His current product is
nvi-m17n. It is based on Keith Bostic's nvi but has better multilingual
capabilities.
Nvi-m17n solves problems in most cases, but he's still working on a few things,
including word boundary issues and the regex library. He would also like to
switch to using widechar and multi-encoding.
Jun-ichiro's key to multilingual programming includes reducing the number of
assumptions made by the software.
Session: Kernel
Summary by Jeffrey Hsu
Improving Application Performance through Swap Compression
R. Cervera, T. Cortes, and Y. Becerra, Universitat Politècnica de
CatalunyaBarcelona
Toni Cortes started out by explaining that the motivation for his group's work
was not to run super-large applications, but to enable laptops to run larger
applications. The goals were to increase swapping performance without adding
more memory and to increase swap space without adding more disk. He went on to
describe three novel optimizations of the project: different paths for reads and
writes, batching writes, and not splitting pages between buffers. The speed-ups
on the benchmarked applications range from 1.2x to 6.5x. In performing the
benchmarking, they discovered there was no perfect cache size. The large caches
take away from application memory, and too-small caches won't allow the
application to run. He then described related work, such as that done by Douglis
in 1993, which did not handle reads and writes differently, had limited batch
writes, and showed performance gains only for some applications. There was a
question raised from the audience about which compression scheme was used. The
answer was lz0, but it is easy to change the algorithm if a better one comes
along.
New Tricks for an Old Terminal Driver
Eric Fischer, University of Chicago
Eric Fischer started by posing the basic question of how to get the arrow keys
to work with different shells and applications. The problem is that some
applications support keyboard editing directly and others rely on the operating
system. He then discussed the three possible places to implement basic editing
facilities: individual applications, the operating system, and a middle layer.
He concluded that it's best to add support in the OS. He gave an overview of the
line terminal code in the kernel and how it parses line input, looking for
editing keys. One complication is that the VT100 and vi-mode bindings require
the kernel to keep track of state, since they are multi-key sequences. History,
in the form of the up and down keys, is kept in a daemon rather than managed
inside the kernel. There are ioctls that an application can use to store lines,
read lines, get the contents of the current line, and change the contents of the
current line. For compatibility reasons, the OS implementation of the editing
keys needs to preserve to the application the illusion that the cursor is always
on the right. He does this by removing characters when moving left and placing
them back when moving right. He uses VT100 key bindings for the cursor keys. A
question was raised during the session about how non-VT100 terminals are
supported. The reply was that most modern terminals use ANSI sequences, which
are the VT100 ones. It is hard to get access to termcap info inside the kernel;
therefore, these sequences are hard-wired.
The Design of the DENTS DNS Server
Todd Lewis, MindSpring Enterprises
Todd Lewis first recounted his frustrations configuring and administering the
BIND DNS server and how they motivated him to work on a flexible name server
with easy-to-configure graphical admin tools and the flexibility to use multiple
back ends rather than flat files to store data. Dents was written from scratch
in ANSI C. It uses glib, which is like STL, except for C. Internally,
it uses CORBA interfaces to control facilities. Lewis talked about the internal
structure of the code and how one can write different adapter drivers to
retrieve zone information from many disparate sources, include flat files,
RDBMs, and even scripts. Lewis believes that server restarts should be rare
events, unlike in Windows where configuration changes invariably require a
system reboot. Yet in UNIX we suffer from the same problem for many of our
services. His solution is to use persistent objects, not config files, to store
parameters. By having a well-defined CORBA IDL interface to the server,
configuration changes can be made without having to restart. Furthermore, the
use of CORBA allows for transactional zone editing. Dents currently uses the
ORBit orb.
Lewis feels that the later revisions of the DNS spec confuse the primary role of
DNSretrieving querieswith new features such as editing. The new
BIND supports dynamic updates, but the changes are not persistent, but get lost
when the server shuts down.
The first public release of Dents came in fall 1988. Ongoing work includes
control-facility enhancements and more drivers to interface to different zone
data stores. Interested developers should see <https://www.dents.org>
| |