LISA '06 Paper
The NMI Build & Test Laboratory: Continuous Integration Framework for Distributed Computing Software
Andrew Pavlo, Peter Couvares, Rebekah Gietzel, Anatoly
Karp, Ian D. Alderman,
and Miron Livny - University of Wisconsin-Madison
Charles Bacon - Argonne National Laboratory
Pp. 263-274 of the Proceedings of LISA '06:
20th Large Installation System Administration Conference (Washington, DC:
USENIX Association, December 3-8, 2006).
Abstract
We present a framework for building and testing software in a
heterogeneous, multi-user, distributed computing environment. Unlike
other systems for automated builds and tests, our framework is not
tied to a specific developer tool, revision control system, or testing
framework, and allows access to computing resources across
administrative boundaries. Users define complex software building
procedures for multiple platforms with simple semantics. The system
balances the need to continually integrate software changes while
still providing on-demand access for developers. Our key contributions
in this paper are: (1) the development of design principles for
distributed build-and-test systems, (2) a description of an
implemented system that satisfies those principles, and (3) case
studies on how this system is used in practice at two sites where
large, multi-component systems are built and tested.
Introduction
Frequently building and testing software yields many benefits [10,
14, 17]. This process, known as continuous integration, allows
developers to recognize and respond to problems in their applications
as they are introduced, rather than be inundated with software bugs
only when a production release is needed [2]. If the time span between
the last successful build of an application and latest broken version
is short, it is easier to isolate the source code modifications that
caused the application's compilation or testing to fail [17]. It is
important to fix these software bugs early in the development process,
as the cost of the fix has been shown to be proportional to the age of
the bug [3].
We developed the NMI Build & Test framework to facilitate
automatic builds and tests of distributed computing software in a
distributed computing environment. It is part of the NSF Middleware
Initiative (NMI), whose mission is to develop an integrated national
middleware infrastructure in support of science and engineering
applications. In these types of problem domains, build-and-test
facilities may be comprised of a few computing resources in a single
location, or a large, heterogeneous collection of machines in
different geographical and administrative domains. Our system
abstracts the build-and-test procedures from the underlying technology
needed to execute these procedures on multiple resources. The
logically distinct steps to build or test each application may be
encapsulated in separate, fully automated tasks in the framework
without restricting users to any specific development tool or testing
system. Thus, developers can migrate their existing build-and-test
procedures easily without compromising the procedures of other
applications using the framework.
To build and test any application, users explicitly define the
execution workflow of build-and-test procedures, along with any
external software dependencies and target platforms, using a
lightweight declarative syntax. The NMI Build & Test software stores
this information in a central repository to ensure every build or test
is reproducible. When a build or test routine is submitted to
the framework, the procedures are dynamically deployed to the
appropriate computing resources for execution. Users can view the
status of their routines as they execute on build-and-test resources.
The framework captures any artifacts produced during this execution
and automatically transfers them to a central repository. Authorized
users can pause or remove their routines from the framework at any
time.
We implement the NMI framework as a lightweight software layer
that runs on top of the Condor high-throughput distributed batch
computing system [15, 25]. Leveraging a feature-rich batch system like
Condor provides our framework with the fault-tolerance, scheduling
policies, accounting, and security it requires. The NMI Build & Test
software is not installed persistently on all available computing
resources; it is deployed dynamically by the batch system at runtime.
The framework and software that we present here are just one
component of the NMI Build & Test Laboratory at the University of
Wisconsin-Madison's Department of Computer Sciences. The Laboratory
also provides maintenance and administration for a diverse collection
of resources. It is used as the primary build-and-test environment for
the Globus Toolkit [8] and the Condor batch system [25], as well as
other products. Managing such a production facility presents certain
specific system administration problems, such as maintaining a
standard set of software libraries across multiple platforms and
coping with the large amount of data produced daily by our users.
In this document, we discuss not only the design and architecture
of the NMI framework and software, but also the tools and practices we
developed for managing a heterogeneous build-and-test facility on
which it is deployed.
Related Work
Continuous integration and automated build-and-test systems are
used by many large software projects [14]. The benefits of these
systems are most often reported in discussions of agile software
development [2, 12, 22]. Research literature on the general design of
such systems, however, is limited.
There are numerous commercial and open source continuous
integration and automated build systems available [13]. Almost all
provide the basic functionality of managing build and test execution
on one or more computing resources. Three popular systems are the
Mozilla Project's Tinderbox system [20], the Apache Software
Foundation's Maven [16], and the CruiseControl toolkit [6]. The
Tinderbox system requires autonomous agents on build machines to
continuously retrieve source code from a repository, compile the
application, and send status reports back to a central server. This is
different from the approach taken by Maven and CruiseControl, where a
central manager pushes builds and tests to computing resources and
then retrieves the results when they are complete.
Many systems make important assumptions about the scope and
complexity of the computing environment in which they are deployed.
For example, some require that all build-and-test resources be
dedicated or that all users have equal access to them. Other systems
assume that prerequisite software is predictably installed and
configured by the system administrator on all machines in the pool.
Users must hard-code paths to these external dependencies in their
build-and-test scripts, making it difficult to reproduce past routines
on platforms that have been patched or updated. Although these
constraints may be appropriate for smaller projects with few users,
they are not realistic for larger organizations with diverse
administrative controls or projects involving developers located
throughout the world.
Other systems offer more flexibility and control of the build-and-test
execution environment. The ElectricCloud commercial distributed
build system re-factors an application's Makefiles into parallel
workloads executed on dedicated clusters [19]. A central manager
synchronizes the system clocks for the pool to help ensure that a
build script's time stamp-based dependencies work correctly. Another
full-featured commercial offering is the BuildForge continuous
integration system [7]. It uses an integrated batch system to provide
rudimentary opportunistic computing capabilities and resource controls
based on user and group policies.
These systems seldom address the many problems inherent in
managing workloads in a distributed environment, however. For example,
a system must ensure that a running build or test can be cancelled and
completely removed from a machine. This is often not an easy
accomplishment; thorough testing of an application often requires
additional services, such as a database server, to be launched along
with the application and testing scripts may leave a myriad of files
scattered about the local disk.
Motivation
In a distributed computing environment, a build-and-test system
cannot assume central administrative control, or that its resources
are dedicated or reliable. Therefore, we need a system that can safely
execute routines on resources outside of one local administrative
domain. This also means that our system cannot assume that each
computing resource is installed with software needed by the framework,
or configured identically.
Because of the arbitrary nature of how routines execute in this
environment, non-dedicated remote computing resources are often less
reliable than local build-and-test machines. But even large pools of
dedicated resources begin to resemble opportunistic pools as their
capacity increases, since hardware failure is inevitable. A routine
may be evicted from its execution site at any time. We need a system
that can restart a routine on another resource and only execute tasks
that have not already completed. The build-and-test framework should
also ensure that routines are never ``lost'' in the system when
failures occur.
Lastly, we need a mechanism for describing the capabilities,
constraints, and properties of heterogeneous resources in the pool.
With this information, a system ensures that each build-and-test
routine is matched with a machine providing the correct execution
environment: a user may require that their build routines only execute
on a computing resource with a specific software configuration. The
system needs to schedule the routine on an available matching machine
or defer execution until a machine satisfying the user's requirements
becomes available. If a satisfying resource does not exist, the system
needs to notify the user that their requirements cannot be met.
Design Principles
The NMI framework was designed in response to our experiences
developing distributed computing software. Our first implementation
was created to help merge the build-and-test procedures of two large
software projects into a unified environment where they could share a
single pool of computing resources and be packaged into a single grid
software distribution. Both projects already had different established
practices for building and testing their applications using a
menagerie of custom scripts and build tools. Thus, our goal was to
develop a unified framework incorporating these application-specific
scripts and processes.
We developed a set of design principles for distributed build-and-test
systems to solve the problems that we encountered in this merging
process. We incorporated these principles into our implementation of
the NMI Build & Test system. They are applicable to other continuous
integration frameworks, both large and small.
Tool Independent
The framework should not require a software project to use a
particular set of development or testing tools. If the build-and-test
procedures for an application are encapsulated in well-defined
building blocks, then a clear separation of the blocks and the tools
used to manipulate them permits diversity. In our system, users are
provided with a general interface to the framework that is compatible
with arbitrary build-and-test utilities. The abstraction afforded by
this interface ensures that new application-specific scripts can be
integrated without requiring modifications to, and thereby affecting
the stability of, the framework or other applications.
Lightweight
The software should be small and portable. This approach has three
advantages: (1) it is easier for system administrators to add new
resources to a build-and-test pool, (2) users are able to access
resources outside of their local administrative domain where they may
be prohibited from installing persistent software, and (3) framework
software upgrades are easier as only the submission hosts need to be
updated.
The NMI Build & Test framework uses existing, proven tools to
solve many difficult problems in automating a distributed computing
workflow. Because it is designed to be lightweight, it is able to run
on top of the Condor batch system and take advantage of the workload
management and distributed computing features Condor offers. The NMI
software only needs to be installed on submission hosts, where it is
deployed dynamically to computing resources. By this we mean that a
subset of the framework software is transferred to build-and-test
resources and automatically deployed at runtime.
Explicit, Well-Controlled Environments
All external software dependencies and resource requirements for
each routine must be explicitly defined. This helps to ensure a
predictable, reproducible, and reliable execution environment for a
build or test, even on unpredictable and unreliable computing
resources.
When a routine's procedures are sent to a build-and-test resource
for execution in the NMI system, the framework creates and isolates
the proper execution environment on demand. The framework ensures that
only the software required by the routine is available at run time.
This may be accomplished in two ways: (1) the developer must declare
all the external software that their application requires other than
what exists in the default vendor installation of the operating
system, or (2) the developer may use the framework interface to
automatically retrieve, configure, and temporarily install external
Figure 5: Condor Builds & Tests - Each platform job is
a single build or test execution cycle on a computing resource; there
may be multiple platform jobs for a single framework build-and-test
routine. Sharp increases in the number of builds correspond to release
deadlines for developers.
software in their routine's runtime environment.
Central Results Repository
A build-and-test system should capture all information and data
generated by routines and store it in a central repository. It is
important that system allows users to easily retrieve the latest
version of applications and view the state of their builds and tests
[10]. The repository maintains routine's provenance information and
historical data, which can be used for statistical analysis of builds
and tests.
The NMI framework stores the execution results, log files, and
output created by routines, as well as all input data, environment
information, and dependencies needed to reproduce the build or test.
While a routine executes, the NMI Build & Test software continuously
updates the central repository with the results of each procedure;
users do not need to wait for a routine to finish before viewing its
results. Any output files produced by builds or tests are
automatically transferred back to the central repository.
Fault Tolerance
The framework must be resilient to errors and faults from
arbitrary components in the system. This allows builds and tests to
continue to execute even when a database server goes down or network
connectivity is severed. If the NMI Build & Test software deployed on
a computing resource is unable to communicate with the submission
host, the routine executing on that resource continues unperturbed.
When the submission host is available again, all queued information is
sent back; routines never stop making forward progress because the
framework was unable to store the results.
The framework also uses leases to track an active routine in the
system. If the framework software is unable to communicate with a
resource executing a routine, the routine is not restarted on another
machine until its lease expires. Thus, there are never duplicate
routines executing at the same time.
Platform-Independent vs. Specific
For multi-platform applications, users should be able to define
platform-independent tasks that are only executed once per routine
submission. This improves the overall throughput of a build-and-test
pool. For example, an application's documentation only needs to be
generated once for all platforms.
Build/Test Separation
The output of a successful build can be used as the input to
another build, or to a future test. Thus, users are be able to break
distinct operations into smaller steps and decouple build products
from testing targets. As described above, the framework archives the
results of every build and test. When these cached results are needed
by another routine as an input, the framework automatically transfers
the results and deploys it on the computing resource at run time.
NMI Software
We developed the NMI Build & Test Laboratory's continuous
integration framework software based on the design principles
described in the previous section. The primary focus of our framework
is to enable software to be built and tested in a distributed batch
computing environment. Our software provides a command-line execution
mechanism that can be triggered by any arbitrary entity, such as the
UNIX cron daemon or a source code repository monitor, or by users when
they need to quickly build their software before committing changes
[10]. We believe that it is important for the framework to accommodate
diverse projects' existing development practices, rather than force
the adoption of a small set of software.
The NMI framework allows users to submit builds and tests for an
application on multiple resources from a single location. We use a
batch system to provide all the network and workload management
functionality. The batch system is installed on every machine in a
build-and-test pool, but the NMI software is only installed on the
submission hosts. The framework stores all information about executing
routines in a central database. The output from routines is returned
to the submission hosts, which can store them on either a shared
network storage system or an independent file system.
A build-and-test routine is composed of a set of glue
scripts and a specification file containing information
about how an application is built or tested. The glue scripts are
user-provided, application-specific tasks that automate parts of the
build-and-test process. These scripts together contain the steps
needed to configure, compile, or deploy an application inside of the
framework. The specification file tells the framework when to execute
these glue scripts, which platforms to execute them on, how to
retrieve input data, and what external software dependencies exist.
Workflow Stages
The execution steps of a framework submission are divided into
four stages: fetch, pre-processing, platform, and post-processing (see
Figure 1). The tasks in the pre- and post-processing stages can be
distributed on multiple machines to balance the workload. A routine's
results and output are automatically transferred to and stored on the
machine that it was submitted from.
- Fetch:
In this stage, the framework retrieves all the input data needed to
build or test an application. Instead of writing custom scripts, users
declare where and how files are retrieved using templates provided by
the framework. Input data may come from multiple sources, including
source code repositories (cvs, svn), file servers (http, ftp), and the
output results from previous builds. Thus, input templates document
the provenance of all inputs and help ensure the repeatability of
routines.
- Pre-processing:
This optional stage prepares the build-and-test routine for execution
on computing resources. These tasks are often used to process the
input data collected in the previous stage. The platform-independent
tasks execute first and may modify the input data for all platforms.
The framework then makes separate copies of the potentially modified
input data for each platform and executes the platform-specific tasks.
Any modifications made to the input data by the platform-specific
tasks are only reflected in that platform's copy.
Figure 1: Workflow Stages - The steps to build or test
an application in the NMI framework are divided into four stages. The
fetch stage is executed on the machine that the user submitted the
routine. The pre- and post-processing stages execute on any resource.
The remote platform tasks each execute on the appropriate
platform.
- Remote platform:
After the input data is retrieved and processed, the framework submits
one job for each target platform to the batch system. These jobs spawn
the remote platform tasks to build or test an application on an
appropriate compute resource. The NMI framework tells the batch system
which input files to transfer to the resource along with a copy of the
remote NMI framework software and the platform task glue scripts.
Before these scripts begin to execute, the NMI software prepares the
working directory for the routine and binds the execution environment
paths to the local configuration of the machine. When each task
finishes, any output produced can be sent back to the submission host
for storage.
- Post-processing:
This stage contains tasks that process the output data produced by
routines executing on build-and-test resources. As the platform tasks
complete for each platform, the framework executes the platform-specific
scripts for the corresponding set of results. Once these
tasks are completed for all the platforms, the platform-independent
scripts are then executed.
Workflow Manager
Using a distributed batch system to coordinate the execution of
jobs running on the build-and-test machines provides the NMI framework
with the robustness and reliability needed in a distributed computing
environment.
We use the Directed Acyclic Graph Manager (DAGMan) to automate and
control jobs submitted to the batch system by the NMI Build & Test
software [5, 25]. DAGMan is a meta-scheduler service for executing
multiple jobs in a batch system with dependencies in a declarative
form; it monitors and schedules the jobs in a workflow. These
workflows are expressed as directed graphs where each node of the
graph denotes an atomic task and the directed edge indicates a
dependency relationship between two adjacent nodes.
When a routine is submitted to the framework, its specification
file is transformed into an execution graph. A single instance of
DAGMan with this graph as its input is submitted to the batch system.
DAGMan can then submit new jobs to the batch system using a standard
application interface. As each of its spawned jobs complete, DAGMan is
notified and can deploy additional jobs based on the dependencies in
the graph.
DAGMan also provides the NMI Build & Test software with fault-tolerance.
It is able to checkpoint a workflow much like a batch
system is able to checkpoint a job. If the batch system fails and must
be restarted, the workflow is restarted automatically and DAGMan only
executes tasks that have not already completed.
Glue Scripts
A routine's glue scripts contain the procedures needed to build or
test an application using the NMI framework. These scripts automate
the typical human-operated steps so that builds and tests require no
human intervention. Build glue scripts typically include configure,
compile, and package steps. Test glue scripts can deploy additional
services or sub-systems at runtime for thorough testing and can use
any testing harness or framework.
The framework provides a glue script with information about the
progress of its routine through pre-defined environment variables.
Thus, the scripts can control a routine's execution workflow while
they are running on a build-and-test resource. For example, a build
glue script might halt execution if a dependency failed to compile in
a previous step. Optionally, a test glue script may choose to continue
even if the previous test case failed.
Application Interfaces
The NMI framework provides a standard interface for submitting and
managing routines in a build-and-test system. This interface can
easily be augmented by other clients or notification paradigms. For
example, our framework distribution includes a web interface that
provides an up-to-date overview of the system (Figure 2).
Figure 2(a): Routine status
Figure 2(b): Computing resource information
Figure
2: NMI Framework Web Interface - The NMI Build & Test software
provides a web client for users to view information about their build-and-test
system. The screenshot in Figure 2(a) shows status
information about a routine submitted to the framework; users can
monitor the progress of tasks, download output files, and view log
files. The screenshot in Figure 2(b) shows the capabilities of a
machine, lists all prerequisite software installed, and provides
information about the routines currently executing on it.
Batch System
We designed the NMI framework to run on top of the Condor high-throughput
distributed computing batch system [15, 25]. When a user
submits a build-and-test routine, the framework software deploys a
single DAGMan job into Condor (Figure 3). This DAGMan job then spawns
multiple Condor jobs for each platform targeted by the routine. Condor
ensures that these jobs are reliably executed on computing resources
that satisfy the explicit requirements of the routine.
Figure 3: NMI Framework Architecture - The user submits
a new routine comprised of glue scripts, input data, and a workflow
specification file. The NMI software uses this information to create a
dependency execution graph and submits a DAGMan job to the Condor
batch system. When the DAGMan job begins to execute, it deploys
multiple Condor jobs to the build-and-test computing resources. All
output data produced by the routine's jobs are stored in a central
repository and retrieved through ancillary clients.
Features
Condor provides many features that are necessary for a distributed
continuous integration system like the NMI framework [24]. It would be
possible to deploy the framework using a different batch system if the
system implemented capabilities similar to the following found in
Condor.
- Matchmaking:
Condor uses a central negotiator for planning and scheduling jobs for
execution in a pool. Each machine provides the negotiator with a list
of its capabilities, system properties, pre-installed software, and
current activity. Jobs waiting for execution also advertise their
requirements that correspond to the information provided by the
machines. After Condor collects this information from both parties,
the negotiator pairs jobs with resources that mutually satisfy each
other's requirements. The matched job and resource communicate
directly with each other to negotiate further terms, and then the job
is transferred by Condor to the machine for execution. The framework
will warn users if they submit a build or test with a requirement that
cannot be satisfied by any machine in the pool.
- Fault tolerance:
The failure of a single component in a Condor pool only affects those
processes that deal directly with it. If a computing resource crashes
while executing a build-and-test routine, Condor can either migrate
the job to another machine or restart it when the resource returns.
Condor uses a transient lease mechanism to ensure only a single
instance of a job exists in a pool at any one time. If a computing
resource is unable to communicate with the central negotiator when a
job finishes execution, Condor transfers back the retained results
once network connectivity is restored.
- Grid resource access:
Condor enables users to access computing resources in other pools
outside of their local domain. Condor can submit jobs to grid resource
middleware systems to allow builds and tests to execute on remote
machines that may or may not be running Condor [11].
- Resource control
A long-standing philosophy of the Condor system is that the resource
owner must always be in control of their resource, and set the terms
of its use. Owners that are inconvenienced by sharing their resources
are less likely to continue participation in a distributed build-and-test
pool. Condor provides flexible policy expressions that allow
administrators to control which users can access resources, set
preferences for certain routines over others, and limit when users are
allowed to execute builds and tests.
- Authentication:
Condor supports several authentication methods for controlling access
to remote computing resources, including GSI [9], Kerberos [23], and
Microsoft's SSPI [1].
- File transfer
The NMI framework uses Condor's built-in file transfer protocol to
send data between submission hosts and build-and-test resources. This
robust mechanism ensures that files are reliably transferred;
transfers are automatically restarted upon connection failure or file
corruption. Condor can also use a number of encryption methods to
securely transfer files without a shared file system.
Pool Configuration
Condor is designed to balance the needs and interests of resource
owners, users wanting to execute jobs, and system administrators. In
this spirit, Condor enables administrators to deploy and manage build-and-test
pools that respect the wishes of resource owners but can
still provide access for users. Priority schemes for both dedicated
and non-dedicated resources can be created using Condor's flexible
resource policy expressions. For example, the dedicated resources in a
pool may prefer to execute processor-intensive builds and high-load
stress tests so that shorter tests can be scheduled on idle
workstations. Preferential job priority may also be granted to
specific users and groups at different times based on deadlines and
release schedules.
Condor can also further divide the resources of individual build-and-test
machines, similar to the policies for the entire pool. Condor
can allocate a multi-processor machine's resources disproportionately
for each processor. For example, in one configuration a processor can
be dedicated for build routines and therefore is allocated a larger
portion of the system's memory. Test routines are only allowed to
execute on the processor with more memory when no other jobs are
waiting for execution. If a build is submitted while a test job is
executing on this processor, Condor automatically evicts the test job
and restarts it at a later time.
Build-and-test pools often have periods where there are no new
routines available for execution. If a computing resource is idle for
certain length of time, Condor can trigger a special task in the
framework that performs continuous tests against an application as
backfill. This is useful to perform long-term stress and random
input tests on an application [18]. The results from these tests are
reported back to the central repository periodically or whenever
Condor evicts the backfill job to run a regular build or test routine.
Pool & Resources Management
We now discuss our experiences in managing the NMI Build & Test
laboratory at the University of Wisconsin-Madison. The NMI framework
is also currently deployed and running in production at other
locations, including multi-national corporations and other academic
institutions.
Our facility currently maintains over 60 machines running a
variety of operating systems (see Table 1). Over a dozen projects,
representing many developers and institutions, use the NMI laboratory
for building and testing grid and distributed computing software. In
order to fully support the scientific community, we maintain multiple
versions of operating systems on different architectures. Machines are
not merely upgraded as newer versions of our supported platforms are
released. We must instead install new hardware and maintain support
for older platform combinations for as long they are needed by users.
Operating
System | Versions | Archs | CPUs
| Debian Linux | 1 | 1 | 2
| Fedora Core Linux | 4 | 2 | 20
| FreeBSD | 1 | 1 | 4
| HP HPUX | 1 | 1 | 3
| IBM AIX | 2 | 1 | 6
| Linux (Other) | 3 | 2 | 9
| Macintosh OS X | 2 | 2 | 8
| Microsoft Windows | 1 | 2 | 3
| OSF1 | 1 | 1 | 2
| Red Hat Linux | 3 | 2 | 13
| Red Hat Enterprise Linux | 2 | 3 | 19
| Scientific Linux | 3 | 2 | 11
| SGI Irix | 1 | 1 | 4
| Sun Solaris | 2 | 1 | 6
| SuSE Enterprise Linux | 3 | 3 | 15 |
Table 1: NMI Build & Test Laboratory Hardware - The
laboratory supports multiple versions of operating systems on a wide
variety of processor architectures.
Resource Configuration
We automate all persistent software installations and system
configuration changes on every machine in our build-and-test pool.
Anything that must be installed, configured, or changed after the
default vendor installation of the operating system is completely
scripted, and then performed using cfengine [4]. This includes
installing vendor patches and updates. Thus, new machine installations
can be added to the facility without requiring staff to rediscover or
repeat modifications that were made to previous instances of the
platform.
Prerequisite Software
In a multi-user build-and-test environment, projects often require
overlapping sets of external software and libraries for compilation
and testing. The NMI framework lets administrators offer prerequisite
software for routines in two ways: (1) the external software can be
pre-installed on each computing resource and published to the NMI
system, or (2) the system can maintain a cache of pre-compiled
software to be deployed dynamically when requested by a user. Dynamic
deployment is advantageous in environments where routines may execute
on resources outside of one administrative domain and are unable to
expect predictable prerequisite software.
At the NMI Laboratory, we use cfengine to install a large set of
prerequisite software on each of our computing resources. This eases
the burden on new users whose builds expect a precise set of non-standard
tools but are not prepared to bring them along themselves.
The trade-off, however, is that these builds and tests are less
portable across administrative domains.
Data Management
The NMI Laboratory produces approximately 150 GB of data per day.
To help manage the large amount of data generated by builds and tests,
the framework provides tools and options for administrators.
- Multiple submission points:
More than one machine can be deployed as a submission host in a build-and-test
pool. By default, the output of a routine is archived on the
machine it is submitted from. The framework provides a built-in
mechanism to make these files accessible from any submission host
without requiring users to know which machine the data resides on. If
a user requests output files from a previous build on a different
submission host, the framework automatically transfers the files from
the correct location.
- Repository pruning
The framework provides mechanisms for removing older build and test
results from the repository based on flexible policies defined by the
lab administrator. When the framework is installed on a submission
host it deploys a special job into the batch system that periodically
removes files based on the administrator's policy. Routines may be
pruned based on file size, submission date, or other more complicated
properties, such as duplicate failures. This process will only remove
user-specified results; task output log files, error log files, and
input data are retained so that builds and tests are reproducible.
Users can set a routine's preservation time stamp to prevent their
files from being removed before a certain date.
Case Studies
The NMI Laboratory is used as a build and test facility for two
large distributed computing research projects: the Globus Toolkit from
the Globus Alliance [8], and the Condor batch system from the
University of Wisconsin-Madison's Department of Computer Sciences
[15]. We present two brief case studies on how the NMI framework has
improved each of these projects software development process.
Globus Toolkit
The Globus Toolkit is an open source software distribution that
provides components for constructing large grid systems and
applications [8]. It enables users to share computing power and other
resources across administrative boundaries without sacrificing local
autonomy. Globus-based systems are deployed all across the world and
are the backbone of many large academic projects and collaborations.
Prior to switching to the NMI framework, the Globus system was
built and tested using a combination of custom scripts and the
Tinderbox open-source continuous integration system [20]. Each build
machine contained a pre-defined source file that mapped all the
external software needed by the build process to paths on the local
disk. This file contained the only record in the system of what
external software was used to execute a build or test, and did not
contain full information about the specific version used. If the
computing resource was updated to use a newer version of the software,
there was no record in the build system to reflect that fact.
As the project grew, developers received an increased amount of
bug reports from users. Many of these reports were for esoteric
platforms that were not readily available to the Globus developers.
Fewer builds and tests were submitted to these machines, which in turn
caused bugs and errors to be discovered much later after they were
introduced into the source code.
Now the Globus Toolkit is built and thoroughly tested every night
by the NMI Build & Test software on 10 different platforms (Figure 4).
The component glue scripts for Globus contain the same build
procedures that an end-user follows in order to compile the toolkit.
These procedures also include integrity checks that warn developers
when the build process generates files that are different from what
the system expected. All other regression and unit tests are preformed
immediately after compilation. Globus' developers have benefited from
the NMI framework's strict attention to the set of software installed
on computing resources and its ability to maintain a consistent
execution environment for each build-and-test run. This allows them to
test backwards compatibility of their build procedures with older
versions of development tools, which they were unable to do before.
Figure 4: Globus Builds & Tests - The large spike in
the number of jobs in the graph indicates when a new version of Globus
was released and required many new build and test routines. Initially,
the toolkit's build-and-test procedures were contained in a monolithic
batch script. The tests were then later broken out of the build
scripts into separate tasks. Thus, no data exists on these tests that
were executed in the first months after switching to the NMI
system.
Condor
Before the advent of Linux's popularity, Condor supported a modest
number of operating systems used by the academic and corporate
communities. Initially, each developer was assigned a platform to
manually execute builds and given a paper checklist of tests to
perform whenever a new production release was needed. All of Condor's
build scripts contained hard-coded path information for each machine
that it was built on. If one of these machines needed to be rebuilt or
replaced, the administrator would have to construct the system to
exactly match the expected specification.
Like Globus, the Condor development team also deployed a Tinderbox
system to automate builds and tests on all the platforms that were
supported. Due to hardware and storage limitations, however, this
system could only build either the stable branch or the development
branch of Condor each day; developers had to make a decision on which
branch the system should build next. This also meant that the system
could not easily build custom branches or on-demand builds of
developer's workspaces.
Since transitioning to the NMI framework, the Condor project has
experienced a steady increase in the number of builds and tests
(Figure 5). The development team submits an automatic build and test
to the framework every night for both the stable and development
releases; Condor is built on 17 platforms with 122 unit and regression
tests per platform. In addition, the framework is used for numerous
on-demand builds of Condor submitted by individual developers and
researchers to test and debug experimental features and new platforms.
Figure 5: Condor Builds & Tests - Each platform job is
a single build or test execution cycle on a computing resource; there
may be multiple platform jobs for a single framework build-and-test
routine. Sharp increases in the number of builds correspond to release
deadlines for developers.
Future Work
Many facets of the NMI framework can be expanded to further
improve its capabilities.
Currently, the NMI framework coordinates builds and tests on
multiple platforms independently. Each routine executes on a single
computing resource for each specified platform. We are developing a
mechanism whereby a build-and-test routine can execute on multiple
machines in parallel and allow them to communicate with one another.
Users specify an arbitrary number of machines and the batch system
deploys the routine only when it has simultaneous access to all of the
resources it requires. The framework passes information to the glue
scripts about which machines are running the other parallel instances
of the routine. Such dynamic cross-machine testing will allow users to
easily test platform and version interoperability without maintaining
permanent ``target'' machines for testing.
We are also extending our test network into the Schooner [21]
system, based on Emulab [26], to expand these distributed tests to
cover a variety of network scenarios. Schooner permits users to
perform tests which include explicit network configurations. For
example, the NMI framework will be able to include automated tests of
how a distributed application performs in the presence of loss or
delay in the network. This system will also allow administrators to
rapidly deploy a variety of different operating system configurations
both on bare hardware and in virtual machines.
A major boon to the NMI framework will be the proliferation of
virtualization technology in more systems. Instead of deploying and
maintaining a specific computing resource for every supported
platform, the framework would keep a cache of virtual machine images
that would be dynamically deployed at a user's request. Because
administrators will only need to configure a single virtual machine
image for each operating system in the entire pool, this will simplify
build-and-test pool management and utilization. The framework would
then also be able to support application testing that requires
privileged system access or which makes irreversible alterations to
the system configuration; these changes would be localized to that
instance of the virtual operating system and not the cached image.
Availability
The NMI Build & Test Laboratory continuous integration framework
is available for download at our website under a BSD-like license:
https://nmi.cs.wisc.edu/.
Acknowledgments
This research is supported in part by NSF Grants No. ANI-0330634,
No. ANI-0330685, and No. ANI-0330670.
Conclusion
We have presented the NMI Build & Test Laboratory continuous
integration framework software. Our implementation is predicated on
design principles that we have established for distributed build-and-test
systems. The key features that distinguish our system are (1) its
ability to execute builds and tests on computing resources spanning
administrative boundaries, (2) it is deployed dynamically on
heterogeneous resources, and (3) it maintains a balance between
continuous integration practices and on-demand access to builds and
tests. Our software uses the Condor batch system to provide the
capabilities necessary to operate in a distributed computing
environment. We discussed our experiences in managing a diverse,
heterogeneous build-and-test facility and showed how the NMI framework
functions as the primary build-and-test system for two large software
projects. From this, we believe that our system can be used to improve
the development process of software in a distributed computing
environment.
Author Biographies
Andrew Pavlo, Peter Couvares, Rebekah Gietzel, and Anatoly Karp
are members of the Condor research project at the University of
Wisconsin-Madison's Department of Computer Sciences. Ian D. Alderman
is a Ph.D. candidate at the University of Wisconsin-Madison's
Department of Computer Sciences. Miron Livny is a Professor with the
Department of Computer Sciences at the University of Wisconsin-Madison
and currently leads the Condor research project.
Charles Bacon is a researcher specializing in grid technology at
Argonne National Laboratory.
Bibliography
[1] The security support provider interface, White paper,
Microsoft Corp., Redmond, WA, 1999.
[2] Beck, K., Extreme programming explained: embrace
change, Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
USA, 2000.
[3] Boehm, B. W. and P. N. Papaccio, ``Understanding and
controlling software costs,'' IEEE Transactions Software
Engineering 14, Vol. 10, pp. 1462-1477, 1988.
[4] Burgess, M., ``A site configuration engine,'' USENIX
Computing Systems, Vol. 8, Num. 2, pp. 309-337, 1995.
[5] Couvares, P., T. Kosar, A. Roy, J. Weber, and K. Wenger,
Workflows for e-Science, Chapter: Workflow Management in
Condor, Springer-Verlag, 2006.
[6] CruiseControl, https://cruisecontrol.sourceforge.net.
[7] Fierro, D., Process automation solutions for software
development: The BuildForge solution, White paper, BuildForge,
Inc., Austin, TX, March, 2006.
[8] Foster, I., and C. Kesselman, ``Globus: A metacomputing
infrastructure toolkit,'' The International Journal of
Supercomputer Applications and High Performance Computing, Vol.
11, Num. 2, pp. 115-128, Summer, 1997.
[9] Foster, I. T., C. Kesselman, G. Tsudik, and S. Tuecke, ``A
security architecture for computational grids,'' ACM Conference on
Computer and Communications Security,, pp. 83-92, 1998.
[10] Fowler, M., Continuous integration, May, 2006,
https://www.martinfowler.com/articles/continuousIntegration.html.
[11] Frey, J., T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke,
``Condor-G: A computation management agent for multi-institutional
grids,'' Cluster Computing, Vol. 5, pp. 237-246, 2002.
[12] Grenning, J., ``Launching extreme programming at a process-intensive
company,'' IEEE Software, Vol. 18, Num. 6, pp. 27-33,
2001.
[13] Hellesoy, A., Continuous integration server fea
ture matrix, May, 2006,
https://damagecontrol.codehaus.org/Continuous+Integration+Server+Feature+Matrix.
[14] Holck, J., and N. Jørgensen, ``Continuous integration
and quality assurance: A case study of two open source projects,''
Australian Journal of Information Systems, Num. 11/12, pp.
40-53, 2004.
[15] Litzkow, M., M. Livny, and M. Mutka, ``Condor - a hunter of
idle workstations,'' Proceedings of the 8th International
Conference of Distributed Computing Systems, June, 1988.
[16] Apache Maven, https://maven.apache.org.
[17] Mcconnell, S., ``Daily build and smoke test,'' IEEE
Software, Vol. 13, Num. 4, p. 144, 1996.
[18] Miller, B. P., L. Fredriksen, and B. So, ``An empirical
study of the reliability of UNIX utilities,'' Communications of the
Association for Computing Machinery, Vol. 33, Num. 12, pp. 32-44,
1990.
[19] Ousterhout, J., and J. Graham-Cumming, Scalable software
build accelerator: Faster, more accurate builds, White paper,
Electric Cloud, Inc., Mountain View, CA, February, 2006.
[20] Reis, C. R., and R. P. de Mattos Fortes, ``An overview of
the software engineering process and tools in the Mozilla Project,''
Workshop on Open Source Software Development, Newcastle, UK,
pp. 162-182, 2002.
[21] Schooner, https://www.schooner.wail.wisc.edu.
[22] Schuh, P., ``Recovery, redemption, and extreme
programming,'' IEEE Software, Vol. 18, Num. 6, pp. 34-41, 2001.
[23] Steiner, J. G., B. C. Neuman, and J. I. Schiller,
``Kerberos: An authentication service for open network systems,''
Proceedings of the USENIX Winter 1988 Technical Conference,
USENIX Association Berkeley, CA, pp. 191-202, 1988.
[24] Tannenbaum, T., D. Wright, K. Miller, and M. Livny, ``Condor
- a distributed job scheduler,'' Beowulf Cluster Computing with
Linux, T. Sterling, Ed., MIT Press, October, 2001.
[25] Thain, D., T. Tannenbaum, and M. Livny, ``Distributed
computing in practice: the condor experience,'' Concurrency -
Practice and Experience, Vol. 17, Num. 2-4, pp. 323-356, 2005.
[26] White, B., J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad,
M. Newbold, M. Hibler, C. Barb, and A. Joglekar, ``An integrated
experimental environment for distributed systems and networks,''
Proc. of the Fifth Symposium on Operating Systems Design and
Implementation, Boston, MA, pp. 255-270, USENIX Association, Dec.,
2002.
|