1 Introduction
There are many families of programs whose members have a high degree
of commonality; such a family is called a domain. When the
commonality is inherently complex, a domain-specific programming
language may help domain programmers develop better software more
quickly by factoring the complexity into the language.
Recent examples of domain-specific languages and their domains include
Envision [SF97] for computer vision,
Fran [Ell97] for computer animation,
GAL [TMC97] for video card device drivers,
Mawl [ABB+97] for dynamic web servers, and
Teapot [CRL96, CDR+97] for writing coherence protocols.
In each case, the domain-specific language moved the burden of writing
intricate domain code from the programmer to the compiler (or interpreter).
We have designed and built a domain-specific language, called
Hancock, to handle the complexity that arises from the
scale of signature computations [CP98].
As with the earlier languages, Hancock programs are easier to write,
debug, read, and maintain than the equivalent programs written in
general-purpose programming languages.
Signatures are profiles of customers that are updated daily from
information collected on AT&T's network. They are used for a variety
of purposes, including fraud detection and marketing. For example,
AT&T uses signatures to track the typical calling pattern for each of
its customers. If customers far exceed their usual calling
levels, the fraud detection system raises alerts.
In contrast, a sudden drop in their calling levels signals that
they may have chosen another long-distance carrier, triggering a
marketing alert.
Cortes and Pregibon describe signatures and their uses in more detail
[CP98].
At a high level, the computation of a
signature is straightforward: process a list of call records and
update data in a file based on those records. Unfortunately,
performance requirements that arise from the volume of call
records and the amount of stored profile data complicate matters
substantially. Efficiently coping with the data requires a more
complex system architecture. Furthermore, the scale pressures programmers to
conserve instructions, which tends
to reduce program readability. The complexity of the architecture and
the code complicates testing, which is already difficult
because of long testing cycles.
The scale of signature computations arises from both the number of
calls and the number of customers. On a typical weekday, there are more
than 250 million calls made on AT&T's network. The call records that
are used for signature computations contain only 32 of the more than
200 bytes of data that are collected for each call. To get a sense of
the scale of this data, it takes about 15 minutes to read through one
weekday's call data (approx 6.5GB) and about two hours to compute a
typical signature on one processor of a 16 processor SGI Origin 2000.
A typical signature tracks several hundred million phone numbers. Even if
we store only two bytes of data per customer, the files are very large,
requiring a minimum of 500MB to hold the data. Such files also require
significant additional space to provide appropriate indexing.
This scale complicates the architecture of signature programs because
at such levels, I/O presents a significant bottleneck. To reduce this
bottleneck, signature programs must be structured to minimize I/O. In
particular, signature programs sort the call records that they process
to improve locality of reference to their signature files, and they
cache parts of these files to reduce the number of disk accesses.
Both techniques trade improved performance for increased code
complexity.
In addition to affecting the high-level architecture, the amount of
data complicates the low-level code structure.
Programmers writing signature programs have tended to respond to
performance pressures by avoiding function
calls and thereby writing less modular code.
The deeply nested code that
results is difficult to decipher, which is a serious problem because
it is often the only documentation of the signature it implements.
Another complication comes from the fact that the size of the
signature files limits the number of bytes we can store per phone
number. As a result, we cannot store exact information for each phone
number; instead we must approximate. Managing the translation between
the representation that we can afford to store and the representation
that we want to compute with complicates the code.
Hancock is a C-based domain-specific language that reduces the coding
effort of writing signatures and improves the clarity of the resulting
code by factoring into the language all the issues that relate
to scale.
In particular, Hancock programmers can focus on the
signature they want to compute rather than the scaffolding necessary
to compute it. The language includes constructs for describing the
overall system architecture, for specifying work that should be done
in response to events in the call stream, and for specifying the
representation of signature data. The Hancock
compiler uses these constructs to generate the boiler-plate
code that makes hand-written signatures difficult to maintain,
without sacrificing efficiency. The Hancock runtime system supports
external sorting and efficient access to signature data. Hancock
does not address directly the problem of testing; instead, it reduces
opportunities for bugs by relieving programmers of the burden of
writing intricate code.
This paper describes the design and implementation of
Hancock. Section 2 describes the domain in more detail
and gives an example signature. Sections
3-5 present Hancock's data and control
abstractions. We discuss the implementation of Hancock's compiler and
runtime system in Section 6, early experiences
with Hancock in Section 7, and our design process
in Section 8. We conclude in Section 9.