Check out the new USENIX Web site. Previous Next Contents

1   Introduction

There are many families of programs whose members have a high degree of commonality; such a family is called a domain. When the commonality is inherently complex, a domain-specific programming language may help domain programmers develop better software more quickly by factoring the complexity into the language. Recent examples of domain-specific languages and their domains include Envision [SF97] for computer vision, Fran [Ell97] for computer animation, GAL [TMC97] for video card device drivers, Mawl [ABB+97] for dynamic web servers, and Teapot [CRL96, CDR+97] for writing coherence protocols. In each case, the domain-specific language moved the burden of writing intricate domain code from the programmer to the compiler (or interpreter). We have designed and built a domain-specific language, called Hancock, to handle the complexity that arises from the scale of signature computations [CP98]. As with the earlier languages, Hancock programs are easier to write, debug, read, and maintain than the equivalent programs written in general-purpose programming languages.

Signatures are profiles of customers that are updated daily from information collected on AT&T's network. They are used for a variety of purposes, including fraud detection and marketing. For example, AT&T uses signatures to track the typical calling pattern for each of its customers. If customers far exceed their usual calling levels, the fraud detection system raises alerts. In contrast, a sudden drop in their calling levels signals that they may have chosen another long-distance carrier, triggering a marketing alert. Cortes and Pregibon describe signatures and their uses in more detail [CP98].

At a high level, the computation of a signature is straightforward: process a list of call records and update data in a file based on those records. Unfortunately, performance requirements that arise from the volume of call records and the amount of stored profile data complicate matters substantially. Efficiently coping with the data requires a more complex system architecture. Furthermore, the scale pressures programmers to conserve instructions, which tends to reduce program readability. The complexity of the architecture and the code complicates testing, which is already difficult because of long testing cycles.

The scale of signature computations arises from both the number of calls and the number of customers. On a typical weekday, there are more than 250 million calls made on AT&T's network. The call records that are used for signature computations contain only 32 of the more than 200 bytes of data that are collected for each call. To get a sense of the scale of this data, it takes about 15 minutes to read through one weekday's call data (approx 6.5GB) and about two hours to compute a typical signature on one processor of a 16 processor SGI Origin 2000. A typical signature tracks several hundred million phone numbers. Even if we store only two bytes of data per customer, the files are very large, requiring a minimum of 500MB to hold the data. Such files also require significant additional space to provide appropriate indexing.

This scale complicates the architecture of signature programs because at such levels, I/O presents a significant bottleneck. To reduce this bottleneck, signature programs must be structured to minimize I/O. In particular, signature programs sort the call records that they process to improve locality of reference to their signature files, and they cache parts of these files to reduce the number of disk accesses. Both techniques trade improved performance for increased code complexity.

In addition to affecting the high-level architecture, the amount of data complicates the low-level code structure. Programmers writing signature programs have tended to respond to performance pressures by avoiding function calls and thereby writing less modular code. The deeply nested code that results is difficult to decipher, which is a serious problem because it is often the only documentation of the signature it implements. Another complication comes from the fact that the size of the signature files limits the number of bytes we can store per phone number. As a result, we cannot store exact information for each phone number; instead we must approximate. Managing the translation between the representation that we can afford to store and the representation that we want to compute with complicates the code.

Hancock is a C-based domain-specific language that reduces the coding effort of writing signatures and improves the clarity of the resulting code by factoring into the language all the issues that relate to scale. In particular, Hancock programmers can focus on the signature they want to compute rather than the scaffolding necessary to compute it. The language includes constructs for describing the overall system architecture, for specifying work that should be done in response to events in the call stream, and for specifying the representation of signature data. The Hancock compiler uses these constructs to generate the boiler-plate code that makes hand-written signatures difficult to maintain, without sacrificing efficiency. The Hancock runtime system supports external sorting and efficient access to signature data. Hancock does not address directly the problem of testing; instead, it reduces opportunities for bugs by relieving programmers of the burden of writing intricate code.

This paper describes the design and implementation of Hancock. Section 2 describes the domain in more detail and gives an example signature. Sections 3-5 present Hancock's data and control abstractions. We discuss the implementation of Hancock's compiler and runtime system in Section 6, early experiences with Hancock in Section 7, and our design process in Section 8. We conclude in Section 9.


Previous Next Contents