Signature domain

2 Signature domain

Figure 1: High-level architecture of signature computations

Signatures are a way to associate information with individual telephone numbers. For each phone number, a typical signature contains information about the characteristics of outgoing calls from that number and incoming calls to it. The outgoing data is often split into sub-categories based on the type of call (for example, toll-free, international, intra-state, and other). This section describes the process of computing signatures, discusses the representation of signature data, and presents an example signature.

We first define some basic terminology. Signatures maintain information about phone numbers rather than customers. A second database can be used to match signature data with customer information, but that process is outside of Hancock's domain. A telephone number (973 555 1212) contains an area code (973), an exchange (555), and a line number (1212). We often use the term exchange to mean the first six digits of a phone number (973 555) and the term line to mean the whole phone number. Hancock supports a few basic types for phone numbers, dates, and times: area_code_t, exchange_t, line_t, date_t, and time_t. It also supports a type for call records:

typedef struct {
   line_t origin;     
   line_t dialed;    
   date_t connectTime;
   time_t duration;
   char isIncomplete;
   char isIntl;
   char isTollFree;
   ...
} callRec_t;

Each call record contains two phone numbers: the originating number and the dialed number. Each record also stores the time the call was connected (connectTime) and the duration of the call in seconds (duration). Call records also contain boolean flags describing the type of call: isIncomplete, isIntl, isTollFree, etc.

2.1 Signature computation

We update signature data daily from the records of the calls made the previous day. Figure 1 shows a graphical depiction of the typical architecture we use for such computations. We first sort the call records by the originating phone number and then update the outgoing portion of the signature for each phone number that made a call. We then repeat this process, sorting the call records by the dialed number and updating the incoming portion of the signature for each phone number that received a call. This architecture reduces the I/O needed to process a signature by ensuring good locality for references to the stored signature data at the cost of a pair of external sorts.

2.2 Signature representation

Figure 2: Signature data representation

As mentioned earlier, the size of signature files limits the number of bytes we can keep for each phone number. As a result, we cannot keep exact information for each line. Instead, we need to approximate. Conceptually, in each signature we have two views for our data: a precise form called the signature view and an approximate form called the approximation view. We compute with the signature, but store the approximation. The choice of these two views is application specific, as is the method for converting between them. We call the process of converting from the signature to the approximation view freezing; the converse process, thawing. (See Figure 2.) Note that thaw(freeze(v)) does not necessarily equal v because freezing is often lossy.

Having two views for the data allows programmers to compute with the natural representation while saving disk space, but it comes at the cost of having to manage conversions between the two views.

2.3 Usage signature

This section describes the Usage signature, which we use as a running example. Usage approximates cumulative daily call duration for incoming calls, outgoing calls, and outgoing calls to toll-free numbers. Usage uses the same structure to track all three types of calls. In particular, the signature type for each is seconds; the approximation type, a bucket number between zero and fifteen. The buckets represent non-uniform ranges of durations. Bucket zero corresponds to very short calls and serves as the default value for lines with no recorded activity, while bucket fifteen corresponds to long calls. Thawing converts bucket numbers to seconds by associating a default duration with each bucket. Freezing identifies the bucket with the appropriate range of times.

There are two parts to the daily Usage computation. First, we accumulate the precise usage in seconds for a particular phone number for a particular type of call, and then we blend that data with the existing signature for that type of call. The code for blending is:

 blend(new, old) = new*lambda + old*(1-lambda)

Blending of this form is common in signature computations.

Pseudo-code for computing part of the Usage signature appears in Figure 3. For brevity, we include only the code for computing the outgoing portion of the signature for a single line, called origin in the pseudo-code. A detailed version of Usage that tracks some additional information can be found in Appendix A.

 
outgoingUsage(origin, calls)  
  int cumTollFree = 0;
  int cumOut = 0;

  uApprox = Get usage data for origin  
  uSig = Convert uApprox from buckets to seconds 

   for c in calls  do  
      if c.isTollFree  then 
        cumTollFree += c.duration
      else 
        cumOut += c.duration 
      endif  
   done 

  uSig.outTollFree = 
     blend(cumTollFree, uSig.outTollFree)
  uSig.out = blend(cumOut, uSig.out)

  uApprox = Convert uSig from seconds to buckets.
  Record new uApprox data for origin

Figure 3: Pseudo-code for computing part of the Usage signature