Our tracing data consists of paths from a test host towards a single host on a destination network. The list of possible destinations is obtained from the routing arbiter database [28]. This is a central registry of all assigned Internet addresses, including those used only privately. Each provides a target network address, such as 135.104.0.0/16.
We should also include networks announced in the core routing tables but not contained in the routing arbiter database. Preliminary analysis of these tables reveal that we miss approximately twenty percent of the networks. These omissions will be corrected when we start the multiple-source mapping described below.
We need to scan towards a particular host on the target network. It is not particularly important that the host actually be present. The network scanner randomly picks an IP number on each network that is likely to be in use. This random selection is biased based on a quick survey of commonly-used IP addresses (e.g., the most common last octet is 1 and lower numbers are more common). Essentially, we are performing a slow host scan over time until a responsive host is found.
If the trace reaches a host on the target network, the address is saved for future traces. More than half the traces end with silence (due to an invalid address or firewall) or an ICMP error reporting failure.
This technique only records an outgoing packet path. The incoming path is often different: many Internet routes are asymmetric, as ISP interconnect agreements often divert traffic through different connections. We do not know of a reliable way to discover return packet paths, but some ideas are discussed in section 7.
The path may vary between traces, or even individual probes, depending on outages, redundant links, reconfigurations, etc. This means the mapping program may occasionally `discover' paths that don't exist. Imagine a packet to Germany that is either routed through the United Kingdom or France at random, for example. As alternate packets travel through alternate paths, the mapping program will infer connections between the alternate paths that do not exist. We believe that load-balancing over large stretches of paths is rare, so the effect of them is limited. In terms of outages and routing changes, the number of routes changing during a scan should be relatively small in most cases.
The technique employed only discovers the IP path. Each link along this path may not actually represent a physical link. For example, if an ISP is running their backbone over ATM, then each link represents a virtual circuit that may travel through many ATM nodes. Depending on how the ATM network is configured, such an ISP's backbone may appear to be completely connected, even though it isn't physically true. From an IP standpoint, however, detecting the physical connectivity is extremely difficult.
The target, date, path data, and path completion codes are recorded in a simple text format, described in appendix A. The database is manipulated with traditional UNIX text tools and some simple additional programs.
Each day's database is compressed and stored permanently. Copies are available upon request. The latest Internet database is available daily online [23]. The compressed database is about 10-20MB: we periodically strip out old paths in order to keep its size down (Special snapshots of the database are taken before this, however).