Architecture

A latency service based on NCs exploits several properties of NCs that help satisfy the design goals from Section 2.

NCs achieve good accuracy on Internet topologies. Although an embedding error arises because Internet latencies violates the triangle inequality, these violations are not severe enough to prevent a metric embedding in practice. Previous work [4] has found a median relative error of $11\%$ , which we confirmed on PlanetLab.
Non-existent measurements between nodes are interpolated by the network embedding, thus reducing the measurement overhead. The trade-off between measurement overhead and accuracy is made explicit by NCs. The accuracy and convergence of NCs can be improved by increasing the measurement frequency and extending the neighbor set.
NCs provide almost instantaneous latency predication because they do not actively initiate new latency measurements to respond to latency queries. Active measurement approaches, such as Meridian [20], may introduce a non-trivial delay while a fresh latency estimate is being obtained.
The decentralized algorithm for computing NCs makes the implementation scalable to a large number of nodes. We have successfully deployed a NC service on over PlanetLab nodes.

To achieve simple application integration, we propose two different architectures: a stand-alone NC service and a per-application NC library. Both approaches have the advantage that they provide a correct implementation of NC to applications. As will be explained in Section 4, the application programmer does not have to deal with the complexity of latency measurement.

Network Coordinate Service. If the network infrastructure is cooperative and under control of a single authority, such as PlanetLab, an efficient solution is to deploy a NC service on all the nodes. Each application then accesses the locally running NC service. This has the advantage that the cost of inter-node measurements is amortized across all applications that share the service. A drawback of this approach is that parameters, such as the measurement frequency, which determines the convergence of the NCs, must be set globally for all applications.

double estimateLat (double[] remoteNC) double[] getNC () double getConfidence () double getRelError () double forceUpdate (IPAddr remoteNode)

Above we show the API of the latency service that is part of our SBON deployment [17] on PlanetLab. The function estimateLat returns the latency estimate between a local and remote node given the remote node's NC. The local NC and confidence are returned by the getNC and getConfidence calls, respectively. A call to getRelError returns the current median relative error over the last latency measurement that were used for coordinate updates. If the application needs an up-to-date latency to a remote node, a call to forceUpdate causes the NC service to perform a measurement to the remote node returning the observed latency. This API assumes that nodes in a distributed application are identified as an IP address and NC pair, (IPAddr,NC). As a result, any node can obtain a latency estimate to another node about which it has learned.

Network Coordinate Library. In some cases, an application should include a module for latency estimation without relying on an externally running service. This is true for peer-to-peer applications that are deployed on a varying set of heterogeneous nodes. To address this, we also propose a NC library that any application can link against to support NCs. In order to avoid duplicating functionality, the library handles only the computation of coordinates but leaves the actual network communication for network probing to the application. This enables the application to exploit application traffic as much as possible for measurements.

void updateNC (IPAddr remoteNode, double[] remoteNC, double remoteConf, double latency)void forceUpdate (IPAddr remoteNode)

In addition to the functions provided by the stand-alone service, the NC library API has a function updateNC that is used by the application to feed in new network measurements from application-level traffic. Only if the application-level traffic is not frequent enough or does not cover a large enough set of nodes to compute an accurate NC does the library request additional latency measurements from the application. As will be explained in Section 4.2, the NC library monitors its relative error to decide if the NC is converging sufficiently. If this is not the case, it uses the forceUpdate callback to the application to request more diverse measurements by initiating a latency measurement to a new remote node.

Jonathan Ledlie 2005-10-18