|
WORLDS '05 Preliminary Paper
[WORLDS '05 Technical Program]
Using PlanetLab for Network Research:
Myths, Realities,
and Best Practices
Neil Spring, Larry Peterson, Andy Bavier, and Vivek Pai
|
1 Introduction
PlanetLab is a research testbed that supports 428
experiments on 276 sites, with 583 nodes in 30 countries.
It has lowered the barrier to distributed experimentation in
network measurement, peer-to-peer networks, content
distribution, resource management, authentication,
distributed file systems, and many other areas.
PlanetLab did not become a useful network testbed overnight.
It started as little more than a group of Linux machines with a
common password file, which scaled poorly and suffered under load.
However, PlanetLab was conceived as an evolvable system under
the direction of a community of researchers.
With their help, PlanetLab version 3.0 has since corrected many
previous faults through
virtualization and substantial performance isolation.
This paper is meant to guide those considering developing a network
service or experiment on PlanetLab by separating widely-held myths
from the realities of service and experiment deployment.
Building and maintaining a testbed for the research
community taught us lessons that may shape its continued
evolution and may generalize beyond PlanetLab to other
systems. First,
users do not always search out “best practice” approaches:
they expect the straightforward approach to work. Second,
users rarely report failed attempts: we learned of the
perceived shortcomings described in this paper through conversations, not through
messages to the mailing lists. Third, frustration lingers:
users hesitate to give another chance to a system that was
recently inadequate or difficult to use. These experiences
are especially challenging for an evolvable system, which
relies on user feedback to evolve so that more users can be
supported by features they desire.
We organize the myths in decreasing order of veracity: those
that are realities in Section 2, that were once
true in Section 3, and those that are false
if best practices are employed in Section 4. We
summarize the discussion in Section 5.
2 Realities
This section describes widely-cited criticisms of PlanetLab
that are entirely true, and are likely to remain so even as PlanetLab
evolves.
Reality: Results are not reproducible
PlanetLab was designed to subject network services to
real-world conditions, not to provide a controlled environment.
By running a service for months or years, researchers should
be able to identify trends and understand the
performance and reliability their service achieves. An
experiment that runs for an hour will reflect only the
conditions of the network (and PlanetLab) during that hour.
Various aspects of a service can be meaningfully
measured by applying simple rules-of-thumb. Avoid heavily-loaded
times and nodes: CoMon [5] tracks and publishes current
resource usage on each PlanetLab node. Secure more
resources for your experiment from a
brokerage service (see Section 3) if needed.
Repeat experiments to generate statistically valid results. Finally,
regard PlanetLab's ability to exercise a system in unintended ways, producing unexpected
results, as a feature, not a bug.
Reality: The network between PlanetLab sites does not represent
the Internet
No testbed, no simulator [2], and no
emulator is inherently representative of the Internet. The
challenges for researchers are to develop experiments that
overcome this limitation, perhaps by recruiting real users
behind residential access networks, or, failing that, to
interpret results taking PlanetLab's special network into
account. The challenge for PlanetLab is to evolve so that
this limitation is less severe, seeking new sites and new
access links.
PlanetLab's network is dominated by
global research and education network (GREN) [1]
(Internet2 in the United States). However, commercial sites
have joined PlanetLab and research sites have connected
machines to DSL and cable modem links: 26 sites are purely
on the commercial Internet. The question is, how does PlanetLab's
network connectivity affect research?
First, some experiments are suitable for
the GREN.
Claims that a new routing technique can
find better routes than BGP are suspect if those better
routes take advantage of well-provisioned research networks that
are not allowed by BGP policy.
However, claims that a service can find the best available
route might be accurate even on the GREN: results
obtained on the GREN are not necessarily tainted.
Second, services for off-PlanetLab users and network
measurement projects that send probes off-PlanetLab observe
the commercial Internet. Although most of PlanetLab is on
the GREN, most machines also connect to the commercial
network or are part of transit ASes. The PlanetFlow
auditing service [4] reports that PlanetLab
nodes communicate with an average of 565,000 unique IP
addresses each day. PlanetSeer [10], which monitors TCP connections between
CoDeeN nodes at PlanetLab sites and Web clients/servers
throughout the Internet,
observed traffic traversing 10,090 ASes, including all
tier-1 ISPs, 96% of the tier-2 ISPs, roughly 80% of the
tier-3 and 4 ISPs, and even 43% of the tier-5 ISPs.
Measurement services like Scriptroute [7] can
use the geographic diversity of vantage points provided by
PlanetLab to probe the Internet without being limited by the
network topology between PlanetLab nodes.
Finally, it is sometimes not the topology of the GREN, but
the availability of its very high bandwidths and low
contention that calls results into question.
Researchers can, however, limit the bandwidth their slices
consume to emulate a lower bandwidth link, via user-space
mechanisms (e.g., pacing the send rate) or by asking PlanetLab
support to lower the slice's outgoing bandwidth cap.
Reality: PlanetLab nodes are not representative of
peer-to-peer network nodes
Typically, this is a comment about the high-bandwidth
network (see above). Sometimes it means that PlanetLab is a
managed infrastructure and not subject to the same churn as
desktop systems.
Although PlanetLab is not equivalent to a set of desktop
machines—and it is not expected to scale to millions of
machines—it can contribute to P2P services. A “seed
deployment” on PlanetLab would show the value of a new
service and encourage end-users to load the service on
desktop machines. End System Multicast [3] instead
uses PlanetLab nodes as the “super nodes” of a P2P
network. PlanetLab can contribute a core of
stable, managed nodes to P2P systems.
Figure 1: Available CPU across PlanetLab nodes. Median percentage available CPU is red (upper),
25th percentile is green (middle), and 10th is blue (lower).
3 Myths that are no longer true
Some who tried to use early versions of PlanetLab found
challenges that are no longer so daunting because PlanetLab
has evolved.
Myth: PlanetLab is too heavily loaded
Although PlanetLab may always be under-provisioned and load
is especially high before conference deadlines, this
perception is misleading in two ways.
Figure 2: Median available CPU measurements using spin loops (blue, upper), load average (green, middle), and number of active slices (red, typically lowest).
First, upgrades to the OS better tolerate high CPU load,
memory consumption, and disk access load. CPU cycles are fairly
distributed among slices rather than threads: a slice with 100
threads receives the same CPU allocation as a slice with just
one. A daemon polices memory consumption, killing
slices that use too much when memory pressure is high; users
now take greater care in configuring programs that may have
a heavy memory footprint to avoid having them killed, which
in turn has reduced memory pressure for everyone. Finally,
an OS upgrade enables disk access via DMA,
rather than programmed I/O, improving performance
when the node is swapping.
Second, PlanetLab has two brokerage services, Sirius and
Bellagio, that perform admission control to a pool of
resources. Researchers can use these services to receive
more than a “fair share” of the CPU, for fixed periods of
time, during periods of heavy load.
CPU availability measurements.
An experiment begun in February 2005 supports the claim that PlanetLab
has sufficient CPU capacity. The experiment runs a spin-loop on each
PlanetLab node to sample the CPU available to a slice; because of PlanetLab's
fair share CPU scheduler, this measurement is more
accurate than standard techniques such as the load
metric reported by top. Figure 1 summarizes
seven months of CPU availability measurements.
The three lines are the median, 25th,
and 10th percentiles of the available CPU across all nodes.
The median line shows that most nodes had at least
20% available: a slice on a typical
PlanetLab node contends with three to five other slices that are
running processes non-stop. The 25th percentile line
generally stays above
10%, indicating that fewer than one-fourth of the nodes had less than
10% free. A slice can get nearly 10% of the CPU on almost any node.
CPU time is also available immediately
before conference deadlines as well. For example, during the week
before the SIGCOMM deadline (February 1–8, 2005), 360
of the 362 running nodes (99%) had at least 10% available CPU,
averaged over the week; 328 of the 360 nodes (91%) had at least 20%
available. These results show somewhat higher availability than in
Figure 1.
Some projects may have
refrained from using PlanetLab to leave resources available
to those running last-minute experiments.
Estimates of available CPU using other metrics
are less accurate. In
Figure 2, we show the median capacities (a) measured
directly using spin loops, (b) estimated using the inverse of the load
average (a load of 100 equals 1% CPU availability), and (c) estimated
using the inverse of
the number of active slices (meaning slices with a runnable thread).
The top line, the spin-loop measured capacity, is significantly higher.
The Unix-reported load average is often misleading: the
processors did have high load (sometimes exceeding
100), but the CPU available to slices is much greater
because although slices that spawn many processes increase the load
average, their processes compete only against each other for CPU.
Likewise, not all active slices use their entire quanta and so the
active slice count overestimates contention. The CoMon monitoring
service now publishes the results of the spin-loop tests to help
users choose nodes by CPU availability.
Myth: PlanetLab cannot guarantee resources
Resource guarantees could not be given before version 3.0.
Schedulers are now available to make resource
guarantees, but PlanetLab does not yet have a policy about what slices
should receive them. Typically, continuously running
services on PlanetLab are robust to varying resource availability (and
have not asked for guarantees), while short-term experiments have the
option of using one of the brokerage services (see previous
item) to gain sufficient capacity for the duration of a
run. Once we have enough experience to understand what policies
should be associated with guarantees, or someone develops a robust market
in which users can acquire resources, resource guarantees are
likely to become commonplace.
4 Myths falsified by best practices
The following four myths about PlanetLab are not true if best
practices are followed. Often these myths are caused by
mismatches between the behavior of a single, unloaded Linux
workstation, and the behavior of a highly-shared, network of
PlanetLab-modified Linux nodes. The first three myths address
problems using PlanetLab for network measurement, the last,
its potential for churn.
Myth: Load prevents accurate latency measurement
Because PlanetLab machines are loaded, no
application can expect that a call to gettimeofday()
right after recv() will return the time when the
packet was received by the machine. The PlanetLab kernel
scheduler (Section 3) can isolate slices so that none are starved of
CPU, but cannot ensure that any slice will be scheduled
immediately upon receiving a packet.
Using in-kernel
timestamping features of Linux, network delay can
be isolated from (most) processing delay.
When a machine receives a packet, the network device sends
an interrupt to the processor so that the kernel can pull
the packet from the device's queue. At the point when Linux
accepts the packet from the device driver, it annotates the
buffer with the current time.1 The kernel will return
control to the current process for the remainder of its
quantum, but this timestamp is kept in the kernel and made
available in at least three ways:
-
The SIOCGSTAMP ioctl called after reading a packet.
Ping uses this ioctl, but Linux kernel
comments suggest the call is Linux-specific.
- The SO_TIMESTAMP socket option combined with
recvmsg() : ancillary data includes a
timestamp. The Spruce [8] receiver code uses
this method, which was introduced in BSD and is supported by Linux.
It is not widely documented, but can be run as a non-root
user.
- The library behind tcpdump, libpcap.
This may be the most portable, but requires
root, which is easy on PlanetLab.
Sent packets are also
timestamped [9].
Figure 3: Approaches to packet
round-trip timing: applications can use gettimeofday
before sending and after receiving; closer to the device
are kernel-supplied timestamps applied as the packet is
queued for transmission or received. The driver and
hardware also may delay packets on transmission and
receipt.
Figure 4: A cumulative distribution of the
differences between application-level timestamps and
kernel-level timestamps when sending (left) or receiving
(right) in microseconds.
Do kernel timestamps matter?
To collect samples of application- and kernel-level
timestamps, we modified traceroute to print the timestamps
it collects via gettimeofday() , then ran traceroute
and tcpdump in parallel to gain kernel-level timestamps for
the same packets from 300 PlanetLab machines to three
destinations, collecting 40,000 samples for comparison.
Figure 3 illustrates where traceroute and
the kernel annotate timestamps.
In Figure 4, we show the differences
between application- and kernel-captured timestamps when
sending probes and receiving responses.
Although the time between gettimeofday() and when the
packet is delivered to the network device is typically small
(18 μs median, 84 μs mean), the time after the
packet is received is typically larger and more variable (77
μs median, 788 μs mean). The larger median may
represent the cost of the intermediate system calls: in traceroute, it is
select() that returns when the response packet is
received. However, that 4% of samples are above 1 ms
suggests contention with other active processes. Further,
the smallest 3% of samples between 20–30 μs suggests that tools
that filter for the minimum round trip time, such as
pathchar, will have difficulty: 97% of the packets will not
observe minimal delay in receive processing.
Measurement tools downloaded from research Web pages may not
use kernel-level techniques to measure packet timings; their
results should be held with skepticism until their methods
are understood.
Figure 5: Timing statistics for 1 ms (spin-based) chirp trains. The
green (upper) line indicates at least 5 consecutive gaps met the target
timings, while the blue (lower) line indicates all gaps met the target.
Figure 6: Timing statistics for 11 ms (sleep-based) chirp trains. The
green (upper) line indicates at least 5 consecutive gaps met the target
timings, while the blue (lower) line indicates all gaps met the target.
Myth: Load prevents sending precise packet trains
Sending packets at precise times, as needed by several tools that
measure available bandwidth, is more difficult. If the process
is willing to discard measurements where the desired sending
times were not achieved or when control of the processor is
lost, then sending rate-paced data on PlanetLab simply
requires more attempts than on unloaded systems.
To determine how CPU load impairs precise sending,
we measure how often we can send precisely-spaced
packets in a train. Sent trains consist of eleven packets, spaced either
by 1 ms, to test spin-waiting, or 11 ms, to test sleep-based waiting using
the nanosleep() system call (via the usleep() library call).
We show how often the desired gaps were achieved for 1 ms gaps in Figure 5 and
11 ms gaps in Figure 6.
In all measurements, 10 gaps are used, and we measure
how often the gaps are within 3% of the target either for all 10
gaps or for any 5 consecutive gaps.
For both tests, at least five consecutive gaps have the desired intervals
in 80–90% of the trains. For the 11 ms test, all 10 gaps had the
correct timing 60–70% of the time. The 1 ms test did not fare as well:
all 10 gaps met their target times in only 20–40% of the trains.
For the shorter (5-gap) chirp trains, the results are
quite good: sending 10 packets is sufficient to discard less than
20% of the measurements. For longer chirp trains, two to five times
as many probes may have to be sent, which may be
tolerable for many experiments.
Mechanisms for negotiating temporarily longer time slices, or
even delegating packet transmission scheduling to the kernel, are
being discussed.
The latter might address another source of concern for measurement
experiments: the packet scheduler used to cap bandwidth and fairly
share bandwidth among slices. The timestamps on sent packets that a
process can observe with libpcap are accurate—the kernel timestamps
packets after they pass through the packet scheduler—and so
can still be used to discard bad results. However, the scheduler does
limit the kinds of trains that can be sent: it enforces a
per-slice cap of 10 Mbps with a maximum burst size of 30KB. Longer
trains sent at a faster rate are not permitted.
Myth: The PlanetLab AUP makes it unsuitable for measurement
The PlanetLab user Acceptable Use
Policy [6] states:
PlanetLab is designed to support network measurement
experiments that purposely probe the Internet. However, we
expect all users to adhere to widely-accepted standards of
network etiquette in an effort to minimize complaints from
network administrators. Activities that have been
interpreted as worm and denial-of-service attacks in the
past (and should be avoided) include sending SYN packets
to port 80 on random machines, probing random IP
addresses, repeatedly pinging routers, overloading
bottleneck links with measurement traffic, and probing a
single target machine from many PlanetLab nodes.
This policy is a result of experience with network
measurements on PlanetLab, and is designed to prevent
network abuse reports of the form “PlanetLab is attacking my
machine.” Here we elaborate on steps to
conduct responsible Internet measurement on PlanetLab. The
goal of these practices is to make network measurements as
easy to support as possible by building a list of hosts that
“opt-out” of measurement without growing the list of
PlanetLab sites that have asked to “opt-out” of
hosting measurement experiments.
Test locally and start slow.
Do not use PlanetLab to send traffic you would not send from
your workstation. Use a machine at your site first to
discover any problems with your tool before causing
network-wide disruption. Measurements from PlanetLab can
appear to be a distributed denial of service attack;
starting with a few nodes can limit how many sites receive
abuse reports. Some intrusion detection systems generate
automatic abuse reports; an abuse report to every PlanetLab
host is best avoided.
Software has bugs, and bugs can cause
measurements to be more intrusive than necessary. Bugs
that have made PlanetLab-supported tools unnecessarily
intrusive include faulty checksum computation in a
lightweight traceroute implementation and a reaction to
unreachable hosts that directed a great deal of redundant
measurement toward the same router. Such errors could have
been detected before deployment with local testing.
Even a correctly-implemented tool may require local testing,
because very little experimental data guides non-intrusive
measurement tool design: are TCP ACKs less likely to raise
alarms than SYNs? Should traceroute not increment the UDP
destination port to avoid appearing as a port scan? How
many probes are needed to distinguish lossy links from
unreachable hosts?
Starting slow could have avoided abuse report flurries in
March and October 2005. An experiment with an
implementation flaw generated 19 abuse reports from as many
sites, half on the first day, March 15. The experiment ran for only 21 hours
before being shut down, but reports continued in for two
weeks. A carefully-designed experiment in October tickled two
remote firewalls and a local intrusion detection system for
a total of 10 abuse reports forwarded to PlanetLab support. The automated responses from
remote firewalls may have been avoided by local testing of
the destination address list. Many more abuse reports were likely
generated by the automated systems, but discarded by recipients as frivolous as they
reported a single ICMP echo request (ping) as an attack.
Alert PlanetLab support.
Update your slice description
and send a message to PlanetLab support detailing
your intended measurement, how to identify its traffic, and
what you've done to try to avoid problems. First, sending
such a message shows that you, as an experimenter, believe
you have put sufficient effort into avoiding abuse reports.
Second, describing your approach gives PlanetLab staff and
other interested people the chance to comment upon your
design. Finally, knowing the research goals and methods can
save PlanetLab staff time and ensures prompt response to
abuse reports.
Figure 7: Median uptime in days across all PlanetLab nodes.
Use Scriptroute.
Scriptroute separates measurement
logic from low-level details of measurement execution. It
will prevent contacting hosts that have complained
about traffic, can prevent inadvertently invalid
packets that trigger intrusion detection systems, will limit
the rate of traffic sent, collects timestamps from libpcap,
and schedules probes using a hybrid between
sleeping and busy-waiting.
Curtail ambition.
It is tempting to demonstrate implementation skill by
running a measurement study from everywhere to
everywhere, using many packets for accuracy, and
using TCP SYN packets to increase the chance of discovering
properties of networks behind firewalls. Resist!
Aggressive measurement increases its cost for only a
marginal benefit to the authority of your result.
Myth: PlanetLab experiences excessive churn
Widespread outages on PlanetLab are fairly rare. Only three times
during the last two years have many PlanetLab nodes been down for
longer than a reboot: (1) all
nodes were taken off-line for a week in response to a security
incident in December 2003; the system was also upgraded from version
1.0 to 2.0; (2) an upgrade from version 2.0 to 3.0 during November
2004 caused more churn than usual for a two week period; and (3) a
kernel bug in February 2005 took many nodes off-line for a weekend.
On the other hand, roughly 30% of PlanetLab's nodes are
down at any given time. About one-third of these are
down for several weeks, usually because a site is
upgrading the hardware or blocking access due to an AUP or
security issue. The remaining failed nodes are part of the daily
churn that typically sees 15–20 nodes fail and as many
recover each day. Major software upgrades that
require reboots of all nodes occur, but are infrequent.
PlanetLab as a whole has been remarkably stable.
Figure 7 shows median node uptimes over
13 months. Of the six sharp drops in uptime, four are
due to testbed-wide software upgrades requiring reboots.
The longer upgrade, to version 3.0, is shown starting at
day 100. The kernel bug, followed by an upgrade, is evident
starting at day 170.
Median uptimes are
generally longer than 5 days, and often 15 to 20 days—much
higher than what would be expected in typical home systems.
Since PlanetLab does experience churn, no users should
expect that the storage offered by PlanetLab nodes is
persistent and no users should expect that a set of
machines, once chosen, will remain operational for the
duration of a long-running experiment.
5 Summary
In this paper, we described realities of the PlanetLab
platform: it is not representative of the Internet or of
peer-to-peer networks, and results are not always
reproducible. We then described myths that linger despite
being fixed: PlanetLab's notoriously high load poses less of
a problem today than it once did because there are resource
brokerage services and the operating system has been
upgraded to isolate experiments. Finally, we described
challenges that can often be addressed by following some
best practices. PlanetLab is capable of substantial network
measurement, despite technical challenges in precise timing
and social challenges in avoiding abuse complaints. In
addition, many PlanetLab machines may fail or be down at any
time; being prepared for this churn is a challenge for
experimenters.
Our hope is that separating myth from reality will make
clear the features and flaws of PlanetLab as an evolving research
platform, enabling researchers to choose the right platform
for their experiments and warning them of the challenges
PlanetLab implies.
Acknowledgments
We would like to thank the anonymous reviewers for their useful
feedback on the paper. This work was supported in part by NSF Grants
ANI-0335214, CNS-0439842, and CNS-0435065.
References
- [1]
-
S. Banerjee, T. G. Griffin, and M. Pias.
The interdomain connectivity of PlanetLab nodes.
In PAM, 2004.
- [2]
-
S. Floyd and V. Paxson.
Difficulties in simulating the Internet.
IEEE/ACM Transactions on Networking, 9(4):392–403, 2001.
- [3]
-
Y. hua Chu, S. G. Rao, and H. Zhang.
A case for end system multicast.
In ACM SIGMETRICS, 2000.
- [4]
-
M. Huang, A. Bavier, and L. Peterson.
PlanetFlow: Maintaining Accountability for Network Services.
Submitted for publication.
- [5]
-
K. Park and V. Pai.
CoMon: A monitoring infrastructure for PlanetLab.
.
- [6]
-
PlanetLab Consortium.
PlanetLab acceptable use policy (AUP).
, 2004.
- [7]
-
N. Spring, D. Wetherall, and T. Anderson.
Scriptroute: A public Internet measurement facility.
In USITS, 2003.
- [8]
-
J. Strauss, D. Katabi, and F. Kaashoek.
A measurement study of available bandwidth estimation tools.
In IMC, 2003.
- [9]
-
TCPDUMP.org Frequently Asked Questions.
, 2001.
- [10]
-
M. Zhang, et al.
PlanetSeer: Internet path failure monitoring and characterization
in wide-area services.
In OSDI, 2004.
- 1
- See:
linux/net/core/dev.c:netif_rx().
This document was translated from LATEX by
HEVEA.
|
|
|