To test the hypothesis that deltas would be sufficiently small, we wanted to take a sample of WWW pages and see how large the differences were between two versions of the same page, relative to the page itself. We considered two sources of sample pages: the version archive of the AT&T Internet Difference Engine (AIDE) [8], which stores versions of pages for future visual comparison of their changes, and a set of random URLs obtained from AltaVista [1]. The random pages were tracked by AIDE as well, but they were considered separately from those pages that were actually registered explicitly, either by individuals or by inclusion in a list of popular URLs collected from a set of bookmark files within AT&T. The random URLs and popular URLs were archived daily if changes were detected, while the pages tracked for individual users were typically archived upon explicit request.
Throughout our experiments, we computed deltas using vdelta, a program that generates compact deltas by essentially compressing the deltas in the process of computing them, and which can be used as a stand-alone compression program as well [10]. We must consider the possibility that WWW pages that are compressed in a stand-alone fashion will compress so well that the deltas between two versions of a page are not much smaller than the compressed page. In this case the client and server proxies could merely compress every page (or rely on compression in the modems) without using deltas and have the same benefit. We will see that in practice, however, deltas are substantially smaller than simple compression.
Figure 2: Comparison of sizes of deltas and
original or compressed pages, using vdelta. While deltas
have a greater benefit
when simple compression is not taken into account, they help above
and beyond the benefits of compression. In each graph, the dashed
line indicates the break-even point and the solid line depicts the
mean across all files.
Considering first the non-random pages, of a total of 380 pages in the archive, 181 had more than one version, with a mean of 4.9 versions/page (sigma). Figure 2(a) shows a plot of delta size against original file size. The delta size is usually a small fraction of the original file size. By comparison, Figure 2(b) plots delta size against the size of the newer file once compressed. Figure 2(b) indicates that the delta is consistently much smaller than the compressed file, though in some cases it is approximately the same; this usually happens when a file has changed completely from one version to the next. Even if the file does not compress well (for instance, it is a GIF file), the worst that vdelta will do is to reproduce the original file with a few bytes of overhead. The mean across over 2200 comparisons of delta/compressed-file ratios was 19% (sigma).
The outlying points in Figure 2, which are due to one GIF file that has been archived automatically each day and a compressed postscript file with two versions in the archive, might be a concern in practice if the system were to send stale copies and compute deltas regardless of file type. Fortunately, file types are identifiable, both from the Content-type HTTP header and data within the files, so it is possible to treat images and other non-textual data specially. One might instead use a distillation technique to send a version of an image that is more appropriate for a low-bandwidth link [11].
Figure 3: Distribution of the number of versions detected by daily
checks of 861 randomly selected URLs over a two-month period, for
the 21% of pages that were modified.
Our study of 1000 random URLs from AltaVista [1] found that 861 URLs were actually accessible at the time we started tracking them, and the vast majority (79%) of those URLs were not modified in the next two months of daily checks. Figure 3 graphs the distribution of the number of versions detected for the remaining 21% that were modified. Just 43 of the 861 URLS (5%) had 40 or more versions over the two months of the study; the minor variations in the number of versions of the frequently-changed pages may be due to transient errors while contacting those hosts. We also performed the above analysis of delta sizes for these pages, and found that the mean delta size was just 3.7% of the original page size (sigma), and 10.4% of the compressed page size (sigma). The pages themselves compressed to an average of 12.1% (sigma) of their original size. A possible explanation for the small deltas is that the sample was dominated by pages that changed daily, and those changes may often have been the inclusion of timestamps or other small, simple modifications.
Another consideration is what sorts of data can be compared. Even dynamic pages, which aren't cacheable, might have a lot of overlap between versions of the same page, or pages with the same base URL but different parameters to a CGI script. (Determining when to compare one URL against a slightly different URL for differencing is an open question, but as long as both the client and server proxies agree on the versions being compared the system will act correctly.)
For example, a query to the AltaVista search engine [1] might result in a page containing several links to content and several more links to other URLs within AltaVista. The ``boilerplate'' can dominate the content that changes from page to page, because each page contains the same form at the top and, at the bottom, a set of links to each other page generated by the query. Figure 4 graphs the sizes of deltas from two queries, compared to the size of the page if it were just compressed. The first, a name lookup, returned 9 pages; the second, a query with many terms (``storage management mobile computing flash memory nvram'') that generated thousands of matches, returned 20 pages, of which 10 were compared. In each case the deltas from one page to the next, within a given search result, were much smaller even than the compressed pages.
Figure 4: Example of deltas for two
AltaVista queries, one for a person and one a mixture of computer
science terms. All comparisons were pairwise in sequential
order, starting with a delta between the first two pages of a query result.
The URL
of each page varied slightly because it specified the range of
responses to return (1-10, 11-20, etc.).