Section 3.1 answered the question of who is requesting data, and section 3.2 discussed how often data is requested. In this section, we inspect the nature of the data that is requested. Figure 6a shows the mime type breakdown of the transferred data in terms the number of bytes transferred, 6b shows this breakdown in term of files transferred.
Figure 6: Breakdown of bytes and files transferred by MIME type
From figure 6a, we see that most of the bytes transferred over the Home IP modem lines come from three predominant mime types: text/html, image/gif, and image/jpeg. Similarly, figure 6b shows that most files sent over the modem lines have the same three predominant mime types. Interestingly, however, we see that although most bytes transferred correspond to JPEG images, most files transferred correspond to GIF images. This means that, on average, JPEGs are larger than GIFs.
The fact that nearly 58% of bytes transferred and 67% of files transferred are images is good news for Internet cache infrastructure proponents. Image content tends to change less often than HTML content - images are usually statically created and have long periods of stability in between modification, in comparison to HTML which is becoming more frequently dynamically generated.
Figure 7: Size distributions by MIME type, shown on a logarithmic
scale. The average HTML file size is 5.6 kilobytes, the average GIF file
size is 4.1 kilobytes, and the average JPEG file size is 12.8 kilobytes.
In figure 7, we see the distribution of sizes of files belonging to the three most common mime type. Two observations can immediately be made: most Internet content is less than 10 kilobytes in size, and data type size distributions are quite heavy-tailed, meaning that there is a non-trivial number of large data files on the web. Looking more closely at individual distributions, we can confirm our previous hypothesis that JPEG files tend to be larger than GIF files. Also, the JPEG file size distribution is considerably more heavy-tailed than the GIF distribution. There are more large JPEGs than GIFs, perhaps in part because JPEGs tend to be photographic images, and GIFs tend to be cartoons, line art, or other such simple, small images.
There are other anomalies in these distributions. The GIF distribution has two visible plateaus, one at roughly 300-1000 bytes, and another at 1000-5000 bytes. We hypothesize that the 300-1000 byte plateau is caused by small ``bullet'' images or icons on web pages, and the 1000-5000 byte plateau represents all other GIF content, such as cartoons, pictures, diagrams, advertisements, etc. Another anomaly is the large spike in the HTML distribution at roughly 11 kilobytes. Investigation revealed that this spike is caused by the extremely popular Netscape Corporation ``Net Search'' page.