In September of 2022, I started receiving the web log files for usenix.org. I felt I needed to see what articles were being downloaded, and the current system didn't provide the insights I was looking for in a timely manner. I filtered out log entries for publications/login, counted the download frequencies, and sorted these with the most popular first.
While I might think that a particular topic will be interesting to lots of people, I am not always right. And as an editor who doesn't pay authors, I really, really, want people to read the articles that they have taken the time to write. I have a responsibility to authors, one I only get to exercise by carefully curating the authors of papers about topics I think will be very popular.
Over time, I noticed a couple of things. Sometimes an article might remain popular for over a year, while other articles would suddenly become popular for a period of just weeks. In both cases, the initial surge in popularity often involves promotion by people other than the authors in places like Slashdot and HackerNews. And sometimes an article just fills a particular need through its thorough explanation of some technical topic.
I noticed a prominent example of how promotion affects downloads the second time I took a look at papers. An OSDI paper had an unusually high number of downloads in the September-October 2023 timeframe: 1818. I asked Ding Yuan, the lead author, if he had any idea what might have caused this increase in activity. Ding discovered that a post on X by Kevlin Henney on September 6 had resulted in about 20% of his followers downloading the paper.
Besides papers and ;login: articles leaping in popularity because of promotion, some stay near the top for other reasons. I skimmed the top 20 papers from June 2024 trying to determine what it was about these papers that made them so popular. I first noticed that they were all well-written, but that's really not uncommon for highly-rated papers.
The other things all these papers had in common was that they weren't just introducing some new research or software: they did a great job of teaching about the issues involved. Usually, section two of a paper covers related-work, while section one provides motivation for why this particular work deserves to be published. Between these two sections, you can learn a lot about a topic. And that's why you find papers about Meta's Haystack, Google's Transformer, and Yahoo's ZooKeeper papers in the top 20 (Table 1).
Index | Title | Description | Downloads |
1 | The Multi Router Traffic Grapher and RRDtool | Description of MRTG and RRDtool, a binary logging tool for time-sequence data | 939 |
2 | Scaling Memcache at Facebook | How Facebook scaled memcached to thousands of servers | 874 |
3 | Replication: No One Can Hack My Mind | Survey of security advice from experts and non-experts | 845 |
4 | In the Compression Hornet’s Nest | Denial of service attacks when Deflate is used in Apache HTTPD Tomcat and other services | 772 |
5 | An Analysis of Private Browsing Modes in Modern Browsers | Evaluation of private browsing in four major browsers, inconsistencies and failures | 743 |
6 | How the Great Firewall of China Detects and Blocks Fully Encrypted Traffic | Blocking encrypted traffic based on passive traffic analysis | 642 |
7 | REX: A Development Platform and Online Learning Approach | Dana, a component-based programming language, an assembly and learning framework, and an online learning implementation that altogether allows for runtime optimization | 533 |
8 | Extracting Training Data from Large Language Models | Training data from LLMs (GPT2) can be recovered | 540 |
9 | TensorFlow: A System for Large-Scale Machine Learning | One of the foundational papers leading to LLMs: describes a dataflow graph to represent both the computation in an algorithm and the state | 533 |
10 | Is Real-time Phishing Eliminated with FIDO? | Downgrade attack against the use of two-factor authentication that uses the FIDO protocol | 493 |
11 | Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing | Improvement to Hadoop by caching datasets, Spark | 475 |
12 | Remote Exploitation of Memory Corruptions in Cellular Protocol Stacks | Demonstration of attacks against the radio processor on smartphones | 462 |
13 | Orca: A Distributed Serving System for Transformer-Based Generative Models | New scheduling mechanism improves performance of LLM inference procedures | 459 |
14 | Spark: Cluster Computing with Working Sets | Adding of working sets for MapReduce and interactive analytics using a Dryad-like interface | 451 |
15 | Scalability! But at what COST? | Workshop paper that explains COST—the Configuration that Outperforms a Single Thread—showing that many data parallel systems are either slower than a single threaded solution or have high COST | 440 |
16 | Finding a needle in Haystack: Facebook’s photo storage | Facebook's Haystack photo storage system keeps metadata in memory | 376 |
17 | Andromeda: Performance, Isolation, and Velocity at Scale | Google's Andromeda cloud network virtualization for isolation and performance | 373 |
18 | Fingerprinting Obfuscated Proxy Traffic with Encapsulated TLS Handshakes | Uncovering obfuscated proxy traffic such is done by Great Firewall of China | 373 |
19 | Zookeeper | Description of Yahoo's ZooKeeper system for coordinating distributed services | 353 |
20 | Amazon DynamoDB | A fully distributed NoSQL database supporting multiple tenants, limitless tables, predictable and reliable performance | 343 |
At the end of this article, I've included Table 2 with the top 100 paper downloads from July 2024. If you attend to the downloads column, you'll notice that there is a quick dropoff in number of downloads. I graphed the top 3000 downloads against their index numbers, and you can see (Figure 1) just how steeply downloads drop off. When you consider that there are nearly 31 thousand papers represented in the logfiles, this might seem very unfair that a relative handful appear the very popular. I suggest keeping in mind a couple of things: one, that papers about famous software are going to be downloaded more often, and two, that promotion can briefly push a paper to the top of the list.
I produced two lists of the top 100 paper downloads, one from the end of 2023 and the other from July 2024, and only 31 papers are in both lists. In other words, there is a fair amount of churn happening over time.
I didn't just look at the top papers either. I skimmed the paper with index number 3000, a workshop paper from HotPar'11 by Hans Boehm where he points out that there are no benign data races. A bit obscure, certainly, but still an interesting enough workshop paper.
One of the last entries in the list was the slide deck from a LEET'10 presentation about botnets. These days, it seems that no one talks about botnets, and looking at the web logs, only crawler bots actually visited this link. Still, I found the slides interesting as the botnot it described was very advanced compared to those from the 90s, with multiple tiers for command and control.
Papers with ten or less downloads, starting almost halfway down the list, are still being downloaded by other than bots. Unless the browser information has been falsified, that LEET presentation mentioned above was just downloaded by three bots, and nothing else.
Finally, there's the matter of conference popularity and how that affects the papers downloaded. If you go back to the early days of USENIX, there were just two conferences: USENIX Summer and USENIX Winter. All topics were included in those conferences, where the main difference between them, besides the season, is one happened near the East Coast and the other on the West Coast. Starting around 1990, conferences began appearing that covered a particular topic area, like system administration or security— the first two new conferences. Figure 2 shows the binning of downloads when separated into conference categories.
Security dwarfs all other categories. If you wonder why this is, just consider that there were over 400 papers at Security'23, and Security'24 has even more. Some conferences, like SRE, have no papers at all, but they do have some presentation slides and all presentations appear on YouTube as videos, data not included in this analysis. LISA had few papers, but one from LISA'98 by Toby Oetiker about RRDTool is often in the top 100.
I'm closing this brief analysis with Table 2, the top 100 downloads during July 2024. When I compared this list to the one from June, 37 papers were the same. If you are wondering about the outlier, Inference of Error Specifications and Bug Detection Using Structural Similarities by Dossche and Coppens with over 10,000 downloads, I quickly found a posting on X by Winson Tang, referring to this paper.