When I heard that a paper by Dutch Meyer and Bill Bolosky had won the FAST 2023 Test of Time Award, I actually remembered that paper. Dutch did a great job of presenting the results of research into the storage usage of 875 Microsoft employees [1]. I also thought that I remembered Dutch saying that one of the usage patterns they uncovered was WORN: Write Once, Read Never. I was wrong about that, but Bill Bolosky thought that he had said that about a similar research project published at FAST several years earlier.
I wondered what results in a paper getting a Test of Time Award? Logically, it should be a paper that other researchers tended to cite in future papers. I decided to ask someone who should know.
Rik Farrow: Can you tell me why this paper won a Test of Time Award?
Bill Bolosky: The test of time award [2] goes to the paper that the test of time committee selects. Being on the committee (but obviously being recused from my own paper) I know how the process works. The committee is made up of former FAST PC chairs. Any paper published in FAST at least 10 years ago that hasn’t already won the award is eligible. Committee members can nominate any eligible paper. These papers get put in a special HotCRP site where committee members can “review” them, which in this case means rating them on a 1 to 5 scale (“never should get the award” to “really should get it right now”) and then the committee chairs read the reviews and select the winner.
In practice, having the most citations helps a lot for winning in my experience. It’s an imperfect measure, but it’s more or less the only objective one. It certainly informs (though does not determine) my votes.
RF: That's pretty much what I had thought. Although in this particular case, your research provided a lot of data about how a population of mostly programmers and other desktop users actually used the local storage in their desktops.
In the paper, you analyze storage patterns from MS workstations, so you had to have some way to collect that information. I assumed that was because of some company policy, perhaps a network backup, that made data about file sizes, types, and access patterns available. Is that true?
Dutch Meyer: Not so much! We knew what we wanted for the paper, so I wrote a tool purpose-built that walked every file of every directory of every hard drive, and I built an installer that would schedule that tool to run in the background at times when we figured it would be less disruptive. Once we had that tool, we pulled the entire directory of Microsoft's Redmond employees and randomly sampled ten thousand of them. Then I emailed them a very innocent message: "Please install this tool that will scan your hard drive in the most invasive way imaginable". The installer also had mechanisms for data retrieval, basically just copying the resulting files back to a network file server. Then of course you need odds-and-ends features like jittering the upload so it doesn't take our server out, and checking a remote file we controlled so that we could remotely trigger an uninstall.
Of course it would be much better for us if this already existed but since it didn't we had to build it to operate in a time-limited way. We essentially only had one shot and getting the tool to work and to run.
RF: Wow, that's not what I was expecting. But it does make sense, in that to collect the type of data you presented in that paper, you really needed to be able to do more than just analyze backups. How did you determine file sizes and types?
DM: We knew, because it wasn't particular hard to know: we looked at the metadata available in the file system APIs and took almost everything. For extensions we read the file names and took the string following the period.
On the deduplication side we had to think more carefully about space and time constraints. The scanner itself running on the users' computer read the file data and turned it into content hashes. If you break the files into different sized chunks you get different hashes, but we had to do all that on the user's computer because we couldn't copy all their data back to our server. So we had to pick several different ways to break up those files and hash the data, such that we would get decent coverage of what different systems would do, but also so that the scan would finish for most users in a timely way. We also hashed the user's file names (but not extensions) for privacy.