usenix conference policies
Short Paper: A Memory Soft Error Measurement on Production Systems
Memory state can be corrupted by the impact of particles causing single-event upsets (SEUs). Understanding and dealing with these soft (or transient) errors is important for system reliability. Several earlier studies have provided field test measurement results on memory soft error rate, but no results were available for recent production computer systems. We believe the measurement results on real production systems are uniquely valuable due to various environmental effects. This paper presents methodologies for memory soft error measurement on production systems where performance impact on existing running applicationsmust be negligible and the system administrative control might or might not be available.
We conducted measurements in three distinct system environments: a rack-mounted server farm for a popular Internet service (Ask.com search engine), a set of office desktop computers (Univ. of Rochester), and a geographically distributed network testbed (PlanetLab). Our preliminary measurement on over 300 machines for varying multi-month periods finds 2 suspected soft errors. In particular, our result on the Internet servers indicates that, with high probability, the soft error rate is at least two orders of magnitude lower than those reported previously. We provide discussions that attribute the low error rate to several factors in today’s production system environments. As a contrast, our measurement unintentionally discovers permanent (or hard) memory faults on 9 out of 212 Ask.com machines, suggesting the relative commonness of hard memory faults.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Xin Li and Kai Shen and Michael C. Huang and Lingkun Chu},
title = {Short Paper: A Memory Soft Error Measurement on Production Systems },
booktitle = {2007 USENIX Annual Technical Conference (USENIX ATC 07)},
year = {2007},
address = {Santa Clara, CA},
url = {https://www.usenix.org/conference/2007-usenix-annual-technical-conference/short-paper-memory-soft-error-measurement},
publisher = {USENIX Association},
month = jun
}
connect with us