A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility

http://www.cs.rochester.edu/~kshen/papers/usenix2010-li.pdf

Abstract

Memory hardware reliability is an indispensable part of whole-system dependability. This paper presents the collection of realistic memory hardware error traces (including transient and non-transient errors) from production computer systems with more than 800 GB memory for around nine months. Detailed information on the error addresses allows us to identify patterns of single-bit, row, column, and whole-chip memory errors. Based on the collected traces, we explore the implications of different hardware ECC protection schemes so as to identify the most common error causes and approximate error rates exposed to the software level. Further, we investigate the software system susceptibility to major error causes, with the goal of validating, questioning, and augmenting results of prior studies. In particular, we find that the earlier result that most memory hardware errors do not lead to incorrect software execution may not be valid, due to the unrealistic model of exclusive transient errors. Our study is based on an effi- cient memory error injection approach that applies hardware watchpoints on hotspot memory regions.

1 Introduction

Memory hardware errors are an important threat to computer system reliability [37] as VLSI technologies continue to scale [6]. Past case studies [27,38] suggested that these errors are significant contributing factors to whole-system failures. Managing memory hardware errors is an important component in developing an overall system dependability strategy. Recent software system studies have attempted to examine the impact of memory hardware errors on computer system reliability [11, 26] and security [14]. Software system countermeasures to these errors have also been investigated [31]. Despite its importance, our collective understanding about the rate, pattern, impact, and scaling trends of

memory hardware errors is still somewhat fragmented and incomplete. The lack of knowledge on realistic errors has forced failure analysis researchers to use synthetic error models that have not been validated [11, 14, 24, 26, 31]. Without a good understanding, it is tempting for software developers in the field to attribute (often falsely) non-deterministic system failures or rare performance anomalies [36] to hardware errors. On the other hand, anecdotal evidence suggests that these errors are being encountered in the field. For example, we were able to follow a Rochester student’s failure report and identify a memory hardware error on a medical Systemon-Chip platform (Microchip PIC18F452). The faulty chip was used to monitor heart rate of neonates and it reported mysterious (and alarming) heart rate drops. Using an in-circuit debugger, we found the failure was caused by a memory bit (in the SRAM’s 23rd byte) stuck at ‘1’.

In an effort to acquire valuable error statistics in realworld environments, we have monitored memory hardware errors in three groups of computers—specifically, a rack-mounted Internet server farm with more than 200 machines, about 20 university desktops, and 70 PlanetLab machines. We have collected error tracking results on over 800 GB memory for around nine months. Our error traces are available on the web [34]. As far as we know, they are the first (and so far only) publicly available memory hardware error traces with detailed error addresses and patterns.

One important discovery from our error traces is that non-transient errors are at least as significant a source of reliability concern as transient errors. In theory, permanent hardware errors, whose symptoms persist over

time, are easier to detect. Consequently they ought to present only a minimum threat to system reliability in an ideally-maintained environment. However, some nontransient errors are intermittent [10] (i.e., whose symptoms are unstable at times) and they are not necessarily easy to detect. Further, the system maintenance is hardly perfect, particularly for hardware errors that do not trigger obvious system failures. Given our discovery of nontransient errors in real-world production systems, a holistic dependability strategy needs to take into account their presence and error characteristics.

We conduct trace-driven studies to understand hardware error manifestations and their impact on the software system. First, we extrapolate the collected traces into general statistical error manifestation patterns. We then perform Monte Carlo simulations to learn the error rate and particularly error causes under different memory protection mechanisms (e.g., single-error-correcting ECC or stronger Chipkill ECC [12]). To achieve high confidence, we also study the sensitivity of our results to key parameters of our simulation model.

Further, we use a virtual machine-based error injection approach to study the error susceptibility of real software systems and applications. In particular, we discovered the previous conclusion that most memory hardware errors do not lead to incorrect software execution [11,26] is inappropriate for non-transient memory errors. We also validated the failure oblivious computing model [33] using our web server workload with injected non-transient errors.

2 Background

2.1 Terminology

In general, a fault is the cause of an error, and errors lead to service failures [23]. Precisely defining these terms (“fault”, “error”, and “failure”), however, can be “surprisingly difficult” [2], as it depends on the notion of the system and its boundaries. For instance, the consequence of reading from a defective memory cell (obtaining an erroneous result) can be considered as a failure of the memory subsystem, an error in the broader computer system, or it may not lead to any failure of the computer system at all if it is masked by subsequent processing. In our discussion, we use error to refer to the incidence of having incorrect memory content. The root cause of an error is the fault, which can be a particle impact, or defects in some part of the memory circuit. Note that an error does not manifest (i.e., it is a latent error) until the corrupt location is accessed.

An error may involve more than a single bit. Specifically, we count all incorrect bits due to the same root cause as part of one error. This is different from the con

cept of a multi-bit error in the ECC context, in which case the multiple incorrect bits must fall into a single ECC word. To avoid confusions we call these errors wordwise multi-bit instead.

Transient memory errors are those that do not persist and are correctable by software overwrites or hardware scrubbing. They are usually caused by temporary environmental factors such as particle strikes from radioactive decay and cosmic ray-induced neutrons. Nontransient errors, on the other hand, are often caused (at least partially) by inherent manufacturing defect, insuf- ficient burn-in, or device aging [6]. Once they manifest, they tend to cause more predictable errors as the deterioration is often irreversible. However, before transitioning into permanent errors, they may put the device into a marginal state causing apparently intermittent errors.

A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility的更多相关文章

Methods and Systems for Enhancing Hardware Transactions Using Hardware Transactions in Software Slow-Path
Hybrid transaction memory systems and accompanying methods. A transaction to be executed is received ...
Virtualizing physical memory in a virtual machine system
A processor including a virtualization system of the processor with a memory virtualization support ...
Rochester Memory Hardware Error Research Project
http://www.cs.rochester.edu/research/os/memerror/
[转]How to build a data storage and VM Server using comodity hardware and free software
Source: http://learnandremember.blogspot.jp/2010_01_01_archive.html Requisites: 1) RAID protection f ...
Programming In hardware Programming in software
COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE NINTH EDITION
Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors
目录 . Rowhammer Introduction . Rowhammer Principle . Track And Fix 1. rowhammer introduction 今天的DRAM ...
Bit error testing and training in double data rate (ddr) memory system
DDR PHY interface bit error testing and training is provided for Double Data Rate memory systems. An ...
memory ordering 内存排序
Memory ordering - Wikipedia https://en.wikipedia.org/wiki/Memory_ordering https://zh.wikipedia.org/w ...
Understanding Virtual Memory
Understanding Virtual Memory by Norm Murray and Neil Horman Introduction Definitions The Life of a P ...

随机推荐

win8 鼠标失灵解决办法
前几天也没更新,却不知道突然win8 pro 失灵了,是不是ms 后台运行的也不确定,不过更新之后就可以用了. 经供参考: 更新前: 更新后: 我的主板是华硕的,有时候需要重启几次鼠标才显示出来
Android UI组件学习
android.view.View类是全部UI组件的父类. 如果一些属性的内容本类找不到的时候一定要到父类之中进行查找. 所谓的学习组件的过程就是一个文档的查找过程. ※ Android之中所有的组件 ...
建模算法（二）——整数规划
一.概述 1.定义:规划中变量部分或全部定义成整数是,称为整数规划. 2.分类:纯整数规划和混合整数规划. 3.特点: (1)原线性规划有最优解,当自变量限制为整数后: a.原最优解全是整数,那最优解 ...
Android R文件相关问题
今天遇到的问题,gen下没有自动生成文件,而大部分java文件中错误是找不到R.java.“R cannot be resolved to a variable” 这就一定有别的原因造成错误, ...
在mysql数据库中制作千万级测试表
在mysql数据库中制作千万级测试表前言: 最近准备深入的学一下mysql,包括各种引擎的特性.性能优化.分表分库等.为了方便测试性能.分表等工作,就需要先建立一张比较大的数据表.我这里准备先建一张 ...
win7下loadrunner创建mysql数据库参数化问题解决
问题现象: 安装mysql数据源驱动后,lr创建mysql驱动程序列表没有安装的驱动程序: 安装完mysql ODBC数据源后 2.在控制面板-数据源(ODBC) 3.创建mysql数据源: 4.从l ...
安装Kali Linux操作系统Kali Linux无线网络渗透
安装Kali Linux操作系统Kali Linux无线网络渗透 Kali Linux是一个基于Debian的Linux发行版,它的前身是BackTrack Linux发行版.在该操作系统中,自带了大 ...
递推DP URAL 1225 Flags
题目传送门 /* 1 r; 2 b; 3 w 2不能在最前面,所以dp[1] = 2; dp[2] = 2: 13 or 31 dp[i] = dp[i-1] + dp[i-2]; 只加1或3时,总数 ...
BZOJ1807 : [Ioi2007]Pairs 彼此能听得见的动物对数
一维的情况: 排序后维护一个单调指针即可,时间复杂度$O(n\log n)$. 二维的情况: 旋转坐标系后转化为二维数点问题,扫描线+树状数组维护即可,时间复杂度$O(n\log n)$. 三维的情况 ...
移动互联网终端的touch事件,touchstart, touchend, touchmove
如果我们允许用户在页面上用类似桌面浏览器鼠标手势的方式来控制WEB APP,这个页面上肯定是有很多可点击区域的,如果用户触摸到了那些可点击区域怎么办呢??诸如智能手机和平板电脑一类的移动设备通常会有一 ...

A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility

A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility的更多相关文章

随机推荐

热门专题