A New Focus for a New Century: Availability and Maintainability >> Performance

2/4/02


Click here to start


Table of Contents

A New Focus for a New Century: Availability and Maintainability >> Performance

Thanks to Darrell Long for FAST!

Outline

The past: research goals and assumptions of last 15 years

After 15 years of research on price-performance, what’s next?

Downtime Costs (per Hour)

Total Cost Ownership Hypothesis

Cost of Ownership after 15 years of improving price-performance?

What have we learned from past projects?

Jim Gray: Trouble-Free Systems

Butler Lampson: Systems Challenges

John Hennessy: What Should the “New World” Focus Be?

IBM Research (10/15/2001)

Bill Gates M/S (1/15/2002): “Trustworthy Computing”

New research goals for a New Century: ACME

Where does ACME stand today?

ACME: Availability

ACME: Claims of 5 9s?

ACME: Uptime of HP.com?

“Microsoft fingers technicians for crippling site outages”

ACME: Learning from other fields: disasters

ACME Learning from other fields: human error

ACME: The Automation Irony

Learning from other fields: Bridges

Summary: the present

Outline

Recovery-Oriented Computing Philosophy

ROC approach

ROC Part I: Failure Data Lessons about human operators

Failure Data: Public Switched Telephone Network (PSTN) record

Blocked Calls: PSTN in 2000

Failure Data: 2 Internet Sites

Internet Site Failures

ROC Part 1: Failures Data Collection (so far)

ROC Part 2: ACME benchmarks

Availability benchmarking 101

Availability Benchmarking Environment

Example: 1 fault in SW RAID

Software RAID: QoS behavior

ROC Part 2: ACME Benchmarks (so far)

ROC Part 3: Margin of Safety in CS&E?

ROC Part 4: Create and Evaluate Techniques to help ACME

Safe, forgiving space for operator?

Partitioning and Redundancy?

Geographic distribution, Paired Sites

Input Insertion for Detection?

Aid Diagnosis?

Automation vs. Aid?

Refresh via Restart?

Support Operator Trial and Error?

Undo for Sysadmin

Summary: from ACME to ROC

Interested in ROCing?

BACKUP SLIDES

A science fiction analogy: Autonomic vs. ROC

Outage Report

TCO breakdown (average)

Internet x86/Linux Breakdown

Evaluating ROC: human aspects

Example results: software RAID (2)

Lessons Learned from Other Cultures

Author: conference

Email: patterson@cs.berkeley.edu

Download presentation source