ECE 254 / CPS 225

Fault-Tolerant and Testable Computing Systems

Fall 2011
Professor Daniel J. Sorin

                  

 Course Objective and Content
Objective: To provide students with an understanding of fault tolerant computers, including both the theory of how to design and evaluate them and the practical knowledge of real fault tolerant systems.

Content: The main themes of this course are: technological reasons for faults, fault models, information redundancy, spatial redundancy, backward and forward error recovery, fault-tolerant hardware and software, modeling and analysis, testing, and design for test.

The course includes a project that will allow the students to apply what they have learned in class.

Prerequisites: ECE 152 or CPS 104 or consent of instructor.
Class Location and Hours

 

Class meets M/W/F from 10:20am - 11:10am.

Location: Hudson 222

 Instructor

 

Professor Daniel J. Sorin

Office: 209C Hudson Hall

Office Hours: Mondays 11:15-noon, Thursdays 2-3

Email: Email Address of Daniel Sorin

 Textbook

 

There is no official textbook for this course.  However, I am somewhat partial to the following book that can be downloaded for free by Duke students (you need a Duke IP address):

Daniel J. Sorin. "Fault Tolerant Computer Architecture." Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2009.

 Assignments and Grading

Students are responsible for:

The project is a significant assignment that requires:
Deadlines will be enforced except under extreme circumstances.  I would prefer that you turn in something not quite done on the due date rather

than waiting until after the deadline to try to finish it.  Each day late will result in a 10% reduction of the grade given.


Academic Misconduct: I will not tolerate academically dishonest work.  This includes cheating on the homework and exams and plagiarism on the project.  
Be careful on the project to cite prior work and to give proper credit to others' work. 

 Course Topics, Lecture Notes, and Readings

I will post lecture notes (in PDF format) the night before I cover them in class.  Feel free to print them out and bring them to class with you.

Topic
Readings
Introduction: Terminology and Metrics

Faults and Their Causes
"IBM Experiments in Soft Fails in Computer Electronics" (Ziegler),

"Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation" (Borkar),

"A Large-Scale Study of Failures in High-Performance Computing Systems" (Schroeder and Gibson),

"Why Do Internet Services Fail, and What Can be Done About It?" (Oppenheimer et al.)
General Fault Tolerance Concepts
     - Physical redundancy
     - Error detecting/correcting codes
     - Re-execution techniques
     - Backward error recovery

Slides - part 1

Slides - part 2

Slides - part 3

"The Teramac Custom Computer: Extending the Limits with Defect Tolerance" (Culbertson et al.),

"Use ECP, Not ECC, for Hard Failures in Resistive Memories" (Schechter et al.)

"Memory Mapped ECC: Low-Cost Error Protection for Last Level Caches" (Yoon and Erez.) 

"Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation" (Ernst et al.) [you can skip the part on "counterflow pipelining"],

"AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors" (Rotenberg),

"A Survey of Rollback-Recovery Protocols in Message-Passing Systems" (Elnozahy et al.),

"End-to-End Arguments in System Design (Saltzer et al.)

Using Hardware to Tolerate Faults



"RAS Strategy for IBM S/390 G5 and G6" (Mueller et al.),

"DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design" (Austin)

"Argus: Low-Cost, Comprehensive Error Detection in Simple Cores" (Meixner, Bauer, and Sorin),

"Ultra Low-Cost Defect Protection for Microprocessor Pipelines" (Shyam et al.)

"Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults" (Romanescu and Sorin)

Using Software to Tolerate Faults

"SWIFT: Software Implemented Fault Tolerance" (Reis et al.),

"Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design" (Li et al.)

"Proactive Management of Software Aging" (Castelli et al.),

"Detouring: Translating Software to Circumvent Hard Faults in Simple Cores." (Meixner and Sorin),

"Web Search for a Planet: The Google Cluster Architecture" (Barroso et al.),

Modeling and Evaluation

"Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic" (Shivakumar et al.),

[OPTIONAL] "A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor" (Mukherjee et al.),

[OPTIONAL] "The Impact of Technology Scaling on Lifetime Reliability" (Srinivasan et al.)
Testing, Design for Test, and Verification "Validating the Pentium 4 Microprocessor" (Bentley and Gray),

"Engineering Trust with Semantic Guardians" (Wagner and Bertacco),

"IDDQ Test: Will It Survive the DSM Challenge?" (Sabade and Walker),

"EXE: Automatically Generating Inputs of Death" [you don't have to read sections 3 and 4 in depth] (Cadar et al.)

 
 Homework Assignments

Homework #1, Due Friday, Sept 16 (in class)

Homework #2, Due Friday, Sept 30 (in class)

Homework #3, Due Monday, Oct 31 (in class)

Homework #4, Due Tuesday, Nov 29 (under my office door)

 Tentative Schedule (subject to change)

 

Note that there are several 75-minute classes.  These will be held from 10:05-11:20.

 

Week
Monday
Wednesday Friday
Aug 29
introduction faults and their causes faults and their causes
Sep 5 "IBM Experiment in Soft Fails"; "Designing Reliable Systems" "A Large-Scale Study of Failures"; "Why Do Internet Services Fail?" basic FT concepts
Sep 12 physical redundancy; "The Teramac" information redundancy;  "ECP"; "Memory Mapped ECC"
Sep 19 re-execution; "AR-SMT" "Razor"; backward error recovery "End-to-End Arguments"
Sep 26 FT microprocessors, "RAS Strategy for IBM S/390" "DIVA" review for midterm
Oct 3
MIDTERM EXAM "Argus" "Ultra Low-Cost Defect Protection"; "Core Cannibalization"
Oct 10

FALL BREAK

FT memory, disks, networks FT software, "Software Implemented Fault Tolerance"
Oct 17
"Understanding Propagation of Hard Errors";  "Proactive Management of Software Aging"; "Detouring" "The Google Cluster Architecture"
Oct 24
modeling/evaluation, "Modeling the Effect of Technology Trends" modeling/evaluation, "Architectural Vulnerability"" modeling/evaluation
Oct 31
modeling/evaluation testing testing
Nov 7
testing design for test "Validating the Pentium 4 Microprocessor"
Nov 14
"Semantic Guardians" "IDDQ Test" "Generating Inputs of Death"
Nov 21

(slack)

THANKSGIVING
Nov 28
review for final PROJECT PRESENTATIONS PROJECT PRESENTATIONS
Dec 5 READING PERIOD