ECE/CS 554

Fault-Tolerant and Testable Computing Systems

Spring 2024

Professor Daniel J. Sorin

                  

 Course Objective and Content

 

Objective: To provide students with an understanding of fault tolerant computers, including both the theory of how to design and evaluate them and the practical knowledge of real fault tolerant systems.

Content: The main themes of this course are: technological reasons for faults, fault models, information redundancy, spatial redundancy, backward and forward error recovery, fault-tolerant hardware and software, modeling and analysis, testing, and design for test.

The course includes a project that will allow the students to apply what they have learned in class.

Prerequisites: ECE/CS 250 or consent of instructor.

 

Class Location and Hours

 

Class meets M/W/F from 10:20am - 11:10am.

Location: Hudson Hall 208 (not 115A)

 Instructor

 

Professor Daniel J. Sorin

Office: 403 Wilkinson

Office Hours: TBD

 Textbook

 

There is no official textbook for this course.  However, I am somewhat partial to the following book that can be downloaded for free by Duke students (you need a Duke IP address or to be on the Duke VPN):

Daniel J. Sorin. "Fault Tolerant Computer Architecture." Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2009.

 Assignments and Grading

Students are responsible for:

The project is a significant assignment that requires:

Late homework policy -- NO exceptions, NO extensions (except in case of dean's excuse or official Duke-sanctioned travel)

    0-24 hours late:  Take earned score and multiply by 0.9
    24-48 hours late: Take earned score and multiply by 0.8
    >48 hours late: NO CREDIT

Late project policy: NO CREDIT

Late paper summary policy: NO CREDIT 

Academic Misconduct: I will not tolerate academically dishonest work.  This includes cheating on the homework and exams and plagiarism on the project.  

Be careful on the project to cite prior work and to give proper credit to others' work. 

 

 Course Topics and Readings


I will post lecture notes (in PDF format) the night before I cover them in class.  Feel free to print them out and bring them to class with you.

Topic

Readings

Introduction: Terminology and Metrics

Faults and Their Causes

"IBM Experiments in Soft Fails in Computer Electronics" (Ziegler),

"Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation" (Borkar),

"A Large-Scale Study of Failures in High-Performance Computing Systems" (Schroeder and Gibson),

Error detection and Recovery: General Concepts
     - Physical redundancy
     - Error detecting/correcting codes
     - Re-execution techniques
     - Backward error recovery

"The Teramac Custom Computer: Extending the Limits with Defect Tolerance" (Culbertson et al.),

"Use ECP, Not ECC, for Hard Failures in Resistive Memories" (Schechter et al.)

"Methuselah Flash: Rewriting Codes for Extra Long Storage Lifetime" (Mappouras et al.)

"AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors" (Rotenberg) [optional but highly recommended]

"Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation" (Ernst et al.) [you can skip the part on "counterflow pipelining"],

"A Survey of Rollback-Recovery Protocols in Message-Passing Systems" (Elnozahy et al.),  [optional but highly recommended]

"End-to-End Arguments in System Design (Saltzer et al.)

Error Detection and Recovery: Hardware Faults

"RAS Strategy for IBM S/390 G5 and G6" (Mueller et al.),

"DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design" (Austin)

"Argus: Low-Cost, Comprehensive Error Detection in Simple Cores" (Meixner, Bauer, and Sorin),

"Ultra Low-Cost Defect Protection for Microprocessor Pipelines" (Shyam et al.)

"Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design" (Li et al.)

"SWIFT: Software Implemented Fault Tolerance" (Reis et al.),

Error Detection and Recovery: Software Faults

"Proactive Management of Software Aging" (Castelli et al.),

"EXE: Automatically Generating Inputs of Death" [you don't have to read sections 3 and 4 in depth] (Cadar et al.)

"Web Search for a Planet: The Google Cluster Architecture" (Barroso et al.),

Fault Diagnosis and Self-Repair

"Detouring: Translating Software to Circumvent Hard Faults in Simple Cores." (Meixner and Sorin),

"mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems" (Hari et al.)

Modeling and Evaluation

"Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic" (Shivakumar et al.),

[OPTIONAL] "A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor" (Mukherjee et al.),

[OPTIONAL] "The Impact of Technology Scaling on Lifetime Reliability" (Srinivasan et al.)

Testing, Design for Test, and Verification

"Validating the Pentium 4 Microprocessor" (Bentley and Gray),

"Engineering Trust with Semantic Guardians" (Wagner and Bertacco),

"IDDQ Test: Will It Survive the DSM Challenge?" (Sabade and Walker),

 

 Homework Assignments

On Gradescope

 Tentative Schedule (dates, topics, and papers subject to change)

 

 

 

Week

Monday

Wednesday

Friday

Jan 6

introduction; measuring fault tolerance

faults and their causes

Jan 13

faults and their causes

"IBM Experiment in Soft Fails"; "Designing Reliable Systems from Unreliable Components"

"A Large-Scale Study of Failures";

basic FT concepts

Jan 20

MLK DAY

basic FT concepts

physical redundancy; "The Teramac"

Jan 27

information redundancy

"ECP"; "Methuselah Flash"

re-execution; "AR-SMT"

Feb 3

"Razor"; backward error recovery

"End-to-End Arguments"

FT microprocessors, "RAS Strategy for IBM S/390"

Feb 10

"DIVA"

"Argus"

 "Ultra Low-Cost Defect Protection";  "Understanding Propagation of Hard Errors"

Feb 17

"Software Implemented Fault Tolerance"
Project Proposals Due

FT memory, disks, networks

 

FT software;  "Proactive Management of Software Aging";

Feb 24

"Generating Inputs of Death"

"The Google Cluster Architecture"

 

MIDTERM EXAM

Mar 3

diagnosis/self-repair

"Detouring"

"mSWAT"

Progress Reports Due

Mar 10

 

SPRING BREAK

Mar 17

modeling/evaluation

modeling/evaluation; "Modeling the Effect of Technology Trends"

modeling/evaluation; paper TBD

Mar 24

testing

testing

testing

Mar 31

design for test

"Validating the Pentium 4 Microprocessor"

"Semantic Guardians"

Apr 7

TBD

TBD

 

Project Reports Due

PROJECT PRESENTATIONS (depending on class size)

Apr 14

PROJECT PRESENTATIONS

(depending on class size)


 review for final

Reading period for graduate classes

Apr 21

READING PERIOD (grad and undergrad)