ECE/CS
554 |
Fault-Tolerant and Testable Computing Systems |
Spring
2024 |
Professor
Daniel J. Sorin |
Course Objective and Content |
Objective: To provide students with an understanding of fault tolerant
computers, including both the theory of how to design and evaluate them and
the practical knowledge of real fault tolerant systems. |
Content: The main themes of this
course are: technological reasons for faults, fault
models, information redundancy, spatial redundancy, backward and forward
error recovery, fault-tolerant hardware and software, modeling and analysis,
testing, and design for test. |
The
course includes a project that will allow the students to apply what they
have learned in class. |
Prerequisites: ECE/CS 250 or consent of
instructor. |
Class Location and Hours |
Class meets M/W/F from 10:20am - 11:10am.
Location: Hudson Hall 208 (not 115A)
Instructor |
Office: 403 Wilkinson
Office Hours: TBD
Textbook |
There
is no official textbook for this course. However, I am somewhat partial
to the following book that can be downloaded for free
by Duke students (you need a Duke IP address or to be on the Duke VPN):
Daniel J. Sorin. "Fault
Tolerant Computer Architecture." Synthesis Lectures on Computer
Architecture, Morgan & Claypool Publishers, 2009.
Assignments and Grading |
Students are responsible for:
The
project is a significant assignment that requires: |
Late
homework policy -- NO exceptions, NO extensions (except in case of dean's
excuse or official Duke-sanctioned travel) 0-24 hours late: Take earned score and multiply
by 0.9 Late project policy: NO CREDIT Late paper summary policy: NO CREDIT |
|
Academic Misconduct: I will not tolerate academically dishonest
work. This includes cheating on the homework and exams and plagiarism
on the project. |
|
Be careful on the project to cite prior work and to give proper
credit to others' work. |
Course Topics and Readings |
I will post lecture notes (in PDF format) the night before I cover them in
class. Feel free to print them out and bring them to class with you.
Homework Assignments |
On Gradescope
Tentative Schedule (dates, topics, and
papers subject to change) |
Week |
Monday |
Wednesday
|
Friday
|
Jan
6 |
introduction;
measuring fault tolerance |
faults
and their causes |
|
Jan
13 |
faults
and their causes |
"IBM
Experiment in Soft Fails"; "Designing Reliable Systems from
Unreliable Components" |
"A
Large-Scale Study of Failures"; basic
FT concepts |
Jan
20 |
MLK
DAY |
basic
FT concepts |
physical
redundancy; "The Teramac" |
Jan
27 |
information
redundancy |
"ECP";
"Methuselah Flash" |
re-execution;
"AR-SMT" |
Feb
3 |
"Razor";
backward error recovery |
"End-to-End
Arguments" |
FT
microprocessors, "RAS Strategy for IBM S/390" |
Feb
10 |
"DIVA" |
"Argus" |
"Ultra Low-Cost Defect Protection";
"Understanding Propagation of Hard Errors" |
Feb
17 |
"Software
Implemented Fault Tolerance" |
FT
memory, disks, networks |
FT software; "Proactive Management of Software Aging"; |
Feb
24 |
"Generating
Inputs of Death" |
"The
Google Cluster Architecture" |
MIDTERM EXAM |
Mar
3 |
diagnosis/self-repair |
"Detouring"
"mSWAT" |
Progress Reports Due |
Mar
10 |
SPRING BREAK |
||
Mar
17 |
modeling/evaluation |
modeling/evaluation; "Modeling
the Effect of Technology Trends" |
modeling/evaluation;
paper TBD |
Mar
24 |
testing |
testing |
testing |
Mar
31 |
design for test |
"Validating the Pentium
4 Microprocessor" |
"Semantic
Guardians" |
Apr
7 |
TBD |
TBD Project Reports Due |
PROJECT
PRESENTATIONS (depending on class size) |
Apr
14 |
PROJECT
PRESENTATIONS (depending on
class size) |
|
Reading period for graduate classes |
Apr
21 |
READING PERIOD (grad and
undergrad) |