| ECE
254 / CPS 225 |
|
Fault-Tolerant
and Testable Computing Systems |
| Fall 2011 |
| Professor Daniel J. Sorin |
| Course
Objective and Content |
| Objective:
To provide students with an understanding
of fault tolerant computers, including both the theory of how to design
and evaluate them and the practical knowledge of real fault tolerant
systems. |
| Content: The main
themes of
this course are: technological reasons for faults, fault models,
information redundancy, spatial redundancy, backward and forward error
recovery, fault-tolerant
hardware and software, modeling and analysis, testing, and design for
test. |
| The course includes a project
that will allow the
students to apply what they have learned in class. |
| Prerequisites: ECE 152 or CPS 104 or consent of instructor. |
| Class Location and Hours |
Class meets M/W/F from 10:20am - 11:10am.
Location: Hudson 222
| Instructor |
Office: 209C Hudson Hall
Office Hours: Mondays 11:15-noon, Thursdays 2-3
Email: 
| Textbook |
There is no official textbook for this course. However, I am somewhat partial to the following book that can be downloaded for free by Duke students (you need a Duke IP address):
Daniel J. Sorin. "Fault
Tolerant Computer Architecture." Synthesis Lectures on Computer
Architecture, Morgan & Claypool Publishers, 2009.
| Assignments and Grading |
Students are responsible for:
| The project is a significant assignment that requires: |
| Deadlines will be enforced except under extreme circumstances. I would prefer that you turn in something not quite done on the due date rather | |
|
than
waiting until after the deadline to try to finish it. Each day late will result in a 10%
reduction of the grade given. |
|
| |
|
| Academic Misconduct: I will not tolerate academically dishonest work. This includes cheating on the homework and exams and plagiarism on the project. | |
Be careful on the
project to cite prior work and to give proper credit to others'
work.
|
| Course
Topics, Lecture Notes, and Readings |
| Homework Assignments |
Homework #1, Due Friday, Sept 16 (in class)
Homework #2, Due Friday, Sept 30 (in class)
Homework #3, Due Monday, Oct 31 (in class)
Homework #4, Due Tuesday, Nov 29 (under my office door)
| Tentative
Schedule (subject to change) |
Note that there are several 75-minute classes. These will be held from 10:05-11:20.
| Week |
Monday |
Wednesday | Friday |
| Aug 29 |
introduction | faults and their causes | faults and their causes |
| Sep 5 | "IBM Experiment in Soft Fails"; "Designing Reliable Systems" | "A Large-Scale Study of Failures"; "Why Do Internet Services Fail?" | basic FT concepts |
| Sep 12 | physical redundancy; "The Teramac" | information redundancy; | "ECP"; "Memory Mapped ECC" |
| Sep 19 | re-execution; "AR-SMT" | "Razor"; backward error recovery | "End-to-End Arguments" |
| Sep 26 | FT microprocessors, "RAS Strategy for IBM S/390" | "DIVA" | review for midterm |
| Oct 3 |
MIDTERM EXAM | "Argus" | "Ultra Low-Cost Defect Protection"; "Core Cannibalization" |
| Oct 10 |
FALL BREAK |
FT memory, disks, networks | FT software, "Software Implemented Fault Tolerance" |
| Oct 17 |
"Understanding Propagation of Hard Errors"; | "Proactive Management of Software Aging"; "Detouring" | "The Google Cluster Architecture" |
| Oct 24 |
modeling/evaluation, "Modeling the Effect of Technology Trends" | modeling/evaluation, "Architectural Vulnerability"" | modeling/evaluation |
| Oct 31 |
modeling/evaluation | testing | testing |
| Nov 7 |
testing | design for test | "Validating the Pentium 4 Microprocessor" |
| Nov 14 |
"Semantic Guardians" | "IDDQ Test" | "Generating Inputs of Death" |
| Nov 21 |
(slack) |
THANKSGIVING | |
| Nov 28 |
review for final | PROJECT PRESENTATIONS | PROJECT PRESENTATIONS |
| Dec 5 | READING PERIOD | ||