ECE/CS 554 |
Fault-Tolerant
and Testable Computing Systems |
Spring 2015 |
Professor Daniel J. Sorin |
Course
Objective and Content |
Objective:
To provide students with an understanding
of fault tolerant computers, including both the theory of how to design
and evaluate them and the practical knowledge of real fault tolerant
systems. |
Content: The main
themes of
this course are: technological reasons for faults, fault models,
information redundancy, spatial redundancy, backward and forward error
recovery, fault-tolerant
hardware and software, modeling and analysis, testing, and design for
test. |
The course includes a project
that will allow the
students to apply what they have learned in class. |
Prerequisites: ECE/CS 250 or consent of instructor. |
Class Location and Hours |
Class meets M/W/F from 10:20am - 11:10am.
Location: Hudson 208
Instructor |
Office: 209C Hudson Hall
Office Hours: Monday 1-2pm, Thurs 2-3pm
Email:
Textbook |
There is no official textbook for this course. However, I am somewhat partial to the following book that can be downloaded for free by Duke students (you need a Duke IP address):
Daniel J. Sorin. "Fault
Tolerant Computer Architecture." Synthesis Lectures on Computer
Architecture, Morgan & Claypool Publishers, 2009.
Assignments and Grading |
Students are responsible for:
The project is a significant assignment that requires: |
Late homework policy -- NO exceptions, NO extensions (except in case of
dean's excuse)
0-24 hours late: 10% penalty Late project policy: NO CREDIT Late paper summary policy: NO CREDIT |
|
|
|
Academic Misconduct: I will not tolerate academically dishonest work. This includes cheating on the homework and exams and plagiarism on the project. | |
Be careful on the
project to cite prior work and to give proper credit to others'
work.
|
Course
Topics, Lecture Notes, and Readings |
Homework Assignments |
Homework #1, Due TBD
Tentative
Schedule (subject to change) |
Week |
Monday |
Wednesday | Friday |
Jan 5 |
introduction | faults and their causes | |
Jan 12 | faults and their causes | "IBM Experiment in Soft Fails"; "Designing Reliable Systems" | "A Large-Scale Study of Failures"; "Why Do Internet Services Fail?" |
Jan 19 | MLK DAY | basic FT concepts | physical redundancy; "The Teramac" |
Jan 26 | information redundancy; | "ECP"; "Memory Mapped ECC" | re-execution; "AR-SMT" |
Feb 2 | "Razor"; backward error recovery | "End-to-End Arguments" | FT microprocessors, "RAS Strategy for IBM S/390" |
Feb 9 |
"DIVA" | "Argus" | "Ultra Low-Cost Defect Protection"; "Core Cannibalization" |
Feb 16 | FT memory, disks, networks | FT software, "Software Implemented Fault Tolerance" |
"Understanding Propagation of Hard Errors"
Project Proposals Due |
Feb 23 |
"Proactive Management of Software Aging"; "Detouring" | "The Google Cluster Architecture" | NO CLASS |
Mar 2 |
MIDTERM EXAM | modeling/evaluation, "Modeling the Effect of Technology Trends" | modeling/evaluation, "Architectural Vulnerability"" |
Mar 9 | SPRING BREAK | ||
Mar 16 |
modeling/evaluation | modeling/evaluation | testing |
Mar 23 |
testing
Progress Reports Due |
testing | design for test |
Mar 30 | NO CLASS | "Validating the Pentium 4 Microprocessor" | "Semantic Guardians" |
Apr 6 |
"IDDQ Test" | "Generating Inputs of Death" | review for final |
Apr 13 |
PROJECT PRESENTATIONS
Project Reports Due |
PROJECT PRESENTATIONS | |
Apr 20 | READING PERIOD | ||
Apr 27 | FINAL EXAMS |