Duke FaultFinder Project

Duke University Department of Electrical and Computer Engineering

Directed by Prof. Daniel J. Sorin

 Objective 
Our goal is to improve the availability and reliability of computer architectures. We are developing novel, low-cost mechanisms for comprehensive error detection, fault diagnosis, and reconfiguration in response to faults.  

 

 Publications
Patrick J. Eibl, Albert Meixner, and Daniel J. Sorin. "An FPGA-Based Experimental Evaluation of Microprocessor Core Error Detection with Argus-2." SIGMETRICS, June 2011.

Dimitris Gizopoulos, Mihalis Psarakis, Sarita V. Adve, Pradeep Ramachandran, Siva Kumar Sastry Hari, Daniel Sorin, Albert Meixner, Arijit Biswas, and Xavier Vera. "Architectures for Online Error Detection and Recovery in Multicore Processors." Design, Automation & Test in Europe (DATE), March 2011.

Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. "Address Translation-Aware Memory Consistency." IEEE Micro: Micro's Top Picks from Computer Architecture Conferences, January/February 2011.

Ralph Nathan and Daniel J. Sorin. "Argus-G: A Low-Cost Error Detection Scheme for GPGPUs." Workshop on Resilient Architectures (WRA), December 2010.
Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. "Specifying and Dynamically Verifying Address Translation-Aware Memory Consistency." 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2010), March 2010.
Patrick J. Eibl, Daniel J. Sorin, and Andrew D. Cook. "Reduced Precision Checking for a Floating Point Adder." 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, October 2009.
Michael E. Bauer, Albert Meixner, and Daniel J. Sorin. "Proving the Completeness of the Composition of Two Dynamic Verification Techniques." Technical Report Duke-ECE-2009-1, May 2009. 
Albert Meixner and Daniel J. Sorin. "Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures." IEEE Transactions on Dependable and Secure Computing (TDSC), volume 6, number 1, January-March 2009. 
Bogdan F. Romanescu and Daniel J. Sorin. "Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults." Seventeenth International Conference on Parallel Architectures and Compilation Techniques (PACT), October 2008. 
Albert Meixner and Daniel J. Sorin. "Detouring: Translating Software to Circumvent Hard Faults in Simple Cores." 38th Annual International Conference on Dependable Systems and Networks, June 2008. 
Albert Meixner and Daniel J. Sorin. "IOTA: Detecting Erroneous I/O Behavior via I/O Transaction Auditing." First Workshop on Compiler and Architectural Techniques for Application Reliability and Security (CATARS), June 2008. 
Fred A. Bower, Daniel J. Sorin, and Landon P. Cox. "The Impact of Dynamically Heterogeneous Multicore Processors on Thread Scheduling." IEEE Micro, May/June 2008. 
Albert Meixner, Michael E. Bauer, and Daniel J. Sorin. "Argus: Low-Cost, Comprehensive Error Detection in Simple Cores." IEEE Micro: Micro's Top Picks from Computer Architecture Conferences, January/February 2008. 
Albert Meixner, Michael E. Bauer, and Daniel J. Sorin. "Argus: Low-Cost, Comprehensive Error Detection in Simple Cores." 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2007. 
Mahmut Yilmaz, Sule Ozev, and Daniel J. Sorin. "Low-Cost Run-time Diagnosis of Hard Delay Faults in the Functional Units of a Microprocessor." IEEE International Conference on Computer Design (ICCD), October 2007. 
Mahmut Yilmaz, Albert Meixner, Sule Ozev, and Daniel J. Sorin. "Lazy Error Detection for Microprocessor Functional Units." IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFTS), September 2007. 
Anita Lungu and Daniel J. Sorin. "Verification-Aware Microprocessor Design." Sixteenth International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2007. 
Albert Meixner and Daniel J. Sorin. "Error Detection Using Dynamic Dataflow Verification." Sixteenth International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2007. 
Fred A. Bower, Daniel J. Sorin, and Sule Ozev. "Online Diagnosis of Hard Faults in Microprocessors." ACM Transactions on Architecture and Code Optimization (TACO), volume 4, number 2, June 2007. 
Albert Meixner and Daniel J. Sorin. "Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures." 13th International Symposium on High-Performance Computer Architecture (HPCA), February 2007. 
Mahmut Yilmaz, Derek R. Hower, Sule Ozev, and Daniel J. Sorin. "Self-Detecting and Self-Diagnosing 32-bit Microprocessor Multiplier." International Test Conference (ITC), October 2006. 
Nathan N. Sadler and Daniel J. Sorin. "Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache." International Conference on Computer Design (ICCD), October 2006. 
Albert Meixner and Daniel J. Sorin. "Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures." International Conference on Dependable Systems and Networks (DSN), June 2006. 
Fred A. Bower, Derek H. Hower, Mahmut Yilmaz, Daniel J. Sorin, and Sule Ozev. "Applying Architectural Vulnerability Analysis to Hard Faults in the Microprocessor." Poster and 2-page paper in ACM SIGMETRICS, June 2006. 
Fred A. Bower, Sule Ozev, and Daniel J. Sorin. "Autonomic Microprocessor Execution via Self-Repairing Arrays." IEEE Transactions on Dependable and Secure Computing, October-December 2005. 
Fred A. Bower, Daniel J. Sorin, and Sule Ozev. "A Mechanism for Online Diagnosis of Hard Faults in Microprocessors." 38th Annual International Symposium on Microarchitecture (MICRO), November 2005. 
Albert Meixner and Daniel J. Sorin. "Dynamic Verification of Sequential Consistency." International Symposium on Computer Architecture (ISCA), June 2005. 
Jonathan R. Carter, Sule Ozev, and Daniel J. Sorin. "Circuit-Level Modeling for Concurrent Testing of Operational Defects due to Gate Oxide Breakdown."  Design, Automation, and Test in Europe (DATE), March 2004. 
Fred A. Bower, Paul G. Shealy, Sule Ozev, and Daniel J. Sorin. "Tolerating Hard Faults in Microprocessor Array Structures."  International Conference on Dependable Systems and Networks (DSN), June 2004. 
Daniel J. Sorin, Mark D. Hill, and David A. Wood. "Dynamic Verification of End-to-End Multiprocessor Invariants."  International Conference on Dependable Systems and Networks (DSN), June 2003.  (talk slides)

 

 Group
Prof. Daniel J. Sorin, Associate Professor of Electrical and Computer Engineering and Computer Science

Fred Bower

Albert Meixner

Anita Lungu
Funding and Support
National Science Foundation grants CCF-0444516 and CCR-0309164
National Aeronautics and Space Administration grant NNG04GQ06G
Toyota InfoTechnology Center
Intel Corporation

 

 Links
Duke Architecture
WWW Computer Architecture Page
Duke's Pratt School of Engineering