The goal of the REE
(Remote Exploration and Experimentation) project is to enable the use of high performance parallel processing on board spacecraft. Fault tolerance is the most challenging issue in implementing the REE system. REE research group created the testbed to develop a strategy for achieving fault tolerance and system-level reliability in the space environment. The REE testbed will also be used to test, refine, and validate scalable architectures and system approaches to increased availability, reliability, fault tolerance and power performance.
The REE system can experience a number of radiation-induced transient errors
per day and a very small number of permanent faults over a multi-year mission.
As components fail it shall provide continued operation through graceful degradation. The system must provide fault detection and recovery so that applications can operate in the presence of faults. For a better understanding of the metrics like availability, reliability and performability of the REE system, we need to develop the models of its fault tolerance features.