|
Fault-tolerance or graceful degradation is the property of a system that continues operating properly in the event of failure of some of its parts. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical systems. A system is an assemblage of inter-related elements comprising a unified whole. ...
In telecommunications and reliability theory, the term availability has the following meanings: 1. ...
A life-critical system or safety-critical system is a system whose failure or malfunction may result in a) death or serious injury to people, or b) loss or severe damage to equipment or c) environmental harm. ...
Fault-tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount. The Transmission Control Protocol (TCP) is one of the core protocols of the Internet protocol suite. ...
In computer networking and telecommunications, packet switching is a communications paradigm in which packets (messages or fragments of messages) are individually routed between nodes, with no previously established communication path. ...
Data formats may also be designed to degrade gracefully. HTML for example, is designed to be forward compatible, allowing new HTML entities to be ignored by browsers which do not understand them without causing the document to be unusable. i am a loser In computing, HyperText Markup Language (HTML) is a markup language designed for the creation of web pages and other information viewable in a browser. ...
Forward compatibility is the ability of a system to accept input from later versions of itself. ...
Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error. To quote Matt Dillon (of DragonFly BSD), Checkpointing allows you to freeze a copy of an application so that, in theory, you can restore the program to that running state at a later point in time. ...
In mathematics, an idempotent element is an element which, intuitively, leaves something unchanged. ...
Within the scope of an individual system, fault-tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. Self-stabilization is a concept from computer science. ...
Fault-tolerance by duplication
Duplication can give fault-tolerance in three ways: - Replication: Providing multiple identical instances of the same system, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum;
- Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (fall-back or backup);
- Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
A redundant array of independent disks (RAID) is an example of a fault-tolerant storage device that uses redundancy. Replication refers to the provision of redundant resources (software or hardware components) to improve reliability and fault-tolerance. ...
Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain faster results. ...
In law, a quorum is the minimum number of members of a deliberative body necessary to conduct the business of that group. ...
In engineering, the duplication of critical components of a system with the intention of increasing reliability of the system is called redundancy. ...
Diversity is the presence of a wide range of variation in the qualities or attributes under discussion. ...
In computing, a redundant array of independent disks, also known as redundant array of inexpensive disks (commonly abbreviated RAID) is a system of using multiple hard drives for sharing or replicating data among the drives. ...
In computing, a data storage device—as the name implies—is a device for storing data. ...
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications. Lockstep systems are redundant systems that run the same set of operations at the same time in parallel. ...
Lockstep fault tolerant machines are most easily made fully synchronous, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement. Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replicant can be copied to another replicant. One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly similarly. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicants rather than the three of TMR, but has been used commercially.
See also The Byzantine Generals problem and several solutions were originally described by Lamport, Shostak, and Pease in ACM Transaction on Programming Languages and Systems in 1982 (see References). ...
The term cluster refers to the grouping together of elements within a domain - usually spatial. ...
A transaction is an agreement, communication, or movement carried out between separate entities or objects. ...
External links - Implementing Fault Tolerance on Windows Networks - a high-level survey of the different fault tolerant technologies available for Windows Server 2003
- Fault Handling and Fault Tolerance - Articles about software and hardware fault tolerance techniques.
- Article "Practical Considerations in Making CORBA Services Fault-Tolerant" by Priya Narasimhan
- Article "Experiences, Strategies and Challenges in Building Fault-Tolerant CORBA Systems" by Pascal Felber and Priya Narasimhan
|