|
In electronics and computing, an error is a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. The field of electronics comprises the study and use of systems that operate by controlling the flow of electrons (or other charge carriers) in devices such as thermionic valves (vacuum tubes) and semiconductors. ...
Originally, the word computing was synonymous with counting and calculating, and a science and technology that deals with the original sense of computing mathematical calculations. ...
The word error has different meanings in different domains. ...
Defect is the n00b of the animating world, everybody knows that he cannot and will not animate. ...
If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems use error correction to correct soft errors on the fly. However, in many systems, it may be impossible to determine the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the system may have crashed, in which case the recovery procedure must include a reboot. In computer science and information theory, error correction consists of using methods to detect and/or correct errors in the transmission or storage of data by the use of some amount of redundant data and (in the case of transmission) the selective retransmission of incorrect segments of the data. ...
A crash in computing is a condition where a program (either an application or part of the operating system) stops performing its expected function and also stops responding to other parts of the system. ...
ReBoot was a Canadian animated series that was produced by Mainframe Entertainment, created by Gavin Blair, Ian Pearson, Phil Mitchell and John Grace, with character designed by Brendan McCarthy and Ian Gibson. ...
Soft errors involve changes to data — the electrons in a storage circuit, for example — but not changes to the physical circuit itself, the atoms. If the data is rewritten, the circuit will be perfect again. Properties The electron (also called negatron, commonly represented as e−) is a subatomic particle. ...
Properties For alternative meanings see atom (disambiguation). ...
Soft errors can occur on transmission lines, in digital logic, analog circuits, magnetic storage, and elsewhere, but are most commonly known in semiconductor storage. Causes of soft errors
Package decay Soft errors became widely known with the introduction of dynamic RAM in the 1970s. In these early devices, chip packaging materials contained small amounts of radioactive contaminants. Very low decay rates are needed to avoid excess soft errors, and chip companies have occasionally suffered problems with contamination ever since. It is extremely hard to maintain the material purity needed. DRAM is a type of random access memory that stores each bit of data in a separate capacitor. ...
Radioactive decay is the set of various processes by which unstable atomic nuclei (nuclides) emit subatomic particles. ...
Package radioactive decay usually causes a soft error by alpha particle emission. The positively charged alpha particle travels through the semiconductor and disturbs the distribution of electrons there. If the disturbance is large enough, a digital signal can change from a 0 to a 1 or vice versa. In combinational logic, this effect is transient, perhaps lasting a fraction of a nanosecond, and this has led to the challenge of soft errors in combinational logic mostly going unnoticed. In logic RAM and latches, even this transient upset can become stored for an indefinite time, to be read out later. Thus, designers are usually much more aware of the problem in storage circuits. An alpha particle is deflected by a magnetic field Alpha particles (named after the first letter in the Greek alphabet, α) are a highly ionizing form of particle radiation which have low penetration. ...
A digital system is one that uses discrete values (often electrical voltages), especially those representable as binary numbers, or non-numeric symbols such as letters or icons, for input, processing, transmission, storage, or display, rather than a continuous spectrum of values (ie, as in an analog system). ...
In information theory, a signal is the sequence of states of a communications channel that encodes a message. ...
This article is not about combinatory logic, a topic in mathematical logic. ...
Random access memory (usually known by its acronym, RAM) is a type of data store used in computers that allows the stored data to be accessed in any order â that is, at random, not just in sequence. ...
In electronics, a latch is data storage device used to store information in asynchronous sequential logic systems. ...
Critical charge Whether a circuit experiences a soft error depends on the energy of the incoming particle, the geometry of the impact, the location of the strike, and the design of the logic circuit. Logic circuits with higher capacitance and higher logic voltages are less likely to suffer an error. This combination of capacitance and voltage is described by the critical charge parameter, Qcrit, the minimum electron charge disturbance needed to change the logic level. A higher Qcrit means fewer soft errors. Unfortunately, a higher Qcrit also means a slower logic gate and a higher power dissipation. Reduction in chip feature size and supply voltage, desirable for many reasons, decreases Qcrit. Thus, the importance of soft errors increases as chip technology advances. Capacitance is a measure of the amount of electric charge stored (or separated) for a given electric potential. ...
Electric charge is a fundamental conserved property of some subatomic particles, which determines their electromagnetic interaction. ...
In a logic circuit, Qcrit is defined as the minimum amount of induced charge required at a circuit node to cause a voltage pulse to propagate from that node to the output and be of sufficient duration and magnitude to be reliably latched. Since a logic circuit contains many nodes that may be struck, and each node may be of unique capacitance and distance from output, Qcrit is typically characterized on a per-node basis.
Cosmic rays Once the electronics industry had determined how to control package contaminants, it became clear that other causes were also at work. James F. Ziegler led a program of work at IBM which culminated in the publication of a number of papers (Ziegler and Lanford, 1979) demonstrating that cosmic rays also could cause soft errors. Indeed, in modern devices, cosmic rays are the predominant cause. Many different particles can be present in cosmic rays, but the main cause of soft errors seems to be neutrons. Neutrons are uncharged and cannot disturb electron distribution on their own, but can undergo neutron capture by the nucleus of an atom in a chip, producing an unstable isotope which then causes a soft error when it decays producing an alpha particle. Big Blue redirects here. ...
Cosmic rays can loosely be defined as energetic particles originating outside of the Earth. ...
Properties In physics, the neutron is a subatomic particle with no net electric charge and a mass of 940 MeV/c² (1. ...
The process of neutron capture can proceed in two ways - as a rapid process (an r-process) or a slow process (an s-process). ...
Isotopes are any of the several different forms of an element each having different atomic mass. ...
An alpha particle is deflected by a magnetic field Alpha particles (named after the first letter in the Greek alphabet, α) are a highly ionizing form of particle radiation which have low penetration. ...
Cosmic ray flux depends on altitude. Burying a system in a cave reduces the rate of cosmic-ray-induced soft errors to a negligible level. In the lower levels of the atmosphere, the flux increases by a factor of about 2.2 for every 1000 m (1.3 for every 1000 ft) increase in altitude above sea level. Computers operated on top of mountains, or in aircraft, experience an order of magnitude higher rate of soft errors compared to sea level. This is in contrast to package decay induced soft errors, which do not change with location. An Airbus A380, currently the worlds largest airliner An aircraft is any vehicle or craft capable of atmospheric flight. ...
It happens that one isotope of boron, Boron-10, captures neutrons and undergoes alpha decay very efficiently. It has a very high neutron collision cross section. Boron is used in BPSG, a glass used to cover silicon dies to protect them. In critical designs, depleted boron - consisting almost entirely of Boron-11 - is used, to avoid this effect and therefore to reduce the soft error rate. Boron-11 is a by-product of the nuclear industry. General Name, Symbol, Number boron, B, 5 Chemical series metalloids Group, Period, Block 13, 2, p Appearance black/brown Atomic mass 10. ...
In nuclear and particle physics, the concept of a cross section is used to express the likelihood of interaction between particles. ...
Borophosphosilicate glass, commonly known as BPSG, is a type of silicate glass that includes impurities of both boron and phosphorus. ...
Glass can be made transparent and flat, or into other shapes and colors as shown in this sphere from the Verrerie of Brehat in Brittany. ...
A nuclear power station. ...
Other causes Soft errors can also be caused by random noise or signal integrity problems, such as inductive or capacitive crosstalk. However, in general, these sources represent a small contribution to the overall soft error rate when compared to radiation effects. In science, and especially in physics and telecommunication, noise is fluctuations in and the addition of external factors to the stream of target information (signal) being received at a detector. ...
Signal Integrity, sometimes known as SI, refers to electronic circuit tools and techniques that ensure electrical signals are of sufficient quality for proper operation. ...
In telecommunication, the term crosstalk (XT) has the following meanings: 1. ...
Designing Around Soft Errors Soft Error Mitigation A designer can attempt to minimise the rate of soft errors by judicious device design, choosing the right semiconductor, package and substrate materials, and the right device geometry. Often, however, this is limited by the need to reduce device size and voltage, to increase operating speed and to reduce power dissipation. The susceptibility of devices to upsets is described in the industry using the JEDEC JESD-89 standard. JEDEC stands for Joint Electron Device Engineering Council and is the semiconductor engineering standardization body of the Electronic Industries Alliance (EIA), a trade association that represents all areas of the electronics industry. ...
One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening. This involves increasing the capacitance at selected circuit nodes in order to increase its effective Qcrit value. This reduces the range of particle energies to which the logic value of the node can be upset. Radiation hardening is often accomplished by increasing the size of transistors who share a drain/source region at the node. Since the area and power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors. Microelectronics designed for environments with high levels of ionizing radiation have special design challenges. ...
Correcting soft errors Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and correction to recover gracefully. Typically, a semiconductor memory design might use forward error correction, incorporating redundant data into each word to create an error correcting code. Alternatively, roll-back error correction can be used, detecting the soft error with an error-detecting code such as parity, and rewriting correct data from another source. This technique is often used for write-through cache memories. In telecommunication, forward error correction (FEC) is a system of error control for data transmission, whereby the sender adds redundant data to its messages, which allows the receiver to detect and correct errors (within some bound) without the need to ask the sender for additional data. ...
A word is a unit of language that carries meaning and consists of one or more morphemes which are linked more or less tightly together, and has a phonetical value. ...
In information theory and coding, an error-correcting code or ECC is a code in which each data signal conforms to specific rules of construction so that departures from this construction in the received signal can generally be automatically detected and corrected. ...
In information theory and coding, an error-detecting code is a code in which each data signal conforms to specific rules of construction so that departures from this construction in the received signal can generally be automatically detected. ...
Look up Parity in Wiktionary, the free dictionary Parity is a concept of equality of status or functional equivalence. ...
Diagram of a CPU memory cache A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. ...
Look up cache in Wiktionary, the free dictionary. ...
Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant design. These often include the using of redundant circuitry or computation of data, and typically come at the cost of circuit area, decreased performance, and/or higher power consumption. The concept of triple modular redundancy (TMR) can be employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a circuit compute on the same data in parallel and outputs are fed into majority voting logic, returning the value that occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming the other two circuits operated correctly. In practice, however, few designers can afford the greater than 200% circuit area and power overhead required, so it is usually only selectively applied. Another common concept to correct soft errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple times and compares subsequent evaulations for consistency. This approach, however, often incurs performance overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably more area-efficient than modular redundancy. A digital circuit that acts as a binary clock, hand-wired on a series of prototyping sockets. ...
Fault-tolerance or graceful degradation is the property of a system that continues operating properly in the event of failure of some of its parts. ...
Traditionally, DRAM has had the most attention in the quest to reduce, or work-around soft errors, due to the fact that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer systems (ref. the prevalence of ECC RAM in server computers). Hard figures for DRAM susceptibility are hard to come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256 kilobit DRAMS could have clusters of five or six bits flip from a single alpha particle. Modern DRAMs have much smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip. Dynamic random access memory (DRAM) is a type of random access memory that stores each bit of data in a separate capacitor. ...
An alpha particle is deflected by a magnetic field Alpha particles (named after the first letter in the Greek alphabet, α) are a highly ionizing form of particle radiation which have low penetration. ...
The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a multi-cell upset leads to only a number of separate single-bit upsets in multiple correction words, rather than a multi-bit upset in a single correction word. So, an error correcting code needs only to cope with a single bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when multiple bits in a single correction word are in error. A single event upset (SEU) is a change of state, or voltage pulse caused when a high-energy particle strikes a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. ...
Soft errors in combinational logic The three natural masking effects in combinational logic that determine whether a single event upset (SEU) will propagate to become a soft error are electrical masking, logical masking, and temporal (or timing-window) masking. An SEU is logically masked if its propagation is blocked from reaching an output latch because off-path gate inputs prevent a logical transition of that gate's output. An SEU is electrically masked if the signal is attenuated by the electrical properties of gates on its propagation path such that the resulting pulse is of insufficient magnitude to be reliably latched. An SEU is temporally masked if the erroneous pulse reaches an output latch, but it does occur close enough to when the latch is actually triggered to hold. This article is not about combinatory logic, a topic in mathematical logic. ...
A single event upset (SEU) is a change of state, or voltage pulse caused when a high-energy particle strikes a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. ...
If all three masking effects fail to occur, the propagated pulse becomes latched and the output of the logic circuit will be an erroneous value. In the context of circuit operation, this erroneous output value may be considered a soft error event. However, from a microarchitectural-level standpoint, the affected result may not change the output of the currently-executing program. For instance, the erroneous data could be overwritten before use, masked in subsequent logic operations, or simply never be used. If erroneous data does not affect the output of a program, it is considered to be an example of microarchitectural masking.
Soft Error Rate Soft error rate (SER) is the rate at which a device or system encounters or is predicted to encounter soft errors. It is typically expressed as either number of failures-in-time (FIT), or mean-time-between-failures (MTBF). The unit adopted for quantifying failures in time is called FIT, equivalent to 1 error per billion hours of device operation. MTBF is usually given in years of device operation. To put it in perspective, 1 year MTBF is equal to approximately 114,155 FIT. While many electronic systems have a MTBF that exceeds the expected lifetime of the circuit, the SER may still be unacceptable to the manufacturer or customer. For instance, many failures per million circuits due to soft errors can be expected in the field if the system does not have adequate soft error protection. The failure of even a few products in the field, particularly if catastrophic, can tarnish the reputation of the product and company that designed it. Also, in safety- or cost-critical applications where the cost of system failure far outweighs the cost of the system itself, a 1% chance of soft error failure per lifetime may be too high to be acceptable to the customer. Therefore, it advantageous to design for low SER when manufacturing a system in high-volume or requiring extremely high reliabilty.
See also A single event upset (SEU) is a change of state, or voltage pulse caused when a high-energy particle strikes a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. ...
Microelectronics designed for environments with high levels of ionizing radiation have special design challenges. ...
External links - ZeroSoft's page on soft errors - Lots of resources plus a free on-line evaluation.
- Soft Errors in Electronic Memory - A White Paper - A good summary paper with many references - Tezzaron Jan 2004
- Benefits of Chipkill-Correct ECC for PC Server Main Memory - A 1997 discussion of SDRAM reliability - some interesting information on "soft errors" from cosmic rays, especially with respect to Error-correcting code schemes
- Soft errors' impact on system reliability - Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004
- Scaling and Technology Issues for Soft Error Rates - A Johnston - 4th Annual Research Conference on Reliability Stanford University, October 2000
- Evaluation of LSI Soft Errors Induced by Terrestrial Cosmic rays and Alpha Particles - H. Kobayashi, K. Shiraishi, H. Tsuchiya, H. Usuki (all of Sony), and Y. Nagai, K. Takahisa (Osaka University), 2001.
- SELSE Workshop Website - Website for the workshop on the System Effects of Logic Soft Errors
- iRoC Technologies - A company dedicated to Soft Errors-related solutions and products
Cosmic rays can loosely be defined as energetic particles originating outside of the Earth. ...
In information theory and coding, an error-correcting code or ECC is a code in which each data signal conforms to specific rules of construction so that departures from this construction in the received signal can generally be automatically detected and corrected. ...
References - Ziegler, J. F. and W. A. Lanford, "Effect of Cosmic Rays on Computer Memories", Science, 206, 776 (1979).
|