Our systems suffer many insults when they contact the real world. Flaky inputs, unreliable networks, and misbehaving users, to name just a few. As we design our components and systems to thrive in the only environment that matters, it pays to have mental schema and language to discuss the issues.

A fault is an incorrect internal state in your software. Faults are often introduced at component, module, or subsystem boundaries. There can be a mismatch between the contract a module is designed to implement and its actual behavior. A very simple example is accepting a negative integer or zero when a strictly positive integer was expected.

A fault may also occur when a latent bug in the software is triggered by an external or internal condition. For example, attempting to allocate an object when memory is exhausted will return a null pointer. If the software proceeds with the null pointer it can cause problems later, perhaps in a far distant part of the code.

Such an incorrect state may be recoverable. A fault-tolerant module will attempt to restore a good internal state after detecting a fault. Exception handlers and error-checking code are efforts to provide fault-tolerance.

Another school of thought says that fault tolerance is unreliable. In this approach, once a fault has occurred, the entire memory state of the program must be regarded as corrupt. Instead of attempting to restore a good state by backtracking or patching up the internal state, fault-intolerant modules will exit to avoid producing errors. A system built from these fault-intolerant modules will include supervisor capabilities to restart exited modules.

If a fault propagates in the system, it can produce visibly incorrect behavior. This is an error. Faults may occur without producing errors, as in the case of fault-tolerant modules that correct their own state before an error is observed. An error may be limited to an incorrect output displayed to a user. It can include any incorrect behavior, including data loss or corruption, network flooding, or launching attack drones.

At the component, module, or subsystem level, or mission is to prevent faults from causing errors.

A failure results when a system terminates without completing its job. For a long-running service or server, it stops responding to requests in a finite time. For a program that should run to completion and exit, it exits abnormally before completing. A failure may be preferrable to an error, depending on the harm caused by the error.

Next time, I will address system stability in the face of faults, errors, and failures.