Notes from reading the book: Release It! Design and Deploy Production-Ready Software.

Morden software system usually needs to process transactions, which is an abstract unit of work. A resilient system keeps processing transactions even when there are transient impulses, persistent stresses, or component failure disrupting normal processing.

Failure Mode

Software system failures come with many different format, for example, the bug in the system can gradually accumulate and eventually fail the system. In this post, however, we mainly want to discuss the failures caused by impulsive traffic. Software system runs stably in regular traffic, but when impulses and excessive requests can trigger catastrophic failure. Some component of the system will start to fail before everything else does. The original trigger and the way the crack spreads to the rest of the system, together with the results of the damage, are collectively called a failure mode.

Cracks Propagate and Chain of Failures

Failure mode is often used especially when analyzing software system failures. We need to trace back to when the failure started to happen, how it was propagated, and finally, how we could have avoid such issues.

When we consider the reliability of the system, we need to keep in mind that the failures in the components are not independent, they can propagate and failure in one component might cause failure in another component. And under every system outage, there is a chain of failures.

Patterns and Antipatterns

The failures will happen no matter what we do, and our job is to design the system?s reaction to specific failures. This is why we should learn patterns and anti-patterns. Patterns are not to prevent failures from happening but to prevent them from propagating by containing the failures and preserve partial functionality instead of the total crashes. And learn anti-patterns is to help us avoid applying them on our system design. For example, tight coupling accelerates cracks.