Quick Read 1: Circuit Breaker Pattern

Quick Read is a series of blogs aiming to provide a quick explanation for one concept. One at A Time.

Source:

What is Circuit Breaker?

Circuit breaker is a design pattern used in modern software development. It is used to detect failures and encapsulates the logic of preventing a failure from constantly recurring, during maintenance, temporary external system failure or unexpected system difficulties. The Circuit Breaker pattern also enables an application to detect whether the fault has been resolved.

How does CB improves service availability?

Almost every software system has some dependencies on the remote system, e.g. a DB instance, when the DB slows down significantly because of heavy load or not available, the request from the clients might have to wait long time until the request times out. These requests will hold the some of the critical resource of the system such as connection pool, memory, CPU. New requests for the same resource access will keep come to the system and continue to hold these resources even after the previous requests have timed out.

And requests that doesn’t need to access these remote resources will be blocked until the critical resources becomes available: the failure in one module failed more modules. With CB on the remote resource access path, the later requests would quickly fail by the CB and the resources will be quickly released. Hence, other part of the system is still available.

In sum, CB improves the overall service availability by breaking the access to the operations that might fail to prevent a bigger failure happen to the system.

How does Circuit Breaker work?

CB usually work as a proxy on the path access some dependent resource. You can wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.

What’s the difference between CB and retry?

Many failures are transient, say a network packet loss or issue. These failure can be resolved by retry. However, there are failures due to unanticipated events, and that might take much longer to fix. In these situations it might be pointless for an application to continually retry an operation that is unlikely to succeed, and instead the application should quickly accept that the operation has failed and handle this failure accordingly.

The retry logic should be sensitive to any exceptions returned by the circuit breaker and abandon retry attempts if the circuit breaker indicates that a fault is not transient.

Is TIMEOUT enough to protect the system?

Timeout is a common strategy in system integration, when the access or task failed to finish within given time, the service is timed out in order to prevent infinite waiting. And timeout is often handled (sometime with retry). However, timeout on the client side usually takes too long be detected, while all the critical resource is already used up int these period of time, and caused cascaded failure.

For example: an operation that invokes a service could be configured to implement a timeout, and reply with a failure message if the service fails to respond within this period. This strategy could cause many concurrent requests to the same operation to be blocked until the timeout period expires. These blocked requests might hold critical system resources such as memory, threads, database connections, and so on.

It would be preferable for the operation to fail immediately instead of wait until timeout.

Does CB needs manual reset after break happens?

we can have the breaker itself detect if the underlying calls are working again. We can implement this self-resetting behavior by trying the protected call again after a suitable interval, and resetting the breaker should it succeed.

A circuit breaker component might have the following three status:

  • Closed
  • Open
  • Half-open

In the half-open status, some requests could be allowed to send through, if successful, the system transition to open status, otherwise back to closed status.

How does CB pattern work for asynchronous calls?

Asynchronous calls or tasks usually submit to the queue/worker system for execution. In some cases, the callers are less sensitive for the call execution or result. In other times, a timeout can be set up for such tasks incase they take too long.

How to install CB in the async task execution? A common technique is to put all requests on a queue, which the supplier consumes at its speed – a useful technique to avoid overloading servers. In this case the circuit breaks when the queue fills up.

Are there any framework available to use for Circuit Breaker?

Yes, Hystrix, Netflex, and the integration with Spring Cloud.

Release It! Stability

Notes from reading the book: Release It! Design and Deploy Production-Ready Software.

Morden software system usually needs to process transactions, which is an abstract unit of work. A resilient system keeps processing transactions even when there are transient impulses, persistent stresses, or component failure disrupting normal processing.

Failure Mode

Software system failures come with many different format, for example, the bug in the system can gradually accumulate and eventually fail the system. In this post, however, we mainly want to discuss the failures caused by impulsive traffic. Software system runs stably in regular traffic, but when impulses and excessive requests can trigger catastrophic failure. Some component of the system will start to fail before everything else does. The original trigger and the way the crack spreads to the rest of the system, together with the results of the damage, are collectively called a failure mode.

Cracks Propagate and Chain of Failures

Failure mode is often used especially when analyzing software system failures. We need to trace back to when the failure started to happen, how it was propagated, and finally, how we could have avoid such issues.

When we consider the reliability of the system, we need to keep in mind that the failures in the components are not independent, they can propagate and failure in one component might cause failure in another component. And under every system outage, there is a chain of failures.

Patterns and Antipatterns

The failures will happen no matter what we do, and our job is to design the system’s reaction to specific failures. This is why we should learn patterns and anti-patterns. Patterns are not to prevent failures from happening but to prevent them from propagating by containing the failures and preserve partial functionality instead of the total crashes. And learn anti-patterns is to help us avoid applying them on our system design. For example, tight coupling accelerates cracks.

Release It! Chapter 13: Availability

This is the study notes from the book Release It! Design and Deploy Production-Ready Software.

AVAILABILITY is one of the most frequently used term to describe the reliability of the system. In this section, we will dive deep into the hard of availability. We will discuss how availability is measured and how it is achieved in modern internet services.

First of all, we must realize that there is a tension between the availability and the cost:

  • The desire for greater availability.
  • The desire for minimizing cost.

Every improvement on the availability means an increase cost. But in the mean time, every bit availability improvement also means saved revenue. When you decide the availability of your system, please be clear about the availability level you want to achieve. You want to gathering availability requirements:

  • Availability time can be translated to the system downtime
  • The system downtime improvement means the reduced revenue loss
  • How much revenue loss the business can tolerate.
  • How much cost it takes to reduce the lose.

Next, you want to documenting availability requirements. One common mistake during documenting availabilities are define the system SLAs as a whole, without realizing that the system might have many features that each one carries with a different availability number. Therefore:

  • Define the SLAs in terms of specific features or functions of the system. Don’t define them vaguely based on the system.
  • Be aware of your dependent systems: you can’t afford a better SLA than the worst of the external dependencies involved in a feature.
  • Make sure you can define how you want to measure the availability: how do you know the feature or function is available. And you’d better equipped with some auto availability detection devices

How the high availability is achieve? One major technical is about load balancing. Load balancing is to distribute request across a pool or farm of servers to serve all requests correctly. Horizontally scalable systems achieve both availability and scalability through multiplicity. Adding more machines to increase capacity simultaneous improves resiliency to impulses. And the small servers can be added incrementally as needed, which is way more cost effective.

There are many approaches for load balancing.

DNS Round-robin

DNS solution provides a service name to IP address look up. By mapping the service to multiple IP address and return them in a round robin matter, we can achieve load balancing through DNS. DNS load balancing is often used for small to medium business websites. However, this approach have some security issues: front end IP addresses are visible and reachable from clients, which means they can be attacked. And DNS has no information about the health of the web server, it can send the request to a dead service not.

Reverse Proxy

Let’s first clarify what is a proxy server: it multiplex many outgoing calls into one single source IP address. While the reverse proxy does the opposite: demultiplex calls incoming into a single IP address and fans them to multiples address. Reverse proxy acts as an interceptor for every request. In the current reverse proxy implementation, another feature it to cache the static content to reduce the load on the web server. And since reverse proxy is in the middle of every request, can track which origin severs are healthy and responsive.

Hardware Load Balancer

Hardware load balancer is similar to a reverse proxy server. It can provide switching at layers 4 through 7 of the OSI stack. But it is way more expensive than other software options.

Clustering

Clustering is different from a service pool. The servers within the cluster are aware of each other and actively participating in distributing load. There are two types of clusters for scaling in general:

  • Active/active cluster: for load balancing.
  • Active/passive: used for redundancy in the case of failure. One handles all the load until it fails and the passive one takes over and become active.

Load balanced clusters do not scale linearly, as clusters incur communication overhead such as heart beat. Within the cluster, the application most likely can coordinates its own availability and failover by having a master control node.

Critical Thinking: The Basic Concepts

In a simplest form, critical thinking is about effectively processing the information to reach a conclusion.

I found critical thinking was not easy to master when I first learnt it during my GRE test training at my college. The students are required to finish two essays in one hour, one focused on analyzing an issue, the other focused on making an argument based on a given topic. It was difficult not only because critical thinking skills were new to most of us, but also because the topics were mostly classical moral, legal, sociology controversial topics that requires large amount of background knowledge to be able to make solid reasoning.

I find it necessary to revisit these skills because I increasingly realized the importance of effectively information processing. I will start from a few basic concepts about critical thinking.

  • Claim, Issues, Argument

A claim is basically a statement, is about what we say. And we question about the statement, we raised an issue. An argument is different that it has to include two parts: the premise and the conclusion. The premise it to support the conclusion. In real life, however, people don’t directly share their argument, and your job is to find out the conclusion and their premise to support the conclusion.

  • Inductive Argument and Deductive Argument

Inductive Argument is the argument that if the premise is true, there is no chance that the conclusion is true. While in deductive argument, if the premise is true, it is a strong support for the conclusion, but the conclusion is not necessarily guaranteed.

  • Argument and Explanation

An explanation is different from an argument, that an explanation is to explain a fact, so that it often starts from a fact statement, then comes with why the fact is so. While an argument is trying to prove a conclusion, which is not a fact, as otherwise there is not need to prove it.

  • Value judgement, Moral Judgement

Besides making argument, people oftentimes make value judgement and moral judgement as well. Value judgement is more about the value to the person according to his or her personal preference. While moral judgement is more about moral, that what is right and what is wrong.

  • Evaluating Argument

In general, there are three steps to validate the argument. First of all, you should clarify the argument structure. The most basic structure is about the premise and the conclusion, however, in many reasonings, there could be more than one level of argument, that the conclusion of the first argument might becomes the premise of the next argument, and in many other cases, the premise or the conclusion could be understated.

The next step is to evaluate the logic: from a pure logical point of view, whether the reasoning make sense. In this step, it is especially important to not to be affected by the rhetoric expressions.

If the logic of the reasoning make sense, then you want to validate the premise: whether it is a true statement. And in many case, you have to realize that some of the premise are not within your own background knowledge, that you have to do research to validate the credibility of the premise.

Ask the Most from Your People and Get It, Part I

The managers should ask the most from the team for many reasons. The biggest reason is that the company and the organization expects you to deliver work as a group. Your value is measured by how much value your team can deliver. The more effective your team is, the more value they could deliver and the more successful you are. This is largely why as a manager your success is determined by the success of your team.

However, some of the first time manager including feel frustrated about asking the best from their team. Some of the first time managers get promoted to the manager position because they are the most effective people on the team. And they inevitably set a high expectation on his team: the team should deliver as good as he could and he get frustrated because apparently people won’t be as effective as him.

In other times, we feel frustrated because we carry an unrealistic expectation for motivating people. We thought managers are like the inspirational speakers that he could give great speeches and inspires his fellow, or religious leaders that his fellow follows, respects, and admires him. To some extend, a great manager should be like the great speakers or religious leaders, that he could motivate his team by inspiring them to be a better themselves. However, an incorrect goal can be very misleading. As a first step, let’s try to distinguish the different skill set of a manager and a leader.

First all of, we have to realize that it is impossible to motivate people. Mislead by the wrong examples, I had once believe people could be motivated because they don’t know what they want and your job it to tell them what they should pursue. However, I gradually realized that people are only motivated by themselves. The manager’s job is not to find out a way to incept the goal to the team member’s mind, but to find out what the people truly want and put him in the position that he could achieve it.

Finding out what people truly want takes skills, and most importantly, it takes lot’s time. Many of the first time managers, including me, spent more time thinking about the strategy and execution roadmap: what the team should do, rather than thinking about what people want to do. These managers are good at providing feedbacks for people’s execution and coaching their skills. But they would find it hard to ask even more from people besides the feedback you’ve given.

In the next part of this article, I am going to discuss some of the fundamental skills I learned about asking the most from the team.