Release It: Stability Antipatterns

What is antipattern?

Stability antipatterns are systematic patterns of failure about service stability.

Why study antipatterns?

Software rarely crashes today. Each of the antipatterns will create, accelerate, or multiply cracks in the system. Learning the antipatterns helps us avoid apply them in our system and software design.

Pattern 1: Integration Point

Integration point refers to the point where a system integrates with a remote system, for example, a remote service, or a Database service. Integration points are the number-one killer of systems, but it is not an option to build a system without integration point. The key is to remember any integration point you introduced can cause a failure on the system. Circuit breaker pattern is a good strategy to handle the integration point: fails the request fast when the circuit break happens, which is often triggered when enough failure is seen. The fast failure protect the system from continuing to hammer the dependent integration system or from holding the critical system resource in integrating with the failed remote system.

Pattern 2: Chain Reaction

A chain reaction occurs when there is some defect in an application—usually a resource leak or a load-related crash. A chain reaction happens because the death of one server makes the others pick up the slack. Most of the time, a chain reaction happens when your application has a memory leak. Partitioning servers, with Bulkheads, can prevent Chain Reactions from taking out the entire service. Use Circuit Breaker also helps.

Pattern 3: Cascading Failures

A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out the source. A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensures that you can come back from a call out to the troubled one.

Pattern 4: Users

Users in the real world do things that you won’t predict If there’s a weak spot in your application, they’ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. You should have a special load test to validate the stability of your service. Become intimate with your network design, it should help avert attacks.

Pattern 5: Blocked Threads

The blocked threads antipattern is the proximate cause of almost all failure. And it usually happens when you check resources out of a connection pool, deal with caches or object registries or make calls to external systems. However, it is very difficult to find hanging threads during development due to the essential complexity on concurrent programming and testing. You should be careful when you introduce the synchronized keyword to your code. That means one thread is executing the code, other threads will be blocked. When you have to design the concurrency programming, you should try to use proven primitives and you should defend the blocking thread with timeout settings.

You should also be careful with the third-party libraries, they often come with some resource pooling that is built on the multithreading. Your first problem with these libraries is determining exactly how they behave. If the library breaks easily, you need to protect your request-handling threads. If the library allows you to set timeouts, use them.

Pattern 6: Attack Of Self-Denial

In the retail or similar website, the special offer could cause an attack of self-denial. In such cases, protect the shared resource would be critical.

Pattern 7: Scaling Effects

The shared resource can be a bottleneck when we try to scale the service. In most of the cases, the shared resource can only be used exclusively. When the shared resource saturates, you get a connection backlog. A shared-nothing architecture can solve this problem largely: each request is processed by one node, and the node does not have shared memory or storage. It is relatively hard to achieve the sharing nothing architecture, however, many approaches can be taken in order to reduce the dependency on the shared resource, for examples partition the database table.

Pattern 8: Unbalanced Capacities

Production systems are deployed to some relatively fixed set of resources. The unbalanced capacities problem rarely observed during QA as inmost QA phase the system is scaled down to two servers. How do you test if your system is under the unbalanced capacities? You could use the capacity modeling to make sure at least you are on the ballpark. Then, test your system with excessive workload.

Pattern 9: Slow Response

A slow response is worse than refusing a connection or returning an error. The fast failure allows the caller to finish processing the trasaction quickly. The slow response from the lower layers can gradually propagate to the upper layer and cause a cascaded failure. Some common seen slow response root cause include the contention of DB connection, the memory leaks, and the inefficient low-level protocols, etc.

Pattern 10: SLA Inversion

A service-level agreement (SLA) is a contractual agreement about how well the organization must deliver its services. However, when building a software system, the best you can possibly do is the SLA of the worst of your service providers. And SLA inversion means a system that must meet a high-availability SLA depends on systems of lower availability. To handle the SLA inversion, on the one hand, you can decouple from the lower availability system or handle the degradation gracefully. You need to make sure your system can continue to operate without the remote system. On the other hand, when you craft your SLAs, you should focus on particular functions and features. For features that require a remote dependency, you can only achieve the best SLAs the remote system offers. Pattern 11: Unbounded Result Sets A common seen DB query pattern is that the application sends a query to the DB and run a for loop to traversal all the results without realizing that the result size can be way lager than the server could handle. This pattern is hard to detect during the development phase as the testing data set is often very small, sometimes not even right after ramped to production as until the data set grows to be too big to handle. One solution is to use the LIMIT keyword to limit the result size sent back from DB. For large query set, it is worthy of introducing the pagination on the API and on the database level.

Quick Read 1: Circuit Breaker Pattern

Quick Read is a series of blogs aiming to provide a quick explanation for one concept. One at A Time.

Source:

What is Circuit Breaker?

Circuit breaker is a design pattern used in modern software development. It is used to detect failures and encapsulates the logic of preventing a failure from constantly recurring, during maintenance, temporary external system failure or unexpected system difficulties. The Circuit Breaker pattern also enables an application to detect whether the fault has been resolved.

How does CB improves service availability?

Almost every software system has some dependencies on the remote system, e.g. a DB instance, when the DB slows down significantly because of heavy load or not available, the request from the clients might have to wait long time until the request times out. These requests will hold the some of the critical resource of the system such as connection pool, memory, CPU. New requests for the same resource access will keep come to the system and continue to hold these resources even after the previous requests have timed out.

And requests that doesn’t need to access these remote resources will be blocked until the critical resources becomes available: the failure in one module failed more modules. With CB on the remote resource access path, the later requests would quickly fail by the CB and the resources will be quickly released. Hence, other part of the system is still available.

In sum, CB improves the overall service availability by breaking the access to the operations that might fail to prevent a bigger failure happen to the system.

How does Circuit Breaker work?

CB usually work as a proxy on the path access some dependent resource. You can wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.

What’s the difference between CB and retry?

Many failures are transient, say a network packet loss or issue. These failure can be resolved by retry. However, there are failures due to unanticipated events, and that might take much longer to fix. In these situations it might be pointless for an application to continually retry an operation that is unlikely to succeed, and instead the application should quickly accept that the operation has failed and handle this failure accordingly.

The retry logic should be sensitive to any exceptions returned by the circuit breaker and abandon retry attempts if the circuit breaker indicates that a fault is not transient.

Is TIMEOUT enough to protect the system?

Timeout is a common strategy in system integration, when the access or task failed to finish within given time, the service is timed out in order to prevent infinite waiting. And timeout is often handled (sometime with retry). However, timeout on the client side usually takes too long be detected, while all the critical resource is already used up int these period of time, and caused cascaded failure.

For example: an operation that invokes a service could be configured to implement a timeout, and reply with a failure message if the service fails to respond within this period. This strategy could cause many concurrent requests to the same operation to be blocked until the timeout period expires. These blocked requests might hold critical system resources such as memory, threads, database connections, and so on.

It would be preferable for the operation to fail immediately instead of wait until timeout.

Does CB needs manual reset after break happens?

we can have the breaker itself detect if the underlying calls are working again. We can implement this self-resetting behavior by trying the protected call again after a suitable interval, and resetting the breaker should it succeed.

A circuit breaker component might have the following three status:

  • Closed
  • Open
  • Half-open

In the half-open status, some requests could be allowed to send through, if successful, the system transition to open status, otherwise back to closed status.

How does CB pattern work for asynchronous calls?

Asynchronous calls or tasks usually submit to the queue/worker system for execution. In some cases, the callers are less sensitive for the call execution or result. In other times, a timeout can be set up for such tasks incase they take too long.

How to install CB in the async task execution? A common technique is to put all requests on a queue, which the supplier consumes at its speed – a useful technique to avoid overloading servers. In this case the circuit breaks when the queue fills up.

Are there any framework available to use for Circuit Breaker?

Yes, Hystrix, Netflex, and the integration with Spring Cloud.

Release It! Stability

Notes from reading the book: Release It! Design and Deploy Production-Ready Software.

Morden software system usually needs to process transactions, which is an abstract unit of work. A resilient system keeps processing transactions even when there are transient impulses, persistent stresses, or component failure disrupting normal processing.

Failure Mode

Software system failures come with many different format, for example, the bug in the system can gradually accumulate and eventually fail the system. In this post, however, we mainly want to discuss the failures caused by impulsive traffic. Software system runs stably in regular traffic, but when impulses and excessive requests can trigger catastrophic failure. Some component of the system will start to fail before everything else does. The original trigger and the way the crack spreads to the rest of the system, together with the results of the damage, are collectively called a failure mode.

Cracks Propagate and Chain of Failures

Failure mode is often used especially when analyzing software system failures. We need to trace back to when the failure started to happen, how it was propagated, and finally, how we could have avoid such issues.

When we consider the reliability of the system, we need to keep in mind that the failures in the components are not independent, they can propagate and failure in one component might cause failure in another component. And under every system outage, there is a chain of failures.

Patterns and Antipatterns

The failures will happen no matter what we do, and our job is to design the system’s reaction to specific failures. This is why we should learn patterns and anti-patterns. Patterns are not to prevent failures from happening but to prevent them from propagating by containing the failures and preserve partial functionality instead of the total crashes. And learn anti-patterns is to help us avoid applying them on our system design. For example, tight coupling accelerates cracks.

Release It! Chapter 13: Availability

This is the study notes from the book Release It! Design and Deploy Production-Ready Software.

AVAILABILITY is one of the most frequently used term to describe the reliability of the system. In this section, we will dive deep into the hard of availability. We will discuss how availability is measured and how it is achieved in modern internet services.

First of all, we must realize that there is a tension between the availability and the cost:

  • The desire for greater availability.
  • The desire for minimizing cost.

Every improvement on the availability means an increase cost. But in the mean time, every bit availability improvement also means saved revenue. When you decide the availability of your system, please be clear about the availability level you want to achieve. You want to gathering availability requirements:

  • Availability time can be translated to the system downtime
  • The system downtime improvement means the reduced revenue loss
  • How much revenue loss the business can tolerate.
  • How much cost it takes to reduce the lose.

Next, you want to documenting availability requirements. One common mistake during documenting availabilities are define the system SLAs as a whole, without realizing that the system might have many features that each one carries with a different availability number. Therefore:

  • Define the SLAs in terms of specific features or functions of the system. Don’t define them vaguely based on the system.
  • Be aware of your dependent systems: you can’t afford a better SLA than the worst of the external dependencies involved in a feature.
  • Make sure you can define how you want to measure the availability: how do you know the feature or function is available. And you’d better equipped with some auto availability detection devices

How the high availability is achieve? One major technical is about load balancing. Load balancing is to distribute request across a pool or farm of servers to serve all requests correctly. Horizontally scalable systems achieve both availability and scalability through multiplicity. Adding more machines to increase capacity simultaneous improves resiliency to impulses. And the small servers can be added incrementally as needed, which is way more cost effective.

There are many approaches for load balancing.

DNS Round-robin

DNS solution provides a service name to IP address look up. By mapping the service to multiple IP address and return them in a round robin matter, we can achieve load balancing through DNS. DNS load balancing is often used for small to medium business websites. However, this approach have some security issues: front end IP addresses are visible and reachable from clients, which means they can be attacked. And DNS has no information about the health of the web server, it can send the request to a dead service not.

Reverse Proxy

Let’s first clarify what is a proxy server: it multiplex many outgoing calls into one single source IP address. While the reverse proxy does the opposite: demultiplex calls incoming into a single IP address and fans them to multiples address. Reverse proxy acts as an interceptor for every request. In the current reverse proxy implementation, another feature it to cache the static content to reduce the load on the web server. And since reverse proxy is in the middle of every request, can track which origin severs are healthy and responsive.

Hardware Load Balancer

Hardware load balancer is similar to a reverse proxy server. It can provide switching at layers 4 through 7 of the OSI stack. But it is way more expensive than other software options.

Clustering

Clustering is different from a service pool. The servers within the cluster are aware of each other and actively participating in distributing load. There are two types of clusters for scaling in general:

  • Active/active cluster: for load balancing.
  • Active/passive: used for redundancy in the case of failure. One handles all the load until it fails and the passive one takes over and become active.

Load balanced clusters do not scale linearly, as clusters incur communication overhead such as heart beat. Within the cluster, the application most likely can coordinates its own availability and failover by having a master control node.

SRE: Data Integrity

 

Data integrity usually refers to the accuracy and consistency of data throughout its lifetime. For customer involved online services, things can go even more complex. Any data corruption, data loss, or extended unavailability are considered data integrity issue for the customer.

Data integrity in many cases can be a big problem. For instance, the database table was corrupted and we had to spend a few hours restore the data from the snapshot database. In another instance, the data was accidentally deleted and had a fatal impact on our client, as the client never expected the data to be unavailable. However, it was too expensive to restore the data, so we had to fix the dependent data record and some code on the client side to mitigate the impact on the clients. There is another instance that the data loaded for the client is not what they expected. This is clearly a data consistency issue. However, the issue was not reproducible and thus made it super hard for the team to debug.

There are many types of failure that could lead to the data integrity issue:

  • Root cause

User actions, Operator Error, Application Bug, Infrastructure defect, Hardware Failure, Site Disaster

  • Scope

Wide, Narrow, directed

  • Rate

Big Bang, Slow and Steady

This leads to the 24 combinations of the data integrity issue. How do we handle such issues?

First layer action is to adopt soft delete to the client data. The idea behind soft delete is to make sure that the data is recoverable if needed, for example, from operation errors. A soft delete is usually implemented through adding a is_delete flag and a deleted_at time stamp to the table. When data is to be deleted, they are not deconstructed from the database immediately, but will be marked as deleted with a scheduled deleted time in the future, say 60 days from the deletion. In this way, the data deletion could be reverted if necessary.

There are different opinions about the soft deletion solution, as it might introduce extra complexity on the data management. For example, when there are hierarchies and dependency relationship between the data records, the deletion might break the data constraints. In the meantime, it makes the data selection and option more complex, as a customized filter has to be applied to the data in order to filter out the data that has been soft deleted. And recovering the soft delete data can also be complex especially only part of the data is deleted, a recovery might involve complex data merge.

The second layer action is to build the data backup system and make the recovery process fast. We need to be more careful here that the data backup or archive is not the purpose of data integrity. Find out ways to prevent the data loss, to detect data corruption, to quickly recover from data integrity instance is more important. Data backup is often times neglected as it yields no visible benefit and not a high priority for anyone. But building a restoring system is a much more useful goal.

For many cloud services, data backup is an option, for example, AWS RDS supports creating data snapshot, while the cloud cache Redis cluster supports backup the data on the EBS storage. Many people stop here as they assume that the data is currently back up. However, we should realize that the data recovery could take a long time to finish, and the data integrity is broken during the recovery time. The recovery time should be an important metric for the system.

Besides back up, many systems use replicas. And by failover to the replica when the primary node had an issue, they could improve the system availability. We need to realize that the data might not be consistent between the primary instance and the replica instance.

A third layer is to detect the error earlier. For example, have a data validation job that validates the integrity of the data between different storage systems so that the issue could be fixed quickly when it happens.

SRE: Service Level Objectives

As we are moving from monolithic service to the micro-services, I find it pretty useful to think about the following problems:

  • How do we correctly measure the service?

This question can be breakdown into the following sub questions:

  • If you are about to maintain your service, how do you tell whether your service is functioning correctly or not?
  • If your service is a client facing product, how do you know whether you provide a good experience for the client or not?
  • How do you know your service hit the performance bottleneck or not?
  • If you are about to scale the service performance, what metric should you use to find out the performance bottleneck?

As you find out from the above sub-questions​, it all about defining the metrics for the proper client and proper scenario. For sanity check of the service, it should be quick and straightforward, for product experience check, you should rather put metric where the product impact could be measured; for performance monitoring and optimization, finding out how to measure the resource utilization and the dependent service are essential metrics.

I found that many times we didn’t measure our service correctly. And it takes time, experience and domain knowledge to find define such metrics correctly.

  • How to manage the expectation of the clients that are using our service?

One common problem for monolithic service is that the integration many times are based on direct DB or data integration, where the client treat it as local DB where the data is always available. With this setting, the failure in one domain would be contagious: as the client never assume there will be failures from the domain and when failure happens, no exception handling logic is installed to handle such failures.

To make the system more resilient, a service level agreement or expectation is truly needed. It is about setting the expectation of you service client: we are not a service that is constantly performant and reliable, we could slowdown or not available at some case, and you should be prepared for that.

So I find it is pretty useful to think about these problem​s with the SLOs and related concepts:

  • SLIs: Service Level Indicators

The service level indicators are metrics you defined to measure the behavior and performance of your service. It can be product facing or purely engineering facing.

  • SLOs: Service Level Objectives

The service level objectives are objectives that are set based on the SLIs. It serve as the direction for the team to optimize the system if necessary.

  • SLAs: Service Level Agreement

The service level agreements are more about the contract you defined for the client: how fast your service could load the data on average, and at what case you service might fail and how your client should handle such failure.

Besides defining the SLIs and SLAs, it also provides a way to validate the adoption of the SLAs. For example, if your client are supposed to handle the data access failures from your service, then you can validate that by scheduled an outage of your service. And by doing so, you push your client to adopt to the SLAs.​

Book Review: Cloud Native Java

Cloud Native Java provides an overview of why people want to build cloud-native applications, and how to build them with Spring Boot and other frameworks.

A cloud-native application focus on the application logic that solves the business and product problems, without worrying about the underlying infrastructures setup and operation. Spring Boot is such a framework that provides a simple service bootstrapping process as it covers most of the complicated configuration process in order to bring up a service. It does not solve every engineering problem you will be facing, but it provides easy integrations with the best framework on the problem domains.

Spring Boot enables you to quickly build a REST application, you can have such an application up and running with half an hour. The easy to use reminds of some good frameworks in the Python web world like Falsk. You can easily equip your framework with different persistence solutions: SQL database like MySQL and PostgreSQL, NoSQL database like MongoDB, cache services like Redis. You can also build a data-driven application with the Spring message channel framework or integrating with messaging brokers like Kafka, RabbitMQ.

Spring makes such integration easy by providing you with the native abstraction frameworks to integrate with such systems, like Spring JPA for database access, Spring Data Cache for cache service access, and Spring Message Channel for streaming frameworks. These abstractions help you focus on the application logic without worrying about the underlying details.

Spring also utilize the well-built framework to solve the problems beyond the application logic. By integrating with OAuth, it provides robust authentication and authorization solution. By integrating with Eureka, it helps you solve the middle tier routing problems.

It is fair to say that this book is very comprehensive. It almost covers every aspect of the system that you need to build for a cloud application. The well-organized examples and annotations make this book a handbook for people who need to solve exactly the same problem.

It is a good guide book for people who try to understand how to build a Java application with Spring Boot on the cloud. It discusses some of the essential problems you need to consider and solve when building cloud native applications, and it demonstrated that these problems can be solved easily with spring boot. 

However, I don’t consider it as a good book for people who are exploring Spring Boot as an alternative solution for the engineering problems within a relatively established world.

The major issue is: the infrastructure components in such an organization mostly have already been built. The challenge for the engineers is to integrate the SpringBoot into the existing system.  For one example, OAthu is a good choice for authentication, but what if your company already have it’s own authentication system, for example, Django’s authentication framework where each client is viewed as a user?

I understand that it is very difficult to write such a book that fit for everyone’s use case. as every company is built differently, a solution works for one company might not work for another. And the most interesting challenge for the engineers is to solve the problem with a given context. 

The other problem of this book is that it lacks the in-depth discussions of the essential problems engineerings will be facing when building cloud-native applications. The discussion in this book is relatively superficial. For books that don’t provide the right solution, I expect them to provide inspiriation.

Overall, I recommend this book to people who are new to the industry and new to the Java web service programming world. For people who are looking for more in-depth solutions or discussions, this book is not a right choice.