As we are moving from monolithic service to the micro-services, I find it pretty useful to think about the following problems:
- How do we correctly measure the service?
This question can be breakdown into the following sub questions:
- If you are about to maintain your service, how do you tell whether your service is functioning correctly or not?
- If your service is a client facing product, how do you know whether you provide a good experience for the client or not?
- How do you know your service hit the performance bottleneck or not?
- If you are about to scale the service performance, what metric should you use to find out the performance bottleneck?
As you find out from the above sub-questions, it all about defining the metrics for the proper client and proper scenario. For sanity check of the service, it should be quick and straightforward, for product experience check, you should rather put metric where the product impact could be measured; for performance monitoring and optimization, finding out how to measure the resource utilization and the dependent service are essential metrics.
I found that many times we didn’t measure our service correctly. And it takes time, experience and domain knowledge to find define such metrics correctly.
- How to manage the expectation of the clients that are using our service?
One common problem for monolithic service is that the integration many times are based on direct DB or data integration, where the client treat it as local DB where the data is always available. With this setting, the failure in one domain would be contagious: as the client never assume there will be failures from the domain and when failure happens, no exception handling logic is installed to handle such failures.
To make the system more resilient, a service level agreement or expectation is truly needed. It is about setting the expectation of you service client: we are not a service that is constantly performant and reliable, we could slowdown or not available at some case, and you should be prepared for that.
So I find it is pretty useful to think about these problems with the SLOs and related concepts:
- SLIs: Service Level Indicators
The service level indicators are metrics you defined to measure the behavior and performance of your service. It can be product facing or purely engineering facing.
- SLOs: Service Level Objectives
The service level objectives are objectives that are set based on the SLIs. It serve as the direction for the team to optimize the system if necessary.
- SLAs: Service Level Agreement
The service level agreements are more about the contract you defined for the client: how fast your service could load the data on average, and at what case you service might fail and how your client should handle such failure.
Besides defining the SLIs and SLAs, it also provides a way to validate the adoption of the SLAs. For example, if your client are supposed to handle the data access failures from your service, then you can validate that by scheduled an outage of your service. And by doing so, you push your client to adopt to the SLAs.