SRE: Data Integrity


Data integrity usually refers to the accuracy and consistency of data throughout its lifetime. For customer involved online services, things can go even more complex. Any data corruption, data loss, or extended unavailability are considered data integrity issue for the customer.

Data integrity in many cases can be a big problem. For instance, the database table was corrupted and we had to spend a few hours restore the data from the snapshot database. In another instance, the data was accidentally deleted and had a fatal impact on our client, as the client never expected the data to be unavailable. However, it was too expensive to restore the data, so we had to fix the dependent data record and some code on the client side to mitigate the impact on the clients. There is another instance that the data loaded for the client is not what they expected. This is clearly a data consistency issue. However, the issue was not reproducible and thus made it super hard for the team to debug.

There are many types of failure that could lead to the data integrity issue:

  • Root cause

User actions, Operator Error, Application Bug, Infrastructure defect, Hardware Failure, Site Disaster

  • Scope

Wide, Narrow, directed

  • Rate

Big Bang, Slow and Steady

This leads to the 24 combinations of the data integrity issue. How do we handle such issues?

First layer action is to adopt soft delete to the client data. The idea behind soft delete is to make sure that the data is recoverable if needed, for example, from operation errors. A soft delete is usually implemented through adding a is_delete flag and a deleted_at time stamp to the table. When data is to be deleted, they are not deconstructed from the database immediately, but will be marked as deleted with a scheduled deleted time in the future, say 60 days from the deletion. In this way, the data deletion could be reverted if necessary.

There are different opinions about the soft deletion solution, as it might introduce extra complexity on the data management. For example, when there are hierarchies and dependency relationship between the data records, the deletion might break the data constraints. In the meantime, it makes the data selection and option more complex, as a customized filter has to be applied to the data in order to filter out the data that has been soft deleted. And recovering the soft delete data can also be complex especially only part of the data is deleted, a recovery might involve complex data merge.

The second layer action is to build the data backup system and make the recovery process fast. We need to be more careful here that the data backup or archive is not the purpose of data integrity. Find out ways to prevent the data loss, to detect data corruption, to quickly recover from data integrity instance is more important. Data backup is often times neglected as it yields no visible benefit and not a high priority for anyone. But building a restoring system is a much more useful goal.

For many cloud services, data backup is an option, for example, AWS RDS supports creating data snapshot, while the cloud cache Redis cluster supports backup the data on the EBS storage. Many people stop here as they assume that the data is currently back up. However, we should realize that the data recovery could take a long time to finish, and the data integrity is broken during the recovery time. The recovery time should be an important metric for the system.

Besides back up, many systems use replicas. And by failover to the replica when the primary node had an issue, they could improve the system availability. We need to realize that the data might not be consistent between the primary instance and the replica instance.

A third layer is to detect the error earlier. For example, have a data validation job that validates the integrity of the data between different storage systems so that the issue could be fixed quickly when it happens.

SRE: Service Level Objectives

As we are moving from monolithic service to the micro-services, I find it pretty useful to think about the following problems:

  • How do we correctly measure the service?

This question can be breakdown into the following sub questions:

  • If you are about to maintain your service, how do you tell whether your service is functioning correctly or not?
  • If your service is a client facing product, how do you know whether you provide a good experience for the client or not?
  • How do you know your service hit the performance bottleneck or not?
  • If you are about to scale the service performance, what metric should you use to find out the performance bottleneck?

As you find out from the above sub-questions​, it all about defining the metrics for the proper client and proper scenario. For sanity check of the service, it should be quick and straightforward, for product experience check, you should rather put metric where the product impact could be measured; for performance monitoring and optimization, finding out how to measure the resource utilization and the dependent service are essential metrics.

I found that many times we didn’t measure our service correctly. And it takes time, experience and domain knowledge to find define such metrics correctly.

  • How to manage the expectation of the clients that are using our service?

One common problem for monolithic service is that the integration many times are based on direct DB or data integration, where the client treat it as local DB where the data is always available. With this setting, the failure in one domain would be contagious: as the client never assume there will be failures from the domain and when failure happens, no exception handling logic is installed to handle such failures.

To make the system more resilient, a service level agreement or expectation is truly needed. It is about setting the expectation of you service client: we are not a service that is constantly performant and reliable, we could slowdown or not available at some case, and you should be prepared for that.

So I find it is pretty useful to think about these problem​s with the SLOs and related concepts:

  • SLIs: Service Level Indicators

The service level indicators are metrics you defined to measure the behavior and performance of your service. It can be product facing or purely engineering facing.

  • SLOs: Service Level Objectives

The service level objectives are objectives that are set based on the SLIs. It serve as the direction for the team to optimize the system if necessary.

  • SLAs: Service Level Agreement

The service level agreements are more about the contract you defined for the client: how fast your service could load the data on average, and at what case you service might fail and how your client should handle such failure.

Besides defining the SLIs and SLAs, it also provides a way to validate the adoption of the SLAs. For example, if your client are supposed to handle the data access failures from your service, then you can validate that by scheduled an outage of your service. And by doing so, you push your client to adopt to the SLAs.​







  • 面值:面值(face value,par value)是指债券交割的时候偿还的数量
  • 交割期限:债券到期和交割的日期,比如一年,五年,三十年等。
  • 利息和付息频率:按照利息(coupon)分比为零息债券,附息债券等,按照付息频率有比如半年付息,每年付息和一次付息等。
  • 发债者的信用水平
  • 可比较替代投资的收益水平:主要是银行存款的利息。



  • 交割期限:30年
  • 利息:3%,即每年偿付3%
  • 价格:102.52美元
  • 面值:100美元

Y = 3 + 3/(1 + 3%) + 3/(1+3%)2 + … + 3/(1+3%)29 + 100/(1+3%)29。
用下面的python 代码可以算出30年的总收益为:102.9,也就是说,这个国库券的当前价值为102.9。

par_value = 100
coupon = 3
ytm = 0
interest_rate = 0.03
for x in range(30):
    print("year {}".format(x))
    ytm += coupon/((1+interest_rate)**x)
    print("accumulated yield to this year: {}".format(ytm))
ytm += 100/((1+interest_rate)**29)
print("year to maturity {}".format(ytm))

如前面所说的,国库券可以在二级市场(secondary market)上进行交易,而中央银行可以通过在市场上的回购控制货币量。



  • 联邦基金市场(federal funds market)

法律规定的商业银行必须将一部分的存款留存以降低风险,留存的这部分存款占总存款数量的比例就成为准备金率(reverse ratio)。商业银行的实际准备金可能会低于或者多出的法定的准备金,如果多出一般会将这些钱借出,如果少于法定准备金率,商业银行既可以向其他银行拆借。这些用于满足法定准备金要求的资金市场就称为联邦基金,而这个用于借出和借入的市场就称为联邦基金市场,也称为同业拆借市场。一家银行拆出和拆入的利率差就是这家银行的收益。

  • 联邦基金利率


  • 贴现窗口和贴现率

除了同业拆借,中央银行还可以直接向商业银行借款。这个机制就成为贴现窗口(discount window),而商业银行从中央银行获得借款的利息就成为贴现率(discount rate)。银行在选择使用哪种资金来源来满足法定储备要求的时候会对各个资金来源进行比较。如果贴现窗口的利率小于商业银行的贷款利率则使用贴现窗口就是有利可图的。但是同时中间还存在交易费用和坏账的风险。












如何理解名义汇率和实际汇率的差别呢?所谓实际汇率就是两国之间产品的相对价格,比如同样的麦当劳Big Mac汉堡单点,在美国是3.99美元(非meal价格),而在中国是17元人民币,按当前6.71的中美汇率也就是2.53美元。那么一个美国汉堡可以在中国购买1.57个汉堡,就是关于汉堡的中美实际汇率。流行的Big Mac指数就是用于衡量不同国家实际汇率的一种手段。名义汇率不难理解,就是美元货币对中国货币的汇率。




  • 购买力平价的问题









  • 重视模型的假设,同时注意到模型中与现实不完全相符合的地方。


  • 重视论证过程


  • 和实际生活结合起来


  • 细分领域和交叉领域




国民收入模型是新古典宏观经济学中较为著名的一个模型。国民收入模型旨在讨论在一个经济体当中,经济的收入是如何从 企业向家庭分配的,以及商品和服务的需求与产品和服务的供给之间是如何达到均衡水平的。



  • 资本:K
  • 劳动力:L


  • Y=F(K, L)



  • Profit = PY – WL – RK = PF(K, L) – WL – RK


  • PxMPL = W 当劳动力的边际产量至等于工资的时候,企业的产出达到了最大化。
  • PxMPK = R 当资本的边际产量等于资本的租赁价格的时候,企业的产出达到了最大化。


  • Y – MPLxK – MPKxK

而根据规模不变的效应,经营利润等于0,因此Y=MPLxK + MPKxK


  • 消费(C):假设消费水平直接取决于可支配收入,C=C(Y-T)
  • 投资(I):假设投资量的需求去决定于利率:I=I(r)
  • 政府购买(G):


  • S= Y – C – G = I


  • S = (Y – C – T)-(T-G)


  • Y – C(Y-T) – G = I(r)


  • 当投资需求大于供给也就是储蓄量的时候,利率将上升,从而降低供给。
  • 当投资需求你小于供给也就是储蓄量的时候,利率将下降,更多资金将进入投资。