Site Reliability Engineering principle and how to

Aaron Hsieh
4 min readMar 28, 2021

Before start

In the current business world, people alway want to get more market share. That is why product requirement is always coming urgently. Therefore, we often discover service team which is focusing on function delivery quickly, and lower down reliability design and operation reducing. Once service get problem which cause the bad image to customer, and then regret no better design.

According to my experience, the reliability design should be contained and reviewed in everywhere. Please do not put it in the final phase of product release.

SRE

Site Reliability Engineering (SRE), Google engineer team Ben Treynor Sloss published this principle when people was talking about DevOps and Agile. He shared what people should do as a SRE. They want to leverage this role to ensure service reliability.

After I read this book, I realized SRE should keep the following three principles.

  • Embrace and control risk. If you afraid risk to deny any change that will go to the worse case. In the future, your service won’t handle new busniess requirement.
  • Simple. Simple is really import. The complex design will cause more trouble. People should keep design simple as you can.
  • Security. As a SRE should keep security design. Otherwise, security leak will cause huge problem.

Furthermore, I would suggest SRE keeps executing the following three methods in the daily work.

  • Automation. if you service is changed by manual step that is high risk problem. SRE should keep implementing automation.
  • Visibility. Before your service release, SRE should build all of telemetry. If your telemetry is built when product is online. That is to say you could not measure your product metrics.
  • Reduction. Sometime your might implement some thing new for your service. You might skip some thing because of release schedule. I would suggest you keep to review your method to reduce legacy.

Once you completed above, your service will proceed archived high SLA. If you keep practice above as a loop, you can proceed increasing your service scope.

How to

According to above principle and how to. I leverage DevOps loop practice for SRE work in each phase. The small icon is the tool you could try.

I believe everyone know the DevOps loop start from Plan, and then Build, Continuous Integration, Deployment, Operation, Continuous Feedback, and then Plan again without end. In the DevOps loop I would say SRE has major mission in each phase.

Plan

In the plan phase, SRE should look for better solution. For example, people can leverage kubectl rollout for Pod status check OR restart since Kubernetes 1.15. You don’t need to kill the pod one by one OR brute force way.

Build

Build a good practice is really helpful to service enhancement. For example, you can automate your manual step to reduce human error. Additionally, a good tool could impact service with good change.

Continuous Integration

In the CI phase, SRE has to protect your service CI/CD pipeline, and ensure service change with CI/CD tool.

Deployment

When service release, SRE has to coordinate release work with stakeholders. Otherwise, your service change might cause other service problem. People should prepare recovery plan before change. Once service deployment got problem, you could quick recover your service. Additionally, SRE has to verify LoadBalancing design. This verification not only for network LoadBalancing, but also for the design, e.g. AP read/write design. Finally, SRE has to ensure telemetry is ready for your product release. If no telemetry, you might encounter problem without earlier awareness.

Operation

In the operation phase, SRE should ensure your service SLA. When incident happening, SRE should perform any possible solution to recover your service back as soon as possible.

Additionally, SRE could consider Chaos Engineering. It could verify your service reliability when your service encountering any possible issue.

BTW. I recent completed SLO improvement project. I will share it in the another article.

Continuous Feedback

When your service is running, you might encounter any possible issue. I would suggest to keep recording them, and then keep tracking them. Furthermore, SRE should think any possible solution to improve your service, and then go to plan phase again.

At the end, I would like to suggest everyone when you are working as SRE role, you might encounter problem. You should fully understand problem impact scope and fix it. Otherwise, a small problem might cause big impact.

If you have any better idea of SRE, please share to me. I always keep review my method such a above principle. Thank you

--

--