Site Reliability Engineering principle and how to

Before start

According to my experience, the reliability design should be contained and reviewed in everywhere. Please do not put it in the final phase of product release.


After I read this book, I realized SRE should keep the following three principles.

  • Embrace and control risk. If you afraid risk to deny any change that will go to the worse case. In the future, your service won’t handle new busniess requirement.
  • Simple. Simple is really import. The complex design will cause more trouble. People should keep design simple as you can.
  • Security. As a SRE should keep security design. Otherwise, security leak will cause huge problem.

Furthermore, I would suggest SRE keeps executing the following three methods in the daily work.

  • Automation. if you service is changed by manual step that is high risk problem. SRE should keep implementing automation.
  • Visibility. Before your service release, SRE should build all of telemetry. If your telemetry is built when product is online. That is to say you could not measure your product metrics.
  • Reduction. Sometime your might implement some thing new for your service. You might skip some thing because of release schedule. I would suggest you keep to review your method to reduce legacy.

Once you completed above, your service will proceed archived high SLA. If you keep practice above as a loop, you can proceed increasing your service scope.

How to

I believe everyone know the DevOps loop start from Plan, and then Build, Continuous Integration, Deployment, Operation, Continuous Feedback, and then Plan again without end. In the DevOps loop I would say SRE has major mission in each phase.



Continuous Integration



Additionally, SRE could consider Chaos Engineering. It could verify your service reliability when your service encountering any possible issue.

BTW. I recent completed SLO improvement project. I will share it in the another article.

Continuous Feedback

At the end, I would like to suggest everyone when you are working as SRE role, you might encounter problem. You should fully understand problem impact scope and fix it. Otherwise, a small problem might cause big impact.

If you have any better idea of SRE, please share to me. I always keep review my method such a above principle. Thank you

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store