Site Reliability Engineering principle and how to

Before start

In the current business world, people alway want to get more market share. That is why product requirement is always coming urgently. Therefore, we often discover service team which is focusing on function delivery quickly, and lower down reliability design and operation reducing. Once service get problem which cause the bad image to customer, and then regret no better design.

SRE

Site Reliability Engineering (SRE), Google engineer team Ben Treynor Sloss published this principle when people was talking about DevOps and Agile. He shared what people should do as a SRE. They want to leverage this role to ensure service reliability.

  • Simple. Simple is really import. The complex design will cause more trouble. People should keep design simple as you can.
  • Security. As a SRE should keep security design. Otherwise, security leak will cause huge problem.
  • Visibility. Before your service release, SRE should build all of telemetry. If your telemetry is built when product is online. That is to say you could not measure your product metrics.
  • Reduction. Sometime your might implement some thing new for your service. You might skip some thing because of release schedule. I would suggest you keep to review your method to reduce legacy.

How to

According to above principle and how to. I leverage DevOps loop practice for SRE work in each phase. The small icon is the tool you could try.

Plan

In the plan phase, SRE should look for better solution. For example, people can leverage kubectl rollout for Pod status check OR restart since Kubernetes 1.15. You don’t need to kill the pod one by one OR brute force way.

Build

Build a good practice is really helpful to service enhancement. For example, you can automate your manual step to reduce human error. Additionally, a good tool could impact service with good change.

Continuous Integration

In the CI phase, SRE has to protect your service CI/CD pipeline, and ensure service change with CI/CD tool.

Deployment

When service release, SRE has to coordinate release work with stakeholders. Otherwise, your service change might cause other service problem. People should prepare recovery plan before change. Once service deployment got problem, you could quick recover your service. Additionally, SRE has to verify LoadBalancing design. This verification not only for network LoadBalancing, but also for the design, e.g. AP read/write design. Finally, SRE has to ensure telemetry is ready for your product release. If no telemetry, you might encounter problem without earlier awareness.

Operation

In the operation phase, SRE should ensure your service SLA. When incident happening, SRE should perform any possible solution to recover your service back as soon as possible.

Continuous Feedback

When your service is running, you might encounter any possible issue. I would suggest to keep recording them, and then keep tracking them. Furthermore, SRE should think any possible solution to improve your service, and then go to plan phase again.