When software moves from test to production release, making sure it runs properly is the job of the site reliability engineering team. Sometimes the production environment is different from the development and test environment, so the application doesn't have the same performance it had during test. Sometimes there are more users than anticipated, and the application doesn't scale up. Sometimes real-world data causes problems that test data didn't uncover. Whatever the issue in production, site reliability engineers need to figure out the cause of the problem and put the necessary changes into place to make the application successful.
At some companies, the SRE function is called DevOps, because it's all about moving applications out of development and keeping them operational.
Monitoring and Planning Ahead
A lot of the site reliability engineer's role is about keeping an eye on the system and planning for issues. For an SRE, the "system" means the entire system, including the application, third-party software, the hardware, and the network. The SRE team monitors the system to make sure it meets availability and responsiveness requirements.
The team also looks to the future of the system. They make sure any planned changes, to any component, minimize impact to users. They review capacity and come up with plans for expansion. They also have the responsibility for dealing with unplanned downtime and planning for disaster recovery.
Site Reliability Engineer Skills
Site reliability engineers need solid software engineering skills. They need to understand how software works and how different software products interoperate. SREs often write complex scripts to automate operational tasks. But they also need to bring a broader perspective than just application software development, and understand networks and system administration.
Site reliability engineers need to be creative thinkers and problem solvers, who can work under pressure to figure out a system problem and create a solid solution for bringing things back under control quickly. They need to be analytical, to review data about system usage and system problems, in order to develop plans for the future of the application.
Communication skills are important; SREs need to be able to ask questions of other technical teams to figure out the problem and also to explain to management both the problem and the solution. SREs are part of a team and need to be able to work with a variety of colleagues.
Site Reliability Engineer Career Path
In some cases, SREs choose to strengthen their software engineering skills and move to the software engineering team to create the future of the application. Other SREs choose to develop their system engineering skills and continue to work within site reliability engineering. For those who are interested in management, success as an SRE can lead to firm-wide responsibility for managing infrastructure and shaping the future of the enterprise.