If your startup has an extensive and ever-growing infrastructure, you should hire a site reliability engineer.
Site reliability engineers (SREs) can help you improve your systems' performance and stability, support products and services while seamlessly deploying updates and releases, and more. Unlike DevOps engineers who run a pre-existing infrastructure and automate IT operations to boost reliability, SREs plan and create sturdy infrastructure and update them as needed. They also collaborate with business leaders to develop and run sustainable IT systems, which can help you create new solutions to evolving challenges.
Read on to learn more about SREs and how you can hire qualified site reliability engineers.
What Does a Site Reliability Engineer Do?
Site reliability engineers have many responsibilities, including:
Build and maintain infrastructure
The most important responsibility for SREs is to create and maintain the IT infrastructure on which your company runs its services and products. This involves working with your self-hosted cloud and public clouds, such as Google Cloud and AWS.
Many SREs write infrastructure-as-code (IaC) with HCL and YAML. IaC allows SREs to automate infrastructure provisioning.
It has many other benefits, including:
- A template to follow for provisioning, which simplifies the configuration management process
- The ability to avoid ad-hoc, undocumented configuration changes
- The ability to divide infrastructure into modular parts that can be combined in various ways through automation
Your SRE will also help you define and manage important metrics such as Service Level Objectives (SLOs) and Service Level Indicators (SLI). SLOs point out the target levels for your service, while SLIs measure the service levels.
SREs can derive SLOs from internal discussions about consumer expectations and promises through Service Legal Agreements (SLAs). After defining SLOs, they will determine error budgets, the allowed time your service can be below the target level. These budgets give your SRE and development teams more breathing space since services can't run at maximum reliability. Error budgets can also help your startup measure incident impacts. For example, if a cybersecurity incident consumes 20% of your budget, you can label it as a major incident.
Deploy monitoring and alerting systems
Your SRE will then check if your company meets SLOs by defining and setting up SLIs monitoring.
SREs typically monitor the following SLIs through Software as a Service (SaaS) vendors like Sentry and Datadog or self-serve platforms like Grafana and Prometheus:
- CPU, memory level spikes
- Page load speed
- Service uptime for APIs, websites, and apps
After setting up monitoring and alert systems, your SRE will work with you to ensure that the monitoring thresholds meet the mark. This will prevent team members from being bombarded with low-priority alerts. Your SRE will also refine the alerting system to send alerts whenever it detects symptoms so that team members can take action right away.
Automate rote work
SREs can also reduce labor costs or "toil." According to Google SRE, toil is automatable, repetitive, manual, and non-tactical work that slows down other projects and takes time away from SRE and dev teams.
Examples of toil include:
- Digging into legacy configurations and code to fix errors
- Manually sending out SMS and emails to push alerts
- Manually executing each step of a script that automates a task
SREs can build automation for these repetitive and energy-consuming tasks. For example, your SRE hire can design a system that allows development teams to automate script execution. They can also create an alert system that automatically sends out SMS and emails to team members.
Manage and respond to incidents on call
Once your SRE has set up monitoring, alerts, and automation, they will use a schedule to distribute the load of responding to alerts. They will use an incident management platform to manage all alerts and incidents in one centralized hub. This platform will also help the SRE:
- See who did what and when during each incident
- Calculate key metrics like Mean Time to Resolve (MTTR) and Mean Time to Acknowledge (MTTA)
Your SRE will also be responsible for post-mortems, where they will explain the following to external and internal stakeholders:
- Events that led up to each incident
- Steps taken to resolve the incident
- Changes that your organization made to prevent similar incidents from occurring in the future
Why You Would Hire an SRE
There are many benefits to hiring an SREs, including:
Maximize system uptime
In our highly digital world, customers are used to accessing websites, APIs, and apps any time they want. Frequent and prolonged downtime of your products will lead to significant reputation and financial losses.
SREs will help you prevent or minimize downtimes of your apps, APIs, and other services. They accomplish this by building and maintaining secure and reliable IT infrastructure, managing and responding to security and system incidents that threaten stability, and deploying monitoring and alerting systems.
Accelerate software delivery
SREs will also help you shorten software delivery and development cycles. They will automate software development and delivery. It will also establish continuous integration and continuous development (CI/CD) best practices to reduce dev overhead to deliver your products effectively and efficiently.
Evaluate and mitigate risks
It's more important than ever to reduce risks and improve security. According to CIRA's 2021 Cybersecurity Report, the volume of cyberattacks increased from 29% in 2020 to 36% in 2021. What's more, 17% of all companies surveyed were hit with ransomware — and 69% of those affected paid the ransom.
Hire an SRE to develop contingency plans and countermeasures to protect your data from malicious third parties. SREs will use these documents and procedures to assess and mitigate risks such as cybersecurity breaches and DDoS attacks.
Research has shown that downtime causes customer loss for more than a third of small and medium businesses. Of these businesses, 17% also experience revenue loss.
Hiring an SRE will improve your startup's cost-efficiency by reducing the chances of downtime. They will build reliable IT infrastructure so your offerings can provide value to customers 100% of the time.
This will allow you to:
- Attract and retain more customers
- Start and finish more deals, particularly during peak season
What Skills to Look For in a Site Reliability Engineer
Now that you know why hiring an SRE can make or break your startup, here's a look at the skills you should look for in an SRE:
Core technical SRE skills
First, you should check if your SRE hire has core technical SRE skills. These include:
- Expert knowledge of version control
- CI/CD implementation experience
- Deep understanding of DevOps best practices and concepts
- Expert knowledge of Linux
- Issue troubleshooting experience
- Automation experience
- Experience with distributed storage solutions like Ceph, NFS, HDFS, and S3
- Experience with dynamic resource management frameworks such as Yarn, Kubernetes, and Mesos
- Previous experience in technical engineering
Soft SRE skills
Besides core SRE skills, you should also look for soft or non-technical SRE skills. These include:
- Strong problem-solving skills, including a proactive approach to spotting areas for improvement, problems, and performance bottlenecks
- Fluency in the language(s) your company uses — SREs need to pitch their ideas to stakeholders and communicate with other team members.
- Excellent written and verbal communication skills
- Ability to perform well under pressure
Where to Find SREs
Once you know what skills you're looking for, it's time to start your search for site reliability engineers. Here are some of the best places to find SREs:
SRE job boards
SRE job boards are a great place to start your search. Unlike general job boards like LinkedIn and Indeed, these boards are specifically tailored for SREs and companies looking for SREs. This makes sourcing SREs much more straightforward.
The most popular SRE job boards are:
Srejobs.com is a sleek, user-friendly website for sourcing and hiring SRE talent. SRE jobs from all over the web are posted on this site. Contact the web administrator through Twitter to start advertising on this site.
Sre.pallet.xyz offers a diverse array of jobs. Besides SRE positions, it also has roles from the following categories:
- Software engineering
- Data and analytics
- Business management and strategy
Curated by Rootly, an incident management platform and Slack app built for SREs, this board has jobs from a wide variety of companies. It also allows applicants to browse roles by companies.
Create an account with a profile for your company, and you can post your SRE job on sre.pallet.xyz.
LinkedIn is another excellent place to source site reliability engineers, with over 20,000 SREs on the platform! Use LinkedIn Business tools and guides to narrow down your choices and target the right candidates at the right time.
LinkedIn also offers other products to help you hire CREs:
- Career Pages: Use this employer branding tool to highlight your open roles and company culture.
- Talent Insights: Use this workforce planning product to make informed talent decisions using real-time data.
- Recruiter: This platform helps you source, connect, and manage people you want to hire.
If you don't have time to manually vet candidates, consider using talent marketplaces. These platforms allow you to hire pre-vetted SREs who are ready to work at any time from anywhere.
Revelo is a talent marketplace that matches tech companies with pre-vetted and qualified remote developers from Latin America. You can rest assured you're getting the top SREs.
All of our engineers are:
- Pre-screened for more than 100 skills, including Node, React, Python, Ruby on Rails, and more
- Fluent in English
- Located in U.S.-adjacent time zones, such as Eastern Standard Time (EST), Mountain Standard Time (MST), and Pacific Standard Time (PST)
To start hiring, all you have to do is schedule a meeting. After you tell us your goals, technical demands, and needs, we'll match you with a list of vetted SRE candidates. You can then interview and select the candidates you want.
Site Reliability Engineer Job Description
After you've found a platform for sourcing SREs, you need to write a comprehensive job description to attract the SRE applicant you want.
Remember to include the following when creating an SRE job post:
- The name of your position (i.e., Staff Site Reliability Engineer)
- Whether your position is remote or on-site
- Specify if the position is full-time, part-time, or freelance
- Whether your position is permanent or contract
- Your SRE's responsibilities and how they will fit into the team
- Required skill sets and experiences
- Any other requirements, such as travel and background checks
Here's what a typical site reliability engineer job description looks like:
Staff Site Reliability Engineer - Revelo
Los Angeles, CA - Remote
150,000 - 210,000 USD a year - Full-time, Permanent
We are looking for remote Site Reliability Engineers to join our team.
Revelo Site Reliability Engineers will work with our engineering and development teams to design, code, validate, run, and grow our IT infrastructure. Your goal is to ensure that our platform is always running the way it should.
This position is open for SREs located in the following time zones:
- Pacific Standard Time (PCT)
- Mountain Standard Time (MST)
- Eastern Standard Time (EST)
- Create and implement actionable alerts
- Chaos testing
- Work with devs and engineers to create SLOs and SLIs
- Provide relief to issues in our infrastructure
- Mitigate and prevent future issues in our infrastructure
- Cost optimization, capacity planning, and architecture review of Kafka, Druid, Hadoop, Flink, Spark, and other systems
- Create and maintain network diagrams, technical documentation, procedures, and runbooks
- Respond to production incidents using your knowledge and experience in systems engineering and software development
- Allocate authority and resources as needed
Key Skills and Attributes:
- Bachelor's of Science in Computer Science or equivalent practical experience
- 5+ years of big data maintenance and operation experience
- Ability to debug, write, and optimize code
- Ample coding experience in Java, Python, Go, Perl, Shell, or another language
- Passion for SRE topics like resilience, performance, SLOs, performance, and the elimination of toil
- Strong problem-solving skills
- Experience with observability tools like Zabbix, Grafana, and Prometheus
- Strong verbal and written communication skills
- Experience and proven ability to work remotely
- Understands or has experience with Chaos Engineering
- Proven experience in automating routine tasks using tools like Terraform, Chef, or Ansible
- Ability to express IT infrastructure as code
- Experience with configuration management tools like Puppet
- Experience with containers such as Kubernetes and Docker
Who We Are
Revelo is Latin America's largest technology company in the human resources sector. We offer an intuitive recruitment platform that matches candidates with companies in only three days. Our mission is to connect qualified developers with tech startups around the world. To learn more about Revelo, check out our website at revelo.com.
- Monday to Friday
- 9 AM to 5 PM EST
- Paid time off
- Dental insurance
- Referral program
- Health insurance
- Employee discount
Site Reliability Engineer Average Salary
Besides creating a strong job description, you also need to think about salaries for your future SRE hires.
SRE salaries are typically high — the average base pay for SREs in San Francisco is $119,654 per year. The average base pay is lower in other parts of the country at $105,548 per year. In Chicago, for instance, the average base pay of SREs is $118,469 per year.
In comparison, the average annual cost to hire senior Chilean SRE is $106,960. SREs from Uruguay, Brazil, Argentina, and other Latin American nations offer similar rates. The average salary is lower because these countries have a significantly lower cost of living — but these countries house a vast pool of upcoming tech talent. Chile, for instance, is home to a rapidly growing tech pool and innovative IT infrastructure. The nation is also known for its world-class startup accelerators such as Start-Up Chile (SUP).
If you're interested in hiring remote SREs from Chile and other Latin American countries, check out how Revelo can help.
SRE Job Interview Questions
You also need to craft interview questions for your applicants. Don't just ask generic questions like "What are SLOs?" and "Why do you want to work here?" Ask questions that will give you clear insight into your candidate's knowledge, experience, and personality.
Here are some questions you can ask your candidates during an interview:
- How long have you worked as an SRE?
- What drew you to the SRE field?
- How do you set SLOs and SLIs? How do you make adjustments as needed?
- Which pillar of observability is the most important to you?
- How have you implemented automation in the past? Give me two examples.
- Do you consider employee or customer experience when implementing SRE strategies? Why or why not?
- Do you like working with containers like Kubernetes and Docker?
- How do you keep up with SRE trends?
- What's your favorite SRE field?
SREs play an important role in stabilizing and protecting your company's IT infrastructure. Without them, your infrastructure will be exposed to significant risks, including frequent and prolonged downtime, cybersecurity attacks, and more.
While sourcing and hiring SREs typically takes a lot of time and energy, it doesn't have to be a consuming task. Join Revelo today to start connecting with pre-vetted dedicated site reliability engineers. We'll also take responsibility for the most laborious steps of the onboarding process, such as compliance, benefits, and payroll concerns.
Interested in the Revelo experience? Schedule a meeting with us today. Tell us about your needs, expectations, and goals, and we'll match you with full-time vetted SRE talent within three days. You can then interview and hire the candidates you want, and you're well on the way to smooth and secure operations at your company.