SRE challenges and how to overcome them with insights from Ardent SRE Expert

3 March 2023 | Noor Khan

Site Reliability Engineering (SRE) is a growing discipline that ensures the uptime and availability of software by bridging the gap between development and operations teams. By leveraging software to effectively manage and monitor applications, Site Reliability Engineers enable software to scale easily without having to manually manage multiple systems.

SRE offer multiple benefits such as improved collaboration of teams, increased efficiency through software and automation, building a culture of continuous improvement and improved levels of software reliability and resilience. However, these benefits do not come without some challenges. In this article, we will look at some of the key SRE challenges and how to overcome them with insights from Ardent’s experienced Site Reliability Engineer.

Resistance to the SRE approach

SRE is a relatively new discipline, and it may be met with resistance by teams as it requires a change of mindset and approach. To overcome this, we recommend you pilot the SRE methodology to measure its success with key metrics. If goals and objectives are met, a further rollout would be the next step. Additionally, provide training to teams to ensure they are grasping the SRE approach, therefore are able to adopt it and implement it effectively.

Choosing the right tools and technologies

Choosing the right tech stack can be challenging especially if you do not have the expertise in-house to do so. Therefore, identify your key objectives and what you want to achieve by clearly defining your metrics of success. This can then inform your tool and technologies of choice. For example, at Ardent one key metric, we measure is the response time to incidents. A brilliant technology which enables us to swiftly report and communicate errors is PagerDuty.

Explore PagerDuty use cases below:

“There are a wide variety of SRE technologies to choose from including the likes of Prometheus for web monitoring and AWS Cloud Watch and Data Log for data monitoring. How we choose the right technology by establishing the client's budget and the technical requirements. For example, open source technologies are cost-effective as they are free to us.”
Shoaib Mulani, Site Reliability Engineer

Ensuring continuous reliability and uptime

The ultimate objective of SRE is to ensure continuous reliability and uptime of software by using processes and software put in place. This can be challenging, especially when the software is updated regularly, whether it is maintenance updates or feature updates. To overcome the challenge of ensuring 100 percent uptime, SRE must take a very structured and organised approach to error detection, communication and resolution.

“Automation is essential to every SRE team to ensure the reliability of infrastructure and applications. The two processes which should be automated are monitoring and reporting. For example, for a client project, we measure the server disk utilization to ensure there is no downtime. The monitoring and error reporting are automated to ensure that an alert pops up when the disk is fully utilised”.
Shoaib Mulani, Site Reliability Engineer

Additionally, if your software requires continuous ongoing operational monitoring and support, you may want to outsource the process as it can be a cost-effective solution.

Selecting the right metrics to measure

“A common challenge many SRE teams face is the metrics they should be following. To overcome this challenge, at Ardent we measure traditional metrics such as CPU utilisation, and disk utilisation to name a few. However, for each client, we discuss their goals and objectives to identify and set the key metrics.”
Shoaib Mulani, Site Reliability Engineer

Managing incidents effectively

Managing incidents effectively will directly impact the reliability and uptime of software both in the present and future. However, many businesses find that there are no effective, structured processes in place which mean there is a lack of learning from errors and mistakes, consequently resulting in the repetition of those mistakes.

The following are the steps every SRE should take to overcome this challenge:

Establish set and structured procedures and policies in line with SLAs and ensure they are followed every time. You can do this with training and implementing it as a stage in the workflow of the SRE team. These can range from relevant parties to communication to the steps to take when an incident is first detected.
Take an organised approach to recording incidents and maintain records as and when they happen.
Performing root cause analysis (RCA) to mitigate risks of these errors occurring again.
Keep documentation and track everything including the post-mortem reports of major incidents to better prepare your organisation for any incidents in the future.
Communication is crucial to ensuring effective SRE within your organisation. Hence, SRE teams must implement an effective, clear communication model. There should also be regular communication, daily, weekly and monthly to your business requirements.

“Keeping up and maintaining a run book is vital for every SRE team. Having set procedures in place to deal with incidents enables engineers to react quickly in order for a quick resolution.”
Shoaib Mulani, Site Reliability Engineer

Meeting service level expectations

To effectively meet the service level expectations, the following should be clearly established and communicated with the entire SRE team and the key stakeholder.

SLA (Service Level Agreement) – The SLA will cover and detail how the service will be delivered, the communication channels and frequency, reporting type and frequency and more.
SLI (Service Level Indicators) – These are key metrics such as response time. For example, you may set a response time threshold of 30 minutes. If this is breached, then it becomes a problem.
SLO (Service Level Objectives) – These are the core objectives, for example, ensuring 95% uptime.

Implementing automation

With automation at the heart of the SRE discipline, implementing automation across all processes is key to reducing toil which uses up time that is used to focus on high-value, mission-critical tasks. There are many brilliant automation tools on the market, including Terraform, Docker and Ansible.

Security challenges

Security is a common challenge that SRE teams will face from time to time, therefore research and knowledge are key. To overcome common security challenges, ensure you are aware of the limitations of your tech stack when it comes to security. If these limitations present gaps within your solutions, this should be reported to the development team”.
Shoaib Mulani, Site Reliability Engineer

Finding and retaining valuable SREs

Hiring and retaining SREs remains a great challenge for many organisations, with DevOps.com reporting that demand for SRE-specific skills is high. If this is a challenge for your organisation, a great and much more cost-effective solution to overcoming this is working with a third party and outsourcing your SRE processes. This reduces the time and resources required in finding, hiring and training SRE professionals.

SRE best practices as highlighted by Ardent’s Site Reliability Engineer Shoaib Mulani

Continuously engaging in and improving the whole lifecycle of services from inception and design, through deployment, operation and refinement.
Supporting services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Maintaining services once they are live by measuring and monitoring availability, latency and overall system health.
Scaling systems sustainably through mechanisms like automation.
Evolving and optimising systems by actively pushing to create change that improves reliability and velocity.
Practicing sustainable incident response and blameless postmortems.

Key SRE benefits

Reduce software downtime
Bridge the gap between platform design, development and operations
Increase security and compliance
Mitigate the risk of human error with automation
Gain visibility into the health and the performance of software and system

SRE challenges and how to overcome them with Ardent

At Ardent, we have worked with many clients to provide ongoing operational monitoring and support of their systems, applications and data. Ardent operational monitoring and support service incorporates the SRE discipline and offers invaluable benefits such as:

Continuous improvement and optimisation
Peace of mind with your systems, software and data being expert hands
Swift error detection and resolution
A clear, defined structured approach
Around-the-clock monitoring and support

If you are facing the SRE challenges mentioned in this article and are exploring outsourcing SRE, then you have come to the right place. Get in touch to find out more and we can discuss our three-tier structure to find a solution that is unique to your challenges, needs and requirements.

Ardent Expert: Shoaib Mulani

Shoaib Mulani is a highly knowledgeable Site Reliability Engineer with significant experience in the field. He has worked on many SRE projects leveraging a wide variety of SRE tools and technologies to deliver excellence to our clients.

Ardent Insights

Are you ready to take the lead in driving digital transformation?

Digital transformation is the process of modernizing and digitating business processes with technology that can offer a plethora of benefits including reducing long-term costs, improving productivity and streamlining processes. Despite the benefits, research by McKinsey & Company has found that around 70% of digital transformation projects fail, largely down to employee resistance. If you are [...]

Stateful VS Stateless – What’s right for your application?

Protocols and guidelines are at the heart of data engineering and application development, and the data which is sent using network protocols is broadly divided into stateful vs stateless structures – these rules govern how the data has been formatted, how it sent, and how it is received by other devices (such as endpoints, routers, [...]

Getting data observability done right – Is Monte Carlo the tool for you?

Data observability is all about the ability to understand, diagnose, and manage the health of your data across multiple tools and throughout the entire lifecycle of the data. Ensuring that you have the right operational monitoring and support to provide 24/7 peace of mind is critical to building and growing your company. [...]