Site Reliability Engineer

Join Police Digital Service as a Site Reliability Engineer - starting salary £80,000

As Site Reliability Engineer (SRE) you will be a cornerstone of the Technical Operations team, dedicated to ensuring the seamless operation and reliability of our systems that deliver critical services to our Policing customers. This role is at the heart of our technological infrastructure, requiring a unique blend of skills that combine software engineering with operational acumen.

Key Responsibilities

Design Scalable Infrastructure: Architect and engineer cloud solutions that are inherently scalable, ensuring they can manage varying loads and demands with ease, while maintaining performance and reliability.
Automate Operations: Develop and implement robust scripts and automation tools to streamline deployment, configuration, and management tasks, thereby increasing efficiency and reducing the potential for human error.
Monitor System Health: Utilize comprehensive monitoring solutions to continuously track system performance and health indicators, allowing for proactive identification and resolution of potential issues.
Lead Incident Response: Take charge during service disruptions, coordinating and leading the response to ensure rapid resolution, minimal impact, and clear communication throughout the incident.
Enforce Security Standards: Vigilantly uphold security protocols and compliance standards to protect sensitive data and infrastructure against threats and vulnerabilities.
Plan for Capacity: Engage in strategic capacity planning to accurately predict and prepare for future infrastructure needs, scaling resources accordingly to handle increased load and service demands.
Document Systems: Create and maintain clear, detailed, and up-to-date documentation of cloud infrastructure, including architecture designs, configurations, and operational procedures.
Mentor Team Members: Provide expert guidance and mentorship to less experienced team members, promoting a culture of knowledge sharing, continuous learning, and technical excellence.
Research New Technologies: Actively investigate and evaluate new technologies, tools, and practices that can enhance system reliability, efficiency, and the overall cloud service offering.
Develop Resilience Strategies: Formulate and implement strategies to enhance the resilience and fault-tolerance of cloud services, ensuring they can withstand and recover from unexpected disruptions.
Problem Management: Lead comprehensive post-mortem analysis following incidents to identify root causes, extract lessons learned, and implement preventive measures to avoid future occurrences.

What you need to succeed in the role

Technical Expertise: In depth knowledge of Azure cloud infrastructure, including services like Azure Compute, Azure Storage, and Azure Networking. Familiarity with implementing and managing Azure solutions such as Azure Kubernetes Service, Azure Functions, and Azure DevOps is crucial.
Software Engineering Skills: Strong coding skills in languages such as PowerShell, Python, Go, or Ruby, and experience with software development life cycles and agile methodologies. Understanding of Azure SDKs and APIs for integration and automation purposes.
Automation and Orchestration: Experience with automation tools like Azure Resource Manager, Azure Automation, Ansible, or Chef and orchestration platforms like Kubernetes or Docker Swarm. Proficiency in Azure Bicep would be a significant advantage.
Monitoring and Analytics: Proficiency with Azure monitoring tools such as Azure Monitor, Application Insights, and Network Watcher. Ability to analyse and interpret complex datasets to inform decision-making.
Continuous Learning: A commitment to continuous professional development, staying abreast of the latest industry trends and emerging technologies in cloud computing and SRE practices, particularly within the Azure ecosystem.
Leadership and Mentorship: The ability to lead initiatives, mentor junior team members, and contribute to a culture of technical excellence and continuous improvement.

For a full list of responsibilities and criteria, please refer to the Candidate Pack.

Why join us?

Balance is important and we want you to take time off to recharge – we offer 28 days' annual leave plus bank holidays, rising to 30 days after 5 years of service. Holiday Purchase also available
Flexible working hours - We trust you to do your job and we appreciate that life doesn't always fit around a 9 to 5 workday. We operate core hours of 8 to 6, Monday to Friday (37hr week)
We care about your well-being – we have an EAP that offers not just welfare benefits but also retail discounts
Plan for the future – we offer an excellent pension scheme and life assurance cover
Put your mind at rest regarding your health – offering remote GP, mental health and physiotherapy appointments via video consultation

You can find out more here:
Benefits – Police Digital Service (pds.police.uk)

Diversity, equity and inclusion

We are committed to equal opportunity for all and will not discriminate on any grounds. We encourage applications from people from the widest possible span of experience. We particularly welcome applications from Black, Asian and Minority Ethnic (BAME) candidates and people with disabilities.

Working Arrangements

This is a remote role.

All applicants must be eligible for NPPV3 and SC clearances. Successful applicants will require NPPV3 clearance to have been approved before starting with PDS.

Site Reliability Engineer

Posted by Police ICT.

Related