The Role:
We are seeking a highly skilled and experienced Senior Site Reliability Engineer to join our growing team. You will play a critical role in ensuring the reliability, scalability, and performance of our critical infrastructure and applications. Beyond core SRE responsibilities, you will also serve as a key liaison across various teams, fostering collaboration and ensuring seamless operations.
Responsibilities:
Site Reliability Engineering:
- Proactively identify and mitigate potential issues impacting infrastructure and applications.
- Partner with development teams to implement best practices for building reliable and scalable systems.
- Stay up-to-date on the latest SRE trends and technologies.
Monitoring and Observability:
- Design, implement, and maintain robust monitoring solutions using tools like Prometheus and Grafana.
- Develop and configure alerts within tools like PagerDuty to ensure timely notification of potential issues.
- Analyze and troubleshoot issues using collected application and infrastructure metrics.
Incident Management:
- Lead incident response, ensuring timely resolution and minimizing downtime.
- Document and communicate incident details effectively to stakeholders.
- Conduct post-incident reviews to identify root causes and implement preventative measures.
Service Level Agreements (SLAs):
- Collaborate with product and engineering teams to define clear and measurable SLAs for our SaaS offerings.
- Establish Service Level Objectives (SLOs) for key metrics based on SLA requirements.
- Define Service Level Indicators (SLIs) to track progress towards achieving SLOs.
- Monitor SLO compliance and proactively identify potential SLA breaches.
Automation:
- Identify opportunities for automation to improve efficiency and reliability.
- Develop and implement automation scripts using tools like Python or Bash.
- Automate routine tasks and incident response workflows.
Cross-Team Collaboration:
- Act as a liaison between SRE, Product, Security, Application Engineering, and Customer Operations teams.
- Facilitate communication and information sharing across teams to ensure smooth operations.
- Work collaboratively to define and implement solutions that meet the needs of all stakeholders.
Mentorship and Knowledge Sharing:
- Mentor and collaborate with junior SRE engineers.
- Share knowledge and best practices within the team.
- Contribute to the development and documentation of internal SRE processes.
Required Skills:
- 5-8 years of experience as a Site Reliability Engineer (SRE) or related role.
- Experience with cloud platform GCP
- Proven experience with monitoring tools like Prometheus and Grafana.
- Strong understanding of incident management best practices.
- Experience with alerting tools like PagerDuty.
- Experience with scripting languages like Python or Bash for automation.
- Excellent communication and collaboration skills.
- Ability to work independently and as part of a team.
- Strong problem-solving and analytical skills.
- Passion for building reliable and scalable systems.
Nice to Have:
- Experience with container orchestration platforms like Kubernetes.
- Experience with chaos engineering principles.
- Experience with configuration management tools like Ansible or Chef.
What we offer:
- Remote Work Opportunities
- Flexible Work Hours
Β
Tech Holding
Tech Holding is a full-service consulting firm that delivers predictable outcomes and high-quality solutions to clients.
Other jobs at Tech Holding
Β
Β
Β
Β
Β
Β
Β
Β
Notifications about similar jobs
Get notifications to your inbox about new jobs that are similar to this one.
No spam. No ads. Unsubscribe anytime.
Similar jobs
Β
Β
Β
Β
Β
Β
Β
Β