Β 

Senior Site Reliability Engineer

RemoteSenior
πŸ‡²πŸ‡½ Mexico
Site Reliability Engineer
Technology

The Role:

We are seeking a highly skilled and experienced Senior Site Reliability Engineer to join our growing team. You will play a critical role in ensuring the reliability, scalability, and performance of our critical infrastructure and applications. Beyond core SRE responsibilities, you will also serve as a key liaison across various teams, fostering collaboration and ensuring seamless operations.

Responsibilities:

Site Reliability Engineering:

  • Proactively identify and mitigate potential issues impacting infrastructure and applications.
  • Partner with development teams to implement best practices for building reliable and scalable systems.
  • Stay up-to-date on the latest SRE trends and technologies.

Monitoring and Observability:

  • Design, implement, and maintain robust monitoring solutions using tools like Prometheus and Grafana.
  • Develop and configure alerts within tools like PagerDuty to ensure timely notification of potential issues.
  • Analyze and troubleshoot issues using collected application and infrastructure metrics.

Incident Management:

  • Lead incident response, ensuring timely resolution and minimizing downtime.
  • Document and communicate incident details effectively to stakeholders.
  • Conduct post-incident reviews to identify root causes and implement preventative measures.

Service Level Agreements (SLAs):

  • Collaborate with product and engineering teams to define clear and measurable SLAs for our SaaS offerings.
  • Establish Service Level Objectives (SLOs) for key metrics based on SLA requirements.
  • Define Service Level Indicators (SLIs) to track progress towards achieving SLOs.
  • Monitor SLO compliance and proactively identify potential SLA breaches.

Automation:

  • Identify opportunities for automation to improve efficiency and reliability.
  • Develop and implement automation scripts using tools like Python or Bash.
  • Automate routine tasks and incident response workflows.

Cross-Team Collaboration:

  • Act as a liaison between SRE, Product, Security, Application Engineering, and Customer Operations teams.
  • Facilitate communication and information sharing across teams to ensure smooth operations.
  • Work collaboratively to define and implement solutions that meet the needs of all stakeholders.

Mentorship and Knowledge Sharing:

  • Mentor and collaborate with junior SRE engineers.
  • Share knowledge and best practices within the team.
  • Contribute to the development and documentation of internal SRE processes.

Required Skills:

  • 5-8 years of experience as a Site Reliability Engineer (SRE) or related role.
  • Experience with cloud platform GCP
  • Proven experience with monitoring tools like Prometheus and Grafana.
  • Strong understanding of incident management best practices.
  • Experience with alerting tools like PagerDuty.
  • Experience with scripting languages like Python or Bash for automation.
  • Excellent communication and collaboration skills.
  • Ability to work independently and as part of a team.
  • Strong problem-solving and analytical skills.
  • Passion for building reliable and scalable systems.

Nice to Have:

  • Experience with container orchestration platforms like Kubernetes.
  • Experience with chaos engineering principles.
  • Experience with configuration management tools like Ansible or Chef.

What we offer:

  • Remote Work Opportunities
  • Flexible Work Hours

Β 

Tech Holding

Tech Holding is a full-service consulting firm that delivers predictable outcomes and high-quality solutions to clients.

Consulting

Other jobs at Tech Holding

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

View all Tech Holding jobs

Notifications about similar jobs

Get notifications to your inbox about new jobs that are similar to this one.

πŸ‡²πŸ‡½ Mexico
Site Reliability Engineer
Remote

No spam. No ads. Unsubscribe anytime.

Similar jobs

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β