Senior Site Reliability Engineer

Tech Holding

RemoteSenior

🇲🇽 Mexico

Site Reliability Engineer

Technology

Ansible Bash Chef Cloud GCP Grafana Kubernetes PagerDuty Prometheus Python SaaS

🔥 Apply now

The Role:

We are seeking a highly skilled and experienced Senior Site Reliability Engineer to join our growing team. You will play a critical role in ensuring the reliability, scalability, and performance of our critical infrastructure and applications. Beyond core SRE responsibilities, you will also serve as a key liaison across various teams, fostering collaboration and ensuring seamless operations.

Responsibilities:

Site Reliability Engineering:

Proactively identify and mitigate potential issues impacting infrastructure and applications.
Partner with development teams to implement best practices for building reliable and scalable systems.
Stay up-to-date on the latest SRE trends and technologies.

Monitoring and Observability:

Design, implement, and maintain robust monitoring solutions using tools like Prometheus and Grafana.
Develop and configure alerts within tools like PagerDuty to ensure timely notification of potential issues.
Analyze and troubleshoot issues using collected application and infrastructure metrics.

Incident Management:

Lead incident response, ensuring timely resolution and minimizing downtime.
Document and communicate incident details effectively to stakeholders.
Conduct post-incident reviews to identify root causes and implement preventative measures.

Service Level Agreements (SLAs):

Collaborate with product and engineering teams to define clear and measurable SLAs for our SaaS offerings.
Establish Service Level Objectives (SLOs) for key metrics based on SLA requirements.
Define Service Level Indicators (SLIs) to track progress towards achieving SLOs.
Monitor SLO compliance and proactively identify potential SLA breaches.

Automation:

Identify opportunities for automation to improve efficiency and reliability.
Develop and implement automation scripts using tools like Python or Bash.
Automate routine tasks and incident response workflows.

Cross-Team Collaboration:

Act as a liaison between SRE, Product, Security, Application Engineering, and Customer Operations teams.
Facilitate communication and information sharing across teams to ensure smooth operations.
Work collaboratively to define and implement solutions that meet the needs of all stakeholders.

Mentorship and Knowledge Sharing:

Mentor and collaborate with junior SRE engineers.
Share knowledge and best practices within the team.
Contribute to the development and documentation of internal SRE processes.

Required Skills:

5-8 years of experience as a Site Reliability Engineer (SRE) or related role.
Experience with cloud platform GCP
Proven experience with monitoring tools like Prometheus and Grafana.
Strong understanding of incident management best practices.
Experience with alerting tools like PagerDuty.
Experience with scripting languages like Python or Bash for automation.
Excellent communication and collaboration skills.
Ability to work independently and as part of a team.
Strong problem-solving and analytical skills.
Passion for building reliable and scalable systems.

Nice to Have:

Experience with container orchestration platforms like Kubernetes.
Experience with chaos engineering principles.
Experience with configuration management tools like Ansible or Chef.

What we offer:

Remote Work Opportunities
Flexible Work Hours

🔥 Apply now

Tech Holding

Tech Holding is a full-service consulting firm that delivers predictable outcomes and high-quality solutions to clients.

Consulting

techholding.co

Other jobs at Tech Holding

Remote🇲🇽

Backend Engineer

Remote🇲🇽

Front End Engineer

Remote🇺🇸

Senior Software Engineer

Remote🇺🇸

Technical Project Manager

View all Tech Holding jobs

Notifications about similar jobs

Get notifications to your inbox about new jobs that are similar to this one.

🇲🇽 Mexico

Site Reliability Engineer

Remote

No spam. No ads. Unsubscribe anytime.

Similar jobs

Remote🇲🇽Added 11 days ago

Senior Site Reliability Engineer

Neara - Revolutionising the utilities industry by helping them future-proof their infrastructure and navigate the challenges of the clean energy transition.

CloudAWSGCPAzureHerokuDigital OceanKubernetesTypeScriptJavascriptGolang + 5

Remote🇲🇽Added 22 days ago

Senior Site Reliability Engineer

Enroute is about being exceptional. We deliver IT services and solutions provided by a team of passionate problem solving individuals highly skilled in different IT and business practices. (it services and it consulting)

AzureKubernetesPrometheusGrafanaDatadogPythonBashTerraformAnsibleJenkins + 5

Remote🇲🇽💰Added a month ago

Site Reliability Engineer

FreshBooks - A leading cloud-based SaaS accounting software designed to help small business owners grow worldwide.(software development)

KubernetesDockerTerraformDataDogPrometheusGrafanaGoogle CloudRabbitMQPythonRuby + 3

Remote🇲🇽💰Added a month ago

Site Reliability Engineer

FreshBooks - A leading cloud-based SaaS accounting software designed to help small business owners grow worldwide.(software development)

KubernetesDockerTerraformDataDogPrometheusGrafanaGoogle CloudRabbitMQPythonRuby + 3

Remote🇦🇷🇨🇴🇲🇽🇧🇷Added 2 months ago

Site Reliability Engineer

Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. (technology, information and internet)

AWSTerraform/TerragruntKubernetesArgoCDGitHub ActionsGrafanaCloudWeb3

Senior Site Reliability Engineer

Tech Holding

LinkedIn

Other jobs at Tech Holding

Notifications about similar jobs

Similar jobs