Site Reliability Engineer

RemoteSenior

🇺🇸 United States

Technology

Ansible AWS Azure Bash Chef CI/CD Cloud Datadog Docker EC2 ELK stack GCP GitLab CI Go Grafana Kubernetes Prometheus Puppet Python RDS security groups Splunk Terraform VPCs

🔥 Apply now

What you'll be responsible for:

Infrastructure Management: Design, implement, and maintain scalable and resilient infrastructure using Terraform for infrastructure as code, ensuring high availability and performance.
Kubernetes and Containers: Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker. Implement best practices for container orchestration and management.
Systems and Application Monitoring/Observability: Develop and maintain comprehensive monitoring and observability solutions using Datadog. Ensure detailed visibility into system performance and application health.
SLOs and SLA Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure reliable and consistent service delivery.
Incident Response and Troubleshooting: Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence. Participate in post-incident reviews and contribute to blameless postmortems.
Reliability and Production Environment Management: Ensure the reliability and stability of our production environments. Continuously assess and improve system reliability, identifying and addressing potential points of failure.
Automation and Scripting: Develop automation scripts and tools to reduce manual intervention and improve system reliability using Python, Bash, or Go. Implement and improve CI/CD pipelines.
CI/CD Pipeline Management: Enhance and maintain continuous integration and continuous deployment pipelines using GitLab CI. Ensure seamless and reliable deployment processes.
Capacity Planning and Scaling: Assist in capacity planning and ensure that systems are scalable to meet future demands. Implement auto-scaling strategies where applicable.
Security and Compliance: Implement security best practices and ensure compliance with industry standards. Regularly review and update security policies and procedures.
Collaboration and Support: Work closely with development teams to ensure reliability and scalability of new features and services. Provide technical support and guidance on infrastructure-related issues.
Software Engineering for Operations: Develop and maintain internal tools and services that enhance the efficiency and reliability of our operations.
On-Call Rotation: Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

Requirements

We’re looking for someone with:

Experience: +7 years of experience in SRE, DevOps, or a related role.
Cloud Platform Experience: Proficient with cloud platforms such as AWS, GCP, or Azure. Experience with EC2, RDS, VPCs, and security groups is essential.
Kubernetes and Containers: Strong experience with Kubernetes and Docker, including deployment, scaling, and management of containerized applications.
Infrastructure as Code: Expert in using Terraform for infrastructure as code. Proficient with configuration management tools such as Ansible, Puppet, or Chef.
Monitoring and Observability: Extensive experience with monitoring and observability tools like Datadog, Prometheus, Grafana, ELK stack, or Splunk. Skilled in setting up detailed monitoring and logging systems.
CI/CD Practices: Familiarity with GitLab CI or similar tool for continuous integration and deployment. Experience in setting up and managing pipelines.

🔥 Apply now

PayNearMe

PayNearMe develops award-winning technology to facilitate the end-to-end customer payment experience, making it easy for businesses to manage and accept payments.

Fintech

Software

Other jobs at PayNearMe

Remote🇺🇸

Associate Account Executive

Remote🇺🇸

Sales Executive

Remote🇺🇸

Marketing Operations Associate

Remote🇺🇸

Site Reliability Engineer

View all PayNearMe jobs

Notifications about similar jobs

Get notifications to your inbox about new jobs that are similar to this one.

🇺🇸 United States

Site Reliability Engineer

Remote

No spam. No ads. Unsubscribe anytime.

Similar jobs

Remote🇺🇸💰Added 4 days ago

Senior Platform Operations Engineer

NBCUniversal - We create world-class content, which we distribute across our portfolio of film, television, and streaming, and bring to life through our theme parks and consumer experiences. (entertainment providers)

AWSEC2ECSRDSIAMKubernetesCI/CDTravisCIJenkinsGitlab CI + 9

Remote🇺🇸🇨🇦💰👶Added 6 days ago

Engineer II - Site Reliability

CrowdStrike - A fast-growing security company that protects our wide range of customers from cybersecurity attacks.

LinuxC++JavaPythonGoSANNASNFSObject StorageFreeNAS + 12

Remote🇺🇸💰👶Added 7 days ago

Senior Site Reliability Engineer

RingCentral - Global leader in cloud-based communications and collaboration software

LinuxPythonGoPHPPerlAWSKubernetesKafkaELK stackZabbix + 16

Remote🇺🇸💰Added 6 days ago

Site Reliability Engineer

Together AI - A research-driven artificial intelligence company on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models.

AnsibleTerraformKubernetesCloud

Remote🇺🇸Added 8 days ago

Site Reliability Engineer

PacerPro empowers better decision making and stronger institutions by making facts readily and easily accessible for legal analysis. (legal services)

HerokuAWSSidekiqLogEntriesNew RelicTwilio SendGridCloudMailinRuby/RailsPostgreSQLRedis + 5