Site Reliability Engineer

RemoteSenior
🇺🇸 United States
Site Reliability Engineer
Technology

Our mission

Genmo makes it easy for anyone to create movies, as if it were magic. Using our web application, any user can create cinematic video using a simple text prompt.

We imagine a world where high-quality cinematic video content is as plentiful as water. Our mission is to empower the next billion video creators to tell their stories.

As a Site Reliability Engineer (SRE) at Genmo, you will be responsible for designing, implementing, and maintaining the infrastructure that powers our large generative AI models. You will work on infrastructure automation, distributed systems design, and manage high-performance computing (HPC) and GPU clusters. The ideal candidate will have a strong background in infrastructure automation, distributed systems, and experience with GPU and HPC environments.

Responsibilities:

  • Design, implement, and maintain scalable infrastructure to support our generative AI models.
  • Develop and maintain infrastructure automation tools using technologies like Docker, Kubernetes, and Terraform.
  • Ensure the reliability, availability, and performance of our systems through proactive monitoring and incident response.
  • Collaborate with software engineers and researchers to design and implement distributed systems.
  • Manage and optimize GPU and HPC clusters for efficient AI model training and inference.
  • Develop and maintain CI/CD pipelines to streamline development and deployment processes.
  • Implement and maintain security best practices across the infrastructure.

Qualifications:

  • 5+ years of experience in site reliability engineering or a similar role.
  • Experience working in a 24 x 7 enterprise environment
  • Hands-on experience with infrastructure as code and automation tools (Ansible, Chef, Puppet, Terraform)
  • Strong experience with infrastructure automation tools such as Docker, Kubernetes, and Terraform.
  • Expertise in designing and maintaining distributed systems.
  • Proficiency in scripting and programming languages, particularly Python and C++.
  • Strong understanding of networking, security, and system performance.
  • Excellent problem-solving skills and the ability to work in a fast-paced environment.

Bonus points:

  • Experience with cloud providers like AWS, GCP, or Azure.
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Familiarity with CI/CD tools and practices (e.g., Jenkins, GitLab CI/CD).
  • Experience working with AI and machine learning models.
  • Strong passion for artificial intelligence and the drive to learn new technologies.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

 

Genmo

Genmo makes it easy for anyone to create movies using a simple text prompt.

Artificial Intelligence
Entertainment
Film
Software

Other jobs at Genmo

 

 

 

 

 

 

 

 

View all Genmo jobs

Notifications about similar jobs

Get notifications to your inbox about new jobs that are similar to this one.

🇺🇸 United States
Site Reliability Engineer
Remote

No spam. No ads. Unsubscribe anytime.

Similar jobs