Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demand knowledge across different systems, networking, coding, database, capacity management, continuous delivery and deployment and open source cloud enabling technologies like Kubernetes and OpenStack. SRE at NVIDIA ensures that our internal and external facing GPU cloud services run maximum reliability and uptime as promised to the users and at the same time enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency and performance. SRE is also a mindset and a set of engineering approaches to running better production systems and optimizations. Much of our software development focuses on eliminating manual work through automation, performance tuning and growing efficiency of production systems.
As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approa...
NVIDIA
NVIDIA is a technology company that specializes in AI computing, with a focus on Deep Learning GPUs
Other jobs at NVIDIA
Notifications about similar jobs
Get notifications to your inbox about new jobs that are similar to this one.
No spam. No ads. Unsubscribe anytime.
Similar jobs