Senior Chief Engineer SRE

Senior
🇮🇳 India
Site Reliability Engineer
Technology

Position Summary

Site Reliability Engineer .
Site reliability engineers will be dedicated full-time to creating software that improves the reliability of systems in production, fixing issues, responding to incidents and usually taking on-call responsibilities. Operate system efficiently and systematically through continuous monitoring and improvement, system/service operation automation and process application.

Building software to help operations and support teams
SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.

Role and Responsibilities

Site Reliability Engineer .
Site reliability engineers will be dedicated full-time to creating software that improves the reliability of systems in production, fixing issues, responding to incidents and usually taking on-call responsibilities. Operate system efficiently and systematically through continuous monitoring and improvement, system/service operation automation and process application.

Building software to help operations and support teams
SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.

Fixing support escalation issues
Similarly to the point above, a site reliability engineer can expect to spend time fixing support escalation cases. But, as your SRE operations mature, your systems will become more reliable and you’ll see fewer critical incidents in production – leading to fewer support escalations. Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams.

Optimizing on-call rotations and processes
More times than not, site reliability engineers will need to take on-call responsibilities. At most organizations, the SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes. SRE teams will help add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools and documentation to help prepare on-call teams for future incidents.

Documenting “tribal” knowledge
SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations and on-call duties – meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it.

Conducting post-incident reviews
Without thorough post-incident reviews, you have no way to identify what’s working and what’s not. SRE teams need to keep teams honest and ensure that everyone – software developers and IT professionals – are conducting post-incident reviews, documenting their findings and taking action on their learnings. Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service.

Skills and Qualifications

Primary Skill sets, 5-10 years
• Public Cloud - AWS, Kubernetes

• Scripting- Shell, Terraform, Ansible, Python, Jenkins, Spinnaker, CI/CD

• Knowledge and understanding of install, configure and manage the public cloud infrastructure on AWS, GCP using Terraform and ansible

• Operate system efficiently and systematically through continuous monitoring improvement, system/service operation automation and process application.

• Experienced professional with full understanding on specialized areas; resolves a wide range of issues in creative ways

• Works on problems of diverse scope where analyzing data requires evaluating identifiable factors. Demonstrates good judgement in selecting methods and techniques for obtaining solutions

• Normally receives little instruction on day-to-day work and receives general instructions on new assignments

• Perform to monitor server application and infrastructure for 24 hours every day and handle faults.

• Perform system operation automation of service for cost-effectiveness.

• Typically requires minimum 10 years' of related experience and a Bachelor's degree, or 3 years and a Master's degree;

• Good English command proficiency

Secondary - Monitoring using Grafana, Prometheus, Influx DB, TSDB(2-4 years)
Desired: Mysql, Nosql, Time series DB

* Please visit Samsung membership to see Privacy Policy, which defaults according to your location. You can change Country/Language at the bottom of the page. If you are European Economic Resident, please click here.

 

Samsung Electronics

Samsung Electronics

A tech leader in mobile technologies, consumer electronics, home appliances, and enterprise solutions.

Technology
Consumer Goods
Electronics
Home Goods

Other jobs at Samsung Electronics

 

 

 

 

 

 

 

 

View all Samsung Electronics jobs

Notifications about similar jobs

Get notifications to your inbox about new jobs that are similar to this one.

🇮🇳 India
Site Reliability Engineer

No spam. No ads. Unsubscribe anytime.

Similar jobs