Director, Site Reliability Engineering (SRE)

Hybrid

Director

🇺🇸 United States

Site Reliability Engineer

Technology

Agile API Argo CD Azure CI/CD Cloud Git GitHub Actions Kubernetes Microsoft Azure OpenShift

🔥 Apply now

With a company culture rooted in collaboration, expertise and innovation, we aim to promote progress and inspire our clients, employees, investors and communities to achieve their greatest potential. Our work is the catalyst that helps others achieve their goals. In short, We Enable Possibility℠.

The Director, Site Reliability Engineering (SRE) is a pivotal role in the technology infrastructure team, responsible for ensuring the highest levels of reliability, scalability, and performance. This leadership role will set the vision and strategic direction for a skilled SRE team, aligning with the strategic objectives of the IT Infrastructure team, and fostering a culture of continuous improvement and operational excellence. This role will require a deep understanding of cloud-based infrastructure services and technologies, distributed systems, product delivery platforms, DevOps, automation, monitoring and a proactive approach to preventing and mitigating potential issues. The incumbent must also foster a culture of innovation and collaboration within a team of highly skilled engineers to meet the organization’s evolving needs and deliver a superior digital experience to our product teams and customers.

*This is a Hybrid, Twice-a-week onsite role at our Greensboro and Raleigh offices.

Leadership & Strategy

Develop and implement a comprehensive SRE strategy that aligns with the IT Infrastructure team, IT and company objectives.
Lead the SRE team, setting clear goals and expectations, and providing mentorship and career development opportunities.
Collaborate with cross-functional teams to enhance system reliability and efficiency.

Technical Expertise

Oversee systems related to the availability of our infrastructure ecosystem, including cloud services and internal tooling.
Ensure the team’s deep understanding and expertise in the system architecture, not limited to Kubernetes and OpenShift, but encompassing the entire product delivery stack.

Team Management

Manage the SRE team ensuring effective resource allocation and prioritization of POC’s and initiative prioritization.
Drive the adoption of best practices in incident management and post-mortem analysis.

Incident Management

Be a leader in the response to high-impact infrastructure incidents, ensuring swift resolution and minimal disruption.
Implement proactive monitoring and measures to prevent future incidents and improve system resilience.

Communications

Articulate the value and accomplishments of the SRE team to stakeholders at all levels.
Foster a transparent communication environment within the team and across the organization.
Work closely with shared infrastructure services teams (including other SRE teams) within the corporation to establish a productive and transparent partnership and help establish consistent SRE and Infrastructure practices across the company.

Knowledge & Skills:

Proven expertise in large-scale complex system engineering and administration including cloud-based infrastructure in Microsoft Azure.
Strong leadership skills with the ability to inspire and motivate a high-performing team.
Excellent problem-solving abilities and data-driven approach to decision-making.
Technical leadership skills, including collaboration, technical problem-solving, and leading complex, mission critical initiatives.
In-depth understanding of Kubernetes concepts, components, and APIs with hands-on experience in orchestration of containerized applications using OpenShift (on-premises or in the cloud) Experience with OpenShift’s added-value features such as advanced CI/CD pipelines for containerized product delivery.
Experience with GitHub, GitHub Actions, and/or Argo CD or similar technologies.
Strong background in working in an agile service delivery methodology arena focusing on iterative service improvement delivery.

Education & Experience:

A bachelor’s degree in Computer Science, Engineering, or related field; a master’s degree is preferred.
At least 10 years of experience in IT Infrastructure, system administration, or reliability engineering with a minimum of 5 years in a leadership role.
A track record of managing complex infrastructure initiatives and leading incident response efforts.

#LI-Hybrid
#LI-ZP1

Do you like solving complex business problems, working with talented colleagues and have an innovative mindset? Arch may be a great fit for you. If this job isn’t the right fit but you’re interested in working for Arch, create a job alert! Simply create an account and opt in to receive emails when we have job openings that meet your criteria. Join our talent community to share your preferences directly with Arch’s Talent Acquisition team.

14500 Arch U.S. MI Services Inc.

🔥 Apply now

Arch Capital Services LLC

Arch is a company that promotes progress and inspires clients, employees, investors, and communities to achieve their potential.

Consulting

CSR (Corporate Social Responsibility)

🌍 archgroup.com All open jobs

🌍 linkedin.com

Other jobs at Arch Capital Services LLC

Remote🇺🇸

Sr. Cloud Automation Engineer

Remote🇺🇸

Software Engineer I

🇺🇸

Lead, Senior Data Scientist

Remote🇺🇸

Claims Data Quality Analyst

View all Arch Capital Services LLC jobs

Why OmniJobs?

Rare & hidden jobs
New jobs every day
No expired job posts
All jobs in English

Receive emails about similar jobs

Get alerts to your inbox about new open jobs that are similar to this one.

🇺🇸 United States

Site Reliability Engineer

No spam. No ads. Unsubscribe anytime.

Similar jobs

RemoteContract🇺🇸Added 14h ago

Site Implementation Engineer

Burwood Group, Inc - A technology consulting firm that helps companies use and manage technology to transform business and improve outcomes.

Cisco MerakiAzureVNETsVPNsPeeringZscalerESXiDHCPADKaseya + 2

Contract🇺🇸Added 14h ago

Site Implementation Engineer

Burwood Group, Inc - A technology consulting firm that helps companies use and manage technology to transform business and improve outcomes.

ESXiVelocloudAruba CentralCradlepointKaseyaEnterprise

🇺🇸💰Added 21 min ago

Site Reliability Engineer

Perplexity AI - Perplexity is a conversational answer engine that has grown to 10 million monthly active users and has served over half a billion queries in 2023

PythonPostgreSQLDynamoDBRedisKubernetesDockerTerraformCloudAWS

🇺🇸Added 17h ago

Sr. Reliability Engineer I

Biogen discovers, develops, and delivers worldwide innovative therapies for people living with serious neurological and neurodegenerative diseases.

🇺🇸Added 17h ago

Sr. Reliability Engineer I

Biogen discovers, develops, and delivers worldwide innovative therapies for people living with serious neurological and neurodegenerative diseases.

🇺🇸💰Added 19h ago

Sr. Site Reliability Engineer

Visa is a world leader in digital payments, facilitating more than 215 billion payments transactions between consumers, merchants, financial institutions and government entities. (legal services)

JavaJava EERESTJSONXML parsingXML schema designSOA principlesWeb Servicesmessaging technologiesIBM Websphere + 19

Director, Site Reliability Engineering (SRE)

Arch Capital Services LLC

LinkedIn

Other jobs at Arch Capital Services LLC

Why OmniJobs?

Receive emails about similar jobs

Similar jobs