Lead Reliability Engineer

Celestial AI

Senior

💰$175–200K

🇺🇸 United States

💰Equity

Site Reliability Engineer

Technology

🔥 Apply now

About Celestial AI

As the industry strives to meet the demands of the AI workloads, bottlenecks in data transfers between processors and memory have hindered progress. The Photonic Fabric based Memory Fabric provides an optically scalable solution to the ‘Memory Wall’ problem, enabling tens of Terabytes of memory capacity at full HBM bandwidths with low tens of nanoseconds of latency and extremely low power. The Photonic Fabric based Compute Fabric enables Terabyte class bandwidth between compute nodes at low latency and power. Photonic Fabric delivers a transformative leap in AI system performance, ten years more advanced than existing technologies.

Job Description:

We are looking for a Lead Reliability Engineer to spearhead reliability efforts specifically tailored for datacenter and high-performance computing (HPC) applications. The ideal candidate will have a strong background in reliability engineering with a focus on these critical environments, ensuring the robustness and uptime of our systems in demanding operational scenarios.

ESSENTIAL DUTIES AND RESPONSIBILITIES:

Develop and implement reliability strategies, standards, and processes customized for datacenter and high-performance computing applications, addressing unique challenges such as thermal management, power integrity, and workload variability.
Lead reliability testing and qualification activities tailored for datacenter and HPC environments, including stress testing, thermal cycling, and performance degradation analysis.
Collaborate closely with cross-functional teams, including hardware design, systems engineering, and datacenter operations, to integrate reliability considerations into product development and deployment processes.
Conduct thorough reliability analyses specific to datacenter and HPC applications, such as MTBF (Mean Time Between Failures) calculations, system-level fault tolerance assessments, and risk mitigation strategies.
Define reliability requirements and specifications for new products targeting datacenter and HPC markets, working closely with design teams to ensure compliance with industry standards and customer expectations.
Lead root cause analysis and corrective actions for reliability issues identified in datacenter and HPC environments, driving continuous improvement initiatives and implementing best practices.
Stay abreast of emerging technologies and industry trends in datacenter and HPC reliability engineering, leveraging this knowledge to enhance the reliability and performance of our systems.

QUALIFICATIONS:

Bachelor's degree in Engineering or related field; Master's or PhD degree preferred.
15+ years of experience in reliability engineering, with a focus on datacenter and high-performance computing applications at component, board and system level.
Very strong understanding on physics of failures to drive material and process improvements for components
Strong understanding of reliability principles, methodologies, and tools relevant to datacenter and HPC environments, such as reliability modeling, fault tolerance techniques, and performance optimization strategies.
Experience working with industry standards and guidelines specific to datacenter and HPC reliability, such as GR-468 and other relevant datacenter component qualification requirements.
Proven ability to lead cross-functional teams and drive reliability initiatives in fast-paced environments.
Excellent problem-solving skills and the ability to perform detailed root cause analysis in complex systems.
Effective communication skills and the ability to collaborate with internal teams and external stakeholders in the datacenter and HPC ecosystem.

Location : Bay Area location is preferred.

For California location:

As an early startup experiencing explosive growth, we offer an extremely attractive total compensation package, inclusive of competitive base salary and a generous grant of our valuable early-stage equity. The target base salary for this role is approximately $175,000.00 - $200,000.00. The base salary offered may be slightly higher or lower than the target base salary, based on the final scope as determined by the depth of the experience and skills demonstrated by candidate in the interviews.

We offer great benefits (health, vision, dental and life insurance), collaborative and continuous learning work environment, where you will get a chance to work with smart and dedicated people engaged in developing the next generation architecture for high performance computing.

Celestial AI Inc. is proud to be an equal opportunity workplace and is an affirmative action employer.

#LI-Onsite

🔥 Apply now

Celestial AI

A trailblazer in cutting-edge technology at the intersection of photonics, packaging, and advanced manufacturing, experiencing explosive growth and offering an attractive total compensation package.

Artificial Intelligence

Manufacturing

🌍 celestial.ai All open jobs

🌍 linkedin.com

Other jobs at Celestial AI

🇨🇦💰

Senior DFT Design Engineer

🇺🇸💰

IT Engineer - Security Analyst & Desktop Support

🇨🇦🇺🇸💰

System Architect

🇺🇸💰

Principal Engineer - 2.5D/3D Process Development

🇮🇳

Compiler Engineer

View all Celestial AI jobs

Why OmniJobs?

Rare & hidden jobs
New jobs every day
No expired job posts
All jobs in English

Receive emails about similar jobs

Get alerts to your inbox about new open jobs that are similar to this one.

🇺🇸 United States

Site Reliability Engineer

No spam. No ads. Unsubscribe anytime.

Similar jobs

🇺🇸Added 3h ago

Sr. Reliability Engineer I

Biogen discovers, develops, and delivers worldwide innovative therapies for people living with serious neurological and neurodegenerative diseases.

🇺🇸Added 3h ago

Sr. Reliability Engineer I

Biogen discovers, develops, and delivers worldwide innovative therapies for people living with serious neurological and neurodegenerative diseases.

🇺🇸💰Added 4h ago

Sr. Site Reliability Engineer

Visa is a world leader in digital payments, facilitating more than 215 billion payments transactions between consumers, merchants, financial institutions and government entities. (legal services)

JavaJava EERESTJSONXML parsingXML schema designSOA principlesWeb Servicesmessaging technologiesIBM Websphere + 19

🇺🇸Added 8h ago

Senior Site Reliability Engineer

Humane is a team of proven industry experts who have invented, built, and shipped category-defining hardware and software products to billions of people across the globe (computers and electronics manufacturing)

AWSAzureGoogle CloudvirtualizationcontainerizationnetworkingsecurityCI/CDKubernetesPython + 19

RemoteContract🇺🇸Added a day ago

Site Implementation Engineer

Burwood Group, Inc - A technology consulting firm that helps companies use and manage technology to transform business and improve outcomes.

Cisco MerakiAzureZscalerESXiKaseyaVLAN segmentationVNETsVPNsPeeringDHCP + 3

RemoteContract🇺🇸Added a day ago

Site Implementation Engineer

Burwood Group, Inc - A technology consulting firm that helps companies use and manage technology to transform business and improve outcomes.

ESXiVelocloudAruba CentralCradlepointKaseyaEnterprise

Lead Reliability Engineer

Celestial AI

LinkedIn

Other jobs at Celestial AI

Why OmniJobs?

Receive emails about similar jobs

Similar jobs