ย 

Reliability Engineering Manager

Hybrid
Manager
๐Ÿ‡ณ๐Ÿ‡ฆ Namibia
๐Ÿ“ EMEA
๐Ÿ“ APAC
Site Reliability Engineer
Technology

Who We Are

At The Trade Desk, we recognize that a seamless customer experience is driven by operational excellence. In pursuit of constantly improving the reliability of our platform, we are establishing a global Reliability Operations team. This team's core mission will be to vigilantly monitor The Trade Desk platform services, refine our incident response methodologies, and guarantee a robust and highly-available customer experience. If you're passionate about ensuring system reliability, process improvement, and making an essential customer impact, we invite you to playing a critical role in this next evolution of our on-call experience.

What You'll Do

  • Define, manage, and measure incident response engineering practices
  • Liaise with engineering teams to ensure work discovered during incident response is prioritized
  • Participate in incident response engineering duties as necessary
  • Manage a global Reliability Operations team (3 to 6+ Reliability operations engineers across NAMER, EMEA, APAC)
    • Periodically meeting with reports across timezones will require extended hours periodically
  • There may be periodic weekend coverage requirements

Who We are Looking For

  • Bachelorโ€™s Degree from a four-year university or relevant substitute experience
  • 6+ years relevant work experience in Technical and/or Application Support with strong knowledge technical troubleshooting
  • 2-5 years of management experience with direct reports

The Reliability Operations Engineering Manager will either possess or be excited to learn a number of skills...

Management:

  • Adaptive management style according to level and proficiency of engineering reports.
  • Ability to understand technical employee career paths and collaboratively develop career plans.
  • Scheduling a global team through holidays, sickness and vacation leaves, across timezones.

Technical Proficiency:

  • Understanding of large-scale distributed system architectures (e.g., databases, web services, application services).
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana, Nagios).
  • Ability to author scripts to facilitate troubleshooting as well as configure alerts.
    • Proficiency in scripting languages (e.g., Python, Bash) is a plus

Incident Management and Troubleshooting:

  • Ability to prioritize and manage incidents based on severity, with a focus on customer impact.
  • Ability to remain calm under pressure and quickly diagnose issues.
  • Understanding of system logs, metrics, telemetry.

Communication Skills:

  • Ability to take command and confidently direct engineering resources in ambiguous situations.
  • Ability to communicate effectively with stakeholders during an incident.
  • Ability to maintain and update trouble-shooting guides (TSGs) and operational documentation.

ย 

The Trade Desk

A new office in Hamburg, Germany that offers a great combination of what The Trade Desk and the city are all about: perspective, collaboration, and openness.

Advertising
Technology

Other jobs at The Trade Desk

ย 

ย 

ย 

ย 

ย 

ย 

ย 

ย 

View all The Trade Desk jobs

Notifications about similar jobs

Get notifications to your inbox about new jobs that are similar to this one.

๐Ÿ‡ณ๐Ÿ‡ฆ Namibia
EMEA
APAC
Site Reliability Engineer

No spam. No ads. Unsubscribe anytime.

Similar jobs

ย 

ย 

ย 

ย 

ย 

ย 

ย 

ย