Who We Are
At The Trade Desk, we recognize that a seamless customer experience is driven by operational excellence. In pursuit of constantly improving the reliability of our platform, we are establishing a global Reliability Operations team. This team's core mission will be to vigilantly monitor The Trade Desk platform services, refine our incident response methodologies, and guarantee a robust and highly-available customer experience. If you're passionate about ensuring system reliability, process improvement, and making an essential customer impact, we invite you to playing a critical role in this next evolution of our on-call experience.
What You'll Do
- Define, manage, and measure incident response engineering practices
- Liaise with engineering teams to ensure work discovered during incident response is prioritized
- Participate in incident response engineering duties as necessary
- Manage a global Reliability Operations team (3 to 6+ Reliability operations engineers across NAMER, EMEA, APAC)
- Periodically meeting with reports across timezones will require extended hours periodically
- There may be periodic weekend coverage requirements
Who We are Looking For
- Bachelorโs Degree from a four-year university or relevant substitute experience
- 6+ years relevant work experience in Technical and/or Application Support with strong knowledge technical troubleshooting
- 2-5 years of management experience with direct reports
The Reliability Operations Engineering Manager will either possess or be excited to learn a number of skills...
Management:
- Adaptive management style according to level and proficiency of engineering reports.
- Ability to understand technical employee career paths and collaboratively develop career plans.
- Scheduling a global team through holidays, sickness and vacation leaves, across timezones.
Technical Proficiency:
- Understanding of large-scale distributed system architectures (e.g., databases, web services, application services).
- Familiarity with monitoring tools (e.g., Prometheus, Grafana, Nagios).
- Ability to author scripts to facilitate troubleshooting as well as configure alerts.
- Proficiency in scripting languages (e.g., Python, Bash) is a plus
Incident Management and Troubleshooting:
- Ability to prioritize and manage incidents based on severity, with a focus on customer impact.
- Ability to remain calm under pressure and quickly diagnose issues.
- Understanding of system logs, metrics, telemetry.
Communication Skills:
- Ability to take command and confidently direct engineering resources in ambiguous situations.
- Ability to communicate effectively with stakeholders during an incident.
- Ability to maintain and update trouble-shooting guides (TSGs) and operational documentation.
ย
The Trade Desk
A new office in Hamburg, Germany that offers a great combination of what The Trade Desk and the city are all about: perspective, collaboration, and openness.
Other jobs at The Trade Desk
ย
ย
ย
ย
ย
ย
ย
ย
Notifications about similar jobs
Get notifications to your inbox about new jobs that are similar to this one.
No spam. No ads. Unsubscribe anytime.
Similar jobs
ย
ย
ย
ย
ย
ย
ย
ย