Sr. Site Reliablity Engineer - Transactions PlatformApply
Twilio's customers send hundreds of millions of communications every day. These events become a firehose of micro-transactions processed at a fraction of a penny at a time. It's critical to external and internal customers that we account for these transactions in a timely and accurate manner. As a shared platform within Twilio, we have a responsibility to ensure that our systems can quickly and easily scale to support new products and use-cases.
We're seeking engineers with experience managing large scale distributed systems. Our services run in Scala and Python, and we use Spark to process our large-volume data. MySQL and Redis play key roles as our data stores for serving our financial data and coordinating processing in our distributed environment.
About the Job
As a Senior Site Reliability Engineer, you will be a core contributor on the Real-Time Transactions team and face some of the most complex challenges in distributed data systems at scale.
- Maintain a performant and scalable production environment with 24x7 availability and zero downtime deployments in the AWS environment.
- Instrument and monitor the health and availability of our services, with fault detection, alerting, and recovery (automated and manual).
- Collaborate with developers to integrate existing services with Twilio's continuous delivery environment and processes.
- Manage large MySQL and NoSQL database clusters.
- Write scripts and runbooks to automate procedures.
- Manage system performance with benchmarking and monitoring of vital metrics, create capacity plans, and work with developers to resolve performance problems.
- Work closely with Twilio’s cloud infrastructure, orchestration, and security teams to help define company-wide requirements for operability initiatives and tooling.
- Minimum of 5+ years experience building complex distributed systems in the cloud, with a focus on areas such as reliability, high-availability, performance, scalability, capacity planning, backup and recovery, business continuity planning, and automation of everything.
- Significant development experience in at least one modern scripting language, preferably Python.
- Experience with managing and automating configuration of MySQL or NoSQL database clusters.
- Hands-on experience with cloud infrastructure technologies, including continuous integration and release management tools, configuration management, systems monitoring, and alerting tools.
- Strong AWS experience in a production environment.
- Exceptional communication and troubleshooting skills.
- Experience in agile processes.
- Experience developing services in Scala.
- Experience with technologies such as Spark, Kafka, and S3.
- Experience with securing distributed systems. You understand the purpose of reasonable security techniques and the tradeoffs with operational efficiency.
- Adept at administering Linux systems, dealing with networking issues, and fine tuning instrumentation and alerting systems.
- Experience with managing systems in distributed regions in the cloud or on-site.
- Experience with auto-scaling of distributed systems.
Twilio's mission is to fuel the future of communications. Developers and businesses use Twilio to make communications relevant and contextual by embedding messaging, voice and video capabilities directly into their software applications. Founded in 2008, Twilio has over 650 employees, with headquarters in San Francisco and other offices in Bogotá, Dublin, Hong Kong, London, Madrid, Mountain View, Munich, New York City, Singapore and Tallinn.
Twilio is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal opportunity regardless of race, color, ancestry, religion, gender, gender identity, parental or pregnancy status, national origin, sexual orientation, age, citizenship, marital status, disability, or Veteran status and operate in compliance with the San Francisco Fair Chance Ordinance. #LI-POST