Sr. Site Reliablity Engineer - Billing PlatformApply
Twilio's Billing Platform captures, stores, and processes global event data reliably at scale, and makes this data available to Twilio and the market for a large set of products and services.
About the job:
As a Senior SRE, you will be a core contributor and face some of the most complex challenges in distributed data systems at scale.
- Create a resilient and highly operable production environment with 24x7 availability, high performance, scalable and zero downtime releases in AWS environment.
- Manage large MySQL database clusters and NoSQL systems such as Redis, DynamoDB, and Cassandra.
- Manage regional deployments and set up disaster recovery of Kafka data pipelines, systems and stores in AWS environment.
- Collaborate with Engineers to create a continuous delivery environment and processes.
- Instrument and monitor the health and availability of services, with fault detection, alerting, triage and recovery (automated and manual).
- Work closely with Twilio’s cloud infrastructure, orchestration, and security teams to help implement company-wide security and operability initiatives and to provide tooling requirements.
- Performance manage (with benchmarking and monitoring of vital metrics), capacity plan, and resolve performance problems affecting service levels.
- Write scripts and runbooks to automate procedures.
- Enable auto-scaling.
- Your background will be that of Senior Engineer who has had considerable experience in a highly-complex technical operations environment with cloud-based services.
- Minimum 5+ years experience building complex distributed systems. In this role, you focused on reliability, high-availability, performance, scalability, capacity planning, backup and recovery, business continuity planning and automation of everything.
- Strong Amazon AWS experience in a production environment.
- Experience with managing and automating configuration of MySQL database clusters.
- Hands-on experience with cloud infrastructure technologies, including continuous integration tools, configuration management, systems monitoring and alerting tools.
- Experience with managing systems in distributed regions in the cloud or on-site.
- Adept at troubleshooting and administering Linux systems, dealing with networking issues, and fine tuning instrumentation and alerting systems.
- Demonstrated experience of agile processes, continuous integration, test automation and release management.
- Significant development experience in at least one modern scripting language, preferably Python.
- Exceptional communication and troubleshooting skills.
- Preferably experience with operating a high load data pipeline and exposure to technologies such as Kafka, Kinesis, Spark, S3, and Redshift.
- Preferably experience with managing NoSQL systems such as Redis, DynamoDB, and Cassandra.
- Experience with securing distributed systems. You understand the purpose of reasonable security techniques and the tradeoff with operational efficiency.
Twilio's mission is to fuel the future of communications. Developers and businesses use Twilio to make communications relevant and contextual by embedding messaging, voice and video capabilities directly into their software applications. Founded in 2008, Twilio has over 650 employees, with headquarters in San Francisco and other offices in Bogotá, Dublin, Hong Kong, London, Madrid, Mountain View, Munich, New York City, Singapore and Tallinn.
Twilio is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal opportunity regardless of race, color, ancestry, religion, gender, gender identity, parental or pregnancy status, national origin, sexual orientation, age, citizenship, marital status, disability, or Veteran status and operate in compliance with the San Francisco Fair Chance Ordinance.