Redux: Scaling LaunchDarkly From 4 to 200 Billion Feature Flags Daily

3,786
LaunchDarkly
Serving over 200 billion feature flags daily to help software teams build better software, faster. LaunchDarkly helps eliminate risk for developers and operations teams from the software development cycle.

Written By John Kodumal, CTO and Co-Founder, LaunchDarkly


Background

LaunchDarkly is a feature management platform—we make it easy for software teams to adopt feature flags, helping them eliminate risk in their software development cycles. When we first wrote about our stack, we served about 4 billion feature flags a day. Last month, we averaged over 200 billion flags daily. To me, that's a mind-boggling number, and a testament to the degree to which we're able to change the way teams do software development. Some additional metrics:

  • Our global P99 flag update latency (the time it takes for a feature flag change on our dashboard to be reflected in your application) is under 500ms
  • Our primary Elasticsearch cluster indexes 175M+ docs / day
  • At daily peak, 1.5 million+ mobile devices and browsers and 500k+ servers are connected to our streaming APIs
  • Our event ingestion pipeline processes 40 billion events per day

We've scaled all our services through a process of gradual evolution, with an occasional bit of punctuated equilibrium. We've never re-written a service from scratch, nor have we ever had to completely re-architect any of our services (we did migrate one service from a SaaS provider to a homegrown; more on that later). In fact, from a high level, our stack is very similar to what we described in our earlier post:

  • A Go monolith that serves our REST API and UI (JS / React)
  • A Go microservice that powers our streaming API
  • An event ingestion / transformation pipeline implemented as a set of Go microservices

We use AWS as our cloud provider, and Fastly as our CDN.

Let's talk about some of the changes we've made to scale these systems.

Buy first, build if necessary

Over the past year, we've shifted our philosophy on managed services and have moved several critical parts of our infrastructure away from self-managed options. The most prominent was our shift away from HAProxy to AWS's managed application load balancers (ALBs). As we scaled, managing our HAProxy fleet became a larger and larger burden. We spent a significant amount of time tuning our configuration files and benchmarking different EC2 instance types to maximize throughput. Emerging needs like DDoS protection and auto scaling turned into large projects that we needed to schedule urgently. Instead of continuing this investment, we chose to shift to managed ALB instances. This was a large project, but it quickly paid for itself as we've nearly eliminated the time spent managing load balancers. We also gained DDoS protection and auto scaling "for free".

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services:

  • Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.
  • We also use managed Elasticache instances instead of spinning up EC2 instances to run Redis workloads.
  • In our previous StackShare article, I wrote about a project to incorporate Kafka into our event ingestion pipeline. In keeping with our shift towards managed services, we shifted to Amazon's Kinesis instead of Kafka.

Managed services do have some drawbacks:

  • They're almost never cheaper (in raw dollars) than self-managed alternatives. Pricing is often more opaque, more variable, and hard to predict
  • Much less visibility into the operation, errors, and availability of the service
  • Vendor lock-in

Still, it's a false economy to measure the raw cost of a managed service to an unmanaged service—factor in your team's time and the math is usually pretty clear.

There is one notable case where we've moved from a managed SaaS solution to a homegrown. LaunchDarkly relies on a novel streaming architecture to push feature flag changes out in near real-time. Our SDKs create persistent outbound HTTPS connections to the LaunchDarkly streaming APIs. When you change a feature flag on your dashboard, that change is pushed out using the server-sent events (SSE) protocol. When we initially built our streaming service, we relied heavily on a third-party service, Fanout, to manage persistent connections. Fanout worked well for us, but over time we found that we could introduce domain-specific performance and cost optimizations if we built a custom service for our use case. We created a Go microservice that manages persistent connections and is heavily optimized for the unique workloads associated with feature flag delivery. We use NATS as a message broker to connect our REST API to a fleet of EC2 instances running this microservice. Each of these instances can manage over 50,000 concurrent SSE connections.

At scale, everything is a tight loop

Some of our analytics services receive tens of thousands of requests per second. One of the biggest things we've learned over the past year is that at this scale, there's almost no such thing as premature optimization. Because of the sheer volume of requests, every handler you write is effectively running in a tight loop. We found that to keep meeting our service level objectives and cost goals at scale, we had to do two things repeatedly:

  1. Profile aggressively to identify and address CPU and memory bottlenecks
  2. Apply a set of micro-patterns to handle specific workload

Profiling must be done periodically, as new bottlenecks will constantly emerge as traffic scales and old bottlenecks are eliminated. As an example, at one point, we found that the "front-door" microservice for our analytics pipeline was CPU-bound parsing JSON. We switched from Go's built-in encoding/json package to easyjson, which uses compile-time specialization to eliminate slow runtime reflection in JSON parsing.

We also identified a set of "micro-patterns" that we have extracted as self-contained libraries so they can be applied in appropriate contexts. Some examples:

  • Read coalescing—In a read-heavy workload, expensive calls to fetch data can be queued to await the first read—a kind of memoization. This pattern is encapsulated in Google's singleflight package
  • Write coalescing—The dual of read coalescing. In a write-heavy workload, where last write wins, writes can be queued and discarded in favor of the latest write attempt.
  • Multi-layer caching—In scenarios where an in-process, in-memory cache is necessary for performance, horizontal scaling can reduce cache hit rates. We make our fleet more resilient to this effect by employing multiple layers of caching—for example, backing an in-memory cache with a shared Redis cache before finally falling back to a slower persistent disk-backed store.

These simple patterns improved performance at scale and also helped us deal with bad traffic patterns like reconnection storms.

Get good at managing change

Scaling up isn't just about improving your services and architecture. It requires equal investment in people, processes and tools. One thing we really focused on the process and tools front is understanding change. Better visibility into changes being made to the service had a massively positive impact on service reliability. Here are a few things we did to improve visibility:

  • Internal changelog service: This service catalogues intentional changes being made to the system. This includes deploys, instance type changes, configuration changes, feature flag changes, and more. Anything that could potentially impact the service (either in a positive or negative way) is catalogued here. We couldn't find anything off the shelf here, so we built something ourselves.
  • COGS (cost of goods sold) log: Very similar to our changelog, but focused on price changes to our services. If we scale out a service, or change instance types, or make reserved instance reservations, we add an entry to this log. For us, this is just a Confluence page.
  • Observability / APM: We use a number of services to gain observability into what is happening to our service at runtime. We use a mix of Graphite / Grafana and Honeycomb.io to give us the observability we need. We're big fans of Honeycomb here.
  • Operational and release feature flags: We feature flag most changes using LaunchDarkly. Most new changes are protected by release flags (short-lived flags that are used to protect the initial rollout and rollback of a feature). We also create operational flags—which are long-lived flags that act as control switches to the application. Observability lets us understand change, and feature flags allow us to react to change to maintain availability or improve user experience.
  • Spinnaker / Armory: LaunchDarkly is almost a five year old company, and our methodology for deploying was state of the art... for 2014. We recently undertook a project to modernize the way we deploy our software, moving from Ansible-based deploy scripts that executed on our local machines, to using Spinnaker (along with Terraform and Packer) as the basis of our deployment system. We've been using Armory's enterprise Spinnaker offering to make this project a reality.

Like the sound of this stack? Learn more about LaunchDarkly.

LaunchDarkly
Serving over 200 billion feature flags daily to help software teams build better software, faster. LaunchDarkly helps eliminate risk for developers and operations teams from the software development cycle.
Tools mentioned in article
Open jobs at LaunchDarkly
Sales Engineer
San Francisco
LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team and have an immediate impact with our product and customers. We are specifically looking for our first Sales Engineer who is highly competent at managing the technical conversation with our potential buyers. You should be a self-starter who works well with little supervision. At the same time you need to be comfortable wearing multiple hats. We trust you to do the right things with little oversight. We have unlimited vacation, flexible working hours, fully covered medical insurance, and encourage volunteering. 
  • Serve as the technical lead and owner of the technical deal strategy.
  • Ability to comprehend and communicate the architecture and security practices of LaunchDarkly
  • Perform technical discovery with prospects and quickly architect proposed solutions
  • Successfully manage and execute technical proof of concepts (POCs)
  • Able to respond to functional and technical elements of RFIs/RFPs/security questionnaires
  • Quickly understand a customer's’ business goals and translate them into what a technical implementation will look like
  • Collect feedback from customers, synthesize, analyze and channel throughout the company
  • 5+ years experience as Sales Engineer, Engineer, Implementation Consultant, or Customer Success Engineer
  • Conversational about .NET, PHP, Python, Node.js, Java, JavaScript, Ruby/Rails, Go, iOS, and Android, etc
  • Desire to learn and quickly absorb new development frameworks, practices, and approaches.
  • Passion for consulting and tactical empathy to apply and present complex solutions effectively
  • Excellent written communication skills with the ability to explain complex topics in easily understood, concise language
  • Obvious passion for your work and stellar people skills
  • Technical Support Engineer
    San Francisco
    LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team and have an immediate impact with our product and customers. We are specifically looking for our first Technical Support Engineer who will take end-to-end ownership of customer issues, including initial troubleshooting, identification of root cause, and issue resolution. In addition to answering customer questions, support tasks include leading projects to drive efficiency, documenting knowledge so customers can self-solve questions, and develop tools that allow our clients to be more satisfied with LaunchDarkly. You should be a self-starter who works well with little supervision. At the same time, you need to be comfortable wearing multiple hats. We trust you to do the right things with little oversight. We have unlimited vacation, flexible working hours, fully covered medical insurance, and encourage volunteering. 
  • Meet or exceed customer expectations on response quality, timeliness of responses and overall customer experience.
  • Serve as internal and external point of contact on customer escalations and ensure customer issues are resolved as expediently as possible.
  • Collect information and document bugs with Engineering for product issues that are impacting customers.
  • Create process or troubleshooting documentation in the support knowledge base.
  • Deliver against customer experience and efficiency targets.
  • Push creative thinking beyond the boundaries of existing industry standard practices to come up with process improvements and new ways to delight customers.
  • Develop your skills in cutting-edge technologies
  • Share customer feedback throughout the entire company.
  • 2+ years of customer support, technical support, or related customer facing role.
  • Passion for solving customer issues and advocating for their success, in a fast paced, highly technical environment.
  • Technical fluency with one (or more) development platforms: .NET, PHP, Python, Node.js, Java, JavaScript, Ruby/Rails, Go, iOS, and Android
  • Experience with continuous delivery or agile software development processes and tools
  • Experience working with APIs or building integrations between SaaS services
  • Ability to learn new technologies quickly.
  • Excellent relationship management, customer service and communication skills in variety of forms (written, live chat, conference calls, in-person.)
  • Ability to work independently with little direct supervision and as a part of a team.
  • Outstanding analytical and organizational abilities.
  • Ability to remain calm, composed and articulate when dealing with tough customer situations.
  • You have a thirst for knowledge
  • You enjoy working on technical side projects to validate what you’ve learned.
  • You have good time management skills and can balance numerous projects at once.
  • Developer Advocate
    San Francisco
    As the market leader of a fast-growing space, we’re looking for a Developer Advocate to help us define and quickly expand the market for feature flagging. The overall mission of the Developer Advocate is to secure platform adoption and revenue growth through evangelism, community engagement, and developer relations. This is a technical role with the mission of engaging with the broad community of developers and driving excitement around developer related technologies. This position is a great opportunity to help improve awareness of LaunchDarkly and to increase usage of LaunchDarkly’s technologies through marketing programs as well as in-depth engagement with key accounts. You would be the first at this role, and must be excited about getting to define a new role.
  • Develop useful content, education, and demo apps on top of our platform to demonstrate value and build excitement.
  • Talk about technology intelligently and enthusiastically to developers, developer managers and senior management.
  • Develop relationships with influencers and third-party communities.
  • Attend and speak at conferences, user meetups and hackathons to connect with developers and understand how we can best serve them and make them successful.
  • Become a thought leader in the market.
  • Be a voice of our users inside LaunchDarkly.
  • Success in this role is measured by the growth and retention of LaunchDarkly customers.
  • You have unending enthusiasm to share your knowledge and ideas with other developers.
  • You are able to converse with a broad range of developer technologies and communities (Java, .NET, Node.js, Python, Ruby on Rails, iOS, Android, etc.), but have a particular interest in the DevOps and continuous delivery communities.
  • You have passion, curiosity, technical depth, and exceptional communication and presentation skills.
  • You have a genuine interest in helping developers solve their problems.
  • You are involved in developer community groups.
  • You have good marketing skills and business logic.
  • You possess a strong software developer background, write code and share what you know.
  • You love to build apps, create solutions, interact with other developers and derive job satisfaction from helping others learn by doing.
  • You are interested in implementing marketing programs that are scalable and repeatable.
  • Software Engineer (Full-stack)
    San Francisco
    LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and Twitter, and you'll have an immediate impact with our product and customers. We're looking for a creative, product-focused full stack engineer to help us build our core platform. You'll own new feature development end-to-end, contributing to our back-end and front-end code. We're looking for someone who thrives on putting new features in front of customers and takes pride in the quality of their work. Our core platform serves over four billion feature flags daily. We use the following technologies on a daily basis: Golang— all our services are written in Go React / Redux / JavaScript on the front-end MongoDB ElasticSearch Redis HAProxy Kafka You don't need to know all of these, but if you're familiar with some or all of them, that's a good sign.
  • Proven experience and fluency with server-side web development (e.g. in Java / Scala, Ruby, Python, Golang, Node.js)
  • Proven experience and fluency with front-end web development in JavaScript
  • Strong understanding of concurrency and threading
  • Experience building RESTful APIs
  • Proven ability to mentor and provide technical leadership
  • Self-starter and problem solver, willing to solve difficult problems and work independently when necessary
  • Strong testing background: experience building unit, integration, load tests, and benchmarks
  • Experience with NoSQL databases (MongoDB, ElasticSearch)
  • Experience with React / Redux for front-end development
  • A deep understanding of networking technologies (TCP, HTTP, websockets, server-sent events, etc.)
  • Verified by
    Director Marketing
    You may also like