Redux: Scaling LaunchDarkly From 4 to 200 Billion Feature Flags Daily

4,244
LaunchDarkly
Serving over 200 billion feature flags daily to help software teams build better software, faster. LaunchDarkly helps eliminate risk for developers and operations teams from the software development cycle.

Written By John Kodumal, CTO and Co-Founder, LaunchDarkly


Background

LaunchDarkly is a feature management platform—we make it easy for software teams to adopt feature flags, helping them eliminate risk in their software development cycles. When we first wrote about our stack, we served about 4 billion feature flags a day. Last month, we averaged over 200 billion flags daily. To me, that's a mind-boggling number, and a testament to the degree to which we're able to change the way teams do software development. Some additional metrics:

  • Our global P99 flag update latency (the time it takes for a feature flag change on our dashboard to be reflected in your application) is under 500ms
  • Our primary Elasticsearch cluster indexes 175M+ docs / day
  • At daily peak, 1.5 million+ mobile devices and browsers and 500k+ servers are connected to our streaming APIs
  • Our event ingestion pipeline processes 40 billion events per day

We've scaled all our services through a process of gradual evolution, with an occasional bit of punctuated equilibrium. We've never re-written a service from scratch, nor have we ever had to completely re-architect any of our services (we did migrate one service from a SaaS provider to a homegrown; more on that later). In fact, from a high level, our stack is very similar to what we described in our earlier post:

  • A Go monolith that serves our REST API and UI (JS / React)
  • A Go microservice that powers our streaming API
  • An event ingestion / transformation pipeline implemented as a set of Go microservices

We use AWS as our cloud provider, and Fastly as our CDN.

Let's talk about some of the changes we've made to scale these systems.

Buy first, build if necessary

Over the past year, we've shifted our philosophy on managed services and have moved several critical parts of our infrastructure away from self-managed options. The most prominent was our shift away from HAProxy to AWS's managed application load balancers (ALBs). As we scaled, managing our HAProxy fleet became a larger and larger burden. We spent a significant amount of time tuning our configuration files and benchmarking different EC2 instance types to maximize throughput. Emerging needs like DDoS protection and auto scaling turned into large projects that we needed to schedule urgently. Instead of continuing this investment, we chose to shift to managed ALB instances. This was a large project, but it quickly paid for itself as we've nearly eliminated the time spent managing load balancers. We also gained DDoS protection and auto scaling "for free".

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services:

  • Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.
  • We also use managed Elasticache instances instead of spinning up EC2 instances to run Redis workloads.
  • In our previous StackShare article, I wrote about a project to incorporate Kafka into our event ingestion pipeline. In keeping with our shift towards managed services, we shifted to Amazon's Kinesis instead of Kafka.

Managed services do have some drawbacks:

  • They're almost never cheaper (in raw dollars) than self-managed alternatives. Pricing is often more opaque, more variable, and hard to predict
  • Much less visibility into the operation, errors, and availability of the service
  • Vendor lock-in

Still, it's a false economy to measure the raw cost of a managed service to an unmanaged service—factor in your team's time and the math is usually pretty clear.

There is one notable case where we've moved from a managed SaaS solution to a homegrown. LaunchDarkly relies on a novel streaming architecture to push feature flag changes out in near real-time. Our SDKs create persistent outbound HTTPS connections to the LaunchDarkly streaming APIs. When you change a feature flag on your dashboard, that change is pushed out using the server-sent events (SSE) protocol. When we initially built our streaming service, we relied heavily on a third-party service, Fanout, to manage persistent connections. Fanout worked well for us, but over time we found that we could introduce domain-specific performance and cost optimizations if we built a custom service for our use case. We created a Go microservice that manages persistent connections and is heavily optimized for the unique workloads associated with feature flag delivery. We use NATS as a message broker to connect our REST API to a fleet of EC2 instances running this microservice. Each of these instances can manage over 50,000 concurrent SSE connections.

At scale, everything is a tight loop

Some of our analytics services receive tens of thousands of requests per second. One of the biggest things we've learned over the past year is that at this scale, there's almost no such thing as premature optimization. Because of the sheer volume of requests, every handler you write is effectively running in a tight loop. We found that to keep meeting our service level objectives and cost goals at scale, we had to do two things repeatedly:

  1. Profile aggressively to identify and address CPU and memory bottlenecks
  2. Apply a set of micro-patterns to handle specific workload

Profiling must be done periodically, as new bottlenecks will constantly emerge as traffic scales and old bottlenecks are eliminated. As an example, at one point, we found that the "front-door" microservice for our analytics pipeline was CPU-bound parsing JSON. We switched from Go's built-in encoding/json package to easyjson, which uses compile-time specialization to eliminate slow runtime reflection in JSON parsing.

We also identified a set of "micro-patterns" that we have extracted as self-contained libraries so they can be applied in appropriate contexts. Some examples:

  • Read coalescing—In a read-heavy workload, expensive calls to fetch data can be queued to await the first read—a kind of memoization. This pattern is encapsulated in Google's singleflight package
  • Write coalescing—The dual of read coalescing. In a write-heavy workload, where last write wins, writes can be queued and discarded in favor of the latest write attempt.
  • Multi-layer caching—In scenarios where an in-process, in-memory cache is necessary for performance, horizontal scaling can reduce cache hit rates. We make our fleet more resilient to this effect by employing multiple layers of caching—for example, backing an in-memory cache with a shared Redis cache before finally falling back to a slower persistent disk-backed store.

These simple patterns improved performance at scale and also helped us deal with bad traffic patterns like reconnection storms.

Get good at managing change

Scaling up isn't just about improving your services and architecture. It requires equal investment in people, processes and tools. One thing we really focused on the process and tools front is understanding change. Better visibility into changes being made to the service had a massively positive impact on service reliability. Here are a few things we did to improve visibility:

  • Internal changelog service: This service catalogues intentional changes being made to the system. This includes deploys, instance type changes, configuration changes, feature flag changes, and more. Anything that could potentially impact the service (either in a positive or negative way) is catalogued here. We couldn't find anything off the shelf here, so we built something ourselves.
  • COGS (cost of goods sold) log: Very similar to our changelog, but focused on price changes to our services. If we scale out a service, or change instance types, or make reserved instance reservations, we add an entry to this log. For us, this is just a Confluence page.
  • Observability / APM: We use a number of services to gain observability into what is happening to our service at runtime. We use a mix of Graphite / Grafana and Honeycomb.io to give us the observability we need. We're big fans of Honeycomb here.
  • Operational and release feature flags: We feature flag most changes using LaunchDarkly. Most new changes are protected by release flags (short-lived flags that are used to protect the initial rollout and rollback of a feature). We also create operational flags—which are long-lived flags that act as control switches to the application. Observability lets us understand change, and feature flags allow us to react to change to maintain availability or improve user experience.
  • Spinnaker / Armory: LaunchDarkly is almost a five year old company, and our methodology for deploying was state of the art... for 2014. We recently undertook a project to modernize the way we deploy our software, moving from Ansible-based deploy scripts that executed on our local machines, to using Spinnaker (along with Terraform and Packer) as the basis of our deployment system. We've been using Armory's enterprise Spinnaker offering to make this project a reality.

Like the sound of this stack? Learn more about LaunchDarkly.

LaunchDarkly
Serving over 200 billion feature flags daily to help software teams build better software, faster. LaunchDarkly helps eliminate risk for developers and operations teams from the software development cycle.
Tools mentioned in article
Open jobs at LaunchDarkly
Developer Advocate
Oakland, CA
As the market leader of a fast-growing space, we’re looking for a Developer Advocate to help us define and quickly expand the market for Feature Management. The top-level goal of the Developer Advocate is help others be successful with our product and become passionate supporters of our technology. This is a technical role with the mission of engaging, primarily through written content, with the broad community of developers and driving excitement around developer related technologies. This position is a great opportunity to help improve awareness of LaunchDarkly and to increase usage of LaunchDarkly’s technologies through marketing programs as well as in-depth engagement with key accounts.  LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and GitHub, and you'll have an immediate impact with our product and customers.
  • You will collaborate with development and marketing to build use cases and solutions on top of our platform with the primary output being written content for the purpose of making customers successful.
  • You will write about technology trends with the goal of engaging developers, developer managers, and senior technical leaders.
  • You will write about best practices for feature management and modern application architecture with the the goal of influencing customer success.
  • Develop a persona as a thought leader in Feature Management and modern application development trends.
  • You love to build apps, craft solutions, interact with developers and operators to help them learn through the articulation of your experience.
  • You have passion, curiosity, technical depth, and extraordinary written communication skills.
  • You are able to code and build to established integrations (API's/SDK's) with the goal of assembling end-to-end solutions from a collection of parts.
  • You are able to converse with a broad range of programming language communities (Java, .NET, Node.js, Python, Ruby, iOS, Android, etc.), and have a real passion for modern application development trends at the intersection of development and operations. 
  • After building a thing, writing about a thing, you enjoy publicly speaking about the thing.
  • SDK Engineer
    Oakland, CA
    LaunchDarkly is looking for an SDK engineer to help build our client‐side platform support. The ideal candidate has a wealth of experience using different technologies and libraries in the JavaScript ecosystem. This role would include direct contributions to LaunchDarkly's SDKs for JavaScript, Node.js, Electron, React, and React Native. Understanding our space and our customers (we build tools for developers) is critical, but previous experience building for developers isn't a necessary prerequisite— as long as you're willing to learn. LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and Twitter, and you'll have an immediate impact with our product and customers.
  • Contribute to SDK development for our supported platforms such as JavaScript, Node.js, Electron, React, and React Native
  • Be a front-line responder for issues filed by customers
  • Work directly with our CTO and development team to define our architecture, and help define our client‐server networking model
  • Fluency in at least two JavaScript-related technologies (JS, TypeScript, React, etc.) and build tools
  • A strong interest in following trends in the JavaScript ecosystem
  • Strong understanding of the HTTP protocol and networking technologies
  • Experience contributing to open-source software
  • Proven ability to mentor and provide technical leadership
  • Self‐starter and problem solver, willing to solve difficult problems and work independently when necessary
  • Testing background: experience building unit, integration, load tests, and benchmarks
  • Distributed Systems Engineer
    Oakland, CA
    We're looking for a distributed systems engineer to help us build, scale, and maintain LaunchDarkly's real-time data analytics pipeline. You'll be building systems that handle the scale and exponential growth of our product, ingesting, analyzing, and querying hundreds of billions of events per day. LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and Twitter, and you'll have an immediate impact with our product and customers.
  • Help build and maintain our distributed, high-throughput, real-time data analytics pipeline, implemented as a set of Go microservices
  • Use open-source tools like ElasticSearch, Kafka, Redis, and Cassandra
  • Improve the reliability and efficiency of fault-tolerant distributed systems
  • Work directly with our CTO and development team to define and evolve our architecture 
  • 4+ experience building and maintaining large-scale production systems
  • Experience with real-time event logging, stats collection, and analysis
  • Strong understanding of networking technologies, plus practical experience dealing with networking issues in real-world environments
  • Self‐starter and problem solver, willing to solve difficult problems and work independently when necessary
  • Strong testing background: experience building unit, integration, load tests, and benchmarks

  • DevOps Engineer
    Oakland, CA
    As a DevOps Engineer, you will help us maintain and scale LaunchDarkly's engineering infrastructure. In addition to our SaaS offering, you will deliver private instances of the LaunchDarkly service for our enterprise customers. You are passionate about system reliability, performance, and security, with an eye toward taking our operations to the next level (from semi-automated to fully automated). LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and Twitter, and you'll have an immediate impact with our product and customers.
  • Deploy and maintain infrastructure hosted in the cloud
  • Research and implement changes to increase site reliability and help us operate more efficiently
  • Participate in an after-hours on-call rotation
  • Practice sustainable incident response and blameless postmortems
  • Work directly with our CTO and development team to refine our architecture
  • You are an effective communicator
  • You are a self‐starter and problem solver, willing to solve hard problems and work independently when necessary. You identify potential problems and nip them in the bud before they surface
  • You play well on a small, tight-knit team
  • You have run large-scale production systems on Linux servers in Amazon Web Services (AWS)
  • You love automating deployment with configuration management tools such as Ansible, Chef, Puppet, Salt, or Terraform. When you want to automate other processes, you reach for Python or bash
  • You have configured and tuned Web proxy servers such as HAProxy, nginx, Apache httpd, or Varnish
  • You can't live without monitoring systems such as Sensu, Nagios, Graphite/Grafana, or Datadog
  • You are familiar with running systems with a microservice-based architecture
  • You have interacted with data persistence technologies such as Elasticsearch, MongoDB, Cassandra, Kafka, or Redis
  • You have written software in Go (Golang), C++, or Java
  • Verified by
    Director Marketing
    VP of Product and Engineering
    You may also like