MaxCDN has always used a Perl-based system for provisioning zones to various Points of Presence (POPs) throughout the cache network. The current system started to creak as our client base grew: the provisioning happened on a single thread, and blocked on I/O operations.


  • We decided to move the provisioning process to an API-driven process, and had to decide among a few implementation languages:

    • Go, the server-side language from Google
    • NodeJS, an asynchronous framework in Javascript

    We built prototypes in both languages, and decided on NodeJS:

    • NodeJS is asynchronous-by-default, which suited the problem domain. Provisioning is more like “start the job, let me know when you’re done” than a traditional C-style program that’s CPU-bound and needs low-level efficiency.
    • NodeJS acts as an HTTP-based service, so exposing the API was trivial

    Getting into the headspace and internalizing the assumptions of a tool helps pick the right one. NodeJS assumes services will be non-blocking/event-driven and HTTP-accessible, which snapped into our scenario perfectly. The new NodeJS architecture resulted in a staggering 95% reduction in processing time: requests went from 7.5 seconds to under a second.


  • The original API performed a synchronous Nginx reload after provisioning a zone, which often took up to 30 seconds or longer. While important, this step shouldn’t block the response to the user (or API) that a new zone has been created, or block subsequent requests to adjust the zone. With the new API, an independent worker reloads Nginx configurations based on zone modifications.It’s like ordering a product online: don’t pause the purchase process until the product’s been shipped. Say the order has been created, and you can still cancel or modify shipping information. Meanwhile, the remaining steps are being handled behind the scenes. In our case, the zone provision happens instantly, and you can see the result in your control panel or API. Behind the scenes, the zone will be serving traffic within a minute.


  • How do you know what parts of the workflow need improvement? Measure it. With New Relic in place, we have graphs of our API performance and can directly see if a server or zone is causing trouble, and the impact of our changes. There’s no comparison between a real-time performance graph and “Strange, the site seems slow, I should tail the logs”.


Stack Match

Favorite
1
Views
238
Favorite
Views
238