May 31, 2017
Real-Time Communication with Flannel and AWS
“Your Slack client is the window into your workplace, and teams have grown into the tens of thousands of people, much larger than any primitive village. Slack was architected around the goal of keeping teams of hundreds of people connected, and as teams have gotten larger, our initial techniques for loading and maintaining data have not scaled. To address that, we created a system that lazily loads data on demand and answers queries as you go.”
As the teams got bigger, the initial techniques for loading and maintaining data did not scale. To fix this, a system to lazy load data on demand and answer queries was developed. Some critical problems faced at this juncture were: connection times started to take longer, client memory footprint was large, reconnecting to Slack became expensive. So then, Slack clients connected to Flannel, an application-level caching service developed in-house and deployed to their edge points-of-presence which in turn gathers the full client startup data opening a WebSocket connection to Slack’s servers in the AWS regions. In an episode of “This is My Architecture”, Richard Crowley, Director of Service Engineering shows us how they use Cloudfront, HAProxy, ELB, EC2, and Route 53 to make all of this happen.
Flannel then returns a slimmed down version of this startup data to the client, allowing it to bootstrap thus ensuring the Slack client is ready to use. Flannel ran in Slack's edge locations since January 2017 serving 4 million simultaneous connections at peak and 600k client queries per second. With Flannel, the payload size needed for client bootstrap reduced considerably. In all Flannel played quite a role in making Slack faster and more reliable.


