Early job listings for “Application Engineer” required applicants to have “Fluency with the LAMP stack”. Linux, Apache, MySQL, PHP was and still is a popular choice for web applications which Slack started out as. Other requirements included an excellent understanding of networking, HTTP, JSON, and Smarty (template engine for PHP). According to an AWS case study “Tiny Speck—the original company name for what became Slack Technologies—used AWS in 2009 when it was the only viable offering for public cloud services.”
One of the earliest public references to Slack’s stack comes from a Twitter conversation. The Slack account states that “the messaging server is java, the app is php, db is mysql and solr for search,” and that uploaded files are “Stored on S3, but private files require authentication so requests go through the app.”
As of 2017, Slack was handling a peak of 1.4 billion jobs a day, (33,000 jobs every second). Until recently, Slack had continued to depend on their initial job queue implementation system based on Redis. While it had allowed them to grow exponentially and diversify their services, they soon outgrew the existing system. Also, dequeuing jobs required memory that was unavailable. Allowing job workers to scale up further burdened Redis, slowing the entire system.
Slack decided to use Kafka to ease the process and allow them to scale up without getting rid of the existing architecture. To build on it, they added Kafka in front of Redis leaving the existing queuing interface in place. A stateless service called Kafkagate was developed in Go to enqueue jobs to Kafka. It exposes an HTTP POST interface with each request comprising a topic, partition, and content. Kafkagate's design reduces latency while writing jobs and allows greater flexibility in job queue design. JQRelay, a stateless service, is used to relay jobs from a Kafka topic to Redis. It ensures only one relay process is assigned to each topic, failures are self-healing, and job-specific errors are corrected by re-enqueuing the job to Kafka. The new system was rolled out by double writing all jobs to both Redis and Kafka, with JQRelay operating in 'shadow mode' - dropping all jobs after reading it from Kafka. Jobs were verified by being tracked at each part of the system through its lifetime. By using durable storage and JQRelay, the enqueuing rate could be paused or adjusted to give Redis the necessary breathing room and make Slack a much more resilient service.
One size definitely doesn’t fit all when it comes to open source monitoring solutions, and executing generally understood best practices in the context of unique distributed systems presents all sorts of problems. Megan Anctil, a senior engineer on the Technical Operations team at Slack gave a talk at an O’Reilly Velocity Conference sharing pain points and lessons learned at wrangling known technologies such as Icinga, Graphite, Grafana, and the Elastic Stack to best fit the company’s use cases.
At the time, Slack used a few well-known monitoring tools since it’s Technical Operations team wasn’t large enough to build an in-house solution for all of these. Nor did the team think it’s sustainable to throw money at the problem, given the volume of information processed and the not-insignificant price and rigidity of many vendor solutions. With thousands of servers across multiple regions and millions of metrics and documents being processed and indexed per second, the team had to figure out how to scale these technologies to fit Slack’s needs.
On the backend, they experimented with multiple clusters in both Graphite and ELK, distributed Icinga nodes, and more. At the same time, they’ve tried to build usability into Grafana that reflects the team’s mental models of the system and have found ways to make alerts from Icinga more insightful and actionable.