One of the earliest public references to Slack’s stack comes from a Twitter conversation. The Slack account states that “the messaging server is java, the app is php, db is mysql and solr for search,” and that uploaded files are “Stored on S3, but private files require authentication so requests go through the app.”
One size definitely doesn’t fit all when it comes to open source monitoring solutions, and executing generally understood best practices in the context of unique distributed systems presents all sorts of problems. Megan Anctil, a senior engineer on the Technical Operations team at Slack gave a talk at an O’Reilly Velocity Conference sharing pain points and lessons learned at wrangling known technologies such as Icinga, Graphite, Grafana, and the Elastic Stack to best fit the company’s use cases.
At the time, Slack used a few well-known monitoring tools since it’s Technical Operations team wasn’t large enough to build an in-house solution for all of these. Nor did the team think it’s sustainable to throw money at the problem, given the volume of information processed and the not-insignificant price and rigidity of many vendor solutions. With thousands of servers across multiple regions and millions of metrics and documents being processed and indexed per second, the team had to figure out how to scale these technologies to fit Slack’s needs.
On the backend, they experimented with multiple clusters in both Graphite and ELK, distributed Icinga nodes, and more. At the same time, they’ve tried to build usability into Grafana that reflects the team’s mental models of the system and have found ways to make alerts from Icinga more insightful and actionable.
As of 2017, Slack was handling a peak of 1.4 billion jobs a day, (33,000 jobs every second). Until recently, Slack had continued to depend on their initial job queue implementation system based on Redis. While it had allowed them to grow exponentially and diversify their services, they soon outgrew the existing system. Also, dequeuing jobs required memory that was unavailable. Allowing job workers to scale up further burdened Redis, slowing the entire system.
Slack decided to use Kafka to ease the process and allow them to scale up without getting rid of the existing architecture. To build on it, they added Kafka in front of Redis leaving the existing queuing interface in place. A stateless service called Kafkagate was developed in Go to enqueue jobs to Kafka. It exposes an HTTP POST interface with each request comprising a topic, partition, and content. Kafkagate's design reduces latency while writing jobs and allows greater flexibility in job queue design. JQRelay, a stateless service, is used to relay jobs from a Kafka topic to Redis. It ensures only one relay process is assigned to each topic, failures are self-healing, and job-specific errors are corrected by re-enqueuing the job to Kafka. The new system was rolled out by double writing all jobs to both Redis and Kafka, with JQRelay operating in 'shadow mode' - dropping all jobs after reading it from Kafka. Jobs were verified by being tracked at each part of the system through its lifetime. By using durable storage and JQRelay, the enqueuing rate could be paused or adjusted to give Redis the necessary breathing room and make Slack a much more resilient service.
Slack's new desktop application was launched for macOS. It was built using Electron for a faster, frameless look with a host of background improvements for a superior Slack experience. Instead of adopting a complete-in-box approach taken by other apps, Slack prefers a hybrid approach where some of the assets are loaded as part of the app, while others are made available remotely. Slack's original desktop app was written using the MacGap v1 framework using WebView to host web content within the native app frame. But it was difficult to upgrade with new features only available to Apple's WKWebView and moving to this view called for a total application rewrite.
Electron brings together Chromium's rendering engine with the Node.js runtime and module system. The new desktop app is now based on an ES6 + async/await React application is currently being moved gradually to TypeScript. Electron functions on Chromium's multi-process model, with each Slack team signed into a separate process and memory space. It also helps prevent remote content to directly access desktop features using a feature called WebView Element which creates a fresh Chromium renderer process and assigns rendering of content for its hosting renderer. Additional security can be ensured by preventing Node.js modules from leaking into the API surface and watching out for APIs with file paths. Communication between processes on Electron is carried out via electron-remote, a pared-down, zippy version of Electron's remote module, which makes implementing the web apps UI much easier.
The Slack desktop app was originally written us the MacGap framework, which used Apple’s WebView to host web content inside of a native app frame. As this approach continued to present product limitations, Slack decided to migrate the desktop app to Electron. Electron is a platform that combines the rendering engine from Chromium and the Node.js runtime and module system. The desktop app is written as a modern ES6 + async/await React application.
For the desktop app, Slack takes a hybrid approach, wherein some of the assets ship as part of the app, but most of their assets and code are loaded remotely.
"Slack provides two strategies for searching: Recent and Relevant. Recent search finds the messages that match all terms and presents them in reverse chronological order. If a user is trying to recall something that just happened, Recent is a useful presentation of the results.
Relevant search relaxes the age constraint and takes into account the Lucene score of the document — how well it matches the query terms (Solr powers search at Slack). Used about 17% of the time, Relevant search performed slightly worse than Recent according to the search quality metrics we measured: the number of clicks per search and the click-through rate of the search results in the top several positions. We recognized that Relevant search could benefit from using the user’s interaction history with channels and other users — their ‘work graph’."
Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.
An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.
For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.
Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.
Thrift binds all of these engines together with a typed schema and structured data.
Finally, the Hive Metastore serves as the ground truth for all data and its schema.
Throughout 2016, Slack began migrating from PHP5 to Hack. They cite several well-known challenges inherent to PHP, including surprise type conversions, inconsistency around reference semantics, inconsistencies in the standard library, and the fact that “PHP tries very, very hard to keep the request running, even if it has done something deeply strange.”
To overcome these challenges while maintaining the unique values of PHP, Slack turned to Hack, a gradual typing system for PHP. Hack runs on the HipHop Virtual Machine, or HHVM, an open source just-in-time (JIT) environment for PHP.
Some Slack customers need tighter control and visibility into their data without disturbing the most essential features. This resulted in the development and release of Slack Enterprise Key Management (Slack EKM). It allows larger visibility into data and more control over the keys used to encrypt and decrypt the data. Slack EKM allows users to bring their own keys into Slack. The initial release supported third-party integration of Amazon Web Services Key Management Service to store keys. Slack EKM works independently of the web application so that other security measures could be added to it. It was written in Go, largely due to its suitability for CPU-intensive cryptographic operations and its top-notch AWS software development kit. KMS key requests are logged on AWS CloudTrail, which key requests created directly from AWS KMS.
In a 2015 AWS case study, Richard Crowley, Director of Operations said “with traditional IT, it would take weeks or months to contend with hardware lead times to add more capacity. Using AWS, we can look at user metrics weekly or daily and react with new capacity in 30 seconds.”
Slack needed to pick an infrastructure partner that could support the exponential growth they were experiencing. AWS is the cloud provider that supplied them with i2.xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances for their LAMP stack, Amazon Simple Storage Service (Amazon S3) for user's file uploads and static assets, and ELB to Load Balance workloads across their EC2 instances.
For security, Slack went with Amazon Virtual Private Cloud (VPC) for controlling security groups and firewall rules and AWS Identity and Access Management (IAM) for controlling user credentials and roles.
In 2018, Slack signed an agreement with AWS to spend at least $50 million a year over five years, for a total of at least $250 million, according to the company’s filing with the SEC for a public stock listing (via CNBC)