Slack

Mar 18, 2019

Slack introduces Enterprise Key Management

Some Slack customers need tighter control and visibility into their data without disturbing the most essential features. This resulted in the development and release of Slack Enterprise Key Management (Slack EKM). It allows larger visibility into data and more control over the keys used to encrypt and decrypt the data. Slack EKM allows users to bring their own keys into Slack. The initial release supported third-party integration of Amazon Web Services Key Management Service to store keys. Slack EKM works independently of the web application so that other security measures could be added to it. It was written in Go, largely due to its suitability for CPU-intensive cryptographic operations and its top-notch AWS software development kit. KMS key requests are logged on AWS CloudTrail, which key requests created directly from AWS KMS.

16.2k views16.2k

Comments

StackShare Editors

Feb 9, 2019

Slack's CTO summarizes their current stack

Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

Apps

Web: a mix of JavaScript/ES6 and React.
Desktop: And Electron to ship it as a desktop application.
Android: a mix of Java and Kotlin.
iOS: written in a mix of Objective C and Swift.

Backend

The core application and the API written in PHP/Hack that runs on HHVM.
The data is stored in MySQL using Vitess.
Caching is done using Memcached and MCRouter.
The search service takes help from SolrCloud, with various Java services.
The messaging system uses WebSockets with many services in Java and Go.
Load balancing is done using HAproxy with Consul for configuration.
Most services talk to each other over gRPC,
Some Thrift and JSON-over-HTTP
Voice and video calling service was built in Elixir.

Data warehouse

Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.

Etc

For server configuration and management we use Terraform, Chef and Kubernetes.
We use Prometheus for time series metrics and ELK for logging.

848k views848k

Comments

StackShare Editors

Feb 5, 2019

Slack's tech stack approaching IPO

By early 2019, a mix of JavaSCript, ES6, and React powered the web app, with the desktop app shipping in Electron. Java and Kotlin powered the Android app with Objective-C and Swift powering iOS.

CTO Cal Henderson goes on to describe the backend: “we have our core application which powers and our API, which is written in PHP/Hacklang running on HHVM. We store data in MySQL using Vitess. For caching, we use Memcached and MCRouter. Our search service is based on SolrCloud, with various Java services for ranking. Our real-time messaging system uses WebSockets and is comprised of many services written in Java and Go.

“We use HAproxy for load balancing and Consul for configuration and some service discovery. Most of our services talk to each other over gRPC, though we have some Thrift and JSON-over-HTTP too. Our voice and video calling service is built in Elixir. A few different services are also written in Node. Our async task queue system is built on Kafka and Redis.

“Our data warehouse is built on open source tools, including Presto, Spark, Airflow, Hadoop and Kafka. For server configuration and management we use Terraform, Chef and Kubernetes. We use Prometheus for time series metrics and ELK for logging. Slack is largely hosted in AWS, in many regions globally.”

41 views41

Comments

StackShare Editors

Aug 9, 2018

Re-architecting Slack’s Workspace Preferences

They're critical to the business data and operated by an ecosystem of tools. But once the tools have been used, it was important to verify that the data remains as expected at all times. Even with the best efforts to prevent errors, inconsistencies are bound to creep at any stage. In order to test the code in a comprehensive manner, Slack developed a structure known as a consistency check framework.

This is a responsive and personalized framework that can meaningfully analyze and report on your data with a number of proactive and reactive benefits. This framework is important because it can help with repair and recovery from an outage or bug, it can help ensure effective data migration through scripts that test the code post-migration, and find bugs throughout the database. This framework helped prevent duplication and identifies the canonical code in each case, running as reusable code.

The framework was created by creating generic versions of the scanning and reporting code and an interface for the checking code. The checks could be run from the command line and either a single team could be scanned or the whole system. The process was improved over time to further customize the checks and make them more specific. In order to make this framework accessible to everyone, a GUI was added and connected to the internal administrative system. The framework was also modified to include code that can fix certain problems, while others are left for manual intervention. For Slack, such a tool proved extremely beneficial in ensuring data integrity both internally and externally.

27k views27k

Comments

StackShare Editors

Dec 6, 2017

Using Kafka and Redis to handle billions of tasks in milliseconds

As of 2017, Slack was handling a peak of 1.4 billion jobs a day, (33,000 jobs every second). Until recently, Slack had continued to depend on their initial job queue implementation system based on Redis. While it had allowed them to grow exponentially and diversify their services, they soon outgrew the existing system. Also, dequeuing jobs required memory that was unavailable. Allowing job workers to scale up further burdened Redis, slowing the entire system.

Slack decided to use Kafka to ease the process and allow them to scale up without getting rid of the existing architecture. To build on it, they added Kafka in front of Redis leaving the existing queuing interface in place. A stateless service called Kafkagate was developed in Go to enqueue jobs to Kafka. It exposes an HTTP POST interface with each request comprising a topic, partition, and content. Kafkagate's design reduces latency while writing jobs and allows greater flexibility in job queue design. JQRelay, a stateless service, is used to relay jobs from a Kafka topic to Redis. It ensures only one relay process is assigned to each topic, failures are self-healing, and job-specific errors are corrected by re-enqueuing the job to Kafka. The new system was rolled out by double writing all jobs to both Redis and Kafka, with JQRelay operating in 'shadow mode' - dropping all jobs after reading it from Kafka. Jobs were verified by being tracked at each part of the system through its lifetime. By using durable storage and JQRelay, the enqueuing rate could be paused or adjusted to give Redis the necessary breathing room and make Slack a much more resilient service.

208k views208k

Comments

StackShare Editors

Jun 22, 2017

How Slack runs monitoring at scale

One size definitely doesn’t fit all when it comes to open source monitoring solutions, and executing generally understood best practices in the context of unique distributed systems presents all sorts of problems. Megan Anctil, a senior engineer on the Technical Operations team at Slack gave a talk at an O’Reilly Velocity Conference sharing pain points and lessons learned at wrangling known technologies such as Icinga, Graphite, Grafana, and the Elastic Stack to best fit the company’s use cases.

At the time, Slack used a few well-known monitoring tools since it’s Technical Operations team wasn’t large enough to build an in-house solution for all of these. Nor did the team think it’s sustainable to throw money at the problem, given the volume of information processed and the not-insignificant price and rigidity of many vendor solutions. With thousands of servers across multiple regions and millions of metrics and documents being processed and indexed per second, the team had to figure out how to scale these technologies to fit Slack’s needs.

On the backend, they experimented with multiple clusters in both Graphite and ELK, distributed Icinga nodes, and more. At the same time, they’ve tried to build usability into Grafana that reflects the team’s mental models of the system and have found ways to make alerts from Icinga more insightful and actionable.

559k views559k

Comments

StackShare Editors

May 31, 2017

Real-Time Communication with Flannel and AWS

“Your Slack client is the window into your workplace, and teams have grown into the tens of thousands of people, much larger than any primitive village. Slack was architected around the goal of keeping teams of hundreds of people connected, and as teams have gotten larger, our initial techniques for loading and maintaining data have not scaled. To address that, we created a system that lazily loads data on demand and answers queries as you go.”

As the teams got bigger, the initial techniques for loading and maintaining data did not scale. To fix this, a system to lazy load data on demand and answer queries was developed. Some critical problems faced at this juncture were: connection times started to take longer, client memory footprint was large, reconnecting to Slack became expensive. So then, Slack clients connected to Flannel, an application-level caching service developed in-house and deployed to their edge points-of-presence which in turn gathers the full client startup data opening a WebSocket connection to Slack’s servers in the AWS regions. In an episode of “This is My Architecture”, Richard Crowley, Director of Service Engineering shows us how they use Cloudfront, HAProxy, ELB, EC2, and Route 53 to make all of this happen.

Flannel then returns a slimmed down version of this startup data to the client, allowing it to bootstrap thus ensuring the Slack client is ready to use. Flannel ran in Slack's edge locations since January 2017 serving 4 million simultaneous connections at peak and 600k client queries per second. With Flannel, the payload size needed for client bootstrap reduced considerably. In all Flannel played quite a role in making Slack faster and more reliable.

4.26k views4.26k

Comments

StackShare Editors

Feb 7, 2017

Optimizing Slack's Search by Relevance

"Slack provides two strategies for searching: Recent and Relevant. Recent search finds the messages that match all terms and presents them in reverse chronological order. If a user is trying to recall something that just happened, Recent is a useful presentation of the results.

Relevant search relaxes the age constraint and takes into account the Lucene score of the document — how well it matches the query terms (Solr powers search at Slack). Used about 17% of the time, Relevant search performed slightly worse than Recent according to the search quality metrics we measured: the number of clicks per search and the click-through rate of the search results in the top several positions. We recognized that Relevant search could benefit from using the user’s interaction history with channels and other users — their ‘work graph’."

430k views430k

Comments

StackShare Editors

Dec 1, 2016

Data engineering with Presto, Hive, and Spark

Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

Thrift binds all of these engines together with a typed schema and structured data.

Finally, the Hive Metastore serves as the ground truth for all data and its schema.

45.8k views45.8k

Comments

StackShare Editors

Nov 11, 2016

Monitoring for potentially malicious activity

In order to protect applications such as Slack from malicious activity, it was crucial to monitor the infrastructure at all times. The best way to do this was through a centralized logging system and Slack enables the same through tools such as StreamStash, Elasticsearch, and ElastAlert.

StreamStash is a Node.js based service for log aggregating, filtering, and redirecting. It transmits outputs to ElasticSearch, which is an open source full-text search engine using an HTTP web interface and schema-free JSON documents. It provides an almost real-time and scalable search to the user.

This helps users retrieve any log file at its most updated state almost instantly. ElastAlert helps provide alerts for anomalies, spikes and other curious patterns for data available in ElasticSearch. This robust system together ensured all the data was processed and collected by the application and can be studied and retrieved at a moment's notice for necessary action.

10.6k views10.6k

Comments

Slack

Tech Stack

Application & Data

Utilities

DevOps

Business Tools

Engineering Blog

Stack Decisions

Slack introduces Enterprise Key Management

Slack's CTO summarizes their current stack

Apps

Backend

Data warehouse

Etc

Slack's tech stack approaching IPO

Re-architecting Slack’s Workspace Preferences

Using Kafka and Redis to handle billions of tasks in milliseconds

How Slack runs monitoring at scale

Real-Time Communication with Flannel and AWS

Optimizing Slack's Search by Relevance

Data engineering with Presto, Hive, and Spark

Monitoring for potentially malicious activity