Distributed, fault tolerant, high throughput pub-sub messaging system

What is Kafka?

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.

Kafka is a tool in the Message Queue category of a tech stack.

Kafka is an open source tool with 11.8K Github Stars and 6.36K Github Forks. Here’s a link to Kafka's open source repository on Github

Who Uses Kafka?

466 companies use Kafka including Uber, Spotify, and Shopify.

Kafka integrates with

Datadog, Couchbase, Apache Flink, Presto, and Woopra are some of the popular tools that integrate with Kafka. Here's a list of all 26 tools that integrate with Kafka.

Why people like Kafka

Here’s a list of reasons why companies and developers use Kafka.

Add a one-liner

Here are some stack decisions and reviews by companies and developers who chose Kafka in their tech stack.

Nick Rockwell
Nick Rockwell
CTO at NY Times · | 23 upvotes · 52896 views
atThe New York Times
Apache HTTP Server

When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?

So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.

React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.

Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.

See more
Dan Robinson
Dan Robinson
at Heap, Inc. · | 14 upvotes · 11914 views

At Heap, we searched for an existing tool that would allow us to express the full range of analyses we needed, index the event definitions that made up the analyses, and was a mature, natively distributed system.

After coming up empty on this search, we decided to compromise on the “maturity” requirement and build our own distributed system around Citus and sharded PostgreSQL. It was at this point that we also introduced Kafka as a queueing layer between the Node.js application servers and Postgres.

If we could go back in time, we probably would have started using Kafka on day one. One of the biggest benefits in adopting Kafka has been the peace of mind that it brings. In an analytics infrastructure, it’s often possible to make data ingestion idempotent.

In Heap’s case, that means that, if anything downstream from Kafka goes down, we won’t lose any data – it’s just going to take a bit longer to get to its destination. We also learned that you want the path between data hitting your servers and your initial persistence layer (in this case, Kafka) to be as short and simple as possible, since that is the surface area where a failure means you can lose customer data. We learned that it’s a very good fit for an analytics tool, since you can handle a huge number of incoming writes with relatively low latency. Kafka also gives you the ability to “replay” the data flow: it’s like a commit log for your whole infrastructure.

#MessageQueue #Databases #FrameworksFullStack

See more
John Kodumal
John Kodumal
CTO at LaunchDarkly · | 14 upvotes · 5729 views
Amazon Kinesis
Amazon EC2
Amazon ElastiCache
Amazon RDS

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.

We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.

See more
Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 10 upvotes · 4964 views
atStitch Fix
Amazon EC2 Container Service
Apache Spark
Amazon S3

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Adam Rabinovitch
Adam Rabinovitch
Senior Technical Recruiter & Engineering Evangelist at Beamery · | 3 upvotes · 86067 views

Beamery runs a #microservices architecture in the backend on top of Google Cloud with Kubernetes There are a 100+ different microservice split between Node.js and Go . Data flows between the microservices over REST and gRPC and passes through Kafka RabbitMQ as a message bus. Beamery stores data in MongoDB with near-realtime replication to Elasticsearch . In addition, Beamery uses Redis for various memory-optimized tasks.

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 3 upvotes · 50681 views
atUber Technologies
Kafka Manager
Apache Spark

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:


(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more

Kafka's Features

  • Written at LinkedIn in Scala
  • Used by LinkedIn to offload processing of all page and other views
  • Defaults to using persistence, uses OS disk cache for hot data (has higher throughput then any of the above having persistence enabled)
  • Supports both on-line as off-line processing

Kafka's alternatives

  • RabbitMQ - RabbitMQ is a messaging broker - an intermediary for messaging
  • Amazon SQS - Fully managed message queuing service
  • Celery - Distributed task queue
  • ActiveMQ - A message broker written in Java together with a full JMS client
  • ZeroMQ - Fast, lightweight messaging library that allows you to design complex communication system without much effort

See all alternatives to Kafka