Kafka logo

Kafka

Distributed, fault tolerant, high throughput pub-sub messaging system

What is Kafka?

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Kafka is a tool in the Message Queue category of a tech stack.
Kafka is an open source tool with 24.4K GitHub stars and 12.4K GitHub forks. Here’s a link to Kafka's open source repository on GitHub

Who uses Kafka?

Companies
1440 companies reportedly use Kafka in their tech stacks, including Uber, Shopify, and Spotify.

Developers
18735 developers on StackShare have stated that they use Kafka.

Kafka Integrations

Datadog, Presto, Apache Flink, Couchbase, and Databricks are some of the popular tools that integrate with Kafka. Here's a list of all 90 tools that integrate with Kafka.
Pros of Kafka
126
High-throughput
119
Distributed
91
Scalable
85
High-Performance
65
Durable
37
Publish-Subscribe
19
Simple-to-use
18
Open source
11
Written in Scala and java. Runs on JVM
8
Message broker + Streaming system
4
KSQL
4
Robust
4
Avro schema integration
3
Suport Multiple clients
2
Partioned, replayable log
1
Flexible
1
Extremely good parallelism constructs
1
Fun
1
Simple publisher / multi-subscriber model
Decisions about Kafka

Here are some stack decisions, common use cases and reviews by companies and developers who chose Kafka in their tech stack.

Needs advice
on
KafkaKafka
and
RabbitMQRabbitMQ

I want to collect the dependency data that Java applications build in the maven tool by CI/CD tools. I want to know how to pick collection tech, and what is the pros and cons between Kafka an RabbitMQ.

Thanks!

See more
Needs advice
on
DruidDruidKafkaKafka
and
Apache SparkApache Spark

My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

I make a report based on the two files in Jupyter notebook and convert it to HTML.

  • Everything is done with vanilla python and Pandas.
  • sometimes I may get a different format of data
  • cloud service is Microsoft Azure.

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

See more
waheed khan
Associate Java Developer at txtsol · | 3 upvotes · 3.3K views
Needs advice
on
JavaJavaKafkaKafka
and
Spring BootSpring Boot

Hi all, I'm working on a project where I have to implement Messaging queues in a project. I just need to know about your personal experience with these queues which is best (RabbitMQ or Kafka).

Thanks

See more
Needs advice
on
AirflowAirflow
and
KafkaKafka

We're looking to do a project for a company that has incoming data from 2 sources, namely MongoDB and MySQL. We need to make it such that we are combining data from these 2 sources and showing it in real-time to PostgreSQL. Ideally, about 600,000 records per day. Which tool would be better for this use case? Airflow or Kafka?

See more
Needs advice
on
BlazeMeterBlazeMeterGatlingGatling
and
k6k6

Kindly suggest the best tool for generating 10Mn+ concurrent user load. The tool must support MQTT traffic, REST API, support to interfaces such as Kafka, websockets, persistence HTTP connection, auth type support to assess the support /coverage.

The tool can be integrated into CI pipelines like Azure Pipelines, GitHub, and Jenkins.

See more
Needs advice
on
ConfluentConfluentKafka StreamsKafka Streams
and
KSQLKSQL

I have recently started using Confluent/Kafka cloud. We want to do some stream processing. As I was going through Kafka I came across Kafka Streams and KSQL. Both seem to be A good fit for stream processing. But I could not understand which one should be used and one has any advantage over another. We will be using Confluent/Kafka Managed Cloud Instance. In near future, our Producers and Consumers are running on premise and we will be interacting with Confluent Cloud.

Also, Confluent Cloud Kafka has a primitive interface; is there any better UI interface to manage Kafka Cloud Cluster?

See more

Blog Posts

Dec 22 2021 at 5:41AM

Pinterest

MySQLKafkaDruid+3
3
535
Amazon S3KafkaZookeeper+5
8
1506
Mar 24 2021 at 12:57PM

Pinterest

GitJenkinsKafka+7
3
1890

Jobs that mention Kafka as a desired skillset

CBRE
United States of America Texas Richardson
Pinterest
San Francisco, CA, US; , CA, US
CBRE
United States of America Texas Richardson
CBRE
United States of America Texas Richardson
CBRE
United States of America Texas Richardson
See all jobs

Kafka's Features

  • Written at LinkedIn in Scala
  • Used by LinkedIn to offload processing of all page and other views
  • Defaults to using persistence, uses OS disk cache for hot data (has higher throughput then any of the above having persistence enabled)
  • Supports both on-line as off-line processing

Kafka Alternatives & Comparisons

What are some alternatives to Kafka?
ActiveMQ
Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 License.
RabbitMQ
RabbitMQ gives your applications a common platform to send and receive messages, and your messages a safe place to live until received.
Amazon Kinesis
Amazon Kinesis can collect and process hundreds of gigabytes of data per second from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.
Apache Spark
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Akka
Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.
See all alternatives

Kafka's Followers
19638 developers follow Kafka to keep up with related blogs and decisions.