Druid vs Apache Spark: What are the differences?
What is Druid? Fast column-oriented distributed data store. Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.
What is Apache Spark? Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Druid and Apache Spark can be primarily classified as "Big Data" tools.
"Real Time Aggregations" is the primary reason why developers consider Druid over the competitors, whereas "Open-source" was stated as the key factor in picking Apache Spark.
Druid and Apache Spark are both open source tools. Apache Spark with 22.5K GitHub stars and 19.4K forks on GitHub appears to be more popular than Druid with 8.31K GitHub stars and 2.08K GitHub forks.
Uber Technologies, Slack, and Shopify are some of the popular companies that use Apache Spark, whereas Druid is used by Airbnb, Instacart, and Dial Once. Apache Spark has a broader approval, being mentioned in 266 company stacks & 112 developers stacks; compared to Druid, which is listed in 24 company stacks and 12 developer stacks.