Need advice about which tool to choose?Ask the StackShare community!
Apache Spark vs Druid vs Presto: What are the differences?
Apache Spark, Druid, and Presto are all popular tools used in big data processing. Each of these tools has its own set of strengths and weaknesses that make them suitable for different use cases.
Data Processing Paradigm: Apache Spark is a general-purpose distributed processing framework that provides high-level APIs in programming languages like Java, Scala, and Python. It is especially useful for iterative algorithms and interactive data mining. Druid, on the other hand, is a high-performance, real-time analytics database specifically designed for fast ad-hoc queries on large datasets. Presto is a distributed SQL query engine optimized for low-latency interactive queries on various data sources.
Data Storage: Apache Spark does not have its own storage capabilities but can read and process data from various data sources like HDFS, Cassandra, and S3. Druid is designed to store and query large volumes of data in a column-oriented, distributed architecture, optimized for fast query performance. Presto supports querying data where it resides, allowing users to query data from multiple sources like HDFS, MySQL, and Cassandra without the need to move or replicate the data.
Query Speed and Performance: Apache Spark is known for its in-memory computation capabilities, which can significantly improve processing speed for iterative algorithms and machine learning tasks. Druid is optimized for sub-second query performance, making it ideal for real-time analytics and dashboard applications. Presto is designed for low-latency interactive queries on massive datasets, making it suitable for scenarios where query speed is crucial.
Scalability and Fault Tolerance: Apache Spark provides fault tolerance through its resilient distributed dataset (RDD) abstraction, which enables data recovery in case of node failures. Druid is highly scalable and can handle petabytes of data spread across a cluster of machines, providing horizontal scalability for growing data volumes. Presto is horizontally scalable and fault-tolerant, allowing users to add more nodes to the cluster for increased processing power and handle failures gracefully.
Use Cases: Apache Spark is widely used for batch processing, real-time streaming analytics, machine learning, and graph processing. Druid is commonly used for real-time analytics, time series data analysis, and event data analysis. Presto is popular for ad-hoc queries, interactive analytics, and business intelligence applications where fast query response times are critical.
In Summary, Apache Spark, Druid, and Presto offer unique features and capabilities for big data processing, with Apache Spark focusing on general-purpose distributed processing, Druid on real-time analytics, and Presto on interactive querying.
Pros of Druid
- Real Time Aggregations15
- Batch and Real-Time Ingestion6
- OLAP5
- OLAP + OLTP3
- Combining stream and historical analytics2
- OLTP1
Pros of Presto
- Works directly on files in s3 (no ETL)18
- Open-source13
- Join multiple databases12
- Scalable10
- Gets ready in minutes7
- MPP6
Pros of Apache Spark
- Open-source61
- Fast and Flexible48
- One platform for every big data problem8
- Great for distributed SQL like applications8
- Easy to install and to use6
- Works well for most Datascience usecases3
- Interactive Query2
- Machine learning libratimery, Streaming in real2
- In memory Computation2
Sign up to add or upvote prosMake informed product decisions
Cons of Druid
- Limited sql support3
- Joins are not supported well2
- Complexity1
Cons of Presto
Cons of Apache Spark
- Speed4