Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Druid
Druid

104
143
+ 1
17
Presto
Presto

108
181
+ 1
46
Add tool

Druid vs Presto: What are the differences?

What is Druid? Fast column-oriented distributed data store. Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

What is Presto? Distributed SQL Query Engine for Big Data. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Druid and Presto can be categorized as "Big Data" tools.

"Real Time Aggregations" is the primary reason why developers consider Druid over the competitors, whereas "Works directly on files in s3 (no ETL)" was stated as the key factor in picking Presto.

Druid and Presto are both open source tools. Presto with 9.29K GitHub stars and 3.15K forks on GitHub appears to be more popular than Druid with 8.31K GitHub stars and 2.08K GitHub forks.

According to the StackShare community, Druid has a broader approval, being mentioned in 24 company stacks & 12 developers stacks; compared to Presto, which is listed in 19 company stacks and 11 developer stacks.

What is Druid?

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

What is Presto?

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose Druid?
Why do developers choose Presto?

Sign up to add, upvote and see more prosMake informed product decisions

    Be the first to leave a con
      Be the first to leave a con
      What companies use Druid?
      What companies use Presto?

      Sign up to get full access to all the companiesMake informed product decisions

      What tools integrate with Druid?
      What tools integrate with Presto?

      Sign up to get full access to all the tool integrationsMake informed product decisions

      What are some alternatives to Druid and Presto?
      HBase
      Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop.
      MongoDB
      MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
      Cassandra
      Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
      Prometheus
      Prometheus is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
      Elasticsearch
      Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).
      See all alternatives
      Decisions about Druid and Presto
      StackShare Editors
      StackShare Editors
      Presto
      Presto
      Apache Spark
      Apache Spark
      Hadoop
      Hadoop

      Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

      To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

      “In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

      See more
      StackShare Editors
      StackShare Editors
      Presto
      Presto
      Apache Spark
      Apache Spark
      Hadoop
      Hadoop

      To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

      While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

      These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

      See more
      StackShare Editors
      StackShare Editors
      Presto
      Presto
      Apache Spark
      Apache Spark
      Scala
      Scala
      MySQL
      MySQL
      Kafka
      Kafka

      Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

      An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

      For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

      Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

      Thrift binds all of these engines together with a typed schema and structured data.

      Finally, the Hive Metastore serves as the ground truth for all data and its schema.

      See more
      StackShare Editors
      StackShare Editors
      Apache Thrift
      Apache Thrift
      Kotlin
      Kotlin
      Presto
      Presto
      HHVM (HipHop Virtual Machine)
      HHVM (HipHop Virtual Machine)
      gRPC
      gRPC
      Kubernetes
      Kubernetes
      Apache Spark
      Apache Spark
      Airflow
      Airflow
      Terraform
      Terraform
      Hadoop
      Hadoop
      Swift
      Swift
      Hack
      Hack
      Memcached
      Memcached
      Consul
      Consul
      Chef
      Chef
      Prometheus
      Prometheus

      Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

      Apps
      • Web: a mix of JavaScript/ES6 and React.
      • Desktop: And Electron to ship it as a desktop application.
      • Android: a mix of Java and Kotlin.
      • iOS: written in a mix of Objective C and Swift.
      Backend
      • The core application and the API written in PHP/Hack that runs on HHVM.
      • The data is stored in MySQL using Vitess.
      • Caching is done using Memcached and MCRouter.
      • The search service takes help from SolrCloud, with various Java services.
      • The messaging system uses WebSockets with many services in Java and Go.
      • Load balancing is done using HAproxy with Consul for configuration.
      • Most services talk to each other over gRPC,
      • Some Thrift and JSON-over-HTTP
      • Voice and video calling service was built in Elixir.
      Data warehouse
      • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
      Etc
      See more
      Eric Colson
      Eric Colson
      Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 263.6K views
      atStitch FixStitch Fix
      Amazon EC2 Container Service
      Amazon EC2 Container Service
      Docker
      Docker
      PyTorch
      PyTorch
      R
      R
      Python
      Python
      Presto
      Presto
      Apache Spark
      Apache Spark
      Amazon S3
      Amazon S3
      PostgreSQL
      PostgreSQL
      Kafka
      Kafka
      #Data
      #DataStack
      #DataScience
      #ML
      #Etl
      #AWS

      The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

      Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

      At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

      For more info:

      #DataScience #DataStack #Data

      See more
      Interest over time
      Reviews of Druid and Presto
      No reviews found
      How developers use Druid and Presto
      No items found
      How much does Druid cost?
      How much does Presto cost?
      Pricing unavailable
      Pricing unavailable
      News about Druid
      More news
      News about Presto
      More news