What is Presto?

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto is a tool in the Big Data Tools category of a tech stack.
Presto is an open source tool with 9.3K GitHub stars and 3.2K GitHub forks. Here’s a link to Presto's open source repository on GitHub

Who uses Presto?

29 companies reportedly use Presto in their tech stacks, including Airbnb, Facebook, and Netflix.

61 developers on StackShare have stated that they use Presto.

Presto Integrations

Cloudera Enterprise, MySQL, Microsoft SQL Server, PostgreSQL, and MongoDB are some of the popular tools that integrate with Presto. Here's a list of all 16 tools that integrate with Presto.

Why developers like Presto?

Here’s a list of reasons why companies and developers use Presto
Presto Reviews

Here are some stack decisions, common use cases and reviews by companies and developers who chose Presto in their tech stack.

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 91.3K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Apache Spark
Amazon S3

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
StackShare Editors
StackShare Editors
| 4 upvotes · 1.7K views
atUber TechnologiesUber Technologies
Apache Spark

To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

See more
StackShare Editors
StackShare Editors
| 4 upvotes · 1K views
atUber TechnologiesUber Technologies

In early 2015, Uber Engineering migrated its business entities from integer identifiers to UUID identifiers as part of an initiative focused on using multiple active data centers. To do that, Uber engineers had to identify foreign key relationships between every table in the data warehouse—a nontrivial task by any accounting.

Uber’s solution was to observe and analyze incoming SQL queries to extract foreign key relationships, for which it built tool called Queryparser, which it open sourced.)

Queryparser is written in Haskell, a tool that the team wasn’t previously familiar with but has strong support for language parsing. To help each other get up to speed, engineers started a weekly reading group to discuss Haskell books and documentation.

See more
StackShare Editors
StackShare Editors
Apache Spark

Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

Thrift binds all of these engines together with a typed schema and structured data.

Finally, the Hive Metastore serves as the ground truth for all data and its schema.

See more
StackShare Editors
StackShare Editors
| 3 upvotes · 1.7K views
atUber TechnologiesUber Technologies
Apache Spark

Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

“In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

See more
StackShare Editors
StackShare Editors
| 3 upvotes · 844 views
atUber TechnologiesUber Technologies

By mid-2016, Uber’s team was running more than one hundred thousand analytic queries daily. To keep up, they decided to redesign their analytics system, leveraging Presto, an open source SQL engine for large datasets, and Parquet, a columnar storage format for Hadoop.

Presto was chosen for a few reasons, including its scalability (according to Uber, it can access over five petabytes of data, and completes more than 90% of queries within 60 seconds).

To store its data, Uber also uses Parquet, a Hadoop storage solution that is compressible, has a columnar storage format, is encoded, and has ground-up support for nested data sets. Uber stores its data in columns instead of rows, because it removes the need to scan and discard unwanted data in rows. Columnar storage means more disk space saved, and improved query performance for large datasets.

See more

Presto Alternatives & Comparisons

What are some alternatives to Presto?
Apache Spark
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Apache Flink
Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.
Amazon Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.
Apache Hive
Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.
See all alternatives

Presto's Stats

Presto's Followers
155 developers follow Presto to keep up with related blogs and decisions.
Emre Eker
Max Young
Rahul Rawla
Albert Franzi
Vineet M