Alternatives to Presto logo

Alternatives to Presto

Apache Spark, Stan, Amazon Athena, Apache Flink, and Apache Hive are the most popular alternatives and competitors to Presto.
114
192
+ 1
46

What is Presto and what are its top alternatives?

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto is a tool in the Big Data Tools category of a tech stack.
Presto is an open source tool with 10K GitHub stars and 3.4K GitHub forks. Here鈥檚 a link to Presto's open source repository on GitHub

Presto alternatives & related posts

related Apache Spark posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix | 19 upvotes 352.4K views
atStitch FixStitch Fix
Kafka
Kafka
PostgreSQL
PostgreSQL
Amazon S3
Amazon S3
Apache Spark
Apache Spark
Presto
Presto
Python
Python
R
R
PyTorch
PyTorch
Docker
Docker
Amazon EC2 Container Service
Amazon EC2 Container Service
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber | 5 upvotes 163.9K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kafka Manager
Kafka Manager
Hadoop
Hadoop
Apache Spark
Apache Spark
GitHub
GitHub

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more
Stan logo

Stan

13
10
0
13
10
+ 1
0
A Probabilistic Programming Language
    Be the first to leave a pro
    Stan logo
    Stan
    VS
    Presto logo
    Presto

    related Amazon Athena posts

    Amazon Athena
    Amazon Athena
    Google BigQuery
    Google BigQuery

    I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

    However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

    See more
    Apache Flink logo

    Apache Flink

    146
    157
    11
    146
    157
    + 1
    11
    Fast and reliable large-scale data processing engine
    Apache Flink logo
    Apache Flink
    VS
    Presto logo
    Presto

    related Apache Flink posts

    Surabhi Bhawsar
    Surabhi Bhawsar
    Technical Architect at Pepcus | 6 upvotes 45.1K views
    Kafka
    Kafka
    Apache Flink
    Apache Flink

    I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

    See more
    Apache Hive logo

    Apache Hive

    119
    53
    0
    119
    53
    + 1
    0
    Data Warehouse Software for Reading, Writing, and Managing Large Datasets
      Be the first to leave a pro
      Apache Hive logo
      Apache Hive
      VS
      Presto logo
      Presto

      related Apache Hive posts

      Ashish Singh
      Ashish Singh
      Tech Lead, Big Data Platform at Pinterest | 20 upvotes 34.3K views
      Apache Hive
      Apache Hive
      Kubernetes
      Kubernetes
      Kafka
      Kafka
      Amazon S3
      Amazon S3
      Amazon EC2
      Amazon EC2
      Presto
      Presto
      #DataScience
      #DataEngineering
      #AWS
      #BigData

      To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

      Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

      We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

      Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

      Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

      #BigData #AWS #DataScience #DataEngineering

      See more
      Druid logo

      Druid

      107
      152
      17
      107
      152
      + 1
      17
      Fast column-oriented distributed data store
      Druid logo
      Druid
      VS
      Presto logo
      Presto
      AWS Glue logo

      AWS Glue

      63
      38
      0
      63
      38
      + 1
      0
      Fully managed extract, transform, and load (ETL) service
        Be the first to leave a pro
        AWS Glue logo
        AWS Glue
        VS
        Presto logo
        Presto
        Apache Impala logo

        Apache Impala

        58
        56
        8
        58
        56
        + 1
        8
        Real-time Query for Hadoop
        Apache Impala logo
        Apache Impala
        VS
        Presto logo
        Presto
        Pig logo

        Pig

        38
        52
        4
        38
        52
        + 1
        4
        Platform for analyzing large data sets
        Pig logo
        Pig
        VS
        Presto logo
        Presto
        Amazon Redshift Spectrum logo

        Amazon Redshift Spectrum

        38
        36
        0
        38
        36
        + 1
        0
        Exabyte-Scale In-Place Queries of S3 Data
          Be the first to leave a pro
          Amazon Redshift Spectrum logo
          Amazon Redshift Spectrum
          VS
          Presto logo
          Presto
          Apache Kudu logo

          Apache Kudu

          27
          45
          3
          27
          45
          + 1
          3
          Fast Analytics on Fast Data. A columnar storage manager developed for the Hadoop platform
          Apache Kudu logo
          Apache Kudu
          VS
          Presto logo
          Presto
          Talend logo

          Talend

          26
          12
          0
          26
          12
          + 1
          0
          A single, unified suite for all integration needs
            Be the first to leave a pro
            Talend logo
            Talend
            VS
            Presto logo
            Presto
            Vertica logo

            Vertica

            26
            7
            0
            26
            7
            + 1
            0
            Storage platform designed to handle large volumes of data
              Be the first to leave a pro
              Vertica logo
              Vertica
              VS
              Presto logo
              Presto
              Apache Parquet logo

              Apache Parquet

              23
              10
              0
              23
              10
              + 1
              0
              A free and open-source column-oriented data storage format
                Be the first to leave a pro
                Apache Parquet logo
                Apache Parquet
                VS
                Presto logo
                Presto
                Hue logo

                Hue

                16
                8
                0
                16
                8
                + 1
                0
                An open source SQL Workbench for Data Warehouses
                  Be the first to leave a pro
                  Hue logo
                  Hue
                  VS
                  Presto logo
                  Presto
                  Mule logo

                  Mule

                  16
                  21
                  0
                  16
                  21
                  + 1
                  0
                  Revolutionizing the way the world connects data and applications
                    Be the first to leave a pro
                    Mule logo
                    Mule
                    VS
                    Presto logo
                    Presto
                    Azure Data Factory logo

                    Azure Data Factory

                    13
                    4
                    0
                    13
                    4
                    + 1
                    0
                    Hybrid data integration service that simplifies ETL at scale
                      Be the first to leave a pro
                      Azure Data Factory logo
                      Azure Data Factory
                      VS
                      Presto logo
                      Presto
                      Singer logo

                      Singer

                      9
                      9
                      0
                      9
                      9
                      + 1
                      0
                      Simple, Composable, Open Source ETL
                        Be the first to leave a pro
                        Singer logo
                        Singer
                        VS
                        Presto logo
                        Presto
                        Pachyderm logo

                        Pachyderm

                        8
                        12
                        2
                        8
                        12
                        + 1
                        2
                        MapReduce without Hadoop. Analyze massive datasets with Docker.
                        Pachyderm logo
                        Pachyderm
                        VS
                        Presto logo
                        Presto