Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Denodo
Denodo

4
4
+ 1
0
Presto
Presto

123
222
+ 1
49
Add tool
- No public GitHub repository available -

What is Denodo?

It is the leader in data virtualization providing data access, data governance and data delivery capabilities across the broadest range of enterprise, cloud, big data, and unstructured data sources without moving the data from their original repositories.

What is Presto?

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose Denodo?
Why do developers choose Presto?
    Be the first to leave a pro

    Sign up to add, upvote and see more prosMake informed product decisions

      Be the first to leave a con
        Be the first to leave a con
        What companies use Denodo?
        What companies use Presto?
          No companies found

          Sign up to get full access to all the companiesMake informed product decisions

          What tools integrate with Denodo?
          What tools integrate with Presto?

          Sign up to get full access to all the tool integrationsMake informed product decisions

          What are some alternatives to Denodo and Presto?
          AtScale
          Its Virtual Data Warehouse delivers performance, security and agility to exceed the demands of modern-day operational analytics.
          Tableau
          Tableau can help anyone see and understand their data. Connect to almost any database, drag and drop to create visualizations, and share with a click.
          Pandas
          Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.
          NumPy
          Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
          Metabase
          Metabase is an easy way to generate charts and dashboards, ask simple ad hoc queries without using SQL, and see detailed information about rows in your Database. You can set it up in under 5 minutes, and then give yourself and others a place to ask simple questions and understand the data your application is generating.
          See all alternatives
          Decisions about Denodo and Presto
          StackShare Editors
          StackShare Editors
          Hadoop
          Hadoop
          Apache Spark
          Apache Spark
          Presto
          Presto

          Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

          To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

          “In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

          See more
          StackShare Editors
          StackShare Editors
          Hadoop
          Hadoop
          Apache Spark
          Apache Spark
          Presto
          Presto

          To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

          While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

          These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

          See more
          StackShare Editors
          StackShare Editors
          Kafka
          Kafka
          MySQL
          MySQL
          Scala
          Scala
          Apache Spark
          Apache Spark
          Presto
          Presto

          Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

          An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

          For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

          Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

          Thrift binds all of these engines together with a typed schema and structured data.

          Finally, the Hive Metastore serves as the ground truth for all data and its schema.

          See more
          StackShare Editors
          StackShare Editors
          Prometheus
          Prometheus
          Chef
          Chef
          Consul
          Consul
          Memcached
          Memcached
          Hack
          Hack
          Swift
          Swift
          Hadoop
          Hadoop
          Terraform
          Terraform
          Airflow
          Airflow
          Apache Spark
          Apache Spark
          Kubernetes
          Kubernetes
          gRPC
          gRPC
          HHVM (HipHop Virtual Machine)
          HHVM (HipHop Virtual Machine)
          Presto
          Presto
          Kotlin
          Kotlin
          Apache Thrift
          Apache Thrift

          Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

          Apps
          • Web: a mix of JavaScript/ES6 and React.
          • Desktop: And Electron to ship it as a desktop application.
          • Android: a mix of Java and Kotlin.
          • iOS: written in a mix of Objective C and Swift.
          Backend
          • The core application and the API written in PHP/Hack that runs on HHVM.
          • The data is stored in MySQL using Vitess.
          • Caching is done using Memcached and MCRouter.
          • The search service takes help from SolrCloud, with various Java services.
          • The messaging system uses WebSockets with many services in Java and Go.
          • Load balancing is done using HAproxy with Consul for configuration.
          • Most services talk to each other over gRPC,
          • Some Thrift and JSON-over-HTTP
          • Voice and video calling service was built in Elixir.
          Data warehouse
          • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
          Etc
          See more
          Eric Colson
          Eric Colson
          Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 457K views
          atStitch FixStitch Fix
          Kafka
          Kafka
          PostgreSQL
          PostgreSQL
          Amazon S3
          Amazon S3
          Apache Spark
          Apache Spark
          Presto
          Presto
          Python
          Python
          R
          R
          PyTorch
          PyTorch
          Docker
          Docker
          Amazon EC2 Container Service
          Amazon EC2 Container Service
          #AWS
          #Etl
          #ML
          #DataScience
          #DataStack
          #Data

          The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

          Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

          At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

          For more info:

          #DataScience #DataStack #Data

          See more
          Ashish Singh
          Ashish Singh
          Tech Lead, Big Data Platform at Pinterest · | 26 upvotes · 87.2K views
          Apache Hive
          Apache Hive
          Kubernetes
          Kubernetes
          Kafka
          Kafka
          Amazon S3
          Amazon S3
          Amazon EC2
          Amazon EC2
          Presto
          Presto
          #DataScience
          #DataEngineering
          #AWS
          #BigData

          To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

          Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

          We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

          Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

          Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

          #BigData #AWS #DataScience #DataEngineering

          See more
          Interest over time
          Reviews of Denodo and Presto
          No reviews found
          How developers use Denodo and Presto
          No items found
          How much does Denodo cost?
          How much does Presto cost?
          Pricing unavailable
          Pricing unavailable
          News about Denodo
          More news
          News about Presto
          More news