Amazon DynamoDB vs Presto

Overview

Amazon DynamoDB

Stacks4.0K

Followers3.2K

Votes195

Presto

Stacks394

Followers1.0K

Votes66

Amazon DynamoDB vs Presto: What are the differences?

What is Amazon DynamoDB? Fully managed NoSQL database service. All data items are stored on Solid State Drives (SSDs), and are replicated across 3 Availability Zones for high availability and durability. With DynamoDB, you can offload the administrative burden of operating and scaling a highly available distributed database cluster, while paying a low price for only what you use.

What is Presto? Distributed SQL Query Engine for Big Data. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Amazon DynamoDB and Presto are primarily classified as "NoSQL Database as a Service" and "Big Data" tools respectively.

"Predictable performance and cost" is the primary reason why developers consider Amazon DynamoDB over the competitors, whereas "Works directly on files in s3 (no ETL)" was stated as the key factor in picking Presto.

Presto is an open source tool with 9.22K GitHub stars and 3.12K GitHub forks. Here's a link to Presto's open source repository on GitHub.

According to the StackShare community, Amazon DynamoDB has a broader approval, being mentioned in 433 company stacks & 173 developers stacks; compared to Presto, which is listed in 19 company stacks and 11 developer stacks.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon DynamoDB, Presto

Ashish

Tech Lead, Big Data Platform at Pinterest

Nov 27, 2019

Needs adviceon

Apache Hive

Presto

Amazon EC2

To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

#BigData #AWS #DataScience #DataEngineering

3.72M views3.72M

Comments

Doru

Solution Architect

Jun 9, 2019

Reviewon

Amazon DynamoDB

I use Amazon DynamoDB because it integrates seamlessly with other AWS SaaS solutions and if cost is the primary concern early on, then this will be a better choice when compared to AWS RDS or any other solution that requires the creation of a HA cluster of IaaS components that will cost money just for being there, the costs not being influenced primarily by usage.

1.44k views1.44k

Comments

akash

Aug 27, 2020

Needs adviceon

Cloud Firestore

Firebase Realtime Database

Amazon DynamoDB

We are building a social media app, where users will post images, like their post, and make friends based on their interest. We are currently using Cloud Firestore and Firebase Realtime Database. We are looking for another database like Amazon DynamoDB; how much this decision can be efficient in terms of pricing and overhead?

199k views199k

Comments

Detailed Comparison

Amazon DynamoDB	Presto
With it , you can offload the administrative burden of operating and scaling a highly available distributed database cluster, while paying a low price for only what you use.	Distributed SQL Query Engine for Big Data
Automated Storage Scaling – There is no limit to the amount of data you can store in a DynamoDB table, and the service automatically allocates more storage, as you store more data using the DynamoDB write APIs;Provisioned Throughput – When creating a table, simply specify how much request capacity you require. DynamoDB allocates dedicated resources to your table to meet your performance requirements, and automatically partitions data over a sufficient number of servers to meet your request capacity;Fully Distributed, Shared Nothing Architecture	-
Statistics
Stacks 4.0K	Stacks 394
Followers 3.2K	Followers 1.0K
Votes 195	Votes 66
Pros & Cons
Pros 62 Predictable performance and cost 56 Scalable 35 Native JSON Support 21 AWS Free Tier 7 Fast Cons 4 Only sequential access for paginate data 1 Document Limit Size 1 Scaling	Pros 18 Works directly on files in s3 (no ETL) 13 Open-source 12 Join multiple databases 10 Scalable 7 Gets ready in minutes
Integrations
Amazon RDS for PostgreSQL PostgreSQL MySQL SQLite Azure Database for MySQL	PostgreSQL Kafka Redis MySQL Hadoop Microsoft SQL Server

What are some alternatives to Amazon DynamoDB, Presto?

Apache Spark

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Azure Cosmos DB

Azure DocumentDB is a fully managed NoSQL database service built for fast and predictable performance, high availability, elastic scaling, global distribution, and ease of development.

Cloud Firestore

Cloud Firestore is a NoSQL document database that lets you easily store, sync, and query data for your mobile and web apps - at global scale.

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Cloudant

Cloudant’s distributed database as a service (DBaaS) allows developers of fast-growing web and mobile apps to focus on building and improving their products, instead of worrying about scaling and managing databases on their own.

Google Cloud Bigtable

Google Cloud Bigtable offers you a fast, fully managed, massively scalable NoSQL database service that's ideal for web, mobile, and Internet of Things applications requiring terabytes to petabytes of data. Unlike comparable market offerings, Cloud Bigtable doesn't require you to sacrifice speed, scale, or cost efficiency when your applications grow. Cloud Bigtable has been battle-tested at Google for more than 10 years—it's the database driving major applications such as Google Analytics and Gmail.

Apache Kylin

Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets, originally contributed from eBay Inc.