Amazon EMR vs Snowflake vs Stitch

Overview

Amazon EMR

Stacks543

Followers682

Votes54

Snowflake

Stacks1.2K

Followers1.2K

Votes27

Stitch

Stacks150

Followers150

Votes12

Amazon EMR vs Snowflake vs Stitch: What are the differences?

Introduction

Key differences between Amazon EMR and Snowflake and Stitch are outlined below.

Distributed Computing vs. Data Warehouse: Amazon EMR is a distributed computing service that allows users to process large amounts of data across dynamically scalable Amazon EC2 instances, while Snowflake and Stitch are cloud-based data warehousing services specializing in storing and analyzing structured and semi-structured data in a SQL-like manner.
Pricing Model: Amazon EMR follows a pay-as-you-go pricing model where users are billed on an hourly basis for the compute resources used, whereas Snowflake and Stitch offer subscription-based pricing plans with different tiers based on the usage and storage requirements of the organization.
Data Transformation Capabilities: Amazon EMR allows users to perform complex data transformations using tools like Apache Hive, Apache Spark, and Apache Pig, offering more flexibility in processing unstructured data compared to Snowflake and Stitch, which primarily focus on SQL-based data warehousing operations.
Integration with External Data Sources: Snowflake and Stitch offer seamless integrations with a wide range of external data sources and applications, enabling users to easily ingest and analyze data from various platforms, whereas Amazon EMR requires more setup and configuration to integrate with external data sources and tools.

Summary

In summary, Amazon EMR focuses on distributed computing, Snowflake and Stitch offer data warehousing services with SQL-like analysis capabilities, and the pricing models, data transformation capabilities, and integrations with external data sources differ between the three platforms.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon EMR, Snowflake, Stitch

Julien

CTO at Hawk

Sep 19, 2020

Decided

Cloud Data-warehouse is the centerpiece of modern Data platform. The choice of the most suitable solution is therefore fundamental.

Our benchmark was conducted over BigQuery and Snowflake. These solutions seem to match our goals but they have very different approaches.

BigQuery is notably the only 100% serverless cloud data-warehouse, which requires absolutely NO maintenance: no re-clustering, no compression, no index optimization, no storage management, no performance management. Snowflake requires to set up (paid) reclustering processes, to manage the performance allocated to each profile, etc. We can also mention Redshift, which we have eliminated because this technology requires even more ops operation.

BigQuery can therefore be set up with almost zero cost of human resources. Its on-demand pricing is particularly adapted to small workloads. 0 cost when the solution is not used, only pay for the query you're running. But quickly the use of slots (with monthly or per-minute commitment) will drastically reduce the cost of use. We've reduced by 10 the cost of our nightly batches by using flex slots.

Finally, a major advantage of BigQuery is its almost perfect integration with Google Cloud Platform services: Cloud functions, Dataflow, Data Studio, etc.

BigQuery is still evolving very quickly. The next milestone, BigQuery Omni, will allow to run queries over data stored in an external Cloud platform (Amazon S3 for example). It will be a major breakthrough in the history of cloud data-warehouses. Omni will compensate a weakness of BigQuery: transferring data in near real time from S3 to BQ is not easy today. It was even simpler to implement via Snowflake's Snowpipe solution.

We also plan to use the Machine Learning features built into BigQuery to accelerate our deployment of Data-Science-based projects. An opportunity only offered by the BigQuery solution

193k views193k

Comments

Detailed Comparison

Amazon EMR	Snowflake	Stitch
It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.	Snowflake eliminates the administration and management demands of traditional data warehouses and big data platforms. Snowflake is a true data warehouse as a service running on Amazon Web Services (AWS)—no infrastructure to manage and no knobs to turn.	Stitch is a simple, powerful ETL service built for software developers. Stitch evolved out of RJMetrics, a widely used business intelligence platform. When RJMetrics was acquired by Magento in 2016, Stitch was launched as its own company.
Elastic- Amazon EMR enables you to quickly and easily provision as much capacity as you need and add or remove capacity at any time. Deploy multiple clusters or resize a running cluster;Low Cost- Amazon EMR is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low hourly pricing, Amazon EC2 Spot integration, Amazon EC2 Reserved Instance integration, elasticity, and Amazon S3 integration.;Flexible Data Stores- With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.;Hadoop Tools- EMR supports powerful and proven Hadoop tools such as Hive, Pig, and HBase.	-	Connect to your ecosystem of data sources - UI allows you to configure your data pipeline in a way that balances data freshness with cost and production database load;Replication frequency - Choose full or incremental loads, and determine how often you want them to run - from every minute, to once every 24 hours; Data selection - Configure exactly what data gets replicated by selecting the tables, fields, collections, and endpoints you want in your warehouse;API - With the Stitch API, you're free to replicate data from any source. Its REST API supports JSON or Transit, and recognizes your schema based on the data you send.;Usage dashboard - Access our simple UI to check usage data like the number of rows synced by data source, and how you're pacing toward your monthly row limit;Email alerts - Receive immediate notifications when Stitch encounters issues like expired credentials, integration updates, or warehouse errors preventing loads;Warehouse views - By using the freshness data provided by Stitch, you can build a simple audit table to track replication frequency;Scalable - Highly Scalable Stitch handles all data volumes with no data caps, allowing you to grow without the possibility of an ETL failure;Transform nested JSON - Stitch provides automatic detection and normalization of nested document structures into relational schemas;Complete historical data - On your first sync, Stitch replicates all available historical data from your database and SaaS tools. No database dump necessary.
Statistics
Stacks 543	Stacks 1.2K	Stacks 150
Followers 682	Followers 1.2K	Followers 150
Votes 54	Votes 27	Votes 12
Pros & Cons
Pros 15 On demand processing power 12 Don't need to maintain Hadoop Cluster yourself 7 Hadoop Tools 6 Elastic 4 Backed by Amazon	Pros 7 Public and Private Data Sharing 4 Multicloud 4 Good Performance 4 User Friendly 3 Great Documentation	Pros 8 3 minutes to set up 4 Super simple, great support
Integrations
No integrations available	Python Apache Spark Node.js Looker Periscope Mode	Stripe Twilio SendGrid Zendesk MongoDB Marketo Recurly GitLab Zapier FreshDesk Harvest

What are some alternatives to Amazon EMR, Snowflake, Stitch?

Google BigQuery

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.

Amazon Redshift

It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

Qubole

Qubole is a cloud based service that makes big data easy for analysts and data engineers.

Altiscale

we run Apache Hadoop for you. We not only deploy Hadoop, we monitor, manage, fix, and update it for you. Then we take it a step further: We monitor your jobs, notify you when something’s wrong with them, and can help with tuning.

Azure Synapse

It is an analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. It brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs.

Dremio

Dremio—the data lake engine, operationalizes your data lake storage and speeds your analytics processes with a high-performance and high-efficiency query engine while also democratizing data access for data scientists and analysts.

Cloudera Enterprise

Cloudera Enterprise includes CDH, the world’s most popular open source Hadoop-based platform, as well as advanced system management and data management tools plus dedicated support and community advocacy from our world-class team of Hadoop developers and experts.

Airbyte

It is an open-source data integration platform that syncs data from applications, APIs & databases to data warehouses lakes & DBs.

Treasure Data

Treasure Data's Big Data as-a-Service cloud platform enables data-driven businesses to focus their precious development resources on their applications, not on mundane, time-consuming integration and operational tasks. The Treasure Data Cloud Data Warehouse service offers an affordable, quick-to-implement and easy-to-use big data option that does not require specialized IT resources, making big data analytics available to the mass market.

Xplenty

Read and process data from cloud storage sources such as Amazon S3, Rackspace Cloud Files and IBM SoftLayer Object Storage. Once done processing, Xplenty allows you to connect with Amazon Redshift, SAP HANA and Google BigQuery. You can also store processed data back in your favorite relational database, cloud storage or key-value store.

Related Comparisons

Bootstrap vs Materialize

Django vs Laravel vs Node.js

Bootstrap vs Foundation vs Material UI

Node.js vs Spring-Boot

Flyway vs Liquibase

Cloud Data-warehouse is the centerpiece of modern Data platform. The choice of the most suitable solution is therefore fundamental.

Our benchmark was conducted over BigQuery and Snowflake. These solutions seem to match our goals but they have very different approaches.

Finally, a major advantage of BigQuery is its almost perfect integration with Google Cloud Platform services: Cloud functions, Dataflow, Data Studio, etc.

We also plan to use the Machine Learning features built into BigQuery to accelerate our deployment of Data-Science-based projects. An opportunity only offered by the BigQuery solution