Amazon EMR vs Amazon S3 vs Redis

Overview

Amazon S3

Stacks55.1K

Followers40.2K

Votes2.0K

Amazon EMR

Stacks543

Followers682

Votes54

Redis

Stacks61.9K

Followers46.5K

Votes3.9K

GitHub Stars42

Forks6

Amazon EMR vs Amazon S3 vs Redis: What are the differences?

Introduction Amazon EMR, Amazon S3, and Redis are all popular services offered by Amazon Web Services (AWS). Each of these services serves different purposes and has its own unique features and functionalities.

Data Processing and Analytics Capabilities:
- Amazon EMR: Amazon Elastic MapReduce (EMR) is a fully-managed big data processing and analytics service. It allows for distributed processing of large datasets using popular frameworks such as Apache Hadoop, Spark, and Presto.
- Amazon S3: Amazon Simple Storage Service (S3) is an object storage service that provides scalable storage for data and files. It is designed for durability, scalability, and accessibility of data.
- Redis: Redis is an open-source, in-memory data structure store that can be used as a database, cache, and message broker. It supports various data structures such as strings, lists, sets, and hashes, and provides high-performance data storage and retrieval.
Data Storage and Retrieval:
- Amazon EMR: EMR is primarily focused on processing and analytics of large datasets. It can read data from and write data to various data sources such as Amazon S3, HDFS, and DynamoDB. It provides integration with different storage systems for data storage and retrieval.
- Amazon S3: S3 is a scalable object storage service that can be used to store and retrieve any amount of data. It offers high durability, availability, and security for the stored data. S3 provides a simple API for data manipulation and supports lifecycle policies for data management.
- Redis: Redis primarily serves as an in-memory data store and supports various data structures. It is optimized for high-performance data storage and retrieval. Redis can be used as a primary database or as a cache layer in applications.
Data Processing Capabilities:
- Amazon EMR: EMR provides a wide range of data processing capabilities through its integration with frameworks like Apache Hadoop, Spark, and Presto. It allows for distributed processing of large datasets and supports batch processing, real-time processing, and iterative algorithms.
- Amazon S3: S3 does not provide native data processing capabilities. It is mainly used as a storage layer for data. However, data stored in S3 can be accessed and processed using other AWS services like EMR, AWS Glue, or directly through custom applications.
- Redis: Redis does not provide built-in data processing capabilities either. It primarily focuses on high-performance data storage and retrieval. Data manipulation and processing need to be implemented using Redis commands and client libraries.
Scaling and Performance:
- Amazon EMR: EMR allows for scaling of compute resources as per the processing requirements. It automatically provisions and manages the underlying infrastructure based on the data processing workloads. This enables parallel processing and high-performance analytics.
- Amazon S3: S3 is designed to scale infinitely and handle large amounts of data. It provides high throughput and low latency for data retrieval. S3 automatically manages data distribution across multiple devices and storage nodes, ensuring efficient performance.
- Redis: Redis is known for its exceptional read and write performance. It stores data in-memory, which allows for low-latency access. Redis supports replication and sharding for horizontal scaling to handle larger data sets and increased traffic loads.
Data Persistence and Durability:
- Amazon EMR: EMR primarily serves as a data processing and analytics service, and the persistence and durability of data depend on the underlying storage system being used, such as Amazon S3 or HDFS. EMR does not provide its own persistent storage.
- Amazon S3: S3 offers high durability for stored objects, with a durability guarantee of 99.999999999% (11 nines). It achieves this by automatically storing data redundantly across multiple devices and facilities.
- Redis: Redis provides persistence options to ensure data durability. It supports snapshotting and append-only file (AOF) persistence mechanisms. Snapshots allow periodically saving the dataset to disk, while AOF logs the write operations, ensuring data recovery in case of failures.
Use Cases:
- Amazon EMR: EMR is commonly used for big data processing and analytics tasks, such as log analysis, data warehousing, machine learning, real-time analytics, and genomics research.
- Amazon S3: S3 is widely used for storing and retrieving large amounts of data, backup and restore, content storage and distribution, data archiving, data lakes, and as a data source for other AWS services.
- Redis: Redis is popularly used as a caching layer, session store, real-time analytics, queuing system, leaderboard implementation, messaging system, and in various other use cases that require high-performance data storage and retrieval.

In Summary, Amazon EMR is focused on big data processing and analytics capabilities, Amazon S3 provides scalable object storage for data, and Redis is an in-memory data structure store optimized for performance. Each service has its own unique features and use cases, catering to different aspects of data processing, storage, and retrieval.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon S3, Amazon EMR, Redis

Gabriel

CEO at NaoLogic Inc

Dec 24, 2019

Decided

We offer our customer HIPAA compliant storage. After analyzing the market, we decided to go with Google Storage. The Nodejs API is ok, still not ES6 and can be very confusing to use. For each new customer, we created a different bucket so they can have individual data and not have to worry about data loss. After 1000+ customers we started seeing many problems with the creation of new buckets, with saving or retrieving a new file. Many false positive: the Promise returned ok, but in reality, it failed.

That's why we switched to S3 that just works.

330k views330k

Comments

Detailed Comparison

Amazon S3	Amazon EMR	Redis
Amazon Simple Storage Service provides a fully redundant data storage infrastructure for storing and retrieving any amount of data, at any time, from anywhere on the web	It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.	Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams.
Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited.;Each object is stored in a bucket and retrieved via a unique, developer-assigned key.;A bucket can be stored in one of several Regions. You can choose a Region to optimize for latency, minimize costs, or address regulatory requirements. Amazon S3 is currently available in the US Standard, US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), South America (Sao Paulo), and GovCloud (US) Regions. The US Standard Region automatically routes requests to facilities in Northern Virginia or the Pacific Northwest using network maps.;Objects stored in a Region never leave the Region unless you transfer them out. For example, objects stored in the EU (Ireland) Region never leave the EU.;Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users.;Options for secure data upload/download and encryption of data at rest are provided for additional data protection.;Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit.;Built to be flexible so that protocol or functional layers can easily be added. The default download protocol is HTTP. A BitTorrent protocol interface is provided to lower costs for high-scale distribution.;Provides functionality to simplify manageability of data through its lifetime. Includes options for segregating data by buckets, monitoring and controlling spend, and automatically archiving data to even lower cost storage options. These options can be easily administered from the Amazon S3 Management Console.;Reliability backed with the Amazon S3 Service Level Agreement.	Elastic- Amazon EMR enables you to quickly and easily provision as much capacity as you need and add or remove capacity at any time. Deploy multiple clusters or resize a running cluster;Low Cost- Amazon EMR is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low hourly pricing, Amazon EC2 Spot integration, Amazon EC2 Reserved Instance integration, elasticity, and Amazon S3 integration.;Flexible Data Stores- With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.;Hadoop Tools- EMR supports powerful and proven Hadoop tools such as Hive, Pig, and HBase.	-
Statistics
GitHub Stars -	GitHub Stars -	GitHub Stars 42
GitHub Forks -	GitHub Forks -	GitHub Forks 6
Stacks 55.1K	Stacks 543	Stacks 61.9K
Followers 40.2K	Followers 682	Followers 46.5K
Votes 2.0K	Votes 54	Votes 3.9K
Pros & Cons
Pros 590 Reliable 492 Scalable 456 Cheap 329 Simple & easy 83 Many sdks Cons 7 Permissions take some time to get right 6 Takes time/work to organize buckets & folders properly 6 Requires a credit card 3 Complex to set up	Pros 15 On demand processing power 12 Don't need to maintain Hadoop Cluster yourself 7 Hadoop Tools 6 Elastic 4 Backed by Amazon	Pros 888 Performance 542 Super fast 514 Ease of use 444 In-memory cache 324 Advanced key-value cache Cons 15 Cannot query objects directly 3 No secondary indexes for non-numeric data types 1 No WAL

What are some alternatives to Amazon S3, Amazon EMR, Redis?

Google BigQuery

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.

Amazon Redshift

It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

Amazon EBS

Amazon EBS volumes are network-attached, and persist independently from the life of an instance. Amazon EBS provides highly available, highly reliable, predictable storage volumes that can be attached to a running Amazon EC2 instance and exposed as a device within the instance. Amazon EBS is particularly suited for applications that require a database, file system, or access to raw block level storage.

Google Cloud Storage

Google Cloud Storage allows world-wide storing and retrieval of any amount of data and at any time. It provides a simple programming interface which enables developers to take advantage of Google's own reliable and fast networking infrastructure to perform data operations in a secure and cost effective manner. If expansion needs arise, developers can benefit from the scalability provided by Google's infrastructure.

Qubole

Qubole is a cloud based service that makes big data easy for analysts and data engineers.

Hazelcast

With its various distributed data structures, distributed caching capabilities, elastic nature, memcache support, integration with Spring and Hibernate and more importantly with so many happy users, Hazelcast is feature-rich, enterprise-ready and developer-friendly in-memory data grid solution.

Azure Storage

Azure Storage provides the flexibility to store and retrieve large amounts of unstructured data, such as documents and media files with Azure Blobs; structured nosql based data with Azure Tables; reliable messages with Azure Queues, and use SMB based Azure Files for migrating on-premises applications to the cloud.

Aerospike

Aerospike is an open-source, modern database built from the ground up to push the limits of flash storage, processors and networks. It was designed to operate with predictable low latency at high throughput with uncompromising reliability – both high availability and ACID guarantees.

MemSQL

MemSQL converges transactions and analytics for sub-second data processing and reporting. Real-time businesses can build robust applications on a simple and scalable infrastructure that complements and extends existing data pipelines.

Minio

Minio is an object storage server compatible with Amazon S3 and licensed under Apache 2.0 License

Related Comparisons