Amazon EMR vs Hazelcast

Overview

Amazon EMR

Stacks544

Followers682

Votes54

Hazelcast

Stacks428

Followers474

Votes59

GitHub Stars6.4K

Forks1.9K

Amazon EMR vs Hazelcast: What are the differences?

Developers describe Amazon EMR as "Distribute your data and processing across a Amazon EC2 instances using Hadoop". Amazon EMR is used in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. Customers launch millions of Amazon EMR clusters every year. On the other hand, Hazelcast is detailed as "Clustering and highly scalable data distribution platform for Java". With its various distributed data structures, distributed caching capabilities, elastic nature, memcache support, integration with Spring and Hibernate and more importantly with so many happy users, Hazelcast is feature-rich, enterprise-ready and developer-friendly in-memory data grid solution.

Amazon EMR belongs to "Big Data as a Service" category of the tech stack, while Hazelcast can be primarily classified under "In-Memory Databases".

Some of the features offered by Amazon EMR are:

Elastic- Amazon EMR enables you to quickly and easily provision as much capacity as you need and add or remove capacity at any time. Deploy multiple clusters or resize a running cluster
Low Cost- Amazon EMR is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low hourly pricing, Amazon EC2 Spot integration, Amazon EC2 Reserved Instance integration, elasticity, and Amazon S3 integration.
Flexible Data Stores- With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.

On the other hand, Hazelcast provides the following key features:

Distributed implementations of java.util.{Queue, Set, List, Map}
Distributed implementation of java.util.concurrent.locks.Lock
Distributed implementation of java.util.concurrent.ExecutorService

"On demand processing power" is the primary reason why developers consider Amazon EMR over the competitors, whereas "High Availibility" was stated as the key factor in picking Hazelcast.

Hazelcast is an open source tool with 3.18K GitHub stars and 1.16K GitHub forks. Here's a link to Hazelcast's open source repository on GitHub.

Netflix, Medium, and Yelp are some of the popular companies that use Amazon EMR, whereas Hazelcast is used by Yammer, Seat Pagine Gialle, and Para. Amazon EMR has a broader approval, being mentioned in 95 company stacks & 18 developers stacks; compared to Hazelcast, which is listed in 26 company stacks and 16 developer stacks.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Detailed Comparison

Amazon EMR	Hazelcast
It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.	With its various distributed data structures, distributed caching capabilities, elastic nature, memcache support, integration with Spring and Hibernate and more importantly with so many happy users, Hazelcast is feature-rich, enterprise-ready and developer-friendly in-memory data grid solution.
Elastic- Amazon EMR enables you to quickly and easily provision as much capacity as you need and add or remove capacity at any time. Deploy multiple clusters or resize a running cluster;Low Cost- Amazon EMR is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low hourly pricing, Amazon EC2 Spot integration, Amazon EC2 Reserved Instance integration, elasticity, and Amazon S3 integration.;Flexible Data Stores- With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.;Hadoop Tools- EMR supports powerful and proven Hadoop tools such as Hive, Pig, and HBase.	Distributed implementations of java.util.{Queue, Set, List, Map};Distributed implementation of java.util.concurrent.locks.Lock;Distributed implementation of java.util.concurrent.ExecutorService;Distributed MultiMap for one-to-many relationships;Distributed Topic for publish/subscribe messaging;Synchronous (write-through) and asynchronous (write-behind) persistence;Transaction support;Socket level encryption support for secure clusters;Second level cache provider for Hibernate;Monitoring and management of the cluster via JMX;Dynamic HTTP session clustering;Support for cluster info and membership events;Dynamic discovery, scaling, partitioning with backups and fail-over
Statistics
GitHub Stars -	GitHub Stars 6.4K
GitHub Forks -	GitHub Forks 1.9K
Stacks 544	Stacks 428
Followers 682	Followers 474
Votes 54	Votes 59
Pros & Cons
Pros 15 On demand processing power 12 Don't need to maintain Hadoop Cluster yourself 7 Hadoop Tools 6 Elastic 4 Backed by Amazon	Pros 11 High Availibility 6 Distributed compute 6 Distributed Locking 5 Sharding 4 Load balancing Cons 4 License needed for SSL
Integrations
No integrations available	Java Spring

What are some alternatives to Amazon EMR, Hazelcast?

Redis

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams.

Google BigQuery

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.

Amazon Redshift

It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

Qubole

Qubole is a cloud based service that makes big data easy for analysts and data engineers.

Aerospike

Aerospike is an open-source, modern database built from the ground up to push the limits of flash storage, processors and networks. It was designed to operate with predictable low latency at high throughput with uncompromising reliability – both high availability and ACID guarantees.

MemSQL

MemSQL converges transactions and analytics for sub-second data processing and reporting. Real-time businesses can build robust applications on a simple and scalable infrastructure that complements and extends existing data pipelines.

Apache Ignite

It is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale

Altiscale

we run Apache Hadoop for you. We not only deploy Hadoop, we monitor, manage, fix, and update it for you. Then we take it a step further: We monitor your jobs, notify you when something’s wrong with them, and can help with tuning.

Snowflake

Snowflake eliminates the administration and management demands of traditional data warehouses and big data platforms. Snowflake is a true data warehouse as a service running on Amazon Web Services (AWS)—no infrastructure to manage and no knobs to turn.

SAP HANA

It is an application that uses in-memory database technology that allows the processing of massive amounts of real-time data in a short time. The in-memory computing engine allows it to process data stored in RAM as opposed to reading it from a disk.

Related Comparisons

Bootstrap vs Materialize

Django vs Laravel vs Node.js

Bootstrap vs Foundation vs Material UI

Node.js vs Spring-Boot

Flyway vs Liquibase