Amazon Athena vs Amazon EMR

Overview

Amazon EMR

Stacks544

Followers682

Votes54

Amazon Athena

Stacks524

Followers840

Votes49

Amazon Athena vs Amazon EMR: What are the differences?

Amazon Athena and Amazon EMR are two key services provided by Amazon Web Services (AWS) for big data analytics. While both services offer solutions for processing and analyzing large amounts of data, they differ in several key aspects.

Data Processing Framework: Amazon Athena is a serverless interactive query service that allows you to analyze data directly from Amazon S3 using standard SQL. It provides a simple and cost-effective option for ad-hoc querying and analysis. On the other hand, Amazon EMR is a fully managed, distributed data processing framework that allows you to run big data frameworks like Apache Hadoop, Spark, and Presto on a cluster of EC2 instances. EMR provides a more flexible and scalable solution for complex data processing tasks.
Managed Infrastructure: With Amazon Athena, you do not need to provision or manage any infrastructure. It automatically scales and manages the underlying resources required to run queries, allowing you to focus on data analysis. In contrast, Amazon EMR requires you to provision and manage a cluster of EC2 instances. This gives you more control over the infrastructure but also requires additional effort in terms of configuration and maintenance.
Data Compression and Partitioning: Amazon Athena supports data compression and partitioning techniques to improve query performance and reduce costs. It can automatically detect and read compressed and partitioned data stored in Amazon S3. Additionally, Athena supports converting raw JSON data into a structured format using schema-on-read. In comparison, while Amazon EMR also supports data compression and partitioning, you have more control and flexibility in defining how data is stored and processed.
Cost Structure: Amazon Athena follows a pay-as-you-go pricing model, where you are billed based on the amount of data scanned by your queries. This can be cost-effective for sporadic or ad-hoc analysis tasks. Amazon EMR, on the other hand, has a more complex pricing structure that includes costs for EC2 instances, storage, and data transfer. It is more suitable for long-running or consistently high workloads.
Ease of Use: Amazon Athena is designed to be easy to use and does not require any setup or administration overhead. It integrates seamlessly with other AWS services and supports standard SQL queries. In comparison, Amazon EMR provides more flexibility and control but also requires more setup and management. It is suitable for users with more advanced technical skills and specific requirements.
Data Processing Capabilities: Amazon Athena is primarily focused on ad-hoc query processing and analysis. It is optimized for fast, interactive queries on large datasets. Amazon EMR, on the other hand, supports a broader range of data processing capabilities through its support for various big data frameworks. This includes batch processing, real-time streaming, machine learning, and graph analytics.

In summary, Amazon Athena is a serverless query service for ad-hoc analysis, providing simplicity and cost-effectiveness. Amazon EMR is a fully managed big data processing framework, offering more flexibility and power but also requiring more configuration and management efforts.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon EMR, Amazon Athena

Kevin

Co-founder at Transloadit

Dec 18, 2020

Review

Hey there, the trick to keeping costs under control is to partition. This means you split up your source files by date, and also query within dates, so that Athena only scans the few files necessary for those dates. I hope that makes sense (and I also hope I understood your question right). This article explains better https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/.

5.11k views5.11k

Comments

Pavithra

Mar 12, 2020

Needs adviceon

Amazon S3

Amazon Athena

Amazon Redshift

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

522k views522k

Comments

Detailed Comparison

Amazon EMR	Amazon Athena
It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.	Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Elastic- Amazon EMR enables you to quickly and easily provision as much capacity as you need and add or remove capacity at any time. Deploy multiple clusters or resize a running cluster;Low Cost- Amazon EMR is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low hourly pricing, Amazon EC2 Spot integration, Amazon EC2 Reserved Instance integration, elasticity, and Amazon S3 integration.;Flexible Data Stores- With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.;Hadoop Tools- EMR supports powerful and proven Hadoop tools such as Hive, Pig, and HBase.	-
Statistics
Stacks 544	Stacks 524
Followers 682	Followers 840
Votes 54	Votes 49
Pros & Cons
Pros 15 On demand processing power 12 Don't need to maintain Hadoop Cluster yourself 7 Hadoop Tools 6 Elastic 4 Backed by Amazon	Pros 16 Use SQL to analyze CSV files 8 Glue crawlers gives easy Data catalogue 7 Cheap 6 Query all my data without running servers 24x7 4 No data base servers yay
Integrations
No integrations available	Amazon S3 Presto

What are some alternatives to Amazon EMR, Amazon Athena?

Google BigQuery

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.

Apache Spark

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Amazon Redshift

It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

Qubole

Qubole is a cloud based service that makes big data easy for analysts and data engineers.

Presto

Distributed SQL Query Engine for Big Data

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Altiscale

we run Apache Hadoop for you. We not only deploy Hadoop, we monitor, manage, fix, and update it for you. Then we take it a step further: We monitor your jobs, notify you when something’s wrong with them, and can help with tuning.

Snowflake

Snowflake eliminates the administration and management demands of traditional data warehouses and big data platforms. Snowflake is a true data warehouse as a service running on Amazon Web Services (AWS)—no infrastructure to manage and no knobs to turn.

Related Comparisons

Amazon Athena vs Amazon EMR: What are the differences?

Data Processing Framework: Amazon Athena is a serverless interactive query service that allows you to analyze data directly from Amazon S3 using standard SQL. It provides a simple and cost-effective option for ad-hoc querying and analysis. On the other hand, Amazon EMR is a fully managed, distributed data processing framework that allows you to run big data frameworks like Apache Hadoop, Spark, and Presto on a cluster of EC2 instances. EMR provides a more flexible and scalable solution for complex data processing tasks.
Managed Infrastructure: With Amazon Athena, you do not need to provision or manage any infrastructure. It automatically scales and manages the underlying resources required to run queries, allowing you to focus on data analysis. In contrast, Amazon EMR requires you to provision and manage a cluster of EC2 instances. This gives you more control over the infrastructure but also requires additional effort in terms of configuration and maintenance.
Data Compression and Partitioning: Amazon Athena supports data compression and partitioning techniques to improve query performance and reduce costs. It can automatically detect and read compressed and partitioned data stored in Amazon S3. Additionally, Athena supports converting raw JSON data into a structured format using schema-on-read. In comparison, while Amazon EMR also supports data compression and partitioning, you have more control and flexibility in defining how data is stored and processed.
Cost Structure: Amazon Athena follows a pay-as-you-go pricing model, where you are billed based on the amount of data scanned by your queries. This can be cost-effective for sporadic or ad-hoc analysis tasks. Amazon EMR, on the other hand, has a more complex pricing structure that includes costs for EC2 instances, storage, and data transfer. It is more suitable for long-running or consistently high workloads.
Ease of Use: Amazon Athena is designed to be easy to use and does not require any setup or administration overhead. It integrates seamlessly with other AWS services and supports standard SQL queries. In comparison, Amazon EMR provides more flexibility and control but also requires more setup and management. It is suitable for users with more advanced technical skills and specific requirements.
Data Processing Capabilities: Amazon Athena is primarily focused on ad-hoc query processing and analysis. It is optimized for fast, interactive queries on large datasets. Amazon EMR, on the other hand, supports a broader range of data processing capabilities through its support for various big data frameworks. This includes batch processing, real-time streaming, machine learning, and graph analytics.

Amazon Athena vs Amazon EMR

Overview