Need advice about which tool to choose?Ask the StackShare community!

Amazon Athena

488
835
+ 1
49
Amazon EMR

543
681
+ 1
54
Add tool

Amazon Athena vs Amazon EMR: What are the differences?

Amazon Athena and Amazon EMR are two key services provided by Amazon Web Services (AWS) for big data analytics. While both services offer solutions for processing and analyzing large amounts of data, they differ in several key aspects.

  1. Data Processing Framework: Amazon Athena is a serverless interactive query service that allows you to analyze data directly from Amazon S3 using standard SQL. It provides a simple and cost-effective option for ad-hoc querying and analysis. On the other hand, Amazon EMR is a fully managed, distributed data processing framework that allows you to run big data frameworks like Apache Hadoop, Spark, and Presto on a cluster of EC2 instances. EMR provides a more flexible and scalable solution for complex data processing tasks.

  2. Managed Infrastructure: With Amazon Athena, you do not need to provision or manage any infrastructure. It automatically scales and manages the underlying resources required to run queries, allowing you to focus on data analysis. In contrast, Amazon EMR requires you to provision and manage a cluster of EC2 instances. This gives you more control over the infrastructure but also requires additional effort in terms of configuration and maintenance.

  3. Data Compression and Partitioning: Amazon Athena supports data compression and partitioning techniques to improve query performance and reduce costs. It can automatically detect and read compressed and partitioned data stored in Amazon S3. Additionally, Athena supports converting raw JSON data into a structured format using schema-on-read. In comparison, while Amazon EMR also supports data compression and partitioning, you have more control and flexibility in defining how data is stored and processed.

  4. Cost Structure: Amazon Athena follows a pay-as-you-go pricing model, where you are billed based on the amount of data scanned by your queries. This can be cost-effective for sporadic or ad-hoc analysis tasks. Amazon EMR, on the other hand, has a more complex pricing structure that includes costs for EC2 instances, storage, and data transfer. It is more suitable for long-running or consistently high workloads.

  5. Ease of Use: Amazon Athena is designed to be easy to use and does not require any setup or administration overhead. It integrates seamlessly with other AWS services and supports standard SQL queries. In comparison, Amazon EMR provides more flexibility and control but also requires more setup and management. It is suitable for users with more advanced technical skills and specific requirements.

  6. Data Processing Capabilities: Amazon Athena is primarily focused on ad-hoc query processing and analysis. It is optimized for fast, interactive queries on large datasets. Amazon EMR, on the other hand, supports a broader range of data processing capabilities through its support for various big data frameworks. This includes batch processing, real-time streaming, machine learning, and graph analytics.

In summary, Amazon Athena is a serverless query service for ad-hoc analysis, providing simplicity and cost-effectiveness. Amazon EMR is a fully managed big data processing framework, offering more flexibility and power but also requiring more configuration and management efforts.

Advice on Amazon Athena and Amazon EMR

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

See more
Replies (4)

you can use aws glue service to convert you pipe format data to parquet format , and thus you can achieve data compression . Now you should choose Redshift to copy your data as it is very huge. To manage your data, you should partition your data in S3 bucket and also divide your data across the redshift cluster

See more
Carlos Acedo
Data Technologies Manager at SDG Group Iberia · | 5 upvotes · 240.9K views
Recommends
on
Amazon RedshiftAmazon Redshift

First of all you should make your choice upon Redshift or Athena based on your use case since they are two very diferent services - Redshift is an enterprise-grade MPP Data Warehouse while Athena is a SQL layer on top of S3 with limited performance. If performance is a key factor, users are going to execute unpredictable queries and direct and managing costs are not a problem I'd definitely go for Redshift. If performance is not so critical and queries will be predictable somewhat I'd go for Athena.

Once you select the technology you'll need to optimize your data in order to get the queries executed as fast as possible. In both cases you may need to adapt the data model to fit your queries better. In the case you go for Athena you'd also proabably need to change your file format to Parquet or Avro and review your partition strategy depending on your most frequent type of query. If you choose Redshift you'll need to ingest the data from your files into it and maybe carry out some tuning tasks for performance gain.

I'll recommend Redshift for now since it can address a wider range of use cases, but we could give you better advice if you described your use case in depth.

See more
Alexis Blandin
Recommends
on
Amazon AthenaAmazon Athena

It depend of the nature of your data (structured or not?) and of course your queries (ad-hoc or predictible?). For example you can look at partitioning and columnar format to maximize MPP capabilities for both Athena and Redshift

See more
Recommends

you can change your PSV fomat data to parquet file format with AWS GLUE and then your query performance will be improved

See more
Get Advice from developers at your company using StackShare Enterprise. Sign up for StackShare Enterprise.
Learn More
Pros of Amazon Athena
Pros of Amazon EMR
  • 16
    Use SQL to analyze CSV files
  • 8
    Glue crawlers gives easy Data catalogue
  • 7
    Cheap
  • 6
    Query all my data without running servers 24x7
  • 4
    No data base servers yay
  • 3
    Easy integration with QuickSight
  • 2
    Query and analyse CSV,parquet,json files in sql
  • 2
    Also glue and athena use same data catalog
  • 1
    No configuration required
  • 0
    Ad hoc checks on data made easy
  • 15
    On demand processing power
  • 12
    Don't need to maintain Hadoop Cluster yourself
  • 7
    Hadoop Tools
  • 6
    Elastic
  • 4
    Backed by Amazon
  • 3
    Flexible
  • 3
    Economic - pay as you go, easy to use CLI and SDKs
  • 2
    Don't need a dedicated Ops group
  • 1
    Massive data handling
  • 1
    Great support

Sign up to add or upvote prosMake informed product decisions

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

What is Amazon EMR?

It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.

Need advice about which tool to choose?Ask the StackShare community!

What companies use Amazon Athena?
What companies use Amazon EMR?
See which teams inside your own company are using Amazon Athena or Amazon EMR.
Sign up for StackShare EnterpriseLearn More

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Amazon Athena?
What tools integrate with Amazon EMR?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Aug 28 2019 at 3:10AM

Segment

PythonJavaAmazon S3+16
7
2581
Jul 2 2019 at 9:34PM

Segment

Google AnalyticsAmazon S3New Relic+25
10
6803
GitHubMySQLSlack+44
109
50698
What are some alternatives to Amazon Athena and Amazon EMR?
Presto
Distributed SQL Query Engine for Big Data
Amazon Redshift Spectrum
With Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond data stored on local disks in your data warehouse to query vast amounts of unstructured data in your Amazon S3 “data lake” -- without having to load or transform any data.
Amazon Redshift
It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Spectrum
The community platform for the future.
See all alternatives