Amazon Athena

Application and Data / Data Stores / Big Data Tools

Needs advice

and

Could you please suggest the best database engine for on-premise? We have used Amazon Athena for the cloud. We are looking similar product for on-premise. This should support Node.js programming language.

2 upvotes·3.8K views

Needs advice

and

So, I have data in Amazon S3 as parquet files and I have it available in the Glue data catalog too. I want to build an AppSync API on top of this data. Now the two options that I am considering are:

Bring the data to Amazon DynamoDB and then build my API on top of this Database.
Add a Lambda function that resolves Amazon Athena queries made by AppSync.

Which of the two approaches will be cost effective?

I would really appreciate some back of the envelope estimates too.

Note: I only expect to make read queries. Thanks.

5 upvotes·45.4K views

Replies (2)

Roel van den Brand

Lead Developer at Di-Vision Consultion·May 2, 2022

Recommends

Amazon DynamoDB

Overall, I would think, if the data fits in AWS DynamoDB with being able to Query (not scan) that would be a bit more cost effective. But it all depends on the size and changes in the data.

On relatively stale data Athena could be cheaper on big loads when the data is processed via Glue, Lambda costs are quite small. DynamoDB could become expensive under big loads of (reads and/or writes)

3 upvotes·2.2K views

Eran Levy

Engineering leader at Cyren·Apr 26, 2022

It really depends on the data size and number of requests. If latency isnt an issue for this use-case so Amazon Athena might be a good solution for that if you partition the data correctly to be effective enough. DynamoDB is a key-value db, it really depends on the use-case - you might not be able to retrieve the relevant info..

2 upvotes·2.9K views

Punith Ganadinni

Senior Product Engineer ·Nov 13, 2020

Needs advice

AWS Data Pipeline

and

AWS Glue

My Stack

Hey all, I need some suggestions in creating a replica of our RDS DB for reporting and analytical purposes. Cost is a major factor. I was thinking of using AWS Glue to move data from Amazon RDS to Amazon S3 and use Amazon Athena to run queries on it. Any other suggestions would be appreciable.

2 upvotes·65K views

Replies (1)

Louis Guitton

VP Engineering at Onefootball·Nov 28, 2020

Recommends

AWS Database Migration Service

Singer

If cost is a major factor, I suggest to either A) look at open source tools that you can run on compute you already pay for or B) use AWS services within the free tier.

For option A), check out singer taps and targets. For option B) check out the AWS DBMS (Database Migration Service). It's make for replicating data and your use case is described here https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html

Using Amazon S3 as a target for AWS Database Migration Service - AWS Database Migration Service (docs.aws.amazon.com)

3 upvotes·783 views

Needs advice

and

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

3 upvotes·513.1K views

Replies (4)

Recommends

you can use aws glue service to convert you pipe format data to parquet format , and thus you can achieve data compression . Now you should choose Redshift to copy your data as it is very huge. To manage your data, you should partition your data in S3 bucket and also divide your data across the redshift cluster

7 upvotes·212.6K views

Carlos Acedo

Data Technologies Manager at SDG Group Iberia·Jun 14, 2020

Recommends

Amazon Redshift

First of all you should make your choice upon Redshift or Athena based on your use case since they are two very diferent services - Redshift is an enterprise-grade MPP Data Warehouse while Athena is a SQL layer on top of S3 with limited performance. If performance is a key factor, users are going to execute unpredictable queries and direct and managing costs are not a problem I'd definitely go for Redshift. If performance is not so critical and queries will be predictable somewhat I'd go for Athena.

Once you select the technology you'll need to optimize your data in order to get the queries executed as fast as possible. In both cases you may need to adapt the data model to fit your queries better. In the case you go for Athena you'd also proabably need to change your file format to Parquet or Avro and review your partition strategy depending on your most frequent type of query. If you choose Redshift you'll need to ingest the data from your files into it and maybe carry out some tuning tasks for performance gain.

I'll recommend Redshift for now since it can address a wider range of use cases, but we could give you better advice if you described your use case in depth.

5 upvotes·253.7K views

View all (4)