Amazon S3

Application and Data / Data Stores / Cloud Storage

kew44

Needs advice

and

Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
Processing-> We want to use SAS if at all possible. What will work with SAS code?
Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!

6 upvotes·109.9K views

Sindhumathi Parameswaran

Aug 4, 2022

Needs advice

Amazon S3

and

MySQL

Hi, I'm working on a project to integrate dat from Shopify (e-commerce platform) to Amazon Quicksight. I'm thinking about which database to use, either Amazon S3 or MySQL.

3 upvotes·41K views

Needs advice

and

Hi team, we are building up a ecommerce marketplace with both sides (sellers and buyers), now, we have made the forntend dev, but we are facing the choice of database selection. we would use Amazon S3 as cloud storage, but hard to choose the database between No-SQL (Heroku Postgres and MongoDB) For the tech stack; we are using Next.js and Node.js for the front and backend. so please share your professional input. thanks in advance

9 upvotes·34.9K views

Replies (1)

Ray Arayilakath

Full-stack Developer && Software Engineer at Self-employed·May 7, 2022

Recommends

MongoDB

There are many things to consider when choosing between noSQL and SQL databases, but in short SQL databases are often more performant whereas noSQL databases are more friendly for developers to create and maintain.

Since you are building this website up from scratch it would be best to use MongoDB because it makes for an easier developer experience, cuts down on the time needed to setup a database, removes the need for someone specialized in database development (unlike SQL databases which needed to be engineered with tables and rows noSQL databases are rather intuitive), and is overall a better option for a startup.

Some considerations to keep in mind is switching from a noSQL database to a SQL database later on is very, very tricky. If you find that you want to improve performance or later prefer SQL databases it's not going to be an easy feat, however you may come to love noSQL databases and your team will also appreciate its simplicity as well. Another consideration is cost. I'm not familiar with the expenses involved in using Heroku PostgreSQL however MongoDB can be a financial overhead when your project starts to become more user-heavy and you perform more operations and store more data. You may want to perform a quick cost-analysis before choosing your preferred database.

I hope this helps you (or anyone else with a similar question in mind) out a bit and good luck with your marketplace!

3 upvotes·6K views

Needs advice

and

So, I have data in Amazon S3 as parquet files and I have it available in the Glue data catalog too. I want to build an AppSync API on top of this data. Now the two options that I am considering are:

Bring the data to Amazon DynamoDB and then build my API on top of this Database.
Add a Lambda function that resolves Amazon Athena queries made by AppSync.

Which of the two approaches will be cost effective?

I would really appreciate some back of the envelope estimates too.

Note: I only expect to make read queries. Thanks.

5 upvotes·43.5K views

Replies (2)

Roel van den Brand

Lead Developer at Di-Vision Consultion·May 2, 2022

Recommends

Amazon DynamoDB

Overall, I would think, if the data fits in AWS DynamoDB with being able to Query (not scan) that would be a bit more cost effective. But it all depends on the size and changes in the data.

On relatively stale data Athena could be cheaper on big loads when the data is processed via Glue, Lambda costs are quite small. DynamoDB could become expensive under big loads of (reads and/or writes)

3 upvotes·2.2K views

Eran Levy

Engineering leader at Cyren·Apr 26, 2022

It really depends on the data size and number of requests. If latency isnt an issue for this use-case so Amazon Athena might be a good solution for that if you partition the data correctly to be effective enough. DynamoDB is a key-value db, it really depends on the use-case - you might not be able to retrieve the relevant info..

2 upvotes·2.8K views

Needs advice

and

I have been building a website with Gatsby (for a small group of volunteers). I track it in GitHub and push it to Amazon S3.

I am satisfied with it as a single user; however, I would like to get non-technical teammates to be able to post Markdown blog posts. I tried to teach them to add mdx files, git push, gastby build, and publish with gatsby-plugin-s3, but I am getting a fair amount of resistance :).

So I wonder if there are tools, preferably using Node.js, that allow multi-user blog authors a la wordpress, i.e. with an interface for non technical bloggers, but producing static/pre-rendered web pages.

(PS: I am considering having a node/express.js server where they could upload their mdx file and the server would re-build push and publish for them, without having them install anything, but I'd like to know if something already exists before jumping into this endeavor)

6 upvotes·56.2K views

Replies (1)

Recommends

If you're after Markdown I would look at https://www.netlifycms.org. I've used it on several projects to allow clients to use Markdown to publish and it integrates really well with Gatsby. You can create your own content structures using it then implement them into your templates. These are all the widgets you can use: https://www.netlifycms.org/docs/widgets/

This keeps it strictly static file driven with no database or need for express etc.

9 upvotes·1 comment·19.8K views

Arnaud Amzallag

March 31st 2022 at 6:29PM

Thank you, I was skeptical at first, but now that I read more about it, that is a great answer! Before you answered I started to go the route of ghost.js and see how the gatsby build could source from my ghost server, but that would be an endeavor. netlifyCMS seems to acheive what I wanted much more directly. Will continue to learn about it. Thank you!

Miroslav Petrovic

Senior Software Engineer at Incode technologies·Feb 7, 2022

Needs advice

ceph

and

Minio

I need a replacement for Amazon S3 storage, private storage replacement for s3, which one would you choose?

3 upvotes·24.7K views

Umair Iftikhar

Technical Architect at ERP Studio·Jan 5, 2022

Needs advice

Amazon DynamoDB

and

MongoDB

We are developing a system in which we have to collect 10 Million records every day. We need a database solution, NoSQL. data is simple logs. We are using AWS for now. I want to know the cheaper solution from both available techs. Amazon S3 or MongoDB.

We have 30 Tables that are collecting these logs.

6 upvotes·26.1K views

Replies (4)

Ivan Begtin

Founder - Dateno, Director - NGO "Informational Culture" / Ambassador - OKFN Armenia at Infoculture·Jan 12, 2022

Recommends

Clickhouse

Elasticsearch

I am a big fan of MongoDB and It's great for document storage but I am not really sure that it's the best engine for log storage. If data that you store is "flat" and well-defined than log storage based on engines like Clickhouse or Elasticsearch stach could be much more efficient. Also it's quite important how you reuse collected logs. Do you calculate aggregated metrics? Do you need full search ? And so on.

If logs are really simple and full text search needed than Logstash + Elasticsearch. If you need to calculate a lot of metrics and logs are not just text, but include numbers/values needed for aggregation than Clickhouse.

5 upvotes·24.9K views

subz390

Developer ·Jan 5, 2022

Recommends

Amazon DynamoDB

MongoDB

My Stack

The way I'd approach this is to carry out a survey. Prioritise a list of important criteria, such as performance, functionality, and cost. For example with MongoDB you can archive documents if the data not immediately required to save on costs at the expense of instant access, but if that fits your use case model then you can use that feature. So create a use case test project that actually uses both services as per your use case and see for yourself the results of the tests. Along the way you'll encounter issues perculiar to each platform that you can factor into your final decision, such as comparing how easy it is to use their API, or that the documentation is sparce or confusing. From there you'll have an informed decision and you'll be confident investing further resources into it.

5 upvotes·1 comment·26.2K views

reidmorrison

January 9th 2022 at 5:20PM

If you use Amazon DocumentDB instead of DynamoDB, it is compatible with the MongoDB API. That will keep your code cloud agnostic and you have option of switching between DynamoDB and MongoDB in the future based on whichever ends up being cheapest to run.

View all (4)

Needs advice

and

Amazon S3 or CloudFlare as a CDN? Anyone like one over the other? My devs are switching to Amazon S3 but I would like other opinions - thank you in advance! Read a little how they differ, but would love an expert opinion. Thanks!

3 upvotes·192.1K views

Replies (1)

Steven Heryanto

Technical Consultant ·Sep 13, 2021

Recommends

Amazon S3

My Stack

If you use another service from AWS, S3 is by far the better choice. They're way easier to integrate. One thing to consider is the cost, in some regions AWS price are more expensive than other.

2 upvotes·3 comments·673 views

Jordan Giha

September 13th 2021 at 11:56PM

Hey Steven, thank you. What would this cost difference be?

Steven Heryanto

September 14th 2021 at 3:06AM

Depends on where do want to put your S3 region. You can estimate them in this page https://calculator.aws/#/createCalculator/S3

Jordan Giha

September 15th 2021 at 1:36AM

thank you! I wish I knew what to fill out here lol

Needs advice

and

I have data stored in Amazon S3 bucket in parquet file format.

I want this data to be copied from S3 to Amazon Redshift, so I use copy commands to achieve this. But, I need to do this manually. I want to achieve this with some sort of automation such that if any new file comes into S3, it should be copied to the required table in redshift. Can you suggest what different approaches I can use?

8 upvotes·17K views

Replies (1)

Scott MESSNER

Backend Software Engineer ·Mar 17, 2021

Recommends

aws

sns

aws-s3

Hello Aditya, I haven't tried this myself, but theoretically you can harness the power of Amazon s3 events to generate an event whenever there is a CRUD event on your parquet files in S3. First you'd have to generate the event: https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html

Then you would want to subscribe to the event use the event info to pull the file into Redshift https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#working-with-event-notifications-subscribe

-Scott

4 upvotes·3.7K views

Pardha Saradhi

Technical Lead at Incred Financial Solutions·Dec 10, 2020

Needs advice

Amazon S3

Metabase

and

Presto

Hi,

We are currently storing the data in Amazon S3 using Apache Parquet format. We are using Presto to query the data from S3 and catalog it using AWS Glue catalog. We have Metabase sitting on top of Presto, where our reports are present. Currently, Presto is becoming too costly for us, and we are looking for alternatives for it but want to use the remaining setup (S3, Metabase) as much as possible. Please suggest alternative approaches.

6 upvotes·109.2K views

Replies (1)

Kevin van Zonneveld

Co-founder at Transloadit·Dec 18, 2020

Recommends

(

)

Hey there, the trick to keeping costs under control is to partition. This means you split up your source files by date, and also query within dates, so that Athena only scans the few files necessary for those dates. I hope that makes sense (and I also hope I understood your question right). This article explains better https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/.

Analyze your Amazon CloudFront access logs at scale | AWS Big Data Blog (aws.amazon.com)

4 upvotes·4.9K views